BeeBDC vignette

0.0 Script preparation

0.1 Working directory

Choose the path to the root folder in which all other folders can be found.

For the first time that you run BeeBDC, and if you want to use the renv package to manage your packages, you can install renv…

        install.packages("renv", repos = "")

and then initialise renv the project.

        renv::init(project = paste0(RootPath,"/Data_acquisition_workflow")) 

If you have already initialised a project, you can instead just activate it.

0.2 Install packages (if needed)

You may need to install gdal on your computer. This can be done on a Mac by using Homebrew in the terminal and the command “brew install gdal”.

To start out, you will need to install BiocManager, devtools, ComplexHeatmap, and rnaturalearthhires to then install and fully use BeeBDC.

Now install BeeBDC.

Snapshot the renv environment.

Set up the directories used by BeeBDC. These directories include where the data, figures, reports, etc. will be saved. The RDoc needs to be a path RELATIVE to the RootPath; i.e., the file path from which the two diverge.

0.3 Load packages

Load packages.

1.0 Data merge

Although each line of code has been validated, in order to save time knitting the R markdown document the next section is display only. If you are not data merging (section 1.0) or preparing the data (section 2.0), feel free to skip to Section 3.0 Initial flags.

1.1 Download ALA data

Download ALA data and create a new file in the DataPath to put those data into. You should also first make an account with ALA in order to download your data —

  BeeBDC::atlasDownloader(path = DataPath,
           userEmail = "",
           atlas = "ALA",
           ALA_taxon = "Apiformes")

1.2 Import and merge ALA, SCAN, iDigBio, and GBIF data

Supply the path to where the data is, the save_type is either “csv_files” or “R_file”.

  DataImp <- BeeBDC::repoMerge(path = DataPath, 
                  occ_paths = BeeBDC::repoFinder(path = DataPath),
                  save_type = "R_file")

If there is an error in finding a file, run repoFinder() by itself to troubleshoot. For example:

            #BeeBDC::repoFinder(path = DataPath)
            #[1] "F:/BeeDataCleaning2022/BeeDataCleaning/BeeDataCleaning/BeeData/ALA_galah_path/galah_download_2022-09-15/data.csv"
            #[1] "F:/BeeDataCleaning2022/BeeDataCleaning/BeeDataCleaning/BeeData/GBIF_webDL_30Aug2022/0000165-220831081235567/occurrence.txt"
            #[2] "F:/BeeDataCleaning2022/BeeDataCleaning/BeeDataCleaning/BeeData/GBIF_webDL_30Aug2022/0436695-210914110416597/occurrence.txt"
            #[3] "F:/BeeDataCleaning2022/BeeDataCleaning/BeeDataCleaning/BeeData/GBIF_webDL_30Aug2022/0436697-210914110416597/occurrence.txt"
            #[4] "F:/BeeDataCleaning2022/BeeDataCleaning/BeeDataCleaning/BeeData/GBIF_webDL_30Aug2022/0436704-210914110416597/occurrence.txt"
            #[5] "F:/BeeDataCleaning2022/BeeDataCleaning/BeeDataCleaning/BeeData/GBIF_webDL_30Aug2022/0436732-210914110416597/occurrence.txt"
            #[6] "F:/BeeDataCleaning2022/BeeDataCleaning/BeeDataCleaning/BeeData/GBIF_webDL_30Aug2022/0436733-210914110416597/occurrence.txt"
            #[7] "F:/BeeDataCleaning2022/BeeDataCleaning/BeeDataCleaning/BeeData/GBIF_webDL_30Aug2022/0436734-210914110416597/occurrence.txt"
            #[1] "F:/BeeDataCleaning2022/BeeDataCleaning/BeeDataCleaning/BeeData/iDigBio_webDL_30Aug2022/5aa5abe1-62e0-4d8c-bebf-4ac13bd9e56f/occurrence_raw.csv"
            #Failing because SCAN_data seems to be missing. Downloaded separatly from the one drive

Load in the most-recent version of these data if needed. This will return a list with:

  1. The occurrence dataset with attributes (.$Data_WebDL)
  2. The appended eml file (.$eml_files)

    DataImp <- BeeBDC::importOccurrences(path = DataPath,
                           fileName = "BeeData_")

1.3 Import USGS Data

The USGS_formatter() will find, import, format, and create metadata for the USGS dataset. The pubDate must be in day-month-year format.

  USGS_data <- BeeBDC::USGS_formatter(path = DataPath, pubDate = "19-11-2022")

1.4 Formatted Source Importer

Use this importer to find files that have been formatted and need to be added to the larger data file.

The attributes file must contain “attribute” in its name, and the occurrence file must not.

  Complete_data <- BeeBDC::formattedCombiner(path = DataPath, 
                                strings = c("USGS_[a-zA-Z_]+[0-9]{4}-[0-9]{2}-[0-9]{2}"), 
                                  # This should be the list-format with eml attached
                                existingOccurrences = DataImp$Data_WebDL,
                                existingEMLs = DataImp$eml_files) 

In the column catalogNumber, remove ".*specimennumber:" as what comes after should be the USGS number to match for duplicates.

  Complete_data$Data_WebDL <- Complete_data$Data_WebDL %>%
    dplyr::mutate(catalogNumber = stringr::str_replace(catalogNumber,
                                                       pattern = ".*\\| specimennumber:",
                                                       replacement = ""))

1.5 Save data

Choose the type of data format you want to use in saving your work in 1.x.

  BeeBDC::dataSaver(path = DataPath,# The main path to look for data in
       save_type = "CSV_file", # "R_file" OR "CSV_file"
       occurrences = Complete_data$Data_WebDL, # The existing datasheet
       eml_files = Complete_data$eml_files, # The existing EML files
       file_prefix = "Fin_") # The prefix for the fileNames
rm(Complete_data, DataImp)

2.0 Data preparation

The data preparatin section of the script relates mostly to integrating bee occurrence datasets and corrections and so may be skipped by many general taxon users.

2.1 Standardise datasets

You may either use:

a. bdc import

The bdc import is NOT truly supported here, but provided as an example. Please go to section 2.1b below. Read in the bdc metadata and standardise the dataset to bdc.

        bdc_metadata <- readr::read_csv(paste(DataPath, "out_file", "bdc_integration.csv", sep = "/"))
        # ?issue — datasetName is a darwinCore field already!
        # Standardise the dataset to bdc
        db_standardized <- bdc::bdc_standardize_datasets(
          metadata = bdc_metadata,
          format = "csv",
          overwrite = TRUE,
          save_database = TRUE)
        # read in configuration description file of the column header info
        config_description <- readr::read_csv(paste(DataPath, "Output", "bdc_configDesc.csv",
                                                    sep = "/"), 
                                              show_col_types = FALSE, trim_ws = TRUE)

b. jbd import

Find the path, read in the file, and add the database_id column.

  occPath <- BeeBDC::fileFinder(path = DataPath, fileName = "Fin_BeeData_combined_")

  db_standardized <- readr::read_csv(occPath, 
                                       # Use the basic ColTypeR function to determine types
                                     col_types = BeeBDC::ColTypeR(), trim_ws = TRUE) %>%
                                     dplyr::mutate(database_id = paste("Dorey_data_", 
                                     1:nrow(.), sep = ""),
                                     .before = family)

c. optional thin

You can thin the dataset for TESTING ONLY!

         check_pf <- check_pf %>%
           # take every 100th record
           filter(row_number() %% 100 == 1)

2.2 Paige dataset

Paige Chesshire’s cleaned American dataset —

Import data

If you haven’t figured it out by now, don’t worry about the column name warning — not all columns occur here.

  PaigeNAm <- readr::read_csv(paste(DataPath, "Paige_data", "NorAmer_highQual_only_ALLfamilies.csv",
                                    sep = "/"), col_types = BeeBDC::ColTypeR()) %>%
     # Change the column name from Source to dataSource to match the rest of the data.
    dplyr::rename(dataSource = Source) %>%
      # add a NEW database_id column
      database_id = paste0("Paige_data_", 1:nrow(.)),
      .before = scientificName)

It is recommended to run the below code on the full bee dataset with more than 16GB RAM. Robert ran this on a laptop with 16GB RAM and an Intel(R) Core(TM) i7-8550U processor (4 cores and 8 threads) — it struggled.

Merge Paige’s data with downloaded data

  db_standardized <- BeeBDC::PaigeIntegrater(
      db_standardized = db_standardized,
      PaigeNAm = PaigeNAm,
        # This is a list of columns by which to match Paige's data to the most-recent download with. 
        # Each vector will be matched individually
      columnStrings = list(
        c("decimalLatitude", "decimalLongitude", 
          "recordNumber", "recordedBy", "individualCount", "samplingProtocol",
          "associatedTaxa", "sex", "catalogNumber", "institutionCode", "otherCatalogNumbers",
          "recordId", "occurrenceID", "collectionID"),         # Iteration 1
        c("catalogNumber", "institutionCode", "otherCatalogNumbers",
          "recordId", "occurrenceID", "collectionID"), # Iteration 2
        c("decimalLatitude", "decimalLongitude", 
          "recordedBy", "genus", "specificEpithet"),# Iteration 3
        c("id", "decimalLatitude", "decimalLongitude"),# Iteration 4
        c("recordedBy", "genus", "specificEpithet", "locality"), # Iteration 5
        c("recordedBy", "institutionCode", "genus", 
          "specificEpithet","locality"),# Iteration 6
        c("occurrenceID","decimalLatitude", "decimalLongitude"),# Iteration 7
        c("catalogNumber","decimalLatitude", "decimalLongitude"),# Iteration 8
        c("catalogNumber", "locality") # Iteration 9
      ) )

Remove spent data.


2.3 USGS

The USGS dataset also partially occurs on GBIF from BISON. However, the occurrence codes are in a silly place… We will correct these here to help identify duplicates later.

    db_standardized <- db_standardized %>%
          # Remove the discoverlife html if it is from USGS
      dplyr::mutate(occurrenceID = dplyr::if_else(
        stringr::str_detect(occurrenceID, "USGS_DRO"),
        stringr::str_remove(occurrenceID, "http://www\\.discoverlife\\.org/mp/20l\\?id="),
        occurrenceID)) %>%
          # Use otherCatalogNumbers when occurrenceID is empty AND when USGS_DRO is detected there
        occurrenceID = dplyr::if_else(
          stringr::str_detect(otherCatalogNumbers, "USGS_DRO") &,
          otherCatalogNumbers, occurrenceID)) %>%
           # Make sure that no eventIDs have snuck into the occurrenceID columns 
           # For USGS_DRO, codes with <6 digits are event ids
        occurrenceID = dplyr::if_else(stringr::str_detect(occurrenceID, "USGS_DRO", negate = TRUE),
             # Keep occurrenceID if it's NOT USGS_DRO
             # If it IS USGS_DRO and it has => 6 numbers, keep it, else, NA
          dplyr::if_else(stringr::str_detect(occurrenceID, "USGS_DRO[0-9]{6,10}"),
                         occurrenceID, NA_character_)),
        catalogNumber = dplyr::if_else(stringr::str_detect(catalogNumber, "USGS_DRO", negate = TRUE),
             # Keep catalogNumber if it's NOT USGS_DRO
             # If it IS USGS_DRO and it has => 6 numbers, keep it, else, NA
          dplyr::if_else(stringr::str_detect(catalogNumber, "USGS_DRO[0-9]{6,10}"),
                         catalogNumber, NA_character_)))

2.4 Additional datasets

Import additional and potentially private datasets.

Note: Private dataset functions are provided but the data itself is not integrated here until those datasets become freely available.

There will be some warnings were a few rows may not be formatted correctly or where dates fail to parse. This is normal.


Guzman, L. M., Kelly, T. & Elle, E. A data set for pollinator diversity and their interactions with plants in the Pacific Northwest. Ecology, e3927 (2022).

EPEL_Data <- BeeBDC::readr_BeeBDC(dataset = "EPEL",
                                path = paste0(DataPath, "/Additional_Datasets"),
                      inFile = "/InputDatasets/bee_data_canada.csv",
                      outFile = "jbd_EPEL_data.csv",
                      dataLicense = "")
b. Allan Smith-Pardo

Data from Allan Smith-Pardo

ASP_Data <- BeeBDC::readr_BeeBDC(dataset = "ASP",
                               path = paste0(DataPath, "/Additional_Datasets"),
                      inFile = "/InputDatasets/Allan_Smith-Pardo_Dorey_ready2.csv",
                      outFile = "jbd_ASP_data.csv",
                      dataLicense = "")
c. Minckley

Data from Robert Minckley

BMin_Data <- BeeBDC::readr_BeeBDC(dataset = "BMin",
                                path = paste0(DataPath, "/Additional_Datasets"),
                        inFile = "/InputDatasets/Bob_Minckley_6_1_22_ScanRecent-mod_Dorey.csv",
                        outFile = "jbd_BMin_data.csv",
                        dataLicense = "")
d. BMont

Delphia, C. M. Bumble bees of Montana. (2022)

BMont_Data <- BeeBDC::readr_BeeBDC(dataset = "BMont",
                                 path = paste0(DataPath, "/Additional_Datasets"),
                          inFile = "/InputDatasets/Bombus_Montana_dorey.csv",
                          outFile = "jbd_BMont_data.csv",
                          dataLicense = "")
e. Ecd

Ecdysis. Ecdysis: a portal for live-data arthropod collections, (2022).

Ecd_Data <- BeeBDC::readr_BeeBDC(dataset = "Ecd",
                               path = paste0(DataPath, "/Additional_Datasets"),
                      inFile = "/InputDatasets/Ecdysis_occs.csv",
                      outFile = "jbd_Ecd_data.csv",
                      dataLicense = "")
f. Gai

Gaiarsa, M. P., Kremen, C. & Ponisio, L. C. Pollinator interaction flexibility across scales affects patch colonization and occupancy. Nature Ecology & Evolution 5, 787-793 (2021).

Gai_Data <- BeeBDC::readr_BeeBDC(dataset = "Gai",
                               path = paste0(DataPath, "/Additional_Datasets"),
                      inFile = "/InputDatasets/upload_to_scan_Gaiarsa et al_Dorey.csv",
                      outFile = "jbd_Gai_data.csv",
                      dataLicense = "")

From the Connecticut Agricultural Experiment Station.

Zarrillo, T. A., Stoner, K. A. & Ascher, J. S. Biodiversity of bees (Hymenoptera: Apoidea: Anthophila) in Connecticut (USA). Zootaxa (Accepted).

Ecdysis. Occurrence dataset (ID: 16fca9c2-f622-4cb1-aef0-3635a7be5aeb). (2023)

CAES_Data <- BeeBDC::readr_BeeBDC(dataset = "CAES",
                                path = paste0(DataPath, "/Additional_Datasets"),
                        inFile = "/InputDatasets/CT_BEE_DATA_FROM_PBI.xlsx",
                        outFile = "jbd_CT_Data.csv",
                        sheet = "Sheet1",
                        dataLicense = "")
h. GeoL
GeoL_Data <- BeeBDC::readr_BeeBDC(dataset = "GeoL",
                                path = paste0(DataPath, "/Additional_Datasets"),
                        inFile = "/InputDatasets/Geolocate and BELS_certain and accurate.xlsx",
                        outFile = "jbd_GeoL_Data.csv",
                        dataLicense = "")
i. EaCO
EaCO_Data <- BeeBDC::readr_BeeBDC(dataset = "EaCO",
                                path = paste0(DataPath, "/Additional_Datasets"),
                        inFile = "/InputDatasets/Eastern Colorado bee 2017 sampling.xlsx",
                        outFile = "jbd_EaCo_Data.csv",
                        dataLicense = "")

Florida State Collection of Arthropods

FSCA_Data <- BeeBDC::readr_BeeBDC(dataset = "FSCA",
                                path = paste0(DataPath, "/Additional_Datasets"),
                        inFile = "InputDatasets/fsca_9_15_22_occurrences.csv",
                        outFile = "jbd_FSCA_Data.csv",
                        dataLicense = "")
k. Texas SMC

Published or unpublished data from Texas literature not in an online database, usually copied into spreadsheet from document format, or otherwise copied from a very differently-formatted spreadsheet. Unpublished or partially published data were obtained with express permission from the lead author.

SMC_Data <- BeeBDC::readr_BeeBDC(dataset = "SMC",
                               path = paste0(DataPath, "/Additional_Datasets"),
                      inFile = "/InputDatasets/TXbeeLitOccs_31Oct22.csv", 
                      outFile = "jbd_SMC_Data.csv",
                      dataLicense = "")
l. Texas Bal

Data with GPS coordinates (missing accidentally from records on Dryad) from Ballare, K. M., Neff, J. L., Ruppel, R. & Jha, S. Multi-scalar drivers of biodiversity: local management mediates wild bee community response to regional urbanization. Ecological Applications 29, e01869 (2019), The version on Dryad is missing site GPS coordinates (by accident). Kim is okay with these data being made public as long as her paper is referenced. - Elinor Lichtenberg

Bal_Data <- BeeBDC::readr_BeeBDC(dataset = "Bal",
                               path = paste0(DataPath, "/Additional_Datasets"),
                      inFile = "/InputDatasets/Beedata_ballare.xlsx", 
                      outFile = "jbd_Bal_Data.csv",
                      sheet = "animal_data",
                      dataLicense = "")
m. Palouse Lic

Elinor Lichtenberg’s canola data: Lichtenberg, E. M., Milosavljević, I., Campbell, A. J. & Crowder, D. W. Differential effects of soil conservation practices on arthropods and crop yields. Journal of Applied Entomology, (2023) These are the data I will be putting on SCAN. - Elinor Lichtenberg

Lic_Data <- BeeBDC::readr_BeeBDC(dataset = "Lic",
                               path = paste0(DataPath, "/Additional_Datasets"),
                      inFile = "/InputDatasets/Lichtenberg_canola_records.csv", 
                      outFile = "jbd_Lic_Data.csv",
                      dataLicense = "")
n. Arm

Data from Armando Falcon-Brindis from the University of Kentucky.

Arm_Data <- BeeBDC::readr_BeeBDC(dataset = "Arm",
                               path = paste0(DataPath, "/Additional_Datasets"),
                      inFile = "/InputDatasets/Bee database Armando_Final.xlsx",
                      outFile = "jbd_Arm_Data.csv",
                      sheet = "Sheet1",
                      dataLicense = "")
o. Dor

From several papers:

  1. Dorey, J. B., Fagan-Jeffries, E. P., Stevens, M. I., & Schwarz, M. P. (2020). Morphometric comparisons and novel observations of diurnal and low-light-foraging bees. Journal of Hymenoptera Research, 79, 117–144. doi:
  2. Dorey, J. B. (2021). Missing for almost 100 years: the rare and potentially threatened bee Pharohylaeus lactiferus (Hymenoptera, Colltidae). Journal of Hymenoptera Research, 81, 165-180. doi:
  3. Dorey, J. B., Schwarz, M. P., & Stevens, M. I. (2019). Review of the bee genus Homalictus Cockerell (Hymenoptera: Halictidae) from Fiji with description of nine new species. Zootaxa, 4674(1), 1–46. doi:
  Dor_Data <- BeeBDC::readr_BeeBDC(dataset = "Dor",
                    path = paste0(DataPath, "/Additional_Datasets"),
                    inFile = "/InputDatasets/DoreyData.csv",
                    outFile = "jbd_Dor_Data.csv",
                    dataLicense = "")
p. VicWam

These data are originally from the Victorian Museum and Western Australian Museum in Australia. However, in their current form they are from Dorey et al. 2021.

  1. PADIL. (2020). PaDIL.
  2. Houston, T. F. (2000). Native bees on wildflowers in Western Australia. Western Australian Insect Study Society.
  3. Dorey, J. B., Rebola, C. M., Davies, O. K., Prendergast, K. S., Parslow, B. A., Hogendoorn, K., . . . Caddy-Retalic, S. (2021). Continental risk assessment for understudied taxa post catastrophic wildfire indicates severe impacts on the Australian bee fauna. Global Change Biology, 27(24), 6551-6567. doi:
 VicWam_Data <- BeeBDC::readr_BeeBDC(dataset = "VicWam",
                    path = paste0(DataPath, "/Additional_Datasets"),
                    inFile = "/InputDatasets/Combined_Vic_WAM_databases.xlsx",
                    outFile = "jbd_VicWam_Data.csv",
                    dataLicense = "",
                    sheet = "Combined")

2.5 Merge all

Remove these spent datasets.

  rm(EPEL_Data, ASP_Data, BMin_Data, BMont_Data, Ecd_Data, Gai_Data, CAES_Data, 
  GeoL_Data, EaCO_Data, FSCA_Data, SMC_Data, Bal_Data, Lic_Data, Arm_Data, Dor_Data,

Read in and merge all. There are more readr_BeeBDC() supported than currently implemented and these represent datasets that will be publicly released in the future. See ‘?readr_BeeBDC()’ for details.

db_standardized <- db_standardized %>%
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_ASP_data.csv"), col_types = BeeBDC::ColTypeR()),
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_EPEL_data.csv"), col_types = BeeBDC::ColTypeR()),
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_BMin_data.csv"), col_types = BeeBDC::ColTypeR()),
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_BMont_data.csv"), col_types = BeeBDC::ColTypeR()),
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_Ecd_data.csv"), col_types = BeeBDC::ColTypeR()),
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_Gai_data.csv"), col_types = BeeBDC::ColTypeR()),
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_CT_Data.csv"), col_types = BeeBDC::ColTypeR()),
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_GeoL_Data.csv"), col_types = BeeBDC::ColTypeR()),
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_EaCo_Data.csv"), col_types = BeeBDC::ColTypeR()), 
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_SMC_Data.csv"), col_types = BeeBDC::ColTypeR()),
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_Bal_Data.csv"), col_types = BeeBDC::ColTypeR()),
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_Lic_Data.csv"), col_types = BeeBDC::ColTypeR()),
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_Arm_Data.csv"), col_types = BeeBDC::ColTypeR()),
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_Dor_Data.csv"), col_types = BeeBDC::ColTypeR()),
readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                       "/jbd_VicWam_Data.csv"), col_types = BeeBDC::ColTypeR())) %>% 
    # END bind_rows
  suppressWarnings(classes = "warning") # End suppressWarnings — due to col_types

2.6 Match database_id

If you have prior runs from which you’d like to match database_ids with from the current run, you may use the below script to try to match database_ids with prior runs.

Read in a prior run of choice.

  priorRun <- BeeBDC::fileFinder(path = DataPath,
                          file = "01_prefilter_database_9Aug22.csv") %>%
    readr::read_csv(file = ., col_types = BeeBDC::ColTypeR())

This function will attempt to find the database_ids from prior runs.

  db_standardized <- BeeBDC::idMatchR(
  currentData = db_standardized,
  priorData = priorRun,
    # First matches will be given preference over later ones
  matchBy = tibble::lst(c("gbifID", "dataSource"),
                        c("catalogNumber", "institutionCode", "dataSource", "decimalLatitude",
                        c("occurrenceID", "dataSource","decimalLatitude","decimalLongitude"),
                        c("recordId", "dataSource","decimalLatitude","decimalLongitude"),
                        c("id", "dataSource","decimalLatitude","decimalLongitude"),
                        # Because INHS was entered as it's own dataset but is now included in the GBIF    download...
                        c("catalogNumber", "institutionCode", "dataSource",
    # You can exclude datasets from prior by matching their prefixs — before first underscore:
  excludeDataset = c("ASP", "BMin", "BMont", "CAES", "EaCO", "Ecd", "EcoS",
                     "Gai", "KP", "EPEL", "CAES", "EaCO", "FSCA", "SMC", "Lic", "Arm",

 # Remove redundant files

Save the dataset.

  db_standardized %>%
                     paste(OutPath_Intermediate, "00_prefilter_database.csv",
                           sep = "/"))

3.0 Initial flags

Read data back in if needed. OutPath_Intermediate (and a few other directories) should be have been created and saved to the global environment by dirMaker().

  db_standardized <- readr::read_csv(paste(OutPath_Intermediate, "00_prefilter_database.csv",
                                    sep = "/"), col_types = BeeBDC::ColTypeR())}

Normally, you would use the full dataset, as read in above. But, for the sake of this vignette, we will use a combination of two example datasets. These example datasets can further be very useful for testing functions if you’re ever feeling a bit confused and overwhelmed!

data("bees3sp", package = "BeeBDC")
data("beesRaw", package = "BeeBDC")
db_standardized <- dplyr::bind_rows(beesRaw, 
                                      # Only keep a subset of columns from bees3sp
                             bees3sp %>% dplyr::select(tidyselect::all_of(colnames(beesRaw)), countryCode))

For more details about the bdc package, please see their tutorial.

3.1 SciName

Flag occurrences without scientificName provided.

3.2 MissCoords

Flag occurrences with missing decimalLatitude and decimalLongitude.

3.3 OutOfRange

Flag occurrences that are not on Earth (outside of -180 to 180 or -90 to 90 degrees).

3.4 Source

Flag occurrences that don’t match the basisOfRecord types below.

3.5 CountryName

Try to harmonise country names.

b. run function

Get country name from coordinates using a wrapper around the jbd_country_from_coordinates() function. Because our dataset is much larger than those used to design bdc, we have made it so that you can analyse data in smaller pieces. Additionally, like some other functions in BeeBDC, we have implemented parallel operations (using mc.cores = #cores in stepSize = #rowsPerOperation); see ‘?jbd_CfC_chunker()’ for details. NOTE: In an actual run you should use scale = “large”

3.6 StandardCoNames

Run the function, which standardises country names and adds ISO2 codes, if needed.

3.7 TranspCoords

Flag and correct records when decimalLatitude and decimalLongitude appear to be transposed. We created this chunked version of bdc::bdc_coordinates_transposed() because it is very RAM-heavy using our large bee dataset. Like many of our other ‘jbd_…’ functions there are other improvements - e.g., parallel running.

NOTE: Usually you would use scale = “large”, which requires rnaturalearthhires

Get a quick summary of the number of transposed records.

Save the dataset.

Read the data in again if needed.

3.8 Coord-country

Collect all country names in the country_suggested column. We rebuilt a bdc function to flag occurrences where the coordinates are inconsistent with the provided country name.

Save the dataset.

3.9 GeoRefIssue

This function identifies records whose coordinates can potentially be extracted from locality information, which must be manually checked later.

Remove spent data.

3.10 Flag Absent

Flag the records marked as “absent”.

3.11 flag License

Flag the records that may not be used according to their license information.

3.12 GBIF issue

Flag select issues that are flagged by GBIF.

3.13 Flag Reports

3.14 Save

Save the intermediate dataset.

4.0 Taxonomy

For more information about the corresponding bdc functions used in this section, see their tutorial.

Read in the filtered dataset or rename the 3.x dataset for 4.0.

if (!exists("check_pf")) {
    database <- readr::read_csv(paste(OutPath_Intermediate, "01_prefilter_output.csv",
        sep = "/"), col_types = BeeBDC::ColTypeR())
} else {
    # OR rename and remove
    database <- check_pf
    # Remove spent dataset

Remove names_clean if it already exists (i.e. you have run the following functions before on this dataset before).

database <- database %>%

4.1 Prep data names

This step cleans the database’s scientificName column.

! MAC: You might need to install gnparser through terminal — brew brew tap gnames/gn brew install gnparser

This can be difficult for a Windows install. Ensure you have the most recent version of R, R Studio, and R packages. Also, check package ‘rgnparser’ is installed correctly. If you still can not get the below code to work, you may have to download the latest version of ‘gnparser’ from here. You may then need to manually install it and edit your systems environmental variable PATH to locate ‘gnparser.exe’. See here.

## The latest gnparser version is v1.7.4
## gnparser has been installed to /home/runner/bin
## >> Family names prepended to scientific names were flagged and removed from 0 records.
## >> Terms denoting taxonomic uncertainty were flagged and removed from 0 records.
## >> Other issues, capitalizing the first letter of the generic name, replacing empty names by NA, and     removing extra spaces, were flagged and corrected or removed from 1 records.
## >> Infraspecific terms were flagged and removed from 0 records.

Keep only the .uncer_terms and names_clean columns.

Merge names with the complete dataset.

4.2 Harmonise taxonomy

Download the custom taxonomy file from the BeeBDC package and Discover Life website.

As of version 1.1.0, BeeBDC now has a new function that can download taxonomies using the taxadb package and transform them into the BeeBDC format. The function, BeeBDC::taxadbToBeeBDC(), allows the user to choose their desired provider (e.g., “gbif”, “itis”…), version, taxon name and rank, and to save the taxonomy as a readable csv or not. For example for the bee genus Apis:

ApisTaxonomy <- BeeBDC::taxadbToBeeBDC(
  name = "Apis",
  rank = "Genus",
  provider = "gbif",
  version = "22.12",
  outPath = getwd(),
  fileName = "ApisTaxonomy.csv"

Harmonise the names in the occurrence tibble. This flags the occurrences without a matched name and matches names to their correct name according to Discover Life. You can also use multiple cores to achieve this. See ‘?harmoniseR()’ for details.

You don’t need this file any more…

Save the harmonised file.

4.3 Save flags

Save the flags so far. This will find the most-recent flag file and append your new data to it. You can double-check the data and number of columns if you’d like to be thorough and sure that all of data are intact.

5.0 Space

The final frontier or whatever.

Read in the latest database.

if (!exists("database")) {
    database <- readr::read_csv(paste(OutPath_Intermediate, "02_taxonomy_database.csv",
        sep = "/"), col_types = BeeBDC::ColTypeR())

5.1 Coordinate precision

This function identifies records with a coordinate precision below a specified number of decimal places. For example, the precision of a coordinate with 1 decimal place is 11.132 km at the equator, i.e., the scale of a large city. The major difference between the bdc and BeeBDC functions is that jbd_coordinates_precision() will only flag occurrences if BOTH latitude and longitude are rounded (as opposed to only one of these).

Coordinates with one, two, or three decimal places present a precision of ~11.1 km, ~1.1 km, and ~111 m at the equator, respectively.

Remove the spent dataset.

Save the resulting file.

5.2 Common spatial issues

Only run for occurrences through clean_coordinates() that are spatially ‘valid’.

Next, we will flag common spatial issues using functions of the package CoordinateCleaner. It addresses some common issues in biodiversity datasets.

Re-merge the datasets.

Remove the temporary dataset.

Save the intermediate dataset.

5.3 Diagonal + grid

Finds sequential numbers that could be fill-down errors in lat and long, and groups by the ‘groupingColumns’. This is accomplished by using a sliding window with the length determined by minRepeats. Only coordinates of precision ‘ndec’ (number of decimals in decimal degree format) will be examined. Note, that this function is very RAM-intensive and so the use of multiple threads should be approached with caution depending on your dataset. However, the option is provided.

Spatial gridding from rasterisation: Select only the records with more than X occurrences.

Run the gridding analysis to find datasets that might be gridded.

Integrate these results with the main dataset.

Save the gridded_datasets file for later examination.

Now remove this file.

5.4 Uncertainty

Flag records that exceed a coordinateUncertaintyInMeters threshold.

5.5 Country & continent checklists

This step identifies mismatches between the Discover Life country checklist — beesChecklist — for bee species and the dataset, identifying potential misidentifications, outliers, etc.’

Download the country-level checklist.

Since version 1.1.2 a new function, BeeBDC::continentOutlieRs(), has been released which is conceptually the same as BeeBDC::countryOutlieRs() except it works on the level of continents. Users might appreciate that generating a continent-level checklist is much easier than a country-level one and so might actually be of more wide-spread use (beyond bees).

5.6 Map spatial errors

Assemble maps of potential spatial errors and outliers, either one flag at a time or using the .summary column. First, you need to rebuild the .summary column.

Rebuild the .summary column.

Use col_to_map in order to map ONE spatial flag at a time or map the .summary column for all flags.

5.7 Space report

Create the space report using bdc.

5.8 Space figures

Create figures for the spatial data filtering results.

For examining the figures, the options are:

Save interim dataset.

5.9 Save flags

Save the flags so far.

5.10 Save

Save the intermediate dataset.

6.0 Time

Read in the last database, if needed.

if (!exists("check_space")) {
    check_time <- readr::read_csv(paste(OutPath_Intermediate, "03_space_database.csv",
        sep = "/"), col_types = BeeBDC::ColTypeR())
} else {
    check_time <- check_space
    # Remove the spent file

You can plot a histogram of dates here.

hist(lubridate::ymd_hms(check_time$eventDate, truncated = 5), breaks = 20, main = "Histogram of eventDates")

Filter some silly dates that don’t make sense.

check_time$year <- ifelse(check_time$year > lubridate::year(Sys.Date()) | check_time$year <
    1600, NA, check_time$year)
check_time$month <- ifelse(check_time$month > 12 | check_time$month < 1, NA, check_time$month)
check_time$day <- ifelse(check_time$day > 31 | check_time$day < 1, NA, check_time$day)

6.1 Recover dates

The dateFindR() function will search through some other columns in order to find and rescue dates that may not have made it into the correct columns. It will further update the eventDate, day, month, and year columns where these data were a) missing and b) located in one of the searched columns.

6.2 No eventDate

Flag records that simply lack collection date. :(

6.3 Old records

This will flag records prior to the date selected. 1970 is frequently chosen for SDM work. You may not need to filter old records at all, so think critically about your use. We have chosen 1950 as a lower extreme.

6.4 Time report

Not all of time, just the time pertaining to our precise occurrence records. Update the .summary column.

6.5 Time figures

Create time results figures.

You can check figures by using…

Save the time-revised data into the intermediate folder.

6.6 Save flags

Save the flags so far.

7.0 De-duplication

The dataset can be re-read here if it does not already exist.

if (!exists("check_time")) {
    check_time <- readr::read_csv(paste(OutPath_Intermediate, "04_time_database.csv",
        sep = "/"), col_types = BeeBDC::ColTypeR())

7.1 deDuplicate

We will FLAG duplicates here. These input columns can be hacked to de-duplicate as you wish. This function uses user-specified inputs and columns to identify duplicate occurrence records. Duplicates are identified iteratively and will be tallied up, duplicate pairs clustered, and sorted at the end of the function. The function is designed to work with Darwin Core data with a database_id column, but it is also modifiable to work with other columns. I would encourage you to see ‘?dupeSummary()’ for more details as this function is quite modifiable to user needs.

check_time <- BeeBDC::dupeSummary(
  data = check_time,
  path = OutPath_Report,
   # options are "ID","collectionInfo", or "both"
  duplicatedBy = "collectionInfo", 
    # The columns to generate completeness info from (and to sort by completness)
  completeness_cols = c("decimalLatitude",  "decimalLongitude",
                        "scientificName", "eventDate"),
   # The columns to ADDITIONALLY consider when finding duplicates in collectionInfo
  collectionCols = c("decimalLatitude", "decimalLongitude", "scientificName", "eventDate", 
    # The columns to combine, one-by-one with the collectionCols
  collectInfoColumns = c("catalogNumber", "otherCatalogNumbers"),
    # Custom comparisons — as a list of columns to compare
     # RAW custom comparisons do not use the character and number thresholds
  CustomComparisonsRAW = dplyr::lst(c("catalogNumber", "institutionCode", "scientificName")),
     # Other custom comparisons use the character and number thresholds
  CustomComparisons = dplyr::lst(c("gbifID", "scientificName"),
                                  c("occurrenceID", "scientificName"),
                                  c("recordId", "scientificName"),
                                  c("id", "scientificName")),
   # The order in which you want to KEEP duplicated based on data source
   # try unique(check_time$dataSource)
  sourceOrder = c("CAES", "Gai", "Ecd","BMont", "BMin", "EPEL", "ASP", "KP", "EcoS", "EaCO",
                  "FSCA", "Bal", "SMC", "Lic", "Arm",
                  "USGS", "ALA", "VicWam", "GBIF","SCAN","iDigBio"),
    # Paige ordering is done using the database_id prefix, not the dataSource prefix.
  prefixOrder = c("Paige", "Dorey"),
    # Set the complexity threshold for id letter and number length
     # minimum number of characters when WITH the numberThreshold
  characterThreshold = 2,
     # minimum number of numbers when WITH the characterThreshold
  numberThreshold = 3,
     # Minimum number of numbers WITHOUT any characters
  numberOnlyThreshold = 5
) %>% # END dupeSummary
  dplyr::as_tibble(col_types = BeeBDC::ColTypeR())
##  - Generating a basic completeness summary from the decimalLatitude, decimalLongitude, scientificName, eventDate columns.
## This summary is simply the sum of complete.cases in each column. It ranges from zero to the N of columns. This will be used to sort duplicate rows and select the most-complete rows.
##  - Updating the .summary column to sort by...
##  - We will NOT flag the following columns. However, they will remain in the data file.
## .gridSummary, .lonFlag, .latFlag, .uncer_terms, .uncertaintyThreshold, .unLicensed
##  - summaryFun:
## Flagged 99 
##   The .summary column was added to the database.
##  - Working on CustomComparisonsRAW duplicates...
## Completed iteration 1 of 1:
##  - Identified 0 duplicate records and kept 0 unique records using the column(s): 
## catalogNumber, institutionCode, scientificName
##  - Working on CustomComparisons duplicates...
## Completed iteration 1 of 4:
##  - Identified 0 duplicate records and kept 0 unique records using the column(s): 
## gbifID, scientificName
## Completed iteration 2 of 4:
##  - Identified 0 duplicate records and kept 0 unique records using the column(s): 
## occurrenceID, scientificName
## Completed iteration 3 of 4:
##  - Identified 0 duplicate records and kept 0 unique records using the column(s): 
## recordId, scientificName
## Completed iteration 4 of 4:
##  - Identified 0 duplicate records and kept 0 unique records using the column(s): 
## id, scientificName
##  - Working on collectionInfo duplicates...
## Completed iteration 1 of 2:
##  - Identified 0 duplicate records and kept 0 unique records using the columns: 
## decimalLatitude, decimalLongitude, scientificName, eventDate, recordedBy, and catalogNumber
## Completed iteration 2 of 2:
##  - Identified 0 duplicate records and kept 0 unique records using the columns: 
## decimalLatitude, decimalLongitude, scientificName, eventDate, recordedBy, and otherCatalogNumbers
##  - Clustering duplicate pairs...
## Duplicate pairs clustered. There are 0 duplicates across 0 kept duplicates.
##  - Ordering prefixs...
##  - Ordering data by 1. dataSource, 2. completeness and 3. .summary column...
##  - Find and FIRST duplicate to keep and assign other associated duplicates to that one (i.e., across multiple tests a 'kept duplicate', could otherwise be removed)...
##  - Duplicates have been saved in the file and location: /var/folders/5x/jm9bgqkj1g1f_vxsmfh8n_t40000gp/T//RtmpOfSCv8/Data_acquisition_workflow/Output/ReportduplicateRun_collectionInfo_2024-06-20.csv
##  - Across the entire dataset, there are now 0 duplicates from a total of 205 occurrences.
##  - Completed in 0.3 secs

Save the dataset into the intermediate folder.

7.2 Save flags

Save the flags so far.

8.0 Data filtering

The dataset can be re-read here if it does not already exist.

if (!exists("check_time")) {
    check_time <- readr::read_csv(paste(OutPath_Intermediate, "04_2_dup_database.csv",
        sep = "/"), col_types = ColTypeR())

8.1 rm Outliers

Read in the most-recent duplicates file (generated by dupeSummary()) in order to identify the duplicates of the expert outliers.

Identify the outliers and get a list of their database_ids. This would require the source outlier files provided with the BeeBDC paper. These files can further be modified to include more outliers.

check_time <- BeeBDC::manualOutlierFindeR(
  data = check_time,
  DataPath = DataPath,
  PaigeOutliersName = "removedBecauseDeterminedOutlier.csv",
  newOutliersName = "^All_outliers_ANB_14March.xlsx",
  ColombiaOutliers_all = "All_Colombian_OutlierIDs.csv",
  # A .csv with manual outlier records that are too close to otherwise TRUE records
  NearTRUE = "nearTRUE.csv",
  duplicates = duplicates)

8.2 Save uncleaned

Save the uncleaned dataset.

8.3 Filter

Now clean the dataset of extra columns and failed rows and then save it.

9.0 Figures and tables

9.1 Duplicate chordDiagrams

Install BiocManager and ComplexHeatmap if you missed them at the start.

Read in the most recent file of flagged duplicates, if it’s not already in your environment.

Choose the global figure parameters.

Create the chordDiagram. You can leave many of the below values out, but we show here the defaults. There are [internally] no duplicates in current our test dataset, so BeeBDC will throw an informative error. However, we show the full output figure from our bee dataset below.

[Full chord diagram from Dorey et al. 2023]