EcoCleanR: Overview on Steps for Data merging from online bioiversity resources

Introduction

In this tutorial, we demonstrate the step-by-step process of downloading data from various sources such as GBIF, OBIS, and iDigBio using existing R packages, as well as from InvertEbase via a local CSV file. This process includes merging all data files and standardizing their formats to make them compatible for integration.

Example species: Mexacanthina lugubris

Load packages

library(EcoCleanR)
library(rgbif)
#> Warning: package 'rgbif' was built under R version 4.4.3
library(robis)
#> Warning: package 'robis' was built under R version 4.4.1
#> 
#> Attaching package: 'robis'
#> The following object is masked from 'package:rgbif':
#> 
#>     dataset
library(ridigbio)
#> Warning: package 'ridigbio' was built under R version 4.4.3
library(dplyr)

Input species name

species_name <- "Mexacanthina lugubris"
taxonkey <- name_backbone(species_name)$usageKey

Create a attribute list with TDWG standardize names

Given attributes in the list can be changed/added based on the requirement

attribute_list <- c("source", "catalogNumber", "basisOfRecord", "occurrenceStatus", "institutionCode", "verbatimEventDate", "scientificName", "individualCount", "organismQuantity", "abundance", "decimalLatitude", "decimalLongitude", "coordinateUncertaintyInMeters", "locality", "verbatimLocality", "municipality", "county", "stateProvince", "country", "countryCode")

GBIF - data extraction and standardization

This step uses function occ_data of rgbif package to extract data from GBIF.

gbif.occ <- occ_data(taxonKey = taxonkey, occurrenceStatus = NULL, limit = 10000L)$data

# refer article/cite_data.Rmd for instructions on how to cite the data from gbif- data providers

## additional field added to know the source
gbif.occ$source <- "gbif"
for (field in attribute_list) {
  if (!field %in% names(gbif.occ)) {
    gbif.occ[[field]] <- NA # Add the missing field as NA
  }
}

## we are making one column called abundance which should have values from individual count and organism Quantity
gbif.occ$abundance <- ifelse(is.na(as.numeric(gbif.occ$individualCount)), as.numeric(gbif.occ$organismQuantity), as.numeric(gbif.occ$individualCount))
## additional field added to know the source
gbif.occ$source <- "gbif"
gbif.occ_temp <- gbif.occ[, attribute_list]
str(gbif.occ_temp[, 1:3])
#> tibble [1,927 × 3] (S3: tbl_df/tbl/data.frame)
#>  $ source       : chr [1:1927] "gbif" "gbif" "gbif" "gbif" ...
#>  $ catalogNumber: chr [1:1927] "258336784" "258586406" "260117394" "261990463" ...
#>  $ basisOfRecord: chr [1:1927] "HUMAN_OBSERVATION" "HUMAN_OBSERVATION" "HUMAN_OBSERVATION" "HUMAN_OBSERVATION" ...

OBIS- data extraction and standardization

This step uses occurrence function of robis package to extract data from OBIS.

obis.occ <- occurrence(species_name)
#> Retrieved 84 records of approximately 84 (100%)
for (field in attribute_list) {
  if (!field %in% names(obis.occ)) {
    obis.occ[[field]] <- NA # Add the missing field as NA
  }
}
obis.occ$abundance <- ifelse(is.na(as.numeric(obis.occ$individualCount)), as.numeric(obis.occ$organismQuantity), as.numeric(obis.occ$individualCount))
obis.occ$source <- "obis"
obis.occ$municipality <- ""
obis.occ_temp <- obis.occ[, attribute_list]
str(obis.occ_temp[, 1:3])
#> tibble [84 × 3] (S3: tbl_df/tbl/data.frame)
#>  $ source       : chr [1:84] "obis" "obis" "obis" "obis" ...
#>  $ catalogNumber: chr [1:84] NA "DMNS:Inv:25322" NA "483074" ...
#>  $ basisOfRecord: chr [1:84] "HumanObservation" "PreservedSpecimen" "HumanObservation" "PreservedSpecimen" ...

IDIGBIO - data extraction and standardization

This step uses idig_search_records of ridigbio package to extract data from IDIGBIO.

idig.occ <- idig_search_records(
  type = "records",
  rq = list("scientificname" = species_name),
  field = "all",
  max_items = 10000L,
  limit = 10000L,
  offset = 0
)

idig.occ <- idig.occ %>%
  mutate(
    abundance = as.numeric(individualcount),
    source = "idigbio",
    occurrenceStatus = "",
    organismQuantity = ""
  ) %>%
  rename(
    decimalLatitude = geopoint.lat,
    decimalLongitude = geopoint.lon,
    basisOfRecord = basisofrecord,
    catalogNumber = catalognumber,
    scientificName = scientificname,
    stateProvince = stateprovince,
    coordinateUncertaintyInMeters = coordinateuncertainty,
    individualCount = individualcount,
    institutionCode = institutioncode,
    verbatimLocality = verbatimlocality,
    verbatimEventDate = verbatimeventdate,
    countryCode = countrycode
  )

idig.occ_temp <- idig.occ[, attribute_list]
str(idig.occ_temp[, 1:3])
#> 'data.frame':    342 obs. of  3 variables:
#>  $ source       : chr  "idigbio" "idigbio" "idigbio" "idigbio" ...
#>  $ catalogNumber: chr  "lacmip 66.1255" "lacm 1951-43.22" "1069" "239577" ...
#>  $ basisOfRecord: chr  "fossilspecimen" "preservedspecimen" "preservedspecimen" "preservedspecimen" ...

Local file (InvertEbase) - data read and standardization

This local file “example_sp_invertebase” is a manual downloaded file from InvertEbase for Mexacanthina lugubris. See the example_sp_invertEbase dataset for its attributes and DwC format.

sym.occ <- example_sp_invertebase
sym.occ$abundance <- as.numeric(sym.occ$individualCount)

for (field in attribute_list) {
  if (!field %in% names(sym.occ)) {
    sym.occ[[field]] <- NA # Add the missing field as NA
  }
}

str(sym.occ[, 1:3])
#> 'data.frame':    710 obs. of  3 variables:
#>  $ source       : chr  "invert" "invert" "invert" "invert" ...
#>  $ catalogNumber: chr  "49323" "155070811" "66762485" "69352588" ...
#>  $ basisOfRecord: chr  "PreservedSpecimen" "HUMAN_OBSERVATION" "HUMAN_OBSERVATION" "HUMAN_OBSERVATION" ...

Merging the databases

ec_db_merge function in the EcoCleanR package helps merge data from all sources, provided that each source has the same attribute names and number of columns. It also filters the data based on the specified type (e.g., modern or fossil) and removes records marked as ‘absent’ occurrenceStatus.

db_list <- list(gbif.occ_temp, obis.occ_temp, idig.occ_temp, sym.occ)
Mixdb.occ <- ec_db_merge(db_list = db_list, datatype = "modern")

str(Mixdb.occ[, 1:3])
#> tibble [2,310 × 3] (S3: tbl_df/tbl/data.frame)
#>  $ source       : chr [1:2310] "gbif" "gbif" "gbif" "gbif" ...
#>  $ catalogNumber: chr [1:2310] "258336784" "258586406" "260117394" "261990463" ...
#>  $ basisOfRecord: chr [1:2310] "modern" "modern" "modern" "modern" ...
ec_geographic_map(Mixdb.occ, "decimalLatitude", longitude = "decimalLongitude") # display records those has coordinate values
#> Warning: Removed 667 rows containing missing values or values outside the scale range
#> (`geom_point()`).

Further documents:

*see data cleaning steps on mixdb (merged) dataset at vignette: [data_cleaning]

*see citation guidelines for the downloaded files from gbif, obis, idigbio and InvertEbase vignettes/article/cite_data.rmd