EcoCleanR: Overview on Steps for Data cleaning and defining Biogeographic ranges

Introduction:

In this tutorial, we will demonstrate the step-by-step process of cleaning occurrences and extracting environmental data for the coastal species Mexacanthina lugubris.

This workflow covers data Cleaning:
1. Remove duplicates
2. Check bad taxon using WoRMs
3. Improve the coordinate information utilizing external georeference tools
4. check the coordinate precision and rounding
5. Flag the records associated with the wrong ocean/sea and inland
6. Extract the environmental variables
7. Impute the environmental variables if no assignment from online resources
8. Identifying outliers
9. Data visualization:
Below is the demonstration of each steps:

Note: For details on data integration, see the [data_merging]

# package loading
library(EcoCleanR)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# provide example species name
species_name <- "Mexacanthina lugubris"

Step1: Remove duplicates

Remove duplicates from the merged dataset which is a product after merging data from various sources

ec_rm_duplicate will return an occurrence table “ecodata” after removing duplicates based on unique catalog numbers, while retaining abundance counts wherever available.

ecodata <- ec_rm_duplicate(Mixdb.occ, catalogNumber = "catalogNumber", abundance = "abundance")
str(ecodata[,1:3])

### Optional to perform: check the institution code
# inst_counts <- ecodata %>%
# group_by(institutionCode) %>%
# summarise(record_count = n(), .groups = "drop")

### optional to perform..
# ecodata <- ecodata %>%
# filter(is.na(institutionCode) | institutionCode != "NRM")

ec_geographic_map(ecodata,
  latitude = "decimalLatitude",
  longitude = "decimalLongitude"
)
#> Warning: Removed 358 rows containing missing values or values outside the scale range
#> (`geom_point()`).

Step2: Remove bad taxa

This step helps to check if any wrong taxonomy has fetched from online datasources by checking if it is an accepted synonym in the WoRMS (World Register of Marine Species) taxonomy database

ec_worms_synonym returns a table named comparison with two columns: the first column lists the synonyms accepted in the WoRMS database, and the second column lists the unique species names from ecodata with the counts of occurrences

comparison <- ec_worms_synonym(species_name,
  ecodata,
  scientificName = "scientificName"
)
#> ══  1 queries  ═══════════════
#> 
#> Retrieving data for taxon 'Mexacanthina lugubris'
#> ✔  Found:  Mexacanthina lugubris
#> ══  Results  ═════════════════
#> 
#> • Total: 1 
#> • Found: 1 
#> • Not Found: 0
print(comparison)
#>       Accepted_syn_worms                            ecodata_syn_with_count
#> 1     Acanthina lugubris           Acanthina lugubre (I.Sowerby, 1822) (6)
#> 2  Acanthina tyrianthina    Acanthina lugubris (G.B.Sowerby I, 1822) (215)
#> 3       Buccinum armatum            Acanthina lugubris (Sowerby, 1821) (2)
#> 4  Mexacanthina lugubris        Acanthina tyrianthina S.S.Berry, 1957 (27)
#> 5      Monoceros cymatum                        Mexacanthina lugubris (13)
#> 6 Monoceros denticulatum Mexacanthina lugubris (G.B.Sowerby I, 1822) (834)
#> 7      Monoceros lugubre         Monoceros cymatum G.B.Sowerby I, 1835 (1)
#> 8                   <NA>        Monoceros lugubre G.B.Sowerby I, 1822 (16)
#> 9                   <NA>                         mexacanthina lugubris (1)
# compare the columns to know if any taxa found that is not a synonym in WoRMS data base, filter bad taxa from ecodata using dplyr::filter()

Step3: Georeferencing using external tool

This step generates a table “data_need_correction” that can potentially assign georeferences based on the locality and verbatim locality information in the occurrence table by using furntion ec_flag_with_locality.

This table can be further used as an input file for the GEOLocate tool to perform georeferencing. Currently, there is no R code available to automate this process. This process require manual validation on each records for the assigned coordinates and uncertainty. (see Scenario B in the manuscript for this example)

Use ec_merge_coorected_Coordinates to assigned corrected coordinates - latitude, longitude and coordinte uncertainty, back into main data table. Pre-saved data file ecodata_corrected.rda, an example to use as template.

ecodata$flag_check_geolocate <- ec_flag_with_locality(ecodata,
  uncertainty = "coordinateUncertaintyInMeters",
  locality = "locality",
  verbatimLocality = "verbatimLocality"
)
str(ecodata[,1:3])
data_need_correction <- ecodata %>%
filter(flag_check_geolocate!= 1)
#save the data as local file to insert in GEOLocate web tool
#write.csv(data_need_correction, "data_check_geolocate.csv")

#Load back the corrected coordinate file extracted from GEOLocate 
#ecodata_corrected <- read.csv("M lugubris_corrected_geolocate.csv")

### Merged records with improved georeference:
ecodata <- ec_merge_corrected_coordinates(
  ecodata_corrected, 
  ecodata, 
  latitude = "decimalLatitude", 
  longitude = "decimalLongitude",       
  uncertainty_col = "coordinateUncertaintyInMeters"
)
str(ecodata[,1:3]) 

### Plot the map to visualize the datapoints
ec_geographic_map(ecodata, latitude = "decimalLatitude", longitude = "decimalLongitude")

Step4: Extreme high uncertainty

ec_filter_by_uncertainty function can be used in all scenarios (see manuscript) to help remove extremely high uncertainty from the remaining data.

ecodata_cl <- ec_filter_by_uncertainty(ecodata,
  uncertainty_col = "coordinateUncertaintyInMeters",
  percentile = 0.95,
  ask = FALSE,
  latitude = "decimalLatitude",
  longitude = "decimalLongitude"
)
#> Suggested threshold at 95th percentile: 13000
str(ecodata_cl[, 1:3])
#> 'data.frame':    734 obs. of  3 variables:
#>  $ X               : int  1 3 5 7 8 9 10 11 12 13 ...
#>  $ basisOfRecord   : chr  "modern" "modern" "modern" "modern" ...
#>  $ occurrenceStatus: chr  "PRESENT" "PRESENT" "PRESENT" "PRESENT" ...
### plot the map
ec_geographic_map(ecodata_cl,
  latitude = "decimalLatitude",
  longitude = "decimalLongitude"
)

Step5: Coordinate precision and rounding

ec_flag_precision checks the precision of the coordinates. This function flags the records based on two checkpoints: 1) flag both coordinates independently if they have <2 decimal points; 2) check if any rounding to the nearest 0.5° in either of those coordinates.

ecodata_cl$flag_precision <- ec_flag_precision(ecodata_cl,
  latitude = "decimalLatitude",
  longitude = "decimalLongitude"
)

# filter the flag - flag_cordinate_precision
ecodata_cl <- ecodata_cl %>%
  filter(flag_precision != 1)
str(ecodata_cl[1:3])
#> 'data.frame':    728 obs. of  3 variables:
#>  $ X               : int  1 3 5 7 8 9 10 11 12 13 ...
#>  $ basisOfRecord   : chr  "modern" "modern" "modern" "modern" ...
#>  $ occurrenceStatus: chr  "PRESENT" "PRESENT" "PRESENT" "PRESENT" ...

Step6: Records tagged to wrong ocean/sea

ec_flag_non_region function helps identify records that are incorrectly tagged under the wrong ocean or sea. Expert knowledge is required to recognize when a species is unlikely to occur in certain ocean regions. For example, Mexacanthina lugubris is an Eastern Pacific species; therefore, any occurrences reported from the Atlantic Ocean would be flagged after running this function.
Variables “direction” accepts input as “east” or “west”, and variable “ocean” accepts input “atlantic” or “pacific”.
If given species lives globally, this cleaning step would not be needed.

# This is a heavy processing step, won’t execute during vignette building.
direction <- "east"
buffer <- 25000
ocean <- "pacific"
ecodata_cl$flag_non_region <- ec_flag_non_region(direction,
  ocean,
  buffer,
  ecodata_cl,
  latitude = "decimalLatitude",
  longitude = "decimalLongitude"
)
str(ecodata_cl[, 1:3])
# filter flagged records
ecodata_cl <- ecodata_cl %>%
  filter(flag_non_region != 1)
### map view to see accepted records
ec_geographic_map(ecodata_cl,
  latitude = "decimalLatitude",
  longitude = "decimalLongitude"
)

Step7: Extract the environmental data

ec_extract_env_layers This function extracts environmental data for occurrence points using their associated coordinates. It is designed with the ‘sdmpredictors’ package to extract layers from sources such as Bio-ORACLE, MARSPEC, and WorldClim.

ec_impute_env_values imputes environmental data for coordinates lacking values in the environmental data sources. This will provide average value of existing data within input radius.

# This is a heavy processing step, won’t execute during vignette building.
# get the unique combination of coordiantes
ecodata_unique <- ecodata_cl[, c("decimalLatitude", "decimalLongitude")]
ecodata_unique <- base::unique(ecodata_unique)
# It is recommended to check what layers available in sdm_predictors and correct name.
# available_layers <- list_layers() # returns something like c("BO_sstmean", "BO_sstmax", ...)
# provide layers as input to env_layers variable
env_layers <- c("BO_sstmean", "BO_sstmin", "BO_sstmax")

### extraction env layers
ecodata_unique <- ec_extract_env_layers(ecodata_unique,
  env_layers = env_layers,
  latitude = "decimalLatitude",
  longitude = "decimalLongitude"
)
# A warning message if layers are in saved in cache.

### impute env var values those were missing after extraction
ecodata_unique <- ec_impute_env_values(
  ecodata_unique,
  latitude = "decimalLatitude",
  longitude = "decimalLongitude",
  radius_km = 10,
  iter = 3
)

### omit the coordinate which couldn't get any env values after imputation
ecodata_unique <- na.omit(ecodata_unique)

Step8: Identify outliers

ec_flag_outlier function helps to identify outliers based on both spatial (coordinates) and non-spatial (environmental) attributes. (see manuscript for more detail about this function)

# This is a heavy processing step, won’t execute during vignette building.
# Instead of executing it here, we will use a pre-saved cleaned file.
ecodata_unique$flag_outliers <- ec_flag_outlier(ecodata_unique,
  latitude = "decimalLatitude",
  longitude = "decimalLongitude",
  env_layers,
  itr = 50,
  k = 3,
  geo_quantile = 0.99,
  maha_quantile = 0.99
)$outlier

### these unique combinations of coordiantes, environmental variables and outliers will be mergeed to main ecodata_cl file
ecodata_cl <- ecodata_cl %>%
  left_join(ecodata_unique[, c("decimalLatitude", "decimalLongitude", "flag_outliers", env_layers)],
    by = c("decimalLatitude", "decimalLongitude")
  )

# pre-saved file ecodata_with_outliers instead of using ecodata_cl
### map view to see records with outlier probability
ec_geographic_map_w_flag(ecodata_with_outliers,
  flag_column = "outliers",
  latitude = "decimalLatitude",
  longitude = "decimalLongitude"
)
#> Ignoring unknown labels:
#> • colour : "Flag"

### Filter outliers those have higher outlier probability >0.90, 0.95 etc.
ecodata_cleaned <- ecodata_cl %>%
  filter(flag_outliers < 0.95)

### mapview to visualize accepted data
ec_geographic_map(ecodata_cleaned,
  latitude = "decimalLatitude",
  longitude = "decimalLongitude"
)

Step9: Display final accepted biogeographic range

ec_var_summary generates a summary table of accepted occurrences after data cleaning, showing mean, minimum, and maximum values for spatial and non-spatial attributes.

ec_plot_var_range shows a plot of accepted ranges.

env_layers <- c("BO_sstmean", "BO_sstmax", "BO_sstmin")
data("ecodata_cleaned")
summary_table <- ec_var_summary(ecodata_cleaned,
  latitude = "decimalLatitude",
  longitude = "decimalLongitude",
  env_layers
)
head(summary_table)
#>           variable     Max     Min    Mean
#> 1  decimalLatitude   34.04   22.92   31.73
#> 2 decimalLongitude -106.10 -118.94 -116.58
#> 3       BO_sstmean   29.04   16.15   17.97
#> 4        BO_sstmax   32.68   18.79   22.47
#> 5        BO_sstmin   24.96   11.42   14.41

ec_plot_var_range(ecodata_with_outliers,
  summary_df = summary_table,
  latitude = "decimalLatitude",
  longitude = "decimalLongitude",
  env_layers = env_layers
)

Further documents:

*see data merging vignette: [data_merging]

*see citation guidelines for the downloaded files from gbif, obis, idigbio and InvertEbase vignettes/article/cite_data.rmd