---
title: "EcoCleanR: Overview on Steps for Data merging from online bioiversity resources"
output: html_vignette
vignette: >
  %\VignetteIndexEntry{data_merging}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  # eval = FALSE,
  fig.width = 8,
  fig.height = 6,
  out.width = "70%"
)
```
## Introduction
In this tutorial, we demonstrate the step-by-step process of downloading data from various sources such as GBIF, OBIS, and iDigBio using existing R packages, as well as from InvertEbase via a local CSV file. This process includes merging all data files and standardizing their formats to make them compatible for integration.

Example species: Mexacanthina lugubris

## Load packages
```{r setup}
library(EcoCleanR)
library(rgbif)
library(robis)
library(ridigbio)
library(dplyr)
```

## Input species name
```{r}
species_name <- "Mexacanthina lugubris"
taxonkey <- name_backbone(species_name)$usageKey
```
## Create a attribute list with TDWG standardize names
Given attributes in the list can be changed/added based on the requirement

```{r}
attribute_list <- c("source", "catalogNumber", "basisOfRecord", "occurrenceStatus", "institutionCode", "verbatimEventDate", "scientificName", "individualCount", "organismQuantity", "abundance", "decimalLatitude", "decimalLongitude", "coordinateUncertaintyInMeters", "locality", "verbatimLocality", "municipality", "county", "stateProvince", "country", "countryCode")
```

## GBIF - data extraction and standardization
This step uses function `occ_data` of rgbif package to extract data from GBIF.
```{r}
gbif.occ <- occ_data(taxonKey = taxonkey, occurrenceStatus = NULL, limit = 10000L)$data

# refer article/cite_data.Rmd for instructions on how to cite the data from gbif- data providers

## additional field added to know the source
gbif.occ$source <- "gbif"
for (field in attribute_list) {
  if (!field %in% names(gbif.occ)) {
    gbif.occ[[field]] <- NA # Add the missing field as NA
  }
}

## we are making one column called abundance which should have values from individual count and organism Quantity
gbif.occ$abundance <- ifelse(is.na(as.numeric(gbif.occ$individualCount)), as.numeric(gbif.occ$organismQuantity), as.numeric(gbif.occ$individualCount))
## additional field added to know the source
gbif.occ$source <- "gbif"
gbif.occ_temp <- gbif.occ[, attribute_list]
str(gbif.occ_temp[, 1:3])
```

## OBIS- data extraction and standardization
This step uses `occurrence` function of robis package to extract data from OBIS.
```{r}
obis.occ <- occurrence(species_name)
for (field in attribute_list) {
  if (!field %in% names(obis.occ)) {
    obis.occ[[field]] <- NA # Add the missing field as NA
  }
}
obis.occ$abundance <- ifelse(is.na(as.numeric(obis.occ$individualCount)), as.numeric(obis.occ$organismQuantity), as.numeric(obis.occ$individualCount))
obis.occ$source <- "obis"
obis.occ$municipality <- ""
obis.occ_temp <- obis.occ[, attribute_list]
str(obis.occ_temp[, 1:3])
```

## IDIGBIO - data extraction and standardization
This step uses `idig_search_records` of ridigbio package to extract data from IDIGBIO.
```{r}
idig.occ <- idig_search_records(
  type = "records",
  rq = list("scientificname" = species_name),
  field = "all",
  max_items = 10000L,
  limit = 10000L,
  offset = 0
)

idig.occ <- idig.occ %>%
  mutate(
    abundance = as.numeric(individualcount),
    source = "idigbio",
    occurrenceStatus = "",
    organismQuantity = ""
  ) %>%
  rename(
    decimalLatitude = geopoint.lat,
    decimalLongitude = geopoint.lon,
    basisOfRecord = basisofrecord,
    catalogNumber = catalognumber,
    scientificName = scientificname,
    stateProvince = stateprovince,
    coordinateUncertaintyInMeters = coordinateuncertainty,
    individualCount = individualcount,
    institutionCode = institutioncode,
    verbatimLocality = verbatimlocality,
    verbatimEventDate = verbatimeventdate,
    countryCode = countrycode
  )

idig.occ_temp <- idig.occ[, attribute_list]
str(idig.occ_temp[, 1:3])
```

## Local file (InvertEbase) - data read and standardization

This local file "example_sp_invertebase" is a manual downloaded file from InvertEbase for Mexacanthina lugubris. See the example_sp_invertEbase dataset for its attributes and DwC format.
```{r}
sym.occ <- example_sp_invertebase
sym.occ$abundance <- as.numeric(sym.occ$individualCount)

for (field in attribute_list) {
  if (!field %in% names(sym.occ)) {
    sym.occ[[field]] <- NA # Add the missing field as NA
  }
}

str(sym.occ[, 1:3])
```

## Merging the databases
`ec_db_merge` function in the EcoCleanR package helps merge data from all sources, provided that each source has the same attribute names and number of columns. It also filters the data based on the specified type (e.g., modern or fossil) and removes records marked as 'absent' occurrenceStatus.
```{r}
db_list <- list(gbif.occ_temp, obis.occ_temp, idig.occ_temp, sym.occ)
Mixdb.occ <- ec_db_merge(db_list = db_list, datatype = "modern")

str(Mixdb.occ[, 1:3])
ec_geographic_map(Mixdb.occ, "decimalLatitude", longitude = "decimalLongitude") # display records those has coordinate values
```

Further documents:<br>

*see data cleaning steps on mixdb (merged) dataset at vignette: [`data_cleaning`]<br>

*see citation guidelines for the downloaded files from gbif, obis, idigbio and InvertEbase vignettes/article/cite_data.rmd