--- title: "5.2 Summary metrics" author: "Pierre Denelle, Boris Leroy and Maxime Lenormand" date: "`r Sys.Date()`" output: html_vignette: number_sections: true bibliography: '`r system.file("REFERENCES.bib", package="bioregion")`' csl: style_citation.csl vignette: > %\VignetteIndexEntry{5.2 Summary metrics} \usepackage[utf8]{inputenc} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, fig.width = 6, fig.height = 6) # Packages -------------------------------------------------------------------- suppressPackageStartupMessages({ suppressWarnings({ library("bioregion") library("dplyr") library("ggplot2") library("sf") }) }) options(tinytex.verbose = TRUE) ``` In this vignette, we describe two functions to compute summary metrics: - metrics calculated for each species and/or site `site_species_metrics()` - metrics calculated for each bioregion `bioregion_metrics()` # 1. Terminology clarification The `bioregion` is focused on bioregionalization, i.e. clustering of geographical areas on the basis of species data. However, there are several cases where species can also become part of the clustering (for example, in bipartite network clustering), which poses terminology issues. To be conceptually accurate, we have chosen to name species clusters as 'chorotypes': - **Bioregion**: A group of sites with similar species composition, identified through clustering analysis. Bioregions are geographic units. - **Chorotype**: A group of species with similar distributions within the study area. Chorotypes are biological units. This generally corresponds to the concept of "regional chorotype" sensu [@BaroniUrbani1978], as clarified by [@Fattorini2015]. Note that when clustering on worldwide ranges, the concept becomes "global chorotypes" (see [@Fattorini2015] for further details). ## Possible cases of chorotypes | Clustering scenario | Site clusters | Species clusters | Conceptual basis | |:--------------------|:-------------:|:----------------:|:-----------------| | Site-only clustering | Bioregions | — | Sites grouped by compositional similarity | | Bipartite network clustering | Bioregions | Chorotypes (same cluster IDs) | Sites and species grouped by shared network structure | | Species-only clustering | — | Chorotypes | Species grouped by distributional similarity | | Post-hoc species assignment | Bioregions | Chorotypes (derived) | Species assigned to bioregions based on specificity/IndVal | ### Bipartite network clustering In bipartite network clustering, both sites and species are assigned to the **same clusters** (network modules). A species assigned to cluster 1 belongs to the same bioregion as sites assigned to cluster 1. We use the term **chorotype** to refer to the set of species assigned to a given bioregion, but it is important to understand that: > **In bipartite clustering, bioregion ID = chorotype ID.** They are two > perspectives on the same network partition: bioregion refers to the sites > in a cluster, chorotype refers to the species in that same cluster. ### Site-only clustering with post-hoc species assignment Species can be secondarily assigned to bioregions based on metrics such as maximum specificity or IndVal. Here, **chorotype** refers to the group of species most strongly associated with a given bioregion. Unlike bipartite clustering, this assignment is derived rather than intrinsic to the clustering algorithm. ### Species-only clustering When clustering species directly (e.g., by distributional similarity), the resulting groups are true **chorotypes** in the regional sense [@Fattorini2015]: species with similar distributions within the study area. # 2. Example data We use the vegetation dataset included in the `bioregion`. ```{r} data("vegedf") data("vegemat") # Calculation of (dis)similarity matrices vegedissim <- dissimilarity(vegemat, metric = c("Simpson")) vegesim <- dissimilarity_to_similarity(vegedissim) ``` # 3. Bioregionalization We use the same three bioregionalization algorithms as in the [visualization vignette](https://biorgeo.github.io/bioregion/articles/a5_1_visualization.html), i.e., non-hierarchical, hierarchical, and network bioregionalizations. In addition, we include a network bioregionalization algorithm based on a bipartite network, which assigns clusters to both sites and species. We chose three bioregions for the non-hierarchical and hierarchical bioregionalizations.
```{r} # Non hierarchical bioregionalization vege_nhclu <- nhclu_kmeans(vegedissim, n_clust = 3, index = "Simpson", seed = 1) vege_nhclu$cluster_info # Hierarchical bioregionalization set.seed(1) vege_hclu <- hclu_hierarclust(dissimilarity = vegedissim, index = "Simpson", method = "average", n_clust = 3, optimal_tree_method = "best", verbose = FALSE) vege_hclu$cluster_info # Network bioregionalization set.seed(1) vege_netclu <- netclu_walktrap(vegesim, index = "Simpson") vege_netclu$cluster_info # Bipartite network bioregionalization install_binaries(verbose = FALSE) vege_netclubip <- netclu_infomap(vegedf, seed = 1, bipartite = TRUE) vege_netclubip$cluster_info ``` # 4. Metric components Before diving into specific metrics, we can understand the core terms using a simple example. Consider a study area with **4 sites** and **4 species**, where sites have been assigned to **2 bioregions**. ## 4.1 Species-derived metrics The following diagram shows the site-species matrix where sites are grouped by bioregion. Marginal sums give us all the core terms needed to compute metrics: ``` Species sp1 sp2 sp3 sp4 n_b ┌─────┬─────┬─────┬─────┐ Site A │ 1 │ 1 │ · │ · │ B1 ───────┼─────┼─────┼─────┼─────┤ 2 Site B │ 1 │ 1 │ 1 │ · │ Bioregion ══════╪═════╪═════╪═════╪═════╪══════ Site C │ · │ 1 │ 1 │ 1 │ B2 ───────┼─────┼─────┼─────┼─────┤ 2 Site D │ · │ · │ 1 │ 1 │ └─────┴─────┴─────┴─────┘ n_sb sp1 sp2 sp3 sp4 n_b (per bioregion) ┌─────┬─────┬─────┬─────┐ B1 │ 2 │ 2 │ 1 │ 0 │ 2 ├─────┼─────┼─────┼─────┤ B2 │ 0 │ 1 │ 2 │ 2 │ 2 └─────┴─────┴─────┴─────┘ n_s (total) 2 3 3 2 n = 4 K_s (# bioreg) 1 2 2 1 K = 2 ``` | Term | Meaning | Where to find it | |:-----|:--------|:-----------------| | $n$ | Total number of sites | Bottom-right corner (4) | | $K$ | Total number of bioregions | Bottom-right corner (2) | | $n_b$ | Sites in bioregion $b$ | Right margin per bioregion row | | $n_s$ | Sites where species $s$ occurs | Bottom margin per species column | | $K_s$ | Number of bioregions where species $s$ occurs | Bottom margin $n_s$ | | $n_{sb}$ | Sites in bioregion $b$ with species $s$ | The $n_{sb}$ summary table | ### Examples of calculations From the $n_{sb}$ table, all species-per-bioregion metrics follow directly: **Specificity** (fraction of species' occurrences in a bioregion): $$A_{sp1,B1} = \frac{n_{sp1,B1}}{n_{sp1}} = \frac{2}{2} = 1.00 \quad \text{(sp1 is exclusive to B1)}$$ $$A_{sp2,B1} = \frac{n_{sp2,B1}}{n_{sp2}} = \frac{2}{3} = 0.67 \quad \text{(sp2 mostly in B1)}$$ **Fidelity** (fraction of bioregion's sites with the species): $$B_{sp2,B1} = \frac{n_{sp2,B1}}{n_{B1}} = \frac{2}{2} = 1.00 \quad \text{(sp2 in all B1 sites)}$$ $$B_{sp3,B1} = \frac{n_{sp3,B1}}{n_{B1}} = \frac{1}{2} = 0.50 \quad \text{(sp3 in half of B1)}$$ **IndVal** (indicator value = Specificity × Fidelity): $$IndVal_{sp1,B1} = 1.00 \times 1.00 = 1.00 \quad \text{(perfect indicator of B1)}$$ $$IndVal_{sp2,B1} = 0.67 \times 1.00 = 0.67$$ ## 4.2 Site-derived metrics The following diagram shows the same site-species matrix, but now **species are grouped by cluster** (chorotype). We compute how many species from each cluster occur in each site: ``` Chorotypes ┌─── C1 ───┐ ┌─── C2 ───┐ sp1 sp2 sp3 sp4 ┌─────┬─────┬─────┬─────┐ Site A │ 1 │ 1 │ · │ · │ 2 ├─────┼─────┼─────┼─────┤ Sites Site B │ 1 │ 1 │ 1 │ · │ 3 ├─────┼─────┼─────┼─────┤ Site C │ · │ 1 │ 1 │ 1 │ 3 ├─────┼─────┼─────┼─────┤ Site D │ · │ · │ 1 │ 1 │ 2 └─────┴─────┴─────┴─────┘ n_c 2 2 n = 4 n_gc C1 C2 n_g (per cluster) ┌───────┬───────┐ Site A │ 2 │ 0 │ 2 ├───────┼───────┤ Site B │ 2 │ 1 │ 3 ├───────┼───────┤ Site C │ 1 │ 2 │ 3 ├───────┼───────┤ Site D │ 0 │ 2 │ 2 └───────┴───────┘ n_c 2 2 n = 4 ``` | Term | Meaning | Where to find it | |:-----|:--------|:-----------------| | $n$ | Total number of species | Bottom-right corner (4) | | $n_c$ | Species in cluster $c$ | Bottom margin per cluster | | $n_g$ | Species present in site $g$ | Right margin per site row | | $n_{gc}$ | Species from cluster $c$ present in site $g$ | The $n_{gc}$ summary table | **NOTE:** in bipartite clustering, bioregion and chorotypes can be the **exact same clusters.** Nevertheless, we use different terms here to avoid confusion in the calculation of metrics. ### Examples of calculations **Specificity** of Site A for C1 (fraction of site's species belonging to C1): $$A_{A,C1} = \frac{n_{A,C1}}{n_A} = \frac{2}{2} = 1.00 \quad \text{(Site A has only C1 species)}$$ **Specificity** of Site B for C1: $$A_{B,C1} = \frac{n_{B,C1}}{n_B} = \frac{2}{3} = 0.67 \quad \text{(Site B mostly has C1 species)}$$ **Fidelity** of Site A for C1 (fraction of C1 species present in Site A): $$B_{A,C1} = \frac{n_{A,C1}}{n_C1} = \frac{2}{2} = 1.00 \quad \text{(Site A has all C1 species)}$$ **Fidelity** of Site C for C1: $$B_{C,C1} = \frac{n_{C,C1}}{n_{C1}} = \frac{1}{2} = 0.50 \quad \text{(Site C has half of C1 species)}$$ # 5. List of site/species metrics included in the package ## Metrics per cluster ### When clusters are assigned to sites (`cluster_on = "site"` or `cluster_on = "both"`) | Metric | Entity | Cluster type | Based on | Occ | Ab | Formula (occurrence) | Interpretation | |:-------|:-------|:------------------|:---------|:---:|:--:|:---------------------|:---------------| | Specificity | Species | Bioregion | Co-occurrence | ✓ | ✓ | $A_{sb} = \frac{n_{sb}}{n_s}$ | Fraction of species' occurrences in bioregion | | NSpecificity | Species | Bioregion | Co-occurrence | ✓ | ✓ | $\bar{A}_{sb} = \frac{n_{sb}/n_b}{\sum_k n_{sk}/n_k}$ | Size-normalized specificity | | Fidelity | Species | Bioregion | Co-occurrence | ✓ | ✓ | $B_{sb} = \frac{n_{sb}}{n_b}$ | Fraction of bioregion's sites with species | | IndVal | Species | Bioregion | Co-occurrence | ✓ | ✓ | $A_{sb} \times B_{sb}$ | Indicator value (specificity × fidelity) | | NIndVal | Species | Bioregion | Co-occurrence | ✓ | ✓ | $\bar{A}_{sb} \times B_{sb}$ | Size-normalized indicator value | | Rho | Species | Bioregion | Co-occurrence | ✓ | ✓ | See section 7.1.1 | Standardized contribution index | | CoreTerms | Species | Bioregion | Co-occurrence | ✓ | ✓ | $n$, $n_b$, $n_s$, $n_{sb}$ | Raw counts for custom calculations | | | | | | | | | | | Richness| Site | — | Co-occurrence | ✓ | — | $S_g = n_g$ | Number of species | | Rich_Endemics | Site | Bioregion | Co-occurrence| ✓ | — | $E_g = \sum{K_s}$ | Number of endemic species in the site (i.e., species occurring in only one bioregion) | | Prop_Endemics | Site | Bioregion | Co-occurrence | ✓ | — | $\bar{PctEnd}_{g} = \frac{E_g}{S_g}$ | Proportion of endemic species in the site | | | | | | | | | | | MeanSim | Site | Bioregion | Similarity | — | — | $\frac{1}{n_b - \delta} \sum_{g' \neq g} sim_{gg'}$ | Mean similarity to bioregion | | SdSim | Site | Bioregion | Similarity | — | — | See section 7.2.1 | SD of similarity to bioregion | ### When clusters are assigned to species (`cluster_on = "species"` or `cluster_on = "both"`) | Metric | Entity | Cluster type | Based on | Occ | Ab | Formula (occurrence) | Interpretation | |:-------|:-------|:------------------|:---------|:---:|:--:|:---------------------|:---------------| | Specificity | Site | Chorotype | Co-occurrence | ✓ | ✓ | $A_{gc} = \frac{n_{gc}}{n_g}$ | Fraction of site's species in cluster | | NSpecificity | Site | Chorotype | Co-occurrence | ✓ | ✓ | $\bar{A}_{gc} = \frac{n_{gc}/n_c}{\sum_k n_{gk}/n_k}$ | Size-normalized specificity | | Fidelity | Site | Chorotype | Co-occurrence | ✓ | ✓ | $B_{gc} = \frac{n_{gc}}{n_c}$ | Fraction of cluster's species in site | | IndVal | Site | Chorotype | Co-occurrence | ✓ | ✓ | $A_{gc} \times B_{gc}$ | Indicator value (specificity × fidelity) | | NIndVal | Site | Chorotype | Co-occurrence | ✓ | ✓ | $\bar{A}_{gc} \times B_{gc}$ | Size-normalized indicator value | | Rho | Site | Chorotype | Co-occurrence | ✓ | ✓ | See section 7.2.2 | Standardized contribution index | | CoreTerms | Site | Chorotype | Co-occurrence | ✓ | ✓ | $n$, $n_c$, $n_g$, $n_{gc}$ | Raw counts for custom calculations | ## Metrics in bioregionalization/clustering These metrics summarize how an entity is distributed across *all* clusters, rather than in relation to each individual cluster. ### When `cluster_on = "site"` (or `"both"`) | Metric | Entity | Based on | Occ | Ab | Formula | Interpretation | |:-------|:-------|:---------|:---:|:--:|:--------|:---------------| | P | Species | Co-occurrence | ✓ | ✓ | $1 - \sum_k \left(\frac{n_{sk}}{n_s}\right)^2$ | Evenness of species across bioregions (0–1) | | Silhouette | Site | Similarity | — | — | $\frac{a_g - b_g}{\max(a_g, b_g)}$ | Fit to assigned vs. nearest bioregion | ### When `cluster_on = "species"` (or `"both"`) | Metric | Entity | Based on | Occ | Ab | Formula | Interpretation | |:-------|:-------|:---------|:---:|:--:|:--------|:---------------| | P | Site | Co-occurrence | ✓ | ✓ | $1 - \sum_k \left(\frac{n_{gk}}{n_g}\right)^2$ | Evenness of site across chorotypes (0–1) | # 6. Usage This section demonstrates how to use `site_species_metrics()` with all metrics computed for both sites and species. This is only possible in a bipartite network clustering, where both sites and species receive clusters simultaneously. For this example, we will use the bipartite network bioregionalization from section 3, where both sites and species are assigned to the same clusters. We compute all available metrics for both sites and species. ```{r} all_metrics <- site_species_metrics( bioregionalization = vege_netclubip, bioregion_metrics = c("Specificity", "NSpecificity", "Fidelity", "IndVal", "NIndVal", "Rho", "CoreTerms", "Richness", "Rich_Endemics", "Prop_Endemics", "MeanSim", "SdSim"), # You can also simply write "all" bioregionalization_metrics = c("P", "Silhouette"), data_type = "both", cluster_on = "both", comat = vegemat, similarity = vegesim, index = "Simpson", verbose = FALSE) ``` Typing the name of the object in the console calls `print()`, which provides a concise overview of the output, including the settings used, a preview of available metrics, and instructions for accessing the data. ```{r} all_metrics ``` You can also run `summary()` oçn the object to quickly see a statistical summary for each output table, including the number of rows and summary statistics for numeric columns. ```{r} summary(all_metrics) ``` We can see it also displays the top sites or species for IndVal for a convenient quick look at our clustering structure. You can also use `str()` to display the internal structure of the object, showing the settings and the dimensions and column types of each data frame component. ```{r} str(all_metrics) ``` # 7. Metrics per cluster ## 7.1 Species-per-bioregion metrics These metrics are computed when sites have clusters (i.e., `cluster_on = "site"` (or `"both"`)). In the following example, we compute all metrics (`bioregion_metrics = c("Specificity", "NSpecificity", "Fidelity", "IndVal", "NIndVal", "Rho", "CoreTerms")`). To compute these metrics, we need to provide `comat`. ### 7.1.1 Co-occurrence metrics: occurrence version The occurrence metrics are computed when `data_type = "occurrence"`. By default, the function will detect the type of data used for the clustering. However, this parameter can be overriden by users, such that occurrence metrics can be calculated for abundance clustering, and vice-versa. Users can also specify `data_type = "both"` if they want to obtain both versions of co-occurrence metrics. ```{r} nsb <- site_species_metrics(bioregionalization = vege_nhclu, bioregion_metrics = c("Specificity", "NSpecificity", "Fidelity", "IndVal", "NIndVal", "Rho", "CoreTerms"), bioregionalization_metrics = NULL, data_type = "occurrence", cluster_on = "site", comat = vegemat, similarity = NULL, index = NULL, # Name of similarity column verbose = FALSE) nsb ``` #### Specificity (occurrence) The specificity $A_{sb}$ of species $s$ for bioregion $b$ [@Caceres2009] is defined as $$A_{sb} = \frac{n_{sb}}{n_s}$$ and measures the fraction of occurrences of species $s$ that belong to bioregion $b$. It therefore reflects the uniqueness of a species to a particular bioregion. #### NSpecificity (occurrence) A normalized version that accounts for the size of each bioregion is also available, as defined in [@Caceres2009]: $$\bar{A}_{sb} = \frac{n_{sb}/n_b}{\sum_{k=1}^K n_{sk}/n_k}$$ It corresponds to a normalized specificity value that adjusts for differences in bioregion size. #### Fidelity (occurrence) The fidelity $B_{sb}$ of species $s$ for bioregion $b$ [@Caceres2009] is defined as $$B_{sb} = \frac{n_{sb}}{n_b}$$ and measures the fraction of sites in bioregion $b$ where species $s$ is present. It therefore reflects the frequency of occurrence of a species within a bioregion. #### IndVal (occurrence) The indicator value $IndVal_{sb}$ of species $s$ for bioregion $b$ can be defined as the product of specificity and fidelity [@Caceres2009]: $$IndVal_{sb} = A_{sb} \times B_{sb}$$ This index quantifies the strength of association between a species and a bioregion by combining its specificity (uniqueness to that bioregion) and fidelity (consistency of occurrence within that bioregion). High IndVal values identify species that are both frequent and restricted to a single bioregion, making them good indicators of that region. #### NIndVal (occurrence) A normalized version of the indicator value is also available: $$\bar{IndVal}_{sb} = \bar{A}_{sb} \times B_{sb}$$ This normalization adjusts for differences in bioregion size, allowing more comparable indicator values across regions with unequal sampling effort or extent. #### Rho (occurrence) The contribution index $\rho$ can also be calculated following [@Lenormand2019]: $$\rho_{sb} = \frac{n_{sb} - n_s\frac{n_b}{n}}{\sqrt{\frac{n_b(n - n_b)}{n - 1} \frac{n_s}{n}(1 - \frac{n_s}{n}) }}$$ This index measures the deviation between the observed number of occurrences of species $s$ in bioregion $b$ and the expected value under random association, providing a standardized measure of contribution to the bioregional structure. ### Co-occurrence metrics: abundance version The occurrence metrics are computed when `data_type = "occurrence"`. By default, the function will detect the type of data used for the clustering. However, this parameter can be overriden by users, such that occurrence metrics can be calculated for abundance clustering, and vice-versa. The abundance version of these metrics can also be computed when `data_type = "abundance"` (or `data_type = "both"`). In this case the core terms and associated metrics are: - $w_{sb}$ is the sum of abundances of species **s** in sites of bioregion **b**. - $w_s$ is the total abundance of species **s**. - $w_b$ is the total abundance of all species present in sites of bioregion **b**. ```{r} wsb <- site_species_metrics(bioregionalization = vege_nhclu, bioregion_metrics = c("Specificity", "NSpecificity", "Fidelity", "IndVal", "NIndVal", "Rho", "CoreTerms"), bioregionalization_metrics = NULL, data_type = "abundance", cluster_on = "site", comat = vegemat, similarity = NULL, # Name of similarity column index = NULL, verbose = FALSE) wsb ``` #### Specificity (abundance) $$A_{sb} = \frac{w_{sb}}{w_s}$$ #### NSpecificity (abundance) $$\bar{A}_{sb} = \frac{w_{sb}/n_b}{\sum_{k=1}^K w_{sk}/n_k}$$ #### Fidelity (abundance) $$B_{sb} = \frac{w_{sb}}{w_b}$$ #### IndVal (abundance) $$IndVal_{sb} = A_{sb} \times \frac{n_{sb}}{n_b}$$ Note that the fidelity based on occurrence is used here [@Caceres2009]. #### NIndVal (abundance) $$\bar{IndVal}_{sb} = \bar{A}_{sb} \times \frac{n_{sb}}{n_b}$$ Note that the fidelity based on occurrence is used here [@Caceres2009]. #### Rho (abundance) $$\rho_{sb} = \frac{\mu_{sb} - \mu_s}{\sqrt{\left(\frac{n - n_b}{n-1}\right) \left(\frac{{\sigma_s}^2}{n_b}\right)}}$$ where - $\mu_{sb} = \frac{w_{sb}}{n_b}$ the average abundance of species $s$ in bioregion $b$ (as in **NSpecificity** and **NIndVal**) - $\mu_s = \frac{w_s}{n}$ the average abundance of species $s$ - $\sigma_s$ the associated standard deviation. ## 7.2 Site metrics For sites, two types of metrics can be computed, depending on whether the clustering is based on site or species: - if the clustering is based on sites (`cluster_on = "site"` (or `"both"`)), then richness and similarity-based metrics can be computed - if the clustering is based on species (`cluster_on = "species"` (or `"both"`)), then we can also compute metrics that are typically applied at the species level, such as affinity, fidelity, IndVal and other similar metrics. The conceptual interpretation differs in this case. ### 7.2.1 Diversity & endemicity site metrics When clusters are assigned to sites (bioregions), we can compute basic diversity metrics: - Richness = number of species in the site - Rich_Endemics = number of species in the site that are endemic to a single region (i.e., occur in only one bioregion) - Prop_Endemics = proportion of endemic species, i.e. ratio between Rich_Endemics and Richness ```{r} sim_metrics <- site_species_metrics(bioregionalization = vege_nhclu, bioregion_metrics = c("Richness", "Rich_Endemics", "Prop_Endemics"), bioregionalization_metrics = NULL, data_type = "occurrence", cluster_on = "site", comat = vegemat, similarity = vegesim, index = "Simpson", # Name of similarity column verbose = FALSE) sim_metrics ``` ### 7.2.2 Similarity-based site metrics To compute similarity-based metrics for sites, we need to provide the site similarity matrix (`vegesim`). These metrics include the average similarity of each site to the sites of each bioregion ($MeanSim$) and the associated standard deviation ($SdSim$). When computing the average similarity, the focal site itself is not included in the calculation for its own bioregion. ```{r} sim_metrics <- site_species_metrics(bioregionalization = vege_nhclu, bioregion_metrics = c("MeanSim", "SdSim"), bioregionalization_metrics = NULL, data_type = "occurrence", cluster_on = "site", comat = vegemat, similarity = vegesim, index = "Simpson", # Name of similarity column verbose = FALSE) sim_metrics ``` #### MeanSim Let $g$ be a site and $b$ a bioregion with sites $g' \in b$, then: $$MeanSim_{gb} = \frac{1}{n_b - \delta_{g \in b}} \sum_{g' \in b, g' \neq g} sim_{gg'}$$ where $sim_{gg'}$ is the similarity between sites $g$ and $g'$, $n_b$ is the number of sites in bioregion $b$, and $\delta_{g \in b}$ is 1 if site $g$ belongs to bioregion $b$ (to exclude itself), 0 otherwise. #### SdSim The standard deviation of similarities of site $g$ to bioregion $b$ is: $$SdSim_{gb} = \sqrt{\frac{1}{n_b - 1 - \delta_{g \in b}} \sum_{g' \in b, g' \neq g} \left( sim_{gg'} - MeanSim_{gb} \right)^2}$$ where $sim_{gg'}$ is the similarity between sites $g$ and $g'$, $n_b$ is the number of sites in bioregion $b$, and $\delta_{g \in b}$ is 1 if site $g$ belongs to bioregion $b$ (to exclude itself), 0 otherwise. ### 7.2.3 Chorotype/Cluster-based site metrics In the following example we compute only metrics for sites, on the basis of species clusters (`cluster_on = "species"`). ```{r} gc <- site_species_metrics(bioregionalization = vege_netclubip, bioregion_metrics = c("Specificity", "NSpecificity", "Fidelity", "IndVal", "NIndVal", "Rho", "CoreTerms"), bioregionalization_metrics = "P", data_type = "both", cluster_on = "species", comat = vegemat, similarity = NULL, index = NULL, verbose = FALSE) gc ``` # 8. Metrics over the entire bioregionalization (i.e., over all clusters) ## 8.1 Site metrics Based on $MeanSim$, it is possible to derive aggregated metrics that assess how well a site fits within its assigned bioregion relative to others. For now, only the Silhouette index [@Rousseeuw1987] is proposed. ### Silhouette The Silhouette index for a site $g$ is defined as: $$Silhouette_g = \frac{a_g - b_g}{\max(a_g, b_g)}$$ where: - $a_g$ is the average similarity of site $g$ to all other sites in its own bioregion, - $b_g$ is the average similarity of site $g$ to all sites belonging to the nearest bioregion. This index reflects how strongly a site is associated with its assigned bioregion relative to the most similar alternative bioregion, ranging from -1, when the site may be misassigned (i.e., more similar to another bioregion than its own), to 1, when the site is well matched to its own bioregion, and around 0 when the site lies near the boundary between bioregions. ```{r} sil_metrics <- site_species_metrics(bioregionalization = vege_nhclu, bioregion_metrics = NULL, bioregionalization_metrics = "Silhouette", data_type = "occurrence", cluster_on = "site", comat = vegemat, similarity = vegesim, index = "Simpson", # Name of similarity column verbose = FALSE) sil_metrics ``` ### Site participation coefficient We can compute the participation coefficient $P_s$ of a species $s$ to the bioregionalization as described in [@Denelle2020], available in both its occurrence and abundance versions. These metrics measure whether a site has species from a single region or from multiple regions - useful when investigating transition zones [@Leroy2019]. There are ranging from 0 to 1. Values close to 0 indicate that the site only has species from a single chorotype (i.e., not a transition zone), whereas values close to 1 indicate that the site has species evenely distributed across multiple chorotypes (i.e., likely a transition zone). ### P (occurrence) $$ P_s = 1 - \sum_{k=1}^K \left(\frac{n_{sk}}{n_s}\right)^2 $$ ```{r} p_occ_site <- site_species_metrics(bioregionalization = vege_netclubip, bioregion_metrics = NULL, bioregionalization_metrics = "P", data_type = "occurrence", cluster_on = "species", comat = vegemat, similarity = NULL, index = "Simpson", # Name of similarity column verbose = FALSE) p_occ_site ``` ### P (abundance) $$ P_s = 1 - \sum_{k=1}^K \left(\frac{w_{sk}}{w_s}\right)^2 $$ ```{r} p_ab_site <- site_species_metrics(bioregionalization = vege_netclubip, bioregion_metrics = NULL, bioregionalization_metrics = "P", data_type = "abundance", cluster_on = "species", comat = vegemat, similarity = NULL, index = "Simpson", # Name of similarity column verbose = FALSE) p_ab_site ``` ## 8.2 Species metrics We can compute the participation coefficient $P_s$ of a species $s$ to the bioregionalization for species as well. ### P (occurrence) $$ P_s = 1 - \sum_{k=1}^K \left(\frac{n_{sk}}{n_s}\right)^2 $$ ```{r} p_occ_sp <- site_species_metrics(bioregionalization = vege_netclubip, bioregion_metrics = NULL, bioregionalization_metrics = "P", data_type = "occurrence", cluster_on = "site", comat = vegemat, similarity = NULL, index = "Simpson", # Name of similarity column verbose = FALSE) p_occ_sp ``` ### P (abundance) $$ P_s = 1 - \sum_{k=1}^K \left(\frac{w_{sk}}{w_s}\right)^2 $$ ```{r} p_ab_sp <- site_species_metrics(bioregionalization = vege_netclubip, bioregion_metrics = NULL, bioregionalization_metrics = "P", data_type = "abundance", cluster_on = "site", comat = vegemat, similarity = NULL, index = "Simpson", # Name of similarity column verbose = FALSE) p_ab_sp ``` These metrics measure how evenly a species is distributed among bioregions. There are ranging from 0 to 1. Values close to 0 indicate that the species is largely restricted to a single bioregion, while values close to 1 indicate that the species is evenly distributed across multiple bioregions. Calculations on both occurrence & abundance at the same time: ```{r} ps <- site_species_metrics(bioregionalization = vege_nhclu, bioregion_metrics = NULL, bioregionalization_metrics = "P", data_type = "both", cluster_on = "site", comat = vegemat, similarity = NULL, index = NULL, verbose = FALSE) ps ``` # 9. Bioregion metrics & spatial coherence At the granularity of bioregions, we can calculate the number of sites it contains and the number of species present in those sites. The number and proportion of endemic species are also computed. Endemic species are defined as those occurring only in sites assigned to a particular bioregion (i.e., species that occur in only one bioregion). ```{r} bioregion_summary <- bioregion_metrics(bioregionalization = vege_nhclu, comat = vegemat) bioregion_summary ``` We use the metric of spatial coherence as in [@Divisek2016], except that we replace the number of pixels per bioregion with the area of each coherent part. The spatial coherence is expressed in percentage, and has the following formula: $$SC_j = 100 \times \frac{LargestPatch_j}{Area_j}$$ where $j$ is a bioregion. Here is an example with the vegetation dataset. ```{r} # Spatial coherence vegedissim <- dissimilarity(vegemat) hclu <- nhclu_kmeans(dissimilarity = vegedissim, n_clust = 4) vegemap <- map_bioregions(hclu, vegesf, write_clusters = TRUE, plot = FALSE) bioregion_metrics(bioregionalization = hclu, comat = vegemat, map = vegemap, col_bioregion = 2) ``` The bioregion 4 is almost constituted of one homogeneous block, which is why the spatial coherence is very close to 100 %. ```{r} ggplot(vegemap) + geom_sf(aes(fill = as.factor(K_4))) + scale_fill_viridis_d("Bioregion") + theme_bw() + theme(legend.position = "bottom") ``` # 10. References