---
title: "5.2 Summary metrics"
author: "Pierre Denelle, Boris Leroy and Maxime Lenormand"
date: "`r Sys.Date()`"
output: 
  html_vignette:
    number_sections: true
bibliography: '`r system.file("REFERENCES.bib", package="bioregion")`' 
csl: style_citation.csl    
vignette: >
  %\VignetteIndexEntry{5.2 Summary metrics}
  \usepackage[utf8]{inputenc}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
 chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE,
                      fig.width = 6, fig.height = 6)
# Packages --------------------------------------------------------------------
suppressPackageStartupMessages({
  suppressWarnings({
    library("bioregion")
    library("dplyr")
    library("ggplot2")
    library("sf")
  })
})

options(tinytex.verbose = TRUE)
```

In this vignette, we describe two functions to compute summary metrics:

- metrics calculated for each species and/or site `site_species_metrics()`
- metrics calculated for each bioregion `bioregion_metrics()`


# 1. Terminology clarification

The `bioregion` is focused on bioregionalization, i.e. clustering of geographical areas 
on the basis of species data.
However, there are several cases where species can also become part of the clustering 
(for example, in bipartite network clustering), which poses terminology issues. 

To be conceptually accurate, we have chosen to name species clusters as 'chorotypes':

- **Bioregion**: A group of sites with similar species composition, identified through 
  clustering analysis. Bioregions are geographic units.

- **Chorotype**: A group of species with similar distributions within the study area. 
  Chorotypes are biological units. This generally corresponds to the concept of "regional 
  chorotype" sensu [@BaroniUrbani1978], 
  as clarified by [@Fattorini2015]. Note that when clustering on worldwide ranges, the
  concept becomes "global chorotypes" (see [@Fattorini2015] for further details).

## Possible cases of chorotypes

| Clustering scenario | Site clusters | Species clusters | Conceptual basis |
|:--------------------|:-------------:|:----------------:|:-----------------|
| Site-only clustering | Bioregions | — | Sites grouped by compositional similarity |
| Bipartite network clustering | Bioregions | Chorotypes (same cluster IDs) | Sites and species grouped by shared network structure |
| Species-only clustering | — | Chorotypes | Species grouped by distributional similarity |
| Post-hoc species assignment | Bioregions | Chorotypes (derived) | Species assigned to bioregions based on specificity/IndVal |


### Bipartite network clustering

In bipartite network clustering,
both sites and species are assigned to the **same clusters** (network modules). 
A species assigned to cluster 1 belongs to the same bioregion as sites assigned 
to cluster 1. We use the term **chorotype** to refer to the set of species 
assigned to a given bioregion, but it is important to understand that:

> **In bipartite clustering, bioregion ID = chorotype ID.** They are two 
> perspectives on the same network partition: bioregion refers to the sites 
> in a cluster, chorotype refers to the species in that same cluster.


### Site-only clustering with post-hoc species assignment
Species can be secondarily assigned to bioregions based on metrics such as 
maximum specificity or IndVal. Here, **chorotype** refers to the group of 
species most strongly associated with a given bioregion. Unlike bipartite 
clustering, this assignment is derived rather than intrinsic to the 
clustering algorithm.

### Species-only clustering
When clustering species directly (e.g., by distributional similarity), the 
resulting groups are true **chorotypes** in the regional sense 
[@Fattorini2015]: species with similar distributions within the study area.


# 2. Example data

We use the vegetation dataset included in the `bioregion`.

```{r}
data("vegedf")
data("vegemat")

# Calculation of (dis)similarity matrices
vegedissim <- dissimilarity(vegemat, metric = c("Simpson"))
vegesim <- dissimilarity_to_similarity(vegedissim)
```

# 3. Bioregionalization

We use the same three bioregionalization algorithms as in the
[visualization vignette](https://biorgeo.github.io/bioregion/articles/a5_1_visualization.html),
i.e., non-hierarchical, hierarchical, and network bioregionalizations. In 
addition, we include a network bioregionalization algorithm based on a bipartite 
network, which assigns clusters to both sites and species. We 
chose three bioregions for the non-hierarchical and hierarchical 
bioregionalizations.
<br>

```{r}
# Non hierarchical bioregionalization
vege_nhclu <- nhclu_kmeans(vegedissim, 
                           n_clust = 3, 
                           index = "Simpson",
                           seed = 1)
vege_nhclu$cluster_info 

# Hierarchical bioregionalization
set.seed(1)
vege_hclu <- hclu_hierarclust(dissimilarity = vegedissim,
                              index = "Simpson",
                              method = "average", 
                              n_clust = 3,
                              optimal_tree_method = "best",
                              verbose = FALSE)
vege_hclu$cluster_info

# Network bioregionalization
set.seed(1)
vege_netclu <- netclu_walktrap(vegesim,
                               index = "Simpson")
vege_netclu$cluster_info 

# Bipartite network bioregionalization
install_binaries(verbose = FALSE)
vege_netclubip <- netclu_infomap(vegedf,
                                 seed = 1, 
                                 bipartite = TRUE)
vege_netclubip$cluster_info

```

# 4. Metric components

Before diving into specific metrics, we can understand the core terms using a 
simple example. Consider a study area with **4 sites** and **4 species**, 
where sites have been assigned to **2 bioregions**.

## 4.1 Species-derived metrics 

The following diagram shows the site-species matrix where sites are grouped by 
bioregion. Marginal sums give us all the core terms needed to compute metrics:

```
                          Species
                   sp1   sp2   sp3   sp4      n_b 
                 ┌─────┬─────┬─────┬─────┐
          Site A │  1  │  1  │  ·  │  ·  │
     B1   ───────┼─────┼─────┼─────┼─────┤     2
          Site B │  1  │  1  │  1  │  ·  │
 Bioregion ══════╪═════╪═════╪═════╪═════╪══════
          Site C │  ·  │  1  │  1  │  1  │
     B2   ───────┼─────┼─────┼─────┼─────┤     2
          Site D │  ·  │  ·  │  1  │  1  │
                 └─────┴─────┴─────┴─────┘
                  
     n_sb            sp1   sp2   sp3   sp4     n_b
   (per bioregion) ┌─────┬─────┬─────┬─────┐
              B1   │  2  │  2  │  1  │  0  │   2
                   ├─────┼─────┼─────┼─────┤
              B2   │  0  │  1  │  2  │  2  │   2
                   └─────┴─────┴─────┴─────┘
     n_s (total)      2     3     3     2      n = 4
     K_s (# bioreg)   1     2     2     1      K = 2
```

| Term | Meaning | Where to find it |
|:-----|:--------|:-----------------|
| $n$ | Total number of sites | Bottom-right corner (4) |
| $K$ | Total number of bioregions | Bottom-right corner (2) |
| $n_b$ | Sites in bioregion $b$ | Right margin per bioregion row |
| $n_s$ | Sites where species $s$ occurs | Bottom margin per species column |
| $K_s$ | Number of bioregions where species $s$ occurs | Bottom margin $n_s$ |
| $n_{sb}$ | Sites in bioregion $b$ with species $s$ | The $n_{sb}$ summary table |

### Examples of calculations 

From the $n_{sb}$ table, all species-per-bioregion metrics follow directly:

**Specificity** (fraction of species' occurrences in a bioregion):
$$A_{sp1,B1} = \frac{n_{sp1,B1}}{n_{sp1}} = \frac{2}{2} = 1.00 \quad \text{(sp1 is exclusive to B1)}$$
$$A_{sp2,B1} = \frac{n_{sp2,B1}}{n_{sp2}} = \frac{2}{3} = 0.67 \quad \text{(sp2 mostly in B1)}$$

**Fidelity** (fraction of bioregion's sites with the species):
$$B_{sp2,B1} = \frac{n_{sp2,B1}}{n_{B1}} = \frac{2}{2} = 1.00 \quad \text{(sp2 in all B1 sites)}$$
$$B_{sp3,B1} = \frac{n_{sp3,B1}}{n_{B1}} = \frac{1}{2} = 0.50 \quad \text{(sp3 in half of B1)}$$

**IndVal** (indicator value = Specificity × Fidelity):
$$IndVal_{sp1,B1} = 1.00 \times 1.00 = 1.00 \quad \text{(perfect indicator of B1)}$$
$$IndVal_{sp2,B1} = 0.67 \times 1.00 = 0.67$$

## 4.2 Site-derived metrics 

The following diagram shows the same site-species matrix, but now **species are 
grouped by cluster** (chorotype). We compute how many species from each cluster 
occur in each site:

```
                                Chorotypes 
                        ┌─── C1 ───┐ ┌─── C2 ───┐
                          sp1   sp2   sp3   sp4
                        ┌─────┬─────┬─────┬─────┐
                 Site A │  1  │  1  │  ·  │  ·  │  2
                        ├─────┼─────┼─────┼─────┤
   Sites         Site B │  1  │  1  │  1  │  ·  │  3
                        ├─────┼─────┼─────┼─────┤
                 Site C │  ·  │  1  │  1  │  1  │  3
                        ├─────┼─────┼─────┼─────┤
                 Site D │  ·  │  ·  │  1  │  1  │  2
                        └─────┴─────┴─────┴─────┘
                  n_c          2           2         n = 4
                  
       n_gc                C1      C2          n_g
     (per cluster)     ┌───────┬───────┐
                Site A │   2   │   0   │        2
                       ├───────┼───────┤
                Site B │   2   │   1   │        3
                       ├───────┼───────┤
                Site C │   1   │   2   │        3
                       ├───────┼───────┤
                Site D │   0   │   2   │        2
                       └───────┴───────┘
     n_c                   2       2           n = 4
```

| Term | Meaning | Where to find it |
|:-----|:--------|:-----------------|
| $n$ | Total number of species | Bottom-right corner (4) |
| $n_c$ | Species in cluster $c$ | Bottom margin per cluster |
| $n_g$ | Species present in site $g$ | Right margin per site row |
| $n_{gc}$ | Species from cluster $c$ present in site $g$ | The $n_{gc}$ summary table |

**NOTE:** in bipartite clustering, bioregion and chorotypes can be the **exact same clusters.**
Nevertheless, we use different terms here to avoid confusion in the calculation of metrics.

### Examples of calculations 

**Specificity** of Site A for C1 (fraction of site's species belonging to C1):
$$A_{A,C1} = \frac{n_{A,C1}}{n_A} = \frac{2}{2} = 1.00 \quad \text{(Site A has only C1 species)}$$

**Specificity** of Site B for C1:
$$A_{B,C1} = \frac{n_{B,C1}}{n_B} = \frac{2}{3} = 0.67 \quad \text{(Site B mostly has C1 species)}$$

**Fidelity** of Site A for C1 (fraction of C1 species present in Site A):
$$B_{A,C1} = \frac{n_{A,C1}}{n_C1} = \frac{2}{2} = 1.00 \quad \text{(Site A has all C1 species)}$$

**Fidelity** of Site C for C1:
$$B_{C,C1} = \frac{n_{C,C1}}{n_{C1}} = \frac{1}{2} = 0.50 \quad \text{(Site C has half of C1 species)}$$

# 5. List of site/species metrics included in the package

## Metrics per cluster 

### When clusters are assigned to sites (`cluster_on = "site"` or `cluster_on = "both"`) 

| Metric | Entity | Cluster type | Based on | Occ | Ab | Formula (occurrence) | Interpretation |
|:-------|:-------|:------------------|:---------|:---:|:--:|:---------------------|:---------------|
| Specificity | Species | Bioregion | Co-occurrence | ✓ | ✓ | $A_{sb} = \frac{n_{sb}}{n_s}$ | Fraction of species' occurrences in bioregion |
| NSpecificity | Species | Bioregion | Co-occurrence | ✓ | ✓ | $\bar{A}_{sb} = \frac{n_{sb}/n_b}{\sum_k n_{sk}/n_k}$ | Size-normalized specificity |
| Fidelity | Species | Bioregion | Co-occurrence | ✓ | ✓ | $B_{sb} = \frac{n_{sb}}{n_b}$ | Fraction of bioregion's sites with species |
| IndVal | Species | Bioregion | Co-occurrence | ✓ | ✓ | $A_{sb} \times B_{sb}$ | Indicator value (specificity × fidelity) |
| NIndVal | Species | Bioregion | Co-occurrence | ✓ | ✓ | $\bar{A}_{sb} \times B_{sb}$ | Size-normalized indicator value |
| Rho | Species | Bioregion | Co-occurrence | ✓ | ✓ | See section 7.1.1 | Standardized contribution index |
| CoreTerms | Species | Bioregion | Co-occurrence | ✓ | ✓ | $n$, $n_b$, $n_s$, $n_{sb}$ | Raw counts for custom calculations |
| | | | | | | | |
| Richness| Site | — | Co-occurrence | ✓ | — | $S_g = n_g$ | Number of species |
| Rich_Endemics | Site | Bioregion | Co-occurrence| ✓ | — | $E_g = \sum{K_s}$ | Number of endemic species in the site (i.e., species occurring in only one bioregion) |
| Prop_Endemics | Site | Bioregion | Co-occurrence | ✓ | — | $\bar{PctEnd}_{g} = \frac{E_g}{S_g}$ | Proportion of endemic species in the site |
| | | | | | | | |
| MeanSim | Site | Bioregion | Similarity | — | — | $\frac{1}{n_b - \delta} \sum_{g' \neq g} sim_{gg'}$ | Mean similarity to bioregion |
| SdSim | Site | Bioregion | Similarity | — | — | See section 7.2.1 | SD of similarity to bioregion |

### When clusters are assigned to species (`cluster_on = "species"` or `cluster_on = "both"`) 

| Metric | Entity | Cluster type | Based on | Occ | Ab | Formula (occurrence) | Interpretation |
|:-------|:-------|:------------------|:---------|:---:|:--:|:---------------------|:---------------|
| Specificity | Site | Chorotype | Co-occurrence | ✓ | ✓ | $A_{gc} = \frac{n_{gc}}{n_g}$ | Fraction of site's species in cluster |
| NSpecificity | Site | Chorotype | Co-occurrence | ✓ | ✓ | $\bar{A}_{gc} = \frac{n_{gc}/n_c}{\sum_k n_{gk}/n_k}$ | Size-normalized specificity |
| Fidelity | Site | Chorotype | Co-occurrence | ✓ | ✓ | $B_{gc} = \frac{n_{gc}}{n_c}$ | Fraction of cluster's species in site |
| IndVal | Site | Chorotype | Co-occurrence | ✓ | ✓ | $A_{gc} \times B_{gc}$ | Indicator value (specificity × fidelity) |
| NIndVal | Site | Chorotype | Co-occurrence | ✓ | ✓ | $\bar{A}_{gc} \times B_{gc}$ | Size-normalized indicator value |
| Rho | Site | Chorotype | Co-occurrence | ✓ | ✓ | See section 7.2.2 | Standardized contribution index |
| CoreTerms | Site | Chorotype | Co-occurrence | ✓ | ✓ | $n$, $n_c$, $n_g$, $n_{gc}$ | Raw counts for custom calculations |


## Metrics in bioregionalization/clustering

These metrics summarize how an entity is distributed across *all* clusters, 
rather than in relation to each individual cluster.

### When `cluster_on = "site"` (or `"both"`)

| Metric | Entity | Based on | Occ | Ab | Formula | Interpretation |
|:-------|:-------|:---------|:---:|:--:|:--------|:---------------|
| P | Species | Co-occurrence | ✓ | ✓ | $1 - \sum_k \left(\frac{n_{sk}}{n_s}\right)^2$ | Evenness of species across bioregions (0–1) |
| Silhouette | Site | Similarity | — | — | $\frac{a_g - b_g}{\max(a_g, b_g)}$ | Fit to assigned vs. nearest bioregion |

### When `cluster_on = "species"` (or `"both"`)

| Metric | Entity | Based on | Occ | Ab | Formula | Interpretation |
|:-------|:-------|:---------|:---:|:--:|:--------|:---------------|
| P | Site | Co-occurrence | ✓ | ✓ | $1 - \sum_k \left(\frac{n_{gk}}{n_g}\right)^2$ | Evenness of site across chorotypes (0–1) |


# 6. Usage

This section demonstrates how to use `site_species_metrics()` with all
metrics computed for both sites and species. This is only possible in a bipartite 
network clustering, where both sites and species receive clusters simultaneously.

For this example, we will use the bipartite network bioregionalization 
from section 
3, where both sites 
and species are assigned to the same clusters. We compute all available metrics
for both sites and species.

```{r}
all_metrics <- site_species_metrics(
  bioregionalization = vege_netclubip,
  bioregion_metrics = c("Specificity", "NSpecificity", "Fidelity", 
                        "IndVal", "NIndVal", "Rho", "CoreTerms",
                        "Richness", "Rich_Endemics", "Prop_Endemics",
                        "MeanSim", "SdSim"), # You can also simply write "all"
  bioregionalization_metrics = c("P", "Silhouette"),
  data_type = "both",
  cluster_on = "both",
  comat = vegemat,
  similarity = vegesim,
  index = "Simpson",
  verbose = FALSE)
```


Typing the name of the object in the console calls `print()`, which 
provides a concise overview of the output, including the 
settings used, a preview of available metrics, and instructions for accessing 
the data.

```{r}
all_metrics
```

You can also run `summary()` oçn the object to quickly see a 
statistical summary for each output table,
including the number of rows and summary statistics for numeric columns.

```{r}
summary(all_metrics)
```

We can see it also displays the top sites or species for IndVal for a convenient
quick look at our clustering structure.


You can also use `str()` to display the internal structure of the object, 
showing 
the settings and the dimensions and column types of each data frame component.

```{r}
str(all_metrics)
```


# 7. Metrics per cluster

## 7.1 Species-per-bioregion metrics

These metrics are computed when sites have clusters (i.e., `cluster_on = "site"`
(or `"both"`)). In the following example, we compute all metrics 
(`bioregion_metrics = c("Specificity", "NSpecificity", "Fidelity", "IndVal", "NIndVal",
"Rho", "CoreTerms")`). To compute these metrics, we need to provide `comat`.

### 7.1.1 Co-occurrence metrics: occurrence version 

The occurrence metrics are computed when `data_type = "occurrence"`. By default,
the function will detect the type of data used for the clustering. However, this 
parameter can be overriden by users, such that occurrence metrics can be calculated
for abundance clustering, and vice-versa. Users can also specify `data_type = "both"`
if they want to obtain both versions of co-occurrence metrics.

```{r}
nsb <- site_species_metrics(bioregionalization = vege_nhclu,
                            bioregion_metrics = c("Specificity", "NSpecificity",
                                                  "Fidelity", "IndVal", "NIndVal",
                                                  "Rho", 
                                                  "CoreTerms"),
                            bioregionalization_metrics = NULL,
                            data_type = "occurrence",
                            cluster_on = "site",
                            comat = vegemat,
                            similarity = NULL,
                            index = NULL, # Name of similarity column
                            verbose = FALSE)

nsb

```

#### Specificity (occurrence)

The specificity $A_{sb}$ of species $s$ for bioregion $b$ [@Caceres2009] is 
defined as

$$A_{sb} = \frac{n_{sb}}{n_s}$$

and measures the fraction of occurrences of species $s$ that belong to 
bioregion $b$. It therefore reflects the uniqueness of a species to a particular 
bioregion.

#### NSpecificity (occurrence)

A normalized version that accounts for the size of each bioregion is also 
available, as defined in [@Caceres2009]:

$$\bar{A}_{sb} = \frac{n_{sb}/n_b}{\sum_{k=1}^K n_{sk}/n_k}$$

It corresponds to a normalized specificity value that adjusts for differences 
in bioregion size.

#### Fidelity (occurrence)

The fidelity $B_{sb}$ of species $s$ for bioregion $b$ [@Caceres2009] is 
defined as

$$B_{sb} = \frac{n_{sb}}{n_b}$$

and measures the fraction of sites in bioregion $b$ where species $s$ is 
present. It therefore reflects the frequency of occurrence of a species 
within a bioregion.

#### IndVal (occurrence)

The indicator value $IndVal_{sb}$ of species $s$ for bioregion $b$ can be 
defined as the product of specificity and fidelity [@Caceres2009]:

$$IndVal_{sb} = A_{sb} \times B_{sb}$$

This index quantifies the strength of association between a species and a 
bioregion by combining its specificity (uniqueness to that bioregion) and 
fidelity (consistency of occurrence within that bioregion). High IndVal 
values identify species that are both frequent and restricted to a single 
bioregion, making them good indicators of that region.

#### NIndVal (occurrence)

A normalized version of the indicator value is also available:

$$\bar{IndVal}_{sb} = \bar{A}_{sb} \times B_{sb}$$

This normalization adjusts for differences in bioregion size, allowing more 
comparable indicator values across regions with unequal sampling effort or 
extent.

#### Rho (occurrence)

The contribution index $\rho$ can also be calculated following 
[@Lenormand2019]:

$$\rho_{sb} = \frac{n_{sb} - n_s\frac{n_b}{n}}{\sqrt{\frac{n_b(n - n_b)}{n - 1} 
\frac{n_s}{n}(1 - \frac{n_s}{n}) }}$$

This index measures the deviation between the observed number of occurrences of 
species $s$ in bioregion $b$ and the expected value under random association, 
providing a standardized measure of contribution to the bioregional structure.

### Co-occurrence metrics: abundance version 

The occurrence metrics are computed when `data_type = "occurrence"`. By default,
the function will detect the type of data used for the clustering. However, this 
parameter can be overriden by users, such that occurrence metrics can be calculated
for abundance clustering, and vice-versa.


The abundance version of these metrics can also be computed when 
`data_type = "abundance"` (or `data_type = "both"`). In this case the core terms 
and associated metrics are:

- $w_{sb}$ is the sum of abundances of species **s** in sites of bioregion **b**. 
- $w_s$ is the total abundance of species **s**.  
- $w_b$ is the total abundance of all species present in sites of bioregion **b**.


```{r}
wsb <- site_species_metrics(bioregionalization = vege_nhclu,
                            bioregion_metrics = c("Specificity", "NSpecificity",
                                                  "Fidelity",
                                                  "IndVal", "NIndVal",
                                                  "Rho",
                                                  "CoreTerms"),
                            bioregionalization_metrics = NULL,
                            data_type = "abundance",
                            cluster_on = "site",
                            comat = vegemat,
                            similarity = NULL, # Name of similarity column
                            index = NULL,
                            verbose = FALSE)

wsb

```

#### Specificity (abundance)

$$A_{sb} = \frac{w_{sb}}{w_s}$$

#### NSpecificity (abundance)

$$\bar{A}_{sb} = \frac{w_{sb}/n_b}{\sum_{k=1}^K w_{sk}/n_k}$$

#### Fidelity (abundance)

$$B_{sb} = \frac{w_{sb}}{w_b}$$

#### IndVal (abundance)

$$IndVal_{sb} = A_{sb} \times \frac{n_{sb}}{n_b}$$
Note that the fidelity based on occurrence is used here [@Caceres2009].

#### NIndVal (abundance)

$$\bar{IndVal}_{sb} = \bar{A}_{sb} \times \frac{n_{sb}}{n_b}$$

Note that the fidelity based on occurrence is used here [@Caceres2009].

#### Rho (abundance)

$$\rho_{sb} = \frac{\mu_{sb} - \mu_s}{\sqrt{\left(\frac{n - n_b}{n-1}\right) \left(\frac{{\sigma_s}^2}{n_b}\right)}}$$
where 

- $\mu_{sb} = \frac{w_{sb}}{n_b}$ the average abundance of species $s$ in 
bioregion $b$ (as in **NSpecificity** and **NIndVal**)
- $\mu_s = \frac{w_s}{n}$ the average abundance of species $s$
- $\sigma_s$ the associated standard deviation.

## 7.2 Site metrics

For sites, two types of metrics can be computed, depending on whether the clustering
is based on site or species:

- if the clustering is based on sites (`cluster_on = "site"`
(or `"both"`)), then richness and similarity-based metrics can be computed
- if the clustering is based on species (`cluster_on = "species"`
(or `"both"`)), then we can also compute metrics that are typically applied
at the species level, such as affinity, fidelity, IndVal and other similar metrics.
The conceptual interpretation differs in this case.

### 7.2.1 Diversity & endemicity site metrics

When clusters are assigned to sites (bioregions), we can compute basic diversity metrics:

- Richness = number of species in the site
- Rich_Endemics = number of species in the site that are endemic to a single region (i.e., occur in only one bioregion)
- Prop_Endemics = proportion of endemic species, i.e. ratio between Rich_Endemics and Richness 


```{r}
sim_metrics <- site_species_metrics(bioregionalization = vege_nhclu,
                            bioregion_metrics = c("Richness", "Rich_Endemics",
                                                  "Prop_Endemics"),
                            bioregionalization_metrics = NULL,
                            data_type = "occurrence",
                            cluster_on = "site",
                            comat = vegemat,
                            similarity = vegesim,
                            index = "Simpson", # Name of similarity column
                            verbose = FALSE)

sim_metrics
```

### 7.2.2 Similarity-based site metrics 

To compute similarity-based metrics for sites, we need to provide the 
site similarity matrix (`vegesim`).

These metrics include the average similarity of each site to the sites of  
each bioregion ($MeanSim$) and the associated standard deviation ($SdSim$).  
When computing the average similarity, the focal site itself is not included  
in the calculation for its own bioregion.

```{r}
sim_metrics <- site_species_metrics(bioregionalization = vege_nhclu,
                            bioregion_metrics = c("MeanSim", "SdSim"),
                            bioregionalization_metrics = NULL,
                            data_type = "occurrence",
                            cluster_on = "site",
                            comat = vegemat,
                            similarity = vegesim,
                            index = "Simpson", # Name of similarity column
                            verbose = FALSE)

sim_metrics
```

#### MeanSim

Let $g$ be a site and $b$ a bioregion with sites $g' \in b$, then:

$$MeanSim_{gb} = \frac{1}{n_b - \delta_{g \in b}} \sum_{g' \in b, g' \neq g} sim_{gg'}$$
where $sim_{gg'}$ is the similarity between sites $g$ and $g'$, $n_b$ is the 
number of sites in bioregion $b$, and $\delta_{g \in b}$ is 1 if site $g$ belongs 
to bioregion $b$ (to exclude itself), 0 otherwise.

#### SdSim

The standard deviation of similarities of site $g$ to bioregion $b$ is:

$$SdSim_{gb} = \sqrt{\frac{1}{n_b - 1 - \delta_{g \in b}} \sum_{g' \in b, g' \neq g} \left( sim_{gg'} - MeanSim_{gb} \right)^2}$$
where $sim_{gg'}$ is the similarity between sites $g$ and $g'$, $n_b$ is the 
number of sites in bioregion $b$, and $\delta_{g \in b}$ is 1 if site $g$ 
belongs to bioregion $b$ (to exclude itself), 0 otherwise.


### 7.2.3 Chorotype/Cluster-based site metrics 

In the following example we compute only metrics for sites, on the basis 
of species clusters (`cluster_on = "species"`).

```{r}
gc <- site_species_metrics(bioregionalization = vege_netclubip,
                            bioregion_metrics = c("Specificity", "NSpecificity",
                                                  "Fidelity",
                                                  "IndVal", "NIndVal",
                                                  "Rho",
                                                  "CoreTerms"),
                            bioregionalization_metrics = "P",
                            data_type = "both",
                            cluster_on = "species",
                            comat = vegemat,
                            similarity = NULL,
                            index = NULL,
                            verbose = FALSE)

gc

```

# 8. Metrics over the entire bioregionalization (i.e., over all clusters)

## 8.1 Site metrics 

Based on $MeanSim$, it is possible to derive aggregated metrics that assess  
how well a site fits within its assigned bioregion relative to others.  

For now, only the Silhouette index [@Rousseeuw1987] is proposed.

### Silhouette

The Silhouette index for a site $g$ is defined as:

$$Silhouette_g = \frac{a_g - b_g}{\max(a_g, b_g)}$$

where:

- $a_g$ is the average similarity of site $g$ to all other sites in its own bioregion,  
- $b_g$ is the average similarity of site $g$ to all sites belonging to the nearest bioregion.

This index reflects how strongly a site is associated with its assigned bioregion 
relative to the most similar alternative bioregion, ranging from -1, when the site 
may be misassigned (i.e., more similar to another bioregion than its own), to 1, 
when the site is well matched to its own bioregion, and around 0 when the site 
lies near the boundary between bioregions.

```{r}
sil_metrics <- site_species_metrics(bioregionalization = vege_nhclu,
                            bioregion_metrics = NULL,
                            bioregionalization_metrics = "Silhouette",
                            data_type = "occurrence",
                            cluster_on = "site",
                            comat = vegemat,
                            similarity = vegesim,
                            index = "Simpson", # Name of similarity column
                            verbose = FALSE)

sil_metrics
```

### Site participation coefficient


We can compute the participation coefficient $P_s$ of a species $s$ to the 
bioregionalization as described in [@Denelle2020], available in 
both its occurrence and abundance versions.

These metrics measure whether a site has species from a single region or from 
multiple regions - useful when investigating transition zones [@Leroy2019]. 
There are ranging from 0 to 1. Values close to 0 indicate that the site only has 
species from a single chorotype (i.e., not a transition zone), whereas values 
close to 1 indicate that the site has species evenely distributed across multiple
chorotypes (i.e., likely a transition zone).

### P (occurrence)

$$
P_s =  1 - \sum_{k=1}^K \left(\frac{n_{sk}}{n_s}\right)^2
$$
```{r}
p_occ_site <- site_species_metrics(bioregionalization = vege_netclubip,
                            bioregion_metrics = NULL,
                            bioregionalization_metrics = "P",
                            data_type = "occurrence",
                            cluster_on = "species",
                            comat = vegemat,
                            similarity = NULL,
                            index = "Simpson", # Name of similarity column
                            verbose = FALSE)

p_occ_site
```

### P (abundance)

$$
P_s =  1 - \sum_{k=1}^K \left(\frac{w_{sk}}{w_s}\right)^2
$$

```{r}
p_ab_site <- site_species_metrics(bioregionalization = vege_netclubip,
                            bioregion_metrics = NULL,
                            bioregionalization_metrics = "P",
                            data_type = "abundance",
                            cluster_on = "species",
                            comat = vegemat,
                            similarity = NULL,
                            index = "Simpson", # Name of similarity column
                            verbose = FALSE)

p_ab_site
```


## 8.2 Species metrics

We can compute the participation coefficient $P_s$ of a species $s$ to the 
bioregionalization for species as well.

### P (occurrence)

$$
P_s =  1 - \sum_{k=1}^K \left(\frac{n_{sk}}{n_s}\right)^2
$$

```{r}
p_occ_sp <- site_species_metrics(bioregionalization = vege_netclubip,
                            bioregion_metrics = NULL,
                            bioregionalization_metrics = "P",
                            data_type = "occurrence",
                            cluster_on = "site",
                            comat = vegemat,
                            similarity = NULL,
                            index = "Simpson", # Name of similarity column
                            verbose = FALSE)

p_occ_sp
```

### P (abundance)

$$
P_s =  1 - \sum_{k=1}^K \left(\frac{w_{sk}}{w_s}\right)^2
$$

```{r}
p_ab_sp <- site_species_metrics(bioregionalization = vege_netclubip,
                            bioregion_metrics = NULL,
                            bioregionalization_metrics = "P",
                            data_type = "abundance",
                            cluster_on = "site",
                            comat = vegemat,
                            similarity = NULL,
                            index = "Simpson", # Name of similarity column
                            verbose = FALSE)

p_ab_sp
```

These metrics measure how evenly a species is distributed among bioregions. 
There are ranging from 0 to 1. Values close to 0 indicate that the species is 
largely restricted to a single bioregion, while values close to 1 indicate that 
the species is evenly distributed across multiple bioregions.

Calculations on both occurrence & abundance at the same time:

```{r}
ps <- site_species_metrics(bioregionalization = vege_nhclu,
                           bioregion_metrics = NULL,
                            bioregionalization_metrics = "P",
                            data_type = "both",
                            cluster_on = "site",
                            comat = vegemat,
                            similarity = NULL,
                            index = NULL,
                            verbose = FALSE)

ps

```

# 9. Bioregion metrics & spatial coherence

At the granularity of bioregions, 
we can calculate the number of sites it contains and the number 
of species present in those sites. The number and proportion of endemic species 
are also computed. Endemic species are defined as those occurring only in sites 
assigned to a particular bioregion (i.e., species that occur in only one bioregion). 

```{r}
bioregion_summary <- bioregion_metrics(bioregionalization = vege_nhclu,
                                       comat = vegemat)
bioregion_summary
```

We use the metric of spatial coherence as in [@Divisek2016], except that we
replace the number of pixels per bioregion with the area of each coherent part.

The spatial coherence is expressed in percentage, and has the following
formula:

$$SC_j = 100 \times \frac{LargestPatch_j}{Area_j}$$

where $j$ is a bioregion.

Here is an example with the vegetation dataset.

```{r}
# Spatial coherence
vegedissim <- dissimilarity(vegemat)
hclu <- nhclu_kmeans(dissimilarity = vegedissim, n_clust = 4)
vegemap <- map_bioregions(hclu, vegesf, write_clusters = TRUE, plot = FALSE)

bioregion_metrics(bioregionalization = hclu, comat = vegemat, map = vegemap,
col_bioregion = 2) 
```

The bioregion 4 is almost constituted of one homogeneous block, which is why 
the spatial coherence is very close to 100 %.

```{r}
ggplot(vegemap) +
  geom_sf(aes(fill = as.factor(K_4))) +
  scale_fill_viridis_d("Bioregion") +
  theme_bw() +
  theme(legend.position = "bottom")
```


# 10. References