---
title: "BatChef package introduction"
author: 
- name: Elena Zuin
  affiliation: Department of Biology, University of Padova, Italy
- name: Chiara Romualdi
  affiliation: Department of Biology, University of Padova, Italy
- name: Davide Risso
  affiliation: Department of Statistical Sciences, University of Padova, Italy
- name: Gabriele Sales
  affiliation: Department of Biology, University of Padova, Italy
output:
  BiocStyle::html_document:
      toc: true
      toc_float:
          collapsed: true
package: BatChef
vignette: >
  %\VignetteIndexEntry{BatChef vignette}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r intro, echo=FALSE, results="hide", message=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Introduction

Aggregating single-cell RNA sequencing (scRNA-seq) datasets from multiple sources introduces technical variability, known as batch effects, arising from differences in operator handling, reagent lots, sequencing platforms, and experimental timing. Batch effects can distort downstream analyses and obscure the true biological variation, compromising the interpretability of the results.

Numerous batch correction methods have been proposed in the literature, based on different mathematical approaches. However, their performance is highly dependent on the intrinsic characteristics of the data.

`BatChef` is an R package that:

-   implements a variety of correction methods;

-   provides quantitative metrics to evaluate the performance of the correction methods, including the Wasserstein distance, Local Inverse Simpson's Index (LISI), Average Silhouette Width (ASW), and Adjusted Rand Index (ARI).

-   can be used as a guideline to identify the most appropriate batch effects correction method based on data-specific characteristics.

# Installation

```{r installation,eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
BiocManager::install("BatChef")
```

# Setup

To demonstrate, we simulate a dataset with 2,148 genes, 3,461 cells, 2 batches and 3 cell types, using the `simulate_data()` function. This function applies the same parameters as the Splatter package’s ones. Additionally, it normalizes the data, identifies highly variable genes and performs principal component analysis using the `scrapper` package.

The `BatChef` package allows to use three different input data: `SingleCellExperiment`, `Seurat` and `AnnData`. These can be simulated using the the `output_format` parameter in the `simulate_data()` function.

```{r sce,  include=TRUE, message=FALSE, warning=FALSE}
library(BatChef)
sce <- simulate_data(
  n_genes = 2148, batch_cells = c(1210, 2251), compute_hvgs = TRUE,
  batch_fac_loc = c(0.1, 0.1), batch_fac_scale = c(0.1, 0.1),
  mean_shape = 0.4, lib_loc = 11.5,
  group_prob = c(0.17, 0.46, 0.37),
  compute_pca = TRUE, pca_ncomp = 10, output_format = "SingleCellExperiment"
)
sce
```

```{r so, eval=FALSE}
so <- simulate_data(
  n_genes = 2148, batch_cells = c(1210, 2251), compute_hvgs = TRUE,
  batch_fac_loc = c(0.1, 0.1), batch_fac_scale = c(0.1, 0.1),
  mean_shape = 0.4, lib_loc = 11.5,
  group_prob = c(0.17, 0.46, 0.37),
  compute_pca = TRUE, pca_ncomp = 10, output_format = "Seurat"
)

adata <- simulate_data(
  n_genes = 2148, batch_cells = c(1210, 2251), compute_hvgs = TRUE,
  batch_fac_loc = c(0.1, 0.1), batch_fac_scale = c(0.1, 0.1),
  mean_shape = 0.4, lib_loc = 11.5,
  group_prob = c(0.17, 0.46, 0.37),
  compute_pca = TRUE, pca_ncomp = 10, output_format = "AnnData"
)
```

# Batch correction method prediction

Users can predict the optimal batch correction method for a given dataset via the `suggested_method()` function. This function employs a Support Vector Machine (SVM) algorithm to predict the most suitable batch correction method and visualize the dataset's position within a two-dimensional embedding space. The embedding was constructed by analyzing 130 datasets and their associated characteristics.

```{r pred, warning=FALSE}
pred <- suggested_method(input = sce, batch = "Batch")
```

In this case, the optimal batch correction is scMerge2 (unsupervised) for our data.

# Batch effects correction

The `batchCorrect()` function allows to specify the desired correction method. It can have these values:

-   `LimmaParams`

-   `CombatParams`

-   `Seuratv3Params`

-   `Seuratv5Params`

-   `FastMNNParams`

-   `HarmonyParams`

-   `ScMerge2Params`

-   `LigerParams`

To illustrate the functionality of the package, batch correction was performed using the scMerge2 method. To improve computational efficiency, the analysis was restricted to the first 1,000 highly variable genes.

```{r scmerge2, message=FALSE, warning=FALSE, results='hide', fig.keep='none'}
sce <- sce[SingleCellExperiment::rowData(sce)$hvg, ]
sce <- batchCorrect(input = sce, batch = "Batch", params = ScMerge2Params())
library(scater)
sce <- runPCA(sce,
  subset_row = rownames(sce),
  assay.type = "scmerge2", name = "scmerge2", ncomponent = 10
)
```

The `batchCorrect()` function returns an object of the same class as the input. In this example, the output is therefore a `SingleCellExperiment` object. The corrected gene expression matrix and/or the corrected dimensionality reduction are stored in the returned object.

```{r out}
sce
```

# Performance evaluation

To evaluate batch correction performance, the user can use the `metrics()` function to compute multiple quantitative metrics, including Normalized Mutual Information (NMI), Adjusted Rand Index (ARI), Local Inverse Simpson’s Index (LISI), Average Silhouette Width (ASW), and the Wasserstein distance.

The Wasserstein distance is computed on resampled data to account for the different numbers of cells in each batch. The parameter `rep` specifies how many times the calculation is repeated (in this case, 1).

```{r perf, message=FALSE, warning=FALSE}
red <- SingleCellExperiment::reducedDimNames(sce)
metrics <- lapply(red, function(x) {
  metrics(
    input = sce, batch = "Batch",
    group = "Group", reduction = x,
    rep = 1
  )
})
metrics <- do.call(rbind, metrics)
metrics
```

To interpret the results, we focused on the Wasserstein distance and the ARI score, which measure batch effect removal and the preservation of biological effects, respectively.

```{r wassari}
metrics[, c(1:2, 5)]
```

A low Wasserstein distance indicates that the method effectively removes batch effects, while a high Adjusted Rand Index (ARI) indicates that the biological effects are well preserved. Overall, scMerge2 performs well on both metrics.

# sessionInfo()

```{r sessioninfo}
sessionInfo()
```