1 Introduction

Aggregating single-cell RNA sequencing (scRNA-seq) datasets from multiple sources introduces technical variability, known as batch effects, arising from differences in operator handling, reagent lots, sequencing platforms, and experimental timing. Batch effects can distort downstream analyses and obscure the true biological variation, compromising the interpretability of the results.

Numerous batch correction methods have been proposed in the literature, based on different mathematical approaches. However, their performance is highly dependent on the intrinsic characteristics of the data.

BatChef is an R package that:

  • implements a variety of correction methods;

  • provides quantitative metrics to evaluate the performance of the correction methods, including the Wasserstein distance, Local Inverse Simpson’s Index (LISI), Average Silhouette Width (ASW), and Adjusted Rand Index (ARI).

  • can be used as a guideline to identify the most appropriate batch effects correction method based on data-specific characteristics.

2 Installation

if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
BiocManager::install("BatChef")

3 Setup

To demonstrate, we simulate a dataset with 2,148 genes, 3,461 cells, 2 batches and 3 cell types, using the simulate_data() function. This function applies the same parameters as the Splatter package’s ones. Additionally, it normalizes the data, identifies highly variable genes and performs principal component analysis using the scrapper package.

The BatChef package allows to use three different input data: SingleCellExperiment, Seurat and AnnData. These can be simulated using the the output_format parameter in the simulate_data() function.

library(BatChef)
sce <- simulate_data(
  n_genes = 2148, batch_cells = c(1210, 2251), compute_hvgs = TRUE,
  batch_fac_loc = c(0.1, 0.1), batch_fac_scale = c(0.1, 0.1),
  mean_shape = 0.4, lib_loc = 11.5,
  group_prob = c(0.17, 0.46, 0.37),
  compute_pca = TRUE, pca_ncomp = 10, output_format = "SingleCellExperiment"
)
sce
## class: SingleCellExperiment 
## dim: 2148 3461 
## metadata(0):
## assays(2): counts logcounts
## rownames(2148): Gene1 Gene2 ... Gene2147 Gene2148
## rowData names(5): means variances fitted residuals hvg
## colnames(3461): Cell1 Cell2 ... Cell3460 Cell3461
## colData names(5): Cell Batch Group ExpLibSize sizeFactor
## reducedDimNames(1): PCA
## mainExpName: NULL
## altExpNames(0):
so <- simulate_data(
  n_genes = 2148, batch_cells = c(1210, 2251), compute_hvgs = TRUE,
  batch_fac_loc = c(0.1, 0.1), batch_fac_scale = c(0.1, 0.1),
  mean_shape = 0.4, lib_loc = 11.5,
  group_prob = c(0.17, 0.46, 0.37),
  compute_pca = TRUE, pca_ncomp = 10, output_format = "Seurat"
)

adata <- simulate_data(
  n_genes = 2148, batch_cells = c(1210, 2251), compute_hvgs = TRUE,
  batch_fac_loc = c(0.1, 0.1), batch_fac_scale = c(0.1, 0.1),
  mean_shape = 0.4, lib_loc = 11.5,
  group_prob = c(0.17, 0.46, 0.37),
  compute_pca = TRUE, pca_ncomp = 10, output_format = "AnnData"
)

4 Batch correction method prediction

Users can predict the optimal batch correction method for a given dataset via the suggested_method() function. This function employs a Support Vector Machine (SVM) algorithm to predict the most suitable batch correction method and visualize the dataset’s position within a two-dimensional embedding space. The embedding was constructed by analyzing 130 datasets and their associated characteristics.

pred <- suggested_method(input = sce, batch = "Batch")
## Optimal method: scmerge2_un

In this case, the optimal batch correction is scMerge2 (unsupervised) for our data.

5 Batch effects correction

The batchCorrect() function allows to specify the desired correction method. It can have these values:

  • LimmaParams

  • CombatParams

  • Seuratv3Params

  • Seuratv5Params

  • FastMNNParams

  • HarmonyParams

  • ScMerge2Params

  • LigerParams

To illustrate the functionality of the package, batch correction was performed using the scMerge2 method. To improve computational efficiency, the analysis was restricted to the first 1,000 highly variable genes.

sce <- sce[SingleCellExperiment::rowData(sce)$hvg, ]
sce <- batchCorrect(input = sce, batch = "Batch", params = ScMerge2Params())
library(scater)
sce <- runPCA(sce,
  subset_row = rownames(sce),
  assay.type = "scmerge2", name = "scmerge2", ncomponent = 10
)

The batchCorrect() function returns an object of the same class as the input. In this example, the output is therefore a SingleCellExperiment object. The corrected gene expression matrix and/or the corrected dimensionality reduction are stored in the returned object.

sce
## class: SingleCellExperiment 
## dim: 1101 3461 
## metadata(0):
## assays(3): counts logcounts scmerge2
## rownames(1101): Gene1 Gene2 ... Gene2147 Gene2148
## rowData names(5): means variances fitted residuals hvg
## colnames(3461): Cell1 Cell2 ... Cell3460 Cell3461
## colData names(5): Cell Batch Group ExpLibSize sizeFactor
## reducedDimNames(2): PCA scmerge2
## mainExpName: NULL
## altExpNames(0):

6 Performance evaluation

To evaluate batch correction performance, the user can use the metrics() function to compute multiple quantitative metrics, including Normalized Mutual Information (NMI), Adjusted Rand Index (ARI), Local Inverse Simpson’s Index (LISI), Average Silhouette Width (ASW), and the Wasserstein distance.

The Wasserstein distance is computed on resampled data to account for the different numbers of cells in each batch. The parameter rep specifies how many times the calculation is repeated (in this case, 1).

red <- SingleCellExperiment::reducedDimNames(sce)
metrics <- lapply(red, function(x) {
  metrics(
    input = sce, batch = "Batch",
    group = "Group", reduction = x,
    rep = 1
  )
})
metrics <- do.call(rbind, metrics)
metrics
##     method wasserstein         iasw    ilisi       ari       nmi      casw
## 1      PCA   9.8569782  0.331507151 1.000000 0.5996069 0.7610525 0.3603265
## 2 scmerge2   0.0186726 -0.001318161 1.810335 1.0000000 1.0000000 0.5153678
##   clisi
## 1     1
## 2     1

To interpret the results, we focused on the Wasserstein distance and the ARI score, which measure batch effect removal and the preservation of biological effects, respectively.

metrics[, c(1:2, 5)]
##     method wasserstein       ari
## 1      PCA   9.8569782 0.5996069
## 2 scmerge2   0.0186726 1.0000000

A low Wasserstein distance indicates that the method effectively removes batch effects, while a high Adjusted Rand Index (ARI) indicates that the biological effects are well preserved. Overall, scMerge2 performs well on both metrics.

7 sessionInfo()

sessionInfo()
## R version 4.6.0 RC (2026-04-17 r89917)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] scater_1.39.4               ggplot2_4.0.3              
##  [3] scuttle_1.21.6              SingleCellExperiment_1.33.2
##  [5] SummarizedExperiment_1.41.1 Biobase_2.71.0             
##  [7] GenomicRanges_1.63.2        Seqinfo_1.1.0              
##  [9] IRanges_2.45.0              S4Vectors_0.49.2           
## [11] BiocGenerics_0.57.1         generics_0.1.4             
## [13] MatrixGenerics_1.23.0       matrixStats_1.5.0          
## [15] BatChef_0.99.12             BiocStyle_2.39.0           
## 
## loaded via a namespace (and not attached):
##   [1] dichromat_2.0-0.1         nnet_7.3-20              
##   [3] goftest_1.2-3             Biostrings_2.79.5        
##   [5] HDF5Array_1.39.1          rstan_2.32.7             
##   [7] vctrs_0.7.3               spatstat.random_3.4-5    
##   [9] digest_0.6.39             png_0.1-9                
##  [11] shape_1.4.6.1             proxy_0.4-29             
##  [13] ggrepel_0.9.8             deldir_2.0-4             
##  [15] parallelly_1.47.0         batchelor_1.27.1         
##  [17] MASS_7.3-65               reshape2_1.4.5           
##  [19] withr_3.0.2               httpuv_1.6.17            
##  [21] ggrastr_1.0.2             scCustomize_3.2.4        
##  [23] xfun_0.57                 survival_3.8-6           
##  [25] memoise_2.0.1             proxyC_0.5.2             
##  [27] ggbeeswarm_0.7.3          janitor_2.2.1            
##  [29] zoo_1.8-15                GlobalOptions_0.1.4      
##  [31] gtools_3.9.5              V8_8.2.0                 
##  [33] pbapply_1.7-4             DEoptimR_1.1-4           
##  [35] Formula_1.2-5             KEGGREST_1.51.1          
##  [37] rematch2_2.1.2            promises_1.5.0           
##  [39] otel_0.2.0                httr_1.4.8               
##  [41] globals_0.19.1            fitdistrplus_1.2-6       
##  [43] cvTools_0.3.3             rhdf5filters_1.23.3      
##  [45] rhdf5_2.55.16             rstudioapi_0.18.0        
##  [47] units_1.0-1               miniUI_0.1.2             
##  [49] dir.expiry_1.19.0         base64enc_0.1-6          
##  [51] curl_7.1.0                sfsmisc_1.1-23           
##  [53] ScaledMatrix_1.19.0       h5mread_1.3.3            
##  [55] polyclip_1.10-7           SparseArray_1.11.13      
##  [57] xtable_1.8-8              stringr_1.6.0            
##  [59] evaluate_1.0.5            S4Arrays_1.11.1          
##  [61] bookdown_0.46             irlba_2.3.7              
##  [63] filelock_1.0.3            colorspace_2.1-2         
##  [65] ROCR_1.0-12               harmony_1.2.4            
##  [67] reticulate_1.46.0         spatstat.data_3.1-9      
##  [69] magrittr_2.0.5            lmtest_0.9-40            
##  [71] snakecase_0.11.1          later_1.4.8              
##  [73] viridis_0.6.5             lattice_0.22-9           
##  [75] genefilter_1.93.0         spatstat.geom_3.7-3      
##  [77] future.apply_1.20.2       robustbase_0.99-7        
##  [79] XML_3.99-0.23             scattermore_1.2          
##  [81] cowplot_1.2.0             RcppAnnoy_0.0.23         
##  [83] class_7.3-23              Hmisc_5.2-5              
##  [85] pillar_1.11.1             StanHeaders_2.32.10      
##  [87] nlme_3.1-169              caTools_1.18.3           
##  [89] compiler_4.6.0            beachmat_2.27.5          
##  [91] RSpectra_0.16-2           stringi_1.8.7            
##  [93] sf_1.1-0                  tensor_1.5.1             
##  [95] lubridate_1.9.5           plyr_1.8.9               
##  [97] crayon_1.5.3              abind_1.4-8              
##  [99] locfit_1.5-9.12           sp_2.2-1                 
## [101] bit_4.6.0                 dplyr_1.2.1              
## [103] codetools_0.2-20          BiocSingular_1.27.1      
## [105] bslib_0.10.0              QuickJSR_1.9.2           
## [107] splatter_1.35.2           e1071_1.7-17             
## [109] paletteer_1.7.0           plotly_4.12.0            
## [111] mime_0.13                 splines_4.6.0            
## [113] circlize_0.4.18           Rcpp_1.1.1-1.1           
## [115] fastDummies_1.7.6         basilisk_1.23.0          
## [117] sparseMatrixStats_1.23.0  scrapper_1.5.19          
## [119] blob_1.3.0                knitr_1.51               
## [121] reldist_1.7-2             listenv_0.10.1           
## [123] checkmate_2.3.4           DelayedMatrixStats_1.33.0
## [125] pkgbuild_1.4.8            tibble_3.3.1             
## [127] Matrix_1.7-5              statmod_1.5.1            
## [129] startupmsg_1.0.0          pkgconfig_2.0.3          
## [131] tools_4.6.0               cachem_1.1.0             
## [133] aricode_1.0.3             RSQLite_2.4.6            
## [135] viridisLite_0.4.3         DBI_1.3.0                
## [137] numDeriv_2016.8-1.1       fastmap_1.2.0            
## [139] rmarkdown_2.31            scales_1.4.0             
## [141] grid_4.6.0                ica_1.0-3                
## [143] Seurat_5.5.0              sass_0.4.10              
## [145] patchwork_1.3.2           ggprism_1.0.7            
## [147] BiocManager_1.30.27       dotCall64_1.2            
## [149] transport_0.15-4          RANN_2.6.2               
## [151] rpart_4.1.27              farver_2.1.2             
## [153] mgcv_1.9-4                yaml_2.3.12              
## [155] foreign_0.8-91            cli_3.6.6                
## [157] purrr_1.2.2               lifecycle_1.0.5          
## [159] uwot_0.2.4                M3Drop_1.37.0            
## [161] mvtnorm_1.3-7             bluster_1.21.1           
## [163] backports_1.5.1           annotate_1.89.0          
## [165] BiocParallel_1.45.0       distr_2.9.7              
## [167] timechange_0.4.0          gtable_0.3.6             
## [169] ggridges_0.5.7            densEstBayes_1.0-2.2     
## [171] progressr_0.19.0          parallel_4.6.0           
## [173] limma_3.67.3              jsonlite_2.0.0           
## [175] edgeR_4.9.9               RcppHNSW_0.6.0           
## [177] bitops_1.0-9              bit64_4.8.0              
## [179] assertthat_0.2.1          Rtsne_0.17               
## [181] spatstat.utils_3.2-2      BiocNeighbors_2.5.4      
## [183] SeuratObject_5.4.0        RcppParallel_5.1.11-2    
## [185] bdsmatrix_1.3-7           jquerylib_0.1.4          
## [187] metapod_1.19.2            dqrng_0.4.1              
## [189] loo_2.9.0                 spatstat.univar_3.1-7    
## [191] lazyeval_0.2.3            shiny_1.13.0             
## [193] ruv_0.9.7.1               htmltools_0.5.9          
## [195] sctransform_0.4.3         glue_1.8.1               
## [197] spam_2.11-3               rliger_2.2.1             
## [199] ResidualMatrix_1.21.0     XVector_0.51.0           
## [201] scMerge_1.27.0            scran_1.39.2             
## [203] mclust_6.1.2              classInt_0.4-11          
## [205] gridExtra_2.3             sccore_1.0.7             
## [207] igraph_2.3.0              R6_2.6.1                 
## [209] sva_3.59.0                tidyr_1.3.2              
## [211] gplots_3.3.0              forcats_1.0.1            
## [213] anndata_0.8.0             zellkonverter_1.21.2     
## [215] cluster_2.1.8.2           bbmle_1.0.25.1           
## [217] Rhdf5lib_1.99.6           mcprogress_0.1.1         
## [219] rstantools_2.6.0          DelayedArray_0.37.1      
## [221] tidyselect_1.2.1          vipor_0.4.7              
## [223] htmlTable_2.5.0           inline_0.3.21            
## [225] AnnotationDbi_1.73.1      future_1.70.0            
## [227] leidenAlg_1.1.7           rsvd_1.0.5               
## [229] KernSmooth_2.23-26        S7_0.2.2                 
## [231] data.table_1.18.2.1       htmlwidgets_1.6.4        
## [233] RColorBrewer_1.1-3        rlang_1.2.0              
## [235] spatstat.sparse_3.1-0     spatstat.explore_3.8-0   
## [237] beeswarm_0.4.0