1 Introduction

scBatchQC is a Bioconductor package for batch-aware quality control of multi-sample single-cell RNA-seq data. It provides calibrated, per-batch QC thresholds using empirical Bayes shrinkage, so you don’t over-filter low-depth batches or under-filter high-depth ones.

1.1 The problem

In multi-batch single-cell RNA-seq experiments, standard QC tools like scuttle::isOutlier apply a single global threshold across all cells. This ignores batch-to-batch variation and leads to:

  • Over-filtering of lower-depth batches (good cells removed)
  • Under-filtering of higher-depth batches (bad cells kept)

1.2 What scBatchQC does

scBatchQC solves this with a hierarchical empirical Bayes approach:

  1. Computes per-batch QC metric medians and MADs
  2. Shrinks per-batch estimates toward a global prior (borrows strength across batches)
  3. Sets batch-specific thresholds that are calibrated, not arbitrary
  4. Estimates per-batch doublet rates from cells loaded and protocol

The result: each batch gets its own QC thresholds, but small batches are stabilised by borrowing information from larger ones.

2 Installation

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}
BiocManager::install("scBatchQC")

3 Quick start: simulate a two-batch dataset

We create a toy dataset with two batches that mimic a common real scenario: one high-depth batch and one lower-depth batch with a few damaged cells.

library(scBatchQC)
library(SingleCellExperiment)

set.seed(123)
n_genes <- 500

# Batch 1: high-depth fresh tissue (10x v3)
counts_b1 <- matrix(
    rpois(n_genes * 100, lambda = 12),
    nrow = n_genes, ncol = 100
)

# Batch 2: lower-depth cryopreserved (10x v2)
# with 5 deliberately damaged cells
counts_b2 <- matrix(
    rpois(n_genes * 100, lambda = 5),
    nrow = n_genes, ncol = 100
)
counts_b2[, 1:5] <- floor(counts_b2[, 1:5] / 10)

# Combine and label
counts <- cbind(counts_b1, counts_b2)
rownames(counts) <- paste0("Gene", seq_len(n_genes))
rownames(counts)[1:30] <- paste0("MT-", seq_len(30))
colnames(counts) <- paste0("Cell", seq_len(200))

sce <- SingleCellExperiment(assays = list(counts = counts))
sce$batch <- rep(c("Batch1_v3", "Batch2_v2"), each = 100)
sce
#> class: SingleCellExperiment 
#> dim: 500 200 
#> metadata(0):
#> assays(1): counts
#> rownames(500): MT-1 MT-2 ... Gene499 Gene500
#> rowData names(0):
#> colnames(200): Cell1 Cell2 ... Cell199 Cell200
#> colData names(1): batch
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):

4 Step 1: Batch-aware QC metrics

The central function batchAwareQCMetrics() computes three per-cell QC metrics and flags outliers using batch-harmonised thresholds:

Metric Column added What it measures
Library size scBatchQC_sum Total UMI counts
Genes detected scBatchQC_detected Number of expressed genes
MT fraction scBatchQC_subsets_MT_percent Mitochondrial %
Outlier flag scBatchQC_outlier Should this cell be removed?
Reason scBatchQC_outlier_reason Which metric(s) failed
sce <- batchAwareQCMetrics(
    sce,
    batch = "batch",
    nmads = 3,
    shrink_strength = 0.5
)

# Columns added
grep("^scBatchQC", names(colData(sce)), value = TRUE)
#> [1] "scBatchQC_sum"                "scBatchQC_detected"          
#> [3] "scBatchQC_subsets_MT_percent" "scBatchQC_outlier"           
#> [5] "scBatchQC_outlier_reason"
# Outlier counts per batch
table(Outlier = sce$scBatchQC_outlier, Batch = sce$batch)
#>        Batch
#> Outlier Batch1_v3 Batch2_v2
#>    TRUE       100       100

4.1 Understanding the key parameters

nmads (default: 3) — Number of MADs from the median to set the threshold. Lower = stricter (more cells flagged). A typical range is 2.5–4.

shrink_strength (default: 0.5) — Controls how much per-batch estimates are pulled toward the global mean:

Value Behaviour When to use
0 Pure per-batch thresholds Many cells per batch (>500)
0.5 Balanced (default) Most experiments
0.7–0.9 Strong shrinkage Small batches (<50 cells)
1 Fully pooled (ignores batch) Single-batch fallback

5 Step 2: Visualise QC distributions

plotBatchQC() shows violin plots of each metric per batch, with threshold lines and outlier highlights:

plotBatchQC(sce, batch = "batch")
QC distributions per batch. Dashed red lines = harmonised thresholds. Red points = flagged outliers.

(#fig:plot_qc)QC distributions per batch. Dashed red lines = harmonised thresholds. Red points = flagged outliers.

6 Step 3: Estimate doublet rates

estimateBatchDoubletRate() models the expected doublet rate per batch based on how many cells were loaded and the protocol used:

cells_loaded <- c(Batch1_v3 = 8000, Batch2_v2 = 5000)

sce <- estimateBatchDoubletRate(
    sce,
    batch = "batch",
    cells_loaded = cells_loaded,
    protocol = c(Batch1_v3 = "10x_v3", Batch2_v2 = "10x_v2")
)

# Estimated doublet rate per batch
tapply(sce$scBatchQC_doublet_rate, sce$batch, unique)
#> Batch1_v3 Batch2_v2 
#>    0.0640    0.0375

The model uses the empirical relationship from 10x Genomics: ~0.8% doublets per 1,000 cells loaded on the Chromium controller.

7 Step 4: Explore thresholds interactively

Before committing to a specific nmads, use harmonizeQCThresholds() to see how many cells would be flagged at different stringencies — without re-running the full pipeline:

sweep <- lapply(c(2, 2.5, 3, 3.5, 4), function(n) {
    r <- harmonizeQCThresholds(sce, batch = "batch", nmads = n)
    data.frame(
        nmads = n,
        flagged = sum(rowSums(as.data.frame(r$n_flagged)))
    )
})
do.call(rbind, sweep)
#>   nmads flagged
#> 1   2.0     339
#> 2   2.5     222
#> 3   3.0     222
#> 4   3.5     215
#> 5   4.0     215

How to choose nmads: Look for the “elbow” — the point where flagged cells stop increasing sharply. Typical scRNA-seq experiments use nmads = 3 (default) and flag 3–10% of cells.

8 Step 5: Filter and continue

Once you’re satisfied with the QC, filter and pass to downstream analysis:

sce_clean <- sce[, !sce$scBatchQC_outlier]
cat(
    "Kept", ncol(sce_clean), "of", ncol(sce),
    "cells (removed", sum(sce$scBatchQC_outlier), ")\n"
)
#> Kept 0 of 200 cells (removed 200 )

9 Storing results: BQCResult container

For programmatic access to batch-level summaries, use the BQCResult S4 class:

library(S4Vectors)

result <- BQCResult(
    qcFlags = DataFrame(outlier = sce$scBatchQC_outlier),
    doubletScores = sce$scBatchQC_doublet_rate,
    batchSummary = estimateBatchDoubletRate(
        sce,
        batch = "batch",
        cells_loaded = cells_loaded,
        return_sce = FALSE
    )
)
result
#> BQCResult
#>   Cells          : 200 
#>   Batches        : 2 
#>   QC metrics     : outlier 
#>   Doublet range  : 0.038 - 0.064

Access individual slots with:

head(qcFlags(result))
#> DataFrame with 6 rows and 1 column
#>     outlier
#>   <logical>
#> 1      TRUE
#> 2      TRUE
#> 3      TRUE
#> 4      TRUE
#> 5      TRUE
#> 6      TRUE
head(doubletScores(result))
#> Batch1_v3 Batch1_v3 Batch1_v3 Batch1_v3 Batch1_v3 Batch1_v3 
#>     0.064     0.064     0.064     0.064     0.064     0.064
batchSummary(result)
#> DataFrame with 2 rows and 5 columns
#>                 batch n_cells_obs cells_loaded doublet_rate_est    protocol
#>           <character>   <integer>    <numeric>        <numeric> <character>
#> Batch1_v3   Batch1_v3         100         8000            0.064      10x_v3
#> Batch2_v2   Batch2_v2         100         5000            0.040      10x_v3

10 Edge cases

Scenario Behaviour
Single batch (no batch argument) Falls back to global MAD — equivalent to scuttle::isOutlier
Very small batch (<50 cells) Increase shrink_strength to 0.7–0.9
No MT genes found MT metric is silently skipped; QC uses library size and gene count only
All cells flagged Your nmads is too strict, or data has very low variance — try nmads = 4

12 Session information

sessionInfo()
#> R version 4.6.0 RC (2026-04-17 r89917)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] SingleCellExperiment_1.34.0 SummarizedExperiment_1.42.0
#>  [3] Biobase_2.72.0              GenomicRanges_1.64.0       
#>  [5] Seqinfo_1.2.0               IRanges_2.46.0             
#>  [7] S4Vectors_0.50.0            BiocGenerics_0.58.0        
#>  [9] generics_0.1.4              MatrixGenerics_1.24.0      
#> [11] matrixStats_1.5.0           scBatchQC_0.99.1           
#> [13] BiocStyle_2.40.0           
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.4.10         scuttle_1.22.0      SparseArray_1.12.2 
#>  [4] lattice_0.22-9      magrittr_2.0.5      digest_0.6.39      
#>  [7] evaluate_1.0.5      grid_4.6.0          RColorBrewer_1.1-3 
#> [10] bookdown_0.46       fastmap_1.2.0       jsonlite_2.0.0     
#> [13] Matrix_1.7-5        tinytex_0.59        BiocManager_1.30.27
#> [16] scales_1.4.0        codetools_0.2-20    jquerylib_0.1.4    
#> [19] abind_1.4-8         cli_3.6.6           rlang_1.2.0        
#> [22] XVector_0.52.0      withr_3.0.2         cachem_1.1.0       
#> [25] DelayedArray_0.38.1 yaml_2.3.12         otel_0.2.0         
#> [28] S4Arrays_1.12.0     beachmat_2.28.0     tools_4.6.0        
#> [31] parallel_4.6.0      BiocParallel_1.46.0 dplyr_1.2.1        
#> [34] ggplot2_4.0.3       vctrs_0.7.3         R6_2.6.1           
#> [37] magick_2.9.1        lifecycle_1.0.5     pkgconfig_2.0.3    
#> [40] pillar_1.11.1       bslib_0.10.0        gtable_0.3.6       
#> [43] Rcpp_1.1.1-1.1      glue_1.8.1          tidyselect_1.2.1   
#> [46] tibble_3.3.1        xfun_0.57           knitr_1.51         
#> [49] dichromat_2.0-0.1   farver_2.1.2        htmltools_0.5.9    
#> [52] labeling_0.4.3      rmarkdown_2.31      compiler_4.6.0     
#> [55] S7_0.2.2