1 Motivation

From the perspective of metabolites as the continuation of the central dogma of biology, metabolomics provides the closest link to many phenotypes of interest. This makes untargeted LC-MS metabolomics data promising in teasing apart the complexities of living systems. However, due to experimental reasons, the data includes non-wanted variation which limits quality and reproducibility, especially if the data is obtained from several batches.

The batchCorr package reduces unwanted variation by way of between-batch alignment, within-batch drift correction and between-batch normalization using batch-specific quality control (QC) samples and long-term reference QC samples. Please see the associated article (Brunius, Shi, and Landberg 2016) for more thorough descriptions of algorithms.

2 Installation

To install batchCorr, install BiocManager first, if it is not installed. Afterwards use the install function from BiocManager.

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}
BiocManager::install("batchCorr")

3 Data

The example data allows for demonstration of all core batchCorr functionality: between-batch alignment, within-batch drift correction and between-batch normalization. The example data consisting of three batches from a single analytical mode consists of three objects: PTnofill (non-imputed/filled abundances, matrix), PTfilled (imputed/filled abundances) and meta (sample and feature metadata, data.frame).

library(batchCorr)
data("ThreeBatchData")

4 How it works

batchCorr was originally designed to work with basic data structures but the three main methods, alignBatches, correctDrift and normalizeBatches also support SummarizedExperiment. This improves interoperability with other Bioconductor packages. This includes xcms for preprocessing, the qmtools, phenomis and/or pmp packages to complement normalization and quality control as well as statistical tests, machine learning and annotation.

Abundances are included in a matrix, while sample and feature data is included in a data.frame. batchCorr works best as per the chronology presented below, where both the original functionality and SummarizedExperiment-functionality is demonstrated.

Utility functions for the original functionality include:

peakInfo to extract m/z and rt from peak table based on a separator in rownames
getBatch to extract specific batch from a list with a peak table and metadata
mergeBatches to merge batches after drift correction

Important analytical background includes batch-specific QC samples and long-term reference QC samples, which are regularly interspersed in the injection sequence. Batch-specific QC samples are typically pooled aliquots of study samples, and are used for within-batch drift correction. Long-term reference QC samples are not of the same biological origin as the batch-specific QC samples, and are therefore not directly representative of the sample population. Long-term reference QC samples are used for between-batch alignment, within-batch drift correction and between-batch normalization.

Let’s create a SummarizedExperiment object in order to demonstrate the original functionality and the new SummarizedExperiment methods in parallel. SummarizedExperiment may also be output from xcms-based preprocessing using xcms::quantify().

peaks <- SimpleList(t(PTnofill), t(PTfill))
sampleData <- meta
featureData <- peakInfo(PT = PTnofill, sep = "@", start = 3)
rownames(featureData) <- rownames(peaks[[1]])

se <- SummarizedExperiment(assays = peaks, colData = sampleData,
    rowData = featureData)
names(assays(se)) <- c("nofill", "fill")

Below, we focus on basic usage. The list output for basic data structures includes processing metadata which can be used for troubleshooting. The SummarizedExperiment methods return objects with modified peak tables. Please see the documentation for more information (for example, ?alignBatches).

4.1 Between-batch alignment

Shifts in retention time (RT) and mass-to-charge ratio (m/z) across batches results in some metabolites being redundantly represented in the dataset. To rectify this, between-batch alignment of features is performed using alignBatches, which encompasses the following steps:

Aggregation of feature presence/missingness on batch level

batch-wise flagging of low-quality features with proportion of NAs to all samples > 80% based on long-term reference QC samples
0 < total batch presence of candidates features < number of batches to be an alignment candidate

Identification of features with missingness within “the box”, i.e. sufficiently similar in RT and m/z

potential alignment candidates have similar RT and m/z across batches
orthogonal batch presence: two or more alignment candidates cannot be present in the same batch
if there are multiple combinations of candidates across batches, the features are recursively subclustered before clustering across batches

Alignment of feature clusters resulting in new peak table

# Extract peakinfo (i.e. m/z and rt of features),
# These column names have 2 leading characters describing LC-MS mode
# -> start at 3
peakIn <- peakInfo(PT = PTnofill, sep = "@", start = 3)
# Perform multi-batch alignment
alignBat <- alignBatches(
    peakInfo = peakIn, PeakTabNoFill = PTnofill,
    PeakTabFilled = PTfill, batches = meta$batch,
    sampleGroups = meta$grp, selectGroup = "QC",
    report = FALSE
)
# Extract new peak table
PT <- alignBat$PTalign

Below, we use SummarizedExperiment with multiple peak tables. When using SummarizedExperiment sequentially such that a single peak table is replaced by a new one, one doesn’t need to specify the assay.type or name parameters. The assay is added to the SummarizedExperiment supplied to PeakTabFilled.

se <- alignBatches(PeakTabNoFill = se, PeakTabFilled = se, batches = "batch",
    sampleGroups = "grp", report = FALSE, assay.type1 = "nofill",
    assay.type2 = "fill", name = "aligned", rt_col = "rt", mz_col = "mz")

4.2 Within-batch drift correction

Drift in abundance within a batch gives rise to unwanted variation which can be modelled in terms of injection order. Many methods fail to take into account different drift patterns in features or are prone to overfitting. Herein, within-batch drift correction is performed using the wrapper correctDrift, which involves:

Clustering of features in observation space

scaling by standard deviation
clustering serves to identify features with similar drift patterns, which are corrected in aggregate. As such, different drift patterns are accounted for while mitigating overfitting to unwanted variation in a single feature.

Fitting a cubic spline and calculation of correction factor
Correction of the abundances using correction factor

corrects to reference level at the first injection after scaling
corrected values were retained only if the root mean square deviation of long-term reference QC samples was reduced after drift correction for the cluster at large

The mixture models used to cluster the drift patterns in correctDrift() can fail to converge for some combinations of geometry and cluster number. This results in missing Bayesian Information Criterion (BIC) measures for some models. You can check from which converged models the final model was selected using the “BIC” element in the output for basic data structures. To learn more about the model-based clustering used herein, refer to the mclust (e)book (Scrucca et al. 2023).

# Batch B
batchB <- getBatch(
    peakTable = PT, meta = meta,
    batch = meta$batch, select = "B"
)
BCorr <- correctDrift(
    peakTable = batchB$peakTable,
    injections = batchB$meta$inj,
    sampleGroups = batchB$meta$grp, QCID = "QC",
    G = seq(5, 35, by = 3), modelNames = c("VVE", "VEE"),
    report = FALSE
)
# Batch F
batchF <- getBatch(
    peakTable = PT, meta = meta,
    batch = meta$batch, select = "F"
)
FCorr <- correctDrift(
    peakTable = batchF$peakTable,
    injections = batchF$meta$inj,
    sampleGroups = batchF$meta$grp,
    QCID = "QC", G = seq(5, 35, by = 3),
    modelNames = c("VVE", "VEE"),
    report = FALSE
)
# Batch H
batchH <- getBatch(
    peakTable = PT, meta = meta,
    batch = meta$batch, select = "H"
)
HCorr <- correctDrift(
    peakTable = batchH$peakTable,
    injections = batchH$meta$inj,
    sampleGroups = batchH$meta$grp,
    QCID = "QC", G = seq(5, 35, by = 3),
    modelNames = c("VVE", "VEE"),
    report = FALSE
)

HCorr$BIC
## Bayesian Information Criterion (BIC): 
##         VVE      VEE
## 5  7471.557 7028.343
## 8  7608.446 7460.302
## 11 7471.978 7456.081
## 
## Top 3 models based on the BIC criterion: 
##    VVE,8   VVE,11    VVE,5 
## 7608.446 7471.978 7471.557

Similarly, we subset the SummarizedExperiment by batch and perform drift correction, but for this example using long-term reference samples for quality indicators.

batch_labels <- unique(colData(se)$batch)
batches <- lapply(batch_labels, function(batch_label) {
    se[, colData(se)$batch == batch_label]
})

batches <- lapply(batches, correctDrift, injections = "inj", 
    sampleGroups = "grp", RefID = "Ref", G = seq(5, 35, by = 3), 
    modelNames = c("VVE", "VEE"), report = FALSE, 
    assay.type = "aligned", name = "corrected")

4.3 Between-batch normalization

normalizeBatches performs between-batch normalization either based on long-term reference QC samples or median batch intensity depending on the following dual criterion:

long-term reference QC sample CV < 30%
fold-change < 5 for the ratio of the average feature intensity of a specific feature between batches to the ratio of the all-feature average intensity between batches

If the long-term QC samples are not considered reliable according to the above dual criterion for a specific feature, batches were normalized by sample population median, where a sample population can be specified explicitly to the population argument. Features not present in all batches are also excluded from the dataset.

mergedData <- mergeBatches(list(BCorr, FCorr, HCorr), qualRatio = 0.5)
normData <- normalizeBatches(
    peakTableCorr = mergedData$peakTableCorr,
    batches = meta$batch, sampleGroup = meta$grp,
    refGroup = "Ref", population = "all"
)
PTnorm <- normData$peakTable

Merging with mergeBatches, as above, includes a quality control step, where features with CV > the limit (supplied to correctDrift()) in a specified proportion of batches (default = 0.5) are excluded. For SummarizedExperiment, we join the batches, keeping features shared across all batches but without filtering features by proportion of quality batches.

se <- do.call(cbind, batches)
se <- se[which(apply(assay(se), 1, function(x) any(is.na(x)))), ]

se <- normalizeBatches(peakTableCorr = se, 
    batches = "batch", sampleGroup = "grp", refGroup = "Ref",
    population = "all", assay.type = "corrected", name = "normalized")

5 Authors & Acknowledgements

The first version of batchCorr was written by Carl Brunius. batchCorr was developed for Bioconductor by Carl Brunius, Anton Ribbenstedt and Vilhelm Suksi. If you find any bugs or other things to fix, please submit an issue on GitHub! All contributions to the package are always welcome!

6 Session information

## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] batchCorr_0.99.8            SummarizedExperiment_1.39.1
##  [3] Biobase_2.69.0              GenomicRanges_1.61.1       
##  [5] Seqinfo_0.99.2              IRanges_2.43.0             
##  [7] S4Vectors_0.47.0            BiocGenerics_0.55.0        
##  [9] generics_0.1.4              MatrixGenerics_1.21.0      
## [11] matrixStats_1.5.0           BiocStyle_2.37.0           
## 
## loaded via a namespace (and not attached):
##  [1] Matrix_1.7-3        jsonlite_2.0.0      compiler_4.5.1     
##  [4] BiocManager_1.30.26 crayon_1.5.3        Rcpp_1.1.0         
##  [7] parallel_4.5.1      jquerylib_0.1.4     BiocParallel_1.43.4
## [10] yaml_2.3.10         fastmap_1.2.0       lattice_0.22-7     
## [13] plyr_1.8.9          R6_2.6.1            XVector_0.49.0     
## [16] S4Arrays_1.9.1      knitr_1.50          DelayedArray_0.35.2
## [19] bookdown_0.43       bslib_0.9.0         rlang_1.1.6        
## [22] reshape_0.8.10      cachem_1.1.0        xfun_0.52          
## [25] sass_0.4.10         SparseArray_1.9.1   cli_3.6.5          
## [28] digest_0.6.37       grid_4.5.1          mclust_6.1.1       
## [31] lifecycle_1.0.4     evaluate_1.0.4      codetools_0.2-20   
## [34] abind_1.4-8         rmarkdown_2.29      tools_4.5.1        
## [37] htmltools_0.5.8.1

References

Brunius, Carl, Lin Shi, and Rikard Landberg. 2016. “Large-Scale Untargeted Lc-Ms Metabolomics Data Correction Using Between-Batch Feature Alignment and Cluster-Based Within-Batch Signal Intensity Drift Correction.” Metabolomics 12: 1–13.

Scrucca, Luca, Chris Fraley, T. Brendan Murphy, and Adrian E. Raftery. 2023. Model-Based Clustering, Classification, and Density Estimation Using mclust in R. Chapman; Hall/CRC. https://doi.org/10.1201/9781003277965.

General introduction

2025-07-25

Contents