ProteinBatcher workflow

Overview

Quantitative proteomics is well supported in Bioconductor for individual downstream tasks such as differential testing (e.g. via limma), missing-value imputation, or visualization. However, fully integrated downstream workflows that combine these steps in a single, reproducible, and design-aware pipeline are comparatively scarce, particularly for label-free proteomics. ProteinBatcher does not run upstream quantification; it operates on protein-level abundance matrices (the package description explicitly targets DIA-NN-style outputs).

ProteinBatcher addresses this gap by providing a cohesive downstream analysis framework that focuses on three aspects that are often difficult to combine in practice:

  1. Flexible experimental designs

    • The package is built around limma and supports:
    • multi-factor designs (e.g. condition + batch, condition + sex),
    • interaction testing,
    • paired and repeated-measures designs,
    • correlation-aware models via duplicateCorrelation.

    This flexibility allows users to model common proteomics scenarios such as batch effects, donor effects, or preparation-day effects without rewriting analysis code for each experiment.

  2. End-to-end standardization of downstream steps. ProteinBatcher integrates, in a single workflow:

    • missingness-based filtering,
    • mean and left-censored (LDV) imputation strategies,
    • differential abundance testing,
    • systematic separation of main, common, and interaction effects,
    • generation of plot-ready result objects.

    By enforcing a consistent structure for inputs, models, and outputs, the package reduces ad hoc analysis choices and improves reproducibility across experiments.

  3. Interpretation-oriented visualizations. In addition to standard volcano plots, ProteinBatcher implements:

    • deregulograms, which compare full effects across the two levels of an interaction factor (e.g. batch day 1 vs day 2, female vs male). Deregulograms complement interaction volcano plots by emphasizing effect concordance and magnitude, rather than statistical significance alone, and facilitate biological interpretation of interaction effects.

Importantly, ProteinBatcher is not intended to replace existing Bioconductor infrastructure, but rather to orchestrate and standardize well-established methods (notably limma) into a coherent downstream proteomics workflow. The package is therefore particularly suited for users who repeatedly analyze proteomics experiments with similar structure but varying designs, and who need a balance between automation and modeling flexibility.

This vignette documents the workflow and outputs, but does not execute the full analysis. All examples rely on precomputed results stored in inst/extdata, ensuring portability and fast rendering during package checks.

Relationship with the Bioconductor ecosystem

ProteinBatcher is designed to operate within the standard Bioconductor infrastructure, rather than introducing new data containers or statistical methods. Its functionality is built by extending existing Bioconductor components in a consistent downstream workflow.

The package is built around two Bioconductor pillars:

All intermediate and final results in ProteinBatcher are stored in SummarizedExperiment objects. Missing-value filtering and imputation are implemented as modular steps that operate directly on SummarizedExperiment objects.

Input data and workflow parameters

ProteinBatcher needs two input data objects and a set of workflow parameters. This distinction is important: the inputs define the data to be analysed, whereas the parameters control how the analysis is performed.

Input data

ProteinBatcher expects exactly two input datasets:

  1. Quantitative matrix: A feature-by-sample matrix containing quantitative proteomics measurements (typically protein groups). Rows correspond to quantified features and columns to samples. The matrix is assumed to originate from an upstream workflow (e.g. FragPipe or DIANN), but ProteinBatcher does not depend on a specific quantification tool.

  2. Sample annotation table: A tab-delimited table describing the experimental design and linking each sample to its biological and technical metadata.

Both inputs are provided in this vignette as example datasets shipped with the package.

Code
# Load package
library(ProteinBatcher)
library(readr)

# Input files shipped with the package
path_annotation <- system.file(
  "extdata", "annotation_HaCaT.tsv",
  package = "ProteinBatcher"
)

path_pgmatrix <- system.file(
  "extdata",
  "2024MK017_HaCaT_Stimulation_1to24_Astral_report.pg_matrix.tsv",
  package = "ProteinBatcher"
)

# Output directory (portable, not used during vignette build)
path_output <- tempdir()

# Inspect annotation structure
ann <- readr::read_tsv(path_annotation)
## Rows: 24 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (5): file, sample, sample_name, condition, batch
## dbl (2): replicate, donor_id
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Sample annotation structure

The annotation table must contain at least a condition column defining the main experimental comparison. Additional columns enable more complex designs such as batch correction, paired analyses, and interaction testing.

The following columns are supported:

Only columns explicitly referenced in the design formula are required.

Workflow parameters

In addition to the input data, ProteinBatcher requires several parameters that define how the analysis is carried out. These parameters do not represent additional inputs, but control filtering, modeling, and output generation. These parameters provide flexibility on the specific analysis to conduct.

Key parameters include:

These parameters are illustrated in the workflow section below, but the full analysis is not executed during vignette rendering.

Example 1: Workflow with interaction testing (condition-by-batch)

Code
# Parameters
experiment <- "HaCaT"
percent_missing <- 50

formula <- formula(~ 0 + condition + batch + condition:batch)
tests <- c(
  "IL13_vs_NoTreated",
  "IL22_vs_NoTreated",
  "COMBO_vs_NoTreated"
)
tests_interaction <- c(
  "IL13.batchday2",
  "IL22.batchday2",
  "COMBO.batchday2"
)
reference_condition <- "NoTreated"

# Workflow
res <- run_proteomics_pipeline(
  path_pgmatrix        = path_pgmatrix,
  path_annotation      = path_annotation,
  path_output          = path_output,
  tests                = tests,
  tests_interaction    = tests_interaction,
  formula              = formula,
  reference_condition  = reference_condition,
  percent_missing      = percent_missing,
  ldv_source           = "per-condition",
  experiment           = experiment,
  plots                = FALSE # Use TRUE to generate outputs
)
## Fitting limma model without block

Inspecting filtering and imputation outputs (xlsx files)

Code
## Inspect filtering and imputation summary tables
library(readxl)

path_filtered <- system.file(
  "extdata",
  "HaCaT_filtered_out_proteins.xlsx",
  package = "ProteinBatcher"
)

path_imputed <- system.file(
  "extdata",
  "HaCaT_Imputation_step.xlsx",
  package = "ProteinBatcher"
)

filtered_df <- readxl::read_xlsx(path_filtered)
imputed_df  <- readxl::read_xlsx(path_imputed)

head(filtered_df)
## # A tibble: 6 × 3
##   protein_id  gene_name reason                                                  
##   <chr>       <chr>     <chr>                                                   
## 1 MBLC1_HUMAN MBLAC1    "Failed per-condition non-missing threshold: >= 0.50 wi…
## 2 SH321_HUMAN SH3D21    "Failed per-condition non-missing threshold: >= 0.50 wi…
## 3 AN33B_HUMAN ANKRD33B  "Failed per-condition non-missing threshold: >= 0.50 wi…
## 4 SFI1_HUMAN  SFI1      "Failed per-condition non-missing threshold: >= 0.50 wi…
## 5 KLRF2_HUMAN KLRF2     "Failed per-condition non-missing threshold: >= 0.50 wi…
## 6 APRIO_HUMAN PRNP      "Failed per-condition non-missing threshold: >= 0.50 wi…
Code
head(imputed_df)
## # A tibble: 6 × 25
##   gene      NoTreated_1 `IL13 _1` IL22_1 `COMBO _1` NoTreated_2 `IL13 _2` IL22_2
##   <chr>           <dbl>     <dbl>  <dbl>      <dbl>       <dbl>     <dbl>  <dbl>
## 1 NUDT4B           22.5      22.4   22.5       22.5        22.6      22.3   22.6
## 2 SMIM26           20.7      20.3   20.6       20.6        20.6      20.5   20.6
## 3 GATD3;GA…        24.5      24.4   24.3       24.6        24.6      24.5   24.4
## 4 PIGBOS1          21.4      21.4   21.1       21.6        21.6      21.6   20.9
## 5 NBDY             20.4      20.1   20.1       20.1        19.9      20.3   20.0
## 6 MMP24OS          19.4      19.3   18.9       19.4        19.7      19.1   18.8
## # ℹ 17 more variables: `COMBO _2` <dbl>, NoTreated_3 <dbl>, `IL13 _3` <dbl>,
## #   IL22_3 <dbl>, `COMBO _3` <dbl>, NoTreated_4 <dbl>, `IL13 _4` <dbl>,
## #   IL22_4 <dbl>, `COMBO _4` <dbl>, NoTreated_5 <dbl>, `IL13 _5` <dbl>,
## #   IL22_5 <dbl>, `COMBO _5` <dbl>, NoTreated_6 <dbl>, `IL13 _6` <dbl>,
## #   IL22_6 <dbl>, `COMBO _6` <dbl>

These Excel files are generated by the workflow to make the preprocessing steps fully transparent.

HaCaT_filtered_out_proteins.xlsx lists proteins removed during the missingness-based filtering step. This allows users to verify which features were excluded and why (e.g. insufficient detection across conditions).

HaCaT_Imputation_step.xlsx summarizes the imputation strategy applied to each protein and sample, distinguishing mean-imputed values from LDV (left‑censored) imputations.

Together, these tables allow direct comparison of the dataset before and after filtering and imputation, which is critical for quality control and interpretability in label‑free proteomics analyses.

Note: For package size reasons, this vignette uses a reduced example of the imputation summary table. The full table is generated during a standard workflow run.

Differential effects - IL13 vs NoTreated contrast

From this point onwards, we focus exclusively on the IL13 vs NoTreated contrast to illustrate how different effect types are interpreted.

Main effect (all proteins)

The Main Effect (All Proteins) volcano plot shows the global IL13 effect in HaCaT cell lines, ignoring any interaction terms (batch). Every tested protein is included, regardless of whether its response depends on batch.

How to read it:

  • The x‑axis represents the estimated log2 fold change (IL13 vs NoTreated).
  • The y‑axis represents statistical significance (adjusted p‑value).

This plot answers the question: “Which proteins are affected by IL13 on average?”

This view is useful for an initial overview, but it does not distinguish stable effects from batch‑dependent ones.

Common effect (batch‑independent proteins)

The Common Effect plot shows proteins whose IL13 response is not significantly modulated by batch. These proteins pass the interaction test and are therefore considered stable across batch levels.

How to read it:

  • Proteins here show a consistent IL13 effect independent of interaction.
  • These are typically the most robust candidates for biological interpretation, as their direction and magnitude are reproducible across experimental batches

Conceptually, this plot answers: “Which IL13‑responsive proteins behave consistently across batches?”

Interaction effect (batch‑dependent proteins)

The Interaction Effect plot highlights proteins whose IL13 response differs between batch levels (here: day1 vs day2).

How to read it:

  • Significance is driven by the interaction term, not by the average IL13 effect.
  • A protein may appear weak or even non-significant in the main effect plot but still show a strong interaction if its response reverses or changes magnitude between batches.

This plot answers: “Which proteins show batch-specific IL13 effects?”

Deregulogram: interpreting batch-dependent effects

The deregulogram compares full IL13 effects between the two batch levels.

  • y-axis: full IL13 effect in the reference batch (e.g. day1)
  • x-axis: full IL13 effect in the other batch (day2), computed as main effect + interaction effect

How to read it:

  • Proteins along the diagonal behave similarly in both batches.
  • Proteins deviating from the diagonal show batch-dependent regulation.
  • Highlighted points represent proteins for which the interaction is both statistically significant and biologically relevant (effect size threshold).

Key interpretation note: The deregulogram is intentionally more restrictive than the interaction volcano. It is designed to emphasize interpretable, direction-changing or magnitude-shifting effects, rather than listing all statistically significant interactions. As a result, the set of highlighted proteins is not expected to match one-to-one with the interaction volcano, but be more stringent.

Example 2: Workflow without interaction testing

In some experiments, the main goal is to quantify a global condition effect without explicitly testing whether that effect depends on a secondary factor. Typical scenarios include:

At the level of the workflow, this corresponds to:

Here we use a simple condition-only model:

Code
# Workflow parameters (no interaction testing)
experiment <- "HaCaT"
percent_missing <- 50

formula <- ~ 0 + condition
tests <- c(
  "IL13_vs_NoTreated",
  "IL22_vs_NoTreated",
  "COMBO_vs_NoTreated"
)
tests_interaction <- "NA"
reference_condition <- "NoTreated"

Running the pipeline

Code
res <- run_proteomics_pipeline(
  path_pgmatrix        = path_pgmatrix,
  path_annotation      = path_annotation,
  path_output          = path_output,
  tests                = tests,
  tests_interaction    = tests_interaction,
  formula              = formula,
  reference_condition  = reference_condition,
  percent_missing      = percent_missing,
  ldv_source           = "per-condition",
  experiment           = experiment,
  plots                = FALSE
)
## Fitting limma model without block

When tests_interaction = “NA”:

This configuration answers the question:

“Which proteins are differentially abundant between conditions on average?”

and is appropriate when there is no strong prior reason to expect factor-dependent responses or when the analysis is intended as a first check, global comparison.

sessionInfo()

Code
sessionInfo()
## R version 4.6.0 RC (2026-04-17 r89917)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] readxl_1.4.5          readr_2.2.0           ProteinBatcher_0.99.0
## 
## loaded via a namespace (and not attached):
##  [1] utf8_1.2.6                  tidyr_1.3.2                
##  [3] sass_0.4.10                 generics_0.1.4             
##  [5] SparseArray_1.12.2          lattice_0.22-9             
##  [7] hms_1.1.4                   digest_0.6.39              
##  [9] magrittr_2.0.5              evaluate_1.0.5             
## [11] grid_4.6.0                  fastmap_1.2.0              
## [13] cellranger_1.1.0            jsonlite_2.0.0             
## [15] Matrix_1.7-5                writexl_1.5.4              
## [17] limma_3.68.1                purrr_1.2.2                
## [19] jquerylib_0.1.4             abind_1.4-8                
## [21] cli_3.6.6                   rlang_1.2.0                
## [23] crayon_1.5.3                XVector_0.52.0             
## [25] Biobase_2.72.0              bit64_4.8.0                
## [27] withr_3.0.2                 cachem_1.1.0               
## [29] DelayedArray_0.38.1         yaml_2.3.12                
## [31] otel_0.2.0                  S4Arrays_1.12.0            
## [33] tools_4.6.0                 parallel_4.6.0             
## [35] tzdb_0.5.0                  dplyr_1.2.1                
## [37] SummarizedExperiment_1.42.0 BiocGenerics_0.58.0        
## [39] assertthat_0.2.1            vctrs_0.7.3                
## [41] R6_2.6.1                    matrixStats_1.5.0          
## [43] stats4_4.6.0                lifecycle_1.0.5            
## [45] Seqinfo_1.2.0               S4Vectors_0.50.0           
## [47] IRanges_2.46.0              bit_4.6.0                  
## [49] vroom_1.7.1                 pkgconfig_2.0.3            
## [51] bslib_0.10.0                pillar_1.11.1              
## [53] glue_1.8.1                  data.table_1.18.2.1        
## [55] statmod_1.5.1               xfun_0.57                  
## [57] tibble_3.3.1                GenomicRanges_1.64.0       
## [59] tidyselect_1.2.1            MatrixGenerics_1.24.0      
## [61] knitr_1.51                  htmltools_0.5.9            
## [63] rmarkdown_2.31              compiler_4.6.0