---
title: "ProBatchFeatures"
author: |
  | Yuliya Burankova
  | Institute for Computational Systems Biology, University of Hamburg, Germany
date: "`r Sys.Date()`"
output: BiocStyle::html_document
bibliography: "references.bib"
csl: "nature-no-superscript.csl"
package: proBatch
vignette: >
  %\VignetteIndexEntry{ProBatchFeatures: QFeatures-based pipelines with operation logging for proBatch}
  %\VignetteEncoding{UTF-8}
  %\usepackage[utf8]{inputenc}
  %\VignetteEngine{knitr::rmarkdown}
abstract: |
  This vignette demonstrates how to perform and evaluate data preprocessing and batch-effect correction workflows using `ProBatchFeatures`, a `QFeatures`-based extension of `proBatch`.
toc: yes
toc_depth: 2
numbersections: true
editor_options:
  markdown:
    wrap: 72
---

```{r global_options, include=FALSE}
knitr::opts_chunk$set(
    echo = TRUE, warning = FALSE, message = FALSE, fig.pos = "h",
    collapse = TRUE,
    comment = "#>",
    knitr.table.format = "simple"
)
```

```{r setup, include = FALSE}
chooseCRANmirror(graphics = FALSE, ind = 1)
options(tinytex.verbose = TRUE)
```

# Overview

`ProBatchFeatures` is an extension of `proBatch` based on `QFeatures` [@QFeatures], which extends it's functionality by storing each processing stage (for example, raw data → log-transformation → normalization → batch effect correction → …) as a new `SummarizedExperiment` assay, and logging every step. It maintains full compatibility with `QFeatures` functions, and facilitates quick benchmarking of data preprocessing pipelines.

This vignette demonstrates how to:

1. Construct a `ProBatchFeatures` object from matrix-formatted quantification data.

2. Apply and record standard preprocessing steps (filtering features with too many missing values, log-transformation, normalization, and batch-effect correction) in a chained workflow.

3. Perform diagnostic checks.

4. Retrieve processed assay data and the corresponding preprocessing pipeline for benchmarking or reuse.


# Setup


## Installation

To install the latest version of `proBatch` package, use `BiocManager` package:

```{r install_proBatch, fig.show='hold', eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}
BiocManager::install("proBatch")
```

## Loading required packages

This vignette uses `dplyr`, `tibble`, and `ggplot2` (alongside `proBatch`) for data manipulation and visualization.

```{r load_packages, message=FALSE}
library(proBatch)

library(dplyr)
library(tibble)
library(ggplot2)
library(QFeatures)
```


## Loading and preparing the example dataset

As an example, we will use the *E. coli* dataset described in the FedProt paper [@burankova2025privacy]:

```{r load_data}
data("example_ecoli_data", package = "proBatch")
```

The dataset consists of data from five centers that were  measured and analyzed independently.
For simplicity, data from all centers have been merged into a single dataset, and a subset containing 400 protein groups and their corresponding peptides is used in this example.

```{r extract-data}
# Extracting data
all_metadata <- example_ecoli_data$all_metadata
all_precursors <- example_ecoli_data$all_precursors
all_protein_groups <- example_ecoli_data$all_protein_groups
all_precursor_pg_match <- example_ecoli_data$all_precursor_pg_match

# Keeping only essential
rm(example_ecoli_data)
```

# Constructing a `ProBatchFeatures` object

The dataset contains two hierarchical levels. While additional levels can be added in a manner similar to `QFeatures`, this example focuses on the peptide and protein group levels.
The `all_precursor_pg_match` dataframe stores the mapping between these levels.
We begin by using the  `ProBatchFeatures()` constructor to create an object representing the peptide-level data:

```{r build-pbf-from-matrix}
# Building from a (features x samples) matrix
pbf <- ProBatchFeatures(
    data_matrix = all_precursors, # matrix of peptide intensities (features x samples)
    sample_annotation = all_metadata, # data.frame of sample metadata
    sample_id_col = "Run", # column in metadata that matches sample IDs
    level = "peptide" # label this assay level as "peptide"
)

pbf # show basic info about the ProBatchFeatures object
```

Assay names follow the convention "<level>::<pipeline>", for example, "peptide::raw", or
"protein::median_on_log".

Next, we add the protein-level data as a second assay in the same `ProBatchFeatures` instance. 
Internally, this is achieved by creating a `SummarizedExperiment` for the protein-level data and
 adding it to the existing object using `QFeatures` functionality.
The same sample annotation should be used for the new assay to ensure correct column alignment:

```{r add-new-pbf-level}
# Adding proteins as a new level and mapping them to precursors
#    all_precursor_pg_match has columns: "Precursor.Id", "Protein.Ids"
pbf <- pb_add_level(
    object = pbf,
    from = "peptide::raw",
    new_matrix = all_protein_groups,
    to_level = "protein", # will name "protein::raw" by default
    mapping_df = all_precursor_pg_match,
    from_id = "Precursor.Id",
    to_id = "Protein.Ids",
    map_strategy = "as_is"
)

pbf
```

If a precursor (peptide) maps to multiple protein groups, `map_strategy` parameter specifies how to resolve multimappers.
Available options are "first", "longest", and "as_is". The "first" and "longest" strategies select a single ProteinID per peptide,
where first takes the first match found in the mapping dataframe, and longest selects the ProteinID with the longest sequence, 
while "as_is" assumes a one-to-one mapping between identifiers.


```{r check_pbf}
# Check
validObject(pbf) # should be TRUE
assayLink(pbf, "protein::raw") # verify link summary

# Remove unused variables to clean up the workspace
rm(all_metadata, all_precursor_pg_match, all_precursors, all_protein_groups)
```

# Processing pipeline with diagnostics

With the data now loaded into a `ProBatchFeatures` object with peptide- and protein-level assays, we can demonstrate an example of typical preprocessing workflow. This pipeline includes filtering low-quality features, applying a log2-transformation,  median normalization, and batch effect correction.


## Step 1 - filtering features with too many missing values

Features detected in only a small subset of samples are unreliable and may obscure systematic effects in subsequent analyses. To examine the proportion of missing values across datasets, `pb_nNA()` function is used:

Function `pb_nNA()` summarizes missingness per assay:

```{r nNA_calc}
# pb_nNA returns a DataFrame with counts of NAs per feature/sample
na_counts <- pb_nNA(pbf)

# na_counts contains info about total number of NAs and % in the data:
na_counts[["nNA"]] %>% as_tibble()
```
Here nNA is a number of missing values in the assay and pNA is its
proportion. It is possible to inspect the per-feature and per-sample
breakdown for each assay:

```{r nNA_individual}
# check NAs per sample:
head(na_counts[["peptide::raw"]]$nNAcols) %>% as_tibble()
```

For each assay the following tables are availabe:
```{r print_na_counts}
print(names(na_counts[["peptide::raw"]]))
```

Visualisations helps explore batch-specific patterns of missingness. Firstly, we define a color scheme which will be used in all diagnostic plots:

```{r define_colors}
# color scheme for lab and condition labels
color_scheme <- sample_annotation_to_colors(
    pbf,
    sample_id_col = "Run",
    factor_columns = c("Lab", "Condition"),
    numeric_columns = NULL
)
```

Displaying features that contain at least one missing as a heatmap: 

```{r plot_na_heatmap_single, fig.show='hold', fig.width=10, fig.height=4}
plot_NA_heatmap(
    pbf, # by default the last created assay
    show_rownames = FALSE, show_row_dend = FALSE,
    color_by = "Lab"
)
```

To display all features, set the parameter `drop_complete=F`. When the number of features or samples exceeds 5000, a random subset of 5000 rows/columns is used for visualization by default. To display the entire dataset, set `use_subset = T`.

To display multiple assays in one plot,  specify them using `pbf_name` parameter:

```{r plot_na_heatmap_multi, fig.show='hold', fig.width=10, fig.height=4}
plot_NA_heatmap(
    pbf,
    pbf_name = c("peptide::raw", "protein::raw"),
    show_rownames = FALSE, show_row_dend = FALSE,
    color_by = "Lab"
)
```

The overlap between peptide and protein identifications across samples can be visualized as follows:

```{r plot_na_frequency_panel, fig.show='hold', fig.width=7, fig.height=4}
plot_NA_frequency(
    pbf,
    pbf_name = c("peptide::raw", "protein::raw"),
    show_percent = FALSE
)
```

In the resulting plot, particularly for peptides, batch-specific patterns are evident, reflecting the 19-20 samples contributed by each laboratory.

Here, we apply a simple filter that retains only features quantified in at least 40% of the samples.
This threshold can be adjusted if necessary.

```{r apply_missing_filter}
pbf <- pb_filterNA(
    pbf, # without specifying, the filter will be applied to all assays in the object
    inplace = TRUE, # if false, filtered matrix will be saved as a new assay.
    pNA = 0.6 # the maximal proportion of missing values per feature
)
pbf
```

After filtering, the `ProBatchFeatures` object stores fewer features,
which sharpens downstream diagnostics.

```{r plot_na_heatmap_filtered, fig.show='hold', fig.width=7, fig.height=4}
plot_NA_heatmap(
    pbf,
    pbf_name = c("peptide::raw", "protein::raw"),
    show_rownames = FALSE, show_row_dend = FALSE,
    color_by = "Lab", # currently only one level is supported
    border_color = NA # pheatmap parameter to remove cell borders
)
```

However, as the plots indicate, the remaining patterns of missingness are still associated with the lab.  
To remove the residual systematic effects from the data, we further perform normalization and batch-effect correction.

## Step 2 - log2-transformation

Proteomics data are commonly log-transformed to stabilize variance as required for furter statistical analyzes and data visualization. Here, we perform a log2-transformation of both peptide- and protein-level data:

```{r run_log_transform}
pbf <- log_transform_dm(
    pbf,
    log_base = 2, offset = 1,
    pbf_name = "protein::raw"
)

pbf <- log_transform_dm(
    pbf,
    log_base = 2, offset = 1,
    pbf_name = "peptide::raw"
)

pbf
```

After log-transformation, the `ProBatchFeatures` instance `pbf` contains additional assays (`peptide::log2_on_raw` and
`protein::log2_on_raw`). If the assay was transformed using "fast to recompute" step (for exmaple, log-transformation or median
normalization), the transformed data will not be saved in the list of assays,
but only re-computed for plotting. To force storing
the results of this steps `store_fast_steps = T` parameter can be added.


The intensity distributions of each sample after log-transformation can be visualized as boxplots to identify potential batch-specific shifts:

```{r plot_logmean, fig.height=6, fig.width=8, warning=FALSE}
plot_boxplot(
    pbf,
    pbf_name = c("peptide::log2_on_raw", "protein::log2_on_raw"), # plot multiple on one
    sample_id_col = "Run",
    order_col = "Run",
    batch_col = "Lab",
    color_by_batch = TRUE,
    color_scheme = color_scheme,
    base_size = 7,
    plot_ncol = 1, # how many columns use for plotting multy-assay plot
    plot_title = c( # title for each assay
        "Peptide level - log2 scale",
        "Protein level - log2 scale"
    )
)
```

Additionally, the intensity distribution of peptides and proteins with and without NAs
can be plotted:

```{r plot_na_density_panels, fig.show='hold', fig.width=10, fig.height=4}
plot_NA_density(
    pbf,
    pbf_name = c("peptide::log2_on_raw", "protein::log2_on_raw")
)
```
The comparison confirms that log2 stabilises the variance for both
peptides and proteins, yet the cross-lab shifts in median intensity
persist and will need to be addressed downstream.
 

Next, we will perform a Principal Component Analysis (PCA) to get a
high-level overview of the sample clustering. We expect to see a strong
batch effect, where samples cluster by their center of origin ("Lab")
rather than their biological condition ("Condition").

```{r plot_PCA, fig.show='hold', fig.width=8, fig.height=4}
pca1 <- plot_PCA(
    pbf,
    pbf_name = "protein::log2_on_raw",
    sample_id_col = "Run",
    color_scheme = color_scheme,
    color_by = "Lab",
    shape_by = "Condition",
    fill_the_missing = NULL, # remove all rows with missing values. Can be also "remove"
    plot_title = "NA rows removed, protein, log2",
    base_size = 10, point_size = 3, point_alpha = 0.5
)

pca2 <- plot_PCA(
    pbf,
    pbf_name = "protein::log2_on_raw",
    sample_id_col = "Run",
    color_scheme = color_scheme,
    color_by = "Condition",
    shape_by = "Lab",
    fill_the_missing = 0, # default value is -1
    plot_title = "NA replaced with 0, protein, log2",
    base_size = 10, point_size = 3, point_alpha = 0.5
)

library(gridExtra)
grid.arrange(pca1, pca2, ncol = 2, nrow = 1)
```

Simiarly, the PCA plots clearly demonstrate that the primary source of
variance in the raw data is the lab.

## Step 3 - median normalization

To make the samples more comparable by removing the effect of sample load on intensities, we apply median
normalization:

```{r apply_median_normalization}
pbf <- pb_transform(
    object = pbf,
    from = "peptide::log2_on_raw",
    steps = "medianNorm"
)

pbf <- pb_transform(
    object = pbf,
    from = "protein::log2_on_raw",
    steps = "medianNorm"
)

pbf
```

`pb_transform()` appends `peptide::medianNorm_on_log2_on_raw` and
`protein::medianNorm_on_log2_on_raw`. We can now plot and compare these assays before and after normalization:

```{r compare_median_norm_diagnostics, fig.show='hold', fig.width=8, fig.height=4}
# boxplot
plot_boxplot(
    pbf,
    sample_id_col = "Run",
    order_col = "Run",
    batch_col = "Lab",
    color_by_batch = TRUE,
    color_scheme = color_scheme,
    base_size = 7,
    pbf_name = c("protein::log2_on_raw", "protein::medianNorm_on_log2_on_raw"),
    plot_ncol = 1,
    plot_title = c(
        "Before Median Normalization, protein level",
        "After Median Normalization, protein level"
    )
)

# plot PCA plot
plot_PCA(
    pbf,
    pbf_name = c("protein::log2_on_raw", "protein::medianNorm_on_log2_on_raw"),
    sample_id_col = "Run",
    color_scheme = color_scheme,
    color_by = "Lab",
    shape_by = "Condition",
    fill_the_missing = NULL, # remove all rows with missing values. Can be also "remove"
    plot_title = c(
        "Before Median Normalization, protein level",
        "After Median Normalization, protein level"
    ),
    base_size = 10, point_size = 3, point_alpha = 0.5
)
```

Although median normalization made itensity distributions more similar, the residual lab-specific batch effects remain evident on the PCA plots and need to be corrected.

## Step 4 - batch-effect correction

The user can choose from two methods for batch effects correction: `ComBat` [@Johnson2006] from **sva** and `removeBatchEffect()` from **limma** [@Ritchie2015]. Both methods can be accessed via the same function `pb_transform()`.

```{r prepare_sample_annotations}
# sample annotations used by both correction methods
sa <- colData(pbf) %>% as.data.frame()
```

Apply **limma**'s `removeBatchEffect()` to log-transformed and normalized data:

```{r apply_limma_correction}
# limma::removeBatchEffect method
pbf <- pb_transform(
    pbf,
    from = "protein::log2_on_raw",
    steps = "limmaRBE",
    params_list = list(list(
        sample_annotation = sa,
        batch_col = "Lab",
        covariates_cols = c("Condition"),
        fill_the_missing = FALSE
    ))
)

pbf <- pb_transform(
    pbf,
    from = "protein::medianNorm_on_log2_on_raw",
    steps = "limmaRBE",
    params_list = list(list(
        sample_annotation = sa,
        batch_col = "Lab",
        covariates_cols = c("Condition"),
        fill_the_missing = FALSE
    ))
)
```

As batch effect correction does not listed as a `fast step`, two new protein-level assay are added to the `pbd`: 
- [3] protein::limmaRBE_on_log2_on_raw and 
- [4] protein::limmaRBE_on_medianNorm_on_log2_on_raw.

```{r inspect_pbf_after_limma}
print(pbf)
```

Alternatively, batch effect correction can be performed using `ComBat`:

```{r apply_combat_correction}
# sva::ComBat wrapped via pb_transform()
pbf <- pb_transform(
    pbf,
    from = "protein::log2_on_raw",
    steps = "combat",
    params_list = list(list(
        sample_annotation = sa,
        batch_col = "Lab",
        sample_id_col = "Run",
        par.prior = TRUE,
        fill_the_missing = "remove"
    ))
)

pbf <- pb_transform(
    pbf,
    from = "protein::medianNorm_on_log2_on_raw",
    steps = "combat",
    params_list = list(list(
        sample_annotation = sa,
        batch_col = "Lab",
        sample_id_col = "Run",
        par.prior = TRUE,
        fill_the_missing = "remove"
    ))
)
```
The `ComBat` method from the **sva** package cannot be applied to data containing missing values.  
Therefore, features with any missing values are removed by default.  
Alternatively, missing values can be imputed with a user-defined constant using the `fill_the_missing` parameter (default: `-1`).

```{r inspect_pbf_after_combat}
print(pbf)
```

Similarly, two new protein-level assays were added to `pdf`: 
- [5] protein::combat_on_log2_on_raw: ... with 244 rows 
- [6] protein::combat_on_medianNorm_on_log2_on_raw: ... with 244 rows.

And as we removed features with missing values, these assays contain less
protein groups than the raw filtered data (331 rows).

All steps can be applied to peptide-level data in the same way:

```{r apply_batch_correction_peptide}
pbf <- pb_transform(
    pbf,
    from = "peptide::medianNorm_on_log2_on_raw",
    steps = "limmaRBE",
    params_list = list(
        list(
            sample_annotation = sa,
            batch_col = "Lab",
            covariates_cols = c("Condition"),
            fill_the_missing = FALSE
        )
    )
)

pbf <- pb_transform(
    pbf,
    from = "peptide::medianNorm_on_log2_on_raw",
    steps = "combat",
    params_list = list(
        list(
            sample_annotation = sa,
            batch_col = "Lab",
            sample_id_col = "Run",
            par.prior = TRUE,
            fill_the_missing = "remove"
        )
    )
)

pbf
```

Two new peptide-level assays were added to `pdf`: 
- [7] peptide::limmaRBE_on_medianNorm_on_log2_on_raw: ... with  867 rows 
- [8] peptide::combat_on_medianNorm_on_log2_on_raw: ... with  152 rows (as NA-containing features were removed before ComBat).

## Step 5 - assess processed assays

The four new assays (`protein::limmaRBE_on_…` and `protein::combat_on_…`) are now stored alongside the previous processing steps. We can explore these assays using PCA to verify that the batch effect has been successfully removed and compare them to the data before batch effect correction.

### Principal component analysis

The progression from log$_2$-transformed to batch-corrected assays demonstrates the expected reduction in lab-specific variance and a clearer separation by biological condition in the final panel.

```{r plot_pca_limma_series, fig.height=7, fig.show='hold', fig.width=8, message=FALSE, warning=FALSE}
plot_PCA(
    pbf,
    pbf_name = c(
        "protein::log2_on_raw",
        "protein::medianNorm_on_log2_on_raw",
        "protein::limmaRBE_on_medianNorm_on_log2_on_raw",
        "protein::combat_on_medianNorm_on_log2_on_raw"
    ),
    sample_id_col = "Run",
    color_scheme = color_scheme,
    color_by = "Lab",
    shape_by = "Condition",
    fill_the_missing = NULL,
    base_size = 8, point_size = 3, point_alpha = 0.5
)
```

Similarly, for peptide-level data:

```{r plot_pca_peptides, fig.height=7, fig.show='hold', fig.width=8, message=FALSE, warning=FALSE}
plot_PCA(
    pbf,
    pbf_name = c(
        "peptide::log2_on_raw",
        "peptide::medianNorm_on_log2_on_raw",
        "peptide::limmaRBE_on_medianNorm_on_log2_on_raw",
        "peptide::combat_on_medianNorm_on_log2_on_raw"
    ),
    sample_id_col = "Run",
    color_scheme = color_scheme,
    color_by = "Lab",
    shape_by = "Condition",
    fill_the_missing = "remove",
    base_size = 8, point_size = 3, point_alpha = 0.5
)
```

### Hierarchical clustering

Hierarchical clustering provides another way to explore sample similarities.
Here we display clustering results for `protein::combat_on_medianNorm_on_log2_on_raw` and
`protein::limmaRBE_on_medianNorm_on_log2_on_raw` assays and colour
samples by lab and condition:

```{r plot_hclust_combat, fig.show='hold', fig.width=12, fig.height=4}
plot_hierarchical_clustering(
    pbf,
    pbf_name = "protein::combat_on_medianNorm_on_log2_on_raw",
    sample_id_col = "Run",
    label_font = 0.6,
    color_list = color_scheme
)
```

```{r plot_hclust_limma, fig.show='hold', fig.width=12, fig.height=4}
plot_hierarchical_clustering(
    pbf,
    pbf_name = "protein::limmaRBE_on_medianNorm_on_log2_on_raw",
    sample_id_col = "Run",
    label_font = 0.6,
    color_list = color_scheme,
    fill_the_missing = "remove"
)
```

### Principal Variance Component Analysis (PVCA)

PVCA quantifies the contribution of known factors to the observed
variance [@bushel2025pvca] which can be compared across raw, normalised, and ComBat-corrected assays:

```{r plot_pvca_panels, fig.height=6, fig.show='hold', fig.width=5, message=FALSE, warning=FALSE}
plot_PVCA(
    pbf,
    pbf_name = c(
        "protein::medianNorm_on_log2_on_raw",
        "protein::combat_on_medianNorm_on_log2_on_raw"
    ),
    sample_id_col = "Run",
    technical_factors = c("Lab"),
    biological_factors = c("Condition"),
    fill_the_missing = NULL,
    base_size = 7,
    plot_ncol = 2, # the number of plots in a row
    variance_threshold = 0 # the the percentile value of weight each of the
    # covariates needs to explain
    # (the rest will be lumped together)
)
```


# Inspecting operation log

Each call to `pb_transform()` or `pb_filterNA()` creates a log entry that records the source assay, the applied operation, and the resulting assay name. Reviewing this log ensures that complex processing pipelines remain transparent and reproducible.

```{r show_operation_log}
get_operation_log(pbf) %>%
    as_tibble()
```

If you need a compact name for the current assay chain (for example, to
label benchmark results), use
`pb_pipeline_name(pbf, "protein::combat_on_medianNorm_on_log2_on_raw")`.

```{r compute_pipeline_name}
pb_pipeline_name(pbf, "protein::combat_on_medianNorm_on_log2_on_raw")
```

## Extracting data matrices from `pbf` object

It is possible to extract a specific data matrix from a `ProBatchFeatures`
object using the assay name:

```{r matr}
extracted_matrix <- pb_assay_matrix(
    pbf,
    assay = "peptide::raw"
)
# show
extracted_matrix[1:5, 1:5]
rm(extracted_matrix)
```

For consistency with the main **ProBatch** vignette, assay can also be extracted in the long format. In this case, the `sample_id_col` needs to be specified:

```{r extract_to_long}
extracted_long <- pb_as_long(
    pbf,
    sample_id_col = "Run",
    pbf_name = "peptide::raw"
)

# show
extracted_long[1:3, ]
rm(extracted_long)
```

# Session info

```{r sessionInfo}
sessionInfo()
```
# Citation

To cite this package, please use:

```{r citation}
citation("proBatch")
```

# References