---
title: "LimROTS: A Hybrid Method Integrating Empirical Bayes and 
        Reproducibility-Optimized Statistics for Robust 
        Differential Expression Analysis"
bibliography: references.bib
output:
    BiocStyle::html_document:
        fig_height: 7
        fig_width: 7
        toc: true
        toc_float: true
        toc_depth: 3
        number_sections: true
vignette: >
    %\VignetteIndexEntry{LimROTS: Overview and Differential Expression Analysis}
    %\VignetteEncoding{UTF-8}
    %\VignetteEngine{knitr::rmarkdown}
editor_options:
  markdown:
    wrap: 72
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```

# Introduction

Differential expression analysis is commonly used to study diverse
biological datasets. The reproducibility-optimized test statistic (ROTS)
([Elo et al., 2008](https://doi.org/10.1109/tcbb.2007.1078)) uses
a modified t-statistic to prioritise features that differ between two or
more groups. However, the ROTS Bioconductor implementation ([Suomi et
al.,
2017](https://doi.org/10.1371/journal.pcbi.1005562))
did not accommodate technical or biological covariates. LimROTS ([Anwar
et al., 2025](https://doi.org/10.1093/bioinformatics/btaf570)) addressed
this limitation by combining a reproducibility-optimized test statistic
with the limma empirical Bayes approach ([Ritchie et al.,
2015](https://doi.org/10.1093/nar/gkv007)). This
enables the analysis of more complex experimental designs and the
incorporation of covariates. These validated solutions were made in
Bioconductor development version 3.20 and the subsequent release 3.21.
Related but different linear modeling features were subsequently
incorporated also into ROTS Bioconductor package, but have not been
formally published to our knowledge. Both ROTS and LimROTS added linear
model extensions for survival analysis in the Bioconductor devel version
3.23.

# Citation

If you use `LimROTS` in your research, please cite:

> Anwar, A. M., Jeba, A., Lahti, L., & Coffey, E. (2025). LimROTS: A
> Hybrid Method Integrating Empirical Bayes and
> Reproducibility-Optimized Statistics for Robust Differential
> Expression Analysis. *Bioinformatics*, btaf570.
> <https://doi.org/10.1093/bioinformatics/btaf570>

# Disclaimer

LimROTS is a standalone package that extends ROTS ([Elo et al.,
2008](https://ieeexplore.ieee.org/document/4359873); [Tomi Suomi et al.,
2017](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005562))
and the limma ([Ritchie ME et al.,
2015](https://academic.oup.com/nar/article/43/7/e47/2414268)) framework.
It is a newly developed, independently implemented method that
incorporates covariates in reproducibility-optimized testing. LimROTS is
not affiliated with, endorsed by, or maintained by the original ROTS or
limma developers. Users are advised to cite the original publications
when referencing these methods.

ROTS: Elo LL, Filén S, Lahesmaa R, Aittokallio T.
Reproducibility-optimized test statistic for ranking genes in microarray
studies. IEEE/ACM Transactions on Computational Biology and
Bioinformatics. 2008;5(3):423–31.

Suomi T, Seyednasrollah F, Jaakkola MK, Faux T, Elo LL (2017) ROTS: An R
package for reproducibility-optimized statistical testing. PLOS
Computational Biology 13(5): e1005562.
<https://doi.org/10.1371/journal.pcbi.1005562>

limma: Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. limma
powers differential expression analyses for RNA-sequencing and
microarray studies. Nucleic Acids Research. 2015;43(7):e47.
<https://doi.org/10.1093/nar/gkv007>

# Algorithm Overview

The **LimROTS** workflow proceeds as follows:

1.  A linear model is fitted to the data using a user-supplied design
    matrix via limma @limma.
2.  Empirical Bayes variance shrinkage is applied to obtain moderated
    statistics. The adjusted standard error for each feature is:
    $SE_{\text{post}} = \sqrt{s^2_{\text{post}}} \times \text{unscaled SD}$,
    along with the regression coefficient $\beta_{(p)}$.
3.  A reproducibility-optimization step selects parameters $\alpha_1$
    and $\alpha_2$ by maximizing the overlap of top-ranked features
    across group-preserving bootstrap datasets.
4.  The final statistic for each feature is:

$$t_{\alpha(p)} = \frac{\beta_{(p)}}{\alpha_1 + \alpha_2 \times SE_{\text{post}(p)}}$$

where $t_{\alpha(p)}$ is the optimized statistic, $\beta_{(p)}$ is the
model coefficient, and $SE_{\text{post}(p)}$ is the adjusted standard
error.

P-values are derived from permutation-based null distributions using the
`qvalue` package @qvalue. FDR is estimated using an internal
implementation adapted from ROTS @rots, and q-values are computed via
the bootstrap method for the proportion of true null hypotheses.

# Computational Considerations

LimROTS runtime depends on four primary factors: the number of samples,
features, bootstrap iterations (`niter`), and the top list size (`K`).
The bootstrapping step is parallelized using `BiocParallel`
@BiocParallel; we recommend using at least 4 workers for routine
analyses. The optimization step is sequential and may be time-consuming
for large values of `K`.

# LimROTS Function Parameters

`LimROTS()` is the primary interface for end users. For a full
description of all parameters, refer to the function documentation:

```{r, eval=FALSE}
?LimROTS
```

# UPS1 Case Study

## Dataset Description

To illustrate LimROTS in a realistic multi-covariate scenario, we use a
DIA proteomics dataset from a UPS1-spiked *E. coli* protein mixture
@GOTTI2022107829. The dataset contains 1,656 proteins measured across 48
samples: 24 processed with Spectronaut and 24 with ScaffoldDIA. Eight
UPS1 concentrations (0.1–50 fmol) are grouped into two conditions:

-   **Low concentration** (0.1–2.5 fmol, `Conc. 2`): 12 samples per
    software.
-   **High concentration** (5–50 fmol, `Conc. 1`): 12 samples per
    software.

A synthetic unbalanced batch variable (`fake.batch`) was assigned across
samples with the following distribution:

```{r, echo=FALSE, message=FALSE, results="hide"}
library(LimROTS, quietly = TRUE)
data("UPS1.Case4")
```

```{r echo=FALSE}
table(UPS1.Case4$fake.batch, UPS1.Case4$Conc.)
```

Additionally, a simulated batch effect was introduced: an effect size of
10 was added to 100 randomly selected *E. coli* proteins log2 values in
one of the fake batches. The expected result is that only the 48 UPS1
human proteins are truly differentially expressed; no *E. coli* protein
should show a genuine concentration effect.

This scenario mirrors real-world conditions such as unbalanced batch
structures or unequal covariate distributions across groups.

## Loading the Data

```{r load-packages, message=FALSE}
library(LimROTS, quietly = TRUE)
library(BiocParallel, quietly = TRUE)
library(ggplot2, quietly = TRUE)
library(SummarizedExperiment, quietly = TRUE)
```

```{r load-dataset}
data("UPS1.Case4")
print(UPS1.Case4)
```

## Running LimROTS

Set a seed before running LimROTS to ensure reproducibility.

```{r set-seed}
set.seed(1234, kind = "default", sample.kind = "default")
```

> **Note:** The group column in `colData` must be a `factor` with levels
> defined in the intended contrast order. For example, `Conc. 1` (high)
> vs `Conc. 2` (low) reports the fold-change as high − low. Define this
> explicitly if needed:
>
> ``` r
> colData(UPS1.Case4)$Conc. <- factor(
>     colData(UPS1.Case4)$Conc.,
>     levels = c("1", "2")
> )
> ```

```{r run-limrots}
# Define analysis parameters
meta.info   <- c("Conc.", "tool", "fake.batch")
niter       <- 100   # bootstrap iterations (use >= 1000 for real analyses)
K           <- 100   # top list size for reproducibility optimization
formula.str <- "~ 0 + Conc. + tool + fake.batch"

# Run LimROTS
UPS1.Case4 <- LimROTS(
    x           = UPS1.Case4,
    niter       = niter,
    K           = K,
    meta.info   = meta.info,
    group       = "Conc.",
    formula.str = formula.str,
    trend       = TRUE,
    robust      = TRUE,
    permutating.group = FALSE
)
```

> **Note:** `niter = 100` and `K = 100` are used here to reduce vignette
> build time. For production analyses, use `niter = 1000` or higher. We
> also recommend at least 4 parallel workers.

To increase the number of workers, pass a `BPPARAM` argument:

```{r bpparam-example}
# Windows
# BPPARAM <- SnowParam(workers = 4)

# Linux / macOS
# BPPARAM <- MulticoreParam(workers = 4)

# Pass to LimROTS via: LimROTS(..., BPPARAM = BPPARAM)
```

The results are appended directly to the `rowData` and `metadata` slots
of the input `SummarizedExperiment` object.

## Volcano Plot

The volcano plot below shows corrected log-fold changes versus
$-\log_{10}$(q-value). At a 5% q-value threshold, LimROTS recovers the
majority of true UPS1 human proteins while flagging very few *E. coli*
proteins.

```{r volcano-plot}
# Extract results
result_df <- data.frame(
    rowData(UPS1.Case4),
    row.names = rownames(UPS1.Case4)
)

# Annotate proteins
result_df$label <- ifelse(
    grepl("HUMAN", result_df$GeneID),
    "UPS1 (true positive)",
    "E. coli (true negative)"
)

ggplot(result_df, aes(
    x     = corrected.logfc,
    y     = -log10(qvalue),
    color = label
)) +
    geom_point(alpha = 0.7, size = 1.2) +
    geom_hline(
        yintercept = -log10(0.05),
        linetype   = "dashed",
        color      = "steelblue"
    ) +
    scale_color_manual(values = c(
        "E. coli (true negative)" = "grey60",
        "UPS1 (true positive)"    = "firebrick"
    )) +
    labs(
        title = "LimROTS \u2014 UPS1 Case Study",
        x     = "Corrected Log Fold Change",
        y     = expression(-log[10](q~value)),
        color = NULL
    ) +
    theme_bw(base_size = 12) +
    theme(legend.position = "top")
```

## Quality Control

LimROTS produces permutation-derived p-values and q-values that can be
inspected visually. These are the recommended primary inference
statistics.

```{r quality-control, results="hide", message=FALSE, warning=FALSE}
plot(metadata(UPS1.Case4)[["q_values"]])
hist(metadata(UPS1.Case4)[["q_values"]])
print(summary(metadata(UPS1.Case4)[["q_values"]]))
```

```{r session-info}
sessionInfo()
```

# References