--- title: "LimROTS: A Hybrid Method Integrating Empirical Bayes and Reproducibility-Optimized Statistics for Robust Differential Expression Analysis" bibliography: references.bib output: BiocStyle::html_document: fig_height: 7 fig_width: 7 toc: true toc_float: true toc_depth: 3 number_sections: true vignette: > %\VignetteIndexEntry{LimROTS: Overview and Differential Expression Analysis} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: markdown: wrap: 72 --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Introduction Differential expression analysis is commonly used to study diverse biological datasets. The reproducibility-optimized test statistic (ROTS) ([Elo et al., 2008](https://doi.org/10.1109/tcbb.2007.1078)) uses a modified t-statistic to prioritise features that differ between two or more groups. However, the ROTS Bioconductor implementation ([Suomi et al., 2017](https://doi.org/10.1371/journal.pcbi.1005562)) did not accommodate technical or biological covariates. LimROTS ([Anwar et al., 2025](https://doi.org/10.1093/bioinformatics/btaf570)) addressed this limitation by combining a reproducibility-optimized test statistic with the limma empirical Bayes approach ([Ritchie et al., 2015](https://doi.org/10.1093/nar/gkv007)). This enables the analysis of more complex experimental designs and the incorporation of covariates. These validated solutions were made in Bioconductor development version 3.20 and the subsequent release 3.21. Related but different linear modeling features were subsequently incorporated also into ROTS Bioconductor package, but have not been formally published to our knowledge. Both ROTS and LimROTS added linear model extensions for survival analysis in the Bioconductor devel version 3.23. # Citation If you use `LimROTS` in your research, please cite: > Anwar, A. M., Jeba, A., Lahti, L., & Coffey, E. (2025). LimROTS: A > Hybrid Method Integrating Empirical Bayes and > Reproducibility-Optimized Statistics for Robust Differential > Expression Analysis. *Bioinformatics*, btaf570. > # Disclaimer LimROTS is a standalone package that extends ROTS ([Elo et al., 2008](https://ieeexplore.ieee.org/document/4359873); [Tomi Suomi et al., 2017](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005562)) and the limma ([Ritchie ME et al., 2015](https://academic.oup.com/nar/article/43/7/e47/2414268)) framework. It is a newly developed, independently implemented method that incorporates covariates in reproducibility-optimized testing. LimROTS is not affiliated with, endorsed by, or maintained by the original ROTS or limma developers. Users are advised to cite the original publications when referencing these methods. ROTS: Elo LL, Filén S, Lahesmaa R, Aittokallio T. Reproducibility-optimized test statistic for ranking genes in microarray studies. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2008;5(3):423–31. Suomi T, Seyednasrollah F, Jaakkola MK, Faux T, Elo LL (2017) ROTS: An R package for reproducibility-optimized statistical testing. PLOS Computational Biology 13(5): e1005562. limma: Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research. 2015;43(7):e47. # Algorithm Overview The **LimROTS** workflow proceeds as follows: 1. A linear model is fitted to the data using a user-supplied design matrix via limma @limma. 2. Empirical Bayes variance shrinkage is applied to obtain moderated statistics. The adjusted standard error for each feature is: $SE_{\text{post}} = \sqrt{s^2_{\text{post}}} \times \text{unscaled SD}$, along with the regression coefficient $\beta_{(p)}$. 3. A reproducibility-optimization step selects parameters $\alpha_1$ and $\alpha_2$ by maximizing the overlap of top-ranked features across group-preserving bootstrap datasets. 4. The final statistic for each feature is: $$t_{\alpha(p)} = \frac{\beta_{(p)}}{\alpha_1 + \alpha_2 \times SE_{\text{post}(p)}}$$ where $t_{\alpha(p)}$ is the optimized statistic, $\beta_{(p)}$ is the model coefficient, and $SE_{\text{post}(p)}$ is the adjusted standard error. P-values are derived from permutation-based null distributions using the `qvalue` package @qvalue. FDR is estimated using an internal implementation adapted from ROTS @rots, and q-values are computed via the bootstrap method for the proportion of true null hypotheses. # Computational Considerations LimROTS runtime depends on four primary factors: the number of samples, features, bootstrap iterations (`niter`), and the top list size (`K`). The bootstrapping step is parallelized using `BiocParallel` @BiocParallel; we recommend using at least 4 workers for routine analyses. The optimization step is sequential and may be time-consuming for large values of `K`. # LimROTS Function Parameters `LimROTS()` is the primary interface for end users. For a full description of all parameters, refer to the function documentation: ```{r, eval=FALSE} ?LimROTS ``` # UPS1 Case Study ## Dataset Description To illustrate LimROTS in a realistic multi-covariate scenario, we use a DIA proteomics dataset from a UPS1-spiked *E. coli* protein mixture @GOTTI2022107829. The dataset contains 1,656 proteins measured across 48 samples: 24 processed with Spectronaut and 24 with ScaffoldDIA. Eight UPS1 concentrations (0.1–50 fmol) are grouped into two conditions: - **Low concentration** (0.1–2.5 fmol, `Conc. 2`): 12 samples per software. - **High concentration** (5–50 fmol, `Conc. 1`): 12 samples per software. A synthetic unbalanced batch variable (`fake.batch`) was assigned across samples with the following distribution: ```{r, echo=FALSE, message=FALSE, results="hide"} library(LimROTS, quietly = TRUE) data("UPS1.Case4") ``` ```{r echo=FALSE} table(UPS1.Case4$fake.batch, UPS1.Case4$Conc.) ``` Additionally, a simulated batch effect was introduced: an effect size of 10 was added to 100 randomly selected *E. coli* proteins log2 values in one of the fake batches. The expected result is that only the 48 UPS1 human proteins are truly differentially expressed; no *E. coli* protein should show a genuine concentration effect. This scenario mirrors real-world conditions such as unbalanced batch structures or unequal covariate distributions across groups. ## Loading the Data ```{r load-packages, message=FALSE} library(LimROTS, quietly = TRUE) library(BiocParallel, quietly = TRUE) library(ggplot2, quietly = TRUE) library(SummarizedExperiment, quietly = TRUE) ``` ```{r load-dataset} data("UPS1.Case4") print(UPS1.Case4) ``` ## Running LimROTS Set a seed before running LimROTS to ensure reproducibility. ```{r set-seed} set.seed(1234, kind = "default", sample.kind = "default") ``` > **Note:** The group column in `colData` must be a `factor` with levels > defined in the intended contrast order. For example, `Conc. 1` (high) > vs `Conc. 2` (low) reports the fold-change as high − low. Define this > explicitly if needed: > > ``` r > colData(UPS1.Case4)$Conc. <- factor( > colData(UPS1.Case4)$Conc., > levels = c("1", "2") > ) > ``` ```{r run-limrots} # Define analysis parameters meta.info <- c("Conc.", "tool", "fake.batch") niter <- 100 # bootstrap iterations (use >= 1000 for real analyses) K <- 100 # top list size for reproducibility optimization formula.str <- "~ 0 + Conc. + tool + fake.batch" # Run LimROTS UPS1.Case4 <- LimROTS( x = UPS1.Case4, niter = niter, K = K, meta.info = meta.info, group = "Conc.", formula.str = formula.str, trend = TRUE, robust = TRUE, permutating.group = FALSE ) ``` > **Note:** `niter = 100` and `K = 100` are used here to reduce vignette > build time. For production analyses, use `niter = 1000` or higher. We > also recommend at least 4 parallel workers. To increase the number of workers, pass a `BPPARAM` argument: ```{r bpparam-example} # Windows # BPPARAM <- SnowParam(workers = 4) # Linux / macOS # BPPARAM <- MulticoreParam(workers = 4) # Pass to LimROTS via: LimROTS(..., BPPARAM = BPPARAM) ``` The results are appended directly to the `rowData` and `metadata` slots of the input `SummarizedExperiment` object. ## Volcano Plot The volcano plot below shows corrected log-fold changes versus $-\log_{10}$(q-value). At a 5% q-value threshold, LimROTS recovers the majority of true UPS1 human proteins while flagging very few *E. coli* proteins. ```{r volcano-plot} # Extract results result_df <- data.frame( rowData(UPS1.Case4), row.names = rownames(UPS1.Case4) ) # Annotate proteins result_df$label <- ifelse( grepl("HUMAN", result_df$GeneID), "UPS1 (true positive)", "E. coli (true negative)" ) ggplot(result_df, aes( x = corrected.logfc, y = -log10(qvalue), color = label )) + geom_point(alpha = 0.7, size = 1.2) + geom_hline( yintercept = -log10(0.05), linetype = "dashed", color = "steelblue" ) + scale_color_manual(values = c( "E. coli (true negative)" = "grey60", "UPS1 (true positive)" = "firebrick" )) + labs( title = "LimROTS \u2014 UPS1 Case Study", x = "Corrected Log Fold Change", y = expression(-log[10](q~value)), color = NULL ) + theme_bw(base_size = 12) + theme(legend.position = "top") ``` ## Quality Control LimROTS produces permutation-derived p-values and q-values that can be inspected visually. These are the recommended primary inference statistics. ```{r quality-control, results="hide", message=FALSE, warning=FALSE} plot(metadata(UPS1.Case4)[["q_values"]]) hist(metadata(UPS1.Case4)[["q_values"]]) print(summary(metadata(UPS1.Case4)[["q_values"]])) ``` ```{r session-info} sessionInfo() ``` # References