This vignette explains broadly the main functions for applying
roastgsa in RNA-seq data. A more exhaustive example to
explore the roastgsa functionality is presented in the
“roastgsa vignette (main)”. All the analyses explained in the main
vignette can be reproduced for RNA-seq data, after undertaking the steps
covered here in the section “Data normalization and filtering”.
We consider the first dataset available in the tcga
compendium from the GSEABenchmarkeR package [1], which
consists of a RNA-seq study with 19 tumor Bladder Urothelial Carcinoma
samples and 19 adjacent healthy tissues.
#library(GSEABenchmarkeR)
#tcga <- loadEData("tcga", nr.datasets=1,cache = TRUE)
#ysel <- assays(tcga[[1]])$expr
#fd <- rowData(tcga[[1]])
#pd <- colData(tcga[[1]])
data(fd.tcga)
data(pd.tcga)
data(expr.tcga)
fd <- fd.tcga
ysel <- expr.tcga
pd <- pd.tcga
N <- ncol(ysel)
head(pd)## DataFrame with 6 rows and 4 columns
## sample type GROUP
## <character> <factor> <numeric>
## TCGA-K4-A3WV-01A-11R-A22U-07 TCGA-K4-A3WV-01A-11R.. BLCA 1
## TCGA-BT-A20W-01A-21R-A14Y-07 TCGA-BT-A20W-01A-21R.. BLCA 1
## TCGA-K4-A5RI-01A-11R-A28M-07 TCGA-K4-A5RI-01A-11R.. BLCA 1
## TCGA-BT-A20N-01A-11R-A14Y-07 TCGA-BT-A20N-01A-11R.. BLCA 1
## TCGA-BL-A13J-01A-11R-A277-07 TCGA-BL-A13J-01A-11R.. BLCA 1
## TCGA-BT-A20U-01A-11R-A14Y-07 TCGA-BT-A20U-01A-11R.. BLCA 1
## BLOCK
## <character>
## TCGA-K4-A3WV-01A-11R-A22U-07 TCGA-K4-A3WV
## TCGA-BT-A20W-01A-21R-A14Y-07 TCGA-BT-A20W
## TCGA-K4-A5RI-01A-11R-A28M-07 TCGA-K4-A5RI
## TCGA-BT-A20N-01A-11R-A14Y-07 TCGA-BT-A20N
## TCGA-BL-A13J-01A-11R-A277-07 TCGA-BL-A13J
## TCGA-BT-A20U-01A-11R-A14Y-07 TCGA-BT-A20U
cnames <- c("BLOCK","GROUP")
covar <- data.frame(pd[,cnames,drop=FALSE])
covar$GROUP <- as.factor(covar$GROUP)
colnames(covar) <- cnames
print(table(covar$GROUP))##
## 0 1
## 19 19
To apply roastgsa, the expression data should be
approximately normally distributed, at least in their univariate form.
Depending on the user’s preferred method for differential expression
analysis, counts transformation methods such as rlog or
vst (DESeq2) [2], zscoreDGE
(edgeR) [3] or voom (limma) [4],
can be applied. In the paper we explored the type I and type II errors
when applying the rlog or vst transformation
followed by roastgsa, showing both good control of type I
errors and decent true discovery rates. In the example presented here we
transform the expression data with vst function from
DESeq2 R package
library(DESeq2)
dds1 <- DESeqDataSetFromMatrix(countData=ysel,colData=pd,
design= ~ BLOCK + GROUP)
dds1 <- estimateSizeFactors(dds1)
ynorm <- assays(vst(dds1))[[1]]
colnames(ynorm) <- rownames(covar) <- paste0("s",1:ncol(ynorm))Another key step before using the roastgsa methods for enrichment
analysis is to filter out low expressed genes, where coverage might be a
limitation for detecting true differentially expressed genes. For the
TCGA data considered here, the default filter employed by the authors
when loading the data was to exclude genes with cpm < 2 in more than
half of the samples. A short discussion about the relationship between
gene coverage and statistical power for the roastgsa
approach is available in our article presenting the
roastgsa package.
## [1] 3621 38
## [1] 88.26316
We consider a classic repository of general biological functions for battery gene set analysis such as broad hallmarks [5]. The gene sets for human are saved within the roastgsa package and can be loaded by
## [1] "HALLMARK_TNFA_SIGNALING_VIA_NFKB" "HALLMARK_HYPOXIA"
## [3] "HALLMARK_CHOLESTEROL_HOMEOSTASIS" "HALLMARK_MITOTIC_SPINDLE"
## [5] "HALLMARK_WNT_BETA_CATENIN_SIGNALING" "HALLMARK_TGF_BETA_SIGNALING"
In this case, hallmarks.hs contains gene symbols whereas
the row names for ynorm are entrez identifiers. We can set
the row names to symbols, which in this case presents a one-to-one
relationship
Other gene set databases that could be applied to these data for
battery testing are presented in the roastgsa vignette
(gene set collections).
The comparison of interest can be specified by a numeric vector with length matching the number of columns in the design.
form <- as.formula(paste0("~ ", paste0(cnames, collapse = "+")))
design <- model.matrix(form , data = covar)
terms <- colnames(design)
contrast <- rep(0, length(terms))
contrast[length(colnames(design))] <- 1Below, there is the standard roastgsa instruction (under
competitive testing) for maxmean and mean
statistics.
fit.maxmean <- roastgsa(ynorm, form = form, covar = covar,
contrast = contrast, index = hallmarks.hs, nrot = 500,
mccores = 1, set.statistic = "maxmean",
self.contained = FALSE, executation.info = FALSE)
f1 <- fit.maxmean$res
rownames(f1) <- gsub("HALLMARK_","",rownames(f1))
head(f1)## total_genes measured_genes est nes
## G2M_CHECKPOINT 200 188 1.1752646 3.849555
## E2F_TARGETS 200 194 1.4694371 3.796728
## MYOGENESIS 200 155 -0.8304752 -2.982035
## UNFOLDED_PROTEIN_RESPONSE 113 104 0.3749170 2.562442
## UV_RESPONSE_DN 144 134 -0.6095027 -2.456606
## MTORC1_SIGNALING 200 194 0.5399790 2.426912
## pval adj.pval
## G2M_CHECKPOINT 0.001996008 0.0332668
## E2F_TARGETS 0.001996008 0.0332668
## MYOGENESIS 0.001996008 0.0332668
## UNFOLDED_PROTEIN_RESPONSE 0.021956088 0.1871257
## UV_RESPONSE_DN 0.017964072 0.1871257
## MTORC1_SIGNALING 0.057884232 0.2894212
fit.mean <- roastgsa(ynorm, form = form, covar = covar,
contrast = contrast, index = hallmarks.hs, nrot = 500,
mccores = 1, set.statistic = "mean",
self.contained = FALSE, executation.info = FALSE)
f2 <- fit.mean$res
rownames(f2) <- gsub("HALLMARK_","",rownames(f2))
head(f2)## total_genes measured_genes est nes
## E2F_TARGETS 200 194 1.1896796 2.853741
## G2M_CHECKPOINT 200 188 0.9287256 2.742169
## UNFOLDED_PROTEIN_RESPONSE 113 104 0.4270303 2.531644
## MYOGENESIS 200 155 -0.7076989 -2.437318
## UV_RESPONSE_DN 144 134 -0.5941011 -2.207927
## MYC_TARGETS_V2 58 58 0.9175999 2.181858
## pval adj.pval
## E2F_TARGETS 0.001996008 0.0332668
## G2M_CHECKPOINT 0.001996008 0.0332668
## UNFOLDED_PROTEIN_RESPONSE 0.001996008 0.0332668
## MYOGENESIS 0.005988024 0.0748503
## UV_RESPONSE_DN 0.009980040 0.0998004
## MYC_TARGETS_V2 0.029940120 0.2106897
Several graphics can be obtained to complement the table results in
f1 and f2. Here we only show the heatmaps that
summarize the expression patterns obtained for all tested hallmarks.
Full description and usage of all graphical options available in the
roastgsa package are considered in the
roastgsa vignette for arrays data and the
roastgsa manual
hm1 <- heatmaprgsa_hm(fit.maxmean, ynorm, intvar = "GROUP", whplot = 1:50,
toplot = TRUE, pathwaylevel = TRUE, mycol = c("orange","green",
"white"), sample2zero = FALSE)hm2 <- heatmaprgsa_hm(fit.mean, ynorm, intvar = "GROUP", whplot = 1:50,
toplot = TRUE, pathwaylevel = TRUE, mycol = c("orange","green",
"white"), sample2zero = FALSE)## R version 4.5.2 (2025-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] DESeq2_1.50.1 SummarizedExperiment_1.40.0
## [3] Biobase_2.70.0 MatrixGenerics_1.22.0
## [5] matrixStats_1.5.0 GenomicRanges_1.62.0
## [7] Seqinfo_1.0.0 IRanges_2.44.0
## [9] S4Vectors_0.48.0 BiocGenerics_0.56.0
## [11] generics_0.1.4 roastgsa_1.8.0
## [13] knitr_1.50 BiocStyle_2.38.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.54 bslib_0.9.0
## [4] ggplot2_4.0.0 caTools_1.18.3 lattice_0.22-7
## [7] vctrs_0.6.5 tools_4.5.2 bitops_1.0-9
## [10] parallel_4.5.2 tibble_3.3.0 pkgconfig_2.0.3
## [13] Matrix_1.7-4 KernSmooth_2.23-26 RColorBrewer_1.1-3
## [16] S7_0.2.0 lifecycle_1.0.4 compiler_4.5.2
## [19] farver_2.1.2 gplots_3.2.0 statmod_1.5.1
## [22] codetools_0.2-20 htmltools_0.5.8.1 sys_3.4.3
## [25] buildtools_1.0.0 sass_0.4.10 yaml_2.3.10
## [28] pillar_1.11.1 jquerylib_0.1.4 BiocParallel_1.44.0
## [31] cachem_1.1.0 DelayedArray_0.36.0 limma_3.66.0
## [34] abind_1.4-8 gtools_3.9.5 tidyselect_1.2.1
## [37] locfit_1.5-9.12 digest_0.6.37 dplyr_1.1.4
## [40] labeling_0.4.3 maketools_1.3.2 fastmap_1.2.0
## [43] grid_4.5.2 cli_3.6.5 SparseArray_1.10.1
## [46] magrittr_2.0.4 S4Arrays_1.10.0 withr_3.0.2
## [49] scales_1.4.0 rmarkdown_2.30 XVector_0.50.0
## [52] evaluate_1.0.5 rlang_1.1.6 Rcpp_1.1.0
## [55] glue_1.8.0 BiocManager_1.30.26 jsonlite_2.0.0
## [58] R6_2.6.1
[1] Geistlinger L, Csaba G, Santarelli M, Schiffer L, Ramos M, Zimmer R, Waldron L (2019). GSEABenchmarkeR: Reproducible GSEA Benchmarking. R package version 1.6.0, https://github.com/waldronlab/GSEABenchmarkeR.
[2] Love MI, Huber W, Anders S (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15, 550. doi:10.1186/s13059-014-0550-8.
[3] Robinson MD, McCarthy DJ, Smyth GK (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139-140. doi:10.1093/bioinformatics/btp616.
[4] M. E. Ritchie, B. Phipson, D. Wu, Y. Hu, C. W. Law, W. Shi, and G. K. Smyth. limma powers differential expression analyses for RNAsequencing and microarray studies. Nucleic acids research, 43(7):e47, 2015.
[5] A. Liberzon, C. Birger, H. Thorvaldsdottir, M. Ghandi, J. P. Mesirov, and P. Tamayo. The Molecular Signatures Database Hallmark Gene Set Collection. Cell Systems, 1(6):417-425, 2015.