A brief introduction of dmGsea R package for gene set enrichment analysis.
dmGsea 0.99.11
In DNA methylation data, genes are often represented by variable number of correlated probes, and a single probe can map to multiple genes. This complex data structure poses significant challenges for gene set enrichment analysis (GSEA), and can lead to biased enrichment results. The dmGsea package offers several functions with novel methods specifically designed to perform efficient gene set enrichment analysis while addressing probe dependency and probe number bias. Compared to alternative packages for DNA methylation data, these methods effectively utilize probe dependency information, provide higher statistical power, can well control type I errer rates, and are computationally more efficient. The package fully supports enrichment analysis for Illumina DNA methylation array data, and is easily extendable to other types of omics data when provided with appropriate probe annotation information.
gsGene(): GSEA based on aggregates association signals at gene level
gsPG(): GSEA using summary statistics for independent probe groups based
on gene annotation
gsProbe(): GSEA using probe level p-values
gsRank(): Fast ranking-based GSEA with gene level statistics
The following examples are brief demonstrations on how to perform gene set enrichment analysis using dmGsea functions.
require(dmGsea)
#generating example data
annopkg <- "IlluminaHumanMethylation450kanno.ilmn12.hg19"
anno <- minfi::getAnnotation(eval(annopkg))
#Use a subset of the data in the example to speed up execution
anno <- anno[1:10000,]
probe.p <- data.frame(Name=rownames(anno),p=runif(nrow(anno)))
probe.p$p[1:500] <- probe.p$p[1:500]/100000
Data4cor <- matrix(runif(nrow(probe.p)*100),ncol=100)
rownames(Data4cor) <- rownames(anno)
#geneset enrichment analysis with threshold based method
#for top ranked 1000 genes
gsGene(probe.p <- probe.p,Data4Cor=Data4cor,arrayType="450K",nTopGene=1000,
    outGenep=TRUE, method="Threshold",gSetName="KEGG",species="Human",
    outfile="gs1",ncore=1)
file.remove("gs1_KEGG_KEGG.csv")To perform GSEA using significant genes, set FDRthre = 0.05 to apply
an FDR threshold of 0.05.
To perform GSEA using a ranking-based method, specify method = "Ranking".
If the argument Data4Cor is not provided, GSEA will be performed without
using the methylation data matrix to account for between-CpG correlation.
Use gsPG() to perform enrichment analysis based on probe group-level
p-values.
To perform GSEA directly with probe-level p-values without combining them
into gene-level statistics, use the gsProbe() function. It applies a
noncentral hypergeometric test to adjust for bias introduced
by variable numbers of probes per gene.
#generate example dataset
kegg <- getKEGG(species="Human")
gene1 <- unique(as.vector(unlist(kegg[1:5])))
gene2 <- unique(as.vector(unlist(kegg[6:length(kegg)])))
gene1 <- rep(gene1,sample(1:10,length(gene1),replace=TRUE))
gene2 <- rep(gene2,sample(1:10,length(gene2),replace=TRUE))
p11 <- runif(length(gene1))*(1e-3)
p2 <- runif(length(gene2))
geneid <- c(gene1,gene2)
p <- c(p11,p2)
Name <- paste0("cg",1:length(p))
probe.p <- data.frame(Name=Name,p=p)
GeneProbeTable <- data.frame(Name=Name,entrezid=geneid)
dat <- matrix(runif(length(p)*100),ncol=100)
rownames(dat) <- Name
#enrichment analysis
gsGene(probe.p=probe.p,Data4Cor=dat,GeneProbeTable=GeneProbeTable,
        method="Threshold",gSetName="KEGG",species="Human",outfile="gs5",
        ncore=1)
file.remove("gs5_KEGG_KEGG.csv")method = "Ranking".#generatin example dataset
userGeneset <- getKEGG(species="Human")
#enrichment analysis
gsGene(probe.p=probe.p,Data4Cor=dat,GeneProbeTable=GeneProbeTable,
        method="Threshold",geneSet=userGeneset,species="Human",outfile="gs7",
        ncore=1)
file.remove("gs7_userSet_userSet.csv")method = "Ranking".#generatin example dataset
kegg <- getKEGG(species="Human")
gene <- unique(as.vector(unlist(kegg)))
p <- runif(length(gene))
names(p) <- gene
stats <- -log(p)*sample(c(1,-1),length(p),replace=TRUE)
#traditional GSEA analysis, enrichment toward higher or lower end of statstics
stats <- sort(stats,decr=TRUE)
gsRank(stats=stats,gSetName="KEGG",scoreType="std",outfile="gs9",nperm=1e4,
    ncore=1)
file.remove("gs9_KEGG_KEGG.csv")
file.remove("gsGene_genep.csv")
#enrichment of genes with higher statistics
stats <- sort(abs(stats),decr=TRUE)The package includes built-in support for KEGG, GO, MSigDB, and Reactome gene sets for both human and mouse pathways. All functions also offer options to incorporate custom, user-provided gene sets.
The KEGG pathway database is a widely used resource that provides a comprehensive collection of manually curated biological pathways. These pathways cover various biological processes, including metabolism, cellular processes, genetic information processing, and human diseases. KEGG pathways integrate information about molecular interactions, reactions, and relationships between genes,proteins, and other molecules, helping researchers understand complex biological functions at a systems level.
GO is a widely used framework for describing the roles of genes and their products (proteins, RNAs) in biological systems. Unlike pathway databases that focus on specific molecular interactions, GO provides a standardized vocabulary for annotating gene functions across species in three main categories:
MSigDB is a comprehensive collection of gene sets used for interpreting high-throughput gene expression data in biological research. It is a key resource for gene set enrichment analysis (GSEA), helping researchers identify biological pathways, processes, and mechanisms that are overrepresented in a given dataset.
The Molecular Signatures Database (MSigDB) is divided into several major collections, each of which contains different sub-categories. Here’s a list of all the main categories and their sub-categories:
Reactome is a freely accessible, curated database of biological pathways that provides detailed insights into molecular processes across a wide range of organisms. It covers various cellular and biochemical processes, such as signal transduction, metabolism, gene expression, and immune responses. Each pathway in Reactome is represented as a network of molecular interactions, where entities like proteins, small molecules, and complexes participate in reactions, including binding, transport, and modifications. These reactions are organized hierarchically, from individual molecular events to larger biological processes.Reactome pathways are extensively curated by domain experts and are used in functional analysis of high-throughput omics data (e.g., gene expression, proteomics). It integrates experimental data, enabling researchers to explore the molecular mechanisms behind diseases, drug responses, and other biological phenomena.
Threshold-Based GSEA or Over-Representation Analysis. It requires a threshold to define “significant” genes (e.g., p-value, fold change) and tests whether the overlap between a predefined list of differentially expressed genes (DEGs) and a gene set is statistically significant.
Ranking-Based or functional class scoring (FCS) GSEA, ranks all genes in a dataset based on a continuous metric (e.g., P-value, fold change, t-statistic) and assesses whether the genes in a predefined gene set cluster at the top or bottom of this ranked list.
The choice between threshold-based and ranking-based Gene Set Enrichment Analysis (GSEA) depends on the nature of the hypothesis being tested, the type of data, and the goals of the analysis. Threshold-Based GSEA: Use for well-defined, significant subsets of genes/features when you have a justifiable cutoff and are interested in strong, specific signals. Ranking-Based GSEA: Use when you want to incorporate the full dataset, avoid arbitrary cutoffs, or have a hypothesis that requires considering a continuous spectrum of feature significance.
Zongli Xu, Alison A Motsinger-Reif, Liang Niu, Efficient gene set analysis for DNA methylation addressing probe dependency and bias. Bioinformatics, 2025 Jul 24:btaf422. PMID: 40705664 DOI: 10.1093/bioinformatics/btaf422 Full text linksCite
sessionInfo() ## R version 4.5.1 Patched (2025-08-23 r88802)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.1
##  [2] minfi_1.55.1                                      
##  [3] bumphunter_1.51.1                                 
##  [4] locfit_1.5-9.12                                   
##  [5] iterators_1.0.14                                  
##  [6] foreach_1.5.2                                     
##  [7] Biostrings_2.77.2                                 
##  [8] XVector_0.49.1                                    
##  [9] dmGsea_0.99.11                                    
## [10] SummarizedExperiment_1.39.2                       
## [11] Biobase_2.69.1                                    
## [12] GenomicRanges_1.61.5                              
## [13] Seqinfo_0.99.2                                    
## [14] IRanges_2.43.5                                    
## [15] S4Vectors_0.47.4                                  
## [16] BiocGenerics_0.55.1                               
## [17] generics_0.1.4                                    
## [18] MatrixGenerics_1.21.0                             
## [19] matrixStats_1.5.0                                 
## [20] Matrix_1.7-4                                      
## [21] BiocStyle_2.37.1                                  
## 
## loaded via a namespace (and not attached):
##   [1] beanplot_1.3.1            DBI_1.2.3                
##   [3] bitops_1.0-9              poolr_1.2-0              
##   [5] BiasedUrn_2.0.12          magrittr_2.0.4           
##   [7] rlang_1.1.6               scrime_1.3.5             
##   [9] compiler_4.5.1            RSQLite_2.4.3            
##  [11] GenomicFeatures_1.61.6    DelayedMatrixStats_1.31.0
##  [13] png_0.1-8                 vctrs_0.6.5              
##  [15] quadprog_1.5-8            pkgconfig_2.0.3          
##  [17] crayon_1.5.3              fastmap_1.2.0            
##  [19] Rsamtools_2.25.3          rmarkdown_2.30           
##  [21] tzdb_0.5.0                preprocessCore_1.71.2    
##  [23] purrr_1.1.0               bit_4.6.0                
##  [25] xfun_0.53                 cachem_1.1.0             
##  [27] jsonlite_2.0.0            blob_1.2.4               
##  [29] rhdf5filters_1.21.0       DelayedArray_0.35.3      
##  [31] reshape_0.8.10            Rhdf5lib_1.31.0          
##  [33] BiocParallel_1.43.4       R6_2.6.1                 
##  [35] bslib_0.9.0               RColorBrewer_1.1-3       
##  [37] limma_3.65.5              rtracklayer_1.69.1       
##  [39] genefilter_1.91.0         jquerylib_0.1.4          
##  [41] Rcpp_1.1.0                bookdown_0.45            
##  [43] knitr_1.50                readr_2.1.5              
##  [45] tidyselect_1.2.1          rentrez_1.2.4            
##  [47] illuminaio_0.51.0         splines_4.5.1            
##  [49] abind_1.4-8               yaml_2.3.10              
##  [51] siggenes_1.83.0           codetools_0.2-20         
##  [53] curl_7.0.0                doRNG_1.8.6.2            
##  [55] tibble_3.3.0              lattice_0.22-7           
##  [57] plyr_1.8.9                KEGGREST_1.49.1          
##  [59] askpass_1.2.1             evaluate_1.0.5           
##  [61] survival_3.8-3            xml2_1.4.0               
##  [63] mclust_6.1.1              pillar_1.11.1            
##  [65] BiocManager_1.30.26       rngtools_1.5.2           
##  [67] mathjaxr_1.8-0            RCurl_1.98-1.17          
##  [69] hms_1.1.3                 sparseMatrixStats_1.21.0 
##  [71] xtable_1.8-4              glue_1.8.0               
##  [73] tools_4.5.1               BiocIO_1.19.0            
##  [75] data.table_1.17.8         GenomicAlignments_1.45.4 
##  [77] annotate_1.87.0           GEOquery_2.77.6          
##  [79] XML_3.99-0.19             rhdf5_2.53.5             
##  [81] grid_4.5.1                tidyr_1.3.1              
##  [83] AnnotationDbi_1.71.1      base64_2.0.2             
##  [85] nlme_3.1-168              nor1mix_1.3-3            
##  [87] HDF5Array_1.37.0          restfulr_0.0.16          
##  [89] cli_3.6.5                 S4Arrays_1.9.1           
##  [91] dplyr_1.1.4               sass_0.4.10              
##  [93] digest_0.6.37             SparseArray_1.9.1        
##  [95] dqrng_0.4.1               org.Hs.eg.db_3.21.0      
##  [97] rjson_0.2.23              memoise_2.0.1            
##  [99] htmltools_0.5.8.1         multtest_2.65.0          
## [101] lifecycle_1.0.4           h5mread_1.1.1            
## [103] httr_1.4.7                statmod_1.5.0            
## [105] openssl_2.3.4             bit64_4.6.0-1            
## [107] MASS_7.3-65