1 Installation

EnrichDO can be installed from Bioconductor:

if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")

BiocManager::install("EnrichDO")

or github page

if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")

library(devtools)
devtools::install_github("liangcheng-hrbmu/EnrichDO")

2 Introduction

Disease Ontology (DO) enrichment analysis is an effective means to discover the associations between genes and diseases. However, most current DO-based enrichment methods were unable to solve the over enriched problem caused by the “true-path” rule. To address this problem, we presented EnrichDO, a double weighted iterative model by integrating the DO graph topology on a global scale. EnrichDO was based on the latest annotations of the human genome with DO terms, and double weighted the annotated genes. On one hand, to reinforce the saliency of direct gene-DO annotations, different initial weights were assigned to directly annotated genes and indirectly annotated genes, respectively. On the other hand, to detect locally most significant node between the parent and its children, less significant nodes were dynamically down-weighted. EnrichDO exhibits higher accuracy that often yield more specific significant DO terms, which alleviate the over enriched problem.

EnrichDO encompasses a variety of statistical models and visualization schemes for discovering the disease-gene relationships under biological big data. Currently uploaded to Bioconductor, we anticipate that our R package will provide a more convenient and effective DO enrichment tool.

library(EnrichDO)
#> 

3 Weighted DO Enrichment Analysis

EnrichDO is a double weighted iterative model for DO enrichment analysis. Based on the latest annotations of the human genome with DO terms, EnrichDO can identify locally significant enriched nodes by applying different initial weights and dynamic weights for annotated genes and integrating the DO graph topology on a global scale. EnrichDO is an effective and flexible model that supplies various statistical testing models and multiple testing correction methods.

3.1 doEnrich function

In EnrichDO, we implemented doEnrich to realize the enrichment analysis of ontology by combining topological properties of ontology graph structure.

3.1.1 Result description

In the following example, several genes (demo.data) are randomly selected from the protein-coding genes for analysis. The parameters of doEnrich is default.

demo.data=c(1636,351,102,2932,3077,348,4137,54209,5663,5328,23621,3416,3553)
demo_result<-doEnrich(interestGenes=demo.data)
#>       -- Descending rights test--
#> LEVEL: 13    1 nodes 72 genes to be scored
#> LEVEL: 12    2 nodes 457 genes to be scored
#> LEVEL: 11    3 nodes 907 genes to be scored
#> LEVEL: 10    13 nodes    2279 genes to be scored
#> LEVEL: 9 54 nodes    6504 genes to be scored
#> LEVEL: 8 130 nodes   9483 genes to be scored
#> LEVEL: 7 198 nodes   11209 genes to be scored
#> LEVEL: 6 220 nodes   12574 genes to be scored
#> LEVEL: 5 198 nodes   12936 genes to be scored
#> LEVEL: 4 103 nodes   12824 genes to be scored
#> LEVEL: 3 30 nodes    11683 genes to be scored
#> LEVEL: 2 5 nodes 8032 genes to be scored
#> LEVEL: 1 0 nodes 0 genes to be scored
show(demo_result)
#> 
#> ------------------------- EnrichResult object -------------------------
#> Method of enrichment:
#>   Global Weighted Model
#>   'hypergeomTest' Statistical model with the 'BH' Multiple hypothesis correction
#> Enrichment cutoff layer: 1
#> interestGenes number: 13
#> 957 DOTerms scored: 231 terms with p < 0.01
#> Parameter setting:
#>   Enrichment cutoff layer: 1
#>   Doterm gene number limit: minGsize 5, maxGsize 5000
#>   Enrichment threshold: 0.01

Running doEnrich will output the nodes and total genes involved in each layer of DAG structure to the user. The show method can be used to present the overall result to the user.

The result of doEnrich is demo_result which contains enrich, interestGenes, test, method, m, maxGsize, minGsize, delta, traditional, penalize. Use the EnrichTab function to obtain the enrich result in demo_result. Note that the complete enrich result is stored in demo_result, and EnrichTab can extract result.

EnrichTab takes two parameters, where object receives the EnrichResult object produced by doEnrich, and all is a logical value, TRUE to extract all enrich results and FALSE to extract only significant results.

Enrich<-EnrichTab(object=demo_result,all = TRUE)

There are 16 columns of Enrich, including:

  • The standard ID corresponding to the disease in the Disease Ontology database (DOID).

  • the standard name of the disease (DOTerm), each DOterm has a unique DOID.

  • We constructed a directed acyclic graph according to the is_a relationship between each node in the DO database, and each DOterm has a corresponding level (level).

  • The DO database stores the parent node of each DOterm (parent.arr) and its number (parent.len). For example, “B-cell acute lymphoblastic leukemia” (DOID:0080638) is_a “acute lymphoblastic leukemia” (DOID:9952) and “lymphoma” (DOID:0060058), then the node “B-cell acute lymphoblastic leukemia” is a child of “acute lymphoblastic leukemia” and “lymphoma”, and the child is a more specific biological classification than its parent.

  • child nodes of the DOterm (child.arr) and its number (child.len).

  • the latest GeneRIF information was used to annotate DOterms, each DOterm has its corresponding disease-associated genes (gene.arr), and its number (gene.len).

  • Assigning a weight to each gene helps assess the contribution of different genes to DOterm (weight.arr).

  • The smaller the weights of indirectly annotated genes, the less contribution of these genes in the enrichment analysis.(gene.w).

  • the P-value of the DOterm (p), which arrange the order of enrich, and the value of P-value correction (p.adjust).

  • the genes of interest annotated to this DOterm (cg.arr) and its number (cg.len).

  • the number of genes in the interest gene set (ig.len), this represents the number of genes that are actually used for enrichment analysis.

Generally, a significant P value of the enrichment results was less than 0.05 or 0.01, and it was regarded that there was a significant association between the gene set of interest and the disease node. In the Enrich, the node with the most significant enrichment is DOID:0080832, and the DOTerm is mild cognitive impairment, with its P-value being 9.22e-16. These results suggested that there was statistical significance between the gene set of interest and mild cognitive impairment.

The data frame doterms contains the information of the disease ontology for DAG construction.

head(doterms)
#>        DOID level     gene.arr   weight.arr parent.arr parent.len child.arr
#> 1 DOID:3720    13          595            1  DOID:3721          1          
#> 2 DOID:3722    13          596            1  DOID:3721          1          
#> 3 DOID:4927    13 9788, 74.... 1, 1, 1,....  DOID:4928          1          
#> 4 DOID:5746    13 9354, 66.... 1, 1, 1,....  DOID:3605          1          
#> 5 DOID:7024    13   4583, 1045         1, 1  DOID:4928          1          
#> 6 DOID:7642    13 8289, 45.... 1, 1, 1,....  DOID:4928          1          
#>   child.len gene.len                                   DOTerm
#> 1         0        1              extramedullary plasmacytoma
#> 2         0        1            solitary osseous plasmacytoma
#> 3         0       30                         Klatskin's tumor
#> 4         0       15        ovarian serous cystadenocarcinoma
#> 5         0        2 mucinous intrahepatic cholangiocarcinoma
#> 6         0        5            cholangiolocellular carcinoma

3.1.2 Application cases of doEnrich function

1.Weighted enrichment analysis with multiple parameters. Each parameter in the following example is suitable for enrichment analysis with weights. You can modify the parameter value as required.

weighted_demo<-doEnrich(interestGenes=demo.data,
                           test="fisherTest",
                           method="holm",
                           m=1,
                           minGsize=10,
                           maxGsize=2000,
                           delta=0.05,
                           penalize = TRUE)
#>       -- Descending rights test--
#> LEVEL: 13    1 nodes 72 genes to be scored
#> LEVEL: 12    2 nodes 457 genes to be scored
#> LEVEL: 11    3 nodes 907 genes to be scored
#> LEVEL: 10    12 nodes    2278 genes to be scored
#> LEVEL: 9 50 nodes    5376 genes to be scored
#> LEVEL: 8 116 nodes   7751 genes to be scored
#> LEVEL: 7 181 nodes   9463 genes to be scored
#> LEVEL: 6 193 nodes   10144 genes to be scored
#> LEVEL: 5 174 nodes   9756 genes to be scored
#> LEVEL: 4 85 nodes    9088 genes to be scored
#> LEVEL: 3 18 nodes    4599 genes to be scored
#> LEVEL: 2 1 nodes 1605 genes to be scored
#> LEVEL: 1 0 nodes 0 genes to be scored

2.The parameter penalize was used to alleviate the impact of different magnitudes of p-values, default value is TRUE. When set to false, the degree of reduction in weight for non-significant nodes is decreased, resulting in a slight increase in significance for these nodes, i.e., their p-value will be reduced.

penalF_demo<-doEnrich(interestGenes=demo.data, penalize = FALSE)
#>       -- Descending rights test--
#> LEVEL: 13    1 nodes 72 genes to be scored
#> LEVEL: 12    2 nodes 457 genes to be scored
#> LEVEL: 11    3 nodes 907 genes to be scored
#> LEVEL: 10    13 nodes    2279 genes to be scored
#> LEVEL: 9 54 nodes    6504 genes to be scored
#> LEVEL: 8 130 nodes   9483 genes to be scored
#> LEVEL: 7 198 nodes   11209 genes to be scored
#> LEVEL: 6 220 nodes   12574 genes to be scored
#> LEVEL: 5 198 nodes   12936 genes to be scored
#> LEVEL: 4 103 nodes   12824 genes to be scored
#> LEVEL: 3 30 nodes    11683 genes to be scored
#> LEVEL: 2 5 nodes 8032 genes to be scored
#> LEVEL: 1 0 nodes 0 genes to be scored

3.Using the traditional enrichment analysis method, it doesn’t reduce weights according to the DAG structure. Parameters test, method, m, maxGsize and minGsize can be used flexibly.

Tradition_demo<-doEnrich(demo.data , traditional = TRUE)
#>       -- Traditional test--

3.2 writeDoTerms function

writeDoTerms can output DOID, DOTerm, level, genes, parents, children, gene.len, parent.len and child.len in the data frame doterms as text. The default file name is “doterms.txt”.

writeDoTerms(doterms,file=file.path(tempdir(),"doterms.txt"))

3.3 writeResult function

The writeResult function can output DOID, DOTerm, p, p.adjust, geneRatio, bgRatio and cg in the data frame enrich as text. The default file name is “result.txt”.

geneRatio represents the intersection of the doterm with the interest set divided by the interest gene set, and bgRatio represents all genes of the doterm divided by the background gene set.

writeResult has four parameters. Enrich indicates the significant enrichment result of doEnrich, file indicates the write address of a file. The parameter Q (and P) indicates that doterm is output only when p.adjust (and p value) is less than or equal to Q (and P). The default values for P and Q are 1.

writeResult(Enrich,file=file.path(tempdir(),"result.txt"),Q=1,P=1)

4 Visualization of enrichment results

EnrichDO provides four methods to visualize enrichment results, including bar plot (drawBarGraph), bubble plot (drawPointGraph), tree plot (drawGraphviz) and heatmap (drawHeatmap), which can show the research results more concisely and intuitively. Pay attention to the threshold setting for each visual method, if the threshold is too low, the display is insufficient.

4.1 drawBarGraph function

drawBarGraph can draw the top n nodes with the most significant p-value as bar chart, and the node’s p-value is less than delta (By default, n is 10 and delta is 1e-15).

drawBarGraph(Enrich,n=10,delta=0.05)
bar plot

Figure 1: bar plot

4.2 drawPointGraph function

drawPointGraph can draw the top n nodes with the most significant p-value as bubble plot, and the node’s p-value is less than delta (By default, n is 10 and delta is 1e-15).

drawPointGraph(Enrich,n=10,delta=0.05)
point plot

Figure 2: point plot

4.3 drawGraphViz function

drawGraphViz draws the DAG structure of the most significant n nodes, and labelfontsize can set the font size of labels in nodes (By default, n is 10 and labelfontsize is 14). The characters in the figure are the doterm’s name corresponding to each node .

In addition, the drawGraphViz function can also display the P-value of each node in the enrichment analysis (pview=TRUE), and the number of overlapping genes of each doterm and interest set (numview=TRUE).


drawGraphViz(demo_result, n=10, numview = FALSE, pview = FALSE,labelfontsize=17)
#>  chr [1:3] "DOID:1561" "DOID:150" "DOID:4"
#>  chr [1:3] "DOID:1561" "DOID:150" "DOID:4"
#>  chr [1:6] "DOID:680" "DOID:1289" "DOID:331" "DOID:863" "DOID:7" "DOID:4"
#>  chr [1:6] "DOID:0050890" "DOID:1289" "DOID:331" "DOID:863" "DOID:7" ...
#>  chr [1:5] "DOID:1289" "DOID:331" "DOID:863" "DOID:7" "DOID:4"
#>  chr [1:5] "DOID:936" "DOID:331" "DOID:863" "DOID:7" "DOID:4"
#>  chr [1:7] "DOID:649" "DOID:0050117" "DOID:936" "DOID:4" "DOID:331" ...
#>  chr [1:4] "DOID:0080599" "DOID:934" "DOID:0050117" "DOID:4"
#>  chr [1:4] "DOID:2468" "DOID:1561" "DOID:150" "DOID:4"
#>  chr [1:2] "DOID:0014667" "DOID:4"
tree plot

Figure 3: tree plot

4.4 drawHeatmap function

drawHeatmap function visualizes the strength of the relationship between the top DOID_n nodes from enrichment results and the genes whose weight sum ranks the top gene_n in these nodes. And the gene displayed must be included in the gene of interest. readable indicates whether the gene is displayed as its symbol.

drawHeatmap also provides additional parameters from the pheatmap function, which you can set according to your needs. Default DOID_n is10, gene_n is 50, fontsize_row is 10, readable is TRUE.

Meanwhile, the weightMatrix variable is also written in the environment to store the corresponding values in the heatmap.

drawHeatmap(interestGenes=demo.data,
            enrich=Enrich,
            gene_n=10,
            fontsize_row=8,
            readable = TRUE)
#> gene symbol conversion result:
#> 
#> 'select()' returned 1:1 mapping between keys and columns
heatmap

Figure 4: heatmap

4.5 convenient drawing

Draw(drawBarGraph ,drawPointGraph ,drawGraphViz) from writeResult output files, so you don’t have to wait for the algorithm to run.

#Firstly, read the wrireResult output file,using the following two lines
data<-read.delim(file.path(system.file("examples", package="EnrichDO"),"result.txt"))
convDraw(resultDO=data)
#> The enrichment results you provide are stored in enrich
#> Now you can use the drawing function

#then, Use the drawing function you need
drawGraphViz(enrich=enrich)    #Tree diagram
#>  chr [1:6] "DOID:680" "DOID:1289" "DOID:331" "DOID:863" "DOID:7" "DOID:4"
#>  chr [1:5] "DOID:1289" "DOID:331" "DOID:863" "DOID:7" "DOID:4"
#>  chr [1:3] "DOID:1561" "DOID:150" "DOID:4"
#>  chr [1:3] "DOID:1561" "DOID:150" "DOID:4"
#>  chr [1:6] "DOID:0050890" "DOID:1289" "DOID:331" "DOID:863" "DOID:7" ...
#>  chr [1:5] "DOID:1289" "DOID:331" "DOID:863" "DOID:7" "DOID:4"
#>  chr [1:2] "DOID:150" "DOID:4"
#>  chr [1:4] "DOID:3324" "DOID:1561" "DOID:150" "DOID:4"
#>  chr [1:10] "DOID:0060004" "DOID:3213" "DOID:331" "DOID:438" "DOID:863" ...
#>  chr [1:4] "DOID:2468" "DOID:1561" "DOID:150" "DOID:4"

drawPointGraph(enrich=enrich,delta = 0.05)  #Bubble diagram
#> Warning: Using size for a discrete variable is not advised.

drawBarGraph(enrich=enrich,delta = 0.05)    #Bar plot

5 Session information

sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] EnrichDO_0.99.11 BiocStyle_2.33.1
#> 
#> loaded via a namespace (and not attached):
#>   [1] RColorBrewer_1.1-3      jsonlite_1.8.8          magrittr_2.0.3         
#>   [4] magick_2.8.4            farver_2.1.2            rmarkdown_2.28         
#>   [7] fs_1.6.4                zlibbioc_1.51.1         vctrs_0.6.5            
#>  [10] memoise_2.0.1           ggtree_3.13.1           tinytex_0.52           
#>  [13] htmltools_0.5.8.1       gridGraphics_0.5-1      sass_0.4.9             
#>  [16] bslib_0.8.0             plyr_1.8.9              cachem_1.1.0           
#>  [19] igraph_2.0.3            lifecycle_1.0.4         pkgconfig_2.0.3        
#>  [22] Matrix_1.7-0            R6_2.5.1                fastmap_1.2.0          
#>  [25] gson_0.1.0              GenomeInfoDbData_1.2.12 digest_0.6.37          
#>  [28] aplot_0.2.3             enrichplot_1.25.2       colorspace_2.1-1       
#>  [31] patchwork_1.2.0         AnnotationDbi_1.67.0    S4Vectors_0.43.2       
#>  [34] RSQLite_2.3.7           org.Hs.eg.db_3.19.1     labeling_0.4.3         
#>  [37] fansi_1.0.6             httr_1.4.7              polyclip_1.10-7        
#>  [40] compiler_4.4.1          bit64_4.0.5             withr_3.0.1            
#>  [43] BiocParallel_1.39.0     viridis_0.6.5           DBI_1.2.3              
#>  [46] highr_0.11              ggforce_0.4.2           R.utils_2.12.3         
#>  [49] MASS_7.3-61             tools_4.4.1             ape_5.8                
#>  [52] scatterpie_0.2.4        R.oo_1.26.0             glue_1.7.0             
#>  [55] nlme_3.1-166            GOSemSim_2.31.2         grid_4.4.1             
#>  [58] shadowtext_0.1.4        reshape2_1.4.4          fgsea_1.31.0           
#>  [61] generics_0.1.3          gtable_0.3.5            tzdb_0.4.0             
#>  [64] R.methodsS3_1.8.2       tidyr_1.3.1             data.table_1.16.0      
#>  [67] hms_1.1.3               tidygraph_1.3.1         utf8_1.2.4             
#>  [70] XVector_0.45.0          BiocGenerics_0.51.1     ggrepel_0.9.5          
#>  [73] pillar_1.9.0            stringr_1.5.1           yulab.utils_0.1.7      
#>  [76] vroom_1.6.5             splines_4.4.1           dplyr_1.1.4            
#>  [79] tweenr_2.0.3            treeio_1.29.1           lattice_0.22-6         
#>  [82] bit_4.0.5               tidyselect_1.2.1        GO.db_3.19.1           
#>  [85] Biostrings_2.73.1       knitr_1.48              gridExtra_2.3          
#>  [88] bookdown_0.40           IRanges_2.39.2          stats4_4.4.1           
#>  [91] xfun_0.47               graphlayouts_1.1.1      Biobase_2.65.1         
#>  [94] pheatmap_1.0.12         stringi_1.8.4           UCSC.utils_1.1.0       
#>  [97] lazyeval_0.2.2          ggfun_0.1.6             yaml_2.3.10            
#> [100] evaluate_0.24.0         codetools_0.2-20        ggraph_2.2.1           
#> [103] archive_1.1.8           tibble_3.2.1            qvalue_2.37.0          
#> [106] hash_2.2.6.3            Rgraphviz_2.49.0        BiocManager_1.30.25    
#> [109] graph_1.83.0            ggplotify_0.1.2         cli_3.6.3              
#> [112] munsell_0.5.1           jquerylib_0.1.4         Rcpp_1.0.13            
#> [115] GenomeInfoDb_1.41.1     png_0.1-8               parallel_4.4.1         
#> [118] ggplot2_3.5.1           readr_2.1.5             blob_1.2.4             
#> [121] clusterProfiler_4.13.3  DOSE_3.99.1             viridisLite_0.4.2      
#> [124] tidytree_0.4.6          scales_1.3.0            purrr_1.0.2            
#> [127] crayon_1.5.3            rlang_1.1.4             cowplot_1.1.3          
#> [130] fastmatch_1.1-4         KEGGREST_1.45.1