Contents

1 Overview

The ChIPDBData package provides curated ChIP-seq transcription factor target databases designed for use with TFEA.ChIP.

Each dataset contains a collection of ChIP-seq experiments (e.g., from ENCODE) along with their associated gene targets. These datasets are structured as ChIPDB list objects, and can be accessed either manually or via the getChIPDB() function.

Important: When loading any dataset, make sure it is assigned to an object named ChIPDB. This is crucial, as TFEA.ChIP looks for a globally defined object called ChIPDB and will not recognize it under any other name.

2 Installation

To install the package, start R and enter:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("ChIPDBData")

Once ChIPDBData is installed, it can be loaded with the following command:

library(ChIPDBData)

3 Available Datasets

The following datasets are currently available in the ChIPDBData package:

These can be accessed via the ExperimentHub interface:

library(ExperimentHub)
#> Loading required package: BiocGenerics
#> Loading required package: generics
#> 
#> Attaching package: 'generics'
#> The following objects are masked from 'package:base':
#> 
#>     as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
#>     setequal, union
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, is.unsorted, lapply,
#>     mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
#>     rank, rbind, rownames, sapply, saveRDS, table, tapply, unique,
#>     unsplit, which.max, which.min
#> Loading required package: AnnotationHub
#> Loading required package: BiocFileCache
#> Loading required package: dbplyr

eh <- ExperimentHub()
dbs <- query(eh, "ChIPDBData")
dbs
#> ExperimentHub with 10 records
#> # snapshotDate(): 2025-09-22
#> # $dataprovider: ENCODE, GeneHancer, CREDB
#> # $species: Homo sapiens
#> # $rdataclass: list
#> # additional mcols(): taxonomyid, genome, description,
#> #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> #   rdatapath, sourceurl, sourcetype 
#> # retrieve records with, e.g., 'object[["EH9847"]]' 
#> 
#>            title                              
#>   EH9847 | ENCODE rE2G complete               
#>   EH9848 | ENCODE rE2G greater than 0.25 score
#>   EH9849 | ENCODE rE2G greater than 0.5 score 
#>   EH9850 | ENCODE rE2G greater than 0.75 score
#>   EH9851 | ENCODE rE2G greater than 50 depth  
#>   EH9852 | ENCODE rE2G greater than 100 depth 
#>   EH9853 | ENCODE rE2G greater than 200 depth 
#>   EH9854 | ENCODE rE2G greater than 300 depth 
#>   EH9855 | CREdb                              
#>   EH9856 | GeneHancer

# Example: Load ENCODE rE2G300d
ChIPDB <- dbs[["EH9854"]]  # IMPORTANT: Assign to 'ChIPDB'
#> see ?ChIPDBData and browseVignettes('ChIPDBData') for documentation
#> loading from cache

Alternatively, you can retrieve datasets programmatically using getChIPDB() with any of the following identifiers: “ENCODE_rE2G”, “ENCODE_rE2G_25score”, “ENCODE_rE2G_50score”, “ENCODE_rE2G_75score”, “ENCODE_rE2G_50depth”, “ENCODE_rE2G_100depth”, “ENCODE_rE2G_200depth”, “ENCODE_rE2G_300depth”, “CREdb” or “GeneHancer”.

For example:

# Load the ENCODE dataset filtered by depth >= 300
ChIPDB <- getChIPDB("ENCODE_rE2G_300depth")
#> see ?ChIPDBData and browseVignettes('ChIPDBData') for documentation
#> loading from cache

A ChIPDB object is a named list with two main components:

  1. A character vector of Entrez Gene IDs, representing the universe of possible targets.
  2. A named list of ChIP-seq experiments, where each element is a vector of integer indices pointing to the genes in component 1. Each entry represents the gene targets of a transcription factor in a specific experiment.

Exploring the structure:

# List names of the top-level elements
names(ChIPDB)
#> [1] "Gene Keys"    "ChIP Targets"

# Preview the first few Entrez IDs
ChIPDB[[1]][1:5]
#> [1] "1"     "10"    "100"   "1000"  "10000"

# View names of ChIP-seq experiments
names(ChIPDB[[2]])[1:3]
#> [1] "ENCSR000AHD.CTCF.MCF-7"   "ENCSR000AHF.TAF1.MCF-7"  
#> [3] "ENCSR000AKB.CTCF.GM12878"

# Show gene indices for the first experiment
ChIPDB[[2]][[1]][1:5]
#> [1]  4  9 90 94 97

# Get actual gene IDs from those indices
ChIPDB[[1]][ ChIPDB[[2]][[1]][1:5] ]
#> [1] "1000"      "100009676" "100036567" "100048912" "100049716"

4 Integration with TFEA.ChIP

To perform transcription factor enrichment analysis, start by loading your differential expression data and defining the regulated and control gene sets. Ensure that your ChIP-seq database is loaded and assigned to ChIPDB. The TFEA.ChIP functions will automatically use this object for analysis.

Important: Make sure to load ChIPDB after running library(TFEA.CHIP). Otherwise, the package’s default database (a limited subset from GeneHancer) will overwrite it.

# Load and preprocess differential expression table
data('hypoxia_DESeq')
hypoxia_table <- preprocessInputData(hypoxia_DESeq)
#> Loading required namespace: DESeq2
#> Warning: Some genes returned 1:many mapping to ENTREZ ID.

# Define gene sets
Genes.Upreg <- Select_genes(hypoxia_table, min_LFC = 1)
Genes.Control <- Select_genes(hypoxia_table,
  min_pval = 0.5, max_pval = 1,
  min_LFC = -0.25, max_LFC = 0.25
)

# Run TF enrichment
CM_list <- contingency_matrix(Genes.Upreg, Genes.Control)
results <- getCMstats(CM_list)
#> Warning in tmpOR[is.infinite(tmpOR)] <- ifelse(statMat$OR == Inf,
#> max(statMat$OR, : number of items to replace is not a multiple of replacement
#> length

# Display results
head(results)
#>                                    Accession       Cell Treatment
#> 7559                  GSE89836.EPAS1.HUVEC-C    HUVEC-C          
#> 5971                 GSE48516.JARID2.UTEIPS6    UTEIPS6          
#> 1189 ENCSR341VYI.EZH2_phosphoT487.hepatocyte hepatocyte          
#> 5972                 GSE48516.JARID2.UTEIPS7    UTEIPS7          
#> 866                ENCSR091BOQ.SUZ12.GM12878    GM12878          
#> 4755                    GSE135024.EZH2.THP-1      THP-1          
#>                    TF      p.value        OR      OR.SE  log2.OR  adj.p.value
#> 7559            EPAS1 2.883815e-06 63.194453 68.3644690 5.981726 1.066667e-05
#> 5971           JARID2 2.309584e-43  6.137755  0.7739575 2.617711 1.860601e-40
#> 1189 EZH2_phosphoT487 1.779976e-35  5.662254  0.7414628 2.501377 2.926427e-33
#> 5972           JARID2 6.754287e-34  6.571209  0.9378734 2.716159 8.776216e-32
#> 866             SUZ12 9.452603e-32  5.674107  0.7785041 2.504393 8.368151e-30
#> 4755             EZH2 5.757605e-31  5.817861  0.8167430 2.540489 4.503230e-29
#>      log10.adj.pVal distance
#> 7559       4.971971 62.39287
#> 5971      39.730347 40.06117
#> 1189      32.533662 32.86603
#> 5972      31.056693 31.55244
#> 866       29.077370 29.45065
#> 4755      28.346476 28.75299

5 Session Info

sessionInfo()
#> R version 4.5.1 Patched (2025-08-23 r88802)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] ExperimentHub_2.99.5 AnnotationHub_3.99.6 BiocFileCache_2.99.6
#> [4] dbplyr_2.5.1         BiocGenerics_0.55.1  generics_0.1.4      
#> [7] ChIPDBData_0.99.7    TFEA.ChIP_1.29.3     BiocStyle_2.37.1    
#> 
#> loaded via a namespace (and not attached):
#>  [1] DBI_1.2.3                   bitops_1.0-9               
#>  [3] httr2_1.2.1                 biomaRt_2.65.14            
#>  [5] rlang_1.1.6                 magrittr_2.0.4             
#>  [7] matrixStats_1.5.0           compiler_4.5.1             
#>  [9] RSQLite_2.4.3               GenomicFeatures_1.61.6     
#> [11] png_0.1-8                   vctrs_0.6.5                
#> [13] stringr_1.5.2               pkgconfig_2.0.3            
#> [15] crayon_1.5.3                fastmap_1.2.0              
#> [17] XVector_0.49.1              Rsamtools_2.25.3           
#> [19] rmarkdown_2.29              purrr_1.1.0                
#> [21] bit_4.6.0                   xfun_0.53                  
#> [23] cachem_1.1.0                jsonlite_2.0.0             
#> [25] progress_1.2.3              blob_1.2.4                 
#> [27] DelayedArray_0.35.3         BiocParallel_1.43.4        
#> [29] parallel_4.5.1              prettyunits_1.2.0          
#> [31] R6_2.6.1                    bslib_0.9.0                
#> [33] stringi_1.8.7               RColorBrewer_1.1-3         
#> [35] rtracklayer_1.69.1          GenomicRanges_1.61.5       
#> [37] jquerylib_0.1.4             Rcpp_1.1.0                 
#> [39] Seqinfo_0.99.2              bookdown_0.44              
#> [41] SummarizedExperiment_1.39.2 knitr_1.50                 
#> [43] org.Mm.eg.db_3.21.0         R.utils_2.13.0             
#> [45] IRanges_2.43.2              Matrix_1.7-4               
#> [47] tidyselect_1.2.1            dichromat_2.0-0.1          
#> [49] abind_1.4-8                 yaml_2.3.10                
#> [51] codetools_0.2-20            curl_7.0.0                 
#> [53] lattice_0.22-7              tibble_3.3.0               
#> [55] Biobase_2.69.1              withr_3.0.2                
#> [57] KEGGREST_1.49.1             S7_0.2.0                   
#> [59] evaluate_1.0.5              Biostrings_2.77.2          
#> [61] pillar_1.11.1               BiocManager_1.30.26        
#> [63] filelock_1.0.3              MatrixGenerics_1.21.0      
#> [65] stats4_4.5.1                RCurl_1.98-1.17            
#> [67] BiocVersion_3.22.0          S4Vectors_0.47.2           
#> [69] hms_1.1.3                   ggplot2_4.0.0              
#> [71] scales_1.4.0                glue_1.8.0                 
#> [73] tools_4.5.1                 BiocIO_1.19.0              
#> [75] locfit_1.5-9.12             GenomicAlignments_1.45.4   
#> [77] XML_3.99-0.19               grid_4.5.1                 
#> [79] AnnotationDbi_1.71.1        restfulr_0.0.16            
#> [81] cli_3.6.5                   rappdirs_0.3.3             
#> [83] S4Arrays_1.9.1              dplyr_1.1.4                
#> [85] gtable_0.3.6                R.methodsS3_1.8.2          
#> [87] DESeq2_1.49.4               sass_0.4.10                
#> [89] digest_0.6.37               SparseArray_1.9.1          
#> [91] org.Hs.eg.db_3.21.0         rjson_0.2.23               
#> [93] farver_2.1.2                memoise_2.0.1              
#> [95] htmltools_0.5.8.1           R.oo_1.27.1                
#> [97] lifecycle_1.0.4             httr_1.4.7                 
#> [99] bit64_4.6.0-1