GEO and TCGA provide us with a wealth of data, such as RNA-seq, DNA Methylation, single nucleotide Variation and Copy number variation data. It’s easy to download data from TCGA using the  gdc tool or TCGAbiolinks, and some software provides organized TCGA data, such as UCSC Xena , UCSCXenaTools, and sangerbox, but processing these data into a format suitable for bioinformatics  analysis requires more work. This R package was developed to handle these data.
This is a basic example which shows you how to solve a common problem:
It is convenient to use TCGAbiolinks  or GDCRNATools to download and analysis Gene expression data.  TCGAbiolinks use edgeR package to do differential expression analysis, while GDCRNATools can implement three most commonly used methods: limma, edgeR , and DESeq2 to identify differentially expressed  genes (DEGs).
Alicia Oshlack et al. claimed that unlike the chip data, the RNA-seq data had one bias[1]: the larger the transcript length / mean read count , the more likely it was to be identified as a differential gene, while there was no such trend in the chip data[2].
However, when we use their chip data for difference analysis (using the limma package), we find that chip data has the same trend as RNA-seq data. And we also found this trend in the difference analysis results given by the data authors[3].
It is worse noting that only technical replicate data, which has small gene dispersions, shows this bias[4]. This is because in technical replicate RNA-seq data a long gene has more reads mapping to it compared to a short gene of similar expression, and most of the statistical methods used to detect differential expression have stronger detection ability for genes with more reads. However, we have not deduced why there is such a bias in the current difference analysis algorithms.
Some software, such as CQN , present a normalization algorithm [5] to correct systematic biases(gene length bias and GC-content bias[6]. But they did not provide sufficient evidence to prove that the correction is effective. We use the Marioni dataset[2] to verify the correction effect of CQN and find that there is still a deviation after correction:
GOseq [1]based on Wallenius’ noncentral hypergeometric distribution can effectively correct the gene length deviation in enrichment analysis. However, the current RNA-seq data often have no gene length bias, but only the expression amount(read count) bias, GOseq may overcorrect these data, correcting originally unbiased data into reverse bias.
GOseq also fails to correct for expression bias, therefore, read count bias correction is still a challenge for us.
# use user-defined data
df <- matrix(rnbinom(400, mu = 4, size = 10), 25, 16)
df <- as.data.frame(df)
rownames(df) <- paste0("gene", 1:25)
colnames(df) <- paste0("sample", 1:16)
group <- sample(c("group1", "group2"), 16, replace = TRUE)
result <- differential_RNA(counts = df, group = group,
    filte = FALSE, method = "Wilcoxon")
# use SummarizedExperiment object input
df <- matrix(rnbinom(400, mu = 4, size = 10), 25, 16)
rownames(df) <- paste0("gene", 1:25)
colnames(df) <- paste0("sample", 1:16)
group <- sample(c("group1", "group2"), 16, replace = TRUE)
nrows <- 200; ncols <- 20
counts <- matrix(
    runif(nrows * ncols, 1, 1e4), nrows,
    dimnames = list(paste0("cg",1:200),paste0("S",1:20))
)
colData <- S4Vectors::DataFrame(
  row.names = paste0("sample", 1:16),
  group = group
)
data <- SummarizedExperiment::SummarizedExperiment(
         assays=S4Vectors::SimpleList(counts=df),
         colData = colData)
result <- differential_RNA(counts = data, groupCol = "group",
    filte = FALSE, method = "Wilcoxon") use TCGAbiolinks data.
The codes may need to be modified if TCGAbiolinks updates. So please read its documents.
# use user defined data
library(ChAMP)
cpgData <- matrix(runif(2000), nrow = 200, ncol = 10)
rownames(cpgData) <- paste0("cpg", seq_len(200))
colnames(cpgData) <- paste0("sample", seq_len(10))
sampleGroup <- c(rep("group1", 5), rep("group2", 5))
names(sampleGroup) <- colnames(cpgData)
cpg2gene <- data.frame(cpg = rownames(cpgData), 
    gene = rep(paste0("gene", seq_len(20)), 10))
result <- differential_methy(cpgData, sampleGroup, 
    cpg2gene = cpg2gene, normMethod = NULL)
# use SummarizedExperiment object input
library(ChAMP)
cpgData <- matrix(runif(2000), nrow = 200, ncol = 10)
rownames(cpgData) <- paste0("cpg", seq_len(200))
colnames(cpgData) <- paste0("sample", seq_len(10))
sampleGroup <- c(rep("group1", 5), rep("group2", 5))
names(sampleGroup) <- colnames(cpgData)
cpg2gene <- data.frame(cpg = rownames(cpgData), 
    gene = rep(paste0("gene", seq_len(20)), 10))
colData <- S4Vectors::DataFrame(
    row.names = colnames(cpgData),
    group = sampleGroup
)
data <- SummarizedExperiment::SummarizedExperiment(
         assays=S4Vectors::SimpleList(counts=cpgData),
         colData = colData)
result <- differential_methy(cpgData = data, 
    groupCol = "group", normMethod = NULL, 
    cpg2gene = cpg2gene)  Note: ChAMPhas a large number of dependent packages. If you cannot install it successfully, you can download each dependent package separately(Source or Binary) and install it locally.
We provide two models to get methylation difference genes:
if model = “cpg”, step1: calculate difference cpgs; step2: calculate difference genes;
if model = “gene”, step1: calculate the methylation level of genes; step2: calculate difference genes.
We find that only model = “gene” has no deviation of CpG number.
use TCGAbiolinks to download TCGA data(Gene Level Copy Number Scores)
We provide SNP_QC function to do quality control of SNP data
snpDf <- matrix(sample(c("AA", "Aa", "aa"), 100, replace = TRUE), 10, 10)
snpDf <- as.data.frame(snpDf)
sampleGroup <- sample(c("A", "B"), 10, replace = TRUE)
result <- SNP_QC(snpDf)Then use differential_SNP to do differential analysis.
The function gene_ave could average the expression data of different ids for the same gene in the GEO chip data. For example:
aa <- c("MARCH1","MARC1","MARCH1","MARCH1","MARCH1")
bb <- c(2.969058399,4.722410064,8.165514853,8.24243893,8.60815086)
cc <- c(3.969058399,5.722410064,7.165514853,6.24243893,7.60815086)
file_gene_ave <- data.frame(aa=aa,bb=bb,cc=cc)
colnames(file_gene_ave) <- c("Gene", "GSM1629982", "GSM1629983")
result <- gene_ave(file_gene_ave, 1)Multiple genes symbols may correspond to a same chip id. The result of function repAssign is to assign the expression of this id to each gene, and function repRemove deletes the expression. For example:
aa <- c("MARCH1 /// MMA","MARC1","MARCH2 /// MARCH3",
        "MARCH3 /// MARCH4","MARCH1")
bb <- c("2.969058399","4.722410064","8.165514853","8.24243893","8.60815086")
cc <- c("3.969058399","5.722410064","7.165514853","6.24243893","7.60815086")
input_file <- data.frame(aa=aa,bb=bb,cc=cc)
repAssign_result <- repAssign(input_file," /// ")
repRemove_result <- repRemove(input_file," /// ")id_conversion_TCGA could convert ENSEMBL gene id to gene Symbol in TCGA. For example:The parameter profile is a data.frame or matrix of gene expression data in TCGA.
Note: In previous versions(< 1.0.0) the id_conversion and id_conversion_TCGA used HGNC data to convert human gene id.
In future versions, we will use clusterProfiler::bitr for ID conversion.
countToFpkm and countToTpm could convert count data to FPKM or TPM data.data(gene_cov)
lung_squ_count2 <- matrix(c(1,2,3,4,5,6,7,8,9),ncol=3)
rownames(lung_squ_count2) <- c("DISC1","TCOF1","SPPL3")
colnames(lung_squ_count2) <- c("sample1","sample2","sample3")
result <- countToFpkm(lung_squ_count2, keyType = "SYMBOL", 
    gene_cov = gene_cov)
#> 'select()' returned 1:1 mapping between keys and columns
#> Warning in clusterProfiler::bitr(rownames(gene_cov), fromType = "ENTREZID", :
#> 0.07% of input gene IDs are fail to map...data(gene_cov)
lung_squ_count2 <- matrix(c(0.11,0.22,0.43,0.14,0.875,
    0.66,0.77,0.18,0.29),ncol=3)
rownames(lung_squ_count2) <- c("DISC1","TCOF1","SPPL3")
colnames(lung_squ_count2) <- c("sample1","sample2","sample3")
result <- countToTpm(lung_squ_count2, keyType = "SYMBOL", 
    gene_cov = gene_cov)sessionInfo()
#> R version 4.4.0 beta (2024-04-15 r86425)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] ChAMP_2.34.0                              
#>  [2] RPMM_1.25                                 
#>  [3] cluster_2.1.6                             
#>  [4] DT_0.33                                   
#>  [5] IlluminaHumanMethylationEPICmanifest_0.3.0
#>  [6] Illumina450ProbeVariants.db_1.39.0        
#>  [7] DMRcate_3.0.0                             
#>  [8] ChAMPdata_2.35.0                          
#>  [9] minfi_1.50.0                              
#> [10] bumphunter_1.46.0                         
#> [11] locfit_1.5-9.9                            
#> [12] iterators_1.0.14                          
#> [13] foreach_1.5.2                             
#> [14] Biostrings_2.72.0                         
#> [15] XVector_0.44.0                            
#> [16] SummarizedExperiment_1.34.0               
#> [17] Biobase_2.64.0                            
#> [18] MatrixGenerics_1.16.0                     
#> [19] matrixStats_1.3.0                         
#> [20] GenomicRanges_1.56.0                      
#> [21] GenomeInfoDb_1.40.0                       
#> [22] IRanges_2.38.0                            
#> [23] S4Vectors_0.42.0                          
#> [24] BiocGenerics_0.50.0                       
#> [25] GeoTcgaData_2.4.0                         
#> 
#> loaded via a namespace (and not attached):
#>   [1] R.methodsS3_1.8.2                                  
#>   [2] dichromat_2.0-0.1                                  
#>   [3] cqn_1.50.0                                         
#>   [4] progress_1.2.3                                     
#>   [5] nnet_7.3-19                                        
#>   [6] shinythemes_1.2.0                                  
#>   [7] kpmt_0.1.0                                         
#>   [8] HDF5Array_1.32.0                                   
#>   [9] vctrs_0.6.5                                        
#>  [10] digest_0.6.35                                      
#>  [11] png_0.1-8                                          
#>  [12] lumi_2.56.0                                        
#>  [13] ggrepel_0.9.5                                      
#>  [14] deldir_2.0-4                                       
#>  [15] combinat_0.0-8                                     
#>  [16] permute_0.9-7                                      
#>  [17] MASS_7.3-60.2                                      
#>  [18] reshape_0.8.9                                      
#>  [19] reshape2_1.4.4                                     
#>  [20] withr_3.0.0                                        
#>  [21] httpuv_1.6.15                                      
#>  [22] qvalue_2.36.0                                      
#>  [23] ggfun_0.1.4                                        
#>  [24] xfun_0.43                                          
#>  [25] survival_3.6-4                                     
#>  [26] doRNG_1.8.6                                        
#>  [27] memoise_2.0.1                                      
#>  [28] gson_0.1.0                                         
#>  [29] clusterProfiler_4.12.0                             
#>  [30] BiasedUrn_2.0.11                                   
#>  [31] tidytree_0.4.6                                     
#>  [32] gtools_3.9.5                                       
#>  [33] DNAcopy_1.78.0                                     
#>  [34] R.oo_1.26.0                                        
#>  [35] Formula_1.2-5                                      
#>  [36] prettyunits_1.2.0                                  
#>  [37] KEGGREST_1.44.0                                    
#>  [38] promises_1.3.0                                     
#>  [39] httr_1.4.7                                         
#>  [40] restfulr_0.0.15                                    
#>  [41] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.1 
#>  [42] rhdf5filters_1.16.0                                
#>  [43] rhdf5_2.48.0                                       
#>  [44] rstudioapi_0.16.0                                  
#>  [45] UCSC.utils_1.0.0                                   
#>  [46] generics_0.1.3                                     
#>  [47] DOSE_3.30.0                                        
#>  [48] base64enc_0.1-3                                    
#>  [49] curl_5.2.1                                         
#>  [50] zlibbioc_1.50.0                                    
#>  [51] ggraph_2.2.1                                       
#>  [52] polyclip_1.10-6                                    
#>  [53] GenomeInfoDbData_1.2.12                            
#>  [54] quadprog_1.5-8                                     
#>  [55] ExperimentHub_2.12.0                               
#>  [56] SparseArray_1.4.0                                  
#>  [57] xtable_1.8-4                                       
#>  [58] stringr_1.5.1                                      
#>  [59] doParallel_1.0.17                                  
#>  [60] evaluate_0.23                                      
#>  [61] S4Arrays_1.4.0                                     
#>  [62] BiocFileCache_2.12.0                               
#>  [63] preprocessCore_1.66.0                              
#>  [64] hms_1.1.3                                          
#>  [65] colorspace_2.1-0                                   
#>  [66] filelock_1.0.3                                     
#>  [67] JADE_2.0-4                                         
#>  [68] IlluminaHumanMethylationEPICanno.ilm10b4.hg19_0.6.0
#>  [69] marray_1.82.0                                      
#>  [70] magrittr_2.0.3                                     
#>  [71] readr_2.1.5                                        
#>  [72] ggtree_3.12.0                                      
#>  [73] later_1.3.2                                        
#>  [74] viridis_0.6.5                                      
#>  [75] lattice_0.22-6                                     
#>  [76] genefilter_1.86.0                                  
#>  [77] shadowtext_0.1.3                                   
#>  [78] XML_3.99-0.16.1                                    
#>  [79] cowplot_1.1.3                                      
#>  [80] Hmisc_5.1-2                                        
#>  [81] pillar_1.9.0                                       
#>  [82] nlme_3.1-164                                       
#>  [83] compiler_4.4.0                                     
#>  [84] stringi_1.8.3                                      
#>  [85] dendextend_1.17.1                                  
#>  [86] GenomicAlignments_1.40.0                           
#>  [87] plyr_1.8.9                                         
#>  [88] crayon_1.5.2                                       
#>  [89] abind_1.4-5                                        
#>  [90] BiocIO_1.14.0                                      
#>  [91] gridGraphics_0.5-1                                 
#>  [92] graphlayouts_1.1.1                                 
#>  [93] org.Hs.eg.db_3.19.1                                
#>  [94] bit_4.0.5                                          
#>  [95] dplyr_1.1.4                                        
#>  [96] fastmatch_1.1-4                                    
#>  [97] codetools_0.2-20                                   
#>  [98] openssl_2.1.2                                      
#>  [99] bslib_0.7.0                                        
#> [100] biovizBase_1.52.0                                  
#> [101] plotly_4.10.4                                      
#> [102] multtest_2.60.0                                    
#> [103] mime_0.12                                          
#> [104] wateRmelon_2.10.0                                  
#> [105] splines_4.4.0                                      
#> [106] Rcpp_1.0.12                                        
#> [107] dbplyr_2.5.0                                       
#> [108] sparseMatrixStats_1.16.0                           
#> [109] IlluminaHumanMethylation450kmanifest_0.4.0         
#> [110] HDO.db_0.99.1                                      
#> [111] interp_1.1-6                                       
#> [112] knitr_1.46                                         
#> [113] blob_1.2.4                                         
#> [114] utf8_1.2.4                                         
#> [115] clue_0.3-65                                        
#> [116] BiocVersion_3.19.1                                 
#> [117] AnnotationFilter_1.28.0                            
#> [118] fs_1.6.4                                           
#> [119] checkmate_2.3.1                                    
#> [120] DelayedMatrixStats_1.26.0                          
#> [121] Gviz_1.48.0                                        
#> [122] ggplotify_0.1.2                                    
#> [123] tibble_3.2.1                                       
#> [124] Matrix_1.7-0                                       
#> [125] statmod_1.5.0                                      
#> [126] tzdb_0.4.0                                         
#> [127] tweenr_2.0.3                                       
#> [128] pkgconfig_2.0.3                                    
#> [129] tools_4.4.0                                        
#> [130] cachem_1.0.8                                       
#> [131] RSQLite_2.3.6                                      
#> [132] viridisLite_0.4.2                                  
#> [133] globaltest_5.58.0                                  
#> [134] DBI_1.2.2                                          
#> [135] impute_1.78.0                                      
#> [136] fastmap_1.1.1                                      
#> [137] rmarkdown_2.26                                     
#> [138] scales_1.3.0                                       
#> [139] grid_4.4.0                                         
#> [140] Rsamtools_2.20.0                                   
#> [141] AnnotationHub_3.12.0                               
#> [142] sass_0.4.9                                         
#> [143] patchwork_1.2.0                                    
#> [144] BiocManager_1.30.22                                
#> [145] VariantAnnotation_1.50.0                           
#> [146] scrime_1.3.5                                       
#> [147] rpart_4.1.23                                       
#> [148] farver_2.1.1                                       
#> [149] scatterpie_0.2.2                                   
#> [150] tidygraph_1.3.1                                    
#> [151] mgcv_1.9-1                                         
#> [152] yaml_2.3.8                                         
#> [153] latticeExtra_0.6-30                                
#> [154] foreign_0.8-86                                     
#> [155] rtracklayer_1.64.0                                 
#> [156] illuminaio_0.46.0                                  
#> [157] cli_3.6.2                                          
#> [158] purrr_1.0.2                                        
#> [159] siggenes_1.78.0                                    
#> [160] GEOquery_2.72.0                                    
#> [161] lifecycle_1.0.4                                    
#> [162] askpass_1.2.0                                      
#> [163] backports_1.4.1                                    
#> [164] BiocParallel_1.38.0                                
#> [165] annotate_1.82.0                                    
#> [166] methylumi_2.50.0                                   
#> [167] gtable_0.3.5                                       
#> [168] rjson_0.2.21                                       
#> [169] missMethyl_1.38.0                                  
#> [170] ape_5.8                                            
#> [171] limma_3.60.0                                       
#> [172] jsonlite_1.8.8                                     
#> [173] edgeR_4.2.0                                        
#> [174] isva_1.9                                           
#> [175] bitops_1.0-7                                       
#> [176] ggplot2_3.5.1                                      
#> [177] bit64_4.0.5                                        
#> [178] assertthat_0.2.1                                   
#> [179] yulab.utils_0.1.4                                  
#> [180] base64_2.0.1                                       
#> [181] geneLenDataBase_1.39.0                             
#> [182] jquerylib_0.1.4                                    
#> [183] GOSemSim_2.30.0                                    
#> [184] R.utils_2.12.3                                     
#> [185] lazyeval_0.2.2                                     
#> [186] shiny_1.8.1.1                                      
#> [187] htmltools_0.5.8.1                                  
#> [188] affy_1.82.0                                        
#> [189] enrichplot_1.24.0                                  
#> [190] GO.db_3.19.1                                       
#> [191] rappdirs_0.3.3                                     
#> [192] ensembldb_2.28.0                                   
#> [193] glue_1.7.0                                         
#> [194] ROC_1.80.0                                         
#> [195] httr2_1.0.1                                        
#> [196] RCurl_1.98-1.14                                    
#> [197] treeio_1.28.0                                      
#> [198] mclust_6.1.1                                       
#> [199] BSgenome_1.72.0                                    
#> [200] jpeg_0.1-10                                        
#> [201] gridExtra_2.3                                      
#> [202] igraph_2.0.3                                       
#> [203] R6_2.5.1                                           
#> [204] sva_3.52.0                                         
#> [205] tidyr_1.3.1                                        
#> [206] GenomicFeatures_1.56.0                             
#> [207] rngtools_1.5.2                                     
#> [208] Rhdf5lib_1.26.0                                    
#> [209] beanplot_1.3.1                                     
#> [210] aplot_0.2.2                                        
#> [211] bsseq_1.40.0                                       
#> [212] DelayedArray_0.30.0                                
#> [213] tidyselect_1.2.1                                   
#> [214] ProtGenerics_1.36.0                                
#> [215] nleqslv_3.3.5                                      
#> [216] htmlTable_2.4.2                                    
#> [217] ggforce_0.4.2                                      
#> [218] xml2_1.3.6                                         
#> [219] AnnotationDbi_1.66.0                               
#> [220] fastICA_1.2-4                                      
#> [221] munsell_0.5.1                                      
#> [222] KernSmooth_2.23-22                                 
#> [223] goseq_1.56.0                                       
#> [224] affyio_1.74.0                                      
#> [225] nor1mix_1.3-3                                      
#> [226] data.table_1.15.4                                  
#> [227] htmlwidgets_1.6.4                                  
#> [228] fgsea_1.30.0                                       
#> [229] RColorBrewer_1.1-3                                 
#> [230] biomaRt_2.60.0                                     
#> [231] rlang_1.1.3                                        
#> [232] topconfects_1.20.0                                 
#> [233] fansi_1.0.6