--- title: "curatedTBData" author: - name: Xutao Wang affiliation: - Department of Biostatistics, Boston University School of Public Health, Boston, MA, U.S.A. email: xutaow@bu.edu - name: Prasad Patil affiliation: - Department of Biostatistics, Boston University School of Public Health, Boston, MA, U.S.A. email: patil@bu.edu - name: W. Evan Johnson affiliation: - Section of Computational Biomedicine, Boston University School of Medicine, Boston, MA, U.S.A. email: wej@bu.edu package: curatedTBData abstract: > The curatedTBData is an R package that provides standardized, curated tuberculosis(TB) transcriptomic studies. The initial release of the package contains 49 studies. The curatedTBData package allows users to access tuberculosis trancriptomic efficiently and to make easy comparison for different TB gene signatures across multiple datasets. date: "`r Sys.Date()`" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{curatedTBData} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # Installation ```{r install curatedTBData, eval = FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("curatedTBData") ``` # R packages ```{r libraries, message=FALSE, warning=FALSE, results='hide'} library(curatedTBData) library(dplyr) library(SummarizedExperiment) library(sva) ``` # Access to MetaData The `DataSummary` table summarized the list of available studies and their metadata information contained in the curatedTBData package. \ The table helps users to query avialable datasets quickly. ```{r accessimg datasummary} # Remove GeographicalRegion, Age, DiagnosisMethod, Notes, Tissue, HIVStatus for concise display data("DataSummary", package = "curatedTBData") DataSummary |> dplyr::select(-c(`Country/Region`, Age, DiagnosisMethod, Notes, Tissue, HIVStatus)) |> DT::datatable() ``` # Access curated tuberculosis gene expression profile Users can use `curatedTBData()` function to access data. There are three arguments in the function. The first argument `study_name` represents the names of the data that are used to determine the resources of interests. Users can find all available resource names from `DataSummary$Study`. The second argument `dry.run` enables users to determine the resources's availability before actually downloading them. When `dry.run` is set to `TRUE`, the output includes names of the resources. The third argument `curated.only` allows the users to access the curated ready-to-use data. If `curated.only` is `TRUE`, the function only download the curated gene expression profile and the clinical annotation of the corresponding data. If `curated.only` is `FALSE`, the function downloads all available resources for input studies. ```{r access data with dry.run} curatedTBData("GSE19439", dry.run = TRUE, curated.only = FALSE) ``` To download complete data for a Microarry study (e.g. [GSE19439](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE19439)), we set `dry.run = FALSE` and `curated.only = FALSE`. There are two experiments assay being included in the output Microarray studies. The first experiment is `assay_curated`, which is a `matrix` that represents normalized, curated version of the gene expression profile. The second experiment is `assay_raw`, which is a `r Biocpkg("SummarziedExperiment")` object that contains the raw gene expression profile and information about probe features. ```{r download full data for GSE19439, warning=FALSE} GSE19439 <- curatedTBData("GSE19439", dry.run = FALSE, curated.only = FALSE) GSE19439 ``` When accessing data for RNA sequencing studies, another assay called `assay_reprocess` is included. This `matrix` represents the reprocessed version of gene expression profile from the raw .fastq files using `r Biocpkg("Rsubread")`. ```{r, warning=FALSE} GSE79362 <- curatedTBData("GSE79362", dry.run = FALSE, curated.only = FALSE) GSE79362 ``` A list of `r Biocpkg("MultiAssayExperiment")` objects is returned when `dry.run` is `FALSE`. \ To save running time, in the following example, we set the `curated.only = TRUE` and selected five studies that belong to [GSE19491](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE19491) from the Gene Expression Omnibus. ```{r access selected curated datasets microarray, warning=FALSE, results='hide'} myGEO <- c("GSE19435", "GSE19439", "GSE19442", "GSE19444", "GSE22098") object_list <- curatedTBData(myGEO, dry.run = FALSE, curated.only = TRUE) ``` ```{r} object_list[1:2] ``` A full version example for RNA sequencing data: [GSE79362](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE79362). ```{r ACS RNA-seq, warning=FALSE, results='hide'} GSE79362 <- curatedTBData("GSE79362", dry.run = FALSE, curated.only = FALSE) ``` ```{r} GSE79362 ``` # Subset/Combine MultiAssayExperiment objects ## Subset The major advantage of using `r Biocpkg("MultiAssayExperiment")` is the coordination of the meta-data and assays when sub-setting. The MultiAssayExperiment object has built-in function for subsetting samples based on column condition. \ The following code shows how to select samples with only active TB. ```{r} GSE19439 <- object_list$GSE19439 GSE19439[, GSE19439$TBStatus == "PTB"]["assay_curated"] # 13 samples ``` The following example shows how to subset patients with active TB and LTBI ```{r} GSE19439[, GSE19439$TBStatus %in% c("PTB", "LTBI")]["assay_curated"] # 30 samples ``` # Dataset integeration ## Merge Studies with common gene symbols The `combine_objects()` function provides an easy implementation for combining different studies based on common gene symbols. The function returns a `r Biocpkg("SummarizedExperiment")` object that contains the merged assay and associated clinical annotation. Noticed that [GSE74092](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi) is usually removed from merging, because it used quantitative PCR, which did not have enough coverage to capture all genes. \ There are two arguments in the `combine_objects()` function. The first one is `object_list`, which takes a list of `r Biocpkg("MultiAssayExperiment")` objects obtained from `curatedTBData()`. Notice that the `names(object_list)` should not be `NULL` and must be unique for each object within the list, so that we can keep track the original study after merging. The second argument is `experiment_name`, which can be a string or vector of strings representing the name of the assay from the object. ```{r} GSE19491 <- combine_objects(object_list, experiment_name = "assay_curated", update_genes = TRUE) GSE19491 ``` When the objects are merged, the original individual data tag can be found in the `Study` section from the metadata. ```{r} unique(GSE19491$Study) ``` It is also possible to combine the given list of objects with different experiment names. In this case, the `experiment_name` is a vector of string that corresponds to each of object from the input list. ```{r commbine objects multiple experiment name} exp <- combine_objects(c(GSE79362[1], object_list[1]), experiment_name = c("assay_reprocess_hg19", "assay_curated"), update_genes = TRUE) exp ``` ## Batch correction If datasets are merged, it is typically recommended to remove a very likely batch effect. We use the `ComBat()` function from `r Biocpkg("sva")` to remove potential batch effect between studies. \ In the following example, each study is viewed as one batch. The batch corrected assay will be stored in a `r Biocpkg("SummarizedExperiment")` object. ```{r batch correction, message=FALSE} batch1 <- colData(GSE19491)$Study combat_edata1 <- sva::ComBat(dat = assay(GSE19491), batch = batch1) assays(GSE19491)[["Batch_corrected_assay"]] <- combat_edata1 GSE19491 ``` # Dataset subset The function `subset_curatedTBData()` allows the users to subset a list of `r Biocpkg("MultiAssayExperiment")` with the output contains the exact conditions given by the `annotationCondition`. \ With `subset_curatedTBData()`, users can quickly subset desired results from `r Biocpkg("curatedTBData")` database without checking individual object. \ There are four arguments in this function. The `theObject` represents a `r Biocpkg("MultiAssayExperiment")` or `r Biocpkg("SummarizedExperiment")` object. The `annotationColName` is a character that indicates the column name in the metadata. The `annotationCondition` is a character or vector of characters that the users intend to select. ## Select patients with Active TB and LTBI In the following example, we call `subset_curatedTBData()` function to subset samples with active TB (`PTB`) and latent TB infection (`LTBI`) for binary classification. ```{r subset active TB and LTBI, message=FALSE, warning=FALSE} multi_set_PTB_LTBI <- lapply(object_list, function(x) subset_curatedTBData(x, annotationColName = "TBStatus", annotationCondition = c("LTBI", "PTB"), assayName = "assay_curated")) # Remove NULL from the list multi_set_PTB_LTBI <- multi_set_PTB_LTBI[!sapply(multi_set_PTB_LTBI, is.null)] multi_set_PTB_LTBI[1:3] ``` ## Select other outcome of interests The HIV status (`HIVStatus`) and diabetes status (`DiabetesStatus`) for each subject were also recorded for each study in the `r Biocpkg("curatedTBData")`. \ In the following example, we select subjects with HIV positive from the input. \ Users can also find HIV status information for each study by looking at the column: `HIVStatus` from `DataSummary`. \ When the the length of the `annotationCondition` equals to 1, we can subset using either `MultiAssayExperiment` built-in procedure or `subset_curatedTBData`. ```{r subset patients with HIV, message=FALSE, warning=FALSE} multi_set_HIV <- lapply(object_list, function(x) subset_curatedTBData(x, annotationColName = "HIVStatus", annotationCondition = "Negative", assayName = "assay_curated")) # Remove NULL from the list multi_set_HIV <- multi_set_HIV[!vapply(multi_set_HIV, is.null, TRUE)] multi_set_HIV[1:3] ``` # Session Info ```{r session info} sessionInfo() ```