--- title: "Tumor, tumor-adjacent normal, and healthy colorectal transcriptomes as a SummarizedExperiment on Bioconductor's ExperimentHub" author: - name: Christopher Dampier affiliation: - University of Virginia - Center for Public Health Genomics - Department of General Surgery email: chd5n@virginia.edu package: "FieldEffectCrc" date: "`r BiocStyle::doc_date()`" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{SummarizedExperiments of colorectal transcriptomes} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Introduction Data included in this package was generated by harmonizing all publicly available bulk RNA-seq samples from human primary colorectal tissue available through NCBI's [Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra) and [database of Genotypes and Phenotypes](https://www.ncbi.nlm.nih.gov/gap/), NCI's [Genomic Data Commons](https://gdc.cancer.gov/), and the NIH-funded [BarcUVa-Seq](https://barcuvaseq.org/) project. FASTQ files were downloaded directly or generated from BAM files using [biobambam2](https://gitlab.com/german.tischler/biobambam2). Gene expression estimates were generated from FASTQ files using [Salmon](https://combine-lab.github.io/salmon/) and aggregated from `quant.sf` files using [tximport](https://bioconductor.org/packages/release/bioc/html/tximport.html). The original processed data are stored as tab-delimited text on [Synapse](https://www.synapse.org/) under Synapse ID [syn22237139](https://www.synapse.org/#!Synapse:syn22237139/wiki/604471) and packaged as `SummarizedExperiment` objects on `ExperimentHub` to facilitate reproduction and extension of results published in Dampier et al. (PMCID: PMC7386360, PMID: 32764205). # Installation The following example demonstrates how to download the dataset from `ExperimentHub`. First, make sure the Bioconductor packages `BiocManager`, `SummarizedExperiment`, and `ExperimentHub` are installed: ```{r, install-packages, eval=FALSE} if (!requireNamespace("BiocManager", quietly=TRUE)) { install.packages("BiocManager") } BiocManager::install(c("SummarizedExperiment", "ExperimentHub", "FieldEffectCrc")) ``` Second, load the `SummarizedExperiment` and `ExperimentHub` packages: ```{r, load-hub} library(SummarizedExperiment) library(ExperimentHub) ``` Third, find the `FieldEffectCrc` data objects and look at their descriptions: ```{r, find-data} hub = ExperimentHub() ## simply list the resource names ns <- listResources(hub, "FieldEffectCrc") ns ## query the hub for full metadata crcData <- query(hub, "FieldEffectCrc") crcData ## extract metadata df <- mcols(crcData) df ## see where the cache is stored hc <- hubCache(crcData) hc ``` Fourth, explore the metadata of a particular file. Let's look at the data for cohort A: ```{r, explore-file} md <- hub["EH3524"] md ``` Fifth, download the data object itself as a `SummarizedExperiment`: ```{r, download-data} ## using resource ID data1 <- hub[["EH3524"]] data1 ## using loadResources() data2 <- loadResources(hub, "FieldEffectCrc", "cohort A") data2 ``` Since the data is a `SummarizedExperiment`, the different phenotypes can be accessed using `colData()`. The phenotype of interest in the field effect analysis is referred to as `sampType`: ```{r, phenotypes} summary(colData(data1)$sampType) ``` # Quick Start To immediately access the count data for the 834 samples in cohort A, load the `SummarizedExperiment` object for cohort A as demonstrated in the [Installation](#installation) section immediately above. Then use the `assays()` accessor function to extract the `counts` assay, which is the second assay in the `SummarizedExperiment`: ```{r, quick-start-1} se <- data1 counts1 <- assays(se)[["counts"]] ``` Note that `assay()`, the singular as opposed to plural, will by default extract the first assay element from a `SummarizedExperiment` but will extract the 'i'th assay element with the `assay(se, i)` syntax: ```{r, quick-start-2} counts2 <- assay(se, 2) ``` The two alternatives are equivalent: ```{r, quick-start-3} all(counts1==counts2) ``` # Case Study We are interested in conducting a differential expression analysis using `DESeq2` to identify genes over- and under-expressed in different colorectal tissue phenotypes. Due to substantial technical artifacts between single-end and paired-end library formats observed in the full `FieldEffectCrc` data set, samples are grouped according to library format, with paired-end samples in cohorts A and B and single-end samples in cohort C. Cohort A is the primary analysis cohort and the one we want to use to discover differentially expressed genes. To begin, load cohort A as demonstrated in the [Installation](#installation) section immediately above if not already loaded. In the following steps, we will use `DESeq2` to perform differential expression analysis on expression data stored in the `SummarizedExperiment` object for cohort A. We will adjust for latent batch effects with surrogate variable analysis as implemented in the `sva` package. This analysis would typically consume excess memory and time, so for the purpose of demonstration, we perform the analysis on a small subset of the full data. ## Prepare DESeqDataSet In order to take advantage of the feature lengths preserved by `tximport`, we will recreate a `tximport` object from the `SummarizedExperiment` object downloaded from `ExperimentHub` and use the `DESeqDataSetFromTximport()` function to create the `DESeqDataSet` object. Alternatively, the `SummarizedExperiment` object itself could be used to create the `DESeqDataSet` object using the `DESeqDataSet()` function, but the order of the assays would need to be re-arranged and the counts would need to be rounded, as the `DESeqDataSet()` function expects `counts` with integer values to be the first assay in a `SummarizedExperiment` object, and `abundance` with floating-point values is the first assay in the `FieldEffectCrc` objects to facilitate conversion to `tximport` objects. Such a re-ordering is facilitated by the `FieldEffectCrc::reorder_assays()` function. Yet another option is to extract the `counts` assay as a matrix and use the `DESeqDataSetFromMatrix()` function to create the `DESeqDataSet` object. Since there are probably some benefits to feature length normalization as implemented in `DESeq2` when such information is available, we demonstrate creation of the `tximport` object. First, make the `tximport` object, which is just an object of class `list` from base R: ```{r, make-txi} ## manual construction txi <- as.list(assays(se)) txi$countsFromAbundance <- "no" ## convenience function construction txi <- make_txi(se) ``` Next, we could assign the clinical annotations from `colData(se)` to an `S4Vectors::DataFrame` or a `base::data.frame` object, since that is what `DESeqDataSetFromTximport()` expects for the `colData` argument. However, the `colData` slot of a `SummarizedExperiment` object already stores the clinical annotations as an `S4Vectors::DataFrame`, so we are ready to create the `DESeqDataSet` object: ```{r, make-dds} if (!requireNamespace("DESeq2", quietly=TRUE)) { BiocManager::install("DESeq2") } library(DESeq2) dds <- DESeqDataSetFromTximport(txi, colData(se), ~ sampType) ``` Note that we are interested in expression differences between phenotypes, which are coded in the `sampType` field, so we include `sampType` in our preliminary design. ## Filter Genes Normally, we would pre-filter genes to select genes whose quantities are likely to be estimated accurately. We could use something like the following code, which selects for genes with a count greater than 10 RNA molecules in at least a third of all samples: ```{r, typical-filter} countLimit <- 10 sampleLimit <- (1/3) * ncol(dds) keep <- rowSums( counts(dds) > countLimit ) >= sampleLimit dds <- dds[keep, ] ``` However, to expedite subsequent computational steps in this demonstration, we will select a random subset of genes and samples for analysis and then filter out any gene with a count less than 10 across all samples: ```{r, special-filter} r <- sample(seq_len(nrow(dds)), 6000) c <- sample(seq_len(ncol(dds)), 250) dds <- dds[r, c] r <- rowSums(counts(dds)) >= 10 dds <- dds[r, ] ``` ## Estimate Size Factors and Dispersions, Fit Negative Binomial Models, and Perform Wald Test Amazingly, `DESeq2` estimates size factors (i.e. normalization factors based on library size and feature length) and dispersions for every gene, fits negative binomial models for every gene, and performs a Wald test for every gene in a single function: ```{r, size-factors} dds <- DESeq(dds) ``` Pretty incredible. Unfortunately, we have to do some more work before we can fit useful models. The size factor estimates are the key result of the `DESeq()` function at this stage. ## Estimate Latent Factors If we trusted our high-throughput assays to be perfect representations of the underlying biology they were designed to measure, we would be ready to test for differential expression. However, our dataset is composed of samples from several studies collected and processed in various ways over several years, so we have reason to suspect there are latent factors driving non-biological variation in the expression data. Fortunately, we can use `sva` to estimate the latent factors and include them in our `DESeq2` design. For a full tutorial on how to use the `sva` package, see the package [vignette](https://bioconductor.org/packages/release/bioc/vignettes/sva/inst/doc/sva.pdf). For this demonstration, we will provide only cursory motivations for the code. ### Prepare Inputs The `sva` functions will take the normalized count data as well as model matrices as inputs. The full model matrix should include the variables of interest (e.g. `sampType`) as well as adjustment variables (e.g. `sex`, `age`), which we do not use here. The null model matrix should include adjustment variables only (i.e. no variables of interest). Without any adjustment variables, the null model includes an intercept term alone. ```{r, prep-input} dat <- counts(dds, normalized=TRUE) ## extract the normalized counts mod <- model.matrix(~sampType, data=colData(dds)) ## set the full model mod0 <- model.matrix(~1, data=colData(dds)) ## set the null model ``` ### Identify Number of Factors to Estimate The `sva` package offers two approaches for estimating the number of surrogate variables that should be included in a differential expression model, the Buja Eyuboglu and Leek approaches. Buja Eyuboglu is the default approach. ```{r, get-nsv} if (!requireNamespace("sva", quietly=TRUE)) { BiocManager::install("sva") } library(sva) nsv <- num.sv(dat, mod, method=c("be"), B=20, seed=1) ## Buja Eyuboglu method ``` ### Estimate Factors The crucial function for estimating surrogate variables in RNA-seq data is `svaseq()`: ```{r, get-svs} svs <- svaseq(dat, mod, mod0, n.sv=nsv) ``` The output is a list of 4 elements, the first of which is the latent factor estimate with a vector for each factor. The name of this element is `sv`. ### Append Latent Factors to colData We want to include the latent factors as adjustment variables in our statistical testing. Here, we add the surrogate variables to `colData` and update the `DESeqDataSet` design formula: ```{r, add-svs} for (i in seq_len(svs$n.sv)) { newvar <- paste0("sv", i) colData(dds)[ , newvar] <- svs$sv[, i] } nvidx <- (ncol(colData(dds)) - i + 1):ncol(colData(dds)) newvars <- colnames(colData(dds))[nvidx] d <- formula( paste0("~", paste(paste(newvars, collapse="+"), "sampType", sep="+")) ) design(dds) <- d ``` ## Re-Fit Negative Binomial Models and Perform Wald Tests with Surrogate Variables Now that we have the surrogate variables in the `DESeqDataSet`, we can perform the differential expression analysis we set out to do: ```{r, test-de} dds <- DESeq(dds) ``` ## Extract Results We want to extract results from each of the following 3 comparisons: 1. HLT vs NAT 2. NAT vs CRC 3. HLT vs CRC With three categories under consideration, the `DESeq()` function call tests the first level against the other two, but the second and third levels are not tested against each other. (Note that the order of the levels is simply alphabetical if they are not explicitly set.) Furthermore, the results reported with the `results()` function are for the comparison of the first level against the last by default. Therefore, in order to extract all the results of interest, we set contrasts as follows: ```{r, results} cons <- list() m <- combn(levels(colData(dds)$sampType), 2) for (i in seq_len(ncol(m))) { cons[[i]] <- c("sampType", rev(m[, i])) names(cons) <- c( names(cons)[seq_len(length(cons) - 1)], paste(rev(m[, i]), collapse="v") ) } res <- list() for (i in seq_len(length(cons))) { res[[i]] <- results(dds, contrast=cons[[i]], alpha=0.05) ## default alpha is 0.1 } names(res) <- names(cons) ``` ## Display Results Let's take a look at the results: ```{r, display} lapply(res, head) lapply(res, summary) ``` # sessionInfo() ```{r, sessionInfo} sessionInfo() ```