--- title: "Using curatedAdipoRNA" author: "Mahmoud Ahmed" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Using curatedAdipoRNA} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Overview In this document, we introduce the purpose of the `curatedAdipoRNA` package, its contents and its potential use cases. This package is a curated dataset of RNA-Seq samples. The samples are MDI-induced pre-phagocytes (3T3-L1) at different time points/stage of differentiation. The package document the data collection, pre-processing and processing. In addition to the documentation, the package contains the scripts that was used to generated the data in `inst/scripts/` and the final `RangedSummarizedExperiment` object in `data/`. # Introduction ## What is `curatedAdipoRNA`? It is an R package for documenting and distributing a curated dataset. The package doesn't contain any R functions. ## What is contained in `curatedAdipoRNA`? The package contains two different things: 1. Scripts for documenting/reproducing the data in `inst/scripts` 2. Final `RangedSummarizedExperiment` object in `data/` ## What is `curatedAdipoRNA` for? The `RangedSummarizedExperiment` object contains the `adipo_counts`, `colData`, `rowRanges` and `metadata` which can be used for the purposes of conducting differential expression or gene set enrichment analysis on the cell line model. # Installation The `curatedAdipoRNA` package can be installed from Bioconductor using `BiocManager`. ```{r install_biocmanager,eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("curatedAdipoRNA") ``` # Docker image The pre-processing and processing of the data setup environment is available as a `docker` image. This image is also suitable for reproducing this document. The `docker` image can be obtained using the `docker` CLI client. ``` $ docker pull bcmslab/adiporeg_rna:latest ``` # Generating `curatedAdipoRNA` ## Search strategy & data collection The term "3T3-L1" was used to search the NCBI **SRA** repository. The results were sent to the **run selector**. 1,176 runs were viewed. The runs were faceted by **Assay Type** and the "rna-seq" which resulted in 323 runs. Only 98 samples from 16 different studies were included after being manually reviewed to fit the following criteria: * The raw data is available from GEO and has a GEO identifier (GSM#) * The raw data is linked to a published publicly available article * The protocols for generating the data sufficiently describe the origin of the cell line, the differentiation medium and the time points when the samples were collected. * In case the experimental designs included treatment other than the differentiation medias, the control (non-treated) samples were included. Note: The data quality and the platform discrepancies are not included in these criteria. ## Pre-processing The scripts to download and process the raw data are located in `inst/scripts/` and are glued together to run sequentially by the GNU make file `Makefile`. The following is basically a description of the recipes in the `Makefile` with emphasis on the software versions, options, inputs and outputs. ### 1. Downloading data `download_fastq` * Program: `wget` (1.18) * Input: `run.csv`, the URLs column * Output: `*.fastq.gz` * Options: `-N` ### 2. Making a genome index `make_index` * Program: `hisat2-build` (2.0.5) * Input: URL for mm10 mouse genome fasta files * Output: `*.bt2` bowtie2 index for the mouse genome * Options: defaults ### 3. Dowinloading annotations `get_annotation` * Program: `wget` (1.18) * Input: URL for mm10 gene annotation file * Output: `annotation.gtf` * Options: `-N` ### 4. Aligning reads `align_reads` * Program: `hisat2` (2.0.5) * Input: `*.fastq.gz` and `mm10/` bowtie2 index for the mouse genome * Output: `*.sam` * Options: defaults ### 5. Counting features `count_features` * Program: `featureCounts` (1.5.1) * Input: `*.bam` and the annotation `gtf` file for the mm10 mouse genome. * Output: `*.txt` * Option: defaults ### Quality assessment `fastqc` * Program: `fastqc` (0.11.5) * Input: `*.fastq.gz` and `*.sam` * Output: `*_fastqc.zip` * Option: defaults ## Processing The aim of this step is to construct a self-contained object with minimal manipulations of the pre-processed data followed by simple a simple exploration of the data in the next section. ### Making Summarized experiment object `make_object` The required steps to make this object from the pre-processed data are documented in the script and are supposed to be fully reproducible when run through this package. The output is a `RangedSummarizedExperiment` object containing the gene counts and the phenotype and features data and metadata. The `RangedSummarizedExperiment` contains * The gene counts matrix `gene_counts` * The phenotype data `colData` * The feature data `rowRanges` * The metadata `metadata` which contains a `data.frame` of the studies from which the samples were collected. ## Exploring the `adipo_counts` object In this section, we conduct a simple exploration of the data objects to show the content of the package and how they can be loaded and used. ```{r loading_libraries, message=FALSE} # loading required libraries library(curatedAdipoRNA) library(SummarizedExperiment) library(S4Vectors) library(fastqcr) library(DESeq2) library(dplyr) library(tidyr) library(ggplot2) ``` ```{r loading_data} # load data data("adipo_counts") # print object adipo_counts ``` The count matrix can be accessed using `assay`. Here we show the first five entries of the first five samples. ```{r adipo_counts} # print count matrix assay(adipo_counts)[1:5, 1:5] ``` The phenotype/samples data is a `data.frame`, It can be accessed using `colData`. The `time` and `stage` columns encode the time point in hours and stage of differentiation respectively. ```{r colData} # names of the coldata object names(colData(adipo_counts)) # table of times column table(colData(adipo_counts)$time) # table of stage column table(colData(adipo_counts)$stage) ``` Other columns in `colData` are selected information about the samples/runs or identifiers to different databases. The following table provides the description of each of these columns. | col_name | description | |------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | id | The GEO sample identifier. | | study | The SRA study identifier. | | pmid | The PubMed ID of the article where the data were published originally. | | time | The time point of the sample when collected in hours. The time is recorded from the beginning of the protocol as 0 hours. | | stage | The stage of differentiation of the sample when collected. Possible values are 0 to 3; 0 for non-differentiated; 1 for differentiating; and 2/3 for maturating samples. | | bibtexkey | The key of the study where the data were published originally. This maps to the studies object of the metadata which records the study information in bibtex format. | | run | The SRA run identifier. | | submission | The SRA study submission identifier. | | sample | The SRA sample identifier. | | experiment | The SRA experiment identifier. | | study_name | The GEO study/series identifier. | | library_layout | The type of RNA library. Possible values are SINGLE for single-end and PAIRED for paired-end runs. | | instrument_model | The name of the sequencing machine that was used to obtain the sequence reads. | | qc | The quality control output of fastqc on the separate files/runs. | Using the identifiers in `colData` along with Bioconductor packages such as [`GEOmetabd`](http://bioconductor.org/packages/GEOmetadb/) and/or [`SRAdb`](http://bioconductor.org/packages/SRAdb/) gives access to the sample metadata as submitted by the authors or recorded in the data repositories. The features data are a `GRanges` object and can be accessed using `rowRanges`. ```{r rowRanges} # print GRanges object rowRanges(adipo_counts) ``` `qc` is a column of `colData` it is a list of lists. Each entry in the list correspond to one sample. Each sample has one or more objects of `qc_read` class. The reason for that is because paired-end samples has two separate files on which `fastqc` quality control were ran. ```{r qc} # show qc data adipo_counts$qc # show the class of the first entry in qc class(adipo_counts$qc[[1]][[1]]) ``` The metadata is a list of one object. `studies` is a `data.frame` containing the bibliography information of the studies from which the data were collected. Here we show the first entry in `studies`. ```{r metadata, message=FALSE} # print data of first study metadata(adipo_counts)$studies[1,] ``` # Summary of the studies in the dataset ```{r summary_table,echo=FALSE} # generate a study summary table colData(adipo_counts) %>% as.data.frame() %>% group_by(study_name) %>% summarise(pmid = unique(pmid), nsamples = n(), time = paste(unique(time), collapse = '/'), stages = paste(unique(stage), collapse = '/'), instrument_model = unique(instrument_model)) %>% knitr::kable(align = 'cccccc', col.names = c('GEO series ID', 'PubMed ID', 'Num. of Samples', 'Time (hr)', 'Differentiation Stage', 'Instrument Model')) ``` # Example of using `curatedAdipoRNA` ## Motivation All the samples in this dataset come from the [3T3-L1](https://en.wikipedia.org/wiki/3T3-L1) cell line. The [MDI](http://www.protocol-online.org/prot/Protocols/In-Vitro-Adipocytes-Differentiation-4789.html) induction media, were used to induce adipocyte differentiation. The two important variables in the dataset are `time` and `stage`, which correspond to the time point and stage of differentiation when the sample were captured. Ideally, this dataset should be treated as a time course. However, for the purposes of this example, we only used samples from two time points 0 and 24 hours and treated them as independent groups. The goal of this example is to show how a typical differential expression analysis can be applied in the dataset. The main focus is to explain how the the data and metadata in `adipo_counts` fit in each main piece of the analysis. We started by filtering the low quality samples and low count genes. Then we applied the `DESeq2` method with the default values. ## Filtering low quality samples First, we subset the `adipo_counts` object to all samples that has time points 0 or 24. The total number of samples is 30; 22 at 0 hour and 8 samples at 24 hours. The total number of features/genes in the set is 23916. ```{r subset_object} # subsetting counts to 0 and 24 hours se <- adipo_counts[, adipo_counts$time %in% c(0, 24)] # showing the numbers of features, samples and time groups dim(se) table(se$time) ``` Since the quality metrics are reported per run file, we need to get the SSR* id for each of the samples. Notice that, some samples would have more than one file. In this case because some of the samples are paired-end, so each of them would have two files `SRR\*_1` and `SRR\*_2`. ```{r filtering_samples} # filtering low quality samples # chek the library layout table(se$library_layout) # check the number of files in qc qc <- se$qc table(lengths(qc)) # flattening qc list qc <- unlist(qc, recursive = FALSE) length(qc) ``` The `qc` object of the `colData` contains the output of [fastqc](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) in a `qc_read` class. More information on this object can be accessed by calling `?fastqcr::qc_read`. Here, we only use the `per_base_sequence_quality` to filter out low quality samples. This is by no means enough quality control but it should drive the point home. ```{r per_base_scores} # extracting per_base_sequence_quality per_base <- lapply(qc, function(x) { df <- x[['per_base_sequence_quality']] df %>% select(Base, Mean) %>% transform(Base = strsplit(as.character(Base), '-')) %>% unnest(Base) %>% mutate(Base = as.numeric(Base)) }) %>% bind_rows(.id = 'run') ``` After tidying the data, we get a `data.frame` with three columns; `run`, `Mean` and `Base` for the run ID, the mean quality score and the base number in each read. [fastqc](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) provide thorough documentation of this quality control module and others. Notice that read length varies significantly between the runs and that the average of the mean score is suitable. ```{r score_summary} # a quick look at quality scores summary(per_base) ``` To identify the low quality samples, we categorize the runs by `length` and `run_average` which are the read length and the average of the per base mean scores. The following figure should make it easier to see why these cutoff were used in this case. ```{r finding_low_scores,fig.align='centre',fig.height=3,fig.width=7} # find low quality runs per_base <- per_base %>% group_by(run) %>% mutate(length = max(Base) > 150, run_average = mean(Mean) > 34) # plot average per base quality per_base %>% ggplot(aes(x = Base, y = Mean, group = run, color = run_average)) + geom_line() + facet_wrap(~length, scales = 'free_x') ``` The run IDs of the "bad" samples is then used to remove them from the dataset. ```{r remove_lowscore} # get run ids of low quality samples bad_samples <- data.frame(samples = unique(per_base$run[per_base$run_average == FALSE])) bad_samples <- separate(bad_samples, col = samples, into = c('id', 'run'), sep = '\\.') # subset the counts object se2 <- se[, !se$id %in% bad_samples$id] table(se2$time) ``` ## Filtering low count genes To identify the low count feature/genes (possibly not expressed), we keep only the features with at least 10 reads in 2 or more samples. Then we subset the object to exclude these genes. ```{r remove low_counts} # filtering low count genes low_counts <- apply(assay(se2), 1, function(x) length(x[x>10])>=2) table(low_counts) # subsetting the count object se3 <- se2[low_counts,] ``` ## Applying differential expression using `DESeq2` [DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html) is a well documented and widely used R package for the differential expression analysis. Here we use the default values of `DESeq` to find the genes which are deferentially expressed between the samples at 24 hours and 0 hours. ```{r differential expression} # differential expression analysis se3$time <- factor(se3$time) dds <- DESeqDataSet(se3, ~time) dds <- DESeq(dds) res <- results(dds) table(res$padj < .1) ``` ## Next! In this example, we didn't attempt to correct for the between study factors that might confound the results. To show how is this possible, we use the [PCA](https://en.wikipedia.org/wiki/Principal_component_analysis) plots with a few of these factors in the following graphs. The first uses the `time` factor which is the factor of interest in this case. We see that the `DESeq` transformation did a good job separating the samples to their expected groups. However, it also seems that the `time` is not the only factor in play. For example, we show in the second and the third graphs two other factors `library_layout` and `instrument_model` which might explain some of the variance between the samples. This is expected because the data were collected from different studies using slightly different protocols and different sequencing machines. Therefore, it is necessary to account for these differences to obtain reliable results. There are multiple methods to do that such as [Removing Unwanted Variation (RUV)](http://www-personal.umich.edu/~johanngb/ruv/) and [Surrogate Variable Analysis (SVA)](http://bioconductor.org/packages/release/bioc/html/sva.html). ```{r pca,fig.align='centre',fig.height=4,fig.width=4} # explaining variabce plotPCA(rlog(dds), intgroup = 'time') plotPCA(rlog(dds), intgroup = 'library_layout') plotPCA(rlog(dds), intgroup = 'instrument_model') ``` ## Citing the studies in this subset of the data Speaking of studies, as mentioned earlier the `studies` object contains full information of the references of the original studies in which the data were published. Please cite them when using this dataset. ```{r studies_keys} # keys of the studies in this subset of the data unique(se3$bibtexkey) ``` # Citing `curatedAdipoRNA` For citing the package use: ```{r citation, warning=FALSE} # citing the package citation("curatedAdipoRNA") ``` # Session Info ```{r session_info} devtools::session_info() ```