--- title: "Using ObMiTi" author: "Omar Elashkar" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Using ObMiTi} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # ObMiTi A MusMus Dataset of Ob/ob and WT mice on different diets. # Overview In this document, we introduce the purpose of `ObMiTi` package, its contents and its potential use cases. This package is a dataset of RNA-seq samples. The samples are of 6 ob/ob mice and 6 wild type mice divided further into High fat diet and normal diet. From each mice 7 tissues has been analyzed. The duration of dieting was 20 weeks. The package document the data collection, pre-processing and processing. In addition to the documentation the package contains the scripts that were used to generate the data object from the processed data. This data is deposited as `RangedSummarizedExperiment` object and can be accessed through `ExperimentHub`. # Introduction ## What is `ObMiTi`? It is an R package for documenting and distributing a dataset. The package doesn't contain any R functions. ## What is contained in `ObMiTi`? The package contains two different things: 1. Scripts for documenting/reproducing the data in `inst/scripts` 2. Access to the final `RangedSummarizedExperiment` through `ExperimentHub`. ## What is `ObMiTi` for? The `RangedSummarizedExperiment` object contains the `counts`, `colData`, `rowRanges` and `metadata` which can be used for the purposes of differential gene expression and get set enrichment analysis. # Installation The `ObMiTi` package can be installed from Bioconductor using `BiocManager`. ```{r, eval = FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("ObMiTi") ``` # Generating `ObMiTi` ## 1. RNA-Seq Analysis RNA-seq analysis of wild type, and ob/ob mice at 25 weeks of age (n = 3 mice per group) . The sequencing library was constructed using Illumina’s TruSeq RNA Prep kit (Illumina Inc., San Diego, CA, USA), and data generation was performed using the NextSeq 500 platform (Illumina Inc.) following the manufacturer’s protocol. ## 2. Quality Control * Program: Trimmomatic (0.36) * Input: `*.fastq.gz` * Options: PE ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 ## 3. Aligning reads * Program: `HISAT2` (2.0.5) * Input: `*.fastq.gz` and `GRCm38` bowtie2 index for the mouse genome * Output: `*. * Options: defaults ## 4. Counting * Program: `FeatureCount * Input: `*.bam` * Output: `MouseRNA-seq.txt` * Options: defaults ## Processing The aim of this step is to construct a self-contained object with minimal manipulations of the pre-processed data followed by a simple exploration of the data in the next section. ### Making a summarized experiment object `ob_counts` The required steps to make this object from the pre-processed data are documented in the script and are supposed to be fully reproducible when run through this package. The output is a `RangedSummarizedExperiment` object containing the peak counts and the phenotype and features data and metadata. The `RangedSummarizedExperiment` contains * The gene counts matrix `counts` * The phenotype data `colData`. The column `name` links samples with the counts columns. * The feature data `rowRanges` * The metadata `metadata` which contain a `data.frame` of extra details about the sample collected and phenotype. ## Exploring the `ob_counts` object In this section, we conduct a simple exploration of the data objects to show the content of the package and how they can be loaded and used. ```{r} # loading required libraries library(ExperimentHub) library(SummarizedExperiment) ``` ```{r} # query package resources on ExperimentHub eh <- ExperimentHub() query(eh, "ObMiTi") # load data from ExperimentHub ob_counts <- query(eh, "ObMiTi")[[1]] # print object ob_counts ``` The count matrix can be accessed using `assay`. Here we show the first five entries of the first five samples. ```{r} # print count matrix assay(ob_counts)[1:5, 1:5] ``` The phenotype/samples data is a `data.frame`, It can be accessed using `colData`. ```{r} # View Structure of counts str(colData(ob_counts)) # Studies' metadata available names(colData(ob_counts)) # Sample GSM ID (Same ob_counts$geo_accession) rownames(colData(ob_counts)) # Sample strain, tissue and diet ID ob_counts$title # Frequencies of different diets table(ob_counts$diet.ch1) # Frequncies of tissues table(ob_counts$tissue.ch1) # crosstable of tissue and diet and stratify by genotype table(ob_counts$diet.ch1, ob_counts$tissue.ch1,ob_counts$genotype.ch1) # Summarize Numeric data summary(data.frame(colData(ob_counts))) ``` Other columns in `colData` are selected information about the samples/runs or identifiers to different databases. The following table provides the description of each of these columns. Here are a brief description about the key columns. | col_name | description | |-----------------------|----------------------------------------------------------| | title | Sample title include strain, diet, tissue and replicate | | genotype.ch1 | the mice type; either ob/ob or WT | | diet.ch1 | The diet type; either high fat (HFD) or Normal diet (ND) | | tissue.ch1 | tissue type. 7 tissues included* | |-----------------------|----------------------------------------------------------| \* Ao: arota, Ep=Epididymis; He=Heart; Hi=Hippocampus; Hy=Hypothalamus; Li=Liver; Sk=Skeletal Muscle. Additional information about mice characteristics can be accessed from the `metadata`. The main dataframe passed is measures. You can access measures as: ```{r} metadata(ob_counts)$measures ``` The information presented in `measures` table is described in the table below: | col_name | description | |-----------------------|----------------------------------------------------------| | blood | Total blood volume | | weight | mice weight | | fasting_glucose | Fasting blood glucose measurement | | brain | Brain weight | | Li | Liver weight | | Ep | Epididymis weight | | mesentrec_fact | Mesenteric fat weight | | reteroperitoneal_fact | Reteroperitoneal fat weight | | ALT_UL | ALT measurment (U/L) | | AST_UL | AST measurement (U/L) | | T.Chol_mgdL | Total cholesterol measurement (mg/dL) | | FFA_uEql | Free fatty acids measurement | | Glucose_mgdL | Glucose measrurement (mg/dL) | | Triglyceride_mgdL | Triglyceride measurement (mg/dL) | | Leptin_ngmL | Leptin measurement (ng/dL) | | fat | Mice's fat mass by echo MRI | | lean | Lean body mass by echo MRI | | free_water | Free water measurement by echo MRI | | total water | Total water measurement by echo MRI | |-----------------------|----------------------------------------| The features data are a `GRanges` object and can be accessed using `rowRanges`. ```{r} # print GRanges object rowRanges(ob_counts) ``` Notice there are two types of data in this object. The first is the coordinates of the identified genes `ranges(ob_counts)`. The second is the annotation of the these genes `mcols(ob_counts)`. The following table show the description of the second annotation item. All annotations were obtained using `biomaRt` package as described in the `inst/scripts`. | col_name | description | |-----------|-------------------------------------------------------------------| | ranges | The range of start and end of gene | | strand | Either this gene is located on the positive or negative strand | | gene_id | Ensembl gene id | | entrez_id | Entrez gene id (if available) | | symbol | Common gene symbol (if available) | | biotype | The biological function of gene as classified by Ensembl database | |-----------|-------------------------------------------------------------------| # Example of using `ObMiTi` ## Selecting Protein Coding genes ```{r} se <- ob_counts[rowRanges(ob_counts)$biotype == 'protein_coding',] ``` ## Plot first 100 genes ```{r} plot(log(assay(se)[1:100,])) ``` # Citing `ObMiTi` For citing the package use: ```{r citation, warning=FALSE, eval=FALSE} #citing the package citation("ObMiTi") ``` # Session Info ```{r session_info} devtools::session_info() ```