--- title: Overview of the MouseThymusAgeing datasets author: Mike Morgan date: "November 4, 2020" output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteIndexEntry{Available datasets} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: thymus.bib --- ```{r, echo=FALSE, results="hide"} knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE) ``` # Introduction The `r Biocpkg("MouseThymusAgeing")` package provides convenient access to the single-cell RNA sequencing (scRNA-seq) datasets from @baran-gale_ageing_2020. The study used single-cell transcriptomic profiling to resolve how the epithelial composition of the mouse thymus changes with ageing. The datasets from the paper are provided as count matrices with relevant sample-level and feature-level meta-data. All data are provided post-processing and QC. The raw sequencing data can be directly acquired from ArrayExpress using accessions [E-MTAB-8560](https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-8560/) and [E-MTAB-8737](https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-8737/). # Installation The package can be installed from Bioconductor. Bioconductor packages can be accessed using the `r CRANpkg("BiocManager")` package. ```{r getPackage, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("MouseThymusAgeing") ``` To use the package, load it in the typical way. ```{r Load, message=FALSE} library(MouseThymusAgeing) ``` # Processsing Overview Detailed experimental protocols are available in [the manuscript](https://elifesciences.org/articles/56221) and analytical details are provided in the accompanying [GitHub repo](https://github.com/WTSA-Homunculus/Ageing2019). This data package contains 2 single-cell data sets from the paper. The first details the initial transcriptomic profiling of defined TEC populations using the plate-based SMART-seq2 chemistry. These cells were sorted from mice at 1, 4, 16, 32 and 52 weeks of age using the following flow cytometry phenotypes: * mTEClo: Cd45- EpCam+ Ly51- Cd80lo MHCIIlo Dsg3- * mTEChi: Cd45- EpCam+ Ly51- Cd80hi MHCIIhi Dsg3- * cTEC: Cd45- EpCam+ Ly51+ Cd80- * Dsg3+ TEC: Cd45- EpCam+ Ly51- Cd80lo MHCIIlo Dsg3+ In each case cells were sorted from 5 separate mice at each age into a 384 well plate containing lysis buffer, with cells from different ages and days block sorted into different areas of each plate to minimise the confounding between batch effects, mouse age and sorted subpopulation. The single-cell libraries were prepared according to the [SMART-seq2 protocol](https://doi.org/10.1038/nprot.2014.006) and sequenced on an Illumina NovaSeq 6000. The computational processing invovled the following steps: * Read trimming with trimmomatic, prior to alignment to the mm10 genome with STAR v2.5.3a, and read de-duplication using Picard Tools. * FeatureCounts was used to count reads on each gene using the Ensembl v95 annotation. * Poor quality cells were removed based on excess of ERCC92 spiked-in sequences, poor sequencing coverage and low gene-detection. * Gene counts were normalized using size factors estimated using `computeSumFactors()` function from `r Biocpkg("scran")` [@l._lun_pooling_2016]. * Highly variable genes were identified, and used to subset the log (+1 pseudocount) normalised expression prior to PCA. * Shared nearest neighbour (SNN) graph building was performed using `r CRANpkg("igraph")` and cells were clustered using the Walktrap community detection algorithm [@pons_computing_2005]. Clusters were manually annotated based on inspecting the expression of marker genes. The second dataset contains cells that were profiling from TEC at 8, 20 and 36 weeks old, derived from a transgenic model system that is also able to lineage trace cells that derive from those that express the thymoproteasomal gene, $\beta$-5t. When this gene is expressed it drives the expression of a fluorescent reporter gene, ZsGreen (ZsG). The mouse is denoted $\mbox{3xtg}^{\beta5t}$. Each mouse (3 replicates per age) first had their transgene induced using doxycycline, and 4 weeks later the TEC were collected by flow cytometry in separate ZsG+ and ZsG- groups. Within each of these groups cells were FAC-sorted into mTEC (Cd45+EpCam+MHCII+Ly51-UEA1+) and cTEC (Cd45+EpCam+Ly51+UEA1+) populations. For this experiment we made us of recent developments in multiplexing with hashtag oligos (HTO; cell-hashing)[@stoeckius_cell_2018]. Consequently, the cells were super-loaded onto the 10X Genomics Chromium chips before library prep and sequencing on an Illumina NovaSeq 6000. The computational processing for these data is different to above. Specifically: * Demultiplexing, read alignment, UMI deduplication and feature counting were all performed using _Cellranger_ v3.1.0. * Non-empty droplets were called separately using the `emptyDrops()` from the `r Biocpkg("DropletUtils")` [@lun_emptydrops_2019]. * Cells were de-multiplexed into specific samples by assigning each cell to its best-matching hashtag oligo using the approach by [@stoeckius_cell_2018] - this was also able to identify multiplet cells containing multiple HTOs. * Cells were filtered based on low sequencing coverage and high deviation of mitochondrial content from the population median. * Size factors were estimated as above using `computeSumFactors()` function from `r Biocpkg("scran")` [@l._lun_pooling_2016], and used for normalization with a `log(X + 1)` transformation. * A SNN-graph was then built using `r CRANpkg("igraph")` as above, and cell were also clustered using Walktrap community detection algorithm [@pons_computing_2005]. These clusters were annotated with concordant labels from the above data set. The exception being that many more clusters were identified, and thus each cluster was suffixed with a number to uniquely identify them. # Package data format The SMART-seq2 data is stored in subsets according to the sorting day (numbered 1-5). For the droplet data, the data can be accessed according to the specific multiplexed samples (6 in total). For the SMART-seq2 the exported object `SMARTseqMetadata` provides the relevant metadata information for each sorting day, the equivalent object `DropletMetadata` contains the relevant information for each separate sample. Specific descriptions of each column can be accessed using `?SMARTseqMetadata` and `?DropletMetadata`. ```{r} head(SMARTseqMetadata, n = 5) ``` All of the data access functions allow you to select the particular samples or sorting days that you would like to access for the relevant data set. By loading only the samples or sorting days that you are interested in for your particular analysis, you will save time when downloading and loading the data, and also reduce memory consumption on your machine. Droplet single-cell experiments tend to be much larger owing to the ability to encapsulate and process many more cells than in either 96- or 384-well plates. The droplet scRNA-seq made use of hashtag oligonucleotides to multiplex samples, allowing for replicated experimental design without breaking the bank. ```{r} head(DropletMetadata, n = 5) ``` ## Data access Package data are provided as `SingleCellExperiment` objects, an extension of the Bioconductor `SummarizedExperiment` object for high-throughput omics experiment data. `SingleCellExperiment` object uses memory-efficient storage and sparse matrices to store the single-cell experiment data, whilst allowing the layering of additional feature- and cell-wise meta-data to facilitate single-cell analyses. This section will detail how to access and interact with these objects from the `MouseThymusAgeing` package. ```{r, message=FALSE} smart.sce <- MouseSMARTseqData(samples="day2") smart.sce ``` The gene counts are stored in the `assays(sce, "counts")` slot, which can be accessed using the convenience function `counts`. The gene counts are stored in a memory efficient sparse matrix class from the `r CRANpkg("Matrix")` package. ```{r, message=FALSE} head(counts(smart.sce)[, 1:10]) ``` The normalisation factors per cell can be accessed using the `sizeFactors()` function. ```{r} head(sizeFactors((smart.sce))) ``` These are used to normalise the data. To generate single-cell expression values on a log-normal scale, we can apply the `logNormCounts` from the `r Biocpkg("scuttle")` package. This will add the `logcounts` entry to the `assays` slot in our object. ```{r, message=FALSE} library(scuttle) smart.sce <- logNormCounts(smart.sce) ``` With these normalised counts we can perform our standard down-stream analytical tasks, such as identifying highly variable genes, projecting cells into a reduced dimensional space and clustering using a nearest-neighbour graph. You can further inspect the cell-wise meta-data attached to each dataset, stored in the `colData` for each `r Biocpkg("SingleCellExperiment")` object. ```{r, message=FALSE} head(colData(smart.sce)) ``` Details of what information is stored can be found in the documentation using `?DropletMetadata` and `?SMARTseqMetada`. In each object we also have the pre-computed reduced dimensions that can be accessed through the `reducedDim(, "PCA")` slot. # Session Information ```{r} sessionInfo() ``` # References