---
title: "celarefData"
author: "Sarah Williams"
date: "`r Sys.Date()`"
output: 
  html_document: 
    toc: yes
bibliography: celaref.bib
vignette: >
  %\VignetteIndexEntry{Manual}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Overview

The celarefData package is a repository of a few public datasets that have been
processed using the *celaref* package. 
This includes some example data for the  *celaref* package vignette 
(please refer to full examples there), and some other potentially useful 
preprocessed reference datasets.


# Data

The data in this package is a series of data frames output from the
*contrast_each_group_to_the_rest* function of the *celaref* package. 
That is, these are differential experession results (calculated using 
MAST @Finak2015), of each cell cluster versus the rest of the experiment. 

For details and explanation see the celaref package vignette. 
The commands used to make these data files are also in the make-data.R 
file of this package.

### Dataset : Farmer et al 2017, Mouse lacrimal gland

@Farmer2017 have published a survey of cell types in the mouse lacrimal gland 
at two developmental stages in paper (*Defining epithelial cell dynamics and 
lineage relationships in the developing lacrimal gland*). Only the more 
mature P4 timepoint is included.

Data:

* Farmer2017_lacrimalP4

### Dataset : Watkins et al 2009, PBMC microarrays 

The ('Watkins2009') ‘HaemAtlas’ [@Watkins2009] microarray dataset of purified 
PBMC cell types
was downloaded as a normalised table from the 'haemosphere' website:
http://haemosphere.org/datasets/show [@DeGraaf2016]

Processing for those data files via *contrast_each_group_to_the_rest_for_norm_ma_with_limma* is described in the vignette.


Data:

* de_table_Watkins2009_pbmcs


### Dataset : 10X genomics, pbmc4k, k=7

10X genomics has several datasets available to download from their website,
including the pbmc4k dataset, which contains PBMCs derived from a healthy
individual. The kmeans k=7 cell-cluster assignments were chosen. 
Source dataset available here:
(https://support.10xgenomics.com/single-cell-gene-expression/datasets/2.1.0/pbmc4k)

Data: 

* de_table_10X_pbmc4k_k7

### Dataset : Zeisel et al 2015, Mouse brain


In their paper 'Cell types in the mouse cortex and hippocampus revealed by
single-cell RNA-seq', @Zeisel2015 performed single cell RNA sequencing
in mouse, in two tissues (sscortex and ca1hippocampus).

This data was download from the link provided in the paper: http://linnarssonlab.org/cortex

As described in make-data.R, both counts and cell annotations were parsed from this file:
https://storage.googleapis.com/linnarsson-lab-www-blobs/blobs/cortex/expression_mRNA_17-Aug-2014.txt


Data:

* de_table_Zeisel2015_cortex
* de_table_Zeisel2015_hc


### Dataset : Zheng et al 2017, Pure PBMCs

As part of their analysis described in 
_'Massively parallel digital transcriptional profiling of single cells'_, 
Zheng et al generated a reference dataset of PBMC 
(peripheral blood mononuclear cell) sub-populations [@Zheng2017].
These cell types are:

* CD34+                        
* CD56+ NK                     
* CD4+/CD45RA+/CD25- Naive T  
* CD4+/CD25 T Reg              
* CD8+/CD45RA+ Naive Cytotoxic 
* CD4+/CD45RO+ Memory         
* CD8+ Cytotoxic T             
* CD19+ B                      
* CD4+ T Helper2              
* CD14+ Monocyte               
* Dendritic   

They used a bead-based purification approach described in their paper, followed
by the analysis which they have shared at https://github.com/10XGenomics/single-cell-3prime-paper/tree/master/pbmc68k_analysis 

------


To create the derived differential expression tables suitable for using as a 
reference dataset with celaref. 

1. Data and scripts obtained from https://github.com/10XGenomics/single-cell-3prime-paper/tree/master/pbmc68k_analysis
 
2. Cell cluster labels were obtained by the (rather nicely reproduceable) 
analysis scripts (specifically 'main_process_pure_pbmc.R' ) also provided by
Zheng et al at:
https://github.com/10XGenomics/single-cell-3prime-paper/tree/master/pbmc68k_analysis

3. Cells were subsetted to a _maximum of 1000 per group_ - enough for 
the differential expression for this puropose.

4. Gene-level ID was set to the GeneSymbol, choosing the more highly 
expressed ensemblID if multiple mappings exist. The *de_table_Zheng2017purePBMC*
dataset has GeneSymbol IDs, wherease *de_table_Zheng2017purePBMC_ensembl* was
converted back to ensemble IDs. 

Exact commands are provided in celarefData make-data.R script.


# Use

```{r}
library(ExperimentHub)
eh = ExperimentHub()
ExperimentHub::listResources(eh, "celarefData")

de_table.10X_pbmc4k_k7        <- ExperimentHub::loadResources(eh, "celarefData", 'de_table_10X_pbmc4k_k7')[[1]]
de_table.Watkins2009PBMCs     <- ExperimentHub::loadResources(eh, "celarefData", 'de_table_Watkins2009_pbmcs')[[1]]
de_table.zeisel.cortex        <- ExperimentHub::loadResources(eh, "celarefData", 'de_table_Zeisel2015_cortex')[[1]]
de_table.zeisel.hippo         <- ExperimentHub::loadResources(eh, "celarefData", 'de_table_Zeisel2015_hc')[[1]]
de_table.Farmer2017lacrimalP4 <- ExperimentHub::loadResources(eh, "celarefData", 'de_table_Farmer2017_lacrimalP4')[[1]]

de_table.Zheng2017purePBMC         <- ExperimentHub::loadResources(eh, "celarefData", 'de_table_Zheng2017purePBMC')[[1]]
de_table.Zheng2017purePBMC_ensembl <- ExperimentHub::loadResources(eh, "celarefData", 'de_table_Zheng2017purePBMC_ensembl')[[1]]

head(de_table.10X_pbmc4k_k7)
```

# References