%\VignetteIndexEntry{SEQC Vignette}
%\VignetteDepends{}
%\VignetteKeywords{SEQC RNA-se Read mapping}
%\VignettePackage{seqc}
\documentclass[12pt]{article}

\usepackage{hyperref, url}

\newcommand{\R}[1]{{\texttt{#1}}}
\newcommand{\C}[1]{{\texttt{#1}}}

\textwidth=6.2in
\textheight=8.5in
\oddsidemargin=0.2in
\evensidemargin=0.2in
\headheight=0in
\headsep=0in

\begin{document}
\title{RNA-seq Data Employed in The Sequencing Quality Control (SEQC) Project}
\author{Yang Liao and Wei Shi}

\maketitle

\section{Introduction}
This vignette briefly describes the content in the seqc package.

The seqc package provides a series of data frames derived from the SEQC project between 2011 to 2014. Three types of data frames are included in this package: RNA-seq read count tables, junction tables and an additional gene intensity table generated by using the TaqMan RT-PCR technology. All the data frames are ready to use in the R environment once the library is loaded. 

\begin{scriptsize}
<<>>=
library(seqc)
options(width=110, digits=2)
ls(2)
@
\end{scriptsize}
Six (6) samples were employed in the SEQC project [1]. Sample A and B were obtained from two human RNA reference libraries, while a small amount of Ambion ERCC RNA Spike-in Mix was added into both samples. Sample A and B were then mixed up to different ratios, creating Sample C and D. The pure ERCC Spike-in Mix 1 and 2 were also used as sample E and F, ending up with totally six samples being studied in this project. The details about the samples are further described in the next section. Huge amounts of RNA-seq data (short reads) were yielded from the six samples at twelve (12) sequencing site by using three (3) different platforms (sequencers).

Genes that have different expression levels among samples can be detected from the RNA-seq data. The read count tables in this package are useful for the purpose of Differently Expressed (DE) gene analysis. For generating the read count tables, the RNA-seq data was mapped to the human genome by using the Subread aligner [2], and the mapping result was assigned to the annotated genes by using the featureCounts program [3] for estimating their expression levels. Two sets of gene annotations were referred when the mapping result was assigned to the genomes: the AceView annotations and the RefSeq annotations, both from the National Center for Biotechnology Information (NCBI) in the United States. 

The RNA-seq technologies give an opportunity to precisely call the exon-exon junctions that are coded in the RNA transcriptome. We used the Subjunc program [2] to process the mapping result from the Subread aligner, therefore detecting the exon-exon junctions by using the seed-and-vote strategy. The junction tables from Subjunc were summarized on the sample level and are included in this package. 

In addition to the analyses on the RNA-seq data, samples A, B, C and D were also analyzed by using the TaqMan RT-PCR technology. The intensity values of the selected genes are included in this package as a reference.

\section{Samples and Sequencing}
The SEQC consortium prepared six RNA samples and distributed the samples to twelve sequencing sites. The first two samples, A and B, were derived from Agilent's Universal Human Reference RNA (UHRR) and Life Technologies' Human Brain Reference RNA (HBRR) cell lines respectively. Sample A and B were then mixed with Ambion ERCC RNA Spike-In Mix 1 and 2 accordingly, which are two mixtures of RNA molecules from 92 contigs, but each contig is known to have different density levels in the two mixtures. Sample C and D were then created by mixing sample A and B to different ratios. In sample C, there is 75\% of the volume from sample A and 25\% of the volume from sample B; the ratio between sample A and B is 25\% and 75\% in sample D. As we mentioned above, sample E and F are the pure ERCC RNA Spike-In Mix 1 and 2. All the sequencing sites sequenced Sample A and B, but some sites did not sequence sample C, D, E and/or F. 

Each of the samples has a number of replicates. Samples A, B, C and D each has five (5) replicates, and Sample E and F each has two (2) replicates. Some sequencing sites sequenced all of these replicates, but the other sites only sequenced a part of them. The set of replicates sequenced at each site can be found from the column names of the data frames in this package. The naming scheme is described in the next chapter.

Three RNA-seq platforms were examined in the SEQC project: the Illumina HiSeq 2000 devices (ILM), the Roche 454 GS FLX platform (ROC) and Life Technologies' SOLiD 5500 instruments (LIF). Each sequencing site uses a number of lanes and flowcells to sequence each replicate, but the numbers of lanes and flowcells used in the SEQC project differ between the platforms and event between the sequencing sites using the same platform. This project generated 2758 libraries in total: 1832 of them from the Illumnia HiSeq platform, 12 from the 454 Life Science platform and 914 from the SOLiD systems. Table~\ref{tab:seq-sites} lists the sequencing sites, the platforms employed, the number of samples/replicates that were sequenced at each site, and the numbers of short read libraries yielded.

The 2758 libraries have different read formats. All the libraries from Illumnia HiSeq are paired-end, and the length of every single read is around 100bp. The 454 Life Science libraries, on the other hand, are all single-end, and the read length is variable from tens of bases to above one thousand bases. Different to the libraries from Illumnia HiSeq and 454 Life Science, which contain only base-space reads, all the SOLiD libraries contain only color-space reads. There are 50 single-end libraries and 864 paired-end libraries among all the 914 SOLiD libraries. The read lengths in the SOLiD library vary from 36 to 76 colors long.

\begin{table}[h]
\footnotesize{
\begin{tabular}{c|c|c|c|c|c}
Site & Platform & Samples & Replicates & Libraries &PE/SE\\
\hline
AGR & Illumina (ILM) & A, B, C, D & 4 for each sample &256&PE\\
BGI & Illumina (ILM) & A, B, C, D, E, F & 5 for A, B, C \& D; 2 for E \& F & 384&PE\\ 
CNL & Illumina (ILM) & A, B, C, D, E, F & 5 for A, B, C \& D; 2 for E \& F & 384&PE\\
COH & Illumina (ILM) & A, B, C, D & 4 for each sample &128&PE\\
MAY & Illumina (ILM) & A, B, C, D, E, F & 5 for A, B, C \& D; 2 for E \& F & 384&PE\\
NVS & Illumina (ILM) & A, B, C, D, E, F & 4 for A, B, C \& D; 2 for E \& F & 320&PE\\
\hline
LIV & SOLiD (LIF) & A, B, C, D & 2 for each sample &50&SE\\
NWU & SOLiD (LIF) & A, B, C, D, E, F & 5 for A, B, C \& D; 2 for E \& F & 288&PE\\
PSU & SOLiD (LIF) & A, B, C, D, E, F & 5 for A, B, C \& D; 2 for E \& F & 288&PE\\
SQW & SOLiD (LIF) & A, B, C, D, E, F & 5 for A, B, C \& D; 2 for E \& F & 288&PE\\
\hline
SQW & 454 (ROC) & A, B & 1 for each sample & 4&SE\\
MGP & 454 (ROC) & A, B & 1 for each sample & 4&SE\\
NYU & 454 (ROC) & A, B & 1 for each sample & 4&SE\\
\end{tabular}
}
\caption[Table caption text]{ List of the sequencing sites and processed samples/replicates. SQW used two platforms: SOLiD and 454. The last column indicates if the libraries are paired-end (PE) or single-end (SE)}
\label{tab:seq-sites}
\end{table}

\section{Read Mapping and Read Assignment}
All the libraries were mapped to the human genome by using the Subread package.  

The reference genome index was first built from the GRCg37/hg19 human genome, plus the 92 ERCC Spike-In contigs\footnote{\url{http://tools.lifetechnologies.com/downloads/ERCC_Controls_Annotation.txt}}, the Subread aligner then mapped the RNA-seq libraries to the index using the default settings. The mapping results were then assigned to the annotated genes by featureCounts, running on the single-end mode or the paired-end mode according to the format of the library. Two sets of annotations were used in the study: the AceView annotations and the RefSeq annotations. There are 1,383,372 exons (from 55,950 genes) in the AceView annotations, and there are 225,074 exons (from 25,072 genes) in the RefSeq annotations; only the exon regions of the genes were extracted from the annotations for read assignment. It is noticed that, although there are much more genes in the AceView annotations, the genes in the RefSeq annotations are not a subset of genes in the AceView annotations. Both sets of annotations have some unique genes not in the other set. 

The gene-level read assignment results are provided in this package as 22 data frames, which contain the numbers of single-end reads or paired-end fragments overlapping with exons of each annotated gene. The names of the data frames are synthesized following this scheme: {\tt (PLATFORM)\_(ANNOTATION)\_gene\_(SITE)}, for example, data frame {\tt LIF\_refseq\_gene\_NWU} includes the read assignment results on the libraries sequenced at NMU by using the SOLiD platform, and the reads were assigned to the RefSeq annotations. The column names in the data frames describe the sample names, the replicate numbers, the lane numbers and the flowcell numbers (if applicable) associated with the libraries. For example, the read counts in {\tt LIF\_refseq\_gene\_NWU\$B\_4\_L02\_FlowCell2} are the fourth replicate of Sample B, sequenced by using the second flowcell in lane 2. Another example, {\tt ROC\_aceview\_gene\_MGP\$A\_1\_R02} include the RNA-seq library generated from the first replicate of Sample A at MGP by using the second region of the 454 platform. 

\begin{scriptsize}
<<>>=
ROC_aceview_gene_MGP[1:15,]
@
<<>>=
colnames(ILM_aceview_gene_BGI)
ILM_aceview_gene_BGI[1:15,1:7]
@
\end{scriptsize}

\section{Junction Detection}
The subread aligner can very efficiently map reads to the reference genome. From the mapping results, subjunc furthers detected exon-exon junctions that were coded in the RNA-seq reads\footnote{The libraries were processed by using an old version of the subread package. The subjunc in that package only takes SAM files from the subread aligner as input. The subjunc program in the newer package (after version 1.4.0) is able to directly take FASTQ and FASTA files as input.}, and generated a junction table for each library. We merged the junction tables from the libraries on the platform-site-sample level, then provide the resulted data frames in this package. 

The data frames are named following this scheme: {\tt (PLATFORM)\_junction\_(SITE)\_SAMPLE}. For example, data frame {\tt ILM\_junction\_AGR\_B} includes the junctions that were detected from all the libraries of Sample B sequenced at AGR by using the Illumina HiSeq platform. Each data frame contains four columns: the chromosome name, the first junction side, the second junction side and the number of supporting reads (i.e., the reads that spanning the junction location and include the two exons surrounding the junction). The two junction sides are defined as the chromosomal position (one starting) of the last base in the first exon, and the chromosomal position of the first base in the second exon both surrounding this junction point. 

\begin{scriptsize}
<<>>=
ILM_junction_AGR_B[1:15,]
@
\end{scriptsize}


\section{TaqMan RT-PCR}
The gene expression levels in Samples A, B, C and D were further measured by using the TaqMan RT-PCR technologies. The expression levels of the 1044 selected genes in replicates 1, 2, 3 and 4 of the four samples are stored in data frame {\tt taqman}. The first column in this data frame are the entrez ids and symbols of the 1044 genes, followed by the gene intensity values in 16 columns, which are named as {\tt (SAMPLE).(SAMPLE)(REPLICATE)\_values}, for example, {\tt taqman\$C.C3\_values} are the gene intensity values in the third replicate of sample C. 

\begin{scriptsize}
<<>>=
colnames(taqman)
taqman[1:15, 1:9]
@
\end{scriptsize}

\section{Citation}
[1] Su Z, Labaj PP, Li S, Thierry-Mieg J, Thierry-Mieg D, Shi W, Wang C, Schroth GP, Setterquist RA, Thompson JF, et al. (SEQC/MAQC-III Consortium). A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nature Biotechnology, 32(9), 903-914\\\\
\noindent{[2] Yang Liao, Gordon K Smyth and Wei Shi (2013). The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Research, 41(10):e108.}\\\\
\noindent{[3] Yang Liao, Gordon K Smyth and Wei Shi (2014). featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics, 30(7):923-30}

\section{Authors}
Yang Liao and Wei Shi\\
Bioinformatics Division \\
The Walter and Eliza Hall Institute of Medical Research \\
1G Royal Parade, Parkville, Victoria 3052 \\
Australia \\

\section{Contact}
Please post to Bioconductor mailing list (\url{http://bioconductor.org/}) if you find any bugs and have any inquires.
Or, you may contact Wei Shi (shi at wehi dot edu dot au) directly.

\end{document}