%\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{Pre-Processing for the Zebrafish RNA-Seq Gene-Level Counts} \documentclass{article} <>= BiocStyle::latex() @ \usepackage{url} \usepackage[numbers,sort&compress]{natbib} \title{Pre-Processing for the Zebrafish RNA-Seq Gene-Level Counts} \author{Davide Risso} \date{Modified: April 13, 2014. Compiled: \today.} \begin{document} \maketitle This vignette describes the pre-processing steps that were followed for the generation of the gene-level read counts contained in the \Bioconductor{} package \Biocpkg{zebrafishRNASeq}. \tableofcontents \section{Sample preparation and sequencing} Olfactory sensory neurons were isolated from three pairs of gallein-treated and control embryonic zebrafish pools and purified by fluorescence activated cell sorting (FACS) \cite{ferreira2014silencing}. Each RNA sample was enriched in poly(A)+ RNA from 10--30 ng total RNA and 1 $\mu$L (1:1000 dilution) of Ambion ERCC ExFold RNA Spike-in Control Mix 1 was added to 30 ng of total RNA before mRNA isolation. cDNA libraries were prepared according to manufacturer's protocol. The six libraries were sequenced in two multiplex runs on an Illumina HiSeq2000 sequencer, yielding approximately 50 million 100bp paired-end reads per library. \section{Read alignment and expression quantitation} We made use of a custom reference sequence, defined as the union of the zebrafish reference genome (Zv9, downloaded from Ensembl \cite{flicek2012ensembl}, v. 67) and the ERCC spike-in sequences (\url{http://tools.invitrogen.com/downloads/ERCC92.fa}). Reads were mapped with TopHat \cite{trapnell2009tophat} (v. 2.0.4), with the following parameters, \begin{verbatim} --library-type=fr-unstranded -G ensembl.gtf --transcriptome-index=transcript --no-novel-juncs \end{verbatim} where \texttt{ensembl.gtf} is a GTF file containing Ensembl gene annotation. Gene-level read counts were obtained using the htseq-count python script \cite{htseq} in the ``union'' mode and Ensembl (v. 67) gene annotation. After verifying that there were no run-specific biases, we used the sums of the counts of the two runs as the expression measures for each library. \section{Loading the zebrafish data into \R{}} To load the gene-level read counts into \R{}, simply type <>= library(zebrafishRNASeq) data(zfGenes) head(zfGenes) @ The ERCC spike-in read counts are in the last rows of the same matrix and can be retrieved in the following way. <>= spikes <- zfGenes[grep("^ERCC", rownames(zfGenes)),] head(spikes) @ The typical use of this dataset is the indentification of differentially expressed genes between control (Ctl) and treated (Trt) samples. For additional details, exploratory analysis, and normalization of the zebrafish data see \cite{risso2014ruv,risso2014role}. The data are used as a case study for the \Bioconductor{} package \Biocpkg{RUVSeq}. \section{Session info} <>= toLatex(sessionInfo()) @ \bibliography{biblio} \end{document}