\documentclass[12pt]{article}

\topmargin 0in
\headheight 0in
\headsep 0in
\oddsidemargin 0in
\evensidemargin 0in
\textwidth 176mm
\textheight 215mm
\usepackage{natbib}
\usepackage{Sweave}
\usepackage{url}
\DeclareGraphicsRule{.tif}{png}{.png}{`convert #1 `dirname #1`/`basename #1 .tif`.png}
\newcommand{\edgeR}{\textsf{edgeR}}
\newcommand{\limma}{\textsf{limma}}
\newcommand{\code}[1]{{\small\texttt{#1}}}
\newcommand{\R}{\textsf{R}}


\begin{document}

%\VignetteIndexEntry{edgeR User's Guide}

\title{\edgeR: differential expression analysis \\ of digital gene expression data}
\author{Mark Robinson \\ \texttt{mrobinson@wehi.edu.au}
\and
Davis McCarthy \\ \texttt{dmccarthy@wehi.edu.au}
\and
Gordon K.\ Smyth}
% Please increment date when working on this document, so that
% date shows genuine change date, not merely date of compile.
\date{1 July 2010}
\maketitle

\section{Introduction}

This document gives a brief introduction and overview of the \R~Bioconductor package \edgeR~\citep{Robinson:2010p249},
which provides statistical routines for determining differential
expression in digital gene expression data. The package implements
methods developed by~\citet{Robinson07, Robinson08}.  The routines can
be applied equally to RNA-seq, Tag-seq, SAGE, CAGE,  Illumina/Solexa, 454 or ABI SOLiD experiments.  In fact, the methods may be useful in other experiments where counts are observed.


\section{How to get help}

Most questions about \edgeR~will hopefully be answered by the documentation or references.
Every function mentioned in this guide has its own help page.
For example, a detailed description of the arguments and output of the \code{exactTest} function can be read by typing \code{?exactTest} or \code{help(exactTest)} at the \R~prompt.

The authors of the package always appreciate receiving reports of bugs in the package functions or in the documentation.
The same goes for well-considered suggestions for improvements.
Other questions about how to use \edgeR~are best sent to the Bioconductor mailing list \url{bioconductor@stat.math.ethz.ch}.
To subscribe to the mailing list, see \url{https://stat.ethz.ch/mailman/listinfo/bioconductor}.
Please send requests for general assistance and advice to the mailing list rather than to the individual authors.
Users posting to the mailing list for the first time may find it helpful to read the 
helpful posting guide at \url{http://www.bioconductor.org/doc/postingGuide.html}.


\section{Reading data}

%\R~packages for the processing of raw data files for digital gene expression (DGE) datasets are still in development stages (e.g. \texttt{ShortRead}) at time of writing.

\edgeR~requires three pieces of information:
\begin{enumerate}
\item
\texttt{counts}: a matrix of counts where each row represents a gene/exon (or whatever genomic feature is being tracked) and each column is a different sample.
The row names are transcript IDs.
\item
\texttt{group}: a factor (with length equal to the number of columns of \texttt{counts}) denoting the experimental group.
\item
\texttt{lib.size}: vector of the same length as \texttt{group} giving the total number of reads sequenced for each sample.
\end{enumerate}
We assume that the counts are stored in one of two formats.
Either there is a single file containing a table of counts with the first column containing the tag identifiers and the remaining columns containing the tag counts for each library sequenced, or there is an individual file for each library, each with first column for tag identifiers and second column for counts. 

If the counts for all libraries are stored in a single file, then an appriopriate in-built \texttt{R} function (such as \texttt{read.delim} or \texttt{read.csv}) can be used to read the table of counts into \texttt{R}. The library sizes can be the column sums from the table of counts, and thus easily obtained, or the user can specify the the library sizes throught \texttt{lib.size} argument of the \texttt{DGEList()} constructor.  See the help documentation (\texttt{?DGEList} or \texttt{?"DGEList-class"}) or the examples below for further details.

If the counts are stored in separate files, then, given a vector containing the filenames the \edgeR~function \texttt{readDGE} will read in the data from the individual files, collate the counts into a table and compute the library sizes and return a \texttt{DGEList} object.  See the help documentation (\texttt{?readDGE}) or the examples below for further details.

\section{Moderated negative binomial dispersions}

The basic model we use for DGE data is based on the negative binomial distribution.  The model is very flexible.  For example, if $Y$ is distributed as $NB(\mu,\phi)$, then the expected value of $Y$ is $\mu$ and the variance is $\mu + \mu^2 \cdot \phi$, thus giving sufficient flexibility for many scenarios in observing count data.

The observed data can be denoted as $Y_{gij}$ where $g$ is the gene (tag, exon, etc.), $i$ is the experimental group and $j$ is the index of the replicate.  We can model the counts as
\[  Y_{gij} \sim NB(M_j \cdot p_{gi}, \phi_g) \]
where $p_{gi}$ represents the proportion of the sequenced sample for group $i$ that is tag $g$ and $M_j$ represents the library size. 

It is of interest to find genes where, for example, $p_{g1}$ is significantly different from $p_{g2}$.  The parameter $\phi_g$ is the overdispersion (relative to the Poisson) and represents the biological, or sample-to-sample variability.   The methods we developed moderate the dispersion estimates towards a common dispersion, much like how the \limma~package moderates the variances in the analysis of microarray data. It is also possible to analyse DGE data using a common dispersion for each tag using \edgeR.


\section{Normalization issues for digital counts}

\subsection{General comments}

The \edgeR\ methodology needs to work with the original digital expression counts, so these should not be transformed in any way by users prior to analysis.
\edgeR\ automatically takes into account the total size (total read number) of each library in all calculations of fold-changes, concentration and statistical significance.
For many datasets, no other normalization is required for evaluating differential expression.


\subsection{TMM Normalization}

Recently, \citet{Robinson:2010p145} described a method to account for a bias introduced by what they call RNA composition.  In brief, there are occasions when comparing different DGE libraries where a small number of genes are very highly expressed in one sample, but not in another.  Because these highly expressed genes consume a substantial proportion of the sequencing ``real estate'', the remaining genes in the library are undersampled.
This situation may occur for example when the two tissues being compared have transcriptomes of different sizes, i.e., when there are noticeably more transcripts expressed in one tissues than the other.
For example, \citet{Robinson:2010p145} show that in comparing kidney and liver RNA, there are a large number of genes expressed in kidney but not in liver, causing the remaining genes to be undersampled and skewing the differential expression calls.
To account for this, the authors developed an empirical approach to estimate the bias and proposed to build that into the library size (or, an offset in a generalized linear model), making it an {\em effective} library size.  We demonstrate this on the \citet{Marioni:2008p93} RNA-seq dataset.


Given a table counts, one can calculate normalization factors in a pairwise fashion, using one sample as the reference and build the normalization factors into the \texttt{DGEList} object.  For symmetry, you may want to divide the factors by the geometric mean so that if you use a different reference column, the factors are comparable.

\begin{Schunk}
\begin{Sinput}
> head(D)
\end{Sinput}
\begin{Soutput}
   R1L1Kidney R1L2Liver R1L3Kidney R1L4Liver
10          0         0          0         0
15          4        35          7        32
17          0         2          0         0
18        110       177        131       135
19      12685      9246      13204      9312
22          0         1          0         0
\end{Soutput}
\begin{Sinput}
> f <- calcNormFactors(D)
> f
\end{Sinput}
\begin{Soutput}
R1L1Kidney  R1L2Liver R1L3Kidney  R1L4Liver 
     1.000      0.678      1.003      0.685 
\end{Soutput}
\begin{Sinput}
> f <- f/exp(mean(log(f)))
> prod(f)
\end{Sinput}
\begin{Soutput}
[1] 1
\end{Soutput}
\begin{Sinput}
> g <- gsub("R[1-2]L[1-8]", "", colnames(D))
> d <- DGEList(counts = D, group = g, lib.size = colSums(D) * f)
> d
\end{Sinput}
\begin{Soutput}
An object of class "DGEList"
$samples
            group lib.size
R1L1Kidney Kidney  2184732
R1L2Liver   Liver  1389149
R1L3Kidney Kidney  2252906
R1L4Liver   Liver  1405422

$counts
   R1L1Kidney R1L2Liver R1L3Kidney R1L4Liver
15          4        35          7        32
17          0         2          0         0
18        110       177        131       135
19      12685      9246      13204      9312
22          0         1          0         0
20830 more rows ...
\end{Soutput}
\end{Schunk}

By default, \texttt{calcNormFactors} uses the first sample as the reference, but this can be specified by the \texttt{refColumn} argument.  Also, you can specify different levels of trimming on the log-ratios or log-concentrations, as well as a cutoff on the log-concentrations (See the help documentation for further details).   From the calculated factors, you'll notice that since the reference is a kidney sample, all of the other kidney (technical) replicates are very near to 1, whereas all of the liver factors are less than 1, reflective of the impact that composition bias introduces.

To see the bias and normalization visually, consider a smear plot between the first (kidney) and second (liver) sample.  In the left panel of Figure \ref{fig:Marioni_normalization}, we show a smear plot (X-axis: log-concentration, Y-axis: log fold-change of liver over kidney, those with 0 in either sample are shown in the smear on the left) of the raw data (Note: the argument \texttt{normalize=TRUE} {\em only} divides by the sum of counts in each sample and has nothing to do with the normalization factors mentioned in this Section).  One should notice a shift downward in the log-ratios, presumably caused by the genes highly expressed in liver that are taking away sequencing capacity from the remainder of the genes in the liver RNA sample.  The red line signifies the estimated TMM (trimmed mean of M values) normalization factor, which in this case represents the adjustment applied to the library size to account for the compositional bias.  The right panel of Figure \ref{fig:Marioni_normalization} simply shows the M and A values after correction.  Here, one should find that the bulk of the M-values are centred around 0.


\begin{Schunk}
\begin{Sinput}
> par(mfrow = c(1, 2))
> maPlot(D[, 1], D[, 2], normalize = TRUE, pch = 19, cex = 0.2, 
+     ylim = c(-8, 8))
> grid(col = "blue")
> abline(h = log2(f[2]), col = "red", lwd = 4)
> maPlot(d$counts[, 1]/d$samples$lib.size[1], d$counts[, 2]/d$samples$lib.size[2], 
+     normalize = FALSE, pch = 19, cex = 0.2, ylim = c(-8, 8))
> grid(col = "blue")
\end{Sinput}
\end{Schunk}


\begin{figure}[ht]
\begin{center}
\includegraphics[width=\textwidth]{edgeR_normalization_example_marioni}
\caption{Smear plots before (left) and after (right) composition normalization.}
\label{fig:Marioni_normalization}
\end{center}
\end{figure}


This normalization analysis was conducted on:

\begin{Schunk}
\begin{Sinput}
> sessionInfo()
\end{Sinput}
\begin{Soutput}
R version 2.11.0 (2010-04-22) 
x86_64-apple-darwin9.8.0 

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/C/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] edgeR_1.6.2

loaded via a namespace (and not attached):
[1] limma_3.3.4  tools_2.11.0
\end{Soutput}
\end{Schunk}


%%% CASE STUDIES

\section{Case study: SAGE data}
\subsection{Introduction}
This section provides a detailed analysis of data from a SAGE experiment to illustrate the data analysis pipeline for \edgeR. The data come from a very early study using SAGE technology to analyse gene expression profiles in human cancer cells  \citep{Zhang:1997p13}.

\subsection{Source of the data}
At the time that \citet{Zhang:1997p13} published their paper, no comprehensive study of gene expression in cancer cells had been reported. \citet{Zhang:1997p13} designed a study to address the following issues:
\begin{enumerate}
	\item How many genes are expressed differentially in tumour versus normal cells?
	\item Are the majority of those differences cell-autonomous rather than dependent on the tumour micro-environment?
	\item Are most differences cell type-specific or tumour-specific?
\end{enumerate}
They used normal and neoplastic gastro-intestinal tissue as a prototype and analysed global profiles of gene expression in human cancer cells. The researchers derived transcripts from human colorectal (CR) epithelium, CR cancers or pancreatic cancers. The data that we analyse in this case study are \citet{Zhang:1997p13}'s SAGE results for the comparison of expression patterns between normal colon epithelium and primary colon cancer. 

They report that the expression profiles revealed that most transcripts were expressed at similar levels, but that $289$ transcripts were expressed at significantly different levels [$P$-value < 0.01] and that $181$ of these $289$ were decreased in colon tumours as compared with normal colon tissue. \citet{Zhang:1997p13} used Monte Carlo simulation to determine statistical significance. In this case study we will use the \edgeR~package, based around the negative binomial model, to identify genes differentially expressed in the normal and cancer samples.


\subsection{Reading in the data and creating a \code{DGEList} object}
Our first task is to load the \edgeR~package, read the data into \R~and organise the data into a \code{DGEList} object that the functions in the package can recognise. The library size is usually the total sum of all of the counts for a library, and that is how library size is defined in this analysis. The easiest way to construct an appropriate \code{DGEList} object for these data is described below. 

In this case, the tag counts for the four individual libraries are stored in four separate plain text files, \code{GSM728.txt, GSM729.txt, GSM755.txt \textrm{and} GSM756.txt}. In each file, the tag IDs and counts for each tag are provided in a table. It is best to create a tab-delimited, plain-text `Targets' file, which, under the headings `files', `group' and `description', gives the filename, the group and a brief description for each sample.

The \code{targets} object is produced when the `Targets.txt' file is read into the \R~session. This object makes a convenient argument to the function \code{readDGE} which reads the tables of counts into our \code{R} session, calculates the sizes of the count libraries and produces a \code{DGEList} object for use by subsequent functions.

\begin{Schunk}
\begin{Sinput}
> library(edgeR)
> setwd("/Users/davismcc/Documents/Honours/Data/ZhangData")
> targets <- read.delim(file = "Targets.txt", stringsAsFactors = FALSE)
> targets
\end{Sinput}
\begin{Soutput}
       files group                description
1 GSM728.txt    NC               Normal colon
2 GSM729.txt    NC               Normal colon
3 GSM755.txt    Tu Primary colonrectal tumour
4 GSM756.txt    Tu Primary colonrectal tumour
\end{Soutput}
\begin{Sinput}
> d <- readDGE(targets, skip = 5, comment.char = "#")
> d
\end{Sinput}
\begin{Soutput}
An object of class "DGEList"
$samples
            files group                description lib.size
GSM728 GSM728.txt    NC               Normal colon    50179
GSM729 GSM729.txt    NC               Normal colon    49593
GSM755 GSM755.txt    Tu Primary colonrectal tumour    57686
GSM756 GSM756.txt    Tu Primary colonrectal tumour    49064

$counts
           GSM728 GSM729 GSM755 GSM756
CCCATCGTCC   1288   1380   1236      0
CCTCCAGCTA    719    458    148    142
CTAAGACTTC    559    558    248    199
GCCCAGGTCA    520    448     22     62
CACCTAATTG    469    472    763    421
57443 more rows ...
\end{Soutput}
\end{Schunk}

This \code{DGEList} is now ready to be passed to the functions that do the calculations to determine differential expression levels for the genes. Note that when we `see' the \code{DGEList} object \code{d}, the counts for just the first five genes in the table are shown, as well as the library sizes and groups for the samples.

\subsection{Analysis using common dispersion}
\subsubsection{Estimating the common dispersion}
The first major step in the analysis of DGE data using the NB model is to estimate the dispersion parameter for each tag. The most straight-forward analysis of DGE data uses the common dispersion estimate as the dispersion for all tags. For many applications this will be adequate and it may not be necessary to estimate tagwise dispersions, i.e.~estimate the dispersion parameter separately for each tag. Using the common dispersion allows the user to obtain DE results very quickly and in few steps, and so makes a good place to start with any analysis of DGE data.

Estimating the common dispersion is done using the function \code{estimateCommonDisp}. In order to do this, the function first needs to generate the `pseudocounts' under the alternative hypothesis (that there really is a difference in expression level between the groups). The conditional maximum likelihood method assumes that the library sizes are equal, which is certainly not true in general for DGE data. 

The pseudocounts are calculated using a quantile-to-quantile method for the negative binomial distribution so that the library sizes for the pseudocounts are equal to the geometric mean of the original library sizes. These pseudocounts are then used as the count data for the common conditional negative binomial likelihood function, which is maximised over the dispersion parameter to obtain our estimate of the common dispersion.

\begin{Schunk}
\begin{Sinput}
> d <- estimateCommonDisp(d)
> names(d)
\end{Sinput}
\begin{Soutput}
[1] "samples"           "common.dispersion" "counts"           
[4] "pseudo.alt"        "genes"             "conc"             
[7] "common.lib.size"  
\end{Soutput}
\end{Schunk}

The output of \code{estimateCommonDisp} is a DGEList object with several new elements. The element \code{common.dispersion}, as the name suggests, provides the estimate of the common dispersion, and \code{pseudo.alt} gives the pseudocounts calculated under the alternative hypothesis. The element \code{genes} contains the information about gene/tag identifiers. The element \code{conc} gives the estimates of the overall concentration of each tag across all of the original samples (\code{conc\$conc.common}) and the estimate of the concentration of each tag within each group (\code{conc\$conc.group}). The element \code{common.lib.size} gives the library size to which the original libraries have been adjusted in the pseudocounts. 

We see in the output below that the total number of counts in each library of the pseudocounts agrees well with the common library size, as desired. 

\begin{Schunk}
\begin{Sinput}
> d$samples$lib.size
\end{Sinput}
\begin{Soutput}
[1] 50179 49593 57686 49064
\end{Soutput}
\begin{Sinput}
> d$common.lib.size
\end{Sinput}
\begin{Soutput}
[1] 51516
\end{Soutput}
\begin{Sinput}
> colSums(d$pseudo.alt)
\end{Sinput}
\begin{Soutput}
GSM728 GSM729 GSM755 GSM756 
 51512  51513  51674  51483 
\end{Soutput}
\end{Schunk}

Under the negative binomial model, the square root of the common dispersion gives the coefficient of variation of biological variation. Here, as seen in the code below, the coefficient of variation of biological variation is found to be $0.45$. We also note that a common dispersion estimate of $0.2$ means that there is a lot more variability in the data that can be accounted for by the Poisson model---if a tag has just $200$ counts, then the estimate of the tag's variance under the NB model is over $40$ times greater than it would be under the Poisson model.

\begin{Schunk}
\begin{Sinput}
> d$common.dispersion
\end{Sinput}
\begin{Soutput}
[1] 0.1988
\end{Soutput}
\begin{Sinput}
> sqrt(d$common.dispersion)
\end{Sinput}
\begin{Soutput}
[1] 0.4458
\end{Soutput}
\end{Schunk}

\subsubsection{Testing}
Once we have an estimate of the common dispersion, we can proceed with testing procedures for determining differential expression. The \edgeR~package uses an exact test for the negative binomial distribution, which has strong parallels with Fisher's exact test, to compute exact $p$-values that can be used to assess differential expression. The function \code{exactTest} allows the user to conduct the NB exact test for pairwise comparisons of groups. Here there are only two groups, so the pair need not be specified---the function by default compares the two groups present.

\begin{Schunk}
\begin{Sinput}
> de.com <- exactTest(d)
\end{Sinput}
\begin{Soutput}
Comparison of groups:  Tu - NC 
\end{Soutput}
\begin{Sinput}
> names(de.com)
\end{Sinput}
\begin{Soutput}
[1] "table"      "comparison" "genes"   
\end{Soutput}
\begin{Sinput}
> names(de.com$table)
\end{Sinput}
\begin{Soutput}
[1] "logConc" "logFC"   "p.value"
\end{Soutput}
\end{Schunk}

The object produced by \code{exactTest} contains three elements:
\code{table}, \code{comparison} and \code{genes}. The element \code{de.com\$comparison} contains a vector giving the names of the two groups compared. The table\code{de.com\$table} contains the elements \code{logConc}, which gives the overall concentration for a tag across the two groups being compared, \code{logFC}, which gives the log-fold change difference for the counts between the groups and \code{p.value} gives the exact $p$-values computed.

The results of the NB exact test can be accessed conveniently using the \code{topTags} function applied to the object produced by \code{exactTest}. The user can specify the number, \code{n}, of tags for which they would like to see the differential expression information, ranked by $p$-value (default) or fold change. As the same test is conducted for many thousands of tags, adjusting the $p$-values for multiple testing is recommended. The desired adjustment method can be supplied by the user, with the default method being Benjamini and Hochberg's approach for controlling the false discovery rate (FDR)~\citep{Benjamini95}. The table below shows the top $10$ DE genes ranked by $p$-value.

The output below shows that the \edgeR~package identifies a good deal of differential expression between the normal colon cell group and the primary CR cancer cell group. The top DE genes are given very small $p$-values, even after adjusting for multiple testing. Furthermore, all of the top genes have a large fold change, indicating that these genes are more likely to be biologically meaningful. A Gene Ontology analysis could be carried out using the list of top genes and $p$-values provided by \code{topTags} in order to obtain more systematic and functional information about the differentially expressed genes.

\begin{Schunk}
\begin{Sinput}
> options(digits = 4)
> topTags(de.com)
\end{Sinput}
\begin{Soutput}
Comparison of groups:  Tu - NC 
           logConc   logFC    PValue       FDR
AGCTGTTCCC  -28.24  43.552 1.017e-19 5.841e-15
CTTGGGTTTT  -29.88  40.271 2.967e-10 8.522e-06
TACAAAATCG  -30.26  39.510 2.712e-08 4.926e-04
CCCAACGCGC  -12.80  -5.849 3.430e-08 4.926e-04
GCCACCCCCT  -30.33  39.380 6.341e-08 7.285e-04
CCAGTCCGCC  -30.43  39.177 1.844e-07 1.513e-03
GTCATCACCA  -30.42 -39.189 1.844e-07 1.513e-03
TCACCGGTCA  -11.13  -4.229 4.428e-07 3.180e-03
TAAATTGCAA  -11.39  -4.242 6.090e-07 3.887e-03
CGCGTCACTA  -12.30   4.611 7.300e-07 4.194e-03
\end{Soutput}
\end{Schunk}

The table below shows the raw counts for the genes that \edgeR~has identified as the most differentially expressed. For these genes there seems to be very large differences between the groups, suggesting that the DE genes identified are truly differentially expressed, and not false positives.

\begin{Schunk}
\begin{Sinput}
> detags.com <- rownames(topTags(de.com)$table)
> d$counts[detags.com, ]
\end{Sinput}
\begin{Soutput}
           GSM728 GSM729 GSM755 GSM756
AGCTGTTCCC      0      0    119   1011
CTTGGGTTTT      0      0     21     97
TACAAAATCG      0      0     14     56
CCCAACGCGC    106      1      2      0
GCCACCCCCT      0      0      5     58
CCAGTCCGCC      0      0      6     49
GTCATCACCA     35     20      0      0
TCACCGGTCA    118     75      6      5
TAAATTGCAA    103     59      3      6
CGCGTCACTA      1      3     88     21
\end{Soutput}
\end{Schunk}

If we order the genes by fold change instead of $p$-value, as in the table below, we see that the genes with the largest fold changes have very small concentrations. This ranking is dominated by genes that have zero total counts in one group and is less informative than ranking by $p$-value. 

\begin{Schunk}
\begin{Sinput}
> topTags(de.com, sort.by = "logFC")
\end{Sinput}
\begin{Soutput}
Comparison of groups:  Tu - NC 
           logConc  logFC    PValue       FDR
AGCTGTTCCC  -28.24  43.55 1.017e-19 5.841e-15
CTTGGGTTTT  -29.88  40.27 2.967e-10 8.522e-06
TACAAAATCG  -30.26  39.51 2.712e-08 4.926e-04
GCCACCCCCT  -30.33  39.38 6.341e-08 7.285e-04
GTCATCACCA  -30.42 -39.19 1.844e-07 1.513e-03
CCAGTCCGCC  -30.43  39.18 1.844e-07 1.513e-03
GTGCGCTGAG  -30.67  38.69 2.362e-06 9.048e-03
CTTGACATAC  -30.69 -38.66 2.837e-06 9.588e-03
GTGTGTTTGT  -30.72  38.59 4.135e-06 1.320e-02
GGGGGGGGGG  -30.74  38.56 5.019e-06 1.518e-02
\end{Soutput}
\end{Schunk}

 \citet{Zhang:1997p13} identified $289$ genes as significantly differentially expressed with $p$-values less than $0.01$. We can look at the genes that are given an exact $p$-value less than $0.01$ by \edgeR~before adjusting for multiple testing, and less than $0.05$ after adjustment.

We see in the output below that $264$ genes are significantly differentially expressed according to \edgeR~when using the common dispersion estimate. Of those genes, $100$ are up-regulated in the cancer cells compared with the normal cells and $164$ are down-regulated in the cancer cells compared with normal cells. These proportions of up- and down-regulated tags are very similar to those found by \citet{Zhang:1997p13}.

\begin{Schunk}
\begin{Sinput}
> sum(de.com$table$p.value < 0.01)
\end{Sinput}
\begin{Soutput}
[1] 264
\end{Soutput}
\begin{Sinput}
> top264 <- topTags(de.com, n = 264)
> sum(top264$table$logFC > 0)
\end{Sinput}
\begin{Soutput}
[1] 100
\end{Soutput}
\begin{Sinput}
> sum(top264$table$logFC < 0)
\end{Sinput}
\begin{Soutput}
[1] 164
\end{Soutput}
\end{Schunk}

Furthermore, we see below that $33$ tags ($0.06$\% of the total number) have a $p$-value of less than $0.05$ after adjusting for multiple testing using the~\citet{Benjamini95} method for controlling the FDR, which is strong evidence for differential expression.

\begin{Schunk}
\begin{Sinput}
> sum(p.adjust(de.com$table$p.value, method = "BH") < 0.05)
\end{Sinput}
\begin{Soutput}
[1] 33
\end{Soutput}
\begin{Sinput}
> mean(p.adjust(de.com$table$p.value, method = "BH") < 0.05) * 
+     100
\end{Sinput}
\begin{Soutput}
[1] 0.05744
\end{Soutput}
\end{Schunk}


\subsubsection{Visualising DGE results}
The function \code{plotSmear} can be used to generate a plot of the log-fold change against the log-concentration for each tag (analogous to an MA-plot in the microarray context). We can easily identify the top DE tags and highlight them on the plot. The code for producing the default fold-change plot is shown below, and the result of this code is shown in Figure~\ref{fig:Zhang_FC1}. 

\begin{Schunk}
\begin{Sinput}
> detags264 <- rownames(topTags(de.com, n = 264)$table)
\end{Sinput}
\end{Schunk}
\begin{Schunk}
\begin{Sinput}
> plotSmear(d, de.tags = detags264, main = "FC plot using common dispersion")
> abline(h = c(-2, 2), col = "dodgerblue")
\end{Sinput}
\end{Schunk}

Figure~\ref{fig:Zhang_FC1} shows the default fold change-plot for these data---the `smear plot'. Plotting DGE data poses some challenges, as when the total counts in one group are zero, the log-fold change is technically infinite, and the log-concentration is negative infinity. With the algorithm used by topTags, we see very high log-fold changes and very small values for log-concentration for such tags, but plotting these values directly causes problems with the scale of the graph. To get around this problem, \edgeR~produces a `smear' of points at the left-most edge of the plot for tags which have zero counts in one of the groups. Although this is still slightly artificial, it has the advantage that the expression level of all tags can be visualised and interpreted simultaneously.

The `lines' of points we see at smaller log-concentration values arise from the discrete nature of the count data. When the sum all of the counts in one of the groups is one, we see the lines of points furthest away from the main body of points, and other lines of points correspond to when the total sum of counts in one of the groups is $2$, $3$, $4$ and so on.  

In Figure~\ref{fig:Zhang_FC1}, the $264$ tags identified as differentially expressed (i.e.~those identified as significant ($p$-value less than $0.01$) by \edgeR~using the common dispersion) are outlined in red.

\begin{figure}[ht]
\begin{center}
\includegraphics[width=\textwidth]{edgeR_case_study_Zhang-015.png}
\caption{Plot of the log-fold change against the log-concentration for each tag. The 264 most differentially expressed tags as identified by \edgeR~using the common dispersion are outlined in red.}
\label{fig:Zhang_FC1}
\end{center}
\end{figure}


\subsection{Analysing the data using moderated tagwise dispersions}
\subsubsection{Moderating the tagwise dispersion}
An extension to simply using the common dispersion for each tag is to estimate the dispersion separately for each tag, while `squeezing' these estimates towards the common dispersion estimate in order to improve inference by sharing information between tags. This type of analysis can also be carried out in few steps using the \edgeR~package.

As noted earlier, the dispersion parameter is the overdispersion relative to the Poisson, and represents the biological, or sample-to-sample variability. The methods we developed moderate the dispersion estimates towards a common dispersion, much like how the \limma~package moderates the variances in the analysis of microarray data.

The amount of moderation done is determined by the value of a weight
parameter \code{prior.n}. The value for \code{prior.n} corresponds
to the number of individual tags equivalent to the weight given to the
common likelihood. Thus, the higher \code{prior.n}, the more
strongly the individual dispersion estimates are moderated, or
`squeezed', towards the common value. To run the moderated analysis, we need to determine how much
moderation is necessary. How best to do this is still an open research
question, but we currently recommend selecting a value for the weight parameter \code{prior.n} \emph{a
  priori} and have found that very good results can be obtained this way.

In an experiment such as that we consider here, in which we
have just four samples, two in each group, and thus two degrees of
freedom for estimating the dispersion parameter. Standard tagwise
dispersion estimates are likely to be unreliable, so we want to give a
reasonable weight to the common likelihood. We need to choose a value
for \code{prior.n} such that individual tagwise dispersion estimates are `squeezed' quite strongly
towards the common dispersion. Here, we choose a moderate amount of
smoothing---we let \code{prior.n} be $10$. This means that the
common likelihood receives the weight of $10$ individual tags, so
there will be a reasonable degree of `squeezing', but there is still
ample scope to estimate an individual dispersion for each gene.

The function \code{estimateTagwiseDisp} produces a DGEList object that contains all of the elements present in the object produced by \code{estimateCommonDisp}, as well as the value for \code{prior.n} used (\code{d\$prior.n}) and the tagwise dispersion estimates (\code{d\$tagwise.dispersion}), as we see below.

\begin{Schunk}
\begin{Sinput}
> d <- estimateTagwiseDisp(d, prior.n = 10)
\end{Sinput}
\begin{Soutput}
Using grid search to estimate tagwise dispersion. 
\end{Soutput}
\begin{Sinput}
> names(d)
\end{Sinput}
\begin{Soutput}
[1] "samples"            "common.dispersion"  "prior.n"           
[4] "tagwise.dispersion" "counts"             "pseudo.alt"        
[7] "conc"               "common.lib.size"   
\end{Soutput}
\begin{Sinput}
> d$prior.n
\end{Sinput}
\begin{Soutput}
[1] 100
\end{Soutput}
\begin{Sinput}
> head(d$tagwise.dispersion)
\end{Sinput}
\begin{Soutput}
[1] 2.612336 0.065168 0.001001 0.268445 0.053908 0.112720
\end{Soutput}
\end{Schunk}

It is interesting to consider the distribution of the tagwise
dispersion estimates. As we can see from the output below, the tagwise
dispersion estimates range from a minimum of $0.001$ to a maximum of
$3.41$. The range of dispersions is therefore large, but the tags in the middle two quartiles of the tagwise dispersion estimates have dispersion estimates just slightly smaller than the common dispersion estimate.

\begin{Schunk}
\begin{Sinput}
> summary(d$tagwise.dispersion)
\end{Sinput}
\begin{Soutput}
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.0010 0.1926 0.1998 0.2132 0.1998 3.4120 
\end{Soutput}
\begin{Sinput}
> d$common.dispersion
\end{Sinput}
\begin{Soutput}
[1] 0.1988
\end{Soutput}
\end{Schunk}

\subsubsection{Estimating smoothing using an approximate eBayes rule}
For this, we can use an empirical Bayes rule that involves calculating a weight parameter $\textrm{prior.n}$. However, for many applications (especially if the estimated smoothing parameter is large), using the common dispersion for all tags will give excellent results. 

In order to determine how much moderation is necessary, we require an
estimate of the dispersion parameter, so we use the common dispersion
estimate from \code{estimateCommonDisp}. The smoothing parameter,
\code{prior.n} can then be calculated. As we see below, for this dataset the estimate of the smoothing parameter is very large. This estimate arises from the fact that the estimate of the overdispersion
of the likelihood scores (a critical aspect of the algorithm for estimating the amount of smoothing required) is strictly zero. In this
circumstance the algorithm is indicating that we should be using a common dispersion, and so returns a very large suggested value for
\code{prior.n}. If we were to use this large value for the weight parameter we would moderate the tagwise dispersion estimates so strongly that it would be equivalent to using the common dispersion estimate.

\begin{Schunk}
\begin{Sinput}
> prior.n <- estimateSmoothing(d)
\end{Sinput}
\begin{Soutput}
  Warning message:
In .odls(scores.g = scores, info.g = exp.inf) :
  Estimate of overdispersion of likelihood scores is strictly zero.
 Returning min val close to zero (1e-08).
\end{Soutput}
\begin{Sinput}
> prior.n
\end{Sinput}
\begin{Soutput}
[1] 182665666
\end{Soutput}
\end{Schunk}


\subsubsection{Testing}
The testing procedures when using tagwise dispersion estimates are
carried out exactly as for the common dispersion, as described above,
but we add the argument \code{common.disp=FALSE} to the call to
\code{exactTest}. Here we
carry out the testing using the tagwise dispersion estimates
calculated using a \code{prior.n} value of ten.

\begin{Schunk}
\begin{Sinput}
> de.tgw <- exactTest(d, common.disp = FALSE)
\end{Sinput}
\begin{Soutput}
Comparison of groups:  Tu - NC 
\end{Soutput}
\end{Schunk}

The output below shows that when using tagwise dispersions, the
\edgeR~package still identifies a lot of differential
expression between the normal colon cell group and the primary CR
cancer cell group---indeed the $p$-values of the top tags are even
smaller than the top tags based on the common dispersion. This arises
because the moderated tagwise dispersions can be much smaller than the
common dispersion, and tags with smaller dispersions will have smaller
p-values than the same tags with $p$-values computed using a common
dispersion. As with the analysis using the common dispersion, all of the top tags have a large fold change, indicating that these changes in expression are likely to be biologically meaningful. We note that the ranking is different, however, and not all of the top ten tags according to using the common dispersion are found to be among the top ten tags using tagwise dispersions.

\begin{Schunk}
\begin{Sinput}
> topTags(de.tgw)
\end{Sinput}
\begin{Soutput}
Comparison of groups:  Tu - NC 
           logConc  logFC    PValue       FDR
CTAAGACTTC  -7.191 -1.419 3.610e-56 2.074e-51
TGCTCCTACC  -9.985 -2.722 2.690e-21 7.726e-17
GTGCTGAATG -10.385 -2.459 6.358e-18 1.218e-13
CGCTGTGGGG -11.303 -2.982 2.311e-16 3.319e-12
GGGGTCAGGG  -9.639  1.934 1.928e-15 2.215e-11
GATGACCCCC -11.900 -3.603 4.097e-15 3.923e-11
TTATGGGATC -11.724  3.122 8.147e-14 6.065e-10
TCACCGGTCA -11.130 -4.229 8.446e-14 6.065e-10
ATGCGGGAGA -11.720 -2.797 5.618e-12 3.586e-08
GGCTGGGGGC -10.209 -1.701 1.012e-11 5.816e-08
\end{Soutput}
\end{Schunk}

The table below shows the raw counts for the tags that \edgeR~has identified as the most differentially expressed using tagwise dispersions. For these genes there seems to be very large differences between the groups, suggesting that the DE genes identified are truly differentially expressed, and not false positives. 

We note that in general, when using tagwise dispersions, the counts
are more consistent within groups, as using tagwise dispersions
instead of the common dispersion penalises tags which are highly
variable within groups. The smaller the value selected for
\code{prior.n}, the more highly variable tags will be penalised, as
there is less `squeezing' of the tagwise dispersions towards the
common value. This effect is seen clearly in the table below (compare
this with the corresponding table for the analysis using the common dispersion).

\begin{Schunk}
\begin{Sinput}
> detags.tgw <- rownames(topTags(de.tgw)$table)
> d$counts[detags.tgw, ]
\end{Sinput}
\begin{Soutput}
  GSM728 GSM729 GSM755 GSM756
CTAAGACTTC    559    558    248    199
TGCTCCTACC    140    113     22     19
GTGCTGAATG     96     79     17     17
CGCTGTGGGG     58     53      6      9
GGGGTCAGGG     27     37    134    127
GATGACCCCC     42     49      3      5
TTATGGGATC      6      4     48     45
TCACCGGTCA    118     75      6      5
ATGCGGGAGA     39     39      6      6
GGCTGGGGGC     75     77     26     24
\end{Soutput}
\end{Schunk}

Of course, we can sort the top table differently, as we did earlier.

We see in the output below that $309$ genes are significantly differentially expressed according to \edgeR~when using the tagwise dispersion estimates (ten fewer than when using the common dispersion). Of those tags, $100$ are up-regulated in the cancer cells compared with the normal cells and $209$ are down-regulated in the cancer cells compared with normal cells. These proportions of up- and down-regulated tags are similar to those found using the common dispersion, but there is a slightly higher proportion of down-regulated tags in those identified as DE using tagwise dispersions.

\begin{Schunk}
\begin{Sinput}
> sum(de.tgw$table$p.value < 0.01)
\end{Sinput}
\begin{Soutput}
[1] 309
\end{Soutput}
\begin{Sinput}
> toptgw <- topTags(de.tgw, n = sum(de.tgw$table$p.value < 0.01))
> sum(toptgw$table$logFC > 0)
\end{Sinput}
\begin{Soutput}
[1] 100
\end{Soutput}
\begin{Sinput}
> sum(toptgw$table$logFC < 0)
\end{Sinput}
\begin{Soutput}
[1] 209
\end{Soutput}
\end{Schunk}

Furthermore, we see below that $72$ tags ($0.13$\% of the total number) have a $p$-value of less than $0.05$ after adjusting for multiple testing using the~\citet{Benjamini95} method for controlling the FDR, which is strong evidence for differential expression.

\begin{Schunk}
\begin{Sinput}
> sum(p.adjust(de.tgw$table$p.value, method = "BH") < 0.05)
\end{Sinput}
\begin{Soutput}
[1] 72
\end{Soutput}
\begin{Sinput}
> mean(p.adjust(de.tgw$table$p.value, method = "BH") < 0.05) * 
+     100
\end{Sinput}
\begin{Soutput}
[1] 0.1253307
\end{Soutput}
\end{Schunk}


\subsubsection{Visualising DGE results}
Shown below is the code for producing the default fold-change plot using \code{plotSmear} with the DE tags as determined using tagwise dispersions highlighted, and the result of this code is shown in Figure~\ref{fig:Zhang_FC2}. 

\begin{Schunk}
\begin{Sinput}
> detags.tgw <- rownames(topTags(de.tgw, n = sum(de.tgw$table$p.value
< 0.01) )$table)
\end{Sinput}
s\end{Schunk}
\begin{Schunk}
\begin{Sinput}
> plotSmear(d, de.tags = detags.tgw, main = "FC plot using tagwise dispersions")
> abline(h = c(-2, 2), col = "dodgerblue")
\end{Sinput}
\end{Schunk}

In Figure~\ref{fig:Zhang_FC2}, the $309$ tags identified as differentially expressed (i.e.~those identified as significant ($p$-value less than $0.01$) by \edgeR~using the tagwise dispersions) are highlighted in red. We see that the pattern of differential expression using tagwise dispersions that we see in Figure~\ref{fig:Zhang_FC2} is very similar to that obtained using the common dispersion that we saw in Figure~\ref{fig:Zhang_FC1}.

\begin{figure}[ht]
\begin{center}
\includegraphics[width=\textwidth]{edgeR_case_study_Zhang-028.png}
\caption{Plot of the log-fold change against the log-concentration for each tag. The 264 most differentially expressed tags as identified by \edgeR~are outlined in red.}
\label{fig:Zhang_FC2}
\end{center}
\end{figure}

\subsection{Setup}
This analysis of \citet{Zhang:1997p13}'s SAGE data was conducted on:
\begin{Schunk}
\begin{Sinput}
> sessionInfo()
\end{Sinput}
\begin{Soutput}
R version 2.9.1 (2009-06-26) 
i386-apple-darwin8.11.1 

locale:
en_AU.UTF-8/en_AU.UTF-8/C/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] edgeR_1.3.6

loaded via a namespace (and not attached):
[1] limma_2.18.2
\end{Soutput}
\end{Schunk}
and took 2--3 minutes to carry out on an Apple MacBook with a 2.4 Ghz Intel Core 2 Duo processor and 4 Gb of 1067 MHz DDR3 memory.


\section{Case Study: deep-sequenced short tags}
\subsection{Introduction}
This section provides a detailed analysis of data from an experiment seeking to compare deep-sequenced tag-based expression profiling to the microarray platforms that had been previously used to conduct such studies~\citep{THoen:2008p9}. 

\subsection{Source of the data}
\citet{THoen:2008p9} address both biological and technical questions in their study. The biological question addressed was the identification of transcripts differentially expressed in the hippocampus between wild-type mice and transgenic mice overexpressing a splice variant of the $\delta$C-doublecortin-like kinase-1 (\emph{Dclk1}) gene. The splice variant, DCLK-short, makes the kinase constitutively active and causes subtle behavioural phenotypes.

On the technical side, the researchers compare the robustness, resolution and inter-lab portability of Solexa/Illumina's DGE tag profiling approach and five microarray platforms~\citep{THoen:2008p9}. The tag-based gene expression technology in this experiment could be thought of as a hybrid between SAGE and RNA-seq---like SAGE it uses short sequence tags ($\sim17$bp) to identify transcripts, but it uses the deep sequencing capabilities of Solexa/Illumina 1G Genome Analyzer to greatly increase the number of tags that can be sequenced. For our purposes we will concentrate solely on the DGE data generated in the experiment.

The RNA samples came from wild-type male C57/BL6j mice and transgenic mice overexpressing DCLK-short with a C57/BL6j background. Tissue samples were collected from four individuals in each of the two groups by dissecting out both hippocampi from each mouse. Total RNA was isolated and extracted from the hippocampus cells and sequence tags were prepared using Illumina's Digital Gene Expression Tag Profiling Kit according to the manufacturer's protocol. 

Sequencing was done using Solexa/Illumina's Whole Genome Sequencer. RNA from each biological sample was supplied to an individual lane in one Illumina 1G flowcell. The instrument conducted $18$ cycles of base incorporation, then image analysis and basecalling were performed using the Illumina Pipeline. Sorting and counting the unique tags followed, and the raw data (tag sequences and counts) are what we will analyze here. \citet{THoen:2008p9} went on to annotate the tags by mapping them back to the genome. In general, the mapping of tags is an important and highly non-trivial part of a DGE experiment, but we shall not deal with this task in this case study. 

The researchers obtained $\sim2.4$ million sequence tags per sample, with tag abundance spanning four orders of magnitude. They found the results to be highly reproducible, even across laboratories. Using a dedicated Bayesian model, they found $3179$ transcripts to be differentially expressed with a FDR of $8.5$\%. This is a much higher figure than was found for the microarrays. \citet{THoen:2008p9} conclude that deep-sequencing offers a major advance in robustness, comparability and richness of expression profiling data.


\subsection{Reading in the data and creating a \code{DGEList} object}
Our first task is to load the \edgeR~package, read the data into \R~and organise the data into an object that the functions in the package can recognise. In this case, the tag counts for the eight individual libraries are stored in eight separate plain text files, \code{GSM272105.txt}, \code{GSM272106.txt}, \code{GSM272318.txt}, \code{GSM272319.txt}, \code{GSM272320.txt}, \code{GSM272321.txt}, \code{GSM272322.txt} and \code{GSM272323.txt}.

In each file, the tag IDs and counts for each tag are provided in a table. It is best to create a tab-delimited, plain-text `Targets' file, which, under the headings `files', `group' and `description', gives the filename, the group and a brief description for each sample.

The \code{targets} object is produced when the `Targets.txt' file is read into the \code{R} session. This object makes a convenient argument to the function \code{readDGE} which reads the tables of counts into our \code{R} session, calculates the sizes of the count libraries and produces a \code{DGEList} object for use by subsequent functions.

\begin{Schunk}
\begin{Sinput}
> setwd("/Users/davismcc/Documents/Honours/Data/Long_SAGE_Data")
> library(edgeR)
> targets <- read.delim(file = "targets.txt", stringsAsFactors = FALSE)
> targets
\end{Sinput}
\begin{Soutput}
          files group                          description
1 GSM272105.txt  DCLK transgenic (Dclk1) mouse hippocampus
2 GSM272106.txt    WT          wild-type mouse hippocampus
3 GSM272318.txt  DCLK transgenic (Dclk1) mouse hippocampus
4 GSM272319.txt    WT          wild-type mouse hippocampus
5 GSM272320.txt  DCLK transgenic (Dclk1) mouse hippocampus
6 GSM272321.txt    WT          wild-type mouse hippocampus
7 GSM272322.txt  DCLK transgenic (Dclk1) mouse hippocampus
8 GSM272323.txt    WT          wild-type mouse hippocampus
\end{Soutput}
\begin{Sinput}
> d <- readDGE(targets, skip = 5, comment.char = "!")
> d
\end{Sinput}
\begin{Soutput}
An object of class "DGEList"
$samples
                  files group                          description
GSM272105 GSM272105.txt  DCLK transgenic (Dclk1) mouse hippocampus
GSM272106 GSM272106.txt    WT          wild-type mouse hippocampus
GSM272318 GSM272318.txt  DCLK transgenic (Dclk1) mouse hippocampus
GSM272319 GSM272319.txt    WT          wild-type mouse hippocampus
GSM272320 GSM272320.txt  DCLK transgenic (Dclk1) mouse hippocampus
GSM272321 GSM272321.txt    WT          wild-type mouse hippocampus
GSM272322 GSM272322.txt  DCLK transgenic (Dclk1) mouse hippocampus
GSM272323 GSM272323.txt    WT          wild-type mouse hippocampus
          lib.size
GSM272105  2685418
GSM272106  3517977
GSM272318  3202246
GSM272319  3558260
GSM272320  2460753
GSM272321   294909
GSM272322   651172
GSM272323  3142280

$counts
                  GSM272105 GSM272106 GSM272318 GSM272319 GSM272320
CATCGCCAGCGGGCACC         1         0         0         0         0
AAGGTCGACTCTGAAGT         1         1         0         0         0
CCTTCCTGGCTCTATGG         1         0         0         0         0
TCTGCTGAGCGTCTGTT         1         0         0         0         0
CCCCAGAGCGAATCAGG         1         1         2         1         1
                  GSM272321 GSM272322 GSM272323
CATCGCCAGCGGGCACC         0         0         0
AAGGTCGACTCTGAAGT         0         0         0
CCTTCCTGGCTCTATGG         0         0         0
TCTGCTGAGCGTCTGTT         0         0         0
CCCCAGAGCGAATCAGG         0         2         1
844311 more rows ...
\end{Soutput}
\end{Schunk}

This \code{DGEList} is now ready to be passed to the functions that do the calculations to determine differential expression levels for the genes. Note that when we `see' the \code{DGEList} object \code{d}, the counts for just the first five genes in the table are shown, as well as the \code{samples} element, which is a data frame constructed from the `Targets.txt' file and provides the filenames, groups, descriptions and library sizes for the samples.

However, for this dataset, there were over 800 000 unique tags sequenced, most of which have a very small number of counts in total across all libraries. Since it is not possible to achieve statistical significance with fewer than six counts in total for a tag, we filter out tags with five or fewer counts in total---this also helps to speed up the calculations we need to perform. The subsetting capability of \code{DGEList} objects makes such filtering very easy to carry out.

\begin{Schunk}
\begin{Sinput}
> d <- d[rowSums(d$counts) > 5, ]
\end{Sinput}
\end{Schunk}

Now the dataset is ready to be analysed for differential expression.


\subsection{Producing an MDS plot}
Before proceeding with the computations for differential expression, it is possible to produce a plot showing the sample relations based on multidimensional scaling. The function \code{plotMDS.dge} produces an MDS plot for the samples when provided with the \code{DGEList} object and other usual graphical parameters as arguments, as shown below.

\begin{Schunk}
\begin{Sinput}
> plotMDS.dge(d, main="MDS Plot for 't Hoen Data", xlim=c(-2,1))
\end{Sinput}
\end{Schunk}


This function is a variation on the usual multdimensional scaling (or
principle coordinate) plot, in that a distance measure particularly
appropriate for the digital gene expression (DGE) context is used. The
distance between each pair of samples (columns) is the square root of
the common dispersion for the top $n$ (default is $n = 500$) genes
which best distinguish that pair of samples. These top $n$ genes are
selected according to the tagwise dispersion of all the samples. The
resulting plot for the `t Hoen data is shown
in~\ref{fig:tHoen_MDS}. From this MDS plot we can identify sample
GSM272322 as an outlier---there is a large distance between this
sample and the others in Dimension 1. It would be a sensible approach
to remove this sample from the dataset for analysis. In the
analysis that follows, however, we do not remove the sample as we will
just be demonstrating the use of \edgeR. 

\begin{figure}[ht]
\begin{center}
\includegraphics[width=0.9\textwidth]{edgeR_case_study_longSAGE_MDSplot.pdf}
\caption{Multidimensional scaling (MDS) plot for the `t Hoen data,
  showing the relations between the samples in two dimensions. From
  this plot, the sample GSM272322 can be identified easily as an
  outlier---there is a large distance between this sample and the others
  in Dimension 1.}
\label{fig:tHoen_MDS}
\end{center}
\end{figure}


\subsection{Analysis using common dispersion}
\subsubsection{Estimating the common dispersion}
As discussed for the SAGE data, the first major step in the analysis of DGE data using the NB model is to estimate the dispersion parameter for each tag. Like in the earlier case study, we begin by estimating the common dispersion using the function \code{estimateCommonDisp}. 

\begin{Schunk}
\begin{Sinput}
> d <- estimateCommonDisp(d)
> names(d)
\end{Sinput}
\begin{Soutput}
[1] "samples"           "common.dispersion" "counts"           
[4] "pseudo.alt"        "conc"              "common.lib.size"  
\end{Soutput}
\end{Schunk}

We see in the output below that the total counts in each library of the pseudocounts agrees well with the common library size, as desired. 

\begin{Schunk}
\begin{Sinput}
> d$samples$lib.size
\end{Sinput}
\begin{Soutput}
[1] 2685418 3517977 3202246 3558260 2460753  294909  651172 3142280
\end{Soutput}
\begin{Sinput}
> d$common.lib.size
\end{Sinput}
\begin{Soutput}
[1] 1885653
\end{Soutput}
\begin{Sinput}
> colSums(d$pseudo.alt)
\end{Sinput}
\begin{Soutput}
GSM272105 GSM272106 GSM272318 GSM272319 GSM272320 GSM272321 GSM272322 
  1781297   1783960   1778180   1774107   1789066   1777588   1795899 
GSM272323 
  1783746 
\end{Soutput}
\end{Schunk}

Here the coefficient of variation of biological variation (square root of the common dispersion) is found to be $0.44$. We also note that a common dispersion estimate of $0.2$ means that there is a lot more variability in the data that can be accounted for by the Poisson model---if a tag has just $200$ counts, then the estimate of the tag's variance under the NB model is over $40$ times greater than it would be under the Poisson model.

\begin{Schunk}
\begin{Sinput}
> d$common.dispersion
\end{Sinput}
\begin{Soutput}
[1] 0.1977094
\end{Soutput}
\begin{Sinput}
> sqrt(d$common.dispersion)
\end{Sinput}
\begin{Soutput}
[1] 0.4446452
\end{Soutput}
\end{Schunk}

\subsubsection{Testing}
Once we have an estimate of the common dispersion, we can proceed with testing procedures for determining differential expression. As for the SAGE data, there are only two groups here, so the pair need not be specified in the call to \code{exactTest}.

\begin{Schunk}
\begin{Sinput}
> de.common <- exactTest(d)
\end{Sinput}
\begin{Soutput}
Comparison of groups:  WT - DCLK 
\end{Soutput}
\end{Schunk}

The results of the NB exact test can be accessed conveniently using the \code{topTags} function applied to the object produced by \code{exactTest}. The table below shows the top $10$ DE genes ranked by $p$-value.

The table in the output from \code{topTags} shows that the \edgeR~package identifies a good deal of differential expression between the wild-type and the DCLK-transgenic groups. The top DE tags are given very small $p$-values, even after adjusting for multiple testing. Furthermore, all of the top tags have a large fold change, indicating that these tags are likely to be biologically meaningful. As suggested in the SAGE case study, a Gene Ontology analysis could be carried out using the list of top tags and $p$-values provided by \code{topTags} in order to obtain more systematic and functional information about the differentially expressed genes.

\begin{Schunk}
\begin{Sinput}
> topTags(de.common)
\end{Sinput}
\begin{Soutput}
Comparison of groups:  WT - DCLK 
                    logConc     logFC       PValue          FDR
AATTTCTTCCTCTTCCT -17.29988 11.605093 1.028275e-36 1.149704e-31
CCGTCTTCTGCTTGTCG -10.57698  5.571165 1.076926e-23 6.020501e-19
TCTGTACGCAGTCAGGC -18.46570 -9.731983 7.100854e-23 2.646464e-18
CCGTCTTCTGCTTGTAA -14.44395  5.448158 1.083062e-21 3.027402e-17
CCGTCTTCTGCTTGTCA -15.45499  5.496920 2.759755e-20 6.171309e-16
AAGACTCAGGACTCATC -32.27026 35.491588 7.516087e-20 1.400610e-15
CCGTCTTCTGCTTGTAG -15.57138  4.803709 3.572708e-17 5.453064e-13
AGTGTGACGTGACCGGG -19.06213  8.067070 3.901700e-17 5.453064e-13
AAATTCTTCCTCTTCCT -19.14713  7.910596 2.838381e-16 3.526184e-12
TGTGTATCCCACAAGGG -18.68579  6.864501 5.262632e-16 5.884096e-12
\end{Soutput}
\end{Schunk}

The table below shows the raw counts for the tags that \edgeR~has identified as the most differentially expressed. For these tags there seem to be very large differences between the groups, suggesting that the DE genes identified are truly differentially expressed, and not false positives.

\begin{Schunk}
\begin{Sinput}
> detags.com <- rownames(topTags(de.common)$table)
> d$counts[detags.com, ]
\end{Sinput}
\begin{Soutput}
                  GSM272105 GSM272106 GSM272318 GSM272319 GSM272320
AATTTCTTCCTCTTCCT         1        44         0         1         0
CCGTCTTCTGCTTGTCG       106      1485       268       420       601
TCTGTACGCAGTCAGGC       160         0       101         1       440
CCGTCTTCTGCTTGTAA        12        87        21        28        31
CCGTCTTCTGCTTGTCA         2        42         8        17        19
AAGACTCAGGACTCATC         0         6         0         2         0
CCGTCTTCTGCTTGTAG         9        61        11        20        17
AGTGTGACGTGACCGGG         0       249         0         2         1
AAATTCTTCCTCTTCCT         1         6         0         0         0
TGTGTATCCCACAAGGG         1         1         1         0         0
                  GSM272321 GSM272322 GSM272323
AATTTCTTCCTCTTCCT        76         0      3487
CCGTCTTCTGCTTGTCG      5156         5       242
TCTGTACGCAGTCAGGC         0        33         0
CCGTCTTCTGCTTGTAA       352         1        14
CCGTCTTCTGCTTGTCA       183         1        17
AAGACTCAGGACTCATC         4         0       461
CCGTCTTCTGCTTGTAG       133         0         9
AGTGTGACGTGACCGGG         5         0        85
AAATTCTTCCTCTTCCT         2         0       288
TGTGTATCCCACAAGGG         6         0       252
\end{Soutput}
\end{Schunk}

If we order the tags by fold change instead of $p$-value, as in the table below, we see that the genes with the largest fold changes have very small concentrations, and in general the $p$-values are not as small as when ranked by $p$-value (not surprisingly). This ranking is dominated by genes that have zero total counts in one group and is less informative than ranking by $p$-value. 

\begin{Schunk}
\begin{Sinput}
> topTags(de.common, sort.by = "logFC")
\end{Sinput}
\begin{Soutput}
Comparison of groups:  WT - DCLK 
                    logConc     logFC       PValue          FDR
AAGACTCAGGACTCATC -32.27026  35.49159 7.516087e-20 1.400610e-15
CCTGATGCTACAGAAAA -32.72599  34.58013 5.104779e-15 5.188729e-11
CATAAGTCACAGAGTCG -32.76506 -34.50198 6.541738e-14 5.626348e-10
ACTCTGTGTATTACTCC -32.89435  34.24342 3.132417e-14 2.918603e-10
GATTTTTGTCGTGTTGG -32.95947  34.11317 2.185773e-13 1.745636e-09
AAAAGAAATCACAGTTG -32.96984 -34.09243 7.296827e-12 3.399379e-08
CACATAAGACTTTGGAC -33.09656  33.83899 2.181792e-12 1.434965e-08
AAAATGTTGTTTATGGA -33.10525  33.82162 4.049564e-12 2.156084e-08
GAAATTCTCCATTGATT -33.13607  33.75996 4.049564e-12 2.156084e-08
AAATTATTCCTCTTCCT -33.17108  33.68995 9.017868e-12 4.033115e-08
\end{Soutput}
\end{Schunk}

Using their dedicated Bayesian model, \citet{THoen:2008p9} found $3179$ transcripts to be differentially expressed with a FDR of $8.5$\%. We can compare \citet{THoen:2008p9}'s results with the results from \edgeR~by applying the \code{topTags} function to help look at the tags that have a FDR of less than $0.085$ after adjusting for multiple testing using \citet{Benjamini95}'s method for controlling the FDR. 

We see in the output below that $1710$ tags ($1.5$\% of the total number analysed) are significantly differentially expressed according to \edgeR~using the common dispersion estimate. Of those tags, $943$ ($55$\% of the DE tags) are up-regulated in the wild-type compared with the transgenic samples and $767$ ($45$\%) are down-regulated in the wild-type compared with transgenic mice.

\begin{Schunk}
\begin{Sinput}
> sum(p.adjust(de.common$table$p.value, method = "BH") < 0.085)
\end{Sinput}
\begin{Soutput}
[1] 1710
\end{Soutput}
\begin{Sinput}
> mean(p.adjust(de.common$table$p.value, method = "BH") < 
+     0.085) * 100
\end{Sinput}
\begin{Soutput}
[1] 1.529394
\end{Soutput}
\begin{Sinput}
> top.com <- topTags(de.common, n = 1710)
> sum(top.com$table$logFC > 0)
\end{Sinput}
\begin{Soutput}
[1] 943
\end{Soutput}
\begin{Sinput}
> sum(top.com$table$logFC < 0)
\end{Sinput}
\begin{Soutput}
[1] 767
\end{Soutput}
\end{Schunk}

\subsubsection{Visualising DGE results}
The code for producing the default fold-change plot, with the top $500$ most DE tags highlighted in red, is shown below, and the result of this code is shown in Figure~\ref{fig:longSAGE_FC1}. In Figure~\ref{fig:longSAGE_FC1}, we see that the $500$ tags identified as most differentially expressed have large fold changes---almost all of the $500$ tags in red fall outside the blue lines at $\log \textrm{FC} = -2$ and $\log \textrm{FC} = 2$. This means that most of these tags show at least a 4-fold change in expression level between the samples. This plot suggests strongly that the tags identified by \edgeR~as differentially expressed are truly differentially expressed, and, given the large changes in expression level, are likely to be biologically meaningful.

\begin{Schunk}
\begin{Sinput}
> detags500.com <- rownames(topTags(de.common, n = 500)$table)
\end{Sinput}
\end{Schunk}
\begin{Schunk}
\begin{Sinput}
> plotSmear(d, de.tags = detags500.com, main = "FC plot using common dispersion")
> abline(h = c(-2, 2), col = "dodgerblue", lwd = 2)
\end{Sinput}
\end{Schunk}

\begin{figure}[ht]
\begin{center}
\includegraphics[width=\textwidth]{edgeR_case_study_longSAGE-015.png}
\caption{Plot of the log-fold change against the log-concentration for each tag. The $500$ most differentially expressed tags as identified by \edgeR~using the common dispersion are outlined in red.}
\label{fig:longSAGE_FC1}
\end{center}
\end{figure}


\subsection{Analysis using moderated tagwise dispersions}
\subsubsection{Moderating the tagwise dispersion}
An extension to simply using the common dispersion for each tag is to estimate the dispersion separately for each tag, while `squeezing' these estimates towards the common dispersion estimate in order to improve inference by sharing information between tags. This type of analysis can also be carried out in few steps using the \edgeR~package.

To run the moderated analysis, we need to determine how much
moderation is necessary. As discussed in the SAGE case study, above,
we currently recommend choosing a
value for \code{prior.n} \emph{a priori} that will provide an
appropriate balance between the common and tagwise dispersion
values. This moderation can improve the analysis by giving higher
levels of significance to tags which have more consistent counts
within groups (and therefore lower within-group variance) and reducing
the significance of tags which have one extremely large count in one
library, which can otherwise dominate the statistical assessment of
differential expression. 


In an experiment such as that we consider here, in which we have eight samples and thus six degrees of freedom for estimating the dispersion parameter, setting the \code{prior.n} to be ten should be appropriate. This means that the common likelihood receives the weight of ten individual tags, so there will be a reasonable degree of `squeezing' towards the common dispersion estimate, but there is still enough scope to estimate an individual dispersion for each tag.

The function \code{estimateTagwiseDisp} produces a DGEList object that contains all of the elements present in the object produced by \code{estimateCommonDisp}, as well as the value for \code{prior.n} used (\code{d\$prior.n}) and the tagwise dispersion estimates (\code{d\$tagwise.dispersion}), as we see below.

\begin{Schunk}
\begin{Sinput}
> d <- estimateTagwiseDisp(d, prior.n = 10)
\end{Sinput}
\begin{Soutput}
Using grid search to estimate tagwise dispersion. 
\end{Soutput}
\begin{Sinput}
> names(d)
\end{Sinput}
\begin{Soutput}
[1] "samples"            "common.dispersion"  "prior.n"           
[4] "tagwise.dispersion" "counts"             "pseudo.alt"        
[7] "conc"               "common.lib.size"   
\end{Soutput}
\begin{Sinput}
> d$prior.n
\end{Sinput}
\begin{Soutput}
[1] 10
\end{Soutput}
\begin{Sinput}
> head(d$tagwise.dispersion)
\end{Sinput}
\begin{Soutput}
[1] 0.1997564 0.1854905 0.1925808 0.1579529 0.2070189 0.1854905
\end{Soutput}
\end{Schunk}

It is interesting to consider the distribution of the tagwise dispersion estimates. As we can see from the output below, the tagwise dispersion estimates range from a minimum of $0.09$ to a maximum of $0.97$, and the common dispersion estimate lies in between the median and mean values for the tagwise dispersion estimates.

\begin{Schunk}
\begin{Sinput}
> summary(d$tagwise.dispersion)
\end{Sinput}
\begin{Soutput}
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.08842 0.18550 0.19260 0.19930 0.20700 0.99000 
\end{Soutput}
\begin{Sinput}
> d$common.dispersion
\end{Sinput}
\begin{Soutput}
[1] 0.1977094
\end{Soutput}
\end{Schunk}


\subsubsection{Estimating the smoothing parameter using an approximate
  eBayes rule}
Although we recommend using a fixed value for the \code{prior.n}
parameter, chosen \emph{a priori}, there is an alternative
approach. \citet{Robinson07} have developed an approximate empirical
Bayes algorithm for estimating the smoothing parameter, and this
method is implemented in \edgeR. The smoothing parameter, \code{prior.n} can be calculated using
the function \code{estimateSmoothing}, which takes the
\code{DGEList} object produced by \code{estimateCommonDisp} as its
argument. As we see below, for this dataset the estimate of the
smoothing parameter is very small, so if we were to use this small
value for the weight parameter we would moderate the tagwise
dispersion estimates hardly at all. It would be equivalent to using
the unmoderated tagwise dispersion estimates, which we do not trust as
there are not enough samples to estimate the tagwise dispersions
reliably. We will be much better off with greater moderation of the
tagwise dispersions, as outlined above. At this time, it is
recommended that the approximate
empirical Bayes method for estimating the smoothing parameter be used
as a guide, but not as a routine part of DGE analysis.

\begin{Schunk}
\begin{Sinput}
> prior.n <- estimateSmoothing(d)
> prior.n
\end{Sinput}
\begin{Soutput}
[1] 0.0008078916
\end{Soutput}
\end{Schunk}


\subsubsection{Testing}
Once we have an estimate of the common dispersion and/or estimates of
the tagwise dispersions, we can proceed with testing procedures for
determining differential expression using \code{exactTest}. Here we
carry out the testing using the tagwise dispersion estimates
calculated using a \code{prior.n} value of ten.

By default, \code{exactTest} uses the common dispersion, but by adding the argument \code{common.disp=FALSE}, tagwise dispersion estimates will be used instead.

\begin{Schunk}
\begin{Sinput}
> de.tagwise <- exactTest(d, common.disp = FALSE)
\end{Sinput}
\begin{Soutput}
Comparison of groups:  WT - DCLK 
\end{Soutput}
\end{Schunk}

Just as we saw earlier, the object produced by \code{exactTest} contains two elements. The first is a data frame (\code{table}) that contains the elements \code{logConc}, \code{logFC} and \code{p.value} and the second is a vector (\code{comparison}) that lists the names of the groups being compared.

The output below shows that when using tagwise dispersions, the \edgeR~package still identifies a lot of differential expression between the wild-type group and the DCLK-transgenic group. The top DE tags are given very small $p$-values, even after adjusting for multiple testing. However, We see immediately that the $p$-values for the top tags are many orders of magnitude greater than those for the top tags identified using the common dispersion. 

As with the analysis using the common dispersion, all of the top tags have a large fold change, indicating that these changes in expression are likely to be biologically meaningful, although interestingly we see more tags ($7$ out of $10$) that are down-regulated in the wild-type group compared with the DCLK group, which contrasts with using the common dispersion. We note that the ranking of the tags is different, too, and only three of the top ten tags according to using the common dispersion are found to be among the top ten tags using tagwise dispersions.

\begin{Schunk}
\begin{Sinput}
> topTags(de.tagwise)
\end{Sinput}
\begin{Soutput}
Comparison of groups:  WT - DCLK 
                    logConc      logFC       PValue          FDR
TCTGTACGCAGTCAGGC -18.46778  -9.732317 2.271105e-18 2.539300e-13
CATAAGTCACAGAGTCG -32.76303 -34.506051 4.507171e-15 2.519711e-10
CCAAGAATCTGGTCGTA -17.46762  -3.929366 1.966857e-12 7.330412e-08
AATTTCTTCCTCTTCCT -17.30380  11.607564 3.139447e-12 8.775460e-08
ATACTGACATTTCGTAT -16.77800   4.144907 6.097301e-12 1.363466e-07
GCTAATAAATGGCAGAT -14.91149  -3.295240 9.212532e-12 1.716740e-07
CTGCTAAGCAGAAGCAA -16.99434  -3.427292 1.803488e-11 2.880660e-07
TTCCTGAAAATGTGAAG -17.08403  -3.639794 2.653336e-11 3.708335e-07
AAAAGAAATCACAGTTG -32.97330 -34.085504 6.032072e-11 7.493777e-07
AGTGTGACGTGACCGGG -19.07157   8.035868 1.581782e-10 1.768575e-06
\end{Soutput}
\end{Schunk}

Of course, we can also rank the top tags using the fold change instead of the $p$-value. The results of doing this are shown in the table below, but this ranking is dominated by genes that have zero total counts in one group and is less informative than ranking by $p$-value.

\begin{Schunk}
\begin{Sinput}
> topTags(de.tagwise, n = 10, sort.by = "logFC")
\end{Sinput}
\begin{Soutput}
Comparison of groups:  WT - DCLK 
                    logConc     logFC       PValue          FDR
AAGACTCAGGACTCATC -32.29053  35.45105 1.216554e-08 5.667572e-05
CCTGATGCTACAGAAAA -32.75498  34.52215 4.898918e-07 1.165411e-03
CATAAGTCACAGAGTCG -32.76303 -34.50605 4.507171e-15 2.519711e-10
ACTCTGTGTATTACTCC -32.92486  34.18238 2.898915e-08 1.200463e-04
AAAAGAAATCACAGTTG -32.97330 -34.08550 6.032072e-11 7.493777e-07
GATTTTTGTCGTGTTGG -33.00084  34.03043 1.747407e-07 4.765265e-04
CACATAAGACTTTGGAC -33.13007  33.77197 2.927280e-06 3.596662e-03
GAAATTCTCCATTGATT -33.15363  33.72485 7.690677e-07 1.500202e-03
AAAATGTTGTTTATGGA -33.15760  33.71690 5.539764e-06 5.567035e-03
AAATTATTCCTCTTCCT -33.18808  33.65595 1.494026e-06 2.456552e-03
\end{Soutput}
\end{Schunk}

The tables below shows the raw counts for the genes that \edgeR~has identified as the most differentially expressed, using the common dispersion and tagwise dispersions. For these tags, using both methods, there seem to be very large differences between the groups, suggesting that the DE genes identified are truly differentially expressed, and not false positives. 

Particularly noteworthy, however, is how much more consistent the counts \emph{within} groups are for the top tags identified using tagwise dispersions compared with those identified using the common dispersion. This is to be expected, as allowing tagwise dispersions penalises highly variable tags, so those that have greater variability within groups (especially one or two libraries with extremely high counts) will appear far lower in the ranking using tagwise dispersions than they would using the common dispersion. This difference in the rankings provided by the two approaches to the dispersion parameter could yield valuable information.

\begin{Schunk}
\begin{Sinput}
> detags.tgw <- rownames(topTags(de.tagwise)$table)
> detags.com <- rownames(topTags(de.common)$table)
> d$counts[detags.tgw, ]
\end{Sinput}
\begin{Soutput}
                  GSM272105 GSM272106 GSM272318 GSM272319 GSM272320
TCTGTACGCAGTCAGGC       160         0       101         1       440
CATAAGTCACAGAGTCG        67         0        77         0        58
CCAAGAATCTGGTCGTA        70         3        66         5        47
AATTTCTTCCTCTTCCT         1        44         0         1         0
ATACTGACATTTCGTAT         5       113         5       228         8
GCTAATAAATGGCAGAT       387        45       321        32       132
CTGCTAAGCAGAAGCAA        76         7        88         7        52
TTCCTGAAAATGTGAAG        74         6        70         9        86
AAAAGAAATCACAGTTG        31         0        90         0        42
AGTGTGACGTGACCGGG         0       249         0         2         1
                  GSM272321 GSM272322 GSM272323
TCTGTACGCAGTCAGGC         0        33         0
CATAAGTCACAGAGTCG         0         7         0
CCAAGAATCTGGTCGTA         0        13         7
AATTTCTTCCTCTTCCT        76         0      3487
ATACTGACATTTCGTAT         4         1       104
GCTAATAAATGGCAGAT         1        71        38
CTGCTAAGCAGAAGCAA         0        15        11
TTCCTGAAAATGTGAAG         0        10         7
AAAAGAAATCACAGTTG         0         3         0
AGTGTGACGTGACCGGG         5         0        85
\end{Soutput}
\begin{Sinput}
> d$counts[detags.com, ]
\end{Sinput}
\begin{Soutput}
                  GSM272105 GSM272106 GSM272318 GSM272319 GSM272320
AATTTCTTCCTCTTCCT         1        44         0         1         0
CCGTCTTCTGCTTGTCG       106      1485       268       420       601
TCTGTACGCAGTCAGGC       160         0       101         1       440
CCGTCTTCTGCTTGTAA        12        87        21        28        31
CCGTCTTCTGCTTGTCA         2        42         8        17        19
AAGACTCAGGACTCATC         0         6         0         2         0
CCGTCTTCTGCTTGTAG         9        61        11        20        17
AGTGTGACGTGACCGGG         0       249         0         2         1
AAATTCTTCCTCTTCCT         1         6         0         0         0
TGTGTATCCCACAAGGG         1         1         1         0         0
                  GSM272321 GSM272322 GSM272323
AATTTCTTCCTCTTCCT        76         0      3487
CCGTCTTCTGCTTGTCG      5156         5       242
TCTGTACGCAGTCAGGC         0        33         0
CCGTCTTCTGCTTGTAA       352         1        14
CCGTCTTCTGCTTGTCA       183         1        17
AAGACTCAGGACTCATC         4         0       461
CCGTCTTCTGCTTGTAG       133         0         9
AGTGTGACGTGACCGGG         5         0        85
AAATTCTTCCTCTTCCT         2         0       288
TGTGTATCCCACAAGGG         6         0       252
\end{Soutput}
\end{Schunk}

We might also be interested in comparing the top-ranking genes as identified by \edgeR~using the common dispersion and tagwise dispersions. The output below shows, firstly, that there are three tags that appear in the top ten most DE tags using both common and tagwise dispersions. Secondly, we see that of the top $1000$ most DE tags as identified using tagwise dispersions, $77$\% of these tags are also in the list of the $1000$ most DE tags as identified using the common dispersion. This shows that although we do get quite different results depending on which method we use, there is still a great deal of agreement as to which tags are DE.

\begin{Schunk}
\begin{Sinput}
> sum(rownames(topTags(de.tagwise)$table) %in% rownames(topTags(de.common)$table))
\end{Sinput}
\begin{Soutput}
[1] 3
\end{Soutput}
\begin{Sinput}
> sum(rownames(topTags(de.tagwise, n = 1000)$table) %in% rownames(topTags(de.common, 
+     n = 1000)$table))/1000 * 100
\end{Sinput}
\begin{Soutput}
[1] 76.7
\end{Soutput}
\end{Schunk}

Using their dedicated Bayesian model, \citet{THoen:2008p9} found $3179$ transcripts to be differentially expressed with a FDR of $8.5$\%. The output below shows that using \citet{Benjamini95}'s approach for controlling the FDR at $8.5$\%, \edgeR~identifies $1717$ tags as DE using common dispersion and $1441$ tags as DE using tagwise dispersions. This means that we determine $1.54$\% and $1.29$\% of tags to be DE using common and tagwise dispersions, respectively.

\begin{Schunk}
\begin{Sinput}
> sum(p.adjust(de.common$table$p.value, method = "BH") < 0.085)
\end{Sinput}
\begin{Soutput}
[1] 1710
\end{Soutput}
\begin{Sinput}
> mean(p.adjust(de.common$table$p.value, method = "BH") < 
+     0.085) * 100
\end{Sinput}
\begin{Soutput}
[1] 1.529394
\end{Soutput}
\begin{Sinput}
> sum(p.adjust(de.tagwise$table$p.value, method = "BH") < 
+     0.085)
\end{Sinput}
\begin{Soutput}
[1] 1441
\end{Soutput}
\begin{Sinput}
> mean(p.adjust(de.tagwise$table$p.value, method = "BH") < 
+     0.085) * 100
\end{Sinput}
\begin{Soutput}
[1] 1.288805
\end{Soutput}
\end{Schunk}

Of the $1441$ tags identified as DE using tagwise dispersions, $729$ ($51$\%) are up-regulated in wild-type and $712$ ($49$\%) are up-regulated in the transgenic mice. The proportions of up- and down-regulated genes identified using the two approaches to modeling the dispersion are similar, but using the common dispersion identifies slightly more tags up-regulated in wild-type mice as DE.

\begin{Schunk}
\begin{Sinput}
> top.tgw <- topTags(de.tagwise, n = 1441)
> sum(top.tgw$table$logFC > 0)
\end{Sinput}
\begin{Soutput}
[1] 729
\end{Soutput}
\begin{Sinput}
> sum(top.tgw$table$logFC < 0)
\end{Sinput}
\begin{Soutput}
[1] 712
\end{Soutput}
\end{Schunk}

\subsubsection{Visualising DGE results}
As discussed earlier, the function \code{plotSmear} can be used to generate a plot of the log-fold change against the log-concentration for each tag (analogous to an MA-plot in the microarray context). We identify the top $500$ most DE tags using both common dispersion and tagwise dispersions so we can highlight them on the plots and compare what we see. The code for producing the fold-change plots is shown below, and the result of this code is shown in Figure~\ref{fig:longSAGE_FC2}. 

\begin{Schunk}
\begin{Sinput}
> detags500.com <- rownames(topTags(de.common, n = 500)$table)
> detags500.tgw <- rownames(topTags(de.tagwise, n = 500)$table)
\end{Sinput}
\end{Schunk}
\begin{Schunk}
\begin{Sinput}
> par(mfcol = c(2, 1))
> plotSmear(d, de.tags = detags500.com, main = "Using common dispersion")
> abline(h = c(-2, 2), col = "dodgerblue", lwd = 2)
> plotSmear(d, de.tags = detags500.tgw, main = "Using tagwise dispersions")
> abline(h = c(-2, 2), col = "dodgerblue", lwd = 2)
\end{Sinput}
\end{Schunk}

In Figure~\ref{fig:longSAGE_FC2}, the top $500$ most differentially expressed tags (those identified as significant by \edgeR~using the common dispersion (top) and tagwise dispersions (bottom)) are highlighted in red. Looking at Figure~\ref{fig:longSAGE_FC2}, we see that, generally speaking, the pattern of differential expression looks similar using the two different methods, and the tags identified as DE have convincingly large fold changes.

\begin{figure}[ht]
\begin{center}
\includegraphics[height=0.9\textheight]{edgeR_case_study_longSAGE-030.png}
\caption{Plots of the log-fold change against the log-concentration for each tag, using the common dispersion (upper), and tagwise dispersions (lower). Tags with positive fold-change here are up-regulated in wild-type compared with transgenic mice. The $500$ most differentially expressed tags according to each method are highlighted in red on both plots.}
\label{fig:longSAGE_FC2}
\end{center}
\end{figure}


\subsection{Setup}
This analysis of \citet{THoen:2008p9}'s tag-based DGE data was conducted on:
\begin{Schunk}
\begin{Sinput}
> sessionInfo()
\end{Sinput}
\begin{Soutput}
R version 2.9.1 (2009-06-26) 
i386-apple-darwin8.11.1 

locale:
en_AU.UTF-8/en_AU.UTF-8/C/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] edgeR_1.3.6

loaded via a namespace (and not attached):
[1] limma_2.18.2
\end{Soutput}
\end{Schunk}
and took 5--10 minutes to carry out on an Apple MacBook with a 2.4 Ghz Intel Core 2 Duo processor and 4 Gb of 1067 MHz DDR3 memory.


\section{Case Study: RNA-seq data}
\subsection{Introduction}
This section provides a detailed analysis of data from a study by \citet{Li08} designed to address a range of practical issues in RNA-seq experiments:
\begin{enumerate}
\item How many annotated genes are detected in a single cell type?

\item What is the number of tags that is necessary for the analysis of differentially regulated genes under different experimental conditions?

\item To what extent can different mRNA isoforms be detected?

\item How can one quantify alternative splicing by using a single or combination of existing technologies?
\end{enumerate}
\citet{Li08} attempt to address all of these issues on an androgen-sensitive prostate cancer cell model. We are interested primarily in the second question, and the challenge of identifying differentially regulated genes under different experimental conditions. We will demonstrate the use of the \edgeR~package for analyzing RNA-seq data for differential gene expression.

\subsection{Source of the data}
\citet{Li08} sequenced $\textrm{poly(A)}^+$ RNA from mock-treated or androgen sensitive LNCaP cells (a cell line of human cells commonly used in the field of oncology) on the Illumina 1G Genome Analyzer. The researchers used a double-random priming approach that was capable of generating strand-specific information, although this is not of relevance to our analysis here. The raw RNA-seq data provided by Li et al. consists of 7 `lanes' of 35bp reads.~\footnote{The Illumina instrument requires samples to be placed in a `flow cell' which contains eight `lanes'---each lane has a sample of cDNA and generates a library of sequence counts for that sample.} Approximately 10 million sequence tags were generated from both control and hormone-treated cells (treated with DHT), and \citet{Li08}'s analysis suggests that this tag density is sufficient for quantitative analysis of gene expression. 

The 10 million sequenced tags arise from four libraries from control cells and three libraries for hormone-treated cells, giving a total of seven libraries to analyse. From \citet{Li08} and its companion paper \citep{Li:2006p382} it is unclear as to whether the treatments are independent or not. The following analysis shows how a quantitative analysis of gene expression, focusing on identifying differentially expressed genes, can be conducted for these seven libraries using \edgeR.

\subsection{Reading in the data and creating a \code{DGEList} object}
Our first task is to load the \edgeR~package and read the data into \R. In this case, the tag counts for the libraries are stored in a single table in a plain text file \code{pnas\_expression.txt}, in which the rows of the table represent tags and the columns represent the different libraries. 

To turn the raw RNA-seq data into a table of counts, reads were mapped to the NCBI36 build of the human genome using \code{bowtie}, allowing up to two mismatches. Reads which did not map uniquely were discarded. The number of mapped reads that overlapped ENSEMBL gene annotations (version 53) was then counted. In counting reads associated with genes, reads which mapped to non-coding gene regions, such as introns, were included in the count. 

Unlike in the other datasets we have look at, counts here are aggregated at the gene, not at the tag, level.

The \code{files} object provides the name of the data file, and makes a convenient argument to the function \code{read.delim} which reads the table of counts into our \R~session. 

\begin{Schunk}
\begin{Sinput}
> setwd("/Users/davismcc/Documents/Honours/Data/LiData")
> library(edgeR)
> raw.data <- read.delim("pnas_expression.txt")
> names(raw.data)
\end{Sinput}
\begin{Soutput}
[1] "ensembl_ID" "lane1"      "lane2"      "lane3"      "lane4"     
[6] "lane5"      "lane6"      "lane8"      "len"       
\end{Soutput}
\end{Schunk}

The raw data is stored in a table with columns representing the gene names, the counts for the seven libraries and a column giving the length of each gene. The gene length is not currently used by \edgeR, but this information could be used in future versions of the package. In the code below, we assign the counts matrix to an object \code{d} with the appropriate rownames, define the groups to which the samples belong, and then pass these arguments to \code{DGEList}, which calculates the library sizes and constructs a \code{DGEList} containing all of the data we require for the analysis.

\begin{Schunk}
\begin{Sinput}
> d <- raw.data[, 2:8]
> rownames(d) <- raw.data[, 1]
> group <- c(rep("Control", 4), rep("DHT", 3))
> d <- DGEList(counts = d, group = group)
> d
\end{Sinput}
\begin{Soutput}
An object of class "DGEList"
$samples
        group lib.size
lane1 Control   978576
lane2 Control  1156844
lane3 Control  1442169
lane4 Control  1485604
lane5     DHT  1823460
lane6     DHT  1834335
lane8     DHT   681743

$counts
                lane1 lane2 lane3 lane4 lane5 lane6 lane8
ENSG00000124208   478   619   628   744   483   716   240
ENSG00000182463    27    20    27    26    48    55    24
ENSG00000124201   180   218   293   275   373   301    88
ENSG00000124205     0     0     5     5     0     0     0
ENSG00000124207    76    80    85    97    80    81    37
21872 more rows ...
\end{Soutput}
\end{Schunk}

This \code{DGEList} is now ready to be passed to the functions that do the calculations to determine differential expression levels for the genes. 

\subsection{Producing an MDS plot}
Before proceeding with the computations for differential expression,
it is possible to produce a plot showing the sample relations based on
multidimensional scaling, as demonstrated for the Tag-seq data
above. We can produce an MDS plot for the Li Data using the command below.

\begin{Schunk}
\begin{Sinput}
> plotMDS.dge(d.Li, main="MDS Plot for Li Data",xlim=c(-0.4,0.8))
\end{Sinput}
\end{Schunk}


The resulting plot for the Li data is shown in~\ref{fig:Li_MDS}. In
this plot, Lane 8 appears to be separated from the other samples in
Dimension 1, but the distances between the samples are relatively
small, so we need not treat this sample as an outlier and remove it
from the analysis. Having now investigated some of the relationships
between the samples we can proceed to the DE analysis of the data.

\begin{figure}[ht]
\begin{center}
\includegraphics[width=0.9\textwidth]{edgeR_case_study_Li_MDSplot.pdf}
\caption{Multidimensional scaling (MDS) plot for the Li data, showin
  the relations between the samples in two dimensions. From this plot,
  Lane 8 may be identified easily as an outlier, but we note that the
  distance between this samples and the others in Dimension 1 is
  relatively small, so we do not remove this Lane 8 from the analysis.}
\label{fig:Li_MDS}
\end{center}
\end{figure}


\subsection{Analysis using common dispersion}
\subsubsection{Estimating the common dispersion}
As discussed for the SAGE data, the first major step in the analysis of DGE data using the NB model is to estimate the dispersion parameter for each tag. Like in the earlier case study, we begin by estimating the common dispersion using the function \code{estimateCommonDisp}, and analysing the data using the common dispersion.

\begin{Schunk}
\begin{Sinput}
> d <- estimateCommonDisp(d)
> names(d)
\end{Sinput}
\begin{Soutput}
[1] "samples"           "common.dispersion" "counts"           
[4] "pseudo.alt"        "conc"              "common.lib.size"  
\end{Soutput}
\begin{Sinput}
> d
\end{Sinput}
\begin{Soutput}
An object of class "DGEList"
$samples
        group lib.size
lane1 Control   978576
lane2 Control  1156844
lane3 Control  1442169
lane4 Control  1485604
lane5     DHT  1823460
lane6     DHT  1834335
lane8     DHT   681743

$common.dispersion
[1] 0.02021596

$counts
                lane1 lane2 lane3 lane4 lane5 lane6 lane8
ENSG00000124208   478   619   628   744   483   716   240
ENSG00000182463    27    20    27    26    48    55    24
ENSG00000124201   180   218   293   275   373   301    88
ENSG00000124205     0     0     5     5     0     0     0
ENSG00000124207    76    80    85    97    80    81    37
21872 more rows ...

$pseudo.alt
                      lane1        lane2      lane3      lane4     lane5
ENSG00000124208 623.7011345 682.94207324 555.660542 639.479771 336.71526
ENSG00000182463  34.3356362  22.18976707  23.832567  22.180621  33.16394
ENSG00000124201 235.0851553 240.62900197 259.559061 236.246548 262.61397
ENSG00000124205   0.1502441   0.05777303   4.539068   4.427231   0.00000
ENSG00000124207  98.4667235  88.25763482  75.005547  83.279817  55.73657
                    lane6        lane8
ENSG00000124208 499.57554 4.485689e+02
ENSG00000182463  38.29654 4.367427e+01
ENSG00000124201 209.30644 1.685512e+02
ENSG00000124205   0.00000 1.509887e-10
ENSG00000124207  56.12481 6.774033e+01
21872 more rows ...

$conc
$conc.common
ENSG00000124208 ENSG00000182463 ENSG00000124201 ENSG00000124205 
   4.236152e-04    2.419602e-05    1.809021e-04    1.063407e-06 
ENSG00000124207 
   5.831078e-05 
21872 more elements ...

$conc.group
                     Control          DHT
ENSG00000124208 4.897757e-04 3.350953e-04
ENSG00000182463 1.990269e-05 2.966863e-05
ENSG00000124201 1.902894e-04 1.684999e-04
ENSG00000124205 1.963536e-06 8.783496e-16
ENSG00000124207 6.735245e-05 4.640575e-05
21872 more rows ...


$common.lib.size
[1] 1276768
\end{Soutput}
\end{Schunk}

The output of \code{estimateCommonDisp} is a DGEList object with several new elements. The element \code{common.dispersion}, as the name suggests, provides the estimate of the common dispersion. The pseudocounts calculated under the alternative hypothesis are given by  \code{pseudo.alt}. The element \code{conc} gives the estimates of the overall concentration of each tag across all of the original samples (\code{conc\$conc.common}) and the estimate of the concentration of each tag within each group (\code{conc\$conc.group}). The element \code{common.lib.size} gives the library size to which the original libraries have been adjusted in the pseudocounts. 

We see in the output below that the total counts in each library of the pseudocounts agrees well with the common library size, as desired. 

\begin{Schunk}
\begin{Sinput}
> d$samples$lib.size
\end{Sinput}
\begin{Soutput}
[1]  978576 1156844 1442169 1485604 1823460 1834335  681743
\end{Soutput}
\begin{Sinput}
> d$common.lib.size
\end{Sinput}
\begin{Soutput}
[1] 1276768
\end{Soutput}
\begin{Sinput}
> colSums(d$pseudo.alt)
\end{Sinput}
\begin{Soutput}
  lane1   lane2   lane3   lane4   lane5   lane6   lane8 
1277021 1276791 1276935 1277035 1277209 1277069 1277919 
\end{Soutput}
\end{Schunk}

Here the coefficient of variation of biological variation (square root of the common dispersion) is found to be $0.142$. We also note that although a common dispersion estimate of $0.02$ may seem `small', if a tag has just $200$ counts, then the estimate of the tag's variance is 5 times greater than it would be under the Poisson model.

\begin{Schunk}
\begin{Sinput}
> d$common.dispersion
\end{Sinput}
\begin{Soutput}
[1] 0.02021596
\end{Soutput}
\begin{Sinput}
> sqrt(d$common.dispersion)
\end{Sinput}
\begin{Soutput}
[1] 0.1421828
\end{Soutput}
\end{Schunk}

\subsubsection{Testing}
Once we have an estimate of the common dispersion, we can proceed with testing procedures for determining differential expression. As for the SAGE data, there are only two groups here, so the pair need not be specified in the call to \code{exactTest}.

\begin{Schunk}
\begin{Sinput}
> de.com <- exactTest(d)
\end{Sinput}
\begin{Soutput}
Comparison of groups:  DHT - Control 
\end{Soutput}
\begin{Sinput}
> names(de.com)
\end{Sinput}
\begin{Soutput}
[1] "table"      "comparison"
\end{Soutput}
\end{Schunk}

The results of the NB exact test can be accessed conveniently using the \code{topTags} function applied to the object produced by \code{exactTest}. The table below shows the top $10$ DE genes ranked by $p$-value.

The table in the output from \code{topTags} shows that the \edgeR~package identifies a great deal of differential expression, and gives the top genes extremely small $p$-values, even after adjusting for multiple testing. Furthermore, all of the top genes have a very large fold change (indicating that these tags are likely to be biologically meaningful), and all are up-regulated in the DHT-treatment group compared to the control group. 

Of course, for many applications the ranking for differential expression is more important than the $p$-value, and \code{topTags} provides such a ranking. As suggested in the SAGE case study, a Gene Ontology analysis could be carried out using the list of top gene and $p$-values provided by \code{topTags} in order to obtain more systematic and functional information about the differentially expressed genes. 

\begin{Schunk}
\begin{Sinput}
> topTags(de.com)
\end{Sinput}
\begin{Soutput}
Comparison of groups:  DHT - Control 
                  logConc    logFC        PValue           FDR
ENSG00000151503 -11.94799 5.705233 7.744047e-185 1.694165e-180
ENSG00000096060 -11.33288 4.893134 5.349073e-155 5.851083e-151
ENSG00000127954 -15.63280 8.118692 7.331162e-148 5.346128e-144
ENSG00000166451 -12.28742 4.570439 7.122773e-128 3.895623e-124
ENSG00000131016 -14.42856 5.190737 1.701099e-104 7.442990e-101
ENSG00000113594 -12.83343 4.000650  2.579172e-96  9.404091e-93
ENSG00000116285 -13.56732 4.087861  8.108902e-88  2.534264e-84
ENSG00000123983 -12.09539 3.544802  1.444749e-86  3.950846e-83
ENSG00000166086 -15.24551 5.390848  6.219612e-86  1.511849e-82
ENSG00000162772 -10.81704 3.201950  7.560702e-80  1.654055e-76
\end{Soutput}
\end{Schunk}

The table below shows the raw counts for the genes that \edgeR~has identified as the most differentially expressed. For these genes there seems to be very large differences between the groups, suggesting that the DE genes identified are truly differentially expressed.

\begin{Schunk}
\begin{Sinput}
> detags.com <- rownames(topTags(de.com)$table)
> d$counts[detags.com, ]
\end{Sinput}
\begin{Soutput}
                lane1 lane2 lane3 lane4 lane5 lane6 lane8
ENSG00000151503    35    35    49    59  3307  3439  1224
ENSG00000096060    65    79   105   113  3975  3727  1451
ENSG00000127954     0     0     3     3   607   602   220
ENSG00000166451    41    52    57    57  1750  1654   728
ENSG00000131016     9     5    18     6   564   377   213
ENSG00000113594    37    36    57    43   936   959   418
ENSG00000116285    18    28    23    32   645   630   218
ENSG00000123983    62    76    94   108  1354  1258   628
ENSG00000166086     9     2     3     6   296   298   121
ENSG00000162772   172   204   250   304  2972  3269  1112
\end{Soutput}
\end{Schunk}

If we order the genes by fold change instead of $p$-value, we see that the genes with the largest fold changes have very small concentrations. This ranking is dominated by genes that have zero total counts in one group and is less informative than ranking by $p$-value. 

\begin{Schunk}
\begin{Sinput}
> topTags(de.com, n = 10, sort.by = "logFC")
\end{Sinput}
\begin{Soutput}
Comparison of groups:  DHT - Control 
                  logConc     logFC       PValue          FDR
ENSG00000091972 -31.75811 -36.51589 8.689939e-56 6.789635e-53
ENSG00000164120 -32.27738  35.47735 4.636882e-44 2.205241e-41
ENSG00000100373 -32.93100 -34.17010 2.739292e-17 2.147939e-15
ENSG00000118513 -33.00607 -34.01998 8.893707e-16 5.895989e-14
ENSG00000081237 -33.15786 -33.71640 5.634630e-13 2.691459e-11
ENSG00000196660 -33.22302 -33.58606 3.870150e-12 1.647223e-10
ENSG00000117245 -33.23863 -33.55484 1.051358e-11 4.151725e-10
ENSG00000019549 -33.39510  33.24191 2.420329e-13 1.225684e-11
ENSG00000137404 -33.41447 -33.20316 1.041329e-08 2.556807e-07
ENSG00000059804 -33.43964  33.15284 2.174556e-12 9.630115e-11
\end{Soutput}
\end{Schunk}

We can see how many genes are identified as differentially expressed between the control group (untreated LNCaP cells) and the DHT-treated LNCaP cells, for a given threshold for the exact $p$-value or for the adjusted $p$-value. 

As the output below shows, \edgeR~detects a huge number of differentially expressed genes in this dataset. Almost $5000$ genes are given a $p$-value less than $0.01$.

\begin{Schunk}
\begin{Sinput}
> sum(de.com$table$p.value < 0.01)
\end{Sinput}
\begin{Soutput}
[1] 4760
\end{Soutput}
\end{Schunk}

The output below shows that over $4835$ genes are given an adjusted $p$-value of less than $0.05$. This means that if we set our control the FDR for differential expression at 5\%, then \edgeR~identifies $22$\% of all the genes in the dataset as differentially expressed. 

\begin{Schunk}
\begin{Sinput}
> sum(p.adjust(de.com$table$p.value, method = "BH") < 0.05)
\end{Sinput}
\begin{Soutput}
[1] 4835
\end{Soutput}
\begin{Sinput}
> mean(p.adjust(de.com$table$p.value, method = "BH") < 0.05) * 
+     100
\end{Sinput}
\begin{Soutput}
[1] 22.10084
\end{Soutput}
\end{Schunk}

Of the genes identified as DE above, $1911$ (40\% of the DE genes) are up-regulated in DHT-treated compared with control cells, and $2924$ (60\%) are up-regulated in the control cells compared with DHT-treated cells. It is interesting to note that although we detect far more genes as DE that are up-regulated in the control group, all of the top ten genes were up-regulated in the DHT-treated group.

\begin{Schunk}
\begin{Sinput}
> top.com <- topTags(de.com, n = 4835)
> sum(top.com$table$logFC > 0)
\end{Sinput}
\begin{Soutput}
[1] 1911
\end{Soutput}
\begin{Sinput}
> sum(top.com$table$logFC < 0)
\end{Sinput}
\begin{Soutput}
[1] 2924
\end{Soutput}
\end{Schunk}


\subsubsection{Visualising DGE results}
The code for producing the default fold-change plot, with the top $500$ most DE tags highlighted in red, is shown below, and the result of this code is shown in Figure~\ref{fig:Li_FC1}. In Figure~\ref{fig:Li_FC1}, we see that the $500$ tags identified as most differentially expressed have large fold changes---almost all of the $500$ tags in red fall outside the blue lines at $\log \textrm{FC} = -2$ and $\log \textrm{FC} = 2$. This means that most of these tags show at least a 4-fold change in expression level between the samples. This plot suggests strongly that the tags identified by \edgeR~as differentially expressed are truly differentially expressed, and, given the large changes in expression level, are likely to be biologically meaningful.

\begin{Schunk}
\begin{Sinput}
> detags500.com <- rownames(topTags(de.com, n = 500)$table)
\end{Sinput}
\end{Schunk}
\begin{Schunk}
\begin{Sinput}
> plotSmear(d, de.tags = detags500.com, main = "FC plot using common dispersion")
> abline(h = c(-2, 2), col = "dodgerblue", lwd = 2)
\end{Sinput}
\end{Schunk}

\begin{figure}[ht]
\begin{center}
\includegraphics[width=\textwidth]{edgeR_case_study_Li-017.png}
\caption{Plot of the log-fold change against the log-concentration for each tag. The $500$ most differentially expressed tags as identified by \edgeR~using the common dispersion are outlined in red.}
\label{fig:Li_FC1}
\end{center}
\end{figure}


\subsection{Analysis using moderated tagwise dispersions}
\subsubsection{Moderating the tagwise dispersion}
As discussed in the previous case studies, an extension to simply using the common dispersion for each tag is to estimate the dispersion separately for each tag, while `squeezing' these estimates towards the common dispersion estimate. The goal of this moderation of the dispersion estimates is to improve inference by sharing information between tags. This type of analysis can be carried out in few steps using the \edgeR~package.

To run the moderated analysis, we need to determine how much
moderation is necessary. As discussed above, we currently prefer to
choose a sensible value for the smoothing parameter \emph{a priori},
although we do have an algorithm developed by \citet{Robinson07} for estimating the smoothing
parameter as an approximate eBayes rule.

As we only have seven libraries, a small sample size, we should not be
too confident about the accuracy of the tagwise dispersions. Therefore
it is recommended to use a larger value for \code{prior.n}, which
can be selected \emph{a priori}, instead of being estimated. In an
experiment such as this, the seven samples mean that we have five
degrees of freedom for estimating the dispersion parameter. Thus,
setting the \code{prior.n} to be ten (as we have done previously)
should be appropriate. This means that the common likelihood receives
the weight of ten individual tags. Therefore, there will be a
reasonable degree of `squeezing' towards the common dispersion
estimate, but still enough scope to allow flexibility when estimatig
the individual dispersion for each gene.

The function \code{estimateTagwiseDisp} produces a DGEList object that contains all of the elements present in the object produced by \code{estimateCommonDisp}, as well as the value for \code{prior.n} used (\code{d\$prior.n}) and the tagwise dispersion estimates (\code{d\$tagwise.dispersion}), as we see below. Here we set \code{grid.length=500} for greater precision in the tagwise dispersion estimates.

\begin{Schunk}
\begin{Sinput}
> d <- estimateTagwiseDisp(d, prior.n = 10, grid.length = 500)
\end{Sinput}
\begin{Soutput}
Using grid search to estimate tagwise dispersion. 
\end{Soutput}
\begin{Sinput}
> names(d)
\end{Sinput}
\begin{Soutput}
[1] "samples"            "common.dispersion"  "prior.n"           
[4] "tagwise.dispersion" "counts"             "pseudo.alt"        
[7] "conc"               "common.lib.size"   
\end{Soutput}
\begin{Sinput}
> d$prior.n
\end{Sinput}
\begin{Soutput}
[1] 10
\end{Soutput}
\begin{Sinput}
> head(d$tagwise.dispersion)
\end{Sinput}
\begin{Soutput}
[1] 0.01936799 0.01729400 0.01936799 0.02145046 0.01522843 0.02774923
\end{Soutput}
\end{Schunk}

It is interesting to consider the distribution of the tagwise dispersion estimates. As we can see from the output below, the tagwise dispersion estimates range from a minimum of $0.005$ to a maximum of $0.236$, and the common dispersion estimate lies in between the median and mean values for the tagwise dispersion estimates.

\begin{Schunk}
\begin{Sinput}
> summary(d$tagwise.dispersion)
\end{Sinput}
\begin{Soutput}
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.005025 0.019370 0.019370 0.020640 0.021450 0.236100 
\end{Soutput}
\begin{Sinput}
> d$common.dispersion
\end{Sinput}
\begin{Soutput}
[1] 0.02021596
\end{Soutput}
\end{Schunk}

\subsubsection{Estimating the smoothing parameter as an approximate eBayes rule}
As discussed in the SAGE and Tag-seq case studies, we can use an approximate
empirical Bayes rule that involves calculating a weight parameter
$\textrm{prior.n}$. However, for many applications (especially if the
estimated smoothing parameter is large), using the common dispersion
for all tags will give excellent results. Also, if the estimated
smoothing parameter is very small, then we prefer to select a larger
smoothing parameter to squeeze the (in general unreliable) tagwise
dispersion estimates towards the common value and thus stabilize these
estimates. We currently recommend that the \code{prior.n} value
given by \code{estimateSmoothing} be used as a guide, but not
routinely in analysis.

The function \code{estimateSmoothing} estimates the smoothing
parameter \code{prior.n} and takes the \code{DGEList} object
produced by \code{estimateCommonDisp} as its argument. As we see
below, for this dataset the estimate of the smoothing parameter is
very small. If we were to use this tiny value for the weight parameter
we would moderate the tagwise dispersion estimates so little that it
would be equivalent to using just using the tagwise dispersion for
each gene without any `squeezing' at all towards the common
dispersion. As the unmoderated tagwise dispersion estimates will be
unreliable in experiments with small sample sizes, we recommend
selecting a larger value for \code{prior.n}, as outlined above.

\begin{Schunk}
\begin{Sinput}
> prior.n <- estimateSmoothing(d)
> prior.n
\end{Sinput}
\begin{Soutput}
[1] 0.002132353
\end{Soutput}
\end{Schunk}


\subsubsection{Testing}
Once we have an estimate of the common dispersion and/or estimates of
the tagwise dispersions, we can proceed with testing procedures for
determining differential expression using \code{exactTest}. Here we
carry out the testing using the tagwise dispersion estimates
calculated using a \code{prior.n} value of ten.

By default, \code{exactTest} uses the common dispersion, but by adding the argument \code{common.disp=FALSE}, tagwise dispersion estimates will be used instead.

\begin{Schunk}
\begin{Sinput}
> de.tgw <- exactTest(d, common.disp = FALSE)
\end{Sinput}
\begin{Soutput}
Comparison of groups:  DHT - Control 
\end{Soutput}
\end{Schunk}

The output below shows that when using tagwise dispersions, the \edgeR~package still identifies a huge amount of differential expression between the control group and the DHT-treated group. The top DE tags are given even smaller $p$-values than using the common dispersion---many, many orders of magnitude smaller. 

As with the analysis using the common dispersion, all of the top genes have large fold changes, indicating that these changes in expression are likely to be biologically meaningful. Again, all of the top genes are up-regulated in the DHT-treated group compared with the control group. We note that the ranking of the tags is similar, with seven of the top ten genes using the common dispersion to be found among the top ten genes using tagwise dispersions.

\begin{Schunk}
\begin{Sinput}
> topTags(de.tgw)
\end{Sinput}
\begin{Soutput}
Comparison of groups:  DHT - Control 
                   logConc    logFC        PValue           FDR
ENSG00000151503 -11.947221 5.704024 2.214558e-301 4.844788e-297
ENSG00000096060 -11.332026 4.891305 1.545034e-259 1.690035e-255
ENSG00000166451 -12.288475 4.570589 8.545084e-180 6.231360e-176
ENSG00000127954 -15.632016 8.117608 5.274185e-171 2.884584e-167
ENSG00000162772 -10.816415 3.201287 6.821241e-136 2.984566e-132
ENSG00000113594 -12.834292 3.999954 8.573960e-119 3.126209e-115
ENSG00000116133 -11.741194 3.128433 2.367018e-118 7.397607e-115
ENSG00000116285 -13.566933 4.089714 4.303139e-116 1.176747e-112
ENSG00000115648  -8.831597 2.481756 1.967977e-114 4.783715e-111
ENSG00000130066 -10.322217 2.494893 6.742514e-108 1.475060e-104
\end{Soutput}
\end{Schunk}

Of course, we can also rank the top tags using the fold change instead of the $p$-value, as described above, this ranking is dominated by genes that have zero total counts in one group and is less informative than ranking by $p$-value.

\begin{Schunk}
\begin{Sinput}
> topTags(de.tgw, n = 10, sort.by = "logFC")
\end{Sinput}
\begin{Soutput}
Comparison of groups:  DHT - Control 
                  logConc     logFC       PValue          FDR
ENSG00000091972 -31.75810 -36.51592 2.059993e-63 1.502216e-60
ENSG00000164120 -32.27722  35.47767 1.732888e-44 5.123027e-42
ENSG00000100373 -32.93097 -34.17017 1.170463e-17 7.759461e-16
ENSG00000118513 -33.00748 -34.01715 6.961165e-15 3.696345e-13
ENSG00000081237 -33.15838 -33.71535 1.283161e-12 5.113246e-11
ENSG00000196660 -33.22317 -33.58577 3.427356e-12 1.288321e-10
ENSG00000117245 -33.23845 -33.55521 1.635293e-11 5.651708e-10
ENSG00000019549 -33.39438  33.24334 2.743940e-13 1.175645e-11
ENSG00000059804 -33.43993  33.15225 2.018339e-12 7.801273e-11
ENSG00000137404 -33.44594 -33.14022 5.612659e-07 9.183854e-06
\end{Soutput}
\end{Schunk}

The tables below shows the quantile-adjusted counts (i.e. counts for equalised library sizes) for the genes that \edgeR~has identified as the most differentially expressed, using the common dispersion and tagwise dispersions. For these tags, using both methods, there seem to be very large differences between the groups, suggesting that the DE genes identified are truly differentially expressed, and not false positives. 

We saw for \citet{THoen:2008p9}'s data how much more consistent the
counts \emph{within} groups are for the top tags identified using
tagwise dispersions compared with those identified using the common
dispersion. This effect is not nearly as pronounced here, as the
differences between groups for the top ten tags are so profound, but
we do note that there is a great deal of consistency in the counts
within groups for these top tags.

\begin{Schunk}
\begin{Sinput}
> detags.tgw <- rownames(topTags(de.tgw)$table)
> detags.com <- rownames(topTags(de.com)$table)
> round(d$pseudo.alt[detags.tgw, ])
\end{Sinput}
\begin{Soutput}
                lane1 lane2 lane3 lane4 lane5 lane6 lane8
ENSG00000151503    46    39    43    51  2315  2394  2293
ENSG00000096060    85    87    93    97  2783  2594  2717
ENSG00000166451    53    57    50    49  1225  1151  1361
ENSG00000127954     0     0     3     3   425   419   412
ENSG00000162772   225   225   221   262  2081  2276  2083
ENSG00000113594    48    40    51    37   655   667   780
ENSG00000116133   127   118   137   123  1143  1121  1041
ENSG00000116285    24    31    20    28   452   439   409
ENSG00000115648  1226  1196  1166  1156  6813  6920  6130
ENSG00000130066   403   426   440   410  2239  2301  2571
\end{Soutput}
\begin{Sinput}
> round(d$pseudo.alt[detags.com, ])
\end{Sinput}
\begin{Soutput}
                lane1 lane2 lane3 lane4 lane5 lane6 lane8
ENSG00000151503    46    39    43    51  2315  2394  2293
ENSG00000096060    85    87    93    97  2783  2594  2717
ENSG00000127954     0     0     3     3   425   419   412
ENSG00000166451    53    57    50    49  1225  1151  1361
ENSG00000131016    12     6    16     5   396   261   396
ENSG00000113594    48    40    51    37   655   667   780
ENSG00000116285    24    31    20    28   452   439   409
ENSG00000123983    81    84    83    93   948   875  1172
ENSG00000166086    11     2     3     5   207   207   226
ENSG00000162772   225   225   221   262  2081  2276  2083
\end{Soutput}
\end{Schunk}

We might also be interested in comparing the top-ranking genes as identified by \edgeR~using the common dispersion and tagwise dispersions. We see in the output below that of the top $1000$ most DE tags as identified using tagwise dispersions, $87$\% of these tags are also in the list of the $1000$ most DE tags as identified using the common dispersion. This shows that for this dataset there is a great deal of agreement between the common and tagwise dispersion approaches as to which tags are DE.

\begin{Schunk}
\begin{Sinput}
> sum(rownames(topTags(de.tgw, n = 1000)$table) %in% rownames(topTags(de.com, 
+     n = 1000)$table))/1000 * 100
\end{Sinput}
\begin{Soutput}
[1] 87.3
\end{Soutput}
\end{Schunk}

Using the common dispersion we found that $4835$ genes (22\% of the total number) are given an adjusted $p$-value of less than $0.05$. In the output below, we see that using tagwise dispersions we obtain slightly more DE genes, namely $4933$, or $23$\% of all of the genes in the dataset.

\begin{Schunk}
\begin{Sinput}
> sum(p.adjust(de.tgw$table$p.value, method = "BH") < 0.05)
\end{Sinput}
\begin{Soutput}
[1] 4933
\end{Soutput}
\begin{Sinput}
> mean(p.adjust(de.tgw$table$p.value, method = "BH") < 0.05) * 
+     100
\end{Sinput}
\begin{Soutput}
[1] 22.54880
\end{Soutput}
\end{Schunk}

Of the $4933$ tags identified as DE using tagwise dispersions, $1981$ ($40$\%) are up-regulated in DHT-treated cells and $2952$ ($60$\%) are up-regulated in the control cells. The proportions of up- and down-regulated genes identified using the two approaches to modeling the dispersion are equal.

\begin{Schunk}
\begin{Sinput}
> top.tgw <- topTags(de.tgw, n = 4933)
> sum(top.tgw$table$logFC > 0)
\end{Sinput}
\begin{Soutput}
[1] 1981
\end{Soutput}
\begin{Sinput}
> sum(top.tgw$table$logFC < 0)
\end{Sinput}
\begin{Soutput}
[1] 2952
\end{Soutput}
\end{Schunk}

\subsubsection{Visualising DGE results}
As discussed earlier, the function \code{plotSmear} can be used to generate a plot of the log-fold change against the log-concentration for each tag. We identify the top $500$ most DE tags using both common dispersion and tagwise dispersions so we can highlight them on the plots and compare what we see. The code for producing the fold-change plots (in the one frame for purposes of comparison) is shown below, and the result of this code is shown in Figure~\ref{fig:Li_FC2}. 

\begin{Schunk}
\begin{Sinput}
> detags500.com <- rownames(topTags(de.com, n = 500)$table)
> detags500.tgw <- rownames(topTags(de.tgw, n = 500)$table)
\end{Sinput}
\end{Schunk}
\begin{Schunk}
\begin{Sinput}
> par(mfcol = c(2, 1))
> plotSmear(d, de.tags = detags500.com, main = "Using common dispersion")
> abline(h = c(-2, 2), col = "dodgerblue", lwd = 2)
> plotSmear(d, de.tags = detags500.tgw, main = "Using tagwise dispersions")
> abline(h = c(-2, 2), col = "dodgerblue", lwd = 2)
\end{Sinput}
\end{Schunk}

In Figure~\ref{fig:Li_FC2}, the top $500$ most differentially expressed genes (those identified as significant by \edgeR~using the common dispersion (top) and tagwise dispersions (bottom)) are highlighted in red. Looking at Figure~\ref{fig:Li_FC2}, we see that, generally speaking, the pattern of differential expression looks similar using the two different methods, and the genes identified as DE have convincingly large fold changes.

\begin{figure}[ht]
\begin{center}
\includegraphics[height=0.9\textheight]{edgeR_case_study_Li-032.png}
\caption{Plots of the log-fold change against the log-concentration for each tag, using the common dispersion (top), and tagwise dispersions (bottom). Tags with positive fold-change here are up-regulated in DHT-treated cells compared with control cells. The $500$ most differentially expressed tags according to each method are highlighted in red on both plots.}
\label{fig:Li_FC2}
\end{center}
\end{figure}


\subsection{Setup}
This analysis of \citet{Li08}'s RNA-seq data was conducted on:
\begin{Schunk}
\begin{Sinput}
> sessionInfo()
\end{Sinput}
\begin{Soutput}
R version 2.9.1 (2009-06-26) 
i386-apple-darwin8.11.1 

locale:
en_AU.UTF-8/en_AU.UTF-8/C/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] edgeR_1.3.6

loaded via a namespace (and not attached):
[1] limma_2.18.2
\end{Soutput}
\end{Schunk}
and took 2--4 minutes to carry out on an Apple MacBook with a 2.4 Ghz Intel Core 2 Duo processor and 4 Gb of 1067 MHz DDR3 memory.


\section{Poisson example}

It has been noted that, in some deep sequencing approaches, not a great deal of overdispersion is observed.  Specifically, the means and variances appear to be very close to each other, suggesting the Poisson distribution is a good fit.  Methods within the \edgeR~package may still be useful, including the quantile adjustment (effectively a normalization) and the exact testing routines.

To illustrate this, we sample Poisson data and run \code{de4DGE}
with the \code{doPoisson} argument set to \code{TRUE}.  The data
is quantile-adjusted and before the exact test is invoked, the
dispersion parameter is set to 0. Currently, elements from the output
of \code{de4DGE} must be manually added to the \code{DGEList}
object, as \code{exactTest} operates only on \code{DGEList} objects, as illustrated in the case studies above.  

Nevertheless, an analysis using the Poisson distribution can be carried out as follows:

<<poisson,width=7,height=4>>=
library(edgeR)
set.seed(101)
n <- 10000
lib.sizes <- c(40000,50000,38000,40000)
p <- runif(n,min=.0001,.001)
mu <- outer(p,lib.sizes)
mu[1:5,3:4]<-mu[1:5,3:4]*8
y <- matrix(rpois(4*n,lambda=mu),nrow=n)
dP <- DGEList(counts=y,group=rep(1:2,each=2),lib.size=lib.sizes)
msP <- de4DGE(dP,doPoisson=TRUE)
dP$pseudo.alt <- msP$pseudo
dP$common.dispersion <- 1e-06 # Set dispersion close to zero to allow exact test for Poisson data
dP$conc <- msP$conc
dP$common.lib.size <- msP$M
@

And you can proceed as before:

<<poissonstats,fig=TRUE,height=6,width=8>>=
de.P <- exactTest(dP)
topTags(de.P)
plotSmear(dP,col=c( rep("blue",5), rep("black",n-5) ))
@

\section{Future improvements and extension}

Here, we list some improvements/extensions that are planned for the \edgeR~package:

\begin{enumerate}
\item As the packages for the processing of raw high-throughput sequencing data become more mature, \edgeR~may need to adapt and operate on different objects.  As shown above, \edgeR~operates on a simple object containing simple data summaries which will  presumably be readily available from pre-processing steps.
\item Significant speed improvements have been made since the earlier
  versions of \edgeR, but as the datasets become larger, some further optimizations may be necessary.  We are exploring various ways to do this.
\item The current quantile-based normalization assumes the library sizes are fixed.  Depending on the circumstances of the samples in question, it may be necessary to explore something like an {\em effective} library size.
\item We are exploring more general testing procedures.
\end{enumerate}

%<<setupfiles>>=
%getwd()
%dir("")
%@

\section{Setup}

This vignette was built on:

<<setup>>=
sessionInfo()
@


\bibliographystyle{plainnat}
\bibliography{edgeR_case_study_bib}

\end{document}