%\VignetteIndexEntry{Differential expression for RNA-seq data with dispersion shrinkage}
%\VignettePackage{DSS}                                                                        

\documentclass{article}

\usepackage{float}
\usepackage{Sweave}
\usepackage[a4paper]{geometry}
\usepackage{hyperref,graphicx}
\textwidth=6.5in
\textheight=9in
%\parskip=.3cm                                                                                     
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.5in
\footskip=0.6in
\renewcommand{\baselinestretch}{1.3}


\SweaveOpts{keep.source=TRUE,eps=FALSE,include=TRUE,width=4,height=4}
%\newcommand{\Robject}[1]{\texttt{#1}}                                                             
%\newcommand{\Rpackage}[1]{\textit{#1}}                                                            
%\newcommand{\Rclass}[1]{\textit{#1}}                                                              
%\newcommand{\Rfunction}[1]{{\small\texttt{#1}}}                                                   

\author{Hao Wu \\[1em]Department of Biostatistics and Bioinformatics\\ Emory University\\
  Atlanta, GA 303022 \\ [1em] \texttt{hao.wu@emory.edu}}

\title{\textsf{\textbf{Differential analyses with DSS}}}


\begin{document}
\maketitle
\tableofcontents

%% abstract
\begin{abstract}
This vignette introduces the use of the Bioconductor package 
DSS ({\underline D}ispersion {\underline S}hrinkage 
for {\underline S}equencing data), which is designed for 
differential analysis based on high-throughput sequencing data. 
It performs differential expression analyses for RNA-seq, 
and differential methylation analyses for bisulfite sequencing (BS-seq) data. 
The core of DSS is a new procedure to estimate and shrink 
gene- or CpG site-specific dispersions, then conduct Wald tests for
differential expression/methylation. Compared with existing methods, DSS provides excellent 
statistical and computational performance.
\end{abstract}


\section{Introduction}
Recent advances in high-throughput sequencing technology have
revolutionized genomic research. For example, 
RNA-seq is a new technology for measuring the abundance of RNA products.
Compared to gene expression microarrays, 
it provides a better dynamic range and lower signal-to-noise ratio.
Bisulfite sequencing (BS-seq) is a new technology for measuring DNA methylation. 
Compared to capture-based methods such as MeDIP-seq, 
it provides single-base resolution and eliminates biases associated with CpG density. 

Fundamental questions for RNA-seq or BS-seq data analyses
are whether gene expression regulation or DNA methylation dynamics
vary under different biological contexts. 
Identifying sites or regions exhibiting differential expression (DE) or differential methylation (DM)
are thus key tasks in functional genomics research.

RNA- or BS-seq experiments typically have a limited number of biological
replicates due to cost constraints. This can lead to unstable estimation of
within group variance, and subsequently undesirable results from hypothesis testing. 
Variance shrinkage methods have been widely used in DE analyses based on
microarray data. The methods are typically based on a Bayesian hierarchical model, 
with a prior imposed on the gene-specific variances to provide 
a basis for information sharing across all genes/CpG sites. 
In these models, shrinkage is achieved for variance estimation.
Using shrunk variance in hypothesis tests has been shown to provide better results. 

A distinct feature of RNA-seq or BS-seq data is that the measurements are
in the form of counts. These data are often assumed to be from the Poisson (for RNA-seq)
or Binomial (for BS-seq) distributions. 
Unlike continuous distributions such as the Gaussian distribution, 
the variances depend on means in these discrete distributions. 
This implies that the sample variances do not account for biological variation between replicates,
and shrinkage cannot be applied on variances directly. 

In contrast, we assume that our count data come from the Gamma-Poisson (RNA-seq) or Beta-Binomial (BS-seq) distribution.
These distributions can be parameterized by a mean and an over dispersion parameter.  
The over dispersion parameters, which represent the biological variation for
replicates within a treatment group, play a central role in the differential analyses.

Here we present a new DE/DM detection algorithm, where shrinkage is performed on the dispersion parameters. 
We first impose a log-normal prior on the dispersions, 
and then combine data from all genes/CpG sites to shrink dispersions
through a penalized likelihood  approach. 
Finally, we construct Wald tests to test each gene/site for differential expression/methylation. 
Our results show that the new method provides excellent performance
compared to existing methods, especially when the overall dispersion level is high
or the number of replicates is small. 

Currently DSS only supports comparison of expression or methylation from
two treatment groups. Methods for more advanced study designs
are under development and will be implemented soon.


\section{Using {\tt DSS} for differential expression analysis}

\subsection{Single factor experiment}
Required inputs for DSS are (1) gene expression values as a matrix of integers, 
rows are for genes and columns are for samples;
and (2) a vector representing experimental designs. The length of the
design vector must match the number of columns of input counts.
Optionally, normalization factors or additional annotation for genes 
can be supplied. 

The basic data container in the package is {\tt SeqCountSet} class, 
which is directly inherited from {\tt ExpressionSet} class 
defined in {\tt Biobase}. An object of the class contains all necessary
information for a DE analysis: gene expression values, experimental designs,
and additional annotations. 

A typical DE analysis contains the following simple steps. 
\begin{enumerate}
\item Create a {\tt SeqCountSet} object using {\tt newSeqCountSet}.
\item Estimate normalization factor using {\tt estNormFactors}. 
\item Estimate and shrink gene-wise dispersion using {\tt estDispersion}
\item Two-group comparison using {\tt waldTest}. 
\end{enumerate}

The usage of DSS is demonstrated in the simple simulation below. 
\begin{enumerate}
\item First load in the library, and make a {\tt SeqCountSet}
object from some counts for 2000 genes and 6 samples. 
<<echo=TRUE, result=TRUE>>=
library(DSS)
counts1=matrix(rnbinom(300, mu=10, size=10), ncol=3)
counts2=matrix(rnbinom(300, mu=50, size=10), ncol=3)
X1=cbind(counts1, counts2) ## these are 100 DE genes
X2=matrix(rnbinom(11400, mu=10, size=10), ncol=6)
X=rbind(X1,X2)
designs=c(0,0,0,1,1,1)
seqData=newSeqCountSet(X, designs)
seqData
@
\item Estimate normalization factor. 
<<echo=TRUE, result=TRUE>>=
seqData=estNormFactors(seqData)
@ 
\item Estimate and shrink gene-wise dispersions
<<>>=
seqData=estDispersion(seqData)
@ 
\item With the normalization factors and dispersions ready, the two-group comparison can be 
conducted via a Wald test:
<<>>=
result=waldTest(seqData, 0, 1)
head(result,5)
@
\end{enumerate}

\subsection{Multifactor experiment}
{\tt DSS} provides functionalities for dispersion shrinkage for multifactor experimental designs. 
Downstream model fitting (through genearlized linear model)
and hypothesis testing can be performed using other packages such as {\tt edgeR},
with the dispersions estimated from DSS. 

Below is an example, based a simple simulation, to illustrate the DE analysis of 
a crossed design.  
\begin{enumerate}
\item First simulate data for a 2x2 crossed experiments. Note the 
counts are randomly generated.
<<echo=TRUE, result=TRUE>>=
library(DSS)
library(edgeR)
counts=matrix(rpois(800, 10), ncol=8)
design=data.frame(gender=c(rep("M",4), rep("F",4)), strain=rep(c("WT", "Mutant"),4))
X=model.matrix(~gender+strain, data=design)
@ 

\item make SeqCountSet, then estimate size factors and  dispersion 
<<echo=TRUE>>=
seqData=newSeqCountSet(counts, as.data.frame(X))
seqData=estNormFactors(seqData)
seqData=estDispersion(seqData)
@ 

\item Using edgeR's function to do glm model fitting, but plugging in the estimated size factors and  dispersion
from DSS. 
<<>>=
fit.edgeR <- glmFit(counts, X, lib.size=normalizationFactor(seqData),
             dispersion=dispersion(seqData))
@ 

\item Using edgeR's function to do hypothesis testing on the second parameter of the model (gender).
<<echo=TRUE, result=TRUE>>=
lrt.edgeR <- glmLRT(fit.edgeR, coef=2)
head(lrt.edgeR$table)
@ 

\end{enumerate}


%%% DML detection
\section{Using {\tt DSS} for differential methylation analysis}
For BS-seq experiments, after sequence alignment 
and proper processing, the BS-seq data can be
summarized with following information for each C position (mostly CpG
sites, with the occasional CH): chromosome number, genomic coordinate, total
number of reads covering the position, and number of reads showing
methylation at this position. For a sample, this information 
needs to be saved in a simple text file, with each row representing a CpG site. 

DML detection using {\tt DSS} starts from several such text files. 
A typical DML detection contains two simple steps.
Below we will use files distributed with the package 
to illustrate the usage of the package. 

\begin{enumerate}
\item Load in library. Read in text files and create {\tt BSseq} objects for two conditions.
This step requires {\tt bsseq} Bioconductor package. {\tt BSseq} class is defined in that package.

<<>>=
library(DSS)
require(bsseq)
path <- file.path(system.file(package="DSS"), "extdata")
dat1.1 <- read.table(file.path(path, "cond1_1.txt"), header=TRUE)
dat1.2 <- read.table(file.path(path, "cond1_2.txt"), header=TRUE)
dat2.1 <- read.table(file.path(path, "cond2_1.txt"), header=TRUE)
dat2.2 <- read.table(file.path(path, "cond2_2.txt"), header=TRUE)
BS1 <- makeBSseqData( list(dat1.1, dat1.2), paste("cond1",1:2,sep=".") )
BS2 <- makeBSseqData( list(dat2.1, dat2.2), paste("cond2",1:2,sep=".") )
BS1
@ 

\item Detect DML by calling {\tt callDML} function.
<<>>=
dmls <- callDML(BS1, BS2)
head(dmls)
@ 

\item Based on the DML results, detect DMR by calling {\tt callDMR} function. 
<<>>=
dmrs <- callDMR(dmls, p.threshold=0.001)
head(dmrs)
@ 


\end{enumerate}

For differentially methylated loci (DML) detection, 
a hypothesis test is conducted at each CpG site,
and the tests are performed independently. 
The differentially methylated region (DMR) detection
is based on the DML test results. Regions will CpG sites being
statistically significant are detected as DMRs. Nearby DMRs are
merged into longer ones. Some restrictions including the minimum
length, minimum number of CpG sites, etc. are applied. Note that
this function doesn't consider the spatial correlation among
nearby CpG sites, which is an important aspect in the whole genome 
BS-seq data. This will be in the future development plan.


\section{Session Info}
<<echo=TRUE, result=TRUE>>=
sessionInfo()
@ 
\end{document}