%\VignetteIndexEntry{Differential expression for RNA-seq data with dispersion shrinkage} %\VignettePackage{DSS} \documentclass{article} \usepackage{float} \usepackage{Sweave} \usepackage[a4paper]{geometry} \usepackage{hyperref,graphicx} \textwidth=6.5in \textheight=9in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.5in \footskip=0.6in \renewcommand{\baselinestretch}{1.3} \SweaveOpts{keep.source=TRUE,eps=FALSE,include=TRUE,width=4,height=4} %\newcommand{\Robject}[1]{\texttt{#1}} %\newcommand{\Rpackage}[1]{\textit{#1}} %\newcommand{\Rclass}[1]{\textit{#1}} %\newcommand{\Rfunction}[1]{{\small\texttt{#1}}} \author{Hao Wu \\[1em]Department of Biostatistics and Bioinformatics\\ Emory University\\ Atlanta, GA 303022 \\ [1em] \texttt{hao.wu@emory.edu}} \title{\textsf{\textbf{Differential analyses with DSS}}} \begin{document} \maketitle \tableofcontents %% abstract \begin{abstract} This vignette introduces the use of the Bioconductor package DSS ({\underline D}ispersion {\underline S}hrinkage for {\underline S}equencing data), which is designed for differential analysis based on high-throughput sequencing data. It performs differential expression analyses for RNA-seq, and differential methylation analyses for bisulfite sequencing (BS-seq) data. The core of DSS is a new procedure to estimate and shrink gene- or CpG site-specific dispersions, then conduct Wald tests for differential expression/methylation. Compared with existing methods, DSS provides excellent statistical and computational performance. \end{abstract} \section{Introduction} Recent advances in high-throughput sequencing technology have revolutionized genomic research. For example, RNA-seq is a new technology for measuring the abundance of RNA products. Compared to gene expression microarrays, it provides a better dynamic range and lower signal-to-noise ratio. Bisulfite sequencing (BS-seq) is a new technology for measuring DNA methylation. Compared to capture-based methods such as MeDIP-seq, it provides single-base resolution and eliminates biases associated with CpG density. Fundamental questions for RNA-seq or BS-seq data analyses are whether gene expression regulation or DNA methylation dynamics vary under different biological contexts. Identifying sites or regions exhibiting differential expression (DE) or differential methylation (DM) are thus key tasks in functional genomics research. RNA- or BS-seq experiments typically have a limited number of biological replicates due to cost constraints. This can lead to unstable estimation of within group variance, and subsequently undesirable results from hypothesis testing. Variance shrinkage methods have been widely used in DE analyses based on microarray data. The methods are typically based on a Bayesian hierarchical model, with a prior imposed on the gene-specific variances to provide a basis for information sharing across all genes/CpG sites. In these models, shrinkage is achieved for variance estimation. Using shrunk variance in hypothesis tests has been shown to provide better results. A distinct feature of RNA-seq or BS-seq data is that the measurements are in the form of counts. These data are often assumed to be from the Poisson (for RNA-seq) or Binomial (for BS-seq) distributions. Unlike continuous distributions such as the Gaussian distribution, the variances depend on means in these discrete distributions. This implies that the sample variances do not account for biological variation between replicates, and shrinkage cannot be applied on variances directly. In contrast, we assume that our count data come from the Gamma-Poisson (RNA-seq) or Beta-Binomial (BS-seq) distribution. These distributions can be parameterized by a mean and an over dispersion parameter. The over dispersion parameters, which represent the biological variation for replicates within a treatment group, play a central role in the differential analyses. Here we present a new DE/DM detection algorithm, where shrinkage is performed on the dispersion parameters. We first impose a log-normal prior on the dispersions, and then combine data from all genes/CpG sites to shrink dispersions through a penalized likelihood approach. Finally, we construct Wald tests to test each gene/site for differential expression/methylation. Our results show that the new method provides excellent performance compared to existing methods, especially when the overall dispersion level is high or the number of replicates is small. Currently DSS only supports comparison of expression or methylation from two treatment groups. Methods for more advanced study designs are under development and will be implemented soon. \section{Using {\tt DSS} for differential expression analysis} \subsection{Single factor experiment} Required inputs for DSS are (1) gene expression values as a matrix of integers, rows are for genes and columns are for samples; and (2) a vector representing experimental designs. The length of the design vector must match the number of columns of input counts. Optionally, normalization factors or additional annotation for genes can be supplied. The basic data container in the package is {\tt SeqCountSet} class, which is directly inherited from {\tt ExpressionSet} class defined in {\tt Biobase}. An object of the class contains all necessary information for a DE analysis: gene expression values, experimental designs, and additional annotations. A typical DE analysis contains the following simple steps. \begin{enumerate} \item Create a {\tt SeqCountSet} object using {\tt newSeqCountSet}. \item Estimate normalization factor using {\tt estNormFactors}. \item Estimate and shrink gene-wise dispersion using {\tt estDispersion} \item Two-group comparison using {\tt waldTest}. \end{enumerate} The usage of DSS is demonstrated in the simple simulation below. \begin{enumerate} \item First load in the library, and make a {\tt SeqCountSet} object from some counts for 2000 genes and 6 samples. <>= library(DSS) counts1=matrix(rnbinom(300, mu=10, size=10), ncol=3) counts2=matrix(rnbinom(300, mu=50, size=10), ncol=3) X1=cbind(counts1, counts2) ## these are 100 DE genes X2=matrix(rnbinom(11400, mu=10, size=10), ncol=6) X=rbind(X1,X2) designs=c(0,0,0,1,1,1) seqData=newSeqCountSet(X, designs) seqData @ \item Estimate normalization factor. <>= seqData=estNormFactors(seqData) @ \item Estimate and shrink gene-wise dispersions <<>>= seqData=estDispersion(seqData) @ \item With the normalization factors and dispersions ready, the two-group comparison can be conducted via a Wald test: <<>>= result=waldTest(seqData, 0, 1) head(result,5) @ \end{enumerate} \subsection{Multifactor experiment} {\tt DSS} provides functionalities for dispersion shrinkage for multifactor experimental designs. Downstream model fitting (through genearlized linear model) and hypothesis testing can be performed using other packages such as {\tt edgeR}, with the dispersions estimated from DSS. Below is an example, based a simple simulation, to illustrate the DE analysis of a crossed design. \begin{enumerate} \item First simulate data for a 2x2 crossed experiments. Note the counts are randomly generated. <>= library(DSS) library(edgeR) counts=matrix(rpois(800, 10), ncol=8) design=data.frame(gender=c(rep("M",4), rep("F",4)), strain=rep(c("WT", "Mutant"),4)) X=model.matrix(~gender+strain, data=design) @ \item make SeqCountSet, then estimate size factors and dispersion <>= seqData=newSeqCountSet(counts, as.data.frame(X)) seqData=estNormFactors(seqData) seqData=estDispersion(seqData) @ \item Using edgeR's function to do glm model fitting, but plugging in the estimated size factors and dispersion from DSS. <<>>= fit.edgeR <- glmFit(counts, X, lib.size=normalizationFactor(seqData), dispersion=dispersion(seqData)) @ \item Using edgeR's function to do hypothesis testing on the second parameter of the model (gender). <>= lrt.edgeR <- glmLRT(fit.edgeR, coef=2) head(lrt.edgeR$table) @ \end{enumerate} %%% DML detection \section{Using {\tt DSS} for differential methylation analysis} For BS-seq experiments, after sequence alignment and proper processing, the BS-seq data can be summarized with following information for each C position (mostly CpG sites, with the occasional CH): chromosome number, genomic coordinate, total number of reads covering the position, and number of reads showing methylation at this position. For a sample, this information needs to be saved in a simple text file, with each row representing a CpG site. DML detection using {\tt DSS} starts from several such text files. A typical DML detection contains two simple steps. Below we will use files distributed with the package to illustrate the usage of the package. \begin{enumerate} \item Load in library. Read in text files and create {\tt BSseq} objects for two conditions. This step requires {\tt bsseq} Bioconductor package. {\tt BSseq} class is defined in that package. <<>>= library(DSS) require(bsseq) path <- file.path(system.file(package="DSS"), "extdata") dat1.1 <- read.table(file.path(path, "cond1_1.txt"), header=TRUE) dat1.2 <- read.table(file.path(path, "cond1_2.txt"), header=TRUE) dat2.1 <- read.table(file.path(path, "cond2_1.txt"), header=TRUE) dat2.2 <- read.table(file.path(path, "cond2_2.txt"), header=TRUE) BS1 <- makeBSseqData( list(dat1.1, dat1.2), paste("cond1",1:2,sep=".") ) BS2 <- makeBSseqData( list(dat2.1, dat2.2), paste("cond2",1:2,sep=".") ) BS1 @ \item Detect DML by calling {\tt callDML} function. <<>>= dmls <- callDML(BS1, BS2) head(dmls) @ \item Based on the DML results, detect DMR by calling {\tt callDMR} function. <<>>= dmrs <- callDMR(dmls, p.threshold=0.001) head(dmrs) @ \end{enumerate} For differentially methylated loci (DML) detection, a hypothesis test is conducted at each CpG site, and the tests are performed independently. The differentially methylated region (DMR) detection is based on the DML test results. Regions will CpG sites being statistically significant are detected as DMRs. Nearby DMRs are merged into longer ones. Some restrictions including the minimum length, minimum number of CpG sites, etc. are applied. Note that this function doesn't consider the spatial correlation among nearby CpG sites, which is an important aspect in the whole genome BS-seq data. This will be in the future development plan. \section{Session Info} <>= sessionInfo() @ \end{document}