%\VignetteIndexEntry{ChimpHumanBrainData} %\VignetteDepends{ChimpHumanBrainData} %\VignetteKeywords{data} %\VignetteKeywords{ChimpHumanBrainData} %\VignettePackage{ChimpHumanBrainData} \documentclass{article} %%%%%%%%%%%%%%%%%%%%%%%% Standard Packages %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \usepackage{epsfig} \usepackage{graphicx} \usepackage{graphics} \usepackage{amssymb} \usepackage{amsmath} \usepackage{mathrsfs} \usepackage{fancyvrb} \usepackage{theorem} \usepackage{underscore} \usepackage{color} \usepackage{caption} \usepackage{comment} \usepackage{fancyhdr} \usepackage{lscape} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textsf{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\code}[1]{{\texttt{#1}}} \newcommand{\file}[1]{{\texttt{#1}}} \newcommand{\software}[1]{\textsf{#1}} \newcommand{\R}{\software{R}} \newcommand{\bioc}{\software{BioConductor}} %% Excercises and Questions \theoremstyle{break} \newtheorem{Ex}{Exercise} \theoremstyle{break} \newtheorem{Q}{Question} %% And solution or answer \newenvironment{solution}{\color{blue}}{\bigskip} \usepackage[margin=1in]{geometry} \usepackage{fancyhdr} \pagestyle{fancy} \rhead{} \renewcommand{\footrulewidth}{\headrulewidth} \SweaveOpts{eps=FALSE} \title{Differential Expression Analysis using LIMMA} \author{Naomi S. Altman\\ Department of Statistics\\ Penn State University\\ {\small naomi@stat.psu.edu}} \date{November, 2013} \thispagestyle{fancy} \begin{document} \SweaveOpts{concordance=TRUE} \maketitle \section{Introduction} In this lab we will do differential expression analysis of a complex experiment using Affymetrix@ Genechip arrays. \section{The Data} Khaitovich et al (2004) considered gene expression in 7 homologous regions of human and chimpanzee brains. There were 3 human brains and 3 chimpanzee brains available for the study. Each brain was dissected to obtain tissue samples for each of the 7 regions. This is called a split plot design. Each brain is a ``whole plot" yielding a tissue sample for each region. The brains are classified into species which is the whole plot factor. The dissected portions of the brain are called ``subplots" and region is the subplot factor. The interaction between species and region is also considered a subplot effect. The factors ``species'' and ``region'' are arranged in a balanced factorial design, because each combination of species and region was sampled with the same number of biological replicates. However, there is also a blocking factor ``brain'' with 6 levels representing the 6 individuals. The samples were hybridized to a variety of Affymetrix@ Genechips and are available as experiment E-AFMX-2 at http://www.ebi.ac.uk/aerep/dataselection?expid=352682122. We will use only 4 of the brain regions: prefrontal cortex, caudate nucleus, cerebellum and Broca's region and only one of the Genechips, HG_U95B with one hybridization per sample. The data have been compiled into a Bioconductor dataset called \texttt{ChimpHumanBrainData}. To start, we need to load the data into R. <>= options(width=95) @ <>= library('ChimpHumanBrainData') library(affy) celfileDir = system.file('extdata',package='ChimpHumanBrainData') celfileNames = list.celfiles(celfileDir) brainBatch=ReadAffy(filenames=celfileNames,celfile.path=celfileDir,compress=TRUE) @ The sample names for brainBatch are the cel file names, which are not informative. We will replace them with more informative names, and then extract the probewise raw expression values for quality assessment. The \texttt{paste} and \texttt{rep} command are very handy for creating names. The array names are coded ``a_xny'' where n is the replicate number, x is either ``c'' for chimpanzee or ``h'' for human, and the brain regions are a) prefrontal cortex, d) caudate nucleus e) cerebellum or f) Broca's region. First print the sampleNames to be sure the arrays are in the right order. Then replace the names with the more informative names. <>= sampleNames(brainBatch) sampleNames(brainBatch)=paste(rep(c("CH","HU"),each=12),rep(c(1:3,1:3),each=4), rep(c("Prefrontal","Caudate","Cerebellum","Broca"),6),sep="") @ We should at minimum check quality by doing some scatterplot matrices of the log2(expression) values. We could do a hexplom plot of each tissue. <>= brain.expr=exprs(brainBatch) library(hexbin) plot(hexplom(log2(brain.expr[,paste(rep(c("CH","HU"),each=3), c(1:3,1:3),rep("Prefrontal",6),sep="")]))) @ Notice that the most dense data are along the diagonal, and that the arrays of the same species are more correlated than the arrays from different species. \begin{Ex} Draw hexplom plots for the other 3 brain regions. Do any of the arrays appear to be different than the others? \end{Ex} We should set up the treatment names and blocks. This is readily done using \texttt{paste} and \texttt{rep}. The treatment names are the same as the sample names, but the replicate numbers are dropped. There is one block label for each brain. <<>>= trts=factor(paste(rep(c("CH","HU"),each=12), rep(c("Prefrontal","Caudate","Cerebellum","Broca"),6),sep="")) blocks=factor(rep(1:6,each=4)) @ Finally, we should normalize the expression values and combine into probeset summaries using a method such as RMA. <>= brain.rma=rma(brainBatch) @ We might also want to do some quality checks after normalization. \section{LIMMA analysis} We are now ready to perform analysis in LIMMA. The steps are: \begin{enumerate} \item Compute $S^2_p$ the pooled variance. To do this, we need to compute within region variance for each gene, and the correlation among regions from the same brain (averaged across all the genes). \item Create the coefficient matrix for the contrasts. \item Compute the estimated contrasts. \item Compute the moderated contrast t-test for each gene. \item Plot the histogram of p-values for each contrast for each gene. \item Create the list of significant genes based on the p-values, adjusted p-values or FDR estimates. \end{enumerate} \subsection{Compute $S^2_p$} There are 3 steps to computing the pooled variance. \begin{enumerate} \item Create a design matrix for the treatment effects. \item If there are blocks, compute the within block correlation for each gene. \item Fit the model for the treatment effects to obtain the pooled variance. \end{enumerate} A design matrix is a matrix whose columns give the coefficients of the linear model. There is one row for each sample. The simplest way to set up the matrix is with an incidence matrix that has value 1 if the column belongs to the treatment and 0 otherwise. Since we have already created the factor \texttt{trts} which connects the columns of the expression matrix to the treatment names, the design matrix is readily created. <<>>= library(limma) design.trt=model.matrix(~0+trts) @ By default, \texttt{model.matrix} includes a column of all 1's representing $\mu$ in the ANOVA model $Y_{ij}=\mu+\alpha_i+$error. In this case, the fitted values will be estimates of $\mu$ and the $\alpha$'s. We prefer to eliminate this column (``0+") because then the fitted values will be the treatment mean expression value for each gene in each treatment, which are quantities we usually want to compute. You might want to print \texttt{design.trt} to see what it looks like. If there are blocks or technical replicates the correlation of genes within the blocks need to be computed. This requires the design matrix and the blocking factor. In our case, the blocking factor is called \texttt{blocks}. The function \texttt{duplicateCorrelation} requires the package \texttt{statmod} which may not be automatically downloaded to your R directory with the Bioconductor routines. If R cannot find it, you will need to install it from a CRAN site. Note also that \texttt{duplicateCorrelation} is a time-consuming computation. Be patient. <<>>= library(statmod) corfit <- duplicateCorrelation(brain.rma, design.trt, block = blocks) @ The within-block correlation for each gene is stored on as hyperbolic arctan(correlation). So to obtain a histogram of the correlations, you need to use the \texttt{tanh} function: <>= hist(tanh(corfit$atanh.correlations)) @ Notice that the correlations are mainly positive and have a mode around 0.6. A consensus correlation is computed by discarding the most extreme outliers, averaging the remainder on the hyperbolic arctan scale, and then transforming back to a correlation. This is stored in component \texttt{consensus.correlation}. \texttt{LIMMA} assumes that the correlation induced by the blocks is the same for all genes and uses the consensus. We are now ready to compute the pooled sample variance for each gene. As a side effect, we also compute the sample mean expression of each gene in each treatment (remembering that after RMA normalization, the data are on the log2 scale). <<>>= fitTrtMean <- lmFit(brain.rma, design.trt, block = blocks, cor = corfit$consensus.correlation) @ The output \texttt{fitTrtMean} has several components, but only 2 of these are of interest. Component \texttt{coefficients} contains the mean expression for each gene in each treatment. Component \texttt{sigma} has the estimate of $S_p$. (Notice this the pooled SD, not the pooled variance.) \subsection{Create the coefficient matrix for the contrasts} We need to compute the coefficient matrix for any contrasts we want to do. We do not need to worry about the rank of this matrix, as we will obtain the pooled variances from \texttt{fitTrtMean}. We need to decide what contrasts are interesting to us. For this lab, we will look at 6 contrasts: \begin{enumerate} \item Average chimpanzee versus average human \item Chimpanzee versus human for each region \item The interaction between species and the comparison of cerebellum to Broca's region. \end{enumerate} Note that the treatment names are taken from the columns of the design matrix. To make more useful names for the final output, we will want to rename the columns of the contrast matrix. <<>>= colnames(design.trt) contrast.matrix=makeContrasts( (trtsCHBroca+trtsCHCaudate+trtsCHCerebellum+trtsCHPrefrontal)/4 -(trtsHUBroca+trtsHUCaudate+trtsHUCerebellum+trtsHUPrefrontal)/4, trtsCHBroca-trtsHUBroca, trtsCHCaudate-trtsHUCaudate, trtsCHCerebellum-trtsHUCerebellum, trtsCHPrefrontal-trtsHUPrefrontal, (trtsCHCerebellum-trtsHUCerebellum)-(trtsCHBroca-trtsHUBroca), levels=design.trt) colnames(contrast.matrix)= c("ChVsHu","Broca","Caudate","Cerebellum","Prefrontal","Interact") @ The resulting contrast coefficient matrix has one row for each treatment and one column for each contrast. \subsection{Compute the estimated contrasts.} We simply fit the contrast matrix to the previous fitted model: <<>>= fit.contrast=contrasts.fit(fitTrtMean,contrast.matrix) @ \subsection{Compute the moderated contrast t-test.} The \texttt{eBayes} command will compute the consensus pooled variance, and then use it to compute the empirical Bayes (moderated) pooled variance for each gene. This also adjusts the degrees of freedom for the contrast t-tests. The command also computes the t-tests and associated p-values. <<>>= efit.contrast=eBayes(fit.contrast) @ The interesting components of this output are the estimated contrasts, which are stored in the component \texttt{coefficient} and the contrast p-values, which are stored in component \texttt{p.value}. \subsection{Plot the p-values.} It is important to remember that when the null hypothesis is true for every comparison, the p-values should be uniformly distributed and we expect to have false detections. In the more usual situation that some of the genes differentially express, we will have both false detections and false nondetections. As simple way of visualizing what to expect is to plot a histogram of p-values for each contrast. Below we put all 6 histograms on a single plot and use the contrast names as labels. <>= par(mfrow=c(2,3)) for (i in 1:ncol(efit.contrast$p.value)) { hist(efit.contrast$p.value[,i],main=colnames(efit.contrast$p.value)[i]) } @ Notice that the overall species contrast has the most differentially expressing genes (small p-values). All of the comparisons have a large percentage of differentially expressing genes. We will want to use a multiple comparisons procedure that adapts to having a large number of non-null hypotheses such as the Benjamini and Yuketiel or Storey methods. All of the plots show sharp peaks of small p-values. This indicates that we have good detection power in this study. When the power is poor, the histogram drops slowly towards the flat area. \subsection{Compute the gene list} The most statistically significant genes for each contrast can be assembled into spreadsheets. There are several ways to do this. \texttt{LIMMA} provides 2 functions, \texttt{topTable} and \texttt{decideTests} to assemble gene lists. I prefer to compute FDR or q-value estimates or adjusted p-values for each gene and output the treatment means and estimated contrasts, p-values and FDR or q-values to a comma separated text file which I can import to a spreadsheet. To use \texttt{topTable}, select a contrast and one of the adjustment methods. Of those available, Benjamini and Yuketiel (2001) (``BY") is a good general purpose choice. You also need probeset ids, which can either be extracted from the original data or from the row names of the p-value matrix. To limit the output to the most statistically significant genes, set the input parameter \texttt{p.value} to the maximum adjusted p-value or estimated FDR that you want to consider and the input parameter \texttt{n} to the maximum number of genes you want on the list. If you want a complete list, set \texttt{p.value=1.0} and \texttt{n=X} where X is bigger than the total number of probesets on the array. For example to get the top 10 genes with p<$10^{-5}$ for the overall comparison and for the interaction contrast: <<>>= genes=geneNames(brainBatch) topTable(efit.contrast,coef=1,adjust.method="BY",n=10,p.value=1e-5,genelist=genes) topTable(efit.contrast,coef=6,adjust.method="BY",n=10,p.value=1e-5,genelist=genes) @ The columns of the table are the row number of the gene, the gene id, the estimated contrast, the expression mean over all arrays, contrast t-value, contrast p-value, contrast adjusted p-value or estimated FDR and the estimated log-odds probability ratio that the gene is differentially expressed. The \texttt{decideTests} function can be used to create indicator variables for significance of contrasts with a variety of options. As an alternative, \texttt{write.table} can be used to create a comma separated text file, using \texttt{cbind} to concatenate matrices. <<>>= write.table(file="fits.txt", cbind(genes,fitTrtMean$coefficients,efit.contrast$coefficients,efit.contrast$p.value), row.names=F, col.names=c("GeneID",colnames(fitTrtMean$coefficients),colnames(efit.contrast$p.value), paste("p",colnames(efit.contrast$coefficients))),sep=",") @ \begin{Ex} Append adjusted p-values to the table above using either \texttt{p.adjust} or \texttt{qvalue}. Note that to use \texttt{qvalue} with \texttt{apply} you need to write a wrapper function. For example <<>>= library(qvalue) q.values=apply(efit.contrast$p.value,2, function(x) qvalue(x)$qvalues) @ \end{Ex} \begin{thebibliography}{30} \bibitem{Benjamini} Benjamini, Y., and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. \textit{Annals of Statistics},\textbf{29}: 1165?-1188. \bibitem{Khaitovich} Khaitovich, P., Muetzel, B., She, X., Lachmann, M., Hellmann, I., Dietzsch, J., Steigele, S., Do, H. H., Weiss, G., Enard, W., Heissig, F., Arendt, T., Nieselt-Struwe, K., Eichler, E. E., P\"a\"abo, S. (2004) Regional patterns of gene expression in human and chimpanzee brains. \textit{Genome research}, \textbf{14} (8) :1462--73. \bibitem{Smyth} Smyth, G. K. (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. \textit{Statistical Applications in Genetics and Molecular Biology}, \textbf{3}, Article 3. http://www.bepress.com/sagmb/vol3/iss1/art3. \bibitem{Storey} Storey JD. (2003) The positive false discovery rate: A Bayesian interpretation and the q-value. \textit{Annals of Statistics}, \textbf{31}: 2013--2035. \end{thebibliography} \section*{SessionInfo} <>= toLatex(sessionInfo()) @ <<>>= print(gc()) @ \end{document}