% \VignetteIndexEntry{Adaptive Gene Picking for Microarray Expression Data Analysis} % \VignetteDepends{pickgene} % \VignetteKeywords{mRNA,differential expression} %\VignettePackage{pickgene} \documentclass{article} \begin{document} \title{Adaptive Gene Picking for Microarray Expression Data Analysis} \author{Brian S. Yandell, Yi Lin, Hong Lan and Alan D. Attie\\ Univeristy of Wisconsin--Madison} \maketitle \section{Overview} The following is adapted from Lin et al. (2002). References can be found there or at {\tt http://www.stat.wisc.edu/$\sim$yandell/statgen}. Our gene array analysis algorithm uses rank order to normalize data for each experimental condition and estimates the variability at each level of gene expression to set varying significance thresholds for differential expression across levels of mRNA abundance. This procedure can be used to prefilter data in detecting patterns of differential gene expression, for instance using clustering methods. We propose assigning Bonferroni-corrected $p$-values, which requires only minimal assumptions. Although expression data may be acquired from a variety of technologies, we focus attention on the oligonucleotide arrays in Affymetrix chips used in a mouse experiment on diabetes and obesity. Our approach was motivated by a series of experiments on diabetes and obesity. Nadler et al. (2000) used Affymetrix MGU74AV2 chips with over 13,000 probes representing about 12,000 genes on mRNA from adipose tissue to examine the relationship between obesity and mouse genotype (B6, BTBR, or F1). Further experiments have grown out of this collaboration using replicates and will be reported elsewhere. The primary goal was to find patterns of differential gene expression in mouse tissue between strains. Thus, we have a two-factor experiment with possible replication for each chip mRNA. \subsection{Transformation to Approximate Normality} Raw microarray measurements are typically normalized to account for systematic bias and noise to attempt to restore expression levels from raw data (Lockhart et al. 1996). One important source of bias is background fluorescence. Other factors that require attention include variations in array, dye, thickness of sample, and measurement noise. Background fluorescence may be measured in several ways, depending on chip technology, and is typically removed by subtraction (see Lockhart et al. 1996; Li and Wong 2001; Schadt et al. 2001; Irizarry et al. 2002; Li and Wong 2002). Affymetrix chips handle background by comparing perfect match ($PM$) with mismatch ($MM$) intensity. We use weighted averages $PM$ and $MM$ across oligo probe pairs using recent `low-level' analysis (Li and Wong 2001; Schadt et al. 2001; Li and Wong 2002) to reduce measurement variability. More recently, Irizarry et al. (2003) and Zhang et al. (2003) have proposed normalization {\em without} using $MM$ measurements. Background-adjusted intensities are typically log-transformed to reduce the dynamic range and achieve normality. Various authors have noted that comparisons based on such log-transformed gene expression levels appear to be approximately normal (see Kerr and Churchill 2001). However, negative adjusted values can arise from low expression levels swamped by background noise. Some authors have proposed adding a small value before taking the log to recover some of these data (Kerr and Churchill 2001). Our alternative normalization method leverages this idea while providing comparisons that are more robust to difficulties with the lognormal assumption. For further discussion on normalization, see Dudoit and Yang (2002) and Colantuoni et al. (2002). Our procedure converts the background-adjusted expression values into normal-scores without discarding negative values. This normal-scores transformation has been employed for microarray data using a different approach (Efron et al. 2001). If expression data are really lognormal, then this normal-scores transformation is indistinguishable from a log transformation after rescaling. We have found that log-transformed data appear roughly normal in the middle of the distribution, while the normal scores are normal throughout. Our procedure depends on the existence of some unknown monotone transformation of the data to near multivariate normal. There is always such a transformation in one dimension: let $F$ be the cumulative distribution of adjusted values $\Delta$ and $\Phi$ be the cumulative normal distribution. Then $\Phi^{-1}(F(\Delta))$ transforms $\Delta$ to normal. If $F$ is lognormal, then $\Phi^{-1}(F(\Delta))=\log(\Delta)$, but we prefer not to make this assumption up-front. Instead, we approximate the transformation by $\Phi^{-1}(F_J(\Delta))$, where $F_J$ is the empirical distribution of the $J$ adjusted values $\Delta_1, \cdots, \Delta_J$. The difference between this approximate transformation and the ideal one is small (on the order of $1/\sqrt{J}$). This is known as the normal-scores transformation, and is readily computed as $$ x = \Phi^{-1}(F_J(\Delta)) = \mbox{qnorm}( \mbox{rank}(\Delta) / (J+1) ) $$ \noindent where rank($\Delta$) is the rank order of adjusted gene measurements $\Delta = PM - MM$ among all $J$ genes under the same condition. The normal quantiles, qnorm(), transform the ranks to be essentially a sample from standard normal: a histogram of these $x$ is bell-shaped and centered about zero, with normal scores equally spaced in terms of probability mass. Thus, these normal scores are close to a transformation that would make the data appear normal (Efron et al. 2001). If done separately by condition, this normalization automatically standardizes the center to 0 and the scale (standard deviation) to 1. Alternatively, if the experimental conditions are viewed as a random sample of a broader set of possible conditions, data across all conditions could be transformed together by normal scores. Normal scores are unaffected by monotone transformations of adjusted intensities or by global factors such as array, dye, and thickness of chip sample. Ranks may be disturbed by local noise, but that effect is unavoidable in any analysis of such an experiment. \subsection{Differential Expression Across Conditions} Differential expression across conditions of interest can be computed by comparing their transformed expression levels. Information on comparison of two conditions, 1 and 2, is summarized in pairs of normal scores, $x_1$ and $x_2$, across the genes; plotting $x_1$ against $x_2$ yields points dispersing from the diagonal. However, differential gene expression between experimental conditions may depend on the average level of gene expression, with genes of different average expression having intrinsically different variability. Thus, we recommend plotting the average intensity $a = (x_1 + x_2) / 2$ against the difference $d = x_1 - x_2$, which involves just a 45 degree rotation (Roberts et al. 2000; Dudoit and Yang 2002; Irizarry et al. 2002; Lee and O'Connell 2002; Colantuoni et al. 2002; Wu et al. 2002). Since our normal scores may be considered a forgiving approximation to the log transform, we prefer to represent the plotting axes as if the data were log-transformed; that is, use an antilog or exp scale. Thus, the $a$ axis is centered on 1 and suggests a fold change in intensity, while the $d$ axis suggests a fold change in differential expression. This method can be extended to experiments with multiple conditions, multiple readings (e.g., dyes) per gene on a chip, and replication of chips (Kerr et al. 2001; Wu et al. 2002). Consider an ANOVA model $$ x_{ijk} = \mu + c_i + g_j + (cg)_{ij} + \epsilon_{ijk} $$ with $i = 1, \cdots, I$ conditions, $j = 1, \cdots, J$ genes, $k = 1, \cdots, K$ replicate chips per condition, $\epsilon_{ijk} \sim \Phi(0,\sigma_j^2)$ being the measurement error for the $k$th replicate, and $c_i = 0$ if there is separate normalization by condition. Both the gene effect $g_j$ and the condition by gene interaction $(cg)_{ij}$ are random effects. In general, all variance components may depend on the gene effect $g_j$. Adding multiple readings per chip introduces a nested structure to the experimental design that we do not develop further here (see Lee et al. 2000). The major biological research focus is on differential gene expression, the condition $i$ by gene $j$ interaction. We assume that most genes show no differential expression; thus with some small probability $\pi_1$, a particular interaction $(cg)_{ij}$ is nonzero, say from $\Phi(0,\delta_j^2)$. Let $z_j = 1$ indicate differential expression, $\mbox{Prob}\{z_j = 1\} = \pi_1$. The variance of the expression score is $$ \begin{array}{rcll} \mbox{Var}(x_{ijk}) &=& \gamma_j^2 + \delta_j^2 + \sigma_j^2 & \mbox{if } z_j = 1 \mbox{ (differential expression)}, \\ \mbox{Var}(x_{ijk}) &=& \gamma_j^2 + \sigma_j^2 & \mbox{if } z_j = 0 \mbox{ (no differential expression)}, \end{array} $$ for $i = 1, ..., I, k = 1, ..., K$, with $\gamma_j^2$ the variance for the gene $j$ random effect. This differential expression indicator has been effectively used for microarray analysis (Lee et al. 2000; Kerr et al. 2001; Newton et al. 2001). This ANOVA framework allows isolation of the $(cg)_{ij}$ differential expression from the $g_j$ gene effect by contrasting conditions. Suppose that $w_i$ are condition contrasts such that $\sum_iw_i=0$ and $\sum_iw_i^2=1$. The standardized contrast $d_j = ( \bar{x}_{1j\cdot} - \bar{x}_{2j\cdot} )\sqrt{K/2}$ with $\bar{x}_{ij\cdot} = \sum_k x_{ijk}/K$ compares condition 1 with condition 2. More generally, the contrast $$ d_{jk} = \sum_iw_i \bar{x}_{ij\cdot}\sqrt{K} = \sum_iw_i\sqrt{K}[c_i + (cg)_{ij} + \bar{\epsilon}_{ij\cdot}] $$ with $\bar{\epsilon}_{ij\cdot}=\sum_k\epsilon_{ijk}/K$ has $E(d_{j})=\sum_iw_ic_i\sqrt{K}$ and $$ \begin{array}{rcll} \mbox{Var}(d_{j}) &=& \delta_j^2 + \sigma_j^2 & \mbox{if } z_j = 1 \mbox{ (differential expression)}, \\ \mbox{Var}(d_{j}) &=& \sigma_j^2 & \mbox{if } z_j = 0 \mbox{ (no differential expression)}. \end{array} $$ Again, condition effects $c_i$ drop out and $E(d_{j})=0$ if each chip is standardized separately, but in general they remain part of the contrast. Although microarray experiments began by contrasting two conditions, this approach adapts naturally to contrasts capturing key features of differential gene expression across design factors. Time or other progressions over multiple levels, such as a linear series of glucose concentrations, might be examined for linear or quadratic trends using orthogonal contrasts (Lentner and Bishop 1993). For instance, with five conditions the linear and quadratic contrasts are, respectively (dropping subscripts except for condition), $$ \begin{array}{lcl} d_{\mbox{\tiny{linear}}} &=& (2x_5 + x_4 - x_2 - 2x_1)\sqrt{K/8}\,, \\ d_{\mbox{\tiny{quadratic}}} &=& (2x_5 - x_4 - 2x_3 - x_2 + 2x_1)\sqrt{K/14}\,. \end{array} $$ With conditions resolved as multiple factors, such as obesity and genotype in our situation, separate contrasts can be considered for main effects and interactions. Each contrast can be analyzed in a fashion similar to that above. Alternatively, one can examine factors with multiple levels, say three genotypes, by an appropriate ANOVA evaluation (Lee et al. 2000). \subsection{Robust Center and Spread} For the majority of genes that are not changing, the difference $d_j$ reflects only the intrinsic noise. Thus, genes that do change can be detected by assessing their differential expression relative to the intrinsic noise found in the nonchanging genes. Although it is natural to use replicates when possible to assess the significance of contrasts for each gene, microarray experiments have typically had few replicates $K$, leading to unreliable tests. Some authors have considered shrinkage approaches that combine variance information across genes (Efron et al. 2001; L\"{o}nnstedt and Speed 2001). Measurement error seems to depend on the gene expression level $a_j = \sum_{ik} x_{ijk}/IK$, and it may be more efficient to combine variance estimates across genes with similar average expression levels (Hughes et al. 2000; Roberts et al. 2000; Baldi and Long 2001; Kerr et al. 2001; Long et al. 2001; Newton et al. 2001). Further, if there were no replicates, as in early microarray data, then it would be important to combine across genes in some fashion. There may in addition be systematic biases that depend on the average expression level (Dudoit et al. 2000; Yang et al. 2000). We noticed that empirically the variance across nonchanging genes seems to depend approximately on expression level in some smooth way, decreasing as $a$ increases, due in part to the mechanics of hybridization and reading spot measurements. Here, we consider smooth estimates of abundance-based variance to account for these concerns. In a later paper, we will investigate shrinking the gene-specific variance estimate using our abundance-based estimate and an empirical Bayes argument similar to that of L\"{o}nnstedt and Speed (2001). Our approach involves estimating the center and spread of differential expression as it varies across average gene expression $a_j$ to standardize the differential expression. Specifically, we use smoothed medians and smoothed median absolute deviations, respectively, to estimate the center and spread. Smoothing splines (Wahba 1990) are combined with standardized local median absolute deviation (MAD) to provide a data-adapted, robust estimate of spread $s(a)$. A smooth, robust estimate of center $m(a)$ can be computed in a similar fashion by smoothing the medians across the slices. We use these robust estimates of center and scale to construct standardized values $$ T_j = ( d_j - m(a_j) ) / s(a_j) $$ and base further analysis on these standardized differences. For convenience, we illustrate with two conditions and drop explicit reference to gene $j$. Revisiting the motivating model helps explain our specification for spread. Consider again $\log(G) = g+h+\epsilon$ and suppose that hybridization error is negligible or at least the same across conditions. The intrinsic noise $\epsilon$ may depend on the true expression level $g$: for two conditions 1 and 2, the difference $d$ is approximately $$ d \approx \log(G_1) - \log(G_2) = g_1 - g_2 + \epsilon_1 - \epsilon_2\,. $$ If there is no differential expression, $g_1 = g_2 = g$, then $\mbox{Var}(d \vert g) = s^2(g)$, and the gene signal $g$ may be approximated by $a$. However, the true formula for $\mbox{Var}(d \vert a)$ is not exactly $s^2(a)$ and cannot be determined without further assumptions. Thus, differential contrasts standardized by estimated center and spread that depend on $a$ should have approximately the standard normal distribution for genes that have no differential expression across the experimental conditions. Comparison of gene expressions between two conditions involves finding genes with strong differential expression. Typically, most genes show no real difference, only chance measurement variation. Therefore, a robust method that ignores genes showing large differential expression should capture the properties of the vast majority of unchanging genes. The genes are sorted and partitioned based on $a$ into many (say 400) slices containing roughly the same number of genes and summarized by the median and the MAD for each slice. For example, with 12,000 genes, the 30 contrasts $d$ for each slice are sorted; the average of ordered values 15 and 16 is the median, while the MAD is the median of absolute deviations from that central value. These 400 medians and MADs should have roughly the same distribution up to a constant. To estimate the scale, it is natural to regress the 400 values of log(MAD) on $a$ with smoothing splines (Wahba 1990), but other nonparametric smoothing methods would work as well. The smoothing parameter is tuned automatically by generalized cross-validation (Wahba 1990). The antilog of the smoothed curve, globally rescaled, provides an estimate of $s(a)$, which can be forced to be decreasing if appropriate. The 400 medians are smoothed via regression on $a$ to estimate $m(a)$. Replicates are averaged over in the robust smoothing approach, that is, contrasts $d_{j} = \sum w_i \bar{x}_{ij\cdot}\sqrt{K}$ factor out replicates. We are currently investigating shrinkage variance estimates of the form $$ s_j^2=\frac{\nu_0 s^2(a_j) + \nu_1 \hat{\sigma}_j^2}{\nu_0+\nu_1} $$ with $\hat{\sigma}_j^2=\sum_k(x_{ijk}-\bar{x}_{ij\cdot})^2/\nu_1$, $\nu_1=I(K-1)$, and $\nu_0$ is the empirical Bayes estimate (see L\"{o}nnstedt and Speed 2001) of the degrees of freedom for $\hat{\sigma}_j^2/s^2(a_j)$. It should be possible to combine estimates of spread across multiple contrasts; say, by using the absolute deviations $\vert x_{ijk} - a_j\vert$ for all genes with average intensity $a_j$ within the range of a particular slice to estimate the slice MAD. This is sensible since these absolute deviations estimate the measurement error for most genes and most conditions. Those few genes with large differential effects across conditions would have large absolute deviations that are effectively ignored by using the robust median absolute deviation. \subsection{Formal Evaluation of Significant Differential Expression} Formal evaluation of differential expression may be approached as a collection of tests for each gene of the ``null hypothesis'' of no difference or alternatively as estimating the probability that a gene shows differential expression (Kerr et al. 2001; Newton et al. 2001). Testing raises the need to account for multiple comparisons, here we use $p$-values derived using a Bonferroni-style genome-wide correction (Dudoit et al. 2000). Genes with significant differential expression are reported in order of increasing $p$-value. We can use the standardized differences $T$ to rank the genes. The conditional distribution of these $T$ given $a$ is assumed to be standard normal across all genes whose expressions do not change between conditions. Hypothesis testing here amounts to comparing the standardized differences with the intrinsic noise level. Since we are conducting multiple tests, we should adjust the test level of each gene to have a suitable overall level of significance. We prefer the conservative Zidak version of the Bonferroni correction: the overall $p$-value is bounded by $1-(1-p)^J$, where $p$ is the single-test $p$-value. For example, for 13,000 genes with an overall level of significance of 0.05, each gene should be tested at level $1.95\times 10^{-6}$, which corresponds to 4.62 score units. Testing for a million genes would correspond to identifying significant differential expression at more than 5.45 score units. Guarding against overall type I error may seem conservative. However, a larger overall level does not substantially change the normal critical value (from 4.62 to 4.31 with 13,000 genes for a 0.05 to 0.20 change in $p$-value). This test can be made one-sided if preferred. Apparently less conservative multiple-comparison adjustments to $p$-values are proposed in Yang et al. (2000). However, the results are essentially the same with all such methods, except when more than 5--10\% of the genes show differential expression across conditions. For an alternative interpretation of $p$-values in terms of false discovery rates, see Storey and Tibshirani (2003). It may be appropriate to examine a histogram of standardized differences $T$ using these critical values as guidelines rather than as strict rules. The density $f$ of all the scores is a mixture of the densities for nonchanging $f_0$ and changing $f_1$ genes, $$ f(T) = (1-\pi_1)f_0(T) + \pi_1 f_1(T). $$ By our construction, $f_0$ is approximately standard normal. Following Efron et al. (2001), set $\pi_1$ just large enough so that the estimate $$ f_1(T) = [f(T) - (1-\pi_1) f_0(T)] / \pi_1 $$ is positive. This in some sense provides a `liberal' estimate of the distribution of differentially expressed genes. It lends support to examination of a wider set of genes, with standardized scores that are above 3 or below --3. We suggest using this set as the basis for hierarchical clustering. Notice also that this provides an estimate of the posterior probability of differential expression ($z_j=1$) for each mRNA, $$ \mbox{Prob}\{z_j = 1| T_j \} = \pi_1 f_1(T_j) / f(T_j) . $$ Gross errors on microarrays can be confused with changing genes. Replicates can be used to detect outliers in a fashion similar to the approach for differential gene expression. Residual deviations of each replicate from the condition by gene mean, $x_{ijk}-\bar{x}_{ij\cdot}$, could be plotted against the average intensity, $a_j$. Robust estimates of center and scale could be used as above in formal Bonferroni-style tests for outliers. Separate smooth robust estimates of center and scale are needed for each contrast. Perhaps an additional Bonferroni correction may be used to adjust for multiple contrasts. \section{Software} The analysis procedures are written as an R language module. The R system is publicly available from the R Project, and our code is available from the corresponding author as the R {\tt pickgene} library. The function {\tt pickgene()} plots $d$ against $a$, after backtransforming to show fold changes, and picks the genes with significant differences in expression. Examples include the simulations and graphics presented here. This library can be found at {\tt www.stat.wisc.edu/$\sim$yandell/statgen}. In its simplest form, {\tt pickgene()} takes a data frame (or matrix) of microarray data, one column per array. We assume that housekeeping genes have already been removed. Columns are automatically contrasted using the prevailing form of orthonormal contrast (default is polynomial, {\tt contrasts = "contr.poly"}). \begin{verbatim} library( pickgene ) result <- pickgene( data ) \end{verbatim} This produces a scatterplot with average intensity $a$ along the horizontal axis and contrasts $d$ along the vertical, with one plot for each contrast (typically one fewer than the number of columns of {\tt data}). With two columns, we are usually interested in something analogous to the log ratio, which can be achieved by renormalizing the contrast. If desired, the log transform can be specified by setting {\tt rankbased = F}. Gene ideas can be preserved in the results as well. \begin{verbatim} result <- pickgene( data, geneID = probes, renorm = sqrt( 2 ), rankbased = F ) print( result$pick[[1]] ) \end{verbatim} The {\tt pick} object is a list with one entry for each contrast, including the probe names, average intensity $a$, fold change ($\exp(d)$, as if $\Phi^{-1}(F(\Delta))=\log(\Delta)$), and Bonferroni-adjusted $p$-value. The result also contains a score object with the average intensity $a$, score $T$, lower and upper Bonferroni limits, and probe names. The {\tt pickgene()} function relies on two other functions. The function {\tt model.pickgene()} generates the contrasts, although this can be bypassed. More importantly, the function {\tt robustscale} slices the pairs $(a,d)$ into 400 equal-sized sets based on $a$, finds medians and log(MAD)s for each slice, and then smoothes them using splines (Wahba 1990) to estimate the center, $m(a)$, and spread, $s(a)$, respectively. Estimates of density are based on the {\tt density()} function, packaged in our {\tt pickedhist()} routine. \begin{verbatim} pickedhist( result, p1 = .05, bw = NULL ) \end{verbatim} We pick a bandwidth {\tt bw} that provides smooth curves and then adjust $\pi_1$ = {\tt p1} so that $f_1$ is positive. The standard deviation $s(a)$ is not returned directly in result. However, it is easily calculated as log(upper/lower)/2. \section{References} Lin Y, Nadler ST, Lan H, Attie AD, Yandell BS (2002) Adaptive gene picking with microarray data: detecting important low abundance signals. in {\em The Analysis of Gene Expression Data: Methods and Software}, ed by G Parmigiani, ES Garrett, RA Irizarry, SL Zeger. Springer-Verlag, ch. 13. \end{document}