%\VignetteIndexEntry{RPA} %The above line is needed to remove a warning in R CMD check \documentclass[12pt]{article} \usepackage{amsmath,amssymb,amsfonts} \usepackage{hyperref} \usepackage[authoryear,round]{natbib} \usepackage{Sweave} \textwidth=6.2in \textheight=8.5in \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \def\muj{\mathbf{\mu}_j} \def\mujs{\{\mathbf{\mu}_j\}} \def\s{\{\mathbf{s}\}} \def\st{\mathbf{s}^t} \def\sj{\mathbf{s}_j} \def\sij{s_{ij}} \def\sjs{\mathbf{s}_{j=1}^J} \def\sjt{\mathbf{s}_j^t} \def\stj{s_{tj}} \def\scj{s_{cj}} \def\stjs{\{s_{tj}\}_t} \def\sjI{\mathbf{s}_j^1} \def\sjT{\mathbf{s}_j^T} \def\m{\mathbf{m}} \def\mj{\mathbf{m}_j} \def\mjs{\{\mathbf{m}_j\}} \def\mtj{m_{tj}} \def\gj{\{\mathbf{g}_j\}} \def\gi{g_i} \def\gt{g_t} \def\gc{g_c} \def\g{\mathbf{g}} \def\d{\mathbf{d}} \def\dt{d_t} \def\djt{\mathbf{d}_j^t} \def\djr{\mathbf{s}_{j}} \def\TauSq{\boldsymbol{\tau}^2} \def\tauj{\mathbf{\tau}_j} \def\taus{\{\tau_j^2\}} \def\taujSq{\mathbf{\tau}_j^2} \def\tauOneSq{\mathbf{\tau}_1^2} \def\tauPSq{\mathbf{\tau}_J^2} \def\epsilonj{\varepsilon_j} \def\epsilonij{\mathbf{\varepsilon}_{ij}} \def\epsiloncj{\varepsilon_{cj}} \def\epsilontj{\varepsilon_{tj}} \def\epsilontjs{\{\varepsilon_{tj}\}_t} \def\epsiloncjs{\{\varepsilon_{cj}\}_j} \def\Epsilonj{\boldsymbol{\varepsilon}_j} \def\Epsiloncj{\boldsymbol{\varepsilon}_{cj}} \def\alphaj{\alpha_j} \def\betaj{\beta_j} \def\alphahatj{\hat{\alpha}_j} \def\betahatj{\hat{\beta}_j} \def\zi{z_i} \def\j{\mathbf{j}} \def\sigmaj{\sigma_j} \def\lambdaj{\lambda_j} \author{Leo Lahti\\Helsinki Institute for Information Technology HIIT\\ and Adaptive Informatics Research Centre,\\Department of Information and Computer Science,\\Aalto University School of Science and Technology,\\P.O. Box 15400, FI-00076 Aalto, Finland\\{\tt leo.lahti@iki.fi}} \begin{document} \title{Description of RPA} \maketitle \section{Introduction} \Rpackage{RPA} (Robust Probabilistic Averaging) package\footnote{\url{http://www.cis.hut.fi/projects/mi/software/RPA/}} provides tools for analyzing probe reliability and differential gene expression on Affymetrix short oligonucleotide arrays. RPA provides explicit estimates probe reliability, and a probeset-level estimate of differential gene expression between a user-specified reference array and the other arrays. Probabilistic formulation allows incorporation of prior information concerning probe reliability into gene expression analysis within a principled framework. The underlying probabilistic model for probe-level observations is described in \cite{Lahti09tcbb}. \subsection{Relation to other probe-level models} RPA utilizes probe-level measurements of differential expression to avoid modeling unidentifiable probe affinities, which are the key probe-specific parameter in many preprocessing methods including RMA \cite{Irizarry03rma}. Instead, RPA estimates the overall reliability of each probe in terms of a probe-specific variance term. While RPA can be used for preprocessing of gene expression data, its primary focus on probe reliability analysis. This distinguishes RPA from probe-level preprocessing methods such as dChip's MBEI \citep{Li01pnas}, RMA \citep{Irizarry03rma}, or FARMS \cite{Hochreiter06}, that provide probeset-level summaries but have not been used to investigate probe performance. For further details, see \cite{Lahti09tcbb}. \subsection{Summary of RPA model} RPA assumes a Gaussian model for probe effects. Consider a probe set targeted at measuring the expression level of target transcript \(g\). Probe-level observation \(\sij\) of probe \(j\) on array \(i\) is modeled as a sum of the true expression signal (common for all probes in the probeset), and probe-specific Gaussian noise: \(\sij = \gi + \muj + \epsilonij\). Importantly, the stochastic noise component is probe-specific in RPA, and distributed as \(\epsilonij \sim N(0,\taujSq)\). The variance parameters \(\taus\) are of interest in probe reliability analysis; the inverse variance \(1/\taujSq\) provides a measure of probe reliability. The mean parameter \(\muj\) of the noise model describes systematic probe affinity effect but it is unidentifiable. In RPA, these parameters cancel out when the signal log-ratio between a user-specified 'reference' array and the remaining arrays is computed for each probe: the differential expression signal between arrays \(t = \{1, \dots, T\} \) and the reference array \(c\) for probe \(j\) is given by \(\mtj = \stj - \scj = \gt - \gc + \epsilontj - \epsiloncj = \dt + \epsilontj - \epsiloncj\). In vector notation the differential gene expression profile of probe \(j\) across the arrays can be written as \(\mj = \d + \Epsilonj\). In practice, \(\d\) and the probe-specific variances \(\{\tau_j\}_{j=1}^P\) for the \(P\) probes within the probeset are estimated simultaneously based on the probabilistic model. With large sample sizes the solution will converge to estimating the mean of the probe-level observations weighted by probe reliability. The probe-level data is background corrected, normalized, and log2-transformed before the analysis. By default, RPA uses the background correction model of RMA \cite{Irizarry03} and quantile normalization \cite{Bolstad03}. Our implementation utilizes the \Rpackage{affy} package \cite{Gautier04a} to handle probe-level data. For details about short oligonucleotide arrays and the design of the Affymetrix GeneChip arrays, see the Affymetrix MAS manual \cite{affy5}. \section{Probe reliability analysis} RPA operates on affybatch objects. An affybatch can be created from Affymetrix CEL files using the ReadAffy function of the BioConductor \Rpackage{affy} package \cite{Gautier04a}. An affybatch contains the probe-level data of Affymetrix arrays. Our toy examples use the \Robject{Dilution} dataset provided by \Rpackage{affydata} package. Load example data (the 'Dilution' affybatch): <>= require(affy) require(affydata) data(Dilution) @ {\it RPA.pointestimate} is the main function. Let us perform the analysis for particular probesets in the Dilution data using the first array (\(cind = 1\)) as the reference for calculating differential expression values of the other arrays. <>= require(RPA) sets <- geneNames(Dilution)[1:2] rpa.results <- RPA.pointestimate(Dilution, sets, cind = 1) @ Probe reliability and differential gene expression analysis is performed on the whole data set as follows (potentially slow). <>= rpa.results <- RPA.pointestimate(Dilution, cind = 1) @ The 'rpa2eset' function coerces the probeset-level differential expression estimates (d) into an ExpressionSet object that can be analyzed using standard R/BioC tools for gene expression. The results for a particular probeset are visualized with <>= rpa.plot("1000_at", rpa.results) @ The output is shown in Figure \ref{fig:illustration}. See help('rpa.plot') for details. \begin{figure}[htbp] \begin{center} <>= rpa.plot("1000_at", rpa.results) @ \caption{Estimated probe-specific variances and differential gene expression for an example probe set.} \label{fig:illustration} \end{center} \end{figure} \subsection{Measuring probe-specific noise} RPA provides an estimate of the noise level of each individual probe by the probe-specific variance parameter (\(\taujSq\)). Fetch the probe variances (or normalized variants; see help(get.probe.noise.estimates)) with <>= noise <- get.probe.noise.estimates(rpa.results) @ The more noisy the probe is (the higher the variance \(\taujSq\)), the less reliable it is considered. Inverse of the variance, \(\frac{1}{\taujSq}\), can be used to quantitate probe reliability. \subsection{Manual analysis of an individual probe set} Preprocess the whole data set before the analysis: <>= Smat <- RPA.preprocess(Dilution, cind = 1) @ Pick probe-level data for a probe set (arrays x probes matrix): <>= S <- t(Smat$fcmat[Smat$set.inds[["1000_at"]], ]) @ Estimate probeset-level signal and probe-specific variances: <>= res <- RPA.iteration(S) @ \subsection{Setting probe-specific priors} Prior information of probe reliability can be set by tuning the shape (alpha) and scale (beta) parameters of the model. These are inverse Gamma distribution parameters, which is the conjugate prior for the variances. Set priors for a particular probeset. If the 'priors' parameter is not given, non-informative priors will be given for the other probesets: <>= alpha <- beta <- rep(1, 16) probe.index <- 5 alpha[[probe.index]] <- 3 beta[[probe.index]] <- 1 priors <- set.priors(Dilution, set = "1000_at", alpha, beta) @ Run RPA with priors: <>= rpa.results <- RPA.pointestimate(Dilution, sets, priors = priors) @ \section{Differential gene expression analysis} RPA provides a wrapper ('rpa') for robust preprocessing of gene expression data in differential gene expression studies. Note that one of the arrays is used as the reference against which the differential expression values are calculated; RPA expression values are log\(_2\)-ratios related to the reference array. RPA supports alternative CDF environments (see function documentation for details). <>= eset <- rpa(Dilution, cind = 1) @ The output is an ExpressionSet object, which allows downstream analysis of the results using standard R/BioC tools for gene expression data. \section{Citing RPA} Please cite \cite{Lahti09tcbb} when using the package. \section{Details} This document was written using: <
>= sessionInfo() @ \bibliographystyle{plainnat} \bibliography{RPA} \end{document}