%\VignetteIndexEntry{RPA}
%The above line is needed to remove a warning in R CMD check
\documentclass[12pt]{article}

\usepackage{amsmath,amssymb,amsfonts}
\usepackage{hyperref}
\usepackage[authoryear,round]{natbib}
\usepackage{Sweave}

\textwidth=6.2in
\textheight=8.5in
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}

\def\muj{\mathbf{\mu}_j}
\def\mujs{\{\mathbf{\mu}_j\}}

\def\s{\{\mathbf{s}\}}
\def\st{\mathbf{s}^t}
\def\sj{\mathbf{s}_j}
\def\sij{s_{ij}}
\def\sjs{\mathbf{s}_{j=1}^J}
\def\sjt{\mathbf{s}_j^t}
\def\stj{s_{tj}}
\def\scj{s_{cj}}
\def\stjs{\{s_{tj}\}_t}
\def\sjI{\mathbf{s}_j^1}
\def\sjT{\mathbf{s}_j^T}

\def\m{\mathbf{m}}
\def\mj{\mathbf{m}_j}
\def\mjs{\{\mathbf{m}_j\}}
\def\mtj{m_{tj}}

\def\gj{\{\mathbf{g}_j\}}
\def\gi{g_i}
\def\gt{g_t}
\def\gc{g_c}
\def\g{\mathbf{g}}

\def\d{\mathbf{d}}
\def\dt{d_t}
\def\djt{\mathbf{d}_j^t}
\def\djr{\mathbf{s}_{j}}

\def\TauSq{\boldsymbol{\tau}^2}
\def\tauj{\mathbf{\tau}_j}
\def\taus{\{\tau_j^2\}}
\def\taujSq{\mathbf{\tau}_j^2}
\def\tauOneSq{\mathbf{\tau}_1^2}
\def\tauPSq{\mathbf{\tau}_J^2}

\def\epsilonj{\varepsilon_j}
\def\epsilonij{\mathbf{\varepsilon}_{ij}}
\def\epsiloncj{\varepsilon_{cj}}
\def\epsilontj{\varepsilon_{tj}}
\def\epsilontjs{\{\varepsilon_{tj}\}_t}
\def\epsiloncjs{\{\varepsilon_{cj}\}_j}
\def\Epsilonj{\boldsymbol{\varepsilon}_j}
\def\Epsiloncj{\boldsymbol{\varepsilon}_{cj}}

\def\alphaj{\alpha_j}
\def\betaj{\beta_j}
\def\alphahatj{\hat{\alpha}_j}
\def\betahatj{\hat{\beta}_j}

\def\zi{z_i}
\def\j{\mathbf{j}}
\def\sigmaj{\sigma_j}
\def\lambdaj{\lambda_j}

\author{Leo Lahti\\Helsinki Institute for Information Technology HIIT\\
  and Adaptive Informatics Research Centre,\\Department of Information
  and Computer Science,\\Aalto University School of Science and
  Technology,\\P.O. Box 15400, FI-00076 Aalto, Finland\\{\tt
    leo.lahti@iki.fi}}

\begin{document}
\title{Description of RPA}


\maketitle 

\section{Introduction}

\Rpackage{RPA} (Robust Probabilistic Averaging)
package\footnote{\url{http://www.cis.hut.fi/projects/mi/software/RPA/}}
provides tools for analyzing probe reliability and differential gene
expression on Affymetrix short oligonucleotide arrays.

RPA provides explicit estimates probe reliability, and a
probeset-level estimate of differential gene expression between a
user-specified reference array and the other arrays. Probabilistic
formulation allows incorporation of prior information concerning probe
reliability into gene expression analysis within a principled
framework.  The underlying probabilistic model for probe-level
observations is described in \cite{Lahti09tcbb}.

\subsection{Relation to other probe-level models}

RPA utilizes probe-level measurements of differential expression to
avoid modeling unidentifiable probe affinities, which are the key
probe-specific parameter in many preprocessing methods including RMA
\cite{Irizarry03rma}.  Instead, RPA estimates the overall reliability
of each probe in terms of a probe-specific variance term. While RPA
can be used for preprocessing of gene expression data, its primary
focus on probe reliability analysis. This distinguishes RPA from
probe-level preprocessing methods such as dChip's MBEI
\citep{Li01pnas}, RMA \citep{Irizarry03rma}, or FARMS
\cite{Hochreiter06}, that provide probeset-level summaries but have
not been used to investigate probe performance. For further details,
see \cite{Lahti09tcbb}.

\subsection{Summary of RPA model}

RPA assumes a Gaussian model for probe effects.  Consider a probe set
targeted at measuring the expression level of target transcript
\(g\). Probe-level observation \(\sij\) of probe \(j\) on array \(i\)
is modeled as a sum of the true expression signal (common for all
probes in the probeset), and probe-specific Gaussian noise: \(\sij =
\gi + \muj + \epsilonij\). Importantly, the stochastic noise component
is probe-specific in RPA, and distributed as \(\epsilonij \sim
N(0,\taujSq)\). The variance parameters \(\taus\) are of interest in
probe reliability analysis; the inverse variance \(1/\taujSq\)
provides a measure of probe reliability.

The mean parameter \(\muj\) of the noise model describes systematic
probe affinity effect but it is unidentifiable. In RPA, these
parameters cancel out when the signal log-ratio between a
user-specified 'reference' array and the remaining arrays is computed
for each probe: the differential expression signal between arrays \(t
= \{1, \dots, T\} \) and the reference array \(c\) for probe \(j\) is
given by \(\mtj = \stj - \scj = \gt - \gc + \epsilontj - \epsiloncj =
\dt + \epsilontj - \epsiloncj\). In vector notation the differential
gene expression profile of probe \(j\) across the arrays can be
written as \(\mj = \d + \Epsilonj\). In practice, \(\d\) and the
probe-specific variances \(\{\tau_j\}_{j=1}^P\) for the \(P\) probes
within the probeset are estimated simultaneously based on the
probabilistic model. With large sample sizes the solution will
converge to estimating the mean of the probe-level observations
weighted by probe reliability. 

The probe-level data is background corrected, normalized, and
log2-transformed before the analysis. By default, RPA uses the
background correction model of RMA \cite{Irizarry03} and quantile
normalization \cite{Bolstad03}. Our implementation utilizes the
\Rpackage{affy} package \cite{Gautier04a} to handle probe-level data.
For details about short oligonucleotide arrays and the design of the
Affymetrix GeneChip arrays, see the Affymetrix MAS manual
\cite{affy5}.

\section{Probe reliability analysis}

RPA operates on affybatch objects. An affybatch can be created from
Affymetrix CEL files using the ReadAffy function of the BioConductor
\Rpackage{affy} package \cite{Gautier04a}. An affybatch contains the
probe-level data of Affymetrix arrays. Our toy examples use the
\Robject{Dilution} dataset provided by \Rpackage{affydata}
package. Load example data (the 'Dilution' affybatch):

<<startchunk,results=hide>>=
require(affy)
require(affydata)
data(Dilution)
@

{\it RPA.pointestimate} is the main function. Let us perform the
analysis for particular probesets in the Dilution data using the first
array (\(cind = 1\)) as the reference for calculating differential
expression values of the other arrays.

<<RPA.pointestimate, results = hide>>=
require(RPA)
sets <- geneNames(Dilution)[1:2]
rpa.results <- RPA.pointestimate(Dilution, sets, cind = 1)
@

Probe reliability and differential gene expression analysis is
performed on the whole data set as follows (potentially slow).

<<rpa.res, results = hide, eval = FALSE>>=
rpa.results <- RPA.pointestimate(Dilution, cind = 1)
@


The 'rpa2eset' function coerces the probeset-level differential
expression estimates (d) into an ExpressionSet object that can be
analyzed using standard R/BioC tools for gene expression. The results
for a particular probeset are visualized with

<<visu, fig = FALSE, echo = TRUE, print = FALSE, results = hide, eval = FALSE>>=
rpa.plot("1000_at", rpa.results)
@

The output is shown in Figure \ref{fig:illustration}. See
help('rpa.plot') for details.


\begin{figure}[htbp]
\begin{center}
<<barplots, fig = TRUE, echo = FALSE, print = FALSE>>=
rpa.plot("1000_at", rpa.results)
@
\caption{Estimated probe-specific variances and differential gene expression for an example probe set.}
\label{fig:illustration}
\end{center}
\end{figure}

\subsection{Measuring probe-specific noise}

RPA provides an estimate of the noise level of each individual probe
by the probe-specific variance parameter (\(\taujSq\)). Fetch the
probe variances (or normalized variants; see
help(get.probe.noise.estimates)) with

<<noise,results=hide>>=
noise <- get.probe.noise.estimates(rpa.results)
@ 

The more noisy the probe is (the higher the variance \(\taujSq\)), the
less reliable it is considered. Inverse of the variance,
\(\frac{1}{\taujSq}\), can be used to quantitate probe reliability.

\subsection{Manual analysis of an individual probe set}

Preprocess the whole data set before the analysis:

<<Smat,results=hide, eval = FALSE>>=
Smat <- RPA.preprocess(Dilution, cind = 1)
@

Pick probe-level data for a probe set (arrays x probes matrix):

<<pick, results=hide, eval = FALSE>>=
S <- t(Smat$fcmat[Smat$set.inds[["1000_at"]], ])
@

Estimate probeset-level signal and probe-specific variances:

<<iter, results=hide, eval = FALSE>>=
res <- RPA.iteration(S)
@

\subsection{Setting probe-specific priors}

Prior information of probe reliability can be set by tuning the shape
(alpha) and scale (beta) parameters of the model. These are inverse
Gamma distribution parameters, which is the conjugate prior for the
variances. Set priors for a particular probeset. If the 'priors'
parameter is not given, non-informative priors will be given for the
other probesets:

<<setpriors, results = hide>>=
alpha <- beta <- rep(1, 16)
probe.index <- 5
alpha[[probe.index]] <- 3
beta[[probe.index]] <- 1
priors <- set.priors(Dilution, set = "1000_at", alpha, beta)
@

Run RPA with priors:

<<pe, results=hide, eval = FALSE>>=
rpa.results <- RPA.pointestimate(Dilution, sets, priors = priors)
@                                  

\section{Differential gene expression analysis}

RPA provides a wrapper ('rpa') for robust preprocessing of gene
expression data in differential gene expression studies. Note that one
of the arrays is used as the reference against which the differential
expression values are calculated; RPA expression values are
log\(_2\)-ratios related to the reference array. RPA supports
alternative CDF environments (see function documentation for details).

<<rpa, results = hide, eval = FALSE>>=
eset <- rpa(Dilution, cind = 1)
@

The output is an ExpressionSet object, which allows downstream
analysis of the results using standard R/BioC tools for gene
expression data.

\section{Citing RPA}

Please cite \cite{Lahti09tcbb} when using the package.

\section{Details}

This document was written using:

<<details>>=
sessionInfo()
@

\bibliographystyle{plainnat}
\bibliography{RPA}

\end{document}