\name{plgem.fit} \alias{plgem.fit} \title{PLGEM Fitting and Evaluation} \description{ Function for fitting and evaluating goodness of fit of a \bold{PLGEM} on a set of replicated samples defined in an \code{ExpressionSet}. } \usage{ plgem.fit(data, covariate=1, fitCondition=1, p=10, q=0.5, trimAllZeroRows=FALSE, zeroMeanOrSD=c("replace", "trim"), fittingEval=FALSE, plot.file=FALSE, verbose=FALSE) } \arguments{ \item{data}{an object of class \code{ExpressionSet}; see Details for important information on how the \code{phenoData} slot of this object will be interpreted by the function.} \item{covariate}{\code{integer}, \code{numeric} or \code{character}; specifies the covariate to be used to fit the \bold{PLGEM}. See Details for how to specify the \code{covariate}.} \item{fitCondition}{\code{integer}, \code{numeric} or \code{character}; specifies the condition to be used to fit the \bold{PLGEM}. See Details for how to specify the \code{fitCondition}.} \item{p}{\code{integer} (or coercible to \code{integer}); number of intervals used to partition the expression value range.} \item{q}{\code{numeric} in [0,1]; the quantile of standard deviation used for \bold{PLGEM} fitting.} \item{trimAllZeroRows}{\code{logical}; if \code{TRUE}, rows in the data set containing only zero values are trimmed before fitting \bold{PLGEM}.} \item{zeroMeanOrSD}{either \code{NULL} or \code{character}; what should be done if a row with non-positive mean or zero standard deviation is encountered before fitting \bold{PLGEM}? Current options are one of \code{"replace"} or \code{"trim"}. Partial matching is used to switch between the options and setting the value to \code{NULL} will cause the default behaviour to be enforced, i.e. to \code{"replace"} (see Details).} \item{fittingEval}{\code{logical}; if \code{TRUE}, the fitting is evaluated generating a diagnostic plot.} \item{plot.file}{\code{logical}; if \code{TRUE}, a png file is written on the current working directory.} \item{verbose}{\code{logical}; if \code{TRUE}, comments are printed out while running.} } \details{ \code{plgem.fit} fits a Power Law Global Error Model (\bold{PLGEM}) to an \code{ExpressionSet} and optionally evaluates the quality of the fit. This \bold{PLGEM} aims to find the mathematical relationship between standard deviation and mean gene expression values (or protein abundance levels) in a set of replicated microarray (or proteomics) samples, according to the following power law: \deqn{\log{(modeledSpread)}=PLGEMslope*\log{(mean)}+PLGEMintercept}{% ln(modeledSpread) = PLGEMslope * ln(mean) + PLGEMintercept} It has been demonstrated that this model fits to Affymetrix GeneChip datasets, as well as to datasets of normalized spectral counts obtained by mass spectrometry-based proteomics. Technically, two replicates are required and sufficient to fit a \bold{PLGEM}. Having 3 or more replicates, of course, improves the fitting and is recommended (see References for details). The \code{phenoData} slot of the \code{ExpressionSet} given as input is expected to contain the necessary information to distinguish the various experimental conditions from one another. The columns of the \code{pData} are referred to as \sQuote{covariates}. There has to be at least one covariate defined in the input \code{ExpressionSet}. The sample attributes according to this covariate must be distinct for samples that are to be treated as distinct experimental conditions and identical for samples that are to be treated as replicates. There is a couple different ways to specify the \code{covariate}: If an \code{integer} or a \code{numeric} is given, it will be taken as the covariate number (in the same order in which the covariates appear in the \code{colnames} of the \code{pData}). If a \code{character} is given, it will be taken as the covariate name itself (in the same way the covariates are specified in the \code{colnames} of the \code{pData}). By default, the first covariate appearing in the \code{colnames} of the \code{pData} is used. Similarly, there is a couple different ways to specify on which experimental condition to fit the model. The available \sQuote{condition names} are taken from \code{unique(as.character(pData(data)[, covariate]))}. If \code{fitCondition} is given as a \code{character}, it will be taken as the condition name itself. If \code{fitCondition} is given as an \code{integer} or a \code{numeric} value, it will be taken as the condition number (in the same order of appearance as in the \sQuote{condition names}). By default, the first condition name is used. Setting \code{trimAllZeroRows=TRUE} is especially useful in proteomics data sets, where there is no guarantee of identifying a protein across all experimental conditions. Since \bold{PLGEM} is fitted only to the data corresponding to a single experimental condition (as defined by \code{fitCondition}), it is possible to generate a non-negligible number of rows containing only zero values, even if there were no such rows in the original (complete) data set containing all experimental conditions. Setting \code{zeroMeanOrSD="replace"} (the current default, for backward compatibility) will cause the function to replace zero or negative means with the smallest positive mean found in the data set and to replace zero standard deviations with the smallest non-zero standard deviation found in the data set. Setting \code{zeroMeanOrSD="trim"} is the current recommended option, especially for spectral counting proteomics data sets that are typically characterized by a high data granularity or for microarray data sets with a small number of replicates. In both cases, there are chances for data values for a same gene or protein to be identical across replicates (and therefore with zero standard deviation) by chance alone. Note that setting \code{trimAllZeroRows=TRUE} does not guarantee that there will be no rows with zero mean or zero standard deviation. If argument \code{fittingEval} is set to \code{TRUE}, a graphical control of the goodness of the \bold{PLGEM} fitting is produced and a plot containing four panels is generated. The top-left panel shows the power law, characterized by a \sQuote{SLOPE} and an \sQuote{INTERCEPT}. The top-right panel represents the distribution of model residuals. The bottom-left reports the contour plot of ranked residuals. The bottom-right panel finally shows the relationship between the distribution of observed residuals and the normal distribution. A good fit normally gives a horizontal symmetric rank-plot and a near normal distribution of residuals. Warnings are issued if the fitted PLGEM slope is above 1 or under 0.5, if the adjusted \eqn{r^{2}}{r^2} is below 0.95 or if the Pearson correlation coefficient is below 0.85. These are the ranges of values inside which most GeneChip MAS5 dataset and NSAF proteomics dataset have been empirically observed to lie (see References). } \value{ A \code{list} of six elements (see Details): \item{SLOPE}{the slope of the fitted PLGEM.} \item{INTERCEPT}{the intercept of the fitted PLGEM.} \item{DATA.PEARSON}{the Pearson correlation coefficient between the \eqn{\log{(sd)}}{log(sd)} and the \eqn{\log{(mean)}}{log(mean)} in the original data.} \item{ADJ.R2.MP}{the adjusted \eqn{r^{2}}{r^2} of PLGEM fitted on the modelling points.} \item{COVARIATE}{a \code{character} indicating the covariate used for fitting.} \item{FIT.CONDITION}{a \code{character} indicating the condition used for fitting.} } \references{ Pavelka N, Pelizzola M, Vizzardelli C, Capozzoli M, Splendiani A, Granucci F, Ricciardi-Castagnoli P. A power law global error model for the identification of differentially expressed genes in microarray data. BMC Bioinformatics. 2004 Dec 17; 5:203; \url{http://www.biomedcentral.com/1471-2105/5/203}. Pavelka N, Fournier ML, Swanson SK, Pelizzola M, Ricciardi-Castagnoli P, Florens L, Washburn MP. Statistical similarities between transcriptomics and quantitative shotgun proteomics data. Mol Cell Proteomics. 2008 Apr; 7(4):631-44; \url{http://www.mcponline.org/cgi/content/abstract/7/4/631}. } \author{ Mattia Pelizzola \email{mattia.pelizzola@gmail.com} Norman Pavelka \email{normanpavelka@gmail.com} } \seealso{ \code{\link{plgem.obsStn}}, \code{\link{plgem.resampledStn}}, \code{\link{plgem.pValue}}, \code{\link{plgem.deg}}, \code{\link{run.plgem}} } \examples{ data(LPSeset) LPSfit <- plgem.fit(data=LPSeset, fittingEval=TRUE) as.data.frame(LPSfit) } \keyword{models}