%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Do not modify this file since it was automatically generated from: % % 901.CalibrationAndNormalization.R % % by the Rdoc compiler part of the R.oo package. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \name{1. Calibration and Normalization} \alias{1. Calibration and Normalization} \title{1. Calibration and Normalization} \encoding{latin1} \description{ In this section we give \emph{our} recommendation on how spotted two-color (or multi-color) microarray data is best calibrated and normalized. } \section{Classical background subtraction}{ We do \emph{not} recommend background subtraction in classical means where background is estimated by various image analysis methods. This means that we will only consider foreground signals in the analysis. We estimate "background" by other means. In what is explain below, only a global background, that is, a global bias, is estimated and removed. } \section{Multiscan calibration}{ In Bengtsson et al (2004) we give evidence that microarray scanners can introduce a significant bias in data. This bias, which is about 15-25 out of 65535, \emph{will} introduce intensity dependency in the log-ratios, as explained in Bengtsson \& \enc{Hössjer}{Hossjer} (2006). In Bengtsson et al (2004) we find that this bias is stable across arrays (and a couple of months), but further research is needed in order to tell if this is true over a longer time period. To calibrate signals for scanner biases, scan the same array at multiple PMT-settings (in decreasing order) at three or more PMT settings. Do this \emph{without} washing, cleaning or by other means changing the array between subsequent scans. Although not necessary, it is preferred that the array remains in the scanner between subsequent scans. This will simplify the image analysis since spot identification can be made once if images aligns perfectly. After image analysis, read all K scans for the same array into the two matrices, one for the red and one for the green channel, where the K columns corresponds to scans and the N rows to the spots. It is enough to use foreground signals. In order to multiscan calibrate the data, for each channel separately call \code{Xc <- calibrateMultiscan(X)} where \code{X} is the NxK matrix of signals for one channel across all scans. The calibrated signals are returned in the Nx1 matrix \code{Xc}. Multiscan calibration may sometimes be skipped, especially if affine normalization is applied immediately after, but we do recommend that every lab check at least once if their scanner introduce bias. } \section{Affine normalization}{ In Bengtsson \& \enc{Hössjer}{Hossjer} (2006), we carry out a detailed study on how biases in each channel introduce so called intensity-dependent log-ratios among other systematic artifacts. Data with (additive) bias in each channel is said to be \emph{affinely} transformed. Data without such bias, is said to be \emph{linearly} (proportionally) transform. Ideally, observed signals (data) is a linear (proportional) function of true gene expression levels. We do \emph{not} assume proportional observations. The scanner bias is real evidence that assuming linearity is not correct. Affine normalization corrects for affine transformation in data. Without control spots it is not possible to estimate the bias in each of the channels but only the relative bias such that after normalization the effective bias are the same in all channels. This is why we call it normalization and not calibration. In its simplest form, affine normalization is done by \code{Xn <- normalizeAffine(X)} where \code{X} is a Nx2 matrix with the first column holds the foreground signals from the red channel and the second holds the signals from the green channel. If three- or four-channel data is used these are added the same way. The normalized data is returned as a Nx2 matrix \code{Xn}. To normalize all arrays and all channels at once, one may put all data into one big NxK matrix where the K columns hold the all channels from the first array, then all channels from the second array and so on. Then \code{Xn <- normalizeAffine(X)} will return the across-array and across-channel normalized data in the NxK matrix \code{Xn} where the colunms are stored in the same order as in matrix \code{X}. Equal effective bias in all channels is much better. First of all, any intensity-dependent bias in the log-ratios is removed \emph{for all non-differentially expressed genes}. There is still an intensity-dependent bias in the log-ratios for differentially expressed genes, but this is now symmetric around log-ratio zero. Affine normalization will (by default and recommended) normalize \emph{all} arrays together and at once. This will guarantee that all arrays are "on the same scale". Thus, it \emph{not} recommended to apply a classical between-array scale normalization afterward. Moreover, the average log-ratio will be zero after an affine normalization. Note that an affine normalization will only remove curvature in the log-ratios at lower intensities. If a strong intensity-dependent bias at high intensities remains, this is most likely due to saturation effects, such as too high PMT settings or quenching. Note that for a perfect affine normalization you \emph{should} expect much higher noise levels in the \emph{log-ratios} at lower intensities than at higher. It should also be approximately symmetric around zero log-ratio. In other words, \emph{a strong fanning effect is a good sign}. Due to different noise levels in red and green channels, different PMT settings in different channels, plus the fact that the minimum signal is zero, "odd shapes" may be seen in the log-ratio vs log-intensity graphs at lower intensities. Typically, these show themselves as non-symmetric in positive and negative log-ratios. Note that you should not see this at higher intensities. If there is a strong intensity-dependent effect left after the affine normalization, we recommend, for now, that a subsequent curve-fit or quantile normalization is done. Which one, we do not know. Why negative signals? By default, 5\% of the normalized signals will have a non-positive signal in one or both channels. \emph{This is on purpose}, although the exact number 5\% is chosen by experience. The reason for introducing negative signals is that they are indeed expected. For instance, when measure a zero gene expression level, there is a chance that the observed value is (should be) negative due to measurement noise. (For this reason it is possible that the scanner manufacturers have introduced scanner bias on purpose to avoid negative signals, which then all would be truncated to zero.) To adjust the ratio (or number) of negative signals allowed, use for example \code{normalizeAffine(X, constraint=0.01)} for 1\% negative signals. If set to zero (or \code{"max"}) only as much bias is removed such that no negative signals exist afterward. Note that this is also true if there were negative signals on beforehand. Why not lowess normalization? Curve-fit normalization methods such as lowess normalization are basically designed based on linearity assumptions and will for this reason not correct for channel biases. Curve-fit normalization methods can by definition only be applied to one pair of channels at the time and do therefore require a subsequent between-array scale normalization, which is by the way very ad hoc. Why not quantile normalization? Affine normalization can be though of a special case of quantile normalization that is more robust than the latter. See Bengtsson \& \enc{Hössjer}{Hossjer} (2006) for details. Quantile normalization is probably better to apply than curve-fit normalization methods, but less robust than affine normalization, especially at extreme (low and high) intensities. For this reason, we do recommend to use affine normalization first, and if this is not satisfactory, quantile normalization may be applied. } \author{Henrik Bengtsson (\url{http://www.braju.com/R/})} \keyword{documentation}