\name{backgroundData}
\alias{backgroundData}
\title{Generation of background expression set}
\description{The function generates background expression sets
  using different methods (permutation within rows, Gaussian distribution, 
 auto-regressive models)
}

  \usage{backgroundData(eset,model=c("rr", "gauss", "ar1"))}
  \arguments{\item{eset}{object of the class \dQuote{ExpressionSet}}
    \item{model}{model for generation of background
	     data: \dQuote{rr}- random permutation, 
                    \dQuote{gauss}- Gaussian and  
           \dQuote{ar1}- AR1 models}}

    
\details{Microarray data comprise the measurements of transcript levels for many thousands of genes. Due to the large number of genes, it can be expected that some genes show periodicity simply by chance. To assess therefore the significance of periodic signals, it is necessary first to define what distribution of signals can be expected if the studied process exhibits no true periodicity. In statistical terms this is equivalent with the definition of a null hypothesis of non-periodic expression. 

The most simple model for non-periodic expression is based on randomization of the observed times series. A background distribution can then be constructed by (repeated) random permutation of the sequentially ordered measurements in the experiment. This background model is used here if \code{model="rr"} is chosen.

 Alternatively, non-periodic expression can be derived using a statistical model. A conventional approach is based on the assumption of data normality and to use the normal distribution. This background model is chosen if \code{model="gauss"}. 

However, these two approaches  neglect the fact that time series data exhibit generally a considerable autocorrelation i.e. correlation between successive measurements. Therefore, neither the assumptions of data normality nor for randomizations may hold.  As demonstrated for yeast cell cycle data (\emph{Bioinformatics} 2008),  this failure can substantially interfere with the significance testing, and that neglecting autocorrelation can potentially lead to a considerable overestimation of the number of periodically expressed genes. 

A more suitable model is based on autoregressive processes of order one (AR(1)), 
for which the value of the time-dependent variable \emph{X}  depends on its previous value  up to a normally distributed random variable \emph{Z}. Such model is used here for the setting of 
\code{model="ar1"}. The autocorrelation of \emph{X} and variance of \emph{Z} is   estimated for each feature of the \code{ExpressionSet} object separately.  Mathematical details can be found in the given reference. 

It is important to note in this context, that AR(1) processes cannot capture periodic patterns except for alternations with period two. Since \emph{Z} is a random variable, we can readily generate a collection of time series with the same autocorrelation as in the original data set.  Therefore, although AR(1) processes constitute random processes, they allow us to construct a background distribution that captures the autocorrelation structure of original gene expression time series without fitting the potentially included periodic pattern.
}

\value{ExpressionSet object with expression data generated by the chosen background model}

\note{Note that this function evaluates soley the \code{exprs} matrix and 
no information is used from the \code{phenoData}. In particular, 
the ordering of samples (arrays) is the same as the ordering 
of the columns in the \code{exprs} matrix. Also, replicated arrays in the 
\code{exprs} matrix are treated as independent 
i.e. they should be averagered prior to analysis or placed into different
distinct \dQuote{ExpressionSet} objects.}


\author{Matthias E. Futschik (\url{http://www.cbme.ualg.pt/mfutschik_cbme.html})}

\references{Matthias E. Futschik and Hanspeter Herzel (2008) Are we overestimating the number of cell-cycling genes? The impact of background models on time series analysis, \emph{Bioinformatics}, 24(8):1063-1069
}

\keyword{distribution}
\keyword{datagen}