\author{Hao Wu} \name{read.madata} \alias{read.madata} \title{Read Microarray data} \description{ This is the function to read Microarray experiment data from a TAB delimited text file or matrix object. } \section{Preparing data file}{ Before using the package, user need to prepare the input data file. 1) The data file can be a matrix type R object, such as the output of exprs() from array or beadarray package. It is assumed that the intensity is started from the first column and row name is probe ID. Otherwise, column number containing probe ID and intensity should be specified. 2) The data file can be a TAB delimited text file. In this file, each row corresponds to a gene. In the columns, you can put some gene specific information, e.g., the Probe ID, Gene Bank ID, etc. and the grid location of the spot. But most importantly you need to put the intensity data after that. Most of the Microarray gridding software generate one file for each slide. At this point, you need to manually combine them into the data file. You need to decide which data you want to use in analysis, e.g., mean versus median, background subtracted or not, etc. For N-dye array, your intensity data should have N columns for each array. These N columns need to be adjacent to each other. You can put the spot flag as a column after intensity data for each array. (Note that if you have flag, you will have N+1 columns data for each array.) If you have replicates, replicated measurements of the same probe (clone) on the same array should appear in adjacent rows. For example, for a 2-dye cDNA array, you have four slides scanned by Gene Pix and you get four files. First you open your favorite Spread Sheet editor, e.g., MS Excel. Copy your probe ID and Cluster ID to the first 2 columns. Then open one of the files generated by Gene Pix, copy the grid location into next 4 columns (you only need to do this once because they are all the same for four slides). Then for all four files, copy the two columns of foreground median value (if you want to use it) and one column of flag to the file in the order of Cy5, Cy3, flag. Then select the whole file and row sort it according to probe ID. Save the file as tab delimited text file and you are done. The data file must be "full", that is, all rows have to have the same number of fields. When you have missing data in your datafile, you need to check the data or use \code{\link[maanova]{fill.missing}} to fill in missing variable. Sometimes leading and trailing TAB in the text file will bring problems, depends on the operating system. So user need to be careful about that. } \section{Preparing design file}{ Design file can be data.frame or matrix R object or TAB delimited text file. Number of rows of this file equals number of arrays times N (the number of dyes) (plus one for column header, if design file is a TAB delimited file and header = T). The row of design file *MUST* be organized by the order of datafile unless the matchDataToDesign parameter is set to TRUE. For example, if the datafile stores the intensity from array1, array11, array2,..., then the row of designfile must follow this order. Number of columns of this file depends on the experimental design. For example, you can have "Strain", "Diet", "Sex", etc. in your design file. You *MUST* have a column named "Array" in the design file. For two-color array, in addition to the "Array" column, you must have "Sample" and "Dye" columns (case sensitive) in the design file. "Sample" should be integers representing biological individuals. Reference samples should have Sample number to be zero(0). Reference sample will always be treated as fixed factor in mixed model and it will not be involved in any test. You must NOT have "Spot", "Label" and "covM" columns. They are reserved for spotting, labeling and covariance effects. Note that you DO NOT have to use all factors in design file. You can put all factors in design file but turn them on/off in formula in \code{\link[maanova]{fitmaanova}}. } \section{Preparing covariate file}{ If you have array specific covariate, it should be included in the design matrix. If you have gene specific covariate, you need to prepare matrix type R object or TAB delimited text file, "covM". The size of "covM" equals to the size of intensity data (and TAB delimited text file must have column header if header = T, but NO row name). Specify covM only if you have gene specific covariate variable. Covariate variable must be a numeric value and need to be specified in the \code{\link[maanova]{fitmaanova}}. } \usage{ read.madata(datafile=datafile, designfile=designfile, covM = covM, arrayType=c("oneColor", "twoColor"),header=TRUE, spotflag=FALSE, n.rep=1, avgreps=0, log.trans=FALSE, metarow, metacol, row, col, probeid, intensity, matchDataToDesign=FALSE, ...) } \arguments{ \item{datafile}{Matrix R object or data file name with path name as a string.} \item{designfile}{Matrix or data.frame R object or design file name with path as a string.} \item{covM}{Gene specific covariate matrix. Specify this only if you have gene specific covariate matrix.} \item{arrayType}{Specify if it is one or two color array. Default is one color.} \item{header}{A logical value indicating when input files (data file, design file or covariate matrix) are TAB delimited file, whether they have column header.} \item{spotflag}{A flag to indicate whether the input file contains the flag for bad spot or not.} \item{n.rep}{An integer to represent the number of replicates.} \item{avgreps}{An integer to indicate whether to average or collapse the replicates or not. 0 means no average; 1 means to take the mean of the replicates; 2 means to take the median of the replicates.} \item{log.trans}{A logical value to indicate whether to take log2 transformation on the raw data or not. It is FALSE by default.If this is TRUE, \code{TransformMethod} field will be set to "log2".} \item{metarow}{For 2-dye array. The column number for meta row. Default values are 1s.} \item{metacol}{For 2-day array. The column number for meta column. Default values are 1s.} \item{row}{For 2-day array. The column number for row. Default value is NA.} \item{col}{For 2-day array. The column number for column. Default value is NA.} \item{probeid}{The column number storing probe (clone) id. When datafile is matrix R object, it assumes rowname of the data is probe id. If data does not have row name, then 1,2,... is used as a probe id. For TAB delimited file, if probeid is not provided, it assumes that the first column stores the probe id. If you do not have probe id then set probeid = 0.} \item{intensity}{The start column number of intensity. For the matrix R object, it assumes intensity starts from the first column and for TAB delimited file, it assumes intensity stars from the second column, as a default.} \item{matchDataToDesign}{Defaults to false. If set to TRUE then the datafile column headers (or colnames(datafile) in the case of a matrix) will be matched up to the design file's Array column. This allows you to ignore the input order of array data as long as the datafile's header values can be matched exactly to the designfile's Array values} \item{\dots}{Other gene information in the data file.} } \value{ An object of class \code{madata}, which is a list of following components: \item{n.gene}{Total number of genes in the experiment.} \item{n.rep}{Number of replicates in the experiment, if .} \item{n.spot}{Number of spots for each gene.} \item{data}{data field. It is either the log2 transformed data (if log.trans=TRUE), or just the original data (if log.trans=FALSE).} \item{n.array}{Number of arrays in the experiment.} \item{n.dye}{Number of dyes.} \item{flag}{A matrix for spot flag. Each element corresponding to one spot. 0 means normal spot, all other values mean bad spot.} \item{metarow}{Meta row for each spot.} \item{metacol}{Meta column for each spot.} \item{row}{Row for each spot.} \item{col}{Column for each spot.} \item{ArrayName}{A list of strings to represent the names of intensity data.} \item{design}{An object to represent the experimental design.} \item{Others}{Other experiment information listed in the data file and specified by user.} } \examples{ # note that .CEL files are not distributed with the package, thus following # code does not work. This shows how to read data from affy (or beadarray) # package, when TAB delimited design file is ready. \dontrun{ library(affy) beforeRma <- ReadAffy() rmaData <- rma(beforeRma) datafile <- exprs(rmaData) abf1 <- read.madata(datafile=datafile,designfile="design.txt") # make and read designfile (data.frame type R object) from R design.table <- data.frame(Array=row.names(pData(beforeRma))); Strain <- rep(c('Aj', 'B6', 'B6xAJ'), each=6) Sample <- rep(c(1:9), each=2) designfile <- cbind(design.table, Strain, Sample) abf1 <- read.madata(datafile, designfile=designfile) # read in a TAB delimited file with spot flag - for two color array # HAVE TO SPECIFY that the data is from two color array kidney.raw <- read.madata("kidney.txt", designfile="kidneydesign.txt", metarow=1, metacol=2, col=3, row=4, probeid=6, intensity=7, arrayType='twoColor',log.trans=T, spotflag=T) }} \keyword{IO}