% \VignetteIndexEntry{Creating reference datasets} % \VignetteDepends{} % \VignetteKeywords{gCMAP} % \VignettePackage{gCMAP} \documentclass[10pt]{article} \usepackage{times} \usepackage{hyperref} \usepackage{Sweave} \textwidth=6.5in \textheight=8.5in \oddsidemargin=-.1in \evensidemargin=-.1in \headheight=-.3in \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textsf{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \title{Primer: Preparing NChannelSet objects with differential expression scores} \begin{document} \SweaveOpts{concordance=TRUE} \maketitle This document exemplifies the processing of differential expression data using small, simulated datasets shipped with the gCMAP package. To see real-life examples with data from available from public databases, please refer to the documentation of the \Rpackage{gCMAPWeb} companion package. \section{Differential expression analysis} <>= options(warn=-1) @ The \tt gCMAP \rm package offers a the \tt generate\_gCMAP\_NChannelSet \rm function to process multiple instances differential expression experiments with two classes (e.g. cases vs controls). For microarray data, the \tt limma \rm package is used to calculate a moderated t-statistic (default). Optionally, a standard t-test can be computed instead. For RNAseq data, the \tt DESeq \rm package is used instead. Data preprocessing differs considerably between different technologies and array platforms and needs to be performed beforehand. Normalized microarray data and accompanying annotation is passed to \tt generate\_gCMAP\_NChannelSet \rm as a list of \tt ExpressionSet \rm objects, RNAseq data can be passed as a list of \tt CountDataSet \rm objects instead. To generate a set of 3 example \tt CountDataSets \rm, we use the \tt makeExampleCountDataSet \rm function from the \tt DESeq \rm package. <>= library(gCMAP) library(DESeq) set.seed( 123 ) cds.list <- lapply( 1:3, function(n) { cds <- makeExampleCountDataSet() featureNames(cds) <- paste("gene",1:10000, sep="_") cds }) names(cds.list) <- paste("Instance", 1:3, sep="") sapply(cds.list, dim) sapply(cds.list, function(n) pData(n)$condition ) @ By default, each \tt CountDataset \rm object contains counts for 10000 genes from five samples. Each sample is assigned to one of two conditions, A or B, in the phenoData slot of the \tt CountDataSet \rm. The pData column containing group membership information ( e.g. "condition" ) is provided as the control\_perturb\_col parameter. The levels associated with control and treatment groups are specified as "control" and "perturb" character strings. Each of the three \tt CountDataSet \rm instances is analyzed individually by \tt generate\_gCMAP\_NChannelSet \rm. To assemble the results into a single NChannelSet, the input \tt ExpressionSet \rm or \tt CountDataSet \rm objects must contain measurements for the same features (e.g. the vectors returned by "featureNames" must be identical across all instances). To include information about the instances in the NChannelSet, a 'sample.annotation' data.frame can be provided, containing exactly one row for each element of the input list of \tt ExpressionSet / CountDataSet \rm objects. <>= ## this step takes a little time cde <- generate_gCMAP_NChannelSet(cds.list, uids=1:3, sample.annotation=NULL, platform.annotation="Entrez", control_perturb_col="condition", control="A", perturb="B") channelNames(cde) @ For array data, a \tt NChannelSet \rm with slots "exprs","z", "p", and "log\_fc" is returned, containing the average intenstity across all samples within the instance, z-scores, (raw) p-values and log2 fold changes, respectively. If count data is processed, an additional "mod\_fc" channel is returned, providing the moderated fold change, calculated after performing variance-stabilising transformation across all input instances. (Please consult the \tt DESeq \rm vignette for details.) \subsection{Storing assayData as BigMatrix objects on disk} When large numbers of instances are processed, the resulting \tt NChannelSet \rm objects can require large amounts of memory. The \tt bigmemory \rm and \tt bigmemoryExtras \rm packages can be used to create \tt BigMatrix \rm objects, allowing methods to subset large datasets without having to load them fully into memory first. Note: at the time of writing, the \tt bigmemory \rm package was only available for Unix and Mac OS X operating systems but not for Windows. Windows users can take advantage of \tt gCMAP's \rm functionality but datasets must be fully loaded into memory first. If the \tt bigmemory \rm and \tt bigmemoryExtras \rm packages are available and a file name is provided via the "big.matrix" parameter, \tt generate\_gCMAP\_NChannelSet \rm uses the \tt BigMatrix \rm package to store data from each channel on disk. In the future, individual channels and / or subsets of the datasets can then be loaded without requiring the full object to be read into memory again. To highlight this functionality, we derive three (arbitrary) instances from the sample.ExpressionSet object available from tbe \tt Biobase \rm package, process them and store the results in a temporary directory. Note: this section will only created the expected \tt big.matrix \rm files on disk if the \tt bigmemory \rm and \tt bigmemoryExtras \rm packages can be loaded. Otherwise, a standard Rdata object is created and a warning is issued. <>= ## list of ExpressionSets data("sample.ExpressionSet") ## from Biobase es.list <- list( sample.ExpressionSet[,1:4], sample.ExpressionSet[,5:8], sample.ExpressionSet[,9:12]) ## three instances names(es.list) <- paste( "Instance", 1:3, sep=".") storage.file <- tempfile() storage.file ## filename prefix for BigMatrices de <- generate_gCMAP_NChannelSet( es.list, 1:3, platform.annotation = annotation(es.list[[1]]), control_perturb_col="type", control="Control", perturb="Case", big.matrix=storage.file) channelNames(de) head( assayDataElement(de, "z") ) dir(dirname( storage.file )) @ If the \tt bigmemoryExtras \rm package is available, it generated a \tt BigMatrix \rm objects containing pointers to three files in the temporary directory, one for each channel (identified by their suffices). If the package is unavailable, a standard eSet is saved to disk, which will be read fully into memory upon reload. To demonstrate the use of disk-based \tt NChannelSet \rm objects, we will first delete the object from the current R workspace and reload it from disk. Accessing the complete matrix in the assayData slots, e.g. for the "z" channel, returns another BigMatrix object with assayData slot pointing to the associated file on disk. Upon subsetting, only the requested part of the dataset is loaded into memory. <>= ## remove de object from R session and reload rm( de ) de <-get( load( paste( storage.file, "rdata", sep=".")) ) class( assayDataElement(de, "z") ) assayDataElement(de, "z")[1:10,] ## load subset @ The \tt memorize \rm function reads the complete \tt NChannelSet \rm into memory. In addition, one or more selected channels can be specified with the 'name' parameter. <>= ## read z-score channel into memory dem <- memorize( de, name="z" ) channelNames(dem) class( assayDataElement(dem, "z") ) ## matrix @ <>= ## remove tempfiles unlink(paste(storage.file,"*", sep="")) @ <>= sessionInfo() @ \end{document}