%\VignetteKeywords{Database} %\VignetteDepends{curatedOvarianData} %\VignettePackage{curatedOvarianData} %\VignetteIndexEntry{curatedOvarianData} \documentclass[a4paper, 10pt]{article} \usepackage{rotating} <<>= BiocStyle::latex() @ \bioctitle[curatedOvarianData]{curatedOvarianData: Clinically Annotated Data for the Ovarian Cancer Transcriptome} \author{ Benjamin Frederick Ganzfried, Markus Riester, Benjamin \\ Haibe-Kains, Thomas Risch, Svitlana Tyekucheva, Ina Jazic,\\ Victoria Xin Wang, Mahnaz Ahmadifar, Michael Birrer, \\ Giovanni Parmigiani, Curtis Huttenhower, Levi Waldron} \begin{document} \date{2014} \maketitle \tableofcontents <>= library(sva) library(logging) options(width=60) @ \section{curatedOvarianData: Clinically Annotated Data for the Ovarian Cancer Transcriptome} This package represents a manually curated data collection for gene expression meta-analysis of patients with ovarian cancer. This resource provides uniformly prepared microarray data with curated and documented clinical metadata. It allows a computational user to efficiently identify studies and patient subgroups of interest for analysis and to run such analyses immediately without the challenges posed by harmonizing heterogeneous microarray technologies, study designs, expression data processing methods, and clinical data formats. The \Biocpkg{curatedOvarianData} package is published in the journal DATABASE \cite{ganzfried2013}. Note the existence also of \Biocpkg{curatedCRCData} and \Biocpkg{curatedBladderData}. Please see http://bcb.dfci.harvard.edu/ovariancancer for alterative versions of this package, differing in how redundant probe sets are dealt with. In this vignette, we give a short tour of the package and will show how to use it efficiently. \section{Load TCGA data} Loading a single dataset is very easy. First we load the package: <>= library(curatedOvarianData) @ To get a listing of all the datasets, use the \Rfunction{data} function: <>= data(package="curatedOvarianData") @ Now to load the TCGA data, we use the \Rfunction{data} function again: <>= data(TCGA_eset) TCGA_eset @ The datasets are provided as \Bioconductor{} \Rclass{ExpressionSet} objects and we refer to the Bioconductor documentation for users unfamiliar with this data structure. \section{Load datasets based on rules} For a meta-analysis, we typically want to filter datasets and patients to get a population of patients we are interested in. We provide a short but powerful R script that does the filtering and provides the data as a list of \Rclass{ExpressionSet} objects. One can use this script within R by first sourcing a config file which specifies the filters, like the minimum numbers of patients in each dataset. It is also possible to filter samples by annotation, for example to remove early stage and normal samples. <>= source(system.file("extdata", "patientselection.config",package="curatedOvarianData")) ls() @ See what the values of these variables we have loaded are. The variable names are fairly descriptive, but note that ``rule.1'' is a character vector of length 2, where the first entry is the name of a clinical data variable, and the second entry is a Regular Expression providing a requirement for that variable. Any number of rules can be added, with increasing identifiers, e.g. ``rule.2'', ``rule.3'', etc. Here strict.checking is FALSE, meaning that samples not annotated for the variables in these rules are allowed to pass the filter. If strict.checking == TRUE, samples missing this annotation will be removed. \subsection{Cleaning of duplicate samples} The patientselection.config file loaded above contains several objects indicating which samples were removed for QC and duplicate cleaning by Waldron \emph{et al.} \cite{waldron2014}: \begin{itemize} \item tcga.lowcor.outliers: two profiles identified in the TCGA dataset with anomolously low correlation to other ovc profiles \item duplicates: samples blacklisted because they contain duplicates. In the case of duplicates, generally better-annotated samples, and samples from more recent studies, were kept. \item remove.samples: the above to vectors of samples concatenated \end{itemize} <>= #remove.samples and duplicates are too voluminous: sapply(ls(), function(x) if(!x %in% c("remove.samples", "duplicates")) print(get(x))) @ Now that we have defined the sample filter, we create a list of \Rclass{ExpressionSet} objects by sourcing the \file{createEsetList.R} file: <>= source(system.file("extdata", "createEsetList.R", package = "curatedOvarianData")) @ It is also possible to run the script from the command line and then load the R data file within R: \begin{verbatim} R --vanilla "--args patientselection.config ovarian.eset.rda tmp.log" < createEsetList.R \end{verbatim} Now we have \Sexpr{length(esets)} datasets with samples that passed our filter in a list of \Rclass{ExpressionSet} objects called \texttt{esets}: <>= names(esets) @ \section{Association of CXCL12 expression with overall survival} Next we use the list of \Sexpr{length(esets)} datasets from the previous example and test if the expression of the CXCL12 gene is associated with overall survival. CXCL12/CXCR4 is a chemokine/chemokine receptor axis that has previously been shown to be directly involved in cancer pathogenesis. We first define a function that will generate a forest plot for a given gene. It needs the overall survival information as \Rclass{Surv} objects, which the \file{createEsetList.R} function already added in the \texttt{phenoData} slots of the \Rclass{ExpressionSet} objects, accessible at the \texttt{y} label. The resulting forest plot is shown for the CXCL12 gene in Figure~\ref{fig:example3:forest}. <>= esets[[1]]$y forestplot <- function(esets, y="y", probeset, formula=y~probeset, mlab="Overall", rma.method="FE", at=NULL,xlab="Hazard Ratio",...) { require(metafor) esets <- esets[sapply(esets, function(x) probeset %in% featureNames(x))] coefs <- sapply(1:length(esets), function(i) { tmp <- as(phenoData(esets[[i]]), "data.frame") tmp$y <- esets[[i]][[y]] tmp$probeset <- exprs(esets[[i]])[probeset,] summary(coxph(formula,data=tmp))$coefficients[1,c(1,3)] }) res.rma <- metafor::rma(yi = coefs[1,], sei = coefs[2,], method=rma.method) if (is.null(at)) at <- log(c(0.25,1,4,20)) forest.rma(res.rma, xlab=xlab, slab=gsub("_eset$","",names(esets)), atransf=exp, at=at, mlab=mlab,...) return(res.rma) } @ \begin{figure} \centering <>= res <- forestplot(esets=esets,probeset="CXCL12",at=log(c(0.5,1,2,4))) @ \caption{The database confirms CXCL12 as prognostic of overall survival in patients with ovarian cancer. Forest plot of the expression of the chemokine CXCL12 as a univariate predictor of overall survival, using all \Sexpr{length(esets)} datasets with applicable expression and survival information. A hazard ratio significantly larger than 1 indicates that patients with high CXCL12 levels had poor outcome. The p-value for the overall HR, found in res\$pval, is \Sexpr{signif(res$pval, 2)}. This plot is Figure 3 of the curatedOvarianData manuscript. } \label{fig:example3:forest} \end{figure} We now test whether CXCL12 is an independent predictor of survival in a multivariate model together with success of debulking surgery, defined as residual tumor smaller than 1 cm, and Federation of Gynecology and Obstetrics (FIGO) stage. We first filter the datasets without debulking and stage information: <>= idx.tumorstage <- sapply(esets, function(X) sum(!is.na(X$tumorstage)) > 0 & length(unique(X$tumorstage)) > 1) idx.debulking <- sapply(esets, function(X) sum(X$debulking=="suboptimal",na.rm=TRUE)) > 0 @ In Figure~\ref{fig:example4:forest}, we see that CXCL12 stays significant after adjusting for debulking status and FIGO stage. We repeated this analysis for the CXCR4 receptor and found no significant association with overall survival (Figure~\ref{fig:example5:forest}). \begin{figure} \centering <>= res <- forestplot(esets=esets[idx.debulking & idx.tumorstage], probeset="CXCL12",formula=y~probeset+debulking+tumorstage, at=log(c(0.5,1,2,4))) @ \caption{Validation of CXCL12 as an independent predictor of survival. This figure shows a forest plot as in Figure~\ref{fig:example3:forest}, but the CXCL12 expression levels were adjusted for debulking status (optimal versus suboptimal) and tumor stage. The p-value for the overall HR, found in res\$pval, is \Sexpr{signif(res$pval, 2)}.} \label{fig:example4:forest} \end{figure} \begin{figure} \centering <>= res <- forestplot(esets=esets,probeset="CXCR4",at=log(c(0.5,1,2,4))) @ \caption{Up-regulation of CXCR4 is not associated with overall survival. This figure shows again a forest plot as in Figure~\ref{fig:example3:forest}, but here the association of mRNA expression levels of the CXCR4 receptor and overall survival is shown. The p-value for the overall HR, found in res\$pval, is \Sexpr{signif(res$pval, 2)}.} \label{fig:example5:forest} \end{figure} \clearpage \section{Batch correction with ComBat} If datasets are merged, it is typically recommended to remove a very likely batch effect. We will use the ComBat \cite{Johnson:2007} method, implemented for example in the SVA Bioconductor package \cite{sva}. To combine two ExpressionSet objects, we can use the \Rfunction{combine()} function. This function will fail when the two ExpressionSets have conflicting annotation slots, for example \texttt{annotation} when the platforms differ. We write a simple \texttt{combine2} function which only considers the \texttt{exprs} and \texttt{phenoData} slots: <>= combine2 <- function(X1, X2) { fids <- intersect(featureNames(X1), featureNames(X2)) X1 <- X1[fids,] X2 <- X2[fids,] ExpressionSet(cbind(exprs(X1),exprs(X2)), AnnotatedDataFrame(rbind(as(phenoData(X1),"data.frame"), as(phenoData(X2),"data.frame"))) ) } @ In Figure~\ref{fig:batch:1}, we combined two datasets from different platforms, resulting in a huge batch effect. \begin{figure} <>= data(E.MTAB.386_eset) data(GSE30161_eset) X <- combine2(E.MTAB.386_eset, GSE30161_eset) boxplot(exprs(X)) @ \caption{Boxplot showing the expression range for all samples of two merged datasets arrayed on different platforms. This illlustrates a huge batch effect.} \label{fig:batch:1} \end{figure} Now we apply ComBat and adjust for the batch and show the boxplot after batch correction in Figure~\ref{fig:batch:2}: <>= mod <- model.matrix(~as.factor(tumorstage), data=X) batch <- as.factor(grepl("DFCI",sampleNames(X))) combat_edata <- ComBat(dat=exprs(X), batch=batch, mod=mod) @ \begin{figure} <>= boxplot(combat_edata) @ \caption{Boxplot showing the expression range for all samples of two merged datasets arrayed on different platforms after batch correction with ComBat.} \label{fig:batch:2} \end{figure} \clearpage \section{Non-specific probe sets} In the standard version of curatedOvarianData (the version available on Bioconductor), we collapse manufacturer probesets to official HGNC symbols using the Biomart database. Some probesets are mapped to multiple HGNC symbols in this database. For these probesets, we provide all the symbols. For example \texttt{220159\_at} maps to \textit{ABCA11P} and \textit{ZNF721} and we provide \texttt{ABCA11P///ZNF721} as probeset name. If you have an array of gene symbols for which you want to access the expression data, "ABCA11P" would not be found in curatedOvarianData in this example. The script createEsetList.R provides three methods to deal with non-specific probe sets by setting the variable $probes.not.mapped.uniquely$ to: \begin{itemize} \item "keep": leave as-is, these have "///" in gene names, \item "drop": drop any non-uniquely mapped features, or \item "split": split non-uniquely mapped features to one per row. If this creates duplicate rows for a gene, those rows are averaged. \end{itemize} This feature uses the following function to create a new ExpressionSet, in which both \textit{ZNF721} and \textit{ABCA11P} are features with identical expression data: <>= expandProbesets <- function (eset, sep = "///") { x <- lapply(featureNames(eset), function(x) strsplit(x, sep)[[1]]) eset <- eset[order(sapply(x, length)), ] x <- lapply(featureNames(eset), function(x) strsplit(x, sep)[[1]]) idx <- unlist(sapply(1:length(x), function(i) rep(i, length(x[[i]])))) xx <- !duplicated(unlist(x)) idx <- idx[xx] x <- unlist(x)[xx] eset <- eset[idx, ] featureNames(eset) <- x eset } X <- TCGA_eset[head(grep("///", featureNames(TCGA_eset))),] exprs(X)[,1:3] exprs(expandProbesets(X))[,1:3] @ \section{FULLVcuratedOvarianData} In curatedOvarianData, probesets mapping to the same gene symbol are merge by selecting the probeset with maximum mean across all studies of a given platform. You can see which representative probeset was chosen by looking at the featureData of the Expressionset, e.g.: <>= head(pData(featureData(GSE18520_eset))) @ The full, unmerged ExpressionSets are available through the FULLVcuratedOvarianData package at http://bcb.dfci.harvard.edu/ovariancancer/. Probeset to gene maps are again provided in the featureData of those \Robject{ExpressionSet}s. Where official \Bioconductor annotation packages are available for the array, these are stored in the \Robject{ExpressionSet} annotation slots, e.g.: <>= annotation(GSE18520_eset) @ so that standard filtering methods such as \Rfunction{nsFilter} will work by default. \section{Available Clinical Characteristics} <>= rm(list=ls()) source(system.file("extdata", "patientselection_all.config",package="curatedOvarianData")) source(system.file("extdata", "createEsetList.R", package = "curatedOvarianData")) @ \begin{figure}[!htb] \centering <>= .esetsStats <- function(esets) { res <- lapply(varLabels(esets[[1]]), function(covar) unlist(sapply(esets, function(X) sum(!is.na(X[[covar]]))>0))) names(res) <- varLabels(esets[[1]]) do.call(rbind, res) } df.r <- .esetsStats(esets) M <- as.matrix(apply(df.r,c(1,2),ifelse,0,1)) colnames(M) <- gsub("_eset$", "", colnames(M)) # no need to show the sample ids M <- M[-(1:2),] heatmap(M[nrow(M):1,],scale="none",margins=c(8,10),Rowv=NA) @ \caption{Available clinical annotation. This heatmap visualizes for each curated clinical characteristic (rows) the availability in each dataset (columns). Red indicates that the corresponding characteristic is available for at least one sample in the dataset. This plot is Figure 2 of the curatedOvarianData manuscript.} \end{figure} \section{Summarizing the List of ExpressionSets} This example provides a table summarizing the datasets being used, and is useful when publishing analyses based on curatedOvarianData. First, define some useful functions for this purpose: <>= source(system.file("extdata", "summarizeEsets.R", package = "curatedOvarianData")) @ Now create the table, used for Table 1 of the curatedOvarianData manuscript: <>= summary.table <- t(sapply(esets, getEsetData)) rownames(summary.table) <- sub("_eset", "", rownames(summary.table)) @ Optionally write this table to file, for example ( replace myfile <- tempfile() with something like myfile <- ``nicetable.csv'' ) <>= (myfile <- tempfile()) write.table(summary.table, file=myfile, row.names=FALSE, quote=TRUE, sep=",") @ <>= library(xtable) print(xtable(summary.table[, c(2, 3, 4, 5, 7)], caption="Datasets provided by curatedOvarianData. This is an abbreviated version of Table 1 of the manuscript; the full version is written by the write.table command above. Stage column is early/late/unknown, histology column is ser/clearcell/endo/mucinous/other/unknown.", table.placement="p", caption.placement="bottom"), floating.environment='sidewaystable') @ \section{For non-R users} If you are not doing your analysis in R, and just want to get some data you have identified from the curatedOvarianData manual, here is a simple way to do it. For one dataset: <>= library(curatedOvarianData) library(affy) data(GSE30161_eset) write.csv(exprs(GSE30161_eset), file="GSE30161_eset_exprs.csv") write.csv(pData(GSE30161_eset), file="GSE30161_eset_clindata.csv") @ Or for several datasets: <>= data.to.fetch <- c("GSE30161_eset", "E.MTAB.386_eset") for (onedata in data.to.fetch){ print(paste("Fetching", onedata)) data(list=onedata) write.csv(exprs(get(onedata)), file=paste(onedata, "_exprs.csv", sep="")) write.csv(pData(get(onedata)), file=paste(onedata, "_clindata.csv", sep="")) } @ \section{Session Info} <>= toLatex(sessionInfo()) @ \clearpage \bibliography{curatedOvarianData_vignette} \end{document}