%\VignetteKeywords{Database}
%\VignetteDepends{curatedOvarianData}
%\VignettePackage{curatedOvarianData}
%\VignetteIndexEntry{curatedOvarianData}

\documentclass[a4paper, 10pt]{article}

\usepackage{rotating}

<<<style-Sweave, eval=TRUE, echo=FALSE, results=tex>>=
BiocStyle::latex()
@ 


\bioctitle[curatedOvarianData]{curatedOvarianData: Clinically Annotated Data for the Ovarian Cancer Transcriptome}
\author{
Benjamin Frederick Ganzfried, Markus Riester, Benjamin \\
Haibe-Kains, Thomas Risch, Svitlana Tyekucheva, Ina Jazic,\\
Victoria Xin Wang, Mahnaz Ahmadifar, Michael Birrer, \\
Giovanni Parmigiani, Curtis Huttenhower, Levi Waldron}

\begin{document}
\date{2014}

\maketitle
\tableofcontents

<<preliminaries,echo=FALSE,results=hide>>=
library(sva)
library(logging)
options(width=60)
@ 


\section{curatedOvarianData: Clinically Annotated Data for the Ovarian Cancer
Transcriptome}

This package represents a manually curated data collection for gene expression
meta-analysis of patients with ovarian cancer. This resource provides
uniformly prepared microarray data with curated and documented clinical
metadata.  It allows a computational user to efficiently identify studies and
patient subgroups of interest for analysis and to run such analyses
immediately without the challenges posed by harmonizing heterogeneous
microarray technologies, study designs, expression data processing methods,
and clinical data formats.  

The \Biocpkg{curatedOvarianData} package is published in the journal
DATABASE \cite{ganzfried2013}. Note the existence also of
\Biocpkg{curatedCRCData} and \Biocpkg{curatedBladderData}.

Please see http://bcb.dfci.harvard.edu/ovariancancer for alterative
versions of this package, differing in how redundant probe sets are
dealt with.

In this vignette, we give a short tour of the package and will show how to use
it efficiently.

\section{Load TCGA data}
Loading a single dataset is very easy. First we load the package:
<<example1tcgastep1>>=
library(curatedOvarianData)
@
To get a listing of all the datasets, use the \Rfunction{data} function:
<<example1tcgastep2, eval=FALSE>>=
data(package="curatedOvarianData")
@
Now to load the TCGA data, we use the \Rfunction{data} function again:
<<example1tcgastep2>>=
data(TCGA_eset)
TCGA_eset
@
The datasets are provided as \Bioconductor{} \Rclass{ExpressionSet} objects and
we refer to the Bioconductor documentation for users unfamiliar with this data
structure. 

\section{Load datasets based on rules}
For a meta-analysis, we typically want to filter datasets and patients to
get a population of patients we are interested in. We provide a short but
powerful R script that does the filtering and provides the data as a list of
\Rclass{ExpressionSet} objects. One can use this script within R by first sourcing a config
file which specifies the filters, like the minimum numbers of patients in each
dataset. It is also possible to filter samples by annotation, for example to remove
early stage and normal samples. 
<<example2loadstep1>>=
source(system.file("extdata", 
"patientselection.config",package="curatedOvarianData"))
ls()
@

See what the values of these variables we have loaded are.  The
variable names are fairly descriptive, but note that ``rule.1'' is a
character vector of length 2, where the first entry is the name of a
clinical data variable, and the second entry is a Regular Expression
providing a requirement for that variable.  Any number of rules can be
added, with increasing identifiers, e.g. ``rule.2'', ``rule.3'', etc.

Here strict.checking is FALSE, meaning that samples not annotated for
the variables in these rules are allowed to pass the filter.  If
strict.checking == TRUE, samples missing this annotation will be
removed.

\subsection{Cleaning of duplicate samples}
The patientselection.config file loaded above contains several objects
indicating which samples were removed for QC and duplicate cleaning by
Waldron \emph{et al.} \cite{waldron2014}: 

\begin{itemize}
  \item tcga.lowcor.outliers: two profiles identified in the TCGA dataset with anomolously low correlation to other ovc profiles
  \item duplicates: samples blacklisted because they contain
      duplicates.  In the case of duplicates, generally
      better-annotated samples, and samples from more recent studies,
      were kept.
  \item remove.samples: the above to vectors of samples concatenated
\end{itemize}

<<showls>>=
#remove.samples and duplicates are too voluminous:
sapply(ls(), function(x) if(!x %in% c("remove.samples", "duplicates")) print(get(x)))
@ 

Now that we have defined the sample filter, we create a list of
\Rclass{ExpressionSet} objects by sourcing the \file{createEsetList.R} file:
<<example2loadstep2>>=
source(system.file("extdata", "createEsetList.R", package =
"curatedOvarianData"))
@
It is also possible to run the script from the command line and then load the
R data file within R:
\begin{verbatim}
R --vanilla "--args patientselection.config ovarian.eset.rda tmp.log"  < createEsetList.R 
\end{verbatim}
Now we have \Sexpr{length(esets)} datasets with samples that passed our filter in a list of
\Rclass{ExpressionSet} objects called \texttt{esets}:
<<example2loadstep3>>=
names(esets)
@
\section{Association of CXCL12 expression with overall survival}
Next we use the list of \Sexpr{length(esets)} datasets from the previous
example and test if the expression of the CXCL12 gene is associated with
overall survival.  CXCL12/CXCR4 is a chemokine/chemokine receptor axis that
has previously been shown to be directly involved in cancer pathogenesis.  

We first define a function that will generate a forest plot for a given gene. It
needs the overall survival information as \Rclass{Surv} objects, which the
\file{createEsetList.R} function already added in the \texttt{phenoData}
slots of the \Rclass{ExpressionSet} objects, accessible at the \texttt{y} label.  The
resulting forest plot is shown for the CXCL12 gene in
Figure~\ref{fig:example3:forest}.

<<example3prepare>>=
esets[[1]]$y
forestplot <- function(esets, y="y", probeset, formula=y~probeset,
mlab="Overall", rma.method="FE", at=NULL,xlab="Hazard Ratio",...) {
    require(metafor)
    esets <- esets[sapply(esets, function(x) probeset %in% featureNames(x))]
    coefs <- sapply(1:length(esets), function(i) {
        tmp   <- as(phenoData(esets[[i]]), "data.frame")
        tmp$y <- esets[[i]][[y]]
        tmp$probeset <- exprs(esets[[i]])[probeset,]

        summary(coxph(formula,data=tmp))$coefficients[1,c(1,3)]
    })  

    res.rma <- metafor::rma(yi = coefs[1,], sei = coefs[2,], 
        method=rma.method)

    if (is.null(at)) at <- log(c(0.25,1,4,20))
    forest.rma(res.rma, xlab=xlab, slab=gsub("_eset$","",names(esets)),
    atransf=exp, at=at, mlab=mlab,...)
    return(res.rma)
}
@ 

\begin{figure}
\centering
<<example3plot,fig=TRUE>>=
res <- forestplot(esets=esets,probeset="CXCL12",at=log(c(0.5,1,2,4)))
@
\caption{The database confirms CXCL12 as prognostic of overall
  survival in patients with ovarian cancer.  Forest plot of the
  expression of the chemokine CXCL12 as a univariate predictor of
  overall survival, using all \Sexpr{length(esets)} datasets with
  applicable expression and survival information.  A hazard ratio
  significantly larger than 1 indicates that patients with high CXCL12
  levels had poor outcome.  The p-value for the overall HR, found in
  res\$pval, is \Sexpr{signif(res$pval, 2)}. This plot is Figure 3 of
  the curatedOvarianData manuscript. }
\label{fig:example3:forest}
\end{figure}

We now test whether CXCL12 is an independent predictor of survival in a
multivariate model together with success of debulking surgery, defined as residual
tumor smaller than 1 cm, and Federation of Gynecology and Obstetrics (FIGO)
stage. We first filter the datasets without debulking and stage information:

<<filterdatasets>>=
idx.tumorstage <- sapply(esets, function(X) 
    sum(!is.na(X$tumorstage)) > 0 & length(unique(X$tumorstage)) > 1)

idx.debulking <- sapply(esets, function(X) 
    sum(X$debulking=="suboptimal",na.rm=TRUE)) > 0 
@ 

In Figure~\ref{fig:example4:forest}, we see that CXCL12 stays significant after
adjusting for debulking status and FIGO stage. We repeated this analysis for
the CXCR4 receptor and found no significant association with overall survival
(Figure~\ref{fig:example5:forest}).

\begin{figure}
\centering
<<example4plot,fig=TRUE>>=
res <- forestplot(esets=esets[idx.debulking & idx.tumorstage],
    probeset="CXCL12",formula=y~probeset+debulking+tumorstage, 
    at=log(c(0.5,1,2,4)))
@
\caption{Validation of CXCL12 as an independent predictor of survival. This
figure shows a forest plot as in Figure~\ref{fig:example3:forest}, but the
CXCL12 expression levels were adjusted for debulking status (optimal versus
suboptimal) and tumor stage.  The p-value for the overall HR, found in
  res\$pval, is \Sexpr{signif(res$pval, 2)}.}
\label{fig:example4:forest}
\end{figure}

\begin{figure}
\centering
<<example5plot,fig=TRUE>>=
res <- forestplot(esets=esets,probeset="CXCR4",at=log(c(0.5,1,2,4)))
@
\caption{Up-regulation of CXCR4 is not associated with overall survival. This
figure shows again a forest plot as in Figure~\ref{fig:example3:forest}, but
here the association of mRNA expression levels of the CXCR4 receptor and
overall survival is shown.  The p-value for the overall HR, found in
  res\$pval, is \Sexpr{signif(res$pval, 2)}.}
\label{fig:example5:forest}
\end{figure}

\clearpage
\section{Batch correction with ComBat}

If datasets are merged, it is typically recommended to remove a very likely
batch effect. We will use the ComBat \cite{Johnson:2007} method, implemented
for example in the SVA Bioconductor package \cite{sva}.  To combine two ExpressionSet
objects, we can use the \Rfunction{combine()} function. This function will fail
when the two ExpressionSets have conflicting annotation slots, for example
\texttt{annotation} when the platforms differ. We write a simple
\texttt{combine2} function which only considers the \texttt{exprs} and
\texttt{phenoData} slots:

<<combine2>>=
combine2 <- function(X1, X2) {
    fids <- intersect(featureNames(X1), featureNames(X2))
    X1 <- X1[fids,]
    X2 <- X2[fids,]
    ExpressionSet(cbind(exprs(X1),exprs(X2)), 
        AnnotatedDataFrame(rbind(as(phenoData(X1),"data.frame"), 
                                 as(phenoData(X2),"data.frame")))
    )
}
@

In Figure~\ref{fig:batch:1}, we combined two datasets from different platforms, resulting in a huge batch effect.
\begin{figure}
<<boxplot1, fig=TRUE>>=
data(E.MTAB.386_eset)
data(GSE30161_eset)
X <- combine2(E.MTAB.386_eset, GSE30161_eset)
boxplot(exprs(X))
@
\caption{Boxplot showing the expression range for all samples of two merged datasets arrayed on different platforms. This illlustrates a huge batch effect.}
\label{fig:batch:1}
\end{figure}
Now we apply ComBat and adjust for the batch and show the boxplot after batch correction in Figure~\ref{fig:batch:2}:
<<combat>>=
mod <- model.matrix(~as.factor(tumorstage), data=X)
batch <- as.factor(grepl("DFCI",sampleNames(X)))
combat_edata <- ComBat(dat=exprs(X), batch=batch, mod=mod)
@

\begin{figure}
<<boxplot2, fig=TRUE>>=
boxplot(combat_edata)
@
\caption{Boxplot showing the expression range for all samples of two merged datasets arrayed on different platforms after batch correction with ComBat.}
\label{fig:batch:2}
\end{figure}

\clearpage
\section{Non-specific probe sets}

In the standard version of curatedOvarianData (the version available on
Bioconductor), we collapse manufacturer probesets to official HGNC symbols
using the Biomart database. Some probesets are mapped to multiple HGNC symbols
in this database. For these probesets, we provide all the symbols. For example 
\texttt{220159\_at} maps to \textit{ABCA11P} and \textit{ZNF721} and we
provide \texttt{ABCA11P///ZNF721} as probeset name. If you have an array of
gene symbols for which you want to access the expression data, "ABCA11P" would
not be found in curatedOvarianData in this example. 

The script createEsetList.R provides three methods to deal with
non-specific probe sets by setting the variable $probes.not.mapped.uniquely$ to:

\begin{itemize}
   \item  "keep":  leave as-is, these have "///" in gene names,
   \item  "drop":  drop any non-uniquely mapped features, or
   \item "split":  split non-uniquely mapped features to one per row.
     If this creates duplicate rows for a gene, those rows are
     averaged.
\end{itemize}

This feature uses the following function to create a new ExpressionSet,
in which both \textit{ZNF721} and \textit{ABCA11P} are features with
identical expression data:

<<expand>>=
expandProbesets <- function (eset, sep = "///") 
{
    x <- lapply(featureNames(eset), function(x) strsplit(x, sep)[[1]])
    eset <- eset[order(sapply(x, length)), ]
    x <- lapply(featureNames(eset), function(x) strsplit(x, sep)[[1]])
    idx <- unlist(sapply(1:length(x), function(i) rep(i, length(x[[i]]))))
    xx <- !duplicated(unlist(x))
    idx <- idx[xx]
    x <- unlist(x)[xx]
    eset <- eset[idx, ]
    featureNames(eset) <- x
    eset
}

X <- TCGA_eset[head(grep("///", featureNames(TCGA_eset))),]
exprs(X)[,1:3]
exprs(expandProbesets(X))[,1:3]
@

\section{FULLVcuratedOvarianData}

In curatedOvarianData, probesets mapping to the same gene symbol are
merge by selecting the probeset with maximum mean across all studies
of a given platform.  You can see which representative probeset was chosen by looking at the featureData of the Expressionset, e.g.:

<<featureData>>=
head(pData(featureData(GSE18520_eset)))
@ 

The full, unmerged ExpressionSets are available through the
FULLVcuratedOvarianData package at
http://bcb.dfci.harvard.edu/ovariancancer/.  Probeset to gene maps are
again provided in the featureData of those \Robject{ExpressionSet}s.
Where official \Bioconductor annotation packages are available for the
array, these are stored in the \Robject{ExpressionSet} annotation
slots, e.g.:

<<annotationeg>>=
annotation(GSE18520_eset)
@ 

so that standard filtering methods such as \Rfunction{nsFilter} will
work by default.

\section{Available Clinical Characteristics}
<<loadallsamples, echo=FALSE,results=hide>>=
rm(list=ls())
source(system.file("extdata", 
    "patientselection_all.config",package="curatedOvarianData"))
source(system.file("extdata", "createEsetList.R", package =
    "curatedOvarianData"))
@

\begin{figure}[!htb]
\centering
<<heatmap, echo=FALSE, fig=TRUE>>=
.esetsStats <- function(esets) {
    res <- lapply(varLabels(esets[[1]]), function(covar) unlist(sapply(esets, 
        function(X) sum(!is.na(X[[covar]]))>0)))
    names(res) <- varLabels(esets[[1]])    
    do.call(rbind, res)
}

df.r <- .esetsStats(esets)
M <- as.matrix(apply(df.r,c(1,2),ifelse,0,1))
colnames(M) <- gsub("_eset$", "", colnames(M))
# no need to show the sample ids
M <- M[-(1:2),]
heatmap(M[nrow(M):1,],scale="none",margins=c(8,10),Rowv=NA)
@
\caption{Available clinical annotation. This heatmap visualizes for
  each curated clinical characteristic (rows) the availability in each
  dataset (columns). Red indicates that the corresponding
  characteristic is available for at least one sample in the dataset.
  This plot is Figure 2 of the curatedOvarianData manuscript.}
\end{figure}

\section{Summarizing the List of ExpressionSets}

This example provides a table summarizing the datasets being used, and
is useful when publishing analyses based on curatedOvarianData.
First, define some useful functions for this purpose:

<<esetToTableFuns>>=
source(system.file("extdata", "summarizeEsets.R", package =
    "curatedOvarianData"))
@ 

Now create the table, used for Table 1 of the curatedOvarianData manuscript:

<<esettable, echo=FALSE>>=
summary.table <- t(sapply(esets, getEsetData))
rownames(summary.table) <- sub("_eset", "", rownames(summary.table))
@

Optionally write this table to file, for example ( replace myfile <-
tempfile() with something like myfile <- ``nicetable.csv'' )

<<writeesettable>>=
(myfile <- tempfile())
write.table(summary.table, file=myfile, row.names=FALSE, quote=TRUE, sep=",")
@ 

<<xtable, echo=FALSE, results=tex>>=
library(xtable)
print(xtable(summary.table[, c(2, 3, 4, 5, 7)], 
             caption="Datasets provided by curatedOvarianData.  This is an abbreviated version of Table 1 of the manuscript; the full version is written by the write.table command above.  Stage column is early/late/unknown, histology column is ser/clearcell/endo/mucinous/other/unknown.", 
             table.placement="p", caption.placement="bottom"),
      floating.environment='sidewaystable')
@ 

\section{For non-R users}

If you are not doing your analysis in R, and just want to get some
data you have identified from the curatedOvarianData manual, here is
a simple way to do it.  For one dataset:

<<simplygetdata, eval=FALSE>>=
library(curatedOvarianData)
library(affy)
data(GSE30161_eset)
write.csv(exprs(GSE30161_eset), file="GSE30161_eset_exprs.csv")
write.csv(pData(GSE30161_eset), file="GSE30161_eset_clindata.csv")
@ 

Or for several datasets:

<<simplyseveraldatasets, eval=FALSE>>=
data.to.fetch <- c("GSE30161_eset", "E.MTAB.386_eset")
for (onedata in data.to.fetch){
    print(paste("Fetching", onedata))
    data(list=onedata)
    write.csv(exprs(get(onedata)), file=paste(onedata, "_exprs.csv", sep=""))
    write.csv(pData(get(onedata)), file=paste(onedata, "_clindata.csv", sep=""))
}
@ 

\section{Session Info}
<<sessioninfo, echo=FALSE, results=tex>>=
toLatex(sessionInfo())
@

\clearpage
\bibliography{curatedOvarianData_vignette}

\end{document}