% % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % % \VignetteIndexEntry{The Iterative Bayesian Model Averaging Algorithm For Survival Analysis} %\VignetteDepends{BMA} %\VignetteSuggests{} %\VignetteKeywords{} %\VignettePackage{iterativeBMAsurv} \documentclass[11pt]{article} \usepackage{times} \usepackage{hyperref} \usepackage{natbib} \usepackage{times} \usepackage{comment} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newlength{\smallfigwidth} \setlength{\smallfigwidth}{6cm} \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textsf{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\MAT}[1]{{\bf #1}} \newcommand{\VEC}[1]{{\bf #1}} \newcommand{\Amat}{{\MAT{A}}} %%notationally this is going to break \newcommand{\Emat}{{\MAT{E}}} \newcommand{\Xmat}{{\MAT{X}}} \newcommand{\Xvec}{{\VEC{X}}} \newcommand{\xvec}{{\VEC{x}}} \newcommand{\Zvec}{{\VEC{Z}}} \newcommand{\zvec}{{\VEC{z}}} \newcommand{\calG}{\mbox{${\cal G}$}} \bibliographystyle{plain} \title{The Iterative Bayesian Model Averaging Algorithm for Survival Analysis: an Improved Method for Gene Selection and Survival Analysis on Microarray Data} \author{Amalia Annest, Roger E. Bumgarner, Adrian E. Raftery, and Ka Yee Yeung} \begin{document} \maketitle \section{Introduction} Survival analysis is a supervised learning technique that in the context of microarray data is most frequently used to identify genes whose expression levels are correlated with patient survival prognosis. Survival analysis is generally applied to diseased samples for the purpose of analyzing time to event, where the event can be any milestone of interest (e.g., metastases, relapse, or death). Typically, the interest is in identifying genes that are predictive of a patient's chances for survival. In such cases, both the accuracy of the prediction and the number of genes necessary to obtain a given accuracy is important. In particular, methods that select a small number of relevant genes and provide accurate patient risk assessment can aid in the development of simpler diagnostic tests. In addition, methods that adopt a weighted average approach over multiple models have the potential to provide more accurate predictions than methods that do not take model uncertainty into consideration. To this end, we developed the iterative Bayesian Model Averaging (BMA) method for gene selection and survival analysis on microarray data \citep{Annest2008}. Typical gene selection and survival analysis procedures ignore model uncertainty and use a single set of relevant genes (model) to predict patient risk. BMA is a multivariate technique that takes the interaction of variables (typically genes) and model uncertainty into account. In addition, the output of BMA contains posterior probabilities for each prediction, which can be useful in assessing the correctness of a given prognosis. \subsection{Bayesian Model Averaging (BMA)} Bayesian Model Averaging (BMA) takes model uncertainty into consideration by averaging over the posterior distributions of a quantity of interest based on multiple models, weighted by their posterior model probabilities \citep{Raftery1995}. The posterior probability that a test sample is at risk for the given event is the posterior probability that the test sample is at risk for the given even computed using the set of relevant genes in model $M_k$ multiplied by the posterior probability of model $M_k$, summed over a set of `good' models $M_k$. \subsection{Iterative Bayesian Model Averaging (BMA) Algorithm for Survival Analysis} The BMA algorithm we have described is limited to data in which the number of variables is greater than the number of responses. In the case of performing survival analysis on microarray data, there are typically thousands or tens of thousands of genes (variables) and only a few dozens samples (responses). In this package, the iterative BMA algorithm for survival analysis is implemented. In the iterative BMA algorithm for survival analysis, we start by ranking the genes in descending order of their log likelihood using a univariate measure such as the Cox Proportional Hazards Model \citep{Cox1972}. In this initial preprocessing step, genes with a larger log likelihood are given a higher ranking. Once the dataset is sorted, we apply the traditional BMA algorithm to the $maxNvar$ top log-ranked genes. We use a default of $maxNvar=25$, because the traditional BMA algorithm employs the leaps and bounds algorithm that is inefficient for numbers of genes (variables) greater than 30. In the next step, genes to which the BMA algorithm assigns low posterior probabilities of being in the predictive model are removed. In our study, we used 1\% as the threshold and eliminated genes with posterior probabilities $<$ 1\%. Suppose $m$ genes are removed. The next $m$ genes from the rank ordered log likelihood scores are added back to the set of genes so that we maintain a window of $maxNvar$ genes and apply the traditional BMA algorithm again. These steps of gene swaps and iterative applications of BMA are continued until all genes are considered. \section{Some examples} The R package {\tt BMA} is required to run the key commands in this package. <>= library(BMA) library(iterativeBMAsurv) @ An adapted diffuse large B-Cell lymphoma (DLBCL) dataset \citep{Rosenwald2002} is included for illustration purposes. The adapted DLBCL dataset consists of the top 100 genes selected using the Cox Proportional Hazards Model. The training set consists of 65 samples, while the test set consists of 36 samples. In the following examples, we chose parameters to reduce computational time for illustrative purposes. Please refer to our manuscript (\citep{Annest2008}) for recommended input parameters. <>= ## Use the sample training data. The data matrix is called trainData. data(trainData) ## The survival time vector for the training set is called trainSurv, where survival times are reported in years. data(trainSurv) ## The censor vector for the training set is called trainCens, where 0 = censored and 1 = uncensored. data(trainCens) @ The function {\tt iterateBMAsurv.train} selects relevant variables by iteratively applying the {\tt bic.surv} function from the {\tt BMA} package until all variables are exhausted. The function {\tt iterateBMAsurv.train.wrapper} acts as a wrapper for {\tt iterateBMAsurv.train}, initializing the {\tt bic.surv} parameters and calling {\tt iterateBMAsurv.train} to launch the {\tt bic.surv} iterations. When {\tt iterateBMAsurv.train.wrapper} is called, the data is assumed to be pre-sorted by rank and assumed to contain the desired number of variables. In the training phase, only the sorted training dataset and the corresponding survival times and censor data are required as input. <>= ## Training phase: select relevant genes ## In this example training set, the top 100 genes have already been sorted in decreasing order of their log likelihood ret.list <- iterateBMAsurv.train.wrapper (x=trainData, surv.time=trainSurv, cens.vec=trainCens, nbest=5) ## Extract the {\tt bic.surv} object ret.bic.surv <- ret.list$obj ## Extract the names of the genes from the last iteration of {\tt bic.surv} gene.names <- ret.list$curr.names ## Get the selected genes with probne0 > 0 top.gene.names <- gene.names[ret.bic.surv$probne0 > 0] top.gene.names ## Get the posterior probabilities for the selected models ret.bic.surv$postprob @ If all the variables are exhausted in the {\tt bic.surv} iterations, the {\tt iterateBMAsurv.train.wrapper} function returns {\tt curr.names}, or a vector containing the names of the variables in the last iteration of {\tt bic.surv}. It also returns an object of class {\tt bic.surv} from the last iteration of {\tt bic.surv}. This object is a list consisting of many components. Here are some of the relevant components: \begin{itemize} \item{\tt namesx}: The names of the variables in the last iteration of {\tt bic.surv}. \item {\tt postprob}: The posterior probabilities of the models selected. The length of this vector indicates the number of models selected by BMA. \item {\tt which}: A logical matrix with one row per model and one column per variable indicating whether that variable is in the model. \item {\tt probne0}: The posterior probability that each variable is non-zero (in percent) in the last iteration of {\tt bic.surv}. The length of this vector should be identical to that of {\tt curr.mat}. \item {\tt mle}: Matrix with one row per model and one column per variable giving the maximum likelihood estimate of each coefficient for each model. \end{itemize} In the training phase, the relevant variables (genes) are selected using the training data, the survival times, and the censor vector. In the test phase, we call the function {\tt predictBicSurv} with the selected variables (genes), the selected models, and the corresponding posterior probabilities to predict the risk scores for the patient samples in the test set. The predicted risk score of a test sample is equal to the weighted average of the risk score of the test sample under each selected model, multiplied by the predicted posterior probability of each model. Note that in this case, a model consists of a set of genes, and different models can potentially have overlapping genes. The posterior probability of a gene is equal to the sum of the posterior probabilities of all the models that the gene belongs to. Finally, the function {\tt predictiveAssessCategory} assigns each test sample to a risk group (either high-risk or low-risk) based on the predicted risk score of the sample. <>= ## The test data matrix is called testData. data(testData) ## The survival time vector for the test set is called testSurv, where survival times are reported in years data(testSurv) ## The censor vector for the test set is called testCens, where 0 = censored and 1 = uncensored data(testCens) ## Get the subset of test data with the genes from the last iteration of bic.surv curr.test.dat <- testData[, top.gene.names] ## Compute the predicted risk scores for the test samples y.pred.test <- apply (curr.test.dat, 1, predictBicSurv, postprob.vec=ret.bic.surv$postprob, mle.mat=ret.bic.surv$mle) ## Compute the risk scores for the training samples y.pred.train <- apply (trainData[, top.gene.names], 1, predictBicSurv, postprob.vec=ret.bic.surv$postprob, mle.mat=ret.bic.surv$mle) ## Assign risk categories for test samples ## Argument {\tt cutPoint} is the percentage cutoff for separating the high-risk group from the low-risk group ret.table <- predictiveAssessCategory (y.pred.test, y.pred.train, testCens, cutPoint=50) ## Extract risk group vector and risk group table risk.vector <- ret.table$groups risk.table <- ret.table$assign.risk risk.table ## Create a survival object from the test set mySurv.obj <- Surv(testSurv, testCens) ## Extract statistics including p-value, chi-square, and variance matrix stats <- survdiff(mySurv.obj ~ unlist(risk.vector)) stats @ The p-value is calculated using the central chi-square distribution. The function {\tt iterateBMAsurv.train.predict.assess} combines the training, prediction, and test phases. The function begins by calling {\tt singleGeneCoxph}, which sorts the genes in descending order of their log likelihood. After calling {\tt iterateBMAsurv.train.wrapper} to conduct the {\tt bic.surv} iterations, the algorithm predicts the risk scores for the test samples and assigns them to a risk group. Predictive accuracy is evaluated through the p-value and chi-square statistic, along with a Kaplan-Meier survival analysis curve to serve as a pictorial nonparametric estimator of the difference between risk groups. If the Cox Proportional Hazards Model is the desired univariate ranking measure, then calling the function {\tt iterateBMAsurv.train.predict.assess} is all that is necessary for a complete survival analysis run. The parameter $p$ represents the number of top univariate sorted genes to be used in the iterative calls to the {\tt bic.surv} algorithm. Our studies showed that a relatively large $p$ typically yields good results \citep{Yeung2005}. For simplicity, there are 100 genes in the sample training set, and we used $p$= 100 in the iterative BMA algorithm for survival analysis. Experimenting with greater $p$ values, higher numbers of $nbest$ models, and different percentage cutoffs for $cutPoint$ will likely yield more significant results than the following examples illustrate. The function returns a list consisting of the following components: \begin{itemize} \item{\tt nvar}: {The number of variables selected by the last iteration of {\tt bic.surv}.} \item{\tt nmodel}: {The number of models selected by the last iteration of {\tt bic.surv}.} \item{\tt ypred}: {The predicted risk scores on the test samples.} \item{\tt result.table}: {A 2 x 2 table indicating the number of test samples in each category (high-risk/censored, high-risk/uncensored, low-risk/censored, low-risk/uncensored).} \item{\tt statistics}: {An object of class {\tt survdiff} that contains the statistics from survival analysis, including the variance matrix, chi-square statistic, and p-value.} \item{\tt success}: {A boolean variable returned as TRUE if both risk groups are present in the patient test samples.} \end{itemize} If all test samples are assigned to a single risk group or all samples are in the same censor category, a boolean variable {\tt success} is returned as FALSE. <>= ## Use p=10 genes and nbest=5 for fast computation ret.bma <- iterateBMAsurv.train.predict.assess (train.dat=trainData, test.dat=testData, surv.time.train=trainSurv, surv.time.test=testSurv, cens.vec.train=trainCens, cens.vec.test=testCens, p=10, nbest=5) ## Extract the statistics from this survival analysis run number.genes <- ret.bma$nvar number.models <- ret.bma$nmodel evaluate.success <- ret.bma$statistics evaluate.success @ The function {\tt crossVal} performs k runs of n-fold cross validation on the training set, where k and n are specified by the user through the {\tt noRuns} and {\tt noFolds} arguments respectively. The {\tt crossVal} function in this package can be used to evaluate the selected mathematical models and determine the optimal input parameters for a given dataset. For each run of cross validation, the training set, survival times, and censor data are re-ordered according to a random permutation. For each fold of cross validation, 1/nth of the data is set aside to act as the validation set. In each fold, the {\tt iterateBMAsurv.train.predict.assess} function is called in order to carry out a complete run of survival analysis. This means the univariate ranking measure for this cross validation function is Cox Proportional Hazards Regression; see {\tt iterateBMAsurv.train.wrapper} to experiment with alternate univariate ranking methods. With each run of cross validation, the survival analysis statistics are saved and written to file. The output of this function is a series of files written to the working R directory which give the fold results, run results, per-fold statistics, and average statistics across all runs and all folds. <>= ## Perform 1 run of 2-fold cross validation on the training set, using p=10 genes and nbest=5 for fast computation ## Argument {\tt diseaseType} specifies the type of disease present the training samples, used for writing to file cv <- crossVal (exset=trainData, survTime=trainSurv, censor=trainCens, diseaseType="DLBCL", noRuns=1, noFolds=2, p=10, nbest=5) @ This package also contains the {\tt imageplot.iterate.bma.surv} function, which allows for the creation of a heatmap-style image to visualize the selected genes and models (see Figure~\ref{fig:image}). \begin{figure}[hp] \centering <>= imageplot.iterate.bma.surv (ret.bic.surv) @ %%\includegraphics{iterateBMAsurv-imageplot} \caption{\label{fig:image}% An image plot showing the selected genes and models.} \end{figure} In Figure~\ref{fig:image}, the BMA selected variables are shown on the vertical axis, and the BMA selected models are shown on the horizontal axis. The variables (genes) are sorted in decreasing order of the posterior probability that the variable is not equal to 0 ({\tt probne0}) from top to bottom. The models are sorted in decreasing order of the model posterior probability ({\tt postprob}) from left to right. \section*{Acknowledgements} We would like to thank Isabelle Bichindaritz, Donald Chinn, Steve Hanks, Ian Painter, Deanna Petrochilos, and Chris Volinsky. Yeung is supported by NIH-NCI 1K25CA106988. Bumgarner is funded by NIH-NHLBI P50 HL073996, NIH-NIAID U54 AI057141, NIH-NCRR R24 RR021863-01A1, NIH-NIDCR R01 DE012212-06, NIH-NCRR 1 UL1 RR 025014-01, and a generous basic research grant from Merck. Raftery is supported by NIH-NICHD 1R01HDO54511-01A1, NSF llS0534094, NSF ATM0724721, and Office of Naval Research grant N00014-01-1-0745. \bibliography{iterativeBMAsurv} \end{document}