% \VignetteIndexEntry{GlobalAncova} % \VignetteDepends{GlobalAncova, globaltest, golubEsets, hu6800.db, vsn} % \VignetteKeywords{Expression Analysis} % \VignettePackage{GlobalAncova} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rfunarg}[1]{{\textit{#1}}} \documentclass[a4paper]{article} \title{Global testing of differential gene expression} \author{Manuela Hummel \and Reinhard Meister \and Ulrich Mansmann} %\and Ramona Scheufele } \begin{document} \maketitle \tableofcontents \newpage \section{Abstract} In studies about differential gene expression between different clinical diagnoses the main interest may often not be in single genes but rather in groups of genes that are associated with a pathway or have a common location in the genome. In such cases it may be better to perform a global test because the problems of multiple testing can be avoided. The approach presented here is an ANCOVA global test on phenotype effects and gene--phenotype interaction. Testing many pathways simultaneously is also possible. This, of course, causes again need for correction for multiple testing. Besides the standard approaches for correction we introduce a closed testing procedure in which the experiment--wise error rate equals the required level of confidence of the overall test. <>= #library(Biobase) library(GlobalAncova) library(globaltest) sI <- sessionInfo() @ This document was created using R version \Sexpr{paste(R.version$major,".",R.version$minor,sep="")} and versions \Sexpr{sI$otherPkgs$GlobalAncova$Version} and \Sexpr{sI$otherPkgs$globaltest$Version} of the packages \Rpackage{GlobalAncova} and \Rpackage{globaltest}, respectively. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Changes to Previous Versions} \noindent {\bf Version 3.14.x} \begin{itemize} \item The testing of collections of functional gene sets (Gene Ontology, KEGG, Broad Institute's gene sets) was adapted to new functions in package \Rpackage{globaltest} (version $> 5.0.0$) and can now be performed using the functions \Rfunction{GAGO}, \Rfunction{GAKEGG} and \Rfunction{GABroad}. \item Note change in use of function \Rfunction{GAGO} as compared to previous versions. \end{itemize} \vspace{0.2cm} \noindent {\bf Version 3.3.3} \begin{itemize} \item The permutation approach is now implemented in \emph{C} and therefore faster. \item If the number of possible phenotype permutations is smaller than the number specified in \Rfunarg{perm} (i.e. in very small sample sizes), all possible permutations are considered for the permutation test. \item Some more error messages are included. \item In \Rfunction{Plot.genes} and \Rfunction{Plot.subjects} bar labels can be manipulated with the argument \Rfunarg{bar.names}. \end{itemize} \vspace{0.2cm} \noindent {\bf Version 3.x.x} \begin{itemize} \item Besides the permutation-based p-values also asymptotic p-values based on an approximation of the distribution of the test statistic are provided. Theoretical F-test p-values are no longer displayed since they are not valid in case of correlations or non-normality. \item The focus level procedure for finding interesting Gene Ontology subgraphs from Goeman and Mansmann (2007) \cite{GoeMan:08} was adapted for the use with \Rfunction{GlobalAncova}. \item Sequential and type III decompositions of the residual sum of squares, adjustment for global covariates and pair-wise comparisons of different levels of a categorical factor are implemented. These functionalities are described in the additional vignette \emph{GlobalAncovaDecomp.pdf}. \item Now the parameter \Rfunarg{test.genes} allows for specifying the gene group which a graph shall be based on in the plotting functions. \end{itemize} \vspace{0.2cm} \noindent {\bf Version 2.5.1} \begin{itemize} \item Testing several groups of genes is now more efficient and less time consuming. \item In the gene and subjects plots bar heights (gene-wise reduction in sum of squares and subject-wise reduction in sum of squares, respectively) can be returned. %This eases the identification of genes with most influence regarding differential expression or of subjects behaving like outliers. \item Plots are more flexible regarding graphical parameters like specification of colors, titles, axis labels and axis limits. \end{itemize} \vspace{0.2cm} \noindent {\bf Version 2.x.x} \begin{itemize} \item The major modification in the new version is the transfer from simple two group comparisons to a general linear model framework where arbitrary clinical variables (in especially with more groups or also continuous ones), time trends, gene--gene interactions, co--expression and so forth can be analysed. \item According to the new framework also the diagnostic plots are more flexible. The variable defining the coloring of bars can now be specified by the user, see section \ref{plots} for details. \item A bug was fixed concerning testing only a single gene for differential expression with the global ANCOVA F--test. \item Within the closed testing procedure a bug was fixed concerning testing non--disjunct groups of genes. \end{itemize} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction} The ANCOVA global test is a test for the association between expression values and clinical entities. The test is carried out by comparison of linear models via the extra sum of squares principle. If the mean expression level for at least one gene differs between corresponding models the global null hypothesis, which is the intersection of all single gene null hypotheses, is violated. As our test is based on the sum of gene-wise reduction in sum of squares due to phenotype, all systematic differences in gene expression between phenotypes equally contribute to the power of the test. Single genes are not, in general, the primary focus of gene expression experiments. The researcher might be more interested in relevant pathways, functional sets or genomic regions consisting of several genes. Most of the current methods for studying pathways analyse differential expression of single genes. In these methods pathways where many genes show minor changes in their expression values may not be identified. Goeman's global test and the ANCOVA global test were designed to address this issue. Applying global tests for differential expression in pathways substantially reduces the number of tests compared to gene-wise multiple testing. The amount of correction for multiple testing decreases. Function (KEGG, GO) or location (chromosome, cytoband) could be used as grouping criteria, for example. We want to compare our method with the global test of Goeman et al. (2004) \cite{Goe:04}. Our function \Rfunction{GlobalAncova} tests whether the expectation of expression levels differs between biological entities for a given group of genes. This vignette has its focus on the practical use of the test. For more details about the mathematical background and the interpretation of results, we refer to the papers by Mansmann and Meister (2005) \cite{MaMei:05} and Hummel, Meister and Mansmann (2008) \cite{Hummel:07}. This document shows the functionality of the R-package \Rpackage{GlobalAncova}. The datasets, all necessary R--packages and our package \Rpackage{GlobalAncova} are available from the Bioconductor website ({\it www.bioconductor.org}). First we load the packages and data we will use. <>= library(GlobalAncova) library(globaltest) library(golubEsets) library(hu6800.db) library(vsn) data(Golub_Merge) golubX <- justvsn(Golub_Merge) @ This creates a dataset \Robject{golubX}, which is of the format \Rclass{ExpressionSet}, the standard format for gene expression data in BioConductor. It consists of \Sexpr{length(featureNames(golubX))} genes and \Sexpr{length(sampleNames(golubX))} samples (the data are from \cite{Gol:99}). We used \Rpackage{vsn} to normalize the data. Other appropriate normalization methods may be used as well. From several phenotype variables we use ``ALL.AML'' as the clinical diagnoses of interest. ALL and AML are two types of acute leukemia. There are \Sexpr{sum(golubX$ALL.AML=="ALL")} patients with ALL and \Sexpr{sum(golubX$ALL.AML=="AML")} with AML. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Global Testing of a Single Pathway} \label{onepathway} \subsection{Golub Data and Cell Cycle Pathway} Suppose we are interested in testing whether AML and ALL have different gene expression patterns for certain pathways, for example from the KEGG database. With \Rpackage{globaltest} we answer the question whether the expression profile has prognostic power with respect to diagnosis of AML or ALL. \Rpackage{GlobalAncova} asks for differences in mean expression between the two clinical groups. \subsubsection{Testing all Genes} We start by applying our test to all genes in the Golub dataset so that differences in the overall gene-expression pattern can be demonstrated. <>= options(width=70) @ <>= gr <- as.numeric(golubX$ALL.AML=="ALL") ga.all <- GlobalAncova(xx=exprs(golubX), group=gr, covars=NULL, perm=100) @ The first input \Rfunarg{xx} is a \Sexpr{dim(exprs(golubX))[[1]]} $\times$ \Sexpr{dim(exprs(golubX))[[2]]} matrix that contains the expression values of all genes and samples. Missing values in the expression matrix \Rfunarg{xx} are not allowed because otherwise gene-wise linear models could not be summarized adequately to a global group statement. If missing values occur we propose either leaving out the genes with missing values (i.e. the corresponding rows in the gene expression matrix), or imputing the data before applying \Rpackage{GlobalAncova}. An easy way to do the latter would be for example to calculate linear models for each gene using the available model variables (e.g. phenotype group labels). Missing values can then be estimated based on the resulting model parameters and the actual values of phenotype variables of the corresponding samples. The use of more sophisticated imputation methods \cite{Rubin:87} would be computationally expensive and is not implemented in \Rpackage{GlobalAncova}. Note that we did not yet evaluate how data imputation affects \Rpackage{GlobalAncova} results and whether the easier imputation methods described above yield similar results as the more complex approaches. The second input \Rfunarg{group} in the \Rfunction{GlobalAncova} function is a vector that defines the clinical diagnosis for the \Sexpr{dim(exprs(golubX))[[2]]} patients. Note that \Rpackage{GlobalAncova} is not restricted to the analysis of dichotomous phenotype groups. More complex tasks like variables with more groups or also continuous ones, time trends, gene--gene interactions and co--expression can be performed as well. Some examples will be given in section \ref{vantVeer}. The realization of such tasks is done by definition of two linear models that shall be compared via the extra sum of squares principle. Hence model formulas for the full model containing all parameters and the reduced model, where the terms of interest are omitted, have to be given. An alternative is to provide the formula for the full model and a character vector naming the terms of interest. Those names can be chosen by previous output of the \Rfunction{GlobalAncova} function. Consequently we could run the same analysis as above with two possible further function calls shown below (output is omitted). In both cases a data frame with information about all variables for each sample is required. In the case of microarray data this can be the corresponding \Robject{pData} object. <>= GlobalAncova(xx=exprs(golubX), formula.full=~ALL.AML, formula.red=~1, model.dat=pData(golubX), perm=100) GlobalAncova(xx=exprs(golubX), formula.full=~ALL.AML, test.terms="ALL.AMLAML", model.dat=pData(golubX), perm=100) @ To avoid alpha--inflation due to correlated data and effects of non--normality of the data tests for significance of the resulting F--ratios are performed using a permutation test approach. We apply permutation of samples which is equivalent to permuting rows of the full design matrix. Note that permutation is only conducted for such columns of the design matrix that correspond to the variables of interest. Values of additional covariates remain in the original order. This prevents us for destroying covariate effects. Still the permutation approach is not optimal since residuals may be correlated. However, this does not seem to be a severe problem. The argument \Rfunarg{perm} defines the number of permutations, which is 10,000 for default. Here we set \Rfunarg{perm} to just 100 or 1000 so that creating this vignette will not last too long. For getting more reliable results one should recompute the examples with more permutations. \\ As an alternative to the permutation approach an approximation of the F--statistic nominator according to \cite{Rob:49} yields asymptotic p--values. Note that the approximation is not feasible for very large gene groups since the huge gene expression covariance matrix has to be estimated, which is not possible for too many genes. The default value for group size (\Rfunarg{max.group.size}) is 2500, groups above this size are treated by the permutation approach. When using work stations with good working memory this number may be increased. The estimation of the covariance matrix is carried out with the R package \Rpackage{corpcor} from \cite{corpcor}.\\ Whether the permutation--based or the asymptotic p--values or both should be calculated is controlled by the argument \Rfunarg{method}. The result of the \Rfunction{GlobalAncova} function is a typical ANOVA table with information about sums of squares, degrees of freedom and mean sums of squares for the effect and error term, respectively. Besides F--statistics %and classical F--test p--values, there are given either p--values from the permutation test or the asymptotic p--values or both. The names of all involved parameters are displayed as well as the name(s) of the tested effect(s). <>= ga.all @ From this result we conclude that the overall gene expression profile for all \Sexpr{length(featureNames(golubX))} genes is associated with the clinical outcome. This means that samples with different AML/ALL status tend to have different expression profiles. We expect most pathways (especially the ones containing many genes) also to be associated with the phenotype groups. If we apply Goeman's global test we get <>= gt.all <- gt(ALL.AML, golubX) @ <>= gt.all @ Both tests show that the data contain overwhelming evidence for differential gene expression between AML and ALL. \subsubsection{Testing the Cell Cycle Pathway} Now we ask the more specific question of whether there is evidence for differential gene expression between both diagnoses restricted to genes belonging to the cell cycle pathway. First we load all KEGG pathways. <>= kegg <- as.list(hu6800PATH2PROBE) @ <>= cellcycle <- kegg[["04110"]] @ The list \Robject{kegg} consists of \Sexpr{length(kegg)} pathways. Each pathway is represented by a vector of gene names. We are mainly interested in the cell cycle pathway which has the identifier ``04110'' in the KEGG database. It corresponds to \Sexpr{length(cellcycle)} probe sets on the hu6800 chip. <>= cellcycle <- kegg[["04110"]] @ We apply the global test to this pathway using the option \Rfunarg{test.genes}. <>= ga.cc <- GlobalAncova(xx=exprs(golubX), group=gr, test.genes=cellcycle, method="both", perm=1000) ga.cc @ Also with \Rpackage{globaltest} we get a very small p--value <>= gt.cc <- gt(ALL.AML, golubX, subsets=cellcycle) gt.cc @ <>= gt.cc @ The test results clearly indicate that the expression pattern of the cell cycle pathway is different between the two clinical groups. \subsubsection{Adjusting for Covariates} Covariate information can be incorporated by specifying the \Rfunarg{covars} option. For example if we want to adjust for whether samples were taken from bone marrow or from peripheral blood (\Robject{BM.PB}), we can do this by <>= ga.cc.BMPB <- GlobalAncova(xx=exprs(golubX), group=gr, covars=golubX$BM.PB, test.genes=cellcycle, method="both", perm=1000) ga.cc.BMPB @ With the more general function call we would simply adjust the definitions of model formulas, namely \Rfunarg{formula.full =} $\sim$ \Rfunarg{ALL.AML + BM.PB} and \Rfunarg{formula.red =} $\sim$ \Rfunarg{BM.PB}. The source of the samples does not seem to have an explanatory effect on the outcome since F--statistics and p--values are very similar to the model without adjustment. With the \Rpackage{globaltest} we get a similar p--value. <>= gt.cc.BMPB <- gt(ALL.AML ~ BM.PB, golubX, subsets=cellcycle) gt.cc.BMPB @ <>= gt.cc.BMPB @ Permutation based p--values can also be obtained with Goeman's test, however only when covariates are absent. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{van't Veer Data and p53--Signalling Pathway} \label{vantVeer} <>= data(vantVeer) data(phenodata) data(pathways) @ We present another example from a study on breast cancer from van't Veer et al. (2002) \cite{vV:02}. This example illustrates how more complex tasks than comparing %expression levels between just two clinical groups can be performed with \Rpackage{GlobalAncova}. A subset of the data consisting of the expression values for \Sexpr{dim(phenodata)[1]} patients without {\it BRCA1} or {\it BRCA2} mutations is available with the package. The dataset (\Robject{vantVeer}) is restricted to \Sexpr{dim(vantVeer)[1]} genes associated with \Sexpr{length(pathways)} cancer related pathways that are provided as a list named (\Robject{pathways}), too. We take one gene from the original data additionally to the expression set, namely 'AL137718'. This gene is part of the original van't Veer prognosis signature. We will later use it to demonstrate how signature genes can be related to pathways. Information about some of the originally surveyed covariates is stored in \Robject{phenodata}. %Further we provide a list of \Sexpr{length(pathways)} cancer related pathways (\Robject{pathways}). The tumour suppressor protein p53 contributes as a transcription factor to cell cycle arrest and apoptosis induction. %Oncogene. 2004 Apr 22;23(19):3376-84. %Identification of Tcf-4 as a transcriptional target of p53 signalling. %Rother K, Johne C, Spiesbach K, Haugwitz U, Tschop K, Wasner M, %Klein-Hitpass L, Moroy T, Mossner J, Engeland K. Therefore, first the p53-signalling pathway is selected as a candidate, where differential expression between relevant prognostic groups, defined by the development of distant metastases within five years, was expected. <>= data(vantVeer) data(phenodata) data(pathways) metastases <- phenodata$metastases p53 <- pathways$p53_signalling @ We get a significant result with the global ANCOVA. <>= vV.1 <- GlobalAncova(xx=vantVeer, group=metastases, test.genes=p53, method="both", perm=1000) vV.1 @ \subsubsection{Analysis of Various Clinical Groups} In the new version of the package also clinical variables with more than two groups can be considered. For demonstration we investigate differential expression for the three ordered levels of tumour grade. <>= vV.3 <- GlobalAncova(xx=vantVeer, formula.full=~grade, formula.red=~1, model.dat=phenodata, test.genes=p53, method="both", perm=1000) vV.3 @ \subsubsection{Gene--Gene Interaction} Now we want to go into the matter of other interesting biological questions. For example one might ask if there exists interaction between the expression of genes which the authors in \cite{vV:02} presented as signature for prediction of cancer recurrence and the expression of genes in a certain pathway. This question can be answered by viewing the expression values of the signature genes as linear regressors and to test their effects on the expression pattern of the pathway genes. For demonstration we pick the signature gene "AL137718", which is not part of any of the pathways, and test its effect on the p53--signalling pathway. Assume that we also want to adjust for the Estrogen receptor status. The analysis can be carried out in the following way. <>= signature.gene <- "AL137718" model <- data.frame(phenodata, signature.gene=vantVeer[signature.gene, ]) vV.4 <- GlobalAncova(xx=vantVeer, formula.full=~signature.gene + ERstatus, formula.red=~ERstatus, model.dat=model, test.genes=p53, method="both", perm=1000) vV.4 @ Assuming a significance level of 0.05 we get a significant effect of the signature gene on the p53--signalling pathway. \subsubsection{Co--Expression} Next we want to analyse co--expression regarding the clinical outcome of building distant metastases within five years. This can be done by simply adding the variable \Robject{metastases} to the full and reduced model, respectively. Such layout corresponds to testing the linear effect of the signature gene stratified not only by Estrogen receptor status but also by \Robject{metastases}. <>= vV.5 <- GlobalAncova(xx=vantVeer, formula.full=~metastases + signature.gene + ERstatus, formula.red=~metastases + ERstatus, model.dat=model, test.genes=p53, method="both", perm=1000) vV.5 @ Again we get a significant result. Supposably the most interesting question in this case concerns differential co--expression. Differential co--expression is on hand if the effect of the signature gene behaves different in both \Robject{metastases} groups. In a one dimensional context this would become manifest by different slopes of the regression lines. Hence what we have to test is the interaction between \Robject{metastases} and \Robject{signature.gene}. <>= vV.6 <- GlobalAncova(xx=vantVeer, formula.full=~metastases * signature.gene + ERstatus, formula.red=~metastases + signature.gene + ERstatus, model.dat=model, test.genes=p53, method="both", perm=1000) vV.6 @ We observe a significant differential co-expression between the chosen signature gene and the p53--signalling pathway. With \Rpackage{globaltest} we can also test gene-gene interaction, also adjusted for phenotype groups. But it is not possible to test for differential co-expression or the influence of more than one signature gene on a pathway. On the other hand globaltest is able to deal with survival times as clinical outcome. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Testing Several Pathways Simultaneously} Systems biology involves the study of mechanisms underlying complex biological processes as integrated systems of many diverse interacting components, often referred to as pathways. We regard the possibility to investigate differential gene expression simultaneously for several of those pathways as a contribution towards understanding biological relevant relations. The user can apply \Rfunction{GlobalAncova} to compute p--values for a couple of pathways with one call by specifying the \Rfunarg{test.genes} option. The members of each pathway to be tested must belong to genes in the expression--matrix. Afterwards a suitable correction for multiple testing has to be applied. An alternative based on the closed testing approach is described later. Suppose for example that for sake of simplicity we want to test the first four of the cancer related pathways with the van't Veer data. We proceed as follows. <>= metastases <- phenodata$metastases ga.pw <- GlobalAncova(xx=vantVeer, group=metastases, test.genes=pathways[1:4], method="both", perm=1000) ga.pw @ The result is a matrix whose rows correspond to the different pathways. With the \Rfunction{globaltest} we get a similar matrix. <>= gt.options(transpose = TRUE) gt.pw <- gt(metastases, vantVeer, subsets=pathways[1:4]) gt.pw @ \subsection{Simultaneous Adjustment of p--values} Next we show how to extract p--values for correction for multiple testing. Note however that due to the extremely high correlations between these tests, many procedures that correct for multiple testing here are inappropriate. An appropriate way of adjusting would be for example the method of Holm (1979) \cite{Holm:79}. An alternative to such adjustments that is not affected by correlations between tests is a closed testing procedure. For this approach you need a family of null hypotheses that is closed under intersection. Then a single hypothesis can be rejected at level $\alpha$ if it is rejected along with all hypotheses included in it (\cite{Mar:76}). %(Marcus et al. 1976). For the adjustment according to Bonferroni and Holm we build a vector of the raw p--values. The function \Rfunction{p.adjust} provides several adjusting methods for any vector of raw p-values. For the output of function \Rfunction{gt} there is a specific \Rfunction{p.adjust} method. <<>>= ga.pw.raw <- ga.pw[ ,"p.perm"] ga.pw.adj <- p.adjust(ga.pw.raw, "holm") ga.result <- data.frame(rawp=ga.pw.raw, Holm=ga.pw.adj) ga.result gt.result <- p.adjust(gt.pw) gt.result @ Allowing a family--wise error rate of 0.05 %with \Rfunction{GlobalAncova} all but two pathways remain significant. all but one pathways remain significant for both methods. \subsection{Closed Testing Procedure} \label{closed} Closed testing procedures (\cite{Mar:76}) %(Marcus et al., 1976 \cite{Mar:76}) offer a versatile and powerful approach to the multiple testing problem. Implementation is non--trivial, therefore, the program given in this version should be regarded as a prototype. In order to apply the closed testing procedure we first have to create the required family of hypotheses by building all intersections between the ``natural'' hypotheses tested above and all intersections of those new hypotheses and so on. The resulting family of hypotheses can be illustrated in a directed graph. If we just for the sake of illustration assume that we have only four hypotheses named ``1'', $\ldots$ ``4'' then the node ``1-2-3-4'' for example stands for the global hypothesis that the genes of all four pathways are not differentially expressed. Now the interesting hypothesis ``1'' for example can be rejected if also the hypotheses ``1-2-3-4'', ``1-2-3'', ``1-2-4'', ``1-3-4'', ``2-3-4'', ``1-2'', $\ldots$, ``1-4'' are rejected. These relationships are represented by the edges of the graph. <>= if(require(Rgraphviz)) { res <- GlobalAncova:::Hnull.family(pathways[1:4]) graph <- new("graphNEL", nodes=names(res), edgemode="directed") graph <- addEdge(from=c(rep(names(res)[15],4),rep(names(res)[10],3),rep(names(res)[11],3), rep(names(res)[13],3),rep(names(res)[14],3),rep(names(res)[5],2),rep(names(res)[6],2), rep(names(res)[7],2),rep(names(res)[8],2),rep(names(res)[9],2),rep(names(res)[12],2)), to=c(names(res)[10],names(res)[11],names(res)[13],names(res)[14],names(res)[5],names(res)[6], names(res)[8],names(res)[5],names(res)[7],names(res)[9],names(res)[6],names(res)[7],names(res)[12], names(res)[8],names(res)[9],names(res)[12],names(res)[1],names(res)[2],names(res)[1],names(res)[3], names(res)[1],names(res)[4],names(res)[2],names(res)[3],names(res)[2],names(res)[4],names(res)[3], names(res)[4]), graph, weights=rep(1, 28)) att <- list() lab <- c("1","2","3","4","1-2","1-3","1-4","2-3","2-4","1-2-3","1-2-4","3-4","1-3-4","2-3-4","1-2-3-4") names(lab) <- names(res) att$label <- lab plot(graph, nodeAttrs=att) } else { plot(1, 1, type="n", main="This plot requires Rgraphviz", axes=FALSE, xlab="", ylab="") } @ We can compute the closed testing procedure using the function <>= ga.closed <- GlobalAncova.closed(xx=vantVeer, group=metastases, test.genes=pathways[1:4], previous.test=ga.pw, level=0.05, method="approx") @ where \Rfunarg{test.genes} is again a list of pathways. In order to shorten computing time we can provide the results of the previous application of \Rfunction{GlobalAncova} for the pathways of interest. The option \Rfunarg{level} allows to manipulate the level of significance. There is again the possibility to choose between permutation and asymptotic p-values via the option \Rfunarg{method}. %\Rfunarg{perm} again gives the desired number of permutations used in the permutation test. Note that if you provide results of previous computation, the type of p-values has to correspond, i.e. if we now want to use \Rfunarg{method = "approx"} in the previous test we should have used \Rfunarg{method = "approx"} or \Rfunarg{method = "both"} such that asymptotic p-values are available. Also for \Rfunction{GlobalAncova.closed} all three different function calls as for \Rfunction{GlobalAncova} itself are possible. The function \Rfunction{GlobalAncova.closed} provides the formed null hypotheses (this means lists of genes to be tested simultaneously), the test results for each pathway of interest and the names of significant and non significant pathways. Names for the intersections of hypotheses are built by simply coercing the names of the respective pathways. If for a pathway one single hypothesis can not be rejected there is no need to test all the remaining hypotheses. That is why in test results of non significant pathways lines are filled with NA's after a p--value $> \alpha$ occured. Here only test results for the first pathway are displayed. <>= names(ga.closed) rownames(ga.closed$test.results[[1]]) rownames(ga.closed$test.results[[1]]) <- NULL ga.closed$test.results[1] ga.closed$significant ga.closed$not.significant @ We get the same significant and non significant pathways as before. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Testing Public Gene Set Collections} \subsection{Gene Ontology} When testing gene sets defined by the Gene Ontology it is of special interest to incorporate the hierarchical structure of the GO graph. Goeman and Mansmann (2008)\cite{GoeMan:08} developed the \emph{focus level} method, a multiple testing approach on the GO that combines the correction method of Holm (1979) \cite{Holm:79} and the closed testing procedure from Marcus et al. (1976) \cite{Mar:76} (also used in section \ref{closed}). The method is originally implemented in package \Rpackage{globaltest}. We adapted the corresponding functions such that the procedure is available also with \Rfunction{GlobalAncova}. For details see the vignette of \Rpackage{globaltest}. For reasons of computing time here we only apply the focus level method the subgraph of the 'cell cycle' GO term and all its descendants. To test all terms within an ontology (or several ontologies) the \Rfunarg{id} argument can just be omitted. <>= library(GO.db) descendants <- get("GO:0007049", GOBPOFFSPRING) gago <- GAGO(xx=exprs(golubX), formula.full=~ALL.AML, formula.red=~1, model.dat=pData(golubX), id=c("GO:0007049", descendants), annotation="hu6800", ontology="BP", multtest="focuslevel") @ <<>>= head(gago) @ All arguments for specifying the linear model used in \Rfunction{GlobalAncova} can be given here. Only the parameter \Rfunarg{method} is not available because the focus level procedure does only work with the asymptotic test. Note however, that still a number of permutations can be specified (\Rfunarg{perm}, default 10,000) since very large GO terms (with more annotated genes than defined by parameter \Rfunarg{max.group.size}, default 2500) are tested permutation-based. Alternative to the focus level procedure, one can choose between the methods of Holm \cite{Holm:79}, Benjamini $\&$ Hochberg \cite{BeHo:95} and Benjamini $\&$ Yekutieli \cite{BeYe:01} for multiple testing correction, using the argument \Rfunarg{multtest}. \subsection{KEGG Pathways} The function \Rfunction{GAKEGG} (adapted from \Rpackage{globaltest} function \Rfunction{gtKEGG}) can be used to score the pathways from the KEGG database. Let's say we are interested in testing the 'cell cycle' and the 'apoptosis' pathways and want to correct the resulting p-values by the FDR method of Benjamini $\&$ Hochberg \cite{BeHo:95}. <>= GAKEGG(xx=exprs(golubX), formula.full=~ALL.AML, formula.red=~1, model.dat=pData(golubX), id=c("04110", "04210"), annotation="hu6800", multtest="BH") @ If the \Rfunarg{id} argument is not specified, all KEGG gene sets will be tested. \subsection{The Broad Gene Sets} As described in the \Rpackage{globaltest} vignette, also the gene set collection from the Broad Institute can be tested quite easily (c1: positional gene sets, c2: curated gene sets, c3: motif gene sets, c4: computational gene sets, c5: GO gene sets). First the file \verb:msigdb_v.2.5.xml: containing all gene sets has to be downloaded from \verb&http://www.broad.mit.edu/gsea/downloads.jsp#msigdb&. The function \Rfunction{getBroadSets} from package \Rpackage{GSEABase} can then be used to read the xml file into \verb:R:. With the function \Rfunction{GABroad} gene sets can be selected and tested using the gene set IDs <>= broad <- getBroadSets("your/path/to/msigdb_v.2.5.xml") GABroad(xx=exprs(golubX), formula.full=~ALL.AML, formula.red=~1, model.dat=pData(golubX), id="chr5q33", collection=broad, annotation="hu6800") @ or all gene sets from one or several categories can be tested <>= GABroad(xx=exprs(golubX), formula.full=~ALL.AML, formula.red=~1, model.dat=pData(golubX), category=c("c1", "c3"), collection=broad, annotation="hu6800", multtest="holm") @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \setkeys{Gin}{width=0.6\textwidth} \section{Diagnostic Plots} \label{plots} There are two types of diagnostic plots available supporting communication and interpretation of results of the global ANCOVA. The \Rfunction{Plot.genes} visualizes the influence of individual genes on the test result while the \Rfunction{Plot.subjects} visualizes the influence of individual samples. Both plots are based on the decomposition of sums of squares. We use again the van't Veer data constricted to the genes of the p53--signalling pathway for demonstration of the plot functions. \subsection{Gene Plot} The influence of each gene on the outcome of the test can be assessed and visualized with a diagnostic plot generated by our function \Rfunction{Plot.genes}. It corresponds to the function \Rfunction{features} in the \Rpackage{globaltest} package. We use the \Rfunction{features} function with the option \Rfunarg{what = "w"} for displaying weighted gene-wise test statistics, which corresponds best to what is shown in \Rfunction{Plot.genes}. The function \Rfunction{Plot.genes} gives a graphical display of single gene-wise analysis for all genes. Bars are always positive as a reduction of sum of squares is always achieved in this case. The bar height indicates the influence of the respective gene on the test statistic. The added reference line is the residual mean square error per gene and corresponds to the expected height of the bars under the null hypothesis which says that the gene is not associated with the clinical outcome. The actual and expected bar heights also correspond to the nominator and denominator of gene-wise F-statistics. Hence the ratio of the two values is a measure for the association of the respective gene with the phenotype. Bar heights for all genes can be returned by setting the option \Rfunarg{returnValues} to \Robject{TRUE}. This helps to detect genes with most influence on the global statistics. Note however that comparisons between different gene groups can not easily be done by means of these values directly since different group sizes have an impact on global significance. The bars can be colored according to a variable of interest with the option \Rfunarg{Colorgroup} in order to show in which of the groups a gene has the highest expression values. The automatically chosen bar labels can be manipulated with the parameter \Rfunarg{bar.names}. The commands for creating gene plots in the \Rpackage{GlobalAncova} and the \Rpackage{globaltest} are as follows. Note that for the former one again three alternatives for function calls are provided, see section \ref{onepathway} for details. The two approaches show almost the same results (figures \ref{ga.genepl1} and \ref{gt.genepl1}). We prefer plotting horizontal bars rather than vertical because we think it is easier to read off the bar heights this way. <>= Plot.genes(xx=vantVeer, group=metastases, test.genes=p53) gt.vV <- gt(metastases, vantVeer, subsets=p53) features(gt.vV, what="w") @ \begin{figure}[htb!] \begin{center} <>= Plot.genes(xx=vantVeer, group=metastases, test.genes=p53) @ \vspace{-0.4cm} \caption{{\small \label{ga.genepl1} Gene Plot for the van't Veer data with \Rpackage{GlobalAncova}. Shown are the genes of the p53--signalling pathway. The bar height indicates the influence of the respective gene on the test statistic. The color shows in which of the phenotype groups the gene has higher expression values. The reference line is the residual mean square error per gene.}} \end{center} \end{figure} \begin{figure}[htb!] \begin{center} <>= features(gt.vV, what="w") @ \vspace{-0.4cm} \caption{{\small \label{gt.genepl1} Gene Plot for the van't Veer data with \Rpackage{globaltest}. Shown are the genes of the p53--signalling pathway. The bar height indicates the influence of the respective gene on the test statistic. The position of the fat marks gives the expected height of the bar under the null hypothesis. The other marks indicate with how many standard deviations the bar exceeds this reference.}} \end{center} \end{figure} In this case where only the influence of one variable is of interest (and therefore the easiest version of possible function calls is chosen), the same variable is assumed to be relevant for coloring. However one is free to specify another coloring. For example for the same plot we could ask which genes are higher expressed in samples with either positive or negative Estrogen receptor status, see figure \ref{ga.genepl2}. <>= Plot.genes(xx=vantVeer, formula.full=~metastases, formula.red=~1, model.dat=phenodata, test.genes=p53, Colorgroup="ERstatus") @ \begin{figure}[htb!] \begin{center} <>= Plot.genes(xx=vantVeer, formula.full=~metastases, formula.red=~1, model.dat=phenodata, test.genes=p53, Colorgroup="ERstatus") @ \vspace{-0.4cm} \caption{{\small \label{ga.genepl2} Gene Plot for the van't Veer data with \Rpackage{GlobalAncova}. Shown are the genes of the p53--signalling pathway. The bar height indicates the influence of the respective gene on the test statistic. The color shows in which of the specified phenotype groups, in this case Estrogen receptor status, the gene has higher expression values. The reference line is the residual mean square error per gene.}} \end{center} \end{figure} \subsection{Subjects Plot} The function \Rfunction{Plot.subjects} visualizes the influence of the individual samples on the test result and corresponds to the \Rfunction{subjects} of Goeman. As for the \Rfunction{features} plot we use the option \Rfunarg{what = "w"} to get closest to the output of \Rfunction{Plot.subjects}. The function \Rfunction{Plot.subjects} gives information on the reduction of sum of squares per subject. Here we sum over genes. Large reduction demonstrates a good approximation of a subject's gene expressions by the corresponding group means. If an individual does not fit into the pattern of its phenotype, negative values can occur. A small p--value will therefore generally coincide with many positive bars. If there are still tall negative bars, these indicate deviating samples: removing a sample with a negative bar would result in a lower p-value. The bars are colored to distinguish samples of different clinical entities that can again be specified by the user through the option \Rfunarg{Colorgroup}. With the option \Rfunarg{sort} it is also possible to sort the bars with respect to the phenotype groups. Bar labels can be changed with the argument \Rfunarg{bar.names}. Also in the subjects plot bar heights can be returned by setting the option \Rfunarg{returnValues} to \Robject{TRUE}. That may help to detect, not only visually, samples which do not fit into their respective clinical groups. We compare again the different approaches (figures \ref{ga.subjects1} and \ref{gt.subjects1}): <>= #colnames(exprs(golubX)) <- pData(golubX)[ ,1] Plot.subjects(xx=vantVeer, group=metastases, test.genes=p53, legendpos="bottomright") subjects(gt.vV, what="w") @ \begin{figure}[htb!] \begin{center} <>= Plot.subjects(xx=vantVeer, group=metastases, test.genes=p53, legendpos="bottomright") @ \vspace{-0.4cm} \caption{{\small \label{ga.subjects1} Subjects Plot for the van't Veer data with \Rpackage{GlobalAncova}. The bar height indicates the influence of the respective sample on the test result. If an individual does not fit into the pattern of its phenotype, negative values can occur. Bars are colored corresponding to phenotype groups.}} \end{center} \end{figure} \begin{figure}[htb!] \begin{center} <>= subjects(gt.vV, what="w") @ \vspace{-0.4cm} \caption{{\small \label{gt.subjects1} Subjects Plot for the van't Veer data with \Rpackage{globaltest}. The bar height indicates the influence of the respective sample on the test result. If an individual does not fit into the pattern of its phenotype, negative values can occur. The fat marks on the bars show the expected influence of the samples under the null hypothesis. The other marks indicate the standard deviation of the influence of the sample under the null hypothesis.}} \end{center} \end{figure} The function \Rfunction{Plot.subjects} can be invoked by the three alternative function calls (see section \ref{onepathway}) and hence also plots corresponding to more complex testing challenges can be produced as well. To give just one example we consider again the influence of the tumour grade, which can take three possible values, on gene expression (figure \ref{ga.subjects2}). <>= Plot.subjects(xx=vantVeer, formula.full=~grade, formula.red=~1, model.dat=phenodata, test.genes=p53, Colorgroup="grade", legendpos="topleft") @ \begin{figure}[htb!] \begin{center} <>= Plot.subjects(xx=vantVeer, formula.full=~grade, formula.red=~1, model.dat=phenodata, test.genes=p53, Colorgroup="grade", legendpos="topleft") @ \vspace{-0.4cm} \caption{{\small \label{ga.subjects2} Subjects Plot for the van't Veer data with \Rpackage{GlobalAncova}. Tumour grade is the clinical variable of interest. The bar height indicates the influence of the respective sample on the test result. If an individual does not fit into the pattern of its phenotype, negative values can occur. Bars are colored corresponding to phenotype groups.}} \end{center} \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Acknowledgements} We thank Sven Kn\"uppel who took the initiative and contributed a C-code implementation of the permutation test to our package.\\ This work was supported by the NGFN project 01 GR 0459, BMBF, Germany. \bibliographystyle{plain} \bibliography{references} \end{document}