% \VignetteIndexEntry{SAFE manual} % \VignetteDepends{safe} % \VignetteKeywords{Expression Analysis} % \VignettePackage{safe} \documentclass[11pt, a4paper]{article} \setlength{\topmargin}{-0.2in} \setlength{\oddsidemargin}{0.05 in} \setlength{\textwidth}{6in} \setlength{\textheight}{9in} \headsep=0in \oddsidemargin=0in \evensidemargin=0in \title{Significance Analysis of Function and Expression} \author{William T. Barry \\ bill.barry@duke.edu} \begin{document} \maketitle \section{Introduction} This vignette demonstrates the utility and flexibility of the R-package \texttt{safe} in conducting tests of functional categories for gene expression studies. SAFE is a resampling-based method of testing that is applicable to many different experimental designs and sets of functional categories. SAFE extends and builds on an approach employed in Virtaneva {\em et al.} (2001), and defined more rigorously in recent publications from Barry {\em et al.} (2005 and 2008). It is suggested that all users refer to these publications in order to understand the SAFE terminology and principles in greater detail. We also ask that Barry {\em et al.} (2008) be cited in publications that use the updated version of \texttt{safe}. Several of the extensions to the \texttt{safe} package are itemized below, and relate to added functionality discussed in Barry {\em et al.} (2008). Further, more functions and arguments are provided which improve the input and output capabilities of the package. Manuscripts for the citations mentioned above, and additional tutorials and examples are available at the following URL. \begin{center} \verb+http://www.duke.edu/~dinbarry/SAFE/+ \end{center} \section{Changes in version 2.0} The following list describes the extended capability of \texttt{safe} version 2.0. Examples of their implementation and the changes to functional arguments are illustrated in subsequent sections and detailed in help documents. \begin{itemize} \item Gene categories can be automatically generated within \texttt{safe} using the \texttt{platform} and \texttt{annotate} arguments. This can build categories from GO ontologies, KEGG pathways or PFAM domain, if a suitable Bioconductor annotation package exists for the array type. \item The sparse matrix package \texttt{SparseM} is incorporated to ease the memory constraints in working with large datasets and/or many hundreds of categories. \item Local statistics are added for analyses of paired data (\texttt{local = "t.paired"}), and a Cox proportional hazard model for censored survival data (\texttt{local = "z.COXPH"}). \item Global statistics are added for doing resampling-based tests of genelist-type analyses (\texttt{global = "Fisher"}) or (\texttt{global = "Pearson"}) or mean difference (\texttt{global = "AveDiff"}). See Barry et al. (2008) for further explanation of these global statistics. \item Non-resampling based error estimates are available including a Bonferroni correction or Holm's step-down procedure for the family-wise error rate (\texttt{error = "FWER.Bonf"}) or (\texttt{error = "FWER.Holm"}), and the Benjamini-Hochberg step-up procedure to control the false discovery rate (\texttt{error = "FDR.BH"}). \item The gene-specific results within a given category can be displayed using \texttt{gene.results}. \item SAFE results can be plotted across the directed graph of Gene Ontology using \texttt{safedag}. \item A bootstrap-based test is included which we have shown to be more powerful, require fewer resamples, and extendable to more complicated experimental designs with covariate information; see Barry {\em et al.} (2008) for more discussion of bootstrap-based tests. \end{itemize} \section{SAFE implementation and output} Here, we implement \texttt{safe} using the datasets and annotations in Bioconductor packages listed below. <>= library(safe) library(multtest) library(hu6800.db) @ Every SAFE analysis requires three elements from a dataset: (1) gene expression data, (2) a response vector associated with the samples, and (3) a matrix containing category assignments that is either user-defined or built from annotation packages for the array platform. The expression data should be in the form of an $m \times n$ matrix, where appropriate normalization and other pre-processing steps have been taken. It should be noted that in the current version of \texttt{safe}, missing values are not allowed in the expression data, and must be imputed prior to analysis. In this vignette, we will use the AML/ALL dataset from Golub {\em et al.} (1999) as illustration. <<>>= data(golub) dimnames(golub)[[1]] <- golub.gnames[,3] @ \texttt{golub} is a matrix of normalized expression estimates for 3,051 genes across 38 samples. Row-names of Affymetrix hu6800 probeset IDs are added to \texttt{golub}. The row names are necessary for building gene categories on the subset of probesets retained in \texttt{golub}. The comparison of interest is between AML and ALL tumors subtypes. Tumor classification of samples is provided in \texttt{golub.cl} ( AML = 1, ALL = 0 ). Section 4 will discuss the valid forms of response vectors for the experimental designs allowed in \texttt{safe}. <<>>= table(golub.cl) @ For the primary example in this vignette, the functional categories of interest will be KEGG pathways. Pathway annotation for the Affymetrix array is available from the \texttt{hu6800} package. For the sake of parsimony, we will only consider pathways that have at minimum of 10 probeset members among the 3,051 in the \texttt{golub} dataset. In version 2.0, the KEGG categories can now be automatically generated by the \texttt{safe} function, and is discussed is more detail in section 8. NOTE: to be more efficient while running multiple examples, we will also generate a matrix of indicator variables for KEGG category membership externally from \texttt{safe} for repeated use. When working with user-defined category matrices, it is strongly suggested that appropriate names are given to all objects so the rows in the data and C.matrix correspond, and the output from \texttt{safe} is properly labeled. <<>>= C.mat <- getCmatrix(gene.list = as.list(hu6800PATH), as.matrix = TRUE, present.genes = golub.gnames[,3], min.size = 10) dimnames(C.mat)[[2]] <- paste("KEGG:",dimnames(C.mat)[[2]],sep="") @ <>= set.seed(12345) results <- safe(golub, golub.cl, platform = "hu6800",annotate = "KEGG", min.size = 10) @ The SAFE framework for testing gene categories is a two-stage process, where ``local'' statistics assess the association between expression and the response of interest in a gene-by-gene manner, and a ``global'' statistic measures the extent of association in genes assigned to a category relative to their complement. As indicated, the default local statistic for the 2-sample comparison of AML and ALL is the Student's t-statistic. An increased amount of differential expression within a KEGG pathway is determined using a global Wilcoxon rank sum statistic by default. Inference on each type of statistic is achieved through permutation. <<>>= results @ The basic output from \texttt{safe} is an object of class {\em SAFE}. Showing objects of class {\em SAFE} will print details on the type of analysis and the results for categories that attain a certain level of significance. Here, significant results are printed for the 6 categories that have empirical p-values $\le$ 0.05. For each category, the number of annotated genes in the dataset is displayed along with the Wilcoxon global statistic and its empirical p-value. NOTE: as in standard gene-by-gene analyses, it is of critical importance to account for multiple comparisons when considering a number of categories simultaneously. Several options for adjusted p-values are provided in \texttt{safe}, and discussed in detail in Section 6. Gene-specific results within a category are now made more readily accessible through the \texttt{gene.results} function. We believe this is very useful for investigators interested in seeing which category members are contributing to its significance. The following example demonstrates how the direction and magnitude of differential expression are displayed by default. A list of two data.frames can also be returned with the argument \texttt{print.it = FALSE}. <<>>= gene.results(results,cat.name="KEGG:00860") @ \section{Experimental Designs and Local Statistics} The basic 2-sample comparison in the example above is one of several experimental designs that \texttt{safe} can automatically accommodate. The following examples illustrate the arguments needed for the variety of designs and statistics allowed. In addition to the internal local statistics for generic comparisons, one can also employ user-defined functions in \texttt{safe} to extend its utility. For 2-sample comparisons, the response vector can either be given as a (0,1) vector, or a character vector with two unique elements. It is important to note that when a character vector is passed to \texttt{safe} as the response, the assignment of the first array becomes Group 1, and is printed as a warning. Thus, the sign of the t-statistics have flipped below. NOTE: to decrease computation time, the permutation testing is bypassed by using the argument \texttt{Pi.mat = 1}. By default, a Student's t-statistic is employed for 2-sample comparisons, but if unequal variances are assumed, the Welch t-statistic can be selected using \texttt{local="t.Welch"}. <>= y.vec <- c("ALL","AML")[1+golub.cl] results2 <- safe(golub, y.vec, C.mat, Pi.mat = 1) results3 <- safe(golub, golub.cl, C.mat, local="t.Welch", Pi.mat = 1) round(cbind(Student1 = results@local.stat[1:3], Student2 = results2@local.stat[1:3], Welch = results3@local.stat[1:3]),3) @ For multi-class designs, response vectors should be character or numeric vectors with unique values for each group. If a character vector is supplied for \texttt{y.vec}, an ANOVA F-statistic is computed by default; otherwise, an ANOVA test can be specified with the argument \texttt{local = "f.ANOVA"} for numeric class assignments. Simple linear regression is performed if a numeric vector with more than two unique values is supplied, or by using the argument \texttt{local = "t.LM"}. Version 2.0 of \texttt{safe} is extended to include the paired t-test for matched experiments. For this, samples are identified by (+/-) pairs of integers. Internally, the permutation algorithm changes from random sampling without replacement, to randomly flipping the signs of each paired sample. <<>>= y.vec <- rep(1:19,2)*rep(c(-1,1),each=19) y.vec results2 <- safe(golub, y.vec, C.mat, local="t.paired", Pi.mat = 1) @ In Barry et. al. (2005), SAFE was applied to a Cox proportional hazards model for associating tumor gene expression to the survival of lung cancer patients. To include this functionality in \texttt{safe} The argument \texttt{local = "z.COXPH"} will conduct a univariate Wald test for each gene, with event times given as y.vec. This requires providing censoring indicators as a logical or numeric vector, \texttt{censor}, in the argument \texttt{args.local}: <<>>= y.vec <- rexp(38) cens <- rep(0:1,c(30,8)) results2 <- safe(golub, y.vec, C.mat, local="z.COXPH", Pi.mat = 1, args.local = list(censor=cens)) @ In addition to these predefined local statistics, \texttt{safe} has been structured such that the user can specify other statistics. In creating a function for computing local statistics, it must take as input the matrix of expression data and response information as illustrated below; additional information can be passed as objects in the optional list, \texttt{args.local}. Here, we create a function for a one-sided Wilcoxon statistic for gene-specific increases in expression in the AML subtype (this choice of local statistic should not be confused with the default global statistic) <>= local.Wilcoxon<-function(X.mat,y.vec, ...){ return(function(data,trt = (y.vec == 1)) { return(as.numeric(trt %*% apply(data,1,rank))) }) } results2 <- safe(golub, golub.cl, C.mat, Pi.mat = 1, local="Wilcoxon") @ <>= cbind(Student1 = round(results@local.stat[1:3],3), Rank.Sum=results2@local.stat[1:3]) @ As a resampling-based method, \texttt{safe} is computationally intensive, so considerations of efficiency should made in programming user-defined functions for local and global statistics. The above example, while simple, is much slower than the default run of \texttt{safe} because of the \texttt{apply} function. Likewise, for Barry et. al. (2005), a separate and faster function was written in C for solving the iterative solution to the univariate Cox proportional hazards model. Interfacing with C or another foreign language is highly suggested for intensive computational settings. A complete discussion of how to design and include user-defined functions will not be included in this vignette. \section{Alternative Global Statistics} In the above SAFE analyses, a functional category was compared to its complement set of genes through a Wilcoxon rank sum statistic. The merits of using rank-based statistics for functional analysis are discussed in more detail in Barry {\em et al.} (2005). However, the the SAFE framework naturally extends to other statistics used in gene category analyses. This way one more properly account for gene correlation in testing categories (see Barry {\em et al.} 2008). By default, \texttt{safe} conducts two-sided tests, whereby one takes the absolute value of local statistics such as a Student's $t$, before ranking genes. In this way, one can identify categories showing both consistent up-regulation, down-regulation, and also bi-directional differential expression. To conduct one-sided tests, one must specify \texttt{args.global = list(one.sided=T)} to consider only genes in the positive direction to be significant. One popular way of examining categories is through ``gene-list enrichment'' methods, that were developed as {\em post hoc} means of testing once the genes with significant differential expression had been identified. These methods use a global statistics that only consider the dichotomous outcomes of gene-specific hypothesis tests, and typically use Fisher's Exact test, or Pearson's test for a difference in proportions. p-values are often times extremely anti-conservative under the false assumption of gene independence, which can lead to spurious results. For this reason, we have extended SAFE to these global statistics such that valid p-values can be obtained. In using the gene-list type global statistics, one must specify either the list length, as in the example below, or a (one- or two-sided) cut-off value: <>= set.seed(12345) results2 <- safe(golub, golub.cl, C.mat, global="Fisher", args.global = list(one.sided=F,genelist.length=200)) @ <<>>= results2 @ The following calculation demonstrates the inappropriate p-value one would get from a naive application of Fisher's Exact test to KEGG:04640. <<>>= 1-phyper(12-1, 70, 3051-12, 200) @ Alternatively, a one-sided cutoff value for local statistics is declared by the argument \texttt{args.global = list(one.sided = TRUE, genelist.cutoff=2.0)}. Further, the Pearson test for difference in proportions (which is equivalent to a Chi-squared test) can be specified by the argument \texttt{global="Pearson"}, and the cutoff for the gene-list is specified in the same manners as shown above. One can also substitute the average difference in local statistics as a measure of category-wide increases in differential expression using the argument \texttt{global="AveDiff"}. An alternative non-parametric 2-sample comparison that is also valid (albeit more computationally intensive) is the Kolmogorov-Smirnoff test; a computationally inefficient test can be specified as \texttt{global="Kolmogorov"}. See \verb+http://www.duke.edu/~dinbarry/SAFE/+ for further examples that use these alternative global statistics. \section{Adjusting for multiple-testing in SAFE} As in standard gene-by-gene analyses, it is important to account for multiple comparisons when considering a set of categories. Since SAFE is a resampling-based test, permutation-based error rate methods have been incorporated into \texttt{safe} when applicable. Shown below are the results from the first example once multiple testing is accounted for using the Yekutieli-Benjamini method of estimating the {\em false discovery rate} (FDR). <>= set.seed(12345) results <- safe(golub, golub.cl, C.mat, error = "FDR.YB", alpha = 0.25) @ <<>>= results @ NOTE: By default, when correcting for multiple testing, the cutoff for display changes from categories with empirical p-value less than 0.05 to those with adjusted p-values less than 0.1. The cutoff for display can be changed with the \texttt{alpha} argument in \texttt{safe} As shown above, the most significant KEGG pathways are only marginally significant in their association to leukemia subtype after accounting for multiple comparisons through the FDR. In addition to the Yekutieli-Benjamini FDR estimate, \texttt{safe} can use the permutation algorithm to estimate the {\em family-wise error rate} with the Westfall-Young method (\texttt{error = "FWER.WY"}). Although we feel these two permutation-based procedures for controlling error are superior by empirically accounting for correlation among tests, one can also apply tradition methods including a Bonferroni correction or Holm's step-down procedure for the family-wise error rate (\texttt{error = "FWER.Bonf"}) or (\texttt{error ="FWER.Holm"}), and the Benjamini-Hochberg step-up procedure to control the false discovery rate (\texttt{error = "FDR.BH"}). These may be useful when comparing results from SAFE with other non-resampling-based procedures, or when using the bootstrap algorithms discussed in the following section. \section{Bootstrap-based tests in SAFE} In Barry {\em et al.} (2008), a bootstrap-based version of SAFE is proposed that is shown to generally be more powerful, and applicable to a wider set of experimental designs. Two basic methods of hypothesis testing are available: 1) The argument (\texttt{method="bootstrap"}) or (\texttt{method="bootstrap.t"}), will invoke pivot tests to look for the exclusion of a null value from Gaussian confidence intervals based on resampled estimates of the mean and variance of the global statistic; 2) alternatively (\texttt{method="bootstrap.q"}), will invoke tests based on the exclusion of a null value from the alpha-quantile interval of the resampled global statistic. The following illustrates that more categories are identified as marginally significant under bootstrap-resampling version of SAFE. <>= set.seed(12345) results2 <- safe(golub, golub.cl, C.mat, method = "bootstrap", error = "FDR.BH") @ <<>>= results2 @ Based on the requirements for bootstrap-based hypothesis testing (see Barry {\em et al.} 2008 for explanation), they can only be performed using (\texttt{global \%in\% c("Wilcoxon", "AveDiff", "Pearson")}). Further the permutation-based error rate estimates are no longer applicable, so that the options available are (\texttt{error \%in\% c("FDR.BH", "FWER.Bonf", "FWER.Holm", "none")}). By default, the data are resampled 1000 times when selecting \texttt{method \%in\% c("permutation", "bootstrap.q")} though often times $>10$-fold more resamples are suggested if there are several hundred categories being investigated. The Gaussian bootstrap-based test has the added advantage that empirical p-values are not bounded by the total number of resamples taken. Thus, small p-values can be obtained without intensive computational effort; this is of particular importance when overcoming stringent correction for high degrees of multiple-testing. As such, 200 resamples are taken for (\texttt{method = "bootstrap"}), and have been demonstrated to provide sufficient error control. Also, permutation-based p-values for local statistics are no longer obtained under bootstrap resampling. Instead, empirical p-values can be obtained using the exclusion of 0 from quantile intervals with (\texttt{args.local = list(boot.test = "q")}) and Gaussian intervals with (\texttt{args.local = list(boot.test = "t")}). A null value of 0 relates to no differential expression in the supplied local statistics, however this must be reasonable for any user-defined statistics ({\em e.g.} it does not apply for a Kolmogorov-Smirnov test statistic). \section{Sources of Functional Categories} In the above sections, functional categories were derived from KEGG pathways as provided in the package \texttt{hu6800}. Functional categories can also be derived from other sources of information in the same package. The Protein Families database can also be used to generate categories of genes that share homologous domains: <<>>= results2 <- safe(golub, golub.cl, platform="hu6800", annotate="PFAM", min.size=10,method="bootstrap") @ <<>>= results2 @ Gene Ontology is also available from \texttt{hu6800} and other Bioconductor metadata packages. \texttt{annotate = "GO.ALL"} will form categories from all three ontologies, while ``GO.CC'', ``GO.BP'', and ``GO.MF'' will work with Cellular Compartment, Biological Process and Molecular Function respectively. It is important to note that in the hierarchical structure of the GO vocabularies, a gene category is generally thought of as containing the set of genes directly annotated to a term, and also to any terms beneath it in the ontology. The C.matrix of each can be externally built with the \texttt{getCmatrix} function as follows (in a much more efficient manner than in SAFE 1.0). <<>>= GO.list <- as.list(hu6800GO2ALLPROBES) C.mat.CC <- getCmatrix(keyword.list = GO.list, GO.ont = "CC", present.genes = dimnames(golub)[[1]], min.size = 10, max.size=200) results2 <- safe(golub, golub.cl, C.mat.CC, method="bootstrap") @ \section{Plotting SAFE results} For a single category, we have proposed that the differential expression of genes be plotted as a SAFE-plot (Barry {\em et al.}, 2005). Shown below is the SAFE-plot for the most significant KEGG pathway, which is the default output of \texttt{safeplot} when a object of class {\em SAFE} is provided. <>= safeplot(results) @ SAFE-plots of other categories can be generated with the argument \texttt{cat.name}, as shown below for a non-significant category. <>= safeplot(results,cat.name="KEGG:00010") @ SAFE-plots show the cumulative distribution function (CDF) for the ranked local statistics from a given category (solid line). A significant category will have more extreme associations to the response of interest than its complement, resulting in either a right-ward, left-ward, or bidirectional shift in the CDF away from the unit line (dashed line). The shaded regions of the plot correspond to the genes that pass a nominal level of significance (empirical p-values $\le$ 0.05 by default). Also, the genes in the category are shown as tick marks along the top of the graph, and depending on the category size, either all genes in the category are labeled, or only ones in the shaded region of the graph. Thus SAFE-plots show that the KEGG pathways 00860 and 00590 show upregulation in AML on average, while 00970 shows downregulation in AML on average, and 00010 shows no consistent trend of differential expression. Finally, Gene Ontology is a unique structured vocabulary where genes are annotated from broad to narrow levels of classification in a directed acyclic graph (DAG). As such, many categories are highly related in their gene membership, and visualizing results across the ontology can be useful in ascertaining the relationship among multiply significant categories. The following function interacts with the \texttt{GOstats} and \texttt{Rgraphviz} packages in order to overlay SAFE results onto the DAG structure in a color-metric manner. By default, nodes with unadjusted p-values less than 0.001 are drawn in blue; less than 0.01 are drawn in green; and less than 0.1 are drawn in red. User-defined cutoffs for the three colors can be specified using the argument \texttt{color.cutoffs}. Further illustrations are provided in example scripts available at \verb+http://www.duke.edu/~dinbarry/SAFE/+ <>= safedag(results2,filter=1) @ And one can also zoom in on parts of the DAG by specifying a node to be the top of the graph. <>= safedag(results2, filter=1,top="GO:0044428") @ \section{References} \begin{itemize} \item Barry, W.T., Nobel, A.B. and Wright, F.A. (2005) `Significance Analysis of functional categories in gene expression studies: a structured permutation approach', {\em Bioinformatics}, {\bf 21}(9), 1943--1949. \item Barry, W.T., Nobel, A.B. and Wright, F.A. (In press) `A Statistical Framework for Testing Functional Categories in Microarray Data', {\em Annals of Applied Statistics}. \item Virtaneva, K.I., Wright, F.A., Tanner, S.M., Yuan, B., Lemon, W.J., Caligiuri, M.A., Bloomfield, C.D., de~laChapelle, A., \& Krahe, R. (2001) `Expression profiling reveals fundamental biological differences in acute myeloid leukemia with isolated trisomy 8 and normal cytogenetics'{\em Proc Natl Acad Sci U S A} {\bf 98}(3), 1124--1129. \item Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caliguiri, M.A., Bloomfield, C.D., and Lander, E.S. (1999) `Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring', {\em Science}, {\bf 286}, 531--537. \item Yekutieli, D. and Benjamini, Y. (1999) `Resampling-based false discovery rate controlling multiple test procedures for correlated test statistics' {\em J Statist Plann Inference} {\bf 82}, 171--196 \item Westfall, P.H. and Young, S.S. (1989) `P-value adjustment for multiple tests in multivariate binomial models', {\em J Amer Statist Assoc} {\bf 84}, 780--786 \end{itemize} \end{document}