% % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % %\VignetteIndexEntry{annaffy Primer} %\VignetteDepends{Biobase, GO.db, KEGG.db, multtest, hgu95av2.db} %\VignetteKeywords{HTML, URL} %\VignettePackage{annaffy} \documentclass[12pt]{article} \usepackage{hyperref} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \begin{document} \title{Textual Description of annaffy} \author{Colin A. Smith} \maketitle \section*{Introduction} \verb+annaffy+ is part of the Bioconductor project. It is designed to help interface between Affymetrix analysis results and web-based databases. It provides classes and functions for accessing those resources both interactively and through statically generated HTML pages. The core functionality of \verb+annaffy+ depends on annotation contained in Bioconductor data packages. The data packages are created by the \verb+SQLForge+ code inside of another package called \verb+AnnotationDbi+. It gathers annotation data from many diverse sources and makes the information easily processed by R. Preconstructed packages for most Affymetrix chips are available on the Bioconductor web site. \section{Loading Annotation Data} \verb+annaffy+ represents each type of annotation data as a different class. Currently implemented classes include: \begin{description} \item[aafSymbol] gene symbol \item[aafDescription] gene description/name \item[aafFunction] gene functional description \item[aafChromosome] genomic chromosome \item[aafChromLoc] location on the chromosome (in base pairs) \item[aafGenBank] GenBank accession number \item[aafLocusLink] LocusLink ids (almost never more than one) \item[aafCytoband] mapped cytoband location \item[aafUniGene] UniGene cluster ids (almost never more than one) \item[aafPubMed] PubMed ids \item[aafGO] Gene Ontology identifiers, names, types, and evidence codes \item[aafPathway] KEGG pathway identifiers and names \end{description} For each class, there is a constructor function with the same name. It takes as arguments a vector of Affymetrix probe ids as well as the chip name. The chip name corresponds to the name of the data package that contains the annotation. If the data package for the chip is not already loaded, the constructor will attempt to load it. The constructor returns a list of the corresponding objects populated with annotation data. (\verb+NA+ values in the annotation package are mapped to empty objects.) <>= library("annaffy") @ (For the purpose of demonstration, we will use the \verb+hgu95av2.db+ metadata package and probe ids from the \verb+aafExpr+ dataset.) <<>>= data(aafExpr) probeids <- featureNames(aafExpr) symbols <- aafSymbol(probeids, "hgu95av2.db") symbols[[54]] symbols[55:57] @ All annotation constructors return their results as \verb+aafList+ objects, which act like normal lists but have special behavior when used with certain methods. One such method is \verb+getText()+, which returns a simple textual representation of most \verb+annaffy+ objects. Note the differing ways \verb+annaffy+ handles missing annotation data. <<>>= getText(symbols[54:57]) @ Other annotation constructors return more complex data structures: <<>>= gos <- aafGO(probeids, "hgu95av2.db") gos[[3]] @ The gene ontology constructor, \verb+aafGO()+, returns \verb+aafList+s of \verb+aafGO+ objects, which are in turn lists of \verb+aafGOItem+ objects. Within each of those objects, there are four slots: id, name, type, and evidence code. The individual slots can be accessed with the \verb+@+ operator. <<>>= gos[[3]][[1]]@name @ If the reader is not already aware, \verb+R+ includes two subsetting operators, which can be the source of some confusion at first. Single brackets (\verb+[]+) always return an object of the same type that they are used to subset. For example, using single brackets to subset an \verb+aafList+ will return another \verb+aafList+, even if it only contains one item. On the other hand, double brackets (\verb+[[]]+) return just a single item which is not enclosed in a list. Thus the above statement first picks out the third \verb+aafGO+ object, then the first \verb+aafGOItem+, and finally the name slot. \section{Linking to Online Databases} One of the most important features of the \verb+annaffy+ package its ability to link to various public online databases. Most of the annotation classes in annaffy have a \verb+getURL()+ method which returns single or multiple URLs, depending on the object type. The simplest annotation class which produces a URL is \verb+aafGenBank+. Because Affymetrix chips are generally based off GenBank, all probes have a corresponding GenBank accession number, even those missing other annotation data. The GenBank database provides information about the expressed sequence that the Affymetrix chip detects. Additionally, it helps break down the functional parts of the sequence and provides information about the authors that initially sequenced the gene fragment. (See this URL \href{http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=search&db=nucleotide&term=U41068%5BACCN%5D&doptcmdl=GenBank}{here}.) <<>>= gbs <- aafGenBank(probeids, "hgu95av2.db") getURL(gbs[[1]]) @ In most distributions of R, you can open URLs in your browser with the \verb+browseURL()+ function. Many other types of URLs are also possible. Entrez Gene (formerly LocusLink) is a very useful online database that links to many other data sources not referenced by Bioconductor. One worthy of note is OMIM, which provides relatively concise gene function and mutant phenotype information. (See this URL \href{http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=2322}{here}.) <<>>= lls <- aafLocusLink(probeids, "hgu95av2.db") getURL(lls[[2]]) @ If you are interested in exploring the area of the genome surrounding a probe, the \verb+aafCytoband+ provides a link to NCBI's online genome viewer. It includes adjacent genes and other genomic annotations. (See this URL \href{http://www.ncbi.nlm.nih.gov/mapview/map_search.cgi?direct=on&query=U02687%5BACCN%5D}{here}.) <<>>= bands <- aafCytoband(probeids, "hgu95av2.db") getURL(bands[[2]]) @ For primary literature information about a gene, use the \verb+aafPubMed+ class. It will provide a link to a list of abstracts on PubMed which describe the gene of interest. The list abstracts that Bioconductor provides are by no means complete and will sometimes only include the paper in which the gene was cloned. (See this URL \href{http://www.ncbi.nih.gov/entrez/query.fcgi?tool=bioconductor&cmd=Retrieve&db=PubMed&list_uids=15059064%2c14759363%2c14737077%2c14670924%2c14630076%2c14604974%2c14562119%2c14504097%2c12969963%2c12935959%2c12926083%2c12854887%2c12842996%2c12816873%2c12691136%2c12676789%2c12481903%2c12468433%2c12393674%2c12239147%2c12239146%2c12070009%2c12060771%2c12036858%2c11983110%2c11971190%2c8394751%2c7507245}{here}.) <<>>= pmids <- aafPubMed(probeids, "hgu95av2.db") getURL(pmids[[2]]) @ A number of interesting queries can be done with the gene ontology class. You can display the gene ontology family hierarchy for an entire probe id at once, including multiple GO ids. The usefulness of such a query may be dubious, but it is possible. See this URL \href{http://godatabase.org/cgi-bin/go.cgi?open_0=GO:0007155&open_0=GO:0005581&open_0=GO:0030199&open_0=GO:0005592&open_0=GO:0005737&open_0=GO:0030020&open_0=GO:0007605&open_0=GO:0006817&open_0=GO:0001501}{here}. <<>>= getURL(gos[[1]]) @ You can also show the family hierarchy for a single GO id. (See this URL \href{http://godatabase.org/cgi-bin/go.cgi?open_0=GO:0005592}{here}.) <<>>= getURL(gos[[1]][[4]]) @ The last link type of note is that for KEGG Pathway information. Most genes are not annotated with pathway data. However, for those that are, it is possible to retrieve schematics of the biochemical pathways a gene is involved in. (See this URL \href{http://www.genome.ad.jp/dbget-bin/show_pathway?MAP00480+2.5.1.18}{here}. Look for the enzyme in question to be highlighted in red.) <<>>= paths <- aafPathway(probeids, "hgu95av2.db") getURL(paths[[4]]) @ \section{Building HTML Pages} In addition to using \verb+annaffy+ interactively through R, it may also be desirable to generate annotated reports summarizing your microarray analysis results. Such a report can be utilized by a scientist collaborator with no knowledge of either R or Bioconductor. Additionally, by having all the annotation and statistical data presented together on one page, connections between and generalizations about the data can be made in a more efficient manner. The primary intent of the \verb+annaffy+ package is to produce such reports in HTML. Additionally, it can easily format the same report as tab-delimited text for import into a table, spreadsheet, or database. It supports nearly all the annotation data available through Bioconductor. Additionally, it has facilities for including and colorizing user data in an informative manner. The rest of this tutorial will make use of an \verb+ExpressionSet+ generated for demonstration purposes. It contains anonymized data from a microarray experiment which used the Affymetrix hgu95av2 chip. There are eight total samples in the set, four control samples and four experimental samples. 250 expression measures were selected at random from the results, and another 250 probe ids were selected at random and assigned to those expression measures. The data therefore has no real biological significance, but can still fully show the capabilities of \verb+annaffy+. \subsection{Limiting the Results} HTML reports generated by \verb+annaffy+ can grow to become quite large unless some measures are taken to limit the results. Multi-megabyte web pages are unwieldy and should thus be avoided. Doing a ranked statistical analysis is one way to limit results, and will be shown here. We will rank the expression measures by putting their two-sample Welch t-statistics in order of decreasing absolute value. The first step is to load the \verb+multtest+ package which will be used for the t-test. (It is also part of the Bioconductor project.) <<>>= library(multtest) @ The \verb+mt.teststat()+ function requires a vector that specifies which samples belong to the different observation classes. Using a few R tricks, that vector can be produced directly from the first covariate of \verb+pData+. <>= class <- as.integer(pData(aafExpr)$covar1) - 1 @ Using the class vector, we calculate the t-statistic for each of the probes. We then generate an index vector which can be used to order the probes themselves in increasing order. As a last step, we produce the vector of ranked probe ids. Latter annotation steps will only use the first 50 of those probes. <<>>= teststat <- mt.teststat(exprs(aafExpr), class) index <- order(abs(teststat), decreasing = TRUE) probeids <- featureNames(aafExpr)[index] @ \subsection{Annotating the Probes} Once there is a list of probes, annotation is quite simple. The only decision that needs to be made is which classes of annotation to include in the table. Including all the annotation classes, which is the default, may not be a good idea. If the table grows too wide, its usefulness may decrease. To see which columns of data can be included, use the \verb+aaf.handler()+ function. When called with no arguments, it returns the annotation types \verb+annaffy+ can handle. <<>>= aaf.handler() @ To help avoid typing errors, subset the vector instead of retyping each column name. <>= anncols <- aaf.handler()[c(1:3,8:9,11:13)] @ This may be too many columns, but it is possible at a later stage to choose to either not show some of the columns or remove them altogether. Note that by using the \verb+widget=TRUE+ option in the next function, it is also possible select data columns with a graphical widget. See Figure \ref{fig:widget.selector}. \begin{figure}[htbp] \begin{center} \includegraphics{selector} \caption{\label{fig:widget.selector}Graphical display for selecting annotation data columns. } \end{center} \end{figure} Now we generate the annotation table with the \verb+aafTableAnn()+ function. Note that for this tutorial, \verb+annaffy+ is acting as its own data package. If you wish to annotate results for other chips, download the appropriate data package from the Bioconductor website. <<>>= anntable <- aafTableAnn(probeids[1:50], "hgu95av2.db", anncols) @ To see what has been produced so far, use the \verb+saveHTML()+ method to generate the HTML report. Using the optional argument \verb+open=TRUE+ will open the resulting file in your browser. <<>>= saveHTML(anntable, "example1.html", title = "Example Table without Data") @ See this page online \href{http://genome.nasa.gov/downloads/annaffy/example1.html}{here}. \subsection{Adding Other Data} To add other data to the table, just use any of the other table constructors to generate your own table, and then merge the two. For instance, listing the t-test results along with the annotation data is quite useful. \verb+annaffy+ provides the option of colorizing signed data, making it easier to assimilate. <<>>= testtable <- aafTable("t-statistic" = teststat[index[1:50]], signed = TRUE) table <- merge(anntable, testtable) @ After HTML generation, a one line change to the style sheet header will change the colors used to show the positive and negative values. In fact, with a bit of CSS it is possible to heavily customize the appearance of the tables very quickly, even on a column by column basis. \verb+annaffy+ also provides an easy way to include expression data in the table. It colorizes the cells with varrying intensities of green to show relative expression values. Additionally, because of the way \verb+merge+ works, it will always match probe id rows together, regardless of their order. This allows a quick "sanity check" on the other statistics produced, and can help decrease user error. (Check, for example, that the t-statistics and ranking seem reasonable given the expression data.) <<>>= exprtable <- aafTableInt(aafExpr, probeids = probeids[1:50]) table <- merge(table, exprtable) saveHTML(table, "example2.html", title = "Example Table with Data") @ See this page online \href{http://genome.nasa.gov/downloads/annaffy/example2.html}{here}. Producing a tab-delimited text version uses the \verb+saveText()+ method. The text ouput also includes more digits of precision than HTML. <<>>= saveText(table, "example2.txt") @ \section{Searching Metadata} Often a biologist will make hypotheses about changes in gene expression either before or after the microarray experiment. In order to facilitate the formulation and testing of such hypotheses, \verb+annaffy+ includes functions to search annotation metadata using various criteria. All search functions return character vectors of Affymetrix probe ids that can be used to subset data and annotation. \subsection{Text} The two currently implemented search functions are simple and easy to use. The first is a text search that matches against the textual representation of biological metadata. Recall that textual representations are extracted using the \verb+getText()+ method. For complex annotation structures, the textual representation can include a variety of information, including numeric identifiers and textual descriptions. For the purposes of demonstration, we will use the hgu95av2.db annotation data package available through Bioconductor. <<>>= library(hgu95av2.db) probeids <- ls(hgu95av2SYMBOL) gos <- aafGO(probeids[1:2], "hgu95av2.db") getText(gos) @ The textual search is probably best applied to the Symbol, Description, and Pathway metadata types. (A specialized Gene Ontology search will be discussed later.) For instance, to find all the kinases on a chip, simply do a text search of Description for kinases. <<>>= kinases <- aafSearchText("hgu95av2.db", "Description", "kinase") kinases[1:5] print(length(kinases)) @ One can search multiple metadata types with multiple queries all with a single function call. For instance, to find all genes with "ribosome" or "polymerase" in the Description or Pathway annotation, use the following function call. <<>>= probes <- aafSearchText("hgu95av2.db", c("Description", "Pathway"), c("ribosome", "polymerase")) print(length(probes)) @ When doing searches of multiple annotation data types or multiple terms, by default the search returns all probe ids matching any of the search criteria. That can be altered by changing the logical operator from OR to AND using the \verb+logic="AND"+ argument. This is useful because \verb+aafSearchText()+ does not automatically tokenize a search query as Google and many other search engines do. For example, "DNA polymerase" finds all occurrences of that exact string. To find all probes whose description contains both "DNA" and "polymerase", use the following function call. <<>>= probes <- aafSearchText("hgu95av2.db", "Description", c("DNA", "polymerase"), logic = "AND") print(length(probes)) @ Another useful application of the text search is to map a vector of GenBank accession numbers onto a vector of probe ids. This comes in handy if you wish to filter microarray data based on the results of a BLAST job. <<>>= gbs <- c("AF035121", "AL021546", "AJ006123", "AL080082", "AI289489") aafSearchText("hgu95av2.db", "GenBank", gbs) @ Lastly, two points for power users. One, the text search is always case insensitive. Second, individual search terms are treated as Perl compatible regular expressions. This means that you should be cautious of special regular expression characters. See the Perl documentation\footnote{\url{http://perldoc.perl.org/perlre.html}} for further information about how to use regular expressions. \subsection{Gene Ontology} The second type of search available is a Gene Ontology search. It takes a vector of Gene Ontology identifiers and maps them onto a list of probe ids. Gene Ontology is a tree and you can include or exclude descendents with the \verb+descendents+ argument. The search also supports the \verb+logic+ argument. Because the Bioconductor metadata packages include pre-indexed Gene Ontology mappings, this search is very fast. The input format for Gene Ontology ids is very flexible. You may use numeric or character vectors, either excluding or including the "GO:" prefix and leading zeros. <<>>= aafSearchGO("hgu95av2.db", c("GO:0000002", "GO:0000008")) aafSearchGO("hgu95av2.db", c("2", "8")) aafSearchGO("hgu95av2.db", c(2, 8)) @ A good source for finding relevant Gene Ontology identifiers is the AmiGO website\footnote{\url{http://www.godatabase.org/}}, operated by the Gene Ontology Consortium. \end{document}