\documentclass[12pt]{article} \topmargin 0in \headheight 0in \headsep 0in \oddsidemargin 0in \evensidemargin 0in \textwidth 176mm \textheight 215mm \usepackage{natbib} \usepackage{Sweave} \usepackage{url} \DeclareGraphicsRule{.tif}{png}{.png}{`convert #1 `dirname #1`/`basename #1 .tif`.png} \newcommand{\np}{\vspace{5mm}\par} \begin{document} %\VignetteIndexEntry{goseq User's Guide} \title{\texttt{goseq}: Gene Ontology testing for RNA-seq datasets} \author{Matthew D. Young \and Nadia Davidson \\ \texttt{nadia.davidson@mcri.edu.au} \and Matthew J. Wakefield \and Gordon K.\ Smyth \and Alicia Oshlack} % Please increment date when working on this document, so that % date shows genuine change date, not merely date of compile. \date{26 June 2014} \maketitle \section{Introduction} This document gives an introduction to the use of the \texttt{goseq} R Bioconductor package~\citep{Young10}. This package provides methods for performing Gene Ontology analysis of RNA-seq data, taking length bias into account~\citep{Oshlack:2009p149}. The methods and software used by \texttt{goseq} are equally applicable to other category based test of RNA-seq data, such as KEGG pathway analysis. \np Once installed, the \texttt{goseq} package can be easily loaded into R using: <>= library(goseq) @ <>= options(width=84) @ \np In order to perform a GO analysis of your RNA-seq data, \texttt{goseq} only requires a simple named vector, which contains two pieces of information. \begin{enumerate} \item \texttt{Measured genes}: all genes for which RNA-seq data was gathered for your experiment. Each element of your vector should be named by a unique gene identifier. \item \texttt{Differentially expressed genes}: each element of your vector should be either a 1 or a 0, where 1 indicates that the gene is differentially expressed and 0 that it is not. \end{enumerate} If the organism, gene identifier or category test is currently not natively supported by \texttt{goseq}, it will also be necessary to supply additional information regarding the genes length and/or the association between categories and genes. \np A combination of bioconductor R packages such as \texttt{GenomicFeatures} and \texttt{Rsamtools} allow for the summarization of mapped reads into a table of counts, such as reads per gene. From there, several packages exist for performing differential expression analysis on summarized data (eg. \texttt{edgeR}~\citep{Robinson07, Robinson08, Robinson10}). \texttt{goseq} will work with any method for determining differential expression and as such differential expression analysis is outside the scope of this document, but in order to facilitate ease of use, we will make use of the \texttt{edgeR} package to calculate differentially expressed (DE) genes in all the case studies in this document. \section{Reading data} We assume that the user can use appropriate in-built \texttt{R} functions (such as \texttt{read.table} or \texttt{scan}) to obtain two vectors, one containing all genes assayed in the RNA-seq experiment, the other containing all genes which are DE. If we assume that the vector of genes being assayed is named \texttt{assayed.genes} and the vector of DE genes is named \texttt{de.genes} we can construct a named vector suitable for use with \texttt{goseq} using the following: <>= gene.vector=as.integer(assayed.genes%in%de.genes) names(gene.vector)=assayed.genes head(gene.vector) @ It may be that the user can already read in a vector in this format, in which case it can then be immediately used by \texttt{goseq}. \section{GO testing of RNA-seq data} To begin the analysis, \texttt{goseq} first needs to quantify the length bias present in the dataset under consideration. This is done by calculating a Probability Weighting Function or PWF which can be thought of as a function which gives the probability that a gene will be differentially expressed (DE), based on its length alone. The PWF is calculated by fitting a monotonic spline to the binary data series of differential expression (1=DE, 0=Not DE) as a function of gene length. The PWF is used to weight the chance of selecting each gene when forming a null distribution for GO category membership. The fact that the PWF is calculated directly from the dataset under consideration makes this approach robust, only correcting for the length bias present in the data. For example, if \texttt{goseq} is run on a microarray dataset, for which no length bias exists, the calculated PWF will be nearly flat and all genes will be weighted equally, resulting in no length bias correction. \np In order to account for the length bias inherent to RNA-seq data when performing a GO analysis (or other category based tests), one cannot simply use the hypergeometric distribution as the null distribution for category membership, which is appropriate for data without DE length bias, such as microarray data. GO analysis of RNA-seq data requires the use of random sampling in order to generate a suitable null distribution for GO category membership and calculate each categories significance for over representation amongst DE genes. \np However, this random sampling is computationally expensive. In most cases, the Wallenius distribution can be used to approximate the true null distribution, without any significant loss in accuracy. The \texttt{goseq} package implements this approximation as its default option. The option to generate the null distribution using random sampling is also included as an option. \np Having established a null distribution, each GO category is then tested for over and under representation amongst the set of differentially expressed genes and the null is used to calculate a p-value for under and over representation. \section{Natively supported Gene Identifiers and category tests} \texttt{goseq} needs to know the length of each gene, as well as what GO categories (or other categories of interest) each gene is associated with. \texttt{goseq} relies on the UCSC genome browser to provide the length information for each gene. However, because the process of fetching the length of every transcript is slow and bandwidth intensive, \texttt{goseq} relies on an offline copy of this information stored in the data package \texttt{geneLenDataBase}. To see which genome/gene identifier combinations are in the local database, simply run: <>= supportedGenomes() @ or <>= supportedGeneIDs() @ The rightmost column in the output of both these commands lists the identifiers (in the case of supportedGenomes) or genomes (in the case of supportedGeneIDs) for which length data exists in the local database. If your genome/ID combination is not in the local database, it will be downloaded from the UCSC genome browser. If your genome/ID combination is not in the UCSC database, you will have to manually specify the gene lengths. \np In order to link GO categories to genes, \texttt{goseq} uses the organism packages from Bioconductor. These packages are named org...db, where is a short string identifying the genome and is a short string identifying the gene identifier. Currently, \texttt{goseq} will automatically retrieve the mapping between GO categories and genes from the relevant package (as long as it is installed) for commonly used genome/ID combinations. If GO mappings are not automatically available for your genome/ID combination, you will have to manually specify the relationship between genes and categories. Although the Genome/ID naming conventions used by the organism packages differ from the UCSC, \texttt{goseq} is able to convert between the two, so the user need only ever specify the UCSC genome/ID in most cases. \section{Non-native Gene Identifier or category test} If the organism, Gene Identifier or category test you wish to perform is not in the native \texttt{goseq} database, you will have to supply one or all of the following: \begin{itemize} \item \texttt{Length data}: the length of each gene in your gene identifier format. \item \texttt{Category mappings}: the mapping (usually many-to-many) between the categories you wish to test for over/under representation amongst DE genes and genes in your gene identifier format. \end{itemize} \subsection{Length data format} The length data must be formatted as a numeric vector, of the same length as the main named vector specifying gene names/DE genes. Each entry should give the length of the corresponding gene in bp. If length data is unavailable for some genes, that entry should be set to NA. \subsection{Category mapping format} The mapping between category names and genes should be given as a data frame with two columns. One column should contain the gene IDs and the other the name of an associated category. As the mapping between categories and genes is usually many-to-many, this data frame will usually have multiple rows with the same gene name and category name. \np Alternatively, mappings between genes and categories can be given as a list. The names of list entries should be gene IDs and the entries themselves should be a vector of category names to which the gene ID corresponds. \subsection{Some additional tips} Any organism for which there is an annotation on either Ensembl or the UCSC, can be easily turned into length data using the GenomicFeatures package. To do this, first create a TranscriptDb object using either makeTranscriptDbFromBiomart or makeTranscriptDbFromUCSC (see the help in the GenomicFeatures package on using these commands). Once you have a transcriptDb object, you can get a vector named by gene ID containing the median transcript length of each gene simply by using the command. <>= txsByGene=transcriptsBy(txdb,"gene") lengthData=median(width(txsByGene)) @ The relationship between gene identifier and GO category can usually be obtained from the Gene Ontology website (www.geneontology.org) or from the NCBI. Additionally, the bioconductor AnnotationDbi library has recently added a function "makeOrgPackageFromNCBI", which can be used to create an organism package from within R, using the NCBI data. Once created, this package can then be used to obtain the mapping between genes and gene ontology. \section{Case study: Prostate cancer data} \subsection{Introduction} This section provides an analysis of data from an RNA-seq experiment to illustrate the use of \texttt{goseq} for GO analysis. \np This experiment examined the effects of androgen stimulation on a human prostate cancer cell line, LNCaP. The data set includes more than 17 million short cDNA reads obtained for both the treated and untreated cell line and sequenced on Illumina's 1G genome analyzer. \np For each sample we were provided with the raw 35 bp RNA-seq reads from the authors. For the untreated prostate cancer cells (LNCaP cell line) there were 4 lanes totaling ~10 million, 35 bp reads. For the treated cells there were 3 lanes totaling ~7 million, 35 bp reads. All replicates were technical replicates. Reads were mapped to NCBI version 36.3 of the human genome using bowtie. Any read with which mapped to multiple locations was discarded. Using the ENSEMBL 54 annotation from biomart, each mapped read was associated with an ENSEMBL gene. This was done by associating any read that overlapped with any part of the gene (not just the exons) with that gene. Reads that did not correspond to genes were discarded. \subsection{Source of the data} The data set used in this case study is taken from \citep{Li:2008p113} and was made available from the authors upon request. \subsection{Determining the DE genes using \texttt{edgeR}} To begin with, we load in the text data and convert it the appropriate \texttt{edgeR} DGEList object. <>= library(edgeR) table.summary=read.table(system.file("extdata","Li_sum.txt",package='goseq'), sep='\t',header=TRUE,stringsAsFactors=FALSE) counts=table.summary[,-1] rownames(counts)=table.summary[,1] grp=factor(rep(c("Control","Treated"),times=c(4,3))) summarized=DGEList(counts,lib.size=colSums(counts),group=grp) @ Next, we use \texttt{edgeR} to estimate the biological dispersion and calculate differential expression using a negative binomial model. <>= disp=estimateCommonDisp(summarized) disp$common.dispersion tested=exactTest(disp) topTags(tested) @ Finally, we Format the DE genes into a vector suitable for use with \texttt{goseq} <>= genes=as.integer(p.adjust(tested$table$PValue[tested$table$logFC!=0], method="BH")<.05) names(genes)=row.names(tested$table[tested$table$logFC!=0,]) table(genes) @ \subsection{Determining Genome and Gene ID} In order to allow for automatic data retrieval, the user has to tell \texttt{goseq} what genome and gene ID format were used to summarize the data. In our case we will use the GRCh37 build of the human genome, we check what code this corresponds to by running: <>= head(supportedGenomes())[,1:4] @ Which lists the genome codes in the far left column, headed ``db''. As we are using GRCh37, we see that we should use ``hg19". We also know that we used ENSEMBL Gene ID to summarize our read data, we check what code this corresponds to by running: <>= head(supportedGeneIDs(),n=12)[,1:4] @ Note that the head command, and first four columns, is only used here to save on space in this guide. Again, the gene ID codes are listed in the far left ``db'' column. We find that our gene ID code is ``ensGene". We will use these strings whenever we are asked for a genome or id. \subsection{GO analysis} \subsubsection{Fitting the Probability Weighting Function (PWF)} We first need to obtain a weighting for each gene, depending on its length, given by the PWF. As you may have noticed when running supportedGenomes or supportedGeneIDs, length data is available in the local database for our gene ID, ``ensGene" and our genome, ``hg19". We will let \texttt{goseq} automatically fetch this data from its databases. <>= pwf=nullp(genes,"hg19","ensGene") head(pwf) @ \texttt{nullp} plots the resulting fit, allowing verification of the goodness of fit before continuing the analysis. Further plotting of the pwf can be performed using the \texttt{plotPWF} function. \np The output of nullp contains all the data used to create the PWF, as well as the PWF itself. It is a data frame with 3 columns, named "DEgenes", "bias.data" and "pwf" with the rownames set to the gene names. Each row corresponds to a gene with the DEgenes column specifying if the gene is DE (1 for DE, 0 for not DE), the bias.data column giving the numeric value of the DE bias being accounted for (usually the gene length or number of counts) and the pwf column giving the genes value on the probability weighting function. \subsubsection{Using the Wallenius approximation} To start with we will use the default method, to calculate the over and under expressed GO categories among DE genes. Again, we allow \texttt{goseq} to fetch data automatically, except this time the data being fetched is the relationship between ENSEMBL gene IDs and GO categories. %This hides the output of goseq <>= GO.wall=goseq(pwf,"hg19","ensGene") head(GO.wall) @ The resulting object is ordered by GO category over representation amongst DE genes. \subsubsection{Using random sampling} It may sometimes be desirable to use random sampling to generate the null distribution for category membership. This is easily accomplished by using the \texttt{method} option to specify sampling and the \texttt{repcnt} option to specify the number of samples to generate: <>= GO.samp=goseq(pwf,"hg19","ensGene",method="Sampling",repcnt=1000) @ <>= head(GO.samp) @ You will notice that this takes far longer than the Wallenius approximation. Plotting the p-values against one another, we see that there is little difference between the two methods. <>= plot(log10(GO.wall[,2]), log10(GO.samp[match(GO.samp[,1],GO.wall[,1]),2]), xlab="log10(Wallenius p-values)",ylab="log10(Sampling p-values)", xlim=c(-3,0)) abline(0,1,col=3,lty=2) @ \subsubsection{Ignoring length bias} \texttt{goseq} also allows for one to perform a GO analysis without correcting for RNA-seq length bias. In practice, this is only useful for assessing the effect of length bias on your results. You should NEVER use this option as your final analysis. If length bias is truly not present in your data, \texttt{goseq} will produce a nearly flat PWF and no length bias correction will be applied to your data and all methods will produce the same results. \np However, if you still wish to ignore length bias in calculating GO category enrichment, this is again accomplished using the \texttt{method} option. <>= GO.nobias=goseq(pwf,"hg19","ensGene",method="Hypergeometric") head(GO.nobias) @ Ignoring length bias gives very different results from a length bias corrected analysis. <>= plot(log10(GO.wall[,2]), log10(GO.nobias[match(GO.nobias[,1],GO.wall[,1]),2]), xlab="log10(Wallenius p-values)", ylab="log10(Hypergeometric p-values)", xlim=c(-3,0), ylim=c(-3,0)) abline(0,1,col=3,lty=2) @ \subsubsection{Limiting GO categories and other category based tests} By default, \texttt{goseq} tests all three major Gene Ontology branches; Cellular Components, Biological Processes and Molecular Functions. However, it is possible to limit testing to any combination of the major branches by using the \texttt{test.cats} argument to the goseq function. This is done by specifying a vector consisting of some combination of the strings ``GO:CC", ``GO:BP" and ``GO:MF". For example, to test for only Molecular Function GO categories: <>= GO.MF=goseq(pwf,"hg19","ensGene",test.cats=c("GO:MF")) head(GO.MF) @ Native support for other category tests, such as KEGG pathway analysis are also made available via this argument. See the man \texttt{goseq} function man page for up to date information on what category tests are natively supported. \subsubsection{Making sense of the results} Having performed the GO analysis, you may now wish to interpret the results. If you wish to identify categories significantly enriched/unenriched below some p-value cutoff, it is necessary to first apply some kind of multiple hypothesis testing correction. For example, GO categories over enriched using a .05 FDR cutoff \citep{Benjamini95} are: <>= enriched.GO=GO.wall$category[p.adjust(GO.wall$over_represented_pvalue, method="BH")<.05] head(enriched.GO) @ Unless you are a machine, GO accession identifiers are probably not very meaningful to you. Information about each term can be obtained from the Gene Ontology website, \url{http://www.geneontology.org/}, or using the R package \texttt{GO.db}. <>= library(GO.db) for(go in enriched.GO[1:10]){ print(GOTERM[[go]]) cat("--------------------------------------\n") } @ \subsubsection{Understanding goseq internals} The situation may arise where it is necessary for the user to perform some of the data processing steps usually performed automatically by \texttt{goseq} themselves. With this in mind, it will be useful to step through the preprocessing steps performed automatically by \texttt{goseq} to understand what is happening. \np To start with, when \texttt{nullp} is called, \texttt{goseq} uses the genome and gene identifiers supplied to try and retrieve length information for all genes given to the \texttt{genes} argument. To do this, it retrieves the data from the database of gene lengths maintained in the package \texttt{geneLenDataBase}. This is performed by the \texttt{getlength} function in the following way: <>= len=getlength(names(genes),"hg19","ensGene") length(len) length(genes) head(len) @ After some data cleanup, the length data and the DE data is then passed to the \texttt{makespline} function to produce the PWF. The \texttt{nullp} returns a data frame which has 3 columns, the original DEgenes vector, the length bias data (in a column called bias.data) and the PWF itself (in a column named pwf). The names of the genes are also kept in this data frame as the names of the rows. If length data could not be obtained for a certain gene the corresponding entries in the "bias.data" and "pwf" columns are set to NA. \np Next we call the \texttt{goseq} function to determine over/under representation of GO categories amongst DE genes. When we do this, \texttt{goseq} looks for the appropriate organism package and tries to obtain the mapping from genes to GO categories from it. This is done using the \texttt{getgo} function as follows: <>= go=getgo(names(genes),"hg19","ensGene") length(go) length(genes) head(go) @ Note that some of the gene categories have been returned as "NULL". This means that a GO category could not be found in the database for one of the genes. In the \texttt{goseq} command, enrichment will only be calculated using genes with a GO category by default. However, in older versions of goseq (below 1.15.2), we counted all genes. i.e. genes with no categories still counted towards the total number of gene outside of any single category. It is possible to switch between these two behaviors using the \texttt{use\_genes\_without\_cat} flag in \texttt{goseq}. The first thing the getgo function does is to convert the UCSC genome/ID namings into the naming convention used by the organism packages. This is done using two hard coded conversion vectors that are included in the \texttt{goseq} package but usually hiden from the user. <>= goseq:::.ID_MAP goseq:::.ORG_PACKAGES @ It is just as valid to run the length and GO category fetching as separate steps and then pass the result to the \texttt{nullp} and \texttt{goseq} functions using the \texttt{bias.data} and \texttt{gene2cat} arguments. Thus the following two blocks of code are equivalent: <>= pwf=nullp(genes,"hg19","ensGene") go=goseq(pwf,"hg19","ensGene") @ and <>= gene_lengths=getlength(names(genes),"hg19","ensGene") pwf=nullp(genes,bias.data=gene_lengths) go_map=getgo(names(genes),"hg19","ensGene") go=goseq(pwf,"hg19","ensGene",gene2cat=go_map) @ \subsection{KEGG pathway analysis} In order to illustrate performing a category test not present in the \texttt{goseq} database, we perform a KEGG pathway analysis. For human, the mapping from KEGG pathways to genes are stored in the package org.Hs.eg.db, in the object org.Hs.egPATH. In order to test for KEGG pathway over representation amongst DE genes, we need to extract this information and put it in a format that \texttt{goseq} understands. Unfortunately, the org.Hs.eg.db package does not contain direct mappings between ENSEMBL gene ID and KEGG pathway. Therefore, we have to construct this map by combining the ENSEMBL <-> Entrez and Entrez <-> KEGG mappings. This can be done using the following code: <>= # Get the mapping from ENSEMBL 2 Entrez en2eg=as.list(org.Hs.egENSEMBL2EG) # Get the mapping from Entrez 2 KEGG eg2kegg=as.list(org.Hs.egPATH) # Define a function which gets all unique KEGG IDs # associated with a set of Entrez IDs grepKEGG=function(id,mapkeys){unique(unlist(mapkeys[id],use.names=FALSE))} # Apply this function to every entry in the mapping from # ENSEMBL 2 Entrez to combine the two maps kegg=lapply(en2eg,grepKEGG,eg2kegg) head(kegg) @ Note that this step is quite time consuming. The code written here is not the most efficient way of producing this result, but the logic is much clearer than faster algorithms. The source code for getgo contains a more efficient routine. \np We produce the PWF as before. Then, to perform a KEGG analysis, we simply make use of the \texttt{gene2cat} option in \texttt{goseq}: <>= pwf=nullp(genes,"hg19","ensGene") KEGG=goseq(pwf,gene2cat=kegg) head(KEGG) @ Note that we do not have to tell the goseq function what organism and gene ID we are using as we are manually supplying the mapping between genes and categories. \np KEGG analysis is shown as an illustration of how to supply your own mapping between gene ID and category, KEGG analysis is actually natively supported by GOseq and we could have performed it with the following code. <>= pwf=nullp(genes,'hg19','ensGene') KEGG=goseq(pwf,'hg19','ensGene',test.cats="KEGG") head(KEGG) @ Noting that this time it was necessary to tell the goseq function that we are using HG19 and ENSEMBL gene ID, as the function needs this information to automatically construct the mapping from geneid to KEGG pathway. \subsection{Extracting mappings from organism packages} If you know that the information mapping gene ID to your categories of interest is contained in the organism packages, but \texttt{goseq} fails to fetch it automatically, you may want to extract it yourself and then pass it to the \texttt{goseq} function using the \texttt{gene2cat} argument. This is done in exactly the same way as extracting the KEGG to ENSEMBL mappings in the section ``KEGG pathway analysis" above. This example is actually the worst case, where it is necessary to combine two mappings to get the desired list. If we had instead wanted the association between Entrez gene IDs and KEGG pathways, the following code would have been sufficient: <>= kegg=as.list(org.Hs.egPATH) head(kegg) @ A note on fetching GO mappings from the organism. The data structure of GO is a directed acyclic graph. This means that in addition to each GO category being associated with a set of genes, it may also have children that are associated to other genes. It is important to use the org.Hs.egGO2ALLEGS and NOT the org.Hs.egGO object to create the mapping between GO categories and gene identifiers, as the latter does not include the links to genes arising from "child" GO categories. Thank you to Christopher Fjell for pointing this out. \subsection{Correcting for other biases} It is possible that in some circumstances you will wish to correct not just for length bias, but for the total number of counts. This can make sense because power to detect DE depends on the total number of counts a gene receives, which is the product of gene length and gene expression. So correcting for read count bias will compensate for all biases, known and unknown, in power to detect DE. On the other hand, it will also remove bias resulting from differences in expression level, which may not be desirable. \np Correcting for count bias will produce a different PWF. Therefore, we need to tell \texttt{goseq} about the data on which the fraction DE depends when calculating the PWF using the \texttt{nullp} function. We then simply pass the result to \texttt{goseq} as usual. \np So, in order to tell \texttt{goseq} to correct for read count bias instead of length bias, all you need to do is supply a numeric vector, containing the number of counts for each gene to \texttt{nullp}. <>= countbias=rowSums(counts)[rowSums(counts)!=0] length(countbias) length(genes) @ To use the count bias when doing GO analysis, simply pass this vector to \texttt{nullp} using the \texttt{bias.data} option. Note that we have to supply "hg19" and "ensGene" to \texttt{goseq} as it is not used by nullp and hence not in the pwf.counts object. <>= pwf.counts=nullp(genes,bias.data=countbias) GO.counts=goseq(pwf.counts,"hg19","ensGene") head(GO.counts) @ Note that if you want to correct for length bias, but your organism/gene identifier is not natively supported, then you need to follow the same procedure as above, only the numeric vector supplied will contain each gene's length instead of its number of reads. \section{Setup} This vignette was built on: <>= sessionInfo() @ \section{Acknowledgments} Christopher Fjell for a series of bug fixes and pointing out the difference between the egGO and egGO2ALLEGS objects in the organism packages. \bibliographystyle{plainnat} \bibliography{goseq_bib} \end{document}