%\VignetteIndexEntry{segmentSeq} %\VignettePackage{segmentSeq} \documentclass[a4paper]{article} \title{segmentSeq} \author{Thomas J. Hardcastle} \begin{document} \maketitle \section{Introduction} High-throughput sequencing technologies allow the production of large volumes of short sequences, which can be aligned to the genome to create a set of \textsl{matches} to the genome. By looking for regions of the genome which to which there are high densities of matches, we can infer a segmentation of the genome into regions of biological significance. The methods we propose allows the simultaneous segmentation of data from multiple samples, taking into account replicate data, in order to create a consensus segmentation. This has obvious applications in a number of classes of sequencing experiments, particularly in the discovery of small RNA loci and novel mRNA transcriptome discovery. We approach the problem by considering a large set of potential \textsl{segments} upon the genome and counting the number of tags that match to that segment in multiple sequencing experiments (that may or may not contain replication). We then adapt the empirical Bayesian methods based on the Poisson-Gamma conjugacy and implemented in the \verb'baySeq' package \cite{Hardcastle:2010} to establish, for a given segment, the likelihood that the count data in that segment is similar to background levels, or that it is similar to the regions to the left or right of that segment. We then rank all the potential segments in order of increasing likelihood of similarity and reject those segments for which there is a high likelihood of similarity with the background or the regions to the left or right of the segment. This gives us a large list of overlapping segments. We reduce this list to identify non-overlapping loci by choosing, for a set of overlapping segments, the segment which has the lowest likelihood of similarity with either background or the regions to the left or right of that segment and rejecting all other segments that overlap with this segment. For fuller details of the method, see Hardcastle (2010) \cite{Hardcastle:2010a}. \section{Preparation} We begin by loading the \verb'segmentSeq' package. <<>>= library(segmentSeq) @ Note that because the experiments that \verb'segmentSeq' is designed to analyse are usually massive, we should use (if possible) parallel processing as implemented by the \verb'snow' package. We therefore need to load the \verb'snow' package (if it exists) and define a \textsl{cluster}. If \verb'snow' is not present, we can proceed anyway with a \verb'NULL' cluster. Results may be slightly different depending on whether or not a cluster is used owing to the non-deterministic elements of the method. <>= if("snow" %in% installed.packages()[,1]) { library(snow) cl <- makeCluster(4, "SOCK") } else cl <- NULL @ There is a convenience function, \verb'processTags' which is able to read in tab-delimited files which have appropriate column names, and create an \verb'alignmentData' object. Alternatively, if the appropriate column names are not present, we can specify which columns to use for the data. In either case, we pass a character vector of files, together with information on which data are to be treated as replicates to the function. We also need to define the lengths of the chromosome and specifiy the chromosome names as a character. The data here, drawn from text files in the 'data' directory of the \verb'segmentSeq' package are taken from the first million bases of an alignment to chromosome 1 and the first five hundred thousand bases of an alignment to chromosome 2 of \textsl{Arabidopsis thaliana} in a sequencing experiment where libraries 'SL10' and 'SL26' are replicates, as are 'SL32' and 'SL9'. <<>>= chrlens <- c(1e6, 5e5) datadir <- system.file("data", package = "segmentSeq") libfiles <- dir(datadir, pattern = ".txt", full.names = TRUE) libnames <- c("SL9", "SL10", "SL26", "SL32") replicates <- c(1,2,1,2) aD <- processTags(libfiles, replicates, libnames, chrlens, chrs = c(">Chr1", ">Chr2"), header = TRUE) aD @ Next, we process this \verb'alignmentData' object to produce a \verb'segData' object. This \verb'segData' object contains a set of potential segments on the genome defined by the start and end points of regions of overlapping alignments in the \verb'alignmentData' object. It then evaluates the number of tags that hit in each of these segments. <<>>= sD <- processAD(aD, maxgaplen = 500, cl = cl) sD @ We then try and estimate prior parameters on the data with the \verb'getPriors' function. This function constructs a random sample of non-overlapping segments from the segData object \verb'sD' and uses the prior estimation functions from the package \verb'baySeq' to construct a set of prior parameters on the data. At present only the Poisson-Gamma method of prior estimation implemented by \verb'baySeq' is supported as alternative methods require excessive computational time. Additional options can be given to the \verb'getPriors' function to be passed on to the \verb'getPriors.Pois' method of the \verb'baySeq' library. <<>>= sDP <- getPriors(sD, type = "Pois", samplesize = 100, perSE = 0.5, maxit = 1000, cl = cl) @ Having found estimates of the prior parameters of the data, we can then segment the genome based on the information in the \verb'segData' object. The function compares, for each replicate group, the likelihood that the number of alignments matching in a segment is similar to background, or that it is similar to the regions to the left and right of the segment. It then evaluates the likelihood that this similarity occurs in all replicate groups, and ranks all the potential segments in order of decreasing likelihood of similarity. Any segment with a likelihood of similarity greater than the \verb'pcut' argument is discarded at this point. The function then filters the potential segments to form a non-overlapping set of segments by choosing the segment with the lowest likelihood of similarity and discarding any segments which overlap with this. By iteratively discarding overlapping segments, we form a non-overlapping set of segments which have a low likelihood of being similar to background (and are thus regions corresponding to a high density of aligned sequences). <<>>= segD <- segmentSequences(sDP, pcut = 0.1, estimatePriors = FALSE, verbose = TRUE, cl = cl) @ We finally acquire an annotated \verb'countData' object, with the annotations describing the co-ordinates of each segment. <<>>= segD @ We can use this \verb'countData' object, in combination with the \verb'alignmentData' object, to plot the segmented genome. <>= plotGenome(aD, segD, chr = ">Chr1", limits = c(1, 1e5)) @ \begin{figure}[!ht] \begin{center} <>= <> @ \caption{The segmented genome (first $10^5$ bases of chromosome 1.} \label{fig:Seg} \end{center} \end{figure} This \verb'countData' object can also be examined for differential expression with the \verb'baySeq' package. \begin{thebibliography}{99} \bibitem{Hardcastle:2010} Thomas J. Hardcastle and Krystyna A. Kelly. \textsl{Empirical Bayesian Methods For Identifying Patterns of Differential Expression in Count Data}. In submission. 2010. \bibitem{Hardcastle:2010a} Thomas J. Hardcastle and Krystyna A. Kelly. \textsl{Genome Segmentation From High-Throughput Sequencing Data.}. In submission. 2010. \end{thebibliography} \end{document}