%\VignetteIndexEntry{An Introduction to IRanges} %\VignetteDepends{} %\VignetteKeywords{Ranges} %\VignettePackage{IRanges} \documentclass[10pt]{article} \usepackage{times} \usepackage{hyperref} \textwidth=6.5in \textheight=8.5in %\parskip=.3cm \oddsidemargin=-.1in \evensidemargin=-.1in \headheight=-.3in \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\software}[1]{\textsf{#1}} \newcommand{\R}{\software{R}} \newcommand{\IRanges}{\Rpackage{IRanges}} \title{An Introduction to \IRanges{}} \author{Patrick Aboyoun, Michael Lawrence, Herv\'e Pag\`es} \date{\today} \begin{document} \maketitle <>= options(width=72) @ \section{Introduction} The \IRanges{} package is designed to represent sequences, ranges representing indices along those sequences, and data related to those ranges. In this vignette, we will rely on simple, illustrative example datasets, rather than large, real-world data, so that each data structure and algorithm can be explained in an intuitive, graphical manner. We expect that packages that apply \IRanges{} to a particular problem domain will provide vignettes with relevant, realistic examples. The \IRanges{} package is available at bioconductor.org and can be downloaded via \Rfunction{biocLite}: <>= source("http://bioconductor.org/biocLite.R") biocLite("IRanges") @ <>= library(IRanges) @ \section{Sequences} A sequence is an ordered collection of elements. The \IRanges{} packages represents three types of objects as sequences: atomic sequences, lists (or non-atomic sequences), and data tables (similar to \Rclass{data.frame}). The following subsections describe each in turn. All \IRanges{}-derived sequences inherit from the \Rclass{Sequence} virtual class. \subsection{Atomic Sequences} In \R{}, atomic sequences are typically stored in atomic vectors. The \IRanges{} package includes two additional atomic sequence object types: \Rclass{Rle}, which compresses an atomic sequence through run-length encoding, and \Rclass{XVector}, which refers to its data through an external pointer. \Rclass{XVector} and its derivatives are considered low-level infrastructure and, as such, will not be covered by this vignette. We begin our discussion of atomic sequences using two \R{} \Rclass{integer} vectors. <>= set.seed(0) lambda <- c(rep(0.001, 4500), seq(0.001, 10, length = 500), seq(10, 0.001, length = 500)) xVector <- rpois(1e7, lambda) yVector <- rpois(1e7, lambda[c(251:length(lambda), 1:250)]) @ All atomic sequences in \R{} have three main properties: (1) a notion of length or number of elements, (2) the ability to extract elements to create new atomic sequences, and (3) the ability to be combined with one or more atomic sequences to form larger atomic sequences. The main functions for these three operations are \Rfunction{length}, \Rfunction{[}, and \Rfunction{c}. <>= length(xVector) xVector[1] zVector <- c(xVector, yVector) @ While these three methods may seem trivial, they provide a great deal of power and many atomic sequence manipulations can be constructed using them. \subsubsection{Sequence Extraction} As with ordinary \R{} atomic vectors, it is often necessary to extract one sequence from another. When this extraction does not duplicate or reorder the elements being extracted, the result is called a \textit{subsequence}. In general, the \Rfunction{[} function can be used to construct a new sequence or extract a subsequence, but its interface is often inconvenient and not amenable to optimization. To compensate for this, the \IRanges{} package supports seven additional functions for sequence extraction: \begin{enumerate} \item \Rfunction{window} - Produces a subsequence over a specified region with or without regular interval subsampling. \item \Rfunction{seqselect} - Concatenates multiple consecutive subsequences into a new sequence. The result may or may not be a subsequence. \item \Rfunction{subset} - Extracts the subsequence specified by a logical vector. \item \Rfunction{head} - Extracts a consecutive subsequence containing the first n elements. \item \Rfunction{tail} - Extracts a consecutive subsequence containing the last n elements. \item \Rfunction{rev} - Creates a new sequence with the elements in the reverse order. \item \Rfunction{rep} - Creates a new sequence by repeating sequence elements. \item \Rfunction{Filter} - Extracts the elements for which a function evaluates to TRUE. \item \Rfunction{Find} - Extracts the first or last such element specified by a function. \item \Rfunction{Position} - Extracts the first or last such position specified by a function. \end{enumerate} The following code illustrates how these functions are used on an ordinary \R{} \Rclass{integer} vector: <>= xSnippet <- window(xVector, start = 4751, end = 4760) xSnippet head(xSnippet) tail(xSnippet) rev(xSnippet) rep(xSnippet, 2) window(xSnippet, delta = 2) seqselect(xSnippet, start = c(6,1), end = c(10, 5)) subset(xSnippet, xSnippet >= 5L) @ \subsubsection{Combining Sequences} The \IRanges{} package uses two generic functions, \Rfunction{c} and \Rfunction{append}, for combining two \Rclass{Sequence} objects. The methods for \Rclass{Sequence} objects follow the definition that these two functions are given the the \Rpackage{base} package. <>= c(xSnippet, rev(xSnippet)) append(xSnippet, xSnippet, after = 3) @ \subsubsection{Looping over Sequences and Subsequences} In \R{}, \Rfunction{for} looping can be an expensive operation. To compensate for this, \IRanges{} uses three generics, \Rfunction{endoapply}, \Rfunction{lapply}, and \Rfunction{sapply}, for looping over sequences and two generics, \Rfunction{aggregate} and \Rfunction{shiftApply}, to perform calculations over subsequences. The \Rfunction{lapply} and \Rfunction{sapply} functions are familiar to many \R{} users since they are the standard functions for looping over the elements of an \R{} \Rclass{list} object. The \Rfunction{endoapply} function performs an endomorphism equivalent to \Rfunction{lapply}, i.e. returns a \Rclass{Sequence} object of the same class as the input rather than a \Rclass{list} object. More will be given on these three functions in the List subsection. The \Rfunction{aggregate} function combines sequence extraction functionality of the \Rfunction{window} function with looping capabilities of the \Rfunction{sapply} function. For example, here is some code to compute medians across a moving window of width 3 using the function \Rfunction{aggregate}: <>= xSnippet aggregate(xSnippet, start = 1:8, width = 3, FUN = median) @ The \Rfunction{shiftApply} function is a looping operation involving two sequences whose elements are lined up via a positional shift operation. For example, the elements of \Robject{xVector} and \Robject{yVector} were simulated from Poisson distributions with the mean of element i from \Robject{yVector} being equivalent to the mean of element i + 250 from \Robject{xVector}. If we did not know the size of the shift, we could estimate it by finding the shift that maximizes the correlation between \Robject{xVector} and \Robject{yVector}. <>= cor(xVector, yVector) shifts <- seq(235, 265, by=3) corrs <- shiftApply(shifts, yVector, xVector, FUN = cor) @ % <>= plot(shifts, corrs) @ The result is shown in Fig.~\ref{figshiftcorrs}. \begin{figure}[tb] \begin{center} \includegraphics[width=0.5\textwidth]{IRangesOverview-figshiftcorrs} \caption{\label{figshiftcorrs}% Correlation between \Robject{xVector} and \Robject{yVector} for various shifts.} \end{center} \end{figure} \subsubsection{Run Length Encoding} Up until this point we have used \R{} atomic vectors to represent atomic sequences, but there are times when these object become too large to manage in memory. When there are lots of consecutive repeats in the sequence, the data can be compressed and managed in memory through a run-length encoding where a data value is paired with a run length. For example, the sequence \{1, 1, 1, 2, 3, 3\} can be represented as values = \{1, 2, 3\}, run lengths = \{3, 1, 2\}. The \Rclass{Rle} class in \IRanges{} is used to represent a run-length encoded (compressed) sequence of \Rclass{logical}, \Rclass{integer}, \Rclass{numeric}, \Rclass{complex}, \Rclass{character}, or \Rclass{raw} values. One way to construct an \Rclass{Rle} object is through the \Rclass{Rle} constructor function: <>= xRle <- Rle(xVector) yRle <- Rle(yVector) xRle yRle @ When there are lots of consecutive repeats, the memory savings through an RLE can be quite dramatic. For example, the \Robject{xRle} object occupies less than one quarter of the space of the original \Robject{xVector} object, while storing the same information: <>= as.vector(object.size(xRle) / object.size(xVector)) identical(as.vector(xRle), xVector) @ The functions \Rfunction{runValue} and \Rfunction{runLength} extract the run values and run lengths from an \Rclass{Rle} object respectively: <>= head(runValue(xRle)) head(runLength(xRle)) @ The \Rclass{Rle} class supports many of the basic methods associated with \R{} atomic vectors including the Ops, Math, Math2, Summary, and Complex group generics. Here is a example of manipulating \Rclass{Rle} objects using methods from the Ops group: <>= xRle > 0 xRle + yRle xRle > 0 | yRle > 0 @ Here are some from the Summary group: <>= range(xRle) sum(xRle > 0 | yRle > 0) @ And here is one from the Math group: <>= log1p(xRle) @ As with the atomic vectors, the \Rfunction{cor} and \Rfunction{shiftApply} functions operate on \Rclass{Rle} objects: <>= cor(xRle, yRle) shiftApply(249:251, yRle, xRle, FUN = function(x, y) var(x, y) / (sd(x) * sd(y))) @ For more information on the methods supported by the \Rclass{Rle} class, consult the \Rcode{Rle} man page. \subsection{Lists} In many data analysis situation there is a desire to organize and manipulate multiple objects simultaneously. Typically this is done in \R{} through the usage of a list. While a list serves as a generic container, it does not confer any information about the specific class of its elements, provides no infrastructure to ensure type safety, and the S3 and S4 method dispatch mechanisms do not support method selection for lists with homogeneous object types. \subsubsection{Lists of Atomic Sequences} The first type of lists we consider are those containing atomic sequences such as \Rclass{integer} vectors or \Rclass{Rle} objects. We may wish to define a method that retrieves the length of each atomic sequence element, without special type checking. To enable this, we define collection classes such as \Rclass{IntegerList} and \Rclass{RleList}, which inherit from the \Rclass{Sequence} virtual class, for representing lists of \Rclass{integer} vectors and \Rclass{Rle} objects respectively. <>= getClassDef("RleList") @ As the class definition above shows, the \Rclass{RleList} class is virtual with subclasses \Rclass{SimpleRleList} and \Rclass{CompressedRleList}. A \Rclass{SimpleRleList} class uses a regular \R{} list to store the underlying elements and the \Rclass{CompressedRleList} class stores the elements in an unlisted form and keeps track of where the element breaks are. The former ``simple list" class is useful when the Rle elements are long and the latter ``compressed list" class is useful when the list is sparse, i.e. a number of the list elements have length 0. In fact, all of the atomic vector types (raw, logical, integer, numeric, complex, and character) have similar list classes that derive from the \Rclass{Sequence} virtual class. For example, there is an \Rclass{IntegerList} virtual class with subclasses \Rclass{SimpleIntegerList} and \Rclass{CompressedIntegerList}. Each of the list classes for atomic sequences, be they stored as vectors or \Rclass{Rle} objects, have a constructor function with a name of the appropriate list virtual class, such as \Rclass{IntegerList}, and an optional argument \Rfunarg{compress} that takes an argument to specify whether or not to create the simple list object type or the compressed list object type. The default is to create the compressed list object type. <>= args(IntegerList) cIntList1 <- IntegerList(x = xVector, y = yVector) cIntList1 sIntList2 <- IntegerList(x = xVector, y = yVector, compress = FALSE) sIntList2 ## sparse integer list xExploded <- lapply(xVector[1:5000], function(x) seq_len(x)) cIntList2 <- IntegerList(xExploded) sIntList2 <- IntegerList(xExploded, compress = FALSE) object.size(cIntList2) object.size(sIntList2) @ The \Rfunction{length} function returns the number of elements in a \Rclass{Sequence}-derived object and, for ``simple lists" and ``compressed lists", the \Rfunction{elementLengths} function returns an integer vector containing the lengths of each of the elements: <>= length(cIntList2) Rle(elementLengths(cIntList2)) @ Just as with ordinary \R{} \Rclass{list} objects, \Rclass{Sequence}-derived list classes support the \Rfunction{[[} for element extraction \Rfunction{c} for combining, and \Rfunction{lapply}/\Rfunction{sapply} for looping. When looping over sparse lists, the ``compressed list" classes can be much faster during computations since only the non-empty elements are looped over during the \Rfunction{lapply}/\Rfunction{sapply} computation and all the empty elements are assigned the appropriate value based on their status. <>= system.time(sapply(xExploded, mean)) system.time(sapply(sIntList2, mean)) system.time(sapply(cIntList2, mean)) identical(sapply(xExploded, mean), sapply(sIntList2, mean)) identical(sapply(xExploded, mean), sapply(cIntList2, mean)) @ Unlist ordinary \R{} \Rclass{list} objects, \Rclass{AtomicList} objects support the \Rfunction{Ops} (e.g. \Rfunction{+}, \Rfunction{==}, \Rfunction{\&}), \Rfunction{Math} (e.g. \Rfunction{log}, \Rfunction{sqrt}), \Rfunction{Math2} (e.g. \Rfunction{round}, \Rfunction{signif}), \Rfunction{Summary} (e.g. \Rfunction{min}, \Rfunction{max}, \Rfunction{sum}), and \Rfunction{Complex} (e.g. \Rfunction{Re}, \Rfunction{Im}) group generics. <>= xRleList <- RleList(xRle, 2L * rev(xRle)) yRleList <- RleList(yRle, 2L * rev(yRle)) xRleList > 0 xRleList + yRleList sum(xRleList > 0 | yRleList > 0) @ Since these atomic lists inherit from \Rclass{Sequence}, they can also use the looping function \Rfunction{endoapply} to perform endomorphisms. <>= safe.max <- function(x) { if(length(x)) max(x) else integer(0) } endoapply(sIntList2, safe.max) endoapply(cIntList2, safe.max) endoapply(sIntList2, safe.max)[[1]] @ \subsection{Data Tables} To Do: \Rclass{DataTable}, \Rclass{DataFrame}, \Rclass{DataFrameList}, \Rclass{SplitDataFrameList} \section{Sequence Annotations} Often when one has a collection of objects, there is a need to attach metadata that describes the collection in some way. The \Rclass{Sequence} class definition contains two slots for storing metadata: the \Robject{metadata} slot, which stores an ordinary \Rclass{list} for storing metadata about the whole object, and \Robject{elementMetadata}, which can store and object that inherits from \Rclass{DataTable} whose rows annotate the corresponding element of the \Rclass{Sequence} object. \section{Sequence Ranges} When analyzing sequences, we are often interested in particular consecutive subsequences. For example, the \Sexpr{letters} vector could be considered a sequence of lower-case letters, in alphabetical order. We would call the first five letters (\textit{a} to \textit{e}) a consecutive subsequence, while the subsequence containing only the vowels would not be consecutive. It is not uncommon for an analysis task to focus only on the geometry of the regions, while ignoring the underlying sequence values. A list of indices would be a simple way to select a subsequence. However, a sparser representation for consecutive subsequences would be a range, a pairing of a start position and a width, as used when extracting sequences with \Rfunction{window} and \Rfunction{seqselect}, above. When analyzing subsequences in \IRanges{}, each range is treated as an observation. The virtual \Rclass{Ranges} class represents lists of ranges, or, equivalently and as a derivative \Rclass{IntegerList}, sequences of consecutive integers. The most commonly used implementation of \Rclass{Ranges} is \Rclass{IRanges}, which stores the starts and widths as ordinary integer vectors. To construct an \Rclass{IRanges} instance, we call the \Rfunction{IRanges} constructor. Ranges are normally specified by passing two out of the three parameters: start, end and width (see \Rcode{help(IRanges)} for more information). % <>= ir1 <- IRanges(start = 1:10, width = 10:1) ir2 <- IRanges(start = 1:10, end = 11) ir3 <- IRanges(end = 11, width = 10:1) identical(ir1, ir2) & identical(ir2, ir3) ir <- IRanges(c(1, 8, 14, 15, 19, 34, 40), width = c(12, 6, 6, 15, 6, 2, 7)) @ % All of the above calls construct an \Rclass{IRanges} instance with the same ranges, using different combinations of the \Rfunarg{start}, \Rfunarg{end} and \Rfunarg{width} parameters. Accessing the starts, widths and ends is supported by every \Rclass{Ranges} implementation. <>= start(ir) @ <>= end(ir) @ <>= width(ir) @ For \Rclass{IRanges} and some other \Rclass{Ranges} derivatives, subsetting is also supported, by numeric and logical indices. <>= ir[1:4] @ <>= ir[start(ir) <= 15] @ One may think of each range as a sequence of integer ranges, and \Rclass{Ranges} is, in fact, derived from \Rclass{IntegerList}. <>= ir[[1]] @ In order to illustrate range operations, we'll create a function to plot ranges. <>= plotRanges <- function(x, xlim = x, main = deparse(substitute(x)), col = "black", sep = 0.5, ...) { height <- 1 if (is(xlim, "Ranges")) xlim <- c(min(start(xlim)), max(end(xlim))) bins <- disjointBins(IRanges(start(x), end(x) + 1)) plot.new() plot.window(xlim, c(0, max(bins)*(height + sep))) ybottom <- bins * (sep + height) - height rect(start(x)-0.5, ybottom, end(x)+0.5, ybottom + height, col = col, ...) title(main) axis(1) } @ <>= plotRanges(ir) @ \begin{figure}[tb] \begin{center} \includegraphics[width=0.5\textwidth]{IRangesOverview-ir-plotRanges} \caption{\label{fig-ir-plotRanges}% Plot of original ranges.} \end{center} \end{figure} \subsection{Normality} Sometimes, it is necessary to formally represent a subsequence, where no elements are repeated and order is preserved. Also, it is occasionally useful to think of a \Rclass{Ranges} object as a set, where no elements are repeated and order does not matter. While every \Rclass{Ranges} object, as a \Rclass{Sequence} derivative, has an implicit ordering, one can enforce the same ordering for all such objects, so that ordering becomes inconsequential within that context. The \Rclass{NormalIRanges} class formally represents either a subsequence encoding or a set of integers. By definition a Ranges object is said to be \textit{normal} when its ranges are: (a) not empty (i.e. they have a non-null width); (b) not overlapping; (c) ordered from left to right; (d) not even adjacent (i.e. there must be a non empty gap between 2 consecutive ranges). There are three main advantages of using a \textit{normal} \Rclass{Ranges} object: (1) it guarantees a subsequence encoding or set of integers, (2) it is compact in terms of the number of ranges, and (3) it uniquely identifies its information, which simplifies comparisons. The \Rfunction{reduce} function reduces any \Rclass{Ranges} object to a \Rclass{NormalIRanges} by merging redundant ranges. <>= reduce(ir) plotRanges(reduce(ir)) @ \begin{figure}[tb] \begin{center} \includegraphics[width=0.5\textwidth]{IRangesOverview-ranges-reduce} \caption{\label{fig-ranges-reduce}% Plot of reduced ranges.} \end{center} \end{figure} \subsection{Lists of \Rclass{Ranges} objects} As with \Rclass{Sequence} objects, it is common to manipulate collections of \Rclass{Ranges} objects during an analysis. Thus, the \IRanges{} package defines some specific classes for working with multiple \Rclass{Ranges} objects. The \Rclass{RangesList} class asserts that each element is a \Rclass{Ranges} object and provides convenience methods, such as \Rfunction{start}, \Rfunction{end} and \Rfunction{width} accessors that return flat integer vectors, aligning with the unlisted form of the \Rclass{RangesList}. To explicitly construct a \Rclass{RangesList}, use the \Rfunction{RangesList} function. <>= rl <- RangesList(ir, rev(ir)) @ % <>= start(rl) @ \subsection{Sequence Extraction} As the elements of \Rclass{Ranges} objects encode consecutive subsequences, they may be used directly in sequence extraction. Note that when a \textit{normal} \Rclass{Ranges} is given as the index, the result a subsequence, as no elements are repeated or reordered. % <>= irextract <- IRanges(start = c(4501, 4901) , width = 100) seqselect(xRle, irextract) @ % If the sequence is a \Rclass{Sequence} subclass (i.e. not an ordinary \Rclass{vector}), the canonical \Rfunction{[} function also accepts a \Rclass{Ranges} instance. % <>= xRle[irextract] @ % \subsection{Finding Overlapping Ranges} The functions \Rfunction{findOverlaps} and \Rfunction{\%in\%} detect overlaps between two \Rclass{Ranges} objects. <>= ol <- findOverlaps(ir, reduce(ir)) as.matrix(ol) @ \subsection{Counting Overlapping Ranges} The function \Rfunction{coverage} counts the number of ranges over each position. <>= cov <- coverage(ir) plotRanges(ir) cov <- as.vector(cov) mat <- cbind(seq_along(cov)-0.5, cov) d <- diff(cov) != 0 mat <- rbind(cbind(mat[d,1]+1, mat[d,2]), mat) mat <- mat[order(mat[,1]),] lines(mat, col="red", lwd=4) axis(2) @ \begin{figure}[tb] \begin{center} \includegraphics[width=0.5\textwidth]{IRangesOverview-ranges-coverage} \caption{\label{fig-ranges-coverage}% Plot of ranges with accumulated coverage.} \end{center} \end{figure} \subsection{Finding Neighboring Ranges} The \Rfunction{nearest} function finds the nearest neighbor ranges (overlapping is zero distance). The \Rfunction{precede} and \Rfunction{follow} functions find the non-overlapping nearest neighbors on a specific side. \subsection{Transforming Ranges} Utilities are available for transforming a \Rclass{Ranges} object in a variety of ways. Some transformations, like \Rfunction{reduce} introduced above, can be dramatic, while others are simple per-range adjustments of the starts, ends or widths. \subsubsection{Adjusting starts, ends and widths} Perhaps the simplest transformation is to adjust the start values by a scalar offset, as performed by the \Rfunction{shift} function. Below, we shift all ranges forward 10 positions. % <>= shift(ir, 10) @ The arithmetic functions \Rfunction{+}, \Rfunction{-} and \Rfunction{*} change both the start and the end/width by symmetrically expanding or contracting each range. Adding or subtracting a numeric (integer) vector to a \Rclass{Ranges} causes each range to be expanded or contracted on each side by the corresponding value in the numeric vector. <>= ir + seq_len(length(ir)) @ % The \Rfunction{*} operator symmetrically magnifies a \Rclass{Ranges} object by a factor, where positive contracts (zooms in) and negative expands (zooms out). % <>= ir * -2 # half the width @ There are several other ways to form subranges, besides symmetric contraction. These include \Rfunction{narrow}, \Rfunction{threebands} and \Rfunction{restrict}. \Rfunction{narrow} supports the adjustment of start, end and width values, which should be relative to each range. Unlike \Rfunction{shift}, these adjustments are vectorized over the ranges. As its name suggests, the ranges can only be narrowed. % <>= narrow(ir, start=1:5, width=2) @ % The \Rfunction{threebands} function extends \Rfunction{narrow} so that the remaining left and right regions adjacent to the narrowed region are also returned in separate \Rclass{Ranges} objects. <>= threebands(ir, start=1:5, width=2) @ The \Rfunction{restrict} function ensures every range falls within a set of bounds. Ranges are contracted as necessary, and the ranges that fall completely outside of but not adjacent to the bounds are dropped, by default. <>= restrict(ir, start=2, end=3) @ \subsubsection{Making ranges disjoint} A more complex type of operation is making a set of ranges disjoint, \textit{i.e.} non-overlapping. For example, \Rfunction{threebands} returns a disjoint set of three ranges for each input range. The \Rfunction{disjoin} function makes a \Rclass{Ranges} object disjoint by fragmenting it into the widest ranges where the set of overlapping ranges is the same. <>= disjoin(ir) plotRanges(disjoin(ir)) @ \begin{figure}[tb] \begin{center} \includegraphics[width=0.5\textwidth]{IRangesOverview-ranges-disjoin} \caption{\label{fig-ranges-disjoin}% Plot of disjoined ranges.} \end{center} \end{figure} A variant of \Rfunction{disjoin} is \Rfunction{disjointBins}, which divides the ranges into bins, such that the ranges in each bin are disjoint. The return value is an integer vector of the bins. <>= disjointBins(ir) @ \subsubsection{Other transformations} Other transformations include \Rfunction{reflect} and \Rfunction{flank}. The former ``flips'' each range within a set of common reference bounds. <>= reflect(ir, IRanges(start(ir), width=width(ir)*2)) @ % The \Rfunction{flank} returns ranges of a specified width that flank, to the left (default) or right, each input range. One use case of this is forming promoter regions for a set of genes. <>= flank(ir, width = seq_len(length(ir))) @ % \subsection{Set Operations} Sometimes, it is useful to consider a \Rclass{Ranges} object as a set of integers, although there is always an implicit ordering. This is formalized by \Rclass{NormalIRanges}, above, and we now present versions of the traditional mathematical set operations \textit{complement}, \textit{union}, \textit{intersect}, and \textit{difference} for \Rclass{Ranges} objects. There are two variants for each operation. The first treats each \Rclass{Ranges} object as a set and returns a \textit{normal} value, while the other has a ``parallel'' semantic like \Rfunction{pmin}/\Rfunction{pmax} and performs the operation for each range pairing separately. The \textit{complement} operation is implemented by the \Rfunction{gaps} and \Rfunction{pgap} functions. By default, \Rfunction{gaps} will return the ranges that fall between the ranges in the (normalized) input. It is possible to specify a set of bounds, so that flanking ranges are included. <>= gaps(ir, start=1, end=50) plotRanges(gaps(ir, start=1, end=50), c(1,50)) @ \begin{figure}[tb] \begin{center} \includegraphics[width=0.5\textwidth]{IRangesOverview-ranges-gaps} \caption{\label{fig-ranges-gap}% Plot of gaps from ranges.} \end{center} \end{figure} \Rfunction{pgap} considers each parallel pairing between two \Rclass{Ranges} objects and finds the range, if any, between them. Note that the function name is singular, suggesting that only one range is returned per range in the input. <>= @ The remaining operations, \textit{union}, \textit{intersect} and \textit{difference} are implemented by the \Rfunction{[p]union}, \Rfunction{[p]intersect} and \Rfunction{[p]setdiff} functions, respectively. These are relatively self-explanatory. <>= @ <>= @ <>= @ <>= @ <>= @ <>= @ % \subsection{Mapping Ranges Between Sequences} % \Rclass{RangesMapping} % \Rfunction{map} \section{Sequence Views} When we extract a sequence with \Rfunction{seqselect}, we can pass multiple ranges, each selecting a single consecutive subsequence. Those subsequences are extracted and concatenated into a single sequence. There are many cases where the user wishes to avoid the concatenation step and instead treat each consecutive subsequence as a separate element in a list. While one could simply store each extracted sequence as an element in a list object like a \Rclass{SimpleList}, this is undesirable for a couple of reasons. First, the user often wants to preserve the original sequence and declare a set of interesting regions as an overlay. This allows retrieving sequence values even after the ranges have been adjusted. Another benefit of an overlay approach is performance: the sequence values need not be copied. For representing such an overlay, the \IRanges{} package provides the virtual \Rclass{Views} class, which derives from \Rclass{Ranges} but also stores a sequence. Each range is said to represent a \textit{view} onto the sequence. Here, we will demonstrate the \Rclass{RleViews} class, where the sequence is of class \Rclass{Rle}. Other \Rclass{Views} implementations exist, such as \Rclass{XStringViews} in the \Rpackage{Biostrings} package. \subsection{Creating Views} There are two basic constructors for creating views: the \Rfunction{Views} function based on indicators and the \Rfunction{slice} based on numeric boundaries. <>= xViews <- Views(xRle, xRle >= 1) xViews <- slice(xRle, 1) xViewsList <- slice(xRleList, 1) @ \subsection{Aggregating Views} While \Rfunction{sapply} can be used to loop over each window, the native functions \Rfunction{viewMaxs}, \Rfunction{viewMins}, \Rfunction{viewSums}, and \Rfunction{viewMeans} provide fast looping to calculate their respective statistical summaries. <>= head(viewSums(xViews)) viewSums(xViewsList) head(viewMaxs(xViews)) viewMaxs(xViewsList) @ \section{Data on Ranges} When analyzing ranges, there are often additional variables of interest, besides the geometry (starts, ends and widths). To formally represent a dataset where the ranges are the observations, \IRanges{} defines the \Rclass{RangedData} class. %\subsection{Manipulating \Rclass{RangedData}} To create a \Rclass{RangedData} instance, one needs to provide a \Rclass{Ranges} object and, optionally, any number of variables on those ranges. The variable objects need not be vectors, but they must satisfy the contract of \Rclass{DataFrame}. <>= values <- rnorm(length(ir)) rd <- RangedData(ir, name = letters[seq_len(length(ir))], values) rd @ One might notice the term ``sequence'' in the above output. This refers to an important feature of \Rclass{RangedData}: the ability to segregate ranges by their sequence (or space). For example, when analyzing genomic data, one is often working with ranges on different chromosomes. In many cases, such as when calculating overlap, one needs to separately treat ranges from different spaces, and \Rclass{RangedData} aims to facilitate this mode of operation. The segregation may be performed at construction time. <>= rd <- RangedData(ir, name = letters[seq_len(length(ir))], values, space = rep(c("chr1", "chr2"), c(3, length(ir) - 3))) rd @ With the knowledge that the data is split into spaces, it should not be surprising that the \Rfunction{ranges} accessor returns a \Rclass{RangesList} and \Rfunction{values} returns a \Rclass{SplitDataFrameList}. <>= ranges(rd) @ <>= values(rd) @ To obtain a \Rclass{RangedData} for a specific set of spaces, one should use the \Rfunction{[} function, which accepts logical, numeric and character indices. <>= rd["chr1"] @ % <>= all(identical(rd["chr1"], rd[1]), identical(rd[1], rd[c(TRUE, FALSE)])) @ The \Rfunction{names} and \Rfunction{length} functions return the names and number of spaces, respectively. <>= names(rd) @ <>= length(rd) @ The \Rfunction{lapply} function operates over the spaces. The object passed to the user function is a subset \Rclass{RangedData}. <>= lapply(rd, names) @ The above would suggest that \Rclass{RangedData} is a sequence of spaces. However, \Rclass{RangedData} also inherits from \Rclass{DataTable}, so it in some ways behaves like a sequence of columns. For example, one can extract a column via \Rfunction{\$} or \Rfunction{[[}. <>= rd[[2]] @ <>= rd$values @ Note that the extracted columns are ``unlisted'' over the spaces, which is usually much more convenient than obtaining them as lists. It is important to note that the elements have been sorted by the space factor and thus may not have the same order as the objects passed to the constructor. The two dimensional matrix-style subsetting is also supported. The rows are indexed globally, independent of space. <>= rd[1:3, "name"] @ % \subsection{Applying Over Spaces} % \Rclass{RDApplyParams} % \Rclass{FilterRules} \section{IRanges in Biological Sequence Analysis} The \IRanges{} packages was primarily designed with biological sequence analysis in mind and Table \ref{table:bioseq} shows how some biological sequence analysis concepts are represented in the \IRanges{} class system. \begin{table}[ht] \begin{center} \begin{tabular}{l|l} \hline Biological Entity & \Rclass{Sequence} Subclass \\ \hline Genome browser track(s) & \Rclass{RangedData}/\Rclass{RangedDataList} \\ Coverage across chromosomes/contigs & \Rclass{RleList} \\ Mapped ranges to genome & \Rclass{CompressedIRangesList} \\ Data (sans ranges) across chroms/contigs & \Rclass{SplitDataFrameList} \\ \hline \end{tabular} \end{center} \caption{\Rclass{Sequence} subclasses for Biological Sequence Analysis} \label{table:bioseq} \end{table} \pagebreak[4] \section{Session Information} \begin{table*}[tbp] \begin{minipage}{\textwidth} <>= toLatex(sessionInfo()) @ \end{minipage} \caption{\label{tab:sessioninfo}% The output of \Rfunction{sessionInfo} on the build system after running this vignette.} \end{table*} \end{document}