--- title: "Additional examples of plyranges" author: "Stuart Lee" package: plyranges date: "`r Sys.Date()`" output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteIndexEntry{Additional examples of plyranges} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} --- # Quick overview ## About `Ranges` `Ranges` objects can either represent sets of integers as `IRanges` (which have start, end and width attributes) or represent genomic intervals (which have additional attributes, sequence name, and strand) as `GRanges`. In addition, both types of `Ranges` can store information about their intervals as metadata columns (for example GC content over a genomic interval). `Ranges` objects follow the tidy data principle: each row of a `Ranges` object corresponds to an interval, while each column will represent a variable about that interval, and generally each object will represent a single unit of observation (like gene annotations). We can construct a `IRanges` object from a `data.frame` with a `start` or `width` using the `as_iranges()` method. ```{r, message=FALSE} library(plyranges) df <- data.frame(start = 1:5, width = 5) as_iranges(df) # alternatively with end df <- data.frame(start = 1:5, end = 5:9) as_iranges(df) ``` We can also construct a `GRanges` object in a similar manner. Note that a `GRanges` object requires at least a seqnames column to be present in the data.frame (but not necessarily a strand column). ```{r} df <- data.frame(seqnames = c("chr1", "chr2", "chr2", "chr1", "chr2"), start = 1:5, width = 5) as_granges(df) # strand can be specified with `+`, `*` (mising) and `-` df$strand <- c("+", "+", "-", "-", "*") as_granges(df) ``` # Example: finding GWAS hits that overlap known exons Let's look at a more a realistic example (taken from HelloRanges vignette). ```{r, include=FALSE} dir <- system.file(package = "HelloRangesData", "extdata/") genome <- as_granges(read.delim(file.path(dir, "hg19.genome"), header = FALSE), seqnames = V1, start = 1L, width = V2) gwas <- read_bed(file.path(dir, "gwas.bed"), genome_info = genome) exons <- read_bed(file.path(dir, "exons.bed"), genome_info = genome) ``` Suppose we have two _GRanges_ objects: one containing coordinates of known exons and another containing SNPs from a GWAS. The first and last 5 exons are printed below, there are two additional columns corresponding to the exon name, and a score. We could check the number of exons per chromosome using `group_by` and `summarise`. ```{r} exons exons %>% group_by(seqnames) %>% summarise(n = n()) ``` Next we create a column representing the transcript_id with `mutate`: ```{r} exons <- exons %>% mutate(tx_id = sub("_exon.*", "", name)) ``` To find all GWAS SNPs that overlap exons, we use `join_overlap_inner`. This will create a new _GRanges_ with the coordinates of SNPs that overlap exons, as well as metadata from both objects. ```{r} olap <- join_overlap_inner(gwas, exons) olap ``` For each SNP we can count the number of times it overlaps a transcript. ```{r} olap %>% group_by(name.x, tx_id) %>% summarise(n = n()) ``` We can also generate 2bp splice sites on either side of the exon using `flank_left` and `flank_right`. We add a column indicating the side of flanking for illustrative purposes. The `interweave` function pairs the left and right ranges objects. ```{r} left_ss <- flank_left(exons, 2L) right_ss <- flank_right(exons, 2L) all_ss <- interweave(left_ss, right_ss, .id = "side") all_ss ``` # Session information ```{r} sessionInfo() ```