--- title: "fastRanges: A Practical Introduction to Genomic Interval Analysis" output: BiocStyle::html_document: toc: true vignette: > %\VignetteIndexEntry{fastRanges: A Practical Introduction to Genomic Interval Analysis} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r vignette-options, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") ``` ## Overview `fastRanges` provides high-throughput interval operations for Bioconductor workflows built on `IRanges` and `GRanges`. For genomic data, the central data type is `GRanges`. Each row is one genomic interval with: - a sequence name such as `"chr1"` - a genomic range (`start`, `end`, or `width`) - a strand (`"+"`, `"-"`, or `"*"`) Typical analyses ask questions such as: - Which peaks overlap genes or promoters? - How many query ranges overlap each subject annotation? - Which subject is nearest to each query? - How can the same subject annotation be reused across many batches of query intervals? `fastRanges` is designed to answer those questions while keeping its grammar close to the Bioconductor overlap APIs in `IRanges` and `GenomicRanges`. The package is best thought of as a high-throughput interval engine for supported Bioconductor range classes rather than as a promise of full drop-in replacement for every `findOverlaps()` input type. ```{r setup} library(fastRanges) library(GenomicRanges) library(S4Vectors) data("fast_ranges_example", package = "fastRanges") query <- fast_ranges_example$query subject <- fast_ranges_example$subject ``` ## A Mental Model The package uses the same two-object language as Bioconductor: - `query`: the intervals you are asking about - `subject`: the intervals you compare the query against If a query peak overlaps three subject genes, then: - `fast_find_overlaps()` returns three hit pairs - `fast_count_overlaps()` returns `3` for that query row - `fast_overlaps_any()` returns `TRUE` for that query row That distinction is the main decision point when choosing a function. ## Which Function Should You Use? | Goal | Function | Main output | |---|---|---| | Get every matching pair | `fast_find_overlaps()` | `Hits` | | Count matches per query | `fast_count_overlaps()` | integer vector | | Ask only yes/no | `fast_overlaps_any()` | logical vector | | Build a tabular overlap join | `fast_overlap_join()` | `data.frame` | | Keep only matching query rows | `fast_semi_overlap_join()` | `data.frame` | | Keep only non-matching query rows | `fast_anti_overlap_join()` | `data.frame` | | Reuse the same subject many times | `fast_build_index()` | `fast_ranges_index` | | Find nearest subject | `fast_nearest()` | `DataFrame` | | Summarize overlap counts by subject metadata | `fast_count_overlaps_by_group()` | matrix | | Summarize subject scores over overlaps | `fast_overlap_aggregate()` | numeric vector | | Merge or transform intervals themselves | `fast_reduce()`, `fast_disjoin()`, `fast_gaps()` | ranges | | Summarize coverage in bins | `fast_tile_coverage()` | `data.frame` | ## The Example Data The package ships with a small reproducible example: ```{r inspect-example} names(fast_ranges_example) query subject ``` In a real analysis, `query` might be ChIP-seq peaks and `subject` might be genes, exons, enhancers, promoters, or other genomic annotations. This example is intentionally small so every function in the vignette can run quickly. The structure is: - `query`: 6 genomic intervals with metadata columns `query_id` and `score` - `subject`: 7 genomic intervals with metadata columns `gene_id` and `biotype` If you want file-based examples instead of in-memory objects, the same records are also installed as BED files: ```{r example-files} system.file("extdata", "query_peaks.bed", package = "fastRanges") system.file("extdata", "subject_genes.bed", package = "fastRanges") ``` That split is deliberate: - use `fast_ranges_example` for package examples, tutorials, and tests - use the BED files when demonstrating import pipelines or command-line inputs ## Core Overlap Operations ### Return all hit pairs ```{r core-overlaps} hits <- fast_find_overlaps(query, subject, threads = 2) hits ``` This result is a `Hits` object. The important parts are: - `queryHits(hits)`: row indices from `query` - `subjectHits(hits)`: matching row indices from `subject` ```{r core-overlaps-inspect} cbind( query_id = S4Vectors::queryHits(hits), subject_id = S4Vectors::subjectHits(hits) ) ``` ### Count matches per query ```{r core-counts} counts <- fast_count_overlaps(query, subject, threads = 2) counts ``` There is one value per query row: - `0` means no subject matched - `2` means that query matched two subject ranges ### Ask only whether a match exists ```{r core-any} any_hits <- fast_overlaps_any(query, subject, threads = 2) any_hits ``` This is the lightest-weight summary when you need only a logical answer. ## Important Arguments Most overlap-style functions use the same argument grammar: - `query` - `subject` - `max_gap` - `min_overlap` - `type` - `ignore_strand` - `threads` - `deterministic` ### `type` `type` controls what "match" means. ```text type = "any" query : |------| subject: |------| => any qualifying overlap is a hit type = "within" subject: |--------------| query : |------| => query must lie inside subject type = "start" query : |------| subject: |------------| => start coordinates must match or be within tolerance type = "end" query : |------| subject: |------------| => end coordinates must match or be within tolerance type = "equal" query : |------| subject: |------| => start and end coordinates must match, or be within tolerance ``` ### `max_gap` `max_gap` controls how far apart two ranges may be and still count as a hit. - `-1` means true overlap is required - `0` allows Bioconductor's zero-gap tolerance behavior - positive values allow nearby, non-overlapping ranges to count in the modes that support that interpretation ### `min_overlap` `min_overlap` is the minimum overlap width, in bases. - `0` is the least strict setting - larger values require wider shared overlap ### `ignore_strand` For `GRanges`: - `FALSE` uses strand-aware Bioconductor semantics - `TRUE` ignores strand and compares only genomic coordinates For `IRanges`, strand does not exist, so this argument has no effect. ### `threads` and `deterministic` - `threads` controls parallel execution - `deterministic = TRUE` keeps stable ordering of hit pairs - `deterministic = FALSE` can be slightly faster for large jobs when order is unimportant ## Bioconductor Compatibility `fastRanges` is intended to stay close to Bioconductor overlap semantics. The easiest way to see that is to compare outputs directly: ```{r compatibility-check} ref <- GenomicRanges::findOverlaps(query, subject, ignore.strand = FALSE) identical( cbind(S4Vectors::queryHits(hits), S4Vectors::subjectHits(hits)), cbind(S4Vectors::queryHits(ref), S4Vectors::subjectHits(ref)) ) ``` ### What is currently supported? - `IRanges` - `GRanges` - overlap modes such as `"any"`, `"start"`, `"end"`, `"within"`, and `"equal"` - strand-aware or strand-ignored comparison for `GRanges` - `select = "all"`, `"first"`, `"last"`, and `"arbitrary"` ### What is not currently supported? - `GRangesList` - circular genomic sequences For these unsupported cases, `fastRanges` now raises an explicit error rather than returning silently different answers. ### Empty ranges Width-0 / empty-range cases are handled using Bioconductor-compatible reference behavior. This is a correctness choice: empty ranges are uncommon in production overlap workloads, so exact semantics are more important than forcing these cases through the multithreaded C++ path. ### Direct vs indexed usage There are two intended usage modes: - Direct mode: - `fast_find_overlaps(query, subject, ...)` - best for one-off calls or exploratory work - Indexed mode: - `idx <- fast_build_index(subject)` - `fast_find_overlaps(query, idx, ...)` - best when the same subject is queried repeatedly This distinction matters for performance much more than for semantics. ### `deterministic` `deterministic = TRUE` keeps output ordering stable across thread counts and is useful for tests, reports, and exact reproducibility. `deterministic = FALSE` avoids that extra ordering step and is usually the right choice when you care about maximum throughput on large multithreaded jobs. That compatibility is important because `fastRanges` is designed to fit into existing `IRanges` / `GRanges` workflows rather than replace those classes. ## Return Types You Will See Most Often ```{r return-types} ret_classes <- c( hits = class(fast_find_overlaps(query, subject, threads = 1))[1], count = class(fast_count_overlaps(query, subject, threads = 1))[1], any = class(fast_overlaps_any(query, subject, threads = 1))[1], nearest = class(fast_nearest(query, subject))[1], join = class(fast_overlap_join(query, subject, threads = 1))[1], coverage = class(fast_coverage(query))[1] ) ret_classes ``` Common return shapes: - `Hits`: pairwise overlap relationships - integer vector: one summary value per query - logical vector: one yes/no value per query - `DataFrame`: nearest-neighbor tables - `data.frame`: join or coverage summary tables - `GRanges` / `IRanges`: transformed intervals - `RleList` or `Rle`: base-resolution coverage ## Reusing the Same Subject with an Index If you will query the same `subject` repeatedly, build an index once and reuse it. ```{r index-build} subject_index <- fast_build_index(subject) subject_index ``` Now use that index in later overlap calls: ```{r index-use} hits_from_index <- fast_find_overlaps(query, subject_index, threads = 2) identical( cbind(S4Vectors::queryHits(hits_from_index), S4Vectors::subjectHits(hits_from_index)), cbind(S4Vectors::queryHits(hits), S4Vectors::subjectHits(hits)) ) ``` When should you build an index? - yes: many query batches against the same subject - yes: benchmarking repeated overlap workloads - no: one small one-off overlap call - no: when you need subject metadata directly in a join output ## Overlap Joins The join family turns overlap hits into beginner-friendly tables. ```{r join-inner} joined_inner <- fast_overlap_join(query, subject, join = "inner", threads = 2) head(joined_inner) ``` Important columns: - `query_id`: row index from `query` - `subject_id`: matching row index from `subject` - `query_*`: columns copied from `query` - `subject_*`: columns copied from `subject` Left join keeps all query rows: ```{r join-left} joined_left <- fast_left_overlap_join(query, subject, threads = 2) head(joined_left) ``` Semi and anti joins keep only query rows: ```{r join-semi-anti} semi_tbl <- fast_semi_overlap_join(query, subject, threads = 2) anti_tbl <- fast_anti_overlap_join(query, subject, threads = 2) head(semi_tbl) head(anti_tbl) ``` ## Nearest and Directional Queries These functions are useful when direct overlap is not the only biological question. ```{r nearest-queries} nearest_tbl <- fast_nearest(query, subject) head(as.data.frame(nearest_tbl)) ``` Interpretation: - `subject_id` is the nearest subject for each matched query - `distance = 0` means overlap - positive distance means the nearest subject is separated by a gap Directional helpers return subject row indices: ```{r directional-queries} fast_precede(query, subject) fast_follow(query, subject) ``` ## Grouped Overlap Summaries Many analyses need more than raw hits. They need counts or scores summarized by subject annotation. ```{r grouped-setup} set.seed(1) S4Vectors::mcols(subject)$group <- sample( c("promoter", "enhancer"), length(subject), replace = TRUE ) S4Vectors::mcols(subject)$score <- seq_len(length(subject)) ``` ### Count overlaps by group ```{r grouped-counts} by_group <- fast_count_overlaps_by_group( query, subject, group_col = "group", threads = 2 ) by_group ``` Rows are query ranges. Columns are subject groups. ### Aggregate a subject score over overlaps ```{r grouped-aggregate} sum_score <- fast_overlap_aggregate( query, subject, value_col = "score", fun = "sum", threads = 2 ) sum_score ``` This pattern is useful when subject intervals carry signal, annotation scores, or other numeric metadata. ## Self-Overlaps, Clusters, and Sliding Windows Self-overlaps compare a range set to itself: ```{r self-overlaps} self_hits <- fast_self_overlaps(query, threads = 2) self_hits ``` Clusters identify connected components in the self-overlap graph: ```{r overlap-clusters} clusters <- fast_cluster_overlaps(query, return = "data.frame") head(clusters) ``` Sliding-window summaries convert interval data into a regular table: ```{r overlap-windows} window_counts <- fast_window_count_overlaps( query, subject, window_width = 1000L, step_width = 500L, threads = 2 ) head(window_counts) ``` ## Range Algebra These functions work on the ranges themselves rather than overlap hit pairs. ```{r range-algebra} query_reduced <- fast_reduce(query) query_disjoint <- fast_disjoin(query) query_gaps <- fast_gaps(query) c( original = length(query), reduced = length(query_reduced), disjoint = length(query_disjoint), gaps = length(query_gaps) ) ``` You can also combine two range sets directly: ```{r range-set-ops} union_ranges <- fast_range_union(query, subject) intersect_ranges <- fast_range_intersect(query, subject) setdiff_ranges <- fast_range_setdiff(query, subject) c( union = length(union_ranges), intersect = length(intersect_ranges), setdiff = length(setdiff_ranges) ) ``` ## Coverage and Binned Coverage Coverage answers: "how many ranges cover each base position?" ```{r coverage} query_cov <- fast_coverage(query) query_cov ``` For plotting and downstream summaries, tile-based coverage is often easier to work with: ```{r tile-coverage} query_tiles <- fast_tile_coverage( query, tile_width = 1000L, step_width = 1000L ) head(query_tiles) ``` ## Iterating over Large Query Sets Use the iterator API when you do not want to materialize all hits at once. ```{r iterator} iter <- fast_find_overlaps_iter(query, subject_index, chunk_size = 2L, threads = 2) fast_iter_has_next(iter) chunk1 <- fast_iter_next(iter) chunk1 fast_iter_has_next(iter) ``` To start over: ```{r iterator-reset} fast_iter_reset(iter) all_iter_hits <- fast_iter_collect(iter) length(all_iter_hits) ``` `fast_iter_collect()` gathers the remaining chunks from the current iterator position. If you already consumed some chunks and want everything again, reset the iterator first. ## Saving and Loading an Index ```{r index-io} tmp_index <- tempfile(fileext = ".rds") fast_save_index(subject_index, tmp_index) loaded_index <- fast_load_index(tmp_index) fast_index_stats(loaded_index) unlink(tmp_index) ``` This is useful when index construction is expensive and the same subject set will be queried again in a later session. ## A Small Benchmark Example The chunk below is only a small demonstration. Publication-scale benchmarking should use the Quarto workflow in `inst/benchmarks`. ```{r benchmark-demo} base_time <- system.time({ h_base <- GenomicRanges::findOverlaps(query, subject, ignore.strand = FALSE) }) fast_time <- system.time({ h_fast <- fast_find_overlaps(query, subject_index, ignore_strand = FALSE, threads = 2) }) rbind( genomic_ranges = base_time[1:3], fast_ranges = fast_time[1:3] ) c(base_hits = length(h_base), fast_hits = length(h_fast)) ``` ## Benchmark Resources The vignette example above is intentionally small. For large benchmark studies, repeated-query experiments, and figure-rich interpretation, use the benchmark resources maintained in the repository: - [Benchmark summary](https://github.com/cparsania/fastRanges/blob/main/inst/benchmarks/README.md) - [Benchmark runner](https://github.com/cparsania/fastRanges/blob/main/inst/benchmarks/benchmark_bioc.qmd) - [Benchmark interpretation report](https://github.com/cparsania/fastRanges/blob/main/inst/benchmarks/benchmark_result_interpretation.qmd) This separation is deliberate: - the vignette stays fast and reproducible on ordinary machines - the benchmark runner can be executed on larger servers - the interpretation report can be rendered later from saved result tables ## Package Example Files The package also ships example BED files and benchmarking helpers: ```{r extdata} c( list.files(system.file("extdata", package = "fastRanges")), list.files(system.file("benchmarks", package = "fastRanges")) ) ``` ## Session Info ```{r session-info} sessionInfo() ```