fastRanges provides high-throughput interval operations
for Bioconductor workflows built on IRanges and
GRanges.
For genomic data, the central data type is GRanges. Each
row is one genomic interval with:
"chr1"start, end, or
width)"+", "-", or
"*")Typical analyses ask questions such as:
fastRanges is designed to answer those questions while
keeping its grammar close to the Bioconductor overlap APIs in
IRanges and GenomicRanges.
The package is best thought of as a high-throughput interval engine
for supported Bioconductor range classes rather than as a promise of
full drop-in replacement for every findOverlaps() input
type.
library(fastRanges)
library(GenomicRanges)
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> Loading required package: generics
#>
#> Attaching package: 'generics'
#> The following objects are masked from 'package:base':
#>
#> as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
#> setequal, union
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> anyDuplicated, aperm, append, as.data.frame, basename, cbind,
#> colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
#> get, grep, grepl, is.unsorted, lapply, Map, mapply, match, mget,
#> order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
#> rbind, Reduce, rownames, sapply, saveRDS, table, tapply, unique,
#> unsplit, which.max, which.min
#> Loading required package: S4Vectors
#>
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#>
#> findMatches
#> The following objects are masked from 'package:base':
#>
#> expand.grid, I, unname
#> Loading required package: IRanges
#> Loading required package: Seqinfo
library(S4Vectors)
data("fast_ranges_example", package = "fastRanges")
query <- fast_ranges_example$query
subject <- fast_ranges_example$subjectThe package uses the same two-object language as Bioconductor:
query: the intervals you are asking aboutsubject: the intervals you compare the query
againstIf a query peak overlaps three subject genes, then:
fast_find_overlaps() returns three hit pairsfast_count_overlaps() returns 3 for that
query rowfast_overlaps_any() returns TRUE for that
query rowThat distinction is the main decision point when choosing a function.
| Goal | Function | Main output |
|---|---|---|
| Get every matching pair | fast_find_overlaps() |
Hits |
| Count matches per query | fast_count_overlaps() |
integer vector |
| Ask only yes/no | fast_overlaps_any() |
logical vector |
| Build a tabular overlap join | fast_overlap_join() |
data.frame |
| Keep only matching query rows | fast_semi_overlap_join() |
data.frame |
| Keep only non-matching query rows | fast_anti_overlap_join() |
data.frame |
| Reuse the same subject many times | fast_build_index() |
fast_ranges_index |
| Find nearest subject | fast_nearest() |
DataFrame |
| Summarize overlap counts by subject metadata | fast_count_overlaps_by_group() |
matrix |
| Summarize subject scores over overlaps | fast_overlap_aggregate() |
numeric vector |
| Merge or transform intervals themselves | fast_reduce(), fast_disjoin(),
fast_gaps() |
ranges |
| Summarize coverage in bins | fast_tile_coverage() |
data.frame |
The package ships with a small reproducible example:
names(fast_ranges_example)
#> [1] "query" "subject"
query
#> GRanges object with 6 ranges and 2 metadata columns:
#> seqnames ranges strand | query_id score
#> <Rle> <IRanges> <Rle> | <character> <integer>
#> [1] chr1 100-159 + | q1 8
#> [2] chr1 180-269 + | q2 10
#> [3] chr1 350-419 - | q3 6
#> [4] chr2 40-79 * | q4 3
#> [5] chr2 300-379 + | q5 9
#> [6] chr3 10-34 - | q6 5
#> -------
#> seqinfo: 3 sequences from an unspecified genome; no seqlengths
subject
#> GRanges object with 7 ranges and 2 metadata columns:
#> seqnames ranges strand | gene_id biotype
#> <Rle> <IRanges> <Rle> | <character> <character>
#> [1] chr1 90-169 + | gA promoter
#> [2] chr1 210-309 - | gB enhancer
#> [3] chr1 320-369 - | gC enhancer
#> [4] chr2 20-89 + | gD promoter
#> [5] chr2 330-419 * | gE gene_body
#> [6] chr3 1-40 - | gF promoter
#> [7] chr3 80-114 + | gG enhancer
#> -------
#> seqinfo: 3 sequences from an unspecified genome; no seqlengthsIn a real analysis, query might be ChIP-seq peaks and
subject might be genes, exons, enhancers, promoters, or
other genomic annotations.
This example is intentionally small so every function in the vignette can run quickly. The structure is:
query: 6 genomic intervals with metadata columns
query_id and scoresubject: 7 genomic intervals with metadata columns
gene_id and biotypeIf you want file-based examples instead of in-memory objects, the same records are also installed as BED files:
system.file("extdata", "query_peaks.bed", package = "fastRanges")
#> [1] "/tmp/RtmpdxvFL5/Rinstf5825d7003e/fastRanges/extdata/query_peaks.bed"
system.file("extdata", "subject_genes.bed", package = "fastRanges")
#> [1] "/tmp/RtmpdxvFL5/Rinstf5825d7003e/fastRanges/extdata/subject_genes.bed"That split is deliberate:
fast_ranges_example for package examples,
tutorials, and testshits <- fast_find_overlaps(query, subject, threads = 2)
hits
#> Hits object with 5 hits and 0 metadata columns:
#> from to
#> <integer> <integer>
#> [1] 1 1
#> [2] 3 3
#> [3] 4 4
#> [4] 5 5
#> [5] 6 6
#> -------
#> nLnode: 6 / nRnode: 7This result is a Hits object. The important parts
are:
queryHits(hits): row indices from
querysubjectHits(hits): matching row indices from
subjectThere is one value per query row:
0 means no subject matched2 means that query matched two subject rangesMost overlap-style functions use the same argument grammar:
querysubjectmax_gapmin_overlaptypeignore_strandthreadsdeterministictypetype controls what “match” means.
type = "any"
query : |------|
subject: |------|
=> any qualifying overlap is a hit
type = "within"
subject: |--------------|
query : |------|
=> query must lie inside subject
type = "start"
query : |------|
subject: |------------|
=> start coordinates must match or be within tolerance
type = "end"
query : |------|
subject: |------------|
=> end coordinates must match or be within tolerance
type = "equal"
query : |------|
subject: |------|
=> start and end coordinates must match, or be within tolerance
max_gapmax_gap controls how far apart two ranges may be and
still count as a hit.
-1 means true overlap is required0 allows Bioconductor’s zero-gap tolerance
behaviormin_overlapmin_overlap is the minimum overlap width, in bases.
0 is the least strict settingignore_strandFor GRanges:
FALSE uses strand-aware Bioconductor semanticsTRUE ignores strand and compares only genomic
coordinatesFor IRanges, strand does not exist, so this argument has
no effect.
threads and deterministicthreads controls parallel executiondeterministic = TRUE keeps stable ordering of hit
pairsdeterministic = FALSE can be slightly faster for large
jobs when order is unimportantfastRanges is intended to stay close to Bioconductor
overlap semantics.
The easiest way to see that is to compare outputs directly:
ref <- GenomicRanges::findOverlaps(query, subject, ignore.strand = FALSE)
identical(
cbind(S4Vectors::queryHits(hits), S4Vectors::subjectHits(hits)),
cbind(S4Vectors::queryHits(ref), S4Vectors::subjectHits(ref))
)
#> [1] TRUEIRangesGRanges"any", "start",
"end", "within", and "equal"GRangesselect = "all", "first",
"last", and "arbitrary"GRangesListFor these unsupported cases, fastRanges now raises an
explicit error rather than returning silently different answers.
Width-0 / empty-range cases are handled using Bioconductor-compatible reference behavior. This is a correctness choice: empty ranges are uncommon in production overlap workloads, so exact semantics are more important than forcing these cases through the multithreaded C++ path.
There are two intended usage modes:
fast_find_overlaps(query, subject, ...)idx <- fast_build_index(subject)fast_find_overlaps(query, idx, ...)This distinction matters for performance much more than for semantics.
deterministicdeterministic = TRUE keeps output ordering stable across
thread counts and is useful for tests, reports, and exact
reproducibility.
deterministic = FALSE avoids that extra ordering step
and is usually the right choice when you care about maximum throughput
on large multithreaded jobs.
That compatibility is important because fastRanges is
designed to fit into existing IRanges /
GRanges workflows rather than replace those classes.
ret_classes <- c(
hits = class(fast_find_overlaps(query, subject, threads = 1))[1],
count = class(fast_count_overlaps(query, subject, threads = 1))[1],
any = class(fast_overlaps_any(query, subject, threads = 1))[1],
nearest = class(fast_nearest(query, subject))[1],
join = class(fast_overlap_join(query, subject, threads = 1))[1],
coverage = class(fast_coverage(query))[1]
)
ret_classes
#> hits count any nearest join
#> "Hits" "integer" "logical" "DFrame" "data.frame"
#> coverage
#> "SimpleRleList"Common return shapes:
Hits: pairwise overlap relationshipsDataFrame: nearest-neighbor tablesdata.frame: join or coverage summary tablesGRanges / IRanges: transformed
intervalsRleList or Rle: base-resolution
coverageIf you will query the same subject repeatedly, build an
index once and reuse it.
subject_index <- fast_build_index(subject)
subject_index
#> <fast_ranges_index>
#> subject class: GRanges
#> subject ranges: 7
#> partitions: 3Now use that index in later overlap calls:
hits_from_index <- fast_find_overlaps(query, subject_index, threads = 2)
identical(
cbind(S4Vectors::queryHits(hits_from_index), S4Vectors::subjectHits(hits_from_index)),
cbind(S4Vectors::queryHits(hits), S4Vectors::subjectHits(hits))
)
#> [1] TRUEWhen should you build an index?
The join family turns overlap hits into beginner-friendly tables.
joined_inner <- fast_overlap_join(query, subject, join = "inner", threads = 2)
head(joined_inner)
#> query_id subject_id query_seqnames query_start query_end query_width
#> 1 1 1 chr1 100 159 60
#> 3 3 3 chr1 350 419 70
#> 4 4 4 chr2 40 79 40
#> 5 5 5 chr2 300 379 80
#> 6 6 6 chr3 10 34 25
#> query_strand query_query_id query_score subject_seqnames subject_start
#> 1 + q1 8 chr1 90
#> 3 - q3 6 chr1 320
#> 4 * q4 3 chr2 20
#> 5 + q5 9 chr2 330
#> 6 - q6 5 chr3 1
#> subject_end subject_width subject_strand subject_gene_id subject_biotype
#> 1 169 80 + gA promoter
#> 3 369 50 - gC enhancer
#> 4 89 70 + gD promoter
#> 5 419 90 * gE gene_body
#> 6 40 40 - gF promoterImportant columns:
query_id: row index from querysubject_id: matching row index from
subjectquery_*: columns copied from querysubject_*: columns copied from
subjectLeft join keeps all query rows:
joined_left <- fast_left_overlap_join(query, subject, threads = 2)
head(joined_left)
#> query_id subject_id query_seqnames query_start query_end query_width
#> 1 1 1 chr1 100 159 60
#> 3 3 3 chr1 350 419 70
#> 4 4 4 chr2 40 79 40
#> 5 5 5 chr2 300 379 80
#> 6 6 6 chr3 10 34 25
#> 2 2 NA chr1 180 269 90
#> query_strand query_query_id query_score subject_seqnames subject_start
#> 1 + q1 8 chr1 90
#> 3 - q3 6 chr1 320
#> 4 * q4 3 chr2 20
#> 5 + q5 9 chr2 330
#> 6 - q6 5 chr3 1
#> 2 + q2 10 <NA> NA
#> subject_end subject_width subject_strand subject_gene_id subject_biotype
#> 1 169 80 + gA promoter
#> 3 369 50 - gC enhancer
#> 4 89 70 + gD promoter
#> 5 419 90 * gE gene_body
#> 6 40 40 - gF promoter
#> 2 NA NA <NA> <NA> <NA>Semi and anti joins keep only query rows:
semi_tbl <- fast_semi_overlap_join(query, subject, threads = 2)
anti_tbl <- fast_anti_overlap_join(query, subject, threads = 2)
head(semi_tbl)
#> query_id overlap_count query_seqnames query_start query_end query_width
#> 1 1 1 chr1 100 159 60
#> 3 3 1 chr1 350 419 70
#> 4 4 1 chr2 40 79 40
#> 5 5 1 chr2 300 379 80
#> 6 6 1 chr3 10 34 25
#> query_strand query_query_id query_score
#> 1 + q1 8
#> 3 - q3 6
#> 4 * q4 3
#> 5 + q5 9
#> 6 - q6 5
head(anti_tbl)
#> query_id overlap_count query_seqnames query_start query_end query_width
#> 2 2 0 chr1 180 269 90
#> query_strand query_query_id query_score
#> 2 + q2 10These functions are useful when direct overlap is not the only biological question.
nearest_tbl <- fast_nearest(query, subject)
head(as.data.frame(nearest_tbl))
#> query_id subject_id distance
#> 1 1 1 0
#> 2 2 1 10
#> 3 3 3 0
#> 4 4 4 0
#> 5 5 5 0
#> 6 6 6 0Interpretation:
subject_id is the nearest subject for each matched
querydistance = 0 means overlapDirectional helpers return subject row indices:
Many analyses need more than raw hits. They need counts or scores summarized by subject annotation.
set.seed(1)
S4Vectors::mcols(subject)$group <- sample(
c("promoter", "enhancer"),
length(subject),
replace = TRUE
)
S4Vectors::mcols(subject)$score <- seq_len(length(subject))Self-overlaps compare a range set to itself:
self_hits <- fast_self_overlaps(query, threads = 2)
self_hits
#> Hits object with 0 hits and 0 metadata columns:
#> from to
#> <integer> <integer>
#> -------
#> nLnode: 6 / nRnode: 6Clusters identify connected components in the self-overlap graph:
clusters <- fast_cluster_overlaps(query, return = "data.frame")
head(clusters)
#> range_id cluster_id cluster_size
#> 1 1 1 1
#> 2 2 2 1
#> 3 3 3 1
#> 4 4 4 1
#> 5 5 5 1
#> 6 6 6 1Sliding-window summaries convert interval data into a regular table:
window_counts <- fast_window_count_overlaps(
query,
subject,
window_width = 1000L,
step_width = 500L,
threads = 2
)
#> Warning in .merge_two_Seqinfo_objects(x, y): The 2 combined objects have no sequence levels in common. (Use
#> suppressWarnings() to suppress this warning.)
#> Warning in .merge_two_Seqinfo_objects(x, y): The 2 combined objects have no sequence levels in common. (Use
#> suppressWarnings() to suppress this warning.)
head(window_counts)
#> window_id seqnames start end overlap_count
#> 1 1 chr1 100 1099 3
#> 2 2 chr2 40 1039 2
#> 3 3 chr3 10 1009 2These functions work on the ranges themselves rather than overlap hit pairs.
query_reduced <- fast_reduce(query)
query_disjoint <- fast_disjoin(query)
query_gaps <- fast_gaps(query)
c(
original = length(query),
reduced = length(query_reduced),
disjoint = length(query_disjoint),
gaps = length(query_gaps)
)
#> original reduced disjoint gaps
#> 6 6 6 6You can also combine two range sets directly:
union_ranges <- fast_range_union(query, subject)
intersect_ranges <- fast_range_intersect(query, subject)
setdiff_ranges <- fast_range_setdiff(query, subject)
c(
union = length(union_ranges),
intersect = length(intersect_ranges),
setdiff = length(setdiff_ranges)
)
#> union intersect setdiff
#> 10 3 4Coverage answers: “how many ranges cover each base position?”
query_cov <- fast_coverage(query)
query_cov
#> RleList of length 3
#> $chr1
#> integer-Rle of length 419 with 6 runs
#> Lengths: 99 60 20 90 80 70
#> Values : 0 1 0 1 0 1
#>
#> $chr2
#> integer-Rle of length 379 with 4 runs
#> Lengths: 39 40 220 80
#> Values : 0 1 0 1
#>
#> $chr3
#> integer-Rle of length 34 with 2 runs
#> Lengths: 9 25
#> Values : 0 1For plotting and downstream summaries, tile-based coverage is often easier to work with:
Use the iterator API when you do not want to materialize all hits at once.
iter <- fast_find_overlaps_iter(query, subject_index, chunk_size = 2L, threads = 2)
fast_iter_has_next(iter)
#> [1] TRUE
chunk1 <- fast_iter_next(iter)
chunk1
#> Hits object with 1 hit and 0 metadata columns:
#> from to
#> <integer> <integer>
#> [1] 1 1
#> -------
#> nLnode: 6 / nRnode: 7
fast_iter_has_next(iter)
#> [1] TRUETo start over:
fast_iter_collect() gathers the remaining chunks from
the current iterator position. If you already consumed some chunks and
want everything again, reset the iterator first.
tmp_index <- tempfile(fileext = ".rds")
fast_save_index(subject_index, tmp_index)
loaded_index <- fast_load_index(tmp_index)
fast_index_stats(loaded_index)
#> DataFrame with 1 row and 6 columns
#> subject_n partition_n block_n seqlevel_n mean_partition_size index_size_mb
#> <integer> <integer> <integer> <integer> <numeric> <numeric>
#> 1 7 3 3 3 2.33333 0.0043869
unlink(tmp_index)This is useful when index construction is expensive and the same subject set will be queried again in a later session.
The chunk below is only a small demonstration. Publication-scale
benchmarking should use the Quarto workflow in
inst/benchmarks.
base_time <- system.time({
h_base <- GenomicRanges::findOverlaps(query, subject, ignore.strand = FALSE)
})
fast_time <- system.time({
h_fast <- fast_find_overlaps(query, subject_index, ignore_strand = FALSE, threads = 2)
})
rbind(
genomic_ranges = base_time[1:3],
fast_ranges = fast_time[1:3]
)
#> user.self sys.self elapsed
#> genomic_ranges 0.010 0 0.011
#> fast_ranges 0.003 0 0.003
c(base_hits = length(h_base), fast_hits = length(h_fast))
#> base_hits fast_hits
#> 5 5The vignette example above is intentionally small. For large benchmark studies, repeated-query experiments, and figure-rich interpretation, use the benchmark resources maintained in the repository:
This separation is deliberate:
The package also ships example BED files and benchmarking helpers:
sessionInfo()
#> R version 4.5.3 (2026-03-11)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] GenomicRanges_1.63.2 Seqinfo_1.1.0 IRanges_2.45.0
#> [4] S4Vectors_0.49.2 BiocGenerics_0.57.1 generics_0.1.4
#> [7] fastRanges_0.99.2 BiocStyle_2.39.0
#>
#> loaded via a namespace (and not attached):
#> [1] httr_1.4.8 cli_3.6.6 knitr_1.51
#> [4] rlang_1.2.0 xfun_0.57 UCSC.utils_1.7.1
#> [7] jsonlite_2.0.0 buildtools_1.0.0 htmltools_0.5.9
#> [10] maketools_1.3.2 sys_3.4.3 sass_0.4.10
#> [13] rmarkdown_2.31 evaluate_1.0.5 jquerylib_0.1.4
#> [16] fastmap_1.2.0 GenomeInfoDb_1.47.2 yaml_2.3.12
#> [19] lifecycle_1.0.5 BiocManager_1.30.27 compiler_4.5.3
#> [22] Rcpp_1.1.1-1 XVector_0.51.0 digest_0.6.39
#> [25] R6_2.6.1 bslib_0.10.0 tools_4.5.3
#> [28] cachem_1.1.0