---
title: "jvecfor: Fast K-Nearest Neighbor Search for Single-Cell Analysis"
abstract: >
  **jvecfor** provides fast k-nearest neighbor (KNN) and nearest-neighbor
  graph construction for single-cell genomics. Powered by the **jvector** Java
  library, it delivers HNSW-DiskANN approximate search and VP-tree exact search
  with SIMD acceleration (AVX2/AVX-512), achieving approximately 2× speedup
  over Annoy-based `BiocNeighbors::findKNN` at n ≥ 100K cells while maintaining
  a fully compatible output format suitable for drop-in use in Bioconductor
  workflows.
author:
  - name: Anestis Gkanogiannis
    email: anestis@gkanogiannis.com
date: "`r Sys.Date()`"
package: jvecfor
output:
  BiocStyle::html_document:
    toc: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{jvecfor}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
    collapse  = TRUE,
    comment   = "#>"
)
# Guard all jvecfor calls: skip if Java or JAR is unavailable
has_java <- nzchar(Sys.which("java"))
has_jar  <- length(dir(system.file("java", package = "jvecfor"),
                        pattern = "^jvecfor-.*\\.jar$")) > 0L
run_eg   <- has_java && has_jar
```

# Introduction

**jvecfor** is a high-performance KNN library for Bioconductor single-cell
workflows. It wraps a Java library, built on top of
[jvector](https://github.com/jbellis/jvector), which implements two algorithms:

- **HNSW-DiskANN** (`type = "ann"`) -- approximate KNN via Hierarchical
    Navigable Small World graphs, with SIMD-accelerated distance computation
    (AVX2/AVX-512). O(log n) query time; ~2× faster than
    `r Biocpkg("BiocNeighbors")` at n ≥ 100K cells.
- **VP-tree** (`type = "knn"`) — exact KNN using vantage-point trees.
    Deterministic; ideal for small datasets or ground-truth validation.

The R package communicates with the JVM via `system2()` — no JNI, no
`r CRANpkg("rJava")` dependency. The JAR is bundled in `inst/java/` and
requires no separate installation step beyond having Java 20+ on your `PATH`.

## Function overview

| Function | Description | Returns |
|---|---|---|
| `fastFindKNN()` | Core KNN search | `list($index, $distance)` |
| `fastMakeSNNGraph()` | KNN → shared nearest-neighbor graph | `igraph` object |
| `fastMakeKNNGraph()` | KNN → k-nearest-neighbor graph | `igraph` object |
| `jvecfor_setup()` | Copy a custom JAR into the package | path (invisibly) |

The output of `fastFindKNN()` is a named list with `$index` (n × k integer
matrix, 1-based) and `$distance` (n × k numeric matrix), identical in
structure to `BiocNeighbors::findKNN()`.

# Prerequisites and Setup

## Java version

jvecfor requires **Java ≥ 20** on `PATH`. The JAR uses the
`jdk.incubator.vector` module for SIMD acceleration. Check your installation:

```{r check-java, eval=FALSE}
system2("java", "--version")  # should print openjdk 20 or higher
```

If `java` is not found, install [Eclipse Temurin](https://adoptium.net) or
any OpenJDK 20+ distribution and ensure the `java` binary is on your `PATH`.
## Package options

Two global options control runtime behaviour:

```{r options-table, echo=FALSE}
opts <- data.frame(
    Option           = c("`jvecfor.verbose`", "`jvecfor.jar`"),
    Default          = c("`FALSE`", "`NULL`"),
    Effect           = c(
        "Pass `--verbose` to jvecfor, printing HNSW build progress to stderr.",
        "Use a custom JAR instead of the one bundled in `inst/java/`."
    ),
    stringsAsFactors = FALSE
)
knitr::kable(opts)
```

Set them persistently in your `.Rprofile` or per-session:

```{r options-example, eval=FALSE}
options(jvecfor.verbose = TRUE)                        # global verbose mode
options(jvecfor.jar = "/path/to/custom/jvecfor.jar")   # custom JAR
```

## Custom JAR setup

The bundled JAR is installed automatically. If you have compiled a custom
jvecfor build:

```{r setup-jar, eval=FALSE}
# Auto-finds target/jvecfor-*.jar relative to the working directory
jvecfor_setup()

# Or provide the path explicitly
jvecfor_setup(jar_path = "/path/to/jvecfor-x.y.z.jar")
```

# Quick Start

```{r quick-start, eval=run_eg}
library(jvecfor)

# Simulate 300 cells × 30 principal components
set.seed(42)
pca <- matrix(rnorm(300 * 30), nrow = 300, ncol = 30)

# Find 15 nearest neighbours (HNSW-DiskANN approximate, Euclidean)
nn <- fastFindKNN(pca, k = 15)

str(nn)
```

The result is a two-element list. `$index` is a 300 × 15 integer matrix of
1-based neighbor indices; `$distance` is a 300 × 15 numeric matrix of
Euclidean distances.

# Core Function: `fastFindKNN()`

## Output structure

```{r output-structure, eval=run_eg}
dim(nn$index)     # 300 x 15
dim(nn$distance)  # 300 x 15

class(nn$index)     # "integer" — 1-based, same as BiocNeighbors
class(nn$distance)  # "numeric"

# No zeros in index (1-based) and all values within [1, nrow(pca)]
range(nn$index)
```

This output is structurally identical to `BiocNeighbors::findKNN()`, so
existing code that passes `nn$index` to `r Biocpkg("bluster")` or other
Bioconductor tools works without modification.

## Algorithm choice: ANN vs. exact

```{r ann-vs-knn, eval=run_eg}
# Approximate (HNSW-DiskANN) — default, fast, O(log n) query
nn_ann <- fastFindKNN(pca, k = 15, type = "ann")

# Exact (VP-tree) — deterministic, O(n log n) query
nn_knn <- fastFindKNN(pca, k = 15, type = "knn")

# Self-consistency check: exact search always returns itself as nearest
# (excluding self) — no duplicate indices per row
stopifnot(all(apply(nn_knn$index, 1, function(x) !anyDuplicated(x))))
```

**When to choose ANN vs. exact:**

| Criterion | ANN (`"ann"`) | Exact (`"knn"`) |
|---|---|---|
| Dataset size | n ≥ 5K | n < 5K, or benchmark ground truth |
| Recall | ~95–99 % (tunable) | 100 % |
| Speed | O(log n) | O(n log n) |
| Metrics | euclidean, cosine, dot_product | euclidean, cosine |
| Reproducibility | Deterministic (fixed graph) | Deterministic |

## Distance metrics

### Euclidean (default)

The L2 distance in PC space. Appropriate when principal components have been
whitened (equal variance per component) or when the embedding preserves
meaningful inter-cell distances.

```{r metric-euclidean, eval=run_eg}
nn_euc <- fastFindKNN(pca, k = 15, metric = "euclidean")
```

### Cosine

Measures the angle between cell vectors; ignores magnitude. This is useful
when only the direction of the embedding vector matters, for instance with
log-normalised count embeddings where library-size effects inflate norms.
Normalise rows to unit length before calling to ensure consistent behaviour:

```{r metric-cosine, eval=run_eg}
# L2-normalise rows
norms    <- sqrt(rowSums(pca^2))
pca_norm <- sweep(pca, 1, norms, "/")

nn_cos <- fastFindKNN(pca_norm, k = 15, metric = "cosine")
```

### Dot-product (ANN only)

The inner product ⟨u, v⟩. A similarity measure (higher = more similar) rather
than a distance. Only valid for `type = "ann"` — the VP-tree requires a proper
metric. Attempting to combine `metric = "dot_product"` with `type = "knn"`
raises an informative error:

```{r metric-dotproduct, eval=run_eg}
# ANN with dot-product similarity
nn_dot <- fastFindKNN(pca_norm, k = 15, metric = "dot_product", type = "ann")
```

```{r metric-dotproduct-error, eval=run_eg, error=TRUE}
# Attempting exact + dot_product raises an error
fastFindKNN(pca, k = 15, metric = "dot_product", type = "knn")
```

## Skipping distance computation

When you only need neighbor indices (e.g. for graph construction), set
`get.distance = FALSE`. This avoids allocating and transmitting the n × k
distance matrix, saving roughly 50 % of output parsing time:

```{r no-distance, eval=run_eg}
nn_idx <- fastFindKNN(pca, k = 15, get.distance = FALSE)
is.null(nn_idx$distance)   # TRUE
```

The `fastMakeSNNGraph()` and `fastMakeKNNGraph()` wrappers set
`get.distance = FALSE` internally.

# HNSW-DiskANN Tuning

## Parameter reference

```{r hnsw-params-table, echo=FALSE}
hnsw_tbl <- data.frame(
    "Parameter" = c(
        "`M`", "`ef.search`",
        "`oversample.factor`", "`pq.subspaces`"
    ),
    "Default" = c("16", "0 (auto)", "1.0", "0 (off)"),
    "Tuning guidance" = c(
        paste(
            "Connections per node. Increase to 32-64",
            "for high-dimensional (>50 dims) data;",
            "doubles memory and build time."
        ),
        paste(
            "Beam width. Auto = max(k+1, 3k).",
            "Increase (e.g. 100-500) to trade speed for recall."
        ),
        paste(
            "Fetch ceil(ef x factor) candidates, keep top k.",
            "Values 1.5-3.0 improve recall with proportional cost."
        ),
        paste(
            "PQ subspaces. See section 5.3.",
            "4-8x speedup with minimal recall loss."
        )
    ),
    check.names      = FALSE,
    stringsAsFactors = FALSE
)
knitr::kable(hnsw_tbl)
```

## High-recall configuration

Use this when recall quality matters more than raw speed — for example when
producing the KNN graph that will underpin a publication's cluster assignments
or trajectory analysis:

```{r high-recall, eval=run_eg}
nn_highrecall <- fastFindKNN(
    pca, k = 15,
    M                 = 32L,
    ef.search         = 150L,
    oversample.factor = 2.0
)
dim(nn_highrecall$index)
```

## Product Quantization (pq.subspaces)

Product Quantization (PQ) compresses vectors into compact codes before
distance computation, yielding a 4–8× search speedup with negligible recall
loss. Set `pq.subspaces` to `ncol(X) / 2` as a starting point; the value
must evenly divide `ncol(X)`.

```{r pq-demo, eval=run_eg}
# 300 × 20 matrix — use 10 PQ subspaces (= 20 / 2)
set.seed(1)
pca20 <- matrix(rnorm(300 * 20), 300, 20)

nn_default <- fastFindKNN(pca20, k = 15)                      # no PQ
nn_pq      <- fastFindKNN(pca20, k = 15, pq.subspaces = 10L)  # PQ enabled

# Recall overlap (fraction of PQ neighbors that match the default)
shared <- mapply(function(a, b) length(intersect(a, b)),
                    split(nn_default$index, row(nn_default$index)),
                    split(nn_pq$index,      row(nn_pq$index)))
message(sprintf("Mean PQ recall vs. default: %.1f%%", 100 * mean(shared / 15)))
```

For small matrices (< 1K rows) the JVM startup cost dominates and PQ will
not show a wall-clock benefit. The speedup becomes material at n ≥ 10K.

> **Avoid PQ when** `ncol(X)` is very small (< 8 features): quantisation
> error dominates and recall drops sharply.

A timing comparison for large data:

```{r pq-timing, eval=FALSE}
# Illustrative — run on your own large matrix
nn_default_time <- system.time(fastFindKNN(large_pca, k = 15))
nn_pq_time <- system.time(
    fastFindKNN(large_pca, k = 15,
                pq.subspaces = ncol(large_pca) %/% 2L)
)
nn_default_time["elapsed"] / nn_pq_time["elapsed"]  # expect 4–8×
```

## Thread management

jvecfor auto-detects the number of physical cores and uses
`max(1, physical_cores − 1)` threads by default. Override with
`num.threads`:

```{r threads, eval=run_eg}
# Single-threaded (reproducible, good for shared HPC nodes)
nn_1t <- fastFindKNN(pca, k = 15, num.threads = 1L)

# Explicit thread count
nn_4t <- fastFindKNN(pca, k = 15, num.threads = 4L)

# Use verbose to confirm the thread count jvecfor actually uses
nn_v <- fastFindKNN(pca, k = 15, num.threads = 2L, verbose = TRUE)
```

On shared HPC nodes, set `num.threads` explicitly to avoid consuming all
physical cores and interfering with co-located jobs.

# Graph Construction

The `fastMakeSNNGraph()` and `fastMakeKNNGraph()` wrappers build
`r CRANpkg("igraph")` objects directly from the PC matrix, combining
`fastFindKNN()` with `r Biocpkg("bluster")`'s graph constructors.

## Shared Nearest-Neighbor graph

```{r snn-graph, eval=run_eg}
library(igraph)

g_snn <- fastMakeSNNGraph(pca, k = 15)
class(g_snn)
igraph::vcount(g_snn)   # 300 — one vertex per cell
igraph::ecount(g_snn)   # weighted undirected edges
```

The `snn.type` argument controls how edge weights are computed:

```{r snn-type-table, echo=FALSE}
snn_tbl <- data.frame(
    "snn.type" = c(
        '`"rank"` (default)', '`"jaccard"`', '`"number"`'
    ),
    "Edge weight" = c(
        "Rank-based Jaccard similarity",
        "Jaccard similarity",
        "Count of shared neighbors"
    ),
    "Use case" = c(
        "Seurat-compatible; robust to hub nodes",
        "Standard set-overlap measure",
        "Simple; unnormalised"
    ),
    check.names      = FALSE,
    stringsAsFactors = FALSE
)
knitr::kable(snn_tbl)
```

```{r snn-types, eval=run_eg}
g_rank    <- fastMakeSNNGraph(pca, k = 15, snn.type = "rank")
g_jaccard <- fastMakeSNNGraph(pca, k = 15, snn.type = "jaccard")
g_number  <- fastMakeSNNGraph(pca, k = 15, snn.type = "number")

# Edge weight ranges differ between types
summary(igraph::E(g_rank)$weight)
summary(igraph::E(g_jaccard)$weight)
```

## K-Nearest Neighbor graph

```{r knn-graph, eval=run_eg}
# Undirected KNN graph (mutual edges)
g_knn_u <- fastMakeKNNGraph(pca, k = 15, directed = FALSE)

# Directed KNN graph (each cell points to its k neighbors)
g_knn_d <- fastMakeKNNGraph(pca, k = 15, directed = TRUE)

igraph::is_directed(g_knn_d)  # TRUE
```

## Community detection

The SNN graph is the standard input for graph-based clustering in single-cell
analysis. Here we use the Louvain algorithm:

```{r louvain, eval=run_eg}
louvain <- igraph::cluster_louvain(g_snn)
message(
    "Number of detected communities: ",
    length(igraph::communities(louvain))
)
membership_vec <- igraph::membership(louvain)
table(membership_vec)
```

Leiden clustering (when available via `r CRANpkg("igraph")` ≥ 1.3) gives
finer control over community resolution:

```{r leiden, eval=FALSE}
leiden <- igraph::cluster_leiden(g_snn, resolution_parameter = 0.5)
table(igraph::membership(leiden))
```

# Full Single-Cell Workflow

This end-to-end example simulates three cell populations (clusters) embedded
in 30 PCs and recovers them via SNN graph clustering. All steps use only
**jvecfor** and `r CRANpkg("igraph")`.

```{r workflow-simulate, eval=run_eg}
# Simulate 300 cells (100 per cluster) with separable cluster centres in PC1–PC2
set.seed(42)
n_per <- 100
centres <- list(c(5, 0), c(-5, 0), c(0, 5))

pca_clust <- do.call(rbind, lapply(seq_along(centres), function(i) {
    m        <- matrix(rnorm(n_per * 30), n_per, 30)
    m[, 1]   <- m[, 1] + centres[[i]][1]
    m[, 2]   <- m[, 2] + centres[[i]][2]
    m
}))
true_labels <- rep(1:3, each = n_per)
```

```{r workflow-knn, eval=run_eg}
# Step 1: find 15 nearest neighbours
nn_wf <- fastFindKNN(pca_clust, k = 15)
```

```{r workflow-graph, eval=run_eg}
# Step 2: build SNN graph
g_wf <- fastMakeSNNGraph(pca_clust, k = 15)
```

```{r workflow-cluster, eval=run_eg}
# Step 3: Louvain community detection
lou_wf   <- igraph::cluster_louvain(g_wf)
detected <- igraph::membership(lou_wf)

# Step 4: cross-tabulate detected vs. true labels
print(table(Detected = detected, True = true_labels))
```

```{r workflow-plot-cap, include=FALSE, eval=run_eg}
wf_cap <- paste(
    "Louvain clusters (colours) in PC1-PC2 space.",
    "Three well-separated populations are recovered."
)
```

```{r workflow-plot, eval=run_eg, fig.width=5, fig.height=5, fig.cap=wf_cap}
# Visualise in the first two PCs (base R — no extra dependencies)
cluster_cols <- c("#E64B35", "#4DBBD5", "#00A087", "#3C5488", "#F39B7F")
plot(
    pca_clust[, 1], pca_clust[, 2],
    col  = cluster_cols[detected],
    pch  = 19, cex = 0.7,
    xlab = "PC 1", ylab = "PC 2",
    main = "Louvain clusters on SNN graph"
)
legend("topright",
        legend = paste("Cluster", sort(unique(detected))),
        col    = cluster_cols[sort(unique(detected))],
        pch    = 19, bty = "n", cex = 0.85)
```

### Using a real SingleCellExperiment

Swap the simulated matrix for `reducedDim(sce, "PCA")` to apply the same
pipeline to real data:

```{r sce-workflow, eval=FALSE}
library(SingleCellExperiment)

# Assume sce is a SingleCellExperiment with PCA computed
# (e.g. via scater::runPCA)
pca_mat <- reducedDim(sce, "PCA")

# KNN search
nn_sce <- fastFindKNN(pca_mat, k = 15)

# Graph-based clustering
g_sce      <- fastMakeSNNGraph(pca_mat, k = 15)
lou_sce    <- igraph::cluster_louvain(g_sce)
sce$cluster <- as.factor(igraph::membership(lou_sce))
```

# Comparing with BiocNeighbors

jvecfor is designed as a drop-in replacement for
`r Biocpkg("BiocNeighbors")`. The two calls below are functionally
equivalent and return identically structured output:

```{r biocneighbors-compare, eval=FALSE}
library(BiocNeighbors)

# BiocNeighbors (Annoy backend, the default ANN method)
nn_bn <- BiocNeighbors::findKNN(pca, k = 15, BNPARAM = AnnoyParam())

# jvecfor (HNSW-DiskANN backend) — standalone function
nn_jv <- jvecfor::fastFindKNN(pca, k = 15)

# Same structure
identical(names(nn_bn), names(nn_jv))   # TRUE — both have $index, $distance
identical(dim(nn_bn$index), dim(nn_jv$index))  # TRUE
```

Results will not be identical because both methods are approximate — they
use different ANN algorithms with different recall characteristics.

## Drop-in BiocNeighbors integration

`JvecforParam` is a `BiocNeighborParam` subclass, so you can pass it
as the `BNPARAM` argument to any Bioconductor function that accepts one:

```{r bnparam-dropin, eval=FALSE}
library(BiocNeighbors)

# Use jvecfor through the standard BiocNeighbors interface
nn <- findKNN(pca, k = 15, BNPARAM = JvecforParam())

# Works with scran, scater, and other BNPARAM-aware packages:
# library(scran)
# g <- buildSNNGraph(sce, BNPARAM = JvecforParam())
#
# library(scater)
# sce <- runUMAP(sce, BNPARAM = JvecforParam())

# Customise algorithm parameters via the constructor
nn2 <- findKNN(pca, k = 15,
                BNPARAM = JvecforParam(
                    type = "knn",           # exact VP-tree search
                    distance = "Cosine",
                    M = 32L
                ))
```

## Timing comparison

To benchmark on your own data:

```{r timing-compare, eval=FALSE}
# Replace `pca_large` with your n × p matrix (e.g. n = 100K cells, p = 50 PCs)
t_bn <- system.time(
    BiocNeighbors::findKNN(pca_large, k = 15, BNPARAM = AnnoyParam())
)
t_jv <- system.time(
    fastFindKNN(pca_large, k = 15)
)

message(sprintf("BiocNeighbors: %.1f s", t_bn["elapsed"]))
message(sprintf("jvecfor:       %.1f s", t_jv["elapsed"]))
message(sprintf("Speedup:       %.1fx",  t_bn["elapsed"] / t_jv["elapsed"]))
```

# Troubleshooting

## Common errors

```{r errors-table, echo=FALSE}
err_tbl <- data.frame(
    "Error message" = c(
        "`Java not found on PATH`",
        "`Java >= 20 is required (found Java X)`",
        paste("`jvecfor JAR not found.",
                "Run jvecfor_setup()`"),
        "`jvecfor exited with status N`",
        "Very slow first call",
        "Unexpected neighbour count"
    ),
    "Likely cause" = c(
        "`java` not in shell `PATH`",
        "Outdated JDK",
        "`inst/java/` empty",
        "JVM crash or invalid input",
        "JVM cold-start overhead",
        "`k >= nrow(X)`"
    ),
    "Resolution" = c(
        paste("Install OpenJDK 20+ (adoptium.net);",
                "re-launch R after updating PATH."),
        paste("Upgrade to OpenJDK 20+;",
                "verify with `java --version`."),
        "Run `jvecfor_setup()` or reinstall.",
        "Set `verbose=TRUE` to inspect Java stderr.",
        "Expected; subsequent calls are faster.",
        paste("Reduce `k`; need at least",
                "`k + 1` observations.")
    ),
    check.names      = FALSE,
    stringsAsFactors = FALSE
)
knitr::kable(err_tbl)
```

## Debugging with verbose mode

Enable verbose output to see HNSW build progress and confirm threading:

```{r verbose-debug, eval=run_eg}
options(jvecfor.verbose = TRUE)
nn_debug <- fastFindKNN(pca[1:50, ], k = 5)
options(jvecfor.verbose = FALSE)   # reset
```

## Checking the Java runtime

```{r java-check, eval=FALSE}
# Which java is on PATH?
Sys.which("java")

# Version reported by that java binary
system2("java", "--version", stderr = TRUE)

# Where is the bundled JAR?
system.file("java", package = "jvecfor")
```

## Using a safe wrapper

For production code that should degrade gracefully when Java is unavailable,
wrap calls with `tryCatch`:

```{r tryCatch-example, eval=FALSE}
nn <- tryCatch(
    fastFindKNN(pca, k = 15),
        error = function(e) {
            warning("jvecfor unavailable: ", conditionMessage(e),
                "\nFalling back to BiocNeighbors.")
            BiocNeighbors::findKNN(pca, k = 15)
        }
)
```

# Session Information

```{r session-info}
sessionInfo()
```