---
title: Using scrapper to analyze single-cell data
author: 
- name: Aaron Lun
  email: infinite.monkeys.with.keyboards@gmail.com
date: "Revised: October 29, 2025"
output:
  BiocStyle::html_document
package: scrapper
vignette: >
  %\VignetteIndexEntry{Using scrapper to analyze single-cell data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}    
---

```{r, echo=FALSE, results="hide", message=FALSE}
require(knitr)
opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)

library(BiocStyle)
self <- Biocpkg("scrapper")
```

# Overview

`r self` implements bindings to C++ code for analyzing single-cell data, mostly from the [**libscran**](https://github.com/libscran) libraries.
Each function performs an individual analysis step ranging from quality control to clustering and marker detection.
`r self` is mostly intended for other Bioconductor package developers;
users should check out the `r Biocpkg("scrapple")` package instead for a more ergonomic experience.

# Basic analysis

Let's fetch a small single-cell RNA-seq dataset for demonstration purposes:

```{r}
library(scRNAseq)
sce.z <- ZeiselBrainData()
sce.z
```

We run it through the `r self` analysis pipeline.
First, some quality control:

```{r}
counts <- assay(sce.z)
is.mito <- grepl("^mt-", rownames(sce.z))
nthreads <- 2 # using a smaller value to avoid stressing out the build machines.

library(scrapper)
rna.qc.metrics <- computeRnaQcMetrics(counts, subsets=list(mt=is.mito), num.threads=nthreads)
rna.qc.thresholds <- suggestRnaQcThresholds(rna.qc.metrics)
rna.qc.filter <- filterRnaQcMetrics(rna.qc.thresholds, rna.qc.metrics)
filtered <- counts[,rna.qc.filter,drop=FALSE]
```

Then normalization:

```{r}
size.factors <- centerSizeFactors(rna.qc.metrics$sum[rna.qc.filter])
normalized <- normalizeCounts(filtered, size.factors)
```

Feature selection and PCA:

```{r}
gene.var <- modelGeneVariances(normalized, num.threads=nthreads)
top.hvgs <- chooseHighlyVariableGenes(gene.var$statistics$residuals)
pca <- runPca(normalized[top.hvgs,], num.threads=nthreads)
```

Some clustering:

```{r}
snn.graph <- buildSnnGraph(pca$components, num.threads=nthreads)
clust.out <- clusterGraph(snn.graph)
table(clust.out$membership)
```

And visualization:

```{r}
umap.out <- runUmap(pca$components, num.threads=nthreads)
tsne.out <- runTsne(pca$components, num.threads=nthreads)
plot(tsne.out[,1], tsne.out[,2], col=factor(clust.out$membership))
```

And finally marker detection:

```{r}
markers <- scoreMarkers(normalized, groups=clust.out$membership, num.threads=nthreads)

# Top markers for each cluster, say, based on the median AUC:
top.markers <- lapply(markers$auc, function(x) {
    head(rownames(x)[order(x$median, decreasing=TRUE)], 10)
})
head(top.markers)
```

Readers are referred to the [OSCA book](https://bioconductor.org/books/release/OSCA/) for more details on the theory behind each step.

# Blocking on batches 

Let's fetch another brain dataset and combine it with the previous one.

```{r}
sce.t <- TasicBrainData()
common <- intersect(rownames(sce.z), rownames(sce.t))
combined <- cbind(assay(sce.t)[common,], assay(sce.z)[common,])
block <- rep(c("tasic", "zeisel"), c(ncol(sce.t), ncol(sce.z)))
```

We set `block=` in each step to account for the batch structure.
This ensures that the various calculations are not affected by inter-block differences.
It also uses MNN correction to batch effects in the low-dimensional space prior to further analyses.

```{r}
library(scrapper)
rna.qc.metrics <- computeRnaQcMetrics(combined, subsets=list(), num.threads=nthreads)
rna.qc.thresholds <- suggestRnaQcThresholds(rna.qc.metrics, block=block)
rna.qc.filter <- filterRnaQcMetrics(rna.qc.thresholds, rna.qc.metrics, block=block)

filtered <- combined[,rna.qc.filter,drop=FALSE]
filtered.block <- block[rna.qc.filter]
size.factors <- centerSizeFactors(rna.qc.metrics$sum[rna.qc.filter], block=filtered.block)
normalized <- normalizeCounts(filtered, size.factors)

gene.var <- modelGeneVariances(normalized, num.threads=nthreads, block=filtered.block)
top.hvgs <- chooseHighlyVariableGenes(gene.var$statistics$residuals)
pca <- runPca(normalized[top.hvgs,], num.threads=nthreads, block=filtered.block)

# Now we do a MNN correction to get rid of the batch effect.
corrected <- correctMnn(pca$components, block=filtered.block, num.threads=nthreads)

umap.out <- runUmap(corrected$corrected, num.threads=nthreads)
tsne.out <- runTsne(corrected$corrected, num.threads=nthreads)
plot(tsne.out[,1], tsne.out[,2], pch=16, col=factor(filtered.block))

snn.graph <- buildSnnGraph(corrected$corrected, num.threads=nthreads)
clust.out <- clusterGraph(snn.graph)

markers <- scoreMarkers(normalized, groups=clust.out$membership, block=filtered.block, num.threads=nthreads)
```

We can also compute pseudo-bulk profiles for each cluster-dataset combination, e.g., for differential expression analyses.

```{r}
aggregates <- aggregateAcrossCells(filtered, list(cluster=clust.out$membership, dataset=filtered.block))
```

# Combining multiple modalities

Let's fetch a single-cell dataset with both RNA-seq and CITE-seq data.
To keep the run-time short, we'll only consider the first 5000 cells.

```{r}
sce.s <- StoeckiusHashingData(mode=c("human", "adt1", "adt2"))
sce.s <- sce.s[,1:5000]
sce.s
```

We extract the counts for both the RNA and the ADTs.

```{r}
rna.counts <- assay(sce.s)
adt.counts <- rbind(assay(altExp(sce.s, "adt1")), assay(altExp(sce.s, "adt2")))
```

We use both matrices in our analysis pipeline.
Each modality is processed separately before being combined for steps like clustering and visualization.

```{r}
# QC in both modalities, only keeping the cells that pass in both.
is.mito <- grepl("^MT-", rownames(rna.counts))
rna.qc.metrics <- computeRnaQcMetrics(rna.counts, subsets=list(MT=is.mito), num.threads=nthreads)
rna.qc.thresholds <- suggestRnaQcThresholds(rna.qc.metrics)
rna.qc.filter <- filterRnaQcMetrics(rna.qc.thresholds, rna.qc.metrics)

is.igg <- grepl("^IgG", rownames(adt.counts))
adt.qc.metrics <- computeAdtQcMetrics(adt.counts, subsets=list(IgG=is.igg), num.threads=nthreads)
adt.qc.thresholds <- suggestAdtQcThresholds(adt.qc.metrics)
adt.qc.filter <- filterAdtQcMetrics(adt.qc.thresholds, adt.qc.metrics)

keep <- rna.qc.filter & adt.qc.filter
rna.filtered <- rna.counts[,keep,drop=FALSE]
adt.filtered <- adt.counts[,keep,drop=FALSE]

rna.size.factors <- centerSizeFactors(rna.qc.metrics$sum[keep])
rna.normalized <- normalizeCounts(rna.filtered, rna.size.factors)

adt.size.factors <- computeClrm1Factors(adt.filtered, num.threads=nthreads)
adt.size.factors <- centerSizeFactors(adt.size.factors)
adt.normalized <- normalizeCounts(adt.filtered, adt.size.factors)

gene.var <- modelGeneVariances(rna.normalized, num.threads=nthreads)
top.hvgs <- chooseHighlyVariableGenes(gene.var$statistics$residuals)
rna.pca <- runPca(rna.normalized[top.hvgs,], num.threads=nthreads)

# Combining the RNA-derived PCs with ADT expression. Here, there's very few ADT
# tags so there's no need for further dimensionality reduction.
combined <- scaleByNeighbors(list(rna.pca$components, as.matrix(adt.normalized)), num.threads=nthreads)

snn.graph <- buildSnnGraph(combined$combined, num.threads=nthreads)
clust.out <- clusterGraph(snn.graph)
table(clust.out$membership)

umap.out <- runUmap(combined$combined, num.threads=nthreads)
tsne.out <- runTsne(combined$combined, num.threads=nthreads)
plot(umap.out[,1], umap.out[,2], pch=16, col=factor(clust.out$membership))

rna.markers <- scoreMarkers(rna.normalized, groups=clust.out$membership, num.threads=nthreads)
adt.markers <- scoreMarkers(adt.normalized, groups=clust.out$membership, num.threads=nthreads)
```

# Other useful functions

The `runAllNeighborSteps()` will run `runUmap()`, `runTsne()`, `buildSnnGraph()` and `clusterGraph()` in a single call.
This runs the UMAP/t-SNE iterations and the clustering in parallel to maximize use of multiple threads.

The `scoreGeneSet()` function will compute a gene set score based on the input expression matrix.
This can be used to summarize the activity of pathways into a single per-cell score for visualization.

The `subsampleByNeighbors()` function will deterministically select a representative subset of cells based on their local neighborhood.
This can be used to reduce the compute time of the various steps downstream of the PCA.

For CRISPR data, quality control can be performed using `computeCrisprQcMetrics()`, `suggestCrisprQcThresholds()` and `filterCrisprQcMetrics()`.
To normalize, we use size factors defined by centering the total sum of guide counts for each cell.

# Session information {-}

```{r}
sessionInfo()
```