--- title: Using scran to analyze single-cell RNA-seq data author: - name: Aaron Lun email: infinite.monkeys.with.keyboards@gmail.com date: "Revised: April 7, 2026" output: BiocStyle::html_document: toc_float: true package: scran vignette: > %\VignetteIndexEntry{Using scran to analyze scRNA-seq data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo=FALSE, results="hide", message=FALSE} require(knitr) opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE) ``` ```{r setup, echo=FALSE, message=FALSE} library(scran) library(BiocParallel) set.seed(100) self <- BiocStyle::Biocpkg("scran") ``` # Introduction Woah, hi there. Didn't except a visitor to this vignette. Most of the functionality in `r self` is now deprecated or defunct, having since been superceded by the `r Biocpkg("scrapper")` package. We just keep this package around so that users trying to call the old functions are redirected to their replacements. And also, `r self` got me a well-paid gig that funds my retirement, so I feel an obligation to repay the favor to an loyal old draft horse. ```r # Show your appreciation for this package by installing it # to bump up its download statistics. install.packages("BiocManager") BiocManager::install("scran") ``` Anyway, this vignette will go through some of the remaining non-deprecated functions in this package. In general, these functions are no longer part of routine single-cell analyses but we keep them around for historical purposes. We'll demonstrate on a small single-cell pancreas dataset: ```{r} library(scran) library(scRNAseq) sce <- GrunPancreasData() sce # Quickly analyzing it to generate bits and pieces that we can # reuse in the rest of this vignette. library(scrapper) results <- analyze.se(sce, num.threads=2) sce.ready <- results$x ``` # Quick clustering When using the `computeSumFactors()` function from `r Biocpkg("scuttle")`, it can be beneficial to initially compute normalization factors within clusters. This reduces our sensitivity to violations of the assumption of a non-DE majority of genes. The `quickCluster()` function generates a quick-and-dirty clustering that can be passed to `clusters=`: ```{r} library(scran) clusters <- quickCluster(sce.ready) sce.normed <- computeSumFactors(sce.ready, clusters=clusters) summary(sizeFactors(sce.normed)) ``` If necessary, further subclustering can be performed conveniently using the `quickSubCluster()` wrapper function. This splits the input `SingleCellExperiment` into several smaller objects containing cells from each cluster and performs another round of clustering within that cluster, using a freshly identified set of HVGs to improve resolution for internal structure. ```{r} subout <- quickSubCluster(sce.ready, groups=clusters) table(metadata(subout)$subcluster) # formatted as '.' ``` # Automated PC choice A common question is how many PCs to retain; more PCs will capture more biological signal at the cost of retaining more noise and requiring more computational work. One approach to choosing the number of PCs is to use the technical component estimates to determine the proportion of variance that should be retained. This is implemented in `denoisePCANumber()`: ```{r} var.explained <- metadata(sce.ready)$PCA$variance.explained stats <- rowData(sce.ready) total.tech.var.in.hvgs <- sum(stats[stats$hvg,"fitted"]) total.var.in.hvgs <- sum(stats[stats$hvg,"variances"]) chosen <- denoisePCANumber(var.explained, total.tech.var.in.hvgs, total.var.in.hvgs) chosen ``` We could then use only the specified number of top PCs in our downstream analyses: ```{r} reducedDim(sce.ready, "PCA.denoised") <- reducedDim(sce.ready, "PCA")[,1:chosen] ``` # Detecting correlated genes Another useful procedure is to identify significant pairwise correlations between pairs of HVGs. The idea is to distinguish between HVGs caused by random stochasticity, and those that are driving systematic heterogeneity, e.g., between subpopulations. Correlations are computed by the `correlatePairs` method using a slightly modified version of Spearman's rho, tested against the null hypothesis of zero correlation using the same method in `cor.test()`. ```{r} # Using the first 200 HVs, which are the most interesting anyway. of.interest <- order(rowData(sce.ready)$residuals, decreasing=TRUE)[1:200] cor.pairs <- correlatePairs(sce.ready, subset.row=of.interest) cor.pairs ``` As with variance estimation, if uninteresting substructure is present, this should be blocked on using the `block=` argument. This avoids strong correlations due to the blocking factor. ```{r} cor.pairs2 <- correlatePairs(sce.ready, subset.row=of.interest, block=sce.ready$donor) ``` The pairs can be used for choosing marker genes in experimental validation, and to construct gene-gene association networks. In other situations, the pairs may not be of direct interest - rather, we just want to know whether a gene is correlated with any other gene. This is often the case if we are to select a set of correlated HVGs for use in downstream steps like clustering or dimensionality reduction. To do so, we use `correlateGenes()` to compute a single set of statistics for each gene, rather than for each pair. ```{r} cor.genes <- correlateGenes(cor.pairs) cor.genes ``` Significant correlations are defined at a false discovery rate (FDR) threshold of, e.g., 5%. Note that the p-values are calculated by permutation and will have a lower bound. If there were insufficient permutation iterations, a warning will be issued suggesting that more iterations be performed. # Session information {-} ```{r} sessionInfo() ```