--- title: "anglemania Tutorial" author: - name: Aaron Kollotzek affiliation: - &MDC Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Berlin, Germany - &BIMSB Berlin Institute for Medical Systems Biology, Berlin, Germany email: aaron.kollotzek@mdc-berlin.de - name: Vedran Franke affiliation: - &MDC Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Berlin, Germany - &BIMSB Berlin Institute for Medical Systems Biology, Berlin, Germany - name: Artem Baranovskii affiliation: - Helmholtz Munich - name: Altuna Akalin affiliation: - &MDC Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Berlin, Germany - &BIMSB Berlin Institute for Medical Systems Biology, Berlin, Germany date: "`r Sys.Date()`" output: BiocStyle::html_document: df_print: "paged" toc: true toc_depth: 2 toc_float: true theme: simplex vignette: > %\VignetteIndexEntry{anglemania tutorial} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # Introduction anglemania is a feature selection package that extracts genes from multi-batch scRNA-seq experiments for downstream dataset integration. The goal is to select genes that carry high biological information and low technical noise between the batches. Those genes are extracted from gene pairs that have an invariant and extremely narrow or wide angle between their expression vectors. Conventionally, highly-variable genes (HVGs) or sometimes all genes are used for integration tasks (https://www.nature.com/articles/s41592-021-01336-8). While HVGs are a great and easy way to reduce the noise and dimensionality of the data, we hypothesize that there are better ways to select genes specifically for integration tasks. HVGs are sensitive to batch effects because the variance is a function of both the technical and biological variance. anglemania improves conventional usage of HVGs for integration tasks, especially when the transcriptional difference between cell types or cell states is subtle (showcased here using simulated data generated using `r Biocpkg("splatter")` with `de.facLoc` and `de.facScale` set to 0.1, which results in mild differences between "Groups"). The package can be used on top of `r Biocpkg("SingleCellExperiment")` or `r CRANpkg("Seurat")` objects. Under the hood, anglemania works with file-backed big matrices (FBMs) from the `r CRANpkg("bigstatsr")` package for fast and memory efficient computation. ```{r, load libraries} suppressPackageStartupMessages({ library(anglemania) library(dplyr) library(Seurat) options(Seurat.object.assay.version = "v5") library(splatter) library(SingleCellExperiment) library(scater) library(scran) library(bluster) library(batchelor) library(UpSetR) }) ``` # Simulation We simulate a scRNA-seq dataset using Splatter with 4 batches and 3 cell types with subtle differences between cell types and rather big batch effects. ```{r, create simulated data} batch.facLoc <- 0.3 de.facLoc <- 0.1 nBatches <- 4 nGroups <- 3 nGenes <- 5000 groupCells <- 300 sce_raw <- splatSimulate( batchCells = rep(groupCells * nGroups, nBatches), batch.facLoc = batch.facLoc, group.prob = rep(1 / nGroups, nGroups), nGenes = nGenes, batch.facScale = 0.1, method = "groups", verbose = FALSE, out.prob = 0.001, de.prob = 0.1, # mild de.facLoc = de.facLoc, de.facScale = 0.1, bcv.common = 0.1, seed = 42 ) sce <- sce_raw assays(sce) ``` # Unintegrated data Here we perform a standard workflow on the unintegrated data. When we perform clustering on the unintegrated data and visualize it in a UMAP, we can see that the clusters are driven by batch effects rather than cell types. ```{r, fig.cap = "UMAPs of unintegrated data, colored by Batch and Group. The clusters are driven by batch effects.", fig.wide = TRUE, fig.width = 10} sce_unintegrated <- sce # Normalization. sce_unintegrated <- logNormCounts(sce_unintegrated) # Feature selection. dec <- modelGeneVar(sce_unintegrated) hvg <- getTopHVGs(dec, prop = 0.1) # PCA. set.seed(1234) sce_unintegrated <- scater::runPCA( sce_unintegrated, ncomponents = 50, subset_row = hvg ) # Clustering. colLabels(sce_unintegrated) <- clusterCells(sce_unintegrated, use.dimred = "PCA", BLUSPARAM = NNGraphParam(cluster.fun = "louvain") ) # Visualization. sce_unintegrated <- scater::runUMAP(sce_unintegrated, dimred = "PCA") patchwork::wrap_plots( plotUMAP(sce_unintegrated, colour_by = "Batch") + ggtitle("Unintegrated data, colored by Batch"), plotUMAP(sce_unintegrated, colour_by = "Group") + ggtitle("Unintegrated data, colored by Group") ) ``` # anglemania anglemania works on a `r Biocpkg("SingleCellExperiment")` object. The function has a few important arguments: - `batch_key`: the column in the metadata of the SingleCellExperiment object that indicates which batch the cells belong to. This is required to distinguish between batches, because we compute the angle between gene pairs for each batch. - `method`: either cosine, spearman or diem - this is the method by which the relationship of the gene pairs is measured. Default is cosine, which is the cosine similarity between the expression vectors of the gene pairs. - `zscore_mean_threshold`: We compute a mean of the zscore of the relationship between a gene pair, and then we set a minimal cutoff for the (absolute) mean zscore. A cutoff of 2 means that the filtered gene pairs have a relationship, e.g. cosine similarity, that is 2 standard deviations away from the mean of all cosine similarities from this dataset. A higher value means a more stringent cutoff. - `zscore_sn_threshold`: The SNR or signal-to-noise ratio measures the invariance of the relationship of the relationship between the gene pair. A high SN ratio means that the relationship is constant over multiple batches. - `max_n_genes`: you can specify a maximum number of extracted genes. They are sorted by decreasing mean zscore after passing the thresholds. ```{r, run anglemania, message = FALSE} head(colData(sce)) batch_key <- "Batch" sce <- anglemania( sce, batch_key = batch_key, zscore_mean_threshold = 2, zscore_sn_threshold = 2 ) ``` ## extract the anglemania genes from the SCE object ```{r} anglemania_genes <- get_anglemania_genes(sce) head(anglemania_genes) length(anglemania_genes) ``` ## select_genes once anglemania was run on the SCE, you can adjust the initial zscore mean and zscore SNR thresholds by using the `select_genes()` function ```{r, select genes, message = FALSE} # If you think the number of selected genes is # too high or low you can adjust the thresholds: sce <- select_genes(sce, zscore_mean_threshold = 2.5, zscore_sn_threshold = 2.5 ) # Inspect the anglemania genes anglemania_genes <- get_anglemania_genes(sce) head(anglemania_genes) length(anglemania_genes) # 306 genes are selected with these thresholds ``` # MNN integration The anglemania genes can now be used for downstream integration algorithms such as MNN. We compare the integration results using the anglemania genes with the results using 300 and the standard 2000 HVGs. ## HVGs ### 300 HVGs ```{r, MNN_300 HVGs} hvg_300 <- sce %>% scater::logNormCounts() %>% modelGeneVar(block = colData(sce)[[batch_key]]) %>% getTopHVGs(n = 300) barcodes_by_batch <- split(rownames(colData(sce)), colData(sce)[[batch_key]]) sce_list <- lapply(barcodes_by_batch, function(x) sce[, x]) sce_mnn <- sce %>% scater::logNormCounts() sce_mnn <- batchelor::fastMNN( sce_mnn, subset.row = hvg_300, k = 20, batch = factor(colData(sce_mnn)[[batch_key]]), d = 50 ) reducedDim(sce, "MNN_hvg_300") <- reducedDim(sce_mnn, "corrected") sce <- scater::runUMAP(sce, dimred = "MNN_hvg_300", name = "umap_MNN_hvg_300") # k is the number of nearest neighbours to consider when identifying MNNs ``` ### 2000 HVGs ```{r, MNN_2000 HVGs} hvg_2000 <- sce %>% scater::logNormCounts() %>% modelGeneVar(block = colData(sce)[[batch_key]]) %>% getTopHVGs(n = 2000) barcodes_by_batch <- split(rownames(colData(sce)), colData(sce)[[batch_key]]) sce_list <- lapply(barcodes_by_batch, function(x) sce[, x]) sce_mnn <- sce %>% scater::logNormCounts() sce_mnn <- batchelor::fastMNN( sce_mnn, subset.row = hvg_2000, k = 20, batch = factor(colData(sce_mnn)[[batch_key]]), d = 50 ) reducedDim(sce, "MNN_hvg_2000") <- reducedDim(sce_mnn, "corrected") sce <- scater::runUMAP(sce, dimred = "MNN_hvg_2000", name = "umap_MNN_hvg_2000") ``` ## anglemania genes ```{r, MNN anglemania, message = FALSE} sce_mnn <- sce %>% scater::logNormCounts() sce_mnn <- batchelor::fastMNN( sce_mnn, subset.row = anglemania_genes, k = 20, batch = factor(colData(sce_mnn)[[batch_key]]), d = 50 ) reducedDim(sce, "MNN_anglemania") <- reducedDim(sce_mnn, "corrected") sce <- scater::runUMAP( sce, dimred = "MNN_anglemania", name = "umap_MNN_anglemania" ) ``` ## Plot ### UMAP embeddings We can see from the UMAPs that anglemania genes yield the best integration in terms of clustering by cell type and mixing the batches. The goal of an integration and subsequent clustering should be to have low intra cluster variance and high inter cluster variance. This is at least true for most downstream scRNA-seq analyses where the goal is to e.g. differentiate between cell types or cell states and annotate these. ```{r, fig.cap = "UMAPs of MNN integrated data. Comparison of UMAP embeddings of integrated data using anglemania genes, top 300 HVGs and top 2000 HVGs.", fig.width = 12, fig.height = 8, fig.wide = TRUE} # Use wrap_plots patchwork::wrap_plots( plotReducedDim(sce, colour_by = "Batch", dimred = "umap_MNN_anglemania") + ggtitle("MNN integration using anglemania genes, colored by Batch"), plotReducedDim(sce, colour_by = "Group", dimred = "umap_MNN_anglemania") + ggtitle("MNN integration using anglemania genes, colored by Group"), plotReducedDim(sce, colour_by = "Batch", dimred = "umap_MNN_hvg_300") + ggtitle("MNN integration using top 300 HVGs, colored by Batch"), plotReducedDim(sce, colour_by = "Group", dimred = "umap_MNN_hvg_300") + ggtitle("MNN integration using top 300 HVGs, colored by Group"), plotReducedDim(sce, colour_by = "Batch", dimred = "umap_MNN_hvg_2000") + ggtitle("MNN integration using top 2000 HVGs, colored by Batch"), plotReducedDim(sce, colour_by = "Group", dimred = "umap_MNN_hvg_2000") + ggtitle("MNN integration using top 2000 HVGs, colored by Group"), ncol = 2 ) ``` ### Overlap ```{r, fig.cap = "Overlap of selected genes. Additionally, we check the overlap of the anglemania genes with the HVGs. About 33 of the 306 anglemania genes are also found in the top 300 HVGs, and about 179 of the 306 anglemania genes are also found in the top 2000 HVGs.", fig.width = 10, fig.height = 8, fig.wide = TRUE, message = FALSE, warning = FALSE} upsetr_df <- fromList( list( anglemania = anglemania_genes, hvg_300 = hvg_300, hvg_2000 = hvg_2000 ) ) upset(upsetr_df, text.scale = 2) ``` # Seurat Now you can just use the anglemania genes for other integration algorithms. When using Seurat, the easiest approach is to create an SCE from the counts and metadata of the SeuratObject, then run anglemania on it and save those genes as the VariableFeatures of the SeuratObject. ```{r, message = FALSE, warning = FALSE} se <- CreateSeuratObject( counts = counts(sce_raw), meta.data = as.data.frame(colData(sce_raw)) ) se anglemania_genes <- se |> as.SingleCellExperiment(assay = "RNA") |> anglemania( batch_key = "Batch", zscore_mean_threshold = 2, zscore_sn_threshold = 2 ) |> get_anglemania_genes() ``` ## Integration In Seurat v5 you split the layers of an assay by batch and then run the normal Seurat workflow. To use anglemania genes for integration, you need to assign them to the VariableFeatures slot of the SeuratObject. After that, you integrate the layers using the anglemania genes as the `features` argument. ```{r, Seurat Integration v5, message = FALSE, warning = FALSE} # Split by batch se[["RNA"]] <- split(se[["RNA"]], f = se$Batch) # Standard preprocessing but use anglemania genes as "VariableFeatures" se <- NormalizeData(se, verbose = FALSE) VariableFeatures(se) <- anglemania_genes se <- se |> ScaleData(verbose = FALSE) |> RunPCA(verbose = FALSE) # Integrate se <- IntegrateLayers( object = se, method = CCAIntegration, orig.reduction = "pca", new.reduction = "integrated.cca", features = anglemania_genes, verbose = FALSE ) se <- RunUMAP(se, dims = 1:30, reduction = "integrated.cca", verbose = FALSE) se ``` ## Plot ```{r, fig.cap = "UMAPs of Seurat integrated data. Here we show that we can use the anglemania genes for integration of a SeuratObject.", fig.wide = TRUE, fig.width = 10} patchwork::wrap_plots( DimPlot(se, reduction = "umap", group.by = "Batch") + ggtitle("Seurat integration using\nanglemania genes\ncolored by Batch"), DimPlot(se, reduction = "umap", group.by = "Group") + ggtitle("Seurat integration using\nanglemania genes\ncolored by Group"), ncol = 2 ) ``` # Showcase underlying functions ## Normal anglemania workflow ```{r, message = FALSE} sce_raw <- sce_example() sce <- sce_raw batch_key <- "batch" sce <- anglemania(sce, batch_key = batch_key, verbose = FALSE) ``` `anglemania` is run on the SCE object and it basically calls three functions: - `factorise`: - creates a permutation of the input matrix whose correlation matrix is used to create a null distribution for each batch. - computes the cosine similarity (or spearman coefficient) between gene expression vector pairs matrix for both the original and permuted matrices - computes the zscore of the relationship between the gene pairs taking the mean and standard deviation of the null distribution - it does this for every batch in the dataset! - `get_list_stats` - computes the mean and standard deviation of the zscores across the matrices from the different batches. This creates two important matrices: the mean zscore matrix `mean_zscore` and the signal-to-noise ratio matrix `sn_zscore`. These are stored in the metadata of the SCE object. - `select_genes` - filters the gene pairs by the `mean_zscore` and `sn_zscore` matrices (SN ratio, i.e. the mean divided by the standard deviation). ## factorise ```{r, message = FALSE} barcodes_by_batch <- split(rownames(colData(sce)), colData(sce)[[batch_key]]) counts_by_batch <- lapply(barcodes_by_batch, function(x) { counts(sce[, x]) %>% sparse_to_fbm() }) counts_by_batch[[1]][1:10, 1:6] # we are working on FBMs (file-backed matrices # implemented in the bigstatsr package) class(counts_by_batch[[1]]) # factorise produces the correlation matrices transformed to z-scores factorised <- lapply(counts_by_batch, factorise) factorised[[1]][1:10, 1:6] ``` ## get_list_stats The "list stats" are computed by `get_list_stats` and take the z-score transformed correlation matrices from `factorise` as input. The outputs are the mean zscore matrix `mean_zscore` and the signal-to-noise ratio matrix `sn_zscore`. These are stored in the metadata of the SCE object. ```{r, message = FALSE} matrix_list <- metadata(sce)$anglemania$matrix_list weights <- setNames( metadata(sce)$anglemania$weights$weight, metadata(sce)$anglemania$weights$batch ) list_stats <- get_list_stats( matrix_list = matrix_list, weights = weights, verbose = FALSE ) names(list_stats) class(list_stats) list_stats$mean_zscore[1:10, 1:6] list_stats$sn_zscore[1:10, 1:6] # Or we can access them directly from the SCE object # after running anglemania metadata(sce)$anglemania$list_stats$mean_zscore[1:10, 1:6] metadata(sce)$anglemania$list_stats$sn_zscore[1:10, 1:6] ``` ## select_genes - under the hood, `anglemania` calls `select_genes` with the default thresholds `zscore_mean_threshold = 2.5`, `zscore_sn_threshold = 2.5` - we can use `select_genes` to change the thresholds without having to run anglemania again ```{r, message = FALSE} previous_genes <- get_anglemania_genes(sce) sce <- select_genes( sce, zscore_mean_threshold = 2, zscore_sn_threshold = 2, verbose = FALSE ) # Inspect the anglemania genes new_genes <- get_anglemania_genes(sce) length(previous_genes) length(new_genes) ``` # sessionInfo ```{r} sessionInfo() ```