--- title: "The CSOA algorithm" author: "Andrei-Florian Stoica" package: CSOA date: August 13, 2025 output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{The CSOA algorithm} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include=FALSE} knitr::opts_chunk$set( collapse=TRUE, comment="#>" ) ``` # Introduction In this tutorial, we will walk through the main steps of the CSOA algorithm. For clarity and simplicity, we will illustrate the algorithm by scoring a single gene signature set. # Prerequisites In addition to CSOA, you need to install [scRNAseq](https://bioconductor.org/packages/release/data/experiment/html/scRNAseq.html) and [scuttle](https://www.bioconductor.org/packages/release/bioc/html/scuttle.html) for this tutorial. # Constructing cell sets First, we repeat the setup from the **Getting started with CSOA** tutorial. We need the `BaronPancreasData` dataset from `scRNAseq` and the list of acinar markers from [PanglaoDB](https://www.panglaodb.se/markers.html). ```{r message=FALSE, warning=FALSE, results=FALSE} library(CSOA) library(scRNAseq) library(scuttle) library(Seurat) sceObj <- BaronPancreasData('human') sceObj <- logNormCounts(sceObj) seuratObj <- as.Seurat(sceObj) acinarMarkers <- c('PRSS1', 'KLK1', 'CTRC', 'PNLIP', 'AKR1C3', 'CTRB1', 'DUOXA2', 'ALDOB', 'REG3A', 'SERPINA3', 'PRSS3', 'REG1B', 'CFB', 'GDF15', 'MUC1','ANPEP', 'ANGPTL4', 'OLFM4', 'GSTA1', 'LGALS2', 'PDZK1IP1', 'RARRES2', 'CXCL17', 'UBD', 'GSTA2', 'LYZ', 'RBPJL', 'PTF1A', 'CELA3A', 'SPINK1', 'ZG16', 'CEL', 'CELA2A', 'CPB1', 'CELA1', 'PNLIPRP1', 'RNASE1', 'AMY2B', 'CPA2','CPA1', 'CELA3B', 'CTRB2', 'PLA2G1B', 'PRSS2', 'CLPS', 'REG1A', 'SYCN') ``` We now extract the expression data for the input genes. The results will be stored as a non-sparse matrix: ```{r} geneSetExp <- expMat(seuratObj, acinarMarkers) class(geneSetExp) dim(geneSetExp) ``` Next, a cell set representing the highest-expression cells is constructed for each input gene. The cutoff for cell selection is implemented using the `percentile` argument, set at `90` by default. **! Note:** The percentile is taken over all the cells expressing the gene, not over all the cells in the dataset. Therefore, the cell sets will vary in size. ```{r} cellSets <- percentileSets(geneSetExp, percentile=90) vapply(cellSets, length, integer(1)) ``` We can now visualize the distribution of the cell sets among all the cells: ```{r, out.height='90%', out.width='90%', fig.height=6, fig.width=5} mat <- cellDistribution(cellSets, allCells = colnames(seuratObj)) basicHeatmap(mat, c('Gene', 'Cell', 'Presence'), title='Distribution of cell sets among all cells', palType='fillDis') ``` The cell sets strongly cluster in a small region of the heatmap, which indicates their significant overlap, expected since they correspond to markers of the same cell type identity. # Assessing overlaps of pairs of cell sets The next step is calculating the statistical significance of the overlaps of pairs of cell sets by using hypergeometric tests using the `cellSetsOverlaps` function. ```{r} cso <- cellSetsOverlaps(cellSets, nCells=dim(seuratObj)[2]) ``` **! Note:** The `nCells` argument must be set to the total number of cells in the Seurat object, not to the number of cells in the cell sets. We can inspect the output of this function: ```{r} head(cso) ``` The number of rows (`1081`) equals the number of all possible unordered pairs that can be generated from a list of `47` genes (`47 * 46 / 2 = 1081`). To perform both steps—creating cell sets and assessing overlaps of pairs of cell sets—in a single call, the `generateOverlaps` function can be used: ```{r} cso2 <- generateOverlaps(geneSetExp, percentile = 90) identical(cso, cso2) ``` # Identifying and scoring top overlaps Before scoring the cells, the top overlaps need to be identified and scored. This step is performed in a single call using `processOverlaps`: ```{r} po <- processOverlaps(cso) head(po) ``` `processOverlaps` sequentially performs these steps: 1. p-values are corrected for multiple testing and insignificant overlaps are removed. 2. Two preliminary overlap ranks are defined based on adjusted p-value and recorded-over-expected size ratio. 3. These ranks are replaced by two **connectivity-based ranks**, consisting of the minimum p-value/ratio rank of the vertices neighboring the edge corresponding to each overlap. The order imposed by the by average of the two ranks is the final rank 4. The overlap rank cutoff is defined as the highest-frequency rank, if unique, and as the average of the minimum and maximum of the highest-frequency ranks otherwise: ```{r, out.height='100%', out.width='100%', fig.height=5, fig.width=5} overlapCutoffPlot(po) ``` 4. Overlaps are filtered based on the rank cutoff. 5. The retained overlaps are scored. If the default `log` overlap scoring method is selected, scores are assigned starting at `1` and decreasing logarithmically towards `0` for each unique rank (though `0` is not reached but interpreted as the score of the first non-top overlap). If the `minmax` method is selected instead, overlaps are scored through inverted minmax normalization. # Scoring the cells `scoreCells` scores the cells using an overlap data frame as input. It requires the output of `cellSetsOverlaps` (or of its wrapper `generateOverlaps`), and first calls `processOverlaps`. Subsequently, it assigns a score to each cell by summing the products of these three quantities: 1. The overlap score of the gene pair; 2. The in-cell minmax-normalized expresssion of the first gene; 3. The in-cell minmax-normalized expresssion of the second gene. ```{r} scoreDF <- scoreCells(geneSetExp, cso, setPairs=list(overlapPairs(cso)), geneSetNames='CSOA_acinar') head(scoreDF) ``` The `attachCellScores` function attaches cell scores to the object: ```{r, message=FALSE} seuratObj <- attachCellScores(seuratObj, scoreDF) head(seuratObj$CSOA_acinar) ``` # The runCSOA wrapper `runCSOA` runs the full CSOA pipeline, as a wrapper around `generateOverlaps`, `scoreCells` and `attachCellScores`. The wrapper delivers the same results as running the three functions in succession: ```{r} geneSet <- list(CSOA_acinar = acinarMarkers) seuratObj <- runCSOA(seuratObj, geneSet) identical(seuratObj[[]]$CSOA_acinar, scoreDF$CSOA_acinar) ``` # Session information {-} ```{r} sessionInfo() ```