---
title: "The CSOA algorithm"
author: "Andrei-Florian Stoica"
package: CSOA
date: August 13, 2025
output: BiocStyle::html_document
vignette: >
    %\VignetteIndexEntry{The CSOA algorithm}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

```{r, include=FALSE}
knitr::opts_chunk$set(
  collapse=TRUE,
  comment="#>"
)
```

# Introduction
In this tutorial, we will walk through the main steps of the CSOA algorithm. 
For clarity and simplicity, we will illustrate the algorithm by scoring a 
single gene signature set.

# Prerequisites
In addition to CSOA, you need to install [scRNAseq](https://bioconductor.org/packages/release/data/experiment/html/scRNAseq.html) and [scuttle](https://www.bioconductor.org/packages/release/bioc/html/scuttle.html) for this tutorial.

# Constructing cell sets
First, we repeat the setup from the **Getting started with CSOA** tutorial. 
We need the `BaronPancreasData` dataset from `scRNAseq` and the list of acinar 
markers from [PanglaoDB](https://www.panglaodb.se/markers.html).

```{r message=FALSE, warning=FALSE, results=FALSE}
library(CSOA)
library(scRNAseq)
library(scuttle)
library(Seurat)

sceObj <- BaronPancreasData('human')
sceObj <- logNormCounts(sceObj)

seuratObj <- as.Seurat(sceObj)

acinarMarkers <- c('PRSS1', 'KLK1', 'CTRC', 'PNLIP', 'AKR1C3', 'CTRB1', 
                   'DUOXA2', 'ALDOB', 'REG3A', 'SERPINA3', 'PRSS3', 'REG1B', 
                   'CFB',	'GDF15',	'MUC1','ANPEP', 'ANGPTL4', 'OLFM4', 
                   'GSTA1', 'LGALS2', 'PDZK1IP1', 'RARRES2', 'CXCL17', 
                   'UBD', 'GSTA2', 'LYZ', 'RBPJL', 'PTF1A', 'CELA3A', 
                   'SPINK1', 'ZG16', 'CEL', 'CELA2A', 'CPB1', 'CELA1', 
                   'PNLIPRP1', 'RNASE1', 'AMY2B', 'CPA2','CPA1', 'CELA3B', 
                   'CTRB2', 'PLA2G1B', 'PRSS2', 'CLPS', 'REG1A', 'SYCN')
```

We now extract the expression data for the input genes. The results will be 
stored as a non-sparse matrix:

```{r}
geneSetExp <- expMat(seuratObj, acinarMarkers)
class(geneSetExp)
dim(geneSetExp)
```

Next, a cell set representing the highest-expression cells is constructed for 
each input gene. The cutoff for cell selection is implemented using the 
`percentile` argument, set at `90` by default.

**! Note:** The percentile is taken over all the cells expressing the gene, 
not over all the cells in the dataset. Therefore, the cell sets will vary in 
size.

```{r}
cellSets <- percentileSets(geneSetExp, percentile=90)
vapply(cellSets, length, integer(1))
```

We can now visualize the distribution of the cell sets among all the cells:

```{r, out.height='90%', out.width='90%', fig.height=6, fig.width=5}
mat <- cellDistribution(cellSets, allCells = colnames(seuratObj))
basicHeatmap(mat, c('Gene', 'Cell', 'Presence'), 
             title='Distribution of cell sets among all cells',
             palType='fillDis')
```

The cell sets strongly cluster in a small region of the heatmap, 
which indicates their significant overlap, expected since they correspond 
to markers of the same cell type identity.

# Assessing overlaps of pairs of cell sets

The next step is calculating the statistical significance of the overlaps of 
pairs of cell sets by using hypergeometric tests using the `cellSetsOverlaps` 
function.

```{r}
cso <- cellSetsOverlaps(cellSets, nCells=dim(seuratObj)[2])
```

**! Note:** The `nCells` argument must be set to the total number of cells in the Seurat object, not to the number of cells in the cell sets.

We can inspect the output of this function:

```{r}
head(cso)
```

The number of rows (`1081`) equals the number of all possible unordered pairs that can be generated from a list of `47` genes (`47 * 46 / 2 = 1081`).

To perform both steps—creating cell sets and assessing overlaps of pairs of cell sets—in a single call, the `generateOverlaps` function can be used:

```{r}
cso2 <- generateOverlaps(geneSetExp, percentile = 90)
identical(cso, cso2)
```

# Identifying and scoring top overlaps

Before scoring the cells, the top overlaps need to be identified and scored.
This step is performed in a single call using `processOverlaps`:

```{r}
po <- processOverlaps(cso)
head(po)
```

`processOverlaps` sequentially performs these steps:

1. p-values are corrected for multiple testing and insignificant overlaps
are removed.

2. Two preliminary overlap ranks are defined based on adjusted p-value and 
recorded-over-expected size ratio. 

3. These ranks are replaced by two **connectivity-based ranks**, consisting of 
the minimum p-value/ratio rank of the vertices neighboring the edge 
corresponding to each overlap. The order imposed by the by average of the two 
ranks is the final rank

4. The overlap rank cutoff is defined as the highest-frequency rank, if unique, 
and as the average of the minimum and maximum of the highest-frequency ranks 
otherwise:

```{r, out.height='100%', out.width='100%', fig.height=5, fig.width=5}
overlapCutoffPlot(po)
```

4. Overlaps are filtered based on the rank cutoff.

5. The retained overlaps are scored. If the default `log` overlap scoring
method is selected, scores are assigned starting at `1` and decreasing 
logarithmically towards `0` for each unique rank (though `0` is not reached but 
interpreted as the score of the first non-top overlap). If the `minmax` method
is selected instead, overlaps are scored through inverted minmax normalization.

# Scoring the cells

`scoreCells` scores the cells using an overlap data frame as input. It requires
the output of `cellSetsOverlaps` (or of its wrapper `generateOverlaps`), and
first calls `processOverlaps`. Subsequently, it assigns a score to each cell
by summing the products of these three quantities:

1. The overlap score of the gene pair;

2. The in-cell minmax-normalized expresssion of the first gene;

3. The in-cell minmax-normalized expresssion of the second gene.

```{r}
scoreDF <- scoreCells(geneSetExp, cso, setPairs=list(overlapPairs(cso)), 
                      geneSetNames='CSOA_acinar')
head(scoreDF)
```

The `attachCellScores` function attaches cell scores to the object: 

```{r, message=FALSE}
seuratObj <- attachCellScores(seuratObj, scoreDF)
head(seuratObj$CSOA_acinar)
```

# The runCSOA wrapper

`runCSOA` runs the full CSOA pipeline, as a wrapper around `generateOverlaps`, 
`scoreCells` and `attachCellScores`. The wrapper delivers the same results as
running the three functions in succession:

```{r}
geneSet <- list(CSOA_acinar = acinarMarkers)
seuratObj <- runCSOA(seuratObj, geneSet)
identical(seuratObj[[]]$CSOA_acinar, scoreDF$CSOA_acinar)
```

# Session information {-}
```{r}
sessionInfo()
```