---
title: "Advanced CSOA"
author: "Andrei-Florian Stoica"
package: CSOA
date: August 13, 2025
output: BiocStyle::html_document
vignette: >
    %\VignetteIndexEntry{Advanced CSOA}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

```{r, include=FALSE}
knitr::opts_chunk$set(
    collapse=TRUE,
    comment="#>"
)
```

# Introduction

This tutorial showcases some advanced functions of CSOA. 

# Prerequisites
In addition to CSOA, you need to install 
[patchwork](https://patchwork.data-imaginist.com/index.html),
[scRNAseq](https://bioconductor.org/packages/release/data/experiment/html/scRNAseq.html), [scuttle](https://www.bioconductor.org/packages/release/bioc/html/scuttle.html) and
[stringr](https://cran.r-project.org/web/packages/stringr/index.html)
for this tutorial.

# Refining gene signatures: I
First, we will repeat the setup from the **CSOA** tutorial. We need the 
`BaronPancreasData` dataset from `scRNAseq`, and
we load the [PanglaoDB](https://www.panglaodb.se/markers.html)-generated
list of acinar markers.

```{r message=FALSE, warning=FALSE, results=FALSE}
library(CSOA)
library(ggplot2)
library(patchwork)
library(scRNAseq)
library(scuttle)
library(stringr)
library(Seurat)

sceObj <- BaronPancreasData('human')
sceObj <- logNormCounts(sceObj)

seuratObj <- as.Seurat(sceObj)

acinarMarkers <- c('PRSS1', 'KLK1', 'CTRC', 'PNLIP', 'AKR1C3', 'CTRB1', 
                   'DUOXA2', 'ALDOB', 'REG3A', 'SERPINA3', 'PRSS3', 'REG1B', 
                   'CFB',	'GDF15',	'MUC1','ANPEP', 'ANGPTL4', 'OLFM4', 
                   'GSTA1', 'LGALS2', 'PDZK1IP1', 'RARRES2', 'CXCL17', 
                   'UBD', 'GSTA2', 'LYZ', 'RBPJL', 'PTF1A', 'CELA3A', 
                   'SPINK1', 'ZG16', 'CEL', 'CELA2A', 'CPB1', 'CELA1', 
                   'PNLIPRP1', 'RNASE1', 'AMY2B', 'CPA2','CPA1', 'CELA3B', 
                   'CTRB2', 'PLA2G1B', 'PRSS2', 'CLPS', 'REG1A', 'SYCN')
```

A distinctive feature of CSOA is that the calculation of per-cell scores does not necessarily use all the genes in the input gene signature. If a gene did not register any top overlap, it will not feature in the calculation of CSOA scores. 

We now run CSOA with the option of calculating the contribution of individual gene pairs to the CSOA score toggled on using `pairFileName`:

```{r}
seuratObj <- runCSOA(seuratObj, list(CSOA_acinar = acinarMarkers), 
                     pairFileTemplate='pairs')
```

Next, we inspect the resulting data frame using `qGrab`, a wrapper around
`qread` from the `qs` package which deletes the file after reading it:

```{r}
acinarPairs <- qGrab('pairsCSOA_acinar.qs')
head(acinarPairs)
```

Using the `overlapGenes` function, we find that only `27` out of the `47` input genes contributed directly to the CSOA score:

```{r}
genes <- overlapGenes(acinarPairs)
nGenes <- length(genes)
nGenes
length(acinarMarkers)
setdiff(acinarMarkers, genes)
```

Next, we run CSOA with only the `27` genes as input:

```{r}
seuratObj <- runCSOA(seuratObj, list(CSOA_acinar2 = genes))
```

We aim to compare the results with those obtained using the original `47` acinar markers. To this end, we build a function that computes and prints several statistics of interest here—Pearson correlation, and mean, median, maximum and minimum of the absolute-value differences between the two sets of CSOA scores:

```{r}
printStats <- function(seuratObj, col1, col2){
    paste0('Pearson correlation: ', cor(seuratObj[[]][[col1]], 
                                        seuratObj[[]][[col2]]))
    absDiffs <- abs(seuratObj[[]][[col1]] - seuratObj[[]][[col2]])
    absDiffsStats <- sapply(list(mean, median, max, min), 
                            function(fun) fun(absDiffs))
    names(absDiffsStats) <- paste0(c('mean', 'median', 'max', 'min'), 
                                   'AbsDiff')
    absDiffsStats
}
```

The results are quite close to those obtained using all the `47` genes:

```{r}
printStats(seuratObj, 'CSOA_acinar', 'CSOA_acinar2')
```

It is instructive to think why the results between the two runs were not 
identical. This is because the additional overlaps computed for the `47` acinar 
markers impacted each individual component of the overlap rank (p-value rank, 
ratio rank), inducing changes in the overlap ranking and subsequent selection 
and scoring. 


# Refining gene signatures: II

We can further refine gene signatures by eliminating overlaps which account 
only for a small percent of the total score, and subsequently setting the 
corresponding genes as new input genes.

We return to the `acinarPairs` data frame. The `revCumsum` column shows 
that `90%` of the total CSOA score is accounted for by the top-scoring 
`128` gene pairs:

```{r}
acinarPairsSub <- subset(acinarPairs, revCumsum > 10)
nrow(acinarPairsSub)
```

The `overlapGenes` function finds the `24` genes that correspond to these pairs:

```{r}
genes90 <- overlapGenes(acinarPairsSub)
genes90
nGenes90 <- length(genes90)
nGenes90
```

Next, we run CSOA using solely these `24` genes as input genes. The results remain close to those obtained using all acinar markers:

```{r}
seuratObj <- runCSOA(seuratObj, list(CSOA_acinar3 = genes90))

printStats(seuratObj, 'CSOA_acinar', 'CSOA_acinar3')
```

# Addressing noisy gene sets

This section will demonstrate CSOA's ability to extract the most biologically 
meaningful genes from an input gene set. First, as random numbers are going to 
be involved, we set a seed to ensure reproducibility:

```{r}
set.seed(123)
```

We will proceed by replacing randomly chosen genes among the top `24` acinar 
markers with random genes from the Seurat object, and running CSOA using the
resulting gene set.

First, we build a function for this task:

```{r}
runCSOAReplace <- function(seuratObj, markers, nReplacedGenes, 
                           pairFileTemplate = 'pairs'){
    genesComplement <- setdiff(rownames(seuratObj), markers)
    geneSet <- list(c(sample(markers, length(markers) - nReplacedGenes), 
                      sample(genesComplement, nReplacedGenes)))
    names(geneSet) <- paste0('CSOA_replace', nReplacedGenes)
    seuratObj <- runCSOA(seuratObj, geneSet, pairFileTemplate=pairFileTemplate)
    return(seuratObj)
}
```

Now we run CSOA on a set in which half of top `24` acinar markers (randomly chosen) have been replaced by random genes. The results are very close to those obtained using the original `24` markers, though with an increase in the maximum absolute difference—CSOA no longer recognizes a very few acinar cells as such, because of the lower number of input acinar markers. 

```{r, warning=FALSE, out.height='80%', out.width='80%', fig.height=4, fig.width=5}
nReplacedGenes <- ceiling(nGenes90 / 2)
seuratObj <- runCSOAReplace(seuratObj, genes90, nReplacedGenes)

newCol <- paste0('CSOA_replace', nReplacedGenes)
printStats(seuratObj, 'CSOA_acinar3', newCol)
VlnPlot(seuratObj, newCol, group.by='label') + NoLegend()
```

The noise introduced by random genes did not directly affect CSOA results at all. Using the `overlapGenes` function, we can verify that none of the random genes participated in the `23` top overlaps:

```{r}
length(setdiff(overlapGenes(qGrab('pairsCSOA_replace12.qs')), genes90))
```

We now replace all but `2` of the top `24` acinar markers with random genes and run CSOA again. The mean, median and minimum absolute differences remain very small, and  the correlation with the original results remains very high. However, the maximum absolute difference has become substantial, indicating that an additional small number of cells could no longer be detected as acinar cells. Noise now makes an effect, though a small one: a single random gene (`PTGR1`) introduces two overlaps into CSOA scoring, which together account for `27.8%` of the total CSOA score.

```{r, out.height='80%', out.width='80%', fig.height=4, fig.width=5}
nReplacedGenes <- nGenes90 - 2
seuratObj <- runCSOAReplace(seuratObj, genes90, nReplacedGenes)

newCol <- paste0('CSOA_replace', nReplacedGenes)
printStats(seuratObj, 'CSOA_acinar3', newCol)
dfReplace <- qGrab('pairsCSOA_replace22.qs')
dfReplace
setdiff(overlapGenes(dfReplace), genes90)
VlnPlot(seuratObj, newCol, group.by='label') + NoLegend()
```

These results have implications upon the construction of input gene sets for CSOA. Due to CSOA's efficient noise-filtering ability, it is generally better to err on the side of **including more genes in the input**, even though this will incur higher computational costs.

# Operating with gene sets that characterize multiple subpopulations

Cell signatures do not necessarily characterize a single cell type or state. Therefore, it is desirable that a gene set analysis method is able to recognize the different cell types or states characterized by different genes in the input set. 

To illustrate this functionality of CSOA, we will first construct a mixed signature using acinar and stellate cell markers selected from [PanglaoDB](https://www.panglaodb.se/markers.html):

```{r, warning=FALSE}

stellateMarkers <- c('COL6A1', 'RGS5', 'COL1A1', 'MMP11', 'THY1', 'COL6A3',
                     'SFRP2', 'COL1A2', 'TNFAIP6', 'TIMP3', 'SPARC', 'COL3A1',
                     'MGP', 'COL6A2', 'COL4A1', 'FN1', 'SPON2', 'TIMP1',
                     'TGFB1', 'INHBA', 'PDGFRA', 'NDUFA4L2', 'MMP14', 'CTGF',
                     'CYGB', 'KRT10', 'PDGFRB', 'DYNLT1', 'GEM')
  
mixedMarkers <- union(acinarMarkers, stellateMarkers)

```

Now we run CSOA with the mixed markers as input. As expected, we see enrichment of CSOA scores among both acinar and stellate cells:

```{r, warning=FALSE, out.height='80%', out.width='80%', fig.height=4, fig.width=5}
seuratObj <- runCSOA(seuratObj, list(CSOA_mixed = mixedMarkers), 
                     pairFileTemplate='pairs')
VlnPlot(seuratObj, 'CSOA_mixed', group.by='label') + NoLegend()
```

We can interpret the overlap data frame as a graph, with genes as vertices and overlaps as edges. We can build and visualize the graph as follows:

```{r, out.height='80%', out.width='80%', fig.height=5, fig.width=5}
dfNetwork <- qGrab('pairsCSOA_mixed.qs')
networkPlot(dfNetwork, rankCol='overlapRank')
```

We remark that the graph is not **connected**: it has no vertex from which we can reach all other vertices by travelling on the graph edges. Instead, some subgraphs are connected. We refer to these as **connected components**. Their significance is that they can be interpreted as **gene modules**—groups of genes linked by the highly significant overlaps of some of their corresponding cell sets. The graph contains two connected components, `1` and `2`:

```{r}
dfNetwork <- connectedComponents(dfNetwork)
dplyr::count(dfNetwork, component)
```

Next, we are interested in scoring the major gene modules in order to ascertain the identity of the cells they characterize. We can score the major components separately as follows:

```{r, warning=FALSE, out.height='80%', out.width='80%', fig.height=7, fig.width=6}
components <- c(1, 2)
seuratObj <- scoreModules(seuratObj, dfNetwork, components)
p1 <- VlnPlot(seuratObj, 'CSOA_component1', group.by='label') + NoLegend() +
    theme(axis.text.y = element_blank())
p2 <- VlnPlot(seuratObj, 'CSOA_component2', group.by='label') + NoLegend() +
    theme(axis.text.y = element_blank())
p1 / p2

```

Thus, component `1` is an acinar one and component `2` is a stellate one. The genes defining each component can be extracted from the overlap matrix after `connectedComponents` has been run:

```{r, warning=FALSE}
genes1 <- overlapGenes(subset(dfNetwork, component == 1))
genes1
genes2 <- overlapGenes(subset(dfNetwork, component == 2))
genes2

```
It may happen, though, that the network of overlaps generated by CSOA for a mixed signature is connected. Let us construct a mixed signature of acinar and ductal markers, run CSOA, and identity the connected components of the graph determined by the top overlaps:

```{r, warning=FALSE, out.height='80%', out.width='80%', fig.height=4, fig.width=5}
ductalMarkers <- c('CFTR', 'SERPINA5', 'SLPI', 'TFF1', 'CFB', 'LGALS4',
                   'CTSH',	'PERP', 'PDLIM3',	'WFDC2', 'SLC3A1', 'AQP1',
                   'ALDH1A3', 'VTCN1',	'KRT19', 'TFF2', 'KRT7', 'CLDN4',
                   'LAMB3', 'TACSTD2', 'CCL2', 'DCDC2','CXCL2', 'CLDN10',
                   'HNF1B', 'KRT20', 'MUC1', 'ONECUT1', 'AMBP', 'HHEX',
                   'ANXA4', 'SPP1', 'PDX1', 'SERPINA3', 'GDF15', 'AKR1C3',
                   'MMP7', 'DEFB1', 'SERPING1', 'TSPAN8', 'CLDN1', 'S100A10',
                   'PIGR')

mixedMarkers <- union(acinarMarkers, ductalMarkers)
seuratObj <- runCSOA(seuratObj, list(CSOA_mixed = mixedMarkers), 
                     pairFileTemplate='pairs')
VlnPlot(seuratObj, 'CSOA_mixed', group.by='label') + NoLegend()
dfNetwork <- qGrab('pairsCSOA_mixed.qs')
dfNetwork <- connectedComponents(dfNetwork)
dplyr::count(dfNetwork, component)
```

The violin plot shows enrichment in both the acinar and ductal groups, 
but the overlap graph is connected. To divide the mixed signature into acinar 
and ductal components, we will run CSOA with a non-null value set for 
the `jaccardCutoff`  parameter:

```{r}
seuratObj <- runCSOA(seuratObj, list(CSOA_mixed = mixedMarkers), 
                     jaccardCutoff=1/3, pairFileTemplate='pairs')
```
This function has reduced the number of overlaps from 760 to 486 by selectively 
removing overlaps for which the neighbors of the corresponding genes overlapped 
little, as calculated using the Jaccard score. The resulting graph has two connected components:

```{r, out.height='80%', out.width='80%', fig.height=5, fig.width=5}
dfNetwork <- qGrab('pairsCSOA_mixed.qs')
networkPlot(dfNetwork, rankCol='overlapRank')
```
These correspond to acinar and ductal cells, respectively:

```{r, message=FALSE, warning=FALSE, out.height='80%', out.width='80%', fig.height=7, fig.width=6}  
dfNetwork <- connectedComponents(dfNetwork)
components <- unique(dfNetwork$component)
seuratObj <- scoreModules(seuratObj, dfNetwork, components)
p1 <- VlnPlot(seuratObj, 'CSOA_component1', group.by='label') + NoLegend() +
    theme(axis.text.y=element_blank())
p2 <- VlnPlot(seuratObj, 'CSOA_component2', group.by='label') + NoLegend() +
    theme(axis.text.y=element_blank())
p1 / p2
```


# Visualizing gene participation in top overlaps

To display gene participation in top overlaps, CSOA provides the `geneRadialPlot` function:

```{r, out.height='100%', out.width='100%', fig.height=6, fig.width=6}
geneRadialPlot(acinarPairs, 
               title='Top 100 overlap genes - Acinar markers')
```

Additionally, we can visualize and compare the genes involved in the top overlaps across multiple gene sets. We will reload the alpha and gamma markers from the **Getting started with CSOA** tutorial, and run again CSOA with four gene sets. To facilitate visualization, we will select the top `15` overlaps for each gene set:


```{r, message=FALSE, out.height='100%', out.width='100%', fig.height=6, fig.width=6}
alphaMarkers <- c('GCG', 'TTR', 'PCSK2', 'FXYD5', 'LDB2', 'MAFB',
                  'CHGA', 'SCGB2A1', 'GLS', 'FAP', 'DPP4', 'GPR119',
                  'PAX6', 'NEUROD1', 'LOXL4', 'PLCE1', 'GC', 'KLHL41',
                  'FEV', 'PTGER3', 'RFX6', 'SMARCA1', 'PGR', 'IRX1',
                  'UCP2', 'RGS4', 'KCNK16', 'GLP1R', 'ARX', 'POU3F4',
                  'RESP18', 'PYY', 'SLC38A5', 'TM4SF4', 'CRYBA2', 'SH3GL2', 
                  'PCSK1', 'PRRG2', 'IRX2', 'ALDH1A1','PEMT', 'SMIM24', 
                  'F10', 'SCGN', 'SLC30A8')

gammaMarkers <- c('PPY', 'ABCC9', 'FGB', 'ZNF503', 'MEIS1', 'LMO3', 'EGR3', 
                  'CHN2', 'PTGFR', 'ENTPD2', 'AQP3', 'THSD7A', 'CARTPT',
                  'ISL1', 'PAX6', 'NEUROD1', 'APOBEC2', 'SEMA3E', 'SLITRK6',
                  'SERTM1', 'PXK', 'PPY2P', 'ETV1', 'ARX', 'CMTM8', 'SCGB2A1', 
                  'FXYD2', 'SCGN')

geneSets <- list(acinarMarkers, alphaMarkers, ductalMarkers, gammaMarkers)
names(geneSets) <- c('CSOA_acinar', 'CSOA_alpha', 'CSOA_ductal', 'CSOA_gamma')
seuratObj <- runCSOA(seuratObj, geneSets, pairFileTemplate='pairs', 
                             keepOverlapOrder=TRUE)

geneRadialPlot(lapply(paste0('pairs', names(geneSets), '.qs'), qGrab),
               groupLegendName='Gene set',
               groupNames=str_remove(names(geneSets), 'CSOA_'),
               cutoff=15,
               title='Top 15 overlap genes',
               extraCircles=2)

```

# Session information {-}
```{r}
sessionInfo()
```