Contents

1 Introduction

In this tutorial, we will walk through the main steps of the CSOA algorithm. For clarity and simplicity, we will illustrate the algorithm by scoring a single gene signature set.

2 Prerequisites

In addition to CSOA, you need to install scRNAseq and scuttle for this tutorial.

3 Constructing cell sets

First, we repeat the setup from the Getting started with CSOA tutorial. We need the BaronPancreasData dataset from scRNAseq and the list of acinar markers from PanglaoDB.

library(CSOA)
library(scRNAseq)
library(scuttle)
library(Seurat)

sceObj <- BaronPancreasData('human')
sceObj <- logNormCounts(sceObj)

seuratObj <- as.Seurat(sceObj)

acinarMarkers <- c('PRSS1', 'KLK1', 'CTRC', 'PNLIP', 'AKR1C3', 'CTRB1', 
                   'DUOXA2', 'ALDOB', 'REG3A', 'SERPINA3', 'PRSS3', 'REG1B', 
                   'CFB',   'GDF15',    'MUC1','ANPEP', 'ANGPTL4', 'OLFM4', 
                   'GSTA1', 'LGALS2', 'PDZK1IP1', 'RARRES2', 'CXCL17', 
                   'UBD', 'GSTA2', 'LYZ', 'RBPJL', 'PTF1A', 'CELA3A', 
                   'SPINK1', 'ZG16', 'CEL', 'CELA2A', 'CPB1', 'CELA1', 
                   'PNLIPRP1', 'RNASE1', 'AMY2B', 'CPA2','CPA1', 'CELA3B', 
                   'CTRB2', 'PLA2G1B', 'PRSS2', 'CLPS', 'REG1A', 'SYCN')

We now extract the expression data for the input genes. The results will be stored as a non-sparse matrix:

geneSetExp <- expMat(seuratObj, acinarMarkers)
class(geneSetExp)
#> [1] "matrix" "array"
dim(geneSetExp)
#> [1]   47 8569

Next, a cell set representing the highest-expression cells is constructed for each input gene. The cutoff for cell selection is implemented using the percentile argument, set at 90 by default.

! Note: The percentile is taken over all the cells expressing the gene, not over all the cells in the dataset. Therefore, the cell sets will vary in size.

cellSets <- percentileSets(geneSetExp, percentile=90)
#> Computing percentile sets...
vapply(cellSets, length, integer(1))
#>   AKR1C3    ALDOB    AMY2B  ANGPTL4    ANPEP      CEL    CELA1   CELA2A 
#>      124       95       48      218      127       69        1      152 
#>   CELA3A   CELA3B      CFB     CLPS     CPA1     CPA2     CPB1    CTRB1 
#>      266      186      132      194      218      145      188      264 
#>    CTRB2     CTRC   CXCL17   DUOXA2    GDF15    GSTA1    GSTA2     KLK1 
#>      320      143       57       90      259       73       74      119 
#>   LGALS2      LYZ     MUC1    OLFM4 PDZK1IP1  PLA2G1B    PNLIP PNLIPRP1 
#>       37       46      150       75      144      199      136      124 
#>    PRSS1    PRSS2    PRSS3    PTF1A  RARRES2    RBPJL    REG1A    REG1B 
#>      241      318      195       18       82        3      481      262 
#>    REG3A   RNASE1 SERPINA3   SPINK1     SYCN      UBD     ZG16 
#>      103      131      259      271      124       14       14

We can now visualize the distribution of the cell sets among all the cells:

mat <- cellDistribution(cellSets, allCells = colnames(seuratObj))
basicHeatmap(mat, c('Gene', 'Cell', 'Presence'), 
             title='Distribution of cell sets among all cells',
             palType='fillDis')

The cell sets strongly cluster in a small region of the heatmap, which indicates their significant overlap, expected since they correspond to markers of the same cell type identity.

4 Assessing overlaps of pairs of cell sets

The next step is calculating the statistical significance of the overlaps of pairs of cell sets by using hypergeometric tests using the cellSetsOverlaps function.

cso <- cellSetsOverlaps(cellSets, nCells=dim(seuratObj)[2])
#> Assessing gene overlaps...

! Note: The nCells argument must be set to the total number of cells in the Seurat object, not to the number of cells in the cell sets.

We can inspect the output of this function:

head(cso)
#>    gene1   gene2 ncells1 ncells2 shared_cells exp_shared_cells     ratio
#> 1 AKR1C3   ALDOB     124      95           14       1.37472284 10.183871
#> 2 AKR1C3   AMY2B     124      48            2       0.69459680  2.879368
#> 3 AKR1C3 ANGPTL4     124     218           12       3.15462714  3.803936
#> 4 AKR1C3   ANPEP     124     127           10       1.83778737  5.441326
#> 5 AKR1C3     CEL     124      69           10       0.99848290 10.015194
#> 6 AKR1C3   CELA1     124       1            0       0.01447077  0.000000
#>           pval
#> 1 6.388483e-11
#> 2 1.528311e-01
#> 3 7.039371e-05
#> 4 1.402596e-05
#> 5 4.621307e-08
#> 6 1.000000e+00

The number of rows (1081) equals the number of all possible unordered pairs that can be generated from a list of 47 genes (47 * 46 / 2 = 1081).

To perform both steps—creating cell sets and assessing overlaps of pairs of cell sets—in a single call, the generateOverlaps function can be used:

cso2 <- generateOverlaps(geneSetExp, percentile = 90)
#> Computing percentile sets...
#> Assessing gene overlaps...
identical(cso, cso2)
#> [1] TRUE

5 Identifying and scoring top overlaps

Before scoring the cells, the top overlaps need to be identified and scored. This step is performed in a single call using processOverlaps:

po <- processOverlaps(cso)
#> 194 overlaps will be used in the calculation of CSOA scores.
head(po)
#>     gene1 gene2 ncells1 ncells2 shared_cells exp_shared_cells     ratio
#> 812  KLK1 PNLIP     119     136           36         1.888668 19.061048
#> 789 GSTA2 PNLIP      74     136           27         1.174466 22.989169
#> 765 GSTA1 PNLIP      73     136           19         1.158595 16.399174
#> 504  CPA1 PNLIP     218     136           91         3.459914 26.301234
#> 659  CTRC PNLIP     143     136           47         2.269576 20.708710
#> 782 GSTA2  KLK1      74     119            9         1.027658  8.757779
#>              pval       pvalAdj pvalRank ratioRank rawAggRank rank     score
#> 812  1.600432e-37  7.073002e-36     19.0       5.5      12.25    1 1.0000000
#> 789  1.119819e-30  4.258412e-29     20.5       4.5      12.50    2 0.9733086
#> 765  1.386693e-18  3.964172e-17     20.5       4.5      12.50    2 0.9733086
#> 504 3.720306e-119 1.448430e-116     13.0      12.5      12.75    4 0.9458851
#> 659  1.990427e-51  1.232851e-49     17.0       8.5      12.75    4 0.9458851
#> 782  7.345269e-07  1.325707e-05     23.5       2.0      12.75    4 0.9458851

processOverlaps sequentially performs these steps:

  1. p-values are corrected for multiple testing and insignificant overlaps are removed.

  2. Two preliminary overlap ranks are defined based on adjusted p-value and recorded-over-expected size ratio.

  3. These ranks are replaced by two connectivity-based ranks, consisting of the minimum p-value/ratio rank of the vertices neighboring the edge corresponding to each overlap. The order imposed by the by average of the two ranks is the final rank

  4. The overlap rank cutoff is defined as the highest-frequency rank, if unique, and as the average of the minimum and maximum of the highest-frequency ranks otherwise:

overlapCutoffPlot(po)

  1. Overlaps are filtered based on the rank cutoff.

  2. The retained overlaps are scored. If the default log overlap scoring method is selected, scores are assigned starting at 1 and decreasing logarithmically towards 0 for each unique rank (though 0 is not reached but interpreted as the score of the first non-top overlap). If the minmax method is selected instead, overlaps are scored through inverted minmax normalization.

6 Scoring the cells

scoreCells scores the cells using an overlap data frame as input. It requires the output of cellSetsOverlaps (or of its wrapper generateOverlaps), and first calls processOverlaps. Subsequently, it assigns a score to each cell by summing the products of these three quantities:

  1. The overlap score of the gene pair;

  2. The in-cell minmax-normalized expresssion of the first gene;

  3. The in-cell minmax-normalized expresssion of the second gene.

scoreDF <- scoreCells(geneSetExp, cso, setPairs=list(overlapPairs(cso)), 
                      geneSetNames='CSOA_acinar')
#> 194 overlaps will be used in the calculation of CSOA scores.
#> Normalizing expression matrix by rows...
#> Computing per-cell scores for gene pairs...
#> Computing per-cell gene signature scores...
head(scoreDF)
#>                             CSOA_acinar
#> human1_lib1.final_cell_0001   0.7534719
#> human1_lib1.final_cell_0002   0.5009556
#> human1_lib1.final_cell_0003   0.5308047
#> human1_lib1.final_cell_0004   0.5052516
#> human1_lib1.final_cell_0005   0.5371408
#> human1_lib1.final_cell_0006   0.3647260

The attachCellScores function attaches cell scores to the object:

seuratObj <- attachCellScores(seuratObj, scoreDF)
head(seuratObj$CSOA_acinar)
#> human1_lib1.final_cell_0001 human1_lib1.final_cell_0002 
#>                   0.7534719                   0.5009556 
#> human1_lib1.final_cell_0003 human1_lib1.final_cell_0004 
#>                   0.5308047                   0.5052516 
#> human1_lib1.final_cell_0005 human1_lib1.final_cell_0006 
#>                   0.5371408                   0.3647260

7 The runCSOA wrapper

runCSOA runs the full CSOA pipeline, as a wrapper around generateOverlaps, scoreCells and attachCellScores. The wrapper delivers the same results as running the three functions in succession:

geneSet <- list(CSOA_acinar = acinarMarkers)
seuratObj <- runCSOA(seuratObj, geneSet)
#> Computing percentile sets...
#> Assessing gene overlaps...
#> 194 overlaps will be used in the calculation of CSOA scores.
#> Normalizing expression matrix by rows...
#> Computing per-cell scores for gene pairs...
#> Computing per-cell gene signature scores...
identical(seuratObj[[]]$CSOA_acinar, scoreDF$CSOA_acinar)
#> [1] TRUE

Session information

sessionInfo()
#> R version 4.5.1 Patched (2025-08-23 r88802)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] Seurat_5.3.0                SeuratObject_5.2.0         
#>  [3] sp_2.2-0                    stringr_1.5.2              
#>  [5] scuttle_1.19.0              scRNAseq_2.23.0            
#>  [7] SingleCellExperiment_1.31.1 SummarizedExperiment_1.39.2
#>  [9] Biobase_2.69.1              GenomicRanges_1.61.4       
#> [11] Seqinfo_0.99.2              IRanges_2.43.2             
#> [13] S4Vectors_0.47.2            BiocGenerics_0.55.1        
#> [15] generics_0.1.4              MatrixGenerics_1.21.0      
#> [17] matrixStats_1.5.0           patchwork_1.3.2            
#> [19] ggplot2_4.0.0               CSOA_0.99.2                
#> [21] BiocStyle_2.37.1           
#> 
#> loaded via a namespace (and not attached):
#>   [1] ProtGenerics_1.41.0      spatstat.sparse_3.1-0    bitops_1.0-9            
#>   [4] httr_1.4.7               RColorBrewer_1.1-3       tools_4.5.1             
#>   [7] sctransform_0.4.2        alabaster.base_1.9.5     R6_2.6.1                
#>  [10] HDF5Array_1.37.0         lazyeval_0.2.2           uwot_0.2.3              
#>  [13] rhdf5filters_1.21.0      withr_3.0.2              gridExtra_2.3           
#>  [16] progressr_0.16.0         cli_3.6.5                spatstat.explore_3.5-2  
#>  [19] fastDummies_1.7.5        labeling_0.4.3           alabaster.se_1.9.0      
#>  [22] sass_0.4.10              S7_0.2.0                 spatstat.data_3.1-8     
#>  [25] ggridges_0.5.7           pbapply_1.7-4            Rsamtools_2.25.3        
#>  [28] dichromat_2.0-0.1        parallelly_1.45.1        RSQLite_2.4.3           
#>  [31] RApiSerialize_0.1.4      BiocIO_1.19.0            ica_1.0-3               
#>  [34] spatstat.random_3.4-1    dplyr_1.1.4              wesanderson_0.3.7       
#>  [37] Matrix_1.7-4             ggbeeswarm_0.7.2         abind_1.4-8             
#>  [40] lifecycle_1.0.4          yaml_2.3.10              rhdf5_2.53.4            
#>  [43] SparseArray_1.9.1        BiocFileCache_2.99.6     Rtsne_0.17              
#>  [46] grid_4.5.1               blob_1.2.4               promises_1.3.3          
#>  [49] ExperimentHub_2.99.5     crayon_1.5.3             miniUI_0.1.2            
#>  [52] lattice_0.22-7           beachmat_2.25.5          cowplot_1.2.0           
#>  [55] GenomicFeatures_1.61.6   KEGGREST_1.49.1          poibin_1.6              
#>  [58] magick_2.9.0             pillar_1.11.1            knitr_1.50              
#>  [61] rjson_0.2.23             bayesbio_1.0.0           future.apply_1.20.0     
#>  [64] codetools_0.2-20         glue_1.8.0               spatstat.univar_3.1-4   
#>  [67] data.table_1.17.8        vctrs_0.6.5              png_0.1-8               
#>  [70] gypsum_1.5.0             spam_2.11-1              gtable_0.3.6            
#>  [73] kernlab_0.9-33           cachem_1.1.0             xfun_0.53               
#>  [76] S4Arrays_1.9.1           mime_0.13                tidygraph_1.3.1         
#>  [79] survival_3.8-3           tinytex_0.57             fitdistrplus_1.2-4      
#>  [82] ROCR_1.0-11              nlme_3.1-168             bit64_4.6.0-1           
#>  [85] alabaster.ranges_1.9.1   filelock_1.0.3           RcppAnnoy_0.0.22        
#>  [88] GenomeInfoDb_1.45.11     bslib_0.9.0              irlba_2.3.5.1           
#>  [91] vipor_0.4.7              KernSmooth_2.23-26       DBI_1.2.3               
#>  [94] ggrastr_1.0.2            tidyselect_1.2.1         bit_4.6.0               
#>  [97] compiler_4.5.1           curl_7.0.0               httr2_1.2.1             
#> [100] h5mread_1.1.1            textshape_1.7.5          DelayedArray_0.35.3     
#> [103] plotly_4.11.0            stringfish_0.17.0        bookdown_0.44           
#> [106] rtracklayer_1.69.1       scales_1.4.0             lmtest_0.9-40           
#> [109] ggeasy_0.1.6             sgof_2.3.5               rappdirs_0.3.3          
#> [112] digest_0.6.37            goftest_1.2-3            spatstat.utils_3.1-5    
#> [115] alabaster.matrix_1.9.0   rmarkdown_2.29           XVector_0.49.1          
#> [118] htmltools_0.5.8.1        pkgconfig_2.0.3          ensembldb_2.33.2        
#> [121] dbplyr_2.5.1             fastmap_1.2.0            UCSC.utils_1.5.0        
#> [124] rlang_1.1.6              htmlwidgets_1.6.4        shiny_1.11.1            
#> [127] farver_2.1.2             jquerylib_0.1.4          zoo_1.8-14              
#> [130] jsonlite_2.0.0           BiocParallel_1.43.4      RCurl_1.98-1.17         
#> [133] magrittr_2.0.4           dotCall64_1.2            Rhdf5lib_1.31.0         
#> [136] Rcpp_1.1.0               kerntools_1.2.0          ggnewscale_0.5.2        
#> [139] viridis_0.6.5            reticulate_1.43.0        stringi_1.8.7           
#> [142] alabaster.schemas_1.9.0  ggraph_2.2.2             MASS_7.3-65             
#> [145] AnnotationHub_3.99.6     plyr_1.8.9               parallel_4.5.1          
#> [148] listenv_0.9.1            ggrepel_0.9.6            deldir_2.0-4            
#> [151] Biostrings_2.77.2        graphlayouts_1.2.2       splines_4.5.1           
#> [154] tensor_1.5.1             igraph_2.1.4             spatstat.geom_3.5-0     
#> [157] RcppHNSW_0.6.0           reshape2_1.4.4           BiocVersion_3.22.0      
#> [160] XML_3.99-0.19            evaluate_1.0.5           RcppParallel_5.1.11-1   
#> [163] BiocManager_1.30.26      tweenr_2.0.3             httpuv_1.6.16           
#> [166] RANN_2.6.2               tidyr_1.3.1              purrr_1.1.0             
#> [169] polyclip_1.10-7          alabaster.sce_1.9.0      qs_0.27.3               
#> [172] future_1.67.0            scattermore_1.2          ggforce_0.5.0           
#> [175] xtable_1.8-4             AnnotationFilter_1.33.0  restfulr_0.0.16         
#> [178] RSpectra_0.16-2          later_1.4.4              viridisLite_0.4.2       
#> [181] tibble_3.3.0             beeswarm_0.4.0           memoise_2.0.1           
#> [184] AnnotationDbi_1.71.1     GenomicAlignments_1.45.4 cluster_2.1.8.1         
#> [187] globals_0.18.0