In this tutorial, we will walk through the main steps of the CSOA algorithm. For clarity and simplicity, we will illustrate the algorithm by scoring a single gene signature set.
First, we repeat the setup from the Getting started with
CSOA tutorial. We need the BaronPancreasData
dataset from scRNAseq and the list of acinar markers from
PanglaoDB.
library(CSOA)
library(scRNAseq)
library(scuttle)
library(Seurat)
sceObj <- BaronPancreasData('human')
sceObj <- logNormCounts(sceObj)
seuratObj <- as.Seurat(sceObj)
acinarMarkers <- c('PRSS1', 'KLK1', 'CTRC', 'PNLIP', 'AKR1C3', 'CTRB1',
'DUOXA2', 'ALDOB', 'REG3A', 'SERPINA3', 'PRSS3', 'REG1B',
'CFB', 'GDF15', 'MUC1','ANPEP', 'ANGPTL4', 'OLFM4',
'GSTA1', 'LGALS2', 'PDZK1IP1', 'RARRES2', 'CXCL17',
'UBD', 'GSTA2', 'LYZ', 'RBPJL', 'PTF1A', 'CELA3A',
'SPINK1', 'ZG16', 'CEL', 'CELA2A', 'CPB1', 'CELA1',
'PNLIPRP1', 'RNASE1', 'AMY2B', 'CPA2','CPA1', 'CELA3B',
'CTRB2', 'PLA2G1B', 'PRSS2', 'CLPS', 'REG1A', 'SYCN')We now extract the expression data for the input genes. The results will be stored as a non-sparse matrix:
geneSetExp <- expMat(seuratObj, acinarMarkers)
class(geneSetExp)
#> [1] "matrix" "array"
dim(geneSetExp)
#> [1] 47 8569Next, a cell set representing the highest-expression cells is
constructed for each input gene. The cutoff for cell selection is
implemented using the percentile argument, set at
90 by default.
! Note: The percentile is taken over all the cells expressing the gene, not over all the cells in the dataset. Therefore, the cell sets will vary in size.
cellSets <- percentileSets(geneSetExp, percentile=90)
#> Computing percentile sets...
vapply(cellSets, length, integer(1))
#> AKR1C3 ALDOB AMY2B ANGPTL4 ANPEP CEL CELA1 CELA2A
#> 124 95 48 218 127 69 1 152
#> CELA3A CELA3B CFB CLPS CPA1 CPA2 CPB1 CTRB1
#> 266 186 132 194 218 145 188 264
#> CTRB2 CTRC CXCL17 DUOXA2 GDF15 GSTA1 GSTA2 KLK1
#> 320 143 57 90 259 73 74 119
#> LGALS2 LYZ MUC1 OLFM4 PDZK1IP1 PLA2G1B PNLIP PNLIPRP1
#> 37 46 150 75 144 199 136 124
#> PRSS1 PRSS2 PRSS3 PTF1A RARRES2 RBPJL REG1A REG1B
#> 241 318 195 18 82 3 481 262
#> REG3A RNASE1 SERPINA3 SPINK1 SYCN UBD ZG16
#> 103 131 259 271 124 14 14We can now visualize the distribution of the cell sets among all the cells:
mat <- cellDistribution(cellSets, allCells = colnames(seuratObj))
basicHeatmap(mat, c('Gene', 'Cell', 'Presence'),
title='Distribution of cell sets among all cells',
palType='fillDis')The cell sets strongly cluster in a small region of the heatmap, which indicates their significant overlap, expected since they correspond to markers of the same cell type identity.
The next step is calculating the statistical significance of the
overlaps of pairs of cell sets by using hypergeometric tests using the
cellSetsOverlaps function.
! Note: The nCells argument must be set
to the total number of cells in the Seurat object, not to the number of
cells in the cell sets.
We can inspect the output of this function:
head(cso)
#> gene1 gene2 ncells1 ncells2 shared_cells exp_shared_cells ratio
#> 1 AKR1C3 ALDOB 124 95 14 1.37472284 10.183871
#> 2 AKR1C3 AMY2B 124 48 2 0.69459680 2.879368
#> 3 AKR1C3 ANGPTL4 124 218 12 3.15462714 3.803936
#> 4 AKR1C3 ANPEP 124 127 10 1.83778737 5.441326
#> 5 AKR1C3 CEL 124 69 10 0.99848290 10.015194
#> 6 AKR1C3 CELA1 124 1 0 0.01447077 0.000000
#> pval
#> 1 6.388483e-11
#> 2 1.528311e-01
#> 3 7.039371e-05
#> 4 1.402596e-05
#> 5 4.621307e-08
#> 6 1.000000e+00The number of rows (1081) equals the number of all
possible unordered pairs that can be generated from a list of
47 genes (47 * 46 / 2 = 1081).
To perform both steps—creating cell sets and assessing overlaps of
pairs of cell sets—in a single call, the generateOverlaps
function can be used:
Before scoring the cells, the top overlaps need to be identified and
scored. This step is performed in a single call using
processOverlaps:
po <- processOverlaps(cso)
#> 194 overlaps will be used in the calculation of CSOA scores.
head(po)
#> gene1 gene2 ncells1 ncells2 shared_cells exp_shared_cells ratio
#> 812 KLK1 PNLIP 119 136 36 1.888668 19.061048
#> 789 GSTA2 PNLIP 74 136 27 1.174466 22.989169
#> 765 GSTA1 PNLIP 73 136 19 1.158595 16.399174
#> 504 CPA1 PNLIP 218 136 91 3.459914 26.301234
#> 659 CTRC PNLIP 143 136 47 2.269576 20.708710
#> 782 GSTA2 KLK1 74 119 9 1.027658 8.757779
#> pval pvalAdj pvalRank ratioRank rawAggRank rank score
#> 812 1.600432e-37 7.073002e-36 19.0 5.5 12.25 1 1.0000000
#> 789 1.119819e-30 4.258412e-29 20.5 4.5 12.50 2 0.9733086
#> 765 1.386693e-18 3.964172e-17 20.5 4.5 12.50 2 0.9733086
#> 504 3.720306e-119 1.448430e-116 13.0 12.5 12.75 4 0.9458851
#> 659 1.990427e-51 1.232851e-49 17.0 8.5 12.75 4 0.9458851
#> 782 7.345269e-07 1.325707e-05 23.5 2.0 12.75 4 0.9458851processOverlaps sequentially performs these steps:
p-values are corrected for multiple testing and insignificant overlaps are removed.
Two preliminary overlap ranks are defined based on adjusted p-value and recorded-over-expected size ratio.
These ranks are replaced by two connectivity-based ranks, consisting of the minimum p-value/ratio rank of the vertices neighboring the edge corresponding to each overlap. The order imposed by the by average of the two ranks is the final rank
The overlap rank cutoff is defined as the highest-frequency rank, if unique, and as the average of the minimum and maximum of the highest-frequency ranks otherwise:
Overlaps are filtered based on the rank cutoff.
The retained overlaps are scored. If the default log
overlap scoring method is selected, scores are assigned starting at
1 and decreasing logarithmically towards 0 for
each unique rank (though 0 is not reached but interpreted
as the score of the first non-top overlap). If the minmax
method is selected instead, overlaps are scored through inverted minmax
normalization.
scoreCells scores the cells using an overlap data frame
as input. It requires the output of cellSetsOverlaps (or of
its wrapper generateOverlaps), and first calls
processOverlaps. Subsequently, it assigns a score to each
cell by summing the products of these three quantities:
The overlap score of the gene pair;
The in-cell minmax-normalized expresssion of the first gene;
The in-cell minmax-normalized expresssion of the second gene.
scoreDF <- scoreCells(geneSetExp, cso, setPairs=list(overlapPairs(cso)),
geneSetNames='CSOA_acinar')
#> Processing overlaps...
#> 194 overlaps will be used in the calculation of CSOA scores.
#> Normalizing expression matrix by rows...
#> Computing per-cell scores for gene pairs...
#> Computing per-cell gene signature scores...
head(scoreDF)
#> CSOA_acinar
#> human1_lib1.final_cell_0001 0.7534719
#> human1_lib1.final_cell_0002 0.5009556
#> human1_lib1.final_cell_0003 0.5308047
#> human1_lib1.final_cell_0004 0.5052516
#> human1_lib1.final_cell_0005 0.5371408
#> human1_lib1.final_cell_0006 0.3647260The attachCellScores function attaches cell scores to
the object:
seuratObj <- attachCellScores(seuratObj, scoreDF)
head(seuratObj$CSOA_acinar)
#> human1_lib1.final_cell_0001 human1_lib1.final_cell_0002
#> 0.7534719 0.5009556
#> human1_lib1.final_cell_0003 human1_lib1.final_cell_0004
#> 0.5308047 0.5052516
#> human1_lib1.final_cell_0005 human1_lib1.final_cell_0006
#> 0.5371408 0.3647260runCSOA runs the full CSOA pipeline, as a wrapper around
generateOverlaps, scoreCells and
attachCellScores. The wrapper delivers the same results as
running the three functions in succession:
geneSet <- list(CSOA_acinar = acinarMarkers)
seuratObj <- runCSOA(seuratObj, geneSet)
#> Computing percentile sets...
#> Generating cell set overlaps...
#> Processing overlaps...
#> 194 overlaps will be used in the calculation of CSOA scores.
#> Normalizing expression matrix by rows...
#> Computing per-cell scores for gene pairs...
#> Computing per-cell gene signature scores...
identical(seuratObj[[]]$CSOA_acinar, scoreDF$CSOA_acinar)
#> [1] TRUEsessionInfo()
#> R version 4.5.2 (2025-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] Seurat_5.3.1 SeuratObject_5.2.0
#> [3] sp_2.2-0 stringr_1.6.0
#> [5] scuttle_1.21.0 scRNAseq_2.25.0
#> [7] SingleCellExperiment_1.33.0 SummarizedExperiment_1.41.0
#> [9] Biobase_2.71.0 GenomicRanges_1.63.0
#> [11] Seqinfo_1.1.0 IRanges_2.45.0
#> [13] S4Vectors_0.49.0 BiocGenerics_0.57.0
#> [15] generics_0.1.4 MatrixGenerics_1.23.0
#> [17] matrixStats_1.5.0 patchwork_1.3.2
#> [19] henna_0.3.4 ggplot2_4.0.1
#> [21] CSOA_1.1.1 BiocStyle_2.39.0
#>
#> loaded via a namespace (and not attached):
#> [1] ProtGenerics_1.43.0 spatstat.sparse_3.1-0 bitops_1.0-9
#> [4] httr_1.4.7 RColorBrewer_1.1-3 tools_4.5.2
#> [7] sctransform_0.4.2 alabaster.base_1.11.1 R6_2.6.1
#> [10] HDF5Array_1.39.0 lazyeval_0.2.2 uwot_0.2.4
#> [13] rhdf5filters_1.23.1 withr_3.0.2 gridExtra_2.3
#> [16] progressr_0.18.0 cli_3.6.5 spatstat.explore_3.6-0
#> [19] fastDummies_1.7.5 labeling_0.4.3 alabaster.se_1.9.0
#> [22] sass_0.4.10 S7_0.2.1 spatstat.data_3.1-9
#> [25] ggridges_0.5.7 pbapply_1.7-4 Rsamtools_2.27.0
#> [28] parallelly_1.45.1 RSQLite_2.4.5 RApiSerialize_0.1.4
#> [31] BiocIO_1.21.0 ica_1.0-3 spatstat.random_3.4-3
#> [34] dplyr_1.1.4 wesanderson_0.3.7 Matrix_1.7-4
#> [37] abind_1.4-8 lifecycle_1.0.4 yaml_2.3.11
#> [40] rhdf5_2.55.11 SparseArray_1.11.7 BiocFileCache_3.1.0
#> [43] Rtsne_0.17 grid_4.5.2 blob_1.2.4
#> [46] promises_1.5.0 ExperimentHub_3.1.0 crayon_1.5.3
#> [49] miniUI_0.1.2 lattice_0.22-7 beachmat_2.27.0
#> [52] cowplot_1.2.0 GenomicFeatures_1.63.1 cigarillo_1.1.0
#> [55] KEGGREST_1.51.1 sys_3.4.3 maketools_1.3.2
#> [58] pillar_1.11.1 knitr_1.50 abdiv_0.2.0
#> [61] rjson_0.2.23 bayesbio_1.0.0 future.apply_1.20.0
#> [64] codetools_0.2-20 glue_1.8.0 spatstat.univar_3.1-5
#> [67] data.table_1.17.8 gypsum_1.7.0 vctrs_0.6.5
#> [70] png_0.1-8 spam_2.11-1 gtable_0.3.6
#> [73] kernlab_0.9-33 cachem_1.1.0 xfun_0.54
#> [76] S4Arrays_1.11.1 mime_0.13 tidygraph_1.3.1
#> [79] survival_3.8-3 fitdistrplus_1.2-4 ROCR_1.0-11
#> [82] liver_1.26 nlme_3.1-168 bit64_4.6.0-1
#> [85] alabaster.ranges_1.9.1 filelock_1.0.3 RcppAnnoy_0.0.22
#> [88] GenomeInfoDb_1.47.1 bslib_0.9.0 irlba_2.3.5.1
#> [91] KernSmooth_2.23-26 otel_0.2.0 DBI_1.2.3
#> [94] tidyselect_1.2.1 bit_4.6.0 compiler_4.5.2
#> [97] curl_7.0.0 httr2_1.2.1 h5mread_1.3.1
#> [100] textshape_1.7.5 DelayedArray_0.37.0 plotly_4.11.0
#> [103] stringfish_0.17.0 rtracklayer_1.71.0 scales_1.4.0
#> [106] lmtest_0.9-40 ggeasy_0.1.6 rappdirs_0.3.3
#> [109] digest_0.6.39 goftest_1.2-3 spatstat.utils_3.2-0
#> [112] alabaster.matrix_1.9.0 rmarkdown_2.30 XVector_0.51.0
#> [115] htmltools_0.5.8.1 pkgconfig_2.0.3 ensembldb_2.35.0
#> [118] dbplyr_2.5.1 fastmap_1.2.0 UCSC.utils_1.7.0
#> [121] rlang_1.1.6 htmlwidgets_1.6.4 shiny_1.11.1
#> [124] farver_2.1.2 jquerylib_0.1.4 zoo_1.8-14
#> [127] jsonlite_2.0.0 BiocParallel_1.45.0 RCurl_1.98-1.17
#> [130] magrittr_2.0.4 dotCall64_1.2 Rhdf5lib_1.33.0
#> [133] Rcpp_1.1.0 kerntools_1.2.0 ggnewscale_0.5.2
#> [136] viridis_0.6.5 reticulate_1.44.1 stringi_1.8.7
#> [139] alabaster.schemas_1.11.0 ggalluvial_0.12.5 ggraph_2.2.2
#> [142] MASS_7.3-65 AnnotationHub_4.1.0 plyr_1.8.9
#> [145] parallel_4.5.2 listenv_0.10.0 ggrepel_0.9.6
#> [148] deldir_2.0-4 Biostrings_2.79.2 graphlayouts_1.2.2
#> [151] splines_4.5.2 tensor_1.5.1 igraph_2.2.1
#> [154] spatstat.geom_3.6-1 RcppHNSW_0.6.0 buildtools_1.0.0
#> [157] reshape2_1.4.5 BiocVersion_3.23.1 XML_3.99-0.20
#> [160] evaluate_1.0.5 RcppParallel_5.1.11-1 BiocManager_1.30.27
#> [163] tweenr_2.0.3 httpuv_1.6.16 RANN_2.6.2
#> [166] tidyr_1.3.1 purrr_1.2.0 polyclip_1.10-7
#> [169] alabaster.sce_1.9.0 qs_0.27.3 future_1.68.0
#> [172] scattermore_1.2 ggforce_0.5.0 xtable_1.8-4
#> [175] AnnotationFilter_1.35.0 restfulr_0.0.16 RSpectra_0.16-2
#> [178] later_1.4.4 viridisLite_0.4.2 tibble_3.3.0
#> [181] memoise_2.0.1 AnnotationDbi_1.73.0 GenomicAlignments_1.47.0
#> [184] cluster_2.1.8.1 globals_0.18.0