CSOA 0.99.2
In this tutorial, we will walk through the main steps of the CSOA algorithm. For clarity and simplicity, we will illustrate the algorithm by scoring a single gene signature set.
First, we repeat the setup from the Getting started with CSOA tutorial.
We need the BaronPancreasData
dataset from scRNAseq
and the list of acinar
markers from PanglaoDB.
library(CSOA)
library(scRNAseq)
library(scuttle)
library(Seurat)
sceObj <- BaronPancreasData('human')
sceObj <- logNormCounts(sceObj)
seuratObj <- as.Seurat(sceObj)
acinarMarkers <- c('PRSS1', 'KLK1', 'CTRC', 'PNLIP', 'AKR1C3', 'CTRB1',
'DUOXA2', 'ALDOB', 'REG3A', 'SERPINA3', 'PRSS3', 'REG1B',
'CFB', 'GDF15', 'MUC1','ANPEP', 'ANGPTL4', 'OLFM4',
'GSTA1', 'LGALS2', 'PDZK1IP1', 'RARRES2', 'CXCL17',
'UBD', 'GSTA2', 'LYZ', 'RBPJL', 'PTF1A', 'CELA3A',
'SPINK1', 'ZG16', 'CEL', 'CELA2A', 'CPB1', 'CELA1',
'PNLIPRP1', 'RNASE1', 'AMY2B', 'CPA2','CPA1', 'CELA3B',
'CTRB2', 'PLA2G1B', 'PRSS2', 'CLPS', 'REG1A', 'SYCN')
We now extract the expression data for the input genes. The results will be stored as a non-sparse matrix:
geneSetExp <- expMat(seuratObj, acinarMarkers)
class(geneSetExp)
#> [1] "matrix" "array"
dim(geneSetExp)
#> [1] 47 8569
Next, a cell set representing the highest-expression cells is constructed for
each input gene. The cutoff for cell selection is implemented using the
percentile
argument, set at 90
by default.
! Note: The percentile is taken over all the cells expressing the gene, not over all the cells in the dataset. Therefore, the cell sets will vary in size.
cellSets <- percentileSets(geneSetExp, percentile=90)
#> Computing percentile sets...
vapply(cellSets, length, integer(1))
#> AKR1C3 ALDOB AMY2B ANGPTL4 ANPEP CEL CELA1 CELA2A
#> 124 95 48 218 127 69 1 152
#> CELA3A CELA3B CFB CLPS CPA1 CPA2 CPB1 CTRB1
#> 266 186 132 194 218 145 188 264
#> CTRB2 CTRC CXCL17 DUOXA2 GDF15 GSTA1 GSTA2 KLK1
#> 320 143 57 90 259 73 74 119
#> LGALS2 LYZ MUC1 OLFM4 PDZK1IP1 PLA2G1B PNLIP PNLIPRP1
#> 37 46 150 75 144 199 136 124
#> PRSS1 PRSS2 PRSS3 PTF1A RARRES2 RBPJL REG1A REG1B
#> 241 318 195 18 82 3 481 262
#> REG3A RNASE1 SERPINA3 SPINK1 SYCN UBD ZG16
#> 103 131 259 271 124 14 14
We can now visualize the distribution of the cell sets among all the cells:
mat <- cellDistribution(cellSets, allCells = colnames(seuratObj))
basicHeatmap(mat, c('Gene', 'Cell', 'Presence'),
title='Distribution of cell sets among all cells',
palType='fillDis')
The cell sets strongly cluster in a small region of the heatmap, which indicates their significant overlap, expected since they correspond to markers of the same cell type identity.
The next step is calculating the statistical significance of the overlaps of
pairs of cell sets by using hypergeometric tests using the cellSetsOverlaps
function.
cso <- cellSetsOverlaps(cellSets, nCells=dim(seuratObj)[2])
#> Assessing gene overlaps...
! Note: The nCells
argument must be set to the total number of cells in the Seurat object, not to the number of cells in the cell sets.
We can inspect the output of this function:
head(cso)
#> gene1 gene2 ncells1 ncells2 shared_cells exp_shared_cells ratio
#> 1 AKR1C3 ALDOB 124 95 14 1.37472284 10.183871
#> 2 AKR1C3 AMY2B 124 48 2 0.69459680 2.879368
#> 3 AKR1C3 ANGPTL4 124 218 12 3.15462714 3.803936
#> 4 AKR1C3 ANPEP 124 127 10 1.83778737 5.441326
#> 5 AKR1C3 CEL 124 69 10 0.99848290 10.015194
#> 6 AKR1C3 CELA1 124 1 0 0.01447077 0.000000
#> pval
#> 1 6.388483e-11
#> 2 1.528311e-01
#> 3 7.039371e-05
#> 4 1.402596e-05
#> 5 4.621307e-08
#> 6 1.000000e+00
The number of rows (1081
) equals the number of all possible unordered pairs that can be generated from a list of 47
genes (47 * 46 / 2 = 1081
).
To perform both steps—creating cell sets and assessing overlaps of pairs of cell sets—in a single call, the generateOverlaps
function can be used:
cso2 <- generateOverlaps(geneSetExp, percentile = 90)
#> Computing percentile sets...
#> Assessing gene overlaps...
identical(cso, cso2)
#> [1] TRUE
Before scoring the cells, the top overlaps need to be identified and scored.
This step is performed in a single call using processOverlaps
:
po <- processOverlaps(cso)
#> 194 overlaps will be used in the calculation of CSOA scores.
head(po)
#> gene1 gene2 ncells1 ncells2 shared_cells exp_shared_cells ratio
#> 812 KLK1 PNLIP 119 136 36 1.888668 19.061048
#> 789 GSTA2 PNLIP 74 136 27 1.174466 22.989169
#> 765 GSTA1 PNLIP 73 136 19 1.158595 16.399174
#> 504 CPA1 PNLIP 218 136 91 3.459914 26.301234
#> 659 CTRC PNLIP 143 136 47 2.269576 20.708710
#> 782 GSTA2 KLK1 74 119 9 1.027658 8.757779
#> pval pvalAdj pvalRank ratioRank rawAggRank rank score
#> 812 1.600432e-37 7.073002e-36 19.0 5.5 12.25 1 1.0000000
#> 789 1.119819e-30 4.258412e-29 20.5 4.5 12.50 2 0.9733086
#> 765 1.386693e-18 3.964172e-17 20.5 4.5 12.50 2 0.9733086
#> 504 3.720306e-119 1.448430e-116 13.0 12.5 12.75 4 0.9458851
#> 659 1.990427e-51 1.232851e-49 17.0 8.5 12.75 4 0.9458851
#> 782 7.345269e-07 1.325707e-05 23.5 2.0 12.75 4 0.9458851
processOverlaps
sequentially performs these steps:
p-values are corrected for multiple testing and insignificant overlaps are removed.
Two preliminary overlap ranks are defined based on adjusted p-value and recorded-over-expected size ratio.
These ranks are replaced by two connectivity-based ranks, consisting of the minimum p-value/ratio rank of the vertices neighboring the edge corresponding to each overlap. The order imposed by the by average of the two ranks is the final rank
The overlap rank cutoff is defined as the highest-frequency rank, if unique, and as the average of the minimum and maximum of the highest-frequency ranks otherwise:
overlapCutoffPlot(po)
Overlaps are filtered based on the rank cutoff.
The retained overlaps are scored. If the default log
overlap scoring
method is selected, scores are assigned starting at 1
and decreasing
logarithmically towards 0
for each unique rank (though 0
is not reached but
interpreted as the score of the first non-top overlap). If the minmax
method
is selected instead, overlaps are scored through inverted minmax normalization.
scoreCells
scores the cells using an overlap data frame as input. It requires
the output of cellSetsOverlaps
(or of its wrapper generateOverlaps
), and
first calls processOverlaps
. Subsequently, it assigns a score to each cell
by summing the products of these three quantities:
The overlap score of the gene pair;
The in-cell minmax-normalized expresssion of the first gene;
The in-cell minmax-normalized expresssion of the second gene.
scoreDF <- scoreCells(geneSetExp, cso, setPairs=list(overlapPairs(cso)),
geneSetNames='CSOA_acinar')
#> 194 overlaps will be used in the calculation of CSOA scores.
#> Normalizing expression matrix by rows...
#> Computing per-cell scores for gene pairs...
#> Computing per-cell gene signature scores...
head(scoreDF)
#> CSOA_acinar
#> human1_lib1.final_cell_0001 0.7534719
#> human1_lib1.final_cell_0002 0.5009556
#> human1_lib1.final_cell_0003 0.5308047
#> human1_lib1.final_cell_0004 0.5052516
#> human1_lib1.final_cell_0005 0.5371408
#> human1_lib1.final_cell_0006 0.3647260
The attachCellScores
function attaches cell scores to the object:
seuratObj <- attachCellScores(seuratObj, scoreDF)
head(seuratObj$CSOA_acinar)
#> human1_lib1.final_cell_0001 human1_lib1.final_cell_0002
#> 0.7534719 0.5009556
#> human1_lib1.final_cell_0003 human1_lib1.final_cell_0004
#> 0.5308047 0.5052516
#> human1_lib1.final_cell_0005 human1_lib1.final_cell_0006
#> 0.5371408 0.3647260
runCSOA
runs the full CSOA pipeline, as a wrapper around generateOverlaps
,
scoreCells
and attachCellScores
. The wrapper delivers the same results as
running the three functions in succession:
geneSet <- list(CSOA_acinar = acinarMarkers)
seuratObj <- runCSOA(seuratObj, geneSet)
#> Computing percentile sets...
#> Assessing gene overlaps...
#> 194 overlaps will be used in the calculation of CSOA scores.
#> Normalizing expression matrix by rows...
#> Computing per-cell scores for gene pairs...
#> Computing per-cell gene signature scores...
identical(seuratObj[[]]$CSOA_acinar, scoreDF$CSOA_acinar)
#> [1] TRUE
sessionInfo()
#> R version 4.5.1 Patched (2025-08-23 r88802)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] Seurat_5.3.0 SeuratObject_5.2.0
#> [3] sp_2.2-0 stringr_1.5.2
#> [5] scuttle_1.19.0 scRNAseq_2.23.0
#> [7] SingleCellExperiment_1.31.1 SummarizedExperiment_1.39.2
#> [9] Biobase_2.69.1 GenomicRanges_1.61.4
#> [11] Seqinfo_0.99.2 IRanges_2.43.2
#> [13] S4Vectors_0.47.2 BiocGenerics_0.55.1
#> [15] generics_0.1.4 MatrixGenerics_1.21.0
#> [17] matrixStats_1.5.0 patchwork_1.3.2
#> [19] ggplot2_4.0.0 CSOA_0.99.2
#> [21] BiocStyle_2.37.1
#>
#> loaded via a namespace (and not attached):
#> [1] ProtGenerics_1.41.0 spatstat.sparse_3.1-0 bitops_1.0-9
#> [4] httr_1.4.7 RColorBrewer_1.1-3 tools_4.5.1
#> [7] sctransform_0.4.2 alabaster.base_1.9.5 R6_2.6.1
#> [10] HDF5Array_1.37.0 lazyeval_0.2.2 uwot_0.2.3
#> [13] rhdf5filters_1.21.0 withr_3.0.2 gridExtra_2.3
#> [16] progressr_0.16.0 cli_3.6.5 spatstat.explore_3.5-2
#> [19] fastDummies_1.7.5 labeling_0.4.3 alabaster.se_1.9.0
#> [22] sass_0.4.10 S7_0.2.0 spatstat.data_3.1-8
#> [25] ggridges_0.5.7 pbapply_1.7-4 Rsamtools_2.25.3
#> [28] dichromat_2.0-0.1 parallelly_1.45.1 RSQLite_2.4.3
#> [31] RApiSerialize_0.1.4 BiocIO_1.19.0 ica_1.0-3
#> [34] spatstat.random_3.4-1 dplyr_1.1.4 wesanderson_0.3.7
#> [37] Matrix_1.7-4 ggbeeswarm_0.7.2 abind_1.4-8
#> [40] lifecycle_1.0.4 yaml_2.3.10 rhdf5_2.53.4
#> [43] SparseArray_1.9.1 BiocFileCache_2.99.6 Rtsne_0.17
#> [46] grid_4.5.1 blob_1.2.4 promises_1.3.3
#> [49] ExperimentHub_2.99.5 crayon_1.5.3 miniUI_0.1.2
#> [52] lattice_0.22-7 beachmat_2.25.5 cowplot_1.2.0
#> [55] GenomicFeatures_1.61.6 KEGGREST_1.49.1 poibin_1.6
#> [58] magick_2.9.0 pillar_1.11.1 knitr_1.50
#> [61] rjson_0.2.23 bayesbio_1.0.0 future.apply_1.20.0
#> [64] codetools_0.2-20 glue_1.8.0 spatstat.univar_3.1-4
#> [67] data.table_1.17.8 vctrs_0.6.5 png_0.1-8
#> [70] gypsum_1.5.0 spam_2.11-1 gtable_0.3.6
#> [73] kernlab_0.9-33 cachem_1.1.0 xfun_0.53
#> [76] S4Arrays_1.9.1 mime_0.13 tidygraph_1.3.1
#> [79] survival_3.8-3 tinytex_0.57 fitdistrplus_1.2-4
#> [82] ROCR_1.0-11 nlme_3.1-168 bit64_4.6.0-1
#> [85] alabaster.ranges_1.9.1 filelock_1.0.3 RcppAnnoy_0.0.22
#> [88] GenomeInfoDb_1.45.11 bslib_0.9.0 irlba_2.3.5.1
#> [91] vipor_0.4.7 KernSmooth_2.23-26 DBI_1.2.3
#> [94] ggrastr_1.0.2 tidyselect_1.2.1 bit_4.6.0
#> [97] compiler_4.5.1 curl_7.0.0 httr2_1.2.1
#> [100] h5mread_1.1.1 textshape_1.7.5 DelayedArray_0.35.3
#> [103] plotly_4.11.0 stringfish_0.17.0 bookdown_0.44
#> [106] rtracklayer_1.69.1 scales_1.4.0 lmtest_0.9-40
#> [109] ggeasy_0.1.6 sgof_2.3.5 rappdirs_0.3.3
#> [112] digest_0.6.37 goftest_1.2-3 spatstat.utils_3.1-5
#> [115] alabaster.matrix_1.9.0 rmarkdown_2.29 XVector_0.49.1
#> [118] htmltools_0.5.8.1 pkgconfig_2.0.3 ensembldb_2.33.2
#> [121] dbplyr_2.5.1 fastmap_1.2.0 UCSC.utils_1.5.0
#> [124] rlang_1.1.6 htmlwidgets_1.6.4 shiny_1.11.1
#> [127] farver_2.1.2 jquerylib_0.1.4 zoo_1.8-14
#> [130] jsonlite_2.0.0 BiocParallel_1.43.4 RCurl_1.98-1.17
#> [133] magrittr_2.0.4 dotCall64_1.2 Rhdf5lib_1.31.0
#> [136] Rcpp_1.1.0 kerntools_1.2.0 ggnewscale_0.5.2
#> [139] viridis_0.6.5 reticulate_1.43.0 stringi_1.8.7
#> [142] alabaster.schemas_1.9.0 ggraph_2.2.2 MASS_7.3-65
#> [145] AnnotationHub_3.99.6 plyr_1.8.9 parallel_4.5.1
#> [148] listenv_0.9.1 ggrepel_0.9.6 deldir_2.0-4
#> [151] Biostrings_2.77.2 graphlayouts_1.2.2 splines_4.5.1
#> [154] tensor_1.5.1 igraph_2.1.4 spatstat.geom_3.5-0
#> [157] RcppHNSW_0.6.0 reshape2_1.4.4 BiocVersion_3.22.0
#> [160] XML_3.99-0.19 evaluate_1.0.5 RcppParallel_5.1.11-1
#> [163] BiocManager_1.30.26 tweenr_2.0.3 httpuv_1.6.16
#> [166] RANN_2.6.2 tidyr_1.3.1 purrr_1.1.0
#> [169] polyclip_1.10-7 alabaster.sce_1.9.0 qs_0.27.3
#> [172] future_1.67.0 scattermore_1.2 ggforce_0.5.0
#> [175] xtable_1.8-4 AnnotationFilter_1.33.0 restfulr_0.0.16
#> [178] RSpectra_0.16-2 later_1.4.4 viridisLite_0.4.2
#> [181] tibble_3.3.0 beeswarm_0.4.0 memoise_2.0.1
#> [184] AnnotationDbi_1.71.1 GenomicAlignments_1.45.4 cluster_2.1.8.1
#> [187] globals_0.18.0