--- title: "CSOA" author: "Andrei-Florian Stoica" package: CSOA date: August 13, 2025 output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{Getting started with CSOA} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include=FALSE} knitr::opts_chunk$set( collapse=TRUE, comment="#>" ) ``` # Introduction CSOA is a tool for scoring gene set signatures in scRNA-seq data. It constructs, for each input gene, a set of cells highly expressing the gene, then assesses all pairs of cell sets for statistical significance. The significant overlaps are ranked, and the top-ranked overlaps receive scores used to calculate per-cell CSOA scores. # Installation To install CSOA, run the following commands in an R session: ```{r setup, eval=FALSE} if (!require("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("CSOA") ``` # Prerequisites In addition to CSOA, you need to install [patchwork](https://patchwork.data-imaginist.com/index.html), [scRNAseq](https://bioconductor.org/packages/release/data/experiment/html/scRNAseq.html) and [scuttle](https://www.bioconductor.org/packages/release/bioc/html/scuttle.html) for this tutorial. # Scoring gene sets This tutorial uses an scRNA-seq human pancreas dataset with provided cell type annotations. After loading the required packages, download the dataset using the `BaronPancreasData` function from `scRNAseq`. The dataset will be stored as a `SingleCellExperiment` object. CSOA requires the input scRNA-seq data to be normalized and log-transformed. We will employ the `logNormCounts` function from `scuttle` for this step. ```{r message=FALSE, warning=FALSE, results=FALSE} library(CSOA) library(ggplot2) library(patchwork) library(scRNAseq) library(scuttle) library(Seurat) sceObj <- BaronPancreasData('human') sceObj <- logNormCounts(sceObj) ``` We can take a look at the cell types found in this dataset: ```{r} table(colData(sceObj)$label) ``` Next, we will define gene sets of acinar, alpha, ductal and gamma markers based on [PanglaoDB](https://www.panglaodb.se/markers.html). The gene sets must be provided as a named list of character vectors. The names of the list will be used as column names when storing the results. ```{r} acinarMarkers <- c('PRSS1', 'KLK1', 'CTRC', 'PNLIP', 'AKR1C3', 'CTRB1', 'DUOXA2', 'ALDOB', 'REG3A', 'SERPINA3', 'PRSS3', 'REG1B', 'CFB', 'GDF15', 'MUC1','ANPEP', 'ANGPTL4', 'OLFM4', 'GSTA1', 'LGALS2', 'PDZK1IP1', 'RARRES2', 'CXCL17', 'UBD', 'GSTA2', 'LYZ', 'RBPJL', 'PTF1A', 'CELA3A', 'SPINK1', 'ZG16', 'CEL', 'CELA2A', 'CPB1', 'CELA1', 'PNLIPRP1', 'RNASE1', 'AMY2B', 'CPA2','CPA1', 'CELA3B', 'CTRB2', 'PLA2G1B', 'PRSS2', 'CLPS', 'REG1A', 'SYCN') alphaMarkers <- c('GCG', 'TTR', 'PCSK2', 'FXYD5', 'LDB2', 'MAFB', 'CHGA', 'SCGB2A1', 'GLS', 'FAP', 'DPP4', 'GPR119', 'PAX6', 'NEUROD1', 'LOXL4', 'PLCE1', 'GC', 'KLHL41', 'FEV', 'PTGER3', 'RFX6', 'SMARCA1', 'PGR', 'IRX1', 'UCP2', 'RGS4', 'KCNK16', 'GLP1R', 'ARX', 'POU3F4', 'RESP18', 'PYY', 'SLC38A5', 'TM4SF4', 'CRYBA2', 'SH3GL2', 'PCSK1', 'PRRG2', 'IRX2', 'ALDH1A1','PEMT', 'SMIM24', 'F10', 'SCGN', 'SLC30A8') ductalMarkers <- c('CFTR', 'SERPINA5', 'SLPI', 'TFF1', 'CFB', 'LGALS4', 'CTSH', 'PERP', 'PDLIM3', 'WFDC2', 'SLC3A1', 'AQP1', 'ALDH1A3', 'VTCN1', 'KRT19', 'TFF2', 'KRT7', 'CLDN4', 'LAMB3', 'TACSTD2', 'CCL2', 'DCDC2','CXCL2', 'CLDN10', 'HNF1B', 'KRT20', 'MUC1', 'ONECUT1', 'AMBP', 'HHEX', 'ANXA4', 'SPP1', 'PDX1', 'SERPINA3', 'GDF15', 'AKR1C3', 'MMP7', 'DEFB1', 'SERPING1', 'TSPAN8', 'CLDN1', 'S100A10', 'PIGR') gammaMarkers <- c('PPY', 'ABCC9', 'FGB', 'ZNF503', 'MEIS1', 'LMO3', 'EGR3', 'CHN2', 'PTGFR', 'ENTPD2', 'AQP3', 'THSD7A', 'CARTPT', 'ISL1', 'PAX6', 'NEUROD1', 'APOBEC2', 'SEMA3E', 'SLITRK6', 'SERTM1', 'PXK', 'PPY2P', 'ETV1', 'ARX', 'CMTM8', 'SCGB2A1', 'FXYD2', 'SCGN') geneSets <- list(acinarMarkers, alphaMarkers, ductalMarkers, gammaMarkers) names(geneSets) <- c('CSOA_acinar', 'CSOA_alpha', 'CSOA_ductal', 'CSOA_gamma') ``` Before running CSOA, we will convert the `SingleCellExperiment` object to a `Seurat` object in order to employ the `VlnPlot` function for visualization. ```{r, warning=FALSE} seuratObj <- as.Seurat(sceObj) seuratObj <- runCSOA(seuratObj, geneSets) ``` We can now display the results: ```{r, warning=FALSE, out.height='100%', out.width='100%', fig.height=8, fig.width=8} plots <- lapply(names(geneSets), function(x) { VlnPlot(seuratObj, x, group.by = 'label') + NoLegend() + theme(axis.text = element_text(size=10), axis.title.x = element_blank(), plot.title = element_text (size=11)) }) (plots[[1]] | plots[[2]]) / (plots[[3]] | plots[[4]]) ``` Alternatively, CSOA can be run on a `SingleCellExperiment` object. The results will be stored as a column in the object's `colData` matrix: ```{r} sceObj <- runCSOA(sceObj, geneSets) ``` We verify that the results are identical: ```{r, message=FALSE} sceObj <- runCSOA(sceObj, geneSets) vapply(names(geneSets), function(x) identical(seuratObj[[]][[x]], colData(sceObj)[[x]]), logical(1)) ``` Additionally, CSOA can be run directly on an expression matrix. The `expMat` function extracts the log-transformed and normalized expression matrix—`data` for`Seurat`, `logcounts` for `SingleCellExperiment`—from the input object, and converts it to a dense matrix. CSOA requires a dense expression matrix, as opposed to the sparse matrix class (`dgCMatrix`) used by `Seurat` and `SingleCellExperiment`. To satisfy this requirement, the `expMat` function will be employed. This function extracts the ```{r} geneSetExp <- expMat(seuratObj) ``` **Note**: `runCSOA` also accepts the sparse matrix class used by `Seurat` and `SingleCellExperiment` (`dgCMatrix`) as input. Internally, `runCSOA` filters the `dgCMatrix` as to contain only the genes present in the input gene sets, then densifies the matrix using `expMat`. When running directly on an expression matrix, `runCSOA` will return a list in which the first element is the expression matrix and the second is the data frame of CSOA scores. ```{r, message=FALSE} res <- runCSOA(geneSetExp, geneSets) head(res[[2]]) ``` The results are identical with those obtained earlier for the Seurat object: ```{r} vapply(names(geneSets), function(x) identical(seuratObj[[]][[x]], res[[2]][[x]]), logical(1)) ``` # Session information {-} ```{r} sessionInfo() ```