1 Introduction

This document describes some important parameters of the UCell algorithm, and how they can be adapted depending on your dataset. Here we will use single-cell data stored in a Seurat object, but the same considerations apply to SingleCellExperiment or matrix input formats.

2 Load example dataset

For this demo, we will download a single-cell dataset of lung cancer (Zilionis et al. (2019) Immunity) through the scRNA-seq package. This dataset contains >170,000 single cells; for the sake of simplicity, in this demo will we focus on immune cells, according to the annotations by the authors, and downsample to 5000 cells.

library(scRNAseq)
library(ggplot2)

lung <- ZilionisLungData()
immune <- lung$Used & lung$used_in_NSCLC_immune
lung <- lung[,immune]
lung <- lung[,1:5000]

exp.mat <- Matrix::Matrix(counts(lung),sparse = TRUE)
colnames(exp.mat) <- paste0(colnames(exp.mat), seq(1,ncol(exp.mat)))

Save it as a Seurat object

library(Seurat)

seurat.object <- CreateSeuratObject(counts = exp.mat, 
                                    project = "Zilionis_immune")
seurat.object <- NormalizeData(seurat.object)

Note: becase UCell scores are based on relative gene ranks, it can be applied both on raw counts or normalized data. As long as the normalization preserves the relative ranks between genes, the results will be equivalent.

3 Parameters

3.1 Positive and negative gene sets in signatures

UCell supports positive and negative gene sets within a signature. Simply append + or - signs to the genes to include them in positive and negative sets, respectively. For example:

signatures <- list(
    CD8T = c("CD8A+","CD8B+","CD4-"),
    CD4 = c("TRAC+","CD4+","CD40LG+","CD8A-","CD8B-"),
    NK = c("KLRD1+","NCR1+","NKG7+","CD3D-","CD3E-")
)

UCell evaluates the positive and negative gene sets separately, then subtracts the scores. The parameter w_neg controls the relative weight of the negative gene set compared to the positive set (w_neg=1.0 means equal weight). Note that the combined score is clipped to zero, to preserve UCell scores in the [0, 1] range.

library(UCell)

seurat.object <- AddModuleScore_UCell(seurat.object, features=signatures, 
                                      w_neg = 1.0, name = NULL)

scores <- seurat.object[[names(signatures)]]
head(scores,15)
##              CD8T        CD4       NK
## bcHTNA1  0.000000 0.14975523 0.000000
## bcHNVA2  0.000000 0.02503338 0.000000
## bcALZN3  0.000000 0.00000000 0.000000
## bcFWBP4  0.000000 0.00000000 0.000000
## bcBJYE5  0.000000 0.28627058 0.000000
## bcGSBJ6  0.000000 0.00000000 0.000000
## bcHQGJ7  0.000000 0.00000000 0.000000
## bcHKKM8  0.000000 0.21161549 0.000000
## bcIGQU9  0.000000 0.28649310 0.000000
## bcDVGG10 0.000000 0.21384068 0.000000
## bcEPCC11 0.707374 0.00000000 0.000000
## bcDOHD12 0.000000 0.00000000 0.000000
## bcFPZF13 0.000000 0.27403204 0.274032
## bcHRXV14 0.000000 0.00000000 0.000000
## bcFGME15 0.000000 0.31753449 0.000000

3.2 The maxRank parameter

Single-cell data are sparse. In other words, for any given cell only a few hundred/a few thousand genes (out of tens of thousands) are detected with at least one UMI count. Because UCell scores are based on ranking genes by their expression values, it is essential to account for data sparsity when calculating ranks. This is implemented by capping ranks to a maxRank parameter, in other words only the top maxRank genes are ranked, and the rest are assumed equivalent at the lowest ranking value.

It is often useful to adjust the maxRank depending on the sparsity of your dataset. A good rule of thumb is to examine the median number of expressed genes per cell, and set maxRank in that order of magnitude. For example, for the test dataset:

VlnPlot(seurat.object, features="nFeature_RNA", pt.size = 0, log = TRUE)

This dataset has relatively low depth, so it is advisable to choose a maxRank around 800-1000 (from the default 1500)

seurat.object <- AddModuleScore_UCell(seurat.object, features=signatures,
                                      maxRank=1000)

This is even more important when applying UCell to technologies/modalities of much lower dimensionality, for example probe-based spatial transcriptomics data (e.g. Xenium, CosMx), or antibody tags (ADT) in CITE-seq experiments. Xenium panels contain a few hundred/a few thousand genes; CITE-seq can detect a few hundred proteins, as opposed to thousands of genes in scRNA-seq. The maxRank parameter should then also be adapted to reflect the new dimensionality, and set it at most to the number of probes in the panel.

3.3 Handling missing genes

If a subset of the genes in your signature are absent from the count matrix, how should they be handled?

UCell offers two alternative ways of handling missing genes:

  • missing_genes="impute" (default): it assumes that absence from the count matrix means zero expression. All values for this gene are imputed to zero. This can sometimes be the case for processed scRNA-seq datasets deposited in public repositories, where poorly detected genes are often dropped from the count matrix.
  • missing_genes="skip": simply exclude all missing genes from the signatures; they won’t contribute to the scores.

Here’s an example with a missing gene:

signatures <- list(
    Myeloid = c("LYZ","CSF1R","not_a_gene")
)

seurat.object <- AddModuleScore_UCell(seurat.object, features=signatures,
                                      missing_genes="impute")
scores1 <- seurat.object$Myeloid_UCell

seurat.object <- AddModuleScore_UCell(seurat.object, features=signatures,
                                      missing_genes="skip")
scores2 <- seurat.object$Myeloid_UCell

scores <- cbind(scores1, scores2)
head(scores)
##           scores1   scores2
## bcHTNA1 0.3319982 0.4978312
## bcHNVA2 0.5263685 0.7892893
## bcALZN3 0.3333333 0.4998332
## bcFWBP4 0.0000000 0.0000000
## bcBJYE5 0.2078327 0.3116450
## bcGSBJ6 0.4755229 0.7130464

3.4 Chunk size

UCell scores are calculated individually for each cell (though they may be later smoothed by nearest-neighbor similarity). This means that computation can be easily split into batches, reducing the computational footprint of gene ranking and enabling parallel processing (see below). The size of the batches is controlled by the chunk.size parameter. Large chunks take up more RAM, while small chunk sizes have large overhead from dataset splitting and merging. A sweet spot for chunk.size is usually in the order of 100-1000 cells per batch.

seurat.object <- AddModuleScore_UCell(seurat.object, features=signatures,
                                      chunk.size=500)

3.5 Parallelization

If your machine has multi-core capabilities and enough RAM, running UCell in parallel can speed up considerably your analysis. The example below runs on a single core - you may modify this behavior by setting e.g. workers=8 to parallelize to 8 processes:

BPPARAM <- BiocParallel::MulticoreParam(workers=1)

seurat.object <- AddModuleScore_UCell(seurat.object, features=signatures,
                                      BPPARAM=BPPARAM)

3.6 Signature score smoothing

To mitigate sparsity in single-cell data, it can be useful to ‘impute’ scores by neighboring cells. The function SmoothKNN performs smoothing of single-cell scores by weighted average of the k-nearest neighbors in a given dimensionality reduction. A crucial parameter is the number of neighbors k that are used for smoothing. A small k only borrows from very close neighbors, a large k takes weighted averages over large portions of transcriptional space:

seurat.object <- NormalizeData(seurat.object)
seurat.object <- FindVariableFeatures(seurat.object, 
                     selection.method = "vst", nfeatures = 500)
  
seurat.object <- ScaleData(seurat.object)
seurat.object <- RunPCA(seurat.object, npcs = 20, 
                        features=VariableFeatures(seurat.object)) 
seurat.object <- RunUMAP(seurat.object, reduction = "pca", 
                         dims = 1:20, seed.use=123)
signatures <- list(
    Tcell = c("CD3D","CD3E","CD3G","CD2","TRAC"),
    Myeloid = c("CD14","LYZ","CSF1R","FCER1G","SPI1","LCK-"),
    NK = c("KLRD1","NCR1","NKG7","CD3D-","CD3E-"),
    Plasma_cell = c("MZB1","DERL3","CD19-")
)

seurat.object <- AddModuleScore_UCell(seurat.object, features=signatures,
                                      name=NULL)
seurat.object <- SmoothKNN(seurat.object, reduction="pca",
                           signature.names = names(signatures),
                           k=3, suffix = "_kNN3")

seurat.object <- SmoothKNN(seurat.object, reduction="pca",
                           signature.names = names(signatures),
                           k=100, suffix = "_kNN100")
FeaturePlot(seurat.object, reduction = "umap",
            features = c("Tcell","Tcell_kNN3")) &
  theme(aspect.ratio = 1)

FeaturePlot(seurat.object, reduction = "umap",
            features = c("Tcell","Tcell_kNN100")) &
  theme(aspect.ratio = 1)

The decay parameter controls the relative influence of close vs distant neighbors. Lower the decay parameter to increase the weight for distant neighbors, increase decay to give higher weight to close neighbors

seurat.object <- SmoothKNN(seurat.object, reduction="pca",
                           signature.names = names(signatures),
                           k=100, decay=0.001, suffix = "_decay0.001")

seurat.object <- SmoothKNN(seurat.object, reduction="pca",
                           signature.names = names(signatures),
                           k=100, decay=0.5, suffix = "_decay0.5")
FeaturePlot(seurat.object, reduction = "umap",
            features = c("Tcell_decay0.5","Tcell_decay0.001")) &
  theme(aspect.ratio = 1)

4 Resources

Please report any issues at the UCell GitHub repository.

More demos available on the Bioc landing page and at the UCell demo repository.

If you find UCell useful, you may also check out the scGate package, which relies on UCell scores to automatically purify populations of interest based on gene signatures.

See also SignatuR for easy storing and retrieval of gene signatures.

5 References

Appendix

  • Andreatta, M., Carmona, S. J. (2021) UCell: Robust and scalable single-cell gene signature scoring Computational and Structural Biotechnology Journal
  • Zilionis, R., Engblom, C., …, Klein, A. M. (2019) Single-Cell Transcriptomics of Human and Mouse Lung Cancers Reveals Conserved Myeloid Populations across Individuals and Species Immunity
  • Hao, Yuhan, et al. (2021) Integrated analysis of multimodal single-cell data Cell

A Session Info

sessionInfo()
## R version 4.5.1 Patched (2025-08-23 r88802)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] patchwork_1.3.2             ggplot2_4.0.0              
##  [3] UCell_2.13.3                Seurat_5.3.0               
##  [5] SeuratObject_5.2.0          sp_2.2-0                   
##  [7] scRNAseq_2.23.0             SingleCellExperiment_1.31.1
##  [9] SummarizedExperiment_1.39.2 Biobase_2.69.1             
## [11] GenomicRanges_1.61.5        Seqinfo_0.99.2             
## [13] IRanges_2.43.5              S4Vectors_0.47.4           
## [15] BiocGenerics_0.55.1         generics_0.1.4             
## [17] MatrixGenerics_1.21.0       matrixStats_1.5.0          
## [19] BiocStyle_2.37.1           
## 
## loaded via a namespace (and not attached):
##   [1] RcppAnnoy_0.0.22         splines_4.5.1            later_1.4.4             
##   [4] BiocIO_1.19.0            bitops_1.0-9             filelock_1.0.3          
##   [7] tibble_3.3.0             polyclip_1.10-7          XML_3.99-0.19           
##  [10] fastDummies_1.7.5        lifecycle_1.0.4          httr2_1.2.1             
##  [13] globals_0.18.0           lattice_0.22-7           ensembldb_2.33.2        
##  [16] MASS_7.3-65              alabaster.base_1.9.5     magrittr_2.0.4          
##  [19] plotly_4.11.0            sass_0.4.10              rmarkdown_2.30          
##  [22] jquerylib_0.1.4          yaml_2.3.10              httpuv_1.6.16           
##  [25] sctransform_0.4.2        spam_2.11-1              spatstat.sparse_3.1-0   
##  [28] reticulate_1.43.0        cowplot_1.2.0            pbapply_1.7-4           
##  [31] DBI_1.2.3                RColorBrewer_1.1-3       abind_1.4-8             
##  [34] Rtsne_0.17               purrr_1.1.0              AnnotationFilter_1.33.0 
##  [37] RCurl_1.98-1.17          rappdirs_0.3.3           ggrepel_0.9.6           
##  [40] irlba_2.3.5.1            spatstat.utils_3.2-0     listenv_0.9.1           
##  [43] alabaster.sce_1.9.0      goftest_1.2-3            RSpectra_0.16-2         
##  [46] spatstat.random_3.4-2    fitdistrplus_1.2-4       parallelly_1.45.1       
##  [49] codetools_0.2-20         DelayedArray_0.35.3      tidyselect_1.2.1        
##  [52] UCSC.utils_1.5.0         farver_2.1.2             spatstat.explore_3.5-3  
##  [55] BiocFileCache_2.99.6     GenomicAlignments_1.45.4 jsonlite_2.0.0          
##  [58] BiocNeighbors_2.3.1      progressr_0.16.0         ggridges_0.5.7          
##  [61] survival_3.8-3           tools_4.5.1              ica_1.0-3               
##  [64] Rcpp_1.1.0               glue_1.8.0               gridExtra_2.3           
##  [67] SparseArray_1.9.1        xfun_0.53                GenomeInfoDb_1.45.12    
##  [70] dplyr_1.1.4              HDF5Array_1.37.0         gypsum_1.5.0            
##  [73] withr_3.0.2              BiocManager_1.30.26      fastmap_1.2.0           
##  [76] rhdf5filters_1.21.0      digest_0.6.37            R6_2.6.1                
##  [79] mime_0.13                scattermore_1.2          tensor_1.5.1            
##  [82] spatstat.data_3.1-8      dichromat_2.0-0.1        RSQLite_2.4.3           
##  [85] h5mread_1.1.1            tidyr_1.3.1              data.table_1.17.8       
##  [88] rtracklayer_1.69.1       htmlwidgets_1.6.4        httr_1.4.7              
##  [91] S4Arrays_1.9.1           uwot_0.2.3               pkgconfig_2.0.3         
##  [94] gtable_0.3.6             blob_1.2.4               lmtest_0.9-40           
##  [97] S7_0.2.0                 XVector_0.49.1           htmltools_0.5.8.1       
## [100] dotCall64_1.2            bookdown_0.44            ProtGenerics_1.41.0     
## [103] scales_1.4.0             alabaster.matrix_1.9.0   png_0.1-8               
## [106] spatstat.univar_3.1-4    knitr_1.50               reshape2_1.4.4          
## [109] rjson_0.2.23             nlme_3.1-168             curl_7.0.0              
## [112] cachem_1.1.0             zoo_1.8-14               rhdf5_2.53.5            
## [115] stringr_1.5.2            BiocVersion_3.22.0       KernSmooth_2.23-26      
## [118] vipor_0.4.7              parallel_4.5.1           miniUI_0.1.2            
## [121] AnnotationDbi_1.71.1     ggrastr_1.0.2            restfulr_0.0.16         
## [124] pillar_1.11.1            grid_4.5.1               alabaster.schemas_1.9.0 
## [127] vctrs_0.6.5              RANN_2.6.2               promises_1.3.3          
## [130] dbplyr_2.5.1             xtable_1.8-4             cluster_2.1.8.1         
## [133] beeswarm_0.4.0           evaluate_1.0.5           magick_2.9.0            
## [136] tinytex_0.57             GenomicFeatures_1.61.6   cli_3.6.5               
## [139] compiler_4.5.1           Rsamtools_2.25.3         rlang_1.1.6             
## [142] crayon_1.5.3             future.apply_1.20.0      labeling_0.4.3          
## [145] ggbeeswarm_0.7.2         plyr_1.8.9               stringi_1.8.7           
## [148] deldir_2.0-4             viridisLite_0.4.2        alabaster.se_1.9.0      
## [151] BiocParallel_1.43.4      Biostrings_2.77.2        lazyeval_0.2.2          
## [154] spatstat.geom_3.6-0      Matrix_1.7-4             ExperimentHub_2.99.5    
## [157] RcppHNSW_0.6.0           bit64_4.6.0-1            future_1.67.0           
## [160] Rhdf5lib_1.31.0          KEGGREST_1.49.1          shiny_1.11.1            
## [163] alabaster.ranges_1.9.1   AnnotationHub_3.99.6     ROCR_1.0-11             
## [166] igraph_2.1.4             memoise_2.0.1            bslib_0.9.0             
## [169] bit_4.6.0