Over the past decade, advances in single-cell RNA-sequencing (scRNA-seq) technologies have significantly increased the sensitivity and specificity with which cellular transcriptional dynamics can be analyzed. Further, parallel increases in the number cells which can be simultaneously sequenced have allowed for novel analysis pipelines including the description of transcriptional trajectories and the discovery of rare sub-populations of cells. The development of droplet-based, unique-molecular-identifier (UMI) protocols such as Drop-seq, inDrop, and the 10x Genomics Chromium platform have significantly contributed to these advances. In particular, the commercially available 10x Genomics platform has allowed the rapid and cost effective gene expression profiling of hundreds to tens of thousands of cells across many studies to date.
The use of UMIs in the 10x Genomics and related platforms has
augmented these developments in sequencing technology by tagging
individual mRNA transcripts with unique cell and transcript specific
identifiers. In this way, biases due to transcript length and PCR
amplification have been significantly reduced. However, technical
variability in sequencing depth remains and, consequently, normalization
to adjust for sequencing depth is required to ensure accurate downstream
analyses. To address this, we introduce Dino, an
R package implementing the Dino
normalization method.
Dino utilizes a flexible mixture of Negative Binomials
model of gene expression to reconstruct full gene-specific expression
distributions which are independent of sequencing depth. By giving exact
zeros positive probability, the Negative Binomial components are
applicable to shallow sequencing (high proportions of zeros).
Additionally, the mixture component is robust to cell heterogeneity as
it accommodates multiple centers of gene expression in the distribution.
By directly modeling (possibly heterogenous) gene-specific expression
distributions, Dino outperforms competing approaches, especially for
datasets in which the proportion of zeros is high as is typical for
modern, UMI based protocols.
Dino does not attempt to correct for batch or other
sample specific effects, and will only do so to the extent that they are
correlated with sequencing depth. In situations where batch effects are
expected, downstream analysis may benefit from such accommodations.
Dino is now available on BioConductor and
can be easily installed from that repository by running:
# Install Bioconductor if not present, skip otherwise
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
# Install Dino package
BiocManager::install("Dino")
# View (this) vignette from R
browseVignettes("Dino")Dino is also available from Github, and bug fixes,
patches, and updates are available there first. To install
Dino from Github, run
Note, building vignettes can take a little time, so for a
quicker install, consider setting
build_vignettes = FALSE.
Dino (function) is an all-in-one function to normalize
raw UMI count data from 10X Cell Ranger or similar protocols. Under
default options, Dino outputs a sparse matrix of normalized
expression. SeuratFromDino provides one-line functionality
to return a Seurat object from raw UMI counts or from a previously
normalized expression matrix.
library(Dino)
# Return a sparse matrix of normalized expression
Norm_Mat <- Dino(UMI_Mat)
# Return a Seurat object from already normalized expression
# Use normalized (doNorm = FALSE) and un-transformed (doLog = FALSE) expression
Norm_Seurat <- SeuratFromDino(Norm_Mat, doNorm = FALSE, doLog = FALSE)
# Return a Seurat object from UMI expression
# Transform normalized expression as log(x + 1) to improve
# some types of downstream analysis
Norm_Seurat <- SeuratFromDino(UMI_Mat)To facilitate concrete examples, we demonstrate normalization on a
small subset of sequencing data from about 3,000 peripheral blood
mononuclear cells (PBMCs) published by 10X
Genomics. This dataset, named pbmcSmall contains 200
cells and 1,000 genes and is included with the Dino
package.
set.seed(1)
# Bring pbmcSmall into R environment
library(Dino)
library(Seurat)
library(Matrix)
data("pbmcSmall")
print(dim(pbmcSmall))## [1] 1000 200
While Dino was developed to normalize UMI count data, it
will run on any matrix of non-negative expression data; user caution is
advised if applying Dino to non-UMI sequencing protocols.
Input formats may be sparse or dense matrices of expression with genes
(features) on the rows and cells (samples) on the columns.
While Dino can normalize the pbmcSmall
dataset as it currently exists, the resulting normalized matrix, and in
particular, downstream analysis are likely to be improved by cleaning
the data. Of greatest use is removing genes that are expected
not to contain useful information. This set of genes may be
case dependent, but a good rule of thumb for UMI protocols is to remove
genes lacking a minimum of non-zero expression prior to normalization
and analysis.
By default, Dino will not perform the resampling
algorithm on any genes without at least 10 non-zero samples, and will
rather normalize such genes by scaling with sequencing depth. To
demonstrate a stricter threshold, we remove genes lacking at least 20
non-zero samples prior to normalization.
# Filter genes for a minimum of non-zero expression
pbmcSmall <- pbmcSmall[rowSums(pbmcSmall != 0) >= 20, ]
print(dim(pbmcSmall))## [1] 907 200
Dino contains several options to tune output. One of
particular interest is nCores which allows for parallel
computation of normalized expression. By default, Dino runs
with two threads. Choosing nCores = 0 will utilize all
available cores, and otherwise an integer number of parallel instances
can be chosen.
After normalization, Dino makes it easy to perform data
analysis. The default output is the normalized matrix in sparse format,
and Dino additionally provides a function to transform
normalized output into a Seurat object. We demonstrate this
by running a quick clustering pipeline in Seurat. Much of
the pipeline is modified from the tutorial at https://satijalab.org/seurat/v3.1/pbmc3k_tutorial.html
# Reformat normalized expression as a Seurat object
pbmcSmall_Seurat <- SeuratFromDino(pbmcSmall_Norm, doNorm = FALSE)
# Cluster pbmcSmall_Seurat
pbmcSmall_Seurat <- FindVariableFeatures(pbmcSmall_Seurat,
selection.method = "mvp")
pbmcSmall_Seurat <- ScaleData(pbmcSmall_Seurat,
features = rownames(pbmcSmall_Norm))
pbmcSmall_Seurat <- RunPCA(pbmcSmall_Seurat,
features = VariableFeatures(object = pbmcSmall_Seurat),
verbose = FALSE)
pbmcSmall_Seurat <- FindNeighbors(pbmcSmall_Seurat, dims = 1:10)
pbmcSmall_Seurat <- FindClusters(pbmcSmall_Seurat, verbose = FALSE)
pbmcSmall_Seurat <- RunUMAP(pbmcSmall_Seurat, dims = 1:10)
DimPlot(pbmcSmall_Seurat, reduction = "umap")Dino additionally supports the normalization of datasets
formatted as SingleCellExperiment. As with the
Seurat pipeline, this functionality is implemented through
the use of a wrapper function. We demonstrate this by quickly converting
the pbmcSmall dataset to a SingleCellExperiment object
and then normalizing.
# Reformatting pbmcSmall as a SingleCellExperiment
library(SingleCellExperiment)
pbmc_SCE <- SingleCellExperiment(assays = list("counts" = pbmcSmall))
# Run Dino
pbmc_SCE <- Dino_SCE(pbmc_SCE)
str(normcounts(pbmc_SCE))## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## ..@ i : int [1:162715] 0 1 2 3 4 5 6 7 8 9 ...
## ..@ p : int [1:201] 0 820 1622 2432 3251 4041 4854 5667 6479 7281 ...
## ..@ Dim : int [1:2] 907 200
## ..@ Dimnames:List of 2
## .. ..$ : chr [1:907] "ENSG00000087086" "ENSG00000167996" "ENSG00000251562" "ENSG00000205542" ...
## .. ..$ : chr [1:200] "CCAACCTGACGTAC-1" "ATCTGGGATTCCGC-1" "TACTTTCTTTTGGG-1" "CAGGCCGAACACGT-1" ...
## ..@ x : num [1:162715] 105.3 30.3 12.7 30 10.3 ...
## ..@ factors : list()
By default, Dino computes sequencing depth, which is
corrected for in the normalized data, as the sum of expression for a
cell (sample) across genes. This sum is then scaled such that the median
depth is 1. For some datasets, however, it may be beneficial to run
Dino on an alternately computed set of sequencing depths.
Note: it is generally recommended that the median depth not be
far from 1 as this corresponds to recomputing expression as though all
cells had been sequenced at the median depth.
A simple pipeline to compute alternate sequencing depths utilizes the
Scran method for computing normalization scale factors, and
is demonstrated below.
library(scran)
# Compute scran size factors
scranSizes <- calculateSumFactors(pbmcSmall)
# Re-normalize data
pbmcSmall_SNorm <- Dino(pbmcSmall, nCores = 1, depth = log(scranSizes))A fuller discussion of a specific use case for providing alternate
sequencing depths can be viewed on the Dino Github page: Issue #1
Dino models observed UMI counts as a mixture of Negative
Binomial random variables. The Negative Binomial distribution can,
however, be decomposed into a hierarchical Gamma-Poisson distribution,
so for gene \(g\) and cell \(j\), the Dino model for UMI
counts is: \[y_{gj}\sim
f^{P}(\lambda_{gj}\delta_{j})\\
\lambda_{gj}\sim\sum_{K}\pi_{k}f^{G}\left(\frac{\mu_{gk}}{\theta_g},\theta_g\right)\]
where \(f^{P}\) is a Poisson
distribution parameterized by mean \(\lambda_{gj}\delta_{j}\) and \(f^{G}\) is a Gamma distribution
parameterized by shape \(\mu_{gk}/\theta_g\) and scale \(\theta_g\). \(\delta_{j}\) is the cell-specific
sequencing depth, \(\lambda_{gj}\) is
the latent level of gene/cell-specific expression independent of depth,
component probabilities \(\pi_{k}\) sum
to 1, the Gamma distribution is parameterized such that \(\mu_{gk}\) denotes the distribution mean,
and the Gamma scale paramter, \(\theta_g\), is constant across mixture
components.
Following model fitting for a fixed gene through an accelerated EM
algorithm, Dino produces normalized expression values by
resampling from the posterior distribution of the latent expression
parameters, \(\lambda_{gj}\). It can be
shown that the distribution on the \(\lambda_{j}\) (dropping the gene-specific
subscript \(g\) as calculations are
repreated across genes) is a mixture of Gammas, specifically: \[\mathbb{P}(\lambda_{j}|y_{j},\delta_j)=\sum_{K}\tau_{kj}f^{G}\left(\frac{\mu_{k}}{\theta}+\gamma
y_{j},\frac{1}{\frac{1}{\theta}+\gamma\delta_j}\right)\] where
\(\tau_{kj}\) denotes the conditional
probability that \(\lambda_{gj}\) was
sampled from mixture component \(k\)
and \(\gamma\) is a global
concentration parameter. The \(\tau_{kj}\) are estimated as part of the
implementation of the EM algorithm in Dino. The adjustment
from the concentration parameter can be seen as a bias in the normalized
values towards a scale-factor version of normalization, since, in the
limit of \(\gamma\), the normalized
expression for cell \(j\) converges to
\(y_j/\delta_j\). Default values of
\(\gamma=15\) have proven
successful.
Approximating the flexibility of a non-parametric method,
Dino uses a large number of mixture components, \(K\), in order to capture the full
heterogeneity of expression that may exist for a given gene. The
gene-specific number of components is estimated as the square root of
the number of strictly positive UMI counts for a given gene. By default,
\(K\) is limited to be no larger than
100. In simulation, large values of \(K\) are shown to successfully reconstruct
both unimodal and multimodal underlying distributions. For example, when
UMI counts are estimated under a single negative binomial distribution,
the Dino fitted prior distribution (black, right panel)
which is used to sample normalized expression closely matches the
theoretical sampling distribution (red, right panel). Likewise, the
fitted means (\(\mu_k\) in the model,
gray lines, left panel) span the range of the simulated data (heat map
of counts, left panel), but concentrate around the theoretical mean of
the sampling distribution (red, left panel).
## TableGrob (2 x 2) "arrange": 3 grobs
## z cells name grob
## 1 1 (2-2,1-1) arrange gtable[layout]
## 2 2 (2-2,2-2) arrange gtable[layout]
## 3 3 (1-1,1-2) arrange text[GRID.text.284]
Simulating data from a pair of Negative Binomial distributions with different means and different dispersion parameters yields similar results in the multimodal case.
## TableGrob (2 x 2) "arrange": 3 grobs
## z cells name grob
## 1 1 (2-2,1-1) arrange gtable[layout]
## 2 2 (2-2,2-2) arrange gtable[layout]
## 3 3 (1-1,1-2) arrange text[GRID.text.488]
## R version 4.5.2 (2025-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] ggpubr_0.6.2 gridExtra_2.3
## [3] ggplot2_4.0.0 SingleCellExperiment_1.33.0
## [5] SummarizedExperiment_1.41.0 Biobase_2.70.0
## [7] GenomicRanges_1.63.0 Seqinfo_1.1.0
## [9] IRanges_2.45.0 S4Vectors_0.49.0
## [11] BiocGenerics_0.56.0 generics_0.1.4
## [13] MatrixGenerics_1.23.0 matrixStats_1.5.0
## [15] future_1.67.0 Matrix_1.7-4
## [17] Seurat_5.3.1 SeuratObject_5.2.0
## [19] sp_2.2-0 Dino_1.16.0
## [21] knitr_1.50 BiocStyle_2.38.0
##
## loaded via a namespace (and not attached):
## [1] RcppAnnoy_0.0.22 splines_4.5.2 later_1.4.4
## [4] tibble_3.3.0 polyclip_1.10-7 fastDummies_1.7.5
## [7] lifecycle_1.0.4 rstatix_0.7.3 edgeR_4.9.0
## [10] globals_0.18.0 lattice_0.22-7 MASS_7.3-65
## [13] backports_1.5.0 magrittr_2.0.4 limma_3.67.0
## [16] plotly_4.11.0 sass_0.4.10 rmarkdown_2.30
## [19] jquerylib_0.1.4 yaml_2.3.10 metapod_1.19.0
## [22] httpuv_1.6.16 otel_0.2.0 sctransform_0.4.2
## [25] spam_2.11-1 spatstat.sparse_3.1-0 reticulate_1.44.0
## [28] cowplot_1.2.0 pbapply_1.7-4 buildtools_1.0.0
## [31] RColorBrewer_1.1-3 abind_1.4-8 Rtsne_0.17
## [34] purrr_1.2.0 ggrepel_0.9.6 irlba_2.3.5.1
## [37] listenv_0.10.0 spatstat.utils_3.2-0 maketools_1.3.2
## [40] goftest_1.2-3 RSpectra_0.16-2 spatstat.random_3.4-2
## [43] dqrng_0.4.1 fitdistrplus_1.2-4 parallelly_1.45.1
## [46] codetools_0.2-20 DelayedArray_0.37.0 scuttle_1.21.0
## [49] tidyselect_1.2.1 farver_2.1.2 ScaledMatrix_1.19.0
## [52] spatstat.explore_3.5-3 jsonlite_2.0.0 BiocNeighbors_2.4.0
## [55] Formula_1.2-5 progressr_0.18.0 ggridges_0.5.7
## [58] survival_3.8-3 tools_4.5.2 ica_1.0-3
## [61] Rcpp_1.1.0 glue_1.8.0 SparseArray_1.11.1
## [64] xfun_0.54 dplyr_1.1.4 withr_3.0.2
## [67] BiocManager_1.30.26 fastmap_1.2.0 bluster_1.21.0
## [70] digest_0.6.37 rsvd_1.0.5 R6_2.6.1
## [73] mime_0.13 scattermore_1.2 tensor_1.5.1
## [76] spatstat.data_3.1-9 hexbin_1.28.5 tidyr_1.3.1
## [79] data.table_1.17.8 httr_1.4.7 htmlwidgets_1.6.4
## [82] S4Arrays_1.11.0 uwot_0.2.4 pkgconfig_2.0.3
## [85] gtable_0.3.6 lmtest_0.9-40 S7_0.2.0
## [88] XVector_0.51.0 sys_3.4.3 htmltools_0.5.8.1
## [91] carData_3.0-5 dotCall64_1.2 scales_1.4.0
## [94] png_0.1-8 spatstat.univar_3.1-4 scran_1.39.0
## [97] reshape2_1.4.4 nlme_3.1-168 cachem_1.1.0
## [100] zoo_1.8-14 stringr_1.6.0 KernSmooth_2.23-26
## [103] parallel_4.5.2 miniUI_0.1.2 pillar_1.11.1
## [106] vctrs_0.6.5 RANN_2.6.2 promises_1.5.0
## [109] BiocSingular_1.26.0 car_3.1-3 beachmat_2.26.0
## [112] xtable_1.8-4 cluster_2.1.8.1 evaluate_1.0.5
## [115] cli_3.6.5 locfit_1.5-9.12 compiler_4.5.2
## [118] rlang_1.1.6 future.apply_1.20.0 ggsignif_0.6.4
## [121] labeling_0.4.3 plyr_1.8.9 stringi_1.8.7
## [124] viridisLite_0.4.2 deldir_2.0-4 BiocParallel_1.44.0
## [127] lazyeval_0.2.2 spatstat.geom_3.6-0 RcppHNSW_0.6.0
## [130] patchwork_1.3.2 statmod_1.5.1 shiny_1.11.1
## [133] ROCR_1.0-11 igraph_2.2.1 broom_1.0.10
## [136] bslib_0.9.0
If you use Dino in your analysis, please cite our paper:
Brown, J., Ni, Z., Mohanty, C., Bacher, R., and Kendziorski, C. (2021). “Normalization by distributional resampling of high throughput single-cell RNA-sequencing data.” Bioinformatics, 37, 4123-4128. https://academic.oup.com/bioinformatics/article/37/22/4123/6306403.
Other work referenced in this vignette include:
Satija, R., Farrell, J.A., Gennert, D., Schier, A.F. and Regev, A. (2015). “Spatial reconstruction of single-cell gene expression data.” Nat. Biotechnol., 33, 495–502. https://doi.org/10.1038/nbt.3192
Amezquita, R.A., Lun, A.T.L., Becht, E., Carey, V.J., Carpp, L.N., Geistlinger, L., Marini, F., Rue-Albrecht, K., Risso, D., Soneson, C., et al. (2020). “Orchestrating single-cell analysis with Bioconductor.” Nat. Methods, 17, 137–145. https://doi.org/10.1038/s41592-019-0654-x
Lun, A. T. L., Bach, K. and Marioni, J. C. (2016). “Pooling across cells to normalize single-cell RNA sequencing data with many zero counts.” Genome Biol., 17, 1–14. https://doi.org/10.1186/s13059-016-0947-7
Jared Brown:
Christina Kendziorski: