celda_CG()CEllular Latent Dirichlet Allocation (celda), is among multiple approaches to cluster cells from single-cell RNA-seq data into discrete sub-populations. In most cases, annotating a concise set of important features for distinguishing these sub-populations is not a trivial task.
In this vignette we will demonstrate the implementation of a multiclass decision tree approach to simultaneously sort and annotate cell cluster label estimations by generating a sequence of univariate rules for each cluster.
The procedure has two main deviations from simple multiclass decision tree procedures. First, at each split cells from the same cluster are never separated during tree building. Instead cells from the same population are moved to one-side of a particular split based on majority voting. Second, each cluster split can be determined by one of two heuristics, as follows…
celda can be installed from Bioconductor:
if (!requireNamespace("BiocManager", quietly=TRUE)){
    install.packages("BiocManager")}
BiocManager::install("celda")The package can be loaded using the library command.
library(celda)Complete list of help files are accessible using the help command with a package option.
help(package = celda)To see the latest updates and releases or to post a bug, see our GitHub page at https://github.com/campbio/celda. To ask questions about running celda, post a thread on Bioconductor support site at https://support.bioconductor.org/.
celda_CG()First, we will read in a single-cell data set and use celda_CG() to cluster the genes and samples. The M3DExampleData is a subset of single-cell RNA-Seq expression data of r mouse 2-cell embryos and blastocysts, originally generated by Deng et al. (2014). In this example we will use 500 genes to generate 10 sample and 10 gene clusters using celda_CG(). Note, that these parameters are not based on knowledge of these data, but rather just for running a speedy example.
library(M3DExampleData)
counts <- M3DExampleData::Mmus_example_list$data
# Subset 500 genes for fast clustering
counts <- as.matrix(counts[1501:2000, ])
# Cluster genes ans samples each into 10 modules
cm <- celda_CG(counts = counts, L = 10, K = 10, verbose = FALSE)The decision trees are generated and evaluated by findMarkers. findMarkers requires two arguments: features and class. features is a numeric matrix with a row for every variable (or feature) to be used in sorting and a column for every sample. In this example we will use a factorized matrix of the gene clusters as our feature matrix (Check ?factorizedMatrix for more details). class is a vector with a label assignment for every sample. Note, that neither features nor class may have any missing values.
# Get features matrix and cluster assignments
factorized <- factorizeMatrix(counts, cm)
features <- factorized$proportions$cell
class <- clusters(cm)$zfindMarkers allows for different parameter options. The optimal combination of these parameters is generally analysis specific and may require some trial and error. Parameters include:
reuseFeatures = FALSE a feature is still allowed to be used more than once in the decision tree. This happens when the same feature is used for different sets of cluster rules after those sets of clusters have diverged.consecutiveOneoff = FALSE, more then one good one-off splits will be assessed at the same level and used if determined to be a good split.DecTree <- findMarkers(features,
    class,
    oneoffMetric = "modified F1",
    threshold = 0.95,
    reuseFeatures = FALSE,
    altSplit = TRUE,
    consecutiveOneoff = FALSE)findMarkers outputThe findMarkers output is a named list of four elements
data.frame for every label. Each
data.frame has five columns and gives the set of rules for distinguishing
each label.
performance A named list denoting the training performance of the
plotDendro generates a dendrogram as a ggplot object. The addSensPrec argument determines whether to print cluster-specific sensitivities and precisions under the leaf labels for each cluster. leafSize, boxSize, boxColor are all stylistic choices to make the plot labels legible.
plotDendro(DecTree,
    classLabel = NULL,
    addSensPrec = TRUE,
    leafSize = 24,
    boxSize = 7,
    boxColor = "black")
In the plot, the feature(s) used for splits determined by the one-off metric are printed above the cluster labels for which they are markers. Alternatively, the features used for the balanced splits are centered above the two sets of clusters defined by that split. The up-regulated set of clusters for a balanced split are one the right side of that split. The size, sensitivitie and precision of each class are printed below the leaf labels, respectively.
In plotDendro the classLabel argument may used to only print the sequence of relevant rules for a given split.
plotDendro(DecTree,
    classLabel = "1",
    addSensPrec = TRUE,
    leafSize = 15,
    boxSize = 7,
    boxColor = "black")findMarkers performs label estimation of each sample in the training set automatically. You can use the set of rules generated by findMarkers to predict the labels on an independent feature matrix using getDecisions.
head(getDecisions(DecTree$rules, features))## GSM1112603_early2cell_0r.1_expression GSM1112604_early2cell_0r.2_expression 
##                                   "7"                                   "7" 
##  GSM1112605_early2cell_1.1_expression  GSM1112606_early2cell_1.2_expression 
##                                   "7"                                   "7" 
##  GSM1112607_early2cell_2.1_expression  GSM1112608_early2cell_2.2_expression 
##                                   "7"                                   "7"If you have a priori understanding of sub-groups of your cluster labels, you can ensure that these sub-groups are not separated up-stream in the tree by using the optional cellTypes argument. This is just a named list of labels in your class vector. For example, if we knew that clusters, 4, 5, and 1 were of the same subtype of clusters, we could do the following…
# Run with a hierarchichal split
cellTypes <- list(metaLabel = c("4", "5", "1"))
DecTreeMeta <- findMarkers(features,
    class,
    cellTypes,
    oneoffMetric = "modified F1",
    threshold = 1,
    reuseFeatures = F,
    consecutiveOneoff = FALSE)
plotDendro(DecTreeMeta)sessionInfo()## R version 3.6.2 (2019-12-12)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] M3DExampleData_1.12.0 celda_1.2.4           BiocStyle_2.14.4     
## 
## loaded via a namespace (and not attached):
##  [1] Biobase_2.46.0              ggdendro_0.1-20            
##  [3] viridis_0.5.1               httr_1.4.1                 
##  [5] viridisLite_0.3.0           foreach_1.4.7              
##  [7] RcppParallel_4.4.4          assertthat_0.2.1           
##  [9] BiocManager_1.30.10         stats4_3.6.2               
## [11] GenomeInfoDbData_1.2.2      ggrepel_0.8.1              
## [13] yaml_2.2.0                  pillar_1.4.3               
## [15] lattice_0.20-38             glue_1.3.1                 
## [17] pROC_1.16.1                 RcppEigen_0.3.3.7.0        
## [19] digest_0.6.23               GenomicRanges_1.38.0       
## [21] RColorBrewer_1.1-2          XVector_0.26.0             
## [23] colorspace_1.4-1            enrichR_2.1                
## [25] htmltools_0.4.0             Matrix_1.2-18              
## [27] plyr_1.8.5                  pkgconfig_2.0.3            
## [29] magick_2.2                  bookdown_0.17              
## [31] zlibbioc_1.32.0             purrr_0.3.3                
## [33] scales_1.1.0                Rtsne_0.15                 
## [35] BiocParallel_1.20.1         tibble_2.1.3               
## [37] combinat_0.0-8              farver_2.0.3               
## [39] IRanges_2.20.2              ggplot2_3.2.1              
## [41] withr_2.1.2                 SummarizedExperiment_1.16.1
## [43] BiocGenerics_0.32.0         lazyeval_0.2.2             
## [45] magrittr_1.5                crayon_1.3.4               
## [47] evaluate_0.14               doParallel_1.0.15          
## [49] MASS_7.3-51.5               MAST_1.12.0                
## [51] tools_3.6.2                 data.table_1.12.8          
## [53] lifecycle_0.1.0             matrixStats_0.55.0         
## [55] stringr_1.4.0               S4Vectors_0.24.3           
## [57] munsell_0.5.0               DelayedArray_0.12.2        
## [59] compiler_3.6.2              GenomeInfoDb_1.22.0        
## [61] rlang_0.4.2                 grid_3.6.2                 
## [63] RCurl_1.98-1.1              iterators_1.0.12           
## [65] rjson_0.2.20                SingleCellExperiment_1.8.0 
## [67] labeling_0.3                bitops_1.0-6               
## [69] rmarkdown_2.1               gtable_0.3.0               
## [71] codetools_0.2-16            abind_1.4-5                
## [73] reshape2_1.4.3              R6_2.4.1                   
## [75] gridExtra_2.3               knitr_1.27.2               
## [77] dplyr_0.8.3                 uwot_0.1.5                 
## [79] dendextend_1.13.2           stringi_1.4.5              
## [81] parallel_3.6.2              Rcpp_1.0.3                 
## [83] MCMCprecision_0.4.0         tidyselect_0.2.5           
## [85] xfun_0.12