Contents

1 Introduction

Single-cell DNA sequencing (scDNA-seq) enables the analysis of mutation profiles at single-cell resolution. This cutting-edge technique makes it possible to shed light on the previously inaccessible complexity of heterogeneous tissues such as tumors. scafari is an R package that offers easy-to-use data quality control as well as explorative variant analyses and visualization for scDNA-seq data.

Other scDNA-seq analysis packages are: - mosaic (https://missionbio.github.io/mosaic/) - optima optima - scDNA (https://github.com/bowmanr/scDNA)

2 Requirements

To run scafari, you need R (Version 4.4 or higher).

3 Running scafari

3.1 Installation

A development version of scafari will be available on GitHub and can be installed as follows:

BiocManager::install("scafari")

3.2 Getting started

A basic workflow analysis of scDNA-seq data is demonstrated in the following on an exemplary dataset. First we need to load scafari.

library(scafari)
#> 

Once the package is installed and have been loaded, the analysis can start. scafari is available as R package and shiny GUI. You can run the shiny version by RStudio executing the function launchScafariShiny(). The scafari workflow can be split into four main parts.

  1. Sequencing information and quality control
  2. Panel analysis
  3. Variant analysis
  4. Analyses of variants of special interest

These parts are represented in the four tabs of the shiny app and several functions in the R package.

3.3 Sequencing information and quality control

The analysis start with uploading the .h5 file. In the shiny app an interactive upload is provided on the starting page. Information from the .h5 file are loaded in an sce class object in this step.

library(SingleCellExperiment)

# Load .h5 file
h5_file_path <- system.file("extdata", "demo.h5", package = "scafari")
h5 <- h5ToSce(h5_file_path)
sce <- h5$sce_amp
se.var <- h5$se_var

After successful data upload the metadata including sequencing information can be accessed and basic quality information can be inspected.

# Get metadata
metadata(sce)
#> $ado_rate
#> [1] "NA"
#> 
#> $avg_mapping_error_rate
#> [1] "0.006751065"
#> 
#> $avg_panel_uniformity
#> [1] "0.960629921259842"
#> 
#> $chemistry_version
#> [1] "V2"
#> 
#> $genome_version
#> [1] "hg19"
#> 
#> $n_amplicons
#> [1] "127"
#> 
#> $n_bases_r1
#> [1] "1313531484"
#> 
#> $n_bases_r1_q30
#> [1] "1257088621"
#> 
#> $n_bases_r2
#> [1] "1313531484"
#> 
#> $n_bases_r2_q30
#> [1] "1191181775"
#> 
#> $n_cell_barcode_bases
#> [1] "433072697"
#> 
#> $n_cell_barcode_bases_q30
#> [1] "433315642"
#> 
#> $n_cells
#> [1] "1313"
#> 
#> $n_read_pairs
#> [1] "8698884"
#> 
#> $n_read_pairs_mapped_to_cells
#> [1] "5417202"
#> 
#> $n_read_pairs_trimmed
#> [1] "8573237"
#> 
#> $n_read_pairs_valid_cell_barcodes
#> [1] "8419726"
#> 
#> $n_reads_mapped
#> [1] "16453845"
#> 
#> $n_reads_mapped_insert
#> [1] "13777334"
#> 
#> $panel_name
#> [1] "AML"
#> 
#> $pipeline_version
#> [1] "2.0.2"

Further, reads per barcode can be visualized in a log-log plot.

logLogPlot(sce)

3.4 Panel analysis

The panel analysis start by normalizing the read counts using normalizeReadCounts(). The normalized read counts are stored in the corresponding assay (normalized.counts).

sce <- normalizeReadCounts(sce = sce)

After read counts are normalized they can be annotated with biological relevant information such as exon number and canonical transcript ID.

Therefore, known canonicals should be provided. They can be dowloaded like it is shown here:

library(R.utils)
#> Loading required package: R.oo
#> Loading required package: R.methodsS3
#> R.methodsS3 v1.8.2 (2022-06-13 22:00:14 UTC) successfully loaded. See ?R.methodsS3 for help.
#> R.oo v1.27.1 (2025-05-02 21:00:05 UTC) successfully loaded. See ?R.oo for help.
#> 
#> Attaching package: 'R.oo'
#> The following object is masked from 'package:R.methodsS3':
#> 
#>     throw
#> The following object is masked from 'package:SummarizedExperiment':
#> 
#>     trim
#> The following object is masked from 'package:GenomicRanges':
#> 
#>     trim
#> The following object is masked from 'package:IRanges':
#> 
#>     trim
#> The following object is masked from 'package:generics':
#> 
#>     compile
#> The following objects are masked from 'package:methods':
#> 
#>     getClasses, getMethods
#> The following objects are masked from 'package:base':
#> 
#>     attach, detach, load, save
#> R.utils v2.13.0 (2025-02-24 21:20:02 UTC) successfully loaded. See ?R.utils for help.
#> 
#> Attaching package: 'R.utils'
#> The following object is masked from 'package:generics':
#> 
#>     evaluate
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
#> The following objects are masked from 'package:base':
#> 
#>     cat, commandArgs, getOption, isOpen, nullfile, parse, use, warnings

temp_dir <- tempdir()
known_canon_path <- file.path(
    temp_dir,
    "UCSC_hg19_knownCanonical_goldenPath.txt"
)

if (!file.exists(known_canon_path)) {
    url <- paste0(
        "http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/",
        "knownCanonical.txt.gz"
    )
    destfile <- file.path(
        temp_dir,
        "UCSC_hg19_knownCanonical_goldenPath.txt.gz"
    )

    # Download the file
    download.file(url, destfile)

    # Decompress the file
    R.utils::gunzip(destfile, remove = FALSE)
}
try(annotateAmplicons(sce = sce, known.canon = known_canon_path))

The exact genomic locations of the amplicons can be inspected in the amplicon distribution plot.

plotAmpliconDistribution(sce)
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.

The amplicon performance plot depicts normalized read counts per amplicon. This can help to identify low performing amplicons and get deeper insights into potential copy number alterations (Sarah).

plotNormalizedReadCounts(sce = sce)

Insights in the normalized read counts per amplicon on single cell resolution is provided in the panel uniformity plot. Each dot represenents one cell and each row an amplicon. The amplicons are sorted by their mean reads, from high to low. Amplicons meeting at least 20% of the average depth of coverage are represented in blue, others in orange.

plotPanelUniformity(sce, interactive = FALSE)
#> No id variables; using all as measure variables

3.5 Variant analysis

The variant analysis starts with variant filtering and is performed on criteria such as sequencing depth, variant allele frequency (VAF), genoytpe data and genotype quality data. Further, the minimum percentage of cells that could be assigned with a genotype other than missing (wild type, heterozygous, homozygous) and the minimum percentage of cells with mutated genotypes (heterozygous, homozygous).

library(SummarizedExperiment)

filteres <- filterVariants(
    depth.threshold = 10,
    genotype.quality.threshold = 30,
    vaf.ref = 5,
    vaf.het = 35,
    vaf.hom = 95,
    min.cell = 50,
    min.mut.cell = 1,
    se.var = se.var,
    sce = sce,
    shiny = FALSE
)
se.f <-
    SummarizedExperiment(
        assays =
            list(
                VAF = t(filteres$vaf.matrix.filtered),
                Genotype = t(filteres$genotype.matrix.filtered),
                Genoqual = t(filteres$genoqual.matrix.filtered)
            ),
        rowData = filteres$variant.ids.filtered,
        colData = filteres$cells.keep
    )

# Filter out cells in sce object
# Find the indices of the columns to keep
indices_to_keep <- match(filteres$cells.keep,
    SummarizedExperiment::colData(sce)[[1]],
    nomatch = 0
)

# Subset the SCE using these indices
sce_filtered <- sce[, indices_to_keep]
SingleCellExperiment::altExp(sce_filtered, "variants") <- se.f

After filtering the variants are annotated with biological information like gene names and clinical variant metrics. By default, annotation is performed via the Mission Bio API, which provides information on the names of the gene and the affected protein, the coding effect, the functional classification (e.g.  intronic, coding), ClinVar information, the dbSNP ID and the DANN (Deleterious Annotation of genetic variants using Neural Networks) score. The annotated variants can than be inspected.

suppressPackageStartupMessages(library(DT))
suppressPackageStartupMessages(library(dplyr))

# Load sce object with filtered variants
# Check metadata
metadata(altExp(sce_filtered))
#> list()

# Annotate variants
sce_filtered <- annotateVariants(sce = sce_filtered)

# Check metadata. Now a annotation flag is present
metadata(altExp(sce_filtered))
#> $annotated
#> [1] TRUE

# The annotations are stored in the rowData
rowData(altExp(sce_filtered)) %>%
    as.data.frame() %>%
    head() %>%
    datatable()

To gain deeper insights into the VAF of filtered variants across the cells, a VAF heatmap with genotype and chromosome annotations is available.

plotVariantHeatmap(sce = sce_filtered)

More detailed insights into the genotype and their quality can be obtained in a violin plot.

plotGenotypequalityPerGenotype(sce = sce_filtered)

3.6 Analysis of variants of interest

To further focus on selected variants of interest, there is a set of function to analyze user selected variants. Thereby, variants forming cell clusters/clones can be identified and analyzed regarding VAF and genotype. First the user need to define a set of variants of interest.

variants.of.interest <- c(
    "KIT:chr4:55599436:T/C",
    "TET2:chr4:106158216:G/A",
    "FLT3:chr13:28610183:A/G",
    "TP53:chr17:7577427:G/A"
)

After selecting the variants of interest, the cells can be clustered based on their VAF. The clustering returns the k-means results and the clusterplot itself in a list object. For stable k-means clustering a seed can be used set.seed(1).

plot <- clusterVariantSelection(
    sce = sce_filtered,
    variants.of.interest = variants.of.interest,
    n.clust = 3
)

names(plot)
#> [1] "k_means"     "clusterplot"

The clustered cells plot is stored in clusterplot.

plot$clusterplot

To compare the clusters regarding the variant selection there are various visualizations available. The distribution of variant allele frequency (VAF) within each cell cluster can be obtained using plotClusterVAF().

plotClusterVAF(
    sce = sce_filtered,
    variants.of.interest = variants.of.interest,
    gg.clust = plot$clusterplot
)

plotClusterVAFMap(
    sce = sce_filtered,
    variants.of.interest = variants.of.interest,
    gg.clust = plot$clusterplot
)

Further, the clusters can be analyzed regarding the genotype using plotClusterGenotype().

plotClusterGenotype(
    sce = sce_filtered,
    variants.of.interest = variants.of.interest,
    gg.clust = plot$clusterplot
)
#> Using cluster as id variables

3.7 Data

4 Session information

sessionInfo()
#> R version 4.5.1 (2025-06-13)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] dplyr_1.1.4                 DT_0.33                    
#>  [3] R.utils_2.13.0              R.oo_1.27.1                
#>  [5] R.methodsS3_1.8.2           SingleCellExperiment_1.31.1
#>  [7] SummarizedExperiment_1.39.1 Biobase_2.69.0             
#>  [9] GenomicRanges_1.61.1        Seqinfo_0.99.2             
#> [11] IRanges_2.43.0              S4Vectors_0.47.0           
#> [13] BiocGenerics_0.55.0         generics_0.1.4             
#> [15] MatrixGenerics_1.21.0       matrixStats_1.5.0          
#> [17] scafari_0.99.11             BiocStyle_2.37.0           
#> 
#> loaded via a namespace (and not attached):
#>   [1] later_1.4.2              BiocIO_1.19.0            bitops_1.0-9            
#>   [4] filelock_1.0.3           tibble_3.3.0             graph_1.87.0            
#>   [7] XML_3.99-0.18            rpart_4.1.24             factoextra_1.0.7        
#>  [10] lifecycle_1.0.4          httr2_1.2.1              rstatix_0.7.2           
#>  [13] doParallel_1.0.17        OrganismDbi_1.51.4       lattice_0.22-7          
#>  [16] ensembldb_2.33.1         crosstalk_1.2.1          backports_1.5.0         
#>  [19] magrittr_2.0.3           Hmisc_5.2-3              plotly_4.11.0           
#>  [22] sass_0.4.10              rmarkdown_2.29           jquerylib_0.1.4         
#>  [25] yaml_2.3.10              shinyBS_0.61.1           httpuv_1.6.16           
#>  [28] shinycustomloader_0.9.0  ggbio_1.57.1             DBI_1.2.3               
#>  [31] RColorBrewer_1.1-3       abind_1.4-8              purrr_1.1.0             
#>  [34] AnnotationFilter_1.33.0  biovizBase_1.57.1        RCurl_1.98-1.17         
#>  [37] nnet_7.3-20              VariantAnnotation_1.55.1 rappdirs_0.3.3          
#>  [40] circlize_0.4.16          ggrepel_0.9.6            codetools_0.2-20        
#>  [43] DelayedArray_0.35.2      xml2_1.3.8               tidyselect_1.2.1        
#>  [46] shape_1.4.6.1            UCSC.utils_1.5.0         farver_2.1.2            
#>  [49] BiocFileCache_2.99.5     base64enc_0.1-3          GenomicAlignments_1.45.1
#>  [52] jsonlite_2.0.0           GetoptLong_1.0.5         waiter_0.2.5            
#>  [55] Formula_1.2-5            iterators_1.0.14         foreach_1.5.2           
#>  [58] tools_4.5.1              progress_1.2.3           Rcpp_1.1.0              
#>  [61] glue_1.8.0               gridExtra_2.3            SparseArray_1.9.1       
#>  [64] xfun_0.52                GenomeInfoDb_1.45.9      withr_3.0.2             
#>  [67] BiocManager_1.30.26      fastmap_1.2.0            GGally_2.3.0            
#>  [70] rhdf5filters_1.21.0      shinyjs_2.1.0            digest_0.6.37           
#>  [73] R6_2.6.1                 mime_0.13                colorspace_2.1-1        
#>  [76] Cairo_1.6-2              dichromat_2.0-0.1        markdown_2.0            
#>  [79] biomaRt_2.65.0           RSQLite_2.4.2            tidyr_1.3.1             
#>  [82] data.table_1.17.8        rtracklayer_1.69.1       prettyunits_1.2.0       
#>  [85] httr_1.4.7               htmlwidgets_1.6.4        S4Arrays_1.9.1          
#>  [88] ggstats_0.10.0           pkgconfig_2.0.3          gtable_0.3.6            
#>  [91] blob_1.2.4               ComplexHeatmap_2.25.2    S7_0.2.0                
#>  [94] XVector_0.49.0           htmltools_0.5.8.1        carData_3.0-5           
#>  [97] bookdown_0.43            RBGL_1.85.0              ProtGenerics_1.41.0     
#> [100] clue_0.3-66              scales_1.4.0             png_0.1-8               
#> [103] knitr_1.50               rstudioapi_0.17.1        reshape2_1.4.4          
#> [106] rjson_0.2.23             checkmate_2.3.2          curl_6.4.0              
#> [109] org.Hs.eg.db_3.21.0      cachem_1.1.0             rhdf5_2.53.3            
#> [112] GlobalOptions_0.1.2      stringr_1.5.1            parallel_4.5.1          
#> [115] shinycssloaders_1.1.0    foreign_0.8-90           AnnotationDbi_1.71.0    
#> [118] restfulr_0.0.16          pillar_1.11.0            grid_4.5.1              
#> [121] vctrs_0.6.5              ggpubr_0.6.1             promises_1.3.3          
#> [124] car_3.1-3                dbplyr_2.5.0             xtable_1.8-4            
#> [127] cluster_2.1.8.1          htmlTable_2.4.3          evaluate_1.0.4          
#> [130] magick_2.8.7             tinytex_0.57             GenomicFeatures_1.61.5  
#> [133] cli_3.6.5                compiler_4.5.1           Rsamtools_2.25.1        
#> [136] rlang_1.1.6              crayon_1.5.3             ggsignif_0.6.4          
#> [139] labeling_0.4.3           plyr_1.8.9               stringi_1.8.7           
#> [142] viridisLite_0.4.2        BiocParallel_1.43.4      txdbmaker_1.5.6         
#> [145] Biostrings_2.77.2        lazyeval_0.2.2           Matrix_1.7-3            
#> [148] BSgenome_1.77.1          hms_1.1.3                bit64_4.6.0-1           
#> [151] ggplot2_3.5.2            Rhdf5lib_1.31.0          KEGGREST_1.49.1         
#> [154] shiny_1.11.1             broom_1.0.8              memoise_2.0.1           
#> [157] bslib_0.9.0              bit_4.6.0