---
title: "Quality control of sc/snRNA-seq"
author:
  - name: Mariano Ruz Jurado
    affiliation:
    - Goethe University
output: 
  BiocStyle::html_document:
    self_contained: true
    toc: true
    toc_float: true
    toc_depth: 3
    code_folding: show
package: "`r pkg_ver('DOtools')`"
vignette: >
  %\VignetteIndexEntry{Quality control of sc/snRNA-seq}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
---

```{r chunk_setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```

```{r vignette_setup, echo=FALSE, message=FALSE, warning = FALSE}
# Track time spent on making the vignette.
start_time <- Sys.time()
pkg <- function(x) {
    paste0("[*", x, "*](https://cran.r-project.org/package=", x, ")")
}
pypkg <- function(x) {
    paste0("[*", x, "*](https://pypi.org/project/", x, "/)")
}
```

# Installation
`r Biocpkg("DOtools")` is an R package distributed as part of the Bioconductor
project. To install the package, start R and enter:

```{r bioconductor_install, eval=FALSE}
install.packages("BiocManager") # WORK iN PROGRESS
BiocManager::install("DOtools")
```

Alternatively, you can instead install the latest development version from
[*GitHub*](https://github.com/) with:

```{r github_install, eval=FALSE}
BiocManager::install("MarianoRuzJurado/DOtools")
```

# Usage
`r Biocpkg("DOtools")` contains different functions
for processing and visualizing gene expression in scRNA/snRNA experiments:

In this vignette we showcase how to use the functions with public available 
data.

## Libraries

`r Biocpkg("DOtools")` can be imported as:

```{r load_library, message=FALSE}
library(DOtools)

# Additional packages
library(SummarizedExperiment)
library(scran)
library(scater)
library(plyr)
library(dplyr)
library(tibble)
library(enrichR)
library(kableExtra)
library(Seurat)
```

## Ambient removal
Despite advances in optimizing and standardizing droplet-based single-cell omics
protocols like single-cell and single-nucleus RNA sequencing (sc/snRNA-seq),
these experiments still suffer from systematic biases and background noise.
In particular, ambient RNA in snRNA-seq can lead to an overestimation of 
expression levels for certain genes. Computational tools such as
`r pypkg("cellbender")` have been developed to address these biases by
correcting for ambient RNA contamination.

We have integrated a wrapper function to run CellBender within the
`r Biocpkg("DOtools")` package. The current implementation supports processing
samples generated with CellRanger.

```{r Cellbender, eval=FALSE, warning = FALSE}
base <- DOtools:::.example_10x()
dir.create(file.path(base, "/cellbender"))

raw_files <- list.files(base,
    pattern = "raw_feature_bc_matrix\\.h5$",
    recursive = TRUE,
    full.names = TRUE
)

DO.CellBender(
    cellranger_path = base,
    output_path = file.path(base, "/cellbender"),
    samplenames = c("disease"),
    cuda = TRUE,
    BarcodeRanking = FALSE,
    cpu_threads = 48,
    epochs = 150
)
```

After running the analysis, several files are saved in the `output_folder`,
including a summary report to check for any issues during CellBender execution,
individual log files for each sample, and a `commands_Cellbender.txt` file with
the exact command used. The corrected `.h5` files can then be used alternatively
to the cellranger output for downstream analysis.

## Quality control
`r Biocpkg("DOtools")` The `DO.Import()` function provides a streamlined 
pipeline for performing quality control on single-cell or single-nucleus 
RNA sequencing (sc/snRNA-seq) data. It takes as input a list of .h5 files 
generated by e.g. CellRanger or STARsolo, along with sample names and metadata.

During preprocessing, low-quality genes and cells are filtered out based on 
specified thresholds. Genes expressed in fewer than five cells are removed. 
Cells are filtered according to mitochondrial gene content, number of detected 
genes, total UMI counts, and potential doublets. The function supports doublet 
detection using `r Biocpkg("scDblFinder")`. Thresholds for mitochondrial 
content (e.g., 5% for scRNA-seq and 3% for snRNA-seq), gene counts, and 
UMI counts can be defined to tailor the filtering.

After filtering, samples are merged into one `r pkg("SingleCellExperiment")` 
or `r pkg("Seurat")` object, followed by log-normalisation, scaling, and the 
identification of highly variable genes. To help assess the effect of quality 
control, violin plots showing distributions of key metrics before and after 
filtering are automatically generated and saved alongside the input files. 
A summary of removed genes and cells is also recorded. 

To show how the quality control works, we are going to use a public dataset 
from 10X from human blood of healthy and donors with a malignant tumor:

```{r read_example_data, warning = FALSE, eval=FALSE}
base <- DOtools:::.example_10x()

paths <- c(
    file.path(base, "healthy/outs/filtered_feature_bc_matrix.h5"),
    file.path(base, "disease/outs/filtered_feature_bc_matrix.h5")
)

SCE_obj <- DO.Import(
    pathways = paths,
    ids = c("healthy-1", "disease-1"),
    DeleteDoublets = TRUE,
    cut_mt = .05,
    min_counts = 500,
    min_genes = 300,
    high_quantile = .95,
    Seurat = FALSE # Set to TRUE for Seurat object
)
```
We can now check the quality before introducing filterings:
```{r preflt, out.width="100%", fig.align="center", fig.width=8, fig.height=6}
prefilterplots <- system.file(
    "figures", "prefilterplots-1.png",
    package = "DOtools"
)

pQC1 <- magick::image_read(prefilterplots)
plot(pQC1)
```
<br>
And after:
```{r postflt, out.width="100%", fig.align="center", fig.width=8, fig.height=6}
postfilterplots <- system.file(
    "figures",
    "postfilterplots-1.png",
    package = "DOtools"
)

pQC2 <- magick::image_read(postfilterplots)
plot(pQC2)
```
We observed that most cells were removed due to increased mitochondrial content.
Depending on the experimental design, the mitochondrial content threshold can be
adjusted to retain a higher number of cells, if minimizing cell loss is of 
relevance.

<br>
The DOtools package provides a slim object of this data set. Please feel free,
to use the one created from DO.Import for prettier results or this slim downed 
version. We can observe how similar the samples are through running a 
correlation analysis.
```{r Correlation, fig.width=6, fig.height=6}
# Making sure we have a save folder
base <- tempfile("my_tempdir_")
dir.create(base)

SCE_obj <- readRDS(
    system.file("extdata",
        "sce_data.rds",
        package = "DOtools"
    )
)

DO.Correlation(SCE_obj)
```


## Data integration

After quality control the preferred integration method can be chosen, we support
every integration method within Seurat's `IntegrateLayers` function. 
Additionally, we implemented a new wrapper function for the scVI integration 
from the [scvi-tools](https://pypi.org/project/scvi-tools/) package. After the 
integration completes, we run the Leiden algorithm to find clusters and 
generate UMAP embeddings.

```{r CCA Integration, warning = FALSE}
SCE_obj <- DO.Integration(
    sce_object = SCE_obj,
    split_key = "orig.ident",
    HVG = TRUE,
    scale = TRUE,
    pca = TRUE,
    integration_method = "CCAIntegration"
)
```

```{r scVI Integration, warning=FALSE, eval=FALSE}
# (Optional) Integration with scVI-Model
SCE_obj <- DO.scVI(
    sce_object = SCE_obj,
    batch_key = "orig.ident",
    layer_counts = "counts",
    layer_logcounts = "logcounts"
)


SNN_Graph <- scran::buildSNNGraph(SCE_obj,
    use.dimred = "scVI"
)

clust_SCVI <- igraph::cluster_louvain(SNN_Graph,
    resolution = 0.3
)

SCE_obj$leiden0.3 <- factor(igraph::membership(clust_SCVI))
SCE_obj <- scater::runUMAP(SCE_obj, dimred = "scVI", name = "UMAP")
```


After the integration finished, both corrected expression matrices can be found
saved in the SCE object and can be used for cluster calculations and UMAP
projections. In this case, we will continue with the CCA Integration method.

```{r Clustering and UMAP, warning = FALSE}
DO.UMAP(SCE_obj,
    group.by = "leiden0.3"
)

DO.UMAP(SCE_obj,
    group.by = "condition",
    legend.position = "right",
    label = FALSE
)
```


## Semi-automatic annotation with Celltypist
Next up, we implemented a wrapper around the semi-automatic annotation tool
`r pypkg("celltypist")`. It will annotate the defined clusters based on the
`Adult_COVID19_PBMC.pkl` model.

```{r annotation, warning = FALSE}
SCE_obj <- DO.CellTypist(SCE_obj,
    modelName = "Healthy_COVID19_PBMC.pkl",
    runCelltypistUpdate = TRUE,
    over_clustering = "leiden0.3"
)
DO.UMAP(SCE_obj, group.by = "autoAnnot", legend.position = "right")
```

The semi-automatic annotation is a good estimate of the cell types in your 
object. But you should always manually validate the findings of the model. 
You can manually define a set of marker genes for the cell population or check 
the most preeminent genes per cluster by using scran's `findMarkers` function.
Marker genes can also be visualised using the `DO.UMAP` function.

```{r Manual annotation, fig.width=12, fig.height=7}
markers_list <- scran::findMarkers(
    SCE_obj,
    test.type = "t",
    groups = SingleCellExperiment::colData(SCE_obj)$autoAnnot,
    direction = "up",
    lfc = 0.25,
    pval.type = "any"
)

# pick top 5 per cluster, naming adjustments
annotation_Markers <- lapply(names(markers_list), function(cluster) {
    df <- as.data.frame(markers_list[[cluster]])
    df$gene <- rownames(df)
    df$cluster <- cluster
    df %>%
        rename(
            avg_log2FC = summary.logFC,
            p_val      = p.value,
            p_val_adj  = FDR
        ) %>%
        dplyr::select(gene, cluster, avg_log2FC, p_val, p_val_adj)
}) %>%
    bind_rows()

# or with seurat if preferred
Seu_obj <- as.Seurat(SCE_obj)
annotation_Markers <- FindAllMarkers(
    object = Seu_obj,
    assay = "RNA",
    group.by = "autoAnnot",
    min.pct = 0.25,
    logfc.threshold = 0.25
)

annotation_Markers <- annotation_Markers %>%
    arrange(desc(avg_log2FC)) %>%
    distinct(gene, .keep_all = TRUE) %>%
    group_by(cluster) %>%
    slice_head(n = 5)

p1 <- DO.Dotplot(
    sce_object = SCE_obj,
    Feature = annotation_Markers,
    group.by.x = "leiden0.3",
    plot.margin = c(1, 1, 1, 1),
    annotation_x = TRUE,
    point_stroke = 0.1,
    annotation_x_rev = TRUE,
    textSize = 14,
    hjust = 0.5,
    vjust = 0,
    textRot = 0,
    segWidth = 0.3,
    lwd = 3
)

# manual set of markers
annotation_Markers <- data.frame(
    cluster = c(
        "ImmuneCells",
        rep("B_cells", 3),
        rep("T_cells", 3),
        rep("NK", 2),
        rep("Myeloid", 3),
        rep("pDC", 3)
    ),
    genes = c(
        "PTPRC", "CD79A", "BANK1", "MS4A1",
        "CD3E", "CD4", "IL7R", "NKG7",
        "KLRD1", "CD68", "CD14", "ITGAM",
        "LILRA4", "CLEC4C", "LRRC26"
    )
)

p2 <- DO.Dotplot(
    sce_object = SCE_obj,
    Feature = annotation_Markers,
    group.by.x = "leiden0.3",
    plot.margin = c(1, 1, 1, 1),
    annotation_x = TRUE,
    point_stroke = 0.1,
    annotation_x_rev = TRUE,
    textSize = 14,
    hjust = 0.5,
    vjust = 0,
    textRot = 0,
    segWidth = 0.3,
    lwd = 3
)

# Visualise marker expression in UMAP
DO.UMAP(SCE_obj,
    FeaturePlot = TRUE,
    features = "NKG7",
    group.by = "leiden0.3",
    legend.position = "right"
)
```

The manual markers for the immune cells show an agreement for the annotation 
therefore we can continue with it after some minor adjustments.

```{r renaming}
SCE_obj$annotation <- plyr::revalue(SCE_obj$leiden0.3, c(
    `1` = "T_cells",
    `2` = "T_cells",
    `3` = "NK",
    `4` = "B_cells",
    `5` = "Monocytes",
    `6` = "NK",
    `7` = "T_cells",
    `8` = "pDC"
))

DO.UMAP(SCE_obj, group.by = "annotation", legend.position = "right")
```

## Cell composition

After the identification of the celltype populations, we can also evaluate if 
there are significant changes in these populations in the healthy and diseased 
condition using a wrapper function around the python tool `r pypkg("scanpro")`.

```{r cell populations, warning=FALSE}
DO.CellComposition(SCE_obj,
    assay_normalized = "RNA",
    cluster_column = "annotation",
    sample_column = "orig.ident",
    condition_column = "condition",
    transform_method = "arcsin",
    n_reps = 3
)
```


## Reclustering of cell populations

Subpopulations can be tricky to find, therefore it is always a good practice to 
perform a reclustering of a given cell populations, if we are interested in a 
specific set of cells in a population. Here for example in the T cells. We will 
identify the subpopulations and then markers defining them.

```{r Recluster, warning=FALSE}
SCE_obj <- DO.FullRecluster(SCE_obj, over_clustering = "annotation")
DO.UMAP(SCE_obj, group.by = "annotation_recluster")
T_cells <- DO.Subset(SCE_obj,
    ident = "annotation_recluster",
    ident_name = grep("T_cells",
        unique(SCE_obj$annotation_recluster),
        value = TRUE
    )
)

T_cells <- DO.CellTypist(T_cells,
    modelName = "Healthy_COVID19_PBMC.pkl",
    runCelltypistUpdate = FALSE,
    over_clustering = "annotation_recluster",
    SeuV5 = FALSE
)

T_cells$annotation <- plyr::revalue(
    T_cells$annotation_recluster,
    c(
        `T_cells_1` = "CD4_T_cells",
        `T_cells_2` = "CD4_T_cells",
        `T_cells_3` = "CD4_T_cells",
        `T_cells_4` = "CD8_T_cells"
    )
)
```

Now that we identified the marker genes describing the different T cell 
populations. We can re-annotate them based on their expression profile and a 
new prediciton from Celltypist. After this we, can easily transfer the labels 
in the subset to the original object.

```{r re-annotate}
SCE_obj <- DO.TransferLabel(SCE_obj,
    Subset_obj = T_cells,
    annotation_column = "annotation",
    subset_annotation = "annotation"
)

DO.UMAP(SCE_obj, group.by = "annotation", legend.position = "right")
```

## Gene ontology analysis

To explore which biological processes are enriched in a specific cell type 
across conditions, we can perform gene ontology analysis. We'll start by 
identifying differentially expressed genes, focusing here on T cells. For 
differential gene expression analysis, we introduced a new function, which 
combines DGE analysis using a single cell approach, e.g. the popular Wilcoxon 
test and a pseudobulk testing using DESeq2. We can then observe the results in 
a combined dataframe.


```{r GO analysis, warning=FALSE}
# this data set contains only one sample per condition
# we introduce replicates for showing the pseudo bulk approach
set.seed(123)
SCE_obj$orig.ident2 <- sample(rep(c("A", "B", "C", "D", "E", "F"),
    length.out = ncol(SCE_obj)
))
CD4T_cells <- DO.Subset(SCE_obj,
    ident = "annotation",
    ident_name = "CD4_T_cells"
)

DGE_result <- DO.MultiDGE(CD4T_cells,
    sample_col = "orig.ident2",
    method_sc = "wilcox",
    ident_ctrl = "healthy"
)

head(DGE_result, 10) %>%
    kable(format = "html", table.attr = "style='width:100%;'") %>%
    kable_styling(bootstrap_options = c(
        "striped",
        "hover",
        "condensed",
        "responsive"
    ))
```

After inspecting the DGE analysis, we continue with `DO.enrichR` function, 
which uses the enrichR API to run gene set enrichment. It separates the DE 
genes into up- and down-regulated sets and runs the analysis for each group 
independently.

```{r GO analysis2, warning=FALSE}
result_GO <- DO.enrichR(
    df_DGE = DGE_result,
    gene_column = "gene",
    pval_column = "p_val_adj_SC_wilcox",
    log2fc_column = "avg_log2FC_SC_wilcox",
    pval_cutoff = 0.05,
    log2fc_cutoff = 0.25,
    path = NULL,
    filename = "",
    species = "Human",
    go_catgs = "GO_Biological_Process_2023"
)

head(result_GO, 5) %>%
    kable(format = "html", table.attr = "style='width:100%;'") %>%
    kable_styling(bootstrap_options = c(
        "striped",
        "hover",
        "condensed",
        "responsive"
    ))
```

The top significant results can then be visualized in a bar plot.

```{r GO visualisation, fig.width=10, fig.height=8}
result_GO_sig <- result_GO[result_GO$Adjusted.P.value < 0.05, ]
result_GO_sig$celltype <- "CD4T_cells"

DO.SplitBarGSEA(
    df_GSEA = result_GO_sig,
    term_col = "Term",
    col_split = "Combined.Score",
    cond_col = "State",
    pos_cond = "enriched",
    showP = FALSE,
    path = paste0(base, "/")
)
GSEA_plot <- list.files(
    path = base,
    pattern = "SplitBar.*\\.svg$",
    full.names = TRUE,
    recursive = TRUE
)
plot(magick::image_read_svg(GSEA_plot))
```

## Candidate gene visualisation

After performing DGE and GO analyses, discovering whether specific genes are
regulated in a particular disease state and/or cell type is a common step.
To address this, we implemented advanced methods in our functions that
provide summarised results and incorporate statistical testing to answer these
questions efficiently.

The `DO.Dotplot` function covers the expression over three variables at the same
time along with statistical testing. For example, we can visualise the 
expression of a gene across cell types and conditions:

```{r dotplot, fig.width=5, fig.height=5}
DO.Dotplot(
    sce_object = SCE_obj,
    group.by.x = "condition",
    group.by.y = "annotation",
    Feature = "NKG7",
    stats_y = TRUE
)
```

The `DO.Heatmap` function shows the expression of multiple genes in a publish 
ready way, including statistical testing:

```{r Heat map}
#| out.width: "100%"
#| fig.align: "center"
#| fig.width: 10
#| fig.height: 8
#| warning: FALSE

path_file <- tempfile("dotools_plots_")
dir.create(path_file, recursive = TRUE, showWarnings = FALSE)

DO.Heatmap(SCE_obj,
    group_by = "leiden0.3",
    features = rownames(SCE_obj)[1:10],
    xticks_rotation = 45,
    path = path_file,
    stats_x_size = 20,
    showP = FALSE
)

Heatmap_plot <- list.files(
    path = path_file,
    pattern = "Heatmap*\\.svg$",
    full.names = TRUE,
    recursive = TRUE
)

plot(magick::image_read_svg(Heatmap_plot))
```

We can visualize the average expression of a gene in a cell type + condition
or we can plot continuous metadata information across conditions with
violinplots, barplots and boxplots. Additionally, we can test for significance.

```{r Violin, fig.width=8, fig.height=5.5, warning = FALSE}
SCE_obj_sub <- DO.Subset(SCE_obj,
    ident = "annotation",
    ident_name = c("NK", "CD4_T_cells", "B_cells")
)

DO.VlnPlot(SCE_obj_sub,
    Feature = "NKG7",
    group.by = "condition",
    group.by.2 = "annotation",
    ctrl.condition = "healthy"
)
```

```{r Bar, fig.width=5, fig.height=6, warning = FALSE}
SCE_obj_NK <- DO.Subset(SCE_obj,
    ident = "annotation",
    ident_name = "NK"
)

DO.BarplotWilcox(SCE_obj_NK,
    group.by = "condition",
    ctrl.condition = "healthy",
    Feature = "NKG7",
    x_label_rotation = 0
)
```

```{r Box, fig.width=5, fig.height=6, warning = FALSE}
set.seed(123)
SCE_obj$rdm_sample <- sample(rep(c("A", "B", "C"),
    length.out = ncol(SCE_obj)
))
DO.BoxPlot(SCE_obj,
    group.by = "rdm_sample",
    ctrl.condition = "A",
    Feature = "nCount_RNA",
    step_mod = 50,
    stat_pos_mod = 1.05,
    plot_sample = FALSE
)
```

# Session information

```{r session_info, echo=FALSE}
options(width = 120)
sessioninfo::session_info()
```