---
title: "Using SuperCellCyto for Stratified Summarising"
author: "Givanna Putri"
output: BiocStyle::html_document
vignette: >
    %\VignetteIndexEntry{using-runsupercellcyto-for-stratified-summarising}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

```{r global_options, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```

Have you been following the vignette on 
[how to create supercells](SuperCellCyto.html),
and wonder whether it is possible to use `SuperCellCyto` as a replacement for 
stratified sampling to avoid overcrowding UMAP/tSNE plot?

The short answer to that is, yes we can.
We call this **stratified summarising**, and `SuperCellCyto` can be used 
for this purpose.
To do this, all we need to do is simply set the sample column of our data 
to not be the biological sample the cell came from, but rather` the column 
we want to stratify the data based on.

For example, when drawing UMAP or tSNE plot, we commonly subsample each 
cluster or cell type to avoid crowding the plot.
Instead of subsampling, we can generate supercells for each cluster or cell 
type simply by specifying the column that denotes the cluster or cell type 
each cell belong to as the `sample_colname` parameter!

Let's illustrate this using a clustered (using k-means) toy data.

```{r simulate_and_cluster}
library(SuperCellCyto)

set.seed(42)

# Simulate some data
dat <- simCytoData()
markers_col <- paste0("Marker_", seq_len(10))
cell_id_col <- "Cell_Id"

# Run kmeans
clust <- kmeans(
    x = dat[, markers_col, with = FALSE],
    centers = 5
)

clust_col <- "kmeans_clusters"
dat[[clust_col]] <- paste0("cluster_", clust$cluster)
```

To perform stratified summarising, we supply the cluster column
(`kmeans_clusters` in the example above), as `runSuperCellCyto`'s 
`sample_colname` parameter.

```{r run_supercellcyto_stratified}
supercells <- runSuperCellCyto(
    dt = dat,
    markers = markers_col,
    sample_colname = clust_col,
    cell_id_colname = cell_id_col
)
```

Now, if we look at the `supercell_expression_matrix`, each row 
(each supercell) will be denoted with the cluster it belongs to, and 
*not the biological sample it came from*:

```{r inspect_supercell_matrix}
# Inspect the top 3 and bottom 3 of the expression matrix and some columns.
rbind(
    head(supercells$supercell_expression_matrix, n = 3),
    tail(supercells$supercell_expression_matrix, n = 3)
)[, c("kmeans_clusters", "SuperCellId", "Marker_10")]
```

If we look at the number of supercells created and check how many cells
there were in each cluster, we will find that, for each cluster, we get 
approximately `n_cells_in_the_cluster/20` where 20 is the `gam` parameter 
value we used for `runSuperCellCyto` (this is the default).

```{r cells_per_cluster}
# Compute how many cells per cluster, and divide by 20, the gamma value.
table(dat$kmeans_clusters) / 20
```

```{r supercells_per_cluster}
table(supercells$supercell_expression_matrix$kmeans_clusters)
```

## Session information
```{r session_info}
sessionInfo()
```