---
title: "SpaNorm: Spatially aware library size normalisation"
author: "Dharmesh D. Bhuva and Agus Salim"
date: "`r BiocStyle::doc_date()`"
output:
  prettydoc::html_pretty:
    theme: cayman
    toc: yes
    toc_depth: 2
    number_sections: yes
    fig_caption: yes
    df_print: paged
abstract: > 
  This package implements the spatially aware library size normalisation algorithm, SpaNorm. SpaNorm normalises out library size effects while retaining biology through the modelling of smooth functions for each effect. Normalisation is performed in a gene- and cell-/spot- specific manner, yielding library size adjusted data.
vignette: >
  %\VignetteIndexEntry{SpaNorm}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}

bibliography: ../inst/REFERENCES.bib
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  warning = FALSE,
  message = FALSE,
  comment = "#>"
)

# load libraries without messages
library(SpaNorm)
library(ggplot2)
library(patchwork)
library(SpatialExperiment)
library(scater)

update_geom_defaults("point", aes(size = 0.5))
```

# SpaNorm

SpaNorm is a spatially aware library size normalisation method that removes library size effects, while retaining biology. Library sizes need to be removed from molecular datasets to allow comparisons across observations, in this case, across space. Bhuva et al. [@Bhuva2024] and Atta et al. [@Atta2023] have shown that standard single-cell inspired library size normalisation approaches are not appropriate for spatial molecular datasets as they often remove biological signals while doing so. This is because library size confounds biology in spatial molecular data.

![_The SpaNorm workflow: SpaNorm takes the gene expression data and spatial coordinates as inputs. Using a gene-wise model (e.g., Negative Binomial (NB)), SpaNorm decomposes spatially-smooth variation into those unrelated to library size (LS), representing the underlying true biology and those related to library size. The adjusted data is then produced by keeping only the variation unrelated to library size._](SpaNormWorkflow.png)

SpaNorm uses a unique approach to spatially constraint modelling approach to model gene expression (e.g., counts) and remove library size effects, while retaining biology. It achieves this through three key innovations:

1. Optmial decomposition of spatial variation into spatially smooth library size associated (technical) and library size independent (biology) variation using generalized linear models (GLMs).
1. Computing spatially smooth functions (using thin plate splines) to represent the gene- and location-/cell-/spot- specific size factors.
1. Adjustment of data using percentile adjusted counts (PAC) [@Salim2022], as well as other adjustment approaches (e.g., Pearson).

The SpaNorm package can be installed as follows:

```{r eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# release version
BiocManager::install("SpaNorm")
# development version from GitHub
BiocManager::install("bhuvad/SpaNorm")
```

# Load count data

We begin by loading some example 10x Visium data profiling the dorsolateral prefrontal cortex (DLPFC) of the human brain. The data has ~4,000 spots and covers genome-wide measurements. The example data here is filtered to remove lowly expressed genes (using `filterGenes(HumanDLPFC, prop = 0.1)`).  This filtering retains genes that are expressed in at least 10% of cells.

```{r fig.width=4, fig.height=4.25}
library(SpaNorm)
library(SpatialExperiment)
library(ggplot2)

# load sample data
data(HumanDLPFC)
# change gene IDs to gene names
rownames(HumanDLPFC) = rowData(HumanDLPFC)$gene_name
HumanDLPFC

# plot regions
p_region = plotSpatial(HumanDLPFC, colour = AnnotatedCluster, size = 0.5) +
  scale_colour_brewer(palette = "Paired", guide = guide_legend(override.aes = list(shape = 15, size = 5))) +
  ggtitle("Region")
p_region
```

The `filterGenes` function returns a logical vector indicating which genes should be kept.

```{r}
# filter genes expressed in 20% of spots
keep = filterGenes(HumanDLPFC, 0.2)
table(keep)
# subset genes
HumanDLPFC = HumanDLPFC[keep, ]
```

The log-transformed raw counts are visualised below for the gene _MOBP_ which is a marker of oligodendrocytes enriched in the white matter (WM) [@Maynard2021]. Despite being a marker of this region, we see that it is in fact absent from the white matter region.

```{r fig.width=7.5, fig.height=4.25}
logcounts(HumanDLPFC) = log2(counts(HumanDLPFC) + 1)

p_counts = plotSpatial(
    HumanDLPFC,
    colour = MOBP,
    what = "expression",
    assay = "logcounts",
    size = 0.5
  ) +
  scale_colour_viridis_c(option = "F") +
  ggtitle("logCounts")
p_region + p_counts
```

# Normalise count data

SpaNorm normalises data in two steps: (1) fitting the SpaNorm model of library sizes; (2) adjusting data using the fit model. A single call to the `SpaNorm()` function is enough to run these two steps. To speed up computation, the model is fit using a smaller proportion of spots/cells (default is 0.25). The can be modified using the `sample.p` parameter.

```{r message=TRUE}
set.seed(36)
HumanDLPFC = SpaNorm(HumanDLPFC)
HumanDLPFC
```

The above output (which can be switched off by setting `verbose = FALSE`), shows the two steps of normalisation. In the model fitting step, `r round(0.25 * ncol(HumanDLPFC))` cells/spots are used to fit the negative binomial (NB) model. Subsequent output shows that this fit is performed by alternating between estimation of the dispersion parameter and estimation of the NB parameters by fixing the dispersion. The output also shows that each intermediate fit converges, and so does the final fit. The accuracy of the fit can be controlled by modifying the tolerance parameter `tol` (default `1e-4`). 

Next, data is adjusted using the fit model. The following approaches are implemented for count data:

1. `adj.method = "logpac"` (default) - percentile adjusted counts (PAC) which estimates the count for each gene at each location/spot/cell using a model that does not contain unwanted effects such as the library size.
1. `adj.method = "person"` - Pearson residuals from factoring out unwanted effects.
1. `adj.method = "meanbio"` - the mean of each gene at each location estimated from the biological component of the model.
1. `adj.method = "medbio"` - the median of each gene at each location estimated from the biological component of the model.

These data are stored in the `logcounts` assay of the SpatialExperiment object. After normalisation, we see that MOBP is enriched in the white matter.

```{r fig.width=7.5, fig.height=4.25}
p_logpac = plotSpatial(
    HumanDLPFC,
    colour = MOBP,
    what = "expression",
    assay = "logcounts",
    size = 0.5
  ) +
  scale_colour_viridis_c(option = "F") +
  ggtitle("logPAC")
p_region + p_logpac
```

# Using Seurat objects

Users may prefer to work with Seurat. Below we convert the `SpatialExperiment` object to Seurat (v5) and add spatial coordinates extracted from the original `SpatialExperiment`. The SpaNorm function can then be run directly on the Seurat object. Note that the `counts` assay is used to store raw counts and the `data` assay is used to store normalised values, following Seurat conventions. SpaNorm searches for coordinates in the `images` slot which contains the `coordinates` slot. If not found, it will also search `meta.data` slot of the Seurat object with names `x` and `y`, or with names `imagecol` and `imagerow`. This is the order of the search of coordinates

```{r}
library(Seurat)

# create a Seurat object with the data
seurat_obj = HumanDLPFC
logcounts(seurat_obj) = counts(seurat_obj)
seurat_obj = suppressWarnings(Seurat::as.Seurat(seurat_obj))  
# add spatial coordinates to Seurat meta.data from SpatialExperiment
coords = SpatialExperiment::spatialCoords(HumanDLPFC)
seurat_obj@meta.data$x = coords[, 1]
seurat_obj@meta.data$y = coords[, 2]

# run SpaNorm on Seurat (CPU backend)
seurat_obj <- SpaNorm(seurat_obj, sample.p = 0.1, df.tps = 2, backend = "cpu", verbose = TRUE)
seurat_obj
```

# Computing alternative adjustments using a precomputed SpaNorm fit

As no appropriate slot exists for storing model parameters, we currently save them in the metadata slot with the name "SpaNorm". This also means that subsetting features (i.e., genes) or observations (i.e., cells/spots/loci) does not subset the model. In such an instance, the SpaNorm function will realise that the model no longer matches the data and re-estimates when called. If instead the model is valid for the data, the existing fit is extracted and reused.

The fit can be manually retrieved as below for users wishing to reuse the model outside the SpaNorm framework. Otherwise, calling `SpaNorm()` on an object containing the fit will automatically use it.

```{r}
# manually retrieve model
fit.spanorm = metadata(HumanDLPFC)$SpaNorm
fit.spanorm
```

When a valid fit exists in the object, only the adjustment step is performed. The model is recomputed if `overwrite = TRUE` or any of the following parameters change: degrees of freedom (`df.tps`), penalty parameters(`lambda.a`), object dimensions, or `batch` specification. Alternative adjustments can be computed as below and stored to the `logcounts` assay.

```{r fig.width=11.5, fig.height=8.5}
# Pearson residuals
HumanDLPFC = SpaNorm(HumanDLPFC, adj.method = "pearson")
p_pearson = plotSpatial(
    HumanDLPFC,
    colour = MOBP,
    what = "expression",
    assay = "logcounts",
    size = 0.5
  ) +
  scale_colour_viridis_c(option = "F") +
  ggtitle("Pearson")

# meanbio residuals
HumanDLPFC = SpaNorm(HumanDLPFC, adj.method = "meanbio")
p_meanbio = plotSpatial(
    HumanDLPFC,
    colour = MOBP,
    what = "expression",
    assay = "logcounts",
    size = 0.5
  ) +
  scale_colour_viridis_c(option = "F") +
  ggtitle("Mean biology")

# meanbio residuals
HumanDLPFC = SpaNorm(HumanDLPFC, adj.method = "medbio")
p_medbio = plotSpatial(
    HumanDLPFC,
    colour = MOBP,
    what = "expression",
    assay = "logcounts",
    size = 0.5
  ) +
  scale_colour_viridis_c(option = "F") +
  ggtitle("Median biology")

p_region + p_counts + p_logpac + p_pearson + p_meanbio + p_medbio + plot_layout(ncol = 3)
```

The mean biology adjustment shows a significant enrichment of the _MOBP_ gene in the white matter. As the overall counts of this gene are low in this sample, other methods show less discriminative power.

# Varying model complexity

The complexity of the spatial smoothing function is determined by the `df.tps` parameter where larger values result in more complicated functions (default 6). 

```{r fig.width=7.5, fig.height=4.25}
# df.tps = 2
HumanDLPFC_df2 = SpaNorm(HumanDLPFC, df.tps = 2)
p_logpac_2 = plotSpatial(
    HumanDLPFC,
    colour = MOBP,
    what = "expression",
    assay = "logcounts",
    size = 0.5
  ) +
  scale_colour_viridis_c(option = "F") +
  ggtitle("logPAC (df.tps = 2)")

# df.tps = 6 (default)
p_logpac_6 = p_logpac +
  ggtitle("logPAC (df.tps = 6)")

p_logpac_2 + p_logpac_6
```

# Enhancing signal

As the counts for the MOBP gene are very low, we see artifacts in the adjusted counts. As we have a model for the genes, we can increase the signal by adjusting all means by a constant factor. Applying a scale factor of 4 shows how the adjusted data are more continuous, with significant enrichment in the white matter.

```{r fig.width=7.5, fig.height=4.25}
# scale.factor = 1 (default)
HumanDLPFC = SpaNorm(HumanDLPFC, scale.factor = 1)
p_logpac_sf1 = plotSpatial(
    HumanDLPFC,
    colour = MOBP,
    what = "expression",
    assay = "logcounts",
    size = 0.5
  ) +
  scale_colour_viridis_c(option = "F") +
  ggtitle("logPAC (scale.factor = 1)")

# scale.factor = 4
HumanDLPFC = SpaNorm(HumanDLPFC, scale.factor = 4)
p_logpac_sf4 = plotSpatial(
    HumanDLPFC,
    colour = MOBP,
    what = "expression",
    assay = "logcounts",
    size = 0.5
  ) +
  scale_colour_viridis_c(option = "F") +
  ggtitle("logPAC (scale.factor = 4)")

p_logpac_sf1 + p_logpac_sf4 + plot_layout(ncol = 2)
```

# Exploring learnt functions

The `plotCovariate()` function can be used to explore the learnt functions. We could study what the model has learnt about the biology and library size effects of the _MOBP_ gene.

```{r fig.width=7.5, fig.height=4.25}
p1 = plotCovariate(HumanDLPFC, colour = MOBP, covariate = "biology") +
  scale_colour_viridis_c(option = "F") +
  ggtitle("Biology")
p2 = plotCovariate(HumanDLPFC, colour = MOBP, covariate = "ls") +
  scale_colour_viridis_c(option = "F") +
  ggtitle("Library size effect")
p1 + p2
```

# Identifying spatially variable genes

The `SpaNormSVG()` function can be used to identify spatially variable genes (SVGs) in the data. This function fits a nested model without the biological function of the form:

$$
log \mu_{c,g} = \varsigma_{g} + \{\alpha + h_g(x_c, y_c;\gamma_g)\}logLS_c
$$

where $\mu_{c,g}$ is the mean of gene $g$ at location $c$, $\varsigma_g$ is the log-mean of gene $g$, $h_g(x_c, y_c)$ is a smooth function of the spatial coordinates $x_c$ and $y_c$ representing the library size effect, and $\alpha$ is the log-mean of the library size.

The `SpaNormSVG()` function fits this model and then uses an F-test to identify spatially variable genes (SVGs).

```{r message=TRUE}
HumanDLPFC = SpaNormSVG(HumanDLPFC)
HumanDLPFC
```

The `topSVGs()` function can be used to retrieve the top spatially variable genes (SVGs) at a given false discovery rate (FDR). These are stored in the `rowData` slot of the SpatialExperiment object.

```{r}
svgs = topSVGs(HumanDLPFC, n = 10)
svgs
```

We can visualise the spatially variable genes using the `plotSpatial()` function.

```{r fig.width=12, fig.height=12}
# fix gene names
rownames(HumanDLPFC) = gsub("-", ".", rownames(HumanDLPFC))
rownames(svgs) = gsub("-", ".", rownames(svgs))

lapply(rownames(svgs)[1:9], function(g) {
  plotSpatial(HumanDLPFC, colour = !!sym(g), what = "expression", assay = "logcounts", size = 0.5) +
    scale_colour_viridis_c(option = "F") +
    ggtitle(g) +
    theme(legend.position = "bottom")
}) |> 
  wrap_plots(ncol = 3)
```

# GLM-PCA

PCA on log-transfomed counts has been show to distort low dimensional features in single-cell datasets [@Townes2019]. As spatial transcriptomics data follows similar distributions, this is also the case for spatial transcriptomics data. GLM-PCA which computes PCA directly on the counts is a better approach to perform PCA on spatial transcriptomics data. While the GLM-PCA algorithm itself is computationally intensive, an approximation proposed in the original manuscripts is to fit a null model to the data and perform PCA on the deviance or Pearson residuals. This null model is the one fit above to estimate SVGs. SpaNorm implements the GLM-PCA approximation using the `SpaNormPCA()` function. Users can specify the features to use from the SVG calling, using cutoffs for the top SVGs to use (recommended) or using a FDR cutoff. The interface to this function is similar to the `scater::runPCA()` function and the results are stored in the `reducedDim` slot of the SpatialExperiment object using the proposed name (default is "PCA"). This function should always be run calling SVGs (`SpaNormSVG()`) first.

```{r fig.width=6, fig.height=5}
library(scater)

HumanDLPFC = SpaNormPCA(HumanDLPFC, ncomponents = 50, svg.fdr = 0.2, nsvgs = Inf)
HumanDLPFC = runUMAP(HumanDLPFC, n_neighbors = 20, min_dist = 0.3)
plotUMAP(HumanDLPFC, colour_by = "AnnotatedCluster", size_by = "cell_count") +
  scale_colour_brewer(palette = "Paired", guide = guide_legend(override.aes = list(shape = 15, size = 5))) +
  labs(title = "UMAP derived from SpaNorm PCA", colour = "Cluster")
```

# Session information

```{r}
sessionInfo()
```

# References