---
title: 'Proteomics vignette: dimensionality reduction and imputation 
with omicsGMF'
author:
- name: Alexandre Segers
bibliography: sgdGMF.bib
date: "14/01/2025"
output: 
  BiocStyle::html_document:
    toc: true
    toc_depth: 3
  BiocStyle::pdf_document: default
package: omicsGMF
abstract: |
  Proteomics vignette for the omicsGMF package. This vignette aims to provide a 
  detailed description of a matrix factorization on proteomics data, 
  which can further be used for dimensionality reduction, visualization and 
  imputation of missing values.
vignette: >
  %\VignetteIndexEntry{Proteomics-vignette: omicsGMF}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
  
```{r, echo = FALSE}
library(knitr)
```

# Introduction

`omicsGMF` is an R package for generalized matrix factorization and missing 
value imputation in omics data. It is designed for dimensionality reduction and 
visualization, specifically handling count data and missing values efficiently. 
Unlike conventional PCA, `omicsGMF` does not require log-transformation of 
RNA-seq data or prior imputation of proteomics data.

A key advantage of `omicsGMF` is its ability to control for known sample- and 
feature-level covariates, such as batch effects. This improves downstream 
analyses like clustering. Additionally, omicsGMF includes model selection to 
optimize the number of latent confounders, ensuring an optimal dimensionality 
for analysis. Its stochastic optimization algorithms allow it to remain fast 
while handling these complex data structures.

`omicsGMF` builds on the `sgdGMF` framework provided in the `sgdGMF` CRAN 
package, and provides easy integration with `SingleCellExperiment`, 
`SummarizedExperiment`,
and `QFeature` classes, with adapted default values for the optimization 
arguments when dealing with omics data.

All details about the `sgdGMF` framework, such as the adaptive learning rates,
exponential gradient averaging and subsampling of the data are 
described in our preprint [@Castiglione2024]. There, we show the use of the
`sgdGMF-framework` on single-cell RNA-seq data. In our newest preprint 
[@Segers2025], we show how `omicsGMF` can be used to visualize (single-cell) 
proteomics data and impute missing values.


This vignette provides a step-by-step workflow for using `omicsGMF` for
dimensionality reduction of omics data. The main function are:


1. `runCVGMF` or `calculateCVGMF` performs cross-validation to determine the 
optimal number of latent confounders. These results can be visualized using 
`plotCV`. This cross-validation avoids arbitrarily choosing `ncomponents`, but
requires some computational time. An alternative is `calculateRankGMF`, which
performs an eigenvalue decomposition on the deviance residuals. This allows for
model selection based on a scree plot using `plotRank`, for example using
the elbow method.

2. `runGMF` or `calculateGMF` estimates the latent confounders and the rotation
matrix, and estimates the respective parameters of the sample-level and 
feature-level covariates.

3. `plotGMF` plots the samples using its decomposition.

4. `imputeGMF` creates a new assay with missing values imputed using the 
estimates of `runGMF`.

We here apply `omicsGMF` on proteomics data.


# Package installation

`sgdGMF` can be installed through CRAN.
`omicsGMF` can be installed from github, and will be soon available
through Bioconductor.

```{r, eval=FALSE}
if(!requireNamespace("sgdGMF", quietly = TRUE))
    install.packages("sgdGMF")

BiocManager::install("omicsGMF")
```


```{r, echo = TRUE, warning=FALSE, message=FALSE}
library(sgdGMF)
library(omicsGMF)
library(dplyr)
library(scuttle)
set.seed(100)
```


# Proteomics data analysis

To perform dimensionality reduction on proteomics data, one can use the 
log-transformed intensities, which makes the data Gaussian distributed. 
Optionally, one can opt to perform normalization such as median-normalization,
although this is not required, but might enhance numerical stability and 
convergence speed. For proteomics data, `family = gaussian()` should be used 
in the data analysis, and missing values should not be imputed prior to the 
matrix factorization as omicsGMF deals with these internally. If the goal is to 
impute missing values after matrix decomposition, one should include all 
features in the analysis, therefore ignoring the `ntop` argument.

For the proteomics vignette, we will simulate some artificial Gaussian 
data, and introduce some missing values completely ad random.
We store the simulated intensities in the `logintensities` assay of a 
`SingleCellExperiment`. For sake of exposition, we also include a batch
effect in the `colData`. `omicsGMF` can internally correct for these batch 
effects, and therefore does not require prior correction with other tools.


```{r}
sim_intensities <- matrix(rnorm(n = 20*50, mean = 1, sd = 1),
                          ncol = 20)
NAs <- rbinom(n = prod(dim(sim_intensities)), size = 1, prob = 0.3) == 1
sim_intensities[NAs] <- NA

colnames(sim_intensities) <- paste0("S_", c(1:20))
rownames(sim_intensities) <-  paste0("G_", c(1:50))

example_sce <- SingleCellExperiment(
    assays = SimpleList("logintensities" = sim_intensities),
    colData = data.frame("Batch" = rep(c("Batch1", "Batch2"), each = 10)))

X <- model.matrix(~Batch, data = colData(example_sce))
```


A recommended step is to estimate the optimal dimensionality in the model
by using cross-validation. This cross-validation masks a proportion of the
values as missing, and tries to reconstruct these. Using the out-of-sample 
deviances, one can estimate the optimal dimensionality of the latent space.
This cross-validation can be done with the `runCVGMF` or `calculateCVGMF` 
function, which builds 
on the sgdgmf.cv function from the `sgdGMF` package. Although the 
`sgdGMF` framework allows great flexibility regarding the optimization 
algorithm, sensible default values are here introduced for omics data. One
should choose the correct distribution family (`family`), the number
of components in the dimensionality reduction for which the cross-validation is
run (`ncomponents`), and the known covariate matrices to account for 
(`X` and `Z`). Also, one should select the right assay that is used for
dimensionality reduction (`exprs_values` or `assay.type`). 

Visualization of the cross-validation results can be done using 
`plotCV`. In case that multiple cross-validation results are available in the 
`metadata`, it is possible to visualize these by giving all names of the 
metadata slots. The optimal dimensionality is the one that has the lowest 
out-of-sample deviances.


```{r}
example_sce <- runCVGMF(
    example_sce, 
    X = X,                          # Covariate matrix
    exprs_values="logintensities",  # Use log-transformed intensities
    family = gaussian(),            # Gaussian model for proteomics data
    ncomponents = c(1:5))           # Test components from 1 to 5

metadata(example_sce)$cv_GMF %>% 
    group_by(ncomp) %>% 
    summarise(mean_dev = mean(dev),
              mean_aic = mean(aic),
              mean_bic = mean(bic),
              mean_mae = mean(mae),
              mean_mse = mean(mse))

plotCV(example_sce, name = "cv_GMF")

```


If the dataset is large or you are unsure about the optimal range of components 
to test, an alternative is the scree plot approach. This method uses PCA on 
deviance residuals to estimate eigenvalues, providing a fast approximation of 
the optimal dimensionality.

This can be done using `runRankGMF` or `calculateRankGMF` followed
by `plotRank` or `screeplot_rank` respectively. Note that now, the 
`maxcomp` argument can be defined, which is the number of 
eigenvalues computed.


```{r}
example_sce <- runRankGMF(example_sce, 
                                    X = X, 
                                    exprs_values="logintensities", 
                                    family = gaussian(),
                                    maxcomp = 10)

plotRank(example_sce, maxcomp = 10)
```


After choosing the number of components to use in the final dimensionality
reduction, `runGMF` or `calculateGMF` can be used. Again, one should select the
distribution family (`family`), the dimensionality (`ncomponents`), 
the known covariate matrices to account for (`X` and `Z`) and the 
assay used (`exprs_values` or `assay.type`). Unlike `runPCA`, `runGMF` uses all 
features by default. If you want to select the most variable genes instead, 
set `ntop`. However, when the goal is imputing missing values, all 
features should be used. The results are stored in the `reducedDim` slot of the 
`SingleCellExperiment` object. Additional information such as the `rotation` 
matrix, parameter estimates, the optimization history of `sgdGMF` framework 
and many more are available in the `attributes`. See `runGMF` for all 
outputs. 


```{r}
example_sce <- runGMF(
    example_sce, 
    exprs_values="logintensities", 
    family = gaussian(), 
    ncomponents = 3, # Use optimal dimensionality, here arbitrarily chosen as 3
    name = "GMF")
```


```{r, results = 'hide'}
reducedDimNames(example_sce)
head(reducedDim(example_sce, type = "GMF"))

names(attributes(reducedDim(example_sce, type = "GMF")))
head(attr(reducedDim(example_sce, type = "GMF"), "rotation"))
tail(attr(reducedDim(example_sce, type = "GMF"), "trace"))

```


To visualize the reduced dimensions, you can use `plotReducedDim` from the 
`scater` package, specifying "GMF" as the dimension reduction method. 
Alternatively, the `plotGMF` function provides a direct wrapper for this.


```{r}
plotReducedDim(example_sce, dimred = "GMF", colour_by = "Batch")
```

Finally, it is possible to impute the missing values with the model-based 
estimates. This can be done with the `imputeGMF` function:

```{r, results = 'hide'}
example_sce <- imputeGMF(example_sce, 
                         exprs_values = "logintensities", 
                         reducedDimName = "GMF",
                         name = "logintensities_imputed")


assay(example_sce,'logintensities')[1:5,1:5]
assay(example_sce,'logintensities_imputed')[1:5,1:5]
```


```{r}
sessionInfo()
```