---
title: "GSVA on proteomics data"
author:
- name: Robert Castelo
  affiliation:
  - &idupf Dept. of Medicine and Life Sciences, Universitat Pompeu Fabra, Barcelona, Spain
  email: robert.castelo@upf.edu
- name: Axel Klenk
  affiliation: *idupf
  email: axel.klenk@gmail.com
- name: Justin Guinney 
  affiliation: 
  - Tempus Labs, Inc.
  email: justin.guinney@tempus.com
abstract: >
  Here we illustrate how to use GSVA with proteomics data.
date: "`r BiocStyle::doc_date()`"
package: "`r pkg_ver('GSVA')`"
vignette: >
  %\VignetteIndexEntry{GSVA on proteomics data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
  %\VignetteKeywords{GeneExpression, Microarray, RNAseq, GeneSetEnrichment, Pathway}
output:
  BiocStyle::html_document:
    toc: true
    toc_float: true
    number_sections: true
    fig_captions: yes
bibliography: GSVA.bib
---

**License**: `r packageDescription("GSVA")[["License"]]`

```{r setup, include=FALSE}
options(width=80)
knitr::opts_chunk$set(collapse=TRUE,
                      message=TRUE,
                      warning=TRUE,
                      comment="",
                      fig.align="center",
                      fig.wide=TRUE)
```

# Introduction

Proteomics data, such as the one produced through the Clinical Proteomic Tumor
Analysis Consortium
([CPTAC](https://www.cancer.gov/research/resources/resource/176)), can have
a massive amount of missing values, in which imputing can result in regressing
to the mean, underestimating protein expression variability. In such a setting,
skipping missing values from the calculations may constitute a suitable
alternative to imputation. The GSVA package provides a missing data policy for
the GSVA and ssGSEA methods, which we illustrate here with the proteomics
data from @costa2021genome and available in the `r Biocpkg("GSVAdata")`
experiment data package. We start loading the data as follows.

```{r, message=FALSE}
library(SummarizedExperiment)
library(GSVA)
library(GSVAdata)

data(geneprotExpCostaEtAl2021)
ls()
protExpCostaEtAl2021
assayNames(protExpCostaEtAl2021)
```
The data is stored in a `SummarizedExperiment` object called
`protExpCostaEtAl2021` with three assays:

* `logCPM` with the normalized log-CPM RNA-seq expression values for the
`r nrow(protExpCostaEtAl2021)` genes encoding the quantified proteins.
* `log2protexp` with the normalized protein units of expression, and
the missing quantifications specified by `NA` values.
* `log2protexpimp` with the normalized protein units of expression, but
where the missing quantifications have been imputed.

Please consult the publication in [@costa2021genome] for details on how
these data have been produced.

# Load gene sets

We will use the MSigDB C7 collection of immunologic genesets, reading them
with the `readGMT()` function of the GSVA package.

```{r, eval=FALSE}
library(GSEABase)

URL <- "https://data.broadinstitute.org/gsea-msigdb/msigdb/release/2024.1.Hs/c7.immunesigdb.v2024.1.Hs.symbols.gmt"
c7.genesets <- readGMT(URL)
```
```{r, echo=FALSE, message=FALSE, warning=FALSE}
library(GSEABase)

gmtfname <- system.file("extdata", "c7.immunesigdb.v2024.1.Hs.symbols.gmt.gz",
                        package="GSVAdata")
c7.genesets <- readGMT(gmtfname)
c7.genesets
```
We filter this collection of gene sets to those formed by gene upregulated in
innate leukocytes and adaptive mature lymphocytes, excluding those reported in
studies on myeloid cells and the lupus autoimmune disease.

```{r}
innatepat <- c("NKCELL_VS_.+_UP", "MAST_CELL_VS_.+_UP",
               "EOSINOPHIL_VS_.+_UP", "BASOPHIL_VS_.+_UP",
               "MACROPHAGE_VS_.+_UP", "NEUTROPHIL_VS_.+_UP")
innatepat <- paste(innatepat, collapse="|")
innategsets <- names(c7.genesets)[grep(innatepat, names(c7.genesets))]
length(innategsets)

adaptivepat <- c("CD4_TCELL_VS_.+_UP", "CD8_TCELL_VS_.+_UP", "BCELL_VS_.+_UP")
adaptivepat <- paste(adaptivepat, collapse="|")
adaptivegsets <- names(c7.genesets)[grep(adaptivepat, names(c7.genesets))]
excludepat <- c("NAIVE", "LUPUS", "MYELOID")
excludepat <- paste(excludepat, collapse="|")
adaptivegsets <- adaptivegsets[-grep(excludepat, adaptivegsets)]
length(adaptivegsets)

c7.genesets.filt <- c7.genesets[c(innategsets, adaptivegsets)]
length(c7.genesets.filt)
```

# Usage and benchmark with the RNA-seq data

Here we show first a quick benchmarking using the logCPM expression
profiles, which are complete, by introducing `NA` values with the pattern
of missing data observed in the corresponding proteins.

```{r}
logCPMsWithNAs <- assay(protExpCostaEtAl2021, "logCPM")
logCPMsWithNAs[1:5, 1:5]
logCPMsWithNAs[is.na(assay(protExpCostaEtAl2021, "log2protexp"))] <- NA
logCPMsWithNAs[1:5, 1:5]
```
We need to add the annotation metadata to this new matrix of expression
profiles with missing values.

```{r}
gsvaAnnotation(logCPMsWithNAs) <- EntrezIdentifier("org.Hs.eg.db")
```

When building the parameter object, GSVA will check automatically for the
presence of missing values whenever the input expression data container is a
non-sparse container of expression values, such as a base matrix object or a
`SummarizedExperiment` object.


```{r}
gsvapar <- gsvaParam(logCPMsWithNAs, c7.genesets.filt, minSize=5)
```
We can force the `gsvaParam()` function to check for missing values irrespective
of the input expression data container by setting the argument `checkNA="yes"`,
or disable that check altogether with `checkNA="no"`. By default
`checkNA="auto"`. Once missing values have been detected when we build the
parameter object, the `gsva()` function (or `gsvaRanks()` and `gsvaScores()`)
will apply a missing data policy specified through a parameter called `use`,
which takes one of the following three possible character string values:
`everything`, `all.obs` or `na.rm`. The first value (`everything`) is the
default value and it propagates the missing `NA` values through the
calculations.

```{r}
es_gsva_everything <- gsva(gsvapar)
es_gsva_everything[1:10, 1:4]
all(is.na(es_gsva_everything))
```
So, in this case, there are so many missing values that the resulting
enrichment scores are all `NA` values. If we try the second option
`use="all.obs"` it will immediately produce an error at the parameter
building step.

```{r, error=TRUE}
gsvapar <- gsvaParam(logCPMsWithNAs, c7.genesets.filt, minSize=5,
                     use="all.obs")
```
Finally, if we set `use="na.rm"`, missing `NA` values will be skipped
during calculations, giving the chance to obtain non-missing enrichment
scores for those gene sets with enough genes with non-missing expression
values.

```{r}
gsvapar <- gsvaParam(logCPMsWithNAs, c7.genesets.filt, minSize=5,
                     use="na.rm")
es_gsva_narm <- gsva(gsvapar)
round(es_gsva_narm[1:10, 1:4], digits=2)
```
These parameters work exactly in the same way with the ssGSEA method.

```{r}
ssgseapar <- ssgseaParam(logCPMsWithNAs, c7.genesets.filt, minSize=5,
                         use="na.rm")
es_ssgsea_narm <- gsva(ssgseapar)
round(es_ssgsea_narm[1:10, 1:4], digits=2)
```
Since we are doing calculations on the RNA-seq data, for which we have the
complete logCPM values, we can compare how close are the enrichment scores
obtained by skipping `NA` values from the ones obtained with the complete
data, for both GSVA and ssGSEA. First, we calculate the enrichment scores with
the complete data for each of the two methods.

```{r}
gsvaAnnotation(protExpCostaEtAl2021) <- EntrezIdentifier("org.Hs.eg.db")
gsvapar <- gsvaParam(protExpCostaEtAl2021, c7.genesets.filt,
                     assay="logCPM", minSize=5)
es_gsva <- gsva(gsvapar)
ssgseapar <- ssgseaParam(protExpCostaEtAl2021, c7.genesets.filt,
                         assay="logCPM", minSize=5)
es_ssgsea <- gsva(ssgseapar)
```
Second, we calculate and plot the correlations between the enrichment scores
obtained from the data with missing values with the ones obtained from the
complete data.

```{r missingdatacomp, echo=TRUE, fig.width=5, fig.height=5, out.width="600px", dpi=100, fig.cap="Correlation between enrichment scores calculated from RNA-seq expression profiles with missing values, and those calculated from the complete version of the same data."}
r_es_gsva <- r_es_ssgsea <- numeric(nrow(es_gsva))
for (i in 1:nrow(es_gsva)) {
    r_es_gsva[i] <- cor(assay(es_gsva)[i, ], es_gsva_narm[i, ], use="complete.obs")
    r_es_ssgsea[i] <- cor(assay(es_ssgsea)[i, ], es_ssgsea_narm[i, ], use="complete.obs")
}
boxplot(list(GSVA=r_es_gsva, ssGSEA=r_es_ssgsea),
	ylab="Correlation with complete data", las=1)
```
As we can see in Figure \@ref(fig:missingdatacomp) GSVA scores calculated by skipping
missing values correlate better with those calculated from the complete data, than
ssGSEA scores calculated in the same way.

# Session information {.unnumbered}

Here is the output of `sessionInfo()` on the system on which this document was
compiled running pandoc `r rmarkdown::pandoc_version()`:

```{r session_info, cache=FALSE}
sessionInfo()
```

# References