---
title: "SmartPhos: a pipeline for processing and analysis of phosphoproteomic 
data"
author: "Shubham Agrawal, Junyan Lu"
date: "2025-03-27"
output:
  BiocStyle::html_document:
    toc_float: true
vignette: >
  %\VignetteIndexEntry{"Introduction to SmartPhos"}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r style, echo=FALSE, results='asis'}
BiocStyle::markdown()
```

# Introduction

Phosphoproteomics provides rich information for dissecting pathway activities 
and therefore is becoming vital in basic and translational biomedical research. 
To facilitate and streamline phosphoproteomics data analysis, we developed 
`SmartPhos`, an R package for the pre-processing, quality control, and 
exploratory analysis of phosphoproteomics data generated by mass-spectrometry. 
SmartPhos can process outputs from `MaxQuant` and `Spectronaut` either using 
the R command line or in an interactive ShinyApp called `SmartPhos Explorer`. 
Besides commonly used preprocessing steps, such as normalization, 
transformation, imputation, and batch effect correction, our package features a 
novel method for correcting normalization artifacts observed in 
phosphoproteomics, especially when large global phosphorylation changes are 
expected, by taking both phospho-enriched and unenriched samples into account. 

In addition, the `SmartPhos Explorer` `ShinyApp` included in our R package 
provides a user-friendly and interactive one-stop solution for performing 
exploratory data analysis (PCA, hierarchical clustering, etc.), differential 
expression, time-series clustering, gene set enrichment analysis, and kinase 
activity analysis easily without the knowledge of coding or the underlying 
statistical model. 

This vignette focuses on using `SmartPhos` with the R command line. 


# Install and load SmartPhos Package
To install `SmartPhos` enter the following to the `R` console

```{r install, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("SmartPhos")
```

Load the package
```{r initialize, warning=FALSE, message=FALSE}
library(MultiAssayExperiment)
library(SmartPhos)
``` 

Load the R object
```{r}
data("dia_example")
dia_example
```

# Preprocessing the assay, basic visualization, PCA and Heatmaps

Extract the SummarizedExperiment object from the multiAssayExperiment object
```{r}
se <- dia_example[["Phosphoproteome"]]
colData(se) <- colData(dia_example)
se
```

## Preprocessing options

SmartPhos package performs some pre-filtering and preprocessing based on 
various threshold values during the generation of the multiAssayExperiment 
object. On top of that, this panel has different options for preprocessing of 
the selected assay. The options provided by the shiny app are as follows:

* **Normalization correction**: This option is present only if the 
phophoproteomics data is present. It allows for the correction of normalization 
artefacts introduced by the Spectronaut. 
* **Normalize** phospho intensity by the corresponding protein expression.
* **Transformation**: The transformation methods available are: log2 and vst 
(variance stabilizing transformation).
* **Normalization**: Normalization strategy depends on the selected 
transformation method. Therefore, the user has option of *Yes* or *No*. If the 
user selects *Yes*, then for log2 and no transformation, *median scaling* is 
applied and for vst, *vsn* is applied. 
* **Missing values**: Users can select what percentage of missing values are 
allowed. Proteins with missing values above the selected threshold will be 
removed from all analyses.
* **Imputation**: Currently four imputation methods are available: 
  + QRILC (Quantile Regression Imputation of Left-Censored data)
  + MinDet (Deterministic minimal value approach)
  + BPCA (Bayesian PCA)
  + MLE (Maximum Likelihood Estimation)
  + Random forest
* **Batch effects removal**: This option provides the abilty to correct for 
batch effects. It uses *removeBatchEffect()* from the *limma* package. The 
users can select the maximum of two columns. The columns used for batch effects 
removal should be present in the fileTable.txt file.

The preprocessing (transformation, normalization, imputation, etc.) on the 
intensity assay can be performed by running the following command:
```{r message=FALSE}
newSE <- preprocessPhos(seData = se, transform = "log2", normalize = TRUE, 
                        impute = "QRILC")
```


## Different visualization options

Plot the boxplot of assay intensities:
```{r}
plotIntensity(newSE, colorByCol = "replicate")
```

Plot the completeness (percentage of non-missing) for each sample:
```{r}
plotMissing(newSE)
```
This is performed on the original assay and not on the imputed assay. 

Perform principal component analysis (PCA) by running the following command:
```{r}
# perform PCA
pca <- stats::prcomp(t(assays(newSE)[["imputed"]]), center = TRUE, 
                     scale. = TRUE)
# call the plotting function
p <- plotPCA(pca = pca, se = newSE, color = "replicate")
p
```

Heatmap can be plotted using the plotHeatmap() function of SmartPhos. There are 
three different heatmaps possible based on the **type** argument:

* **Top variant**: This allows the users to plot the genes with highest 
variance. Users can decide the number of top variants genes to plot. This 
option performs clustering automatically. The user also has the option to 
divide the columns and rows of the heatmap into specific number of clusters 
using the **cutCol** and **cutRow** arguments.
* **Differentially expressed**: Allows to plot the heatmaps for differentially 
expressed genes. The differential expression analysis is available in 
*Differential expression* tab and can be performed using ProDA or Limma. 
* **Selected time series cluster**: After performing the time-series clustering 
in another tab, users have option to plot the heatmap of the selected cluster.

```{r}
plotHeatmap(type = "Top variant", newSE, top = 10, 
            annotationCol = c("replicate", "treatment"))
```


# Differential Expression Analysis

The goal of performing differential expression analysis is to quantify the 
expression levels of genes between different experimental conditions using 
statistical tests. 

If the users want paired t-test on the patient IDs, cell lines, etc, then the 
users must have **subjectID** as one of the column in the *fileTable.txt* file 
with the relevant information. The subjectID column should also be selected in 
the additional column annotations before the generation of multiAssayExperiment 
object. 

The *performDifferentialExp()* has two methods for performing differential 
expression analysis:

* **limma**: uses linear models.
* **ProDA**: uses probabilistic dropout model.

Limma method needs an assay with non-missing data and is faster than the proDA 
method. ProDA method can work on assay with missing values also.

```{r message=FALSE}
dea <- performDifferentialExp(se = newSE, assay = "imputed", method = "limma", 
                              condition = "treatment", reference = "EGF", 
                              target = "1stCrtl")
```

‘dea’ is a list object which contains two values: ‘resDE’ with tabular 
differential expression analysis results and ‘seSub’ which is the subsetted 
SummarizedExperiment object based on the selected condition. 

```{r}
dea$seSub
dea$resDE
```

Plot the volcano plot for the obtained differential expression analysis results 
by running the following command:
```{r}
plotVolcano(dea$resDE)
```


**pFilter** and **fcFilter** arguments can be used to change the thresholds for 
the upregulated and downregulated genes being represented in the volcano plot.

Use intensityBoxPlot() to plot the intensity data of a specified gene, with 
optional subject-specific lines:
```{r}
intensityBoxPlot(se = dea$seSub, id = 's447', symbol = "WASL")
```

# Time series clustering

Time series clustering is performed to group proteins/phosphopeptides based on 
how their level changes over time. The *clusterTS()* function employs 
**fuzzy c-means clustering** algorithm which considers the time-resolved trend, 
but not the expression levels. Thus, members of the same cluster would have a 
similar trend over time (e.g., all increasing or all decreasing), though their 
expression level can be different.

To use this function, users must have logitudinal data with **timepoint** 
annotation in their data. The time points are either unit-less numbers 
(e.g., 1, 2, 3) or are in hour and/or minute, which must be typed as "h" and 
"min", respectively (e.g., 1min, 2min, 3h). Please notice that mixing the two 
said options (e.g., 1, 2min, 3) or using other unit for time (e.g., 1hour, 
2minute, 3day) will likely lead to wrong results. 

```{r}
# call addZeroTime function to add zero timepoint to EGF treatment
newSEzero <- addZeroTime(newSE, condition = "treatment", treat = "EGF", 
                         zeroTreat = "1stCrtl", 
                         timeRange = c("10min","100min", "24h"))
# extract the assay
exprMat <- SummarizedExperiment::assay(newSEzero)
# call the clustering function
tsc <- clusterTS(x = exprMat, k = 5)
```

‘tsc’ is a list object which contains two values: 
‘cluster’ with tabular time series clustering results and ‘plot’ 
with the clusters plots. 

```{r}
tsc$cluster
tsc$plot
```

A single gene or phospho-site time series data can also be plotted:
```{r}
timerange <- unique(newSEzero$timepoint)
plotTimeSeries(newSEzero, type = "expression", geneID = "s40", 
               symbol = "RBM47_T519", condition = "treatment", 
               treatment = "EGF", timerange = timerange)
```


# Enrichment analysis

Enrichment analysis can be performed on either gene sets (gene-centric) or 
post-translational modification signature sets (site-centric) that are 
differentially expressed or in a time-series cluster.

In the latter, each set contains PTM site names with direction of regulation 
(up- or down-regulated) instead of gene names. Only phosphorylation sites will 
be considered since this pipeline supports proteomics and phosphoproteomics 
data. 

The gene sets and PTM signature sets are derived from the Molecular Signatures 
Database (Subramanian, Tamayo et al.,2005; Liberzon et al., 2011, Liberzon et 
al., 2015) and the PTM Signature Database version 2.0.0 (Krug et al., 2019), 
respectively. Users are encouraged to consult the PTMsigDB website
(https://proteomics.broadapps.org/ptmsigdb/) and the paper of Krug et al. for 
details on how PTM signature sets were curated and their annotation.

In gene-centric pathway enrichment, users can perform either Parametric 
Analysis of Gene Set Enrichment (PAGE) (Kim & Volsky, 2005) or Gene Set 
Enrichment Analysis (GSEA) (Subramanian, Tamayo et al.,2005) with Differential 
Expression analysis result. With Time series clustering result, we offer the 
Fisher's exact test. For phospho-signature enrichment on Differential 
Expression analysis result, we offer PTM-Signature Enrichment Analysis 
(PTM-SEA), a method adapted from GSEA by Krug et al. (2019) to be applicable 
with the PTM Signature Database. On time series clustering result, we offer 
the Fisher's exact test, in which each signature set is split into two, one 
containing upregulated and the other downregulated phosphosites.

## Gene Enrichment analysis on differential expression analysis results

```{r message=FALSE}
# Load the gene set
genesetPath <- system.file("shiny-app/geneset", package = "SmartPhos")
inGMT <- piano::loadGSC(paste0(genesetPath,"/Cancer_Hallmark.gmt"), type="gmt")
# Call the function
resTab <- enrichDifferential(dea = dea$resDE, type = "Pathway enrichment", 
                             gsaMethod = "PAGE", geneSet = inGMT, 
                             statType = "stat", nPerm = 200, sigLevel = 0.05, 
                             ifFDR = FALSE)
resTab
```

## Gene Enrichment analysis on time-series results

```{r message=FALSE}
# Load the gene set
genesetPath <- system.file("shiny-app/geneset", package = "SmartPhos")
inGMT <- piano::loadGSC(paste0(genesetPath,
                               "/Chemical_and_Genetic_Perturbations.gmt"), 
                        type="gmt")
# Call the function
clustEnr <- clusterEnrich(clusterTab = tsc$cluster,
                          se = newSE, inputSet = inGMT,
                          filterP = 0.05,
                          ifFDR = FALSE)
clustEnr
```

## Phospho Enrichment analysis on time-series results

```{r message=FALSE}
# Load the ptm set
ptmsetPath <- system.file("shiny-app/ptmset", package = "SmartPhos")
load(paste0(ptmsetPath, "/human_PTM.rda"))
# Call the function
clustEnr <- clusterEnrich(clusterTab = tsc$cluster, se = newSE, 
                          inputSet = ptmSetDb, ptm = TRUE, filterP = 0.05,
                          ifFDR = FALSE)
clustEnr
```

# Kinase Activity Inference

SmartPhos can perform kinase activity inference based on phosphopeptides that 
are differentially expressed or in a cluster. Similar to enrichment analysis, 
users would first need to perform either a differential expression analysis or 
time-series clustering and select a cluster of interest. 

A network of kinase-phosphosite interactions is constructed using the package 
**OmnipathR** (Türei et al., 2021). Users can choose to construct this network 
with prior knowledge from either Homo sapiens (taxonomy ID = 9606) or Mus 
musculus (taxonomy ID = 10090).

By combining prior knowledge about known kinase-phosphosite interactions and 
the data, SmartPhos can infer the activity of the kinases responsible for the 
phosphopeptides being considered. The activity is estimated by an activity 
score computed with the package **decoupleR** (Badia-I-Mompel et al., 2022) 
following the authors' tutorial on Kinase and Transcription Factor activity 
estimation. For time-series clustering, users can also estimate how likely the 
kinases are associated with phosphopeptides in the selected cluster.

## Kinase activity inference on differential expression analysis results

```{r eval=FALSE}
netw <- getDecouplerNetwork("Homo sapiens")
scoreTab <- calcKinaseScore(dea$resDE, netw, statType = "stat", nPerm = 500)
```

# Session Info

```{r}
sessionInfo()
```