--- title: "SmartPhos: a pipeline for processing and analysis of phosphoproteomic data" author: "Shubham Agrawal, Junyan Lu" date: "2025-03-27" output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteIndexEntry{"Introduction to SmartPhos"} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ```{r style, echo=FALSE, results='asis'} BiocStyle::markdown() ``` # Introduction Phosphoproteomics provides rich information for dissecting pathway activities and therefore is becoming vital in basic and translational biomedical research. To facilitate and streamline phosphoproteomics data analysis, we developed `SmartPhos`, an R package for the pre-processing, quality control, and exploratory analysis of phosphoproteomics data generated by mass-spectrometry. SmartPhos can process outputs from `MaxQuant` and `Spectronaut` either using the R command line or in an interactive ShinyApp called `SmartPhos Explorer`. Besides commonly used preprocessing steps, such as normalization, transformation, imputation, and batch effect correction, our package features a novel method for correcting normalization artifacts observed in phosphoproteomics, especially when large global phosphorylation changes are expected, by taking both phospho-enriched and unenriched samples into account. In addition, the `SmartPhos Explorer` `ShinyApp` included in our R package provides a user-friendly and interactive one-stop solution for performing exploratory data analysis (PCA, hierarchical clustering, etc.), differential expression, time-series clustering, gene set enrichment analysis, and kinase activity analysis easily without the knowledge of coding or the underlying statistical model. This vignette focuses on using `SmartPhos` with the R command line. # Install and load SmartPhos Package To install `SmartPhos` enter the following to the `R` console ```{r install, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("SmartPhos") ``` Load the package ```{r initialize, warning=FALSE, message=FALSE} library(MultiAssayExperiment) library(SmartPhos) ``` Load the R object ```{r} data("dia_example") dia_example ``` # Preprocessing the assay, basic visualization, PCA and Heatmaps Extract the SummarizedExperiment object from the multiAssayExperiment object ```{r} se <- dia_example[["Phosphoproteome"]] colData(se) <- colData(dia_example) se ``` ## Preprocessing options SmartPhos package performs some pre-filtering and preprocessing based on various threshold values during the generation of the multiAssayExperiment object. On top of that, this panel has different options for preprocessing of the selected assay. The options provided by the shiny app are as follows: * **Normalization correction**: This option is present only if the phophoproteomics data is present. It allows for the correction of normalization artefacts introduced by the Spectronaut. * **Normalize** phospho intensity by the corresponding protein expression. * **Transformation**: The transformation methods available are: log2 and vst (variance stabilizing transformation). * **Normalization**: Normalization strategy depends on the selected transformation method. Therefore, the user has option of *Yes* or *No*. If the user selects *Yes*, then for log2 and no transformation, *median scaling* is applied and for vst, *vsn* is applied. * **Missing values**: Users can select what percentage of missing values are allowed. Proteins with missing values above the selected threshold will be removed from all analyses. * **Imputation**: Currently four imputation methods are available: + QRILC (Quantile Regression Imputation of Left-Censored data) + MinDet (Deterministic minimal value approach) + BPCA (Bayesian PCA) + MLE (Maximum Likelihood Estimation) + Random forest * **Batch effects removal**: This option provides the abilty to correct for batch effects. It uses *removeBatchEffect()* from the *limma* package. The users can select the maximum of two columns. The columns used for batch effects removal should be present in the fileTable.txt file. The preprocessing (transformation, normalization, imputation, etc.) on the intensity assay can be performed by running the following command: ```{r message=FALSE} newSE <- preprocessPhos(seData = se, transform = "log2", normalize = TRUE, impute = "QRILC") ``` ## Different visualization options Plot the boxplot of assay intensities: ```{r} plotIntensity(newSE, colorByCol = "replicate") ``` Plot the completeness (percentage of non-missing) for each sample: ```{r} plotMissing(newSE) ``` This is performed on the original assay and not on the imputed assay. Perform principal component analysis (PCA) by running the following command: ```{r} # perform PCA pca <- stats::prcomp(t(assays(newSE)[["imputed"]]), center = TRUE, scale. = TRUE) # call the plotting function p <- plotPCA(pca = pca, se = newSE, color = "replicate") p ``` Heatmap can be plotted using the plotHeatmap() function of SmartPhos. There are three different heatmaps possible based on the **type** argument: * **Top variant**: This allows the users to plot the genes with highest variance. Users can decide the number of top variants genes to plot. This option performs clustering automatically. The user also has the option to divide the columns and rows of the heatmap into specific number of clusters using the **cutCol** and **cutRow** arguments. * **Differentially expressed**: Allows to plot the heatmaps for differentially expressed genes. The differential expression analysis is available in *Differential expression* tab and can be performed using ProDA or Limma. * **Selected time series cluster**: After performing the time-series clustering in another tab, users have option to plot the heatmap of the selected cluster. ```{r} plotHeatmap(type = "Top variant", newSE, top = 10, annotationCol = c("replicate", "treatment")) ``` # Differential Expression Analysis The goal of performing differential expression analysis is to quantify the expression levels of genes between different experimental conditions using statistical tests. If the users want paired t-test on the patient IDs, cell lines, etc, then the users must have **subjectID** as one of the column in the *fileTable.txt* file with the relevant information. The subjectID column should also be selected in the additional column annotations before the generation of multiAssayExperiment object. The *performDifferentialExp()* has two methods for performing differential expression analysis: * **limma**: uses linear models. * **ProDA**: uses probabilistic dropout model. Limma method needs an assay with non-missing data and is faster than the proDA method. ProDA method can work on assay with missing values also. ```{r message=FALSE} dea <- performDifferentialExp(se = newSE, assay = "imputed", method = "limma", condition = "treatment", reference = "EGF", target = "1stCrtl") ``` ‘dea’ is a list object which contains two values: ‘resDE’ with tabular differential expression analysis results and ‘seSub’ which is the subsetted SummarizedExperiment object based on the selected condition. ```{r} dea$seSub dea$resDE ``` Plot the volcano plot for the obtained differential expression analysis results by running the following command: ```{r} plotVolcano(dea$resDE) ``` **pFilter** and **fcFilter** arguments can be used to change the thresholds for the upregulated and downregulated genes being represented in the volcano plot. Use intensityBoxPlot() to plot the intensity data of a specified gene, with optional subject-specific lines: ```{r} intensityBoxPlot(se = dea$seSub, id = 's447', symbol = "WASL") ``` # Time series clustering Time series clustering is performed to group proteins/phosphopeptides based on how their level changes over time. The *clusterTS()* function employs **fuzzy c-means clustering** algorithm which considers the time-resolved trend, but not the expression levels. Thus, members of the same cluster would have a similar trend over time (e.g., all increasing or all decreasing), though their expression level can be different. To use this function, users must have logitudinal data with **timepoint** annotation in their data. The time points are either unit-less numbers (e.g., 1, 2, 3) or are in hour and/or minute, which must be typed as "h" and "min", respectively (e.g., 1min, 2min, 3h). Please notice that mixing the two said options (e.g., 1, 2min, 3) or using other unit for time (e.g., 1hour, 2minute, 3day) will likely lead to wrong results. ```{r} # call addZeroTime function to add zero timepoint to EGF treatment newSEzero <- addZeroTime(newSE, condition = "treatment", treat = "EGF", zeroTreat = "1stCrtl", timeRange = c("10min","100min", "24h")) # extract the assay exprMat <- SummarizedExperiment::assay(newSEzero) # call the clustering function tsc <- clusterTS(x = exprMat, k = 5) ``` ‘tsc’ is a list object which contains two values: ‘cluster’ with tabular time series clustering results and ‘plot’ with the clusters plots. ```{r} tsc$cluster tsc$plot ``` A single gene or phospho-site time series data can also be plotted: ```{r} timerange <- unique(newSEzero$timepoint) plotTimeSeries(newSEzero, type = "expression", geneID = "s40", symbol = "RBM47_T519", condition = "treatment", treatment = "EGF", timerange = timerange) ``` # Enrichment analysis Enrichment analysis can be performed on either gene sets (gene-centric) or post-translational modification signature sets (site-centric) that are differentially expressed or in a time-series cluster. In the latter, each set contains PTM site names with direction of regulation (up- or down-regulated) instead of gene names. Only phosphorylation sites will be considered since this pipeline supports proteomics and phosphoproteomics data. The gene sets and PTM signature sets are derived from the Molecular Signatures Database (Subramanian, Tamayo et al.,2005; Liberzon et al., 2011, Liberzon et al., 2015) and the PTM Signature Database version 2.0.0 (Krug et al., 2019), respectively. Users are encouraged to consult the PTMsigDB website (https://proteomics.broadapps.org/ptmsigdb/) and the paper of Krug et al. for details on how PTM signature sets were curated and their annotation. In gene-centric pathway enrichment, users can perform either Parametric Analysis of Gene Set Enrichment (PAGE) (Kim & Volsky, 2005) or Gene Set Enrichment Analysis (GSEA) (Subramanian, Tamayo et al.,2005) with Differential Expression analysis result. With Time series clustering result, we offer the Fisher's exact test. For phospho-signature enrichment on Differential Expression analysis result, we offer PTM-Signature Enrichment Analysis (PTM-SEA), a method adapted from GSEA by Krug et al. (2019) to be applicable with the PTM Signature Database. On time series clustering result, we offer the Fisher's exact test, in which each signature set is split into two, one containing upregulated and the other downregulated phosphosites. ## Gene Enrichment analysis on differential expression analysis results ```{r message=FALSE} # Load the gene set genesetPath <- system.file("shiny-app/geneset", package = "SmartPhos") inGMT <- piano::loadGSC(paste0(genesetPath,"/Cancer_Hallmark.gmt"), type="gmt") # Call the function resTab <- enrichDifferential(dea = dea$resDE, type = "Pathway enrichment", gsaMethod = "PAGE", geneSet = inGMT, statType = "stat", nPerm = 200, sigLevel = 0.05, ifFDR = FALSE) resTab ``` ## Gene Enrichment analysis on time-series results ```{r message=FALSE} # Load the gene set genesetPath <- system.file("shiny-app/geneset", package = "SmartPhos") inGMT <- piano::loadGSC(paste0(genesetPath, "/Chemical_and_Genetic_Perturbations.gmt"), type="gmt") # Call the function clustEnr <- clusterEnrich(clusterTab = tsc$cluster, se = newSE, inputSet = inGMT, filterP = 0.05, ifFDR = FALSE) clustEnr ``` ## Phospho Enrichment analysis on time-series results ```{r message=FALSE} # Load the ptm set ptmsetPath <- system.file("shiny-app/ptmset", package = "SmartPhos") load(paste0(ptmsetPath, "/human_PTM.rda")) # Call the function clustEnr <- clusterEnrich(clusterTab = tsc$cluster, se = newSE, inputSet = ptmSetDb, ptm = TRUE, filterP = 0.05, ifFDR = FALSE) clustEnr ``` # Kinase Activity Inference SmartPhos can perform kinase activity inference based on phosphopeptides that are differentially expressed or in a cluster. Similar to enrichment analysis, users would first need to perform either a differential expression analysis or time-series clustering and select a cluster of interest. A network of kinase-phosphosite interactions is constructed using the package **OmnipathR** (Türei et al., 2021). Users can choose to construct this network with prior knowledge from either Homo sapiens (taxonomy ID = 9606) or Mus musculus (taxonomy ID = 10090). By combining prior knowledge about known kinase-phosphosite interactions and the data, SmartPhos can infer the activity of the kinases responsible for the phosphopeptides being considered. The activity is estimated by an activity score computed with the package **decoupleR** (Badia-I-Mompel et al., 2022) following the authors' tutorial on Kinase and Transcription Factor activity estimation. For time-series clustering, users can also estimate how likely the kinases are associated with phosphopeptides in the selected cluster. ## Kinase activity inference on differential expression analysis results ```{r eval=FALSE} netw <- getDecouplerNetwork("Homo sapiens") scoreTab <- calcKinaseScore(dea$resDE, netw, statType = "stat", nPerm = 500) ``` # Session Info ```{r} sessionInfo() ```