--- title: "1. Introduction to SMAD" author: - name: "Qingzhou (Johnson) Zhang" email: zqzneptune@hotmail.com date: "`r Sys.Date()`" package: SMAD output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteIndexEntry{1. Introduction to SMAD} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5 ) ``` # Introduction The `SMAD` (Statistical Modelling of AP-MS Data) package is designed to process Affinity Purification-Mass Spectrometry (AP-MS) data. Its primary goal is to compute confidence scores that help researchers distinguish true protein-protein interactions (PPI) from non-specific background contaminants. In a typical AP-MS experiment, many proteins might be identified, but only a fraction are *bona fide* interactors. `SMAD` implements several validated statistical models to assign probability scores to these interactions. # Installation You can install the `SMAD` package from Bioconductor using the following command: ```r if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("SMAD") ``` # Preparing Input Data `SMAD` requires input data as a dataframe. The standard format includes identifiers for the experiment run, the bait protein, the prey protein, and quantitative measurements (like spectral counts and protein length). ## Example Dataset We provide a sample dataset `TestDatInput`, which is a subset of the BioPlex 2.0 data focusing on apoptosis-related proteins. ```{r prepare_input} library(SMAD) data("TestDatInput") head(TestDatInput) ``` The columns required for most scoring functions are: | Column Name | Description | |:---|:---| | `idRun` | Unique ID for the AP-MS experiment run | | `idBait` | Identifier for the bait protein used in the pull-down | | `idPrey` | Identifier for the identified prey protein | | `countPrey` | Quantitative measure (e.g., peptide or spectral counts) | | `lenPrey` | Length of the prey protein (used for normalization) | # Scoring Protein Interactions `SMAD` offers multiple scoring algorithms. This guide focuses on two popular methods: **CompPASS** and **HGScore**. ## CompPASS (Comparative Proteomic Analysis Software Suite) CompPASS is a "spoke model" algorithm that identifies high-confidence interactors by comparing occurrences across a large number of experiments. It was originally developed by [Sowa et al. (2009)][1] and widely used in the BioPlex projects. The output includes several metrics, with the **WD-score** (Weighted D-score) being the most commonly used for ranking. ```{r compPASS_score} # Run CompPASS scoring scoreCompPASS <- CompPASS(TestDatInput) # View the top results head(scoreCompPASS[order(scoreCompPASS$scoreWD, decreasing = TRUE), ]) ``` We can visualize the distribution of scores to see the separation of high-confidence interactors: ```{r compPASS_plot, echo=TRUE} par(mfrow = c(1, 1)) plot(sort(scoreCompPASS$scoreWD, decreasing = TRUE), pch = 20, col = "royalblue", xlab = "Ranked Interactions", ylab = "WD-score", main = "CompPASS WD-score Distribution") abline(h = mean(scoreCompPASS$scoreWD) + 2 * sd(scoreCompPASS$scoreWD), col = "red", lty = 2) legend("topright", legend = "Mean + 2SD", col = "red", lty = 2) ``` ## HGScore (Hypergeometric Score) HGScore is based on a hypergeometric distribution error model [(Hart et al., 2007)][6], incorporating Normalized Spectral Abundance Factor (NSAF) to account for protein length. Unlike CompPASS, HGScore can incorporate a "matrix model" perspective, often leading to a larger number of inferred interactions. ```{r hg_score} # Run HG scoring scoreHG <- HG(TestDatInput) # View the top results head(scoreHG[order(scoreHG$HG, decreasing = TRUE), ]) ``` Visualizing the HGScore distribution: ```{r hg_plot, echo=TRUE} plot(sort(scoreHG$HG, decreasing = TRUE), pch = 20, col = "darkorange", xlab = "Ranked Interactions", ylab = "HGscore", main = "HGScore Distribution") ``` # Advanced Scoring and Further Reading While CompPASS and HGScore are excellent starting points, `SMAD` includes several other advanced scoring methods such as: - **SAINTexpress**: A widely adopted Bayesian framework for AP-MS. - **PE (Purification Enrichment)**: A Bayesian classifier combining spoke and matrix models. - **DICE and Hart**: Specialized scores for prey-prey interaction affinity. For a detailed showcase of all these functions, please refer to the **Scoring Functions in SMAD** vignette: ```r vignette("scoring_functions", package = "SMAD") ``` # References [1]: https://doi.org/10.1016/j.cell.2009.04.042 [2]: http://besra.hms.harvard.edu/ipmsmsdbs/cgi-bin/tutorial.cgi [3]: https://doi.org/10.1016/j.cell.2015.06.043 [4]: https://www.nature.com/articles/nature22366 [5]: https://github.com/dnusinow/cRomppass [6]: https://doi.org/10.1186/1471-2105-8-236 [7]: https://doi.org/10.1021/pr060161n [8]: https://doi.org/10.1016/j.cell.2011.08.047 # Session Information ```{r sessionInfo} sessionInfo() ```