---
title: "1. Introduction to SMAD"
author: 
- name: "Qingzhou (Johnson) Zhang"
  email: zqzneptune@hotmail.com
date: "`r Sys.Date()`"
package: SMAD
output: 
  BiocStyle::html_document:
    toc_float: true
vignette: >
  %\VignetteIndexEntry{1. Introduction to SMAD}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5
)
```

# Introduction

The `SMAD` (Statistical Modelling of AP-MS Data) package is designed to process Affinity Purification-Mass Spectrometry (AP-MS) data. Its primary goal is to compute confidence scores that help researchers distinguish true protein-protein interactions (PPI) from non-specific background contaminants.

In a typical AP-MS experiment, many proteins might be identified, but only a fraction are *bona fide* interactors. `SMAD` implements several validated statistical models to assign probability scores to these interactions.

# Installation

You can install the `SMAD` package from Bioconductor using the following command:

```r
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("SMAD")
```

# Preparing Input Data

`SMAD` requires input data as a dataframe. The standard format includes identifiers for the experiment run, the bait protein, the prey protein, and quantitative measurements (like spectral counts and protein length).

## Example Dataset

We provide a sample dataset `TestDatInput`, which is a subset of the BioPlex 2.0 data focusing on apoptosis-related proteins.

```{r prepare_input}
library(SMAD)
data("TestDatInput")
head(TestDatInput)
```

The columns required for most scoring functions are:

| Column Name | Description |
|:---|:---|
| `idRun` | Unique ID for the AP-MS experiment run |
| `idBait` | Identifier for the bait protein used in the pull-down |
| `idPrey` | Identifier for the identified prey protein |
| `countPrey` | Quantitative measure (e.g., peptide or spectral counts) |
| `lenPrey` | Length of the prey protein (used for normalization) |

# Scoring Protein Interactions

`SMAD` offers multiple scoring algorithms. This guide focuses on two popular methods: **CompPASS** and **HGScore**.

## CompPASS (Comparative Proteomic Analysis Software Suite)

CompPASS is a "spoke model" algorithm that identifies high-confidence interactors by comparing occurrences across a large number of experiments. It was originally developed by [Sowa et al. (2009)][1] and widely used in the BioPlex projects.

The output includes several metrics, with the **WD-score** (Weighted D-score) being the most commonly used for ranking.

```{r compPASS_score}
# Run CompPASS scoring
scoreCompPASS <- CompPASS(TestDatInput)

# View the top results
head(scoreCompPASS[order(scoreCompPASS$scoreWD, decreasing = TRUE), ])
```

We can visualize the distribution of scores to see the separation of high-confidence interactors:

```{r compPASS_plot, echo=TRUE}
par(mfrow = c(1, 1))
plot(sort(scoreCompPASS$scoreWD, decreasing = TRUE), 
     pch = 20, col = "royalblue",
     xlab = "Ranked Interactions", 
     ylab = "WD-score",
     main = "CompPASS WD-score Distribution")
abline(h = mean(scoreCompPASS$scoreWD) + 2 * sd(scoreCompPASS$scoreWD), 
       col = "red", lty = 2)
legend("topright", legend = "Mean + 2SD", col = "red", lty = 2)
```

## HGScore (Hypergeometric Score)

HGScore is based on a hypergeometric distribution error model [(Hart et al., 2007)][6], incorporating Normalized Spectral Abundance Factor (NSAF) to account for protein length. Unlike CompPASS, HGScore can incorporate a "matrix model" perspective, often leading to a larger number of inferred interactions.

```{r hg_score}
# Run HG scoring
scoreHG <- HG(TestDatInput)

# View the top results
head(scoreHG[order(scoreHG$HG, decreasing = TRUE), ])
```

Visualizing the HGScore distribution:

```{r hg_plot, echo=TRUE}
plot(sort(scoreHG$HG, decreasing = TRUE), 
     pch = 20, col = "darkorange",
     xlab = "Ranked Interactions", 
     ylab = "HGscore",
     main = "HGScore Distribution")
```

# Advanced Scoring and Further Reading

While CompPASS and HGScore are excellent starting points, `SMAD` includes several other advanced scoring methods such as:

- **SAINTexpress**: A widely adopted Bayesian framework for AP-MS.
- **PE (Purification Enrichment)**: A Bayesian classifier combining spoke and matrix models.
- **DICE and Hart**: Specialized scores for prey-prey interaction affinity.

For a detailed showcase of all these functions, please refer to the **Scoring Functions in SMAD** vignette:

```r
vignette("scoring_functions", package = "SMAD")
```

# References

[1]: https://doi.org/10.1016/j.cell.2009.04.042
[2]: http://besra.hms.harvard.edu/ipmsmsdbs/cgi-bin/tutorial.cgi
[3]: https://doi.org/10.1016/j.cell.2015.06.043
[4]: https://www.nature.com/articles/nature22366
[5]: https://github.com/dnusinow/cRomppass
[6]: https://doi.org/10.1186/1471-2105-8-236
[7]: https://doi.org/10.1021/pr060161n
[8]: https://doi.org/10.1016/j.cell.2011.08.047

# Session Information

```{r sessionInfo}
sessionInfo()
```