---
title: "Mass Spectrometry Data on ExperimentHub"
author:
- name: Laurent Gatto
package: MsDataHub
output:
  BiocStyle::html_document:
    toc_float: true
vignette: >
  %\VignetteIndexEntry{Mass Spectrometry Data on ExperimentHub}
  %\VignetteEngine{knitr::rmarkdown}
  %%\VignetteKeywords{Mass Spectrometry, MS, MSMS, Proteomics, Metabolomics}
  %\VignetteEncoding{UTF-8}
---

```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
```

```{r env, echo = FALSE, message = FALSE}
library(Spectra)
library(PSMatch)
library(QFeatures)
```


# Introduction

The `MsDataHub` package provides example mass spectrometry data,
peptide spectrum matches or quantitative data from proteomics and
metabolomics experiments. The data are served through the
`ExperimentHub` infrastructure, which allows download them only ones
and cache them for further use. Currently available data are summarised
in the table below and details in the next section.

```{r data}
library("MsDataHub")
DT::datatable(MsDataHub())
```

# Installation

To install the package:

```{r install1, eval = FALSE}
if (!require("BiocManager"))
    install.packages("BiocManager")

BiocManager::install("MsDataHub")
```


# Available data

## TripleTOF

- Type: Raw MS data
- Files: `PestMix1_DDA.mzML` and `PestMix1_SWATH.mzML`
- More details: `?TripleTOF`

Load with

```{r, eval = TRUE}
f <- PestMix1_DDA.mzML()
library(Spectra)
Spectra(f)
```

```{r, eval = TRUE}
f <- PestMix1_SWATH.mzML()
Spectra(f)
```

## sciex

- Type: Raw MS data
- Files: `20171016_POOL_POS_1_105-134.mzML` and `20171016_POOL_POS_3_105-134.mzML`
- More details: `?sciex`

Load with

```{r, eval = TRUE}
f <- X20171016_POOL_POS_1_105.134.mzML()
Spectra(f)
```
```{r, eval = TRUE}
f <- X20171016_POOL_POS_3_105.134.mzML()
Spectra(f)
```

## PXD000001

- Type: Raw MS data and peptide spectrum matches
- Files:
  `TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML.gz`
  and
  `TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid`
- More details: `?PDX000001`

Load with

```{r, eval = TRUE}
f <- TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.20141210.mzML.gz()
Spectra(f)
```

```{r, eval = TRUE}
f <- TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.20141210.mzid()
library(PSMatch)
PSM(f)
```

## CPTAC

- Type: tab-delimited quantitative proteomics data tables (as produced
  by MaxQuant)
- Files: `cptac_a_b_c_peptides.txt`, `cptac_a_b_peptides.txt` and
  `cptac_peptides.txt`
- More details: `?cptac`

Load with

```{r, eval = TRUE}
library(QFeatures)
f <- cptac_peptides.txt()
ecols <- grep("Intensity\\.", names(read.delim(f)))
readSummarizedExperiment(f, ecols, sep = "\t")
```

```{r, eval = TRUE}
cptac_a_b_c_peptides.txt()
cptac_a_b_peptides.txt()
```

## FAAH KO

- Type: Raw MS data, in netCDF format.
- File: `ko15.CDF`
- More details: `?cdf`

Load with

```{r, eval = TRUE}
f <- ko15.CDF()
Spectra(f)
```

## DIA-NN software outputs

- Type: tab-delimited DIA quantitative proteomics data tables produced
  by [DIA-NN](https://github.com/vdemichev/DiaNN).
- Files:
  - Label-free DIA: `benchmarkingDIA.tsv`
  - mTRAQ plexDIA: `Report.Derks2022.plexDIA.tsv`
- More details: `?benchmarkingDIA.tsv` and
  `?Report.Derks2022.plexDIA.tsv`

Load with

```{r lfdia, eval = TRUE, message = FALSE}
library(QFeatures)
lfdia <- read.delim(MsDataHub::benchmarkingDIA.tsv())
readQFeaturesFromDIANN(lfdia)
```

```{r pledia, eval = TRUE, message = FALSE}
plexdia <- read.delim(MsDataHub::Report.Derks2022.plexDIA.tsv())
readQFeaturesFromDIANN(plexdia, multiplexing = "mTRAQ")
```

## DIA-NN single-cell proteomics reports

- Type: tab-delimited DIA quantitative proteomics data tables produced
  by [DIA-NN](https://github.com/vdemichev/DiaNN).
- Files:
  - Single-cell abel-free: `Ai2025_aCMs_report.tsv`
  - Single-cell label-free: `Ai2025_iCMs_report.tsv`
- More details: `?Ai2025`.

## Proteomics contaminant databases

- Type: fasta files, as documented in `camprotR`'s [cRAP
  databases](https://cambridgecentreforproteomics.github.io/camprotR/articles/crap.html)
  vignette.
- Files:
  - `crap_gpm.fasta`: the common Repository of Adventitious Proteins
    (cRAP) from the Global Proteome Machine (GPM) organisation.
  - `crap_ccp.fasta`: Cambridge Centre for Proteomics' own cRAP fasta
    database.
  - `crap_maxquant.fasta.gz`: MaxQuant's contaminant database.
- More details: `?cRAP`.

## FTICR-MS direct injection MS data

Example files for direct injection fourier-transform ion cyclotron resonance
(FTICR) mass spectrometry data.

- Type: raw MS data in mzML file format.
- Files: 5 replicates from sample *HAM004*, 5 replicates from sample *HAM005*,
  i.e., 10 mzML files.
- More details: `?FTICR`.

Example how to load one of the available files:

```{r}
f <- MsDataHub::HAM004_641fE_14.11.07..Exp1.extracted.mzML()
Spectra(f)
```

## MRM data file

Example file in mzML format for multiple reaction monitoring (MRM) data. The
file does not contain mass spectra, but chromatographic data. The data can be
imported and represented with the *Chromatograms* Bioconductor package.

- Type: raw (chromatographic) MS data in mzML file format.
- Files:
  - `MRM-standmix-5.mzML`: sample from mouse brain acquired by HILIC
    ESI-QqQ/MS in Dynamic multiple reaction monitoring mode (MRM). HPLC
    system was a 1290 Infinity (Agilent Technologies) coupled to ion-Funnel
    Triple quadrupole 6490 mass spectrometer (Agilent Technologies). This
    file was contributed by Xavi Domingo-Almenara from the The Scripps
    Research Institute, San Diego, CA.
- More details: `?MRM`.

Load with

```{r}
f <- MsDataHub::MRM.standmix.5.mzML()
```

## CE-MS data

The CE-MS test files consist of two files, `"CEMS_10ppm.mzML"` and
`"CEMS_25ppm.mzML"`. The data contains CE-MS runs of a standard mixture that
contains e.g. Lysine (at 10 ppm and 25 ppm, respectively) and the neutral EOF
marker Paracetamol (50 ppm). The data was acquired on a 7100 capillary
electrophoresis system from Agilent Technologies, coupled to an Agilent 6560
IM-QToF-MS. CE Separation was performed using a 80 cm fused silica capillary
with an internal diameter of 50 µm and external diameter of 365 µm. The
Background Electrolyte was 10 % acetic acid and separation was performed at +30
kV and a constant pressure of 50 mbar.  MS detection was performed in positive
ionization mode.

The raw data were then converted to the open-source *.mzML* format
(Proteowizard). To reduce data size, the test data was subset to a retention
time range from 400 to 900 seconds and an *m/z* range from 147.1 to 152.0.

- Type: raw MS data in mzML file format.
- Files:
  - `CEMS_10ppm.mzML`: sample with Lysine added in 10ppm.
  - `CEMS_25ppm.mzML`: sample with Lysine added in 25ppm.
- More details: `?CEMS`.

Load with

```{r}
f <- MsDataHub::CEMS_25ppm.mzML()
s <- Spectra(f)
```

## TMT MS3 SPS data

Example MS3 SPS TMT data.

- `MS3TMT10_01022016_32917-33481.mzML.gz` is an mzML file containing
  565 spectra from a MS3 PSP TMT 10-pex experiment.
- `MS3TMT11.mzML` is an mzML file containing 994 scans from MS3 SPS
  TMT 11-plex experiment.
- `fdms3tmt11.rda` contains a data.frame with identification data for
  `MS3TMT11.mzML`.

# Adding data to `MsDataHub`

1. If you would like additional dataset to `MsDataHub`, start by
   opening an
   [issue](https://github.com/rformassspectrometry/MsDataHub/issues)
   in the package's GitHub repository and describe the new data. In
   particular, provide information about it's provenance, its use, its
   format(s) and acknowledge that the data may be shared freely with
   the community without any restrictions. You may provide an open
   licence specifying the terms it can be re-used, typically a
   CC-BY-SA license.
2. By contribution to the package, you acknowledge that you will
   comply to the R for Mass Spectrometry project [code of
   conduct](https://rformassspectrometry.github.io/RforMassSpectrometry/articles/RforMassSpectrometry.html#code-of-conduct).
3. A maintainer of the package will reply to your issue, confirming
   that the data can be added.
4. At this point, if you are familiar with the development of
   `ExperimentHub` packages and GitHub *pull requests*, you may
   directly send one that adds your data to the package. Make sure (1)
   add appropriate references in the manual page and (2) to add
   yourself as a contributor of the package in the DESCRIPTION file.
5. Alternatively, a maintainer will add the dataset to the package and
   may require your input to make sure the documentation file is
   complete.

# Session information {-}

```{r sessioninfo, echo=FALSE}
sessionInfo()
```