---
title: "Supported input formats for readQFeatures()"
author:
  - name: "Karolína Kryštofová"
  - name: Laurent Gatto
date: "`r BiocStyle::doc_date()`"
package: "`r BiocStyle::pkg_ver('QFeatures')`"
output:
  BiocStyle::html_document:
    self_contained: yes
    toc: true
    toc_float: true
    toc_depth: 2
    code_folding: show

vignette: >
  %\VignetteIndexEntry{readQFeatures() input format}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
link-citations: true
link-bibliography: true
---

## Methods

### Datasets

This vignette demonstrates the use of the *QFeatures* package's
`readQFeatures()` function to import data produced by popular
third-party software. For this purpose, subsets of the following
previously publicly available datasets have been used:

+---------------------------+-----------+------+---------+------------+
| Citation                  | PXD ID    | Mode | Label   | Code       |
+===========================+===========+======+=========+============+
| Van Puyvelde et al., 2022 | PXD028735 | DIA  | LFQ     | A_DIA_LFQ  |
|                           |           +------+---------+------------+
|                           |           | DDA  | LFQ     | A_DDA_LFQ  |
+---------------------------+-----------+------+---------+------------+
| Derks et al., 2022        | PXD029531 | DIA  | plexDIA | B_DIA_plex |
+---------------------------+-----------+------+---------+------------+
| Christoforou et al., 2016 | PXD001279 | DDA  | TMT     | C_DDA_TMT  |
+---------------------------+-----------+------+---------+------------+

Specifically, these subsets consist of these files:

+------------+-------------------------------------------------------+
| Code       | Original raw file name                                |
+------------+-------------------------------------------------------+
| A_DIA_LFQ  | LFQ_timsTOFPro_diaPASEF_Condition_A_Sample_Alpha_01.d |
|            +-------------------------------------------------------+
|            | LFQ_timsTOFPro_diaPASEF_Condition_A_Sample_Alpha_02.d |
|            +-------------------------------------------------------+
|            | LFQ_timsTOFPro_diaPASEF_Condition_A_Sample_Alpha_03.d |
|            +-------------------------------------------------------+
|            | LFQ_timsTOFPro_diaPASEF_Condition_B_Sample_Alpha_01.d |
|            +-------------------------------------------------------+
|            | LFQ_timsTOFPro_diaPASEF_Condition_B_Sample_Alpha_02.d |
|            +-------------------------------------------------------+
|            | LFQ_timsTOFPro_diaPASEF_Condition_B_Sample_Alpha_03.d |
+------------+-------------------------------------------------------+
| A_DDA_LFQ  | LFQ_Orbitrap_DDA_Condition_A_Sample_Alpha_01.raw      |
|            +-------------------------------------------------------+
|            | LFQ_Orbitrap_DDA_Condition_A_Sample_Alpha_02.raw      |
|            +-------------------------------------------------------+
|            | LFQ_Orbitrap_DDA_Condition_A_Sample_Alpha_03.raw      |
|            +-------------------------------------------------------+
|            | LFQ_Orbitrap_DDA_Condition_B_Sample_Alpha_01.raw      |
|            +-------------------------------------------------------+
|            | LFQ_Orbitrap_DDA_Condition_B_Sample_Alpha_02.raw      |
|            +-------------------------------------------------------+
|            | LFQ_Orbitrap_DDA_Condition_B_Sample_Alpha_03.raw      |
+------------+-------------------------------------------------------+
| B_DIA_plex | wJD803.raw                                            |
|            +-------------------------------------------------------+
|            | wJD804.raw                                            |
|            +-------------------------------------------------------+
|            | wJD815.raw                                            |
+------------+-------------------------------------------------------+
| C_DDA_TMT  | Replicate1_fraction1.raw                              |
|            +-------------------------------------------------------+
|            | Replicate1_fraction2.raw                              |
+------------+-------------------------------------------------------+

Identifications and quantifications performed on these datasets were
used in combination with the following software:

```{r dataset_matrix, echo=FALSE}
suppressPackageStartupMessages(library(ComplexHeatmap))

# All possible softwares, labels and report levels
sws <- c('MaxQuant', 'DIA-NN', 'Spectronaut', 'Proteome Discoverer',
         'sage', 'FragPipe', 'PEAKS')
label <- c('LFQ', 'multiplex')

# Create a combination dataset, by default leave value cells blank
## value: 0 for not yet processed data, 1 for already processed data
## dataset: what dataset will be used
d <- expand.grid(software = sws, label = label)
d$value <- 'no'
d$dataset <- ''

# Fill in manually with currently known information
d$dataset[which(d$label == 'LFQ' & d$software == 'Spectronaut')] <- 'A_DIA_LFQ'
d$dataset[which(d$label == 'LFQ' & d$software == 'DIA-NN')] <- 'A_DIA_LFQ'
d$dataset[which(d$label == 'LFQ' & d$dataset == '')] <- 'A_DDA_LFQ'
d$dataset[which(d$label == 'multiplex' & d$software %in% c('DIA-NN', 'Spectronaut'))] <- 'B_DIA_plex'
d$dataset[which(d$dataset == '')] <- 'C_DDA_TMT'

## d$value[which(d$software %in% c('DIA-NN', 'MaxQuant', 'sage', 'FragPipe'))] <- 'yes'
## d$value[which(d$software == 'Spectronaut' & d$label == 'LFQ')] <- 'yes'
## d$value[which(d$software == 'FragPipe')] <- 'yes'
## d$value[which(d$software == 'PEAKS' & d$label == 'LFQ')] <- 'yes'
d$value <- "yes"

# Dataset to matrix for plotting
d <- tidyr::unite(d, col, label)
#lvls <- c('LFQ_PSM', 'LFQ_peptide', 'LFQ_PG',
#          'multiplex_PSM', 'multiplex_peptide', 'multiplex_PG')
#d$col <- factor(d$col, levels = lvls)
#d <- dplyr::arrange(d, col)

m <- d |> dplyr::select(-dataset) |> tidyr::spread(key = 'col', value = 'value')
rownames(m) <- m$software
m$software <- NULL
m <- as.matrix(m)
colnames(m) <- gsub('LFQ_', '', fixed = TRUE, colnames(m))
colnames(m) <- gsub('multiplex_', '', fixed = TRUE, colnames(m))

# Create a helped matrix storing the text annotations
m_anno <- d |> dplyr::select(-value) |> tidyr::spread(key = 'col', value = 'dataset')
rownames(m_anno) <- m_anno$software
m_anno$software <- NULL
m_anno <- as.matrix(m_anno)
colnames(m_anno) <- gsub('LFQ_', '', fixed = TRUE, colnames(m_anno))
colnames(m_anno) <- gsub('multiplex_', '', fixed = TRUE, colnames(m_anno))

## Create a plot using the ComplexHeatmap package
colors <- structure(rep('grey95', 2), names = c("yes", "no"))

Heatmap(m, name = "Available",
        col = colors,
        cluster_rows = FALSE, cluster_columns = FALSE,
        row_names_side = "left",
        column_names_side = "top",
        column_names_centered = TRUE,
        column_names_rot = 0,
        show_heatmap_legend = FALSE,
        border = TRUE,
        cell_fun = function(j, i, x, y, width, height, fill) {
          grid.text(m_anno[i, j], x, y, gp = gpar(fontsize = 10))
        })
```

The data files are available in the `MsDataHub` package (>= 1.11.5).

```{r MsDataHub}
library("MsDataHub")
MsDataHub() |>
    dplyr::filter(grepl("19137577", SourceUrl)) |>
    dplyr::pull(Title)
```

Each file can be accessed with the function that has its name:

```{r loadFromMsDataHub}
vanPuyvelde_2022_LFQ_DDA_FragPipe_A_2_psm.tsv()
Derks_2022_plex_DIA_DIANN_report_subset.tsv()
```

and imported as a standard `data.frame` using the usual `read.*()`
functions (see below).

### Existing search outputs

Example outputs for LFQ quantification using DIA-NN, Spectronaut and
PEAKS were sourced from the ProteoBench website. IDs of these outputs
on the ProteoBench website are as follows:

-   DIA-NN:
    -   DIA-NN_20250714_074145 (.parquet output)
    -   DIA-NN_20250606_103313 (.tsv output)
-   Spectronaut:
    -   Spectronaut_20250609_135453
-   PEAKS:
    -   PEAKS_20250714_150458

As an example output for multiplex quantification using DIA-NN, a
search result of the plexDIA dataset was sourced from the MassIVE
repository (MSV000088302).

### New search outputs

Following software versions were used to produce search results used
in this vignette:

-   MaxQuant 2.7.0
-   sage 0.14.7
-   FragPipe 23.1

## Introduction

Below, we describe individual outputs and their processing using the
`readQFeatures()` functions. When applicable, we demonstrate how to
read data on PSM, precursor, as well as protein group level.

For general explanation of the `QFeatures` class and detailed
description of individual arguments taken by the `readQFeatures()`
group of functions, consult the `readQFeatures()` manual page of this
[vignette](https://rformassspectrometry.github.io/QFeatures/articles/readQFeatures.html).

To initiate the session, we will load the `QFeatures` package.

```{r library, message = FALSE}
library(QFeatures)
```

## MaxQuant

MaxQuant produces several output `.txt` files. In order to obtain
information from several levels of the search, we can look at the
`evidence.txt`, `peptides.txt` and `proteinGroups.txt` files.

### Label-free

Here we will process the results of a multi-set label-free
experiment. First we will read the `evidence.txt` file storing
information about PSM-level data:

```{r mq-lfq-psm, message=FALSE}
dataMaxquantLFQevidence <-
    vanPuyvelde_2022_LFQ_DDA_MaxQuant_evidence.txt() |>
    read.delim()

nrow(dataMaxquantLFQevidence)
```

We can now import the data.frame as a `QFeatures` object using
`"Intensity"` as quantitative column. This column the quantitation
values of all samples, acquired in different runs, as defined in the
`"Experiment"` column. We also rename the set names, prefixing them
with `"psm_"`.

```{r mq-lfq-psm2, message=FALSE}
qfMaxquant <- readQFeatures(dataMaxquantLFQevidence,
                            quantCols = "Intensity",
                            runCol = "Experiment")
names(qfMaxquant) <- paste('psm', names(qfMaxquant), sep = '_')

qfMaxquant
```

Next we will read the peptide-level results from a `peptides.txt` file
and append this to the `QFeatures` object as a new assay:

```{r mq-lfq-peptide0, message=FALSE}
dataMaxquantLFQpeptide <-
    vanPuyvelde_2022_LFQ_DDA_MaxQuant_peptides.txt() |>
    read.delim()
nrow(dataMaxquantLFQpeptide)
```

This table is in a large format, meaning that the peptide quantitation
values of different samples are stored in different columns. We thus
get the indices of respective intensity columns, starting with
`"Intensity."`.

```{r mq-lfq-peptide1, message=FALSE}
(i <- grep('Intensity.', colnames(dataMaxquantLFQpeptide), fixed = TRUE))
colnames(dataMaxquantLFQpeptide)[i]
```

We can read this peptide-level table as a new `QFeatures` object using
the same `readQFeatures()` as above. This time, it will contain a
single set with as many columns as there are samples/acquisitions in
the data.

```{r mq-lfq-peptide2, message=FALSE}
readQFeatures(dataMaxquantLFQpeptide, quantCols = i, fnames = 'Sequence')
```

If we want to add the peptide-level data to our previously created
`QFeatures` object, we can read it as an invidual set (a
`SummarizedExperiment` instance) and add it with `addAssay()`

```{r mq-lfq-peptide3, message=FALSE}
pepSE <- readSummarizedExperiment(dataMaxquantLFQpeptide,
                                  quantCols = i,
                                  fnames = 'Sequence')
pepSE
qfMaxquant <- addAssay(qfMaxquant,
                      pepSE,
                      name = 'peptides')
qfMaxquant
```

We see that a new assay has been appended to `QFeatures` object.

Finally, we will append the protein group-level in the same
manner. Here we will use the `"LFQ.intensity"` columns:

```{r mq-lfq-pg, message=FALSE}
dataMaxquantLFQprotein <-
    vanPuyvelde_2022_LFQ_DDA_MaxQuant_proteinGroups.txt() |>
    read.delim()
nrow(dataMaxquantLFQprotein)

## get indices of LFQ intensity columns
(i <- grep('LFQ.intensity.', colnames(dataMaxquantLFQprotein),
           fixed = TRUE))
colnames(dataMaxquantLFQprotein)[i]

## load the data
protSE <- readSummarizedExperiment(dataMaxquantLFQprotein,
                                   quantCols = i,
                                   fnames = 'Protein.IDs')
protSE
qfMaxquant <- addAssay(qfMaxquant,
                       protSE,
                       name = 'proteinGroups')
qfMaxquant
```

It is important to highlight however that, while it is possible to add
PSM-, peptide- and protein-level sets one-by-one, as illustrated
above, we strongly recommend to compute the peptide-level data from
the PSM-level data, and the protein-level data from the peptide-level
data using the `QFeatures::aggregateFeatures()` function. The function
will record the link between features, PSM to peptide and peptides to
protein, and consistently apply filtering across these
levels. Alternatively, these links between sets can be re-computer
with the `addAssayLink()` function.

### TMT

Below, we will demonstrate how to read data from a TMT-labeled
experiment consisting of two runs:

```{r mq-tmt, message=FALSE}
dataMaxquantTMTevidence <-
    Christoforou_2016_TMT_DDA_MaxQuant_evidence.txt() |>
    read.delim()

(i <- grep('Reporter.intensity.\\d+', colnames(dataMaxquantTMTevidence)))
colnames(dataMaxquantTMTevidence)[i]

qfMaxquantTMT <- readQFeatures(dataMaxquantTMTevidence,
                               quantCols = i,
                               runCol = 'Raw.file',
                               fnames = 'Sequence')

qfMaxquantTMT
```

We see that a separate experiment has been created for each run with
10 columns corresponding to the 10 TMT channels.

## DIA-NN

### Label-free

DIA-NN versions 1.9.0 and below produce a main *.tsv* search result
file, which has been replaced by a *.parquet* file from version 2.0.0
up solely .

DIA-NN *.tsv* reports can be read using the `readQFeaturesFromDIANN()`
function:

```{r diann-tsv, message = FALSE}
qfDiannLFQ <-
    vanPuyvelde_2022_LFQ_DIA_DIANN_report.tsv() |>
    read.delim() |>
    readQFeaturesFromDIANN(runCol = 'Run')

qfDiannLFQ
```

In order to read a *.parquet* file in R, we need to use the `arrow`
package, that provides an interface to the Arrow C++ library. After
reading this file however, we can work with the resulting data.frame
in the same manner as we are used to in case of the *.tsv* report.

```{r diann-parquet, message=FALSE}
qfDiannParquet <-
    vanPuyvelde_2022_LFQ_DIA_DIANN_report.parquet() |>
    arrow::read_parquet() |>
    readQFeaturesFromDIANN(runCol = 'Run')

qfDiannParquet
```

As you can see, both experiments consist of the same run names as both
searches have been performed on the same set of raw data. The numbers
of rows in each `SummarizedExperiment` however differ between those
two reports, as both searches have been performed using different
software versions, as well as different search parameters.

### plexDIA

To correctly parse a plexDIA experiment, it is necessary to set the
`multiplexing` parameter to `"mTRAQ"`:

```{r diann-plexdia, message=FALSE}
qfDiannPlex <-
    Derks_2022_plex_DIA_DIANN_report_subset.tsv() |>
    read.delim() |>
    readQFeaturesFromDIANN(runCol = 'Run',
                           multiplexing = 'mTRAQ')

qfDiannPlex
```

The run names in this output file are not the most informative with
regards to the samples. We will now edit the sample metadata to
contain more meaningful sample annotation. All runs were performed on
the same sample, in 3 technical replicates:


```{r}
qfDiannPlex$sample <- 'mixed standard'
qfDiannPlex$rep <- rep(1:3, each = 3)
qfDiannPlex$label <- paste0('mTraq d', rep(c(0, 4, 8), times = 3))
colData(qfDiannPlex)
```

## sage

The **sage** search engine stores quantification data either in the
*lfq.tsv* or *tmt.tsv* file based on the quantification used.

As above for label-free quantification, the *lfq.tsv* file contains
estimated quantities of identified peptidoforms and is grouped on
modified sequence level:

```{r sage-lfq, message = FALSE, warning=FALSE}
dataSageLFQ <-
    vanPuyvelde_2022_LFQ_DDA_sage_lfq.tsv() |>
    read.delim()

(i <- grep('.mzML', colnames(dataSageLFQ), fixed = TRUE))
colnames(dataSageLFQ)[i]

qfSageLFQ <- readQFeatures(dataSageLFQ,
                           quantCols = i,
                           name = 'peptides')

qfSageLFQ
```

As for TMT-based quantification, the PSM-level quantification is
included in the *tmt.tsv* file.

```{r sage-tmt, message = FALSE, warning = FALSE}
dataSageTMT <-
    Christoforou_2016_TMT_DDA_sage_tmt.tsv() |>
    read.delim()
```

Upon inspection, we can see that peptide identification information is
missing in this file:

```{r}
colnames(dataSageTMT)
```

We can source this from the main results.sage.tsv file and append this
information to the quantification data frame. We also extract the
indices of the quantification columns before loading the data using
the `readQFeatures()` function.

```{r sage-tmt-2, message = FALSE, warning = FALSE}
dataSageTMTident <-
    Christoforou_2016_TMT_DDA_sage_results.sage.tsv() |>
    read.delim()

dataSageTMTfinal <- merge(dataSageTMT, dataSageTMTident, by = c('filename', 'scannr'))

(i <- grep('tmt_', colnames(dataSageTMTfinal), fixed = TRUE))
colnames(dataSageTMTfinal)[i]


qfSageTMT <- readQFeatures(dataSageTMTfinal, quantCols = i)
qfSageTMT
```

A more straightforward way is to use the `sager::sageQFeatures()`
function from the BiocStyle::Githubpkg("UCLouvain-CBIO/sager") package
to quickly load TMT quantification data from both *.tsv* output files
into a `QFeatures` object.

```{r sager, eval = FALSE}
sager::sageQFeatures(
           Christoforou_2016_TMT_DDA_sage_tmt.tsv(),
           Christoforou_2016_TMT_DDA_sage_results.sage.tsv())
```

## FragPipe

**FragPipe** produces several outputs. The following code block shows
the processing of a label-free multi-set experiment.

First we can load the *psm.tsv* files that are produced separately for
each sample-biological replicate combination as specified during
FragPipe configuration. In our case, there has been a separate file
created for each run.

We start by extracting the relevant filenames from `MsDataHub`.

```{r fragpipe-lfq0, message = FALSE}
fls <- MsDataHub() |>
    dplyr::filter(grepl("2022_LFQ_DDA_FragPipe", Title)) |>
    dplyr::pull(1)
fls
```

Below, we iterate over each filename, convert it to a function call
that we then evaluate, and then load as a `SummarizedExperiment`. The
code below produces a list of `SummarizedExperiment` instances, that
we then name using the initial filenames.

```{r fragpipe-lfq1, message = FALSE}
lst <- lapply(fls, function(fl) {
    call(fl) |>
        eval() |>
        read.delim() |>
        readSummarizedExperiment(quantCols = "Intensity")
})

names(lst) <- fls
```

We can now pass this list to the `QFeatures` constructor to create a
`QFeatures` object.

```{r fragpipe-lfq2, message = FALSE}
qfFpipeLFQ <- QFeatures(lst)
qfFpipeLFQ
```

The names of the assays are based on the (rather long) filenames, they
were derived from. We can shorten these:

```{r fragpipe-lfq-names, message=FALSE}
names(qfFpipeLFQ) <- sub('vanPuyvelde_2022_LFQ_DDA_FragPipe_(\\w_\\d_psm)\\.tsv', '\\1', names(qfFpipeLFQ))
qfFpipeLFQ
```

The processing of peptide and protein-level outputs is similar to
MaxQuant processing above.

### TMT

In the following section, we demonstrate the processing of
TMT-labelled multi-set experiment. It consists of two runs named
*Fraction1* and *Fraction2*. Just like in the case of a label-free
experiment, there is a separate *psm.tsv* file produced for each run:

```{r fragpipe-tmt, message=FALSE}

fls <- MsDataHub() |>
    dplyr::filter(grepl("Christoforou_2016_TMT_DDA_FragPipe_Fraction", Title)) |>
    dplyr::pull(1)

lst <- lapply(fls,
       function(fl) {
           x <- eval(call(fl)) |>
               read.delim()
           i <- grep('Intensity\\.', colnames(x))
           readSummarizedExperiment(x, quantCols = i)
       })

names(lst) <- fls
QFeatures(lst)
```