---
title: "Introduction to CalibraCurve"
author: 
  - name: Karin Schork
    affiliation:
    - Medizinisches Proteom-Center, Ruhr-Universität Bochum
    email: karin.schork@rub.de
output: 
  BiocStyle::html_document:
    self_contained: yes
    toc: true
    toc_float: true
    toc_depth: 2
    code_folding: show
date: "`r doc_date()`"
package: "`r pkg_ver('CalibraCurve')`"
vignette: >
  %\VignetteIndexEntry{1. Introduction to CalibraCurve}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}  
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    crop = NULL 
    ## Related to 
    ## https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html
)
```


```{r vignetteSetup, echo=FALSE, message=FALSE, warning = FALSE}
## Bib setup
library("RefManageR")

## Write bibliography information
bib <- c(
    R = citation(),
    BiocStyle = citation("BiocStyle")[1],
    knitr = citation("knitr")[1],
    RefManageR = citation("RefManageR")[1],
    rmarkdown = citation("rmarkdown")[1],
    sessioninfo = citation("sessioninfo")[1],
    testthat = citation("testthat")[1],
    CalibraCurve = citation("CalibraCurve")[1],
    ggplot2 = citation("ggplot2")[1],
    msqc1 = citation("msqc1")[1]
    
)

BibOptions(max.names = 2, bib.style = "authoryear", style = "citation")

```


# Introduction

Targeted mass spectrometry based experiments (e.g. proteomics, metabolomics or 
lipidomics) are used for validation of biomarkers or for quantitative assays.
During the development of these assays, calibration curves are a valuable tool 
to assess the upper and lower limit of quantification and the linearity of the 
obtained measurements. The resulting model may also allow to predict 
concentrations from measured intensities (absolute quantification).

CalibraCurve is a tool to generate and visualize calibration curves. The tool is
already established as a KNIME workflow based on an R script 
`r Citep(bib[["CalibraCurve"]])` and a service in 
[de.NBI](https://www.denbi.de/) 
and [ELIXIR Germany](https://elixir-europe.org/). The code base was re-worked 
into an R package without changing the underlying algorithm. Several 
improvements were made:

- allowing xlsx files and SummarizedExperiment objects as input
- generate plots with `ggplot2` `r Citep(bib[["ggplot2"]])` for nicer, 
publication-ready visualization 
- more customization possibilities for the plots (colours etc.)
- added possibility to generate multiple curves in one graph
- re-worked and improved error messages
- added a function to predict concentrations based on the model
- additional 
[nextflow workflow](https://github.com/mpc-bioinformatics/CalibraCurve_NF), 
which can be integrated into other workflows or platforms like 
[MAcWorP](https://github.com/cubimedrub/macworp)


# Installation of  `CalibraCurve`

To install the latest version of `CalibraCurve` from Bioconductor use the 
following code snippet:

```{r "install", eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
        install.packages("BiocManager")
    }

BiocManager::install("CalibraCurve")
```

# Asking for help

If you need help with `CalibraCurve`, encountered a bug or have suggestions for 
improvements leave a post in the 
[Bioconductor support site](https://support.bioconductor.org/) with the 
`CalibraCurve` tag or start an issue on our 
[github page](https://github.com/mpc-bioinformatics/CalibraCurve).


# Citing `CalibraCurve`

We hope that `r Biocpkg("CalibraCurve")` will be useful for your research. 
Please use the following information to cite the package and the overall 
approach. Thank you!

```{r "citation"}
## Citation info
citation("CalibraCurve")
```


# Quick start to using `CalibraCurve`

CalibraCurve's main task is to calculate calibration curves for targeted mass 
spectrometry based data like proteomics, metabolomics or lipidomics. 
The experimental setup is the following: for the analyte of interest, samples 
with different known concentrations levels of this analyte are prepared.
Ideally, these samples are measured in replicates, leading to intensities or 
areas as a measurement.

To calculate the calibration curve, two pieces of information are needed: The 
known concentration levels and the measured intensities for each sample.

There are two options to provide this information for CalibraCurve. First, a 
set of Excel files may be provided that contain at least a Concentration and a 
Measurement column each. 
Second, a SummarizedExperiment object may be provided. This object may contain 
measurements for different analytes, one analyte per row. Information about the 
concentration levels have to be provided in the colData part. 

## Example data set

CalibraCurve contains test data from the `msqc1` R package 
`r Citep(bib[["msqc1"]])`. More precisely, parts of the
dilution series from the `msqc1_dil` object.

This data was filtered in the following way:
- using only QTRAP, TSQVantage and QExactive instruments (PRM or SRM method)
- use only data on the level of y-ions, no precursors
- use only the heavy isotope version of the peptide

`SummarizedExperiment` objects with the corresponding data for each peptide 
sequence is given as .rds files.

Additionally, for the peptide sequence "GGPFSDSYR" the data is stored as .xlsx
tables, separated by instrument and ion type.


## Import one or more xlsx, csv or txt files

A single `.xlsx`, `.csv` or `.txt` file, containing at least a "Concentration" 
and a "Measurement" column, can be imported using the `readDataTable` function. 

The number of the column belonging to the concentration levels has 
to be defined as `concCol`.
The number of the column belonging to the substance or analyte has 
to be defined as `measCol`.
The resulting object can then be directly used in the main function 
`CalibraCurve` to generate and visualize the Calibration curve.

Here, an example is given using a `.xlsx` file:


```{r "start", message=FALSE}
library("CalibraCurve")

file <- system.file("extdata", "MSQC1_xlsx", "GGPFSDSYR_QExactive_y5.xlsx", 
                    package = "CalibraCurve")

D <- readDataTable(dataPath = file, 
                    fileType = "xlsx", 
                    concCol = 16, # column "amount" containing concentrations
                    measCol = 12, # column "Area" containing measurements
                    naStrings = c("NA", "NaN", "Filtered", "#NV"), 
                    sheet = 1)
print(head(D))

```
The resulting object is a dataframe with two columns, 
"Concentration" and "Measurement". 
The rows are sorted from lowest to highest concentration levels. This dataframe
can then be directly used in the main function 
`CalibraCurve` to generate and visualize the Calibration curve.


Alternatively, a whole folder of `.xlsx`, `.csv` or `.txt` files can be 
imported using the `readMultipleTables` function. All files in the folder have
to follow the same structure, i.e. having the "Concentration" and "Measurement"
column in the same place. 

Here is an example using a folder containing 9 `.xlsx` files:


```{r "start2", message=FALSE}
library("CalibraCurve")

folder <- system.file("extdata", "MSQC1_xlsx", package = "CalibraCurve")

D <- readMultipleTables(dataFolder = folder, 
                        fileType = "xlsx", 
                        concCol = 16, 
                        measCol = 12) 
print(D)

str(D)

```

Each file in the folder is imported and the resulting dataframes are combined 
as a list. Each list entry contains the data belonging to one of the input 
files, which here present different measurements of the same peptide that 
differ in their used fragment ion and machine.


## Import SummarizedExperiments object

A SummarizedExperiment object stored as an .rds file can be imported using the 
function `readDataSE` function. 
The column name of the colData part belonging to the concentration levels have 
to be defined as `concColName`.
The column name of the rowData part belonging to the substance or analyte have 
to be defined as `substColName`.

Here, we will use the default options. 


```{r "start_SE", message=FALSE}
library("CalibraCurve")

file <- system.file("extdata", "MSQC1", "msqc1_dil_GGPFSDSYR.rds", 
                    package = "CalibraCurve")

D <- readDataSE(file, concColName = "amount_fmol", substColName = "Substance")

print(D)

str(D)
```

The resulting object is a list of dataframes, 
each with two columns, "Concentration" and "Measurement". 
The rows are sorted from lowest to highest concentration levels. 

This list of dataframes can then be directly used in the main function 
`CalibraCurve` to generate and visualize the Calibration curve.


Alternatively, also a SummarizedExperiment object, that was already imported
in R, can be read in directly by `readDataSE`:

```{r "start_SE2", message=FALSE}
library("CalibraCurve")

file <- system.file("extdata", "MSQC1", "msqc1_dil_GGPFSDSYR.rds", 
                    package = "CalibraCurve")

SE_data <- readRDS(file)

D <- readDataSE(rawDataSE = SE_data, 
                concColName = "amount_fmol", 
                substColName = "Substance")

print(D)

str(D)
```


## Apply CalibraCurve

As soon as the data have been imported, they can directly be 
used inside the `CalibraCurve` function. CalibraCurve has a lot of options, 
e.g. thresholds or customizing options for the graphics. All of these options
have sensible default values, so you can just try out CalibraCurve without 
thinking about these. However for optimal results or publication-ready 
visualizations some adjustments may be necessary. 
Details and examples about customizing the visualization are given in the
vignette "Customizing the visualizations of CalibraCurve".


```{r "apply_CC", message=FALSE}
RES <- CalibraCurve(D)


```


CalibraCurve outputs several tables and two kinds of plots: the calibration 
curve and the response factor plots. The calibration curves are stort in the 
list entry `plot_CC_list` of the result object.
By default, each calibration curve is plotted separately. 
Please be aware that by default the plots are not exported (you have to specify 
an `output_path` for that).

As an example, we plot here
the calibration curve for the fourth analyte: 


```{r "plot_SE", message=FALSE}
print(RES$plot_CC_list[[4]])
```

The calibration curve is plotted as a red line, both axes are shown on 
logarithmic scale. 
The grey background shows the linear range (here between XY and 1000 fmol), 
the curve was only fitted to data points inside this range. The grey data point 
outside of the linear range visibly do not fit to the curve.

Additionally, we can now look at the corresponding response factor plot:

```{r "plot_RF", message=FALSE}
print(RES$plot_RF_list[[4]])
```

For each data point the so called response factor is calculated. For the data 
point $i$ it is defined as
(intensity - y intercept)/concentration.

The nominator describes the signal produced by the analyte (intensity minus the 
intersect) and the denominator the concentration or amount of the analyte.
Ideally the mean response factors for each concentration level (shown as the 
larger dots) would lie in a straight line and not vary much for the different 
concentration levels. In orange, thresholds based on 80\% and 120\% of the 
overall mean response factor (within the linear range) are plotted.  Values 
within these borders are coloured in green, the ones outside in pink. For all 
concentration levels within the final linear range the response factors lie 
within these borders, serving as a further quality check of the results and 
linearity of the response.

For each analyte, two result tables are produced to give inside into the steps 
of the algorithm. The first one gives an overview of the
results on the level of concentration levels:

```{r "print_table_conc_levels", message=FALSE}
print(RES$RES[[4]]$result_table_conc_levels)
```

There is one row for each concentration level. It contains the mean 
measurement, the estimated measurement by the linear model and further 
information on the different steps of the algorithm (e.g. if the concentration 
level was removed during the cleaning step or if it was resent in the 
preliminary linear range). The last column shows which concentration levels 
ended up in the final linear range.

For this specific analyte, no concentration level was removed during the 
cleaning step and all levels fulfilled the CV criterion and where included in 
the preliminary linear range. The two highest concentration levels were 
subsequently removed during the calculation of the final linear range.

The second table offers a more detailed view for each individual replicate:

```{r "print_table_obs", message=FALSE}
print(RES$RES[[4]]$result_table_obs)
```

This table offers a more detailed view, e.g. it can be seen that every single 
replicate within the final linear range also fulfills the response factore 
criterion.


If multiple analytes are analyzed, a further summary table is generated:

```{r "print_table_summary", message=FALSE}
print(RES$summary_tab)
```


This table contains one row per analyte and contains more detailed information 
on the linear model (coefficients and R squared) and the lower and upper limit 
of quantification. In this case it can be seen that the QExactive leads to the 
worst R squared value overall, while the TSQCVantage lead to a wider linear 
range.


## Use calibration curve for prediction

The resulting calibration curves can be used to predict the analytes' 
concentration based on measured intensities. You need to define a vector with 
measured intensities and
feed it into the `predictConcentration` function together with the Calibracurve 
result object for this particular analyte. Imagine we have measured three new 
samples, with an intensity of 1e6, 1e7 and 1e8, respectively.

```{r "prediction", message=FALSE}
newdata <- c(1000000, #1e6
            10000000, #1e7
            100000000) # 1e8


CC_res = RES$RES[[4]]

predictConcentration(CC_res = RES$RES[[4]], newdata = newdata) 
```

In the table, the predicted concentrations based on the linear model are given.
Please note the warning message: The predicted concentration for the intensity 
1e8 falls outside of the linear range for this analyte. Therefore this value 
may be unreliable, like is shown by the warning message.


# Further information

- information on customizing the graphics can be found in vignette 
"Customizing the visualizations of CalibraCurve"
- more detailed information on the parameters and settings can be found 
in the publication `r Citep(bib[["CalibraCurve"]])`


# Comparison with other related packages

There are not many R packages that have functionality for the analysis of 
targeted mass spectrometry based data, in particular for calculating and 
visualizing calibration curves. Therefore, `CalibraCurve` complements other 
existing packages in this research area and is a valuable addition for the 
field of targeted proteomics, metabolomics and lipidomics. There are some 
packages with functionalities for targeted proteomics data that cover different 
parts of the whole workflow. These may be used in combination with 
`CalibraCurve`, e.g. to handle the raw data, process them or to do further 
statistical analysis on the predicted concentrations (e.g. group comparisons).

- `MSstatsLOBD` estimates the limit of 
detection and limit of blank (LoB) for targeted experiments. In contrast to 
`CalibraCurve` it does not estimate the lower and upper limit of quantification 
(linear range) and does only provide minimal functionality for visualizing 
calibration curves. Response factor plots or the ability to predict 
concentrations are also not provided.
- The package `msqc1` is a data package containing 
targeted proteomics data from the MSQC1 standard samples for benchmarking of 
different machines and methods. We will use parts of the dilution series data 
in the following as an example (see also section "Example data set"). 
- The package `specL` produces spectra libraries that 
can be used to identify peptides in targeted proteomics data and therefore 
covers the very beginning of the targeted data analysis workflow.
- The package `MSstats` contains a lot of 
functionality for the statistical analysis of proteomics data (targeted and 
untargeted). For example, estimate the required sample size for an experiment, 
summarizing data to protein level, visualizations and statistical analysis 
(group comparisons). 
- `mzR` provides parsing for differnt standard file 
formats used for mass spectrometry data.
- The packages `xcms`  and `MSnbase`provide functionality for processing 
and visualizing mass spectrometry based (proteomics) data.


# R session information

```{r reproduce3, echo=FALSE}
## Session info
library("sessioninfo")
options(width = 120)
session_info()
```


# Bibliography

This vignette was generated using `r Biocpkg("BiocStyle")` 
`r Citep(bib[["BiocStyle"]])` with `r CRANpkg("knitr")` 
`r Citep(bib[["knitr"]])` and `r CRANpkg("rmarkdown")` 
`r Citep(bib[["rmarkdown"]])` running behind the scenes.

Citations made with `r CRANpkg("RefManageR")` `r Citep(bib[["RefManageR"]])`.

```{r Biblio, results = "asis", echo = FALSE, warning = FALSE, message = FALSE}
## Print bibliography
PrintBibliography(bib, .opts = list(hyperlink = "to.doc", style = "html"))
```