---
title: "Getting Started"
author: "Torben Kimhofer"
date: "`r Sys.Date()`"
output: 
  BiocStyle::html_document:
    toc: true
    number_sections: true
vignette: >
  %\VignetteIndexEntry{Getting Started}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  markdown: 
    wrap: 72
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  error = FALSE,
  collapse = TRUE,
  comment = "#>"
)
```

# Introduction

This vignette demonstrates data import and preprocessing workflow for NMR-based 
metabolomics with **metabom8**. 

```{r get-data}
library(metabom8)
```

# Importing spectral data

In routine high-throughput NMR workflows, raw time-domain data (the
free-induction decay) are typically processed directly on the
spectrometer using the vendor software TopSpin. Downstream metabolomics
analysis therefore commonly starts from frequency-domain spectra. Consequently, 
this vignette focuses on importing spectra generated by TopSpin.

For completeness and teaching purposes, the section
([Starting point: raw FID](#import-fid)) briefly illustrates importing and 
transforming time domain data to obtain spectra.


## TopSpin-processed spectra {#import-spectra}

The function `read1d_proc()` imports TopSpin-processed 1D NMR
spectra from the `pdata` subdirectory of Bruker experiment folders
(typically `pdata/1`). It reads the processed absorption-mode spectrum
(e.g. `1r`) together with associated acquisition (`acqus`) and processing
(`procs`) parameters.

At minimum two arguments are required:

-   `path`: path to the directory that encloses NMR experiments
-   `exp_type`: list of spectrometer parameters to filter desired
    experiment types based on acquisition and processing metadata.
    
Further details on supported filtering options are provided in `?read1d_proc`.

```{r import, fig.width=7.2, fig.height=4}
## Example: import 1D spectra
exp_dir <- system.file("extdata", package = "metabom8")
exp_type <- list(pulprog="noesygppr1d")

nmr_data <- read1d_proc(exp_dir, exp_type, n_max = 100)

names(nmr_data)
```

Both import functions (`read1d_proc()` and `read1d_raw()`) return a named
list containing three core components used throughout this vignette:
the spectral matrix (`X`), the chemical-shift axis (`ppm`), and associated
metadata (`meta`). 

For interactive exploration, setting `to_global = TRUE` assigns the
components `X`, `ppm`, and `meta` directly to the global environment
(`.GlobalEnv`). Existing objects with these names will be overwritten.

The row names of `X` and `meta` correspond to experiment directories and can be
used to join external sample annotations. The `meta` data frame
contains full TopSpin acquisition and processing parameters for each
spectrum.

Example acquisition and processing parameters as they appear in `meta`:

```{r pars_table, echo=FALSE}
knitr::kable(
  data.frame(
    Prefix = c("a_", "p_"),
    Meaning = c("Acquisition parameters",
                "Processing parameters"),
    Examples = c("a_PULPROG, a_SFO1, a_RG, a_SW_h, a_TE",
                 "p_WDW, p_LB, p_SI, p_PHC*, p_BC*, p_NC_proc")
  ),
  align = "lll"
)
```

See `?read1d_proc()` for further details.

## Raw FID data {#import-fid}

For completeness, `metabom8` also provides functionality to import and
process raw FIDs. This allows users to reproduce basic spectral processing
steps within R or to explore how different processing parameters influence
the resulting spectra.

Processing an FID converts the time-domain signal into a frequency-domain
spectrum. Typical processing steps include digital-filter correction 
(group-delay), apodisation (windowing), zero-filling, Fourier 
transformation, phase correction.

The function `read1d_raw()` performs these operations:

```{r read-in-raw, fig.show='hold'}
# import 1D NMR data
nmr_raw <- read1d_raw(
  exp_dir,
  exp_type,
  apodisation = list(fun = "exponential", lb = 0.2),
  zerofil = 1L,
  mode = "absorption"
  )

names(nmr_raw)
```

The following examples illustrate several commonly used apodisation
profiles implemented internally in `metabom8`.

```{r apod, eval=TRUE, fig.width=8}
# selected FID apodisation functions
f_apod <- c("sine","cosine","exponential","sem")

# create toy fid
n <- 200; t <- seq(0,1,len=n)
fid <- exp(-5*t)*cos(20*pi*t)

# apply apodisation
A  <- sapply(f_apod, \(f) metabom8:::.fidApodisationFct(n, list(fun=f, lb=-0.2)))
Fp <- sweep(A, 1, fid, `*`)
S  <- apply(Fp, 2, \(x) Mod(fft(x))[1:(n/2)])

# compare graphically
cols <- 1:ncol(A)

par(mfrow=c(2,2), mar=c(3,3,2,1))
plot(t, fid, type="l", lwd=2, main="FID")
matplot(t, A, type="l", lwd=2, col=cols, main="Apodisation windows")
matplot(t, Fp, type="l", lwd=2, col=cols, main="Windowed FID")
matplot(S, type="l", lwd=2, col=cols, main="Spectra")

legend("topright", f_apod, col=cols, lty=1, bty="n", cex=.8)
```

## Visual inspection

Spectra can be visualised with the function `plot_spec()`. The minimum
set of arguments include the spectral data `X` (array or matrix) and the
chemical shift vector `ppm`.

Three plotting backends are available: `"plotly"` (default), `"base"`, and
`"ggplot2"`. The `"plotly"` backend is set up to use WebGL, enabling
smooth interactive graphics even when the number of data spectra is
large. The `"ggplot2"` backend is not iteractive and renders much slower. 
However, the returned object integrates with the extensive `"ggplot2"` ecosystem 
and can be further customised to generate publication-quality figures.

See `?plot_spec` for additional details.

```{r viz_covid, fig.width=7.2, fig.height=4}
data("covid", package = "metabom8")

## plot directly from the dataset object
plot_spec(covid, shift = c(0.5, 2))

## alternatively extract data components: X, ppm
X <- covid$X
ppm <- covid$ppm

dim(X)
head(ppm)

plot_spec(X[1:3, ], ppm, shift = c(0.5, 2)) # backend='plotly' by default
plot_spec(X[1:3, ], ppm, shift = c(0.5, 2), backend='base')
plot_spec(X[1:3, ], ppm, shift = c(0.5, 2), backend='ggplot2')
```

# Preprocessing

`metabom8` provides a modular preprocessing API for 1D NMR spectra. Each
preprocessing function is composable and designed to support both
interactive exploration and reproducible pipelines.

For a list of preprocessing functions call `list_preprocessing()` from
the R console.

## Pipeline-style usage

When operating on a `metabom8`-style dataset object, preprocessing functions return
the same structured object: a named list containing 
the elements `X`, `ppm`, and `meta`. This allows steps to be chained using the
base R pipe operator (`|>`):

```{r preproc_pipe}
data("hiit_raw", package = "metabom8") # urine NMR (HIIT exercise experiment)

names(hiit_raw) # X, ppm, meta

# piped preprocessing 
hiit_proc <- hiit_raw |> 
  calibrate(type = "tsp") |>
  excise() |>
  correct_baseline(method='asls') |>
  align_spectra() |>
  pqn()

dim(hiit_proc$X)
```

This API design enables clear, declarative workflows suitable for production 
analyses.


```{r viz-raw-hiit, fig.width=7.2, fig.height=4}
plot_spec(hiit_raw, shift = c(4,4.1))
```

```{r viz-proc-hiit, fig.width=7.2, fig.height=4}
plot_spec(hiit_proc, shift = c(4,4.1))
```

## Stepwise execution

Each operation can also be applied independently to a spectral matrix
and ppm vector. This explicit alternative is useful for pipeline development,
parameter tuning and visual inspection.

```{r preproc_stepwise, fig.width=7.2, fig.height=4}
X <- hiit_raw$X
ppm <- hiit_raw$ppm

# perform TSP calibration
X_cal <- calibrate(X, ppm, type='tsp')

# excise chemical shift regions
regions <- list(
      upfield_noise   = c(min(ppm), 0.25),
      residual_water  = c(4.5, 5.2),
      downfield_noise = c(9.7, max(ppm))
    )
X_exi <- excise(X_cal, ppm, regions)
ppm_exi <- attr(X_exi, 'm8_axis')$ppm

# baseline correction
X_bli <- correct_baseline(X_exi)

# plot spectrum
plot_spec(X_bli[1,], ppm_exi, shift = c(0,10))
```

## Provenance

Data provenance refers to the documented record of how data were generated,
processed, and transformed. Recording this information ensures that analyses
remain reproducible and that individual preprocessing steps can be inspected,
verified, or repeated at a later stage. 

In `metabom8`, particular emphasis is placed on transparent preprocessing, as 
these decisions can substantially influence downstream statistical analyses and 
biological interpretation. Each preprocessing operation automatically records 
its parameters and execution context as structured metadata attached to the 
spectral matrix `X`.

The resulting dataset therefore remains self-describing: the complete processing 
history is linked to the data matrix and can be inspected programmatically at 
any time.

```{r preproc_provenance}

## Inspect preprocessing steps
print_provenance(hiit_proc)

# Retrieve specific step
get_provenance(hiit_proc, step = 2)

## Retrieve the full provenance log
log <- get_provenance(hiit_proc)
# print(log) # very detailed
```

Each entry stores: 

* sequential execution index
* preprocessing step name 
* parameter settings
* optional notes
* timestamp and versioning info

Individual parameter values can be retrieved as well, allowing the full transformation 
history of a dataset to be inspected programmatically.

```{r preproc_provenance-indi}
# Retrieve parameter from a named step
get_provenance(hiit_proc, step = "calibrate", param = "target")
```

Custom notes can also be included in the provenance log. This is particularly userful
when data are processed in automated fashion, for example when using workflow managers:

```{r provenance_note}
# collect env varaibles
params <- list(
  agent    = paste0("snakemake/", Sys.getenv("SNAKEMAKE_VERSION", "v?")),
  runtime  = "docker",
  workflow = Sys.getenv("M8_WORKFLOW", "std_prof-urine"),
  run_id   = Sys.getenv("M8_RUN_ID", "m8-2605-001")
)
 
# create note with env parameters
hiit_proc <- add_note(hiit_proc, 
         'Excluding outlier sample XX - reason is ...', 
         params)

# check provenance out
get_provenance(hiit_proc , step = "user note")
```


> **Note on provenance**  
`metabom8` records preprocessing history as structured attributes on
standard R matrices or named lists. This design is lightweigth and prioritises
interoperability and flexibility over strict container classes
(e.g., `SummarizedExperiment`). Provenance is preserved when objects are saved/loaded
based on standard R serialisation (`save()`/`saveRDS()`, `load()`), but may be lost 
if the object is modified by functions outside the `metabom8`
API that drop or overwrite attributes (for example when objects are
coerced, reconstructed, or exported to external formats).


# Modelling

The section demonstrates data modelling using PCA (unsupervised) and PLS/OPLS (supervised).

All modelling calls return an instance of class `m8_model`, which can be used 
for further processing and diagnostics.


```{r mva-data}
Y <- covid$an$type
X <- covid$X
```

To prevent the natural dynamic range of metabolite concentrations from 
dominating statistical modelling, data are scaled prior to analysis. In 
**metabom8**, scaling and centering are passed explicitly to the model function:

```{r model_setup}
# Define the scaling and centering strategy
uv = uv_scaling(center = TRUE)
```

See `?uv_scaling` for further details and additional scaling strategies.


## Principal Components Analysis (PCA)

Unsupervised modelling with PCA is done by an iterative procedure
(NIPALS) that calculates components iteratively, one at a time. Therefore, the
number of desired components is provided in the function call.

Here we compute two principal components using unit-variance scaling:

```{r pca}
pca_model <- pca(X, scaling = uv, ncomp=2)

# show / summary methods 
show(pca_model)
summary(pca_model)
```

## Partial-Least Squares (PLS)
Partial-Least-Squares (PLS) as supervised modelling technique utilises 
a statistical re-sampling techniques to determine the optimal number of components 
and to safeguard against overfitting.

In this vignette, we employ balanced Monte Carlo cross-validation (`balanced_mc`). 
This process repeatedly splits the data into training and testing sets to ensure the 
model predictive power generalizes well to unseen data.

```{r resampling}
mc_cv <- balanced_mc(k=15, split=2/3, type="DA")
```

See `?balanced_mc` for further details and additional statistical validation strategies.

```{r pls}
pls_model <- pls(X, Y, validation_strategy=mc_cv, scaling = uv)

# model information
show(pls_model)

# model performance summary
summary(pls_model)

# model scores & loadings
Tp <- scores(pls_model)
Pp <- loadings(pls_model)
```

## Orthogonal-Partial-Least Squares (OPLS)

OPLS can be though of as PLS with an integrated orthogonality filter, which has 
been reported to improve model interpretability. While standard 
PLS mixes predictive and non-predictive variation, OPLS partitions the variance 
into two parts: variation correlated to the response ($Y$) and variation "orthogonal" 
(unrelated) to it.

Similar to PLS, the OPLS algorithm evaluates model performance 
using statistical re-sampling techniques to ensure the results are robust and 
not driven by noise.

```{r opls}
opls_model <- opls(X, Y, validation_strategy=mc_cv, scaling = uv)

# model information & performance
show(opls_model)
```
The OPLS algorithm fitted one predictive and one orthogonal component.
A second orthogonal component was evaluated but the improvement in
cross-validated class prediction (`AUCv`)based on the left-out sets was
negligible (`stop reason`). Therefore, a third component was not fitted to the full
spectral matrix.

```{r opls-summary}
summary(opls_model)
```

For detailed information on component stopping criteria used in `opls()` and`pls()` 
see `?.evalFit()`.

Methods implemented for both, `pls()` and `opls()`, models include `scores()`, 
`loadings()` and `fitted()` (Y-fitted values from statistical resampling).
For OPLS models, these former two functions accept argument `orth = TRUE`
to extract orthogonal component information.

Further OPLS model interpretation can be performed using Variable Importance in 
Projection (`vip()`), which identifies the most influential variables in the model.

```{r opls-t-p}
# model scores, loadings & vip
Tp <- scores(opls_model)
To <- scores(opls_model, orth=TRUE)

Pp <- loadings(opls_model)
Po <- loadings(opls_model, orth=TRUE)

vip <- vip(opls_model)
```


To ensure the model is statistically robust, `metabom8` provides several 
diagnostic tools:

* `dmodx()`: Calculates the "Distance to the Model X" to identify spectral outliers
* `opls_perm()`:  Executes Y-permutations to validate that the observed $Q^2$ is 
significantly higher than what would be expected by chance.
* `cv_anova()`: Performs a Cross-Validated ANOVA to test the significance of the 
model's predictive ability (only for regression)


```{r opls-dx, fig.width=7.2, fig.height=4}
# distance to the model in X space
dex <- dmodx(opls_model)
head(dex)
```

This vignette demonstrated the core data import, preprocessing, and modelling workflow 
implemented in `metabom8`. For further details on individual functions, see the 
corresponding help pages.

# Citation

```{r citation, echo=FALSE, results='asis'}
citation("metabom8")
```

# Session Info

```{r session-info, echo=FALSE} 
sessionInfo()
```