--- title: "Getting Started" author: "Torben Kimhofer" date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true number_sections: true vignette: > %\VignetteIndexEntry{Getting Started} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: markdown: wrap: 72 --- ```{r setup, include = FALSE} knitr::opts_chunk$set( error = FALSE, collapse = TRUE, comment = "#>" ) ``` # Introduction This vignette demonstrates data import and preprocessing workflow for NMR-based metabolomics with **metabom8**. ```{r get-data} library(metabom8) ``` # Importing spectral data In routine high-throughput NMR workflows, raw time-domain data (the free-induction decay) are typically processed directly on the spectrometer using the vendor software TopSpin. Downstream metabolomics analysis therefore commonly starts from frequency-domain spectra. Consequently, this vignette focuses on importing spectra generated by TopSpin. For completeness and teaching purposes, the section ([Starting point: raw FID](#import-fid)) briefly illustrates importing and transforming time domain data to obtain spectra. ## TopSpin-processed spectra {#import-spectra} The function `read1d_proc()` imports TopSpin-processed 1D NMR spectra from the `pdata` subdirectory of Bruker experiment folders (typically `pdata/1`). It reads the processed absorption-mode spectrum (e.g. `1r`) together with associated acquisition (`acqus`) and processing (`procs`) parameters. At minimum two arguments are required: - `path`: path to the directory that encloses NMR experiments - `exp_type`: list of spectrometer parameters to filter desired experiment types based on acquisition and processing metadata. Further details on supported filtering options are provided in `?read1d_proc`. ```{r import, fig.width=7.2, fig.height=4} ## Example: import 1D spectra exp_dir <- system.file("extdata", package = "metabom8") exp_type <- list(pulprog="noesygppr1d") nmr_data <- read1d_proc(exp_dir, exp_type, n_max = 100) names(nmr_data) ``` Both import functions (`read1d_proc()` and `read1d_raw()`) return a named list containing three core components used throughout this vignette: the spectral matrix (`X`), the chemical-shift axis (`ppm`), and associated metadata (`meta`). For interactive exploration, setting `to_global = TRUE` assigns the components `X`, `ppm`, and `meta` directly to the global environment (`.GlobalEnv`). Existing objects with these names will be overwritten. The row names of `X` and `meta` correspond to experiment directories and can be used to join external sample annotations. The `meta` data frame contains full TopSpin acquisition and processing parameters for each spectrum. Example acquisition and processing parameters as they appear in `meta`: ```{r pars_table, echo=FALSE} knitr::kable( data.frame( Prefix = c("a_", "p_"), Meaning = c("Acquisition parameters", "Processing parameters"), Examples = c("a_PULPROG, a_SFO1, a_RG, a_SW_h, a_TE", "p_WDW, p_LB, p_SI, p_PHC*, p_BC*, p_NC_proc") ), align = "lll" ) ``` See `?read1d_proc()` for further details. ## Raw FID data {#import-fid} For completeness, `metabom8` also provides functionality to import and process raw FIDs. This allows users to reproduce basic spectral processing steps within R or to explore how different processing parameters influence the resulting spectra. Processing an FID converts the time-domain signal into a frequency-domain spectrum. Typical processing steps include digital-filter correction (group-delay), apodisation (windowing), zero-filling, Fourier transformation, phase correction. The function `read1d_raw()` performs these operations: ```{r read-in-raw, fig.show='hold'} # import 1D NMR data nmr_raw <- read1d_raw( exp_dir, exp_type, apodisation = list(fun = "exponential", lb = 0.2), zerofil = 1L, mode = "absorption" ) names(nmr_raw) ``` The following examples illustrate several commonly used apodisation profiles implemented internally in `metabom8`. ```{r apod, eval=TRUE, fig.width=8} # selected FID apodisation functions f_apod <- c("sine","cosine","exponential","sem") # create toy fid n <- 200; t <- seq(0,1,len=n) fid <- exp(-5*t)*cos(20*pi*t) # apply apodisation A <- sapply(f_apod, \(f) metabom8:::.fidApodisationFct(n, list(fun=f, lb=-0.2))) Fp <- sweep(A, 1, fid, `*`) S <- apply(Fp, 2, \(x) Mod(fft(x))[1:(n/2)]) # compare graphically cols <- 1:ncol(A) par(mfrow=c(2,2), mar=c(3,3,2,1)) plot(t, fid, type="l", lwd=2, main="FID") matplot(t, A, type="l", lwd=2, col=cols, main="Apodisation windows") matplot(t, Fp, type="l", lwd=2, col=cols, main="Windowed FID") matplot(S, type="l", lwd=2, col=cols, main="Spectra") legend("topright", f_apod, col=cols, lty=1, bty="n", cex=.8) ``` ## Visual inspection Spectra can be visualised with the function `plot_spec()`. The minimum set of arguments include the spectral data `X` (array or matrix) and the chemical shift vector `ppm`. Three plotting backends are available: `"plotly"` (default), `"base"`, and `"ggplot2"`. The `"plotly"` backend is set up to use WebGL, enabling smooth interactive graphics even when the number of data spectra is large. The `"ggplot2"` backend is not iteractive and renders much slower. However, the returned object integrates with the extensive `"ggplot2"` ecosystem and can be further customised to generate publication-quality figures. See `?plot_spec` for additional details. ```{r viz_covid, fig.width=7.2, fig.height=4} data("covid", package = "metabom8") ## plot directly from the dataset object plot_spec(covid, shift = c(0.5, 2)) ## alternatively extract data components: X, ppm X <- covid$X ppm <- covid$ppm dim(X) head(ppm) plot_spec(X[1:3, ], ppm, shift = c(0.5, 2)) # backend='plotly' by default plot_spec(X[1:3, ], ppm, shift = c(0.5, 2), backend='base') plot_spec(X[1:3, ], ppm, shift = c(0.5, 2), backend='ggplot2') ``` # Preprocessing `metabom8` provides a modular preprocessing API for 1D NMR spectra. Each preprocessing function is composable and designed to support both interactive exploration and reproducible pipelines. For a list of preprocessing functions call `list_preprocessing()` from the R console. ## Pipeline-style usage When operating on a `metabom8`-style dataset object, preprocessing functions return the same structured object: a named list containing the elements `X`, `ppm`, and `meta`. This allows steps to be chained using the base R pipe operator (`|>`): ```{r preproc_pipe} data("hiit_raw", package = "metabom8") # urine NMR (HIIT exercise experiment) names(hiit_raw) # X, ppm, meta # piped preprocessing hiit_proc <- hiit_raw |> calibrate(type = "tsp") |> excise() |> correct_baseline(method='asls') |> align_spectra() |> pqn() dim(hiit_proc$X) ``` This API design enables clear, declarative workflows suitable for production analyses. ```{r viz-raw-hiit, fig.width=7.2, fig.height=4} plot_spec(hiit_raw, shift = c(4,4.1)) ``` ```{r viz-proc-hiit, fig.width=7.2, fig.height=4} plot_spec(hiit_proc, shift = c(4,4.1)) ``` ## Stepwise execution Each operation can also be applied independently to a spectral matrix and ppm vector. This explicit alternative is useful for pipeline development, parameter tuning and visual inspection. ```{r preproc_stepwise, fig.width=7.2, fig.height=4} X <- hiit_raw$X ppm <- hiit_raw$ppm # perform TSP calibration X_cal <- calibrate(X, ppm, type='tsp') # excise chemical shift regions regions <- list( upfield_noise = c(min(ppm), 0.25), residual_water = c(4.5, 5.2), downfield_noise = c(9.7, max(ppm)) ) X_exi <- excise(X_cal, ppm, regions) ppm_exi <- attr(X_exi, 'm8_axis')$ppm # baseline correction X_bli <- correct_baseline(X_exi) # plot spectrum plot_spec(X_bli[1,], ppm_exi, shift = c(0,10)) ``` ## Provenance Data provenance refers to the documented record of how data were generated, processed, and transformed. Recording this information ensures that analyses remain reproducible and that individual preprocessing steps can be inspected, verified, or repeated at a later stage. In `metabom8`, particular emphasis is placed on transparent preprocessing, as these decisions can substantially influence downstream statistical analyses and biological interpretation. Each preprocessing operation automatically records its parameters and execution context as structured metadata attached to the spectral matrix `X`. The resulting dataset therefore remains self-describing: the complete processing history is linked to the data matrix and can be inspected programmatically at any time. ```{r preproc_provenance} ## Inspect preprocessing steps print_provenance(hiit_proc) # Retrieve specific step get_provenance(hiit_proc, step = 2) ## Retrieve the full provenance log log <- get_provenance(hiit_proc) # print(log) # very detailed ``` Each entry stores: * sequential execution index * preprocessing step name * parameter settings * optional notes * timestamp and versioning info Individual parameter values can be retrieved as well, allowing the full transformation history of a dataset to be inspected programmatically. ```{r preproc_provenance-indi} # Retrieve parameter from a named step get_provenance(hiit_proc, step = "calibrate", param = "target") ``` Custom notes can also be included in the provenance log. This is particularly userful when data are processed in automated fashion, for example when using workflow managers: ```{r provenance_note} # collect env varaibles params <- list( agent = paste0("snakemake/", Sys.getenv("SNAKEMAKE_VERSION", "v?")), runtime = "docker", workflow = Sys.getenv("M8_WORKFLOW", "std_prof-urine"), run_id = Sys.getenv("M8_RUN_ID", "m8-2605-001") ) # create note with env parameters hiit_proc <- add_note(hiit_proc, 'Excluding outlier sample XX - reason is ...', params) # check provenance out get_provenance(hiit_proc , step = "user note") ``` > **Note on provenance** `metabom8` records preprocessing history as structured attributes on standard R matrices or named lists. This design is lightweigth and prioritises interoperability and flexibility over strict container classes (e.g., `SummarizedExperiment`). Provenance is preserved when objects are saved/loaded based on standard R serialisation (`save()`/`saveRDS()`, `load()`), but may be lost if the object is modified by functions outside the `metabom8` API that drop or overwrite attributes (for example when objects are coerced, reconstructed, or exported to external formats). # Modelling The section demonstrates data modelling using PCA (unsupervised) and PLS/OPLS (supervised). All modelling calls return an instance of class `m8_model`, which can be used for further processing and diagnostics. ```{r mva-data} Y <- covid$an$type X <- covid$X ``` To prevent the natural dynamic range of metabolite concentrations from dominating statistical modelling, data are scaled prior to analysis. In **metabom8**, scaling and centering are passed explicitly to the model function: ```{r model_setup} # Define the scaling and centering strategy uv = uv_scaling(center = TRUE) ``` See `?uv_scaling` for further details and additional scaling strategies. ## Principal Components Analysis (PCA) Unsupervised modelling with PCA is done by an iterative procedure (NIPALS) that calculates components iteratively, one at a time. Therefore, the number of desired components is provided in the function call. Here we compute two principal components using unit-variance scaling: ```{r pca} pca_model <- pca(X, scaling = uv, ncomp=2) # show / summary methods show(pca_model) summary(pca_model) ``` ## Partial-Least Squares (PLS) Partial-Least-Squares (PLS) as supervised modelling technique utilises a statistical re-sampling techniques to determine the optimal number of components and to safeguard against overfitting. In this vignette, we employ balanced Monte Carlo cross-validation (`balanced_mc`). This process repeatedly splits the data into training and testing sets to ensure the model predictive power generalizes well to unseen data. ```{r resampling} mc_cv <- balanced_mc(k=15, split=2/3, type="DA") ``` See `?balanced_mc` for further details and additional statistical validation strategies. ```{r pls} pls_model <- pls(X, Y, validation_strategy=mc_cv, scaling = uv) # model information show(pls_model) # model performance summary summary(pls_model) # model scores & loadings Tp <- scores(pls_model) Pp <- loadings(pls_model) ``` ## Orthogonal-Partial-Least Squares (OPLS) OPLS can be though of as PLS with an integrated orthogonality filter, which has been reported to improve model interpretability. While standard PLS mixes predictive and non-predictive variation, OPLS partitions the variance into two parts: variation correlated to the response ($Y$) and variation "orthogonal" (unrelated) to it. Similar to PLS, the OPLS algorithm evaluates model performance using statistical re-sampling techniques to ensure the results are robust and not driven by noise. ```{r opls} opls_model <- opls(X, Y, validation_strategy=mc_cv, scaling = uv) # model information & performance show(opls_model) ``` The OPLS algorithm fitted one predictive and one orthogonal component. A second orthogonal component was evaluated but the improvement in cross-validated class prediction (`AUCv`)based on the left-out sets was negligible (`stop reason`). Therefore, a third component was not fitted to the full spectral matrix. ```{r opls-summary} summary(opls_model) ``` For detailed information on component stopping criteria used in `opls()` and`pls()` see `?.evalFit()`. Methods implemented for both, `pls()` and `opls()`, models include `scores()`, `loadings()` and `fitted()` (Y-fitted values from statistical resampling). For OPLS models, these former two functions accept argument `orth = TRUE` to extract orthogonal component information. Further OPLS model interpretation can be performed using Variable Importance in Projection (`vip()`), which identifies the most influential variables in the model. ```{r opls-t-p} # model scores, loadings & vip Tp <- scores(opls_model) To <- scores(opls_model, orth=TRUE) Pp <- loadings(opls_model) Po <- loadings(opls_model, orth=TRUE) vip <- vip(opls_model) ``` To ensure the model is statistically robust, `metabom8` provides several diagnostic tools: * `dmodx()`: Calculates the "Distance to the Model X" to identify spectral outliers * `opls_perm()`: Executes Y-permutations to validate that the observed $Q^2$ is significantly higher than what would be expected by chance. * `cv_anova()`: Performs a Cross-Validated ANOVA to test the significance of the model's predictive ability (only for regression) ```{r opls-dx, fig.width=7.2, fig.height=4} # distance to the model in X space dex <- dmodx(opls_model) head(dex) ``` This vignette demonstrated the core data import, preprocessing, and modelling workflow implemented in `metabom8`. For further details on individual functions, see the corresponding help pages. # Citation ```{r citation, echo=FALSE, results='asis'} citation("metabom8") ``` # Session Info ```{r session-info, echo=FALSE} sessionInfo() ```