--- title: "Essential concepts and setup" author: - name: Leonardo Ramirez-Lopez email: ramirez.lopez.leo@gmail.com date: today bibliography: resemble.bib csl: elsevier-harvard.csl format: html: toc: true toc-depth: 3 number-sections: true toc-location: left code-overflow: wrap smooth-scroll: true html-math-method: mathjax vignette: > %\VignetteIndexEntry{1 Essential concepts and setup} %\VignetteEncoding{UTF-8} %\VignetteEngine{quarto::html} --- ```{r} #| echo: false Sys.setenv(OMP_NUM_THREADS = 2) ``` :::: {.columns} ::: {.column width="70%"} > *Think Globally, Fit Locally* -- [@saul2003think] ::: ::: {.column width="30%"} ::: :::: # Introduction Spectroscopic data analysis plays a central role in many environmental, agricultural, and food-related applications. Techniques such as near-infrared (NIR), mid-infrared (IR), and other forms of diffuse reflectance spectroscopy provide rapid, non-destructive, and cost-efficient measurements that can be used to infer chemical, physical, or biological properties of complex matrices, including soils, plant materials, and food products. In quantitative applications, these measurements are typically linked to reference laboratory values through empirical calibration models. As spectral databases grow in size and diversity, their effective use becomes increasingly challenging. Large spectral libraries often contain substantial heterogeneity, domain shifts, redundant observations, and samples that are only locally informative for a given prediction problem. Under these conditions, global modelling strategies are often insufficient on their own, and methods based on dimensionality reduction, dissimilarity analysis, neighbour retrieval, local modelling, and targeted sample selection become essential. The `resemble` package provides a framework for sample retrieval and local learning in spectral chemometrics. It is designed to support the analysis of large and complex spectral datasets through tools for projection-based representation, dissimilarity computation, neighbourhood search, memory-based learning, evolutionary subset search, and retrieval-based modelling with pre-computed local models. The package therefore supports both classical local modelling workflows and newer strategies for exploiting spectral libraries as structured resources for predictive modelling. The functions presented here are implemented based on the methods described in @ramirezlopez2026a, @ramirezlopez2026b, and @ramirez2013spectrum. The main functionalities of `resemble` include: - orthogonal projection of spectral data using principal component analyssis (PCA) and partial least squares (PLS) methods - computation and evaluation of spectral dissimilarity measures - nearest-neighbour search in spectral reference sets - memory-based learning and local regression - evolutionary sample search for context-specific calibration - retrieval-based modelling using libraries of localised experts # Citing the package Simply type and you will get the info you need: ```{r} #| eval: true citation(package = "resemble") ``` # Dataset used across the vignettes The vignettes in `resemble` use the soil near-infrared (NIR) spectral dataset provided in the [`prospectr`](https://CRAN.R-project.org/package=prospectr) package [@stevens2020introduction]. This dataset is used because soils are among the most complex matrices analyzed by NIR spectroscopy. It was originally used in the *Chimiométrie 2006* challenge [@pierna2008soil]. The dataset contains NIR absorbance spectra for 825 dried and sieved soil samples collected from agricultural fields across the Walloon region of Belgium. In `R`, the data are stored in a `data.frame` with the following structure: * **Response variables**: * **`Nt`**: total nitrogen (g/kg dry soil); available for 645 samples and missing for 180. * **`Ciso`**: carbon (g/100 g dry soil); available for 732 samples and missing for 93. * **`CEC`**: cation exchange capacity (meq/100 g dry soil); available for 447 samples and missing for 378. * **Predictor variables (`spc`)**: the spectral predictors are stored in the matrix `NIRsoil$spc`, embedded within the data frame. These variables contain NIR absorbance spectra measured from 1100 to 2498 nm at 2 nm intervals. Each column name corresponds to a wavelength value (in nm). * **Set indicator (`Set`)**: a binary variable indicating whether a sample belongs to the training set (`1`, 618 samples) or the test set (`0`, 207 samples). Load the necessary packages and data: ```{r} #| label: libraries #| message: false library(resemble) library(prospectr) ``` The dataset can be loaded into R as follows: ```{r} #| message: false #| results: hide data(NIRsoil) dim(NIRsoil) str(NIRsoil) ``` # Spectral preprocessing Throughout the vignettes, the same preprocessing workflow is used to improve the suitability of the spectra for quantitative analysis. In particular, the goal is to reduce unwanted baseline variation and enhance local spectral features that may be informative for modeling. The preprocessing steps are implemented using the [`prospectr`](https://CRAN.R-project.org/package=prospectr) package [@stevens2020introduction]. The following steps are applied: 1. **Detrending** is applied first to reduce broad baseline shifts and curvature effects across the spectra. 2. A **first-order Savitzky–Golay derivative** [@Savitzky1964] is then computed to emphasize local spectral features and reduce remaining additive effects. ```{r} #| label: NIRsoil #| message: false # obtain a numeric vector of the wavelengths at which spectra is recorded wavs <- as.numeric(colnames(NIRsoil$spc)) # pre-process the spectra: # - use detrend # - use first order derivative diff_order <- 1 poly_order <- 1 window <- 7 # Preprocess spectra NIRsoil$spc_pr <- savitzkyGolay( detrend(NIRsoil$spc, wav = wavs), m = diff_order, p = poly_order, w = window ) ``` ```{r} #| label: fig-plotspectra #| fig-cap: "Raw spectral absorbance data (top) and first derivative of the absorbance spectra (bottom)." #| fig-align: center #| fig-width: 7 #| fig-height: 7 #| echo: false # Hex sticker palette# bg_dark <- NA #"#0F172A" # background blue <- "#3B82F6" # border / spiral high amber <- "#F59E0B" # spiral low amber_light <- "#FBBF24" # text accent slate <- "#64748B" # muted elements old_par <- par("mfrow", "mar", "bg") par(mfrow = c(2, 1), mar = c(4, 4, 1, 4), bg = bg_dark) new_wavs <- as.matrix(as.numeric(colnames(NIRsoil$spc_p))) text_col <- "black"#"white" # Plot 1: Raw spectra plot(range(wavs), range(NIRsoil$spc), col = NA, xlab = "", ylab = "Absorbance", col.lab = text_col, col.axis = text_col) rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col = "#1E293B") grid(lty = 1, col = "#334155") matlines(x = wavs, y = t(NIRsoil$spc), lty = 1, col = paste0(blue, "33")) # Plot 2: First derivative plot(range(new_wavs), range(NIRsoil$spc_p), col = NA, xlab = "Wavelengths, nm", ylab = expression(d(detrended~A)/d*lambda), col.lab = text_col, col.axis = text_col) rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col = "#1E293B") grid(lty = 1, col = "#334155") matlines(x = new_wavs, y = t(NIRsoil$spc_p), lty = 1, col = paste0(amber, "33")) par(old_par) ``` Both the raw absorbance spectra and the preprocessed spectra are shown in @fig-plotspectra. The preprocessed spectra, obtained as the first derivative of detrended absorbance, are used as the predictor variables in all examples throughout this document. For illustration purposes, the `NIRsoil` data are divided into training and test subsets. In the examples that require a response variable, `Ciso` is used to demonstrate the functionality of the package. ```{r} train_x <- NIRsoil$spc_pr[NIRsoil$train == 1, ] train_y <- NIRsoil$Ciso[NIRsoil$train == 1] test_x <- NIRsoil$spc_pr[NIRsoil$train == 0, ] test_y <- NIRsoil$Ciso[NIRsoil$train == 0] ``` The notation used throughout the `resemble` package for arguments referring to training and test observations is as follows: * **Training observations**: * `Xr` denotes the matrix of predictor variables in the reference/training set. * `Yr` denotes the response variable(s) in the reference/training set. In the context of this package, `Yr` may also be referred to as **side information**, that is, variables associated with the training observations that can support or guide optimization during modeling, even when they are not directly used as model inputs. For example, as shown in later sections, `Yr` can be used in principal component analysis to help determine the optimal number of components. * **Test observations**: * `Xu` denotes the matrix of predictor variables in the unknown/test set. * `Yu` denotes the response variable(s) in the unknown/test set. # References {-}