| Type: | Package |
| Title: | Automated Classification of Mass Spectra |
| Version: | 0.5.0 |
| Maintainer: | Alexandre Godmer <alexandre.godmer@aphp.fr> |
| Description: | Functions to classify mass spectra in known categories and to determine discriminant mass-to-charge values (m/z). Includes easy-to-use preprocessing pipelines for Matrix Assisted Laser Desorption Ionisation - Time Of Flight Mass Spectrometry (MALDI-TOF) mass spectra, methods to select discriminant m/z from labelled libraries, and tools to predict categories (species, phenotypes, etc.) from selected features. Also provides utilities to build design matrices from peak intensities and labels. While this package was developed with the aim of identifying very similar species or phenotypes of bacteria from MALDI-TOF MS, the functions of this package can also be used to classify other categories associated to mass spectra; or from mass spectra obtained with other mass spectrometry techniques. Parallelized processing and optional C++-accelerated functions are available (notably to deal with large datasets) from version 0.5.0. If you use this package in your research, please cite the associated publication (<doi:10.1016/j.eswa.2025.128796>). For a comprehensive guide, additional applications, and detailed examples, see https://github.com/agodmer/MSclassifR_examples. |
| URL: | https://github.com/agodmer/MSclassifR_examples, https://doi.org/10.1016/j.eswa.2025.128796 |
| BugReports: | https://github.com/agodmer/MSclassifR_examples/issues |
| License: | GPL (≥ 3) |
| Encoding: | UTF-8 |
| LazyData: | true |
| Language: | en-US |
| Depends: | R (≥ 4.0), cp4p, caret, statmod, MALDIquant, MALDIrppa |
| Imports: | reshape2, ggplot2, dplyr, stats, limma, car, Rcpp, Matrix, methods |
| Suggests: | doParallel, foreach, ranger, randomForest, mixOmics, VSURF, vita, Boruta, glmnet, e1071, xgboost, nnet, mclust, mltools, metap, MALDIquantForeign, matrixStats, rmarkdown |
| LinkingTo: | Rcpp |
| NeedsCompilation: | yes |
| RoxygenNote: | 7.3.1 |
| Packaged: | 2025-12-09 10:29:48 UTC; alexa |
| Author: | Alexandre Godmer [aut, cre], Quentin Giai Gianetto [aut], Karen Druart [aut] |
| ByteCompile: | true |
| Repository: | CRAN |
| Date/Publication: | 2025-12-09 11:20:01 UTC |
Internal: choose a safe number of workers
Description
Internal: choose a safe number of workers
Usage
.safe_n_workers(n_workers, fallback = 1L)
Metadata of mass spectra corresponding to the bacterial species Citrobacter sp. from The Robert Koch-Institute (RKI) database of microbial MALDI-TOF mass spectra
Description
Metadada of the CitrobacterRKIspectra list of mass spectra.
Usage
data("CitrobacterRKImetadata", package = "MSclassifR")
Format
A data frame with 14 rows (each corresponding to a mass spectrum), and five columns that contain (in order): the strain name, the species name, the spot, a sample number and the name of the strain associated with the spot.
Details
The Robert Koch-Institute (RKI) database of microbial MALDI-TOF mass spectra contains raw mass spectra. Only mass spectra of the Citrobacter bacterial species were collected. Metadata were manually reported from raw data.
Source
The raw data were downloaded from this link : https://zenodo.org/record/163517#.YIkWiNZuJCp. The dataset focuses only on mass spectra from Citrobacter.
References
Lasch, Peter, Stammler, Maren, & Schneider, Andy. (2018). Version 3 (20181130) of the MALDI-TOF Mass Spectrometry Database for Identification and Classification of Highly Pathogenic Microorganisms from the Robert Koch-Institute (RKI) [Data set]. Zenodo.doi:10.5281/zenodo.163517
Mass spectra corresponding to the bacterial species Citrobacter sp. from The Robert Koch-Institute (RKI) database of microbial MALDI-TOF mass spectra
Description
Mass spectra of the CitrobacterRKIspectra dataset.
Usage
data("CitrobacterRKIspectra", package = "MSclassifR")
#####
#Plotting the first mass spectrum
#library("MSclassifR")
#PlotSpectra(SpectralData=CitrobacterRKIspectra[[1]],absx = "ALL", Peaks = NULL,
# Peaks2 = NULL, col_spec = 1, col_peak = 2, shape_peak = 3,
# col_peak2 = 2, shape_peak2 = 2)
Format
A list that contains 14 objects of class S4 corresponding each to a each mass spectrum.
Details
The Robert Koch-Institute (RKI) database of microbial MALDI-TOF mass spectra contains raw mass spectra. Only mass spectra of the Citrobacter bacterial species were collected.
Source
The raw data were downloaded from this link : https://zenodo.org/record/163517#.YIkWiNZuJCp. The dataset focuses only on mass spectra from Citrobacter.
References
Lasch, Peter, Stammler, Maren, & Schneider, Andy. (2018). Version 3 (20181130) of the MALDI-TOF Mass Spectrometry Database for Identification and Classification of Highly Pathogenic Microorganisms from the Robert Koch-Institute (RKI) [Data set]. Zenodo.doi:10.5281/zenodo.163517
Fast supervised classifier with m/z subsetting and optional sampling
Description
Trains a multiclass classifier on a subset of m/z features using cross-validation.
For kind = "rf", it automatically delegates to a ranger-based algorithm
(LogReg_rf_fast) when available for maximum speed and parallelism; otherwise it
uses the caret R package with method = "ranger" as a fast fallback. Other kinds ("linear",
"nnet", "svm", "xgb") are trained via caret with compact grids and optional
parallelization. Features (columns) are selected by matching their numeric
column names to moz. Optional class-balancing (among up/down-sampling or SMOTE) can be applied.
Usage
LogReg(
X,
moz,
Y,
number = 2,
repeats = 2,
Metric = c("Kappa", "Accuracy", "F1", "AdjRankIndex", "MatthewsCorrelation"),
kind = "linear",
Sampling = c("no", "up", "down", "smote"),
ncores = max(1L, parallel::detectCores() - 1L),
num.trees = 500L,
tuneLength = 5L,
seed = 123L
)
Arguments
X |
Numeric matrix or data.frame with samples in rows and features (m/z) in columns. Column names must be numeric (or coercible), e.g., "1234.567" or "mz_1234.567". Non-finite values are set to 0. |
moz |
Numeric vector of m/z values to keep. Only columns of |
Y |
Factor (or coercible) of class labels; length must equal nrow(X). |
number |
Integer; number of CV folds (k). Default 2. |
repeats |
Integer; number of CV repeats. Default 2. |
Metric |
Character; selection metric. One of "Kappa", "Accuracy", "F1", "AdjRankIndex", "MatthewsCorrelation". For non-caret metrics, custom summary functions are used. |
kind |
Character; model type. One of "linear" (multinom), "nnet" (nnet), "rf" (random forest), "svm" (svmLinear2), "xgb" (xgbTree). Default "linear". |
Sampling |
Character; class-balancing strategy. One of "no", "up", "down", "smote". For "smote", the function smote_classif(Y ~ ., data.frame(Y, X)) is used before training. For "up"/"down", caret’s in-fold sampling is used. |
ncores |
Integer; number of CPU cores to use for caret’s parallel backend (doParallel). Default is all but one core. Ignored if doParallel is unavailable. |
num.trees |
Integer; number of trees for random forests (ranger engine). Default 500. Used when kind = "rf" and either the caret "ranger" fallback is used or the caret-free LogReg_rf_fast is available. |
tuneLength |
Integer; size of the hyperparameter search (caret-based models). Default 5 (compact grid). |
seed |
Integer; random seed for reproducibility. Default 123. |
Details
Feature subsetting:
Xis subset to columns whose numeric names matchmoz. This avoids expensive joins/transposes and guarantees stable feature order.Random forests: if the function LogReg_rf_fast is available in the namespace (see its documentation), this function delegates the "rf" case to it for maximum speed and Windows-friendly parallel CV. Otherwise, it uses caret with method = "ranger" (still fast and parallelizable).
Sampling: "smote" is applied once, before training; "up"/"down" are applied in-fold by caret via trainControl(sampling = ...). "no" leaves the data unchanged.
Parallelism: if ncores > 1 and doParallel is installed, a PSOCK cluster is registered for caret. The fast RF engine (LogReg_rf_fast) internally handles fold-level parallelism and ranger threading to avoid oversubscription.
Value
A list with:
train_mod: the fitted model (caret::train object) or, if kind = "rf" and LogReg_rf_fast is available, the structure returned by LogReg_rf_fast (contains the final ranger model and CV details).
boxplot: ggplot object of resampling metric distributions (caret paths) or the boxplot returned by LogReg_rf_fast.
Confusion.Matrix: caret::confusionMatrix on the fitted model (caret paths) or the confusion matrix returned by LogReg_rf_fast.
stats_global: data.frame summarizing per-fold metrics (Metric, Mean, Sd) for caret paths; from LogReg_rf_fast otherwise.
See Also
LogReg_rf_fast, ranger::ranger, caret::train, caret::confusionMatrix
Examples
## Not run:
set.seed(1)
X <- matrix(runif(2000), nrow = 100, ncol = 20)
colnames(X) <- as.character(round(seq(1000, 1190, length.out = 20), 4))
moz <- as.numeric(colnames(X))[seq(1, 20, by = 2)]
Y <- factor(sample(letters[1:3], 100, replace = TRUE))
# Fast RF (delegates to LogReg_rf_fast if available; else caret + ranger)
fit_rf <- LogReg(X, moz, Y, number = 3, repeats = 1, kind = "rf",
Metric = "Kappa", Sampling = "no", ncores = 4,
num.trees = 300, seed = 42)
fit_rf$Confusion.Matrix
# Linear (multinom) with macro F1 metric
fit_lin <- LogReg(X, moz, Y, number = 3, repeats = 1, kind = "linear",
Metric = "F1", Sampling = "no", ncores = 2)
fit_lin$stats_global
## End(Not run)
Fast random-forest classifier with stratified CV and in-fold sampling (ranger, caret-free)
Description
Trains a multiclass random-forest classifier using the ranger algorithm with a compact hyperparameter search and repeated stratified cross-validation. Feature columns are first subset by the provided m/z list (moz). Class balancing (no/up/down/SMOTE) is applied only within training folds to avoid leakage, and again on the full data before fitting the final model. The evaluation across folds can be parallelized in a Windows-safe manner (PSOCK), while avoiding CPU oversubscription by giving each fold worker one ranger thread. Returns the final ranger model, per-fold metrics, a confusion matrix on the full data, and a ggplot boxplot of resampling metrics.
Usage
LogReg_rf_fast(
X,
moz,
Y,
number = 5,
repeats = 1,
Metric = c("Kappa", "Accuracy", "F1", "AdjRankIndex", "MatthewsCorrelation"),
Sampling = c("no", "up", "down", "smote"),
ncores = max(1L, parallel::detectCores() - 1L),
num.trees = 500L,
tuneLength = 5L,
folds_parallel = c("auto", "TRUE", "FALSE"),
seed = 123L,
mtry = NULL,
splitrule = "gini",
sample.fraction = 1,
min.node.size.grid = c(1L, 5L, 10L),
min_node_frac = 1/3
)
Arguments
X |
Numeric matrix or data frame; rows are samples and columns are features (m/z). Column names must be numeric (coercible with as.numeric), representing the feature m/z. Non-finite values are set to 0 internally. |
moz |
Numeric vector of m/z to keep. Only columns of X whose numeric names
match values in |
Y |
Factor (or coercible to factor) of class labels; length must equal nrow(X). |
number |
Integer; number of CV folds (k). Default 5. |
repeats |
Integer; number of CV repeats. Default 1. |
Metric |
Character; CV selection metric. One of "Kappa", "Accuracy", "F1", "AdjRankIndex", "MatthewsCorrelation". The best hyperparameters maximize this metric averaged over folds. |
Sampling |
Character; class-balancing strategy applied within each training fold (and before the final fit on the full data). One of "no", "up", "down", "smote".
|
ncores |
Integer; number of CPU cores to use. Controls both fold-level parallelism and ranger threads when not parallelizing folds. Default is all but one core. |
num.trees |
Integer; number of trees per ranger model. Default 500. |
tuneLength |
Integer; upper bound on the size of the hyperparameter grid.
If the full grid (mtry × min.node.size) is larger, a random subset of size
|
folds_parallel |
Character; "auto", "TRUE", or "FALSE".
|
seed |
Integer; RNG seed for reproducibility. Default 123. |
mtry |
Optional integer; if provided, fixes the number of variables tried at each split. If NULL (default), a small grid around floor(sqrt(p)) is used, where p = number of features. |
splitrule |
Character; ranger split rule (e.g., "gini", "extratrees"). Default "gini". |
sample.fraction |
Numeric in (0, 1]; subsampling fraction per tree in ranger. Default 1. |
min.node.size.grid |
Integer vector; candidate values for ranger’s |
min_node_frac |
Numeric in (0, 1]. Safety cap for ranger’s min.node.size per fold/final fit: the value used is min(requested_min.node.size, floor(min_node_frac * n_train)), with a lower bound of 1. This prevents root-only trees (near-uniform class probabilities) on small training folds (e.g., with SMOTE). Applied inside CV and for the final model. Default: 1/3 (set to 1 to disable capping). |
Details
Feature subsetting: X is subset to columns whose numeric names match
moz. This avoids expensive joins/transposes and guarantees consistent feature order.Cross-validation: folds are stratified by Y and repeated
repeatstimes. Sampling is applied only to training indices in each fold (to prevent leakage) and again before the final fit.Hyperparameter search: a compact grid over mtry (around sqrt(p)) and min.node.size (from
min.node.size.grid), optionally downsampled totuneLength. The best combination maximizes the chosen metric averaged over folds.Parallel strategy: by default ("auto"), the code parallelizes across folds with a PSOCK cluster (Windows-safe) and sets ranger’s num.threads = 1 inside each worker to avoid oversubscription. If you set folds_parallel = "FALSE", folds run sequentially and each ranger fit uses up to
ncoresthreads for strong single-fit parallelism.Metrics:
Accuracy and Cohen’s Kappa computed from the confusion matrix.
F1 is macro-averaged across classes.
AdjRankIndex uses mclust::adjustedRandIndex.
MatthewsCorrelation is the multiclass MCC.
Value
A list with:
train_mod: list with fields
model: the fitted ranger::ranger object (final model on full data)
method: "ranger"
best_params: data.frame with the best hyperparameters found by CV
cv_score: best mean CV score (according to
Metric)metric: the metric name used
boxplot: ggplot object showing the distribution of per-fold metric values
Confusion.Matrix: caret::confusionMatrix for predictions of the final model on the full data
stats_global: data.frame with columns Metric, Mean, Sd summarizing per-fold metrics
resample: data.frame of per-fold metrics (columns: variable, value, fold)
See Also
ranger::ranger, caret::confusionMatrix
Examples
## Not run:
set.seed(1)
X <- matrix(runif(3000), nrow = 100, ncol = 30)
colnames(X) <- as.character(round(seq(1000, 1290, length.out = 30), 4))
moz <- as.numeric(colnames(X))[seq(1, 30, by = 2)] # keep half the m/z
Y <- factor(sample(letters[1:3], 100, replace = TRUE))
fit <- LogReg_rf_fast(
X, moz, Y,
number = 3, repeats = 1,
Metric = "Kappa",
Sampling = "no",
ncores = 4,
num.trees = 300,
tuneLength = 4,
seed = 42
)
fit$train_mod$best_params
fit$Confusion.Matrix
## End(Not run)
Automated classification of mass spectra
Description
This package provides R functions to classify mass spectra in known categories, and to determine discriminant mass-to-charge values. It was developed with the aim of identifying very similar species or phenotypes of bacteria from mass spectra obtained by Matrix Assisted Laser Desorption Ionisation - Time Of Flight Mass Spectrometry (MALDI-TOF MS). However, the different functions of this package can also be used to classify other categories associated to mass spectra; or from mass spectra obtained with other mass spectrometry techniques. It includes easy-to-use functions for pre-processing mass spectra, functions to determine discriminant mass-to-charge values (m/z) from a library of mass spectra corresponding to different categories, and functions to predict the category (species, phenotypes, etc.) associated to a mass spectrum from a list of selected mass-to-charge values.
If you use this package in your research, please cite the associated publication: doi:10.1016/j.eswa.2025.128796.
For a comprehensive guide, additional applications, and detailed examples of using this package, please visit our GitHub repository: here.
Value
No return value. Package description.
Author(s)
Alexandre Godmer, Quentin Giai Gianetto
References
Alexandre Godmer, Yahia Benzerara, Emmanuelle Varon, Nicolas Veziris, Karen Druart, Renaud Mozet, Mariette Matondo, Alexandra Aubry, Quentin Giai Gianetto, MSclassifR: An R package for supervised classification of mass spectra with machine learning methods, Expert Systems with Applications, Volume 294, 2025, 128796, ISSN 0957-4174, doi:10.1016/j.eswa.2025.128796.
Internal C++ helpers (not user-facing)
Description
These are internal compiled routines used by MSclassifR for fast matching and per-feature tests. They are exported only for technical reasons and are not part of the public API.
Detection of peaks in MassSpectrum objects
Description
Detects peaks on a list of MALDIquant MassSpectrum objects, with an optional preliminary averaging step and an optional discrete-bin alignment of detected peaks. Per-spectrum peak detection can be parallelized on Unix-alike systems (Linux/macOS) for very large inputs; on Windows a serial/vectorized path is used. The result is a list of MALDIquant MassPeaks ready for downstream matrix building (e.g., with MALDIquant::intensityMatrix or build_X_from_peaks_fast).
Usage
PeakDetection(
x,
averageMassSpec = TRUE,
labels = NULL,
averageMassSpectraMethod = "median",
SNRdetection = 3,
binPeaks = TRUE,
PeakDetectionMethod = "MAD",
halfWindowSizeDetection = 11,
AlignMethod = "strict",
Tolerance = 0.002,
n_workers = NULL,
verbose = TRUE,
min_parallel_n = 2000L,
chunk_size = 1000L,
...
)
Arguments
x |
List of MALDIquant::MassSpectrum objects (one per sample). These are typically obtained after preprocessing (baseline, smoothing, normalization). |
averageMassSpec |
Logical; if TRUE, average spectra using
MALDIquant::averageMassSpectra before peak detection. If |
labels |
Optional factor/character vector for groupwise averaging (same
semantics as MALDIquant::averageMassSpectra). Ignored if |
averageMassSpectraMethod |
Character, "median" (default) or "mean". Passed
to MALDIquant::averageMassSpectra when |
SNRdetection |
Numeric; signal-to-noise ratio threshold for peak detection (MALDIquant::detectPeaks argument SNR). Default 3. |
binPeaks |
Logical; if TRUE, align detected peaks into discrete bins using
MALDIquant::binPeaks (method/tolerance set by |
PeakDetectionMethod |
Character; MALDIquant::detectPeaks method, e.g., "MAD" (default) or "SuperSmoother". |
halfWindowSizeDetection |
Integer; half window size for local maxima (MALDIquant::detectPeaks argument halfWindowSize). Default 11. |
AlignMethod |
Character; MALDIquant::binPeaks method, "strict" (default) or "relaxed". |
Tolerance |
Numeric; MALDIquant::binPeaks tolerance (units consistent with your m/z axis). Default 0.002. |
n_workers |
Integer or NULL; requested number of parallel workers for the Unix mclapply path. On Windows, this is ignored (serial path). The effective number is sanitized by an internal helper (.safe_n_workers) to avoid oversubscription and R CMD check issues. Default NULL (auto). |
verbose |
Logical; if TRUE, print progress messages. Default TRUE. |
min_parallel_n |
Integer; minimum number of spectra at which the function will attempt Unix parallelization via mclapply. Defaults to 2000. Increase to be more conservative, decrease to parallelize more aggressively. Set to Inf to effectively disable Unix parallelization regardless of n_workers. |
chunk_size |
Integer; number of spectra per chunk/task submitted to mclapply on Unix. Larger chunks reduce scheduling overhead but use more memory per task. Default 1000. |
... |
Reserved for future extensions or pass-through to MALDIquant/MALDIrppa. |
Details
Averaging: if
averageMassSpec = TRUEandlabelsis provided with length(labels) == length(x), MALDIquant::averageMassSpectra performs a groupwise averaging by labels. Otherwise, all spectra are averaged. IfaverageMassSpec = FALSE, the input list is used as-is.Peak detection: peak finding uses MALDIquant::detectPeaks with the given SNR/method/half-window. This is applied per spectrum (serial or parallel).
Discrete-bin alignment: when
binPeaks = TRUE, MALDIquant::binPeaks aligns detected peaks to a shared discrete grid (methodAlignMethod, toleranceTolerance), enabling consistent feature columns across spectra.Parallelization: on Windows, a single serial/vectorized call to MALDIquant::detectPeaks is used (fast enough for small/medium inputs). On Unix-alike systems, when
length(x) >= min_parallel_nandn_workers > 1, the list is split into chunks of sizechunk_sizeand processed with parallel::mclapply using the requested number of workers.Meta-data: if
labelsis provided, label information is appended best-effort to MassPeaks metaData (file/fullName), preserving existing fields where possible.
Value
A list of MALDIquant::MassPeaks objects (one per input or averaged
spectrum). If binPeaks = TRUE, all MassPeaks are aligned to the same
discrete m/z bins (shared centers), facilitating fast matrix construction.
See Also
MALDIquant::averageMassSpectra, MALDIquant::detectPeaks, MALDIquant::binPeaks; MALDIquant::intensityMatrix; build_X_from_peaks_fast for a fast matrix builder from MassPeaks.
Examples
if (requireNamespace("MALDIquant", quietly = TRUE)) {
# Two toy spectra with peaks near 1000, 1500, 2000 Da
mass <- seq(900, 2100, by = 1)
make_spectrum <- function(shift) {
inten <- dnorm(mass, 1000 + shift, 2) * 50 +
dnorm(mass, 1500 - shift, 2) * 80 +
dnorm(mass, 2000 + shift, 2) * 40 +
rnorm(length(mass), 0, 0.2)
MALDIquant::createMassSpectrum(mass = mass, intensity = inten)
}
spectra <- list(make_spectrum(0.3), make_spectrum(-0.3))
# Detect peaks without averaging; align in strict bins
peaks <- PeakDetection(
x = spectra,
averageMassSpec = FALSE,
SNRdetection = 3,
PeakDetectionMethod = "MAD",
binPeaks = TRUE,
AlignMethod = "strict",
Tolerance = 0.5,
verbose = TRUE
)
# Build an intensity matrix (rows = spectra, cols = aligned m/z bins)
X <- MALDIquant::intensityMatrix(peaks)
dim(X)
}
Plot spectral data with optional peak markers
Description
Create a ggplot of a mass spectrum with optional points for detected peaks and optional vertical lines (and points) highlighting user-specified m/z values.
Usage
PlotSpectra(
SpectralData,
absx = "ALL",
Peaks = NULL,
Peaks2 = NULL,
col_spec = 1,
col_peak = 2,
shape_peak = 3,
col_peak2 = 2,
shape_peak2 = 2,
tol = 0
)
Arguments
SpectralData |
An object containing spectrum data with numeric slots
|
absx |
Either the string "ALL" to plot the full m/z range, or a numeric length-2 vector c(min, max) specifying the m/z window to display. |
Peaks |
Optional MassPeaks object (e.g., from MALDIquant) providing
detected peak positions ( |
Peaks2 |
Optional numeric or character vector of m/z values to highlight.
If character/factor, values are coerced to numeric (non-numeric chars removed).
Vertical dashed lines are drawn at these m/z. If |
col_spec |
Colour for the spectrum line. Default: 1 (black). |
col_peak |
Colour for points corresponding to all peaks in |
shape_peak |
Point shape for |
col_peak2 |
Colour for points corresponding to the subset of |
shape_peak2 |
Point shape for the |
tol |
Numeric tolerance (in m/z units) used to match |
Value
A ggplot object representing the spectrum and optional annotations.
Examples
if (requireNamespace("MALDIquant", quietly = TRUE)) {
# Load mass spectra
data("CitrobacterRKIspectra", package = "MSclassifR")
# Plot raw mass spectrum
PlotSpectra(SpectralData = CitrobacterRKIspectra[[1]])
# standard pre-processing of mass spectra
spectra <- SignalProcessing(CitrobacterRKIspectra)
# Plot pre-processed mass spectrum
PlotSpectra(SpectralData=spectra[[1]])
# detection of peaks in pre-processed mass spectra
peaks <- PeakDetection(x = spectra, averageMassSpec=FALSE)
# Plot peaks on pre-processed mass spectrum
PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]],col_spec="blue",col_peak="black")
}
Fast class prediction from peak lists using linear regressions
Description
Builds a sample-by-m/z matrix from a list of MALDIquant MassPeaks and predicts
the class of each spectrum by fitting, for each class, a linear regression of
the spectrum’s intensities on the training spectra of that class. The class
minimizing the AIC is selected as the predicted label. In parallel, an F-test
p-value is computed per class to quantify how unlikely the spectrum is to
belong to the training database; the minimum across classes is returned as
p_not_in_DB. The peak-to-m/z matching is done in C++ via
build_X_from_peaks_fast() for speed.
Usage
PredictFastClass(
peaks,
mod_peaks,
Y_mod_peaks,
moz = "ALL",
tolerance = 6,
normalizeFun = TRUE,
noMatch = 0,
chunk_size = 2000L,
ncores = 1L,
verbose = FALSE
)
Arguments
peaks |
List of MALDIquant::MassPeaks objects to classify (one per spectrum).
Each element must expose |
mod_peaks |
Numeric training matrix of dimension n_train x p (rows =
spectra, columns = m/z features) used as regressors per class. Column names
must be m/z values (character) and must include all m/z requested in |
Y_mod_peaks |
Factor of length n_train giving the class labels for rows of
|
moz |
Either "ALL" or a numeric vector of target m/z. If "ALL" (default),
the column names of |
tolerance |
Numeric (Da). A target m/z is matched to the nearest peak only
if the absolute difference is <= |
normalizeFun |
Logical; if TRUE, per-spectrum max normalization is applied after matching (i.e., each row of the new matrix is divided by its maximum). Default TRUE. |
noMatch |
Numeric; intensity value inserted when no peak is matched for a given target m/z. Default 0. |
chunk_size |
Integer; rows per block when building the new matrix from
|
ncores |
Integer; number of cores to use when building the new matrix from
|
verbose |
Logical; print progress messages. Default FALSE. |
Details
Matrix building:
build_X_from_peaks_fast()maps each spectrum inpeaksto the target m/z grid with nearest-within-tolerance matching (C++). IfnormalizeFun = TRUE, each row is divided by its maximum (guarded to avoid division by zero). Spectra with initially no matches are retried with a slightly increased tolerance (internal bumping).Alignment to training: columns of the new matrix must align to
mod_peaks. The function stops if any requested m/z is missing frommod_peaks.Per-class regression: for each class k, it regresses the new spectrum’s intensities on the columns of
mod_peaksbelonging to class k (after removing entries where the new spectrum is non-finite). If the number of training spectra exceeds the number of non-missing points in the spectrum, a random subset of columns (size = length(non-missing) - 1) is used to avoid singular fits. Fitting is done via stats::lm.fit for speed.Selection and scores:
pred_catis the class with smallest AIC across fitted models. For each class, an F-test p-value is computed from the model summary;p_not_in_DBis the minimum across classes (1 if a class model fails).
Value
A data.frame with columns:
name: spectrum name (from MassPeaks metaData fullName/file if available).
p_not_in_DB: minimum F-test p-value across classes (smaller suggests the spectrum matches the training database; larger suggests “not in DB”).
pred_cat: predicted class (label with smallest AIC).
See Also
build_X_from_peaks_fast; MALDIquant::createMassPeaks; stats::lm.fit
Examples
## Not run:
if (requireNamespace("MALDIquant", quietly = TRUE)) {
set.seed(1)
# Create a small training set (mod_peaks) with 2 classes
p <- 6
moz <- as.character(round(seq(1000, 1500, length.out = p), 2))
mod_peaks <- rbind(
matrix(runif(5 * p, 0, 1), nrow = 5, dimnames = list(NULL, moz)),
matrix(runif(5 * p, 0, 1), nrow = 5, dimnames = list(NULL, moz))
)
Y_mod <- factor(rep(c("A", "B"), each = 5))
# Two spectra to classify: generate MassPeaks near moz
mk_peaks <- function(shift = 0) {
MALDIquant::createMassPeaks(
mass = as.numeric(moz) + rnorm(length(moz), shift, 0.2),
intensity = runif(length(moz), 10, 100)
)
}
peaks <- list(mk_peaks(0.1), mk_peaks(-0.1))
res <- PredictFastClass(
peaks = peaks,
mod_peaks = mod_peaks,
Y_mod_peaks = Y_mod,
moz = "ALL",
tolerance = 1,
normalizeFun = TRUE
)
res
}
## End(Not run)
Prediction of the category to which a mass spectrum belongs
Description
Predicts the category (species, phenotype, etc.) of each spectrum in a list
of MALDIquant MassPeaks using one or more trained models (e.g., multinomial
logistic regression from LogReg). Peaks are matched to a given shortlist of
discriminant m/z values (moz) within a tolerance; unmatched positions are
filled with noMatch. If several models are supplied, the function also
produces meta-predictions: per-class Fisher combinations (if 'metap' is
available) and a majority-vote fraction across models.
Usage
PredictLogReg(
peaks,
model,
moz,
tolerance = 6,
toleranceStep = 2,
normalizeFun = TRUE,
noMatch = 0,
Reference = NULL,
chunk_size = 10000L,
ncores = 1L,
verbose = FALSE
)
Arguments
peaks |
a list of MALDIquant::MassPeaks objects (one per spectrum). |
model |
a model or a list of models estimated from a shortlist of m/z (e.g., the output of LogReg). Each model must support predict(..., type = "prob"). If a single model is supplied, it is wrapped into a one-element list. |
moz |
a numeric vector of shortlisted m/z values used for prediction (typically the selection from SelectionVar/SelectionVarStat_fast). |
tolerance |
numeric; accepted m/z tolerance (in Da) for matching peaks to moz. Default 6. |
toleranceStep |
numeric; if a spectrum yields no matches at the initial
tolerance, the function retries by increasing tolerance in steps of
|
normalizeFun |
logical; if TRUE (default), per-spectrum max normalization is applied after matching (row is divided by its maximum). |
noMatch |
numeric; intensity used when no peak matches a given m/z. Default 0. |
Reference |
optional factor of true categories, length equal to length(peaks). If provided and has at least two distinct levels, the function returns, in addition to the predictions, per-model confusion matrices (caret::confusionMatrix). |
chunk_size |
integer; number of spectra per prediction batch (rows of X). Large datasets can be processed in chunks to limit memory. Default 10000. |
ncores |
integer; number of cores used while building X from peaks on the R side (the C++ matching itself is single-threaded here). Default 1. |
verbose |
logical; print progress messages. Default FALSE. |
Details
Matching and normalization: peak-to-moz matching is performed by build_X_from_peaks_fast (C++-backed; nearest-within-tolerance). If no m/z from a spectrum match the shortlist initially, the tolerance is increased by
toleranceStepin a small number of attempts until at least one match is found. IfnormalizeFun = TRUE, each row is divided by its maximum (guarded to avoid divide-by-zero).Multiple models: when several models are supplied, the output contains one set of probabilities per model (with method column identifying it). Two additional rows per spectrum can be appended:
comb_fisher: per-class Fisher combined p-values computed via metap::sumlog (if available).
max_vote: per-class fraction of models casting the top-probability vote for that class.
Models: this function is agnostic of the modeling engine as long as predict(type = "prob") is implemented (e.g., caret multinom/nnet/ranger/xgb, glmnet, etc.).
Value
If Reference is missing (or has < 2 levels), a data.frame with:
name: spectrum name (from MassPeaks metaData fullName/file when available)
method: model identifier (from model$method; suffixed with "_i" if needed)
one column per class with predicted probabilities
pred_max_p: predicted class (argmax of probabilities)
If Reference is provided with at least two levels, a list with:
Prob.results: the predictions data.frame as above
Confusion.Matrix: a list of caret::confusionMatrix objects (one per method)
References
Kuhn, M. (2008). Building predictive models in R using the caret package. Journal of Statistical Software, 28(1), 1–26.
Alexandre Godmer, Yahia Benzerara, Emmanuelle Varon, Nicolas Veziris, Karen Druart, Renaud Mozet, Mariette Matondo, Alexandra Aubry, Quentin Giai Gianetto (2025). MSclassifR: An R package for supervised classification of mass spectra with machine learning methods. Expert Systems with Applications, 294, 128796. doi:10.1016/j.eswa.2025.128796
See Also
LogReg; SelectionVar; SelectionVarStat_fast; build_X_from_peaks_fast
Examples
library(MSclassifR)
library(MALDIquant)
## 1) Preprocess and detect peaks
data("CitrobacterRKIspectra", "CitrobacterRKImetadata", package = "MSclassifR")
spectra <- SignalProcessing(CitrobacterRKIspectra)
peaks <- MSclassifR::PeakDetection(x = spectra, averageMassSpec = FALSE)
## 2) Build X and Y (sample-by-peak intensities + labels)
## Option A: if you prefer the helper and a sparse return:
Y <- factor(CitrobacterRKImetadata$Species)
xy <- build_XY_from_peaks(peaks, labels = Y, normalize = "max", sparse = FALSE)
X <- xy$X
Y <- xy$Y
## Option B: via MALDIquant::intensityMatrix (as in the original examples)
##IntMat <- MALDIquant::intensityMatrix(peaks)
##rownames(IntMat) <- paste(CitrobacterRKImetadata$Strain_name_spot)
##IntMat[is.na(IntMat)] <- 0
##IntMat <- t(apply(IntMat, 1, function(x) x / max(x))) # per-spectrum max norm
##X <- t(IntMat) # features in columns
##Y <- factor(CitrobacterRKImetadata$Species)
## 3) Select discriminant m/z with "cvp" method
a <- MSclassifR::SelectionVar(
X, Y,
MethodSelection = "cvp",
MethodValidation = "cv",
PreProcessing = c("center","scale","nzv","corr"),
NumberCV = 2,
Metric = "Kappa"
)
sel_moz <- a$sel_moz
## 4) Train several models on the shortlisted m/z
model_lm <- MSclassifR::LogReg(X = X, moz = sel_moz, Y = Y, number = 2,
repeats = 2, Metric = "Kappa", kind = "linear")
model_nn <- MSclassifR::LogReg(X = X, moz = sel_moz, Y = Y, number = 2,
repeats = 2, Metric = "Kappa", kind = "nnet", Sampling = "up")
model_rf <- MSclassifR::LogReg(X = X, moz = sel_moz, Y = Y, number = 2,
repeats = 2, Metric = "Kappa", kind = "rf", Sampling = "down")
model_svm <- MSclassifR::LogReg(X = X, moz = sel_moz, Y = Y, number = 2,
repeats = 2, Metric = "Kappa", kind = "svm", Sampling = "up")
Models <- list(
model_lm$train_mod,
model_nn$train_mod,
model_rf$train_mod,
model_svm$train_mod
)
## 5) Predict classes for a subset of peaks; 6 Da tolerance for matching
prob_cat <- MSclassifR::PredictLogReg(
peaks = peaks[1:5],
model = Models,
moz = sel_moz,
tolerance = 6,
Reference = Y[1:5]
)
prob_cat
## 6) Meta-classifier strategy (several RF models + SMOTE + Fisher combine)
a2 <- MSclassifR::SelectionVar(X, Y, MethodSelection = "mda", Ntree = 5 * ncol(X))
sel_moz2 <- a2$sel_moz
models2 <- vector("list", 4L)
for (i in seq_along(models2)) {
models2[[i]] <- MSclassifR::LogReg(
X = X, moz = sel_moz2, Y = Y,
number = 5, repeats = 5,
kind = "rf", Metric = "Kappa",
Sampling = "smote"
)$train_mod
}
prob_cat2 <- MSclassifR::PredictLogReg(
peaks = peaks,
model = models2,
moz = sel_moz2,
tolerance = 6,
Reference = Y
)
Variable selection using methods based on random forests and others.
Description
This function performs variable selection (i.e. selection of discriminant mass-to-charge values) using several selection methods (see MethodSelection argument).
Usage
SelectionVar(X,
Y,
MethodSelection = c("RFERF", "RFEGlmnet", "VSURF", "sPLSDA", "mda",
"cvp", "boruta"),
MethodValidation = c("cv", "repeatedcv", "LOOCV"),
PreProcessing = c("center", "scale", "nzv", "corr"),
Metric = c("Kappa", "Accuracy"),
Sampling = c("no", "up","down", "smote"),
NumberCV = NULL,
RepeatsCV = NULL,
Sizes,
Ntree = 1000,
ncores = 2,
threshold = 0.01,
ncomp.max = 10,
nbf=0)
Arguments
X |
a numeric |
Y |
a |
MethodSelection |
a |
MethodValidation |
a |
NumberCV |
a |
RepeatsCV |
a |
PreProcessing |
a |
Metric |
a |
Sampling |
a |
Sizes |
a numeric |
Ntree |
a |
ncores |
a |
ncomp.max |
a |
threshold |
a |
nbf |
a |
Details
The selection of variables can be carried out with two different objectives: either to find a minimum number of variables allowing to obtain the highest possible accuracy (or Kappa coefficient), which involves the possible elimination of variables correlated between them (i.e. not bringing any additional predictive power with respect to some other variables); or to find all the variables in the dataset with a potential predictive power ("discriminant" variables).
The VSURF method attempts to accomplish only the first objective.
The mda and cvp methods attempt to accomplish the second objective, as do the methods available in the SelectionVarStat function of our MSclassifR R package.
The RFERF, RFEGlmnet and sPLSDA methods take as input a number of variables to be selected(Sizes argument), and can therefore be used with both objectives.
Within the framework of the second objective, either the mda or cvp methods can be used to estimate a number of discriminant variables from the importances of variables. The SelectionVarStat function can also be used to estimate this number from distributions of p-values. Of note, be sure that the Ntree argument is high enough to get a robust estimation with the mda or cvp methods.
The "RFEGlmnet" and "RFERF" methods are based on recursive feature elimination and can either optimize the kappa coefficient or the accuracy as metrics when selecting variables.
The "sPLSDA" method selects variables from the ones kept in latent components of the sparse PLS-DA model using an automatic choice of the number of components (when the balanced classification error rate (BER) reaches a plateau - see argument threshold).
The "mda" and "cvp" methods use the distribution of variable importances to estimate the number of discriminant features (mass-to-charge values). Briefly, the distribution of variable importances for useless (not discriminant) features is firstly estimated from negative importance variables by the method proposed in section 2.6 of Janitza et al.(2018). Next, the following mixture model is assumed:
F(x)=\pi\times F_u(x)+(1-\pi)\times F_d(x)
where F is the empirical cumulative distribution of variable importances of all the features, F_u the one of the useless features, F_d the one of the discriminative features, and \pi is the proportion of useless features in the dataset.
From the estimated distribution of useless features, we can estimate quantile values x_q and compute \epsilon_q=min(F(x_q)/q;1) for each quantile q. The minimum of the \epsilon_q corresponds to the estimated proportion of useless features in the dataset, what allows estimating the number of discriminant features by N_d=floor(N\times (1 - \pi)) where N is the total number of features. Next, the N_d features with the highest variable importances are selected.
The "VSURF" and "sPLSDA" methods use the minimum mean out-of-bag (OOB) and balanced classification error rate (BER) metrics respectively.
The "boruta" method selects variables from the Boruta algorithm (see Kursa and Rudnicki (2010)). The maxRuns argument of the Boruta function is fixed to 3*ncol(X) to perform the selection of variables.
For Sampling methods available for unbalanced data: "up" corresponds to the up-sampling method which consists of random sampling (with replacement) so that the minority class is the same size as the majority class; "down" corresponds to the down-sampling method randomly which consists of random sampling (without replacement) of the majority class so that their class frequencies match the minority class; "smote" corresponds to the Synthetic Minority Over sampling Technique (SMOTE) specific algorithm for data augmentation which consist of creates new data from minority class using the K Nearest Neighbor algorithm.
See rfe in the caret R package, VSURF in the VSURF R package, splsda in the mixOmics R package, importance function in the randomForest R package, CVPVI function in the vita R package, and Boruta function in the Boruta R package for more details.
Godmer et al. (2025) presents a comparison of different pipelines using SelectionVar that can help you to optimize your workflow. For a comprehensive guide, additional applications, and detailed examples of using this package, please visit our GitHub repository: here.
Value
A list composed of:
sel_moz |
a |
For the "RFERF" and "RFEGlmnet" methods, it also returns the results of the rfe function of the caret R package.
For the "VSURF" method, it also returns the results of the results of the VSURF function of the VSURF R package.
For the "sPLSDA" method, it also returns the following items:
Raw_data |
a horizontal bar plot and containing the contribution of features on each component. |
selected_variables |
|
For the "mda" and "cvp" methods, it also returns the following items:
nb_to_sel |
a numeric value corresponding to an estimated number of mass-over-chage values where the intensities are significantly different between categories (see details). |
imp_sel |
a vector containing the variable importances for the selected features. |
References
Kuhn, Max. (2012). The caret Package. Journal of Statistical Software. 28.
Genuer, Robin, Jean-Michel Poggi and Christine Tuleau-Malot. VSURF : An R Package for Variable Selection Using Random Forests. R J. 7 (2015): 19.
Friedman J, Hastie T, Tibshirani R (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 1-22.
Kim-Anh Le Cao, Florian Rohart, Ignacio Gonzalez, Sebastien Dejean with key contributors Benoit Gautier, Francois, Bartolo, contributions from Pierre Monget, Jeff Coquery, FangZou Yao and Benoit Liquet. (2016). mixOmics: Omics. Data Integration Project. R package version 6.1.1. https://CRAN.R-project.org/package=mixOmics
Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. J. Artif. Int. Res. 16, 1 (January 2002), 321–357.
Branco P, Ribeiro R, Torgo L (2016). “UBL: an R Package for Utility-Based Learning.” CoRR, abs/1604.08079.
Janitza, S., Celik, E., Boulesteix, A. L. (2018). A computationally fast variable importance test for random forests for high-dimensional data. Advances in Data Analysis and Classification, 12, 885-915.
Miron B. Kursa, Witold R. Rudnicki (2010). Feature Selection with the Boruta Package. Journal of Statistical Software, 36(11), p. 1-13.
Alexandre Godmer, Yahia Benzerara, Emmanuelle Varon, Nicolas Veziris, Karen Druart, Renaud Mozet, Mariette Matondo, Alexandra Aubry, Quentin Giai Gianetto, MSclassifR: An R package for supervised classification of mass spectra with machine learning methods, Expert Systems with Applications, Volume 294, 2025, 128796, ISSN 0957-4174, doi:10.1016/j.eswa.2025.128796.
Examples
library("MSclassifR")
library("MALDIquant")
###############################################################################
## 1. Pre-processing of mass spectra
# load mass spectra and their metadata
data("CitrobacterRKIspectra","CitrobacterRKImetadata", package = "MSclassifR")
# standard pre-processing of mass spectra
spectra <- MSclassifR::SignalProcessing(CitrobacterRKIspectra)
# detection of peaks in pre-processed mass spectra
peaks <- MSclassifR::PeakDetection(x = spectra, averageMassSpec=FALSE)
# matrix with intensities of peaks arranged in rows (each column is a mass-to-charge value)
IntMat <- MALDIquant::intensityMatrix(peaks)
rownames(IntMat) <- paste(CitrobacterRKImetadata$Strain_name_spot)
# remove missing values in the matrix
IntMat[is.na(IntMat)] <- 0
# normalize peaks according to the maximum intensity value for each mass spectrum
IntMat <- apply(IntMat,1,function(x) x/(max(x)))
# transpose the matrix for statistical analysis
X <- t(IntMat)
# define the known categories of mass spectra for the classification
Y <- factor(CitrobacterRKImetadata$Species)
###############################################################################
## 2. Perform variables selection using SelectionVar with RFE and random forest
# with 5 to 10 variables,
# up sampling method and trained with the Kappa coefficient metric
a <- SelectionVar(X,
Y,
MethodSelection = c("RFERF"),
MethodValidation = c("cv"),
PreProcessing = c("center","scale","nzv","corr"),
NumberCV = 2,
Metric = "Kappa",
Sizes = c(5:10),
Sampling = "up")
# Plotting peaks on the first pre-processed mass spectrum and highlighting the
# discriminant mass-to-charge values with red lines
PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]],
Peaks2=a$sel_moz,col_spec="blue",col_peak="black")
###############################################################################
## 3. Perform variables selection using SelectionVar with VSURF
# This function can last a few minutes
b <- SelectionVar(X, Y, MethodSelection = c("VSURF"))
summary(b$result)
###############################################################################
## 4. Perform variables selection using SelectionVar with "mda" or "cvp"
# option 1: Using mean decrease in accuracy
# with no sampling method
c <- SelectionVar(X,Y,MethodSelection="mda",Ntree=10*ncol(X))
# Estimation of the number of peaks to discriminate species
c$nb_to_sel
# Discriminant mass-to-charge values
c$sel_moz
# Plotting peaks on the first pre-processed mass spectrum and highlighting the
# discriminant mass-to-charge values with red lines
PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]],
Peaks2=c$sel_moz,col_spec="blue",col_peak="black")
# option 2: Using cross-validated permutation variable importance measures (more "time-consuming")
# with no sampling method
d <- SelectionVar(X,Y,MethodSelection="cvp",NumberCV=2,ncores=2,Ntree=1000)
# Estimation of the number of peaks to discriminate species
d$nb_to_sel
# Discriminant mass-to-charge values
d$sel_moz
# Plotting peaks on the first pre-processed mass spectrum and highlighting the
# discriminant mass-to-charge values with red lines
PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]],
Peaks2=d$sel_moz,col_spec="blue",col_peak="black")
# Mass-over charge values found with both methods ("mda" and "cvp")
intersect(c$sel_moz,d$sel_moz)
Fast feature (m/z) selection using multiple hypothesis testing (LIMMA/ANOVA/Kruskal) with optional class balancing (no/up/down/SMOTE)
Description
Runs univariate statistical tests across all features (columns) to identify discriminant m/z values. Supports three tests:
"Limma": moderated F-test via the limma R package on k-1 contrasts between groups
"anova": classical one-way ANOVA (implemented in C++ for speed)
"kruskal": Kruskal–Wallis rank-sum test (implemented in C++ with tie correction)
Usage
SelectionVarStat(
X,
Y,
stat.test = c("Limma", "anova", "kruskal"),
pi0.method = "abh",
fdr = 0.05,
Sampling = c("no", "up", "down", "smote"),
seed = NULL
)
Arguments
X |
Numeric matrix with samples in rows and features (peaks) in columns. If a sparse Matrix is provided, it is coerced to a base R dense matrix. Infinities are set to NA; NAs are allowed. |
Y |
Class labels. A factor (or coercible to factor) of length |
stat.test |
Character string; which univariate test to use. One of
|
pi0.method |
Character; method for |
fdr |
Numeric in (0, 1]; FDR threshold used to select features after
p-value adjustment. Default |
Sampling |
Character string; optional class balancing applied before testing.
One of
|
seed |
Optional integer. If provided, sets the random seed for up/down sampling and SMOTE to ensure reproducibility. |
Details
Optional class balancing can be applied before testing: no sampling,
up-sampling, down-sampling, or SMOTE using the internal smote_classif() function.
P-values are adjusted with the cp4p R package, and features below the FDR
threshold are returned, together with an estimated proportion of nulls, pi0.
Missing values are preserved as NA: for ANOVA/Kruskal, the C++ backends
skip NAs per feature; for Limma, missing values are handled by limma internally.
Non-finite values (Inf/-Inf) are coerced to NA.
Limma design:
model.matrix(~ 0 + Y)is used; a full-rank set of k-1 contrasts vs the first level is constructed withlimma::makeContrasts. The returned p-values are moderated F-test p-values for overall group differences.ANOVA/Kruskal implementations are in C++ (see
anova_cols_cppandkruskal_cols_cpp); they process all features in one pass and skip NAs per feature.Non-finite p-values (if any) are set to 1 before
pi0/adjustment.SMOTE requires at least two observations per class; otherwise the function automatically falls back to up-sampling.
For limma, if residual degrees of freedom are not positive (
nrow(X) - nlevels(Y) <= 0), p-values are set to 1 with a warning.
Value
A list with:
-
nb_to_sel: integer,floor(ncol(X) * (1 - pi0))usingcp4p::estim.pi0. -
n_selected_fdr: integer, number of features with adjusted p-value <fdr. -
sel_moz: character vector of selected feature names (columns ofX) with adjusted p-value <fdr. -
ap: the full object returned bycp4p::adjust.p(contains adjusted p-values).
See Also
limma::lmFit, limma::contrasts.fit, limma::eBayes, limma::makeContrasts; cp4p::estim.pi0, cp4p::adjust.p
Examples
###############################################################################
## 1. Pre-processing of mass spectra
# load mass spectra and their metadata
data("CitrobacterRKIspectra","CitrobacterRKImetadata", package = "MSclassifR")
# standard pre-processing of mass spectra
spectra <- MSclassifR::SignalProcessing(CitrobacterRKIspectra)
# detection of peaks in pre-processed mass spectra
peaks <- MSclassifR::PeakDetection(x = spectra, labels = CitrobacterRKImetadata$Strain_name_spot)
# build matrix with intensities of peaks (rows = samples, columns = m/z)
Y <- factor(CitrobacterRKImetadata$Species)
xy <- build_XY_from_peaks(peaks, labels = Y, normalize = "max", sparse = FALSE)
X <- xy$X
Y <- xy$Y
###############################################################################
## 2. Estimate the optimal number of peaks to discriminate the different species
OptiPeaks <- MSclassifR::SelectionVarStat(X,
Y,
stat.test = "Limma",
pi0.method = "abh",
fdr = 0.05,
Sampling = "smote",
seed = 1)
## Estimation of the optimal number of peaks to discriminate species (from the pi0 parameter)
OptiPeaks$nb_to_sel
## discriminant mass-to-charge values estimated using a 5 per cent false discovery rate
OptiPeaks$sel_moz
## p-values and adjusted p-values estimated for all the tested mass-to-charge values
OptiPeaks$ap$adjp
Signal processing for MALDI-TOF spectra (wrapper to SignalProcessingUltra)
Description
Backward-compatible wrapper that delegates to SignalProcessingUltra. Keeps the original argument names/signature so existing code continues to work.
Usage
SignalProcessing(
x,
transformIntensity_method = "sqrt",
smoothing_method = "Wavelet",
removeBaseline_method = "SNIP",
removeBaseline_iterations = 25,
calibrateIntensity_method = "TIC",
alignSpectra_NoiseMethod = "MAD",
alignSpectra_method = "lowess",
alignSpectra_halfWs = 11,
alignSpectra_SN = 3,
tolerance_align = 0.002,
referenceSpectra = NULL,
minFrequency = 0.5,
binPeaks_method = "strict",
keepReferenceSpectra = FALSE,
...
)
Arguments
x |
list of MALDIquant MassSpectrum objects. |
transformIntensity_method |
character, intensity transform (default "log"). |
smoothing_method |
character, smoothing method ("Wavelet" UDWT). |
removeBaseline_method |
character, baseline method ("TopHat" default; "SNIP","ConvexHull" supported). |
removeBaseline_iterations |
integer, SNIP iterations if removeBaseline_method = "SNIP". |
calibrateIntensity_method |
character, intensity calibration ("PQN" default, or "TIC","median"). |
alignSpectra_NoiseMethod |
character, noise estimator for peak finding pre-alignment ("MAD"). |
alignSpectra_method |
character, alignment engine: "cubic" (default), "lowess", or "landmark_cpp". |
alignSpectra_halfWs |
integer, half window size for peak detection. |
alignSpectra_SN |
numeric, SNR for peak detection. |
tolerance_align |
numeric, tolerance for matching anchors to the reference during alignment. Use consistent units across your pipeline (Da by default here). |
referenceSpectra |
optional MALDIquant MassPeaks object to use as alignment reference. |
minFrequency |
numeric, minimum peak frequency to build reference if not provided (default 0.7). |
binPeaks_method |
character, "strict" (default) or "relaxed" for reference peak binning. |
keepReferenceSpectra |
logical, if TRUE and no reference provided, returns list(spectra=..., RefS=...). |
... |
additional arguments passed to SignalProcessingUltra (e.g., n_workers, ref_sample_n). |
Value
A list of processed MassSpectrum objects, or list(spectra, RefS) if keepReferenceSpectra = TRUE.
Optimized signal processing for MALDI-TOF spectra (parallel + optional C++ alignment)
Description
This function performs post-acquisition processing for lists of MALDIquant MassSpectrum objects. It parallelizes per-spectrum steps and can align with either MALDIquant's warping (cubic/lowess) or a fast landmark-based C++ algorithm ("landmark_cpp").
Usage
SignalProcessingUltra(
x,
transformIntensity_method = "log",
smoothing_method = "Wavelet",
removeBaseline_method = "TopHat",
removeBaseline_iterations = 25,
calibrateIntensity_method = "PQN",
alignSpectra_NoiseMethod = "MAD",
alignSpectra_method = c("cubic", "lowess", "landmark_cpp"),
alignSpectra_halfWs = 11,
alignSpectra_SN = 3,
tolerance_align = 0.002,
ppm_align = FALSE,
referenceSpectra = NULL,
minFrequency = 0.7,
binPeaks_method = "strict",
keepReferenceSpectra = FALSE,
n_workers = NULL,
ref_sample_n = NULL,
verbose = TRUE,
...
)
Arguments
x |
list of MALDIquant MassSpectrum objects. |
transformIntensity_method |
character, intensity transform (default "log"). |
smoothing_method |
character, smoothing method ("Wavelet" UDWT). |
removeBaseline_method |
character, baseline method ("TopHat" default; "SNIP","ConvexHull" supported). |
removeBaseline_iterations |
integer, SNIP iterations if removeBaseline_method = "SNIP". |
calibrateIntensity_method |
character, intensity calibration ("PQN" default, or "TIC","median"). |
alignSpectra_NoiseMethod |
character, noise estimator for peak finding pre-alignment ("MAD"). |
alignSpectra_method |
character, alignment engine: "cubic" (default), "lowess", or "landmark_cpp". |
alignSpectra_halfWs |
integer, half window size for peak detection. |
alignSpectra_SN |
numeric, SNR for peak detection. |
tolerance_align |
numeric, tolerance for matching anchors to the reference during alignment. Use consistent units across your pipeline (Da by default here). |
ppm_align |
logical, set TRUE if tolerance_align is in ppm (then interpreted as ppm). |
referenceSpectra |
optional MALDIquant MassPeaks object to use as alignment reference. |
minFrequency |
numeric, minimum peak frequency to build reference if not provided (default 0.7). |
binPeaks_method |
character, "strict" (default) or "relaxed" for reference peak binning. |
keepReferenceSpectra |
logical, if TRUE and no reference provided, returns list(spectra=..., RefS=...). |
n_workers |
integer, number of parallel workers (default: all cores minus one). |
ref_sample_n |
integer or NULL, if set, build the reference from a random subset of this many spectra. |
verbose |
logical, print progress. |
... |
passed to MALDIrppa::wavSmoothing (e.g., n.levels). |
Value
A list of MassSpectrum objects, or a list with $spectra and $RefS if keepReferenceSpectra = TRUE.
Build design matrix X and response Y from peak intensities
Description
Constructs a sample-by-peak design matrix (X) and an outcome vector/factor (Y) from peak-intensity input. Accepts either a numeric matrix/data.frame (rows = samples, columns = peaks) or a list of aligned per-sample peak vectors. Optionally applies per-sample max normalization and can return X as a sparse dgCMatrix for memory efficiency.
Usage
build_XY_from_peaks(
peaks,
labels,
normalize = c("max", "none"),
sparse = FALSE,
name_cols = FALSE,
name_digits = 4
)
Arguments
peaks |
Peak data from which to build the design matrix X. Either:
|
labels |
Outcome/response labels used to build Y. A vector or factor with one entry per sample (length must equal nrow(peaks) for matrix/data.frame input, or length(peaks) for list input). |
normalize |
Normalization to apply to each sample’s peak intensities before constructing X. One of "max" or "none" (matched via match.arg). "max" scales each sample by its maximum non-NA intensity; "none" applies no scaling. |
sparse |
Logical; if TRUE, return X as a sparse Matrix::dgCMatrix. If FALSE, return a base R dense matrix. Default is FALSE. |
name_cols |
Logical; if TRUE, set column names of X from the m/z values returned by MALDIquant::intensityMatrix (formatted as "mz_<mz>"). Default is FALSE. This only applies when peaks is a list of MALDIquant::MassPeaks (or when attr(X, "mass") is available); otherwise it is ignored. Enabling this can add noticeable overhead for very wide matrices. |
name_digits |
Integer scalar; number of decimal digits to use when formatting m/z values into column names if name_cols = TRUE. Default is 4. Must be a non-negative integer. Ignored when name_cols = FALSE or when no m/z vector is available (i.e., for plain matrix/data.frame or generic list inputs). |
Value
A list with:
X: numeric matrix or Matrix::dgCMatrix of dimension n_samples x n_peaks
Y: response vector/factor aligned to rows of X (returned as supplied/coerced by the function)
Examples
data("CitrobacterRKIspectra", "CitrobacterRKImetadata", package = "MSclassifR")
spectra <- SignalProcessing(CitrobacterRKIspectra)
peaks <- MSclassifR::PeakDetection(x = spectra, averageMassSpec = FALSE)
labels <- CitrobacterRKImetadata$Species # adjust to your label column
xy <- build_XY_from_peaks(peaks, labels, normalize = "max", sparse = TRUE)
Build a sample-by-m/z intensity matrix from a list of peaks (fast, C++-backed)
Description
Converts a list of MALDIquant MassPeaks into a numeric matrix X (rows = samples, columns = target m/z), by matching each target m/z to the nearest peak within a tolerance. If requested, per-row max normalization is applied. Spectra that initially produce no matches can be retried with an increased tolerance. Internally uses Rcpp for speed (binary search per m/z).
Usage
build_X_from_peaks_fast(
peaks,
moz,
tolerance = 6,
normalize = TRUE,
noMatch = 0,
bump_if_empty = TRUE,
toleranceStep = 2,
max_bumps = 5L
)
Arguments
peaks |
List of MALDIquant::MassPeaks objects (one per sample). Each element must provide pk@mass (numeric vector of m/z) and pk@intensity (numeric vector of intensities) of the same length. |
moz |
Numeric vector of target m/z values (in Da). Will be sorted and uniqued; the output matrix columns follow this sorted order. |
tolerance |
Numeric scalar (Da). A target m/z is matched to the nearest peak only if the absolute difference is <= tolerance. Default 6. |
normalize |
Logical; if TRUE, per-row max normalization is applied after matching (i.e., each sample is divided by its maximum non-NA intensity). Default TRUE. |
noMatch |
Numeric scalar; intensity value to insert when no peak is matched for a given target m/z. Default 0. |
bump_if_empty |
Logical; if TRUE, any spectrum resulting in an all-noMatch
(or all-zero after normalization) row will be retried by increasing the tolerance
in steps of |
toleranceStep |
Numeric scalar (Da); the increment used when bumping the tolerance for empty rows. Default 2. |
max_bumps |
Integer; maximum number of bumps when retrying empty rows. Default 5. |
Details
Matching: for each target m/z, the nearest peak is chosen if its distance is <=
tolerance; otherwisenoMatchis used. Ties are resolved by nearest distance via binary search.Normalization: when
normalize = TRUE, each row is divided by its maximum non-NA intensity (guarded to avoid division by zero).Empty rows: when
bump_if_empty = TRUE, rows with allnoMatch(or all zeros after normalization) are retried with increased tolerance (bytoleranceStep) up tomax_bumpstimes.Performance: implemented with C++ helpers (map_spectrum_to_moz_cpp and build_X_from_peaks_cpp) for speed on large datasets.
Value
A numeric matrix X of dimension n x p:
n = length(peaks)
p = length(unique(sort(moz))) Column names are the sorted
mozcoerced to character. Values are intensities (possibly normalized) ornoMatchfor unmatched positions.
See Also
MALDIquant::createMassPeaks; internal C++ helpers map_spectrum_to_moz_cpp and build_X_from_peaks_cpp (not user-facing).
Examples
# Minimal example with synthetic MassPeaks
if (requireNamespace("MALDIquant", quietly = TRUE)) {
set.seed(1)
# Two spectra with slightly jittered peaks around 1000, 1500, 2000 Da
mz1 <- c(999.7, 1500.2, 2000.1); int1 <- c(10, 50, 30)
mz2 <- c(1000.3, 1499.8, 2000.4); int2 <- c(12, 60, 28)
p1 <- MALDIquant::createMassPeaks(mass = mz1, intensity = int1)
p2 <- MALDIquant::createMassPeaks(mass = mz2, intensity = int2)
peaks <- list(p1, p2)
# Target m/z grid (unsorted, will be sorted internally)
moz <- c(2000, 1500, 1000)
X <- build_X_from_peaks_fast(peaks, moz, tolerance = 1, normalize = TRUE)
dim(X)
colnames(X)
X
}
# Typical usage in a pipeline:
# spectra <- SignalProcessing(yourSpectra)
# peaks <- MSclassifR::PeakDetection(x = spectra, averageMassSpec = FALSE)
# moz <- c(1000, 1500, 2000) # from selection or prior knowledge
# X <- build_X_from_peaks_fast(peaks, moz, tolerance = 6, normalize = TRUE)
# Then pass X to SelectionVar/SelectionVarStat_fast/LogReg, etc.
Function calculating the distance between two vectors.
Description
This function calculates the distance between two vectors using the specified distance metric. Distance metrics available are "p-norm", "Chebyshev", "Canberra", "Overlap", "HEOM" (Heterogeneous Euclidean-Overlap Metric), "HVDM" (Heterogeneous Value Difference Metric). Used in the fast_find_neighbors function of our package.
Usage
calculate_distance(x, y, nominal_indices, p_code)
Arguments
x |
Vector of numeric and/or categorical values. |
y |
Vector of numeric and/or categorical values. |
nominal_indices |
Vector indicating which positions in
|
p_code |
Numeric code representing the distance metric to use:
|
Details
Different distance metrics handle nominal variables differently:
For pure numeric metrics (p >= 1, p = 0, p = -1), nominal features are ignored
For the Overlap metric (p = -2), only nominal features are considered
For HEOM (p = -3), numeric features use normalized Euclidean distance while nominal features use overlap distance (1 if different, 0 if same)
For HVDM (p = -4), a specialized metric combines normalized differences for numeric features and value difference metric for nominal features
Value
A numeric value representing the distance between the two vectors.
Function joining two tables based not on exact matches
Description
This function joins two tables based on a distance metric of one or more columns. It gives the same results that the distance_left_join function of the fuzzyjoin R package, but its execution time is faster. It is used to match mass spectra to short-listed mass-to-charge values in PredictLogReg and PredictFastClass functions.
Usage
d_left_join (x, y, by = NULL, method = "euclidean", max_dist = 1,
distance_col = NULL)
Arguments
x |
A |
y |
A |
by |
Columns by which to join the two tables. |
method |
Method to use for computing distance, either "euclidean" (default) or "manhattan". |
max_dist |
A |
distance_col |
A |
Value
Returns a data frame that contains the results of joining the x dataset onto the y datasets, where all rows from x are preserved and matching data from y is added according to the specified criteria.
Fast cross-validated permutation variable importance (ranger-based)
Description
Computes cross-validated permutation variable importance (PVI) using the ranger random-forest algorithm. For each CV fold, a ranger model is trained on the training split and permutation importance is computed (OOB) inside ranger in C++. Importances are averaged across folds to obtain a stable CV importance vector. Optionally appends artificial “false” features to estimate the null distribution and pi0, then selects the top (1 - pi0) proportion of features. The evaluation can parallelize across folds (Windows-safe PSOCK) while avoiding CPU oversubscription.
Usage
fast_cvpvi(
X,
Y,
k = 5,
ntree = 500,
nbf = 0,
nthreads = max(1L, parallel::detectCores() - 1L),
folds_parallel = c("auto", "TRUE", "FALSE"),
mtry = NULL,
sample_fraction = 1,
min_node_size = 1L,
seed = 123
)
Arguments
X |
Numeric matrix (n x p); samples in rows, features in columns. Column names should be feature IDs (e.g., m/z). Non-finite values are set to zero internally for modeling. |
Y |
Factor or numeric response of length n. A factor triggers classification; numeric triggers regression. |
k |
Integer; number of cross-validation folds. Default 5. |
ntree |
Integer; number of trees per fold model. Default 500. |
nbf |
Integer (>= 0); number of artificial “false” (noise) features to append to X for estimating the null distribution of importances. Default 0 disables this (the null is then approximated using mirrored negative importances). |
nthreads |
Integer; total threads available. When parallelizing folds,
each fold worker gets one ranger thread to avoid oversubscription; when not
parallelizing folds, ranger uses up to |
folds_parallel |
Character; "auto", "TRUE", or "FALSE".
|
mtry |
Optional integer; variables tried at each split. If NULL, defaults to floor(sqrt(p)) for classification or max(floor(p/3), 1) for regression. |
sample_fraction |
Numeric in (0, 1]; subsampling fraction per tree (speed/ regularization knob). Default 1. |
min_node_size |
Integer; ranger minimum node size. Larger values speed up training and yield smaller trees. Default 1. |
seed |
Integer; RNG seed. Default 123. |
Details
One ranger model is trained per fold (training split). Permutation importance (importance = "permutation") is computed in C++ using OOB. The per-fold importances are averaged to obtain CV importances.
Null and pi0: if
nbf > 0, false peaks are created to get negative importances. For this,nbfnoise features (uniform betweenmin(X)andmax(X)) are appended and negative importances among them help shape the null. Ifnbf = 0, the null is approximated by mirroring negative importances of true features. An estimator of the proportion of useless features over high quantiles yields pi0. If no negative importances occur, pi0 is set to 0 (conservative).Parallelism: with
folds_parallel = "auto"/"TRUE", folds run in parallel using a PSOCK cluster (Windows-safe). Each worker sets ranger num.threads = 1 to avoid oversubscription. With"FALSE", folds are sequential and ranger uses up tonthreadsthreads, which can be faster for small k or very large p.
Value
A list with:
nb_to_sel: integer; number of selected features (floor(p * (1 - pi0))).
sel_moz: character vector of selected feature names (columns of X).
imp_sel: named numeric vector of CV importances for selected features.
fold_varim: matrix (features x folds) of per-fold permutation importances.
cv_varim: matrix (features x 1) of averaged importances across folds.
pi0: estimated proportion of null features.
References
Alexandre Godmer, Yahia Benzerara, Emmanuelle Varon, Nicolas Veziris, Karen Druart, Renaud Mozet, Mariette Matondo, Alexandra Aubry, Quentin Giai Gianetto, MSclassifR: An R package for supervised classification of mass spectra with machine learning methods, Expert Systems with Applications, Volume 294, 2025, 128796, ISSN 0957-4174, doi:10.1016/j.eswa.2025.128796.
See Also
ranger::ranger; for a holdout-based (validation-fold) permutation alternative, see a custom implementation using predict() on permuted features. For a full feature-selection wrapper, see SelectionVar with MethodSelection = "cvp".
Examples
## Not run:
set.seed(1)
n <- 120; p <- 200
X <- matrix(rnorm(n * p), n, p)
colnames(X) <- paste0("mz_", seq_len(p))
Y <- factor(sample(letters[1:3], n, replace = TRUE))
if (requireNamespace("ranger", quietly = TRUE)) {
out <- fast_cvpvi(
X, Y,
k = 5,
ntree = 300,
nbf = 50,
nthreads = max(1L, parallel::detectCores() - 1L),
folds_parallel = "auto",
seed = 42
)
head(out$sel_moz)
# CV importances for top features
head(sort(out$cv_varim[,1], decreasing = TRUE))
}
## End(Not run)
Function finding k Nearest Neighbors for each row of a matrix
Description
This function finds the k nearest neighbors for each row in a matrix using the specified distance metric. Distance metrics available are "p-norm", "Chebyshev", "Canberra", "Overlap", "HEOM" (Heterogeneous Euclidean-Overlap Metric), "HVDM" (Heterogeneous Value Difference Metric). See the calculate_distance function of our package for more details on the distances.
Usage
fast_find_neighbors(data, nominal_indices, p_code, k)
Arguments
data |
Matrix where the k nearest neighbors for each row are searched. |
nominal_indices |
Vector of column indices indicating which features are
categorical (nominal) variables. This is crucial for proper distance calculation
as nominal and numeric features require different handling. For example, if columns
2 and 5 contain categorical variables, nominal_indices should be |
p_code |
Numeric code representing the distance metric to use:
|
k |
Number of nearest neighbors to find. |
Value
A matrix where each row contains the indices of the k nearest neighbors for the corresponding example.
See Also
Function generating synthetic examples using SMOTE
Description
This function generates synthetic examples using the SMOTE algorithm.
For each example in the minority class, this function generates synthetic examples by interpolating between the example and its k nearest neighbors. This function is used in the smote_classif function.
Usage
fast_generate_synthetic(dat, k, n, p_code)
Arguments
dat |
A |
k |
Number of nearest neighbors to consider. |
n |
Number of synthetic examples to generate. |
p_code |
Numeric code representing the distance metric to use:
|
Value
A data frame containing the generated synthetic examples.
See Also
Fast MDA-style variable selection using ranger permutation importance
Description
Computes feature importances with a single multiclass (or regression) random
forest using the ranger engine and its C++ permutation importance. The null
distribution of importances is estimated either by appending artificial “false”
(noise) features (when nbf > 0) or by mirroring negative importances (when
nbf = 0). An estimator of the proportion of useless features yields pi0, and the top (1 - pi0) proportion
of true features are selected. This implementation is a fast, OS-agnostic alternative to
repeated/random-forest-based MDA schemes.
Usage
fast_mda(
X,
Y,
ntree = 1000,
nbf = 0,
nthreads = max(1L, parallel::detectCores() - 1L),
mtry = NULL,
sample_fraction = 1,
min_node_size = 1L,
seed = 123
)
Arguments
X |
Numeric matrix (n x p); samples in rows, features in columns. Column names should be feature IDs (e.g., m/z). Non-finite values are set to zero internally for modeling. |
Y |
Factor (classification) or numeric (regression) response of length n.
The default |
ntree |
Integer; number of trees. Default 1000. |
nbf |
Integer (>= 0); number of artificial “false” (noise) features to append to X to estimate the null distribution. Default 0 disables this and uses mirrored negative importances as the null. |
nthreads |
Integer; total number of threads for ranger. Default is max(1, parallel::detectCores() - 1). |
mtry |
Optional integer; variables tried at each split. If NULL (default), computed as floor(sqrt(p)) for classification or max(floor(p/3), 1) for regression. |
sample_fraction |
Numeric in (0, 1]; subsampling fraction per tree (speed/ regularization knob). Default 1. |
min_node_size |
Integer; ranger minimum node size. Larger values speed up training and yield simpler trees. Default 1. |
seed |
Integer; RNG seed for reproducibility. Default 123. |
Details
A single ranger model is fit with importance = "permutation". This computes permutation importance in C++ using OOB (fast and stable).
Null and pi0:
If
nbf > 0,nbffalse features (uniform betweenmin(X)andmax(X)) are appended; negative importances among them help shape the null. An estimator of the proportion of useless features over high quantiles (e.g., 0.75–1) yields pi0 and is adjusted for the number of false features.If
nbf = 0, the null is approximated by mirroring negative importances of true features. If no negative importances occur, pi0 is set to 0 (conservative).
Task: factors in
Ytrigger probability = TRUE; numericYtriggers regression.Robustness: any non-finite importances are set to zero. Selection is performed only among the original (true) features; false features are discarded.
Performance: this is typically 5–20x faster than randomForest-based MDA and fully multithreaded via
nthreads.
Value
A list with:
nb_to_sel: integer; number of selected features (floor(p_true * (1 - pi0))).
sel_moz: character vector of selected feature names (columns of X).
imp_sel: named numeric vector of importances for selected features (true features only).
all_imp: named numeric vector of importances for all true features.
pi0: estimated proportion of null features.
References
Alexandre Godmer, Yahia Benzerara, Emmanuelle Varon, Nicolas Veziris, Karen Druart, Renaud Mozet, Mariette Matondo, Alexandra Aubry, Quentin Giai Gianetto, MSclassifR: An R package for supervised classification of mass spectra with machine learning methods, Expert Systems with Applications, Volume 294, 2025, 128796, ISSN 0957-4174, doi:10.1016/j.eswa.2025.128796.
See Also
ranger::ranger; for cross-validated permutation importance, see fast_cvpvi. For a wrapper that plugs MDA/CVP into broader selection workflows, see SelectionVar (MethodSelection = "mda" or "cvp").
Examples
## Not run:
set.seed(1)
n <- 100; p <- 300
X <- matrix(rnorm(n * p), n, p)
colnames(X) <- paste0("mz_", seq_len(p))
Y <- factor(sample(letters[1:3], n, replace = TRUE))
if (requireNamespace("ranger", quietly = TRUE)) {
out <- fast_mda(
X, Y,
ntree = 500,
nbf = 50,
nthreads = max(1L, parallel::detectCores() - 1L),
seed = 42
)
out$nb_to_sel
head(out$sel_moz)
# Top importances
head(sort(out$all_imp, decreasing = TRUE))
}
## End(Not run)
Internal helpers and C++ exports
Description
These functions are internal building blocks used by SignalProcessingUltra:
Details
alignSpectra_landmark_cpp: wrapper that applies landmark matching/warping per spectrum.
cpp_match_anchors: Rcpp function for greedy 1:1 matching between spectrum peaks and reference anchors.
cpp_warp_mz_piecewise: Rcpp function for piecewise-linear warping of a mass axis.
They are not part of the user-facing API.
SMOTE for classification datasets
Description
Generate synthetic examples for minority classes using the SMOTE idea, to balance a classification dataset.
Usage
smote_classif(
formula,
data,
k = 5,
strategy = c("balance", "perc"),
perc = NULL,
metric = c("euclidean", "manhattan", "chebyshev", "canberra", "overlap", "heom",
"hvdm", "pnorm"),
p = 2,
seed = NULL,
C.perc = NULL
)
Arguments
formula |
A model formula target ~ predictors indicating the response and predictors. |
data |
A data.frame containing the variables in the model. |
k |
Integer, number of nearest neighbors used by SMOTE (default 5). |
strategy |
One of "balance" (oversample to the max class size) or "perc" (oversample each class by a percentage). Default "balance". |
perc |
Numeric percentage used when strategy = "perc" (e.g., 100 means generate as many synthetic examples as existing in the class). Ignored for "balance". |
metric |
Distance metric for neighbor search: one of "euclidean", "manhattan", "chebyshev", "canberra", "overlap", "heom", "hvdm", "pnorm". Default "euclidean". |
p |
Numeric p for the p-norm when metric = "pnorm"; also used implicitly for "euclidean" (p=2) and "manhattan" (p=1). Default 2. |
seed |
Optional integer seed for reproducibility. |
C.perc |
Deprecated. Backward-compatibility alias for oversampling control. If character "balance", mapped to strategy = "balance". If a single numeric, mapped to strategy = "perc" and perc = C.perc. Other forms are ignored with a warning. |
Details
The function supports multi-class data. With strategy = "balance" (default), each class is oversampled up to the size of the largest class. With strategy = "perc", each class c is oversampled by round(n_c * perc/100). Neighbors are computed within each class.
Value
A data.frame with synthetic rows appended, same columns and types as input.
Examples
data(iris)
imbal_iris <- iris[c(1:40, 51:100, 101:110), ]
table(imbal_iris$Species)
balanced_iris <- smote_classif(Species ~ ., imbal_iris)
table(balanced_iris$Species)