Omics technologies have advanced precision medicine by enabling the prediction of clinically relevant outcomes based on individual molecular profiles. However, their diagnostic and prognostic applications remain limited due to variability introduced by different technological platforms, laboratory protocols, and data processing methods. These inconsistencies hinder the generalizability of machine learning (ML) models trained on individual datasets.
Single-sample molecular scoring has emerged as a promising solution to improve interstudy reproducibility. By summarizing molecular features (e.g., gene expression) into pathway or gene set activity scores, these methods provide a more standardized and robust representation of biological processes. This enables the development of ML models that generalize across datasets, even when quantification methods differ.
Several R packages implement molecular scoring methods, such as
decoupleR (Badia-i-Mompel et al.
(2022)) and GSVA (Hänzelmann,
Castelo, and Guinney (2013)). pathMED unifies and
simplifies these approaches, providing a common input format and
allowing users to compute multiple scores efficiently. Additionally,
pathMED includes our novel method, M-scores, which enhances
disease characterization and prediction by leveraging reference datasets
(Toro-Domínguez et al. (2022)).
A key limitation of pathway-based scoring is the heterogeneity within gene sets, where different genes may have opposing effects on biological processes. Direct scoring of such broad gene sets can obscure meaningful patterns. To address this, pathMED incorporates methods to refine gene sets by clustering features with coordinated activity, improving pathway granularity and revealing previously hidden subpathways.
Many ML-based clinical prediction models also suffer from overfitting, unreliable performance metrics, and limited reproducibility in external datasets, restricting their clinical applicability. To overcome these challenges, pathMED provides a unified framework for ML model training, validation, and prediction. While it leverages the caret package, pathMED simplifies its usage and implements nested cross-validation by default to enhance model reliability.
You can install pathMED from the GitHub repository:
Please note that on some operating systems such as Ubuntu there may be installation errors in the packages “metrica”, “factoextra” and “FactoMineR”, as it is necessary to install system dependencies with administrator permissions. These system dependencies can be identified the following code (adapted to Ubuntu 20.04) and installed by running the output in a terminal as administrator:
Gene sets often contain genes associated with a specific biological
function, but their expression can be heterogeneous, sometimes even
exhibiting opposing directions when the function is altered (e.g., in
disease). The dissectDB() function clusters the genes
within each set into subsets based on coordinated expression patterns.
This clustering can be performed using one or more molecular datasets,
which may include control samples or not. The same dataset used in the
following sections can also be used for this purpose. Both preloaded and
custom gene sets can be used (see the next section for details). The
following code splits KEGG pathways into subpathways using the example
dataset:
library(pathMED)
data(pathMEDExampleData)
customKEGG <- dissectDB(list(pathMEDExampleData), geneSets = "kegg")The output is a gene set that can be used with the
getScores() function. In the new gene set, subpathways are
labeled with the suffix _splitx. For example, for the pathway
hsa04714:
## [1] "COX6A2" "COX7A1" "COX15" "GRB2" "HRAS" "MAP3K5"
## $hsa04714.split1
## [1] "NDUFA4" "NDUFS4" "UQCRQ" "NDUFAF3" "NDUFA11" "SMARCD1" "NDUFA6"
## [8] "NDUFB2" "NDUFA3" "COX16"
##
## $hsa04714.split2
## [1] "MGLL" "MLST8" "NDUFA2" "NDUFV1" "NDUFS8" "CPT1B" "NDUFS7"
##
## $hsa04714.split3
## [1] "COX20" "RPS6" "NDUFB6" "NDUFC2" "UQCRC2" "ACTL6A" "UQCRH"
Only the features available in the input dataset are included in this object.
The getScores() function calculates a score for each
gene set and sample. It takes as input a matrix or data frame with
molecular features in rows and samples in columns, along with a gene set
collection. The function supports 20 different scoring methods, which
can be specified using the method parameter. The available
methods and their references are:
Gene sets can be provided as a named list, where each entry contains
the genes associated with a specific gene set, using the
geneSets parameter. Alternatively, preloaded databases can
be used by specifying their name in the geneSets
parameter.
Note that these preloaded gene sets are built with gene symbols. Therefore, the row names of the input data should be gene symbols to use them.
If running the script on a multi-core processor, you can use the
cores parameter to specify the number of workers (most
pathMED functions include this parameter). Additionally,
specific parameters for the scoring functions can be customized (see,
for example, ?GSVA::gsvaParam).
To run this function with the example data:
library(pathMED)
data(pathMEDExampleData)
scoresExample <- getScores(pathMEDExampleData, geneSets = "kegg",
method = "Z-score")
print(scoresExample[1:5, 1:5])## 2f13c769_68 bd9262b7_91 564c1b1a_143 55d3e499_0 2f13c769_109
## hsa00010 0.6615978 0.3752335 -0.8888598 1.87367072 1.1322258
## hsa00020 -3.0246505 -2.4858390 0.9356097 0.75894566 1.3966702
## hsa00030 -2.4434677 1.3570865 -2.6715807 0.06263129 -0.2135047
## hsa00040 -0.3852843 -0.4486775 -0.4217614 0.51468743 0.6125073
## hsa00051 -0.3852843 -0.4486775 -0.4217614 0.51468743 0.6125073
The result is a matrix with gene set identifiers as rows and samples
as columns. To annotate the identifiers with their corresponding
descriptions, you can use the ann2term() function.
## ID term
## 1 hsa00010 Glycolysis / Gluconeogenesis
## 2 hsa00020 Citrate cycle (TCA cycle)
## 3 hsa00030 Pentose phosphate pathway
## 4 hsa00040 Pentose and glucuronate interconversions
## 5 hsa00051 Fructose and mannose metabolism
## 6 hsa00052 Galactose metabolism
The next step in the core pathMED pipeline is to use the
scores calculated in the previous step to train machine learning (ML)
models for predicting categorical or numerical clinical variables. The
trainModel() function fits and tests multiple algorithms,
selecting the one with the best performance for predicting a variable.
The input objects are inputData (the scores matrix obtained
in the previous step, or any other numerical matrix), and
metadata (a data frame containing sample metadata). The
var2predict parameter indicates the column name of metadata
with the variable to be predicted.
The function methodsML() should be used to prepare a
list of ML algorithms to be trained and tested. One or more algorithms
can be selected using the algorithms parameter (see the
function help for details), and the amount of hyperparameters
combinations to be tested is specified with the tuneLength
parameter.
Nested cross-validation is performed using Koutter folds
for the outer cross-validation (CV) and Kinner folds for
the inner CV. The inner k-fold CV can be repeated multiple times
(repeatsCV parameter) for hyperparameter tuning.
In the example dataset, the metadata include a column called
Response, indicating whether patients responded to a
treatment. The positiveClass parameter can be specified to
calculate performance metrics based on the chosen positive class (e.g.,
diseased).
data(pathMEDExampleMetadata)
modelsList <- methodsML(algorithms = c("rf", "knn"), outcomeClass = "character")
set.seed(123)
trainedModel <- trainModel(scoresExample,
metadata = pathMEDExampleMetadata,
var2predict = "Response",
positiveClass = "YES",
models = modelsList,
Koutter = 2,
Kinner = 2,
repeatsCV = 1
)## | | | 0% | |================== | 25% | |=================================== | 50% | |==================================================== | 75% | |======================================================================| 100%
## $model
## Random Forest
##
## 40 samples
## 310 predictors
## 2 classes: 'NO', 'YES'
##
## No pre-processing
## Resampling: None
##
## $stats
## rf knn
## mcc 0.3115628 0.1483202
## balacc 0.6446970 0.5696970
## accuracy 0.6466165 0.5751880
## recall 0.4393939 0.4393939
## specificity 0.8500000 0.7000000
## npv 0.6071429 0.5584416
## precision 0.7285714 0.6000000
## fscore 0.5476190 0.5000000
## perc.lossSamples 0.0000000 0.0000000
##
## $bestTune
## .mtry
## 1 2.5
##
## $subsample.preds
## NO YES obs
## bd9262b7_91 0.472 0.528 NO
## 564c1b1a_143 0.524 0.476 NO
## 55d3e499_0 0.574 0.426 NO
## 564c1b1a_124 0.572 0.428 NO
## 55d3e499_52 0.554 0.446 NO
## 2f13c769_0 0.526 0.474 NO
## 96bf8031_0 0.650 0.350 NO
## 3a42e820_170 0.582 0.418 NO
## 3a42e820_79 0.654 0.346 NO
## 2270bbfa_91 0.702 0.298 NO
## b60c5f9a_430 0.668 0.332 YES
## 3daec0f2_219 0.476 0.524 YES
## 3daec0f2_295 0.320 0.680 YES
## dc81f201_0 0.562 0.438 YES
## a1b702ca_382 0.376 0.624 YES
## 616d027e_0 0.486 0.514 YES
## 3daec0f2_384 0.344 0.656 YES
## b60c5f9a_625 0.562 0.438 YES
## dc81f201_157 0.466 0.534 YES
## 601fe1df_4 0.536 0.464 YES
## b60c5f9a_535 0.610 0.390 YES
## 2f13c769_68 0.524 0.476 NO
## 2f13c769_109 0.398 0.602 NO
## 564c1b1a_52 0.596 0.404 NO
## 3a42e820_435 0.550 0.450 NO
## 49eb5d3c_163 0.656 0.344 NO
## 96bf8031_96 0.680 0.320 NO
## bd9262b7_0 0.586 0.414 NO
## 3a42e820_560 0.494 0.506 NO
## 86a2a8eb_0 0.500 0.500 NO
## 2270bbfa_0 0.748 0.252 NO
## 3daec0f2_126 0.386 0.614 YES
## 3daec0f2_35 0.446 0.554 YES
## a1b702ca_284 0.432 0.568 YES
## 616d027e_672 0.574 0.426 YES
## a1b702ca_27 0.596 0.404 YES
## 601fe1df_0 0.630 0.370 YES
## dc81f201_38 0.524 0.476 YES
## b60c5f9a_89 0.540 0.460 YES
## a1b702ca_0 0.556 0.444 YES
The output object is a list with the following elements:
This object may be used with the predictExternal()
function to make predictions on new data with the same features than the
training data.
data(refData)
scoresExternal <- getScores(refData$dataset1,
geneSets = "kegg",
method = "Z-score"
)
predictions <- predictExternal(scoresExternal, trainedModel)
head(predictions)## Obs Prediction
## 1 GSM1594346 YES
## 2 GSM1594449 NO
## 3 GSM1594572 YES
## 4 GSM1594656 NO
## 5 GSM1594994 YES
## 6 GSM1594605 NO
In addition to previously developed molecular scoring methods, pathMED includes the novel M-score method. M-scores were used in a previous study that used personalized scoring to predict drug response, symptoms and disease progression in Systemic Lupus Erythematosus patients based on gene expression data Toro-Domínguez et al. (2022).
M-scores are calculated by comparing pathway activities between cases
and healthy controls, requiring the availability of control sample data.
To use M-scores with the getScores() function, the
labels parameter must be used to indicate control samples
(using 0 or “Healthy”).
As described previously Toro-Domínguez et al. (2022), M-scores can be used to identify key pathways associated with the studied disease. To achieve this, a reference dataset of patients and healthy controls must be constructed, either from a single dataset or by integrating multiple datasets, following the steps outlined below. This reference will also enable the imputation of M-scores in new cohorts or individual patients where healthy controls are not available.
We will load refData into the environment, which
includes four log2-normalized gene expression datasets and four
corresponding metadata datasets containing clinical information. These
datasets were extracted from NCBI GEO and contain both lupus and healthy
samples, encoded in the metadata as “Healthy_sample” and
“Disease_sample”.
The buildRefObject() function can be used to create a
refObject object with the required structure for the
following steps. The function require:
data) or a single
dataset if only one is available.groupVar).controlGroup), typically
representing healthy samples. If different datasets use different
control labels, a separate value should be provided for each metadata
file.data(refData)
refObject <- buildRefObject(
data = list(
refData$dataset1, refData$dataset2,
refData$dataset3, refData$dataset4
),
metadata = list(
refData$metadata1, refData$metadata2,
refData$metadata3, refData$metadata4
),
groupVar = "group",
controlGroup = "Healthy_sample"
)To create the refObject object, it is recommended to
include as many datasets as possible to improve robustness. Ensure that
all gene expression datasets are log2 transformed and follow a normal
distribution. Additionally, gene names must be annotated as Gene Symbols
in both reference and test datasets.
After creating a refObject object, the next step is to
calculate M-scores for the reference datasets with the
mScores_createReference() function.
To identify the most relevant gene sets for the studied phenotype,
you can use the mScores_filterPaths() function. This
function selects pathways based on two key parameters:
perc_samples, which defines the percentage of samples in
which a pathway must be statistically significant, and
min_datasets, which sets the minimum number of datasets
that must meet this th
These parameters help identify pathways that may be significant in small patient subgroups—a common scenario in heterogeneous diseases—while also ensuring that the selected pathways appear deregulated across multiple studies, reducing the risk of study-specific biases. The significance of each pathway in each sample is determined using its M-score, as described in the original study Toro-Domínguez et al. (2022).
## Selected paths: 168
Once the reference data is prepared and the relevant gene sets are
selected, you can calculate M-scores for datasets without healthy
controls using the mScores_imputeFromReference()
function.
The function requires a matrix containing case samples, gene sets (or
the output of the mScores_filterPaths() function), and the
reference data generated by mScores_createReference().
M-scores are computed using the k most similar samples
from the reference data, with k specified through the
nk parameter. The distance.threshold parameter
defines the maximum allowable Euclidean distance between a sample and
the reference for M-score imputation.
The output is a list containing a matrix of M-scores, where gene sets are in rows and samples in columns, along with a dataframe of calculated distances between each sample and the reference.
mScoresExample <- mScores_imputeFromReference(
inputData = pathMEDExampleData,
geneSets = relevantPaths,
externalReference = refMscores,
distance.threshold = 50
)## Calculating M-Scores by similarity with samples from external reference for 40 patients
## 2f13c769_68 bd9262b7_91 564c1b1a_143 55d3e499_0 2f13c769_109
## LI.M0 4.667796657 0.3831264 -0.7573253 -0.1096919 -1.1079187
## LI.M2.1 0.008238995 -0.2117298 -0.3545561 -0.3799828 -0.3549857
## LI.M4.1 1.559523495 1.3094181 1.2007531 1.2003028 1.6567628
## LI.M4.2 1.722801405 1.2468369 1.1522729 1.1678329 1.5024763
## LI.M4.4 0.496440873 0.5941196 0.8777018 0.9477592 1.2017842
## ID Distance
## 1 2f13c769_68 47.60733
## 2 bd9262b7_91 47.62470
## 3 564c1b1a_143 47.61592
## 4 55d3e499_0 47.69350
## 5 2f13c769_109 47.51437
## R version 4.5.2 (2025-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] caret_7.0-1 lattice_0.22-9 ggplot2_4.0.2 pathMED_1.3.0
##
## loaded via a namespace (and not attached):
## [1] splines_4.5.2 tibble_3.3.1
## [3] hardhat_1.4.2 graph_1.89.1
## [5] pROC_1.19.0.1 XML_3.99-0.22
## [7] rpart_4.1.24 factoextra_1.0.7
## [9] metrica_2.1.0 lifecycle_1.0.5
## [11] globals_0.19.0 MASS_7.3-65
## [13] flashClust_1.01-2 magrittr_2.0.4
## [15] gam_1.22-7 sass_0.4.10
## [17] rmarkdown_2.30 jquerylib_0.1.4
## [19] yaml_2.3.12 otel_0.2.0
## [21] pbapply_1.7-4 DBI_1.2.3
## [23] buildtools_1.0.0 RColorBrewer_1.1-3
## [25] lubridate_1.9.5 abind_1.4-8
## [27] GenomicRanges_1.63.1 purrr_1.2.1
## [29] BiocGenerics_0.57.0 minerva_1.5.10
## [31] nnet_7.3-20 ipred_0.9-15
## [33] lava_1.8.2 IRanges_2.45.0
## [35] S4Vectors_0.49.0 ggrepel_0.9.6
## [37] irlba_2.3.7 listenv_0.10.0
## [39] maketools_1.3.2 GSVA_2.5.15
## [41] BiocStyle_2.39.0 annotate_1.89.0
## [43] parallelly_1.46.1 codetools_0.2-20
## [45] caretEnsemble_4.0.1 DelayedArray_0.37.0
## [47] DT_0.34.0 energy_1.7-12
## [49] tidyselect_1.2.1 memuse_4.2-3
## [51] farver_2.1.2 ScaledMatrix_1.19.0
## [53] matrixStats_1.5.0 stats4_4.5.2
## [55] Seqinfo_1.1.0 jsonlite_2.0.0
## [57] e1071_1.7-17 survival_3.8-6
## [59] iterators_1.0.14 emmeans_2.0.1
## [61] foreach_1.5.2 tools_4.5.2
## [63] Rcpp_1.1.1 glue_1.8.0
## [65] prodlim_2025.04.28 SparseArray_1.11.10
## [67] xfun_0.56 MatrixGenerics_1.23.0
## [69] dplyr_1.2.0 HDF5Array_1.39.0
## [71] withr_3.0.2 BiocManager_1.30.27
## [73] fastmap_1.2.0 boot_1.3-32
## [75] rhdf5filters_1.23.3 digest_0.6.39
## [77] rsvd_1.0.5 timechange_0.4.0
## [79] R6_2.6.1 estimability_1.5.1
## [81] RSQLite_2.4.6 h5mread_1.3.1
## [83] tidyr_1.3.2 generics_0.1.4
## [85] data.table_1.18.2.1 recipes_1.3.1
## [87] class_7.3-23 httr_1.4.7
## [89] htmlwidgets_1.6.4 S4Arrays_1.11.1
## [91] scatterplot3d_0.3-44 ModelMetrics_1.2.2.2
## [93] pkgconfig_2.0.3 gtable_0.3.6
## [95] timeDate_4052.112 blob_1.3.0
## [97] S7_0.2.1 SingleCellExperiment_1.33.0
## [99] XVector_0.51.0 sys_3.4.3
## [101] htmltools_0.5.9 GSEABase_1.73.0
## [103] multcompView_0.1-10 scales_1.4.0
## [105] Biobase_2.71.0 leaps_3.2
## [107] png_0.1-8 SpatialExperiment_1.21.0
## [109] gower_1.0.2 knitr_1.51
## [111] reshape2_1.4.5 rjson_0.2.23
## [113] nlme_3.1-168 proxy_0.4-29
## [115] cachem_1.1.0 rhdf5_2.55.13
## [117] stringr_1.6.0 parallel_4.5.2
## [119] AnnotationDbi_1.73.0 pillar_1.11.1
## [121] grid_4.5.2 vctrs_0.7.1
## [123] randomForest_4.7-1.2 BiocSingular_1.27.1
## [125] beachmat_2.27.2 ada_2.0-5
## [127] xtable_1.8-4 cluster_2.1.8.2
## [129] evaluate_1.0.5 magick_2.9.0
## [131] mvtnorm_1.3-3 cli_3.6.5
## [133] compiler_4.5.2 rlang_1.1.7
## [135] crayon_1.5.3 future.apply_1.20.1
## [137] plyr_1.8.9 stringi_1.8.7
## [139] BiocParallel_1.45.0 Biostrings_2.79.4
## [141] gsl_2.1-9 Matrix_1.7-4
## [143] patchwork_1.3.2 sparseMatrixStats_1.23.0
## [145] bit64_4.6.0-1 future_1.69.0
## [147] Rhdf5lib_1.33.0 KEGGREST_1.51.1
## [149] FactoMineR_2.13 SummarizedExperiment_1.41.1
## [151] memoise_2.0.1 bslib_0.10.0
## [153] bit_4.6.0 xgboost_3.2.0.1