
Stability and Robustness Evaluation for Machine Learning Models
TrustworthyMLR is an R package designed to help data
scientists, machine learning engineers, and researchers evaluate the
trustworthiness of their predictive models. In
production environments and academic research alike, it is critical to
understand not only how well a model performs, but how
reliably it performs under varying conditions.
This package provides two core metrics:
| Metric | Purpose | Output |
|---|---|---|
| Stability Index | Measures consistency of predictions across multiple training runs or resamples | 0–1 (1 = perfectly stable) |
| Classification Stability | Consistency of predicted classes (labels) adjusted for chance | 0–1 (1 = perfect agreement) |
| Robustness Score | Measures resilience of predictions under small input perturbations | 0–1 (1 = perfectly robust) |
| Visualizations | Decay curves and stability plots for deep diagnostic insights | plots |
Modern ML pipelines often focus exclusively on accuracy metrics (RMSE, AUC, F1). However, a model that achieves high accuracy on one training run but produces substantially different predictions on another is not reliable for deployment. Similarly, a model whose predictions change dramatically with tiny input perturbations is not robust enough for real-world use.
TrustworthyMLR addresses this gap by providing
principled, easy-to-use diagnostics that complement traditional
performance metrics. These tools are essential for:
Install the development version from GitHub:
# install.packages("devtools")
devtools::install_github("your-username/TrustworthyMLR")Evaluate how consistent a model’s predictions are across multiple runs:
library(TrustworthyMLR)
# Simulate predictions from 5 independent model runs
set.seed(42)
base_predictions <- rnorm(100)
prediction_matrix <- matrix(
rep(base_predictions, 5) + rnorm(500, sd = 0.1),
ncol = 5
)
# Compute stability (1 = perfectly consistent)
stability_index(prediction_matrix)
#> [1] 0.9950...Evaluate how sensitive a model’s predictions are to small input noise:
# Define a prediction function (e.g., wrapping a trained model)
predict_fn <- function(X) X %*% c(1.5, -0.8, 2.3)
# Generate sample input data
set.seed(42)
X <- matrix(rnorm(300), ncol = 3)
# Compute robustness under 5% Gaussian noise
robustness_score(predict_fn, X, noise_level = 0.05, n_rep = 20)
#> [1] 0.9975...
### Visual Diagnostics
Visualize how model performance decays as noise increases:
```r
plot_robustness(predict_fn, X, main = "Robustness Decay Curve")Visualize prediction variance across observations:
plot_stability(prediction_matrix, main = "Model Prediction Stability")
### Real-World Workflow Example
```r
library(TrustworthyMLR)
# Step 1: Train multiple models (e.g., via cross-validation)
set.seed(1)
n <- 200
p <- 5
X <- matrix(rnorm(n * p), ncol = p)
y <- X %*% rnorm(p) + rnorm(n, sd = 0.5)
# Collect predictions from 10 bootstrap resamples
predictions <- replicate(10, {
idx <- sample(n, replace = TRUE)
fit <- lm(y[idx] ~ X[idx, ])
predict(fit, newdata = data.frame(X))
})
# Step 2: Assess stability
cat("Stability Index:", stability_index(predictions), "\n")
# Step 3: Assess robustness
model <- lm(y ~ X)
pred_fn <- function(newX) {
as.numeric(cbind(1, newX) %*% coef(model))
}
cat("Robustness Score:", robustness_score(pred_fn, X, noise_level = 0.05), "\n")
| Function | Description |
|---|---|
stability_index() |
Compute the stability of predictions across multiple runs |
robustness_score() |
Compute robustness of a model under input perturbations |
Contributions are welcome! Please open an issue or submit a pull request on GitHub.
git checkout -b feature/new-metric)git commit -m "Add new metric")git push origin feature/new-metric)MIT © Ali Hamza