--- title: "bioLeak: Leakage-Aware Biomedical Modeling" author: "Selçuk Korkmaz" date: "`r Sys.Date()`" output: rmarkdown::html_document: toc: true toc_float: true number_sections: true theme: flatly highlight: tango vignette: > %\VignetteIndexEntry{bioLeak: Leakage-Aware Biomedical Modeling} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", message = FALSE, warning = FALSE, eval = TRUE ) ``` ## Why bioLeak bioLeak is a leakage-aware modeling toolkit for biomedical and machine-learning analyses. Its purpose is to prevent and diagnose information leakage across resampling workflows where training and evaluation data are not truly independent because samples share subjects, batches, studies, or time. Standard workflows are often insufficient. Random, row-wise cross-validation assumes samples are independent. Global preprocessing (imputation, scaling, feature selection) done before resampling lets test-fold information shape the training process. These choices inflate performance and can lead to incorrect biomarker discovery, misleading clinical signals, or models that fail in deployment. Data leakage means any direct or indirect use of evaluation data in training or feature engineering. In biomedical data, leakage commonly appears as: - repeated measurements from the same patient split across folds - batch or study effects that align with the outcome - duplicates or near-duplicates across train and test - time-series lookahead or random splits that use future information - preprocessing that uses all samples instead of training-only statistics bioLeak addresses these issues with leakage-aware splitting, guarded preprocessing that is fit only on training data, and audit diagnostics that surface overlaps, confounding, and duplicates. ## Guided workflow The sections below walk through a leakage-aware workflow from data setup to audits. Each step includes a leaky failure mode and a corrected alternative. ### Example data ```{r example-data} library(bioLeak) set.seed(123) n <- 160 subject <- rep(seq_len(40), each = 4) batch <- sample(paste0("B", 1:6), n, replace = TRUE) study <- sample(paste0("S", 1:4), n, replace = TRUE) time <- seq_len(n) x1 <- rnorm(n) x2 <- rnorm(n) x3 <- rnorm(n) linpred <- 0.7 * x1 - 0.4 * x2 + 0.2 * x3 + rnorm(n, sd = 0.5) p <- stats::plogis(linpred) outcome <- factor(ifelse(runif(n) < p, "case", "control"), levels = c("control", "case")) df <- data.frame( subject = subject, batch = batch, study = study, time = time, outcome = outcome, x1 = x1, x2 = x2, x3 = x3 ) df_leaky <- within(df, { leak_subject <- ave(as.numeric(outcome == "case"), subject, FUN = mean) leak_batch <- ave(as.numeric(outcome == "case"), batch, FUN = mean) leak_global <- mean(as.numeric(outcome == "case")) }) df_time <- df df_time$leak_future <- c(as.numeric(df_time$outcome == "case")[-1], 0) predictors <- c("x1", "x2", "x3") # Example data (first 6 rows) head(df) # Outcome class counts as.data.frame(table(df$outcome)) ``` The table preview displays the metadata columns (`subject`, `batch`, `study`, `time`), the binary outcome, and three numeric predictors (`x1`, `x2`, `x3`). The class count table shows the baseline prevalence. Stratified splits try to preserve these proportions (for grouped modes, stratification is applied at the group level). ### Create leakage-aware splits with `make_split_plan()` `make_split_plan()` is the foundation. It returns a `LeakSplits` object with explicit train/test indices (or compact fold assignments) and metadata. For data.frame inputs, a unique `row_id` column is added automatically, so `group = "row_id"` is the explicit way to request sample-wise CV. It assumes that the grouping columns you provide are complete and that samples sharing a group must not cross folds. Grouped stratification uses the majority class per group and is ignored for `study_loocv`, `time_series`, and survival outcomes. Misuse to avoid: - using `group = "row_id"` (sample-wise CV) when subjects repeat - using the wrong grouping column (for example batch instead of subject) - using time-series CV with unsorted or missing time values - relying on stratification when the outcome is missing, single-class at the group level, or a time/event outcome - assuming `time_series` will always return exactly `v` folds **Leaky example: row-wise CV when subjects repeat** ```{r leaky-splits} leaky_splits <- make_split_plan( df, outcome = "outcome", mode = "subject_grouped", group = "row_id", v = 5, repeats = 1, stratify = TRUE ) cat("Leaky splits summary (sample-wise CV):\n") leaky_splits ``` The printed `LeakSplits` summary reports the split mode, number of folds, and fold-level training and test set sizes. Because `group = "row_id"`, each sample is treated as its own group. As a result, repeated samples from the same subject may appear in both the training and test sets, which introduces information leakage. **Leakage-safe alternative: group by subject** ```{r safe-splits} safe_splits <- make_split_plan( df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 5, repeats = 1, stratify = TRUE, seed = 10 ) cat("Leakage-safe splits summary (subject-grouped CV):\n") safe_splits ``` Here, each subject is confined to a single fold. Test set sizes are multiples of the number of samples per subject, confirming that subjects were not split across folds. The fold sizes remain comparable because `stratify = TRUE` balances outcome proportions across folds while respecting subject-level boundaries. **Other leakage-aware modes** ```{r splits-other-modes} batch_splits <- make_split_plan( df, outcome = "outcome", mode = "batch_blocked", batch = "batch", v = 4, stratify = TRUE ) cat("Batch-blocked splits summary:\n") batch_splits study_splits <- make_split_plan( df, outcome = "outcome", mode = "study_loocv", study = "study" ) cat("Study leave-one-out splits summary:\n") study_splits time_splits <- make_split_plan( df, outcome = "outcome", mode = "time_series", time = "time", v = 4, horizon = 2 ) cat("Time-series splits summary:\n") time_splits nested_splits <- make_split_plan( df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3, nested = TRUE, stratify = TRUE ) cat("Nested CV splits summary:\n") nested_splits ``` Each summary reports the number of folds and the fold sizes for the corresponding split strategy. - **Batch & Study:** The test set sizes vary because batches/studies are not evenly sized. `study_loocv` uses one study per fold, so `v` and `repeats` are ignored. - **Time series:** The training set size increases across folds while the test set size follows the block size. Early blocks with no prior training data (or too-small test windows) are skipped, so you may see fewer than `v` folds. **Assumptions and intent by mode** - `subject_grouped`: keeps all samples from a subject together - `batch_blocked`: keeps batches/centers together (leave-one-group-out when `v` is at least the number of batches; otherwise multiple batches per fold) - `study_loocv`: holds out each study in turn for external validation - `time_series`: rolling-origin splits with an optional horizon to prevent lookahead Use `repeats` for repeated CV in grouped/batch modes (it is ignored for `study_loocv` and `time_series`), `stratify = TRUE` to balance outcome proportions at the group level when applicable, and `nested = TRUE` to attach inner folds (one repeat). For large datasets, `progress = TRUE` reports progress; storing explicit indices can be memory intensive. ### Scalability **Handling Large Datasets (Compact Mode)** For large datasets (e.g., $N > 50,000$) with many repeats, storing explicit integer indices for every fold can consume gigabytes of memory. Use `compact = TRUE` to store a lightweight vector of fold assignments instead. `fit_resample()` will automatically reconstruct the indices on the fly. Compact mode is not supported when `nested = TRUE` (it falls back to full indices). For `time_series` compact splits, the time column must be present in the stored split metadata so folds can be reconstructed. ```{r compact-splits} # Efficient storage for large N big_splits <- make_split_plan( df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 5, compact = TRUE # <--- Saves memory ) cat("Compact-mode splits summary:\n") big_splits cat(sprintf("Compact storage enabled: %s\n", big_splits@info$compact)) ``` The summary is identical to a regular split, but the underlying storage is a compact fold-assignment vector. Use the `compact` flag to confirm the memory- saving mode is active. ### Guarded preprocessing and imputation `bioLeak` uses guarded preprocessing to prevent leakage from global imputation, scaling, and feature selection. The low-level API is: - `.guard_fit()` to fit preprocessing on training data only - `predict_guard()` to apply the trained guard to new data - `.guard_ensure_levels()` to align factor levels across train/test - `impute_guarded()` as a convenience wrapper for leakage-safe imputation `.guard_fit()` fits a pipeline that can winsorize, impute, normalize, filter, and select features. It one-hot encodes non-numeric columns and carries factor levels forward to new data. Assumptions and misuse to avoid: - `fs = "ttest"` requires a binary outcome with enough samples per class - `fs = "lasso"` requires `glmnet` - `impute = "mice"` is not supported for guarded prediction - `impute = "none"` with missing values will trigger a median fallback and missingness indicators to avoid errors Supported preprocessing options include imputation (`median`, `knn`, `missForest`, `none`), normalization (`zscore`, `robust`, `none`), filtering by variance or IQR (optionally `min_keep`), and feature selection (`ttest`, `lasso`, `pca`). Winsorization is controlled via `impute$winsor` and `impute$winsor_k` in guarded steps; in `impute_guarded()`, use `winsor` and `winsor_thresh`. `impute_guarded()` returns a `LeakImpute` object with guarded data, diagnostics, and the fitted guard state. **Leaky example: global scaling before CV** ```{r leaky-scaling} df_leaky_scaled <- df df_leaky_scaled[predictors] <- scale(df_leaky_scaled[predictors]) scaled_summary <- data.frame( feature = predictors, mean = colMeans(df_leaky_scaled[predictors]), sd = apply(df_leaky_scaled[predictors], 2, stats::sd) ) scaled_summary$mean <- round(scaled_summary$mean, 3) scaled_summary$sd <- round(scaled_summary$sd, 3) # Leaky global scaling: means ~0 and SDs ~1 computed on all samples scaled_summary ``` The summary shows that scaling used the full dataset, so test-fold statistics influenced the training transformation. This violates the train/test separation and can bias performance estimates. **Leakage-safe alternative: fit preprocessing on training only** ```{r guarded-preprocess} fold1 <- safe_splits@indices[[1]] train_x <- df[fold1$train, predictors] test_x <- df[fold1$test, predictors] guard <- .guard_fit( X = train_x, y = df$outcome[fold1$train], steps = list( impute = list(method = "median", winsor = TRUE), normalize = list(method = "zscore"), filter = list(var_thresh = 0, iqr_thresh = 0), fs = list(method = "none") ), task = "binomial" ) train_x_guarded <- predict_guard(guard, train_x) test_x_guarded <- predict_guard(guard, test_x) cat("GuardFit object:\n") guard cat("\nGuardFit summary:\n") summary(guard) # Guarded training data (first 6 rows) head(train_x_guarded) # Guarded test data (first 6 rows) head(test_x_guarded) ``` The `GuardFit` object records the preprocessing steps and the number of features retained after filtering. The summary reports how many features were removed and the preprocessing audit trail. The guarded train/test previews show that missing values were imputed and (if requested) scaled using training-only statistics; the test data never influences these values. **Factor level guard (advanced)** ```{r guard-ensure-levels} raw_levels <- data.frame( site = c("A", "B", "B"), status = c("yes", "no", "yes"), stringsAsFactors = FALSE ) level_state <- .guard_ensure_levels(raw_levels) # Aligned factor data with consistent levels level_state$data # Levels map level_state$levels ``` The returned `data` keeps factor levels consistent across folds, while the `levels` list records the training-time levels (including any dummy levels added to preserve one-hot columns). Use these when transforming new data to avoid misaligned model matrices. **Leaky example: imputation using train and test together** ```{r leaky-impute} train <- data.frame(a = c(1, 2, NA, 4), b = c(NA, 1, 1, 0)) test <- data.frame(a = c(NA, 5), b = c(1, NA)) all_median <- vapply(rbind(train, test), function(col) median(col, na.rm = TRUE), numeric(1)) train_leaky <- as.data.frame(Map(function(col, m) { col[is.na(col)] <- m; col }, train, all_median)) test_leaky <- as.data.frame(Map(function(col, m) { col[is.na(col)] <- m; col }, test, all_median)) # Leaky medians computed on train + test data.frame(feature = names(all_median), median = all_median) # Leaky-imputed training data train_leaky # Leaky-imputed test data test_leaky ``` The medians above were computed using both train and test data, so the test set influences the imputation values. This is a classic leakage pathway because test information directly alters the training features. **Leakage-safe alternative: `impute_guarded()`** ```{r safe-impute} imp <- impute_guarded( train = train, test = test, method = "median", winsor = FALSE ) # Guarded-imputed training data imp$train # Guarded-imputed test data imp$test ``` Here, the imputation statistics are computed from the training set only. The missing value in the test set (column `a`) is replaced by 2, which is the training median. In contrast, in the leaky example above, it was replaced by 3, the global median. This confirms that the test set is transformed using fixed values learned from the training data, preserving a clean separation between training and evaluation. The `LeakImpute` object also contains missingness diagnostics in `imp$summary$diagnostics`, and guarded outputs use the same encoding as `.guard_fit()` (categorical variables are one-hot encoded). Use `vars` to impute only a subset of columns when needed. ### Fit and resample with `fit_resample()` `fit_resample()` combines leakage-aware splits with guarded preprocessing and model fitting. It supports: - built-in learners (`glmnet`, `ranger`) - `parsnip` model specifications or `workflows::workflow` objects (recommended) - `recipes::recipe` preprocessing (prepped on training folds only) - `rsample` resamples (`rset`/`rsplit`) via `splits`, with `split_cols` to map metadata - custom learners via `custom_learners` (legacy/advanced) - multiple metrics, class weights, positive class override, and yardstick metrics Assumptions and misuse to avoid: - outcome must be binary or multiclass (factor) for classification, numeric for regression, or a survival outcome (a `Surv` column or time/event columns) - `positive_class` must be a single value that exists in the outcome levels - `class_weights` only applies to binomial/multiclass tasks; workflows must handle weights internally - if you supply a matrix without metadata, you must remove leakage columns yourself (group, batch, study, time) Use `learner_args` to pass model-specific arguments. For custom learners (advanced), you can supply separate `fit` and `predict` argument lists. Set `parallel = TRUE` to use future.apply for fold-level parallelism when available. When `learner` is a parsnip specification, `learner_args` and `custom_learners` are ignored. When `learner` is a workflow, `preprocess` and `learner_args` are ignored. When a recipe or workflow is used, the built-in guarded preprocessing list is bypassed, so ensure the recipe/workflow itself is leakage-safe. Metrics used in this vignette: - **AUC**: area under the ROC curve (0.5 = random, 1.0 = perfect; higher is better) - **PR AUC**: area under the precision-recall curve (0 to 1; higher is better) - **Accuracy**: fraction of correct predictions (0 to 1; higher is better) - **RMSE**: root mean squared error (0 to infinity; lower is better) - **Macro F1**: average F1 across classes (multiclass; higher is better) - **Log loss**: cross-entropy for probabilities (lower is better) - **C-index**: concordance for regression/survival (higher is better) Always report the mean and variability across folds, not a single fold value. When using yardstick metrics, the positive class is the second factor level; set `positive_class` to control this. **Parsnip model specification (recommended)** ```{r parsnip-spec} spec <- parsnip::logistic_reg(mode = "classification") |> parsnip::set_engine("glm") ``` This spec uses base R `glm` under the hood and does not require extra model packages. Use it in the examples below; custom learners are covered in the Advanced section. **Leaky example: leaky features and row-wise splits** ```{r fit-leaky} fit_leaky <- fit_resample( df_leaky, outcome = "outcome", splits = leaky_splits, learner = spec, metrics = c("auc", "accuracy"), preprocess = list( impute = list(method = "median"), normalize = list(method = "zscore"), filter = list(var_thresh = 0), fs = list(method = "none") ) ) cat("Leaky fit summary:\n") summary(fit_leaky) metrics_leaky <- as.data.frame(fit_leaky@metric_summary) num_cols <- vapply(metrics_leaky, is.numeric, logical(1)) metrics_leaky[num_cols] <- lapply(metrics_leaky[num_cols], round, digits = 3) # Leaky fit: mean and SD of metrics across folds metrics_leaky ``` The summary reports cross-validated performance. AUC ranges from 0.5 (random) to 1.0 (perfect), while accuracy is the fraction of correct predictions. Because the splits are leaky, these metrics can be artificially high and should not be used for reporting. **Leakage-safe alternative: grouped splits and clean predictors** ```{r fit-safe} fit_safe <- fit_resample( df, outcome = "outcome", splits = safe_splits, learner = spec, metrics = c("auc", "accuracy"), preprocess = list( impute = list(method = "median"), normalize = list(method = "zscore"), filter = list(var_thresh = 0), fs = list(method = "none") ), positive_class = "case", class_weights = c(control = 1, case = 1), refit = TRUE ) cat("Leakage-safe fit summary:\n") summary(fit_safe) metrics_safe <- as.data.frame(fit_safe@metric_summary) num_cols <- vapply(metrics_safe, is.numeric, logical(1)) metrics_safe[num_cols] <- lapply(metrics_safe[num_cols], round, digits = 3) # Leakage-safe fit: mean and SD of metrics across folds metrics_safe # Per-fold metrics (first 6 rows) head(fit_safe@metrics) ``` `fit_resample()` returns a `LeakFit` object. Use `summary(fit_safe)` for a compact report and inspect `fit_safe@metrics` and `fit_safe@predictions` for details. When `refit = TRUE`, the final model and preprocessing state are stored in `fit_safe@info$final_model` and `fit_safe@info$final_preprocess`. The mean +/- SD table is the primary performance summary to report, while the per-fold metrics reveal variability and potential instability across folds. By default, `fit_resample()` stores refit inputs in `fit_safe@info$perm_refit_spec` to enable `audit_leakage(perm_refit = "auto")`. Set `store_refit_data = FALSE` to reduce memory and provide `perm_refit_spec` manually when you want refit-based permutations. When multiple learners are passed, `refit = TRUE` refits only the first learner; set `refit = FALSE` if you do not need a final model. The examples above use a parsnip model specification. To swap in other models, replace `spec` with another parsnip spec (see the gradient boosting example below). **Multiclass classification (optional)** ```{r multiclass-example} if (requireNamespace("ranger", quietly = TRUE)) { set.seed(11) df_multi <- df df_multi$outcome3 <- factor(sample(c("A", "B", "C"), nrow(df_multi), replace = TRUE)) multi_splits <- make_split_plan( df_multi, outcome = "outcome3", mode = "subject_grouped", group = "subject", v = 4, stratify = TRUE, seed = 11 ) fit_multi <- fit_resample( df_multi, outcome = "outcome3", splits = multi_splits, learner = "ranger", metrics = c("accuracy", "macro_f1", "log_loss"), refit = FALSE ) cat("Multiclass fit summary:\n") summary(fit_multi) } else { cat("ranger not installed; skipping multiclass example.\n") } ``` Multiclass fits compute accuracy, macro-F1, and log loss when class probabilities are available. `positive_class` is ignored for multiclass tasks. **Survival outcomes (optional)** ```{r survival-example} if (requireNamespace("survival", quietly = TRUE) && requireNamespace("glmnet", quietly = TRUE)) { set.seed(12) df_surv <- df df_surv$time_to_event <- rexp(nrow(df_surv), rate = 0.1) df_surv$event <- rbinom(nrow(df_surv), 1, 0.7) surv_splits <- make_split_plan( df_surv, outcome = c("time_to_event", "event"), mode = "subject_grouped", group = "subject", v = 4, stratify = FALSE, seed = 12 ) fit_surv <- fit_resample( df_surv, outcome = c("time_to_event", "event"), splits = surv_splits, learner = "glmnet", metrics = "cindex", refit = FALSE ) cat("Survival fit summary:\n") summary(fit_surv) } else { cat("survival or glmnet not installed; skipping survival example.\n") } ``` For survival tasks, supply `outcome = c(time_col, event_col)` or a `Surv` column; `stratify` is ignored for time/event outcomes and `class_weights` are not used. **Passing learner-specific arguments (optional)** ```{r learner-args} if (requireNamespace("glmnet", quietly = TRUE)) { fit_glmnet <- fit_resample( df, outcome = "outcome", splits = safe_splits, learner = "glmnet", metrics = "auc", learner_args = list(glmnet = list(alpha = 0.5)), preprocess = list( impute = list(method = "median"), normalize = list(method = "zscore"), filter = list(var_thresh = 0), fs = list(method = "none") ) ) cat("GLMNET summary with learner-specific arguments:\n") summary(fit_glmnet) } else { cat("glmnet not installed; skipping learner_args example.\n") } ``` This summary reflects the same guarded preprocessing but a different model configuration (here, `alpha = 0.5`). Use the mean +/- SD metrics to compare learner settings under identical splits. **SummarizedExperiment inputs (optional)** ```{r se-example} if (requireNamespace("SummarizedExperiment", quietly = TRUE)) { se <- SummarizedExperiment::SummarizedExperiment( assays = list(counts = t(as.matrix(df[, predictors]))), colData = df[, c("subject", "batch", "study", "time", "outcome"), drop = FALSE] ) se_splits <- make_split_plan( se, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3 ) se_fit <- fit_resample( se, outcome = "outcome", splits = se_splits, learner = spec, metrics = "auc" ) cat("SummarizedExperiment fit summary:\n") summary(se_fit) } else { cat("SummarizedExperiment not installed; skipping SE example.\n") } ``` The summary is identical in structure to the `data.frame` case because `fit_resample()` extracts predictors and metadata from the `SummarizedExperiment` object. Note that `features_final` here reflects only the assay predictors (`x1`, `x2`, `x3`), because the assay was constructed without metadata columns. ### Tidymodels interoperability `bioLeak` integrates with `rsample`, `recipes`, `workflows`, and `yardstick`: - `splits` can be an `rsample` rset/rsplit; use `split_cols` (default `"auto"`) or `bioLeak_mode` / `bioLeak_perm_mode` attributes to map group/batch/study/time metadata. - `preprocess` can be a `recipes::recipe` (prepped on training folds only). - `learner` can be a `workflows::workflow`. - `metrics` can be a `yardstick::metric_set`. - `as_rsample()` converts `LeakSplits` to an `rsample` object. ```{r tidymodels-interop} library(bioLeak) library(parsnip) library(recipes) library(yardstick) set.seed(123) N <- 60 df <- data.frame( subject = factor(rep(paste0("S", 1:20), length.out = N)), outcome = factor(rep(c("ClassA", "ClassB"), length.out = N)), x1 = rnorm(N), x2 = rnorm(N), x3 = rnorm(N) ) spec <- logistic_reg() |> set_engine("glm") # Use bioLeak's native split planner to avoid conversion errors set.seed(13) # Use make_split_plan instead of rsample::group_vfold_cv # This creates a subject-grouped CV directly compatible with fit_resample splits <- make_split_plan( df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3 ) rec <- recipes::recipe(outcome ~ x1 + x2 + x3, data = df) |> recipes::step_impute_median(recipes::all_numeric_predictors()) |> recipes::step_normalize(recipes::all_numeric_predictors()) metrics_set <- yardstick::metric_set(yardstick::roc_auc, yardstick::accuracy) fit_rs <- fit_resample( df, outcome = "outcome", splits = splits, learner = spec, preprocess = rec, metrics = metrics_set, refit = FALSE ) if (exists("as_rsample", where = asNamespace("bioLeak"), mode = "function")) { rs_export <- as_rsample(fit_rs@splits, data = df) print(rs_export) } ``` Use `split_cols` to ensure split-defining metadata are dropped from predictors when using rsample inputs. ```{r workflow-example} if (requireNamespace("workflows", quietly = TRUE)) { wf <- workflows::workflow() |> workflows::add_model(spec) |> workflows::add_formula(outcome ~ x1 + x2 + x3) fit_wf <- fit_resample( df, outcome = "outcome", splits = safe_splits, learner = wf, metrics = "auc", refit = FALSE ) cat("Workflow fit summary:\n") summary(fit_wf) } else { cat("workflows not installed; skipping workflow example.\n") } ``` When auditing rsample-based fits, pass `perm_mode = "subject_grouped"` (or set `attr(rs, "bioLeak_perm_mode") <- "subject_grouped"`) so restricted permutations respect the intended split design. ### Nested tuning with `tune_resample()` `tune_resample()` runs leakage-aware nested tuning with tidymodels. It expects outer splits that include inner folds (use `nested = TRUE` in `make_split_plan()`), and a parsnip model spec or workflow. Survival tasks are not yet supported. Use `inner_v`, `inner_repeats`, and `inner_seed` to control inner resampling when inner folds are not precomputed. ```{r tune-example} library(bioLeak) library(parsnip) library(recipes) library(tune) library(glmnet) # --- 1. Create Data --- set.seed(123) N <- 60 df <- data.frame( subject = factor(rep(paste0("S", 1:20), length.out = N)), # FIX: Use sample() to randomize outcome and avoid perfect subject-confounding outcome = factor(sample(c("ClassA", "ClassB"), N, replace = TRUE)), x1 = rnorm(N), x2 = rnorm(N), x3 = rnorm(N) ) # --- 2. Generate Nested Splits --- set.seed(1) nested_splits <- make_split_plan( df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3, nested = TRUE ) # --- 3. Define Recipe & Model --- rec <- recipes::recipe(outcome ~ x1 + x2 + x3, data = df) |> recipes::step_impute_median(recipes::all_numeric_predictors()) |> recipes::step_normalize(recipes::all_numeric_predictors()) spec_tune <- parsnip::logistic_reg( penalty = tune::tune(), mixture = 1, mode = "classification" ) |> parsnip::set_engine("glmnet") # --- 4. Run Tuning --- tuned <- tune_resample( df, outcome = "outcome", splits = nested_splits, learner = spec_tune, preprocess = rec, inner_v = 2, grid = 3, metrics = c("auc", "accuracy"), seed = 14 ) summary(tuned) ``` ### Advanced: Using Gradient Boosting with Parsnip `bioLeak` natively supports `tidymodels` specifications. You can pass a `parsnip` model specification directly to `fit_resample()`. This allows you to use state-of-the-art algorithms like XGBoost, LightGBM, or SVMs while ensuring all preprocessing remains guarded. ```{r fit-xgboost} if (requireNamespace("parsnip", quietly = TRUE) && requireNamespace("xgboost", quietly = TRUE) && requireNamespace("recipes", quietly = TRUE)) { # 1. Define the model spec xgb_spec <- parsnip::boost_tree( mode = "classification", trees = 100, tree_depth = 6, learn_rate = 0.01 ) |> parsnip::set_engine("xgboost") # 2. Define a Recipe (CORRECTED) # FIX: Define the recipe on data WITHOUT 'subject'. # This matches the data bioLeak passes (which has subject removed). df_for_rec <- df[, !names(df) %in% "subject"] rec_xgb <- recipes::recipe(outcome ~ ., data = df_for_rec) |> recipes::step_dummy(recipes::all_nominal_predictors()) |> recipes::step_impute_median(recipes::all_numeric_predictors()) # Note: No need for step_rm(subject) because it's already gone! # 3. Fit fit_xgb <- fit_resample( df, outcome = "outcome", splits = nested_splits, learner = xgb_spec, metrics = "auc", preprocess = rec_xgb ) cat("XGBoost parsnip fit summary:\n") print(summary(fit_xgb)) } else { cat("parsnip/xgboost/recipes not installed.\n") } ``` The summary reports cross-validated AUC for a non-linear gradient boosting model. Use the mean $\pm$ SD table to compare against baseline models, and confirm that any gains do not coincide with leakage signals in the audit diagnostics. ### Advanced: Custom learners Custom learners are used when a model is not available as a *parsnip* specification or when a lightweight wrapper around base R is needed. Each custom learner must define `fit` and `predict` components. The `fit` function must accept `x`, `y`, `task`, and `weights` as inputs, while the `predict` function must accept `object` and `newdata`. ```{r custom-learners} custom_learners <- list( glm = list( fit = function(x, y, task, weights, ...) { df_fit <- data.frame(y = y, x, check.names = FALSE) stats::glm(y ~ ., data = df_fit, family = stats::binomial(), weights = weights) }, predict = function(object, newdata, task, ...) { as.numeric(stats::predict(object, newdata = as.data.frame(newdata), type = "response")) } ) ) cat("Custom learner names:\n") names(custom_learners) cat("Custom learner components (fit/predict):\n") lapply(custom_learners, names) ``` This output confirms that each custom learner provides both a `fit` and a `predict` function. Custom learners can be used with `fit_resample()` as follows: ```r fit_resample( ..., learner = "glm", custom_learners = custom_learners ) ``` ### Visual diagnostics **bioLeak** includes plotting helpers for quick diagnostic checks: - `plot_fold_balance()` shows class balance per fold. - `plot_overlap_checks()` highlights train/test overlap for a metadata column. - `plot_time_acf()` checks autocorrelation in time-series predictions. - `plot_calibration()` shows reliability of binomial probabilities. - `plot_confounder_sensitivity()` summarizes performance by batch/study strata. Tabular helpers are also available: - `calibration_summary()` returns calibration curve data and ECE/MCE/Brier metrics. - `confounder_sensitivity()` returns per-stratum performance summaries. **Misuse to avoid** - Plotting without predictions (fits with no successful folds). - Using `plot_overlap_checks()` when the column is not present in the split metadata. - Using `plot_time_acf()` without a time column or for non-temporal data. - Using `plot_calibration()` outside binomial tasks. - Using `plot_confounder_sensitivity()` with a metric unsupported by the task. For classification, `plot_fold_balance()` uses the positive class recorded in `fit@info$positive_class`, or defaults to the second factor level if this is not set. For multiclass tasks, it shows per-class counts without a proportion line. ```{r plot-fold-balance} if (requireNamespace("ggplot2", quietly = TRUE)) { plot_fold_balance(fit_safe) } else { cat("ggplot2 not installed; skipping fold balance plot.\n") } ``` The bar chart shows the counts of Positives (blue) and Negatives (tan) in each fold. The dashed blue line represents the proportion of the positive class. In a well-stratified fit, this line remains relatively stable across folds. Large fluctuations indicate poor stratification, which can lead to unstable fold-level performance estimates. ```{r plot-calibration} if (requireNamespace("ggplot2", quietly = TRUE)) { plot_calibration(fit_safe, bins = 10) } else { cat("ggplot2 not installed; skipping calibration plot.\n") } ``` Calibration curves compare predicted probabilities to observed event rates. Large deviations from the diagonal indicate miscalibration. Use `min_bin_n` to suppress tiny bins and `learner =` to select a model when multiple learners are stored in the fit. ```{r plot-confounder-sensitivity} if (requireNamespace("ggplot2", quietly = TRUE)) { plot_confounder_sensitivity(fit_safe, confounders = c("batch", "study"), metric = "auc", min_n = 3) } else { cat("ggplot2 not installed; skipping confounder sensitivity plot.\n") } ``` Confounder sensitivity plots highlight whether performance varies substantially across batch or study strata. ```{r diagnostics-tables} cal <- calibration_summary(fit_safe, bins = 10, min_bin_n = 5) conf_tbl <- confounder_sensitivity(fit_safe, confounders = c("batch", "study"), metric = "auc", min_n = 3) # Calibration metrics cal$metrics # Confounder sensitivity table (first 6 rows) head(conf_tbl) ``` ```{r plot-overlap} if (requireNamespace("ggplot2", quietly = TRUE)) { plot_overlap_checks(fit_leaky, column = "subject") plot_overlap_checks(fit_safe, column = "subject") } else { cat("ggplot2 not installed; skipping overlap plots.\n") } ``` The overlap plots compare the number of unique subjects appearing in the training and test sets for each fold. **Top plot (`fit_leaky`)** The red line shows high overlap counts, accompanied by the explicit *“WARNING: Overlaps detected!”* annotation. This confirms that the same subjects appear in both the training and test sets, indicating information leakage. **Bottom plot (`fit_safe`)** The overlap line remains flat at zero across all folds. This confirms that `subject_grouped` splitting successfully keeps subjects isolated, ensuring that no subject appears in both sets simultaneously. ### Audit leakage with `audit_leakage()` `audit_leakage()` combines four diagnostics in one object: - permutation gap: tests whether model signal is non-random - batch or study association with folds (chi-square and Cramers V) - target leakage scan: feature-wise outcome association to flag proxy columns - duplicate and near-duplicate detection across train/test (by default) using feature similarity Assumptions and misuse to avoid: - `coldata` (or stored split metadata) must align with prediction IDs; rownames or a `row_id` column help alignment - `X_ref` must align to prediction IDs (rownames, `row_id`, or matching order) - `plot_perm_distribution()` requires `return_perm = TRUE` - `target_threshold` should be set high enough to flag only strong proxies - target leakage scan includes univariate associations plus an optional multivariate/interaction check (`target_scan_multivariate`, `*_components`, `*_interactions`, `*_B`); proxies outside `X_ref` can still pass undetected - for rsample splits, set `perm_mode` (or `bioLeak_perm_mode`) so restricted permutations respect the intended design - for time-series audits, ensure a valid time column and set `time_block` / `block_len` as needed **Interpretation Note:** The label-permutation test refits models by default when refit inputs are available (`perm_refit = "auto"` with `store_refit_data = TRUE`) and `B <= perm_refit_auto_max`. Otherwise it keeps predictions fixed, so the p-value quantifies prediction-label association rather than a full refit null. A large gap indicates a non-random association. To determine if that signal is *real* or *leaked*, check the **Batch Association** (confounding), **Target Leakage Scan** (proxy features), and **Duplicate Detection** (memorization) tables. Use `perm_refit = FALSE` to force fixed-prediction shuffles or `perm_refit = TRUE` with `perm_refit_spec` to always refit. Use `feature_space = "rank"` to compare samples by rank profiles, and `sim_method` (`cosine` or `pearson`) to control similarity. For large `n`, `nn_k` and `max_pairs` limit duplicate searches. Use `duplicate_scope = "all"` to include within-fold duplicates (data-quality checks). Use `ci_method = "if"` (influence-function) or `ci_method = "bootstrap"` with `boot_B` to obtain a confidence interval for the permutation gap; `include_z` controls whether a z-score is reported. Set `perm_stratify = "auto"` to stratify permutations only when outcomes support it. **Leakage-aware audit** ```{r audit-leakage} X_ref <- df[, predictors] X_ref[c(1, 5), ] <- X_ref[1, ] audit <- audit_leakage( fit_safe, metric = "auc", B = 20, perm_stratify = TRUE, batch_cols = c("batch", "study"), X_ref = X_ref, sim_method = "cosine", sim_threshold = 0.995, return_perm = TRUE ) cat("Leakage audit summary:\n") summary(audit) if (!is.null(audit@permutation_gap) && nrow(audit@permutation_gap) > 0) { # Permutation significance results audit@permutation_gap } if (!is.null(audit@batch_assoc) && nrow(audit@batch_assoc) > 0) { # Batch/study association with folds (Cramer's V) audit@batch_assoc } else { cat("No batch or study associations detected.\n") } if (!is.null(audit@target_assoc) && nrow(audit@target_assoc) > 0) { # Top features by target association score head(audit@target_assoc) } else { cat("No target leakage scan results available.\n") } if (!is.null(audit@duplicates) && nrow(audit@duplicates) > 0) { # Top duplicate/near-duplicate pairs by similarity head(audit@duplicates) } else { cat("No near-duplicates detected.\n") } ``` The permutation table reports the observed metric, the mean under random label permutation, the gap (difference), and a permutation *p*-value. For metrics where higher values indicate better performance, larger gaps reflect stronger non-random signal. The batch association table reports chi-square statistics and Cramer's V. Large p-values and small V values indicate that folds are not aligned with batch or study labels (which is the desired outcome when these should be independent). The target scan table lists features with the strongest associations with the outcome. For numeric features, the score is \(|\mathrm{AUC} - 0.5| \times 2\) for classification or the absolute correlation for regression. For categorical features, the score is Cramér’s V or eta-squared. Scores closer to 1 indicate stronger outcome association. The duplicate table lists pairs of samples with near-identical profiles that cross train/test folds (by default). In this setup, the artificially duplicated pair (`X_ref[c(1, 5), ] <- X_ref[1, ]`) should appear near the top of the list. Use `duplicate_scope = "all"` to include within-fold duplicates. ```{r plot-perm} if (requireNamespace("ggplot2", quietly = TRUE)) { plot_perm_distribution(audit) } else { cat("ggplot2 not installed; skipping permutation plot.\n") } ``` The histogram shows the null distribution (gray bars) of the performance metric under random label permutation. The blue dashed line represents the average performance of a random model (the permuted mean). The red solid line represents the observed performance of the fitted model. A genuine signal is indicated when the red line lies well to the right of the gray distribution; overlap suggests weak or unstable signal. **Audit per learner** ```{r audit-by-learner} if (requireNamespace("ranger", quietly = TRUE) && requireNamespace("parsnip", quietly = TRUE)) { # 1. Define specs # Standard GLM (no tuning) spec_glm <- parsnip::logistic_reg() |> parsnip::set_engine("glm") # Random Forest spec_rf <- parsnip::rand_forest( mode = "classification", trees = 100 ) |> parsnip::set_engine("ranger") # 2. Fit with CORRECT splits fit_multi <- fit_resample( df, outcome = "outcome", splits = nested_splits, # <--- FIX: Use the valid split object learner = list(glm = spec_glm, rf = spec_rf), metrics = "auc" ) # 3. Run the audit audits <- audit_leakage_by_learner(fit_multi, metric = "auc", B = 20) cat("Per-learner audit summary:\n") print(audits) } else { cat("ranger/parsnip not installed.\n") } ``` This example uses **ranger** for the random forest specification; if it is not installed, the code chunk is skipped. Use `parallel_learners = TRUE` to audit learners concurrently when **future.apply** is available. The printed table summarizes each learner’s observed metric, permutation gap, p-value, and key batch/duplicate summaries. Use it to compare signal strength across models while checking for leakage risks. Pass `learners =` to audit a subset, or `mc.cores =` to control parallel workers. **HTML audit report** `audit_report()` accepts either a `LeakAudit` or a `LeakFit` object. When a `LeakFit` is provided, it first runs `audit_leakage()` and then forwards any additional arguments to the audit step. If multiple learners were fit, pass `learner =` via `...` to select one. Use `open = TRUE` to open the report in a browser after rendering. ```{r audit-report, eval = FALSE} if (requireNamespace("rmarkdown", quietly = TRUE) && rmarkdown::pandoc_available()) { report_path <- audit_report(audit, output_dir = ".") cat("HTML report written to:\n", report_path, "\n") } else { cat("rmarkdown or pandoc not available; skipping audit report rendering.\n") } ``` The report path points to a standalone HTML file containing the same audit tables and plots, suitable for sharing with collaborators or archiving as a quality control record. ### Time-series leakage checks Time-series data require special handling. Random splits can leak information from the future into the past. Use `mode = "time_series"` with a prediction `horizon`, and audit with block permutations. Choose `time_block = "circular"` or `"stationary"`; when `block_len` is NULL, the audit uses a default block length (~10% of the test block size, minimum 5). **Leaky example: lookahead feature** ```{r time-series-fit} time_splits <- make_split_plan( df_time, outcome = "outcome", mode = "time_series", time = "time", v = 4, horizon = 1 ) cat("Time-series splits summary:\n") time_splits fit_time_leaky <- fit_resample( df_time, outcome = "outcome", splits = time_splits, learner = spec, metrics = "auc" ) cat("Time-series leaky fit summary:\n") summary(fit_time_leaky) ``` Time-series splitting trains on growing windows, so performance can differ from standard CV simply because early folds have smaller training sets. Regardless of the score, the presence of the `leak_future` feature makes this estimate methodologically invalid. **Leakage-safe alternative: remove lookahead and audit with blocks** ```{r time-series-safe} df_time_safe <- df_time df_time_safe$leak_future <- NULL fit_time_safe <- fit_resample( df_time_safe, outcome = "outcome", splits = time_splits, learner = spec, metrics = "auc" ) cat("Time-series safe fit summary:\n") summary(fit_time_safe) audit_time <- audit_leakage( fit_time_safe, metric = "auc", B = 20, time_block = "stationary", block_len = 5 ) cat("Time-series leakage audit summary:\n") summary(audit_time) if (!is.null(audit_time@permutation_gap) && nrow(audit_time@permutation_gap) > 0) { # Time-series permutation significance results audit_time@permutation_gap } if (requireNamespace("ggplot2", quietly = TRUE)) { plot_time_acf(fit_time_safe, lag.max = 20) } else { cat("ggplot2 not installed; skipping ACF plot.\n") } ``` The safe fit summary provides the leakage-resistant performance estimate. Compare leaky vs. safe fits to gauge inflation risk; `features_final` should drop when the lookahead feature is removed. The ACF plot shows the autocorrelation of out-of-fold predictions by lag. The dashed red lines represent the 95% confidence interval for white noise; large bars outside the bands indicate residual temporal dependence. `plot_time_acf()` requires numeric predictions and time metadata aligned to the fit. ## Parallel Processing `bioLeak` uses the `future` framework for parallelism (Windows, macOS, Linux). `fit_resample()`, `audit_leakage()`, `audit_leakage_by_learner()`, and `simulate_leakage_suite()` honor the active plan when `parallel = TRUE` (or `parallel_learners = TRUE`). ```{r parallel-setup, eval=FALSE} library(future) # Use multiple cores (works on all OS) plan(multisession, workers = 4) # Run a heavy simulation sim <- simulate_leakage_suite(..., parallel = TRUE) # Parallel folds or audits # fit_resample(..., parallel = TRUE) # audit_leakage(..., parallel = TRUE) # audit_leakage_by_learner(..., parallel_learners = TRUE) # Return to sequential processing plan(sequential) ``` This chunk only configures the parallel plan, so it does not produce a result object. Use it as a template before running compute-heavy functions. ### Simulation suite `simulate_leakage_suite()` runs Monte Carlo simulations that inject specific leakage mechanisms and then evaluates detection with the audit pipeline. It is useful for validating your leakage checks before applying them to real data. Available leakage types include `none`, `subject_overlap`, `batch_confounded`, `peek_norm`, and `lookahead`. Use `signal_strength`, `prevalence`, `rho`, `K`, `repeats`, `horizon`, and `preprocess` to control the simulation. The output is a `LeakSimResults` data frame with one row per seed. Metrics are selected automatically based on outcome type (AUC for binary classification). ```{r simulate-suite} if (requireNamespace("glmnet", quietly = TRUE)) { sim <- simulate_leakage_suite( n = 80, p = 6, mode = "subject_grouped", learner = "glmnet", leakage = "subject_overlap", seeds = 1:2, B = 20, parallel = FALSE, signal_strength = 1 ) # Simulation results (first 6 rows) head(sim) } else { cat("glmnet not installed; skipping simulation suite example.\n") } ``` Each row corresponds to one simulation seed. - `metric_obs` is the observed performance (AUC for this simulation). Higher values suggest stronger apparent signal, which can be inflated by leakage. - `gap` and `p_value` indicate that the leaked signal is statistically distinguishable from random noise. Use `metric_obs` to gauge the magnitude of the leakage effect, and `gap` to assess detection sensitivity. ### Objects and summaries bioLeak uses S4 classes and list-like results to capture provenance and diagnostics: - `LeakSplits`: returned by `make_split_plan()`, printed with `show()` - `LeakFit`: returned by `fit_resample()`, summarized with `summary()` - `LeakAudit`: returned by `audit_leakage()`, summarized with `summary()` - `LeakAuditList`: returned by `audit_leakage_by_learner()`, printed directly - `GuardFit`: returned by `.guard_fit()`, printed and summarized - `LeakImpute`: returned by `impute_guarded()` (guarded data + diagnostics) - `LeakTune`: returned by `tune_resample()` (nested tuning summaries) - `LeakSimResults`: returned by `simulate_leakage_suite()` (per-seed results) These objects store hashes and metadata that help reproduce and audit the workflow. Inspect their slots (`@metrics`, `@predictions`, `@permutation_gap`, `@duplicates`) to drill into specific leakage signals.