--- title: "Exploratory Subgroup Identification with ForestSearch" subtitle: "German Breast Cancer Study Group (GBSG) Analysis" author: "Larry F. León" date: "`r Sys.Date()`" output: html_document: toc: true toc_depth: 3 number_sections: true code_folding: hide theme: cosmo highlight: tango fig_width: 8 fig_height: 6 vignette: > %\VignetteIndexEntry{Exploratory Subgroup Identification with ForestSearch} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( echo = TRUE, warning = FALSE, message = FALSE, fig.align = "center", fig.retina = 2, fig.width = 8, fig.height = 6 ) # Bootstrap 5 code-folding hook for pkgdown compatibility # Usage: add `echo=FALSE, code_fold=TRUE` to any chunk header knitr::knit_hooks$set(code_fold = function(before, options, envir) { if (!before && isTRUE(options$code_fold)) { id <- gsub("[^a-zA-Z0-9]", "", options$label) code_lines <- knitr::knit_code$get(options$label) code_text <- htmltools::htmlEscape(paste(code_lines, collapse = "\n")) sprintf( '

%s

', id, id, id, code_text ) } }) # Initialize timing timings <- list() t_vignette_start <- proc.time() ``` # Introduction This vignette demonstrates the **ForestSearch** methodology for exploratory subgroup identification in survival analysis, as described in León et al. (2024) *Statistics in Medicine*. ## Motivation In clinical trials, particularly oncology, subgroup analyses are essential for: - Evaluating treatment effect consistency across patient populations - Identifying subgroups where treatment may be detrimental (harm) - Characterizing subgroups with enhanced benefit - Informing regulatory decisions and clinical practice While prespecified subgroups provide stronger evidence, important subgroups based on patient characteristics may not be anticipated. ForestSearch provides a principled approach to **exploratory** subgroup identification with proper statistical inference. ## Methodology Overview ForestSearch identifies subgroups through: 1. **Candidate factor selection**: Using LASSO and/or Generalized Random Forests (GRF) 2. **Exhaustive subgroup search**: Evaluating all combinations up to `maxk` factors 3. **Consistency-based selection**: Applying splitting consistency criteria 4. **Bootstrap bias correction**: Adjusting for selection-induced optimism 5. **Cross-validation**: Assessing algorithm stability The key innovation is the **splitting consistency criterion**: a subgroup is considered "consistent with harm" if, when randomly split 50/50 many times, both halves consistently show hazard ratios ≥ 1.0 (for example if 1.0 represents a meaningful "harm threshold"). # Setup ## Load Required Packages ```{r load-packages} library(forestsearch) library(survival) library(data.table) library(ggplot2) library(gt) library(grf) library(policytree) library(doFuture) # Optional packages for enhanced output library(patchwork) library(weightedsurv) # Set ggplot theme theme_set(theme_minimal(base_size = 12)) ``` # Data: German Breast Cancer Study Group Trial ## Study Background The GBSG trial evaluated hormonal treatment (tamoxifen) versus chemotherapy in node-positive breast cancer patients. Key characteristics: - **Sample size**: N = 686 - **Outcome**: Recurrence-free survival time - **Censoring rate**: ~56% - **Treatment**: Hormonal therapy (tamoxifen) vs. chemotherapy ## Data Preparation ```{r data-setup} # Load GBSG data (included in forestsearch package) df.analysis <- gbsg # Prepare analysis variables df.analysis <- within(df.analysis, { id <- seq_len(nrow(df.analysis)) time_months <- rfstime / 30.4375 grade3 <- ifelse(grade == "3", 1, 0) treat <- hormon }) # Define variable roles confounders.name <- c("age", "meno", "size", "grade3", "nodes", "pgr", "er") outcome.name <- "time_months" event.name <- "status" id.name <- "id" treat.name <- "hormon" # Display data structure cat("Sample size:", nrow(df.analysis), "\n") cat("Events:", sum(df.analysis[[event.name]]), sprintf("(%.1f%%)\n", 100 * mean(df.analysis[[event.name]]))) cat("Baseline factors:", paste(confounders.name, collapse = ", "), "\n") ``` ## Baseline Characteristics ```{r baseline-table} create_summary_table( data = df.analysis, treat_var = treat.name, table_title = "GBSG Baseline Characteristics by Treatment Arm", vars_continuous = c("age", "nodes", "size", "er", "pgr"), vars_categorical = c("grade", "meno"), font_size = 12 ) ``` ## Kaplan-Meier Analysis (ITT Population) ```{r km-itt, fig.width=8, fig.height=5} # Prepare counting process data for KM plot dfcount <- df_counting( df = df.analysis, by.risk = 6, tte.name = outcome.name, event.name = event.name, treat.name = treat.name ) # Plot with confidence intervals and log-rank test plot_weighted_km( dfcount, conf.int = TRUE, show.logrank = TRUE, ymax = 1.05, xmed.fraction = 0.775, ymed.offset = 0.125 ) ``` The ITT Cox hazard ratio estimate is approximately 0.69 (95% CI: 0.54, 0.89), suggesting an overall benefit for hormonal therapy. # Preliminary Analysis: Generalized Random Forests Before running ForestSearch, we can use GRF to explore potential treatment effect heterogeneity and identify candidate factors. ```{r grf-analysis} t0 <- proc.time() grf_est <- grf.subg.harm.survival( data = df.analysis, confounders.name = confounders.name, outcome.name = outcome.name, event.name = event.name, id.name = id.name, treat.name = treat.name, maxdepth = 2, n.min = 60, dmin.grf = 12, frac.tau = 0.6, details = TRUE, return_selected_cuts_only = FALSE ) timings$grf <- (proc.time() - t0)["elapsed"] ``` ```{r grf-trees, fig.width=10, fig.height=4} # Display policy trees # leaf1 = recommend control, leaf2 = recommend treatment oldpar <- par(mfrow = c(1, 2)) plot(grf_est$tree1, leaf.labels = c("Control", "Treat"), main = "Depth 1") plot(grf_est$tree2, leaf.labels = c("Control", "Treat"), main = "Depth 2") par(oldpar) ``` GRF identifies estrogen receptor status (ER) as a key factor, with ER ≤ 0 suggesting potential harm from hormonal therapy. # ForestSearch Analysis ## Parallel Processing Configuration ForestSearch supports parallel processing for computationally intensive operations (bootstrap, cross-validation). ```{r parallel-setup} # Detect available cores (limited to 2 cores for CRAN checks) n_cores <- 2 n_cores_total <- parallel::detectCores() cat("Using", n_cores, "of", n_cores_total, "total cores for parallel processing") ``` ## Running ForestSearch ForestSearch performs an exhaustive search over candidate subgroup combinations with up to `maxk` factors. Key parameters: | Parameter | Value | Description | |-----------|-------|-------------| | `hr.threshold` | 1.25 | Minimum HR for consistency evaluation | | `hr.consistency` | 1.0 | Minimum consistency rate for candidates | | `pconsistency.threshold` | 0.90 | Required consistency for selection | | `maxk` | 2 | Maximum factors in subgroup definition | | `n.min` | 60 | Minimum subgroup sample size | | `d0.min`, `d1.min` | 12 | Minimum events per treatment arm | ```{r forestsearch-main, fig.width=10, fig.height=7} t0 <- proc.time() fs <- forestsearch( df.analysis, confounders.name = confounders.name, outcome.name = outcome.name, treat.name = treat.name, event.name = event.name, id.name = id.name, # Threshold parameters (per León et al. 2024) hr.threshold = 1.25, hr.consistency = 1.0, pconsistency.threshold = 0.80, stop_threshold = 0.80, # Search configuration sg_focus = "hr", max_subgroups_search = 3, use_twostage = TRUE, # Factor selection use_grf = TRUE, return_selected_cuts_only = TRUE, use_lasso = TRUE, cut_type = "default", # Subgroup constraints maxk = 2, n.min = 60, d0.min = 12, d1.min = 12, # Consistency evaluation fs.splits = 100, # Parallel processing parallel_args = list( plan = "multisession", workers = n_cores, show_message = TRUE ), # Output options showten_subgroups = TRUE, details = TRUE, plot.sg = TRUE ) plan("sequential") timings$forestsearch <- (proc.time() - t0)["elapsed"] cat("\nForestSearch completed in", round(timings$forestsearch, 1), "seconds\n") ``` ## ForestSearch Results ### Identified Subgroups ```{r fs-results} # Generate results tables res_tabs <- sg_tables(fs, ndecimals = 3, which_df = "est") # Display top subgroups meeting criteria res_tabs$sg10_out ``` ### Treatment Effect Estimates ```{r fs-estimates} # ITT and subgroup estimates res_tabs$tab_estimates ``` ### Identified Subgroup Definition ```{r fs-subgroup} cat("Identified subgroup (H):", paste(fs$sg.harm, collapse = " & "), "\n") cat("Subgroup size:", sum(fs$df.est$treat.recommend == 0), sprintf("(%.1f%% of ITT)\n", 100 * mean(fs$df.est$treat.recommend == 0))) ``` ForestSearch identifies **Estrogen ≤ 0** (ER-negative) as the subgroup with potential harm. This is biologically plausible: tamoxifen is a selective estrogen receptor modulator with limited efficacy in ER-negative tumors. # Bootstrap Bias Correction ## Rationale Cox model estimates from identified subgroups are **upwardly biased** due to the selection process (subgroups are selected *because* they show extreme effects). Bootstrap bias correction addresses this by: 1. Resampling with replacement 2. Re-running the entire ForestSearch algorithm 3. Computing bias terms from bootstrap vs. observed estimates 4. Applying infinitesimal jackknife variance estimation ## Running Bootstrap Analysis ```{r bootstrap, eval=TRUE} # Number of bootstrap iterations # Use 500-2000 for production; reduced here for vignette NB <- 2 t0 <- proc.time() fs_bc <- forestsearch_bootstrap_dofuture( fs.est = fs, nb_boots = NB, show_three = FALSE, details = FALSE, parallel_args = list( plan = "multisession", workers = n_cores, show_message = TRUE ) ) plan("sequential") timings$bootstrap <- (proc.time() - t0)["elapsed"] cat("\nBootstrap completed in", round(timings$bootstrap / 60, 1), "minutes\n") ``` ## Bootstrap Summary and Diagnostics ```{r bootstrap-summary} # Comprehensive summary with diagnostics summaries <- summarize_bootstrap_results( sgharm = fs$sg.harm, boot_results = fs_bc, create_plots = TRUE, est.scale = "hr" ) # Display bias-corrected estimates table summaries$table ``` ## Kaplan-Meier by Identified Subgroups ```{r fs-results_figure, fig.width=10, fig.height=7, fig.cap="Kaplan-Meier survival curves by identified subgroup"} km_result <- plot_sg_weighted_km( fs.est = fs, outcome.name = "time_months", event.name = "status", treat.name = "hormon", show.logrank = FALSE, conf.int = TRUE, by.risk = 12, show.cox = FALSE, show.cox.bc = TRUE, fs_bc = fs_bc, hr_bc_position = "topright" ) ``` ::: {.figure-note} `r figure_note(km_result, prefix = "**Note:** ", custom_text = "Medians [95% CI] for arms are un-adjusted.")` ::: ### Event Count Summary Low event counts can lead to unstable HR estimates. This summary helps identify potential issues: ```{r event-summary} # note that default required minimum events is 12 for subgroup candidate # Here we evaluate frequency of subgroup candidates in bootstrap samples less than 15 event_summary <- summarize_bootstrap_events(fs_bc, threshold = 15) ``` ### Bootstrap Diagnostics ```{r bootstrap-diagnostics} # Quality metrics summaries$diagnostics_table_gt ``` ### Subgroup Agreement How consistently does bootstrap identify the same subgroup? ```{r subgroup-agreement} # Agreement with original analysis if (!is.null(summaries$subgroup_summary$original_agreement)) { summaries$subgroup_summary$original_agreement } # Factor presence across bootstrap iterations if (!is.null(summaries$subgroup_summary$factor_presence)) { summaries$subgroup_summary$factor_presence } ``` ### Bootstrap Distributions ```{r bootstrap-plots, fig.width=10, fig.height=4, eval=TRUE} if (!is.null(summaries$plots)) { summaries$plots$H_distribution + summaries$plots$Hc_distribution } ``` # Cross-Validation Cross-validation assesses the **stability** of the ForestSearch algorithm. Two approaches are available: ## K-Fold Cross-Validation ```{r kfold-cv, eval = TRUE} # 10-fold CV with multiple iterations # Use Ksims >= 50 for production Ksims <- 1 t0 <- proc.time() fs_kfold <- forestsearch_tenfold( fs.est = fs, sims = Ksims, Kfolds = 2, details = FALSE, parallel_args = list( plan = "multisession", workers = n_cores, show_message = FALSE ) ) plan("sequential") timings$kfold <- (proc.time() - t0)["elapsed"] metrics_tables <- cv_metrics_tables(fs_kfold) metrics_tables ``` ## Out-of-Bag (N-Fold) Cross-Validation N-fold CV (leave-one-out): ```{r oob-cv, eval = FALSE} t0 <- proc.time() fs_OOB <- forestsearch_Kfold( fs.est = fs, details = FALSE, Kfolds = round(nrow(df.analysis)/100,0), # N-fold = leave-one-out parallel_args = list( plan = "multisession", workers = n_cores, show_message = TRUE ) ) plan("sequential") timings$oob <- (proc.time() - t0)["elapsed"] # Summarize OOB results cv_out <- forestsearch_KfoldOut( res = fs_OOB, details = FALSE, outall = TRUE ) tables <- cv_summary_tables(cv_out) tables$combined_table tables$metrics_table ``` # Results Visualization ## Forest Plot The forest plot summarizes treatment effects across the ITT population, reference subgroups, and identified subgroups with cross-validation metrics. ```{r forest-plot, fig.width=18, fig.height=12, fig.cap="Subgroup forest plot including identified subgroups"} # Define reference subgroups for comparison subgroups <- list( age_gt65 = list( subset_expr = "age > 65", name = "Age > 65", type = "reference" ), age_le65 = list( subset_expr = "age <= 65", name = "Age ≤ 65", type = "reference" ), pgr_positive = list( subset_expr = "pgr > 0", name = "PgR > 0", type = "reference" ), pgr_negative = list( subset_expr = "pgr <= 0", name = "PgR ≤ 0", type = "reference" ) ) my_theme <- create_forest_theme(base_size = 24, footnote_fontsize = 17, cv_fontsize = 22) # Create forest plot # Include fs_kfold and fs_OOB if available for CV metrics result <- plot_subgroup_results_forestplot( fs_results = list( fs.est = fs, fs_bc = fs_bc, fs_OOB = NULL, fs_kfold = fs_kfold ), df_analysis = df.analysis, subgroup_list = subgroups, outcome.name = outcome.name, event.name = event.name, treat.name = treat.name, E.name = "Hormonal", C.name = "Chemo", ci_column_spaces = 25, xlog = TRUE, theme = my_theme ) # Option 2: Custom sizing render_forestplot(result) ``` ### KM Difference plots: ITT and subgroups The solid black line denotes the ITT Kaplan-Meier treatment difference estimates along with 95%95% CIs (the grey shaded region). K-M differences corresponding to subgroups are displayed. ```{r KMdiffs, fig.width = 8, fig.height = 6, fig.align="center"} # Add additional subgroups along with ITT and identified subgroups ref_sgs <- list( age_young = list(subset_expr = "age < 65", color = "brown"), age_old = list(subset_expr = "age >= 65", color = "orange") ) plot_km_band_forestsearch( df = df.analysis, fs.est = fs, ref_subgroups = ref_sgs, outcome.name = outcome.name, event.name = event.name, treat.name = treat.name, draws_band = 20 ) # # Example with more subgroups # ref_sgs <- list( # pgr_positive = list(subset_expr = "pgr > 0", color ="green"), # pgr_negative = list(subset_expr = "pgr <= 0", color = "purple"), # age_young = list(subset_expr = "age < 65", color = "brown"), # age_old = list(subset_expr = "age >= 65", color = "orange") # ) ``` # Summary and Interpretation ## Key Findings ```{r summary-findings} # Extract key results cat("=" %>% rep(60) %>% paste(collapse = ""), "\n") cat("FORESTSEARCH ANALYSIS SUMMARY\n") cat("=" %>% rep(60) %>% paste(collapse = ""), "\n\n") cat("Dataset: GBSG (N =", nrow(df.analysis), ")\n") cat("Outcome: Recurrence-free survival\n\n") cat("ITT Analysis:\n") cat(" HR (95% CI): 0.69 (0.54, 0.89)\n\n") cat("Identified Subgroup (H):\n") cat(" Definition:", paste(fs$sg.harm, collapse = " & "), "\n") cat(" Size:", sum(fs$df.est$treat.recommend == 0), sprintf("(%.1f%%)\n", 100 * mean(fs$df.est$treat.recommend == 0))) cat(" Unadjusted HR:", sprintf("%.2f", exp(fs$grp.consistency$out_sg$result$hr[1])), "\n") cat("\nComplement Subgroup (Hc):\n") cat(" Size:", sum(fs$df.est$treat.recommend == 1), sprintf("(%.1f%%)\n", 100 * mean(fs$df.est$treat.recommend == 1))) ``` ## Clinical Interpretation The ForestSearch analysis identifies **estrogen receptor-negative (ER ≤ 0)** patients as a subgroup with potential lack of benefit from hormonal therapy. **Biological plausibility**: Tamoxifen is a selective estrogen receptor modulator. Its efficacy depends on ER expression. The finding that ER-negative patients may not benefit is consistent with: - Mechanistic understanding of tamoxifen action - Meta-analyses showing no tamoxifen benefit in ER-negative breast cancer - Clinical guidelines recommending tamoxifen primarily for ER-positive tumors **Caveats**: 1. This is an **exploratory** analysis requiring independent validation 2. The bias-corrected estimates have wider confidence intervals 3. Cross-validation metrics should be evaluated for algorithm stability ## Computational Timing ```{r timing-summary, echo=FALSE, code_fold=TRUE} timings$total <- (proc.time() - t_vignette_start)["elapsed"] timing_df <- data.frame( Analysis = c("GRF", "ForestSearch", "Bootstrap", "Total"), Seconds = c( timings$grf, timings$forestsearch, timings$bootstrap, timings$total ) ) timing_df$Minutes <- timing_df$Seconds / 60 gt(timing_df) |> tab_header(title = "Computational Timing") |> fmt_number(columns = c(Seconds, Minutes), decimals = 1) |> cols_label( Analysis = "Component", Seconds = "Time (sec)", Minutes = "Time (min)" ) ``` # References León LF, Jemielita T, Guo Z, Marceau West R, Anderson KM (2024). "Exploratory subgroup identification in the heterogeneous Cox model: A relatively simple procedure." *Statistics in Medicine*. DOI: 10.1002/sim.10163 # Session Information ```{r session-info} sessionInfo() ```