---
title: "Exploratory Subgroup Identification with ForestSearch"
subtitle: "German Breast Cancer Study Group (GBSG) Analysis"
author: "Larry F. León"
date: "`r Sys.Date()`"
output:
html_document:
toc: true
toc_depth: 3
number_sections: true
code_folding: hide
theme: cosmo
highlight: tango
fig_width: 8
fig_height: 6
vignette: >
%\VignetteIndexEntry{Exploratory Subgroup Identification with ForestSearch}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
warning = FALSE,
message = FALSE,
fig.align = "center",
fig.retina = 2,
fig.width = 8,
fig.height = 6
)
# Bootstrap 5 code-folding hook for pkgdown compatibility
# Usage: add `echo=FALSE, code_fold=TRUE` to any chunk header
knitr::knit_hooks$set(code_fold = function(before, options, envir) {
if (!before && isTRUE(options$code_fold)) {
id <- gsub("[^a-zA-Z0-9]", "", options$label)
code_lines <- knitr::knit_code$get(options$label)
code_text <- htmltools::htmlEscape(paste(code_lines, collapse = "\n"))
sprintf(
'
Code
\n',
id, id, id, code_text
)
}
})
# Initialize timing
timings <- list()
t_vignette_start <- proc.time()
```
# Introduction
This vignette demonstrates the **ForestSearch** methodology for exploratory
subgroup identification in survival analysis, as described in
León et al. (2024) *Statistics in Medicine*.
## Motivation
In clinical trials, particularly oncology, subgroup analyses are essential for:
- Evaluating treatment effect consistency across patient populations
- Identifying subgroups where treatment may be detrimental (harm)
- Characterizing subgroups with enhanced benefit
- Informing regulatory decisions and clinical practice
While prespecified subgroups provide stronger evidence, important subgroups
based on patient characteristics may not be anticipated. ForestSearch provides
a principled approach to **exploratory** subgroup identification with proper
statistical inference.
## Methodology Overview
ForestSearch identifies subgroups through:
1. **Candidate factor selection**: Using LASSO and/or Generalized Random Forests (GRF)
2. **Exhaustive subgroup search**: Evaluating all combinations up to `maxk` factors
3. **Consistency-based selection**: Applying splitting consistency criteria
4. **Bootstrap bias correction**: Adjusting for selection-induced optimism
5. **Cross-validation**: Assessing algorithm stability
The key innovation is the **splitting consistency criterion**: a subgroup is
considered "consistent with harm" if, when randomly split 50/50 many times,
both halves consistently show hazard ratios ≥ 1.0 (for example if 1.0 represents a meaningful "harm threshold").
# Setup
## Load Required Packages
```{r load-packages}
library(forestsearch)
library(survival)
library(data.table)
library(ggplot2)
library(gt)
library(grf)
library(policytree)
library(doFuture)
# Optional packages for enhanced output
library(patchwork)
library(weightedsurv)
# Set ggplot theme
theme_set(theme_minimal(base_size = 12))
```
# Data: German Breast Cancer Study Group Trial
## Study Background
The GBSG trial evaluated hormonal treatment (tamoxifen) versus chemotherapy
in node-positive breast cancer patients. Key characteristics:
- **Sample size**: N = 686
- **Outcome**: Recurrence-free survival time
- **Censoring rate**: ~56%
- **Treatment**: Hormonal therapy (tamoxifen) vs. chemotherapy
## Data Preparation
```{r data-setup}
# Load GBSG data (included in forestsearch package)
df.analysis <- gbsg
# Prepare analysis variables
df.analysis <- within(df.analysis, {
id <- seq_len(nrow(df.analysis))
time_months <- rfstime / 30.4375
grade3 <- ifelse(grade == "3", 1, 0)
treat <- hormon
})
# Define variable roles
confounders.name <- c("age", "meno", "size", "grade3", "nodes", "pgr", "er")
outcome.name <- "time_months"
event.name <- "status"
id.name <- "id"
treat.name <- "hormon"
# Display data structure
cat("Sample size:", nrow(df.analysis), "\n")
cat("Events:", sum(df.analysis[[event.name]]),
sprintf("(%.1f%%)\n", 100 * mean(df.analysis[[event.name]])))
cat("Baseline factors:", paste(confounders.name, collapse = ", "), "\n")
```
## Baseline Characteristics
```{r baseline-table}
create_summary_table(
data = df.analysis,
treat_var = treat.name,
table_title = "GBSG Baseline Characteristics by Treatment Arm",
vars_continuous = c("age", "nodes", "size", "er", "pgr"),
vars_categorical = c("grade", "meno"),
font_size = 12
)
```
## Kaplan-Meier Analysis (ITT Population)
```{r km-itt, fig.width=8, fig.height=5}
# Prepare counting process data for KM plot
dfcount <- df_counting(
df = df.analysis,
by.risk = 6,
tte.name = outcome.name,
event.name = event.name,
treat.name = treat.name
)
# Plot with confidence intervals and log-rank test
plot_weighted_km(
dfcount,
conf.int = TRUE,
show.logrank = TRUE,
ymax = 1.05,
xmed.fraction = 0.775,
ymed.offset = 0.125
)
```
The ITT Cox hazard ratio estimate is approximately 0.69 (95% CI: 0.54, 0.89),
suggesting an overall benefit for hormonal therapy.
# Preliminary Analysis: Generalized Random Forests
Before running ForestSearch, we can use GRF to explore potential treatment
effect heterogeneity and identify candidate factors.
```{r grf-analysis}
t0 <- proc.time()
grf_est <- grf.subg.harm.survival(
data = df.analysis,
confounders.name = confounders.name,
outcome.name = outcome.name,
event.name = event.name,
id.name = id.name,
treat.name = treat.name,
maxdepth = 2,
n.min = 60,
dmin.grf = 12,
frac.tau = 0.6,
details = TRUE,
return_selected_cuts_only = FALSE
)
timings$grf <- (proc.time() - t0)["elapsed"]
```
```{r grf-trees, fig.width=10, fig.height=4}
# Display policy trees
# leaf1 = recommend control, leaf2 = recommend treatment
oldpar <- par(mfrow = c(1, 2))
plot(grf_est$tree1, leaf.labels = c("Control", "Treat"), main = "Depth 1")
plot(grf_est$tree2, leaf.labels = c("Control", "Treat"), main = "Depth 2")
par(oldpar)
```
GRF identifies estrogen receptor status (ER) as a key factor, with ER ≤ 0
suggesting potential harm from hormonal therapy.
# ForestSearch Analysis
## Parallel Processing Configuration
ForestSearch supports parallel processing for computationally intensive
operations (bootstrap, cross-validation).
```{r parallel-setup}
# Detect available cores (limited to 2 cores for CRAN checks)
n_cores <- 2
n_cores_total <- parallel::detectCores()
cat("Using", n_cores, "of", n_cores_total, "total cores for parallel processing")
```
## Running ForestSearch
ForestSearch performs an exhaustive search over candidate subgroup combinations
with up to `maxk` factors. Key parameters:
| Parameter | Value | Description |
|-----------|-------|-------------|
| `hr.threshold` | 1.25 | Minimum HR for consistency evaluation |
| `hr.consistency` | 1.0 | Minimum consistency rate for candidates |
| `pconsistency.threshold` | 0.90 | Required consistency for selection |
| `maxk` | 2 | Maximum factors in subgroup definition |
| `n.min` | 60 | Minimum subgroup sample size |
| `d0.min`, `d1.min` | 12 | Minimum events per treatment arm |
```{r forestsearch-main, fig.width=10, fig.height=7}
t0 <- proc.time()
fs <- forestsearch(
df.analysis,
confounders.name = confounders.name,
outcome.name = outcome.name,
treat.name = treat.name,
event.name = event.name,
id.name = id.name,
# Threshold parameters (per León et al. 2024)
hr.threshold = 1.25,
hr.consistency = 1.0,
pconsistency.threshold = 0.80,
stop_threshold = 0.80,
# Search configuration
sg_focus = "hr",
max_subgroups_search = 3,
use_twostage = TRUE,
# Factor selection
use_grf = TRUE,
return_selected_cuts_only = TRUE,
use_lasso = TRUE,
cut_type = "default",
# Subgroup constraints
maxk = 2,
n.min = 60,
d0.min = 12,
d1.min = 12,
# Consistency evaluation
fs.splits = 100,
# Parallel processing
parallel_args = list(
plan = "multisession",
workers = n_cores,
show_message = TRUE
),
# Output options
showten_subgroups = TRUE,
details = TRUE,
plot.sg = TRUE
)
plan("sequential")
timings$forestsearch <- (proc.time() - t0)["elapsed"]
cat("\nForestSearch completed in",
round(timings$forestsearch, 1), "seconds\n")
```
## ForestSearch Results
### Identified Subgroups
```{r fs-results}
# Generate results tables
res_tabs <- sg_tables(fs, ndecimals = 3, which_df = "est")
# Display top subgroups meeting criteria
res_tabs$sg10_out
```
### Treatment Effect Estimates
```{r fs-estimates}
# ITT and subgroup estimates
res_tabs$tab_estimates
```
### Identified Subgroup Definition
```{r fs-subgroup}
cat("Identified subgroup (H):", paste(fs$sg.harm, collapse = " & "), "\n")
cat("Subgroup size:", sum(fs$df.est$treat.recommend == 0),
sprintf("(%.1f%% of ITT)\n",
100 * mean(fs$df.est$treat.recommend == 0)))
```
ForestSearch identifies **Estrogen ≤ 0** (ER-negative) as the subgroup with
potential harm. This is biologically plausible: tamoxifen is a selective
estrogen receptor modulator with limited efficacy in ER-negative tumors.
# Bootstrap Bias Correction
## Rationale
Cox model estimates from identified subgroups are **upwardly biased** due to
the selection process (subgroups are selected *because* they show extreme effects).
Bootstrap bias correction addresses this by:
1. Resampling with replacement
2. Re-running the entire ForestSearch algorithm
3. Computing bias terms from bootstrap vs. observed estimates
4. Applying infinitesimal jackknife variance estimation
## Running Bootstrap Analysis
```{r bootstrap, eval=TRUE}
# Number of bootstrap iterations
# Use 500-2000 for production; reduced here for vignette
NB <- 2
t0 <- proc.time()
fs_bc <- forestsearch_bootstrap_dofuture(
fs.est = fs,
nb_boots = NB,
show_three = FALSE,
details = FALSE,
parallel_args = list(
plan = "multisession",
workers = n_cores,
show_message = TRUE
)
)
plan("sequential")
timings$bootstrap <- (proc.time() - t0)["elapsed"]
cat("\nBootstrap completed in",
round(timings$bootstrap / 60, 1), "minutes\n")
```
## Bootstrap Summary and Diagnostics
```{r bootstrap-summary}
# Comprehensive summary with diagnostics
summaries <- summarize_bootstrap_results(
sgharm = fs$sg.harm,
boot_results = fs_bc,
create_plots = TRUE,
est.scale = "hr"
)
# Display bias-corrected estimates table
summaries$table
```
## Kaplan-Meier by Identified Subgroups
```{r fs-results_figure, fig.width=10, fig.height=7, fig.cap="Kaplan-Meier survival curves by identified subgroup"}
km_result <- plot_sg_weighted_km(
fs.est = fs,
outcome.name = "time_months",
event.name = "status",
treat.name = "hormon",
show.logrank = FALSE,
conf.int = TRUE,
by.risk = 12,
show.cox = FALSE, show.cox.bc = TRUE,
fs_bc = fs_bc,
hr_bc_position = "topright"
)
```
::: {.figure-note}
`r figure_note(km_result, prefix = "**Note:** ", custom_text = "Medians [95% CI] for arms are un-adjusted.")`
:::
### Event Count Summary
Low event counts can lead to unstable HR estimates. This summary helps
identify potential issues:
```{r event-summary}
# note that default required minimum events is 12 for subgroup candidate
# Here we evaluate frequency of subgroup candidates in bootstrap samples less than 15
event_summary <- summarize_bootstrap_events(fs_bc, threshold = 15)
```
### Bootstrap Diagnostics
```{r bootstrap-diagnostics}
# Quality metrics
summaries$diagnostics_table_gt
```
### Subgroup Agreement
How consistently does bootstrap identify the same subgroup?
```{r subgroup-agreement}
# Agreement with original analysis
if (!is.null(summaries$subgroup_summary$original_agreement)) {
summaries$subgroup_summary$original_agreement
}
# Factor presence across bootstrap iterations
if (!is.null(summaries$subgroup_summary$factor_presence)) {
summaries$subgroup_summary$factor_presence
}
```
### Bootstrap Distributions
```{r bootstrap-plots, fig.width=10, fig.height=4, eval=TRUE}
if (!is.null(summaries$plots)) {
summaries$plots$H_distribution + summaries$plots$Hc_distribution
}
```
# Cross-Validation
Cross-validation assesses the **stability** of the ForestSearch algorithm.
Two approaches are available:
## K-Fold Cross-Validation
```{r kfold-cv, eval = TRUE}
# 10-fold CV with multiple iterations
# Use Ksims >= 50 for production
Ksims <- 1
t0 <- proc.time()
fs_kfold <- forestsearch_tenfold(
fs.est = fs,
sims = Ksims,
Kfolds = 2,
details = FALSE,
parallel_args = list(
plan = "multisession",
workers = n_cores,
show_message = FALSE
)
)
plan("sequential")
timings$kfold <- (proc.time() - t0)["elapsed"]
metrics_tables <- cv_metrics_tables(fs_kfold)
metrics_tables
```
## Out-of-Bag (N-Fold) Cross-Validation
N-fold CV (leave-one-out):
```{r oob-cv, eval = FALSE}
t0 <- proc.time()
fs_OOB <- forestsearch_Kfold(
fs.est = fs,
details = FALSE,
Kfolds = round(nrow(df.analysis)/100,0), # N-fold = leave-one-out
parallel_args = list(
plan = "multisession",
workers = n_cores,
show_message = TRUE
)
)
plan("sequential")
timings$oob <- (proc.time() - t0)["elapsed"]
# Summarize OOB results
cv_out <- forestsearch_KfoldOut(
res = fs_OOB,
details = FALSE,
outall = TRUE
)
tables <- cv_summary_tables(cv_out)
tables$combined_table
tables$metrics_table
```
# Results Visualization
## Forest Plot
The forest plot summarizes treatment effects across the ITT population,
reference subgroups, and identified subgroups with cross-validation metrics.
```{r forest-plot, fig.width=18, fig.height=12, fig.cap="Subgroup forest plot including identified subgroups"}
# Define reference subgroups for comparison
subgroups <- list(
age_gt65 = list(
subset_expr = "age > 65",
name = "Age > 65",
type = "reference"
),
age_le65 = list(
subset_expr = "age <= 65",
name = "Age ≤ 65",
type = "reference"
),
pgr_positive = list(
subset_expr = "pgr > 0",
name = "PgR > 0",
type = "reference"
),
pgr_negative = list(
subset_expr = "pgr <= 0",
name = "PgR ≤ 0",
type = "reference"
)
)
my_theme <- create_forest_theme(base_size = 24,
footnote_fontsize = 17, cv_fontsize = 22)
# Create forest plot
# Include fs_kfold and fs_OOB if available for CV metrics
result <- plot_subgroup_results_forestplot(
fs_results = list(
fs.est = fs,
fs_bc = fs_bc,
fs_OOB = NULL,
fs_kfold = fs_kfold
),
df_analysis = df.analysis,
subgroup_list = subgroups,
outcome.name = outcome.name,
event.name = event.name,
treat.name = treat.name,
E.name = "Hormonal",
C.name = "Chemo",
ci_column_spaces = 25,
xlog = TRUE,
theme = my_theme
)
# Option 2: Custom sizing
render_forestplot(result)
```
### KM Difference plots: ITT and subgroups
The solid black line denotes the ITT Kaplan-Meier treatment difference estimates along with 95%95% CIs (the grey shaded region). K-M differences corresponding to subgroups are displayed.
```{r KMdiffs, fig.width = 8, fig.height = 6, fig.align="center"}
# Add additional subgroups along with ITT and identified subgroups
ref_sgs <- list(
age_young = list(subset_expr = "age < 65", color = "brown"),
age_old = list(subset_expr = "age >= 65", color = "orange")
)
plot_km_band_forestsearch(
df = df.analysis,
fs.est = fs,
ref_subgroups = ref_sgs,
outcome.name = outcome.name,
event.name = event.name,
treat.name = treat.name,
draws_band = 20
)
# # Example with more subgroups
# ref_sgs <- list(
# pgr_positive = list(subset_expr = "pgr > 0", color ="green"),
# pgr_negative = list(subset_expr = "pgr <= 0", color = "purple"),
# age_young = list(subset_expr = "age < 65", color = "brown"),
# age_old = list(subset_expr = "age >= 65", color = "orange")
# )
```
# Summary and Interpretation
## Key Findings
```{r summary-findings}
# Extract key results
cat("=" %>% rep(60) %>% paste(collapse = ""), "\n")
cat("FORESTSEARCH ANALYSIS SUMMARY\n")
cat("=" %>% rep(60) %>% paste(collapse = ""), "\n\n")
cat("Dataset: GBSG (N =", nrow(df.analysis), ")\n")
cat("Outcome: Recurrence-free survival\n\n")
cat("ITT Analysis:\n")
cat(" HR (95% CI): 0.69 (0.54, 0.89)\n\n")
cat("Identified Subgroup (H):\n")
cat(" Definition:", paste(fs$sg.harm, collapse = " & "), "\n")
cat(" Size:", sum(fs$df.est$treat.recommend == 0),
sprintf("(%.1f%%)\n", 100 * mean(fs$df.est$treat.recommend == 0)))
cat(" Unadjusted HR:", sprintf("%.2f", exp(fs$grp.consistency$out_sg$result$hr[1])), "\n")
cat("\nComplement Subgroup (Hc):\n")
cat(" Size:", sum(fs$df.est$treat.recommend == 1),
sprintf("(%.1f%%)\n", 100 * mean(fs$df.est$treat.recommend == 1)))
```
## Clinical Interpretation
The ForestSearch analysis identifies **estrogen receptor-negative (ER ≤ 0)**
patients as a subgroup with potential lack of benefit from hormonal therapy.
**Biological plausibility**: Tamoxifen is a selective estrogen receptor
modulator. Its efficacy depends on ER expression. The finding that ER-negative
patients may not benefit is consistent with:
- Mechanistic understanding of tamoxifen action
- Meta-analyses showing no tamoxifen benefit in ER-negative breast cancer
- Clinical guidelines recommending tamoxifen primarily for ER-positive tumors
**Caveats**:
1. This is an **exploratory** analysis requiring independent validation
2. The bias-corrected estimates have wider confidence intervals
3. Cross-validation metrics should be evaluated for algorithm stability
## Computational Timing
```{r timing-summary, echo=FALSE, code_fold=TRUE}
timings$total <- (proc.time() - t_vignette_start)["elapsed"]
timing_df <- data.frame(
Analysis = c("GRF", "ForestSearch", "Bootstrap", "Total"),
Seconds = c(
timings$grf,
timings$forestsearch,
timings$bootstrap,
timings$total
)
)
timing_df$Minutes <- timing_df$Seconds / 60
gt(timing_df) |>
tab_header(title = "Computational Timing") |>
fmt_number(columns = c(Seconds, Minutes), decimals = 1) |>
cols_label(
Analysis = "Component",
Seconds = "Time (sec)",
Minutes = "Time (min)"
)
```
# References
León LF, Jemielita T, Guo Z, Marceau West R, Anderson KM (2024).
"Exploratory subgroup identification in the heterogeneous Cox model:
A relatively simple procedure." *Statistics in Medicine*.
DOI: 10.1002/sim.10163
# Session Information
```{r session-info}
sessionInfo()
```