--- title: "Deriving Endpoints from Imputed Data" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 number_sections: true vignette: > %\VignetteIndexEntry{Deriving Endpoints from Imputed Data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning = FALSE, message = FALSE ) ``` # Introduction Once the rbmi pipeline has been run and imputed datasets are available, the same imputed data can be reused to analyse derived endpoints without re-running the computationally expensive `draws()` and `impute()` steps. This is particularly useful for binary responder endpoints, where the response is defined by whether a subject's outcome crosses a pre-specified threshold. This vignette demonstrates how to define binary responder endpoints from imputed continuous data and analyse them using rbmi's `analyse()` / `pool()` machinery via **rbmiUtils** helper functions. Two types of responder definitions are covered: 1. **Threshold-based responder** -- using the pre-derived `CRIT1FLN` column (CHG > 3) already present in the imputed data. 2. **Clinical cutoff responder** -- deriving a new binary variable from the continuous `CHG` column using a higher threshold (CHG > 5), demonstrating the flexibility of imputed data reuse. For the full rbmi workflow including `draws()`, `impute()`, and continuous ANCOVA analysis, see the [pipeline vignette](pipeline.html). This vignette assumes familiarity with rbmi core concepts (draws, impute, analyse, pool). # Prerequisites and Setup We need **rbmi** for the analysis/pool infrastructure, **rbmiUtils** for the analysis helpers and reporting functions, and **dplyr** for data manipulation. ```{r libraries, message = FALSE, warning = FALSE} library(rbmi) library(rbmiUtils) library(dplyr) ``` Next, we load the pre-built `ADMI` dataset bundled with rbmiUtils. This dataset contains 100 imputed copies of a simulated two-arm clinical trial, with continuous change-from-baseline outcomes (`CHG`) and a pre-derived binary responder variable (`CRIT1FLN`, defined as `CHG > 3`). ```{r load-data} data("ADMI", package = "rbmiUtils") ``` Before analysis, we convert the key grouping columns to factors. Factor levels control the ordering of treatment arms and visits, so it is important to set them explicitly. ```{r factor-prep} ADMI <- ADMI |> mutate( TRT = factor(TRT, levels = c("Placebo", "Drug A")), USUBJID = factor(USUBJID), AVISIT = factor(AVISIT), STRATA = factor(STRATA), REGION = factor(REGION) ) ``` We also define the analysis variables for the binary responder endpoint. The outcome is `CRIT1FLN` (the numeric 0/1 responder flag), and we adjust for baseline, stratification, and region. ```{r define-vars} vars_binary <- set_vars( subjid = "USUBJID", visit = "AVISIT", group = "TRT", outcome = "CRIT1FLN", covariates = c("BASE", "STRATA", "REGION") ) ``` Finally, we specify the method object. This must match the imputation method originally used to create the imputed datasets -- here, Bayesian MI with 100 samples. ```{r define-method} method <- method_bayes( n_samples = 100, control = control_bayes(warmup = 200, thin = 2) ) ``` # Threshold-Based Responder (CHG > 3) The `ADMI` dataset already contains the `CRIT1FLN` column, which flags subjects as responders (`1`) if their change from baseline exceeds 3, and non-responders (`0`) otherwise. We can verify this: > **Interpreting the results:** In this simulated dataset, positive CHG values > represent worsening (an increase in symptom score). Therefore CHG > 3 defines > "worsening responders" -- subjects whose symptoms deteriorated by more than 3 > points. A lower responder rate for Drug A compared to Placebo indicates that > fewer patients on the active treatment experienced clinically meaningful > worsening, which is evidence of drug efficacy. ```{r verify-crit1fln} # The responder criterion is pre-derived ADMI |> distinct(CRIT) |> pull(CRIT) ``` ## Analyse We use `analyse_mi_data()` with `gcomp_responder_multi()` as the analysis function. This applies g-computation via logistic regression at each visit, estimating covariate-adjusted marginal treatment effects using the method of Ge et al. (implemented in the beeca package). ```{r threshold-analyse, message = FALSE, warning = FALSE} ana_obj <- analyse_mi_data( data = ADMI, vars = vars_binary, method = method, fun = gcomp_responder_multi, reference_levels = "Placebo" ) ``` ## Pool Pool the per-imputation results using Rubin's rules: ```{r threshold-pool} pool_obj <- pool(ana_obj) ``` ## Results The `tidy_pool_obj()` function converts the pool object into a tidy tibble with clearly labelled columns for estimates, standard errors, confidence intervals, and p-values. ```{r threshold-tidy} tidy_pool_obj(pool_obj) ``` The efficacy table presents the results in a regulatory-style format: ```{r threshold-table, eval = requireNamespace("gt", quietly = TRUE)} efficacy_table( pool_obj, title = "Responder Analysis: CHG > 3", subtitle = "G-computation with Marginal Effects (Ge et al.)", arm_labels = c(ref = "Placebo", alt = "Drug A") ) ``` # Clinical Cutoff Responder (CHG > 5) The key advantage of working with imputed continuous data is the flexibility to derive new binary endpoints without re-running the imputation model. Here, we define a more stringent "clinically meaningful improvement" threshold of CHG > 5. ## Derive the New Endpoint We create a new binary column `RESP5` directly from the continuous `CHG` values in the imputed data: ```{r cutoff-derive} ADMI_cutoff <- ADMI |> mutate(RESP5 = as.numeric(CHG > 5)) ``` We can inspect the responder rates by treatment arm and visit: ```{r cutoff-rates} ADMI_cutoff |> group_by(TRT, AVISIT) |> summarise( n = n(), responders = sum(RESP5), rate = mean(RESP5), .groups = "drop" ) ``` ## Analyse We define new analysis variables pointing to the `RESP5` outcome and repeat the analysis: ```{r cutoff-vars} vars_cutoff <- set_vars( subjid = "USUBJID", visit = "AVISIT", group = "TRT", outcome = "RESP5", covariates = c("BASE", "STRATA", "REGION") ) ``` ```{r cutoff-analyse, message = FALSE, warning = FALSE} ana_obj_cutoff <- analyse_mi_data( data = ADMI_cutoff, vars = vars_cutoff, method = method, fun = gcomp_responder_multi, reference_levels = "Placebo" ) ``` ## Pool and Display ```{r cutoff-pool} pool_obj_cutoff <- pool(ana_obj_cutoff) ``` ```{r cutoff-tidy} tidy_pool_obj(pool_obj_cutoff) ``` ```{r cutoff-table, eval = requireNamespace("gt", quietly = TRUE)} efficacy_table( pool_obj_cutoff, title = "Responder Analysis: CHG > 5", subtitle = "G-computation with Marginal Effects (Ge et al.)", arm_labels = c(ref = "Placebo", alt = "Drug A") ) ``` # Storing Results as ARD The Analysis Results Dataset (ARD) format from the pharmaverse provides a standardised long-format representation suitable for downstream use with tools like gtsummary. The `pool_to_ard()` function converts a pool object into this format: ```{r ard-conversion, message = FALSE, warning = FALSE, eval = requireNamespace("cards", quietly = TRUE)} ard <- pool_to_ard(pool_obj) print(ard) ``` Each row in the ARD represents a single statistic (estimate, standard error, confidence interval bound, or p-value) for a given parameter, with grouping columns for visit, parameter type, and least-squares-mean type. This format integrates directly into pharmaverse reporting workflows. # Caveats When deriving binary responder endpoints from multiply imputed continuous data, keep the following considerations in mind: - **Imputation model assumptions carry forward.** The binary endpoints are derived from continuous values that were imputed under a specific model (e.g., Bayesian MMRM with jump-to-reference). The validity of the responder analysis depends on the appropriateness of that continuous imputation model. - **Pre-specify responder thresholds.** Responder definitions and their thresholds should be documented in the statistical analysis plan before unblinding. Post-hoc threshold selection risks inflating type I error. - **Results are conditional on reference-based assumptions.** The imputed values -- and therefore the derived responder status -- reflect the chosen reference-based assumption (e.g., jump-to-reference, copy-reference). Different assumptions will produce different responder rates.