--- title: "MI Diagnostics and Pipeline Inspection" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{MI Diagnostics and Pipeline Inspection} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Introduction The rbmi multiple imputation pipeline produces several intermediate objects -- draws, imputation, analysis, and pool -- each containing useful metadata about the imputation model and results. **rbmiUtils** v0.3.0 adds tools to inspect these objects and extract diagnostic statistics, making it easier to verify that the MI pipeline behaved as expected. This vignette covers three features: - `describe_draws()`: structured metadata from draws objects (method, formula, samples, MCMC convergence) - `describe_imputation()`: structured metadata from imputation objects (method, M, references, missingness breakdown) - `pool_to_ard()` MI diagnostic enrichment: fraction of missing information (FMI), lambda, relative increase in variance (RIV), and other Rubin's rules diagnostics embedded in the ARD output # Setup We load the required packages and prepare data for the `pool_to_ard()` diagnostic enrichment examples in Section 5. This setup uses `analyse_mi_data()` which works directly with the pre-imputed `ADMI` dataset (containing an `IMPID` column), so there is no need for `draws()` or `impute()`. ```{r setup-pipeline, message = FALSE, warning = FALSE} library(rbmiUtils) library(rbmi) library(dplyr) data("ADMI", package = "rbmiUtils") ADMI <- ADMI |> mutate( TRT = factor(TRT, levels = c("Placebo", "Drug A")), USUBJID = factor(USUBJID), AVISIT = factor(AVISIT) ) vars <- set_vars( subjid = "USUBJID", visit = "AVISIT", group = "TRT", outcome = "CHG", covariates = c("BASE", "STRATA", "REGION") ) method <- method_bayes( n_samples = 100, control = control_bayes(warmup = 200, thin = 5) ) ana_obj <- analyse_mi_data(ADMI, vars, method, fun = ancova) pool_obj <- pool(ana_obj) ``` # Inspecting Draws with `describe_draws()` The `describe_draws()` function extracts structured metadata from an rbmi `draws` object, providing a quick summary of the imputation model configuration and, for Bayesian methods, MCMC convergence diagnostics. The code below uses `ADEFF` data and defines its own `vars` and `method` objects. We use `eval = FALSE` because `draws()` runs MCMC sampling, which is too slow for vignette builds. ```{r describe-draws-code, eval = FALSE} data("ADEFF", package = "rbmiUtils") ADEFF <- ADEFF |> mutate( TRT = factor(TRT01P, levels = c("Placebo", "Drug A")), USUBJID = factor(USUBJID), AVISIT = factor(AVISIT, levels = c("Week 24", "Week 48")) ) vars <- set_vars( subjid = "USUBJID", visit = "AVISIT", group = "TRT", outcome = "CHG", covariates = c("BASE", "STRATA", "REGION") ) method <- method_bayes( n_samples = 100, control = control_bayes(warmup = 200, thin = 2) ) dat <- ADEFF |> select(USUBJID, STRATA, REGION, TRT, BASE, CHG, AVISIT) draws_obj <- draws(data = dat, vars = vars, method = method) desc <- describe_draws(draws_obj) print(desc) ``` Example output from `describe_draws()`: ``` -- Draws Summary -- Method: Bayesian (MCMC via Stan) Formula: CHG ~ 1 + BASE + STRATA + REGION + TRT + AVISIT + TRT:AVISIT Samples: 100 Failures: 0 Covariance: us Same covariance across groups: Yes -- -- MCMC Convergence -- v All Rhat < 1.1 (42 parameters) Max Rhat: 1.003 Min ESS: 245.2 ``` The returned object is a list with programmatic access to all fields: - `$method` -- human-readable method name (e.g., "Bayesian (MCMC via Stan)") - `$method_class` -- raw class: `"bayes"`, `"approxbayes"`, or `"condmean"` - `$formula` -- the deparsed model formula string - `$n_samples` -- total number of samples drawn - `$n_failures` -- number of failed samples - `$mcmc` -- (Bayesian only) list with `rhat`, `ess`, `max_rhat`, `min_ess`, `n_params`, `converged` # Inspecting Imputations with `describe_imputation()` The `describe_imputation()` function extracts metadata from an rbmi `imputation` object, including the method, number of imputations (M), reference arm mappings, and a missingness breakdown by visit and treatment arm. This section continues from the `draws_obj` created in the code above (Section 3). Again, we use `eval = FALSE` because the pipeline requires MCMC. ```{r describe-imputation-code, eval = FALSE} impute_obj <- impute( draws_obj, references = c("Placebo" = "Placebo", "Drug A" = "Placebo") ) desc <- describe_imputation(impute_obj) print(desc) ``` Example output from `describe_imputation()`: ``` -- Imputation Summary -- Method: Bayesian (MCMC via Stan) Imputations (M): 100 Subjects: 200 -- -- References -- Placebo -> Placebo Drug A -> Placebo -- Missingness by Visit and Arm -- visit group n_total n_miss pct_miss Week 24 Placebo 100 8 8.0 Week 24 Drug A 100 10 10.0 Week 48 Placebo 100 15 15.0 Week 48 Drug A 100 18 18.0 ``` The returned object provides programmatic access to: - `$method` -- human-readable method name - `$n_imputations` -- number of imputations (M) - `$n_subjects` -- total number of unique subjects - `$references` -- named character vector of reference arm mappings (or `NULL`) - `$missingness` -- a `data.frame` with columns `visit`, `group`, `n_total`, `n_miss`, `pct_miss` # MI Diagnostic Statistics in ARD The `pool_to_ard()` function converts a pool object to the pharmaverse Analysis Results Dataset (ARD) format. When you also pass the `analysis_obj`, it enriches the ARD with MI diagnostic statistics computed from Rubin's rules. ```{r ard-base-vs-enriched, eval = requireNamespace("cards", quietly = TRUE)} # Base ARD (no diagnostics) ard <- pool_to_ard(pool_obj) # Enriched ARD with MI diagnostics ard_enriched <- pool_to_ard(pool_obj, analysis_obj = ana_obj) ``` The enriched ARD includes additional rows for each parameter with diagnostic statistics. We can filter and display them: ```{r ard-diagnostics, eval = requireNamespace("cards", quietly = TRUE)} ard_enriched |> dplyr::filter(stat_name %in% c("fmi", "lambda", "riv", "df.adjusted", "re")) |> dplyr::select(group1_level, variable_level, stat_name, stat) ``` Each diagnostic statistic has a specific interpretation: - **FMI** (fraction of missing information) -- the adjusted proportion of total sampling variance attributable to missing data, following the mice convention: `(riv + 2/(df + 3)) / (1 + riv)` - **lambda** -- the proportion of total variance due to between-imputation variance (missingness) - **RIV** (relative increase in variance) -- the ratio of between-imputation variance to within-imputation variance, scaled by `(1 + 1/M)` - **df.adjusted** -- Barnard-Rubin adjusted degrees of freedom, which accounts for finite complete-data degrees of freedom - **re** (relative efficiency) -- `1 / (1 + fmi/M)`, the efficiency of the MI estimator relative to an estimator with infinite imputations # When Diagnostics Are Not Available Non-Rubin pooling methods (e.g., conditional mean with jackknife) do not produce MI diagnostic statistics because the variance decomposition does not apply. When `pool_to_ard()` is called with an `analysis_obj` from a non-Rubin method, it emits an informative message and omits diagnostic rows from the ARD. The `describe_draws()` and `describe_imputation()` functions work with all method types (Bayesian, approximate Bayesian, and conditional mean). # Learn More - [From rbmi Analysis to Regulatory Tables](pipeline.html) -- the full end-to-end pipeline vignette - [`pool_to_ard()`](../reference/pool_to_ard.html) -- function documentation with ARD format details - [`describe_draws()`](../reference/describe_draws.html) and [`describe_imputation()`](../reference/describe_imputation.html) -- function documentation with full field descriptions