--- title: "Data Preparation and Validation" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 2 number_sections: true vignette: > %\VignetteIndexEntry{Data Preparation and Validation} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Introduction This vignette demonstrates how to prepare and validate data before running multiple imputation with [`{rbmi}`](https://cran.r-project.org/package=rbmi). The `{rbmiUtils}` package provides three key functions for this workflow: * `validate_data()`: Pre-flight validation to catch common data issues * `prepare_data_ice()`: Build intercurrent event data from flag columns * `summarise_missingness()`: Understand missing data patterns Using these functions helps ensure your imputation will run successfully and gives you insight into the structure of your missing data. # Setup ```{r libraries, message = FALSE, warning = FALSE} library(dplyr) library(rbmi) library(rbmiUtils) ``` # Example Data We'll create a small example dataset to demonstrate the functions: ```{r example-data} set.seed(42) dat <- data.frame( USUBJID = factor(rep(paste0("SUBJ-", 1:20), each = 4)), AVISIT = factor( rep(c("Week 4", "Week 8", "Week 12", "Week 16"), 20), levels = c("Week 4", "Week 8", "Week 12", "Week 16") ), TRT = factor(rep(c("Placebo", "Drug A"), each = 40)), BASE = rep(round(rnorm(20, 50, 10), 1), each = 4), STRATA = factor(rep(sample(c("Low", "High"), 20, replace = TRUE), each = 4)) ) # Generate CHG with some missing values dat$CHG <- round(rnorm(80, mean = -2, sd = 3), 1) # Create missing data patterns: # - Subjects 3, 8: monotone dropout at Week 12 # - Subject 15: intermittent missing at Week 8 # - Subject 18: monotone dropout at Week 16 dat$CHG[dat$USUBJID == "SUBJ-3" & dat$AVISIT %in% c("Week 12", "Week 16")] <- NA dat$CHG[dat$USUBJID == "SUBJ-8" & dat$AVISIT %in% c("Week 12", "Week 16")] <- NA dat$CHG[dat$USUBJID == "SUBJ-15" & dat$AVISIT == "Week 8"] <- NA dat$CHG[dat$USUBJID == "SUBJ-18" & dat$AVISIT == "Week 16"] <- NA # Add discontinuation flag dat$DISCFL <- ifelse( dat$USUBJID %in% c("SUBJ-3", "SUBJ-8") & dat$AVISIT == "Week 12", "Y", ifelse( dat$USUBJID == "SUBJ-18" & dat$AVISIT == "Week 16", "Y", "N" ) ) head(dat, 12) ``` # Define Variables ```{r define-vars} vars <- set_vars( subjid = "USUBJID", visit = "AVISIT", group = "TRT", outcome = "CHG", covariates = c("BASE", "STRATA"), strategy = "STRATEGY" ) ``` # Validating Data The `validate_data()` function performs comprehensive checks on your data before imputation: ```{r validate} # This will pass validation validate_data(dat, vars) ``` The function checks: * Data is a data.frame * All required columns exist (subjid, visit, group, outcome, covariates) * Factor columns are properly typed * Outcome column is numeric * Covariates have no missing values * No duplicate subject-visit combinations * If `data_ice` is provided: valid subjects, visits, and strategies ## Catching Validation Errors Here's an example of how validation catches issues: ```{r validate-error, error = TRUE} # Create problematic data bad_dat <- dat bad_dat$TRT <- as.character(bad_dat$TRT) # Should be factor bad_dat$BASE[1] <- NA # Covariate with missing value # This will report all issues at once tryCatch( validate_data(bad_dat, vars), error = function(e) cat(e$message) ) ``` # Summarising Missing Data Before imputation, it's important to understand your missing data patterns: ```{r summarise-missing} miss <- summarise_missingness(dat, vars) ``` ## Missing by Visit ```{r by-visit} print(miss$by_visit) ``` ## Subject Patterns ```{r patterns} print(miss$patterns) ``` ## Summary by Treatment Group ```{r summary} print(miss$summary) ``` The three pattern types are: * **complete**: No missing outcome values * **monotone**: Once a value is missing, all subsequent visits are also missing (dropout) * **intermittent**: Missing values with observed values both before and after # Preparing ICE Data When subjects discontinue treatment, you may want to apply reference-based imputation strategies (see the [`{rbmi}` documentation](https://openpharma.github.io/rbmi/latest-tag/articles/quickstart.html) for details on intercurrent event handling). The `prepare_data_ice()` function builds the required `data_ice` data.frame from a discontinuation flag: ```{r prepare-ice} data_ice <- prepare_data_ice( data = dat, vars = vars, ice_col = "DISCFL", strategy = "JR" # Jump to Reference ) print(data_ice) ``` The function: * Identifies subjects with ICE flags (`"Y"`, `TRUE`, or `1`) * Takes the first flagged visit per subject * Assigns the specified imputation strategy Available strategies are: * `"MAR"`: Missing at Random * `"CR"`: Copy Reference * `"JR"`: Jump to Reference * `"CIR"`: Copy Increment from Reference * `"LMCF"`: Last Mean Carried Forward # Complete Workflow Here's how these functions fit into a typical [`{rbmi}`](https://cran.r-project.org/package=rbmi) workflow: ```{r workflow, eval = FALSE} library(rbmi) library(rbmiUtils) # 1. Validate data validate_data(dat, vars) # 2. Understand missing patterns miss <- summarise_missingness(dat, vars) print(miss$summary) # 3. Prepare ICE data if needed data_ice <- prepare_data_ice(dat, vars, ice_col = "DISCFL", strategy = "JR") # 4. Define method method <- method_bayes( n_samples = 100, control = control_bayes(warmup = 200, thin = 2) ) # 5. Run imputation draws_obj <- draws( data = dat, vars = vars, data_ice = data_ice, method = method ) # 6. Continue with impute() and analyse() ``` # Summary The data preparation functions in `{rbmiUtils}` help you: 1. **Catch issues early** with `validate_data()` before running time-consuming imputations 2. **Understand your data** with `summarise_missingness()` to characterize missing data patterns 3. **Simplify ICE handling** with `prepare_data_ice()` to build `data_ice` from flag columns These utilities complement the core [`{rbmi}`](https://cran.r-project.org/package=rbmi) package and support reproducible, well-documented analysis workflows. After data preparation, see `vignette('pipeline')` for the complete analysis workflow from imputation through to regulatory tables.