---
title: "Data Preparation and Validation"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 2
    number_sections: true
vignette: >
  %\VignetteIndexEntry{Data Preparation and Validation}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
 collapse = TRUE,
 comment = "#>"
)
```

# Introduction
This vignette demonstrates how to prepare and validate data before running multiple imputation with [`{rbmi}`](https://cran.r-project.org/package=rbmi). The `{rbmiUtils}` package provides three key functions for this workflow:

* `validate_data()`: Pre-flight validation to catch common data issues
* `prepare_data_ice()`: Build intercurrent event data from flag columns
* `summarise_missingness()`: Understand missing data patterns

Using these functions helps ensure your imputation will run successfully and gives you insight into the structure of your missing data.

# Setup

```{r libraries, message = FALSE, warning = FALSE}
library(dplyr)
library(rbmi)
library(rbmiUtils)
```

# Example Data

We'll create a small example dataset to demonstrate the functions:

```{r example-data}
set.seed(42)

dat <- data.frame(
 USUBJID = factor(rep(paste0("SUBJ-", 1:20), each = 4)),
 AVISIT = factor(
   rep(c("Week 4", "Week 8", "Week 12", "Week 16"), 20),
   levels = c("Week 4", "Week 8", "Week 12", "Week 16")
 ),
 TRT = factor(rep(c("Placebo", "Drug A"), each = 40)),
 BASE = rep(round(rnorm(20, 50, 10), 1), each = 4),
 STRATA = factor(rep(sample(c("Low", "High"), 20, replace = TRUE), each = 4))
)

# Generate CHG with some missing values
dat$CHG <- round(rnorm(80, mean = -2, sd = 3), 1)

# Create missing data patterns:
# - Subjects 3, 8: monotone dropout at Week 12
# - Subject 15: intermittent missing at Week 8
# - Subject 18: monotone dropout at Week 16
dat$CHG[dat$USUBJID == "SUBJ-3" & dat$AVISIT %in% c("Week 12", "Week 16")] <- NA
dat$CHG[dat$USUBJID == "SUBJ-8" & dat$AVISIT %in% c("Week 12", "Week 16")] <- NA
dat$CHG[dat$USUBJID == "SUBJ-15" & dat$AVISIT == "Week 8"] <- NA
dat$CHG[dat$USUBJID == "SUBJ-18" & dat$AVISIT == "Week 16"] <- NA

# Add discontinuation flag
dat$DISCFL <- ifelse(
 dat$USUBJID %in% c("SUBJ-3", "SUBJ-8") & dat$AVISIT == "Week 12",
 "Y",
 ifelse(
   dat$USUBJID == "SUBJ-18" & dat$AVISIT == "Week 16",
   "Y",
   "N"
 )
)

head(dat, 12)
```

# Define Variables

```{r define-vars}
vars <- set_vars(
 subjid = "USUBJID",
 visit = "AVISIT",
 group = "TRT",
 outcome = "CHG",
 covariates = c("BASE", "STRATA"),
 strategy = "STRATEGY"
)
```

# Validating Data

The `validate_data()` function performs comprehensive checks on your data before imputation:

```{r validate}
# This will pass validation
validate_data(dat, vars)
```

The function checks:

* Data is a data.frame
* All required columns exist (subjid, visit, group, outcome, covariates)
* Factor columns are properly typed
* Outcome column is numeric
* Covariates have no missing values
* No duplicate subject-visit combinations
* If `data_ice` is provided: valid subjects, visits, and strategies

## Catching Validation Errors

Here's an example of how validation catches issues:
```{r validate-error, error = TRUE}
# Create problematic data
bad_dat <- dat
bad_dat$TRT <- as.character(bad_dat$TRT)  # Should be factor
bad_dat$BASE[1] <- NA  # Covariate with missing value

# This will report all issues at once
tryCatch(
 validate_data(bad_dat, vars),
 error = function(e) cat(e$message)
)
```

# Summarising Missing Data

Before imputation, it's important to understand your missing data patterns:

```{r summarise-missing}
miss <- summarise_missingness(dat, vars)
```

## Missing by Visit

```{r by-visit}
print(miss$by_visit)
```

## Subject Patterns

```{r patterns}
print(miss$patterns)
```

## Summary by Treatment Group

```{r summary}
print(miss$summary)
```

The three pattern types are:

* **complete**: No missing outcome values
* **monotone**: Once a value is missing, all subsequent visits are also missing (dropout)
* **intermittent**: Missing values with observed values both before and after

# Preparing ICE Data

When subjects discontinue treatment, you may want to apply reference-based imputation strategies (see the [`{rbmi}` documentation](https://openpharma.github.io/rbmi/latest-tag/articles/quickstart.html) for details on intercurrent event handling). The `prepare_data_ice()` function builds the required `data_ice` data.frame from a discontinuation flag:

```{r prepare-ice}
data_ice <- prepare_data_ice(
 data = dat,
 vars = vars,
 ice_col = "DISCFL",
 strategy = "JR"  # Jump to Reference
)

print(data_ice)
```

The function:

* Identifies subjects with ICE flags (`"Y"`, `TRUE`, or `1`)
* Takes the first flagged visit per subject
* Assigns the specified imputation strategy

Available strategies are:

* `"MAR"`: Missing at Random
* `"CR"`: Copy Reference
* `"JR"`: Jump to Reference
* `"CIR"`: Copy Increment from Reference
* `"LMCF"`: Last Mean Carried Forward

# Complete Workflow

Here's how these functions fit into a typical [`{rbmi}`](https://cran.r-project.org/package=rbmi) workflow:

```{r workflow, eval = FALSE}
library(rbmi)
library(rbmiUtils)

# 1. Validate data
validate_data(dat, vars)

# 2. Understand missing patterns
miss <- summarise_missingness(dat, vars)
print(miss$summary)

# 3. Prepare ICE data if needed
data_ice <- prepare_data_ice(dat, vars, ice_col = "DISCFL", strategy = "JR")

# 4. Define method
method <- method_bayes(
 n_samples = 100,
 control = control_bayes(warmup = 200, thin = 2)
)

# 5. Run imputation
draws_obj <- draws(
 data = dat,
 vars = vars,
 data_ice = data_ice,
 method = method
)

# 6. Continue with impute() and analyse()
```

# Summary

The data preparation functions in `{rbmiUtils}` help you:

1. **Catch issues early** with `validate_data()` before running time-consuming imputations
2. **Understand your data** with `summarise_missingness()` to characterize missing data patterns
3. **Simplify ICE handling** with `prepare_data_ice()` to build `data_ice` from flag columns

These utilities complement the core [`{rbmi}`](https://cran.r-project.org/package=rbmi) package and support reproducible, well-documented analysis workflows.

After data preparation, see `vignette('pipeline')` for the complete analysis workflow from imputation through to regulatory tables.