--- title: "Simulation benchmarks: recovery and power" author: "tidyILD authors" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Simulation benchmarks: recovery and power} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 6, fig.height = 4 ) # Optional: set `cache = TRUE` on slow chunks when developing locally (see chunks below). ``` This article shows how to use **`ild_simulate()`** and **`ild_power(..., return_sims = TRUE)`** to assess **parameter recovery** for a fixed effect under the package’s bundled Gaussian mixed-model simulation setup, and how that connects to **empirical power**. It is an **illustrative benchmark** for that DGP—not a proof that every tidyILD method is validated for all data. **Power** answers: “How often do we reject the null when the effect is present?” **Recovery** answers: “How close are point estimates (and intervals) to the known coefficient across repeated simulations?” Both use the same replication loop inside **`ild_power()`**. ## Simulation size and precision Recovery summaries (**bias**, **RMSE**, **coverage**) are **Monte Carlo estimates**: their sampling variability shrinks as **`n_sim`** increases (roughly \(\propto 1/\sqrt{n_\text{sim}}\) for means). The worked example below uses **`n_sim = 200`** as a balance between stability and build time on **CRAN**; for publication-style tables, consider larger **`n_sim`** and reporting uncertainty (e.g. bootstrap across replications or analytical approximations where available). Chunks that re-run **`ild_power()`** are marked with optional **`cache = TRUE`** so local rebuilds can skip re-simulation when the code is unchanged (not used on CRAN by default). ## What `ild_simulate()` encodes The base outcome **`y`** (before `ild_power()` adds a predictor) is built as follows: - **Between-person**: independent random intercepts per `id` (scale **`bp_effect`**). - **Within-person**: optional AR(1) structure on de-trended within-person series when **`ar1`** is set (numeric \(\phi\)); otherwise i.i.d. Gaussian noise (scale **`wp_effect`**). - **Time**: a mild linear-in-time component plus **`time`** as `POSIXct` from a regular grid (with optional jitter if **`irregular = TRUE`**). The formula you fit should match this structure (e.g. random intercept for `id` when using **`ild_lme(..., ar1 = FALSE)`**). Recovery in this vignette focuses on a **single fixed effect** added by **`ild_power()`** on top of that DGP (see below). ## Fixed-effect recovery with `ild_power(..., return_sims = TRUE)` For each replication, **`ild_power()`**: 1. Simulates data with **`ild_simulate()`**. 2. Adds a standard normal predictor **`x`** and sets **`y <- y + effect_size * x`** so the **true** coefficient of **`x`** is **`effect_size`** (call it β). 3. Runs **`ild_prepare()`** and **`ild_lme()`** with your formula. 4. Records the Wald-based estimate and standard error for the test term. With **`return_sims = TRUE`**, you get a tibble of per-replication **`estimate`**, **`std_error`**, **`p_value`**, and **`converged`**. Use **`ild_recovery_metrics()`** for **bias**, **RMSE**, and **nominal Wald interval coverage** (normal approximation \(\hat\beta \pm z_{1-\alpha/2} \cdot \mathrm{SE}\)), consistent with the manual calculation shown in earlier package versions. ```{r recovery-run, cache = FALSE} library(tidyILD) library(dplyr) library(ggplot2) truth <- 0.35 n_sim <- 200L set.seed(2026) res <- ild_power( formula = y ~ x + (1 | id), n_sim = n_sim, n_id = 25L, n_obs_per = 12L, effect_size = truth, seed = 2026L, return_sims = TRUE, verbose = FALSE ) sim <- res$sim_results |> dplyr::filter(converged) ``` ```{r recovery-metrics} ild_recovery_metrics(res$sim_results, truth = truth, level = 0.95) ``` Across replications, the mean estimate should be near **β**, bias near 0, RMSE positive but modest if the model is well specified, and **coverage** near the nominal level (e.g. 0.95) for large **`n_sim`**—subject to Monte Carlo noise when **`n_sim`** is moderate. ```{r recovery-plot, fig.alt = "Histogram of simulation estimates of the fixed effect", cache = FALSE} ggplot2::ggplot(sim, ggplot2::aes(x = estimate)) + ggplot2::geom_histogram(fill = "steelblue", color = "white", bins = 20) + ggplot2::geom_vline(xintercept = truth, linetype = 2, linewidth = 0.8) + ggplot2::labs( x = "Estimated coefficient for x", y = "Count", title = "Sampling distribution of fixed-effect estimates", subtitle = sprintf("True beta = %s (vertical line)", truth) ) ``` ## From recovery to power The same object **`res`** already contains **empirical power**: the fraction of converged runs with **`p_value < alpha`**. ```{r power-link} tibble::tibble( power = res$power, n_reject = res$n_reject, n_converged = res$n_converged, alpha = res$alpha ) ``` **Interpretation:** **Recovery** summarizes the **sampling distribution** of \(\hat\beta\). **Power** summarizes how often that distribution leads to rejection at **`alpha`**. A large effect with low variance yields good recovery *and* high power; a small effect may show acceptable recovery (unbiased estimates) but low power at a fixed sample size. ## Variance components (illustrative snapshot) **`ild_power()`** is built around a **fixed-effect** estimand. Recovery of **variance components** (random-intercept SD, residual variance) is a separate question. As a **single-run sanity check**, you can inspect **`VarCorr()`** on one simulated dataset ( **`lme4`** or **`nlme`** output depending on **`ar1`**): ```{r varcorr-snapshot, eval = requireNamespace("lme4", quietly = TRUE)} set.seed(1) d_one <- ild_simulate(n_id = 30L, n_obs_per = 15L, seed = 99) d_one$x <- rnorm(nrow(d_one)) d_one$y <- d_one$y + 0.3 * d_one$x prep_one <- ild_prepare(d_one, id = "id", time = "time") fit_one <- ild_lme(y ~ x + (1 | id), prep_one, ar1 = FALSE, warn_no_ar1 = FALSE) lme4::VarCorr(fit_one) ``` This is **not** a simulation-based assessment of variance recovery—only a template for where to look when extending benchmarks. ## AR(1) in the DGP vs residual correlation in the fit - **`ild_simulate(..., ar1 = phi)`** (numeric \(\phi\)) injects **AR(1) within-person** correlation in the **outcome** before **`ild_power()`** adds **`x`**. - **`ild_lme(..., ar1 = TRUE)`** fits a **residual** AR1/CAR1 structure in **`nlme`** (different object from the DGP’s \(\phi\)); when **`ar1 = TRUE`**, the formula must be **fixed-effects-only** with **`random = ~ 1 | id`** (see **`?ild_lme`**). **`ild_power()`** exposes **`ar1`** as a **logical** flag for the **fit**, not the DGP’s \(\phi\). To study AR(1) DGPs with **`ild_power()`**, you would need a **custom loop** calling **`ild_simulate(..., ar1 = ...)`** yourself, or prioritize alignment by matching structures in both steps. Minimal illustration of **DGP AR + nlme residual correlation** on one draw: ```{r ar1-illustration, eval = requireNamespace("nlme", quietly = TRUE)} set.seed(2) d_ar <- ild_simulate(n_id = 20L, n_obs_per = 14L, ar1 = 0.35, wp_effect = 0.5, seed = 2) d_ar$x <- rnorm(nrow(d_ar)) d_ar$y <- d_ar$y + 0.25 * d_ar$x prep_ar <- ild_prepare(d_ar, id = "id", time = "time") fit_ar <- tryCatch( ild_lme(y ~ x, prep_ar, ar1 = TRUE, random = ~ 1 | id, warn_no_ar1 = FALSE), error = function(e) NULL ) if (!is.null(fit_ar)) summary(fit_ar) else "nlme fit failed on this platform (skip)" ``` ## Bayesian and state-space extensions - **`ild_power()`** currently drives **`ild_lme()`** only. For **brms** / Stan, use **`ild_simulate()`** + **`ild_fit(..., backend = "brms")`** (or **`ild_brms()`**) in your own replication loop, then summarize posterior intervals against **`truth`** with the same **coverage** idea (credible vs Wald nominal coverage is **not** interchangeable—document which you use). - **KFAS** targets **state-space** models; see **`vignette("kfas-state-space-modeling", package = "tidyILD")`** and **`?ild_kfas`**. Cross-checking mixed-model simulation benchmarks against KFAS is out of scope here but natural for method comparison papers. ## Cross-backend validation harness (optional) For **repeatable CI-style checks** across several engines (`lme4`, optional `nlme` / `brms` / **KFAS** / **ctsem**), the package ships a **non-exported** harness documented in **`inst/dev/BACKEND_VALIDATION_BENCHMARK_CONTRACT.md`**. From a **source tree**, run `scripts/run-backend-validation-benchmarks.R` with a **`--tier`** (`smoke`, `nightly`, `full`) and compare outputs to JSON thresholds via `scripts/check-backend-validation-thresholds.R`. This complements the **`ild_power()`** / **`ild_recovery_metrics()`** workflow above by standardizing **raw and summary tables** for trend comparison; it is **not** a substitute for study-specific simulation design. ## Limitations and scope - **One estimand in `ild_power()`**: this path targets a **single fixed effect** in a linear mixed model fit with **`ild_lme()`**, matching the current **`ild_power()`** implementation. - **Inference**: for **`lme4`** fits, **`ild_power()`** uses a **Wald z-approximation** when **p-values** are unavailable from the engine; that is appropriate for this illustration but not identical to likelihood-ratio tests. - **Variance / AR / other backends**: recovery of **variance components** and **AR** parameters, **brms**, and **KFAS** are discussed **above**; full simulation suites for those paths are left to application-specific code. ## See also - **`vignette("tidyILD-workflow", package = "tidyILD")`** — end-to-end pipeline. - **`?ild_power`**, **`?ild_recovery_metrics`** — simulation-based power and recovery summaries. - **`?ild_simulate`** — base data generator. ```{r session-info, echo = FALSE} sessionInfo() ```