--- title: "Programmatic workflow demonstration" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Programmatic workflow demonstration} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- This vignette provides an in-depth walk-through of the `MRPWorkflow` and `MRPModel` classes in the **shinymrp** package. You will find practical information about the purpose, arguments, and caveats of each method, enabling you to conduct MRP analyses programmatically. For an introduction to the package and workflow concepts, begin with the [Getting started with shinymrp](https://mrp-interface.github.io/shinymrp/articles/getting-started.html) vignette. ```{r setup, include=FALSE} run_stan <- requireNamespace("cmdstanr", quietly = TRUE) knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) knitr::opts_template$set( tall_plot = list( fig.width = 10, fig.height = 6, fig.align = "center", out.width = "90%" ), long_plot = list( fig.width = 13, fig.height = 6, fig.align = "center", out.width = "90%" ), map = list( fig.height = 4, out.width = "100%" ) ) ``` ```{r cache_setup, eval=!run_stan, include=FALSE} sample_data <- readr::read_csv( "data/timevarying_binomial_prep.csv", show_col_types = FALSE ) workflow <- qs::qread("data/workflow.qs") model1 <- qs::qread("data/model.qs") compare_df <- readr::read_csv( "data/loo.csv", show_col_types = FALSE ) ``` # Initializing the workflow MRP analysis begins by instantiating an `MRPWorkflow` object with `mrp_workflow()`. The object-oriented design efficiently manages your data and supports a reproducible workflow without requiring repeated user input. All workflow methods are accessed via the [R6](https://r6.r-lib.org/) `$` operator. ```{r create_workflow, eval=run_stan} library(shinymrp) workflow <- mrp_workflow() ``` # 1. Data preparation MRP requires two key data components: - **Sample data**: Your analysis sample, which must include the outcome of interest (binary or continuous) and predictors (demographic, geographic, etc.), such as COVID-19 test records or survey responses. - **Poststratification data**: A table containing population sizes of groups defined by demographic and geographic factors. For details on accepted formats and preprocessing, see the [Data Preprocessing](https://mrp-interface.github.io/shinymrp/articles/data-prep) vignette. ## 1.1 Preprocessing the sample data Sample data should be a tidy data frame with your outcome measure and predictors. Time-varying data are also supported, allowing dates or indices (e.g., week, month, or year) to be incorporated for time-varying summaries and estimates. The `example_sample_data()` function demonstrates accepted structures and common options: - whether a time variable is included - whether the data are aggregated - whether the data follow a specific standard (e.g., medical records from the Epic system) - whether the outcome measure is binary (modeled with binomial distributions) or continuous (modeled with Gaussian/normal distributions) ```{r example_sample, eval=run_stan} sample_data <- example_sample_data( is_timevar = TRUE, is_aggregated = TRUE, special_case = NULL, family = "binomial" ) head(sample_data) ``` ```{r example_sample_cached, eval=!run_stan, echo=FALSE} head(sample_data) ``` Data preprocessing is performed with the `$preprocess()` method, which standardizes formats (e.g., recodes age groups, creates time indices), removes duplicates, and prepares data for modeling. Note: Only basic cleaning is performed. Conduct preliminary data checks beforehand. ```{r preprocess, eval=run_stan} workflow$preprocess( sample_data, is_timevar = TRUE, is_aggregated = TRUE, special_case = NULL, family = "binomial" ) ``` ## 1.2 Assembling poststratification data ### 1.2.1 Linking to ACS Use the `$link_acs()` method to retrieve poststratification data from the American Community Survey (ACS), specifying the desired geographic level and reference year (i.e., the last year of the five-year ACS data collection). Supported levels include ZIP code, county, and state. If not specified, poststratification data are aggregated for the U.S. population. ```{r link_acs, eval=run_stan} workflow$link_acs(link_geo = "zip", acs_year = 2021) ``` ### 1.2.2 Importing custom poststratification data You can also import your own poststratification data using `$load_pstrat()`. [Requirements](https://mrp-interface.github.io/shinymrp/articles/data-prep#required-columns-and-categories) mirror those for sample data. ```{r load_pstrat, eval=run_stan} pstrat_data <- example_pstrat_data() workflow$load_pstrat(pstrat_data, is_aggregated = TRUE) ``` # 2. Descriptive statistics The `MRPWorkflow` class provides methods for creating visualizations of demographic distributions, geographic coverage, and outcome summaries. Use `$demo_bars()` to generate bar plots comparing demographic distributions in the sample and poststratification data. ```{r demo_bars, opts.label="tall_plot"} workflow$demo_bars(demo = "sex") ``` For analyzing COVID-19 test data with linked geographic predictors, the `$covar_hist()` method creates frequency histograms of the geographic predictors. This method will give the following error if it is called with incompatible data that do not include geographic predictors. ```{r covar_hist, error = TRUE} workflow$covar_hist(covar = "income") ``` `$sample_size_map()` visualizes sample sizes across geographic areas in an interactive choropleth map, powered by [highcharter](https://CRAN.R-project.org/package=highcharter). ```{r sample_size_map, opts.label="map"} workflow$sample_size_map() ``` `$outcome_plot()` creates plots of average outcome values, including time variation if present. ```{r outcome_plot, opts.label="long_plot"} workflow$outcome_plot() ``` `$outcome_map()` visualizes outcome summaries by geography, with options for the maximum or minimum value across time. ```{r outcome_map, opts.label="map"} workflow$outcome_map(summary_type = "max") ``` # 3. Model building The `MRPModel` class is a wrapper around CmdStanR objects, providing a high-level interface for obtaining posterior summaries and diagnostics. ## 3.1 Model specification Use `$create_model()` to specify the model, including the mean structure (fixed and varying effects, two-way interactions) and priors. Varying effects assume normal distributions with unknown standard deviations (with specified priors). Stan code is generated for compilation via CmdStanR. Supported prior types include: - `normal(mu, sigma)` - `student_t(nu, mu, sigma)` - `structured` (see [Si et al., 2020](https://www150.statcan.gc.ca/n1/en/pub/12-001-x/2020002/article/00003-eng.pdf?st=iF1_Fbrh)) Default priors are used if `""` (empty string) is supplied: - Overall intercept: normal(0, 5) - Coefficient: normal(0, 3) - Standard deviation (main effect): normal+(0, 3) - Standard deviation (interaction): normal+(0, 1) ```{r create_model1, eval=run_stan} model1 <- workflow$create_model( intercept_prior = "normal(0, 4)", fixed = list( sex = "normal(0, 2)", race = "normal(0, 2)" ), varying = list( age = "", time = "" ) ) ``` ## 3.2 Model fitting The `$fit()` method runs Stan's main Markov chain Monte Carlo algorithm via CmdStanR. ```{r fit_model1, eval=run_stan} model1$fit( n_iter = 500, n_chains = 2, seed = 123, show_messages = FALSE, show_exceptions = FALSE ) ``` ## 3.3 Posterior summaries and diagnostics The fitted `MRPModel` object extracts draws from the internal `CmdStanMCMC` object for summaries and diagnostics. - **Posterior summary and convergence diagnostics**: Use `$summary()` for posterior statistics, effective sample sizes, and R-hat diagnostics. ```{r summary} model1$summary() ``` - **Additional diagnostics**: Use `$diagnostics()` for [detailed diagnostic information](https://mc-stan.org/learn-stan/diagnostics-warnings) about the MCMC process. ```{r diagnostics} model1$diagnostics() ``` - **Posterior predictive check**: Use `$pp_check()` to compare posterior predictive samples against observed data. ```{r pp_check, opts.label="long_plot"} workflow$pp_check(model1) ``` - **Model comparison**: Compare models via leave-one-out cross-validation with `$compare_models()`. ```{r create_fit_model2, eval=run_stan} model2 <- workflow$create_model( intercept_prior = "normal(0, 4)", fixed = list( sex = "normal(0, 2)", race = "normal(0, 2)" ), varying = list( age = "", time = "" ), interaction = list( `age:time` = "normal(0, 1)", `race:time` = "normal(0, 1)", `sex:time` = "normal(0, 1)" ) ) model2$fit( n_iter = 500, n_chains = 2, seed = 123, show_messages = FALSE, show_exceptions = FALSE ) ``` ```{r compare_models, eval=run_stan} workflow$compare_models(model1, model2) ``` ```{r compare_models_cached, eval=!run_stan, echo=FALSE} compare_df ``` ## 3.4 Saving and loading models You can easily save and restore models using the `qs` package. The `$save()` gathers all posterior draws and diagnostics, then calls `qs::qsave()` internally. ```{r, save_model, eval=FALSE} model1$save(file = "model1.qs") # load back into workspace model1 <- qs::qread("model1.qs") ``` # 4. Result visualization Use `$estimate_plot()` to visualize the model estimates, either overall or by group, with a 95% credible interval by default. Users can specify alternative intervals. ```{r, estimate_plot, opts.label="tall_plot", fig.width=13, fig.height=14} workflow$estimate_plot(model1, group = "sex", interval = 0.95) ``` For time-varying data, `$estimate_map()` creates snapshots of geography-based estimates at specific time points. `time_index` is irrelevant for cross-sectional data. ```{r estimate_map, opts.label="map"} workflow$estimate_map(model1, geo = "county", interval = 0.95, time_index = 1) ``` # Further reading - [Getting Started with shinymrp](https://mrp-interface.github.io/shinymrp/articles/getting-started.html) - [Data Preprocessing](https://mrp-interface.github.io/shinymrp/articles/data-prep) - [Stan Model Diagnostics](https://mc-stan.org/learn-stan/diagnostics-warnings)