--- title: "One-Arm Regression Modeling with CausalMixGPD" output: rmarkdown::html_vignette bibliography: CausalMixGPD.bib link-citations: true vignette: > %\VignetteIndexEntry{One-Arm Regression Modeling with CausalMixGPD} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 4.8, out.width = "85%", fig.align = "center", message = FALSE, warning = FALSE ) .pkg_extdata <- function(name) { f <- system.file("extdata", name, package = "CausalMixGPD") if (nzchar(f)) { return(f) } vdir <- knitr::current_input(dir = TRUE) pkg_root <- normalizePath(file.path(vdir, ".."), winslash = "/", mustWork = TRUE) alt <- file.path(pkg_root, "inst", "extdata", name) if (file.exists(alt)) { return(normalizePath(alt, winslash = "/", mustWork = TRUE)) } stop("Missing extdata: ", name, ". Run data-raw/build_vignette_fits.R", call. = FALSE) } ``` # Introduction `CausalMixGPD` is designed for settings where the response distribution is not well represented by a simple parametric family. Many applications have heteroskedastic, skewed, or multimodal outcomes that are better modeled with flexible mixtures, together with upper-tail behavior that may require extrapolation. The `CausalMixGPD` package addresses this setting with a spliced distribution in which a Dirichlet process mixture (DPM) model for the core is optionally augmented with a generalized Pareto distribution (GPD) tail [@EscobarWest1995; @BalkemaDeHaan1974; @Pickands1975; @Coles2001]. A conditional outcome distribution is fitted first, and densities, survival functions, quantiles, means, and restricted means are then extracted from the fitted distribution. The software paper describes the one-arm workflow that the causal and clustering extensions build on. This vignette walks through the workflow using a built-in simulated data set. It introduces the notation and model setup, describes the main one-arm workflow functions and their statistical targets, shows the interface through executable code, and summarizes the main customization options. # Notation and statistical background ## Conditional distribution as the inferential target Let $Y$ denote the response and let $X \in \mathbb{R}^p$ denote a predictor vector. For a fixed covariate value $x$, the conditional cumulative distribution function and conditional density are $$ F(y \mid x) = \Pr(Y \le y \mid X = x), \qquad f(y \mid x) = \frac{\partial}{\partial y} F(y \mid x), $$ and the conditional quantile function is $$ Q(\tau \mid x) = \inf\{y : F(y \mid x) \ge \tau\}, \qquad 0 < \tau < 1. $$ In `CausalMixGPD`, this conditional distribution is the primary object of inference. Scalar summaries such as means or upper quantiles are derived from the fitted distribution rather than modeled separately. That design is important because it keeps different summaries internally coherent and avoids issues such as quantile crossing that can arise when quantiles are estimated one at a time. ## Bulk model: Dirichlet process mixtures Below a tail threshold, the package uses a DPM regression model. If $k(\cdot \mid x; \theta)$ denotes a chosen kernel family, the bulk conditional density can be written as $$ f_{\mathrm{DP}}(y \mid x) = \int k(y \mid x; \theta) \, dH(\theta) = \sum_{j=1}^{\infty} w_j k(y \mid x; \theta_j), $$ where $H$ denotes a random mixing distribution having a Dirichlet process prior such that the mixing proportions obey $w_j \ge 0$ and $\sum_{j} w_j = 1$, and the kernel parameters can be either fixed or covariate-dependent [@Ferguson1973; @Sethuraman1994; @Neal2000]. The mixture formulation provides much flexibility in dealing with multimodality, skewness, nonuniform local dispersion, and potential latent grouping without requiring any prespecification of the number of mixture components. From an implementation standpoint, the bulk model specification is largely driven by four main inputs: `kernel`, `backend`, `components`, and `param_specs`. The first two control the kernel function $k(\cdot)$ and the underlying representation, respectively, the third is the internal truncation level, and the fourth decides whether the kernel parameters are fixed, drawn from priors, or regressed. ## Tail model: generalized Pareto exceedances When the upper tail requires explicit modeling, the package replaces the part of the density above a threshold $u(x)$ by a GPD component. For exceedances over the threshold, define $$ Z = Y - u(x) \quad \text{given } Y > u(x). $$ Under the peaks-over-threshold framework, the conditional excess distribution is approximated by a GPD when the threshold is sufficiently high [@BalkemaDeHaan1974; @Pickands1975; @Coles2001]. For $y > u(x)$, with scale parameter $\sigma(x) > 0$ and shape parameter $\xi$, the GPD cumulative distribution function is $$ F_{\mathrm{GPD}}(y \mid u(x), \sigma(x), \xi) = \begin{cases} 1 - \left(1 + \xi \frac{y-u(x)}{\sigma(x)}\right)^{-1/\xi}, & \xi \ne 0, \\ 1 - \exp\left(-\frac{y-u(x)}{\sigma(x)}\right), & \xi = 0. \end{cases} $$ The support is $y \ge u(x)$ when $\xi \ge 0$, and $u(x) \le y \le u(x) - \sigma(x)/\xi$ when $\xi < 0$. The shape parameter controls the tail regime: negative values imply a finite upper endpoint, $\xi = 0$ gives an exponential-type tail, and positive values correspond to heavy tails. ## Spliced bulk-tail construction The key statistical idea in the package is the spliced model. Let $F_{\mathrm{DP}}(\cdot \mid x; \Theta)$ and $f_{\mathrm{DP}}(\cdot \mid x; \Theta)$ denote the bulk CDF and density, and let $\Phi$ collect the tail parameters. Write $$ p_u(x; \Theta) = F_{\mathrm{DP}}(u(x) \mid x; \Theta) $$ for the bulk probability mass below the threshold. Then the spliced conditional CDF is $$ F(y \mid x; \Theta, \Phi) = \begin{cases} F_{\mathrm{DP}}(y \mid x; \Theta), & y \le u(x), \\ p_u(x; \Theta) + \{1 - p_u(x; \Theta)\} F_{\mathrm{GPD}}(y \mid x; \Phi), & y > u(x), \end{cases} $$ and the corresponding conditional density is $$ f(y \mid x; \Theta, \Phi) = \begin{cases} f_{\mathrm{DP}}(y \mid x; \Theta), & y \le u(x), \\ \{1 - p_u(x; \Theta)\} f_{\mathrm{GPD}}(y \mid x; \Phi), & y > u(x). \end{cases} $$ This formulation preserves the flexibility of the mixture in the central region while forcing the tail to obey an EVT-motivated form. The splice is continuous at $u(x)$ by construction, and the tail density is normalized by the bulk survival at the threshold. The quantile function follows the same two-part logic. For quantile levels below the threshold mass, the package inverts the bulk CDF. For levels above the threshold mass, the probability index is rescaled to the exceedance scale: $$ \widetilde\tau(x; \Theta) = \frac{\tau - p_u(x; \Theta)}{1 - p_u(x; \Theta)}. $$ Then the upper-tail quantiles are obtained from the GPD quantile function evaluated at $\widetilde\tau(x; \Theta)$. This distinction explains why upper quantile predictions are one of the main places where a spliced fit can differ meaningfully from a bulk-only fit. ## Means, restricted means, and existence conditions When the kernel mean exists and the tail satisfies $\xi < 1$, the conditional mean of the spliced model is finite. In that case the one-arm interface can report ordinary means through `predict(type = "mean")`. When the tail is too heavy for a finite mean or when the user prefers a bounded summary, the package provides restricted means through `predict(type = "rmean")`, which returns $$ E\{\min(Y, c) \mid X = x\} $$ for a user-supplied cutoff $c$. Restricted means are often easier to interpret in heavy-tailed settings because they remain finite and are less dominated by a few very large posterior draws. The software paper emphasizes this distinction explicitly in the one-arm prediction section. # Function map for the one-arm workflow The main exported functions used in this vignette are listed below. - `dpmgpd()` fits a one-arm spliced DPM-GPD model. - `dpmix()` fits the corresponding bulk-only DPM model. - `summary()` reports posterior summaries from a fitted object. - `params()` extracts parameter summaries in a more direct programmatic form. - `plot()` provides diagnostic or summary graphics for fitted and predicted objects. - `predict()` returns posterior summaries for densities, survival functions, quantiles, means, restricted means, medians, samples, and fitted values. These are not separate modeling frameworks. They are different ways of interrogating the same posterior distribution after fitting a one-arm model. # Package setup ```{r libraries} library(CausalMixGPD) library(ggplot2) ``` For a vignette, it is reasonable to use modest MCMC settings so the code remains runnable on ordinary hardware. In a substantive analysis, longer runs and sensitivity checks are recommended. ```{r mcmc-settings} mcmc_vig <- list( niter = 1200, nburnin = 300, thin = 2, nchains = 2, seed = 2026 ) ``` # Data We use the bundled dataset `nc_posX100_p3_k2`, which contains a positive response and three predictors. It is convenient for illustrating both the bulk-only and spliced workflows. ```{r load-data} data("nc_posX100_p3_k2", package = "CausalMixGPD") onearm_dat <- data.frame( y = nc_posX100_p3_k2$y, nc_posX100_p3_k2$X ) ``` A quick visual check helps motivate flexible modeling. ```{r exploratory-plots, fig.height=4.5, fig.width=7} p1 <- ggplot(onearm_dat, aes(x = y)) + geom_histogram(aes(y = after_stat(density)), bins = 25, fill = "grey85", colour = "grey35") + geom_density(linewidth = 0.9) + labs(x = "y", y = "Density", title = "Observed response distribution") + theme_minimal() p2 <- ggplot(onearm_dat, aes(sample = y)) + stat_qq() + stat_qq_line(colour = 2) + labs(title = "Normal Q-Q plot for the response") + theme_minimal() p1 p2 ``` The response is positive and the empirical distribution is skewed. That makes the dataset a natural candidate for positive-support kernels such as `gamma`, `lognormal`, or `invgauss`. If the scientific question involves high quantiles or tail probabilities, a spliced GPD tail can also be justified. # Fitting a spliced one-arm model We begin with `dpmgpd()`, which fits the bulk mixture together with an active GPD tail. The article presents this as the main one-arm interface for the spliced workflow, with the formula describing the conditional outcome model and the backend selecting the latent DPM representation. To keep vignette build time predictable, outputs from this vignette are **loaded from** `inst/extdata/one_arm_outputs.rds` (created by `tools/.Rscripts/build_vignette_fits.R`). This contains only the prediction/summary objects needed to render the figures and tables, rather than the full fitted MCMC objects. The code chunks below show the package calls you would run in a full analysis; the rendered tables and figures are produced from the shipped artifacts in companion chunks with hidden source. ```{r fit-spliced-display, eval=FALSE} fit_spliced <- dpmgpd( formula = y ~ x1 + x2 + x3, data = onearm_dat, backend = "sb", kernel = "lognormal", components = 5, mcmc = mcmc_vig ) ``` ```{r fit-spliced, echo=FALSE} one_arm_out <- readRDS(.pkg_extdata("one_arm_outputs.rds")) ``` A fitted one-arm object supports posterior summaries and diagnostic plots. To avoid storing the full fit, we include static diagnostics generated offline. ```{r summary-spliced-display, eval=FALSE} summary(fit_spliced) ``` ```{r summary-spliced, echo=FALSE} fsum <- system.file("extdata", "one_arm_fit_summary.txt", package = "CausalMixGPD") if (!nzchar(fsum)) { alt <- normalizePath( file.path(knitr::current_input(dir = TRUE), "..", "inst", "extdata", "one_arm_fit_summary.txt"), winslash = "/", mustWork = FALSE ) if (file.exists(alt)) fsum <- alt } if (nzchar(fsum) && file.exists(fsum)) { cat(paste(readLines(fsum, warn = FALSE, encoding = "UTF-8"), collapse = "\n")) } else { cat("Precomputed fit summary not found; run tools/.Rscripts/build_vignette_fits.R.\n") } ``` ```{r plot-spliced-trace-display, eval=FALSE} plot(fit_spliced, family = c("traceplot", "density"), params = "alpha") ``` ```{r plot-spliced-trace, echo=FALSE, fig.height=5.2, fig.width=7} knitr::include_graphics("assets/one_arm_alpha_trace.png") ``` ```{r plot-spliced-density-display, eval=FALSE} # Posterior density for monitored parameters (e.g. alpha); see plot.mixgpd_fit(). plot(fit_spliced, family = "density", params = "alpha") ``` ```{r plot-spliced-density, echo=FALSE, fig.height=5, fig.width=7} knitr::include_graphics("assets/one_arm_alpha_density.png") ``` The exact parameter names stored by the object depend on the chosen kernel and backend, but the general interpretation is stable. The concentration parameter controls the effective complexity of the mixture, while the monitored bulk and tail parameters determine the conditional distribution used for all later prediction tasks. Use `params(fit_spliced)` when you need the posterior draws as a matrix for custom summaries. # Posterior prediction from the spliced model ## Conditional density and survival summaries The package article notes that density and survival summaries are returned by evaluating the corresponding functional draw by draw and then summarizing over retained MCMC draws. We illustrate that at three representative predictor profiles formed from empirical quartiles. ```{r newdata-grid-display, eval=FALSE} x_new <- as.data.frame(lapply( onearm_dat[, c("x1", "x2", "x3")], stats::quantile, probs = c(0.25, 0.50, 0.75), na.rm = TRUE )) rownames(x_new) <- c("q25", "q50", "q75") x_new ``` ```{r newdata-grid, echo=FALSE} x_new <- one_arm_out$x_new x_new ``` ```{r density-prediction-display, eval=FALSE} y_grid <- seq(min(onearm_dat$y), quantile(onearm_dat$y, 0.99), length.out = 120) pred_dens <- predict( fit_spliced, newdata = x_new, y = y_grid, type = "density", interval = "credible", level = 0.95, store_draws = FALSE ) plot(pred_dens, type = "density", facet = "covariate") ``` ```{r density-prediction, echo=FALSE} pred_dens <- one_arm_out$pred_dens plot(pred_dens, type = "density", facet = "covariate") ``` ```{r survival-prediction-display, eval=FALSE} y_grid <- seq(min(onearm_dat$y), quantile(onearm_dat$y, 0.99), length.out = 120) pred_surv <- predict( fit_spliced, newdata = x_new, y = y_grid, type = "survival", interval = "credible", level = 0.95, store_draws = FALSE ) plot(pred_surv) ``` ```{r survival-prediction, echo=FALSE} pred_surv <- one_arm_out$pred_surv plot(pred_surv) ``` These summaries show how the fitted conditional distribution changes across covariate profiles. In practice, that helps determine whether predictors act mostly through a location shift, through dispersion, or through upper-tail behavior. ## Quantiles, means, and restricted means The one-arm workflow is especially useful when the scientific target is not just the density itself but one or more functionals of the conditional distribution. The article highlights quantiles and mean-type summaries as central examples, with restricted means remaining available even when the ordinary mean is unstable or undefined under a heavy tail. ```{r quantile-prediction-display, eval=FALSE} pred_quant <- predict( fit_spliced, newdata = x_new, type = "quantile", index = c(0.25, 0.50, 0.75, 0.90, 0.95), interval = "credible", level = 0.95, store_draws = FALSE ) ``` ```{r quantile-prediction, echo=FALSE} pred_quant <- one_arm_out$pred_quant ``` ```{r quantile-prediction-plot-display, eval=FALSE} plot(pred_quant) ``` ```{r quantile-prediction-plot, echo=FALSE, fig.height=5.5, fig.width=7} plot(pred_quant) ``` ```{r mean-prediction-display, eval=FALSE} pred_mean <- predict( fit_spliced, newdata = x_new, type = "mean", interval = "hpd", level = 0.90, store_draws = FALSE ) ``` ```{r mean-prediction, echo=FALSE} pred_mean <- one_arm_out$pred_mean ``` ```{r mean-prediction-plot-display, eval=FALSE} plot(pred_mean) ``` ```{r mean-prediction-plot, echo=FALSE, fig.height=5.2, fig.width=7} plot(pred_mean) ``` ```{r rmean-prediction-display, eval=FALSE} cutoff_val <- as.numeric(stats::quantile(onearm_dat$y, 0.95)) pred_rmean <- predict( fit_spliced, newdata = x_new, type = "rmean", cutoff = cutoff_val, interval = "hpd", level = 0.90, store_draws = FALSE ) ``` ```{r rmean-prediction, echo=FALSE} cutoff_val <- one_arm_out$cutoff_val ``` ```{r rmean-table-display, eval=FALSE} knitr::kable(pred_rmean$fit_df, format = "html", row.names = FALSE, digits = 4) ``` ```{r rmean-table, echo=FALSE} knitr::kable( one_arm_out$pred_rmean_fit_df, format = "html", row.names = FALSE, digits = 4 ) ``` The restricted mean is often a good complement to upper-tail quantiles. It preserves the original response scale, is always finite for finite cutoffs, and gives a more stable summary when a few extreme draws can dominate the ordinary mean. # Comparing bulk-only and spliced fits The package also provides a bulk-only wrapper, `dpmix()`, which removes the GPD tail while leaving the general predictive interface intact. The article presents this as the natural special case obtained when the tail component is switched off. ```{r fit-bulk-only-display, eval=FALSE} fit_bulk <- dpmix( formula = y ~ x1 + x2 + x3, data = onearm_dat, backend = "sb", kernel = "lognormal", components = 5, mcmc = mcmc_vig ) ``` ```{r fit-bulk-only, echo=FALSE} cat("Bulk-only fit computed offline; see quantile comparison below.\n") ``` ```{r compare-quantiles-display, eval=FALSE} quant_bulk_fit_df <- predict( fit_bulk, newdata = x_new, type = "quantile", index = c(0.90, 0.95), interval = "credible", level = 0.95, store_draws = FALSE )$fit_df quant_spliced_fit_df <- predict( fit_spliced, newdata = x_new, type = "quantile", index = c(0.90, 0.95), interval = "credible", level = 0.95, store_draws = FALSE )$fit_df ``` ```{r compare-quantiles, echo=FALSE} quant_bulk_fit_df <- one_arm_out$quant_bulk_fit_df quant_spliced_fit_df <- one_arm_out$quant_spliced_fit_df ``` ```{r compare-quantiles-plot-display, eval=FALSE} qb <- quant_bulk_fit_df qb$model <- "dpmix (bulk only)" qs <- quant_spliced_fit_df qs$model <- "dpmgpd (spliced)" qc <- rbind(qb, qs) qc$profile <- rownames(x_new)[as.integer(qc$id)] ggplot(qc, aes(x = profile, y = estimate, colour = model, group = model)) + geom_line(linewidth = 0.8) + geom_point(size = 2) + facet_wrap(~index, scales = "free_y", ncol = 2) + labs( x = "Covariate profile", y = "Posterior mean quantile", title = "Upper quantiles: bulk-only vs spliced tail" ) + theme_minimal() + theme(axis.text.x = element_text(angle = 20, hjust = 1)) ``` ```{r compare-quantiles-plot, echo=FALSE, fig.height=5, fig.width=7} qb <- quant_bulk_fit_df qb$model <- "dpmix (bulk only)" qs <- quant_spliced_fit_df qs$model <- "dpmgpd (spliced)" qc <- rbind(qb, qs) qc$profile <- rownames(x_new)[as.integer(qc$id)] ggplot(qc, aes(x = profile, y = estimate, colour = model, group = model)) + geom_line(linewidth = 0.8) + geom_point(size = 2) + facet_wrap(~index, scales = "free_y", ncol = 2) + labs( x = "Covariate profile", y = "Posterior mean quantile", title = "Upper quantiles: bulk-only vs spliced tail" ) + theme_minimal() + theme(axis.text.x = element_text(angle = 20, hjust = 1)) ``` This comparison is one of the most revealing diagnostics in a heavy-tailed application. Bulk-only and spliced models can look very similar in the center of the distribution, yet their upper-tail quantile predictions can differ materially because only the spliced model imposes an explicit EVT-based tail form. # Analysis summary The one-arm workflow is easiest to interpret when the fitted conditional distribution is the main object of interest. In this example, the positive-support mixture captures the main structure of the response across covariate profiles, while the optional GPD tail changes how upper-tail mass is extrapolated. The density and survival summaries describe the overall shape of the fitted conditional distribution, the quantile summaries highlight profile-specific location and tail changes, and the restricted mean provides a bounded summary on the original response scale. From a practical modeling standpoint, `dpmix()` and `dpmgpd()` are two related model classes with the same summary interface. The choice between them should be driven by the role of the upper tail in the analysis. If the target is mainly bulk prediction, a bulk-only mixture may be adequate. If inference on extreme upper quantiles, tail probabilities, or means sensitive to upper-tail behavior matters, the spliced formulation is often more appropriate. # Customization options for one-arm models This section collects the main arguments that users typically customize in one-arm analyses. The package appendix provides the same controls in compact reference form. ## Model specification options `kernel` selects the bulk family. Positive-support choices include `"gamma"`, `"lognormal"`, `"invgauss"`, and `"amoroso"`; real-line kernels include `"normal"`, `"laplace"`, and `"cauchy"`. The support of the response should guide this choice. `backend` determines how the DPM is represented internally. The main options are `"sb"` for stick-breaking and `"crp"` for a Chinese restaurant process representation. `components` controls the truncation size used for the finite approximation to the infinite mixture. `GPD` is implicit in the wrapper choice: `dpmgpd()` activates the spliced tail, whereas `dpmix()` fits the bulk-only model. ## Parameter and prior customization `param_specs` is the central argument for advanced model control. Each kernel or tail parameter can be assigned one of three modes: a fixed value, a prior-driven random value, or a covariate-linked regression structure. This is the main route for changing whether a threshold, scale, or kernel parameter is held constant or allowed to vary with predictors. `alpha_random` controls whether the DPM concentration parameter is treated as random under its prior or fixed. `epsilon` controls downstream truncation and summary behavior for very small mixture components. ## MCMC and monitoring options `mcmc` contains the iteration controls `niter`, `nburnin`, `thin`, `nchains`, and `seed`. `monitor` chooses how much of the posterior state is retained. The usual values are `"core"` and `"full"`. `monitor_latent = TRUE` requests latent allocation labels in the retained output, and `monitor_v = TRUE` stores stick-breaking fractions for SB fits. ## Prediction options `predict()` supports `type = "density"`, `"survival"`, `"quantile"`, `"mean"`, `"rmean"`, `"median"`, `"sample"`, and `"fit"`. For density and survival summaries, the argument `y` supplies the evaluation grid. For quantile summaries, the argument `index` supplies the probability levels. For restricted means, the argument `cutoff` defines the truncation point and `nsim_mean` can control Monte Carlo evaluation when applicable. `interval` can be set to `"credible"`, `"hpd"`, or `NULL`, and `level` controls the posterior interval coverage level. # Session information ```{r session-info} sessionInfo() ```