--- title: "Causal Inference with CausalMixGPD" output: rmarkdown::html_vignette bibliography: CausalMixGPD.bib link-citations: true vignette: > %\VignetteIndexEntry{Causal Inference with CausalMixGPD} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 4.8, out.width = "85%", fig.align = "center", message = FALSE, warning = FALSE ) .pkg_extdata <- function(name) { f <- system.file("extdata", name, package = "CausalMixGPD") if (nzchar(f)) { return(f) } vdir <- knitr::current_input(dir = TRUE) pkg_root <- normalizePath(file.path(vdir, ".."), winslash = "/", mustWork = TRUE) alt <- file.path(pkg_root, "inst", "extdata", name) if (file.exists(alt)) { return(normalizePath(alt, winslash = "/", mustWork = TRUE)) } stop("Missing extdata: ", name, ".", call. = FALSE) } ``` # Introduction In contrast to `CausalMix`, which was limited to modeling the average treatment effect, `CausalMixGPD` expands on the framework of causal inference by fitting treatment-specific outcome distributions and obtaining the contrasts by evaluating posterior functional of the fitted distribution [@Rubin1974; @RosenbaumRubin1983; @ImbensRubin2015]. Thus, it is a distribution-based approach where one obtains means, quantiles, restricted means, survival summaries, and their contrasts based on posterior objects obtained from the modeled outcome distributions. The key point made in the software article is that the proposed approach becomes especially relevant when the outcome is skewed or heavy-tailed. In such settings, treatment heterogeneity might not manifest itself via contrasts of means but rather in terms of other summaries of interest such as quantiles, and hence the modeling of treatment-specific distributions becomes critical for obtaining a coherent picture of the heterogeneous treatment effect. In this vignette, we describe the process of performing causal analysis with `CausalMixGPD` in four parts. First, we introduce potential outcomes framework and discuss the causal estimands computed by the package. Next, we explain what happens under the hood when the spliced DPM-GPD model is fit to each of the arms. After that, we demonstrate the usage of the key functions of the package using the example of the provided simulated data set. # Notation and short theory background ## Potential outcomes and arm-specific distributions Let $A \in \{0,1\}$ denote a binary treatment indicator, with $A = 1$ for treatment and $A = 0$ for control. Let $Y(1)$ and $Y(0)$ be the potential outcomes, and let $X$ denote the pre-treatment covariates. The observed outcome is $$ Y = A Y(1) + (1-A)Y(0). $$ For each treatment arm $a \in \{0,1\}$, define the arm-specific conditional CDF and density by $$ F_a(y \mid x) = \Pr\{Y(a) \le y \mid X = x\}, \qquad f_a(y \mid x) = \frac{\partial}{\partial y} F_a(y \mid x). $$ The package models these arm-specific conditional distributions directly. Once they are available, all reported causal estimands are computed from them. ## Identification assumptions The causal workflow follows the standard potential-outcomes framework under consistency, no interference, conditional ignorability, and overlap [@Rubin1974; @RosenbaumRubin1983; @ImbensRubin2015]. Under these assumptions, $$ F_a(y \mid x) = \Pr(Y \le y \mid A=a, X=x), \qquad a \in \{0,1\}. $$ That identification step matters because it justifies fitting arm-specific outcome models to the observed data and then interpreting posterior contrasts as causal quantities. ## Propensity score augmentation In observational settings, the package can estimate a propensity score $$ \rho(x) = \Pr(A=1 \mid X=x) $$ and append a summary of it to the outcome model covariates. If $\widehat\rho(x)$ is the estimated score, the augmented predictor used by the outcome model is typically of the form $$ r(x) = (x^\top, \psi\{\widehat\rho(x)\})^\top, $$ where $\psi(\cdot)$ is either the identity map on the probability scale or the logit transformation, depending on the setting. This is the package's default strategy for incorporating design information into the arm-specific outcome regressions. ## Spliced outcome model within each arm Within each arm, `CausalMixGPD` uses the same spliced model described in the one-arm workflow. For arm $a$, the conditional distribution combines a DPM bulk and an optional GPD tail: $$ f_a(y \mid x) = \begin{cases} f_{a,\mathrm{DP}}(y \mid x; \Theta_a), & y \le u_a(x), \\ \{1-p_{u_a}(x;\Theta_a)\} f_{a,\mathrm{GPD}}(y \mid x; \Phi_a), & y > u_a(x), \end{cases} $$ where $p_{u_a}(x;\Theta_a) = F_{a,\mathrm{DP}}(u_a(x) \mid x; \Theta_a)$. The main modeling choice in the causal workflow is therefore not the contrast itself but the quality of the two fitted conditional outcome distributions. ## Causal estimands reported by the package Once the arm-specific distributions are fitted, the package returns several causal functionals. The conditional mean in arm $a$ is $$ \mu_a(x) = E\{Y(a) \mid X=x\}, $$ and the conditional quantile is $$ Q_a(\tau \mid x) = \inf\{y : F_a(y \mid x) \ge \tau\}. $$ From these, the package constructs conditional and standardized estimands. The main ones are $$ \mathrm{CATE}(x) = \mu_1(x) - \mu_0(x), $$ $$ \mathrm{CQTE}(\tau \mid x) = Q_1(\tau \mid x) - Q_0(\tau \mid x), $$ $$ \mathrm{ATE} = E\{\mu_1(X) - \mu_0(X)\}, $$ and $$ \mathrm{QTE}(\tau) = Q^m_1(\tau) - Q^m_0(\tau), $$ where the superscript $m$ indicates standardization over the empirical covariate distribution in the full sample. The package also provides ATT and QTT analogues standardized over the treated covariate distribution. These definitions explain the structure of the causal API. Functions such as `ate()`, `qte()`, `cate()`, and `cqte()` are not fitting new models. They are post-processing a fitted causal object to summarize different posterior functionals. # Function map for the causal workflow The most important exported functions in this vignette are: - `dpmgpd.causal()` for spliced arm-specific outcome modeling. - `dpmix.causal()` for the bulk-only causal analogue. - `ate()` and `att()` for mean or restricted-mean treatment contrasts. - `qte()` and `qtt()` for marginal quantile contrasts. - `cate()` and `cqte()` for profile-specific conditional contrasts. - `predict()` for arm-wise predictive summaries and contrast-based summaries at new covariate profiles. - `summary()` and `plot()` for effect-object summaries and visualization. # Package setup ```{r libraries} library(CausalMixGPD) library(ggplot2) ``` ```{r mcmc-settings} mcmc_vig <- list( niter = 1200, nburnin = 300, thin = 2, nchains = 2, seed = 2026 ) # Settings used when regenerating `inst/extdata/causal_*.csv` (see tools/.Rscripts/build_vignette_fits.R). mcmc_out <- list( niter = 1200, nburnin = 300, thin = 2, nchains = 2, seed = 2027 ) mcmc_ps <- list( niter = 1000, nburnin = 250, thin = 2, nchains = 2, seed = 2028 ) ``` # Data We use the bundled causal dataset `causal_pos500_p3_k2`, which contains a positive outcome `y`, a binary treatment indicator `A`, and a predictor matrix `X`. ```{r load-data} data("causal_pos500_p3_k2", package = "CausalMixGPD") causal_dat <- data.frame( y = causal_pos500_p3_k2$y, A = causal_pos500_p3_k2$A, causal_pos500_p3_k2$X ) ``` A quick empirical comparison between the observed outcome distributions by treatment arm gives a sense of the modeling problem. ```{r exploratory-plots, fig.height=4.8, fig.width=7} ggplot(causal_dat, aes(x = y, colour = factor(A), fill = factor(A))) + geom_histogram(aes(y = after_stat(density)), bins = 25, alpha = 0.25, position = "identity") + labs( x = "y", y = "Density", colour = "A", fill = "A", title = "Observed outcome distributions by treatment arm" ) + theme_minimal() ``` Even in a simulated benchmark, the treated and control distributions can differ in ways that are not well summarized by a single average shift. That is exactly where the distributional causal framework is useful. # Fitting a causal spliced model The high-level causal wrapper is `dpmgpd.causal()`. It fits treatment-specific outcome models and, when requested, a propensity score stage. The software article presents this as the direct causal extension of the one-arm workflow. To keep vignette building stable, the numerical outputs shown below are read from small precomputed files shipped in `inst/extdata/`. The following chunk shows the package call you would use for a full fit; the tables and figures in this section are rendered from those shipped artifacts in hidden companion chunks. ```{r fit-causal-display, eval=FALSE} fit_causal <- dpmgpd.causal( formula = y ~ x1 + x2 + x3, data = causal_dat, treat = "A", backend = "crp", kernel = "gamma", components = 6, PS = "logit", ps_scale = "logit", ps_summary = "mean", mcmc_outcome = mcmc_out, mcmc_ps = mcmc_ps, parallel_arms = FALSE ) ``` The call matches the structure used to build the shipped causal summaries: arm-specific spliced outcome distributions, a gamma bulk kernel, a CRP backend for the mixture, and a logistic propensity score stage summarized on the logit scale. # Standardized causal summaries ## Average treatment effect The standardized ATE contrasts the arm-specific means after averaging over the empirical covariate distribution. ```{r ate-summary-display, eval=FALSE} ate_fit <- ate(fit_causal, interval = "hpd", level = 0.95, show_progress = FALSE) knitr::kable(summary(ate_fit)$effect_table, format = "html", digits = 4) ``` ```{r ate-summary, echo=FALSE} ate_tab <- read.csv(.pkg_extdata("causal_ate.csv")) knitr::kable(ate_tab, format = "html", digits = 4) ``` This table is intentionally simple. In the package, `ate()` returns an effect object with summary and plotting methods. The key interpretation point is that the reported effect is built from the arm-specific fitted distributions, not from a stand-alone regression coefficient. ## Quantile treatment effects Quantile treatment effects are often more informative than a single average contrast in skewed or heavy-tailed settings. ```{r qte-summary-display, eval=FALSE} qte_fit <- qte( fit_causal, probs = c(0.25, 0.50, 0.75, 0.90, 0.95), interval = "credible", level = 0.95, show_progress = FALSE ) knitr::kable(summary(qte_fit)$effect_table, format = "html", digits = 4) ``` ```{r qte-summary, echo=FALSE} qte_tab <- read.csv(.pkg_extdata("causal_qte.csv")) knitr::kable(qte_tab, format = "html", digits = 4) ``` ```{r qte-plot-display, eval=FALSE} qte_fit <- qte( fit_causal, probs = c(0.25, 0.50, 0.75, 0.90, 0.95), interval = "credible", level = 0.95, show_progress = FALSE ) plot(qte_fit, type = "effect") ``` ```{r qte-plot, echo=FALSE, fig.height=5, fig.width=7} plot( qte_tab$prob, qte_tab$estimate, type = "b", pch = 19, xlab = "Quantile level", ylab = "Estimated QTE", ylim = range(c(qte_tab$lower, qte_tab$upper)), main = "Quantile treatment effect curve" ) segments(qte_tab$prob, qte_tab$lower, qte_tab$prob, qte_tab$upper, lwd = 1.2) abline(h = 0, lty = 2, col = "grey50") ``` The QTE display shows how the treatment effect varies with the quantile level. This is one of the main scientific motivations for fitting treatment-specific distributions directly rather than relying on mean-only summaries. # Conditional causal summaries at representative profiles The conditional functions `cate()` and `cqte()` evaluate treatment contrasts at user-supplied covariate profiles. A common strategy is to build representative profiles from empirical quartiles of the predictors. ```{r newdata-grid} qs <- c(0.25, 0.50, 0.75) Xgrid <- expand.grid(lapply(causal_dat[, c("x1", "x2", "x3")], quantile, probs = qs)) head(Xgrid) ``` To illustrate the profile-specific summaries in a lightweight vignette, we reuse the package's shipped conditional quantile summaries. ```{r conditional-qte-table-display, eval=FALSE} Xgrid <- as.data.frame(lapply( causal_dat[, c("x1", "x2", "x3")], stats::quantile, probs = c(0.25, 0.50, 0.75), na.rm = TRUE )) rownames(Xgrid) <- c("low", "mid", "high") cqte_fit <- cqte( fit_causal, newdata = Xgrid, probs = c(0.25, 0.50, 0.75, 0.90, 0.95), interval = "credible", level = 0.95, show_progress = FALSE ) knitr::kable(summary(cqte_fit)$effect_table, format = "html", digits = 4) ``` ```{r conditional-qte-table, echo=FALSE} cond_q <- read.csv(.pkg_extdata("conditional_quantiles.csv")) knitr::kable(cond_q, format = "html", digits = 4) ``` The columns have the following meaning. - `id` indexes the covariate profile. - `index` is the quantile level $\tau$. - `estimate` is the posterior point estimate of the arm-contrast quantile summary. - `lower` and `upper` are the corresponding interval bounds. In a full causal fit, `cqte()` would return an effect object that can be plotted profile by profile. The important conceptual point is that these conditional quantile contrasts are produced by evaluating the fitted treated and control quantile functions at the same covariate profile and differencing the results. # Predictive interpretation The causal workflow also supports direct prediction. After fitting the causal object, `predict()` can be used to obtain arm-specific densities, survival probabilities, quantiles, means, or restricted means at new covariate profiles. For mean and quantile-type outputs, the treated-minus-control contrast is often the quantity of interest; for density and survival outputs, users frequently inspect the arm-specific predictions themselves. The predictive logic is the same as in the one-arm case. First, the model builds arm-specific conditional distributions. Second, desired posterior functionals are evaluated from those distributions. Third, contrasts are formed draw by draw and then summarized. # Analysis summary The causal analysis in `CausalMixGPD` is best understood as a structured post-processing layer on top of two fitted conditional outcome models. The ATE summarizes a standardized mean contrast, but the QTE and CQTE summaries reveal whether treatment effects differ across the outcome scale or across covariate profiles. In heavy-tailed problems, this distinction is often substantive rather than cosmetic. Two interventions may have similar average effects while behaving quite differently in the upper tail. The package's causal interface is especially useful in that situation because the same fitted object supports several scientifically relevant views of the treatment effect. Mean, restricted-mean, quantile, and conditional contrasts all remain tied to the same posterior predictive mechanism. # Customization options for causal models The package appendix lists the full causal customization surface; the most practically important arguments are summarized here. ## Outcome-model controls `kernel`, `backend`, `GPD`, `components`, and `param_specs` can each be supplied either as a single shared value or as arm-specific values of length two. This allows the treated and control outcome models to differ in kernel family, backend, truncation size, and prior structure. `dpmgpd.causal()` activates the spliced tail, while `dpmix.causal()` fits the bulk-only causal analogue. ## Propensity score controls `PS` selects the propensity score model. Common values are `"logit"`, `"probit"`, `"naive"`, and `FALSE`. `ps_scale` controls whether the propensity score enters the outcome model on the probability scale or the logit scale. `ps_summary` selects the posterior summary of the propensity score appended to the outcome regressors, typically `"mean"` or `"median"`. `ps_prior`, `ps_clamp`, and `include_intercept` provide additional control over propensity score estimation and numerical stability. ## MCMC controls The causal workflow separates iteration controls for the propensity score stage and the outcome stage through `mcmc_ps` and `mcmc_outcome`. `parallel_arms`, `workers`, and related runtime arguments can be used when the treated and control models are fitted in parallel. ## Effect and prediction controls `ate()`, `att()`, `cate()`, and related functions accept `type = "mean"` or `"rmean"` when the target is mean-based. `qte()`, `qtt()`, and `cqte()` use the argument `probs` to specify the quantile levels. Effect intervals are controlled by `interval` and `level`, just as in the one-arm workflow. `predict()` supports the same distributional output types as the one-arm interface, now interpreted in an arm-specific or contrast-based manner depending on the chosen output. # Session information ```{r session-info} sessionInfo() ```