--- title: "Simulation Settings for Feature Importance Methods" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Simulation Settings for Feature Importance Methods} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5 ) ``` ```{r setup} library(xplainfi) library(DiagrammeR) library(mlr3learners) set.seed(123) ``` ## Introduction The `xplainfi` package provides several data generating processes (DGPs) designed to illustrate specific strengths and weaknesses of different feature importance methods. Each DGP focuses on one primary challenge to make the differences between methods clear. This article provides a comprehensive overview of all simulation settings, including their mathematical formulations and causal structures visualized as directed acyclic graphs (DAGs). ## Overview of Simulation Settings ```{r overview-table, echo=FALSE} dgp_overview <- data.frame( DGP = c( "sim_dgp_correlated", "sim_dgp_mediated", "sim_dgp_confounded", "sim_dgp_interactions", "sim_dgp_independent", "sim_dgp_ewald" ), Challenge = c( "Spurious correlation", "Mediation effects", "Confounding", "Interaction effects", "Baseline (no challenges)", "Mixed effects" ), `PFI Behavior` = c( "High for spurious x2", "Shows total effects", "Biased upward", "Low (no main effects)", "Accurate", "Mixed" ), `CFI Behavior` = c( "Low for spurious x2", "Shows direct effects", "Less biased", "High (captures interactions)", "Accurate", "Mixed" ), check.names = FALSE ) knitr::kable(dgp_overview, caption = "Overview of simulation settings and expected method behavior") ``` ## 1. Correlated Features DGP This DGP creates a highly correlated spurious predictor to illustrate the fundamental difference between marginal and conditional importance methods. ### Mathematical Model $$(X_1, X_2)^T \sim \text{MVN}(0, \Sigma)$$ where $\Sigma$ is a $2 \times 2$ covariance matrix with 1 on the diagonal and correlation $r$ (default 0.9) on the off-diagonal. $$X_3 \sim N(0,1), \quad X_4 \sim N(0,1)$$ $$Y = 2 \cdot X_1 + X_3 + \varepsilon$$ where $\varepsilon \sim N(0, 0.2^2)$. ### Causal Structure ```{r dag-correlated, echo=FALSE, fig.cap="DAG for correlated features DGP", fig.width=10, fig.height=4} #| fig.alt: "Directed acyclic graph with four feature nodes (X1-X4) and outcome Y. Arrows show causal paths from X1 and X3 to Y, with an edge between X1 and X2 showing r approx. 0.9" grViz( " digraph Correlated { rankdir=LR; graph [ranksep=1.5]; node [shape=circle, style=filled, fontsize=14, width=1.2]; X1 [fillcolor='lightcoral', label='X₁\n(β=2.0)']; X2 [fillcolor='pink', label='X₂\n(β=0)']; X3 [fillcolor='lightblue', label='X₃\n(β=1.0)']; X4 [fillcolor='lightgray', label='X₄\n(β=0)']; Y [fillcolor='greenyellow', label='Y', width=1.5]; X1 -> X2 [color=red, style=bold, label='r≈0.9']; X1 -> Y [label='2.0']; X2 -> Y [style=dashed, color=gray, label='0']; X3 -> Y [label='1.0']; X4 -> Y [style=dashed, color=gray]; {rank=source; X1; X3; X4} {rank=same; X2} {rank=sink; Y} }" ) ``` ### Usage Example ```{r correlated-example} set.seed(123) task <- sim_dgp_correlated(n = 500) # Check correlation between X1 and X2 cor(task$data()[, c("x1", "x2")]) # True coefficients: x1=2.0, x2=0, x3=1.0, x4=0 # Note: x2 is highly correlated with x1 but has NO causal effect! ``` ### Expected Behavior - **Marginal methods (PFI, Marginal SAGE)**: Will falsely assign high importance to x2 because permuting it breaks the correlation with x1, creating unrealistic data that confuses the model - **Conditional methods (CFI, Conditional SAGE)**: Should correctly assign near-zero importance to x2 because conditional sampling preserves the correlation, revealing that x2 adds no information beyond what x1 provides - **Key insight**: x2 is a spurious predictor - it appears predictive due to correlation with x1 but has no causal effect on y ## 2. Mediated Effects DGP This DGP demonstrates the difference between total and direct causal effects. Some features affect the outcome only through mediators. ### Mathematical Model $$\text{exposure} \sim N(0,1), \quad \text{direct} \sim N(0,1)$$ $$\text{mediator} = 0.8 \cdot \text{exposure} + 0.6 \cdot \text{direct} + \varepsilon_m$$ $$Y = 1.5 \cdot \text{mediator} + 0.5 \cdot \text{direct} + \varepsilon$$ where $\varepsilon_m \sim N(0, 0.3^2)$ and $\varepsilon \sim N(0, 0.2^2)$. ### Causal Structure ```{r dag-mediated, echo=FALSE, fig.cap="DAG for mediated effects DGP", fig.width=10, fig.height=4} #| fig.alt: "Directed acyclic graph showing mediation structure. Exposure connects to mediator, which connects to Y. Direct connects to both mediator and Y. Noise node has dashed arrow to Y." grViz( " digraph Mediated { rankdir=LR; graph [ranksep=1.2]; node [shape=circle, style=filled, fontsize=14, width=1.2]; E [fillcolor='orange', label='Exposure\n(β=0)']; D [fillcolor='lightblue', label='Direct\n(β=0.5)']; M [fillcolor='yellow', label='Mediator\n(β=1.5)']; N [fillcolor='lightgray', label='Noise\n(β=0)']; Y [fillcolor='greenyellow', label='Y', width=1.5]; E -> M [label='0.8', color=purple, penwidth=2]; D -> M [label='0.6', color=blue]; D -> Y [label='0.5', color=blue]; M -> Y [label='1.5', color=purple, penwidth=2]; N -> Y [style=dashed, color=gray]; {rank=source; E; D; N} {rank=same; M} {rank=sink; Y} }" ) ``` ### Usage Example ```{r mediated-example} set.seed(123) task <- sim_dgp_mediated(n = 500) # Calculate total effect of exposure # Total effect = 0.8 * 1.5 = 1.2 (through mediator) # Direct effect = 0 (no direct path to Y) ``` ### Expected Behavior - **PFI**: Shows total effects (exposure appears important with effect approximately 1.2) - **CFI**: Shows direct effects (exposure appears unimportant when conditioning on mediator) - **RFI with mediator**: Should show direct effects similar to CFI ## 3. Confounding DGP This DGP includes a confounder that affects both a feature and the outcome. ### Mathematical Model $$H \sim N(0,1) \quad \text{(confounder)}$$ $$X_1 = H + \varepsilon_1$$ $$\text{proxy} = H + \varepsilon_p, \quad \text{independent} \sim N(0,1)$$ $$Y = H + X_1 + \text{independent} + \varepsilon$$ where all $\varepsilon \sim N(0, 0.5^2)$ independently. ### Causal Structure ```{r dag-confounded, echo=FALSE, fig.cap="DAG for confounding DGP", fig.width=10, fig.height=5} #| fig.alt: "Directed acyclic graph with hidden confounder H (dashed circle) connecting to X1, proxy, and Y. X1 and independent also have arrows to Y." grViz( " digraph Confounded { rankdir=LR; graph [ranksep=1.2, nodesep=0.8]; node [shape=circle, style=filled, fontsize=14, width=1.2]; H [fillcolor='red', label='H\n(Confounder)', style='filled,dashed']; X1 [fillcolor='lightcoral', label='X₁\n(β=1.0)']; P [fillcolor='pink', label='Proxy\n(β=0)']; I [fillcolor='lightblue', label='Independent\n(β=1.0)']; Y [fillcolor='greenyellow', label='Y', width=1.5]; H -> X1 [color=red, label='1.0']; H -> P [color=red, style=dashed, label='1.0']; H -> Y [color=red, label='1.0', penwidth=2]; X1 -> Y [label='1.0']; I -> Y [label='1.0']; {rank=source; H} {rank=same; X1; P; I} {rank=sink; Y} }" ) ``` ### Usage Example ```{r confounded-example} set.seed(123) # Hidden confounder scenario (default) task_hidden <- sim_dgp_confounded(n = 500, hidden = TRUE) task_hidden$feature_names # proxy available but not confounder # Observable confounder scenario task_observed <- sim_dgp_confounded(n = 500, hidden = FALSE) task_observed$feature_names # both confounder and proxy available ``` ### Expected Behavior - **PFI**: Will show inflated importance for x1 due to confounding (the confounder H contributes to x1's predictive power beyond its direct effect) - **CFI**: Should partially account for confounding through conditional sampling - **RFI conditioning on proxy**: Should reduce confounding bias by conditioning on the noisy measurement of the confounder ## 4. Interaction Effects DGP This DGP demonstrates a pure interaction effect where features have no main effects. ### Mathematical Model $$Y = 2 \cdot X_1 \cdot X_2 + X_3 + \varepsilon$$ where $X_j \sim N(0,1)$ independently and $\varepsilon \sim N(0, 0.5^2)$. ### Causal Structure ```{r dag-interactions, echo=FALSE, fig.cap="DAG for interaction effects DGP", fig.width=10, fig.height=4} #| fig.alt: "Directed acyclic graph with X1 and X2 connecting to a diamond-shaped interaction node, which connects to Y. X3 connects directly to Y. N1 and N2 have dashed arrows to Y." grViz( " digraph Interaction { rankdir=LR; graph [ranksep=1.2]; node [shape=circle, style=filled, fontsize=14, width=1.2]; X1 [fillcolor='orange', label='X₁\n(β=0)']; X2 [fillcolor='orange', label='X₂\n(β=0)']; X3 [fillcolor='lightblue', label='X₃\n(β=1.0)']; N1 [fillcolor='lightgray', label='N₁\n(β=0)']; N2 [fillcolor='lightgray', label='N₂\n(β=0)']; Y [fillcolor='greenyellow', label='Y', width=1.5]; INT [fillcolor='red', shape=diamond, label='X₁×X₂\n(β=2.0)', width=1.5]; X1 -> INT [color=red, penwidth=2]; X2 -> INT [color=red, penwidth=2]; INT -> Y [color=red, label='2.0', penwidth=2]; X3 -> Y [label='1.0']; N1 -> Y [style=dashed, color=gray]; N2 -> Y [style=dashed, color=gray]; {rank=source; X1; X2; X3; N1; N2} {rank=same; INT} {rank=sink; Y} }" ) ``` ### Usage Example ```{r interactions-example} set.seed(123) task <- sim_dgp_interactions(n = 500) # Note: X1 and X2 have NO main effects # Their importance comes ONLY through their interaction ``` ### Expected Behavior - **PFI**: Should assign near-zero importance to x1 and x2 (no marginal effect) - **CFI**: Should capture the interaction and assign high importance to x1 and x2 - **LOCO**: May show high importance for x1 and x2 (removing either breaks the interaction) - **LOCI**: Should show near-zero importance for x1 and x2 (individually useless) ## 5. Independent Features DGP (Baseline) This is a baseline scenario where all features are independent and their effects are additive. All importance methods should give similar results. ### Mathematical Model $$Y = 2.0 \cdot X_1 + 1.0 \cdot X_2 + 0.5 \cdot X_3 + \varepsilon$$ where $X_j \sim N(0,1)$ independently and $\varepsilon \sim N(0, 0.2^2)$. ### Causal Structure ```{r dag-independent, echo=FALSE, fig.cap="DAG for independent features DGP", fig.width=10, fig.height=4} #| fig.alt: "Directed acyclic graph with five feature nodes in a row, each with arrows pointing to outcome Y. X1-X3 have solid arrows with effect sizes labeled, N1-N2 have dashed arrows." grViz( " digraph Independent { rankdir=LR; graph [ranksep=1.5]; node [shape=circle, style=filled, fontsize=14, width=1.2]; X1 [fillcolor='lightblue', label='X₁\n(β=2.0)']; X2 [fillcolor='lightblue', label='X₂\n(β=1.0)']; X3 [fillcolor='lightblue', label='X₃\n(β=0.5)']; N1 [fillcolor='lightgray', label='N₁\n(β=0)']; N2 [fillcolor='lightgray', label='N₂\n(β=0)']; Y [fillcolor='greenyellow', label='Y', width=1.5]; X1 -> Y [label='2.0', penwidth=3]; X2 -> Y [label='1.0', penwidth=2]; X3 -> Y [label='0.5']; N1 -> Y [style=dashed, color=gray]; N2 -> Y [style=dashed, color=gray]; {rank=source; X1; X2; X3; N1; N2} {rank=sink; Y} }" ) ``` ### Usage Example ```{r independent-example} set.seed(123) task <- sim_dgp_independent(n = 500) # All methods should rank features consistently: # important1 > important2 > important3 > unimportant1,2 (approx. 0) ``` ### Expected Behavior - **All methods**: Should rank features consistently by their true effect sizes - **Ground truth**: important1 (2.0) > important2 (1.0) > important3 (0.5) > unimportant1,2 (0) ## 6. Ewald et al. (2024) DGP Reproduces the data generating process from Ewald et al. (2024) for benchmarking feature importance methods. Includes correlated features and interaction effects. ### Mathematical Model $$X_1, X_3, X_5 \sim \text{Uniform}(0,1)$$ $$X_2 = X_1 + \varepsilon_2, \quad \varepsilon_2 \sim N(0, 0.001)$$ $$X_4 = X_3 + \varepsilon_4, \quad \varepsilon_4 \sim N(0, 0.1)$$ $$Y = X_4 + X_5 + X_4 \cdot X_5 + \varepsilon, \quad \varepsilon \sim N(0, 0.1)$$ ### Causal Structure ```{r dag-ewald, echo=FALSE, fig.cap="DAG for Ewald et al. (2024) DGP", fig.width=10, fig.height=4} #| fig.alt: "Directed acyclic graph with X1-X2 correlation, X3-X4 correlation. X4 and X5 connect to a diamond interaction node and directly to Y." grViz( " digraph Ewald { rankdir=LR; graph [ranksep=1.2]; node [shape=circle, style=filled, fontsize=14, width=1.2]; X1 [fillcolor='lightgray', label='X₁\n(β=0)']; X2 [fillcolor='lightgray', label='X₂\n(β=0)']; X3 [fillcolor='lightgray', label='X₃\n(β=0)']; X4 [fillcolor='lightblue', label='X₄\n(β=1.0)']; X5 [fillcolor='lightblue', label='X₅\n(β=1.0)']; Y [fillcolor='greenyellow', label='Y', width=1.5]; INT [fillcolor='red', shape=diamond, label='X₄×X₅\n(β=1.0)', width=1.5]; X1 -> X2 [color=gray, label='≈1.0']; X3 -> X4 [color=gray, label='≈1.0']; X4 -> Y [label='1.0']; X5 -> Y [label='1.0']; X4 -> INT [color=red]; X5 -> INT [color=red]; INT -> Y [color=red, label='1.0']; {rank=source; X1; X3; X5} {rank=same; X2; X4} {rank=same; INT} {rank=sink; Y} }" ) ``` ### Usage Example ```{r ewald-example} sim_dgp_ewald(n = 500) ```