--- title: "Using DAGassist for Diagnosis and Re-estimation" author: "Graham Goff and Mike Denly" date: "`r Sys.Date()`" output: html_document bibliography: references.bib link-citations: true vignette: > %\VignetteIndexEntry{Get Started} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r dev-load, include=FALSE} # Prefer source build when available (works in RStudio, pkgdown, or local render) if (requireNamespace("devtools", quietly = TRUE) && file.exists(file.path("..","DESCRIPTION"))) { # Don't error on CRAN/build machines that don't have devtools or the source path try(devtools::load_all("..", quiet = TRUE), silent = TRUE) } # If we've already loaded from source, avoid re-attaching a different installed build later from_source <- try({ "DAGassist" %in% loadedNamespaces() && grepl(normalizePath(".."), getNamespaceInfo(asNamespace("DAGassist"), "path"), fixed = TRUE) }, silent = TRUE) from_source <- isTRUE(from_source) # Feature gates (computed *after* attempting load_all) has_show <- tryCatch({ "show" %in% names(formals(DAGassist::DAGassist)) }, error = function(e) FALSE) # Robust check: dev build defines a private .report_dotwhisker helper has_dotwhisker <- tryCatch({ exists(".report_dotwhisker", envir = asNamespace("DAGassist"), inherits = FALSE) }, error = function(e) FALSE) ``` ## Introduction *DAGassist* contains tools for using directed acyclic graphs (DAGs) to align regressions with an estimand and its identifying assumptions. DAGs are causal graphs that nonparametrically encode the relationships between a model's variables. For good introductory articles on DAGs, see @Pearl1995, @Pearl2009, @HunermundEtAl2025, and @Elwert2013. The *DAGassist* workflow has five steps: (1) declare an estimand; (2) draw a DAG; (3) classify control variables by role; (4) estimate models using DAG-consistent adjustment sets; and (5) recover the interpretable estimand. This guide provides an applied introduction to the *DAGassist* workflow. ## Step 1: Declare an Estimand Step 1's focus on declaring the estimands ensures that studies maintain a consistent quantity of interest for evaluation @LundbergJohnsonStewart2021; @FindleyKikutaDenly2021. Of course, some estimands may be more policy-relevant than others @Deaton2010. For the purpose of this guide, we are interested in the sample average treatment effect (SATE). ## Step 2: Draw a DAG DAGs have three basic building blocks: variables, arrows, and missing arrows. In DAG terminology, variables capture nodes or vertices, whereas edges or arcs refer to arrows @TennantEtAl2021. Missing arrows are equivalent to a strong null hypothesis.
Dataset summary statistics (click to expand) ```{r make-df, include=FALSE} simulate_toy_panel <- function( n_id = 1000, n_t = 5, seed = 20251225, p_band = c(0.001, 0.08), max_tries = 160 ) { stopifnot(length(p_band) == 2, p_band[1] < p_band[2]) clamp <- function(x, lo, hi) pmin(pmax(x, lo), hi) inv_logit <- function(z) 1 / (1 + exp(-z)) rmultinom1 <- function(prob_named) { levs <- names(prob_named) levs[which.max(stats::runif(1) <= cumsum(prob_named))] } set.seed(seed) id <- seq_len(n_id) gender <- factor(sample(c("Female", "Male"), n_id, replace = TRUE, prob = c(0.51, 0.49))) immigrant <- rbinom(n_id, 1, 0.13) urban <- rbinom(n_id, 1, 0.72) class_levels <- c("Low", "Working", "Middle", "Upper") class <- factor(sample(class_levels, n_id, replace = TRUE, prob = c(0.18, 0.38, 0.34, 0.10)), levels = class_levels, ordered = TRUE) class_num <- as.integer(class) - 1L age0 <- round(clamp(rnorm(n_id, mean = 35, sd = 16), 0, 100 - (n_t - 1)), 1) # ----------------------- # Contract type (baseline; affects edu_year, income, children) # ----------------------- contract_levels <- c("Informal", "Temporary", "Permanent") contract0 <- character(n_id) for (i in seq_len(n_id)) { cnum <- class_num[i] u <- urban[i] im <- immigrant[i] gF <- (gender[i] == "Female") a0 <- age0[i] # Scores: Permanent higher for higher class + urban; Informal higher for immigrant score <- c( Informal = 0.15 - 0.25*cnum + 0.10*im - 0.05*u + 0.02*(a0 - 35), Temporary = 0.05 + 0.05*cnum - 0.05*im + 0.02*u + 0.01*(a0 - 35), Permanent = -0.20 + 0.22*cnum - 0.08*im + 0.10*u - 0.02*gF + 0.01*(a0 - 35) ) prob <- exp(score) / sum(exp(score)) contract0[i] <- rmultinom1(prob) } contract <- factor(rep(contract0, each = n_t), levels = contract_levels) # Numeric index for linear effects (Informal=0, Temporary=1, Permanent=2) contract_num <- as.integer(factor(contract0, levels = contract_levels)) - 1L contract_num_p <- rep(contract_num, each = n_t) # ----------------------- # Preference for children (baseline; affects children only) # ----------------------- # Interpretable desired children: ~0 to 5, centered at ~2 pref0 <- clamp(rnorm(n_id, mean = 2.0, sd = 1.0), 0, 5) pref <- rep(pref0, each = n_t) # Centered version for linear predictors pref_c <- pref0 - 2.0 pref_c_p <- rep(pref_c, each = n_t) target_deg_adult <- c( HS_grad = 0.40, Some_college = 0.28, BA = 0.22, MA = 0.08, PhD_or_prof = 0.02 ) stopifnot(abs(sum(target_deg_adult) - 1) < 1e-8) edu_me_sd <- 0.70 income_intercept <- 10.45 income_sd <- 0.58 marriage_intercept <- -3.75 children_cap <- 12L edu_fert_scale <- 1.25 assign_deg_by_quantiles <- function(z, age0, target_deg_adult) { adult_idx <- which(age0 >= 25) qs <- stats::quantile( z[adult_idx], probs = cumsum(target_deg_adult)[1:4], names = FALSE, type = 8 ) assign_deg <- function(zv, cuts) { out <- rep("PhD_or_prof", length(zv)) out[zv < cuts[4]] <- "MA" out[zv < cuts[3]] <- "BA" out[zv < cuts[2]] <- "Some_college" out[zv < cuts[1]] <- "HS_grad" out } deg <- rep(NA_character_, length(z)) deg[adult_idx] <- assign_deg(z[adult_idx], qs) # 22–24: allow BA but NOT grad degrees idx_22_24 <- which(age0 >= 22 & age0 < 25) if (length(idx_22_24) > 0) { tmp <- assign_deg(z[idx_22_24], qs) tmp[tmp %in% c("MA", "PhD_or_prof")] <- "BA" deg[idx_22_24] <- tmp } # 18–21: allow Some_college but NOT BA+ idx_18_21 <- which(age0 >= 18 & age0 < 22) if (length(idx_18_21) > 0) { tmp <- assign_deg(z[idx_18_21], qs) tmp[tmp %in% c("BA", "MA", "PhD_or_prof")] <- "Some_college" deg[idx_18_21] <- tmp } # <18: in progress deg[age0 < 18] <- "In_progress" deg[is.na(deg)] <- "HS_grad" deg } # helper: map degree to target years of schooling degree_to_target_years <- function(deg, age0) { target_years <- numeric(length(deg)) # K-12 accumulation for in-progress kids/teens k12 <- clamp(age0 - 5, 0, 13) target_years[deg == "In_progress"] <- k12[deg == "In_progress"] target_years[deg == "HS_grad"] <- 12 + rnorm(sum(deg == "HS_grad"), 0, 0.30) target_years[deg == "Some_college"] <- 13.5 + rnorm(sum(deg == "Some_college"), 0, 0.45) target_years[deg == "BA"] <- 16 + rnorm(sum(deg == "BA"), 0, 0.55) target_years[deg == "MA"] <- 18 + rnorm(sum(deg == "MA"), 0, 0.65) target_years[deg == "PhD_or_prof"] <- 21 + rnorm(sum(deg == "PhD_or_prof"), 0, 0.70) # feasibility given age max_feasible <- clamp(age0 - 5, 0, 22) clamp(target_years, 0, max_feasible) } # ---- Try loop: redraw endogenous noise until calibration holds ---- for (try_idx in seq_len(max_tries)) { set.seed(seed + 7000 + try_idx) # Panel skeleton year <- rep(0:(n_t - 1), times = n_id) pid <- rep(id, each = n_t) age <- clamp(rep(age0, each = n_t) + year, 0, 100) gender_p <- rep(gender, each = n_t) immigrant_p <- rep(immigrant, each = n_t) urban_p <- rep(urban, each = n_t) class_p <- rep(class, each = n_t) class_num_p <- rep(class_num, each = n_t) # ----------------------- # Education propensity index z # ----------------------- abil <- rnorm(n_id, 0, 1) z <- 0.85 * abil + 0.55 * class_num + 0.20 * urban + (-0.10) * immigrant + 0.05 * (gender == "Female") + (-0.04) * (age0 - 35) + 0.22 * contract_num + # <--- contract raises schooling rnorm(n_id, 0, 0.35) # Robust degree assignment (adult shares forced; HS > BA forced) deg <- assign_deg_by_quantiles(z, age0, target_deg_adult) edu_degree <- factor(rep(deg, each = n_t), levels = c("In_progress","HS_grad","Some_college","BA","MA","PhD_or_prof")) # True baseline education from degree + feasibility edu0_true <- degree_to_target_years(deg, age0) edu_target <- clamp(edu0_true, 0, 22) # target is baseline level (then accrual for young) # Evolve education in panel (<25 can grow toward a slightly higher personal target) # Give some additional accumulation for young people (esp. HS->BA pathways). edu_target2 <- edu_target bump <- rep(0, n_id) bump[deg %in% c("HS_grad","Some_college","BA") & age0 < 23] <- rnorm(sum(deg %in% c("HS_grad","Some_college","BA") & age0 < 23), 1.2, 0.6) bump <- clamp(bump, 0, 3.0) edu_target2 <- clamp(edu_target2 + bump, 0, 22) edu_true <- numeric(length(pid)) edu_true[year == 0] <- edu0_true for (i in seq_len(n_id)) { idx <- which(pid == i) for (t in seq_along(idx)) { if (t == 1) next prev <- edu_true[idx[t - 1]] a <- age[idx[t]] if (a < 25 && prev < edu_target2[i]) { delta <- clamp(rnorm(1, mean = 1.0, sd = 0.25), 0, 1.2) edu_true[idx[t]] <- pmin(prev + delta, edu_target2[i]) } else { edu_true[idx[t]] <- prev } } } # observed edu_year (measurement error) edu_year <- clamp(edu_true + rnorm(length(pid), 0, edu_me_sd), 0, 22) # ----------------------- # Job stability (time-varying; treatment-induced intermediate Z) # - affected by edu_true (treatment) and baseline contract # - affects mediators (income, married, birth_control) and outcome (children) # ----------------------- rho_job <- 0.70 # persistence over time job_trait_i <- rnorm(n_id, 0, 0.8) # stable id-level trait job_stability_t <- numeric(length(pid)) for (i in seq_len(n_id)) { idx <- which(pid == i) # baseline level at t=0 base0 <- -0.25 + 0.10 * (edu_true[idx[1]] - 12) + 0.30 * contract_num[i] + 0.10 * class_num[i] + 0.12 * urban[i] + (-0.10) * immigrant[i] + 0.02 * (age0[i] - 35) + 0.05 * (gender[i] == "Female") + job_trait_i[i] + rnorm(1, 0, 0.45) job_stability_t[idx[1]] <- clamp(base0, -3, 3) if (length(idx) > 1) { for (t in 2:length(idx)) { prev <- job_stability_t[idx[t - 1]] # current-period innovation (edu affects job stability each period) innov <- -0.05 + 0.08 * (edu_true[idx[t]] - 12) + 0.08 * (age[idx[t]] - 35) / 10 + rnorm(1, 0, 0.50) job_stability_t[idx[t]] <- clamp(rho_job * prev + (1 - rho_job) * (base0 + innov), -3, 3) } } } # ----------------------- # Religion (depends on baseline edu_true + class + urban) # ----------------------- rel_levels <- c("Christian","Muslim","Hindu","Buddhist","Jewish","Unaffiliated","Other") religion0 <- character(n_id) for (i in seq_len(n_id)) { e <- edu0_true[i]; cnum <- class_num[i]; u <- urban[i] score <- c( Christian = 0.95 - 0.010*e - 0.02*u + 0.03*cnum, Muslim = -0.45 - 0.006*e + 0.02*u, Hindu = -1.55 + 0.004*e, Buddhist = -1.60 + 0.004*e, Jewish = -1.85 + 0.030*e + 0.04*cnum + 0.03*u, Unaffiliated = 0.10 + 0.045*e + 0.08*u, Other = -0.95 + 0.004*e ) prob <- exp(score) / sum(exp(score)) religion0[i] <- rmultinom1(prob) } religion <- factor(rep(religion0, each = n_t), levels = rel_levels) # ----------------------- # Married (no married < 17; edu delays slightly) # ----------------------- rel_marriage_effect <- setNames(c( Christian = 0.18, Muslim = 0.20, Hindu = 0.10, Buddhist = 0.06, Jewish = 0.10, Unaffiliated = -0.15, Other = 0.00 ), rel_levels) lp_married <- marriage_intercept + 0.22 * (age - 18) + (-0.00115) * (age - 38)^2 + (-0.012) * edu_true + 0.35 * job_stability_t + # <--- add rel_marriage_effect[as.character(religion)] married <- rbinom(length(pid), 1, inv_logit(lp_married)) married[age < 17] <- 0L # ----------------------- # Birth control (edu increases use; marriage increases use) # ----------------------- rel_bc_effect <- setNames(c( Christian = -0.05, Muslim = -0.20, Hindu = -0.10, Buddhist = -0.05, Jewish = 0.05, Unaffiliated = 0.15, Other = 0.00 ), rel_levels) lp_bc <- -1.05 + 0.085 * edu_true + 0.55 * married + 0.25 * job_stability_t + # <--- add 0.04 * (age - 18) - 0.0009 * (age - 30)^2 + rel_bc_effect[as.character(religion)] birth_control <- rbinom(length(pid), 1, inv_logit(lp_bc)) # ----------------------- # Income (US-like) # ----------------------- lp_inc <- income_intercept + 0.060 * (edu_true - 12) + 0.12 * urban_p + (-0.05) * immigrant_p + 0.10 * class_num_p + (-0.03) * (gender_p == "Female") + 0.030 * (age - 25) - 0.00055 * (age - 45)^2 + 0.18 * contract_num_p + # from earlier 0.22 * job_stability_t income <- exp(rnorm(length(pid), mean = lp_inc, sd = income_sd)) income[age < 16] <- exp(rnorm(sum(age < 16), mean = 8.45, sd = 0.25)) # ----------------------- # Children: degree gradient + negative edu effect, with dispersion # ----------------------- deg_fert_effect <- c( In_progress = 0.00, HS_grad = 0.00, Some_college = -0.06, BA = -0.10, MA = -0.12, PhD_or_prof = -0.14 ) deg_eff_i <- deg_fert_effect[deg] fert_i <- rnorm(n_id, 0, 0.95) fert_pref <- rep(fert_i, each = n_t) rel_fert_effect <- setNames(c( Christian = 0.10, Muslim = 0.18, Hindu = 0.08, Buddhist = -0.05, Jewish = -0.02, Unaffiliated = -0.12, Other = 0.00 ), rel_levels) idx0 <- which(year == 0) job0 <- job_stability_t[idx0] rel0 <- rel_fert_effect[religion0] mar0 <- married[idx0] bc0 <- birth_control[idx0] inc0 <- income[idx0] a0 <- age0 age_term0 <- ifelse(a0 < 15, -10, 0.11 * (pmin(a0, 42) - 18) - 0.0022 * (pmin(a0, 42) - 28)^2) beta_edu_children0 <- (-0.020 * edu_fert_scale) mu_children0 <- exp( -0.85 + age_term0 + 0.55 * mar0 + (-0.45) * bc0 + beta_edu_children0 * edu0_true + deg_eff_i + (-0.0000007) * inc0 + 0.04 * class_num + 0.18 * contract_num + 0.45 * pref_c + 0.25 * job0 + # <--- add rel0 + fert_i ) children0 <- rnbinom(n_id, size = 1.3, mu = mu_children0) children0[a0 < 15] <- 0L children0 <- pmin(children0, children_cap) fertile <- (age >= 15 & age <= 45) beta_edu_births <- (-0.008 * edu_fert_scale) lp_birth <- -3.35 + 0.80 * married + (-0.90) * birth_control + beta_edu_births * edu_true + rep(deg_eff_i, each = n_t) + 0.02 * (age - 22) - 0.0010 * (age - 30)^2 + rel_fert_effect[as.character(religion)] + 0.12 * contract_num_p + 0.35 * pref_c_p + 0.18 * job_stability_t + # <--- add fert_pref births <- rpois(length(pid), exp(lp_birth) * fertile) births <- pmin(births, 2L) children <- numeric(length(pid)) for (i in seq_len(n_id)) { idx <- which(pid == i) children[idx] <- children0[i] + cumsum(births[idx]) } children <- pmin(children, children_cap) # Assemble df <- data.frame( id = pid, year = year, age = age, gender = gender_p, immigrant = factor(immigrant_p, levels = c(0, 1), labels = c("No", "Yes")), urban = factor(urban_p, levels = c(0, 1), labels = c("Rural", "Urban")), class = class_p, religion = religion, contract = contract, pref = pref, edu_year = round(edu_year, 1), edu_degree = edu_degree, married = married, birth_control = birth_control, income = round(income, 0), children = children, job_stability_t = job_stability_t ) # ----------------------- # Calibration: LAST WAVE, age-adjusted edu_year # ----------------------- dL <- df[df$year == max(df$year), ] fit <- lm(children ~ edu_year + age, data = dL) cc <- summary(fit)$coefficients est <- unname(cc["edu_year", "Estimate"]) pval <- unname(cc["edu_year", "Pr(>|t|)"]) if (is.na(pval)) pval <- 0 lo <- p_band[1]; hi <- p_band[2] ok <- (est < 0) && (pval >= lo) && (pval <= hi) if (ok || try_idx == max_tries) { attr(df, "pval_last_wave_age_adj") <- pval attr(df, "edu_est_last_wave_age_adj") <- est attr(df, "edu_fert_scale_used") <- edu_fert_scale attr(df, "edu_me_sd_used") <- edu_me_sd attr(df, "tries") <- try_idx return(df) } # If too weak / wrong sign: strengthen negative edu effect if (est >= 0 || pval > hi) edu_fert_scale <- edu_fert_scale * 1.12 # If too significant: weaken edu effect and add measurement error if (pval < lo) { edu_fert_scale <- edu_fert_scale * 0.85 edu_me_sd <- min(edu_me_sd * 1.08, 2.0) } } } # ---- Example usage ---- dat <- simulate_toy_panel(n_id = 1000, n_t = 5, seed = 20251225) d0 <- subset(dat, year == 0) dL <- subset(dat, year == max(year)) # Calibration check (what we target) summary(lm(children ~ edu_year + age, data = dL)) attr(dat, "pval_last_wave_age_adj") attr(dat, "edu_est_last_wave_age_adj") attr(dat, "tries") # Attainment realism (adult-only) d0_adult <- subset(d0, age >= 25 & edu_degree != "In_progress") prop.table(table(d0_adult$edu_degree)) # Degree-type result (adult, completed degrees) dA <- subset(dL, age >= 25 & edu_degree != "In_progress") dA$edu_degree <- relevel(dA$edu_degree, "HS_grad") df <- dat ``` ```{r summary, echo=FALSE} toy_summary_split <- function(dat, top_k = 3, digits = 2) { stopifnot(is.data.frame(dat)) fmt <- function(x) { ifelse(is.na(x), NA_character_, format(round(x, digits), nsmall = digits, trim = TRUE)) } num_table <- function(df) { out <- lapply(names(df), function(nm) { x <- df[[nm]] s <- summary(x) data.frame( variable = nm, type = class(x)[1], Min = fmt(unname(s["Min."])), Q1 = fmt(unname(s["1st Qu."])), Median = fmt(unname(s["Median"])), Mean = fmt(unname(s["Mean"])), Q3 = fmt(unname(s["3rd Qu."])), Max = fmt(unname(s["Max."])), stringsAsFactors = FALSE ) }) res <- do.call(rbind, out) rownames(res) <- NULL res } fac_table <- function(df) { out <- lapply(names(df), function(nm) { x <- df[[nm]] if (is.logical(x)) x <- factor(x, levels = c(FALSE, TRUE)) if (is.character(x)) x <- factor(x) x <- as.factor(x) tab <- sort(table(x, useNA = "no"), decreasing = TRUE) k <- min(top_k, length(tab)) top <- tab[seq_len(k)] rest <- sum(tab) - sum(top) parts <- paste0(names(top), ":", as.integer(top)) if (rest > 0) parts <- c(parts, paste0("(Other):", as.integer(rest))) data.frame( variable = nm, type = class(df[[nm]])[1], top_levels = paste(parts, collapse = " "), stringsAsFactors = FALSE ) }) res <- do.call(rbind, out) rownames(res) <- NULL res } is_cat <- vapply(dat, function(x) is.factor(x) || is.character(x) || is.logical(x), logical(1)) num_df <- dat[!is_cat] cat_df <- dat[is_cat] list( numeric = if (ncol(num_df)) num_table(num_df) else data.frame(), categorical = if (ncol(cat_df)) fac_table(cat_df) else data.frame() ) } # vignette usage: tabs <- toy_summary_split(dat, top_k = 3) knitr::kable(tabs$numeric, align = "l") knitr::kable(tabs$categorical, align = "l") ```
```{r example-dag, echo=FALSE, message=FALSE, fig.width=8, fig.height=5, warning=FALSE, fig.cap="*Example: The Causal Effects of Family Background and Life Course Events on Fertility Patterns*"} library(dagitty) library(ggdag) library(tidyverse) x_pos <- c( children = 10, income = 5, edu_year = 0, urban = 10, immigrant = 10, class = 3, birth_control = 8, married = 5, gender = 3, age = 7, pref = 12, contract = 5, job_stability_t = 7 ) y_pos <- c( children = 0, income = 0.5, edu_year = 0, urban = 2, immigrant = -2, class = -4, birth_control = -4, married = -4, gender = 2, age = 2, pref = 0, contract = 2, job_stability_t = -2 ) dag_model <- dagify( children ~ income + edu_year + urban + immigrant + class + religion + birth_control + married + gender + age + pref + contract + job_stability_t, edu_year ~ class + immigrant + urban + gender + age + contract, job_stability_t ~ edu_year + contract + class + urban + immigrant + age + gender, income ~ edu_year + urban + immigrant + class + gender + age + contract +job_stability_t, birth_control ~ edu_year + religion + married + age + job_stability_t, married ~ edu_year + religion + age + job_stability_t, coords = list(x=x_pos, y=y_pos), exposure = "edu_year", outcome = "children", labels = c( children = "Children", income = "Income", edu_year = "Education", urban = "Urban/Rural", immigrant = "Immigrant", class = "Parents' Class", birth_control = "Birth Control", married = "Married", gender = "Gender", age = "Age", pref = "Preference\nfor Children", contract = "Contract Type", job_stability_t = "Job Stability" ) ) #check if valid DAG #is.dagitty(dag_model) #dagitty::isAcyclic(dag_model) #dagitty::findCycle(dag_model) ggdag_status(dag_model, text = FALSE, use_labels = "label") + geom_dag_label_repel(aes(label = NA)) + theme_dag() + scale_fill_manual( values = c( exposure = "black", outcome = "grey30", none = "grey90" ) ) + scale_color_manual( values = c( exposure = "black", outcome = "grey30", none = "grey60" ) ) + theme(legend.position = "none") ``` For the purpose of this guide, we visualize a common social science question: how does education affect fertility @MorganWinship2015? The DAG model encodes a plausible, but not exhaustive, set of covariates. ## Step 3: Classify Control Variables by Role ```{r roles-table} DAGassist(dag_model, show="roles") ``` **Interpreting the roles table:** - **ROLES:** `DAGassist` classifies the variables in your formula by causal role, based on the relationships in your DAG. It classifies according to these categories. - **X** is the `treatment` / `independent variable` / `exposure`. - **Y** is the `outcome` / `dependent variable`. - **conf** stands for `confounder`, a common cause of X and Y. Confounders create a spurious association between X and Y, and must be adjusted for. - **med** stands for `mediator`, a variable that lies on a path from X to Y, which transmit some of the effect from X to Y. One should not adjust for mediators if one wants to estimate the total effect of X on Y. - **col** stands for `collider`, a direct common descendant of X and Y. Colliders already block paths, so adjusting for it opens a spurious association between X and Y. - **IO** stands for `intermediate outcome`, a descendant of Y, which introduces bias if adjusted for. - **dMed / Dmediator** stands for `descendant of a mediator`, which should not be adjusted for when estimating total effect. - **dCol / Dcollider** stands for `descendant of a collider`. Adjusting for a descendant of a collider opens a spurious association between X and Y. - **other** variables, such as those that only affect the outcome, do not fit any of the previous definitions. They are included in the canonical model because they can be safely included as controls, but are omitted from the minimal model because their inclusion is not strictly necessary. Since `other` variables' effects are generally neutral, it is usually best to use the minimal adjustment set as your baseline model. ## 4. Estimate Models Using DAG-Consistent Adjustment Sets ```{r model-comparison} DAGassist(dag_model, formula = lm(children ~ edu_year + age + class + gender + immigrant + urban + birth_control + income + married + job_stability_t + contract + pref, data = dat)) ``` **Interpreting the model comparison table:** - **MODEL COMPARISON:** - `Minimal` is the smallest adjustment set necessary to close all back-door paths from the independent to the dependent variable. The minimal set only includes `confounders` as controls. - `Canonical` is the largest permissible adjustment set. Essentially, the `canonical` set contains all control variables that are not `confounders`, `mediators`, `intermediate outcomes`, `descendants of mediatiors`, or `descendants of colliders`. ## 5. Recover the Interpretable Estimand ```{r sate} DAGassist(dag_model, formula = lm(children ~ edu_year + age + class + gender + immigrant + urban + birth_control + income + married + job_stability_t + contract + pref, data = dat), estimand = "SATE") ``` In some cases, the target estimand is the average controlled direct effect. `DAGassist` supports recovering the controlled direct effect using sequential g-estimation via integration with the `DirectEffects` R package. Using the prior example, we can use `DAGassist` to estimate the effect of years of education on a person's number of children, except through birth control, income, and marital status. ```{r cde-est, warning=FALSE, fig.width=8, fig.height=5, fig.cap="*Visualizing all estimands*"} library(DirectEffects) DAGassist(dag_model, formula = lm(children ~ edu_year + age + class + gender + immigrant + urban + birth_control + income + married + job_stability_t + contract + pref, data = dat), estimand = c("SATE", "SACDE"), type = "dotwhisker") ``` ## Export Publication-Grade Reports In order to export `DAGassist` reports as files, users must first install a few commonly-used packages. Dependencies vary by export file type. - `modelsummary` to build the **model comparison** table for **LaTeX**, **Word**, **Excel**, and **plaintext**. - LaTeX uses `broom` as a fallback for report generation - `knitr` to build intermediate .md for **Word** and **plaintext** report generation. - `rmarkdown` to convert .md files to .docx files for **Word** report generation. - `writexl` to export **Excel** files. Essentially, to export: - **LaTeX** only needs `modelsummary` - **Excel** needs `modelsummary` and `writexl` - **plaintext** needs `modelsummary` and `knitr` - **Word** needs `modelsummary`, `knitr`, and `rmarkdown` ## References