\documentclass[12pt]{article} \usepackage{fullpage} \usepackage{natbib} \usepackage{amssymb} \usepackage{hyperref} \usepackage{graphicx} \usepackage[utf8]{inputenc} \usepackage{amsmath} \usepackage{amsthm} \usepackage{bbm} %\VignetteIndexEntry{Using causaldrf} %\VignetteDepends{} %\VignetteKeyWords{} %\VignettePackage{causaldrf} %\VignetteEngine{knitr::knitr} \newcommand{\bma}[1]{\mbox{\boldmath $#1$}} \newcommand{\proglang}[1]{\texttt{#1}} \newcommand{\code}[1]{\texttt{#1}} \newcommand{\pkg}[1]{\texttt{#1}} % \linespread{1.8} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % custom abbreviations for this document \newcommand{\be}{\begin{equation}} \newcommand{\bea}{\begin{eqnarray}} \newcommand{\ee}{\end{equation}} \newcommand{\eea}{\end{eqnarray}} \newcommand{\nn}{\nonumber} \newcommand{\bA}{{\mbox{\boldmath $A$}}} \newcommand{\bX}{{\mbox{\boldmath $X$}}} \newcommand{\bY}{{\mbox{\boldmath $Y$}}} \newcommand{\bT}{{\mbox{\boldmath $T$}}} \newcommand{\bb}{{\mbox{\boldmath $b$}}} \newcommand{\bB}{{\mbox{\boldmath $B$}}} \newcommand{\bW}{{\mbox{\boldmath $W$}}} \newcommand{\bmu}{{\mbox{\boldmath $\mu$}}} \newcommand{\bnu}{{\mbox{\boldmath $\nu$}}} \newcommand{\bbeta}{{\mbox{\boldmath $\beta$}}} \newcommand{\balpha}{{\mbox{\boldmath $\alpha$}}} \newcommand{\bgamma}{{\mbox{\boldmath $\gamma$}}} \newcommand{\btheta}{{\mbox{\boldmath $\theta$}}} \newcommand{\bphi}{{\mbox{\boldmath $\phi$}}} \newcommand{\bxi}{{\mbox{\boldmath $\xi$}}} \newcommand{\bpsi}{{\mbox{\boldmath $\psi$}}} \newcommand{\bpi}{{\mbox{\boldmath $\pi$}}} \newcommand{\bdelta}{{\mbox{\boldmath $\delta$}}} \newcommand{\bSigma}{{\mbox{\boldmath $\Sigma$}}} \newcommand{\bGamma}{{\mbox{\boldmath $\Gamma$}}} \newcommand{\bOmega}{{\mbox{\boldmath $\Omega$}}} \newcommand{\bI}{{\mbox{\boldmath $I$}}} \newcommand{\bzero}{{\mbox{\boldmath $0$}}} \newcommand{\RRn}{1\!\!1} % This is for indicator functions \newcommand{\quotes}[1]{``#1''} % This is for quotes %\boldsymbol %Use this to make greek symbols bold \overfullrule=5pt % \doublespace \begin{document} % \SweaveOpts{concordance=TRUE} % \SweaveOpts{concordance=TRUE} % \SweaveOpts{concordance=TRUE} <>= library(knitr) opts_chunk$set( concordance=TRUE ) @ \title{Estimating Average Dose Response Functions Using the R Package \texttt{causaldrf}} \author{Douglas Galagate, Joseph L. Schafer} \maketitle \begin{abstract} \vspace{-0.08in} This document describes the R package \texttt{causaldrf} for estimating average dose response functions (ADRF). The R package contains functions to estimate ADRFs using parametric and non-parametric models when the data contains a continuous treatment variable. The \texttt{causaldrf} R package is flexible and can be used on data sets containing treatment variables from a range of probability distributions. \end{abstract} \hrule \vspace{0.1in} \noindent {\em Keywords:} Causal Inference; Propensity Score; Generalized Propensity Score; Propensity Function; Average Dose Response Function. \section{Introduction} In this document, we provide examples to illustrate the flexibility and the ease of use of the \texttt{causaldrf} R package, which estimates the average dose response function (ADRF) when the treatment is continuous. The \texttt{causaldrf} R package also provides methods for estimating average potential outcomes when the treatment is binary or multi-valued. The user can compare different methods to understand the sensitivity of the estimates and a way to check robustness. The package contains new estimators based on a linear combination of a finite number of basis functions \cite{schafer2015causal}. In addition, \texttt{causaldrf} includes functions useful for model diagnostics such as assessing common support and for checking covariate balance. This package fills a gap in the R package space and offers a range of existing and new estimators described in the statistics literature such as \cite{schafer2015causal}, \cite{bia2014stata}, \cite{flores2012estimating}, \cite{imai2004causal}, \cite{hirano2004propensity}, and \cite{robins2000marginal}. The \texttt{causaldrf} R package is currently available on the Comprehensive R Archive Network (CRAN). The R package contains 12 functions for estimating the ADRF which are explained in more detail in Chapters 2, 3, and in the documentation files for the package \url{https://cran.r-project.org/web/packages/causaldrf/index.html}. The user can choose which estimator to apply based on their particular problems and goals. This document is organized as follows. In Section \ref{section:Sim_data_Moodie}, we introduce a simulated dataset from \cite{hirano2004propensity} and \cite{moodie2012estimation} and apply functions from \texttt{causaldrf} to estimate the ADRF. In Section \ref{section:NMES_analysis}, we use data from the National Medical Expenditures Survey (NMES) to show the capabilities of \texttt{causaldrf} in analyzing a data set containing weights. Section \ref{section:IHDP_analysis} contains data from the Infant Health and Development Program (IHDP) and applies methods from \texttt{causaldrf} to the data. Conclusions are presented in Section \ref{section:Discussion_examples}. \section{An Example Based on Simulated Data} \label{section:Sim_data_Moodie} % %\subsection{Simulation from \cite{schafer2015causal}} % %Recall that chapter 3 describes this simulation. It contains ..... % % This section demonstrates the use of the \texttt{causaldrf} package by using simulated data from \cite{hirano2004propensity} and \cite{moodie2012estimation}. This simulation constructs an ADRF with an easy to interpret functional form, and a means to clearly compare the performance of different estimation methods. Let $Y_{1}(t) | X_{1}, X_{2} \sim \mathcal{N}\left(t + (X_{1} + X_{2})e^{-t(X_{1} + X_{2})}, 1 \right)$ and $X_{1}, X_{2}$ be unit exponentials, $T_{1} \sim \text{exp}(X_{1} + X_{2})$. The ADRF can be calculated by integrating out the covariates analytically \citep{moodie2012estimation}, \begin{equation} \label{eq:HI_sim_truth} \mu(t) = E(Y_{i}(t)) = t + \frac{2}{(1 + t)^{3}} \end{equation} This example provides a setting to compare ADRF estimates with the true ADRF given in Equation \ref{eq:HI_sim_truth}. In this simulation, our goal is to demonstrate how to use the functions. We introduce a few of the estimators and show their plots. <>= options(width=70) @ First, install {\bf causaldrf} and then load the package: <<>>= library (causaldrf) @ The data is generated from: <<>>= set.seed(301) hi_sample <- function(N){ X1 <- rexp(N) X2 <- rexp(N) T <- rexp(N, X1 + X2) gps <- (X1 + X2) * exp(-(X1 + X2) * T) Y <- T + gps + rnorm(N) hi_data <- data.frame(cbind(X1, X2, T, gps, Y)) return(hi_data) } hi_sim_data <- hi_sample(1000) head(hi_sim_data) @ <>= overlap_list <- overlap_fun(Y = Y, treat = T, treat_formula = T ~ X1 + X2, data_set = hi_sim_data, n_class = 3, treat_mod = "Gamma", link_function = "inverse") overlap_data <- overlap_list$overlap_dataset t_mod_list <- t_mod(treat = T, treat_formula = T ~ X1 + X2 + I(X1^2) + I(X2^2), data = overlap_data, treat_mod = "Gamma", link_function = "inverse") cond_exp_data <- t_mod_list$T_data full_data <- cbind(overlap_data, cond_exp_data) @ %' <>= %' %' x <- hi_sim_data$T %' quantile_grid <- quantile(x, probs = seq(0, .95, by = 0.01)) %' # quantile_grid <- quantile_grid[1:100] %' true_hi_fun <- function(t){t + 2/(1 + t)^3} %' %' plot(quantile_grid, %' true_hi_fun(quantile_grid), %' pch = ".", %' main = "true ADRF", %' xlab = "treat", %' ylab = "Y", %' col = "black") %' %' lines(quantile_grid, %' true_hi_fun(quantile_grid), %' col = "black", %' lty = 1, %' lwd = 2) %' %' @ Below is code for a few different estimators of the ADRF. The first is the additive spline estimator from \cite{bia2014stata}. This estimator fits a treatment model to estimate the GPS. Next, additive spline bases values are created for both the treatment and the GPS. The outcome is regressed on the treatment, GPS, treatment bases, and GPS bases. After the outcome model is estimated, each treatment grid value and set of covariates is plugged in to the model which corresponds to imputed values for each unit at that particular treatment value. The imputed values are averaged to get the estimated ADRF at that treatment value. Repeating this process for many treatment values, \code{grid\_val}, traces out the estimated ADRF. The arguments are: \code{Y} for the name of the outcome variable, \code{treat} for the name of the treatment variable, \code{treat\_formula} for the formula used to fit the treatment model, \code{data} for the name of the data set, \code{grid\_val} for a vector in the domain of the treatment for where the outcome is estimated, \code{knot\_num} for the number of knots for the spline fit, and \code{treat\_mod} for the treatment model that relates treatment with the covariates. In this example we fit the correct treatment model so that the GPS is correctly specified with a gamma distribution. <<>>= add_spl_estimate <- add_spl_est(Y = Y, treat = T, treat_formula = T ~ X1 + X2, data = hi_sim_data, grid_val = quantile(hi_sim_data$T, probs = seq(0, .95, by = 0.01)), knot_num = 3, treat_mod = "Gamma", link_function = "inverse") @ The next estimator is based on the generalized additive model. This method requires a treatment formula and model to estimate the GPS. The estimated GPS values are used to fit an outcome regression. The outcome, \code{Y}, is regressed on two things: the treatment, \code{T}, and spline basis terms from the GPS fit. <<>>= gam_estimate <- gam_est(Y = Y, treat = T, treat_formula = T ~ X1 + X2, data = hi_sim_data, grid_val = quantile(hi_sim_data$T, probs = seq(0, .95, by = 0.01)), treat_mod = "Gamma", link_function = "inverse") @ The Hirano-Imbens estimator also requires two models. The first model regresses the treatment, \code{T}, on a set of covariates to estimate the GPS values. The second step requires fitting the outcome, \code{Y}, on the observed treatment and fitted GPS values. The summary above shows the fit of both the treatment model and outcome model. Also shown is the estimated outcome values on the grid of treatment values, \code{quantile\_grid}. <<>>= hi_estimate <- hi_est(Y = Y, treat = T, treat_formula = T ~ X1 + X2, outcome_formula = Y ~ T + I(T^2) + gps + I(gps^2) + T * gps, data = hi_sim_data, grid_val = quantile(hi_sim_data$T, probs = seq(0, .95, by = 0.01)), treat_mod = "Gamma", link_function = "inverse") @ This last method, importance sampling, fits the treatment as a function of the covariates, then calculates GPS values. The GPS values are used as inverse probability weights in the regression of \code{Y} on \code{T} \citep{robins2000marginal}. The estimated parameters correspond to coefficients for a quadratic model of the form $\hat{\mu}(t) = \hat{\alpha}_0 + \hat{\alpha}_1 t + \hat{\alpha}_2 t^2$. In this example, the estimator is restricted to a quadratic fit. <<>>= iptw_estimate <- iptw_est(Y = Y, treat = T, treat_formula = T ~ X1 + X2, numerator_formula = T ~ 1, data = hi_sim_data, degree = 2, treat_mod = "Gamma", link_function = "inverse") @ %' << echo = FALSE >>= %' plot(quantile_grid, %' true_hi_fun(quantile_grid), %' pch = ".", %' main = "true ADRF", %' xlab = "treat", %' ylab = "Y", %' col = "black") %' %' lines(quantile_grid, %' true_hi_fun(quantile_grid), %' col = "black", %' lty = 1, %' lwd = 2) %' %' lines(quantile_grid, %' add_spl_estimate$param, %' lty = 2, %' col = "red", %' lwd = 2) %' %' lines(quantile_grid, %' gam_estimate$param, %' lty = 3, %' col = "orange", %' lwd = 2) %' %' lines(quantile_grid, %' hi_estimate$param, %' lty = 4, %' col = "green", %' lwd = 2) %' %' lines(quantile_grid, %' iptw_estimate$param[1] + %' iptw_estimate$param[2] * quantile_grid + %' iptw_estimate$param[3] * quantile_grid^2, %' lty = 5, %' col = "blue", %' lwd = 2) %' %' legend('bottomright', %' "reg estimate", %' lty= c(1, 2, 3, 4, 5), %' lwd = c(2, 2, 2, 2, 2), %' col = c("black", "red", "orange", "green", "blue"), %' c("true ADRF", "additive spline", "generalized additive", "Hirano-Imbens", "importance sampling"), %' bty='Y', %' cex=0.75) %' @ The true ADRF and 4 estimates are plotted in Figure \ref{fig:HI_sim}. \begin{figure} \begin{center} \includegraphics[scale=.55]{HI_sim_10292015_np.png} \end{center} \caption{True ADRF along with estimated curves.}\label{fig:HI_sim} \end{figure} \newpage \section{Analysis of the National Medical Expenditures Survey} \label{section:NMES_analysis} \subsection{Introduction} The 1987 National Medical Expenditures Survey (NMES) includes information about smoking amount, in terms of the quantity packyears, and medical expenditures in a representative sample of the U.S. civilian, non-institutionalized population (U.S. Department of Health and Human Services, Public Health service, 1987). The 1987 medical costs were verified by multiple interviews and other data from clinicians and hospitals. \cite{johnson2003disease} analyzed the NMES to estimate the fraction of disease cases and the fraction of the total medical expenditures attributable to smoking for two disease groups. \cite{imai2004causal} emulate the setting by \cite{johnson2003disease} but estimated the effect of smoking amount on medical expenditures. \cite{johnson2003disease} and \cite{imai2004causal} conducted a complete case analysis by removing units containing missing values. Both \cite{johnson2003disease} used multiple imputation techniques to deal with the missing values, but did not find significant differences between that analysis and the complete case analysis. Complete case analysis with propensity scores will lead to biased causal inference unless the data are missing completely at random \citep{d2000estimating}. Regardless of this drawback, the analysis in this section uses the complete case data to illustrate the different statistical methods available for estimating the ADRF relating smoking amount and medical expenditures. This example is analyzed in this section because the treatment variable, smoking amount, is a continuous variable. The data is restricted to that used in \cite{imai2004causal} with 9708 observations and 12 variables. For each person interviewed, the survey collected information on age at the time of the survey, age when the person started smoking, gender, race (white, black, other), marital status (married, widowed, divorced, separated, never married), education level (college graduate, some college, high school graduate, other), census region (Northeast, Midwest, South, or West), poverty status (poor, near poor, low income, middle income, high income), and seat belt usage (rarely, sometimes, always/almost always) \citep{imai2004causal}. The data is available in the \texttt{causaldrf} package. Our goal is to understand how the amount of smoking affects the amount of medical expenditures. \cite{johnson2003disease} use a measure of cumulative exposure to smoking that combines self-reported information about frequency and duration of smoking into a variable called \textit{packyear} \begin{equation} \label{eq:packyear} packyear = \frac{\text{number of cigarettes per day}}{20} \times \text{(number of years smoked)} \end{equation} \textit{packyear} can also be defined as the number of packs smoked per day multiplied by the number of years the person was a smoker. The total number of cigarettes per pack is normally 20. Determining the effect of smoking on health has a long history. Scientists cannot ethically assign smoking amounts randomly to people because of the potential negative effects, so observational data analysis is needed to understand the relationship. The rest of this section will focus on the relationship between smoking amount and medical expenditures. The NMES oversampled subgroups of the population in order to reduce variances of the estimates. Oversampling reduces the variances of the estimates by increasing the sample size of the target sub-population disproportionately \citep{singhoversampling}. The U.S. Department of Health and Human Services oversampled Blacks, Hispanics, the poor and near poor, and the elderly and persons with functional limitations \citep{cohen2000sample}. \subsection{Data} <>= load("nmes_10192015.RData") @ Load {\bf nmes\_data} into the workspace with <<>>= data("nmes_data") dim (nmes_data) summary(nmes_data) @ The dataset {\bf nmes\_data} is a data frame with 9708 rows and 12 variables with summaries of the variables given above. Six of the variables are numeric and the other six are categorical. The outcome variable is the total amount of medical expenditures, \code{TOTALEXP}, the treatment is the amount of smoking, \code{packyears}. The data set contains weights, \code{HSQACCWT}, that can be used to upweight estimates to the population of interest which is the set of people who smoke and above the age of 18. This analysis demonstrates the capability of \pkg{causaldrf} by estimating the ADRF with and without weights. In Figure \ref{fig:nmes_plots}, we plot the estimated ADRFs, their 95\% confidence bands, and the 95\% confidence bands without weigths. \subsection{Common support} The data set is restricted to observations that overlap and have a common support. Units outside of the common support are removed. See Figure \ref{fig:overlap_balance}. The preliminary steps of analysis are omitted such as cleaning and making sure the data overlap. From \cite{bia2014stata}, we use the formula $$CS = \cap^{K}_{q=1} \{i: \hat{R}_{i}^{q} \in [max\{min_{j:Q_{j} = q}\hat{R}_{j}^{q}, min_{j:Q_{j} \neq q}\hat{R}_{j}^{q}\}, min\{max_{j:Q_{j} = q}\hat{R}_{j}^{q}, max_{j:Q_{j} \neq q}\hat{R}_{j}^{q}\}]\}$$ to get the common support. For 3 subclasses, the sample is reduced to 8732 units in the common support. \begin{figure} \begin{minipage}[t]{0.45\textwidth} \includegraphics[scale=0.25]{nmes_overlap_a_10022015.png} \label{fig:overlap_1_all} \end{minipage} \hspace{\fill} \begin{minipage}[t]{0.45\textwidth} \includegraphics[scale=0.25]{nmes_overlap_b_10022015.png} \label{fig:overlap_1_removed} \end{minipage} \vspace*{0.4cm} % (or whatever vertical separation you prefer) \begin{minipage}[t]{0.45\textwidth} \includegraphics[scale=0.25]{nmes_overlap_c_10022015.png} \label{fig:overlap_2_all} \end{minipage} \hspace{\fill} \begin{minipage}[t]{0.45\textwidth} \includegraphics[scale=0.25]{nmes_overlap_d_10022015.png} \label{fig:overlap_2_removed} \end{minipage} \vspace*{0.4cm} % (or whatever vertical separation you prefer) \begin{minipage}[t]{0.45\textwidth} \includegraphics[scale=0.25]{nmes_overlap_e_10022015.png} \label{fig:overlap_3_all} \end{minipage} \hspace{\fill} \begin{minipage}[t]{0.45\textwidth} \includegraphics[scale=0.25]{nmes_overlap_f_10022015.png} \label{fig:overlap_3_removed} \end{minipage} \caption{\footnotesize Common support restriction. Shaded bars represent units not in tercile, while white bars represent units in the tercile. (a) Compares group 1 vs others before deleting non-overlapping units. (b) Compares group 1 vs others after deleting non-overlapping units. (c) Compares group 2 vs others before deleting non-overlapping units. (d) Compares group 2 vs others after deleting non-overlapping units. (e) Compares group 3 vs others before deleting non-overlapping units. (f) Compares group 3 vs others after deleting non-overlapping units. } \label{fig:overlap_balance} \end{figure} \subsection{Covariate balance} One of the main goals of fitting a treatment model is to balance the covariates. The GPS or the PF provide a way to balance the covariates. Comparisons of the balance of the covariates before and after adjusting for the GPS or the PF are shown in the following results: <<>>= t(p_val_bal_cond) t(p_val_bal_no_cond) @ The last column displays the p-value of regressing each of the continuous covariates on the outcome variable, packyears, before and after conditioning on the PF. The first three rows show the p-values after conditioning on the PF, while the last three rows show the p-values when there is no conditioning. \subsection{Estimating the ADRF} The \texttt{causaldrf} R package contains a variety of estimators. Below is code for 4 other estimators that can account for weights. Although the true ADRF is not a polynomial, we will illustrate methods that are restricted to polynomial form of up to degree 2. The prima facie estimator is a basic estimator that regresses the outcome \code{Y} on the treatment \code{T} without taking covariates into account. The prima facie estimator is unbiased if the data comes from a simple random sample; otherwise it will likely be biased. The model fit is $Y \sim \alpha_0 + \alpha_1 t + \alpha_2 t^2$. <<>>= pf_estimate <- reg_est(Y = TOTALEXP, treat = packyears, covar_formula = ~ 1, data = full_data_orig, degree = 2, wt = full_data_orig$HSQACCWT, method = "same") pf_estimate @ The regression prediction method generalizes the prima facie estimator and takes the covariates into account \citep{schafer2015causal}. <<>>= reg_estimate <- reg_est(Y = TOTALEXP, treat = packyears, covar_formula = ~ LASTAGE + LASTAGE2 + AGESMOKE + AGESMOKE2 + MALE + beltuse + educate + marital + POVSTALB + RACE3, covar_lin_formula = ~ 1, covar_sq_formula = ~ 1, data = full_data_orig, degree = 2, wt = full_data_orig$HSQACCWT, method = "different") reg_estimate @ The propensity spline prediction method adds spline basis terms to the regression prediction method. This method is similar to that of \cite{little2004robust} and \cite{schafer2008average}, but for the continuous treatment setting \citep{schafer2015causal}. <<>>= spline_estimate <- prop_spline_est(Y = TOTALEXP, treat = packyears, covar_formula = ~ LASTAGE + LASTAGE2 + AGESMOKE + AGESMOKE2 + MALE + beltuse + educate + marital + POVSTALB + RACE3, covar_lin_formula = ~ 1, covar_sq_formula = ~ 1, data = full_data_orig, e_treat_1 = full_data_orig$est_treat, degree = 2, wt = full_data_orig$HSQACCWT, method = "different", spline_df = 5, spline_const = 4, spline_linear = 4, spline_quad = 4) spline_estimate @ This last method fits a spline basis to the estimated PF values and then regresses the outcome on both the basis terms and the treatment to estimate the ADRF. This is described in \cite{imai2004causal} and \cite{schafer2015causal}. The estimated parameters correspond to coefficients for a quadratic model of the form $\hat{\mu}(t) = \hat{\alpha}_0 + \hat{\alpha}_1 t + \hat{\alpha}_2 t^2$. <<>>= ivd_estimate <- prop_spline_est(Y = TOTALEXP, treat = packyears, covar_formula = ~ 1, covar_lin_formula = ~ 1, covar_sq_formula = ~ 1, data = full_data_orig, e_treat_1 = full_data_orig$est_treat, degree = 2, wt = full_data_orig$HSQACCWT, method = "different", spline_df = 5, spline_const = 4, spline_linear = 4, spline_quad = 4) ivd_estimate @ \begin{figure} \begin{minipage}[t]{0.45\textwidth} \includegraphics[width=1\linewidth]{nmes_pf_10222015_wt.png} %\caption{\textit{Histogram of lottery prize}} \label{fig:nmes_pf_10222015_wt} \end{minipage} \hspace{\fill} \begin{minipage}[t]{0.45\textwidth} \includegraphics[width=1\linewidth]{nmes_reg_10222015_wt.png} %\caption{\textit{Histogram of log(lottery prize}} \label{fig:nmes_reg_10222015_wt} \end{minipage} \vspace*{0.5cm} % (or whatever vertical separation you prefer) \begin{minipage}[t]{0.45\textwidth} \includegraphics[width=1\linewidth]{nmes_spline_10222015_wt.png} %\caption{\textit{QQ plot of \texttt{log\_prize} $ \sim \bX$}} \label{fig:nmes_spline_10222015_wt} \end{minipage} \hspace{\fill} \begin{minipage}[t]{0.45\textwidth} \includegraphics[width=1\linewidth]{nmes_ivd_11062015_wt.png} %\caption{\textit{Residuals of the \texttt{log\_prize} regressed on $\bX$}} \label{fig:nmes_ivd_10222015_wt} \end{minipage} \caption{\footnotesize Estimated dose-response functions using 4 different methods with 95\% pointwise standard errors. The standard errors are estimated by bootstrapping the entire estimation process from the beginning.} \label{fig:nmes_plots} \end{figure} \clearpage %\subsubsubsection{Other estimators} \subsection{Discussion} These four methods estimate the ADRF in a structured way and assumes the true ADRF is a linear combination of a finite number of basis functions. Figure \ref{fig:nmes_plots} shows an overall rising amount of \code{TOTALEXP} as \code{packyear} increases. Recall that in this example, the four estimators are restricted to fitting the ADRF as a polynomial of up to degree 2. Fitting more flexible models may give slightly different curves. The next section analyzes a different data set and will fit other flexible estimators such as BART, which allows for flexible response surfaces to estimate the ADRF. \section{Analysis of the Infant Health and Development Program} \label{section:IHDP_analysis} \subsection{Introduction} The next example on the Infant Health and Development Program is described by \cite{gross1992infant}: \begin{quotation} The Infant Health and Development Program (IHDP) was a collaborative, randomized, longitudinal, multisite clinical trial designed to evaluate the efficacy of comprehensive early intervention in reducing the developmental and health problems of low birth weight, premature infants. An intensive intervention extending from hospital discharge to 36 months corrected age was administered between 1985 and 1988 at eight different sites. The study sample of infants was stratified by birth weight (2,000 grams or less, 2,001-2,500 grams) and randomized to the Intervention Group or the Follow-Up Group. \end{quotation} The intervention (treatment) group received more support than the control group. In addition to the standard pediatric follow-up, the treatment group also received home visits and attendance at a special child development center. Although the treatment was assigned randomly, families chosen for the intervention self-selected into different participation levels \citep{hill2011bayesian}. Therefore, restricting our analysis to families in the intervention group and their participation levels leads to an observational setting. In this section, even though families are randomly selected for intervention, we restrict our analysis on those selected for the treatment. These families choose the amount of days they attend the child development centers and this makes the data set, for practical purposes, an observational data set. We apply our methods on this subset of the data to estimate the ADRF for those who received the treatment. We analyze this data set because the treatment variable, number of child development center days, is analyzed as a continuous variable. The data set we use comes from \cite{hill2011bayesian}. \subsection{Data} Part of this data set is included in the supplement in \cite{hill2011bayesian}, but does not include all the needed variables. The continuous treatment is available through the data repository at \url{icpsr.umich.edu}. To get the data, go to \url{http://www.icpsr.umich.edu/icpsrweb/HMCA/studies/9795?paging.startRow=51} and download DS141: Transport Format SAS Library Containing the 59 Evaluation Data Files - Download All Files (27.9 MB). After downloading the .zip file, extract the data file named \quotes{09795-0141-Data-card\_image.xpt} to a folder and set the R working directory to this folder. The following instructions describe how to extract the continuous treatment variable. Making sure the working directory contains \quotes{09795-0141-Data-card\_image.xpt}, the next step is to load the \pkg{Hmisc} package to read sas export files. << eval = FALSE, echo = TRUE >>= library(Hmisc) mydata <- sasxport.get("09795-0141-Data-card_image.xpt") data_58 <- mydata[[58]] ihdp_raw <- data_58 # restricts data to treated cases treated_raw <- ihdp_raw[which(ihdp_raw$tg == "I"),] # continuous treatment variable treat_value <- treated$cdays.t @ The continuous treatment variable is merged with the data given in the supplement by \cite{hill2011bayesian} to create the data set for this section. A few more steps are needed to clean and recode the data. We collect a subset of families eligible for the intervention and restrict the data set to families that use the child development centers at least once. The data set contains the outcome variable, \code{iqsb.36}, which is the measured iq of the child at 36 months. The treatment variable is the number of days the child attended the child development center divided by 100, \code{ncdctt} (i.e. \code{ncdctt} $= 1.5$ means $150$ days in the child developement center). We select the covariates using a stepwise procedure to simplify the analysis. \subsection{Common support} << eval = FALSE, echo = TRUE >>= overlap_temp <- overlap_fun(Y = iqsb.36, treat = ncdctt, treat_formula = t_formula, data = data_set, n_class = 3, treat_mod = "Normal") median_list <- overlap_temp[[2]] overlap_orig <- overlap_temp[[1]] overlap_3 <- overlap_temp[[3]] fitted_values_overlap <- overlap_3$fitted.values @ \subsection{Covariate balance} Balance is evaluated similarly to the NMES example. \subsection{Estimating the ADRF} The BART estimator fits a rich outcome model on the treatment and covariates to create a flexible response surface \citep{hill2011bayesian}. The flexible response surface imputes the missing potential outcomes. The estimated potential outcomes are averaged to get the estimated ADRF over a grid of treatment values. \begin{figure} \begin{minipage}[t]{0.45\textwidth} \includegraphics[width=1\linewidth]{ihdp_bart_10212015.png} %\caption{\textit{Histogram of lottery prize}} \label{fig:ihdp_bart_10212015} \end{minipage} \hspace{\fill} \begin{minipage}[t]{0.45\textwidth} \includegraphics[width=1\linewidth]{ihdp_iw_10202015.png} %\caption{\textit{Histogram of log(lottery prize}} \label{fig:ihdp_hi_10202015} \end{minipage} \vspace*{0.5cm} % (or whatever vertical separation you prefer) \begin{minipage}[t]{0.45\textwidth} \includegraphics[width=1\linewidth]{ihdp_nw_10202015.png} %\caption{\textit{QQ plot of \texttt{log\_prize} $ \sim \bX$}} \label{fig:ihdp_ivd_10202015} \end{minipage} \hspace{\fill} \begin{minipage}[t]{0.45\textwidth} \includegraphics[width=1\linewidth]{ihdp_prop_spl_10202015.png} %\caption{\textit{Residuals of the \texttt{log\_prize} regressed on $\bX$}} \label{fig:ihdp_prop_spl_10202015} \end{minipage} \caption{\footnotesize Estimated dose-response functions using 4 different methods with 95\% pointwise standard errors. The standard errors are estimated by bootstrapping the entire estimation process from the beginning.} \label{fig:ihdp_plots} \end{figure} << eval = FALSE, echo = TRUE >>= bart_estimate <- bart_est(Y = iqsb.36, treat = ncdctt, outcome_formula = iqsb.36 ~ ncdctt + bw + female + mom.lths + site1 + site7 + momblack + workdur.imp, data = full_data_orig, grid_val = grid_treat) @ The next method is described in \cite{flores2012estimating} and uses inverse weights to adjust for the covariates. First a treatment model is fit and GPS values are estimated. This is a method that uses weights to locally regress the outcome on nearby points. This is a local linear regression of the outcome, \code{iqsb.36}, on the treatment, \code{ncdctt}, with a weighted kernel. The weighted kernel is weighted by the reciprocal of the GPS values. << eval = FALSE, echo = TRUE >>= iw_estimate <- iw_est(Y = iqsb.36, treat = ncdctt, treat_formula = ncdctt ~ bw + female + mom.lths + site1 + site7 + momblack + workdur.imp, data = full_data_orig, grid_val = grid_treat, bandw = 2 * bw.SJ(full_data_orig$ncdctt), treat_mod = "Normal") @ This next method, the Nadaraya-Watson based estimator, is similar to the inverse weighting method in the previous code chunk, but uses a local constant regression. << eval = FALSE, echo = TRUE >>= nw_estimate <- nw_est(Y = iqsb.36, treat = ncdctt, treat_formula = ncdctt ~ bw + female + mom.lths + site1 + site7 + momblack + workdur.imp, data = full_data_orig, grid_val = grid_treat, bandw = 2 * bw.SJ(full_data_orig$ncdctt), treat_mod = "Normal") @ The propensity spline estimator is a generalization of the prima facie and regression prediction method in \cite{schafer2015causal}. In this example, the estimator is restricted to a polynomial of up to degree 2 of the form $\hat{\mu}(t) = \hat{\alpha}_0 + \hat{\alpha}_1 t + \hat{\alpha}_2 t^2$. << eval = FALSE, echo = TRUE >>= spline_estimate <- prop_spline_est(Y = iqsb.36, treat = ncdctt, covar_formula = ~ bw + female + mom.lths + site1 + site7 + momblack + workdur.imp, covar_lin_formula = ~ 1, covar_sq_formula = ~ 1, data = full_data_orig, e_treat_1 = full_data_orig$est_treat, degree = 2, wt = NULL, method = "different", spline_df = 5, spline_const = 2, spline_linear = 2, spline_quad = 2) @ \subsection{Discussion} The plots in Figure \ref{fig:ihdp_plots} show the estimated relationship of IQ at 36 months, \code{iqsb.36}, on number of days in the child development care center, \code{ncdctt}. The inverse weighting and Nadaraya-Watson show a decreasing trend for $\code{ncdctt} \in (0, 0.8)$, but an increasing trend for $\code{ncdctt} > 0.8$. These estimators are jagged because of the bandwidth selection. In this example, we use twice the Sheather-Jones bandwidth estimate to select the bandwidth. Picking a larger bandwidth will give smoother estimates. The BART and propensity spline estimators have a generally increasing trend. \section{Conclusion} \label{section:Discussion_examples} % explain what we can do with this newly acquired knowledge. Answer the question \quotes{so what?} In this document, we have demonstrated how to estimate ADRFs using different statistical techniques using the R package \texttt{causaldrf}, both for simulated and real data, by correcting for confounding variables. \texttt{causaldrf} can accommodate a wide array of treatment models, is user friendly, and does not require extensive programming. This contribution of the R package \texttt{causaldrf} will make ADRF estimation more accessible to applied researchers. In future updates of the package, the functions will be adapted to an even wider range of problems. % % \section*{Acknowledgments} % The author would like to thank colleagues at the University of Maryland, College Park, and U.S. Census Bureau. Also, thanks to Steven Pav for coding ideas. \bibliographystyle{apalike} \bibliography{Chapter1Ref09212015} \end{document}