\documentclass[nojss]{jss} %\VignetteIndexEntry{imputeMulti: An Introduction} %\VignetteAuthor{Alex Whitworth} %\VignetteKeywords{R, introduction} %\VignetteTangle{no} %\VignetteEngine{R.rsp::rsp} % \usepackage{graphicx, color, hyperref, ae, fancyverb, natbib} % -- default JSS packages \usepackage{amsmath, amssymb, amsthm} % American Math. Society Packages \usepackage{thumbpdf} % JSS encouraged \usepackage{float} \usepackage[font= footnotesize, labelfont= bf]{caption} \usepackage{placeins} % from Achim Zeileis -- email correspondance 6/25/2016 \newcommand{\Sconcordance}[1]{% \ifx\pdfoutput\undefined% \csname newcount\endcsname\pdfoutput\fi% \ifcase\pdfoutput\special{#1}% \else% \begingroup% \pdfcompresslevel=0% \immediate\pdfobj stream{#1}% \pdfcatalog{/SweaveConcordance \the\pdflastobj\space 0 R}% \endgroup% \fi} %% need no \usepackage{Sweave.sty} %------------------------------------------------------------ % title information \author{Alex Whitworth} \Plainauthor{Alex Whitworth} \title{imputeMulti: Imputation for Multivariate Multinomial Missing Data} \Shorttitle{imputeMulti} \Abstract{ \pkg{imputeMulti} is an \proglang{R} package for imputation of multivariate multinomial missing data via expectation-maximization and data augmentation algorithms. The package allows the specification of Bayesian priors, including data-dependent priors. For performance, calculation of the summary statistics of the multinomial distribution is implemented in \proglang{C++}; \pkg{imputeMulti} also supports these calculations in parallel. As a result, the \pkg{imputeMulti} package capably handles large datasets. In this article, we introduce the package's functionality and provide a hands-on approach to solving multinomial missing data problems. } \Keywords{multinomial, missing data, imputation, expectation-maximization, data-augmentation, EM, DA, R} \Address{ Alex Whitworth \\ Seattle, WA 98103 \\ E-mail: \email{whitworth.alex@gmail.com} } %------------------------------------------------------------ \begin{document} \section{Introduction} As \cite{amelia} noted, ``missing data is a ubiquitous problem [especially] in social science data. Respondents do not answer every question, countries do not collect statistics every year, archives are incomplete, [and] subjects drop out of panels. Most statistical analysis methods, however, assume the absence of missing data, and are only able to include observations for which every variable is measured." In the past several decades, widely available methods have been developed for missing value imputation allowing practitioners to move beyond ad-hoc methods such as casewise-deletion and mean imputation, to more advanced methods such as multiple imputation \cite{rubin}, expectation maximization \cite{dempster}, chained equations \cite{mice}, and data augmentation \cite{data_aug}. However, many of these methods assume the data comes from a multivariate normal distribution, which ignores missing data from a variety of other distributions. \pkg{imputeMulti} performs missing data imputation for the common multivariate multinomial distribution. Like other imputation methods, this method creates a ``filled in" version of the incomplete data so that analyses which require complete observations can appropriately use all the data in a dataset containing missingness. \pkg{imputeMulti} uses the familiar expectation-maximization (EM) and data augmentation (DA) algorithms to estimate the parameter estimates of the multinomial distribution and maximum likelihood to impute missing observations. Two issues which make multivariate multinomial missing data imputation challenging are the memory required to store all possible combinations of the variables of interest and the computation time required to calculate both the complete data sufficient statistics and the missing data marginal sufficient statistics. To overcome these performance challenges, \pkg{imputeMulti} uses \proglang{C++} via \pkg{Rcpp} \cite{rcpp} and also allows for parallel processing via \pkg{parallel} \cite{r_core}. \pkg{imputeMulti} provides advantages over other popular imputation methods such as \pkg{Amelia} \cite{amelia}, \pkg{mice} \cite{mice}, and \pkg{Hmisc} \cite{Hmisc} when working with multinomial data. The most important advantage is higher imputaton accuracy for multinomial data relative to any of \pkg{Amelia}, \pkg{mice}, and \pkg{Hmisc}. Secondly, \pkg{imputeMulti} provides researchers easy access to the multinomial parameter estimates. In many cases, the parameter estimates may be of more interest to researchers than the observation level imputations. One disadvantage of \pkg{imputeMulti} is longer run time than \pkg{Amelia} or \pkg{Hmisc}, although \pkg{imputeMulti} is faster than \pkg{mice}. Run time is impacted by unique aspects of the the multinomial distribution which will be discussed below. The remainder of the paper proceeds as follows: Section~\ref{sec:multi} provides a brief discussion of multinomial missing data problems; Section~\ref{sec:guide} provides a user's guide to \pkg{imputeMulti}; Section~\ref{sec:compare} compares \pkg{imputeMulti} with alternative imputation methods; and Section~\ref{sec:conclude} concludes. %------------------------------------------------------------ \section{Multinomial missing data} \label{sec:multi} %------------------------------------------------------------ \pkg{imputeMulti} is an implementation of the imputation methods for multivariate multinomial missing data found in \cite[Chapter~7]{schafer}. An abbreviated discussion of multivariate multinomial missing data is provided below. To facilitate use of this reference text, the same notation as \cite{schafer} is used throughout this discussion. To begin, let $Y_1, Y_2, \ldots, Y_p$ be categorical variables each taking a finite number of values $Y_j \in \{1,2,\ldots, d_j \}; j= 1,2,\ldots, p$. If the observations are independent and identically distributed (iid), then we can reduce $Y$ to a contingency table with $D$ cells, where $D= \prod_{j=1}^p d_j$ is the number of distinct combinations of the levels of $Y_1, Y_2, \ldots, Y_p$. The cell counts of $D$ are denoted by $x_d, d = 1,\ldots, D$. If the sample size is \textit{n}, then \textit{x} has a multinomial distribution: \begin{align} x|\theta &\sim M(n,\theta) \end{align} with parameter vector $\theta = (\theta_1, \theta_2, \ldots, \theta_D )$. The likelihood function for the multinomial parameter $\theta$ is \begin{align} L(\theta|Y) &\propto \prod_{d=1}^D \theta_{d}^{x_d} I_{\Theta(\theta)} \end{align} where $I_{\Theta(\theta)}$ is an indicator function equal to 1 if $\theta \in \Theta$ and 0 otherwise. This leads to the well known maximum likelihood estimates (MLE): \begin{align} \hat \theta_d = \frac{x_d}{n}, d= 1,\ldots, D \end{align} \subsection{The Bayesian case and the Dirichlet prior} The simplest way to conduct Bayesian inference on a multinomial model is to assume a Dirichlet prior \cite{dirichlet}, which is typically denoted by the parameter vector $\alpha= (\alpha_1, \alpha_2, \ldots, \alpha_D)$. The prior and posterior of $\theta$ with a Dirichlet prior are thus written in shorthand as \begin{align} \theta|\alpha &\sim D(\alpha) \\ \theta|Y &\sim D(\alpha') \end{align} where $\alpha' = (\alpha_1 + x_1, \alpha_2 + x_2, \ldots, \alpha_D + x_D)$. The posterior mean is \begin{align} \E(\theta|Y) &= \left(\frac{\alpha_1'}{\alpha_0'}, \frac{\alpha_2'}{\alpha_0'} \ldots, \frac{\alpha_D'}{\alpha_0'} \right) \end{align} with $\alpha_0' = \sum_{d=1}^D (\alpha_d + x_d) = \alpha_0 + n$. As is typical in Bayesian analyses, the choice of prior can have important effects on the outcome of missing value imputation for multinomial data. In addition to no prior, \pkg{imputeMulti} supports three choices of prior. When little prior information is known, a noninformative prior may be a sensible choice. If the sample size is large relative to the number of parameters being estimated, a noninformative prior will have little impact on model inferences. The choice used in \pkg{imputeMulti} is twice the Jefferys prior, $c=1$, which provides proper posterior modes of $\theta$.\footnote{Using the Jefferys prior, $c=\frac{1}{2}$, can produce improper posterior modes. Specifically, if $\alpha_y < 1$ and the corresponding count $x_y$ is zero, then $\hat \theta_y$ will be negative.} Another choice of prior is a flattening prior $\alpha= (c,c,\ldots,c)$ for some $c>1$. A final choice available in \pkg{imputeMulti} is the data-dependent prior, whose aim is to take a \textit{peek} at the data, via some summary statistics for instance, prior to analysis. Discussion and criticism of the use of data-dependent priors is beyond the scope of this article. Interested readers are referred to \cite{darnieder}. \subsection{Characterizing the multivariate multinomial missing data problem} At the core of all imputation methods is that we do not completely observe the data $x$, but instead only observe part of it. In the multinomial case, assume that we have grouped the observed data into their observed patterns $x^{obs}$ and their missingness patterns $x^{mis}$. Index the missingness paterns by $s= 1,2,\ldots,S$ and define a set of indicator variables: \begin{align*} r_{sj} &= \left\{ \begin{array}{lc} 1 & \mbox{if } Y_j \mbox{ is observed in } s \\ 0 & \mbox{if } Y_j \mbox{ is missing in } s \end{array} \right. \end{align*} Let $O_s(y)$ and $M_s(y)$ respectively denote the sets of observed and missing variables within each missingness pattern and with elements defined \begin{align} O_s(y) &= \{y_j : r_{sj} = 1 \} \\ M_s(y) &= \{y_j : r_{sj} = 0 \}. \end{align} Since most missing observations are \textit{partially} observed, within each missingness pattern $s$, observations are cross-classified by their observed variables. The distinct partially observed patterns are denoted $x^{(s)}$; and their counts tabulated into a table denoted \begin{align} z_{O_s(y)}^{s} &= \sum_{M_s(y)\in M_s} x^{(s)} \mbox{ for all } O_s(y) \in O_s. \end{align} The marginal probability that an observation falls within a given cell of this table of partially observed values is denoted \begin{align} \beta_{O_s(y)} = \sum_{M_s(y)\in M_s} \theta_{y.} \mbox{ .} \end{align} And the observed-data loglikelihood contribution from the partially observed data is \begin{align} l(\theta|Y_{obs}) &= \sum_{s=1}^S \sum_{O_s(y)\in O_s} z_{O_s(y)}^{s} log\left(\beta_{O_s(y)} \right) . \end{align} The EM algorithm for maximizing the complete-data loglikelihood is straightforward, with the multivariate case first described by \cite{fuchs}. For the E-step, we find the expected summary statistics given the observed data and the current value of theta, \begin{align} \E(x_y|Y_{obs}, \theta) &= \sum_{s=1}^S z_{O_s(y)}^{s} \theta_y / \beta_{O_s(y)} . \label{E-step} \end{align} This leads to the trivial M-step. Under maximum-likelihood, set $\hat \theta = \E(x_y|Y_{obs}, \theta) / n$ for all $y \in Y$. A minor modification is made to maximize the complete-data posterior density under a Dirichlet prior $\hat \theta_y = (x_y +\alpha_y - 1) / (n + \alpha_0 - D)$ for all $y \in Y$ where $\alpha_0=\sum_i^D \alpha_i$ and $D$ is the number of parameters. Similarly, different minor modifications can be made to convert the EM algorithm to data-augmentation. For DA, instead of proportionally allocating the counts of partially missing observations to fully observed $x^{obs}$ counts based on the current value of $\theta$ as in Equation~\ref{E-step}, the proportional allocation is replaced by a random allocation based on the current value of $\theta$. %------------------------------------------------------------ \section{Software user's guide} \label{sec:guide} %------------------------------------------------------------ We now turns to the practical matter of using \pkg{imputeMulti}, which is freely available as a package for the statistical software \proglang{R} \cite{r_core} and can be run in any environment that \proglang{R} can. \proglang{R} is freely available from \url{https://www.r-project.org/}. To install \pkg{imputeMulti} from a CRAN (Comprehensive R Archive Network) mirror, simply type the following command in the \proglang{R} command prompt, \begin{Code} R> install.packages("imputeMulti"). \end{Code} If you wish to use the most current development version, you can install it from Github. This is most easily done using the \pkg{devtools} package \cite{devtools}, via \begin{Code} R> devtools::install_github("alexWhitworth/imputeMulti"). \end{Code} To keep \pkg{imputeMulti} up to date, you should use the \proglang{R} command \code{update.packages()}. \subsection{Example data} To illustrate \pkg{imputeMulti}, we use simulated (ACS) American Community Survey data for individuals living in census tract 2221 in Los Angeles County, California. The data was simulated using spatial microsimulation \cite{orcutt}. This dataset contains ten variables on 3,974 individuals. Missing values have been inserted, independently and at random, to seven of the ten variables. A detailed description of the dataset can be found by typing \code{?tract2221} into the \proglang{R} console. \begin{Code} R> library("imputeMulti") R> data("tract2221") \end{Code} Beyond being useful to illustrate the use of \pkg{imputeMulti}, the data also serves to illustrate one constraint on multivariate multinomial missing data imputation---combinatorial explosion. If we choose to use all ten variables in the model, the number of distinct combinations which may be observed is extremely high. Using the previously introduced notation, $D= \prod_{j=1}^p d_j = 8,467,200$. Given that seven variables have missing values, the number of possible distinct combinations of marginally observed variables is even higher at 38,707,200. In \proglang{R}, an integer vector of this size requires 155 MB of memory and a corresponding \code{data.frame}, which is needed to store $z_{O_s(y)}^{s}$, requires 1.7 GB of memory. If our model were substantially larger, we would quickly run into \pkg{base} \proglang{R}'s constraint on matrix size of 2 billion elements. One option to alleviate the memory constraint is the use of a package that stores objects locally, such as \pkg{bigmemory} \cite{bigmemory}. But alleviating memory concerns does not impact the larger problem of factorial runtime---O(p!)---factorial in the number of variables $p$. One approach is the use of a multivariate normal distribution to provide an approximate solution. The approach suggested in this guide is to instead find an exact solution to an approximate problem. That is, to decompose the set of all variables into subsets of related variables; and to find an exact solution for each subset. Run-time constraints are not the only benefit of decomposing an increasingly large set of variables. An additional benefit is that complex, higher-order, interactions may be poorly estimated by the fully saturated multinomial model. In these situations, it is often useful to simplify the model by selectively removing some of the highest order associations. In this dataset, for example, the researcher might decide that income, poverty and employment statuses, educational attainment, age, and gender are closely related and thus group \code{age}, \code{gender}, \code{edu\_attain}, \code{pov\_status}, \code{emp\_status}, and \code{ind\_income} into one subset. Further, she might consider marital status, race, nativity, and geographic mobility to all be closely related. This would lead to a second subset of \code{marital\_status}, \code{nativity}, \code{geog\_mobility}, and \code{race}. This is the approach that we uses here to illustrate the use of \pkg{imputeMulti}. \subsection{Imputation via expectation-maximization} The primary user-facing function for \pkg{imputeMulti} is \code{multinomial\_impute}, which provides observation level imputation for multinomial data. Various options can be specified to perform multinomial imputations using EM or DA along with the specification of one of four possible priors: \code{`none'}, \code{`non.informative'}, \code{`flat.prior'}, or \code{`data.dep'}. These options for imputing missing values via EM are now illustrated, starting without the use of a Bayesian prior. \begin{Sinput} R> set.seed(2134L) R> imputeEM_none <- multinomial_impute(tract2221[,c(1:2,4,7,9)], method = "EM", conj_prior = "none", verbose = TRUE) \end{Sinput} Here, \code{method = "EM"} is specified for the expectation-maximization algorithm and \code{conj\_prior = "none"} for a strictly maximum likelihood estimate. The parameter \code{verbose = TRUE} is specified to provide informative intermediate outputs. The option \code{parallel = TRUE} can also be specified to compute the sufficient statistics in parallel. In practice, this only provides a performance improvement for exceptionally large datasets and thus the default is currrently \code{parallel = FALSE}. \begin{Code} [1] "Calculating observed data sufficient statistics." [1] "Setting up Iteration 1." Iteration 1 : log-likelihood = -32083.8393739337 Convergence Criteria = 0.0160612788 ... Iteration 2 : log-likelihood = -24079.2963400782 Convergence Criteria = 0.0043943837 ... .... output truncated .... Iteration 119 : log-likelihood = -23650.2292737947 Convergence Criteria = 0.0000004931 ... [1] "Imputing missing observations via MLE results." \end{Code} Typical of EM, convergence is quite rapid. By changing the argument \code{conj\_prior}, Bayesian priors may be specified. The other options for \code{conj\_prior} are shown below % NEED TO ADD: custom priors -- fully specify arugment alpha \begin{Code} R> imputeEM_non <- multinomial_impute(tract2221[,c(1:2,4,7,9)], method = "EM", conj_prior = "non.informative", verbose = TRUE) R> imputeEM_flat <- multinomial_impute(tract2221[,c(1:2,4,7,9)], method = "EM", conj_prior = "flat.prior", alpha = 10L, verbose = TRUE) R> imputeEM_data <- multinomial_impute(tract2221[,c(1:2,4,7,9)], method = "EM", conj_prior = "data.dep", verbose = TRUE) \end{Code} where a flat prior may be specified as a scalar by the parameter \code{alpha}. Note the use of a strong flat prior in this example. In addition to observation level imputation, some users may only be interested in the parameter estimates. \pkg{imputeMulti} allows for this type of analysis as well. To do so, first compute the observed and marginally-observed summary statistics via \code{multinomial\_stats} and then estimate the parameters directly with your choice of prior via \code{multinomial\_em}. An example of this is shown below.\footnote{The differences in log-likelihood and convergence criteria in the displayed output of \code{multinomial\_impute} and \code{multinomial\_em} are due to random seeding of initial parameters. These random differences in initial parameter values lead to differences of only (<0.003) in final log-likelihood.} \begin{CodeChunk} \begin{CodeInput} R> x_y <- multinomial_stats(tract2221[,c(1:2,4,7,9)], output = "x_y") R> z_Os_y <- multinomial_stats(tract2221[,c(1:2,4,7,9)], output = "z_Os_y") R> x_possible <- multinomial_stats(tract2221[,c(1:2,4,7,9)], output = "possible.obs") R> imputeEM_mle <- multinomial_em(x_y, z_Os_y, x_possible, n_obs = nrow(tract2221), conj_prior = "none", verbose = TRUE) \end{CodeInput} \begin{CodeOutput} [1] "Setting up Iteration 1." Iteration 1 : log-likelihood = -32096.9776607773 Convergence Criteria = 0.0157098212 ... Iteration 2 : log-likelihood = -24086.0109006056 Convergence Criteria = 0.0030262731 ... .... output truncated .... Iteration 101 : log-likelihood = -23650.2320024124 Convergence Criteria = 0.0000004888 ... \end{CodeOutput} \end{CodeChunk} The outputs of these functions will be examined in Section~\ref{sec:outputs}. \subsection{Imputation via data-augmentation} \label{sec:DA} Using \pkg{imputeMulti} for DA is very similar to the use for expectation-maximization. The chief functional difference is using \code{method = "DA"} instead of \code{method = "EM"}. Imputation of observation level data is still done via \code{multinomial\_impute}; and, the choice of prior can be controlled by \code{conj_prior}. An example is shown below using the second subset of variables discussed at the beginning of this Section. \begin{CodeChunk} \begin{CodeInput} R> imputeDA_none <- multinomial_impute(tract2221[,c(3,6,9:10)], method = "DA", conj_prior = "none", verbose = TRUE, burnin = 100) \end{CodeInput} \begin{CodeOutput} [1] "Calculating observed data sufficient statistics." [1] "Setting up Iteration 1." Iteration 1 : log-likelihood = -26927.4017334082 ... Iteration 2 : log-likelihood = -18571.4267260919 ... .... output truncated .... Iteration 99 : log-likelihood = -18369.0208750788 ... Iteration 100 : log-likelihood = -18381.5414370566 ... [1] "Imputing missing observations via MLE results." \end{CodeOutput} \end{CodeChunk} Parameter estimates via DA can also be easily obtained via \code{multinomial\_data\_aug} in a similar fashion to EM. \begin{Code} R> imputeDA_mle <- multinomial_data_aug(x_y, z_Os_y, x_possible, conj_prior = "none", verbose = TRUE, burnin = 100, post_draws = 1000) \end{Code} With DA, the number of burnin samples is controled via \code{burnin} for both \code{multinomial\_impute} and \code{multinomial\_data\_aug}. \code{burnin} defaults to 100. Parameter estimates are calculated as the posterior mean based on \code{post\_draws} draws, which can again be specified for both functions. The default is 1000 draws. \subsection{Examining imputed outputs} \label{sec:outputs} \pkg{imputeMulti} uses S4 classes. Both \code{multinomial\_em} and \code{multinomial\_data\_aug} return \code{mod\_imputeMulti-class} objects while \code{multinomial\_impute} returns \code{imputeMulti-class} objects. The \code{imputeMulti-class} class inherits from the \code{mod\_imputeMulti-class}. Several methods are available for summarization and extracting elements from the class objects. However, most users will be interested in only two methods of the \code{imputeMulti-class} object: \code{get\_parameters} which extracts the parameter estimates, and \code{get\_imputations} which extracts the imputed observation level data.\footnote{Other methods for these classes can be found in the documentation. See \code{?'mod\_imputeMulti-class'} and \code{?'imputeMulti-class'}.} \begin{Code} R> param_est <- get_parameters(imputeEM_none) R> imputed_data <- get_imputations(imputeEM_none) \end{Code} Since neither \code{multinomial\_em} nor \code{multinomial\_data\_aug} impute at the observation level, the \code{get\_imputations} method does not exist for \code{mod\_imputeMulti-class} objects. To recombine imputed data with the original dataset, utilize the fact that \pkg{imputeMulti} maintains both row order and rownames. Recombining imputed and original data can therefore be done via a simple replacement or by merging via rownames. The latter method is implemented in \pkg{imputeMulti} and both methods are shown below: \begin{Code} R> tract_imp <- data.frame(imputed_data, tract2221[, c(3,6,9:10)]) R> tract_imp <- merge_imputed(imputeEM_none, tract2221[,c(3,6,9:10)]) \end{Code} Researchers can also examine the parameter estimates directly. For example, researchers may be interested in the marginal distribution of $\hat \theta$ by gender and educational attainment among those aged eighteen to thirty-four compared to those aged fifty to sixty-four. Here a comparison is shown for a \code{"non.informative"} and \code{"flat.prior"}, which also illustrates the impact of the previously chosen strong flat prior. \begin{Code} R> param_est_non <- get_parameters(imputeEM_non) R> param_est_flat <- get_parameters(imputeEM_flat) \end{Code} We first extract the marginal statistics and then normalize $\hat \theta$ within each age group. The impact of the choice of prior on $\hat \theta$ is then shown in Tables \ref{tbl:marg_param_est_noprior} and \ref{tbl:marg_param_est_flatprior}. % non-informative prior \begin{Code} R> pe_non18 <- param_est_non[param_est_non$age %in% c("18_24", "25_29", "30_34"),] R> pe_non50 <- param_est_non[param_est_non$age %in% c("50_54", "55_59", "60_64"),] R> non_theta18_34 <- sum(pe_non18$theta_y) R> non_theta50_64 <- sum(pe_non50$theta_y) R> marg_non18 <- tapply(pe_non18$theta_y, list(pe_non18$gender, pe_non18$edu_attain), sum) R> marg_non50 <- tapply(pe_non50$theta_y, list(pe_non50$gender, pe_non50$edu_attain), sum) R> round(marg_non18 / non_theta18_34, 4) R> round(marg_non50 / non_theta50_64, 4) \end{Code} % strong flat prior \begin{Code} R> pe_flat18 <- param_est_flat[param_est_flat$age %in% c("18_24", "25_29", "30_34"),] R> pe_flat50 <- param_est_flat[param_est_flat$age %in% c("50_54", "55_59", "60_64"),] R> flat_theta18_34 <- sum(pe_flat18$theta_y) R> flat_theta50_64 <- sum(pe_flat50$theta_y) R> marg_flat18 <- tapply(pe_flat18$theta_y, list(pe_flat18$gender, pe_flat18$edu_attain), sum) R> marg_flat50 <- tapply(pe_flat50$theta_y, list(pe_flat50$gender, pe_flat50$edu_attain), sum) R> round(marg_flat18 / flat_theta18_34, 4) R> round(marg_flat50 / flat_theta50_64, 4) \end{Code} \FloatBarrier \begin{table} \centering \begin{tabular}[h]{l|ccccccc} \hline \textbf{Ages 18-34} & \code{lt_hs} & \code{some_hs} & \code{hs_grad} & \code{some_col} & \code{assoc_dec} & \code{ba_deg} & \code{grad_deg} \\ \hline Female & 0.0507 & 0.0421 & 0.1523 & 0.1251 & 0.0000 & 0.0912 & 0.0205 \\ Male & 0.0423 & 0.0928 & 0.0922 & 0.1141 & 0.0187 & 0.1197 & 0.0383 \\ \hline \textbf{Ages 50-64} & & & & & & & \\ \hline Female & 0.1260 & 0.0866 & 0.1304 & 0.1051 & 0.0405 & 0.0000 & 0.0318 \\ Male & 0.1704 & 0.0980 & 0.0357 & 0.0471 & 0.0575 & 0.0710 & 0.0000 \\ \hline \hline \end{tabular} \caption{Estimates of the marginal distribution of $\hat \theta$ by gender and educational attainment using a non-informative prior. Estimates are for individuals aged 18-34 and 50-64.} \label{tbl:marg_param_est_noprior} \end{table} \begin{table} \centering \begin{tabular}[h]{l|ccccccc} \hline \textbf{Ages 18-34} & \code{lt_hs} & \code{some_hs} & \code{hs_grad} & \code{some_col} & \code{assoc_dec} & \code{ba_deg} & \code{grad_deg} \\ \hline Female & 0.0684 & 0.0671 & 0.0831 & 0.0791 & 0.0611 & 0.0744 & 0.0641 \\ Male & 0.0671 & 0.0744 & 0.0744 & 0.0776 & 0.0641 & 0.0784 & 0.0668 \\ \hline \textbf{Ages 50-64} & & & & & & & \\ \hline Female & 0.0763 & 0.0729 & 0.0764 & 0.0744 & 0.0687 & 0.0653 & 0.0680 \\ Male & 0.0801 & 0.0738 & 0.0652 & 0.0694 & 0.0702 & 0.0713 & 0.0652 \\ \hline \hline \end{tabular} \caption{Estimates of the marginal distribution of $\hat \theta$ by gender and educational attainment using a strong flat prior. Estimates are for individuals aged 18-34 and 50-64.} \label{tbl:marg_param_est_flatprior} \end{table} The marginalized estimates of $\hat \theta$ with a non-informative prior show that education levels have improved over time as the younger cohort has higher estimated parameters for `some college' or higher educational attainment. It is also clear that there is greater gender equality in educational outcomes for the younger cohort. But the analyst would not draw these conclusions using a strong flat prior. As expected, a strong flat prior exerts substantial smoothing effects on the estimates of $\hat \theta$. While there is a large range in parameter estimates for the non-informative prior, all estimates for the flat prior have been smoothed to near 0.07. This comparison shows that, as with most Bayesian analyses, the results of imputation of multinomial data can be quite sensitive to the choice of prior. %------------------------------------------------------------ \section{Comparing imputeMulti} \label{sec:compare} %------------------------------------------------------------ In this section, we compare \pkg{imputeMulti} with other popular imputation methods. Specifically, we examine bootstrap EM via \pkg{Amelia} \cite{amelia}, polytomous logistic regression via \pkg{mice} \cite{mice}, and predictive mean matching via \pkg{Hmisc} \cite{Hmisc}. The goal of the comparison is not to provide an \textit{exhaustive} comparison of imputation methods; but to compare commonly used methods within each package. The \code{tract2221} dataset is again used for illustration, in this case looking at imputation of five variables: \code{age}, \code{gender}, \code{marital_status}, \code{edu_attain}, and \code{emp_status}. The first thing to note is that \pkg{Amelia} does not allow for categorical variables with greater than ten levels, while \code{tract2221$age} has sixteen levels. \begin{CodeChunk} \begin{CodeInput} R> library("Amelia") R> amelia_EM <- amelia(tract2221[,1:5], m = 1, noms = 1:5) \end{CodeInput} \begin{CodeOutput} Warning message: In amcheck(x = x, m = m, idvars = numopts$idvars, priors = priors, : The number of categories in one of the variables marked nominal has greater than 10 categories. Check nominal specification. \end{CodeOutput} \end{CodeChunk} This is one serious limitation on multivariate multinomial imputation when using a package designed for a multivariate normal distribution. Researchers using \pkg{Amelia} must immediately decide the best way to limit some of the richness of their data. None of the other three packages have this restriction. To keep the remainder of the comparison as equivalent as possible, only four variables are used: \code{gender}, \code{marital_status}, \code{edu_attain}, and \code{emp_status}. \subsection{Comparing algorithm speed} The \pkg{microbenchmark} package \cite{microbenchmark} is used for speed comparisons. Tests were performed on a Intel Xeon CPU E5-2960 v3 2.60GHz server running Windows Server 2012 R2 Standard and were not run in parallel. To purely test algorithm speed, multiple imputations are not specified for any packages that allow them. \begin{CodeChunk} \begin{CodeInput} R> library("Amelia") R> library("Hmisc") R> library("imputeMulti") R> library("mice") R> library("microbenchmark") R> test_d <- tract2221[,2:5] R> microbenchmark( EM = multinomial_impute(test_d, "EM", conj_prior = "non.informative"), DA = multinomial_impute(test_d, "DA", conj_prior = "non.informative"), amelia = amelia(test_d, m = 1, noms = 1:4), hmisc = aregImpute(~ gender + marital_status + edu_attain + emp_status, data = test_d, n.impute = 1, type = "pmm"), mice = mice(test_d, m = 1, method = "polyreg"), times = 10L) \end{CodeInput} \begin{CodeOutput} Unit: milliseconds expr min lq mean median uq max neval EM 2271.97 2279.53 2336.20 2282.18 2418.99 2532.10 10 DA 3664.21 3843.10 4328.98 4242.36 4948.12 4979.40 10 amelia 144.29 149.73 175.99 151.40 153.20 402.19 10 hmisc 682.13 688.32 822.01 819.47 945.75 997.42 10 mice 3899.12 3913.96 4091.42 3956.81 4161.81 4995.85 10 \end{CodeOutput} \end{CodeChunk} All algorithms finish in a matter of milliseconds or seconds. \pkg{Amelia} is clearly the fastest algorithm, beating \pkg{imputeMulti} by more than an order of magnitude in this test. As expected, EM, which stops upon convergence, is noticeably faster than DA, which must run all \code{burnin} iterations before calculating posterior means. \subsection{Comparing outputs: Parameter estimates} \label{sec:param_compare} Differences in the outputs of the various methods are examined next, focusing on differences in output rather than the the statistical properties of each package. For a discussion of using the multivariate normal to approximate multinomial data, see \cite[Chapter~6]{schafer}. For the purposes of this article, a comparison of a single commonly used method within each package is used rather than an exhaustive comparison. As above, multiple imputations are not specified for any packages that allow them. Functions used for these comparisons as well as post-processing the \pkg{Hmisc} and \pkg{mice} outputs (Section~\ref{sec:obs_compare}) are provided in the supplementary material to this article. \begin{Code} R> source("./00_JSS_suplemental_functions.R") R> set.seed(1987543L) R> test_dat1 <- tract2221[complete.cases(tract2221[,2:5]), 2:5] R> test_dat2 <- createNAs(test_dat1, pctNA = 0.15) R> rownames(test_dat2) <- 1:nrow(test_dat2) R> IM_EM <- multinomial_impute(test_dat2, "EM", conj_prior = "non.informative", verbose = TRUE) R> amelia_EM <- amelia(test_dat2, m = 1, noms = c("gender", "marital_status", "edu_attain", "emp_status")) R> hmisc_pmm <- aregImpute(~ gender + marital_status + edu_attain + emp_status, data = test_dat2, n.impute = 1, type = "pmm") R> mice_ch_eq <- mice(test_dat2, m = 1, method = "polyreg") \end{Code} \pkg{Amelia}, \pkg{mice}, and \pkg{Hmisc} all use S3 classes for outputs. In both \pkg{mice} and \pkg{Hmisc}, the parameter estimates are not directly available. The parameter estimates from the \code{amelia} function are found in \code{amelia_EM$mu} and \code{amelia_EM$theta}. This output structure fits the expectation that the data comes from a multivariate normal (eg. Y $\sim N(\mu, \Sigma)$). The output of \code{amelia_EM$mu} is shown below. \begin{Code} noms.gender.2 -0.034183160 noms.marital_status.2 -0.001370118 noms.marital_status.3 0.001611829 noms.marital_status.4 -0.005067206 noms.marital_status.5 0.040268816 noms.edu_attain.2 0.014664753 noms.edu_attain.3 0.011949303 noms.edu_attain.4 0.020386341 noms.edu_attain.5 -0.024465460 noms.edu_attain.6 0.015697896 noms.edu_attain.7 0.003025067 noms.emp_status.2 -0.022108630 noms.emp_status.3 0.012268149 \end{Code} As is clear from the above output, when working with multivariate multinomial data using the \pkg{Amelia} package, researchers are required to post-process parameter outputs to fit their needs. It is not immediately obvious how to get specific multinomial parameter estimates or how to marginalize these estimates, for example obtaining the marginal distribution of $\hat \theta$ by gender and educational attainment as in Section~\ref{sec:outputs}. An additional advantage to using \pkg{imputeMulti} is therefore improved ease of use for researchers interested in multinomial parameter estimates. \FloatBarrier \subsection{Comparing outputs: Observation level imputations} \label{sec:obs_compare} Observation level imputations are provided by \pkg{Amelia}, \pkg{mice}, and \pkg{Hmisc}. In \pkg{Amelia} and in \pkg{imputeMulti}, imputed observations are returned in the original dataset and can be compared directly. For both \pkg{mice} and \pkg{Hmisc} only the missing observations and their row numbers are returned. The imputations thus require post-processing to insert the imputed values into the original dataset containing missing values. To test accuracy of observation level imputations, two tests are run. The first test does not allow for multiple imputations, while the second does. To setup the tests, the \code{complete.cases} from \code{tract2221[,2:5]} are extracted and missing values are randomly inserted. Observation level imputation is then conducted for these datasets and the imputations were compared to the original, fully observed, dataset. To test sensitivity to the amount of missingness in the data, tests are run using 15\%, 30\%, 45\%, and 60\% missing values. The first test compares accuracy of \pkg{imputeMulti} to a single imputation in the other methods. The second test allows ten multiple imputations and compares the accuracy of a single \pkg{imputeMulti} imputation to the maximum accuracy across the ten multiple imputations using the other methods. Ten such tests are run for each level of inserted missingness and the mean results are shown in Table~\ref{tbl:impute_accuracy_test1} and Table~\ref{tbl:impute_accuracy_test2}. Unsurprisingly, imputation accuracy degrades for each method as missingness increases. But, in each test, \pkg{imputeMulti} has the highest accuracy for completely matching true observations; and, \pkg{imputeMulti} maintains this advantage even against multiple imputations. The closest comparison method was \pkg{mice}. \begin{table} \centering \begin{tabular}[h]{l|ccccc} \hline Percent Missing & IM-EM & IM-DA & Amelia & Hmisc & mice \\ \hline 15\% & 0.7642 & 0.7615 & 0.6761 & 0.7077 & 0.7133 \\ 30\% & 0.5700 & 0.5639 & 0.4404 & 0.4833 & 0.4882 \\ 45\% & 0.4135 & 0.4077 & 0.2679 & 0.3166 & 0.3188 \\ 60\% & 0.2827 & 0.2794 & 0.1460 & 0.1908 & 0.1993 \\ \hline \end{tabular} \caption{Mean accuracy of observation level imputation for \pkg{imputeMulti}, \pkg{Amelia}, \pkg{Hmisc}, and \pkg{mice} using a single imputation.} \label{tbl:impute_accuracy_test1} \end{table} \begin{table} \centering \begin{tabular}[h]{l|ccccc} \hline Percent Missing & IM-EM & IM-DA & Amelia & Hmisc & mice \\ \hline 15\% & 0.7651 & 0.7615 & 0.6857 & 0.7175 & 0.7166 \\ 30\% & 0.5687 & 0.5645 & 0.4493 & 0.4968 & 0.4966 \\ 45\% & 0.4097 & 0.4070 & 0.2816 & 0.3233 & 0.3298 \\ 60\% & 0.2853 & 0.2857 & 0.1653 & 0.1994 & 0.2052 \\ \hline \end{tabular} \caption{Mean accuracy of observation level imputation for \pkg{imputeMulti}, \pkg{Amelia}, \pkg{Hmisc}, and \pkg{mice}. A single imputation from \pkg{imputeMulti} is compared to the maximum accuracy from ten multiple imputations of the comparison methods.} \label{tbl:impute_accuracy_test2} \end{table} %------------------------------------------------------------ \section{Conclusion} \label{sec:conclude} %------------------------------------------------------------ Deciding how to deal with missing values is a frequent concern for applied researchers. Although many methods exist for missing value imputation, the vast majority rely on assumptions of multivariate normality. It is often desirable to use a model specifically designed for the distribution from which a researcher's data is generated. \pkg{imputeMulti} provides an easily accessible model for imputing multivariate multinomial data. Further, \pkg{imputeMulti} integrates easily into common analyses and handles large datasets with high performance by providing functions for calculating sufficient statistics in \proglang{C++}. \pkg{imputeMulti} also allows parallel processing in the case of extremely large datasets. This article provides a hands-on introduction to the \pkg{imputeMulti} package as well as comparison to existing methods in \proglang{R} for multivariate multinomial imputation. As shown, the use of a package specifically designed for multivariate multinomial imputation, such as \pkg{imputeMulti}, provides exact solutions for datasets with small numbers of variables; and, for datasets with large numbers of variables, \pkg{imputeMulti} may be used by splitting the variables into related groups. In addition, \pkg{imputeMulti} produces higher observation level imputation accuracy as well as providing easier access to parameter estimates. Future development of \pkg{imputeMulti} is expected to focus on extending performance by migrating more of the code to \proglang{C++}, improving the search algorithms used in calculating the multinomial distribution sufficient statistics by implementing balanced-trees, and exploring further parallelization of parameter estimates. As with any parallelization effort, the tradeoff between parallel overhead and increased utilization of multi-core computing resources must be intelligently balanced. To date, timing tests on Windows machines, which do not allow forking, have indicated that the parallel overhead is only justified for datasets with several hundred thousand or more observations. %------------------------------------------------------------ % Bibliography % \clearpage \bibliography{imputeMulti} \nocite{*} %------------------------------------------------------------ \end{document}