%% missForest_1.0.Rnw created on 04.05.11
%% missForest_1.2.Rnw updated on 20.02.12
%% missForest_1.3.Rnw updated on 22.06.12
%% missForest_1.4.Rnw updated on 30.12.13
%% missForest_1.5.Rnw updated on 14.04.22
\documentclass[11pt]{article}
%\VignetteIndexEntry{missForest_1.5}
%\VignetteDepends{missForest}
\usepackage[english]{babel}
\usepackage{Sweave,graphicx,amsmath,amssymb,url}
\usepackage{natbib}
\usepackage[paper=a4paper,top=3cm,left=2.5cm,right=2.5cm,
    foot=1cm,bottom=2.5cm]{geometry} %% changing margins

%\newcommand{\bibfont}{\large}
\newcommand{\Rp}{\textsf{R }}
\newcommand{\mF}{\texttt{missForest }}
    
\title{Using the \texttt{missForest} Package}
\author{Daniel J. Stekhoven\\stekhoven@stat.math.ethz.ch}
\date{Friday, 13$^{\textrm{th}}$ of May, 2011\\\small{Update: Version 1.5, 14.04.22}}
\begin{document}
\SweaveOpts{concordance=TRUE}
\maketitle
\SweaveOpts{engine=R, eps=FALSE, pdf=TRUE, width=5.33, height=3}
\SweaveOpts{strip.white=true, keep.source=TRUE}

%% Setting default R options
<<label=setRopts, echo=FALSE>>=
require(foreach)
require(missForest)
options(width=70, prompt="> ", continue="+ ")
options(SweaveHooks=list(fig=function() par(mar=c(4,4,0.4,0.7)))) 
@

\tableofcontents

\section{Introduction}
\subsection{What is this document? (And what it isn't!)}
This {\it package vignette} is an application focussed user guide for the
\Rp package \texttt{missForest}. The functionality is explained using a
couple of real data examples. Argument selection with respect to feasibility
and accuracy issues are discussed and illustrated using these real data
sets. Do not be alarmed by the length of this document which is mainly due
to some major \Rp output included for illustrative reasons.

This document is {\it not} a theoretical primer for the fundamental
approach of the \mF algorithm. It also does not contain any
simulations or comparative studies with other imputation methods. For this
information we point the interested reader to \cite{stekhoven11}.


\subsection{The \texttt{missForest} algorithm}
\mF is a nonparametric imputation method for basically any
kind of data. It can cope with mixed-type of variables, nonlinear
relations, complex interactions and high dimensionality ($p \gg n$). It
only requires the observation (i.e. the rows of the data frame supplied to
the function) to be pairwise independent. The algorithm is based on random
forest (\cite{breiman01}) and is dependent on its \Rp implementation
\texttt{randomForest} by Andy Liaw and Matthew Wiener. Put simple (for
those who have skipped the previous paragraph): for each variable
\mF fits a random forest on the observed part and then
predicts the missing part. The algorithm continues to repeat these two
steps until a stopping criterion is met or the user specified maximum of
iterations is reached. For further details see \cite{stekhoven11}. 

To understand the remainder of this user guide it is important to know that
\mF is running iteratively, continuously updating the
imputed matrix variable-wise, and is assessing its performance between
iterations. This assessment is done by considering the difference(s)
between the previous imputation result and the new imputation result. As
soon as this difference (in case of one type of variable) or differences
(in case of mixed-type of variables) increase the algorithm stops.

\mF provides the user with an estimate of the imputation error. This
estimate is based on the out-of-bag (OOB) error estimate of random
forest. \cite{stekhoven11} showed that this estimate produces an
appropriate representation of the true imputation error.

\subsection{Installation}
The \Rp package \mF is available from the Comprehensive \Rp
Archive Network (CRAN, \url{http://cran.r-project.org/}) and as such can be
installed in the default way using the \texttt{install.packages} function:

<<eval=FALSE>>=
install.packages(missForest, dependencies = TRUE)
@ 

Make sure to include the \texttt{dependencies = TRUE} argument to install
also the \texttt{randomForest} package unless it is already installed.

\section{Missing value imputation with \texttt{missForest}}
In this section we describe using the \mF function. We will
shed light on all arguments which can or have to be supplied to the
algorithm. Also, we will discuss how to make \mF faster or
more accurate. Finally, an interpretation of the OOB imputation error
estimates is given.

\subsection{Description of the data used}
\begin{description}
\item[Iris data] This complete data set contains five variables of which
  one is categorical with three levels. It is contained in the \Rp base and
  can be loaded directly by typing \texttt{data(iris)}. The data were
  collected by \cite{anderson35}.
\item[Oesophageal cancer data] This complete data set comes from a
  case-control study of oesophageal cancer in Ile-et-Vilaine, France. It is
  contained in the \Rp base and can be loaded directly by typing
  \texttt{data(esoph)}. The data were collected by \cite{breslow80}.
\item[Musk data] This data set describes the shapes of 92 molecules of
  which 47 are musks and 45 are non-musks. Since a molecule can have many
  conformations due to rotating bonds, there are $n=476$ different
  conformations in the set. The classification into musk and non-musk
  molecules is removed. For further details see \cite{UCI10}.
\end{description}

\subsection{\texttt{missForest} in a nutshell}\label{nutshell}
After you have properly installed \mF you can load the
package in your \Rp session:

<<eval=FALSE>>=
library(missForest)
@ 

We will load now the famous Iris data set and artificially remove 10\% of
the entries in the data completely at random using the \texttt{prodNA}
function from the \mF package:
<<echo=FALSE>>=
set.seed(81)
@ 
<<iris.mis>>=
data(iris)
iris.mis <- prodNA(iris, noNA = 0.1)
summary(iris.mis)
@ 

We can see that there is an evenly distributed amount of missing values
over the variables in the data set. With {\it completely at random} we mean
that the process of deleting entries is not influenced by the data or the
data generating process.

The missing data is now imputed by simply handing it over to \mF:
<<echo=FALSE>>=
set.seed(81)
@ 
<<iris.imp>>=
iris.imp <- missForest(iris.mis)
@ 

Except for the iteration numbering no additional print-out is given. The
results are stored in the \Rp object \texttt{iris.imp} which is a list. We
can call upon the imputed data matrix by typing
\texttt{iris.imp\$ximp}. {\it Note: A common mistake is to use}
\texttt{iris.imp} {\it instead of} \texttt{iris.imp\$ximp} {\it for
  subsequent analyses.}

Additionally, \mF provides an OOB imputation error estimate which can be
extracted using the same \texttt{\$} notation as with the imputed data matrix:

<<irisOOBerr>>=
iris.imp$OOBerror
@ 

As mentioned before the Iris data set contains two types of variables,
continuous and categorical. This is why the OOB imputation error supplies
two values for the result of the imputation (default setting). The first
value is the normalized root mean squared error (NRMSE, see \cite{oba03})
for the continuous part of the imputed data set, e.g.,
\texttt{Sepal.Length}, \texttt{Sepal.Width}, \texttt{Petal.Length} and
\texttt{Petal.Width}. The second value is the proportion of falsely
classified entries (PFC) in the categorical part of the imputed data set,
e.g., \texttt{Species}. In both cases good performance of \mF leads to a
value close to 0 and bad performance to a value around 1. 

If you are interested in assessing the reliability of the imputation
for single variables, e.g., to decide which variables to use in a
subsequent data analysis, \mF can return the OOB errors for each variable
separately instead of aggregating over the whole data matrix. This can be
done using the argument \texttt{variablewise = TRUE} when calling the
\texttt{missForest} function.

<<irisOOBvarwise>>=
iris.imp <- missForest(iris.mis, variablewise = TRUE)
iris.imp$OOBerror
@ 

We can see that the output has the same length as there are variables in
the data. For each variable the resulting error and the
type of error measure, i.e., mean squared error (MSE) or PFC, is
returned. Note that we are not using the NRMSE here.

\subsection{Additional output using \texttt{verbose}}\label{verbose}
In \ref{nutshell} the print-out of \mF showed only which iteration is
taking place at the moment. Anyhow, if you are imputing a large data set or
choose to use ridiculously large \texttt{mtry} and/or \texttt{ntree}
arguments (see \ref{ntreemtry}) you might be interested in getting
additional information on how \mF is performing.

By setting the logical \texttt{verbose} argument to \texttt{TRUE} the
print-out is extended threefold:
\begin{description}
  \item[\texttt{estimated error(s)}] The OOB imputation error estimate for the
    continuous and categorical parts of the imputed data set. {\it Note: If
      there is only one type of variable there will be only one value with
      the corresponding error measure.}
  \item[\texttt{difference(s)}] The difference between the previous and the new
    imputed continuous and categorical parts of the data set. The
    difference for the set of continuous variables ${\bf N}$ in the data
    set is computed by
    \[
    \frac{\sum_{j\in {\bf N}}({\bf X}^{imp}_{new} - {\bf
        X}^{imp}_{old})^2}{\sum_{j\in {\bf N}}({\bf X}^{imp}_{new})^2},
    \]
    and for the set of categorical variables the difference corresponds to
    the PFC.    
  \item[\texttt{time}] The runtime of the iteration in seconds.
\end{description}

If we rerun the previous imputation of the Iris data \footnote{Since random
  forest -- as its name suggests -- is using a random number generator
  (RNG) the result for two trials on the same missing data set will be
  different. To avoid this from happening in the given illustrative example
  we use the \texttt{set.seed} function before applying \mF on the
  \texttt{iris.mis} data set. This causes the RNG to be reset to the same
  state as before (where we invisibly called \texttt{set.seed(81)} already
  but did not want to trouble the concerned reader with technical
  details).} setting \texttt{verbose = TRUE} we get:
<<iris.imp.verb>>=
set.seed(81)
iris.imp <- missForest(iris.mis, verbose = TRUE)
@ 

The above print-out shows that \mF needs four iterations to finish. If we
check the final OOB imputation error estimate:
<<>>=
iris.imp$OOBerror
@ 

we can see that it used the result from the second last iteration, i.e., the
third instead of the last one. This is because the stopping criterion was
triggered and the fact that the differences increase indicate that the new
imputation is probably a less accurate imputation than the previous
one. However, we can also see that the {\it estimated} error(s) is lower for
the last imputation than for the one before. But we will show later on that
the true imputation error is lower for iteration 3 (the impatient reader
can jump to section \ref{xtrue}).

\subsection{Changing the number of iterations with \texttt{maxiter}}
Depending on the composition and structure of the data it is possible that
\mF needs more than the typical four to five iterations (see \ref{verbose})
until the stopping criterion kicks in. From an optimality point of view we
do want \mF to stop due to the stopping criterion and not due to the limit
of iterations. However, if the difference between iterations is seriously
shrinking towards nought and the estimated error is in a stalemate the only
way to keep computation time at a reasonable level is to limit the number
of iterations using the argument \texttt{maxiter}.

We show this using the \texttt{esoph} data. First,
we run \mF on a data set where we removed 5\% of the entries at random:
<<echo=FALSE>>=
set.seed(84)
@ 
<<esophExample, echo=TRUE>>=
data(esoph)
esoph.mis <- prodNA(esoph, 0.05)
set.seed(96)
esoph.imp <- missForest(esoph.mis, verbose = TRUE)
@ 

We can see that it takes \mF nine iterations to come to a stop. The
returned imputation result was reached in iteration 8 having estimated
errors of 0.55 and 0.73 and differences of $3\cdot10^{-5}$ and $0$. In
iteration 6 the estimated errors are smaller (i.e. 0.53 and 0.70) and the
differences are $1\cdot10^{-4}$ and $4\cdot10^{-3}$. So why is \mF not
simply taking the sixth iteration and calls it a day? Because the
difference in the continuous part of the data set is still reduced in each
iteration up until iteration 9. This stopping strategy is on average
(taking all possible data sets into account) quite good but can have its
caveats at specific data sets. In the above case of the \texttt{esoph}
data we can get the result of the sixth iteration by doing the following:

<<maxiterEsoph, echo=TRUE>>=
set.seed(96)
esoph.imp <- missForest(esoph.mis, verbose = TRUE, maxiter = 6)
@ 

The returned result is now given by iteration 6. Quintessentially, there
are two uses for the \texttt{maxiter} argument:
\begin{enumerate}
  \item Controlling the run time in case of stagnating performance;
  \item extract a preferred iteration result not supplied by the stopping
    criterion.
\end{enumerate}

\subsection{Speed and accuracy trade-off manipulating \texttt{mtry} and \texttt{ntree}}\label{ntreemtry}

<<muskData, echo=FALSE>>=
X <- scan('http://stat.ethz.ch/Teaching/Datasets/musk.dat', what = 'character', sep =',')
X <- matrix(X, ncol = 169, byrow=TRUE)
rowlabels <- X[,1]
X <- X[,-c(1,2)] #remove molecule names and conformations
X <- X[,-167] #remove response
musk <- matrix(as.numeric(X), ncol = 166)
@ 

\mF grows in each iteration for each variable a random forest to impute
the missing values. With a large number of variables $p$ this can lead to
computation times beyond today's perception of feasibility. There are two
ways to speed up the imputation process of \texttt{missForest}:
\begin{enumerate}
  \item Reducing the number of trees grown in each forest using the
    argument \texttt{ntree};
  \item reducing the number of variables randomly sampled at each split
    using the argument \texttt{mtry}.
\end{enumerate}
It is imperative to know that reducing either of these numbers will
probably result in reduced accuracy. This is why we speak of a speed and
accuracy {\it trade-off}. 

\subsubsection{\texttt{ntree}}
The effect of reducing \texttt{ntree} on the computation time is linear,
e.g., halving \texttt{ntree} will half computation time for a single
iteration. The default value in \mF is set to 100 which is fairly
large. Smaller values in the tens can give appropriate results already. We
show this using the Musk data:

<<muskBigNtree, echo=TRUE, eval=FALSE>>=
musk.mis <- prodNA(musk, 0.05)
musk.imp <- missForest(musk.mis, verbose = TRUE, maxiter = 3)
@ 
\begin{verbatim}
  missForest iteration 1 in progress...done!
    estimated error(s): 0.1491825 
    difference(s): 0.02383702 
    time: 280.739 seconds

  missForest iteration 2 in progress...done!
    estimated error(s): 0.1367353 
    difference(s): 0.0001208087 
    time: 277.011 seconds

  missForest iteration 3 in progress...done!
    estimated error(s): 0.137418 
    difference(s): 3.836082e-05 
    time: 278.287 seconds
\end{verbatim}
The computation time is about 14 minutes and we end up with an estimated
NRMSE of 0.14. {\it Note: The response was removed from the Musk data, that
  is why there is only the estimated NRMSE and also only the difference for
  the continuous part of the data set.}

If we repeat the imputation using the \texttt{ntree} argument and setting
it to 20 we get:

<<muskSmallNtree, echo=TRUE, eval=FALSE>>=
musk.imp <- missForest(musk.mis, verbose = TRUE, maxiter = 3, ntree = 20)
@ 
\begin{verbatim}
  missForest iteration 1 in progress...done!
    estimated error(s): 0.1724939 
    difference(s): 0.02383371 
    time: 56.705 seconds

  missForest iteration 2 in progress...done!
    estimated error(s): 0.1576795 
    difference(s): 0.0002417658 
    time: 55.833 seconds

  missForest iteration 3 in progress...done!
    estimated error(s): 0.1591702 
    difference(s): 0.0001966117 
    time: 56.053 seconds
\end{verbatim}

The computation time is now around 3 minutes which is approximately a fifth
of the previous computation time using 100 trees (as a matter of fact,
taking the floor values of the iteration times in seconds then the former
imputation took {\it exactly} five times longer than the latter). The
estimated NRMSE has increased to 0.16 -- an increase of 14\% compared to
before. In some application this might seem as an unacceptable increase of
imputation error. However, if the number of variables is large enough,
e.g., in the thousands like in gene expression data, the amount of
computation time saved will surpass the amount of imputation error
increased.

\subsubsection{\texttt{mtry}}
The effect on computation time when changing \texttt{mtry} is not as
straight forward as with \texttt{ntree}. It is however more pronounced in
settings with high-dimensionality (e.g. $p\gg n$, where $n$ is the number
of observations) and complex structures. The default setting in \mF is
$\sqrt{p}$. This choice qualifies for a quite nice trade-off between
imputation error and computation time. Anyhow, certain data might demand
different choices either putting a focus on better imputation error or
better computation time. We leave this delicate choice to the user of these
certain data sets.

\subsection{Use subsampling instead of bootstrapping by setting
  \texttt{replace} to \texttt{FALSE}}\label{replace}
Like in the original paper by \cite{breiman01} \mF uses bootstrap samples
to grow its trees on. Another possibility would be to use subsamples
instead. Randomly selected observations are replaced in the data when
bootstrapping is performed, i.e., a single observation can be selected
several times. In subsampling these observations are not replaced and thus
a single observation can only be selected once. If \texttt{replace=FALSE}
then \texttt{sampsize} (controlling the size of the sample drawn from the
data to grow a tree) is reduced from $n$ to $0.632n$. This is because
otherwise there would be no more OOB observations and an error prediction
would be impossible. The number 0.632 is the expected proportion of
observations selected when using bootstrapping, i.e., selecting with
replacements $n$ observations.

<<subsampling,echo=TRUE,eval=TRUE>>=
set.seed(81)
iris.imp.sub <- missForest(iris.mis, verbose = TRUE, replace = FALSE)
iris.imp.sub$OOBerror
@ 

We can see that there is no substantial improvement compared to
bootstrapping in the previous example in section \ref{verbose} for the
\texttt{iris} data. However, in some cases subsampling can be superior to
bootstrapping. Therefore, if time allows explore both strategies and settle
for the better performing one.

\subsection{Imbalanced data, stratified sampling and focussed selection
  (\texttt{classwt}, \texttt{cutoff}, \texttt{strata}, \texttt{sampsize})}\label{stratified}
From version 1.3 on missForest offers the possibility to pass more
arguments to the \texttt{randomForest} function at its core. These include:
\begin{description}
  \item[\texttt{classwt}] adding priors to the classes in categorical variables;
  \item[\texttt{cutoff}] setting cutoffs for each class in categorical variables;
  \item[\texttt{strata}] perform stratified sampling for categorical variables;
  \item[\texttt{sampsize}] define size of samples drawn from a variable.
\end{description}
For each of these arguments the user has to generate a list containing the
appropriate object for each variable at the corresponding list entry, i.e.,
the third entry of the list corresponds to the third variable in the data,
etc. The first three arguments in the above list do only make sense when
used with categorical variables. However, the generated list has to have an
entry for each variable - include for continuous variables \texttt{NULL}
(for \texttt{cutoff} use \texttt{1}). The \texttt{sampsize} argument can be
used for both types of data. In case of continuous variables a single
integer and in case of categorical variables a vector of the same length as
there are classes in the variable.

<<sampsize,echo=TRUE,eval=FALSE>>=
iris.sampsize <- list(12, 12, 12, 12, c(10, 15, 10))
iris.imp.sampsize <- missForest(iris.mis, sampsize = iris.sampsize)
@ 

Note how we set the list entry for \texttt{sampsize} in case of the fifth
variable \texttt{Species} to a vector with three entries. An example for
the use of \texttt{cutoff} could be:

<<cutoff,echo=TRUE,eval=FALSE>>=
iris.cutoff <- list(1, 1, 1, 1, c(0.3, 0.6, 0.1))
iris.imp.cutoff <- missForest(iris.mis, cutoff = iris.cutoff)
@ 

we set the cutoff for \texttt{setosa} to 0.3, for \texttt{versicolor} to
0.6 and for \texttt{virginica} to 0.1 (\emph{not that this would make any sense -
it is simply to show how the arguments have to be
generated}). Equivalently, using a \texttt{NULL} instead of 1 for the
continuous variables the input for \texttt{classwt} looks as:

<<classwt,echo=TRUE,eval=FALSE>>=
iris.classwt <- list(NULL, NULL, NULL, NULL, c(10, 30, 20))
iris.imp.classwt <- missForest(iris.mis, classwt = iris.classwt)
@ 

\subsection{Controlling terminal nodes w.r.t. \texttt{nodesize} and \texttt{maxnodes}}\label{terminal}
We can control the structural tree growing process two fold:
\begin{itemize}
  \item by setting the maximum number of terminal nodes in the tree;
  \item by defining the minimum number of observations in a terminal node.
\end{itemize}
The default setting for the maximum number of nodes is given by the maximum
possible in the tree growing process, subject to the limits of
\texttt{nodesize}, which in turn has the default setting of 1 for
continuous and 5 for categorical variables. The \texttt{maxnodes} argument
is simply specified by an integer. For \texttt{nodesize} the user needs to
supply a vector of length 2 where the first entry corresponds to continuous
and the second entry to categorical variables:

<<terminal, echo=TRUE, eval=FALSE>>=
iris.imp.term <- missForest(iris.mis, nodesize = c(3, 7))
@ 

In the above call to \mF we set the number of observations in terminal
nodes to 3 for continuous variables and 7 for categorical
variables. Especially, the \texttt{maxnodes} argument can have a strong
effect on computation time.

\subsection{Testing the appropriateness by supplying \texttt{xtrue}}\label{xtrue}
Whenever imputing data with real missing values the question arises how
good the imputation was. In \mF the estimated OOB imputation error gives a
nice indication at what you have to expect. A wary user might want to make
an additional assessment (or back the OOB estimate up) by performing
cross-validation or -- in the optimal case -- testing \mF previously on
complete data. For both cases \mF offers the \texttt{xtrue} argument which
simply takes in the same data matrix as \texttt{xmis} but with no missing
values present. The strategy for testing the performance is the same as
shown in the previous examples using \texttt{prodNA}:
\begin{enumerate}
  \item Generate a data matrix with missing values;
  \item impute this artificially generated data matrix;
  \item compare the complete and imputed data matrices.
\end{enumerate}
The functions to use for this strategy are \texttt{prodNA}, \mF and
\texttt{mixError}. Using again the Iris data this would look like:

<<irisStrategy, echo = TRUE, eval = FALSE>>=
iris.mis <- prodNA(iris, noNA = 0.1)
iris.imp <- missForest(iris.mis)
@
<<irisStrategy2, echo = TRUE>>=
iris.err <- mixError(iris.imp$ximp, iris.mis, iris)
print(iris.err)
@

{\it Note: We want to point out once more that the user has to extract the
  imputed matrix from the \mF output using the} \texttt{\$} {\it list
  notation. Not doing so will generate the following error:}

<<irisError, echo = TRUE, eval = FALSE>>=
iris.err <- mixError(iris.imp, iris.mis, iris)
@ 
\begin{verbatim}
Error in mixError(iris.imp, iris.mis, iris) : 
  Wrong input for 'xmis' - you probably forgot to point at the
  list element $ximp from the missForest output object.
\end{verbatim}

We can simplify the above strategy by using \texttt{xtrue}. If combined
with \texttt{verbose = TRUE} the user even gets additional information on
the performance of \mF between iterations:
<<echo=FALSE>>=
set.seed(81)
@ 
<<irisPerfom, echo = TRUE>>=
iris.imp <- missForest(iris.mis, xtrue = iris, verbose = TRUE)
@ 

Supplying \texttt{xtrue} adds the line \texttt{error(s)} to the \mF
output. We can observe that the true imputation error really is lower for
the second last iteration as mentioned in section
\ref{nutshell}. Additionally, the output object (in the above example
\texttt{iris.imp}) contains now a list element \texttt{error} which can be
called directly:

<<errorDirect, echo = TRUE>>=
iris.imp$error
@ 

\subsection{Parallel execution of \mF using \texttt{parallelize}}\label{parallelize}
The argument \texttt{parallelize} allows to run \mF on multiple cores in parallel to save computational time. The parallel computation is achieved using the packages \texttt{foreach} (\cite{foreach}) and \texttt{itertools} (\cite{itertools}). There are two possible ways to parallelize the algorithm of \mF:
\begin{enumerate}
\item {\bf Create random forest objects parallel}.\\
  In each random forest grown divide \texttt{ntree} in $k$ parts, where $k$ equals the number of available cores. Each core computes the $\frac{\textrm{ntree}}{k}$ trees and finally the results are combined. This parallelization is most useful, if the random forest objects take long to compute and not too many variables with missing values are in the data. There are no consequences on the theoretical aspects of \mF regarding this parallelization (see \cite{stekhoven11}).
\item {\bf Compute multiple iterations of \mF parallel}.\\
  An iteration of \mF consists of growing a random forest on the observed parts of a variable and subsequently predicting the missing parts using this forest. Partitioning the variables containing missing values into subsets of size $k$, where $k$ equals the number of available cores, allows for the parallel computation of $k$ iterations. When each of the $k$ iterations is finished the missing values in the first $k$ variables are updated and the next block of $k$ variables is started. This parallelization is most useful if the data consists of many variables and the random forest objects do not take long to compute\footnote{This can be achieved artificially by setting \texttt{ntree} or \texttt{mtry} relatively low, see Section \ref{ntreemtry}}. However, the methodology of this procedure deviates from the original \mF algorithm, where the missing values were updated after each iteration. Now, the values are only updated after the separate computation of the $k$ iterations. This seems to have no negative effect on the performance of \mF.
\end{enumerate}

The use of parallel computing with \Rp requires an appropriate parallel backend telling \Rp where the cores are and how many of them. There are several packages available offering this functionality. Here, we give a short example taken from \cite{doParallel} on how this can be achieved using the package \texttt{doParallel}. It is recommended to carefully read the documentation of \texttt{doParallel} (especially the Section "Registering the \texttt{doParallel} parallel backend").

The following code is loading the \texttt{doParallel} package and registers a (default) parallel backend. The function \texttt{getDoParWorkers()} returns the number of available cores. The package \texttt{doRNG} provides an operator which ensures consisten results across \texttt{foreach} loops with respect to random numbers. This is important as you may know using random forests. The command \texttt{registerDoRNG} subsequently allows to set a seed.

<<doParallel, echo=TRUE>>=
require(doParallel)
registerDoParallel(cores=2)
getDoParWorkers()
require(doRNG)
registerDoRNG(seed = 1.618)
foreach(i=1:3) %dorng% sqrt(i)
@

If the \texttt{foreach} loop returns the above output the backend is registered and can be used by \mF. 

If the data has neither many variables nor the random forests take very long to compute, but the need for parallel computing is given, we recommend to try both parallelizations and see which improves the performance best.

\section{Concluding remarks}
Imputation using \mF can be done very easily. The OOB imputation error
estimate facilitates the interpretation of such imputation
results. However, it should always be kept in mind that imputing data with
missing values does not increase the information contained within this
data. It is only a way to have completeness for further data analysis. Many
methods of data analysis require complete observations. In such complete
case analyses observations missing only a single entry will be completely
removed from the data and therefore the information content is
reduced. Imputing the data beforehand prevents this reduction. For further
details on the effect of imputation on the subsequent data analysis we
suggest the books of \cite{schafer97} and \cite{little87}.

\section*{Acknowledgments}
We thank Steve Weston for the input on the parallel computation approach.

\bibliographystyle{plainnat}
\bibliography{myBib}

\end{document}