%\VignetteIndexEntry{Mandallaz' model-assisted small area estimators} \documentclass[a4paper]{article} \bibliographystyle{unsrt} \usepackage[usenames,dvipsnames]{color} \usepackage{listings} \lstset{ % basicstyle=\footnotesize , commentstyle=\color{PineGreen} , numberstyle=\tiny\color{Gray} , rulecolor=\color{Black} , keywordstyle=\color{Blue} , stringstyle=\color{Sepia} , showstringspaces=false , language=R } \usepackage[utf8]{inputenc} \title{Mandallaz' Model-Assisted Small Area Estimators} \author{Andreas Dominik Cullmann} \SweaveOpts{echo=false} \SweaveOpts{eval=false} \SweaveOpts{print=false} \SweaveOpts{width=60} \begin{document} \maketitle \providecommand{\rpack}[1]{package \texttt{#1}} \providecommand{\rdata}[1]{\texttt{#1}} \providecommand{\rcode}[1]{\texttt{#1}} \section{Introduction} Model-\emph{based} small area estimators (for example \cite{rao2003}, chapters 5ff.) depend on model assumptions to hold. This dependency doesn't make them very attractive for official statistics. Model-\emph{assisted} small area estimators do not depend on the model assumptions to hold, albeit their variances will be higher if the model is inappropriate (see \cite{saerndal1992}, chapter 6.7). The synthetic-regression estimator (SRE) (for example \cite{rao2003}, chapter 4.2.2) is biased, and the variance of its biased-corrected version, the generalized regression estimator (for example \cite{rao2003}, chapter 2.5), ``depends crucially on the [\ldots{}] residuals in the small area'' (\cite{mandallaz2013a2}, p. 444), which basically means that its variance will be unacceptably high in many applications. Daniel Mandallaz and others ( \cite{mandallaz2012a1}, \cite{mandallaz2013a2}, \cite{mandallaz2013b2}, \cite{mandallaz2013b1}, \cite{mandallaz2014c2} and \cite{mandallaz2013c1} ) propose an unbiased extension of the SRE for two- and three-phase sampling designs with or without clustering. The variance of the extended SRE is, like that of the SRE, based on all residuals in the second (or third) phase and asymptotically equivalent to the SRE's variance (see \cite{mandallaz2013a2}, p.~444). \section{Non-Exhaustive Auxiliary Information} \subsection{Two-Phase Sampling} <>== library("maSAE") @ Let us suppose we have a two-phase sampling design, and the sample data can be loaded via <>== data("s2") data("s1") @ We now add sampling phase indicators to the data and join (from a database point of view we do a union, but I'll keep calling it join) it into a single data.frame \rcode{s12}: <>== s12 <- bind_data(s1, s2) @ We build a small area Object <>== saeO <- saObj(data = s12, f = y ~x1 + x2 + x3 | g, s2 = "phase2") @ and get the small area estimations for the non-exhaustive case as <>== predict(saeO) @ \subsection{Three-Phase Sampling} Suddenly we stumble across data from a third sampling phase, the null phase (\rdata{s0} has all the predictors of \rdata{s2}, but we keep only one -- if we kept all of them, we'ld be back to two-phase sampling with more observations), and we join all three phases into \rdata{s012}: <>== data("s0") s0$x1 <- s0$x3 <- NULL s012 <- bind_data(s1, s2, s0) @ from which we predict again: <>== predict(saObj(data = s012, f = y ~x1 + x2 + x3 | g, s2 = "phase2", s1 = "phase1")) @ Note the drop in variance induced by the extensive null phase sampling. \section{Partially Exhaustive Auxiliary Information} Let us suppose we \emph{knew} the estimated small area means of the fixed effect sampled in all three phases to be the true small area means: <>== tm1 <- as.data.frame(tapply(s012$x2, s012$g, mean)) names(tm1)[1] <- c("x2"); tm1$g <- row.names(tm1) predict(saObj(data = s12, f = y ~x1 + x2 + x3 | g, s2 = "phase2", smallAreaMeans = tm1)) @ Again, the variance estimation is reduced, but not as markedly as before: due to the extensive null phase the mean estimations had very small variances (which added to the small area estimation variances). \section{Exhaustive Auxiliary Information} Of course we could also take our estimated small area means of all fixed effects from the first and second phase to be the true means: <>== preds <- paste("x",1:3, sep="") tm <- as.data.frame(rbind(colMeans(subset(s12, g == "a")[, preds]), colMeans(subset(s12, g == "b")[, preds]) ) ); tm$g=c("a", "b") @ That would give us the smallest variance estimates: <>== predict(saObj(data = s12, f = y ~x1 + x2 + x3 | g, s2 = "phase2", smallAreaMeans = tm)) @ \section{Model-Based Small Area Estimation} \textit{Wait! If I use the code in Appendix \ref{RAO} to calculate the EBLUP for the basic unit level model given by \cite{rao2003}, chapter 7.2, I get at least a much smaller mse1:} <>== source("Rao.R") library(nlme) dat <- subset(s2, ! is.na(s2$g)) dat <- dat[with(dat, order(g)), TRUE] aLmeObj <- lme(y ~x1 + x2 + x3, data = dat, random = ~1 | g) foo <- new(Class = "sae", lmeObj = aLmeObj, domain.df = tm) sae(foo) @ Of course you do, but you rely on the model's assumptions to hold. Have you checked them? \textit{ No, I don't even know what they are. But even if the tests would not reject the hypotheses of the assumptions being valid on, say, a .95 confidence level, I could never be sure.}\\ That's why I wrote \rcode{maSAE}.\\ \textit{ I see, but wait again, what is \ldots} \section{Cluster Sampling} <>== grep(".*clust", capture.output(str(s1)), value = TRUE) @ \textit{\ldots this ``clustid'' in the data.frames?}\\ Dear, I forgot that sampling for my completely made up example data was done with a cluster design! So all the above variances estimates are too optimistic. My fault. For the sake of CPU time on CRAN, I'll leave the following to you: <>== predict(saObj(data = s12, f = y ~x1 + x2 + x3 | g, s2 = "phase2", cluster = "clustid")) @ <>== predict(saObj(data = s012, f = y ~x1 + x2 + x3 | g, s2 = "phase2", s1 = "phase1", cluster = "clustid")) @ <>== predict(saObj(data = s12, f = y ~x1 + x2 + x3 | g, s2 = "phase2", smallAreaMeans = tm1, cluster = "clustid")) @ <>== predict(saObj(data = s12, f = y ~x1 + x2 + x3 | g, s2 = "phase2", smallAreaMeans = tm, cluster = "clustid")) @ \bibliography{bibliography} \appendix \section{Rao.R\label{RAO}} \lstinputlisting{Rao.R} \end{document}