\documentclass{article} \usepackage{url} %\VignetteIndexEntry{Estimates in subpopulations} \usepackage{Sweave} \author{Thomas Lumley} \title{Estimates in subpopulations.} \begin{document} \maketitle Estimating a mean or total in a subpopulation (domain) from a survey, eg the mean blood pressure in women, is not done simply by taking the subset of data in that subpopulation and pretending it is a new survey. This approach would give correct point estimates but incorrect standard errors. The standard way to derive domain means is as ratio estimators. I think it is easier to derive them as regression coefficients. These derivations are not important for R users, since subset operations on survey design objects automatically do the necessary adjustments, but they may be of interest. The various ways of constructing domain mean estimators are useful in quality control for the survey package, and some of the examples here are taken from \texttt{survey/tests/domain.R}. Suppose that in the artificial \texttt{fpc} data set we want to estimate the mean of \texttt{x} when \texttt{x>4}. <<>>= library(survey) data(fpc) dfpc<-svydesign(id=~psuid,strat=~stratid,weight=~weight,data=fpc,nest=TRUE) dsub<-subset(dfpc,x>4) svymean(~x,design=dsub) @ The \texttt{subset} function constructs a survey design object with information about this subpopulation and \texttt{svymean} computes the mean. The same operation can be done for a set of subpopulations with \texttt{svyby}. <<>>= svyby(~x,~I(x>4),design=dfpc, svymean) @ In a regression model with a binary covariate $Z$ and no intercept, there are two coefficients that estimate the mean of the outcome variable in the subpopulations with $Z=0$ and $Z=1$, so we can construct the domain mean estimator by regression. <<>>= summary(svyglm(x~I(x>4)+0,design=dfpc)) @ Finally, the classical derivation of the domain mean estimator is as a ratio where the numerator is $X$ for observations in the domain and 0 otherwise and the denominator is 1 for observations in the domain and 0 otherwise <<>>= svyratio(~I(x*(x>4)),~as.numeric(x>4), dfpc) @ The estimator is implemented by setting the sampling weight to zero for observations not in the domain. For most survey design objects this allows a reduction in memory use, since only the number of zero weights in each sampling unit needs to be kept. For more complicated survey designs, such as post-stratified designs, all the data are kept and there is no reduction in memory use. \subsection*{More complex examples} Verifying that \texttt{svymean} agrees with the ratio and regression derivations is particularly useful for more complicated designs where published examples are less readily available. This example shows calibration (GREG) estimators of domain means for the California Academic Performance Index (API). <<>>= data(api) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) pop.totals<-c(`(Intercept)`=6194, stypeH=755, stypeM=1018) gclus1 <- calibrate(dclus1, ~stype+api99, c(pop.totals, api99=3914069)) svymean(~api00, subset(gclus1, comp.imp=="Yes")) svyratio(~I(api00*(comp.imp=="Yes")), ~as.numeric(comp.imp=="Yes"), gclus1) summary(svyglm(api00~comp.imp-1, gclus1)) @ Two-stage samples with full finite-population corrections <<>>= data(mu284) dmu284<-svydesign(id=~id1+id2,fpc=~n1+n2, data=mu284) svymean(~y1, subset(dmu284,y1>40)) svyratio(~I(y1*(y1>40)),~as.numeric(y1>40),dmu284) summary(svyglm(y1~I(y1>40)+0,dmu284)) @ Stratified two-phase sampling of children with Wilm's Tumor, estimating relapse probability for those older than 3 years (36 months) at diagnosis <<>>= library("survival") data(nwtco) nwtco$incc2<-as.logical(with(nwtco, ifelse(rel | instit==2,1,rbinom(nrow(nwtco),1,.1)))) dccs8<-twophase(id=list(~seqno,~seqno), strata=list(NULL,~interaction(rel,stage,instit)), data=nwtco, subset=~incc2) svymean(~rel, subset(dccs8,age>36)) svyratio(~I(rel*as.numeric(age>36)), ~as.numeric(age>36), dccs8) summary(svyglm(rel~I(age>36)+0, dccs8)) @ \end{document}