\documentclass{article} \usepackage{url} %\VignetteIndexEntry{A survey analysis example} \usepackage{Sweave} \author{Thomas Lumley} \title{A survey analysis example} \begin{document} \maketitle This document provides a simple example analysis of a survey data set, a subsample from the California Academic Performance Index, an annual set of tests used to evaluate California schools. The API website, including the original data files are at \url{http://api.cde.ca.gov}. The subsample was generated as a teaching example by Academic Technology Services at UCLA and was obtained from \url{http://www.ats.ucla.edu/stat/stata/Library/svy_survey.htm}. We have a cluster sample in which 15 school districts were sampled and then all schools in each district. This is in the data frame \texttt{apiclus1}, loaded with \texttt{data(api)}. The two-stage sample is defined by the sampling unit (\texttt{dnum}) and the population size(\texttt{fpc}). Sampling weights are computed from the population sizes, but could be provided separately. <<>>= library(survey) data(api) dclus1 <- svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc) @ The \texttt{svydesign} function returns an object containing the survey data and metadata. <<>>= summary(dclus1) @ We can compute summary statistics to estimate the mean, median, and quartiles of the Academic Performance Index in the year 2000, the number of elementary, middle, and high schools in the state, the total number of students, and the proportion who took the test. Each function takes a formula object describing the variables and a survey design object containing the data. <<>>= svymean(~api00, dclus1) svyquantile(~api00, dclus1, quantile=c(0.25,0.5,0.75), ci=TRUE) svytotal(~stype, dclus1) svytotal(~enroll, dclus1) svyratio(~api.stu,~enroll, dclus1) @ The ordinary R subsetting functions \verb'[' and \texttt{subset} work correctly on these survey objects, carrying along the metadata needed for valid standard errors. Here we compute the proportion of high school students who took the test <<>>= svyratio(~api.stu, ~enroll, design=subset(dclus1, stype=="H")) @ The warnings referred to in the output occured because several school districts have only one high school sampled, making the second stage standard error estimation unreliable. Specifying a large number of variables is made easier by the \texttt{make.formula} function <<>>= vars<-names(apiclus1)[c(12:13,16:23,27:37)] svymean(make.formula(vars),dclus1,na.rm=TRUE) @ Summary statistics for subsets can also be computed with \texttt{svyby}. Here we compute the average proportion of ``English language learners'' and of students eligible for subsidized school meals for elementary, middle, and high schools <<>>= svyby(~ell+meals, ~stype, design=dclus1, svymean) @ Regression models show that these socieconomic variables predict API score and whether the school achieved its API target <<>>= regmodel <- svyglm(api00~ell+meals,design=dclus1) logitmodel <- svyglm(I(sch.wide=="Yes")~ell+meals, design=dclus1, family=quasibinomial()) summary(regmodel) summary(logitmodel) @ We can calibrate the sampling using the statewide total for the previous year's API <<>>= gclus1 <- calibrate(dclus1, formula=~api99, population=c(6194, 3914069)) @ which improves estimation of some quantities <<>>= svymean(~api00, gclus1) svyquantile(~api00, gclus1, quantile=c(0.25,0.5,0.75), ci=TRUE) svytotal(~stype, gclus1) svytotal(~enroll, gclus1) svyratio(~api.stu,~enroll, gclus1) @ \end{document}