--- title: "surveyCV: Cross Validation Based on Survey Design" author: "Cole Guerin, Thomas McMahon, Jerzy Wieczorek" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{intro} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Introduction to `surveyCV` This package implements cross validation (CV) for complex survey data, by accounting for strata, clusters, FPCs, and survey weights when creating CV folds as well as when calculating test-set loss estimates (currently either mean squared error (MSE) for linear models, or binary cross-entropy for logistic models). It is meant to work smoothly with [the `survey` package](https://cran.r-project.org/package=survey). In addition, the function `folds.svy()` (which makes the design-based CV folds) can be used to code up custom CV loops to evaluate other models besides linear and logistic regression. To understand why we believe it's important to account for survey design when carrying out CV, please see our paper: Wieczorek, Guerin, and McMahon (2022), "K-Fold Cross-Validation for Complex Sample Surveys," *Stat* [](https://doi.org/10.1002/sta4.454). Briefly, the problem is that standard CV assumes exchangeable data, which is not true of most complex sampling designs. We argue (1) that CV folds should mimic the sampling design (so that we're honest about how much the model-fits will vary across "samples like this one"), and (2) that test set MSEs should be averaged and weighted in ways that generalize to the full population from which the sample was drawn, assuming that is the population for which the models will be applied. This vignette briefly illustrates how to use `surveyCV` with several common types of sampling design. We provide three equivalent ways to carry out CV: direct control with `cv.svy`, use of an existing survey design object with `cv.svydesign`, or use of an existing survey GLM object with `cv.svyglm`. We also illustrate how to write your own design-based CV loop, with a design-consistent random forest example from the [`rpms`](https://cran.r-project.org/package=rpms) package. # Setup ```{r setup-packages, message=FALSE} # We set a random seed in this vignette only to ensure # that our discussion will match the CV output; # otherwise, each time we reran the vignette, we'd get different CV folds set.seed(2022) library(surveyCV) data("NSFG_data") library(survey) data("api") ``` # Linear and logistic regression with `cv.svy`, `cv.svydesign`, and `cv.svyglm` ## Direct control with `cv.svy` First, we borrow some examples from the documentation for `survey::svyglm()`, based on the `api` data (the Academic Performance Index for California schools in the year 2000). Say that we want to carry out complex survey CV for several linear models, and we are working with the stratified sample in `apistrat`. Using `cv.svy()`, we specify the dataset; the formulas for the models being compared; the number of folds; and any necessary information about the sampling design. For this particular stratified design, we must specify the strata variable, the weights variable, and the FPC variable. ```{r} #stratified sample cv.svy(apistrat, c("api00~ell", "api00~ell+meals", "api00~ell+meals+mobility"), nfolds = 5, strataID = "stype", weightsID = "pw", fpcID = "fpc") ``` The 2nd model (`api00~ell+meals`) appears to have the lowest average test MSE (in the `mean`) column, although its performance does not differ much from the 3rd model and the difference between them is much smaller than either of their standard errors (in the `SE` column). If we were to use the single-stage cluster sample in `apiclus1` instead, the command would be similar but we would need to specify clusters instead of strata: ```{r} # one-stage cluster sample cv.svy(apiclus1, c("api00~ell", "api00~ell+meals", "api00~ell+meals+mobility"), nfolds = 5, clusterID = "dnum", weightsID = "pw", fpcID = "fpc") ``` On the other hand, if we are working with a simple random sample as in `apisrs`, we could just use standard CV instead of complex survey CV to generate the folds. However, complex survey CV may still be useful with a SRS if we want to account for the finite population correction (FPC) when estimating the SEs of our CV MSEs. ```{r} # simple random sample cv.svy(apisrs, c("api00~ell", "api00~ell+meals", "api00~ell+meals+mobility"), nfolds = 5, fpcID = "fpc") ``` Note that our intent in this vignette is **not** to compare the stratified sample results vs. the cluster sample results vs. the SRS results. We show 3 approaches only to illustrate how the code depends on the sampling design. Each of these samples was taken in a different way, and so it should be analyzed in a different way. In most cases, the data analyst will have only one dataset, and they ought to choose the right form of CV based on that one dataset's sampling design. (But if we *do* have different datasets taken under different sampling designs from the same population, it's plausible that the "best model" might differ across sampling designs. For instance, a stratified sample will likely have more precision than a cluster sample with the same number of ultimate observation units---so it's possible that the stratified-sample's CV would choose a larger model than the cluster-sample's CV. That is, the stratified sample might have enough precision to fit a larger model without overfitting, while CV might show us that the cluster sample is overfitting when it tries to fit a larger model.) Finally, we show an example from a different dataset, the 2015-2017 National Survey of Family Growth (NSFG), which uses clusters nested within strata as well as unequal sample weights. In this case we are using 4 folds, not 5 as above, because each stratum only has 4 clusters. (Note: We use `nest = TRUE` only if cluster IDs are nested within strata, i.e., if clusters in different strata might reuse the same names.) ```{r} # complex sample from NSFG library(splines) cv.svy(NSFG_data, c("income ~ ns(age, df = 1)", "income ~ ns(age, df = 2)", "income ~ ns(age, df = 3)", "income ~ ns(age, df = 4)", "income ~ ns(age, df = 5)", "income ~ ns(age, df = 6)"), nfolds = 4, strataID = "strata", clusterID = "SECU", nest = TRUE, weightsID = "wgt") ``` Here we tend to see that the spline with 3 degrees of freedom fits a little better than the others, but the differences between models are mostly smaller than the SEs. ## Using a survey design object with `cv.svydesign` When using the `survey` package, we will often create a `svydesign` object in order to calculate means and totals, fit GLMs, etc. In that case, instead of using `cv.svy()`, it is more convenient to use `cv.svydesign()`, which will read the relevant information out of the `svydesign` object and internally pass it along to `cv.svy()` for us. We repeat the `api` stratified, cluster, and SRS examples above using this `cv.svydesign()` approach. ```{r} #stratified sample dstrat <- svydesign(id = ~1, strata = ~stype, weights = ~pw, data = apistrat, fpc = ~fpc) cv.svydesign(formulae = c("api00~ell", "api00~ell+meals", "api00~ell+meals+mobility"), design_object = dstrat, nfolds = 5) # one-stage cluster sample dclus1 <- svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc) cv.svydesign(formulae = c("api00~ell", "api00~ell+meals", "api00~ell+meals+mobility"), design_object = dclus1, nfolds = 5) # simple random sample dsrs <- svydesign(id = ~1, data = apisrs, fpc = ~fpc) cv.svydesign(formulae = c("api00~ell", "api00~ell+meals", "api00~ell+meals+mobility"), design_object = dsrs, nfolds = 5) ``` Next, we repeat the NSFG example using this approach. ```{r} NSFG.svydes <- svydesign(id = ~SECU, strata = ~strata, nest = TRUE, weights = ~wgt, data = NSFG_data) cv.svydesign(formulae = c("income ~ ns(age, df = 1)", "income ~ ns(age, df = 2)", "income ~ ns(age, df = 3)", "income ~ ns(age, df = 4)", "income ~ ns(age, df = 5)", "income ~ ns(age, df = 6)"), design_object = NSFG.svydes, nfolds = 4) ``` ## Using a survey GLM object with `cv.svyglm` Finally, if we have already fit a `svyglm` object, we can use `cv.svyglm()` which will read the formula as well as the survey design information and internally pass it along to `cv.svy()` for us. However, this approach only works for one GLM at time instead of comparing several models in a single function. We repeat the `api` stratified, cluster, and SRS examples above using this `cv.svyglm()` approach, with the help of the `surveydesign` objects already created above. ```{r} #stratified sample glmstrat <- svyglm(api00 ~ ell+meals+mobility, design = dstrat) cv.svyglm(glmstrat, nfolds = 5) # one-stage cluster sample glmclus1 <- svyglm(api00 ~ ell+meals+mobility, design = dclus1) cv.svyglm(glmclus1, nfolds = 5) # simple random sample glmsrs <- svyglm(api00 ~ ell+meals+mobility, design = dsrs) cv.svyglm(glmsrs, nfolds = 5) ``` Next, we repeat the NSFG example using this approach. ```{r} NSFG.svyglm <- svyglm(income ~ ns(age, df = 3), design = NSFG.svydes) cv.svyglm(glm_object = NSFG.svyglm, nfolds = 4) ``` ## Fitting a logistic model instead of linear If we have a binary response variable and wish to fit a logistic regression instead, we can use `family = quasibinomial()` in the call to `svyglm()` and feed that directly into `cv.svyglm()`: ```{r} NSFG.svyglm.logistic <- svyglm(LBW ~ ns(age, df = 3), design = NSFG.svydes, family = quasibinomial()) cv.svyglm(glm_object = NSFG.svyglm.logistic, nfolds = 4) ``` In this case, the `mean` column shows the average of the binary cross-entropy loss $-[y_i \log(\hat y_i) + (1-y_i)\log(1-\hat y_i)]$ rather than MSE. As with MSE, lower values of this loss indicate better-fitting models. Or we can set `method = "logistic"` in `cv.svydesign()` or in `cv.svy()`: ```{r} cv.svydesign(formulae = c("LBW ~ ns(age, df = 1)", "LBW ~ ns(age, df = 2)", "LBW ~ ns(age, df = 3)", "LBW ~ ns(age, df = 4)", "LBW ~ ns(age, df = 5)", "LBW ~ ns(age, df = 6)"), design_object = NSFG.svydes, nfolds = 4, method = "logistic") cv.svy(NSFG_data, c("LBW ~ ns(age, df = 1)", "LBW ~ ns(age, df = 2)", "LBW ~ ns(age, df = 3)", "LBW ~ ns(age, df = 4)", "LBW ~ ns(age, df = 5)", "LBW ~ ns(age, df = 6)"), nfolds = 4, strataID = "strata", clusterID = "SECU", nest = TRUE, weightsID = "wgt", method = "logistic") ``` These two examples are carrying out the same CV over the same data and same models; they have slightly different results only because of randomness in forming the CV folds for each example. In these examples, model 2 or model 3 has the lowest loss, though any differences between models are well within one SE. # Other models with `folds.svy` and `folds.svydesign` The function `folds.svy()` generates design-based fold IDs for K-fold CV, using any specified strata and clusters. (Briefly: For a stratified sample, each fold will contain data from each stratum. For a cluster sample, a given cluster's rows will all be assigned to the same fold. See [our *Stat* paper](https://doi.org/10.1002/sta4.454) for details.) Using these fold IDs, you can write your own CV loop for models that our packages does not currently handle. These might be other models from the `survey` package (Poisson regression, Cox proportional hazards model, etc.) or from other design-based modeling packages such as [`rpms`](https://cran.r-project.org/package=rpms), [`mase`](https://cran.r-project.org/package=mase), or others in the [CRAN Task View on Official Statistics & Survey Statistics](https://cran.r-project.org/view=OfficialStatistics). Here is an example of tuning the bin size parameter for a design-based random forest, using the `rpms_forest()` function from the [`rpms`](https://cran.r-project.org/package=rpms) package. (I have also set the forest size to 50 trees instead of the default 500, purely in order to make the example run faster.) Note that while `folds.svy()` accounts for the clustering in this survey design, we need to do a bit of extra work to ensure that the model training and testing steps also account for the survey design. In this particular example, we must pass the cluster IDs and survey weights to `rpms_forest()` for design-consistent model-fitting, and we must use the survey weights in the MSE calculations. ```{r, eval = FALSE} # Based on example("rpms"): # model the mean of retirement account value `IRAX` among households with # reported retirement account values > 0, # predicted from householder education, age, and urban/rural location library(rpms) data(CE) # Generate fold IDs that account for clustering in the survey design # for the IRAX>0 subset of the CE dataset nfolds <- 5 CEsubset <- CE[which(CE$IRAX > 0), ] CEsubset$.foldID <- folds.svy(CEsubset, nfolds = nfolds, clusterID = "CID") # Use CV to tune the bin_size parameter of rpms_forest() bin_sizes <- c(10, 20, 50, 100, 250, 500) # Create placeholder for weighted Sums of Squared Errors SSEs <- rep(0, length(bin_sizes)) for(ff in 1:nfolds) { # For every fold... # Use .foldID to split data into training and testing sets train <- subset(CEsubset, .foldID != ff) test <- subset(CEsubset, .foldID == ff) for(bb in 1:length(bin_sizes)) { # For every value of the tuning parameter... # Fit a new model rf <- rpms_forest(IRAX ~ EDUCA + AGE + BLS_URBN, data = train, weights = ~FINLWT21, clusters = ~CID, bin_size = bin_sizes[bb], f_size = 50) # Get predictions and squared errors yhat <- predict(rf, newdata = test) res2 <- (yhat - test$IRAX)^2 # Sum up weighted SSEs, not MSEs yet, # b/c cluster sizes may differ across folds and b/c of survey weights SSEs[bb] <- SSEs[bb] + sum(res2 * test$FINLWT21) } } # Divide entire weighted sum by the sum of weights MSEs <- SSEs / sum(CEsubset$FINLWT21) # Show results cbind(bin_sizes, MSEs) #> bin_sizes MSEs #> [1,] 10 204246617270 #> [2,] 20 202870633392 #> [3,] 50 201393921358 #> [4,] 100 201085838446 #> [5,] 250 201825549231 #> [6,] 500 204155844501 ``` Bin size 100 had the lowest survey-weighted CV MSE estimate, although sizes 50 and 250 were quite similar. Using this chosen tuning parameter value, the next step could be to fit a random forest with bin size 100 on the full `CEsubset` dataset. In this case, `rpms_forest()` does not work with the `survey` package so there was no need to create a `svydesign` object for this analysis. But if we were working with a different model that expects us to create a `svydesign` object from `CEsubset`, we could have used `folds.svydesign()` instead, which would read out the cluster variable from the `svydesign` object automatically: ```{r, eval = FALSE} CE.svydes <- svydesign(id = ~CID, weights = ~FINLWT21, data = CEsubset) # Use update() to add variables to a svydesign object CE.svydes <- update(CE.svydes, .foldID = folds.svydesign(CE.svydes, nfolds = nfolds)) # Now the fold IDs are available as CE.svydes$variables$.foldID table(CE.svydes$variables$.foldID) #> 1 2 3 4 5 #> 813 923 928 885 960 ```