--- title: "Getting Started with surveycore" output: rmarkdown::html_vignette: toc: true toc_depth: 3 bibliography: references.bib link-citations: true vignette: > %\VignetteIndexEntry{Getting Started with surveycore} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} library(surveycore) knitr::opts_chunk$set( comment = "#>" ) ``` `surveycore` gives you a (mostly) complete workflow for survey data analysis. This vignette is designed to give you a quick overview of the main functionality present in this package. It is comprised of two main sections: 1. Creating survey objects 2. Conducting simple analysis **Quick PSA before jumping in:** `surveycore` was built as an alternative to `survey` and `srvyr`. However, the code powering the variance estimation and analysis is vendored from the `survey` package. Every aspect of this package that calculates anything has been tested to ensure it provides the same numerical results. This means every estimate `surveycore` produces has been numerically verified to match `survey` output. Without Thomas Lumley's work on that package, surveycore would not be possible. --- ## Creating the survey object The first step when conducting survey analysis is creating the right survey object where we specify the sampling design, weights, and whatever other information is needed. Without this information, point estimates may be biased and standard errors are almost certainly wrong [@lumley2010; @lohr2022]. Fortunately, we don't have to calculate that uncertainty ourselves! That's what the survey objects are for. They tell the analysis functions how to conduct their analysis so they can properly take into account the variance and bias from the survey design. `surveycore` has four different survey object constructors: 1. `as_survey()` 2. `as_survey_replicate()` 3. `as_survey_nonprob()` 4. `as_survey_twophase()` Rather than going into detail on each constructor, I'm just going to provide a quick overview of each. For more information visit `vignette("creating-survey-objects")`. ### `as_survey()` You want to use this if you used a probability sample and the data you have has cluster IDs, strata, and/or design weights. In this example we'll use the General Social Survey which has variables for clustering, strata, and design weights. ``` {r as_survey} gss_svy <- as_survey( gss_2024, ids = vpsu, strata = vstrat, weights = wtssps ) gss_svy ``` ### `as_survey_replicate()` Use this when the data you have comes with pre-built replicate weight columns like `repwt_1`, `repwt_2`. For example, Pew's Jewish American study from 2020 uses replicate weights. ```{r replicate} pew_jewish_svy <- as_survey_replicate( pew_jewish_2020, weights = extweight, repweights = extweight1:extweight100, type = "JK1" ) pew_jewish_svy ``` ### SRS designs with `as_survey()` When your survey was a pure simple random sample — each respondent had equal probability of selection, no clustering, no strata — use `as_survey()` without specifying `ids` or `strata`. Omitting both creates a `survey_taylor` object with no cluster or stratum structure, which is the SRS special case. Here is a synthetic school district survey to illustrate: ```{r srs} set.seed(101) N <- 400 # total schools in district n <- 80 # schools sampled school_survey <- data.frame( school_id = sample(seq_len(N), n), avg_score = round(rnorm(n, mean = 72, sd = 11), 1), pct_frpl = round(runif(n, 0.10, 0.85), 2), # % free/reduced price lunch enrollment = round(runif(n, 180, 850)), sw = N / n, # equal sampling weight = 400/80 = 5.0 fpc = N # population size for FPC ) school_svy <- as_survey( school_survey, weights = sw, # each sampled school represents 5 schools in the population fpc = fpc # reduces SEs: we sampled 20% of the population ) school_svy ``` ### `as_survey_nonprob()` Use this when you used a non-probability panel (e.g., Qualtrics Panels, Cint/Lucid, Dynata, YouGov, etc.) and have created weights designed to ensure the sample is representative of the population you are interested in. ```{r calibrated} ns_wave1_svy <- as_survey_nonprob(ns_wave1, weights = weight) ns_wave1_svy ``` ### `as_survey_twophase()` This is very rare, so it's unlikely you have this. But, if you took a large sample of a population and then resampled a subset of them, then you might have a two-phase design. Some common contexts are: case-cohort studies, medical validation studies, or surveys with a screening phase. We will use the nwtco data from the `survival` package. ```{r nwtco, eval=requireNamespace("survival", quietly=TRUE)} nwtco <- survival::nwtco # in.subcohort is stored as 0/1 — must be logical for as_survey_twophase() nwtco$in.subcohort <- as.logical(nwtco$in.subcohort) # Phase 1: all 4,028 enrolled patients (each patient is their own unit) phase1 <- as_survey(nwtco, ids = seqno) # Phase 2: subcohort, with Phase 2 sampling stratified by relapse status nwtco_svy <- as_survey_twophase( phase1, strata2 = rel, # Phase 2 strata: cases (rel=1) vs. non-cases (rel=0) subset = in.subcohort, # Logical column: TRUE = selected into Phase 2 method = "full" ) nwtco_svy ``` As you may have noticed, each survey object has a print method that shows the first 10 rows of each data set, similar to a tibble, but also includes a brief description of the survey design. --- ## Conducting Simple Analysis In addition to creating survey objects, `surveycore` has several functions designed to make analysis easier: - `get_freqs()` - `get_means()` - `get_totals()` - `get_corr()` - `get_ratios()` - `get_quantiles()` ### Frequency tables — `get_freqs()` `get_freqs()` calculates weighted frequencies (aka proportions). The first argument is the survey design, the second is the variable you want to get the frequencies for. Here's a simple example where we calculate whether or not people are willing to consider voting for Trump. ```{r freqs-basic} get_freqs(ns_wave1_svy, consider_trump) ``` **Analyzing multiple variables at once** A key piece of survey research involves select-all-that-apply style questions. For example, the Nationscape data asked people: "We're interested in where you might have heard news about politics in the last week. Please indicate which of the following sources you used." Rather than looking at each one individually, `get_freqs()` allows you to pass in multiple variables (it uses `tidy-select` under the hood to do this). Let's look at an example: ```{r freqs-multi} get_freqs(ns_wave1_svy, c(news_sources_facebook:news_sources_other)) ``` The `name` column identifies which variable each row belongs to; `value` holds the response code. You can also change the name of the columns if you want. For example: ```{r} ns_wave1_svy |> get_freqs( c(news_sources_facebook:news_sources_other), names_to = "news_source", values_to = "choice" ) ``` ### Weighted means — `get_means()` `get_means()` estimates the survey-weighted mean of a continuous variable. ```{r means-basic} # Mean discrimination against blacks get_means(ns_wave1_svy, discrimination_blacks) ``` ### Population totals — `get_totals()` `get_totals()` estimates the weighted sum: the total count or aggregate for the target population. It's important to note that the weights typically used in non-probability samples are calibration weights which are designed to ensure the sample is representative of the population it is sampled from. These weights will not give you accurate population totals. However, design weights scaled to represent the target population will give estimated population size. To show the difference, we will use Pew's Jewish-Americans Study from 2020 to show a study with weights that show population totals and the Nationscape data to show how calibrated weights do not. First, we will look at the Nationscape data. Since these are calibration weights, the totals will add up to the number of rows (6,422). ```{r} get_totals(ns_wave1_svy) ``` However, the Pew Jewish-Americans study shows us the estimated total population of Jews in the US: ```{r} get_totals(pew_jewish_svy) ``` Note how we did not use any variables. That's because we are interested in the sample's population size. However, if we are interested in the estimated total income or age, we can do that by specifying it in the `x` argument. If we wanted to understand how many estimated people selected a specific response option, we use the `group` argument. For example: if we wanted to know how many Jews are in each age category we would calculate it like this: ```{r} get_totals(pew_jewish_svy, group = age4cat) ``` --- ### Weighted correlations — `get_corr()` `get_corr()` estimates survey-weighted Pearson correlations between two or more continuous variables. Confidence intervals use the Fisher Z transformation, guaranteeing bounds in (−1, 1). Let's look at approval for Trump and Biden. First we clean the underlying data frame — dropping rows with missing values and removing "Not sure" responses (coded `999`) — then rebuild the survey object. ```{r corr-basic} ns_wave1_clean <- ns_wave1 |> dplyr::filter( !is.na(cand_favorability_trump), !is.na(cand_favorability_biden), cand_favorability_trump != 999, cand_favorability_biden != 999 ) ns_wave1_clean_svy <- as_survey_nonprob(ns_wave1_clean, weights = weight) get_corr( ns_wave1_clean_svy, c(cand_favorability_trump, cand_favorability_biden) ) ``` Next, let's look at favorability across multiple variables. ```{r corr-multi} fav_vars <- c( "cand_favorability_trump", "cand_favorability_biden", "cand_favorability_harris", "cand_favorability_sanders", "cand_favorability_warren", "cand_favorability_buttigieg", "cand_favorability_pence" ) ns_wave1_multi <- ns_wave1 |> dplyr::filter( dplyr::if_all(dplyr::all_of(fav_vars), ~ !is.na(.x) & .x != 999) ) ns_wave1_multi_svy <- as_survey_nonprob(ns_wave1_multi, weights = weight) get_corr( ns_wave1_multi_svy, c(cand_favorability_trump:cand_favorability_pence) ) ``` The output defaults to a long version where each row is a unique variable pair. It shows the correlation in `r`, the confidence intervals, p-values, and other relevant information. Switch to wide format for a more familiar correlation-matrix layout: ```{r corr-wide} get_corr( ns_wave1_multi_svy, c(cand_favorability_trump:cand_favorability_pence), format = "wide" ) ``` --- ### Ratio estimation — `get_ratios()` `get_ratios()` estimates the ratio of two weighted totals: $$\hat{R} = \frac{\sum_i w_i \, y_i}{\sum_i w_i \, x_i}$$ Variance is estimated via the delta method (linearization), equivalent to `survey::svyratio()` [@lumley2010]. Ratios are useful when you want an estimate that is invariant to the scale of the weights — for example, wages per hour, spending per household member, or disease prevalence ratios. A good example uses the Nationscape data. The favorability scale runs from 1 (Very favorable) to 4 (Very unfavorable), so we can estimate the ratio of Trump's aggregate favorability score to Biden's. A ratio less than 1 means Trump's aggregate score is lower — i.e., he is viewed more favorably on average; a ratio greater than 1 means Biden is viewed more favorably. ```{r ratios-basic} get_ratios( ns_wave1_clean_svy, numerator = cand_favorability_trump, denominator = cand_favorability_biden ) ``` --- ### `get_quantiles()` `get_quantiles()` estimates survey-weighted quantiles using the Woodruff (1952) confidence interval method. Confidence intervals are derived by inverting the weighted CDF rather than assuming normality, so they are generally **asymmetric** around the estimate and always respect the range of the data. By default, it calculates the quantiles at the 25th, 50th, and 75th percentile. ```{r quantiles-basic} # Quartiles and median of age (default probs = c(0.25, 0.5, 0.75)) get_quantiles(ns_wave1_svy, age) ``` #### Choosing quantiles Pass any numeric vector to `probs`. For the median alone: ```{r quantiles-median} get_quantiles(ns_wave1_svy, age, probs = 0.5) ``` For deciles of age: ```{r quantiles-deciles} get_quantiles(ns_wave1_svy, age, probs = seq(0.1, 0.9, 0.1)) ``` ### Subgroup analysis — the `group` argument Every analysis function accepts a `group` argument for computing estimates separately within levels of a categorical variable. Pass a bare column name or multiple using `c()`. For example, we'll look at Trump consideration by party identification. ```{r group-means} get_freqs(ns_wave1_svy, consider_trump, group = pid3) ``` Rows where the grouping variable is `NA` are excluded from all groups and do not appear in the output. Responses within each group sum to 100% for `get_freqs()`. --- ### Controlling uncertainty output All analysis functions share a common `variance` argument. You can request any combination of: | Code | What it returns | |--------|-----------------------------------------------------| | `"se"` | Standard error | | `"ci"` | Confidence interval: `ci_low`, `ci_high` | | `"var"`| Variance (square of the SE) | | `"cv"` | Coefficient of variation (SE / estimate) | | `"moe"`| Margin of error at `conf_level` | | `"deff"` | Design effect (complex design variance / SRS variance) | The `conf_level` argument controls the confidence level for `"ci"` and `"moe"`. The default is `0.95`; for a 90% CI: ```{r variance-options} get_means( ns_wave1_svy, age, variance = c("se", "ci", "moe"), conf_level = 0.9 ) ``` Set `variance = NULL` to suppress all uncertainty columns and return point estimates and sample counts only. ### Estimated population counts Add `n_weighted = TRUE` to any function to include the estimated population count — the sum of weights — alongside the unweighted sample count `n`. Let's look at Pew's Jewish-Americans data again. Using `get_freqs()` and `n_weighted = TRUE` we can see proportions and estimated populations. In this example, we're looking at the proportion (`pct`), the number of people from the sample (`n`), and the estimated population size (`n_weighted`) of Jewish- Americans in each age category. ```{r n-weighted} get_freqs(pew_jewish_svy, age4cat, n_weighted = TRUE) ``` --- ## Summary | Function | Use for | |---|---| | `get_freqs()` | Categorical variables — weighted distributions, percentages | | `get_means()` | Continuous variables — weighted means | | `get_totals()` | Population counts or aggregates — weighted sums | | `get_ratios()` | Ratios of two weighted totals | | `get_corr()` | Pairwise Pearson correlations | | `get_quantiles()` | Weighted quantiles and median — Woodruff CIs | All functions: - Return a tibble subclass ready for further analysis or display - Accept a `group` argument for subgroup estimates - Accept a `variance` argument to control which uncertainty columns appear - Handle all survey design classes: `survey_taylor`, `survey_replicate`, `survey_twophase`, and `survey_nonprob`