--- title: "sae.projection: A Model-Assisted Projection Estimator for Combining Independent Surveys" author: "Ridson Al Farizal P (ridsonap@bps.go.id)" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 2 vignette: > %\VignetteIndexEntry{Model-Assisted Projection Estimator} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## Introduction The `ma_projection()` function implements a model-assisted projection estimator for combining information from two independent surveys. This method is especially useful in survey sampling scenarios where: - **Survey 1** contains a large sample with only auxiliary variables. - **Survey 2** contains a smaller sample with both the outcome and auxiliary variables. This vignette illustrates how to use `ma_projection()` for domain-level estimation using various supervised learning models, including machine learning techniques via the `parsnip` interface. ## Method Overview The approach follows the work of Kim & Rao (2012), where a working model is trained on Survey 2 to predict the outcome variable. Predictions are made for the auxiliary-only Survey 1 data. These predictions are then aggregated by domain to generate small area estimates. ## Required Packages ```r library(sae.projection) library(dplyr) library(tidymodels) library(bonsai) # for modern tree-based models ``` ## Example: Income Estimation Using Linear Regression ```r # Filter non-missing values for income svy22_income <- df_svy22 %>% filter(!is.na(income)) svy23_income <- df_svy23 %>% filter(!is.na(income)) # Fit projection model lm_result <- ma_projection( income ~ age + sex + edu + disability, cluster_ids = "PSU", weight = "WEIGHT", strata = "STRATA", domain = c("PROV", "REGENCY"), working_model = linear_reg(), data_model = svy22_income, data_proj = svy23_income, nest = TRUE ) # View results head(lm_result$df_result) ``` ## Example: Binary Outcome Using Logistic Regression ```r # Filter youth population for NEET classification svy22_neet <- df_svy22 %>% filter(between(age, 15, 24)) svy23_neet <- df_svy23 %>% filter(between(age, 15, 24)) # Fit logistic regression model lr_result <- ma_projection( formula = neet ~ sex + edu + disability, cluster_ids = ~ PSU, weight = ~ WEIGHT, strata = ~ STRATA, domain = ~ PROV + REGENCY, working_model = logistic_reg(), data_model = svy22_neet, data_proj = svy23_neet, nest = TRUE ) # View results head(lr_result$df_result) ``` ## Example: LightGBM with Hyperparameter Tuning ```r # Define LightGBM model with tuning lgbm_model <- boost_tree( mtry = tune(), trees = tune(), min_n = tune(), tree_depth = tune(), learn_rate = tune(), engine = "lightgbm" ) # Fit with cross-validation lgbm_result <- ma_projection( formula = neet ~ sex + edu + disability, cluster_ids = "PSU", weight = "WEIGHT", strata = "STRATA", domain = c("PROV", "REGENCY"), working_model = lgbm_model, data_model = svy22_neet, data_proj = svy23_neet, cv_folds = 3, tuning_grid = 5, nest = TRUE ) # View results head(lgbm_result$df_result) ``` ## Supported Models `ma_projection()` supports many working models using the `parsnip` interface, including: - `linear_reg()`, `logistic_reg()` (also with Stan engine) - `poisson_reg()`, `mlp()`, `naive_bayes()`, `nearest_neighbor()` - Tree-based: `decision_tree()`, `bag_tree()`, `boost_tree()` with LightGBM/XGBoost, `rand_forest()` (ranger, aorsf), `bart()` - SVM: `svm_linear()`, `svm_poly()`, `svm_rbf()` ## References Kim, J. K., & Rao, J. N. (2012). Combining data from two independent surveys: a model-assisted approach. *Biometrika*, 99(1), 85–100. [doi:10.1093/biomet/asr063](https://doi.org/10.1093/biomet/asr063) ## Conclusion `ma_projection()` provides a flexible and robust way to combine survey data using modern modeling tools. It supports a wide range of use cases including socioeconomic indicators, health estimates, and more.