--- title: "Roy Model: Sector Choice with a Latent Ability Factor" author: "Greg Veramendi" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Roy Model: Sector Choice with a Latent Ability Factor} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Overview In a Roy model, individuals choose the sector that offers the highest utility, and only their sector-specific wage is observed. An unobserved ability factor affects both test scores and potential wages, and the choice is driven by the implied gain. **factorana** can jointly estimate: 1. A **measurement system** for ability (multiple test scores). 2. **Potential outcomes** in each sector (wages, partially observed). 3. A **discrete choice** of sector (probit). All three components share the same latent ability factor, and the model is identified because test scores contain information about ability independent of sector choice and realised wages. The methodology is described in Heckman, Humphries and Veramendi (2016) [doi:10.1016/j.jeconom.2015.12.001](https://doi.org/10.1016/j.jeconom.2015.12.001) and (2018) [doi:10.1086/698760](https://doi.org/10.1086/698760). ## Simulate data ```{r} library(factorana) set.seed(108) n <- 500 # Observed covariates and the unobserved ability factor x1 <- rnorm(n) # shifts wages x2 <- rnorm(n) # shifts wages and sector choice f <- rnorm(n, sd = 1) # latent ability (unobserved in practice) # Test scores measure ability with error (shared scale across tests) T1 <- 2.0 + 1.0 * f + rnorm(n, 0, 0.5) T2 <- 1.5 + 1.2 * f + rnorm(n, 0, 0.6) T3 <- 1.0 + 0.8 * f + rnorm(n, 0, 0.4) # Potential wages in each sector wage0 <- 2.0 + 0.5 * x1 + 0.3 * x2 + 0.5 * f + rnorm(n, 0, 0.6) wage1 <- 2.5 + 0.6 * x1 + 1.0 * f + rnorm(n, 0, 0.7) # Sector choice: higher ability -> more likely to pick sector 1 z_sector <- 0.0 + 0.4 * x2 + 0.8 * f sector <- as.numeric(runif(n) < pnorm(z_sector)) # Only the wage in the chosen sector is observed wage <- ifelse(sector == 1, wage1, wage0) dat <- data.frame( intercept = 1, x1 = x1, x2 = x2, T1 = T1, T2 = T2, T3 = T3, wage = wage, sector = sector, eval_tests = 1L, # always observe all three tests eval_wage0 = 1L - sector, # wage0 observed iff sector = 0 eval_wage1 = sector, # wage1 observed iff sector = 1 eval_sector = 1L # sector choice always observed ) ``` ## Model specification Only one latent factor (ability). Every component loads on that factor, so `loading_normalization` is a scalar here. ```{r} fm <- define_factor_model(n_factors = 1, n_types = 1) ``` ### Measurement equations (tests) Fix T1's loading at 1.0 to pin the factor scale; leave T2 and T3 free. ```{r} mc_T1 <- define_model_component( name = "T1", data = dat, outcome = "T1", factor = fm, covariates = "intercept", model_type = "linear", loading_normalization = 1.0, evaluation_indicator = "eval_tests" ) mc_T2 <- define_model_component( name = "T2", data = dat, outcome = "T2", factor = fm, covariates = "intercept", model_type = "linear", loading_normalization = NA_real_, evaluation_indicator = "eval_tests" ) mc_T3 <- define_model_component( name = "T3", data = dat, outcome = "T3", factor = fm, covariates = "intercept", model_type = "linear", loading_normalization = NA_real_, evaluation_indicator = "eval_tests" ) ``` ### Potential outcomes (wages) Each wage equation is a linear model evaluated only on the subsample that chose that sector. `evaluation_indicator` tells **factorana** which rows contribute to which component's likelihood. ```{r} mc_wage0 <- define_model_component( name = "wage0", data = dat, outcome = "wage", factor = fm, covariates = c("intercept", "x1", "x2"), model_type = "linear", loading_normalization = NA_real_, evaluation_indicator = "eval_wage0" ) mc_wage1 <- define_model_component( name = "wage1", data = dat, outcome = "wage", factor = fm, covariates = c("intercept", "x1"), model_type = "linear", loading_normalization = NA_real_, evaluation_indicator = "eval_wage1" ) ``` ### Sector-choice equation A probit with a free factor loading (positive ability raises the probability of choosing sector 1 when the simulated value is positive). ```{r} mc_sector <- define_model_component( name = "sector", data = dat, outcome = "sector", factor = fm, covariates = c("intercept", "x2"), model_type = "probit", loading_normalization = NA_real_, evaluation_indicator = "eval_sector" ) ``` ### Assemble the system ```{r} ms <- define_model_system( components = list(mc_T1, mc_T2, mc_T3, mc_wage0, mc_wage1, mc_sector), factor = fm ) ``` ## Estimation ```{r} ctrl <- define_estimation_control(n_quad_points = 8, num_cores = 1) fit <- estimate_model_rcpp( model_system = ms, data = dat, control = ctrl, optimizer = "nlminb", parallel = FALSE, verbose = FALSE ) fit$convergence # 0 == strict convergence fit$loglik ``` ## Results ```{r} components_table(fit, digits = 3) ``` Estimated loadings on the ability factor track the simulated values: the test equations recover (1.0, 1.2, 0.8); the wage equations recover roughly 0.5 in sector 0 and 1.0 in sector 1, capturing the positive selection on ability into the high-return sector. ## Recover factor scores Given the fitted model, we can back out each individual's posterior mean ability. ```{r} fscores <- estimate_factorscores_rcpp( fit, dat, control = ctrl, parallel = FALSE, verbose = FALSE ) head(fscores[, c("obs_id", "factor_1", "se_factor_1", "converged")]) # Correlation of estimated scores with the true (simulated) ability cor(fscores$factor_1, f) ``` ## Two-stage estimation (optional) For larger applications it is often convenient to estimate the measurement system first, then freeze those parameters and estimate the structural part. See `?estimate_model_rcpp` (argument `previous_stage` in `define_model_system()`) and `?set_adaptive_quadrature_cpp` for the adaptive integration that makes the second stage cheap.