--- title: "Introduction to aiDIF: Detecting Differential Item Functioning in AI-Scored Assessments" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to aiDIF} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") library(aiDIF) ``` ## Background When AI systems score essays, short-answer responses, or structured tasks, a critical fairness question arises: does the AI scoring engine shift item difficulties **differently** for different demographic groups? Classical DIF methods test whether an item performs differently across groups *within* a single scoring condition. `aiDIF` extends this to a paired design: 1. **Human-scoring DIF** — robust M-estimation of item-level bias 2. **AI-scoring DIF** — the same analysis applied to AI-scored data 3. **Differential AI Scoring Bias (DASB)** — a new test for *group-dependent* parameter shifts from human to AI scoring ## The Example Dataset `make_aidif_eg()` returns a built-in example with item parameter MLEs for 6 items in two groups under both scoring conditions. The planted structure is: - **Item 1**: DIF in human scoring (intercept +0.5 in focal group) - **Item 3**: DASB — AI scoring adds +0.4 to the focal group intercept only - **Impact**: 0.5 SD (focal group higher on the latent trait) - **AI drift**: +0.1 uniform calibration offset across all items ```{r data} eg <- make_aidif_eg() str(eg, max.level = 2) ``` ## Fitting the Model `fit_aidif()` runs the robust IRLS engine under each scoring condition and performs the DASB test. ```{r fit} mod <- fit_aidif( human_mle = eg$human, ai_mle = eg$ai, alpha = 0.05 ) print(mod) ``` ## Full Report ```{r summary} summary(mod) ``` ## The DASB Test `scoring_bias_test()` can also be called directly. ```{r dasb} sb <- scoring_bias_test(eg$human, eg$ai) print(sb) ``` Item 3 should be significant, reflecting the planted group-dependent AI scoring bias. ## AI-Effect Classification ```{r effect} eff <- ai_effect_summary(mod$dif_human, mod$dif_ai) print(eff) ``` | Status | Meaning | |---|---| | `introduced` | AI scoring creates DIF not present under human scoring | | `masked` | AI scoring hides DIF that existed under human scoring | | `stable_dif` | DIF detected in both conditions | | `stable_clean` | No DIF in either condition | ## Visualisations ```{r plots, fig.width=7, fig.height=5, eval=FALSE} plot(mod, type = "dif_forest") # human vs AI DIF side by side plot(mod, type = "dasb") # DASB bar chart with error bars plot(mod, type = "weights") # bi-square anchor weights ``` ## Simulation ```{r sim} dat <- simulate_aidif_data( n_items = 8, n_obs = 600, dif_items = c(1, 2), dif_mag = 0.5, dasb_items = 5, dasb_mag = 0.4, seed = 123 ) sim_mod <- fit_aidif(dat$human, dat$ai) print(sim_mod) ``` ## References - Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In *Test validity* (pp. 129–145). Erlbaum. - Halpin, P., Nickodem, K., & Eagle, J. (2024). *robustDIF: Differential Item Functioning Using Robust Scaling*. R package version 0.2.0.