Load required packages.
We use the full example Census and Customer Information System (CIS) datasets from McLeod et al. (2011). The goal is to link records from CIS to records from Census.
data("census", package = "automatedRecLin")
data("cis", package = "automatedRecLin")
setDT(census)
setDT(cis)
NROW(cis)
#> [1] 24613
NROW(census)
#> [1] 25343The person_id variable identifies the correct linkage.
We use this information only to evaluate the result.
We compare forename and surname using the Jaro-Winkler distance. These two comparison variables are modeled with the continuous parametric MEC method. Sex and date-of-birth variables use the default binary method. Address fields are used only to construct blocks.
variables <- c(
"pername1", "pername2", "sex",
"dob_day", "dob_mon", "dob_year"
)
comparators <- list(
"pername1" = jarowinkler_complement(),
"pername2" = jarowinkler_complement()
)
methods <- list(
"pername1" = "continuous_parametric",
"pername2" = "continuous_parametric"
)
blocking_variables <- c(variables, "enumcap", "enumpc")Run blocked MEC. The model is trained on all candidate pairs retained by blocking.
set.seed(1)
result <- mec_blocking(
A = cis,
B = census,
variables = variables,
comparators = comparators,
methods = methods,
blocking_variables = blocking_variables,
blocking_sep = " ",
controls_blocking = list(seed = 1, n_threads = 1),
alpha = 0.5,
true_matches = true_matches
)
result
#> Blocked MEC record linkage based on:
#> pername1, pername2, sex, dob_day, dob_mon, dob_year.
#> ========================================================
#> The algorithm predicted 23700 matches.
#> The first 6 predicted matches are:
#> a b block ratio / 1000
#> <int> <int> <num> <num>
#> 1: 12264 18361 17173 3.317857e-12
#> 2: 23367 13031 12223 3.317857e-12
#> 3: 23495 15194 14243 3.317857e-12
#> 4: 1768 12279 11529 6.944914e-12
#> 5: 343 18657 17447 7.957520e-12
#> 6: 2124 5497 5152 7.957520e-12
#> ========================================================
#> ========================================================
#> Blocking diagnostics:
#> Known matches: 24043.
#> Known matches retained by blocking: 23688.
#> Known matches missed by blocking: 355.
#> Blocking MMR: 1.4765 %.
#> Candidate pairs retained: 25343 of 623767259.
#> Candidate pair reduction: 99.9959 %.
#> ========================================================
#> Evaluation metrics:
#> FLR (%) MMR (%)
#> 0.0591 1.4848The full Cartesian product contains 623,767,259 record pairs. Blocking reduces this to 25,343 candidate pairs, while retaining 98.52% of known links. The final linkage set contains 23,700 predicted matches.
| step | result |
|---|---|
| Training | all_candidate_pairs on 23,726 blocks |
| Blocking | 23,688 of 24,043 known links retained |
| Linkage | FLR = 0.06%; MMR = 1.48% |