4. Latent and Mixed-Scale Correlation

Scope

This vignette covers the latent-correlation estimators used when the observed data are binary, ordinal, or mixed. These methods do not target the same quantity as an ordinary Pearson correlation on coded categories. They are designed for settings where the observed variables are treated as thresholded versions of latent continuous variables.

The relevant functions are:

tetrachoric()
polychoric()
polyserial()
biserial()

A small latent-data example

library(matrixCorr)

set.seed(30)
n <- 500
Sigma <- matrix(c(
  1.00, 0.55, 0.35, 0.20,
  0.55, 1.00, 0.40, 0.30,
  0.35, 0.40, 1.00, 0.45,
  0.20, 0.30, 0.45, 1.00
), 4, 4, byrow = TRUE)

Z <- matrix(rnorm(n * 4), n, 4) %*% chol(Sigma)

X_bin <- data.frame(
  b1 = Z[, 1] > qnorm(0.70),
  b2 = Z[, 2] > qnorm(0.55),
  b3 = Z[, 3] > qnorm(0.50)
)

X_ord <- data.frame(
  o1 = ordered(cut(Z[, 2], breaks = c(-Inf, -0.5, 0.4, Inf),
    labels = c("low", "mid", "high")
  )),
  o2 = ordered(cut(Z[, 3], breaks = c(-Inf, -1, 0, 1, Inf),
    labels = c("1", "2", "3", "4")
  ))
)

X_cont <- data.frame(x1 = Z[, 1], x2 = Z[, 4])

Binary-binary and ordinal-ordinal settings

tetrachoric() is used for binary variables. polychoric() is used for ordered categorical variables.

fit_tet <- tetrachoric(X_bin, ci = TRUE, p_value = TRUE)
fit_pol <- polychoric(X_ord, ci = TRUE, p_value = TRUE)

print(fit_tet, digits = 2)
#> Tetrachoric correlation matrix
#>   method      : tetrachoric
#>   dimensions  : 3 x 3
#>   ci          : yes
#>   p_values    : yes
#> 
#>      b1   b2   b3
#> b1 1.00 0.62 0.42
#> b2 0.62 1.00 0.48
#> b3 0.42 0.48 1.00
summary(fit_pol)
#> Polychoric correlation summary
#>   output      : matrix
#>   dimensions  : 2 x 2
#>   retained_pairs: 3
#>   threshold   : 0.0000
#>   diag        : included
#>   estimate    : 0.4757 to 1.0000
#>   thresholds  : 2 set(s)
#>   inference   : wald_z_polychoric
#>   ci          : 95%
#>   ci_method   : wald_information_polychoric
#>   ci_width    : 0.164
#> 
#>  item1 item2 estimate n_complete lwr    upr    statistic df    p_value fisher_z
#>  o1    o2    0.4757   500        0.3940 0.5575 11.4008      NA 0.0000  0.5175  
#> 
#> Strongest pairs by |estimate|
#> 
#>  item1 item2 estimate  n_complete lwr       upr       statistic df p_value     
#>  o1    o2    0.4757396 500        0.3939531 0.5575261 11.40081  NA 4.142299e-30
#>  fisher_z 
#>  0.5174631

These estimators assume a latent-normal threshold model. That assumption should be stated whenever the results are reported, because the interpretation is not simply “correlation between coded categories.”

It is often useful to compare that latent estimate with a naive Pearson correlation computed on coded categories.

fit_bin_naive <- pearson_corr(data.frame(lapply(X_bin[, 1:2], as.numeric)))
fit_ord_naive <- pearson_corr(data.frame(lapply(X_ord, as.numeric)))

round(c(
  b1_b2_pearson = fit_bin_naive[1, 2],
  b1_b2_tetrachoric = fit_tet[1, 2],
  o1_o2_pearson = fit_ord_naive[1, 2],
  o1_o2_polychoric = fit_pol[1, 2]
), 2)
#>     b1_b2_pearson b1_b2_tetrachoric     o1_o2_pearson  o1_o2_polychoric 
#>              0.40              0.62              0.40              0.48

Those numbers need not agree. The latent estimators target the association between the underlying continuous variables, not the correlation between arbitrarily coded categories.

Mixed continuous-discrete settings

polyserial() is used when one variable is continuous and the other is ordinal. biserial() is used when one variable is continuous and the other is binary.

fit_ps <- polyserial(X_cont, X_ord, ci = TRUE, p_value = TRUE)
fit_bis <- biserial(X_cont, X_bin[, 1:2], ci = TRUE, p_value = TRUE)

summary(fit_ps)
#> Polyserial correlation summary
#>   output      : matrix
#>   dimensions  : 2 x 2
#>   retained_pairs: 4
#>   threshold   : 0.0000
#>   diag        : included
#>   estimate    : 0.3282 to 0.6043
#>   inference   : wald_z_polyserial
#>   ci          : 95%
#>   ci_method   : wald_information_polyserial
#>   ci_width    : 0.123 to 0.175
#> 
#>  item1 item2 estimate n_complete lwr    upr    statistic df    p_value fisher_z
#>  x1    o1    0.6043   500        0.5426 0.6660 19.2066      NA 0.0000  0.6999  
#>  x2    o2    0.4654   500        0.3943 0.5365 12.8327      NA 0.0000  0.5042  
#>  x1    o2    0.4183   500        0.3432 0.4934 10.9164      NA 0.0000  0.4457  
#>  x2    o1    0.3282   500        0.2407 0.4158 7.3485       NA 0.0000  0.3408  
#> 
#> Strongest pairs by |estimate|
#> 
#>  item1 item2 estimate  n_complete lwr       upr       statistic df p_value     
#>  x1    o1    0.6042923 500        0.5426265 0.6659580 19.206632 NA 3.257254e-82
#>  x2    o2    0.4654365 500        0.3943497 0.5365233 12.832747 NA 1.074842e-37
#>  x1    o2    0.4183313 500        0.3432226 0.4934399 10.916370 NA 9.626728e-28
#>  x2    o1    0.3282230 500        0.2406806 0.4157654  7.348499 NA 2.004446e-13
#>  fisher_z 
#>  0.6998810
#>  0.5042289
#>  0.4456676
#>  0.3408354
summary(fit_bis)
#> Biserial correlation summary
#>   output      : matrix
#>   dimensions  : 2 x 2
#>   retained_pairs: 4
#>   threshold   : 0.0000
#>   diag        : included
#>   estimate    : 0.2120 to 1.0000
#>   inference   : biserial_t_test
#>   ci          : 95%
#>   ci_method   : fisher_z_biserial
#>   ci_width    : 0.000 to 0.168
#> 
#>  item1 item2 estimate n_complete lwr    upr    statistic df       p_value
#>  x1    b1    1.0000   500        1.0000 1.0000   Inf     498.0000 0.0000 
#>  x1    b2    0.6024   500        0.5435 0.6555 16.8435   498.0000 0.0000 
#>  x2    b2    0.3191   500        0.2381 0.3957 7.5146    498.0000 0.0000 
#>  x2    b1    0.2120   500        0.1267 0.2942 4.8409    498.0000 0.0000 
#>  fisher_z
#>     NA   
#>  0.6970  
#>  0.3307  
#>  0.2153  
#> 
#> Strongest pairs by |estimate|
#> 
#>  item1 item2 estimate  n_complete lwr       upr       statistic df 
#>  x1    b1    1.0000000 500        1.0000000 1.0000000       Inf 498
#>  x1    b2    0.6024366 500        0.5434557 0.6554985 16.843500 498
#>  x2    b2    0.3191279 500        0.2381004 0.3957438  7.514554 498
#>  x2    b1    0.2119967 500        0.1266607 0.2942177  4.840932 498
#>  p_value      fisher_z 
#>  0.000000e+00        NA
#>  1.033521e-50 0.6969630
#>  2.670252e-13 0.3306758
#>  1.727016e-06 0.2152610

Confidence intervals and p-values

These functions now follow the same user-facing pattern as the rest of the package:

estimates are returned by default;
confidence intervals are added only when ci = TRUE;
p-values are added only when p_value = TRUE where supported.

The important point is that inference is tied to the fitted latent model rather than to an ordinary Pearson-correlation formula applied to coded categories.

Practical guidance

These estimators are appropriate when the scientific question is explicitly about latent association under a threshold model.

Use tetrachoric() for binary-binary pairs.
Use polychoric() for ordinal-ordinal pairs.
Use polyserial() for continuous-ordinal pairs.
Use biserial() for continuous-binary pairs.

If the variables are nominal rather than ordered, these latent-correlation functions are not the right tools.