3. Robust and High-Dimensional Correlation

Scope

This vignette covers the estimators used when ordinary wide-data correlation is not enough. The package provides two broad extensions:

The relevant functions are:

Robust correlation

Outliers can distort ordinary correlation severely, especially in moderate sample sizes. The robust estimators in matrixCorr are designed to limit that effect, but they do so in different ways.

library(matrixCorr)

set.seed(20)
Y <- data.frame(
  x1 = rnorm(60),
  x2 = rnorm(60),
  x3 = rnorm(60),
  x4 = rnorm(60)
)

idx <- sample.int(nrow(Y), 5)
Y$x1[idx] <- Y$x1[idx] + 8
Y$x2[idx] <- Y$x2[idx] - 6

R_pear <- pearson_corr(Y)
R_bicor <- bicor(Y)
R_pb <- pbcor(Y)
R_win <- wincor(Y)
R_skip <- skipped_corr(Y)

summary(R_pear)
#> Pearson correlation summary
#>   output      : matrix
#>   dimensions  : 4 x 4
#>   retained_pairs: 10
#>   threshold   : 0.0000
#>   diag        : included
#>   estimate    : -0.8552 to 1.0000
#> 
#>  item1 item2 estimate fisher_z
#>  x1    x2    -0.8552  -1.2754 
#>  x3    x4    0.1514   0.1525  
#>  x1    x3    0.1058   0.1062  
#>  x1    x4    -0.0400  -0.0400 
#>  x2    x3    -0.0327  -0.0328 
#>  x2    x4    0.0050   0.0050  
#> 
#> Strongest pairs by |estimate|
#> 
#>  item1 item2 estimate    fisher_z   
#>  x1    x2    -0.85524378 -1.27536016
#>  x3    x4     0.15136710  0.15253930
#>  x1    x3     0.10583308  0.10623089
#>  x1    x4    -0.04001767 -0.04003905
#>  x2    x3    -0.03274562 -0.03275733
summary(R_bicor)
#> Biweight mid-correlation summary
#>   output      : matrix
#>   dimensions  : 4 x 4
#>   retained_pairs: 10
#>   threshold   : 0.0000
#>   diag        : included
#>   estimate    : -0.2738 to 1.0000
#> 
#>  item1 item2 estimate fisher_z
#>  x1    x2    -0.2738  -0.2809 
#>  x3    x4    0.1699   0.1716  
#>  x1    x4    -0.1324  -0.1332 
#>  x2    x3    0.1279   0.1287  
#>  x2    x4    0.1062   0.1066  
#>  x1    x3    -0.0407  -0.0407 
#> 
#> Strongest pairs by |estimate|
#> 
#>  item1 item2 estimate   fisher_z  
#>  x1    x2    -0.2737753 -0.2809405
#>  x3    x4     0.1699477  0.1716128
#>  x1    x4    -0.1324173 -0.1331995
#>  x2    x3     0.1279485  0.1286537
#>  x2    x4     0.1062372  0.1066396

In practical terms:

That last property is useful, but it also makes skipped_corr() the most computationally expensive estimator in this group.

summary(R_skip)
#> Skipped correlation summary
#>   output      : matrix
#>   dimensions  : 4 x 4
#>   retained_pairs: 10
#>   threshold   : 0.0000
#>   diag        : included
#>   estimate    : -0.2896 to 1.0000
#>   skipped_n   : 2 to 5
#>   skipped_prop: 0.0333 to 0.0833
#> 
#>  item1 item2 estimate n_complete fisher_z statistic p_value
#>  x3    x4    0.2986   60         0.3080   2.3256    0.0200 
#>  x1    x2    -0.2896  60         -0.2982  -2.2512   0.0244 
#>  x2    x3    0.1829   60         0.1850   1.3965    0.1626 
#>  x1    x4    -0.1651  60         -0.1667  -1.2582   0.2083 
#>  x2    x4    0.0834   60         0.0836   0.6309    0.5281 
#>  x1    x3    -0.0544  60         -0.0544  -0.4110   0.6811 
#> 
#> Strongest pairs by |estimate|
#> 
#>  item1 item2 estimate    n_complete fisher_z    statistic  p_value   
#>  x3    x4     0.29864746 60          0.30803396  2.3256054 0.02003961
#>  x1    x2    -0.28964887 60         -0.29818293 -2.2512317 0.02437086
#>  x2    x3     0.18289322 60          0.18497441  1.3965262 0.16255611
#>  x1    x4    -0.16513181 60         -0.16665782 -1.2582390 0.20830535
#>  x2    x4     0.08337372 60          0.08356771  0.6309224 0.52809127

Inference for robust estimators

Inference is not uniform across the robust estimators, so it is worth stating the current scope explicitly.

fit_bicor_ci <- bicor(Y, ci = TRUE)
summary(fit_bicor_ci)
#> Biweight mid-correlation summary
#>   output      : matrix
#>   dimensions  : 4 x 4
#>   retained_pairs: 10
#>   threshold   : 0.0000
#>   diag        : included
#>   estimate    : -0.2738 to 1.0000
#>   inference   : large_sample_bicor
#>   ci          : 95%
#>   ci_method   : fisher_z_bicor
#>   ci_width    : 0.472 to 0.507
#>   cross_zero  : 5 pair(s)
#> 
#>  item1 item2 estimate n_complete lwr     upr     statistic df      p_value
#>  x1    x2    -0.2738  60         -0.4934 -0.0213 2.1678    58.0000 0.0343 
#>  x3    x4    0.1699   60         -0.0878 0.4063  1.3134    58.0000 0.1942 
#>  x1    x4    -0.1324  60         -0.3738 0.1257  1.0174    58.0000 0.3132 
#>  x2    x3    0.1279   60         -0.1302 0.3699  0.9825    58.0000 0.3299 
#>  x2    x4    0.1062   60         -0.1518 0.3507  0.8137    58.0000 0.4192 
#>  x1    x3    -0.0407  60         -0.2916 0.2155  0.3101    58.0000 0.7576 
#>  fisher_z
#>  -0.2809 
#>  0.1716  
#>  -0.1332 
#>  0.1287  
#>  0.1066  
#>  -0.0407 
#> 
#> Strongest pairs by |estimate|
#> 
#>  item1 item2 estimate   n_complete lwr         upr         statistic df
#>  x1    x2    -0.2737753 60         -0.49339972 -0.02133372 2.1678359 58
#>  x3    x4     0.1699477 60         -0.08776438  0.40633735 1.3133888 58
#>  x1    x4    -0.1324173 60         -0.37377417  0.12573516 1.0174191 58
#>  x2    x3     0.1279485 60         -0.13020648  0.36985685 0.9825024 58
#>  x2    x4     0.1062372 60         -0.15178200  0.35070130 0.8136833 58
#>  p_value   fisher_z  
#>  0.0342880 -0.2809405
#>  0.1942241  0.1716128
#>  0.3131795 -0.1331995
#>  0.3299332  0.1286537
#>  0.4191544  0.1066396

fit_pb_inf <- pbcor(Y, ci = TRUE, p_value = TRUE, n_boot = 200, seed = 1)
summary(fit_pb_inf)
#> Percentage bend correlation summary
#>   output      : matrix
#>   dimensions  : 4 x 4
#>   retained_pairs: 10
#>   threshold   : 0.0000
#>   diag        : included
#>   estimate    : -0.4218 to 1.0000
#>   inference   : percentage_bend_t_test
#>   ci          : 95%
#>   ci_method   : percentile_bootstrap
#>   ci_width    : 0.461 to 0.554
#>   cross_zero  : 5 pair(s)
#> 
#>  item1 item2 estimate n_complete lwr     upr     statistic df      p_value
#>  x1    x2    -0.4218  60         -0.6878 -0.1340 -3.5426   58.0000 0.0008 
#>  x3    x4    0.1735   60         -0.1188 0.3812  1.3418    58.0000 0.1849 
#>  x2    x3    0.0967   60         -0.1530 0.3448  0.7396    58.0000 0.4625 
#>  x2    x4    0.0919   60         -0.1571 0.3043  0.7032    58.0000 0.4848 
#>  x1    x4    -0.0720  60         -0.3303 0.1495  -0.5499   58.0000 0.5845 
#>  x1    x3    0.0105   60         -0.2447 0.2603  0.0800    58.0000 0.9365 
#>  fisher_z
#>  -0.4498 
#>  0.1753  
#>  0.0970  
#>  0.0922  
#>  -0.0721 
#>  0.0105  
#> 
#> Strongest pairs by |estimate|
#> 
#>  item1 item2 estimate    n_complete lwr        upr        statistic  df
#>  x1    x2    -0.42176832 60         -0.6877642 -0.1339875 -3.5426038 58
#>  x3    x4     0.17351533 60         -0.1187733  0.3812304  1.3418070 58
#>  x2    x3     0.09665599 60         -0.1530372  0.3448375  0.7395729 58
#>  x2    x4     0.09193922 60         -0.1570500  0.3042500  0.7031664 58
#>  x1    x4    -0.07201735 60         -0.3302915  0.1495091 -0.5498957 58
#>  p_value      fisher_z   
#>  0.0007897873 -0.44984102
#>  0.1848883191  0.17528886
#>  0.4625421763  0.09695868
#>  0.4847651309  0.09219959
#>  0.5845023802 -0.07214224

fit_win_inf <- wincor(Y, ci = TRUE, p_value = TRUE, n_boot = 200, seed = 1)
summary(fit_win_inf)
#> Winsorized correlation summary
#>   output      : matrix
#>   dimensions  : 4 x 4
#>   retained_pairs: 10
#>   threshold   : 0.0000
#>   diag        : included
#>   estimate    : -0.3578 to 1.0000
#>   inference   : winsorized_t_test
#>   ci          : 95%
#>   ci_method   : percentile_bootstrap
#>   ci_width    : 0.486 to 0.601
#>   cross_zero  : 5 pair(s)
#> 
#>  item1 item2 estimate n_complete lwr     upr     statistic df      p_value
#>  x1    x2    -0.3578  60         -0.6771 -0.0757 -2.9184   34.0000 0.0062 
#>  x3    x4    0.1682   60         -0.1069 0.4246  1.2994    34.0000 0.2026 
#>  x2    x4    0.1032   60         -0.1522 0.3343  0.7900    34.0000 0.4350 
#>  x2    x3    0.0849   60         -0.1747 0.3673  0.6487    34.0000 0.5209 
#>  x1    x3    0.0095   60         -0.2634 0.2731  0.0724    34.0000 0.9427 
#>  x1    x4    -0.0048  60         -0.3033 0.2185  -0.0365   34.0000 0.9711 
#>  fisher_z
#>  -0.3744 
#>  0.1698  
#>  0.1035  
#>  0.0851  
#>  0.0095  
#>  -0.0048 
#> 
#> Strongest pairs by |estimate|
#> 
#>  item1 item2 estimate     n_complete lwr        upr         statistic   df
#>  x1    x2    -0.357832947 60         -0.6770871 -0.07572829 -2.91841561 34
#>  x3    x4     0.168185252 60         -0.1068624  0.42458958  1.29936971 34
#>  x2    x4     0.103172411 60         -0.1521755  0.33432344  0.78995327 34
#>  x2    x3     0.084868015 60         -0.1747059  0.36734596  0.64867584 34
#>  x1    x3     0.009503506 60         -0.2633990  0.27308773  0.07237982 34
#>  p_value     fisher_z    
#>  0.006198272 -0.374398404
#>  0.202559111  0.169798500
#>  0.435030149  0.103540842
#>  0.520905514  0.085072656
#>  0.942723765  0.009503792

fit_skip_inf <- skipped_corr(Y, ci = TRUE, p_value = TRUE, n_boot = 200, seed = 1)
summary(fit_skip_inf)
#> Skipped correlation summary
#>   output      : matrix
#>   dimensions  : 4 x 4
#>   retained_pairs: 10
#>   threshold   : 0.0000
#>   diag        : included
#>   estimate    : -0.2896 to 1.0000
#>   skipped_n   : 2 to 5
#>   skipped_prop: 0.0333 to 0.0833
#>   inference   : bootstrap_b2
#>   ci          : 95%
#>   ci_width    : 0.473 to 0.718
#>   cross_zero  : 6 pair(s)
#> 
#>  item1 item2 estimate n_complete lwr     upr    p_value fisher_z
#>  x3    x4    0.2986   60         -0.1183 0.5996 0.2900  0.3080  
#>  x1    x2    -0.2896  60         -0.4994 0.0898 0.0900  -0.2982 
#>  x2    x3    0.1829   60         -0.1003 0.4052 0.2400  0.1850  
#>  x1    x4    -0.1651  60         -0.3979 0.2096 0.3600  -0.1667 
#>  x2    x4    0.0834   60         -0.1242 0.3487 0.4000  0.0836  
#>  x1    x3    -0.0544  60         -0.3124 0.2238 0.6300  -0.0544 
#> 
#> Strongest pairs by |estimate|
#> 
#>  item1 item2 estimate    n_complete lwr        upr       p_value fisher_z   
#>  x3    x4     0.29864746 60         -0.1183050 0.5996071 0.29     0.30803396
#>  x1    x2    -0.28964887 60         -0.4994458 0.0898211 0.09    -0.29818293
#>  x2    x3     0.18289322 60         -0.1003327 0.4051890 0.24     0.18497441
#>  x1    x4    -0.16513181 60         -0.3979290 0.2096387 0.36    -0.16665782
#>  x2    x4     0.08337372 60         -0.1241757 0.3487093 0.40     0.08356771

For pbcor(), wincor(), and skipped_corr(), the inferential layer is more computationally demanding than the estimate-only path because it is built from bootstrap resampling. The skipped-correlation route remains the most expensive of the three because each bootstrap sample also repeats the pairwise outlier screening step. The vignette uses n_boot = 200 only to keep runtime modest; substantive analyses usually need more resamples.

Partial correlation

Partial correlation addresses a different target. Instead of summarising raw association, it estimates conditional association after accounting for the remaining variables.

set.seed(21)
n <- 300
x2 <- rnorm(n)
x1 <- 0.8 * x2 + rnorm(n, sd = 0.4)
x3 <- 0.8 * x2 + rnorm(n, sd = 0.4)
x4 <- 0.7 * x3 + rnorm(n, sd = 0.5)
x5 <- rnorm(n)
x6 <- rnorm(n)
X <- data.frame(x1, x2, x3, x4, x5, x6)

fit_pcor_sample <- pcorr(X, method = "sample")
fit_pcor_oas <- pcorr(X, method = "oas")
R_raw <- pearson_corr(X)

round(c(
  raw_x1_x3 = R_raw["x1", "x3"],
  partial_x1_x3 = fit_pcor_sample$pcor["x1", "x3"]
), 2)
#>     raw_x1_x3 partial_x1_x3 
#>          0.77         -0.07

print(fit_pcor_sample, digits = 2)
#> Partial correlation (sample covariance)
#> 
#> Partial correlation matrix
#>   method      : sample
#>   dimensions  : 6 x 6
#> 
#>       x1    x2    x3    x4    x5    x6
#> x1  1.00  0.64 -0.07  0.07  0.07 -0.08
#> x2  0.64  1.00  0.63  0.04 -0.03  0.06
#> x3 -0.07  0.63  1.00  0.41  0.00  0.02
#> x4  0.07  0.04  0.41  1.00 -0.05  0.01
#> x5  0.07 -0.03  0.00 -0.05  1.00  0.05
#> x6 -0.08  0.06  0.02  0.01  0.05  1.00
summary(fit_pcor_oas)
#> Correlation summary
#>   method      : oas
#>   dimensions  : 6 x 6
#>   pairs       : 15
#>   estimate    : -0.0676 to 0.6074
#>   most_negative: x1-x6 (-0.0676)
#>   most_positive: x1-x2 (0.6074)
#>   rho         : 0.02242
#>   jitter      : 0
#> 
#> Strongest pairs by |estimate|
#> 
#>  item1 item2 estimate
#>  x1    x2    0.6074  
#>  x2    x3    0.5882  
#>  x3    x4    0.3966  
#>  x2    x4    0.0728  
#>  x1    x4    0.0704

Here x1 and x3 are strongly associated in the raw correlation matrix because both depend on x2. Partial correlation is useful because it can separate that indirect association from the remaining direct structure.

The choice of method is important:

When the classical sample estimator is appropriate, partial correlation also supports p-values and Fisher-z confidence intervals.

Unlike the other correlation wrappers, pcorr() uses return_p_value = TRUE rather than p_value = TRUE. That is intentional: the p-value matrix is an optional returned component of the fitted list object.

fit_pcor_inf <- pcorr(
  X,
  method = "sample",
  return_p_value = TRUE,
  ci = TRUE
)

summary(fit_pcor_inf)
#> Partial correlation summary
#>   dimensions  : 6 x 6
#>   pairs       : 15
#>   estimate    : -0.0759 to 0.6445
#>   most_negative: x1-x6 (-0.0759)
#>   most_positive: x1-x2 (0.6445)
#>   ci          : 95%
#>   ci_method   : fisher_z_partial
#>   ci_width    : 0.134 to 0.228
#>   cross_zero  : 12 pair(s)
#> 
#> Strongest pairs by |estimate|
#> 
#>  item1 item2 estimate lwr    upr   p_value      n_complete
#>  x1    x2     0.6445   0.573 0.707 3.819610e-36 300       
#>  x2    x3     0.6271   0.553 0.692 9.272162e-34 300       
#>  x3    x4     0.4145   0.315 0.505 1.011392e-13 300       
#>  x1    x6    -0.0759  -0.188 0.038 1.929409e-01 300       
#>  x1    x5     0.0714  -0.043 0.184 2.206908e-01 300       
#> ... 10 more rows not shown (omitted)
#> Use as.data.frame()/tidy()/as.matrix() to inspect the full result.

That inference layer is available only for the ordinary sample estimator in the low-dimensional setting. The regularised estimators are primarily estimation tools rather than direct inferential procedures in the current interface.

High-dimensional shrinkage correlation

When the number of variables is large relative to the sample size, direct sample correlation can become unstable. shrinkage_corr() provides a shrinkage correlation route designed for exactly this setting.

set.seed(22)
n <- 40
p_block <- 30

make_block <- function(f) {
  sapply(seq_len(p_block), function(j) 0.8 * f + rnorm(n, sd = 0.8))
}

f1 <- rnorm(n)
f2 <- rnorm(n)
f3 <- rnorm(n)

Xd <- cbind(make_block(f1), make_block(f2), make_block(f3))
p <- ncol(Xd)
colnames(Xd) <- paste0("G", seq_len(p))

R_raw <- stats::cor(Xd)
fit_shr <- shrinkage_corr(Xd)
block <- rep(1:3, each = p_block)
within <- outer(block, block, "==") & upper.tri(R_raw)
between <- outer(block, block, "!=") & upper.tri(R_raw)

round(c(
  raw_within = mean(abs(R_raw[within])),
  raw_between = mean(abs(R_raw[between])),
  shrink_within = mean(abs(fit_shr[within])),
  shrink_between = mean(abs(fit_shr[between]))
), 3)
#>     raw_within    raw_between  shrink_within shrink_between 
#>          0.509          0.119          0.404          0.095

print(fit_shr, digits = 2, max_rows = 6, max_vars = 6)
#> Schafer-Strimmer shrinkage correlation matrix
#>   method      : schafer_shrinkage
#>   dimensions  : 90 x 90
#> 
#>        G1   G2   G3 ...   G88   G89  G90
#> G1   1.00 0.47 0.42 ... -0.16 -0.12 0.01
#> G2   0.47 1.00 0.42 ...  0.12  0.04 0.14
#> G3   0.42 0.42 1.00 ...  0.02  0.02 0.18
#> ...   ...  ...  ... ...   ...   ...  ...
#> G88 -0.16 0.12 0.02 ...  1.00  0.39 0.52
#> G89 -0.12 0.04 0.02 ...  0.39  1.00 0.52
#> G90  0.01 0.14 0.18 ...  0.52  0.52 1.00
#> ... 84 more rows not shown (omitted)
#> ... 84 more variables not shown (omitted)
#> Use summary() for a richer digest.
#> Use as.data.frame()/tidy()/as.matrix() to inspect the full result.
summary(fit_shr)
#> Correlation summary
#>   output      : matrix
#>   dimensions  : 90 x 90
#>   retained_pairs: 4,095
#>   threshold   : 0.0000
#>   diag        : included
#>   estimate    : -0.4097 to 1.0000
#> 
#>  item1 item2 estimate fisher_z
#>  G65   G78   0.6154   0.7176  
#>  G53   G60   0.6048   0.7006  
#>  G14   G18   0.5970   0.6884  
#>  G31   G60   0.5930   0.6823  
#>  G53   G57   0.5915   0.6800  
#>  ...   ...   ...      ...     
#>  G37   G65   0.0003   0.0003  
#>  G22   G55   -0.0001  -0.0001 
#>  G9    G87   -0.0001  -0.0001 
#>  G12   G82   0.0000   0.0000  
#>  G51   G75   0.0000   0.0000  
#> ... 3,995 more rows not shown (omitted)
#> Use as.data.frame()/tidy()/as.matrix() to inspect the full result.
#> 
#> Strongest pairs by |estimate|
#> 
#>  item1 item2 estimate  fisher_z 
#>  G65   G78   0.6154224 0.7176030
#>  G53   G60   0.6047778 0.7006462
#>  G14   G18   0.5969508 0.6883963
#>  G31   G60   0.5930245 0.6823184
#>  G53   G57   0.5915489 0.6800453

The shrinkage estimator is not intended to reproduce the raw sample matrix. Its purpose is to provide a better-conditioned and more stable estimate in settings where the sample matrix is noisy or singular. In this block-factor simulation, the raw sample matrix retains the main within-block pattern but also carries substantial between-block noise; shrinkage pulls that background noise down without erasing the dominant structure.

Choosing among these methods

The choice is usually governed by the main source of difficulty in the data.

These are not interchangeable estimators. They solve different problems, even though they share the same matrix-oriented calling style.