This vignette covers the estimators used when ordinary wide-data correlation is not enough. The package provides two broad extensions:
p >= n settings.The relevant functions are:
bicor()pbcor()wincor()skipped_corr()shrinkage_corr()pcorr()Outliers can distort ordinary correlation severely, especially in
moderate sample sizes. The robust estimators in matrixCorr
are designed to limit that effect, but they do so in different ways.
library(matrixCorr)
set.seed(20)
Y <- data.frame(
x1 = rnorm(60),
x2 = rnorm(60),
x3 = rnorm(60),
x4 = rnorm(60)
)
idx <- sample.int(nrow(Y), 5)
Y$x1[idx] <- Y$x1[idx] + 8
Y$x2[idx] <- Y$x2[idx] - 6
R_pear <- pearson_corr(Y)
R_bicor <- bicor(Y)
R_pb <- pbcor(Y)
R_win <- wincor(Y)
R_skip <- skipped_corr(Y)
summary(R_pear)
#> Pearson correlation summary
#> output : matrix
#> dimensions : 4 x 4
#> retained_pairs: 10
#> threshold : 0.0000
#> diag : included
#> estimate : -0.8552 to 1.0000
#>
#> item1 item2 estimate fisher_z
#> x1 x2 -0.8552 -1.2754
#> x3 x4 0.1514 0.1525
#> x1 x3 0.1058 0.1062
#> x1 x4 -0.0400 -0.0400
#> x2 x3 -0.0327 -0.0328
#> x2 x4 0.0050 0.0050
#>
#> Strongest pairs by |estimate|
#>
#> item1 item2 estimate fisher_z
#> x1 x2 -0.85524378 -1.27536016
#> x3 x4 0.15136710 0.15253930
#> x1 x3 0.10583308 0.10623089
#> x1 x4 -0.04001767 -0.04003905
#> x2 x3 -0.03274562 -0.03275733
summary(R_bicor)
#> Biweight mid-correlation summary
#> output : matrix
#> dimensions : 4 x 4
#> retained_pairs: 10
#> threshold : 0.0000
#> diag : included
#> estimate : -0.2738 to 1.0000
#>
#> item1 item2 estimate fisher_z
#> x1 x2 -0.2738 -0.2809
#> x3 x4 0.1699 0.1716
#> x1 x4 -0.1324 -0.1332
#> x2 x3 0.1279 0.1287
#> x2 x4 0.1062 0.1066
#> x1 x3 -0.0407 -0.0407
#>
#> Strongest pairs by |estimate|
#>
#> item1 item2 estimate fisher_z
#> x1 x2 -0.2737753 -0.2809405
#> x3 x4 0.1699477 0.1716128
#> x1 x4 -0.1324173 -0.1331995
#> x2 x3 0.1279485 0.1286537
#> x2 x4 0.1062372 0.1066396In practical terms:
bicor() is often a good first robust alternative for
wide data.pbcor() and wincor() follow classical
robustification routes and are easy to compare against ordinary
correlation.skipped_corr() is more aggressive because it first
removes bivariate outliers pair by pair.That last property is useful, but it also makes
skipped_corr() the most computationally expensive estimator
in this group.
summary(R_skip)
#> Skipped correlation summary
#> output : matrix
#> dimensions : 4 x 4
#> retained_pairs: 10
#> threshold : 0.0000
#> diag : included
#> estimate : -0.2896 to 1.0000
#> skipped_n : 2 to 5
#> skipped_prop: 0.0333 to 0.0833
#>
#> item1 item2 estimate n_complete fisher_z statistic p_value
#> x3 x4 0.2986 60 0.3080 2.3256 0.0200
#> x1 x2 -0.2896 60 -0.2982 -2.2512 0.0244
#> x2 x3 0.1829 60 0.1850 1.3965 0.1626
#> x1 x4 -0.1651 60 -0.1667 -1.2582 0.2083
#> x2 x4 0.0834 60 0.0836 0.6309 0.5281
#> x1 x3 -0.0544 60 -0.0544 -0.4110 0.6811
#>
#> Strongest pairs by |estimate|
#>
#> item1 item2 estimate n_complete fisher_z statistic p_value
#> x3 x4 0.29864746 60 0.30803396 2.3256054 0.02003961
#> x1 x2 -0.28964887 60 -0.29818293 -2.2512317 0.02437086
#> x2 x3 0.18289322 60 0.18497441 1.3965262 0.16255611
#> x1 x4 -0.16513181 60 -0.16665782 -1.2582390 0.20830535
#> x2 x4 0.08337372 60 0.08356771 0.6309224 0.52809127Inference is not uniform across the robust estimators, so it is worth stating the current scope explicitly.
bicor() supports large-sample confidence
intervals.pbcor() supports percentile-bootstrap confidence
intervals and method-specific p-values.wincor() supports percentile-bootstrap confidence
intervals and method-specific p-values.skipped_corr() supports bootstrap confidence intervals
and bootstrap p-values.fit_bicor_ci <- bicor(Y, ci = TRUE)
summary(fit_bicor_ci)
#> Biweight mid-correlation summary
#> output : matrix
#> dimensions : 4 x 4
#> retained_pairs: 10
#> threshold : 0.0000
#> diag : included
#> estimate : -0.2738 to 1.0000
#> inference : large_sample_bicor
#> ci : 95%
#> ci_method : fisher_z_bicor
#> ci_width : 0.472 to 0.507
#> cross_zero : 5 pair(s)
#>
#> item1 item2 estimate n_complete lwr upr statistic df p_value
#> x1 x2 -0.2738 60 -0.4934 -0.0213 2.1678 58.0000 0.0343
#> x3 x4 0.1699 60 -0.0878 0.4063 1.3134 58.0000 0.1942
#> x1 x4 -0.1324 60 -0.3738 0.1257 1.0174 58.0000 0.3132
#> x2 x3 0.1279 60 -0.1302 0.3699 0.9825 58.0000 0.3299
#> x2 x4 0.1062 60 -0.1518 0.3507 0.8137 58.0000 0.4192
#> x1 x3 -0.0407 60 -0.2916 0.2155 0.3101 58.0000 0.7576
#> fisher_z
#> -0.2809
#> 0.1716
#> -0.1332
#> 0.1287
#> 0.1066
#> -0.0407
#>
#> Strongest pairs by |estimate|
#>
#> item1 item2 estimate n_complete lwr upr statistic df
#> x1 x2 -0.2737753 60 -0.49339972 -0.02133372 2.1678359 58
#> x3 x4 0.1699477 60 -0.08776438 0.40633735 1.3133888 58
#> x1 x4 -0.1324173 60 -0.37377417 0.12573516 1.0174191 58
#> x2 x3 0.1279485 60 -0.13020648 0.36985685 0.9825024 58
#> x2 x4 0.1062372 60 -0.15178200 0.35070130 0.8136833 58
#> p_value fisher_z
#> 0.0342880 -0.2809405
#> 0.1942241 0.1716128
#> 0.3131795 -0.1331995
#> 0.3299332 0.1286537
#> 0.4191544 0.1066396
fit_pb_inf <- pbcor(Y, ci = TRUE, p_value = TRUE, n_boot = 200, seed = 1)
summary(fit_pb_inf)
#> Percentage bend correlation summary
#> output : matrix
#> dimensions : 4 x 4
#> retained_pairs: 10
#> threshold : 0.0000
#> diag : included
#> estimate : -0.4218 to 1.0000
#> inference : percentage_bend_t_test
#> ci : 95%
#> ci_method : percentile_bootstrap
#> ci_width : 0.461 to 0.554
#> cross_zero : 5 pair(s)
#>
#> item1 item2 estimate n_complete lwr upr statistic df p_value
#> x1 x2 -0.4218 60 -0.6878 -0.1340 -3.5426 58.0000 0.0008
#> x3 x4 0.1735 60 -0.1188 0.3812 1.3418 58.0000 0.1849
#> x2 x3 0.0967 60 -0.1530 0.3448 0.7396 58.0000 0.4625
#> x2 x4 0.0919 60 -0.1571 0.3043 0.7032 58.0000 0.4848
#> x1 x4 -0.0720 60 -0.3303 0.1495 -0.5499 58.0000 0.5845
#> x1 x3 0.0105 60 -0.2447 0.2603 0.0800 58.0000 0.9365
#> fisher_z
#> -0.4498
#> 0.1753
#> 0.0970
#> 0.0922
#> -0.0721
#> 0.0105
#>
#> Strongest pairs by |estimate|
#>
#> item1 item2 estimate n_complete lwr upr statistic df
#> x1 x2 -0.42176832 60 -0.6877642 -0.1339875 -3.5426038 58
#> x3 x4 0.17351533 60 -0.1187733 0.3812304 1.3418070 58
#> x2 x3 0.09665599 60 -0.1530372 0.3448375 0.7395729 58
#> x2 x4 0.09193922 60 -0.1570500 0.3042500 0.7031664 58
#> x1 x4 -0.07201735 60 -0.3302915 0.1495091 -0.5498957 58
#> p_value fisher_z
#> 0.0007897873 -0.44984102
#> 0.1848883191 0.17528886
#> 0.4625421763 0.09695868
#> 0.4847651309 0.09219959
#> 0.5845023802 -0.07214224
fit_win_inf <- wincor(Y, ci = TRUE, p_value = TRUE, n_boot = 200, seed = 1)
summary(fit_win_inf)
#> Winsorized correlation summary
#> output : matrix
#> dimensions : 4 x 4
#> retained_pairs: 10
#> threshold : 0.0000
#> diag : included
#> estimate : -0.3578 to 1.0000
#> inference : winsorized_t_test
#> ci : 95%
#> ci_method : percentile_bootstrap
#> ci_width : 0.486 to 0.601
#> cross_zero : 5 pair(s)
#>
#> item1 item2 estimate n_complete lwr upr statistic df p_value
#> x1 x2 -0.3578 60 -0.6771 -0.0757 -2.9184 34.0000 0.0062
#> x3 x4 0.1682 60 -0.1069 0.4246 1.2994 34.0000 0.2026
#> x2 x4 0.1032 60 -0.1522 0.3343 0.7900 34.0000 0.4350
#> x2 x3 0.0849 60 -0.1747 0.3673 0.6487 34.0000 0.5209
#> x1 x3 0.0095 60 -0.2634 0.2731 0.0724 34.0000 0.9427
#> x1 x4 -0.0048 60 -0.3033 0.2185 -0.0365 34.0000 0.9711
#> fisher_z
#> -0.3744
#> 0.1698
#> 0.1035
#> 0.0851
#> 0.0095
#> -0.0048
#>
#> Strongest pairs by |estimate|
#>
#> item1 item2 estimate n_complete lwr upr statistic df
#> x1 x2 -0.357832947 60 -0.6770871 -0.07572829 -2.91841561 34
#> x3 x4 0.168185252 60 -0.1068624 0.42458958 1.29936971 34
#> x2 x4 0.103172411 60 -0.1521755 0.33432344 0.78995327 34
#> x2 x3 0.084868015 60 -0.1747059 0.36734596 0.64867584 34
#> x1 x3 0.009503506 60 -0.2633990 0.27308773 0.07237982 34
#> p_value fisher_z
#> 0.006198272 -0.374398404
#> 0.202559111 0.169798500
#> 0.435030149 0.103540842
#> 0.520905514 0.085072656
#> 0.942723765 0.009503792
fit_skip_inf <- skipped_corr(Y, ci = TRUE, p_value = TRUE, n_boot = 200, seed = 1)
summary(fit_skip_inf)
#> Skipped correlation summary
#> output : matrix
#> dimensions : 4 x 4
#> retained_pairs: 10
#> threshold : 0.0000
#> diag : included
#> estimate : -0.2896 to 1.0000
#> skipped_n : 2 to 5
#> skipped_prop: 0.0333 to 0.0833
#> inference : bootstrap_b2
#> ci : 95%
#> ci_width : 0.473 to 0.718
#> cross_zero : 6 pair(s)
#>
#> item1 item2 estimate n_complete lwr upr p_value fisher_z
#> x3 x4 0.2986 60 -0.1183 0.5996 0.2900 0.3080
#> x1 x2 -0.2896 60 -0.4994 0.0898 0.0900 -0.2982
#> x2 x3 0.1829 60 -0.1003 0.4052 0.2400 0.1850
#> x1 x4 -0.1651 60 -0.3979 0.2096 0.3600 -0.1667
#> x2 x4 0.0834 60 -0.1242 0.3487 0.4000 0.0836
#> x1 x3 -0.0544 60 -0.3124 0.2238 0.6300 -0.0544
#>
#> Strongest pairs by |estimate|
#>
#> item1 item2 estimate n_complete lwr upr p_value fisher_z
#> x3 x4 0.29864746 60 -0.1183050 0.5996071 0.29 0.30803396
#> x1 x2 -0.28964887 60 -0.4994458 0.0898211 0.09 -0.29818293
#> x2 x3 0.18289322 60 -0.1003327 0.4051890 0.24 0.18497441
#> x1 x4 -0.16513181 60 -0.3979290 0.2096387 0.36 -0.16665782
#> x2 x4 0.08337372 60 -0.1241757 0.3487093 0.40 0.08356771For pbcor(), wincor(), and
skipped_corr(), the inferential layer is more
computationally demanding than the estimate-only path because it is
built from bootstrap resampling. The skipped-correlation route remains
the most expensive of the three because each bootstrap sample also
repeats the pairwise outlier screening step. The vignette uses
n_boot = 200 only to keep runtime modest; substantive
analyses usually need more resamples.
Partial correlation addresses a different target. Instead of summarising raw association, it estimates conditional association after accounting for the remaining variables.
set.seed(21)
n <- 300
x2 <- rnorm(n)
x1 <- 0.8 * x2 + rnorm(n, sd = 0.4)
x3 <- 0.8 * x2 + rnorm(n, sd = 0.4)
x4 <- 0.7 * x3 + rnorm(n, sd = 0.5)
x5 <- rnorm(n)
x6 <- rnorm(n)
X <- data.frame(x1, x2, x3, x4, x5, x6)
fit_pcor_sample <- pcorr(X, method = "sample")
fit_pcor_oas <- pcorr(X, method = "oas")
R_raw <- pearson_corr(X)
round(c(
raw_x1_x3 = R_raw["x1", "x3"],
partial_x1_x3 = fit_pcor_sample$pcor["x1", "x3"]
), 2)
#> raw_x1_x3 partial_x1_x3
#> 0.77 -0.07
print(fit_pcor_sample, digits = 2)
#> Partial correlation (sample covariance)
#>
#> Partial correlation matrix
#> method : sample
#> dimensions : 6 x 6
#>
#> x1 x2 x3 x4 x5 x6
#> x1 1.00 0.64 -0.07 0.07 0.07 -0.08
#> x2 0.64 1.00 0.63 0.04 -0.03 0.06
#> x3 -0.07 0.63 1.00 0.41 0.00 0.02
#> x4 0.07 0.04 0.41 1.00 -0.05 0.01
#> x5 0.07 -0.03 0.00 -0.05 1.00 0.05
#> x6 -0.08 0.06 0.02 0.01 0.05 1.00
summary(fit_pcor_oas)
#> Correlation summary
#> method : oas
#> dimensions : 6 x 6
#> pairs : 15
#> estimate : -0.0676 to 0.6074
#> most_negative: x1-x6 (-0.0676)
#> most_positive: x1-x2 (0.6074)
#> rho : 0.02242
#> jitter : 0
#>
#> Strongest pairs by |estimate|
#>
#> item1 item2 estimate
#> x1 x2 0.6074
#> x2 x3 0.5882
#> x3 x4 0.3966
#> x2 x4 0.0728
#> x1 x4 0.0704Here x1 and x3 are strongly associated in
the raw correlation matrix because both depend on x2.
Partial correlation is useful because it can separate that indirect
association from the remaining direct structure.
The choice of method is important:
"sample" is the ordinary full-rank route."oas" and "ridge" stabilise estimation in
higher-dimensional settings."glasso" is appropriate when a sparse precision
structure is the target.When the classical sample estimator is appropriate, partial correlation also supports p-values and Fisher-z confidence intervals.
Unlike the other correlation wrappers, pcorr() uses
return_p_value = TRUE rather than
p_value = TRUE. That is intentional: the p-value matrix is
an optional returned component of the fitted list object.
fit_pcor_inf <- pcorr(
X,
method = "sample",
return_p_value = TRUE,
ci = TRUE
)
summary(fit_pcor_inf)
#> Partial correlation summary
#> dimensions : 6 x 6
#> pairs : 15
#> estimate : -0.0759 to 0.6445
#> most_negative: x1-x6 (-0.0759)
#> most_positive: x1-x2 (0.6445)
#> ci : 95%
#> ci_method : fisher_z_partial
#> ci_width : 0.134 to 0.228
#> cross_zero : 12 pair(s)
#>
#> Strongest pairs by |estimate|
#>
#> item1 item2 estimate lwr upr p_value n_complete
#> x1 x2 0.6445 0.573 0.707 3.819610e-36 300
#> x2 x3 0.6271 0.553 0.692 9.272162e-34 300
#> x3 x4 0.4145 0.315 0.505 1.011392e-13 300
#> x1 x6 -0.0759 -0.188 0.038 1.929409e-01 300
#> x1 x5 0.0714 -0.043 0.184 2.206908e-01 300
#> ... 10 more rows not shown (omitted)
#> Use as.data.frame()/tidy()/as.matrix() to inspect the full result.That inference layer is available only for the ordinary sample estimator in the low-dimensional setting. The regularised estimators are primarily estimation tools rather than direct inferential procedures in the current interface.
When the number of variables is large relative to the sample size,
direct sample correlation can become unstable.
shrinkage_corr() provides a shrinkage correlation route
designed for exactly this setting.
set.seed(22)
n <- 40
p_block <- 30
make_block <- function(f) {
sapply(seq_len(p_block), function(j) 0.8 * f + rnorm(n, sd = 0.8))
}
f1 <- rnorm(n)
f2 <- rnorm(n)
f3 <- rnorm(n)
Xd <- cbind(make_block(f1), make_block(f2), make_block(f3))
p <- ncol(Xd)
colnames(Xd) <- paste0("G", seq_len(p))
R_raw <- stats::cor(Xd)
fit_shr <- shrinkage_corr(Xd)
block <- rep(1:3, each = p_block)
within <- outer(block, block, "==") & upper.tri(R_raw)
between <- outer(block, block, "!=") & upper.tri(R_raw)
round(c(
raw_within = mean(abs(R_raw[within])),
raw_between = mean(abs(R_raw[between])),
shrink_within = mean(abs(fit_shr[within])),
shrink_between = mean(abs(fit_shr[between]))
), 3)
#> raw_within raw_between shrink_within shrink_between
#> 0.509 0.119 0.404 0.095
print(fit_shr, digits = 2, max_rows = 6, max_vars = 6)
#> Schafer-Strimmer shrinkage correlation matrix
#> method : schafer_shrinkage
#> dimensions : 90 x 90
#>
#> G1 G2 G3 ... G88 G89 G90
#> G1 1.00 0.47 0.42 ... -0.16 -0.12 0.01
#> G2 0.47 1.00 0.42 ... 0.12 0.04 0.14
#> G3 0.42 0.42 1.00 ... 0.02 0.02 0.18
#> ... ... ... ... ... ... ... ...
#> G88 -0.16 0.12 0.02 ... 1.00 0.39 0.52
#> G89 -0.12 0.04 0.02 ... 0.39 1.00 0.52
#> G90 0.01 0.14 0.18 ... 0.52 0.52 1.00
#> ... 84 more rows not shown (omitted)
#> ... 84 more variables not shown (omitted)
#> Use summary() for a richer digest.
#> Use as.data.frame()/tidy()/as.matrix() to inspect the full result.
summary(fit_shr)
#> Correlation summary
#> output : matrix
#> dimensions : 90 x 90
#> retained_pairs: 4,095
#> threshold : 0.0000
#> diag : included
#> estimate : -0.4097 to 1.0000
#>
#> item1 item2 estimate fisher_z
#> G65 G78 0.6154 0.7176
#> G53 G60 0.6048 0.7006
#> G14 G18 0.5970 0.6884
#> G31 G60 0.5930 0.6823
#> G53 G57 0.5915 0.6800
#> ... ... ... ...
#> G37 G65 0.0003 0.0003
#> G22 G55 -0.0001 -0.0001
#> G9 G87 -0.0001 -0.0001
#> G12 G82 0.0000 0.0000
#> G51 G75 0.0000 0.0000
#> ... 3,995 more rows not shown (omitted)
#> Use as.data.frame()/tidy()/as.matrix() to inspect the full result.
#>
#> Strongest pairs by |estimate|
#>
#> item1 item2 estimate fisher_z
#> G65 G78 0.6154224 0.7176030
#> G53 G60 0.6047778 0.7006462
#> G14 G18 0.5969508 0.6883963
#> G31 G60 0.5930245 0.6823184
#> G53 G57 0.5915489 0.6800453The shrinkage estimator is not intended to reproduce the raw sample matrix. Its purpose is to provide a better-conditioned and more stable estimate in settings where the sample matrix is noisy or singular. In this block-factor simulation, the raw sample matrix retains the main within-block pattern but also carries substantial between-block noise; shrinkage pulls that background noise down without erasing the dominant structure.
The choice is usually governed by the main source of difficulty in the data.
bicor() and
compare with pbcor(), wincor(), or
skipped_corr() as needed.pcorr().p >= n or unstable covariance
estimation, use shrinkage_corr() or a regularised
pcorr() method.These are not interchangeable estimators. They solve different problems, even though they share the same matrix-oriented calling style.