This vignette covers the basic wide-data correlation estimators. These methods start from one numeric matrix or data frame, treat columns as variables, and return a square matrix indexed by those columns.
The main functions in this group are:
pearson_corr()spearman_rho()kendall_tau()dcor()They answer related but not identical questions, so method choice should be driven by the structure of the data rather than by habit alone.
library(matrixCorr)
set.seed(10)
z <- rnorm(80)
u <- rnorm(80)
X <- data.frame(
x1 = z + rnorm(80, sd = 0.35),
x2 = 0.85 * z + rnorm(80, sd = 0.45),
x3 = 0.25 * z + 0.70 * u + rnorm(80, sd = 0.45),
x4 = rnorm(80)
)All four estimators accept this same wide input format.
R_pear <- pearson_corr(X)
R_spr <- spearman_rho(X)
R_ken <- kendall_tau(X)
R_dcor <- dcor(X)
print(R_pear, digits = 2)
#> Pearson correlation matrix
#> method : pearson
#> dimensions : 4 x 4
#>
#> x1 x2 x3 x4
#> x1 1.00 0.78 0.13 0.08
#> x2 0.78 1.00 0.17 -0.02
#> x3 0.13 0.17 1.00 0.22
#> x4 0.08 -0.02 0.22 1.00
summary(R_spr)
#> Correlation summary
#> output : matrix
#> dimensions : 4 x 4
#> retained_pairs: 10
#> threshold : 0.0000
#> diag : included
#> estimate : -0.0023 to 1.0000
#>
#> item1 item2 estimate fisher_z
#> x1 x2 0.7766 1.0368
#> x2 x3 0.1866 0.1888
#> x3 x4 0.1750 0.1768
#> x1 x3 0.1318 0.1326
#> x1 x4 0.1047 0.1051
#> x2 x4 -0.0023 -0.0023
#>
#> Strongest pairs by |estimate|
#>
#> item1 item2 estimate fisher_z
#> x1 x2 0.7766292 1.0368199
#> x2 x3 0.1866151 0.1888278
#> x3 x4 0.1750117 0.1768321
#> x1 x3 0.1318097 0.1325811
#> x1 x4 0.1046882 0.1050732This toy dataset is intentionally structured so that x1
and x2 form a clear linear pair, x3 is only
moderately related to that first block, and x4 is close to
null. That makes the shared output structure easier to interpret than a
pure-noise example.
Pearson correlation targets linear association on the original measurement scale. It is the natural first choice when variables are continuous, the relationship is approximately linear, and there is no strong concern about outlier sensitivity.
If confidence intervals are required, they can be requested directly.
R_pear_ci <- pearson_corr(X, ci = TRUE)
summary(R_pear_ci)
#> Pearson correlation summary
#> output : matrix
#> dimensions : 4 x 4
#> retained_pairs: 10
#> threshold : 0.0000
#> diag : included
#> estimate : -0.0239 to 1.0000
#> ci : 95%
#> ci_method : fisher_z
#> ci_width : 0.179 to 0.439
#> cross_zero : 4 pair(s)
#>
#> item1 item2 estimate n_complete lwr upr fisher_z statistic p_value
#> x1 x2 0.7773 80 0.6724 0.8516 1.0385 9.1128 0.0000
#> x3 x4 0.2249 80 0.0054 0.4236 0.2288 2.0074 0.0447
#> x2 x3 0.1741 80 -0.0474 0.3793 0.1759 1.5437 0.1227
#> x1 x3 0.1250 80 -0.0974 0.3355 0.1257 1.1029 0.2701
#> x1 x4 0.0770 80 -0.1452 0.2918 0.0772 0.6771 0.4984
#> x2 x4 -0.0239 80 -0.2424 0.1968 -0.0239 -0.2100 0.8337
#>
#> Strongest pairs by |estimate|
#>
#> item1 item2 estimate n_complete lwr upr fisher_z statistic
#> x1 x2 0.7772944 80 0.672415903 0.8515753 1.03849848 9.1127871
#> x3 x4 0.2248536 80 0.005403655 0.4236409 0.22876233 2.0073813
#> x2 x3 0.1741272 80 -0.047403236 0.3793314 0.17591984 1.5436903
#> x1 x3 0.1250327 80 -0.097358788 0.3355320 0.12569046 1.1029293
#> x1 x4 0.0770052 80 -0.145167838 0.2917853 0.07715795 0.6770583
#> p_value
#> 8.029223e-20
#> 4.470908e-02
#> 1.226634e-01
#> 2.700579e-01
#> 4.983690e-01Spearman’s rho and Kendall’s tau are rank-based estimators. They are
useful when monotone association is of interest and the analysis should
be less sensitive to departures from strict linearity. Both functions
also support optional large-sample confidence intervals through
ci = TRUE.
set.seed(11)
x <- sort(rnorm(60))
y <- x^3 + rnorm(60, sd = 0.5)
dat_mon <- data.frame(x = x, y = y)
pearson_corr(dat_mon)
#> Pearson correlation matrix
#> method : pearson
#> dimensions : 2 x 2
#>
#> x y
#> x 1.0000 0.7839
#> y 0.7839 1.0000
spearman_rho(dat_mon)
#> Spearman correlation matrix
#> method : spearman
#> dimensions : 2 x 2
#>
#> x y
#> x 1.0000 0.8007
#> y 0.8007 1.0000
kendall_tau(dat_mon)
#> Kendall correlation matrix
#> method : kendall
#> dimensions : 2 x 2
#>
#> x y
#> x 1.0000 0.6395
#> y 0.6395 1.0000In this setting the relationship is monotone but not linear, so a rank-based summary is often the clearer first description.
When interval estimation is required, the same matrix-style interface is kept.
fit_spr_ci <- spearman_rho(X, ci = TRUE)
fit_ken_ci <- kendall_tau(X, ci = TRUE)
summary(fit_spr_ci)
#> Spearman correlation summary
#> output : matrix
#> dimensions : 4 x 4
#> retained_pairs: 10
#> threshold : 0.0000
#> diag : included
#> estimate : -0.0023 to 1.0000
#> ci : 95%
#> ci_method : jackknife_euclidean_likelihood
#> ci_width : 0.197 to 0.476
#> cross_zero : 5 pair(s)
#>
#> item1 item2 estimate n_complete lwr upr fisher_z statistic p_value
#> x1 x2 0.7766 80 0.6776 0.8750 1.0368 9.0981 0.0000
#> x2 x3 0.1866 80 -0.0400 0.4130 0.1888 1.6570 0.0975
#> x3 x4 0.1750 80 -0.0525 0.4023 0.1768 1.5517 0.1207
#> x1 x3 0.1318 80 -0.0975 0.3609 0.1326 1.1634 0.2447
#> x1 x4 0.1047 80 -0.1335 0.3427 0.1051 0.9220 0.3565
#> x2 x4 -0.0023 80 -0.2312 0.2267 -0.0023 -0.0200 0.9841
#>
#> Strongest pairs by |estimate|
#>
#> item1 item2 estimate n_complete lwr upr fisher_z statistic
#> x1 x2 0.7766292 80 0.67755395 0.8749743 1.0368199 9.0980576
#> x2 x3 0.1866151 80 -0.04003742 0.4129867 0.1888278 1.6569574
#> x3 x4 0.1750117 80 -0.05247634 0.4022706 0.1768321 1.5516955
#> x1 x3 0.1318097 80 -0.09748395 0.3609353 0.1325811 1.1633941
#> x1 x4 0.1046882 80 -0.13348567 0.3427447 0.1050732 0.9220137
#> p_value
#> 9.196175e-20
#> 9.752809e-02
#> 1.207351e-01
#> 2.446697e-01
#> 3.565214e-01
summary(fit_ken_ci)
#> Kendall correlation summary
#> output : matrix
#> dimensions : 4 x 4
#> retained_pairs: 10
#> threshold : 0.0000
#> diag : included
#> estimate : -0.0032 to 1.0000
#> ci : 95%
#> ci_method : fieller
#> ci_width : 0.195 to 0.295
#> cross_zero : 5 pair(s)
#>
#> item1 item2 estimate n_complete lwr upr fisher_z statistic p_value
#> x1 x2 0.5861 80 0.4800 0.6752 0.6717 5.8939 0.0000
#> x3 x4 0.1152 80 -0.0329 0.2583 0.1157 1.0153 0.3100
#> x2 x3 0.1133 80 -0.0348 0.2565 0.1138 0.9984 0.3181
#> x1 x3 0.0880 80 -0.0603 0.2325 0.0882 0.7740 0.4389
#> x1 x4 0.0741 80 -0.0743 0.2192 0.0742 0.6510 0.5151
#> x2 x4 -0.0032 80 -0.1506 0.1444 -0.0032 -0.0278 0.9778
#>
#> Strongest pairs by |estimate|
#>
#> item1 item2 estimate n_complete lwr upr fisher_z statistic
#> x1 x2 0.58607595 80 0.48004770 0.6752274 0.67166789 5.8938618
#> x3 x4 0.11518987 80 -0.03290630 0.2583364 0.11570344 1.0152936
#> x2 x3 0.11329114 80 -0.03482794 0.2565401 0.11377960 0.9984119
#> x1 x3 0.08797468 80 -0.06034551 0.2324940 0.08820270 0.7739756
#> x1 x4 0.07405063 80 -0.07429803 0.2191928 0.07418643 0.6509833
#> p_value
#> 3.772727e-09
#> 3.099659e-01
#> 3.180797e-01
#> 4.389452e-01
#> 5.150573e-01Distance correlation addresses a broader target. It is designed to
detect general dependence rather than only linear or monotone structure.
The function also supports optional hypothesis testing through
p_value = TRUE.
set.seed(12)
x <- runif(100, -2, 2)
y <- x^2 + rnorm(100, sd = 0.2)
dat_nonlin <- data.frame(x = x, y = y)
pearson_corr(dat_nonlin)
#> Pearson correlation matrix
#> method : pearson
#> dimensions : 2 x 2
#>
#> x y
#> x 1.0000 0.0045
#> y 0.0045 1.0000
dcor(dat_nonlin)
#> Distance correlation (dCor) matrix
#> method : distance_correlation
#> dimensions : 2 x 2
#>
#> x y
#> x 1.0000 0.2064
#> y 0.2064 1.0000This is a typical situation where Pearson correlation can be close to zero even though the variables are clearly dependent.
If a formal inferential summary is needed, p-values can be requested directly.
fit_dcor_p <- dcor(dat_nonlin, p_value = TRUE)
summary(fit_dcor_p)
#> Distance correlation summary
#> output : matrix
#> dimensions : 2 x 2
#> retained_pairs: 3
#> threshold : 0.0000
#> diag : included
#> estimate : 0.2064 to 1.0000
#> inference : dcor_t_test
#>
#> item1 item2 estimate n_complete statistic df p_value fisher_z
#> x y 0.2064 100 14.6861 4849.0000 0.0000 0.2094
#>
#> Strongest pairs by |estimate|
#>
#> item1 item2 estimate n_complete statistic df p_value fisher_z
#> x y 0.2063619 100 14.68607 4849 4.156224e-48 0.2093684The default wide-data behaviour is strict validation. Missing values
are rejected unless the function explicitly supports a relaxed mode
through na_method = "pairwise".
X_miss <- X
X_miss$x2[c(3, 7)] <- NA
try(pearson_corr(X_miss))
#> Error in eval(expr, envir) : Missing values are not allowed.
pearson_corr(X_miss, na_method = "pairwise")
#> Pearson correlation matrix
#> method : pearson
#> dimensions : 4 x 4
#>
#> x1 x2 x3 x4
#> x1 1.0000 0.7742 0.1250 0.0770
#> x2 0.7742 1.0000 0.1829 -0.0285
#> x3 0.1250 0.1829 1.0000 0.2249
#> x4 0.0770 -0.0285 0.2249 1.0000When na_method = "pairwise", the package uses pairwise
complete observations for the affected estimator. That is convenient,
but it also means different pairs may be based on different effective
sample sizes.
In ordinary wide-data work, the following sequence is usually defensible.
pearson_corr() when the variables are
continuous and linear association is the scientific target.spearman_rho() or kendall_tau() when a
monotone summary is preferred.dcor() when non-linear dependence is plausible and
a zero Pearson correlation would be misleading.The next vignette addresses settings where this basic family is still not sufficient because the data are contaminated by outliers or the number of variables is large relative to sample size.