--- title: "3. Robust and High-Dimensional Correlation" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{3. Robust and High-Dimensional Correlation} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning = FALSE, message = FALSE ) ``` ## Scope This vignette covers the estimators used when ordinary wide-data correlation is not enough. The package provides two broad extensions: - robust estimators for outlier-prone data; - regularised estimators for partial correlation or `p >= n` settings. The relevant functions are: - `bicor()` - `pbcor()` - `wincor()` - `skipped_corr()` - `shrinkage_corr()` - `pcorr()` ## Robust correlation Outliers can distort ordinary correlation severely, especially in moderate sample sizes. The robust estimators in `matrixCorr` are designed to limit that effect, but they do so in different ways. ```{r} library(matrixCorr) set.seed(20) Y <- data.frame( x1 = rnorm(60), x2 = rnorm(60), x3 = rnorm(60), x4 = rnorm(60) ) idx <- sample.int(nrow(Y), 5) Y$x1[idx] <- Y$x1[idx] + 8 Y$x2[idx] <- Y$x2[idx] - 6 R_pear <- pearson_corr(Y) R_bicor <- bicor(Y) R_pb <- pbcor(Y) R_win <- wincor(Y) R_skip <- skipped_corr(Y) summary(R_pear) summary(R_bicor) ``` In practical terms: - `bicor()` is often a good first robust alternative for wide data. - `pbcor()` and `wincor()` follow classical robustification routes and are easy to compare against ordinary correlation. - `skipped_corr()` is more aggressive because it first removes bivariate outliers pair by pair. That last property is useful, but it also makes `skipped_corr()` the most computationally expensive estimator in this group. ```{r} summary(R_skip) ``` ## Inference for robust estimators Inference is not uniform across the robust estimators, so it is worth stating the current scope explicitly. - `bicor()` supports large-sample confidence intervals. - `pbcor()` supports percentile-bootstrap confidence intervals and method-specific p-values. - `wincor()` supports percentile-bootstrap confidence intervals and method-specific p-values. - `skipped_corr()` supports bootstrap confidence intervals and bootstrap p-values. ```{r} fit_bicor_ci <- bicor(Y, ci = TRUE) summary(fit_bicor_ci) fit_pb_inf <- pbcor(Y, ci = TRUE, p_value = TRUE, n_boot = 200, seed = 1) summary(fit_pb_inf) fit_win_inf <- wincor(Y, ci = TRUE, p_value = TRUE, n_boot = 200, seed = 1) summary(fit_win_inf) fit_skip_inf <- skipped_corr(Y, ci = TRUE, p_value = TRUE, n_boot = 200, seed = 1) summary(fit_skip_inf) ``` For `pbcor()`, `wincor()`, and `skipped_corr()`, the inferential layer is more computationally demanding than the estimate-only path because it is built from bootstrap resampling. The skipped-correlation route remains the most expensive of the three because each bootstrap sample also repeats the pairwise outlier screening step. The vignette uses `n_boot = 200` only to keep runtime modest; substantive analyses usually need more resamples. ## Partial correlation Partial correlation addresses a different target. Instead of summarising raw association, it estimates conditional association after accounting for the remaining variables. ```{r} set.seed(21) n <- 300 x2 <- rnorm(n) x1 <- 0.8 * x2 + rnorm(n, sd = 0.4) x3 <- 0.8 * x2 + rnorm(n, sd = 0.4) x4 <- 0.7 * x3 + rnorm(n, sd = 0.5) x5 <- rnorm(n) x6 <- rnorm(n) X <- data.frame(x1, x2, x3, x4, x5, x6) fit_pcor_sample <- pcorr(X, method = "sample") fit_pcor_oas <- pcorr(X, method = "oas") R_raw <- pearson_corr(X) round(c( raw_x1_x3 = R_raw["x1", "x3"], partial_x1_x3 = fit_pcor_sample$pcor["x1", "x3"] ), 2) print(fit_pcor_sample, digits = 2) summary(fit_pcor_oas) ``` Here `x1` and `x3` are strongly associated in the raw correlation matrix because both depend on `x2`. Partial correlation is useful because it can separate that indirect association from the remaining direct structure. The choice of `method` is important: - `"sample"` is the ordinary full-rank route. - `"oas"` and `"ridge"` stabilise estimation in higher-dimensional settings. - `"glasso"` is appropriate when a sparse precision structure is the target. When the classical sample estimator is appropriate, partial correlation also supports p-values and Fisher-z confidence intervals. Unlike the other correlation wrappers, `pcorr()` uses `return_p_value = TRUE` rather than `p_value = TRUE`. That is intentional: the p-value matrix is an optional returned component of the fitted list object. ```{r} fit_pcor_inf <- pcorr( X, method = "sample", return_p_value = TRUE, ci = TRUE ) summary(fit_pcor_inf) ``` That inference layer is available only for the ordinary sample estimator in the low-dimensional setting. The regularised estimators are primarily estimation tools rather than direct inferential procedures in the current interface. ## High-dimensional shrinkage correlation When the number of variables is large relative to the sample size, direct sample correlation can become unstable. `shrinkage_corr()` provides a shrinkage correlation route designed for exactly this setting. ```{r} set.seed(22) n <- 40 p_block <- 30 make_block <- function(f) { sapply(seq_len(p_block), function(j) 0.8 * f + rnorm(n, sd = 0.8)) } f1 <- rnorm(n) f2 <- rnorm(n) f3 <- rnorm(n) Xd <- cbind(make_block(f1), make_block(f2), make_block(f3)) p <- ncol(Xd) colnames(Xd) <- paste0("G", seq_len(p)) R_raw <- stats::cor(Xd) fit_shr <- shrinkage_corr(Xd) block <- rep(1:3, each = p_block) within <- outer(block, block, "==") & upper.tri(R_raw) between <- outer(block, block, "!=") & upper.tri(R_raw) round(c( raw_within = mean(abs(R_raw[within])), raw_between = mean(abs(R_raw[between])), shrink_within = mean(abs(fit_shr[within])), shrink_between = mean(abs(fit_shr[between])) ), 3) print(fit_shr, digits = 2, max_rows = 6, max_vars = 6) summary(fit_shr) ``` The shrinkage estimator is not intended to reproduce the raw sample matrix. Its purpose is to provide a better-conditioned and more stable estimate in settings where the sample matrix is noisy or singular. In this block-factor simulation, the raw sample matrix retains the main within-block pattern but also carries substantial between-block noise; shrinkage pulls that background noise down without erasing the dominant structure. ## Choosing among these methods The choice is usually governed by the main source of difficulty in the data. - If the concern is outliers, start with `bicor()` and compare with `pbcor()`, `wincor()`, or `skipped_corr()` as needed. - If the concern is conditional dependence, use `pcorr()`. - If the concern is `p >= n` or unstable covariance estimation, use `shrinkage_corr()` or a regularised `pcorr()` method. These are not interchangeable estimators. They solve different problems, even though they share the same matrix-oriented calling style.