---
title: "3. Robust and High-Dimensional Correlation"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{3. Robust and High-Dimensional Correlation}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning = FALSE,
  message = FALSE
)
```

## Scope

This vignette covers the estimators used when ordinary wide-data correlation is
not enough. The package provides two broad extensions:

- robust estimators for outlier-prone data;
- regularised estimators for partial correlation or `p >= n` settings.

The relevant functions are:

- `bicor()`
- `pbcor()`
- `wincor()`
- `skipped_corr()`
- `shrinkage_corr()`
- `pcorr()`

## Robust correlation

Outliers can distort ordinary correlation severely, especially in moderate
sample sizes. The robust estimators in `matrixCorr` are designed to limit that
effect, but they do so in different ways.

```{r}
library(matrixCorr)

set.seed(20)
Y <- data.frame(
  x1 = rnorm(60),
  x2 = rnorm(60),
  x3 = rnorm(60),
  x4 = rnorm(60)
)

idx <- sample.int(nrow(Y), 5)
Y$x1[idx] <- Y$x1[idx] + 8
Y$x2[idx] <- Y$x2[idx] - 6

R_pear <- pearson_corr(Y)
R_bicor <- bicor(Y)
R_pb <- pbcor(Y)
R_win <- wincor(Y)
R_skip <- skipped_corr(Y)

summary(R_pear)
summary(R_bicor)
```

In practical terms:

- `bicor()` is often a good first robust alternative for wide data.
- `pbcor()` and `wincor()` follow classical robustification routes and are easy
  to compare against ordinary correlation.
- `skipped_corr()` is more aggressive because it first removes bivariate
  outliers pair by pair.

That last property is useful, but it also makes `skipped_corr()` the most
computationally expensive estimator in this group.

```{r}
summary(R_skip)
```

## Inference for robust estimators

Inference is not uniform across the robust estimators, so it is worth stating
the current scope explicitly.

- `bicor()` supports large-sample confidence intervals.
- `pbcor()` supports percentile-bootstrap confidence intervals and
  method-specific p-values.
- `wincor()` supports percentile-bootstrap confidence intervals and
  method-specific p-values.
- `skipped_corr()` supports bootstrap confidence intervals and bootstrap
  p-values.

```{r}
fit_bicor_ci <- bicor(Y, ci = TRUE)
summary(fit_bicor_ci)

fit_pb_inf <- pbcor(Y, ci = TRUE, p_value = TRUE, n_boot = 200, seed = 1)
summary(fit_pb_inf)

fit_win_inf <- wincor(Y, ci = TRUE, p_value = TRUE, n_boot = 200, seed = 1)
summary(fit_win_inf)

fit_skip_inf <- skipped_corr(Y, ci = TRUE, p_value = TRUE, n_boot = 200, seed = 1)
summary(fit_skip_inf)
```

For `pbcor()`, `wincor()`, and `skipped_corr()`, the inferential layer is more
computationally demanding than the estimate-only path because it is built from
bootstrap resampling. The skipped-correlation route remains the most expensive
of the three because each bootstrap sample also repeats the pairwise outlier
screening step. The vignette uses `n_boot = 200` only to keep runtime modest;
substantive analyses usually need more resamples.

## Partial correlation

Partial correlation addresses a different target. Instead of summarising raw
association, it estimates conditional association after accounting for the
remaining variables.

```{r}
set.seed(21)
n <- 300
x2 <- rnorm(n)
x1 <- 0.8 * x2 + rnorm(n, sd = 0.4)
x3 <- 0.8 * x2 + rnorm(n, sd = 0.4)
x4 <- 0.7 * x3 + rnorm(n, sd = 0.5)
x5 <- rnorm(n)
x6 <- rnorm(n)
X <- data.frame(x1, x2, x3, x4, x5, x6)

fit_pcor_sample <- pcorr(X, method = "sample")
fit_pcor_oas <- pcorr(X, method = "oas")
R_raw <- pearson_corr(X)

round(c(
  raw_x1_x3 = R_raw["x1", "x3"],
  partial_x1_x3 = fit_pcor_sample$pcor["x1", "x3"]
), 2)

print(fit_pcor_sample, digits = 2)
summary(fit_pcor_oas)
```

Here `x1` and `x3` are strongly associated in the raw correlation matrix
because both depend on `x2`. Partial correlation is useful because it can
separate that indirect association from the remaining direct structure.

The choice of `method` is important:

- `"sample"` is the ordinary full-rank route.
- `"oas"` and `"ridge"` stabilise estimation in higher-dimensional settings.
- `"glasso"` is appropriate when a sparse precision structure is the target.

When the classical sample estimator is appropriate, partial correlation also
supports p-values and Fisher-z confidence intervals.

Unlike the other correlation wrappers, `pcorr()` uses
`return_p_value = TRUE` rather than `p_value = TRUE`. That is intentional:
the p-value matrix is an optional returned component of the fitted list object.

```{r}
fit_pcor_inf <- pcorr(
  X,
  method = "sample",
  return_p_value = TRUE,
  ci = TRUE
)

summary(fit_pcor_inf)
```

That inference layer is available only for the ordinary sample estimator in the
low-dimensional setting. The regularised estimators are primarily estimation
tools rather than direct inferential procedures in the current interface.

## High-dimensional shrinkage correlation

When the number of variables is large relative to the sample size, direct
sample correlation can become unstable. `shrinkage_corr()` provides a shrinkage
correlation route designed for exactly this setting.

```{r}
set.seed(22)
n <- 40
p_block <- 30

make_block <- function(f) {
  sapply(seq_len(p_block), function(j) 0.8 * f + rnorm(n, sd = 0.8))
}

f1 <- rnorm(n)
f2 <- rnorm(n)
f3 <- rnorm(n)

Xd <- cbind(make_block(f1), make_block(f2), make_block(f3))
p <- ncol(Xd)
colnames(Xd) <- paste0("G", seq_len(p))

R_raw <- stats::cor(Xd)
fit_shr <- shrinkage_corr(Xd)
block <- rep(1:3, each = p_block)
within <- outer(block, block, "==") & upper.tri(R_raw)
between <- outer(block, block, "!=") & upper.tri(R_raw)

round(c(
  raw_within = mean(abs(R_raw[within])),
  raw_between = mean(abs(R_raw[between])),
  shrink_within = mean(abs(fit_shr[within])),
  shrink_between = mean(abs(fit_shr[between]))
), 3)

print(fit_shr, digits = 2, max_rows = 6, max_vars = 6)
summary(fit_shr)
```

The shrinkage estimator is not intended to reproduce the raw sample matrix. Its
purpose is to provide a better-conditioned and more stable estimate in settings
where the sample matrix is noisy or singular. In this block-factor simulation,
the raw sample matrix retains the main within-block pattern but also carries
substantial between-block noise; shrinkage pulls that background noise down
without erasing the dominant structure.

## Choosing among these methods

The choice is usually governed by the main source of difficulty in the data.

- If the concern is outliers, start with `bicor()` and compare with
  `pbcor()`, `wincor()`, or `skipped_corr()` as needed.
- If the concern is conditional dependence, use `pcorr()`.
- If the concern is `p >= n` or unstable covariance estimation, use
  `shrinkage_corr()` or a regularised `pcorr()` method.

These are not interchangeable estimators. They solve different problems, even
though they share the same matrix-oriented calling style.