corrselect identifies all maximal subsets of variables
whose pairwise correlations stay below a chosen threshold. This process
reduces multicollinearity and redundancy before modeling, while
preserving interpretability. Unlike greedy or stepwise approaches,
corrselect exhaustively searches for all valid subsets
using fast, exact algorithms. It is fully model-agnostic, making it
suitable as a preprocessing step for regression, clustering, feature
selection, and other analyses.
Given a threshold \(t \in (0,1)\),
the functions corrSelect() (data-frame interface) and
MatSelect() (matrix interface) enumerate all
maximal subsets \(S\)
of variables satisfying:
\[ \forall i, j \in S,\ i \neq j: \ |r_{ij}| < t \]
where \(r_{ij}\) denotes the chosen correlation measure between variables \(i\) and \(j\). Enumeration relies on two exact graph-theoretic algorithms:
Results are returned as a CorrCombo S4 object containing
each subset’s variable names and summary statistics
(avg_corr, min_corr, max_corr).
You can then extract subsets from the original data via
corrSubset(). Because the procedure does not depend on any
downstream model, it cleanly separates “feature curation” from “model
fitting” and supports multiple correlation measures
(pearson, spearman, kendall,
bicor, distance, maximal).
CorrSelect)res <- corrSelect(df, threshold = 0.7)
res
#> CorrCombo object
#> -----------------
#> Method: bron-kerbosch
#> Correlation: pearson
#> Threshold: 0.700
#> Subsets: 2 valid combinations
#> Data Rows: 100 used in correlation
#> Pivot: TRUE
#>
#> Top combinations:
#> No. Variables Avg Max Size
#> ------------------------------------------------------------
#> [ 1] F, B, C, D, E 0.082 0.185 5
#> [ 2] A, B, C, D, E 0.083 0.185 5
as.data.frame(res)
#> VarName01 VarName02 VarName03 VarName04 VarName05
#> Subset01 [avg=0.082] F B C D E
#> Subset02 [avg=0.083] A B C D EcorrSubset(res, df, which = 1)[1:10,]
#> F B C D E
#> 1 1.33677667 1.2009654 -2.0009292 -0.004620768 1.33491259
#> 2 -0.41675087 1.0447511 0.3337772 0.760242168 -0.86927176
#> 3 0.32656994 -1.0032086 1.1713251 0.038990913 0.05548695
#> 4 0.58317730 1.8484819 2.0595392 0.735072142 0.04906691
#> 5 0.29182614 -0.6667734 -1.3768616 -0.146472627 -0.57835573
#> 6 -0.11532450 0.1055138 -1.1508556 -0.057887335 -0.99873866
#> 7 1.25744892 -0.4222559 -0.7058214 0.482369466 -0.00243278
#> 8 -0.18188872 -0.1223502 -1.0540558 0.992943637 0.65551188
#> 9 1.69450003 0.1881930 -0.6457437 -1.246395498 1.47684228
#> 10 0.02717808 0.1191610 -0.1853780 -0.033487525 -1.90915279res2 <- corrSelect(df, threshold = 0.7, force_in = "A")
res2
#> CorrCombo object
#> -----------------
#> Method: els
#> Correlation: pearson
#> Threshold: 0.700
#> Subsets: 1 valid combinations
#> Data Rows: 100 used in correlation
#> Forced-in: A
#>
#> Top combinations:
#> No. Variables Avg Max Size
#> ------------------------------------------------------------
#> [ 1] A, B, C, D, E 0.083 0.185 5res3 <- corrSelect(df, threshold = 0.6, cor_method = "spearman")
res3
#> CorrCombo object
#> -----------------
#> Method: bron-kerbosch
#> Correlation: spearman
#> Threshold: 0.600
#> Subsets: 2 valid combinations
#> Data Rows: 100 used in correlation
#> Pivot: TRUE
#>
#> Top combinations:
#> No. Variables Avg Max Size
#> ------------------------------------------------------------
#> [ 1] F, B, C, D, E 0.088 0.191 5
#> [ 2] A, B, C, D, E 0.090 0.206 5MatSelect)If you already computed a correlation matrix or want to apply the method to precomputed correlations:
mat <- cor(df)
res4 <- MatSelect(mat, threshold = 0.7)
res4
#> CorrCombo object
#> -----------------
#> Method: bron-kerbosch
#> Threshold: 0.700
#> Subsets: 2 valid combinations
#> Data Rows: 6 used in correlation
#> Pivot: TRUE
#>
#> Top combinations:
#> No. Variables Avg Max Size
#> ------------------------------------------------------------
#> [ 1] F, B, C, D, E 0.082 0.185 5
#> [ 2] A, B, C, D, E 0.083 0.185 5Selecting subsets:
MatSelect(mat, threshold = 0.5)
#> CorrCombo object
#> -----------------
#> Method: bron-kerbosch
#> Threshold: 0.500
#> Subsets: 2 valid combinations
#> Data Rows: 6 used in correlation
#> Pivot: TRUE
#>
#> Top combinations:
#> No. Variables Avg Max Size
#> ------------------------------------------------------------
#> [ 1] F, B, C, D, E 0.082 0.185 5
#> [ 2] A, B, C, D, E 0.083 0.185 5Force variable 1 into every subset:
MatSelect(mat, threshold = 0.5, force_in = 1)
#> CorrCombo object
#> -----------------
#> Method: els
#> Threshold: 0.500
#> Subsets: 1 valid combinations
#> Data Rows: 6 used in correlation
#> Forced-in: A
#>
#> Top combinations:
#> No. Variables Avg Max Size
#> ------------------------------------------------------------
#> [ 1] A, B, C, D, E 0.083 0.185 5assocSelect)df_ass <- data.frame(
height = rnorm(15, 170, 10),
weight = rnorm(15, 70, 12),
group = factor(rep(LETTERS[1:3], each = 5)),
score = ordered(sample(c("low","med","high"), 15, TRUE))
)
# keep every subset whose internal associations ≤ 0.6
res5 <- assocSelect(df_ass, threshold = 0.6)
res5
#> CorrCombo object
#> -----------------
#> Method: bron-kerbosch
#> Correlation: mixed
#> AssocMethod: numeric_numeric = pearson, numeric_factor = eta, numeric_ordered
#> = spearman, factor_ordered = cramersv
#> Threshold: 0.600
#> Subsets: 1 valid combinations
#> Data Rows: 15 used in correlation
#> Pivot: TRUE
#>
#> Top combinations:
#> No. Variables Avg Max Size
#> ------------------------------------------------------------
#> [ 1] height, weight, group, score 0.174 0.332 4By default, corrSelect() uses Pearson correlation. You
can choose alternatives with the cor_method argument:
"pearson": linear correlation (default)"spearman": rank-based monotonic association"kendall": Kendall’s tau"bicor": robust biweight midcorrelation
(WGCNA::bicor)"distance": distance correlation
(energy::dcor)"maximal": maximal information coefficient
(minerva::mine)Example:
res6 <- corrSelect(df, threshold = 0.7, cor_method = "spearman")
res6
#> CorrCombo object
#> -----------------
#> Method: bron-kerbosch
#> Correlation: spearman
#> Threshold: 0.700
#> Subsets: 2 valid combinations
#> Data Rows: 100 used in correlation
#> Pivot: TRUE
#>
#> Top combinations:
#> No. Variables Avg Max Size
#> ------------------------------------------------------------
#> [ 1] F, B, C, D, E 0.088 0.191 5
#> [ 2] A, B, C, D, E 0.090 0.206 5The function assocSelect() extends
corrSelect() to support mixed data types —
including numeric, factor, and ordered variables — by using appropriate
association measures for each variable pair.
Instead of a single correlation matrix, it constructs a generalized association matrix using the following logic:
| Variable 1 | Variable 2 | Method Used |
|---|---|---|
| numeric | numeric | pearson (default; customizable) |
| numeric | factor | eta |
| numeric | ordered | spearman (default; customizable) |
| factor | factor | cramersv |
| factor | ordered | cramersv |
| ordered | ordered | spearman (default; customizable) |
The defaults for numeric-numeric, numeric-ordered, and ordered-ordered associations can be changed via arguments:
assocSelect(df_ass,
method_num_num = "kendall",
method_num_ord = "spearman",
method_ord_ord = "kendall"
)
#> CorrCombo object
#> -----------------
#> Method: bron-kerbosch
#> Correlation: mixed
#> AssocMethod: numeric_numeric = kendall, numeric_factor = eta, numeric_ordered
#> = spearman, factor_ordered = cramersv
#> Threshold: 0.700
#> Subsets: 1 valid combinations
#> Data Rows: 15 used in correlation
#> Pivot: TRUE
#>
#> Top combinations:
#> No. Variables Avg Max Size
#> ------------------------------------------------------------
#> [ 1] height, weight, group, score 0.178 0.332 4All other combinations use fixed methods (eta or
cramersv) appropriate for measuring association
strength.
df_ass <- data.frame(
height = rnorm(10),
weight = rnorm(10),
group = factor(sample(c("A", "B"), 10, replace = TRUE)),
score = ordered(sample(1:3, 10, replace = TRUE))
)
res7 <- assocSelect(df_ass, threshold = 1, method = "bron-kerbosch", use_pivot = TRUE)
res7
#> CorrCombo object
#> -----------------
#> Method: bron-kerbosch
#> Correlation: mixed
#> AssocMethod: numeric_numeric = pearson, numeric_factor = eta, numeric_ordered
#> = spearman, factor_ordered = cramersv
#> Threshold: 1.000
#> Subsets: 1 valid combinations
#> Data Rows: 10 used in correlation
#> Pivot: TRUE
#>
#> Top combinations:
#> No. Variables Avg Max Size
#> ------------------------------------------------------------
#> [ 1] height, weight, group, score 0.336 0.495 4Each pairwise association is bounded to [0,1] and treated analogously to correlation.
Given a symmetric correlation matrix \(R \in \mathbb{R}^{p \times p}\), we seek all maximal subsets \(S \subseteq \{1, \dots, p\}\) such that:
\[ \forall i, j \in S,\ i \neq j: \ |R_{ij}| < t \]
for a fixed threshold \(t \in (0, 1)\).
This is equivalent to finding all maximal cliques in the thresholded correlation graph, where:
A maximal clique corresponds to a variable subset that cannot be extended without violating the correlation limit.
The ELS algorithm efficiently enumerates all maximal cliques in a sparse graph using degeneracy ordering:
Formally, define:
\[ \text{extend}(S, C) = \begin{cases} S, & C = \emptyset, \\ \bigcup_{v \in C} \text{extend}(S \cup \{v\},\ C \setminus (N(v) \cup \{v\})), & \text{otherwise}. \end{cases} \]
ELS avoids redundant exploration, achieving good performance on typical correlation graphs.
The classical Bron–Kerbosch algorithm enumerates maximal cliques via recursive backtracking with optional pivoting:
Let \(R\) = current clique, \(P\) = prospective nodes, \(X\) = excluded nodes. Then:
\[ \text{BK}(R, P, X) = \begin{cases} \text{report}(R), & P = X = \emptyset, \\ \text{for each } v \in P \setminus N(u): \\ \quad \text{BK}(R \cup \{v\},\ P \cap N(v),\ X \cap N(v)), \ \quad P \leftarrow P \setminus \{v\},\ X \leftarrow X \cup \{v\}. \end{cases} \]
Choosing a pivot \(u \in P \cup X\) and iterating over \(P \setminus N(u)\) reduces recursive calls.
Most existing R tools:
findCorrelation)corrselect uniquely provides:
CorrCombo objectsThis makes it ideal for pipelines where interpretability and completeness are essential.
Convert results for downstream use:
df_res <- as.data.frame(res)
head(df_res)
#> VarName01 VarName02 VarName03 VarName04 VarName05
#> Subset01 [avg=0.082] F B C D E
#> Subset02 [avg=0.083] A B C D EExtract individual subsets:
lapply(corrSubset(res, df, which = 1:2), function(x) head(x, 10))
#> $Subset1
#> F B C D E
#> 1 1.33677667 1.2009654 -2.0009292 -0.004620768 1.33491259
#> 2 -0.41675087 1.0447511 0.3337772 0.760242168 -0.86927176
#> 3 0.32656994 -1.0032086 1.1713251 0.038990913 0.05548695
#> 4 0.58317730 1.8484819 2.0595392 0.735072142 0.04906691
#> 5 0.29182614 -0.6667734 -1.3768616 -0.146472627 -0.57835573
#> 6 -0.11532450 0.1055138 -1.1508556 -0.057887335 -0.99873866
#> 7 1.25744892 -0.4222559 -0.7058214 0.482369466 -0.00243278
#> 8 -0.18188872 -0.1223502 -1.0540558 0.992943637 0.65551188
#> 9 1.69450003 0.1881930 -0.6457437 -1.246395498 1.47684228
#> 10 0.02717808 0.1191610 -0.1853780 -0.033487525 -1.90915279
#>
#> $Subset2
#> A B C D E
#> 1 1.37095845 1.2009654 -2.0009292 -0.004620768 1.33491259
#> 2 -0.56469817 1.0447511 0.3337772 0.760242168 -0.86927176
#> 3 0.36312841 -1.0032086 1.1713251 0.038990913 0.05548695
#> 4 0.63286260 1.8484819 2.0595392 0.735072142 0.04906691
#> 5 0.40426832 -0.6667734 -1.3768616 -0.146472627 -0.57835573
#> 6 -0.10612452 0.1055138 -1.1508556 -0.057887335 -0.99873866
#> 7 1.51152200 -0.4222559 -0.7058214 0.482369466 -0.00243278
#> 8 -0.09465904 -0.1223502 -1.0540558 0.992943637 0.65551188
#> 9 2.01842371 0.1881930 -0.6457437 -1.246395498 1.47684228
#> 10 -0.06271410 0.1191610 -0.1853780 -0.033487525 -1.90915279Summarize correlation metrics:
# Number and size of subsets
length(res@subset_list)
#> [1] 2
summary(lengths(res@subset_list))
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 5 5 5 5 5 5
# Summaries of within-subset correlations
summary(res@max_corr)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.185 0.185 0.185 0.185 0.185 0.185
summary(res@avg_corr)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.08162 0.08185 0.08208 0.08208 0.08232 0.08255A CorrCombo S4 object contains:
subset_list: list of character vectors (variable
names)avg_corr, min_corr, max_corr:
numeric vectors of correlation metricsthreshold, forced_in,
search_type, cor_method,
n_rows_useduse_pivot (if applicable)Inspect slots:
str(res@subset_list)
#> List of 2
#> $ : chr [1:5] "F" "B" "C" "D" ...
#> $ : chr [1:5] "A" "B" "C" "D" ...sessionInfo()
#> R version 4.5.1 (2025-06-13 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26100)
#>
#> Matrix products: default
#> LAPACK version 3.12.1
#>
#> locale:
#> [1] LC_COLLATE=C
#> [2] LC_CTYPE=English_United States.utf8
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.utf8
#>
#> time zone: Europe/Luxembourg
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] corrselect_2.0.1
#>
#> loaded via a namespace (and not attached):
#> [1] digest_0.6.37 R6_2.6.1 fastmap_1.2.0 xfun_0.52
#> [5] cachem_1.1.0 knitr_1.50 htmltools_0.5.8.1 rmarkdown_2.29
#> [9] lifecycle_1.0.4 cli_3.6.5 sass_0.4.10 jquerylib_0.1.4
#> [13] compiler_4.5.1 rstudioapi_0.17.1 tools_4.5.1 evaluate_1.0.4
#> [17] bslib_0.9.0 Rcpp_1.1.0 yaml_2.3.10 rlang_1.1.6
#> [21] jsonlite_2.0.0