Type: | Package |
Title: | 'SplitWise': Hybrid Stepwise Regression with Single-Split Dummy Encoding |
Version: | 1.0.0 |
Description: | Implements 'SplitWise', a hybrid regression approach that transforms numeric variables into either single-split (0/1) dummy variables or retains them as continuous predictors. The transformation is followed by stepwise selection to identify the most relevant variables. The default 'iterative' mode adaptively explores partial synergies among variables to enhance model performance, while an alternative 'univariate' mode applies simpler transformations independently to each predictor. For details, see Kurbucz et al. (2025) <doi:10.48550/arXiv.2505.15423>. |
License: | GPL (≥ 3) |
Encoding: | UTF-8 |
Depends: | R (≥ 3.5.0) |
Imports: | rpart, stats |
RoxygenNote: | 7.3.2 |
Suggests: | knitr, rmarkdown, testthat (≥ 3.0.0) |
Config/testthat/edition: | 3 |
VignetteBuilder: | knitr |
NeedsCompilation: | no |
Packaged: | 2025-05-26 20:17:10 UTC; Marcell |
Author: | Marcell T. Kurbucz [aut, cre], Nikolaos Tzivanakis [aut], Nilufer Sari Aslam [aut], Adam Sykulski [aut] |
Maintainer: | Marcell T. Kurbucz <m.kurbucz@ucl.ac.uk> |
Repository: | CRAN |
Date/Publication: | 2025-05-28 16:00:02 UTC |
Decide Variable Type (Iterative)
Description
A stepwise variable-selection method that iteratively chooses each variable's best form:
"linear"
, single-split "dummy"
, or double-split ("middle=1") dummy,
based on AIC/BIC improvement. Supports "forward", "backward", or "both" strategies.
Usage
decide_variable_type_iterative(
X,
Y,
minsplit = 5,
direction = c("backward", "forward", "both"),
criterion = c("AIC", "BIC"),
exclude_vars = NULL,
verbose = FALSE,
...
)
Arguments
X |
A data frame of predictors (no response). |
Y |
A numeric vector (the response). |
minsplit |
Minimum number of observations in a node to consider splitting. Default = 5. |
direction |
Stepwise strategy: |
criterion |
A character string: either |
exclude_vars |
A character vector of variable names to exclude from dummy transformations.
These variables will always be treated as linear. Default = |
verbose |
Logical; if |
... |
Additional arguments (currently unused). |
Details
Dummy forms come from a shallow (maxdepth = 2
) rpart
tree fit to the partial
residuals of the current model. We extract up to two splits:
Single cutoff dummy (e.g.,
x >= c
)Double cutoff dummy (e.g.,
c1 < x < c2
)
The function then picks the form (linear, single-split dummy, or double-split dummy)
that yields the lowest AIC/BIC. Variables listed in exclude_vars
will be forced to remain
linear (dummy transformations are never attempted).
Value
A named list of decisions, where each element is a list with:
- type
Either
"linear"
or"dummy"
.- cutoff
A numeric vector of length 1 or 2 (the chosen split points).
Decide Variable Type (Univariate)
Description
For each numeric predictor, this function fits a shallow (maxdepth = 2
) rpart
tree
directly on Y ~ x
and tests whether a dummy transformation improves model fit.
Usage
decide_variable_type_univariate(
X,
Y,
minsplit = 5,
criterion = c("AIC", "BIC"),
exclude_vars = NULL,
verbose = FALSE
)
Arguments
X |
A data frame of numeric predictors (no response). |
Y |
A numeric response vector. |
minsplit |
Minimum number of observations in a node to consider splitting. Default = 5. |
criterion |
A character string: either |
exclude_vars |
A character vector of variable names to exclude from dummy transformations.
These variables will always be treated as linear. Default = |
verbose |
Logical; if |
Details
Dummy forms come from a shallow (maxdepth = 2
) rpart
tree fit to the data. We extract up to two splits:
Single cutoff dummy (e.g.,
x >= c
)Double cutoff dummy (e.g.,
c1 < x < c2
)
The function then picks the form (linear, single-split dummy, or double-split dummy)
that yields the lowest AIC/BIC. If a variable is listed in exclude_vars
, it will always be used
as a linear predictor (dummy transformation is never attempted).
Value
A named list of decisions, where each element is a list with:
- type
Either
"dummy"
or"linear"
.- cutoffs
A numeric vector (length 1 or 2) if
type = "dummy"
, orNULL
if linear.- tree_model
The fitted
rpart
model (for reference) orNULL
if excluded.
SplitWise Regression
Description
Transforms each numeric variable into either a single-split dummy or keeps it linear,
then runs stats::step()
for stepwise selection. The user can choose a
simpler univariate transformation or an iterative approach.
Usage
splitwise(
formula,
data,
transformation_mode = c("iterative", "univariate"),
direction = c("backward", "forward", "both"),
minsplit = 5,
criterion = c("AIC", "BIC"),
exclude_vars = NULL,
verbose = FALSE,
trace = 1,
steps = 1000,
k = 2,
...
)
## S3 method for class 'splitwise_lm'
print(x, ...)
## S3 method for class 'splitwise_lm'
summary(object, ...)
Arguments
formula |
A formula specifying the response and (initial) predictors, e.g. |
data |
A data frame containing the variables used in |
transformation_mode |
Either |
direction |
Stepwise direction: |
minsplit |
Minimum number of observations in a node to consider splitting. Default = 5. |
criterion |
Either |
exclude_vars |
A character vector naming variables that should be forced to remain linear
(i.e., no dummy splits allowed). Default = |
verbose |
Logical; if |
trace |
If positive, |
steps |
Maximum number of steps for |
k |
Penalty multiple for the number of degrees of freedom (used by |
... |
Additional arguments passed to |
x |
A |
object |
A |
Value
An S3 object of class c("splitwise_lm", "lm")
, storing:
splitwise_info |
List containing transformation decisions, final data, and call. |
Functions
-
print(splitwise_lm)
: Prints a summary of the splitwise_lm object. -
summary(splitwise_lm)
: Provides a detailed summary, including how dummies were created.
Examples
# Load the mtcars dataset
data(mtcars)
# Univariate transformations (AIC-based, backward stepwise)
model_uni <- splitwise(
mpg ~ .,
data = mtcars,
transformation_mode = "univariate",
direction = "backward",
trace = 0
)
summary(model_uni)
# Iterative approach (BIC-based, forward stepwise)
# Note: typically set k = log(nrow(mtcars)) for BIC in step().
model_iter <- splitwise(
mpg ~ .,
data = mtcars,
transformation_mode = "iterative",
direction = "forward",
criterion = "BIC",
k = log(nrow(mtcars)),
trace = 0
)
summary(model_iter)
Transform Features (Iterative Logic)
Description
Once decide_variable_type_iterative
has chosen which variables to add (and how),
we can build a final data frame from those decisions.
Usage
transform_features_iterative(X, decisions)
Arguments
X |
Original predictor data frame. |
decisions |
Output of |
Value
A data frame with the chosen variables in their final forms (dummy or linear).
Transform Features (Univariate Logic)
Description
Given the decisions (dummy or linear) for each predictor, produce a transformed data frame. Dummy columns are 0/1 based on the cutoff.
Usage
transform_features_univariate(X, decisions)
Arguments
X |
Original predictor data frame. |
decisions |
The list returned by |
Value
A new data frame with either the original column or a dummy column for each variable.