Version: | 0.1-5 |
Date: | 2023-04-11 |
Title: | Stability Assessment of Statistical Learning Methods |
Description: | Graphical and computational methods that can be used to assess the stability of results from supervised statistical learning. |
Depends: | R (≥ 3.0.0) |
Imports: | graphics, methods, MASS, e1071, partykit, party, randomForest, ranger |
Suggests: | utils, Formula, nnet, rpart, evtree, rchallenge, knitr, rmarkdown |
VignetteBuilder: | knitr |
License: | GPL-2 | GPL-3 |
Encoding: | UTF-8 |
NeedsCompilation: | no |
Packaged: | 2023-04-12 23:53:11 UTC; zeileis |
Author: | Michel Philipp [aut],
Carolin Strobl [aut],
Achim Zeileis |
Maintainer: | Achim Zeileis <Achim.Zeileis@R-project.org> |
Repository: | CRAN |
Date/Publication: | 2023-04-13 12:22:20 UTC |
List of Predefined Learners for Assessing Stability
Description
The list contains details about several predefined learners that are required to assess the stability of results from statistical learning.
Usage
LearnerList
Details
Currently implemented learners are:
- ctree
conditional inference trees using
ctree
from partykit.- rpart
recursive partitioning using
rpart
from rpart.- J48
recursive partitioning using
J48
from RWeka.- C5.0
recursive partitioning using
C5.0
from C50.- tree
recursive partitioning using
tree
from tree.- lda
linear discriminant analysis using
lda
from MASS.- lm
linear models using
lm
from stats.- glm
generalized linear models using
glm
from stats.
Users can add new learners to LearnerList
for the current R session,
see addLearner
.
See Also
Prediction Accuracy from Stability Assessment Results
Description
Function to compute the prediction accuracy from an object
of class "stablelearner"
or "stablelearnerList"
as a parallel
to the similarity values estimated by stability
in each
iteration of the stability assessment procedure.
Usage
accuracy(x, measure = "kappa", na.action = na.exclude,
applyfun = NULL, cores = NULL)
Arguments
x |
an object of class |
measure |
a character string (or a vector of character strings).
Name(s) of the measure(s) used to compute accuracy. Currently implemented
measures are |
na.action |
a function which indicates what should happen to the predictions
of the results containing |
applyfun |
a |
cores |
integer. The number of cores to use in multicore computations
using |
Details
This function can be used to compute prediction accuracy after the stability was
estimated using stability
.
Value
A matrix of size 2*B
times length(measure
) containing prediction
accuracy values of the learners trained during the stability assessment procedure.
See Also
Examples
library("partykit")
res <- ctree(Species ~ ., data = iris)
stab <- stability(res)
accuracy(stab)
Add Learners to LearnerList
Description
The function can be used to add new learner to LearnerList
in the
current R session.
Usage
addLearner(x)
Arguments
x |
a list containing all required information to define a new learner (see Details below). |
Details
The function can be used to add new learners to LearnerList
in
the current R session. The function expects a list of four elements
including the name of the learners object class, the name of the package
where the class and the fitting method is implemented, the name of the method
and a prediction function that predicts class probabilities (in the
classification case) or numeric values (in the regression case) and takes the
arguments x
(the fitted model object), newdata
a
data.frame
containing the predictions of the observations in the
evaluation sample and yclass
a character string specifying the type
of the response variable ("numeric"
, "factor"
, etc.). The
elements in the list should be named class
, package
,
method
and predfun
.
See Also
Examples
newlearner <- list(
class = "svm",
package = "e1071",
method = "Support Vector Machine",
predict = function(x, newdata, yclass = NULL) {
if(match(yclass, c("ordered", "factor"))) {
attr(predict(x, newdata = newdata, probability = TRUE), "probabilities")
} else {
predict(x, newdata = newdata)
}
})
addLearner(newlearner)
Sampler Infrastructure for Stability Assessment
Description
Sampler objects that provide objects with functionality used by
stabletree
to generate resampled datasets.
Usage
bootstrap(B = 500, v = 1)
subsampling(B = 500, v = 0.632)
samplesplitting(k = 5)
jackknife(d = 1, maxrep = 5000)
splithalf(B = 500)
Arguments
B |
An integer value specifying the number of resampled datasets. |
k |
An integer value specifying the number of folds in sample-splitting. |
d |
An integer value specifying the number of observations left out in jackknife. |
maxrep |
An integer value specifying the maximum number of resampled datasets allowed, when using jackknife. |
v |
A numeric value between 0 and 1 specifying the fraction of observations in each subsample. |
Details
The sampler functions provide objects that include functionality to generate
resampled datasets used by stabletree
.
The bootstrap
function provides an object that can be used to generate
B
bootstrap samples by sampling from n
observations with
replacement.
The subsampling
function provides an object that can be used to
generate B
subsamples by sampling from floor(v*n)
observations without replacement.
The samplesplitting
function provides an object that can be used to
generate k
-folds from n
observations.
The jackknife
function provides an object that can be used to generate
all datasets necessary to perform leave-k
-out jackknife sampling from
n
observations. The number of datasets is limited by maxrep
to
prevent unintended CPU or memory overload by accidently choosing too large
values for k
.
The splithalf
function provides an object that can be used to
generate B
subsamples by sampling from floor(0.5*n)
observations without replacement. When used to implement the "splithalf"
resampling strategy for measuring the stability of a result via the
stability
function, the matrix containing the complement
learning samples is generated automatically by stability
.
See Also
Examples
set.seed(0)
## bootstrap sampler
s <- bootstrap(3)
s$sampler(10)
## subsampling
s <- subsampling(3, v = 0.6)
s$sampler(10)
## 5-fold sample-splitting
s <- samplesplitting(5)
s$sampler(10)
## jackknife
s <- jackknife(d = 1)
s$sampler(10)
## splithaf
s <- splithalf(3)
s$sampler(10)
Illustrate Results from Stability Assessment
Description
Illustrates the results from stability assessments performed by
stability
using boxplots.
Usage
## S3 method for class 'stablelearnerList'
boxplot(x, ..., main = NULL, xlab = NULL, ylab = NULL, reverse = TRUE)
## S3 method for class 'stablelearner'
boxplot(x, ...)
Arguments
x |
an object of class |
... |
Arguments passed to |
main |
a character specifying the title. By default set to |
xlab |
a character specifying the title for the x axis. By default set
to |
ylab |
a character specifying the title for the y axis. By default set
to |
reverse |
logical. If |
See Also
stability
, summary.stablelearnerList
Examples
library("partykit")
r1 <- ctree(Species ~ ., data = iris)
library("rpart")
r2 <- rpart(Species ~ ., data = iris)
stab <- stability(r1, r2, names = c("ctree", "rpart"))
boxplot(stab)
Data-Ggnerating Function for Two-Class Problem
Description
Data-generating function to generate artificial data sets of a classification
problem with two response classes, denoted as "A"
and "B"
.
Usage
dgp_twoclass(n = 100, p = 4, noise = 16, rho = 0,
b0 = 0, b = rep(1, p), fx = identity)
Arguments
n |
integer. Number of observations. The default is 100. |
p |
integer. Number of signal predictors. The default is 4. |
noise |
integer. Number of noise predictors. The default is 16. |
rho |
numeric value between -1 and 1 specifying the correlation
between the signal predictors. The correlation is given by |
b0 |
numeric value. Baseline probability for class |
b |
numeric value. Slope parameter for the predictors on the logit scale. The default is 1 for all predictors. |
fx |
a function that is used to transform the predictors. The default
is |
Value
A data.frame
including a column denoted as class
that is
a factor with two levels "A"
and "B"
. All other columns
represent the predictor variables (signal predictors followed by noise
predictors) and are named by "x1"
, "x2"
, etc..
See Also
Examples
dgp_twoclass(n = 200, p = 6, noise = 4)
Get Learner Details from LearnerList
Description
Function to get information available about a specific learner in
LearnerList
of the current R session.
Usage
getLearner(x)
Arguments
x |
a fitted model object. |
Details
The function returns the entry in LearnerList
found for
the class of the object submitted to the function.
See Also
Examples
library("partykit")
m <- ctree(Species ~ ., data = iris)
getLearner(m)
Visualizing Tree Stability Assessments
Description
Visualizations of tree stability assessments carried out
via stabletree
.
Usage
## S3 method for class 'stabletree'
plot(x, select = order(colMeans(x$vs), decreasing = TRUE),
type.breaks = "levels", col.breaks = "red", lty.breaks = "dashed",
cex.breaks = 0.7, col.main = c("black", "gray50"), main.uline = TRUE,
args.numeric = NULL, args.factor = NULL, args.ordered = NULL, main = NULL,
original = TRUE, ...)
## S3 method for class 'stabletree'
barplot(height, main = "Variable selection frequencies",
xlab = "", ylab = "", horiz = FALSE, col = gray.colors(2),
names.arg = NULL, names.uline = TRUE, names.diag = TRUE,
cex.names = 0.9,
ylim = if (horiz) NULL else c(0, 100), xlim = if (horiz) c(0, 100) else NULL,
original = TRUE, ...)
## S3 method for class 'stabletree'
image(x, main = "Variable selections",
ylab = "Repetitions", xlab = "", col = gray.colors(2),
names.arg = NULL, names.uline = TRUE, names.diag = TRUE,
cex.names = 0.9, xaxs = "i", yaxs = "i",
col.tree = 2, lty.tree = 2, xlim = c(0, length(x$vs0)), ylim = c(0, x$B),
original = TRUE, ...)
Arguments
x , height |
an object of class |
original |
logical. Should the original tree information be highlighted? |
select |
An vector of integer or character values representing the
number(s) or the name(s) of the variable(s) to be plotted. By default all
variables are plotted. The numbers correspond to the ordering of all
partitioning variables used in the call of the fitted model object that was
passed to |
type.breaks |
A character specifying the type of information added to
the lines that represent the splits in the complete data tree.
|
col.breaks |
Coloring of the lines and the texts that represent the splits in the complete data tree. |
lty.breaks |
Type of the lines that represent the splits in the complete data tree. |
cex.breaks |
Size of the texts that represent the splits in the complete data tree. |
col.main |
A vector of colors of length two. The first color is used for titles of variables that are selected in the complete data tree. The second color is used for titles of variables that are not selected in the complete data tree. |
main.uline |
A logical value. If |
args.numeric |
A list of arguments passed to the internal function
that is used for plotting a histogram of the cutpoints in numerical splits.
|
args.factor |
A list of arguments passed to the internal function that
is used for plotting an image plot of the cutpoints in categorical splits.
|
args.ordered |
A list of arguments passed to the internal function that
is used for plotting a barplot of the cutpoints in ordered categorical splits.
All arguments in the list are passed to the function
|
... |
further arguments passed to plotting functions, especially for labeling and annotation. |
main , xlab , ylab |
character. Annotations of axes and main title, respectively. |
horiz |
A logical value. If |
col |
A vector of colors of length two used for coloring in the
|
names.arg |
A vector of labels to be plotted below each bar (in case of
|
names.uline |
A logical value. If |
names.diag |
A logical value (omitted if |
cex.names |
Expansion factor for labels. |
xlim , ylim |
The limits of the plot. |
xaxs , yaxs |
The style of axis interval calculation to be used (see
|
col.tree , lty.tree |
color and line type to indicate differences from the original tree that was resampled. |
Details
plot
visualizes the variability of the cutpoints.barplot
visualizes the variable selection frequency.image
visualizes the combinations of variables selected.
See Also
Examples
## build a tree
library("partykit")
m <- ctree(Species ~ ., data = iris)
plot(m)
## investigate stability
set.seed(0)
s <- stabletree(m, B = 500)
## show variable selection proportions
## with different labels and different ordering
barplot(s)
barplot(s, cex.names = 0.8)
barplot(s, names.diag = FALSE)
barplot(s, names.arg = c("a", "b", "c", "d"))
barplot(s, names.uline = FALSE)
barplot(s, col = c("lightgreen", "darkred"))
barplot(s, horiz = TRUE)
## illustrate variable selections of replications
## with different labels and different ordering
image(s)
image(s, cex.names = 0.8)
image(s, names.diag = FALSE)
image(s, names.arg = c("a", "b", "c", "d"))
image(s, names.uline = FALSE)
image(s, col = c("lightgreen", "darkred"))
## graphical cutpoint analysis, selecting variable by number and name
## with different numerical of break points
plot(s)
plot(s, select = 3)
plot(s, select = "Petal.Width")
plot(s, args.numeric = list(breaks = 40))
# change labels of splits in complete data tree
plot(s, select = 3, type.breaks = "levels")
plot(s, select = 3, type.breaks = "nodeids")
plot(s, select = 3, type.breaks = "breaks")
plot(s, select = 3, type.breaks = "none")
Similarity Measure Infrastructure for Stability Assessment with Ordinal Responses
Description
Functions that provide objects with functionality used by
stability
to measure the similarity between the predictions
of two results in classification problems.
Usage
clagree()
ckappa()
bdist()
tvdist()
hdist()
jsdiv(base = 2)
Arguments
base |
A positive or complex number: the base with respect to which logarithms are computed. Defaults to 2. |
Details
The similarity measure functions provide objects that include functionality
used by stability
to measure the similarity between the
probability predictions of two results in classification problems.
The clagree
and ckappa
functions provide an object that can be
used to assess the similarity based on the predicted classes of two results.
The predicted classes are selected by the class with the highest probability.
The bdist
(Bhattacharayya distance), tvdist
(Total variation
distance), hdist
(Hellinger distance) and jsdist
(Jenson-Shannon divergence) functions provide an object that can be
used to assess the similarity based on the predicted class probabilities of
two results.
See Also
Examples
set.seed(0)
## build trees
library("partykit")
m1 <- ctree(Species ~ ., data = iris[sample(1:nrow(iris), replace = TRUE),])
m2 <- ctree(Species ~ ., data = iris[sample(1:nrow(iris), replace = TRUE),])
p1 <- predict(m1, type = "prob")
p2 <- predict(m2, type = "prob")
## class agreement
m <- clagree()
m$measure(p1, p2)
## cohen's kappa
m <- ckappa()
m$measure(p1, p2)
## bhattacharayya distance
m <- bdist()
m$measure(p1, p2)
## total variation distance
m <- tvdist()
m$measure(p1, p2)
## hellinger distance
m <- hdist()
m$measure(p1, p2)
## jenson-shannon divergence
m <- jsdiv()
m$measure(p1, p2)
## jenson-shannon divergence (base = exp(1))
m <- jsdiv(base = exp(1))
m$measure(p1, p2)
Similarity Measure Infrastructure for Stability Assessment with Numerical Responses
Description
Functions that provide objects with functionality used by
stability
to measure the similarity between numeric
predictions of two results in regression problems.
Usage
edist()
msdist()
rmsdist()
madist()
qadist(p = 0.95)
cprob(kappa = 0.1)
rbfkernel()
tanimoto()
cosine()
ccc()
pcc()
Arguments
p |
A numeric value between 0 and 1 specifying the probability to which the sample quantile of the absolute distance between the predictions is computed. |
kappa |
A positive numeric value specifying the upper limit of the absolute distance between the predictions to which the coverage probability is computed. |
Details
The similarity measure functions provide objects that include functionality
used by stability
to measure the similarity between numeric
predictions of two results in regression problems.
The edist
(euclidean distance), msdist
(mean squared distance),
rmsdist
(root mean squared distance), madist
(mean absolute
distance) and qadist
(quantile of absolute distance) functions
implement scale-variant distance measures that are unbounded.
The cprob
(coverage probability), rbfkernel
(gaussian radial
basis function kernel), tanimoto
(tanimoto coefficient) and
cosine
(cosine similarity) functions implement scale-variant distance
measures that are bounded.
The ccc
(concordance correlation coefficient) and pcc
(pearson
correlation coefficient) functions implement scale-invariant distance
measures that are bounded between 0 and 1.
See Also
Examples
set.seed(0)
library("partykit")
airq <- subset(airquality, !is.na(Ozone))
m1 <- ctree(Ozone ~ ., data = airq[sample(1:nrow(airq), replace = TRUE),])
m2 <- ctree(Ozone ~ ., data = airq[sample(1:nrow(airq), replace = TRUE),])
p1 <- predict(m1)
p2 <- predict(m2)
## euclidean distance
m <- edist()
m$measure(p1, p2)
## mean squared distance
m <- msdist()
m$measure(p1, p2)
## root mean squared distance
m <- rmsdist()
m$measure(p1, p2)
## mean absolute istance
m <- madist()
m$measure(p1, p2)
## quantile of absolute distance
m <- qadist()
m$measure(p1, p2)
## coverage probability
m <- cprob()
m$measure(p1, p2)
## gaussian radial basis function kernel
m <- rbfkernel()
m$measure(p1, p2)
## tanimoto coefficient
m <- tanimoto()
m$measure(p1, p2)
## cosine similarity
m <- cosine()
m$measure(p1, p2)
## concordance correlation coefficient
m <- ccc()
m$measure(p1, p2)
## pearson correlation coefficient
m <- pcc()
m$measure(p1, p2)
Extracting Similarity Values
Description
Extract similarity values from object returned by stability
for
further illustration or analysis.
Usage
similarity_values(x, reverse = TRUE)
Arguments
x |
an object of class |
reverse |
logical. If |
Value
A numeric array of dimension 3 containing similarity values. The dimensions represent repetitions, results (fitted model objects) and similarity measures.
See Also
stability
, summary.stablelearnerList
Examples
library("partykit")
res <- ctree(Species ~ ., data = iris)
stab <- stability(res)
similarity_values(stab)
Control for Supervised Stability Assessments
Description
Various parameters that control aspects of the stability assessment performed
via stability
.
Usage
stab_control(B = 500, measure = list(tvdist, ccc), sampler = "bootstrap",
evaluate = "OOB", holdout = 0.25, seed = NULL, na.action = na.exclude,
savepred = TRUE, silent = TRUE, ...)
Arguments
B |
an integer value specifying the number of repetitions. The default
is |
measure |
a list of similarity measure (generating) functions. Those
should either be functions of |
sampler |
a resampling (generating) function. Either this should be a
function of |
evaluate |
a character specifying the evaluation strategy to be applied
(see Details below). The default is |
holdout |
a numeric value between zero and one that specifies the
proportion of observations hold out for evaluation over all repetitions,
only if |
seed |
a single value, interpreted as an integer, see
|
na.action |
a function which indicates what should happen when the
predictions of the results contain |
savepred |
logical. Should the predictions from each iteration be
saved? If |
silent |
logical. If |
... |
arguments passed to |
Details
With the argument measure
one or more measures can be defined that are
used to assess the stability of a result from supervised statistical learning
by stability
. Predefined similarity measures for the regression
and the classification case are listed in similarity_measures_classification
and similarity_measures_regression
.
Users can define their own similarity functions f(p1, p2)
that must
return a single numeric value for the similarity between two results trained on
resampled data sets. Such a function must take the arguments p1
and p2
.
In the classification case, p1
and p2
are probability matrices of
size m * K, where m
is the number of predicted observations (size
of the evaluation sample) and K is the number of classes. In the
regression case, p1
and p2
are numeric vectors of length
m.
A different way to implement new similarity functions for the current R
session is to define a similarity measure generator function, which is a
function without arguments that generates a list of five elements including the
name of the similarity measure, the function to compute the similarity
between the predictions as described above, a vector of character values
specifying the response types for which the similarity measure can be used,
a list containing two numeric elements lower
and upper
that
specify the range of values of the similarity measure and the function to
invert (or reverse) the similarity values such that higher values indicate
higher stability. The latter can be set to NULL
, if higher similarity
values already indicate higher stability. Those elements should be named
name
, measure
, classes
, range
and reverse
.
The argument evaluate
can be used to specify the evaluation strategy.
If set to "ALL"
, all observations in the original data set are used for
evaluation. If set to "OOB"
, only the pairwise out-of-bag observations
are used for evaluation within each repetition. If set to "OOS"
, a
fraction (defined by holdout
) of the observations in the original data
set are randomly sampled and used for evaluation, but not for training, over all
repetitions.
The argument seed
can be used to make similarity assessments comparable
when comparing the stability of different results that were trained on the same
data set. By default, seed
is set to NULL
and the learning samples
are sampled independently for each fitted model object passed to
stability
. If seed
is set to a specific number, the seed
will be set for each fitted model object before the learning samples are
generated using "L'Ecuyer-CMRG"
(see set.seed
) which
guarantees identical learning samples for each stability assessment and, thus,
comparability of the stability assessments between the results.
See Also
Examples
library("partykit")
res <- ctree(Species ~ ., data = iris)
## less repetitions
stability(res, control = stab_control(B = 100))
## Not run:
## change similarity measure
stability(res, control = stab_control(measure = list(bdist)))
## change evaluation strategy
stability(res, control = stab_control(evaluate = "ALL"))
stability(res, control = stab_control(evaluate = "OOS"))
## change resampling strategy to subsampling
stability(res, control = stab_control(sampler = subsampling))
stability(res, control = stab_control(sampler = subsampling, evaluate = "ALL"))
stability(res, control = stab_control(sampler = subsampling, evaluate = "OOS"))
## change resampling strategy to splithalf
stability(res, control = stab_control(sampler = splithalf, evaluate = "ALL"))
stability(res, control = stab_control(sampler = splithalf, evaluate = "OOS"))
## End(Not run)
Stability Assessment for Results from Supervised Statistical Learning
Description
Stability assessment of results from supervised statistical learning (i.e., recursive partitioning, support vector machines, neural networks, etc.). The procedure involves the pairwise comparison of results generated from learning samples randomly drawn from the original data set or directly from the data-generating process (if available).
Usage
stability(x, ..., data = NULL, control = stab_control(), weights = NULL,
applyfun = NULL, cores = NULL, names = NULL)
Arguments
x |
fitted model object. Any model object can be used whose class is
registered in |
... |
additional fitted model objects. |
data |
an optional |
control |
a list with control parameters, see |
weights |
an optional matrix of dimension n * B that can be used to
weight the observations from the original learning data when the models
are refitted. If |
applyfun |
a |
cores |
integer. The number of cores to use in multicore computations
using |
names |
a vector of characters to specify a name for each fitted model object. By default, the objects are named by their class. |
Details
Assesses the (overall) stability of a result from supervised statistical learning by quantifying the similarity of realizations from the distribution of possible results (given the algorithm, the formulated model, the data-generating process, the sample size, etc.). The stability distribution is estimated by repeatedly assessing the similarity between the results generated by training the algorithm on two different learning samples, by means of a similarity metric. The learning samples are generated by sampling from the learning data or the data-generating process in case of a simulation study. For more details, see Philipp et al. (2018).
Value
For a single fitted model object, stability
returns an object of
class "stablelearner"
with the following components:
call |
the call from the model object |
learner |
the information about the learner retrieved from |
B |
the number of repetitions, |
sval |
a matrix containing the estimated similarity values for each
similarity measure specified in |
sampstat |
a list containing information on the size of the learning
samples ( |
data |
a language object referring to the |
control |
a list with control parameters used for assessing the stability, |
For several fitted model objects, stability
returns an object of
class "stablelearnerList"
which is a list of objects of class
"stablelearner"
.
References
Philipp M, Rusch T, Hornik K, Strobl C (2018). “Measuring the Stability of Results from Supervised Statistical Learning”. Journal of Computational and Graphical Statistics, 27(4), 685–700. doi:10.1080/10618600.2018.1473779
See Also
boxplot.stablelearnerList
, summary.stablelearner
Examples
## assessing the stability of a single result
library("partykit")
r1 <- ctree(Species ~ ., data = iris)
stab <- stability(r1)
summary(stab)
## assessing the stability of several results
library("rpart")
r2 <- rpart(Species ~ ., data = iris)
stab <- stability(r1, r2, control = stab_control(seed = 0))
summary(stab, names = c("ctree", "rpart"))
## using case-weights instead of resampling
stability(r1, weights = TRUE)
## using self-defined case-weights
n <- nrow(iris)
B <- 500
w <- array(sample(c(0, 1), size = n * B * 3, replace = TRUE), dim = c(n, B, 3))
stability(r1, weights = w)
## assessing stability for a given data-generating process
my_dgp <- function() dgp_twoclass(n = 100, p = 2, noise = 4, rho = 0.2)
res <- ctree(class ~ ., data = my_dgp())
stability(res, data = my_dgp)
Stability Assessment for Tree Learners
Description
Stability assessment of variable and cutpoint selection in tree learners (i.e., recursive partitioning). By refitting trees to resampled versions of the learning data, the stability of the trees structure is assessed and can be summarized and visualized.
Usage
stabletree(x, data = NULL, sampler = subsampling, weights = NULL,
applyfun = NULL, cores = NULL, savetrees = FALSE, ...)
Arguments
x |
fitted model object. Any tree-based model object that can be coerced
by |
data |
an optional |
sampler |
a resampling (generating) function. Either this should be a function
of |
weights |
an optional matrix of dimension n * B that can be used to
weight the observations from the original learning data when the trees
are refitted. If |
applyfun |
a |
cores |
integer. The number of cores to use in multicore computations
using |
savetrees |
logical. If |
... |
further arguments passed to |
Details
The function stabletree
assesses the stability of tree learners (i.e.,
recursive partitioning methods) by refitting the tree to resampled versions
of the learning data. By default, if data = NULL
, the fitting call is
extracted by getCall
to infer the learning data.
Subsequently, the sampler
generates B
resampled versions
of the learning data, the tree is regrown with update
,
and (if necessary) coerced by as.party
. For each
of the resampled trees it is queried and stored which variables are selected
for splitting and what the selected cutpoints are.
The resulting object of class "stabletree"
comes with a set of
standard methods to generic functions including print
, summary
for numerical summaries and plot
, barplot
, and image
for graphical representations. See plot.stabletree
for more
details. In most methods, the argument original
can be set to
TRUE
or FALSE
, turning highlighting of the original tree
information on and off.
Value
stabletree
returns an object of class "stabletree"
which is a list with
the following components:
call |
the call from the model object |
B |
the number of resampled datasets, |
sampler |
the |
vs0 |
numeric vector of the variable selections of the original tree, |
br0 |
list of the break points (list of |
vs |
numeric matrix of the variable selections for each resampled dataset, |
br |
list of the break points (only the |
classes |
character vector indicating the classes of all partitioning variables, |
trees |
a list of tree objects of class |
References
Hothorn T, Zeileis A (2015). partykit: A Modular Toolkit for Recursive Partytioning in R. Journal of Machine Learning Research, 16(118), 3905–3909.
Philipp M, Zeileis A, Strobl C (2016). “A Toolkit for Stability Assessment of Tree-Based Learners”. In A. Colubi, A. Blanco, and C. Gatu (Eds.), Proceedings of COMPSTAT 2016 – 22nd International Conference on Computational Statistics (pp. 315–325). The International Statistical Institute/International Association for Statistical Computing. Preprint available at https://EconPapers.RePEc.org/RePEc:inn:wpaper:2016-11
See Also
plot.stabletree
, as.stabletree
,
as.party
Examples
## build a simple tree
library("partykit")
m <- ctree(Species ~ ., data = iris)
plot(m)
## investigate stability
set.seed(0)
s <- stabletree(m, B = 500)
print(s)
## variable selection statistics
summary(s)
## show variable selection proportions
barplot(s)
## illustrate variable selections of replications
image(s)
## graphical cutpoint analysis
plot(s)
Coercion Functions
Description
Functions coercing various forest objects to objects of class
"stabletree"
.
Usage
as.stabletree(x, ...)
## S3 method for class 'randomForest'
as.stabletree(x, applyfun = NULL, cores = NULL, ...)
## S3 method for class 'RandomForest'
as.stabletree(x, applyfun = NULL, cores = NULL, ...)
## S3 method for class 'cforest'
as.stabletree(x, applyfun = NULL, cores = NULL, savetrees = FALSE, ...)
## S3 method for class 'ranger'
as.stabletree(x, applyfun = NULL, cores = NULL, ...)
Arguments
x |
an object of class |
applyfun |
a |
cores |
integer. The number of cores to use in multicore computations
using |
savetrees |
logical. If |
... |
additional arguments (currently not used). |
Details
Random forests fitted using randomForest
,
cforest
, cforest
or
ranger
are coerced to "stabletree"
objects.
Note that when plotting a randomForest
or
ranger
, the gray areas of levels of a nominal variable
do not mimic exactly the same behavior as for classical "stabletree"
objects, due to randomForest
and
ranger
, not storing any information whether any
individuals were left fulfilling the splitting criterion in the subsample.
Therefore, gray areas only indicate that this level of this variable has
already been used in a split before in such a way that it could not be used
for any further splits.
For ranger
, interaction terms are (currently) not
supported.
Value
as.stabletree
returns an object of class "stabletree"
which is a
list with the following components:
call |
the call from the model object |
B |
the number of trees of the random forest, |
sampler |
the random forest fitting function, |
vs0 |
numeric vector of the variable selections of the original tree, here always a vector of zeros because there is no original tree, |
br0 |
always |
vs |
numeric matrix of the variable selections for each tree of the random forest, |
br |
list of the break points (only the |
classes |
character vector indicating the classes of all partitioning variables, |
trees |
a list of tree objects of class |
See Also
Examples
## build a randomForest using randomForest
library(randomForest)
set.seed(1)
rf <- randomForest(Species ~ ., data = iris)
## coerce to a stabletree
srf <- as.stabletree(rf)
print(srf)
summary(srf, original = FALSE) # there is no original tree
barplot(srf)
image(srf)
plot(srf)
## build a RandomForest using party
library("party")
set.seed(2)
cf_party <- cforest(Species ~ ., data = iris,
control = cforest_unbiased(mtry = 2))
## coerce to a stabletree
scf_party <- as.stabletree(cf_party)
print(scf_party)
summary(scf_party, original = FALSE)
barplot(scf_party)
image(scf_party)
plot(scf_party)
## build a cforest using partykit
library("partykit")
set.seed(3)
cf_partykit <- cforest(Species ~ ., data = iris)
## coerce to a stabletree
scf_partykit <- as.stabletree(cf_partykit)
print(scf_partykit)
summary(scf_partykit, original = FALSE)
barplot(scf_partykit)
image(scf_partykit)
plot(scf_partykit)
## build a random forest using ranger
library("ranger")
set.seed(4)
rf_ranger <- ranger(Species ~ ., data = iris)
## coerce to a stabletree
srf_ranger <- as.stabletree(rf_ranger)
print(srf_ranger)
summary(srf_ranger, original = FALSE)
barplot(srf_ranger)
image(srf_ranger)
plot(srf_ranger)
Summarize Results from Stability Assessment
Description
Summarizes and prints the results from stability assessments performed by
stability
.
Usage
## S3 method for class 'stablelearnerList'
summary(object, ..., reverse = TRUE,
probs = c(0.05, 0.25, 0.5, 0.75, 0.95), digits = 3, names = NULL)
Arguments
object |
a object of class |
... |
Arguments passed from or to other functions (currently ignored). |
reverse |
logical. If |
digits |
integer. The number of digits used to summarize the similarity
distribution in the |
probs |
a vector of probabilities used tosummarize the similarity
distribution in the |
names |
a vector of characters to specify a name for each result from
statistical learning in the |
See Also
Examples
library("partykit")
rval <- ctree(Species ~ ., data = iris)
stab <- stability(rval)
summary(stab)
summary(stab, reverse = FALSE)
summary(stab, probs = c(0.25, 0.5, 0.75))
summary(stab, names = "conditional inference tree")
Passengers and Crew on the RMS Titanic
Description
the titanic
data is a complete list of passengers and crew members on
the RMS Titanic. It includes a variable indicating whether a person did
survive the sinking of the RMS Titanic on April 15, 1912.
Usage
data("titanic")
Format
A data frame containing 2207 observations on 11 variables.
- name
a string with the name of the passenger.
- gender
a factor with levels
male
andfemale
.- age
a numeric value with the persons age on the day of the sinking. The age of babies (under 12 months) is given as a fraction of one year (1/month).
- class
a factor specifying the class for passengers or the type of service aboard for crew members.
- embarked
a factor with the persons place of of embarkment.
- country
a factor with the persons home country.
- ticketno
a numeric value specifying the persons ticket number (
NA
for crew members).- fare
a numeric value with the ticket price (
NA
for crew members, musicians and employees of the shipyard company).- sibsp
an ordered factor specifying the number if siblings/spouses aboard; adopted from Vanderbild data set (see below).
- parch
an ordered factor specifying the number of parents/children aboard; adopted from Vanderbild data set (see below).
- survived
a factor with two levels (
no
andyes
) specifying whether the person has survived the sinking.
Details
The website https://www.encyclopedia-titanica.org/ offers detailed information about passengers and crew members on the RMS Titanic. According to the website 1317 passengers and 890 crew member were abord.
8 musicians and 9 employees of the shipyard company are listed as
passengers, but travelled with a free ticket, which is why they have NA
values in fare
. In addition to that, fare
is truely missing for
a few regular passengers.
Source
The complete list of persons on the RMS titanic was downloaded from
https://www.encyclopedia-titanica.org/ on April 5, 2016. The
information given in sibsp
and parch
was adopoted from a data
set obtained from https://hbiostat.org/data/.
References
https://www.encyclopedia-titanica.org/ and https://hbiostat.org/data/.
Examples
data("titanic", package = "stablelearner")
summary(titanic)
Tuning Wrapper Function
Description
Convenience function to train a method using different tuning parameters.
Usage
tuner(method, tunerange, ...)
Arguments
method |
a character string. Name of the R function to train the method. |
tunerange |
a list. A list that specifies the range of values to be used for each tuning parameter. Each element of the list should be a vector that specifies the values to be tested for the tuning parameter. The element must be named after the corresponding tuning parameter of the method (see examples). |
... |
additional information passed to |
Details
This function can be used to train any method using different values for its
tuning parameter(s). The result can be passed directly to stability
to compare the stability of results based on different values of the tuning
parameter.
Value
A list that contains all fitted model objects.
Additional information about the range of values used for the tuning parameters is attached to the resulting object as an attribute.
See Also
Examples
library("partykit")
## tuning cforest using different values of its tuning parameter mtry
r <- tuner("cforest", tunerange = list(mtry = 1:4), formula = Species ~ ., data = iris)
## assess stability (with B = 10 for illustration to avoid excessive computation times)
stability(r, control = stab_control(seed = 1234, B = 10))
## receive information about the range of tuning parameters
attr(r, "range")