Version: | 1.0.4 |
Title: | Prediction Explanation with Dependence-Aware Shapley Values |
Description: | Complex machine learning models are often hard to interpret. However, in many situations it is crucial to understand and explain why a model made a specific prediction. Shapley values is the only method for such prediction explanation framework with a solid theoretical foundation. Previously known methods for estimating the Shapley values do, however, assume feature independence. This package implements methods which accounts for any feature dependence, and thereby produces more accurate estimates of the true Shapley values. An accompanying 'Python' wrapper ('shaprpy') is available through the GitHub repository. |
URL: | https://norskregnesentral.github.io/shapr/, https://github.com/NorskRegnesentral/shapr/ |
BugReports: | https://github.com/NorskRegnesentral/shapr/issues |
License: | MIT + file LICENSE |
Encoding: | UTF-8 |
ByteCompile: | true |
Language: | en-US |
RoxygenNote: | 7.3.2 |
Depends: | R (≥ 3.5.0) |
Imports: | stats, data.table (≥ 1.15.0), Rcpp (≥ 0.12.15), Matrix, future.apply, methods, cli, rlang |
Suggests: | ranger, xgboost, mgcv, testthat (≥ 3.0.0), knitr, rmarkdown, roxygen2, ggplot2, gbm, party, partykit, waldo, progressr, future, ggbeeswarm, vdiffr, forecast, torch, GGally, coro, parsnip, recipes, workflows, tune, dials, yardstick, hardhat, rsample |
LinkingTo: | RcppArmadillo, Rcpp |
VignetteBuilder: | knitr |
Config/testthat/edition: | 3 |
NeedsCompilation: | yes |
Packaged: | 2025-04-28 11:26:18 UTC; jullum |
Author: | Martin Jullum |
Maintainer: | Martin Jullum <Martin.Jullum@nr.no> |
Repository: | CRAN |
Date/Publication: | 2025-04-28 13:00:02 UTC |
shapr: Prediction Explanation with Dependence-Aware Shapley Values
Description
Complex machine learning models are often hard to interpret. However, in many situations it is crucial to understand and explain why a model made a specific prediction. Shapley values is the only method for such prediction explanation framework with a solid theoretical foundation. Previously known methods for estimating the Shapley values do, however, assume feature independence. This package implements methods which accounts for any feature dependence, and thereby produces more accurate estimates of the true Shapley values. An accompanying 'Python' wrapper ('shaprpy') is available through the GitHub repository.
Author(s)
Maintainer: Martin Jullum Martin.Jullum@nr.no (ORCID)
Authors:
Lars Henry Berge Olsen lhbolsen@nr.no (ORCID)
Annabelle Redelmeier ardelmeier@gmail.com
Jon Lachmann Jon@lachmann.nu (ORCID)
Nikolai Sellereite nikolaisellereite@gmail.com (ORCID)
Other contributors:
Anders Løland Anders.Loland@nr.no [contributor]
Jens Christian Wahl jens.c.wahl@gmail.com [contributor]
Camilla Lingjærde [contributor]
Norsk Regnesentral [copyright holder, funder]
See Also
Useful links:
Report bugs at https://github.com/NorskRegnesentral/shapr/issues
Additional setup for regression-based methods
Description
Additional setup for regression-based methods
Usage
additional_regression_setup(internal, model, predict_model)
Arguments
internal |
List.
Not used directly, but passed through from |
Value
The (updated) internal list
AICc formula for several sets, alternative definition
Description
AICc formula for several sets, alternative definition
Usage
aicc_full_cpp(h, X_list, mcov_list, S_scale_dist, y_list, negative)
Arguments
h |
numeric specifying the scaling (sigma) |
X_list |
List. Contains matrices with the appropriate features of the training data |
mcov_list |
List. Contains the covariance matrices of the matrices in X_list |
S_scale_dist |
Logical. Indicates whether Mahalanobis distance should be scaled with the number of variables. |
y_list |
List. Contains the appropriate (temporary) response variables. |
negative |
Logical. Whether to return the negative of the AICc value. |
Value
Scalar with the numeric value of the AICc formula
Author(s)
Martin Jullum
Temp-function for computing the full AICc with several X's etc
Description
Temp-function for computing the full AICc with several X's etc
Usage
aicc_full_single_cpp(X, mcov, S_scale_dist, h, y)
Arguments
X |
matrix. |
mcov |
matrix The covariance matrix of X. |
S_scale_dist |
logical. Indicating whether the Mahalanobis distance should be scaled with the number of variables |
h |
numeric specifying the scaling (sigma) |
y |
Vector Representing the (temporary) response variable |
Value
Scalar with the numeric value of the AICc formula.
Author(s)
Martin Jullum
Appends the new vS_list to the prev vS_list
Description
Appends the new vS_list to the prev vS_list
Usage
append_vS_list(vS_list, internal)
Arguments
vS_list |
List
Output from |
internal |
List.
Not used directly, but passed through from |
Value
The vS_list after being merged with previously computed vS_lists (stored in internal)
A torch::nn_module()
Representing a categorical_to_one_hot_layer
Description
The categorical_to_one_hot_layer
module/layer expands categorical features into one-hot vectors,
because multi-layer perceptrons are known to work better with this data representation.
It also replaces NaNs with zeros in order so that further layers may work correctly.
Usage
categorical_to_one_hot_layer(
one_hot_max_sizes,
add_nans_map_for_columns = NULL
)
Arguments
one_hot_max_sizes |
A torch tensor of dimension |
add_nans_map_for_columns |
Optional list which contains indices of columns which is_nan masks are to be appended to the result tensor. This option is necessary for the full encoder to distinguish whether value is to be reconstructed or not. |
Details
Note that the module works with mixed data represented as 2-dimensional inputs and it
works correctly with missing values in groundtruth
as long as they are represented by NaNs.
Author(s)
Lars Henry Berge Olsen
Check that all explicands has at least one valid MC sample in causal Shapley values
Description
Check that all explicands has at least one valid MC sample in causal Shapley values
Usage
check_categorical_valid_MCsamp(dt, n_explain, n_MC_samples, joint_prob_dt)
Arguments
dt |
Data.table containing the generated MC samples (and conditional values) after each sampling step |
n_MC_samples |
Positive integer.
For most approaches, it indicates the maximum number of samples to use in the Monte Carlo integration
of every conditional expectation.
For |
Details
For undocumented arguments, see setup_approach.categorical()
.
Author(s)
Lars Henry Berge Olsen
Checks the convergence according to the convergence threshold
Description
Checks the convergence according to the convergence threshold
Usage
check_convergence(internal)
Arguments
internal |
List.
Not used directly, but passed through from |
Value
The (updated) internal list
Check that the group parameter has the right form and content
Description
Check that the group parameter has the right form and content
Usage
check_groups(feature_names, group)
Function that checks the verbose parameter
Description
Function that checks the verbose parameter
Usage
check_verbose(verbose)
Arguments
verbose |
String vector or NULL.
Specifies the verbosity (printout detail level) through one or more of strings |
Value
The function does not return anything.
Author(s)
Lars Henry Berge Olsen, Martin Jullum
Printing messages in compute_vS with cli
Description
Printing messages in compute_vS with cli
Usage
cli_compute_vS(internal)
Arguments
internal |
List.
Not used directly, but passed through from |
Value
No return value (but prints compute_vS messages with cli)
Printing messages in iterative procedure with cli
Description
Printing messages in iterative procedure with cli
Usage
cli_iter(verbose, internal, iter)
Arguments
verbose |
String vector or NULL.
Specifies the verbosity (printout detail level) through one or more of strings |
internal |
List.
Not used directly, but passed through from |
iter |
Integer. The iteration number. Only used internally. |
Value
No return value (but prints iterative messages with cli)
Printing startup messages with cli
Description
Printing startup messages with cli
Usage
cli_startup(internal, model_class, verbose)
Arguments
internal |
List.
Not used directly, but passed through from |
model_class |
String. Class of the model as a string |
verbose |
String vector or NULL.
Specifies the verbosity (printout detail level) through one or more of strings |
Value
No return value (but prints startup messages with cli)
Create a header topline with cli
Description
Create a header topline with cli
Usage
cli_topline(verbose, testing, init_time, type, is_python)
Arguments
verbose |
String vector or NULL.
Specifies the verbosity (printout detail level) through one or more of strings |
testing |
Logical.
Only use to remove random components like timing from the object output when comparing output with testthat.
Defaults to |
init_time |
POSIXct object.
The time when the |
type |
Character.
Either "regular" or "forecast" corresponding to function |
is_python |
Logical.
Indicates whether the function is called from the Python wrapper.
Default is FALSE which is never changed when calling the function via |
Value
No return value (but prints header with cli unless verbose
is NULL
)
Get coalition matrix
Description
Get coalition matrix
Usage
coalition_matrix_cpp(coalitions, m)
Arguments
coalitions |
List. Each of the elements equals an integer vector representing a valid combination of features/feature groups. |
m |
Integer. Number of features/feature groups. |
Value
Matrix
Author(s)
Nikolai Sellereite, Martin Jullum
Mean Squared Error of the Contribution Function v(S)
Description
Function that computes the Mean Squared Error (MSEv) of the contribution function v(s) as proposed by Frye et al. (2019) and used by Olsen et al. (2022).
Usage
compute_MSEv_eval_crit(
internal,
dt_vS,
MSEv_uniform_comb_weights,
MSEv_skip_empty_full_comb = TRUE
)
Arguments
internal |
List.
Holds all parameters, data, functions and computed objects used within |
dt_vS |
Data.table of dimension |
MSEv_uniform_comb_weights |
Logical.
If |
MSEv_skip_empty_full_comb |
Logical. If |
Details
The MSEv evaluation criterion does not rely on access to the true contribution functions nor the true Shapley values to be computed. A lower value indicates better approximations, however, the scale and magnitude of the MSEv criterion is not directly interpretable in regard to the precision of the final estimated Shapley values. Olsen et al. (2024) illustrates in Figure 11 a fairly strong linear relationship between the MSEv criterion and the MAE between the estimated and true Shapley values in a simulation study. Note that explicands refer to the observations whose predictions we are to explain.
Value
List containing:
MSEv
A
data.table
with the overall MSEv evaluation criterion averaged over both the coalitions and observations/explicands. Thedata.table
also contains the standard deviation of the MSEv values for each explicand (only averaged over the coalitions) divided by the square root of the number of explicands.MSEv_explicand
A
data.table
with the mean squared error for each explicand, i.e., only averaged over the coalitions.MSEv_coalition
A
data.table
with the mean squared error for each coalition, i.e., only averaged over the explicands/observations. Thedata.table
also contains the standard deviation of the MSEv values for each coalition divided by the square root of the number of explicands.
Author(s)
Lars Henry Berge Olsen
References
Computes the the Shapley values and their standard deviation given the v(S)
Description
Computes the the Shapley values and their standard deviation given the v(S)
Usage
compute_estimates(internal, vS_list)
Arguments
internal |
List.
Not used directly, but passed through from |
vS_list |
List
Output from |
Value
The (updated) internal list
Compute shapley values
Description
Compute shapley values
Usage
compute_shapley(internal, dt_vS)
Arguments
internal |
List.
Holds all parameters, data, functions and computed objects used within |
dt_vS |
The contribution matrix. |
Value
A data.table
with Shapley values for each test observation.
Gathers and computes the timing of the different parts of the explain function.
Description
Gathers and computes the timing of the different parts of the explain function.
Usage
compute_time(internal)
Arguments
internal |
List.
Not used directly, but passed through from |
Value
List of reformatted timing information
Computes v(S)
for all features subsets S
.
Description
Computes v(S)
for all features subsets S
.
Usage
compute_vS(internal, model, predict_model)
Arguments
internal |
List.
Not used directly, but passed through from |
Value
List of v(S)
for different coalitions S
, optionally also with the samples used to estimate v(S)
Convert feature names into feature indices
Description
Functions that takes a causal_ordering
specified using strings and convert these strings to feature indices.
Usage
convert_feature_name_to_idx(causal_ordering, labels, feat_group_txt)
Arguments
causal_ordering |
List.
Not applicable for (regular) non-causal or asymmetric explanations.
|
labels |
Vector of strings containing (the order of) the feature names. |
feat_group_txt |
String that is either "feature" or "group" based on
if |
Value
The causal_ordering
list, but with feature indices (w.r.t. labels
) instead of feature names.
Author(s)
Lars Henry Berge Olsen
Correction term with trace_input in AICc formula
Description
Correction term with trace_input in AICc formula
Usage
correction_matrix_cpp(tr_H, n)
Arguments
tr_H |
numeric The trace of H |
n |
numeric The number of rows in H |
Value
Scalar
Author(s)
Martin Jullum
Define coalitions, and fetch additional information about each unique coalition
Description
Define coalitions, and fetch additional information about each unique coalition
Usage
create_coalition_table(
m,
exact = TRUE,
n_coalitions = 200,
n_coal_each_size = choose(m, seq(m - 1)),
weight_zero_m = 10^6,
paired_shap_sampling = TRUE,
prev_X = NULL,
n_samps_scale = 10,
coal_feature_list = as.list(seq_len(m)),
approach0 = "gaussian",
kernelSHAP_reweighting = "none",
semi_deterministic_sampling = FALSE,
dt_coal_samp_info = NULL,
dt_valid_causal_coalitions = NULL
)
Arguments
m |
Positive integer. Total number of features/groups. |
exact |
Logical.
If |
n_coalitions |
Positive integer.
Note that if |
n_coal_each_size |
Vector of integers of length |
weight_zero_m |
Numeric. The value to use as a replacement for infinite coalition weights when doing numerical operations. |
paired_shap_sampling |
Logical. Whether to do paired sampling of coalitions. |
prev_X |
data.table. The X data.table from the previous iteration. |
n_samps_scale |
Positive integer.
Integer that scales the number of coalitions |
coal_feature_list |
List. A list mapping each coalition to the features it contains. |
approach0 |
Character vector.
Contains the approach to be used for estimation of each coalition size. Same as |
kernelSHAP_reweighting |
String.
How to reweight the sampling frequency weights in the kernelSHAP solution after sampling.
The aim of this is to reduce the randomness and thereby the variance of the Shapley value estimates.
The options are one of |
semi_deterministic_sampling |
Logical.
If |
dt_coal_samp_info |
data.table. The data.table contains information about the which coalitions should be
deterministically included and which can be sampled, in addition to the sampling probabilities of each available
coalition size, and the weight given to the sampled and deterministically included coalitions (excluding empty and
grand coalitions which are given the |
dt_valid_causal_coalitions |
data.table. Only applicable for asymmetric Shapley
values explanations, and is |
Value
A data.table with info about the coalitions to use
Author(s)
Nikolai Sellereite, Martin Jullum, Lars Henry Berge Olsen
Build all the conditional inference trees
Description
Build all the conditional inference trees
Usage
create_ctree(
given_ind,
x_train,
mincriterion,
minsplit,
minbucket,
use_partykit = "on_error"
)
Arguments
given_ind |
Integer vector. Indicates which features are conditioned on. |
x_train |
Data.table with training data. |
use_partykit |
String. In some semi-rare cases |
Details
See the documentation of the setup_approach.ctree()
function for undocumented parameters.
Value
List with conditional inference tree and the variables conditioned/not conditioned on.
Author(s)
Annabelle Redelmeier, Martin Jullum
Create marginal categorical data for causal Shapley values
Description
This function is used when we generate marginal data for the categorical approach when we have several sampling
steps. We need to treat this separately, as we here in the marginal step CANNOT make feature values such
that the combination of those and the feature values we condition in S are NOT in
categorical.joint_prob_dt
. If we do this, then we cannot progress further in the chain of sampling
steps. E.g., X1 in (1,2,3), X2 in (1,2,3), and X3 in (1,2,3).
We know X2 = 2, and let causal structure be X1 -> X2 -> X3. Assume that
P(X1 = 1, X2 = 2, X = 3) = P(X1 = 2, X2 = 2, X = 3) = 1/2. Then there is no point
generating X1 = 3, as we then cannot generate X3.
The solution is only to generate the values which can proceed through the whole
chain of sampling steps. To do that, we have to ensure the the marginal sampling
respects the valid feature coalitions for all sets of conditional features, i.e.,
the features in features_steps_cond_on
.
We sample from the valid coalitions using the MARGINAL probabilities.
Usage
create_marginal_data_cat(
n_MC_samples,
x_explain,
Sbar_features,
S_original,
joint_prob_dt
)
Arguments
n_MC_samples |
Positive integer.
For most approaches, it indicates the maximum number of samples to use in the Monte Carlo integration
of every conditional expectation.
For |
x_explain |
Matrix or data.frame/data.table. Contains the the features, whose predictions ought to be explained. |
Sbar_features |
Vector of integers containing the features indices to generate marginal observations for.
That is, if |
S_original |
Vector of integers containing the features indices of the original coalition |
Details
For undocumented arguments, see setup_approach.categorical()
.
Value
Data table of dimension (`n_MC_samples` * `nrow(x_explain)`) \times `length(Sbar_features)`
with the
sampled observations.
Author(s)
Lars Henry Berge Olsen
Generate marginal Gaussian data using Cholesky decomposition
Description
Given a multivariate Gaussian distribution, this function creates data from specified marginals of said distribution.
Usage
create_marginal_data_gaussian(n_MC_samples, Sbar_features, mu, cov_mat)
Arguments
n_MC_samples |
Integer. The number of samples to generate. |
Sbar_features |
Vector of integers indicating which marginals to sample from. |
mu |
Numeric vector containing the expected values for all features in the multivariate Gaussian distribution. |
cov_mat |
Numeric matrix containing the covariance between all features in the multivariate Gaussian distribution. |
Author(s)
Lars Henry Berge Olsen
Function that samples data from the empirical marginal training distribution
Description
Sample observations from the empirical distribution P(X) using the training dataset.
Usage
create_marginal_data_training(
x_train,
n_explain,
Sbar_features,
n_MC_samples = 1000,
stable_version = TRUE
)
Arguments
x_train |
Data.table with training data. |
Sbar_features |
Vector of integers containing the features indices to generate marginal observations for.
That is, if |
stable_version |
Logical. If |
Value
Data table of dimension n_MC_samples
\times
length(Sbar_features)
with the sampled observations.
Author(s)
Lars Henry Berge Olsen
Exported documentation helper function.
Description
Exported documentation helper function.
Usage
default_doc_export(internal, iter, index_features)
Arguments
internal |
List.
Not used directly, but passed through from |
iter |
Integer. The iteration number. Only used internally. |
index_features |
Positive integer vector. Specifies the id_coalition to
apply to the present method. |
Unexported documentation helper function.
Description
Unexported documentation helper function.
Usage
default_doc_internal(
internal,
model,
predict_model,
x_explain,
x_train,
n_features,
W_kernel,
S,
dt_vS,
output_size,
...
)
Arguments
internal |
List.
Holds all parameters, data, functions and computed objects used within |
model |
Objects.
The model object that ought to be explained.
See the documentation of |
predict_model |
Function.
The prediction function used when |
x_explain |
Data.table with the features of the observation whose predictions ought to be explained (test data). |
x_train |
Data.table with training data. |
n_features |
Positive integer. The number of features. |
W_kernel |
Numeric matrix. Contains all nonscaled weights between training and test
observations for all coalitions. The dimension equals |
S |
Integer matrix of dimension |
dt_vS |
Data.table of dimension |
output_size |
Scalar integer. Specifies the dimension of the output from the prediction model for every observation. |
... |
Further arguments passed to |
Value
The internal
list.
It holds all parameters, data, and computed objects used within explain()
.
Get table with all (exact) coalitions
Description
Get table with all (exact) coalitions
Usage
exact_coalition_table(
m,
max_fixed_coal_size = ceiling((m - 1)/2),
dt_valid_causal_coalitions = NULL,
weight_zero_m = 10^6
)
Arguments
m |
Positive integer. Total number of features/groups. |
dt_valid_causal_coalitions |
data.table. Only applicable for asymmetric Shapley
values explanations, and is |
weight_zero_m |
Numeric. The value to use as a replacement for infinite coalition weights when doing numerical operations. |
Explain the output of machine learning models with dependence-aware (conditional/observational) Shapley values
Description
Computes dependence-aware Shapley values for observations in x_explain
from the specified
model
by using the method specified in approach
to estimate the conditional expectation.
See Aas et al. (2021)
for a thorough introduction to dependence-aware prediction explanation with Shapley values.
Usage
explain(
model,
x_explain,
x_train,
approach,
phi0,
iterative = NULL,
max_n_coalitions = NULL,
group = NULL,
n_MC_samples = 1000,
seed = NULL,
verbose = "basic",
predict_model = NULL,
get_model_specs = NULL,
prev_shapr_object = NULL,
asymmetric = FALSE,
causal_ordering = NULL,
confounding = NULL,
extra_computation_args = list(),
iterative_args = list(),
output_args = list(),
...
)
Arguments
model |
Model object.
Specifies the model whose predictions we want to explain.
Run |
x_explain |
Matrix or data.frame/data.table. Contains the the features, whose predictions ought to be explained. |
x_train |
Matrix or data.frame/data.table. Contains the data used to estimate the (conditional) distributions for the features needed to properly estimate the conditional expectations in the Shapley formula. |
approach |
Character vector of length |
phi0 |
Numeric. The prediction value for unseen data, i.e. an estimate of the expected prediction without conditioning on any features. Typically we set this value equal to the mean of the response variable in our training data, but other choices such as the mean of the predictions in the training data are also reasonable. |
iterative |
Logical or NULL
If |
max_n_coalitions |
Integer.
The upper limit on the number of unique feature/group coalitions to use in the iterative procedure
(if |
group |
List.
If |
n_MC_samples |
Positive integer.
For most approaches, it indicates the maximum number of samples to use in the Monte Carlo integration
of every conditional expectation.
For |
seed |
Positive integer.
Specifies the seed before any randomness based code is being run.
If |
verbose |
String vector or NULL.
Specifies the verbosity (printout detail level) through one or more of strings |
predict_model |
Function.
The prediction function used when |
get_model_specs |
Function.
An optional function for checking model/data consistency when
If |
prev_shapr_object |
|
asymmetric |
Logical.
Not applicable for (regular) non-causal or asymmetric explanations.
If |
causal_ordering |
List.
Not applicable for (regular) non-causal or asymmetric explanations.
|
confounding |
Logical vector.
Not applicable for (regular) non-causal or asymmetric explanations.
|
extra_computation_args |
Named list.
Specifies extra arguments related to the computation of the Shapley values.
See |
iterative_args |
Named list.
Specifies the arguments for the iterative procedure.
See |
output_args |
Named list.
Specifies certain arguments related to the output of the function.
See |
... |
Arguments passed on to
|
Details
The shapr
package implements kernelSHAP estimation of dependence-aware Shapley values with
eight different Monte Carlo-based approaches for estimating the conditional distributions of the data.
These are all introduced in the
general usage vignette.
(From R: vignette("general_usage", package = "shapr")
).
Moreover,
Aas et al. (2021)
gives a general introduction to dependence-aware Shapley values, and the three approaches "empirical"
,
"gaussian"
, "copula"
, and also discusses "independence"
.
Redelmeier et al. (2020) introduces the approach "ctree"
.
Olsen et al. (2022) introduces the "vaeac"
approach.
Approach "timeseries"
is discussed in
Jullum et al. (2021).
shapr
has also implemented two regression-based approaches "regression_separate"
and "regression_surrogate"
,
as described in Olsen et al. (2024).
It is also possible to combine the different approaches, see the
general usage for more information.
The package also supports the computation of causal and asymmetric Shapley values as introduced by
Heskes et al. (2020) and
Frye et al. (2020).
Asymmetric Shapley values were proposed by
Heskes et al. (2020) as a way to incorporate causal knowledge in
the real world by restricting the possible feature combinations/coalitions when computing the Shapley values to
those consistent with a (partial) causal ordering.
Causal Shapley values were proposed by
Frye et al. (2020) as a way to explain the total effect of features
on the prediction, taking into account their causal relationships, by adapting the sampling procedure in shapr
.
The package allows for parallelized computation with progress updates through the tightly connected
future::future and progressr::progressr packages.
See the examples below.
For iterative estimation (iterative=TRUE
), intermediate results may also be printed to the console
(according to the verbose
argument).
Moreover, the intermediate results are written to disk.
This combined batch computing of the v(S) values, enables fast and accurate estimation of the Shapley values
in a memory friendly manner.
Value
Object of class c("shapr", "list")
. Contains the following items:
shapley_values_est
data.table with the estimated Shapley values with explained observation in the rows and features along the columns. The column
none
is the prediction not devoted to any of the features (given by the argumentphi0
)shapley_values_sd
data.table with the standard deviation of the Shapley values reflecting the uncertainty. Note that this only reflects the coalition sampling part of the kernelSHAP procedure, and is therefore by definition 0 when all coalitions is used. Only present when
extra_computation_args$compute_sd=TRUE
, which is the default wheniterative = TRUE
internal
List with the different parameters, data, functions and other output used internally.
pred_explain
Numeric vector with the predictions for the explained observations
MSEv
List with the values of the MSEv evaluation criterion for the approach. See the MSEv evaluation section in the general usage for details.
timing
List containing timing information for the different parts of the computation.
init_time
andend_time
gives the time stamps for the start and end of the computation.total_time_secs
gives the total time in seconds for the complete execution ofexplain()
.main_timing_secs
gives the time in seconds for the main computations.iter_timing_secs
gives for each iteration of the iterative estimation, the time spent on the different parts iterative estimation routine.
Author(s)
Martin Jullum, Lars Henry Berge Olsen
References
Examples
# Load example data
data("airquality")
airquality <- airquality[complete.cases(airquality), ]
x_var <- c("Solar.R", "Wind", "Temp", "Month")
y_var <- "Ozone"
# Split data into test- and training data
data_train <- head(airquality, -3)
data_explain <- tail(airquality, 3)
x_train <- data_train[, x_var]
x_explain <- data_explain[, x_var]
# Fit a linear model
lm_formula <- as.formula(paste0(y_var, " ~ ", paste0(x_var, collapse = " + ")))
model <- lm(lm_formula, data = data_train)
# Explain predictions
p <- mean(data_train[, y_var])
# (Optionally) enable parallelization via the future package
if (requireNamespace("future", quietly = TRUE)) {
future::plan("multisession", workers = 2)
}
# (Optionally) enable progress updates within every iteration via the progressr package
if (requireNamespace("progressr", quietly = TRUE)) {
progressr::handlers(global = TRUE)
}
# Empirical approach
explain1 <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "empirical",
phi0 = p,
n_MC_samples = 1e2
)
# Gaussian approach
explain2 <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "gaussian",
phi0 = p,
n_MC_samples = 1e2
)
# Gaussian copula approach
explain3 <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "copula",
phi0 = p,
n_MC_samples = 1e2
)
if (requireNamespace("party", quietly = TRUE)) {
# ctree approach
explain4 <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "ctree",
phi0 = p,
n_MC_samples = 1e2
)
}
# Combined approach
approach <- c("gaussian", "gaussian", "empirical")
explain5 <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = approach,
phi0 = p,
n_MC_samples = 1e2
)
# Print the Shapley values
print(explain1$shapley_values_est)
# Plot the results
if (requireNamespace("ggplot2", quietly = TRUE)) {
plot(explain1)
plot(explain1, plot_type = "waterfall")
}
# Group-wise explanations
group_list <- list(A = c("Temp", "Month"), B = c("Wind", "Solar.R"))
explain_groups <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
group = group_list,
approach = "empirical",
phi0 = p,
n_MC_samples = 1e2
)
print(explain_groups$shapley_values_est)
# Separate and surrogate regression approaches with linear regression models.
req_pkgs <- c("parsnip", "recipes", "workflows", "rsample", "tune", "yardstick")
if (requireNamespace(req_pkgs, quietly = TRUE)) {
explain_separate_lm <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
phi0 = p,
approach = "regression_separate",
regression.model = parsnip::linear_reg()
)
explain_surrogate_lm <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
phi0 = p,
approach = "regression_surrogate",
regression.model = parsnip::linear_reg()
)
}
# Iterative estimation
# For illustration purposes only. By default not used for such small dimensions as here
# Gaussian approach
explain_iterative <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "gaussian",
phi0 = p,
n_MC_samples = 1e2,
iterative = TRUE,
iterative_args = list(initial_n_coalitions = 10)
)
Explain a forecast from time series models with dependence-aware (conditional/observational) Shapley values
Description
Computes dependence-aware Shapley values for observations in explain_idx
from the specified
model
by using the method specified in approach
to estimate the conditional expectation.
See
Aas, et. al (2021)
for a thorough introduction to dependence-aware prediction explanation with Shapley values.
Usage
explain_forecast(
model,
y,
xreg = NULL,
train_idx = NULL,
explain_idx,
explain_y_lags,
explain_xreg_lags = explain_y_lags,
horizon,
approach,
phi0,
max_n_coalitions = NULL,
iterative = NULL,
group_lags = TRUE,
group = NULL,
n_MC_samples = 1000,
seed = NULL,
predict_model = NULL,
get_model_specs = NULL,
verbose = "basic",
extra_computation_args = list(),
iterative_args = list(),
output_args = list(),
...
)
Arguments
model |
Model object.
Specifies the model whose predictions we want to explain.
Run |
y |
Matrix, data.frame/data.table or a numeric vector. Contains the endogenous variables used to estimate the (conditional) distributions needed to properly estimate the conditional expectations in the Shapley formula including the observations to be explained. |
xreg |
Matrix, data.frame/data.table or a numeric vector. Contains the exogenous variables used to estimate the (conditional) distributions needed to properly estimate the conditional expectations in the Shapley formula including the observations to be explained. As exogenous variables are used contemporaneously when producing a forecast, this item should contain nrow(y) + horizon rows. |
train_idx |
Numeric vector.
The row indices in data and reg denoting points in time to use when estimating the conditional expectations in
the Shapley value formula.
If |
explain_idx |
Numeric vector. The row indices in data and reg denoting points in time to explain. |
explain_y_lags |
Numeric vector.
Denotes the number of lags that should be used for each variable in |
explain_xreg_lags |
Numeric vector.
If |
horizon |
Numeric.
The forecast horizon to explain. Passed to the |
approach |
Character vector of length |
phi0 |
Numeric. The prediction value for unseen data, i.e. an estimate of the expected prediction without conditioning on any features. Typically we set this value equal to the mean of the response variable in our training data, but other choices such as the mean of the predictions in the training data are also reasonable. |
max_n_coalitions |
Integer.
The upper limit on the number of unique feature/group coalitions to use in the iterative procedure
(if |
iterative |
Logical or NULL
If |
group_lags |
Logical.
If |
group |
List.
If |
n_MC_samples |
Positive integer.
For most approaches, it indicates the maximum number of samples to use in the Monte Carlo integration
of every conditional expectation.
For |
seed |
Positive integer.
Specifies the seed before any randomness based code is being run.
If |
predict_model |
Function.
The prediction function used when |
get_model_specs |
Function.
An optional function for checking model/data consistency when
If |
verbose |
String vector or NULL.
Specifies the verbosity (printout detail level) through one or more of strings |
extra_computation_args |
Named list.
Specifies extra arguments related to the computation of the Shapley values.
See |
iterative_args |
Named list.
Specifies the arguments for the iterative procedure.
See |
output_args |
Named list.
Specifies certain arguments related to the output of the function.
See |
... |
Arguments passed on to
|
Details
This function explains a forecast of length horizon
. The argument train_idx
is analogous to x_train in explain()
, however, it just contains the time indices of where
in the data the forecast should start for each training sample. In the same way explain_idx
defines the time index (indices) which will precede a forecast to be explained.
As any autoregressive forecast model will require a set of lags to make a forecast at an
arbitrary point in time, explain_y_lags
and explain_xreg_lags
define how many lags
are required to "refit" the model at any given time index. This allows the different
approaches to work in the same way they do for time-invariant models.
See the forecasting section of the general usages for further details.
Value
Object of class c("shapr", "list")
. Contains the following items:
shapley_values_est
data.table with the estimated Shapley values with explained observation in the rows and features along the columns. The column
none
is the prediction not devoted to any of the features (given by the argumentphi0
)shapley_values_sd
data.table with the standard deviation of the Shapley values reflecting the uncertainty. Note that this only reflects the coalition sampling part of the kernelSHAP procedure, and is therefore by definition 0 when all coalitions is used. Only present when
extra_computation_args$compute_sd=TRUE
, which is the default wheniterative = TRUE
internal
List with the different parameters, data, functions and other output used internally.
pred_explain
Numeric vector with the predictions for the explained observations
MSEv
List with the values of the MSEv evaluation criterion for the approach. See the MSEv evaluation section in the general usage for details.
timing
List containing timing information for the different parts of the computation.
init_time
andend_time
gives the time stamps for the start and end of the computation.total_time_secs
gives the total time in seconds for the complete execution ofexplain()
.main_timing_secs
gives the time in seconds for the main computations.iter_timing_secs
gives for each iteration of the iterative estimation, the time spent on the different parts iterative estimation routine.
Author(s)
Jon Lachmann, Martin Jullum
References
Examples
# Load example data
data("airquality")
data <- data.table::as.data.table(airquality)
# Fit an AR(2) model.
model_ar_temp <- ar(data$Temp, order = 2)
# Calculate the zero prediction values for a three step forecast.
p0_ar <- rep(mean(data$Temp), 3)
# Empirical approach, explaining forecasts starting at T = 152 and T = 153.
explain_forecast(
model = model_ar_temp,
y = data[, "Temp"],
train_idx = 2:151,
explain_idx = 152:153,
explain_y_lags = 2,
horizon = 3,
approach = "empirical",
phi0 = p0_ar,
group_lags = FALSE
)
Gathers the final output to create the explanation object
Description
Gathers the final output to create the explanation object
Usage
finalize_explanation(internal)
Arguments
internal |
List.
Not used directly, but passed through from |
Value
List of reformatted output information extracted from internal
A torch::nn_module()
Representing a gauss_cat_loss
Description
The gauss_cat_loss module
layer computes the log probability of the groundtruth
for each object
given the mask and the distribution parameters. That is, the log-likelihoods of the true/full training observations
based on the generative distributions parameters distr_params
inferred by the masked versions of the observations.
Usage
gauss_cat_loss(one_hot_max_sizes, min_sigma = 1e-04, min_prob = 1e-04)
Arguments
one_hot_max_sizes |
A torch tensor of dimension |
min_sigma |
For stability it might be desirable that the minimal sigma is not too close to zero. |
min_prob |
For stability it might be desirable that the minimal probability is not too close to zero. |
Details
Note that the module works with mixed data represented as 2-dimensional inputs and it
works correctly with missing values in groundtruth
as long as they are represented by NaNs.
Author(s)
Lars Henry Berge Olsen
A torch::nn_module()
Representing a gauss_cat_parameters
Description
The gauss_cat_parameters
module extracts the parameters from the inferred generative Gaussian and
categorical distributions for the continuous and categorical features, respectively.
If one_hot_max_sizes
is [4, 1, 1, 2]
, then the inferred distribution parameters for one observation is the
vector [p_{00}, p_{01}, p_{02}, p_{03}, \mu_1, \sigma_1, \mu_2, \sigma_2, p_{30}, p_{31}]
, where
\operatorname{Softmax}([p_{00}, p_{01}, p_{02}, p_{03}])
and \operatorname{Softmax}([p_{30}, p_{31}])
are probabilities of the first and the fourth feature categories respectively in the model generative distribution,
and Gaussian(\mu_1, \sigma_1^2
) and Gaussian(\mu_2, \sigma_2^2
) are the model generative distributions
on the second and the third features.
Usage
gauss_cat_parameters(one_hot_max_sizes, min_sigma = 1e-04, min_prob = 1e-04)
Arguments
one_hot_max_sizes |
A torch tensor of dimension |
min_sigma |
For stability it might be desirable that the minimal sigma is not too close to zero. |
min_prob |
For stability it might be desirable that the minimal probability is not too close to zero. |
Author(s)
Lars Henry Berge Olsen
A torch::nn_module()
Representing a gauss_cat_sampler_most_likely
Description
The gauss_cat_sampler_most_likely
generates the most likely samples from the generative distribution
defined by the output of the vaeac. I.e., the layer will return the mean and most probable class for the Gaussian
(continuous features) and categorical (categorical features) distributions, respectively.
Usage
gauss_cat_sampler_most_likely(
one_hot_max_sizes,
min_sigma = 1e-04,
min_prob = 1e-04
)
Arguments
one_hot_max_sizes |
A torch tensor of dimension |
min_sigma |
For stability it might be desirable that the minimal sigma is not too close to zero. |
min_prob |
For stability it might be desirable that the minimal probability is not too close to zero. |
Value
A gauss_cat_sampler_most_likely
object.
Author(s)
Lars Henry Berge Olsen
A torch::nn_module()
Representing a gauss_cat_sampler_random
Description
The gauss_cat_sampler_random
generates random samples from the generative distribution defined by the
output of the vaeac. The random sample is generated by sampling from the inferred Gaussian and categorical
distributions for the continuous and categorical features, respectively.
Usage
gauss_cat_sampler_random(
one_hot_max_sizes,
min_sigma = 1e-04,
min_prob = 1e-04
)
Arguments
one_hot_max_sizes |
A torch tensor of dimension |
min_sigma |
For stability it might be desirable that the minimal sigma is not too close to zero. |
min_prob |
For stability it might be desirable that the minimal probability is not too close to zero. |
Author(s)
Lars Henry Berge Olsen
Transforms a sample to standardized normal distribution
Description
Transforms a sample to standardized normal distribution
Usage
gaussian_transform(x)
Arguments
x |
Numeric vector.The data which should be transformed to a standard normal distribution. |
Value
Numeric vector of length length(x)
Author(s)
Martin Jullum
Transforms new data to standardized normal (dimension 1) based on other data transformations
Description
Transforms new data to standardized normal (dimension 1) based on other data transformations
Usage
gaussian_transform_separate(yx, n_y)
Arguments
yx |
Numeric vector. The first |
n_y |
Positive integer. Number of elements of |
Value
Vector of back-transformed Gaussian data
Author(s)
Martin Jullum
Get the steps for generating MC samples for coalitions following a causal ordering
Description
Get the steps for generating MC samples for coalitions following a causal ordering
Usage
get_S_causal_steps(S, causal_ordering, confounding, as_string = FALSE)
Arguments
S |
Integer matrix of dimension |
causal_ordering |
List.
Not applicable for (regular) non-causal or asymmetric explanations.
|
confounding |
Logical vector.
Not applicable for (regular) non-causal or asymmetric explanations.
|
as_string |
Boolean. If the returned object is to be a list of lists of integers or a list of vectors of strings. |
Value
Depends on the value of the parameter as_string
. If a string, then results[j]
is a vector specifying
the process of generating the samples for coalition j
. The length of results[j]
is the number of steps, and
results[j][i]
is a string of the form features_to_sample|features_to_condition_on
. If the
features_to_condition_on
part is blank, then we are to sample from the marginal distribution.
For as_string == FALSE
, then we rather return a vector where results[[j]][[i]]
contains the elements
Sbar
and S
representing the features to sample and condition on, respectively.
Author(s)
Lars Henry Berge Olsen
get_cov_mat
Description
get_cov_mat
Usage
get_cov_mat(x_train, min_eigen_value = 1e-06)
Arguments
x_train |
Matrix or data.frame/data.table. Contains the data used to estimate the (conditional) distributions for the features needed to properly estimate the conditional expectations in the Shapley formula. |
min_eigen_value |
Numeric
Specifies the smallest allowed eigen value before the covariance matrix of |
Set up data for explain_forecast
Description
Set up data for explain_forecast
Usage
get_data_forecast(
y,
xreg,
train_idx,
explain_idx,
explain_y_lags,
explain_xreg_lags,
horizon
)
Arguments
y |
Matrix, data.frame/data.table or a numeric vector. Contains the endogenous variables used to estimate the (conditional) distributions needed to properly estimate the conditional expectations in the Shapley formula including the observations to be explained. |
xreg |
Matrix, data.frame/data.table or a numeric vector. Contains the exogenous variables used to estimate the (conditional) distributions needed to properly estimate the conditional expectations in the Shapley formula including the observations to be explained. As exogenous variables are used contemporaneously when producing a forecast, this item should contain nrow(y) + horizon rows. |
train_idx |
Numeric vector.
The row indices in data and reg denoting points in time to use when estimating the conditional expectations in
the Shapley value formula.
If |
explain_idx |
Numeric vector. The row indices in data and reg denoting points in time to explain. |
explain_y_lags |
Numeric vector.
Denotes the number of lags that should be used for each variable in |
explain_xreg_lags |
Numeric vector.
If |
horizon |
Numeric.
The forecast horizon to explain. Passed to the |
Value
A list containing
The data.frames x_train and x_explain which holds the lagged data examples.
A numeric, n_endo denoting how many columns are endogenous in x_train and x_explain.
A list, group with groupings of each variable to explain per variable and not per variable and lag.
Fetches feature information from a given data set
Description
Fetches feature information from a given data set
Usage
get_data_specs(x)
Arguments
x |
data.frame or data.table. The data to extract feature information from. |
Details
This function is used to extract the feature information to be checked against the corresponding information extracted from the model and other data sets. The function is only called internally
Value
A list with the following elements:
- labels
character vector with the feature names to compute Shapley values for
- classes
a named character vector with the labels as names and the class types as elements
- factor_levels
a named list with the labels as names and character vectors with the factor levels as elements (NULL if the feature is not a factor)
Author(s)
Martin Jullum
Gets the default values for the extra computation arguments
Description
Gets the default values for the extra computation arguments
Usage
get_extra_comp_args_default(
internal,
paired_shap_sampling = isFALSE(internal$parameters$asymmetric),
semi_deterministic_sampling = FALSE,
kernelSHAP_reweighting = "on_all_cond",
compute_sd = isFALSE(internal$parameters$exact),
n_boot_samps = 100,
vS_batching_method = "future",
max_batch_size = 10,
min_n_batches = 10
)
Arguments
internal |
List.
Not used directly, but passed through from |
paired_shap_sampling |
Logical.
If |
semi_deterministic_sampling |
Logical.
If |
kernelSHAP_reweighting |
String.
How to reweight the sampling frequency weights in the kernelSHAP solution after sampling.
The aim of this is to reduce the randomness and thereby the variance of the Shapley value estimates.
The options are one of |
compute_sd |
Logical. Whether to estimate the standard deviations of the Shapley value estimates. This is TRUE whenever sampling based kernelSHAP is applied (either iteratively or with a fixed number of coalitions). |
n_boot_samps |
Integer. The number of bootstrapped samples (i.e. samples with replacement) from the set of all coalitions used to estimate the standard deviations of the Shapley value estimates. |
vS_batching_method |
String. The method used to perform batch computing of vS.
|
max_batch_size |
Integer. The maximum number of coalitions to estimate simultaneously within each iteration. A larger numbers requires more memory, but may have a slight computational advantage. |
min_n_batches |
Integer. The minimum number of batches to split the computation into within each iteration. Larger numbers gives more frequent progress updates. If parallelization is applied, this should be set no smaller than the number of parallel workers. |
Value
A list with the default values for the extra computation arguments.
Author(s)
Martin Jullum
References
This includes both extra parameters and other objects
Description
This includes both extra parameters and other objects
Usage
get_extra_parameters(internal, type)
Gets the feature specifications form the model
Description
Gets the feature specifications form the model
Usage
get_feature_specs(get_model_specs, model)
Arguments
get_model_specs |
Function.
An optional function for checking model/data consistency when
If |
model |
Model object.
Specifies the model whose predictions we want to explain.
Run |
Function to specify arguments of the iterative estimation procedure
Description
Function to specify arguments of the iterative estimation procedure
Usage
get_iterative_args_default(
internal,
initial_n_coalitions = ceiling(min(200, max(5, internal$parameters$n_features,
(2^internal$parameters$n_features)/10), internal$parameters$max_n_coalitions)),
fixed_n_coalitions_per_iter = NULL,
max_iter = 20,
convergence_tol = 0.02,
n_coal_next_iter_factor_vec = c(seq(0.1, 1, by = 0.1), rep(1, max_iter - 10))
)
Arguments
internal |
List.
Not used directly, but passed through from |
initial_n_coalitions |
Integer. Number of coalitions to use in the first estimation iteration. |
fixed_n_coalitions_per_iter |
Integer. Number of |
max_iter |
Integer. Maximum number of estimation iterations |
convergence_tol |
Numeric. The t variable in the convergence threshold formula on page 6 in the paper Covert and Lee (2021), 'Improving KernelSHAP: Practical Shapley Value Estimation via Linear Regression' https://arxiv.org/pdf/2012.01536. Smaller values requires more coalitions before convergence is reached. |
n_coal_next_iter_factor_vec |
Numeric vector. The number of |
Details
The functions sets default values for the iterative estimation procedure, according to the function
defaults.
If the argument iterative
of explain()
is FALSE, it sets parameters corresponding to the use of a
non-iterative estimation procedure
Value
A list with the default values for the iterative estimation procedure
Author(s)
Martin Jullum
Get the number of coalitions that respects the causal ordering
Description
Get the number of coalitions that respects the causal ordering
Usage
get_max_n_coalitions_causal(causal_ordering)
Arguments
causal_ordering |
List.
Not applicable for (regular) non-causal or asymmetric explanations.
|
Details
The function computes the number of coalitions that respects the causal ordering by computing the number
of coalitions in each partial causal component and then summing these. We compute
the number of coalitions in the i
th a partial causal component by 2^n - 1
,
where n
is the number of features in the the i
th partial causal component
and we subtract one as we do not want to include the situation where no features in
the i
th partial causal component are present. In the end, we add 1 for the
empty coalition.
Value
Integer. The (maximum) number of coalitions that respects the causal ordering.
Author(s)
Lars Henry Berge Olsen
Fetches feature information from natively supported models
Description
This function is used to extract the feature information from the model to be checked against the
corresponding feature information in the data passed to explain()
.
NOTE: You should never need to call this function explicitly. It is exported just to be easier accessible for users, see details.
Usage
get_model_specs(x)
## Default S3 method:
get_model_specs(x)
## S3 method for class 'ar'
get_model_specs(x)
## S3 method for class 'Arima'
get_model_specs(x)
## S3 method for class 'forecast_ARIMA'
get_model_specs(x)
## S3 method for class 'glm'
get_model_specs(x)
## S3 method for class 'lm'
get_model_specs(x)
## S3 method for class 'gam'
get_model_specs(x)
## S3 method for class 'ranger'
get_model_specs(x)
## S3 method for class 'workflow'
get_model_specs(x)
## S3 method for class 'xgb.Booster'
get_model_specs(x)
Arguments
x |
Model object for the model to be explained. |
Details
If you are explaining a model not supported natively, you may (optionally) enable such checking by
creating this function yourself and passing it on to explain()
.
Value
A list with the following elements:
- labels
character vector with the feature names to compute Shapley values for
- classes
a named character vector with the labels as names and the class type as elements
- factor_levels
a named list with the labels as names and character vectors with the factor levels as elements (NULL if the feature is not a factor)
Author(s)
Martin Jullum
See Also
For model classes not supported natively, you NEED to create an analogue to predict_model()
. See it's
help file for details.
Examples
# Load example data
data("airquality")
airquality <- airquality[complete.cases(airquality), ]
# Split data into test- and training data
x_train <- head(airquality, -3)
x_explain <- tail(airquality, 3)
# Fit a linear model
model <- lm(Ozone ~ Solar.R + Wind + Temp + Month, data = x_train)
get_model_specs(model)
get_mu_vec
Description
get_mu_vec
Usage
get_mu_vec(x_train)
Arguments
x_train |
Matrix or data.frame/data.table. Contains the data used to estimate the (conditional) distributions for the features needed to properly estimate the conditional expectations in the Shapley formula. |
Gets the default values for the output arguments
Description
Gets the default values for the output arguments
Usage
get_output_args_default(
keep_samp_for_vS = FALSE,
MSEv_uniform_comb_weights = TRUE,
saving_path = tempfile("shapr_obj_", fileext = ".rds")
)
Arguments
keep_samp_for_vS |
Logical.
Indicates whether the samples used in the Monte Carlo estimation of v_S should be returned (in |
MSEv_uniform_comb_weights |
Logical.
If |
saving_path |
String. The path to the directory where the results of the iterative estimation procedure should be saved. Defaults to a temporary directory. |
Value
A list of default output arguments.
Author(s)
Martin Jullum
Get predict_model function
Description
Get predict_model function
Usage
get_predict_model(predict_model, model)
Arguments
predict_model |
Function.
The prediction function used when |
model |
Objects.
The model object that ought to be explained.
See the documentation of |
Gets the implemented approaches
Description
Gets the implemented approaches
Usage
get_supported_approaches()
Value
Character vector.
The names of the implemented approaches that can be passed to argument approach
in explain()
.
Provides a data.table with the supported models
Description
Provides a data.table with the supported models
Usage
get_supported_models()
Value
A data.table with the supported models.
Get all coalitions satisfying the causal ordering
Description
This function is only relevant when we are computing asymmetric Shapley values. For symmetric Shapley values (both regular and causal), all coalitions are allowed.
Usage
get_valid_causal_coalitions(
causal_ordering,
sort_features_in_coalitions = TRUE
)
Arguments
causal_ordering |
List.
Not applicable for (regular) non-causal or asymmetric explanations.
|
sort_features_in_coalitions |
Boolean. If |
Value
List of vectors containing all coalitions that respects the causal ordering.
Author(s)
Lars Henry Berge Olsen
Set up user provided groups for explanation in a forecast model.
Description
Set up user provided groups for explanation in a forecast model.
Usage
group_forecast_setup(group, horizon_features)
Arguments
group |
The list of groups to be explained. |
horizon_features |
A list of features per horizon, to split appropriate groups over. |
Value
A list containing
group The list group with entries that differ per horizon split accordingly.
horizon_group A list of which groups are applicable per horizon.
Computing single H matrix in AICc-function using the Mahalanobis distance
Description
Computing single H matrix in AICc-function using the Mahalanobis distance
Usage
hat_matrix_cpp(X, mcov, S_scale_dist, h)
Arguments
X |
matrix. |
mcov |
matrix The covariance matrix of X. |
S_scale_dist |
logical. Indicating whether the Mahalanobis distance should be scaled with the number of variables |
h |
numeric specifying the scaling (sigma) |
Value
Matrix of dimension ncol(X)*ncol(X)
Author(s)
Martin Jullum
Transforms new data to a standardized normal distribution
Description
Transforms new data to a standardized normal distribution
Usage
inv_gaussian_transform_cpp(z, x)
Arguments
z |
arma::mat. The data are the Gaussian Monte Carlos samples to transform. |
x |
arma::mat.
The data with the original transformation. Used to conduct the transformation of |
Value
arma::mat of the same dimension as z
Author(s)
Lars Henry Berge Olsen
Lag a matrix of variables a specific number of lags for each variables.
Description
Lag a matrix of variables a specific number of lags for each variables.
Usage
lag_data(x, lags)
Arguments
x |
The matrix of variables (one variable per column). |
lags |
A numeric vector denoting how many lags each variable should have. |
Value
A list with two items
A matrix, lagged with the lagged data.
A list, group, with groupings of the lagged data per variable.
(Generalized) Mahalanobis distance
Description
Used to get the Euclidean distance as well by setting mcov
= diag(m)
.
Usage
mahalanobis_distance_cpp(
featureList,
Xtrain_mat,
Xexplain_mat,
mcov,
S_scale_dist
)
Arguments
featureList |
List. Contains the vectors indicating all factor combinations that should be included in the computations. Assumes that the first one is empty. |
Xtrain_mat |
Matrix Training data in matrix form |
Xexplain_mat |
Matrix Explanation data in matrix form. |
mcov |
matrix The covariance matrix of X. |
S_scale_dist |
logical. Indicating whether the Mahalanobis distance should be scaled with the number of variables |
Value
Array of three dimensions. Contains the squared distance for between all training and test observations for all feature combinations passed to the function.
Author(s)
Martin Jullum
Missing Completely at Random (MCAR) Mask Generator
Description
A mask generator which masks the entries in the input completely at random.
Usage
mcar_mask_generator(masking_ratio = 0.5, paired_sampling = FALSE)
Arguments
masking_ratio |
Numeric between 0 and 1. The probability for an entry in the generated mask to be 1 (masked). |
paired_sampling |
Boolean. If we are doing paired sampling. So include both S and |
Details
The mask generator mask each element in the batch
(N x p) using a component-wise independent Bernoulli
distribution with probability masking_ratio
. Default values for masking_ratio
is 0.5, so all
masks are equally likely to be generated, including the empty and full masks.
The function returns a mask of the same shape as the input batch
, and the batch
can contain
missing values, indicated by the "NaN" token, which will always be masked.
Shape
Input:
(N, p)
where N is the number of observations in thebatch
andp
is the number of features.Output:
(N, p)
, same shape as the input
Author(s)
Lars Henry Berge Olsen
A torch::nn_module()
Representing a Memory Layer
Description
The layer is used to make skip-connections inside a torch::nn_sequential()
network
or between several torch::nn_sequential()
networks without unnecessary code complication.
Usage
memory_layer(id, shared_env, output = FALSE, add = FALSE, verbose = FALSE)
Arguments
id |
A unique id to use as a key in the storage list. |
shared_env |
A shared environment for all instances of memory_layer where the inputs are stored. |
output |
Boolean variable indicating if the memory layer is to store input in storage or extract from storage. |
add |
Boolean variable indicating if the extracted value are to be added or concatenated to the input.
Only applicable when |
verbose |
Boolean variable indicating if we want to give printouts to the user. |
Details
If output = FALSE
, this layer stores its input in the shared_env
with the key id
and then
passes the input to the next layer. I.e., when memory layer is used in the masked encoder. If output = TRUE
, this
layer takes stored tensor from the storage. I.e., when memory layer is used in the decoder. If add = TRUE
, it
returns sum of the stored vector and an input
, otherwise it returns their concatenation. If the tensor with
specified id
is not in storage when the layer with output = TRUE
is called, it would cause an exception.
Author(s)
Lars Henry Berge Olsen
Check that the type of model is supported by the native implementation of the model class
Description
The function checks whether the model given by x
is supported.
If x
is not a supported model the function will return an error message, otherwise it return NULL
(meaning all types of models with this class is supported)
Usage
model_checker(x)
## Default S3 method:
model_checker(x)
## S3 method for class 'ar'
model_checker(x)
## S3 method for class 'Arima'
model_checker(x)
## S3 method for class 'forecast_ARIMA'
model_checker(x)
## S3 method for class 'glm'
model_checker(x)
## S3 method for class 'lm'
model_checker(x)
## S3 method for class 'gam'
model_checker(x)
## S3 method for class 'ranger'
model_checker(x)
## S3 method for class 'workflow'
model_checker(x)
## S3 method for class 'xgb.Booster'
model_checker(x)
Arguments
x |
Model object for the model to be explained. |
Value
Error or NULL
See Also
See predict_model()
for more information about what type of models shapr
currently support.
Generate permutations of training data using test observations
Description
Generate permutations of training data using test observations
Usage
observation_impute(
W_kernel,
S,
x_train,
x_explain,
empirical.eta = 0.7,
n_MC_samples = 1000
)
Arguments
W_kernel |
Numeric matrix. Contains all nonscaled weights between training and test
observations for all coalitions. The dimension equals |
S |
Integer matrix of dimension |
x_train |
Data.table with training data. |
x_explain |
Data.table with the features of the observation whose predictions ought to be explained (test data). |
Value
data.table
Author(s)
Nikolai Sellereite
Get imputed data
Description
Get imputed data
Usage
observation_impute_cpp(index_xtrain, index_s, x_train, x_explain, S)
Arguments
index_xtrain |
Positive integer. Represents a sequence of row indices from |
index_s |
Positive integer. Represents a sequence of row indices from |
x_train |
Matrix. Contains the training data. |
x_explain |
Matrix with 1 row. Contains the features of the observation for a single prediction. |
S |
arma::mat.
Matrix of dimension ( |
Details
S(i, j) = 1
if and only if feature j
is present in feature
combination i
, otherwise S(i, j) = 0
. I.e. if m = 3
, there
are 2^3 = 8
unique ways to combine the features. In this case dim(S) = c(8, 3)
.
Let's call the features x1, x2, x3
and take a closer look at the combination
represented by s = c(x1, x2)
. If this combination is represented by the second row,
the following is true: S[2, 1:3] = c(1, 1, 0)
.
The returned object, X
, is a numeric matrix where
dim(X) = c(length(index_xtrain), ncol(x_train))
. If feature j
is present in
the k-th observation, that is S[index_[k], j] == 1
, X[k, j] = x_explain[1, j]
.
Otherwise X[k, j] = x_train[index_xtrain[k], j]
.
Value
Numeric matrix
Author(s)
Nikolai Sellereite
Sampling Paired Observations
Description
A sampler used to samples the batches where each instances is sampled twice
Usage
paired_sampler(vaeac_dataset_object, shuffle = FALSE)
Arguments
vaeac_dataset_object |
A |
shuffle |
Boolean. If |
Details
A sampler object that allows for paired sampling by always including each observation from the
vaeac_dataset()
twice. A torch::sampler()
object can be used with torch::dataloader()
when creating
batches from a torch dataset torch::dataset()
. See https://rdrr.io/cran/torch/src/R/utils-data-sampler.R for
more information. This function does not use batch iterators, which might increase the speed.
Author(s)
Lars Henry Berge Olsen
Plot of the Shapley value explanations
Description
Plots the individual prediction explanations.
Usage
## S3 method for class 'shapr'
plot(
x,
plot_type = "bar",
digits = 3,
index_x_explain = NULL,
top_k_features = NULL,
col = NULL,
bar_plot_phi0 = TRUE,
bar_plot_order = "largest_first",
scatter_features = NULL,
scatter_hist = TRUE,
include_group_feature_means = FALSE,
beeswarm_cex = 1/length(index_x_explain)^(1/4),
...
)
Arguments
x |
An |
plot_type |
Character.
Specifies the type of plot to produce.
|
digits |
Integer.
Number of significant digits to use in the feature description.
Applicable for |
index_x_explain |
Integer vector.
Which of the test observations to plot. E.g. if you have
explained 10 observations using |
top_k_features |
Integer.
How many features to include in the plot.
E.g. if you have 15 features in your model you can plot the 5 most important features,
for each explanation, by setting |
col |
Character vector (where length depends on plot type).
The color codes (hex codes or other names understood by If you want to alter the colors i the plot, the length of the |
bar_plot_phi0 |
Logical.
Whether to include |
bar_plot_order |
Character.
Specifies what order to plot the features with respect to the magnitude of the shapley values with
|
scatter_features |
Integer or character vector.
Only used for |
scatter_hist |
Logical.
Only used for |
include_group_feature_means |
Logical.
Whether to include the average feature value in a group on the y-axis or not.
If |
beeswarm_cex |
Numeric.
The cex argument of |
... |
Other arguments passed to underlying functions,
like |
Details
See the examples below, or vignette("general_usage", package = "shapr")
for an examples of
how you should use the function.
Value
ggplot object with plots of the Shapley value explanations
Author(s)
Martin Jullum, Vilde Ung, Lars Henry Berge Olsen
Examples
if (requireNamespace("party", quietly = TRUE)) {
data("airquality")
airquality <- airquality[complete.cases(airquality), ]
x_var <- c("Solar.R", "Wind", "Temp", "Month")
y_var <- "Ozone"
# Split data into test- and training data
data_train <- head(airquality, -50)
data_explain <- tail(airquality, 50)
x_train <- data_train[, x_var]
x_explain <- data_explain[, x_var]
# Fit a linear model
lm_formula <- as.formula(paste0(y_var, " ~ ", paste0(x_var, collapse = " + ")))
model <- lm(lm_formula, data = data_train)
# Explain predictions
p <- mean(data_train[, y_var])
# Empirical approach
x <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "empirical",
phi0 = p,
n_MC_samples = 1e2
)
if (requireNamespace(c("ggplot2", "ggbeeswarm"), quietly = TRUE)) {
# The default plotting option is a bar plot of the Shapley values
# We draw bar plots for the first 4 observations
plot(x, index_x_explain = 1:4)
# We can also make waterfall plots
plot(x, plot_type = "waterfall", index_x_explain = 1:4)
# And only showing the 2 features with largest contribution
plot(x, plot_type = "waterfall", index_x_explain = 1:4, top_k_features = 2)
# Or scatter plots showing the distribution of the shapley values and feature values
plot(x, plot_type = "scatter")
# And only for a specific feature
plot(x, plot_type = "scatter", scatter_features = "Temp")
# Or a beeswarm plot summarising the Shapley values and feature values for all features
plot(x, plot_type = "beeswarm")
plot(x, plot_type = "beeswarm", col = c("red", "black")) # we can change colors
# Additional arguments can be passed to ggbeeswarm::geom_beeswarm() using the '...' argument.
# For instance, sometimes the beeswarm plots overlap too much.
# This can be fixed with the 'corral="wrap" argument.
# See ?ggbeeswarm::geom_beeswarm for more information.
plot(x, plot_type = "beeswarm", corral = "wrap")
}
# Example of scatter and beeswarm plot with factor variables
airquality$Month_factor <- as.factor(month.abb[airquality$Month])
airquality <- airquality[complete.cases(airquality), ]
x_var <- c("Solar.R", "Wind", "Temp", "Month_factor")
y_var <- "Ozone"
# Split data into test- and training data
data_train <- airquality
data_explain <- tail(airquality, 50)
x_train <- data_train[, x_var]
x_explain <- data_explain[, x_var]
# Fit a linear model
lm_formula <- as.formula(paste0(y_var, " ~ ", paste0(x_var, collapse = " + ")))
model <- lm(lm_formula, data = data_train)
# Explain predictions
p <- mean(data_train[, y_var])
# Empirical approach
x <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "ctree",
phi0 = p,
n_MC_samples = 1e2
)
if (requireNamespace(c("ggplot2", "ggbeeswarm"), quietly = TRUE)) {
plot(x, plot_type = "scatter")
plot(x, plot_type = "beeswarm")
}
}
Plots of the MSEv Evaluation Criterion
Description
Make plots to visualize and compare the MSEv evaluation criterion for a list of
explain()
objects applied to the same data and model. The function creates
bar plots and line plots with points to illustrate the overall MSEv evaluation
criterion, but also for each observation/explicand and coalition by only averaging over
the coalitions and observations/explicands, respectively.
Usage
plot_MSEv_eval_crit(
explanation_list,
index_x_explain = NULL,
id_coalition = NULL,
CI_level = if (length(explanation_list[[1]]$pred_explain) < 20) NULL else 0.95,
geom_col_width = 0.9,
plot_type = "overall"
)
Arguments
explanation_list |
A list of |
index_x_explain |
Integer vector.
Which of the test observations to plot. E.g. if you have
explained 10 observations using |
id_coalition |
Integer vector. Which of the coalitions to plot.
E.g. if you used |
CI_level |
Positive numeric between zero and one. Default is |
geom_col_width |
Numeric. Bar width. By default, set to 90% of the |
plot_type |
Character vector. The possible options are "overall" (default), "comb", and "explicand".
If |
Value
Either a single ggplot2::ggplot()
object of the MSEv criterion when plot_type = "overall"
, or a list
of ggplot2::ggplot()
objects based on the plot_type
parameter.
Author(s)
Lars Henry Berge Olsen
Examples
if (requireNamespace("xgboost", quietly = TRUE) && requireNamespace("ggplot2", quietly = TRUE)) {
# Get the data
data("airquality")
data <- data.table::as.data.table(airquality)
data <- data[complete.cases(data), ]
#' Define the features and the response
x_var <- c("Solar.R", "Wind", "Temp", "Month")
y_var <- "Ozone"
# Split data into test and training data set
ind_x_explain <- 1:25
x_train <- data[-ind_x_explain, ..x_var]
y_train <- data[-ind_x_explain, get(y_var)]
x_explain <- data[ind_x_explain, ..x_var]
# Fitting a basic xgboost model to the training data
model <- xgboost::xgboost(
data = as.matrix(x_train),
label = y_train,
nround = 20,
verbose = FALSE
)
# Specifying the phi_0, i.e. the expected prediction without any features
phi0 <- mean(y_train)
# Independence approach
explanation_independence <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "independence",
phi0 = phi0,
n_MC_samples = 1e2
)
# Gaussian 1e1 approach
explanation_gaussian_1e1 <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "gaussian",
phi0 = phi0,
n_MC_samples = 1e1
)
# Gaussian 1e2 approach
explanation_gaussian_1e2 <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "gaussian",
phi0 = phi0,
n_MC_samples = 1e2
)
# ctree approach
explanation_ctree <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "ctree",
phi0 = phi0,
n_MC_samples = 1e2
)
# Combined approach
explanation_combined <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = c("gaussian", "independence", "ctree"),
phi0 = phi0,
n_MC_samples = 1e2
)
# Create a list of explanations with names
explanation_list_named <- list(
"Ind." = explanation_independence,
"Gaus. 1e1" = explanation_gaussian_1e1,
"Gaus. 1e2" = explanation_gaussian_1e2,
"Ctree" = explanation_ctree,
"Combined" = explanation_combined
)
# Create the default MSEv plot where we average over both the coalitions and observations
# with approximate 95% confidence intervals
plot_MSEv_eval_crit(explanation_list_named, CI_level = 0.95, plot_type = "overall")
# Can also create plots of the MSEv criterion averaged only over the coalitions or observations.
MSEv_figures <- plot_MSEv_eval_crit(explanation_list_named,
CI_level = 0.95,
plot_type = c("overall", "comb", "explicand")
)
MSEv_figures$MSEv_bar
MSEv_figures$MSEv_coalition_bar
MSEv_figures$MSEv_explicand_bar
# When there are many coalitions or observations, then it can be easier to look at line plots
MSEv_figures$MSEv_coalition_line_point
MSEv_figures$MSEv_explicand_line_point
# We can specify which observations or coalitions to plot
plot_MSEv_eval_crit(explanation_list_named,
plot_type = "explicand",
index_x_explain = c(1, 3:4, 6),
CI_level = 0.95
)$MSEv_explicand_bar
plot_MSEv_eval_crit(explanation_list_named,
plot_type = "comb",
id_coalition = c(3, 4, 9, 13:15),
CI_level = 0.95
)$MSEv_coalition_bar
# We can alter the figures if other palette schemes or design is wanted
bar_text_n_decimals <- 1
MSEv_figures$MSEv_bar +
ggplot2::scale_x_discrete(limits = rev(levels(MSEv_figures$MSEv_bar$data$Method))) +
ggplot2::coord_flip() +
ggplot2::scale_fill_discrete() + #' Default ggplot2 palette
ggplot2::theme_minimal() + #' This must be set before the other theme call
ggplot2::theme(
plot.title = ggplot2::element_text(size = 10),
legend.position = "bottom"
) +
ggplot2::guides(fill = ggplot2::guide_legend(nrow = 1, ncol = 6)) +
ggplot2::geom_text(
ggplot2::aes(label = sprintf(
paste("%.", sprintf("%d", bar_text_n_decimals), "f", sep = ""),
round(MSEv, bar_text_n_decimals)
)),
vjust = -1.1, # This value must be altered based on the plot dimension
hjust = 1.1, # This value must be altered based on the plot dimension
color = "black",
position = ggplot2::position_dodge(0.9),
size = 5
)
}
Shapley value bar plots for several explanation objects
Description
Make plots to visualize and compare the estimated Shapley values for a list of
explain()
objects applied to the same data and model. For group-wise Shapley values,
the features values plotted are the mean feature values for all features in each group.
Usage
plot_SV_several_approaches(
explanation_list,
index_explicands = NULL,
index_explicands_sort = FALSE,
only_these_features = NULL,
plot_phi0 = FALSE,
digits = 4,
add_zero_line = FALSE,
axis_labels_n_dodge = NULL,
axis_labels_rotate_angle = NULL,
horizontal_bars = TRUE,
facet_scales = "free",
facet_ncol = 2,
geom_col_width = 0.85,
brewer_palette = NULL,
include_group_feature_means = FALSE
)
Arguments
explanation_list |
A list of |
index_explicands |
Integer vector. Which of the explicands (test observations) to plot.
E.g. if you have explained 10 observations using |
index_explicands_sort |
Boolean. If |
only_these_features |
String vector. Containing the names of the features which are to be included in the bar plots. |
plot_phi0 |
Boolean. If we are to include the |
digits |
Integer.
Number of significant digits to use in the feature description.
Applicable for |
add_zero_line |
Boolean. If we are to add a black line for a feature contribution of 0. |
axis_labels_n_dodge |
Integer. The number of rows that should be used to render the labels. This is useful for displaying labels that would otherwise overlap. |
axis_labels_rotate_angle |
Numeric. The angle of the axis label, where 0 means horizontal, 45 means tilted,
and 90 means vertical. Compared to setting the angle in |
horizontal_bars |
Boolean. Flip Cartesian coordinates so that horizontal becomes vertical,
and vertical, horizontal. This is primarily useful for converting geoms and statistics which display
y conditional on x, to x conditional on y. See |
facet_scales |
Should scales be free (" |
facet_ncol |
Integer. The number of columns in the facet grid. Default is |
geom_col_width |
Numeric. Bar width. By default, set to 85% of the |
brewer_palette |
String. Name of one of the color palettes from
|
include_group_feature_means |
Logical. Whether to include the average feature value in a group on the
y-axis or not. If |
Value
A ggplot2::ggplot()
object.
Author(s)
Lars Henry Berge Olsen
Examples
## Not run:
if (requireNamespace("xgboost", quietly = TRUE) && requireNamespace("ggplot2", quietly = TRUE)) {
# Get the data
data("airquality")
data <- data.table::as.data.table(airquality)
data <- data[complete.cases(data), ]
# Define the features and the response
x_var <- c("Solar.R", "Wind", "Temp", "Month")
y_var <- "Ozone"
# Split data into test and training data set
ind_x_explain <- 1:12
x_train <- data[-ind_x_explain, ..x_var]
y_train <- data[-ind_x_explain, get(y_var)]
x_explain <- data[ind_x_explain, ..x_var]
# Fitting a basic xgboost model to the training data
model <- xgboost::xgboost(
data = as.matrix(x_train),
label = y_train,
nround = 20,
verbose = FALSE
)
# Specifying the phi_0, i.e. the expected prediction without any features
phi0 <- mean(y_train)
# Independence approach
explanation_independence <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "independence",
phi0 = phi0,
n_MC_samples = 1e2
)
# Empirical approach
explanation_empirical <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "empirical",
phi0 = phi0,
n_MC_samples = 1e2
)
# Gaussian 1e1 approach
explanation_gaussian_1e1 <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "gaussian",
phi0 = phi0,
n_MC_samples = 1e1
)
# Gaussian 1e2 approach
explanation_gaussian_1e2 <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "gaussian",
phi0 = phi0,
n_MC_samples = 1e2
)
# Combined approach
explanation_combined <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = c("gaussian", "ctree", "empirical"),
phi0 = phi0,
n_MC_samples = 1e2
)
# Create a list of explanations with names
explanation_list <- list(
"Ind." = explanation_independence,
"Emp." = explanation_empirical,
"Gaus. 1e1" = explanation_gaussian_1e1,
"Gaus. 1e2" = explanation_gaussian_1e2,
"Combined" = explanation_combined
)
# The function uses the provided names.
plot_SV_several_approaches(explanation_list)
# We can change the number of columns in the grid of plots and add other visual alterations
plot_SV_several_approaches(explanation_list,
facet_ncol = 3,
facet_scales = "free_y",
add_zero_line = TRUE,
digits = 2,
brewer_palette = "Paired",
geom_col_width = 0.6
) +
ggplot2::theme_minimal() +
ggplot2::theme(legend.position = "bottom", plot.title = ggplot2::element_text(size = 0))
# We can specify which explicands to plot to get less chaotic plots and make the bars vertical
plot_SV_several_approaches(explanation_list,
index_explicands = c(1:2, 5, 10),
horizontal_bars = FALSE,
axis_labels_rotate_angle = 45
)
# We can change the order of the features by specifying the
# order using the `only_these_features` parameter.
plot_SV_several_approaches(explanation_list,
index_explicands = c(1:2, 5, 10),
only_these_features = c("Temp", "Solar.R", "Month", "Wind")
)
# We can also remove certain features if we are not interested in them
# or want to focus on, e.g., two features. The function will give a
# message to if the user specifies non-valid feature names.
plot_SV_several_approaches(explanation_list,
index_explicands = c(1:2, 5, 10),
only_these_features = c("Temp", "Solar.R"),
plot_phi0 = TRUE
)
}
## End(Not run)
Plot the training VLB and validation IWAE for vaeac
models
Description
This function makes (ggplot2::ggplot()
) figures of the training VLB and the validation IWAE for a list
of explain()
objects with approach = "vaeac"
. See setup_approach()
for more information about the
vaeac
approach. Two figures are returned by the function. In the figure, each object in explanation_list
gets
its own facet, while in the second figure, we plot the criteria in each facet for all objects.
Usage
plot_vaeac_eval_crit(
explanation_list,
plot_from_nth_epoch = 1,
plot_every_nth_epoch = 1,
criteria = c("VLB", "IWAE"),
plot_type = c("method", "criterion"),
facet_wrap_scales = "fixed",
facet_wrap_ncol = NULL
)
Arguments
explanation_list |
A list of |
plot_from_nth_epoch |
Integer. If we are only plot the results form the nth epoch and so forth. The first epochs can be large in absolute value and make the rest of the plot difficult to interpret. |
plot_every_nth_epoch |
Integer. If we are only to plot every nth epoch. Usefully to illustrate the overall trend, as there can be a lot of fluctuation and oscillation in the values between each epoch. |
criteria |
Character vector. The possible options are "VLB", "IWAE", "IWAE_running". Default is the first two. |
plot_type |
Character vector. The possible options are "method" and "criterion". Default is to plot both. |
facet_wrap_scales |
String. Should the scales be fixed (" |
facet_wrap_ncol |
Integer. Number of columns in the facet wrap. |
Details
See Olsen et al. (2022) or the blog post for a summary of the VLB and IWAE.
Value
Either a single ggplot2::ggplot()
object or a list of ggplot2::ggplot()
objects based on the
plot_type
parameter.
Author(s)
Lars Henry Berge Olsen
References
Examples
if (requireNamespace("xgboost", quietly = TRUE) &&
requireNamespace("torch", quietly = TRUE) &&
torch::torch_is_installed()) {
data("airquality")
data <- data.table::as.data.table(airquality)
data <- data[complete.cases(data), ]
x_var <- c("Solar.R", "Wind", "Temp", "Month")
y_var <- "Ozone"
ind_x_explain <- 1:6
x_train <- data[-ind_x_explain, ..x_var]
y_train <- data[-ind_x_explain, get(y_var)]
x_explain <- data[ind_x_explain, ..x_var]
# Fitting a basic xgboost model to the training data
model <- xgboost::xgboost(
data = as.matrix(x_train),
label = y_train,
nround = 100,
verbose = FALSE
)
# Specifying the phi_0, i.e. the expected prediction without any features
p0 <- mean(y_train)
# Train vaeac with and without paired sampling
explanation_paired <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "vaeac",
phi0 = p0,
n_MC_samples = 1, # As we are only interested in the training of the vaeac
vaeac.epochs = 10, # Should be higher in applications.
vaeac.n_vaeacs_initialize = 1,
vaeac.width = 16,
vaeac.depth = 2,
vaeac.extra_parameters = list(vaeac.paired_sampling = TRUE)
)
explanation_regular <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "vaeac",
phi0 = p0,
n_MC_samples = 1, # As we are only interested in the training of the vaeac
vaeac.epochs = 10, # Should be higher in applications.
vaeac.width = 16,
vaeac.depth = 2,
vaeac.n_vaeacs_initialize = 1,
vaeac.extra_parameters = list(vaeac.paired_sampling = FALSE)
)
# Collect the explanation objects in an named list
explanation_list <- list(
"Regular sampling" = explanation_regular,
"Paired sampling" = explanation_paired
)
# Call the function with the named list, will use the provided names
plot_vaeac_eval_crit(explanation_list = explanation_list)
# The function also works if we have only one method,
# but then one should only look at the method plot.
plot_vaeac_eval_crit(
explanation_list = explanation_list[2],
plot_type = "method"
)
# Can alter the plot
plot_vaeac_eval_crit(
explanation_list = explanation_list,
plot_from_nth_epoch = 2,
plot_every_nth_epoch = 2,
facet_wrap_scales = "free"
)
# If we only want the VLB
plot_vaeac_eval_crit(
explanation_list = explanation_list,
criteria = "VLB",
plot_type = "criterion"
)
# If we want only want the criterion version
tmp_fig_criterion <-
plot_vaeac_eval_crit(explanation_list = explanation_list, plot_type = "criterion")
# Since tmp_fig_criterion is a ggplot2 object, we can alter it
# by, e.g,. adding points or smooths with se bands
tmp_fig_criterion + ggplot2::geom_point(shape = "circle", size = 1, ggplot2::aes(col = Method))
tmp_fig_criterion$layers[[1]] <- NULL
tmp_fig_criterion + ggplot2::geom_smooth(method = "loess", formula = y ~ x, se = TRUE) +
ggplot2::scale_color_brewer(palette = "Set1") +
ggplot2::theme_minimal()
}
Plot Pairwise Plots for Imputed and True Data
Description
A function that creates a matrix of plots (GGally::ggpairs()
) from
generated imputations from the unconditioned distribution p(\boldsymbol{x})
estimated by
a vaeac
model, and then compares the imputed values with data from the true distribution (if provided).
See ggpairs for an
introduction to GGally::ggpairs()
, and the corresponding
vignette.
Usage
plot_vaeac_imputed_ggpairs(
explanation,
which_vaeac_model = "best",
x_true = NULL,
add_title = TRUE,
alpha = 0.5,
upper_cont = c("cor", "points", "smooth", "smooth_loess", "density", "blank"),
upper_cat = c("count", "cross", "ratio", "facetbar", "blank"),
upper_mix = c("box", "box_no_facet", "dot", "dot_no_facet", "facethist",
"facetdensity", "denstrip", "blank"),
lower_cont = c("points", "smooth", "smooth_loess", "density", "cor", "blank"),
lower_cat = c("facetbar", "ratio", "count", "cross", "blank"),
lower_mix = c("facetdensity", "box", "box_no_facet", "dot", "dot_no_facet",
"facethist", "denstrip", "blank"),
diag_cont = c("densityDiag", "barDiag", "blankDiag"),
diag_cat = c("barDiag", "blankDiag"),
cor_method = c("pearson", "kendall", "spearman")
)
Arguments
explanation |
Shapr list. The output list from the |
which_vaeac_model |
String. Indicating which |
x_true |
Data.table containing the data from the distribution that the |
add_title |
Logical. If |
alpha |
Numeric between |
upper_cont |
String. Type of plot to use in upper triangle for continuous features, see |
upper_cat |
String. Type of plot to use in upper triangle for categorical features, see |
upper_mix |
String. Type of plot to use in upper triangle for mixed features, see |
lower_cont |
String. Type of plot to use in lower triangle for continuous features, see |
lower_cat |
String. Type of plot to use in lower triangle for categorical features, see |
lower_mix |
String. Type of plot to use in lower triangle for mixed features, see |
diag_cont |
String. Type of plot to use on the diagonal for continuous features, see |
diag_cat |
String. Type of plot to use on the diagonal for categorical features, see |
cor_method |
String. Type of correlation measure, see |
Value
A GGally::ggpairs()
figure.
Author(s)
Lars Henry Berge Olsen
References
Examples
if (requireNamespace("xgboost", quietly = TRUE) &&
requireNamespace("ggplot2", quietly = TRUE) &&
requireNamespace("torch", quietly = TRUE) &&
torch::torch_is_installed()) {
data("airquality")
data <- data.table::as.data.table(airquality)
data <- data[complete.cases(data), ]
x_var <- c("Solar.R", "Wind", "Temp", "Month")
y_var <- "Ozone"
ind_x_explain <- 1:6
x_train <- data[-ind_x_explain, ..x_var]
y_train <- data[-ind_x_explain, get(y_var)]
x_explain <- data[ind_x_explain, ..x_var]
# Fitting a basic xgboost model to the training data
model <- xgboost::xgboost(
data = as.matrix(x_train),
label = y_train,
nround = 100,
verbose = FALSE
)
explanation <- shapr::explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "vaeac",
phi0 = mean(y_train),
n_MC_samples = 1,
vaeac.epochs = 10,
vaeac.n_vaeacs_initialize = 1
)
# Plot the results
figure <- shapr::plot_vaeac_imputed_ggpairs(
explanation = explanation,
which_vaeac_model = "best",
x_true = x_train,
add_title = TRUE
)
figure
# Note that this is an ggplot2 object which we can alter, e.g., we can change the colors.
figure +
ggplot2::scale_color_manual(values = c("#E69F00", "#999999")) +
ggplot2::scale_fill_manual(values = c("#E69F00", "#999999"))
}
Generate predictions for input data with specified model
Description
Performs prediction of response
stats::lm()
,
stats::glm()
,
ranger::ranger()
,
mgcv::gam()
,
workflows::workflow()
(i.e., tidymodels
models), and
xgboost::xgb.train()
with binary or continuous
response. See details for more information.
Usage
predict_model(x, newdata, ...)
## Default S3 method:
predict_model(x, newdata, ...)
## S3 method for class 'ar'
predict_model(x, newdata, newreg, horizon, ...)
## S3 method for class 'Arima'
predict_model(
x,
newdata,
newreg,
horizon,
explain_idx,
explain_lags,
y,
xreg,
...
)
## S3 method for class 'forecast_ARIMA'
predict_model(x, newdata, newreg, horizon, ...)
## S3 method for class 'glm'
predict_model(x, newdata, ...)
## S3 method for class 'lm'
predict_model(x, newdata, ...)
## S3 method for class 'gam'
predict_model(x, newdata, ...)
## S3 method for class 'ranger'
predict_model(x, newdata, ...)
## S3 method for class 'workflow'
predict_model(x, newdata, ...)
## S3 method for class 'xgb.Booster'
predict_model(x, newdata, ...)
Arguments
x |
Model object for the model to be explained. |
newdata |
A data.frame/data.table with the features to predict from. |
... |
|
horizon |
Numeric.
The forecast horizon to explain. Passed to the |
explain_idx |
Numeric vector. The row indices in data and reg denoting points in time to explain. |
y |
Matrix, data.frame/data.table or a numeric vector. Contains the endogenous variables used to estimate the (conditional) distributions needed to properly estimate the conditional expectations in the Shapley formula including the observations to be explained. |
xreg |
Matrix, data.frame/data.table or a numeric vector. Contains the exogenous variables used to estimate the (conditional) distributions needed to properly estimate the conditional expectations in the Shapley formula including the observations to be explained. As exogenous variables are used contemporaneously when producing a forecast, this item should contain nrow(y) + horizon rows. |
Details
The following models are currently supported:
If you have a binary classification model we'll always return the probability prediction for a single class.
If you are explaining a model not supported natively, you need to create the [predict_model()]
function yourself,
and pass it on to as an argument to [explain()]
.
For more details on how to explain such non-supported models (i.e. custom models), see the Advanced usage section
of the general usage:
From R: vignette("general_usage", package = "shapr")
Web: https://norskregnesentral.github.io/shapr/articles/general_usage.html#explain-custom-models
Value
Numeric. Vector of size equal to the number of rows in newdata
.
Author(s)
Martin Jullum
Examples
# Load example data
data("airquality")
airquality <- airquality[complete.cases(airquality), ]
# Split data into test- and training data
x_train <- head(airquality, -3)
x_explain <- tail(airquality, 3)
# Fit a linear model
model <- lm(Ozone ~ Solar.R + Wind + Temp + Month, data = x_train)
# Predicting for a model with a standardized format
predict_model(x = model, newdata = x_explain)
Generate data used for predictions and Monte Carlo integration
Description
Generate data used for predictions and Monte Carlo integration
Usage
prepare_data(internal, index_features = NULL, ...)
## S3 method for class 'categorical'
prepare_data(internal, index_features = NULL, ...)
## S3 method for class 'copula'
prepare_data(internal, index_features, ...)
## S3 method for class 'ctree'
prepare_data(internal, index_features = NULL, ...)
## S3 method for class 'empirical'
prepare_data(internal, index_features = NULL, ...)
## S3 method for class 'gaussian'
prepare_data(internal, index_features, ...)
## S3 method for class 'independence'
prepare_data(internal, index_features = NULL, ...)
## S3 method for class 'regression_separate'
prepare_data(internal, index_features = NULL, ...)
## S3 method for class 'regression_surrogate'
prepare_data(internal, index_features = NULL, ...)
## S3 method for class 'timeseries'
prepare_data(internal, index_features = NULL, ...)
## S3 method for class 'vaeac'
prepare_data(internal, index_features = NULL, ...)
Arguments
internal |
List.
Not used directly, but passed through from |
index_features |
Positive integer vector. Specifies the id_coalition to
apply to the present method. |
... |
Currently not used. |
Value
A data.table containing simulated data used to estimate the contribution function by Monte Carlo integration.
Author(s)
Martin Jullum
Annabelle Redelmeier and Lars Henry Berge Olsen
Lars Henry Berge Olsen
Martin Jullum,
Generate data used for predictions and Monte Carlo integration for causal Shapley values
Description
This function loops over the given coalitions, and for each coalition it extracts the
chain of relevant sampling steps provided in internal$object$S_causal
. This chain
can contain sampling from marginal and conditional distributions. We use the approach given by
internal$parameters$approach
to generate the samples from the conditional distributions, and
we iteratively call prepare_data()
with a modified internal_copy
list to reuse code.
However, this also means that chains with the same conditional distributions will retrain a
model of said conditional distributions several times.
For the marginal distribution, we sample from the Gaussian marginals when the approach is
gaussian
and from the marginals of the training data for all other approaches. Note that
we could extend the code to sample from the marginal (gaussian) copula, too, when approach
is
copula
.
Usage
prepare_data_causal(internal, index_features = NULL, ...)
Arguments
internal |
List.
Not used directly, but passed through from |
index_features |
Positive integer vector. Specifies the id_coalition to
apply to the present method. |
... |
Currently not used. |
Value
A data.table containing simulated data that respects the (partial) causal ordering and the the confounding assumptions. The data is used to estimate the contribution function by Monte Carlo integration.
Author(s)
Lars Henry Berge Olsen
Generate (Gaussian) Copula MC samples
Description
Generate (Gaussian) Copula MC samples
Usage
prepare_data_copula_cpp(
MC_samples_mat,
x_explain_mat,
x_explain_gaussian_mat,
x_train_mat,
S,
mu,
cov_mat
)
Arguments
MC_samples_mat |
arma::mat.
Matrix of dimension ( |
x_explain_mat |
arma::mat.
Matrix of dimension ( |
x_explain_gaussian_mat |
arma::mat.
Matrix of dimension ( |
x_train_mat |
arma::mat.
Matrix of dimension ( |
S |
arma::mat.
Matrix of dimension ( |
mu |
arma::vec.
Vector of length |
cov_mat |
arma::mat.
Matrix of dimension ( |
Value
An arma::cube/3D array of dimension (n_MC_samples
, n_explain
* n_coalitions
, n_features
), where
the columns (,j,) are matrices of dimension (n_MC_samples
, n_features
) containing the conditional Gaussian
copula MC samples for each explicand and coalition on the original scale.
Author(s)
Lars Henry Berge Olsen
Generate (Gaussian) Copula MC samples for the causal setup with a single MC sample for each explicand
Description
Generate (Gaussian) Copula MC samples for the causal setup with a single MC sample for each explicand
Usage
prepare_data_copula_cpp_caus(
MC_samples_mat,
x_explain_mat,
x_explain_gaussian_mat,
x_train_mat,
S,
mu,
cov_mat
)
Arguments
MC_samples_mat |
arma::mat.
Matrix of dimension ( |
x_explain_mat |
arma::mat.
Matrix of dimension ( |
x_explain_gaussian_mat |
arma::mat.
Matrix of dimension ( |
x_train_mat |
arma::mat.
Matrix of dimension ( |
S |
arma::mat.
Matrix of dimension ( |
mu |
arma::vec.
Vector of length |
cov_mat |
arma::mat.
Matrix of dimension ( |
Value
An arma::cube/3D array of dimension (n_MC_samples
, n_explain
* n_coalitions
, n_features
), where
the columns (,j,) are matrices of dimension (n_MC_samples
, n_features
) containing the conditional Gaussian
copula MC samples for each explicand and coalition on the original scale.
Author(s)
Lars Henry Berge Olsen
Generate Gaussian MC samples
Description
Generate Gaussian MC samples
Usage
prepare_data_gaussian_cpp(MC_samples_mat, x_explain_mat, S, mu, cov_mat)
Arguments
MC_samples_mat |
arma::mat.
Matrix of dimension ( |
x_explain_mat |
arma::mat.
Matrix of dimension ( |
S |
arma::mat.
Matrix of dimension ( |
mu |
arma::vec.
Vector of length |
cov_mat |
arma::mat.
Matrix of dimension ( |
Value
An arma::cube/3D array of dimension (n_MC_samples
, n_explain
* n_coalitions
, n_features
), where
the columns (,j,) are matrices of dimension (n_MC_samples
, n_features
) containing the conditional Gaussian
MC samples for each explicand and coalition.
Author(s)
Lars Henry Berge Olsen
Generate Gaussian MC samples for the causal setup with a single MC sample for each explicand
Description
Generate Gaussian MC samples for the causal setup with a single MC sample for each explicand
Usage
prepare_data_gaussian_cpp_caus(MC_samples_mat, x_explain_mat, S, mu, cov_mat)
Arguments
MC_samples_mat |
arma::mat.
Matrix of dimension ( |
x_explain_mat |
arma::mat.
Matrix of dimension ( |
S |
arma::mat.
Matrix of dimension ( |
mu |
arma::vec.
Vector of length |
cov_mat |
arma::mat.
Matrix of dimension ( |
Value
An arma::cube/3D array of dimension (n_MC_samples
, n_explain
* n_coalitions
, n_features
), where
the columns (,j,) are matrices of dimension (n_MC_samples
, n_features
) containing the conditional Gaussian
MC samples for each explicand and coalition.
Author(s)
Lars Henry Berge Olsen
Compute the conditional probabilities for a single coalition for the categorical approach
Description
The prepare_data.categorical()
function is slow when evaluated for a single coalition.
This is a bottleneck for Causal Shapley values which call said function a lot with single coalitions.
Usage
prepare_data_single_coalition(internal, index_features)
Arguments
internal |
List.
Holds all parameters, data, functions and computed objects used within |
Author(s)
Lars Henry Berge Olsen
Prepares the next iteration of the iterative sampling algorithm
Description
Prepares the next iteration of the iterative sampling algorithm
Usage
prepare_next_iteration(internal)
Arguments
internal |
List.
Not used directly, but passed through from |
Value
The (updated) internal list
Print method for shapr objects
Description
Print method for shapr objects
Usage
## S3 method for class 'shapr'
print(x, digits = 4, ...)
Arguments
x |
A shapr object |
digits |
Scalar Integer. Number of digits to display to the console |
... |
Unused |
Value
No return value (but prints the shapley values to the console)
Prints iterative information
Description
Prints iterative information
Usage
print_iter(internal)
Arguments
internal |
List.
Not used directly, but passed through from |
Value
No return value (but prints iterative information)
Treat factors as numeric values
Description
Factors are given a numeric value above the highest numeric value in the data. The value of the different levels are sorted by factor and then level.
Usage
process_factor_data(dt, factor_cols)
Arguments
dt |
data.table to plot |
factor_cols |
Columns that are factors or character |
Value
A list of a lookup table with each factor and level and its numeric value, a data.table very similar to the input data, but now with numeric values for factors, and the maximum feature value.
Compute the quantiles using quantile type seven
Description
Compute the quantiles using quantile type seven
Usage
quantile_type7_cpp(x, probs)
Arguments
x |
arma::vec. Numeric vector whose sample quantiles are wanted. |
probs |
arma::vec. Numeric vector of probabilities with values between zero and one. |
Details
Using quantile type number seven from stats::quantile in R.
Value
A vector of length length(probs)
with the quantiles is returned.
Author(s)
Lars Henry Berge Olsen
Set up exogenous regressors for explanation in a forecast model.
Description
Set up exogenous regressors for explanation in a forecast model.
Usage
reg_forecast_setup(x, horizon, group)
Arguments
x |
A matrix with the exogenous variables. |
horizon |
Numeric.
The forecast horizon to explain. Passed to the |
group |
The list of endogenous groups, to append exogenous groups to. |
Value
A list containing
fcast A matrix containing the exogenous observations needed for each observation.
group The list group with the exogenous groups appended.
Check that needed libraries are installed
Description
This function checks that the parsnip
, recipes
, workflows
, tune
, dials
,
yardstick
, hardhat
and rsample
, packages are available.
Usage
regression.check_namespaces()
Author(s)
Lars Henry Berge Olsen
Check regression parameters
Description
Check regression parameters
Usage
regression.check_parameters(internal)
Arguments
internal |
List.
Holds all parameters, data, functions and computed objects used within |
Value
The same internal
list, but added logical indicator internal$parameters$regression.tune
if we are to tune the regression model/models.
Author(s)
Lars Henry Berge Olsen
Check regression.recipe_func
Description
Check that regression.recipe_func is a function that returns the RHS of the formula for arbitrary feature name inputs.
Usage
regression.check_recipe_func(regression.recipe_func, x_explain)
Arguments
regression.recipe_func |
Either |
x_explain |
Data.table with the features of the observation whose predictions ought to be explained (test data). |
Author(s)
Lars Henry Berge Olsen
Check the regression.surrogate_n_comb
parameter
Description
Check that regression.surrogate_n_comb
is either NULL or a valid integer.
Usage
regression.check_sur_n_comb(regression.surrogate_n_comb, n_coalitions)
Arguments
regression.surrogate_n_comb |
Positive integer. Specifies the number of unique coalitions to apply to each training observation. The default is the number of sampled coalitions in the present iteration. Any integer between 1 and the default is allowed. Larger values requires more memory, but may improve the surrogate model. If the user sets a value lower than the maximum, we sample this amount of unique coalitions separately for each training observations. That is, on average, all coalitions should be equally trained. |
n_coalitions |
Integer. The number of used coalitions (including the empty and grand coalition). |
Author(s)
Lars Henry Berge Olsen
Check the parameters that are sent to rsample::vfold_cv()
Description
Check that regression.vfold_cv_para
is either NULL or a named list that only contains recognized parameters.
Usage
regression.check_vfold_cv_para(regression.vfold_cv_para)
Arguments
regression.vfold_cv_para |
Either |
Author(s)
Lars Henry Berge Olsen
Produce message about which batch prepare_data is working on
Description
Produce message about which batch prepare_data is working on
Usage
regression.cv_message(
regression.results,
regression.grid,
n_cv = 10,
current_comb
)
Arguments
regression.results |
The results of the CV procedures. |
regression.grid |
Object containing the hyperparameter values. |
n_cv |
Integer (default is 10) specifying the number of CV hyperparameter configurations to print. |
current_comb |
Integer vector. The current combination of features, passed to verbosity printing function. |
Author(s)
Lars Henry Berge Olsen
Convert the string into an R object
Description
Convert the string into an R object
Usage
regression.get_string_to_R(string)
Arguments
string |
A character vector/string containing the text to convert into R code. |
Author(s)
Lars Henry Berge Olsen
Get if model is to be tuned
Description
That is, if the regression model contains hyperparameters we are to tune using cross validation. See tidymodels for default model hyperparameters.
Usage
regression.get_tune(regression.model, regression.tune_values, x_train)
Arguments
regression.model |
A |
regression.tune_values |
Either |
x_train |
Data.table with training data. |
Value
A boolean variable indicating if the regression model is to be tuned.
Author(s)
Lars Henry Berge Olsen
Get the predicted responses
Description
Get the predicted responses
Usage
regression.get_y_hat(internal, model, predict_model)
Arguments
internal |
List.
Holds all parameters, data, functions and computed objects used within |
model |
Objects.
The model object that ought to be explained.
See the documentation of |
predict_model |
Function.
The prediction function used when |
Value
The same internal
list, but added vectors internal$data$x_train_y_hat
and
internal$data$x_explain_y_hat
containing the predicted response of the training and explain data.
Author(s)
Lars Henry Berge Olsen
Augment the training data and the explicands
Description
Augment the training data and the explicands
Usage
regression.surrogate_aug_data(
internal,
x,
y_hat = NULL,
index_features = NULL,
augment_masks_as_factor = FALSE,
augment_include_grand = FALSE,
augment_add_id_coal = FALSE,
augment_comb_prob = NULL,
augment_weights = NULL
)
Arguments
internal |
List.
Holds all parameters, data, functions and computed objects used within |
x |
Data.table containing the training data. |
y_hat |
Vector of numerics (optional) containing the predicted responses for the observations in |
index_features |
Array of integers (optional) containing which coalitions to consider. Must be provided if
|
augment_masks_as_factor |
Logical (default is |
augment_include_grand |
Logical (default is |
augment_add_id_coal |
Logical (default is |
augment_comb_prob |
Array of numerics (default is |
augment_weights |
String (optional). Specifying which type of weights to add to the observations.
If |
Value
A data.table containing the augmented data.
Author(s)
Lars Henry Berge Olsen
Train a tidymodels model via workflows
Description
Function that trains a tidymodels
model via workflows
based on the provided input parameters.
This function allows for cross validating the hyperparameters of the model.
Usage
regression.train_model(
x,
seed = 1,
verbose = NULL,
regression.model = parsnip::linear_reg(),
regression.tune = FALSE,
regression.tune_values = NULL,
regression.vfold_cv_para = NULL,
regression.recipe_func = NULL,
regression.response_var = "y_hat",
regression.surrogate_n_comb = NULL,
current_comb = NULL
)
Arguments
x |
Data.table containing the training data. |
seed |
Positive integer.
Specifies the seed before any randomness based code is being run.
If |
verbose |
String vector or NULL.
Specifies the verbosity (printout detail level) through one or more of strings |
regression.model |
A |
regression.tune |
Logical (default is |
regression.tune_values |
Either |
regression.vfold_cv_para |
Either |
regression.recipe_func |
Either |
regression.response_var |
String (default is |
regression.surrogate_n_comb |
Integer (default is |
current_comb |
Integer vector. The current combination of features, passed to verbosity printing function. |
Value
A trained tidymodels
model based on the provided input parameters.
Author(s)
Lars Henry Berge Olsen
Auxiliary function for the vignettes
Description
Function that question if the main and vaeac vignette has been built using the
rebuild-long-running-vignette.R
function.
This is only useful when using devtools to release shapr
to cran.
See devtools::release()
for more information.
Usage
release_questions()
Function for computing sigma_hat_sq
Description
Function for computing sigma_hat_sq
Usage
rss_cpp(H, y)
Arguments
H |
Matrix.
Output from |
y |
Vector Representing the (temporary) response variable |
Value
Scalar
Author(s)
Martin Jullum
Get table with sampled coalitions using the semi-deterministic sampling approach
Description
Get table with sampled coalitions using the semi-deterministic sampling approach
Usage
sample_coalition_table(
m,
n_coalitions = 200,
n_coal_each_size = choose(m, seq(m - 1)),
weight_zero_m = 10^6,
paired_shap_sampling = TRUE,
prev_X = NULL,
kernelSHAP_reweighting = "on_all_cond",
semi_deterministic_sampling = FALSE,
dt_coal_samp_info = NULL,
dt_valid_causal_coalitions = NULL,
n_samps_scale = 10
)
Arguments
m |
Positive integer. Total number of features/groups. |
n_coalitions |
Positive integer.
Note that if |
n_coal_each_size |
Vector of integers of length |
weight_zero_m |
Numeric. The value to use as a replacement for infinite coalition weights when doing numerical operations. |
paired_shap_sampling |
Logical. Whether to do paired sampling of coalitions. |
prev_X |
data.table. The X data.table from the previous iteration. |
kernelSHAP_reweighting |
String.
How to reweight the sampling frequency weights in the kernelSHAP solution after sampling.
The aim of this is to reduce the randomness and thereby the variance of the Shapley value estimates.
The options are one of |
semi_deterministic_sampling |
Logical.
If |
dt_coal_samp_info |
data.table. The data.table contains information about the which coalitions should be
deterministically included and which can be sampled, in addition to the sampling probabilities of each available
coalition size, and the weight given to the sampled and deterministically included coalitions (excluding empty and
grand coalitions which are given the |
dt_valid_causal_coalitions |
data.table. Only applicable for asymmetric Shapley
values explanations, and is |
n_samps_scale |
Positive integer.
Integer that scales the number of coalitions |
We here return a vector of strings/characters, i.e., a CharacterVector, where each string is a space-separated list of integers.
Description
We here return a vector of strings/characters, i.e., a CharacterVector, where each string is a space-separated list of integers.
Usage
sample_coalitions_cpp_str_paired(m, n_coalitions, paired_shap_sampling = TRUE)
Arguments
m |
Positive integer. Total number of features/groups. |
n_coalitions |
IntegerVector. The number of features to sample for each feature combination. |
paired_shap_sampling |
Logical. Whether to do paired sampling of coalitions. |
Helper function to sample a combination of training and testing rows, which does not risk getting the same observation twice. Need to improve this help file.
Description
Helper function to sample a combination of training and testing rows, which does not risk getting the same observation twice. Need to improve this help file.
Usage
sample_combinations(ntrain, ntest, nsamples, joint_sampling = TRUE)
Arguments
ntrain |
Positive integer. Number of training observations to sample from. |
ntest |
Positive integer. Number of test observations to sample from. |
nsamples |
Positive integer. Number of samples. |
joint_sampling |
Logical. Indicates whether train- and test data should be sampled
separately or in a joint sampling space. If they are sampled separately (which typically
would be used when optimizing more than one distribution at once) we sample with replacement
if |
Value
data.frame
Author(s)
Martin Jullum
Sample ctree variables from a given conditional inference tree
Description
Sample ctree variables from a given conditional inference tree
Usage
sample_ctree(tree, n_MC_samples, x_explain, x_train, n_features, sample)
Arguments
tree |
List. Contains tree which is an object of type ctree built from the party package. Also contains given_ind, the features to condition upon. |
n_MC_samples |
Scalar integer.
Corresponds to the number of samples from the leaf node.
See an exception when sample = FALSE in |
x_explain |
Data.table with the features of the observation whose predictions ought to be explained (test data). |
x_train |
Data.table with training data. |
n_features |
Positive integer. The number of features. |
Details
See the documentation of the setup_approach.ctree()
function for undocumented parameters.
Value
data.table with n_MC_samples
(conditional) Gaussian samples
Author(s)
Annabelle Redelmeier
Saves the intermediate results to disk
Description
Saves the intermediate results to disk
Usage
save_results(internal)
Arguments
internal |
List.
Not used directly, but passed through from |
Value
No return value (but saves the intermediate results to disk)
check_setup
Description
check_setup
Usage
setup(
x_train,
x_explain,
approach,
phi0,
output_size = 1,
max_n_coalitions,
group,
n_MC_samples,
seed,
feature_specs,
type = "regular",
horizon = NULL,
y = NULL,
xreg = NULL,
train_idx = NULL,
explain_idx = NULL,
explain_y_lags = NULL,
explain_xreg_lags = NULL,
group_lags = NULL,
verbose,
iterative = NULL,
iterative_args = list(),
is_python = FALSE,
testing = FALSE,
init_time = NULL,
prev_shapr_object = NULL,
asymmetric = FALSE,
causal_ordering = NULL,
confounding = NULL,
output_args = list(),
extra_computation_args = list(),
...
)
Arguments
x_train |
Matrix or data.frame/data.table. Contains the data used to estimate the (conditional) distributions for the features needed to properly estimate the conditional expectations in the Shapley formula. |
x_explain |
Matrix or data.frame/data.table. Contains the the features, whose predictions ought to be explained. |
approach |
Character vector of length |
phi0 |
Numeric. The prediction value for unseen data, i.e. an estimate of the expected prediction without conditioning on any features. Typically we set this value equal to the mean of the response variable in our training data, but other choices such as the mean of the predictions in the training data are also reasonable. |
output_size |
Scalar integer. Specifies the dimension of the output from the prediction model for every observation. |
max_n_coalitions |
Integer.
The upper limit on the number of unique feature/group coalitions to use in the iterative procedure
(if |
group |
List.
If |
n_MC_samples |
Positive integer.
For most approaches, it indicates the maximum number of samples to use in the Monte Carlo integration
of every conditional expectation.
For |
seed |
Positive integer.
Specifies the seed before any randomness based code is being run.
If |
feature_specs |
List. The output from
|
type |
Character.
Either "regular" or "forecast" corresponding to function |
horizon |
Numeric.
The forecast horizon to explain. Passed to the |
y |
Matrix, data.frame/data.table or a numeric vector. Contains the endogenous variables used to estimate the (conditional) distributions needed to properly estimate the conditional expectations in the Shapley formula including the observations to be explained. |
xreg |
Matrix, data.frame/data.table or a numeric vector. Contains the exogenous variables used to estimate the (conditional) distributions needed to properly estimate the conditional expectations in the Shapley formula including the observations to be explained. As exogenous variables are used contemporaneously when producing a forecast, this item should contain nrow(y) + horizon rows. |
train_idx |
Numeric vector.
The row indices in data and reg denoting points in time to use when estimating the conditional expectations in
the Shapley value formula.
If |
explain_idx |
Numeric vector. The row indices in data and reg denoting points in time to explain. |
explain_y_lags |
Numeric vector.
Denotes the number of lags that should be used for each variable in |
explain_xreg_lags |
Numeric vector.
If |
group_lags |
Logical.
If |
verbose |
String vector or NULL.
Specifies the verbosity (printout detail level) through one or more of strings |
iterative |
Logical or NULL
If |
iterative_args |
Named list.
Specifies the arguments for the iterative procedure.
See |
is_python |
Logical.
Indicates whether the function is called from the Python wrapper.
Default is FALSE which is never changed when calling the function via |
testing |
Logical.
Only use to remove random components like timing from the object output when comparing output with testthat.
Defaults to |
init_time |
POSIXct object.
The time when the |
prev_shapr_object |
|
asymmetric |
Logical.
Not applicable for (regular) non-causal or asymmetric explanations.
If |
causal_ordering |
List.
Not applicable for (regular) non-causal or asymmetric explanations.
|
confounding |
Logical vector.
Not applicable for (regular) non-causal or asymmetric explanations.
|
output_args |
Named list.
Specifies certain arguments related to the output of the function.
See |
extra_computation_args |
Named list.
Specifies extra arguments related to the computation of the Shapley values.
See |
... |
Further arguments passed to specific approaches, see below. |
Value
A internal list, containing parameters, info, data and computations needed for the later computations. The list is expanded and modified in other functions.
Set up the framework for the chosen approach
Description
The different choices of approach
take different (optional) parameters,
which are forwarded from explain()
.
See the general usage vignette
for more information about the different approaches.
Usage
setup_approach(internal, ...)
## S3 method for class 'combined'
setup_approach(internal, ...)
## S3 method for class 'categorical'
setup_approach(
internal,
categorical.joint_prob_dt = NULL,
categorical.epsilon = 0.001,
...
)
## S3 method for class 'copula'
setup_approach(internal, ...)
## S3 method for class 'ctree'
setup_approach(
internal,
ctree.mincriterion = 0.95,
ctree.minsplit = 20,
ctree.minbucket = 7,
ctree.sample = TRUE,
...
)
## S3 method for class 'empirical'
setup_approach(
internal,
empirical.type = "fixed_sigma",
empirical.eta = 0.95,
empirical.fixed_sigma = 0.1,
empirical.n_samples_aicc = 1000,
empirical.eval_max_aicc = 20,
empirical.start_aicc = 0.1,
empirical.cov_mat = NULL,
model = NULL,
predict_model = NULL,
...
)
## S3 method for class 'gaussian'
setup_approach(internal, gaussian.mu = NULL, gaussian.cov_mat = NULL, ...)
## S3 method for class 'independence'
setup_approach(internal, ...)
## S3 method for class 'regression_separate'
setup_approach(
internal,
regression.model = parsnip::linear_reg(),
regression.tune_values = NULL,
regression.vfold_cv_para = NULL,
regression.recipe_func = NULL,
...
)
## S3 method for class 'regression_surrogate'
setup_approach(
internal,
regression.model = parsnip::linear_reg(),
regression.tune_values = NULL,
regression.vfold_cv_para = NULL,
regression.recipe_func = NULL,
regression.surrogate_n_comb =
internal$iter_list[[length(internal$iter_list)]]$n_coalitions - 2,
...
)
## S3 method for class 'timeseries'
setup_approach(
internal,
timeseries.fixed_sigma = 2,
timeseries.bounds = c(NULL, NULL),
...
)
## S3 method for class 'vaeac'
setup_approach(
internal,
vaeac.depth = 3,
vaeac.width = 32,
vaeac.latent_dim = 8,
vaeac.activation_function = torch::nn_relu,
vaeac.lr = 0.001,
vaeac.n_vaeacs_initialize = 4,
vaeac.epochs = 100,
vaeac.extra_parameters = list(),
...
)
Arguments
internal |
List.
Not used directly, but passed through from |
... |
Arguments passed to specific classes. See below |
categorical.joint_prob_dt |
Data.table. (Optional)
Containing the joint probability distribution for each combination of feature
values.
|
categorical.epsilon |
Numeric value. (Optional)
If |
ctree.mincriterion |
Numeric scalar or vector.
Either a scalar or vector of length equal to the number of features in the model.
The value is equal to 1 - |
ctree.minsplit |
Numeric scalar. Determines minimum value that the sum of the left and right daughter nodes required for a split. The default value is 20. |
ctree.minbucket |
Numeric scalar. Determines the minimum sum of weights in a terminal node required for a split The default value is 7. |
ctree.sample |
Boolean.
If |
empirical.type |
Character. (default = |
empirical.eta |
Numeric scalar.
Needs to be |
empirical.fixed_sigma |
Positive numeric scalar.
The default value is 0.1.
Represents the kernel bandwidth in the distance computation used when conditioning on all different coalitions.
Only used when |
empirical.n_samples_aicc |
Positive integer.
Number of samples to consider in AICc optimization.
The default value is 1000.
Only used for |
empirical.eval_max_aicc |
Positive integer.
Maximum number of iterations when optimizing the AICc.
The default value is 20.
Only used for |
empirical.start_aicc |
Numeric.
Start value of the |
empirical.cov_mat |
Numeric matrix. (Optional)
The covariance matrix of the data generating distribution used to define the Mahalanobis distance.
|
model |
Objects.
The model object that ought to be explained.
See the documentation of |
predict_model |
Function.
The prediction function used when |
gaussian.mu |
Numeric vector. (Optional)
Containing the mean of the data generating distribution.
|
gaussian.cov_mat |
Numeric matrix. (Optional)
Containing the covariance matrix of the data generating distribution.
|
regression.model |
A |
regression.tune_values |
Either |
regression.vfold_cv_para |
Either |
regression.recipe_func |
Either |
regression.surrogate_n_comb |
Positive integer. Specifies the number of unique coalitions to apply to each training observation. The default is the number of sampled coalitions in the present iteration. Any integer between 1 and the default is allowed. Larger values requires more memory, but may improve the surrogate model. If the user sets a value lower than the maximum, we sample this amount of unique coalitions separately for each training observations. That is, on average, all coalitions should be equally trained. |
timeseries.fixed_sigma |
Positive numeric scalar. Represents the kernel bandwidth in the distance computation. The default value is 2. |
timeseries.bounds |
Numeric vector of length two.
Specifies the lower and upper bounds of the timeseries.
The default is |
vaeac.depth |
Positive integer (default is |
vaeac.width |
Positive integer (default is |
vaeac.latent_dim |
Positive integer (default is |
vaeac.activation_function |
An |
vaeac.lr |
Positive numeric (default is |
vaeac.n_vaeacs_initialize |
Positive integer (default is |
vaeac.epochs |
Positive integer (default is |
vaeac.extra_parameters |
Named list with extra parameters to the |
Value
Updated internal object with the approach set up
Author(s)
Martin Jullum
Lars Henry Berge Olsen
References
Set up the kernelSHAP framework
Description
Set up the kernelSHAP framework
Usage
shapley_setup(internal)
Arguments
internal |
List.
Not used directly, but passed through from |
Value
The internal list updated with the coalitions to be estimated
Calculate Shapley weight
Description
Calculate Shapley weight
Usage
shapley_weights(m, N, n_components, weight_zero_m = 10^6)
Arguments
m |
Positive integer. Total number of features/groups. |
N |
Positive integer. The number of unique coalitions when sampling |
n_components |
Positive integer. Represents the number of features/feature groups you want to sample from
a feature space consisting of |
weight_zero_m |
Numeric. The value to use as a replacement for infinite coalition weights when doing numerical operations. |
Value
Numeric
Author(s)
Nikolai Sellereite
A torch::nn_module()
Representing a skip connection
Description
Skip connection over the sequence of layers in the constructor. The module passes input data sequentially through these layers and then adds original data to the result.
Usage
skip_connection(...)
Arguments
... |
network modules such as, e.g., |
Author(s)
Lars Henry Berge Olsen
A torch::nn_module()
Representing a specified_masks_mask_generator
Description
A mask generator which masks the entries based on sampling provided 1D masks with corresponding probabilities. Used for Shapley value estimation when only a subset of coalitions are used to compute the Shapley values.
Usage
specified_masks_mask_generator(masks, masks_probs, paired_sampling = FALSE)
Arguments
masks |
Matrix/Tensor of possible/allowed 'masks' which we sample from. |
masks_probs |
Array of 'probabilities' for each of the masks specified in 'masks'. Note that they do not need to be between 0 and 1 (e.g. sampling frequency). They are scaled, hence, they only need to be positive. |
paired_sampling |
Boolean. If we are doing paired sampling. So include both S and |
Author(s)
Lars Henry Berge Olsen
A torch::nn_module()
Representing a specified_prob_mask_generator
Description
A mask generator which masks the entries based on specified probabilities.
Usage
specified_prob_mask_generator(masking_probs, paired_sampling = FALSE)
Arguments
masking_probs |
An M+1 numerics containing the probabilities masking 'd' of the (0,...M) entries for each observation. |
paired_sampling |
Boolean. If we are doing paired sampling. So include both S and |
Details
A class that takes in the probabilities of having d masked observations. I.e., for M dimensional data, masking_probs is of length M+1, where the d'th entry is the probability of having d-1 masked values.
A mask generator that first samples the number of entries 'd' to be masked in the 'M'-dimensional observation 'x' in the batch based on the given M+1 probabilities. The 'd' masked are uniformly sampled from the 'M' possible feature indices. The d'th entry of the probability of having d-1 masked values.
Note that mcar_mask_generator with p = 0.5 is the same as using specified_prob_mask_generator()
with
masking_ratio
= choose(M, 0:M), where M is the number of features. This function was initially created to check if
increasing the probability of having a masks with many masked features improved vaeac's performance by focusing more
on these situations during training.
Model testing function
Description
Model testing function
Usage
test_predict_model(x_test, predict_model, model, internal)
Arguments
predict_model |
Function.
The prediction function used when |
model |
Objects.
The model object that ought to be explained.
See the documentation of |
internal |
List.
Holds all parameters, data, functions and computed objects used within |
Cleans out certain output arguments to allow perfect reproducibility of the output
Description
Cleans out certain output arguments to allow perfect reproducibility of the output
Usage
testing_cleanup(output)
Value
Cleaned up version of the output list used for testthat testing
Author(s)
Lars Henry Berge Olsen, Martin Jullum
Initializing a vaeac model
Description
Class that represents a vaeac model, i.e., the class creates the neural networks in the vaeac model and necessary training utilities. For more details, see Olsen et al. (2022).
Usage
vaeac(
one_hot_max_sizes,
width = 32,
depth = 3,
latent_dim = 8,
activation_function = torch::nn_relu,
skip_conn_layer = FALSE,
skip_conn_masked_enc_dec = FALSE,
batch_normalization = FALSE,
paired_sampling = FALSE,
mask_generator_name = c("mcar_mask_generator", "specified_prob_mask_generator",
"specified_masks_mask_generator"),
masking_ratio = 0.5,
mask_gen_coalitions = NULL,
mask_gen_coalitions_prob = NULL,
sigma_mu = 10000,
sigma_sigma = 1e-04
)
Arguments
one_hot_max_sizes |
A torch tensor of dimension |
width |
Integer. The number of neurons in each hidden layer in the neural networks of the masked encoder, full encoder, and decoder. |
depth |
Integer. The number of hidden layers in the neural networks of the masked encoder, full encoder, and decoder. |
latent_dim |
Integer. The number of dimensions in the latent space. |
activation_function |
A |
skip_conn_layer |
Boolean. If we are to use skip connections in each layer, see |
skip_conn_masked_enc_dec |
Boolean. If we are to apply concatenating skip connections between the layers in the masked encoder and decoder. The first layer of the masked encoder will be linked to the last layer of the decoder. The second layer of the masked encoder will be linked to the second to last layer of the decoder, and so on. |
batch_normalization |
Boolean. If we are to use batch normalization after the activation function.
Note that if |
paired_sampling |
Boolean. If we are doing paired sampling. I.e., if we are to include both coalition S
and |
mask_generator_name |
String specifying the type of mask generator to use. Need to be one of 'mcar_mask_generator', 'specified_prob_mask_generator', and 'specified_masks_mask_generator'. |
masking_ratio |
Scalar. The probability for an entry in the generated mask to be 1 (masked).
Not used if |
mask_gen_coalitions |
Matrix containing the different coalitions to learn.
Must be given if |
mask_gen_coalitions_prob |
Numerics containing the probabilities
for sampling each mask in |
sigma_mu |
Numeric representing a hyperparameter in the normal-gamma prior used on the masked encoder, see Section 3.3.1 in Olsen et al. (2022). |
sigma_sigma |
Numeric representing a hyperparameter in the normal-gamma prior used on the masked encoder, see Section 3.3.1 in Olsen et al. (2022). |
Details
This function builds neural networks (masked encoder, full encoder, decoder) given the list of one-hot max sizes of the features in the dataset we use to train the vaeac model, and the provided parameters for the networks. It also creates, e.g., reconstruction log probability function, methods for sampling from the decoder output, and then use these to create the vaeac model.
Value
Returns a list with the neural networks of the masked encoder, full encoder, and decoder together with reconstruction log probability function, optimizer constructor, sampler from the decoder output, mask generator, batch size, and scale factor for the stability of the variational lower bound optimization.
make_observed
Apply Mask to Batch to Create Observed Batch
Compute the parameters for the latent normal distributions inferred by the encoders.
If only_masked_encoder = TRUE
, then we only compute the latent normal distributions inferred by the
masked encoder. This is used in the deployment phase when we do not have access to the full observation.
make_latent_distributions
Compute the Latent Distributions Inferred by the Encoders
Compute the parameters for the latent normal distributions inferred by the encoders.
If only_masked_encoder = TRUE
, then we only compute the latent normal distributions inferred by the
masked encoder. This is used in the deployment phase when we do not have access to the full observation.
masked_encoder_regularization
Compute the Regularizes for the Latent Distribution Inferred by the Masked Encoder.
The masked encoder (prior) distribution regularization in the latent space. This is used to compute the extended variational lower bound used to train vaeac, see Section 3.3.1 in Olsen et al. (2022). Though regularizing prevents the masked encoder distribution parameters from going to infinity, the model usually doesn't diverge even without this regularization. It almost doesn't affect learning process near zero with default regularization parameters which are recommended to be used.
batch_vlb
Compute the Variational Lower Bound for the Observations in the Batch
Compute differentiable lower bound for the given batch of objects and mask. Used as the (negative) loss function for training the vaeac model.
batch_iwae
Compute IWAE log likelihood estimate with K samples per object.
Technically, it is differentiable, but it is recommended to use it for
evaluation purposes inside torch.no_grad in order to save memory. With torch::with_no_grad()
the method almost doesn't require extra memory for very large K. The method makes K independent
passes through decoder network, so the batch size is the same as for training with batch_vlb.
IWAE is an abbreviation for Importance Sampling Estimator:
\log p_{\theta, \psi}(x|y) \approx
\log {\frac{1}{K} \sum_{i=1}^K [p_\theta(x|z_i, y) * p_\psi(z_i|y) / q_\phi(z_i|x,y)]} \newline
=
\log {\sum_{i=1}^K \exp(\log[p_\theta(x|z_i, y) * p_\psi(z_i|y) / q_\phi(z_i|x,y)])} - \log(K) \newline
=
\log {\sum_{i=1}^K \exp(\log[p_\theta(x|z_i, y)] + \log[p_\psi(z_i|y)] - \log[q_\phi(z_i|x,y)])} - \log(K) \newline
=
\operatorname{logsumexp}(\log[p_\theta(x|z_i, y)] + \log[p_\psi(z_i|y)] - \log[q_\phi(z_i|x,y)]) - \log(K) \newline
=
\operatorname{logsumexp}(\text{rec}\_\text{loss} + \text{prior}\_\text{log}\_\text{prob} -
\text{proposal}\_\text{log}\_\text{prob}) - \log(K),
where z_i \sim q_\phi(z|x,y)
.
generate_samples_params
Generate the parameters of the generative distributions for samples from the batch.
The function makes K latent representation for each object from the batch, send these
latent representations through the decoder to obtain the parameters for the generative distributions.
I.e., means and variances for the normal distributions (continuous features) and probabilities
for the categorical distribution (categorical features).
The second axis is used to index samples for an object, i.e. if the batch shape is [n x D1 x D2], then
the result shape is [n x K x D1 x D2]. It is better to use it inside torch::with_no_grad()
in order to save
memory. With torch::with_no_grad()
the method doesn't require extra memory except the memory for the result.
Author(s)
Lars Henry Berge Olsen
Creates Categorical Distributions
Description
Function that takes in a tensor containing the logits for each of the K classes. Each row corresponds to an observations. Send each row through the softmax function to convert from logits to probabilities that sum 1 one. The function also clamps the probabilities between a minimum and maximum probability. Note that we still normalize them afterward, so the final probabilities can be marginally below or above the thresholds.
Usage
vaeac_categorical_parse_params(params, min_prob = 0, max_prob = 1)
Arguments
params |
Tensor of dimension |
min_prob |
For stability it might be desirable that the minimal probability is not too close to zero. |
max_prob |
For stability it might be desirable that the maximal probability is not too close to one. |
Details
Take a Tensor (e. g. a part of neural network output) and return torch::distr_categorical()
distribution. The input tensor after applying softmax over the last axis contains a batch of the categorical
probabilities. So there are no restrictions on the input tensor. Technically, this function treats the last axis as
the categorical probabilities, but Categorical takes only 2D input where the first axis is the batch axis and the
second one corresponds to the probabilities, so practically the function requires 2D input with the batch of
probabilities for one categorical feature. min_prob
is the minimal probability for each class.
After clipping the probabilities from below and above they are renormalized in order to be a valid distribution.
This regularization is required for the numerical stability and may be considered as a neural network architecture
choice without any change to the probabilistic model.Note that the softmax function is given by
\operatorname{Softmax}(x_i) = (\exp(x_i))/(\sum_{j} \exp(x_j))
, where x_i
are the logits and can
take on any value, negative and positive. The output \operatorname{Softmax}(x_i) \in [0,1]
and \sum_{j} Softmax(x_i) = 1
.
Value
A torch::distr_categorical distributions with the provided probabilities for each class.
Author(s)
Lars Henry Berge Olsen
Function that checks the provided activation function
Description
Function that checks the provided activation function
Usage
vaeac_check_activation_func(activation_function)
Arguments
activation_function |
An |
Value
The function does not return anything.
Author(s)
Lars Henry Berge Olsen
Function that checks for access to CUDA
Description
Function that checks for access to CUDA
Usage
vaeac_check_cuda(cuda, verbose)
Arguments
cuda |
Logical (default is |
verbose |
String vector or NULL.
Specifies the verbosity (printout detail level) through one or more of strings |
Value
The function does not return anything.
Author(s)
Lars Henry Berge Olsen
Function that checks provided epoch arguments
Description
Function that checks provided epoch arguments
Usage
vaeac_check_epoch_values(
epochs,
epochs_initiation_phase,
epochs_early_stopping,
save_every_nth_epoch,
verbose
)
Arguments
epochs |
Positive integer (default is |
epochs_initiation_phase |
Positive integer (default is |
epochs_early_stopping |
Positive integer (default is |
save_every_nth_epoch |
Positive integer (default is |
verbose |
String vector or NULL.
Specifies the verbosity (printout detail level) through one or more of strings |
Value
The function does not return anything.
Author(s)
Lars Henry Berge Olsen
Check vaeac.extra_parameters list
Description
Check vaeac.extra_parameters list
Usage
vaeac_check_extra_named_list(vaeac.extra_parameters)
Arguments
vaeac.extra_parameters |
List containing the extra parameters to the |
Author(s)
Lars Henry Berge Olsen
Function that checks logicals
Description
Function that checks logicals
Usage
vaeac_check_logicals(named_list_logicals)
Arguments
named_list_logicals |
List containing named entries. I.e., |
Value
The function does not return anything.
Author(s)
Lars Henry Berge Olsen
Function that checks the specified masking scheme
Description
Function that checks the specified masking scheme
Usage
vaeac_check_mask_gen(mask_gen_coalitions, mask_gen_coalitions_prob, x_train)
Arguments
mask_gen_coalitions |
Matrix (default is |
mask_gen_coalitions_prob |
Numeric array (default is |
x_train |
A data.table containing the training data. Categorical data must have class names |
Value
The function does not return anything.
Author(s)
Lars Henry Berge Olsen
Function that checks that the masking ratio argument is valid
Description
Function that checks that the masking ratio argument is valid
Usage
vaeac_check_masking_ratio(masking_ratio, n_features)
Arguments
masking_ratio |
Numeric (default is |
n_features |
The number of features, i.e., the number of columns in the training data. |
Value
The function does not return anything.
Author(s)
Lars Henry Berge Olsen
Function that calls all vaeac parameters check functions
Description
Function that calls all vaeac parameters check functions
Usage
vaeac_check_parameters(
x_train,
model_description,
folder_to_save_model,
cuda,
n_vaeacs_initialize,
epochs_initiation_phase,
epochs,
epochs_early_stopping,
save_every_nth_epoch,
val_ratio,
val_iwae_n_samples,
depth,
width,
latent_dim,
lr,
batch_size,
running_avg_n_values,
activation_function,
skip_conn_layer,
skip_conn_masked_enc_dec,
batch_normalization,
paired_sampling,
masking_ratio,
mask_gen_coalitions,
mask_gen_coalitions_prob,
sigma_mu,
sigma_sigma,
save_data,
log_exp_cont_feat,
which_vaeac_model,
verbose,
seed,
...
)
Arguments
x_train |
A data.table containing the training data. Categorical data must have class names |
model_description |
String (default is |
folder_to_save_model |
String (default is |
cuda |
Logical (default is |
n_vaeacs_initialize |
Positive integer (default is |
epochs_initiation_phase |
Positive integer (default is |
epochs |
Positive integer (default is |
epochs_early_stopping |
Positive integer (default is |
save_every_nth_epoch |
Positive integer (default is |
val_ratio |
Numeric (default is |
val_iwae_n_samples |
Positive integer (default is |
depth |
Positive integer (default is |
width |
Positive integer (default is |
latent_dim |
Positive integer (default is |
lr |
Positive numeric (default is |
batch_size |
Positive integer (default is |
running_avg_n_values |
running_avg_n_values Positive integer (default is |
activation_function |
An |
skip_conn_layer |
Logical (default is |
skip_conn_masked_enc_dec |
Logical (default is |
batch_normalization |
Logical (default is |
paired_sampling |
Logical (default is |
masking_ratio |
Numeric (default is |
mask_gen_coalitions |
Matrix (default is |
mask_gen_coalitions_prob |
Numeric array (default is |
sigma_mu |
Numeric (default is |
sigma_sigma |
Numeric (default is |
save_data |
Logical (default is |
log_exp_cont_feat |
Logical (default is |
which_vaeac_model |
String (default is |
verbose |
String vector or NULL.
Specifies the verbosity (printout detail level) through one or more of strings |
seed |
Positive integer (default is |
... |
List of extra parameters, currently not used. |
Value
The function does not return anything.
Author(s)
Lars Henry Berge Olsen
Function that checks positive integers
Description
Function that checks positive integers
Usage
vaeac_check_positive_integers(named_list_positive_integers)
Arguments
named_list_positive_integers |
List containing named entries. I.e., |
Value
The function does not return anything.
Author(s)
Lars Henry Berge Olsen
Function that checks positive numerics
Description
Function that checks positive numerics
Usage
vaeac_check_positive_numerics(named_list_positive_numerics)
Arguments
named_list_positive_numerics |
List containing named entries. I.e., |
Value
The function does not return anything.
Author(s)
Lars Henry Berge Olsen
Function that checks probabilities
Description
Function that checks probabilities
Usage
vaeac_check_probabilities(named_list_probabilities)
Arguments
named_list_probabilities |
List containing named entries. I.e., |
Value
The function does not return anything.
Author(s)
Lars Henry Berge Olsen
Function that checks that the save folder exists and for a valid file name
Description
Function that checks that the save folder exists and for a valid file name
Usage
vaeac_check_save_names(folder_to_save_model, model_description)
Arguments
folder_to_save_model |
String (default is |
model_description |
String (default is |
Value
The function does not return anything.
Author(s)
Lars Henry Berge Olsen
Function that gives a warning about disk usage
Description
Function that gives a warning about disk usage
Usage
vaeac_check_save_parameters(
save_data,
epochs,
save_every_nth_epoch,
x_train_size,
verbose
)
Arguments
save_data |
Logical (default is |
epochs |
Positive integer (default is |
save_every_nth_epoch |
Positive integer (default is |
x_train_size |
The object size of the |
verbose |
String vector or NULL.
Specifies the verbosity (printout detail level) through one or more of strings |
Value
The function does not return anything.
Author(s)
Lars Henry Berge Olsen
Function that checks for valid vaeac
model name
Description
Function that checks for valid vaeac
model name
Usage
vaeac_check_which_vaeac_model(
which_vaeac_model,
epochs,
save_every_nth_epoch = NULL
)
Arguments
which_vaeac_model |
String (default is |
epochs |
Positive integer (default is |
save_every_nth_epoch |
Positive integer (default is |
Value
The function does not return anything.
Author(s)
Lars Henry Berge Olsen
Function that checks the feature names of data and vaeac
model
Description
Function that checks the feature names of data and vaeac
model
Usage
vaeac_check_x_colnames(feature_names_vaeac, feature_names_new)
Arguments
feature_names_vaeac |
Array of strings containing the feature names of the |
feature_names_new |
Array of strings containing the feature names to compare with. |
Value
The function does not return anything.
Author(s)
Lars Henry Berge Olsen
Compute Featurewise Means and Standard Deviations
Description
Returns the means and standard deviations for all continuous features in the data set.
Categorical features get mean = 0
and sd = 1
by default.
Usage
vaeac_compute_normalization(data, one_hot_max_sizes)
Arguments
data |
A torch_tensor of dimension |
one_hot_max_sizes |
A torch tensor of dimension |
Value
List containing the means and the standard deviations of the different features.
Author(s)
Lars Henry Berge Olsen
Dataset used by the vaeac
model
Description
Convert a the data into a torch::dataset()
which the vaeac model creates batches from.
Usage
vaeac_dataset(X, one_hot_max_sizes)
Arguments
X |
A torch_tensor contain the data of shape N x p, where N and p are the number of observations and features, respectively. |
one_hot_max_sizes |
A torch tensor of dimension |
Details
This function creates a torch::dataset()
object that represent a map from keys to data samples.
It is used by the torch::dataloader()
to load data which should be used to extract the
batches for all epochs in the training phase of the neural network. Note that a dataset object
is an R6 instance, see https://r6.r-lib.org/articles/Introduction.html, which is classical
object-oriented programming, with self reference. I.e, vaeac_dataset()
is a subclass
of type torch::dataset()
.
Author(s)
Lars Henry Berge Olsen
Extends Incomplete Batches by Sampling Extra Data from Dataloader
Description
If the height of the batch
is less than batch_size
, this function extends the batch
with
data from the torch::dataloader()
until the batch
reaches the required size.
Note that batch
is a tensor.
Usage
vaeac_extend_batch(batch, dataloader, batch_size)
Arguments
batch |
The batch we want to check if has the right size, and if not extend it until it has the right size. |
dataloader |
A |
batch_size |
Integer. The number of samples to include in each batch. |
Value
Returns the extended batch with the correct batch_size.
Author(s)
Lars Henry Berge Olsen
Function that extracts additional objects from the environment to the state list
Description
The function extract the objects that we are going to save together with the vaeac
model to make it possible to
train the model further and to evaluate it.
The environment should be the local environment inside the vaeac_train_model_auxiliary()
function.
Usage
vaeac_get_current_save_state(environment)
Arguments
environment |
The |
Value
List containing the values of epoch
, train_vlb
, val_iwae
, val_iwae_running
,
and the state_dict()
of the vaeac model and optimizer.
Author(s)
Lars Henry Berge Olsen
Function to set up data loaders and save file names
Description
Function to set up data loaders and save file names
Usage
vaeac_get_data_objects(
x_train,
log_exp_cont_feat,
val_ratio,
batch_size,
paired_sampling,
model_description,
depth,
width,
latent_dim,
lr,
epochs,
save_every_nth_epoch,
folder_to_save_model,
train_indices = NULL,
val_indices = NULL
)
Arguments
x_train |
A data.table containing the training data. Categorical data must have class names |
log_exp_cont_feat |
Logical (default is |
val_ratio |
Numeric (default is |
batch_size |
Positive integer (default is |
paired_sampling |
Logical (default is |
model_description |
String (default is |
depth |
Positive integer (default is |
width |
Positive integer (default is |
latent_dim |
Positive integer (default is |
lr |
Positive numeric (default is |
epochs |
Positive integer (default is |
save_every_nth_epoch |
Positive integer (default is |
folder_to_save_model |
String (default is |
train_indices |
Numeric array (optional) containing the indices of the training observations. There are conducted no checks to validate the indices. |
val_indices |
Numeric array (optional) containing the indices of the validation observations. #' There are conducted no checks to validate the indices. |
Value
List of objects needed to train the vaeac
model
Extract the Training VLB and Validation IWAE from a list of explanations objects using the vaeac approach
Description
Extract the Training VLB and Validation IWAE from a list of explanations objects using the vaeac approach
Usage
vaeac_get_evaluation_criteria(explanation_list)
Arguments
explanation_list |
A list of |
Value
A data.table containing the training VLB, validation IWAE, and running validation IWAE at each epoch for each vaeac model.
Author(s)
Lars Henry Berge Olsen
Function to specify the extra parameters in the vaeac
model
Description
In this function, we specify the default values for the extra parameters used in explain()
for approach = "vaeac"
.
Usage
vaeac_get_extra_para_default(
vaeac.model_description = make.names(Sys.time()),
vaeac.folder_to_save_model = tempdir(),
vaeac.pretrained_vaeac_model = NULL,
vaeac.cuda = FALSE,
vaeac.epochs_initiation_phase = 2,
vaeac.epochs_early_stopping = NULL,
vaeac.save_every_nth_epoch = NULL,
vaeac.val_ratio = 0.25,
vaeac.val_iwae_n_samples = 25,
vaeac.batch_size = 64,
vaeac.batch_size_sampling = NULL,
vaeac.running_avg_n_values = 5,
vaeac.skip_conn_layer = TRUE,
vaeac.skip_conn_masked_enc_dec = TRUE,
vaeac.batch_normalization = FALSE,
vaeac.paired_sampling = TRUE,
vaeac.masking_ratio = 0.5,
vaeac.mask_gen_coalitions = NULL,
vaeac.mask_gen_coalitions_prob = NULL,
vaeac.sigma_mu = 10000,
vaeac.sigma_sigma = 1e-04,
vaeac.sample_random = TRUE,
vaeac.save_data = FALSE,
vaeac.log_exp_cont_feat = FALSE,
vaeac.which_vaeac_model = "best",
vaeac.save_model = TRUE
)
Arguments
vaeac.model_description |
String (default is |
vaeac.folder_to_save_model |
String (default is |
vaeac.pretrained_vaeac_model |
List or String (default is |
vaeac.cuda |
Logical (default is |
vaeac.epochs_initiation_phase |
Positive integer (default is |
vaeac.epochs_early_stopping |
Positive integer (default is |
vaeac.save_every_nth_epoch |
Positive integer (default is |
vaeac.val_ratio |
Numeric (default is |
vaeac.val_iwae_n_samples |
Positive integer (default is |
vaeac.batch_size |
Positive integer (default is |
vaeac.batch_size_sampling |
Positive integer (default is |
vaeac.running_avg_n_values |
Positive integer (default is |
vaeac.skip_conn_layer |
Logical (default is |
vaeac.skip_conn_masked_enc_dec |
Logical (default is |
vaeac.batch_normalization |
Logical (default is |
vaeac.paired_sampling |
Logical (default is |
vaeac.masking_ratio |
Numeric (default is |
vaeac.mask_gen_coalitions |
Matrix (default is |
vaeac.mask_gen_coalitions_prob |
Numeric array (default is |
vaeac.sigma_mu |
Numeric (default is |
vaeac.sigma_sigma |
Numeric (default is |
vaeac.sample_random |
Logical (default is |
vaeac.save_data |
Logical (default is |
vaeac.log_exp_cont_feat |
Logical (default is |
vaeac.which_vaeac_model |
String (default is |
vaeac.save_model |
Boolean. If |
Details
The vaeac
model consists of three neural network (a full encoder, a masked encoder, and a decoder) based
on the provided vaeac.depth
and vaeac.width
. The encoders map the full and masked input
representations to latent representations, respectively, where the dimension is given by vaeac.latent_dim
.
The latent representations are sent to the decoder to go back to the real feature space and
provide a samplable probabilistic representation, from which the Monte Carlo samples are generated.
We use the vaeac
method at the epoch with the lowest validation error (IWAE) by default, but
other possibilities are available but setting the vaeac.which_vaeac_model
parameter. See
Olsen et al. (2022) for more details.
Value
Named list of the default values vaeac
extra parameter arguments specified in this function call.
Note that both vaeac.model_description
and vaeac.folder_to_save_model
will change with time and R session.
Author(s)
Lars Henry Berge Olsen
References
Function that extracts the state list objects from the environment
Description
#' @description
The function extract the objects that we are going to save together with the vaeac
model to make it possible to
train the model further and to evaluate it.
The environment should be the local environment inside the vaeac_train_model_auxiliary()
function.
Usage
vaeac_get_full_state_list(environment)
Arguments
environment |
The |
Value
List containing the values of norm_mean
, norm_std
, model_description
, folder_to_save_model
,
n_train
, n_features
, one_hot_max_sizes
, epochs
, epochs_specified
, epochs_early_stopping
,
early_stopping_applied
, running_avg_n_values
, paired_sampling
, mask_generator_name
, masking_ratio
,
mask_gen_coalitions
, mask_gen_coalitions_prob
, val_ratio
, val_iwae_n_samples
,
n_vaeacs_initialize
, epochs_initiation_phase
, width
, depth
, latent_dim
, activation_function
,
lr
, batch_size
, skip_conn_layer
, skip_conn_masked_enc_dec
, batch_normalization
, cuda
,
train_indices
, val_indices
, save_every_nth_epoch
, sigma_mu
,
sigma_sigma
, feature_list
, col_cat_names
, col_cont_names
, col_cat
, col_cont
, cat_in_dataset
,
map_new_to_original_names
, map_original_to_new_names
, log_exp_cont_feat
, save_data
, verbose
,
seed
, and vaeac_save_file_names
.
Author(s)
Lars Henry Berge Olsen
Function that determines which mask generator to use
Description
Function that determines which mask generator to use
Usage
vaeac_get_mask_generator_name(
mask_gen_coalitions,
mask_gen_coalitions_prob,
masking_ratio,
verbose
)
Arguments
mask_gen_coalitions |
Matrix (default is |
mask_gen_coalitions_prob |
Numeric array (default is |
masking_ratio |
Numeric (default is |
verbose |
String vector or NULL.
Specifies the verbosity (printout detail level) through one or more of strings |
Value
The function does not return anything.
Author(s)
Lars Henry Berge Olsen
Function to load a vaeac
model and set it in the right state and mode
Description
Function to load a vaeac
model and set it in the right state and mode
Usage
vaeac_get_model_from_checkp(checkpoint, cuda, mode_train)
Arguments
checkpoint |
List. This must be a loaded |
cuda |
Logical (default is |
mode_train |
Logical. If |
Value
A vaeac
model with the correct state (based on checkpoint
), sent to the desired hardware (based on
cuda
), and in the right mode (based on mode_train
).
Author(s)
Lars Henry Berge Olsen
Function to get string of values with specific number of decimals
Description
Function to get string of values with specific number of decimals
Usage
vaeac_get_n_decimals(value, n_decimals = 3)
Arguments
value |
The number to get |
n_decimals |
Positive integer. The number of decimals. Default is three. |
Value
String of value
with n_decimals
decimals.
Author(s)
Lars Henry Berge Olsen
Function to create the optimizer used to train vaeac
Description
Only torch::optim_adam()
is currently supported. But it is easy to add an additional option later.
Usage
vaeac_get_optimizer(vaeac_model, lr, optimizer_name = "adam")
Arguments
vaeac_model |
A |
lr |
Positive numeric (default is |
optimizer_name |
String containing the name of the |
Value
A torch::optim_adam()
optimizer connected to the parameters of the vaeac_model
.
Author(s)
Lars Henry Berge Olsen
Function that creates the save file names for the vaeac
model
Description
Function that creates the save file names for the vaeac
model
Usage
vaeac_get_save_file_names(
model_description,
n_features,
n_train,
depth,
width,
latent_dim,
lr,
epochs,
save_every_nth_epoch,
folder_to_save_model = NULL
)
Arguments
model_description |
String (default is |
depth |
Positive integer (default is |
width |
Positive integer (default is |
latent_dim |
Positive integer (default is |
lr |
Positive numeric (default is |
epochs |
Positive integer (default is |
save_every_nth_epoch |
Positive integer (default is |
folder_to_save_model |
String (default is |
Value
Array of string containing the save files to use when training the vaeac
model. The first three names
corresponds to the best, best_running, and last epochs, in that order.
Author(s)
Lars Henry Berge Olsen
Compute the Importance Sampling Estimator (Validation Error)
Description
Compute the Importance Sampling Estimator which the vaeac model uses to evaluate its performance on the validation data.
Usage
vaeac_get_val_iwae(
val_dataloader,
mask_generator,
batch_size,
vaeac_model,
val_iwae_n_samples
)
Arguments
val_dataloader |
A torch dataloader which loads the validation data. |
mask_generator |
A mask generator object that generates the masks. |
batch_size |
Integer. The number of samples to include in each batch. |
vaeac_model |
The vaeac model. |
val_iwae_n_samples |
Number of samples to generate for computing the IWAE for each validation sample. |
Details
Compute mean IWAE log likelihood estimation of the validation set. IWAE is an abbreviation for Importance Sampling Estimator
\log p_{\theta, \psi}(x|y) \approx \log {\frac{1}{S}\sum_{i=1}^S
p_\theta(x|z_i, y) p_\psi(z_i|y) \big/ q_\phi(z_i|x,y),}
where z_i \sim q_\phi(z|x,y)
.
For more details, see Olsen et al. (2022).
Value
The average iwae over all instances in the validation dataset.
Author(s)
Lars Henry Berge Olsen
Function to extend the explicands and apply all relevant masks/coalitions
Description
Function to extend the explicands and apply all relevant masks/coalitions
Usage
vaeac_get_x_explain_extended(x_explain, S, index_features)
Arguments
x_explain |
Matrix or data.frame/data.table. Contains the the features, whose predictions ought to be explained. |
S |
The |
index_features |
Positive integer vector. Specifies the id_coalition to
apply to the present method. |
Value
The extended version of x_explain
where the masks from S
with indices index_features
have been applied.
Author(s)
Lars Henry Berge Olsen
Impute Missing Values Using Vaeac
Description
Impute Missing Values Using Vaeac
Usage
vaeac_impute_missing_entries(
x_explain_with_NaNs,
n_MC_samples,
vaeac_model,
checkpoint,
sampler,
batch_size,
verbose = NULL,
seed = NULL,
n_explain = NULL,
index_features = NULL
)
Arguments
x_explain_with_NaNs |
A 2D matrix, where the missing entries to impute are represented by |
n_MC_samples |
Integer. The number of imputed versions we create for each row in |
vaeac_model |
An initialized |
checkpoint |
List containing the parameters of the |
sampler |
A sampler object used to sample the MC samples. |
batch_size |
Positive integer (default is |
verbose |
String vector or NULL.
Specifies the verbosity (printout detail level) through one or more of strings |
seed |
Positive integer (default is |
n_explain |
Positive integer. The number of explicands. |
index_features |
Optional integer vector. Used internally in shapr package to index the coalitions. |
Details
Function that imputes the missing values in 2D matrix where each row constitute an individual. The values are sampled from the conditional distribution estimated by a vaeac model.
Value
A data.table where the missing values (NaN
) in x_explain_with_NaNs
have been imputed n_MC_samples
times.
The data table will contain extra id columns if index_features
and n_explain
are provided.
Author(s)
Lars Henry Berge Olsen
Compute the KL Divergence Between Two Gaussian Distributions.
Description
Computes the KL divergence between univariate normal distributions using the analytical formula, see https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence#Multivariate_normal_distributions.
Usage
vaeac_kl_normal_normal(p, q)
Arguments
p |
A |
q |
A |
Value
The KL divergence between the two Gaussian distributions.
Author(s)
Lars Henry Berge Olsen
Creates Normal Distributions
Description
Function that takes in the a tensor where the first half of the columns contains the means of the
normal distributions, while the latter half of the columns contains the standard deviations. The standard deviations
are clamped with min_sigma
to ensure stable results. If params
is of dimensions batch_size x 8, the function
will create 4 independent normal distributions for each of the observation (batch_size
observations in total).
Usage
vaeac_normal_parse_params(params, min_sigma = 1e-04)
Arguments
params |
Tensor of dimension |
min_sigma |
For stability it might be desirable that the minimal sigma is not too close to zero. |
Details
Take a Tensor (e.g. neural network output) and return a torch::distr_normal()
distribution.
This normal distribution is component-wise independent, and its dimensionality depends on the input shape.
First half of channels is mean (\mu
) of the distribution, the softplus of the second half is
std (\sigma
), so there is no restrictions on the input tensor. min_sigma
is the minimal value of
\sigma
. I.e., if the above softplus is less than min_sigma
, then \sigma
is clipped
from below with value min_sigma
. This regularization is required for the numerical stability and may
be considered as a neural network architecture choice without any change to the probabilistic model.
Value
A torch::distr_normal()
distribution with the provided means and standard deviations.
Author(s)
Lars Henry Berge Olsen
Normalize mixed data for vaeac
Description
Compute the mean and std for each continuous feature, while the categorical features will have mean 0 and std 1.
Usage
vaeac_normalize_data(
data_torch,
one_hot_max_sizes,
norm_mean = NULL,
norm_std = NULL
)
Arguments
one_hot_max_sizes |
A torch tensor of dimension |
norm_mean |
Torch tensor (optional). A 1D array containing the means of the columns of |
norm_std |
Torch tensor (optional). A 1D array containing the stds of the columns of |
Value
A list containing the normalized version of x_torch
, norm_mean
and norm_std
.
Author(s)
Lars Henry Berge Olsen
Postprocess Data Generated by a vaeac Model
Description
vaeac generates numerical values. This function converts categorical features to from numerics with class labels 1,2,...,K, to factors with the original and class labels.
Usage
vaeac_postprocess_data(data, vaeac_model_state_list)
Arguments
data |
data.table containing the data generated by a vaeac model |
vaeac_model_state_list |
List. The returned list from the |
Value
data.table with the generated data from a vaeac model where the categorical features now have the original class names.
Author(s)
Lars Henry Berge Olsen
Preprocess Data for the vaeac approach
Description
vaeac only supports numerical values. This function converts categorical features to numerics with class labels 1,2,...,K, and keeps track of the map between the original and new class labels. It also computes the one_hot_max_sizes.
Usage
vaeac_preprocess_data(
data,
log_exp_cont_feat = FALSE,
normalize = TRUE,
norm_mean = NULL,
norm_std = NULL
)
Arguments
data |
matrix/data.frame/data.table containing the training data. Only the features and not the response. |
log_exp_cont_feat |
Boolean. If we are to log transform all continuous
features before sending the data to vaeac. vaeac creates unbounded values, so if the continuous
features are strictly positive, as for Burr and Abalone data, it can be advantageous to log-transform
the data to unbounded form before using vaeac. If TRUE, then |
norm_mean |
Torch tensor (optional). A 1D array containing the means of the columns of |
norm_std |
Torch tensor (optional). A 1D array containing the stds of the columns of |
Value
list containing data which can be used in vaeac, maps between original and new class names for categorical features, one_hot_max_sizes, and list of information about the data.
Author(s)
Lars Henry Berge Olsen
Function to printout a training summary for the vaeac
model
Description
Function to printout a training summary for the vaeac
model
Usage
vaeac_print_train_summary(best_epoch, best_epoch_running, last_state)
Arguments
best_epoch |
Positive integer. The epoch with the lowest validation error. |
best_epoch_running |
Positive integer. The epoch with the lowest running validation error. |
last_state |
The state list (i.e., the saved |
Value
This function only prints out a message.
Author(s)
Lars Henry Berge Olsen
Function that saves the state list and the current save state of the vaeac
model
Description
Function that saves the state list and the current save state of the vaeac
model
Usage
vaeac_save_state(state_list, file_name, return_state = FALSE)
Arguments
state_list |
List containing all the parameters in the state. |
file_name |
String containing the file path. |
return_state |
Logical if we are to return the state list or not. |
Value
This function does not return anything
Author(s)
Lars Henry Berge Olsen
Train the Vaeac Model
Description
Function that fits a vaeac model to the given dataset based on the provided parameters,
as described in Olsen et al. (2022). Note that
all default parameters specified below origin from setup_approach.vaeac()
and
vaeac_get_extra_para_default()
.
Usage
vaeac_train_model(
x_train,
model_description,
folder_to_save_model,
cuda,
n_vaeacs_initialize,
epochs_initiation_phase,
epochs,
epochs_early_stopping,
save_every_nth_epoch,
val_ratio,
val_iwae_n_samples,
depth,
width,
latent_dim,
lr,
batch_size,
running_avg_n_values,
activation_function,
skip_conn_layer,
skip_conn_masked_enc_dec,
batch_normalization,
paired_sampling,
masking_ratio,
mask_gen_coalitions,
mask_gen_coalitions_prob,
sigma_mu,
sigma_sigma,
save_data,
log_exp_cont_feat,
which_vaeac_model,
verbose,
seed,
...
)
Arguments
x_train |
A data.table containing the training data. Categorical data must have class names |
model_description |
String (default is |
folder_to_save_model |
String (default is |
cuda |
Logical (default is |
n_vaeacs_initialize |
Positive integer (default is |
epochs_initiation_phase |
Positive integer (default is |
epochs |
Positive integer (default is |
epochs_early_stopping |
Positive integer (default is |
save_every_nth_epoch |
Positive integer (default is |
val_ratio |
Numeric (default is |
val_iwae_n_samples |
Positive integer (default is |
depth |
Positive integer (default is |
width |
Positive integer (default is |
latent_dim |
Positive integer (default is |
lr |
Positive numeric (default is |
batch_size |
Positive integer (default is |
running_avg_n_values |
running_avg_n_values Positive integer (default is |
activation_function |
An |
skip_conn_layer |
Logical (default is |
skip_conn_masked_enc_dec |
Logical (default is |
batch_normalization |
Logical (default is |
paired_sampling |
Logical (default is |
masking_ratio |
Numeric (default is |
mask_gen_coalitions |
Matrix (default is |
mask_gen_coalitions_prob |
Numeric array (default is |
sigma_mu |
Numeric (default is |
sigma_sigma |
Numeric (default is |
save_data |
Logical (default is |
log_exp_cont_feat |
Logical (default is |
which_vaeac_model |
String (default is |
verbose |
String vector or NULL.
Specifies the verbosity (printout detail level) through one or more of strings |
seed |
Positive integer (default is |
... |
List of extra parameters, currently not used. |
Details
The vaeac model consists of three neural networks, i.e., a masked encoder, a full encoder, and a decoder.
The networks have shared depth
, width
, and activation_function
. The encoders maps the x_train
to a latent representation of dimension latent_dim
, while the decoder maps the latent representations
back to the feature space. See Olsen et al. (2022)
for more details. The function first initiates n_vaeacs_initialize
vaeac models with different randomly
initiated network parameter values to remedy poorly initiated values. After epochs_initiation_phase
epochs, the
n_vaeacs_initialize
vaeac models are compared and the function continues to only train the best performing
one for a total of epochs
epochs. The networks are trained using the ADAM optimizer with the learning rate is lr
.
Value
A list containing the training/validation errors and paths to where the vaeac models are saved on the disk.
Author(s)
Lars Henry Berge Olsen
References
Function used to train a vaeac
model
Description
This function can be applied both in the initialization phase when, we train several initiated vaeac
models, and
to keep training the best performing vaeac
model for the remaining number of epochs. We are in the former setting
when initialization_idx
is provided and the latter when it is NULL
. When it is NULL
, we save the vaeac
models
with lowest VLB, IWAE, running IWAE, and the epochs according to save_every_nth_epoch
to disk.
Usage
vaeac_train_model_auxiliary(
vaeac_model,
optimizer,
train_dataloader,
val_dataloader,
val_iwae_n_samples,
running_avg_n_values,
verbose,
cuda,
epochs,
save_every_nth_epoch,
epochs_early_stopping,
epochs_start = 1,
progressr_bar = NULL,
vaeac_save_file_names = NULL,
state_list = NULL,
initialization_idx = NULL,
n_vaeacs_initialize = NULL,
train_vlb = NULL,
val_iwae = NULL,
val_iwae_running = NULL
)
Arguments
vaeac_model |
A |
optimizer |
A |
train_dataloader |
A |
val_dataloader |
A |
val_iwae_n_samples |
Positive integer (default is |
running_avg_n_values |
running_avg_n_values Positive integer (default is |
verbose |
String vector or NULL.
Specifies the verbosity (printout detail level) through one or more of strings |
cuda |
Logical (default is |
epochs |
Positive integer (default is |
save_every_nth_epoch |
Positive integer (default is |
epochs_early_stopping |
Positive integer (default is |
epochs_start |
Positive integer (default is |
progressr_bar |
A |
vaeac_save_file_names |
Array of strings containing the save file names for the |
state_list |
Named list containing the objects returned from |
initialization_idx |
Positive integer (default is |
n_vaeacs_initialize |
Positive integer (default is |
train_vlb |
A |
val_iwae |
A |
val_iwae_running |
A |
Value
Depending on if we are in the initialization phase or not. Then either the trained vaeac
model, or
a list of where the vaeac
models are stored on disk and the parameters of the model.
Author(s)
Lars Henry Berge Olsen
Continue to Train the vaeac Model
Description
Function that loads a previously trained vaeac model and continue the training, either on new data or on the same dataset as it was trained on before. If we are given a new dataset, then we assume that new dataset has the same distribution and one_hot_max_sizes as the original dataset.
Usage
vaeac_train_model_continue(
explanation,
epochs_new,
lr_new = NULL,
x_train = NULL,
save_data = FALSE,
verbose = NULL,
seed = 1
)
Arguments
explanation |
A |
epochs_new |
Positive integer. The number of extra epochs to conduct. |
lr_new |
Positive numeric. If we are to overwrite the old learning rate in the adam optimizer. |
x_train |
A data.table containing the training data. Categorical data must have class names |
save_data |
Logical (default is |
verbose |
String vector or NULL.
Specifies the verbosity (printout detail level) through one or more of strings |
seed |
Positive integer (default is |
Value
A list containing the training/validation errors and paths to where the vaeac models are saved on the disk.
Author(s)
Lars Henry Berge Olsen
References
Move vaeac
parameters to correct location
Description
This function ensures that the main and extra parameters for the vaeac
approach is located at their right locations.
Usage
vaeac_update_para_locations(parameters)
Arguments
parameters |
List. The |
Value
Updated version of parameters
where all vaeac
parameters are located at the correct location.
Author(s)
Lars Henry Berge Olsen
Function that checks and adds a pre-trained vaeac
model
Description
Function that checks and adds a pre-trained vaeac
model
Usage
vaeac_update_pretrained_model(parameters)
Arguments
parameters |
List containing the parameters used within |
Value
This function adds a valid pre-trained vaeac model to the parameter
.
Author(s)
Lars Henry Berge Olsen
Calculate weighted matrix
Description
Calculate weighted matrix
Usage
weight_matrix(X, normalize_W_weights = TRUE)
Arguments
X |
data.table.
Output from |
normalize_W_weights |
Logical. Whether to normalize the weights for the coalitions to sum to 1 for
increased numerical stability before solving the WLS (weighted least squares). Applies to all coalitions
except coalition |
Value
Numeric matrix. See weight_matrix_cpp()
for more information.
Author(s)
Nikolai Sellereite, Martin Jullum
Calculate weight matrix
Description
Calculate weight matrix
Usage
weight_matrix_cpp(coalitions, m, n, w)
Arguments
coalitions |
List. Each of the elements equals an integer vector representing a valid combination of features/feature groups. |
m |
Integer. Number of features/feature groups. |
n |
Integer. Number of combinations. |
w |
Numeric vector
Should have length |
Value
Matrix of dimension n x m + 1
Author(s)
Nikolai Sellereite, Martin Jullum