---
title: "OptimalBinningWoE: Practical Guide for Credit Risk Modeling"
author: "José Evandeilton Lopes"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{OptimalBinningWoE: Practical Guide for Credit Risk Modeling}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 8,
  fig.height = 6,
  warning = FALSE,
  message = FALSE
)
```

# Introduction

**OptimalBinningWoE** implements 36 high-performance binning algorithms for Weight of Evidence (WoE) transformation in credit scoring and risk modeling. This vignette demonstrates practical applications using real-world credit data.

## Package Overview

The package provides:

- **36 algorithms**: 20 numerical + 16 categorical methods
- **C++ performance**: Fast processing of large datasets
- **tidymodels integration**: Production-ready ML pipelines
- **Regulatory compliance**: Monotonic binning for Basel/IFRS 9
- **Comprehensive metrics**: IV, KS, Gini, lift curves

## Theoretical Foundation

### Weight of Evidence (WoE)

For bin $i$, WoE quantifies the logarithmic odds ratio:

$$\text{WoE}_i = \ln\left(\frac{\text{Distribution of Events}_i}{\text{Distribution of Non-Events}_i}\right) = \ln\left(\frac{n_{i,1}/N_1}{n_{i,0}/N_0}\right)$$

**Interpretation**:
- WoE > 0: Higher risk than population average
- WoE < 0: Lower risk than population average  
- WoE ≈ 0: Similar to population average

### Information Value (IV)

IV measures total predictive power:

$$\text{IV} = \sum_{i=1}^{k} \left(\frac{n_{i,1}}{N_1} - \frac{n_{i,0}}{N_0}\right) \times \text{WoE}_i$$

**Benchmarks** (Siddiqi, 2006):

| IV Range    | Predictive Power | Recommendation |
|-------------|------------------|----------------|
| < 0.02      | Unpredictive     | Exclude        |
| 0.02 - 0.10 | Weak             | Marginal       |
| 0.10 - 0.30 | Medium           | Include        |
| 0.30 - 0.50 | Strong           | Prioritize     |
| > 0.50      | Suspicious       | Check leakage  |

# Installation

```{r install, eval=FALSE}
# From GitHub
devtools::install_github("evandeilton/OptimalBinningWoE")

# Install dependencies for this vignette
install.packages(c("scorecard", "tidymodels", "pROC"))
```

# Dataset: German Credit Data

```{r load_data}
library(OptimalBinningWoE)
library(scorecard)

# Load German credit dataset
data("germancredit", package = "scorecard")

# Inspect structure
dim(germancredit)
str(germancredit[, 1:8])

# Target variable
table(germancredit$creditability)
cat("\nDefault rate:", round(mean(germancredit$creditability == "bad") * 100, 2), "%\n")
```

## Data Preparation

```{r data_prep}
# Create binary target (must be a factor for tidymodels classification)
german <- germancredit
german$default <- factor(
  ifelse(german$creditability == "bad", 1, 0),
  levels = c(0, 1),
  labels = c("good", "bad")
)
german$creditability <- NULL

# Select key features for demonstration
features_num <- c("duration.in.month", "credit.amount", "age.in.years")
features_cat <- c(
  "status.of.existing.checking.account", "credit.history",
  "purpose", "savings.account.and.bonds"
)

german_model <- german[c("default", features_num, features_cat)]

# Summary statistics
cat("Numerical features:\n")
summary(german_model[, features_num])

cat("\n\nCategorical features:\n")
sapply(german_model[, features_cat], function(x) length(unique(x)))
```

# Quick Start: Single Feature Binning

```{r quickstart_single}
# Bin credit amount with JEDI algorithm
result_single <- obwoe(
  data = german_model,
  target = "default",
  feature = "credit.amount",
  algorithm = "jedi",
  min_bins = 3,
  max_bins = 6
)

# View results
print(result_single)

# Detailed binning table
result_single$results$credit.amount
```

## Visualize Binning Results

```{r quickstart_plot}
# WoE pattern visualization
plot(result_single, type = "woe")
```

## Key Insights from Single Feature

```{r quickstart_insights}
# Extract metrics
bins <- result_single$results$credit.amount

cat("Binning Summary:\n")
cat("  Number of bins:", nrow(bins), "\n")
cat("  Total IV:", round(sum(bins$iv), 4), "\n")
cat("  Monotonic:", all(diff(bins$woe) >= 0) || all(diff(bins$woe) <= 0), "\n\n")

# Event rates by bin
bins_summary <- data.frame(
  Bin = bins$bin,
  Count = bins$count,
  Event_Rate = round(bins$count_pos / bins$count * 100, 2),
  WoE = round(bins$woe, 4),
  IV_Contribution = round(bins$iv, 4)
)

print(bins_summary)
```

# Multiple Features: Automated Binning

```{r multifeature_binning}
# Bin all features simultaneously
result_multi <- obwoe(
  data = german_model,
  target = "default",
  algorithm = "cm",
  min_bins = 3,
  max_bins = 4
)

# Summary of all features
summary(result_multi)
```

## Feature Selection by IV

```{r feature_selection}
# Extract IV summary
iv_summary <- result_multi$summary[!result_multi$summary$error, ]
iv_summary <- iv_summary[order(-iv_summary$total_iv), ]

# Top predictive features
cat("Top 5 Features by Information Value:\n\n")
print(head(iv_summary[, c("feature", "total_iv", "n_bins")], 5))

# Select features with IV >= 0.02
strong_features <- iv_summary$feature[iv_summary$total_iv >= 0.02]
cat("\n\nFeatures with IV >= 0.02:", length(strong_features), "\n")
```

## Gains Table Analysis

```{r gains_analysis}
# Compute gains for best numerical feature
best_num_feature <- iv_summary$feature[
  iv_summary$feature %in% features_num
][1]

gains <- obwoe_gains(result_multi, feature = best_num_feature, sort_by = "id")

print(gains)

# Plot gains curves
oldpar <- par(mfrow = c(2, 2))
plot(gains, type = "cumulative")
plot(gains, type = "ks")
plot(gains, type = "lift")
plot(gains, type = "woe_iv")
par(oldpar)
```

# Algorithm Comparison

Different algorithms excel in different scenarios. Let's compare performance.

```{r algorithm_comparison}
# Test multiple algorithms on credit.amount
algorithms <- c("jedi", "mdlp", "mob", "ewb", "cm")

compare_algos <- function(data, target, feature, algos) {
  results <- lapply(algos, function(algo) {
    tryCatch(
      {
        fit <- obwoe(
          data = data,
          target = target,
          feature = feature,
          algorithm = algo,
          min_bins = 3,
          max_bins = 6
        )

        data.frame(
          Algorithm = algo,
          N_Bins = fit$summary$n_bins[1],
          IV = round(fit$summary$total_iv[1], 4),
          Converged = fit$summary$converged[1],
          stringsAsFactors = FALSE
        )
      },
      error = function(e) {
        # Return NA but log error for debugging during vignette rendering
        message(sprintf("Algorithm '%s' failed: %s", algo, e$message))
        data.frame(
          Algorithm = algo,
          N_Bins = NA_integer_,
          IV = NA_real_,
          Converged = FALSE,
          stringsAsFactors = FALSE
        )
      }
    )
  })

  do.call(rbind, results)
}

# Compare on credit.amount
comp_result <- compare_algos(
  german_model,
  "default",
  "credit.amount",
  algorithms
)

cat("Algorithm Comparison on 'credit.amount':\n\n")
print(comp_result[order(-comp_result$IV), ])
```

## Algorithm Selection Guide

```{r algo_guide}
# View algorithm capabilities
algo_info <- obwoe_algorithms()

cat("Algorithm Categories:\n\n")

cat("Fast for Large Data (O(n) complexity):\n")
print(algo_info[
  algo_info$algorithm %in% c("ewb", "sketch"),
  c("algorithm", "numerical", "categorical")
])

cat("\n\nRegulatory Compliant (Monotonic):\n")
print(algo_info[
  algo_info$algorithm %in% c("mob", "mblp", "ir"),
  c("algorithm", "numerical", "categorical")
])

cat("\n\nGeneral Purpose (algorithm):\n")
print(algo_info[
  algo_info$name %in% c("jedi", "cm", "mdlp"),
  c("algorithm", "numerical", "categorical")
])
```

## Algorithm Selection by Use Case

| Use Case | Recommended Algorithm | Rationale |
|----------|----------------------|-----------|
| **General credit scoring** | `jedi`, `mob` | Best balance of speed and predictive power. |
| **Monotonicity mandatory** | `mob`, `mblp`, `ir` | Guaranteed monotonic WoE profile for regulatory compliance. |
| **Large datasets (>1M rows)**| `sketch`, `ewb` | Constant or sublinear memory footprint. |
| **Non-linear associations** | `dp`, `cm` | Optimal partitioning (Dynamic Programming) capturing complex trends. |
| **Mixed data types** | `jedi_mwoe`, `udt` | Handles both numerical and categorical features uniformly. |
| **Outlier robustness** | `mdlp`, `fetb` | Entropy-based discretization less sensitive to extreme values. |
| **Sparse categorical data** | `gmb`, `ivb`, `swb` | Groups infrequent categories based on similar risk profiles. |

### Complete Algorithm List (36 Algorithms)

| Type | Algorithms |
|------|------------|
| **Universal (9)** | `jedi`, `jedi_mwoe`, `cm`, `dp`, `dmiv`, `fetb`, `mob`, `sketch`, `udt` |
| **Numerical (12)** | `bb`, `ewb`, `fast_mdlp`, `ir`, `kmb`, `ldb`, `lpdb`, `mblp`, `mdlp`, `mrblp`, `oslp`, `ubsd` |
| **Categorical (15)** | `gmb`, `ivb`, `mba`, `milp`, `sab`, `sblp`, `swb` (and others) |

*Full mapping can be inspected via `obwoe_algorithms()`.*

# Production Pipeline with tidymodels

The most powerful application is integrating WoE into production ML workflows.

```{r tidymodels_setup, message=FALSE}
library(tidymodels)

# Train/test split with stratification
set.seed(123)
german_split <- initial_split(german_model, prop = 0.7, strata = default)
train_data <- training(german_split)
test_data <- testing(german_split)

cat("Training set:", nrow(train_data), "observations\n")
cat("Test set:", nrow(test_data), "observations\n")
cat("Train default rate:", round(mean(train_data$default == "bad") * 100, 2), "%\n")
```

## Define Preprocessing Recipe

```{r recipe_definition}
# Create recipe with WoE transformation
rec_woe <- recipe(default ~ ., data = train_data) %>%
  step_obwoe(
    all_predictors(),
    outcome = "default",
    algorithm = "jedi",
    min_bins = 2,
    max_bins = tune(), # Hyperparameter tuning
    bin_cutoff = 0.05,
    output = "woe"
  )

# Preview recipe
rec_woe
```

## Model Specification and Workflow

```{r workflow_setup}
# Logistic regression specification
lr_spec <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

# Create complete workflow
wf_credit <- workflow() %>%
  add_recipe(rec_woe) %>%
  add_model(lr_spec)

wf_credit
```

## Hyperparameter Tuning

```{r cv_tuning}
# Define tuning grid
tune_grid <- tibble(max_bins = c(4, 6, 8))

# Create cross-validation folds
set.seed(456)
cv_folds <- vfold_cv(train_data, v = 5, strata = default)

# Tune workflow
tune_results <- tune_grid(
  wf_credit,
  resamples = cv_folds,
  grid = tune_grid,
  metrics = metric_set(roc_auc, accuracy)
)

# Best configuration
collect_metrics(tune_results) %>%
  # filter(.metric == "roc_auc") %>%
  arrange(desc(mean))

# Visualize tuning
autoplot(tune_results, metric = "roc_auc")
```

## Final Model Fitting

```{r final_model}
# Select best parameters
best_params <- select_best(tune_results, metric = "roc_auc")
cat("Optimal max_bins:", best_params$max_bins, "\n\n")

# Finalize and fit
final_wf <- finalize_workflow(wf_credit, best_params)
final_fit <- fit(final_wf, data = train_data)

# Extract coefficients
final_fit %>%
  extract_fit_parsnip() %>%
  tidy() %>%
  arrange(desc(abs(estimate)))
```

## Model Evaluation

```{r model_eval}
# Predictions on test set
test_pred <- augment(final_fit, test_data)

# Performance metrics
metrics <- metric_set(roc_auc, accuracy, sens, spec, precision)
metrics(test_pred,
  truth = default, estimate = .pred_class,
  .pred_bad, event_level = "second"
)

# ROC curve
roc_curve(test_pred,
  truth = default, .pred_bad,
  event_level = "second"
) %>%
  autoplot() +
  labs(title = "ROC Curve - German Credit Model")
```

## Inspect Learned Binning Rules

```{r inspect_binning}
# Extract trained recipe
trained_rec <- extract_recipe(final_fit)
woe_step <- trained_rec$steps[[1]]

# View binning for credit.amount
credit_bins <- woe_step$binning_results$credit.amount

data.frame(
  Bin = credit_bins$bin,
  WoE = round(credit_bins$woe, 4),
  IV = round(credit_bins$iv, 4)
)
```

# Traditional Scorecard Development

For traditional credit scorecards outside tidymodels.

## Train-Test Split

```{r scorecard_split}
set.seed(789)
n_total <- nrow(german_model)
train_idx <- sample(1:n_total, size = 0.7 * n_total)

train_sc <- german_model[train_idx, ]
test_sc <- german_model[-train_idx, ]
```

## Fit Optimal Binning

```{r scorecard_binning}
# Use monotonic binning for regulatory compliance
sc_binning <- obwoe(
  data = train_sc,
  target = "default",
  algorithm = "mob", # Monotonic Optimal Binning
  min_bins = 3,
  max_bins = 5,
  control = control.obwoe(
    bin_cutoff = 0.05,
    convergence_threshold = 1e-6
  )
)

summary(sc_binning)
```

## Apply WoE Transformation

```{r scorecard_transform}
# Transform training data
train_woe <- obwoe_apply(train_sc, sc_binning, keep_original = FALSE)

# Transform test data (uses training bins)
test_woe <- obwoe_apply(test_sc, sc_binning, keep_original = FALSE)

# Preview transformed features
head(train_woe[, c("default", grep("_woe$", names(train_woe), value = TRUE)[1:3])], 10)
```

## Build Logistic Regression

```{r scorecard_model}
# Select features with IV >= 0.02
selected <- sc_binning$summary$feature[
  sc_binning$summary$total_iv >= 0.02 &
    !sc_binning$summary$error
]

woe_vars <- paste0(selected, "_woe")
formula_str <- paste("default ~", paste(woe_vars, collapse = " + "))

# Fit model
scorecard_glm <- glm(
  as.formula(formula_str),
  data = train_woe,
  family = binomial(link = "logit")
)

summary(scorecard_glm)
```

## Scorecard Validation

```{r scorecard_validation}
library(pROC)

# Predictions
test_woe$score <- predict(scorecard_glm, newdata = test_woe, type = "response")

# ROC curve
roc_obj <- roc(test_woe$default, test_woe$score, quiet = TRUE)
auc_val <- auc(roc_obj)

# KS statistic
ks_stat <- max(abs(
  ecdf(test_woe$score[test_woe$default == "bad"])(seq(0, 1, 0.01)) -
    ecdf(test_woe$score[test_woe$default == "good"])(seq(0, 1, 0.01))
))

# Gini coefficient
gini <- 2 * auc_val - 1

cat("Scorecard Performance:\n")
cat("  AUC:  ", round(auc_val, 4), "\n")
cat("  Gini: ", round(gini, 4), "\n")
cat("  KS:   ", round(ks_stat * 100, 2), "%\n")

# ROC plot
plot(roc_obj,
  main = "Scorecard ROC Curve",
  print.auc = TRUE, print.thres = "best"
)
```

# Data Preprocessing

Proper preprocessing improves binning quality.

## Handling Missing Values and Outliers

```{r preprocessing}
# Simulate feature with issues
set.seed(2024)
problematic <- c(
  rnorm(800, 5000, 2000), # Normal values
  rep(NA, 100), # Missing
  runif(100, -10000, 50000) # Outliers
)

target_sim <- rbinom(1000, 1, 0.3)

# Preprocess with IQR method
preproc_result <- ob_preprocess(
  feature = problematic,
  target = target_sim,
  outlier_method = "iqr",
  outlier_process = TRUE,
  preprocess = "both"
)

# View report
print(preproc_result$report)

# Compare distributions
cat("\n\nBefore preprocessing:\n")
cat("  Range:", range(problematic, na.rm = TRUE), "\n")
cat("  Missing:", sum(is.na(problematic)), "\n")
cat("  Mean:", round(mean(problematic, na.rm = TRUE), 2), "\n")

cat("\nAfter preprocessing:\n")
cleaned <- preproc_result$preprocess$feature_preprocessed
cat("  Range:", range(cleaned), "\n")
cat("  Missing:", sum(is.na(cleaned)), "\n")
cat("  Mean:", round(mean(cleaned), 2), "\n")
```

# Production Deployment

## Model Serialization

```{r production_save, eval=FALSE}
# Add metadata to model
sc_binning$metadata <- list(
  creation_date = Sys.time(),
  creator = Sys.info()["user"],
  dataset_size = nrow(train_sc),
  default_rate = mean(train_sc$default == "bad"),
  r_version = R.version.string,
  package_version = packageVersion("OptimalBinningWoE")
)

# Save model
saveRDS(sc_binning, "credit_scorecard_v1_20250101.rds")

# Load model
loaded_model <- readRDS("credit_scorecard_v1_20250101.rds")
```

## Production Scoring Function

```{r production_score, eval=FALSE}
score_applications <- function(new_data, model_file) {
  # Load binning model
  binning_model <- readRDS(model_file)

  # Validate required features
  required_vars <- binning_model$summary$feature[
    !binning_model$summary$error
  ]

  missing_vars <- setdiff(required_vars, names(new_data))
  if (length(missing_vars) > 0) {
    stop("Missing features: ", paste(missing_vars, collapse = ", "))
  }

  # Apply WoE transformation
  scored <- obwoe_apply(new_data, binning_model, keep_original = TRUE)

  # Add timestamp
  scored$scoring_date <- Sys.Date()

  return(scored)
}

# Usage example
# new_apps <- read.csv("new_applications.csv")
# scored_apps <- score_applications(new_apps, "credit_scorecard_v1_20250101.rds")
```

# Best Practices Summary

## Workflow Recommendations

1. **Start Simple**: Use `algorithm = "jedi"` as default
2. **Check IV**: Select features with IV ≥ 0.02
3. **Validate Monotonicity**: Use MOB/MBLP for regulatory models
4. **Cross-Validate**: Tune binning parameters with CV
5. **Monitor Stability**: Track WoE distributions over time
6. **Document Thoroughly**: Save metadata with models

## Common Pitfalls to Avoid

```{r pitfalls, eval=FALSE}
# ❌ Don't bin on full dataset before splitting
# This causes data leakage!
bad_approach <- obwoe(full_data, target = "y")
train_woe <- obwoe_apply(train_data, bad_approach)

# ✅ Correct: Bin only on training data
good_approach <- obwoe(train_data, target = "y")
test_woe <- obwoe_apply(test_data, good_approach)

# ❌ Don't ignore IV thresholds
# IV > 0.50 likely indicates target leakage
suspicious_features <- result$summary$feature[
  result$summary$total_iv > 0.50
]

# ❌ Don't over-bin
# Too many bins (>10) reduces interpretability
# and may cause overfitting
```


# References

Siddiqi, N. (2006). *Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring*. John Wiley & Sons.

Thomas, L. C., Edelman, D. B., & Crook, J. N. (2002). *Credit Scoring and Its Applications*. SIAM.

Navas-Palencia, G. (2020). Optimal Binning: Mathematical Programming Formulation. *Expert Systems with Applications*, 158, 113508.

Anderson, R. (2007). *The Credit Scoring Toolkit: Theory and Practice for Retail Credit Risk Management*. Oxford University Press.

# Session Information

```{r session_info}
sessionInfo()
```