---
title: "How tidypredict generates tree formulas"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{How tidypredict generates tree formulas}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(dplyr)
library(tidypredict)
library(rpart)
set.seed(100)
```

This vignette explains how tidypredict converts decision tree models into dplyr formulas, and why we chose nested `case_when()` expressions over flat ones.

## Nested vs flat case_when

Consider a simple decision tree:

```
           +-------+
      +----|x <= 5 |----+
      |    +-------+    |
      v                 v
  +-------+           "high"
  |y <= 3 |
  +-------+
   |     |
   v     v
 "low" "med"
```

This tree has three leaves with predictions "low", "med", and "high".

### Flat case_when (old approach)

The flat approach lists every leaf path as a separate condition:

```r
case_when(
  x <= 5 & y <= 3 ~ "low",
  x <= 5 & y > 3  ~ "med",
  x > 5           ~ "high"
)
```

Each condition must encode the **entire path** from root to leaf. For a tree with depth `d`, each condition can have up to `d` comparisons joined by `&`.

### Nested case_when (current approach)

The nested approach mirrors the tree structure:

```r
case_when(
  x <= 5 ~ case_when(
    y <= 3 ~ "low",
    .default = "med"
  ),
  .default = "high"
)
```

Each node becomes its own `case_when()`, with the left branch as the condition and the right branch as `.default`.

### Why nested is better

**1. Fewer comparisons at runtime**

With flat `case_when`, R evaluates conditions sequentially until one matches. In the worst case (the last leaf), all conditions are checked. Each condition re-evaluates splits that were already decided higher in the tree.

With nested `case_when`, each split is evaluated exactly once. The `.default` clause handles the "else" branch without re-checking the condition.

**2. More efficient SQL**

SQL databases optimize nested `CASE WHEN` statements better than flat ones with compound `AND` conditions. The nested structure allows the query planner to short-circuit evaluation.

Flat SQL:
```sql
CASE
  WHEN x <= 5 AND y <= 3 THEN 'low'
  WHEN x <= 5 AND y > 3 THEN 'med'
  WHEN x > 5 THEN 'high'
END
```

Nested SQL:
```sql
CASE
  WHEN x <= 5 THEN
    CASE
      WHEN y <= 3 THEN 'low'
      ELSE 'med'
    END
  ELSE 'high'
END
```

**3. Smaller formula size**

For a balanced tree of depth `d` with `2^d` leaves:

- Flat: Each leaf condition has `d` terms, so total terms = `d * 2^d`
- Nested: Each split appears once, so total terms = `2^d - 1`

For a tree of depth 2 (4 leaves):

- Flat: 2 * 4 = 8 comparison terms
- Nested: 4 - 1 = 3 comparison terms

For a tree of depth 6 (64 leaves):

- Flat: 6 * 64 = 384 comparison terms
- Nested: 64 - 1 = 63 comparison terms

## Parsed model versions

tidypredict uses a version number in parsed models to track format changes:
- **Version 1-2**: Used flat `case_when()` for trees
- **Version 3**: Uses nested `case_when()` (current)

When loading a model saved with an older version, tidypredict automatically uses the appropriate formula builder for backwards compatibility. See `?parse_model` for details.