---
title: "Float precision at split boundaries"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Float precision at split boundaries}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

## The issue

Tree-based models like XGBoost, LightGBM, and CatBoost internally convert data to 32-bit floats during training. This means that split thresholds are chosen based on 32-bit precision values. However, R uses 64-bit doubles by default, and most databases also use higher precision floating-point numbers.

This precision mismatch can cause predictions to differ when a data point falls exactly on or very close to a split boundary. The 32-bit and 64-bit representations of the same number may round differently, causing the data point to go left in one system and right in another.

## Which models are affected?

XGBoost and Cubist store everything as 32-bit floats, making them most susceptible to this issue. LightGBM and CatBoost use 64-bit doubles for leaf values, which reduces (but does not eliminate) the risk.

## Example

Here is a real example from a Cubist model. When we extract the split values used in the model's rules, we see values like:

```
variable     value
lstat    9.5299997
rm       6.2259998
rm       6.546
lstat    5.3899999
```

These split values should correspond to actual values in the training data. But when we check, only one of the four matches exactly:

```
# Exact matches
variable value
rm       6.546

# Non-matches
variable     value
lstat    9.5299997
rm       6.2259998
lstat    5.3899999
```

If we look for nearby values in the training data:

```
variable value_data value_split
rm            6.226   6.2259998
rm            6.546   6.546
lstat         5.39    5.3899999
lstat         9.53    9.5299997
```

The original training values were `6.226`, `5.39`, and `9.53`, but they were converted to 32-bit floats during model training, resulting in slightly different stored thresholds.

Why does this matter? Consider a model with two rules:

```
rule 1: rm > 6.2259998
rule 2: rm <= 6.2259998
```

If you pass in an observation where `rm` is `6.226`, you might expect rule 1 to apply since `6.226 > 6.2259998`. But the native model applies rule 2 because it internally converted `6.226` to `6.2259998` during training, making them equal.

## What tidypredict does

tidypredict extracts split thresholds from the model and uses them in R formulas or SQL queries. Since R and databases typically use 64-bit floats, the comparisons are done at 64-bit precision against thresholds that were originally determined at 32-bit precision.

In most cases, this works fine because data points rarely fall exactly on split boundaries. However, you should always verify predictions match using `tidypredict_test()`.

## Pros and cons

Pros of using tidypredict despite this issue:

- In-database scoring avoids moving large datasets out of the database
- SQL translation is portable across database systems
- For most rows, predictions will match exactly

Risks:

- A small fraction of predictions may differ unpredictably
- For classification, boundary cases could flip the predicted class
- It is hard to know in advance which specific rows will be affected

When are values likely to hit boundaries?

- Integer or rounded data (e.g., age in whole years, prices rounded to cents)
- Data with limited unique values or repeated measurements
- Scoring the same data used for training

Continuous or high-precision real-world measurements are less likely to land exactly on split boundaries.

Considerations:

- Use `tidypredict_test()` to quantify the discrepancy rate on your data
- The magnitude of difference is usually small (just a neighboring leaf value)
- High-stakes applications (medical, financial) may need stricter validation

## Recommendations

1. Accept small differences: For production use, consider that a tiny fraction of predictions may differ at exact boundaries. Decide if this is acceptable for your use case.

2. Use native predictions when possible: For applications where perfect agreement is critical, consider using the native model's predict function rather than SQL translation.