--- title: "Float precision at split boundaries" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Float precision at split boundaries} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## The issue Tree-based models like XGBoost, LightGBM, and CatBoost internally convert data to 32-bit floats during training. This means that split thresholds are chosen based on 32-bit precision values. However, R uses 64-bit doubles by default, and most databases also use higher precision floating-point numbers. This precision mismatch can cause predictions to differ when a data point falls exactly on or very close to a split boundary. The 32-bit and 64-bit representations of the same number may round differently, causing the data point to go left in one system and right in another. ## Which models are affected? XGBoost and Cubist store everything as 32-bit floats, making them most susceptible to this issue. LightGBM and CatBoost use 64-bit doubles for leaf values, which reduces (but does not eliminate) the risk. ## Example Here is a real example from a Cubist model. When we extract the split values used in the model's rules, we see values like: ``` variable value lstat 9.5299997 rm 6.2259998 rm 6.546 lstat 5.3899999 ``` These split values should correspond to actual values in the training data. But when we check, only one of the four matches exactly: ``` # Exact matches variable value rm 6.546 # Non-matches variable value lstat 9.5299997 rm 6.2259998 lstat 5.3899999 ``` If we look for nearby values in the training data: ``` variable value_data value_split rm 6.226 6.2259998 rm 6.546 6.546 lstat 5.39 5.3899999 lstat 9.53 9.5299997 ``` The original training values were `6.226`, `5.39`, and `9.53`, but they were converted to 32-bit floats during model training, resulting in slightly different stored thresholds. Why does this matter? Consider a model with two rules: ``` rule 1: rm > 6.2259998 rule 2: rm <= 6.2259998 ``` If you pass in an observation where `rm` is `6.226`, you might expect rule 1 to apply since `6.226 > 6.2259998`. But the native model applies rule 2 because it internally converted `6.226` to `6.2259998` during training, making them equal. ## What tidypredict does tidypredict extracts split thresholds from the model and uses them in R formulas or SQL queries. Since R and databases typically use 64-bit floats, the comparisons are done at 64-bit precision against thresholds that were originally determined at 32-bit precision. In most cases, this works fine because data points rarely fall exactly on split boundaries. However, you should always verify predictions match using `tidypredict_test()`. ## Pros and cons Pros of using tidypredict despite this issue: - In-database scoring avoids moving large datasets out of the database - SQL translation is portable across database systems - For most rows, predictions will match exactly Risks: - A small fraction of predictions may differ unpredictably - For classification, boundary cases could flip the predicted class - It is hard to know in advance which specific rows will be affected When are values likely to hit boundaries? - Integer or rounded data (e.g., age in whole years, prices rounded to cents) - Data with limited unique values or repeated measurements - Scoring the same data used for training Continuous or high-precision real-world measurements are less likely to land exactly on split boundaries. Considerations: - Use `tidypredict_test()` to quantify the discrepancy rate on your data - The magnitude of difference is usually small (just a neighboring leaf value) - High-stakes applications (medical, financial) may need stricter validation ## Recommendations 1. Accept small differences: For production use, consider that a tiny fraction of predictions may differ at exact boundaries. Decide if this is acceptable for your use case. 2. Use native predictions when possible: For applications where perfect agreement is critical, consider using the native model's predict function rather than SQL translation.