---
title: "catboost models"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{CatBoost models}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
if (FALSE) {
  library(tidypredict)
  library(dplyr)
  eval_code <- TRUE
} else {
  eval_code <- FALSE
}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = eval_code
)
```

| Function                                                      |Works|
|---------------------------------------------------------------|-----|
|`tidypredict_fit()`, `tidypredict_sql()`, `parse_model()`      |
|`tidypredict_to_column()`                                      |
|`tidypredict_test()`                                           |
|`tidypredict_interval()`, `tidypredict_sql_interval()`         |
|`parsnip`                                                      |

## `tidypredict_` functions

```r
library(catboost)
# Prepare data
X <- data.matrix(mtcars[, c("mpg", "cyl", "disp")])
y <- mtcars$hp

pool <- catboost.load_pool(
  X,
  label = y,
  feature_names = as.list(c("mpg", "cyl", "disp"))
)

model <- catboost.train(
  pool,
  params = list(
    iterations = 10L,
    depth = 3L,
    learning_rate = 0.5,
    loss_function = "RMSE",
    logging_level = "Silent",
    allow_writing_files = FALSE
  )
)
```

- Create the R formula
    ```r
tidypredict_fit(model)
    ```

- Add the prediction to the original table
    ```r
library(dplyr)

mtcars %>%
  tidypredict_to_column(model) %>%
  glimpse()
    ```

- Confirm that `tidypredict` results match to the model's `predict()` results. The `xg_df` argument expects the matrix data set.
    ```r
tidypredict_test(model, xg_df = X)
    ```

## Supported objectives

CatBoost supports many objective functions. The following objectives are supported by `tidypredict`:

### Regression objectives (identity transform)

- `RMSE` (default)
- `MAE`
- `Quantile`
- `MAPE`
- `Poisson`

### Binary classification (sigmoid transform)

- `Logloss`
- `CrossEntropy`

### Multiclass classification

- `MultiClass` (softmax transform)
- `MultiClassOneVsAll` (sigmoid per class)

## Binary classification example

```r
X_bin <- data.matrix(mtcars[, c("mpg", "cyl", "disp")])
y_bin <- mtcars$am

pool_bin <- catboost.load_pool(
  X_bin,
  label = y_bin,
  feature_names = as.list(c("mpg", "cyl", "disp"))
)

model_bin <- catboost.train(
  pool_bin,
  params = list(
    iterations = 10L,
    depth = 3L,
    learning_rate = 0.5,
    loss_function = "Logloss",
    logging_level = "Silent",
    allow_writing_files = FALSE
  )
)

tidypredict_test(model_bin, xg_df = X_bin)
```

## Multiclass classification example

```r
X_multi <- data.matrix(iris[, 1:4])
y_multi <- as.integer(iris$Species) - 1L

pool_multi <- catboost.load_pool(
  X_multi,
  label = y_multi,
  feature_names = as.list(colnames(iris)[1:4])
)

model_multi <- catboost.train(
  pool_multi,
  params = list(
    iterations = 10L,
    depth = 3L,
    learning_rate = 0.5,
    loss_function = "MultiClass",
    logging_level = "Silent",
    allow_writing_files = FALSE
  )
)

# Multiclass returns a list of formulas, one per class
formulas <- tidypredict_fit(model_multi)
names(formulas)
```

Test multiclass predictions:

```r
tidypredict_test(model_multi, xg_df = X_multi)
```

## Categorical features

CatBoost models can use categorical features with one-hot encoding.

### With parsnip/bonsai (recommended)

When using parsnip/bonsai, categorical features are handled automatically:

```r
library(parsnip)
library(bonsai)

df_cat <- data.frame(
  num_feat = mtcars$mpg,
  cat_feat = factor(ifelse(mtcars$am == 1, "manual", "auto")),
  target = mtcars$hp
)

model_spec <- boost_tree(trees = 10, tree_depth = 3) |>
  set_engine("catboost", logging_level = "Silent", one_hot_max_size = 10) |>
  set_mode("regression")

model_fit <- fit(model_spec, target ~ num_feat + cat_feat, data = df_cat)

# Categorical features are handled automatically
tidypredict_fit(model_fit)
```

### With raw CatBoost

For raw CatBoost models, you need to manually establish the hash-to-category mapping:

```r
pool_cat <- catboost.load_pool(
  df_cat[, c("num_feat", "cat_feat")],
  label = df_cat$target
)

model_cat <- catboost.train(
  pool_cat,
  params = list(
    iterations = 10L,
    depth = 3L,
    learning_rate = 0.5,
    loss_function = "RMSE",
    logging_level = "Silent",
    allow_writing_files = FALSE,
    one_hot_max_size = 10
  )
)

# Parse and set category mapping manually
pm_cat <- parse_model(model_cat)
pm_cat <- set_catboost_categories(pm_cat, model_cat, df_cat)

# Now use the parsed model
tidypredict_fit(pm_cat)
```

## Parse model spec

Here is an example of the model spec:
```r
pm <- parse_model(model)
str(pm, 2)
```

```r
str(pm$trees[1])
```

## Limitations

- Prediction intervals are not supported
- CatBoost uses 32-bit floats for split thresholds, which may cause prediction discrepancies at exact split boundaries. See the [float precision](float-precision.html) article for details.