--- title: "Exploring Random Forests with ggRandomForests" author: "John Ehrlinger" date: today format: html: toc: true html-math-method: mathjax editor: markdown: wrap: 80 vignette: > %\VignetteIndexEntry{Vignette's Title} %\VignetteEngine{quarto::html} %\VignetteEncoding{UTF-8} --- The **ggRandomForests** package extracts tidy data objects from either `randomForestSRC` or `randomForest` fits and feeds them into familiar `ggplot2` workflows. This vignette highlights the most common objects— `gg_error`, `gg_variable`, and `gg_vimp`—along with a small helper for building balanced conditioning intervals. ```{r pkg-setup, include=FALSE} if (requireNamespace("ggRandomForests", quietly = TRUE)) { library(ggRandomForests) } else if (requireNamespace("pkgload", quietly = TRUE)) { pkgload::load_all(export_all = FALSE, helpers = FALSE, attach_testthat = FALSE) } else { stop("Install ggRandomForests (or pkgload for dev builds) to render this vignette.") } ``` ## Error trajectories with `gg_error()` ```{r error-demo} library(randomForest) set.seed(42) rf_iris <- randomForest(Species ~ ., data = iris, ntree = 200, keep.forest = TRUE) err_df <- ggRandomForests::gg_error(rf_iris, training = TRUE) head(err_df) ``` The `gg_error()` object stores the cumulative OOB error rate for each outcome column plus the `ntree` counter. When `training = TRUE`, the function reconstructs the original model frame and appends the in-bag error trajectory (`train`). Plotting overlays both curves by default: ```{r error-plot, fig.height=4} plot(err_df) ``` ## Marginal dependence via `gg_variable()` ```{r variable-demo} set.seed(99) boston <- MASS::Boston rf_boston <- randomForest(medv ~ ., data = boston, ntree = 150) var_df <- ggRandomForests::gg_variable(rf_boston) str(var_df[, c("lstat", "yhat")]) ``` Because the original training data are recovered from the model call, `gg_variable()` works even when the forest was trained within helper functions or against a `subset()` expression. The output keeps the raw predictors plus either a continuous `yhat` column (regression) or per-class probabilities (`yhat.` for classification). Plotting a single variable is straightforward: ```{r variable-plot, fig.height=4} plot(var_df, xvar = "lstat") ``` Survival forests can request multiple horizons using the `time` argument; non-OOB predictions are available by setting `oob = FALSE`. ## Variable importance with `gg_vimp()` ```{r vimp-demo} vimp_df <- ggRandomForests::gg_vimp(rf_boston) head(vimp_df) plot(vimp_df) ``` If a `randomForest` object lacks stored importance scores, `gg_vimp()` tries to compute them on the fly. When the forest truly cannot provide the information (for example when `importance = FALSE` and the predictors are no longer accessible), the function emits a warning and returns `NA` placeholders so plots still render. ## Balanced conditioning cuts with `quantile_pts()` ```{r quantile-demo} rm_breaks <- ggRandomForests::quantile_pts(boston$rm, groups = 6, intervals = TRUE) rm_groups <- cut(boston$rm, breaks = rm_breaks) table(rm_groups) ``` The helper wraps `stats::quantile()` to produce evenly populated strata that drop directly into `cut()` when building coplots or facet labels. ## Next steps * Inspect the full API reference at . * Use `?gg_error`, `?gg_variable`, `?gg_vimp`, and `?quantile_pts` for additional arguments and examples. * Pair these data objects with your own `ggplot2` themes to align with your preferred publication style.