--- title: "Getting started with SelectBoost.beta" shorttitle: "SelectBoost.beta quick tour" author: - name: "SelectBoost.beta authors" affiliation: - Cedric, Cnam, Paris email: frederic.bertrand@lecnam.net date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting started with SelectBoost.beta} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} LOCAL <- identical(Sys.getenv("LOCAL"), "TRUE") knitr::opts_chunk$set(purl = LOCAL, collapse = TRUE, comment = "#>") suppressPackageStartupMessages(library(SelectBoost.beta)) set.seed(2024) ``` ## Introduction This vignette provides a CRAN-friendly tour of the SelectBoost.beta workflow. It simulates a reproducible beta-regression data set, runs the high-level `sb_beta()` driver, and shows how to interpret the stability matrix returned by the algorithm. All code is self-contained and executes quickly under the default knitr settings. ## Simulated data We use the built-in `simulation_DATA.beta()` helper to generate a correlated design with three truly associated predictors. The response lives in `(0, 1)` and is already compatible with the beta-regression selectors. ```{r, cache=TRUE, eval=LOCAL} sim <- simulation_DATA.beta(n = 120, p = 6, s = 3, rho = 0.35, beta_size = c(1.1, -0.9, 0.7)) str(sim$X) summary(sim$Y) ``` ## Running `sb_beta()` The `sb_beta()` wrapper orchestrates the full SelectBoost loop: it normalises the design matrix, groups correlated predictors, regenerates surrogate designs, and records selection frequencies for each threshold. ```{r, cache=TRUE, eval=LOCAL} sb <- sb_beta(sim$X, sim$Y, B = 40, step.num = 0.4, seed = 99) sb ``` The returned matrix has one row per correlation threshold. Attributes attached to the matrix document how the fit was produced: ```{r, cache=TRUE, eval=LOCAL} attr(sb, "c0.seq") attr(sb, "B") attr(sb, "interval") ``` Use `summary()` to obtain per-threshold summaries and `autoplot.sb_beta()` (when `ggplot2` is available) to visualise the stability matrix. ```{r, cache=TRUE, eval=LOCAL} summary(sb) if (requireNamespace("ggplot2", quietly = TRUE)) { autoplot.sb_beta(sb) } ``` The frequency values range between 0 and 1 and report how often each predictor received a non-zero coefficient across the correlated replicates. High values signal stable selections. If your data contain zeros or ones, keep `squeeze = TRUE` (the default) so the algorithm applies the standard SelectBoost transformation before fitting the selectors. ## Comparing selectors When you wish to benchmark multiple selector families, the `compare_selectors_single()` helper runs them once on the same data set and returns both raw coefficients and a tidy summary table. Column names are briefly shortened internally to satisfy each selector and then mapped back in the outputs. ```{r, cache=TRUE, eval=LOCAL} single <- compare_selectors_single(sim$X, sim$Y, include_enet = FALSE) head(single$table) ``` Bootstrap tallies add a stability perspective. The `freq` column in the table below measures the proportion of resamples where the variable was selected; values close to 1 indicate consistent discoveries. ```{r, cache=TRUE, eval=LOCAL} freq <- suppressWarnings(compare_selectors_bootstrap(sim$X, sim$Y, B = 100, include_enet = FALSE, seed = 99)) head(freq) ``` Merge both views with `compare_table()` and use `plot_compare_coeff()` or `plot_compare_freq()` for quick diagnostics. ```{r, cache=TRUE, eval=LOCAL} compare_table(single$table, freq) ``` ## Interval responses If your outcome is interval-censored, run the `sb_beta_interval()` convenience wrapper. It enables the interval sampling logic inside `sb_beta()` while keeping the same output format and attributes. ```{r, cache=TRUE, eval=LOCAL} y_low <- pmax(sim$Y - 0.05, 0) y_high <- pmin(sim$Y + 0.05, 1) interval_fit <- sb_beta_interval(sim$X, y_low, y_high, B = 30, sample = "uniform", seed = 321) attr(interval_fit, "interval") ``` The resulting stability matrix can be summarised and visualised exactly like the point-response output shown earlier. ```