--- title: "Introduction to betaregscale" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to betaregscale} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 10, fig.height = 5, fig.align = "center", warning = FALSE, message = FALSE, digits = 4 ) kbl10 <- function(x, n = 10, digits = 4, ...) { knitr::kable(utils::head(as.data.frame(x), n), digits = digits, align = "c", ...) } ``` ## Overview The **betaregscale** package provides maximum-likelihood estimation of beta regression models for responses derived from bounded rating scales. Common examples include pain intensity scales (NRS-11, NRS-21, NRS-101), Likert-type scales, product quality ratings, and any instrument whose response can be mapped to the open interval $(0,1)$. The key idea is that a discrete score recorded on a bounded scale carries measurement uncertainty inherent to the instrument. For instance, a pain score of $y=6$ on a 0–10 NRS is not an exact value but rather represents a range: after rescaling to $(0,1)$, the observation is treated as interval-censored in $[0.55,0.65]$. The package uses the beta distribution to model such data, building a complete likelihood that supports mixed censoring types within the same dataset. ## Installation ```{r install, eval = FALSE} # Development version from GitHub: # install.packages("remotes") remotes::install_github("evandeilton/betaregscale") ``` ```{r library} library(betaregscale) ``` ## Censoring types The complete likelihood supports four censoring types, automatically classified by `brs_check()`: | $\delta$ | Type | Likelihood contribution | |:--------:|:-----|:-----------------------| | 0 | Exact (uncensored) | $f(y_i;a_i,b_i)$ | | 1 | Left-censored ($y=0$) | $F(u_i;a_i,b_i)$ | | 2 | Right-censored ($y=K$) | $1-F(l_i;a_i,b_i)$ | | 3 | Interval-censored | $F(u_i;a_i,b_i)-F(l_i;a_i,b_i)$ | where $f(\cdot)$ and $F(\cdot)$ are the beta density and CDF, $[l_i,u_i]$ are the interval endpoints, and $(a_i,b_i)$ are the beta shape parameters derived from $\mu_i$ and $\phi_i$ via the chosen reparameterization. ## Interval construction Scale observations are mapped to $(0,1)$ with midpoint uncertainty intervals: $$y_t=y/K,\quad\text{interval }[y_t-h/K,y_t+h/K]$$ where $K$ is the number of scale categories (`ncuts`) and $h$ is the half-width (`lim`, default **0.5**). ```{r check-response} # Illustrate brs_check with a 0-10 NRS scale y_example <- c(0, 3, 5, 7, 10) cr <- brs_check(y_example, ncuts = 10) kbl10(cr) ``` The `delta` column shows that $y=0$ is left-censored ($\delta=1$), $y=10$ is right-censored ($\delta=2$), and all interior values are interval-censored ($\delta=3$). ## Data preparation with `brs_prep()` In practice, analysts may want to supply their own censoring indicators or interval endpoints rather than relying on the automatic classification of `brs_check()`. The `brs_prep()` function provides a flexible, validated bridge between raw analyst data and `brs()`. It supports four input modes: ### Mode 1: Score only (automatic) ```{r bs-prepare-mode1} # Equivalent to brs_check - delta inferred from y d1 <- data.frame(y = c(0, 3, 5, 7, 10), x1 = rnorm(5)) kbl10(brs_prep(d1, ncuts = 10)) ``` ### Mode 2: Score + explicit censoring indicator ```{r bs-prepare-mode2} # Analyst specifies delta directly d2 <- data.frame( y = c(50, 0, 99, 50), delta = c(0, 1, 2, 3), x1 = rnorm(4) ) kbl10(brs_prep(d2, ncuts = 100)) ``` ### Mode 3: Interval endpoints with NA patterns When the analyst provides `left` and/or `right` columns, censoring is inferred from the NA pattern: ```{r bs-prepare-mode3} d3 <- data.frame( left = c(NA, 20, 30, NA), right = c(5, NA, 45, NA), y = c(NA, NA, NA, 50), x1 = rnorm(4) ) kbl10(brs_prep(d3, ncuts = 100)) ``` ### Mode 4: Analyst-supplied intervals When the analyst provides `y`, `left`, and `right` simultaneously, their endpoints are used directly (rescaled by $K$): ```{r bs-prepare-mode4} d4 <- data.frame( y = c(50, 75), left = c(48, 73), right = c(52, 77), x1 = rnorm(2) ) kbl10(brs_prep(d4, ncuts = 100)) ``` ### Using brs_prep with `brs()` Data processed by `brs_prep()` is automatically detected by `brs()` - the internal `brs_check()` step is skipped: ```{r prepare-workflow} set.seed(42) n <- 1000 dat <- data.frame(x1 = rnorm(n), x2 = rnorm(n)) sim <- brs_sim( formula = ~ x1 + x2, data = dat, beta = c(0.2, -0.5, 0.3), phi = 0.3, link = "logit", link_phi = "logit", repar = 2 ) # Remove left, right, yt so brs_prep can rebuild them prep <- brs_prep(sim[-c(1:3)], ncuts = 100) fit_prep <- brs(y ~ x1 + x2, data = prep, repar = 2, link = "logit", link_phi = "logit" ) summary(fit_prep, digits = 4) ``` ## Example 1: Fixed dispersion model ### Simulating data We simulate observations from a beta regression model with fixed dispersion, two covariates, and a logit link for the mean. ```{r sim-fixed} set.seed(4255) n <- 1000 dat <- data.frame(x1 = rnorm(n), x2 = rnorm(n)) sim_fixed <- brs_sim( formula = ~ x1 + x2, data = dat, beta = c(0.3, -0.6, 0.4), phi = 1 / 10, link = "logit", link_phi = "logit", ncuts = 100, repar = 2 ) kbl10(head(sim_fixed, 8)) ``` Each observation is centered in its interval. For example, a score of 67 on a 0–100 scale yields $y_t=0.67$ with the interval $[0.665,0.675]$. ### Fitting the model ```{r fit-fixed} fit_fixed <- brs( y ~ x1 + x2, data = sim_fixed, link = "logit", link_phi = "logit", repar = 2 ) summary(fit_fixed) ``` The summary output follows the `betareg` package style, showing separate coefficient tables for the mean and precision submodels, with Wald z-tests and $p$-values based on the standard normal distribution. ### Goodness of fit ```{r gof-fixed} kbl10(brs_gof(fit_fixed)) ``` ### Comparing link functions The package supports several link functions for the mean submodel. We can compare them using information criteria: ```{r compare-links} links <- c("logit", "probit", "cauchit", "cloglog") fits <- lapply(setNames(links, links), function(lnk) { brs(y ~ x1 + x2, data = sim_fixed, link = lnk, repar = 2) }) # Estimates est_table <- do.call(rbind, lapply(names(fits), function(lnk) { e <- brs_est(fits[[lnk]]) e$link <- lnk e })) kbl10(est_table) # Goodness of fit gof_table <- do.call(rbind, lapply(fits, brs_gof)) kbl10(gof_table) ``` ### Residual diagnostics The `plot()` method provides six diagnostic panels. By default, the first four are shown: ```{r plot-fixed, fig.height = 6} plot(fit_fixed, caption = NULL, gg = TRUE, title = NULL, theme = ggplot2::theme_bw()) ``` For ggplot2 output (requires the **ggplot2** package): ```{r plot-fixed-gg, eval = requireNamespace("ggplot2", quietly = TRUE), fig.height = 6} plot(fit_fixed, gg = TRUE) ``` ### Predictions ```{r predict-fixed} # Fitted means kbl10( data.frame(mu_hat = head(predict(fit_fixed, type = "response"))), digits = 4 ) # Conditional variance kbl10( data.frame(var_hat = head(predict(fit_fixed, type = "variance"))), digits = 4 ) # Quantile predictions kbl10(predict(fit_fixed, type = "quantile", at = c(0.10, 0.25, 0.5, 0.75, 0.90))) ``` ### Confidence intervals Wald confidence intervals based on the asymptotic normal approximation: ```{r confint-fixed} kbl10(confint(fit_fixed)) kbl10(confint(fit_fixed, model = "mean")) ``` ### Censoring structure The `brs_cens()` function provides a visual and tabular overview of the censoring types in the fitted model: ```{r censoring-summary, fig.height = 5} brs_cens(fit_fixed, gg = TRUE, inform = TRUE) ``` ## Example 2: Variable dispersion model In many applications, the dispersion parameter $\phi$ may depend on covariates. The package supports variable-dispersion models using the `Formula` package notation: `y ~ x1 + x2 | z1 + z2`, where the terms after `|` define the linear predictor for $\phi$. The same `brs_sim()` function is used for fixed and variable dispersion; the second formula part activates the precision submodel in simulation. ### Simulating data ```{r sim-variable} set.seed(2222) n <- 1000 dat_z <- data.frame( x1 = rnorm(n), x2 = rnorm(n), x3 = rbinom(n, size = 1, prob = 0.5), z1 = rnorm(n), z2 = rnorm(n) ) sim_var <- brs_sim( formula = ~ x1 + x2 + x3 | z1 + z2, data = dat_z, beta = c(0.2, -0.6, 0.2, 0.2), zeta = c(0.2, -0.8, 0.6), link = "logit", link_phi = "logit", ncuts = 100, repar = 2 ) kbl10(head(sim_var, 10)) ``` ### Fitting the model ```{r fit-variable} fit_var <- brs( y ~ x1 + x2 | z1, data = sim_var, link = "logit", link_phi = "logit", repar = 2 ) summary(fit_var) ``` Notice the `(phi)_` prefix in the precision coefficient names, following the `betareg` convention. ### Accessing coefficients by submodel ```{r coef-variable} # Full parameter vector round(coef(fit_var), 4) # Mean submodel only round(coef(fit_var, model = "mean"), 4) # Precision submodel only round(coef(fit_var, model = "precision"), 4) # Variance-covariance matrix for the mean submodel kbl10(vcov(fit_var, model = "mean")) ``` ### Comparing link functions (variable dispersion) ```{r compare-links-var} links <- c("logit", "probit", "cauchit", "cloglog") fits_var <- lapply(setNames(links, links), function(lnk) { brs(y ~ x1 + x2 | z1, data = sim_var, link = lnk, repar = 2) }) # Estimates est_var <- do.call(rbind, lapply(names(fits_var), function(lnk) { e <- brs_est(fits_var[[lnk]]) e$link <- lnk e })) kbl10(est_var) # Goodness of fit gof_var <- do.call(rbind, lapply(fits_var, brs_gof)) kbl10(gof_var) ``` ### Diagnostics for variable dispersion ```{r plot-variable, fig.height = 6} plot(fit_var) ``` ## Advanced analyst functions The package includes analyst-facing helpers for uncertainty quantification, effect interpretation, score-scale communication, and predictive validation. ### Parametric bootstrap confidence intervals ```{r bootstrap-fixed} set.seed(101) boot_ci <- brs_bootstrap( fit_fixed, R = 100, level = 0.95, ci_type = "bca", keep_draws = TRUE ) kbl10(head(boot_ci, 10)) ``` ```{r bootstrap-forest, eval = requireNamespace("ggplot2", quietly = TRUE)} autoplot.brs_bootstrap( boot_ci, type = "ci_forest", title = "Bootstrap (BCa) vs Wald intervals" ) ``` ### Average marginal effects (AME) ```{r marginaleffects-fixed} set.seed(202) ame <- brs_marginaleffects( fit_fixed, model = "mean", type = "response", interval = TRUE, n_sim = 120, keep_draws = TRUE ) kbl10(ame) ``` ```{r marginaleffects-plot, eval = requireNamespace("ggplot2", quietly = TRUE)} autoplot.brs_marginaleffects(ame, type = "forest") ``` ### Score probabilities on the original scale ```{r scoreprob-fixed} prob_scores <- brs_predict_scoreprob(fit_fixed, scores = 0:10) kbl10(prob_scores[1:6, 1:6]) kbl10( data.frame(row_sum = rowSums(prob_scores)[1:6]), digits = 4 ) ``` ### Repeated k-fold cross-validation ```{r cv-fixed} set.seed(303) # For cross-validation reproducibility cv_res <- brs_cv( y ~ x1 + x2, data = sim_fixed, k = 5, repeats = 5, repar = 2, ) kbl10(cv_res) kbl10( data.frame( metric = c("log_score", "rmse_yt", "mae_yt"), mean = c( mean(cv_res$log_score, na.rm = TRUE), mean(cv_res$rmse_yt, na.rm = TRUE), mean(cv_res$mae_yt, na.rm = TRUE) ) ), digits = 4 ) ``` ## S3 methods reference The following standard S3 methods are available for objects of class `"brs"`: | Method | Description | |:-------|:-----------| | `print()` | Compact display of call and coefficients | | `summary()` | Detailed output with Wald tests and goodness-of-fit | | `coef(model=)` | Extract coefficients (full, mean, or precision) | | `vcov(model=)` | Variance-covariance matrix (full, mean, or precision) | | `confint(model=)` | Wald confidence intervals | | `logLik()` | Log-likelihood value | | `AIC()`, `BIC()` | Information criteria | | `nobs()` | Number of observations | | `formula()` | Model formula | | `model.matrix(model=)` | Design matrix (mean or precision) | | `fitted()` | Fitted mean values | | `residuals(type=)` | Residuals: response, pearson, rqr, weighted, sweighted | | `predict(type=)` | Predictions: response, link, precision, variance, quantile | | `plot(gg=)` | Diagnostic plots (base R or ggplot2) | ## Reparameterizations The package supports three reparameterizations of the beta distribution, controlled by the `repar` argument: **Direct (`repar = 0`):** Shape parameters $a=\mu$ and $b=\phi$ are used directly. This is rarely used in practice. **Precision (`repar = 1`, Ferrari & Cribari-Neto, 2004):** The mean $\mu\in(0,1)$ and precision $\phi>0$ yield $a=\mu\phi$ and $b=(1-\mu)\phi$. Higher $\phi$ means less variability. **Mean–variance (`repar = 2`):** The mean $\mu\in(0,1)$ and dispersion $\phi\in(0,1)$ yield $a=\mu(1-\phi)/\phi$ and $b=(1-\mu)(1-\phi)/\phi$. Here $\phi$ acts as a coefficient of variation: smaller $\phi$ means less variability. ```{r reparam} # Precision parameterization: mu = 0.5, phi = 10 (high precision) brs_repar(mu = 0.5, phi = 10, repar = 1) # Mean-variance parameterization: mu = 0.5, phi = 0.1 (low dispersion) brs_repar(mu = 0.5, phi = 0.1, repar = 2) ``` ## References - Ferrari, S. L. P., and Cribari-Neto, F. (2004). Beta regression for modelling rates and proportions. *Journal of Applied Statistics*, **31**(7), 799–815. DOI: 10.1080/0266476042000214501. Validated online via: . - Smithson, M., and Verkuilen, J. (2006). A better lemon squeezer? Maximum-likelihood regression with beta-distributed dependent variables. *Psychological Methods*, **11**(1), 54–71. DOI: 10.1037/1082-989X.11.1.54. Validated online via: . - Hawker, G. A., Mian, S., Kendzerska, T., and French, M. (2011). Measures of adult pain: Visual Analog Scale for Pain (VAS Pain), Numeric Rating Scale for Pain (NRS Pain), McGill Pain Questionnaire (MPQ), Short-Form McGill Pain Questionnaire (SF-MPQ), Chronic Pain Grade Scale (CPGS), Short Form-36 Bodily Pain Scale (SF-36 BPS), and Measure of Intermittent and Constant Osteoarthritis Pain (ICOAP). *Arthritis Care and Research*, **63**(S11), S240–S252. DOI: 10.1002/acr.20543. Validated online via: . - Hjermstad, M. J., Fayers, P. M., Haugen, D. F., et al. (2011). Studies comparing numerical rating scales, verbal rating scales, and visual analogue scales for assessment of pain intensity in adults: a systematic literature review. *Journal of Pain and Symptom Management*, **41**(6), 1073–1093. DOI: 10.1016/j.jpainsymman.2010.08.016. Validated online via: .