---
title: "Introduction to ggInterval"
subtitle: "Visualizing Interval-Valued Data using ggplot2"
author: "Bo-Syue Jiang and Han-Ming Wu"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_document:
    toc: true
    toc_depth: 3
    number_sections: true
vignette: >
  %\VignetteIndexEntry{Introduction to ggInterval}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options:
  chunk_output_type: console
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(warning = FALSE)
knitr::opts_chunk$set(message = FALSE)
knitr::opts_chunk$set(out.width = "100%")
knitr::opts_chunk$set(fig.align = "center")
library(knitr)
library(ggInterval)
library(RSDA)
```

# Introduction

Exploratory data analysis (EDA) relies on graphical summaries---boxplots,
histograms, scatter plots---to reveal a dataset's salient features before
formal modeling. Yet modern data are increasingly recorded not as single
scalar values but as *intervals*, *histograms*, or full empirical
distributions. These richer objects are collectively known as **symbolic
data** (Billard and Diday, 2006). For example, when individual-level
measurements are aggregated by group, each variable naturally becomes an
interval $[a, b]$ rather than a point value.

Conventional R graphics cannot natively accommodate interval-valued
observations. **ggInterval** (formerly **ggESDA**) bridges this gap by
extending the **ggplot2** framework to visualize interval-valued symbolic
data. The package provides a family of plot functions with a uniform
interface:

```r
ggInterval_<GRAPH_TYPE>(data, mapping = aes(...), ...)
```

where `data` is a symbolic data object and `mapping` uses the standard
`ggplot2::aes()` syntax. Because most plot functions return `ggplot2`
objects, users can freely add themes, scales, labels, and additional layers.

# Data Preparation

## Built-in datasets

The package ships with several symbolic datasets. The most commonly used
are:

- **`facedata`** -- 24 faces (8 ethnic groups $\times$ 3 replicates) with 6
  interval-valued facial measurements (AD, BC, AH, GH, EH, BG).
- **`Environment`** -- 14 cities described by 17 variables, including both
  interval-valued and modal multi-valued variables.
- **`oils`** -- 8 types of oils with 4 interval-valued chemical properties.

```{r data-facedata}
data(facedata)
facedata
```

```{r data-summary}
summary(facedata)
```

## Converting classical data with `classic2sym`

Classical (scalar) data can be converted to symbolic interval data by
aggregating within groups. The `classic2sym()` function supports several
grouping strategies:

```{r classic2sym-species}
myIris <- classic2sym(iris, groupby = "Species")
myIris$intervalData
```

The `groupby` argument accepts:

- A column name (factor variable) in the data, e.g. `"Species"`.
- `"kmeans"` or `"hclust"` for unsupervised clustering (with `k` groups).
- `"customize"` for user-supplied minimum and maximum data frames.

```{r classic2sym-kmeans}
myIris_km <- classic2sym(iris, groupby = "kmeans", k = 5)
myIris_km$intervalData
```

## Converting RSDA objects with `RSDA2sym`

If you already have an RSDA `symbolic_tbl` object, wrap it with `RSDA2sym()`
so it can be used with all ggInterval plot functions:

```{r rsda2sym, eval = FALSE}
mySym <- RSDA2sym(Cardiological)
mySym$intervalData
```

# Descriptive Statistics

ggInterval provides S3 methods for common statistical summaries on symbolic
interval data.

## Summary statistics

`summary()` reports the minimum, quartiles, median, mean, maximum, and
standard deviation for each interval-valued variable:

```{r stat-summary}
summary(facedata)
```

## Correlation and covariance

`cor()` and `cov()` compute association matrices. Several methods are
available for interval data, including `"centers"`, `"B"` (Billard),
`"BD"` (Billard--Diday), and `"BG"` (Billard--Greco):

```{r stat-cor}
cor(facedata)
```

```{r stat-cov}
cov(facedata)
```

## Standardization

`scale()` standardizes symbolic interval data (centering and scaling),
which can be useful before multivariate analyses:

```{r stat-scale}
facedata_scaled <- scale(facedata)
facedata_scaled
```

# Univariate Plots

## Index plot

`ggInterval_indexplot()` displays the interval range of each observation as
a vertical bar. This is useful for spotting outliers and comparing spreads
across observations.

```{r indexplot, fig.width = 7, fig.height = 4}
ggInterval_indexplot(facedata, aes(x = AD))
```

## Index image

`ggInterval_indexImage()` replaces the margin bars of the index plot with a
color-coded strip. The `column_condition` parameter controls whether colors
represent column-wise or matrix-wise conditions, and `full_strip` expands
the color strip to the full figure width.

```{r indeximage-col, fig.width = 7, fig.height = 4}
ggInterval_indexImage(facedata, aes(AD),
                      column_condition = TRUE, full_strip = FALSE)
```

```{r indeximage-full, fig.width = 7, fig.height = 4}
ggInterval_indexImage(facedata, aes(AD),
                      column_condition = TRUE, full_strip = TRUE) +
  coord_flip()
```

## Boxplot

`ggInterval_boxplot()` draws an interval-valued box plot, where each
observation's interval is represented by nested rectangles showing the
distribution of the interval endpoints. Use `plotAll = TRUE` to display all
variables side by side.

```{r boxplot-single, fig.width = 7, fig.height = 4}
ggInterval_boxplot(facedata, aes(AD))
```

```{r boxplot-all, fig.width = 7, fig.height = 5}
ggInterval_boxplot(facedata, plotAll = TRUE)
```

## Histogram

`ggInterval_hist()` constructs a histogram from interval-valued data. Two
binning strategies are supported:

- `method = "equal-bin"` (default): bins of equal width.
- `method = "unequal-bin"`: bin boundaries depend on the data distribution.

Note that `ggInterval_hist()` returns a list; use `$plot` to extract the
`ggplot2` object.

```{r hist-equal, fig.width = 7, fig.height = 4}
ggInterval_hist(facedata, aes(x = AD), bins = 10,
                method = "equal-bin")$plot
```

```{r hist-unequal, fig.width = 7, fig.height = 4}
ggInterval_hist(facedata, aes(x = AD),
                method = "unequal-bin")$plot
```

## Min-max plot

`ggInterval_MMplot()` marks the minimum and maximum endpoints of each
observation's interval, connected by a line segment. This makes it easy to
compare ranges across observations.

```{r mmplot, fig.width = 7, fig.height = 4}
ggInterval_MMplot(facedata, aes(AD))
```

Use `plotAll = TRUE` to display all variables together:

```{r mmplot-all, fig.width = 7, fig.height = 5}
ggInterval_MMplot(facedata, plotAll = TRUE)
```

## Center-range plot

`ggInterval_CRplot()` plots each observation as a point in a two-dimensional
space where the x-axis is the center (midpoint) of the interval and the
y-axis is the range (spread).

```{r crplot, fig.width = 7, fig.height = 4}
ggInterval_CRplot(facedata, aes(AD))
```

```{r crplot-all, fig.width = 7, fig.height = 5}
ggInterval_CRplot(facedata, plotAll = TRUE)
```

# Bivariate Plots

## Scatter plot

`ggInterval_scatterplot()` visualizes two interval-valued variables
simultaneously. Each observation is drawn as a rectangle whose width and
height represent the intervals on the x- and y-axes, respectively.

```{r scatterplot, fig.width = 7, fig.height = 5}
ggInterval_scatterplot(facedata, aes(x = AD, y = BC))
```

## 2D histogram

`ggInterval_2Dhist()` partitions the bivariate domain into a grid and counts
how many interval observations overlap each cell. The `xBins` and `yBins`
parameters control the grid resolution.

```{r hist2d, fig.width = 7, fig.height = 5}
ggInterval_2Dhist(facedata, aes(x = AD, y = BC), xBins = 10, yBins = 10)
```

Here is the same plot for the `oils` dataset:

```{r hist2d-oils, fig.width = 7, fig.height = 5}
data(oils)
ggInterval_2Dhist(oils, aes(x = GRA, y = FRE), xBins = 5, yBins = 5)
```

# Multivariate Plots

## Scatter plot matrix

`ggInterval_scatterMatrix()` produces a pairwise scatter plot matrix for all
continuous interval variables in the dataset. Note that this function returns
a `marrangeGrob` object (from **gridExtra**), not a `ggplot2` object.

```{r scattermatrix, fig.width = 8, fig.height = 8}
ggInterval_scatterMatrix(facedata[, 1:3])
```

## 2D histogram matrix

`ggInterval_2DhistMatrix()` is the matrix analogue of `ggInterval_2Dhist()`,
showing 2D histograms for all variable pairs.

```{r histmatrix, fig.width = 8, fig.height = 8}
ggInterval_2DhistMatrix(oils, xBins = 5, yBins = 5)
```

## Index image heatmap

When `plotAll = TRUE`, `ggInterval_indexImage()` produces a heatmap-style
visualization across all variables, providing an overview of the entire
dataset.

```{r indeximage-heatmap, fig.width = 8, fig.height = 5}
ggInterval_indexImage(facedata, plotAll = TRUE)
```

## Radar plot

`ggInterval_radarplot()` displays multiple interval-valued variables on
radial axes. Each observation is represented by a polygon (or rectangle)
whose extent along each axis shows the interval range. The `plotPartial`
argument selects which observations to display.

```{r radar-polygon, fig.width = 7, fig.height = 6}
data(Environment)
ggInterval_radarplot(Environment[, 5:17],
                     plotPartial = 2,
                     showLegend = FALSE,
                     base_circle = TRUE,
                     base_lty = 2,
                     addText = FALSE) +
  labs(title = "Environment: radar plot (default)")
```

The `type = "rect"` variant draws rectangles instead of polygons:

```{r radar-rect, fig.width = 7, fig.height = 6}
ggInterval_radarplot(Environment[, 5:17],
                     plotPartial = 2,
                     type = "rect",
                     showLegend = FALSE,
                     base_circle = TRUE,
                     addText = FALSE) +
  labs(title = "Environment: radar plot (rect)")
```

## 3D scatter plot

`ggInterval_3Dscatterplot()` visualizes three interval-valued variables,
rendering each observation as a cube-like shape projected into two
dimensions.

```{r scatter3d, fig.width = 7, fig.height = 6}
ggInterval_3Dscatterplot(facedata[1:5, ], aes(x = BC, y = EH, z = GH))
```

# Principal Component Analysis

`ggInterval_PCA()` performs vertices-based PCA on interval-valued data. Each
interval observation is expanded to its vertices (all $2^p$ corner
combinations), PCA is applied, and the results are projected back to
interval form.

```{r pca, fig.width = 7, fig.height = 5}
pca_result <- ggInterval_PCA(facedata, plot = FALSE)
pca_result$ggplotPCA
```

Setting `poly = TRUE` adds a convex-hull polygon connecting the projected
vertices for each observation:

```{r pca-poly, fig.width = 7, fig.height = 5}
pca_poly <- ggInterval_PCA(facedata, poly = TRUE, plot = FALSE)
pca_poly$ggplotPCA
```

PCA also works with classical data via automatic conversion:

```{r pca-iris, fig.width = 7, fig.height = 5}
myIris <- classic2sym(iris, groupby = "Species")
pca_iris <- ggInterval_PCA(myIris, plot = FALSE)
pca_iris$ggplotPCA
```

# Working with ggplot2

Because most ggInterval functions return standard `ggplot2` objects, you can
customize plots with the full range of ggplot2 features.

**Themes and labels:**

```{r ggplot2-theme, fig.width = 7, fig.height = 4}
ggInterval_indexplot(facedata, aes(x = AD)) +
  theme_minimal() +
  labs(title = "Index plot of AD", x = "Observation", y = "AD")
```

**Custom color scales:**

```{r ggplot2-scale, fig.width = 7, fig.height = 4}
p <- ggInterval_hist(facedata, aes(x = AD), bins = 10,
                     method = "equal-bin")$plot
p + scale_fill_manual(values = rainbow(10))
```

**Adding reference lines:**

```{r ggplot2-ref, fig.width = 7, fig.height = 4}
ggInterval_CRplot(facedata, aes(AD)) +
  geom_hline(yintercept = 5, linetype = "dashed", color = "red")
```

Note that `ggInterval_scatterMatrix()` returns a `marrangeGrob` object, so
ggplot2 `+` operators cannot be applied to it directly.

# References

- Billard, L. and Diday, E. (2006). *Symbolic Data Analysis: Conceptual
  Statistics and Data Mining*. Wiley, Chichester.

- Jiang, B.S. and Wu, H.M. (2025). ggInterval: an R package for visualizing
  interval-valued data using ggplot2. *R package version 0.2.3*,
  <https://CRAN.R-project.org/package=ggInterval>.