Most of the forest plotting functions in R are built to be
general(ish), but they do not anticipate wanting to plot two different
confidence intervals for each estimate. Some of them, like
forestploter::forest and
forestplot::forestplot can be made to do this, but it takes
a bit of hacking to get there. In an effort to make it easier to plot
forests with our two sets of confidence intervals, we wrote a function
called gg_forest() which constucts a table panel and a plot
panel and uses patchwork to put them together in a coherent
fashion. Many features of the underlying machinery are exposed in
gg_forest, but the function itself relies on some
geom_ and scale_ functions that are part of
VizTest. This vignette will show some basic use cases of
these tools.
The input data for gg_forest (and the underlying
geom_ functions) should be organized in a data frame. It
should have columns for estimates, standard lower and upper confidence
intervals (e.g., 95 percent) and optionally inferential confidence
intervals. While the inferential confidence intervals are optional, we
built these functions to work with those intervals alongside the
standard intervals. If you want a standard forest plot with only one set
of intervals, it is likely that other packages, like those mentioned
above, may provide more customization options. In addition to the
estimates and intervals, if a summary row (or rows) exists, then there
should be a logical variable indicating which rows are the summary -
generally you’ll want to print those at the bottom. The data frame
should also have a variable that indicates the character expansion for
the point sizes. We will show a couple of examples with the built-in
esoph data from the datasets package. We will
use the emmeans package to make predictions from the model
to plot.
First, we can load the data, we will change age into an
unordered factor and then run two different models one with age as a
covariate and a null model to generate the overall summary.
library(emmeans)
library(VizTest)
library(dplyr)
data(esoph)
esoph$age <- factor(as.character(esoph$age),
levels=c("25-34", "35-44", "45-54", "55-64", "65-74", "75+"))
model1 <- glm(cbind(ncases, ncontrols) ~ age,
data = esoph, family = binomial())
model_sum <- glm(cbind(ncases, ncontrols) ~ 1,
data = esoph, family = binomial())We can use emmeans to generate predictions and
confidence intervals for each age group and save them in the data frame
called fit.
em <- emmeans(model1, "age")
fit <- print(em)
#> age emmean SE df asymp.LCL asymp.UCL
#> 25-34 -4.745 1.000 Inf -6.713 -2.776
#> 35-44 -3.050 0.341 Inf -3.718 -2.381
#> 45-54 -1.289 0.167 Inf -1.616 -0.963
#> 55-64 -0.781 0.138 Inf -1.053 -0.510
#> 65-74 -0.656 0.166 Inf -0.982 -0.330
#> 75+ -0.869 0.330 Inf -1.517 -0.221
#>
#> Results are given on the logit (not the response) scale.
#> Confidence level used: 0.95To get the inferential confidence intervals, we can execute
viztest() on the emmeans object. We can use
get_viztest_level() to pull the confidence level associated
with the easiest tests to decipher.
vt <- viztest(em, test_level = .05)
vt
#>
#> Correspondents of PW Tests with CI Tests
#> level psame pdiff easy method
#> 1 0.61 1 0.8095238 -0.2390388 Lowest
#> 2 0.75 1 0.8095238 -0.0058850 Middle
#> 3 0.90 1 0.8095238 -0.3696089 Highest
#> 4 0.75 1 0.8095238 -0.0058850 Easiest
#>
#> All 21 tests properly represented for by CI overlaps.
ic_level <- get_viztest_levels(vt, "easiest")Next, we can use confint() to get the standard and
inferential confidence intervals and transform the predictions into the
odds ratio scale.
fit$or <- exp(fit$emmean)
fit$lower <- exp(fit$asymp.LCL)
fit$upper <- exp(fit$asymp.UCL)
ag_data <- aggregate(esoph[,c("ncases", "ncontrols")], list(age = esoph$age), \(x)as.integer(sum(x)))
fit$n_cases <- ag_data$ncases
fit$n_controls <- ag_data$ncontrols
fit_ci <- confint(em, level = ic_level)
fit$lower_ici <- exp(fit_ci$asymp.LCL)
fit$upper_ici <- exp(fit_ci$asymp.UCL)Now, we can do the same for the summary data. Here we simply generate
a prediction for the model with predict() and confidence
intervals using confint(). We do not need to make visual
testing intervals for the summary value.
pred_sum <- predict(model_sum, type="link", se.fit=TRUE)
ci_sum <- confint(model_sum)
fit_sum <- data.frame(
age = "Summary",
emmean = pred_sum$fit[1],
SE = pred_sum$se.fit[1],
asymp.LCL = ci_sum[1],
asymp.UCL = ci_sum[2],
or = exp(pred_sum$fit[1]),
lower = exp(ci_sum[1]),
upper = exp(ci_sum[2]),
n_cases =sum(ag_data$ncases),
n_controls = sum(ag_data$ncontrols),
lower_ici = NA,
upper_ici = NA
)We can then put the data together in the same place.
fit <- rbind(fit[,-4], #remove df column
fit_sum)
# add row_number indicator for later.
fit$row_number <- 1:nrow(fit)Now, there are some other variables we can add since the data are all
in the same place. We need to add a logical flag for the summary row(s),
a point_size scaling variable, which we will set to the inverse of the
sampling variance estimate and a label variable that puts the stimuli in
the right order. Sometimes, we may want to reorder the rows based on the
value of the odds ratio. In those cases, you could use
reorder_forest() for this. It works like
reorder(), but it forces the summary row to be in the right
place. In this case, since the age levels are already ordered in a
useful way, we just want to reverse the values so 25 has the highest
value (it will plot at the top) and Summary has the smallest value (it
will plot at the bottom).
fit$is_summary <- fit$age == "Summary"
fit$point_size <- 1/fit$SE^2
fit$age_label <- factor(fit$age, levels=rev(fit$age))The fit object now has all the relevant data to make a
forest plot.
gg_forest().We first show how to do this with the gg_forest()
wrapper. We will show the function first and then explain its
arguments.
out <- gg_forest(fit,
y = "age_label",
x = "or",
xmin_std = "lower",
xmax_std = "upper",
xmin_ici = "lower_ici",
xmax_ici = "upper_ici",
vline = 1,
size_prop = "point_size",
is_summary = "is_summary",
use_log_scale = TRUE,
data_cols = c("Age" = "age_label", "Controls" = "n_controls", "Cases"="n_cases", "OR" = "or"),
max_size=5,
table_header_size = 16,
table_text_size = 5,
col_nudge=c(-.15, 0,0,0),
diamond_aspect=15,
diamond_row_frac = 2)The first argument is the data. The names of the y-variable (the
labelled indicator for the rows) and the other variables are in quotes.
The _std variables are the standard confidence intervals
and the _ici variables are the inferential confidence
intervals. The size_prop variable is the variable that will
be used to scale the point sizes. The is_summary variable
is a logical variable indicating which rows are summary rows. The
use_log_scale argument indicates whether to use a log scale
for the x-axis, which is common for odds ratios. The
data_cols argument takes a named vector of column names to
be printed in the table panel and the corresponding variables in the
data frame. The max_size argument sets the maximum size for
the points. The table_header_size and
table_text_size arguments set the font size for the table
header and text, respectively. The col_nudge argument takes
a vector of horizontal nudges for each column in the table to adjust
their position. We use this for the first column which is left-aligned.
In this case, the cell text aligns left with the anchor as the tick mark
location. The header label is set centered over the tick mark regardless
of the justification of the text. The col_nudge parameter
allows you to move the cell text to be roughly in line with the header
label. It may take a bit of trial and error to get it just right.
Finally, the diamond_aspect and
diamond_row_frac arguments control the aspect ratio and row
fraction for summary rows, which are plotted as diamonds instead of
points. If the summary diamond is too flat, increase
diamond_aspect. The diamond_row_frac argument
is intended to make is so that the diamond never spills over into
another row.
The out object now contains two elements - a table and a
plot. We can use the plot method to put them together, which is really
just a call to patchwork functions.
Forest Plot with Standard (95%) and Inferential (75%) Confidence Intervals. The standard confidence intervals are the thin lines and the inferential confidence intervals are the thick lines. See Armstrong and Poirier (2025) for more details on inferential confidence intervals.
geom_ Functions.The gg_forest() function is a wrapper that makes it easy
to put together a forest plot with two sets of confidence intervals, but
the underlying machinery is built on ggplot2 geoms and
scales that are part of the VizTest package. If you want
more control over the aesthetics, you can use those functions
directly.
Let’s make the figure first. We can make confidence bounds with
ggplot2’s geom_linerange() using different
line widths. We made a utility - geom_foreststripe() which
will make a striped figure that could match striping in the table if you
use it. The striping should be added before you add other elements. The
points are added with geom_forestpoint() the main utility
of this function is that it plots the summary and non-summary dots as
different shapes and the size argument works differently. The summary
diamonds are as wide as the confidence bounds and the non-summary points
are scaled proportional to the size_prop variable. We use
other ggplot2 functions to add vertical lines, re-scale the
x-axis and adjust the maximum size of the points.
library(ggplot2)
g <- ggplot(fit, aes(y=age_label, x=or)) +
geom_foreststripe(n_cols=NULL, start = 3) +
geom_linerange(aes(xmin = lower, xmax=upper, color="Standard CI"), linewidth=1) +
geom_linerange(aes(xmin=lower_ici, xmax=upper_ici, color="Inferential CI"), linewidth=5) +
geom_forestpoint(aes(is_summary = is_summary, size=point_size, xmin = lower, xmax=upper), fill="black",
diamond_aspect=10, diamond_row_frac = .9) +
theme_classic() +
scale_color_manual(values=c("gray60", "black")) +
scale_x_log10() +
geom_vline(xintercept = 1, linetype="dashed") +
labs(x = "Odds Ratio", y="", colour="") +
scale_size_area(max_size=5) +
guides(size = "none")
gForest Plot with Standard (95%) and Inferential (75%) Confidence Intervals. The standard confidence intervals are the thin lines and the inferential confidence intervals are the thick lines. See Armstrong and Poirier (2025) for more details on inferential confidence intervals.
To make this compatible with the table, we will also need to remove the y-axis ticks and labels.
g <- g +
theme(axis.ticks.y = element_blank(),
axis.text.y = element_blank(),
axis.line.y=element_blank())Next, we can make the table. For this, we can use
geom_foresttable() a function from this package. Again, we
will make the table and then discuss the arguments.
data_cols <- c("Age" = "age_label", "Controls" = "n_controls", "Cases"="n_cases", "OR" = "or")
tab <- ggplot(fit, aes(y=age_label)) +
geom_foreststripe(n_cols=length(data_cols), start=3) +
geom_foresttable(
cols = data_cols,
col_nudge = c(-.15, 0, 0, 0),
fmt = list(
age_label = as.character,
n_controls = function(x)sprintf("%d",x),
n_cases = function(x)sprintf("%d",x),
or = function(x)sprintf("%.2f",x)
),
col_align = c("left", "center", "center", "center"),
size = 4
) +
scale_x_foresttable(data_cols, names(data_cols),
col_gap = 1) + # returns list(scale, theme) — ggplot2 handles lists fine
theme_void() +
theme(
axis.text.x.top = element_text(face = "bold", size = 16, margin = margin(t = 8)),
axis.line.x.top = element_blank(),
axis.ticks.x.top = element_blank(),
axis.ticks.length = unit(0, "pt")
)
tabData Table for Forest Plot.
In the code above, you we start with a ggplot with just
the y aesthetic. This should be the same y
variable as in the figure. We add striping in the same way. We can
specify n_cols to let the function know that there are four
columns of data to stripe. The data_cols is a named vector
of values. The values are variable names and the names of the vector are
what will be printed in the table header row. The col_nudge
argument takes a vector of horizontal nudges for each column in the
table to adjust their position. We use this for the first column which
is left-aligned as described above. The fmt argument takes
a list of functions that format each variable for printing. The names of
the list should be the same as the variable names in
data_cols. I’ve used sprintf() here, but
format() would work as well. In fact, any function that
could be called on the values to render them to print would work. The
col_align argument takes a vector of justifications for
each column. Finally, we use scale_x_foresttable() to set
the header labels and adjust the gap between columns with
col_gap. We end with some theming that turns off elements
that we do not need.
We could put these two elements together with
patchwork:
library(patchwork)
(tab + g + plot_layout(widths = c(1.25, 1), guides = "collect")) & theme(legend.position="bottom")Forest Plot with Data Table. The left panel is the data table and the right panel is the forest plot. The striping matches across panels.
Since these are just made with ggplot, in principle,
there are a couple of ways that subgroup analysis could be done. First,
it would be trivial to make these two different plots for models built
on subsets of the data. Alternatively, you could build a single model
with interactions and then subset the data once it is made and make two
different plots. That is likelyt he easiest way and we will not pursue
that more here.
Another option, would be to use the custom syntax we propose above
and add grouping variables. For example, using the esoph
data imagine we wanted to look at instances by age-group for high and
low tobacco usage groups. We will pull the data together much in the
same way that we did above.
data(esoph)
esoph <- subset(esoph, agegp != "25-34")
esoph$agegp <- factor(as.character(esoph$agegp),
levels=c("25-34", "35-44", "45-54", "55-64", "65-74", "75+"))
esoph$high_tob <- ifelse(esoph$tobgp %in% c("20-29", "30+"), "High Tobacco", "Low Tobacco")
esoph$high_tob <- factor(esoph$high_tob, levels=c("Low Tobacco", "High Tobacco"))
model1 <- glm(cbind(ncases, ncontrols) ~ agegp*high_tob,
data = esoph, family = binomial())
model_sum <- glm(cbind(ncases, ncontrols) ~ high_tob,
data = esoph, family = binomial())
em <- emmeans(model1, c("agegp", "high_tob"))
em_sum <- emmeans(model_sum, "high_tob")
fit <- print(em)
#> agegp high_tob emmean SE df asymp.LCL asymp.UCL
#> 35-44 Low Tobacco -3.212 0.416 Inf -4.028 -2.3961
#> 45-54 Low Tobacco -1.602 0.211 Inf -2.015 -1.1885
#> 55-64 Low Tobacco -1.027 0.168 Inf -1.356 -0.6969
#> 65-74 Low Tobacco -0.782 0.184 Inf -1.143 -0.4213
#> 75+ Low Tobacco -0.860 0.360 Inf -1.565 -0.1552
#> 35-44 High Tobacco -2.615 0.598 Inf -3.787 -1.4427
#> 45-54 High Tobacco -0.552 0.288 Inf -1.117 0.0124
#> 55-64 High Tobacco -0.134 0.259 Inf -0.641 0.3737
#> 65-74 High Tobacco 0.000 0.408 Inf -0.800 0.8002
#> 75+ High Tobacco -0.916 0.837 Inf -2.556 0.7235
#>
#> Results are given on the logit (not the response) scale.
#> Confidence level used: 0.95
fit_sum <- print(em_sum)
#> high_tob emmean SE df asymp.LCL asymp.UCL
#> Low Tobacco -1.381 0.0963 Inf -1.569 -1.192
#> High Tobacco -0.653 0.1540 Inf -0.955 -0.351
#>
#> Results are given on the logit (not the response) scale.
#> Confidence level used: 0.95
fit_sum$agegp <- "Summary"
fit_sum$lower_ici <- fit_sum$upper_ici <- NA
ag_data <- aggregate(esoph[,c("ncases", "ncontrols")], list(high_tob = esoph$high_tob), \(x)as.integer(sum(x)))
fit_sum$n_cases <- ag_data$ncases
fit_sum$n_controls <- ag_data$ncontrols
vt <- viztest(em, include_zero = FALSE, test_level = .05)
vt
#>
#> Correspondents of PW Tests with CI Tests
#> level psame pdiff easy method
#> 1 0.81 1 0.6666667 -0.07238479 Lowest
#> 2 0.84 1 0.6666667 -0.01736363 Middle
#> 3 0.87 1 0.6666667 -0.12092006 Highest
#> 4 0.83 1 0.6666667 -0.01385398 Easiest
#>
#> All 45 tests properly represented for by CI overlaps.
ic_level <- get_viztest_levels(vt, "easiest")
fit_ci <- confint(em, level = ic_level)
fit$lower_ici <- exp(fit_ci$asymp.LCL)
fit$upper_ici <- exp(fit_ci$asymp.UCL)
ag_data <- aggregate(esoph[,c("ncases", "ncontrols")], list(age = esoph$age, high_tob = esoph$high_tob), \(x)as.integer(sum(x)))
fit$n_cases <- ag_data$ncases
fit$n_controls <- ag_data$ncontrols
fit <- rbind(fit, fit_sum)
fit$or <- exp(fit$emmean)
fit$lower <- exp(fit$asymp.LCL)
fit$upper <- exp(fit$asymp.UCL)
fit$is_summary <- fit$agegp == "Summary"
fit$point_size <- 1/fit$SE^2
fit <- fit[order(fit$agegp, fit$high_tob), ]
fit$row <- 1:nrow(fit)
fit$row_fac <- as.factor(1:nrow(fit))
fit$row_fac <- reorder_forest(fit$row_fac, fit$row)g <- ggplot(fit, aes(y=row_fac, x=or)) +
geom_foreststripe(n_cols=NULL, start = 3) +
geom_linerange(aes(xmin = lower, xmax=upper, color = high_tob, alpha="Standard CI"), linewidth=1) +
geom_linerange(aes(xmin=lower_ici, xmax=upper_ici, color=high_tob, alpha= "Inferential CI"), linewidth=5) +
geom_forestpoint(aes(is_summary = is_summary, size=point_size, xmin = lower, xmax=upper, color=high_tob, fill=high_tob),
diamond_aspect=10, diamond_row_frac = .9) +
theme_classic() +
scale_alpha_manual(values=c(.25, .75)) +
scale_x_log10() +
geom_vline(xintercept = 1, linetype="dashed") +
labs(x = "Odds Ratio", y="", alpha="") +
scale_size_area(max_size=5) +
guides(size = "none", color="none", fill="none") +
theme(axis.ticks.y = element_blank(),
axis.text.y = element_blank(),
axis.line.y=element_blank())
gForest Plot with Standard (95%) and Inferential (83%) Confidence Intervals. The standard confidence intervals are the thin lines and the inferential confidence intervals are the thick lines. See Armstrong and Poirier (2025) for more details on inferential confidence intervals.
Next, we can make the table. Because these work in th ggplot
ecosystem, you can color the text by a variable if you like. Here, we
color the text by the tobacco usage group. We can do this simply by
adding a color aesthetic to geom_foresttable().
data_cols = c("Age" = "agegp", "Tobacco\nUsage" = "high_tob", "Controls" = "n_controls", "Cases"="n_cases", "OR" = "or")
tab <- ggplot(fit, aes(y=row_fac, x=or)) +
geom_foreststripe(n_cols=length(data_cols), start=3) +
geom_foresttable(aes(color=high_tob),
cols = data_cols,
col_nudge = c(-.15, -.35, 0, 0, 0),
fmt = list(
agegp = as.character,
high_tob = as.character,
n_controls = function(x)sprintf("%d",x),
n_cases = function(x)sprintf("%d",x),
or = function(x)sprintf("%.2f",x)
),
col_align = c("left", "left", "center", "center", "center"),
size = 4
) +
scale_x_foresttable(data_cols, names(data_cols),
col_gap = 1) + # returns list(scale, theme) — ggplot2 handles lists fine
theme_void() +
theme(
axis.text.x.top = element_text(face = "bold", size = 12, margin = margin(t = 8)),
axis.line.x.top = element_blank(),
axis.ticks.x.top = element_blank(),
axis.ticks.length = unit(0, "pt")
)
tabData Table for Forest Plot with Subgroups.
And now putting them together.
library(patchwork)
(tab + g + plot_layout(widths = c(1.25, 1), guides = "collect")) & theme(legend.position="bottom")Forest Plot with Data Table with Subgroups. The left panel is the data table and the right panel is the forest plot. The striping matches across panels.