--- title: "TrueType-Aware Automatic Column Widths" date: "2025-06-23" author: "Gabriel Becker" output: rmarkdown::html_document: theme: "spacelab" highlight: "kate" toc: true toc_float: true vignette: > %\VignetteIndexEntry{Automatic Column Widths} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} editor_options: markdown: wrap: 72 --- ```{r setup, include = FALSE} knitr::opts_chunk$set( echo = TRUE, collapse = TRUE, comment = "#>" ) # TODO change to pharmaverse data to parmaversejnj versions ``` ## Introduction TrueType fonts (i.e., those where different characters have different printed widths) complicate the calculation of column widths based on the contents of a table or listing, particularly when combined with verbose human readable column and-or row labels. `junco` provides default algorithms for calculating appropriate column widths for both tables and listings when exporting to RTF via `tt_to_tlgrtf`. These can be invoked explicitly by calling the `def_colwidths` function on a `TableTree` or `listng_df` object, along with a font specification. ## Tables Many tables have column labels many times longer than the data in that column's cells; the width of cell data tends to be bounded by the fact it is a set of one to three numbers interspersed with punctuation, rather than words as is the case for labels. ### Pagination Assumptions `tt_to_tlgrtf` allows for horizontal `rtables`-style pagination, but does not perform vertical pagination; each vertical strip of the table (which, mind, comes from *horizontal* pagination) is written to a separate file. The `combined_rtf` argument indicates whether a single combined rtf should *also* be generated by stacking those separate sections of the table into a single RTF (as different table objects). ### Algorithm And Optimality Criterion The column-width algorithm for tables is relatively simple. For table columns, it calculates the widths required so that no *cell values* will be word-wrapped. This is essentially what `rtables:::propose_column_widths` does, with the exception that it does so including the column labels, which we have found in practice to be much wider than the cells. `def_colwidths` also constrains the maximum width of the row labels to the width (in inches) specified via `label_width_ins`, with a default of two inches. ### Examples We can see this by tables with the same structure and value contents but varying verbosity with column and row labels. ```{r} library(junco) adsl2 <- ex_adsl adsl2$ARM2 <- adsl2$ARM levels(adsl2$ARM2) <- c("A", "B", "C") adsl2$ARM3 <- adsl2$ARM levels(adsl2$ARM3) <- c("Full Drug Name Of Drug X", "Current Best-Practice Standard Of Care", "The Weird Other Arm") ## col-labels unmodified (middling width) lyt1 <- basic_table() |> split_cols_by("ARM") |> split_rows_by("RACE") |> summarize_row_groups(format = "xx (xx.xx%)") |> analyze("DCSREAS") tbl1 <- build_table(lyt1, adsl2) head(tbl1) ## super narrow column labels lyt2 <- basic_table() |> split_cols_by("ARM", labels_var = "ARM2") |> split_rows_by("RACE") |> summarize_row_groups(format = "xx (xx.xx%)") |> analyze("DCSREAS") tbl2 <- build_table(lyt2, adsl2) head(tbl2) ## super wide column labels lyt3 <- basic_table() |> split_cols_by("ARM", labels_var = "ARM3") |> split_rows_by("RACE") |> summarize_row_groups(format = "xx (xx.xx%)") |> analyze("DCSREAS") tbl3 <- build_table(lyt3, adsl2) head(tbl3) ``` `rtables`' default column widths (implemented via `formatters::propose_column_widths`) takes the maximum width required for a *label or value* for each column (and the row-label pseudo column): ```{r} propose_column_widths(tbl1) ``` Which means that width of the third column will be slightly smaller for `tbl2` as the column label is no longer wider than the group summary cell values. The first and second columns remain the same as the cell value widths were already slightly larger than the labels in `tbl1`. ```{r} propose_column_widths(tbl2) ``` Meanwhile, the verbose column labels in `tbl3` result in dramatically wider column widths, as `propose_column_widths` enforces no wrapping **even within column labels**: ```{r} propose_column_widths(tbl3) ``` Meanwhile, `def_colwidths` gives the same widths for the 3 columns as with `tbl2` for all 3 tables: ```{r} def_colwidths(tbl1, fontspec = font_spec(), label_width_ins = 2, col_gap = 0) def_colwidths(tbl2, fontspec = font_spec(), label_width_ins = 2, col_gap = 0) def_colwidths(tbl3, fontspec = font_spec(), label_width_ins = 2, col_gap = 0) ``` We see, however, that the label-row width has been reduced due to the `label_width_ins` constraint, which we can vary up to the maximum width the row labels need with no wrapping: ```{r} ## bigger than 2, but not what we got from propose_column_labels def_colwidths(tbl1, fontspec = font_spec(), label_width_ins = 2.2, col_gap = 0) ## bigger than required so we get same row label width as propose_column_widths def_colwidths(tbl1, fontspec = font_spec(), label_width_ins = 6, col_gap = 0) ``` While we have done these examples with the default monospace font used by `rtables` and `formatters`, the difference is often particularly large when using a TrueType font with verbose labels, as many letters have larger print widths than punctuation and numeric digit characters: ```{r} fspec_times <- font_spec("Times", 9) propose_column_widths(tbl3, fontspec = fspec_times) ``` ```{r} def_colwidths(tbl3, fontspec = fspec_times, label_width_ins = 2, col_gap = 0) ``` We note here that for our (fictional but realistically verbose) column labels in `tbl3`, the default behavior from formatters will not fit on a single page as even without padding between the columns, those widths take up ```{r} sum(propose_column_widths(tbl3, fontspec = fspec_times)) ``` space-character widths (which is the unit `formatters` calculates widths in) while a standard page only has ```{r} formatters::page_lcpp(fontspec = fspec_times)$cpp ``` spaces of width available. The column widths calculated by `def_colwidths`, however, easily fit on a single page. ## Listings Listings, unlike tables, often have text in their cell values, sometimes even concatenations of multiple demographic variables into a single column. They also do not have the row-labels pseudo-column present in tables. As such, we need a different, and much more complicated, algorithm to calculate good column widths. ### Pagination Assumptions `def_colwidths` assumes that listings should **not** be horizontally paginated, so all columns, and any gaps between them, must fit within the width of a single page. ### Optimality Criterion For listings, we optimize *the number of total lines a listing will require to print, including repetition of the table header*. This helps control the total size of the resulting RTF file, as well as generally providing a better reading experience for the listing. We further constrain our column widths such that no words *within cell values* will need to be broken up by word wrapping, *if possible*. We define "words" for this purpose as a string of characters separated by space(s) or "-". For this reason, we recommend concatenation of values into listing column values to be split by e.g., `" / "` rather than `"/"`, as even though that makes the value slightly longer it gives the algorithm much more flexibility to find column widths that don't break up individual "words". This translates, generally to finding widths where after wrapping, a single column isn't wrapped many more times than others within the majority of rows. In practice, we have found that this results in listings that are both legible and aesthetically reasonable. ### Algorithm The algorithm for selecting column widths has two parts. First, for each column individually, all widths that would result in different numbers of total lines for the cells in the columns are determined; the constraint that words within cells not be broken up is key here, as it dramatically reduces the number of widths that actually result in different numbers of lines. The second step is to search the space of candidate column widths collectively for the optimal set, which combines to less than the total available space. We will use the following data to illustrate: ```{R} library(rlistings) adae <- pharmaverseadam::adae adae$AEOUT <- gsub("/", " / ", adae$AEOUT) adsl <- pharmaverseadam::adsl adsl <- adsl[, c("USUBJID", setdiff(names(adsl), names(adae)))] lstdat <- merge(adae, adsl, by = "USUBJID") var_labels(lstdat) <- c(var_labels(adae), var_labels(adsl)[-1]) lstdat$demog <- with_label(paste(lstdat$RACE, lstdat$SEX, lstdat$AGE, sep = " / "), "Demographic Information") lsting <- as_listing(lstdat, key_cols = c("USUBJID"), disp_cols = c("ACTARM", "COUNTRY", "demog", "AESEV", "AEBODSYS", "AEDECOD", "ASTDTM", "AENDTM", "AEOUT", "EOSSTT") ) ``` #### Candidate Column Widths For example, the last cell in the demographics column contains the value ```{r} demcell <- lstdat$demog[nrow(lstdat)] demcell ``` Broken up according to our definition, it contains the following "words" which must remain whole during column width selection. ```{r} wrds <- strsplit(demcell, "[ -]")[[1]] wrds ``` Assuming a monospace font for simplicity, then, the smallest possible width of the column is ```{r} max(nchar(wrds)) ``` And using that width, the first two words fit into a line, the third into another, the fourth in its own, and "words" five through 8 all fit into a final line, for a total of four lines. We call this *packing lines* ```{r} packed_widths <- function(...) { lst <- list(...) nchar(vapply(lst, paste, collapse = " ", "")) } packed_widths( wrds[1:2], wrds[3], wrds[4], wrds[5:8] ) ``` Recall that we do not care which words are allocated where, only the total number of lines required, so a colwidth of 10, which would allow the fifth word (`r wrds[5]`) to be packed into the same line as the fourth, resulting in `r paste(wrds[4:5], collapse = " ")`, results in the same number of total lines, so will not be considered a distinct possible column width with respect to that cell. The next column width that results in fewer lines for that cell is one where words one through three are all able to be packed into a single line, with spaces between them, `r nchar(paste(wrds[1:3], collapse = " "))` in this case. With that column width, we get three lines as we do not have enough room for the space required to consolidate the final two lines into one. ```{r} packed_widths(wrds[1:3], wrds[4], wrds[5:8]) ``` Increasing the column width to 17, however, allows us to get down to two lines: ```{R} packed_widths( wrds[1:3], wrds[4:8] ) ``` Finally, the last possible width with a different line total is the smallest width that will fit the entire value, i.e., `r nchar(demcell)`. So for this cell, there are four, and only four, candidate column widths. #### Selecting The Optimal Set Of Widths Once we have the full set of candidate widths for each column individually, the algorithm for selecting the optimal collective set is as follows: 0. Initialize a. Remove candidate widths which result in column labels requiring more than allowable lines (default 3) b. Initialize with smallest candidate width for each column 1. Determine column which requires the largest total lines 2. Check if total space allows for changing to next candidate width for that column a. If it does, select that column width and goto step (1) b. otherwise, end search and spread any remaining available space equally among columns We are able to end the search at step (2b) because even if another column has a candidate width available that would require less lines, the total lines for the document are determined solely by the column which requires the most lines, so changing it as such won't affect the outcome. ### Example `def_colwidths` calls down to `listing_column_widths` with default values when passed a `listing_df` object. We will call the latter directly here for explicitness, and to make the column widths more directly comparable via `export_as_txt` output. ```{r} fspec_times8 <- font_spec("Times", 8, 1) cw <- listing_column_widths(lsting, col_gap = 0, fontspec = fspec_times8, verbose = TRUE) txt <- export_as_txt(lsting, pg_width = inches_to_spaces(8.88, fontspec = fspec_times8), lpp = NULL, colwidths = cw, fontspec = fspec_times8, col_gap = 0 ) txt2 <- strsplit(txt, "\n", fixed = FALSE)[[1]] head(txt2) length(txt2) ``` Versus giving each column an equal portion of the width (admittedly an ill-conceived strategy) ```{r} txtbad <- export_as_txt(lsting, pg_width = inches_to_spaces(8.88, fontspec = fspec_times8), lpp = NULL, colwidths = rep(floor(320 / 11), 11), fontspec = fspec_times8, col_gap = 0 ) txt2bad <- strsplit(txtbad, "\n", fixed = TRUE)[[1]] head(txt2bad) length(txt2bad) ``` ```{r calc, include = FALSE} saved <- (length(txt2bad) - length(txt2)) / length(txt2bad) * 100 dec_saved <- round(saved, 2) ``` So we see that our algorithm saved `r dec_saved` percent of the total lines required by (a set of) naive column widths in this instance.