--- title: "Guide to fjoin" format: html vignette: > %\VignetteEngine{quarto::html} %\VignetteIndexEntry{Guide to fjoin} %\VignetteEncoding{UTF-8} # knitr chunk options: knitr: opts_chunk: collapse: false # don't collapse code and output into one block out.width: "100%" # figs 100% width # execution settings for Quarto: execute: echo: true warning: false message: false cache: false # always render from scratch --- ```{r setup, include=FALSE} pkg <- function(x) { # style "pkgname" is defined in _pkgdown.yml sprintf('%s', x) } ``` `r pkg("fjoin")` builds on `r pkg("data.table")` to provide fast, flexible joins on any data frames. It slots into tidyverse pipelines and general workflows in a single line, and provides NA-safe matching by default, on-the-fly column selection, flexible row-order preservation, multiple-match handling on both sides, and an indicator column for row origin. # Installation Stable release (CRAN): ```r install.packages("fjoin") ``` Latest development version ([R-universe](https://trobx.r-universe.dev)): ```r install.packages("fjoin", repos = c("https://trobx.r-universe.dev", "https://cloud.r-project.org")) ``` # Features - **Scope:** Inner, left, right, full, semi-, and anti-joins with equality and inequality conditions, plus cross joins. - **Compatibility:** Accepts any mix of data frame-like objects or `list`s of same-length vectors, and returns a plain `data.frame`, (grouped) tibble, `data.table`, `sf`, or `sf`-tibble. Refreshes dynamic attributes like `groups`, keys, `agr`, and `bbox` in the output. - **Performance:** Uses case-specific `r pkg("data.table")` solutions specially developed for high performance, and avoids data-copying at any stage for memory-efficiency. - **Transparency:** "Just works" for general users, but you can also view the resulting `r pkg("data.table")` code instead of (or as well as) executing it, or run “mock” joins without data that output template code. - **Low dependency:** Depends only on `r pkg("data.table")`, a mature, dependency-free package emphasising long-term stability. `r pkg("fjoin")` provides several distinctive options and controls, including: - **NA-safe joins by default:** Missing values are treated sensibly (no matches on `NA` unless overridden with `match.na`). - **On-the-fly column selection:** Choose and order columns with `select` for efficient one-liners. - **Row origin tagging:** Add a Stata-style marker column showing each row’s source with `indicate`. - **Multiple-match handling on both sides:** Arguments `mult.x` and `mult.y` specify how to handle multiple matches on either or both sides of the join, including in semi- and anti-joins. - **Flexible order:** Control which data frame's row order is preserved in the result (without expensive sorting) using `order`. # API | `fjoin_*` functions
(`x`/`y` style) | `dtjoin_*` functions
(extended `DT[i]` style) | |:-------|:--| | `fjoin_inner()`, `fjoin_left()`, `fjoin_right()`,`fjoin_full()` | `dtjoin()` | | `fjoin_left_semi` (alias `fjoin_semi`), `fjoin_right_semi()` | `dtjoin_semi()` | | `fjoin_left_anti` (alias `fjoin_anti`), `fjoin_right_anti()` | `dtjoin_anti()` | | `fjoin_cross()` | `dtjoin_cross()` | The `fjoin_*` family consists of conventional left/right-style join functions. They are wrappers around the `dtjoin_*` functions (also exported), which use a generalisation of `r pkg("data.table")`'s `DT[i]` join syntax, and which write the join code and handle execution. # Examples ```{r} library(fjoin) read_df <- function(x) data.table::fread(x, quote = "'", data.table = FALSE) ``` ## Example 1: Basic use and options Plain data frames joined by simple equality, using `fjoin_full()` for illustration. ```{r} dfP <- read_df(" id item price other_cols NA apples 10 ... 3 bananas 20 ... 2 cherries 30 ... 1 dates 40 ... ") dfQ <- read_df(" id quantity notes other_cols 2 5 '' ... 1 6 '' ... 3 7 '' ... NA 8 'oranges' ... ") ``` ### (1) Basic syntax ```{r} fjoin_full(dfQ, dfP, on = "id") ``` The default `match.na = FALSE` prevents apples-and-oranges matches on missing values --- other frameworks would join the last two rows. ### (2) Efficient selective joins in one line Use `select` to restrict the join to particular columns and put them in desired order: ```{r} fjoin_full(dfQ, dfP, on = "id", select = c("item", "price", "quantity")) ``` This is much easier to write and read than the equivalent operation in `r pkg("dplyr")`, which requires two calls before the join to avoid inflating it with irrelevant columns, and a call after it to shuffle column order: ```{r eval = FALSE} x <- dfQ |> select(id, quantity) y <- dfP |> select(id, item, price) full_join(x, y, join_by(id), na.matches = "never") |> select(id, item, price, quantity) ``` ### (3) Indicator column for row origin `1L`: left only, `2L`: right only, `3L`: joined from both. In Stata since 1984!^[See p. 99 of the Stata 1 reference manual, which has been linked to on Statalist and can be found by searching online. Thanks to Nick Cox for this information.] ```{r} fjoin_full( dfQ, dfP, on = "id", select = c("item", "price", "quantity"), indicate = TRUE ) ``` ### (4) Switch row order from left-then-right to right-then-left ```{r} fjoin_full( dfQ, dfP, on = "id", select = c("item", "price", "quantity"), indicate = TRUE, order = "right" ) ``` ### (5) Display code instead of executing For `r pkg("data.table")` users. ```{r} fjoin_full( dfQ, dfP, on = "id", select = c("item", "price", "quantity"), indicate = TRUE, order = "right", do = FALSE ) ``` ## Example 2: Reducing an inequality join from M:M to 1:1 with `mult.x` and `mult.y` `r pkg("data.table")` (`mult`) and `r pkg("dplyr")` (`multiple`) can reduce the cardinality on one side of the join from many (`"all"`) to one (`"first"` or `"last"`). `r pkg("fjoin")` (`mult.x`, `mult.y`) will do this on either side of the join, or on both sides at the same time. This example (using `fjoin_left()`) shows an application to temporally ordered data frames of generic "events" and "reactions". ```{r} events <- read_df(" event_id event_ts 1 10 2 20 3 40 ") reactions <- read_df(" reaction_id reaction_ts 1 30 2 50 3 60 ") ``` ### (1) For each event, all subsequent reactions (M:M) ```{r} fjoin_left( events, reactions, on = c("event_ts < reaction_ts") ) ``` ### (2) For each event, the next reaction (1:M) Equivalent to a one-way forward rolling join. ```{r} fjoin_left( events, reactions, on = c("event_ts < reaction_ts"), mult.x = "first" ) ``` ### (3) For each event, the next reaction, provided there was no intervening event (1:1) Equivalent to a two-way rolling join (mutual forward/backward rolling matches). ```{r} fjoin_left( events, reactions, on = c("event_ts < reaction_ts"), mult.x = "first", mult.y = "last" ) ``` ## Example 3: Chain of calls from `fjoin_*` to `dtjoin_*` to `r pkg("data.table")` A technical illustration showing: - two calls to `fjoin_left()` (commented out), differing in the `order` argument - the resulting calls to `dtjoin()`, with the addition of `show = TRUE` - the generated `r pkg("data.table")` code and its output ```{r} df_x <- data.frame(id_x = 1:3, row_x = paste0("x", 1:3)) df_y <- data.frame(id_y = rep(4:2, each = 2L), row_y = paste0("y", 1:6)) ``` ```{r} # (1) fjoin_left(df_x, df_y, on = "id_x == id_y", mult.x = "first") dtjoin( df_y, df_x, on = "id_y == id_x", mult = "first", i.home = TRUE, prefix = "R.", show = TRUE ) # (2) fjoin_left(df_x, df_y, on = "id_x == id_y", mult.x = "first", order = "right") dtjoin( df_x, df_y, on = "id_x == id_y", mult.DT = "first", nomatch = NULL, nomatch.DT = NA, prefix = "R.", show = TRUE ) ``` The difference in the `order` argument passed to `fjoin_*()` is reflected at `dtjoin()` level in the identity of the tables passed to `.DT` and `.i`, the values of the extended arguments `nomatch.DT` and `mult.DT` (counterparts to the familiar `r pkg("data.table")` arguments `nomatch` and `mult` on the other side of the join), and a compensating argument `i.home` which toggles the "home" and "foreign" table for the purposes of column order and prefixing (as well as for `indicate` and output class). `dtjoin()` in turn translates these specifications into `r pkg("data.table")` code for execution. See the `dtjoin()` documentation for full details of this extended `DT[i]`-style syntax. # Performance ```{r echo=FALSE} pal <- c( "#B8860B", "grey60", "#66C2A5", "#1B9E77" ) factor_in_order <- function(x, ...) factor(x, levels = unique(x), ...) pl <- function(x) { library(ggplot2) ggplot(x, aes(x = soln, y = ifelse(is.na(median_secs), 0, median_secs), fill = soln)) + geom_bar(stat = "identity", position = "dodge") + geom_text(aes(label = sprintf("%.1f", median_secs)), vjust = -0.3, size = 3) + facet_grid(factor_in_order(style) ~ factor_in_order(xyargs), switch = "y") + scale_fill_manual(values = pal) + scale_y_continuous(limits = c(0, k)) + #theme_minimal() + theme( legend.position = "none", axis.title.y = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank(), #strip.placement.y = "left", strip.text.y.left = element_text(angle = 0), ) + labs( x = NULL, # title = x[, unique(description)], subtitle = sprintf("median time in secs, %s runs per join",x[, unique(na.omit(N))]) ) } dat <- structure(list(style = c("Inner join", "Inner join", "Inner join", "Inner join", "Inner join", "Inner join", "Inner join", "Inner join", "Left join", "Left join", "Left join", "Left join", "Left join", "Left join", "Left join", "Left join", "Right join (ordered by left)", "Right join (ordered by left)", "Right join (ordered by left)", "Right join (ordered by left)", "Right join (ordered by left)", "Right join (ordered by left)", "Right join (ordered by left)", "Right join (ordered by left)", "Right join (ordered by right)", "Right join (ordered by right)", "Right join (ordered by right)", "Right join (ordered by right)", "Right join (ordered by right)", "Right join (ordered by right)", "Right join (ordered by right)", "Right join (ordered by right)", "Full join", "Full join", "Full join", "Full join", "Full join", "Full join", "Full join", "Full join" ), xyargs = c("x small, y big", "x small, y big", "x small, y big", "x small, y big", "x big, y small", "x big, y small", "x big, y small", "x big, y small", "x small, y big", "x small, y big", "x small, y big", "x small, y big", "x big, y small", "x big, y small", "x big, y small", "x big, y small", "x small, y big", "x small, y big", "x small, y big", "x small, y big", "x big, y small", "x big, y small", "x big, y small", "x big, y small", "x small, y big", "x small, y big", "x small, y big", "x small, y big", "x big, y small", "x big, y small", "x big, y small", "x big, y small", "x small, y big", "x small, y big", "x small, y big", "x small, y big", "x big, y small", "x big, y small", "x big, y small", "x big, y small"), soln = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), levels = c("fjoin", "merge", "dplyr", "collapse"), class = "factor"), description = c("No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered", "No missing values, tables unordered" ), N = c(5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, NA, 5L, 5L, 5L, NA, 5L, NA, NA, 5L, 5L, NA, NA, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L), median_secs = c(6.1902577, 6.1713217, 19.9907639, 19.7675571, 6.1000215, 6.1203305, 13.3964799, 7.0342887, 6.2597127, 6.2432609, 19.9681676, 20.1874484, 5.6448957, 5.6632873, 13.6564839, 5.5813086, 10.4640203, 10.8960488, 22.8774932, NA, 8.5460365, 12.2154671, 14.2776151, NA, 5.9763782, NA, NA, 5.6241829, 5.7766581, NA, NA, 20.3630658, 10.6481448, 10.864006, 23.0229502, 23.5587724, 9.8157773, 13.6417876, 15.7693984, 9.0004465)), row.names = c(NA, -40L), class = c("data.frame", "data.table")) ``` ```{r echo=FALSE} library(data.table) k <- 30 setDT(dat) pl(dat) ``` The benchmark above is based on a no-frills equality join to allow comparison with `merge.data.table` and `collapse::join()`; see the [Performance](https://trobx.github.io/fjoin/articles/fjoin-performance.html) article for more detail. In the inner and left joins, `r pkg("fjoin")` and `merge.data.table` reflect a simple operation in `r pkg("data.table")`, and straightforwardly inherit its speed and robustness to the order of the tables. But `r pkg("fjoin")` performs a bit better than `merge.data.table` on the right and full joins. This is typical: `r pkg("fjoin")`'s solutions for join types and additional options that are not straightforward in native `r pkg("data.table")` have been developed with close attention to performance. # Notes for `r pkg("tidyverse")` users `r pkg("fjoin")` is a drop-in alternative to `dplyr::*_join()` with fast large data performance and useful options that `dplyr::*_join()` lacks (though the reverse is also true --- see below). Joins are fairly infrequent operations, and the package name is short, so you may not feel the need to attach it: ```{r message=FALSE} library(dplyr) dfQ <- as_tibble(dfQ) dfQ |> fjoin::fjoin_full(dfP, on = "id", select = c("item", "price", "quantity"), order = "right", indicate = TRUE ) |> mutate( quantity = if_else(.join == 2L, 0L, quantity), revenue = price * quantity ) ``` Please note that `r pkg("fjoin")`, for now, has no equivalent of `dplyr::*_join()`'s `relationship` validation: it is silent and permissive about cardinality. It also doesn't yet support rolling joins on unordered data, which `r pkg("dplyr")` implements elegantly via a helper function in `join_by`, or dedicated overlap joins (although these are easily written in terms of inequalities). These features will be added. The implementation of joins in the `r pkg("data.table")`-backed packages `r pkg("dtplyr")` and `r pkg("tidytable")` needs maintenance. In both cases it only supports equality joins, malfunctions in the presence of same-named non-join columns on each side (try it), and silently ignores `r pkg("dplyr")`'s additional join arguments such as `na.matches` and `multiple` (as well as `relationship`). These packages are excellent tools that do much more than just joins, but you should be aware of these limitations. # Notes for `r pkg("sf")` users ```{r} countries <- read_df(" country_id country_name country_shape 1 'Country A' 'POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))' 2 'Country B' 'POLYGON ((1 1, 2 1, 2 2, 1 2, 1 1))' 3 'Country C' 'POLYGON ((2 2, 3 2, 3 3, 2 3, 2 2))' ") |> sf::st_as_sf(wkt = "country_shape", crs = 4326) capitals <- read_df(" country_id capital_name capital_loc 2 'City B' 'POINT (1.5 1.5)' 3 'City C' 'POINT (2.5 2.5)' 4 'City D' 'POINT (3.5 3.5)' ") |> sf::st_as_sf(wkt = "capital_loc", crs = 4326) ``` `r pkg("fjoin")` smoothly accommodates joins involving `sf` data frames. In particular, joins between two `sf` objects work as you would hope: ```{r} fjoin_inner(countries, capitals, on = "country_id") ``` This is useful in workflows where you want to hold multiple geometries in the same `sf` data frame. In `r pkg("dplyr")` such joins are prohibited: ```{r, message = TRUE} try(dplyr::inner_join(countries, capitals, by = "country_id")) ``` In addition, `r pkg("fjoin")` always detects and refreshes `sfc`-class columns in the join output, regardless of whether the inputs and output have `sf` class (and whether those columns were/are active geometries). This avoids stale bounding boxes and ensures that values at non-matching rows are converted from `NULL` to a valid empty geometry. For example, here is the `sfc` column `capital_loc` from the input on the right, after a left join in which the inputs are plain data frames instead of `sf`s: ```{r, message=TRUE} fjoin_left(as.data.frame(countries), as.data.frame(capitals), on = "country_id")$capital_loc ``` # Notes for `r pkg("data.table")` users `r pkg("fjoin")` automates joins that are challenging or laborious to write in `r pkg("data.table")`, while solving frustrations such as garbled join columns in inequality joins and the lack of an effective `incomparables` argument, and providing other useful options. Even for very simple joins, there is no reason *not* to use it, since it has negligible overhead and if anything will actually slightly outperform `merge` (see above) and even native `r pkg("data.table")` (see below). That said, `r pkg("fjoin")` is not a comprehensive wrapper for `r pkg("data.table")`'s rich join functionality. Some things it cannot do are: - computing on joined columns inside `j` (including `by = .EACHI` aggregations), rather than simply selecting them - joins by reference of the direct form `DT[i, on = , v := i.v]` - for now, dedicated rolling and overlap joins You do have the option of setting `do = FALSE`, copying the console output, and editing the `j`-expression(s). You can also "plonk" joined columns by reference using the pattern `DT[, v := fjoin_left(DT, i, on = , select = "v")$v]`, which can often usefully be combined with the `".join"` indicator: ```{r} library(data.table) dtQ <- as.data.table(dfQ) dtP <- as.data.table(dfP) dtP[, revenue := price * fjoin_left( dtP, dtQ, on = "id", select = c("quantity"), indicate = TRUE )[.join == 1L, quantity := 0L]$quantity][] ``` The package actually began life as a tool to solve `r pkg("data.table")`'s garbling of join columns in non-equi joins. It still does that: ```{r} dt1 <- data.table(t=c(5L,25L,45L)) dt2 <- data.table(t_start=c(1L,21L), t_end=c(10L,30L)) ``` Here is a range join with `r pkg("fjoin")`: ```{r} dtjoin(dt2, dt1, on=c("t_start <= t", "t_end >= t"), show = TRUE) ``` Compare the default output from `r pkg("data.table")`: ```{r message=FALSE} dt2[dt1, on=.(t_start <= t, t_end >= t)] ``` Notice that `r pkg("fjoin")` uses `data.frame()` in `j` (coupled with an outer `setDT()`) instead of the more usual `list()`, even when the required output is a `data.table`. This is to sidestep an unnecessary deep copy of the joined columns currently made by `r pkg("data.table")` in this case. Likewise, `r pkg("fjoin")` avoids deep copies on the way in and out by "shallow-casting" inputs and outputs to and from `data.table`s as necessary. Shallow conversion is not a safe general way of operating with foreign objects in `r pkg("data.table")`, but it works for joins, because the input vectors only need to be read, and the output vectors are guaranteed to be unshared. One consequence of this is that `r pkg("fjoin")` can be more efficient than `r pkg("data.table")` itself with standard interactive idioms. Consider the following (exaggerated) example, where the `i` expression is a `data.frame` and we select columns in `j` with `list()` (aliased `.()`): ```{r message=FALSE} n <- 1e6L; ncol_dt <- 2L; ncol_df <- 10L dt <- data.table(id = rep(1:n, each = 5L), matrix(runif(n * ncol_dt), ncol = ncol_dt)) df <- data.frame(id = 1:n, matrix(runif(n * ncol_df), ncol = ncol_df)) bench::mark( data.table = dt[df, on = .(id), .(id, V1, V2, X1, X3, X5, X7, X9)], fjoin = dtjoin(dt, df, on = "id", select.i = c("X1", "X3", "X5", "X7", "X9")), iterations = 3, check = TRUE ) |> summary() |> subset(select = c("expression", "n_itr", "median", "mem_alloc")) ``` Here `r pkg("fjoin")` avoids a call to `as.data.table.data.frame` on the way in, and a call to `as.data.table.list` on the way out, both of which (currently) always deep-copy. The `bench::mark` memory measurements exclude C-level allocations, but the difference between them reflects these R-level copies. This behaviour will eventually change in `r pkg("data.table")`, at which point `r pkg("fjoin")` will revert to the more familiar `j = list()` idiom. Finally, please note that `dtjoin()` differs from `r pkg("data.table")` in its preservation of keys. In a `r pkg("data.table")` `DT[i]` join, the output inherits the key of `DT` provided it happens to remain sorted on those columns; this is consistent with `r pkg("data.table")`'s conception of joins as a subsetting-like operation on `DT`, even though it is the `i`-table that dictates the row order. With `dtjoin()`, the output always inherits the key of `.i` (unless the non-joining rows of `.DT` are appended, in which case the key is `NULL` since sortedness on `.i`'s key columns can no longer be guaranteed). This design choice ensures intuitive behaviour of the `fjoin_*()` functions: for example, `fjoin_left(x, y)` with the default `order = "left"` preserves `x`'s key, as a user would surely expect. # "Internals" With `match.na = FALSE` (the default), `r pkg("fjoin")` inspects the data and, if the join permits, chooses which table to omit `NA`-containing rows from using a heuristic. The only thing that is not reflected in the `r pkg("data.table")` code that `r pkg("fjoin")` produces are the input and output handling steps: - shallow-casting non-`data.table` inputs as `data.table`s (new objects borrowing columns from the input) - adding (grouped) tibble and/or `sf` class to `data.frame` outputs where appropriate, including setting `groups` and `agr` attributes - keying `data.table` outputs as described above - refreshing `sfc` columns regardless of output class - on exit, dropping temporary columns (pointers) from `data.table` inputs used as-is These steps ensure that `r pkg("fjoin")` handles object classes efficiently (no data copying) while leaving inputs intact. In other respects, the package is a constructor of `r pkg("data.table")` code to a set of carefully thought-out solutions. `r pkg("fjoin")` pays a lot of attention to doing this well and is also very thoroughly tested.