--- title: "Getting Started with joinspy" author: "Gilles Colling" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started with joinspy} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 6, fig.height = 4 ) library(joinspy) # Transparent backgrounds for pkgdown light/dark mode old_par <- par(no.readonly = TRUE) ``` ## Overview **joinspy** provides diagnostic tools for understanding and debugging data frame joins in R. Instead of discovering join problems after the fact, joinspy helps you identify issues upfront and understand what will happen when you join. **Key features:** - Pre-join diagnostics with `join_spy()` and `key_check()` - Duplicate detection with `key_duplicates()` - Post-join analysis with `join_explain()` and `join_diff()` - Safe wrappers: `left_join_spy()`, `right_join_spy()`, `inner_join_spy()`, `full_join_spy()` - Cardinality enforcement with `join_strict()` - Auto-repair with `join_repair()` and `suggest_repairs()` - Visualization with `plot()` and `summary()` methods ## Installation ```{r eval = FALSE} # Install development version from GitHub # install.packages("pak") pak::pak("gcol33/joinspy") ``` ## What joinspy Does joinspy analyzes join keys **before** you join, detecting: - Duplicate keys (causes row multiplication) - Whitespace issues (leading/trailing spaces) - Case mismatches ("ABC" vs "abc") - Encoding problems (invisible Unicode characters) - NA values in keys - Type mismatches (numeric vs character) ## Quick Examples ### Pre-Join Diagnostics with `join_spy()` ```{r} # Sample data with issues orders <- data.frame( customer_id = c("A", "B", "B", "C", "D "), amount = c(100, 200, 150, 300, 50), stringsAsFactors = FALSE ) customers <- data.frame( customer_id = c("A", "B", "C", "D", "E"), name = c("Alice", "Bob", "Carol", "David", "Eve"), stringsAsFactors = FALSE ) # Get diagnostic report report <- join_spy(orders, customers, by = "customer_id") ``` Use `summary()` for a compact data frame of metrics: ```{r} summary(report) ``` ### Quick Check with `key_check()` For a fast pass/fail assessment: ```{r} key_check(orders, customers, by = "customer_id") ``` ### Finding Duplicates with `key_duplicates()` ```{r} key_duplicates(orders, by = "customer_id") ``` ## Join Wrappers Use `*_join_spy()` functions for joins with automatic diagnostics: ```{r} orders_clean <- data.frame( customer_id = c("A", "B", "C"), amount = c(100, 200, 300), stringsAsFactors = FALSE ) # Silent mode for pipelines result <- left_join_spy(orders_clean, customers, by = "customer_id", .quiet = TRUE) head(result) # Access diagnostics afterward last_report()$match_analysis$match_rate ``` Available wrappers: | Function | Description | |----------|-------------| | `left_join_spy()` | Left join with diagnostics | | `right_join_spy()` | Right join with diagnostics | | `inner_join_spy()` | Inner join with diagnostics | | `full_join_spy()` | Full join with diagnostics | ## Cardinality Enforcement Use `join_strict()` to enforce expected relationships: ```{r} products <- data.frame(id = 1:3, name = c("Widget", "Gadget", "Gizmo")) prices <- data.frame(id = 1:3, price = c(10, 20, 30)) # Succeeds - 1:1 relationship join_strict(products, prices, by = "id", expect = "1:1") ``` ```{r error = TRUE} # Fails - duplicates violate 1:1 prices_dup <- data.frame(id = c(1, 1, 2, 3), price = c(10, 15, 20, 30)) join_strict(products, prices_dup, by = "id", expect = "1:1") ``` | Cardinality | Description | |-------------|-------------| | `"1:1"` | Each key unique in both tables | | `"1:m"` | Left keys unique, right may duplicate | | `"m:1"` | Right keys unique, left may duplicate | | `"m:m"` | No constraints | ## Auto-Repair Fix common issues automatically: ```{r} messy <- data.frame( id = c(" A", "B ", " C "), value = 1:3, stringsAsFactors = FALSE ) # Preview repairs join_repair(messy, by = "id", dry_run = TRUE) # Apply repairs fixed <- join_repair(messy, by = "id") fixed$id ``` ## Post-Join Diagnostics ### Explaining Row Count Changes ```{r} orders_dup <- data.frame(id = 1:3, product = c("A", "B", "C")) inventory <- data.frame(id = c(1, 1, 2, 3), location = c("NY", "LA", "NY", "LA")) result <- merge(orders_dup, inventory, by = "id") join_explain(result, orders_dup, inventory, by = "id", type = "inner") ``` ### Before/After Comparison ```{r} before <- data.frame(id = 1:3, val = c("a", "b", "c")) after <- merge(before, data.frame(id = 2:4, name = c("B", "C", "D")), by = "id", all = TRUE) join_diff(before, after, by = "id") ``` ## Visualization ```{r fig.width = 5, fig.height = 4} orders <- data.frame(id = 1:5, val = 1:5) customers <- data.frame(id = 3:7, name = letters[3:7]) report <- join_spy(orders, customers, by = "id") plot(report) # Venn diagram ``` ## Quick Reference ### join_spy() Comprehensive pre-join diagnostic report. ```r join_spy(x, y, by, sample = NULL, ...) ``` | Parameter | Description | Default | |-----------|-------------|---------| | `x` | Left data frame | *required* | | `y` | Right data frame | *required* | | `by` | Column name(s) to join by | *required* | | `sample` | Sample size for large datasets | `NULL` | **Returns**: `JoinReport` S3 object with `print()`, `summary()`, and `plot()` methods. ### key_check() Quick pass/fail key quality assessment. ```r key_check(x, y, by) ``` **Returns**: Logical `TRUE` if keys are clean, `FALSE` otherwise (with messages). ### join_strict() Join with cardinality enforcement. ```r join_strict(x, y, by, type = "left", expect = "1:1") ``` | Parameter | Description | Default | |-----------|-------------|---------| | `type` | Join type: `"left"`, `"right"`, `"inner"`, `"full"` | `"left"` | | `expect` | Expected cardinality: `"1:1"`, `"1:m"`, `"m:1"`, `"m:m"` | `"1:1"` | **Returns**: Joined data frame, or error if cardinality violated. ### join_repair() Automatically fix common key issues. ```r join_repair(x, y = NULL, by, trim_whitespace = TRUE, standardize_case = NULL, remove_invisible = TRUE, empty_to_na = FALSE, dry_run = FALSE) ``` | Parameter | Description | Default | |-----------|-------------|---------| | `trim_whitespace` | Remove leading/trailing whitespace | `TRUE` | | `standardize_case` | Convert to `"lower"` or `"upper"` | `NULL` | | `remove_invisible` | Remove invisible Unicode characters | `TRUE` | | `empty_to_na` | Convert `""` to `NA` | `FALSE` | | `dry_run` | Preview changes without modifying | `FALSE` | **Returns**: Repaired data frame(s). ## See Also - `vignette("common-issues")` - Catalogue of join problems and solutions - `?join_spy`, `?key_check`, `?join_strict`, `?join_repair` ## Session Info ```{r} sessionInfo() ```