dtaudit

Lifecycle: experimental

Audit and Diagnostic Tools for data.table Workflows

Overview

dtaudit helps analysts validate data operations, compare datasets, and diagnose data quality issues. It provides diagnostic output (receipts) at each step, making it easier to catch data issues early in your pipeline.

Installation

# Install CRAN version using
install.packages("dtaudit")

# Install development version using `remotes`
remotes::install_github("fpcordeiro/dtaudit")

# Install development version using using `pak`
pak::pkg_install("fpcordeiro/dtaudit")

Features

Join Validation

Analyze potential joins before performing them. Understand relationship types, match rates, and identify unmatched keys.

library(dtaudit)
library(data.table)

orders <- data.table(
  order_id = 1:5,
  customer_id = c(101, 102, 101, 103, 104),
  revenue = c(200, 150, 300, 450, 100)
)

customers <- data.table(
  customer_id = c(101, 102, 105),
  name = c("Alice", "Bob", "Charlie")
)

result <- validate_join(orders, customers, by = "customer_id")
print(result)
#> ============== Join Validation Summary ==============
#> Tables: orders <--> customers
#> Keys in orders: customer_id
#> Keys in customers: customer_id
#>   Relationship                                : many-to-one
#>   Key(s) in orders   [customer_id]            : (1 col)
#>   Key(s) in customers   [customer_id]         : (1 col)
#>   Rows in orders                              : 5
#>   Distinct key combos in orders               : 4
#>   Rows in customers                           : 3
#>   Distinct key combos in customers            : 3
#>   Overlapping distinct key combos             : 2
#>   Matched row pairs (cartesian)               : 3
#>   Match rate from orders                      : 60.00%
#>   Match rate from customers                   : 66.67%
#>   Rows only in orders (no match in customers) : 2
#>   Rows only in customers (no match in orders) : 1
#> ------------------------------------
#> Duplicates: orders=yes  customers=no

Track a numeric column through the join to quantify the business impact of unmatched rows. Use stat when both tables share the same column name, or stat.x/stat.y for different columns:

validate_join(orders, customers, by = "customer_id", stat.x = "revenue")
#> ============== Join Validation Summary ==============
#> Tables: orders <--> customers
#> ...
#>   Match rate from orders                      : 60.00%
#>   Match rate from customers                   : 66.67%
#>   Rows only in orders (no match in customers) : 2
#>   Rows only in customers (no match in orders) : 1
#>
#>   --- Stat: revenue (orders) ---
#>   Total revenue in orders     : 1,200
#>   Matched revenue in orders   : 650  (54.17%)
#>   Unmatched revenue in orders : 550  (45.83%)
#> ------------------------------------
#> Duplicates: orders=yes  customers=no

Here, 60% of rows match but only 54% of revenue—the unmatched orders carry disproportionately high revenue, something row counts alone wouldn’t reveal.

Filter with Diagnostics

Filter data while tracking what gets dropped—including optional statistics like revenue or counts.

sales <- data.table(
  region = c("East", "West", "East", "West"),
  revenue = c(1000, 500, 2000, 300),
  valid = c(TRUE, FALSE, TRUE, TRUE)
)

# Keep rows where valid == TRUE
clean_sales <- filter_keep(sales, valid == TRUE, stat = revenue)
#> filter_keep(sales, valid == TRUE)
#>   Dropped 1 of 4 rows (25.00%).
#>   Dropped 500 of 3,800 for revenue (13.16%).

# Or equivalently, drop rows where valid == FALSE
clean_sales <- filter_drop(sales, valid == FALSE, stat = revenue)
#> filter_drop(sales, valid == FALSE)
#>   Dropped 1 of 4 rows (25.00%).
#>   Dropped 500 of 3,800 for revenue (13.16%).

Primary Key Validation

Test whether a set of columns uniquely identifies every row in a data.table.

employees <- data.table(
  dept = c("Sales", "Sales", "Engineering", "Engineering"),
  emp_id = c(1, 2, 1, 1),
  name = c("Alice", "Bob", "Charlie", "Diana")
)

validate_primary_keys(employees, c("dept", "emp_id"))
#> ============== Primary Key Validation ==============
#> Table: employees
#> Key column(s): dept, emp_id
#> -----------------------------------------------------
#>   Total rows:              4
#>   Unique key combinations: 3
#>   Duplicate key combos:    1
#> -----------------------------------------------------
#> Result: NO - Keys do NOT uniquely identify all rows.
#>
#> Duplicate keys (showing up to 10):
#> Key: <dept, emp_id>
#>           dept emp_id     N
#>         <char>  <num> <int>
#> 1: Engineering      1     2

Variable Relationship Validation

Determine the relationship type between two variables in a data.table.

dt <- data.table(
  student = c("Alice", "Alice", "Bob", "Bob"),
  course = c("Math", "English", "Math", "Science")
)

validate_var_relationship(dt, "student", "course")
#> ============== Variable Relationship Validation ==============
#> Table: dt
#> Variables: student <--> course
#> --------------------------------------------------------------
#>   Unique values in student: 2
#>   Unique values in course: 3
#>   Unique (student, course) pairs: 4
#> --------------------------------------------------------------
#>   student -> course: one-to-many
#>   course -> student: one-to-many
#> --------------------------------------------------------------
#> Relationship: MANY-TO-MANY

Compare Data Tables

Compare two data.tables by structure, keys, and numeric values.

dt_v1 <- data.table(id = 1:3, value = c(10.0, 20.0, 30.0))
dt_v2 <- data.table(id = 1:3, value = c(10.1, 20.0, 30.5), extra = "new")

compare_datatables(dt_v1, dt_v2)
#> 1. Number of rows
#>    dt_v1: 3 rows
#>    dt_v2: 3 rows
#>    Difference (dt_v1 - dt_v2): 0
#>
#> 2. Column names
#>    Matching column names : 2
#>    Only in dt_v1: 0
#>    Only in dt_v2: 1 (extra)
#>    Type mismatches  : 0
#>
#> 3. Key columns used for matching
#>    Key columns: id (auto-detected)
#>    Distinct key combinations in dt_v1: 3
#>    Distinct key combinations in dt_v2: 3
#>    Matching key combinations: 3
#>    Only in dt_v1: 0
#>    Only in dt_v2: 0
#>
#> 4. Numeric column discrepancies (absolute differences)
#>    Comparing numeric columns after merging on keys.
#>    Rows matched on keys: 3
#>    column     n   min   q25 median   q75   max
#>    <char> <int> <num> <num>  <num> <num> <num>
#> 1:  value     3     0  0.05    0.1   0.3   0.5

Diagnose Missing Values

Report NA counts and percentages for each column.

dt <- data.table(
  id = 1:6,
  name = c("Alice", NA, "Charlie", "Diana", NA, "Frank"),
  score = c(85, 90, NA, NA, NA, 70),
  grade = c("A", "A", "B", NA, NA, "C")
)

diagnose_nas(dt)
#> 3 of 4 columns have missing values
#>   variable                           n_na   pct_na
#>   score                                 3    50.0%
#>   name                                  2    33.3%
#>   grade                                 2    33.3%

Diagnose String Columns

Audit character vectors for data quality issues: NAs, empty strings, whitespace problems, non-ASCII characters, and case inconsistencies.

x <- c("Apple", "apple", "APPLE", "  banana  ", "", NA, "caf\u00e9", "na\u00efve")

diagnose_strings(x, name = "fruits")
#> =============== String Column Diagnosis ===============
#> Variable: fruits
#> --------------------------------------------------------
#> Total elements:        8
#> --------------------------------------------------------
#> Missing & Empty:
#>   NA values:           1 (12.5%)
#>   Empty strings:       1 (12.5%)
#>   Whitespace-only:     0 (0.0%)
#> --------------------------------------------------------
#> Whitespace Issues:
#>   Leading whitespace:  1
#>   Trailing whitespace: 1
#> --------------------------------------------------------
#> Encoding:
#>   Non-ASCII chars:     2
#> --------------------------------------------------------
#> Case Inconsistencies:
#>   Variant groups:      1
#>   Total variants:      3
#>
#> Case variant examples (up to 5 groups):
#> Key: <lower>
#>   lower n_variants            examples
#>  <char>      <int>              <char>
#>   apple          3 Apple, apple, APPLE

Audit Cleaning Operations

Apply a cleaning function and see what changed—before committing to the transformation.

x <- c("Apple, Inc.", "MICROSOFT CORP.", "google llc", "Amazon.com Inc")

result <- audit_clean(x, clean_firm_name, name = "company")
#> =============== String Cleaning Audit ===============
#> Variable: company
#> Function: clean_firm_name
#> -----------------------------------------------------
#> Total elements:  4
#>   NA values:     0
#>   Changed:       4 (100.0% of non-NA)
#>   Unchanged:     0
#> -----------------------------------------------------
#> Examples of changes (showing 4 of 4):
#>           before     after
#>           <char>    <char>
#>      Apple, Inc.     APPLE
#>  MICROSOFT CORP. MICROSOFT
#>       google llc    GOOGLE
#>   Amazon.com Inc AMAZONCOM
#>
#> Access cleaned vector with: result$cleaned

Data Summaries

Generate comprehensive column summaries with type detection, missing value counts, and descriptive statistics.

summary_table <- get_summary_table(my_data)

Check date coverage for time series data:

dates <- as.IDate(c("2023-01-15", "2023-02-20", "2023-03-10", "2023-04-05",
                    "2023-05-12", "2023-06-18", "2023-09-22", "2023-10-30",
                    "2023-11-14", "2023-12-25"))
check_months_coverage(dates, "2023-01-01", "2023-12-31")
#> Checking dates between 2023-01-01 and 2023-12-31 (by month )
#> There are 2 month periods missing. These are:
#> Jul-2023, Aug-2023

String Cleaning

Standardize variable names and firm names for matching.

clean_var_names(c("Sales Revenue", "cost-of-goods"))
#> [1] "sales_revenue" "cost_of_goods"

clean_firm_name(c("Apple, Inc.", "MICROSOFT CORP."))
#> [1] "APPLE"     "MICROSOFT"

Function Reference

Function Purpose
validate_join() Analyze joins before merging
validate_primary_keys() Test if columns uniquely identify rows
validate_var_relationship() Determine relationship between two variables
filter_keep() Filter with diagnostic output
filter_drop() Drop rows with diagnostic output
compare_datatables() Compare two data.tables
diagnose_nas() Report missing values by column
diagnose_strings() Audit string columns for quality issues
audit_clean() Audit a cleaning function’s effect
get_summary_table() Generate column summaries
check_date_coverage() Validate date coverage (flexible granularity)
check_months_coverage() Validate monthly date coverage
embed_into_cartesian() Expand to cartesian product
clean_var_names() Standardize variable names
clean_firm_name() Normalize firm names

Dependencies

Required: - data.table - stringi

Optional: - pbapply - for progress bars

License

LGPL (>= 3)