README

dtaudit helps analysts validate data operations, compare datasets, and diagnose data quality issues. It provides diagnostic output (receipts) at each step, making it easier to catch data issues early in your pipeline.

Installation

# Install CRAN version using
install.packages("dtaudit")

# Install development version using `remotes`
remotes::install_github("fpcordeiro/dtaudit")

# Install development version using using `pak`
pak::pkg_install("fpcordeiro/dtaudit")

Features

Join Validation

Analyze potential joins before performing them. Understand relationship types, match rates, and identify unmatched keys.

library(dtaudit)
library(data.table)

orders <- data.table(
  order_id = 1:5,
  customer_id = c(101, 102, 101, 103, 104),
  revenue = c(200, 150, 300, 450, 100)
)

customers <- data.table(
  customer_id = c(101, 102, 105),
  name = c("Alice", "Bob", "Charlie")
)

result <- validate_join(orders, customers, by = "customer_id")
print(result)
#> ============== Join Validation Summary ==============
#> Tables: orders <--> customers
#> Keys in orders: customer_id
#> Keys in customers: customer_id
#>   Relationship                                : many-to-one
#>   Key(s) in orders   [customer_id]            : (1 col)
#>   Key(s) in customers   [customer_id]         : (1 col)
#>   Rows in orders                              : 5
#>   Distinct key combos in orders               : 4
#>   Rows in customers                           : 3
#>   Distinct key combos in customers            : 3
#>   Overlapping distinct key combos             : 2
#>   Matched row pairs (cartesian)               : 3
#>   Match rate from orders                      : 60.00%
#>   Match rate from customers                   : 66.67%
#>   Rows only in orders (no match in customers) : 2
#>   Rows only in customers (no match in orders) : 1
#> ------------------------------------
#> Duplicates: orders=yes  customers=no

Track a numeric column through the join to quantify the business impact of unmatched rows. Use stat when both tables share the same column name, or stat.x/stat.y for different columns:

validate_join(orders, customers, by = "customer_id", stat.x = "revenue")
#> ============== Join Validation Summary ==============
#> Tables: orders <--> customers
#> ...
#>   Match rate from orders                      : 60.00%
#>   Match rate from customers                   : 66.67%
#>   Rows only in orders (no match in customers) : 2
#>   Rows only in customers (no match in orders) : 1
#>
#>   --- Stat: revenue (orders) ---
#>   Total revenue in orders     : 1,200
#>   Matched revenue in orders   : 650  (54.17%)
#>   Unmatched revenue in orders : 550  (45.83%)
#> ------------------------------------
#> Duplicates: orders=yes  customers=no

Here, 60% of rows match but only 54% of revenue—the unmatched orders carry disproportionately high revenue, something row counts alone wouldn’t reveal.

Filter with Diagnostics

Filter data while tracking what gets dropped—including optional statistics like revenue or counts.

sales <- data.table(
  region = c("East", "West", "East", "West"),
  revenue = c(1000, 500, 2000, 300),
  valid = c(TRUE, FALSE, TRUE, TRUE)
)

# Keep rows where valid == TRUE
clean_sales <- filter_keep(sales, valid == TRUE, stat = revenue)
#> filter_keep(sales, valid == TRUE)
#>   Dropped 1 of 4 rows (25.00%).
#>   Dropped 500 of 3,800 for revenue (13.16%).

# Or equivalently, drop rows where valid == FALSE
clean_sales <- filter_drop(sales, valid == FALSE, stat = revenue)
#> filter_drop(sales, valid == FALSE)
#>   Dropped 1 of 4 rows (25.00%).
#>   Dropped 500 of 3,800 for revenue (13.16%).

Primary Key Validation

employees <- data.table(
  dept = c("Sales", "Sales", "Engineering", "Engineering"),
  emp_id = c(1, 2, 1, 1),
  name = c("Alice", "Bob", "Charlie", "Diana")
)

validate_primary_keys(employees, c("dept", "emp_id"))
#> ============== Primary Key Validation ==============
#> Table: employees
#> Key column(s): dept, emp_id
#> -----------------------------------------------------
#>   Total rows:              4
#>   Unique key combinations: 3
#>   Duplicate key combos:    1
#> -----------------------------------------------------
#> Result: NO - Keys do NOT uniquely identify all rows.
#>
#> Duplicate keys (showing up to 10):
#> Key: <dept, emp_id>
#>           dept emp_id     N
#>         <char>  <num> <int>
#> 1: Engineering      1     2

Variable Relationship Validation

dt <- data.table(
  student = c("Alice", "Alice", "Bob", "Bob"),
  course = c("Math", "English", "Math", "Science")
)

validate_var_relationship(dt, "student", "course")
#> ============== Variable Relationship Validation ==============
#> Table: dt
#> Variables: student <--> course
#> --------------------------------------------------------------
#>   Unique values in student: 2
#>   Unique values in course: 3
#>   Unique (student, course) pairs: 4
#> --------------------------------------------------------------
#>   student -> course: one-to-many
#>   course -> student: one-to-many
#> --------------------------------------------------------------
#> Relationship: MANY-TO-MANY

Compare Data Tables

dt_v1 <- data.table(id = 1:3, value = c(10.0, 20.0, 30.0))
dt_v2 <- data.table(id = 1:3, value = c(10.1, 20.0, 30.5), extra = "new")

compare_datatables(dt_v1, dt_v2)
#> 1. Number of rows
#>    dt_v1: 3 rows
#>    dt_v2: 3 rows
#>    Difference (dt_v1 - dt_v2): 0
#>
#> 2. Column names
#>    Matching column names : 2
#>    Only in dt_v1: 0
#>    Only in dt_v2: 1 (extra)
#>    Type mismatches  : 0
#>
#> 3. Key columns used for matching
#>    Key columns: id (auto-detected)
#>    Distinct key combinations in dt_v1: 3
#>    Distinct key combinations in dt_v2: 3
#>    Matching key combinations: 3
#>    Only in dt_v1: 0
#>    Only in dt_v2: 0
#>
#> 4. Numeric column discrepancies (absolute differences)
#>    Comparing numeric columns after merging on keys.
#>    Rows matched on keys: 3
#>    column     n   min   q25 median   q75   max
#>    <char> <int> <num> <num>  <num> <num> <num>
#> 1:  value     3     0  0.05    0.1   0.3   0.5

Diagnose Missing Values

dt <- data.table(
  id = 1:6,
  name = c("Alice", NA, "Charlie", "Diana", NA, "Frank"),
  score = c(85, 90, NA, NA, NA, 70),
  grade = c("A", "A", "B", NA, NA, "C")
)

diagnose_nas(dt)
#> 3 of 4 columns have missing values
#>   variable                           n_na   pct_na
#>   score                                 3    50.0%
#>   name                                  2    33.3%
#>   grade                                 2    33.3%

Diagnose String Columns

Audit character vectors for data quality issues: NAs, empty strings, whitespace problems, non-ASCII characters, and case inconsistencies.

x <- c("Apple", "apple", "APPLE", "  banana  ", "", NA, "caf\u00e9", "na\u00efve")

diagnose_strings(x, name = "fruits")
#> =============== String Column Diagnosis ===============
#> Variable: fruits
#> --------------------------------------------------------
#> Total elements:        8
#> --------------------------------------------------------
#> Missing & Empty:
#>   NA values:           1 (12.5%)
#>   Empty strings:       1 (12.5%)
#>   Whitespace-only:     0 (0.0%)
#> --------------------------------------------------------
#> Whitespace Issues:
#>   Leading whitespace:  1
#>   Trailing whitespace: 1
#> --------------------------------------------------------
#> Encoding:
#>   Non-ASCII chars:     2
#> --------------------------------------------------------
#> Case Inconsistencies:
#>   Variant groups:      1
#>   Total variants:      3
#>
#> Case variant examples (up to 5 groups):
#> Key: <lower>
#>   lower n_variants            examples
#>  <char>      <int>              <char>
#>   apple          3 Apple, apple, APPLE

Audit Cleaning Operations

Apply a cleaning function and see what changed—before committing to the transformation.

x <- c("Apple, Inc.", "MICROSOFT CORP.", "google llc", "Amazon.com Inc")

result <- audit_clean(x, clean_firm_name, name = "company")
#> =============== String Cleaning Audit ===============
#> Variable: company
#> Function: clean_firm_name
#> -----------------------------------------------------
#> Total elements:  4
#>   NA values:     0
#>   Changed:       4 (100.0% of non-NA)
#>   Unchanged:     0
#> -----------------------------------------------------
#> Examples of changes (showing 4 of 4):
#>           before     after
#>           <char>    <char>
#>      Apple, Inc.     APPLE
#>  MICROSOFT CORP. MICROSOFT
#>       google llc    GOOGLE
#>   Amazon.com Inc AMAZONCOM
#>
#> Access cleaned vector with: result$cleaned

Data Summaries

Generate comprehensive column summaries with type detection, missing value counts, and descriptive statistics.

summary_table <- get_summary_table(my_data)

dates <- as.IDate(c("2023-01-15", "2023-02-20", "2023-03-10", "2023-04-05",
                    "2023-05-12", "2023-06-18", "2023-09-22", "2023-10-30",
                    "2023-11-14", "2023-12-25"))
check_months_coverage(dates, "2023-01-01", "2023-12-31")
#> Checking dates between 2023-01-01 and 2023-12-31 (by month )
#> There are 2 month periods missing. These are:
#> Jul-2023, Aug-2023

String Cleaning

clean_var_names(c("Sales Revenue", "cost-of-goods"))
#> [1] "sales_revenue" "cost_of_goods"

clean_firm_name(c("Apple, Inc.", "MICROSOFT CORP."))
#> [1] "APPLE"     "MICROSOFT"

Function	Purpose
`validate_join()`	Analyze joins before merging
`validate_primary_keys()`	Test if columns uniquely identify rows
`validate_var_relationship()`	Determine relationship between two variables
`filter_keep()`	Filter with diagnostic output
`filter_drop()`	Drop rows with diagnostic output
`compare_datatables()`	Compare two data.tables
`diagnose_nas()`	Report missing values by column
`diagnose_strings()`	Audit string columns for quality issues
`audit_clean()`	Audit a cleaning function’s effect
`get_summary_table()`	Generate column summaries
`check_date_coverage()`	Validate date coverage (flexible granularity)
`check_months_coverage()`	Validate monthly date coverage
`embed_into_cartesian()`	Expand to cartesian product
`clean_var_names()`	Standardize variable names
`clean_firm_name()`	Normalize firm names

dtaudit

Overview