--- title: "Introduction to dtaudit" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to dtaudit} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` `dtaudit` provides diagnostic tools for data analysis workflows built on `data.table`. Unlike assertion-based packages that test whether data meets specific rules, dtaudit focuses on **interactive development** and **understanding what happens** to your data during joins, filters, and transformations. ```{r setup} library(dtaudit) library(data.table) ``` ## Example data Suppose you have two tables: one with customer orders and another with product information. These are typical inputs to a join operation. ```{r data} orders <- data.table( order_id = 1:8, customer = c("Alice", "Bob", "Alice", "Carol", "Bob", "Alice", "Dave", "Eve"), product_id = c(101L, 102L, 101L, 103L, 104L, 102L, 105L, 106L), amount = c(50, 30, 50, 75, 20, 35, 60, 45) ) products <- data.table( product_id = c(101L, 102L, 103L, 104L, 107L), category = c("Electronics", "Books", "Clothing", "Books", "Food"), price = c(25.0, 15.0, 37.5, 10.0, 8.0) ) ``` ## Step 1: Validate primary keys Before joining, check that your join keys actually identify rows uniquely in each table. ```{r validate-pk} validate_primary_keys(orders, "order_id") validate_primary_keys(products, "product_id") ``` ## Step 2: Validate the join Use `validate_join()` to understand the relationship between the two tables **before** joining. This function doesn't run the full outer join on the original tables, making it typically lighter and faster than ad-hoc solutions. This tells you the join type, match rates, and which keys are unmatched. ```{r validate-join} validate_join(orders, products, by = "product_id") ``` The output shows a many-to-one relationship (multiple orders per product), with some product IDs appearing only in one table. This is expected, but now you know exactly what the merge will look like. You can also track numeric columns through the join with `stat.x` and `stat.y` (or `stat` when both tables share the same column name). This shows how much of a metric falls in matched vs. unmatched rows --- often more revealing than row counts alone: ```{r validate-join-stat} validate_join(orders, products, by = "product_id", stat.x = "amount", stat.y = "price") ``` Here 75% of order rows match, but only 71% of total order `amount` --- the unmatched orders carry disproportionately high amounts. On the product side, 80% of rows match but 92% of `price`, meaning the unmatched product is relatively cheap. These insights help you decide whether unmatched keys are safe to ignore. ## Step 3: Filter with diagnostics After merging, you may want to filter your data. `filter_keep()` and `filter_drop()` work like standard subsetting but report how many rows (and optionally how much of a statistic) you're removing. ```{r filter} merged <- merge(orders, products, by = "product_id", all.x = TRUE) # Keep only Electronics and Books, report dropped amount result <- filter_keep(merged, category %in% c("Electronics", "Books"), stat = amount) ``` ## Step 4: Compare before and after `compare_datatables()` gives you a structural comparison between two tables --- useful for checking what changed after a transformation. ```{r compare} compare_datatables(orders, result) ``` ## Step 5: Check data quality ### Missing values ```{r diagnose-nas} # Introduce some NAs to demonstrate merged_with_na <- copy(merged) merged_with_na[sample(.N, 2), category := NA] diagnose_nas(merged_with_na) ``` ### String quality `diagnose_strings()` checks for common string issues: missing values, empty strings, whitespace problems, non-ASCII characters, and case inconsistencies. ```{r diagnose-strings} diagnose_strings(orders$customer) ``` ### Date coverage If your data should cover a continuous time range, `check_date_coverage()` identifies gaps. ```{r date-coverage} dates <- as.IDate(c("2024-01-15", "2024-02-20", "2024-04-10", "2024-05-05")) check_date_coverage(dates, "2024-01-01", "2024-06-30") ``` ## Step 6: Summarize columns `get_summary_table()` produces a one-row-per-column overview of your data, including types, missing counts, and descriptive statistics. ```{r summary-table} get_summary_table(orders) ``` ## Auditing cleaning operations When you apply a cleaning function, `audit_clean()` reports exactly what changed: ```{r audit-clean} firms <- c("Apple Inc.", "MICROSOFT CORP", "Alphabet LLC", "apple", NA) audit_clean(firms, clean_firm_name) ``` ## Further reading See the function reference for complete documentation: - **Join & key validation**: `validate_join()`, `validate_primary_keys()`, `validate_var_relationship()` - **Data comparison**: `compare_datatables()` - **Filter diagnostics**: `filter_keep()`, `filter_drop()` - **Data quality**: `diagnose_nas()`, `diagnose_strings()`, `audit_clean()`, `get_summary_table()`, `summarize_vector()`, `check_date_coverage()` - **String cleaning**: `clean_var_names()`, `clean_firm_name()` - **Data manipulation**: `embed_into_cartesian()`