--- title: "Getting started with dqcheckr" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting started with dqcheckr} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, eval = FALSE) ``` `dqcheckr` automates the verification of recurring external dataset deliveries. For each new file arrival it runs a battery of quality checks, compares the file to the previous delivery, writes a self-contained HTML report, and records summary statistics in a local SQLite database so that quality trends can be tracked over time. ## How it works ``` Your data file Your YAML config │ │ ▼ ▼ read_dataset() load_config() (all columns as (global defaults + character, trimmed) per-dataset rules) │ │ └──────────────┬─────────────────┘ ▼ run_qc_checks() ← single-snapshot checks (QC / SC series) run_comparison_checks()← vs previous delivery (CP series) run_custom_checks() ← your own rules │ ▼ write_snapshot() ← SQLite: run summary + per-column stats render_report() ← self-contained HTML │ ▼ run_dq_check() ← the one function you call ``` A data officer runs a single command for each arriving dataset: ```{r one-call} run_dq_check("customer_accounts", config_dir = "path/to/configs") ``` This prints a one-line console summary, writes an HTML report, and returns `list(status, report_path, snapshot_id)` invisibly. --- ## Installation `dqcheckr` is a first-party package — install it directly from source: ```{r install} devtools::install("path/to/dqr/dqcheckr") library(dqcheckr) ``` All dependencies are on CRAN: `readr`, `DBI`, `RSQLite`, `rmarkdown`, `knitr`, `kableExtra`, `ggplot2`, `gridExtra`, `yaml`, `dplyr`, `tidyr`, `rlang`. --- ## Configuration Two YAML files control every run. ### Global config — `dqcheckr.yml` Placed in your config directory. Sets default thresholds used by all datasets unless a dataset overrides them. ```yaml snapshot_db: "data/snapshots.sqlite" report_output_dir: "reports/" default_rules: # Proportion of non-empty values that must parse as numeric for a column to # be classified as numeric (affects type inference across many checks) type_inference_threshold: 0.90 # FAIL if any column's missing-value rate exceeds this (0-1) max_missing_rate: 0.05 # FAIL if non-numeric values in a numeric column exceed this rate max_non_numeric_rate: 0.01 # FAIL if row count drops below this (0 = disabled) min_row_count: 0 # Version-comparison thresholds (used when a previous file is available) max_row_count_change_pct: 0.10 max_numeric_mean_shift_pct: 0.20 max_missing_rate_change_pp: 2.0 # percentage points max_non_numeric_rate_change_pp: 1.0 flag_new_columns: true # set false to suppress new-column warnings in CP-02 flag_dropped_columns: true # set false to suppress dropped-column warnings in CP-02 flag_type_changes: true # set false to suppress type-change warnings in CP-02 flag_column_order_change: true # set false to skip CP-08 entirely (WARN CSV / FAIL FWF) ``` ### Per-dataset config — `.yml` One file per dataset. Only the keys you specify are needed — everything else inherits from the global defaults. ```yaml dataset_name: "customer_accounts" # for human reference only — the functional # identifier is the YAML filename, which must # match the dataset_name argument to run_dq_check() # --- File location --- # Option A: folder scan (two most recently modified files are used) folder: "data/incoming/customer_accounts/" # Option B: explicit paths (set current_file only → single-file mode, # no version comparison) # current_file: "data/incoming/customer_accounts/20260410.csv" # previous_file: "data/incoming/customer_accounts/20260401.csv" format: csv encoding: "UTF-8" delimiter: "," # --- Schema contract (optional) --- # SC-01 flags extra columns; SC-02 flags missing columns. expected_columns: - id - name - country_code - account_status - account_balance - created_date # --- Key columns (optional) --- # QC-12 checks each for duplicate values. key_columns: - id # --- Rule overrides (optional) --- # These replace the matching keys in default_rules for this dataset only. rule_overrides: max_missing_rate: 0.02 max_row_count_change_pct: 0.05 min_row_count: 1000 # --- Per-column rules (optional) --- column_rules: country_code: allowed_values: ["GB", "US", "DE", "FR"] # QC-09 pattern: "^[A-Z]{2}$" # QC-13 account_status: allowed_values: ["ACTIVE", "CLOSED", "SUSPENDED"] account_balance: min_value: 0 # QC-10 max_value: 1000000 # --- Custom checks (optional) --- # custom_checks_file: "custom/customer_accounts_checks.R" # --- Output paths (optional, override global) --- # snapshot_db: "data/snapshots.sqlite" # report_output_dir: "reports/" ``` ### What is required vs optional The only fields the package requires to run are the file location (`folder` or `current_file`) and the `format`. Every other config key is optional — the corresponding checks are simply skipped when the key is absent. | Config key | Skipped / default when absent | |---|---| | `key_columns` | QC-12 (key uniqueness) skipped | | `expected_columns` | SC-01 and SC-02 (schema contract) skipped | | `column_rules.allowed_values` | QC-09 (allowed values) skipped | | `column_rules.min_value` / `max_value` | QC-10 (numeric bounds) skipped | | `column_rules.pattern` | QC-13 (pattern / regex) skipped | | `custom_checks_file` | All custom checks skipped | | `previous_file` / second file in folder | All CP-01–CP-08 (version comparison) skipped | | `min_row_count` set to 0 | QC-14 (minimum row count) skipped | | `type_inference_threshold` | Defaults to 0.90 (affects QC-06, QC-07, QC-08, QC-11, CP-02, CP-04, CP-05, CP-06, CP-07) | | `flag_new_columns` | Defaults to `true`; set to `false` to suppress new-column warnings in CP-02 (still tracked in SQLite) | | `flag_dropped_columns` | Defaults to `true`; set to `false` to suppress dropped-column warnings in CP-02 (still tracked in SQLite) | | `flag_type_changes` | Defaults to `true`; set to `false` to suppress type-change warnings in CP-02 | | `flag_column_order_change` | Defaults to `true`; set to `false` to skip CP-08 entirely | A minimal dataset config that runs basic checks with no column-level rules: ```yaml dataset_name: "my_dataset" # for reference; must match the YAML filename current_file: "data/incoming/my_dataset.csv" format: csv ``` ### Fixed-width files For fixed-width files, set `format: fwf` and supply column widths and names. If the file has a header row, set `fwf_skip: 1` to skip it. ```yaml format: fwf fwf_skip: 1 fwf_widths: [10, 20, 3, 10, 12, 11] fwf_col_names: [id, name, country_code, account_status, account_balance, created_date] ``` --- ## The quality checks Each check returns one or more `dq_result` objects with four key fields: `check_id`, `status` (`PASS` / `WARN` / `FAIL` / `INFO`), `observed`, and `message`. The overall run status is the worst single status across all checks. ### Single-snapshot checks (QC series) These run on the current file alone. | ID | Name | Triggers on | Severity | |----|------|-------------|----------| | QC-01 | Missing rate | Column missing rate > `max_missing_rate` | FAIL | | QC-02 | Empty column | Column is 100% empty | FAIL | | QC-03 | Duplicate rows | Any fully-identical rows | WARN | | QC-04 | Row count | Always | INFO | | QC-05 | Column count | Always | INFO | | QC-06 | Inferred type | Always (date / numeric / character / unknown) | INFO | | QC-07 | Numeric stats | Numeric columns (min, max, mean, SD) | INFO | | QC-08 | Distinct count | Character columns | INFO | | QC-09 | Allowed values | Values outside `allowed_values` list | FAIL | | QC-10 | Numeric bounds | Values outside `min_value` / `max_value` | FAIL | | QC-11 | Non-numeric values | Non-numeric values in a numeric column | WARN / FAIL | | QC-12 | Key uniqueness | Duplicate values in a `key_columns` column | FAIL | | QC-13 | Pattern / regex | Values not matching `pattern` | FAIL | | QC-14 | Minimum row count | Row count below `min_row_count` | FAIL | ### Schema contract checks (SC series) Only run when `expected_columns` is set. | ID | Name | Triggers on | Severity | |----|------|-------------|----------| | SC-01 | Unexpected column | Column in file not in `expected_columns` | FAIL | | SC-02 | Missing expected column | Column in `expected_columns` absent from file | FAIL | ### Version comparison checks (CP series) Only run when a previous file is available. Skipped automatically in single-file mode. | ID | Name | Triggers on | Severity | |----|------|-------------|----------| | CP-01 | Row count change | Change > `max_row_count_change_pct` | WARN | | CP-02 | Schema diff | Columns added, dropped, or type-changed | WARN | | CP-03 | Missing rate change | Increase > `max_missing_rate_change_pp` pp | WARN | | CP-04 | Numeric mean shift | Shift > `max_numeric_mean_shift_pct` | WARN | | CP-05 | New distinct values | Values in current not seen in previous | INFO | | CP-06 | Dropped distinct values | Values in previous absent from current | INFO | | CP-07 | Non-numeric rate change | Increase > `max_non_numeric_rate_change_pp` pp | WARN | | CP-08 | Column order change | Column order differs from previous | WARN (CSV) / FAIL (FWF) | --- ## Type inference `infer_col_type()` classifies each column after whitespace trimming. Rules are applied in priority order: 1. **date** — all non-empty values parse as a known date format (`%Y-%m-%d`, `%d/%m/%Y`, `%m/%d/%Y`, `%Y%m%d`, `%d-%m-%Y`) 2. **numeric** — ≥ `type_inference_threshold` of non-empty values coerce to numeric 3. **character** — everything else 4. **unknown** — all values are empty or `NA` The threshold defaults to 90%, meaning a column with up to 10% bad values is still classified as numeric, making those bad values visible to QC-11 (non-numeric contamination). It is configurable per dataset via `type_inference_threshold` in `default_rules` or `rule_overrides`: ```yaml # raise threshold — stricter, fewer columns classified as numeric rule_overrides: type_inference_threshold: 0.95 # lower threshold — more lenient, columns with more noise still classified numeric rule_overrides: type_inference_threshold: 0.80 ``` Changing the threshold affects QC-06, QC-07, QC-08, QC-11, CP-02, CP-04, CP-05, CP-06, and CP-07. Note that changing it between runs may cause CP-02 to report a spurious type change on columns that sit close to the old and new thresholds, even if the underlying data did not change. ```{r type-inference} infer_col_type(c("2024-01-15", "2024-06-30")) # "date" infer_col_type(c("100", "200", "N/A")) # "character" (only 67% numeric) infer_col_type(c("100", "200", "N/A", rep("50", 17))) # "numeric" (95% numeric) infer_col_type(c(NA, "", NA)) # "unknown" # with a custom threshold: infer_col_type(c(rep("1", 17), "a", "b", "c"), threshold = 0.80) # "numeric" (85%) ``` --- ## Running a check ```{r run-check} result <- run_dq_check( dataset_name = "customer_accounts", config_dir = "path/to/configs", # contains dqcheckr.yml + customer_accounts.yml open_report = TRUE # open HTML in browser when interactive ) # Console output (always printed): # [dqcheckr] customer_accounts: FAIL - 1 warning(s), 2 failure(s). Report: reports/...html result$status # "PASS", "WARN", or "FAIL" result$report_path # full path to the HTML file result$snapshot_id # integer row ID in snapshots table ``` The HTML report contains: 1. **Header** — dataset name, file name, run time, overall status badge 2. **File summary** — row count, column count, file size, format, encoding 3. **Quality checks table** — every check result, failures first 4. **Custom checks table** — if configured 5. **Version comparison** — what changed since the previous delivery 6. **Historical trend** — row count and check outcomes across the last 10 runs (appears once at least two runs have been recorded) 7. **Appendix** — full column-level statistics table --- ## Calling individual checks You can call any check function directly, without going through the full pipeline. Every check takes `(df, config)` and returns a list of `dq_result` objects. ```{r individual-checks} library(dqcheckr) # Build a minimal config programmatically cfg <- list( rules = list( max_missing_rate = 0.05, max_non_numeric_rate = 0.01, min_row_count = 0 ), column_rules = list( status = list(allowed_values = c("ACTIVE", "CLOSED")) ), key_columns = "id", expected_columns = NULL ) df <- read.csv("data/customer_accounts.csv", colClasses = "character") df[] <- lapply(df, trimws) # Run a single check results <- check_missing_rate(df, cfg) # Inspect results[[1]]$status # "PASS" or "FAIL" results[[1]]$observed # e.g. "3.2% missing (4 of 125)" # Run all single-snapshot checks at once all_results <- run_qc_checks(df, cfg) # Overall status overall_status(all_results) # "FAIL" > "WARN" > "PASS" > "INFO" ``` --- ## Custom checks Create a plain `.R` file that defines `custom_checks(df)`. It receives the trimmed data frame and must return a list of `dq_result` objects. ```{r custom-file} # File: custom/customer_accounts_checks.R custom_checks <- function(df) { results <- list() # Rule: ACTIVE accounts must not have a zero balance active_zero <- df[df$account_status == "ACTIVE" & !is.na(df$account_balance) & df$account_balance == "0", ] n <- nrow(active_zero) results <- c(results, list(dq_result( check_id = "CUST-01", check_name = "No zero-balance active accounts", column = "account_balance", # enables per-column storage in SQLite status = if (n > 0) "FAIL" else "PASS", observed = sprintf("%d ACTIVE account(s) with balance 0", n), message = if (n > 0) sprintf("%d ACTIVE account(s) have a zero balance.", n) else "No ACTIVE accounts have a zero balance." ))) results } ``` Point to it in the dataset config: ```yaml custom_checks_file: "custom/customer_accounts_checks.R" ``` The file is sourced in an isolated environment (not the global session). `dq_result()` is available inside it. Any error — missing file, undefined function, runtime failure — stops the run with a clear message. **Tip**: always set `column =` when your check is scoped to a specific column. Results with `column = NA` (the default) appear in the HTML report and contribute to the overall status, but are not written to the `column_snapshots` SQLite table and therefore do not appear in per-column trend queries. --- ## The snapshot database Every run writes two rows to a shared SQLite database: one row in `snapshots` (run-level summary) and one row per column-stat in `column_snapshots`. ```{r snapshot-query} library(DBI) library(RSQLite) con <- dbConnect(SQLite(), "data/snapshots.sqlite") # Recent runs for one dataset dbGetQuery(con, "SELECT id, file_name, overall_status, check_fail_count, run_timestamp FROM snapshots WHERE dataset_name = 'customer_accounts' ORDER BY id DESC LIMIT 10") # Column-level stats for the most recent run dbGetQuery(con, "SELECT column_name, dq_check, value, threshold FROM column_snapshots WHERE snapshot_id = ( SELECT MAX(id) FROM snapshots WHERE dataset_name = 'customer_accounts' ) ORDER BY column_name, dq_check") dbDisconnect(con) ``` Because all datasets share one database, you can query across them: ```{r cross-dataset} dbGetQuery(con, "SELECT dataset_name, COUNT(*) AS runs, SUM(check_fail_count) AS total_failures FROM snapshots GROUP BY dataset_name") ``` The `output/` directory (and database file) are created automatically on the first run if they do not exist. --- ## Worked example — Star Wars dataset The following uses `dplyr::starwars` exported to CSV and fixed-width files. ```{r starwars-setup} # Config directory contains: # dqcheckr.yml — global thresholds # starwars_csv.yml — CSV dataset config # starwars_fwf.yml — FWF dataset config # Run checks on both formats result_csv <- run_dq_check("starwars_csv", config_dir = "config", open_report = TRUE) result_fwf <- run_dq_check("starwars_fwf", config_dir = "config", open_report = TRUE) ``` The Star Wars config sets `max_missing_rate: 0.60` because `vehicles` (87%) and `starships` (77%) are naturally sparse — most characters have none. Both columns FAIL, which is the correct finding: they are genuinely incomplete and a consumer should be aware before using them. Expected console output: ``` [dqcheckr] starwars_csv: FAIL - 0 warning(s), 2 failure(s). Report: output/reports/starwars_csv_....html [dqcheckr] starwars_fwf: FAIL - 0 warning(s), 2 failure(s). Report: output/reports/starwars_fwf_....html ``` --- ## Error handling | Situation | Behaviour | |-----------|-----------| | Config file not found | Stop immediately; no output written | | Data file not found | Stop immediately; no output written | | Only one file available | Single-file mode; CP checks skipped | | File cannot be parsed | Stop with filename and parse error | | SC-01 / SC-02 violations | FAIL recorded; run continues; report written | | Custom checks file missing | Stop with message | | `custom_checks()` not defined | Stop with message | | `custom_checks()` runtime error | Stop with R error detail | | SQLite write fails | Warning emitted; HTML report still written | The package never modifies, moves, or deletes source files. --- ## Design principles **Owned and evolvable.** Every check is plain R — readable, modifiable, and not hidden behind a third-party API. **Plain English outputs.** Reports are written for data management staff, not engineers. No R code, no package names, no jargon. **Fail loudly.** Configuration errors, missing files, and custom-check failures all stop the run with a clear message. Nothing is silently skipped. **Non-destructive.** The package reads files and writes reports and database records. Source files are never touched.