--- title: "LLM-Assisted Data Cleaning with llmclean" author: "Sadikul Islam" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 number_sections: true vignette: > %\VignetteIndexEntry{LLM-Assisted Data Cleaning with llmclean} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, out.width = "100%", warning = FALSE, message = FALSE ) ``` # Introduction Data frames collected from surveys, administrative records, or manual entry almost always contain semantic inconsistencies that rule-based tools cannot catch. The `llmclean` package addresses this by sending a compact representation of your data to a large language model (LLM) and requesting structured, human-readable suggestions for each issue found. The key distinction from existing tools: | Tool | What it catches | |------|----------------| | `janitor` | Column name formatting | | `validate` / `pointblank` | Rule-based type/range checks | | **`llmclean`** | **Semantic meaning: typos, abbreviations, case, malformed formats, cross-field contradictions** | No existing CRAN package provides the detect → explain → suggest → apply workflow specialised for semantic data frame inconsistencies. ## Supported Providers ``` openai → GPT-4o, GPT-4o-mini (API key required) anthropic → Claude Haiku / Sonnet (API key required) google → Gemini 2.0 Flash (free tier available) groq → LLaMA 3.1, Mixtral (free tier available) ollama → Any local model (no key, fully offline) offline → Statistical fallback (no API, no internet) ``` --- # Quick Start ```{r load} library(llmclean) library(dplyr) ``` ## Step 1 — Configure the provider For this vignette we use offline mode, which requires no API key: ```{r provider} set_llm_provider("offline") ``` With a real LLM provider (not run here): ```{r provider-llm, eval=FALSE} # Free Groq tier — fastest inference set_llm_provider("groq", api_key = Sys.getenv("GROQ_API_KEY"), model = "llama-3.1-8b-instant") # OpenAI set_llm_provider("openai", api_key = Sys.getenv("OPENAI_API_KEY"), model = "gpt-4o-mini") # Anthropic Claude set_llm_provider("anthropic", api_key = Sys.getenv("ANTHROPIC_API_KEY"), model = "claude-haiku-4-5-20251001") # Local Ollama (no key needed, model must be installed) set_llm_provider("ollama", model = "llama3") ``` ## Step 2 — Inspect the built-in messy datasets ```{r data} data(messy_employees) data(messy_survey) cat("messy_employees:", nrow(messy_employees), "rows x", ncol(messy_employees), "cols\n\n") # Peek at known issues cat("Status variants:\n"); print(table(messy_employees$status)) cat("\nDepartment variants:\n"); print(table(messy_employees$department)) cat("\nAge outliers:", messy_employees$age[messy_employees$age < 0 | messy_employees$age > 100], "\n") ``` The data contains at least six types of inconsistency across eight columns — typical of manually entered HR records. --- # Stage 1 — Detect Issues `detect_issues()` calls the LLM (or offline fallback) and returns a tidy tibble with one row per detected problem. ```{r detect} issues <- detect_issues(messy_employees, context = "HR employee records. Status values should be 'active' or 'inactive'.") cat("Issues found:", nrow(issues), "\n\n") print(issues[, c("column","row_index","value","issue_type", "suggestion","confidence")]) ``` ## Issue type breakdown ```{r issue-types} # Summary by type as.data.frame(table(Type = issues$issue_type)) |> dplyr::arrange(dplyr::desc(Freq)) ``` ## Case inconsistencies ```{r case-issues} # Show all case inconsistencies found issues[issues$issue_type == "case", c("column","row_index","value","suggestion","confidence")] ``` ## Typos and near-duplicates (Levenshtein) The offline detector uses `utils::adist()` to compute Levenshtein edit distances between all pairs of unique values in each column. Values within 2 edits of a more frequent value are flagged as typos. ```{r typo-issues} issues[issues$issue_type == "typo", c("column","row_index","value","suggestion","explanation")] ``` ## Malformed email addresses ```{r format-issues} issues[issues$issue_type == "format", c("column","row_index","value","suggestion")] ``` ## Numeric outliers (Tukey outer fence) ```{r outlier-issues} issues[issues$issue_type == "outlier", c("column","row_index","value","explanation")] ``` --- # Stage 2 — Suggest Fixes `suggest_fixes()` can enrich low-confidence suggestions by re-querying the LLM with surrounding row context. In offline mode it returns the issues unchanged. ```{r suggest} enriched <- suggest_fixes(messy_employees, issues) cat("Enriched columns:", paste(names(enriched), collapse = ", "), "\n") # Show suggestions for status column enriched[enriched$column == "status", c("row_index","value","suggestion","alternatives","confidence_revised")] ``` --- # Stage 3 — Apply Fixes `apply_fixes()` has two modes: - **`confirm = FALSE`**: apply all fixes above `min_confidence` automatically (batch mode) - **`confirm = TRUE`**: interactive review, one fix at a time ```{r apply-noninteractive} # Non-interactive: apply fixes with confidence >= 0.88 df_clean <- apply_fixes( messy_employees, enriched, confirm = FALSE, min_confidence = 0.88 ) cat("Status before:", paste(sort(unique(messy_employees$status)), collapse=", "), "\n") cat("Status after: ", paste(sort(unique(df_clean$status)), collapse=", "), "\n\n") cat("Department before:", paste(sort(unique(messy_employees$department)), collapse=", "), "\n") cat("Department after: ", paste(sort(unique(df_clean$department)), collapse=", "), "\n") ``` ## Dry run mode Before applying anything, preview what would change: ```{r dry-run} plan <- apply_fixes(messy_employees, enriched, dry_run = TRUE) cat("Planned changes:\n") print(plan[, c("column","row_index","current_value","suggestion","issue_type")]) ``` --- # Stage 4 — Offline Detection (No API Key) `offline_detect()` runs without any LLM using three statistical methods: 1. **Levenshtein distance** (Chaudhuri et al., 2003) for typo detection 2. **Regex patterns** for email, date, and phone format validation 3. **Tukey outer fence** (Tukey, 1977) for numeric outlier detection ```{r offline} # Works completely offline offline_issues <- offline_detect( messy_survey, issue_types = c("case","typo","format","outlier"), max_edit_distance = 2L ) cat("Survey issues found:", nrow(offline_issues), "\n\n") offline_issues[, c("column","value","issue_type","suggestion","confidence")] ``` --- # Stage 5 — Summary Report `llmclean_report()` produces a structured audit log suitable for reproducible data quality documentation. ```{r report} rpt <- llmclean_report(messy_employees, df_clean, issues) ``` ```{r report-summary} cat("Summary by column and type:\n") print(rpt$summary) cat("\nCell-level changes (first 8):\n") print(head(rpt$changes, 8)) cat("\nMetadata:\n") cat(" Provider :", rpt$metadata$provider, "\n") cat(" Model :", rpt$metadata$model, "\n") cat(" Detected :", rpt$metadata$n_total, "\n") cat(" Applied :", rpt$metadata$n_applied, "\n") ``` --- # Complete Workflow (One Pipeline) ```{r full-pipeline, eval=FALSE} library(llmclean) # 1. Configure provider (use Groq free tier) set_llm_provider("groq", api_key = Sys.getenv("GROQ_API_KEY"), model = "llama-3.1-8b-instant") # 2. Load data data(messy_employees) # 3. Detect semantic issues issues <- detect_issues( messy_employees, context = "Employee records. Status: active/inactive. Age: 18-70." ) # 4. Enrich low-confidence suggestions enriched <- suggest_fixes(messy_employees, issues, n_alternatives = 2L) # 5. Apply fixes non-interactively df_clean <- apply_fixes(messy_employees, enriched, confirm = FALSE, min_confidence = 0.80) # 6. Generate audit report llmclean_report(messy_employees, df_clean, issues) ``` --- # API Key Storage Best Practice Never hardcode API keys in scripts. Store them in `.Renviron`: ```r # ~/.Renviron OPENAI_API_KEY=sk-... ANTHROPIC_API_KEY=sk-ant-... GROQ_API_KEY=gsk_... GOOGLE_API_KEY=AIza... ``` Then retrieve with `Sys.getenv("OPENAI_API_KEY")`. The `usethis` package provides `usethis::edit_r_environ()` to open this file. --- # Session Information ```{r session} sessionInfo() ``` --- # References Chaudhuri, S., Ganjam, K., Ganti, V. and Motwani, R. (2003). Robust and efficient fuzzy match for online data cleaning. *Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data*, 313--324. de Jonge, E. and van der Loo, M. (2013). An introduction to data cleaning with R. *Statistics Netherlands Discussion Paper*. Levenshtein, V.I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. *Soviet Physics Doklady*, 10(8), 707--710. Müller, H. and Freytag, J.C. (2003). Problems, methods, and challenges in comprehensive data cleansing. *Technical Report HUB-IB-164*, Humboldt University Berlin. Tukey, J.W. (1977). *Exploratory Data Analysis*. Addison-Wesley. ISBN: 978-0-201-07616-5. van der Loo, M.P.J. and de Jonge, E. (2018). *Statistical Data Cleaning with Applications in R*. John Wiley & Sons.