--- title: "Troubleshooting Guide" author: "Gilles Colling" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Troubleshooting Guide} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5, error = TRUE ) library(couplr) library(dplyr) ``` ## Overview This vignette provides solutions to common issues encountered when using couplr. Each section describes a problem, explains why it occurs, and provides step-by-step solutions. **Quick reference:** | Issue | Jump to Section | |-------|-----------------| | No valid pairs found | [Infeasible Problems](#infeasible-problems) | | Poor balance after matching | [Poor Balance](#poor-balance-despite-matching) | | Matching is too slow | [Performance Issues](#performance-issues) | | Memory errors | [Memory Errors](#memory-errors) | | Different results with different methods | [Non-Determinism](#different-results-with-different-methods) | | Unexpected NA in results | [Missing Values](#missing-values-in-data) | | C++ compilation errors | [Installation Issues](#installation-issues) | --- ## Infeasible Problems ### Symptom ```{r, error=TRUE} # This fails: all assignments have Inf cost cost <- matrix(c(1, Inf, Inf, Inf, Inf, Inf, Inf, 2, 3), nrow = 3, byrow = TRUE) result <- lap_solve(cost) ``` Error message: "No feasible assignment found" or total cost is Inf. ### Cause A feasible assignment requires that every row can be assigned to at least one column with finite cost. When rows have all-Inf or all-NA entries, no valid assignment exists. ### Diagnosis ```{r} # Check for infeasible rows check_feasibility <- function(cost_matrix) { finite_per_row <- rowSums(is.finite(cost_matrix)) infeasible_rows <- which(finite_per_row == 0) if (length(infeasible_rows) > 0) { cat("Infeasible rows (no finite costs):", infeasible_rows, "\n") return(FALSE) } finite_per_col <- colSums(is.finite(cost_matrix)) infeasible_cols <- which(finite_per_col == 0) if (length(infeasible_cols) > 0) { cat("Infeasible columns (no finite costs):", infeasible_cols, "\n") return(FALSE) } cat("Problem appears feasible\n") return(TRUE) } # Check the problematic matrix cost <- matrix(c(1, Inf, Inf, Inf, Inf, Inf, Inf, 2, 3), nrow = 3, byrow = TRUE) check_feasibility(cost) ``` ### Solutions **1. Remove infeasible rows/columns:** ```{r} # Remove rows with no valid assignments cost <- matrix(c(1, Inf, Inf, Inf, Inf, Inf, Inf, 2, 3), nrow = 3, byrow = TRUE) valid_rows <- rowSums(is.finite(cost)) > 0 cost_valid <- cost[valid_rows, , drop = FALSE] if (nrow(cost_valid) > 0) { result <- lap_solve(cost_valid) cat("Matched", nrow(result), "of", nrow(cost), "rows\n") } ``` **2. Add fallback costs:** ```{r} # Replace Inf with high (but finite) penalty cost_with_fallback <- cost cost_with_fallback[!is.finite(cost_with_fallback)] <- 1e6 # Large penalty result <- lap_solve(cost_with_fallback) print(result) ``` **3. For matching: check covariate overlap:** ```{r, fig.width=7, fig.height=4, fig.alt="Density plot showing overlap between treatment and control groups on the age variable"} # Simulate poor overlap scenario set.seed(123) left <- tibble(id = 1:50, age = rnorm(50, mean = 25, sd = 3)) right <- tibble(id = 1:50, age = rnorm(50, mean = 55, sd = 3)) # Visualize overlap library(ggplot2) combined <- bind_rows( left %>% mutate(group = "Left"), right %>% mutate(group = "Right") ) ggplot(combined, aes(x = age, fill = group)) + geom_density(alpha = 0.5) + labs(title = "Poor Covariate Overlap", subtitle = "No overlap means no good matches possible") + theme_minimal() + theme(plot.background = element_rect(fill = "transparent", color = NA), panel.background = element_rect(fill = "transparent", color = NA)) ``` --- ## Poor Balance Despite Matching ### Symptom After matching, `balance_diagnostics()` shows |std_diff| > 0.25 for some variables. ```{r} # Example of poor balance set.seed(456) left <- tibble( id = 1:100, age = rnorm(100, 30, 5), income = rnorm(100, 80000, 20000) # Very different from right ) right <- tibble( id = 1:100, age = rnorm(100, 32, 5), income = rnorm(100, 40000, 10000) # Very different income ) result <- match_couples(left, right, vars = c("age", "income"), auto_scale = TRUE) balance <- balance_diagnostics(result, left, right, vars = c("age", "income")) print(balance) ``` ### Causes 1. **Weak overlap**: Groups are too different on key variables 2. **Missing confounders**: Important variables not included in matching 3. **Caliper too loose**: Accepting poor matches 4. **Wrong scaling**: Variables not properly weighted ### Solutions **1. Add more matching variables:** ```{r, eval=FALSE} # Include additional relevant variables result <- match_couples( left, right, vars = c("age", "income", "education", "region"), # More variables auto_scale = TRUE ) ``` **2. Tighten caliper (fewer but better matches):** ```{r} result_strict <- match_couples( left, right, vars = c("age", "income"), max_distance = 0.3, # Stricter caliper auto_scale = TRUE ) cat("Original matches:", result$info$n_matched, "\n") cat("With caliper:", result_strict$info$n_matched, "\n") balance_strict <- balance_diagnostics(result_strict, left, right, vars = c("age", "income")) print(balance_strict) ``` **3. Use blocking on problematic variable:** ```{r, eval=FALSE} # Block on income tertiles to ensure exact balance left$income_cat <- cut(left$income, breaks = 3, labels = c("low", "mid", "high")) right$income_cat <- cut(right$income, breaks = 3, labels = c("low", "mid", "high")) blocks <- matchmaker(left, right, block_type = "group", block_by = "income_cat") result_blocked <- match_couples( blocks$left, blocks$right, vars = c("age"), # Match on age within income blocks block_id = "block_id" ) ``` **4. Try different scaling:** ```{r} # Compare scaling methods for (scale_method in c("robust", "standardize", "range")) { res <- match_couples(left, right, vars = c("age", "income"), auto_scale = TRUE, scale = scale_method) bal <- balance_diagnostics(res, left, right, vars = c("age", "income")) cat(scale_method, "- max |std_diff|:", round(bal$overall$max_abs_std_diff, 3), "\n") } ``` --- ## Performance Issues ### Symptom `match_couples()` or `lap_solve()` takes too long or doesn't complete. ### Cause Optimal matching has $O(n^3)$ complexity. For n = 5,000, this means ~125 billion operations. ### Diagnosis ```{r} # Estimate runtime estimate_runtime <- function(n, seconds_per_billion = 1) { ops <- n^3 time_sec <- ops / 1e9 * seconds_per_billion if (time_sec < 60) { sprintf("%.1f seconds", time_sec) } else if (time_sec < 3600) { sprintf("%.1f minutes", time_sec / 60) } else { sprintf("%.1f hours", time_sec / 3600) } } cat("Estimated runtime for optimal matching:\n") for (n in c(100, 500, 1000, 3000, 5000, 10000)) { cat(sprintf(" n = %5d: %s\n", n, estimate_runtime(n))) } ``` ### Solutions **1. Use greedy matching for large problems:** ```{r} set.seed(789) n <- 500 large_left <- tibble(id = 1:n, x1 = rnorm(n), x2 = rnorm(n)) large_right <- tibble(id = 1:n, x1 = rnorm(n), x2 = rnorm(n)) # Greedy is much faster time_greedy <- system.time({ result_greedy <- greedy_couples( large_left, large_right, vars = c("x1", "x2"), strategy = "row_best" ) }) cat("Greedy matching (n=500):", round(time_greedy["elapsed"], 2), "seconds\n") cat("Quality (mean distance):", round(mean(result_greedy$pairs$distance), 4), "\n") ``` **2. Use blocking to divide the problem:** ```{r, eval=FALSE} # Create clusters to match within blocks <- matchmaker( large_left, large_right, block_type = "cluster", block_vars = c("x1", "x2"), n_blocks = 10 # 10 blocks of ~200 each ) # Match within blocks (10 x O(200^3) << O(2000^3)) result_blocked <- match_couples( blocks$left, blocks$right, vars = c("x1", "x2"), block_id = "block_id" ) ``` **3. Choose a faster algorithm:** ```{r, eval=FALSE} # For n > 1000, auction algorithm often faster result <- match_couples( large_left, large_right, vars = c("x1", "x2"), method = "auction" ) # For sparse problems (many forbidden pairs) result <- match_couples( left, right, vars = vars, max_distance = 0.5, # Creates sparsity method = "sap" # Sparse algorithm ) ``` **4. Pre-compute and cache distances:** ```{r} # Compute once, reuse multiple times dist_cache <- compute_distances( large_left, large_right, vars = c("x1", "x2"), scale = "robust" ) # Fast: reuse cached distances result1 <- match_couples(dist_cache, max_distance = 0.3) result2 <- match_couples(dist_cache, max_distance = 0.5) result3 <- match_couples(dist_cache, max_distance = 1.0) ``` --- ## Memory Errors ### Symptom R crashes or shows: "Error: cannot allocate vector of size X GB" ### Cause A full distance matrix for n×n requires 8n² bytes: ```{r} # Memory requirements memory_needed <- function(n) { bytes <- 8 * n^2 if (bytes < 1e6) { sprintf("%.1f KB", bytes / 1e3) } else if (bytes < 1e9) { sprintf("%.1f MB", bytes / 1e6) } else { sprintf("%.1f GB", bytes / 1e9) } } cat("Memory for full distance matrix:\n") for (n in c(1000, 5000, 10000, 20000, 50000)) { cat(sprintf(" n = %5d: %s\n", n, memory_needed(n))) } ``` ### Solutions **1. Use greedy matching (avoids full matrix):** ```{r, eval=FALSE} # Greedy computes distances on-the-fly result <- greedy_couples( left, right, vars = covariates, strategy = "row_best" # Most memory-efficient ) ``` **2. Use blocking to create smaller subproblems:** ```{r, eval=FALSE} # Each block is much smaller blocks <- matchmaker(left, right, block_type = "cluster", n_blocks = 20) result <- match_couples(blocks$left, blocks$right, vars = vars, block_id = "block_id") ``` **3. Use caliper to create sparse matrix:** ```{r, eval=FALSE} # Caliper excludes distant pairs (sparse representation) result <- match_couples( left, right, vars = covariates, max_distance = 0.5, method = "sap" # Sparse-optimized algorithm ) ``` **4. Increase R's memory limit (Windows):** ```{r, eval=FALSE} # Increase to 16 GB (if available) memory.limit(size = 16000) ``` --- ## Different Results with Different Methods ### Symptom Different algorithms return different assignments: ```{r} cost <- matrix(c(1, 2, 2, 2, 1, 2, 2, 2, 1), nrow = 3, byrow = TRUE) result_jv <- lap_solve(cost, method = "jv") result_hungarian <- lap_solve(cost, method = "hungarian") cat("JV assignment: ", result_jv$target, "\n") cat("Hungarian assignment:", result_hungarian$target, "\n") cat("JV total cost: ", get_total_cost(result_jv), "\n") cat("Hungarian total cost:", get_total_cost(result_hungarian), "\n") ``` ### Cause When multiple optimal solutions exist (tied costs), different algorithms may find different ones. **The total cost should be the same**; if not, report a bug. ### Diagnosis ```{r} # Check for ties check_ties <- function(cost_matrix) { n <- nrow(cost_matrix) # Check if diagonal dominates (trivial ties) diag_costs <- diag(cost_matrix) if (length(unique(diag_costs)) < n) { cat("Tied costs on diagonal - multiple optima likely\n") } # Check cost uniqueness unique_costs <- length(unique(as.vector(cost_matrix))) total_entries <- length(cost_matrix) if (unique_costs < total_entries * 0.5) { cat("Many repeated costs - ties possible\n") } } check_ties(cost) ``` ### Solutions **1. Verify total costs match:** ```{r} # Total cost should be identical stopifnot(get_total_cost(result_jv) == get_total_cost(result_hungarian)) cat("Both methods found optimal solutions (same total cost)\n") ``` **2. Use Hungarian for deterministic tie-breaking:** ```{r} # Hungarian has consistent tie-breaking result <- lap_solve(cost, method = "hungarian") ``` **3. Add small noise to break ties:** ```{r} set.seed(42) cost_perturbed <- cost + matrix(rnorm(9, 0, 1e-10), 3, 3) result <- lap_solve(cost_perturbed) ``` --- ## Missing Values in Data ### Symptom Matching fails or produces unexpected results due to NA values. ```{r, error=TRUE} left <- tibble(id = 1:5, age = c(25, 30, NA, 35, 40)) right <- tibble(id = 1:5, age = c(28, 32, 33, 36, 42)) # This may fail or give unexpected results result <- match_couples(left, right, vars = "age") ``` ### Cause couplr requires complete cases for distance computation. NA values in matching variables cause issues. ### Solutions **1. Remove rows with NA (before matching):** ```{r} left_clean <- left %>% filter(!is.na(age)) right_clean <- right %>% filter(!is.na(age)) result <- match_couples(left_clean, right_clean, vars = "age") cat("Matched", result$info$n_matched, "pairs (excluded 1 left unit with NA)\n") ``` **2. Impute missing values:** ```{r} # Simple mean imputation left_imputed <- left %>% mutate(age = if_else(is.na(age), mean(age, na.rm = TRUE), age)) result <- match_couples(left_imputed, right, vars = "age") cat("Matched", result$info$n_matched, "pairs (imputed 1 NA with mean)\n") ``` **3. Use preprocessing to diagnose:** ```{r} health <- preprocess_matching_vars( left, right, vars = "age" ) print(health) ``` --- ## Installation Issues ### C++ Compilation Errors #### Symptom ``` Error in .Call("_couplr_assignment_impl"): object not found ``` or ``` Error: package 'couplr' was built under R version X.X ``` #### Cause C++ code not compiled properly, or object files are stale. #### Solutions **1. Clean and reinstall:** ```{r, eval=FALSE} # In R: remove.packages("couplr") install.packages("couplr") # Or from GitHub: devtools::install_github("gcol33/couplr", force = TRUE) ``` **2. Clean compiled files (development):** ```powershell # PowerShell Remove-Item src\*.o, src\*.dll -Force ``` ```bash # Bash rm src/*.o src/*.so ``` Then rebuild: ```{r, eval=FALSE} devtools::clean_dll() devtools::load_all() ``` **3. Check Rtools/compiler (Windows):** ```{r} # Check if Rtools is properly configured Sys.which("make") Sys.which("g++") ``` Make sure Rtools is installed and PATH is configured correctly. --- ## Common Error Messages ### "method_used attribute is NULL" **Cause**: Result object wasn't created properly. **Solution**: Use `lap_solve()` instead of `assignment()` for the tidy interface. ### "Cost matrix contains non-finite values" **Cause**: NA or NaN values in cost matrix (not Inf). **Solution**: Replace NA with Inf for forbidden assignments, or remove rows/cols. ```{r} cost <- matrix(c(1, NA, 3, 4), 2, 2) cost[is.na(cost)] <- Inf # Mark as forbidden result <- lap_solve(cost) ``` ### "All rows assigned to Inf columns" **Cause**: Caliper is too strict; all potential matches exceed threshold. **Solution**: Increase `max_distance` or check for overlap issues. ### "subscript out of bounds" **Cause**: Empty left or right data frame, or ID mismatch. **Solution**: Check data dimensions and ID column consistency. ```{r} # Always verify data before matching stopifnot(nrow(left) > 0, nrow(right) > 0) ``` --- ## Getting Help If you encounter an issue not covered here: 1. **Check function documentation**: `?match_couples`, `?lap_solve` 2. **Search GitHub issues**: [github.com/gcol33/couplr/issues](https://github.com/gcol33/couplr/issues) 3. **Create a minimal reproducible example**: ```{r, eval=FALSE} # Minimal example template library(couplr) # Minimal data that reproduces the issue set.seed(123) left <- tibble(id = 1:10, x = rnorm(10)) right <- tibble(id = 1:10, x = rnorm(10)) # Code that causes the error result <- match_couples(left, right, vars = "x") # Expected vs actual behavior # Expected: ... # Actual: [error message] # Session info sessionInfo() ``` 4. **Open an issue** with your example at [github.com/gcol33/couplr/issues/new](https://github.com/gcol33/couplr/issues/new) --- ## See Also - `vignette("getting-started")` - Basic usage - `vignette("matching-workflows")` - Production matching pipelines - `vignette("algorithms")` - Algorithm selection guide - `vignette("comparison")` - Comparison with other packages