--- title: "Optimizing the Matching Process with a Random Search Algorithm" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Optimizing the Matching Process with a Random Search Algorithm} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} library(vecmatch) data(cancer) # or data("cancer", package = "vecmatch") formula_cancer <- formula(status ~ age * sex) knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.asp = 0.8, echo = TRUE, echo = TRUE, warning = FALSE, message = FALSE, cache = TRUE # so that estimate_gps doesn’t re-run on every knit ) ``` # Practical Example: Optimizing the Matching Process Matching observations based on generalized propensity scores involves tuning multiple hyperparameters. Running separate workflows with different parameter combinations can be tedious, and the effects of some parameters are not always predictable. To streamline this process, vecmatch provides an automated optimization workflow using a random search algorithm. The function `optimize_gps()` is implemented with multithreading to leverage computational resources efficiently. Step 1: Define the Formula, Data, and Optimization Space In this example, we use the built-in `cancer` dataset and focus on two predictors: the categorical variable `sex` and the continuous variable `age`. We first specify the model formula. Note that `data` and `formula` are fixed inputs; if you want to compare different formulas, you must run the workflow separately for each. ```{r include=FALSE} library(vecmatch) formula_cancer <- formula(status ~ age * sex) ``` Next, we define the search space for the hyperparameters. The helper function `make_opt_args()` validates your inputs and computes the Cartesian product of all specified values, reporting the total number of parameter combinations. ```{r} opt_args <- make_opt_args( data = cancer, formula = formula_cancer, reference = c("control", "adenoma", "crc_beningn", "crc_malignant"), gps_method = c("m1", "m7", "m8"), matching_method = c("fullopt", "nnm"), caliper = seq(0.01, 5, 0.01), cluster = 1:3, ratio = 1:3, min_controls = 1:3, max_controls = 1:3 ) opt_args ``` The `print()` method for `make_opt_args()` provides a clear summary of the search space, including each hyperparameter’s values and the total number of combinations. # Step 2: Run the Optimizer With the search space defined, we can launch the optimization. The `optimize_gps()` function performs a random search across the parameter grid and returns a summary table containing key quality metrics for each tested combination. You control the number of iterations via the `n_iter` argument. The function uses multithreading (via the `future` package) to parallelize work. As a guideline, aim for at least 1000–1500 iterations per core for reliable search coverage. Monitor your system’s memory usage, since the parallel backend can consume substantial RAM. The function automatically registers a multisession backend and, after computing and matching the GPS with `foreach` and `doRNG`, it cleans up the parallel workers to free memory. In a future release, we plan to allow users to select alternative backends, such as an external compute cluster, for greater flexibility. By default, `optimize_gps()` preserves the global random seed. For reproducibility, set a seed before calling the optimizer. ```{r warning=FALSE, message=FALSE} set.seed(167894) seed_before <- .Random.seed opt_results <- optimize_gps( data = cancer, formula = formula_cancer, opt_args = opt_args, n_iter = 1500 ) ``` We ran the optimization on a single core with `n_iter = 1500`; on our test machine this required `r attr(opt_results, "optimization_time")` seconds. Given the size of the parameter grid, increasing `n_iter` would improve the search’s coverage, but here we limited iterations to keep the vignette’s build time reasonable. When you print `opt_results`, it summarizes the entire search by grouping parameter sets into bins defined by their maximum standardized mean difference (SMD), and within each bin it highlights the combination that achieves the highest proportion of matched observations. # Step 3: Select Optimal Configurations After optimization, `select_opt()` helps you choose parameter combinations that meet your specific objectives. For example, you may wish to maximize the number of matched samples for certain treatment groups while minimizing imbalance on key covariates. In this example, we aim to: * Retain the largest possible number of observations in the `adenoma` and `crc_malignant` groups. * Minimize the standardized mean difference (SMD) for the `age` variable. We can achieve this by specifying arguments in `select_opt()`: ```{r} select_results <- select_opt(opt_results, perc_matched = c("adenoma", "crc_malignant"), smd_variables = "age", smd_type = "max" ) ``` The output shows the SMD bins and highlights the combination within each bin that best meets our criteria. Suppose the configuration in the `0.10–0.15` SMD bin offers a desirable balance of matched samples; we can extract its parameters for manual refitting. # Step 4: Refit the Optimized Model Manually To inspect the matched dataset and detailed balance summaries, we rerun the standard `vecmatch` workflow using the selected parameters. 1. Extract the parameter data frame and filter the chosen SMD bin: ```{r} param_df <- attr(select_results, "param_df") subset(param_df, smd_group == "0.10-0.15") ``` 2. Use these values to call `estimate_gps()` and `match_gps()`. For example: ```{r estimate_gps} # estimating gps gps_mat <- estimate_gps(formula_cancer, cancer, method = "multinom", link = "generalized_logit", reference = "adenoma" ) ``` ```{r} # csr with refitting csr_mat <- csregion(gps_mat) # matching matched_df <- match_gps(csr_mat, method = "nnm", caliper = 4.29, reference = "adenoma", ratio = 2, order = "asc", replace = TRUE, ties = TRUE, kmeans_cluster = 1 ) ``` 3. Evaluate balance and verify reproducibility: ```{r} balqual(matched_df, formula = formula_cancer, type = "smd", statistic = "max", round = 3, cutoffs = 0.2 ) all.equal(seed_before, .Random.seed) ```