---
title: "Optimizing the Matching Process with a Random Search Algorithm"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Optimizing the Matching Process with a Random Search Algorithm}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
library(vecmatch)
data(cancer) # or data("cancer", package = "vecmatch")
formula_cancer <- formula(status ~ age * sex)

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.asp = 0.8,
  echo = TRUE,
  echo = TRUE,
  warning = FALSE,
  message = FALSE,
  cache = TRUE # so that estimate_gps doesn’t re-run on every knit
)
```

# Practical Example: Optimizing the Matching Process
Matching observations based on generalized propensity scores involves tuning
multiple hyperparameters. Running separate workflows with different parameter
combinations can be tedious, and the effects of some parameters are not always
predictable. To streamline this process, vecmatch provides an automated
optimization workflow using a random search algorithm. The function
`optimize_gps()` is implemented with multithreading to leverage computational
resources efficiently.

Step 1: Define the Formula, Data, and Optimization Space

In this example, we use the built-in `cancer` dataset and focus on two
predictors: the categorical variable `sex` and the continuous variable `age`. We
first specify the model formula. Note that `data` and `formula` are fixed
inputs; if you want to compare different formulas, you must run the workflow
separately for each.

```{r include=FALSE}
library(vecmatch)

formula_cancer <- formula(status ~ age * sex)
```

Next, we define the search space for the hyperparameters. The helper function
`make_opt_args()` validates your inputs and computes the Cartesian product of
all specified values, reporting the total number of parameter combinations.

```{r}
opt_args <- make_opt_args(
  data            = cancer,
  formula         = formula_cancer,
  reference       = c("control", "adenoma", "crc_beningn", "crc_malignant"),
  gps_method      = c("m1", "m7", "m8"),
  matching_method = c("fullopt", "nnm"),
  caliper         = seq(0.01, 5, 0.01),
  cluster         = 1:3,
  ratio           = 1:3,
  min_controls    = 1:3,
  max_controls    = 1:3
)

opt_args
```

The `print()` method for `make_opt_args()` provides a clear summary of the
search space, including each hyperparameter’s values and the total number of
combinations.

# Step 2: Run the Optimizer
With the search space defined, we can launch the optimization. The
`optimize_gps()` function performs a random search across the parameter grid and
returns a summary table containing key quality metrics for each tested
combination. You control the number of iterations via the `n_iter` argument.

The function uses multithreading (via the `future` package) to parallelize work.
As a guideline, aim for at least 1000–1500 iterations per core for reliable
search coverage. Monitor your system’s memory usage, since the parallel backend
can consume substantial RAM.

The function automatically registers a multisession backend and, after computing
and matching the GPS with `foreach` and `doRNG`, it cleans up the parallel
workers to free memory. In a future release, we plan to allow users to select
alternative backends, such as an external compute cluster, for greater
flexibility.

By default, `optimize_gps()` preserves the global random seed. For
reproducibility, set a seed before calling the optimizer.

```{r warning=FALSE, message=FALSE}
set.seed(167894)
seed_before <- .Random.seed

opt_results <- optimize_gps(
  data     = cancer,
  formula  = formula_cancer,
  opt_args = opt_args,
  n_iter   = 1500
)
```

We ran the optimization on a single core with `n_iter = 1500`; on our test
machine this required `r attr(opt_results, "optimization_time")` seconds. Given
the size of the parameter grid, increasing `n_iter` would improve the search’s
coverage, but here we limited iterations to keep the vignette’s build time
reasonable.

When you print `opt_results`, it summarizes the entire search by grouping
parameter sets into bins defined by their maximum standardized mean difference
(SMD), and within each bin it highlights the combination that achieves the
highest proportion of matched observations.

# Step 3: Select Optimal Configurations
After optimization, `select_opt()` helps you choose parameter combinations that
meet your specific objectives. For example, you may wish to maximize the number
of matched samples for certain treatment groups while minimizing imbalance on
key covariates.

In this example, we aim to:
  * Retain the largest possible number of observations in the `adenoma` and `crc_malignant` groups.
  * Minimize the standardized mean difference (SMD) for the `age` variable.

We can achieve this by specifying arguments in `select_opt()`:

```{r}
select_results <- select_opt(opt_results,
  perc_matched = c("adenoma", "crc_malignant"),
  smd_variables = "age",
  smd_type = "max"
)
```
The output shows the SMD bins and highlights the combination within each bin
that best meets our criteria. Suppose the configuration in the `0.10–0.15` SMD
bin offers a desirable balance of matched samples; we can extract its parameters
for manual refitting.

# Step 4: Refit the Optimized Model Manually
To inspect the matched dataset and detailed balance summaries, we rerun the
standard `vecmatch` workflow using the selected parameters.

1. Extract the parameter data frame and filter the chosen SMD bin:

```{r}
param_df <- attr(select_results, "param_df")

subset(param_df, smd_group == "0.10-0.15")
```

2. Use these values to call `estimate_gps()` and `match_gps()`. For example:

```{r estimate_gps}
# estimating gps
gps_mat <- estimate_gps(formula_cancer,
  cancer,
  method = "multinom",
  link = "generalized_logit",
  reference = "adenoma"
)
```

```{r}
# csr with refitting
csr_mat <- csregion(gps_mat)

# matching
matched_df <- match_gps(csr_mat,
  method = "nnm",
  caliper = 4.29,
  reference = "adenoma",
  ratio = 2,
  order = "asc",
  replace = TRUE,
  ties = TRUE,
  kmeans_cluster = 1
)
```

3. Evaluate balance and verify reproducibility:
```{r}
balqual(matched_df,
  formula = formula_cancer,
  type = "smd",
  statistic = "max",
  round = 3,
  cutoffs = 0.2
)

all.equal(seed_before, .Random.seed)
```