--- title: "Multi-Database Benchmarking: Old vs New Cohort Generation" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Multi-Database Benchmarking: Old vs New Cohort Generation} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Overview This vignette describes how to benchmark **CDMConnector::generateCohortSet** (old, CIRCE-based) against **atlasCohortGenerator::generateCohortSet2** (new, DAG-optimized batch) across multiple database platforms. The benchmarking script: 1. Runs both methods on each database with the same cohort set. 2. Records **overall time** for each method and writes results to a CSV. 3. **Confirms that the two cohort tables have identical rows** (order ignored) and writes per-database and per-cohort equivalence results to a second CSV. Supported platforms include **PostgreSQL**, **Redshift**, **Snowflake**, **Spark**, and **SQL Server**. You provide a named list of CDM reference objects; the script handles timing, comparison, and CSV output. ## Performance improvements with the new approach The new approach (`generateCohortSet2`) uses a **DAG-based batch optimizer** that: - **Shares vocabulary expansion**: Concept set expansion and codeset building are done once and reused across all cohorts, instead of once per cohort. - **Shares domain scans**: Filtered domain tables (e.g. drug exposure, condition occurrence) are built once and read by every cohort that needs them. - **Reduces I/O**: A single batch script writes to shared staging tables and finalizes in one pass, instead of N separate DELETE/INSERT cycles. As a result, **wall-clock time typically decreases** as the number of cohorts and the overlap in concept sets increase. The ratio (new time / old time) is often **below 1.0**, with larger batches showing greater speedups. The exact improvement depends on: - Number of cohorts and size of each definition - Overlap in concept sets and domains across cohorts - Database engine and hardware The benchmarking script records `time_old_sec`, `time_new_sec`, and `ratio_new_over_old` per database so you can measure the speedup on your own data and platforms. ## How to run the benchmark ### Prerequisites - **CDMConnector** and **atlasCohortGenerator** installed (or `devtools::load_all()` for the latter). - One or more **live CDM connections** (e.g. Postgres, Redshift, Snowflake, Spark, SQL Server) as CDM reference objects. - A **cohort set** (e.g. from `CDMConnector::readCohortSet("path/to/cohorts")`). ### Single database For one CDM, use the single-database benchmark and optional equivalence check: ```r source("extras/benchmark_cohort_generation.R") cohort_set <- CDMConnector::readCohortSet("path/to/cohorts") result <- benchmark_cohort_generation(cdm, cohort_set, cohort_path = "path/to/cohorts") # Compare old vs new cohort tables (identical rows, order ignored) cmp <- compare_cohort_tables(result$cdm, name_old = "cohort_bench_old", name_new = "cohort_bench_new") cmp$identical # TRUE if same set of rows cmp$per_cohort # Per-cohort row counts and match status ``` ### Multiple databases Pass a **named list of CDM objects**; names are used as the `database` identifier in the output CSVs (e.g. `postgres`, `redshift`, `snowflake`, `spark`, `sql_server`): ```r source("extras/benchmark_cohort_generation.R") source("extras/benchmark_multi_database.R") cohort_set <- CDMConnector::readCohortSet("path/to/cohorts") cdms <- list( postgres = cdm_postgres, redshift = cdm_redshift, snowflake = cdm_snowflake, spark = cdm_spark, sql_server = cdm_sqlserver ) run_benchmark_multi_database( cdms = cdms, cohort_set = cohort_set, cohort_path = "path/to/cohorts", results_csv = "benchmark_results.csv", equivalence_csv = "benchmark_equivalence.csv" ) ``` - **benchmark_results.csv**: one row per database with `database`, `time_old_sec`, `time_new_sec`, `ratio_new_over_old`, `n_cohorts`, `files_included`, `status`. - **benchmark_equivalence.csv**: one row per database (overall) plus one row per (database, cohort_definition_id) with `n_old`, `n_new`, `rows_identical`, `status`. ## Benchmark results CSV (timing) The timing CSV has one row per database. Example structure: ```{r, echo = FALSE} results_example <- data.frame( database = c("postgres", "redshift", "snowflake", "sql_server"), time_old_sec = c(120.5, 95.2, 88.1, 110.3), time_new_sec = c(45.2, 38.0, 32.5, 42.1), ratio_new_over_old = c(0.38, 0.40, 0.37, 0.38), n_cohorts = 4L, files_included = "cohort_a.json; cohort_b.json; cohort_c.json; cohort_d.json", status = "ok", stringsAsFactors = FALSE ) knitr::kable(results_example, digits = 2) ``` - **ratio_new_over_old** < 1 means the new method was faster. - **files_included** lists the cohort definition files (or names) in the cohort set. ## Equivalence CSV (same results) The equivalence CSV confirms that the **old and new cohort tables contain the same rows** (order ignored). Each database has: 1. An **overall** row (with `cohort_definition_id` NA): total row counts and whether the full tables match. 2. **Per-cohort** rows: row counts from the old table (`n_old`), from the new table (`n_new`), and whether the set of rows for that cohort is identical (`rows_identical`). Example: ```{r, echo = FALSE} equiv_example <- data.frame( database = c("postgres", "postgres", "postgres", "postgres", "redshift", "redshift"), cohort_definition_id = c(NA, 1L, 2L, 3L, NA, 1L), n_old = c(15000L, 5000L, 6000L, 4000L, 15000L, 5000L), n_new = c(15000L, 5000L, 6000L, 4000L, 15000L, 5000L), rows_identical = c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE), status = "ok", stringsAsFactors = FALSE ) knitr::kable(equiv_example) ``` When **rows_identical** is TRUE for all cohorts (and the overall row), the new approach produces **exactly the same cohort membership and dates** as the old CIRCE-based method; only execution strategy and performance differ. ## Summary | Aspect | Description | |--------|-------------| | **Performance** | The new batch optimizer typically reduces wall-clock time (ratio < 1) by sharing vocabulary and domain work across cohorts. | | **Correctness** | The benchmarking pipeline compares old and new cohort tables row-by-row (order ignored) and writes equivalence results to CSV. | | **Platforms** | Run the same cohort set on Postgres, Redshift, Snowflake, Spark, and SQL Server by passing a named list of CDMs to `run_benchmark_multi_database()`. | Use the generated CSVs to document speedups and to confirm identical results across databases and between the two cohort generation methods.