---
title: "Building Cohorts from Concept Sets"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Building Cohorts from Concept Sets}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(tidyOhdsiSolutions)
```

## Overview

`cohortFromConceptSet()` builds a complete CirceR-compatible cohort definition
from one or more concept set expressions. It produces a nested R list that can
be serialized to valid JSON with `cohortToJson()` — no Java, CirceR, or Capr
dependency required.

## Typical Workflow

```
data.frame ──► toConceptSet() ──► cohortFromConceptSet() ──► cohortToJson()
```

1. Start with data frames containing `concept_id` (and optional metadata).
2. Convert them to CIRCE concept set expressions with `toConceptSet()` or
   `toConceptSets()`.
3. Pass a named list of expressions to `cohortFromConceptSet()`.
4. Serialize the result to JSON with `cohortToJson()`.

## Step 1: Define Concept Sets as Data Frames

Each concept set starts as a data frame with at minimum a `concept_id` column.
Optional columns control inclusion flags and provide metadata:

```{r}
diabetes_df <- data.frame(
  concept_id       = c(201826L, 442793L),
  concept_name     = c("Type 2 diabetes mellitus",
                        "Diabetes mellitus due to insulin resistance"),
  domain_id        = c("Condition", "Condition"),
  vocabulary_id    = c("SNOMED", "SNOMED"),
  standard_concept = c("S", "S"),
  descendants      = c(TRUE, TRUE),
  excluded         = c(FALSE, FALSE)
)

hypertension_df <- data.frame(
  concept_id       = 320128L,
  concept_name     = "Essential hypertension",
  domain_id        = "Condition",
  vocabulary_id    = "SNOMED",
  standard_concept = "S",
  descendants      = TRUE,
  excluded         = FALSE
)
```

## Step 2: Convert to Concept Set Expressions

```{r}
diabetes_cs     <- toConceptSet(diabetes_df, name = "Type 2 Diabetes")
hypertension_cs <- toConceptSet(hypertension_df, name = "Hypertension")
```

Or convert multiple at once with `toConceptSets()`:

```{r}
all_cs <- toConceptSets(list(
  "Type 2 Diabetes" = diabetes_df,
  "Hypertension"    = hypertension_df
))
```

## Step 3: Build the Cohort

Pass a named list of concept set expressions to `cohortFromConceptSet()`:

```{r}
cohort <- cohortFromConceptSet(
  conceptSetList = all_cs,
  limit          = "earliest",
  requiredObservation = c(365L, 0L),
  end            = "observation_period_end_date"
)
```

### Parameters

| Argument | Values | Description |
|:---------|:-------|:------------|
| `conceptSetList` | named list | Each element is a concept set expression with `$items` |
| `limit` | `"earliest"`, `"all"`, `"latest"` | Which qualifying event(s) to keep |
| `requiredObservation` | `c(prior, post)` | Days of continuous observation required before and after the index date |
| `end` | `"observation_period_end_date"`, `"fixed_exit"`, `"drug_exit"` | How the cohort era ends |
| `endArgs` | `list(...)` | Extra parameters for the chosen end strategy |
| `addSourceCriteria` | `TRUE` / `FALSE` | Also match on source (non-standard) concept codes |

## Step 4: Export to JSON

```{r}
json <- cohortToJson(cohort)
cat(substr(json, 1, 300), "...\n")
```

The JSON string is ready for `CirceR::cohortExpressionFromJson()` or
`CirceR::buildCohortQuery()`, or can be saved to a file:

```{r, eval = FALSE}
writeLines(json, "my_cohort.json")
```

## End Strategies

### Default: observation period end date

The cohort era ends when the person's observation period ends.

```{r}
cohort_obs <- cohortFromConceptSet(
  all_cs,
  end = "observation_period_end_date"
)
```

### Fixed exit: offset from index

End the cohort era a fixed number of days after the start (or end) date.

```{r}
cohort_fixed <- cohortFromConceptSet(
  all_cs,
  end     = "fixed_exit",
  endArgs = list(index = "startDate", offsetDays = 180)
)

# Verify
cohort_fixed$EndStrategy$DateOffset
```

### Drug exit: era-based persistence

For drug exposures, end the cohort using drug era logic with configurable gap
and surveillance windows.

```{r}
drug_df <- data.frame(
  concept_id   = 1503297L,
  concept_name = "Metformin",
  domain_id    = "Drug",
  vocabulary_id = "RxNorm",
  standard_concept = "S",
  descendants  = TRUE,
  excluded     = FALSE
)

drug_cs <- toConceptSets(list("Metformin" = drug_df))

cohort_drug <- cohortFromConceptSet(
  drug_cs,
  end     = "drug_exit",
  endArgs = list(persistenceWindow = 30, surveillanceWindow = 7)
)

# Verify
cohort_drug$EndStrategy$CustomEra
```

## Event Limits

```{r}
# Keep only the earliest qualifying event per person
earliest <- cohortFromConceptSet(all_cs, limit = "earliest")
earliest$PrimaryCriteria$PrimaryCriteriaLimit$Type

# Keep all qualifying events
all_events <- cohortFromConceptSet(all_cs, limit = "all")
all_events$PrimaryCriteria$PrimaryCriteriaLimit$Type

# Keep only the latest qualifying event
latest <- cohortFromConceptSet(all_cs, limit = "latest")
latest$PrimaryCriteria$PrimaryCriteriaLimit$Type
```

## Source Criteria

When `addSourceCriteria = TRUE`, each domain gets an additional criteria entry
that matches on source (non-standard) concept codes. This doubles the number
of primary criteria entries:

```{r}
cohort_src <- cohortFromConceptSet(all_cs, addSourceCriteria = TRUE)
length(cohort_src$PrimaryCriteria$CriteriaList)

# Without source criteria
cohort_plain <- cohortFromConceptSet(all_cs, addSourceCriteria = FALSE)
length(cohort_plain$PrimaryCriteria$CriteriaList)
```

## Structure of the Output

The returned list mirrors the CirceR cohort expression format:

```{r}
names(cohort)
```

- **ConceptSets** — one entry per concept set, each with `id`, `name`, and
  `expression`
- **PrimaryCriteria** — `CriteriaList` (one entry per domain per concept set),
  `ObservationWindow`, and `PrimaryCriteriaLimit`
- **EndStrategy** — `DateOffset` or `CustomEra` depending on `end`
- **CollapseSettings** — defaults to ERA-based collapsing with 0-day pad
- **InclusionRules**, **CensoringCriteria**, **CensorWindow** — empty defaults

```{r}
# Number of concept sets
length(cohort$ConceptSets)

# Names of concept sets
vapply(cohort$ConceptSets, `[[`, character(1), "name")

# Observation window
cohort$PrimaryCriteria$ObservationWindow
```