--- title: "Interpreting Summary Results with lc500s" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Interpreting Summary Results with lc500s} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", echo = FALSE ) isMissingOrEmpty <- function(x) { length(x) == 0 || is.na(x[1]) || !nzchar(x[1]) } readSummaryParquet <- function(path) { as.data.frame(nanoparquet::read_parquet(path), stringsAsFactors = FALSE) } exampleRoot <- system.file("example", "st", package = "CohortContrast") if (isMissingOrEmpty(exampleRoot) && dir.exists("inst/example/st")) { exampleRoot <- normalizePath("inst/example/st") } studyPath <- file.path(exampleRoot, "lc500s") if (isMissingOrEmpty(exampleRoot) || !dir.exists(studyPath)) { cat("Bundled summary example 'lc500s' is not available in this build.\n") knitr::knit_exit() } metadata <- jsonlite::fromJSON(file.path(studyPath, "metadata.json"), simplifyVector = FALSE) conceptSummaries <- readSummaryParquet(file.path(studyPath, "concept_summaries.parquet")) ordinalSummaries <- readSummaryParquet(file.path(studyPath, "ordinal_summaries.parquet")) mappingTable <- readSummaryParquet(file.path(studyPath, "complementaryMappingTable.parquet")) k2Summary <- readSummaryParquet(file.path(studyPath, "clustering_k2_summary.parquet")) k3Summary <- readSummaryParquet(file.path(studyPath, "clustering_k3_summary.parquet")) k4Summary <- readSummaryParquet(file.path(studyPath, "clustering_k4_summary.parquet")) k5Summary <- readSummaryParquet(file.path(studyPath, "clustering_k5_summary.parquet")) k2Overlap <- readSummaryParquet(file.path(studyPath, "clustering_k2_pairwise_overlap.parquet")) k3Overlap <- readSummaryParquet(file.path(studyPath, "clustering_k3_pairwise_overlap.parquet")) k4Overlap <- readSummaryParquet(file.path(studyPath, "clustering_k4_pairwise_overlap.parquet")) k5Overlap <- readSummaryParquet(file.path(studyPath, "clustering_k5_pairwise_overlap.parquet")) ``` ## Goal This vignette explains what each summary-mode dataframe stores in the bundled `lc500s` study. For each dataframe: - You get markdown column descriptions. - You see `head(...)` output. ## Summary Folder Metadata (`metadata.json`) This JSON file is not a dataframe, but it controls how all summary tables should be interpreted. Top-level fields: - `study_name`: Summary study folder name. - `original_study_name`: Source patient-level study. - `source_path`: Path to source study used during precompute. - `mode`: Expected value is `summary`. - `demographics`: Cohort and age/sex summary block. - `clustering`: Clustering quality and cluster-size summaries by `k`. - `cluster_k_values`: List of precomputed `k` values. - `concept_limit`: Max concepts used in clustering pipeline. - `min_cell_count`: Suppression threshold used in precompute. - `significant_concepts`: Count of significant concepts retained. - `clustering_guardrails`: Guardrails for matrix and overlap computations. ```{r} str(metadata, max.level = 2) ``` ## `concept_summaries.parquet` One row per concept with overall summary statistics. Column descriptions: - `CONCEPT_ID`: Concept identifier. - `HERITAGE`: Domain/heritage of concept. - `time_count`: Number of timing observations used. - `time_min`: Minimum time-to-event. - `time_max`: Maximum time-to-event. - `time_mean`: Mean time-to-event. - `time_median`: Median time-to-event. - `time_std`: Standard deviation of time-to-event. - `time_q1`: 25th percentile of time-to-event. - `time_q3`: 75th percentile of time-to-event. - `time_iqr`: Interquartile range of time-to-event. - `patient_count`: Number of target patients with concept. - `CONCEPT_NAME`: Concept name. - `time_histogram_bins`: JSON text for histogram bin edges. - `time_histogram_counts`: JSON text for histogram counts. - `time_kde_x`: JSON text for KDE x-grid. - `time_kde_y`: JSON text for KDE y-values. - `age_mean`: Mean age for concept-positive patients. - `age_median`: Median age for concept-positive patients. - `age_std`: Age standard deviation. - `age_q1`: 25th percentile of age. - `age_q3`: 75th percentile of age. - `n_ages`: Number of non-missing ages used. - `male_proportion`: Male proportion for concept-positive patients. - `TARGET_SUBJECT_PREVALENCE`: Target prevalence. - `CONTROL_SUBJECT_PREVALENCE`: Control prevalence. - `PREVALENCE_DIFFERENCE_RATIO`: Target/control prevalence ratio. ```{r} utils::head(conceptSummaries, 10) ``` ## `ordinal_summaries.parquet` One row per ordinalized concept event (for repeated occurrences like 1st, 2nd, 3rd). Column descriptions: - `CONCEPT_ID`: Ordinal concept identifier (derived). - `HERITAGE`: Domain/heritage. - `ORDINAL`: Ordinal index (`1`, `2`, ...). - `time_count`: Number of timing observations. - `time_min`: Minimum time-to-event. - `time_max`: Maximum time-to-event. - `time_mean`: Mean time-to-event. - `time_median`: Median time-to-event. - `time_std`: Standard deviation of time-to-event. - `time_q1`: 25th percentile of time-to-event. - `time_q3`: 75th percentile of time-to-event. - `time_iqr`: Interquartile range of time-to-event. - `patient_count`: Number of target patients with this ordinal event. - `age_mean`: Mean age. - `age_median`: Median age. - `age_std`: Age standard deviation. - `age_q1`: 25th percentile of age. - `age_q3`: 75th percentile of age. - `n_ages`: Number of non-missing ages used. - `male_proportion`: Male proportion. - `ordinal_name_suffix`: Human-readable ordinal suffix (`1st`, `2nd`, ...). - `ORIGINAL_CONCEPT_ID`: Base concept id before ordinal expansion. - `CONCEPT_NAME`: Ordinalized concept name (for example `Death 2nd`). - `IS_ORDINAL`: Flag for ordinal rows (`TRUE` here). - `time_histogram_bins`: JSON text for histogram bin edges. - `time_histogram_counts`: JSON text for histogram counts. - `time_kde_x`: JSON text for KDE x-grid. - `time_kde_y`: JSON text for KDE y-values. - `TARGET_SUBJECT_PREVALENCE`: Target prevalence. - `CONTROL_SUBJECT_PREVALENCE`: Control prevalence. - `PREVALENCE_DIFFERENCE_RATIO`: Target/control prevalence ratio. ```{r} utils::head(ordinalSummaries, 10) ``` ## `clustering_k*_summary.parquet` Each file contains one row per concept per cluster for a fixed `k`. Shared columns (`k=2,3,4,5`): - `CONCEPT_ID`: Concept identifier. - `cluster`: Cluster label (`C1`, `C2`, ...). - `patient_count`: Number of patients in that cluster with concept present. - `time_median`: Median time-to-event for concept within cluster. - `time_q1`: 25th percentile of time-to-event within cluster. - `time_q3`: 75th percentile of time-to-event within cluster. - `time_min`: Minimum time-to-event within cluster. - `time_max`: Maximum time-to-event within cluster. - `total_cluster_patients`: Cluster size. - `CONCEPT_NAME`: Concept name. - `ORIGINAL_CONCEPT_ID`: Base concept id. - `ORDINAL`: Ordinal index (0 for non-ordinal rows). - `IS_ORDINAL`: Ordinal flag. - `age_mean`: Mean age for concept-positive patients in cluster. - `age_std`: Age standard deviation in cluster. - `male_proportion`: Male proportion in cluster. - `prevalence`: `patient_count / total_cluster_patients`. ### `clustering_k2_summary.parquet` ```{r} utils::head(k2Summary, 10) ``` ### `clustering_k3_summary.parquet` ```{r} utils::head(k3Summary, 10) ``` ### `clustering_k4_summary.parquet` ```{r} utils::head(k4Summary, 10) ``` ### `clustering_k5_summary.parquet` ```{r} utils::head(k5Summary, 10) ``` ## `clustering_k*_pairwise_overlap.parquet` Each file contains concept-pair overlap metrics for `overall` and each cluster group for fixed `k`. Shared columns (`k=2,3,4,5`): - `concept_id_1`: First concept in pair. - `concept_id_2`: Second concept in pair. - `jaccard`: Jaccard overlap for pair. - `phi_correlation`: Phi correlation for pair. - `prevalence`: Single-concept prevalence (diagonal rows where concept1 == concept2). - `patient_count`: Single-concept patient count (diagonal rows). - `group`: `overall` or cluster (`C1`, `C2`, ...). - `co_occurrence`: Co-occurrence count (typically off-diagonal rows). - `union`: Union count (typically off-diagonal rows). ### `clustering_k2_pairwise_overlap.parquet` ```{r} utils::head(k2Overlap, 10) ``` ### `clustering_k3_pairwise_overlap.parquet` ```{r} utils::head(k3Overlap, 10) ``` ### `clustering_k4_pairwise_overlap.parquet` ```{r} utils::head(k4Overlap, 10) ``` ### `clustering_k5_pairwise_overlap.parquet` ```{r} utils::head(k5Overlap, 10) ``` ## `complementaryMappingTable.parquet` Concept mapping history table (same schema as patient mode). In `lc500s` it is empty. Column descriptions: - `CONCEPT_ID`: Original concept id. - `CONCEPT_NAME`: Original concept name. - `NEW_CONCEPT_ID`: Mapped concept id. - `NEW_CONCEPT_NAME`: Mapped concept name. - `TYPE`: Mapping type. - `HERITAGE`: Heritage/domain. ```{r} utils::head(mappingTable, 10) ```