Part 1: Counts, Attrition, Overlap, Timing
This package aims to standardise and provide the tools to conduct Characterisation studies as of the Darwin-EU Catalogue of Standard Analytics.


We have three types of functions:
summarise: these functions produce an standardised output to summarise a cohort. This standard output is called summarised_result.
plot: these functions produce plots (currently, only ggplot, but working to implement plotly) from a summarised_result object.
table: these functions produce tables (gt and flextable) from a summarised_result object.
result <- summariseXXX(...)tableXXX(result)plotXXX(result)
flowchart LR A[summarise function ] --> B[Plot function ] A --> C[Table function ]
library(CDMConnector)
library(dplyr)
library(tidyr)
library(DBI)
db <- DBI::dbConnect(duckdb::duckdb(), dbdir = eunomia_dir())
cdm <- cdm_from_con(con = db, cdm_schema = "main", write_schema = "main")cdm
── # OMOP CDM reference (duckdb) of Synthea synthetic health database ──────────────────────────────────────────────────
• omop tables: person, observation_period, visit_occurrence, visit_detail, condition_occurrence, drug_exposure,
procedure_occurrence, device_exposure, measurement, observation, death, note, note_nlp, specimen, fact_relationship,
location, care_site, provider, payer_plan_period, cost, drug_era, dose_era, condition_era, metadata, cdm_source,
concept, vocabulary, domain, concept_class, concept_relationship, relationship, concept_synonym, concept_ancestor,
source_to_concept_map, drug_strength
• cohort tables: -
• achilles tables: -
• other tables: -
library(CohortConstructor)
cdm$sinusitis <- conceptCohort(
cdm = cdm,
name = "sinusitis",
conceptSet = list(
"bacterial_sinusitis" = 4294548,
"viral_sinusitis" = 40481087,
"chronic_sinusitis" = 257012,
"any_sinusitis" = c(4294548, 40481087, 257012)
)
)Lets see the sinusitis cohorts
cdm$sinusitis# Source: table<main.sinusitis> [?? x 4]
# Database: DuckDB v0.10.0 [martics@Windows 10 x64:R 4.2.3/C:\Users\martics\AppData\Local\Temp\RtmpmK59Vq\file1230308c511b.duckdb]
cohort_definition_id subject_id cohort_start_date cohort_end_date
<int> <int> <date> <date>
1 2 800 1970-12-08 1970-12-22
2 2 857 1997-08-22 1997-08-29
3 2 1390 1974-12-21 1975-01-11
4 2 1658 1951-05-31 1951-06-07
5 2 2077 2012-01-14 2012-02-04
6 2 2319 1963-01-10 1963-01-24
7 2 2821 1920-07-07 1920-07-14
8 2 2903 1979-12-26 1980-01-09
9 2 2989 1974-10-26 1974-11-02
10 2 4639 1992-05-28 1992-06-11
# ℹ more rows
Lets see the settings of the sinusitis cohorts
We can easily extract metadata about the counts in this cohort:
cdm$sinusitis |> cohortCount()# A tibble: 4 × 3
cohort_definition_id number_records number_subjects
<int> <int> <int>
1 1 939 786
2 2 17268 2686
3 3 825 812
4 4 18629 2688
We can export this metadata using summariseCohortCount:
library(CohortCharacteristics)
cdm$sinusitis |>
summariseCohortCount() |>
glimpse()Rows: 8
Columns: 13
$ result_id <int> 1, 1, 1, 1, 1, 1, 1, 1
$ cdm_name <chr> "Synthea synthetic health database", "Synthea synthetic health database", "Synthea synthetic …
$ group_name <chr> "cohort_name", "cohort_name", "cohort_name", "cohort_name", "cohort_name", "cohort_name", "co…
$ group_level <chr> "any_sinusitis", "chronic_sinusitis", "viral_sinusitis", "bacterial_sinusitis", "bacterial_si…
$ strata_name <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall"
$ strata_level <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall"
$ variable_name <chr> "Number records", "Number records", "Number records", "Number records", "Number subjects", "N…
$ variable_level <chr> NA, NA, NA, NA, NA, NA, NA, NA
$ estimate_name <chr> "count", "count", "count", "count", "count", "count", "count", "count"
$ estimate_type <chr> "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer"
$ estimate_value <chr> "18629", "825", "17268", "939", "786", "2686", "812", "2688"
$ additional_name <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall"
$ additional_level <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall"
We can easily create a gt or flextable from the output of summariseCohortCount:
cdm$sinusitis |>
summariseCohortCount(cohortId = 1:4) |>
tableCohortCount()ℹ summarising data
✔ summariseCharacteristics finished!
! Results have not been suppressed.
| CDM name | Variable name | Estimate name | Cohort name | |||
|---|---|---|---|---|---|---|
| Any sinusitis | Bacterial sinusitis | Viral sinusitis | Chronic sinusitis | |||
| Synthea synthetic health database | Number records | N | 18,629 | 939 | 17,268 | 825 |
| Number subjects | N | 2,688 | 786 | 2,686 | 812 | |
You can easily suppress a summarised_result using the suppress function:
cdm$sinusitis |>
summariseCohortCount(cohortId = 1:4) |>
suppress(minCellCount = 5) |>
tableCohortCount()| CDM name | Variable name | Estimate name | Cohort name | |||
|---|---|---|---|---|---|---|
| Bacterial sinusitis | Any sinusitis | Chronic sinusitis | Viral sinusitis | |||
| Synthea synthetic health database | Number records | N | 939 | 18,629 | 825 | 17,268 |
| Number subjects | N | 786 | 2,688 | 812 | 2,686 | |
We can easily create a gt or flextable from the output of summariseCohortCount:
cdm$sinusitis |>
PatientProfiles::addSex() |>
summariseCohortCount(strata = "sex") |>
tableCohortCount(header = c("group"), groupColumn = "sex")| CDM name | Variable name | Estimate name | Cohort name | |||
|---|---|---|---|---|---|---|
| Any sinusitis | Bacterial sinusitis | Chronic sinusitis | Viral sinusitis | |||
| Female | ||||||
| Synthea synthetic health database | Number records | N | 9,542 | 494 | 427 | 8,824 |
| Number subjects | N | 1,371 | 418 | 424 | 1,371 | |
| Male | ||||||
| Synthea synthetic health database | Number records | N | 9,087 | 445 | 398 | 8,444 |
| Number subjects | N | 1,317 | 368 | 388 | 1,315 | |
| Overall | ||||||
| Synthea synthetic health database | Number records | N | 18,629 | 939 | 825 | 17,268 |
| Number subjects | N | 2,688 | 786 | 812 | 2,686 | |
gt tables can easily be exported to word:
myTable <- cdm$sinusitis |>
PatientProfiles::addSex() |>
summariseCohortCount(strata = "sex") |>
tableCohortCount(header = c("group"), groupColumn = "sex")We can easily extract metadata about the attrition of a cohort:
cdm$sinusitis |> attrition()# A tibble: 4 × 7
cohort_definition_id number_records number_subjects reason_id reason excluded_records excluded_subjects
<int> <int> <int> <int> <chr> <int> <int>
1 1 939 786 1 Initial qualifying e… 0 0
2 2 17268 2686 1 Initial qualifying e… 0 0
3 3 825 812 1 Initial qualifying e… 0 0
4 4 18629 2688 1 Initial qualifying e… 0 0
We can export this metadata using summariseCohortAttrition:
cdm$sinusitis |>
summariseCohortAttrition() |>
glimpse()Rows: 16
Columns: 13
$ result_id <int> 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4
$ cdm_name <chr> "Synthea synthetic health database", "Synthea synthetic health database", "Synthea synthetic …
$ group_name <chr> "cohort_name", "cohort_name", "cohort_name", "cohort_name", "cohort_name", "cohort_name", "co…
$ group_level <chr> "bacterial_sinusitis", "bacterial_sinusitis", "bacterial_sinusitis", "bacterial_sinusitis", "…
$ strata_name <chr> "reason", "reason", "reason", "reason", "reason", "reason", "reason", "reason", "reason", "re…
$ strata_level <chr> "Initial qualifying events", "Initial qualifying events", "Initial qualifying events", "Initi…
$ variable_name <chr> "number_records", "number_subjects", "excluded_records", "excluded_subjects", "number_records…
$ variable_level <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
$ estimate_name <chr> "count", "count", "count", "count", "count", "count", "count", "count", "count", "count", "co…
$ estimate_type <chr> "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "inte…
$ estimate_value <chr> "939", "786", "0", "0", "17268", "2686", "0", "0", "825", "812", "0", "0", "18629", "2688", "…
$ additional_name <chr> "reason_id", "reason_id", "reason_id", "reason_id", "reason_id", "reason_id", "reason_id", "r…
$ additional_level <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"
We can easily create a diagram from the output of summariseCohortAttrition:
cdm$sinusitis |>
summariseCohortAttrition() |>
plotCohortAttrition()We can easily create a diagram from the output of summariseCohortAttrition:
cdm$sinusitis |> settings()# A tibble: 4 × 2
cohort_definition_id cohort_name
<int> <chr>
1 1 bacterial_sinusitis
2 2 viral_sinusitis
3 3 chronic_sinusitis
4 4 any_sinusitis
We can easily create a diagram from the output of summariseCohortAttrition:
cdm$sinusitis |>
summariseCohortAttrition() |>
plotCohortAttrition(cohortId = 1)Can you create a cohort with the following attrition?
all records of sinusitis (4294548, 40481087, 257012)
only first record per person (requireIsFirstEntry)
restrict to female individuals (requireSex)
restrict to children between 5 and 12 years old (requireAge)
plot attrition (summariseCohortAttrition + plotCohortAttrition)
summariseCohortOverlap identifies the overlap (number of subjects) between cohorts:
result <- summariseCohortOverlap(cdm$sinusitis)
result |>
glimpse()Rows: 72
Columns: 13
$ result_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ cdm_name <chr> "Synthea synthetic health database", "Synthea synthetic health database", "Synthea synthetic …
$ group_name <chr> "cohort_name_reference &&& cohort_name_comparator", "cohort_name_reference &&& cohort_name_co…
$ group_level <chr> "bacterial_sinusitis &&& any_sinusitis", "bacterial_sinusitis &&& any_sinusitis", "bacterial_…
$ strata_name <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall", "over…
$ strata_level <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall", "over…
$ variable_name <chr> "overlap", "reference", "comparator", "overlap", "reference", "comparator", "overlap", "refer…
$ variable_level <chr> "number_subjects", "number_subjects", "number_subjects", "number_subjects", "number_subjects"…
$ estimate_name <chr> "count", "count", "count", "count", "count", "count", "count", "count", "count", "count", "co…
$ estimate_type <chr> "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "inte…
$ estimate_value <chr> "786", "0", "1902", "2686", "2", "0", "785", "1", "1901", "466", "346", "320", "810", "1876",…
$ additional_name <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall", "over…
$ additional_level <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall", "over…
We can easily display them in a gt table with tableCohortOverlap:
tableCohortOverlap(result)| CDM name | Cohort name reference | Cohort name comparator | Estimate name | Only in reference cohort | In both cohorts | Only in comparator cohort |
|---|---|---|---|---|---|---|
| Synthea synthetic health database | Any sinusitis | Bacterial sinusitis | N (%) | 1,902 (70.76%) | 786 (29.24%) | 0 (0.00%) |
| Chronic sinusitis | N (%) | 1,876 (69.79%) | 812 (30.21%) | 0 (0.00%) | ||
| Viral sinusitis | N (%) | 2 (0.07%) | 2,686 (99.93%) | 0 (0.00%) | ||
| Bacterial sinusitis | Chronic sinusitis | N (%) | 320 (28.27%) | 466 (41.17%) | 346 (30.57%) | |
| Viral sinusitis | N (%) | 1 (0.04%) | 785 (29.21%) | 1,901 (70.75%) | ||
| Chronic sinusitis | Viral sinusitis | N (%) | 2 (0.07%) | 810 (30.13%) | 1,876 (69.79%) |
We can easily have a plot of the overlap with plotCohortOverlap:
plotCohortOverlap(result)
Create 3 drug cohorts from these 5: - aspirin - acetaminophen - naproxen - amoxicillin - ibuprofen
Identify the subject overlap between them
Create 3 drug cohorts from these 5: - aspirin - acetaminophen - naproxen - amoxicillin - ibuprofen
Remember there is a function in CodelistGenerator::getDrugIngredientCodes
Identify the subject overlap between them and create a plot to show the overlap
We have a function to identify the time between cohorts to see which cohorts occur first compared to the other.
Lets create some medications cohorts:
cdm$medications <- conceptCohort(
cdm = cdm,
conceptSet = getDrugIngredientCodes(
cdm = cdm, name = c("warfarin", "acetaminophen", "morphine")
),
name = "medications"
)summaryTiming <- cdm$medications |>
summariseCohortTiming(restrictToFirstEntry = TRUE)
summaryTiming |>
glimpse()Rows: 42
Columns: 13
$ result_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ cdm_name <chr> "Synthea synthetic health database", "Synthea synthetic health database", "Synthea synthetic …
$ group_name <chr> "cohort_name_reference &&& cohort_name_comparator", "cohort_name_reference &&& cohort_name_co…
$ group_level <chr> "warfarin &&& acetaminophen", "acetaminophen &&& warfarin", "acetaminophen &&& morphine", "mo…
$ strata_name <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall", "over…
$ strata_level <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall", "over…
$ variable_name <chr> "number records", "number records", "number records", "number records", "number records", "nu…
$ variable_level <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ estimate_name <chr> "count", "count", "count", "count", "count", "count", "count", "count", "count", "count", "co…
$ estimate_type <chr> "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "inte…
$ estimate_value <chr> "136", "136", "35", "35", "6", "6", "136", "136", "35", "35", "6", "6", "-33784", "-1106", "-…
$ additional_name <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall", "over…
$ additional_level <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall", "over…
summaryTiming |> tableCohortTiming(timeScale = "years")| CDM name | Cohort name reference | Cohort name comparator | Variable name | Estimate name | Estimate value |
|---|---|---|---|---|---|
| Synthea synthetic health database | Acetaminophen | Morphine | Number records | N | 35 |
| Number subjects | N | 35 | |||
| Warfarin | Number records | N | 136 | ||
| Number subjects | N | 136 | |||
| Morphine | Warfarin | Number records | N | 6 | |
| Number subjects | N | 6 | |||
| Acetaminophen | Morphine | Years between cohort entries | Median [Q25 - Q75] | 15.79 [5.02 - 33.51] | |
| Range | -33.72 - 77.29 | ||||
| Warfarin | Years between cohort entries | Median [Q25 - Q75] | 53.96 [46.34 - 66.97] | ||
| Range | -3.03 - 92.50 | ||||
| Morphine | Warfarin | Years between cohort entries | Median [Q25 - Q75] | 4.54 [-4.76 - 10.36] | |
| Range | -9.24 - 18.99 |
summaryTiming |>
plotCohortTiming(timeScale = "years", facet = "cdm_name", colour = c("group_level"))
summaryTiming |>
plotCohortTiming(timeScale = "years", facet = "cdm_name", colour = c("group_level"), plotType = "density")Error in `plotCohortTiming()`:
! Please provide a cohort timing summarised result with density estimates (use `density = TRUE` in
summariseCohortTiming).
summaryTiming <- cdm$medications |>
summariseCohortTiming(restrictToFirstEntry = TRUE, density = TRUE)summaryTiming |>
plotCohortTiming(timeScale = "years", facet = "cdm_name", colour = c("group_level"), plotType = "density")Can you do a density plot of the three cohorts that you created before (for the overlap exercice)?
Oxford Summer School 2024