CohortCharacteristics

Part 1: Counts, Attrition, Overlap, Timing

Context

This package aims to standardise and provide the tools to conduct Characterisation studies as of the Darwin-EU Catalogue of Standard Analytics.

Package overview

Functions

Workflow

We have three types of functions:

  • summarise: these functions produce an standardised output to summarise a cohort. This standard output is called summarised_result.

  • plot: these functions produce plots (currently, only ggplot, but working to implement plotly) from a summarised_result object.

  • table: these functions produce tables (gt and flextable) from a summarised_result object.

result <- summariseXXX(...)
tableXXX(result)
plotXXX(result)

flowchart LR
  A[summarise function ] --> B[Plot function ]
  A --> C[Table function ]

Create the cdm reference

library(CDMConnector)
library(dplyr)
library(tidyr)
library(DBI)

db <- DBI::dbConnect(duckdb::duckdb(),  dbdir = eunomia_dir())
cdm <- cdm_from_con(con = db, cdm_schema = "main", write_schema = "main")
cdm
── # OMOP CDM reference (duckdb) of Synthea synthetic health database ──────────────────────────────────────────────────
• omop tables: person, observation_period, visit_occurrence, visit_detail, condition_occurrence, drug_exposure,
procedure_occurrence, device_exposure, measurement, observation, death, note, note_nlp, specimen, fact_relationship,
location, care_site, provider, payer_plan_period, cost, drug_era, dose_era, condition_era, metadata, cdm_source,
concept, vocabulary, domain, concept_class, concept_relationship, relationship, concept_synonym, concept_ancestor,
source_to_concept_map, drug_strength
• cohort tables: -
• achilles tables: -
• other tables: -

Let’s instantiate some cohorts

library(CohortConstructor)

cdm$sinusitis <- conceptCohort(
  cdm = cdm,
  name = "sinusitis",
  conceptSet = list(
    "bacterial_sinusitis" = 4294548, 
    "viral_sinusitis" = 40481087, 
    "chronic_sinusitis" = 257012, 
    "any_sinusitis" = c(4294548, 40481087, 257012)
  )
)

summariseCohortCount

Lets see the sinusitis cohorts

cdm$sinusitis
# Source:   table<main.sinusitis> [?? x 4]
# Database: DuckDB v0.10.0 [martics@Windows 10 x64:R 4.2.3/C:\Users\martics\AppData\Local\Temp\RtmpmK59Vq\file1230308c511b.duckdb]
   cohort_definition_id subject_id cohort_start_date cohort_end_date
                  <int>      <int> <date>            <date>         
 1                    2        800 1970-12-08        1970-12-22     
 2                    2        857 1997-08-22        1997-08-29     
 3                    2       1390 1974-12-21        1975-01-11     
 4                    2       1658 1951-05-31        1951-06-07     
 5                    2       2077 2012-01-14        2012-02-04     
 6                    2       2319 1963-01-10        1963-01-24     
 7                    2       2821 1920-07-07        1920-07-14     
 8                    2       2903 1979-12-26        1980-01-09     
 9                    2       2989 1974-10-26        1974-11-02     
10                    2       4639 1992-05-28        1992-06-11     
# ℹ more rows

summariseCohortCount

Lets see the settings of the sinusitis cohorts

cdm$sinusitis |> settings() |> print(n = Inf)
# A tibble: 4 × 2
  cohort_definition_id cohort_name        
                 <int> <chr>              
1                    1 bacterial_sinusitis
2                    2 viral_sinusitis    
3                    3 chronic_sinusitis  
4                    4 any_sinusitis      

summariseCohortCount

We can easily extract metadata about the counts in this cohort:

cdm$sinusitis |> cohortCount()
# A tibble: 4 × 3
  cohort_definition_id number_records number_subjects
                 <int>          <int>           <int>
1                    1            939             786
2                    2          17268            2686
3                    3            825             812
4                    4          18629            2688

summariseCohortCount

We can export this metadata using summariseCohortCount:

Rows: 8
Columns: 13
$ result_id        <int> 1, 1, 1, 1, 1, 1, 1, 1
$ cdm_name         <chr> "Synthea synthetic health database", "Synthea synthetic health database", "Synthea synthetic …
$ group_name       <chr> "cohort_name", "cohort_name", "cohort_name", "cohort_name", "cohort_name", "cohort_name", "co…
$ group_level      <chr> "any_sinusitis", "chronic_sinusitis", "viral_sinusitis", "bacterial_sinusitis", "bacterial_si…
$ strata_name      <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall"
$ strata_level     <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall"
$ variable_name    <chr> "Number records", "Number records", "Number records", "Number records", "Number subjects", "N…
$ variable_level   <chr> NA, NA, NA, NA, NA, NA, NA, NA
$ estimate_name    <chr> "count", "count", "count", "count", "count", "count", "count", "count"
$ estimate_type    <chr> "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer"
$ estimate_value   <chr> "18629", "825", "17268", "939", "786", "2686", "812", "2688"
$ additional_name  <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall"
$ additional_level <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall"

tableCohortCount

We can easily create a gt or flextable from the output of summariseCohortCount:

cdm$sinusitis |>
  summariseCohortCount(cohortId = 1:4) |>
  tableCohortCount()
ℹ summarising data
✔ summariseCharacteristics finished!
! Results have not been suppressed.
CDM name Variable name Estimate name Cohort name
Any sinusitis Bacterial sinusitis Viral sinusitis Chronic sinusitis
Synthea synthetic health database Number records N 18,629 939 17,268 825
Number subjects N 2,688 786 2,686 812

tableCohortCount

You can easily suppress a summarised_result using the suppress function:

cdm$sinusitis |>
  summariseCohortCount(cohortId = 1:4) |>
  suppress(minCellCount = 5) |>
  tableCohortCount()
CDM name Variable name Estimate name Cohort name
Bacterial sinusitis Any sinusitis Chronic sinusitis Viral sinusitis
Synthea synthetic health database Number records N 939 18,629 825 17,268
Number subjects N 786 2,688 812 2,686

tableCohortCount

We can easily create a gt or flextable from the output of summariseCohortCount:

cdm$sinusitis |>
  PatientProfiles::addSex() |>
  summariseCohortCount(strata = "sex") |>
  tableCohortCount(header = c("group"), groupColumn = "sex")
CDM name Variable name Estimate name Cohort name
Any sinusitis Bacterial sinusitis Chronic sinusitis Viral sinusitis
Female
Synthea synthetic health database Number records N 9,542 494 427 8,824
Number subjects N 1,371 418 424 1,371
Male
Synthea synthetic health database Number records N 9,087 445 398 8,444
Number subjects N 1,317 368 388 1,315
Overall
Synthea synthetic health database Number records N 18,629 939 825 17,268
Number subjects N 2,688 786 812 2,686

export gt tables

gt tables can easily be exported to word:

myTable <- cdm$sinusitis |>
  PatientProfiles::addSex() |>
  summariseCohortCount(strata = "sex") |>
  tableCohortCount(header = c("group"), groupColumn = "sex")
library(gt)
myTable |> gt::gtsave("table.docx")

summariseCohortAttrition

We can easily extract metadata about the attrition of a cohort:

cdm$sinusitis |> attrition()
# A tibble: 4 × 7
  cohort_definition_id number_records number_subjects reason_id reason                excluded_records excluded_subjects
                 <int>          <int>           <int>     <int> <chr>                            <int>             <int>
1                    1            939             786         1 Initial qualifying e…                0                 0
2                    2          17268            2686         1 Initial qualifying e…                0                 0
3                    3            825             812         1 Initial qualifying e…                0                 0
4                    4          18629            2688         1 Initial qualifying e…                0                 0

summariseCohortAttrition

We can export this metadata using summariseCohortAttrition:

cdm$sinusitis |>
  summariseCohortAttrition() |>
  glimpse()
Rows: 16
Columns: 13
$ result_id        <int> 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4
$ cdm_name         <chr> "Synthea synthetic health database", "Synthea synthetic health database", "Synthea synthetic …
$ group_name       <chr> "cohort_name", "cohort_name", "cohort_name", "cohort_name", "cohort_name", "cohort_name", "co…
$ group_level      <chr> "bacterial_sinusitis", "bacterial_sinusitis", "bacterial_sinusitis", "bacterial_sinusitis", "…
$ strata_name      <chr> "reason", "reason", "reason", "reason", "reason", "reason", "reason", "reason", "reason", "re…
$ strata_level     <chr> "Initial qualifying events", "Initial qualifying events", "Initial qualifying events", "Initi…
$ variable_name    <chr> "number_records", "number_subjects", "excluded_records", "excluded_subjects", "number_records…
$ variable_level   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
$ estimate_name    <chr> "count", "count", "count", "count", "count", "count", "count", "count", "count", "count", "co…
$ estimate_type    <chr> "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "inte…
$ estimate_value   <chr> "939", "786", "0", "0", "17268", "2686", "0", "0", "825", "812", "0", "0", "18629", "2688", "…
$ additional_name  <chr> "reason_id", "reason_id", "reason_id", "reason_id", "reason_id", "reason_id", "reason_id", "r…
$ additional_level <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"

plotCohortAttrition

We can easily create a diagram from the output of summariseCohortAttrition:

plotCohortAttrition

We can easily create a diagram from the output of summariseCohortAttrition:

cdm$sinusitis |> settings()
# A tibble: 4 × 2
  cohort_definition_id cohort_name        
                 <int> <chr>              
1                    1 bacterial_sinusitis
2                    2 viral_sinusitis    
3                    3 chronic_sinusitis  
4                    4 any_sinusitis      

plotCohortAttrition

We can easily create a diagram from the output of summariseCohortAttrition:

cdm$sinusitis |>
  summariseCohortAttrition() |>
  plotCohortAttrition(cohortId = 1)

Your turn

Can you create a cohort with the following attrition?

  • all records of sinusitis (4294548, 40481087, 257012)

  • only first record per person (requireIsFirstEntry)

  • restrict to female individuals (requireSex)

  • restrict to children between 5 and 12 years old (requireAge)

  • plot attrition (summariseCohortAttrition + plotCohortAttrition)

Your turn

summariseCohortOverlap

summariseCohortOverlap identifies the overlap (number of subjects) between cohorts:

result <- summariseCohortOverlap(cdm$sinusitis)
result |>
  glimpse()
Rows: 72
Columns: 13
$ result_id        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ cdm_name         <chr> "Synthea synthetic health database", "Synthea synthetic health database", "Synthea synthetic …
$ group_name       <chr> "cohort_name_reference &&& cohort_name_comparator", "cohort_name_reference &&& cohort_name_co…
$ group_level      <chr> "bacterial_sinusitis &&& any_sinusitis", "bacterial_sinusitis &&& any_sinusitis", "bacterial_…
$ strata_name      <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall", "over…
$ strata_level     <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall", "over…
$ variable_name    <chr> "overlap", "reference", "comparator", "overlap", "reference", "comparator", "overlap", "refer…
$ variable_level   <chr> "number_subjects", "number_subjects", "number_subjects", "number_subjects", "number_subjects"…
$ estimate_name    <chr> "count", "count", "count", "count", "count", "count", "count", "count", "count", "count", "co…
$ estimate_type    <chr> "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "inte…
$ estimate_value   <chr> "786", "0", "1902", "2686", "2", "0", "785", "1", "1901", "466", "346", "320", "810", "1876",…
$ additional_name  <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall", "over…
$ additional_level <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall", "over…

tableCohortOverlap

We can easily display them in a gt table with tableCohortOverlap:

CDM name Cohort name reference Cohort name comparator Estimate name Only in reference cohort In both cohorts Only in comparator cohort
Synthea synthetic health database Any sinusitis Bacterial sinusitis N (%) 1,902 (70.76%) 786 (29.24%) 0 (0.00%)
Chronic sinusitis N (%) 1,876 (69.79%) 812 (30.21%) 0 (0.00%)
Viral sinusitis N (%) 2 (0.07%) 2,686 (99.93%) 0 (0.00%)
Bacterial sinusitis Chronic sinusitis N (%) 320 (28.27%) 466 (41.17%) 346 (30.57%)
Viral sinusitis N (%) 1 (0.04%) 785 (29.21%) 1,901 (70.75%)
Chronic sinusitis Viral sinusitis N (%) 2 (0.07%) 810 (30.13%) 1,876 (69.79%)

plotCohortOverlap

We can easily have a plot of the overlap with plotCohortOverlap:

Your turn

Create 3 drug cohorts from these 5: - aspirin - acetaminophen - naproxen - amoxicillin - ibuprofen

Identify the subject overlap between them

Your turn

Create 3 drug cohorts from these 5: - aspirin - acetaminophen - naproxen - amoxicillin - ibuprofen

Remember there is a function in CodelistGenerator::getDrugIngredientCodes

Identify the subject overlap between them and create a plot to show the overlap

Your turn

summariseCohortTiming

We have a function to identify the time between cohorts to see which cohorts occur first compared to the other.

Lets create some medications cohorts:

cdm$medications <- conceptCohort(
  cdm = cdm, 
  conceptSet = getDrugIngredientCodes(
    cdm = cdm, name = c("warfarin", "acetaminophen", "morphine")
  ),
  name = "medications"
)

summariseCohortTiming

summaryTiming <- cdm$medications |>
  summariseCohortTiming(restrictToFirstEntry = TRUE)
summaryTiming |>
  glimpse()
Rows: 42
Columns: 13
$ result_id        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ cdm_name         <chr> "Synthea synthetic health database", "Synthea synthetic health database", "Synthea synthetic …
$ group_name       <chr> "cohort_name_reference &&& cohort_name_comparator", "cohort_name_reference &&& cohort_name_co…
$ group_level      <chr> "warfarin &&& acetaminophen", "acetaminophen &&& warfarin", "acetaminophen &&& morphine", "mo…
$ strata_name      <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall", "over…
$ strata_level     <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall", "over…
$ variable_name    <chr> "number records", "number records", "number records", "number records", "number records", "nu…
$ variable_level   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ estimate_name    <chr> "count", "count", "count", "count", "count", "count", "count", "count", "count", "count", "co…
$ estimate_type    <chr> "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "inte…
$ estimate_value   <chr> "136", "136", "35", "35", "6", "6", "136", "136", "35", "35", "6", "6", "-33784", "-1106", "-…
$ additional_name  <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall", "over…
$ additional_level <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall", "over…

tableCohortTiming

summaryTiming |> tableCohortTiming(timeScale = "years")
CDM name Cohort name reference Cohort name comparator Variable name Estimate name Estimate value
Synthea synthetic health database Acetaminophen Morphine Number records N 35
Number subjects N 35
Warfarin Number records N 136
Number subjects N 136
Morphine Warfarin Number records N 6
Number subjects N 6
Acetaminophen Morphine Years between cohort entries Median [Q25 - Q75] 15.79 [5.02 - 33.51]
Range -33.72 - 77.29
Warfarin Years between cohort entries Median [Q25 - Q75] 53.96 [46.34 - 66.97]
Range -3.03 - 92.50
Morphine Warfarin Years between cohort entries Median [Q25 - Q75] 4.54 [-4.76 - 10.36]
Range -9.24 - 18.99

plotCohortTiming

summaryTiming |>
  plotCohortTiming(timeScale = "years", facet = "cdm_name", colour = c("group_level"))

plotCohortTiming

summaryTiming |>
  plotCohortTiming(timeScale = "years", facet = "cdm_name", colour = c("group_level"), plotType = "density")
Error in `plotCohortTiming()`:
! Please provide a cohort timing summarised result with density estimates (use `density = TRUE` in
  summariseCohortTiming).

plotCohortTiming

summaryTiming <- cdm$medications |>
  summariseCohortTiming(restrictToFirstEntry = TRUE, density = TRUE)
summaryTiming |>
  plotCohortTiming(timeScale = "years", facet = "cdm_name", colour = c("group_level"), plotType = "density")

plotCohortTiming

Your turn

Can you do a density plot of the three cohorts that you created before (for the overlap exercice)?

Your turn

CohortCharacteristics

Thank you for your attention!