CohortCharacteristics

Part 1: Counts, Attrition, Overlap, Timing

Context

This package aims to standardise and provide the tools to conduct Characterisation studies as of the Darwin-EU Catalogue of Standard Analytics.

Package overview

Functions

Workflow

We have three types of functions:

summarise: these functions produce an standardised output to summarise a cohort. This standard output is called summarised_result.
plot: these functions produce plots (currently, only ggplot, but working to implement plotly) from a summarised_result object.
table: these functions produce tables (gt and flextable) from a summarised_result object.

result <- summariseXXX(...)

tableXXX(result)

plotXXX(result)

flowchart LR
  A[summarise function ] --> B[Plot function ]
  A --> C[Table function ]

Create the cdm reference

library(CDMConnector)
library(dplyr)
library(tidyr)
library(DBI)

db <- DBI::dbConnect(duckdb::duckdb(),  dbdir = eunomia_dir())
cdm <- cdm_from_con(con = db, cdm_schema = "main", write_schema = "main")

cdm

── # OMOP CDM reference (duckdb) of Synthea synthetic health database ──────────────────────────────────────────────────

• omop tables: person, observation_period, visit_occurrence, visit_detail, condition_occurrence, drug_exposure,
procedure_occurrence, device_exposure, measurement, observation, death, note, note_nlp, specimen, fact_relationship,
location, care_site, provider, payer_plan_period, cost, drug_era, dose_era, condition_era, metadata, cdm_source,
concept, vocabulary, domain, concept_class, concept_relationship, relationship, concept_synonym, concept_ancestor,
source_to_concept_map, drug_strength

• cohort tables: -

• achilles tables: -

• other tables: -

Let’s instantiate some cohorts

library(CohortConstructor)

cdm$sinusitis <- conceptCohort(
  cdm = cdm,
  name = "sinusitis",
  conceptSet = list(
    "bacterial_sinusitis" = 4294548, 
    "viral_sinusitis" = 40481087, 
    "chronic_sinusitis" = 257012, 
    "any_sinusitis" = c(4294548, 40481087, 257012)
  )
)

summariseCohortCount

Lets see the sinusitis cohorts

cdm$sinusitis

# Source:   table<main.sinusitis> [?? x 4]
# Database: DuckDB v0.10.0 [martics@Windows 10 x64:R 4.2.3/C:\Users\martics\AppData\Local\Temp\RtmpmK59Vq\file1230308c511b.duckdb]
   cohort_definition_id subject_id cohort_start_date cohort_end_date
                  <int>      <int> <date>            <date>         
 1                    2        800 1970-12-08        1970-12-22     
 2                    2        857 1997-08-22        1997-08-29     
 3                    2       1390 1974-12-21        1975-01-11     
 4                    2       1658 1951-05-31        1951-06-07     
 5                    2       2077 2012-01-14        2012-02-04     
 6                    2       2319 1963-01-10        1963-01-24     
 7                    2       2821 1920-07-07        1920-07-14     
 8                    2       2903 1979-12-26        1980-01-09     
 9                    2       2989 1974-10-26        1974-11-02     
10                    2       4639 1992-05-28        1992-06-11     
# ℹ more rows

summariseCohortCount

Lets see the settings of the sinusitis cohorts

cdm$sinusitis |> settings() |> print(n = Inf)

# A tibble: 4 × 2
  cohort_definition_id cohort_name        
                 <int> <chr>              
1                    1 bacterial_sinusitis
2                    2 viral_sinusitis    
3                    3 chronic_sinusitis  
4                    4 any_sinusitis

summariseCohortCount

We can easily extract metadata about the counts in this cohort:

cdm$sinusitis |> cohortCount()

# A tibble: 4 × 3
  cohort_definition_id number_records number_subjects
                 <int>          <int>           <int>
1                    1            939             786
2                    2          17268            2686
3                    3            825             812
4                    4          18629            2688

summariseCohortCount

We can export this metadata using summariseCohortCount:

library(CohortCharacteristics)
cdm$sinusitis |>
  summariseCohortCount() |>
  glimpse()

Rows: 8
Columns: 13
$ result_id        <int> 1, 1, 1, 1, 1, 1, 1, 1
$ cdm_name         <chr> "Synthea synthetic health database", "Synthea synthetic health database", "Synthea synthetic …
$ group_name       <chr> "cohort_name", "cohort_name", "cohort_name", "cohort_name", "cohort_name", "cohort_name", "co…
$ group_level      <chr> "any_sinusitis", "chronic_sinusitis", "viral_sinusitis", "bacterial_sinusitis", "bacterial_si…
$ strata_name      <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall"
$ strata_level     <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall"
$ variable_name    <chr> "Number records", "Number records", "Number records", "Number records", "Number subjects", "N…
$ variable_level   <chr> NA, NA, NA, NA, NA, NA, NA, NA
$ estimate_name    <chr> "count", "count", "count", "count", "count", "count", "count", "count"
$ estimate_type    <chr> "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer"
$ estimate_value   <chr> "18629", "825", "17268", "939", "786", "2686", "812", "2688"
$ additional_name  <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall"
$ additional_level <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall"

tableCohortCount

We can easily create a gt or flextable from the output of summariseCohortCount:

cdm$sinusitis |>
  summariseCohortCount(cohortId = 1:4) |>
  tableCohortCount()

ℹ summarising data

✔ summariseCharacteristics finished!

! Results have not been suppressed.

CDM name	Variable name	Estimate name	Cohort name
CDM name	Variable name	Estimate name	Any sinusitis	Bacterial sinusitis	Viral sinusitis	Chronic sinusitis
Synthea synthetic health database	Number records	N	18,629	939	17,268	825
	Number subjects	N	2,688	786	2,686	812

tableCohortCount

You can easily suppress a summarised_result using the suppress function:

cdm$sinusitis |>
  summariseCohortCount(cohortId = 1:4) |>
  suppress(minCellCount = 5) |>
  tableCohortCount()

CDM name	Variable name	Estimate name	Cohort name
CDM name	Variable name	Estimate name	Bacterial sinusitis	Any sinusitis	Chronic sinusitis	Viral sinusitis
Synthea synthetic health database	Number records	N	939	18,629	825	17,268
	Number subjects	N	786	2,688	812	2,686

tableCohortCount

We can easily create a gt or flextable from the output of summariseCohortCount:

cdm$sinusitis |>
  PatientProfiles::addSex() |>
  summariseCohortCount(strata = "sex") |>
  tableCohortCount(header = c("group"), groupColumn = "sex")

CDM name	Variable name	Estimate name	Cohort name
CDM name	Variable name	Estimate name	Any sinusitis	Bacterial sinusitis	Chronic sinusitis	Viral sinusitis
Female
Synthea synthetic health database	Number records	N	9,542	494	427	8,824
	Number subjects	N	1,371	418	424	1,371
Male
Synthea synthetic health database	Number records	N	9,087	445	398	8,444
	Number subjects	N	1,317	368	388	1,315
Overall
Synthea synthetic health database	Number records	N	18,629	939	825	17,268
	Number subjects	N	2,688	786	812	2,686

export gt tables

gt tables can easily be exported to word:

myTable <- cdm$sinusitis |>
  PatientProfiles::addSex() |>
  summariseCohortCount(strata = "sex") |>
  tableCohortCount(header = c("group"), groupColumn = "sex")

library(gt)
myTable |> gt::gtsave("table.docx")

summariseCohortAttrition

We can easily extract metadata about the attrition of a cohort:

cdm$sinusitis |> attrition()

# A tibble: 4 × 7
  cohort_definition_id number_records number_subjects reason_id reason                excluded_records excluded_subjects
                 <int>          <int>           <int>     <int> <chr>                            <int>             <int>
1                    1            939             786         1 Initial qualifying e…                0                 0
2                    2          17268            2686         1 Initial qualifying e…                0                 0
3                    3            825             812         1 Initial qualifying e…                0                 0
4                    4          18629            2688         1 Initial qualifying e…                0                 0

summariseCohortAttrition

We can export this metadata using summariseCohortAttrition:

cdm$sinusitis |>
  summariseCohortAttrition() |>
  glimpse()

Rows: 16
Columns: 13
$ result_id        <int> 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4
$ cdm_name         <chr> "Synthea synthetic health database", "Synthea synthetic health database", "Synthea synthetic …
$ group_name       <chr> "cohort_name", "cohort_name", "cohort_name", "cohort_name", "cohort_name", "cohort_name", "co…
$ group_level      <chr> "bacterial_sinusitis", "bacterial_sinusitis", "bacterial_sinusitis", "bacterial_sinusitis", "…
$ strata_name      <chr> "reason", "reason", "reason", "reason", "reason", "reason", "reason", "reason", "reason", "re…
$ strata_level     <chr> "Initial qualifying events", "Initial qualifying events", "Initial qualifying events", "Initi…
$ variable_name    <chr> "number_records", "number_subjects", "excluded_records", "excluded_subjects", "number_records…
$ variable_level   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
$ estimate_name    <chr> "count", "count", "count", "count", "count", "count", "count", "count", "count", "count", "co…
$ estimate_type    <chr> "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "inte…
$ estimate_value   <chr> "939", "786", "0", "0", "17268", "2686", "0", "0", "825", "812", "0", "0", "18629", "2688", "…
$ additional_name  <chr> "reason_id", "reason_id", "reason_id", "reason_id", "reason_id", "reason_id", "reason_id", "r…
$ additional_level <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"

plotCohortAttrition

We can easily create a diagram from the output of summariseCohortAttrition:

cdm$sinusitis |>
  summariseCohortAttrition() |>
  plotCohortAttrition()

plotCohortAttrition

We can easily create a diagram from the output of summariseCohortAttrition:

cdm$sinusitis |> settings()

# A tibble: 4 × 2
  cohort_definition_id cohort_name        
                 <int> <chr>              
1                    1 bacterial_sinusitis
2                    2 viral_sinusitis    
3                    3 chronic_sinusitis  
4                    4 any_sinusitis

plotCohortAttrition

We can easily create a diagram from the output of summariseCohortAttrition:

cdm$sinusitis |>
  summariseCohortAttrition() |>
  plotCohortAttrition(cohortId = 1)

Your turn

Can you create a cohort with the following attrition?

all records of sinusitis (4294548, 40481087, 257012)
only first record per person (requireIsFirstEntry)
restrict to female individuals (requireSex)
restrict to children between 5 and 12 years old (requireAge)
plot attrition (summariseCohortAttrition + plotCohortAttrition)

Your turn

summariseCohortOverlap

summariseCohortOverlap identifies the overlap (number of subjects) between cohorts:

result <- summariseCohortOverlap(cdm$sinusitis)
result |>
  glimpse()

Rows: 72
Columns: 13
$ result_id        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ cdm_name         <chr> "Synthea synthetic health database", "Synthea synthetic health database", "Synthea synthetic …
$ group_name       <chr> "cohort_name_reference &&& cohort_name_comparator", "cohort_name_reference &&& cohort_name_co…
$ group_level      <chr> "bacterial_sinusitis &&& any_sinusitis", "bacterial_sinusitis &&& any_sinusitis", "bacterial_…
$ strata_name      <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall", "over…
$ strata_level     <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall", "over…
$ variable_name    <chr> "overlap", "reference", "comparator", "overlap", "reference", "comparator", "overlap", "refer…
$ variable_level   <chr> "number_subjects", "number_subjects", "number_subjects", "number_subjects", "number_subjects"…
$ estimate_name    <chr> "count", "count", "count", "count", "count", "count", "count", "count", "count", "count", "co…
$ estimate_type    <chr> "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "inte…
$ estimate_value   <chr> "786", "0", "1902", "2686", "2", "0", "785", "1", "1901", "466", "346", "320", "810", "1876",…
$ additional_name  <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall", "over…
$ additional_level <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall", "over…

tableCohortOverlap

We can easily display them in a gt table with tableCohortOverlap:

tableCohortOverlap(result)

CDM name	Cohort name reference	Cohort name comparator	Estimate name	Only in reference cohort	In both cohorts	Only in comparator cohort
Synthea synthetic health database	Any sinusitis	Bacterial sinusitis	N (%)	1,902 (70.76%)	786 (29.24%)	0 (0.00%)
		Chronic sinusitis	N (%)	1,876 (69.79%)	812 (30.21%)	0 (0.00%)
		Viral sinusitis	N (%)	2 (0.07%)	2,686 (99.93%)	0 (0.00%)
	Bacterial sinusitis	Chronic sinusitis	N (%)	320 (28.27%)	466 (41.17%)	346 (30.57%)
		Viral sinusitis	N (%)	1 (0.04%)	785 (29.21%)	1,901 (70.75%)
	Chronic sinusitis	Viral sinusitis	N (%)	2 (0.07%)	810 (30.13%)	1,876 (69.79%)

plotCohortOverlap

We can easily have a plot of the overlap with plotCohortOverlap:

plotCohortOverlap(result)

Your turn

Create 3 drug cohorts from these 5: - aspirin - acetaminophen - naproxen - amoxicillin - ibuprofen

Identify the subject overlap between them

Your turn

Create 3 drug cohorts from these 5: - aspirin - acetaminophen - naproxen - amoxicillin - ibuprofen

Remember there is a function in CodelistGenerator::getDrugIngredientCodes

Identify the subject overlap between them and create a plot to show the overlap

Your turn

summariseCohortTiming

We have a function to identify the time between cohorts to see which cohorts occur first compared to the other.

Lets create some medications cohorts:

cdm$medications <- conceptCohort(
  cdm = cdm, 
  conceptSet = getDrugIngredientCodes(
    cdm = cdm, name = c("warfarin", "acetaminophen", "morphine")
  ),
  name = "medications"
)

summariseCohortTiming

summaryTiming <- cdm$medications |>
  summariseCohortTiming(restrictToFirstEntry = TRUE)
summaryTiming |>
  glimpse()

Rows: 42
Columns: 13
$ result_id        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ cdm_name         <chr> "Synthea synthetic health database", "Synthea synthetic health database", "Synthea synthetic …
$ group_name       <chr> "cohort_name_reference &&& cohort_name_comparator", "cohort_name_reference &&& cohort_name_co…
$ group_level      <chr> "warfarin &&& acetaminophen", "acetaminophen &&& warfarin", "acetaminophen &&& morphine", "mo…
$ strata_name      <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall", "over…
$ strata_level     <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall", "over…
$ variable_name    <chr> "number records", "number records", "number records", "number records", "number records", "nu…
$ variable_level   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ estimate_name    <chr> "count", "count", "count", "count", "count", "count", "count", "count", "count", "count", "co…
$ estimate_type    <chr> "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "inte…
$ estimate_value   <chr> "136", "136", "35", "35", "6", "6", "136", "136", "35", "35", "6", "6", "-33784", "-1106", "-…
$ additional_name  <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall", "over…
$ additional_level <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall", "over…

tableCohortTiming

summaryTiming |> tableCohortTiming(timeScale = "years")

CDM name	Cohort name reference	Cohort name comparator	Variable name	Estimate name	Estimate value
Synthea synthetic health database	Acetaminophen	Morphine	Number records	N	35
			Number subjects	N	35
		Warfarin	Number records	N	136
			Number subjects	N	136
	Morphine	Warfarin	Number records	N	6
			Number subjects	N	6
	Acetaminophen	Morphine	Years between cohort entries	Median [Q25 - Q75]	15.79 [5.02 - 33.51]
				Range	-33.72 - 77.29
		Warfarin	Years between cohort entries	Median [Q25 - Q75]	53.96 [46.34 - 66.97]
				Range	-3.03 - 92.50
	Morphine	Warfarin	Years between cohort entries	Median [Q25 - Q75]	4.54 [-4.76 - 10.36]
				Range	-9.24 - 18.99

plotCohortTiming

summaryTiming |>
  plotCohortTiming(timeScale = "years", facet = "cdm_name", colour = c("group_level"))

plotCohortTiming

summaryTiming |>
  plotCohortTiming(timeScale = "years", facet = "cdm_name", colour = c("group_level"), plotType = "density")

Error in `plotCohortTiming()`:
! Please provide a cohort timing summarised result with density estimates (use `density = TRUE` in
  summariseCohortTiming).

plotCohortTiming

summaryTiming <- cdm$medications |>
  summariseCohortTiming(restrictToFirstEntry = TRUE, density = TRUE)

summaryTiming |>
  plotCohortTiming(timeScale = "years", facet = "cdm_name", colour = c("group_level"), plotType = "density")

plotCohortTiming

Your turn

Can you do a density plot of the three cohorts that you created before (for the overlap exercice)?

Your turn

CohortCharacteristics

Thank you for your attention!