CohortConstructor

An R package to build and curate cohorts in the OMOP Common Data Model

Introduction

  • CohortConstructor package is designed to support cohort building pipelines in R.
  • The approach taken to create cohorts is to first build a set of base cohorts, and then apply inclusion criteria to derive the final study cohorts of interest.
  • Vignettes with further information can be found in the package website.
  • Available from CRAN.

Understanding cohorts

“A cohort is a set of persons who satisfy one or more inclusion criteria for a duration of time.”

Cohorts in R

A cohort table in R is represented by four fundamental columns:

  • cohort_definition_id: An integer identifying the cohort.

  • subject_id: An identifier for the patients who are part of the cohort.

  • cohort_start_date: The date when the patient begins contributing time to the cohort.

  • cohort_end_date: The date when the patient leaves the cohort.

!! Subjects can contribute multiple times in a cohort, but their contributions cannot overlap!

Cohorts in R

cdm$my_cohort
# Source:   SQL [?? x 4]
# Database: DuckDB v0.10.2 [nuriamb@Windows 10 x64:R 4.2.3/C:\Users\nuriamb\AppData\Local\Temp\RtmpAHtbZB\file182c2f69266.duckdb]
   cohort_definition_id subject_id cohort_start_date cohort_end_date
                  <int>      <int> <date>            <date>         
 1                    1       1776 1920-09-07        1920-10-05     
 2                    1       2672 1972-02-18        1972-03-10     
 3                    2       3580 1993-03-16        1993-04-20     
 4                    1       1691 1953-01-27        1953-03-03     
 5                    1       2185 1942-12-24        1943-02-22     
 6                    1       4668 1957-03-31        1957-04-14     
 7                    1       5299 1988-11-10        1989-01-09     
 8                    2       5086 2003-02-20        2003-03-27     
 9                    1       4143 1968-11-08        1968-11-22     
10                    1       4680 1957-04-10        1957-04-24     
# ℹ more rows

Cohort attributes

  • settings: Relates cohort_definition_id with cohort_name, and other variables that define the cohort.

  • attrition: Inclusion logic to create each cohort and the resulting number of records and subjects at each step.

  • cohortCount: Number of records and subjects in each cohort.

  • cohortCodelist: Concepts used to derive the cohort.

Cohort attributes

  • settings
settings(cdm$my_cohort)
# A tibble: 2 × 3
  cohort_definition_id cohort_name sex   
                 <int> <chr>       <chr> 
1                    1 aspirin     Female
2                    2 ibuprofen   Female
  • attrition
attrition(cdm$my_cohort)
# A tibble: 4 × 7
  cohort_definition_id number_records number_subjects reason_id reason                    excluded_records excluded_subjects
                 <int>          <int>           <int>     <int> <chr>                                <int>             <int>
1                    1           4379            1927         1 Initial qualifying events                0                 0
2                    1           2265             980         2 Sex requirement: Female               2114               947
3                    2           2148            1451         1 Initial qualifying events                0                 0
4                    2           1107             741         2 Sex requirement: Female               1041               710

Cohort attributes

  • cohortCount
cohortCount(cdm$my_cohort)
# A tibble: 2 × 3
  cohort_definition_id number_records number_subjects
                 <int>          <int>           <int>
1                    1           2265             980
2                    2           1107             741
  • cohortCodelist
cohortCodelist(cdm$my_cohort, 1)

- aspirin (2 codes)
cohortCodelist(cdm$my_cohort, 2)

- ibuprofen (3 codes)

CohortConstructor

Function sets

 

Built base cohorts Cohort construction based on concept sets or demographic requirements on the database population.

 

Applying cohort requirements Impose study specific inclusion and exclusion criteria to cohorts in the database.

 

Update cohort start and end dates Modify start and end dates of subject’s in a cohort.

 

Cohort manipulation Generate new cohorts by manipulating a set of cohorts in the database.

Built base cohorts

Functions to build base cohorts

Demographic based - Example

cdm$age_cohort <- demographicsCohort(cdm = cdm, 
                                     ageRange = c(18, 65), 
                                     name = "age_cohort")

settings(cdm$age_cohort)
# A tibble: 1 × 3
  cohort_definition_id cohort_name  age_range
                 <dbl> <chr>        <chr>    
1                    1 demographics 18_65    
cohortCount(cdm$age_cohort)
# A tibble: 1 × 3
  cohort_definition_id number_records number_subjects
                 <int>          <int>           <int>
1                    1           2694            2694
attrition(cdm$age_cohort)
# A tibble: 2 × 7
  cohort_definition_id number_records number_subjects reason_id reason                    excluded_records excluded_subjects
                 <int>          <int>           <int>     <int> <chr>                                <int>             <int>
1                    1           2694            2694         1 Initial qualifying events                0                 0
2                    1           2694            2694         2 Age requirement: 18 to 65                0                 0

Demographic based - Example

# CohortCharacteristics R package
summary(cdm$age_cohort) |> plotCohortAttrition()

Concept based

  • Base cohorts are built by domain rather than by cohort definition.
  • This approach reduces the joins to OMOP CDM tables by using all the concept sets together, making it less computationally expensive.

Workflow to built 5 base cohorts: asthma, COPD, diabetes, acetaminophen and warfarin.

Concept based - Example

  • Get relevant codelists
drug_codes <- getDrugIngredientCodes(cdm, 
                                     name = c("diclofenac", "acetaminophen"))
drug_codes

- diclofenac (1 codes)
- acetaminophen (7 codes)
  • Create concept based cohorts
cdm$medications <- conceptCohort(cdm = cdm, 
                                 conceptSet = drug_codes, 
                                 name = "medications")
settings(cdm$medications)
# A tibble: 2 × 2
  cohort_definition_id cohort_name  
                 <int> <chr>        
1                    1 diclofenac   
2                    2 acetaminophen

Concept based - Example

  • Cohort codelist as an attribute
attr(cdm$medications, "cohort_codelist")
# Source:   SQL [8 x 4]
# Database: DuckDB v0.10.2 [nuriamb@Windows 10 x64:R 4.2.3/C:\Users\nuriamb\AppData\Local\Temp\RtmpAHtbZB\file182c2f69266.duckdb]
  cohort_definition_id codelist_name concept_id type       
                 <int> <chr>              <int> <chr>      
1                    1 diclofenac       1124300 index event
2                    2 acetaminophen    1125315 index event
3                    2 acetaminophen    1127078 index event
4                    2 acetaminophen    1127433 index event
5                    2 acetaminophen   40229134 index event
6                    2 acetaminophen   40231925 index event
7                    2 acetaminophen   40162522 index event
8                    2 acetaminophen   19133768 index event

Concept based - Measurement

  • Cohorts can be created from the measurement table with measurementCohort.

  • This is how we can create a cohort of high fever from oral temperature measurements results.

fever_codelist <- list("oral_temperature_measurement" = 3006322)

cdm$temperature <- measurementCohort(
  cdm = cdm,
  conceptSet = fever_codelist,
  name = "temperature",
  valueAsNumber = list("586323" = c(39, 45)) # 586323 -> unit concept for celsius
)

Let’s get started!

  • Get the necessary packages
# Install packages (install only those that you don't have)
install.packages(c("CohortConstructor", "CDMConnector", "CodelistGenerator", "dplyr", "duckdb"))

# Load packages 
library(CDMConnector)
library(CodelistGenerator)
library(CohortConstructor)
library(dplyr)
  • We will use the Eunomia synthetic dataset for the practicals
# Prepare R environment and download Eunomia 
Sys.setenv("EUNOMIA_DATA_FOLDER" = here::here())
downloadEunomiaData()

Let’s get started!

  • Connect to Eunomia and create the cdm object
con <- DBI::dbConnect(duckdb::duckdb(), dbdir = eunomia_dir())

cdm <- cdm_from_con(
  con = con, 
  cdm_schema = "main",  
  write_schema = c(prefix = "my_practical", schema = "main")
)

cdm
── # OMOP CDM reference (duckdb) of Synthea synthetic health database ────────────────────────────────────────────────────────────
• omop tables: person, observation_period, visit_occurrence, visit_detail, condition_occurrence, drug_exposure,
procedure_occurrence, device_exposure, measurement, observation, death, note, note_nlp, specimen, fact_relationship, location,
care_site, provider, payer_plan_period, cost, drug_era, dose_era, condition_era, metadata, cdm_source, concept, vocabulary,
domain, concept_class, concept_relationship, relationship, concept_synonym, concept_ancestor, source_to_concept_map,
drug_strength
• cohort tables: -
• achilles tables: -
• other tables: -

Your turn

  • Create a cohort of aspirin use.
  • How many records does it have? And how many subjects?

Move to the next slide to see the results.

Results

Number of records and subjects in the cohort.

CDM name Variable name Estimate name Cohort name
Aspirin
Synthea synthetic health database Number records N 4,379
Number subjects N 1,927

Applying cohort requirements

Functions to apply cohort requirements

Deriving study cohorts from base cohorts

Current approach

CohortConstructor

Requirement functions - Example

  • We can apply different inclusion and exclusion criteria using CohortConstructor’s functions in a pipe-line fashion. For instance, in what follows we require

    • only first record per person

    • subjects 18 years old or more at cohort start date

    • only females

    • more than 180 days of prior observation at cohort start date

cdm$medications <- cdm$medications %>% 
  requireIsFirstEntry() %>% 
  requireDemographics(
    ageRange = list(c(18, 85)),
    sex = "Female", 
    minPriorObservation = 30
  )

Requirement functions - Example

Diclofenac attrition:

Requirement functions - Example

Acetaminophen attrition:

Requirement functions - Example

  • Require no more than 1 event of GI bleed in the past
cdm$medications_no_gi_bleed <- cdm$medications %>%
  requireConceptIntersect(conceptSet = list("gi_bleed" = 192671), 
                          intersections = c(0, 1),
                          window = c(-Inf, 0), 
                          name = "medications_no_gi_bleed") 

Requirement functions - Example

Diclofenac attrition:

Requirement functions - Example

Acetaminophen attrition:

name argument

  • Purpose: Specifies the name for the new cohort table in the database.
  • Default Behavior: If not provided, the function uses the input cohort’s name.
  • Warning: Omitting the name argument will overwrite the existing cohort table.
# Example: overwrite cohort
cdm$cohort1 <- cdm$cohort1 %>%
  requireDeathFlag()

# Example: create new cohort table
cdm$cohort2 <- cdm$cohort1 %>%
  requireDeathFlag(name = "cohort2")

Your turn

Create a new cohort named aspirin_last by applying the following criteria to the base aspirin cohort:

  • Include only the last drug exposure for each subject.
  • Include exposures that start between January 1, 1960, and December 31, 1979.
  • Exclude individuals with an amoxicillin exposure in the 7 days prior to the aspirin exposure.

Move to the next slide to see the results.

Results

Attrition of the aspirin_last cohort.

CDM name
Synthea synthetic health database
Reason Variable
Number records Number subjects Excluded records Excluded subjects
aspirin
Initial qualifying events 4379 1927 0 0
Restricted to last entry 1927 1927 2452 0
cohort_start_date after 1960-01-01 1511 1511 416 416
cohort_start_date before 1979-12-31 1174 1174 337 337
Not in concept amoxicillin between -7 & -1 days relative to cohort_start_date 1173 1173 1 1

Move forward in the presentation to get some tips on how to resolve the exercise

Or don’t, and try to resolve it with the content seen so far and the package website :)

Tips

  • Find in the package website which function limits cohort entries to the first or last to get the last drug exposure of a subject.
  • Find in the package website which function impose date requirements on cohort dates.
  • Use CodelistGenerator to find amoxicillin codes, and then use the relevant requirement function to impose the absence of those concepts in the pertinent time-window.

Update cohort start and end dates

Functions to update cohort start and end dates

Update cohort start and end dates - Example

  • We can set the end date to the end of the subject’s observation period
cdm$medications <- cdm$medications %>%
  exitAtObservationEnd()

cdm$medications
# Source:   table<main.my_practicalmedications> [?? x 4]
# Database: DuckDB v0.10.2 [nuriamb@Windows 10 x64:R 4.2.3/C:\Users\nuriamb\AppData\Local\Temp\RtmpAHtbZB\file182c7062343.duckdb]
   cohort_definition_id subject_id cohort_start_date cohort_end_date
                  <int>      <int> <date>            <date>         
 1                    1       1600 1978-01-12        2018-10-15     
 2                    2       2445 1955-11-07        1999-02-01     
 3                    2       1627 1970-06-28        2019-02-02     
 4                    2       4260 1968-12-19        2018-10-30     
 5                    1       3464 2017-04-25        2017-12-27     
 6                    2        408 1994-06-16        2018-09-06     
 7                    1       2334 1971-07-05        2018-08-02     
 8                    1       2207 2013-05-06        2018-10-11     
 9                    1       4133 2009-10-10        2019-02-14     
10                    1       2077 1972-06-04        2012-03-26     
# ℹ more rows

Update cohort start and end dates - Example

  • We can also trim start and end dates to match demographic requirements

  • i.e. cohort dates can be trimmed so the subject contributes time while he is 20 to 40 years old, and has a prior observation of 365 days

cdm$medications_trimmed <- cdm$medications %>%
  trimDemographics(ageRange = list(c(20, 40)),
                   minPriorObservation = 365,
                   name = "medications_trimmed")

Update cohort start and end dates - Example

Diclofenac attrition:

Update cohort start and end dates - Example

Acetaminophen attrition:

Your turn

From the aspirin_last cohort…

1) Create a new cohort named aspirin_death that

  • Includes only subjects who have a record of death.

  • Subjects exit the cohort on their date of death.

2) Create a second cohort called aspirin_30days which

  • Includes subjects for the first 30 days of taking aspirin, or until the end of their drug exposure if it is shorter than 30 days.

    • Determine the number of subjects who leave after 30 days and the number who leave before 30 days.

Move to the next slide to see the results.

Results

Death cohort: cohort counts.

CDM name Variable name Estimate name Estimate value
Synthea synthetic health database Number subjects N 0
Number records N 0

30 Days Aspirin cohort: exit reason counts.

exit_reason counts
cohort_end_date 834
start_30_days; cohort_end_date 75
start_30_days 264

Move forward in the presentation to get some tips on how to resolve the exercise

Or don’t, and try to resolve it with the content seen so far! :)

Tips

1) Death cohort

  • Use the function that allows to update the cohort end date to the date of death. Adjust the function’s arguments to restrict the cohort to individuals who have a death event.

2) 30 Days Aspirin cohort

  • Create a new column by adding 30 days to the cohort start date.

    • !! Warning: When adding dates in SQL tables (e.g., OMOP CDM cohorts), use the dateadd function from CDMConnector.
  • Use the appropriate function to update the cohort end date to be either the new date column or the previous cohort end date, whichever comes first.

Cohort manipulation

Functions for cohort manipulations

Cohort manipulation functions - Example

  • We can generate a new cohort that contains people who had an exposure to both diclofenac and acetaminophen at the same time using intersectCohorts().
cdm$intersection <- cdm$medications %>% 
  CohortConstructor::intersectCohorts(
    gap = 0,
    mutuallyExclusive = TRUE,
    returnOnlyComb = FALSE,
    name = "intersection"
  )

settings(cdm$intersection)
# A tibble: 3 × 6
  cohort_definition_id cohort_name              diclofenac acetaminophen mutually_exclusive   gap
                 <int> <chr>                         <dbl>         <dbl> <lgl>              <dbl>
1                    1 diclofenac                        1             0 TRUE                   0
2                    2 acetaminophen                     0             1 TRUE                   0
3                    3 diclofenac_acetaminophen          1             1 TRUE                   0

Matched cohort - Example

  • The matchCohort functions generates a new cohort by matching on age and sex from a target cohort.

  • For example, to compare individuals who take diclofenac to the general population, we can create a matched cohort as follows:

cdm$diclofenac_match <- cdm$medications %>% 
  matchCohorts(
    cohortId = 1,
    matchSex = TRUE,
    matchYearOfBirth = TRUE,
    ratio = 5,
    name = "diclofenac_match"
  )
settings(cdm$diclofenac_match)
# A tibble: 2 × 8
  cohort_definition_id cohort_name        target_table_name target_cohort_id target_cohort_name match_sex match_year_of_birth
                 <int> <chr>              <chr>                        <int> <chr>              <lgl>     <lgl>              
1                    1 diclofenac         medications                      1 diclofenac         TRUE      TRUE               
2                    2 diclofenac_matched medications                      1 diclofenac         TRUE      TRUE               
# ℹ 1 more variable: match_status <chr>

Matched cohort - Example

cohortCount(cdm$diclofenac_match)
# A tibble: 2 × 3
  cohort_definition_id number_records number_subjects
                 <int>          <int>           <int>
1                    1            416             416
2                    2            901             901
  • Attrition for the matched cohort

Your turn

Starting from the concept-based aspirin cohort…

  • Collapse aspirin exposures: Overwrite your concept-based “aspirin” cohort by merging any aspirin exposures for the same subject that occur within 7 days of each other.
  • Create yearly cohorts: From your collapsed cohort, create five separate cohorts. Each cohort should include records for one specific year from the following list: 1965, 1967, 1968, 1969, and 1970.

Move to the next slide to see the results.

Results

Counts for each of the cohort years

CDM name Variable name Estimate name Cohort name
Aspirin 1967 Aspirin 1965 Aspirin 1966 Aspirin 1969 Aspirin 1970 Aspirin 1968
Synthea synthetic health database Number records N 141 134 143 134 139 151
Number subjects N 137 132 138 130 138 145

Thank you for your attention!

Questions?