CohortConstructor

An R package to build and curate cohorts in the OMOP Common Data Model

Introduction

CohortConstructor package is designed to support cohort building pipelines in R.

The approach taken to create cohorts is to first build a set of base cohorts, and then apply inclusion criteria to derive the final study cohorts of interest.

The code is publicly available in OHDSI’s GitHub repository CohortConstructor.

Vignettes with further information can be found in the package website.

Available from CRAN.

Understanding cohorts

“A cohort is a set of persons who satisfy one or more inclusion criteria for a duration of time.”

Cohorts in R

A cohort table in R is represented by four fundamental columns:

cohort_definition_id: An integer identifying the cohort.
subject_id: An identifier for the patients who are part of the cohort.
cohort_start_date: The date when the patient begins contributing time to the cohort.
cohort_end_date: The date when the patient leaves the cohort.

!! Subjects can contribute multiple times in a cohort, but their contributions cannot overlap!

Cohorts in R

cdm$my_cohort

# Source:   SQL [?? x 4]
# Database: DuckDB v0.10.2 [nuriamb@Windows 10 x64:R 4.2.3/C:\Users\nuriamb\AppData\Local\Temp\RtmpAHtbZB\file182c2f69266.duckdb]
   cohort_definition_id subject_id cohort_start_date cohort_end_date
                  <int>      <int> <date>            <date>         
 1                    1       1776 1920-09-07        1920-10-05     
 2                    1       2672 1972-02-18        1972-03-10     
 3                    2       3580 1993-03-16        1993-04-20     
 4                    1       1691 1953-01-27        1953-03-03     
 5                    1       2185 1942-12-24        1943-02-22     
 6                    1       4668 1957-03-31        1957-04-14     
 7                    1       5299 1988-11-10        1989-01-09     
 8                    2       5086 2003-02-20        2003-03-27     
 9                    1       4143 1968-11-08        1968-11-22     
10                    1       4680 1957-04-10        1957-04-24     
# ℹ more rows

Cohort attributes

settings: Relates cohort_definition_id with cohort_name, and other variables that define the cohort.
attrition: Inclusion logic to create each cohort and the resulting number of records and subjects at each step.
cohortCount: Number of records and subjects in each cohort.
cohortCodelist: Concepts used to derive the cohort.

Cohort attributes

settings

settings(cdm$my_cohort)

# A tibble: 2 × 3
  cohort_definition_id cohort_name sex   
                 <int> <chr>       <chr> 
1                    1 aspirin     Female
2                    2 ibuprofen   Female

attrition

attrition(cdm$my_cohort)

# A tibble: 4 × 7
  cohort_definition_id number_records number_subjects reason_id reason                    excluded_records excluded_subjects
                 <int>          <int>           <int>     <int> <chr>                                <int>             <int>
1                    1           4379            1927         1 Initial qualifying events                0                 0
2                    1           2265             980         2 Sex requirement: Female               2114               947
3                    2           2148            1451         1 Initial qualifying events                0                 0
4                    2           1107             741         2 Sex requirement: Female               1041               710

Cohort attributes

cohortCount

cohortCount(cdm$my_cohort)

# A tibble: 2 × 3
  cohort_definition_id number_records number_subjects
                 <int>          <int>           <int>
1                    1           2265             980
2                    2           1107             741

cohortCodelist

cohortCodelist(cdm$my_cohort, 1)


- aspirin (2 codes)

cohortCodelist(cdm$my_cohort, 2)


- ibuprofen (3 codes)

CohortConstructor

Function sets

Built base cohorts Cohort construction based on concept sets or demographic requirements on the database population.

Applying cohort requirements Impose study specific inclusion and exclusion criteria to cohorts in the database.

Update cohort start and end dates Modify start and end dates of subject’s in a cohort.

Cohort manipulation Generate new cohorts by manipulating a set of cohorts in the database.

Built base cohorts

Functions to build base cohorts

demographicsCohort()

conceptCohort()

measurementCohort()

Demographic based - Example

cdm$age_cohort <- demographicsCohort(cdm = cdm, 
                                     ageRange = c(18, 65), 
                                     name = "age_cohort")

settings(cdm$age_cohort)

# A tibble: 1 × 3
  cohort_definition_id cohort_name  age_range
                 <dbl> <chr>        <chr>    
1                    1 demographics 18_65

cohortCount(cdm$age_cohort)

# A tibble: 1 × 3
  cohort_definition_id number_records number_subjects
                 <int>          <int>           <int>
1                    1           2694            2694

attrition(cdm$age_cohort)

# A tibble: 2 × 7
  cohort_definition_id number_records number_subjects reason_id reason                    excluded_records excluded_subjects
                 <int>          <int>           <int>     <int> <chr>                                <int>             <int>
1                    1           2694            2694         1 Initial qualifying events                0                 0
2                    1           2694            2694         2 Age requirement: 18 to 65                0                 0

Demographic based - Example

# CohortCharacteristics R package
summary(cdm$age_cohort) |> plotCohortAttrition()

Concept based

Base cohorts are built by domain rather than by cohort definition.

This approach reduces the joins to OMOP CDM tables by using all the concept sets together, making it less computationally expensive.

Workflow to built 5 base cohorts: asthma, COPD, diabetes, acetaminophen and warfarin.

Concept based - Example

Get relevant codelists

drug_codes <- getDrugIngredientCodes(cdm, 
                                     name = c("diclofenac", "acetaminophen"))
drug_codes


- diclofenac (1 codes)
- acetaminophen (7 codes)

Create concept based cohorts

cdm$medications <- conceptCohort(cdm = cdm, 
                                 conceptSet = drug_codes, 
                                 name = "medications")
settings(cdm$medications)

# A tibble: 2 × 2
  cohort_definition_id cohort_name  
                 <int> <chr>        
1                    1 diclofenac   
2                    2 acetaminophen

Concept based - Example

Cohort codelist as an attribute

attr(cdm$medications, "cohort_codelist")

# Source:   SQL [8 x 4]
# Database: DuckDB v0.10.2 [nuriamb@Windows 10 x64:R 4.2.3/C:\Users\nuriamb\AppData\Local\Temp\RtmpAHtbZB\file182c2f69266.duckdb]
  cohort_definition_id codelist_name concept_id type       
                 <int> <chr>              <int> <chr>      
1                    1 diclofenac       1124300 index event
2                    2 acetaminophen    1125315 index event
3                    2 acetaminophen    1127078 index event
4                    2 acetaminophen    1127433 index event
5                    2 acetaminophen   40229134 index event
6                    2 acetaminophen   40231925 index event
7                    2 acetaminophen   40162522 index event
8                    2 acetaminophen   19133768 index event

Concept based - Measurement

Cohorts can be created from the measurement table with measurementCohort.
This is how we can create a cohort of high fever from oral temperature measurements results.

fever_codelist <- list("oral_temperature_measurement" = 3006322)

cdm$temperature <- measurementCohort(
  cdm = cdm,
  conceptSet = fever_codelist,
  name = "temperature",
  valueAsNumber = list("586323" = c(39, 45)) # 586323 -> unit concept for celsius
)

Let’s get started!

Get the necessary packages

# Install packages (install only those that you don't have)
install.packages(c("CohortConstructor", "CDMConnector", "CodelistGenerator", "dplyr", "duckdb"))

# Load packages 
library(CDMConnector)
library(CodelistGenerator)
library(CohortConstructor)
library(dplyr)

We will use the Eunomia synthetic dataset for the practicals

# Prepare R environment and download Eunomia 
Sys.setenv("EUNOMIA_DATA_FOLDER" = here::here())
downloadEunomiaData()

Let’s get started!

Connect to Eunomia and create the cdm object

con <- DBI::dbConnect(duckdb::duckdb(), dbdir = eunomia_dir())

cdm <- cdm_from_con(
  con = con, 
  cdm_schema = "main",  
  write_schema = c(prefix = "my_practical", schema = "main")
)

cdm

── # OMOP CDM reference (duckdb) of Synthea synthetic health database ────────────────────────────────────────────────────────────

• omop tables: person, observation_period, visit_occurrence, visit_detail, condition_occurrence, drug_exposure,
procedure_occurrence, device_exposure, measurement, observation, death, note, note_nlp, specimen, fact_relationship, location,
care_site, provider, payer_plan_period, cost, drug_era, dose_era, condition_era, metadata, cdm_source, concept, vocabulary,
domain, concept_class, concept_relationship, relationship, concept_synonym, concept_ancestor, source_to_concept_map,
drug_strength

• cohort tables: -

• achilles tables: -

• other tables: -

Your turn

Create a cohort of aspirin use.

How many records does it have? And how many subjects?

Move to the next slide to see the results.

Results

Number of records and subjects in the cohort.

CDM name	Variable name	Estimate name	Cohort name
CDM name	Variable name	Estimate name	Aspirin
Synthea synthetic health database	Number records	N	4,379
	Number subjects	N	1,927

Applying cohort requirements

Functions to apply cohort requirements

On demographics
- requireDemographics()
- requireAge()
- requireSex()
- requirePriorObservation()
- requireFutureObservation()

On cohort entries
- requireIsFirstEntry()
- requireIsLastEntry()

On cohort dates
- requireInDateRange()

Require presence or absence based on other cohorts, tables and concepts

Deriving study cohorts from base cohorts

Current approach

CohortConstructor

Requirement functions - Example

We can apply different inclusion and exclusion criteria using CohortConstructor’s functions in a pipe-line fashion. For instance, in what follows we require
- only first record per person
- subjects 18 years old or more at cohort start date
- only females
- more than 180 days of prior observation at cohort start date

cdm$medications <- cdm$medications %>% 
  requireIsFirstEntry() %>% 
  requireDemographics(
    ageRange = list(c(18, 85)),
    sex = "Female", 
    minPriorObservation = 30
  )

Requirement functions - Example

Diclofenac attrition:

Requirement functions - Example

Acetaminophen attrition:

Requirement functions - Example

Require no more than 1 event of GI bleed in the past

cdm$medications_no_gi_bleed <- cdm$medications %>%
  requireConceptIntersect(conceptSet = list("gi_bleed" = 192671), 
                          intersections = c(0, 1),
                          window = c(-Inf, 0), 
                          name = "medications_no_gi_bleed")

Requirement functions - Example

Diclofenac attrition:

Requirement functions - Example

Acetaminophen attrition:

`name` argument

Purpose: Specifies the name for the new cohort table in the database.

Default Behavior: If not provided, the function uses the input cohort’s name.

Warning: Omitting the name argument will overwrite the existing cohort table.

# Example: overwrite cohort
cdm$cohort1 <- cdm$cohort1 %>%
  requireDeathFlag()

# Example: create new cohort table
cdm$cohort2 <- cdm$cohort1 %>%
  requireDeathFlag(name = "cohort2")

Your turn

Create a new cohort named aspirin_last by applying the following criteria to the base aspirin cohort:

Include only the last drug exposure for each subject.

Include exposures that start between January 1, 1960, and December 31, 1979.

Exclude individuals with an amoxicillin exposure in the 7 days prior to the aspirin exposure.

Move to the next slide to see the results.

Results

Attrition of the aspirin_last cohort.

	CDM name
	Synthea synthetic health database
Reason	Variable
Reason	Number records	Number subjects	Excluded records	Excluded subjects
aspirin
Initial qualifying events	4379	1927	0	0
Restricted to last entry	1927	1927	2452	0
cohort_start_date after 1960-01-01	1511	1511	416	416
cohort_start_date before 1979-12-31	1174	1174	337	337
Not in concept amoxicillin between -7 & -1 days relative to cohort_start_date	1173	1173	1	1

Move forward in the presentation to get some tips on how to resolve the exercise

Or don’t, and try to resolve it with the content seen so far and the package website :)

Tips

Find in the package website which function limits cohort entries to the first or last to get the last drug exposure of a subject.

Find in the package website which function impose date requirements on cohort dates.

Use CodelistGenerator to find amoxicillin codes, and then use the relevant requirement function to impose the absence of those concepts in the pertinent time-window.

Update cohort start and end dates

Functions to update cohort start and end dates

Cohort exit
- exitAtObservationEnd()
- exitAtDeath()
- exitAtFirstDate()
- exitAtLastDate()

Cohort entry
- entryAtFirstDate()
- entryAtLastDate()

Trim start and end dates
- trimDemographics()
- trimToDateRange()

Update cohort start and end dates - Example

We can set the end date to the end of the subject’s observation period

cdm$medications <- cdm$medications %>%
  exitAtObservationEnd()

cdm$medications

# Source:   table<main.my_practicalmedications> [?? x 4]
# Database: DuckDB v0.10.2 [nuriamb@Windows 10 x64:R 4.2.3/C:\Users\nuriamb\AppData\Local\Temp\RtmpAHtbZB\file182c7062343.duckdb]
   cohort_definition_id subject_id cohort_start_date cohort_end_date
                  <int>      <int> <date>            <date>         
 1                    1       1600 1978-01-12        2018-10-15     
 2                    2       2445 1955-11-07        1999-02-01     
 3                    2       1627 1970-06-28        2019-02-02     
 4                    2       4260 1968-12-19        2018-10-30     
 5                    1       3464 2017-04-25        2017-12-27     
 6                    2        408 1994-06-16        2018-09-06     
 7                    1       2334 1971-07-05        2018-08-02     
 8                    1       2207 2013-05-06        2018-10-11     
 9                    1       4133 2009-10-10        2019-02-14     
10                    1       2077 1972-06-04        2012-03-26     
# ℹ more rows

Update cohort start and end dates - Example

We can also trim start and end dates to match demographic requirements
i.e. cohort dates can be trimmed so the subject contributes time while he is 20 to 40 years old, and has a prior observation of 365 days

cdm$medications_trimmed <- cdm$medications %>%
  trimDemographics(ageRange = list(c(20, 40)),
                   minPriorObservation = 365,
                   name = "medications_trimmed")

Update cohort start and end dates - Example

Diclofenac attrition:

Update cohort start and end dates - Example

Acetaminophen attrition:

Your turn

From the aspirin_last cohort…

1) Create a new cohort named aspirin_death that

Includes only subjects who have a record of death.
Subjects exit the cohort on their date of death.

2) Create a second cohort called aspirin_30days which

Includes subjects for the first 30 days of taking aspirin, or until the end of their drug exposure if it is shorter than 30 days.
- Determine the number of subjects who leave after 30 days and the number who leave before 30 days.

Move to the next slide to see the results.

Results

Death cohort: cohort counts.

CDM name	Variable name	Estimate name	Estimate value
Synthea synthetic health database	Number subjects	N	0
	Number records	N	0

30 Days Aspirin cohort: exit reason counts.

exit_reason	counts
cohort_end_date	834
start_30_days; cohort_end_date	75
start_30_days	264

Move forward in the presentation to get some tips on how to resolve the exercise

Or don’t, and try to resolve it with the content seen so far! :)

Tips

1) Death cohort

Use the function that allows to update the cohort end date to the date of death. Adjust the function’s arguments to restrict the cohort to individuals who have a death event.

2) 30 Days Aspirin cohort

Create a new column by adding 30 days to the cohort start date.
- !! Warning: When adding dates in SQL tables (e.g., OMOP CDM cohorts), use the dateadd function from CDMConnector.
Use the appropriate function to update the cohort end date to be either the new date column or the previous cohort end date, whichever comes first.

Cohort manipulation

Functions for cohort manipulations

collapseCohorts()

intersectCohorts()

matchCohorts()

stratifyCohorts()

subsetCohorts()

unionCohorts()

yearCohorts()

Cohort manipulation functions - Example

We can generate a new cohort that contains people who had an exposure to both diclofenac and acetaminophen at the same time using intersectCohorts().

cdm$intersection <- cdm$medications %>% 
  CohortConstructor::intersectCohorts(
    gap = 0,
    mutuallyExclusive = TRUE,
    returnOnlyComb = FALSE,
    name = "intersection"
  )

settings(cdm$intersection)

# A tibble: 3 × 6
  cohort_definition_id cohort_name              diclofenac acetaminophen mutually_exclusive   gap
                 <int> <chr>                         <dbl>         <dbl> <lgl>              <dbl>
1                    1 diclofenac                        1             0 TRUE                   0
2                    2 acetaminophen                     0             1 TRUE                   0
3                    3 diclofenac_acetaminophen          1             1 TRUE                   0

Matched cohort - Example

The matchCohort functions generates a new cohort by matching on age and sex from a target cohort.
For example, to compare individuals who take diclofenac to the general population, we can create a matched cohort as follows:

cdm$diclofenac_match <- cdm$medications %>% 
  matchCohorts(
    cohortId = 1,
    matchSex = TRUE,
    matchYearOfBirth = TRUE,
    ratio = 5,
    name = "diclofenac_match"
  )
settings(cdm$diclofenac_match)

# A tibble: 2 × 8
  cohort_definition_id cohort_name        target_table_name target_cohort_id target_cohort_name match_sex match_year_of_birth
                 <int> <chr>              <chr>                        <int> <chr>              <lgl>     <lgl>              
1                    1 diclofenac         medications                      1 diclofenac         TRUE      TRUE               
2                    2 diclofenac_matched medications                      1 diclofenac         TRUE      TRUE               
# ℹ 1 more variable: match_status <chr>

Matched cohort - Example

cohortCount(cdm$diclofenac_match)

# A tibble: 2 × 3
  cohort_definition_id number_records number_subjects
                 <int>          <int>           <int>
1                    1            416             416
2                    2            901             901

Attrition for the matched cohort

Your turn

Starting from the concept-based aspirin cohort…

Collapse aspirin exposures: Overwrite your concept-based “aspirin” cohort by merging any aspirin exposures for the same subject that occur within 7 days of each other.

Create yearly cohorts: From your collapsed cohort, create five separate cohorts. Each cohort should include records for one specific year from the following list: 1965, 1967, 1968, 1969, and 1970.

Move to the next slide to see the results.

Results

Counts for each of the cohort years

CDM name	Variable name	Estimate name	Cohort name
CDM name	Variable name	Estimate name	Aspirin 1967	Aspirin 1965	Aspirin 1966	Aspirin 1969	Aspirin 1970	Aspirin 1968
Synthea synthetic health database	Number records	N	141	134	143	134	139	151
	Number subjects	N	137	132	138	130	138	145

Thank you for your attention!

Questions?

CohortConstructor

Introduction

Understanding cohorts

Cohorts in R

Cohorts in R

Cohort attributes

Cohort attributes

Cohort attributes

CohortConstructor

Function sets

Built base cohorts

Functions to build base cohorts

Demographic based - Example

Demographic based - Example

Concept based

Concept based - Example

Concept based - Example

Concept based - Measurement

Let’s get started!

Let’s get started!

Your turn

Results

Applying cohort requirements

Functions to apply cohort requirements

Deriving study cohorts from base cohorts

Requirement functions - Example

Requirement functions - Example

Requirement functions - Example

Requirement functions - Example

Requirement functions - Example

Requirement functions - Example

name argument

Your turn

Results

Tips

Update cohort start and end dates

Functions to update cohort start and end dates

Update cohort start and end dates - Example

Update cohort start and end dates - Example

Update cohort start and end dates - Example

Update cohort start and end dates - Example

Your turn

Results

Tips

Cohort manipulation

Functions for cohort manipulations

Cohort manipulation functions - Example

Matched cohort - Example

Matched cohort - Example

Your turn

Results

Thank you for your attention!

`name` argument