Data Description
Data Files
There are three main data files in CSV format included with this dataset:
all_uti_resist_labels.csv
: Contains resistance testing results for the most common antibiotics used for UTI infections (nitrofurantoin (NIT), trimethoprim-sulfamethoxazole (SXT), ciprofloxacin (CIP), and levofloxacin (LVX)) for all specimens in our UTI cohort.all_prescriptions.csv
: Contains empiric clinician prescription selections for specimens in the uncomplicated UTI cohort only. By construction, our uncomplicated cohort is filtered to only contain specimens for which clinicians treated the infection with exactly one treatment in { NIT, SXT, CIP, LVX } in the empiric treatment window. We do not include prescriptions for the other specimens in the dataset, as clinicians may have treated other specimens with multiple antibiotics, or with antibiotics from outside this set.all_uti_features.csv
: Contains constructed features for all specimens. Also contains columns indicating membership of each specimen in training vs. test set and membership in the uncomplicated UTI cohort.
We also include a data dictionary file (data_dictionary.csv
) for all of the columns in the files above. This data dictionary has five columns: file
/column
give the relevant file and column name, description
gives additional information, and from
/until
indicate the window (if applicable) over which the feature is calculated. For instance, if from
is 14 and until
is 7, this indicates that the feature is calculated over the window of 14 days prior to specimen collection, up to 7 days prior to specimen collection. All windows are inclusive.
Columns in all_uti_resist_labels.csv
These are as follows
example_id
: Unique specimen ID used to link between files.NIT
: Binary indicator of resistance to nitrofurantoin (1
if resistant).SXT
: Binary indicator of resistance to trimethoprim-sulfamethoxazole (1
if resistant).CIP
: Binary indicator of resistance to ciprofloxacin (1
if resistant).LVX
: Binary indicator of resistance to levofloxacin (1
if resistant).is_train
: Used to denote membership in training set (2007-13)uncomplicated
: Used to denote membership in uncomplicated UTI cohort
Note that if no test result is available for a given antibiotic in (NIT, SXT, CIP, LVX), then the corresponding column will be empty.
Columns in all_prescriptions.csv
example_id
: Unique specimen ID used to link between files.prescription
: Observed empiric prescription (one of NIT,SXT,CIP,LVX)is_train
: Used to denote membership in training set (2007-13)
Columns in all_uti_features.csv
This section contains a detailed descriptions of the feature columns found in all_uti_features.csv
.
This section also includes notes on how missing data is handled, when it is not implicit in the definition: For instance, most features are binary indicators, with a 1
indicates the presence of an observed element (e.g., a previous infection) and a 0
indicates that an element was not observed, and covers cases where data might be missing.
Specimen Indicators
First, we note that the following columns are included in this file, which are defined similarly as in the other files:
example_id
: Unique specimen ID used to link between files.is_train
: Used to denote membership in training set (2007-13)uncomplicated
: Used to denote membership in uncomplicated UTI cohort
Basic patient demographics
Each feature in this category conveys basic demographic information:
demographics - age
: Patient age (in years) at time of specimen collection, calculated using recorded date of birth. All ages >= 90 are clipped and set to 90. There are no missing values for this feature.demographics - is_white
: Binary indicator for whether patient is white (1) or non-white (0). If race is not recorded (which occurs in 3% of specimens), this feature is 0.demographics - is_veteran
: Binary indicator for whether patient is a veteran (1) or non-veteran (0). If veteran status is not recorded, this feature is 0.
Prior antibiotic resistance
Each feature in this category is a binary indicator for whether a patient had a resistant test result to a particular antibiotic in a specified time window preceding the current specimen. These column names are of the form
micro - prev resistance [ANTIBIOTIC] [TIME WINDOW]
.
For each antibiotic, we construct binary features for prior resistance in the 14, 30, 90, and 180 days preceding specimen collection, as well as any record of prior resistance to this treatment (in which case ‘ALL’ is used for [TIME WINDOW]). Previous resistance within less than 7 days is excluded from these features, to prevent label leakage.
Antibiotic names are given as abbreviations, in accordance with those established by the American Society for Microbiology ( link).
Note that a 0
for these variables implicitly includes instances where data is missing. For instance, a 0
for micro - prev resistance SXT 90
could indicate that an antibiotic resistance test was done and that the infection was found to be susceptible, or that no test results exist for this patient in that time window.
Prior antibiotic exposures
Each feature in this category is a binary indicator for whether a patient was treated with a particular antibiotic or class of antibiotic in a specified time window preceding the current specimen. These column names are of the form:
medication [TIME WINDOW] - [ANTIBIOTIC]
ab subtype [TIME WINDOW] - [ANTIBIOTIC SUBCLASS]
ab class [TIME WINDOW] - [ANTIBIOTIC CLASS]
For each antibiotic, we construct features for prior exposure in the 7, 14, 30, 90, and 180 days preceding specimen collection, as well as any record of prior exposure to this treatment (in which case ‘ALL’ is used for [TIME WINDOW]). Previous exposure within less than 2 days is excluded from these features, to prevent leakage of the empiric treatment decision.
Prior infecting organisms
Each feature in this category is a binary indicator for whether a patient was previously infected with a specific pathogen in a time window preceding the current specimen. These column names are of the form
micro - prev organism [PATHOGEN NAME] [TIME WINDOW]
.
For each pathogen of interest, we construct features for prior infection in the 14, 30, 90, and 180 days preceding specimen collection. Prior infecting organisms within less than 7 days are excluded from these features, to prevent leakage of the current infecting organism.
Elixhauser comorbidities
Each feature in this category is a binary indicator for whether a patient was previously diagnosed with a given comorbidity in a time window preceding the current specimen. We use the comorbidities that comprise the Elixhauser Comorbidity Index [Quan et al. 2015], and extracted these from ICD-9 and ICD-10 codes using the icd
package in R
. These column names are of the form
comorbidity [TIME WINDOW] - [COMORBIDITY NAME]
.
For each comorbidity of interest, we construct features for prior infection in the 7, 14, 30, 90, and 180 days preceding specimen collection. This includes comorbidities recorded up until the date of specimen collection (inclusive).
Hospital department type (inpatient, outpatient, ER, ICU)
Each feature in this category is a binary indicator for whether the current specimen was collected in a specific department of the hospital; there is a feature for collection in inpatient (IP) settings, outpatient (OP) settings, ER, or the ICU. These binary features are included as
hosp ward - [IP/OP/ER/ICU]
Note that due to the filtering and merging of different specimens (see “Filtering and Merging of Specimens” in the methods section) into a single sample, it is possible to see multiple hospital departments for the same infection.
When no hospital department is recorded, all of these features will be 0
.
Colonization pressure (local rate of resistance)
We define the colonization pressure of an antibiotic as the rate of resistance to that agent within a specified location and time period. We compute the colonization pressure for a given specimen as the proportion of all urinary specimens resistant to an antibiotic in the period ranging from 7 days before to 90 days before the date of specimen collection.
We compute colonization pressure at three location hierarchies, across 25 antibiotics. These column names are of the form
selected micro - colonization pressure [ANTIBIOTIC] 90 - [granular level]
: Resistance rate across specimens collected at the same floor/ward/clinicselected micro - colonization pressure [ANTIBIOTIC] 90 - [higher level]
: Resistance rate across specimens collected at the same hospital (MGH or BWH) and department type (inpatient, outpatient, ICU, ER)selected micro - colonization pressure [ANTIBIOTIC] 90 - [overall]
: Resistance rate across all specimens
Antibiotic names are given as abbreviations, in accordance with those established by the American Society for Microbiology ( link).
When there are no previous visits to the given location in the given time window, these features will default to 0
.
Prior visits to skilled nursing facilities
We also include a feature for whether or not a patient has been to a skilled nursing facility in the past 7, 14, 30, and 90 days. Thus custom feature is included as
custom [TIME WINDOW] - nursing home
These are defined as either of the following: (a) CPT code in the range 99304-99318, or (b) “nursing facility” included in the procedure description.
Other infection sites
All specimens in this dataset are from the urinary tract. However, some patients have other specimens collected (on the same day as the urinary specimen) from other infection sites. We encode this information using the binary features included as
infection_sites - [INFECTION SITE]
Prior procedures
Each feature in this category is a binary indicator for whether a patient previously received a specific procedure in a time window preceding the current specimen. For each procedure of interest, we construct binary indicators for the presence of each procedure in the window of 0 - 180 days preceding specimen collection. This includes procedures up until the date of specimen collection (inclusive).
These column names are of the form
procedure 180 - had cvc
: Placement of a central venous catheter (CVC), defined as either (a) CPT code in 36555-36598, (b) ICD9 code 38.97 or in 999.31-999.33, or (c) “central venous catheter” included in the procedure description.procedure 180 - had surgery
: Any surgical procedure, defined as either (a) CPT code in 10021-69990, or (b) “surgery” or “surgical” included in the procedure description.procedure 180 - had mechanical ventilation
: Mechanical ventilation, defined as “ventilation” included in the procedure description.procedure 180 - had hemodialysis
: Hemodialysis, defined as either (a) CPT code in 90935-90940 or (b) “hemodialysis” but not “than hemodialysis” included in the procedure description.procedure 180 - had parenteral nutrition
: Parenteral nutrition, defined as “parenteral” and “nutrition” included (in that order) in the procedure description.
Note: The uncomplicated UTI cohort is defined to exclude any specimen where any of these features are present in the past 90 days. Hence, these features should be interpreted as giving information on the window of 91-180 days for uncomplicated UTIs.