Cohort Selection

The primary focus of this dataset is the study of uncomplicated UTIs, and the prescription of commonly used antibiotics in this context: Nitrofurantoin (NIT), trimethoprim-sulfamethoxazole (SXT), ciprofloxacin (CIP), and levofloxacin (LVX).

Uncomplicated UTIs are defined as specimens where the infection site was specified to be urinary, and the following criteria are met:

  • Age in [18, 55]
  • Female
  • No diagnosis indicating pregnancy in past 90 days
  • No selected procedure* in past 90 days
  • No indication of pyelonephritis, based on string matching of “pyelo” to diagnosis names
  • Exactly one antibiotic in (NIT, SXT, LVX, CIP) prescribed
  • All AMR test results for NIT, SXT, LVX, CIP are available

*Selected procedures used to exclude specimens are as follows: (i) placement of a central venous catheter (CVC), (ii) mechanical ventilation, (iii) parenteral nutrition, (iv) hemodialysis, and (v) any surgical procedure.

The entire dataset additionally includes a broader set of urine specimens that do not satisfy the above conditions. This broader cohort includes many patients who have complex infections that might be treated with a range of antibiotics. The specimens that meet our definition of an “uncomplicated UTI” are marked with the binary indicator uncomplicated in the relevant CSV files.

Filtering and Merging of Specimens

Multiple microbiology specimens can be taken from a single suspected site of infection. To mimic the empiric treatment setting, we restrict to the first specimen from an infection and exclude any specimens taken within a 14 day period from the same body site as duplicates. Specimens taken on the same day are merged or kept separate, depending on whether they come from the same or different body sites (respectively).

Note that while we will commonly refer to “specimens” in the remainder of the documentation, this should be taken to include both single specimens, as well as multiple specimens that have been merged (as described above) into a single observation.

Identification of Prescriptions

We define the empiric antibiotic prescription as any antibiotic medication prescribed two days before, to one day after, the specimen was collected. As noted above, the uncomplicated UTI cohort contains specimens for which exactly one antibiotic in (NIT, SXT, LVX, CIP) was prescribed in this window. For uncomplicated UTIs, the vast majority of these prescriptions are observed on the same day as the specimen (91.1%) or the day after (8.0%), with a small fraction occurring on the day before (0.8%) or two days before (0.1%).

Derivation of Resistance Labels

The medical record contains microbiological testing results for all specimens sent to the labs at Massachusetts General Hospital (MGH) and Brigham & Women’s Hospital (BWH). This raw data includes the identity of the infecting pathogen and susceptibility testing to various antibiotics. The data contains the metric used for each test (minimum inhibitory concentration (MIC) vs. disk diameter (DD)) and the numeric value of the corresponding test result, as well as the date and location of specimen collection.

For this dataset release, we have transformed these numeric results into categorical phenotypes by applying the published 2017 CLSI clinical breakpoints [CSLI, 2017], which convert the raw semi-quantitative and quantitative results into one of three phenotypes: susceptible (S), intermediate (I), and resistant (R). We treat both intermediate and resistant phenotypes as resistant, which is generally how they are treated in clinical practice.


A unique example_id is assigned to each specimen, and was generated randomly. This is used to link between the various CSV files.

No dates or times are included in this dataset, and age is censored so that any individual with an age >89 is recorded as having an age of 90.

Any binary feature with all positive examples coming from fewer than 20 unique patients is dropped. All colonization pressure features (see “Data Description”) are rounded to the nearest 0.01.

Train / Test Split

This dataset was divided into a training and a test set, based on years. All entries marked with is_train are in the training set, and others are in the test set. These were constructed such that

  • All training specimens are in the years 2007-2013.
  • All test specimens are in the years 2014-2016.
  • There are no patients from the uncomplicated UTI cohort who have specimens in both train / test.

In the uncomplicated UTI cohort, there are 3629 unique patients in the test set, and 10053 unique patients in the training set. In total, the specimens in the test dataset are derived from 26807 patients, and in the training set from 55078 patients.

This study was approved by the Institutional Review Board (IRB) of Massachusetts General Hospital with a waived requirement for informed consent.