PhD Thesis: Semi-Supervised Learning for Electronic Phenotyping in Support of Precision Medicine


Medical informatics plays an important role in precision medicine, delivering the right information to the right person, at the right time. With the introduction and widespread adoption of electronic medical records, in the United States and world-wide, there is now a tremendous amount of health data available for analysis. Electronic record phenotyping refers to the task of determining, from an electronic medical record entry, a concise descriptor of the patient, comprising of their medical history, current problems, presentation, etc. In inferring such a phenotype descriptor from the record, a computer, in a sense, “understands” the relevant parts of the record. These phenotypes can then be used in downstream applications such as cohort selection for retrospective studies, real-time clinical decision support, contextual displays, intelligent search, and precise alerting mechanisms. We are faced with three main challenges: First, the unstructured and incomplete nature of the data recorded in the electronic medical records requires special attention. Relevant information can be missing or written in an obscure way that the computer does not understand. Second, the scale of the data makes it important to develop efficient methods at all steps of the machine learning pipeline, including data collection and labeling, model learning and inference. Third, large parts of medicine are well understood by health professionals. How do we combine the expert knowledge of specialists with the statistical insights from the electronic medical record? Probabilistic graphical models such as Bayesian networks provide a useful abstraction for vi quantifying uncertainty and describing complex dependencies in data. Although significant progress has been made over the last decade on approximate inference algorithms and structure learning from complete data, learning models with incomplete data remains one of machine learnings most challenging problems. How can we model the effects of latent variables that are not directly observed? The first part of the thesis presents two different structural conditions under which learning with latent variables is computationally tractable. The first is the “anchored” condition, where every latent variable has at least one child that is not shared by any other parent. The second is the “singly-coupled” condition, where every latent variable is connected to at least three children that satisfy conditional independence (possibly after transforming the data). Variables that satisfy these conditions can be specified by an expert without requiring that the entire structure or its parameters be specified, allowing for effective use of human expertise and making room for statistical learning to do some of the heavy lifting. For both the anchored and singly-coupled conditions, practical algorithms are presented. The second part of the thesis describes real-life applications using the anchored condition for electronic phenotyping. A human-in-the-loop learning system and a functioning emergency informatics system for real-time extraction of important clinical variables are described and evaluated. The algorithms and discussion presented here were developed for the purpose of improving healthcare, but are much more widely applicable, dealing with the very basic questions of identifiability and learning models with latent variables - a problem that lies at the very heart of the natural and social sciences.

Yoni Halpern
Yoni Halpern
PhD student

Google Research