PhD Thesis: Filling Gaps and Dodging Traps: Leveraging untapped potential in gene expression data for drug discovery with applications to cystic fibrosis


We live in the age of information, and the fields of medicine and biology are no exception. Data surrounding drugs, genes, diseases, and more are accumulating at ever-increasing rates and contain untold potential to accelerate and inform drug development in new ways. Gene expression data, in particular, has a proven capacity to enable deep quantitative and qualitative insights that can help us elucidate the biology of a drug’s effects on the human body, or the pathophysiology of a disease, or even the potential for a drug to treat a disease. In recent years, gene expression data has accumulated to number in the millions of experiments, much of which has been made publicly available. Hence, a key challenge and opportunity now lies in identifying ways to integrate across such data to enable hypotheses and insights that may not have been apparent within the limited context of a single dataset or experiment. In this dissertation, I develop and apply integrative bioinformatic methods to leverage untapped potential in gene expression data with the aim of accelerating and supporting the process of drug development. Grounding this work is a specific focus on applications to cystic fibrosis (CF), a genetic disease that leads to a severe and progressive lung condition characterized by infection, inflammation, and fibrosis. While several treatments have been approved by the FDA in recent years that treat the basic cellular defect, the efficacy and/or applicability are extremely limited, and hence, improved therapeutic options are still greatly needed. My original research starts in Chapter 3 where I perform a multi-faceted meta-analysis of 17 gene expression datasets characterizing transcriptional alterations in CF disease and rescue. I then develop x a workflow to analyze the resulting signatures, revealing a number of hypotheses and insights that could be further pursued experimentally, including unexpected connections to the unfolded protein response (UPR) that are shared across a number of rescue interventions. Then in Chapter 4 I build directly on this work to develop an integrative pipeline that compares genomic representations of drug and disease in order to identify novel compounds with predicted efficacy for CF. From these predictions, we test 120 compounds in a CF cell line, finding eight that demonstrate significant rescue, and three of which stand up to a more stringent assay using CF primary cells. Analysis of the leading compounds’ transcriptional profiles suggests possible mechanisms associated with UPR and/or TNFa pathways. Next, in Chapters 5 and 6, I shift focus away from a particular disease application and focus instead on a recent dataset capturing the biological effects of thousands of drugs and small molecules applied to dozens of human cell types. While these data are already proving to be useful for drug discovery, one key limitation is that there are many gaps in the experimental space across drugs and cell types. In Chapter 5, I start with linear models to address the problem. I demonstrate that it is possible to computationally fill the gaps in the missing data (i.e. predicting entire drug-induced gene expression profiles for unmeasured experiments), and further demonstrate the added value of the resulting completed dataset for downstream prediction of drug properties and drug-disease connections. Then in Chapter 6, I describe a nonlinear approach using a simple latent factor neural network architecture that improves predictive performance.

Rachel Hodos
Rachel Hodos
PhD student

Benevolent AI