Longitudinal health data provides a uniquely detailed view into the evolution of patient health over time. We develop pipelines to efficiently work with this kind of data in its rawest form, enabling the development of new state-of-the-art end-to-end machine learning approaches. While healthcare providers are increasingly using learned methods to predict and understand long-term patient outcomes in order to make meaningful interventions, deep learning models often struggle to match performance of shallow linear models in predicting these outcomes, making it difficult to leverage such techniques in practice. Motivated by the task of clinical prediction from longitudinal health data, we present a new technique called reverse distillation which pre-trains deep models by using high-performing linear models for initialization. We make use of the longitudinal structure of our dataset to develop Self Attention with Reverse Distillation, or SARD, an architecture that utilizes a combination of contextual embedding, temporal embedding and self-attention mechanisms and most critically is trained via reverse distillation. SARD outperforms state-of-the-art methods on multiple clinical prediction outcomes, with ablation studies revealing that reverse distillation is the primary driver of these improvements.