Appl Clin Inform 2016; 07(01): 101-115
DOI: 10.4338/ACI-2015-09-RA-0114
Research Article
Schattauer GmbH

Natural Language Processing for Cohort Discovery in a Discharge Prediction Model for the Neonatal ICU

Michael W. Temple
1  Department of Biomedical Informatics Vanderbilt University, Nashville, TN
Christoph U. Lehmann
1  Department of Biomedical Informatics Vanderbilt University, Nashville, TN
2  Department of Pediatrics Vanderbilt University, Nashville, TN
Daniel Fabbri
1  Department of Biomedical Informatics Vanderbilt University, Nashville, TN
› Author Affiliations
National Library of Medicine Training Grant 5T15LM007450-13.
Further Information

Publication History

received: 12 September 2015

accepted: 02 January 2016

Publication Date:
16 December 2017 (online)



Discharging patients from the Neonatal Intensive Care Unit (NICU) can be delayed for non-medical reasons including the procurement of home medical equipment, parental education, and the need for children’s services. We previously created a model to identify patients that will be medically ready for discharge in the subsequent 2–10 days. In this study we use Natural Language Processing to improve upon that model and discern why the model performed poorly on certain patients.


We retrospectively examined the text of the Assessment and Plan section from daily progress notes of 4,693 patients (103,206 patient-days) from the NICU of a large, academic children’s hospital. A matrix was constructed using words from NICU notes (single words and bigrams) to train a supervised machine learning algorithm to determine the most important words differentiating poorly performing patients compared to well performing patients in our original discharge prediction model.


NLP using a bag of words (BOW) analysis revealed several cohorts that performed poorly in our original model. These included patients with surgical diagnoses, pulmonary hypertension, retinopathy of prematurity, and psychosocial issues.


The BOW approach aided in cohort discovery and will allow further refinement of our original discharge model prediction. Adequately identifying patients discharged home on g-tube feeds alone could improve the AUC of our original model by 0.02. Additionally, this approach identified social issues as a major cause for delayed discharge.


A BOW analysis provides a method to improve and refine our NICU discharge prediction model and could potentially avoid over 900 (0.9%) hospital days.


AUC – Area under the Curve, CART -- Classification And Regression Trees, DTD – Days to Dis- charge, GI – Gastrointestinal, LOS – Length of Stay, NICU – Neonatal Intensive Care Unit, NS – Neurosurgery, RF – Random Forest.