Abstract
Objective The objective of this article was to compare the performances of health care-associated
infection (HAI) detection between deep learning and conventional machine learning
(ML) methods in French medical reports.
Methods The corpus consisted in different types of medical reports (discharge summaries,
surgery reports, consultation reports, etc.). A total of 1,531 medical text documents
were extracted and deidentified in three French university hospitals. Each of them
was labeled as presence (1) or absence (0) of HAI. We started by normalizing the records
using a list of preprocessing techniques. We calculated an overall performance metric,
the F1 Score, to compare a deep learning method (convolutional neural network [CNN])
with the most popular conventional ML models (Bernoulli and multi-naïve Bayes, k-nearest
neighbors, logistic regression, random forests, extra-trees, gradient boosting, support
vector machines). We applied the hyperparameter Bayesian optimization for each model
based on its HAI identification performances. We included the set of text representation
as an additional hyperparameter for each model, using four different text representations
(bag of words, term frequency–inverse document frequency, word2vec, and Glove).
Results CNN outperforms all other conventional ML algorithms for HAI classification. The
best F1 Score of 97.7% ± 3.6% and best area under the curve score of 99.8% ± 0.41%
were achieved when CNN was directly applied to the processed clinical notes without
a pretrained word2vec embedding. Through receiver operating characteristic curve analysis,
we could achieve a good balance between false notifications (with a specificity equal
to 0.937) and system detection capability (with a sensitivity equal to 0.962) using
the Youden's index reference.
Conclusions The main drawback of CNNs is their opacity. To address this issue, we investigated
CNN inner layers' activation values to visualize the most meaningful phrases in a
document. This method could be used to build a phrase-based medical assistant algorithm
to help the infection control practitioner to select relevant medical records. Our
study demonstrated that deep learning approach outperforms other classification learning
algorithms for automatically identifying HAIs in medical reports.
Keywords
electronic health records - natural language processing - machine learning - deep
learning - epidemiology - healthcare-associated infections