Summary
Background EHR systems have high potential to improve healthcare delivery and management. Although
structured EHR data generates information in machine-readable formats, their use for
decision support still poses technical challenges for researchers due to the need
to preprocess and convert data into a matrix format. During our research, we observed
that clinical informatics literature does not provide guidance for researchers on
how to build this matrix while avoiding potential pitfalls.
Objectives This article aims to provide researchers a roadmap of the main technical challenges
of preprocessing structured EHR data and possible strategies to overcome them.
Methods Along standard data processing stages – extracting database entries, defining features,
processing data, assessing feature values and integrating data elements, within an
EDPAI framework –, we identified the main challenges faced by researchers and reflect
on how to address those challenges based on lessons learned from our research experience
and on best practices from related literature. We highlight the main potential sources
of error, present strategies to approach those challenges and discuss implications
of these strategies.
Results Following the EDPAI framework, researchers face five key challenges: (1) gathering
and integrating data, (2) identifying and handling different feature types, (3) combining
features to handle redundancy and granularity, (4) addressing data missingness, and
(5) handling multiple feature values. Strategies to address these challenges include:
crosschecking identifiers for robust data retrieval and integration; applying clinical
knowledge in identifying feature types, in addressing redundancy and granularity,
and in accommodating multiple feature values; and investigating missing patterns adequately.
Conclusions This article contributes to literature by providing a roadmap to inform structured
EHR data preprocessing. It may advise researchers on potential pitfalls and implications
of methodological decisions in handling structured data, so as to avoid biases and
help realize the benefits of the secondary use of EHR data.
Citation: Ferrão JC, Oliveira MD, Janela F, Martins HMG. Preprocessing structured clinical
data for predictive modeling and decision support – a roadmap to tackle the challenges.
Keywords
Data mining - data access - integration and analysis - electronic health records and
systems - structured data - clinical decision support