Abstract
Background During the COVID-19 pandemic, several methodologies were designed for obtaining electronic
health record (EHR)-derived datasets for research. These processes are often based
on black boxes, on which clinical researchers are unaware of how the data were recorded,
extracted, and transformed. In order to solve this, it is essential that extract,
transform, and load (ETL) processes are based on transparent, homogeneous, and formal
methodologies, making them understandable, reproducible, and auditable.
Objectives This study aims to design and implement a methodology, according with FAIR Principles,
for building ETL processes (focused on data extraction, selection, and transformation)
for EHR reuse in a transparent and flexible manner, applicable to any clinical condition
and health care organization.
Methods The proposed methodology comprises four stages: (1) analysis of secondary use models
and identification of data operations, based on internationally used clinical repositories,
case report forms, and aggregated datasets; (2) modeling and formalization of data
operations, through the paradigm of the Detailed Clinical Models; (3) agnostic development
of data operations, selecting SQL and R as programming languages; and (4) automation
of the ETL instantiation, building a formal configuration file with XML.
Results First, four international projects were analyzed to identify 17 operations, necessary
to obtain datasets according to the specifications of these projects from the EHR.
With this, each of the data operations was formalized, using the ISO 13606 reference
model, specifying the valid data types as arguments, inputs and outputs, and their
cardinality. Then, an agnostic catalog of data was developed through data-oriented
programming languages previously selected. Finally, an automated ETL instantiation
process was built from an ETL configuration file formally defined.
Conclusions This study has provided a transparent and flexible solution to the difficulty of
making the processes for obtaining EHR-derived data for secondary use understandable,
auditable, and reproducible. Moreover, the abstraction carried out in this study means
that any previous EHR reuse methodology can incorporate these results into them.
Keywords
electronic health record - FAIR Principles - data reusability - real-world data -
standards