Information Extraction from Echocardiography Reports for a Clinical Follow-up Study—Comparison of Extracted Variables Intended for General Use in a Data Warehouse with Those Intended Specifically for the StudyFunding This work was supported by the German Ministry of Education and Research (BMBF), Berlin (#01EO1004, #01EO1504).
22 March 2019
12 November 2019
30 January 2020 (online)
Background The interest in information extraction from clinical reports for secondary data use is increasing. But experience with the productive use of information extraction processes over time is scarce. A clinical data warehouse has been in use at our university hospital for several years, which also provides an information extraction of echocardiography reports developed for general use.
Objectives This study aims to illustrate the difficulties encountered, while using data from a preexisting information extraction process for a large clinical study. To compare the data from the preexisting process with the data obtained from a specially developed process designed to improve the quality and completeness of the study data.
Methods We extracted the echocardiography variables for 440 patients from the general-use information extraction of the data warehouse (678 reports). Then we developed an information extraction process for the same variables but specifically for this study, with the aim to extract as much information as possible from the text. The extracted data of both processes were compared with a newly created gold standard defined by a cardiologist with long-standing experience in heart failure.
Results Among 57 echocardiography variables considered relevant for the study, 50 were documented in the routine text reports and could be extracted. Twenty of the required variables were not provided by the general-use extraction process, some others were not provided correctly. The median macro F1-score (precision, recall) across the 30 variables for which values were extracted was 0.81 (0.94, 0.77). Across all 50 variables, as relevant for the study, median macro F1-score was only 0.49 (0.56, 0.46). Employing the study-specific approach considerably improved the quality and completeness of the variables, resulting in F1-scores of 0.97 (0.98, 0.96) across all variables.
Conclusion Data from information extractions can be used for large clinical studies. However, preexisting information extraction processes should be treated with caution, as the time and effort spent defining each variable in the information extraction process may not be clear.
- 1 Prokosch HU, Ganslandt T. Perspectives for medical informatics. Reusing the electronic medical record for clinical research. Methods Inf Med 2009; 48 (01) 38-44
- 2 Duftschmid G, Gall W, Eigenbauer E, Dorda W. Management of data from clinical trials using the ArchiMed system. Med Inform Internet Med 2002; 27 (02) 85-98
- 3 Murphy SN, Weber G, Mendis M. , et al. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). J Am Med Inform Assoc 2010; 17 (02) 124-130
- 4 Haarbrandt B, Tute E, Marschollek M. Automated population of an i2b2 clinical data warehouse from an openEHR-based data repository. J Biomed Inform 2016; 63: 277-294
- 5 Hahn U, Romacker M, Schulz S. MEDSYNDIKATE--a natural language system for the extraction of medical information from findings reports. Int J Med Inform 2002; 67 (1-3): 63-74
- 6 Savova GK, Masanz JJ, Ogren PV. , et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc 2010; 17 (05) 507-513
- 7 Mykowiecka A, Marciniak M, Kupść A. Rule-based information extraction from patients' clinical data. J Biomed Inform 2009; 42 (05) 923-936
- 8 Zheng S, Lu JJ, Ghasemzadeh N, Hayek SS, Quyyumi AA, Wang F. Effective information extraction framework for heterogeneous clinical reports using online machine learning and controlled vocabularies. JMIR Med Inform 2017; 5 (02) e12
- 9 Hu YH, Tai CT, Tsai CF, Huang MW. Improvement of adequate digoxin dosage: an application of machine learning approach. J Healthc Eng 2018; 2018: 3948245
- 10 Toepfer M, Corovic H, Fette G, Klügl P, Störk S, Puppe F. Fine-grained information extraction from German transthoracic echocardiography reports. BMC Med Inform Decis Mak 2015; 15: 91
- 11 Wang Y, Wang L, Rastegar-Mojarad M. , et al. Clinical information extraction applications: A literature review. J Biomed Inform 2018; 77: 34-49
- 12 Ferrucci D, Lally A, Verspoor K, Nyberg E. Unstructured information management architecture (UIMA) version 1.0. OASIS Standard. Available at: https://www.oasis-open.org/committees/download.php/28492/uima-spec-wd-05.pdf . Accessed November 28, 2019
- 13 Cunningham H. GATE, a general architecture for text engineering. Comput Hum 2002; 36 (02) 223-254
- 14 Wang Y, Mehrabi S, Sohn S, Atkinson EJ, Amin S, Liu H. Natural language processing of radiology reports for identification of skeletal site-specific fractures. BMC Med Inform Decis Mak 2019; 19 (03) (Suppl. 03) 73
- 15 Fonferko-Shadrach B, Lacey AS, Roberts A. , et al. Using natural language processing to extract structured epilepsy data from unstructured clinic letters: development and validation of the ExECT (extraction of epilepsy clinical text) system. BMJ Open 2019; 9 (04) e023232
- 16 Shickel B, Tighe PJ, Bihorac A, Rashidi P. Deep EHR: a survey of recent advances in deep learning techniques for electronic health record (EHR) analysis. IEEE J Biomed Health Inform 2018; 22 (05) 1589-1604
- 17 Jagannatha AN, Yu H. Structured prediction models for RNN based sequence labeling in clinical text. In: Su J, Duh K, Carreras X. , eds. Conference on Empirical Methods in Natural Language Processing: Conference Proceedings. Austin, TX: The Association for Computational Linguistics; 2016: 856-865
- 18 Rios A, Durbin EB, Hands I. , et al. Cross-registry neural domain adaptation to extract mutational test results from pathology reports. J Biomed Inform 2019; 97: 103267
- 19 Small AM, Kiss DH, Zlatsin Y. , et al. Text mining applied to electronic cardiovascular procedure reports to identify patients with trileaflet aortic stenosis and coronary artery disease. J Biomed Inform 2017; 72: 77-84
- 20 Nath C, Albaghdadi MS, Jonnalagadda SR. A natural language processing tool for large-scale data extraction from echocardiography reports. PLoS One 2016; 11 (04) e0153749
- 21 Patterson OV, Freiberg MS, Skanderson M, J Fodeh S, Brandt CA, DuVall SL. Unlocking echocardiogram measurements for heart disease research through natural language processing. BMC Cardiovasc Disord 2017; 17 (01) 151
- 22 Fette G, Ertl M, Wörner A, Kluegl P, Störk S, Puppe F. Information extraction from unstructured electronic health records and integration into a data warehouse. In: Goltz U, Magnor M, Appelrath HJ, Matthies HK, Balke WT, Wolf L. , eds. INFORMATIK 2012. Bonn: Gesellschaft für Informatik e.V; 2012: 1237-1251
- 23 Dietrich G, Ertl M, Fette G. , et al. Extending the query language of a data warehouse for patient recruitment. Stud Health Technol Inform 2017; 243: 152-156
- 24 Dietrich G, Krebs J, Fette G. , et al. Ad hoc information extraction for clinical data warehouses. Methods Inf Med 2018; 57 (01) e22-e29
- 25 Kaspar M, Ertl M, Fette G. , et al. Data linkage from clinical to study databases via an r data warehouse user interface. experiences from a large clinical follow-up study. Methods Inf Med 2016; 55 (04) 381-386
- 26 Kluegl P, Toepfer M, Beck PD. , et al. UIMA Ruta: rapid development of rule-based information extraction applications. Nat Lang Eng 2016; 22 (01) 1-40
- 27 R Development Core Team. A language and environment for statistical computing. Available at: https://www.gbif.org/tool/81287/r-a-language-and-environment-for-statistical-computing . Accessed November 28, 2019
- 28 Voelker W, Koch D, Flachskampf FA. , et al; Arbeitsgruppe Kardiovaskulärer Ultraschall der DGK. Strukturierter Datensatz zur Befunddokumentation in der Echokardiographie--Version 2004 für den Arbeitskreis “Standardisierung und LV-Funktion” der Arbeitsgruppe Kardiovaskulärer Ultraschall der DGK. [A structured data set for Echocardiography Reports, Version 2004]. Z Kardiol 2004; 93 (12) 987-1004