Designing an openEHR-Based Pipeline for Extracting and Standardizing Unstructured Clinical Data Using Natural Language Processing

Antje Wulff; Marcel Mast; Marcus Hassler; Sara Montag; Michael Marschollek; Thomas Jack

doi:10.1055/s-0040-1716403

RSS-Feed abonnieren

Bitte kopieren Sie die angezeigte URL und fügen sie dann in Ihren RSS-Reader ein.

https://www.thieme-connect.de/rss/thieme/de/10.1055-s-00035037.xml

Teilen / Bookmarken

Facebook Linkedin Weibo

PDF herunterladen

CC BY-NC-ND 4.0 · Methods Inf Med 2020; 59(S 02): e64-e78
DOI: 10.1055/s-0040-1716403

Original Article

Designing an openEHR-Based Pipeline for Extracting and Standardizing Unstructured Clinical Data Using Natural Language Processing

Antje Wulff

¹Peter L. Reichertz Institute for Medical Informatics, TU Braunschweig and Hannover Medical School, Hannover, Germany

,

Marcel Mast

¹Peter L. Reichertz Institute for Medical Informatics, TU Braunschweig and Hannover Medical School, Hannover, Germany

,

Marcus Hassler

²Econob, Informationsdienstleistungs GmbH, Klagenfurt am Wörthersee, Austria

,

Sara Montag

¹Peter L. Reichertz Institute for Medical Informatics, TU Braunschweig and Hannover Medical School, Hannover, Germany

,

Michael Marschollek

¹Peter L. Reichertz Institute for Medical Informatics, TU Braunschweig and Hannover Medical School, Hannover, Germany

,

Thomas Jack

³Department of Pediatric Cardiology and Intensive Care Medicine, Hannover Medical School, Hannover, Germany

› Institutsangaben
Funding None.

› Weitere Informationen

Abstract
Volltext
Referenzen
Zusatzmaterial

Lizenzen und Reprints

Abstract

Background Merging disparate and heterogeneous datasets from clinical routine in a standardized and semantically enriched format to enable a multiple use of data also means incorporating unstructured data such as medical free texts. Although the extraction of structured data from texts, known as natural language processing (NLP), has been researched at least for the English language extensively, it is not enough to get a structured output in any format. NLP techniques need to be used together with clinical information standards such as openEHR to be able to reuse and exchange still unstructured data sensibly.

Objectives The aim of the study is to automatically extract crucial information from medical free texts and to transform this unstructured clinical data into a standardized and structured representation by designing and implementing an exemplary pipeline for the processing of pediatric medical histories.

Methods We constructed a pipeline that allows reusing medical free texts such as pediatric medical histories in a structured and standardized way by (1) selecting and modeling appropriate openEHR archetypes as standard clinical information models, (2) defining a German dictionary with crucial text markers serving as expert knowledge base for a NLP pipeline, and (3) creating mapping rules between the NLP output and the archetypes. The approach was evaluated in a first pilot study by using 50 manually annotated medical histories from the pediatric intensive care unit of the Hannover Medical School.

Results We successfully reused 24 existing international archetypes to represent the most crucial elements of unstructured pediatric medical histories in a standardized form. The self-developed NLP pipeline was constructed by defining 3.055 text marker entries, 132 text events, 66 regular expressions, and a text corpus consisting of 776 entries for automatic correction of spelling mistakes. A total of 123 mapping rules were implemented to transform the extracted snippets to an openEHR-based representation to be able to store them together with other structured data in an existing openEHR-based data repository. In the first evaluation, the NLP pipeline yielded 97% precision and 94% recall.

Conclusion The use of NLP and openEHR archetypes was demonstrated as a viable approach for extracting and representing important information from pediatric medical histories in a structured and semantically enriched format. We designed a promising approach with potential to be generalized, and implemented a prototype that is extensible and reusable for other use cases concerning German medical free texts. In a long term, this will harness unstructured clinical data for further research purposes such as the design of clinical decision support systems. Together with structured data already integrated in openEHR-based representations, we aim at developing an interoperable openEHR-based application that is capable of automatically assessing a patient's risk status based on the patient's medical history at time of admission.

Keywords

natural language processing - clinical decision support systems - openEHR - pediatric intensive care - medical history taking

Authors' Contributions

A.W. was responsible for drafting the methodological approach, managed the overall project work, led the proof-of-concept evaluation, and has authored the manuscript. M. M. developed the described NLP pipeline, designed the openEHR archetypes and template, and co-authored the manuscript. T. J. and S. M. provided clinical expertise for requirement analysis and dictionary construction. M. H. gave subject-specific advices on the design of NLP pipelines and provided the NLP software. M. M. provided further technical and medical expertise and, together with all authors, co-authored and proofread the manuscript. All authors read and approved the final manuscript.

Supplementary Material

Supplementary Material

Publikationsverlauf

Eingereicht: 12. Mai 2020

Angenommen: 18. Juli 2020

Artikel online veröffentlicht:
14. Oktober 2020

© 2020. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution-NonDerivative-NonCommercial-License, permitting copying and reproduction so long as the original work is given appropriate credit. Contents may not be used for commercial purposes, or adapted, remixed, transformed or built upon. (https://creativecommons.org/licenses/by-nc-nd/4.0/).

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany

References
1 Meystre SM, Lovis C, Bürkle T, Tognola G, Budrionis A, Lehmann CU. Clinical data reuse or secondary use: current status and potential future progress. Yearb Med Inform 2017; 26 (01) 38-52

MissingFormLabel
Thieme Connect PubMed Suche in Google Scholar
2 Martínez-Costa C, Cornet R, Karlsson D, Schulz S, Kalra D. Semantic enrichment of clinical models towards semantic interoperability. The heart failure summary use case. J Am Med Inform Assoc 2015; 22 (03) 565-576

MissingFormLabel
Crossref PubMed Suche in Google Scholar
3 Beale T. Archetypes: constraint-based domain models for future-proof information systems. In: Eleventh OOPSLA Workshop on Behavioral Semantics: Serving the Customer. Seattle, Washington, Boston: Northeastern University; 2002: 16-32

MissingFormLabel
PubMed Suche in Google Scholar
4 HL7. FHIR v1.0.2. Available at: http://hl7.org/fhir/index.html . Accessed June 12, 2020

MissingFormLabel
PubMed
5 HL7. HL7 RIM—das Referenzinformationsmodell. Available at: http://hl7.de/themen/hl7-v3-rim-das-referenzinformationsmodell/ . Accessed June 12, 2020

MissingFormLabel
PubMed
6 HL7. Clinical Document Architecture Release 2.0 (CDA R2). Available at: http://www.hl7.org/implement/standards/product_brief.cfm?product_id=7 . Accessed June 12, 2020

MissingFormLabel
PubMed
7 HL7. HL7 Version 3 Standard: clinical decision support; Virtual Medical Record (vMR) Logical Model, Release 2. Available at: http://www.hl7.org/implement/standards/product_brief.cfm?product_id=338 . Accessed June 12, 2020

MissingFormLabel
PubMed
8 Friedman C, Johnson SB. Natural language and text processing in biomedicine. In: Shortliffe EH, Cimino JJ. , eds. Biomedical Informatics. New York, NY: Springer New York; 2006: 312-343 . Health Informatics

MissingFormLabel
Crossref Suche in Google Scholar
9 Kreimeyer K, Foster M, Pandey A. et al. Natural language processing systems for capturing and standardizing unstructured clinical information: a systematic review. J Biomed Inform 2017; 73: 14-29

MissingFormLabel
Crossref PubMed Suche in Google Scholar
10 Hong N, Wen A, Stone DJ. et al. Developing a FHIR-based EHR phenotyping framework: a case study for identification of patients with obesity and multiple comorbidities from discharge summaries. J Biomed Inform 2019; 99: 103310

MissingFormLabel
Crossref PubMed Suche in Google Scholar
11 Hong N, Wen A, Shen F. et al. Developing a scalable FHIR-based clinical data normalization pipeline for standardizing and integrating unstructured and structured electronic health record data. JAMIA Open 2019; 2 (04) 570-579

MissingFormLabel
Crossref PubMed Suche in Google Scholar
12 Daumke P, Heitmann KU, Heckmann S, Martínez-Costa C, Schulz S. Clinical text mining on FHIR. Stud Health Technol Inform 2019; 264: 83-87

MissingFormLabel
PubMed Suche in Google Scholar
13 Lin C-H, Wu N-Y, Lai W-S, Liou D-M. Comparison of a semi-automatic annotation tool and a natural language processing application for the generation of clinical statement entries. J Am Med Inform Assoc 2015; 22 (01) 132-142

MissingFormLabel
Crossref PubMed Suche in Google Scholar
14 Meystre SM, Lee S, Jung CY, Chevrier RD. Common data model for natural language processing based on two existing standard information models: CDA+GrAF. J Biomed Inform 2012; 45 (04) 703-710

MissingFormLabel
Crossref PubMed Suche in Google Scholar
15 Kropf S, Krücken P, Mueller W, Denecke K. Structuring legacy pathology reports by openEHR archetypes to enable semantic querying. Methods Inf Med 2017; 56 (03) 230-237

MissingFormLabel
Thieme Connect PubMed Suche in Google Scholar
16 Williams CN, Bratton SL, Hirshberg EL. Computerized decision support in adult and pediatric critical care. World J Crit Care Med 2013; 2 (04) 21-28

MissingFormLabel
Crossref PubMed Suche in Google Scholar
17 Lighthall GK, Vazquez-Guillamet C. Understanding decision making in critical care. Clin Med Res 2015; 13 (3-4): 156-168

MissingFormLabel
Crossref PubMed Suche in Google Scholar
18 Hampton JR, Harrison MJ, Mitchell JR, Prichard JS, Seymour C. Relative contributions of history-taking, physical examination, and laboratory investigation to diagnosis and management of medical outpatients. BMJ 1975; 2 (5969): 486-489

MissingFormLabel
Crossref PubMed Suche in Google Scholar
19 Summerton N. The medical history as a diagnostic technology. Br J Gen Pract 2008; 58 (549) 273-276

MissingFormLabel
Crossref PubMed Suche in Google Scholar
20 Peterson MC, Holbrook JH, Von Hales D, Smith NL, Staker LV. Contributions of the history, physical examination, and laboratory investigation in making medical diagnoses. West J Med 1992; 156 (02) 163-165

MissingFormLabel
PubMed Suche in Google Scholar
21 Keifenheim KE, Teufel M, Ip J. et al. Teaching history taking to medical students: a systematic review. BMC Med Educ 2015; 15: 159

MissingFormLabel
Crossref PubMed Suche in Google Scholar
22 Ghosh D, Karunaratne P. The importance of good history taking: a case report. J Med Case Reports 2015; 9: 97

MissingFormLabel
Crossref PubMed Suche in Google Scholar
23 Wang MY, Asanad S, Asanad K, Karanjia R, Sadun AA. Value of medical history in ophthalmology: a study of diagnostic accuracy. J Curr Ophthalmol 2018; 30 (04) 359-364

MissingFormLabel
Crossref PubMed Suche in Google Scholar
24 Masic I, Begic Z, Naser N, Begic E. Pediatric cardiac anamnesis: prevention of additional diagnostic tests. Int J Prev Med 2018; 9: 5

MissingFormLabel
Crossref PubMed Suche in Google Scholar
25 Ikiz MA, Cetin II, Ekici F, Güven A, Değerliyurt A, Köse G. Pediatric syncope: is detailed medical history the key point for differential diagnosis?. Pediatr Emerg Care 2014; 30 (05) 331-334

MissingFormLabel
Crossref PubMed Suche in Google Scholar
26 Brander P, Garin N. Utilité de l'anamnèse et de l'examen clinique dans le diagnostic de la pneumoniae. Rev Med Suisse 2011; 7 (313) 2026-2029

MissingFormLabel
PubMed Suche in Google Scholar
27 Garde S, Knaup P, Hovenga E, Heard S. Towards semantic interoperability for electronic health records. Methods Inf Med 2007; 46 (03) 332-343

MissingFormLabel
Thieme Connect PubMed Suche in Google Scholar
28 vitasystems GmbH. EHRbase: Open Electronic Health Record Platform. Available at: https://ehrbase.org/ . Accessed March 11, 2020

MissingFormLabel
PubMed
29 DIPS AS. DIPS Electronic Patient Record. Available at: https://www.dips.com/uk/dips-electronic-patient-record . Accessed March 11, 2020

MissingFormLabel
PubMed
30 Ripple Foundation C.I.C. Ltd. EtherCIS: Enterprise Clinical Data Repository. Available at: http://ethercis.org/ . Accessed March 11, 2020

MissingFormLabel
PubMed
31 CaboLabs. CloudEHRServer: Clinical Data Management and Sharing Platform. Available at: https://cloudehrserver.com/ . Accessed March 11, 2020

MissingFormLabel
PubMed
32 Wulff A, Haarbrandt B, Marschollek M. Clinical knowledge governance framework for nationwide data infrastructure projects. Stud Health Technol Inform 2018;248:196–203

MissingFormLabel
PubMed
33 Velupillai S, Mowery D, South BR, Kvist M, Dalianis H. Recent Advances in Clinical Natural Language Processing in Support of Semantic Analysis. Yearb Med Inform 2015; 10 (01) 183-193

MissingFormLabel
Thieme Connect PubMed Suche in Google Scholar
34 Dubitzky W, Wolkenhauer O, Cho K-H, Yokota H. , eds. Encyclopedia of Systems Biology. New York, NY: Springer New York; 2013

MissingFormLabel
Crossref Suche in Google Scholar
35 Friedman C, Rindflesch TC, Corn M. Natural language processing: state of the art and prospects for significant progress, a workshop sponsored by the National Library of Medicine. J Biomed Inform 2013; 46 (05) 765-773

MissingFormLabel
Crossref PubMed Suche in Google Scholar
36 Montague R. Universal grammar. Theoria 1970; 36 (03) 373-398

MissingFormLabel
Crossref PubMed Suche in Google Scholar
37 Haarbrandt B, Schreiweis B, Rey S, et al. HiGHmed - An open platform approach to enhance care and research across institutional boundaries. Methods Inf Med 2018;57(S01):e66–e81

MissingFormLabel
PubMed
38 Haarbrandt B, Jack T, Marschollek M. Automated transformation of openEHR data instances to OWL. Stud Health Technol Inform 2016;223:63–70

MissingFormLabel
PubMed
39 Wulff A, Haarbrandt B, Tute E, Marschollek M, Beerbaum P, Jack T. An interoperable clinical decision-support system for early detection of SIRS in pediatric intensive care using openEHR. Artif Intell Med 2018;89:10–23

MissingFormLabel
PubMed
40 Haarbrandt B, Tute E, Marschollek M. Automated population of an i2b2 clinical data warehouse from an openEHR-based data repository. J Biomed Inform 2016;63:277–294

MissingFormLabel
PubMed
41 Damerau FJ. A technique for computer detection and correction of spelling errors. Commun ACM 1964; 7 (03) 171-176

MissingFormLabel
Crossref PubMed Suche in Google Scholar
42 Levenshtein VI. Binary codes capable of correcting deletions, insertions, and reversals. Dokl Akad Nauk SSSR 1965; 163 (04) 845-848

MissingFormLabel
PubMed Suche in Google Scholar
43 Knuth DE. The Art of Computer Programming: Sorting and Searching. 2nd ed.. Boston: Addison-Wesley; 2017

MissingFormLabel
Suche in Google Scholar
44 Pomares-Quimbaya A, Kreuzthaler M, Schulz S. Current approaches to identify sections within clinical narratives from electronic health records: a systematic review. BMC Med Res Methodol 2019; 19 (01) 155

MissingFormLabel
Crossref PubMed Suche in Google Scholar
45 Wang Y, Wang L, Rastegar-Mojarad M. et al. Clinical information extraction applications: a literature review. J Biomed Inform 2018; 77: 34-49

MissingFormLabel
Crossref PubMed Suche in Google Scholar
46 Gonzalez-Hernandez G, Sarker A, O'Connor K, Savova G. Capturing the patient's perspective: a review of advances in natural language processing of health-related text. Yearb Med Inform 2017; 26 (01) 214-227

MissingFormLabel
Thieme Connect PubMed Suche in Google Scholar
47 Névéol A, Dalianis H, Velupillai S, Savova G, Zweigenbaum P. Clinical natural language processing in languages other than English: opportunities and challenges. J Biomed Semantics 2018; 9 (01) 12

MissingFormLabel
Crossref PubMed Suche in Google Scholar
48 DFKI—German Research Center for Artificial Intelligence. mEx—Medical Information Extraction. Available at: http://biomedical.dfki.de/mEx . Accessed April 19, 2020

MissingFormLabel
PubMed
49 Savova GK, Masanz JJ, Ogren PV. et al. Mayo clinical text analysis and knowledge extraction system (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc 2010; 17 (05) 507-513

MissingFormLabel
Crossref PubMed Suche in Google Scholar
50 Averbis. Health Discovery. Available at: https://averbis.com/health-discovery . Accessed March 11, 2020

MissingFormLabel
PubMed
51 OpenNLP. OpenNLP. Available at: https://opennlp.apache.org/ . Accessed April 19, 2020

MissingFormLabel
PubMed
52 LingRep. LingRep. Available at: https://www.econob.com/de/demos/ . Accessed April 19, 2020

MissingFormLabel
PubMed
53 Sohn S, Clark C, Halgrim SR, Murphy SP, Chute CG, Liu H. MedXN: an open source medication extraction and normalization tool for clinical text. J Am Med Inform Assoc 2014; 21 (05) 858-865

MissingFormLabel
Crossref PubMed Suche in Google Scholar
54 Lin Y-K, Chen H, Brown RA. MedTime: a temporal information extraction system for clinical narratives. J Biomed Inform 2013; 46: S20-S28

MissingFormLabel
Crossref PubMed Suche in Google Scholar
55 Schwartz AS, Hearst MA. A simple algorithm for identifying abbreviation definitions in biomedical text. Pac Symp Biocomput 2003; 8: 451-462

MissingFormLabel
PubMed Suche in Google Scholar
56 Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. A simple algorithm for identifying negated findings and diseases in discharge summaries. J Biomed Inform 2001; 34 (05) 301-310

MissingFormLabel
Crossref PubMed Suche in Google Scholar
57 Becker M, Böckmann B. Extraction of UMLS® Concepts Using Apache cTAKES™ for German Language. Stud Health Technol Inform 2016; 223: 71-76

MissingFormLabel
PubMed Suche in Google Scholar
58 Becker M, Kasper S, Böckmann B, Jöckel K-H, Virchow I. Natural language processing of German clinical colorectal cancer notes for guideline-based treatment evaluation. Int J Med Inform 2019; 127: 141-146

MissingFormLabel
Crossref PubMed Suche in Google Scholar
59 König M, Sander A, Demuth I, Diekmann D, Steinhagen-Thiessen E. Knowledge-based best of breed approach for automated detection of clinical events based on German free text digital hospital discharge letters. PLoS One 2019; 14 (11) e0224916

MissingFormLabel
Crossref PubMed Suche in Google Scholar
60 Löpprich M, Krauss F, Ganzinger M, Senghas K, Riezler S, Knaup P. Automated classification of selected data elements from free-text diagnostic reports for clinical research. Methods Inf Med 2016; 55 (04) 373-380

MissingFormLabel
Thieme Connect PubMed Suche in Google Scholar
61 Hong N, Wen A, Mojarad MR, Sohn S, Liu H, Jiang G. Standardizing heterogeneous annotation corpora using HL7 FHIR for facilitating their reuse and integration in clinical NLP. AMIA Annu Symp Proc 2018; 2018: 574-583

MissingFormLabel
PubMed Suche in Google Scholar

Zusatzmaterial

Supplementary Material

RSS-Feed abonnieren

Teilen / Bookmarken

Designing an openEHR-Based Pipeline for Extracting and Standardizing Unstructured Clinical Data Using Natural Language Processing

Abstract

Keywords

Authors' Contributions

Supplementary Material

Publikationsverlauf

References