Validation of Semantic Analyses of Unstructured Medical Data for Research Purposes

Roman Michael Pokora; Lucian Le Cornet; Philipp Daumke; Peter Mildenberger; Hajo Zeeb; Maria Blettner

doi:10.1055/a-1007-8540

Subscribe to RSS

Please copy the URL and add it into your RSS Feed Reader.

https://www.thieme-connect.de/rss/thieme/en/10.1055-s-00000022.xml

Download PDF

CC BY-NC-ND 4.0 · Gesundheitswesen 2020; 82(S 02): S158-S164
DOI: 10.1055/a-1007-8540

Original Article

Validation of Semantic Analyses of Unstructured Medical Data for Research Purposes

Validierung von semantischen Analysen von unstrukturierten medizinischen Daten für Forschungszwecke

Authors

Roman Michael Pokora

¹Institute for Medical Biostatistics, Epidemiology and Informatics (IMBEI), University Medical Center of the Johannes Gutenberg University Mainz, Mainz
Lucian Le Cornet

¹Institute for Medical Biostatistics, Epidemiology and Informatics (IMBEI), University Medical Center of the Johannes Gutenberg University Mainz, Mainz

²Studienzentrale, Nationales Centrum für Tumorerkrankungen Heidelberg, Heidelberg
Philipp Daumke

³Averbis GmbH, Freiburg
Peter Mildenberger

⁴Klinik und Poliklinik für Diagnostische und Interventionelle Radiologie, University Medical Center of the Johannes Gutenberg University Mainz, Mainz
Hajo Zeeb

⁵Leibniz-Institut für Präventionsforschung und Epidemiologie (BIPS), Prevention and Evaluation, Bremen
Maria Blettner

⁶Institut fur Medizinische Biometrie Epidemiologie und Informatik, Johannes-Gutenberg Universität Mainz, Mainz

Further Information

Publication History

Publication Date:
09 October 2019 (online)

Also available at

Permissions and Reprints

Abstract

Background In secondary data there are often unstructured free texts. The aim of this study was to validate a text mining system to extract unstructured medical data for research purposes.

Methods From a radiological department, 1,000 out of 7,102 CT findings were randomly selected. These were manually divided into defined groups by 2 physicians. For automated tagging and reporting, the text analysis software Averbis Extraction Platform (AEP) was used. Special features of the system are a morphological analysis for the decomposition of compound words as well as the recognition of noun phrases, abbreviations and negated statements. Based on the extracted standardized keywords, findings reports were assigned to the given findings groups using machine learning methods. To assess the reliability and validity of the automated process, the automated and two independent manual mappings were compared for matches in multiple runs.

Results Manual classification was too time-consuming. In the case of automated keywording, the classification according to ICD-10 turned out to be unsuitable for our data. It also showed that the keyword search does not deliver reliable results. Computer-aided text mining and machine learning resulted in reliable results. The inter-rater reliability of the two manual classifications, as well as the machine and manual classification was very high. Both manual classifications were consistent in 93% of all findings. The kappa coefficient is 0.89 [95% confidence interval (CI) 0.87–0.92]. The automatic classification agreed with the independent, second manual classification in 86% of all findings (Kappa coefficient 0.79 [95% CI 0.75–0.81]).

Discussion The classification of the software AEP was very good. In our study, however, it followed a systematic pattern. Most misclassifications were found in findings that indicate an increased risk of cancer. The free-text structure of the findings raises concerns about the feasibility of a purely automated analysis. The combination of human intellect and intelligent, adaptive software appears most suitable for mining unstructured but important textual information for research.

Zusammenfassung

Hintergrund In Sekundärdaten existieren oftmals unstrukturierte Freitexte. In dieser Arbeit wird ein Text-Mining-System validiert, um unstrukturierte medizinische Daten für Forschungszwecke zu extrahieren.

Methoden Aus einer radiologischen Klinik wurden aus 7102 CT-Befunden 1000 zufällig ausgewählt. Diese wurden von 2 Medizinern manuell in definierte Befundgruppen eingeteilt. Zur automatisierten Verschlagwortung und Klassifizierung wurde die Textanalyse-Software Averbis Extraction Platform (AEP) eingesetzt. Besonderheiten des Systems sind u. a. eine morphologische Analyse zur Zerlegung zusammengesetzter Wörter sowie die Erkennung von Nominalphrasen, Abkürzungen und negierten Aussagen. Anhand der extrahierten standardisierten Schlüsselwörter werden Befundberichte mithilfe maschineller Lernverfahren den vorgegebenen Befundgruppen zugeordnet. Zur Bewertung von Reliabilität und Validität des automatisierten Verfahrens werden die automatisierten und 2 unabhängige manuelle Klassifizierungen in mehreren Durchläufen auf Übereinstimmungen hin verglichen.

Ergebnisse Die manuelle Klassifizierung war zu zeitaufwendig. Bei der automatisierten Verschlagwortung stellte sich in unseren Daten die Klassifizierung nach ICD-10 als ungeeignet heraus. Ebenfalls zeigte sich, dass die Stichwortsuche keine verlässlichen Ergebnisse liefert. Computerunterstütztes Textmining in Kombination mit maschinellem Lernen führte zu verlässlichen Klassifizierungen. Die Inter-Rater-Reliabilität der beiden manuellen Klassifizierungen, sowie der maschinellen und der manuellen Klassifizierung war sehr hoch. Beide manuelle Klassifizierungen stimmten in 93% aller Befunde überein. Der Kappa-Koeffizient beträgt 0,89 [95% Konfidenzintervall (KI) 0,87–0,92]. Die automatische Klassifizierung stimmte in 86% aller Befunde mit der unabhängigen, zweiten manuellen Klassifizierung überein (Kappa-Koeffizient 0,79 [95% KI 0,75–0,81]).

Diskussion Die Klassifizierung der Software AEP war sehr gut. In unserer Studie folgte sie allerdings einem systematischen Muster. Die meisten falschen Zuordnungen finden sich in Befunden, die auf ein erhöhtes Krebsrisiko hinweisen. Die Freitextstruktur der Befunde lässt Bedenken hinsichtlich der Machbarkeit einer rein automatisierten Analyse aufkommen. Die Kombination aus menschlichem Intellekt und einer intelligenten, lernfähigen Software erscheint als zukunftsweisend, um unstrukturierte aber wichtige Textinformationen der Forschung zugänglich machen zu können.

Key words

secondary data - unstructured free text - text-mining - validation

Schlüsselwörter

Sekundärdaten - unstrukturierte Freitext - Text-Mining - Validierung

References
1 Swart E, Gothe H, Geyer S. et al. Gute Praxis Sekundärdatenanalyse (GPS) Leitlinien und Empfehlungen. Gesundheitswesen 2015; 77: 120-126. doi:10.1055/s-0034-1396815

Thieme Connect PubMed Search in Google Scholar
Download RIS citation
2 Swart E, Bitzer EM, Gothe H. et al. A Consensus German Reporting Standard for Secondary Data Analyses, Version 2 (STROSA-STandardisierte BerichtsROutine für Sekundärdaten Analysen). Gesundheitswesen 2016; 78: e145-e160. doi:10.1055/s-0042-108647

Thieme Connect PubMed Search in Google Scholar
Download RIS citation
3 Krille L, Dreger S, Schindel R. et al. Risk of cancer incidence before the age of 15 years after exposure to ionising radiation from computed tomography: results from a German cohort study. Radiat Environ Biophys 2015; 54: 1-12. doi:10.1007/s00411-014-0580-3

Crossref PubMed Search in Google Scholar
Download RIS citation
4 Bosch de Basea M, Pearce MS, Kesminiene A. et al. EPI-CT: Design, challenges and epidemiological methods of an international study on cancer risk after paediatric and young adult CT. J Radiol Prot Radiol Prot 2015; 35: 611-628. doi:10.1088/0952-4746/35/3/611

Search in Google Scholar
Download RIS citation
5 Krille L, Jahnen A, Mildenberger P. et al. Computed tomography in children: multicenter cohort study design for the evaluation of cancer risk. Eur J Epidemiol 2011; 26: 249-250. doi:10.1007/s10654-011-9549-6

Crossref PubMed Search in Google Scholar
Download RIS citation
6 Hammer GP, Seidenbusch MC, Schneider K. et al. A cohort study of childhood cancer incidence after postnatal diagnostic X-ray exposure. Radiat Res 2009; 171: 504-512. doi:10.1667/RR1575.1

Crossref PubMed Search in Google Scholar
Download RIS citation
7 Daumke P, Enders F, Simon K et al. Semantic annotation of clinical text – the averbis annotation editor. Proceedings of the GMDS 2010 Mannheim

Download RIS citation
8 Langlotz CP. RadLex: a new method for indexing online educational materials. Radiographics 2007; 27: 62. doi:10.1148/rg.266065168

Crossref Search in Google Scholar
Download RIS citation
9 Marwede D, Daumke P, Marko K. et al. RadLex – German version: a radiological lexicon for indexing image and report information. Rofo 2009; 181: 38-44. doi:10.1055/s-2008-1027895

Thieme Connect PubMed Search in Google Scholar
Download RIS citation
10 Journy NM, McHugh K, Harbron RW. et al. Medical conditions associated with the use of CT in children and young adults, Great Britain, 1995-2008. Br J Radiol 2016; 89: 1995-2008. doi:10.1259/bjr.20160532

Search in Google Scholar
Download RIS citation
11 Boland GW, Duszak Jr R. Structured reporting and communication. J Am Coll Radiol 2014; 12: 1286-1288. doi:10.1016/j.jacr.2015.08.001

Search in Google Scholar
Download RIS citation
12 Pinto dos Santos D, Scheibl S, Arhold G. et al. A proof of concept for epidemiological research using structured reporting with pulmonary embolism as a use case. Br J Radiol 2018; Jun 5.

Crossref PubMed Search in Google Scholar
Download RIS citation
13 Kim Y. Convolutional neural networks for sentence classification. In Proceedings of the 2014. Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751, Doha, Qatar, October 2014. Association for Computational Linguistics. 10.3115/v1/D14-1181

Download RIS citation
14 Bloehdorn S, Hotho A. Boosting for Text Classification with Semantic Features. In: Mobasher B, Nasraoui O, Liu B, Masand B. eds. Advances in Web Mining and Web Usage Analysis. WebKDD 2004. Lecture Notes in Computer Science. 2006. 3932 149-166 Springer; Berlin, Heidelberg:

Crossref Search in Google Scholar
Download RIS citation
15 Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. In: ICLR Workshop Papers 2013

Download RIS citation

Related Journals

Related Books

Subscribe to RSS

Share / Bookmark

Validation of Semantic Analyses of Unstructured Medical Data for Research Purposes

Authors

Publication History

Corrected by:

Abstract

Zusammenfassung

Key words

Schlüsselwörter

References