Subscribe to RSS
DOI: 10.1055/a-2424-1989
A Transformer-Based Pipeline for German Clinical Document De-Identification
Funding This study was funded by Deutsches Zentrum für Luft- und Raumfahrt under grant number 01ZZ2314D. The funder played no role in the study design, data collection, analysis and interpretation of data, or the writing of this manuscript.
Abstract
Objective Commercially available large language models such as Chat Generative Pre-Trained Transformer (ChatGPT) cannot be applied to real patient data for data protection reasons. At the same time, de-identification of clinical unstructured data is a tedious and time-consuming task when done manually. Since transformer models can efficiently process and analyze large amounts of text data, our study aims to explore the impact of a large training dataset on the performance of this task.
Methods We utilized a substantial dataset of 10,240 German hospital documents from 1,130 patients, created as part of the investigating hospital's routine documentation, as training data. Our approach involved fine-tuning and training an ensemble of two transformer-based language models simultaneously to identify sensitive data within our documents. Annotation Guidelines with specific annotation categories and types were created for annotator training.
Results Performance evaluation on a test dataset of 100 manually annotated documents revealed that our fine-tuned German ELECTRA (gELECTRA) model achieved an F1 macro average score of 0.95, surpassing human annotators who scored 0.93.
Conclusion We trained and evaluated transformer models to detect sensitive information in German real-world pathology reports and progress notes. By defining an annotation scheme tailored to the documents of the investigating hospital and creating annotation guidelines for staff training, a further experimental study was conducted to compare the models with humans. These results showed that the best-performing model achieved better overall results than two experienced annotators who manually labeled 100 clinical documents.
Keywords
machine learning - deep learning - natural language processing - de-identification - anonymizationProtection of Human and Animal Subjects
The study was approved by Institutional Review Boards.
Data Availability
The data that support the findings of this study are not openly available due to reasons of sensitivity and are available from the corresponding author upon reasonable request. The trained models created as part of this study are also not publicly accessible for reasons of data protection. The underlying code for this study is publicly available on GitHub and can be accessed via this link; https://github.com/UMEssen/DOME.
* These authors shared senior authorship.
Publication History
Received: 25 April 2024
Accepted: 18 August 2024
Article published online:
08 January 2025
© 2025. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution-NonDerivative-NonCommercial License, permitting copying and reproduction so long as the original work is given appropriate credit. Contents may not be used for commercial purposes, or adapted, remixed, transformed or built upon. (https://creativecommons.org/licenses/by-nc-nd/4.0/)
Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany
-
References
- 1 Ofer D, Brandes N, Linial M. The language of proteins: NLP, machine learning & protein sequences. Comput Struct Biotechnol J 2021; 19: 1750-1758
- 2 Zhang Y, Liu M, Zhang L. et al. Comparison of chest radiograph captions based on natural language processing vs completed by radiologists. JAMA Netw Open 2023; 6 (02) e2255113
- 3 Tauscher JS, Lybarger K, Ding X. et al. Automated detection of cognitive distortions in text exchanges between clinicians and people with serious mental illness. Psychiatr Serv 2023; 74 (04) 407-410
- 4 HL7 FHIR. Welcome to FHIR®. 2023 . Accessed October 9, 2024 at: https://www.hl7.org/fhir/
- 5 Bender D, Sartipi K. HL7 FHIR: An Agile and RESTful approach to healthcare information exchange. Paper presented at: Proceedings of the 26th IEEE International Symposium on Computer-Based Medical Systems; IEEE, Porto, Portugal; 20-22 June, 2013: 326-331
- 6 Hosch R, Baldini G, Parmar V. et al. FHIR-PYrate: a data science friendly python package to query FHIR servers. BMC Health Serv Res 2023; 23 (01) 734
- 7 Douglass M, Clifford GD, Reisner A, Moody GB, Mark RG. Computer-assisted de-identification of free text in the MIMIC II database. In: Computers in Cardiology. Chicago, IL: IEEE; 2004: 341-344
- 8 Lee RY, Kross EK, Torrence J. et al. Assessment of natural language processing of electronic health records to measure goals-of-care discussions as a clinical trial outcome. JAMA Netw Open 2023; 6 (03) e231204
- 9 Kang T, Sun Y, Kim JH. et al. EvidenceMap: a three-level knowledge representation for medical evidence computation and comprehension. J Am Med Inform Assoc 2023; 30 (06) 1022-1031
- 10 Vaswani A, Shazeer N, Parmar N. et al. Attention is all you need. arXiv 2023 . Preprint. Accessed October 9, 2024 at: http://arxiv.org/abs/1706.03762
- 11 Devlin J, Chang M-W, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Paper presented at: Proceedings of the 2019 Conference of the North Association for Computational Linguistics, Minneapolis, Minnesota; 2-7 June, 2019: 4171-4186
- 12 Huang K, Altosaar J, Ranganath R. ClinicalBERT: Modeling clinical notes and predicting hospital readmission. arXiv 2020. Preprint. Accessed October 9, 2024 at http://arxiv.org/abs/1904.05342
- 13 Tang B, Jiang D, Chen Q, Wang X, Yan J, Shen Y. De-identification of clinical text via Bi-LSTM-CRF with neural language models. AMIA Annu Symp Proc 2020; 2019: 857-863
- 14 Johnson AEW, Bulgarelli L, Pollard TJ. Deidentification of free-text medical records using pre-trained bidirectional transformers. Proc ACM Conf Health Inference Learn (2020) 2020; 214-221
- 15 Gupta A, Lai A, Mozersky J, Ma X, Walsh H, DuBois JM. Enabling qualitative research data sharing using a natural language processing pipeline for deidentification: moving beyond HIPAA Safe Harbor identifiers. JAMIA Open 2021; 4 (03) ooab069
- 16 Oh SH, Kang M, Lee Y. Protected health information recognition by fine-tuning a pre-training transformer model. Healthc Inform Res 2022; 28 (01) 16-24
- 17 Ahmed T, Aziz MMA, Mohammed N. De-identification of electronic health record using neural network. Sci Rep 2020; 10 (01) 18600
- 18 Chen F, Bokhari SMA, Cato K, Gürsoy G, Rossetti S. Examining the generalizability of pretrained de-identification transformer models on narrative nursing notes. Appl Clin Inform 2024; 15 (02) 357-367
- 19 Kolditz T, Lohr C, Hellrich J. et al. Annotating German clinical documents for de-identification. Stud Health Technol Inform 2019; 264: 203-207
- 20 Richter-Pechanski P, Riezler S, Dieterich C. De-identification of German medical admission notes. Stud Health Technol Inform 2018; 253: 165-169
- 21 Lohr C, Eder E, Hahn U. Pseudonymization of PHI items in German clinical reports. Stud Health Technol Inform 2021; 281: 273-277
- 22 Seuss H, Dankerl P, Ihle M. et al. Semi-automated de-identification of German content sensitive reports for big data analytics. Rofo 2017; 189 (07) 661-671
- 23 Richter-Pechanski P, Amr A, Katus HA, Dieterich C. Deep learning approaches outperform conventional strategies in de-identification of German medical reports. Stud Health Technol Inform 2019; 267: 101-109
- 24 Borchert F, Lohr C, Modersohn L. et al. GGPONC: A Corpus of German Medical Text with Rich Metadata Based on Clinical Practice Guidelines. Ppaer presented at: Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis. Association for Computational Linguistics; 20 November, 2020: 38-48
- 25 Modersohn L, Schulz S, Lohr C, Hahn U. GRASCCO - The first publicly shareable, multiply-alienated German clinical text corpus. Stud Health Technol Inform 2022; 296: 66-72
- 26 Kittner M, Lamping M, Rieke DT. et al. Annotation and initial evaluation of a large annotated German oncological corpus. JAMIA Open 2021; 4 (02) ooab025
- 27 Richter-Pechanski P, Wiesenbach P, Schwab DM. et al. A distributable German clinical corpus containing cardiovascular clinical routine doctor's letters. Sci Data 2023; 10 (01) 207
- 28 HIPAA. Protected Health Information. 2022 . Accessed October 9, 2024 at: https://www.hipaajournal.com/considered-phi-hipaa/
- 29 Averbis GmbH. Averbis Health Discovery. Analysis of patient data. 2023 . Accessed October 9, 2024 at: https://averbis.com/de/health-discovery/
- 30 SMITH Consortium. Smart Medical Information Technology for Healthcare. 2023 . Accessed October 9, 2024 at: https://www.medizininformatik-initiative.de/en/konsortien/smith
- 31 Medical Informatics Initiative. About the initiative. 2023 . Accessed October 9, 2024 at: https://www.medizininformatik-initiative.de/en/about-initiative
- 32 Institute for Artificial Intelligence in Medicine. University Hospital Essen. Annotation Lab. 2022. Accessed October 9, 2024 at: https://annotationlab.ikim.nrw/
- 33 Hugging Face. Davlan/bert-base-multilingual-cased-ner-hrl. 2023 . Accessed October 9, 2024 at: https://huggingface.co/Davlan/bert-base-multilingual-cased-ner-hrl
- 34 Bressem KK, Papaioannou J-M, Grundmann P. et al. MEDBERT.de: a comprehensive German BERT model for the medical domain. arXiv 2023. Preprint. Accessed October 9, 2024 at: https://doi.org/10.48550/ARXIV.2303.08179
- 35 Chan B, Schweter S, Möller T. German's Next Language Model. Paper presented at: Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain; 8-13 December, 2020: 6788-6796
- 36 Hugging Face. deepset/gbert-large. 2024 . Accessed October 9, 2024 at: https://huggingface.co/deepset/gbert-large
- 37 Hugging Face. deepset/gelectra-large. 2024 . Accessed October 9, 2024 at: https://huggingface.co/deepset/gelectra-large
- 38 Conneau A, Khandelwal K, Goyal N. et al. Unsupervised cross-lingual representation learning at scale. arXiv 2020 . Preprint. Accessed October 9, 2024 at: http://arxiv.org/abs/1911.02116
- 39 Clark K, Luong M-T, Le QV, Manning CD. ELECTRA: Pre-training text encoders as discriminators rather than generators. arXiv 2020. Preprint. Accessed October 9, 2024 at: http://arxiv.org/abs/2003.10555
- 40 Tjong Kim Sang EF, De Meulder F. Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. Paper presented at: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003. Association for Computational Linguistics, Edmonton, Canada, 31 May-1 June, 2003:vol. 4, pp.: 142–147
- 41 Hayes AF, Krippendorff K. Answering the call for a standard reliability measure for coding data. Commun Methods Meas 2007; 1: 77-89
- 42 Hripcsak G, Rothschild AS. Agreement, the f-measure, and reliability in information retrieval. J Am Med Inform Assoc 2005; 12 (03) 296-298
- 43 Achiam J, Adler S, Agarwal S. et al; Open AI. GPT-4 technical report. arXiv 2024 . Preprint. Accessed October 9, 2024 at: http://arxiv.org/abs/2303.08774
- 44 Cheng S-L, Tsai SJ, Bai YM. et al. Comparisons of quality, correctness, and similarity between ChatGPT-generated and human-written abstracts for basic research: cross-sectional study. J Med Internet Res 2023; 25: e51229
- 45 Murphy Lonergan R, Curry J, Dhas K, Simmons BI. Stratified evaluation of GPT's question answering in surgery reveals artificial intelligence (AI) knowledge gaps. Cureus 2023; 15 (11) e48788
- 46 Wang C. et al. Potential for GPT technology to optimize future clinical decision-making using retrieval-augmented generation. Ann Biomed Eng 2023; 52: 1115-1118