CC BY-NC-ND 4.0 · Appl Clin Inform 2025; 16(01): 031-043
DOI: 10.1055/a-2424-1989
Research Article

A Transformer-Based Pipeline for German Clinical Document De-Identification

Kamyar Arzideh
1   Central IT Department, Data Integration Center, University Hospital Essen, Essen, Germany
2   Institute for Artificial Intelligence in Medicine, University Hospital Essen, Essen, Germany
,
Giulia Baldini
2   Institute for Artificial Intelligence in Medicine, University Hospital Essen, Essen, Germany
3   Institute of Interventional and Diagnostic Radiology and Neuroradiology, University Hospital Essen, Essen, Germany
,
Philipp Winnekens
1   Central IT Department, Data Integration Center, University Hospital Essen, Essen, Germany
2   Institute for Artificial Intelligence in Medicine, University Hospital Essen, Essen, Germany
,
Christoph M. Friedrich
4   Department of Computer Science, University of Applied Sciences and Arts Dortmund, Dortmund, Germany
5   Institute for Medical Informatics, Biometry and Epidemiology, University Hospital Essen, Essen, Germany
,
Felix Nensa
2   Institute for Artificial Intelligence in Medicine, University Hospital Essen, Essen, Germany
3   Institute of Interventional and Diagnostic Radiology and Neuroradiology, University Hospital Essen, Essen, Germany
,
Ahmad Idrissi-Yaghir*
4   Department of Computer Science, University of Applied Sciences and Arts Dortmund, Dortmund, Germany
5   Institute for Medical Informatics, Biometry and Epidemiology, University Hospital Essen, Essen, Germany
,
René Hosch*
2   Institute for Artificial Intelligence in Medicine, University Hospital Essen, Essen, Germany
3   Institute of Interventional and Diagnostic Radiology and Neuroradiology, University Hospital Essen, Essen, Germany
› Author Affiliations
Funding This study was funded by Deutsches Zentrum für Luft- und Raumfahrt under grant number 01ZZ2314D. The funder played no role in the study design, data collection, analysis and interpretation of data, or the writing of this manuscript.


Abstract

Objective Commercially available large language models such as Chat Generative Pre-Trained Transformer (ChatGPT) cannot be applied to real patient data for data protection reasons. At the same time, de-identification of clinical unstructured data is a tedious and time-consuming task when done manually. Since transformer models can efficiently process and analyze large amounts of text data, our study aims to explore the impact of a large training dataset on the performance of this task.

Methods We utilized a substantial dataset of 10,240 German hospital documents from 1,130 patients, created as part of the investigating hospital's routine documentation, as training data. Our approach involved fine-tuning and training an ensemble of two transformer-based language models simultaneously to identify sensitive data within our documents. Annotation Guidelines with specific annotation categories and types were created for annotator training.

Results Performance evaluation on a test dataset of 100 manually annotated documents revealed that our fine-tuned German ELECTRA (gELECTRA) model achieved an F1 macro average score of 0.95, surpassing human annotators who scored 0.93.

Conclusion We trained and evaluated transformer models to detect sensitive information in German real-world pathology reports and progress notes. By defining an annotation scheme tailored to the documents of the investigating hospital and creating annotation guidelines for staff training, a further experimental study was conducted to compare the models with humans. These results showed that the best-performing model achieved better overall results than two experienced annotators who manually labeled 100 clinical documents.

Protection of Human and Animal Subjects

The study was approved by Institutional Review Boards.


Data Availability

The data that support the findings of this study are not openly available due to reasons of sensitivity and are available from the corresponding author upon reasonable request. The trained models created as part of this study are also not publicly accessible for reasons of data protection. The underlying code for this study is publicly available on GitHub and can be accessed via this link; https://github.com/UMEssen/DOME.


* These authors shared senior authorship.


Supplementary Material



Publication History

Received: 25 April 2024

Accepted: 18 August 2024

Article published online:
08 January 2025

© 2025. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution-NonDerivative-NonCommercial License, permitting copying and reproduction so long as the original work is given appropriate credit. Contents may not be used for commercial purposes, or adapted, remixed, transformed or built upon. (https://creativecommons.org/licenses/by-nc-nd/4.0/)

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany