Performance of an Open-source, Offline-capable Large Language Model in Data Extraction from Unstructured Electronic Health Records

V. Ntinopoulos; H. Rodriguez Cetina Biefer; P. Risteski; L. Rings; S. Dushaj; O. Dzemali

doi:10.1055/s-0045-1804138

The Thoracic and Cardiovascular Surgeon, Table of Contents

Thorac Cardiovasc Surg 2025; 73(S 01): S1-S71
DOI: 10.1055/s-0045-1804138

Monday, 17 February

NEUE TECHNOLOGIEN: VON EKZ BIS KI

Performance of an Open-source, Offline-capable Large Language Model in Data Extraction from Unstructured Electronic Health Records

Authors

V. Ntinopoulos

¹University Hospital of Zürich, Zürich, Switzerland
H. Rodriguez Cetina Biefer

¹University Hospital of Zürich, Zürich, Switzerland
P. Risteski

¹University Hospital of Zürich, Zürich, Switzerland
L. Rings

¹University Hospital of Zürich, Zürich, Switzerland
S. Dushaj

¹University Hospital of Zürich, Zürich, Switzerland
O. Dzemali

¹University Hospital of Zürich, Zürich, Switzerland

Abstract

Full Text

All articles of this category

Background: Open-source large language models may provide a solution to the data privacy issues hindering the use of large language models for processing health records. In this study we assess the performance of a recently released state-of-the-art open-source and offline-capable large language model, in data extraction from unstructured electronic health records.

Methods: Fifty fictitious patient medical records were drafted in German and the open-source large language model (all three differently sized variants: 405B, 70B, and 8B) was provided with instructions on processing each one. Data extraction involved text-mining and classification tasks for nine variables. Two closed-source state-of-the-art large language models were used for comparison. Large language model prompting and use were performed via online available deployments of the models.

Results: The accuracy of the open-source large language model over all 450 requested values was 100% (no false predictions) for the 405B model, 98.6% (6 false predictions, all binary classifications) for the 70B model, and 90.8% (41 false predictions, all binary classifications) for the 8B model. The accuracy of both compared closed-source large language models was 100% (no false predictions).

Conclusion: The 405B version of the open-source large language model exhibited excellent performance, on par with the two compared closed-source models. Further research with a local offline installation of the 405B model on a computationally capable computing infrastructure using real health records is warranted to confirm these results.