Subscribe to RSS
DOI: 10.1055/s-0045-1804138
Performance of an Open-source, Offline-capable Large Language Model in Data Extraction from Unstructured Electronic Health Records
Authors
Background: Open-source large language models may provide a solution to the data privacy issues hindering the use of large language models for processing health records. In this study we assess the performance of a recently released state-of-the-art open-source and offline-capable large language model, in data extraction from unstructured electronic health records.
Methods: Fifty fictitious patient medical records were drafted in German and the open-source large language model (all three differently sized variants: 405B, 70B, and 8B) was provided with instructions on processing each one. Data extraction involved text-mining and classification tasks for nine variables. Two closed-source state-of-the-art large language models were used for comparison. Large language model prompting and use were performed via online available deployments of the models.
Results: The accuracy of the open-source large language model over all 450 requested values was 100% (no false predictions) for the 405B model, 98.6% (6 false predictions, all binary classifications) for the 70B model, and 90.8% (41 false predictions, all binary classifications) for the 8B model. The accuracy of both compared closed-source large language models was 100% (no false predictions).
Conclusion: The 405B version of the open-source large language model exhibited excellent performance, on par with the two compared closed-source models. Further research with a local offline installation of the 405B model on a computationally capable computing infrastructure using real health records is warranted to confirm these results.
No conflict of interest has been declared by the author(s).
Publication History
Article published online:
11 February 2025
© 2025. Thieme. All rights reserved.
Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany