Subscribe to RSS
DOI: 10.1055/a-2797-4295
Using a Large Language Model–generated Prompt to Extract Features from Synthetic MRI Brain Scan Reports: A Cross-sectional Study
Authors
Abstract
Background
Feature extraction from free text medical reports is a frequently required clinical, operational, or research procedure. Large language models (LLMs) hold a promise for automating feature extraction, which can also enable category assignment tasks.
Objective
To compare the groundedness of extracted features by five LLMs from magnetic resonance imaging (MRI) brain scan reports using a clinician-engineered versus an LLM-generated prompt.
Methods
Five OpenAI LLMs were evaluated for their ability to extract nine binary features from synthetic MRI brain reports. Two types of prompts, a clinician-engineered and an LLM-generated, were used. Metrics including recall, precision, accuracy, and F1 score were calculated to assess model performance.
Results
For all extracted features by all studied models from both tested prompts, the overall average recall was 0.956, the average precision was 0.9347, the average accuracy was 0.982, and the average F1 score was 0.9431. Using GPT-3.5-turbo, the LLM-generated prompt had better numerical performance than the clinician-engineered prompt. For the other four GPT-4 models examined, overall recall, precision, and accuracy were higher regardless of the prompt source.
Conclusion
This study highlights the potential of LLMs to generate prompts and accurately extract features, with newer models like GPT-4 performing consistently well. The efficacy of feature extraction by LLMs depends on the engineered prompt and model used. Our experimentation demonstrates the potential of LLMs to engineer prompts and extract features from MRI brain scan reports.
Keywords
generative AI - artificial intelligence - large language models - nature language processing - prompt engineeringDeclaration of GenAI Use
During the writing process of this paper, the author(s) used OpenAI's ChatGPT-4o in order to create Supplementary Appendix (available in the online version only). The author(s) reviewed and edited the text and take(s) full responsibility for the content of the paper.
Publication History
Received: 16 March 2025
Accepted: 23 January 2026
Article published online:
19 February 2026
© 2026. Thieme. All rights reserved.
Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany
-
References
- 1 Casey A, Davidson E, Poon M. et al. A systematic review of natural language processing applied to radiology reports. BMC Med Inform Decis Mak 2021; 21 (01) 179
- 2 Friedman C, Shagina L, Lussier Y, Hripcsak G. Automated encoding of clinical documents based on natural language processing. J Am Med Inform Assoc 2004; 11 (05) 392-402
- 3 López-Úbeda P, Díaz-Galiano MC, Martín-Noguerol T, Ureña-López A, Martín-Valdivia MT, Luna A. Natural language processing in radiology: update on methods and applications. Insights Imaging 2022; 13 (01) 132
- 4 Wiggins WF, Kitamura F, Santos I, Prevedello LM. Natural language processing of radiology text reports: interactive text classification. Radiol Artif Intell 2021; 3 (04) e210035
- 5 Reichenpfader D, Müller H, Denecke K. A scoping review of large language model based approaches for information extraction from radiology reports. NPJ Digit Med 2024; 7 (01) 222
- 6 López-Úbeda P, Martín-Noguerol T, Díaz-Angulo C, Luna A. Evaluation of large language models performance against humans for summarizing MRI knee radiology reports: a feasibility study. Int J Med Inform 2024; 187: 105443
- 7 Gertz RJ, Dratsch T, Bunck AC. et al. Potential of GPT-4 for detecting errors in radiology reports: implications for reporting accuracy. Radiology 2024; 311 (01) e232714
- 8 Doshi R, Amin KS, Khosla P, Bajaj SS, Chheang S, Forman HP. Quantitative evaluation of large language models to streamline radiology report impressions: a multimodal retrospective analysis. Radiology 2024; 310 (03) e231593
- 9 Schmidt RA, Seah JCY, Cao K, Lim L, Lim W, Yeung J. Generative large language models for detection of speech recognition errors in radiology reports. Radiol Artif Intell 2024; 6 (02) e230205
- 10 Kathait AS, Garza-Frias E, Sikka T. et al. Assessing laterality errors in radiology: comparing generative AI and natural language processing. J Am Coll Radiol 2024; 21 (10) 1575-1582
- 11 Kanzawa J, Yasaka K, Fujita N, Fujiwara S, Abe O. Automated classification of brain MRI reports using fine-tuned large language models. Neuroradiology 2024; 66 (12) 2177-2183
- 12 Le Guellec B, Lefèvre A, Geay C. et al. Performance of an open-source large language model in extracting information from free-text radiology reports. Radiol Artif Intell 2024; 6 (04) e230364
- 13 Cozzi A, Pinker K, Hidber A. et al. BI-RADS category assignments by GPT-3.5, GPT-4, and Google Bard: a multilanguage study. Radiology 2024; 311 (01) e232133
- 14 Fervers P, Hahnfeldt R, Kottlors J. et al. ChatGPT yields low accuracy in determining LI-RADS scores based on free-text and structured radiology reports in German language. Front Radiol 2024; 4: 1390774
- 15 Gu K, Lee JH, Shin J. et al. Using GPT-4 for LI-RADS feature extraction and categorization with multilingual free-text reports. Liver Int 2024; 44 (07) 1578-1587
- 16 Bhayana R, Nanda B, Dehkharghanian T. et al. Large language models for automated synoptic reports and resectability categorization in pancreatic cancer. Radiology 2024; 311 (03) e233117
- 17 Johnson AEW, Pollard TJ, Berkowitz SJ. et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci Data 2019; 6 (01) 317
- 18 Johnson AEW, Bulgarelli L, Shen L. et al. Author correction: MIMIC-IV, a freely accessible electronic health record dataset. Sci Data 2023; 10 (01) 219
- 19 Abdullah A, Kim ST. Automated radiology report labeling in chest X-ray pathologies: development and evaluation of a large language model framework. JMIR Med Inform 2025; 13: e68618
- 20 Liu K, Ma Z, Kang X. et al. Enhanced Contrastive Learning with Multi-view Longitudinal Data for Chest X-ray Report Generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2025 . p. 10348-10359
- 21 Chaves JMZ, Huang S-C, Xu Y. et al. Toward a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation. arXiv. 2025 . arXiv:2403.08002
- 22 Jiménez-Sánchez A, Avlona N-R, de Boer S. et al. In the picture: medical imaging datasets, artifacts, and their living review. In: Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency (FAccT '25). New York (NY): Association for Computing Machinery; 2025: 511-531
- 23 Wu Q, Wu Q, Li H. et al. Evaluating large language models for automated reporting and data systems categorization: cross-sectional study. JMIR Med Inform 2024; 12: e55799