Methods Inf Med
DOI: 10.1055/a-2797-4295
Original Article

Using a Large Language Model–generated Prompt to Extract Features from Synthetic MRI Brain Scan Reports: A Cross-sectional Study

Authors

  • John J. Hanna

    1   Department of Internal Medicine, ECU Brody School of Medicine, Greenville, North Carolina, United States
    2   Information Services, ECU Health, Greenville, North Carolina, United States
    3   Clinical Informatics Center, University of Texas Southwestern, Dallas, Texas, United States
  • Christopher S. Evans

    2   Information Services, ECU Health, Greenville, North Carolina, United States
    4   Department of Emergency Medicine, ECU Brody School of Medicine, Greenville, North Carolina, United States
  • Christopher R. Dennis

    2   Information Services, ECU Health, Greenville, North Carolina, United States
  • K Stuart Lee

    5   ECU Health Neurosurgery and Spine, ECU Health, Greenville, North Carolina, United States
  • Christoph U. Lehmann

    3   Clinical Informatics Center, University of Texas Southwestern, Dallas, Texas, United States
    6   Department of Pediatrics, University of Texas Southwestern, Dallas, Texas, United States
  • Richard J. Medford

    1   Department of Internal Medicine, ECU Brody School of Medicine, Greenville, North Carolina, United States
    2   Information Services, ECU Health, Greenville, North Carolina, United States
    3   Clinical Informatics Center, University of Texas Southwestern, Dallas, Texas, United States

Abstract

Background

Feature extraction from free text medical reports is a frequently required clinical, operational, or research procedure. Large language models (LLMs) hold a promise for automating feature extraction, which can also enable category assignment tasks.

Objective

To compare the groundedness of extracted features by five LLMs from magnetic resonance imaging (MRI) brain scan reports using a clinician-engineered versus an LLM-generated prompt.

Methods

Five OpenAI LLMs were evaluated for their ability to extract nine binary features from synthetic MRI brain reports. Two types of prompts, a clinician-engineered and an LLM-generated, were used. Metrics including recall, precision, accuracy, and F1 score were calculated to assess model performance.

Results

For all extracted features by all studied models from both tested prompts, the overall average recall was 0.956, the average precision was 0.9347, the average accuracy was 0.982, and the average F1 score was 0.9431. Using GPT-3.5-turbo, the LLM-generated prompt had better numerical performance than the clinician-engineered prompt. For the other four GPT-4 models examined, overall recall, precision, and accuracy were higher regardless of the prompt source.

Conclusion

This study highlights the potential of LLMs to generate prompts and accurately extract features, with newer models like GPT-4 performing consistently well. The efficacy of feature extraction by LLMs depends on the engineered prompt and model used. Our experimentation demonstrates the potential of LLMs to engineer prompts and extract features from MRI brain scan reports.

Declaration of GenAI Use

During the writing process of this paper, the author(s) used OpenAI's ChatGPT-4o in order to create Supplementary Appendix (available in the online version only). The author(s) reviewed and edited the text and take(s) full responsibility for the content of the paper.




Publication History

Received: 16 March 2025

Accepted: 23 January 2026

Article published online:
19 February 2026

© 2026. Thieme. All rights reserved.

Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany