Rofo
DOI: 10.1055/a-2641-3059
Review

From Referral to Reporting: The Potential of Large Language Models in the Radiological Workflow

Article in several languages: English | deutsch
1   Department of Diagnostic and Interventional Radiology, University of Freiburg Faculty of Medicine, Freiburg, Germany (Ringgold ID: RIN88751)
,
Stephan Rau
1   Department of Diagnostic and Interventional Radiology, University of Freiburg Faculty of Medicine, Freiburg, Germany (Ringgold ID: RIN88751)
,
Kai Kästingschäfer
1   Department of Diagnostic and Interventional Radiology, University of Freiburg Faculty of Medicine, Freiburg, Germany (Ringgold ID: RIN88751)
,
Jakob Weiß
1   Department of Diagnostic and Interventional Radiology, University of Freiburg Faculty of Medicine, Freiburg, Germany (Ringgold ID: RIN88751)
,
Fabian Bamberg
1   Department of Diagnostic and Interventional Radiology, University of Freiburg Faculty of Medicine, Freiburg, Germany (Ringgold ID: RIN88751)
,
Maximilian Frederik Russe
1   Department of Diagnostic and Interventional Radiology, University of Freiburg Faculty of Medicine, Freiburg, Germany (Ringgold ID: RIN88751)
› Author Affiliations

Supported by: Berta-Ottenstein-Programme for Clinician Scientists, Faculty of Medicine, University of Freiburg
 

Abstract

Background

Large language models (LLMs) hold great promise for optimizing and supporting radiology workflows amidst rising workloads. This review examines potential applications in daily radiology practice, as well as remaining challenges and potential solutions.

Method

Presentation of potential applications and challenges, illustrated with practical examples and concrete optimization suggestions.

Results

LLM-based assistance systems have potential applications in almost all language-based process steps of the radiological workflow. Significant progress has been made in areas such as report generation, particularly with retrieval-augmented generation (RAG) and multi-step reasoning approaches. However, challenges related to hallucinations, reproducibility, and data protection, as well as ethical concerns, need to be addressed before widespread implementation.

Conclusion

LLMs have immense potential in radiology, particularly for supporting language-based process steps, with technological advances such as RAG and cloud-based approaches potentially accelerating clinical implementation.

Key Points

  • LLMs can optimize reporting and other language-based processes in radiology with technologies such as RAG and multi-step reasoning approaches.

  • Challenges such as hallucinations, reproducibility, privacy, and ethical concerns must be addressed before widespread adoption.

  • RAG and cloud-based approaches could help overcome these challenges and advance the clinical implementation of LLMs.

Citation Format

  • Fink A, Rau S, Kästingschäfer K et al. From Referral to Reporting: The Potential of Large Language Models in the Radiological Workflow. Rofo 2025; DOI 10.1055/a-2641-3059


Abbreviations

EHDS : European Health Data Space
GPT: Generative pre-trained transformer
AI: Artificial intelligence
LLM: Large language model
NLP: Natural language processing
RAG: Retrieval-augmented generation
SOP: Standard operating procedure


Introduction

Advances in technology have traditionally shaped the field of radiology. For example, the transition from film-based X-ray archiving to digital archiving or the development of modern cross-sectional imaging techniques, such as CT and MRI, represent major transformations. Today, radiology is facing a new period of transition brought on once again by technological innovation as artificial intelligence (AI) becomes part of the clinical routine.

In view of increasing workloads [1] and the related risk of errors [2], there is a growing need for tools to improve diagnostic efficiency. In recent years, the development of large language models (LLMs) such as GPT-4 [3], Claude [4], and Gemini Pro [5] has garnered considerable attention, as their potential for optimization in everyday radiology practice is very promising [6] [7] [8] [9] [10]. However, challenges remain, such as hallucinations, where incorrect answers are generated to span gaps in knowledge, as well as limitations when it comes to complex cognitive tasks [11] [12]. The lack of transparency is also problematic in a medical context, where precise and correct answers are critical [13] [14]. In addition, data protection and ethical issues have to be solved before AI can gain widespread use in medicine [15] [16].

The aim of this paper is to provide a comprehensive overview of the areas where LLMs could be applied in radiology, to discuss possible solutions to reduce the limitations mentioned, and to describe the outlook for future implementation.


Main section

1. Principles of interaction

The development of large language models would not have been possible without advances in natural language processing (NLP) [17], which studies the linguistic interaction between humans and computers. Initial research approaches in this area date back to the 1950s, but the real breakthrough came first with the introduction of transformer architecture [18]. This architecture forms the basis of many commercial models such as GPT-4 [3], Claude [4], and Gemini Pro [5], which are now known worldwide.

The automated generation of sequential text files is based on embeddings, i.e. the numerical representation of words and their context. Based on the parameters used to train the model, the LLM attempts to predict the most likely next word or word sequence in the sentence context and to thus generate a text (i.e. “generative AI”). LLM outputs are therefore primarily based on probabilities, which is a key aspect for understanding both the applications and the limitations of this technology.

Despite the massive hype around the development of these language models, their limitations soon became apparent. In addition to the often non-transparent tuning of model parameters by the providers, additional individual approaches to optimizing prompts have since been developed. Prompt engineering enables prompts to be adapted on a targeted basis, while the use of multi-level argumentation approaches can further improve the quality of the interaction. Techniques such as few-shot learning or zero-shot learning can optimize the response accuracy of LLMs by embedding task-specific information or examples directly in the prompt. Retrieval-augmented generation (RAG) further allows updatable, subject-specific information to be integrated automatically from external sources. This increases transparency because the sources used can be cited directly [19] [20].

Overall, these modifications have the potential to significantly expand the range of applications in the medical context. In almost every step of radiological patient care – from referral and scheduling to imaging and reporting ([Fig. 1]) – it is now possible to imagine applications using LLMs.

Zoom
Fig. 1 Steps in the everyday routine care of radiology patients that could benefit from the potential of large language models.

2. Scope of applications in clinical practice

2.1. Determining indications and defining protocol

Since the best-known models are based on language processing via NLP, their greatest potential for use is optimizing language or text-based work steps. In radiology, one thinks primarily of adapting report texts. However, upstream process steps also offer opportunities for improving efficiency.

Radiological patient care begins by first determining the indication and then defining the diagnostic imaging protocol, which takes cooperation between referring physician and responsible radiologist. This step forms the basis for an accurate diagnosis, and it helps to prevent unnecessary scans and radiation exposure.

Rosen et al. and Barash et al. were able to show that the recommendations for appropriate imaging and contrast medium administration derived by LLMs from texts of requirements align largely with established guidelines such as the European Imaging Referral Guidelines [21] and the Appropriateness Criteria of the American College of Radiology [22]. However, many of these applications related to specialized areas, and in some cases problems arose when recommendations were vaguely worded [21] [22].

A promising approach is meta-learning or in-context learning, where the LLM optimizes its output to solve new tasks based on question-specific examples [23]. A further development of this retrieval-augmented generation technique enables the model to access an external database compiled specifically for the respective discipline and containing, for example, scientific articles, textbook content, or department-specific standard operating procedures (SOPs) [20]. The knowledge extracted is integrated directly in the LLM input, in order to provide more accurate and better informed answers ([Fig. 2]). Rau et al. and Rosen et al. were able to show that this approach significantly improves the accuracy of answers and achieves a level comparable to subject matter experts in fictitious case studies. Furthermore, the use of such specialized LLMs results in significant time and cost savings [7] [24].

Zoom
Fig. 2 Process steps with RAG: After manual user input, the query is embedded in a high-dimensional vector space, in order to subsequently perform a similarity search in a separate vector index containing specialist literature or guidelines, for example. The context information obtained in this way is handed over to the language model together with the original prompt and used to produce an answer based on verifiable sources.

Future approaches that so far have been researched very little include the use of LLMs to help evaluate laboratory parameters, extract previous imaging findings automatically, and extract the relevant patient data from doctor's letters or consultation notes. Clinical information in imaging requests is often incomplete and can contain errors, which is problematic because the higher the quality of this information, the better the quality of findings [25]. Use of LLMs in this area deserves to be studied more closely, considering both the clinical need and the potential offered.


2.2. Scheduling appointments and preparing patients

It is not only radiologists who stand to benefit from integrating language models. Other professional groups, such as medical assistants, could also benefit from LLMs in the future. In one possible scenario, for example, LLMs could support appointment scheduling by prioritizing urgent requests automatically and highlighting the related appointments. Additionally, these tasks could be integrated potentially in AI-based, automated appointment scheduling systems [26].

There are also language-based activities in the area of patient preparation that could potentially be automated. For example, a combination of language models and digital informed consent forms could be developed so that patients could ideally fill out these forms at home prior to their examination, in order to reduce the time spent in waiting rooms. In this scenario, the language model would act as a go-between by accessing department-specific SOPs, timelines, and location descriptions, as well as answering patients' frequently asked questions. In addition, this technology could help the healthcare professional providing the informed consent consultation to save time by offering relevant information from the informed consent forms – such as pre-existing diseases of the kidney or thyroid gland, or possible contrast medium allergies – in a structured manner.

From a technical perspective, it is already possible to implement these approaches today, and local adaptations to internal hospital standards could also be achieved with the help of RAG. However, the quality and structure of inputs has a significant impact on LLM outputs [27]. Unstructured input from patients who potentially have little or no medical expertise could therefore lead to misinformation. As a result, research into applications will have to demonstrate the extent to which such systems can be implemented successfully.


2.3. Reporting

The image acquisition step is followed by another language-based area in the radiological workflow: reporting. This area has been the focus of LLM research in recent years, as it promises to directly reduce the workload for radiologists in everyday clinical practice.

One of the strengths of LLMs, in particular, is their ability to structure large amounts of text. In the early days of language models, this led to the development of an important field of research: generating structured findings from unstructured free text. LLMs can sort findings thematically, structure continuous texts, and visualize follow-up checks, e.g. of oncological diseases [28] [29]. In a blinded analysis, Bhayana et al. demonstrated that referring physicians prefer the structured findings generated by LLMs compared to the original findings, and physicians use these structured findings to make treatment decisions more rapidly [30]. In addition, LLMs can be used to correct existing report texts and thus save time in reporting [31] [32]. The first companies in the US are already offering such systems to generate evaluations of findings automatically, such as RadAI with Omni Impressions [33] or Nuance Communications with PowerScribe Smart Impression [34].

LLMs could also be used potentially in the final step of the process chain when findings are communicated to patients. Studies by Amin et al. and Meddeb et al. have shown that it is possible to translate radiological terminology into simpler concepts that are easy for patients to understand [10], as well as into foreign languages [35], in order to overcome communication barriers.

For a long time, it was thought that these applications held the greatest potential for LLMs, while limits were reached in generating new texts. Although popular large language models were able to pass multiple-choice-based knowledge tests, such as American Board of Radiology’s certifying exam, these models sometimes produced low quality results in terms of robustness and reproducibility. In addition, the models with high self-confidence produced incorrect solutions and displayed deficits, particularly when performing complex thinking tasks [11] [36]. Deficits were also found when answering questions requiring medical knowledge or generating differential diagnoses from report texts, which underscores further the importance of including expert medical knowledge in the data used to train language models [37] [38].

A major problem in this regard is that most of the powerful models come from commercial providers. So specialized medical training is unlikely to be included due to the lack of interest on the part of the providers. In addition, manual and task-specific training of the models is extremely time and data-intensive and is therefore difficult to implement.

As a result, a variety of approaches has emerged in recent years that incorporate task-specific knowledge directly in the input prompt instead of retraining the entire model [23]. Nevertheless, when integrating large amounts of data into the input prompt, one quickly encounters input restrictions (also known as token limits), in addition to the problem that relevant content is at risk of getting lost in the massive amounts of information [39]. One promising solution is RAG, in which the LLM accesses, with every prompt, an external, manually created database of specialist articles, textbooks, or SOPs. This approach has led not only to a significant performance improvement in radiological questions [40], but it has also shown potential to provide a diagnosis based on unstructured report texts. For example, concrete diagnoses could be generated in trauma imaging [8], gastrointestinal imaging [9], or fracture classification, in accordance with the guidelines of the Swiss AO Foundation [41]. [Fig. 3] and [Fig. 4] show two practical examples with the corresponding output from the field of trauma imaging. The detailed input prompt for both models is provided in the supplementary material (Suppls. 1 and 2).

Zoom
Fig. 3 Comparison of a generic model (GPT-4 Turbo, incorrect answers are highlighted in red) versus an enhanced model that uses a two-stage prompt, as well as retrieval-augmented generation (RAG) (GPT-4 Turbo with RAG, correct answer), to diagnose and classify a proximal tibial fracture, Schatzker type IV. The RAG solution provided the LLM with context-specific information extracted automatically from the “RadioGraphics Top 10 Reading List Trauma Radiology” [51]. The detailed input prompt for both models is provided in the supplementary material (Suppls. 1 and 2).
Zoom
Fig. 4 Comparison of a generic model (GPT-4 Turbo, incorrect answers are highlighted in red) versus an enhanced model that uses a two-stage prompt, as well as retrieval-augmented generation (RAG) (GPT-4 Turbo with RAG, correct answer), to diagnose and classify a periprosthetic femur fracture, Vancouver type AGT. The RAG solution provided the LLM with context-specific information extracted automatically from the “RadioGraphics Top 10 Reading List Trauma Radiology” [42]. The detailed input prompt for both models is provided in the supplementary material (Suppls. 1 and 2).

Such tools could lead to significant time savings in routine radiology and reduce the amount of time-consuming research. To further improve transparency and confidence in LLM statements, hyperlinks to the sources used, including page references for the information extracted, can be included in each answer [8].

[Fig. 5] provides an overview of the possible applications discussed.

Zoom
Fig. 5 Potential applications of LLMs in the radiological process chain.


3. Challenges and implications

Despite the enormous potential of LLMs, the limitations nevertheless need to be taken into account. The most well-known challenges include hallucinations, where misinformation is generated to span gaps in knowledge, as well as problems with more complex thinking tasks that involve multiple iterative steps. LLMs are based on probabilistic predictions and do not use classical machine learning with a ground truth reference value. This leads to limitations in specialized areas, where dedicated information is underrepresented in the training dataset.

Another problem is that the knowledge is not always up-to-date, because language models only use the information provided up to the time of their training (for GPT-4 Turbo until December 2023 [43]). This is particularly problematic in rapidly developing areas such as radiology. For example, diagnostic guidelines might have been revised in the meantime, meaning the LLM would no longer have access to the latest version and its response would therefore be based on outdated information.

Since subject-specific training is not very feasible currently for the reasons already mentioned, solutions are instead focusing primarily on creating the best possible input prompts, e.g. by using multi-stage argumentation approaches, or on supplementing the input data using RAG [27]. The LLM can access either real-time web databases such as PubMed or a traditional RAG database with carefully curated, scientifically reviewed information. Agent-based approaches [44], where several RAG-augmented LLMs interact similar to an interdisciplinary team of experts and thereby produce a joint result, also offer promising future prospects. This brings up the important question of responsibility. In the future, it may be necessary for RAG databases to be created and continuously updated by professional societies, scientific journals, or within departments, taking into account local SOPs.

One further limitation is the linguistic and, in some cases, content-related variability of the outputs, so that the same instruction can lead to different results several times in a row [45]. This is usually due to a high degree of creativity in the language model, i.e. the variability when selecting the next word (“temperature setting”). In most common LLMs, this parameter can be adjusted manually to ensure greater consistency. Problems related to the lack of transparency in LLM versions can be addressed by RAG-based, verifiable references with hyperlinks in each edition.

Data protection concerns represent another obstacle to implementing LLMs in clinical routines. Every LLM request in the high-performance, commercially operated LLMs runs through third-party company servers; widescale use of LLMs combined with highly sensitive patient data would thus be a serious violation of data protection laws. One solution would be to use specially developed local models, which requires both a high level of expertise and considerable server capacity.

A resource-saving and sustainable alternative is offered by cloud-based solutions from providers such as AWS [46], Google [47], or Microsoft [48]. The data are processed in a protected cloud environment, which is subject to data security similar to an internal hospital IT system. In this context, the European Society of Radiology recently advocated for a compliant implementation of the European AI Act, including through the creation of a European Health Data Space [49]. Since the use of these systems still requires a high level of technical expertise, commercial out-of-the-box solutions or integrated platform approaches could become established in the future.

Finally, ethical concerns should not be ignored, especially regarding the inherent bias of LLMs that can arise from distorted training data. There is a risk that users will be influenced and possibly misled by the models' often convincingly presented answers. This is particularly relevant when LLMs serve as preliminary information for patients without a medical background, for example, in the context of digital education. Before a broad implementation, it is therefore essential to ensure both the diversity of the training data and a comprehensible reasoning chain of the language model in order to be able to verify the outputs.

As the use of large language models becomes more widespread, it is becoming increasingly common for patients to enter image data or report texts in LLMs, in order to translate this information into language understandable by a layperson or to get a supposed second opinion. Since LLMs respond with a high degree of linguistic confidence even in complex situations, this can lead to uncertainty and questions from patients. Radiological specialists should therefore be trained specifically in dealing with such situations and in classifying LLM-generated statements.

In addition to the individual responsibility of physicians, medical associations, in particular, can play a key role in integrating LLM applications in the healthcare system. In its current statement, the German Medical Association calls explicitly for professional organizations to support the use of AI in clinical practice through clear, evidence-based recommendations for action [50]. Additionally, considering the speed of innovation in the field, these new systems will have to be evaluated continually to ensure their safety, effectiveness, and quality in everyday clinical practice for the long term.


4. Conclusions for practice

In summary, LLMs offer tremendous potential, which is already being discussed widely in the medical community [12] [42]. In radiology, LLMs can support language-based process steps, in particular. The ongoing technological advances by these models, as well as perspectives such as RAG, agent-based models, or cloud-based approaches, could enable clinical implementation. In this context, it is critical to define solid rules regarding data security, ethical issues, and responsibilities. In addition, comprehensive training needs to be provided to radiologists and medical staff regarding the functionality, capabilities, and limitations of LLMs, in order to ensure they are used responsibly and to build the confidence in them that is essential for their successful implementation.

In view of the ever increasing number of examinations and expanding workloads, it is important to actively shape these developments, in order to ensure that LLMs are used responsibly in support of everyday radiology practice.




Conflict of Interest

The authors declare that they have no conflict of interest.


Correspondence

Anna Fink
Department of Diagnostic and Interventional Radiology, University of Freiburg Faculty of Medicine
Hugstetter Str. 55
79106 Freiburg
Germany   

Publication History

Received: 25 March 2025

Accepted after revision: 16 June 2025

Article published online:
16 July 2025

© 2025. Thieme. All rights reserved.

Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany


Zoom
Fig. 1 Steps in the everyday routine care of radiology patients that could benefit from the potential of large language models.
Zoom
Fig. 2 Process steps with RAG: After manual user input, the query is embedded in a high-dimensional vector space, in order to subsequently perform a similarity search in a separate vector index containing specialist literature or guidelines, for example. The context information obtained in this way is handed over to the language model together with the original prompt and used to produce an answer based on verifiable sources.
Zoom
Fig. 3 Comparison of a generic model (GPT-4 Turbo, incorrect answers are highlighted in red) versus an enhanced model that uses a two-stage prompt, as well as retrieval-augmented generation (RAG) (GPT-4 Turbo with RAG, correct answer), to diagnose and classify a proximal tibial fracture, Schatzker type IV. The RAG solution provided the LLM with context-specific information extracted automatically from the “RadioGraphics Top 10 Reading List Trauma Radiology” [51]. The detailed input prompt for both models is provided in the supplementary material (Suppls. 1 and 2).
Zoom
Fig. 4 Comparison of a generic model (GPT-4 Turbo, incorrect answers are highlighted in red) versus an enhanced model that uses a two-stage prompt, as well as retrieval-augmented generation (RAG) (GPT-4 Turbo with RAG, correct answer), to diagnose and classify a periprosthetic femur fracture, Vancouver type AGT. The RAG solution provided the LLM with context-specific information extracted automatically from the “RadioGraphics Top 10 Reading List Trauma Radiology” [42]. The detailed input prompt for both models is provided in the supplementary material (Suppls. 1 and 2).
Zoom
Fig. 5 Potential applications of LLMs in the radiological process chain.
Zoom
Abb. 1 Visualisierung der Schritte in der täglichen Routineversorgung radiologischer Patient*innen, in denen das Potenzial großer Sprachmodelle genutzt werden könnte.
Zoom
Abb. 2 Visualisierung der Prozessschritte bei RAG: Nach der manuellen Benutzereingabe wird die Anfrage in einen hochdimensionalen Vektorraum eingebettet, um anschließend eine Ähnlichkeitssuche in einem separaten Vektorindex, der z.B. Fachliteratur oder Leitlinien enthält, durchzuführen. Die so gewonnene Kontextinformation wird zusammen mit der ursprünglichen Eingabeaufforderung an das Sprachmodell übergeben und zur Generierung einer auf verifizierbaren Quellen basierenden Antwort verwendet.
Zoom
Abb. 3 Vergleich eines generischen Modells (GPT-4 Turbo, falsche Antworten in rot markiert) mit einem durch einen zweistufigen Prompt und Retrieval-Augmented Generation (RAG) erweiterten Modell (GPT-4 Turbo mit RAG, richtige Antwort) bei der Befundung und Klassifikation einer proximalen Tibiafraktur, Schatzker Typ IV. Durch RAG wurden dem LLM automatisch extrahierte, kontextspezifische Informationen aus der „RadioGraphics Top 10 Reading List Trauma Radiology“ [51] bereitgestellt. Der genaue Eingabeprompt für beide Modelle ist im Zusatzmaterial (Suppl. 1 und 2) bereitgestellt.
Zoom
Abb. 4 Vergleich eines generischen Modells (GPT-4 Turbo, falsche Antworten in rot markiert) mit einem durch einen zweistufigen Prompt und Retrieval-Augmented Generation (RAG) erweiterten Modell (GPT-4 Turbo mit RAG, richtige Antwort) bei der Befundung und Klassifikation einer periprothetischen Femurfraktur, Vancouver Typ AGT. Durch RAG wurden dem LLM automatisch extrahierte, kontextspezifische Informationen aus der "RadioGraphics Top 10 Reading List Trauma Radiology" [42] bereitgestellt. Der genaue Eingabeprompt für beide Modelle ist im Zusatzmaterial (Suppl. 1 und 2) bereitgestellt.
Zoom
Abb. 5 Visualisierung potenzieller Anwendungsmöglichkeiten von LLMs in der radiologischen Prozesskette. MFA: Medizinische Fachangestellte.