ChatGPT: Chasing the Storm in Radiology Training and Education
We carefully reviewed the editorial penned by Sodhi et al, addressing the potential
application of Chat Generative Pre-Trained Transformer (ChatGPT) in training and educating
radiology residents, coupled with a prudent acknowledgment of its limitations and
pitfalls.[1] In response, we aim to contribute some perspectives on large language models (LLMs)
in radiology, from our experience of utilizing and evaluating various open-source
and proprietary LLMs for various tasks in radiology.
In their editorial, the authors exclusively leveraged ChatGPT, a widely recognized
and freely available LLM. It is noteworthy that while the free version of ChatGPT
utilizes the GPT-3.5 model, the paid version, ChatGPT Plus and Team, leverages the
GPT-4 model that boasts three key features: enhanced creativity, compatibility with
visual input, and an ability to process longer contexts. GPT-4 offers unparalleled
capabilities in tasks demanding advanced reasoning and complex instruction comprehension[2] and also has the ability to retrieve information from the Internet real time using
retrieval augmented generation.[3]
Drawing from our personal experiences, Microsoft Bing Chat, now rebranded as Copilot,
emerges as a viable alternative, allowing the utilization of the advanced GPT-4 model,
without any charge. Additionally, we have explored other free LLMs such as Google
Bard (rebranded as Gemini), Perplexity, and Claude.[4]
[5]
[6]
It is crucial to highlight that, while Gemini offers image analysis capabilities,
it falls short in the domain of medical image interpretation. When prompted to interpret
medical images, the response generated is shown in [Fig. 1]. Also recently, major issues in Gemini with respect to being biased toward certain
sections of the society have raised concerns regarding the introduction of bias during
the use of LLMs in sensitive domains like healthcare.
Fig. 1 Screenshot of response of Google Bard (Gemini) for the interpretation of X-ray image
of the femur (response date 25-02-2024).
Addressing another limitation outlined by the authors, we concur with their observation
regarding artificially generated references in ChatGPT responses that are often nonexistent
despite exhaustive searches—a phenomenon termed “hallucination.”[7] While extensive research is ongoing to figure out ways to mitigate hallucinations,
we are still far from perfect.
Since the publication of the aforementioned article, there has been significant advancements
in artificial intelligence (AI) capabilities for image interpretation through vision
language models, custom GPTs utilizing custom prompts tailored for radiology, and
more advanced methods of prompt engineering, offering potential assistance to radiologists,
especially while in training.[8]
Most of the openly available LLMs currently do not have the capability of generating
realistic medical images like X-rays or sections of computed tomography scans. A simple
prompt action by a radiologist to show a right upper lobe consolidation on a chest
X-ray does not yield the results because of guardrails set by openly available LLMs
([Fig. 2)]. Guardrails are safety controls that oversee user interaction with an LLM application,
acting as rule-based systems between users and foundational models to ensure adherence
to organizational principles. This is where traditional search engines are still useful,
as they can access and index a vast array of medical images and information, offering
immediate, although not always contextually interpreted, results. However, the scenario
is changing with the introduction of foundational models capable of visual question
answering (VQA). VQA requires understanding natural language questions in conjunction
with medical images for accurate and reliable responses. Unlike traditional search
engines, these specialized VQA systems employ advanced retrieval techniques or generative
capabilities to produce images directly relevant to the radiologists' queries. While
general-purpose language models like GPT-4, accessible through interfaces like Copilot
or ChatGPT, are evolving, radiology-specific VQA advancements are increasingly capable
of responding to nuanced queries with precisely relevant images.[9]
[10]
Fig. 2 Screenshot of response of Chat Generative Pre-Trained Transformer 4 (ChatGPT-4) to
provide image output of pneumonia on chest X-ray (response date 01-04-2024).
In conclusion, our response aims to reflect on the original article's insights, underscore
the evolving role of generative AI tools like ChatGPT in radiology, and emphasize
the importance of continuous education while utilizing LLMs. We highlight the need
for improvement in LLMs to mitigate current biases and hallucinations, which is pivotal
for overcoming current limitations and actively utilizing LLMs in radiology.