Evaluating the Diagnostic Accuracy of ChatGPT-4.0 for Classifying Multimodal Musculoskeletal Masses: A Comparative Study with Human Raters

Wolfram A. Bosbach; Luca Schoeni; Claus Beisbart; Jan F. Senge; Milena Mitrakovic; Suzanne E. Anderson; Ngwe R. Achangwa; Eugen Divjak; Gordana Ivanac; Thomas Grieser; Marc-André Weber; Martin H. Maurer; Hatice Tuba Sanal; Keivan Daneshvar

doi:10.1055/a-2594-7085

Subscribe to RSS

Please copy the URL and add it into your RSS Feed Reader.

https://www.thieme-connect.de/rss/thieme/en/10.1055-s-00000066.xml

Share / Bookmark

Facebook Linkedin Weibo

Download PDF

Rofo
DOI: 10.1055/a-2594-7085

Quality/Quality Assurance

Evaluating the Diagnostic Accuracy of ChatGPT-4.0 for Classifying Multimodal Musculoskeletal Masses: A Comparative Study with Human Raters

Bewertung der diagnostischen Genauigkeit von ChatGPT-4.0 bei der Klassifikation multimodaler muskuloskelettaler Läsionen: eine vergleichende Studie mit menschlichen Auswertern

Wolfram A. Bosbach

¹Department of Nuclear Medicine, Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland

²Department of Diagnostic, Interventional and Paediatric Radiology (DIPR), Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland (Ringgold ID: RIN27210)

,

Luca Schoeni

¹Department of Nuclear Medicine, Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland

²Department of Diagnostic, Interventional and Paediatric Radiology (DIPR), Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland (Ringgold ID: RIN27210)

,

Claus Beisbart

³Institute of Philosophy, University of Bern, Bern, Switzerland

⁴Center for Artificial Intelligence in Medicine, University of Bern, Bern, Switzerland

,

Jan F. Senge

⁵Department of Mathematics and Computer Science, University of Bremen, Bremen, Germany (Ringgold ID: RIN9168)

⁶Dioscuri Centre in Topological Data Analysis, Mathematical Institute PAN, Warsaw, Poland

,

Milena Mitrakovic

²Department of Diagnostic, Interventional and Paediatric Radiology (DIPR), Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland (Ringgold ID: RIN27210)

,

Suzanne E. Anderson

²Department of Diagnostic, Interventional and Paediatric Radiology (DIPR), Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland (Ringgold ID: RIN27210)

⁷Sydney School of Medicine, University of Notre Dame Australia, Darlinghurst Sydney, Australia (Ringgold ID: RIN523002)

,

Ngwe R. Achangwa

¹Department of Nuclear Medicine, Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland

,

Eugen Divjak

⁸University of Zagreb School of Medicine, Department of Diagnostic and Interventional Radiology, University Hospital “Dubrava”, Zagreb, Croatia

,

Gordana Ivanac

⁸University of Zagreb School of Medicine, Department of Diagnostic and Interventional Radiology, University Hospital “Dubrava”, Zagreb, Croatia

,

Thomas Grieser

⁹Department of Diagnostic and Interventional Radiology, University Hospital Augsburg, Augsburg, Germany (Ringgold ID: RIN39694)

,

Marc-André Weber

¹⁰Institute of Diagnostic and Interventional Radiology, Pediatric Radiology and Neuroradiology, University Medical Center Rostock, Rostock, Germany

,

Martin H. Maurer

¹¹Department of Diagnostic and Interventional Radiology, Carl von Ossietzky Universität Oldenburg, Oldenburg, Germany

,

Hatice Tuba Sanal

¹²Radiology Department, University of Health Sciences, Gülhane Training and Research Hospital, Ankara, Turkey

¹³Department of Anatomy, Ankara University Institute of Health Sciences, Ankara, Türkiye

,

Keivan Daneshvar

²Department of Diagnostic, Interventional and Paediatric Radiology (DIPR), Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland (Ringgold ID: RIN27210)

› Author Affiliations

› Further Information

Also available at

Abstract
Full Text
References
Figures

PDF Download Permissions and Reprints

Abstract
Zusammenfassung
Introduction
Method and Materials

Image data

Surveying of ChatGPT-4.0 and human raters

Python code evaluations

Results and discussion
Conclusions and future work
Open access supplements
References

Abstract

Purpose

Novel artificial intelligence tools have the potential to significantly enhance productivity in medicine, while also maintaining or even improving treatment quality. In this study, we aimed to evaluate the current capability of ChatGPT-4.0 to accurately interpret multimodal musculoskeletal tumor cases.

Materials and Methods

We created 25 cases, each containing images from X-ray, computed tomography, magnetic resonance imaging, or scintigraphy. ChatGPT-4.0 was tasked with classifying each case using a six-option, two-choice question, where both a primary and a secondary diagnosis were allowed. For performance evaluation, human raters also assessed the same cases.

Results

When only the primary diagnosis was taken into account, the accuracy of human raters was greater than that of ChatGPT-4.0 by a factor of nearly 2 (87% vs. 44%). However, in a setting that also considered secondary diagnoses, the performance gap shrank substantially (accuracy: 94% vs. 71%). Power analysis relying on Cohen’s w confirmed the adequacy of the sample set size (n: 25).

Conclusion and Key Points

The tested artificial intelligence tool demonstrated lower performance than human raters. Considering factors such as speed, constant availability, and potential future improvements, it appears plausible that artificial intelligence tools could serve as valuable assistance systems for doctors in future clinical settings.

Key Points

ChatGPT-4.0 classifies musculoskeletal cases using multimodal imaging inputs.
Human raters outperform AI in primary diagnosis accuracy by a factor of nearly two.
Including secondary diagnoses improves AI performance and narrows the gap.
AI demonstrates potential as an assistive tool in future radiological workflows.
Power analysis confirms robustness of study findings with the current sample size.

Citation Format

Bosbach WA, Schoeni L, Beisbart C et al. Evaluating the Diagnostic Accuracy of ChatGPT-4.0 for Classifying Multimodal Musculoskeletal Masses: A Comparative Study with Human Raters. Rofo 2025; DOI 10.1055/a-2594-7085

Zusammenfassung

Ziel

Neue künstliche Intelligenz (KI)-Werkzeuge haben das Potenzial, die Produktivität in der Medizin erheblich zu steigern und gleichzeitig die Behandlungsqualität aufrechtzuerhalten oder sogar zu verbessern. In dieser Studie wollten wir die aktuelle Fähigkeit von ChatGPT-4.0 zur präzisen Interpretation multimodaler muskuloskelettaler Tumorfälle evaluieren.

Materialien und Methoden

Wir erstellten 25 Fälle, die jeweils Bilder aus Röntgenaufnahmen, Computertomografie, Magnetresonanztomografie oder Szintigrafie enthielten. ChatGPT-4.0 wurde mit der Klassifikation jedes Falls anhand einer sechsoptionalen, zweiauswahlbasierten Frage beauftragt, wobei sowohl eine primäre als auch eine sekundäre Diagnose erlaubt waren. Zur Leistungsbewertung analysierten menschliche Beurteiler dieselben Fälle.

Ergebnisse

Wurde nur die primäre Diagnose berücksichtigt, war die Genauigkeit der menschlichen Beurteiler fast doppelt so hoch wie die von ChatGPT-4.0 (87% vs. 44%). In einem Szenario, das auch sekundäre Diagnosen berücksichtigte, verringerte sich die Leistungslücke jedoch deutlich (Genauigkeit: 94% vs. 71%). Eine Power-Analyse basierend auf Cohens w bestätigte die Angemessenheit der Stichprobengröße (n = 25).

Schlussfolgerung und Kernaussagen

Das getestete KI-Werkzeug zeigte eine geringere Leistung als menschliche Beurteiler. Angesichts von Faktoren wie Geschwindigkeit, ständiger Verfügbarkeit und potenziellen zukünftigen Verbesserungen erscheint es jedoch plausibel, dass KI-Werkzeuge in zukünftigen klinischen Umgebungen als wertvolle Assistenzsysteme für Ärzte dienen könnten.

Kernaussagen

ChatGPT-4.0 klassifiziert muskuloskelettale Fälle anhand multimodaler Bildgebungsdaten.
Menschliche Beurteiler übertreffen die KI bei der primären Diagnosestellung mit nahezu doppelter Genauigkeit.
Die Berücksichtigung sekundärer Diagnosen verbessert die KI-Leistung und verringert die Leistungsdifferenz.
KI zeigt Potenzial als unterstützendes Werkzeug in zukünftigen radiologischen Arbeitsabläufen.
Eine Power-Analyse bestätigt die Aussagekraft der Studienergebnisse bei gegebener Stichprobengröße.

Keywords

Clinical Decision Support - Diagnostic Accuracy - Artificial Intelligence - Musculoskeletal Tumors

Introduction

The demand for clinical radiological imaging services will likely exceed capacity in the future resulting in a negative impact on the healthcare sector. Increasing and longer-than-recommended wait times as well as a negative impact on patient outcomes will be the result unless capacity is increased substantially [1]. The application of artificial intelligence (AI) [2] might offer one way to enhance clinical diagnosing capacities while at the same time maintaining quality or even improving patient outcomes. Because of this potential future contribution, AI has lately received great attention from industrial stakeholders and research groups. Pattern recognition in imaging data is typically the main focus in this field. The commercial software package Aidoc (Aidoc Medical Ltd, Tel-Aviv, Israel) is designed to assist radiologists in acute care medicine. There is similar research in more or less every radiological subspecialty. Cardiovascular [3], pulmonary [4] [5], gynecology [6], musculoskeletal (MSK) [7] [8], etc. all see the possibility of AI contributing to the medicine of the future. In addition to pattern recognition, there are numerous other clinical applications where AI could make a contribution. Large language models (LLM) may automate administration and documentation tasks [9] [10] [11] [12]. However, there are also limitations to the achievable effects of AI. These limitations become clear, for example, when AI is used for the assessment of radiation exposure [13] or for the acceleration of undersampled magnetic resonance imaging (MRI) [14] [15]. There is serious discussion about whether the current attention is overhyping reality.

The current study focuses on how LLMs can contribute to pattern recognition in images and in the identification of correct diagnoses. In earlier studies, we tested the ability of the LLM ChatGPT (OpenAI LLC, San Francisco, CA, USA) to draft radiology reports in MSK imaging and interventional radiology [9] [10]. This particular LLM was originally trained by means of reinforcement learning from human feedback, involving a reward model and proximal policy optimization algorithm [16] [17]. The now available latest version, ChatGPT-4.0, possesses a new feature allowing processing of not only text but also image files [18]. The present study evaluates ChatGPT-4.0’s ability to correctly classify multimodal MSK tumor imaging cases. It includes 25 defined MSK cases with imaging material obtained through X-ray, computed tomography (CT), MRI, or scintigraphy (scint, please see supplement S 1). ChatGPT-4.0 was shown these image sets case per case, with a prompt that asked in a closed question with 6 possible answers for a primary (1^st) and for a secondary (2^nd) diagnosis. To assess the AI performance level, human raters were presented with the same 25 cases with identical possible answers. A Python code-based evaluation was run to determine accuracy, interrater agreement, and significance/power testing.

Method and Materials

The present study defined 25 multimodal imaging cases of typical MSK pathologies. Human raters and ChatGPT-4.0 were asked for 1^st and 2^nd diagnoses. The LLM ChatGPT-4.0 was used for writing up this manuscript, and Python code was used for debugging [18].

Image data

This study included a total of 25 cases including osteosarcoma (n = 5), abscess/osteomyelitis (n = 5), heterotopic ossification (n = 5), myxoid liposarcoma (n = 3), lipoma (n = 2), and hemangioma (n = 5). Authors from multiple health centers originally submitted these cases, e.g., to the National Institutes of Health (NIH) server: https://medpix.nlm.nih.gov/. Data are stored there for open access. The original imaging data (jpg) with the data source, true diagnosis, and (where available) also the source of the true diagnosis are provided in S 1. The image files merged into a single stack per case are provided in S 2 for easier reproducibility of this study. Each case contains between 2 and 4 images. All cases contain images from at least 2 imaging modalities (X-ray, CT, MRI, scint); exceptions being case 14 with 2 X-ray images, and some cases, e.g., 17 and 18, with MRI images only. All cases represent textbook examples with a definitive diagnosis, reflecting standard diagnostic scenarios commonly encountered in MSK radiology.

Surveying of ChatGPT-4.0 and human raters

ChatGPT-4.0 [18] was shown the image data case by case together with the following prompt (please see also [Fig. 1]) asking for the 1^st and 2^nd diagnosis:

For the shown image, please give your most likely primary diagnosis and alternative secondary diagnosis from the 6 options below. Please only give the respective diagnosis numbers:

[1] Osteosarcoma
[2] Abscess/osteomyelitis
[3] Heterotopic ossification
[4] Myxoid liposarcomas
[5] Lipoma
[6] Hemangioma

Fig. 1 Prompt entered with image stack of case 17 in ChatGPT-4.0, accessed August 27, 2024.

In this way, the study implemented a six-option, two-answer question. ChatGPT-4.0 was tested across n = 10 iterations to minimize the impact of statistical variability in the LLMʼs responses.

Human raters were presented with the same imaging cases with the same six-option, two-answer questions. Answers from human raters were collected through Google Forms (S 3, Google LLC, Mountain View, CA, USA) with randomization of the order of questions. In total, n = 10 human raters participated. All human raters had previous work experience in radiology, 8 being MSK specialists and 9 having completed their residency training. Work experience averaged 21.0 years since finishing medical school and 16.6 years since completing board exams. Raw data collected from the LLM and from the human raters are available in S 4.

Python code evaluations

The collected data (S 4) were analyzed using a Python code (S 5). For the study’s assessments, two sets of answers were considered. The first set of answers was the 1^st diagnosis. The second set of answers (called below: 1^st & 2^nd) also included the 2^nd diagnosis; the 2^nd diagnosis replaced the 1^st diagnosis value whenever the 1^st values were incorrect ([Fig. 2]). The Python code calculated diagnostic performance variables, interrater agreement, and significance/test power analysis ([Table 1]).

Fig. 2 1^st primary/2^nd secondary diagnosis, obtained for each case from both AI and human raters.

Table 1 Diagnostic performance, interrater agreement, and test power analysis.
	Human raters’ 1^st diagnosis	Human raters’ 1^st & 2^nddiagnoses	AI’s 1^st diagnosis	AI’s 1^st & 2^nd diagnoses
Diagnostic performance
Samples	25	25	25	25
Raters/LLM revisions	10	10	10	10
Accuracy	0.868	0.936	0.444	0.712
Weighted precision	0.871	0.939	0.476	0.737
Interrater agreement (p-value < alpha = 0.05 highlighted)
Fleiss’ kappa	0.714	0.852	0.505	0.598
Fleiss’ kappa p-value	2.64e–14	0.00e+00	5.62e–08	1.25e–09
Gwet’s AC1	0.718	0.854	0.553	0.607
Gwet’s AC1 p-value	2.09e–14	0.00e+00	6.63e–10	1.18e–09
Chi-square p-value (p-value < alpha = 0.05 highlighted)
H0: Human raters’ 1^st diagnosis	1	0.91	0	0.001
H0: Human raters’ 1^st & 2^nd diagnoses	0.923	1	0	0.007
Cohen’s w
H0: Human raters’ 1^st diagnosis	0	0.078	0.659	0.294
H0: Human raters’ 1^st & 2^nd diagnoses	0.075	0	0.623	0.253
Chi-square power (power > 1 – beta = 0.8 highlighted)
H0: Human raters’ 1^st diagnosis	0.05	0.128	1	0.966
H0: Human raters’ 1^st & 2^nd diagnoses	0.121	0.05	1	0.89

For the analysis of diagnostic performance, accuracy and weighted precision were calculated. Weighted precision and accuracy take values in the interval [0.1]. Unlike accuracy, however, weighted precision factors in data imbalance, as found in the dataset of the present study [19].

Interrater agreement allows assessment of the coherence between the answers from different raters or from different LLM iterations. This study uses Fleiss’ kappa, and Gwet’s AC1 by the python package for interrater reliability Chance-corrected Agreement Coefficients (irrCAC) [20]. Gwet’s AC1 is designed for imbalanced datasets [21]. The irrCAC package includes a routine for simultaneously extracting the respective p-value. In this way, the obtained interrater agreement coefficients can be tested for their null hypothesis (H0) that there is no agreement beyond what would be expected purely by chance. Fleiss’ kappa and Gwet’s AC1 both take values in the interval [–1.1]. For the interpretation of numerical agreement, Landis & Koch provided an interpretation table, ranging from poor agreement (< 0.00) to almost perfect agreement (0.81–1.00), [Table 2] [22].

Table 2 Interpretation of strength of agreement for kappa statistics used in the present study (22).
Kappa statistic	< 0.00	0.00 – 0.20	0.21 – 0.40	0.41 – 0.60	0.61 – 0.80	0.81 – 1.00
Strength of agreement	Poor	Slight	Fair	Moderate	Substantial	Almost perfect

For a performance comparison between the LLM and human raters, confusion matrices (CMs) are shown ([Fig. 3]). The CMs plot the true vs. predicted diagnosis. Ideally, the data concentrate on the main diagonal. Results were normalized per true diagnosis, i.e., matrix rows [23]. This visualization allows for a detailed assessment of misclassification patterns and potential biases in predictions. By comparing CM structures, differences in diagnostic tendencies between the LLM and human raters can be identified. A Chi-square goodness-of-fit test [24] was used to test whether (H0) the observed frequencies were generated by randomly sampling from a categorical distribution with probabilities proportional to the expected frequencies.

Fig. 3 Confusion matrices (true diagnoses with predicted diagnoses) for AI and human raters.

In the final section of the present study, model stability with regard to sample size and test power was analyzed. For that purpose, the results spread for 1^st accuracy is plotted over sample size ([Fig. 4]). Effect size was determined by Cohen’s w for Chi-square testing (Python implementation [25], Ch 7 in Cohen’s original work [26]). The interpretation of Cohen’s w effect follows the original frame of reference: small 0.10, medium 0.30, large 0.50 (p 277 in [26]). Power analysis is performed accordingly by Chi-square goodness-of-fit test [27]. Type I error (rejecting a true H0) testing in this study was performed with regard to alpha = 0.05. Type II error (failing to reject a false H0) testing was performed with beta = 0.2, requiring power = 1 – beta = 0.8.

Fig. 4 Convergence of accuracy of 1^st diagnosis plotted over sample set (raters × items), sample set shuffled 5 times, 95% confidence interval shaded.

Results and discussion

The first result of the study is that the current version ChatGPT-4.0 is able to load and process typical MSK imaging data. [Fig. 1] shows the LLM working on case 17. All iterations run for the 25 cases concluding with a clear output by the LLM with the requested 1^st and 2^nd diagnoses.

When limited to the 1^st diagnosis, the accuracy of the human raters was substantially higher than that of the LLM (87% vs. 44%, differing by a factor of almost 2, [Table 1]). The inclusion of the 2^nd diagnosis increases the accuracy of human raters and of AI by definition. The accuracy of AI’s 1^st & 2^nd diagnoses increased to 71%, still below the accuracy of human raters for 1^st & 2^nd diagnoses (94%), but a substantial improvement in the performance gap. Weighted precision, factoring in the data imbalance, exhibited the same pattern for 1^st diagnosis vs. 1^st & 2^nd diagnoses when calculated for humans and AI, with the values being marginally greater than the accuracy values.

Fleiss’ kappa indicated substantial human interrater agreement for the 1^st diagnosis and almost perfect human interrater agreement for the 1^st & 2^nd diagnoses. AI interrater agreement was moderate, measured by Fleiss’ kappa. Gwet’s AC1 consistently led to marginally greater values due to the data imbalance, with AI’s 1^st & 2^nd diagnoses improving to substantial agreement. The p-value of all calculated interrater agreement turned out to be substantially below << alpha, with the human raters’ 1^st & 2^nd diagnoses even returning no remaining measurable p-value. Due to this, H0 can be rejected; none of the agreement levels was obtained due to chance.

These patterns are also observable visually and qualitatively in the CM of [Fig. 3]. Results from the human raters converge better towards the main diagonal. AI results substantially improve when 2^nd diagnoses are included. Visually, the performance difference between human raters and AI nearly disappears for 1^st & 2^nddiagnoses. Distinguishing myxoid liposarcoma from lipoma and hemangioma from abscess/osteomyelitis proved to be a particular challenge for AI when limited to the 1^st diagnosis. The CM clearly highlighted these diagnostic challenges, with misclassifications concentrated in specific categories where the AI struggled most.

The Chi-square p-value in [Table 1] was calculated once for the human raters’ 1^st diagnosis as H0, and once for the human raters’ 1^st & 2^nd diagnosis as H0. The human raters’ 1^stdiagnosis and the human raters’ 1^st & 2^nd diagnoses both resulted in perfect agreement (i.e. = 1), when projected on themselves. AI’s 1^st diagnosis and AI’s 1^st & 2^nd diagnoses are both significantly different from the human raters’ 1^st diagnosis and from the human raters’ 1^st & 2^nd diagnoses.

Cohen’s w for effect quantification was calculated once for the human raters’ 1^st diagnosis as H0 and once for the human raters’ 1^st & 2^nd diagnoses as H0. In comparisons between the answers from the human raters, Cohen’s w was calculated to be < 0.10, i.e., below the threshold for small effects. This holds true regardless of the version of H0. A large effect was obtained for AI’s 1^st diagnosis, regardless of H0. A small effect, close to medium, was obtained for AI’s 1^st & 2^nd diagnoses, regardless of the version of H0.

The Chi-square power analysis showed that the power values exceeded 0.8 for each pair formed between the two versions of H0 and either AI’s 1^st diagnosis or AI’s 1^st & 2^nd diagnoses. The human raters’ 1^st diagnosis and the human raters’ 1^st & 2^nd diagnoses when projected on itself resulted correctly in power = 0.05 = alpha. The power between the human raters’ 1^st diagnosis the and human raters’ 1^st & 2^nd diagnoses was < 0.8, regardless of the version of H0. With regard to the size of the sample set, a power analysis using Chi-square confirms the effect seen in this study between each combination of human answers and AI answers. Effects by Cohen’s w for the remaining combinations are too low to be confirmed. 1^st diagnosis accuracy plotted over sample size in [Fig. 4] converges clearly towards a horizontal constant, equal to the overall value calculated in [Table 1]. After ca. 75 of the total 250 data points (10 raters × 25 items), no substantial volatility remains on the result curves. By convergence of the graph in [Fig. 4], a larger dataset would not be expected to alter the 1^st diagnosis study accuracy.

Conclusions and future work

The aim of this study was to assess the ability of ChatGPT-4.0 to correctly classify multimodal MSK tumor cases. For this purpose, 25 multimodal cases were defined, each representing a typical MSK textbook diagnosis. For the evaluation of the performance level, the cases were diagnosed by AI and (n = 10) human raters. The accuracy of the LLM was generally lower than that of the human raters. The performance gap substantially shrunk when a 2^nd diagnosis was included in the set of answers. A power analysis confirmed the observable difference between human answers and AI answers. The interrater agreement was slightly greater for the answers of the human raters than for the LLM answers. This fact can also be seen in the plotted CM of [Fig. 3], where the answers converge on the diagonal under 1^st & 2^nd diagnoses for both the human raters and AI. Given the speed and the constant availability of AI software throughout the day, it appears plausible that systems such as the one tested here will assist human physicians in clinical settings in the future.

Today, administration and documentation consume many doctor hours during hospital operations. However, increased productivity appears necessary to meet the future demands for clinical radiology services [1]. Provided that there is no loss in treatment quality, it can be considered in patients’ direct best interest to minimize the time spent by healthcare staff on administrative duties and documentation. LLMs in combination with automated pattern recognition promise to provide technology to achieve that aim and, if successful in clinical application, to fulfill the prophecies of the early AI thinkers from the 1950s [2]. One requirement for that scenario to materialize will be that already mentioned corporations, such as OpenAI, or Google, continue investing in this technology. Ethical aspects of the application of AI in future medicine, e.g., threats of algorithmic bias and human deskilling will have to be addressed [28] [29]. Finally as seen before [14], this technology's functionality and possible limitations will have to be tested.

Open access supplements

S 1: Image data with original source and true diagnosis (.pdf)

S 2: Zip folder, image files merged per case (.jpg)

S 3: Survey for human participants implemented in Google Forms (.pdf)

S 4: Study answers collected from human participants and ChatGPT (.xlsx)

S 5: Python code for data evaluation (.py)

The supplementary material is available under the DOI: 10.6084/m9.figshare.28560842.v1

Conflict of Interest

The authors declare that they have no conflict of interest.

Acknowledgement

JFS acknowledges support by Dioscuri program initiated by the Max Planck Society, jointly managed with the National Science Centre (Poland), and mutually funded by the Polish Ministry of Science and Higher Education and the German Federal Ministry of Education and Research. The authors wish to thank for all useful discussions leading to this manuscript.

References
1 Sutherland G, Russell N, Gibbard R. et al. The Value of Radiology, Part II – The Conference Board of Canada. Ottawa, CAN; 2019.

MissingFormLabel
2 McCarthy J, Minsky ML, Rochester N. et al. A Proposal For The Dartmouth Summer Research Project On Artificial Intelligence [Internet]. 1955 [cited 2021 Oct 30]. p. 1–13. http://jmc.stanford.edu/articles/dartmouth/dartmouth.pdf

MissingFormLabel
PubMed
3 Kagiyama N, Shrestha S, Farjo PD. et al. Artificial Intelligence: Practical Primer for Clinical Research in Cardiovascular Disease. J Am Heart Assoc 2019; 8: 1-12

MissingFormLabel
Crossref PubMed Search in Google Scholar
4 Peters AA, Wiescholek N, Müller M. et al. Impact of artificial intelligence assistance on pulmonary nodule detection and localization in chest CT: a comparative study among radiologists of varying experience levels. Sci Rep 2024; 14 (01) 22447

MissingFormLabel
Crossref PubMed Search in Google Scholar
5 Peters AA, Munz J, Klaus JB. et al. Impact of Simulated Reduced-Dose Chest CT on Diagnosing Pulmonary T1 Tumors and Patient Management. Diagnostics 2024; 14 (15)

MissingFormLabel
Crossref PubMed Search in Google Scholar
6 Borkowski K, Rossi C, Ciritsis A. et al. Fully automatic classification of breast MRI background parenchymal enhancement using a transfer learning approach. Medicine (Baltimore) 2020; 99 (29) e21243

MissingFormLabel
Crossref PubMed Search in Google Scholar
7 Ramedani S, Ramedani M, Von Tengg-Kobligk H. et al. A Deep Learning-based Fully Automated Approach for Body Composition Analysis in 3D Whole Body Dixon MRI. In: 2023 IEEE 19th International Conference on Intelligent Computer Communication and Processing (ICCP). 2023: 287-292

MissingFormLabel
Crossref Search in Google Scholar
8 Urban G, Porhemmat S, Stark M. et al. Classifying shoulder implants in X-ray images using deep learning. Comput Struct Biotechnol J 2020; 18: 967-972

MissingFormLabel
Crossref PubMed Search in Google Scholar
9 Bosbach WA, Senge JF, Nemeth B. et al. Ability of ChatGPT to generate competent radiology reports for distal radius fracture by use of RSNA template items and integrated AO classifier. Curr Probl Diagn Radiol 2023; 53 (01) 102-110

MissingFormLabel
Crossref PubMed Search in Google Scholar
10 Bosbach WA, Senge JF, Nemeth B. et al. Online supplement to manuscript: “Ability of ChatGPT to generate competent radiology reports for distal radius fracture by use of RSNA template items and integrated AO classifier.” Current problems in diagnostic radiology (2023). zenodo. 2023

MissingFormLabel
Crossref PubMed Search in Google Scholar
11 Senge JF, Mc Murray MT, Haupt F. et al. ChatGPT may free time needed by the interventional radiologist for administration/documentation: A study on the RSNA PICC line reporting template. Swiss J Radiol Nucl Med 2024; 7 (02) 1-14

MissingFormLabel
Crossref PubMed Search in Google Scholar
12 Senge JF, Mc Murray MT, Haupt F. et al. Online supplement to manuscript: “ChatGPT may free time needed by the interventional radiologist for administration/documentation: A study on the RSNA PICC line reporting template.” zenodo. 2023

MissingFormLabel
Crossref PubMed Search in Google Scholar
13 Garni SN, Mertineit N, Nöldge G. et al. Regulatory Needs for Radiation Protection Devices based upon Artiﬁcial Intelligence – State task or leave unregulated?. Swiss J Radiol Nucl Med 2024; 5 (01) 5

MissingFormLabel
Crossref PubMed Search in Google Scholar
14 Bosbach WA, Merdes KC, Jung B. et al. Deep learning reconstruction of accelerated MRI: False positive cartilage delamination inserted in MRI arthrography under traction. Top Magn Reson Imaging 2024; 33: e0313

MissingFormLabel
Crossref PubMed Search in Google Scholar
15 Bosbach WA, Merdes KC, Jung B. et al. Open access supplement to the publication: Bosbach, W. A., et al. (2024). Deep learning reconstruction of accelerated MRI: False positive cartilage delamination inserted in MRI arthrography under traction. Topics in Magnetic Resonance Imaging. [accepted]. [Internet]. figshare. 2024

MissingFormLabel
Crossref PubMed Search in Google Scholar
16 Glowacka D, Howes A, Jokinen JP. et al. RL4HCI: Reinforcement Learning for Humans, Computers, and Interaction. Ext Abstr 2021 CHI Conf Hum Factors Comput Syst 2021; 1-3

MissingFormLabel
Crossref PubMed Search in Google Scholar
17 Schulman J, Wolski F, Dhariwal P. et al. Proximal Policy Optimization Algorithms. arXiv 2017; 1707

MissingFormLabel
Crossref PubMed Search in Google Scholar
18 OpenAI LLC, editor. ChatGPT-4.0 [Internet]. 2024 [cited 2024 Aug 29]. Available from: chat.openai.com.

MissingFormLabel
PubMed
19 sklearn.metrics.precision_score [Internet]. scikit-learn 1.5.1 documentation. 2024 [cited 2024 Sep 1]. Available from:. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html

MissingFormLabel
PubMed
20 Gwet K, Fergadis A. irrCAC – Chance-corrected Agreement Coefficients [Internet]. 2023 [cited 2025 Sep 3]. Available from:. irrcac.readthedocs.io/en/latest/usage/usage_raw_data.html

MissingFormLabel
PubMed
21 Wongpakaran N, Wongpakaran T, Wedding D. et al. A comparison of Cohen’s Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: A study conducted with personality disorder samples. BMC Med Res Methodol 2013; 13 (01) 1-7

MissingFormLabel
Crossref PubMed Search in Google Scholar
22 Landis JR, Koch GG. The Measurement of Observer Agreement for Categorical Data. Biometrics 1977; 33 (01) 159-174

MissingFormLabel
Crossref PubMed Search in Google Scholar
23 sklearn.metrics.confusion_matrix [Internet]. scikit-learn 1.5.1 documentation. 2024 [cited 2024 Sep 1]. Available from:. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

MissingFormLabel
PubMed
24 scipy.stats.chisquare [Internet]. SciPy. [cited 2025 Mar 10]. Available from:. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html

MissingFormLabel
PubMed
25 statsmodels.stats.gof.chisquare_effectsize [Internet]. statsmodels 0.15.0 (+617). 2025 [cited 2025 Mar 9]. Available from. https://www.statsmodels.org/dev/generated/statsmodels.stats.gof.chisquare_effectsize.html#statsmodels.stats.gof.chisquare_effectsize

MissingFormLabel
PubMed
26 Cohen J. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. New York, NY, USA; 1988.

MissingFormLabel
PubMed
27 statsmodels.stats.power.GofChisquarePower [Internet]. statsmodels 0.15.0 (+581). 2025 [cited 2025 Jan 2]. Available from: . https://www.statsmodels.org/dev/generated/statsmodels.stats.power.GofChisquarePower.html

MissingFormLabel
PubMed
28 Goisauf M, Cano Abadía M. Ethics of AI in Radiology: A Review of Ethical and Societal Implications. Front Big Data 2022; 5: 1-13

MissingFormLabel
Crossref PubMed Search in Google Scholar
29 Sparrow R, Hatherley J. The Promise and Perils of AI in Medicine. Int J Chinese Comp Philos Med 2019; 17 (02) 79-109

MissingFormLabel
Crossref PubMed Search in Google Scholar

Correspondence

PD Dr. Dr. med. Wolfram A. Bosbach

Department of Nuclear Medicine, Inselspital, Bern University Hospital, University of Bern

Bern

Switzerland

Email: WolframAndreas.Bosbach@Insel.CH

Publication History

Received: 09 January 2025

Accepted after revision: 18 April 2025

Article published online:
03 June 2025

Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany

References
1 Sutherland G, Russell N, Gibbard R. et al. The Value of Radiology, Part II – The Conference Board of Canada. Ottawa, CAN; 2019.

MissingFormLabel
2 McCarthy J, Minsky ML, Rochester N. et al. A Proposal For The Dartmouth Summer Research Project On Artificial Intelligence [Internet]. 1955 [cited 2021 Oct 30]. p. 1–13. http://jmc.stanford.edu/articles/dartmouth/dartmouth.pdf

MissingFormLabel
PubMed
3 Kagiyama N, Shrestha S, Farjo PD. et al. Artificial Intelligence: Practical Primer for Clinical Research in Cardiovascular Disease. J Am Heart Assoc 2019; 8: 1-12

MissingFormLabel
Crossref PubMed Search in Google Scholar
4 Peters AA, Wiescholek N, Müller M. et al. Impact of artificial intelligence assistance on pulmonary nodule detection and localization in chest CT: a comparative study among radiologists of varying experience levels. Sci Rep 2024; 14 (01) 22447

MissingFormLabel
Crossref PubMed Search in Google Scholar
5 Peters AA, Munz J, Klaus JB. et al. Impact of Simulated Reduced-Dose Chest CT on Diagnosing Pulmonary T1 Tumors and Patient Management. Diagnostics 2024; 14 (15)

MissingFormLabel
Crossref PubMed Search in Google Scholar
6 Borkowski K, Rossi C, Ciritsis A. et al. Fully automatic classification of breast MRI background parenchymal enhancement using a transfer learning approach. Medicine (Baltimore) 2020; 99 (29) e21243

MissingFormLabel
Crossref PubMed Search in Google Scholar
7 Ramedani S, Ramedani M, Von Tengg-Kobligk H. et al. A Deep Learning-based Fully Automated Approach for Body Composition Analysis in 3D Whole Body Dixon MRI. In: 2023 IEEE 19th International Conference on Intelligent Computer Communication and Processing (ICCP). 2023: 287-292

MissingFormLabel
Crossref Search in Google Scholar
8 Urban G, Porhemmat S, Stark M. et al. Classifying shoulder implants in X-ray images using deep learning. Comput Struct Biotechnol J 2020; 18: 967-972

MissingFormLabel
Crossref PubMed Search in Google Scholar
9 Bosbach WA, Senge JF, Nemeth B. et al. Ability of ChatGPT to generate competent radiology reports for distal radius fracture by use of RSNA template items and integrated AO classifier. Curr Probl Diagn Radiol 2023; 53 (01) 102-110

MissingFormLabel
Crossref PubMed Search in Google Scholar
10 Bosbach WA, Senge JF, Nemeth B. et al. Online supplement to manuscript: “Ability of ChatGPT to generate competent radiology reports for distal radius fracture by use of RSNA template items and integrated AO classifier.” Current problems in diagnostic radiology (2023). zenodo. 2023

MissingFormLabel
Crossref PubMed Search in Google Scholar
11 Senge JF, Mc Murray MT, Haupt F. et al. ChatGPT may free time needed by the interventional radiologist for administration/documentation: A study on the RSNA PICC line reporting template. Swiss J Radiol Nucl Med 2024; 7 (02) 1-14

MissingFormLabel
Crossref PubMed Search in Google Scholar
12 Senge JF, Mc Murray MT, Haupt F. et al. Online supplement to manuscript: “ChatGPT may free time needed by the interventional radiologist for administration/documentation: A study on the RSNA PICC line reporting template.” zenodo. 2023

MissingFormLabel
Crossref PubMed Search in Google Scholar
13 Garni SN, Mertineit N, Nöldge G. et al. Regulatory Needs for Radiation Protection Devices based upon Artiﬁcial Intelligence – State task or leave unregulated?. Swiss J Radiol Nucl Med 2024; 5 (01) 5

MissingFormLabel
Crossref PubMed Search in Google Scholar
14 Bosbach WA, Merdes KC, Jung B. et al. Deep learning reconstruction of accelerated MRI: False positive cartilage delamination inserted in MRI arthrography under traction. Top Magn Reson Imaging 2024; 33: e0313

MissingFormLabel
Crossref PubMed Search in Google Scholar
15 Bosbach WA, Merdes KC, Jung B. et al. Open access supplement to the publication: Bosbach, W. A., et al. (2024). Deep learning reconstruction of accelerated MRI: False positive cartilage delamination inserted in MRI arthrography under traction. Topics in Magnetic Resonance Imaging. [accepted]. [Internet]. figshare. 2024

MissingFormLabel
Crossref PubMed Search in Google Scholar
16 Glowacka D, Howes A, Jokinen JP. et al. RL4HCI: Reinforcement Learning for Humans, Computers, and Interaction. Ext Abstr 2021 CHI Conf Hum Factors Comput Syst 2021; 1-3

MissingFormLabel
Crossref PubMed Search in Google Scholar
17 Schulman J, Wolski F, Dhariwal P. et al. Proximal Policy Optimization Algorithms. arXiv 2017; 1707

MissingFormLabel
Crossref PubMed Search in Google Scholar
18 OpenAI LLC, editor. ChatGPT-4.0 [Internet]. 2024 [cited 2024 Aug 29]. Available from: chat.openai.com.

MissingFormLabel
PubMed
19 sklearn.metrics.precision_score [Internet]. scikit-learn 1.5.1 documentation. 2024 [cited 2024 Sep 1]. Available from:. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html

MissingFormLabel
PubMed
20 Gwet K, Fergadis A. irrCAC – Chance-corrected Agreement Coefficients [Internet]. 2023 [cited 2025 Sep 3]. Available from:. irrcac.readthedocs.io/en/latest/usage/usage_raw_data.html

MissingFormLabel
PubMed
21 Wongpakaran N, Wongpakaran T, Wedding D. et al. A comparison of Cohen’s Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: A study conducted with personality disorder samples. BMC Med Res Methodol 2013; 13 (01) 1-7

MissingFormLabel
Crossref PubMed Search in Google Scholar
22 Landis JR, Koch GG. The Measurement of Observer Agreement for Categorical Data. Biometrics 1977; 33 (01) 159-174

MissingFormLabel
Crossref PubMed Search in Google Scholar
23 sklearn.metrics.confusion_matrix [Internet]. scikit-learn 1.5.1 documentation. 2024 [cited 2024 Sep 1]. Available from:. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

MissingFormLabel
PubMed
24 scipy.stats.chisquare [Internet]. SciPy. [cited 2025 Mar 10]. Available from:. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html

MissingFormLabel
PubMed
25 statsmodels.stats.gof.chisquare_effectsize [Internet]. statsmodels 0.15.0 (+617). 2025 [cited 2025 Mar 9]. Available from. https://www.statsmodels.org/dev/generated/statsmodels.stats.gof.chisquare_effectsize.html#statsmodels.stats.gof.chisquare_effectsize

MissingFormLabel
PubMed
26 Cohen J. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. New York, NY, USA; 1988.

MissingFormLabel
PubMed
27 statsmodels.stats.power.GofChisquarePower [Internet]. statsmodels 0.15.0 (+581). 2025 [cited 2025 Jan 2]. Available from: . https://www.statsmodels.org/dev/generated/statsmodels.stats.power.GofChisquarePower.html

MissingFormLabel
PubMed
28 Goisauf M, Cano Abadía M. Ethics of AI in Radiology: A Review of Ethical and Societal Implications. Front Big Data 2022; 5: 1-13

MissingFormLabel
Crossref PubMed Search in Google Scholar
29 Sparrow R, Hatherley J. The Promise and Perils of AI in Medicine. Int J Chinese Comp Philos Med 2019; 17 (02) 79-109

MissingFormLabel
Crossref PubMed Search in Google Scholar

Permissions and Reprints

Subscribe to RSS

Share / Bookmark

Evaluating the Diagnostic Accuracy of ChatGPT-4.0 for Classifying Multimodal Musculoskeletal Masses: A Comparative Study with Human Raters

Abstract

Purpose

Materials and Methods

Results

Conclusion and Key Points

Key Points

Citation Format

Zusammenfassung

Ziel

Materialien und Methoden

Ergebnisse

Schlussfolgerung und Kernaussagen

Kernaussagen

Keywords

Introduction

Method and Materials

Image data

Surveying of ChatGPT-4.0 and human raters

Python code evaluations

Table 1 Diagnostic performance, interrater agreement, and test power analysis.

Table 2 Interpretation of strength of agreement for kappa statistics used in the present study (22).

Results and discussion

Conclusions and future work

Open access supplements

Conflict of Interest

Acknowledgement

References

Correspondence

Publication History

References