Subscribe to RSS
DOI: 10.1055/a-2594-7085
Evaluating the Diagnostic Accuracy of ChatGPT-4.0 for Classifying Multimodal Musculoskeletal Masses: A Comparative Study with Human Raters
Bewertung der diagnostischen Genauigkeit von ChatGPT-4.0 bei der Klassifikation multimodaler muskuloskelettaler Läsionen: eine vergleichende Studie mit menschlichen Auswertern- Abstract
- Zusammenfassung
- Introduction
- Method and Materials
- Results and discussion
- Conclusions and future work
- Open access supplements
- References
Abstract
Purpose
Novel artificial intelligence tools have the potential to significantly enhance productivity in medicine, while also maintaining or even improving treatment quality. In this study, we aimed to evaluate the current capability of ChatGPT-4.0 to accurately interpret multimodal musculoskeletal tumor cases.
Materials and Methods
We created 25 cases, each containing images from X-ray, computed tomography, magnetic resonance imaging, or scintigraphy. ChatGPT-4.0 was tasked with classifying each case using a six-option, two-choice question, where both a primary and a secondary diagnosis were allowed. For performance evaluation, human raters also assessed the same cases.
Results
When only the primary diagnosis was taken into account, the accuracy of human raters was greater than that of ChatGPT-4.0 by a factor of nearly 2 (87% vs. 44%). However, in a setting that also considered secondary diagnoses, the performance gap shrank substantially (accuracy: 94% vs. 71%). Power analysis relying on Cohen’s w confirmed the adequacy of the sample set size (n: 25).
Conclusion and Key Points
The tested artificial intelligence tool demonstrated lower performance than human raters. Considering factors such as speed, constant availability, and potential future improvements, it appears plausible that artificial intelligence tools could serve as valuable assistance systems for doctors in future clinical settings.
Key Points
-
ChatGPT-4.0 classifies musculoskeletal cases using multimodal imaging inputs.
-
Human raters outperform AI in primary diagnosis accuracy by a factor of nearly two.
-
Including secondary diagnoses improves AI performance and narrows the gap.
-
AI demonstrates potential as an assistive tool in future radiological workflows.
-
Power analysis confirms robustness of study findings with the current sample size.
Citation Format
-
Bosbach WA, Schoeni L, Beisbart C et al. Evaluating the Diagnostic Accuracy of ChatGPT-4.0 for Classifying Multimodal Musculoskeletal Masses: A Comparative Study with Human Raters. Rofo 2025; DOI 10.1055/a-2594-7085
#
Zusammenfassung
Ziel
Neue künstliche Intelligenz (KI)-Werkzeuge haben das Potenzial, die Produktivität in der Medizin erheblich zu steigern und gleichzeitig die Behandlungsqualität aufrechtzuerhalten oder sogar zu verbessern. In dieser Studie wollten wir die aktuelle Fähigkeit von ChatGPT-4.0 zur präzisen Interpretation multimodaler muskuloskelettaler Tumorfälle evaluieren.
Materialien und Methoden
Wir erstellten 25 Fälle, die jeweils Bilder aus Röntgenaufnahmen, Computertomografie, Magnetresonanztomografie oder Szintigrafie enthielten. ChatGPT-4.0 wurde mit der Klassifikation jedes Falls anhand einer sechsoptionalen, zweiauswahlbasierten Frage beauftragt, wobei sowohl eine primäre als auch eine sekundäre Diagnose erlaubt waren. Zur Leistungsbewertung analysierten menschliche Beurteiler dieselben Fälle.
Ergebnisse
Wurde nur die primäre Diagnose berücksichtigt, war die Genauigkeit der menschlichen Beurteiler fast doppelt so hoch wie die von ChatGPT-4.0 (87% vs. 44%). In einem Szenario, das auch sekundäre Diagnosen berücksichtigte, verringerte sich die Leistungslücke jedoch deutlich (Genauigkeit: 94% vs. 71%). Eine Power-Analyse basierend auf Cohens w bestätigte die Angemessenheit der Stichprobengröße (n = 25).
Schlussfolgerung und Kernaussagen
Das getestete KI-Werkzeug zeigte eine geringere Leistung als menschliche Beurteiler. Angesichts von Faktoren wie Geschwindigkeit, ständiger Verfügbarkeit und potenziellen zukünftigen Verbesserungen erscheint es jedoch plausibel, dass KI-Werkzeuge in zukünftigen klinischen Umgebungen als wertvolle Assistenzsysteme für Ärzte dienen könnten.
Kernaussagen
-
ChatGPT-4.0 klassifiziert muskuloskelettale Fälle anhand multimodaler Bildgebungsdaten.
-
Menschliche Beurteiler übertreffen die KI bei der primären Diagnosestellung mit nahezu doppelter Genauigkeit.
-
Die Berücksichtigung sekundärer Diagnosen verbessert die KI-Leistung und verringert die Leistungsdifferenz.
-
KI zeigt Potenzial als unterstützendes Werkzeug in zukünftigen radiologischen Arbeitsabläufen.
-
Eine Power-Analyse bestätigt die Aussagekraft der Studienergebnisse bei gegebener Stichprobengröße.
#
Keywords
Clinical Decision Support - Diagnostic Accuracy - Artificial Intelligence - Musculoskeletal TumorsIntroduction
The demand for clinical radiological imaging services will likely exceed capacity in the future resulting in a negative impact on the healthcare sector. Increasing and longer-than-recommended wait times as well as a negative impact on patient outcomes will be the result unless capacity is increased substantially [1]. The application of artificial intelligence (AI) [2] might offer one way to enhance clinical diagnosing capacities while at the same time maintaining quality or even improving patient outcomes. Because of this potential future contribution, AI has lately received great attention from industrial stakeholders and research groups. Pattern recognition in imaging data is typically the main focus in this field. The commercial software package Aidoc (Aidoc Medical Ltd, Tel-Aviv, Israel) is designed to assist radiologists in acute care medicine. There is similar research in more or less every radiological subspecialty. Cardiovascular [3], pulmonary [4] [5], gynecology [6], musculoskeletal (MSK) [7] [8], etc. all see the possibility of AI contributing to the medicine of the future. In addition to pattern recognition, there are numerous other clinical applications where AI could make a contribution. Large language models (LLM) may automate administration and documentation tasks [9] [10] [11] [12]. However, there are also limitations to the achievable effects of AI. These limitations become clear, for example, when AI is used for the assessment of radiation exposure [13] or for the acceleration of undersampled magnetic resonance imaging (MRI) [14] [15]. There is serious discussion about whether the current attention is overhyping reality.
The current study focuses on how LLMs can contribute to pattern recognition in images and in the identification of correct diagnoses. In earlier studies, we tested the ability of the LLM ChatGPT (OpenAI LLC, San Francisco, CA, USA) to draft radiology reports in MSK imaging and interventional radiology [9] [10]. This particular LLM was originally trained by means of reinforcement learning from human feedback, involving a reward model and proximal policy optimization algorithm [16] [17]. The now available latest version, ChatGPT-4.0, possesses a new feature allowing processing of not only text but also image files [18]. The present study evaluates ChatGPT-4.0’s ability to correctly classify multimodal MSK tumor imaging cases. It includes 25 defined MSK cases with imaging material obtained through X-ray, computed tomography (CT), MRI, or scintigraphy (scint, please see supplement S 1). ChatGPT-4.0 was shown these image sets case per case, with a prompt that asked in a closed question with 6 possible answers for a primary (1st) and for a secondary (2nd) diagnosis. To assess the AI performance level, human raters were presented with the same 25 cases with identical possible answers. A Python code-based evaluation was run to determine accuracy, interrater agreement, and significance/power testing.
#
Method and Materials
The present study defined 25 multimodal imaging cases of typical MSK pathologies. Human raters and ChatGPT-4.0 were asked for 1st and 2nd diagnoses. The LLM ChatGPT-4.0 was used for writing up this manuscript, and Python code was used for debugging [18].
Image data
This study included a total of 25 cases including osteosarcoma (n = 5), abscess/osteomyelitis (n = 5), heterotopic ossification (n = 5), myxoid liposarcoma (n = 3), lipoma (n = 2), and hemangioma (n = 5). Authors from multiple health centers originally submitted these cases, e.g., to the National Institutes of Health (NIH) server: https://medpix.nlm.nih.gov/. Data are stored there for open access. The original imaging data (jpg) with the data source, true diagnosis, and (where available) also the source of the true diagnosis are provided in S 1. The image files merged into a single stack per case are provided in S 2 for easier reproducibility of this study. Each case contains between 2 and 4 images. All cases contain images from at least 2 imaging modalities (X-ray, CT, MRI, scint); exceptions being case 14 with 2 X-ray images, and some cases, e.g., 17 and 18, with MRI images only. All cases represent textbook examples with a definitive diagnosis, reflecting standard diagnostic scenarios commonly encountered in MSK radiology.
#
Surveying of ChatGPT-4.0 and human raters
ChatGPT-4.0 [18] was shown the image data case by case together with the following prompt (please see also [Fig. 1]) asking for the 1st and 2nd diagnosis:
For the shown image, please give your most likely primary diagnosis and alternative secondary diagnosis from the 6 options below. Please only give the respective diagnosis numbers:
-
[1] Osteosarcoma
-
[2] Abscess/osteomyelitis
-
[3] Heterotopic ossification
-
[4] Myxoid liposarcomas
-
[5] Lipoma
-
[6] Hemangioma


In this way, the study implemented a six-option, two-answer question. ChatGPT-4.0 was tested across n = 10 iterations to minimize the impact of statistical variability in the LLMʼs responses.
Human raters were presented with the same imaging cases with the same six-option, two-answer questions. Answers from human raters were collected through Google Forms (S 3, Google LLC, Mountain View, CA, USA) with randomization of the order of questions. In total, n = 10 human raters participated. All human raters had previous work experience in radiology, 8 being MSK specialists and 9 having completed their residency training. Work experience averaged 21.0 years since finishing medical school and 16.6 years since completing board exams. Raw data collected from the LLM and from the human raters are available in S 4.
#
Python code evaluations
The collected data (S 4) were analyzed using a Python code (S 5). For the study’s assessments, two sets of answers were considered. The first set of answers was the 1st diagnosis. The second set of answers (called below: 1st & 2nd) also included the 2nd diagnosis; the 2nd diagnosis replaced the 1st diagnosis value whenever the 1st values were incorrect ([Fig. 2]). The Python code calculated diagnostic performance variables, interrater agreement, and significance/test power analysis ([Table 1]).


For the analysis of diagnostic performance, accuracy and weighted precision were calculated. Weighted precision and accuracy take values in the interval [0.1]. Unlike accuracy, however, weighted precision factors in data imbalance, as found in the dataset of the present study [19].
Interrater agreement allows assessment of the coherence between the answers from different raters or from different LLM iterations. This study uses Fleiss’ kappa, and Gwet’s AC1 by the python package for interrater reliability Chance-corrected Agreement Coefficients (irrCAC) [20]. Gwet’s AC1 is designed for imbalanced datasets [21]. The irrCAC package includes a routine for simultaneously extracting the respective p-value. In this way, the obtained interrater agreement coefficients can be tested for their null hypothesis (H0) that there is no agreement beyond what would be expected purely by chance. Fleiss’ kappa and Gwet’s AC1 both take values in the interval [–1.1]. For the interpretation of numerical agreement, Landis & Koch provided an interpretation table, ranging from poor agreement (< 0.00) to almost perfect agreement (0.81–1.00), [Table 2] [22].
For a performance comparison between the LLM and human raters, confusion matrices (CMs) are shown ([Fig. 3]). The CMs plot the true vs. predicted diagnosis. Ideally, the data concentrate on the main diagonal. Results were normalized per true diagnosis, i.e., matrix rows [23]. This visualization allows for a detailed assessment of misclassification patterns and potential biases in predictions. By comparing CM structures, differences in diagnostic tendencies between the LLM and human raters can be identified. A Chi-square goodness-of-fit test [24] was used to test whether (H0) the observed frequencies were generated by randomly sampling from a categorical distribution with probabilities proportional to the expected frequencies.


In the final section of the present study, model stability with regard to sample size and test power was analyzed. For that purpose, the results spread for 1st accuracy is plotted over sample size ([Fig. 4]). Effect size was determined by Cohen’s w for Chi-square testing (Python implementation [25], Ch 7 in Cohen’s original work [26]). The interpretation of Cohen’s w effect follows the original frame of reference: small 0.10, medium 0.30, large 0.50 (p 277 in [26]). Power analysis is performed accordingly by Chi-square goodness-of-fit test [27]. Type I error (rejecting a true H0) testing in this study was performed with regard to alpha = 0.05. Type II error (failing to reject a false H0) testing was performed with beta = 0.2, requiring power = 1 – beta = 0.8.


#
#
Results and discussion
The first result of the study is that the current version ChatGPT-4.0 is able to load and process typical MSK imaging data. [Fig. 1] shows the LLM working on case 17. All iterations run for the 25 cases concluding with a clear output by the LLM with the requested 1st and 2nd diagnoses.
When limited to the 1st diagnosis, the accuracy of the human raters was substantially higher than that of the LLM (87% vs. 44%, differing by a factor of almost 2, [Table 1]). The inclusion of the 2nd diagnosis increases the accuracy of human raters and of AI by definition. The accuracy of AI’s 1st & 2nd diagnoses increased to 71%, still below the accuracy of human raters for 1st & 2nd diagnoses (94%), but a substantial improvement in the performance gap. Weighted precision, factoring in the data imbalance, exhibited the same pattern for 1st diagnosis vs. 1st & 2nd diagnoses when calculated for humans and AI, with the values being marginally greater than the accuracy values.
Fleiss’ kappa indicated substantial human interrater agreement for the 1st diagnosis and almost perfect human interrater agreement for the 1st & 2nd diagnoses. AI interrater agreement was moderate, measured by Fleiss’ kappa. Gwet’s AC1 consistently led to marginally greater values due to the data imbalance, with AI’s 1st & 2nd diagnoses improving to substantial agreement. The p-value of all calculated interrater agreement turned out to be substantially below << alpha, with the human raters’ 1st & 2nd diagnoses even returning no remaining measurable p-value. Due to this, H0 can be rejected; none of the agreement levels was obtained due to chance.
These patterns are also observable visually and qualitatively in the CM of [Fig. 3]. Results from the human raters converge better towards the main diagonal. AI results substantially improve when 2nd diagnoses are included. Visually, the performance difference between human raters and AI nearly disappears for 1st & 2nd diagnoses. Distinguishing myxoid liposarcoma from lipoma and hemangioma from abscess/osteomyelitis proved to be a particular challenge for AI when limited to the 1st diagnosis. The CM clearly highlighted these diagnostic challenges, with misclassifications concentrated in specific categories where the AI struggled most.
The Chi-square p-value in [Table 1] was calculated once for the human raters’ 1st diagnosis as H0, and once for the human raters’ 1st & 2nd diagnosis as H0. The human raters’ 1st diagnosis and the human raters’ 1st & 2nd diagnoses both resulted in perfect agreement (i.e. = 1), when projected on themselves. AI’s 1st diagnosis and AI’s 1st & 2nd diagnoses are both significantly different from the human raters’ 1st diagnosis and from the human raters’ 1st & 2nd diagnoses.
Cohen’s w for effect quantification was calculated once for the human raters’ 1st diagnosis as H0 and once for the human raters’ 1st & 2nd diagnoses as H0. In comparisons between the answers from the human raters, Cohen’s w was calculated to be < 0.10, i.e., below the threshold for small effects. This holds true regardless of the version of H0. A large effect was obtained for AI’s 1st diagnosis, regardless of H0. A small effect, close to medium, was obtained for AI’s 1st & 2nd diagnoses, regardless of the version of H0.
The Chi-square power analysis showed that the power values exceeded 0.8 for each pair formed between the two versions of H0 and either AI’s 1st diagnosis or AI’s 1st & 2nd diagnoses. The human raters’ 1st diagnosis and the human raters’ 1st & 2nd diagnoses when projected on itself resulted correctly in power = 0.05 = alpha. The power between the human raters’ 1st diagnosis the and human raters’ 1st & 2nd diagnoses was < 0.8, regardless of the version of H0. With regard to the size of the sample set, a power analysis using Chi-square confirms the effect seen in this study between each combination of human answers and AI answers. Effects by Cohen’s w for the remaining combinations are too low to be confirmed. 1st diagnosis accuracy plotted over sample size in [Fig. 4] converges clearly towards a horizontal constant, equal to the overall value calculated in [Table 1]. After ca. 75 of the total 250 data points (10 raters × 25 items), no substantial volatility remains on the result curves. By convergence of the graph in [Fig. 4], a larger dataset would not be expected to alter the 1st diagnosis study accuracy.
#
Conclusions and future work
The aim of this study was to assess the ability of ChatGPT-4.0 to correctly classify multimodal MSK tumor cases. For this purpose, 25 multimodal cases were defined, each representing a typical MSK textbook diagnosis. For the evaluation of the performance level, the cases were diagnosed by AI and (n = 10) human raters. The accuracy of the LLM was generally lower than that of the human raters. The performance gap substantially shrunk when a 2nd diagnosis was included in the set of answers. A power analysis confirmed the observable difference between human answers and AI answers. The interrater agreement was slightly greater for the answers of the human raters than for the LLM answers. This fact can also be seen in the plotted CM of [Fig. 3], where the answers converge on the diagonal under 1st & 2nd diagnoses for both the human raters and AI. Given the speed and the constant availability of AI software throughout the day, it appears plausible that systems such as the one tested here will assist human physicians in clinical settings in the future.
Today, administration and documentation consume many doctor hours during hospital operations. However, increased productivity appears necessary to meet the future demands for clinical radiology services [1]. Provided that there is no loss in treatment quality, it can be considered in patients’ direct best interest to minimize the time spent by healthcare staff on administrative duties and documentation. LLMs in combination with automated pattern recognition promise to provide technology to achieve that aim and, if successful in clinical application, to fulfill the prophecies of the early AI thinkers from the 1950s [2]. One requirement for that scenario to materialize will be that already mentioned corporations, such as OpenAI, or Google, continue investing in this technology. Ethical aspects of the application of AI in future medicine, e.g., threats of algorithmic bias and human deskilling will have to be addressed [28] [29]. Finally as seen before [14], this technology's functionality and possible limitations will have to be tested.
#
Open access supplements
S 1: Image data with original source and true diagnosis (.pdf)
S 2: Zip folder, image files merged per case (.jpg)
S 3: Survey for human participants implemented in Google Forms (.pdf)
S 4: Study answers collected from human participants and ChatGPT (.xlsx)
S 5: Python code for data evaluation (.py)
The supplementary material is available under the DOI: 10.6084/m9.figshare.28560842.v1
#
#
Conflict of Interest
The authors declare that they have no conflict of interest.
Acknowledgement
JFS acknowledges support by Dioscuri program initiated by the Max Planck Society, jointly managed with the National Science Centre (Poland), and mutually funded by the Polish Ministry of Science and Higher Education and the German Federal Ministry of Education and Research. The authors wish to thank for all useful discussions leading to this manuscript.
-
References
- 1 Sutherland G, Russell N, Gibbard R. et al. The Value of Radiology, Part II – The Conference Board of Canada. Ottawa, CAN; 2019.
- 2 McCarthy J, Minsky ML, Rochester N. et al. A Proposal For The Dartmouth Summer Research Project On Artificial Intelligence [Internet]. 1955 [cited 2021 Oct 30]. p. 1–13. http://jmc.stanford.edu/articles/dartmouth/dartmouth.pdf
- 3 Kagiyama N, Shrestha S, Farjo PD. et al. Artificial Intelligence: Practical Primer for Clinical Research in Cardiovascular Disease. J Am Heart Assoc 2019; 8: 1-12
- 4 Peters AA, Wiescholek N, Müller M. et al. Impact of artificial intelligence assistance on pulmonary nodule detection and localization in chest CT: a comparative study among radiologists of varying experience levels. Sci Rep 2024; 14 (01) 22447
- 5 Peters AA, Munz J, Klaus JB. et al. Impact of Simulated Reduced-Dose Chest CT on Diagnosing Pulmonary T1 Tumors and Patient Management. Diagnostics 2024; 14 (15)
- 6 Borkowski K, Rossi C, Ciritsis A. et al. Fully automatic classification of breast MRI background parenchymal enhancement using a transfer learning approach. Medicine (Baltimore) 2020; 99 (29) e21243
- 7 Ramedani S, Ramedani M, Von Tengg-Kobligk H. et al. A Deep Learning-based Fully Automated Approach for Body Composition Analysis in 3D Whole Body Dixon MRI. In: 2023 IEEE 19th International Conference on Intelligent Computer Communication and Processing (ICCP). 2023: 287-292
- 8 Urban G, Porhemmat S, Stark M. et al. Classifying shoulder implants in X-ray images using deep learning. Comput Struct Biotechnol J 2020; 18: 967-972
- 9 Bosbach WA, Senge JF, Nemeth B. et al. Ability of ChatGPT to generate competent radiology reports for distal radius fracture by use of RSNA template items and integrated AO classifier. Curr Probl Diagn Radiol 2023; 53 (01) 102-110
- 10 Bosbach WA, Senge JF, Nemeth B. et al. Online supplement to manuscript: “Ability of ChatGPT to generate competent radiology reports for distal radius fracture by use of RSNA template items and integrated AO classifier.” Current problems in diagnostic radiology (2023). zenodo. 2023
- 11 Senge JF, Mc Murray MT, Haupt F. et al. ChatGPT may free time needed by the interventional radiologist for administration/documentation: A study on the RSNA PICC line reporting template. Swiss J Radiol Nucl Med 2024; 7 (02) 1-14
- 12 Senge JF, Mc Murray MT, Haupt F. et al. Online supplement to manuscript: “ChatGPT may free time needed by the interventional radiologist for administration/documentation: A study on the RSNA PICC line reporting template.” zenodo. 2023
- 13 Garni SN, Mertineit N, Nöldge G. et al. Regulatory Needs for Radiation Protection Devices based upon Artificial Intelligence – State task or leave unregulated?. Swiss J Radiol Nucl Med 2024; 5 (01) 5
- 14 Bosbach WA, Merdes KC, Jung B. et al. Deep learning reconstruction of accelerated MRI: False positive cartilage delamination inserted in MRI arthrography under traction. Top Magn Reson Imaging 2024; 33: e0313
- 15 Bosbach WA, Merdes KC, Jung B. et al. Open access supplement to the publication: Bosbach, W. A., et al. (2024). Deep learning reconstruction of accelerated MRI: False positive cartilage delamination inserted in MRI arthrography under traction. Topics in Magnetic Resonance Imaging. [accepted]. [Internet]. figshare. 2024
- 16 Glowacka D, Howes A, Jokinen JP. et al. RL4HCI: Reinforcement Learning for Humans, Computers, and Interaction. Ext Abstr 2021 CHI Conf Hum Factors Comput Syst 2021; 1-3
- 17 Schulman J, Wolski F, Dhariwal P. et al. Proximal Policy Optimization Algorithms. arXiv 2017; 1707
- 18 OpenAI LLC, editor. ChatGPT-4.0 [Internet]. 2024 [cited 2024 Aug 29]. Available from: chat.openai.com.
- 19 sklearn.metrics.precision_score [Internet]. scikit-learn 1.5.1 documentation. 2024 [cited 2024 Sep 1]. Available from:. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html
- 20 Gwet K, Fergadis A. irrCAC – Chance-corrected Agreement Coefficients [Internet]. 2023 [cited 2025 Sep 3]. Available from:. irrcac.readthedocs.io/en/latest/usage/usage_raw_data.html
- 21 Wongpakaran N, Wongpakaran T, Wedding D. et al. A comparison of Cohen’s Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: A study conducted with personality disorder samples. BMC Med Res Methodol 2013; 13 (01) 1-7
- 22 Landis JR, Koch GG. The Measurement of Observer Agreement for Categorical Data. Biometrics 1977; 33 (01) 159-174
- 23 sklearn.metrics.confusion_matrix [Internet]. scikit-learn 1.5.1 documentation. 2024 [cited 2024 Sep 1]. Available from:. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
- 24 scipy.stats.chisquare [Internet]. SciPy. [cited 2025 Mar 10]. Available from:. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html
- 25 statsmodels.stats.gof.chisquare_effectsize [Internet]. statsmodels 0.15.0 (+617). 2025 [cited 2025 Mar 9]. Available from. https://www.statsmodels.org/dev/generated/statsmodels.stats.gof.chisquare_effectsize.html#statsmodels.stats.gof.chisquare_effectsize
- 26 Cohen J. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. New York, NY, USA; 1988.
- 27 statsmodels.stats.power.GofChisquarePower [Internet]. statsmodels 0.15.0 (+581). 2025 [cited 2025 Jan 2]. Available from: . https://www.statsmodels.org/dev/generated/statsmodels.stats.power.GofChisquarePower.html
- 28 Goisauf M, Cano Abadía M. Ethics of AI in Radiology: A Review of Ethical and Societal Implications. Front Big Data 2022; 5: 1-13
- 29 Sparrow R, Hatherley J. The Promise and Perils of AI in Medicine. Int J Chinese Comp Philos Med 2019; 17 (02) 79-109
Correspondence
Publication History
Received: 09 January 2025
Accepted after revision: 18 April 2025
Article published online:
03 June 2025
© 2025. Thieme. All rights reserved.
Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany
-
References
- 1 Sutherland G, Russell N, Gibbard R. et al. The Value of Radiology, Part II – The Conference Board of Canada. Ottawa, CAN; 2019.
- 2 McCarthy J, Minsky ML, Rochester N. et al. A Proposal For The Dartmouth Summer Research Project On Artificial Intelligence [Internet]. 1955 [cited 2021 Oct 30]. p. 1–13. http://jmc.stanford.edu/articles/dartmouth/dartmouth.pdf
- 3 Kagiyama N, Shrestha S, Farjo PD. et al. Artificial Intelligence: Practical Primer for Clinical Research in Cardiovascular Disease. J Am Heart Assoc 2019; 8: 1-12
- 4 Peters AA, Wiescholek N, Müller M. et al. Impact of artificial intelligence assistance on pulmonary nodule detection and localization in chest CT: a comparative study among radiologists of varying experience levels. Sci Rep 2024; 14 (01) 22447
- 5 Peters AA, Munz J, Klaus JB. et al. Impact of Simulated Reduced-Dose Chest CT on Diagnosing Pulmonary T1 Tumors and Patient Management. Diagnostics 2024; 14 (15)
- 6 Borkowski K, Rossi C, Ciritsis A. et al. Fully automatic classification of breast MRI background parenchymal enhancement using a transfer learning approach. Medicine (Baltimore) 2020; 99 (29) e21243
- 7 Ramedani S, Ramedani M, Von Tengg-Kobligk H. et al. A Deep Learning-based Fully Automated Approach for Body Composition Analysis in 3D Whole Body Dixon MRI. In: 2023 IEEE 19th International Conference on Intelligent Computer Communication and Processing (ICCP). 2023: 287-292
- 8 Urban G, Porhemmat S, Stark M. et al. Classifying shoulder implants in X-ray images using deep learning. Comput Struct Biotechnol J 2020; 18: 967-972
- 9 Bosbach WA, Senge JF, Nemeth B. et al. Ability of ChatGPT to generate competent radiology reports for distal radius fracture by use of RSNA template items and integrated AO classifier. Curr Probl Diagn Radiol 2023; 53 (01) 102-110
- 10 Bosbach WA, Senge JF, Nemeth B. et al. Online supplement to manuscript: “Ability of ChatGPT to generate competent radiology reports for distal radius fracture by use of RSNA template items and integrated AO classifier.” Current problems in diagnostic radiology (2023). zenodo. 2023
- 11 Senge JF, Mc Murray MT, Haupt F. et al. ChatGPT may free time needed by the interventional radiologist for administration/documentation: A study on the RSNA PICC line reporting template. Swiss J Radiol Nucl Med 2024; 7 (02) 1-14
- 12 Senge JF, Mc Murray MT, Haupt F. et al. Online supplement to manuscript: “ChatGPT may free time needed by the interventional radiologist for administration/documentation: A study on the RSNA PICC line reporting template.” zenodo. 2023
- 13 Garni SN, Mertineit N, Nöldge G. et al. Regulatory Needs for Radiation Protection Devices based upon Artificial Intelligence – State task or leave unregulated?. Swiss J Radiol Nucl Med 2024; 5 (01) 5
- 14 Bosbach WA, Merdes KC, Jung B. et al. Deep learning reconstruction of accelerated MRI: False positive cartilage delamination inserted in MRI arthrography under traction. Top Magn Reson Imaging 2024; 33: e0313
- 15 Bosbach WA, Merdes KC, Jung B. et al. Open access supplement to the publication: Bosbach, W. A., et al. (2024). Deep learning reconstruction of accelerated MRI: False positive cartilage delamination inserted in MRI arthrography under traction. Topics in Magnetic Resonance Imaging. [accepted]. [Internet]. figshare. 2024
- 16 Glowacka D, Howes A, Jokinen JP. et al. RL4HCI: Reinforcement Learning for Humans, Computers, and Interaction. Ext Abstr 2021 CHI Conf Hum Factors Comput Syst 2021; 1-3
- 17 Schulman J, Wolski F, Dhariwal P. et al. Proximal Policy Optimization Algorithms. arXiv 2017; 1707
- 18 OpenAI LLC, editor. ChatGPT-4.0 [Internet]. 2024 [cited 2024 Aug 29]. Available from: chat.openai.com.
- 19 sklearn.metrics.precision_score [Internet]. scikit-learn 1.5.1 documentation. 2024 [cited 2024 Sep 1]. Available from:. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html
- 20 Gwet K, Fergadis A. irrCAC – Chance-corrected Agreement Coefficients [Internet]. 2023 [cited 2025 Sep 3]. Available from:. irrcac.readthedocs.io/en/latest/usage/usage_raw_data.html
- 21 Wongpakaran N, Wongpakaran T, Wedding D. et al. A comparison of Cohen’s Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: A study conducted with personality disorder samples. BMC Med Res Methodol 2013; 13 (01) 1-7
- 22 Landis JR, Koch GG. The Measurement of Observer Agreement for Categorical Data. Biometrics 1977; 33 (01) 159-174
- 23 sklearn.metrics.confusion_matrix [Internet]. scikit-learn 1.5.1 documentation. 2024 [cited 2024 Sep 1]. Available from:. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
- 24 scipy.stats.chisquare [Internet]. SciPy. [cited 2025 Mar 10]. Available from:. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html
- 25 statsmodels.stats.gof.chisquare_effectsize [Internet]. statsmodels 0.15.0 (+617). 2025 [cited 2025 Mar 9]. Available from. https://www.statsmodels.org/dev/generated/statsmodels.stats.gof.chisquare_effectsize.html#statsmodels.stats.gof.chisquare_effectsize
- 26 Cohen J. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. New York, NY, USA; 1988.
- 27 statsmodels.stats.power.GofChisquarePower [Internet]. statsmodels 0.15.0 (+581). 2025 [cited 2025 Jan 2]. Available from: . https://www.statsmodels.org/dev/generated/statsmodels.stats.power.GofChisquarePower.html
- 28 Goisauf M, Cano Abadía M. Ethics of AI in Radiology: A Review of Ethical and Societal Implications. Front Big Data 2022; 5: 1-13
- 29 Sparrow R, Hatherley J. The Promise and Perils of AI in Medicine. Int J Chinese Comp Philos Med 2019; 17 (02) 79-109







