Rofo
DOI: 10.1055/a-2594-7085
Quality/Quality Assurance

Evaluating the Diagnostic Accuracy of ChatGPT-4.0 for Classifying Multimodal Musculoskeletal Masses: A Comparative Study with Human Raters

Bewertung der diagnostischen Genauigkeit von ChatGPT-4.0 bei der Klassifikation multimodaler muskuloskelettaler Läsionen: eine vergleichende Studie mit menschlichen Auswertern
1   Department of Nuclear Medicine, Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland
2   Department of Diagnostic, Interventional and Paediatric Radiology (DIPR), Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland (Ringgold ID: RIN27210)
,
Luca Schoeni
1   Department of Nuclear Medicine, Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland
2   Department of Diagnostic, Interventional and Paediatric Radiology (DIPR), Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland (Ringgold ID: RIN27210)
,
Claus Beisbart
3   Institute of Philosophy, University of Bern, Bern, Switzerland
4   Center for Artificial Intelligence in Medicine, University of Bern, Bern, Switzerland
,
Jan F. Senge
5   Department of Mathematics and Computer Science, University of Bremen, Bremen, Germany (Ringgold ID: RIN9168)
6   Dioscuri Centre in Topological Data Analysis, Mathematical Institute PAN, Warsaw, Poland
,
Milena Mitrakovic
2   Department of Diagnostic, Interventional and Paediatric Radiology (DIPR), Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland (Ringgold ID: RIN27210)
,
Suzanne E. Anderson
2   Department of Diagnostic, Interventional and Paediatric Radiology (DIPR), Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland (Ringgold ID: RIN27210)
7   Sydney School of Medicine, University of Notre Dame Australia, Darlinghurst Sydney, Australia (Ringgold ID: RIN523002)
,
Ngwe R. Achangwa
1   Department of Nuclear Medicine, Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland
,
8   University of Zagreb School of Medicine, Department of Diagnostic and Interventional Radiology, University Hospital “Dubrava”, Zagreb, Croatia
,
8   University of Zagreb School of Medicine, Department of Diagnostic and Interventional Radiology, University Hospital “Dubrava”, Zagreb, Croatia
,
9   Department of Diagnostic and Interventional Radiology, University Hospital Augsburg, Augsburg, Germany (Ringgold ID: RIN39694)
,
10   Institute of Diagnostic and Interventional Radiology, Pediatric Radiology and Neuroradiology, University Medical Center Rostock, Rostock, Germany
,
Martin H. Maurer
11   Department of Diagnostic and Interventional Radiology, Carl von Ossietzky Universität Oldenburg, Oldenburg, Germany
,
Hatice Tuba Sanal
12   Radiology Department, University of Health Sciences, Gülhane Training and Research Hospital, Ankara, Turkey
13   Department of Anatomy, Ankara University Institute of Health Sciences, Ankara, Türkiye
,
2   Department of Diagnostic, Interventional and Paediatric Radiology (DIPR), Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland (Ringgold ID: RIN27210)
› Author Affiliations
 

Abstract

Purpose

Novel artificial intelligence tools have the potential to significantly enhance productivity in medicine, while also maintaining or even improving treatment quality. In this study, we aimed to evaluate the current capability of ChatGPT-4.0 to accurately interpret multimodal musculoskeletal tumor cases.

Materials and Methods

We created 25 cases, each containing images from X-ray, computed tomography, magnetic resonance imaging, or scintigraphy. ChatGPT-4.0 was tasked with classifying each case using a six-option, two-choice question, where both a primary and a secondary diagnosis were allowed. For performance evaluation, human raters also assessed the same cases.

Results

When only the primary diagnosis was taken into account, the accuracy of human raters was greater than that of ChatGPT-4.0 by a factor of nearly 2 (87% vs. 44%). However, in a setting that also considered secondary diagnoses, the performance gap shrank substantially (accuracy: 94% vs. 71%). Power analysis relying on Cohen’s w confirmed the adequacy of the sample set size (n: 25).

Conclusion and Key Points

The tested artificial intelligence tool demonstrated lower performance than human raters. Considering factors such as speed, constant availability, and potential future improvements, it appears plausible that artificial intelligence tools could serve as valuable assistance systems for doctors in future clinical settings.

Key Points

  • ChatGPT-4.0 classifies musculoskeletal cases using multimodal imaging inputs.

  • Human raters outperform AI in primary diagnosis accuracy by a factor of nearly two.

  • Including secondary diagnoses improves AI performance and narrows the gap.

  • AI demonstrates potential as an assistive tool in future radiological workflows.

  • Power analysis confirms robustness of study findings with the current sample size.

Citation Format

  • Bosbach WA, Schoeni L, Beisbart C et al. Evaluating the Diagnostic Accuracy of ChatGPT-4.0 for Classifying Multimodal Musculoskeletal Masses: A Comparative Study with Human Raters. Rofo 2025; DOI 10.1055/a-2594-7085


#

Zusammenfassung

Ziel

Neue künstliche Intelligenz (KI)-Werkzeuge haben das Potenzial, die Produktivität in der Medizin erheblich zu steigern und gleichzeitig die Behandlungsqualität aufrechtzuerhalten oder sogar zu verbessern. In dieser Studie wollten wir die aktuelle Fähigkeit von ChatGPT-4.0 zur präzisen Interpretation multimodaler muskuloskelettaler Tumorfälle evaluieren.

Materialien und Methoden

Wir erstellten 25 Fälle, die jeweils Bilder aus Röntgenaufnahmen, Computertomografie, Magnetresonanztomografie oder Szintigrafie enthielten. ChatGPT-4.0 wurde mit der Klassifikation jedes Falls anhand einer sechsoptionalen, zweiauswahlbasierten Frage beauftragt, wobei sowohl eine primäre als auch eine sekundäre Diagnose erlaubt waren. Zur Leistungsbewertung analysierten menschliche Beurteiler dieselben Fälle.

Ergebnisse

Wurde nur die primäre Diagnose berücksichtigt, war die Genauigkeit der menschlichen Beurteiler fast doppelt so hoch wie die von ChatGPT-4.0 (87% vs. 44%). In einem Szenario, das auch sekundäre Diagnosen berücksichtigte, verringerte sich die Leistungslücke jedoch deutlich (Genauigkeit: 94% vs. 71%). Eine Power-Analyse basierend auf Cohens w bestätigte die Angemessenheit der Stichprobengröße (n = 25).

Schlussfolgerung und Kernaussagen

Das getestete KI-Werkzeug zeigte eine geringere Leistung als menschliche Beurteiler. Angesichts von Faktoren wie Geschwindigkeit, ständiger Verfügbarkeit und potenziellen zukünftigen Verbesserungen erscheint es jedoch plausibel, dass KI-Werkzeuge in zukünftigen klinischen Umgebungen als wertvolle Assistenzsysteme für Ärzte dienen könnten.

Kernaussagen

  • ChatGPT-4.0 klassifiziert muskuloskelettale Fälle anhand multimodaler Bildgebungsdaten.

  • Menschliche Beurteiler übertreffen die KI bei der primären Diagnosestellung mit nahezu doppelter Genauigkeit.

  • Die Berücksichtigung sekundärer Diagnosen verbessert die KI-Leistung und verringert die Leistungsdifferenz.

  • KI zeigt Potenzial als unterstützendes Werkzeug in zukünftigen radiologischen Arbeitsabläufen.

  • Eine Power-Analyse bestätigt die Aussagekraft der Studienergebnisse bei gegebener Stichprobengröße.


#

Introduction

The demand for clinical radiological imaging services will likely exceed capacity in the future resulting in a negative impact on the healthcare sector. Increasing and longer-than-recommended wait times as well as a negative impact on patient outcomes will be the result unless capacity is increased substantially [1]. The application of artificial intelligence (AI) [2] might offer one way to enhance clinical diagnosing capacities while at the same time maintaining quality or even improving patient outcomes. Because of this potential future contribution, AI has lately received great attention from industrial stakeholders and research groups. Pattern recognition in imaging data is typically the main focus in this field. The commercial software package Aidoc (Aidoc Medical Ltd, Tel-Aviv, Israel) is designed to assist radiologists in acute care medicine. There is similar research in more or less every radiological subspecialty. Cardiovascular [3], pulmonary [4] [5], gynecology [6], musculoskeletal (MSK) [7] [8], etc. all see the possibility of AI contributing to the medicine of the future. In addition to pattern recognition, there are numerous other clinical applications where AI could make a contribution. Large language models (LLM) may automate administration and documentation tasks [9] [10] [11] [12]. However, there are also limitations to the achievable effects of AI. These limitations become clear, for example, when AI is used for the assessment of radiation exposure [13] or for the acceleration of undersampled magnetic resonance imaging (MRI) [14] [15]. There is serious discussion about whether the current attention is overhyping reality.

The current study focuses on how LLMs can contribute to pattern recognition in images and in the identification of correct diagnoses. In earlier studies, we tested the ability of the LLM ChatGPT (OpenAI LLC, San Francisco, CA, USA) to draft radiology reports in MSK imaging and interventional radiology [9] [10]. This particular LLM was originally trained by means of reinforcement learning from human feedback, involving a reward model and proximal policy optimization algorithm [16] [17]. The now available latest version, ChatGPT-4.0, possesses a new feature allowing processing of not only text but also image files [18]. The present study evaluates ChatGPT-4.0’s ability to correctly classify multimodal MSK tumor imaging cases. It includes 25 defined MSK cases with imaging material obtained through X-ray, computed tomography (CT), MRI, or scintigraphy (scint, please see supplement S 1). ChatGPT-4.0 was shown these image sets case per case, with a prompt that asked in a closed question with 6 possible answers for a primary (1st) and for a secondary (2nd) diagnosis. To assess the AI performance level, human raters were presented with the same 25 cases with identical possible answers. A Python code-based evaluation was run to determine accuracy, interrater agreement, and significance/power testing.


#

Method and Materials

The present study defined 25 multimodal imaging cases of typical MSK pathologies. Human raters and ChatGPT-4.0 were asked for 1st and 2nd diagnoses. The LLM ChatGPT-4.0 was used for writing up this manuscript, and Python code was used for debugging [18].

Image data

This study included a total of 25 cases including osteosarcoma (n = 5), abscess/osteomyelitis (n = 5), heterotopic ossification (n = 5), myxoid liposarcoma (n = 3), lipoma (n = 2), and hemangioma (n = 5). Authors from multiple health centers originally submitted these cases, e.g., to the National Institutes of Health (NIH) server: https://medpix.nlm.nih.gov/. Data are stored there for open access. The original imaging data (jpg) with the data source, true diagnosis, and (where available) also the source of the true diagnosis are provided in S 1. The image files merged into a single stack per case are provided in S 2 for easier reproducibility of this study. Each case contains between 2 and 4 images. All cases contain images from at least 2 imaging modalities (X-ray, CT, MRI, scint); exceptions being case 14 with 2 X-ray images, and some cases, e.g., 17 and 18, with MRI images only. All cases represent textbook examples with a definitive diagnosis, reflecting standard diagnostic scenarios commonly encountered in MSK radiology.


#

Surveying of ChatGPT-4.0 and human raters

ChatGPT-4.0 [18] was shown the image data case by case together with the following prompt (please see also [Fig. 1]) asking for the 1st and 2nd diagnosis:

For the shown image, please give your most likely primary diagnosis and alternative secondary diagnosis from the 6 options below. Please only give the respective diagnosis numbers:

  • [1] Osteosarcoma

  • [2] Abscess/osteomyelitis

  • [3] Heterotopic ossification

  • [4] Myxoid liposarcomas

  • [5] Lipoma

  • [6] Hemangioma

Zoom Image
Fig. 1 Prompt entered with image stack of case 17 in ChatGPT-4.0, accessed August 27, 2024.

In this way, the study implemented a six-option, two-answer question. ChatGPT-4.0 was tested across n = 10 iterations to minimize the impact of statistical variability in the LLMʼs responses.

Human raters were presented with the same imaging cases with the same six-option, two-answer questions. Answers from human raters were collected through Google Forms (S 3, Google LLC, Mountain View, CA, USA) with randomization of the order of questions. In total, n = 10 human raters participated. All human raters had previous work experience in radiology, 8 being MSK specialists and 9 having completed their residency training. Work experience averaged 21.0 years since finishing medical school and 16.6 years since completing board exams. Raw data collected from the LLM and from the human raters are available in S 4.


#

Python code evaluations

The collected data (S 4) were analyzed using a Python code (S 5). For the study’s assessments, two sets of answers were considered. The first set of answers was the 1st diagnosis. The second set of answers (called below: 1st & 2nd) also included the 2nd diagnosis; the 2nd diagnosis replaced the 1st diagnosis value whenever the 1st values were incorrect ([Fig. 2]). The Python code calculated diagnostic performance variables, interrater agreement, and significance/test power analysis ([Table 1]).

Zoom Image
Fig. 2 1st primary/2nd secondary diagnosis, obtained for each case from both AI and human raters.

Table 1 Diagnostic performance, interrater agreement, and test power analysis.

Human raters’ 1st diagnosis

Human raters’ 1st & 2nd diagnoses

AI’s 1st diagnosis

AI’s 1st & 2nd diagnoses

Diagnostic performance

Samples

25

25

25

25

Raters/LLM revisions

10

10

10

10

Accuracy

0.868

0.936

0.444

0.712

Weighted precision

0.871

0.939

0.476

0.737

Interrater agreement (p-value < alpha = 0.05 highlighted)

Fleiss’ kappa

0.714

0.852

0.505

0.598

Fleiss’ kappa p-value

2.64e–14

0.00e+00

5.62e–08

1.25e–09

Gwet’s AC1

0.718

0.854

0.553

0.607

Gwet’s AC1 p-value

2.09e–14

0.00e+00

6.63e–10

1.18e–09

Chi-square p-value (p-value < alpha = 0.05 highlighted)

H0: Human raters’ 1st diagnosis

1

0.91

0

0.001

H0: Human raters’ 1st & 2nd diagnoses

0.923

1

0

0.007

Cohen’s w

H0: Human raters’ 1st diagnosis

0

0.078

0.659

0.294

H0: Human raters’ 1st & 2nd diagnoses

0.075

0

0.623

0.253

Chi-square power (power > 1 – beta = 0.8 highlighted)

H0: Human raters’ 1st diagnosis

0.05

0.128

1

0.966

H0: Human raters’ 1st & 2nd diagnoses

0.121

0.05

1

0.89

For the analysis of diagnostic performance, accuracy and weighted precision were calculated. Weighted precision and accuracy take values in the interval [0.1]. Unlike accuracy, however, weighted precision factors in data imbalance, as found in the dataset of the present study [19].

Interrater agreement allows assessment of the coherence between the answers from different raters or from different LLM iterations. This study uses Fleiss’ kappa, and Gwet’s AC1 by the python package for interrater reliability Chance-corrected Agreement Coefficients (irrCAC) [20]. Gwet’s AC1 is designed for imbalanced datasets [21]. The irrCAC package includes a routine for simultaneously extracting the respective p-value. In this way, the obtained interrater agreement coefficients can be tested for their null hypothesis (H0) that there is no agreement beyond what would be expected purely by chance. Fleiss’ kappa and Gwet’s AC1 both take values in the interval [–1.1]. For the interpretation of numerical agreement, Landis & Koch provided an interpretation table, ranging from poor agreement (< 0.00) to almost perfect agreement (0.81–1.00), [Table 2] [22].

Table 2 Interpretation of strength of agreement for kappa statistics used in the present study (22).

Kappa statistic

< 0.00

0.00 – 0.20

0.21 – 0.40

0.41 – 0.60

0.61 – 0.80

0.81 – 1.00

Strength of agreement

Poor

Slight

Fair

Moderate

Substantial

Almost perfect

For a performance comparison between the LLM and human raters, confusion matrices (CMs) are shown ([Fig. 3]). The CMs plot the true vs. predicted diagnosis. Ideally, the data concentrate on the main diagonal. Results were normalized per true diagnosis, i.e., matrix rows [23]. This visualization allows for a detailed assessment of misclassification patterns and potential biases in predictions. By comparing CM structures, differences in diagnostic tendencies between the LLM and human raters can be identified. A Chi-square goodness-of-fit test [24] was used to test whether (H0) the observed frequencies were generated by randomly sampling from a categorical distribution with probabilities proportional to the expected frequencies.

Zoom Image
Fig. 3 Confusion matrices (true diagnoses with predicted diagnoses) for AI and human raters.

In the final section of the present study, model stability with regard to sample size and test power was analyzed. For that purpose, the results spread for 1st accuracy is plotted over sample size ([Fig. 4]). Effect size was determined by Cohen’s w for Chi-square testing (Python implementation [25], Ch 7 in Cohen’s original work [26]). The interpretation of Cohen’s w effect follows the original frame of reference: small 0.10, medium 0.30, large 0.50 (p 277 in [26]). Power analysis is performed accordingly by Chi-square goodness-of-fit test [27]. Type I error (rejecting a true H0) testing in this study was performed with regard to alpha = 0.05. Type II error (failing to reject a false H0) testing was performed with beta = 0.2, requiring power = 1 – beta = 0.8.

Zoom Image
Fig. 4 Convergence of accuracy of 1st diagnosis plotted over sample set (raters × items), sample set shuffled 5 times, 95% confidence interval shaded.

#
#

Results and discussion

The first result of the study is that the current version ChatGPT-4.0 is able to load and process typical MSK imaging data. [Fig. 1] shows the LLM working on case 17. All iterations run for the 25 cases concluding with a clear output by the LLM with the requested 1st and 2nd diagnoses.

When limited to the 1st diagnosis, the accuracy of the human raters was substantially higher than that of the LLM (87% vs. 44%, differing by a factor of almost 2, [Table 1]). The inclusion of the 2nd diagnosis increases the accuracy of human raters and of AI by definition. The accuracy of AI’s 1st & 2nd diagnoses increased to 71%, still below the accuracy of human raters for 1st & 2nd diagnoses (94%), but a substantial improvement in the performance gap. Weighted precision, factoring in the data imbalance, exhibited the same pattern for 1st diagnosis vs. 1st & 2nd diagnoses when calculated for humans and AI, with the values being marginally greater than the accuracy values.

Fleiss’ kappa indicated substantial human interrater agreement for the 1st diagnosis and almost perfect human interrater agreement for the 1st & 2nd diagnoses. AI interrater agreement was moderate, measured by Fleiss’ kappa. Gwet’s AC1 consistently led to marginally greater values due to the data imbalance, with AI’s 1st & 2nd diagnoses improving to substantial agreement. The p-value of all calculated interrater agreement turned out to be substantially below << alpha, with the human raters’ 1st & 2nd diagnoses even returning no remaining measurable p-value. Due to this, H0 can be rejected; none of the agreement levels was obtained due to chance.

These patterns are also observable visually and qualitatively in the CM of [Fig. 3]. Results from the human raters converge better towards the main diagonal. AI results substantially improve when 2nd diagnoses are included. Visually, the performance difference between human raters and AI nearly disappears for 1st & 2nd diagnoses. Distinguishing myxoid liposarcoma from lipoma and hemangioma from abscess/osteomyelitis proved to be a particular challenge for AI when limited to the 1st diagnosis. The CM clearly highlighted these diagnostic challenges, with misclassifications concentrated in specific categories where the AI struggled most.

The Chi-square p-value in [Table 1] was calculated once for the human raters’ 1st diagnosis as H0, and once for the human raters’ 1st & 2nd diagnosis as H0. The human raters’ 1st diagnosis and the human raters’ 1st & 2nd diagnoses both resulted in perfect agreement (i.e. = 1), when projected on themselves. AI’s 1st diagnosis and AI’s 1st & 2nd diagnoses are both significantly different from the human raters’ 1st diagnosis and from the human raters’ 1st & 2nd diagnoses.

Cohen’s w for effect quantification was calculated once for the human raters’ 1st diagnosis as H0 and once for the human raters’ 1st & 2nd diagnoses as H0. In comparisons between the answers from the human raters, Cohen’s w was calculated to be < 0.10, i.e., below the threshold for small effects. This holds true regardless of the version of H0. A large effect was obtained for AI’s 1st diagnosis, regardless of H0. A small effect, close to medium, was obtained for AI’s 1st & 2nd diagnoses, regardless of the version of H0.

The Chi-square power analysis showed that the power values exceeded 0.8 for each pair formed between the two versions of H0 and either AI’s 1st diagnosis or AI’s 1st & 2nd diagnoses. The human raters’ 1st diagnosis and the human raters’ 1st & 2nd diagnoses when projected on itself resulted correctly in power = 0.05 = alpha. The power between the human raters’ 1st diagnosis the and human raters’ 1st & 2nd diagnoses was < 0.8, regardless of the version of H0. With regard to the size of the sample set, a power analysis using Chi-square confirms the effect seen in this study between each combination of human answers and AI answers. Effects by Cohen’s w for the remaining combinations are too low to be confirmed. 1st diagnosis accuracy plotted over sample size in [Fig. 4] converges clearly towards a horizontal constant, equal to the overall value calculated in [Table 1]. After ca. 75 of the total 250 data points (10 raters × 25 items), no substantial volatility remains on the result curves. By convergence of the graph in [Fig. 4], a larger dataset would not be expected to alter the 1st diagnosis study accuracy.


#

Conclusions and future work

The aim of this study was to assess the ability of ChatGPT-4.0 to correctly classify multimodal MSK tumor cases. For this purpose, 25 multimodal cases were defined, each representing a typical MSK textbook diagnosis. For the evaluation of the performance level, the cases were diagnosed by AI and (n = 10) human raters. The accuracy of the LLM was generally lower than that of the human raters. The performance gap substantially shrunk when a 2nd diagnosis was included in the set of answers. A power analysis confirmed the observable difference between human answers and AI answers. The interrater agreement was slightly greater for the answers of the human raters than for the LLM answers. This fact can also be seen in the plotted CM of [Fig. 3], where the answers converge on the diagonal under 1st & 2nd diagnoses for both the human raters and AI. Given the speed and the constant availability of AI software throughout the day, it appears plausible that systems such as the one tested here will assist human physicians in clinical settings in the future.

Today, administration and documentation consume many doctor hours during hospital operations. However, increased productivity appears necessary to meet the future demands for clinical radiology services [1]. Provided that there is no loss in treatment quality, it can be considered in patients’ direct best interest to minimize the time spent by healthcare staff on administrative duties and documentation. LLMs in combination with automated pattern recognition promise to provide technology to achieve that aim and, if successful in clinical application, to fulfill the prophecies of the early AI thinkers from the 1950s [2]. One requirement for that scenario to materialize will be that already mentioned corporations, such as OpenAI, or Google, continue investing in this technology. Ethical aspects of the application of AI in future medicine, e.g., threats of algorithmic bias and human deskilling will have to be addressed [28] [29]. Finally as seen before [14], this technology's functionality and possible limitations will have to be tested.


#

Open access supplements

S 1: Image data with original source and true diagnosis (.pdf)

S 2: Zip folder, image files merged per case (.jpg)

S 3: Survey for human participants implemented in Google Forms (.pdf)

S 4: Study answers collected from human participants and ChatGPT (.xlsx)

S 5: Python code for data evaluation (.py)

The supplementary material is available under the DOI: 10.6084/m9.figshare.28560842.v1


#
#

Conflict of Interest

The authors declare that they have no conflict of interest.

Acknowledgement

JFS acknowledges support by Dioscuri program initiated by the Max Planck Society, jointly managed with the National Science Centre (Poland), and mutually funded by the Polish Ministry of Science and Higher Education and the German Federal Ministry of Education and Research. The authors wish to thank for all useful discussions leading to this manuscript.


Correspondence

PD Dr. Dr. med. Wolfram A. Bosbach
Department of Nuclear Medicine, Inselspital, Bern University Hospital, University of Bern
Bern
Switzerland   

Publication History

Received: 09 January 2025

Accepted after revision: 18 April 2025

Article published online:
03 June 2025

© 2025. Thieme. All rights reserved.

Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany


Zoom Image
Fig. 1 Prompt entered with image stack of case 17 in ChatGPT-4.0, accessed August 27, 2024.
Zoom Image
Fig. 2 1st primary/2nd secondary diagnosis, obtained for each case from both AI and human raters.
Zoom Image
Fig. 3 Confusion matrices (true diagnoses with predicted diagnoses) for AI and human raters.
Zoom Image
Fig. 4 Convergence of accuracy of 1st diagnosis plotted over sample set (raters × items), sample set shuffled 5 times, 95% confidence interval shaded.