RSS-Feed abonnieren
DOI: 10.1055/s-0044-1780077
Comparing ChatGPT-4 to BARD for Accuracy and Completeness of Responses to Questions Derived from the International Consensus Statement on Endoscopic Skull Base Surgery
Introduction: Artificial intelligence (AI) language models, such as Chat Generative Pre-Trained Transformer 4 (GPT-4) by OpenAI and Bard by Google, have emerged in 2022 as tools for answering questions, providing information, and offering suggestions to the layperson. These programs are large language models trained on available data to synthesize responses. Chat GPT-4 and Bard have the potential to impact how information is disseminated to patients greatly; however, it is essential to understand how these answers compare to experts in the corresponding field. The International Consensus Statement on Endoscopic Skull Base Surgery 2019 (ICAR:SB) is an international multidisciplinary collaboration to critically evaluate and grade the current literature. The goal of this study is to assess the accuracy and completeness of GPT-4 and Bard generated responses to questions based on ICAR:SB guidelines.
Methods: Endoscopic skull-base surgery policies and grade of evidence were extracted from the ICAR:SB ([Table 1]). Questions were synthesized for each policy statement and input into GPT 4 and Bard. The GPT-4 and Bard answers were graded by a fellowship-trained rhinologist and skull-base neurosurgeon using a 5-point Likert scale for accuracy and completeness. Statistical analysis included descriptive statistics and chi-square testing comparing graded answers of GPT-4 to Bard.
Results: The mean accuracy and completeness of GPT-4 were 4.76 and 4.51, respectively. The mean accuracy and completeness of Bard were 4.25 and 3.84, respectively. The distribution of the scores can be seen on Fig. 1. Chi-square testing comparing Chat GPT-4 to Bard demonstrated statistically significant differences in accuracy (p = 0.005) and completeness (p = 0.004).
Discussion: Overall accuracy and completeness were high for generated responses by GPT-4 and Bard, however, GPT-4 was significantly better in both domains. The capabilities of language models will continue to evolve as more up-to-date information is integrated into future iterations. As the popularity of AI programs continues to expand, patients may search for answers to their healthcare questions on these platforms and it is critical for physicians to monitor the responses from these programs.
Conclusion: This study demonstrates that GPT-4 and Bard generated accurate and complete responses when graded by fellowship-trained rhinologists and skull-base neurosurgeons. AI language models have the potential to be robust tools for disseminating information in the future.


Publikationsverlauf
Artikel online veröffentlicht:
05. Februar 2024
© 2024. Thieme. All rights reserved.
Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany