Comparing ChatGPT-4 to BARD for Accuracy and Completeness of Responses to Questions Derived from the International Consensus Statement on Endoscopic Skull Base Surgery

Yavar Abgin; Kayla Umemoto; Sean Polster; Arthur W. Wu; Andrew Goulian; Christopher R. Roxbury; Omar G. Ahmed; Pranay Soni; Dennis M. Tang

doi:10.1055/s-0044-1780077

RSS-Feed abonnieren

Bitte kopieren Sie die angezeigte URL und fügen sie dann in Ihren RSS-Reader ein.

https://www.thieme-connect.de/rss/thieme/de/10.1055-s-00000181.xml

Teilen / Bookmarken

Facebook Linkedin Weibo

J Neurol Surg B Skull Base 2024; 85(S 01): S1-S398
DOI: 10.1055/s-0044-1780077

Presentation Abstracts

Oral Abstracts

Comparing ChatGPT-4 to BARD for Accuracy and Completeness of Responses to Questions Derived from the International Consensus Statement on Endoscopic Skull Base Surgery

Yavar Abgin

¹California Northstate University, Elk Grove, California United States

,

Kayla Umemoto

¹California Northstate University, Elk Grove, California United States

,

Sean Polster

²University of Chicago, Chicago, Illinois, United States

,

Arthur W. Wu

³Cedars-Sinai Medical Center, Los Angeles, California, United States

,

Andrew Goulian

¹California Northstate University, Elk Grove, California United States

,

Christopher R. Roxbury

²University of Chicago, Chicago, Illinois, United States

,

Omar G. Ahmed

⁴Houston Methodist, Houston, Texas, United States

,

Pranay Soni

⁵Cleveland Clinic, Cleveland, Ohio, United States

,

Dennis M. Tang

³Cedars-Sinai Medical Center, Los Angeles, California, United States

› Institutsangaben

› Weitere Informationen

Auch verfügbar auf

Kongressbeitrag
Volltext

Introduction: Artificial intelligence (AI) language models, such as Chat Generative Pre-Trained Transformer 4 (GPT-4) by OpenAI and Bard by Google, have emerged in 2022 as tools for answering questions, providing information, and offering suggestions to the layperson. These programs are large language models trained on available data to synthesize responses. Chat GPT-4 and Bard have the potential to impact how information is disseminated to patients greatly; however, it is essential to understand how these answers compare to experts in the corresponding field. The International Consensus Statement on Endoscopic Skull Base Surgery 2019 (ICAR:SB) is an international multidisciplinary collaboration to critically evaluate and grade the current literature. The goal of this study is to assess the accuracy and completeness of GPT-4 and Bard generated responses to questions based on ICAR:SB guidelines.

Methods: Endoscopic skull-base surgery policies and grade of evidence were extracted from the ICAR:SB ([Table 1]). Questions were synthesized for each policy statement and input into GPT 4 and Bard. The GPT-4 and Bard answers were graded by a fellowship-trained rhinologist and skull-base neurosurgeon using a 5-point Likert scale for accuracy and completeness. Statistical analysis included descriptive statistics and chi-square testing comparing graded answers of GPT-4 to Bard.

Results: The mean accuracy and completeness of GPT-4 were 4.76 and 4.51, respectively. The mean accuracy and completeness of Bard were 4.25 and 3.84, respectively. The distribution of the scores can be seen on Fig. 1. Chi-square testing comparing Chat GPT-4 to Bard demonstrated statistically significant differences in accuracy (p = 0.005) and completeness (p = 0.004).

Discussion: Overall accuracy and completeness were high for generated responses by GPT-4 and Bard, however, GPT-4 was significantly better in both domains. The capabilities of language models will continue to evolve as more up-to-date information is integrated into future iterations. As the popularity of AI programs continues to expand, patients may search for answers to their healthcare questions on these platforms and it is critical for physicians to monitor the responses from these programs.

Conclusion: This study demonstrates that GPT-4 and Bard generated accurate and complete responses when graded by fellowship-trained rhinologists and skull-base neurosurgeons. AI language models have the potential to be robust tools for disseminating information in the future.

Table 1
Example policy statement from ICAR:SB and the associated question proposed to the language models
Policy	Policy level	Treatment option	Question
X.C. Lumbar Drain after ESBS	Option	LD placement before and/or after ESBS may be used during ESBS	When should a lumber drain be used after endoscopic endonasal skull base surgery?

Fig. 1 Bar graph showing count of the results of scoring for accuracy and completeness of GPT-4 and Bard answers based on 5-point Likert scale.

Publikationsverlauf

Artikel online veröffentlicht:
05. Februar 2024

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany

RSS-Feed abonnieren

Teilen / Bookmarken

Comparing ChatGPT-4 to BARD for Accuracy and Completeness of Responses to Questions Derived from the International Consensus Statement on Endoscopic Skull Base Surgery

Example policy statement from ICAR:SB and the associated question proposed to the language models

Publikationsverlauf