CC BY 4.0 · Endosc Int Open
DOI: 10.1055/a-2586-5912
Original article

Comparing ChatGPT3.5 and Bard in recommending colonoscopy intervals: bridging the gap in healthcare settings

Maziar Amini
1   Division of Gastrointestinal and Liver Diseases, University of Southern California Keck School of Medicine, Los Angeles, United States (Ringgold ID: RIN12223)
,
Patrick W. Chang
2   Department of Internal Medicine, Division of Gastroenterology, University of Southern California Keck School of Medicine, Los Angeles, United States (Ringgold ID: RIN12223)
,
Rio O. Davis
3   School of Science and Engineering, Tulane University, New Orleans, United States (Ringgold ID: RIN5783)
,
Denis D. Nguyen
1   Division of Gastrointestinal and Liver Diseases, University of Southern California Keck School of Medicine, Los Angeles, United States (Ringgold ID: RIN12223)
,
Jennifer L Dodge
1   Division of Gastrointestinal and Liver Diseases, University of Southern California Keck School of Medicine, Los Angeles, United States (Ringgold ID: RIN12223)
,
Jennifer Phan
4   Division of Gastrointestinal Liver Disease, University of Southern California, Los Angeles, United States (Ringgold ID: RIN5116)
,
James Buxbaum
5   Medicine/Gastroenterology, University of California, San Francisco, San Francisco, United States
,
Ara Sahakian
6   Division of Gastrointestinal and Liver Diseases, University of Southern California, Los Angeles, United States
› Institutsangaben

Background and Study Aims: Colorectal cancer is a leading cause of cancer-related deaths, with screening and surveillance colonoscopy playing a crucial role in early detection. This study examines the efficacy of two, freely available, Large Language Models (LLMs), GPT3.5 and Bard, in recommending colonoscopy intervals in diverse healthcare settings. Patients and methods: A cross-sectional study was conducted using data from routine colonoscopies at a large safety-net and a private tertiary hospital. GPT3.5 and Bard were tasked with recommending screening intervals based on colonoscopy reports and pathology data with their accuracy and inter-rater reliability compared to a guideline-directed endoscopist panel. Results: Out of 549 colonoscopies analyzed (N=268 at safety-net and N=281 private hospital), GPT3.5 showed better concordance with guideline recommendations (GPT3.5: 60.4% vs. Bard: 50.0%, p<0.001). In the safety-net hospital, GPT3.5 had a 60.5% concordance rate with the panel compared to Bard’s 45.7% (p<0.001). For the private hospital, concordance was 60.3% for GPT3.5 and 54.3% for Bard (p=0.13). GPT3.5 showed fair agreement with the panel (kappa=0.324), whereas Bard displayed lower agreement (kappa=0.219). For the safety-net hospital, GPT3.5 showed fair agreement with the panel (kappa=0.340) while Bard showed slight agreement (kappa=0.148). For the private hospital, both GPT3.5 and Bard demonstrated fair agreement with the panel (kappa=0.295 and 0.282 respectively). Conclusion: The study highlights the limitations of freely available LLMs in assisting colonoscopy screening recommendations. While the potential of freely available LLMs to offer uniformity is significant, the low accuracy, as noted, excludes their use as the sole agent in providing recommendations.



Publikationsverlauf

Eingereicht: 19. August 2024

Angenommen nach Revision: 07. April 2025

Accepted Manuscript online:
14. April 2025

© . The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution License, permitting unrestricted use, distribution, and reproduction so long as the original work is properly cited. (https://creativecommons.org/licenses/by/4.0/).

Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany

Bibliographical Record
Maziar Amini, Patrick W. Chang, Rio O. Davis, Denis D. Nguyen, Jennifer L Dodge, Jennifer Phan, James Buxbaum, Ara Sahakian. Comparing ChatGPT3.5 and Bard in recommending colonoscopy intervals: bridging the gap in healthcare settings. Endosc Int Open ; 0: a25865912.
DOI: 10.1055/a-2586-5912