Comparing ChatGPT3.5 and Bard recommendations for colonoscopy intervals: Bridging the gap in healthcare settings

Maziar Amini; Patrick W. Chang; Rio O. Davis; Denis D. Nguyen; Jennifer L Dodge; Jennifer Phan; James Buxbaum; Ara Sahakian

doi:10.1055/a-2586-5912

RSS-Feed abonnieren

Bitte kopieren Sie die angezeigte URL und fügen sie dann in Ihren RSS-Reader ein.

https://www.thieme-connect.de/rss/thieme/de/10.1055-s-00025476.xml

Teilen / Bookmarken

Facebook Linkedin Weibo

PDF herunterladen

CC BY 4.0 · Endosc Int Open 2025; 13: a25865912
DOI: 10.1055/a-2586-5912

Original article

Comparing ChatGPT3.5 and Bard recommendations for colonoscopy intervals: Bridging the gap in healthcare settings

Maziar Amini

¹Division of Gastrointestinal and Liver Diseases, University of Southern California Keck School of Medicine, Los Angeles, United States (Ringgold ID: RIN12223)

,

Patrick W. Chang

²Department of Internal Medicine, Division of Gastroenterology, University of Southern California Keck School of Medicine, Los Angeles, United States (Ringgold ID: RIN12223)

,

Rio O. Davis

³School of Science and Engineering, Tulane University, New Orleans, United States (Ringgold ID: RIN5783)

,

Denis D. Nguyen

¹Division of Gastrointestinal and Liver Diseases, University of Southern California Keck School of Medicine, Los Angeles, United States (Ringgold ID: RIN12223)

,

Jennifer L Dodge

¹Division of Gastrointestinal and Liver Diseases, University of Southern California Keck School of Medicine, Los Angeles, United States (Ringgold ID: RIN12223)

,

Jennifer Phan

⁴Division of Gastrointestinal and Liver Diseases, University of Southern California, Los Angeles, United States (Ringgold ID: RIN5116)

,

James Buxbaum

⁵Medicine/Gastroenterology, University of California, San Francisco, San Francisco, United States

,

Ara Sahakian

⁴Division of Gastrointestinal and Liver Diseases, University of Southern California, Los Angeles, United States (Ringgold ID: RIN5116)

› Institutsangaben

› Weitere Informationen

Auch verfügbar auf

Abstract
Volltext
Referenzen
Zusatzmaterial

Lizenzen und Reprints

Abstract

Background and study aims

Colorectal cancer is a leading cause of cancer-related deaths, with screening and surveillance colonoscopy playing a crucial role in early detection. This study examined the efficacy of two freely available large language models (LLMs), GPT3.5 and Bard, in recommending colonoscopy intervals in diverse healthcare settings.

Patients and methods

A cross-sectional study was conducted using data from routine colonoscopies at a large safety-net and a private tertiary hospital. GPT3.5 and Bard were tasked with recommending screening intervals based on colonoscopy reports and pathology data and their accuracy and inter-rater reliability were compared to a guideline-directed endoscopist panel.

Results

Of 549 colonoscopies analyzed (N = 268 at safety-net and N = 281 private hospital), GPT3.5 showed better concordance with guideline recommendations (GPT3.5: 60.4% vs. Bard: 50.0%, P < 0.001). In the safety-net hospital, GPT3.5 had a 60.5% concordance rate with the panel compared with Bard’s 45.7% (P < 0.001). For the private hospital, concordance was 60.3% for GPT3.5 and 54.3% for Bard (P = 0.13). GPT3.5 showed fair agreement with the panel (kappa = 0.324), whereas Bard displayed lower agreement (kappa = 0.219). For the safety-net hospital, GPT3.5 showed fair agreement with the panel (kappa = 0.340) whereas Bard showed slight agreement (kappa = 0.148). For the private hospital, both GPT3.5 and Bard demonstrated fair agreement with the panel (kappa = 0.295 and 0.282, respectively).

Conclusions

This study highlights the limitations of freely available LLMs in assisting colonoscopy screening recommendations. Although the potential of freely available LLMs to offer uniformity is significant, the low accuracy, as noted, excludes their use as the sole agent in providing recommendations.

Keywords

Endoscopy Lower GI Tract - CRC screening - Quality and logistical aspects - Image and data processing, documentatiton - Epidemiology

Supplementary Material

Zusatzmaterial

Publikationsverlauf

Eingereicht: 19. August 2024

Angenommen nach Revision: 07. April 2025

Accepted Manuscript online:
14. April 2025

Artikel online veröffentlicht:
17. Juni 2025

© 2025. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution License, permitting unrestricted use, distribution, and reproduction so long as the original work is properly cited. (https://creativecommons.org/licenses/by/4.0/).

Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany

Bibliographical Record
Maziar Amini, Patrick W. Chang, Rio O. Davis, Denis D. Nguyen, Jennifer L Dodge, Jennifer Phan, James Buxbaum, Ara Sahakian. Comparing ChatGPT3.5 and Bard recommendations for colonoscopy intervals: Bridging the gap in healthcare settings. Endosc Int Open 2025; 13: a25865912.
DOI: 10.1055/a-2586-5912

References
1 Siegel RL, Miller KD, Jemal A. Cancer statistics, 2020. CA Cancer J Clin 2020; 70: 7-30

Suche in Google Scholar
2 Nishihara R, Wu K, Lochhead P. et al. Long-term colorectal-cancer incidence and mortality after lower endoscopy. N Engl J Med 2013; 369: 1095-1105

Suche in Google Scholar
3 US Preventive Services Task Force, Davidson KW, Barry MJ. et al. Screening for colorectal cancer: US Preventive Services Task Force recommendation statement. JAMA 2021; 325: 1965

Suche in Google Scholar
4 Kröner PT, Engels MM, Glicksberg BS. et al. Artificial intelligence in gastroenterology: A state-of-the-art review. World J Gastroenterol 2021; 27: 6794-6824

Suche in Google Scholar
5 Hassan C, Wallace MB, Sharma P. et al. New artificial intelligence system: first validation study versus experienced endoscopists for colorectal polyp detection. Gut 2020; 69: 799-800

Suche in Google Scholar
6 Karwa A, Patell R, Parthasarathy G. et al. Development of an automated algorithm to generate guideline-based recommendations for follow-up colonoscopy. Clin Gastroenterol Hepatol 2020; 18: 2038-2045.e1

Suche in Google Scholar
7 Peterson E, May FP, Kachikian O. et al. Automated identification and assignment of colonoscopy surveillance recommendations for individuals with colorectal polyps. Gastrointest Endosc 2021; 94: 978-987

Suche in Google Scholar
8 Sharma P, Parasa S. ChatGPT and large language models in gastroenterology. Nat Rev Gastroenterol Hepatol 2023; 20: 481-482

Suche in Google Scholar
9 Pugliese N, Wong VW-S, Schattenberg JM. et al. Accuracy, reliability and comprehensiveness of ChatGPT generated medical responses for patients with NAFLD. Clin Gastroenterol Hepatol 2023; 22: 886-889

Suche in Google Scholar
10 Lee T-C, Staller K, Botoman V. et al. ChatGPT answers common patient questions about colonoscopy. Gastroenterology 2023; 165: 509-511.e7

Suche in Google Scholar
11 Yeo YH, Samaan JS, Ng WH. et al. Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. Clin Mol Hepatol 2023; 29: 721-732

Suche in Google Scholar
12 Levartovsky A, Ben-Horin S, Kopylov U. et al. Towards AI-augmented clinical decision-making: An examination of ChatGPT’s Utility in acute ulcerative colitis presentations. Am J Gastroenterol 2023; 118: 2283-2289

Suche in Google Scholar
13 Uche-Anya E, Anyane-Yeboa A, Berzin TM. et al. Artificial intelligence in gastroenterology and hepatology: how to advance clinical practice while ensuring health equity. Gut 2022; 71: 1909-1915

Suche in Google Scholar
14 Chen IY, Szolovits P, Ghassemi M. Can AI help reduce disparities in general medical and mental health care?. AMA J Ethics 2019; 21: E167-E179

Suche in Google Scholar
15 Fedewa SA, Flanders WD, Ward KC. et al. Racial and ethnic disparities in interval colorectal cancer incidence: A population-based cohort study. Ann Intern Med 2017; 166: 857-866

Suche in Google Scholar
16 Chang PW, Amini MM, Davis RO. et al. ChatGPT4 outperforms endoscopists for determination of post-colonoscopy re-screening and surveillance recommendations. Clin Gastroenterol Hepatol 2024; 22: 1917-1925

Suche in Google Scholar
17 OpenAI Platform. https://platform.openai.com
18 Pichai S. An important next step on our AI journey. Google 2023. https://blog.google/technology/ai/bard-google-ai-search-updates/
19 Brown T, Mann B, Ryder N. et al. Language models are few-shot learners. Advances in neural information processing systems 2020; 33: 1877-1901

Suche in Google Scholar
20 Google AI PaLM 2. arXiv:2005.14165. https://arxiv.org/abs/2005.14165
21 Anil R, Dai AM, Firat O et al. PaLM 2 Technical Report 2023. https://doi.org/10.48550/arXiv.2305.10403
22 Elias J. Google’s newest A.I. model uses nearly five times more text data for training than its predecessor. CNBC 2023. https://www.cnbc.com/2023/05/16/googles-palm-2-uses-nearly-five-times-more-text-data-than-predecessor.html
23 Gupta S, Lieberman D, Anderson JC. et al. Recommendations for follow-up after colonoscopy and polypectomy: A consensus update by the US Multi-Society Task Force on Colorectal Cancer. Gastrointest Endosc 2020; 91: 463-485.e5

Suche in Google Scholar
24 Gorelik Y, Ghersin I, Maza I. et al. Harnessing language models for streamlined postcolonoscopy patient management: a novel approach. Gastrointest Endosc 2023; 98: 639-641.e4

Suche in Google Scholar
25 Lahat A, Shachar E, Avidan B. et al. Evaluating the use of large language model in identifying top research questions in gastroenterology. Sci Rep 2023; 13: 4164

Suche in Google Scholar
26 Khanna R, Nelson SA, Feagan BG. et al. Endoscopic scoring indices for evaluation of disease activity in Crohn’s disease. Cochrane Database Syst Rev 2016; 2016: CD010642

Suche in Google Scholar
27 Zorron Cheng Tao Pu L, Chiam KH, Yamamura T. et al. Narrow-band imaging for scar (NBI-SCAR) classification: from conception to multicenter validation. Gastrointest Endosc 2020; 91: 1146-1154.e5

Suche in Google Scholar
28 Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977; 33: 159

Suche in Google Scholar
29 Winawer SJ, Zauber AG. The advanced adenoma as the primary target of screening. Gastrointest Endosc Clin North Am 2002; 12: 1-9

Suche in Google Scholar
30 Liss DT, Baker DW. Understanding current racial/ethnic disparities in colorectal cancer screening in the United States: the contribution of socioeconomic status and access to care. Am J Prev Med 2014; 46: 228-236

Suche in Google Scholar
31 Habchi KM, Weinberg RY, White RS. How the use of standardized protocols may paradoxically worsen disparities for safety-net hospitals. J Comp Eff Res 2022; 11: 65-66

Suche in Google Scholar
32 Meyer A, Riese J, Streichert T. Comparison of the performance of GPT-3.5 and GPT-4 with that of medical students on the written German medical licensing examination: Observational study. JMIR Medical Education 2024; 10: e50965

Suche in Google Scholar
33 GPT-4. https://openai.com/index/gpt-4/
34 Ali MF, Grewal J, Karnes W. Screening colonoscopy polyp, adenoma and sessile serrated adenoma detection rate differences in Hispanics and Whites in age-matched cohorts. Am J Gastroenterol 2018; 113: S62

Suche in Google Scholar
35 Edwardson N, Adsul P, Gonzalez Z. et al. Sessile serrated lesion detection rates continue to increase: 2008–2020. Endosc Int Open 2023; 11: E107-E116

Suche in Google Scholar

Zusatzmaterial

Zusatzmaterial

RSS-Feed abonnieren

Teilen / Bookmarken

Comparing ChatGPT3.5 and Bard recommendations for colonoscopy intervals: Bridging the gap in healthcare settings

Einführung zu diesem Artikel:

Abstract

Background and study aims

Patients and methods

Results

Conclusions

Keywords

Supplementary Material

Publikationsverlauf

References