CC BY 4.0 · Endosc Int Open 2025; 13: a25865912
DOI: 10.1055/a-2586-5912
Original article

Comparing ChatGPT3.5 and Bard recommendations for colonoscopy intervals: Bridging the gap in healthcare settings

Maziar Amini
1   Division of Gastrointestinal and Liver Diseases, University of Southern California Keck School of Medicine, Los Angeles, United States (Ringgold ID: RIN12223)
,
Patrick W. Chang
2   Department of Internal Medicine, Division of Gastroenterology, University of Southern California Keck School of Medicine, Los Angeles, United States (Ringgold ID: RIN12223)
,
Rio O. Davis
3   School of Science and Engineering, Tulane University, New Orleans, United States (Ringgold ID: RIN5783)
,
Denis D. Nguyen
1   Division of Gastrointestinal and Liver Diseases, University of Southern California Keck School of Medicine, Los Angeles, United States (Ringgold ID: RIN12223)
,
Jennifer L Dodge
1   Division of Gastrointestinal and Liver Diseases, University of Southern California Keck School of Medicine, Los Angeles, United States (Ringgold ID: RIN12223)
,
Jennifer Phan
4   Division of Gastrointestinal and Liver Diseases, University of Southern California, Los Angeles, United States (Ringgold ID: RIN5116)
,
James Buxbaum
5   Medicine/Gastroenterology, University of California, San Francisco, San Francisco, United States
,
Ara Sahakian
4   Division of Gastrointestinal and Liver Diseases, University of Southern California, Los Angeles, United States (Ringgold ID: RIN5116)
› Institutsangaben

Abstract

Background and study aims

Colorectal cancer is a leading cause of cancer-related deaths, with screening and surveillance colonoscopy playing a crucial role in early detection. This study examined the efficacy of two freely available large language models (LLMs), GPT3.5 and Bard, in recommending colonoscopy intervals in diverse healthcare settings.

Patients and methods

A cross-sectional study was conducted using data from routine colonoscopies at a large safety-net and a private tertiary hospital. GPT3.5 and Bard were tasked with recommending screening intervals based on colonoscopy reports and pathology data and their accuracy and inter-rater reliability were compared to a guideline-directed endoscopist panel.

Results

Of 549 colonoscopies analyzed (N = 268 at safety-net and N = 281 private hospital), GPT3.5 showed better concordance with guideline recommendations (GPT3.5: 60.4% vs. Bard: 50.0%, P < 0.001). In the safety-net hospital, GPT3.5 had a 60.5% concordance rate with the panel compared with Bard’s 45.7% (P < 0.001). For the private hospital, concordance was 60.3% for GPT3.5 and 54.3% for Bard (P = 0.13). GPT3.5 showed fair agreement with the panel (kappa = 0.324), whereas Bard displayed lower agreement (kappa = 0.219). For the safety-net hospital, GPT3.5 showed fair agreement with the panel (kappa = 0.340) whereas Bard showed slight agreement (kappa = 0.148). For the private hospital, both GPT3.5 and Bard demonstrated fair agreement with the panel (kappa = 0.295 and 0.282, respectively).

Conclusions

This study highlights the limitations of freely available LLMs in assisting colonoscopy screening recommendations. Although the potential of freely available LLMs to offer uniformity is significant, the low accuracy, as noted, excludes their use as the sole agent in providing recommendations.

Supplementary Material



Publikationsverlauf

Eingereicht: 19. August 2024

Angenommen nach Revision: 07. April 2025

Accepted Manuscript online:
14. April 2025

Artikel online veröffentlicht:
17. Juni 2025

© 2025. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution License, permitting unrestricted use, distribution, and reproduction so long as the original work is properly cited. (https://creativecommons.org/licenses/by/4.0/).

Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany

Bibliographical Record
Maziar Amini, Patrick W. Chang, Rio O. Davis, Denis D. Nguyen, Jennifer L Dodge, Jennifer Phan, James Buxbaum, Ara Sahakian. Comparing ChatGPT3.5 and Bard recommendations for colonoscopy intervals: Bridging the gap in healthcare settings. Endosc Int Open 2025; 13: a25865912.
DOI: 10.1055/a-2586-5912
 
  • References

  • 1 Siegel RL, Miller KD, Jemal A. Cancer statistics, 2020. CA Cancer J Clin 2020; 70: 7-30
  • 2 Nishihara R, Wu K, Lochhead P. et al. Long-term colorectal-cancer incidence and mortality after lower endoscopy. N Engl J Med 2013; 369: 1095-1105
  • 3 US Preventive Services Task Force, Davidson KW, Barry MJ. et al. Screening for colorectal cancer: US Preventive Services Task Force recommendation statement. JAMA 2021; 325: 1965
  • 4 Kröner PT, Engels MM, Glicksberg BS. et al. Artificial intelligence in gastroenterology: A state-of-the-art review. World J Gastroenterol 2021; 27: 6794-6824
  • 5 Hassan C, Wallace MB, Sharma P. et al. New artificial intelligence system: first validation study versus experienced endoscopists for colorectal polyp detection. Gut 2020; 69: 799-800
  • 6 Karwa A, Patell R, Parthasarathy G. et al. Development of an automated algorithm to generate guideline-based recommendations for follow-up colonoscopy. Clin Gastroenterol Hepatol 2020; 18: 2038-2045.e1
  • 7 Peterson E, May FP, Kachikian O. et al. Automated identification and assignment of colonoscopy surveillance recommendations for individuals with colorectal polyps. Gastrointest Endosc 2021; 94: 978-987
  • 8 Sharma P, Parasa S. ChatGPT and large language models in gastroenterology. Nat Rev Gastroenterol Hepatol 2023; 20: 481-482
  • 9 Pugliese N, Wong VW-S, Schattenberg JM. et al. Accuracy, reliability and comprehensiveness of ChatGPT generated medical responses for patients with NAFLD. Clin Gastroenterol Hepatol 2023; 22: 886-889
  • 10 Lee T-C, Staller K, Botoman V. et al. ChatGPT answers common patient questions about colonoscopy. Gastroenterology 2023; 165: 509-511.e7
  • 11 Yeo YH, Samaan JS, Ng WH. et al. Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. Clin Mol Hepatol 2023; 29: 721-732
  • 12 Levartovsky A, Ben-Horin S, Kopylov U. et al. Towards AI-augmented clinical decision-making: An examination of ChatGPT’s Utility in acute ulcerative colitis presentations. Am J Gastroenterol 2023; 118: 2283-2289
  • 13 Uche-Anya E, Anyane-Yeboa A, Berzin TM. et al. Artificial intelligence in gastroenterology and hepatology: how to advance clinical practice while ensuring health equity. Gut 2022; 71: 1909-1915
  • 14 Chen IY, Szolovits P, Ghassemi M. Can AI help reduce disparities in general medical and mental health care?. AMA J Ethics 2019; 21: E167-E179
  • 15 Fedewa SA, Flanders WD, Ward KC. et al. Racial and ethnic disparities in interval colorectal cancer incidence: A population-based cohort study. Ann Intern Med 2017; 166: 857-866
  • 16 Chang PW, Amini MM, Davis RO. et al. ChatGPT4 outperforms endoscopists for determination of post-colonoscopy re-screening and surveillance recommendations. Clin Gastroenterol Hepatol 2024; 22: 1917-1925
  • 17 OpenAI Platform. https://platform.openai.com
  • 18 Pichai S. An important next step on our AI journey. Google 2023. https://blog.google/technology/ai/bard-google-ai-search-updates/
  • 19 Brown T, Mann B, Ryder N. et al. Language models are few-shot learners. Advances in neural information processing systems 2020; 33: 1877-1901
  • 20 Google AI PaLM 2. arXiv:2005.14165. https://arxiv.org/abs/2005.14165
  • 21 Anil R, Dai AM, Firat O et al. PaLM 2 Technical Report 2023. https://doi.org/10.48550/arXiv.2305.10403
  • 22 Elias J. Google’s newest A.I. model uses nearly five times more text data for training than its predecessor. CNBC 2023. https://www.cnbc.com/2023/05/16/googles-palm-2-uses-nearly-five-times-more-text-data-than-predecessor.html
  • 23 Gupta S, Lieberman D, Anderson JC. et al. Recommendations for follow-up after colonoscopy and polypectomy: A consensus update by the US Multi-Society Task Force on Colorectal Cancer. Gastrointest Endosc 2020; 91: 463-485.e5
  • 24 Gorelik Y, Ghersin I, Maza I. et al. Harnessing language models for streamlined postcolonoscopy patient management: a novel approach. Gastrointest Endosc 2023; 98: 639-641.e4
  • 25 Lahat A, Shachar E, Avidan B. et al. Evaluating the use of large language model in identifying top research questions in gastroenterology. Sci Rep 2023; 13: 4164
  • 26 Khanna R, Nelson SA, Feagan BG. et al. Endoscopic scoring indices for evaluation of disease activity in Crohn’s disease. Cochrane Database Syst Rev 2016; 2016: CD010642
  • 27 Zorron Cheng Tao Pu L, Chiam KH, Yamamura T. et al. Narrow-band imaging for scar (NBI-SCAR) classification: from conception to multicenter validation. Gastrointest Endosc 2020; 91: 1146-1154.e5
  • 28 Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977; 33: 159
  • 29 Winawer SJ, Zauber AG. The advanced adenoma as the primary target of screening. Gastrointest Endosc Clin North Am 2002; 12: 1-9
  • 30 Liss DT, Baker DW. Understanding current racial/ethnic disparities in colorectal cancer screening in the United States: the contribution of socioeconomic status and access to care. Am J Prev Med 2014; 46: 228-236
  • 31 Habchi KM, Weinberg RY, White RS. How the use of standardized protocols may paradoxically worsen disparities for safety-net hospitals. J Comp Eff Res 2022; 11: 65-66
  • 32 Meyer A, Riese J, Streichert T. Comparison of the performance of GPT-3.5 and GPT-4 with that of medical students on the written German medical licensing examination: Observational study. JMIR Medical Education 2024; 10: e50965
  • 33 GPT-4. https://openai.com/index/gpt-4/
  • 34 Ali MF, Grewal J, Karnes W. Screening colonoscopy polyp, adenoma and sessile serrated adenoma detection rate differences in Hispanics and Whites in age-matched cohorts. Am J Gastroenterol 2018; 113: S62
  • 35 Edwardson N, Adsul P, Gonzalez Z. et al. Sessile serrated lesion detection rates continue to increase: 2008–2020. Endosc Int Open 2023; 11: E107-E116