Endoscopy 2025; 57(S 02): S206
DOI: 10.1055/s-0045-1805508
Abstracts | ESGE Days 2025
Moderated poster
Diagnosis and performance in endoscopy: an update 03/04/2025, 16:00 – 17:00 Poster Dome 1 (P0)

Accuracy of Multi-Modal Large Language Models for Endoscopic Detection of Colorectal Neoplasia

C Hassan
1   Humanitas University, Pieve Emanuele, Italy
,
D Massimi
2   Humanitas Research Hospital, Rozzano, Italy
,
M Spadaccini
2   Humanitas Research Hospital, Rozzano, Italy
,
G Antonelli
3   Gastroenterology and Digestive Endoscopy Unit, Castelli Hospital, Roma, Italy
,
M Menini
2   Humanitas Research Hospital, Rozzano, Italy
,
T Rizkala
1   Humanitas University, Pieve Emanuele, Italy
,
R de Sire
2   Humanitas Research Hospital, Rozzano, Italy
,
L Alfarone
2   Humanitas Research Hospital, Rozzano, Italy
,
A Capogreco
2   Humanitas Research Hospital, Rozzano, Italy
,
L Carlini
4   Politecnico di Milano, Milan, Italy
,
C Lena
4   Politecnico di Milano, Milan, Italy
,
R Maselli
2   Humanitas Research Hospital, Rozzano, Italy
,
A Repici
2   Humanitas Research Hospital, Rozzano, Italy
› Institutsangaben
 

Aims Artificial Intelligence-based Computer-Aided Detection (CADe) systems have shown to consistently improve endoscopists’ Adenoma Detection Rates (ADR) in randomized trials, but require extensive resources for data collection and annotation. Unlike CADe, Multi-Modal Large Language Models (MLLMs) are trained through unsupervised methods and do not rely on manual annotation for colorectal neoplasia. No prior study has evaluated the ability of MLLMs to detect colorectal polyps in colonoscopy videos. This study aimed to assess the potential of MLLMs for polyp detection and segmentation in colonoscopy videos, using CADe as the control [1] [2] [3] [4] [5].

Methods The SUN colonoscopy video database, comprising data from 99 patients with 100 unique polyps, was analyzed. Frames were standardized at 1 FPS and video length limited to 30 seconds to align the models’ capabilities. The system prompt was standardized to align the behavior of GPT-4o and Gemini 1.5 Pro. Their performance was compared to a commercially available CADe module utilizing convolutional neural networks (CNNs). The primary outcome was the accuracy of polyp detection at both the per-video and per-frame levels. Secondary outcomes included sensitivity, specificity, accuracy, and precision for polyp detection. This study was funded by the AIRC IG Grant 2022 (SAVE Project), No. 27843.

Results A total of 100 polyps (median diameter: 5 mm, IQR: 3–7 mm) were included from 99 patients (median age: 69 years, IQR: 58–74; male/female ratio: 71/28), alongside 14 videos of healthy mucosa. The dataset consisted of 49,136 frames with polyps and 109,554 frames without. CADe identified polyps in at least one frame for 99/100 polyps, achieving a sensitivity of 99% (95% CI: 96.3%–100%). GPT detected polyps in 87/100 videos, with a sensitivity of 87% (95% CI: 78.8%–92.9%), and correctly classified all 13 videos without polyps. Gemini detected polyps in 68/100 videos, with a sensitivity of 68% (95% CI: 57.9%–77.0%), and accurately classified 12/13 videos without polyps. Compared to CADe, GPT exhibited lower sensitivity (87% vs. 99%; p<0.05), similar specificity (100% vs. 100%; p=ns), comparable precision (100% vs. 100%; p=ns), and lower accuracy (88.5% vs. 99.1%; p<0.05). Gemini showed lower sensitivity (68% vs. 99%; p<0.05), comparable specificity (92.3% vs. 100%; p=ns), comparable precision (98.6% vs. 100%; p=ns), and reduced accuracy (70.8% vs. 99.1%; p<0.05).

Conclusions MLLMs such as GPT-4o and Gemini show potential for colorectal polyp detection, particularly in lesion identification, unexpectedly achieving results that are comparable with the much more mature and labor-intensive deep learning algorithms. This opens the way to a more comprehensive semantic analysis of endoscopic images with possible benefits for the patients and the health systems.



Publikationsverlauf

Artikel online veröffentlicht:
27. März 2025

© 2025. European Society of Gastrointestinal Endoscopy. All rights reserved.

Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany

 
  • References

  • 1 Deng J., Heybati K., Shammas-Toma M.. “When vision meets reality: Exploring the clinical applicability of GPT-4 with vision,”. 2024. Elsevier Inc;
  • 2 Davis J., Van Bulck L., Durieux B.N., Lindvall C.. “The Temperature Feature of ChatGPT: Modifying Creativity for Clinical Research,” JMIR Hum Factors. 11 p.e53559 2024; doi:10.2196/53559.
  • 3 Misawa M.. et al. “Development of a computer-aided detection system for colonoscopy and a publicly accessible large colonoscopy video database (with video),”. Gastrointest Endosc 2021; 93 no. 4 pp: 960-967.e3
  • 4 Zhang Y.. et al. “Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports,”. Meta-Radiology p. 2024; 100103
  • 5 Katz U.. et al. “GPT versus Resident Physicians — A Benchmark Based on Official Board Scores,” NEJM AI. 2024; 1 no. 5.