Endoscopy 2024; 56(09): 650-652
DOI: 10.1055/a-2335-1331
Editorial

Artificial intelligence (AI) systems for detection of Barrett’s neoplasia: time to bridge domain gaps and explore human–AI interaction

Referring to Meinikheim M et al. doi: 10.1055/a-2296-5696

Authors

  • Albert Jeroen De Groof

    1   Department of Gastroenterology and Hepatology, Amsterdam UMC, Amsterdam, Netherlands
Preview

10.1055/a-2296-5696

Endoscopic recognition of early Barrett’s neoplasia may be difficult; lesions often only show subtle abnormalities and most endoscopists rarely encounter early Barrett’s neoplasia, making them unfamiliar with the appearance of these lesions [1] [2]. Assistance for detection of Barrett’s neoplasia has therefore always been one of the most appealing artificial intelligence (AI) applications.

In this issue of Endoscopy, Meinikheim et al. present an ex vivo study in which they evaluated the effect of AI assistance on the performance of endoscopists in their recognition of Barrett’s neoplasia on endoscopic videos [3]. The AI system was trained with endoscopic still images and video frames from 557 patients recorded with high definition white-light endoscopy or (optical) chromoendoscopy techniques. Histopathology labels were available for 456 patients. Neoplastic images of 304 patients were additionally delineated by expert endoscopists. In a two-phase design, 96 prospectively collected endoscopic video recordings of 72 patients with Barrett’s esophagus were evaluated for the presence of neoplastic lesions by a group of six expert and 16 nonexpert endoscopists. Endoscopist assessors were randomized to begin either with or without AI assistance. Performance did not change significantly in the group that began with AI assistance, yet this was probably due to a learning effect as there was no washout in between phases. In the eight nonexpert endoscopists who started without AI assistance, sensitivity and specificity increased significantly (from 70% to 78% and 67% to 73%, respectively) when AI assistance was provided in the second phase, although it did not match the level of standalone AI performance in terms of sensitivity (sensitivity 92%; specificity 69%).

“Now that the feasibility of AI systems has been confirmed, in the coming years we will need to focus on improving the robustness of these systems for their successful integration into general endoscopic practice, while also developing a deeper understanding of human–AI interaction.”

These findings align with observations from other studies, where endoscopists were found not to consistently adhere to AI recommendations. In a recent publication, Fockens et al. described the development of a computer-aided detection system for Barrett’s neoplasia that was trained with imagery of 2506 patients from 15 different centers and tested on multiple prospective image- and video-based test sets by 112 endoscopists. Video-based detections by endoscopists increased from 67% to 79% when AI assistance was provided, while standalone AI sensitivity was 91% [4].

First, it is essential to recognize that determination of standalone AI performance is inherently prone to bias – especially for video-based applications – and may not directly relate to actual clinical performance. Acknowledging this, the interaction between humans and AI remains a relatively unexplored area in endoscopy and warrants evaluation in the upcoming years. The initiation of this evaluation by the authors, elucidating the impact of AI on the diagnostic confidence of endoscopists, merits commendation.

The employed AI system in this study was designed to detect neoplasia in overview (positive detections are labeled “region at risk”) followed by detailed inspection in magnification (positive detections are labeled as Barrett’s esophagus-related neoplasia or “BERN”). This is in line with the procedural workflow of a Barrett’s surveillance endoscopy, where visible abnormalities are first detected in overview, followed by detailed inspection of the region of interest. Unfortunately, detailed information on the exact composition and content of the videos in the test set (e.g. the distribution of overview versus magnified imagery) is not provided. It is therefore not clear how many lesions were picked up by the AI system in overview, which from a clinical perspective is the essential first step. In this regard, training and testing primary detection systems with prospective imagery without specific focus on any lesion (i.e. mimicking a setting where lesions are generally overlooked) is crucial in preventing important hidden bias that is inherent in the use of retrospectively collected imagery, which is generally focused on the lesion ([Fig. 1]) [4].

Zoom
Fig. 1 A subtle neoplastic Barrett’s esophagus lesion. a Visualized in overview without specific focus on the lesion. b Visualized with a more dedicated focus on the lesion.

Although this is not explicitly described in the manuscript, the AI system was apparently trained with retrospectively collected data from multiple centers. The test set was prospectively collected in a single center. Hence, caution is warranted when interpreting results in terms of generalizability, as the authors rightfully acknowledge. Lack of generalizability of results is a common problem in preclinical AI studies. Generally, AI systems are developed by the use of selected, high quality imagery collected at a limited number of centers, which leads to homogeneous training data. However, the majority of these AI systems are anticipated to be employed in community centers, where the imaging quality and settings of each endoscopy unit vary considerably. These systems may therefore not recognize lesions that are less adequately visualized. Furthermore, modern neural network architectures have shown susceptibility to undesired behavior when subjected to minor image perturbations and this further limits robustness in handling the heterogeneity encountered in daily practice. This phenomenon is commonly referred to as the domain gap and exposes a fundamental limitation of virtually all current endoscopic AI systems: the promising performance reported in an academic context may experience a significant decline once systems are tested in community practice, posing risks to patients and disrupting procedural efficiency.

AI systems in Barrett’s neoplasia are particularly vulnerable to domain gaps, mainly because of the scarcity of training data stemming from its low incidence rate. Moreover, unlike AI development for colonic polyp detection, this low incidence poses a barrier to conducting large-scale, randomized clinical trials owing to the considerable logistical hurdles involved. Given the impracticality of conducting randomized clinical trials, bridging domain gaps becomes even more imperative before clinical impact can be investigated.

How can we bridge domain gaps and enable safe and successful clinical application of endoscopic AI? First, as a general principle, AI developers should aim for inclusion of heterogeneity in their training and test datasets. This implies the necessity of employing multicenter data, ideally encompassing a broader spectrum of image settings and varying image quality. Second, recent technical advancements may enable stabilization of algorithm predictions and increase generalizability of algorithm performance. These include employment of domain-specific pretraining techniques, the use of state-of-the-art algorithm architectures, and integration of self-critical AI. In the coming years, we can anticipate a growing focus on these technological advancements.

Finally, bridging domain gaps can involve pursuing greater data homogeneity across community hospitals through the development of quality assurance algorithms, which are to be used in conjunction with AI systems for primary detection. Quality assurance algorithms can offer comprehensive quality feedback during live endoscopic procedures. This can enhance the quality of the endoscopic procedure by 1) ensuring optimal visualization and subsequent evaluation by the endoscopist, thereby improving their ability to detect early neoplasia, and 2) ensuring high quality image input for AI algorithms, thereby facilitating reliable and consistent algorithm predictions.

The valuable contribution of Meinikheim et al. in advancing the field of AI warrants praise. Now that the feasibility of AI systems has been confirmed, in the coming years we will need to focus on improving the robustness of these systems for their successful integration into general endoscopic practice, while also developing a deeper understanding of human–AI interaction.



Publikationsverlauf

Artikel online veröffentlicht:
18. Juni 2024

© 2024. Thieme. All rights reserved.

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany