How to Check the Reliability of Artificial Intelligence Solutions—Ensuring Client Expectations are MetFunding None.
23 January 2019
04 March 2019
17 April 2019 (online)
Background and Significance
Artificial intelligence solutions for clinical tasks have been found to be prematurely released to clinical teams and thereby created increased risks and workload for clinicians. This letter discusses the issues that determine good AI practices.
A recent article in Forbes has described concerns in the United Kingdom over an artificial intelligence (AI) technology solution that diagnoses patient complaints and recommends the best course of action. The article concentrates on the company Babylon but the critique is valuable for scrutinizing all AI products and their claims so as to offer herein generalizations of the issues and putative remedies that can be inferred from this case study.
The Forbes article says: “In the UK, Babylon Health has claimed its AI bot is as good at diagnosing as human doctors, but interviews with current and former Babylon staff and outside doctors reveal broad concerns that the company has rushed to deploy software that has not been carefully vetted, then exaggerated its effectiveness.”
More broadly the Forbes article questions the relationship between tech startups and health organizations. Forbes says:
“Concerns around Babylon's AI point to the difficulties that can arise when healthcare systems partner with tech startups. While Babylon has positioned itself as a healthcare company, it appears to have been run like a Silicon Valley startup. The focus was on building fast and getting things out the door….”
In particular, the gung-ho approach of information technology companies is identified: “Software is developed by iteration. Developers build an app and release it into the wild, testing it on various groups of live users and iterating as they go along.”
A medical colleague has commented to me: “this is human experimentation and reckless.”
Another commentary questioning Babylon's claims in more detail was published in the Lancet. It made the assertions that: “In particular, data in the trials were entered by doctors, not the intended lay users, and no statistical significance testing was performed. Comparisons between the Babylon Diagnostic and Triage System and seven doctors were sensitive to outliers; poor performance of just one doctor skewed results in favor of the Babylon Diagnostic and Triage System. Qualitative assessment of diagnosis appropriateness made by three clinicians exhibited high levels of disagreement. Comparison to historical results from a study by Semigran and colleagues produced high scores for the Babylon Diagnostic and Triage System but was potentially biased by unblinded selection of a subset of 30 of 45 test cases.”
“Babylon's study does not offer convincing evidence that its Babylon Diagnostic and Triage System can perform better than doctors in any realistic situation, and there is a possibility that it might perform significantly worse…Further clinical evaluation is necessary to ensure confidence in patient safety.”
Babylon has defended itself by saying that Babylon “goes through many, many rounds of clinicians rigorously testing the product … before deploying in the market.” which appears somewhat contrary to the Lancet article, especially in the light of comments from a former employee: “Former staff say one of the biggest flaws in the way Babylon develops its software has been the lack of real-life clinical assessment and follow-up. Did people who used its chatbot ever go to an emergency room? If they did see a doctor, what was their diagnosis? ‘There was no system in place to find out,’ says a former staffer.”
In a closing statement, Babylon answered criticism on the lack of publications of their work with the comment “The company admits it hasn't produced medical research,” saying it will publish in a medical journal “when Babylon produces medical research.”
All of this reminds me of a panel I attended at Healthcare Information and Management Systems Society (HIMSS) in 2016 run by the IBM Watson team. They presented a set of comparative results based on diagnostic analyses from clinicians versus Watson where Watson proved the better in the match. Under close questioning from the audience, it turned out the clinicians were trainees and Watson was given credit for a correct answer if it had the answer in any of up to 50 possible diagnoses it offered. It seems they had never heard of false positives. The audience was unimpressed and Watson went on to generate a $60million burnout at MD Anderson Cancer Center with very little if anything to show for it.
Protection of Human and Animal Subjects
No human and animal subjects were involved in the project.
- 1 Olson P. This health startup won big government deals—but inside, doctors flagged problems. Forbes 2018 (December 17). Available at: https://www.forbes.com/sites/parmyolson/2018/12/17/this-health-startup-won-big-government-dealsbut-inside-doctors-flagged-problems/#2f3f47dbeabb . Accessed January 4, 2019
- 2 Fraser H, Coiera E, Wong D. Safety of patient-facing digital symptom checkers. Lancet 2018; 392 (10161): 2263-2264
- 3 Moody C, Scocozza M, Brant M. , et al. Using natural language processing to screen and classify pathology reports. NAACR Annual Conference June, 2017 . Available at: https://www.naaccr.org/wp-content/uploads/2017/06/Using-Natural-Language-Processing-to-Screen-and-Classify-Pathology-Reports.pdf . Accessed January 4, 2019
- 4 Pearce CM, McLeod A, Patrick J. , et al. POLAR diversion: using general practice data to calculate risk of emergency departmentpresentation at the time of consultation. Appl Clin Inform 2019; 10 (01) 151-157