Appl Clin Inform 2019; 10(02): 269-271
DOI: 10.1055/s-0039-1685220
Letter to the Editor
Georg Thieme Verlag KG Stuttgart · New York

How to Check the Reliability of Artificial Intelligence Solutions—Ensuring Client Expectations are Met

Jon Patrick
1  Health Language Analytics Global, Eveleigh, Australia
› Author Affiliations
Funding None.
Further Information

Publication History

23 January 2019

04 March 2019

Publication Date:
17 April 2019 (online)

Background and Significance

Artificial intelligence solutions for clinical tasks have been found to be prematurely released to clinical teams and thereby created increased risks and workload for clinicians. This letter discusses the issues that determine good AI practices.

A recent article in Forbes has described concerns in the United Kingdom over an artificial intelligence (AI) technology solution that diagnoses patient complaints and recommends the best course of action.[1] The article concentrates on the company Babylon but the critique is valuable for scrutinizing all AI products and their claims so as to offer herein generalizations of the issues and putative remedies that can be inferred from this case study.

The Forbes article says: “In the UK, Babylon Health has claimed its AI bot is as good at diagnosing as human doctors, but interviews with current and former Babylon staff and outside doctors reveal broad concerns that the company has rushed to deploy software that has not been carefully vetted, then exaggerated its effectiveness.”

More broadly the Forbes article questions the relationship between tech startups and health organizations. Forbes says:

“Concerns around Babylon's AI point to the difficulties that can arise when healthcare systems partner with tech startups. While Babylon has positioned itself as a healthcare company, it appears to have been run like a Silicon Valley startup. The focus was on building fast and getting things out the door….”

In particular, the gung-ho approach of information technology companies is identified: “Software is developed by iteration. Developers build an app and release it into the wild, testing it on various groups of live users and iterating as they go along.”

A medical colleague has commented to me: “this is human experimentation and reckless.”

Another commentary questioning Babylon's claims in more detail was published in the Lancet.[2] It made the assertions that: “In particular, data in the trials were entered by doctors, not the intended lay users, and no statistical significance testing was performed. Comparisons between the Babylon Diagnostic and Triage System and seven doctors were sensitive to outliers; poor performance of just one doctor skewed results in favor of the Babylon Diagnostic and Triage System. Qualitative assessment of diagnosis appropriateness made by three clinicians exhibited high levels of disagreement. Comparison to historical results from a study by Semigran and colleagues produced high scores for the Babylon Diagnostic and Triage System but was potentially biased by unblinded selection of a subset of 30 of 45 test cases.”

“Babylon's study does not offer convincing evidence that its Babylon Diagnostic and Triage System can perform better than doctors in any realistic situation, and there is a possibility that it might perform significantly worse…Further clinical evaluation is necessary to ensure confidence in patient safety.”

Babylon has defended itself by saying that Babylon “goes through many, many rounds of clinicians rigorously testing the product … before deploying in the market.” which appears somewhat contrary to the Lancet article, especially in the light of comments from a former employee: “Former staff say one of the biggest flaws in the way Babylon develops its software has been the lack of real-life clinical assessment and follow-up. Did people who used its chatbot ever go to an emergency room? If they did see a doctor, what was their diagnosis? ‘There was no system in place to find out,’ says a former staffer.”

In a closing statement, Babylon answered criticism on the lack of publications of their work with the comment “The company admits it hasn't produced medical research,” saying it will publish in a medical journal “when Babylon produces medical research.”

All of this reminds me of a panel I attended at Healthcare Information and Management Systems Society (HIMSS) in 2016 run by the IBM Watson team. They presented a set of comparative results based on diagnostic analyses from clinicians versus Watson where Watson proved the better in the match. Under close questioning from the audience, it turned out the clinicians were trainees and Watson was given credit for a correct answer if it had the answer in any of up to 50 possible diagnoses it offered. It seems they had never heard of false positives. The audience was unimpressed and Watson went on to generate a $60million burnout at MD Anderson Cancer Center with very little if anything to show for it.

Protection of Human and Animal Subjects

No human and animal subjects were involved in the project.