Appl Clin Inform 2022; 13(03): 569-582
DOI: 10.1055/s-0042-1749119
Review Article

Diversity in Machine Learning: A Systematic Review of Text-Based Diagnostic Applications

Lane Fitzsimmons
1   College of Agriculture and Life Science, Cornell University, Ithaca, New York, United States
,
Maya Dewan
2   Division of Critical Care Medicine, Cincinnati Children's Hospital Medical Center, Cincinnati, Ohio, United States
3   Department of Pediatrics, University of Cincinnati College of Medicine, Cincinnati, Ohio, United States
,
Judith W. Dexheimer
3   Department of Pediatrics, University of Cincinnati College of Medicine, Cincinnati, Ohio, United States
4   Division of Emergency Medicine; Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, Ohio, United States
› Author Affiliations

Abstract

Objective As the storage of clinical data has transitioned into electronic formats, medical informatics has become increasingly relevant in providing diagnostic aid. The purpose of this review is to evaluate machine learning models that use text data for diagnosis and to assess the diversity of the included study populations.

Methods We conducted a systematic literature review on three public databases. Two authors reviewed every abstract for inclusion. Articles were included if they used or developed machine learning algorithms to aid in diagnosis. Articles focusing on imaging informatics were excluded.

Results From 2,260 identified papers, we included 78. Of the machine learning models used, neural networks were relied upon most frequently (44.9%). Studies had a median population of 661.5 patients, and diseases and disorders of 10 different body systems were studied. Of the 35.9% (N = 28) of papers that included race data, 57.1% (N = 16) of study populations were majority White, 14.3% were majority Asian, and 7.1% were majority Black. In 75% (N = 21) of papers, White was the largest racial group represented. Of the papers included, 43.6% (N = 34) included the sex ratio of the patient population.

Discussion With the power to build robust algorithms supported by massive quantities of clinical data, machine learning is shaping the future of diagnostics. Limitations of the underlying data create potential biases, especially if patient demographics are unknown or not included in the training.

Conclusion As the movement toward clinical reliance on machine learning accelerates, both recording demographic information and using diverse training sets should be emphasized. Extrapolating algorithms to demographics beyond the original study population leaves large gaps for potential biases.

Protection of Human and Animal Subjects

Human subjects were not included in this project.




Publication History

Received: 09 November 2021

Accepted: 04 April 2022

Article published online:
25 May 2022

© 2022. Thieme. All rights reserved.

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany