Appl Clin Inform 2021; 12(04): 808-815
DOI: 10.1055/s-0041-1735184
Review Article

Systematic Review of Approaches to Preserve Machine Learning Performance in the Presence of Temporal Dataset Shift in Clinical Medicine

Lin Lawrence Guo
1   Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, Canada
,
Stephen R. Pfohl
2   Biomedical Informatics Research, Stanford University, Palo Alto, California, United States
,
Jason Fries
2   Biomedical Informatics Research, Stanford University, Palo Alto, California, United States
,
Jose Posada
2   Biomedical Informatics Research, Stanford University, Palo Alto, California, United States
,
Scott Lanyon Fleming
2   Biomedical Informatics Research, Stanford University, Palo Alto, California, United States
,
Catherine Aftandilian
4   Division of Pediatric Hematology/Oncology, Stanford University, Palo Alto, United States
,
Nigam Shah
2   Biomedical Informatics Research, Stanford University, Palo Alto, California, United States
,
Lillian Sung
1   Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, Canada
3   Division of Haematology/Oncology, The Hospital for Sick Children, Toronto, Canada
› Author Affiliations
Funding None.

Abstract

Objective The change in performance of machine learning models over time as a result of temporal dataset shift is a barrier to machine learning-derived models facilitating decision-making in clinical practice. Our aim was to describe technical procedures used to preserve the performance of machine learning models in the presence of temporal dataset shifts.

Methods Studies were included if they were fully published articles that used machine learning and implemented a procedure to mitigate the effects of temporal dataset shift in a clinical setting. We described how dataset shift was measured, the procedures used to preserve model performance, and their effects.

Results Of 4,457 potentially relevant publications identified, 15 were included. The impact of temporal dataset shift was primarily quantified using changes, usually deterioration, in calibration or discrimination. Calibration deterioration was more common (n = 11) than discrimination deterioration (n = 3). Mitigation strategies were categorized as model level or feature level. Model-level approaches (n = 15) were more common than feature-level approaches (n = 2), with the most common approaches being model refitting (n = 12), probability calibration (n = 7), model updating (n = 6), and model selection (n = 6). In general, all mitigation strategies were successful at preserving calibration but not uniformly successful in preserving discrimination.

Conclusion There was limited research in preserving the performance of machine learning models in the presence of temporal dataset shift in clinical medicine. Future research could focus on the impact of dataset shift on clinical decision making, benchmark the mitigation strategies on a wider range of datasets and tasks, and identify optimal strategies for specific settings.

Note

L.S. is the Canada Research Chair in Pediatric Oncology Supportive Care.


Author Contributions

L.L.G. and L.S. supported in data acquisition and data analysis. All authors helped in study concepts and design, and data interpretation; involved in drafting the manuscript or revising it critically for important intellectual content; carried out the final approval of version to be published; and granted agreement to be accountable for all aspects of the work.


Protection of Human and Animal Subjects

As this study is a systematic review of primary studies, human and/or animal subjects were not included in the project.


Supplementary Material



Publication History

Received: 28 April 2021

Accepted: 12 July 2021

Article published online:
01 September 2021

© 2021. Thieme. All rights reserved.

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany

 
  • References

  • 1 Challener DW, Prokop LJ, Abu-Saleh O. The proliferation of reports on clinical scoring systems: issues about uptake and clinical utility. JAMA 2019; 321 (24) 2405-2406
  • 2 Rajkomar A, Oren E, Chen K. et al. Scalable and accurate deep learning with electronic health records. NPJ Digit Med 2018; 1: 18
  • 3 Harutyunyan H, Khachatrian H, Kale DC, Ver Steeg G, Galstyan A. Multitask learning and benchmarking with clinical time series data. Sci Data 2019; 6 (01) 96
  • 4 Sendak MP, Balu S, Schulman KA. Barriers to Achieving Economies of Scale in Analysis of EHR Data. A Cautionary Tale. Appl Clin Inform 2017; 8 (03) 826-831
  • 5 Cutillo CM, Sharma KR, Foschini L, Kundu S, Mackintosh M, Mandl KD. MI in Healthcare Workshop Working Group. Machine intelligence in healthcare-perspectives on trustworthiness, explainability, usability, and transparency. NPJ Digit Med 2020; 3: 47
  • 6 Braithwaite J. Changing how we think about healthcare improvement. BMJ 2018; 361: k2014
  • 7 Johnson AE, Pollard TJ, Shen L. et al. MIMIC-III, a freely accessible critical care database. Sci Data 2016; 3: 160035
  • 8 National Center for Health Statistics. International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM). Centers for Disease Control and Prevention. Accessed February 13, 2021 at: https://www.cdc.gov/nchs/icd/icd9cm.htm
  • 9 Moreno-Torres JG, Raeder T, Alaiz-Rodríguez R, Chawla NV, Herrera F. A unifying view on dataset shift in classification. Pattern Recognit 2012; 45 (01) 521-530
  • 10 Challen R, Denny J, Pitt M, Gompels L, Edwards T, Tsaneva-Atanasova K. Artificial intelligence, bias and clinical safety. BMJ Qual Saf 2019; 28 (03) 231-237
  • 11 Futoma J, Simons M, Panch T, Doshi-Velez F, Celi LA. The myth of generalisability in clinical research and machine learning in health care. Lancet Digit Health 2020; 2 (09) e489-e492
  • 12 Gama J, Žliobaitė I, Bifet A, Pechenizkiy M, Bouchachia A. A survey on concept drift adaptation. ACM Comput Surv 2014; 46 (04) 1-37
  • 13 Moher D, Shamseer L, Clarke M. et al; PRISMA-P Group. Preferred reporting items for systematic review and meta-analysis protocols (PRISMA-P) 2015 statement. Syst Rev 2015; 4: 1
  • 14 Luo W, Phung D, Tran T. et al. Guidelines for developing and reporting machine learning predictive models in biomedical research: a multidisciplinary view. J Med Internet Res 2016; 18 (12) e323
  • 15 Cox DR. Two further applications of a model for binary regression. Biometrika 1958; 45 (3–4): 562-565
  • 16 Davis SE, Greevy RA, Fonnesbeck C, Lasko TA, Walsh CG, Matheny ME. A nonparametric updating method to correct clinical prediction model drift. J Am Med Inform Assoc 2019; 26 (12) 1448-1457
  • 17 Siregar S, Nieboer D, Versteegh MIM, Steyerberg EW, Takkenberg JJM. Methods for updating a risk prediction model for cardiac surgery: a statistical primer. Review Interact Cardiovasc Thorac Surg 2019; 28 (03) 333-338
  • 18 Siregar S, Nieboer D, Vergouwe Y. et al. Improved prediction by dynamic modeling: an exploratory study in the adult cardiac surgery database of the netherlands association for cardio-thoracic surgery. Circ Cardiovasc Qual Outcomes 2016; 9 (02) 171-181
  • 19 Hickey GL, Grant SW, Caiado C. et al. Dynamic prediction modeling approaches for cardiac surgery. Circ Cardiovasc Qual Outcomes 2013; 6 (06) 649-658
  • 20 Janssen KJ, Moons KG, Kalkman CJ, Grobbee DE, Vergouwe Y. Updating methods improved the performance of a clinical prediction model in new patients. J Clin Epidemiol 2008; 61 (01) 76-86
  • 21 Parry G, Tucker J, Tarnow-Mordi W. UK Neonatal Staffing Study Collaborative Group. CRIB II: an update of the clinical risk index for babies score. Lancet 2003; 361 (9371): 1789-1791
  • 22 Adam GA, Chang C-HK, Haibe-Kains B, Goldenberg A. Hidden risks of machine learning applied to healthcare: unintended feedback loops between models and future data causing model degradation. Presented at: Proceedings of the 5th Machine Learning for Healthcare Conference; Proceedings of Machine Learning Research. Accessed 2020 at: http://proceedings.mlr.press
  • 23 Su TL, Jaki T, Hickey GL, Buchan I, Sperrin M. A review of statistical updating methods for clinical prediction models. Stat Methods Med Res 2018; 27 (01) 185-197
  • 24 Davis SE, Greevy RA, Lasko TA, Walsh CG, Matheny ME. Comparison of prediction model performance updating protocols: using a data-driven testing procedure to guide updating. AMIA Annu Symp Proc 2019. 2019: 1002-1010
  • 25 Nestor B, McDermott MBA, Boag W. et al. Feature robustness in non-stationary health records: caveats to deployable model performance in common clinical machine learning tasks. Presented at: Proceedings of the 4th Machine Learning for Healthcare Conference. Accessed 2019 at: http://proceedings.mlr.press
  • 26 Nestor B, McDermott MBA, Chauhan G. et al. Rethinking clinical prediction: why machine learning must consider year of care and feature aggregation. Available at: arXiv:181112583 [csLG] . Accessed 2018
  • 27 Strobl AN, Vickers AJ, Van Calster B. et al. Improving patient prostate cancer risk assessment: Moving from static, globally-applied to dynamic, practice-specific risk calculators. J Biomed Inform 2015; 56: 87-93
  • 28 Feng J. Learning how to approve updates to machine learning algorithms in non-stationary settings. Available at: arXiv preprint arXiv:201207278. Accessed 2020
  • 29 Davis SE, Lasko TA, Chen G, Matheny ME. Calibration drift among regression and machine learning models for hospital mortality, AMIA Annual Symposium proceedings/AMIA Symposium. 2017;Annual Symposium proceedings. AMIA Symposium. 625-634 Accessed 2017 at: https://pubmed.ncbi.nlm.nih.gov/29854127/
  • 30 Davis SE, Lasko TA, Chen G, Siew ED, Matheny ME. Calibration drift in regression and machine learning models for acute kidney injury. J Am Med Inform Assoc 2017; 24 (06) 1052-1061
  • 31 Bodenreider O. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res 2004; 32 (Database issue): D267-D270
  • 32 Lu J, Liu A, Dong F, Gu F, Gama J, Zhang G. Learning under concept drift: a review. IEEE Trans Knowl Data Eng 2018; 31 (12) 2346-2363
  • 33 Zhang H, Dullerud N, Seyyed-Kalantari L, Morris Q, Joshi S, Ghassemi M. An empirical framework for domain generalization in clinical settings. Presented at: Proceedings of the Conference on Health, Inference, and Learning; Virtual Event, USA. Accessed 2021 at: https://doi.org/10.1145/3450439.3451878
  • 34 Quiñonero-Candela J, Sugiyama M, Ben-David S. et al. Dataset Shift in Machine Learning. MIT Press; 2008
  • 35 Schölkopf B, Janzing D, Peters J, Sgouritsa E, Zhang K, Mooij J. On causal and anticausal learning. Presented at: Proceedings of the 29th International Coference on International Conference on Machine Learning; Edinburgh, Scotland. Accessed 2012 at: https://icml.cc/2012/papers/625.pdf
  • 36 Subbaswamy A, Schulam P, Saria S. Preventing failures due to dataset shift: learning predictive models that transport. Presented at: International Conference on Artificial Intelligence and Statistics (AISTATS); Naha, Japan. Accessed 2019 at: http://proceedings.mlr.press
  • 37 Heinze-Deml C, Peters J, Meinshausen N. Invariant causal prediction for nonlinear models. J Causal Inference 2018; 6 (02) DOI: 10.1515/jci-2017-0016.
  • 38 Arjovsky M, Bottou L, Gulrajani I, Lopez-Paz D. Invariant risk minimization. arXiv preprint arXiv:190702893. Accessed 2019 at: https://arxiv.org/abs/1907.02893
  • 39 Liu VX, Bates DW, Wiens J, Shah NH. The number needed to benefit: estimating the value of predictive analytics in healthcare. J Am Med Inform Assoc 2019; 26 (12) 1655-1659
  • 40 Sáez C, Gutiérrez-Sacristán A, Kohane I, García-Gómez JM, Avillach P. EHRtemporalVariability: delineating temporal data-set shifts in electronic health records. Gigascience 2020; 9 (08) giaa079