Evaluating the Impact of Health Care Data Completeness for Deep Generative Models

Benjamin Smith; Senne Van Steelandt; Anahita Khojandi

doi:10.1055/a-2023-9181

Subscribe to RSS

Please copy the URL and add it into your RSS Feed Reader.

https://www.thieme-connect.de/rss/thieme/en/10.1055-s-00035037.xml

Download PDF

Methods Inf Med 2023; 62(01/02): 031-039
DOI: 10.1055/a-2023-9181

Original Article for Focus Theme

Evaluating the Impact of Health Care Data Completeness for Deep Generative Models

Authors

Benjamin Smith‡^*

¹Bredesen Center, University of Tennessee, Knoxville, Tennessee, United States
Senne Van Steelandt‡^*

²Department of Business Analytics and Statistics, University of Tennessee, Knoxville, Tennessee, United States
Anahita Khojandi

³Department of Industrial and Systems Engineering, University of Tennessee, Knoxville, Tennessee, United States

Further Information

Permissions and Reprints

Abstract

Background Deep generative models (DGMs) present a promising avenue for generating realistic, synthetic data to augment existing health care datasets. However, exactly how the completeness of the original dataset affects the quality of the generated synthetic data is unclear.

Objectives In this paper, we investigate the effect of data completeness on samples generated by the most common DGM paradigms.

Methods We create both cross-sectional and panel datasets with varying missingness and subset rates and train generative adversarial networks, variational autoencoders, and autoregressive models (Transformers) on these datasets. We then compare the distributions of generated data with original training data to measure similarity.

Results We find that increased incompleteness is directly correlated with increased dissimilarity between original and generated samples produced through DGMs.

Conclusions Care must be taken when using DGMs to generate synthetic data as data completeness issues can affect the quality of generated data in both panel and cross-sectional datasets.

Keywords

data quality - data completeness - case completeness - missingness - deep generative models

^* Contributed equally.

Publication History

Received: 29 June 2022

Accepted: 31 January 2023

Accepted Manuscript online:
31 January 2023

Article published online:
10 March 2023

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany

References
1 Chen RJ, Lu MY, Chen TY, Williamson DFK, Mahmood F. Synthetic data in machine learning for medicine and healthcare. Nat Biomed Eng 2021; 5 (06) 493-497

Crossref PubMed Search in Google Scholar
Download RIS citation
2 Wang Z, Myles P. Tucker Generating and evaluating cross-sectional synthetic electronic healthcare data: preserving data utility and patient privacy. Comput Intell 2021; 37 (02) 819-851

Crossref Search in Google Scholar
Download RIS citation
3 Bhanot K, Qi M, Erickson JS, Guyon I, Bennett KP. The problem of fairness in synthetic healthcare data. Entropy (Basel) 2021; 23 (09) 1165

Crossref PubMed Search in Google Scholar
Download RIS citation
4 Kusner MJ, Paige B, Hernández-Lobato JM. Grammar variational autoencoder. In International Conference on Machine Learning: PMLR. 2017:1945–1954

Download RIS citation
5 Isola P, Zhu J-Y, Zhou T, Efros AA. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision And Pattern Recognition. 2017:1125–1134

Download RIS citation
6 Eigenschink P, Vamosi S, Vamosi R, Sun C, Reutterer T, Kalcher K. Deep generative models for synthetic data. 2021

PubMed Search in Google Scholar
Download RIS citation
7 Shahrin MH, Wyse L. Deep generative models for musical audio synthesis. In Handbook of Artificial Intelligence for Music. Springer; 2021: 639-678

Search in Google Scholar
Download RIS citation
8 Esteban P, Alvaro G, Cecilio A. Generating synthetic ECGs using GANs for anonymizing healthcare data. Electronics (Basel) 2021; 10: 389

Search in Google Scholar
Download RIS citation
9 Raab GM, Beata N, Chris D. Guidelines for producing useful synthetic data. arXiv e-prints 2017 (e-pub ahead of print)

Crossref
Download RIS citation
10 Weiskopf NG, Hripcsak G, Swaminathan S, Weng C. Defining and measuring completeness of electronic health records for secondary use. J Biomed Inform 2013; 46 (05) 830-836

Crossref PubMed Search in Google Scholar
Download RIS citation
11 Burkhart L, Androwich I. Measuring the domain completeness of the Nursing Interventions Classification in parish nurse documentation. Comput Inform Nurs 2004; 22 (02) 72-82

Crossref PubMed Search in Google Scholar
Download RIS citation
12 Wright A, McCoy AB, Hickman T-TT. et al. Problem list completeness in electronic health records: a multi-site study and assessment of success factors. Int J Med Inform 2015; 84 (10) 784-790

Crossref PubMed Search in Google Scholar
Download RIS citation
13 Beaulieu-Jones BK, Moore JH. Missing data imputation in the electronic health record using deeply learned autoencoders. Pac Symp Biocomput 2017; 22: 207-218

PubMed Search in Google Scholar
Download RIS citation
14 Angelos K, Apoorv V, Nikolaos P, François F. Transformers are RNNs: fast autoregressive transformers with linear attention. In: International Conference on Machine Learning. PMLR. 2020: 5156–5165

Download RIS citation
15 Vaswani A, Shazeer N, Parmar N. et al. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17) Red Hook, NY, USA: Curran Associates Inc; 2017: 6000-6010

Search in Google Scholar
Download RIS citation
16 Hilsenbeck SG, Kurucz C, Duncan RC. Estimation of completeness and adjustment of age-specific and age-standardized incidence rates. Biometrics 1992; 48 (04) 1249-1262

Crossref PubMed Search in Google Scholar
Download RIS citation
17 Kodra Y, Posada de la Paz M, Coi A. et al. Data quality in rare diseases registries. Adv Exp Med Biol 2017; 1031: 149-164

Crossref PubMed Search in Google Scholar
Download RIS citation
18 Reiter JP. Simultaneous use of multiple imputation for missing data and disclosure limitation. Surv Methodol 2004; 30: 235-242

Search in Google Scholar
Download RIS citation
19 Dietterich TG, Kong EB. Machine learning bias, statistical bias, and statistical variance of decision tree algorithms technical report, Department of Computer Science, Oregon State University 1995

Download RIS citation
20 Wang X, Lyu Y, Jing L. Deep generative model for robust imbalance classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020:14124–14133

Download RIS citation
21 Little RJ, D'Agostino R, Cohen ML. et al. The prevention and treatment of missing data in clinical trials. N Engl J Med 2012; 367 (14) 1355-1360

Crossref PubMed Search in Google Scholar
Download RIS citation
22 Faris PD, Ghali WA, Brant R, Norris CM, Galbraith PD, Knudtson ML. APPROACH Investigators. Alberta Provincial Program for Outcome Assessment in Coronary Heart Disease. Multiple imputation versus data enhancement for dealing with missing data in observational health care outcome analyses. J Clin Epidemiol 2002; 55 (02) 184-191

Crossref PubMed Search in Google Scholar
Download RIS citation
23 Kelly B, Matthews TP, Anastasio MA. Deep learning-guided image reconstruction from incomplete data. 2017 (e-pub ahead of print)

Crossref Search in Google Scholar
Download RIS citation
24 Markey MK, Tourassi GD, Margolis M, DeLong DM. Impact of missing data in evaluating artificial neural networks trained on complete data. Comput Biol Med 2006; 36 (05) 516-525

Crossref PubMed Search in Google Scholar
Download RIS citation
25 Li SC-X, Bo J, Marlin B. MisGAN: learning from incomplete data with generative adversarial networks. 2019 (e-pub ahead of print)

Crossref Search in Google Scholar
Download RIS citation
26 Hu J, Olanrewaju A, Wang Q. Multiple Imputation and Synthetic Data Generation with NPBayesImputeCat. The R Journal 2021; 13 (02) 90-110

Search in Google Scholar
Download RIS citation
27 Xu L, Zeng X, Li W, Ling B. IDHashGAN: deep hashing with generative adversarial nets for incomplete data retrieval. IEEE Trans Multimed 2021; 24: 534-545

Crossref Search in Google Scholar
Download RIS citation
28 Feldman K, Faust L, Wu X, Huang Chao, Chawla NV. Beyond volume: the impact of complex healthcare data on the machine learning pipeline. Towards Integrative Machine Learning Knowledge Extraction 2017; 10344: 150-169

Crossref Search in Google Scholar
Download RIS citation
29 Mattei P-A, Frellsen J. MIWAE: Deep generative modelling and imputation of incomplete data sets. In: International conference on machine learning. PMLR. 2019:4413–4423

Download RIS citation
30 Johnson AE, Pollard TJ, Shen L. et al. MIMIC-III, a freely accessible critical care database. Sci Data 2016; 3: 160035

Crossref PubMed Search in Google Scholar
Download RIS citation
31 Baucum M, Khojandi A, Vasudevan R. Improving deep reinforcement learning with transitional variational autoencoders: a healthcare application. IEEE J Biomed Health Inform 2021; 25 (06) 2273-2280

Crossref PubMed Search in Google Scholar
Download RIS citation
32 Torfi A, Fox EA. COR-GAN: correlation-capturing convolutional neural networks for generating synthetic healthcare records. Mach Learn 2020; (e-pub ahead of print)

Crossref Search in Google Scholar
Download RIS citation
33 Suo Q, Zhong W, Ma F, Ye Y, Jing G, Zhang A. Metric learning on healthcare data with incomplete modalities. In: IJCAI. 2019:3534–3540

Download RIS citation
34 Lee D, Kim J, Moon W-J, Ye JC. CollaGAN: collaborative GAN for missing image data imputation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019:2487–2496

Download RIS citation
35 Fouladvand S, Talbert J, Dwoskin LP. et al. Predicting opioid use disorder from longitudinal healthcare data using multi-stream transformer 2021 (e-pub ahead of print)

Crossref
Download RIS citation
36 Shome D, Kar T, Mohanty SN. et al. Covid-transformer: Interpretable covid-19 detection using vision transformer for healthcare. Int J Environ Res Public Health 2021; 18 (21) 11086

Crossref PubMed Search in Google Scholar
Download RIS citation
37 Salmi S, Mérelle S, Gilissen R, van der Mei R, Bhulai S. Detecting changes in help seeker conversations on a suicide prevention helpline during the COVID- 19 pandemic: in-depth analysis using encoder representations from transformers. BMC Public Health 2022; 22 (01) 530

Crossref PubMed Search in Google Scholar
Download RIS citation
38 Zeng X, Linwood SL, Liu C. Pretrained transformer framework on pediatric claims data for population specific tasks. Sci Rep 2022; 12 (01) 3651

Crossref PubMed Search in Google Scholar
Download RIS citation
39 Amin-Nejad A, Ive J, Velupillai S. Exploring transformer text generation for medical dataset augmentation. In: Proceedings of the 12th Language Resources and Evaluation Conference. 2020: 4699-4708

PubMed Search in Google Scholar
Download RIS citation
40 Jonker R, Volgenant A. A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing 1987; 38: 325-340

Crossref Search in Google Scholar
Download RIS citation
41 Kuhn HW. The Hungarian method for the assignment problem. Nav Res Logist Q 1955; 2: 83-97

Crossref Search in Google Scholar
Download RIS citation
42 Gao N, Xue H, Shao W. et al. Generative adversarial networks for spatio-temporal data: a survey. Clin Orthop Relat Res 2020; (e-pub ahead of print)

Crossref Search in Google Scholar
Download RIS citation
43 Johnson A, Bulgarelli L, Pollard T, Celi LA, Mark R, Horng S. MIMIC-IV-ED (version 2.2). PhysioNet 2023 (e-pub ahead of print)

Crossref
Download RIS citation
44 Ghadirzadeh A, Poklukar P, Kyrki V, Kragic D, Björkman M. Data-efficient visuomotor policy training using reinforcement learning and generative models. 2020 (e-pub ahead of print)

Crossref Search in Google Scholar
Download RIS citation

Related Journals

Subscribe to RSS

Share / Bookmark

Evaluating the Impact of Health Care Data Completeness for Deep Generative Models

Authors

Abstract

Keywords

Publication History

References