Methods Inf Med 2023; 62(01/02): 031-039
DOI: 10.1055/a-2023-9181
Original Article for Focus Theme

Evaluating the Impact of Health Care Data Completeness for Deep Generative Models

Benjamin Smith*
1   Bredesen Center, University of Tennessee, Knoxville, Tennessee, United States
Senne Van Steelandt*
2   Department of Business Analytics and Statistics, University of Tennessee, Knoxville, Tennessee, United States
Anahita Khojandi
3   Department of Industrial and Systems Engineering, University of Tennessee, Knoxville, Tennessee, United States
› Author Affiliations


Background Deep generative models (DGMs) present a promising avenue for generating realistic, synthetic data to augment existing health care datasets. However, exactly how the completeness of the original dataset affects the quality of the generated synthetic data is unclear.

Objectives In this paper, we investigate the effect of data completeness on samples generated by the most common DGM paradigms.

Methods We create both cross-sectional and panel datasets with varying missingness and subset rates and train generative adversarial networks, variational autoencoders, and autoregressive models (Transformers) on these datasets. We then compare the distributions of generated data with original training data to measure similarity.

Results We find that increased incompleteness is directly correlated with increased dissimilarity between original and generated samples produced through DGMs.

Conclusions Care must be taken when using DGMs to generate synthetic data as data completeness issues can affect the quality of generated data in both panel and cross-sectional datasets.

* Contributed equally.

Publication History

Received: 29 June 2022

Accepted: 31 January 2023

Accepted Manuscript online:
31 January 2023

Article published online:
10 March 2023

© 2023. Thieme. All rights reserved.

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany

  • References

  • 1 Chen RJ, Lu MY, Chen TY, Williamson DFK, Mahmood F. Synthetic data in machine learning for medicine and healthcare. Nat Biomed Eng 2021; 5 (06) 493-497
  • 2 Wang Z, Myles P. Tucker Generating and evaluating cross-sectional synthetic electronic healthcare data: preserving data utility and patient privacy. Comput Intell 2021; 37 (02) 819-851
  • 3 Bhanot K, Qi M, Erickson JS, Guyon I, Bennett KP. The problem of fairness in synthetic healthcare data. Entropy (Basel) 2021; 23 (09) 1165
  • 4 Kusner MJ, Paige B, Hernández-Lobato JM. Grammar variational autoencoder. In International Conference on Machine Learning: PMLR. 2017:1945–1954
  • 5 Isola P, Zhu J-Y, Zhou T, Efros AA. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision And Pattern Recognition. 2017:1125–1134
  • 6 Eigenschink P, Vamosi S, Vamosi R, Sun C, Reutterer T, Kalcher K. Deep generative models for synthetic data. 2021
  • 7 Shahrin MH, Wyse L. Deep generative models for musical audio synthesis. In Handbook of Artificial Intelligence for Music. Springer; 2021: 639-678
  • 8 Esteban P, Alvaro G, Cecilio A. Generating synthetic ECGs using GANs for anonymizing healthcare data. Electronics (Basel) 2021; 10: 389
  • 9 Raab GM, Beata N, Chris D. Guidelines for producing useful synthetic data. arXiv e-prints 2017 (e-pub ahead of print) DOI: 10.48550/arXiv.1712.04078
  • 10 Weiskopf NG, Hripcsak G, Swaminathan S, Weng C. Defining and measuring completeness of electronic health records for secondary use. J Biomed Inform 2013; 46 (05) 830-836
  • 11 Burkhart L, Androwich I. Measuring the domain completeness of the Nursing Interventions Classification in parish nurse documentation. Comput Inform Nurs 2004; 22 (02) 72-82
  • 12 Wright A, McCoy AB, Hickman T-TT. et al. Problem list completeness in electronic health records: a multi-site study and assessment of success factors. Int J Med Inform 2015; 84 (10) 784-790
  • 13 Beaulieu-Jones BK, Moore JH. Missing data imputation in the electronic health record using deeply learned autoencoders. Pac Symp Biocomput 2017; 22: 207-218
  • 14 Angelos K, Apoorv V, Nikolaos P, François F. Transformers are RNNs: fast autoregressive transformers with linear attention. In: International Conference on Machine Learning. PMLR. 2020: 5156–5165
  • 15 Vaswani A, Shazeer N, Parmar N. et al. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17) Red Hook, NY, USA: Curran Associates Inc; 2017: 6000-6010
  • 16 Hilsenbeck SG, Kurucz C, Duncan RC. Estimation of completeness and adjustment of age-specific and age-standardized incidence rates. Biometrics 1992; 48 (04) 1249-1262
  • 17 Kodra Y, Posada de la Paz M, Coi A. et al. Data quality in rare diseases registries. Adv Exp Med Biol 2017; 1031: 149-164
  • 18 Reiter JP. Simultaneous use of multiple imputation for missing data and disclosure limitation. Surv Methodol 2004; 30: 235-242
  • 19 Dietterich TG, Kong EB. Machine learning bias, statistical bias, and statistical variance of decision tree algorithms technical report, Department of Computer Science, Oregon State University 1995
  • 20 Wang X, Lyu Y, Jing L. Deep generative model for robust imbalance classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020:14124–14133
  • 21 Little RJ, D'Agostino R, Cohen ML. et al. The prevention and treatment of missing data in clinical trials. N Engl J Med 2012; 367 (14) 1355-1360
  • 22 Faris PD, Ghali WA, Brant R, Norris CM, Galbraith PD, Knudtson ML. APPROACH Investigators. Alberta Provincial Program for Outcome Assessment in Coronary Heart Disease. Multiple imputation versus data enhancement for dealing with missing data in observational health care outcome analyses. J Clin Epidemiol 2002; 55 (02) 184-191
  • 23 Kelly B, Matthews TP, Anastasio MA. Deep learning-guided image reconstruction from incomplete data. 2017 (e-pub ahead of print) DOI: 10.48550/arXiv.1709.00584
  • 24 Markey MK, Tourassi GD, Margolis M, DeLong DM. Impact of missing data in evaluating artificial neural networks trained on complete data. Comput Biol Med 2006; 36 (05) 516-525
  • 25 Li SC-X, Bo J, Marlin B. MisGAN: learning from incomplete data with generative adversarial networks. 2019 (e-pub ahead of print) DOI: 10.48550/arXiv.1902.09599
  • 26 Hu J, Olanrewaju A, Wang Q. Multiple Imputation and Synthetic Data Generation with NPBayesImputeCat. The R Journal 2021; 13 (02) 90-110
  • 27 Xu L, Zeng X, Li W, Ling B. IDHashGAN: deep hashing with generative adversarial nets for incomplete data retrieval. IEEE Trans Multimed 2021; 24: 534-545
  • 28 Feldman K, Faust L, Wu X, Huang Chao, Chawla NV. Beyond volume: the impact of complex healthcare data on the machine learning pipeline. Towards Integrative Machine Learning Knowledge Extraction 2017; 10344: 150-169
  • 29 Mattei P-A, Frellsen J. MIWAE: Deep generative modelling and imputation of incomplete data sets. In: International conference on machine learning. PMLR. 2019:4413–4423
  • 30 Johnson AE, Pollard TJ, Shen L. et al. MIMIC-III, a freely accessible critical care database. Sci Data 2016; 3: 160035
  • 31 Baucum M, Khojandi A, Vasudevan R. Improving deep reinforcement learning with transitional variational autoencoders: a healthcare application. IEEE J Biomed Health Inform 2021; 25 (06) 2273-2280
  • 32 Torfi A, Fox EA. COR-GAN: correlation-capturing convolutional neural networks for generating synthetic healthcare records. Mach Learn 2020; (e-pub ahead of print) DOI: 10.48550/arXiv.2001.09346.
  • 33 Suo Q, Zhong W, Ma F, Ye Y, Jing G, Zhang A. Metric learning on healthcare data with incomplete modalities. In: IJCAI. 2019:3534–3540
  • 34 Lee D, Kim J, Moon W-J, Ye JC. CollaGAN: collaborative GAN for missing image data imputation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019:2487–2496
  • 35 Fouladvand S, Talbert J, Dwoskin LP. et al. Predicting opioid use disorder from longitudinal healthcare data using multi-stream transformer 2021 (e-pub ahead of print) DOI: 10.48550/arXiv.2103.08800
  • 36 Shome D, Kar T, Mohanty SN. et al. Covid-transformer: Interpretable covid-19 detection using vision transformer for healthcare. Int J Environ Res Public Health 2021; 18 (21) 11086
  • 37 Salmi S, Mérelle S, Gilissen R, van der Mei R, Bhulai S. Detecting changes in help seeker conversations on a suicide prevention helpline during the COVID- 19 pandemic: in-depth analysis using encoder representations from transformers. BMC Public Health 2022; 22 (01) 530
  • 38 Zeng X, Linwood SL, Liu C. Pretrained transformer framework on pediatric claims data for population specific tasks. Sci Rep 2022; 12 (01) 3651
  • 39 Amin-Nejad A, Ive J, Velupillai S. Exploring transformer text generation for medical dataset augmentation. In: Proceedings of the 12th Language Resources and Evaluation Conference. 2020: 4699-4708
  • 40 Jonker R, Volgenant A. A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing 1987; 38: 325-340
  • 41 Kuhn HW. The Hungarian method for the assignment problem. Nav Res Logist Q 1955; 2: 83-97
  • 42 Gao N, Xue H, Shao W. et al. Generative adversarial networks for spatio-temporal data: a survey. Clin Orthop Relat Res 2020; (e-pub ahead of print) DOI: 10.48550/arXiv.2008.08903.
  • 43 Johnson A, Bulgarelli L, Pollard T, Celi LA, Mark R, Horng S. MIMIC-IV-ED (version 2.2). PhysioNet 2023 (e-pub ahead of print) DOI: 10.13026/5ntk-km72
  • 44 Ghadirzadeh A, Poklukar P, Kyrki V, Kragic D, Björkman M. Data-efficient visuomotor policy training using reinforcement learning and generative models. 2020 (e-pub ahead of print) DOI: 10.48550/arXiv.2007.13134