Synthetic Tabular Data Evaluation in the Health Domain Covering Resemblance, Utility, and Privacy Dimensions

Mikel Hernadez; Gorka Epelde; Ane Alberdi; Rodrigo Cilla; Debbie Rankin

doi:10.1055/s-0042-1760247

RSS-Feed abonnieren

Bitte kopieren Sie die angezeigte URL und fügen sie dann in Ihren RSS-Reader ein.

https://www.thieme-connect.de/rss/thieme/de/10.1055-s-00035037.xml

Teilen / Bookmarken

Facebook Linkedin Weibo

PDF herunterladen

CC BY-NC-ND 4.0 · Methods Inf Med 2023; 62(S 01): e19-e38
DOI: 10.1055/s-0042-1760247

Original Article

Synthetic Tabular Data Evaluation in the Health Domain Covering Resemblance, Utility, and Privacy Dimensions

Mikel Hernadez

¹Digital Health and Biomedical Technologies Department, Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Donostia-San Sebastian, Spain

,

Gorka Epelde

¹Digital Health and Biomedical Technologies Department, Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Donostia-San Sebastian, Spain

²eHealth Group, Biodonostia Health Research Institute, Donostia-San Sebastian, Spain

,

Ane Alberdi

³Biomedical Engineering Department, Mondragon Unibertsitatea, Arrasate-Mondragón, Spain

,

Rodrigo Cilla

¹Digital Health and Biomedical Technologies Department, Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Donostia-San Sebastian, Spain

,

Debbie Rankin

⁴School of Computing, Engineering and Intelligent Systems, Ulster University, Derry-Londonderry, United Kingdom

› Institutsangaben
Funding This research was partially funded by the Department of Economic Development and Infrastructure of the Basque Government through Emaitek Plus Action Plan Programme.Ane Alberdi is part of the Intelligent Systems for Industrial Systems research group of Mondragon Unibertsitatea (IT1676-22), supported by the Department of Education, Universities and Research of the Basque Country.

› Weitere Informationen

Abstract
Volltext
Referenzen
Zusatzmaterial

Lizenzen und Reprints

Abstract

Background Synthetic tabular data generation is a potentially valuable technology with great promise for data augmentation and privacy preservation. However, prior to adoption, an empirical assessment of generated synthetic tabular data is required across dimensions relevant to the target application to determine its efficacy. A lack of standardized and objective evaluation and benchmarking strategy for synthetic tabular data in the health domain has been found in the literature.

Objective The aim of this paper is to identify key dimensions, per dimension metrics, and methods for evaluating synthetic tabular data generated with different techniques and configurations for health domain application development and to provide a strategy to orchestrate them.

Methods Based on the literature, the resemblance, utility, and privacy dimensions have been prioritized, and a collection of metrics and methods for their evaluation are orchestrated into a complete evaluation pipeline. This way, a guided and comparative assessment of generated synthetic tabular data can be done, categorizing its quality into three categories (“Excellent,” “Good,” and “Poor”). Six health care-related datasets and four synthetic tabular data generation approaches have been chosen to conduct an analysis and evaluation to verify the utility of the proposed evaluation pipeline.

Results The synthetic tabular data generated with the four selected approaches has maintained resemblance, utility, and privacy for most datasets and synthetic tabular data generation approach combination. In several datasets, some approaches have outperformed others, while in other datasets, more than one approach has yielded the same performance.

Conclusion The results have shown that the proposed pipeline can effectively be used to evaluate and benchmark the synthetic tabular data generated by various synthetic tabular data generation approaches. Therefore, this pipeline can support the scientific community in selecting the most suitable synthetic tabular data generation approaches for their data and application of interest.

Keywords

synthetic tabular data generation - synthetic tabular data evaluation - resemblance evaluation - utility evaluation - privacy evaluation

Supplementary Material

Supplementary Material

Publikationsverlauf

Eingereicht: 13. Juni 2022

Angenommen: 29. Oktober 2022

Artikel online veröffentlicht:
09. Januar 2023

© 2023. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution-NonDerivative-NonCommercial License, permitting copying and reproduction so long as the original work is given appropriate credit. Contents may not be used for commercial purposes, or adapted, remixed, transformed or built upon. (https://creativecommons.org/licenses/by-nc-nd/4.0/)

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany

References
1 Rubin DB. Discussion statistical disclosure limitation. J Off Stat 1993; 9 (02) 461-468

MissingFormLabel
PubMed Suche in Google Scholar
2 Little RJA. Statistical Analysis of Masked Data. J Off Stat 1993; 9 (02) 407-426

MissingFormLabel
PubMed Suche in Google Scholar
3 El Emam K, Hoptroff R. The synthetic data paradigm for using and sharing data. DATA Anal Digit Technol 2019; 19 (06) 12

MissingFormLabel
PubMed Suche in Google Scholar
4 Hernandez M, Epelde G, Alberdi A, Cilla R, Rankin D. Synthetic data generation for tabular health records: a systematic review. Neurocomputing 2022; 493: 28-45

MissingFormLabel
Crossref PubMed Suche in Google Scholar
5 Hu J. Bayesian estimation of attribute and identification disclosure risks in synthetic data. arXiv preprint arXiv:1804.02784, 2018

MissingFormLabel
PubMed Suche in Google Scholar
6 Reiter JP. New approaches to data dissemination: a glimpse into the future. Chance 2004; 17 (03) 11-15

MissingFormLabel
Crossref PubMed Suche in Google Scholar
7 Taub J, Elliot M, Pampaka M, Smith D. Differential Correct Attribution Probability for Synthetic Data: An Exploration. In: Domingo-Ferrer J, Montes F. eds. Privacy in Statistical Databases. Cham: Springer International Publishing; 2018: 122-137

MissingFormLabel
Suche in Google Scholar
8 Yale A, Dash S, Dutta R, Guyon I, Pavao A, Bennett KP. Generation and evaluation of privacy preserving synthetic health data. Neurocomputing 2020; 416: 244-255

MissingFormLabel
Crossref PubMed Suche in Google Scholar
9 Choi E, Biswal S, Malin B, Duke J, Stewart WF, Sun J. Generating multi-label discrete patient records using generative adversarial networks. Machine learning for healthcare conference. 2017: 286-305

MissingFormLabel
PubMed Suche in Google Scholar
10 Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 2002; 16: 321-357

MissingFormLabel

Suche in Google Scholar
11 He H, Bai Y, Garcia EA, Li S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. Paper presented at: 2008 IEEE International Joint Conference on Neural Networks. IEEE World Congress on Computational Intelligence; 2008:1322–1328

MissingFormLabel
PubMed
12 Menardi G, Torelli N. Training and assessing classification rules with imbalanced data. Data Min Knowl Discov 2014; 28 (01) 92-122

MissingFormLabel
Crossref PubMed Suche in Google Scholar
13 Yang F, Yu Z, Liang Y. et al. Grouped Correlational Generative Adversarial Networks for Discrete Electronic Health Records. Paper presented at: 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); 2019:906–913

MissingFormLabel
PubMed
14 Hernandez-Matamoros A, Fujita H, Perez-Meana H. A novel approach to create synthetic biomedical signals using BiRNN. Inf Sci 2020; 541: 218-241

MissingFormLabel
Crossref PubMed Suche in Google Scholar
15 Andreini P, Ciano G, Bonechi S. et al. A Two-Stage GAN for High-Resolution Retinal Image Generation and Segmentation. Electronics (Basel) 2022; 11 (01) 60

MissingFormLabel
PubMed Suche in Google Scholar
16 Porcu S, Floris A, Atzori L. Evaluation of Data Augmentation Techniques for Facial Expression Recognition Systems. Electronics (Basel) 2020; 9 (11) 1892

MissingFormLabel
PubMed Suche in Google Scholar
17 Han C, Hayashi H, Rundo L. et al. GAN-based synthetic brain MR image generation. Paper presented at: 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018); 2018:734–738

MissingFormLabel
PubMed
18 Stephens M, Estepar RSJ, Ruiz-Cabello J, Arganda-Carreras I, Macía I, López-Linares K. MRI to CTA Translation for Pulmonary Artery Evaluation Using CycleGANs Trained with Unpaired Data. In: Petersen J, San José Estépar R, Schmidt-Richberg A. et al., eds. Thoracic Image Analysis. Cham: Springer International Publishing; 2020: 118-129

MissingFormLabel
Suche in Google Scholar
19 Dahmen J, Cook D. SynSys: a synthetic data generation system for healthcare applications. Sensors (Basel) 2019; 19 (05) 1181

MissingFormLabel
Crossref PubMed Suche in Google Scholar
20 Li Z, Ma C, Shi X, Zhang D, Li W, Wu L. TSA-GAN: A Robust Generative Adversarial Networks for Time Series Augmentation. 2021. Paper presented at: International Joint Conference on Neural Networks (IJCNN). Shenzhen, China: IEEE; 2021:1–8

MissingFormLabel
PubMed
21 Che Z, Cheng Y, Zhai S, Sun Z, Liu Y. Boosting Deep Learning Risk Prediction with Generative Adversarial Networks for Electronic Health Records. Paper presented at: 2017 IEEE International Conference on Data Mining (ICDM). 2017:787–792

MissingFormLabel
PubMed
22 Rankin D, Black M, Bond R, Wallace J, Mulvenna M, Epelde G. Reliability of supervised machine learning using synthetic data in health care: model to preserve privacy for data sharing. JMIR Med Inform 2020; 8 (07) e18910

MissingFormLabel
Crossref PubMed Suche in Google Scholar
23 Hernandez M, Epelde G, Beristain A. et al. Incorporation of synthetic data generation techniques within a controlled data processing workflow in the health and wellbeing domain. Electronics (Basel) 2022; 11 (05) 812

MissingFormLabel
PubMed Suche in Google Scholar
24 Kotal A, Piplai A, Chukkapalli SSL, Joshi A. PriveTAB: Secure and Privacy-Preserving sharing of Tabular Data. ACM Int Workshop Secur Priv Anal; 2022

MissingFormLabel
Crossref
25 Bourou S, El Saer A, Velivassaki T-H, Voulkidis A, Zahariadis T. A review of tabular data synthesis using GANs on an IDS dataset. Information (Basel) 2021; 12 (09) 375

MissingFormLabel
Crossref PubMed Suche in Google Scholar
26 Piacentino E, Guarner A, Angulo C. Generating Synthetic ECGs Using GANs for Anonymizing Healthcare Data. Electronics (Basel) 2021; 10 (04) 389

MissingFormLabel
PubMed Suche in Google Scholar
27 Hazra D, Byun Y-C. SynSigGAN: generative adversarial networks for synthetic biomedical signal generation. Biology (Basel) 2020; 9 (12) 441

MissingFormLabel
PubMed Suche in Google Scholar
28 Norgaard S, Saeedi R, Sasani K, Gebremedhin AH. Synthetic Sensor Data Generation for Health Applications: A Supervised Deep Learning Approach. Paper presented at: 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). 2018:1164–1167

MissingFormLabel
PubMed
29 Wang Z, Myles P, Tucker A. Generating and Evaluating Synthetic UK Primary Care Data: Preserving Data Utility Patient Privacy. Paper presented at: 2019 IEEE 32nd International Symposium on Computer-Based Medical Systems (CBMS). 2019:126–131

MissingFormLabel
PubMed
30 Beaulieu-Jones BK, Wu ZS, Williams C. et al. Privacy-preserving generative deep neural networks support clinical data sharing. Circ Cardiovasc Qual Outcomes 2019; 12 (07) e005122

MissingFormLabel
Crossref PubMed Suche in Google Scholar
31 Wang L, Zhang W, He X. Continuous patient-centric sequence generation via sequentially coupled adversarial learning. In: Li G, Yang J, Gama J, Natwichai J, Tong Y. eds. Database Systems for Advanced Applications. Cham: Springer International Publishing; 2019: 36-52

MissingFormLabel
Suche in Google Scholar
32 Rashidian S, Wang F, Moffitt R. et al. SMOOTH-GAN: Towards Sharp and Smooth Synthetic EHR Data Generation. In: Michalowski M, Moskovitch R. eds. Artificial Intelligence in Medicine. Cham: Springer International Publishing; 2020: 37-48

MissingFormLabel
Suche in Google Scholar
33 Yoon J, Drumright LN, van der Schaar M. Anonymization through data synthesis using generative adversarial networks (ADS-GAN). IEEE J Biomed Health Inform 2020; 24 (08) 2378-2388

MissingFormLabel
Crossref PubMed Suche in Google Scholar
34 Baowaly MK, Lin C-C, Liu C-L, Chen K-T. Synthesizing electronic health records using improved generative adversarial networks. J Am Med Inform Assoc 2019; 26 (03) 228-241

MissingFormLabel
Crossref PubMed Suche in Google Scholar
35 Goncalves A, Ray P, Soper B, Stevens J, Coyle L, Sales AP. Generation and evaluation of synthetic patient data. BMC Med Res Methodol 2020; 20 (01) 108

MissingFormLabel
Crossref PubMed Suche in Google Scholar
36 Guan J, Li R, Yu S, Zhang X. Generation of Synthetic Electronic Medical Record Text. Paper presented at: 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). 2018:374–380

MissingFormLabel
PubMed
37 Dash S, Yale A, Guyon I, Bennett KP. Medical Time-Series Data Generation Using Generative Adversarial Networks. In: Michalowski M, Moskovitch R. eds. Artificial Intelligence in Medicine. Cham: Springer International Publishing; 2020: 382-391

MissingFormLabel
Suche in Google Scholar
38 Chin-Cheong K, Sutter T, Vogt JE. Generation of Heterogeneous Synthetic Electronic Health Records using GANs. ETH Zurich, Institute for Machine Learning; 2019

MissingFormLabel
Suche in Google Scholar
39 Hittmeir M, Ekelhart A, Mayer R. On the Utility of Synthetic Data: An Empirical Evaluation on Machine Learning Tasks. Paper presented at: Proceedings of the 14th International Conference on Availability, Reliability and Security. Canterbury CA United Kingdom: ACM; 2019:1–6

MissingFormLabel
PubMed
40 Giles O, Hosseini K, Mingas G. Faking feature importance: A cautionary tale on the use of differentially-private synthetic data. arXiv preprint arXiv:2203.01363, 2022

MissingFormLabel
PubMed Suche in Google Scholar
41 Dankar FK, Ibrahim MK, Ismail L. A multi-dimensional evaluation of synthetic data generators. IEEE Access 2022; 10: 11147-11158

MissingFormLabel
Crossref PubMed Suche in Google Scholar
42 Hittmeir M, Ekelhart A, Mayer R. Utility and Privacy Assessments of Synthetic Data for Regression Tasks. Paper presented at: 2019 IEEE International Conference on Big Data (Big Data). 2019:5763–5772

MissingFormLabel
PubMed
43 Platzer M, Reutterer T. Holdout-based empirical assessment of mixed-type synthetic data. Front Big Data 2021; 4: 679939

MissingFormLabel
Crossref PubMed Suche in Google Scholar
44 Alaa AM, van Breugel B, Saveliev E, van der Schaar M. How faithful is your synthetic data? sample-level metrics for evaluating and auditing generative models. International Conference on Machine Learning. 2022: 290-306

MissingFormLabel
PubMed Suche in Google Scholar
45 Abay NC, Zhou Y, Kantarcioglu M, Thuraisingham B, Sweeney L. Privacy preserving synthetic data release using deep learning. In: Berlingerio M, Bonchi F, Gärtner T, Hurley N, Ifrim G. eds. Machine Learning and Knowledge Discovery in Databases. Cham: Springer International Publishing; 2019: 510-526

MissingFormLabel
Suche in Google Scholar
46 Wu H, Ning Y, Chakraborty P, Vreeken J, Tatti N, Ramakrishnan N. Generating realistic synthetic population datasets. ACM Trans Knowl Discov Data 2018; 12 (04) 45:1-45:22

MissingFormLabel
PubMed Suche in Google Scholar
47 Fowler EE, Berglund A, Schell MJ, Sellers TA, Eschrich S, Heine J. Empirically-derived synthetic populations to mitigate small sample sizes. J Biomed Inform 2020; 105: 103408

MissingFormLabel
Crossref PubMed Suche in Google Scholar
48 Alqahtani H, Kavakli-Thorne M, Kumar G. Applications of generative adversarial networks (GANs): an updated review. Arch Comput Methods Eng 2021; 28 (02) 525-552

MissingFormLabel
Crossref PubMed Suche in Google Scholar
49 Xu L, Skoularidou M, Cuesta-Infante A, Veeramachaneni K. Modeling tabular data using conditional gan. Advances in Neural Information Processing Systems; 2019: 32

MissingFormLabel
PubMed Suche in Google Scholar
50 Patki N, Wedge R, Veeramachaneni K. The Synthetic Data Vault. Paper presented at: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA). 2016:399–410

MissingFormLabel
PubMed
51 The Synthetic Data Vault. Put synthetic data to work! 2022. Accessed January 24, 2022, at: https://sdv.dev/

MissingFormLabel
PubMed
52 SYNTHO. 2022 . Accessed January 13, 2022, at: https://www.syntho.ai/

MissingFormLabel
PubMed
53 The Medkit-Learn(ing) Environment. 2022. Accessed January 24, 2022, https://github.com/vanderschaarlab/medkit-learn

MissingFormLabel
PubMed
54 Build better datasets for AI with synthetic data. 2022. Accessed January 24, 2022, at: https://ydata.ai

MissingFormLabel
PubMed
55 Lee D, Yu H, Jiang X. et al. Generating sequential electronic health records using dual adversarial autoencoder. J Am Med Inform Assoc 2020; 27 (09) 1411-1419

MissingFormLabel
Crossref PubMed Suche in Google Scholar
56 Park N, Mohammadi M, Gorde K, Jajodia S, Park H, Kim Y. Data synthesis based on generative adversarial networks. Proc VLDB Endow 2018; 11 (10) 1071-1083

MissingFormLabel
Crossref PubMed Suche in Google Scholar
57 Mendelevitch O, Lesh MD. Fidelity and privacy of synthetic medical data. arXiv preprint arXiv:2101.08658, 2021

MissingFormLabel
PubMed
58 Ghosheh G, Li J, Zhu T. A review of Generative Adversarial Networks for Electronic Health Records: applications, evaluation measures and data sources. arXiv preprint arXiv:2203.07018, 2022

MissingFormLabel
PubMed
59 Hernandez M, Epelde G. Synthetic Tabular Data Evaluation Metrics. 2022. Accessed June 1, 2022, at: https://github.com/Vicomtech/STDG-evaluation-metrics

MissingFormLabel
PubMed
60 Multivariate Distributions – Copulas 0.5.0 documentation. 2022. Accessed March 3, 2021, at: https://sdv.dev/Copulas/tutorials/03_Multivariate_Distributions.html#Gaussian-Multivariate

MissingFormLabel
PubMed
61 Patki N, Wedge R, Veeramachaneni K. “The Synthetic Data Vault.” Paper presented at: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 2016:399–410

MissingFormLabel
Crossref PubMed
62 Xu L, Skoularidou M, Cuesta-Infante A, Veeramachaneni K. Modeling tabular data using conditional gan. Advances in Neural Information Processing Systems; 2019, 32

MissingFormLabel
PubMed
63 Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville AC. Improved training of Wasserstein GANs. Adv Neural Inf Process Syst 2017; 30: 5767-5777

MissingFormLabel
PubMed Suche in Google Scholar
64 El Emam K, Mosquera L, Hoptroff R. Practical Synthetic Data Generation: Balancing Privacy and the Broad Availability of Data. Ilustrated. O'Reilly Media, Incorporated; 2020

MissingFormLabel
PubMed
65 Strack B, DeShazo JP, Gennings C. et al. Impact of HbA1c measurement on hospital readmission rates: analysis of 70,000 clinical database patient records. BioMed Res Int 2014; 2014: 781670

MissingFormLabel
Crossref PubMed Suche in Google Scholar
66 Ulianova S. Cardiovascular Disease dataset. Kaggle 2018. Accessed January 26, 2021, at: https://www.kaggle.com/sulianova/cardiovascular-disease-dataset

MissingFormLabel
PubMed
67 Palechor FM, Manotas AH. Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from Colombia, Peru and Mexico. Data Brief 2019; 25: 104344

MissingFormLabel
Crossref PubMed Suche in Google Scholar
68 Machine Learning Repository UCI. Contraceptive Method Choice Data Set. 2022 . Accessed March 14, 2022, at: https://archive.ics.uci.edu/ml/datasets/Contraceptive+Method+Choice

MissingFormLabel
PubMed Suche in Google Scholar
69 Pima Indians Diabetes Database. 2022 . Accessed March 14, 2022, at: https://kaggle.com/uciml/pima-indians-diabetes-database

MissingFormLabel
PubMed Suche in Google Scholar
70 Machine Learning Repository UCI. ILPD (Indian Liver Patient Dataset) Data Set. 2022 . Accessed March 14, 2022, at: https://archive.ics.uci.edu/ml/datasets/ILPD+(Indian+Liver+Patient+Dataset)

MissingFormLabel
PubMed Suche in Google Scholar

Zusatzmaterial

Supplementary Material

RSS-Feed abonnieren

Teilen / Bookmarken

Synthetic Tabular Data Evaluation in the Health Domain Covering Resemblance, Utility, and Privacy Dimensions

Abstract

Keywords

Supplementary Material

Publikationsverlauf

References