Subscribe to RSS
The Importance of Context: Risk-based De-identification of Biomedical Data[*]
05 February 2016
accepted in revised form: 12 April 2016
08 January 2018 (online)
Background: Data sharing is a central aspect of modern biomedical research. It is accompanied by significant privacy concerns and often data needs to be protected from re-identification. With methods of de-identification datasets can be transformed in such a way that it becomes extremely difficult to link their records to identified individuals. The most important challenge in this process is to find an adequate balance between an increase in privacy and a decrease in data quality.
Objectives: Accurately measuring the risk of re-identification in a specific data sharing scenario is an important aspect of data de-identification. Overestimation of risks will significantly deteriorate data quality, while underestimation will leave data prone to attacks on privacy. Several models have been proposed for measuring risks, but there is a lack of generic methods for risk-based data de-identification. The aim of the work described in this article was to bridge this gap and to show how the quality of de-identified datasets can be improved by using risk models to tailor the process of de-identification to a concrete context.
Methods: We implemented a generic de-identification process and several models for measuring re-identification risks into the ARX de-identification tool for biomedical data. By integrating the methods into an existing framework, we were able to automatically transform datasets in such a way that information loss is minimized while it is ensured that re-identification risks meet a user-defined threshold. We performed an extensive experimental evaluation to analyze the impact of using different risk models and assumptions about the goals and the background knowledge of an attacker on the quality of de-identified data.
Results: The results of our experiments show that data quality can be improved significantly by using risk models for data de-identification. On a scale where 100 % represents the original input dataset and 0 % represents a dataset from which all information has been removed, the loss of information content could be reduced by up to 10 % when protecting datasets against strong adversaries and by up to 24 % when protecting datasets against weaker adversaries.
Conclusions: The methods studied in this article are well suited for protecting sensitive biomedical data and our implementation is available as open-source software. Our results can be used by data custodians to increase the information content of de-identified data by tailoring the process to a specific data sharing scenario. Improving data quality is important for fostering the adoption of de-identification methods in biomedical research.
KeywordsInformation science - computer security - data protection - data anonymization - risk - data quality
* Supplementary material published on our web-site http://dx.doi.org/10.3414/ME16-01-0012
** These authors contributed equally to this work
- 1 Schneeweiss S. Learning from Big Health Care Data. N Engl J Med 2014; 370 (Suppl. 23) 2161-3. PubMed PMID: 24897079.
- 2 Murdoch T, Detsky A. The inevitable application of big data to health care. J Am Med Assoc 2013; 309 (Suppl. 13) 1351-2. PubMed PMID: 23549579.
- 3 Denny JC, Bastarache L, Ritchie MD, Carroll RJ, Zink R, Mosley JD. et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat Biotechnol 2013; 31 (Suppl. 12) 1102-10. PubMed PMID: 24270849.
- 4 Christoph J, Griebel L, Leb I, Engel I, Köpcke F, Toddenroth D. et al. Secure secondary use of clinical data with cloud-based NLP services. Methods Inf Med 2015; 54 (Suppl. 03) 276-82. PubMed PMID: 25377309.
- 5 US National Institutes of Health.. NOTOD-14–124: NIH Genomic Data Sharing Policy [Internet]. Genomic Data Sharing Policy Team 2014 [cited 2016 Feb 04]. Available from: https://grants.nih.gov/grants/guide/notice-files/NOT-OD-14–124.html.
- 6 Liu V, Musen M, Chou T. Data breaches of protected health information in the united states. J Am Med Assoc 2015; 313 (Suppl. 14) 1471-3. PubMed PMID: 25871675.
- 7 Hallinan D, Friedewald M, McCarthy P. Citizens’ perceptions of data protection and privacy in Europe. Comp Law Sec Rev 2012; 28 (Suppl. 03) 263-72. doi: 10.1016/j.clsr.2012.03.005.
- 8 Schadt EE. The changing privacy landscape in the era of big data. Mol Syst Biol 2012; 8: 612. PubMed PMID: 22968446.
- 9 Sweeney L. Computational disclosure control – A primer on data privacy protection [dissertation]. Cambridge (MA): Massachusetts Institute of Technology; 2001
- 10 El Emam K. Guide to the de-identification of personal health information. 1st ed. Boca Raton: CRC Press; 2013
- 11 El Emam K, Arbuckle L. Anonymizing health data: case studies and methods to get you started. 1st ed. Sebastopol: O’Reilly and Associates; 2014
- 12 HIPAA administrative simplification statute and rules. 45 C.F.R. Parts 160, 162, and 164 (2013).
- 13 Samarati P. Protecting respondents’ identities in microdata release. Trans Knowl Data Eng 2001; 13 (Suppl. 06) 1010-27. doi: 10.1109/69.971193.
- 14 US Health insurance portability and accountability act of 1996. Pub. L. 104–191, 110 Stat. 1936 (Au-gust 21, 1996).
- 15 Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data. Official Journal L 281 , 23/11/1995 P. 0031–0050 (October 24, 1995).
- 16 Xia W, Heatherly R, Ding X, Li J, Malin BA. R-U policy frontiers for health data de-identification. J Am Med Inform Assoc 2015; 22 (Suppl. 05) 1029-41. PubMed PMID: 25911674.
- 17 Sweeney L. k-anonymity: A model for protecting privacy. Int J Uncertain Fuzz 2002; 10 (Suppl. 05) 557-70. doi: 10.1142/S0218488502001648.
- 18 El Emam K. Risk-based de-identification of health data. IEEE Security & Privacy 2010; 8 (Suppl. 03) 64-7. doi: 10.1109/MSP.2010.103.
- 19 Pannekoek J. Statistical methods for some simple disclosure limitation rules. Statistica Neerlandica 1999; 53 (Suppl. 01) 55-67. doi: 10.1111/1467–9574.00097.
- 20 El Emam K, Dankar FK. Protecting privacy using k-anonymity. J Am Med Inform Assoc 2008; 15 (Suppl. 05) 627-37. PubMed PMID: 18579830.
- 21 Hoshino N. Applying pitman’s sampling formula to microdata disclosure risk assessment. J Off Stat 2001; 17 (Suppl. 04) 499-520.
- 22 Chen G, Keller-McNulty S. Estimation of identification disclosure risk in microdata. J Off Stat 1998; 14 (Suppl. 01) 79-95.
- 23 Rinott Y. On models for statistical disclosure risk estimation. In: Proceedings of the Joint ECE/Eurostat Work Session on Statistical Data Confidentiality. 2003. Apr 7–9; Luxembourg; 2003.
- 24 Dankar FK, El Emam K, Neisa A, Roffey T. Estimating the re-identification risk of clinical data sets. BMC Med Inform Decis Mak 2012; 12: 66. PubMed PMID: 22776564.
- 25 Prasser F, Kohlmayer F. Putting statistical disclosure control into practice: The ARX data anonymization tool. In: Gkoulalas-Divanis A, Loukides G. editors. Medical Data Privacy Handbook. New York: Springer; 2015. p. 111-48.
- 26 Iyengar V. Transforming data to satisfy privacy constraints. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2002 Jul 23–26 Edmonton, Canada: ACM; 2002. p. 279-88. doi: 10.1145/775047.775089.
- 27 Bayardo RJ, Agrawal R. Data privacy through optimal k-anonymization. In: Aberer K, Franklin MJ, Nishio S. editors Proceedings of the 21st International Conference on Data Engineering. 2005 Apr 5–8 Tokyo, Japan: IEEE Computer Society; 2005. p. 217-28. doi: 10.1109/ICDE.2005.42.
- 28 Prasser F, Kohlmayer F, Lautenschlaeger R, Eckert C, Kuhn KA. ARX – A Comprehensive tool for anonymizing biomedical data. In: Proceedings of the AMIA 2014 Annual Symposium. 2014 Nov 15–19 Washington, DC, US.: AMIA; 2014. p. 984-93. PubMed PMID: 25954407.
- 29 El Emam K, Malin BA. Appendix B: Concepts and methods for de-identifying clinical trial data. In: Committee on Strategies for Responsible Sharing of Clinical Trial Data; Board on Health Sciences Policy; Institute of Medicine, editor. Sharing clinical trial data: Maximizing benefits, minimizing risk. Washington (DC): National Academies Press (US); 2015. p. 1-290.
- 30 Cox LH, Karr AF, Kinney SK. Risk-utility paradigms for statistical disclosure limitation: How to think, but not how to act. Int Stat Rev 2011; 79 (Suppl. 02) 160-83. doi: 10.1111/j.1751–5823.2011.00140.x.
- 31 Malin B, Karp D, Scheuermann RH. Technical and policy approaches to balancing patient privacy and data sharing in clinical and translational research. J Investig Med 2010; 58 (Suppl. 01) 11-8. PubMed PMID: 20051768.
- 32 El Emam K, Rodgers S, Malin B. Anonymising and sharing individual patient data. BMJ 2015; 350: h1139. PubMed PMID: 25794882
- 33 El Emam K, Jonker E, Arbuckle L, Malin B. A systematic review of re-identification attacks on health data. PloS one 2011; 6 (Suppl. 12) e28071. Epub 2011 Dec 2. PubMed PMID: 22164229.
- 34 US Department of Health and Human Services – Office of the Assistant Secretary for Planning and Evaluation. Standards for Privacy of Individually Identifiable Health Information. Fed Regist 2000; 65 (Suppl. 250) 82462-829.
- 35 El Emam K, Brown A, AbdelMalik P, Neisa A, Walker M, Bottomley J. et al. A method for managing re-identification risk from small geographic areas in Canada. BMC Med Inform Decis Mak 2010; 10: 18. PubMed PMID: 20361870.
- 36 El Emam K, Dankar FK, Vaillancourt R, Roffey T, Lysyk M. Evaluating the risk of re-identification of patients from hospital prescription records. Can J Hosp Pharm. 2009 62. (4) PubMed PMID: 22478909.
- 37 Templ M, Kowarik A, Meindl B. Statistical disclosure control for micro-data using the R package sdcMicro. J Stat Softw 2015; 67 (Suppl. 01) 1-36. doi: 10.18637/jss.v067.i04.
- 38 Hundepool A, Wetering A, Ramaswamy R, Franconi L, Polettini S, Capobianchi A. et al. Mu-Argus, Version 4.2 User’s Manual [Internet]. The Hague, Netherlands: Statistics Netherlands; 2008. [cited 2016 Feb 04]. Available from: http://neon.vb.cbs.nl/casc/Software/MuManual4.2.pdf.
- 39 El Emam K, Dankar FK, Issa R, Jonker E, Amyot D, Cogo E. et al. A globally optimal k-anonymity method for the de-identification of health data. J Am Med Inform Assoc 2009; 16 (Suppl. 05) 670-82. PubMed PMID: 19567795.
- 40 Heatherly RD, Loukides G, Denny JC, Haines JL, Roden DM, Malin BA. Enabling genomic-phenomic association discovery without sacrificing anonymity. PloS one 2013; 8 (Suppl. 02) e53875. Epub 2013 Feb 6. PubMed PMID: 23405076.
- 41 Machanavajjhala A, Kifer D, Gehrke J, Venkitasubramaniam M. l-diversity: Privacy beyond k-anonymity. Trans Knowl Discov Data 2007; 1 (Suppl. 01) 3. doi: 10.1145/1217299.1217302.
- 42 McGraw D. Building public trust in uses of Health Insurance Portability and Accountability Act de-identified data. J Am Med Inform Assoc 2013; 20 (Suppl. 01) 29-34. PubMed PMID: 22735615.
- 43 Domingo-Ferrer J, Torra V. Ordinal, continuous and heterogeneous k-anonymity through microaggregation. Data Min. Knowl. Discov 2005; 11 (Suppl. 02) 195-212. doi: 10.1007/s10618–005–0007–5.
- 44 Goldberger J, Tassa T. Efficient anonymizations with enhanced utility. In: Saygin Y, Xu Yu J, Kargupta H, Wang W, Ranka S, Yu PS, Wu X. editors Proceedings of the ICDMW’09 IEEE International Conference on Data Mining Workshops. 2009 Dec 6 Miami, USA: IEEE Computer Society; 2009. p. 106-13. doi: 10.1109/ICDMW.2009.15.
- 45 Soria-Comas J, Domingo-Ferrer J, Sanchez D, Martinez S. t-Closeness through microaggregation: strict privacy with enhanced utility preservation. Trans Knowl Data Eng 2015; 27 (Suppl. 11) 3098-110. doi: 10.1109/TKDE.2015.2435777
- 46 Dankar FK, El Emam K. Practicing differential privacy in health care: A Review. Trans Data Priv 2013; 6 (Suppl. 01) 35-67.
- 47 Dwork C. Differential privacy. In: Bugliesi M, Preneel B, Sassone V, Wegener I. editors Proceedings of the 33rd International Colloquium. ICALP 2006 Jul 10–14 Venice, Italy. Berlin; Heidelberg: Springer; 2006. p. 1-12. doi: 10.1007/11787006_1.
- 48 El Emam K, Álvarez C. A critical appraisal of the Article 29 Working Party Opinion 05/2014 on data anonymization techniques. Int Data Priv Law 2015; 5 (Suppl. 01) 73-87. doi: 10.1093/idpl/ipu033.