The Importance of Context: Risk-based De-identification of Biomedical Data

Fabian Prasser; Florian Kohlmayer; Klaus A. Kuhn

doi:10.3414/ME16-01-0012

RSS-Feed abonnieren

Bitte kopieren Sie die angezeigte URL und fügen sie dann in Ihren RSS-Reader ein.

https://www.thieme-connect.de/rss/thieme/de/10.1055-s-00035037.xml

PDF herunterladen

Methods Inf Med 2016; 55(04): 347-355
DOI: 10.3414/ME16-01-0012

Original Articles

Schattauer GmbH

The Importance of Context: Risk-based De-identification of Biomedical Data^[*]

Autoren

Fabian Prasser^**

¹Technical University of Munich, University Hospital rechts der Isar, Institute of Medical Statistics and Epidemiology, Munich, Germany
Florian Kohlmayer^**

¹Technical University of Munich, University Hospital rechts der Isar, Institute of Medical Statistics and Epidemiology, Munich, Germany
Klaus A. Kuhn

¹Technical University of Munich, University Hospital rechts der Isar, Institute of Medical Statistics and Epidemiology, Munich, Germany

Weitere Informationen

Publikationsverlauf

received: 05. Februar 2016

accepted in revised form: 12. April 2016

Publikationsdatum:
08. Januar 2018 (online)

Lizenzen und Reprints

Summary

Background: Data sharing is a central aspect of modern biomedical research. It is accompanied by significant privacy concerns and often data needs to be protected from re-identification. With methods of de-identification datasets can be transformed in such a way that it becomes extremely difficult to link their records to identified individuals. The most important challenge in this process is to find an adequate balance between an increase in privacy and a decrease in data quality.

Objectives: Accurately measuring the risk of re-identification in a specific data sharing scenario is an important aspect of data de-identification. Overestimation of risks will significantly deteriorate data quality, while underestimation will leave data prone to attacks on privacy. Several models have been proposed for measuring risks, but there is a lack of generic methods for risk-based data de-identification. The aim of the work described in this article was to bridge this gap and to show how the quality of de-identified datasets can be improved by using risk models to tailor the process of de-identification to a concrete context.

Methods: We implemented a generic de-identification process and several models for measuring re-identification risks into the ARX de-identification tool for biomedical data. By integrating the methods into an existing framework, we were able to automatically transform datasets in such a way that information loss is minimized while it is ensured that re-identification risks meet a user-defined threshold. We performed an extensive experimental evaluation to analyze the impact of using different risk models and assumptions about the goals and the background knowledge of an attacker on the quality of de-identified data.

Results: The results of our experiments show that data quality can be improved significantly by using risk models for data de-identification. On a scale where 100 % represents the original input dataset and 0 % represents a dataset from which all information has been removed, the loss of information content could be reduced by up to 10 % when protecting datasets against strong adversaries and by up to 24 % when protecting datasets against weaker adversaries.

Conclusions: The methods studied in this article are well suited for protecting sensitive biomedical data and our implementation is available as open-source software. Our results can be used by data custodians to increase the information content of de-identified data by tailoring the process to a specific data sharing scenario. Improving data quality is important for fostering the adoption of de-identification methods in biomedical research.

Keywords

Information science - computer security - data protection - data anonymization - risk - data quality

^* Supplementary material published on our web-site http://dx.doi.org/10.3414/ME16-01-0012

^** These authors contributed equally to this work

Zusatzmaterial (PDF) (opens in new window)

References
1 Schneeweiss S. Learning from Big Health Care Data. N Engl J Med 2014; 370 (Suppl. 23) 2161-3. PubMed PMID: 24897079.

Crossref PubMed Suche in Google Scholar
Download RIS citation
2 Murdoch T, Detsky A. The inevitable application of big data to health care. J Am Med Assoc 2013; 309 (Suppl. 13) 1351-2. PubMed PMID: 23549579.

Crossref PubMed Suche in Google Scholar
Download RIS citation
3 Denny JC, Bastarache L, Ritchie MD, Carroll RJ, Zink R, Mosley JD. et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat Biotechnol 2013; 31 (Suppl. 12) 1102-10. PubMed PMID: 24270849.

Crossref PubMed Suche in Google Scholar
Download RIS citation
4 Christoph J, Griebel L, Leb I, Engel I, Köpcke F, Toddenroth D. et al. Secure secondary use of clinical data with cloud-based NLP services. Methods Inf Med 2015; 54 (Suppl. 03) 276-82. PubMed PMID: 25377309.

Thieme Connect PubMed Suche in Google Scholar
Download RIS citation
5 US National Institutes of Health.. NOTOD-14–124: NIH Genomic Data Sharing Policy [Internet]. Genomic Data Sharing Policy Team 2014 [cited 2016 Feb 04]. Available from: https://grants.nih.gov/grants/guide/notice-files/NOT-OD-14–124.html.

PubMed Suche in Google Scholar
Download RIS citation
6 Liu V, Musen M, Chou T. Data breaches of protected health information in the united states. J Am Med Assoc 2015; 313 (Suppl. 14) 1471-3. PubMed PMID: 25871675.

Crossref PubMed Suche in Google Scholar
Download RIS citation
7 Hallinan D, Friedewald M, McCarthy P. Citizens’ perceptions of data protection and privacy in Europe. Comp Law Sec Rev 2012; 28 (Suppl. 03) 263-72. doi: 10.1016/j.clsr.2012.03.005.

Suche in Google Scholar
Download RIS citation
8 Schadt EE. The changing privacy landscape in the era of big data. Mol Syst Biol 2012; 8: 612. PubMed PMID: 22968446.

PubMed Suche in Google Scholar
Download RIS citation
9 Sweeney L. Computational disclosure control – A primer on data privacy protection [dissertation]. Cambridge (MA): Massachusetts Institute of Technology; 2001

Suche in Google Scholar
Download RIS citation
10 El Emam K. Guide to the de-identification of personal health information. 1st ed. Boca Raton: CRC Press; 2013

Suche in Google Scholar
Download RIS citation
11 El Emam K, Arbuckle L. Anonymizing health data: case studies and methods to get you started. 1st ed. Sebastopol: O’Reilly and Associates; 2014

Suche in Google Scholar
Download RIS citation
12 HIPAA administrative simplification statute and rules. 45 C.F.R. Parts 160, 162, and 164 (2013).

Download RIS citation
13 Samarati P. Protecting respondents’ identities in microdata release. Trans Knowl Data Eng 2001; 13 (Suppl. 06) 1010-27. doi: 10.1109/69.971193.

Suche in Google Scholar
Download RIS citation
14 US Health insurance portability and accountability act of 1996. Pub. L. 104–191, 110 Stat. 1936 (Au-gust 21, 1996).

Download RIS citation
15 Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data. Official Journal L 281 , 23/11/1995 P. 0031–0050 (October 24, 1995).

Download RIS citation
16 Xia W, Heatherly R, Ding X, Li J, Malin BA. R-U policy frontiers for health data de-identification. J Am Med Inform Assoc 2015; 22 (Suppl. 05) 1029-41. PubMed PMID: 25911674.

Crossref PubMed Suche in Google Scholar
Download RIS citation
17 Sweeney L. k-anonymity: A model for protecting privacy. Int J Uncertain Fuzz 2002; 10 (Suppl. 05) 557-70. doi: 10.1142/S0218488502001648.

Crossref Suche in Google Scholar
Download RIS citation
18 El Emam K. Risk-based de-identification of health data. IEEE Security & Privacy 2010; 8 (Suppl. 03) 64-7. doi: 10.1109/MSP.2010.103.

Suche in Google Scholar
Download RIS citation
19 Pannekoek J. Statistical methods for some simple disclosure limitation rules. Statistica Neerlandica 1999; 53 (Suppl. 01) 55-67. doi: 10.1111/1467–9574.00097.

Crossref Suche in Google Scholar
Download RIS citation
20 El Emam K, Dankar FK. Protecting privacy using k-anonymity. J Am Med Inform Assoc 2008; 15 (Suppl. 05) 627-37. PubMed PMID: 18579830.

Crossref PubMed Suche in Google Scholar
Download RIS citation
21 Hoshino N. Applying pitman’s sampling formula to microdata disclosure risk assessment. J Off Stat 2001; 17 (Suppl. 04) 499-520.

Suche in Google Scholar
Download RIS citation
22 Chen G, Keller-McNulty S. Estimation of identification disclosure risk in microdata. J Off Stat 1998; 14 (Suppl. 01) 79-95.

Suche in Google Scholar
Download RIS citation
23 Rinott Y. On models for statistical disclosure risk estimation. In: Proceedings of the Joint ECE/Eurostat Work Session on Statistical Data Confidentiality. 2003. Apr 7–9; Luxembourg; 2003.

Suche in Google Scholar
Download RIS citation
24 Dankar FK, El Emam K, Neisa A, Roffey T. Estimating the re-identification risk of clinical data sets. BMC Med Inform Decis Mak 2012; 12: 66. PubMed PMID: 22776564.

Crossref PubMed Suche in Google Scholar
Download RIS citation
25 Prasser F, Kohlmayer F. Putting statistical disclosure control into practice: The ARX data anonymization tool. In: Gkoulalas-Divanis A, Loukides G. editors. Medical Data Privacy Handbook. New York: Springer; 2015. p. 111-48.

Crossref Suche in Google Scholar
Download RIS citation
26 Iyengar V. Transforming data to satisfy privacy constraints. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2002 Jul 23–26 Edmonton, Canada: ACM; 2002. p. 279-88. doi: 10.1145/775047.775089.

Suche in Google Scholar
Download RIS citation
27 Bayardo RJ, Agrawal R. Data privacy through optimal k-anonymization. In: Aberer K, Franklin MJ, Nishio S. editors Proceedings of the 21st International Conference on Data Engineering. 2005 Apr 5–8 Tokyo, Japan: IEEE Computer Society; 2005. p. 217-28. doi: 10.1109/ICDE.2005.42.

Suche in Google Scholar
Download RIS citation
28 Prasser F, Kohlmayer F, Lautenschlaeger R, Eckert C, Kuhn KA. ARX – A Comprehensive tool for anonymizing biomedical data. In: Proceedings of the AMIA 2014 Annual Symposium. 2014 Nov 15–19 Washington, DC, US.: AMIA; 2014. p. 984-93. PubMed PMID: 25954407.

Suche in Google Scholar
Download RIS citation
29 El Emam K, Malin BA. Appendix B: Concepts and methods for de-identifying clinical trial data. In: Committee on Strategies for Responsible Sharing of Clinical Trial Data; Board on Health Sciences Policy; Institute of Medicine, editor. Sharing clinical trial data: Maximizing benefits, minimizing risk. Washington (DC): National Academies Press (US); 2015. p. 1-290.

Suche in Google Scholar
Download RIS citation
30 Cox LH, Karr AF, Kinney SK. Risk-utility paradigms for statistical disclosure limitation: How to think, but not how to act. Int Stat Rev 2011; 79 (Suppl. 02) 160-83. doi: 10.1111/j.1751–5823.2011.00140.x.

Crossref Suche in Google Scholar
Download RIS citation
31 Malin B, Karp D, Scheuermann RH. Technical and policy approaches to balancing patient privacy and data sharing in clinical and translational research. J Investig Med 2010; 58 (Suppl. 01) 11-8. PubMed PMID: 20051768.

Crossref PubMed Suche in Google Scholar
Download RIS citation
32 El Emam K, Rodgers S, Malin B. Anonymising and sharing individual patient data. BMJ 2015; 350: h1139. PubMed PMID: 25794882

Crossref PubMed Suche in Google Scholar
Download RIS citation
33 El Emam K, Jonker E, Arbuckle L, Malin B. A systematic review of re-identification attacks on health data. PloS one 2011; 6 (Suppl. 12) e28071. Epub 2011 Dec 2. PubMed PMID: 22164229.

Crossref PubMed Suche in Google Scholar
Download RIS citation
34 US Department of Health and Human Services – Office of the Assistant Secretary for Planning and Evaluation. Standards for Privacy of Individually Identifiable Health Information. Fed Regist 2000; 65 (Suppl. 250) 82462-829.

PubMed Suche in Google Scholar
Download RIS citation
35 El Emam K, Brown A, AbdelMalik P, Neisa A, Walker M, Bottomley J. et al. A method for managing re-identification risk from small geographic areas in Canada. BMC Med Inform Decis Mak 2010; 10: 18. PubMed PMID: 20361870.

Crossref PubMed Suche in Google Scholar
Download RIS citation
36 El Emam K, Dankar FK, Vaillancourt R, Roffey T, Lysyk M. Evaluating the risk of re-identification of patients from hospital prescription records. Can J Hosp Pharm. 2009 62. (4) PubMed PMID: 22478909.

PubMed Suche in Google Scholar
Download RIS citation
37 Templ M, Kowarik A, Meindl B. Statistical disclosure control for micro-data using the R package sdcMicro. J Stat Softw 2015; 67 (Suppl. 01) 1-36. doi: 10.18637/jss.v067.i04.

Crossref Suche in Google Scholar
Download RIS citation
38 Hundepool A, Wetering A, Ramaswamy R, Franconi L, Polettini S, Capobianchi A. et al. Mu-Argus, Version 4.2 User’s Manual [Internet]. The Hague, Netherlands: Statistics Netherlands; 2008. [cited 2016 Feb 04]. Available from: http://neon.vb.cbs.nl/casc/Software/MuManual4.2.pdf.

Suche in Google Scholar
Download RIS citation
39 El Emam K, Dankar FK, Issa R, Jonker E, Amyot D, Cogo E. et al. A globally optimal k-anonymity method for the de-identification of health data. J Am Med Inform Assoc 2009; 16 (Suppl. 05) 670-82. PubMed PMID: 19567795.

Crossref PubMed Suche in Google Scholar
Download RIS citation
40 Heatherly RD, Loukides G, Denny JC, Haines JL, Roden DM, Malin BA. Enabling genomic-phenomic association discovery without sacrificing anonymity. PloS one 2013; 8 (Suppl. 02) e53875. Epub 2013 Feb 6. PubMed PMID: 23405076.

Crossref PubMed Suche in Google Scholar
Download RIS citation
41 Machanavajjhala A, Kifer D, Gehrke J, Venkitasubramaniam M. l-diversity: Privacy beyond k-anonymity. Trans Knowl Discov Data 2007; 1 (Suppl. 01) 3. doi: 10.1145/1217299.1217302.

Suche in Google Scholar
Download RIS citation
42 McGraw D. Building public trust in uses of Health Insurance Portability and Accountability Act de-identified data. J Am Med Inform Assoc 2013; 20 (Suppl. 01) 29-34. PubMed PMID: 22735615.

Crossref PubMed Suche in Google Scholar
Download RIS citation
43 Domingo-Ferrer J, Torra V. Ordinal, continuous and heterogeneous k-anonymity through microaggregation. Data Min. Knowl. Discov 2005; 11 (Suppl. 02) 195-212. doi: 10.1007/s10618–005–0007–5.

Suche in Google Scholar
Download RIS citation
44 Goldberger J, Tassa T. Efficient anonymizations with enhanced utility. In: Saygin Y, Xu Yu J, Kargupta H, Wang W, Ranka S, Yu PS, Wu X. editors Proceedings of the ICDMW’09 IEEE International Conference on Data Mining Workshops. 2009 Dec 6 Miami, USA: IEEE Computer Society; 2009. p. 106-13. doi: 10.1109/ICDMW.2009.15.

Suche in Google Scholar
Download RIS citation
45 Soria-Comas J, Domingo-Ferrer J, Sanchez D, Martinez S. t-Closeness through microaggregation: strict privacy with enhanced utility preservation. Trans Knowl Data Eng 2015; 27 (Suppl. 11) 3098-110. doi: 10.1109/TKDE.2015.2435777

Suche in Google Scholar
Download RIS citation
46 Dankar FK, El Emam K. Practicing differential privacy in health care: A Review. Trans Data Priv 2013; 6 (Suppl. 01) 35-67.

Suche in Google Scholar
Download RIS citation
47 Dwork C. Differential privacy. In: Bugliesi M, Preneel B, Sassone V, Wegener I. editors Proceedings of the 33rd International Colloquium. ICALP 2006 Jul 10–14 Venice, Italy. Berlin; Heidelberg: Springer; 2006. p. 1-12. doi: 10.1007/11787006_1.

Suche in Google Scholar
Download RIS citation
48 El Emam K, Álvarez C. A critical appraisal of the Article 29 Working Party Opinion 05/2014 on data anonymization techniques. Int Data Priv Law 2015; 5 (Suppl. 01) 73-87. doi: 10.1093/idpl/ipu033.

Suche in Google Scholar
Download RIS citation

Zusatzmaterial

Zusatzmaterial (PDF) (opens in new window)

Ähnliche Zeitschriften

RSS-Feed abonnieren

Teilen / Bookmarken

The Importance of Context: Risk-based De-identification of Biomedical Data[*]

Autoren

Publikationsverlauf

Summary

Keywords

References

The Importance of Context: Risk-based De-identification of Biomedical Data^[*]