Is Multiclass Automatic Text De-Identification Worth the Effort?Funding The authors are supported in part by research funds from the UAB Informatics Institute and by the National Center for Advancing Translational Sciences of the National Institutes of Health under award number KL2TR001419. The content of this paper is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
14 February 2018
accepted: 08 June 2018
24 September 2018 (online)
Objectives: Automatic de-identification to remove protected health information (PHI) from clinical text can use a “binary” model that replaces redacted text with a generic tag (e.g., “<PHI>”), or can use a “multiclass” model that retains more class information (e.g., “<Phone Number>”). Binary models are easier to develop, but result in text that is potentially less informative. We investigated whether building a multiclass de-identification is worth the extra effort.
Methods: Using the 2014 i2b2 dataset, we compared the accuracy and impact on document readability of two models. In the first experiment, we generated one binary and two multiclass versions trained with the same machine-learning algorithm Conditional Random Field (CRF). Accuracy (recall, precision, f-score) and secondary metrics (e.g, training time, testing time, minimum memory required) were measured. In the second experiment, three reviewers accessed the readability of two redacted documents using the binary and multiclass methods. We estimated a pooled Kappa to estimate the inter-rater agreement.
Results: The multiclass model did not demonstrate a clear accuracy advantage, with lower recall (−1.9%) and only slightly better precision (+0.6%), despite requiring additional computing resources. Three raters reached a very high agreement (Kappa = 0.975, 95% Confidence Interval (0.946, 1.00), p < 0.0001) that both binary and multiclass models have the same impact on document readability.
Conclusions: This study suggests that the development of more sophisticated classification of PHI may not be worth the effort in terms of both system accuracy and the usefulness of the output.
- 1 Guidance regarding methods for de-identification of protected health information in accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. US Department of Health Human Services, 2012.
- 2 Dorr DA, Phillips W, Phansalkar S, Sims SA, Hurdle JF. Assessing the difficulty and time cost of de-identification in clinical narratives.. Methods of Information in Medicine 2006; 45 (03) 246-252.
- 3 Stubbs A, Uzuner Ö.. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus.. J Biomed Inform 2015; 58: S20-S29.
- 4 Uzuner O, Luo Y, Szolovits P. Evaluating the stateof-the-art in automatic de-identification.. J Am Med Inform Assoc 2007; 14 (05) 550-563.
- 5 Stubbs A, Kotfila C, Uzuner Ö.. Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1.. J Biomed Inform 2015; 58: S11-S19.
- 6 Meystre SM, Friedlin FJ, South BR, Shen S, Samore MH. Automatic de-identification of textual documents in the electronic health record: a review of recent research.. BMC Medical Research Methodology 2010; 10 (01) 70.
- 7 Ferrandez O, South BR, Shen S, Friedlin FJ, Samore MH, Meystre SM. BoB, a best-of-breed automated text de-identification system for VHA clinical documents.. Journal of the American Medical Informatics Association 2013; 20 (01) 77-83.
- 8 Neamatullah I, Douglass MM, Li-wei HL, Reisner A, Villarroel M, Long WJ. et al. Automated deidentification of free-text medical records.. BMC Med Inform Decis Mak 2008; 8 (01) 32.
- 9 Yang H, Garibaldi JM. Automatic detection of protected health information from clinic narratives.. J Biomed Inform 2015; 58: S30-S38.
- 10 Dernoncourt F, Lee JY, Uzuner O, Szolovits P. Deidentification of patient notes with recurrent neural networks.. Journal of the American Medical Informatics Association 2017; 24 (03) 596-606.
- 11 Scaiano M, Middleton G, Arbuckle L, Kolhatkar V, Peyton L, Dowling M. et al. A unified framework for evaluating the risk of re-identification of text de-identification tools.. J Biomed Inform 2016; 63: 174-183.
- 12 Salloway MK, Deng X, Ning Y, Kao SL, Chen Y, Schaefer GO. et al., editors. A de-identification tool for users in medical operations and public health.. IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI).; 2016
- 13 Phuong ND, Chau VTN. editors. Automatic deidentification of medical records with a multilevel hybrid semi-supervised learning approach.. IEEE RIVF International Conference on Computing & Communication Technologies, Research, Innovation, and Vision for the Future (RIVF).; 2016
- 14 Stubbs A, Filannino M, Uzuner O. De-identification of psychiatric intake records: Overview of 2016 CEGS N-GRID Shared Tasks Track 1.. J Biomed Inform 2017; 75S: S4-S18.
- 15 Bui DDA, Wyatt M, Cimino JJ. The UAB Informatics Institute and 2016 CEGS N-GRID deidentification shared task challenge.. J Biomed Inform 2017; 75S: S54-S61.
- 16 Aberdeen J, Bayer S, Yeniterzi R, Wellner B, Clark C, Hanauer D. et al. The MITRE Identification Scrubber Toolkit: design, training, and assessment.. International Journal of Medical Informatics 2010; 79 (12) 849-859.
- 17 Grishman R, Kittredge R. Analyzing language in restricted domains: sublanguage description and processing.. New York: Psychology Press; 2014
- 18 Friedman C, Kra P, Rzhetsky A. Two biomedical sublanguages: a description based on the theories of Zellig Harris.. J Biomed Inform 2002; 35 (04) 222-235.
- 19 Patterson O, Hurdle JF. Document clustering of clinical narratives: a systematic study of clinical sublanguages.. AMIA Annu Symp Proc 2011; 2011: 1099-1107.
- 20 South BR, Mowery D, Suo Y, Leng J, Ferrández Ó, Meystre SM. et al. Evaluating the effects of machine pre-annotation and an interactive annotation interface on manual de-identification of clinical text.. J Biomed Inform 2014; 50: 162-172.
- 21 Ferrández Ó, South BR, Shen S, Friedlin FJ, Samore MH, Meystre SM. editors. Generalizability and comparison of automatic clinical text deidentification methods and resources.. AMIA Annu Symp Proc 2012; 2012: 199-208.
- 22 Ferrández O, South BR, Shen S, Friedlin FJ, Samore MH, Meystre SM. Evaluating current automatic de-identification methods with Veteran’s health administration clinical documents.. BMC Medical Research Methodology 2012; 12 (01) 109.
- 23 Redd A, Pickard S, Meystre S, Scehnet J, Bolton D, Heavirland J. et al. Evaluation of PHI Hunter in Natural Language Processing Research.. Perspect Health Inf Manag 2015; 12: 1 f. eCollection 2015.
- 24 Kayaalp M, Browne AC, Dodd ZA, Sagan P, McDonald CJ. editors. De-identification of address, date, and alphanumeric identifiers in narrative clinical reports.. AMIA Annu Symp Proc 2014; 2014: 767-776. eCollection 2014.
- 25 Carrell D, Malin B, Aberdeen J, Bayer S, Clark C, Wellner B. et al. Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text.. Journal of the American Medical Informatics Association 2012; 20 (02) 342-348.
- 26 Manning CD, Surdeanu M, Bauer J, Finkel JR, Bethard S, McClosky D, editors. The Stanford CoreNLP Natural Language Processing Toolkit. ACL (System Demonstrations); 2014.
- 27 Liu Z, Chen Y, Tang B, Wang X, Chen Q, Li H. et al. Automatic de-identification of electronic medical records using token-level and character-level conditional random fields.. J Biomed Inform 2015; 58: S47-S52.
- 28 Deleger L, Molnar K, Savova G, Xia F, Lingren T, Li Q. et al. Large-scale evaluation of automated clinical note de-identification and its impact on information extraction.. Journal of the American Medical Informatics Association 2013; 20 (01) 84-94.
- 29 Meystre SM, Ferrández Ó, Friedlin FJ, South BR, Shen S, Samore MH. Text de-identification for privacy protection: A study of its impact on clinical text information content.. J Biomed Inform 2014; 50: 142-150.