Is Multiclass Automatic Text De-Identification Worth the Effort?

Duy Duc An Bui; David T. Redden; James J. Cimino

doi:10.3414/ME18-01-0017

Subscribe to RSS

Please copy the URL and add it into your RSS Feed Reader.

https://www.thieme-connect.de/rss/thieme/en/10.1055-s-00035037.xml

Share / Bookmark

Facebook Linkedin Weibo

Download PDF

Methods Inf Med 2018; 57(04): 177-184
DOI: 10.3414/ME18-01-0017

Original Articles

Georg Thieme Verlag KG Stuttgart · New York

Is Multiclass Automatic Text De-Identification Worth the Effort?

Duy Duc An Bui

¹Informatics Institute, University of Alabama at Birmingham, Birmingham, AL, USA

,

David T. Redden

²Department of Biostatistics, School of Public Health, University of Alabama at Birmingham, Birmingham, AL, USA

,

James J. Cimino

¹Informatics Institute, University of Alabama at Birmingham, Birmingham, AL, USA

› Author Affiliations
Funding The authors are supported in part by research funds from the UAB Informatics Institute and by the National Center for Advancing Translational Sciences of the National Institutes of Health under award number KL2TR001419. The content of this paper is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Further Information

Publication History

received: 14 February 2018

accepted: 08 June 2018

Publication Date:
24 September 2018 (online)

Abstract
Full Text
References

Permissions and Reprints

Summary

Objectives: Automatic de-identification to remove protected health information (PHI) from clinical text can use a “binary” model that replaces redacted text with a generic tag (e.g., “<PHI>”), or can use a “multiclass” model that retains more class information (e.g., “<Phone Number>”). Binary models are easier to develop, but result in text that is potentially less informative. We investigated whether building a multiclass de-identification is worth the extra effort.

Methods: Using the 2014 i2b2 dataset, we compared the accuracy and impact on document readability of two models. In the first experiment, we generated one binary and two multiclass versions trained with the same machine-learning algorithm Conditional Random Field (CRF). Accuracy (recall, precision, f-score) and secondary metrics (e.g, training time, testing time, minimum memory required) were measured. In the second experiment, three reviewers accessed the readability of two redacted documents using the binary and multiclass methods. We estimated a pooled Kappa to estimate the inter-rater agreement.

Results: The multiclass model did not demonstrate a clear accuracy advantage, with lower recall (−1.9%) and only slightly better precision (+0.6%), despite requiring additional computing resources. Three raters reached a very high agreement (Kappa = 0.975, 95% Confidence Interval (0.946, 1.00), p < 0.0001) that both binary and multiclass models have the same impact on document readability.

Conclusions: This study suggests that the development of more sophisticated classification of PHI may not be worth the effort in terms of both system accuracy and the usefulness of the output.

Keywords

De-identification - machine learning - comprehension - natural language processing

References
1 Guidance regarding methods for de-identification of protected health information in accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. US Department of Health Human Services, 2012.

PubMed
2 Dorr DA, Phillips W, Phansalkar S, Sims SA, Hurdle JF. Assessing the difficulty and time cost of de-identification in clinical narratives.. Methods of Information in Medicine 2006; 45 (03) 246-252.

Thieme Connect PubMed Search in Google Scholar
3 Stubbs A, Uzuner Ö.. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus.. J Biomed Inform 2015; 58: S20-S29.

Crossref PubMed Search in Google Scholar
4 Uzuner O, Luo Y, Szolovits P. Evaluating the stateof-the-art in automatic de-identification.. J Am Med Inform Assoc 2007; 14 (05) 550-563.

Crossref PubMed Search in Google Scholar
5 Stubbs A, Kotfila C, Uzuner Ö.. Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1.. J Biomed Inform 2015; 58: S11-S19.

Crossref PubMed Search in Google Scholar
6 Meystre SM, Friedlin FJ, South BR, Shen S, Samore MH. Automatic de-identification of textual documents in the electronic health record: a review of recent research.. BMC Medical Research Methodology 2010; 10 (01) 70.

Crossref PubMed Search in Google Scholar
7 Ferrandez O, South BR, Shen S, Friedlin FJ, Samore MH, Meystre SM. BoB, a best-of-breed automated text de-identification system for VHA clinical documents.. Journal of the American Medical Informatics Association 2013; 20 (01) 77-83.

Crossref PubMed Search in Google Scholar
8 Neamatullah I, Douglass MM, Li-wei HL, Reisner A, Villarroel M, Long WJ. et al. Automated deidentification of free-text medical records.. BMC Med Inform Decis Mak 2008; 8 (01) 32.

Crossref PubMed Search in Google Scholar
9 Yang H, Garibaldi JM. Automatic detection of protected health information from clinic narratives.. J Biomed Inform 2015; 58: S30-S38.

Crossref PubMed Search in Google Scholar
10 Dernoncourt F, Lee JY, Uzuner O, Szolovits P. Deidentification of patient notes with recurrent neural networks.. Journal of the American Medical Informatics Association 2017; 24 (03) 596-606.

PubMed Search in Google Scholar
11 Scaiano M, Middleton G, Arbuckle L, Kolhatkar V, Peyton L, Dowling M. et al. A unified framework for evaluating the risk of re-identification of text de-identification tools.. J Biomed Inform 2016; 63: 174-183.

Crossref PubMed Search in Google Scholar
12 Salloway MK, Deng X, Ning Y, Kao SL, Chen Y, Schaefer GO. et al., editors. A de-identification tool for users in medical operations and public health.. IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI).; 2016

Search in Google Scholar
13 Phuong ND, Chau VTN. editors. Automatic deidentification of medical records with a multilevel hybrid semi-supervised learning approach.. IEEE RIVF International Conference on Computing & Communication Technologies, Research, Innovation, and Vision for the Future (RIVF).; 2016

Search in Google Scholar
14 Stubbs A, Filannino M, Uzuner O. De-identification of psychiatric intake records: Overview of 2016 CEGS N-GRID Shared Tasks Track 1.. J Biomed Inform 2017; 75S: S4-S18.

PubMed Search in Google Scholar
15 Bui DDA, Wyatt M, Cimino JJ. The UAB Informatics Institute and 2016 CEGS N-GRID deidentification shared task challenge.. J Biomed Inform 2017; 75S: S54-S61.

PubMed Search in Google Scholar
16 Aberdeen J, Bayer S, Yeniterzi R, Wellner B, Clark C, Hanauer D. et al. The MITRE Identification Scrubber Toolkit: design, training, and assessment.. International Journal of Medical Informatics 2010; 79 (12) 849-859.

Crossref PubMed Search in Google Scholar
17 Grishman R, Kittredge R. Analyzing language in restricted domains: sublanguage description and processing.. New York: Psychology Press; 2014

Search in Google Scholar
18 Friedman C, Kra P, Rzhetsky A. Two biomedical sublanguages: a description based on the theories of Zellig Harris.. J Biomed Inform 2002; 35 (04) 222-235.

Crossref PubMed Search in Google Scholar
19 Patterson O, Hurdle JF. Document clustering of clinical narratives: a systematic study of clinical sublanguages.. AMIA Annu Symp Proc 2011; 2011: 1099-1107.

PubMed Search in Google Scholar
20 South BR, Mowery D, Suo Y, Leng J, Ferrández Ó, Meystre SM. et al. Evaluating the effects of machine pre-annotation and an interactive annotation interface on manual de-identification of clinical text.. J Biomed Inform 2014; 50: 162-172.

Crossref PubMed Search in Google Scholar
21 Ferrández Ó, South BR, Shen S, Friedlin FJ, Samore MH, Meystre SM. editors. Generalizability and comparison of automatic clinical text deidentification methods and resources.. AMIA Annu Symp Proc 2012; 2012: 199-208.

PubMed Search in Google Scholar
22 Ferrández O, South BR, Shen S, Friedlin FJ, Samore MH, Meystre SM. Evaluating current automatic de-identification methods with Veteran’s health administration clinical documents.. BMC Medical Research Methodology 2012; 12 (01) 109.

Crossref PubMed Search in Google Scholar
23 Redd A, Pickard S, Meystre S, Scehnet J, Bolton D, Heavirland J. et al. Evaluation of PHI Hunter in Natural Language Processing Research.. Perspect Health Inf Manag 2015; 12: 1 f. eCollection 2015.

PubMed Search in Google Scholar
24 Kayaalp M, Browne AC, Dodd ZA, Sagan P, McDonald CJ. editors. De-identification of address, date, and alphanumeric identifiers in narrative clinical reports.. AMIA Annu Symp Proc 2014; 2014: 767-776. eCollection 2014.

PubMed Search in Google Scholar
25 Carrell D, Malin B, Aberdeen J, Bayer S, Clark C, Wellner B. et al. Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text.. Journal of the American Medical Informatics Association 2012; 20 (02) 342-348.

PubMed Search in Google Scholar
26 Manning CD, Surdeanu M, Bauer J, Finkel JR, Bethard S, McClosky D, editors. The Stanford CoreNLP Natural Language Processing Toolkit. ACL (System Demonstrations); 2014.

PubMed
27 Liu Z, Chen Y, Tang B, Wang X, Chen Q, Li H. et al. Automatic de-identification of electronic medical records using token-level and character-level conditional random fields.. J Biomed Inform 2015; 58: S47-S52.

Crossref PubMed Search in Google Scholar
28 Deleger L, Molnar K, Savova G, Xia F, Lingren T, Li Q. et al. Large-scale evaluation of automated clinical note de-identification and its impact on information extraction.. Journal of the American Medical Informatics Association 2013; 20 (01) 84-94.

Crossref PubMed Search in Google Scholar
29 Meystre SM, Ferrández Ó, Friedlin FJ, South BR, Shen S, Samore MH. Text de-identification for privacy protection: A study of its impact on clinical text information content.. J Biomed Inform 2014; 50: 142-150.

Crossref PubMed Search in Google Scholar

Subscribe to RSS

Share / Bookmark

Is Multiclass Automatic Text De-Identification Worth the Effort?

Publication History

Summary

Keywords

References