Methods Inf Med 2018; 57(04): 177-184
DOI: 10.3414/ME18-01-0017
Original Articles
Georg Thieme Verlag KG Stuttgart · New York

Is Multiclass Automatic Text De-Identification Worth the Effort?

Duy Duc An Bui
1  Informatics Institute, University of Alabama at Birmingham, Birmingham, AL, USA
David T. Redden
2  Department of Biostatistics, School of Public Health, University of Alabama at Birmingham, Birmingham, AL, USA
James J. Cimino
1  Informatics Institute, University of Alabama at Birmingham, Birmingham, AL, USA
› Author Affiliations
Funding The authors are supported in part by research funds from the UAB Informatics Institute and by the National Center for Advancing Translational Sciences of the National Institutes of Health under award number KL2TR001419. The content of this paper is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Further Information

Publication History

received: 14 February 2018

accepted: 08 June 2018

Publication Date:
24 September 2018 (online)


Objectives: Automatic de-identification to remove protected health information (PHI) from clinical text can use a “binary” model that replaces redacted text with a generic tag (e.g., “<PHI>”), or can use a “multiclass” model that retains more class information (e.g., “<Phone Number>”). Binary models are easier to develop, but result in text that is potentially less informative. We investigated whether building a multiclass de-identification is worth the extra effort.

Methods: Using the 2014 i2b2 dataset, we compared the accuracy and impact on document readability of two models. In the first experiment, we generated one binary and two multiclass versions trained with the same machine-learning algorithm Conditional Random Field (CRF). Accuracy (recall, precision, f-score) and secondary metrics (e.g, training time, testing time, minimum memory required) were measured. In the second experiment, three reviewers accessed the readability of two redacted documents using the binary and multiclass methods. We estimated a pooled Kappa to estimate the inter-rater agreement.

Results: The multiclass model did not demonstrate a clear accuracy advantage, with lower recall (−1.9%) and only slightly better precision (+0.6%), despite requiring additional computing resources. Three raters reached a very high agreement (Kappa = 0.975, 95% Confidence Interval (0.946, 1.00), p < 0.0001) that both binary and multiclass models have the same impact on document readability.

Conclusions: This study suggests that the development of more sophisticated classification of PHI may not be worth the effort in terms of both system accuracy and the usefulness of the output.