Is the Juice Worth the Squeeze? Costs and Benefits of Multiple Human Annotators for Clinical Text De-identification

David S. Carrell; David J. Cronkite; Bradley A. Malin; John S. Aberdeen; Lynette Hirschman

doi:10.3414/ME15-01-0122

Methods of Information in Medicine, Inhaltsverzeichnis

Methods Inf Med 2016; 55(04): 356-364
DOI: 10.3414/ME15-01-0122

Original Articles

Schattauer GmbH

Is the Juice Worth the Squeeze? Costs and Benefits of Multiple Human Annotators for Clinical Text De-identification

Authors

David S. Carrell

¹Group Health Research Institute, Seattle, Washington, USA
David J. Cronkite

¹Group Health Research Institute, Seattle, Washington, USA
Bradley A. Malin

²Vanderbilt University, Biomedical Informatics, Nashville, Tennessee, USA
John S. Aberdeen

³The MITRE Corporation, Information Technology Center, Bedford, Massachusetts, USA
Lynette Hirschman

³The MITRE Corporation, Information Technology Center, Bedford, Massachusetts, USA

Abstract

Summary

Background: Clinical text contains valuable information but must be de-identified before it can be used for secondary purposes. Accurate annotation of personally identifiable information (PII) is essential to the development of automated de-identification systems and to manual redaction of PII. Yet the accuracy of annotations may vary considerably across individual annotators and annotation is costly. As such, the marginal benefit of incorporating additional annotators has not been well characterized.

Objectives: This study models the costs and benefits of incorporating increasing numbers of independent human annotators to identify the instances of PII in a corpus. We used a corpus with gold standard annotations to evaluate the performance of teams of annotators of increasing size.

Methods: Four annotators independently identified PII in a 100-document corpus consisting of randomly selected clinical notes from Family Practice clinics in a large integrated health care system. These annotations were pooled and validated to generate a gold standard corpus for evaluation.

Results: Recall rates for all PII types ranged from 0.90 to 0.98 for individual annotators to 0.998 to 1.0 for teams of three, when measured against the gold standard. Median cost per PII instance discovered during corpus annotation ranged from $ 0.71 for an individual annotator to $ 377 for annotations discovered only by a fourth annotator.

Conclusions: Incorporating a second annotator into a PII annotation process reduces unredacted PII and improves the quality of annotations to 0.99 recall, yielding clear benefit at reasonable cost; the cost advantages of annotation teams larger than two diminish rapidly.

Keywords

Patient data privacy - data sharing - natural language processing - cost analysis

Volltext

Referenzen

References
1 U.S. Department of Health and Human Services.. Standards for Privacy of Individually Identifiable Health Information; Final Rule. Federal Register. 2002 p. 53181-273.
2 Meystre SM, Friedlin FJ, South BR, Shen S, Samore MH. Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Med Res Methodol 2010; 10: 70. Epub 2010 Aug 04.
3 Dehghan A, Kovacevic A, Karystianis G, Keane JA, Nenadic G. Combining knowledge- and data-driven methods for de-identification of clinical narratives. J Biomed Inform 2015; 58 Suppl S53-9. Epub 2015 July 27.
4 Stubbs A, Kotfila C, Uzuner O. Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1. J Biomed Inform. 2015 Epub 2015 Aug 01.
5 Uzuner O, Luo Y, Szolovits P. Evaluating the state-of-the-art in automatic de-identification. J Am Med Inform Assoc 2007; 14 (Suppl. 05) 550-63. Epub 2007 June 30.
6 Wellner B, Huyck M, Mardis S, Aberdeen J, Morgan A, Peshkin L. et al. Rapidly retargetable approaches to de-identification in medical records. J Am Med Inform Assoc 2007; 14 (Suppl. 05) 564-73. Epub 2007 June 30.
7 Szarvas G, Farkas R, Busa-Fekete R. State-of-theart anonymization of medical records using an iterative machine learning framework. J Am Med Inform Assoc 2007; 14 (Suppl. 05) 574-80. Epub 2007 Sept 08.
8 Uzuner O, Sibanda TC, Luo Y, Szolovits P. A de-identifier for medical discharge summaries. Artif Intell Med 2008; 42 (Suppl. 01) 13-35. Epub 2007 Dec 07.
9 Deleger L, Lingren T, Ni Y, Kaiser M, Stoutenborough L, Marsolo K. et al. Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research. J Biomed Inform 2014; 50: 173-83. Epub 2014 Feb 22.
10 Yeniterzi R, Aberdeen J, Bayer S, Wellner B, Hirschman L, Malin B. Effects of personal identifier resynthesis on clinical text de-identification. J Am Med Inform Assoc 2010; 17 (Suppl. 02) 159-68. Epub 2010 Mar 02.
11 Hanauer D, Aberdeen J, Bayer S, Wellner B, Clark C, Zheng K. et al. Bootstrapping a de-identification system for narrative patient records: cost-performance tradeoffs. Int J Med Inform 2013; 82 (Suppl. 09) 821-31. Epub 2013 May 07.
12 Carrell D, Malin B, Aberdeen J, Bayer S, Clark C, Wellner B. et al. Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text. J Am Med Inform Assoc 2013; 20 (Suppl. 02) 342-8. Epub 2012 July 6.
13 Khare R, Burger JD, Aberdeen JS, Tresner-Kirsch DW, Corrales TJ, Hirchman L. et al. Scaling drug indication curation through crowdsourcing. Database (Oxford). 2015 2015. pii: bav016. Epub 2015 Mar 24.
14 Khare R, Good BM, Leaman R, Su AI, Lu Z. Crowdsourcing in biomedicine: challenges and opportunities. Brief Bioinform 2016; 17 (Suppl. 01) 23-32. Epub 2015 Apr 17.
15 Dorr DA, Phillips WF, Phansalkar S, Sims SA, Hurdle JF. Assessing the difficulty and time cost of de-identification in clinical narratives. Methods Inf Med 2006; 45 (Suppl. 03) 246-52. Epub 2006 May 11.
16 Grouin C, Névéol A. De-identification of clinical notes in French: towards a protocol for reference corpus development. J Biomed Inform 2014; 50 (Suppl. 00) 151-61.
17 Velupillai S, Dalianis H, Hassel M, Nilsson GH. Developing a standard for de-identifying electronic patient records written in Swedish: precision, recall and F-measure in a manual and computerized annotation trial. Int J Med Inform 2009; 78 (Suppl. 12) e19-26. Epub 2009 June 02.
18 Mayer J, Shen S, South BR, Meystre S, Friedlin FJ, Ray WR. et al. Inductive creation of an annotation schema and a reference standard for de-identification of VA electronic clinical notes. AMIA Annu Symp Proc 2009; 416-20. Epub 2009 Jan 01.
19 Jamison EK, Gurevych I. Needle in a Haystack: Reducing the Costs of Annotating Rare-Class Instances in Imbalanced Datasets. Pacific Asia Conference on Language, Information and Computation; December 12–14, 2014; Phuket, Thailand: 2014 p. 244-53.
20 Baldridge J, Osborne M. Active Learning and the Total Cost of Annotation. In: Dekang L, Dekai W. editors. Conference on Empirical Methods in Natural Language Processing. Barcelona, Spain: Association for Computational Linguistics; 2004. p. 9-16.
21 Snow R, O’Connor B, Jurafsky D, Andrew YN. Cheap and Fast – But is it Good?: Evaluating Non-expert Annotations for Natural Language Tasks. In: Lapata M, Ng HT. editors. Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics; 2008. p. 254-63.
22 South BR, Shen S, Friedlin FJ, Samore MH, Meystre SM. editors. Enhancing Annotation of Clinical Text using Pre-Annotation of Common PHI. AMIA 2010 Symposium; 2010; Washington DC
23 Aberdeen J, Bayer S, Yeniterzi R, Wellner B, Clark C, Hanauer D. et al. The MITRE Identification Scrubber Toolkit: Design, training, and assessment. Int J Med Inform 2010; 79 (Suppl. 12) 849-59. Epub 2010 Oct 19.
24 Roden DM, Pulley JM, Basford MA, Bernard GR, Clayton EW, Balser JR. et al. Development of a large-scale de-identified DNA biobank to enable personalized medicine. Clin Pharmacol Ther 2008; 84 (Suppl. 03) 362-9. Epub 2008 May 27.
25 Douglass MM, Clifford GD, Reisner A, Moore R, Marks M. Computer-assisted de-identification of free text in the MIMIC II database. Computers in Cardiology 2004; 31: 341-4.
26 Li R, Carrell D, Aberdeen J, Hirschman L, J. K Li B. et al. Optimizing Annotation Resources for Natural Language De-identification via a Game Theoretic Framework. J Biomed Inform. 2016 (in press).