Is the Juice Worth the Squeeze? Costs and Benefits of Multiple Human Annotators for Clinical Text De-identification

David S. Carrell; David J. Cronkite; Bradley A. Malin; John S. Aberdeen; Lynette Hirschman

doi:10.3414/ME15-01-0122

Subscribe to RSS

Please copy the URL and add it into your RSS Feed Reader.

https://www.thieme-connect.de/rss/thieme/en/10.1055-s-00035037.xml

Download PDF

Methods Inf Med 2016; 55(04): 356-364
DOI: 10.3414/ME15-01-0122

Original Articles

Schattauer GmbH

Is the Juice Worth the Squeeze? Costs and Benefits of Multiple Human Annotators for Clinical Text De-identification

Authors

David S. Carrell

¹Group Health Research Institute, Seattle, Washington, USA
David J. Cronkite

¹Group Health Research Institute, Seattle, Washington, USA
Bradley A. Malin

²Vanderbilt University, Biomedical Informatics, Nashville, Tennessee, USA
John S. Aberdeen

³The MITRE Corporation, Information Technology Center, Bedford, Massachusetts, USA
Lynette Hirschman

³The MITRE Corporation, Information Technology Center, Bedford, Massachusetts, USA

Further Information

Publication History

received: 10 September 2015

accepted in revised form: 18 April 2016

Publication Date:
08 January 2018 (online)

Permissions and Reprints

Summary

Background: Clinical text contains valuable information but must be de-identified before it can be used for secondary purposes. Accurate annotation of personally identifiable information (PII) is essential to the development of automated de-identification systems and to manual redaction of PII. Yet the accuracy of annotations may vary considerably across individual annotators and annotation is costly. As such, the marginal benefit of incorporating additional annotators has not been well characterized.

Objectives: This study models the costs and benefits of incorporating increasing numbers of independent human annotators to identify the instances of PII in a corpus. We used a corpus with gold standard annotations to evaluate the performance of teams of annotators of increasing size.

Methods: Four annotators independently identified PII in a 100-document corpus consisting of randomly selected clinical notes from Family Practice clinics in a large integrated health care system. These annotations were pooled and validated to generate a gold standard corpus for evaluation.

Results: Recall rates for all PII types ranged from 0.90 to 0.98 for individual annotators to 0.998 to 1.0 for teams of three, when measured against the gold standard. Median cost per PII instance discovered during corpus annotation ranged from $ 0.71 for an individual annotator to $ 377 for annotations discovered only by a fourth annotator.

Conclusions: Incorporating a second annotator into a PII annotation process reduces unredacted PII and improves the quality of annotations to 0.99 recall, yielding clear benefit at reasonable cost; the cost advantages of annotation teams larger than two diminish rapidly.

Keywords

Patient data privacy - data sharing - natural language processing - cost analysis

References
1 U.S. Department of Health and Human Services.. Standards for Privacy of Individually Identifiable Health Information; Final Rule. Federal Register. 2002 p. 53181-273.

PubMed Search in Google Scholar
Download RIS citation
2 Meystre SM, Friedlin FJ, South BR, Shen S, Samore MH. Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Med Res Methodol 2010; 10: 70. Epub 2010 Aug 04.

Crossref PubMed Search in Google Scholar
Download RIS citation
3 Dehghan A, Kovacevic A, Karystianis G, Keane JA, Nenadic G. Combining knowledge- and data-driven methods for de-identification of clinical narratives. J Biomed Inform 2015; 58 Suppl S53-9. Epub 2015 July 27.

Crossref PubMed Search in Google Scholar
Download RIS citation
4 Stubbs A, Kotfila C, Uzuner O. Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1. J Biomed Inform. 2015 Epub 2015 Aug 01.

PubMed Search in Google Scholar
Download RIS citation
5 Uzuner O, Luo Y, Szolovits P. Evaluating the state-of-the-art in automatic de-identification. J Am Med Inform Assoc 2007; 14 (Suppl. 05) 550-63. Epub 2007 June 30.

Crossref PubMed Search in Google Scholar
Download RIS citation
6 Wellner B, Huyck M, Mardis S, Aberdeen J, Morgan A, Peshkin L. et al. Rapidly retargetable approaches to de-identification in medical records. J Am Med Inform Assoc 2007; 14 (Suppl. 05) 564-73. Epub 2007 June 30.

Crossref PubMed Search in Google Scholar
Download RIS citation
7 Szarvas G, Farkas R, Busa-Fekete R. State-of-theart anonymization of medical records using an iterative machine learning framework. J Am Med Inform Assoc 2007; 14 (Suppl. 05) 574-80. Epub 2007 Sept 08.

Crossref PubMed Search in Google Scholar
Download RIS citation
8 Uzuner O, Sibanda TC, Luo Y, Szolovits P. A de-identifier for medical discharge summaries. Artif Intell Med 2008; 42 (Suppl. 01) 13-35. Epub 2007 Dec 07.

Crossref PubMed Search in Google Scholar
Download RIS citation
9 Deleger L, Lingren T, Ni Y, Kaiser M, Stoutenborough L, Marsolo K. et al. Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research. J Biomed Inform 2014; 50: 173-83. Epub 2014 Feb 22.

Crossref PubMed Search in Google Scholar
Download RIS citation
10 Yeniterzi R, Aberdeen J, Bayer S, Wellner B, Hirschman L, Malin B. Effects of personal identifier resynthesis on clinical text de-identification. J Am Med Inform Assoc 2010; 17 (Suppl. 02) 159-68. Epub 2010 Mar 02.

Crossref PubMed Search in Google Scholar
Download RIS citation
11 Hanauer D, Aberdeen J, Bayer S, Wellner B, Clark C, Zheng K. et al. Bootstrapping a de-identification system for narrative patient records: cost-performance tradeoffs. Int J Med Inform 2013; 82 (Suppl. 09) 821-31. Epub 2013 May 07.

Crossref PubMed Search in Google Scholar
Download RIS citation
12 Carrell D, Malin B, Aberdeen J, Bayer S, Clark C, Wellner B. et al. Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text. J Am Med Inform Assoc 2013; 20 (Suppl. 02) 342-8. Epub 2012 July 6.

Crossref PubMed Search in Google Scholar
Download RIS citation
13 Khare R, Burger JD, Aberdeen JS, Tresner-Kirsch DW, Corrales TJ, Hirchman L. et al. Scaling drug indication curation through crowdsourcing. Database (Oxford). 2015 2015. pii: bav016. Epub 2015 Mar 24.

PubMed Search in Google Scholar
Download RIS citation
14 Khare R, Good BM, Leaman R, Su AI, Lu Z. Crowdsourcing in biomedicine: challenges and opportunities. Brief Bioinform 2016; 17 (Suppl. 01) 23-32. Epub 2015 Apr 17.

Crossref PubMed Search in Google Scholar
Download RIS citation
15 Dorr DA, Phillips WF, Phansalkar S, Sims SA, Hurdle JF. Assessing the difficulty and time cost of de-identification in clinical narratives. Methods Inf Med 2006; 45 (Suppl. 03) 246-52. Epub 2006 May 11.

Thieme Connect PubMed Search in Google Scholar
Download RIS citation
16 Grouin C, Névéol A. De-identification of clinical notes in French: towards a protocol for reference corpus development. J Biomed Inform 2014; 50 (Suppl. 00) 151-61.

Crossref PubMed Search in Google Scholar
Download RIS citation
17 Velupillai S, Dalianis H, Hassel M, Nilsson GH. Developing a standard for de-identifying electronic patient records written in Swedish: precision, recall and F-measure in a manual and computerized annotation trial. Int J Med Inform 2009; 78 (Suppl. 12) e19-26. Epub 2009 June 02.

Crossref PubMed Search in Google Scholar
Download RIS citation
18 Mayer J, Shen S, South BR, Meystre S, Friedlin FJ, Ray WR. et al. Inductive creation of an annotation schema and a reference standard for de-identification of VA electronic clinical notes. AMIA Annu Symp Proc 2009; 416-20. Epub 2009 Jan 01.

PubMed Search in Google Scholar
Download RIS citation
19 Jamison EK, Gurevych I. Needle in a Haystack: Reducing the Costs of Annotating Rare-Class Instances in Imbalanced Datasets. Pacific Asia Conference on Language, Information and Computation; December 12–14, 2014; Phuket, Thailand: 2014 p. 244-53.

PubMed Search in Google Scholar
Download RIS citation
20 Baldridge J, Osborne M. Active Learning and the Total Cost of Annotation. In: Dekang L, Dekai W. editors. Conference on Empirical Methods in Natural Language Processing. Barcelona, Spain: Association for Computational Linguistics; 2004. p. 9-16.

Search in Google Scholar
Download RIS citation
21 Snow R, O’Connor B, Jurafsky D, Andrew YN. Cheap and Fast – But is it Good?: Evaluating Non-expert Annotations for Natural Language Tasks. In: Lapata M, Ng HT. editors. Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics; 2008. p. 254-63.

Search in Google Scholar
Download RIS citation
22 South BR, Shen S, Friedlin FJ, Samore MH, Meystre SM. editors. Enhancing Annotation of Clinical Text using Pre-Annotation of Common PHI. AMIA 2010 Symposium; 2010; Washington DC

Download RIS citation
23 Aberdeen J, Bayer S, Yeniterzi R, Wellner B, Clark C, Hanauer D. et al. The MITRE Identification Scrubber Toolkit: Design, training, and assessment. Int J Med Inform 2010; 79 (Suppl. 12) 849-59. Epub 2010 Oct 19.

Crossref PubMed Search in Google Scholar
Download RIS citation
24 Roden DM, Pulley JM, Basford MA, Bernard GR, Clayton EW, Balser JR. et al. Development of a large-scale de-identified DNA biobank to enable personalized medicine. Clin Pharmacol Ther 2008; 84 (Suppl. 03) 362-9. Epub 2008 May 27.

Crossref PubMed Search in Google Scholar
Download RIS citation
25 Douglass MM, Clifford GD, Reisner A, Moore R, Marks M. Computer-assisted de-identification of free text in the MIMIC II database. Computers in Cardiology 2004; 31: 341-4.

Search in Google Scholar
Download RIS citation
26 Li R, Carrell D, Aberdeen J, Hirschman L, J. K Li B. et al. Optimizing Annotation Resources for Natural Language De-identification via a Game Theoretic Framework. J Biomed Inform. 2016 (in press).

PubMed Search in Google Scholar
Download RIS citation

Related Journals

Subscribe to RSS

Share / Bookmark

Is the Juice Worth the Squeeze? Costs and Benefits of Multiple Human Annotators for Clinical Text De-identification

Authors

Publication History

Summary

Keywords

References