Methods Inf Med 2016; 55(04): 356-364
DOI: 10.3414/ME15-01-0122
Original Articles
Schattauer GmbH

Is the Juice Worth the Squeeze? Costs and Benefits of Multiple Human Annotators for Clinical Text De-identification

Authors

  • David S. Carrell

    1   Group Health Research Institute, Seattle, Washington, USA
  • David J. Cronkite

    1   Group Health Research Institute, Seattle, Washington, USA
  • Bradley A. Malin

    2   Vanderbilt University, Biomedical Informatics, Nashville, Tennessee, USA
  • John S. Aberdeen

    3   The MITRE Corporation, Information Technology Center, Bedford, Massachusetts, USA
  • Lynette Hirschman

    3   The MITRE Corporation, Information Technology Center, Bedford, Massachusetts, USA
Further Information

Publication History

received: 10 September 2015

accepted in revised form: 18 April 2016

Publication Date:
08 January 2018 (online)

Summary

Background: Clinical text contains valuable information but must be de-identified before it can be used for secondary purposes. Accurate annotation of personally identifiable information (PII) is essential to the development of automated de-identification systems and to manual redaction of PII. Yet the accuracy of annotations may vary considerably across individual annotators and annotation is costly. As such, the marginal benefit of incorporating additional annotators has not been well characterized.

Objectives: This study models the costs and benefits of incorporating increasing numbers of independent human annotators to identify the instances of PII in a corpus. We used a corpus with gold standard annotations to evaluate the performance of teams of annotators of increasing size.

Methods: Four annotators independently identified PII in a 100-document corpus consisting of randomly selected clinical notes from Family Practice clinics in a large integrated health care system. These annotations were pooled and validated to generate a gold standard corpus for evaluation.

Results: Recall rates for all PII types ranged from 0.90 to 0.98 for individual annotators to 0.998 to 1.0 for teams of three, when measured against the gold standard. Median cost per PII instance discovered during corpus annotation ranged from $ 0.71 for an individual annotator to $ 377 for annotations discovered only by a fourth annotator.

Conclusions: Incorporating a second annotator into a PII annotation process reduces unredacted PII and improves the quality of annotations to 0.99 recall, yielding clear benefit at reasonable cost; the cost advantages of annotation teams larger than two diminish rapidly.