Summary
Background: Clinical text contains valuable information but must be de-identified before it can
be used for secondary purposes. Accurate annotation of personally identifiable information
(PII) is essential to the development of automated de-identification systems and to
manual redaction of PII. Yet the accuracy of annotations may vary considerably across
individual annotators and annotation is costly. As such, the marginal benefit of incorporating
additional annotators has not been well characterized.
Objectives: This study models the costs and benefits of incorporating increasing numbers of independent
human annotators to identify the instances of PII in a corpus. We used a corpus with
gold standard annotations to evaluate the performance of teams of annotators of increasing
size.
Methods: Four annotators independently identified PII in a 100-document corpus consisting
of randomly selected clinical notes from Family Practice clinics in a large integrated
health care system. These annotations were pooled and validated to generate a gold
standard corpus for evaluation.
Results: Recall rates for all PII types ranged from 0.90 to 0.98 for individual annotators
to 0.998 to 1.0 for teams of three, when measured against the gold standard. Median
cost per PII instance discovered during corpus annotation ranged from $ 0.71 for an
individual annotator to $ 377 for annotations discovered only by a fourth annotator.
Conclusions: Incorporating a second annotator into a PII annotation process reduces unredacted
PII and improves the quality of annotations to 0.99 recall, yielding clear benefit
at reasonable cost; the cost advantages of annotation teams larger than two diminish
rapidly.
Keywords
Patient data privacy - data sharing - natural language processing - cost analysis