Keywords
Natural language processing - case detection - disease surveillance - generalizability
- portability
1. Objective
Surveillance of a population for infectious and other diseases is an important public
health function. Public health has historically relied on clinical diagnosis, i.e.,
educating clinicians to report individuals with certain diseases when they are diagnosed.
However, due to incompleteness and time delays in clinician-based reporting, there
has been substantial interest in accomplishing the same task using automated algorithms
that analyze information collected by electronic medical records [[1]].
A key technical barrier to automatic case detection from electronic medical records
has been the lack of coded information about symptoms and signs of disease, which
clinicians at present record in unstructured free-text clinical notes. Natural language
processing (NLP) can be put to use on this problem, turning the unstructured information
contained within clinical notes into machine-readable, coded data.
An open problem with using NLP is that clinical notes take many forms and differ in
content and structure across institutions. Even within a single institution these
characteristics may change over time causing system performance to drift. For an NLP-based
approach to be useful in population disease surveillance its portability characteristics
must be understood. To some degree, NLP tools must be resilient to local differences
if they are to be effective in providing information to support widescale disease
surveillance.
This study evaluates the accuracy and portability of a natural language processing
tool for extracting clinical findings of influenza across two large healthcare systems
located in different regions of the United States. Two locally developed NLP parsers,
one at each institution were developed using local institution clinical notes from
emergency department (ED) encounters indicating influenza, and then evaluated at both
institutions. The effectiveness of each parser is evaluated on how well it supports
downstream case detection of influenza which is a critical component of disease surveillance
for public health outbreak detection.
2. Background and Significance
2. Background and Significance
Disease surveillance and outbreak detection are fundamental activities to assist in
early public health management and response to bioterrorism threats and infectious
disease outbreaks like influenza [[2]–[5]]. New strains of old diseases and new diseases continue to emerge that require ongoing
public health vigilance [[6], [7]]. Early outbreak detection systems that can be quickly deployed on a widescale can
help address the need for rapid response providing effective outbreak management [[8]]. Limited attempts and successes have been realized in portability of outbreak detection
systems [[2], [9], [10]]. The Real-time Outbreak and Disease Surveillance system (RODS) developed at University
of Pittsburgh was successfully implemented in Salt Lake City, Utah during the 2000
Olympics for surveillance of biological agents such as Anthrax [[11]]. At the national level, the BioSense platform [[12]] and the Essence system [[13]] have incorporated some standardized tooling that can collect, evaluate, and share
syndromic surveillance information among health officials and government agencies.
To our knowledge, these are the few automated biosurveillance systems that have tried
to address system interoperability and standardization across institutional boundaries.
An important new input to disease surveillance systems is the information contained
within unstructured clinical notes at treating healthcare institutions. At present,
disease reporting systems depend on clinicians to establish and report patients with
selected diagnosis. Recent work has shown that diagnosis can be inferred from clinical
notes [[14]–[20]]. Natural language processing (NLP) is the essential component used to extract relevant
clinical signs and symptoms from unstructured clinical notes that help to identify
the presenting syndrome in real-time surveillance systems [[21], [22]].
Clinical natural language processing techniques used for extraction of clinical findings
from clinical notes have advanced over the last decade to make NLP a viable operational
component in clinical systems [[14], [23]–[25]] and Biosurveillance systems [[26]]. Unfortunaly, these systems are usually developed in localized settings addressing
the needs of single healthcare institutions. Clinical notes vary to such an extent
across healthcare systems that NLP components are typically over-fit to the target
institution and do not generalize well when migrated to other healthcare settings
[[27], [28]]. This issue limits the ability to share NLP components across institutional boundaries.
Research and method advancements in areas like domain adaptation hold promise in addressing
this limitation in the hope that these systems can one day be shared and rapidly deployed
across institutional boundaries [[29]–[31]].
On the other hand, we need to better understand the performance characteristics of
clinical NLP components on downstream processes. NLP components are expensive to develop
and even more costly to develop in a generalized way. To our knowledge, very few studies
have been done to determine how good is good enough when it comes to generalizability
of NLP components and their impact on downstream processes in clinical pipeline systems
[[32], [33]]. This study examines this issue where extraction of clinical findings using NLP
is the source information supporting downstream case detection of influenza for outbreak
detection and disease surveillance.
To conduct this study we used an NLP tool developed at University of Pittsburgh called
Topaz [[34]]. Topaz is a pipeline system that extracts domain specific clinical findings and
their modifiers from clinical notes using deduction-rules. Modifiers include whether
the finding is absent or present, recent or historical, and whether experienced by
a patient or someone else such as a family member. Deterministic production rules
are constructed based on clinical text patterns that suggest the clinical finding
of interest. The pattern-matching is expressed in the form of regular expressions
which describe the text search patterns. These search expressions act as preconditions
that fire actions forming a production rule. The system supports forward-chaining
of production-rules so that complex expressions and inferences can be made on clinical
text [[35]].
Topaz supports conflict resolution when a finding is identified as both absent and
present in different segments of the same clinical note. It resolves the conflicting
finding assertions in favor of being present when both absent and present assertions
have been made. Topaz has four main modules, (1) a structural document preprocessor
to identify clinical note sections, sentence boundaries, and tokenization, (2) a UMLS
Metathesaurus [[36]] concept mapper with extension capabilities, (3) a forward-chaining deductive inference
engine, and (4) a conflicting-concept resolver. Topaz has been evaluated in several
studies that have reported reasonable operating characteristics [[17], [34], [37], [38]].
Other successful applications of rule-based NLP approaches to information extraction
have surfaced recently that use similar extraction techniques to that of Topaz [[39], [40]]. MedTagger [[41]] follows a similar processing paradigm to Topaz but it is integrated into the clinical
Text Analysis and Knowledge Extraction System (cTAKES) [[42]] as a component of this framework. Topaz is a complete pipeline system that is similarly
built using the Unstructured Information Management Architecture (UMIA) framework
[[43]] as does cTAKES, although its primary focus is information extraction. MedTagger
is an information extraction component that relies on several of the other cTAKES
components for processing tasks like sentence splitting, section identification, and
negation. Both pipelines involve similar processing tasks although they package their
processing components differently. Topaz comes with default section header rules of
its own which can be extended for adaptation to institutional variances while MedTagger
relies on SecTag [[44]] for section header identification. Both systems can activate or deactivate the
use of section header identification although in Topaz this is done by directly inactivating
certain rule definition files rather than reconfiguring the pipeline itself. Although
both tools use regular expressions for matching, Topaz integrates the regular expression
matching into production rules that are executed by a more advanced forward chaining
deduction engine. This allows Topaz the ability to perform very complex rule-chaining,
supporting deeper inference capabilities in the information extraction process.
In this study, we used a Bayesian case detection system (CDS) developed at University
of Pittsburgh to classify each ED encounter as influenza, non-influenza influenza-like
illness (NI-ILI), and ‘other’ based on symptoms and findings extracted by Topaz. Integral
to this system are Bayesian network [[45]] models and an inference engine that results in disease classification [[17], [34], [38]]. CDS takes as input, the clinical findings (F) for a patient case that are produced by Topaz. It outputs the posterior probability
distribution that the patient has one of several diseases given those findings. The
diseases D represented as a Bayesian network model in CDS are influenza, NI-ILI, and ‘other’
which is represented as one broad category. CDS performs probabilistic case detection
by using a Bayesian network model to compute P(D|F), the posterior probability of the disease given the findings. A Bayesian network
is a graphical model representation which provides a method of reasoning under uncertainty.
The nodes of a Bayesian network represent variables and arcs between the nodes represent
conditional dependencies between these variables. The strength of the relationships
between the nodes (variables) are represented as conditional probability distributions.
Bayesian networks models factorize a joint probability distribution as the product
of its conditional probability distributions, which often yields a compact representation
of the joint distribution. Bayesian networks have shown to be well-suited for clinical
diagnostic prediction where only a portion of the target clinical features may be
available on a patient case as they are very robust to missing data [[15], [17], [46]].
3. Materials and Methods
Institutional Review Board (IRB) approval was obtained from both healthcare systems
prior to conducting this study.
3.1 Healthcare Systems Characteristics
Intermountain Healthcare (IH) is the leading integrated health care delivery system
in Utah. The health system operates 22 community, tertiary, and specialty hospitals,
a health plan, and 1,400 employed physicians. Intermountain also operates 185 clinics,
including primary care clinics, and urgent care clinics. Intermountain has 137,000
annual admissions and 502,000 annual emergency department (ED) visits across the entire
system.
The University of Pittsburgh Medical Center (UPMC) is closely affiliated with its
academic partner, the University of Pittsburgh and is the leading integrated health
care delivery system in western Pennsylvania. UPMC operates 25 academic, community,
tertiary, and specialty hospitals, a health plan, and 2,500 employed physicians. UPMC
has 170,000 annual admissions with estimated 58% market share for Allegheny County
and 720,000 annual emergency department visits.
3.2 NLP Study Datasets
We constructed four datasets to carry out the NLP parser experiments in this study.
From each Healthcare System (IH and UPMC), one development/training corpus and one
test corpus were constructed.
3.2.1 Intermountain Healthcare NLP Datasets
The Intermountain Healthcare datasets were selected from ED encounters across 19 of
their facilities. The inclusion criteria consisted of positive influenza cases by
microbiology culture, direct fluorescent-antibody (DFA) testing, or polymerase chain
reaction (PCR) testing with at least one ED Physician or Licensed Independent Practitioner
report. The first report was selected for encounters with multiple physician notes.
The development/training corpus was constructed from the first 100 adult (age >= 6)
influenza cases spanning January 1st, 2007 – February 28th, 2008 and 100 pediatric (age < 6) cases spanning January 2nd, 2007 – February 16th, 2007, totaling 200 distinct cases. The test corpus was constructed using the next
consecutive clinical encounters. This corpus was made up of 100 adult influenza cases
spanning March 2nd, 2008 – June 8th 2009 and 100 pediatric cases spanning February 17th, 2007 – March 20th, 2007, totaling 200 distinct cases.
3.2.2 University of Pittsburgh Medical Center NLP Datasets
The University of Pittsburgh Medical Center datasets were selected from ED encounters
across 5 EDs: UPMC Presbyterian Hospital, UPMC Shadyside Hospital, UPMC McKeesport
Hospital, UPMC Mercy Hospital, and Children’s Hospital of Pittsburgh of UPMC. The
inclusion criteria consisted of positive influenza cases confirmed by polymerase chain
reaction (PCR) testing with at least one clinician report. The earliest signed clinical
report was selected for encounters with multiple clinician reports.
The development/training corpus was constructed from the first 100 adult (age >= 6)
influenza cases spanning March 15th, 2007 – February 27th, 2008 and 100 pediatric (age < 6) cases spanning December 21st, 2007 – October 20th, 2009, totaling 200 distinct cases. The test corpus was constructed using the next
consecutive clinical encounters. This corpus was made up of 100 adult influenza cases
spanning February 28th, 2008 –March 26th, 2009 and 100 pediatric cases spanning October 20th, 2009 – February 12th, 2011, totaling 200 distinct cases.
3.3 Annotation Process
Three board-certified practicing physicians (one internist, and two pediatricians
from each institution) identified 77 clinical findings by process of consensus covering
the four diseases of study specified in the research design – influenza, respiratory
syncytial virus (RSV), metapneumovirus, and parainfluenza. They identified the clinical
findings based on experience treating cases of influenza within their respective institutions.
Seventy of these clinical findings were relevant to influenza as shown in ►[Appendix A]. The other three diseases would be studied in later research. Annotation of a clinical
finding involved identifying the clinical finding as either absent or present in the
clinical note and marking the text phrase indicating the finding. An outside, independent
annotation service (University of Utah Core Research Lab) was contracted to provide
annotation services for this study. Four licensed RNs were trained as annotators from
a master annotation guideline providing the clinical finding definition accompanied
with example phrases, utterances, and lexical variants commonly documented by treating
clinicians for each clinical finding. The eHOST (Extensible Human Oracle Suite of
Tools) [[47]] open source annotation tool was used for annotation. Training was performed using
80 (40 adult/40 pediatric) reports randomly selected from the training corpus of each
of the two healthcare systems, IH and UPMC. This represented a total of 160 annotated
training cases, 80 from each site. The annotation training was conducted over four
rounds, each consisting of 20 clinical notes in each round. For each round, two randomly
selected annotators were given the same set of 20 reports and kappa was calculated
between those annotators to assess consistency. Discrepancies between annotator pairs
were adjudicated by the physician board-certified in internal medicine and feedback
was provided. The four rounds of annotation training resulted in kappa scores above
0.80 between annotator pairs.
Appendix A Distribution of Annotated Clinical Findings for Influenza
Influenza Clinical Finding
|
IH
|
UPMC
|
Influenza Clinical Finding
|
IH
|
UPMC
|
Abdominal Pain
|
92
|
91
|
Myalgia
|
41
|
91
|
Abdominal Tenderness
|
165
|
201
|
Nasal Flaring
|
73
|
5
|
Acute Onset of Symptoms
|
9
|
24
|
Nausea
|
39
|
121
|
Anorexia
|
81
|
67
|
Nonproductive Cough
|
20
|
28
|
Apnea
|
21
|
9
|
Other Abnormal Breath Sounds
|
227
|
196
|
Arthralgia
|
8
|
5
|
Other Abnormal X-ray Finding
|
75
|
164
|
Barking Cough
|
14
|
5
|
Other Cough
|
357
|
310
|
Bilateral Acute Conjunctivitis
|
124
|
87
|
Other Pneumonia
|
113
|
162
|
Bronchiolitis
|
31
|
16
|
Paroxysmal Cough
|
2
|
4
|
Bronchitis
|
4
|
22
|
Pharyngitis Diagnosis
|
18
|
13
|
Cervical Lymphadenopathy
|
110
|
79
|
Pharyngitis on Exam
|
153
|
130
|
Chest Pain
|
34
|
107
|
Poor Antipyretics Response
|
54
|
23
|
Chest Wall Retractions
|
116
|
51
|
Poor Feeding
|
85
|
96
|
Chills
|
10
|
81
|
Productive Cough
|
9
|
48
|
Conjunctivitis
|
1
|
3
|
Rales
|
217
|
181
|
Crackles
|
220
|
206
|
Reported Fever
|
580
|
623
|
Croup
|
19
|
26
|
Respiratory Distress
|
250
|
192
|
Cyanosis
|
100
|
39
|
Rhonchi
|
272
|
187
|
Decreased Activity
|
74
|
56
|
Rigor
|
1
|
4
|
Diarrhea
|
142
|
145
|
RSV Lab Testing Only Ordered
|
18
|
1
|
Dyspnea
|
137
|
306
|
RSV Positive Result
|
45
|
2
|
Grunting
|
70
|
8
|
Runny Nose
|
191
|
85
|
Headache
|
96
|
126
|
Seizures
|
18
|
19
|
Hemoptysis
|
4
|
12
|
Sore Throat
|
77
|
77
|
Highest Measured Temperature
|
189
|
188
|
Streptococcus Positive Result
|
46
|
10
|
Hoarseness
|
9
|
2
|
Stridor
|
47
|
30
|
Hypoxemia (Sp02 < 90% on RA)
|
278
|
183
|
Stuffy Nose
|
127
|
94
|
Ill Appearing
|
85
|
57
|
Tachypnea
|
264
|
239
|
Infiltrate
|
120
|
224
|
Toxic Appearance
|
101
|
14
|
Influenza Lab Testing Only Ordered
|
45
|
116
|
Upper Respiratory Infection
|
51
|
129
|
Influenza Positive Result
|
146
|
10
|
Viral Pneumonia
|
16
|
1
|
Influenza-like Illness or URI
|
210
|
109
|
Viral Syndrome
|
192
|
136
|
Lab Ordered (Nasal Swab)
|
4
|
62
|
Vomiting
|
254
|
221
|
Lab Testing 2+ Resp. Pathogens Including Influenza
|
112
|
8
|
Weakness and Fatigue
|
47
|
132
|
Malaise
|
11
|
21
|
Wheezing
|
349
|
324
|
Focus was then turned to the test corpora. The test corpora consisted of 200 (100
adult/100 pediatric) clinical notes from each healthcare system broken into 5 paired
annotation sets containing 24 reports per individual set. Each paired set had 8 duplicate
reports contained within the pair so that inter-annotator agreement across paired
annotators could be measured. So for any paired annotation set, this represented 40
distinct reports – 16 unique reports in each individual annotation set, and an additional
8 duplicated reports across the pair. The annotators were assigned annotation sets
such that all four annotators would be equally paired with one another across all
of the annotation sets. This process resulted in 200 (100 adult/100 pediatric) distinct
annotated reports being generated for each healthcare system for testing purposes.
Inter-annotator agreement using Fleiss’ kappa [[48]] across the 80 shared reports representing 20% of the reports was measured and reported.
These 80 duplicated clinical notes across annotation sets for purposes of measuring
kappa were then adjudicated by a board-certified physician.
3.4 NLP Parser Development
Topaz [[34]] is a rule-based natural language processing parser capable of extracting clinical
concepts by applying domain specific pattern-matching and deduction rules. Two influenza
parsers were developed separately by two different development teams blinded to one
another’s development activities. One team was provided the IH training corpus while
the other team was given the UPMC training corpus. Each training corpus contained
the 80 annotated clinical notes and the additional 120 unannotated notes. Each team
consisted of an experienced NLP software engineer and a board certified pediatrician
on staff from each institution. No communications between teams was permitted. They
were allowed to evaluate their respective parsers on the local annotated training
corpus provided to each team as often as deemed necessary but were unable to evaluate
their parser against each other’s corpus during the development phase of the study.
Once each team felt that their parser’s operating characteristics were optimal, the
systems were evaluated one time against the test corpus of each healthcare system
to determine the cross-compatibility performance characteristics.
Recall, precision, and F1-score were measured for local healthcare system performance characteristics and cross-site
compatibility characteristics.
3.5 CDS Study Datasets
At the core of CDS are Bayesian network classifier models that perform case detection
based on the clinical findings extracted from the clinical notes by the NLP parsers.
We built four Bayesian network classifiers that differed in source training data (IH
or UPMC clinical notes) and NLP parser (IH NLP parser or UPMC NLP parser) to extract
the clinical findings. The training datasets were used to develop the Bayesian network
models using machine learning to learn the network structure as well as the joint
probability distributions of the models. We used the K2 algorithm [[49]] to machine learn the structure of the models from this training data. The K2 algorithm
uses a forward-stepping, greedy search strategy to identify the conditional dependency
relationships (arcs) among the variables (nodes) of a Bayesian network that produce
a locally optimal network structure [[49]].
We labeled as influenza, patient encounters with a positive laboratory test for influenza by polymerase chain
reaction (PCR), direct fluorescent antibody (DFA), or viral culture. Among the remaining
encounters, we labeled as NI-ILI cases with at least one negative test for PCR, DFA, or culture. All remaining encounters
were labeled as other.
3.5.1 Intermountain Healthcare CDS Training Datasets
The Intermountain Healthcare CDS training dataset consisted of 47,504 ED encounters
between January 1, 2008 and May 31, 2010, including 1,858 influenza, 15,989 NI-ILI, and 29,657 other encounters. The IH training dataset represented 60,344 clinical notes. When an encounter
was associated with more than one clinical note, we used the union of clinical findings
from all of the clinical notes.
3.5.2 University of Pittsburgh Medical Center CDS Training Datasets
The University of Pittsburgh Medical Center CDS training dataset consisted of 41,189
ED encounters drawn from the same period as Intermountain Healthcare and labeled using
the same criteria. This training dataset included 915 influenza, 3,040 NI-ILI, and 37,234 other encounters. The UPMC training dataset was associated with 76.467 clinical notes.
Again, for encounters with multiple notes we would union the clinical findings from
all of the clinical notes.
3.5.3 Healthcare Systems’ CDS Test Datasets and Evaluation
To test downstream CDS performance based on the signs and symptoms extracted using
the healthcare systems’ NLP parsers (IH NLP Parser and UPMC NLP Parser), we collected
one year of ED visits as a test dataset from each healthcare system spanning June
1st, 2010 – May 31st, 2011. There were 220,276 IH ED clinical notes representing 182,386 ED visits and
480,067 UPMC ED notes representing 238,722 ED visits. The ED reports collected from
each healthcare system were parsed by the IH NLP parser and the UPMC NLP parser producing
four data sets, (1) IH reports parsed by IH NLP parser, (2) IH reports parsed by UPMC
NLP parser, (3) UMPC reports parsed by UPMC NLP parser, and (4) UPMC reports parsed
by the IH NLP parser. The CDS Bayesian networks were then evaluated using these datasets
to determine the effects of NLP parser cross-compatibility on downstream case detection.
4. Results
4.1 Inter-annotator Agreement on NLP Datasets
Inter-annotator agreement was measured using Fleiss’ kappa [[48]]. On the Intermountain Healthcare test data set an agreement score of 0.81 (95%
CI: 0.80 to 0.81) was achieved over 1,477 clinical findings. On the University of
Pittsburgh Medical Center test data set a kappa score of 0.81 (95% CI: 0.80 to 0.82)
over 1,504 clinical findings was achieved. For both data sets reliable agreement was
reached.
4.2 Healthcare System Differences in Language Expressing Clinical Findings
To gain insight into the differences in the language characteristics used by each
healthcare system to describe a particular clinical finding, the annotated text from
the test corpora that describes a clinical finding as absent or present in a clinical
note was analyzed to identify statistically significant distributional differences
that may exist across healthcare systems. For example, the clinical finding chest wall retractions – present, one institution may describe in a note as “was using his accessory muscles for respiration” while the other institution may describe it as “does have some abdominal breathing and moderate subcostal and mild to moderate suprasternal
retractions”. ►[Figure 1] describes the results of Fisher’s Exact Test for homogeneity performed on the word
frequencies describing each clinical finding between healthcare systems. For each
healthcare system, the annotated text segments indicative of a clinical finding being
absent or present found in the clinical notes were broken down into a bag of words
with frequency counts. This was done for each clinical finding. The test statistic
was then applied between the bag of words frequency counts representing each clinical
finding between the institutions. This evaluation was performed on the annotated test
corpora with word stemming, and stop words removed. P-value significance was adjusted
for multiple comparisons testing using the false discovery rate [[50]] method. Seventy percent (n = 49/70) of the clinical findings had statistically
significant (adjusted p-value < 0.05) differences in the language used to express
the clinical findings by each healthcare system.
Fig. 1 Language differences expressing clinical findings for Influenza between Intermountain
Healthcare and University of Pittsburgh Medical Center. Adjusted significance is shown
on a logarithmic scale.
4.3 Clinical findings distributional characteristics between institutions
The frequency distribution of annotated clinical findings indicating influenza across
the test corpora of both institutions is illustrated in ►[Figure 2]. ►[Appendix A] also provides the frequency counts of the annotated clinical findings. Seventy clinical
findings were identified by expert annotation across the test corpus for each institution.
Evaluation of the distributional characteristics was analyzed by calculating a multinomial
chi-square goodness of fit [[51]] test statistic. The test statistic produced a p-value < 0.0001 confirming that
the two frequency distributions are drawn from different distribution functions. This
finding may imply that (1) the signs and symptoms documented by clinicians in the
clinical notes to describe influenza cases differ between institutions or (2) there
may be differing influenza patient presentations between institutions. The top ten
most frequent clinical findings found in the clinical notes for each institution is
shown in ►[Table 1].
Fig. 2 Frequency distribution of annotated clinical findings for Influenza.
Table 1
Top ten annotated clinical findings by institution indicating Influenza
Healthcare System X
|
Frequency
|
Healthcare System Y
|
Frequency
|
reported fever
|
580
|
reported fever
|
623
|
other cough
|
357
|
wheezing
|
324
|
wheezing
|
349
|
other cough
|
310
|
hypoxemia (sp02 on room air <90%)
|
278
|
dyspnea
|
306
|
rhonchi
|
272
|
tachypnea
|
239
|
tachypnea
|
264
|
infiltrate
|
224
|
vomiting
|
254
|
vomiting
|
221
|
respiratory distress
|
250
|
crackles
|
206
|
other abnormal breath sounds
|
227
|
abdominal tenderness
|
201
|
crackles
|
220
|
other abnormal breath sounds
|
196
|
4.4 NLP parser performance for influenza clinical findings
►[Table 2] presents the performance characteristics for each NLP parser using the rule set
that was developed from the institution’s development/training corpus but evaluated
on the test corpora from both institutions. In other words, the IH NLP parser extraction
rules were constructed using information contained within the Intermountain Healthcare
training corpus and the UPMC NLP parser extraction rules were constructed using information
contained within the University of Pittsburgh Medical Center training corpus. The
goal of this experiment was to evaluate the performance loss in applying locally developed
extraction rule sets against foreign (non-local) clinical notes to assess generalizability
and determine the downstream effects of any upstream performance loss. On the Intermountain
Healthcare test corpus, recall, precision, and F1-score of the IH NLP parser were 0.71 (95% CI: 0.70 to 0.72), 0.75 (95% CI: 0.73 to
0.76), and 0.73 respectively, and the UPMC NLP parser, 0.67 (95% CI: 0.65 to 0.68),
0.79 (95% CI: 0.78 to 0.80), and 0.73. For the Intermountain Healthcare corpus, the
local IH NLP parser had statistically significant (p-value < 0.0001) better recall
than the non-local UPMC NLP parser, but statistically significant (p-value < 0.0001)
lower precision. The F1-scores were equal. It is difficult to determine which of the parsers may be better
suited in this instance as it would depend on whether precision or recall is more
important depending on the target disease differentiation. On the University of Pittsburgh
Medical Center test corpus, recall, precision, and F1-score of the UPMC NLP parser were 0.73 (95% CI: 0.71 to 0.74), 0.80 (95% CI: 0.79
to 0.82), and 0.76 respectively, and the IH NLP parser, 0.53 (95% CI: 0.51 to 0.54),
0.80 (95% CI: 0.78 to 0.81), and 0.64. For the UPMC corpus, the local UPMC NLP parser
had statistically significant (p-value < 0.0001) better recall than the non-local
IH NLP parser and equivalent precision although the F1-score of the UPMC NLP parser was quite a bit better. In this case, the local UMPC
NLP parser with higher recall and equivalent precision would suggest operational preference
over the non-local IH NLP parser. In both compatibility comparisons the local parser
may be preferred over the nonlocal parser.
Table 2
Summary of NLP Parser Performance
|
IH Parser evaluated on IH Corpus
|
UPMC Parser evaluated on IH Corpus
|
p Value (IH Parser vs UPMC Parser) on IH Data
|
UPMC Parser evaluated on UPMC Corpus
|
IH Parser evaluated on UPMC Corpus
|
p Value (UPMC Parser vs IH Parser) on UPMC Data
|
Recall
|
0.71 (0.70 to 0.72)
|
0.67 (0.65 to 0.68)
|
< 0.0001
|
0.73 (0.71 to 0.74)
|
0.53 (0.51 to 0.54)
|
< 0.0001
|
Recall – Present
|
0.65 (0.63 to 0.67)
|
0.63 (0.61 to 0.65)
|
0.3080
|
0.68 (0.66 to 0.70)
|
0.47 (0.44 to 0.49)
|
< 0.0001
|
Recall – Absent
|
0.76 (0.75 to 0.78)
|
0.69 (0.67 to 0.71)
|
< 0.0001
|
0.77 (0.75 to 0.78)
|
0.58 (0.56 to 0.60)
|
< 0.0001
|
Precision
|
0.75 (0.73 to 0.76)
|
0.79 (0.78 to 0.80)
|
< 0.0001
|
0.80 (0.79 to 0.82)
|
0.80 (0.78 to 0.81)
|
0.5188
|
Precision – Present
|
0.74 (0.72 to 0.76)
|
0.70 (0.68 to 0.72)
|
0.0059
|
0.79 (0.77 to 0.81)
|
0.80 (0.78 to 0.82)
|
0.5612
|
Precision –Absent
|
0.75 (0.73 to 0.77)
|
0.87 (0.86 to 0.89)
|
< 0.0001
|
0.81 (0.80 to 0.83)
|
0.80 (0.78 to 0.81)
|
0.1639
|
F1 Score
|
0.73
|
0.73
|
|
0.76
|
0.64
|
|
F1 Score – Present
|
0.69
|
0.66
|
|
0.73
|
0.59
|
|
F1 Score – Absent
|
0.75
|
0.77
|
|
0.79
|
0.67
|
|
Clinical Findings identified in clinical notes are asserted as present or absence.
95% Confidence Intervals in parenthesis.
P-value calculated using X2 test of two proportions.
4.5 Downstream case detection performance with NLP parsers
The effects of the NLP parsers on downstream case detection were evaluated based on
the ability of the Bayesian case detection system to discriminate between influenza
and non-influenza, as well as, the more difficult case of influenza and non-influenza
influenza-like illness. ►[Figure 3] illustrates the effects of NLP parser performance on case detection by comparing
area under the receiver operating characteristic curves (AUROCs). Significance was
calculated using DeLong’s [[52]] statistical AUROC comparison test. The AUROCs and comparison test statistic results
are shown in ►[Table 3]. For discriminating between influenza and non-influenza, the two parsers supported
case detection almost equivalently on Intermountain Healthcare cases with an AUROC
of 0.932 for the IH NLP parser and 0.936 for the UPMC NLP parser. In detecting influenza
from non-influenza on University of Pittsburgh Medical Center cases, the local UPMC
NLP parser (AUROC = 0.954) outperformed the IH NLP parser (AUROC = 0.843). For the
more difficult discrimination of influenza from NI-ILI on IH cases, the non-local
UPMC NLP parser (AUROC = 0.748) outperformed the local IH NLP parser (AUROC = 0.698).
This may be due to precision being more important than recall for distinguishing between
cases of influenza and NI-ILI. This seems to make intuitive sense in that influenza
and NI-ILI have more similarities in their disease presentations than influenza versus
non-influenza making precision more important in distinguishing among the cases. On
the University of Pittsburgh Medical Center cases, the local UPMC NLP parser better
supported discrimination between influenza and NI-ILI with an AUROC of 0.766. In all
but one instance (influenza versus NIILI using IH cases), the local parsers seemed
to do better or as good a job of supporting downstream case detection then that of
the non-local parsers. Although the results produced by the non-local parsers may
still be considered within reasonable limits if there is a need for rapid and widespread
surveillance deployment.
Fig. 3 CDS Performance (AUROC) using different NLP parsers.
Table 3
CDS Performance with Different NLP Parsers with AUC Comparison Tests
|
CDS Performance using IH NLP Parseron IH Corpus (Local Performance)
|
CDS Performance using UPMC NLP Parseron IH Corpus (Portability Performance)
|
DeLong’s test (p-value) H0: Difference between AUROCs = 0 HA: Difference between AUROCs
≠ 0
|
influenza vs non-influenza
|
0.932 (0.924 – 0.940)
|
0.936 (0.928 – 0.944)
|
0.5176
|
influenza vs NI-ILI[*]
|
0.698 (0.675 – 0.720)
|
0.748 (0.727 – 0.769)
|
0.0014
|
|
CDS Performance using UPMC NLP Parser on UPMC Corpus (Local Performance)
|
CDS Performance using IH NLP Parser on UPMC Corpus (Portability Performance)
|
DeLong’s test (p-value) H0: Difference between AUROCs = 0 HA: Difference between AUROCs
≠ 0
|
influenza vs non-influenza
|
0.954 (0.942 – 0.966)
|
0.843 (0.820 – 0.866)
|
< 0.0001
|
influenza vs NI-ILI[*]
|
0.766 (0.735 – 0.796)
|
0.654 (0.620 – 0.687)
|
< 0.0001
|
* NI-ILI: non-influenza influenza-like Illness
5. Discussion
As this study illustrates, it is common to find language differences in clinical notes
describing clinical findings among institutions. Even within institutions, dictation
styles and linguistic expression may vary among clinicians. These variations are typically
addressed in locally developed NLP systems because a representative sample of the
variation can be obtained. Yet this does not typically address the increased variation
experienced across institutional boundaries. This is one of the most difficult challenges
faced in generalizing modern NLP systems today. Whether rule-based or statistically
based, NLP systems are developed from training sets providing samples of phrases and
linguistic expression to draw upon in developing extraction rules or statistical extraction
methods.
A significant difference in NLP rule-development between Intermountain Healthcare
and University of Pittsburgh Medical Center was that the IH rule-developer considered
semi-structured section headers in an effort to gain an early context before applying
extraction rules. These section header identification rules addressed roughly 150
section header variants within the IH dataset alone. Post study analysis determined
that the UPMC dataset had 15 section header types, none of which were syntactically
or semantically consistent enough with Intermountain section headers to be identified
by the IH NLP parser. Section header identification relies heavily on syntactical
attributes like capitalization, ending punctuation, number of line feeds, and structural
aspects like section order, in addition to the lexical content [[44]]. This attributed to the decreased performance of the IH NLP parser when ran against
the UPMC clinical notes. Anticipated clinical note sections identifiable by these
rules were not recognized within the UPMC notes and therefore large segments of clinically
relevant text were ignored by the IH NLP parser.
Surprisingly, the UPMC NLP parser with rules that did not take this approach produced
better precision then the IH NLP parser but worse recall when ran against the IH dataset.
This led to the UPMC NLP parser doing a better job at supporting downstream case detection
than the IH NLP parser. In the case of the UPMC NLP parser, all of the text segments
within the clinical notes were processed for relevant clinical findings, regardless
of institution. Besides this difference in approach, both rule sets used regular expression
matching with negation, deductive inference to identify the absence or presence of
a clinical finding, and the default conflict resolution logic when a finding is identified
as both absent and present in different segments of the same clinical note. The default
conflict resolution is to favor an assertion of present over absent.
A deeper analysis of the clinical finding extraction rules revealed that the IH NLP
parser rules were more specific in nature than the UPMC NLP parser rules. The IH rules
included surrounding context words while the UPMC NLP parser rules were in many instances
simple clinical finding terms without surrounding context. The effects of rule specificity
related to performance would be consistent with our findings of statistically significant
differences in language expressing the clinical findings between the two institutions.
By considering less surrounding context there is less of a chance that the language
differences would affect performance as seen with the UPMC NLP parser.
Systems that show reasonable generalizing characteristics have typically done a good
job at incorporating guessing heuristics into their extraction algorithms. These guessing
heuristics are used in anticipation of coming across phrase expressions that were
not seen in the development/training corpus. In information extraction, it is challenging
to develop good guessing heuristics because there is typically over-fitting caused
by the narrow lexical scope of examined phrases from the local development corpus
to express a clinical concept and the extraction rules. It is also very difficult
to develop general extraction methods to address synonymy at a complex phrase level.
As identified between Intermountain Healthcare and University of Pittsburgh Medical
Center, generalizability is further complicated in that patient disease presentations
or how clinicians express clinical findings among patient cases may differ.
Another approach to better addressing generalizability may be the use of terminology
and ontology mapping tools such as MetaMap [[53]] to map clinical text to standard UMLS Metathesaurus [[36]] concepts that could be used to identify common findings in clinical notes across
institutions. This may help to provide an interoperability bridge improving NLP generalizability.
One of the limitations with this approach is that text from clinical notes have been
shown to be their own sublanguage [[54]]. In a continuing study on domain adaptation of part-of-speech tagging for clinical
notes that took advantage of lexical content, the SPECIALIST lexicon only provided
48.7% vocabulary coverage across a clinical note corpus made up of the ten most common
clinical note types at Inter-mountain Healthcare [[55]]. The SPECIALIST lexicon is one of the foundational components of MetaMap. More
work needs to go into developing terminology and ontologies that have coverage for
the clinical sublanguages found in clinical notes.
The encouraging message is that even when there is limited opportunity for generalizing
certain natural language processing tasks our findings suggest that downstream processes
may still operate within reasonable limits. Although we did experience a drop off
in NLP performance when applying some of the locally developed parsers across institutional
boundaries, the downstream task of case detection was still able to produce good results
in differentiating cases of influenza and non-influenza. The more difficult task of
identifying influenza from non-influenza influenza-like illness resulted in lower
performance, but surprisingly, the UPMC NLP parser (non-local parser) better supported
this case detection task when applied to Intermountain Healthcare cases. This may
be due to the UPMC NLP parser having better precision than the IH NLP parser on IH
clinical notes. This performance characteristic may be more important to distinguish
diseases with very similar signs and symptoms. This finding is encouraging and supports
our optimism that systems can be ported across institutional boundaries with reasonable
operating characteristics.
We developed the CDS system in large part to provide probabilistic information about
each patient case to an Outbreak Detection and characterization Systems (ODS) that
we developed [[56]]. ODS uses that information to detect and characterize disease outbreaks. The information
is a likelihood of the form P(evidence | disease), where “evidence” is a set of clinical finding variables, and “disease” can be one
of a number of disease states, including for example influenza. A Bayesian network
provides a natural way to represent evidence and diseases and to infer the needed
likelihoods.
Most other machine learning methods, such as random forests, do not provide a direct
way to derive such likelihoods, rather, those methods are intended to directly derive
posterior probabilities of the form P(disease | evidence). We have also performed previous studies evaluating Bayesian networks against other
top machine learning algorithms like random forests that have shown that Bayesian
networks perform comparable to these other top learning methods [[57]].
5.1 Limitations
The natural language processing tool used in this study takes a rule-based approach
to information extraction and was not well suited for experimenting with more advanced
statistically based NLP information extraction techniques. Some statistical approaches
open the door to exploring methods of unsupervised or semi-supervised domain adaptation
where source models can self-adapt to distributional characteristics of new domains
[[58]–[60]]. Due to limited available syndromic disease data across these two healthcare systems
and limited resources, we were unable to expand our compatibility research beyond
that of influenza. Also, compatibility studies involving more institutions need to
be performed to further assess the issues of interoperability and to draw further
insights into the challenges faced in porting natural language processing and Biosurveillance
systems across institutional and geographical boundaries.
6. Conclusion
Portability and rapid deployment of infectious disease surveillance systems across
geographic and institutional boundaries require tools that are resilient to local
knowledge representation differences for effective surveillance to take place. This
research addresses the important question of cross-institutional portability of natural
language processing systems to support disease surveillance. Our results suggest that
there is still the need for further research in methods development to produce more
generalizable NLP tools. Natural language processing is becoming an integral disruptive
technology to support surveillance systems although this study suggests it is sensitive
to institutional variability. To our knowledge, this is one of the few comprehensive
studies done on portability across institutional boundaries. There is a compelling
need for more of these sorts of study designs to be carried out in sub-systems like
clinical natural language processing which can lead to more robust and effective Public
Health system solutions. Our research concludes that portability of systems that incorporate
NLP can be achieved to some degree but more work needs to be done in this area.
Clinical Relevance
Disease surveillance and outbreak detection are critical activities to assist in early
public health management and response to bioterrorism threats and infectious disease
outbreaks [[2]–[5]]. Portability and rapid deployment of infectious disease surveillance systems across
geographic and institutional boundaries require tools that are resilient to local
knowledge representation differences for effective surveillance to take place. This
research addresses the important question of cross-institutional portability of natural
language processing systems to support disease surveillance.
Questions
1.Generalization of natural language processing (NLP) tools for use across institutional
boundaries can be challenging because?
-
There is greater variability in dictation styles and the linguistic expression found
in clinical notes across institutional boundaries than within single institutions.
-
NLP tools are better off being over-fit to local institutional clinical note variances
because this improves the operational performance of these tools.
-
Domain adaptation of NLP tools have not shown much promise in providing an avenue
for improved generalization.
-
The common use of terminology services to support natural language processing make
generalization difficult.
Answer: A)
As this study illustrates, it is common to find language differences in clinical notes
describing clinical findings among institutions. Even within institutions, dictation
styles and linguistic expression may vary among clinicians. These variations are typically
addressed in locally developed NLP systems because a representative sample of the
variation can be obtained. Yet this does not typically address the increased variation
experienced across institutional boundaries. This is one of the most difficult challenges
faced in generalizing modern NLP systems today. Whether rule-based or statistically
based, NLP systems are developed from training sets providing samples of phrases and
linguistic expression to draw upon in developing extraction rules or statistical extraction
methods.
2. Disease surveillance and outbreak detection are important public health functions
because?
-
Surveillance systems can help to provide the necessary information to public health
officials for more effective outbreak management.
-
New infectious and non-infectious pathogens continue to emerge that require ongoing
public health awareness.
-
For the improved national and public safety against bioterrorism.
-
All of the above
Answer: D)
Disease surveillance and outbreak detection are fundamental activities to assist in
early public health management and response to bioterrorism threats and infectious
disease outbreaks like influenza. New strains of old diseases and new diseases continue
to emerge that require ongoing public health vigilance. Early outbreak detection systems
that can be quickly deployed on a widescale can help address the need for rapid response
providing effective outbreak management.