CC BY-NC-ND 4.0 · Yearb Med Inform 2018; 27(01): 184-192
DOI: 10.1055/s-0038-1667079
Section 9: Natural Language Processing
Survey
Georg Thieme Verlag KG Stuttgart

Advancing the State of the Art in Clinical Natural Language Processing through Shared Tasks

Michele Filannino
1  George Mason University, Fairfax, VA, USA
2  Massachusetts Institute of Technology, Cambridge, MA, USA
,
Özlem Uzuner
1  George Mason University, Fairfax, VA, USA
2  Massachusetts Institute of Technology, Cambridge, MA, USA
› Author Affiliations
Further Information

Correspondence to

Özlem Uzuner
4400 University Drive, MS 1G8
5359 Nguyen Engineering Building, Fairfax, VA 22030
USA   
Phone: +1 703 993 5996   

Publication History

Publication Date:
29 August 2018 (online)

 

Summary

Objectives: To review the latest scientific challenges organized in clinical Natural Language Processing (NLP) by highlighting the tasks, the most effective methodologies used, the data, and the sharing strategies.

Methods: We harvested the literature by using Google Scholar and PubMed Central to retrieve all shared tasks organized since 2015 on clinical NLP problems on English data.

Results: We surveyed 17 shared tasks. We grouped the data into four types (synthetic, drug labels, social data, and clinical data) which are correlated with size and sensitivity. We found named entity recognition and classification to be the most common tasks. Most of the methods used to tackle the shared tasks have been data-driven. There is homogeneity in the methods used to tackle the named entity recognition tasks, while more diverse solutions are investigated for relation extraction, multi-class classification, and information retrieval problems.

Conclusions: There is a clear trend in using data-driven methods to tackle problems in clinical NLP. The availability of more and varied data from different institutions will undoubtedly lead to bigger advances in the field, for the benefit of healthcare as a whole.


#

1 Introduction

Recent years have seen an increase in the number of scientific challenges, also called shared tasks, organized for the advancement of Natural Language Processing (NLP) in clinical data[1]. Shared tasks promote work specific to a “challenge question” posed to the research community and aim to evaluate the state of the art. Without the unifying framework of a shared task, even though the NLP community might work on the same general problem, the nuances of problems will vary, to the point where the approaches would not be comparable. For this reason, shared task organizers often provide a data set, annotated with gold standard annotations, for system development and tuning. The evaluation of the systems on the challenge question takes place on a held-out data set. This setup provides a way of comparing systems head-to-head on the same data and task, and helps identify the state of the art. The shared task data may remain available for research beyond the challenge time frame, providing a common benchmark for assessing the quality of future attempts[2]. They also provide a great resource for training future generations: they are a great instrument to advance the research and engage students. The ready availability of datasets, evaluation scripts, and commentary provides an ideal environment that serves as a catalyst and motivator.

In addition to the above benefits, in the clinical domain, because of the scarcity of data and their poor availability, shared tasks make it possible for the global research community to tackle problems that would otherwise be inaccessible to them[3]. However, attaining these benefits requires overcoming some obstacles[4]:

  • Availability of data: Clinical data, i.e., data that contains clinical information, can take many forms. Most often used synonymously with electronic health records (EHRs), clinical data can contain social media data as well as information from resources such as drug labels. Each of these forms of data comes with its own challenges for access and use

  • De-identification: Health Insurance Portability and Accountability Act (HIPAA)[5] defines requirements for safeguarding of patient health data, indicating elements of private health information (PHI) that need to be protected. De-identification, i.e., removal of PHI from records, provides one way of addressing this concern. However, there is a downside to de-identi-ication: this process alters the contents of the original records and as a result some useful information may be lost. On the other hand, HIPAA-compliant de-identification may not be adequate in some cases, e.g., professions, which are not covered by HIPAA, are allowed to remain in the records even though, when rare enough, they could uniquely identify patients. This makes de-identification a challenging process that needs to strike a delicate balance between de-identifying the data, so that it can be shared, and preserving the medical content of the data, so that it can be useful for downstream medical research. As a result, de-identification often requires a manual review of the data - an expensive and time-consuming process that ultimately limits the size of the shared data

  • Annotation: Often the bigger cost in shared task organization comes from gold standard generation for the clinically-relevant task that is posed to the community[6]. Gold standard generation requires input from experts who are well-versed in the tasks studied. Experts tend to be medical professionals with high hourly rates - another parameter that needs to be balanced against the volume of desired data.

In this paper, we review the latest scientific challenges organized to tackle NLP problems on clinical data. We highlight the tasks and the most effective methodologies used to tackle these tasks, along with the data used.


#

2 Methods

This review focuses on shared tasks using clinical data to tackle NLP problems. The relevant studies have been identified by querying Google Scholar and PubMed with “((shared task) OR challenge) AND (clinical OR health OR EHR) AND (NLP)”. The returned articles were limited to those describing clinical NLP shared tasks which were published since 2015. This resulted in a total of 17 shared tasks. Four challenges took place in 2015, six in 2016 and seven in 2017. Sixteen shared tasks are complete and published. One is completed but still in the process of publishing the outcomes. For a survey of shared tasks held before 2015, one can refer to Velupillai et al.[7]. For a survey in the broader field of biomedical text mining, one can refer to Huang et al.[8].


#

3 Shared Tasks

Recent clinical NLP shared tasks have utilized social media data (e.g., Twitter, forum posts), journal articles (e.g., MEDLINE/ PubMed), as well as electronic health records (e.g., pathology reports, nursing admission notes, psychiatric evaluation records, etc.) and other health-related documents such as drug labels. Collectively, these shared tasks posed questions on a variety of data, including both de-identified real data and synthetic records. [Table 1] summarizes the key characteristics of each of the shared tasks. We present the shared tasks according to the type of data they use.

Table 1

Clinical NLP Challenges, the tasks they posed, and the number of participating teams, since 2015, ordered by data sensitivity.

Category

Year

Challenge name

Task description

Data type

Data source

teams type

Academia

Industry

Joint

Total

2015

TREC Clinical Decision Support (CDS)[9]

Paticnt-ccntcrcd information retrieval

Medical case narratives

Synthetic, PubMed

33

3

0

36

TREC Precision Medicine[11]

Synthetic

2017

> Track 1

> Track 2

Patient-centered literature article retrieval Patient-centered clinical trials retrieval

Semi-structured cases

Synthetic, PubMed, ClinicalTrials.gov

27

5

0

32

2016

CLEF cHealth[12]

Information extraction

Nursing handover notes

NICTA synthetic nursing handover notes

4

0

0

4

Text Analysis Conference (TAC) Adverse Drug Reaction Extraction from Drug Labels (ADR)[18]

> Track 1

ADR mentions and modifiers extraction

Prescription drug labels

2017

> Track 2

> Track 3

> Track 4

Relation extraction Positive ADR filtering Positive ADR normalization

Drug labels

Drugs-Library.com

6

3

1

10

2015

CLPsych: Depression and PTSD on Twitter[22]

Binary classification of depression and PTSD users

Social media

Twitter

3

0

0

3

Social Media Mining (SMM)[24]

2016

> Track 1

> Track 2

> Track 3

ADR classification Information extraction Concept normalization

Social media

Twitter

9

2

0

11

Online social data

2017

Social Media Mining for Health Applications (SMM4HA)[29]

> Track 1

> Track 2

> Hack 3

ADR classification Classification of medication intake Concept normalization

Social media

Twitter

12

1

0

13

2016

CLPsych: Triaging content in online peer-support forums[33]

Classification of mental health severity in 4 levels

Forum

RcachOut

13

1

1

15

2017

CLPsych: Triaging content in online peer-support forums[35]

Classification of mental health severity in 4 levels

Forum

RcachOut

12

2

1

15

2017

NTCIR-13 MedWeb[36]

8-class classification of diseases and symptoms

Multilingual Social media

Twitter

7

1

1

9

Analysis of Clinical Text (ACT)[39]

2015

> Track 1

> Track 2a

> Track 2b

Disorder NER and normalization

Template slot filling (given gold spans)

Disorder recognition and template slot filling (end-to-end)

Clinical notes

ShARc corpus (MIMIC)

18

3

0

21

2016

TREC Clinical Decision Support (CDS)[43]

Paticnt-ccntcrcd IR

Nursing admission notes

MIMIC, PubMed

21

5

0

26

Medication and Adverse Drug Events (MADE1.0)

2017

> Track 1

> Track 2

Medication, ADE, sign and symptom identification Relation extraction

Clinical notes

UMass Memorial Medical Center

Clinical TempEval[45]

> Track 1

Time expression extraction

2015

> Track 2

> Track 3

Event extraction

Relation extraction (wrt DCT)

Relation extraction (wrt narrative containers)

Pathology reports

Mayo Clinic

3

0

0

3

Clinical data

Clinical TempEval[46] > Track 1

Time expression extraction

2016

> Track 2

> Track 3

Event extraction

Relation extraction (wrt DCT)

Relation extraction (wrt narrative containers)

Pathology reports

Mayo Clinic

U

3

0

14

Clinical TempEval[48]

2017

> Track 1

> Track 2

Time expression extraction (cross-domain) Event extraction (cross-domain)

Pathology reports, Clinical notes

Mayo Clinic

9

2

0

11

> Track 3

Relation extraction (wrt DCT)

Relation extraction (wrt narrative containers)

Centers for Excellence in Genomics N-GRJD (CEGS-NGR1D)[51]

2016

> Track la

> Track lb

> Track 2

De-identification (cross-domain) Dc-identification

Psychiatric Symptom Severity Prediction

Psychiatric evaluation records

Partners Healthcare and Harvard Medical School

23

5

3

31

3.1 Synthetic Data

Synthetic data can serve as a placeholder for real data and allows to side-step the privacy issues related to real data. The downside of synthetic data is that its generation comes with a cost and must make sure that the synthetic data captures the characteristics of real data so that the solutions developed can be valid on real data.

The 2015 Text REtrieval Conference (TREC) Clinical Decision Support (CDS) shared task aimed at evaluating biomedical retrieval systems[9]. The organizers provided a set of 30 synthetic case narratives (called topics), consisting of a short textual description, a summary, and a diagnosis. They asked the participants to develop systems for retrieving the most relevant scientific articles within a collection of 733,138 articles[1] from PubMed Central (PMC)[2]. Thirty-six teams participated in this task, 33 from academia, three from industry. The top performing system achieved an inferred normalized discounted cumulative gain (infNDCG) of 38.21%[10] by combining several Information Retrieval (IR) models (BM25, PL2, BB2).

The 2017 TREC Precision Medicine (PM) shared task[11] utilized 30 semi-structured synthetic topics (e.g., disease, genetic variants, demographic information, and other factors) and evaluated IR systems for their ability to match topics with: 1) 26,759,399 abstracts from MEDLINE; and 2) 241,006 clinical trial descriptions from ClinicalTrials.gov[3]. Thirty-two teams participated in this task, 27 from academia, ive from industry. The top performing system achieved a precision at 10 (P@10) of 63.10% and 44.29% for track 1 and 2, respectively. This system combined a query expansion module with a heuristic scoring method for abstracts and trials.

The Conference and Labs of the Evaluation Forum (CLEF) eHealth 2016[4] shared task[12] used the National Information and Communications Technology Australia (NICTA) Synthetic Nursing Handover Data[15]. This data set consisted of 300 notes that were authored by a registered nurse[5]. Each note consisted of a patient profile and a free-form text paragraph. One of the proposed tasks asked to the participants on this data was to automatically pre-populate handover forms with relevant text-snippets (slot illing)[16]. Three teams participated in this task, all of them from academia. The top performing system scored 38.2% (F1-score) and relied on a Conditional Random Field (CRF) model that used a set of features extracted from Stanford CoreNLP, Unified Medical Language System (UMLS)[17], WordNet, regular expression patterns, and Latent Dirichlet Allocation (LDA) clusters[18]. A wrapper algorithm evaluated several different subsets of these features and ultimately selected the best one.


#

3.2 Real Data

Prescription Drug Labels

Prescription drug labels published by the Food and Drug Administration (FDA) contain information about uses of medications, indications, and side effects. They are meant for public use and are free of any privacy concern[6]. This makes them a good target for studying medication-related problems, such as identifying adverse drug reactions (ADRs), comparing ADRs presented in labels from different manufacturers for the same drug, and performing pharmacovigi-lance by identifying new ADRs not currently included in labels.

The 2017 Text Analysis Conference (TAC) ADR Extraction from Drug Labels[19] studied FDA drug labels. The organizers shared a dataset of 2,309 unannotated drug labels, 200 of which manually annotated with ADR spans, relations, and concept identifiers (IDs)[7]. TAC proposed four tasks: 1) ADR mentions and modifiers span extraction; 2) extraction of relations between ADRs and their corollaries; 3) filtering of positive ADRs; and 4) positive ADR normalization[20]. Ten teams took part in this task, six from academia, three from industry, and one j oint team. The same system ranked first on all tasks, where it achieved an Fl-score of 82.48%, 49.00%, 82.19% (macro Fl), and 85.33% (macro Fl), respectively. This system used two distinct bi-directional Long Short Term Memory (LSTM) -CRF models with some post-processing rules to tackle the first two tasks. A learning-to-rank approach using RankSVM (support vector machine) on the top 10 normalization candidates tackled Tasks 3 and 4.


#

Online Social Data

Among the information shared in social media are personal views, experiences, and even health information[21]. However, social media data are not free of privacy and ethics concerns[22]. Access to most social media data requires a registration and consent to the governing rules, which can prevent secondary uses and limit the maximum amount of data to be collected. If social media data are not de-identified, then they cannot be shared among institutions and must be (re-)obtained directly from their source, e.g., Twitter data are often “distributed” in the form of tweet IDs, user IDs, and download scripts. Since 2015, there have been six clinical NLP shared tasks that used social media data. Four of them have been manually de-identified (or anonymized) and require a data use agreement (DUA) to be signed. Some are available for download beyond the challenges’ timeframes.

The 2015 Computational Linguistics and Clinical Psychology Workshop (CLPsych) used Twitter data for classifying users based on depression and post-traumatic stress disorder (PTSD)[23]. The organizers collected, anonymized, and annotated tweets of the form “I have just been diagnosed with X”, with “X” being depression or PTSD. The resulting dataset included 7,857 million tweets from 477 depression patients, 396 PTSD patients, and 1,746 control users. The data were distributed according to Twitter terms of service, along with a privacy agreement that required protective measures for downloaded copies. The data are available for download[8] and require Institutional Review Board (IRB) approval and signing of the privacy policy. Four teams participated in this task, three from academia, one from industry. The best performing system achieved an average precision above 80%[24] and was based on a Support Vector Machine (SVM) with linear kernel and baseline lexical features with term-frequency-inverse document frequency (TF-IDF) weighting.

The 2016 Social Media Mining shared task (SMM)[25] studied tweets for identifying ADRs. A data set of 10,822 anonymized tweets[26] was annotated by two pharmacology experts and was made available to the participants[9]. The shared task consisted of three tracks: 1) classification of tweets as ADR- and non-ADR-related; 2) ADR span extraction from tweets; and 3) linking ADRs to their UMLS[17] concepts. Eleven teams took part in this task but only six are reported in the overview: four from academia, two from industry. In the first track, the best performing system achieved an F1-score of 41.95%[27] by using an ensemble of Random Forest models with unigram, bigram, and trigram features. Track 2 was tackled as a Named Entity Recognition (NER) task by all the participants with the most effective machine learning (ML) model being CRFs and achieving 61.10% F1-score[28] on a subset of the entire corpus (2,131 annotated tweets). The organizers did not receive submissions for Track 3. Track 1 was re-proposed at the 2017 workshop[29] along with two new tasks: classification of medication intake types, and normalization of clinical concepts to the Medical Dictionary for Regulatory Activities (MedDRA)[20]. The 2017 workshop also extended the 2016 dataset to 15,717 tweets for training and 9,961 for testing. For classification of ADR-related tweets, the top performing system achieved an F1-score of 43.5% with an SVM model trained on textual features and domain-specific word embeddings[30]. For classification of medication intake, the top performing system scored F1 at 69.3% and used convolutional neural networks (CNNs) on word embeddings[31]. Finally, the top performing system for concept normalization scored an F1-score of 88.5% and used an ensemble of linear and deep learning models[32].

The 2016 CLPsych shared task[33] used 65,024 posts from the online forum of ReachOut, an Australian non-proit that supports young people. A total of 1,227 posts were manually prioritized by three independent judges by how urgently they need a response from a moderator (i.e., paraprofessional support) in a 4-point scale. The remaining posts were left un-annotated to experiment with semi-supervised and unsupervised techniques. Fifteen teams took part in this task: 13 from academia, one from industry, and one joint team. The top performing system achieved a macro-averaged F1-score of 42% by using an ensemble of classifiers working on different granularity of text[34]. The task was repeated in 2017 with an expanded dataset (157,963 posts, of which 1,588 were annotated[10]) and attracted a similar number of teams[35]. The best performing team obtained a macro-averaged F1-score of 46.7%. The data from 2016 and for 2017 are available for download on request.

Finally, the 2017 NII Testbeds and Community for Information access and Research'13 (NTCIR-13) MedWeb shared task[11] used a dataset of 2,560 tweets in Japanese, English, and Chinese[36]. The organizers manually de-identified the data and shared them with the participants under a DUA. Participants were asked to label the data with eight diseases/symptoms: influenza, diarrhea, hay fever, cough/sore throat, headache, fever, runny nose, and cold. Four academic teams took part in the English subtask by submitting 12 systems. The best system[37] achieved an exact match accuracy of 88% by using an ensemble ofhierarchical attention networks (HAN) and deep character-level convolutional neural networks (CNNs). At the time of writing, only the training data was available for download[12].


#

Clinical Notes

Clinical notes constitute the most sensitive set of data for shared tasks. They are governed by HIPAA and access to these data can require human subjects training, as well as DUAs even when they are de-identified. Medical Information Mart for Intensive Care (MIMIC) is the most frequently used source of de-identified clinical notes. It contains health data of over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012[38]. Since 2015, two shared tasks have utilized MIMIC as their data set. Other shared tasks have used de-identified and annotated data from their own home institutions. Unless noted otherwise, these data are distributed with DUAs.

The 2015 Analysis of Clinical Text (ACT) shared task[39] utilized the ShaRe dataset[40] consisting of 531 manually annotated discharge summaries, electrocardiograms, echo, and radiology reports from MIMIC-II. ACT focused on two tasks: 1) detection and normalization of disorder mentions; and 2) template slot illing. Twenty-one teams took part in this task, 18 from academia, three from industry. These teams tackled the first task as a sequence labelling problem, using CRFs in combination with word embeddings and ad-hoc sentence clustering. The second task was proposed in two settings according to whether the participants used the gold or the predicted spans for the disorder mentions (Track 2.a and 2.b respectively). The best performing system for the first task scored 75.7% (strict F1-score)[41]. On the second task, the same system scored first in both settings: 88.6% (weighted accuracy score) on Track 2.a and 80.8% (F1 * weighted accuracy score) on Track 2.b[42]. This system tackled the tasks by using a combination of CRFs and a binary SVM, both based on part-of-speech tags and syntactic features.

The 2016 TREC Clinical Decision Support shared tasks[43] studied patient-centered IR. The organizers provided a set of 1.25 million scientific articles from PubMed Central (PMC), and 30 nursing admission notes from MIMIC-III (called topics). With permission from the MIMIC team, the notes were made publicly available without the need for a DUA. Even though the notes were already de-identified, the de-identification process was manually carried out a second time for maximum privacy protection. For consistency with the previous challenges in the series[9] [44],, only the notes’ history of present illness sections were provided to the participants[13]. Participants were asked to retrieve articles relevant for answering questions on diagnoses, tests, and treatments. Twenty-six teams took part in the challenge: 21 from academia, and five from industry. The top performing system achieved a precision at 10 of 40.33%, which is higher than the best score achieved in 2015 (see above). Despite this, the average results were lower than in 2015. The organizers ascribed the result to the difference of real Intensive Care Unit (ICU) notes from synthetic general practice notes.

NLP Challenges for Detecting Medication and Adverse Drug Events from Electronic Health Records (MADE1.0)[14] utilized 1,092 medical records from 21 cancer patients in the UMass Memorial Medical Center to propose three tasks: 1) clinical named entity recognition; 2) relation identification; and 3) end-to-end systems to conduct the first two tasks together. This shared task is currently completed but the overview paper is not published yet.

The Clinical TempEval challenges[45] [46], hosted at the Semantic Evaluation series (SemEval), used 600 de-identified clinical notes and pathology reports from cancer patients at the Mayo Clinic that are manually annotated with temporal expressions, medical events, and temporal relations[15]. In 2015, three academic teams participated in this shared task. In 2016, 14 teams participated, three of which were from industry. The 2016 results were better than those of the previous year, with the top performing system in time expression extraction achieving an F1-score of 79.5%[47] using linear and structural SVMs on morphological, syntactic, discourse, and word representation features. The same system ranked first also in the event (F1-score 90.3%) and temporal relations tasks (F1 75.6% for the relations with respect to document creation times (DCTs), and F1 47.9% for the ones among narrative containers).

Clinical TempEval 2017[48] studied a domain adaptation problem, from colon cancer to brain cancer pathology reports and clinical notes. The corpus for this task contained 1,216 notes from each of the two types of cancer patients at the Mayo Clinic[16]. The notes were manually de-identified and annotated by experts[6] [17]. Eleven teams took part in this shared task: nine from academia, and two from industry. The best performing system achieved F1-scores of57% for time expression spans (using an ensemble of CRFs, rules and decision trees)[49], 72% for event spans, 59% for temporal relations with respect to the DCT, and 32% for those among narrative containers[50]. The system used neural networks with character and word embeddings combined with SVMs. Those results were approximately 20% lower than the ones registered by systems trained and tested on the same domain[45] [46].

Finally, the CEGS-NGRID Shared Tasks and Workshop on Challenges in NLP for Clinical Data made available a corpus of 1,000 manually de-identified psychiatric evaluation records from Partners Healthcare[51]. The organizers extended the HIPAA definition of PHI for better privacy protection. They proposed two tasks: 1) de-identification[52], and 2) symptom severity prediction[53]. De-identification was studied in two subtasks: a) benchmarking pre-existing de-identification systems[54] [55] on psychiatric records[18] (called “sight unseen”); and b) regular de-identification. Overall, 31 teams took part, 23 from academia, five from industry, and three jointly from industry/aca-demia. The same system scored the highest in both subtasks of Task 1: F1-score of 79.85%[52], and F1-score 91.43%[55] respectively. The system used a combination of CRFs, BI-LSTMs, and rules. The result suggests that “out-of-the-box solutions provide a good start at building models that can be tuned to the new data”. In the second task, symptom severity prediction, the systems were scored using the Inverse Normalized Mean Absolute Error Macro-averaged (INMAEM), which weights a prediction's error according to its ordinal distance from the correct class. The top performing system used an ensemble of machine learning classifiers based on morphological, syntactic, and structural features and achieved an INMAEM score of 86.3% which is close to the level of accuracy recorded by the least experienced of the annotators.

The information presented in this section highlights how the data varies in its sensitivity to privacy, which inversely correlates with the available data size. Tasks range from NER to relation extraction, multi-class classification problems and information retrieval, with these last ones being the most successful in terms of both attracting participation and system performance. The use of CRF and BI-LSTM models is common to almost all the top performing NER systems. More diverse methods are used for the relation extraction and multi-class classification problems.


#
#
#

4 Discussion

The discussed shared tasks offer interesting insights related to the availability of data, the advances in the state-of-the-art techniques, the role of privacy, and the importance of data size in supporting the methodological advances.

4.1 Data Availability

The concerns of availability of data, privacy, and cost of annotation ultimately shape the landscape of the field and give direction to the state of the art. Attempts to bypass concerns of availability of data and privacy with synthetic data results in displacing the cost of de-identification to the cost of generating synthetic records and come at the risk of generating a synthetic set that may not represent real data perfectly. Efforts to use social media data to understand the user perspective on her/his health problems face the same kind of privacy concerns as the notoriously sensitive EHR data. They additionally run into constraints related to long term access to data: either they do not remain available after the challenge or they need to be re-obtained from the social media site itself. When the data are to be re-obtained, this leaves the fate of the data set in the hands of the users of social media and could be lost if the users delete the messages or their accounts.


#

4.2 Observing Advances in the State of the Art

Shared tasks continue to grow both in their numbers and in the participation they attract. Especially for the tasks that are organized regularly, the consistency in the tasks and growing datasets continue to attract growing numbers of participants. Some tasks such as de-identification and NER tend to recur because of their high practical value. [Table 2] shows the performances of the systems participating in the most recent shared tasks. It shows that tasks such as clinical named entity recognition (medications, times, events, PHI) are well understood with system performances above 70% (see TAC ADR 2017, CLPsych 2015, ACT 2015, and CEGS-NGRID 2016), while tasks such as relation extraction with performances below 50% need more attention (see the Clinical TempEval series). Clinical information retrieval tasks, with a performance around 50% (see the TREC series), show the need for further research. Finally, multi-class classification tasks (see the CLPsych series) show a performance below 50%, which can be partly justified by the lack of annotated data.

Table 2

List of shared tasks with data source, data size, sub-tasks descriptions, and best-performance score (metrics differ per challenge). The table also contains information about data availability after the challenge, whether the data have been de-identified, and whether they require a DUA to be signed.

Category

Year

Challenge name

Task description

Data type

Data source

Data size

De-identification / anonymization

DUA

Currently Available?

Best Performance

Measure

2015

TREC Clinical Decision Support (CDS)[9]

Patient-centered information retrieval

Medical case narratives

Synthetic, PubMed

30 topics, 730K articles

no

no

yes

38.21%

infNDCG

TREC Precision Medicine[11]

Synthetic

2017

> Track 1

> Track 2

Patient-centered literature article retrieval

Patient-centered clinical trials retrieval

Semi-structured cases

Synthetic, PubMed, ClinicalTrials.gov

30 topics, 27M abstracts, 241K trials

no

no

yes

63.10%

44.29%

P@10

P@10

2016

CLEF cHcalth[12]

Information extraction

Nursing handover notes

NICTA synthetic nursing handover notes

300 notes

no

no

yes

38.20%

Fl (macro avg.)

Text Analysis Conference (TAC) Adverse Drug Reaction Extraction from Drug Labels (ADR)[18]

Prescription drug labels

2017

> Track 1

> Track 2

> Track 3

> Track 4

ADR mentions and modifiers extraction

Relation extraction

Positive ADR filtering

Positive ADR normalization

Drug labels

Drugs-Library.com

2309 labels

no

no

yes

82.48%

49.00%

82.19%

85.33%

Fl

Fl

Fl (macro avg.)

Fl (macro avg.)

2015

CLPsych: Depression and PTSD on Twitter[22]

Binaty classification of depression and PTSD users

Social media

Twitter

7.8M tweets

yes

yes

yes

80.00%

Avg. Precision

Social Media Mining (SMM)[24]

2016

> Track 1

> Track 2

> Track 3

ADR classification

Information extraction

Concept normalization

Social media

Twiner

10,882 tweets

no

no

yes

41.95%

61,10%

-

Fl

Fl

Social Media Mining for Health Applications (SMM4HA)[29]

Online social data

2017

> Track 1

> Track 2

> Track 3

ADR classification

Classification of medication intake

Concept normalization

Social media

Twitter

15,777 tweets

no

no

yes

43.50%

69.30%

88.50%

Fl

Fl (micro avg.)

Accuracy

2016

CLPsych: Triaging content in online peer-support forums[33]

Classification of mental health severity in 4 levels

Forum

ReachOut

65,024 (1,227 annotated)

yes

yes

yes, on request

42.00%

Fl (macro avg.)

2017

CLPsych: Triaging content in online peer-support forums[35]

Classification of mental health severity in 4 levels

Forum

ReachOut

157,963 posts (1,588 annotated)

yes

yes

yes, on request

46.70%

Fl (macro avg.)

2017

NTCIR-13 MedWeb[36]

8-class classification of diseases and symptoms

Multilingual Social media

Twitter

2560 tweets

yes

yes

yes, on request

-

Analysis of Clinical Text (ACT)[39]

2015

> Track 1

> Track 2a

> Track 2b

Disorder NER and normalization

Template slot filling (given gold spans)

Disorder recognition and template slot filling (end-to-end)

Clinical notes

ShARc corpus (MIMIC)

531 summaries

yes

yes

yes

75.70%

88.60%

80.80%

Fl (strict)

Fl * weighted acc.

Fl * weighted acc.

2016

TREC Clinical Decision Support (CDS)[43]

Patient-centered IR

Nursing admission notes

MIMIC, PubMed

30 notes, 1.25M abstracts

40.33%

P@10

Medication and Adverse Drug Events (MADE1.0)

2017

> Track 1

> Track 2

Medication, ADE, sign and symptom identification

Relation extraction

Clinical notes

UMass Memorial Medical Center

1092 records

yes

yes

no

-

Clinical TempEval[45]

2015

> Track 1

> Track 2

> Track 3

Time expression extraction

Event extraction

Relation extraction (wrt DCT)

Relation extraction (wrt narrative containers)

Pathology reports

Mayo Clinic

600 notes

yes

yes

yes, on request

72.50%

87.50%

70.20%

12.30%

Fl

Fl

Fl

Fl

Clinical data

Clinical TempEval[46]

2016

> Track 1

> Track 2

> Track 3

Time expression extraction

Event extraction

Relation extraction (wrt DCT)

Relation extraction (wrt narrative containers)

Pathology reports

Mayo Clinic

600 notes

yes

yes

yes, on request

79.50%

90.30%

75.60%

47.90%

Fl

Fl

Fl

Fl

Clinical TempEval[48]

2017

> Track 1

> Track 2

> Track 3

Time expression extraction (cross-domain)

Event extraction (cross-domain)

Relation extraction (wrt DCT)

Relation extraction (wrt narrative containers)

Pathology reports, Clinical notes

Mayo Clinic

1216 notes

yes

yes

yes, on request

57.00%

72.00%

59.00%

32.50%

Fl

Fl

Fl

Fl

Centers for Excellence in Genomics N-GRID (CEGS-NGRID)[51]

2016

> Track la

> Track lb

> Track 2

De-identification (cross-domain)

Dc-identification

Psychiatric Symptom Severity Prediction

Psychiatric evaluation records

Partners Healthcare and Harvard Medical School

1000 records

yes

yes

yes, on request

79.85%

91.43%

86.30%

Fl

Fl

INMAE^M


#

4.3 Balancing Access, Privacy, and Corporate Confidentiality

Interestingly, until now, academic institutions have dominated shared task participation. Few of the shared tasks reviewed in this paper had a significant participation from industry (e.g., the TREC series and CEGS N-GRID). Industry bridges the gap between pure research and technology[56]. However, the stringent rules governing the use of data and the hesitation to openly share the methods for fear of losing intellectual property result in decreased participation. Attracting more companies to shared tasks would help in diversifying the methods and contributions, reduce the gap between academia and industry, and shorten the time it takes for methods to be adopted by industry.

DUAs required from participants before access to data vary in complexity. Some DUAs pose really strict requirements, e.g., storing the data on machines that are not connected to the Internet for the entire duration of the challenge. Keeping the terms of DUAs to those requirements that match the sensitivity level of data could open up more data sets to more parties for research and encourage participation of more parties.


#

4.4 Larger Datasets Support Methodological Advances

The approaches used to tackle problems in clinical NLP are almost entirely in the realm of data-driven methods. Named entity recognition tasks, such as medication or ADR extraction, are commonly solved using CRFs or deep learning approaches (BI-LSTMs), often with word embeddings although n-gram features are still used. Classification and relation extraction tasks are tackled using ensembles, often as a way of coping with the imbalance nature of classes. This makes a compelling argument for advocating the adoption of bigger datasets. Despite increasing the cost of design and annotation, richer data sets have the benefit of increasing the external validity of the developed solutions.


#
#

5 Conclusions

In this paper we reviewed the latest scientific challenges organized in clinical NLP, by highlighting the tasks, the most effective methodologies used, the data, and the sharing strategies. We surveyed 17 shared tasks, grouped by the type of data used (synthetic, drug labels, social data, and clinical data). We found that the type of data is correlated with its size and sensitivity. Recognition and classification of named entities are the most common tasks, usually tackled by data-driven approaches.

We hope that the growing number of success stories in shared task organization will encourage more institutions to share data. More and varied data from different institutions will undoubtedly lead to bigger advances in the field, for the benefit of healthcare as a whole.


#
#

Acknowledgments

We wish to thank Pierre Zweigenbaum, Aurelie Neveol, and the anonymous reviewers for their valuable comments.

Author Contribution

1 Available for download at http://trec-cds.appspot.com/2015.html


2 http://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/


3 Available for download at http://trec-cds.appspot.com/2017.html


4 Although CLEF eHealth is organized every year[13] [14], its main focus is multi-linguality and information retrieval rather than clinical NLP


5 Available for download at https://sites.google.com/site/clefehealth2016/


6 Accessible at http://www.drugs-library.com


7 Available for download at https://bionlp.nlm.nih.gov/tac2017adversereactions/


8 http://www.cs.jhu.edu/~mdredze/datasets/clpsych_shared_task_2015/


9 Can be downloaded using a script at http://diego.asu.edu/downloads


10 http://clpsych.org/shared-task-2017/


11 http://mednlp.jp/medweb/NTCIR-13/


12 http://mednlp.jp/medweb/NTCIR-13/#dataset


13 Available for download at http://trec-cds.appspot.com/2016.html


14 https://bio-nlp.org/index.php/projects/39-nlp-challenges


15 The data are available under DUA at http://alt.qcri.org/semeval2015/task6/ (2015) and http://alt.qcri.org/ semeval2016/task12/ (2016).


16 The data are available under DUA at http://alt.qcri.org/semeval2017/task12/


17 The annotations are available at https://github.com/stylerw/thymedata


18 Only unannotated test data were released to prevent participants from adapting systems to the new data.



Correspondence to

Özlem Uzuner
4400 University Drive, MS 1G8
5359 Nguyen Engineering Building, Fairfax, VA 22030
USA   
Phone: +1 703 993 5996