Abstract
Background Textual datasets (corpora) are crucial for the application of natural language processing
(NLP) models. However, corpus creation in the medical field is challenging, primarily
because of privacy issues with raw clinical data such as health records. Thus, the
existing clinical corpora are generally small and scarce. Medical NLP (MedNLP) methodologies
perform well with limited data availability.
Objectives We present the outcomes of the Real-MedNLP workshop, which was conducted using limited
and parallel medical corpora. Real-MedNLP exhibits three distinct characteristics:
(1) limited annotated documents: the training data comprise only a small set (∼100)
of case reports (CRs) and radiology reports (RRs) that have been annotated. (2) Bilingually
parallel: the constructed corpora are parallel in Japanese and English. (3) Practical
tasks: the workshop addresses fundamental tasks, such as named entity recognition
(NER) and applied practical tasks.
Methods We propose three tasks: NER of ∼100 available documents (Task 1), NER based only
on annotation guidelines for humans (Task 2), and clinical applications (Task 3) consisting
of adverse drug effect (ADE) detection for CRs and identical case identification (CI)
for RRs.
Results Nine teams participated in this study. The best systems achieved 0.65 and 0.89 F1-scores
for CRs and RRs in Task 1, whereas the top scores in Task 2 decreased by 50 to 70%.
In Task 3, ADE reports were detected by up to 0.64 F1-score, and CI scored up to 0.96
binary accuracy.
Conclusion Most systems adopt medical-domain–specific pretrained language models using data
augmentation methods. Despite the challenge of limited corpus size in Tasks 1 and
2, recent approaches are promising because the partial match scores reached ∼0.8–0.9 F1-scores.
Task 3 applications revealed that the different availabilities of external language
resources affected the performance per language.
Keywords
natural language processing - machine learning - adverse drug events