Best Paper Selection
21 August 2020 (online)
- Content Summaries of Best Papers for the Natural Language Processing Section of the 2020 IMIA Yearbook
Guan J, Li R, Yu S, Zhang X. A Method for Generating Synthetic Electronic Medical Record Text. IEEE/ACM Transact on Comput Biology and Inform 2019 https://ieeexplore.ieee.org/document/8880542
Lee J, Yoon W, Kim S, Kim D, Kim S, Ho So C, Kang J. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2019;36(4):1234-40 https://academic.oup.com/bioinformatics/article/36/4/1234/5566506
Rosemblat G, Fiszman M, Shin D, Kılıçoğlu H. Towards a characterization of apparent contradictions in the biomedical literature using context analysis. J Biomed Inform 2019;98:103275 https://www.sciencedirect.com/science/article/abs/pii/S1532046419301947?via%3Dihub
Content Summaries of Best Papers for the Natural Language Processing Section of the 2020 IMIA Yearbook
Guan J, Li R, Yu S, Zhang X
A Method for Generating Synthetic Electronic Medical Record Text
IEEE/ACM Transact on Comput Biology and Inform 2019
The main problem to perform Natural Language Processing in the biomedical domain is the access to clinical texts for non-medical staff, and more accurately for languages other than English. This paper presents a method based on neural networks (GAN + reinforce algorithm) to produce clinical documents in Chinese, for a given disease (either pneumonia or lung cancer). The authors used a corpus of 2,216 clinical notes written in Chinese, using the ‘History of Present Illness’ section as input and the ‘Admission Diagnosis’ section as tags. The authors report an accuracy of 0.7635 for generated data. They also defined three types of errors in their generated content: repetitions, inconsistent content (“temperature of 39.5°C; no fever”), and improper word matching (“body temperature paroxysmal cough”).
Lee J, Yoon W, Kim S, Kim D, Kim S, Ho So C, Kang J
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
Current NLP methods rely on word representations to improve results, among which BERT is the most commonly used resource. Nevertheless, while general resources exist, a domain-specific language needs specific resources. This paper introduces BioBERT, a BERT model tuned for the biomedical domain. In order to produce this model, the authors used several corpora in English (Wikipedia, BooksCorpus, PubMed abstracts, and PMC full texts). They compared results achieved by the BioBERT model with the BERT general model on three tasks (named entity recognition, relation extraction, and question-answering). For each task, better results were achieved when using the BioBERT model.
Rosemblat G, Fiszman M, Shin D, Kılıçoğlu H
Towards a characterization of apparent contradictions in the biomedical literature using context analysis
J Biomed Inform 2019;98:103275
This paper aims at identifying contradictions in scientific papers. The authors defined five categories of contradictions: (a) internal to patient, such as comorbidities, (b) external to patient, such as dosage, (c) endogenous and exogenous, (d) known controversy, and (e) contradictions in literature. They used the SemRep tool to identify relationships between 20 common diseases and pathologies, or sign or symptoms. Then, they assessed the level of certainty based on the SemMedDB repository (from PubMed) which contains subject-relation-object predications. On 117,000 instances (from 62,000 abstracts), they identified 2,236 apparent contradictions, among which 58 contradictions were real ones.