CC BY-NC-ND 4.0 · Yearb Med Inform 2022; 31(01): 254-260
DOI: 10.1055/s-0042-1742547
Section 10: Natural Language Processing

Year 2021: COVID-19, Information Extraction and BERTization among the Hottest Topics in Medical Natural Language Processing

Natalia Grabar
1   STL, CNRS, Université de Lille, Domaine du Pont-de-bois, Villeneuve-d'Ascq cedex, France
Cyril Grouin
2   Université Paris Saclay, CNRS, Laboratoire Interdisciplinaire des Sciences du Numérique, Orsay, France
› Author Affiliations


Objectives: Analyze the content of publications within the medical natural language processing (NLP) domain in 2021.

Methods: Automatic and manual preselection of publications to be reviewed, and selection of the best NLP papers of the year. Analysis of the important issues.

Results: Four best papers have been selected in 2021. We also propose an analysis of the content of the NLP publications in 2021, all topics included.

Conclusions: The main issues addressed in 2021 are related to the investigation of COVID-related questions and to the further adaptation and use of transformer models. Besides, the trends from the past years continue, such as information extraction and use of information from social networks.

Section Editors for the IMIA Yearbook Section on Natural Language Processing

Publication History

Article published online:
04 December 2022

© 2022. IMIA and Thieme. This is an open access article published by Thieme under the terms of the Creative Commons Attribution-NonDerivative-NonCommercial License, permitting copying and reproduction so long as the original work is given appropriate credit. Contents may not be used for commercial purposes, or adapted, remixed, transformed or built upon. (

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany

  • References

  • 1 Nadkarni PM, Ohno-Machado L, Chapman WW. Natural language processing: an introduction. J Am Med Inform Assoc 2011 Sep-Oct;18(5):544-51.
  • 2 Friedman C, Hripcsak G. Natural language processing and its future in medicine. Acad Med 1999 Aug;74(8):890-5.
  • 3 Li J, Zhong S, Chen K. MLEC-QA: A Chinese Multi-Choice Biomedical Question Answering Dataset. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. ACL; 2021. p. 8862-74..
  • 4 Zhou M, Li Z, Tan B, Zeng G, Yang W, He X, et al. On the generation of medical dialogs for COVID-19. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing 2021, Volume 2: Short Papers. ACL; 2021. p. 886-96.
  • 5 Kong J, Zhang L, Jiang M, Liu T. Incorporating multi-level CNN and attention mechanism for Chinese clinical named entity recognition. J Biomed Inform 2021 Apr;116:103737.
  • 6 Jia Q, Zhang D, Yang S, Xia C, Shi Y, Tao H, et al. Traditional Chinese medicine symptom normalization approach leveraging hierarchical semantic information and text matching with attention mechanism. J Biomed Inform 2021 Apr;116:103718.
  • 7 Wu Z, Liang J, Zhang Z, Lei J. Exploration of text matching methods in Chinese disease Q&A systems: A method using ensemble based on BERT and boosted tree models. J Biomed Inform 2021 Mar;115:103683.
  • 8 Zhou L, Liu S, Li C, Sun Y, Zhang Y, Li Y, et al. Natural Language Processing Algorithms for Normalizing Expressions of Synonymous Symptoms in Traditional Chinese Medicine. Evid Based Complement Alternat Med 2021 Oct 11;2021:6676607.
  • 9 Xu Z, Xu Y, Cheung F, Cheng M, Lung D, Law YW, et al. Detecting suicide risk using knowledge-aware natural language processing and counseling service data. Soc Sci Med 2021 Aug;283:114176.
  • 10 Shen S, Zhu C, Fan C, Wu C, Huang X, Zhou L. Research on the evolution and driving forces of the manufacturing industry during the “13th five-year plan” period in Jiangsu province of China based on natural language processing. PLoS One 2021 Aug 18;16(8):e0256162.
  • 11 Nobel JM, Puts S, Weiss J, Aerts HJWL, Mak RH, Robben SGF, et al. T-staging pulmonary oncology from radiological reports using natural language processing: translating into a multi-language setting. Insights Imaging 2021 Jun 10;12(1):77.
  • 12 Wajsbürt P, Sarfati A, Tannier X. Medical concept normalization in French using multilingual terminologies and contextual embeddings. J Biomed Inform 2021 Feb;114:103684.
  • 13 Ferté T, Cossin S, Schaeverbeke T, Barnetche T, Jouhet V, Hejblum BP. Automatic phenotyping of electronical health record: PheVis algorithm. J Biomed Inform 2021 May;117:103746.
  • 14 Lauriola I, Aiolli F, Lavelli A, Rinaldi F. Learning adaptive representations for entity recognition in the biomedical domain. J Biomed Semantics 2021 May 17;12(1):10.
  • 15 Hammami L, Paglialonga A, Pruneri G, Torresani M, Sant M, Bono C, et al. Automated classification of cancer morphology from Italian pathology reports using Natural Language Processing techniques: A rule-based approach. J Biomed Inform 2021 Apr;116:103712.
  • 16 Viani N, Botelle R, Kerwin J, Yin L, Patel R, Stewart R, et al. A natural language processing approach for identifying temporal disease onset information from mental healthcare text. Sci Rep 2021 Jan 12;11(1):757.
  • 17 Matsuda S, Ohtomo T, Tomizawa S, Miyano Y, Mogi M, Kuriki H, et al. Incorporating Unstructured Patient Narratives and Health Insurance Claims Data in Pharmacovigilance: Natural Language Processing Analysis of Patient-Generated Texts About Systemic Lupus Erythematosus. JMIR Public Health Surveill 2021 Jun 29;7(6):e29238.
  • 18 Shin D, Kam HJ, Jeon MS, Kim HY. Automatic Classification of Thyroid Findings Using Static and Contextualized Ensemble Natural Language Processing Systems: Development Study. JMIR Med Inform 2021 Sep 21;9(9):e30223.
  • 19 Kim D, Oh J, Im H, Yoon M, Park J, Lee J. Automatic Classification of the Korean Triage Acuity Scale in Simulated Emergency Rooms Using Speech Recognition and Natural Language Processing: a Proof of Concept Study. J Korean Med Sci 2021 Jul 12;36(27):e175.
  • 20 Brekke PH, Rama T, Pilán I, Nytrø Ø, Øvrelid L. Synthetic data for annotation and extraction of family history information from clinical text. J Biomed Semantics 2021 Jul 14;12(1):11.
  • 21 Pérez-Díez I, Pérez-Moraga R, López-Cerdán A, Salinas-Serrano JM, la Iglesia-Vayá M. De-identifying Spanish medical texts - named entity recognition applied to radiology reports. J Biomed Semantics 2021 Mar 29;12(1):6.
  • 22 Campillos-Llanos L, Valverde-Mateos A, Capllonch-Carrión A, Moreno-Sandoval A. A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine. BMC Med Inform Decis Mak 2021 Feb 22;21(1):69. Erratum in: BMC Med Inform Decis Mak 2021 Apr 7;21(1):118.
  • 23 Villena F, Pérez J, Lagos R, Dunstan J. Supporting the classification of patients in public hospitals in Chile by designing, deploying and validating a system based on natural language processing. BMC Med Inform Decis Mak 2021 Jul 1;21(1):208. Erratum in: BMC Med Inform Decis Mak 2021 Jul 20;21(1):220.
  • 24 Uzuner O. Second i2b2 workshop on natural language processing challenges for clinical records. AMIA Annu Symp Proc 2008 Nov 6:1252-3.
  • 25 Uzuner O, South BR, Shen S, DuVall SL. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. J Am Med Inform Assoc 2011 Sep-Oct;18(5):552-6.
  • 26 UzZaman N, Llorens H, Derczynski L, Allen J, Verhagen M, Pustejovsky J. Semeval-2013 task 1: Tempeval-3: Evaluating time expressions, events, and temporal relations. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013); 2013. p. 1-9.
  • 27 Sun W, Rumshisky A, Uzuner O. Evaluating temporal relations in clinical text: 2012 i2b2 Challenge. J Am Med Inform Assoc 2013 Sep-Oct;20(5):806-13.
  • 28 Henry S, Wang Y, Shen F, Uzuner O. The 2019 National Natural language processing (NLP) Clinical Challenges (n2c2)/Open Health NLP (OHNLP) shared task on clinical concept normalization for clinical records. J Am Med Inform Assoc 2020 Oct 1;27(10):1529-37. Erratum in: J Am Med Inform Assoc 2021 Oct 12;28(11):2546.
  • 29 Naik A, Lehman JF, Rose C. Adapting event extractors to medical data: Bridging the covariate shift. In: Proc of the 16th Conf of the European Chapter of the Association for Computational Linguistics: Main Volume. ACL; 2021. p. 2963–75.
  • 30 Magge A, Tutubalina E, Miftahutdinov Z, Alimova I, Dirkson A, Verberne S, et al. DeepADEMiner: a deep learning pharmacovigilance pipeline for extraction and normalization of adverse drug event mentions on Twitter. J Am Med Inform Assoc 2021 Sep 18;28(10):2184-92.
  • 31 Rozova V, Witt K, Robinson J, Li Y, Verspoor K. Detection of self-harm and suicidal ideation in emergency department triage notes. J Am Med Inform Assoc 2022 Jan 29;29(3):472-80.
  • 32 Haoran W, Chen W, Xu S, Xu B. Counterfactual supporting facts extraction for explainable medical record based diagnosis with graph network. In: Proc of the 2021 Conf of the North American Chapter of the Assoc for Computational Linguistics: Human Language Technologies. ACL; 2021. p. 1942–55.
  • 33 Amiri H, Mohtarami M, Kohane I. Attentive multiview text representation for differential diagnosis. In: Proc of the 59th Ann Meeting of the Ass for Comp Linguistics and the 11th Inter Joint Conf on Natural Language Processing. ACL; 2021. p. 1012–9.
  • 34 Lybarger K, Ostendorf M, Thompson M, Yetisgen M. Extracting COVID-19 diagnoses and symptoms from clinical text: A new annotated corpus and neural event extraction framework. J Biomed Inform 2021 May;117:103761.
  • 35 Zhao J, Grabowska ME, Kerchberger VE, Smith JC, Eken HN, Feng Q, et al. ConceptWAS: A high-throughput method for early identification of COVID-19 presenting symptoms and characteristics from clinical notes. J Biomed Inform 2021 May;117:103748.
  • 36 Zhou T, Cao P, Chen Y, Liu K, Zhao J, Niu K, et al. Automatic ICD coding via interactive shared representation networks with self-distillation mechanism. In: Proc of the 59th Ann Meeting of the Assoc for Comp Linguistics and the 11th Inter Joint Conf on Natural Language Processing. ACL; 2021. P. 5948–57.
  • 37 Liu Y, Cheng H, Klopfer R, Gormley MR, Schaaf T. Effective convolutional attention network for multi-label clinical document classification. In: Proc of the 2021 Conf on Empirical Methods in Natural Language Processing. ACL; 2021. p. 5941–53.
  • 38 Dong H, Suárez-Paniagua V, Whiteley W, Wu H. Explainable automated coding of clinical notes using hierarchical label-wise attention networks and label embedding initialisation. J Biomed Inform 2021 Apr;116:103728.
  • 39 Zhou B, Cai X, Zhang Y, Yuan X. An end-to-end progressive multi-task learning framework for medical named entity recognition and normalization. In: Proc of the 59th Ann Meeting of the Assoc for Comp Linguistics and the 11th Inter Joint Conf on Natural Language Processing. ACL; 2021. p. 6214–24.
  • 40 Vashishth S, Newman-Griffis D, Joshi R, Dutt R, Rosé CP. Improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets. J Biomed Inform 2021 Sep;121:103880.
  • 41 Mulyar A, Uzuner Ö, McInnes B. MT-clinical BERT: scaling clinical information extraction with multitask learning. J Am Med Inform Assoc 2021 Sep 18;28(10):2108-15.
  • 42 Valizadeh M, Ranjbar-Noiey P, Caragea C, Parde N. Identifying medical self-disclosure in online communities. In: Proc of the 2021 Conf of the North American Chapter of the Association for Comp Linguistics: Human Language Technologies. ACL; 2021. p. 4398–408.
  • 43 Liao S, Kiros J, Chen J, Zhang Z, Chen T. Improving domain adaptation in de-identification of electronic health records through self-training. J Am Med Inform Assoc 2021 Sep 18;28(10):2093-100.
  • 44 Li Y, Wang J, Yu B. Detecting health advice in medical research literature. In: Proc of the 2021 Conf on Empirical Methods in Natural Language Processing. ACL; 2021. p. 6018–29.
  • 45 Stylianou N, Vlahavas I. TransforMED: End-to-?nd Transformers for Evidence-Based Medicine and Argument Mining in medical literature. J Biomed Inform 2021 May;117:103767.
  • 46 Roy A, Pan S. Incorporating medical knowledge in BERT for clinical relation extraction. In: Proc of the 2021 Conf on Empirical Methods in Natural Language Processing. ACL; 2021. p. 5357–66.
  • 47 Kanjirangat V, Rinaldi F. Enhancing Biomedical Relation Extraction with Transformer Models using Shortest Dependency Path Features and Triplet Information. J Biomed Inform 2021 Oct;122:103893.
  • 48 Legrand J, Toussaint Y, Raïssi C, Coulet A. Syntax-based transfer learning for the task of biomedical relation extraction. J Biomed Semantics 2021 Aug 18;12(1):16.
  • 49 Alfattni G, Peek N, Nenadic G. Attention-based bidirectional long short-term memory networks for extracting temporal relationships from clinical discharge summaries. J Biomed Inform 2021 Nov;123:103915.
  • 50 Hussain M, Satti FA, Hussain J, Ali T, Ali SI, Bilal HSM, et al. A practical approach towards causality mining in clinical text using active transfer learning. J Biomed Inform 2021 Nov;123:103932.
  • 51 Ma X, Imai T, Shinohara E, Kasai S, Kato K, Kagawa R, et al. EHR2CCAS: A framework for mapping EHR to disease knowledge presenting causal chain of disorders - chronic kidney disease example. J Biomed Inform 2021 Mar;115:103692.
  • 52 Percha B, Pisapati K, Gao C, Schmidt H. Natural language inference for curation of structured clinical registries from unstructured text. J Am Med Inform Assoc 2021 Dec 28;29(1):97-108.
  • 53 Du J, Wang Q, Wang J, Ramesh P, Xiang Y, Jiang X, et al. COVID-19 trial graph: a linked graph for COVID-19 clinical trials. J Am Med Inform Assoc 2021 Aug 13;28(9):1964-9.
  • 54 Liu H, Chi Y, Butler A, Sun Y, Weng C. A knowledge base of clinical trial eligibility criteria. J Biomed Inform 2021 May;117:103771.
  • 55 Devlin J, Chang M-W, Lee K, Kristina Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Association for Computational Linguistics, editor. Proc of NAACL-HLT 2019. Minneapolis, Minnesota; 2019. p. 4171–86.
  • 56 Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020 Feb 15;36(4):1234-40.
  • 57 Le H, Vial L, Frej J, Segonne V, Coavoux M, Lecouteux B, et al. FlauBERT: Unsupervised language model pre-training for French. In: Proc of the 12th Language Resources and Evaluation Conf. Marseille, France: European Language Resources Association; 2020. p. 2479-90.
  • 58 Flamholz ZN, Crane-Droesch A, Ungar LH, Weissman GE. Word embeddings trained on published case reports are lightweight, effective for clinical tasks, and free of protected health information. J Biomed Inform 2022 Jan;125:103971.
  • 59 Noh J, Kavuluru R. Improved biomedical word embeddings in the transformer era. J Biomed Inform 2021 Aug;120:103867.
  • 60 Bear Don't Walk Iv OJ, Sun T, Perotte A, Elhadad N. Clinically relevant pretraining is all you need. J Am Med Inform Assoc 2021 Aug 13;28(9):1970-6.
  • 61 Michalopoulos G, Wang Y, Kaka H, Chen H, Wong A. UmlsBERT: Clinical domain knowledge augmentation of contextual embeddings using the Unified Medical Language System Metathesaurus. In: Proc of the 2021 Conf of the North American Chapter of the Ass for Comp Linguistics: Human Language Technologies. ACL; 2021. p. 1744–53.
  • 62 Liu F, Shareghi E, Meng Z, Basaldella M, Collier N. Self-alignment pretraining for biomedical entity representations. In: Association for Computational Linguistics, editor. Proc of the 2021 Conf of the North American Chapter of the Ass for Comp Linguistics: Human Language Technologies. 2021. p. 4228–38.
  • 63 Amir S, van de Meent J-W, Wallace BC. On the impact of random seeds on the fairness of clinical classifiers. In: Association for Computational Linguistics, editor. Proc of the 2021 Conf of the North American Chapter of the Ass for Comp Linguistics: Human Language Technologies. 2021. p. 3808–23.
  • 64 Wang J, Abu-El-Rub N, Gray J, Pham HA, Zhou Y, Manion FJ, et al. COVID-19 SignSym: a fast adaptation of a general clinical NLP tool to identify and normalize COVID-19 signs and symptoms to OMOP common data model. J Am Med Inform Assoc 2021 Jun 12;28(6):1275-83.
  • 65 Bojanowski P, Grave E, Joulin A, Tomas Mikolov T. Enriching word vectors with subword information. Trans Assoc Comput Linguist 2017; 5(1):135–46.
  • 66 Mikolov T, Sustkever I, Chen K, Corrado G, Dean J. Distributed representations of words and phrases and their compositionality. In: Proc of the 26th International Conf on Neural Information Processing Systems – Volume 2. 2012. p. 3111-9.
  • 67 Ding X, Mower J, Subramanian D, Cohen T. Augmenting aer2vec: Enriching distributed representations of adverse event report data with orthographic and lexical information. J Biomed Inform 2021 Jul;119:103833.
  • 68 Majewska O, Collins C, Baker S, Björne J, Brown SW, Korhonen A, et al. BioVerbNet: a large semantic-syntactic classification of verbs in biomedicine. J Biomed Semantics 2021 Jul 15;12(1):12.
  • 69 Kim T, Han SW, Kang M, Lee SH, Kim JH, Joo HJ, et al. Similarity-Based Unsupervised Spelling Correction Using BioWordVec: Development and Usability Study of Bacterial Culture and Antimicrobial Susceptibility Reports. JMIR Med Inform 2021 Feb 22;9(2):e25530.
  • 70 Pachamanova D, Glover W, Li Z, Docktor M, Gujral N. Identifying patterns in administrative tasks through structural topic modeling: A study of task definitions, prevalence, and shifts in a mental health practice's operations during the COVID-19 pandemic. J Am Med Inform Assoc 2021 Nov 25;28(12):2707-15.
  • 71 Wang L, Foer D, MacPhaul E, Lo YC, Bates DW, Zhou L. PASCLex: A comprehensive post-acute sequelae of COVID-19 (PASC) symptom lexicon derived from electronic health record clinical notes. J Biomed Inform 2022 Jan;125:103951.
  • 72 Weinzierl MA, Harabagiu SM. Automatic detection of COVID-19 vaccine misinformation with graph link prediction. J Biomed Inform 2021 Dec;124:103955.
  • 73 Guo Y, Zhang Y, Lyu T, Prosperi M, Wang F, Xu H, et al. The application of artificial intelligence and data integration in COVID-19 studies: a scoping review. J Am Med Inform Assoc 2021 Aug 13;28(9):2050-67.
  • 74 Shiner B, Levis M, Dufort VM, Patterson OV, Watts BV, DuVall SL, et al. Improvements to PTSD quality metrics with natural language processing. J Eval Clin Pract 2021 May 24.
  • 75 Cliffe C, Seyedsalehi A, Vardavoulia K, Bittar A, Velupillai S, Shetty H, et al. Using natural language processing to extract self-harm and suicidality data from a clinical sample of patients with eating disorders: a retrospective cohort study. BMJ Open 2021 Dec 31;11(12):e053808.
  • 76 Patel R, Smeraldi F, Abdollahyan M, Irving J, Bessant C. Analysis of mental and physical disorders associated with COVID-19 in online health forums: a natural language processing study. BMJ Open 2021 Nov 5;11(11):e056601.
  • 77 Ridgway JP, Uvin A, Schmitt J, Oliwa T, Almirol E, Devlin S, et al. Natural Language Processing of Clinical Notes to Identify Mental Illness and Substance Use Among People Living with HIV: Retrospective Cohort Study. JMIR Med Inform 2021 Mar 10;9(3):e23456.
  • 78 Leung YW, Wouterloot E, Adikari A, Hirst G, de Silva D, Wong J, Bender JL, et al. Natural Language Processing-Based Virtual Cofacilitator for Online Cancer Support Groups: Protocol for an Algorithm Development and Validation Study. JMIR Res Protoc 2021 Jan 7;10(1):e21453.
  • 79 Wright AP, Jones CM, Chau DH, Matthew Gladden R, Sumner SA. Detection of emerging drugs involved in overdose via diachronic word embeddings of substances discussed on social media. J Biomed Inform 2021 Jul;119:103824.
  • 80 Cox DJ, Garcia-Romeu A, Johnson MW. Predicting changes in substance use following psychedelic experiences: natural language processing of psychedelic session narratives. Am J Drug Alcohol Abuse 2021 Jul 4;47(4):444-54.
  • 81 Hassan A, Ali MDI, Ahammed R, Bourouis S, Khan MM. Development of NLP-Integrated Intelligent Web System for E-Mental Health. Comput Math Methods Med 2021 Dec 13;2021:1546343.