Methods Inf Med 2020; 59(04/05): 131-139
DOI: 10.1055/s-0040-1718940
Original Article

Leveraging the UMLS As a Data Standard for Rare Disease Data Normalization and Harmonization

Qian Zhu
1   Division of Pre-Clinical Innovation, National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), Rockville, Maryland, United States
,
Dac-Trung Nguyen
1   Division of Pre-Clinical Innovation, National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), Rockville, Maryland, United States
,
Eric Sid
2   Office of Rare Diseases Research, National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), Bethesda, Maryland, United States
,
Anne Pariser
2   Office of Rare Diseases Research, National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), Bethesda, Maryland, United States
› Author Affiliations
Funding This research was supported by the Intramural Research Program of the National Institutes of Health, National Center for Advancing Translational Sciences.

Abstract

Objective In this study, we aimed to evaluate the capability of the Unified Medical Language System (UMLS) as one data standard to support data normalization and harmonization of datasets that have been developed for rare diseases. Through analysis of data mappings between multiple rare disease resources and the UMLS, we propose suggested extensions of the UMLS that will enable its adoption as a global standard in rare disease.

Methods We analyzed data mappings between the UMLS and existing datasets on over 7,000 rare diseases that were retrieved from four publicly accessible resources: Genetic And Rare Diseases Information Center (GARD), Orphanet, Online Mendelian Inheritance in Men (OMIM), and the Monarch Disease Ontology (MONDO). Two types of disease mappings were assessed, (1) curated mappings extracted from those four resources; and (2) established mappings generated by querying the rare disease-based integrative knowledge graph developed in the previous study.

Results We found that 100% of OMIM concepts, and over 50% of concepts from GARD, MONDO, and Orphanet were normalized by the UMLS and accurately categorized into the appropriate UMLS semantic groups. We analyzed 58,636 UMLS mappings, which resulted in 3,876 UMLS concepts across these resources. Manual evaluation of a random set of 500 UMLS mappings demonstrated a high level of accuracy (99%) of developing those mappings, which consisted of 414 mappings of synonyms (82.8%), 76 are subtypes (15.2%), and five are siblings (1%).

Conclusion The mapping results illustrated in this study that the UMLS was able to accurately represent rare disease concepts, and their associated information, such as genes and phenotypes, and can effectively be used to support data harmonization across existing resources developed on collecting rare disease data. We recommend the adoption of the UMLS as a data standard for rare disease to enable the existing rare disease datasets to support future applications in a clinical and community settings.

Author's Contributions

The work was conceived by Q.Z. who also implemented this study and drafted the manuscript. D.-T.N. maintains the Neo4j database and the local UMLS REST API and participated in the project discussion. E.S. conducted the manual validation and participated in the project discussions. A.P. participated in the project discussions. The authors thank Nancy Terry, NIH Library Writing Center, for manuscript editing assistance. The authors read and approved the final manuscript.




Publication History

Received: 15 April 2020

Accepted: 17 September 2020

Article published online:
04 November 2020

© 2020. Thieme. All rights reserved.

Georg Thieme Verlag KG
Stuttgart · New York

 
  • References

  • 1 Boat TF, Field MJ. Rare Diseases and Orphan Products: Accelerating Research and Development. Washington, DC: National Academies Press; 2011
  • 2 FAQs About Rare Diseases. Available at: https://rarediseases.info.nih.gov/diseases/pages/31/faqs-about-rare-diseases. Accessed July 31, 2020
  • 3 Groft SC, de la Paz M Posada. Rare diseases: joining mainstream research and treatment based on reliable epidemiological data. In: de la Paz M Posada, Taruscio D, Groft SC. , editors. Rare Diseases Epidemiology: Update and Overview. Cham: Springer International Publishing; 2017: 3-21
  • 4 The Genetic and Rare Diseases Information Center (GARD). Available at: https://rarediseases.info.nih.gov/. Accessed August 20, 2020
  • 5 Weinreich SS, Mangon R, Sikkens JJ, Teeuw ME, Cornel MC. Orphanet: a European database for rare diseases. Ned Tijdschr Geneeskd 2008; 152 (09) 518-519
  • 6 Amberger JS, Bocchini CA, Schiettecatte F, Scott AF, Hamosh A. OMIM.org: Online Mendelian Inheritance in Man (OMIM®), an online catalog of human genes and genetic disorders. Nucleic Acids Res 2015; 43 Database issue, D1 D789-D798
  • 7 Mungall CJ, McMurry JA, Köhler S. et al. The Monarch initiative: an integrative data and analytic platform connecting phenotypes to genotypes across species. Nucleic Acids Res 2017; 45 (D1) D712-D722
  • 8 Richter T, Nestler-Parr S, Babela R. International Society for Pharmacoeconomics and Outcomes Research Rare Disease Special Interest Group. et al. Rare disease terminology and definitions—a systematic global review: report of the ISPOR rare disease special interest group. Value Health 2015; 18 (06) 906-914
  • 9 Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res 2004; 32 (Database issue): D267-D270
  • 10 Jiang G, Liu H, Solbrig HR, Chute CG. ADEpedia 2.0: integration of normalized adverse drug events (ADEs) knowledge from the UMLS. AMIA Jt Summits Transl Sci Proc 2013; 2013: 100-104
  • 11 Cantor MN, Sarkar IN, Gelman R, Hartel F, Bodenreider O, Lussier YA. An evaluation of hybrid methods for matching biomedical terminologies: mapping the gene ontology to the UMLS. Stud Health Technol Inform 2003; 95: 62-67
  • 12 Perez N, Cuadros M, Rigau G. Biomedical term normalization of EHRs with UMLS. arXiv preprint arXiv 2018;1802.02870
  • 13 Jia J, An Z, Ming Y. et al. eRAM: encyclopedia of rare disease annotations for precision medicine. Nucleic Acids Res 2018; 46 (D1) D937-D943
  • 14 Shen F, Liu S, Wang Y, Wang L, Afzal N, Liu H. Leveraging collaborative filtering to accelerate rare disease diagnosis. AMIA Annu Symp Proc 2018; 2017: 1554-1563
  • 15 Shen F, Wang L, Liu H. Phenotypic analysis of clinical narratives using human phenotype ontology. Stud Health Technol Inform 2017; 245: 581-585
  • 16 Shen F, Zhao Y, Wang L. et al. Rare disease knowledge enrichment through a data-driven approach. BMC Med Inform Decis Mak 2019; 19 (01) 32
  • 17 Rance B, Snyder M, Lewis J, Bodenreider O. Leveraging terminological resources for mapping between rare disease information sources. Stud Health Technol Inform 2013; 192: 529-533
  • 18 Brandt MM, Rath A, Devereau A, Aymé S. Mapping Orphanet terminology to UMLS. Paper presented at: Conference on Artificial Intelligence in Medicine in Europe; 2011
  • 19 Institute of Medicine (US) Committee on Accelerating Rare Diseases Research and Orphan Product Development, Profile of rare diseases. In: Field MJ, Boat TF. , eds. Rare Diseases and Orphan Products: Accelerating Research and Development. Washington, DC: National Academies Press; 2010
  • 20 Orphanet Rare Disease Ontology at BioPortal . https://bioportal.bioontology.org/ontologies/ORDO . Accessed August 11, 2020
  • 21 Noy NF, Shah NH, Whetzel PL. et al. BioPortal: ontologies and integrated data resources at the click of a mouse. Nucleic Acids Res 2009; 37 (Web Server issue): W170-3
  • 22 Online Mendelian Inheritance in Man (OMIM) at BioPortal. https://bioportal.bioontology.org/ontologies/OMIM . Accessed August 11, 2020
  • 23 Mondo Disease Ontology (MONDO) at BioPortal. https://bioportal.bioontology.org/ontologies/MONDO . Accessed August 11, 2020
  • 24 Haendel M, Vasilevsky N, Unni D. et al. How many rare diseases are there?. Nat Rev Drug Discov 2019; 19: 77-78
  • 25 Zenodo, Rare disease analysis in Mondo. Available at: https://zenodo.org/record/3478576#.XnpjJJNKgmI. Accessed August 10, 2020
  • 26 Halavi M, Maglott D, Gorelenkov V, Rubinstein W. MedGen. The NCBI Handbook. 2nd ed.. National Center for Biotechnology Information; (U.S.): 2018
  • 27 MedGen at NIH FTP, Index of /pub/medgen. Available at: https://ftp.ncbi.nlm.nih.gov/pub/medgen/. Accessed August 20, 2020
  • 28 Neo4j. Available at: https://neo4j.com/sandbox/?program_name=PPC%20GG%2020%20Neo4j%20Sandbox&utm_source=google&utm_medium=ppc&utm_campaign=*NA%20-%20Search%20-%20Branded&utm_adgroup=*NA%20-%20Search%20-%20Branded%20-%20Neo4j%20-%20Exact&utm_term=neo4j&gclid=EAIaIQobChMIsNOSrM2u6AIVj5OzCh1LYgl9EAAYASAAEgL0wfD_BwE. Accessed July 22, 2020
  • 29 Neo4j developer, Cypher query language. Available at: https://neo4j.com/developer/cypher-query-language/. Accessed August 5, 2020
  • 30 Metathesaurus Browser UMLS, UMLS Terminology Services. Available at: https://uts.nlm.nih.gov//metathesaurus.html. Accessed August 20, 2020
  • 31 Fung KW, McDonald C, Srinivasan S. The UMLS-CORE project: a study of the problem list terminologies used in large healthcare institutions. J Am Med Inform Assoc 2010; 17 (06) 675-680