Methods Inf Med 2020; 59(04/05): 131-139
DOI: 10.1055/s-0040-1718940
Original Article

Leveraging the UMLS As a Data Standard for Rare Disease Data Normalization and Harmonization

Qian Zhu
1  Division of Pre-Clinical Innovation, National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), Rockville, Maryland, United States
,
Dac-Trung Nguyen
1  Division of Pre-Clinical Innovation, National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), Rockville, Maryland, United States
,
Eric Sid
2  Office of Rare Diseases Research, National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), Bethesda, Maryland, United States
,
Anne Pariser
2  Office of Rare Diseases Research, National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), Bethesda, Maryland, United States
› Author Affiliations
Funding This research was supported by the Intramural Research Program of the National Institutes of Health, National Center for Advancing Translational Sciences.

Abstract

Objective In this study, we aimed to evaluate the capability of the Unified Medical Language System (UMLS) as one data standard to support data normalization and harmonization of datasets that have been developed for rare diseases. Through analysis of data mappings between multiple rare disease resources and the UMLS, we propose suggested extensions of the UMLS that will enable its adoption as a global standard in rare disease.

Methods We analyzed data mappings between the UMLS and existing datasets on over 7,000 rare diseases that were retrieved from four publicly accessible resources: Genetic And Rare Diseases Information Center (GARD), Orphanet, Online Mendelian Inheritance in Men (OMIM), and the Monarch Disease Ontology (MONDO). Two types of disease mappings were assessed, (1) curated mappings extracted from those four resources; and (2) established mappings generated by querying the rare disease-based integrative knowledge graph developed in the previous study.

Results We found that 100% of OMIM concepts, and over 50% of concepts from GARD, MONDO, and Orphanet were normalized by the UMLS and accurately categorized into the appropriate UMLS semantic groups. We analyzed 58,636 UMLS mappings, which resulted in 3,876 UMLS concepts across these resources. Manual evaluation of a random set of 500 UMLS mappings demonstrated a high level of accuracy (99%) of developing those mappings, which consisted of 414 mappings of synonyms (82.8%), 76 are subtypes (15.2%), and five are siblings (1%).

Conclusion The mapping results illustrated in this study that the UMLS was able to accurately represent rare disease concepts, and their associated information, such as genes and phenotypes, and can effectively be used to support data harmonization across existing resources developed on collecting rare disease data. We recommend the adoption of the UMLS as a data standard for rare disease to enable the existing rare disease datasets to support future applications in a clinical and community settings.

Author's Contributions

The work was conceived by Q.Z. who also implemented this study and drafted the manuscript. D.-T.N. maintains the Neo4j database and the local UMLS REST API and participated in the project discussion. E.S. conducted the manual validation and participated in the project discussions. A.P. participated in the project discussions. The authors thank Nancy Terry, NIH Library Writing Center, for manuscript editing assistance. The authors read and approved the final manuscript.




Publication History

Received: 15 April 2020

Accepted: 17 September 2020

Publication Date:
04 November 2020 (online)

© 2020. Thieme. All rights reserved.

Georg Thieme Verlag KG
Stuttgart · New York