CC BY-NC-ND 4.0 · Yearb Med Inform 2019; 28(01): 140-151
DOI: 10.1055/s-0039-1677912
Section 6: Knowledge Representation and Management
Survey
Georg Thieme Verlag KG Stuttgart

Enhancing Clinical Data and Clinical Research Data with Biomedical Ontologies - Insights from the Knowledge Representation Perspective

Jonathan P. Bona
1  University of Arkansas for Medical Sciences, Arkansas, USA
,
Fred W. Prior
1  University of Arkansas for Medical Sciences, Arkansas, USA
,
Meredith N. Zozus
1  University of Arkansas for Medical Sciences, Arkansas, USA
,
Mathias Brochhausen
1  University of Arkansas for Medical Sciences, Arkansas, USA
› Author Affiliations
Further Information

Correspondence to

Mathias Brochhausen
4301 W. Markham St.
Slot 782, Little Rock, AR 72205

Publication History

Publication Date:
16 August 2019 (online)

 

Summary

Objectives: There exists a communication gap between the biomedical informatics community on one side and the computer science/artificial intelligence community on the other side regarding the meaning of the terms “semantic integration" and “knowledge representation“. This gap leads to approaches that attempt to provide one-to-one mappings between data elements and biomedical ontologies. Our aim is to clarify the representational differences between traditional data management and semantic-web-based data management by providing use cases of clinical data and clinical research data re-representation. We discuss how and why one-to-one mappings limit the advantages of using Semantic Web Technologies (SWTs).

Methods: We employ commonly used SWTs, such as Resource Description Framework (RDF) and Ontology Web Language (OWL). We reuse pre-existing ontologies and ensure shared ontological commitment by selecting ontologies from a framework that fosters community-driven collaborative ontology development for biomedicine following the same set of principles.

Results: We demonstrate the results of providing SWT-compliant re-representation of data elements from two independent projects managing clinical data and clinical research data. Our results show how one-to-one mappings would hinder the exploitation of the advantages provided by using SWT.

Conclusions: We conclude that SWT-compliant re-representation is an indispensable step, if using the full potential of SWT is the goal. Rather than providing one-to-one mappings, developers should provide documentation that links data elements to graph structures to specify the re-representation.


#

1 Introduction

1.1 A Communication Gap in Biomedical Informatics

The technical means that enable data sharing and data integration are a key problem in biomedical data management. Integration of data can happen at multiple levels, and semantic integration is the second to last integration level, only followed by shared business process, according to Blobel and Oemig [1]. Semantic integration aims to preserve “the detail, uncertainty, and above all the context of the data involved” [2]. Ontologies are an integral part of current semantic integration approaches. To achieve computer-assisted integration solutions, ontologies should be machine-interpretable and thus, need to provide information about details, uncertainty, and context in a computer-interpretable language [2] (as opposed to textual definitions written in any given natural language). Ontologies are an increasingly popular and successful tool for encoding and sharing machine-interpretable knowledge, such as background information about an area of biomedicine or other domains, as well as general information about the structure of the world such as is provided by upper- level ontologies like the BFO (Basic Formal Ontology) or the SUMO (Suggested Upper Merged Ontology) [3]. These ontologies are usually implemented and distributed as OWL (Web Ontology Language) files [4], [5] containing logical definitions.

In a 2018 paper, Brochhausen et al. indicated a communication gap in biomedical informatics regarding the interpretation of the term “semantic integration” and, more generally, “semantics” [6]. They showed how common data models (CDMs) were cited as fostering semantic integration or providing “semantics” despite their lack of representation of detail-oriented contextual information expressing levels of diagnostic confidence (suspected vs. confirmed, etc.) provided in a machine-interpretable language. This shows that the interpretation of the terms “sematic interpretation” and “semantics” differs between the biomedical informatics community and the computer science/big data community, which is found in the publication by Cheatham and Pesquita [2]. To help mitigate that situation and address issues of the ability of resources (such as ontologies, controlled vocabularies, and terminologies) to contribute to semantic integration, Brochhausen et al. proposed “computable semantics” as a baseline to establish whether a resource is capable of supporting semantic integration [6]. For a resource to provide computable semantics means there must be an effective method that could assign or validate the meaning of the symbols and expressions. In logic and mathematics, an effective method (sometimes also called mechanical method) is a method that allows to compute the answer to a given problem in a finite number of steps, and is logically bound to give the correct answer (and no wrong answers) [7].

Utecht et al. [8] have shown one way of demonstrating that an ontology-driven system entails the capability to provide computable semantics. For a project managing drug-drug interaction evidence information, they created an ontology that represented 44 different evidence types (such as longitudinal studies, observational studies, etc.) completed with necessary and sufficient conditions for class inclusion. In a pilot test, the team retrieved 30 evidence items (e.g. a journal paper), that had previously been assigned to one of the evidence types manually. In the test, a person had to answer five questions about each evidence item filling in a web-based form. As they were captured, the answers about these evidence items were used to generate Resource Description Framework (RDF) representations. Based on the information entered and the axiomatic definitions of the evidence types, running the OWL reasoner HermiT (http://www.hermit-reasoner.com) sorted all evidence items correctly into evidence types.

Providing the ability to automatically sort data items, e.g. diagnoses based on properties such as anatomical location, has inspired developments in clinical vocabularies, specifically the SNOMED Clinical Terms (SNOMED CT), over the last years. For example, the SNOMED CT representation of “Herpes simplex iridocyclitis” (SCTID: 13608004) specifies in a machine interpretable language [9] that the finding site for this disorder is the “ciliary body structure” or “iris structure”, the causative agent is “human herpes simplex virus”, the associated morphology is “inflammation”, and the pathological process is “infection”. This means that any instance of a disorder that does not fulfill these criteria would not be sorted in that category. These specifications would potentially also allow the validation of clinical coding by checking whether the finding site, causative agent, associated morphology, and pathological agent specified elsewhere in the medical record are consistent with the code. Miñarro-Giménez et al. have made the point that an increase in formal logical axioms to SNOMED CT would help to overcome still existent low code coincidence between annotators [10].

A recent paper that aimed to assess knowledge representation of clinical data across health systems demonstrated the existence of a communication gap regarding the term “knowledge representation”, in particular in distinction to “data representation”. Rosenbloom et al. [11] assessed three commonly used standards for sharing clinical data: Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) [12], [13], PCORNet (National Patient-Centered Clinical Research Network) CDM [14], [15], and Health Level Seven International (HL7) Fast Healthcare Interoperability Resource (FHIR) [16]. While we agree with the authors that the resources reviewed in their paper contribute to “a recent growth in high-impact efforts to support quality-assured and standardized clinical data sharing across different institutions and EHR (Electronic Health Record) systems”, we do not agree with those resources contributing to better knowledge representation and, thus, fostering semantic integration.

Rosenbloom et al. do not provide a definition of knowledge representation, but use “knowledge representation” as a search term for their review. According to Davis et al. [17], knowledge representation is best understood by looking at five distinct “roles” that it plays: (i) it is based on a surrogate, a representation of the entities in the world; (ii) it consists of a set of ontological commitments; (iii) it provides a fragmentary theory of intelligent reasoning, including rules of inference; (iv) it acts as a medium of pragmatically efficient computation; and (v) it is a medium of human expression. Knowledge representation or knowledge representation and reasoning (KRR), as it is sometimes called, is a subfield of artificial intelligence and has a long history dating back to the early days of symbolic Artificial Intelligence (AI). The reasoning aspect of KRR, i.e., the capability of a computer system to automatically draw inferences based on a set of inference rules, is what allows for filling the roles (iii) and (iv) of Davis’ definition of knowledge representation.

Brochhausen et al. showed that neither OMOP CDM, nor PCORNet CDM exhibit roles (iii) and (iv) defining knowledge representation [6]. Hence, they do not provide knowledge representation in the sense of computer/information science. FHIR does provide avenues to fill roles (iii) and (iv). While an extensive review of FHIR’s knowledge representation capabilities is out of the scope of this paper, Martinez-Costa and Schulz have pointed out from the perspective of knowledge representation that despite the fact that FHIR at the time of their writing (2017) required some manual effort, there are feasible strategies to use FHIR for knowledge representation and semantic integration [18].

A rich corpus of literature about the lack of reliability in coding clinical data [10], [19] [20] [21] [22] [23] demonstrates the reason why axiomatic definitions, even for tasks or databases that do not (yet) explicitly require reasoning, are relevant. Without the capability of using an effective method to ascertain the correctness, consistency, and reliability of coding, semantic integration will not be possible in a way that can be validated.


#

1.2 Using Semantic Web Technologies for Biomedical Data - an Engineering-oriented Perspective

The initial motivation for semantic Web technologies (SWTs) was to enable computers to play a more active role in handling, organizing, and managing data on the Internet:

“The concept of machine-understandable documents does not imply some magical artificial intelligence allowing machines to comprehend human mumblings. It relies solely on the machine’s ability to solve well-defined problems by performing well-defined operations on well-defined data. So, instead of asking machines to understand people’s language, the new technology, like the old, involves asking people to make some extra effort, in repayment for which they get major new functionality–just as the extra effort of producing HTML mark-up is outweighed by the benefit of having content searchable on the web” [24].

SWT include numerous key methodologies, but at its core is the Resource Description Framework (RDF) [25]. In RDF, information such as the fact that “Hydrogen potassium ATPase” is a “proton pump”, is captured by a statement that identifies the two entities about which the statement is made, and by specifying the relation that holds between the two. These statements are referred to as triples, because they consist of three parts: subject, predicate, and object [26], [27]. The bold rectangle in [Fig. 1] shows a representation of a triple. RDF uses unique resource identifiers (URIs) to refer to the entities and relationships in a domain [28]. Using URIs for the entities in the domain, such as “Hydrogen potassium ATPase” in the example in Figure 1, allows us to build complex and large graphs based on the simple triple structure. Due to the use of URIs, the two triples that contain Hydrogen potassium ATPase, [Omeprazole, inhibits, Hydrogen potassium ATPase] and [Hydrogen potassium ATPase, is a, Proton pump] get connected to build a small graph consisting of three nodes and two edges.

Zoom Image
Fig. 1 Example of using RDF triples and axioms to represent knowledge in the pharmacology domain. The transparent boxes represent RDF subjects and objects, the lines represent predicates between subject and object. The gray box shows a necessary and sufficient condition for being a member of the class 'Proton-pump inhibitor'. The dotted line represents an inferred relationship.

Through RDF together with languages to define controlled vocabularies and ontologies, such as RDF Schema (RDFS) [29] and OWL [30], we can present the SWT knowledge representation strategy [26], [27]. At the core of this representation strategy is the possibility to use formal logic to draw inferences from premises and axioms to make implicit information explicit. Figure 1 shows the example of a reasoner using the RDF statements and a necessary and sufficient condition for the class “Proton pump inhibitor” to infer the statement that “Omeprazole is a proton pump inhibitor”. The use of such axioms and rules of inference, for example in ontologies [6], marks one of the key features of using SWTs.

The goal of this paper is not to claim that sharing and integration of clinical data requires SWTs, but the considerations presented above and in Brochhausen et al. clearly demonstrate that semantic integration requires a knowledge representation approach that is absent from both OMOP and PCORNet CDM. Those resources, of course, still provide value in biomedical informatics, but previous research indicates a number of use cases that require or benefit from using SWTs:

  • From the material presented above it is obvious that semantic web technologies are useful tools for all use cases where we seek to validate coding or automate a classification of cases into different categories [8]. Axiomatically rich ontologies have been shown to support a number of medically relevant functionalities such as automatic sorting of entities based on axiomatic definitions. Utecht et al. have shown that studies reporting evidence regarding drug-drug interactions can be sorted automatically into a complex system of study types using the Drug-drug Interaction and Drug-drug interaction Evidence Ontology (DIDEO) based on a six questions about the studies [8].

  • SWTs are used to allow integration of structured, but uncoded data for clinical and clinical research purposes. Mate et al. demonstrated an ontology driven system to manage extract, transform, and load (ETL) procedures to reuse standard care data from electronic medical record (EMR) to answer research questions [31].

  • Integrating heterogeneous uncoded but structured data describing instances of the same types of medical phenomena to allow either query data in a truth preserving manner using the biomedical context. Brochhausen et al. demonstrated that ontology-based representation was able to fix problems in querying biobank data from different biobanks at the same institution, by using RDF and the Ontology of Biobanking (OBIB) [32].

  • SWTs have shown great promise in improving the curation and usage of drug- drug interaction information [33] [34] [35] [36].

These are, of course, only a few examples, illustrating the type of problems and the scope of applying semantic web technologies in the biomedical arena. A PubMed query for “’semantic web technologies’ OR ’semantic web technology’ OR SWT [all]” retrieved 560 hits in December 2018.


#
#

2 Objectives

In the daily practice of an SWT specialist working with clinical data and clinical research data, requests to map or annotate existing clinical data and clinical research data with “ontology terms” are quite common. One reason for these requests is an understandable lack of awareness on the consumer side that using ontologies productively is an effort that goes beyond coding or re-coding existing data, but that it requires transforming (mostly) tabular data into a graph data format. The results of such approaches have been reported in the literature [37] [38] [39]. Previous works showed that using terms from OWL ontologies to annotate biomedical data that is not graph data may yield some results, such as assessing the domain coverage of the ontology or semantic integration based on the taxonomy that is part of the OWL file [40], [41]. However, utilizing the artificial intelligence capabilities linked to KRR requires the data is transformed to graph- based data representation.

Our aim is to provide use cases for using preexistent ontologies and SWTs to map clinical data and clinical research data in a way that realizes the Artificial Intelligence capabilities of those technologies. In our ontological representation, we follow the best practice of reusing existing ontologies where possible [42], [43].

Our focus in this effort is to promote awareness and understanding of the level of re-representation necessary to enable true knowledge representation based on this data. As such, we present conceptualizations of what the data is about, to help alleviate the communication gap between medical researchers and biomedical informaticians on the one side and computer scientists and the artificial intelligence community on the other side. Researchers and data curators in biomedical informatics are encouraged to embrace pre-existing tools for restructuring tabular data as graph data, such as W3C CSV2RDF [44] or RDB2RDF [45]. Our aim is to foster understanding of the re-structuring of the data that is useful for those curating it, especially those with medical domain knowledge, to ensure automatic transformation delivers correct and meaningful results. We point to ontology resources that foster orthogonal and consistent ontology development and demonstrate reuse of OWL entities from ontologies following those strategies. We demonstrate the semantic ambiguity of terms from clinical and clinical research standards and Common Data Elements (CDEs). Modeling ontologies to be one-to-one mappable to those artifacts leads to diminishing the advantages of the SWT approach by creating a multitude of study-specific classes in a pre-coordination approach.


#

3 Knowledge Representation Applied to Medical Data

3.1 General Approach

In our reuse of pre-existing ontologies we have embraced the collaborative, community-driven development paradigm in the biomedical ontologies community led by the Open Biological and Biomedical Ontologies (OBO) Foundry [46]. The OBO Foundry is a collaborative effort to build a library of orthogonal ontologies for both biomedical and biological domains following a core set of principles. In addition, the OBO Foundry provides a consistent way to manage naming and identifiers [46]. The results of the OBO Foundry are made available through the OBO Foundry website [47] and through the Ontobee service [48], which allows users to explore many OBO Foundry ontologies using one term search [49].

Smith and Ceusters stressed that the need for a shared upper ontology is a practical consequence of the need for collaborative ontology development in science [50]. They pointed out the relevance of the ontological realist methodology in building the OBO Foundry. The advantage of adopting a realist stance for collaborative ontology development is that the appropriateness and correctness of the ontological representation (and thus the ontological commitments) is linked to scientific research, including experimentation and scientific arguments [50]. The linkage between the ontological realist methodology and the individual OBO Foundry ontologies is ensured by the fact that the Basic Formal Ontology (BFO) is the upper ontology of most OBO Foundry ontologies [51] [52] [53] [54]. According to Arp et al., all OBO Foundry domain ontologies have adopted BFO as their upper level ontology [53].

As a collective of open biological and biomedical ontologies collaborating around shared design principles, OBO Foundry has broad and expanding term coverage for entities that one might need to model when creating semantic representation for data in the biomedical domain. However, even with this broad coverage, it is not unusual to encounter phenomena for which there are no adequate terms existing in OBO Foundry ontologies. We generally approach this issue in our projects by working with the developers of the relevant ontology, proposing terms and their definitions, and requesting their addition to the ontology [55]. Because this process is not immediate, it is sometimes necessary to create a small application ontology that has placeholders for the desired terms that we can use as we proceed with crafting our representations. Of course, having already defined and implemented a draft for the required term makes the term request easier to discuss and fulfill.

In some cases, there may not be an existing ontology that is a natural fit for the term or terms needed. Depending on the scale of this gap and the nature of the terms in question, including how generally useful they are likely to be for the larger community, it may make sense either to simply develop our own terms for internal use in an application ontology, or to initiate the development of a new OBO ontology that covers the relevant subdomain. Note that because BFO provides a full upper- level theory that is shared by all OBO ontologies, and because there are already existing interoperable OBO ontologies for many areas of biology and medicine, even in the case where we develop new terms without the intention of releasing them as part of a new ontology, these terms are not built in isolation but are developed in the context of existing OBO ontologies, with logical definitions that capture their relations to those resources.


#

3.2 The Cancer Imaging Archive

The Cancer Imaging Archive (TCIA) is the National Cancer Institute’s primary resource for acquiring, curating, managing, and distributing images and related data to support cancer research [56] [57] [58]. TCIA hosts over 36 million de-identified medical images of cancer (28 distinct cancer types) organized into 96 distinct collections [59]. TCIA was created to support research reproducibility and research reuse.

We are developing PRISM (Platform for Imaging in Precision Medicine) as the future basis of TCIA and offering this advanced informatics platform as an open source, easily deployable resource to support other research communities. Within the PRISM platform, we are developing state-of-the-art technologies for semantic integration of clinical and research information drawn from multiple sources. Identified near-term goals and challenges include: uniform management of non-image data; semantic query mechanisms and enhanced data exploration; and automatic curation of current and new data types. Many TCIA collections include non-image data in a variety of formats, often as downloadable spreadsheet files, which makes them difficult to combine or query. Further complicating this is the use of different representation schemes for similar information in different collections.

Our ongoing work to make these diverse non-image data more accessible and usable transforms them into shared semantic representations in OWL that use OBO Foundry resources, and will allow for queries that span collections to answer questions such as:

  • Which patients in lung cancer collections have been diagnosed with metastatic colon cancer, and how was that diagnosis obtained?

  • Which patients in head and neck cancer collections have tumors specifically in their oropharynx, and have been diagnosed with human papillomavirus, and how were those diagnoses obtained?

Our semantic representations based on these data use OBO Foundry ontologies including the Human Disease Ontology and The Uber Anatomy Ontology (Uberon). Instances for individual entries in TCIA collection data are linked to ontology classes to explicitly represent locations, disease types, diagnosis methods, etc.

[Figure 2] shows excerpts of similar data contained in two different head and neck cancer collections in the TCIA: the Head-Neck-PET-CT collection [60], which contains non-image data, including diagnostic and treatment information for patients with head and neck cancer, and the HNSCC (Head and Neck Squamous Cell Carcinoma) collection [61], which contains much of the same information. Though using different notation, these collections overlap significantly in their contents, including patient sex and other demographic data, tumor staging, HPV status, and an indication of the primary tumor location. [Figure 3] shows our semantically-enhanced representation of positive HPV status for a patient in a head and neck cancer collection, which provides a unique contextually-rich and axiomatically-defined representation for what the different values (“positive”, “negative”, “+”, “-”, “N/A”) represent. Note that while in a conversation or even in written documentation one might say, “this patient’s HPV status is positive,” an “HPV status” per se is a fairly nebulous entity from the realist perspective, which strives to represent things as they actually are. Our goal is to represent, in a form even a computer can understand, the relevant portions of reality that the authors of this collection were trying to describe by creating a column named “HPV status” and populating it with entries like “positive” or “+”. The best description we can extract based on that information is that at some point a “diagnostic process” occurred that involved the infected human, and involved some “HPV assay”. “HPV assay” is a subclass of OBI: assay, defined as “A planned process with the objective to produce information about the HPV status of the human that is the evaluant, by physically examining the human or samples taken from their body”. That diagnostic process produced a “diagnosis” as its output, and if there is some instance of the “papillomavirus infectious disease” that inheres in that human, the diagnosis is about that instance of the HPV infectious disease. For the pilot described here, mapping rules for the value sets to an ontological representation were specified manually and then automatically executed in transforming the source data into RDF.

Zoom Image
Fig. 2 Data excerpts from two head and neck cancer collections
Zoom Image
Fig. 3 RDF representation of positive HPV status for head and neck cancer records. “Positive” or “+” map to this representation. “Negative” or “-” will map to a representation with a human undergoing the HPV assay, not establishing the existence of a papillomavirus infectious disease.

These representation patterns are used to transform tabular non-image data associated with TCIA collections into OWL/ RDF instance data linked and annotated with the corresponding ontology terms. These data are then loaded into a triple store database for reasoning and querying. The resulting triple store contains assertions linking patient identifiers to RDF instances representing patients, affected body parts, diagnoses, relations among those, etc. We are able to query this database using SPARQL (SPARQL Protocol and RDF Query Language) to identify patient records matching criteria based on fields that were previously not queryable in TCIA collections, as well as queries that retrieve results spanning collections. Work is ongoing to implement this approach for additional data types and additional existing TCIA collections, with the eventual goal of a shared representation for all TCIA non-image data, including the ability to automatically curate new incoming data as it is submitted.


#

3.3 The Data Coordinating and Operations Center

The IDeA (Institutional Development Award) States Pediatric Clinical Trials Network (ISPCTN) is a research network with the goal to provide medically underserved and rural populations with access to state- of-the-art clinical trials, apply findings from relevant pediatric cohort studies to children in IDeA Program state locations, and build pediatric research capacity at a national level [62]. It is part of the Environmental influences on Child Health Outcomes (ECHO) program, which is funded by the National Institutes of Health (NIH) [63] [64] [65]. The University of Arkansas for Medical Sciences serves as the Data Coordinating and Operations Center (DCOC) for the ISPCTN. One study undertaken by the ISPCTN deals with Neonatal Opioid Withdrawal Syndrome (NOWS), aiming to characterize current clinical practice in opioid withdrawal in newborns. The study data collection form includes patient demographics, facility characteristics, maternal and fetal exposure, maternal history, pharmacologic and non-pharmacologic treatment, and discharge disposition. As part of the DCOC mission to provide reliable and innovative data coordination and management, one of the authors (MB) was asked to provide an overview of how the NOWS data dictionary could be translated into a graph-based data representation using Semantic Web Technologies.

The NOWS data dictionary consists of 267 elements that are closely linked to questionnaire items to be completed by study representatives. Examples for those questions are: “Was the infant > 36 weeks of gestational age?”, “Was there a maternal history of opioid use?”, “Did the infant need major surgical intervention ?”, and “What lactation interventions were employed?”. The data dictionary assigns an item name, a description label, a response type, and a response label to each data element. There is additional information for each element, such as information about the questionnaire order and logic.

In [Table 1], we present three examples illustrating that a one-to-one mapping between a data element and an OWL/RDF class would be suboptimal. We elaborate the shortcomings of one-to-one mapping and demonstrate how such mapping would lead to losing the advantage of SWT.

Table 1

An excerpt from the DCOC’s NOWS data dictionary

Item name

Description label

Response type

Response label

Response options

Data type

INGAGE

Was the infant ≥ 36 weeks gestational age?

radio

YN

Yes, No

INT

BRTHDTC

Date of birth

[empty]

[empty]

[empty]

DATE

INMEDSUSD

Indicate the medication(s) used to treat NOWS for this infant at the transferring hospital

checkbox

MMBCPUO

Morphine,Phenobarbital,Methadone,Buprenorphine,Clonidine,Unknown, Other

ST

The INGAGE data element is one example of NOWS data elements that has a yes/ no answer. Though study administrators frequently employ this type of questions, the information captured that way is particularly sparse from the perspective of semantics. In a case like this capturing only the response (yes, no) does not provide any machine-interpretable semantic information on what that piece of data means. Obviously, changing the text of the answer to reverse the meaning of the answer, if, for example, one changed the description label to “Was the infant <36 weeks gestational age?” would not be discernible by representing the answer alone. Thus, our first aim is to provide semantically-rich data that is machine-interpretable. Doing so requires capturing the semantics in the form of a question and maintaining the association between the question and the answer. [Figure 4] shows the semantic presentation we have chosen for this INGAGE data element.

Zoom Image
Fig. 4 SWT representation of INGAGE from [Table 1]

Using a one-to-one mapping approach to the INGAGE data element would mean creating a class of potential participants in an OWL file that were all born with at least a gestational age of 36 weeks. Doing so would indeed be completely possible and, of course, also possible in RDF. If the class created was also axiomatically defined, we would lose less semantics than with the yes/no answer option. However, doing so is not advisable. Following that strategy, we would need one added class for every study that needs a different gestational age as an inclusion criterion. If those classes are axiomatically defined, we would add a lot of reasoning to our ontology, without much gain, except for individual studies. From an SWT perspective, it is much more advisable to ensure that all elements that we need to define the inclusion criterion do exist in our RDF data. For the example at hand, this means we can capture the integer value for the gestational age for all participants or patients regardless of the inclusion criteria of one specific study, e.g. NOWS. Using SPARQL, we are then able to query for all participants and patients that are at least 36 weeks of gestational age or at least 34 weeks of gestational age, depending on the requirements of the study at hand. We do not need to deal with numerous predefined classes in our ontology that slow down reasoning. The numerical values can now be extracted along with the units of measurement and used in calculations, such as analyses. For operations that go beyond the capabilities of SPARQL, it is advisable to run these calculations in tools external to the SWT suite. SWTs are, at their core, not analysis tools but knowledge management tools that can help to feed better and more meaningful data into our analysis cycles.

Date of birth is an extremely common data element in clinical data and clinical research data. It is also a data element the meaning of which is highly contextual. While some data repositories specify that this is the patient’s or the participant’s date of birth, we still regularly find “date of birth” as the form or question prompts. Strictly speaking, this practice is semantically ambiguous, since we can only know contextually that what is meant is the date of birth of the patient and not, say, the date of birth of the healthcare provider. However, typically the context is sufficient to elucidate that situation. In the data for the NOWS study, it is relevant to specify that this is the date of birth of the infant, which is the NOWS participant and not the date of birth of the mother. The latter is also relevant, as NOWS collects numerous data elements related to the mother’s medical history, such as history of opioid use. The RDF presented in [Figure 5] shows how these issues are disambiguated.

Zoom Image
Fig. 5 SWT representation of BRTHDTC from [Table 1].

Regarding the mapping of data, we also considered the same route rejected for the previous example (INGAGE), i.e., creating a class that captures a NOWS participant’s date of birth. For the same reasons as explained previously, we chose not to do so. Instead, we wanted to ensure that all elements necessary to retrieve that kind of information using a SPARQL query to match the corresponding pattern of triples were present.

[Figure 6] shows a representation of information about drugs being used in the transferring hospital to manage the infant’s neonatal opioid withdrawal symptom. The way this data element is set up, with discrete answer options, except the ubiquitous “other” and “unknown,” provides data that can easily be semantically enriched linking the information to existing controlled vocabularies and terminologies. The advantage of doing a re-representation like the one above lies in a better chance of maintaining semantic integrity, if the data is integrated with data from other sources, using a different level of granularity regarding drug information or using a different terminology or controlled vocabulary. Using ChEBI identifiers (Chemical Entities of Biological Interest) [66] for the active ingredients allows the integration of data from the NOWS study with data that reports drug products using the Drug Ontology (DRON) [67] [68] [69] as a bridge.

Zoom Image
Fig. 6 WT representation of INMEDSUSD from [Table 1].

#
#

4 Discussion

The projects described above aim to enhance pre-existing data by crafting detailed semantic representations based on axiomatically rich ontologies, and using those to re-represent these data. By building the semantics directly into our representations of the data using freely available open biomedical ontologies, we make these data understandable and useable, both to researchers and software, including software that performs automated reasoning to support producing new inferences about the data.

In the PRISM case, this work made available key information about TCIA collections that was previously not retrievable. Additionally, it supports combining similar information across collections, for instance clinical data about imaging subjects, that provides essential context for understanding and analyzing the disease depicted in these images. [Figure 7] shows a SPARQL query and results illustrating this. This query retrieves identifiers across two head-and-neck cancer collections for records whose subjects have a “positive HPV diagnosis” and have also been “diagnosed with cancer of the oropharynx”. Prior to this effort, a researcher interested in investigating HPV diagnoses and tumor images in head and neck cancer cases would have had to navigate a wiki page, download separate spreadsheets, and figure out how to interpret and how to query each of those spreadsheets in order to make combined use of these data. This is already a huge advantage for cohort identification including clinical data in the TCIA. In order to further facilitate this type of investigation using these data, work is ongoing within this project to produce a user-friendly interface that will allow investigators to search and access this semantically integrated data without requiring any knowledge of ontologies, query languages, or other semantic web technologies.

Zoom Image
Fig. 7 SPARQL query with results across head and neck cancer collections for individuals with HPV and cancer of the oropharynx

Regarding the Neonatal Opiate Withdrawal project, our effort was exploratory. Using the study form as an example, the goal of exploring an SWT-based knowledge management approach is to assess:

  • 1) The feasibility of representing, curating, and extracting all information relevant to reporting;

  • 2) The scope to which pre-existing ontologies provide coverage for the representations necessary;

  • 3) Reusability of representation patterns across studies;

  • 4) Flexibility and maintainability of knowledge representation against evolving needs and objectives of studies.

The project to date has shown that SWT-oriented data representation is able to adequately represent the information and data at hand. In comparison to an approach that rests exclusively on definition of common data elements, the SWT-enhanced approach provides knowledge representation capabilities, such as representing context. The coverage of pre-existing ontologies has been very good. Most of the concepts that did not already exist in OBO Foundry ontologies were study-specific. At this point we cannot make a statement about 3) and 4) with any certainty, but we can report that regarding 3) the generic representation of many aspects and the addition of few study specific entities suggest that re-use will be an option across studies and will create synergies for data management.


#

5 Conclusion

As illustrated by the examples discussed above from our ongoing projects working with semantic representations of biomedical data, mapping data elements directly to ontology terms is often not a feasible solution for representing the meaning of data in a useful way. Even when in some cases such one- to-one mappings of data elements to newly created ontology terms may be feasible, doing so comes at the additional cost of increased reasoning over the ontology. Furthermore, doing so puts an unnecessary burden on developers to pre-coordinate information that could instead be easily aggregated from the knowledge graph at the time a query is run.

For these reasons, we argue that progress in the practice of representing biomedical data with ontologies requires a shift in thinking about how these resources are to be used: rather than mapping data elements directly to classes or individuals in an ontology, we work to always provide a full graph representation of the patterns of elements involved in relaying the meaning behind the data elements. This allows developers to begin with a set of data elements to identify the elements needed in their ontology and allows straightforward creation of RDF based on instance data coming from tabular and other less knowledge-structured formats. Toward that end, one of the authors’ ongoing projects establishes a web repository of ontology use patterns built on SWT to promote open sharing and discussion of applying such patterns to represent biomedical instance data.


#
#

Acknowledgements

The PRISM project presented in this paper has been funded in part with federal funds from the National Cancer Institute, National Institutes of Health under Contract No. HHSN261200800001E. The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the U.S. Government. Under this contract, the University of Arkansas is funded by Leidos Biomedical Research subcontract 16X011. Funding was also provided by U24CA215109 and 3U24CA215109-02S1.

The DCOC work presented in this paper was funded by grant number U24OD024957 from the National Institutes of Health Office of the Director through the ECHO program.

Sarah Bost participated in literature search, writing, and technical editing of this paper.

Tracy S. Nolan assisted with data retrieval and data interpretation in the PRISM research presented in this paper. The authors thank the editors and external reviewers for their valuable input.


Correspondence to

Mathias Brochhausen
4301 W. Markham St.
Slot 782, Little Rock, AR 72205


  
Zoom Image
Fig. 1 Example of using RDF triples and axioms to represent knowledge in the pharmacology domain. The transparent boxes represent RDF subjects and objects, the lines represent predicates between subject and object. The gray box shows a necessary and sufficient condition for being a member of the class 'Proton-pump inhibitor'. The dotted line represents an inferred relationship.
Zoom Image
Fig. 2 Data excerpts from two head and neck cancer collections
Zoom Image
Fig. 3 RDF representation of positive HPV status for head and neck cancer records. “Positive” or “+” map to this representation. “Negative” or “-” will map to a representation with a human undergoing the HPV assay, not establishing the existence of a papillomavirus infectious disease.
Zoom Image
Fig. 4 SWT representation of INGAGE from [Table 1]
Zoom Image
Fig. 5 SWT representation of BRTHDTC from [Table 1].
Zoom Image
Fig. 6 WT representation of INMEDSUSD from [Table 1].
Zoom Image
Fig. 7 SPARQL query with results across head and neck cancer collections for individuals with HPV and cancer of the oropharynx