CC BY-NC-ND 4.0 · Appl Clin Inform 2018; 09(02): 377-390
DOI: 10.1055/s-0038-1651497
Research Article
Schattauer GmbH Stuttgart

# Integrating Multimodal Radiation Therapy Data into i2b2

› Author Affiliations
Further Information

Eric Zapletal, PhD
Department of Medical Informatics, Biostatistics, and Public Health
Georges Pompidou European Hospital, Assistance Publique-Hôpitaux de Paris, Paris Descartes Faculty of Medicine
Paris
France

### Publication History

12 December 2017

07 April 2018

Publication Date:
30 May 2018 (online)

### Abstract

Background Clinical data warehouses are now widely used to foster clinical and translational research and the Informatics for Integrating Biology and the Bedside (i2b2) platform has become a de facto standard for storing clinical data in many projects. However, to design predictive models and assist in personalized treatment planning in cancer or radiation oncology, all available patient data need to be integrated into i2b2, including radiation therapy data that are currently not addressed in many existing i2b2 sites.

Objective To use radiation therapy data in projects related to rectal cancer patients, we assessed the feasibility of integrating radiation oncology data into the i2b2 platform.

Methods The Georges Pompidou European Hospital, a hospital from the Assistance Publique – Hôpitaux de Paris group, has developed an i2b2-based clinical data warehouse of various structured and unstructured clinical data for research since 2008. To store and reuse various radiation therapy data—dose details, activities scheduling, and dose-volume histogram (DVH) curves—in this repository, we first extracted raw data by using some reverse engineering techniques and a vendor's application programming interface. Then, we implemented a hybrid storage approach by combining the standard i2b2 “Entity-Attribute-Value” storage mechanism with a “JavaScript Object Notation (JSON) document-based” storage mechanism without modifying the i2b2 core tables. Validation was performed using (1) the Business Objects framework for replicating vendor's application screens showing dose details and activities scheduling data and (2) the R software for displaying the DVH curves.

Results We developed a pipeline to integrate the radiation therapy data into the Georges Pompidou European Hospital i2b2 instance and evaluated it on a cohort of 262 patients. We were able to use the radiation therapy data on a preliminary use case by fetching the DVH curve data from the clinical data warehouse and displaying them in a R chart.

Conclusion By adding radiation therapy data into the clinical data warehouse, we were able to analyze radiation therapy response in cancer patients and we have leveraged the i2b2 platform to store radiation therapy data, including detailed information such as the DVH to create new ontology-based modules that provides research investigators with a wider spectrum of clinical data.

#

### Background and Significance

Clinical data warehouses (CDWs) have proven their efficiency for fostering translational research, and the research opportunities opened by secondary use of such clinical data has been demonstrated, e.g., in Vanderbilt[1] or Harvard.[2] Thanks to its early adoption of an electronic healthcare record (EHR) system in 2000,[3] The Georges Pompidou European Hospital (Hôpital Européen Georges Pompidou – HEGP) started a repository of structured and unstructured clinical data for care and research with the Informatics for Integrating Biology and the Bedside (i2b2) platform in 2008.[4] Since then, numerous data sources have been integrated into the HEGP CDW.[5]

In the early years of the HEGP CDW project, the priority was given to the integration of data that were both present in a standard format and frequently needed, such as demographic data, biology results, and medical activity codes (diagnosis, procedures, Diagnosis Related Group [DRG], etc.). Then, new data sources were added to support projects with different needs (clinical reports, EHR structured forms, etc.). More specifically, data related to cancer patients are of special interest for the CAncer Research and PErsonalized Medicine (CARPEM) program.[6] In 2012, the French National Cancer Institute (INCa) granted eight SIRICs (Site de Recherche Intégré sur le Cancer in French, or Integrated Cancer Research Site) labels in France. SIRICs' goals are to provide new operational resources to oncology research, to optimize and accelerate the production of knowledge, and to favor knowledge dissemination and application in patient care. CARPEM is one of these eight SIRICs, with focus on digestive, endocrine, head and neck, hematological, lung, ovarian, and renal tumors. More generally, the multidimensional characterization of cancer patients is the first step to achieve precision medicine and make decision based on far more complex diagnostic and prognostic categories than are currently in use. The multivariate descriptors of cancer patients will allow better understanding of the disease and develop new decision support tools derived from data to assist in everyday patient care.[7] [8] With the objective of designing predictive models and assisting in personalized treatment planning, all available patient data need to be integrated and explored. However, the variables come from multiple fields such as genomics, imaging, biology, surgery, medical oncology, and radiation oncology.[9] To support cross-disciplinary research objectives leading to the development of personalized medicine, HEGP integrated in the last years chemotherapy data and is currently developing a program dedicated to personalized radiation oncology therapy.

The HEGP CDW is based on i2b2, an open source standard system developed by Harvard Medical School,[10] which has been adopted by more than 130 academic hospitals around the world.[11] However, while the core infrastructure of i2b2 is widely shared and improved by the community, Extraction/Transform/Load (ETL) modules are still mostly developed by each hospital to load data from their local information systems into i2b2, an approach that was used at HEGP. Some data sources used to populate the HEGP CDW were hosted in applications provided by private software vendors and others were hosted in applications provided by the Assistance Publique – Hôpitaux de Paris (AP-HP) institution to which HEGP belongs to. For the latter ones, documentations and technical support were easily available, but for the first ones, poor or no technical support was available. Therefore, a significant amount of time was spent to analyze the source data storage model to export the required observations.

Data sources have been imported in the CDW thanks to the i2b2 generic data storage model.

This generic storage model enables fast and efficient queries through an easy-to-use Web graphical interface dedicated to clinicians (the i2b2 Web client). However, this simplicity has a cost: every data source must be heavily transformed to fit the i2b2 data model. For some complex data sources, this transformation must be carefully analyzed because it has an impact on the way the data are latter accessed.

#

### Objective

Hospital data related to cancer treatment is mainly contained in the core electronic clinical record, but not limited to it. The radiation therapy data are usually produced and stored in dedicated systems and contains information in different formats. Such complex data needs to be integrated with other types of individual data in CDWs. However, the integration of radiation therapy data into i2b2 is an issue currently not addressed in referenced i2b2 sites.[12] In this article, we present the method developed to integrate radiation therapy data in the i2b2 CDW and its implementation at the HEGP.

#

### Methods

To integrate the radiation therapy data into the HEGP CDW, the actions listed below (also shown in [Fig. 1]) were performed.

#### Selection of the Items to Integrate in the CDW

There are currently several treatment planning and record-and-verify systems for radiation oncology. Among these, VARIAN (ARIA and Eclipse) is used at HEGP. The selection of the items to integrate was established with a radiation oncology expert by analyzing the screens of VARIAN ARIA and Eclipse applications where data of interest were displayed (see Step “A” in [Fig. 1]). Four domains were retained:

1. “Dose details” domain

2. “Activities scheduling” domain

3. “Dose-volume histogram (DVH) curves” domain

4. “Couch correction” domain

These features, digitally recorded during treatment planning and delivery, are necessary to predict treatment outcomes (both efficacy and toxicity) in any predictive model.[8] [13]

The structures and treatment plans associated to the DVH curves were selected but the images were not selected for the integration in the i2b2 repository since they are already available via the hospital picture archiving and communication system (PACS) architecture and not yet supported in the standard i2b2 data repository storage model, although some projects, such as the mi2b2 project,[14] allow fetching images from a remote PACS.

The pilot study has been limited to a patient cohort of 262 individuals. This pilot study was approved by the Institutional Review Board (IRB) and ethics committee CPP Ile-de-France II (IRB Committee # 00001072, study reference # CDW_2015_0024).

#

#### Analysis of the Source System

A “backup” of the VARIAN/ARIA production database was installed on a specific computer to facilitate its analysis (see step “B” in [Fig. 1]) and a HTML documentation of the source storage model was generated by the SchemaSpy tool[15] to facilitate the analysis of the source system (see step “E” in [Fig. 1]).

The VARIAN/Eclipse software also offers an application programming interface (ESAPI): this Microsoft .NET package gives access to treatment data such as plans, images, doses, structures, and DVHs and it is actually available through a Web site[16] hosted by VARIAN.

#

#### Radiation Therapy Data Extraction into the CDW Staging Area

From the analysis of the source storage model, two SQL procedures were developed for the “Dose details” and “Activities scheduling” domains. These two SQL procedures have been tested on the VARIAN/ARIA backup database.

For the “DVH curves” domain, a C# template script using the ESAPI package was retrieved from the VARIAN Web site and modified to suit the project needs to extract the DVH curves (see step “C” in [Fig. 1]). This C# export script has been tested and validated on a dedicated VARIAN/Eclipse test workstation including the whole VARIAN software suite but running a dedicated patient database not linked to the daily care process (see step “C'” in [Fig. 1]).

Together with the primary selected items, we decided to include as many as possible “related data” (annotated by a “ + ” symbol in the following enumerations):

For the “Doses details” domain, we extracted:

• Prescribed dose

• + Course ID

• + Plan Setup ID

• + Reference Point ID

• + Total Dose Limit

• + Daily Dose Limit

• + Session Dose Limit

• + Dose Delta

For the “Activities scheduling” domain, we extracted:

• Date/Time of the activity

• Duration of the activity

• Text comment entered by the physician during the activity

For the “DVH curves” domain, we extracted:

• DVH Curve vector:

• + Course ID

• + Anatomic structure ID

• + Volume of the affected anatomic structure

• + Coverage

• + Minimum dose

• + Maximal dose

• + Mean dose

• + Median dose

• + Standard deviation dose

• + Sampling coverage

To ensure integrity and quality prior to the integration into i2b2, these data were exported in the staging area of the HEGP CDW with Talend Open Studio scripts[17] and additional PHP scripts.

#### Talend Open Studio Scripts

The two SQL procedures developed for the “Dose details” and “Activities scheduling” domains were integrated into the Talend Open Studio scripts to export the data of these domains from the radiation therapy backup database into the CDW staging area (see step “F” in [Fig. 1]).

#

#### PHP Scripts

PHP scripts were used to import the DVH files (exported with the C# script) into the CDW staging area. This feature was not directly implemented in the C# script to minimize the dependencies of the global ETL workflow toward proprietary technologies such as the .NET framework (see step “G” in [Fig. 1]).

#
#

#### Validation of the Data Imported in the Staging Area

The validation of the imported data has been a key issue in the whole integration process, requiring specific developments not directly used by the ETL modules. The data sets from the “Dose details” and “Activities scheduling” domains were validated by creating two Business Objects (BO)[18] dashboards replicating the screens of the vendor's application (see step “D” in [Fig. 1]). The rationale of this method is based on the fact that BO dashboards are built on an intermediate layer called a BO “Universe.” The design of a BO “Universe” requires a manual extraction of the relevant relationships between objects in the source data (see [Fig. 2]). Therefore, the BO dashboards may be seen as a proof of a correct understanding of the source data model: if the BO dashboards display the same content as the Eclipse application, then it means that the relationships between objects in the source data model have been correctly interpreted. Furthermore, the tool used to create the BO dashboards is also a SQL generator and the artifacts generated by this tool have been used to validate the two SQL procedures designed to extract the data. Then, the comparison of the Eclipse application screenshots made during data selection ([Fig. 3]) and the BO dashboards ([Fig. 4]) on some selected patients allowed validating the inner structure of the data source model inferred from the analysis step. The data set from the “DVH curve” domain was validated by developing a R[19] script displaying the curve data extracted from the CDW (see step “K” in [Fig. 1]).

These validations steps were mainly manual processes as only a few patient data were used in the screen comparisons with the BO dashboards and the R script, but considering the complexity of these two data sets it was not possible to perform an automatic validation during this preliminary study.

#

#### The Generic i2b2 Data Storage Model

The generic i2b2 data storage model is designed around a central facts table (OBSERVATION_FACT) that stores all the observations in an Entity-Attribute-Value (EAV) model and five additional dimensional tables are used to precisely qualify the observations.[20] In the i2b2 OBSERVATION_FACT table, observations contents are encoded through a small number of columns (the other columns are links to the dimensional tables, secondary qualifiers, or technical timestamps):

• Valtype_cd: stores the type of the observation content (string, number, or text)

• Tval_char: stores the value of a string-based observation

• Nval_num & Units_cd: store the value and the unit of a number-based observation

• Observation_blob: stores the value of a text-based observation

#

#### “Dose Details” and “Activities Scheduling” Domains

Every observation in i2b2 is indexed by a set of concepts that are used by clinicians to build their queries. For the “Dose details” and “Activities scheduling” domains, three new concepts were created:

• “RTX:PRESCRIBEDDOSE”

• “RTX:ACTUALDOSE”

• “RTX:ACTIVITY”

#

#
#

#### Storing Radiation Therapy Structured Data as i2b2 Observations

Three levels of aggregation were actually used to model and store DVH data, as shown in [Fig. 5]:

1. A DVH: aggregation of contextual data (volume, coverage, minimum dose, etc.) and one optional curve data vector.

2. A curve data vector: aggregation of curve data points.

3. A point: aggregation of two coordinates.

#### Standard i2b2 Techniques for Managing Structured Data

Although aggregations techniques are widely used in EHR for structuring and displaying data (such as DVH), the i2b2 observations table does not provide aggregations.[29] It is possible to mitigate this limitation with two referenced techniques:

• Structuring the concepts ontology.[30]

• Use of concept modifiers.[31]

We considered that none of these approaches was suitable for the storage of the DVH because:

• The DVH vector is of variable size.

• The numerous repetition of X and Y “modifier-observations” would have led to an overhead of storage resources (a DVH curve contains often more than 600 points).

#

#### A New JSON Document-Based Approach for Managing Radiation Therapy Structured Data in i2b2

The JSON format was then chosen for the contextual and DVH data because it allows flexibility in the storage while preserving data consistency and indexing features by JSON dedicated packages.

JSON is widely used for storing objects in document-oriented databases[32] and NoSQL databases: CouchDB[33] provide schema-less feature with JSON-based items. Combined with parallel computation and incremental maintenance features, JSON databases offer valuable scalability performances. For this reason, they have been used for the storage of genomic data for research purposes.[34] [35] However, in these contexts, JSON objects are not integrated into the i2b2 core database but stored in a dedicated database (CouchDB). Our approach is somewhat different since we store the JSON objects in the i2b2 core database so that they can be queried together with other data (demographic data, biology, drugs, etc.).

The contextual data and the DVH curves were then converted into JSON strings and stored in the OBSERVATION_BLOB column of the i2b2 facts table (OBSERVATION_FACT). This column may contain JSON data with a maximum size of 4 gigabytes-1 (in the ORACLE 11 g database).

The description of the radiation therapy observations in i2b2 is summarized in [Table 1].

Table 1

### Description of the i2b2 storage content for the radiation therapy data

OBSERVATION_FACT columns

Prescribed dose

Activity scheduling

DVH Curve

ENCOUNTER_NUM

Encounter/stay sequential number

PATIENT_NUM

Patient sequential number

CONCEPT_CD

‘RTX: ACTUALDOSE’

‘RTX: PRESCRIBEDDOSE’

‘RTX:ACTIVITY’

‘RTX:’ + standardized name of the anatomical structure

PROVIDER_ID

‘@’

START_DATE

PlanSetup.HstryDateTime

RTPlan.HstryDateTime

Scheduled Start Time

Structure.HstryDateTime

INSTANCE_NUM

1

VALTYPE_CD

‘N’

‘T’

TVAL_CHAR

‘E’

Original name of the anatomical structure in the Radiation therapy software

NVAL_NUM

Sum(RefPointHstry.ActualDose)

+ RefPointLog.DoseDelta

RTPlan.PrescribedDose

The duration of the activity in minutes

END_DATE

= START_DATE

OBSERVATION_BLOB

{”CourseId””:C1 RECTUM,” “PlanSetupId””:RECTUM.0,” “RefPointId””:ISO RECTUM,” “TotalDoseLimit””:46,” “DailyDoseLimit””:2,” “SessionDoseLimit””:2,” “ActualDose””:46”}

{”CourseId””:C1 RECTUM,” “PlanSetupId””:RECTUM,” “RefPointId””:PELVIS,” “PrescribedDose””:45”}

Type of activity + text comment

{”volume””:50.3174409637117,””coverage””:1,” “minDose””:16.962 Gy,””maxDose””:48.520 Gy,” “meanDose””:44.449 Gy,”

“samplingCoverage””:0.999752750762631,”

“medianDose””:45.973 Gy,””stdDev””:4.80722597080866,”

“curveData”:[[0,50.317440963709],[0.1,50.317440963709],[0.2,50.317440963709],...,[48.4,0.37290217957042],[48.5,0.013851501133903]],

“CourseId””:C1 RECTUM,””StructureId””:aire iliaque g”}

SOURCESYSTEM_CD

‘ARIA’

Abbreviation: i2b2, Informatics for Integrating Biology and the Bedside.

Note: The “curveData” vector field in the OBSERVATION_BLOB column is truncated for readability purpose in the above example.

#
#

#### The HEGP Generic Load Process

For each data source imported into the staging area, a set of ORACLE views have been designed to format the data and populate the i2b2 observations and concepts tables. Therefore, for the radiation therapy data source an additional set of ORACLE views were designed (see step “F” in [Fig. 1]). These ORACLE views were integrated into the HEGP generic load process designed with the Talend Open Studio software suite[4] (see step “J” in [Fig. 1]).

#
#

#### Validation of the Radiation Therapy Data Imported in the i2b2 CDW

All the validation steps were conducted by a computer scientist (E.Z.) and a radiation therapy specialist (J.E.B.).

#### Validation of Prescribed Doses, Received Doses, and Activities Durations

A basic statistic test was performed to compare the data exported with the SQL procedures from the ARIA backup database and the data stored in the i2b2 repository: the count and average values (for the 262 patients' cohort) of prescribed doses, received doses, and activities durations were computed (1) by using the SQL procedures on the ARIA backup database and (2) in the i2b2 CDW repository. For each of these three items, the values in (1) and (2) were identical.

#

#### Validation of the “DVH Curve” Domain

The “DVH curve” validation use case consisted of displaying in the R environment the DVH curves of randomly selected patients.

To achieve that, we first enabled the following extensions in R:

• DBI

• rJAVA

• RJDBC

• Rjson

Then, we designed a R script, built on three basic steps (as shown in [Table 2]):

Table 2

### R script used to display DVH curves extracted from the CDW for a given patient defined by his encounter number

A connection to the CDW is created with the JDBC driver

drv <- JDBC(“oracle.jdbc.OracleDriver,”classPath = ”/path/to/ojdbc6–11 g.jar,”” “)con <- dbConnect(drv, “jdbc:oracle:thin:@host:port:sid,” “user,” “password”)

A simple SQL query is used to fetch only DVH data for a given patient (defined by his encounter number nnnnnn)

data <- dbGetQuery(con, “select tval_char, observation_blob from I2B2DEMODATA.OBSERVATION_FACT WHERE encounter_num = nnnnnn and concept_cd like ‘RTX:%’ AND concept_cd not in ('RTX:ACTUALDOSE', 'RTX:PRESCRIBEDDOSE', 'RTX:ACTIVITY')”)

A graph is created in R by transforming JSON formatted DVH data into native R objects

attach(data)

curveData <- apply(data[2] 1, fromJSON)

colors = rainbow(dim(data)[1])

for (i in 1:dim(data)[1]) {

dose <- sapply(curveData[[i]]$curveData, '[', 1) volume <- sapply(curveData[[i]]$curveData, '[', 2)

if (i == 1) {

plot(dose,volume,col = colors[i],type = ”l,” lty = 1)}

else {lines(dose,volume,col = colors[i], lty = 1)}

}

legend(“bottomleft,” legend = data[,1], col = colors, lty = 1, cex = 0.7)

Abbreviations: CDW, clinical data warehouse; DVH, dose-volume histogram; JDBC, Java Database Connectivity; JSON, JavaScript Object Notation.

1. A connection is first created to the CDW.

2. A simple SQL query is fetching DVH data for a given patient (based on his encounter number).

3. The resulting graph is then created by transforming JSON formatted DVH data into native R objects.

The output of this R script is presented in [Fig. 6].

#
#
#

### Results

#### A Pipeline for Integrating Radiation Therapy Data into i2b2

We developed a pipeline to integrate the radiation therapy data into the HEGP i2b2 instance and evaluated it on a cohort of 262 patients. The volumetry of the integrated data are shown in [Table 3] and the overview of the pipeline is presented in [Fig. 1].

Table 3

### Volumetry of the radiation therapy data integrated in the HEGP i2b2 CDW with the initial 262 patients' sample

i2b2 concept

Number of observations

Number of distinct patients

Total size of JSON objects

Prescribed dose

791

246

75.2 kilobytes

Actual dose

739

252

119 kilobytes

Activity

7,631

262

197.2 kilobytes

DVH

1,644

103

17.8 megabytes

Total

10,805

262

18.2 megabytes

Abbreviations: i2b2, Informatics for Integrating Biology and the Bedside; CDW, clinical data warehouse; DVH, dose-volume histogram; HEGP, Hôpital Européen Georges Pompidou; JSON, JavaScript Object Notation.

#

#### A New Radiation Therapy Ontology

A DVH is a curve modeled as a vector of points linked to an anatomical structure and, to enable query on specific structures of DVH in the CDW, an open ontology dedicated to ROSs was designed: The ROSs ontology (http://bioportal.bioontology.org/ontologies/ROS) has 417 classes, with a maximum of 14 children classes (average = 5) and is available as a Web Ontology Language online. The integration of the ROS ontology in the i2b2 Web client is presented in [Fig. 7].

#

#### A Hybrid Approach for Storing DVH Curve in i2b2

We have proposed a new format for storing DVH curves in i2b2 by using a document-based technique with JSON in the OBSERVATION_BLOB column of the i2b2 fact table without modifying the underlying i2b2 storage model.

#

#### Business Intelligence Tools for Validating Extracted Data Set

The radiation therapy BO Universe created for this project is using 25 tables from the source data model and these tables are linked to each other with 28 relations. The tables and the links in the BO Universe were specifically designed to validate the “Dose details” and “Activities scheduling” data set but they can be useful to create other dashboards for various purposes in other projects.

#
#

### Discussion

#### Toward Scalable Solutions Based on the Open i2b2 Standard

Mechanism for obtaining high quality, routine care, disease-specific, sharable data, is required in translational research. One solution is to leverage the i2b2 open source software to create new ontology-based modules that provides research investigators with the whole spectrum of clinical data. Examples of previous works focused on pediatric chronic disease registries[36] and cancer-related genomic data.[37] [38] We have leveraged the i2b2 platform to store radiation therapy data, including detailed information such as the DVH, thanks to the i2b2 open standard.

#

#### Shared Algorithms

We were able to integrate radiation oncology data into i2b2. The availability of the VARIAN/ESAPI was a key for success of integrating the DVH data. The integration of the “Dose details” and “Activities scheduling” data was much more time consuming.

We developed this pipeline in the context of AP-HP, the largest hospital group in Europe, with five radiation oncology departments treating 8,000 patients each year. We focused on the system used at HEGP, but the model can be generalized to other treatment planning and record-and-verify systems if the data model is similar. As a matter of fact, the SQL and the C# procedures used to extract the radiation therapy data from the vendor's application should be redesigned to fit a different data source model. The BO Universes that are connected to the radiation therapy backup database should also be redesigned since they are linked to the vendor's application data model. Unless vendor's applications have a unified API to access internal data, the ETL process is often specific to the source which makes this kind of task very time consuming. In the presented work, the only generic step occurs between the staging area and the i2b2 repository where a set of ORACLE views formats each different data sources of the staging area into i2b2 standardized data flows.

#

#### Compatibility of the JSON Document-Based Storage with the i2b2 Architecture

The i2b2 data repository (also referred to as the Clinical Research Chart cell) is a component of the broader i2b2 architecture that also defined a Web client for clinical users and a set of other cells that enable additional features such as Natural Language Processing, Correlation Analysis Plugin, etc. Therefore, any modifications in the i2b2 data repository storage model should take into account a compatibility aspect with the other cells and especially with the Web client since it is certainly the most used cell. This component is able to natively display numeric or string data fetched from the i2b2 repository but it should be modified to display the DVH curves. We did not start this task yet since we are still committed to achieve the loading of the complete radiation therapy data in the repository. However, the Web client features can be extended by using different techniques. The most straightforward one is based on the plugins extension mechanism of the i2b2 Web client.[39] Other more complex solutions could be derived from the SmartR project which is an open source platform for interactive presentations for the translational research data.[40]

#

#### Estimation of the Target Volumetry for the DVH Data

For our 262 patients' cohort, the size (in character string length) of the JSON objects (i.e., all the contextual data plus the DVH curve data) is ∼18.4 megabytes. With the hypothesis of a constant DVH allotment among the 14,000 patients in the radiation therapy software, the total size of the JSON DVH objects should be around 1 gigabyte as shown in [Table 4].

Table 4

### Estimation of the JSON objects volumetry for the entire radiation therapy database

Domains

JSON objects size

Sample

Entire database

262 patients

14,000 patients

DVH

18.2 megabytes

950.7 megabytes

Dose

194 kilobytes

10.1 megabytes

Activities

197 kilobytes

10.3 megabytes

Total

18.2 megabytes

971.1 megabytes

Abbreviations: DVH, dose-volume histogram; JSON, JavaScript Object Notation.

As a comparison, the total size of all the clinical reports objects (that are already stored in the HEGP CDW) is around 17 gigabytes.

With this estimation of the expected target volumetry, we think it is not required to use a native NoSQL framework for the management of the radiation therapy data.

#
#

#### Validation of the Data Set

The validation of the imported data set is a key step in every ETL process, especially when the data are complex such as in the radiation therapy context.

By using the BO platform, we were able to validate each element of the data set in a “real life” use case by comparing the BO reports with the user's application screen. Validating the data set only by looking at the table content would have been unsatisfactory.

The simplicity of the R script needed to display the DVH curves from the CDW content is an indication that the JSON format is well suited for storing the DVH data.

#

#### Perspectives

There are still several short- and long-term goals in this project. The “Couch Deviation” domain is not covered yet and we must find a strategy for fetching these data. We validated our approach on a cohort of 262 patients. It is now being extended to the whole VARIAN/ARIA data. Furthermore, the integration process requires a new offline VARIAN/ARIA copy database for dose details and activities scheduling update. We are also setting up new projects using this new data. For example, we are currently analyzing response to radiation therapy in T2–4 N0–1 rectal cancer patients, and we have integrated in our model genomic, clinical, and radiation therapy data. The work presented in this article is a significant step of the integration pipeline.

The method developed for this pilot project will be scaled up and used to integrate the data generated in all five AP-HP radiation oncology departments into the central AP-HP CDW (6.5 million patient records stored as of February 2017). Common data models and shared algorithms will reinforce (1) the central role played by the i2b2 CDW, and (2) the ability to mine cancer data and discover new markers in precision radiation therapy.

In the recent years, a set of stakeholders—representing academia, industry, funding agencies, and scholarly publishers—have promoted the Findable, Accessible, Interoperable, and Reusable (FAIR) data principles.[41] Behind FAIR principles is the notion that shared algorithms, tolls, and workflows are needed to search for relevant data sources, to analyze the data sets, and to mine the data for knowledge discovery. The researchers wanting to share and reuse data, methods, and scientific results will benefit from the application of the FAIR principles. As research in oncology is moving toward more data-intensive science, one of the grand challenges is to facilitate knowledge discovery by assisting researchers in their access to, integration and analysis of all data derived from routine care databases. We have developed the ROS ontology and the integration pipeline presented in this article to provide a semantic framework in radiation therapy that the sources and the users could agree upon to facilitate and accelerate data-driven cancer research.

#
#

### Conclusion

We have been able to integrate and reuse multimodal radiation therapy data for a preliminary study in the i2b2 platform. These data cover three functional domains:

• Dose details (delivered doses and prescribed doses)

• Activities scheduling (start time and duration of treatments)

• DVH curves

We used the standard i2b2 storage paradigm (EAV) for the doses and the activities scheduling by creating new radiation therapy concepts and by associating the scalar values (doses or dates) to the new concepts. For the contextual data and the DVH curve data, we used JSON formatted strings which may be easily converted into operational objects in frameworks daily used by researchers (such as R). A new domain ontology has also been created to annotate the DVH observations in a consistent and standardized manner. Some artifacts designed for the validation purposes (as the BO Universe) could also be used for various projects.

#

### Clinical Relevance Statement

We have leveraged the i2b2 platform to store radiation therapy data, including detailed information such as the DVH to create new ontology-based modules that provides research investigators with a wider spectrum of clinical data.

#

### Multiple Choice Questions

1. When designing a database for enabling queries of various clinical research projects which data model is best suited?

• The same data model as the patient healthcare record

• A set of specific data models dedicated to each patient data sources (demographic data, biology results, prescriptions, clinical reports, etc.)

• A set of specific data models dedicated to each clinical research projects

• A generic data model storage with no “source-oriented” nor “project-oriented” features

Correct answer: The correct answer is option d. The resources needed to extract the patient data from their production environment are often so high that it may only be balanced out by the fact that the extracted data would be available for many other uses. Any source- or project-specific solutions would be an obstacle to the reuse of the data in a long-term perspective. Moreover, the patient health care record data model is optimized for storing data for health care-oriented task (patient past and current treatments queries, nurses planning displays, drug prescriptions controls, etc.) but it is not optimized for statistical queries such as “how many patients have been given this drug for these symptom?” The only acceptable data model is a generic model that can handle data coming from various sources and for various uses.

2. What is the most important benefit of having radiation therapy data in a clinical data warehouse?

• Enabling physicists of radiation therapy departments to compute statistics on their data

• Enabling researchers to combine radiation therapy data with other categories of data

• Enabling hospital managers to evaluate activities of radiation therapy departments

Correct answer: The correct answer is option c. The rationale of a CDW is to gather data from different clinical sources into a coherent repository where data are accessed through standardized axis of queries such as (1) patient axis (“what data belong to this patient?”), (2) encounter axis (“what data belong to that encounter?”), (3) time axis (“what data belong to that time frame?”), (4) provider axis (“what data have been produced by that provider?”), and (5) concept axis (“what data match this concept?”). By aggregating data from different sources in a coherent manner, the CDW enables queries with a wider scope than business oriented softwares dedicated to a specific data source. On the other hand, the CDW could not compete with this specific software when the query focuses on the health care process and activities (radiation therapy dose calculus, internal activities statistics, etc.).

#
#

None.

### Acknowledgment

The authors are thankful to Arnaud Bernard, Alain Fauconnet, and Odile Taugourdeau for their valuable support regarding access to radiation therapy material.

### Protection of Human and Animal Subjects

This study from which the data were extracted was approved by the IRB and ethics committee CPP Ile-de-France II (IRB Committee # 00001072, study reference # CDW_2015_0024). Patients consent to participate to the study was implicit if refusal was not expressly stated. The HEGP CDW has been declared to the French CNIL regulatory commission for data privacy (# 1695855 v 0 ; 2013/08/28).