Background and Significance
Clinical data warehouses (CDWs) have proven their efficiency for fostering translational research, and the research opportunities opened by secondary use of such clinical data has been demonstrated, e.g., in Vanderbilt[1 ] or Harvard.[2 ] Thanks to its early adoption of an electronic healthcare record (EHR) system in 2000,[3 ] The Georges Pompidou European Hospital (Hôpital Européen Georges Pompidou – HEGP) started a repository of structured and unstructured clinical data for care and research with the Informatics for Integrating Biology and the Bedside (i2b2) platform in 2008.[4 ] Since then, numerous data sources have been integrated into the HEGP CDW.[5 ]
In the early years of the HEGP CDW project, the priority was given to the integration of data that were both present in a standard format and frequently needed, such as demographic data, biology results, and medical activity codes (diagnosis, procedures, Diagnosis Related Group [DRG], etc.). Then, new data sources were added to support projects with different needs (clinical reports, EHR structured forms, etc.). More specifically, data related to cancer patients are of special interest for the CAncer Research and PErsonalized Medicine (CARPEM) program.[6 ] In 2012, the French National Cancer Institute (INCa) granted eight SIRICs (Site de Recherche Intégré sur le Cancer in French, or Integrated Cancer Research Site) labels in France. SIRICs' goals are to provide new operational resources to oncology research, to optimize and accelerate the production of knowledge, and to favor knowledge dissemination and application in patient care. CARPEM is one of these eight SIRICs, with focus on digestive, endocrine, head and neck, hematological, lung, ovarian, and renal tumors. More generally, the multidimensional characterization of cancer patients is the first step to achieve precision medicine and make decision based on far more complex diagnostic and prognostic categories than are currently in use. The multivariate descriptors of cancer patients will allow better understanding of the disease and develop new decision support tools derived from data to assist in everyday patient care.[7 ]
[8 ] With the objective of designing predictive models and assisting in personalized treatment planning, all available patient data need to be integrated and explored. However, the variables come from multiple fields such as genomics, imaging, biology, surgery, medical oncology, and radiation oncology.[9 ] To support cross-disciplinary research objectives leading to the development of personalized medicine, HEGP integrated in the last years chemotherapy data and is currently developing a program dedicated to personalized radiation oncology therapy.
The HEGP CDW is based on i2b2, an open source standard system developed by Harvard Medical School,[10 ] which has been adopted by more than 130 academic hospitals around the world.[11 ] However, while the core infrastructure of i2b2 is widely shared and improved by the community, Extraction/Transform/Load (ETL) modules are still mostly developed by each hospital to load data from their local information systems into i2b2, an approach that was used at HEGP. Some data sources used to populate the HEGP CDW were hosted in applications provided by private software vendors and others were hosted in applications provided by the Assistance Publique – Hôpitaux de Paris (AP-HP) institution to which HEGP belongs to. For the latter ones, documentations and technical support were easily available, but for the first ones, poor or no technical support was available. Therefore, a significant amount of time was spent to analyze the source data storage model to export the required observations.
Data sources have been imported in the CDW thanks to the i2b2 generic data storage model.
This generic storage model enables fast and efficient queries through an easy-to-use Web graphical interface dedicated to clinicians (the i2b2 Web client). However, this simplicity has a cost: every data source must be heavily transformed to fit the i2b2 data model. For some complex data sources, this transformation must be carefully analyzed because it has an impact on the way the data are latter accessed.
Methods
To integrate the radiation therapy data into the HEGP CDW, the actions listed below (also shown in [Fig. 1 ]) were performed.
Fig. 1 The integration pipeline of the radiation therapy data into the Hôpital Européen Georges Pompidou (HEGP) Informatics for Integrating Biology and the Bedside (i2b2) clinical data warehouse.
Selection of the Items to Integrate in the CDW
There are currently several treatment planning and record-and-verify systems for radiation oncology. Among these, VARIAN (ARIA and Eclipse) is used at HEGP. The selection of the items to integrate was established with a radiation oncology expert by analyzing the screens of VARIAN ARIA and Eclipse applications where data of interest were displayed (see Step “A” in [Fig. 1 ]). Four domains were retained:
“Dose details” domain
“Activities scheduling” domain
“Dose-volume histogram (DVH) curves” domain
“Couch correction” domain
These features, digitally recorded during treatment planning and delivery, are necessary to predict treatment outcomes (both efficacy and toxicity) in any predictive model.[8 ]
[13 ]
The structures and treatment plans associated to the DVH curves were selected but the images were not selected for the integration in the i2b2 repository since they are already available via the hospital picture archiving and communication system (PACS) architecture and not yet supported in the standard i2b2 data repository storage model, although some projects, such as the mi2b2 project,[14 ] allow fetching images from a remote PACS.
The pilot study has been limited to a patient cohort of 262 individuals. This pilot study was approved by the Institutional Review Board (IRB) and ethics committee CPP Ile-de-France II (IRB Committee # 00001072, study reference # CDW_2015_0024).
Analysis of the Source System
A “backup” of the VARIAN/ARIA production database was installed on a specific computer to facilitate its analysis (see step “B” in [Fig. 1 ]) and a HTML documentation of the source storage model was generated by the SchemaSpy tool[15 ] to facilitate the analysis of the source system (see step “E” in [Fig. 1 ]).
The VARIAN/Eclipse software also offers an application programming interface (ESAPI): this Microsoft .NET package gives access to treatment data such as plans, images, doses, structures, and DVHs and it is actually available through a Web site[16 ] hosted by VARIAN.
Radiation Therapy Data Extraction into the CDW Staging Area
From the analysis of the source storage model, two SQL procedures were developed for the “Dose details” and “Activities scheduling” domains. These two SQL procedures have been tested on the VARIAN/ARIA backup database.
For the “DVH curves” domain, a C# template script using the ESAPI package was retrieved from the VARIAN Web site and modified to suit the project needs to extract the DVH curves (see step “C” in [Fig. 1 ]). This C# export script has been tested and validated on a dedicated VARIAN/Eclipse test workstation including the whole VARIAN software suite but running a dedicated patient database not linked to the daily care process (see step “C'” in [Fig. 1 ]).
Together with the primary selected items, we decided to include as many as possible “related data” (annotated by a “ + ” symbol in the following enumerations):
For the “Doses details” domain, we extracted:
Prescribed dose
Received dose
For the “DVH curves” domain, we extracted:
To ensure integrity and quality prior to the integration into i2b2, these data were exported in the staging area of the HEGP CDW with Talend Open Studio scripts[17 ] and additional PHP scripts.
Talend Open Studio Scripts
The two SQL procedures developed for the “Dose details” and “Activities scheduling” domains were integrated into the Talend Open Studio scripts to export the data of these domains from the radiation therapy backup database into the CDW staging area (see step “F” in [Fig. 1 ]).
PHP Scripts
PHP scripts were used to import the DVH files (exported with the C# script) into the CDW staging area. This feature was not directly implemented in the C# script to minimize the dependencies of the global ETL workflow toward proprietary technologies such as the .NET framework (see step “G” in [Fig. 1 ]).
Validation of the Data Imported in the Staging Area
The validation of the imported data has been a key issue in the whole integration process, requiring specific developments not directly used by the ETL modules. The data sets from the “Dose details” and “Activities scheduling” domains were validated by creating two Business Objects (BO)[18 ] dashboards replicating the screens of the vendor's application (see step “D” in [Fig. 1 ]). The rationale of this method is based on the fact that BO dashboards are built on an intermediate layer called a BO “Universe.” The design of a BO “Universe” requires a manual extraction of the relevant relationships between objects in the source data (see [Fig. 2 ]). Therefore, the BO dashboards may be seen as a proof of a correct understanding of the source data model: if the BO dashboards display the same content as the Eclipse application, then it means that the relationships between objects in the source data model have been correctly interpreted. Furthermore, the tool used to create the BO dashboards is also a SQL generator and the artifacts generated by this tool have been used to validate the two SQL procedures designed to extract the data. Then, the comparison of the Eclipse application screenshots made during data selection ([Fig. 3 ]) and the BO dashboards ([Fig. 4 ]) on some selected patients allowed validating the inner structure of the data source model inferred from the analysis step. The data set from the “DVH curve” domain was validated by developing a R[19 ] script displaying the curve data extracted from the CDW (see step “K” in [Fig. 1 ]).
Fig. 2 The Business Objects Universe created to validate the integration of radiation therapy data into the Hôpital Européen Georges Pompidou (HEGP) Informatics for Integrating Biology and the Bedside (i2b2) clinical data warehouse (CDW).
Fig. 3 Partial view of the “Dose details” screen of the vendor's radiation therapy application.
Fig. 4 Business Objects dashboard created to validate the “Dose details” data integration into Informatics for Integrating Biology and the Bedside (i2b2). This dashboard is replicating the vendor's application screen.
These validations steps were mainly manual processes as only a few patient data were used in the screen comparisons with the BO dashboards and the R script, but considering the complexity of these two data sets it was not possible to perform an automatic validation during this preliminary study.
Integration of the Radiation Therapy Data into i2b2
The Generic i2b2 Data Storage Model
The generic i2b2 data storage model is designed around a central facts table (OBSERVATION_FACT) that stores all the observations in an Entity-Attribute-Value (EAV) model and five additional dimensional tables are used to precisely qualify the observations.[20 ] In the i2b2 OBSERVATION_FACT table, observations contents are encoded through a small number of columns (the other columns are links to the dimensional tables, secondary qualifiers, or technical timestamps):
Valtype_cd: stores the type of the observation content (string , number , or text )
Tval_char: stores the value of a string -based observation
Nval_num & Units_cd: store the value and the unit of a number -based observation
Observation_blob: stores the value of a text -based observation
Radiation Therapy i2b2 Concepts
“Dose Details” and “Activities Scheduling” Domains
Every observation in i2b2 is indexed by a set of concepts that are used by clinicians to build their queries. For the “Dose details” and “Activities scheduling” domains, three new concepts were created:
“RTX:PRESCRIBEDDOSE”
“RTX:ACTUALDOSE”
“RTX:ACTIVITY”
“DVH Curves” Domain
There are several domain-specific information systems for radiation oncology: Elekta (MOSAIQ), VARIAN (ARIA), Accuray (Multiplan and Tomotherapy Data Management System), and BrainLab (iPlan). Each of these treatment planning and record-and-verify systems has its own structures labeling, which is not consistent across platforms, making it difficult to extract and analyze dosimetric data. For uniform data integration, we needed to create a classification and mapping that leads to an accurate ontology. A solution to this issue is to use an ontology, a set of common concepts that can be used, independently of the software, to represent medical knowledge, and in our case, anatomical and target volumes. There are currently around 440 biomedical ontologies. The most commonly used include Systematized Nomenclature of Medicine (SNOMED),[21 ] the National Cancer Institute (NCI) Thesaurus,[22 ] Common Terminology Criteria for Adverse Events (CTC AE),[23 ] and the Unified Medical Language System (UMLS) Metathesaurus.[24 ] These ontologies do not include specific radiation oncology terms, which led to the creation of the Radiation Oncology Ontology (ROO),[25 ] that reused other ontologies and added RO terms such as region of interest (ROI), target volumes (gross tumor volume [GTV], clinical target volume [CTV], planning target volume [PTV]), and DVHs. However, the ROO does not provide enough anatomical or target volume concepts for an easy use of routine practice data. For example, lymph nodes levels, that are essential for the planning of nodal CTV in radiotherapy, are not included.[26 ]
[27 ] Moreover, in the radiation therapy software, the names of the anatomical structure associated to the DVH are manually entered, leading to heterogeneity issues in labels. To address that issue and to enable semantic integration and standard representation of anatomy tailored for radiation therapy, we created a new ontology dedicated to radiation oncology structures: the Radiation Oncology Structure (ROS) ontology.[28 ] We then mapped all the original terms entered by the users to the concepts of the ROS ontology. The mapping table has been integrated into the staging area with a Talend Open Studio script (see step “H” in [Fig. 1 ]). However, the original name of the anatomical structure is stored as the main value of the i2b2 observation in the TVAL_CHAR column. As for the doses detail, the curve data and the other contextual data are stored as a semistructured text field in the JavaScript Object Notation (JSON) format.
Storing Radiation Therapy Structured Data as i2b2 Observations
Three levels of aggregation were actually used to model and store DVH data, as shown in [Fig. 5 ]:
Fig. 5 Modelization of the dose-volume histogram (DVH) data integrated into the Hôpital Européen Georges Pompidou (HEGP) Informatics for Integrating Biology and the Bedside (i2b2) clinical data warehouse (CDW).
A DVH: aggregation of contextual data (volume, coverage, minimum dose, etc.) and one optional curve data vector.
A curve data vector: aggregation of curve data points.
A point: aggregation of two coordinates.
Standard i2b2 Techniques for Managing Structured Data
Although aggregations techniques are widely used in EHR for structuring and displaying data (such as DVH), the i2b2 observations table does not provide aggregations.[29 ] It is possible to mitigate this limitation with two referenced techniques:
We considered that none of these approaches was suitable for the storage of the DVH because:
A New JSON Document-Based Approach for Managing Radiation Therapy Structured Data in i2b2
The JSON format was then chosen for the contextual and DVH data because it allows flexibility in the storage while preserving data consistency and indexing features by JSON dedicated packages.
JSON is widely used for storing objects in document-oriented databases[32 ] and NoSQL databases: CouchDB[33 ] provide schema-less feature with JSON-based items. Combined with parallel computation and incremental maintenance features, JSON databases offer valuable scalability performances. For this reason, they have been used for the storage of genomic data for research purposes.[34 ]
[35 ] However, in these contexts, JSON objects are not integrated into the i2b2 core database but stored in a dedicated database (CouchDB). Our approach is somewhat different since we store the JSON objects in the i2b2 core database so that they can be queried together with other data (demographic data, biology, drugs, etc.).
The contextual data and the DVH curves were then converted into JSON strings and stored in the OBSERVATION_BLOB column of the i2b2 facts table (OBSERVATION_FACT). This column may contain JSON data with a maximum size of 4 gigabytes-1 (in the ORACLE 11 g database).
The description of the radiation therapy observations in i2b2 is summarized in [Table 1 ].
Table 1
Description of the i2b2 storage content for the radiation therapy data
OBSERVATION_FACT columns
Received dose
Prescribed dose
Activity scheduling
DVH Curve
ENCOUNTER_NUM
Encounter/stay sequential number
PATIENT_NUM
Patient sequential number
CONCEPT_CD
‘RTX: ACTUALDOSE’
‘RTX: PRESCRIBEDDOSE’
‘RTX:ACTIVITY’
‘RTX:’ + standardized name of the anatomical structure
PROVIDER_ID
‘@’
START_DATE
PlanSetup.HstryDateTime
RTPlan.HstryDateTime
Scheduled Start Time
Structure.HstryDateTime
INSTANCE_NUM
1
VALTYPE_CD
‘N’
‘T’
TVAL_CHAR
‘E’
Original name of the anatomical structure in the Radiation therapy software
NVAL_NUM
Sum(RefPointHstry.ActualDose)
+ RefPointLog.DoseDelta
RTPlan.PrescribedDose
The duration of the activity in minutes
END_DATE
= START_DATE
OBSERVATION_BLOB
{”CourseId””:C1 RECTUM,” “PlanSetupId””:RECTUM.0,” “RefPointId””:ISO RECTUM,” “TotalDoseLimit””:46,” “DailyDoseLimit””:2,” “SessionDoseLimit””:2,” “ActualDose””:46”}
{”CourseId””:C1 RECTUM,” “PlanSetupId””:RECTUM,” “RefPointId””:PELVIS,” “PrescribedDose””:45”}
Type of activity + text comment
{”volume””:50.3174409637117,””coverage””:1,” “minDose””:16.962 Gy,””maxDose””:48.520 Gy,” “meanDose””:44.449 Gy,”
“samplingCoverage””:0.999752750762631,”
“medianDose””:45.973 Gy,””stdDev””:4.80722597080866,”
“curveData”:[[0,50.317440963709],[0.1,50.317440963709],[0.2,50.317440963709],...,[48.4,0.37290217957042],[48.5,0.013851501133903]],
“CourseId””:C1 RECTUM,””StructureId””:aire iliaque g”}
SOURCESYSTEM_CD
‘ARIA’
Abbreviation: i2b2, Informatics for Integrating Biology and the Bedside.
Note: The “curveData” vector field in the OBSERVATION_BLOB column is truncated for readability purpose in the above example.
The HEGP Generic Load Process
For each data source imported into the staging area, a set of ORACLE views have been designed to format the data and populate the i2b2 observations and concepts tables. Therefore, for the radiation therapy data source an additional set of ORACLE views were designed (see step “F” in [Fig. 1 ]). These ORACLE views were integrated into the HEGP generic load process designed with the Talend Open Studio software suite[4 ] (see step “J” in [Fig. 1 ]).
Validation of the Radiation Therapy Data Imported in the i2b2 CDW
All the validation steps were conducted by a computer scientist (E.Z.) and a radiation therapy specialist (J.E.B.).
Validation of Prescribed Doses, Received Doses, and Activities Durations
A basic statistic test was performed to compare the data exported with the SQL procedures from the ARIA backup database and the data stored in the i2b2 repository: the count and average values (for the 262 patients' cohort) of prescribed doses, received doses, and activities durations were computed (1) by using the SQL procedures on the ARIA backup database and (2) in the i2b2 CDW repository. For each of these three items, the values in (1) and (2) were identical.
Validation of the “DVH Curve” Domain
The “DVH curve” validation use case consisted of displaying in the R environment the DVH curves of randomly selected patients.
To achieve that, we first enabled the following extensions in R:
Then, we designed a R script, built on three basic steps (as shown in [Table 2 ]):
Table 2
R script used to display DVH curves extracted from the CDW for a given patient defined by his encounter number
A connection to the CDW is created with the JDBC driver
drv <- JDBC(“oracle.jdbc.OracleDriver,”classPath = ”/path/to/ojdbc6–11 g.jar,”” “)con <- dbConnect(drv, “jdbc:oracle:thin:@host:port:sid,” “user,” “password”)
A simple SQL query is used to fetch only DVH data for a given patient (defined by his encounter number nnnnnn)
data <- dbGetQuery(con, “select tval_char, observation_blob from I2B2DEMODATA.OBSERVATION_FACT WHERE encounter_num = nnnnnn and concept_cd like ‘RTX:%’ AND concept_cd not in ('RTX:ACTUALDOSE', 'RTX:PRESCRIBEDDOSE', 'RTX:ACTIVITY')”)
A graph is created in R by transforming JSON formatted DVH data into native R objects
attach(data)
curveData <- apply(data[2] 1, fromJSON)
colors = rainbow(dim(data)[1])
for (i in 1:dim(data)[1]) {
dose <- sapply(curveData[[i]]$curveData, '[', 1)
volume <- sapply(curveData[[i]]$curveData, '[', 2)
if (i == 1) {
plot(dose,volume,col = colors[i],type = ”l,” lty = 1)}
else {lines(dose,volume,col = colors[i], lty = 1)}
}
legend(“bottomleft,” legend = data[,1], col = colors, lty = 1, cex = 0.7)
Abbreviations: CDW, clinical data warehouse; DVH, dose-volume histogram; JDBC, Java Database Connectivity; JSON, JavaScript Object Notation.
A connection is first created to the CDW.
A simple SQL query is fetching DVH data for a given patient (based on his encounter number).
The resulting graph is then created by transforming JSON formatted DVH data into native R objects.
The output of this R script is presented in [Fig. 6 ].
Fig. 6 Output of the R script displaying radiation therapy data extracted from the Hôpital Européen Georges Pompidou (HEGP) Informatics for Integrating Biology and the Bedside (i2b2) clinical data warehouse (CDW).
Results
A Pipeline for Integrating Radiation Therapy Data into i2b2
We developed a pipeline to integrate the radiation therapy data into the HEGP i2b2 instance and evaluated it on a cohort of 262 patients. The volumetry of the integrated data are shown in [Table 3 ] and the overview of the pipeline is presented in [Fig. 1 ].
Table 3
Volumetry of the radiation therapy data integrated in the HEGP i2b2 CDW with the initial 262 patients' sample
i2b2 concept
Number of observations
Number of distinct patients
Total size of JSON objects
Prescribed dose
791
246
75.2 kilobytes
Actual dose
739
252
119 kilobytes
Activity
7,631
262
197.2 kilobytes
DVH
1,644
103
17.8 megabytes
Total
10,805
262
18.2 megabytes
Abbreviations: i2b2, Informatics for Integrating Biology and the Bedside; CDW, clinical data warehouse; DVH, dose-volume histogram; HEGP, Hôpital Européen Georges Pompidou; JSON, JavaScript Object Notation.
A New Radiation Therapy Ontology
A DVH is a curve modeled as a vector of points linked to an anatomical structure and, to enable query on specific structures of DVH in the CDW, an open ontology dedicated to ROSs was designed: The ROSs ontology (http://bioportal.bioontology.org/ontologies/ROS ) has 417 classes, with a maximum of 14 children classes (average = 5) and is available as a Web Ontology Language online. The integration of the ROS ontology in the i2b2 Web client is presented in [Fig. 7 ].
Fig. 7 The Radiation Oncology Structures as displayed in the Informatics for Integrating Biology and the Bedside (i2b2) Web client.
A Hybrid Approach for Storing DVH Curve in i2b2
We have proposed a new format for storing DVH curves in i2b2 by using a document-based technique with JSON in the OBSERVATION_BLOB column of the i2b2 fact table without modifying the underlying i2b2 storage model.
Business Intelligence Tools for Validating Extracted Data Set
The radiation therapy BO Universe created for this project is using 25 tables from the source data model and these tables are linked to each other with 28 relations. The tables and the links in the BO Universe were specifically designed to validate the “Dose details” and “Activities scheduling” data set but they can be useful to create other dashboards for various purposes in other projects.
Discussion
Toward Scalable Solutions Based on the Open i2b2 Standard
Mechanism for obtaining high quality, routine care, disease-specific, sharable data, is required in translational research. One solution is to leverage the i2b2 open source software to create new ontology-based modules that provides research investigators with the whole spectrum of clinical data. Examples of previous works focused on pediatric chronic disease registries[36 ] and cancer-related genomic data.[37 ]
[38 ] We have leveraged the i2b2 platform to store radiation therapy data, including detailed information such as the DVH, thanks to the i2b2 open standard.
Shared Algorithms
We were able to integrate radiation oncology data into i2b2. The availability of the VARIAN/ESAPI was a key for success of integrating the DVH data. The integration of the “Dose details” and “Activities scheduling” data was much more time consuming.
We developed this pipeline in the context of AP-HP, the largest hospital group in Europe, with five radiation oncology departments treating 8,000 patients each year. We focused on the system used at HEGP, but the model can be generalized to other treatment planning and record-and-verify systems if the data model is similar. As a matter of fact, the SQL and the C# procedures used to extract the radiation therapy data from the vendor's application should be redesigned to fit a different data source model. The BO Universes that are connected to the radiation therapy backup database should also be redesigned since they are linked to the vendor's application data model. Unless vendor's applications have a unified API to access internal data, the ETL process is often specific to the source which makes this kind of task very time consuming. In the presented work, the only generic step occurs between the staging area and the i2b2 repository where a set of ORACLE views formats each different data sources of the staging area into i2b2 standardized data flows.
Document-Based Storage in the i2b2 Architecture
Compatibility of the JSON Document-Based Storage with the i2b2 Architecture
The i2b2 data repository (also referred to as the Clinical Research Chart cell) is a component of the broader i2b2 architecture that also defined a Web client for clinical users and a set of other cells that enable additional features such as Natural Language Processing, Correlation Analysis Plugin, etc. Therefore, any modifications in the i2b2 data repository storage model should take into account a compatibility aspect with the other cells and especially with the Web client since it is certainly the most used cell. This component is able to natively display numeric or string data fetched from the i2b2 repository but it should be modified to display the DVH curves. We did not start this task yet since we are still committed to achieve the loading of the complete radiation therapy data in the repository. However, the Web client features can be extended by using different techniques. The most straightforward one is based on the plugins extension mechanism of the i2b2 Web client.[39 ] Other more complex solutions could be derived from the SmartR project which is an open source platform for interactive presentations for the translational research data.[40 ]
Estimation of the Target Volumetry for the DVH Data
For our 262 patients' cohort, the size (in character string length) of the JSON objects (i.e., all the contextual data plus the DVH curve data) is ∼18.4 megabytes. With the hypothesis of a constant DVH allotment among the 14,000 patients in the radiation therapy software, the total size of the JSON DVH objects should be around 1 gigabyte as shown in [Table 4 ].
Table 4
Estimation of the JSON objects volumetry for the entire radiation therapy database
Domains
JSON objects size
Sample
Entire database
262 patients
14,000 patients
DVH
18.2 megabytes
950.7 megabytes
Dose
194 kilobytes
10.1 megabytes
Activities
197 kilobytes
10.3 megabytes
Total
18.2 megabytes
971.1 megabytes
Abbreviations: DVH, dose-volume histogram; JSON, JavaScript Object Notation.
As a comparison, the total size of all the clinical reports objects (that are already stored in the HEGP CDW) is around 17 gigabytes.
With this estimation of the expected target volumetry, we think it is not required to use a native NoSQL framework for the management of the radiation therapy data.
Validation of the Data Set
The validation of the imported data set is a key step in every ETL process, especially when the data are complex such as in the radiation therapy context.
By using the BO platform, we were able to validate each element of the data set in a “real life” use case by comparing the BO reports with the user's application screen. Validating the data set only by looking at the table content would have been unsatisfactory.
The simplicity of the R script needed to display the DVH curves from the CDW content is an indication that the JSON format is well suited for storing the DVH data.
Perspectives
There are still several short- and long-term goals in this project. The “Couch Deviation” domain is not covered yet and we must find a strategy for fetching these data. We validated our approach on a cohort of 262 patients. It is now being extended to the whole VARIAN/ARIA data. Furthermore, the integration process requires a new offline VARIAN/ARIA copy database for dose details and activities scheduling update. We are also setting up new projects using this new data. For example, we are currently analyzing response to radiation therapy in T2–4 N0–1 rectal cancer patients, and we have integrated in our model genomic, clinical, and radiation therapy data. The work presented in this article is a significant step of the integration pipeline.
The method developed for this pilot project will be scaled up and used to integrate the data generated in all five AP-HP radiation oncology departments into the central AP-HP CDW (6.5 million patient records stored as of February 2017). Common data models and shared algorithms will reinforce (1) the central role played by the i2b2 CDW, and (2) the ability to mine cancer data and discover new markers in precision radiation therapy.
In the recent years, a set of stakeholders—representing academia, industry, funding agencies, and scholarly publishers—have promoted the Findable, Accessible, Interoperable, and Reusable (FAIR) data principles.[41 ] Behind FAIR principles is the notion that shared algorithms, tolls, and workflows are needed to search for relevant data sources, to analyze the data sets, and to mine the data for knowledge discovery. The researchers wanting to share and reuse data, methods, and scientific results will benefit from the application of the FAIR principles. As research in oncology is moving toward more data-intensive science, one of the grand challenges is to facilitate knowledge discovery by assisting researchers in their access to, integration and analysis of all data derived from routine care databases. We have developed the ROS ontology and the integration pipeline presented in this article to provide a semantic framework in radiation therapy that the sources and the users could agree upon to facilitate and accelerate data-driven cancer research.