CC BY-NC-ND 4.0 · Yearb Med Inform 2020; 29(01): 184-187
DOI: 10.1055/s-0040-1701985
Section 8: Bioinformatics and Translational Informatics
Georg Thieme Verlag KG Stuttgart

Untangling Data in Precision Oncology – A Model for Chronic Diseases?

Xosé M. Fernández
1  Institut Curie, Paris, France
› Author Affiliations
Further Information

Correspondence to

X. M. Fernández
Institut Curie
25 rue d’Ulm, 75005 Paris
Phone: +33 (0) 156246277   

Publication History

Publication Date:
21 August 2020 (online)



Objectives: Any attempt to introduce new data types in the entangled hospital infrastructure should help to unravel old knots without tangling new ones. Health data from a wide range of sources has become increasingly available. We witness an insatiable thirst for data in oncology as treatment paradigms are shifting to targeted molecular therapies.

Methods: From nineteenth-century medical notes consisting entirely of narrative description to standardised forms recording physical examination and medical notes, we have nowadays moved to electronic health records (EHRs). All our analogue medical records are rendered as sequences of zeros and ones changing how we capture and share data. The challenge we face is to offload the analysis without entrusting a machine (or algorithms) to make major decisions about a diagnosis, a treatment, or a surgery, keeping the human oversight. Computers don’t have judgment, they lack context.

Results: EHRs have become the latest addition to our toolset to look after patients. Moore’s law and general advances in computation have contributed to make EHRs a cornerstone of Molecular Tumour Boards, presenting a detailed and unique description of a tumour and treatment options.

Conclusions: Precision oncology, as a systematic approach matching the most accurate and effective treatment to each individual cancer patient, based on a molecular profile, is already expanding to other disease areas.


1 Introduction

Precision medicine requires streamlined software pipelines to handle vast amounts of information, yet it faces additional challenges as we embrace new technologies: high-throughput sequencing, improved medical imaging, wearable sensors, etc. Over the past two decades, the fastest technological advance in history has universalised access to genomics, prompting an increase in the number of national genome sequencing programmes [1]. Managing these data and their interpretation are the biggest challenges alongside safe storage and long-term preservation. Redundant data can clog our warehouses, but “data” is not synonymous of “information”. Data scientists are required to curate noisy datasets. All this is ushering us to a new era of “precision healthcare”, bringing human biology to centre stage as we integrate information about complex traits and susceptibility to disease in our healthcare systems. Information from a wide variety of sources is transforming personal health and challenging already overstretched health management systems. We must find the right way to access it without neglecting privacy and data provenance tracking [2].

Medical care is largely defined by clinical practice guidelines based on population-level data, however genomic medicine relies on an individual’s genome information to guide personal strategies for disease diagnosis, treatment, and prevention. Cancer patients are now routinely stratified according to which treatment will be most effective for their tumour. The identification of clinically useful gene expression signatures can also be used to adjust a therapy regimen to reduce risk of toxicity, resulting in better patient survival. This transformation in patient management is not restricted to cancer, as we are starting to see similar approaches in other complex diseases [3].

Oncology has become increasingly data-driven. Genomic and molecular advances inform the development of targeted therapies replacing the traditional approach of describing tumours as a disease of the tissue of origin (e.g., breast cancer, colorectal cancer, or lung cancer) and cell types (sarcoma, carcinoma). New technologies and computational resources, which were unthinkable a few years ago, have made this a reality: including gene editing [4] [5], immunotherapy [6] [7] [8] [9], and artificial intelligence [10] [11]. As new information flows, more intriguing applications materialise beyond cancer [12].


2 Objectives

We are living a digital revolution driven not only by the abundance of data, but also by our capability to collect, store, and analyse this information. We often forget how much we rely on mathematical models to harness the data tsunami [13]. Each of our patients is systematically screened for a myriad of molecular information at the clinic. This generates terabytes of data per patient from which decisions must be taken regarding the best therapeutic options available. Integrating such data (most of it unstructured) requires computational methods that involve bespoke procedures. Contextual information is essential as little signs may hide the clue to the correct interpretation, in particular in cases when useful domain knowledge is already available.

Genes play a fundamental role in the functioning of life. Genetics turns into genomics as we start analysing the entire DNA in an organism instead of just a few genes. Between two and 10 novel mutations creep into our genome when cells duplicate their DNA [14]. The driving force behind inheritance and evolution will only be fully understood when due attention is given to DNA interactions [15].


3 Methods

Clinical records transcend their original purpose of keeping a record of disease progression and crucial information to support an optimal therapy. Hospitals have adopted EHR systems to hold mountains of paperwork [16] [17] [18]. Yet, clinicians favour flexible (unstructured) data entry methods. This requires therefore a careful strategy to capture that critical contextual information as we develop suitable tools.

ConSoRe is a tool to query medical notes, pathology reports, diagnoses, hard-to-find lifestyle data and structures, all the required information from an EHR system [19]. Processing unstructured medical notes with accuracy according to a predefined disease model, cancer, is automated [20]. It combines state-of-the-art text mining natural language processing (NLP) techniques with semantic knowledge graphs. It provides the necessary flexibility to enable physicians to quickly identify patients matching precise criteria (potentially reducing recruitment time for a clinical trial from years to just days or weeks). A disconnected patchwork of electronic information systems becomes queryable through a unique gateway, not far from the cancer Biomedical Informatics Grid (caBIG®) promoted by the NCI [21].

We are quantifying human health and disease with the help of artificial intelligence (AI) approaches. Large datasets are analysed helping us to discover new drugs and tailored treatments. However these applications in precision medicine can be severely hindered by the scarcity of data available in the training datasets. Indeed, we can find datasets containing nearly as many features as samples. When applied to population-based samples beyond the original clinical setting, these datasets will underperform due to distributional shift [22].

Another issue AI faces is that it cannot yet replicate the diagnostic process. A physician will order different tests sequentially throughout the period she’s following a patient, any given test might be ordered due to the results of a previous result. So, when an algorithm is trained on retrospective corpora, the temporal dimension is removed and therefore the dependency within the dataset is often lost. Any such model subsequently produced will not include the related decisions which ultimately led to the original diagnosis [23].

The final aspect in this equation is accountability. Algorithms should be able to detect biases and therefore, they require robust and complete data [24] [25]. When we use mathematical models (e.g., neural networks) to identify patterns, skewed collections will lead to biased models (data collections may contain inaccuracies and errors which should have been cleaned prior to be used for training models) [26] [27].

An essential aspect in oncology is to relate a detailed and unique description of a cancer to useful properties such as response to therapy or risk of relapse. However, amongst the largest public cancer cell line panels there is a poor representation of key mutations [28] [29], this means any model developed with these will be statistically underpowered.

We are building a digital ecosystem integrating new and existing technologies and data. We can investigate the potential of representation learning for cancer genomics to allow the Molecular Tumour Board to exploit the hierarchical and multi-scale nature of the data available.


4 Results

Ever since Hippocrates founded his school of medicine in ancient Greece some 2,500 years ago, observation, experimentation, and data analysis have been a core ethical principle of medicine. Precision medicine relies upon comprehensive data (and biobanks) on patient treatment and outcomes. Analysis of these data leads to improved models providing the basis for treatment, and for direct use in clinical decision-making. In fact, it is data from previous patients that will probably play the biggest role in making a current patient well again. It gives our treating teams the essential insights and knowledge on which to base their care. We aggregate data in warehouses, we have mentioned ConSoRe, but in France alone, we can find other models outside oncology applications such as Dr Warehouse and eHOP [30] [31]. Ethical and legal issues are paramount when developing these infrastructures, as it is unclear how samples (and data) might be re-used and whether any future uses were compatible with the original consent.

AI is unleashing an array of new approaches to healthcare, but we need to continuously benchmark any progress. Innovative technologies will only be widely adopted in medicine once they significantly improve outcomes for patients, and their implementation is ironed out. Solutions with potential for widespread adoption cannot be resource intensive to deploy and use, and should not be too complex. Manually annotated cohorts can be used to establish baselines to benchmark automated tools [32]. An example we have been using at Institut Curie, is ESMÉ cohort (grouping 30,000 breast cancer patients), a well annotated resource from the Unicancer excellence network of French Comprehensive Cancer Centres. Structuring medical records with ConSoRe can be compared to the work from an experienced team of curators.

Understanding how cancer arises requires more than converting biopsies into ones and zeroes, or lists identifying which genes are mutated in certain cancers. Molecular signatures bring us a step closer towards finding interventions to halt disease and enhance health. Patient stratification meets clinical practise, evidence reveals the language of the cell as each subtype may exhibit a predictable clinical phenotype.

Machine Learning (ML) is benefiting from robust platforms that enable scalable and reproducible computing on large datasets, however, quality is often the challenge [33]. Oncology datasets are often unsuited, as we are dealing with noisy and sparse data, various independent resistance mechanisms can operate [34]. Particular attention must be paid to avoid overfitting [35], when comparing results. The final model would only be as good as the data that was used to develop it and test it.

In a moment when the healthcare data economy is booming, places like Oxford, Paris, or Cambridge are teeming with start-ups promising to harness the power of data. Precision oncology increases the range of treatment options, bringing quality improvements, but for only a relatively small number of people. Challenging the modern clinical trial paradigm with basket-trial approaches is blurring one of the hallmarks of medicine, the educated guess, without any pretentions of certainty.


5 Discussion

The potential of computers to transform the clinical decision process has long been recognised. We can trace medical informatics as an interdisciplinary research field back to the 1960s [36] [37]. MYCIN, an ad hoc model with about 450 rules developed to diagnose blood infections, was one of the first expert systems [38]. ADM, a computer-assisted diagnostic system developed in 1972, covered 2,500 diseases with 22,000 signs and symptoms [39] [40].

Despite decades of research on the development of computer-based patient records the process has been hindered by the hope that difficult clinical problems might yield to mathematical formalisms [41]. Today, technology is placed at the fingertips of everyone; wearables offer an opportunity to capture first-hand data and address disease in the early stages. However, this comes with a risk of swamping already saturated health services with anxious individuals alarmed by false positives, following the adoption of devices promising real-time atrial fibrillation detection [42].

Genomics enables us to decipher and understand the blueprint of a living organism as we better understand biological systems at a molecular level. When the human genome was first assembled [43] [44], it would have been hard to predict that a few years afterwards the future in cancer would pass by single-cell sequencing. Today, we can sequence individual cells from biopsy samples or circulating tumour cells, enabling earlier diagnoses. Even when the cost of sequencing a cancerous tumour has dramatically dropped to affordable levels, the cost of understanding what then needs to be done remains considerably high [45]. Bringing our knowledge about the clinical implications of various genomic elements to the Molecular Tumour Board still requires substantial research investment [47].

In the precision medicine space we are often expected to assess not only how new tests can guide our decisions (companion diagnostics), but also which additional value they bring to the healthcare system (exploring drug repurposing). The ultimate goal in precision medicine is not only to treat patients based on their unique biology, but to get such better care without spending more money.

The publication of the first draft of our genome, and the beginning of precision medicine, concluded with the following words: “it has not escaped our notice that the more we learn about the human genome, the more there is to explore”. Twenty years after this major milestone, we can only stress how true those words were.


6 Conclusions

Only those technologies showing genuine clinical utility will be widely adopted in medicine. ML enables us to extract knowledge from the outcomes of thousands of patients (billions in a global context) to inform care of each single patient. Structured information plays a critical role in medical decision-making. A central promise of ML in medicine is that each patient will benefit from the wisdom contained in the decisions made by nearly all clinicians as they will be based on the outcomes of billions of past patients. A corollary is that patients need to be informed that by sharing their data they are not only helping individuals today, but also future patients.

As Eric Topol says “Electronic health records have broken the backs of clinicians and made them into data clerks. So why would anyone in their right mind think that we could have a rescue through technology? (…) We’ve never had a technology that could actually give us the gift of time” [48]. We are not discussing digital alchemy, but augmented medicine through rigorous research that provides unequivocal benefit for patients.

We have seen that the potential value of computers in medicine is not something recent, but the development of digital ecosystems embedding information from EHRs allow us to streamline clinical queries across normalised medical records. Within this context, we expand our toolset beyond the hospital, extending the ever-increasing patient cohorts with new data types, opens the door to new exciting opportunities in Precision Oncology [49].

Any innovation must not only address clinical problems but should result in significantly improved outcomes for patients. It should not be too complex or resource intensive to implement and use, and should have the potential for widespread adoption and diffusion [50]. The emphasis is often put on data quantity when it should be on quality, which is inherently expensive as it requires human curation. Data can be augmented, but quality cannot be taken for granted.

There is a hard lesson to learn when wandering in the limits of science and medicine. Solutions must involve a team physician-scientist, otherwise we risk solutions will not be adopted. We may have elucidated the iconic double helix and have a better understanding of immunology, but we are still unable to save people from most forms of malignancy.



I wanted to thank the reviewers whose suggestions have improved this contribution. The greatest gratitude to the generosity of all our donors for the support provided to the Curie Foundation to beat cancer together. This work was supported in part by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 780495. We also benefited from SiRIC (grant INCa-DGOS-Inserm_12554).

Correspondence to

X. M. Fernández
Institut Curie
25 rue d’Ulm, 75005 Paris
Phone: +33 (0) 156246277