Keywords
Artificial intelligence - big data - data science - electronic health records - precision
medicine
1 Introduction
Precision medicine requires streamlined software pipelines to handle vast amounts
of information, yet it faces additional challenges as we embrace new technologies:
high-throughput sequencing, improved medical imaging, wearable sensors, etc. Over
the past two decades, the fastest technological advance in history has universalised
access to genomics, prompting an increase in the number of national genome sequencing
programmes [1]. Managing these data and their interpretation are the biggest challenges alongside
safe storage and long-term preservation. Redundant data can clog our warehouses, but
“data” is not synonymous of “information”. Data scientists are required to curate
noisy datasets. All this is ushering us to a new era of “precision healthcare”, bringing
human biology to centre stage as we integrate information about complex traits and
susceptibility to disease in our healthcare systems. Information from a wide variety
of sources is transforming personal health and challenging already overstretched health
management systems. We must find the right way to access it without neglecting privacy
and data provenance tracking [2].
Medical care is largely defined by clinical practice guidelines based on population-level
data, however genomic medicine relies on an individual’s genome information to guide
personal strategies for disease diagnosis, treatment, and prevention. Cancer patients
are now routinely stratified according to which treatment will be most effective for
their tumour. The identification of clinically useful gene expression signatures can
also be used to adjust a therapy regimen to reduce risk of toxicity, resulting in
better patient survival. This transformation in patient management is not restricted
to cancer, as we are starting to see similar approaches in other complex diseases
[3].
Oncology has become increasingly data-driven. Genomic and molecular advances inform
the development of targeted therapies replacing the traditional approach of describing
tumours as a disease of the tissue of origin (e.g., breast cancer, colorectal cancer, or lung cancer) and cell types (sarcoma, carcinoma).
New technologies and computational resources, which were unthinkable a few years ago,
have made this a reality: including gene editing [4]
[5], immunotherapy [6]
[7]
[8]
[9], and artificial intelligence [10]
[11]. As new information flows, more intriguing applications materialise beyond cancer
[12].
2 Objectives
We are living a digital revolution driven not only by the abundance of data, but also
by our capability to collect, store, and analyse this information. We often forget
how much we rely on mathematical models to harness the data tsunami [13]. Each of our patients is systematically screened for a myriad of molecular information
at the clinic. This generates terabytes of data per patient from which decisions must
be taken regarding the best therapeutic options available. Integrating such data (most
of it unstructured) requires computational methods that involve bespoke procedures.
Contextual information is essential as little signs may hide the clue to the correct
interpretation, in particular in cases when useful domain knowledge is already available.
Genes play a fundamental role in the functioning of life. Genetics turns into genomics
as we start analysing the entire DNA in an organism instead of just a few genes. Between
two and 10 novel mutations creep into our genome when cells duplicate their DNA [14]. The driving force behind inheritance and evolution will only be fully understood
when due attention is given to DNA interactions [15].
3 Methods
Clinical records transcend their original purpose of keeping a record of disease progression
and crucial information to support an optimal therapy. Hospitals have adopted EHR
systems to hold mountains of paperwork [16]
[17]
[18]. Yet, clinicians favour flexible (unstructured) data entry methods. This requires
therefore a careful strategy to capture that critical contextual information as we
develop suitable tools.
ConSoRe is a tool to query medical notes, pathology reports, diagnoses, hard-to-find
lifestyle data and structures, all the required information from an EHR system [19]. Processing unstructured medical notes with accuracy according to a predefined disease
model, cancer, is automated [20]. It combines state-of-the-art text mining natural language processing (NLP) techniques
with semantic knowledge graphs. It provides the necessary flexibility to enable physicians
to quickly identify patients matching precise criteria (potentially reducing recruitment
time for a clinical trial from years to just days or weeks). A disconnected patchwork
of electronic information systems becomes queryable through a unique gateway, not
far from the cancer Biomedical Informatics Grid (caBIG®) promoted by the NCI [21].
We are quantifying human health and disease with the help of artificial intelligence
(AI) approaches. Large datasets are analysed helping us to discover new drugs and
tailored treatments. However these applications in precision medicine can be severely
hindered by the scarcity of data available in the training datasets. Indeed, we can
find datasets containing nearly as many features as samples. When applied to population-based
samples beyond the original clinical setting, these datasets will underperform due
to distributional shift [22].
Another issue AI faces is that it cannot yet replicate the diagnostic process. A physician
will order different tests sequentially throughout the period she’s following a patient,
any given test might be ordered due to the results of a previous result. So, when
an algorithm is trained on retrospective corpora, the temporal dimension is removed
and therefore the dependency within the dataset is often lost. Any such model subsequently
produced will not include the related decisions which ultimately led to the original
diagnosis [23].
The final aspect in this equation is accountability. Algorithms should be able to
detect biases and therefore, they require robust and complete data [24]
[25]. When we use mathematical models (e.g., neural networks) to identify patterns, skewed collections will lead to biased models
(data collections may contain inaccuracies and errors which should have been cleaned
prior to be used for training models) [26]
[27].
An essential aspect in oncology is to relate a detailed and unique description of
a cancer to useful properties such as response to therapy or risk of relapse. However,
amongst the largest public cancer cell line panels there is a poor representation
of key mutations [28]
[29], this means any model developed with these will be statistically underpowered.
We are building a digital ecosystem integrating new and existing technologies and
data. We can investigate the potential of representation learning for cancer genomics
to allow the Molecular Tumour Board to exploit the hierarchical and multi-scale nature
of the data available.
4 Results
Ever since Hippocrates founded his school of medicine in ancient Greece some 2,500
years ago, observation, experimentation, and data analysis have been a core ethical
principle of medicine. Precision medicine relies upon comprehensive data (and biobanks)
on patient treatment and outcomes. Analysis of these data leads to improved models
providing the basis for treatment, and for direct use in clinical decision-making.
In fact, it is data from previous patients that will probably play the biggest role
in making a current patient well again. It gives our treating teams the essential
insights and knowledge on which to base their care. We aggregate data in warehouses,
we have mentioned ConSoRe, but in France alone, we can find other models outside oncology
applications such as Dr Warehouse and eHOP [30]
[31]. Ethical and legal issues are paramount when developing these infrastructures, as
it is unclear how samples (and data) might be re-used and whether any future uses
were compatible with the original consent.
AI is unleashing an array of new approaches to healthcare, but we need to continuously
benchmark any progress. Innovative technologies will only be widely adopted in medicine
once they significantly improve outcomes for patients, and their implementation is
ironed out. Solutions with potential for widespread adoption cannot be resource intensive
to deploy and use, and should not be too complex. Manually annotated cohorts can be
used to establish baselines to benchmark automated tools [32]. An example we have been using at Institut Curie, is ESMÉ cohort (grouping 30,000
breast cancer patients), a well annotated resource from the Unicancer excellence network
of French Comprehensive Cancer Centres. Structuring medical records with ConSoRe can
be compared to the work from an experienced team of curators.
Understanding how cancer arises requires more than converting biopsies into ones and
zeroes, or lists identifying which genes are mutated in certain cancers. Molecular
signatures bring us a step closer towards finding interventions to halt disease and
enhance health. Patient stratification meets clinical practise, evidence reveals the
language of the cell as each subtype may exhibit a predictable clinical phenotype.
Machine Learning (ML) is benefiting from robust platforms that enable scalable and
reproducible computing on large datasets, however, quality is often the challenge
[33]. Oncology datasets are often unsuited, as we are dealing with noisy and sparse data,
various independent resistance mechanisms can operate [34]. Particular attention must be paid to avoid overfitting [35], when comparing results. The final model would only be as good as the data that
was used to develop it and test it.
In a moment when the healthcare data economy is booming, places like Oxford, Paris,
or Cambridge are teeming with start-ups promising to harness the power of data. Precision
oncology increases the range of treatment options, bringing quality improvements,
but for only a relatively small number of people. Challenging the modern clinical
trial paradigm with basket-trial approaches is blurring one of the hallmarks of medicine,
the educated guess, without any pretentions of certainty.
5 Discussion
The potential of computers to transform the clinical decision process has long been
recognised. We can trace medical informatics as an interdisciplinary research field
back to the 1960s [36]
[37]. MYCIN, an ad hoc model with about 450 rules developed to diagnose blood infections, was one of the
first expert systems [38]. ADM, a computer-assisted diagnostic system developed in 1972, covered 2,500 diseases
with 22,000 signs and symptoms [39]
[40].
Despite decades of research on the development of computer-based patient records the
process has been hindered by the hope that difficult clinical problems might yield
to mathematical formalisms [41]. Today, technology is placed at the fingertips of everyone; wearables offer an opportunity
to capture first-hand data and address disease in the early stages. However, this
comes with a risk of swamping already saturated health services with anxious individuals
alarmed by false positives, following the adoption of devices promising real-time
atrial fibrillation detection [42].
Genomics enables us to decipher and understand the blueprint of a living organism
as we better understand biological systems at a molecular level. When the human genome
was first assembled [43]
[44], it would have been hard to predict that a few years afterwards the future in cancer
would pass by single-cell sequencing. Today, we can sequence individual cells from
biopsy samples or circulating tumour cells, enabling earlier diagnoses. Even when
the cost of sequencing a cancerous tumour has dramatically dropped to affordable levels,
the cost of understanding what then needs to be done remains considerably high [45]. Bringing our knowledge about the clinical implications of various genomic elements
to the Molecular Tumour Board still requires substantial research investment [47].
In the precision medicine space we are often expected to assess not only how new tests
can guide our decisions (companion diagnostics), but also which additional value they
bring to the healthcare system (exploring drug repurposing). The ultimate goal in
precision medicine is not only to treat patients based on their unique biology, but
to get such better care without spending more money.
The publication of the first draft of our genome, and the beginning of precision medicine,
concluded with the following words: “it has not escaped our notice that the more we learn about the human genome, the more
there is to explore”. Twenty years after this major milestone, we can only stress how true those words
were.
6 Conclusions
Only those technologies showing genuine clinical utility will be widely adopted in
medicine. ML enables us to extract knowledge from the outcomes of thousands of patients
(billions in a global context) to inform care of each single patient. Structured information
plays a critical role in medical decision-making. A central promise of ML in medicine
is that each patient will benefit from the wisdom contained in the decisions made
by nearly all clinicians as they will be based on the outcomes of billions of past
patients. A corollary is that patients need to be informed that by sharing their data
they are not only helping individuals today, but also future patients.
As Eric Topol says “Electronic health records have broken the backs of clinicians
and made them into data clerks. So why would anyone in their right mind think that
we could have a rescue through technology? (…) We’ve never had a technology that could
actually give us the gift of time” [48]. We are not discussing digital alchemy, but augmented medicine through rigorous
research that provides unequivocal benefit for patients.
We have seen that the potential value of computers in medicine is not something recent,
but the development of digital ecosystems embedding information from EHRs allow us
to streamline clinical queries across normalised medical records. Within this context,
we expand our toolset beyond the hospital, extending the ever-increasing patient cohorts
with new data types, opens the door to new exciting opportunities in Precision Oncology
[49].
Any innovation must not only address clinical problems but should result in significantly
improved outcomes for patients. It should not be too complex or resource intensive
to implement and use, and should have the potential for widespread adoption and diffusion
[50]. The emphasis is often put on data quantity when it should be on quality, which
is inherently expensive as it requires human curation. Data can be augmented, but
quality cannot be taken for granted.
There is a hard lesson to learn when wandering in the limits of science and medicine.
Solutions must involve a team physician-scientist, otherwise we risk solutions will
not be adopted. We may have elucidated the iconic double helix and have a better understanding
of immunology, but we are still unable to save people from most forms of malignancy.