Results
Artificial Intelligence, Machine Learning, and “Big Data” Analytics
Artificial Intelligence has risen to prominence in a way that belies its years of
overpromise and underdelivery. What has brought this about is, in part, slow maturation,
but the transformation arguably began with the apparent abandonment of logic as the
foundation of AI (think of “expert systems”) in preference for a plurality of data
and approaches to “learning”, i.e., development of models that fit, describe, and may ultimately explain a set of facts
or observations, in the sense of providing a means to comprehend the pattern of the
data, not merely the individual data points. Indeed, not all individual data points
need to be explainable in this way. One speaker mused in a 2009 keynote [3] whether this trend meant “the abandonment of soundness for completeness”, alluding
to a well-known “incompleteness” theorem in logic (any sound formal system, complex
enough to support arithmetic, necessarily includes assertions that are true but not
provable within the rules of the system itself. A sufficiently rich complete system,
therefore, cannot be sound). It is in this context that we speak of “comprehending”
a pattern in the data even when some data defy the pattern. A thorough exposition
of Deep Learning has been provided by the pioneers, Le Cun, Bengio, and Hinton [4]. A somewhat less technical review of its potential and clear discussion of its role
in augmenting, rather than supplanting, human intelligence has been provided by Rajkomar,
Dean, and Kohane [5].
The breadth of papers on AI in medicine and healthcare certainly defies easy summary.
Indeed, any selection is likely to be representative of the tastes of the reviewer:
this limitation must be admitted. Among papers that show the immense promise of Machine
Learning in healthcare is an interesting analysis of the potential of Electronic Health
Records (EHRs) to yield useful knowledge by Rajkomar et al. [6], appropriately published in a journal that is itself newly dedicated to the field
of Digital Medicine. These authors adopted a deep learning approach using the entire
EHR and addressed four representative questions using a single data structure: for
outcomes, risk of death; for quality, risk of readmission; for resource efficiency,
length of stay; and for the semantic value of the record, patient diagnoses. Their
approach avoids the variable selection problem and outperforms across various indices.
However, it is a retrospective study, so the challenge remains to build predictive
models through time in the EHR and validate through prospective studies. Other notable
work in the field takes a radically different approach. A team at University of California
San Francisco (UCSF) reports on a project with similar predictive goals for patients
with one particular condition, rheumatoid arthritis (RA) [7]. In this study, variables were selected based on known clinical significance, though
not necessarily known to have predictive value. A strict phenotype for RA was applied
in two diverse settings, a university hospital and a safety net hospital. Encouragingly,
the results were applicable in both, suggesting that robust models may be transposable
to new settings once developed. Yet another approach, taking its cue from process
mining, addresses the broad problem of diagnostic error for undifferentiated chief
complaints [8]. How is the sequence of events following presentation with abdominal pain best understood
and visualized? Are some diagnostic trajectories more effective than others? How do
time and timing impact the process? How well does this approach translate from one
chief complaint to another—say from abdominal pain to dizziness? Many such questions
remain to be addressed. Discovery or refinement of disease phenotypes is another potential
application of AI. In the case of sepsis, it has been observed that improved understanding
of the immune response has not translated into improved treatments. This is partly
due to the enormous range of clinical and biological features that figure in the definition
of the syndrome. A team at the University of Pittsburgh reports on a study [9] that identified four new clinical phenotypes that may help explain diverse treatment
effects and can guide the design of clinical trials and future treatment regimens.
AI, and its promise, limitations, and implications, have attracted voluminous commentary
from experts and from anticipated beneficiaries. The Journal of the American Medical
Association has paid close attention to such questions. We note in particular a guide
to reading the literature [10], an accompanying editorial [11], and a viewpoint review [12] of the National Academy of Medicine’s comprehensive exploration of AI in healthcare
[13]. Possible biases in the design and development of AI systems in conjunction with
EHRs have also been explored [14], as has their remediation [15] and the potential legal liability risk for a provider using AI [16]. Considering the influential regulatory framework in the US on Software as a Medical
Device, how should the lifecycle of an AI system be viewed, especially if it is adaptive
and—at least in theory—self-improving [17]? The “black box” paradigm is an apt description for much modern AI. Models are constructed
and decisions made with virtually no explanation. This is in stark contrast to “classical”
AI which was formal logic-based and could in the main provide a logical audit trail
for a decision. The desirability for explanation in general has been recognized and
begun to be addressed in Explainable AI (XAI) [18]
[19]. In healthcare, given all the other concerns reviewed here, the need for explanation
goes beyond desirable to essential. Work in this area is underway, at this stage mainly
in the engineering domain [20], but with applicability to healthcare already under consideration [21].
A number of viewpoints and opinion pieces have addressed ethical, legal, and social
issues. Is it possible for an AI tool to monitor the status of a mental health patient
[22]? Would a conversational agent—“agent” being a term of art for software that can
initiate real world actions and, in many cases, act autonomously—be an appropriate
tool to address underserved mental health needs [23]? How does AI mediate or interfere in the relationship between physician and patient
[24]? Conversely, what is its potential to reduce provider burden and burnout [25]?
A somewhat contrasting approach [26] that leans more heavily on statistical methods [27] is variously described as “Data Science”, “Big Data”, or “Analytics,” although its
practitioners sometimes describe it as “AI” [28]. It has been successful in improving clinical operations, delivery of care, and
health system administration. The goal is often to target a particular performance
index (e.g., average length of stay, 30-day readmission, or immunization rate) or a status index
for the patient population (e.g., percentage of controlled patients with diabetes, or of those with asthma who experience
exacerbations). A typical technique is to identify patients at risk and devise targeted
interventions. Poor data quality sometimes impairs the predictive power of these methods
[29]. It is generally considered most advantageous to implement models in the EHR so
that triggers can fire alerts for action [30]. It is the ambition of Learning Health Systems [31]
[32] to have cumulative evidence from practice improve the predictive value of the models
even as they are being used; naturally, this raises some of the “black box” issues
alluded to above. Finally, we note that analytics has also served patients in activities
bordering on Citizen Science [33].
Common Data Models, Data Quality, and Standards
EHR systems are optimized for transactions, so that providers will experience minimal
delays in their interactions with patients. Data generated in this way is subsequently
stored in a database “normalized” so as to minimize duplication of information (thus,
risk of inconsistency), yet at the same time amenable to search by means of a structured
query language. Clinical research often revolves around the discovery of particular,
sometimes very complex, cohorts of patients. Common Data Models, as they have come
to be known, provide a further filter for the organization and storage of data in
a highly standardized form, so that even data from different institutions may be navigated
using the same basic queries. Among the most popular such models, i2b2 (“Informatics
for Integrating Biology and the Bedside”) and OMOP (“Observational Medical Outcomes
Partnership”) were created in 2005 and 2008, respectively, both essentially with observational
data in mind, for clinical research in the case of i2b2 and to study effects of interventions
and drugs for OMOP. Motivated by regulatory changes, the Federal Drug Administration’s
(FDA) Mini-Sentinel post-marketing drug surveillance program also created a common
data model, and this provided the inspiration for (and largely lent its design to)
the first issue of Patient-Centered Outcomes Research Network’s Common Data Model
(PCORnet CDM). These have opened up a number of avenues for research based on “real
world data”(RWD)—data collected in the course of healthcare delivery or even from
the use of health-related applications on mobile devices. Research on the scope and
validity of RWD and the ways in which the analysis of RWD may lead to “real world
evidence” (RWE) deserve a section of its own, but it is worth alluding here to the
FDA’s definition and discussion of these terms [34]: “Real-world data are the data relating to patient health status and/or the delivery
of health care routinely collected from a variety of sources. … Real-world evidence
is the clinical evidence regarding the usage and potential benefits or risks of a
medical product derived from the analysis of RWD.”
An example of the breadth of rich data and study potential in an environment of independent
entities using a common data model may be seen in the study Short- and Long-Term Effects
of Antibiotics on Childhood Growth. Using the strict criteria of same day height and
weight in each of three distinct age periods (0 to ≪12, 12 to ≪30, and 30 to ≪72 months),
working across 35 institutions, a diverse cohort of 362,550 children were found to
be eligible for the study [35]. Of these, just over 58% had received at least one antibiotic prescription, with
over 33% receiving a broad-spectrum antibiotic. The cohort was large enough to allow
for adjustment for complex chronic conditions. In children without such a condition,
the odds ratio for overweight or obesity was 1.05 (CI 1.03 to 1.09) for those with
at least one antibiotic before age 24 months. The effect was thus shown to be real,
but small [36]. The study group was able to identify 53,320 mother–child pairs to consider whether
antibiotic use by mothers had an effect on childhood weight; taking into account timing
during pregnancy, dose-response, spectrum and class of antibiotics, the study found
no associations between maternal antibiotic use and the distribution of BMI at age
5 [37]. With an eye to what matters to stakeholders—parents and primary physicians—the
study also considered whether these findings would influence prescribing patterns
and parental expectations; the answer was unambiguously “no” [38]. They were also able to examine data quality by comparing prescriptions and dispensing
in 200,395 records and identified gaps in these data, although prescription data were
adequate for the question at hand [39]. Finally, in a technical proof of principle, they showed that a form of distributed
regression analysis, avoiding the aggregation of patient-level data, generated results
comparable to those of the main study [40].
What goes into providing data for such a study? The process begins with the creation
of common data model-conforming data marts or mappings to enterprise or research data
warehouses; this is termed “extract-transform-load” (ETL). In research, this must
be followed by “phenotyping”, translating the inclusion and exclusion criteria and
the defined information for the desired cohort in the form of a query that captures
as precisely as possible the required data. Intervening between the two steps in this
ideal sequence of operations (or perhaps part of a good ETL process) is data quality
analysis. Each of these stages presents certain problems and attracts attention from
researchers in the effort to bridge gaps. Notwithstanding the popularity of common
data models such as the PCORnet CDM and OMOP in the US (and increasingly elsewhere),
a good deal of research still addresses the choice of clinical data model and interoperability
between such models. The HL7 FHIR interface standard is increasingly accepted as a
way forward.
Looking at particular examples of work in this area, a group in Germany has set out
to model the ETL process [41] and along the way define quality checks [42] and provenance standards [43]. This is an interesting way to marry process principles and implementation at the
level of technology. They link their provenance work in particular both to technical
and to administrative or regulatory requirements, so that researchers would not have
to engage in separate activities in using data for research and in ensuring that it
is handled according to all legal and ethical requirements. Yet another contribution
from Germany [44] is preoccupied with the completeness and syntactic accuracy of data from a heterogeneous
network of institutions, using a (logically) central metadata repository as its reference
point. An Australian contribution [45] rooted in business systems seeks a design method, or at least a set of design principles,
towards a unified view of data quality management in healthcare environments, alongside
methods and tools derived from that design view.
In the United States, much of the attention to ETL processes and data quality has
centered on the major common data models. Collaborations around both the PCORnet CDM
and OMOP have focused on data quality, with PCORnet requiring quarterly “data characterization”
or “hygiene” queries [46] and major tool developments by the OHDSI (Observational Health Data Sciences and
Informatics [47]) collaborative. A number of notable efforts have thus cumulatively created an impressive
collection of results. Using a data quality (DQ) ontology of their own devising in
2017 [48], a group led by Michael Kahn analyzed and mapped DQ approaches in six networks [49]. In the same year, Weiskopf et al., published a guideline for DQ assessment [50]. In 2018, there followed a contribution by Gold et al., on the challenges of concept value sets and their possible reuse; indicative of the
acuity of the challenge is that not all co-authors could sign up to every view expressed
in the paper [51]! Rogers et al., [52] then analyzed data element–function combinations in checks from two environments,
ultimately identifying 751 unique elements and 24 unique functions, supporting their
systematic approach to DQ check definition. Most recently, Chunhua Weng has offered
a lifecycle perspective, indeed a philosophy, for clinical DQ for research [53], while Seneviratne, Kahn, and Hernandez-Boussard have provided an overview of challenges
for the merging of heterogeneous data sets, with an eye both on integration across
institutions (where adherence to standards may be sufficient) and across “modalities”,
the latter term interpreted in its widest possible sense, encompassing genomics, imaging,
and patient-reported data from wearables [54].
It is necessary to add two more observations to this section. One concerns the Fast
Health Interoperability Resources (FHIR) standard specification and the other the
commercially supported data marts that have made a significant mark on institutions.
The relative proliferation of data models, and the passionate attachment of each one’s
proponents to the primacy of their chosen model, have resulted in a great deal of
duplication of work—the very avoidance of which was one of the drivers for their introduction
in the first place. It is debated whether one model or another should be taken as
the definitive basis for research data in an institution and how other data needs
would then be met. Thrown into this mix, FHIR, an interface or data exchange standard,
has at times been spoken of as a “meta-model” from which all others can be derived.
However, to quote an authority on this question, “FHIR’s purpose is not to define
a persistence layer—it’s to define a data exchange layer. It exists to define data
structures used to pass information from one system to another. That doesn’t mean
that you can’t use FHIR models to define how you store data, only that FHIR isn’t
designed for that purpose and doesn’t provide any guidance about how to do that” [55]. FHIR is thus likely to be an excellent approach for defining data elements to extract
from data close to care delivery; how that data is then stored and manipulated for
research remains the question, so the choice of model remains fraught.
The second observation is that in the midst of this babel of models a number of academically
well-informed companies have proposed a different, private business model—varying
somewhat by company—whereby the data is curated and made available for anonymous search,
often aggregated by geographic area or nationally. Institutions need not be identified
unless they wish to be, and the cost, which may be considerable, is borne by the commercial
clients of these systems, most obviously pharmaceutical companies, eager to define
cohorts for trials, to gauge how much of a market there might be for a drug, and so
on. At present, these systems co-exist with home-grown or academically developed public
systems (cf. Leaf [56] for an excellent example), but there is a sense in which they are in competition,
so this is a space to be watched.
Phenotyping and Cohort Discovery
These observations on data models, and their realization as real repositories—data
marts or data warehouses—naturally lead to the question of their use. There are possible
uses in quality improvement and in public health, but our main focus is clinical research.
Here, then, is the place to acknowledge and celebrate the successes of major international
consortia and other loose affiliations that have created banks of phenotypes and cohort
discovery tools to help navigate the data in standard models. By “phenotype” we mean
the criteria that identify patients with a given condition or disease. These may be
complex: e.g., patients with type II diabetes (DM2) may be identified by having a pertinent ICD9
or ICD10 code in their problem list or elsewhere in their record, or may be on a combination
of therapies that is uniquely appropriate for DM2, or may be suffering from a complication,
such as maculopathy, that has been annotated to indicate that it is due to DM2. Although
translational informatics is often interpreted in a genomic context, it should be
acknowledged that the need to match phenotypes to genotypes has provided considerable
stimulus to phenotyping from EHRs.
Work begun by the Electronic Medical Records and Genomics (eMERGE) [57] network and the NIH Collaboratory was first reported in 2016 [58] and has continued to grow in the most sustained and focused effort, both to generate
precise phenotypes and to sharpen existing ones. The concept has indeed been accepted
more widely and applied to good effect. A report [59] in a nephrology journal on phenotyping pediatric glomerular disease, a rare condition,
is accompanied by an appreciative editorial [60] recognizing similar efforts in the discipline. Looking to a particularly difficult
case, Koola et al., demonstrate improved phenotyping of hepatorenal syndrome, a difficult sub-phenotype
of acute kidney injury [61]. Pacheco et al., use methods from the Phenotype Execution Modeling Architecture (PhEMA) project [62] to demonstrate the portability of a benign prostatic hyperplasia phenotype across
a number of institutions [63]. Taylor et al., use developed phenotypes to identify patterns of comorbidities in eMERGE network
institutions [64]. That this work is far from easy is readily demonstrated by the difficulties described
in other works, such as the study by Fawcett et al., in the UK [65] and that by Ando et al., in Japan [66]. Nevertheless, independent efforts to phenotype particular conditions still arise
and show considerable promise. We note in hematology the work of Singh et al., [67] and in psychiatric genomics that of Smoller [68]. Increasing reliance on EHR phenotyping is reflected not only in the proliferation
of papers applying one or other approach to particular, often complex, conditions,
but also in notable dissemination efforts, including an extended exposition by Pendergrass
and Crawford aimed at human geneticists [69].
Alternative methodologies also appear in the literature. A semantic approach by Zhang
et al., takes its cue from difficulties encountered in translating specifications (e.g., in PheKB) to query code and to specialize a phenotype to each instance of a data
repository [70]. Reflecting the transition we have observed in AI applications, Banda et al., outline a possible trajectory from “rule-based” phenotyping to machine learning models
and suggest a research program to complete the move [71]. Among ML approaches, Ding et al., adopt “multitask learning”, and find that multitask deep learning nets outperform
the simpler single-task nets—a counterintuitive observation. Yet another interesting
study combined ML and rule-based approaches to identify entities and relations, providing
a natural language interface to clinical databases [72].
Clinical trial recruitment is often the driving reason for phenotyping. Several papers
with a focus on recruitment came to this reviewer’s attention. A clinical trial recruitment
planning framework by the Clinical Trials Transformation Initiative offers evidence-based
recommendations on trial design, trial feasibility, and communication [73]. A Veterans’ Affairs team describes a holistic approach to clinical trial management,
including identification of possible subjects and recruitment, based on its Cooperative
Studies Program working together with VA Informatics [74]. An oncology team at Vanderbilt and Rush reports on an ambitious framework capturing
multiple aspects of clinical trial management, including evaluation [75]. Finally, a team from Seoul, Korea, reviews the creation, deployment, and evaluation
of an entire Clinical Trial Management System which offers the full range of functions
required for recruitment and full ethical and regulatory compliance [76].
We will conclude this section with a brief mention of a reflexive analysis of the
work that goes into the creation of a library of phenotypes [77]. A retrospective analysis of phenotyping algorithms in the eMERGE network identified
nearly 500 “clauses” (phenotype criteria) associated with over 1100 tasks, some 60%
of which are related to knowledge and interpretation and 40% to programming. In each
case, portability (of knowledge, of interpretation, and of programming) was graded
on a scale of 0–3, resulting in each phenotype receiving a score that reflects expert
perception of its portability. Having commended this work, a parallel reading recommendation
should be the analysis of patients’ and clinical professionals’ “data work” by Fiske,
Prainsack, and Buyx [78]. Here the ambiguity of “data” (“givens”) and their contextual dependency are discussed
with insight and with empathy for all participants.
Privacy: Deidentification, Distributed Computation, Blockchain
The last two decades in biomedical informatics have seen enormous growth in large-scale
collaborations and attempts to combine and share data to gain power in results, to
achieve greater diversity, and to provide the much vaunted “evidence” necessary for
evidence-based practice. The rationale and the challenges are well described in Haynes
et al., [79]. One of the most frequently encountered obstacles to data sharing for research is
the concern over patient privacy—and rightly so, of course. In most jurisdictions,
there is some legal or regulatory protection for personal health information. The
acronym “PHI”, for Protected Health Information, is precisely defined in US legislation,
but also serves as a shorthand for what may commonly be considered personal and private
health information. A critical aspect that is derived from the relevant US act (the
Health Information Portability and Accountability Act, HIPAA) is that any value or
code that is derived from PHI is itself PHI, unless a certain kind of one-way cryptographic
“hash” function (“hashing” for short) is used in the derivation. Data that has been
thus transformed is often termed “de-identified”, although experts now are careful
to circumscribe claims of de-identification. Significant reasons for this are the
availability of other data sets, either as public goods or otherwise available for
purchase, and the possible use of similar methods to link individuals in one data
set to those in the supposedly de-identified collection. Cross correlation of information
thus obtained can lead to reidentification of at least some individuals in the list.
Where it can be done securely, an additional benefit of hashing is the possibility
of deduplication of patients that attend more than one healthcare system, thus enabling
a more nearly complete picture of a person’s record to be aggregated without identifying
the patient by name. Following significant advances in the last few years [80]
[81], more recent work has extended and exploited these methods [82]
[83]. Kayaalp has surveyed a number of approaches [84].
A particularly interesting method has been advanced by Hejblum et al., in Boston [85]. It is clear that sufficient clinical details (such as diagnoses with encounter
dates) may be enough to identify a subject uniquely, or very nearly so, in two coherent
data sets. What if there are discrepancies between certain data elements in the two
sets? The method presented allows them to compute the probability of identity even
when certain elements do not agree. Other methods of de-duplication, not necessarily
with anonymity, include a Bayesian method [86] adapted from astronomy—galaxies in different astronomical databases may not have
matched names, but do have matched characteristics, by and large. Funded by the German
Medical Informatics Initiative [87], the SMITH consortium is developing the infrastructure to support a network of Data
Integration Centres (DIC), which will share services and functionality to provide
access to the local hospitals’ Electronic Medical Records (EMRs). Regulatory protections
will be provided by data trustees and privacy management services, but DIC staff will
be able to curate and amend EMR data in a core Health Data Storage. Secure multi-party
computation features in a number of studies, including a persuasive two-party instance
using garbled circuits in a geographically wide-ranging collaboration in the United
States [88] and an Estonian report of a multi-party system based on the Sharemind platform that
is “ready for practical use” [89]. Another technical aspect, which has received some attention, is the high combinatorial
cost of pairwise comparison for de-duplication. An approach known as “blocking and
windowing”, which originates in the founding studies in statistical de-duplication,
is used to reduce the dimensionality of the comparison space, and is still being refined
in various ways [90]. Further methods of interest use Bloom filter pairs. The method of Brown et al., exhibits good error tolerance [91], while that of Ranbaduge and Christen [92] includes the temporal information in records in its hashing process; this Australian
contribution is all the more interesting in the light of extensive national data linking
guidelines by the government [93].
Blockchain has been suggested as a possible answer to the challenges of anonymous
data sharing. Indeed, in the world of Health IT, as one encounters it in practice
in health systems, blockchain is being viewed with interest [94]
[95]. Researchers have begun some exploratory work, but blockchain has not yet had wide
adoption in the field. Some interesting work can be reported here. The Journal of
Medical Systems has a special collection of papers on blockchain, including a study
of a blockchain-based privacy-preserving record linkage (PPRL) solution [96]. Other studies of blockchain include a Swiss-American systematic review of oncology
applications [97], 16 in all at the time of the study, distributed among countries, unsurprisingly
USA (4 studies), Switzerland (2 studies), and Germany, Iraq, Taiwan, Italy, China
(1 study per country), and a proof of principle study using a novel framework, HealthChain
[98].
A radically different concept in privacy preserving analysis is distributed computation.
A University of Pennsylvania-led team describes two performant algorithms [99]
[100], differentiated by resource requirements and performance, which analyze data behind
their home firewalls and aggregate statistical results, mainly regression models.
Their particular success lies in controlling for data source heterogeneity and maintaining
high faithfulness to gold standard (i.e., analysis of aggregated data). On the technical front, another anticipated development
is that of trustworthy databases as described by Rogers et al., in which complementary scenarios of trusted database/untrustworthy analyst and untrustworthy
cloud/trusted analyst motivate the technical requirements [101].
Much of the public concern with data sharing in healthcare [102] revolves around the known or alleged abuses that “big tech” stands accused of [103]
[104]
[105] and near certainty that deidentification [106], even when expertly done and certified, does not eliminate the risk of reidentification
[107]
[108]
[109]. A contentious re-identification exercise was reported and commented on in the Journal
of the American Medical Association (JAMA) in 2018 [110]
[111]. As has been repeatedly pointed out by patient advocates, in a jurisdiction in which
the possibility that certain kinds of health or long term care insurance, employment
prospects, and other rights, may be limited by what is known about one’s health status,
privacy of PHI must be fiercely guarded.
Causal Inference and Real-World Evidence
A significant change in the regulatory environment impacted the world of food and
drug law in mid-2017, although this was written up later in 2018:
“In June 2017, FDA approved a new indication for a medical device without requiring
any new clinical trials. This approval marked the onset of a new era in drug and medical
device regulation: the systemic use of “Real World Evidence” (RWE). FDA based its
approval on records of the product’s actual patient use rather than on randomized
clinical trials” [112].
In some respects, biomedical research, informatics in particular, has been ahead of
the game. In the wake of the realization that the gold standard for research—randomized
clinical trials—is slow to deliver the hoped-for results and improvements, informatics
has embraced observational data and outcomes research. Among the most insightful lines
of inquiry focuses on the question, if we want observational studies to deliver comparably
robust results to those of clinical trials, how should we design our observational
studies? There is progress to report on a number of fronts, including data collection,
observational trial design, and causal analysis.
Specialty journals as well as informatics titles have been reporting on particular
efforts to collect data for research in the process of delivery of care. Examples
include clinical oncology [113], neurology [114], nutrition and endocrinology [115]
[116], and pharmacovigilance [117] to name but a few, not that any of these are complacent or make an easy equation
between RWE and real-world data (RWD). It is generally appreciated that turning RWD
into RWE requires work—often ingenious and complex work [118]
[119]
[120]
[121].
This reviewer’s enthusiasm for Hernán’s and his collaborators’ work will be obvious
sooner or later, so I will leap right in. Their “second chance” paper [122] lays out the tasks ahead with great clarity and analytic perspicacity. Koch’s postulates
in microbiology are now obsolescent, but if one were to aspire, for observational
studies, to the degree of rigor they implied, this might be a good place to start.
Also worth following is a spirited defense of causality and debate between Hernán
and several other scientists with pro and con views in the American Journal of Public
Health [123]. Another debate on causality also took place in the Journal od the American Medical
Informatics Association (JAMIA) and casts a different light on the question [124]. Two books are likely to prove highly influential in this domain: Pearl and Mackenzie’s
The Book of Why [125], which provides the intellectual framework for a science of causality, and a forthcoming
textbook by Hernán and Robbins, based on courses delivered at Harvard [126].