Open Access
CC BY-NC-ND 4.0 · Yearb Med Inform 2021; 30(01): 219-225
DOI: 10.1055/s-0041-1726540
Section 8: Bioinformatics and Translational Informatics
Survey

Predictions, Pivots, and a Pandemic: a Review of 2020's Top Translational Bioinformatics Publications

Authors

  • Scott P. McGrath

    1   CITRIS Health, University of California Berkeley, Berkeley, CA, USA
  • Mary Lauren Benton

    2   Department of Computer Science, Baylor University, Waco, TX, USA
  • Maryam Tavakoli

    3   MTERMS Lab, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
  • Nicholas P. Tatonetti

    4   Department of Biomedical Informatics, Columbia University, New York, NY, USA
 

Summary

Objectives: Provide an overview of the emerging themes and notable papers which were published in 2020 in the field of Bioinformatics and Translational Informatics (BTI) for the International Medical Informatics Association Yearbook.

Methods: A team of 16 individuals scanned the literature from the past year. Using a scoring rubric, papers were evaluated on their novelty, importance, and objective quality. 1,224 Medical Subject Headings (MeSH) terms extracted from these papers were used to identify themes and research focuses. The authors then used the scoring results to select notable papers and trends presented in this manuscript.

Results: The search phase identified 263 potential papers and central themes of coronavirus disease 2019 (COVID-19), machine learning, and bioinformatics were examined in greater detail.

Conclusions: When addressing a once in a centruy pandemic, scientists worldwide answered the call, with informaticians playing a critical role. Productivity and innovations reached new heights in both TBI and science, but significant research gaps remain.


1 Introduction

Each year in the International Medical Informatics Association (IMIA) Yearbook a survey manuscript reviewing notable papers and trends in the field of Bioinformatics and Translational Informatics (BTI).The advancement of knowledge in other areas of BTI continued on, despite the focus being applied to coronavirus disease 2019 (COVID-19) and disruptions to research and work due to precautionary shut-downs. Machine learning and drug repositioning continue to be hot topics, continuing a trend seen in the 2020 Yearbook of Medical Informatics [[1]]. Significant upheaval occurred over the past year, but there are plenty of published works worthy of praise.

In this year's search, we found exciting pairings of machine learning with systematic immunogenic profiling [[2]], adapting and integrating multiple data modalities to study disease [[3]], and examples of drug design and discovery tools in an effort to accelerate treatment options and targets for COVID-19 vaccines [[4]]. With machine learning, we witnessed an expansion of applying interpretation to a variety of tool sets and the continued concern about data security, privacy, and bias. With bioinformatics, there has been a massive increase in the use of single cell gene expression datasets, in line with the field of molecular and cellular biology as a whole. Drug outcome prediction techniques continue to be refined, and increasing complexity seen in global biobanks are providing richer datasets. However, the need to diversify the populations in these datasets still remains a priority.

For the year 2020, bioinformatics, science, and life in general was disrupted by the COVID-19 global pandemic. The scientific community did not retreat, and in fact, rose to meet the challenge. By collaborating and accelerating the dissemination of scientific knowledge at a pace never seen before, enormous strides were achieved in understanding COVID-19 and how to combat its spread. Informatics methods were often central to the execution, analysis, and presentation of these results. We will take some time to reflect on both the positive and negative outcomes of some of those changes.


2 Methods

We relied on a literature review activity, which serves as the foundation of the translational and bioinformatics year in review presentation at the American Medical Informatics Association (AMIA) Informatics Summit. This has been a recurring annual presentation given over the past decade and is a good barometer for notable papers and trends in the field [[5]].

In this year's effort, a team of 16 students and young informatics professionals aggregated papers published from December 2019 until January of 2021. The following query was used to search for manuscripts and modified as needed by members of the team:

(sign OR symptom OR disease OR drug) and (genome OR protein OR small molecule OR RNA OR DNA) AND (computer OR informatics OR statistics)

Our initial query identified 263 papers. The group then graded the manuscripts with a rubric that evaluated informatics novelty in their methods and techniques, topic importance, and overall quality. We used this corpus to identify the manuscripts which highlight some of the trends from this year. Trends were identified by using the Medical Subject Headings (MeSH) on Demand website to capture the MeSH terms from the papers. A total of 1,224 MeSH terms were identified from this step. A python script was then used to cluster the terms and identify themes. [Table 1] presents the top 10 MeSH terms based on frequency count, and [Table 2] shows the top ten themes which emerged from our corpus.

Table 1

MeSH terms frequency.

Term

Paper count

% of Papers

Machine Learning

Pharmaceutical Preparations

Algorithms

Genomics

Neoplasms

Phenotype

Transcriptome

SARS-CoV-2

COVID-19

Electronic Health Records

Animals

49

44

42

41

37

33

31

26

26

23

21

20

17.96

17.14

16.73

15.1

13.47

12.65

10.61

10.61

9.39

8.57

Table 2

Paper themes.

Term

Paper count

% of Papers

Investigative Techniques

Environment and Public Health

Information Science

Health Care Quality, Access, and Evaluation

Genetic Phenomena

Natural Science Disciplines

Mathematical Concepts

Amino Acids, Peptides, and Proteins

Neoplasms

Health Care Facilities, Manpower, and Services

Health Services Administration

208

163

155

143

133

112

109

76

74

73

69

84.9

66.53

63.27

58.37

54.29

45.71

44.49

31.02

30.2

29.8

28.16


3 Results

3.1 SARS CoV-2

The scope of this paper is to perform a survey of the literature from the past year in the areas of bioinformatics and translational informatics. However, we believe that before starting any recent survey of scientific literature, one must address the largest sudden health crisis in modern history.

3.1.1 A Pandemic Arrives

The World Health Organization (WHO) formally declared coronavirus disease 2019 (COVID-19) a Public Health Emergency of International Concern (PHEIC) on January 30th 2020 [[6]]. PHEICs are the WHO's highest level of alarm and set the stage for the year to come. Since 2009, there have been nine events assessed for potential PHEIC declarations with six formal declarations: the 2009 H1N1 pandemic, the 2014 polio decleration, the 2014 Ebola outbreak, the 2018 Kivu Ebola outbreak, and the ongoing COVID-19 pandemic [[7]]. COVID-19 is not the longest PHEIC (the 2014 polio PHEIC still remains in effect in 2021), but it does stand apart in its global impact. In March of 2021, global cases of COVID-19 had exceeded 126 million and caused 2.77 million deaths worldwide. The largest impacts have been seen in the United States and Brazil, with deaths in excess of 559,000 and 340,000 respectively as of April of 2021 [[8]]. Comparatively, the swine flu (H1N1) was estimated to cause 284,000 deaths worldwide (from a range of 150,000 to 575,000 deaths) [[9]]. Global cost estimates of the COVID-19 pandemic have been set at $28 trillion by the International Monetary Fund [[10]], and the impact to the United States alone is estimated at $16 trillion [[11]]. This, unsurprisingly, has caused the COVID-19 pandemic to be labeled the worst global crisis since the Great Depression [[12]].

The ways COVID-19 has impacted daily life, science included, have been profound. Changes observed in the publication of scientific manuscripts were of particular relevance to our topic here. Scientific globalism suddenly found a largely unfettered path, a heightened focus on a singular topic, and a rich variety of research targets, all with a growing sense of urgency [[13]].

Scientists worldwide engaged in a collective action that became the largest research pivot in modern science. The pace of research across many fronts was astounding, with massive intellectual horsepower harnessed in this effort. Within one month of the first COVID-19 outbreak in Wuhan, China, in December of 2019, there were multiple full viral genomes sequenced [[14] [15]]. Vaccine development typically faces a 10-15 year research and testing window [[16]]. In 1967, the mumps vaccine was developed just in just four years, a record that would stand for over 50 years [[17]]. Less than a year into the COVID-19 pandemic, 19 vaccine candidates yielded two different and highly effective vaccines [[18]]. By March of 2021, there were 76 SARS-COV-2 vaccines in clinical trials and six vaccines approved for emergency use [[19]]. Scientific publications on the pandemic also reached an unprecedented level. New curated literature sites emerged, like LitCovid, which includes over 116,000 COVID-19 articles as of early April 2021 [[20]].


3.1.2 Scientific Publishing's Transmutation

The scientific publishing industry also had to adapt in extraodinary ways. With the world's research focus targeting a single topic, there was sudden deluge of paper submissions. For context, since its discovery in 1976, there have been ~9,700 Ebola-related papers published [[21]]. According to LitCovid over the past year (March 16th 2020 – March 14th 2021), there has been an average of 2,075 COVID papers published per week, with 4,322 appearing in the week of August 24th alone. The only significant dip occurred the week of Christmas (December 21st – December 20th), where only 1,057 new papers came out.

Publishers adopted several different techniques to help streamline the publication pipeline. The journal eLife announced it would cut back on requests for additional experiments during revisions, suspend revision deadlines, and require all submissions to post preprints to bioRxiv or medRxiv [[22]]. The Royal Society Open Publishing recruited a group of 700 reviewers who committed to reviewing fast-tracked COVID-19 papers in 24 to 48 hours [[23]]. Efforts to expedite the publication process were found to be very effective across the board. Typically, a biomedical manuscript takes a median of 100 days from submission to acceptance [[24]]. Studies found that the time between submission and publication for COVID-19 papers decreased by 49% on average [[23]]. Palayew et al. found there was a 6-day median time for submission to publication in the early stages of the pandemic [[24]]. This highlights the demand for the most recent data on COVID-19 and the lengths publishers went to ensure data reached scientists and medical professionals quickly.

Demand for the newest information on SARS-COV-2 was not contained to scientific circles. The general public was also ravenous for any new material they could find. The social web aggregate site Reddit.com had two dedicated communities, known as subreddits, materialize during the pandemic: /r/Coronavirus[1] and /r/COVID-19[2]. The /r/Coronavirus subreddit has over 2.36 million members and is dedicated to general information and news about the pandemic. The sister subreddit, /r/COVID-19, was focused on the emerging science on the virus and had over 317k members. The science-focused /r/COVID-19 subreddit had additional rules for sharing material and was more heavily moderated. The massive interest in pre-print servers would often be reflected in these communities, as members would share and discuss the latest pre-print manuscripts in parallel with the latest published papers. The enthusiasm for the science is a bright spot to appear from this pandemic, with younger generations expressing more interest in STEM careers [[25]]. However, this enthusiasm may be somewhat tempered by concerns over the rapid pace of pre-print and publication and the potential for some corners to be cut.


3.1.3 Pitfalls and Pratfalls

For all the advancement and acceleration of the science focused on COVID-19, there were significant errors caused by removing some of the traditional guardrails in scientific publication. The website Retraction Watch, which monitors retracted manuscripts, has been tracking COVID-19 papers and noted 75 fully retracted papers, 11 retracted to journal error, four retracted and reinstated, and five flagged with expressions of concern [[26]]. Pre-print servers like medRxiv[3] and bioRxiv[4] were platforms to help accelerate publications and witnessed exponential growth during this pandemic [[27]]. However, concerns about medical preprints were validated as some papers went viral before there was adequate review [[28]]. There was a pre-print paper about seroprevalence in Santa Clara County that got national media attention when it first appeared on April 17th, 2020 [[29]]. However, just a few days later, people were expressing serious concerns about potential flaws in the study [[30]], but only after it had captured the attention of the general public [[31]]. Traditional peer review should have addressed these concerns prior to publication, but the new and faster process may have led to more errors by reviewers and editors. Rushed and flawed papers were not the only concerning outcome from this pandemic. There are signs that the gender gap in science may be further exacerbated, as female scientists, particularly those with young dependents, reported significant declines in the time they could devote to their research over the past year, which could impact their careers for years to come [[32]]. A period of reflection will be needed to further identify what elements helped advance science during this pandemic, and what issues require repair or removal to prevent additional harm in the future. This sets the stage for the environment we encountered when beginning our survey of bioinformatics and translational informatics papers. COVID-19 caused tectonic shifts in how science and the world adjusted during a modern pandemic. Scientific information saw the arrival of new pathways for dissemination. While the impact COVID-19 has been profound, we do not want it to steal the spotlight from other notable papers and trends from the past year. After reviewing the MeSH term frequency results in [Table 1], we decided to organize the manuscripts we wanted to highlight into two categories: machine learning and bioinformatics.



3.2 Machine Learning

We reviewed novel machine learning methods proposed by the top-scored manuscripts with Information System (L01) and Mathematical Concepts (G-17) MeSH headers and identified a few significant perspectives to further discuss in this section.

3.2.1 Representation

Designing a meaningful and suitable representation for the data is one of the most crucial steps in a machine learning pipeline. It takes a lot of time, hypothesis analysis, and domain expertise to engineer meaningful and useful features. Recent deep learning models have offered automatic feature extraction potentials with relatively high performance. Nevertheless, it is extremely crucial to interpret and validate the extracted features properly.

On this year's top scored manuscripts, using embedding and distributed representation remains a popular alternative or addition to classic feature engineering in predictive tasks. The representations are mainly extracted by deep learning [[33] [34] [35] [36] [37] [38]] or latent probabilistic [[35]] methods. These distributed representations, i.e., embedding, are used to encode various modalities of data, including gene expressions [[39] [40]], events [[36]], images [[33]], and other relational graph data [[37] [41]]. The embedding methods are data-driven representations that can capture semantic and contextual information and incorporate them into a numerical representation. However, the high dependency of data-driven methods on data quality and the detachment of domain knowledge and validation methods from the feature extraction process suggests a broad range of potential improvements for the research in this area.

In some drug-related studies, graph convolution network variations (GCN) [[42]] are used to incorporate domain knowledge of topological chemical structures into the representation learning process. Use of GCN in DeepCDR [[43]] and use of directed-message passing deep neural network model [[44]] for antibiotic drug discovery [[37]] are among these practices. In multimodal studies [[41] [45] [46]] the information fusion is designed in a graph-based form according to a domain-driven information flow. Wang et al. proposed a bipartite GCN for drug re-purposing prediction, which accounts for the central role of proteins in drug-disease association [[41]]. These methods are examples of a more general direction in incorporating the domain knowledge to refining the data-driven approaches.


3.2.2 Interpretation

It is notable that in many studies with deep learning, interpretation approaches were applied either by using toolsets such as SHapley Additive exPlanations (SHAP) [[47]] or by applying a parallel traditional machine learning method. Zhang et al. used a surrogate support vector machine (SVM) for convolution neural network predictions as an interpretation method in a pyrazinamide resistance prediction study to identify important genetic factors for Mycobacterium Tuberculosis [[48]]. Smedley et al. trained a transformer model and used gene masking and saliency to interpret and understand the mapping between gene and MRI image traits of cancer tumors [[49]].

In a pioneering article by Ashdown et al., informatics and molecular biology were integrated to produce a system for predicting and evaluating antimalarial drug-action [[33]]. While the goal of the study itself is laudable, the execution is what makes it so notable. In this study, the authors use laboratory experiments to generate fluorescence imaging data of normal plasmodium falciparum cell growth. They first demonstrated the use of deep neural networks (DNN) to process this data into an interpretable quantitative feature that couples tightly with the cell cycle. Using this new analytical representation, they then show how disruptions to the cell cycle (by chemical agents, for example) can be easily identified in their new feature. The authors round out the study by using their DNN representation to accurately reveal the mechanisms of action of the chemical agents. This well-written and performed study serves as an exemplar of impactful and understandable neural network-based research.


3.2.3 Data Security, Privacy, and Bias Concern

The growing demand for data-centered analyses raises two important concerns. On the one hand, the prediction bias is caused by the models trained on datasets that are not representative of all race and population characteristics. This issue naturally calls for a more systematic data collection and data sharing practice. On the other hand, it remains a significant concern for the institutions to preserve individual and population-level information privacy and prevent unintended information leakage during this data era. Gao et al. suggested transfer learning as an alternative method for mixture and stratification-based models for partial bias recovery [[34]]. The authors elegantly demonstrate the utility of transfer learning to address underrepresentation in existing data and how to identify its source. Other studies provide solutions for a better data sharing practice and moving toward federated machine learning [[50]] methods to preserve security [[4]] and privacy [[51] [52]] while seeking data-centered research.



3.3 Bioinformatics

One of the main themes from our highly-ranked bioinformatics papers was the use of informatics to decipher data from more advanced experimental techniques. In order to better capture relevant variability in traits, single-cell gene expression datasets are becoming increasingly common. Single-cell RNA-sequencing is better able to account for dynamics across cell states, even when using simple linear models. For example, Li et al. predicted breast cancer prognosis by modeling gene expression from single-cell RNA-seq during an important cellular transition [[53]]. Similarly, other studies leveraged single-cell techniques to study populations of cells across time and space, from mapping pathway activation in response to stimuli [[54]] and contrasting expression profiles across developmental stages [[55]] to profiling chromatin accessibility across brain regions [[56]]. Ultimately, this shift away from bulk sequencing assays allows for a more nuanced view of multi-omics data, greatly improving our ability to measure the dynamic processes influencing disease progression and outcomes.

Informatics is also commonly applied to develop clinically-relevant prediction models using genomics data. Given the diverse range of -omics datasets available, studies from this year considered novel ways to integrate data from multiple experimental sources in order to build more accurate models and highlight mechanisms underlying disease. One striking example is the multi-omics approach designed by Su et al. to tease apart the immunological differences between mild, moderate, and severe COVID-19 [[54]]. The authors linked gene expression to changes in immune signaling and clinical measures that differentiate between patients with mild versus moderate disease. The biomarkers discovered through this analysis provide a starting point for developing prognostic metrics and targeted treatments for COVID-19.

3.3.1 Drug Development and Clinical Outcomes

Drug development is another major application area for such technology. Predicting drug response for individual patients remains challenging, especially for notoriously heterogeneous diseases such as cancer. Liu et al. developed a deep learning framework to predict drug response by modeling the molecular structures of the drugs themselves [[43]]. These networks of structural properties were further integrated with networks derived from genomic, transcriptomic, and epigenomic data. The features informed a final model that was able to accurately predict drug response across multiple cancer cell lines, either as the IC50 sensitivity value or classification as sensitive/resistant. When coupled with heterogeneous networks to assist with biological interpretation, predictive multi-omics models (such as the one presented in [[43]]) are interpretable and can perform well. Combining novel features with existing -omics networks will refine future models as the networks continue to evolve.

Genomics potentially impacts other clinically relevant health outcomes. Christian et al. found that patients prescribed medications that were incongruent with their genetics were more likely to have low adherence to those medications [[44]]. This study provides an interesting perspective on the impact of genomic information on other aspects of disease treatment, and suggests that including genomic information in routine clinical care can positively impact health behaviors.

It remains important to disentangle the effects of genetic variation on disease, especially variation in non-protein-coding genomic regions thought to regulate the expression of genes. Mediated expression score regression is a new approach that aims to quantify the contribution of variants to disease by calculating the proportion of disease heritability mediated by gene expression [[57]]. Although the absolute value is low, the authors found that a significant proportion of disease heritability from GWAS is mediated by gene expression in cis. Similarly, PhenomeXcan linked functional genomics and transcriptomics with trait-associated variation to connect genetically regulated gene expression with phenotype [[58]]. A deeper understanding of the relationship between genetic variation, gene expression, and phenotype will not only enable further improvements to variant effect prediction algorithms but will also generate useful hypotheses for future analysis.

2020 also saw the rise of whole-omics approaches to understanding SARS-CoV-2 infection. Ramlall et al. discovered a critical role for the complement system in COVID-19 through a hybrid analysis combining clinical data from EHRs with genomic data from the UK Biobank [[59]]. Given the urgency of the COVID-19 pandemic, researchers turned en masse to informatics and data-driven approaches to find possible therapeutics. Studies that integrated chemical informatics based lead prioritization were quite notable. Panda et al. conducted exhaustive molecular dynamics simulations to several compounds with activity against SARS-CoV-2's viral receptor binding domain [[4]]. The authors used available data in ChEMBL (a database of compound-target activities) to identify 38 drug-like compounds with activity against coronavirus targets. They then followed up with molecular dynamics models to identify the specific binding pockets and possible mechanisms of action. This type of rapid therapeutic hypothesis generation is made possible by the tireless work of informaticians over the past 20 years to structure, organize, and release data and analytical methods.


3.3.2 Biobanks

With the continued growth of EHR-linked biobanks, increasing numbers of individuals are available with matched genomic and clinical data. Algorithms applied to these datasets can define populations based on similar attributes and highlight shared disease biology. For example, Cortes et al. clustered patients in the UK Biobank based on disease associations derived from TreeWAS [[60]]. Similar to the multi-omics approaches described earlier, the authors leveraged gene ontology hierarchies to implicate specific underlying biological processes in the disease clusters. Genetic risk scores applied to individual clusters revealed separation based on comorbidities and biological processes, both of which provide insight into disease sub-phenotypes and potential avenues of treatment. This article highlights the continued movement towards incorporating genomic data to improve our clinical understanding of disease.

Although many EHR-linked biobanks exist, individual-level data is not widely shared between sites due to patient privacy concerns. However, data sharing between biobanks would increase power for informatics studies and enable larger research efforts. Statistical methods may be able to overcome the challenges involved with data sharing. For example, Sum-Share is a method developed to detect pleiotropic genetic variants without requiring access to individual-level data [[61]]. Instead, the approach uses only summary statistics from multiple EHR-linked biobanks to detect pleiotropic effects. The authors demonstrate that this method detects pleotropic variants with the same accuracy as a full analysis of individual-level data and increased power compared to PheWAS approaches. This work demonstrates the potential for novel informatics approaches to expand the universe of accessible data and improve power for association studies without compromising patient privacy.


3.3.3 Genomic Diversity

One theme was notable for its absence from most of the top-scored articles discussed here. It is well documented that historical biases in data collection and analysis have led to the overrepresentation of populations of European descent in genomic studies [[62] [63] [64]]. Health disparities can result from the lack of diversity in existing genomic datasets, especially when computing polygenic risk scores for future clinical use [[65] [66]]. The authors of a polygenic risk score for glaucoma mentioned the need to develop and validate such scores in additional populations to ensure generalizability [[67]]. However, despite the use of genetic risk scores and other forms of predictive modeling based on genomic data in other articles, discussion of diversity and health disparities is not at the forefront. In order to make equitable advances in healthcare moving forward, we must consider potential historical biases in the underlying datasets and prioritize the inclusion of underrepresented populations in modeling and validation efforts. This is especially true in times of global crisis as we have witnessed this past year. In the meantime, machine learning techniques, such as transfer learning, may help to mitigate some of these disparities while we continue to push for increased diversity in our datasets [[34]].




4 Conclusion

Informatics, science, and life at large have been forever shifted by the global coronavirus pandemic, SARS-CoV-2. For science generally, we have witnessed unprecedented productivity, made possible by the groundwork laid by a generation of informaticians. In this review, we highlight some of the year's most influential and inspiring informatics work. These works address the most important challenges of our time: the pandemic, underrepresentation bias, high-through mult-omics integration – among others. Even so, significant research gaps remain. Biases in biomedical data limit our understanding of disease and contribute to higher morbidity and mortality for minority populations. Global warming and climate change will have severe impacts on the incidence of disease and the equitable distribution of healthcare. If these past 14 months have demonstrated anything, however, it is that the bioinformatics community is ready and willing to face these challenges head on.



No conflict of interest has been declared by the author(s).

Acknowledgements

The authors would like to thank the 2021 AMIA Year in Review research team for their work developing the source material for this paper and Melanie McGrath (PhD, LAT, ATC) for aid in the preperation of this manuscript.

1 https://reddit.com/r/Coronavirus


2 https://reddit.com/r/COVID-19


3 https://medrxiv.org


4 https://biorxiv.org



Correspondence to

Scott McGrath, PhD

Nicholas Tatonetti, PhD

Publication History

Article published online:
03 September 2021

© 2021. IMIA and Thieme. This is an open access article published by Thieme under the terms of the Creative Commons Attribution-NonDerivative-NonCommercial License, permitting copying and reproduction so long as the original work is given appropriate credit. Contents may not be used for commercial purposes, or adapted, remixed, transformed or built upon. (https://creativecommons.org/licenses/by-nc-nd/4.0/)

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany