Appl Clin Inform 2021; 12(04): 816-825
DOI: 10.1055/s-0041-1733846
Research Article

A Framework for Systematic Assessment of Clinical Trial Population Representativeness Using Electronic Health Records Data

Yingcheng Sun
1   Department of Biomedical Informatics, Columbia University, New York, New York, United States
Alex Butler
1   Department of Biomedical Informatics, Columbia University, New York, New York, United States
2   Department of Medicine, Columbia University, New York, New York, United States
Ibrahim Diallo
1   Department of Biomedical Informatics, Columbia University, New York, New York, United States
Jae Hyun Kim
1   Department of Biomedical Informatics, Columbia University, New York, New York, United States
Casey Ta
1   Department of Biomedical Informatics, Columbia University, New York, New York, United States
James R. Rogers
1   Department of Biomedical Informatics, Columbia University, New York, New York, United States
Hao Liu
1   Department of Biomedical Informatics, Columbia University, New York, New York, United States
Chunhua Weng
1   Department of Biomedical Informatics, Columbia University, New York, New York, United States
› Author Affiliations
Funding This work was supported by the National Library of Medicine grant R01LM009886–11 (Bridging the Semantic Gap Between Research Eligibility Criteria and Clinical Data) and National Center for Advancing Clinical and Translational Science grants UL1TR001873 and 3U24TR001579–05.


Background Clinical trials are the gold standard for generating robust medical evidence, but clinical trial results often raise generalizability concerns, which can be attributed to the lack of population representativeness. The electronic health records (EHRs) data are useful for estimating the population representativeness of clinical trial study population.

Objectives This research aims to estimate the population representativeness of clinical trials systematically using EHR data during the early design stage.

Methods We present an end-to-end analytical framework for transforming free-text clinical trial eligibility criteria into executable database queries conformant with the Observational Medical Outcomes Partnership Common Data Model and for systematically quantifying the population representativeness for each clinical trial.

Results We calculated the population representativeness of 782 novel coronavirus disease 2019 (COVID-19) trials and 3,827 type 2 diabetes mellitus (T2DM) trials in the United States respectively using this framework. With the use of overly restrictive eligibility criteria, 85.7% of the COVID-19 trials and 30.1% of T2DM trials had poor population representativeness.

Conclusion This research demonstrates the potential of using the EHR data to assess the clinical trials population representativeness, providing data-driven metrics to inform the selection and optimization of eligibility criteria.

Protection of Human and Animal Subjects

No human or animal subjects were involved in the project.

Publication History

Received: 18 April 2021

Accepted: 23 June 2021

Article published online:
08 September 2021

© 2021. Thieme. All rights reserved.

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany

  • References

  • 1 Piantadosi S. Clinical Trials: A Methodologic Perspective. John Wiley & Sons; 2017
  • 2 Fogel DB. Factors associated with clinical trials that fail and opportunities for improving the likelihood of success: a review. Contemp Clin Trials Commun 2018; 11: 156-164
  • 3 Naceanceno KS, House SL, Asaro PV. Shared-task worklists improve clinical trial recruitment workflow in an academic emergency department. Appl Clin Inform 2021; 12 (02) 293-300
  • 4 Sen A, Ryan P, Goldstein A. et al. Assessing eligibility criteria generalizability and their correlations with adverse events using big data for EHRS and clinical trials. In Proceedings of the Data Science Learning and Applications to Biomedical and Health Sciences Conference (Big Data Workshop organized by New York Academy of Sciences; 74–79
  • 5 Thadani SR, Weng C, Bigger JT. et al. Electronic screening improves efficiency in clinical trial recruitment. J Am Med Inform Assoc 2009; 16 (6): 869-873
  • 6 Weng C. Optimizing clinical research participant selection with informatics. Trends in pharmacological sciences 2015; 36 (11): 706-709
  • 7 Van Spall HG, Toren A, Kiss A, Fowler RA. Eligibility criteria of randomized controlled trials published in high-impact general medical journals: a systematic sampling review. JAMA 2007; 297 (11) 1233-1240
  • 8 Janson M, Edlund G, Kressner U. et al. Analysis of patient selection and external validity in the Swedish contribution to the COLOR trial. Surg Endosc 2009; 23 (08) 1764-1769
  • 9 van der Aalst CM, van Iersel CA, van Klaveren RJ. et al. Generalisability of the results of the Dutch-Belgian randomised controlled lung cancer CT screening trial (NELSON): does self-selection play a role?. Lung Cancer 2012; 77 (01) 51-57
  • 10 Bress AP, Tanner RM, Hess R, Colantonio LD, Shimbo D, Muntner P. Generalizability of SPRINT Results to the U.S. Adult Population. J Am Coll Cardiol 2016; 67 (05) 463-472
  • 11 Weng C, Li Y, Ryan P. et al. A distribution-based method for assessing the differences between clinical trial target populations and patient populations in electronic health records. Appl Clin Inform 2014; 5 (02) 463-479
  • 12 Sen A, Ryan P, Goldstein A. et al. Correlating eligibility criteria generalizability and adverse events using Big Data for patients and clinical trials. Annals of the New York Academy of Sciences 2017; 1387 (01) 34-43
  • 13 Sen A, Chakrabarti S, Goldstein A, Wang S, Ryan PB, Weng C. GIST 2.0: a scalable multi-trait metric for quantifying population representativeness of individual clinical studies. J Biomed Inform 2016; 63: 325-336
  • 14 Cahan A, Cahan S, Cimino JJ. Computer-aided assessment of the generalizability of clinical trial results. Int J Med Inform 2017; 99: 60-66
  • 15 Reich C, Ryan PB, Belenkaya R. et al. OHDSI Common Data Model v6.0 Specifications. Accessed 2019 at:
  • 16 Tu SW, Peleg M, Carini S. et al. A practical method for transforming free-text eligibility criteria into computable criteria. J Biomed Inform 2011; 44 (02) 239-250
  • 17 Yuan C, Ryan PB, Ta C. et al. Criteria2Query: a natural language interface to clinical databases for cohort definition. J Am Med Inform Assoc 2019; 26 (04) 294-305
  • 18 Savova GK, Masanz JJ, Ogren PV. et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc 2010; 17 (05) 507-513
  • 19 Aronson AR. 2001 Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. In Proceedings of the AMIA Symposium (p. 17). American Medical Informatics Association. Accessed 2021 at:
  • 20 Kury F, Butler A, Yuan C. et al. Chia, a large annotated corpus of clinical trial eligibility criteria. Sci Data 2020; 7 (01) 281
  • 21 Hripcsak G, Duke JD, Shah NH. et al. Observational Health Data Sciences and Informatics (OHDSI): opportunities for observational researchers. Stud Health Technol Inform 2015; 216: 574-578
  • 22 Chang AX, Manning CD. 2012 , May. Sutime: a library for recognizing and normalizing time expressions. In: LREC. European Language Resources Association (ELRA); vol. 2012;3735–3740. Accessed 2021 at:
  • 23 Laffin LJ, Besser SA, Alenghat FJ. A data-zone scoring system to assess the generalizability of clinical trial results to individual patients. Eur J Prev Cardiol 2019; 26 (06) 569-575
  • 24 Chatterjee P, Cymberknop LJ, Armentano RL. Nonlinear systems in healthcare towards intelligent disease prediction. In: Nonlinear Systems-Theoretical Aspects and Recent Applications. IntechOpen; 2019
  • 25 Awad M, Khanna R. Support vector regression. In: Efficient Learning Machines. Apress, Berkeley; CA: 67-80
  • 26 Steele AJ, Denaxas SC, Shah AD, Hemingway H, Luscombe NM. Machine learning models in electronic health records can outperform conventional survival models for predicting patient mortality in coronary artery disease. PLoS One 2018; 13 (08) e0202344
  • 27 Sun Y, Butler A, Lin F. et al. The COVID-19 trial finder. J Am Med Inform Assoc 2021; 28 (03) 616-621
  • 28 Kim JH, Ta CN, Liu C. et al. Towards clinical data-driven eligibility criteria optimization for interventional COVID-19 clinical trials. J Am Med Inform Assoc 2021; 28 (01) 14-22
  • 29 Al-Lawati JA. Diabetes mellitus: a local and global public health emergency!. Oman medical journal 2017; 32 (03) 177-179
  • 30 Sun Y, Butler A, Stewart LA. et al. Building an OMOP common data model-compliant annotated corpus for COVID-19 clinical trials. J Biomed Inform 2021; 118: 103790
  • 31 Sen A, Goldstein A, Chakrabarti S. et al. The representativeness of eligible patients in type 2 diabetes trials: a case study using GIST 2.0. J Am Med Inform Assoc 2018; 25 (03) 239-247
  • 32 Sun Y, Loparo K. Information extraction from free text in clinical trials with knowledge-based distant supervision. In 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC) (Vol. 1, pp. 954–955). IEEE
  • 33 Li X, Liu H, Kury F. et al. A Comparison between Human and NLP-based Annotation of Clinical Trial Eligibility Criteria Text Using The OMOP Common Data Model. In AMIA 2021 Virtual Informatics Summit; 394-403