CC BY 4.0 · Appl Clin Inform 2025; 16(02): 314-326
DOI: 10.1055/a-2524-5216
Research Article

Application of an Externally Developed Algorithm to Identify Research Cases and Controls from EHR Data: Trials and Triumphs

Nelly Estefanie Garduno-Rapp*
1   Clinical Informatics Center, University of Texas Southwestern Medical Center, Dallas, Texas, United States
,
Simone Herzberg*
2   Division of Epidemiology, Department of Medicine, Vanderbilt University Medical Center, Nashville, Tennessee, United States
3   Medical Scientist Training Program, Vanderbilt University School of Medicine, Nashville, Tennessee, United States
,
Henry H. Ong
4   Center for Precision Medicine, Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, United States
,
Cindy Kao
1   Clinical Informatics Center, University of Texas Southwestern Medical Center, Dallas, Texas, United States
,
Christoph U. Lehmann
1   Clinical Informatics Center, University of Texas Southwestern Medical Center, Dallas, Texas, United States
,
Srushti Gangireddy
4   Center for Precision Medicine, Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, United States
,
Nitin B Jain
5   Department of Physical Medicine and Rehabilitation, University of Michigan, Ann Arbor, Michigan, United States
,
Ayush Giri
2   Division of Epidemiology, Department of Medicine, Vanderbilt University Medical Center, Nashville, Tennessee, United States
6   Division of Quantitative and Clinical Sciences, Department of Obstetrics and Gynecology, Vanderbilt University Medical Center, Nashville, Tennessee, United States
› Author Affiliations
Funding This work received funding from the U.S. Department of Health and Human Services, National Institutes of Health, National Center for Advancing Translational Sciences (grant no.: UL1TR003163), National Institutes of Health, and National Institute of Arthritis and Musculoskeletal and Skin Diseases ( grant no.: R01AR074989).

Abstract

Background

The use of electronic health records (EHRs) in research demands robust and interoperable systems. By linking biorepositories to EHR algorithms, researchers can efficiently identify cases and controls for large observational studies (e.g., genome-wide association studies). This is critical for ensuring efficient and cost-effective research. However, the lack of standardized metadata and algorithms across different EHRs complicates their sharing and application. Our study presents an example of a successful implementation and validation process.

Objectives

This study aimed to implement and validate a rule-based algorithm from a tertiary medical center in Tennessee to classify cases and controls from a research study on rotator cuff tear (RCT) nested within a tertiary medical center in North Texas and to assess the algorithm's performance.

Methods

We applied a phenotypic algorithm (designed and validated in a tertiary medical center in Tennessee) using EHR data from 492 patients enrolled in a case-control study recruited from a tertiary medical center in North Texas. The algorithm leveraged the international classification of diseases and current procedural terminology codes to identify case and control status for degenerative RCT. A manual review was conducted to compare the algorithm's classification with a previously recorded gold standard documented by clinical researchers.

Results

Initially the algorithm identified 398 (80.9%) patients correctly as cases or controls. After fine-tuning and correcting errors in our gold standard dataset, we calculated a sensitivity of 0.94 and a specificity of 0.76. The implementation of the algorithm presented challenges due to the variability in coding practices between medical centers. To enhance performance, we refined the algorithm's data dictionary by incorporating additional codes. The process highlighted the need for meticulous code verification and standardization in multi-center studies.

Conclusion

Sharing case-control algorithms boosts EHR research. Our rule-based algorithm improved multi-site patient identification and revealed 12 data entry errors, helping validate our results.

Protection of Human and Animal Subjects

Our study received approval from the Institutional Review Board center STU-2020-0689. Only patients who provided informed consent at UTSW were included in the data query. To ensure confidentiality, all patient information was de-identified and securely managed.


* These authors contributed equally.




Publication History

Received: 06 October 2024

Accepted: 15 January 2025

Accepted Manuscript online:
24 January 2025

Article published online:
26 March 2025

© 2025. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution License, permitting unrestricted use, distribution, and reproduction so long as the original work is properly cited. (https://creativecommons.org/licenses/by/4.0/)

Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany

 
  • References

  • 1 Adler-Milstein J, Holmgren AJ, Kralovec P, Worzala C, Searcy T, Patel V. Electronic health record adoption in US hospitals: the emergence of a digital “advanced use” divide. J Am Med Inform Assoc 2017; 24 (06) 1142-1148
  • 2 Henke E, Zoch M, Peng Y, Reinecke I, Sedlmayr M, Bathelt F. Conceptual design of a generic data harmonization process for OMOP common data model. BMC Med Inform Decis Mak 2024; 24 (01) 58
  • 3 Kiourtis A, Nifakos S, Mavrogiorgou A, Kyriazis D. Aggregating the syntactic and semantic similarity of healthcare data towards their transformation to HL7 FHIR through ontology matching. Int J Med Inform 2019; 132: 104002
  • 4 Garza M, Del Fiol G, Tenenbaum J, Walden A, Zozus MN. Evaluating common data models for use with a longitudinal community registry. J Biomed Inform 2016; 64: 333-341
  • 5 Kumar G, Basri S, Imam AA, Khowaja SA, Capretz LF, Balogun AO. Data harmonization for heterogeneous datasets: a systematic literature review. Appl Sci (Basel) 2021; 11 (17) 8275
  • 6 Sedlakova J, Daniore P, Horn Wintsch A. et al. University of Zurich Digital Society Initiative (UZH-DSI) Health Community. Challenges and best practices for digital unstructured data enrichment in health research: a systematic narrative review. PLOS Digit Health 2023; 2 (10) e0000347
  • 7 Peng Y, Henke E, Reinecke I, Zoch M, Sedlmayr M, Bathelt F. An ETL-process design for data harmonization to participate in international research with German real-world data based on FHIR and OMOP CDM. Int J Med Inform 2023; 169: 104925
  • 8 Rosenbloom ST, Carroll RJ, Warner JL, Matheny ME, Denny JC. Representing knowledge consistently across health systems. Yearb Med Inform 2017; 26 (01) 139-147
  • 9 Bick AG, Metcalf GA, Mayo KR. et al. All of Us Research Program Genomics Investigators. Genomic data in the All of Us Research Program. Nature 2024; 627 (8003): 340-346
  • 10 Abecasis GR, Altshuler D, Auton A. et al. 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 2010; 467 (7319): 1061-1073
  • 11 Marees AT, de Kluiver H, Stringer S. et al. A tutorial on conducting genome-wide association studies: quality control and statistical analysis. Int J Methods Psychiatr Res 2018; 27 (02) e1608
  • 12 Tashjian RZ, Kim SK, Roche MD, Jones KB, Teerlink CC. Genetic variants associated with rotator cuff tearing utilizing multiple population-based genetic resources. J Shoulder Elbow Surg 2021; 30 (03) 520-531
  • 13 Castro VM, Apperson WK, Gainer VS. et al. Evaluation of matched control algorithms in EHR-based phenotyping studies: a case study of inflammatory bowel disease comorbidities. J Biomed Inform 2014; 52: 105-111
  • 14 Thomas SV, Suresh K, Suresh G. Design and data analysis case-controlled study in clinical research. Ann Indian Acad Neurol 2013; 16 (04) 483-487
  • 15 Bucalo M, Gabetta M, Chiudinelli L. et al. i2b2 to optimize patients enrollment. Stud Health Technol Inform 2021; 281: 506-507
  • 16 Prebay ZJ, Ostrovsky AM, Buck M, Chung PH. A TriNetX registry analysis of the need for second procedures following index anterior and posterior urethroplasty. J Clin Med 2023; 12 (05) 2055
  • 17 Chamberlin SR, Bedrick SD, Cohen AM. et al. Evaluation of patient-level retrieval from electronic health record data for a cohort discovery task. JAMIA Open 2020; 3 (03) 395-404
  • 18 Mallik S, Zhao Z. Graph- and rule-based learning algorithms: a comprehensive review of their applications for cancer type classification and prognosis using genomic data. Brief Bioinform 2020; 21 (02) 368-394
  • 19 Teixeira PL, Wei WQ, Cronin RM. et al. Evaluating electronic health record data sources and algorithmic approaches to identify hypertensive individuals. J Am Med Inform Assoc 2017; 24 (01) 162-171
  • 20 Ganz DA, Esserman D, Latham NK. et al. Validation of a rule-based ICD-10-CM algorithm to detect fall injuries in medicare data. J Gerontol A Biol Sci Med Sci 2024; 79 (07) glae096
  • 21 Yu J, Pacheco JA, Ghosh AS. et al. Under-specification as the source of ambiguity and vagueness in narrative phenotype algorithm definitions. BMC Med Inform Decis Mak 2022; 22 (01) 23
  • 22 Hruby GW, Boland MR, Cimino JJ. et al. Characterization of the biomedical query mediation process. AMIA Jt Summits Transl Sci Proc 2013; 2013: 89-93
  • 23 Herzberg S, Garduno-Rapp NE, Ong H. et al. Standardizing phenotypic algorithms for the classification of degenerative rotator cuff tear from electronic health record systems. medRxiv . Accessed 2024 at:
  • 24 Harris PA, Taylor R, Minor BL. et al. REDCap Consortium. The REDCap consortium: building an international community of software platform partners. J Biomed Inform 2019; 95: 103208
  • 25 Shewade HD, Vidhubala E, Subramani DP. et al. Open access tools for quality-assured and efficient data entry in a large, state-wide tobacco survey in India. Glob Health Action 2017; 10 (01) 1394763