Searching the PDF Haystack: Automated Knowledge Discovery in Scanned EHR Documents

Alexander L. Kostrinsky-Thomas; Fuki M. Hisama; Thomas H. Payne

doi:10.1055/s-0041-1726103

Subscribe to RSS

Please copy the URL and add it into your RSS Feed Reader.

https://www.thieme-connect.de/rss/thieme/en/10.1055-s-00035026.xml

Download PDF

Appl Clin Inform 2021; 12(02): 245-250
DOI: 10.1055/s-0041-1726103

Research Article

Searching the PDF Haystack: Automated Knowledge Discovery in Scanned EHR Documents

Authors

Alexander L. Kostrinsky-Thomas

¹College of Osteopathic Medicine, Pacific Northwest University of Health Sciences, 200 University Pkwy Yakima, Washington, United States
Fuki M. Hisama

²Division of Medical Genetics, Department of Medicine, University of Washington School of Medicine, Seattle, Washington, United States
Thomas H. Payne

³Department of Medicine, University of Washington School of Medicine, Seattle, Washington, United States

Further Information

Also available at

Permissions and Reprints

Abstract

Background Clinicians express concern that they may be unaware of important information contained in voluminous scanned and other outside documents contained in electronic health records (EHRs). An example is “unrecognized EHR risk factor information,” defined as risk factors for heritable cancer that exist within a patient's EHR but are not known by current treating providers. In a related study using manual EHR chart review, we found that half of the women whose EHR contained risk factor information meet criteria for further genetic risk evaluation for heritable forms of breast and ovarian cancer. They were not referred for genetic counseling.

Objectives The purpose of this study was to compare the use of automated methods (optical character recognition with natural language processing) versus human review in their ability to identify risk factors for heritable breast and ovarian cancer within EHR scanned documents.

Methods We evaluated the accuracy of the chart review by comparing our criterion standard (physician chart review) versus an automated method involving Amazon's Textract service (Amazon.com, Seattle, Washington, United States), a clinical language annotation modeling and processing toolkit (CLAMP) (Center for Computational Biomedicine at The University of Texas Health Science, Houston, Texas, United States), and a custom-written Java application.

Results We found that automated methods identified most cancer risk factor information that would otherwise require clinician manual review and therefore is at risk of being missed.

Conclusion The use of automated methods for identification of heritable risk factors within EHRs may provide an accurate yet rapid review of patients' past medical histories. These methods could be further strengthened via improved analysis of handwritten notes, tables, and colloquial phrases.

Keywords

electronic health records - portable document format - optical character recognition - natural language processing - machine learning - evaluation

Protection of Human and Animal Subjects

This project was approved by the University of Washington Institutional Review Board.

Publication History

Received: 06 December 2020

Accepted: 01 February 2021

Article published online:
24 March 2021

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany

References
1 Moon S, Liu S, Chen D. et al. Salience of medical concepts of inside clinical texts and outside medical records for referred cardiovascular patients. Journal of Healthcare Informatics Research. 2019; 3: 200-219

Crossref PubMed Search in Google Scholar
Download RIS citation
2 Healthit.gov. What Is HIE? | Healthit.Gov. 2020 . Accessed November 25, 2020 at: https://www.healthit.gov/topic/health-it-and-health-information-exchange-basics/what-hie

PubMed Search in Google Scholar
Download RIS citation
3 Rudin R, Volk L, Simon S, Bates D. What affects clinicians' usage of health information exchange?. Appl Clin Inform 2011; 2 (03) 250-262

Thieme Connect PubMed Search in Google Scholar
Download RIS citation
4 Rasmussen LV, Peissig PL, McCarty CA, Starren J. Development of an optical character recognition pipeline for handwritten form fields from an electronic health record. J Am Med Inform Assoc 2012; 19 (e1): e90-e95

Crossref PubMed Search in Google Scholar
Download RIS citation
5 Farri O, Pieckiewicz DS, Rahman AS, Adam TJ, Pakhomov SV, Melton GB. A qualitative analysis of EHR clinical document synthesis by clinicians. AMIA Annu Symp Proc 2012; 2012: 1211-1220

PubMed Search in Google Scholar
Download RIS citation
6 Mowery DL, Kawamoto K, Bradshaw R. et al. Determining Onset for Familial Breast and Colorectal Cancer from Family History Comments in the Electronic Health Record. AMIA Jt Summits Transl Sci Proc 2019; 2019: 173-181

PubMed Search in Google Scholar
Download RIS citation
7 Jiang X, McGuinness JE, Sin M, Silverman T, Kukafka R, Crew KD. Identifying women at high risk for breast cancer using data from the electronic health record compared with self-report. JCO Clin Cancer Inform 2019; 3: 1-8

Crossref PubMed Search in Google Scholar
Download RIS citation
8 Payne TH, Zhao LP, Le C. et al. Electronic health records contain dispersed risk factor information that could be used to prevent breast and ovarian cancer. J Am Med Inform Assoc 2020; 27 (09) 1443-1449

Crossref PubMed Search in Google Scholar
Download RIS citation
9 National Comprehensive Cancer Network. Genetic/Familial High-risk Assessment: Breast, Ovarian, and Pancreatic V.1.2020. Accessed August 18, 2020 at: https://www.nccn.org/professionals/physician_gls/pdf/genetics_screening.pdf

Download RIS citation
10 Amazon Textract. Amazon Web Services, Inc; . Accessed January 16, 2021 at: https://aws.amazon.com/textract/

Download RIS citation
11 Soysal E, Wang J, Jiang M. et al. CLAMP—a toolkit for efficiently building customized clinical natural language processing pipelines. J Am Med Inform Assoc 2018; 25 (03) 331-336

Crossref PubMed Search in Google Scholar
Download RIS citation
12 Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. A simple algorithm for identifying negated findings and diseases in discharge summaries. J Biomed Inform 2001; 34 (05) 301-310

Crossref PubMed Search in Google Scholar
Download RIS citation
13 Holley R. How good can it get? Analysing and improving OCR accuracy in large scale historic newspaper digitization programs. Dlib Mag 2009; 15: 3-4

Search in Google Scholar
Download RIS citation
14 Hládek D, Staš J, Ondáš S, Juhár J, Kovács L. Learning string distance with smoothing for OCR spelling correction. Multimedia Tools Appl 2016; 76 (22) 24549-24567

Crossref Search in Google Scholar
Download RIS citation
15 Ferrucci D, Brown E, Chu-Carroll J. et al. Building Watson: an overview of the DeepQA project. AI Mag 2010; 31: 59-79

Search in Google Scholar
Download RIS citation
16 Sauer B, Jones B, Globe G, Leng J, Lu C, He T, Teng C, Sullivan P, Zeng Q. Performance of an NLP Tool to extract PFT reports from Structured and Semi-Structured VA data. eGEMs (Generating Evidence & Methods to improve patient outcomes). 2016; 4 (01) 10

PubMed Search in Google Scholar
Download RIS citation
17 Liang J, Tsou C, Poddar A. A Novel System for Extractive Clinical Note Summarization. Paper presented at: Proceedings of the 2nd Clinical Natural Language Processing Workshop; 2019; Minneapolis, MN

Download RIS citation
18 Goodrum H, Roberts K, Bernstam EV. Automatic classification of scanned electronic health record documents. Int J Med Inform 2020; 144: 104302

Crossref PubMed Search in Google Scholar
Download RIS citation

Related Journals

Subscribe to RSS

Share / Bookmark

Searching the PDF Haystack: Automated Knowledge Discovery in Scanned EHR Documents

Authors

Abstract

Keywords

Protection of Human and Animal Subjects

Publication History

References