Methods Inf Med 2018; 57(01/02): 63-73
DOI: 10.3414/ME17-01-0039
Original Articles
Schattauer GmbH

Identifying Associations between Somatic Mutations and Clinicopathologic Findings in Lung Cancer Pathology Reports

Nishant Kumar
1  Biomedical Data Science Department, Dartmouth College, Hanover, NH, USA
,
Laura J. Tafe
2  Pathology and Laboratory Medicine Department, Dartmouth-Hitchcock Medical Center, Lebanon, NH, USA
,
John H. Higgins
1  Biomedical Data Science Department, Dartmouth College, Hanover, NH, USA
,
Jason D. Peterson
2  Pathology and Laboratory Medicine Department, Dartmouth-Hitchcock Medical Center, Lebanon, NH, USA
,
Francise Blumental de Abreu
2  Pathology and Laboratory Medicine Department, Dartmouth-Hitchcock Medical Center, Lebanon, NH, USA
,
Sophie J. Deharvengt
2  Pathology and Laboratory Medicine Department, Dartmouth-Hitchcock Medical Center, Lebanon, NH, USA
,
Gregory J. Tsongalis
2  Pathology and Laboratory Medicine Department, Dartmouth-Hitchcock Medical Center, Lebanon, NH, USA
,
Christopher I. Amos
1  Biomedical Data Science Department, Dartmouth College, Hanover, NH, USA
,
Saeed Hassanpour
3  Biomedical Data Science, Epidemiology, and Computer Science Departments, Dartmouth College, Hanover, NH, USA
› Author Affiliations
Funding This research was supported in part by a National Institutes of Health grant, P20GM103534.
Further Information

Publication History

received: 20 April 2017

accepted: 27 October 2017

Publication Date:
05 April 2018 (online)

Summary

Objective: We aim to build an informatics methodology capable of identifying statistically significant associations between the clinical findings of non-small cell lung cancer (NSCLC) recorded in patient pathology reports and the various clinically actionable genetic mutations identified from next-generation sequencing (NGS) of patient tumor samples.

Methods: We built an information extraction and analysis pipeline to identify the associations between clinical findings in the pathology reports of patients and corresponding genetic mutations. Our pipeline leverages natural language processing (NLP) techniques, large biomedical terminologies, semantic similarity measures, and clustering methods to extract clinical concepts in freetext from patient pathology reports and group them as salient findings.

Results: In this study, we developed and applied our methodology to lobectomy surgical pathology reports of 142 NSCLC patients who underwent NGS testing and who had mutations in 4 oncogenes with clinical ramifications for NSCLC treatment (EGFR, KRAS, BRAF, and PIK3CA). Our approach identified 732 distinct positive clinical concepts in these reports and highlighted multiple findings with strong associations (P-value ≤ 0.05) to mutations in specific genes. Our assessment showed that these associations are consistent with the published literature.

Conclusions: This study provides an automatic pipeline to find statistically significant associations between clinical findings in unstructured text of patient pathology reports and genetic mutations. This approach is generalizable to other types of pathology and clinical reports in various disorders and can provide the first steps toward understanding the role of genetic mutations in the development and treatment of different types of cancer.