Identifying Associations between Somatic Mutations and Clinicopathologic Findings in Lung Cancer Pathology ReportsFunding This research was supported in part by a National Institutes of Health grant, P20GM103534.
20 April 2017
accepted: 27 October 2017
05 April 2018 (online)
Objective: We aim to build an informatics methodology capable of identifying statistically significant associations between the clinical findings of non-small cell lung cancer (NSCLC) recorded in patient pathology reports and the various clinically actionable genetic mutations identified from next-generation sequencing (NGS) of patient tumor samples.
Methods: We built an information extraction and analysis pipeline to identify the associations between clinical findings in the pathology reports of patients and corresponding genetic mutations. Our pipeline leverages natural language processing (NLP) techniques, large biomedical terminologies, semantic similarity measures, and clustering methods to extract clinical concepts in freetext from patient pathology reports and group them as salient findings.
Results: In this study, we developed and applied our methodology to lobectomy surgical pathology reports of 142 NSCLC patients who underwent NGS testing and who had mutations in 4 oncogenes with clinical ramifications for NSCLC treatment (EGFR, KRAS, BRAF, and PIK3CA). Our approach identified 732 distinct positive clinical concepts in these reports and highlighted multiple findings with strong associations (P-value ≤ 0.05) to mutations in specific genes. Our assessment showed that these associations are consistent with the published literature.
Conclusions: This study provides an automatic pipeline to find statistically significant associations between clinical findings in unstructured text of patient pathology reports and genetic mutations. This approach is generalizable to other types of pathology and clinical reports in various disorders and can provide the first steps toward understanding the role of genetic mutations in the development and treatment of different types of cancer.
KeywordsInformation extraction - natural language processing - non-small cell lung cancer - somatic mutation - pathology report
- 1 bAmerican Cancer Society. Cancer Facts and Figures 2017. www.cancer.org/research/cancer-facts-statistics/all-cancer-facts-figures/cancer-facts-figures-2017.html Last access: April 15, 2017.
- 2 Alberg AJ, Samet JM. Chapter 46. In: Murray & Nadel’s Textbook of Respiratory Medicine. 5th ed. Saunders Elsevier; 2010; 971-978.
- 3 Maione P, Sacco PC, Sgambato A, Casaluce F, Rossi A, Gridelli C. Overcoming resistance to targeted therapies in NSCLC: current approaches and clinical application. Therapeutic advances in medical oncology 2015; 07 (05) 263-273.
- 4 Sanders HR, Albitar M. Somatic mutations of signaling genes in non-small-cell lung cancer. Cancer Genet Cytogenet 2010; 203 (01) 7-15.
- 5 Sequist L V, Joshi VA, Jänne PA, Muzikansky A, Fidias P, Meyerson M. et al. Response to treatment and survival of patients with non-small cell lung cancer undergoing somatic EGFR mutation testing. Oncologist 2007; 12 (01) 90-98.
- 6 Jackman DM, Miller VA, Cioffredi L-A, Yeap BY, Jänne PA, Riely GJ. et al. Impact of epidermal growth factor receptor and KRAS mutations on clinical outcomes in previously untreated non–small cell lung cancer patients: results of an online tumor registry of clinical trials. Clin Cancer Res AACR 2009; 15 (16) 5267-5273.
- 7 Yu B, O’Toole SA, Trent RJ. Somatic DNA mutation analysis in targeted therapy of solid tumours. Transl Pediatr 2015; 04 (02) 125-138.
- 8 Engelman JA, Chen L, Tan X, Crosby K, Guimaraes AR, Upadhyay R. et al. Effective use of PI3K and MEK inhibitors to treat mutant Kras G12D and PIK3CA H1047R murine lung cancers. Nat Med 2008; 14 (12) 1351-1356.
- 9 Jänne PA, Smith I, McWalter G, Mann H, Dougherty B, Walker J. et al. Impact of KRAS codon subtypes from a randomised phase II trial of selumetinib plus docetaxel in KRAS mutant advanced nonsmall-cell lung cancer. Br J Cancer 2015; 113 (02) 199-203.
- 10 Finberg KE, Sequist LV, Joshi VA, Muzikansky A, Miller JM, Han M. et al. Mucinous differentiation correlates with absence of EGFR mutation and presence of KRAS mutation in lung adenocarcinomas with bronchioloalveolar features. J Mol Diagn 2007; 09 (03) 320.
- 11 Honrado E, Benítez J, Palacios J. Histopathology of BRCA1-and BRCA2-associated breast cancer. Critical reviews in oncology/hematology 2006; 59 (01) 27-39.
- 12 Eerola H, Heikkilä P, Tamminen A, Aittomäki K, Blomqvist C, Nevanlinna H. Histopathological features of breast tumours in BRCA1, BRCA2 and mutation-negative breast cancer families. Breast Cancer Res 2004; 07 (01) 1.
- 13 Shashidharan M, Smyrk T, Lin KM, Ternent CA, Thorson AG, Blatchford GJ. et al. Histologic comparison of hereditary nonpolyposis colorectal cancer associated with MSH2 and MLH1 and colorectal cancer from the general population. Dis Colon Rectum 1999; 42 (06) 722-726.
- 14 Lanza G, Gafà R, Maestri I, Santini A, Matteuzzi M, Cavazzini L. et al. Immunohistochemical pattern of MLH1/MSH2 expression is related to clinical and pathological features in colorectal adenocarcinomas with microsatellite instability. Mod Pathol 2002; 15 (07) 741-749.
- 15 Friedman C. Towards a comprehensive medical language processing system: methods and issues. Proc AMIA Annu Fall Symp 1997; 595e9.
- 16 Friedman C, Knirsch CA, Shagina L, Hripcsak G. Automating a severity score guideline-for community-acquired pneumonia employing medical language processing of discharge summaries. In Lorenzi N. ed. Proc AMIA Symp Phil Hanley & Belfus. 1999: 256-260.
- 17 Jain NL, Friedman C. Identification of findings suspicious for breast cancer based on natural language processing of mammogram reports. In Masys DR. ed. Proc AMIA Symp Phila Hanley & Belfus. 1997: 829-833.
- 18 Friedman C. A broad-coverage natural language processing system. Proc AMIA Symp 2000; 270e4.
- 19 Aronson RAlan. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. In: Bakken Suzanne. editor. Proceedings of the AMIA symposium. Washington DC: American Medical Informatics Association; 2001: 17.
- 20 http://ohnlp.sourceforge.net/MedKATp/
- 21 Crowley RS, Castine M, Mitchell K, Chavan G, McSherry T, Feldman M. caTIES: a grid based system for coding and retrieval of surgical pathology reports and tissue specimens in support of translational research. Journal of the American Medical Informatics Association 2010; 17 (03) 253-264.
- 22 https://ncim.nci.nih.gov/ncimbrowser/
- 23 http://clamp.uth.edu/cancer.php
- 24 Savova GK, Masanz JJ, Ogren PV, Zheng J, Sohn S, Kipper-Schuler KC. et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc 2010; 17 (05) 507-513.
- 25 Lin C, Canhao H, Miller T, Dligach D, Plenge RM, Karlson EW, Savova GK. Feature engineering and selection for rheumatoid arthritis disease activity classification using electronic medical records. InICML Workshop on Machine Learning for Clinical Data Analysis. 2012
- 26 Savova GK, Olson JE, Murphy SP, Cafourek VL, Couch FJ, Goetz MP, Ingle JN, Suman VJ, Chute CG, Weinshilboum RM. Automated discovery of drug treatment patterns for endocrine therapy of breast cancer within an electronic medical record. Journal of the American Medical Informatics Association 2011; 19 (e1): e83-e89.
- 27 Kullo IJ, Fan J, Pathak J, Savova GK, Ali Z, Chute CG. Leveraging informatics for genetic studies: use of the electronic medical record to enable a genome-wide association study of peripheral arterial disease. Journal of the American Medical Informatics Association 2010; 17 (05) 568-574.
- 28 Savova GK, Fan J, Ye Z, Murphy SP, Zheng J, Chute CG, Kullo IJ. Discovering peripheral arterial disease cases from radiology notes using natural language processing. In AMIA Annual Symposium Proceedings 2010. (Vol. 2010,) 722 American Medical Informatics Association.;
- 29 https://emerge.mc.vanderbilt.edu/
- 30 Garten YA, Tatonetti NP, Altman RB. Improving the prediction of pharmacogenes using text-derived drug-gene relationships. Pac Symp Biocomput 2010; 305.
- 31 Raja K, Subramani S, Natarajan J. PPInterFinder – a mining tool for extracting causal relations on human proteins from literature. Database(Oxford). 2013 Jan 1; 2013: bas052.
- 32 Quan C, Wang M, Ren F. An unsupervised text mining method for relation extraction from biomedical literature. PloS one 2014; 09 (07) e102039.
- 33 Papanikolaou N, Pavlopoulos GA, Theodosiou T, Iliopoulos I. Protein–protein interaction predictions using text mining methods. Methods 2015; 74: 47-53.
- 34 Mallory EK, Zhang C, Ré C, Altman RB. Largescale extraction of gene interactions from full text literature using DeepDive. Bioinformatics. 2015 btv476.
- 35 Xu J, Lee HJ, Zeng J, Wu Y, Zhang Y, Huang LC. et al. Extracting genetic alteration information for personalized cancer therapy from Clinical-Trials.gov. J Am Med Inform Assoc 2016; 23 (04) 750-757.
- 36 Wu Y, Levy MA, Micheel CM, Yeh P, Tang B, Cantrell MJ. et al. Identifying the status of genetic lesions in cancer clinical trial documents using machine learning. BMC Genomics 2012; 13 (Suppl 8): S21.
- 37 Osborne JD, Wyatt M, Westfall AO, Willig J, Bethard S, Gordon G. Efficient identification of nationally mandated reportable cancer cases using natural language processing and machine learning. J Am Med Inform Assoc. 2016 ocw006; DOI: 10.1093/jamia/ocw006.
- 38 Nguyen AN, Lawley MJ, Hansen DP. et al. Symbolic rule-based classification of lung cancer stages from free-text pathology reports. Journal of the American Medical Informatics Association: JAMIA 2010; 17 (04) 440-445. doi:10.1136/jamia.2010.003707.
- 39 Nguyen AN, Moore J, O’Dwyer J, Philpot S. Automated Cancer Registry Notifications: Validation of a Medical Text Analytics System for Identifying Patients with Cancer from a State-Wide Pathology Repository. AMIA Annual Symposium Proceedings 2016; 2016: 964-973.
- 40 Tsongalis GJ, Peterson JD, de Abreu FB, Tunkey CD, Gallagher TL, Strausbaugh LD. et al. Routine use of the Ion Torrent AmpliSeq™ Cancer Hotspot Panel for identification of clinically actionable somatic mutations. Clin Chem Lab Med 2014; 52 (05) 707-714.
- 41 Cancer Genome Atlas Research Network. Comprehensive molecular profiling of lung adenocarcinoma. Nature 2014; 511 7511 543-550.
- 42 de Abreu FB, Peterson JD, Amos CI, Wells WA, Tsongalis GJ. Effective quality management practices in routine clinical next-generation sequencing. Clin Chem Lab Med 2016; 54 (05) 761-771.
- 43 National Comprehensive Cancer Network (NCCN) Clinical Practice Guidelines in Oncology. Available from: www.nccn.org Last access: April 15, 2017.
- 44 My Cancer Genome: Genetically Informed Cancer Medicine. Available from: www.mycan cergenome.org/ Last access: April 15, 2017
- 45 COSMIC: Catalogue of Somatic Mutations in Cancer. Available from: http://cancer.sanger.ac.uk Last access: April 15, 2017.
- 46 ClinVar National Center for Biotechnology Information. Available from: www.ncbi.nlm.nih.gov/clinvar Last access: April 15, 2017.
- 47 dbSNP National Center for Biotechnology Information. Available from: www.ncbi.nlm.nih.gov/pubmed Last access: April 15, 2017.
- 48 PubMed. Available from: www.ncbi.nlm.nih.gov/pubmed Last access: April 15, 2017.
- 49 https://www.nlm.nih.gov/research/umls/ Last access: April 15, 2017.
- 50 Chapman W, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. A simple algorithm for identifying negated findings and diseases in discharge summaries. J Biomed Inform 2001; 34: 301-310.
- 51 http://hsqldb.org/ Last access: April 15, 2017.
- 52 www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html Last access: April 15, 2017.
- 53 Snomed CT. www.nlm.nih.gov/research/umls/Snomed/snomed_main.html Last access: April 15, 2017.
- 54 www.ihtsdo.org/snomed-ct/snomed-ct-worldwide Last access: April 15, 2017.
- 55 He Z, Chen Y, de Coronado S, Piskorski K, Geller J. Topological-Pattern-based Recommendation of UMLS Concepts for National Cancer Institute Thesaurus. In AMIA Annual Symposium Proceedings 2016. (Vol. 2016,). 618 American Medical Informatics Association.;
- 56 Keselman A, Smith CA, Divita G, Kim H, Browne AC, Leroy G, Zeng-Treitler Q. Consumer health concepts that do not map to the UMLS: where do they fit?. Journal of the American Medical Informatics Association 2008; 15 (04) 496-505.
- 57 He Z, Geller J, Elhanan G. Categorizing the relationships between structurally congruent concepts from pairs of terminologies for semantic harmonization. AMIA Summits on Translational Science Proceedings 2014; 2014: 48.
- 58 He Z, Geller J, Chen Y. A comparative analysis of the density of the SNOMED CT conceptual content for semantic harmonization. Artificial intelligence in medicine 2015; 64 (01) 29-40.
- 59 Buckley JM, Coopey SB, Sharko J. et al. The feasibility of using natural language processing to extract clinical information from breast pathology reports. Journal of Pathology Informatics 2012; 03: 23 doi:10.4103/2153–3539.97788.
- 60 McInnes BT, Pedersen T. Evaluating semantic similarity and relatedness over the semantic grouping of clinical term pairs. Journal of biomedical informatics 2015; 54: 329-36.
- 61 Wu Z, Palmer M. Verbs semantics and lexical selection. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics 1994; 133-138.
- 62 Leacock C, Chodorow M. Combining local context and WordNet similarity for word sense identification. WordNet: An electronic lexical database 1998; 49 (02) 265-283.
- 63 Pesquita C, Faria D, Falcao AO, Lord P, Couto FM. Semantic similarity in biomedical ontologies. PLoS computational biology 2009; 05 (07) e1000443.
- 64 Al-Mubaid H, Nguyen HA. A cluster-based approach for semantic similarity in the biomedical domain. In Engineering in Medicine and Biology Society, 2006. EMBS’06. 28th Annual International Conference of the IEEE; 2006: 2713-2717.
- 65 Hassanpour S, O’Connor MJ, Das AK. A semanticbased method for extracting concept definitions from scientific publications: evaluation in the autism phenotype domain. J Biomed Semantics 2013; 04 (01) 1.
- 66 www.r-project.org/ Last access: April 15, 2017.
- 67 Duda RO, Hart PE. 1973. Pattern Classification and Scene Analysis. John Wiley & Sons; New York:
- 68 Charrad M, Ghazzali N, Boiteau V, Niknafs A. 2014; NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set. Journal of Statistical Software 61 (06) 1-36.
- 69 Fisher RA. On the interpretation of χ2 from contingency tables, and the calculation of P. Journal of the Royal Statistical Society 1922; 85 (01) 87-94.
- 70 Yates F. Contingency tables involving small numbers and the χ2 test. Supplement to the Journal of the Royal Statistical Society 1934; 01 (02) 217-235.
- 71 Dogan S, Shen R, Ang DC, Johnson ML, D’Angelo SP, Paik PK. et al. Molecular epidemiology of EGFR and KRAS mutations in 3,026 lung adenocarcinomas: higher susceptibility of women to smokingrelated KRAS-mutant cancers. Clin Cancer Res 2012; 18 (22) 6169-6177.
- 72 Dai J, Yang P, Cox A, Jiang G. Lung cancer and chronic obstructive pulmonary disease: From a clinical perspective. Oncotarget. 2017 Jan 4.
- 73 Rizzo S, Petrella F, Buscarino V, De Maria F, Raimondi S, Barberis M. et al. CT radiogenomic characterization of EGFR, K-RAS, and ALK mutations in non-small cell lung cancer. Eur Radiol 2016; 26 (01) 32-42.
- 74 Lim JU, Yeo CD, Rhee CK, Kim YH, Park CK, Kim JS. et al. Chronic obstructive pulmonary diseaserelated non-small-cell lung cancer exhibits a low prevalence of EGFR and ALK driver mutations. PloS One 2015; 10 (11) e0142306.
- 75 Saber A, van der Wekken AJ, Kerner GS, van den Berge M, Timens W, Schuuring E. et al. Chronic obstructive pulmonary disease is not associated with KRAS mutations in non-small cell lung cancer. PloS One 2016; 11 (03) e0152317.
- 76 Minna JD. Neoplasms of the lung. Harrisons principles of internal medicine 1998; 552-561.
- 77 Rekhtman N, Paik PK, Arcila ME, Tafe LJ, Oxnard GR, Moreira AL. et al. Clarifying the spectrum of driver oncogene mutations in biomarker-verified squamous carcinoma of lung: lack of EGFR/KRAS and presence of PIK3CA/AKT1 mutations. Clin Cancer Res 2012; 18 (04) 1167-1176.
- 78 Travis WD, Brambilla E, Noguchi M, Nicholson AG, Geisinger KR, Yatabe Y. et al. International association for the study of lung cancer/American thoracic society/European respiratory society international multidisciplinary classification of lung adenocarcinoma. J Thorac Oncol 2011; 06 (02) 244-285.
- 79 Matsuoka Y, Yurugi Y, Takagi Y, Wakahara M, Kubouchi Y, Sakabe T. et al. Prognostic Significance of Solid and Micropapillary Components in Invasive Lung Adenocarcinomas Measuring ≤ 3 cm. Anticancer Res 2016; 36 (09) 4923-4930.
- 80 Yu Y, Jian H, Shen L, Zhu L, Lu S. Lymph node involvement influenced by lung adenocarcinoma subtypes in tumor size≤ 3 cm disease: A study of 2268 cases. Eur J Surg Oncol 2016; 42 (11) 1714-1719.
- 81 Zhao ZR, To KF, Mok TS, Ng CS. Is there significance in identification of non-predominant micropapillary or solid components in early-stage lung adenocarcinoma?. Interact Cardiovasc Thorac Surg. 2016 Sep 5: ivw283.
- 82 Zhao Y, Wang R, Shen X, Pan Y, Cheng C, Li Y. et al. Minor components of micropapillary and solid subtypes in lung adenocarcinoma are predictors of lymph node metastasis and poor prognosis. Ann Surg Oncol 2016; 23 (06) 2099-2105.
- 83 Tsubokawa N, Mimae T, Sasada S, Yoshiya T, Mimura T, Murakami S. et al. Negative prognostic influence of micropapillary pattern in stage IA lung adenocarcinoma. Eur J Cardiothorac Surg 2016; 49 (01) 293-299.
- 84 Tafe LJ, Pierce KJ, Peterson JD, de Abreu F, Memoli VA, Black CC. et al. Clinical Genotyping of Non–Small Cell Lung Cancers Using Targeted Next-Generation Sequencing: Utility of Identifying Rare and Co-mutations in Oncogenic Driver Genes. Neoplasia 2016; 18 (09) 577-583.
- 85 Furukawa M, Toyooka S, Ichimura K, Yamamoto H, Soh J, Hashida S. et al. Genetic alterations in lung adenocarcinoma with a micropapillary component. Mol Clin Oncol 2016; 04 (02) 195-200.
- 86 De Oliveira RDuarte Achcar, Nikiforova MN, Yousem SA. Micropapillary lung adenocarcinoma: EGFR, K-ras, and BRAF mutational profile. Am J Clin Pathol 2009; 131 (05) 694-700.
- 87 Marchetti A, Felicioni L, Malatesta S, Grazia MSciarrotta, Guetti L, Chella A. et al. Clinical features and outcome of patients with non–small-cell lung cancer harboring BRAF mutations. J Clin Oncol 2011; 29 (26) 3574-3579.
- 88 Arteaga F, Ferrer A. Framework for regressionll based missing data imputation methods in on line MSPC. Journal of Chemometrics 2005; 19 (08) 439-447.
- 89 Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the royal statistical society Series B Methodological 1995; 289-300.