Appl Clin Inform 2015; 06(02): 345-363
DOI: 10.4338/ACI-2014-11-RA-0106
Research Article
Schattauer GmbH

Interactive Cohort Identification of Sleep Disorder Patients Using Natural Language Processing and i2b2

W. Chen
1   Research Information Solutions and Innovations
R. Kowatch
2   Center for Innovation in Pediatric Practice
S. Lin
1   Research Information Solutions and Innovations
M. Splaingard
3   Sleep Disorder Center, Nationwide Children’s Hospital, Columbus, OH
Y. Huang
1   Research Information Solutions and Innovations
› Author Affiliations
Further Information

Publication History

received: 25 November 2014

accepted: 23 February 2015

Publication Date:
19 December 2017 (online)


Nationwide Children’s Hospital established an i2b2 (Informatics for Integrating Biology & the Bedside) application for sleep disorder cohort identification. Discrete data were gleaned from semi-structured sleep study reports. The system showed to work more efficiently than the traditional manual chart review method, and it also enabled searching capabilities that were previously not possible.

Objective: We report on the development and implementation of the sleep disorder i2b2 cohort identification system using natural language processing of semi-structured documents.

Methods: We developed a natural language processing approach to automatically parse concepts and their values from semi-structured sleep study documents. Two parsers were developed: a regular expression parser for extracting numeric concepts and a NLP based tree parser for extracting textual concepts. Concepts were further organized into i2b2 ontologies based on document structures and in-domain knowledge.

Results: 26,550 concepts were extracted with 99% being textual concepts. 1.01 million facts were extracted from sleep study documents such as demographic information, sleep study lab results, medications, procedures, diagnoses, among others. The average accuracy of terminology parsing was over 83% when comparing against those by experts. The system is capable of capturing both standard and non-standard terminologies. The time for cohort identification has been reduced significantly from a few weeks to a few seconds.

Conclusion: Natural language processing was shown to be powerful for quickly converting large amount of semi-structured or unstructured clinical data into discrete concepts, which in combination of intuitive domain specific ontologies, allows fast and effective interactive cohort identification through the i2b2 platform for research and clinical use.

Citation: Chen W, Kowatch R, Lin S, Splaingard M, Huang Y. Interactive cohort identification of sleep disorder patients using natural language processing and i2b2. Appl Clin Inf 2015; 6: 345–363

  • References

  • 1 Profile C. Cohort profile: the Swiss HIV Cohort study. International journal of epidemiology 2010; 39: 1179-1189.
  • 2 Hoang PD, Cameron MH, Gandevia SC, Lord SR. Neuropsychological, Balance, and Mobility Risk Factors for Falls in People With Multiple Sclerosis: A Prospective Cohort Study. Archives of physical medicine and rehabilitation 2014; 95 (03) 480-486.
  • 3 Oh J, Kang S-M, Hong N, Youn J-C, Park S, Lee S-H, Choi D. Comparison of pooled cohort risk equations and Framingham risk score for metabolic syndrome in a Korean community-based population. International journal of cardiology 2014; 176 (03) 1154-1155.
  • 4 Marcus CL, Moore RH, Rosen CL, Giordani B, Garetz SL, Taylor HG, Mitchell RB, Amin R, Katz ES, Arens R. A randomized trial of adenotonsillectomy for childhood sleep apnea. New England Journal of Medicine 2013; 368 (25) 2366-2376.
  • 5 Müller F, Christ-Crain M, Bregenzer T, Krause M, Zimmerli W, Mueller B, Schuetz P. Procalcitonin Levels Predict Bacteremia in Patients With Community-Acquired PneumoniaA Prospective Cohort Trial. CHEST Journal 2010; 138 (01) 121-129.
  • 6 Shibasaki M, Nakajima Y, Shime N, Sawa T, Sessler DI. Prediction of optimal endotracheal tube cuff volume from tracheal diameter and from patient height and age: a prospective cohort trial. Journal of anesthesia 2012; 26 (04) 536-540.
  • 7 Hahn U, Krummenauer F, Kölbl B, Neuhann T, Schayan-Araghi K, Schmickler S, von Wolff K, Weindler J, Will T, Neuhann I. Determination of valid benchmarks for outcome indicators in cataract surgery: a multicenter, prospective cohort trial. Ophthalmology 2011; 118 (11) 2105-2112.
  • 8 Jain M, Harrison L, Howe G, Miller A. Evaluation of a self-administered dietary questionnaire for use in a cohort study. The American journal of clinical nutrition 1982; 36 (05) 931-935.
  • 9 Olsen J, Melbye M, Olsen SF, Sørensen TI, Aaby P, Andersen A-MN, Taxbøl D, Hansen KD, Juhl M, Schow TB. The Danish National Birth Cohort-its background, structure and aim. Scandinavian journal of public health 2001; 29 (04) 300-307.
  • 10 Wacholder S. Practical considerations in choosing between the case-cohort and nested case-control designs. Epidemiology 1991: 155-158.
  • 11 Schneeweiss S, Stürmer T, Maclure M. Case–crossover and case–time–control designs as alternatives in pharmacoepidemiologic research. Pharmacoepidemiology and drug safety 1997; 6 S3 S51-S59.
  • 12 Jurafsky D, James H. Speech and language processing an introduction to natural language processing, computational linguistics, and speech. 2000
  • 13 Bekhuis T, Kreinacke M, Spallek H, Song M, O’Donnell JA. Using natural language processing to enable in-depth analysis of clinical messages posted to an Internet mailing list: a feasibility study. Journal of medical Internet research 2011; 13 (04) e98.
  • 14 Wu ST, Sohn S, Ravikumar K, Wagholikar K, Jonnalagadda SR, Liu H, Juhn YJ. Automated chart review for asthma cohort identification using natural language processing: an exploratory study. Annals of Allergy, Asthma & Immunology 2013; 111 (05) 364-369.
  • 15 Xu H, Stenner SP, Doan S, Johnson KB, Waitman LR, Denny JC. MedEx: a medication information extraction system for clinical narratives. Journal of the American Medical Informatics Association 2010; 17 (01) 19-24.
  • 16 Chen W, Fosler-Lussier E, Xiao N, Raje S, Ramnath R, Sui D. editors. A Synergistic Framework for Geographic Question Answering. Semantic Computing (ICSC), 2013 IEEE Seventh International Conference on; 2013: 94-99.
  • 17 Wu ST, Liu H, Li D, Tao C, Musen MA, Chute CG, Shah NH. Unified Medical Language System term occurrences in clinical notes: a large-scale corpus analysis. Journal of the American Medical Informatics Association 2012; 19 e1 e149-156.
  • 18 Garvin JH, DuVall SL, South BR, Bray BE, Bolton D, Heavirland J, Pickard S, Heidenreich P, Shen S, Weir C.. Automated extraction of ejection fraction for quality measurement using regular expressions in Unstructured Information Management Architecture (UIMA) for heart failure. Journal of the American Medical Informatics Association 2012; 19 (Suppl. 05) 859-866.
  • 19 Doan S, Conway M, Phuong TM, Ohno-Machado L. Natural language processing in biomedicine: a unified system architecture overview. Methods in molecular biology (Clifton, NJ) 2013; 1168: 275-294.
  • 20 Aronson AR, Lang F-M. An overview of MetaMap: historical perspective and recent advances. Journal of the American Medical Informatics Association 2010; 17 (03) 229-236.
  • 21 Savova GK, Masanz JJ, Ogren PV, Zheng J, Sohn S, Kipper-Schuler KC, Chute CG. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. Journal of the American Medical Informatics Association 2010; 17 (05) 507-513.
  • 22 Osborne JD, Lin S, Zhu LJ, Kibbe WA. Mining biomedical data using MetaMap Transfer (MMtx) and the Unified Medical Language System (UMLS). Gene Function Analysis: Springer; 2007. p. 153-69.
  • 23 Jiang M, Chen Y, Liu M, Rosenbloom ST, Mani S, Denny JC, Xu H. A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries. Journal of the American Medical Informatics Association 2011; 18 (05) 601-606.
  • 24 Tang B, Cao H, Wu Y, Jiang M, Xu H. editors. Clinical entity recognition using structural support vector machines with rich features. Proceedings of the ACM sixth international workshop on Data and text mining in biomedical informatics. 2012 ACM.
  • 25 Tang B, Cao H, Wu Y, Jiang M, Xu H. Recognizing clinical entities in hospital discharge summaries using Structural Support Vector Machines with word representation features. BMC medical informatics and decision making 2013; 13 (Suppl. 01) S1.
  • 26 Jensen PB, Jensen LJ, Brunak S. Mining electronic health records: towards better research applications and clinical care. Nature Reviews Genetics 2012; 13 (06) 395-405.
  • 27 Zhu D, Wu S, Carterette B, Liu H. Using large clinical corpora for query expansion in text-based cohort identification. Journal of biomedical informatics 2014; 49: 275-281.
  • 28 Murphy SN, Wilcox A. Mission and Sustainability of Informatics for Integrating Biology and the Bedside (i2b2). eGEMs (Generating Evidence & Methods to improve patient outcomes) 2014; 2 (02) 7.
  • 29 Murphy SN, Weber G, Mendis M, Gainer V, Chueh HC, Churchill S, Kohane I. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). Journal of the American Medical Informatics Association 2010; 17 (02) 124-130.
  • 30 Natter MD, Quan J, Ortiz DM, Bousvaros A, Ilowite NT, Inman CJ, Marsolo K, McMurry AJ, Sandborg CI, Schanberg LE. An i2b2-based, generalizable, open source, self-scaling chronic disease registry. Journal of the American Medical Informatics Association 2013; 20 (01) 172-179.
  • 31 Moser R, Boyer E, Lupinski D, Darer J, Anderer T, Villareal A, Berger P. C-B4–02: Enhancing the Quality and Efficiency of Obstructive Sleep Apnea Screening Using Health Information Technology: Results of a Geisinger Clinic Pilot Study. Clinical medicine & research 2011; 9 3–4 170-171.
  • 32 Zhang G-Q, Cui L, Teagno J, Kaebler D, Koroukian S, Xu R. Merging Ontology Navigation with Query Construction for Web-based Medicare Data Exploration. AMIA Summits on Translational Science Proceedings 2013; 2013: 285.
  • 33 Zeng QT, Goryachev S, Weiss S, Sordo M, Murphy SN, Lazarus R. Extracting principal diagnosis, co-morbidity and smoking status for asthma research: evaluation of a natural language processing system. BMC medical informatics and decision making 2006; 6 (01) 30.
  • 34 Chen D, Manning CD. A fast and accurate dependency parser using neural networks. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2014: 740-750.
  • 35 Socher R, Lin CC, Manning C, Ng AY. Parsing natural scenes and natural language with recursive neural networks. Proceedings of the 28th International Conference on Machine Learning (ICML-11) 2011: 129-136.
  • 36 Socher R, Manning CD, Ng AY. Learning continuous phrase representations and syntactic parsing with recursive neural networks. Proceedings of the NIPS-2010 Deep Learning and Unsupervised Feature Learning Workshop 2010: 1-9.
  • 37 Chen W. editor Context-based Natural Language Processing for GIS-based Vague Region Visualization. Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science; 2014: Association for Computational Linguistics.
  • 38 Klein D, Manning CD. editors. Accurate unlexicalized parsing. Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1; 2003: Association for Computational Linguistics.
  • 39 Klein D, Manning CD. editors. Fast exact inference with a factored model for natural language parsing. Advances in neural information processing systems; 2002
  • 40 Cohen WW, Sarawagi S. editors. Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining; 2004 : ACM.
  • 41 Wu Y, Denny JC, Rosenbloom ST, Miller RA, Giuse DA, Xu H. editors. A comparative study of current clinical natural language processing systems on handling abbreviations in discharge summaries. AMIA Annual Symposium Proceedings; 2012 American Medical Informatics Association.
  • 42 Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. A simple algorithm for identifying negated findings and diseases in discharge summaries. Journal of biomedical informatics 2001; 34 (05) 301-310.
  • 43 Ristad ES, Yianilos PN. Learning string-edit distance. Pattern Analysis and Machine Intelligence, IEEE Transactions on 1998; 20 (05) 522-532.