Appl Clin Inform 2016; 07(04): 1135-1153
DOI: 10.4338/ACI-2016-03-SOA-0035
State of the Art/Best Practice Paper
Schattauer GmbH

Preprocessing structured clinical data for predictive modeling and decision support

A roadmap to tackle the challenges
José Carlos Ferrão
1   Siemens Healthcare, Rua Irmãos Siemens 1, 2720–093 Amadora, Portugal
2   CEG-IST, Centre for Management Studies of Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1049–001 Lisbon, Portugal
,
Mónica Duarte Oliveira
2   CEG-IST, Centre for Management Studies of Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1049–001 Lisbon, Portugal
,
Filipe Janela
1   Siemens Healthcare, Rua Irmãos Siemens 1, 2720–093 Amadora, Portugal
,
Henrique M. G. Martins
3   Centre for Research and Creativity in Informatics, Hospital Prof. Doutor Fernando Fonseca, IC-19 Venteira, 2720–276 Amadora, Portugal
› Author Affiliations
Funding The authors also acknowledge the support from Fundação para a Ciência e a Tecnologia (grant SFRH/ BDE/51605/2011), Siemens Healthcare and the Centre for Management Studies of Instituto Superior Técnico (CEG-IST, University of Lisbon).
Further Information

Publication History

received: 06 March 2016

accepted: 01 October 2016

Publication Date:
18 December 2017 (online)

Summary

Background EHR systems have high potential to improve healthcare delivery and management. Although structured EHR data generates information in machine-readable formats, their use for decision support still poses technical challenges for researchers due to the need to preprocess and convert data into a matrix format. During our research, we observed that clinical informatics literature does not provide guidance for researchers on how to build this matrix while avoiding potential pitfalls.

Objectives This article aims to provide researchers a roadmap of the main technical challenges of preprocessing structured EHR data and possible strategies to overcome them.

Methods Along standard data processing stages – extracting database entries, defining features, processing data, assessing feature values and integrating data elements, within an EDPAI framework –, we identified the main challenges faced by researchers and reflect on how to address those challenges based on lessons learned from our research experience and on best practices from related literature. We highlight the main potential sources of error, present strategies to approach those challenges and discuss implications of these strategies.

Results Following the EDPAI framework, researchers face five key challenges: (1) gathering and integrating data, (2) identifying and handling different feature types, (3) combining features to handle redundancy and granularity, (4) addressing data missingness, and (5) handling multiple feature values. Strategies to address these challenges include: crosschecking identifiers for robust data retrieval and integration; applying clinical knowledge in identifying feature types, in addressing redundancy and granularity, and in accommodating multiple feature values; and investigating missing patterns adequately.

Conclusions This article contributes to literature by providing a roadmap to inform structured EHR data preprocessing. It may advise researchers on potential pitfalls and implications of methodological decisions in handling structured data, so as to avoid biases and help realize the benefits of the secondary use of EHR data.

Citation: Ferrão JC, Oliveira MD, Janela F, Martins HMG. Preprocessing structured clinical data for predictive modeling and decision support – a roadmap to tackle the challenges.

 
  • References

  • 1 Hripcsak G, Bloomrosen M, FlatelyBrennan P, Chute CG, Cimino J, Detmer DE. et al. Health data use, stewardship, and governance: ongoing gaps and challenges: a report from AMIA’s 2012 Health Policy Meeting. J Am Med Inform Assoc 2014; 21 (02) 204-211 doi:10.1136/amiajnl-2013-002117.
  • 2 Schneeweiss S. Learning from Big Health Care Data. N Engl J Med 2014; 370: 2161-2163 doi:10.1056/NEJMp1401111.
  • 3 Safran C, Bloomrosen M, Hammond WE, Labkoff S, Markel-Fox S, Tang PC. et al. Toward a national framework for the secondary use of health data: an American Medical Informatics Association White Paper. J Am Med Inform Assoc 2007; 14 (01) 1-9 doi:10.1197/jamia.M2273.
  • 4 Wu J, Roy J, Stewart WF. Prediction modeling using EHR data: challenges, strategies, and a comparison of machine learning approaches. Med Care 2010; 48 (Suppl. 06) S106-S113 doi:10.1097/MLR.0b013e3181de9e17.
  • 5 Berner ES. Clinical Decision Support Systems. 2nd ed. New York: Springer; 2007.
  • 6 Rowan M, Ryan T, Hegarty F, O’Hare N. The use of artificial neural networks to stratify the length of stay of cardiac patients based on preoperative and initial postoperative factors. Artif Intell Med 2007; 40 (03) 211-221 doi:10.1016/j.artmed.2007.04.005.
  • 7 Carter EM, Potts HWW. Predicting length of stay from an electronic patient record system: a primary total knee replacement example. BMC Med Inform Decis Mak. 2014 14. 26 doi:10.1186/1472–6947–14–26.
  • 8 Chaudry B, Wang J, Wu S, Maglione M, Mojica W, Roth E. et al. Systematic Review: Impact of Health Information Technology on Quality, Efficiency, and Costs of Medical Care. Ann Intern Med 2006; 144 (10) 742-752.
  • 9 Osheroff JA, Teich JM, Middleton B, Steen EB, Wright A, Detmer DE. A roadmap for national action on clinical decision support. J Am Med Inform Assoc 2007; 14 (02) 141-145 doi:10.1197/jamia.M2334.
  • 10 Prokosch HU, Ganslandt T. Perspectives for medical informatics. Reusing the electronic medical record for clinical research. Methods Inf Med 2009; 48 (01) 38-44.
  • 11 Cios KJ, Moore GW. Uniqueness of medical data mining. Artif Intell Med 2002; 26 1-2 1-24.
  • 12 Lin JH, Haug PJ. Data preparation framework for preprocessing clinical data in data mining. Proceedinfs of AMIA Annu Symp. 2006. Nov 11-15; Washington DC: USA; 2006: 489-93.
  • 13 Kotsiantis SB. Supervised Machine Learning-: A Review of Classification Techniques. Informatica 2007; 31: 249-268.
  • 14 McDonald CJ. Computer-Stored Medical Records: Their Future Role in Medical Practice, J Am Med Assoc. 1988; 259 (23) 3433-3440 doi:10.1001/jama.1988.03720230043028.
  • 15 Iavindrasana J, Cohen G, Depeursinge A, Müller H, Meyer R, Geissbuhler A. Clinical data mining: a review. Yearb Med Inform; 2009. 48 (Suppl. 1): 1-13.
  • 16 Hand DJ, Mannila H, Smyth P. Principles of Data Mining. 3rd edition. Cambridge, USA: MIT Press; 2001
  • 17 International Organization For Standardization. ISO/TR 20514 Electronic health record – Definition, scope and context. 2005 doi:ISO/TR 20514:2005(E).
  • 18 International Organization For Standardization. ISO 18308 – Health informatics – Requirements for an electronic health record architecture. 2011
  • 19 International Organization For Standardization. ISO 21090 – Health informatics – Harmonized data types for information interchange. 2011
  • 20 International Organization For Standardization. ISO/EN 13606 – Health Informatics – Electronic Health Record Communication. 2010
  • 21 Santos MR, Bax MP, Kalra D. Building a logical EHR architecture based on ISO 13606 standard and semantic web technologies. Stud Health Technol Inform 2010; 160 (Pt 1): 161-165.
  • 22 Dolin RH, Alschuler L, Boyer S, Beebe C, Behlen FM, Biron PV. et al. HL7 Clinical Document Architecture, Release 2. J Am Med Inform Assoc 2006; 13 (01) 30-39 doi:10.1197/jamia.M1888.
  • 23 Beale T, Heard S. OpenEHR Architecture Overview. 2006
  • 24 Atzeni P, De Antonellis V. Relational database theory. Redwood City, USA: Benjamin-Cummings Publishing; 1993
  • 25 Lee KK, Tang WC, Choi KS. Alternatives to relational database: comparison of NoSQL and XML approaches for clinical data storage. Comput Methods Programs Biomed 2013; 110 (01) 99-109 doi:10.1016/j.cmpb.2012.10.018.
  • 26 Cattell R. Scalable SQL and NoSQL data stores. ACM SIGMOD Rec 2011; 39 (04) 12-27 doi:10.1145/1978915.1978919.
  • 27 Stalidis G, Prentza A, Vlachos IN, Maglavera S, Koutsouris D. Medical support system for continuation of care based on XML web technology. Int J Med Inform 2001; 64 2-3 385-400 doi:10.1016/S1386–5056(01)00195–2.
  • 28 Catley C, Frize M. A prototype XML-based implementation of an integrated “intelligent” neonatal intensive care unit. Proceedings of the 4th Int IEEE EMBS Spec Top Conf Inf Technol Appl Biomed. Apr 24-26 2003; Birmingham, UK: 2003: 322-325 doi:10.1109/ITAB.2003.1222543.
  • 29 Gainer V, Hackett K, Mendis M, Kuttan R, Pan W, Phillips LC. et al. Using the i2b2 hive for clinical discovery: an example. Proceedings of AMIA Annu Symp. 2007. Nov 10-14; Chicago, USA: 2007: 959.
  • 30 Murphy SN, Weber G, Mendis M, Gainer V, Chueh HC, Churchill S. et al. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). J Am Med Inform Assoc 2010; 17 (02) 124-130 doi:10.1136/jamia.2009.000893.
  • 31 Rea S, Pathak J, Savova G, Oniki TA, Westberg L, Beebe CE. et al. Building a robust, scalable and standards-driven infrastructure for secondary use of EHR data: the SHARPn project. J Biomed Inform 2012; 45 (04) 763-771 doi:10.1016/j.jbi.2012.01.009.
  • 32 Chute CG, Pathak J, Savova GK, Bailey KR, Schor MI, Hart LA. et al. The SHARPn project on secondary use of Electronic Medical Record data: progress, plans, and possibilities. Proceedings of AMIA Annu Symp. 2011. Oct 22-26; Washington DC, USA: 2011: 248-56.
  • 33 De Moor G, Sundgren M, Kalra D, Schmidt A, Dugas M, Claerhout B. et al. Using electronic health records for clinical research: the case of the EHR4CR project. J Biomed Inform 2015; 53: 162-173 doi:10.1016/j.jbi.2014.10.006.
  • 34 El Fadly A, Rance B, Lucas N, Mead C, Chatellier G, Lastic PY. et al. Integrating clinical research with the Healthcare Enterprise: from the RE-USE project to the EHR4CR platform. J Biomed Inform 2011; 44 su1 S94-S102 doi:10.1016/j.jbi.2011.07.007.
  • 35 Danciu I, Cowan JD, Basford M, Wang X, Saip A, Osgood S. et al. Secondary use of clinical data: the Vanderbilt approach. J Biomed Inform 2014; 52: 28-35 doi:10.1016/j.jbi.2014.02.003.
  • 36 Oster S, Langella S, Hastings S, Ervin S, Madduri R, Phillips J. et al. caGrid 1.0: an enterprise Grid infrastructure for biomedical research. J Am Med Inform Assoc 2008; 15 (02) 138-149 doi:10.1197/jamia.M2522.
  • 37 Bradshaw RL, Matney S, Livne OE. et al. Architecture of a federated query engine for heterogeneous resources. Proceedings of AMIA Annu Symp. 2009. Nov 14-18; San Francisco, USA: 2009: 70-4.
  • 38 Tsoumakas G, Katakis I, Vlahavas I. Mining Multi-label Data. In: Mainon O, Rokach L. editors. Data Mining and Knowledge Discovery Handbook. New York: Springer; 2010: 667-685.
  • 39 Wu L, Barash G, Bartolini C. A Service-oriented Architecture for Business Intelligence. Proceedings of the IEEE Int Conf Serv Comput Appl (SOCA). 2007. Jun 19-20; Newport Beach, USA; 2007: 279-285 doi:10.1109/SOCA.2007.6.
  • 40 Ng K, Ghoting A, Steinhubl SR, Stewart WF, Malin B, Sun J. PARAMO: a PARAllel predictive MOdeling platform for healthcare analytic research using electronic health records. J Biomed Inform 2014; 48: 160-170 doi:10.1016/j.jbi.2013.12.012.
  • 41 Pietka E. Large-Scale Hospital Information System in clinical practice. Int Congr Ser 2003; 1256: 843-848 doi:10.1016/S0531–5131(03)00458–8.
  • 42 AHIMA Work Group. Problem List Guidance in the EHR. J AHIMA 2008; 82 (09) 73-77.
  • 43 Holmes C. The Problem List beyond Meaningful Use Part I: The Problems with problem Lists. J AHIMA 2011; 82: 30-35.
  • 44 Moshkovich H. Rule induction in data mining: effect of ordinal scales. Expert Syst Appl 2001; 22 (04) 303-311 doi:10.1016/S0957–4174(02)00018–0.
  • 45 Cimino JJ. Review paper: coding systems in health care. Methods Inf Med 1996; 35 (4-5): 273-284.
  • 46 Liu S, Moore R, Ganesan V, Nelson S. RxNorm: prescription for electronic drug information exchange. IT Prof 2005; 07 (05) 17-23 doi:10.1109/MITP.2005.122.
  • 47 Bennett CC. Utilizing RxNorm to support practical computing applications: capturing medication history in live electronic health records. J Biomed Inform 2012; 45 (04) 634-641 doi:10.1016/j.jbi.2012.02.011.
  • 48 Huff SM, Rocha RA, McDonald CJ, De Moor GJE, Fiers T, Bidgood WD. et al. Development of the Logical Observation Identifier Names and Codes (LOINC) Vocabulary. J Am Med Inform Assoc 1998; 05 (03) 276-292 doi:10.1136/jamia.1998.0050276.
  • 49 Doan A, Halevy A, Ives Z. Principles of Data Integration. 1st ed. Morgan Kaufmann; 2012
  • 50 Brazhnik O, Jones JF. Anatomy of data integration. J Biomed Inform 2007; 40 (03) 252-269 doi:10.1016/j.jbi.2006.09.001.
  • 51 Burgun A, Bodenreider O. Accessing and integrating data and knowledge for biomedical research. Yearb Med Inform. 2008: 91-101.
  • 52 Giuse D. Health information systems challenges: the Heidelberg conference and the future. Int J Med Inform 2003; 69 2-3 105-114 doi:10.1016/S1386–5056(02)00182-X.
  • 53 Donnelly WJ. Viewpoint: patient-centered medical care requires a patient-centered medical record. Acad Med 2005; 80 (01) 33-38.
  • 54 Bellazzi R, Zupan B. Predictive data mining in clinical medicine: current issues and guidelines. Int J Med Inform 2008; 77 (02) 81-97 doi:10.1016/j.ijmedinf.2006.11.006.
  • 55 Guyon I, Elisseeff A. An Introduction to Variable and Feature Selection. J Mach Learn Res 2003; 03: 1157-1182.
  • 56 Liu H, Motoda H, Setiono R, Zhao Z. Feature Selection: An Ever Evolving Frontier in Data Mining. in: JMLR Work Conf Proc 2010; 10: 4-13.
  • 57 Dash M, Liu H. Feature selection for classification. Intell Data Anal 1997; 01: 131-156 doi:10.1016/S1088–467X(97)00008–5.
  • 58 Cimino JJ. Desiderata for controlled medical vocabularies in the twenty-first century. Methods Inf Med 1998; 37 4-5 394-403.
  • 59 Lippmann R. An introduction to computing with neural nets. IEEE ASSP Mag 1987; 04 (02) 4-22 doi:10.1109/MASSP.1987.1165576.
  • 60 Quinlan JR. Decision trees and decision-making. IEEE Trans Syst Man Cybern 1990; 20 (02) 339-346 doi:10.1109/21.52545.
  • 61 Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics 2007; 23 (19) 2507-2517.
  • 62 Lazar C, Taminau J, Meganck S, Steenhoff D, Coletta A, Molter C. et al. A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM Trans Comput Biol Bioinform 2012; 09 (04) 1106-1119 doi:10.1109/TCBB.2012.33.
  • 63 Suits DB. Use of Dummy Variables in Regression Equations. J Am Stat Assoc 1957; 52: 548-551 doi:10.1080/01621459.1957.10501412.
  • 64 Liu H, Hussain F, Tan CL, Dash M. Discretization: An Enabling Technique. Data Min Knowl Discov 2002; 06: 393-423.
  • 65 Chen C, Garrido T, Chock D, Okawa G, Liang L. The Kaiser Permanente Electronic Health Record: transforming and streamlining modalities of care. Health Aff (Millwood) 2009; 28 (02) 323-333 doi:10.1377/hlthaff.28.2.323.
  • 66 Mäenpää T, Suominen T, Asikainen P, Maass M, Rostila I. The outcomes of regional healthcare information systems in health care: a review of the research literature. Int J Med Inform 2009; 78 (11) 757-771 doi:10.1016/j.ijmedinf.2009.07.001.
  • 67 Heitjan DF. Annotation: what can be done about missing data? Approaches to imputation. Am J Public Health 1997; 87 (04) 548-550.
  • 68 Sterne JA, White IR, Carlin JB, Spratt M, Royston P, Kenward MG. et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ 2009; 338: b2393 doi:10.1136/bmj.b2393.
  • 69 Gorelick MH. Bias arising from missing data in predictive models. J Clin Epidemiol 2006; 59 (10) 1115-1123 doi:10.1016/j.jclinepi.2004.11.029.
  • 70 Wells BJ, Chagin KM, Nowacki AS, Kattan MW. Strategies for handling missing data in electronic health record derived data. EGEMS (Wash DC) 2013; 01 (03) 1035 doi:10.13063/2327–9214.1035.
  • 71 Allison PD. Missing Data. SAGE Publications, Inc; 2001
  • 72 Donders ART, van der Heijden GJMG, Stijnen T, Moons KGM. Review: a gentle introduction to imputation of missing values. J Clin Epidemiol 2006; 59: 1087-1091 doi:10.1016/j.jclinepi.2006.01.014.
  • 73 Cismondi F, Fialho AS, Vieira SM, Reti SR, Sousa JM, Finkelstein SN. Missing data in medical databases: impute, delete or classify?. Artif Intell Med 2013; 58 (01) 63-72 doi:10.1016/j.artmed.2013.01.003.
  • 74 Jerez JM, Molina I, García-Laencina PJ, Alba E, Ribelles N, Martín M. et al. Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif Intell Med 2010; 50 (02) 105-115 doi:10.1016/j.artmed.2010.05.002.
  • 75 Windle T, McClay JC, Windle JR. The impact of domain knowledge on structured data collection and templated note design. Appl Clin Inform 2013; 04 (03) 317-330 doi:10.4338/ACI-2013–02-CR-0008.
  • 76 Rosenbloom ST, Stead WW, Denny JC, Giuse D, Lorenzi NM, Brown SH. et al. Generating Clinical Notes for Electronic Health Record Systems. Appl Clin Inform 2010; 01 (03) 232-243 doi:10.4338/ACI-2010–03-RA-0019.
  • 77 Hoerbst A, Ammenwerth E. Electronic Health Records. A Systematic Review on Quality Requirements. Methods Inf Med 2010; 49 (04) 320-336 doi:10.3414/ME10–01–0038.
  • 78 Cresswell KM, Bates DW, Sheikh A. Ten key considerations for the successful implementation and adoption of large-scale health information technology. J Am Med Inform Assoc 2013; 20 e1 e9-e13 doi:10.1136/amiajnl-2013–001684.
  • 79 Walji MF, Kalenderian E, Piotrowski M, Tran D, Kookal KK, Tokede O. et al. Are three methods better than one? A comparative assessment of usability evaluation methods in an EHR. Int J Med Inform 2014; 83 (05) 361-367 doi:10.1016/j.ijmedinf.2014.01.010.
  • 80 Walji MF, Kalenderian E, Tran D, Kookal KK, Nguyen V, Tokede O. et al. Detection and characterization of usability problems in structured data entry interfaces in dentistry. Int J Med Inform 2013; 82 (02) 128-138.