Combining Contrast Mining with Logistic Regression To Predict Healthcare Utilization in a Managed Care PopulationFunding LS and JCP: This publication was made possible by Grant Number 1C1CMS331001–01–00 from the Department of Health and Human Services, Centers for Medicare & Medicaid Services. The contents of this publication are solely the responsibility of the authors and do not necessarily represent the official views of the U.S. Department of Health and Human Services or any of its agencies. The funding agreement ensured the authors’ independence in designing the study, interpreting the data, writing, and publishing the report. MAP is supported by the US Department of Education Graduate Assistance in Areas of National Need (GAANN) Fellowship under grant number P200A100053, and YZ and CRS are supported by the Shumaker Endowment for biomedical informatics. The high performance computing infrastructure used in this research is currently supported by the National Science Foundation under grant number CNS-1429294.
26 May 2016
accepted: 21 February 2017
21 December 2017 (online)
Background: Because 5% of patients incur 50% of healthcare expenses, population health managers need to be able to focus preventive and longitudinal care on those patients who are at highest risk of increased utilization. Predictive analytics can be used to identify these patients and to better manage their care. Data mining permits the development of models that surpass the size restrictions of traditional statistical methods and take advantage of the rich data available in the electronic health record (EHR), without limiting predictions to specific chronic conditions.
Objective: The objective was to demonstrate the usefulness of unrestricted EHR data for predictive analytics in managed healthcare.
Methods: In a population of 9,568 Medicare and Medicaid beneficiaries, patients in the highest 5% of charges were compared to equal numbers of patients with the lowest charges. Contrast mining was used to discover the combinations of clinical attributes frequently associated with high utilization and infrequently associated with low utilization. The attributes found in these combinations were then tested by multiple logistic regression, and the discrimination of the model was evaluated by the c-statistic.
Results: Of 19,014 potential EHR patient attributes, 67 were found in combinations frequently associated with high utilization, but not with low utilization (support>20%). Eleven of these attributes were significantly associated with high utilization (p<0.05). A prediction model composed of these eleven attributes had a discrimination of 84%.
Conclusions: EHR mining reduced an unusably high number of patient attributes to a manageable set of potential healthcare utilization predictors, without conjecturing on which attributes would be useful. Treating these results as hypotheses to be tested by conventional methods yielded a highly accurate predictive model. This novel, two-step methodology can assist population health managers to focus preventive and longitudinal care on those patients who are at highest risk for increased utilization.
Citation: Sheets L, Petroski GF, Zhuang Y, Phinney MA, Ge B, Parker JC, Shyu C-R. Combining contrast mining with logistic regression to predict healthcare Appl Clin Inform 2017; 8: 430–446 https://doi.org/10.4338/ACI-2016-05-RA-0078
KeywordsData mining - prediction models - clinical decision support - data reuse - practice management
Clinical Relevance Statement
Accurate prediction of the 5% of patients who incur 50% of healthcare expenses is needed to permit population health managers to focus preventive and longitudinal care effectively. Combining contrast mining, which permits the use of the rich data available in the EHR, with testing by traditional statistical methods created flexible and highly accurate healthcare predictive analytics which can support population health management.
Protection of Human and Animal Subjects
This project was funded by the Center for Medicare and Medicaid Services (CMS) to expand the scope of services to a population of CMS beneficiaries, so the Health Sciences Institutional Review Board deemed the project to be a quality improvement initiative that did not require a formal patient consent process since the explicit purpose of data use was to improve patient care; the IRB number is 2001677-QI.
- 1 Berwick DM, Nolan TW, Whittington J. The triple aim: care, health, and cost. Health Aff (Millwood) 2008; 27 (03) 759-769.
- 2 Wagner EH, Glasgow RE, Davis C, Bonomi AE, Provost L, McCulloch D, Carver P, Sixta C. Quality improvement in chronic illness care: a collaborative approach. Jt Comm J Qual Improv 2001; 27 (02) 63-80.
- 3 Glasgow RE, Orleans CT, Wagner EH. Does the chronic care model serve also as a template for improving prevention?. Milbank Q 2001; 79 (Suppl. 04) 579-612 iv-v.
- 4 Bodenheimer T, Wagner EH, Grumbach K. Improving primary care for patients with chronic illness: the chronic care model, Part 2. JAMA 2002; 288 (15) 1909-1914.
- 5 Coleman K, Austin BT, Brach C, Wagner EH. Evidence on the Chronic Care Model in the new millennium. Health Aff (Millwood) 2009; 28 (01) 75-85.
- 6 Snyderman R, Williams RS. Prospective medicine: the next health care transformation. Acad Med 2003; 78 (11) 1079-1084.
- 7 Bradley P. Predictive analytics can support the ACO model. Healthc Financ Manage 2012; 66 (04) 102-106.
- 8 Cohen SB, Uberoi N. United States Agency for Healthcare Research and Quality. Differentials in the concentration in the level of health expenditures across population subgroups in the US, 2010. Rockville: Agency for Healthcare Research and Quality; 2013
- 9 Amarasingham R, Patzer RE, Huesch M, Nguyen NQ, Xie B. Implementing electronic health care predictive analytics: considerations and challenges. Health Aff (Millwood) 2014; 33 (07) 1148-1154.
- 10 O’Caoimha R, Cornallya N, Weathersa E, O’Sullivana R, Fitzgeralda C, Orfilad F, Clarnettee R, Paúlf C, Molloya DW. Risk prediction in the community: A systematic review of case-finding instruments that predict adverse healthcare outcomes in community-dwelling older adults. Maturitas 2015; 82 (01) 3-21.
- 11 Kantardzic M. Data mining: Concepts, models, methods, and algorithms. 2nd ed. Hoboken: John Wiley & Sons; 2011
- 12 Khalid JM, Raluy-Callado M, Curtis BH, Boye KS, Maguire A, Reaney M. Rates and risk of hospitalisation among patients with type 2 diabetes: retrospective cohort study using the UK General Practice Research Database linked to English Hospital Episode Statistics. Int J Clin Pract 2014; 68 (01) 40-48.
- 13 Sun J, McNaughton CD, Zhang P, Perer A, Gkoulalas-Divanis A, Denny JC, Kirby J, Lasko T, Saip A, Malin BA. Predicting changes in hypertension control using electronic health records from a chronic disease management program. J Am Med Inform Assoc 2014; 21 (02) 337-344.
- 14 Hassanpour S, Langlotz CP. Predicting High Imaging Utilization Based on Initial Radiology Reports: A Feasibility Study of Machine Learning. Acad Radiol 2016; 23 (01) 84-89.
- 15 Chechulin Y, Nazerian A, Rais S, Malikov K. Predicting patients with high risk of becoming high-cost healthcare users in Ontario (Canada). Healthc Policy 2014; 9 (03) 68-79.
- 16 Dove HG, Duncan I, Robb A. A prediction model for targeting low-cost, high-risk members of managed care organizations. Am J Manag Care 2003; 9 (05) 381-389.
- 17 Gildersleeve R, Cooper P. Development of an automated, real time surveillance tool for predicting read-missions at a community hospital. Appl Clin Inform 2013; 4 (02) 153-169.
- 18 Wright A, McCoy A, Henkin S, Flaherty M, Sittig D. Validation of an Association Rule Mining-Based Method to Infer Associations Between Medications and Problems. Appl Clin Inform 2013; 4 (01) 100-109.
- 19 Witten IH, Frank E. Data Mining: Practical machine learning tools and techniques, 2nd Ed. San Francisco: Morgan Kaufmann; 2005
- 20 Dong G. Preliminaries. In: Dong G, Bailey J. editors. Contrast data mining: concepts, algorithms, and applications. Boca Raton: CRC Press; 2013. p. 8.
- 21 Shvachko K, Kuang H, Radia S, Chansler R. The Hadoop distributed file system. Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on. New York: Institute of Electrical and Electronics Engineers; 2010
- 22 Health Data Interactive. Atlanta: Centers for Disease Control and Prevention; c2016 [updated 2016 May 16, cited 2016 May 26]. Available from: http://www.cdc.gov/nchs/hdi.htm
- 23 Agrawal R, Srikant R. Fast algorithms for mining association rules in large databases. Proceedings of the 20th International Conference on Very Large Data Bases. 1994: 487-499.
- 24 Cox DR. The regression analysis of binary sequences (with discussion). J Roy Stat Soc B 1958; 20: 215-242.
- 25 Myers RH. Classical and Modern Regression with Applications, Second Edition. Boston: PWK Kent; 1990
- 26 Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982; 143 (01) 29-36.
- 27 Zhou Xh, Obuchowski NA, McClish DK. Statistical Methods in Diagnostic Medicine. New York: John Wiley and Sons; 2002
- 28 Total Expenses and Percent Distribution for Selected Conditions by Type of Service: United States. 2013 Rockville: Agency for Healthcare Research and Quality; c2016 [updated 2016 May 26, cited 2016 May 26]. Available from: http://meps.ahrq.gov/mepsweb/data_stats/tables_compendia_hh_interactive.jsp?_SERVICE=MEPSSocket0&_PROGRAM=MEPSPGM.TC.SAS&File=HCFY2013&Table=HCFY2013_CNDXP_C&_Debug=
- 29 Gelman A, Hill J. Data Analysis Using Regression and Multilevel/Hierarchical Models. New York: Cambridge University Press; 2007