Methods Inf Med 2016; 55(06): 557-563
DOI: 10.3414/ME16-01-0055
Original Articles
Schattauer GmbH

Ensemble Pruning for Glaucoma Detection in an Unbalanced Data Set[*]

Werner Adler
1   Institute of Medical Informatics, Biometry, and Epidemiology, Friedrich-Alexander University Erlangen-Nuremberg, Erlangen, Germany
,
Olaf Gefeller
1   Institute of Medical Informatics, Biometry, and Epidemiology, Friedrich-Alexander University Erlangen-Nuremberg, Erlangen, Germany
,
Asma Gul
2   Department of Statistics, Shaheed Benazir Bhutto Women University, Peshawar, Pakistan
,
Folkert K. Horn
3   Department of Ophthalmology, Friedrich-Alexander University Erlangen-Nuremberg, Erlangen, Germany
,
Zardad Khan
4   Department of Statistics, Abdul Wali Khan University, Mardan, Pakistan
,
Berthold Lausen
5   Department of Mathematical Sciences, University of Essex, Colchester, UK
› Author Affiliations
Funding The work on this article was supported by the German Research Foundation (DFG), grant SCHM 2966 / 1– 2. We acknowledge support from grant number ES / L011859 / 1, from The Business and Local Government Data Research Centre, funded by the Economic and Social Research Council to provide researchers and analysts with secure data services.
Further Information

Publication History

received: 29 April 2016

accepted in revised form: 28 August 2016

Publication Date:
08 January 2018 (online)

Zoom Image

Summary

Background: Random forests are successful classifier ensemble methods consisting of typically 100 to 1000 classification trees. Ensemble pruning techniques reduce the computational cost, especially the memory demand, of random forests by reducing the number of trees without relevant loss of performance or even with increased perfor -mance of the sub-ensemble. The application to the problem of an early detection of glaucoma, a severe eye disease with low prevalence, based on topographical measurements of the eye background faces specific challenges.

Objectives: We examine the performance of ensemble pruning strategies for glaucoma detection in an unbalanced data situation.

Methods: The data set consists of 102 topo-graphical features of the eye background of 254 healthy controls and 55 glaucoma patients. We compare the area under the receiver operating characteristic curve (AUC), and the Brier score on the total data set, in the majority class, and in the minority class of pruned random forest ensembles obtained with strategies based on the prediction accuracy of greedily grown sub-ensembles, the uncertainty weighted accuracy, and the similarity between single trees. To validate the findings and to examine the influence of the prevalence of glaucoma in the data set, we additionally perform a simulation study with lower prevalences of glaucoma.

Results: In glaucoma classification all three pruning strategies lead to improved AUC and smaller Brier scores on the total data set with sub-ensembles as small as 30 to 80 trees compared to the classification results obtained with the full ensemble consisting of 1000 trees. In the simulation study, we were able to show that the prevalence of glaucoma is a critical factor and lower prevalence decreases the performance of our pruning strategies.

Conclusions: The memory demand for glaucoma classification in an unbalanced data situation based on random forests could effectively be reduced by the application of pruning strategies without loss of perfor -mance in a population with increased risk of glaucoma.

* Supplementary material published on our website http://dx.doi.org/10.3414/ME16-01-0055