Overrating Classifier Performance in ROC Analysis in the Absence of a Test Set: Evidence from Simulation and Italian CARATkids Validation

Giovanna Cilluffo; Salvatore Fasola; Giuliana Ferrante; Laura Montalbano; Ilaria Baiardini; Luciana Indinnimeo; Giovanni Viegi; Joao A. Fonseca; Stefania La Grutta

doi:10.1055/s-0039-1693732

Subscribe to RSS

Please copy the URL and add it into your RSS Feed Reader.

https://www.thieme-connect.de/rss/thieme/en/10.1055-s-00035037.xml

Share / Bookmark

Facebook Linkedin Weibo

Download PDF

CC BY-NC-ND 4.0 · Methods Inf Med 2019; 58(S 02): e27-e42
DOI: 10.1055/s-0039-1693732

Original Article

Georg Thieme Verlag KG Stuttgart · New York

Overrating Classifier Performance in ROC Analysis in the Absence of a Test Set: Evidence from Simulation and Italian CARATkids Validation

Giovanna Cilluffo‡^§

¹Institute for Biomedical Research and Innovation, National Research Council of Italy, Palermo, Italy

²Department of Economical, Business and Statistical Science, University of Palermo, Palermo, Italy

,

Salvatore Fasola^§

¹Institute for Biomedical Research and Innovation, National Research Council of Italy, Palermo, Italy

²Department of Economical, Business and Statistical Science, University of Palermo, Palermo, Italy

,

Giuliana Ferrante

³Department of Health Promotion Sciences, Maternal and Infant Care, Internal Medicine and Medical Specialities, University of Palermo, Italy

,

Laura Montalbano

¹Institute for Biomedical Research and Innovation, National Research Council of Italy, Palermo, Italy

,

Ilaria Baiardini

⁴Department of Biomedical Sciences, Humanitas University, Milan, Italy

,

Luciana Indinnimeo

⁵Department of Pediatrics and NPI, University of Roma Sapienza, Rome, Italy

,

Giovanni Viegi

¹Institute for Biomedical Research and Innovation, National Research Council of Italy, Palermo, Italy

⁶Institute of Clinical Physiology, Pulmonary Environmental Epidemiology Unit, National Research Council of Italy, Pisa, Italy

,

Joao A. Fonseca

⁷Department of Immunoallergy, CUF Porto Hospital and Institute, Porto, Portugal

,

Stefania La Grutta

¹Institute for Biomedical Research and Innovation, National Research Council of Italy, Palermo, Italy

› Author Affiliations

Funding None.

Further Information

Address for correspondence

Salvatore Fasola, PhD

Institute for Biomedical Research and Innovation

National Research Council of Italy, Palermo

Italy

Email: salvatore.fasola@irib.cnr.it

Publication History

04 October 2018

21 May 2019

Publication Date:
19 November 2019 (online)

Abstract
Full Text
References
Figures
Supplementary Material

PDF Download Permissions and Reprints

Introduction
Methods

Statistical Characterization

Simulation Study

Clinical Data

Results

Simulation Study

Validation of the Italian CARATkids Questionnaire

Discussion
Conclusions
References

Abstract

Background The use of receiver operating characteristic curves, or “ROC analysis,” has become quite common in biomedical research to support decisions. However, sensitivity, specificity, and misclassification rates are still often estimated using the training sample, overlooking the risk of overrating the test performance.

Methods A simulation study was performed to highlight the inferential implications of splitting (or not) the dataset into training and test set. The normality assumption was made for the classifier given the disease status, and the Youden's criterion considered for the detection of the optimal cutoff. Then, an ROC analysis with sample split was applied to assess the discriminant validity of the Italian version of the Control of Allergic Rhinitis and Asthma Test (CARATkids) questionnaire for children with asthma and rhinitis, for which recent studies may have reported liberal performance estimates.

Results The simulation study showed that both single split and cross-validation (CV) provided unbiased estimators of sensitivity, specificity, and misclassification rate, therefore allowing computation of confidence intervals. For the Italian CARATkids questionnaire, the misclassification rate estimated by fivefold CV was 0.22, with 95% confidence interval 0.14 to 0.30, indicating an acceptable discriminant validity.

Conclusions Splitting into training and test set avoids overrating the test performance in ROC analysis. Validated through this method, the Italian CARATkids is valid for assessing disease control in children with asthma and rhinitis.

Keywords

asthma control test - sample split - performance estimators - optimal cutoff - simulation study - true predictive performance

Introduction

The use of receiver operating characteristic curves, or “ROC analysis,” has become quite common in biomedical research to support decisions.[1] [2] [3] In fact, continuous developments in clinical, biological, and psychometric methods provide a wide range of measurements that can be evaluated as potential diagnostic or prognostic tools. Several advanced nonparametric, semiparametric, and parametric methods have been developed for estimating and comparing ROC curves derived from continuous classifiers.[4] However, the most widespread approach to ROC analysis, routinely used in a clinical setting, is still the simplest one: several values of some numerical (continuous or discrete) classifier are evaluated as possible “optimal” cutoff for labeling individuals as “diseased” or “nondiseased.”[5] [6] [7] [8] The goal is to set up a simple screening test, therefore avoiding performing a more invasive, expensive, or time-consuming “gold standard” test.

To derive a ROC curve, sensitivity is plotted against one minus specificity derived from cross-tabulations (CVs) of the true binary status and several binary classifiers obtained through different cutoffs. Different criteria have been proposed for establishing the “optimal” cutoff, mainly based on a trade-off between sensitivity and specificity. However, there is no general criterion that guarantees optimality in all situations, since optimality may depend on different test characteristics and implications (costs, psychological consequences) of false positivities and false negativities. The most widely used criteria are minimization of the distance from (0,1)[9] and maximization of the Youden's index (sensitivity + specificity-1),[10] the latter being somewhat more appropriate.[11] Sensitivity, specificity, and misclassification rates, obtained with the optimal cutoff, together with the area under the ROC curve, are commonly used to report the predictive performance of a classifier.[12] [13]

The need to assess the predictive performance of a classifier on an independent test sample has been well demonstrated, for example, in the context of machine learning,[14] decision trees,[15] and penalized least square discriminant analysis.[16] By contrast, this topic appears to have been overlooked in medical literature about ROC analysis, with the result that the aforementioned performance indicators are still quite often estimated using the same sample of data where the test was developed.

Although the issue of deriving appropriate estimators for the performance error rates could be bypassed using parametric[17] [18] [19] or Bayesian approaches,[15] these methods may be unfamiliar to medical researchers. In addition, the main issue of the training-test set approach is the choice of the training set proportion (usually 1:2 or 2:3), especially when the sample size is small.[20] An alternative approach is CV.[21] [22] CV leaves out one or more observations in turn to be used as the test sample; all the test samples form a partition of the whole sample, so that all the observations are involved in estimation of the classification error. The dilemma, however, is about choosing the classifier to retain, since different classifiers may be obtained from different training subsets. In general, one may then return to the full dataset.[23]

The motivation for writing this article concerns the increasing acknowledgment of the prognostic value of patient-reported outcomes in patients with asthma,[24] rhinitis,[25] [26] or both.[27] [28] In fact, recent studies have provided simple screening tests for assessing the disease control and therefore monitoring its course. However, out of the five studies referenced above, only one[24] appears to have randomly divided the total sample into a “development” (or “training”) sample (75%) and a “confirmatory” (or “ test”) sample (25%). In particular, for pediatric patients with asthma and rhinitis, one of the previous validation studies of the “Control of Allergic Rhinitis and Asthma Test” (Control of Allergic Rhinitis and Asthma Test (CARAT) CARATkids questionnaire) in Brazilian children[27] reported an estimated probability of 1 for the CARATkids score being larger than 3 with uncontrolled asthma (sensitivity), and an estimated probability of 0.93 for the CARATkids score being lower than 7 with controlled asthma (specificity). Since they report sensitivity and specificity from the same sample where they were maximized, such estimates may be affected by positive bias, that is, they probably overestimate the true sensitivity and specificity in the general population.

The aim of this study was to highlight the positive inferential implications of splitting the study sample into a training sample (where the optimal test is derived) and a test sample (where performance or error rates are estimated) in the setting of ROC analysis. This was accomplished by using a well-known data generating mechanisms and a simple simulation study, as a possible reference for medical researchers dealing with such data.

Methods

Statistical Characterization

Let Y_i be a dichotomous random variable for which

i = 1,2,…,n, where n is the size of a given sample of individuals from some target population. It is possible to define

as the prevalence of the disease in the target population. Now consider a quantitative random variable X_i , and suppose that, on average, the X values are greater in diseased individuals. Given this property, X_i may be considered as a potential classifier for Y_i . For the illustrative purpose of this article, the distribution of X_i conditional to the disease status is supposed to be Normal (or Gaussian), so that

Here μ ₁ and σ ₁ are, respectively, the true mean and standard deviation of the classifier among nondiseased individuals, while μ ₂ and σ ₂ are their counterparts among diseased individuals (with μ ₂ > μ ₁). On this ground, the rationale of ROC analyses is that the “working variable”

can be used as a simple classification rule in the target population for some given cutoff c. The accuracy of the test depends on its ability to correctly detect diseased and nondiseased individuals. In particular, the performance indicators of interest are usually sensitivity (probability that the test is positive in diseased individuals), specificity (probability that the test is negative in nondiseased individuals), and the misclassification rate (probability of incorrectly classifying an individual). The true performance has to be evaluated in the target population, and of course, it depends on the cutoff c. The true sensitivity is defined as:

where Φ(·) represents the Gaussian distribution function, with parameters of the diseased population in this case. Similarly, the true specificity is:

where now the distribution of X in nondiseased individuals is involved. Finally, the true misclassification rate is:

In ROC analyses, pairs (x_i ,y_i ) are collected on n individuals; in particular, the disease status y_i is assessed through some validated gold standard test. To set up the classification rule, the cutoff to use is selected among several candidates on a grid of x values, as the value that optimizes a given criterion. The sample of individuals on which this optimization is performed is called the “training set.” The size of the training set will be denoted by n _δ, where δ = n _δ/n (e.g., δ = 50%, δ = 67% or δ = 100%) indicates the training percentage. According to Youden's criterion,[10] the optimal cutoff is estimated as:

where c_j is the j-th candidate cutoff on a J-dimensional discrete grid of x values, is the test sensitivity in the training sample, and is the test specificity in the training sample.

Once the optimal cutoff ĉ _δ has been identified using the training set, the next step is to estimate the true predictive performance of the optimized test, that is, to estimate (2), (3), and (4) for c = ĉ _δ. To accomplish this, two simple approaches are commonly used. The first, liberal approach consists in estimating the performance in the training set:

If the true disease prevalence (i.e., the prevalence in the general population) is simply estimated by (i.e., the prevalence in the study sample), [Eq. (8)] reduces to the following, more familiar expression:

where I(·) is an indicator function. However, if the number of diseased and nondiseased individuals is fixed a priori in the study design, using [Eq. (9)] would be definitely wrong for obvious reasons; in this case, the true prevalence should be inferred from previous studies or just hypothesized.

The second, more conservative approach consists in randomly leaving out a given proportion of the sample, say individuals (e.g., or ), to use as the test sample, that is, to estimate (2), (3), and (4) for c = ĉ _δ, where ĉ _δ comes from the training sample (). The performance estimators will therefore be denoted with , , and .

Sometimes, separation into a training and a test set is difficult due to the small sample size. In this case, k-fold CV makes better use of the data. With this approach, the whole sample is randomly partitioned into k subgroups to be used as the test set in different steps. At each step, k − 1 groups (training set) are used to develop a classifier, and the outcome predictions are derived in the test set. This procedure is repeated until all the k subgroups have been used as the test set, and the overall classifier performance is therefore evaluated. The main issue with this approach is the choice of the classifier to retain, since different classifiers may be obtained at each step; in this case, one may return to the full dataset using ĉ _100%.[23] The k-fold CV performance estimators will be denoted with , and ; when k = n, CV is referred to as leave-one-out CV (LOOCV).

The next section is intended to show, empirically, that the first approach (100% training) leads to an overestimation of the true sensitivity and specificity (and consequently an underestimation of the misclassification rate in the target population), while the second approach (independent test sample) provides unbiased estimates. The following true performance indicators, averaged over ĉ _δ, will be considered for the different procedures (δ = 100%, δ = 67% and δ = 50%):

where the terms prob(ĉ _δ = c_j ) are estimated through simulation.

Simulation Study

In the simulation study showed in the next section, the n pairs (y_i ,x_i ) were generated as follows: first, y_i was generated from a Bernoulli random variable with probability of success equal to p, then, x_i was generated from a Gaussian distribution, using parameters μ ₁ and if y_i = 0, μ ₂ and if y_i = 1. Simulations were performed to assess the true sensitivity, the specificity and the misclassification rate, and the properties of the estimators presented in the previous section.

Different configurations of target populations were considered by varying the mean difference (distances between μ ₁ and μ ₂, δ_μ = 2, 4, 6, 8), the variances [(σ ₁, σ ₂) = (1, 4), (2, 3), (2, 2), (3, 2), (4, 1)] and the disease prevalence (p = 0.6, 0.4). [Figure 1] illustrates the hypothesized populations of diseased and nondiseased individuals. From each population, 1,000 random data samples were generated using different sample sizes (n = 50, 100, 200). For each simulated sample, a ROC analysis was performed to detect the optimal cutoff using Youden's criterion (on a discrete grid of J = 30 equally spaced candidates c_j ), and the performance of the obtained test was estimated using different training percentages (δ = 100%, 67%, 50%), fivefold CV and LOOCV.

Fig. 1 Theoretical scenarios of populations considered in the simulation study. Gray curves indicate nondiseased individuals, and black curves indicate diseased individuals.

Clinical Data

The data analyzed in the article come from a cross-sectional study performed at the Pediatric Pulmonology-Allergology outpatient clinic of the CNR Institute for Biomedical Research and Innovation of Palermo, and at the Department of Pediatrics of the Sapienza University of Rome, Italy. Children aged 6 to 11 years, with a medical diagnosis of allergic rhinitis and asthma, were consecutively enrolled from March 2015 to December 2016. Children with other respiratory or chronic diseases that might interfere with the study measurements, as well as children with psychiatric disorders and/or cognitive impairment, were excluded. The n = 112 patients were assessed at baseline (T0) and after a mean period of 3 months (T1). All children attended both visits and completed an Italian version of the CARATkids questionnaire[27] [28] [29] and the Childhood Asthma Control Test (C-ACT).[24]

Some psychometric characteristics of the Italian CARATkids questionnaire were assessed. In particular, the discriminant validity of CARATkids was evaluated in previous studies[27] [28] as its ability to detect children with uncontrolled asthma, defined as C-ACT score ≤19.[24] Moreover, the more general cross-sectional and longitudinal validity was assessed through the correlation between the total score of the CARATkids and the total score of C-ACT. A ROC analysis was performed and the optimal cutoff value for CARATkids selected according to Youden's method. The area under the curve (AUC) was estimated and its significance (AUC > 0.5) tested using the method described by DeLong et al.[30]

The study was approved by the local ethic committee (N 11/2014 Azienda ospedaliera Universitaria Policlinico Paolo Giaccone) and conducted in accordance with the Declaration of Helsinki and Good Clinical Practice guidelines. All parents provided written informed consent. The study was registered on the central registration system ClinicalTrials.gov (NCT 02409550).

Results

Simulation Study

[Tables 1] [2] [3] [4] [5] to [6] show the means and standard deviations of the different estimates obtained in the simulated data samples. Scenarios with δ_μ = 2, 8 were reported in [Supplementary Tables S1] [S2] [S3] [S4] [S5] to [S6] (online only). In general, small differences are observed in the expected value of the cutoff estimator (ĉ), which locates approximately at the intersection point between the distributions in [Fig. 1], that is, the optimal cutoff in the population. As expected, the variance of ĉ increases as the training percentage decreases; as a consequence, the true performance indicators appear to get a little worse as δ decreases, since there is greater probability that the estimated cutoff assumes values far from the aforementioned intersection point. As expected, the true performances improve as the sample size and the distance between the distributions of diseased and nondiseased individuals (δ_μ) increase. Similarly, the true sensitivity increases as the variability of the classifier decreases among diseased individuals (σ ₂), just as the true specificity increases as σ ₁ decreases.

Table 1
Simulated means and standard deviations (σ) of the detected cutoff (ĉ), and of estimated sensitivity , specificity , and misclassification rate () with n = 50 and p = 0.60. Se, Sp, and indicate the true performances
Δ_μ	σ ₁, σ ₂	δ	ĉ	σ_ĉ	Se			Sp
4	1, 4	100%	1.660	0.512	0.719	0.738	0.085	0.930	0.974	0.043	0.196	0.168	0.052
		67%	1.560	0.539	0.727	0.729	0.151	0.916	0.921	0.131	0.197	0.195	0.098
		50%	1.473	0.576	0.734	0.739	0.123	0.899	0.900	0.133	0.200	0.197	0.080
		Fivefold CV	1.660	0.512	0.719	0.727	0.084	0.930	0.926	0.058	0.196	0.193	0.060
		LOOCV	1.660	0.512	0.719	0.722	0.087	0.930	0.935	0.056	0.196	0.193	0.062
	2, 3	100%	2.019	0.894	0.737	0.768	0.106	0.822	0.883	0.095	0.229	0.187	0.056
		67%	2.021	0.986	0.734	0.731	0.169	0.818	0.824	0.192	0.232	0.233	0.106
		50%	1.880	1.080	0.747	0.748	0.151	0.796	0.804	0.181	0.233	0.231	0.089
		Fivefold CV	2.019	0.894	0.737	0.739	0.099	0.822	0.815	0.100	0.229	0.230	0.070
		LOOCV	2.019	0.894	0.737	0.735	0.110	0.822	0.824	0.105	0.229	0.229	0.078
	2, 2	100%	1.906	0.700	0.838	0.866	0.086	0.816	0.874	0.089	0.171	0.131	0.048
		67%	1.881	0.813	0.837	0.836	0.153	0.808	0.813	0.190	0.175	0.174	0.098
		50%	1.823	0.899	0.840	0.838	0.138	0.797	0.797	0.175	0.177	0.178	0.084
		Fivefold CV	1.906	0.700	0.838	0.838	0.082	0.816	0.814	0.092	0.171	0.171	0.063
		LOOCV	1.906	0.700	0.838	0.836	0.093	0.816	0.819	0.096	0.171	0.170	0.072
	3, 2	100%	1.729	0.903	0.850	0.877	0.094	0.710	0.771	0.113	0.206	0.166	0.053
		67%	1.668	1.067	0.848	0.848	0.156	0.700	0.701	0.213	0.211	0.211	0.104
		50%	1.614	1.151	0.850	0.849	0.145	0.693	0.698	0.195	0.213	0.212	0.086
		Fivefold CV	1.729	0.903	0.850	0.845	0.092	0.710	0.707	0.114	0.206	0.210	0.070
		LOOCV	1.729	0.903	0.850	0.843	0.105	0.710	0.712	0.119	0.206	0.209	0.078
	4, 1	100%	2.153	0.657	0.942	0.963	0.046	0.703	0.749	0.102	0.154	0.124	0.045
		67%	2.023	0.866	0.942	0.940	0.100	0.690	0.695	0.204	0.159	0.157	0.090
		50%	1.916	1.063	0.936	0.938	0.103	0.679	0.677	0.186	0.166	0.168	0.083
		Fivefold CV	2.153	0.657	0.942	0.945	0.053	0.703	0.696	0.107	0.154	0.155	0.056
		LOOCV	2.153	0.657	0.942	0.946	0.057	0.703	0.703	0.105	0.154	0.151	0.058
6	1, 4	100%	1.835	0.506	0.849	0.860	0.067	0.950	0.984	0.031	0.111	0.091	0.040
		67%	1.721	0.584	0.855	0.853	0.120	0.932	0.931	0.121	0.114	0.117	0.081
		50%	1.632	0.618	0.860	0.863	0.096	0.918	0.920	0.119	0.117	0.114	0.068
		Fivefold CV	1.835	0.506	0.849	0.853	0.068	0.950	0.938	0.051	0.111	0.113	0.049
		LOOCV	1.835	0.506	0.849	0.849	0.069	0.950	0.944	0.046	0.111	0.112	0.050
	2, 3	100%	2.614	0.828	0.862	0.882	0.071	0.886	0.935	0.065	0.128	0.097	0.041
		67%	2.567	0.937	0.863	0.859	0.125	0.878	0.883	0.150	0.131	0.132	0.081
		50%	2.465	1.030	0.867	0.862	0.115	0.863	0.859	0.150	0.134	0.139	0.072
		Fivefold CV	2.614	0.828	0.862	0.862	0.071	0.886	0.879	0.073	0.128	0.131	0.055
		LOOCV	2.614	0.828	0.862	0.857	0.077	0.886	0.888	0.072	0.128	0.130	0.058
	2, 2	100%	2.877	0.696	0.930	0.947	0.049	0.913	0.955	0.049	0.077	0.051	0.032
		67%	2.734	0.802	0.935	0.935	0.093	0.898	0.895	0.142	0.080	0.082	0.070
		50%	2.605	0.947	0.938	0.937	0.087	0.880	0.879	0.135	0.085	0.086	0.063
		Fivefold CV	2.877	0.696	0.930	0.932	0.055	0.913	0.903	0.061	0.077	0.079	0.045
		LOOCV	2.877	0.696	0.930	0.930	0.057	0.913	0.911	0.057	0.077	0.077	0.047
	3, 2	100%	3.095	0.825	0.910	0.931	0.063	0.840	0.886	0.076	0.118	0.087	0.040
		67%	3.005	0.999	0.910	0.906	0.120	0.829	0.823	0.170	0.122	0.127	0.086
		50%	2.892	1.134	0.912	0.912	0.108	0.816	0.807	0.162	0.126	0.129	0.069
		Fivefold CV	3.095	0.825	0.910	0.909	0.065	0.840	0.828	0.083	0.118	0.123	0.054
		LOOCV	3.095	0.825	0.910	0.907	0.071	0.840	0.833	0.084	0.118	0.122	0.060
	4, 1	100%	3.818	0.731	0.965	0.978	0.036	0.826	0.867	0.080	0.091	0.068	0.037
		67%	3.679	0.929	0.964	0.963	0.079	0.815	0.812	0.171	0.096	0.097	0.075
		50%	3.426	1.216	0.963	0.964	0.076	0.794	0.796	0.160	0.104	0.104	0.071
		Fivefold CV	3.818	0.731	0.965	0.963	0.046	0.826	0.817	0.087	0.091	0.096	0.049
		LOOCV	3.818	0.731	0.965	0.964	0.046	0.826	0.823	0.087	0.091	0.092	0.049