Keywords
forecasting - health communication - decision support techniques - clinical decision-making
- education of patients
Background and Significance
Background and Significance
The quantification of patient preferences for health care interventions is important
for determining their role in care delivery and for promoting preference-concordant
decisions.[1]
[2]
[3]
[4] Prior empiric work has described patient preferences for the tradeoffs between costs,
pain severity, survival, transportation time, access to care, and place of death.[2]
[5]
[6]
[7] These studies used a discrete choice experiment (DCE) design that provides a quantitative
measure of these tradeoffs. While similar studies have examined preferences for characteristics
associated with diagnostic tests such as cancer screening,[8]
[9]
[10] none has examined patient preferences for characteristics of clinical prediction
models. This distinction is significant because typically a local health system cannot
directly alter the performance characteristics of a diagnostic test to suit the needs
and preferences of a particular patient. However, the performance characteristics
of a clinical prediction model are partly dependent on modeling and statistical methods
that analysts could optimize to meet individualized needs in a specific clinical scenario.
Highly customized clinical prediction models based on rich data from the electronic
health record (EHR) are becoming increasingly common[11]
[12] in the era of learning health systems[13] and widespread EHR adoption.[14] Such models may promote “precision delivery” of health care by identifying patients
at risk for a given outcome, thereby prompting targeted and timely interventions.[15] However, little is known about patient preferences for such predictive information,
or how such preferences for false positive and false negative errors and for types
of uncertainty might vary across clinical conditions and baseline risk estimates.
For example, a person with a serious illness of otherwise uncertain prognosis may
prefer a false positive to a false negative prediction of death if the response to
the prediction is not too costly or is likely to be undertaken at some point anyway
(e.g., advance care planning). On the other hand, a person considering prophylactic
surgery to prevent future cancer occurrence may prefer a false negative to a false
positive error, which might lead to an unnecessary and irreversible procedure. The
validity of scientific data, time horizons of risk assessments, and presentation of
statistical uncertainty are all important features of scientific knowledge important
to the general public.[16]
Quantifying preferences for predictive model characteristics in such scenarios could
be accomplished through a DCE. However, this approach would require at least a basic
understanding of the relevant statistical concepts used to describe model performance.
Although the understanding of related concepts in the context of diagnostic tests
has been examined among both general audiences and clinicians,[17]
[18]
[19]
[20]
[21] no studies have evaluated understanding of the performance characteristics of clinical
prediction models.
Objective
We conducted a pilot feasibility study to determine (1) the level of understanding
of performance characteristics of clinical prediction models, and (2) whether numeracy,
education, or other demographic factors are associated with understanding of these
concepts. We hypothesized that using best practices in risk communication, a nonclinical
audience could interpret key attributes of predictive models such as sensitivity,
specificity, and confidence intervals (CIs).
Methods
Study Design
We conducted a cross-sectional electronic survey using the Web-based Qualtrics software
(Qualtrics, Provo, Utah, United States) among an online population to quantify the
ability to interpret the performance of clinical prediction models as described by
sensitivity, specificity, and CIs.
Population
Two sequential cohorts were included in this study. The first cohort of participants
was recruited via the Amazon Mechanical Turk (MTurk) platform during December 2016.
Enrollment was restricted to unique participants[22] with historical task success rates of at least 95%.[23] Each participant provided an electronic informed consent and received US$4 upon
completion of the survey.
Given the low median age of respondents in this first cohort ([Table 1]) and the desire to recruit a cohort generalizable to patient populations likely
to utilize predictive information in decisions about health care,[24]
[25] a second cohort was recruited via the TurkPrime platform, which allows for detailed
filtering of eligible respondents by demographic features within the MTurk population.[26] During January 2017, we recruited an additional 301 nonduplicated respondents over
the age of 60 years, each of whom received US$2 for completing the survey. The reimbursement
amount was less in the second cohort based on the lower-than-expected median study
completion time found in the first ([Table 2]).
Table 1
Characteristics of the study population
Characteristic, n (%)
|
All
|
Cohort 1
|
Cohort 2
|
Participants, n
|
534
|
280
|
254
|
Age (y), median (IQR)
|
51 (30–63)
|
31 (27–39)
|
63 (60–67)
|
Gender
|
Male
|
257 (48.1)
|
165 (58.9)
|
92 (36.2)
|
Female
|
273 (51.1)
|
112 (40.0)
|
161 (63.4)
|
Other
|
4 (< 1)
|
3 (1.1)
|
1 (< 1)
|
Race
|
White
|
439 (82.2)
|
206 (73.6)
|
233 (91.7)
|
Asian
|
38 (7.1)
|
33 (11.8)
|
5 (2.0)
|
Black
|
33 (6.2)
|
24 (8.6)
|
9 (3.5)
|
Other
|
24 (4.5)
|
17 (6.1)
|
7 (2.8)
|
Highest level of education
|
High school
|
117 (21.9)
|
56 (20.0)
|
61 (24.0)
|
GED or equivalent
|
45 (8.4)
|
26 (9.3)
|
19 (7.5)
|
Associate's degree
|
123 (23.0)
|
65 (23.2)
|
58 (22.8)
|
Bachelor's degree
|
168 (31.5)
|
105 (37.5)
|
63 (24.8)
|
Master's degree
|
60 (11.2)
|
24 (8.6)
|
36 (14.1)
|
Doctoral degree
|
21 (3.9)
|
4 (1.4)
|
17 (6.7)
|
Daily meds, median (IQR)
|
0 (0–2)
|
0 (0–1)
|
2 (0–4)
|
Marital status
|
Single, never married
|
187 (35.0)
|
152 (54.3)
|
35 (13.8)
|
Married
|
234 (43.8)
|
102 (36.4)
|
132 (52.0)
|
Divorced/Widowed
|
113 (21.2)
|
26 (9.3)
|
87 (34.3)
|
Numeracy, median (IQR)
|
Objective (range 0–8)
|
7 (6–8)
|
7 (6–8)
|
7 (6–8)
|
Subjective (range 1–6)
|
4.9 (4.3–5.4)
|
4.8 (4.1–5.3)
|
4.9 (4.4–5.4)
|
Abbreviations: GED, general equivalency degree; IQR, interquartile range.
Table 2
Characteristics of participant responses, and performance on questions by content
area and type of knowledge
Measure
|
All
|
Cohort 1
|
Cohort 2
|
Study duration (min), median (IQR)
|
14.3 (10.8–18.5)
|
12.5 (9.3–15.6)
|
16.5 (13.3–19.6)
|
Overall knowledge score
|
91.9 (89.2–94.0)
|
89.9 (85.6–93.1)
|
94.1 (90.3–96.6)
|
Sensitivity
|
All
|
95.9 (93.8–97.4)
|
93.8 (90.1–96.2)
|
98.3 (95.6–99.4)
|
Verbatim
|
97.0 (95.1–98.2)
|
94.6 (91.3–96.9)
|
99.6 (97.5–99.9)
|
Gist
|
98.5 (96.9–99.3)
|
97.1 (94.2–98.7)
|
100.0 (98.1–100.0)
|
DCE task
|
92.3 (89.6–94.4)
|
89.6 (85.3–92.8)
|
95.3 (91.7–97.4)
|
Specificity
|
All
|
93.1 (90.5–95.0)
|
91.3 (87.2–94.2)
|
95.0 (91.4–97.2)
|
Verbatim
|
93.8 (91.3–95.6)
|
91.4 (87.4–94.3)
|
96.5 (93.2–98.3)
|
Gist
|
95.3 (93.1–96.9)
|
92.1 (88.2–94.9)
|
98.8 (96.3–99.7)
|
DCE task
|
90.1 (87.1–92.4)
|
90.4 (86.1–93.4)
|
89.8 (85.2–93.1)
|
Confidence interval
|
All
|
86.6 (83.3–89.3)
|
84.3 (79.4–88.3)
|
89.1 (84.5–92.5)
|
Verbatim
|
94.4 (92.0–96.1)
|
93.6 (90.0–96.0)
|
95.3 (91.7–97.4)
|
Gist
|
83.1 (79.6–86.2)
|
79.3 (74.0–83.9)
|
87.4 (82.5–91.1)
|
DCE task
|
82.1 (78.4–85.3)
|
79.4 (73.6–84.2)
|
84.6 (79.5–88.7)
|
Abbreviations: DCE, discrete choice experiment; IQR, interquartile range.
Note: All scores are reported as mean percentages with 95% binomial confidence intervals.
In both cohorts, respondents who failed any attention checks or completed the entire
survey in less than 3 minutes were excluded from the analysis.[27]
Presentation of Statistical Concepts
The survey instrument included didactic modules to explain what is a predictive model
and describe the relevant statistical concepts. First, a single-page visual and text
description of a predictive model was displayed. Next, using best practices in risk
communication, explanatory modules with text exemplars, icon arrays, integer annotations,
summary explanations, and simple sentences[28]
[29] were included in three separate modules to describe the concepts of sensitivity,
specificity, and CIs as they relate to the performance of predictive models. We depicted
CIs using a variation of a “blurred” icon array.[19] Sample explanatory icon arrays are presented in [Fig. 1]. Each module presented the statistical concept in the context of a weather prediction.
Weather examples were chosen to remove any potential cognitive or affective influences
common in medical decision making[30]
[31] and to isolate ascertainment of participant interpretation of these concepts. The
explanatory text of these modules were written at a 10th-grade Flesch–Kincaid reading
level for clarity. The instrument was iteratively piloted with five experienced research
coordinators not involved with the study to improve clarity. A copy of the final survey
instrument with explanatory modules is available in the [Supplementary Material] (available in the online version).
Fig. 1 Sample visual explanatory tools used in the didactic modules to convey sensitivity
(A) and confidence intervals (B) using icon arrays, integer annotations, and juxtaposed comparisons.
Knowledge Testing
Each module was followed by two questions, and each question separately tested verbatim
and gist knowledge. Both types of knowledge are associated with identifying optimal
medical treatments in a comparison task[29] and with understanding of numeric concepts.[32] There were three verbatim and three gist knowledge questions in total. Finally,
to assess participants' abilities to compare two models given a complex presentation
of information, participants were presented with three DCE tasks each comparing performance
and other characteristics of two different prediction models. A DCE task was chosen
because DCEs are commonly used to determine the relative utilities of time, cost,
and health states among patients.[2]
[5]
[6]
[7] Thus, they represent an ideal study design in future work for assessing preferences
for the performance characteristics of clinical prediction models and allow researchers
to quantify tradeoffs between measures such as sensitivity and specificity, as are
commonly encountered in predictive model development. Only one of six model characteristics
(i.e., DCE attributes) varied at a time for each of these sample tasks and participants
were asked to choose the better of the two models. The following model attributes
were presented for each sample task: outcome, time horizon, sensitivity, specificity,
sensitivity CI, and specificity CI. The response was scored as correct if the model
with larger sensitivity or specificity, or smaller CI was chosen. Since the goal was
to use these sample tasks to assess feasibility of interpreting complex presentations
of information, rather than to elicit preferences for model characteristics themselves,
we did not conduct a DCE in this study.
We reported the primary outcome as the overall knowledge score, which is the mean
percentage of correct responses to all nine conceptual questions (two questions each
for understanding of three different statistical concepts plus three sample DCE tasks).
We also reported mean scores broken down by type of knowledge and concept.
Numeracy Measures and Demographics
Numeracy influences the interpretation of quantitative risk information.[17]
[18]
[33] Following the education modules and knowledge questions, we tested both objective
and subjective numeracy because they each gauge distinct types of understanding and
preferences.[34] We used the short Numeracy Understanding in Medicine Instrument (S-NUMi; range,
0–8)[35] and the Subjective Numeracy Scale (SNS; range, 1–6).[36] The order of these two instruments was randomized for each participant. The last
section of the instrument asked for the participants' age, gender, race, ethnicity,
level of education, marital status, and number of prescription medications taken each
day, a crude measure of comorbidity and health.[37]
[38]
Statistical Analysis
The overall knowledge score, the mean correct responses to nine questions, is reported
with 95% binomial CIs. We evaluated differences in scores and participant characteristics
between cohorts using chi-square and two-sample, unpaired, two-sided t-tests with = 0.05 for categorical and continuous variables, respectively. We developed
a multivariable linear regression model to examine the relationship between numeracy
measures and the overall knowledge score. The following covariates were selected based
on likely relevance and included in the multivariable model: age, gender, race, education
level, and minutes spent on the survey. Pearson's correlation coefficients between
continuous model inputs were reported in a correlation matrix. All analyses were conducted
using the R language for statistical computing (version 3.3.1).[39] The deidentified data and source code used for these analyses are available online
(https://github.com/gweissman/numeracy_pilot_study).
Results
A total of 608 participants completed the survey, and 69 were excluded for failing
one or more attention checks, and another five for completing the survey in less than
3 minutes ([Fig. 2]). Among the remaining 534 respondents, 273 (51.1%) were women and 439 (82.2%) self-identified
as white. The median age was 51 years (interquartile range [IQR], 30–63; range, 18–81)
and 249 (46.6%) participants had completed at least a 4-year college degree ([Table 1]).
Fig. 2 Patient enrollment and exclusions.
The second cohort (n = 254) was older (64 vs. 34 years, p < 0.001), took longer to complete the survey (17.2 vs. 13.6 minutes, p < 0.001), had higher mean subjective numeracy (4.8 vs. 4.6, p = 0.006), and included more women (63.6% vs. 40.4%, p < 0.001). The cohorts had similar mean objective numeracy (6.8 vs. 6.7, p = 0.186) and similar rates of bachelor or higher degrees (47.5% vs. 45.7%, p = 0.736). Given these similarities and differences, we reported all results both
overall and separately by cohort.
The overall knowledge score ranged from 11.1 to 100.0% (median number of correct responses
9, IQR, 8–9). Verbatim knowledge was similar to gist knowledge for sensitivity and
specificity, but significantly exceeded gist knowledge for CIs by 11.2% (p < 0.001; [Table 2]). The mean score for the subset of questions embedded in a DCE task was 88.5% (95%
CI, 85.4–91.0). The second cohort scored higher than the first by overall knowledge
score (94.1% vs. 89.9%, p < 0.001). In the adjusted multivariable analysis, a one-point increase in the S-NUMi
or SNS was associated with a 4.7% (95% CI, 3.9–5.6) or 2.2% (95% CI, 0.9–3.5) increase,
respectively, in the overall knowledge score ([Table 3]). Age was weakly correlated with both measures of numeracy and the test duration
([Table 4]). The first 46 responses to the DCE-embedded question testing understanding of CIs
(Question 9 in the [Supplementary Material], available in the online version) were excluded from all analyses due to an error
found in the survey instrument which was corrected for all subsequent participants.
This resulted in the exclusion of 42 participant responses that would otherwise have
been included in the final analytic sample for that question only.
Table 3
Multivariable linear regression results
Variable
|
All
|
Cohort 1
|
Cohort 2
|
Coefficient
|
95% CI
|
p-Value
|
Coefficient
|
95% CI
|
p-Value
|
Coefficient
|
95% CI
|
p-Value
|
S-NUMi score
|
4.7
|
3.9–5.6
|
< 0.001
|
5.7
|
4.4–6.9
|
< 0.001
|
1.9
|
0.8–3.1
|
0.001
|
SNS score
|
2.2
|
0.9–3.5
|
0.001
|
3.1
|
1.0–5.2
|
0.004
|
1.0
|
–0.5 to 2.4
|
0.200
|
Male gender
|
-1.3
|
–3.5 to 1.0
|
0.215
|
–3.93
|
–7.3 to –0.5
|
0.024
|
2.1
|
–0.1 to 4.4
|
0.066
|
White race
|
5.8
|
2.9–8.7
|
< 0.001
|
6.2
|
2.1–10.2
|
0.003
|
2.3
|
–1.8 to 6.3
|
0.267
|
Bachelor's degree or higher
|
-0.9
|
–3.0 to 1.2
|
0.397
|
–2.7
|
–6.1 to 0.7
|
0.111
|
2.4
|
0.1–4.7
|
0.038
|
Age (y)
|
0.0
|
–0.01 to 0.1
|
0.292
|
–0.01
|
–0.1 to 0.1
|
0.772
|
–0.1
|
–0.3 to 0.1
|
0.539
|
Study duration (min)
|
0.1
|
–0.1 to 0.1
|
0.127
|
0.23
|
–0.1 to 0.4
|
0.077
|
–0.1
|
–0.1 to 0.2
|
0.837
|
Abbreviations: CI, confidence interval; S-NUMi, Short Numeracy Understanding in Medicine
Instrument; SNS, Subjective Numeracy Scale.
Note: Coefficient estimates represent the percent change in overall knowledge score
associated with a one-unit change in each variable.
Table 4
Pearson's correlations (p-value) between continuous variables used in the multivariable prediction model
|
S-NUMi
|
SNS
|
Age (y)
|
Duration (min)
|
S-NUMi
|
1.0
|
0.275 (< 0.001)
|
0.108 (0.013)
|
0.028 (0.51)
|
SNS
|
|
1.0
|
0.133 (0.002)
|
0.068 (0.12)
|
Age (y)
|
|
|
1.0
|
0.293 (< 0.001)
|
Duration (min)
|
|
|
|
1.0
|
Abbreviations: S-NUMi, Short Numeracy Understanding in Medicine Instrument; SNS, Subjective
Numeracy Scale.
Discussion
These data provide preliminary evidence of the feasibility of interpreting statistical
concepts underlying the performance characteristics of a prediction model among a
nonclinical audience. The findings from this pilot study support the possibility of
using DCEs or other methods to elicit quantitatively expressed preferences for aspects
of clinical prediction models. Such an approach may increase the relevance of future
prediction models in real clinical decision-making scenarios. There are several ways
in which these findings suggest future directions for study in a wide range of populations
for whom clinical prediction models are likely to be of use.
First, the inability to understand risk information limits the potential for widespread
deployment of patient-centered complex risk models. Objective numeracy as measured
by the S-NUMi was strongly associated with performance on the survey in both cohorts,
although the association was stronger in the younger cohort. Subjective numeracy,
on the other hand, was significantly associated with the knowledge score only in the
younger cohort, while the level of education was significant only in the older cohort.
This finding is consistent with prior work demonstrating that numeracy is positively
associated with understanding statistical risk information.[27]
[40] Adaptation of risk models based on patient preferences will require different approaches
in low-numeracy populations.
Second, the role of educational models in explaining quantitative performance measures
to a nonclinical audience remains unknown. Observed knowledge scores in this study
compared favorably to understanding of statistical concepts in other populations in
both stand-alone questions and when concepts were embedded in complex DCE tasks. For
example, in a review of six studies that assessed the ability of health care professionals
to identify accuracy measures of diagnostic tests based on multiple choice definitions
or written vignettes, sensitivity and specificity were correctly identified 76 to
88% and 80 to 88% of the time, respectively.[21] Similarly, in a sample of medicine residents, only 56.7% correctly determined which
of two example tests had higher specificity.[41] Although the presentation and testing of statistical knowledge varied between these[21]
[41] and the present study, we speculate that our participants scored higher because
of the inclusion of didactic modules prior to testing. These modules presented icon
arrays, integer annotations, and plain language explanations.[28]
[29] Although the performance scores are not directly comparable across the studies,
the differences do suggest a potential role for such risk communication techniques
in a nonclinical audience.
Third, further work is needed to explain age- and education-dependent influences on
knowledge, and to describe potential interactions with the level of numeracy on knowledge
of statistical concepts. Additionally, white race was also associated with higher
scores in the younger cohort, which may represent residual confounding due to factors
not fully assessed in this survey.[42]
[43]
[44]
Fourth, this is the first study to demonstrate a difference between the interpretation
of CIs as measured by verbatim and gist knowledge. Prior studies have demonstrated,
using both quantitative and qualitative methods, that interpretation of CIs is difficult
for the general population.[45]
[46] The use of CIs to convey statistical uncertainty may even worsen understanding of
risk information compared with the presentation of a point estimate alone.[27]
[47] To date, the exact features of CIs or the mechanisms of their interpretation that
confuse have not been elucidated. Although our survey instrument did provide helpful
heuristics to guide interpretation of statistical information, our study design did
not test heuristics explicitly. Given that verbatim knowledge of CIs was high in all
groups while gist knowledge was markedly lower, we hypothesize that respondents may
have been able to understand the numeric description of CIs, but misinterpreted their
qualitative comparison due to a “bigger is better” heuristic.[48] This heuristic is appropriate for sensitivity and specificity, but is reversed for
the correct interpretation of CIs where “smaller is better.” Gist interpretation typically
relies on a qualitative assessment of what a numeric estimate means to the reader.[32] Without an explicit assessment of how each participant views uncertainty with respect
to weather information, the categories provided in the multiple choice questions may
not have encoded a standard meaning. Further work in the use and prompting of heuristics
to understanding quantitative features of prediction models and interpretation of
CIs is warranted.
The strengths of this study include the testing of both gist and verbatim knowledge,
the adjustment for subjective and objective numeracy and other demographic factors,
and the representativeness with respect to age range, number of medications used in
an ambulatory population, and gender.[49] Additionally, this study is the first to test the interpretation of statistical
concepts as they describe prediction models in both stand-alone examples and when
embedded in a complex DCE sample task.
However, the results of this study should be interpreted in light of some limitations.
First, this study conveyed predictive information in the context of weather examples,
which may not elicit the same cognitive and affective decision-making mechanisms as
those relating to health states.[30]
[31] Second, the generalizability of these results is limited by the analytic sample,
which was primarily white and of very high numeracy compared with the general population.[35]
[36]
[50] People with lower numeracy may be especially vulnerable to misinterpretations of
these statistical concepts,[17]
[18]
[33] and thus are an important population in which further validation is warranted. Third,
our study tested knowledge of these statistical concepts immediately following provision
of education modules, but we did not administer a pretest knowledge assessment. Therefore,
we cannot draw conclusions about the efficacy of the education modules themselves
in improving baseline knowledge. Future work should include a pretest baseline assessment
to better characterize effective strategies for describing statistical concepts related
to prediction models. Similarly, future testing should characterize the temporal duration
of an intervention's effect on knowledge, which may decay with time,[51] and which may better distinguish between true knowledge and immediate recall. Fourth,
the multiple-choice format limits more robust assessments of the ability to apply
these statistical concepts, and may result in overly optimistic performance scores
if participants employed other test-taking strategies.[52] Fifth, this study did not measure knowledge of false positive and false negative
concepts directly, which may be more directly relevant to the development of clinical
prediction models than sensitivity and specificity—indirect measures of these error
rates—which were tested in this study. Sixth, we did not perform reliability testing
in the development of this pilot instrument which may threaten the validity of the
findings. Finally, because subjects were not actively screened and approached for
recruitment, our study design cannot account for the self-selection of potential participants
from the online platform who saw but chose not to complete the survey.
Conclusion
In conclusion, this study demonstrates that a nonclinical audience can interpret predictive
model features such as sensitivity and specificity with high accuracy using both gist
and verbatim knowledge. Such understanding was high even when interpreted within a
complex DCE task. These findings highlight the feasibility of future DCEs to quantify
preferences for tradeoffs between performance characteristics of predictive models,
and suggest the need for validating these results in more generalizable patient populations.
Clinical Relevance Statement
Clinical Relevance Statement
The rapidly growing interest in and use of prediction models in health care settings
warrant increased focus on patient preferences for information. In order for clinical
prediction models to achieve significant impact in informing decisions about care,
they must incorporate different preferences for tradeoffs between false positive and
false negative errors, bias and variance, and performance across varying predictive
time horizons. Although preferences for these particular tradeoffs will likely exhibit
significant variation depending on the clinical scenario, this study demonstrates
the feasibility of assessing such preferences for model characteristics in a nonclinical
population. The incorporation of patient preferences into predictive model development
would better align with practices for cancer treatments, medical devices, and organ
transplant protocols, all of which are informed by research into patient preferences
for tradeoffs between their features.
Multiple Choice Questions
Multiple Choice Questions
-
In this study, interpretation of performance characteristics as measured by verbatim
knowledge was high for which of the following:
-
Sensitivity, confidence intervals, and time horizons.
-
Specificity.
-
Sensitivity and specificity.
-
Sensitivity, specificity, and confidence intervals.
Correct Answer: The correct answer is option d. The proportion of correctly answered questions was
greater than 90% when interpreting verbatim knowledge of sensitivity, specificity,
and confidence intervals. This suggests participants were able to interpret the specific
numbers associated with the model performance characteristics described in the question
stems.
-
In this study, interpretation of performance characteristics as measured by gist knowledge
was high for which of the following:
-
Sensitivity, confidence intervals, and time horizons.
-
Specificity.
-
Sensitivity and specificity.
-
Sensitivity, specificity, and confidence intervals.
Correct Answer: The correct answer is option c. The proportion of correctly answered questions was
greater than 90% when interpreting gist knowledge of sensitivity and specificity,
but was considerably lower when interpreting confidence intervals. This suggests participants
were less frequently able to interpret the “gist” meaning of the values (e.g., “good”
or “bad” performance) of confidence intervals. We hypothesized this might be due to
a “bigger is better” heuristic that works well for sensitivity and specificity, but
fails for confidence intervals, where smaller is better.