Keywords patellofemoral osteoarthritis - deep learning - prediction of osteoarthritis progression
- knee
Introduction
Knee osteoarthritis (OA) is the most prevalent chronic joint disorder that involves
degeneration and loss of articular cartilage along with bony changes. High age and
body mass index (BMI) are strong risk factors for knee OA.[1 ] Structural knee OA often leads to significant pain, stiffness, disability, and reduced
quality of life for affected individuals.[2 ] Current understanding of OA disease process is inadequate and, consequently, there
is a lack of disease-modifying medical treatments. As a result, knee OA continues
to impose a significant burden on individuals and society.[3 ]
Although the patellofemoral (PF) joint is an important source of symptoms in knee
OA, the majority of the research on knee OA has focused on tibiofemoral (TF) joint
of the knee.[4 ]
[5 ] Patellofemoral OA (PFOA) can be caused by a number of factors, including previous
injury to the knee, inflammation, biomechanical abnormalities, overuse of joint, obesity,
and genetic predisposition.[3 ]
[6 ] Symptoms often include anterior knee pain, especially when kneeling and squatting,
as well as swelling and a grinding or popping sensation when moving the knee (crepitus).[7 ] As the importance of the PF joint in OA is increasingly acknowledged, the number
of studies into it has been increasing.[3 ]
[8 ]
[9 ] Still, more research is needed.[3 ]
Noninvasive imaging techniques play a crucial role in diagnosing and monitoring PFOA.
Without imaging, a confident diagnosis will seldom be possible for PFOA.[10 ] X-ray imaging is one of the primary diagnostic tools because of its low cost and
wide availability. Although radiography does not allow to visualize soft tissues,
changes in the joint space and bone structure can be well depicted from X-rays. Several
imaging biomarkers such as the narrowing of the joint space, bony spurs, malalignment
of the patella, bone sclerosis, and cysts are associated with PFOA ([Fig. 1 ]).[9 ]
[11 ]
Fig. 1 Example of PFOA progression/development. Figure on the left demonstrates an exemplar
patellofemoral joint ROI imaged at the first visit in the MOST study. At the baseline,
PFOA is not present. Right figure presents the same particpant's PF joint 7 years
after the baseline visit. The knee has developed PFOA where joint space narrowing
and osteophytes—characteristic features of OA—are clearly seen. Best viewed on screen.
MOST, Multicenter Osteoarthritis Study; OA, osteoarthritis; PF, patellofemoral; PFOA,
patellofemoral osteoarthritis; ROI, region of interest.
In recent years, machine learning (ML) techniques have emerged as promising tools
to aid in the diagnosis of PFOA from X-ray images.[12 ]
[13 ] Both early diagnosis and prediction of disease progression might be critical in
the management and intervention of PFOA. However, accurate and timely identification
of PFOA progression based on X-ray images can be challenging due to the complexity
of the disease and the variability of knee imaging. To date, there are no published
studies using ML-based models for prediction of PFOA development or progression in
the future from imaging data.
In this study, we introduced a deep learning-based framework to predict radiographic
progression of PFOA over a 7-year-period from lateral radiographs, demographic data,
and symptom assessments (clinical data). We leveraged attention mechanism in our deep
learning framework and proposed an end-to-end solution via a trainable attention module.
The results of this study have the potential to improve the early diagnosis and treatment
of PFOA, ultimately leading to improved patient outcomes and quality of life.
Materials and Methods
[Fig. 2 ] shows the overall pipeline of our study. We first located patellar landmarks using
BoneFinder software[14 ] ([Fig. 3 ]). Those anatomical landmarks were then used to align patellar bone constantly across
the knees eliminating rotation variance.
Fig. 2 (A ) Illustration of the workflow of our approach. The localization and alignment of
patellofemoral (PF) joint in lateral knee X-rays were performed based on the anatomical
landmarks of patellar bone (BoneFinder). Intensity normalization was then applied.
Finally, each lateral knee was rotated in order to have an aligned patella. After
localizing PF joint ROI, a deep convolutional neural network (CNN) model was used
for predicting the progression of patellofemoral osteoarthritis (PFOA). (B ) For comparison, a separate machine learning model (gradient boosting machine [GBM])
was trained based on clinical variables including age, sex, body mass index (BMI),
the total Western Ontario and McMaster Universities Arthritis Index (WOMAC) score,
and Kellgren and Lawrence (KL) score of the tibiofemoral joint. We used a stratified
subject-wise five-fold cross validation setting to measure the performance of all
the models. (C ) In addition to these individual models, we fused the predictions from these models
in a second layer GBM model to improve the overall prediction performance. ROI, region
of interest.
Fig. 3 Illustration of automated ROI localization. First, patellar height (h ) was determined using landmarks. Subsequently, a small margin (Δ) is padded around
the patellar region. On the femur side, ROI is located such that the width equals
to the height of the ROI. Best viewed on screen. ROI, region of interest.
The image preprocessing step involved normalizing intensity using global contrast
normalization and truncating the histogram between the 5th and 99th percentiles. Subsequently,
we used patellar landmarks to locate the patellofemoral joint regions of interest
(PFJROI) in lateral knee radiographs. To ensure a similar view with left knee images,
the right knee regions of interest (ROI) images were horizontally flipped. We then
utilized a deep convolutional neural network (CNN) to predict PFOA progression within
7 years. Additionally, we trained an ML model (GBM[15 ]) on clinical features as a reference method for comparison with the proposed approach.
Finally, to increase predictive power, we trained an ensemble model using both imaging
and clinical data.
Data
We used the data from the Multicenter Osteoarthritis Study public use datasets (MOST,
http://most.ucsf.edu ). MOST is a longitudinal observational study that aims to identify factors affecting
the occurrence and progression of OA. The study enrolled 3,026 participants aged 50
to 79 years, who either had radiographic knee OA or were at high risk for developing
the disease. The participants were followed up for 84 months where clinical assessments
were conducted and radiological data were collected. In the study, semiflexed lateral
view radiographs were acquired according to a standardized protocol. Knee radiographs
were evaluated from the baseline to 15-, 30-, 60-, and 84-month follow-up visits.
In this study, we employed lateral radiographs acquired at the baseline visit from
both left and right legs that included 3,276 knees (1,832 subjects) which did not
have PFOA at the time of first examination. The number of progressed knees that developed
PFOA was 403 (12%) and the number of knees which did not develop PFOA was 2,873 (88%).
Selected knees must have had PFOA assessments from lateral radiographs and KL grades
from posteroanterior radiographs, all performed at the baseline. Among these, we selected
knees only whose PFOA status within the following 7 years can be assessed (progressor
vs. nonprogressor). For example, participants who dropped out from the study before
the last follow-up timepoint and had not developed PFOA at the previous time points
were excluded. See [Fig. 4 ] and [Table 1 ] for subject flow diagram and demographics.
Fig. 4 Chart shows the selection of knees from the MOST study used in this work. MOST, Multicenter
Osteoarthritis Study; PFOA, patellofemoral osteoarthritis.
Table 1
Demographics of the data used in this study (subset of Multicenter Osteoarthritis
Study)
Age
BMI
WOMAC
KL
PFOA
Number of females
Number of males
Mean
Std
Mean
Std
Mean
Std
Mean
Std
Nonprogressor
1,665 (58%)
1,208 (42%)
61.2
7.7
29.6
5.2
14.6
14.8
0.7
1.0
Progressor
281 (70%)
122 (30%)
61.1
7.6
32.8
6.3
24.1
17.1
1.7
1.3
Abbreviations: BMI, body mass index; KL, Kellgren and Lawrence; PFOA, patellofemoral
osteoarthritis; WOMAC, Western Ontario and McMaster Universities Arthritis Index.
In the MOST public use datasets, radiographic PFOA is defined from lateral view radiographs
as follows: osteophyte score ≥2 or the joint space narrowing (JSN) score is ≥1 plus
any osteophyte, sclerosis or cysts ≥1 in the PF joint (grade 0–3; 0 = normal, 1 = mild,
2 = moderate, 3 = severe). Unlike TF joint OA assessment (KL grading ranging from
0 to 4), in the PF joint, OA was described as either present or absent lacking a severity
grading. In this study, the term “progression” refers to both progression of existing
OA and development of OA in previously nonaffected PF joints (incidence). For example,
knees which showed minor signs of PFOA (e.g., osteophyte score = 1) at the baseline,
which are still considered non-PFOA cases, might experience worsening of an existing
abnormality in the following years and diagnosed with PFOA (progression). Similarly,
knees that did not show any signs of PFOA at the baseline might develop the disease
for the first time during the the following 7 years (incidence). In MOST, individual
radiographic features were graded by two independent expert readers and when there
was a disagreement in film readings, a panel of three adjudicators resolved the discrepancies.[16 ]
Selection of Regions of Interest
We placed a PFJROI automatically using landmarks ([Fig. 3 ]). The height of the patellar bone (h ) was used to locate a square-shaped image ROI. Once the patellar bone margins were
determined using landmarks, a 20-pixel (Δ/2) region is padded around the bone. On
the femur side, the ROI is extended to capture the part of the femur facing the patellar
bone such that the width of the ROI equals to its height (height = width = h + Δ). Finally, the size of the ROI becomes proportional to the size of the patellar
bone.
Predicting Progression of Patellofemoral Osteoarthritis Using Deep Convolutional Neural
Network
We adopted the deep CNN architecture proposed by Yan et al[17 ] to predict PFOA development based on the baseline imaging data. It uses VGG-16[18 ] backbone with two additional attention layers and one penultimate global feature
vector (obtained via global average pooling; [Fig. 2 ]). PFJROI data were preprocessed by resizing it to 256 × 256 pixels and then applying
a random crop of size 224 × 224 pixels. The backbone network VGG-16 was initialized
with its pretrained version on ImageNet. The attention modules were initialized using
He et al's initialization.[19 ] We employed Focal loss,[20 ] a variant of the cross-entropy, which has shown to be effective when facing the
class imbalance problem by selectively downweighting well-classified examples. We
used a batch size of 32 and trained the network end-to-end for 45 epochs using stochastic
gradient descent with momentum. The initial learning rate was 0.001 and it was decayed
by 10, every 10 epochs.
To examine the impact of the attention mechanism on the model's performance, a separate
training was conducted with the original VGG-16 network without the attention modules.
The network parameters were initialized with ImageNet pretraining, and the last layer
was modified for binary classification. To ensure a fair comparison, we maintained
consistency in the other network parameters and hyperparameters between the attention
model and the model without attention.
Attention Module
Previous deep learning works that employ post hoc analysis for visual explanations
such as Grad-CAM[21 ] require extra computation based on a fully trained classification network and relies
on gradient information passed to the last convolutional layer combined with the forward
activation maps. However, those feature maps, that are often used to produce explanations,
are not necessarily related to the target class and they do not affect the network
parameters at all. In this study, we employed a trainable spatial attention mechanism
to produce insights into the model decisions. Attention mechanisms are widely used
in the field of natural language processing (NLP) as a way to improve the performances
of models by emphasizing the important parts of the information.[22 ] In case of image classification, the idea of trainable attention is to focus on
the most informative parts of an image while ignoring less relevant or noisy parts.
During training, the network learns to weight different regions of the input image
based on the classification performance. See [Supplementary Material S1 ] (available in the online version) for more details of the attention module used
in our architecture.
Reference Models
We employed GBM to predict the development of PFOA from demographic data and self-reported
symptom assessments. GBM is a popular and powerful ML algorithm used for regression
and classification based on ensembles of decision trees.[15 ] It works iteratively by adding decision trees to the model where each new tree attempts
to correct the errors made by the previous trees. In this study, we used an efficient
implementation of GBM called LightGBM.[23 ]
We built three GBM classifiers based on the clinical data and risk factors. These
include age, sex, BMI, the total Western Ontario and McMaster Universities Arthritis
Index (WOMAC) score, and the KL grade of the TF joint (Model1, Model2, and Model3
in [Table 2 ]). The WOMAC score is a widely used questionnaire-based assessment tool designed
to evaluate the severity of pain, stiffness, and physical disability in patients with
OA of the knee and hip.
Table 2
Comparison of the developed models
Input
Method
AUC (95% CI)
AP (95% CI)
Brier score
Model1
Age, sex, BMI
GBM
0.655 (0.624, 0.684)
0.232 (0.205, 0.268)
0.103
Clinical model
Model2
Age, sex, BMI, WOMAC
GBM
0.707 (0.678, 0.732)
0.265 (0.231, 0.299)
0.100
Clinical model
Model3
Age, sex, BMI, WOMAC, KL
GBM
0.767 (0.74, 0.789)
0.334 (0.293, 0.377)
0.095
Clinical model
Model4
VGG-16
CNN
0.832 (0.812, 0.851)
0.4 (0.359, 0.444)
0.262
CNN model
Model5
VGG-16-Attn
CNN
0.856 (0.838, 0.872)
0.431 (0.387, 0.475)
0.165
CNN model
Model6
Predictions from Model3 and Model5
GBM
0.865 (0.849, 0.88)
0.447 (0.404, 0.491)
0.084
Stacked model
Abbreviation: AP, average precision; AUC, area under the receiver operating characteristic
curve; BMI, body mass index; CNN, convolutional neural network; GBM, gradient boosting
machine; KL, Kellgren and Lawrence; PFOA, patellofemoral osteoarthritis; WOMAC, Western
Ontario and McMaster Universities Arthritis Index.
AUC and AP indicate the area under the receiver operating characteristics curve and
the area under the Precision–Recall curves, respectively. The 95% confidence intervals
in parentheses were given based on a five-fold cross-validation setting.
For all of our models, we utilized subject-wise stratified five-fold cross-validation.
This involves dividing the dataset into five folds, each containing data from different
subjects, and stratifying the data within each fold so that the proportion of progressors
versus nonprogressors is similar to the overall dataset. This helps to eliminate subject-dependent
bias between the training and validation sets.
K-fold cross-validation involves iteratively selecting one fold as the testing set
and the remaining folds as the training set. The model is trained on the training
set and evaluated on the testing set. This process is repeated for each fold, with
each fold serving as the testing set exactly once.
To ensure fair comparisons, we used the same folds for all of the models. All of the
models were trained separately and the reported performances were derived from these
separate models.
Statistical Methods
The performance of the models was compared using receiver operating characteristics
(ROC) curves, precision–recall (PR) curves, and Brier score.[24 ] ROC curves plot the true positive rate (TPR) against the false positive rate at
various classification thresholds. The area under the ROC curve (AUC-ROC) is often
used as a summary metric for model performance, with a value of 1 indicating perfect
classification and 0.5 indicating random classification. On the other hand, PR curves
plot the precision (positive predictive value) against the recall (TPR) at various
classification thresholds. The area under the PR curve (average precision, AP) is
another commonly used summary metric for model performance, with a value of 1 indicating
perfect classification and 0 indicating random classification. ROC curves are often
used when the number of negative instances is much larger than the number of positive
instances while PR curves are more suitable when the number of positive instances
is relatively small. In general, a good classifier should have high values for both
AUC-ROC and AUC-PR. Brier score equals to the mean squared error of the prediction.
In order to compare the differences between model AUCs, we applied DeLong et al's
test.[25 ]
Results
[Table 2 ] and [Fig. 5 ] show the performance of different models in predicting PFOA progression. Our proposed
VGG-16-Attn model achieved the highest AUC of 0.856 (0.838, 0.872) and AP of 0.431
(0.387, 0.475) among all the considered models (Model1 to Model5). We compared the
performance of VGG-16-Attn with the original VGG-16 model to assess the contribution
of attention modules. Our results show that the addition of attention modules has
a positive impact on the performance of the model, with a statistically significant
difference between the AUC values of the two models (DeLong's p -value =0.00018).
Fig. 5 (A ) ROC and (B ) PR curves demonstrating the performance of the models. Area under the curves and
95% confidence intervals in parentheses were given based on a five-fold cross-validation
setting. Dashed lines in ROC indicate the performance of a random classifier, and
in case of PR it indicates the distributions of the labels of the dataset (progressor
vs. nonprogressor). AUC, area under the receiver operating characteristic curve; BMI,
body mass index; KL, Kellgren and Lawrence; PR, precision–recall; ROC, receiver operating
characteristics; WOMAC, Western Ontario and McMaster Universities Arthritis Index.
To assess the value of imaging biomarkers in predicting PFOA progression, we conducted
a thorough evaluation of various risk factors, including age, sex, BMI, WOMAC, and
TFOA KL scores ([Fig. 5 ]) as reference models. Using GBM models, we trained the models to predict the probability
of developing PFOA based on different combinations of these risk factors. Our results
showed that the best-performing reference model (Model3) incorporated age, sex, BMI,
WOMAC, and TFOA KL scores, achieving an AUC of 0.767 (0.74, 0.789) and an AP of 0.334
(0.293, 0.377; [Fig. 5 ]). We also measured the impact of each feature on the model's output by looking at
the contribution of that feature to the predicted outcome compared to what the predicted
outcome would be if the feature was not included in the model (SHapley Additive exPlanations[26 ] [[Supplementary Figs. S1 ] and [S2 ], available in the online version]). High BMI, WOMAC, and KL scores increase the
predicted PFOA progression risk and low BMI, WOMAC, and KL scores reduce the risk.
Subsequently, we compared the performance of our deep CNN attention model (VGG-16-Attn,
Model5) to the best-performing reference method (Model3). Our results showed a statistically
significant difference between the AUC values of the two models (DeLong's p -value < 1e − 10).
To further improve predictive accuracy, we used a second-layer GBM model that fused
the predictions of the VGG-16-Attn CNN model (Model5) and the strongest reference
model (Model3) with imaging features and clinical assessments ([Figs. 2C ] and [6 ]). This stacked model (Model6) achieved the best AUC of 0.865 (0.838, 0.872), an
AP of 0.447 (0.404, 0.491), and a Brier score of 0.084, outperforming both individual
models. While the increase in AUC between the stacked model (Model6) and the VGG-16-Attn
CNN model (Model5) was statistically significant (DeLong's p -value =0.0085), it was not highly significant.
Fig. 6 (A ) ROC and (B ) PR curves demonstrating the performance of the attention model (VGG-16-Attn), best
clinical model (Model3), and stacked model (Model6). Area under the curves and 95%
confidence intervals in parentheses were given based on a five-fold cross-validation
setting. Dashed lines in ROC indicate the performance of a random classifier and in
case of PR it indicates the distributions of the labels of the dataset (PFOA vs. non-PFOA).
AUC, area under the receiver operating characteristic curve; BMI, body mass index;
KL, Kellgren and Lawrence; PFOA, patellofemoral osteoarthritis; PR, precision–recall;
ROC, receiver operating characteristics; WOMAC, Western Ontario and McMaster Universities
Arthritis Index.
Examples of spatial attention maps are presented in [Fig. 7 ]. The shallower attention map which is applied after conv3 layer, focuses on more
general and diffused areas. Therefore, we present here only the deeper attention map
(after the conv4 layer in [Fig. 2 ]). In various cases, the model paid attention to the PF joint space width and the
inferior and posterior regions of patellar bone. Additional examples of such attention
maps are presented in the [Supplementary Figs. S3 ] and [S4 ] (available in the online version).
Fig. 7 Examples of attention maps of the two progressor knees from the dataset. First column
shows the baseline radiographs in which the knee does not have PFOA yet. Middle column
illustrates the attention maps and finally last column presents the final follow-up
radiographs. PFOA, patellofemoral osteoarthritis.
Discussion
This study presents a novel deep learning-based approach for predicting progression
of PFOA, utilizing both clinical variables and imaging data. The results demonstrate
the potential of ML techniques, especially deep learning, in predicting PFOA progression,
which could provide valuable information for clinicians in patient care.
In general, ML-based models can handle heterogeneous data and they can identify patterns
that may not be apparent to human experts. We highlighted this by the inclusion of
both clinical variables and imaging data into the stacked model. This combination
model achieved the highest accuracy in predicting PFOA progression, indicating its
ability to differentiate between patients who are likely to experience PFOA and those
who are not. However, it should be still noted that the performance gain with the
stacked model (AUC = 0.865, AP = 0.447), compared to the imaging-based model (AUC = 0.856,
AP = 0.431), was only minor and, although statistically significant, probably the
clinical gain might be insignificant. Consequently, this suggests that clinical variables
have only minor contribution to the prediction performance on top of the X-ray image
alone. Similarly, as in the case of knee OA progression prediction,[27 ] it looks like that a knee lateral X-ray image already includes indirectly a lot
of clinical information, such as age and BMI.
Our study confirmed that high BMI, high WOMAC score, female sex, and OA in the TF
joint (KL score) are all risk factors for PFOA development ([Supplementary Figs. S1 ] and [S2 ] [available in the online version] and [Table 1 ]). Out of the three main demographical variables age, sex, and BMI in isolation (Model1),
the strongest predictive capability was high BMI.
It has been earlier reported that the use of attention mechanism increases the performance
of NLP models.[22 ]
[28 ] Here, we also observed the increased performance in this kind of image classification
task (AUC = 0.856 vs. 0.832, AP = 0.431 vs. 0.400). Besides the increase in overall
model performance, generated attention maps highlighted the joint space and the regions
where osteophytes typically occur. These regions are known to be affected in PFOA,
and they reflect manual imaging biomarkers of OA, including JSN and morphological
and structural changes in bone.
The present study is unique as it investigated the potential of ML approaches based
on imaging data to accurately predict PFOA progression for the first time. However,
there are also some limitations of this study. First and foremost, the model was trained
on data from a single population, and further research is necessary to validate the
model's generalizability to other populations and settings. Additionally, the study
did not consider other potential predictors of PFOA progression, such as biomechanical
or genetic factors. Incorporating longitudinal data and other types of imaging data,
such as MRI, could further improve the model.
In conclusion, our study demonstrates the potential of ML models to predict PFOA progression
using imaging and clinical variables. These models could assist in identifying patients
at high risk of PFOA progression, enabling clinicians to intervene with personalized
treatment plans and potentially prevent or delay disease progression.
Summary
We compared the performances of deep CNN-based models and GBM-based models using clinical
variables including age, sex, BMI, the total WOMAC score, and KL score of the TF joint.
Our results demonstrated that imaging biomarkers contain useful information for predicting
PFOA progression within 7 years. Moreover, addition of clinical data slightly improves
the prediction power of the imaging-based model, although the clinical significance
of this performance gain is unknown.