Keywords
deep learning - epilepsy - EEG - interictal epileptiform discharges - IED
Introduction
Interictal epileptiform discharges (IEDs) are electroencephalographic (EEG) patterns
associated with an increased likelihood of epileptic seizures.[1]
[2] The gold standard for IED detection in an EEG is visual analysis by experts[3]; however, this requires extensive analysis time, among several drawbacks.[4] For this reason computer-assisted IED detection with algorithms that match or outperform
experts have been developed, aiming to reduce the time and resources spent on visual
analysis.[5] Automated IED detection is complex due to IEDs’ similarity to normal transients.[6] Several approaches have been used for automated IED detection, and one of the more
recent approach is deep learning.[5]
[7]
[8]
[9]
[10] An advantage of deep learning is that a predefinition of IED features is not necessary.[7]
One prerequisite for the eventual implementation in clinical practice for such a deep
neural network is a sufficiently high performance that can rival with expert visual
analysis. One previous study using 50 EEGs and a deep neural network for IED detection
has shown a sensitivity of 47% and a 98% specificity.[7] To increase performance, first, the deep neural network was made more complex and,
second, the scarce IED input samples were increased through temporal shifting and
using different montages, leading to a sensitivity increase from 63 to 96%, with a
specificity of 99%.[8] Jing and colleagues have shown that their SpikeNet deep neural network can exceed
expert performance, although with more resources: 9,571 EEGs.[10]
Another approach to increase performance is the application of a second-level consecutive
deep neural network as a postprocessing step. This approach has been used in two other
fields (electronics and nephrology), as a way of noise reduction in datasets, suggesting
improvement in deep neural network performance,[11]
[12] but has not yet been applied in IED detection. The benefit of such a two-step approach
would be that the first deep neural network can be trained to focus and perform excellently
on filtering out artefacts. In the second step, other features may be more important
to discern the actual IEDs, possibly improving overall performance and because the
data are less noisy, possibly achieved with a shallower network and less inputs. The
cognitive model behind this can be viewed as the synergy between a general practitioner
and a neurologist: the general practitioner is the first-level network, filtering
a different patient population for the neurologist, the second-level network, making
detection of the purely neurological conditions easier. Therefore, the aim of our
current study was to investigate whether such a deep learning postprocessing step
improves the performance of IED detection in EEGs with limited resources.
Materials and Methods
EEG Data and Preprocessing
We used 17 interictal 24-hour ambulatory EEGs randomly selected from the digital database
of the Medisch Spectrum Twente, in the Netherlands, which were not previously used
for training of deep neural networks. All EEGs were obtained as part of routine care,
and anonymized before analysis. The Medical Ethical Committee Twente waived the need
for informed consent for EEG monitoring acquired as part of routine care (K24-07).
The EEGs were retrospectively accessed on April 14, 2022 and July 3 and 6, 2023, and
the identity of the participants was kept unknown to the authors. There were 15 EEGs
with focal epilepsy and 2 with generalized epilepsy. Patients aged 4 to 80 years,
with a median of 19 years and 25th to 75th percentile of 14 to 32 years.
The EEG data were filtered in the 0.5- to 30-Hz range, down-sampled to 125 Hz, and
split into nonoverlapping epochs of 2 seconds in Matlab R2021a (The MathWorks, Inc.,
Natick, MA, United States) to prevent using datapoints more than once, resulting in
an 18 × 250 matrix for each epoch in the longitudinal bipolar montage. These epochs
were used as input for the VGG-C-based convolutional neural network (CNN) previously
developed by da Silva Lourenço and colleagues,[8] which was trained on both routine and long-term ambulatory registrations, and normal
as well as EEGs with IEDs. The model architecture details and prediction capabilities
of the first-level CNN are reported in previous work.[8] The output of this deep neural network was the probability that each of the epochs
contains an IED. The epochs that corresponded to an IED probability of at least 0.99
were selected. This resulted in 3,629 epochs, which were used as the input for the
second postprocessing deep neural network, developed for this study. Thus, the first-level
VGG-C-based CNN was used to preselect or filter the EEG epoch inputs for the second
postprocessing CNN. This process is illustrated in a flowchart in [Fig. 1].
Fig. 1 Postprocessing deep learning neural network process flowchart. Second-level postprocessing
deep neural network (convolutional neural network [CNN] developed in this study) for
improvement of EEG interictal epileptiform discharge (IED) detection by a first-level
deep neural network (VGG-C-based CNN previously developed by da Silva and colleagues).
For this second postprocessing deep neural network, we used supervised learning in
order to assess its errors and improve its performance. Thus, each epoch selected
from the first deep neural network was visually labeled by one of the authors (G.V.A.),
assigning a score of 1 for epochs containing an IED and 0 for those not containing
an IED (non-IED). This was performed in a MATLAB App developed for this purpose with
a graphical user interface (GUI) as shown in [Supplementary Fig. S1] (available in the online version). Examples of each of those epochs (IED and non-IED)
are shown in [Fig. 2]. Data were divided into an 80/20 training/validation and a test set, where epochs
from a particular patient were used for either training/validation or testing, resulting
in 14 EEGs (3,049 epochs) for training/validation and 3 EEGs (580 epochs) for testing.
The 80/20 training/validation was applied due to the limited data and its class imbalance.
A total of 3,629 EEG epochs with 1,136 true IEDs was deemed a sufficient sample size
based on a previous study by Cho and colleagues that suggested that 1,000 inputs per
prediction class showed good performance in a deep learning CNN much more complex
than ours applied on medical images.[13] This is also comparable to previous IED detection studies such as Tjepkema-Cloostermans
and colleagues,[7] who used 50 EEGs, including a combination of routine 20-minute recordings and long-term
ambulatory registrations, corresponding to 1,478 IEDs for their training set, an amount
comparable to our 1,136 IEDs.
Fig. 2 EEG epoch examples. Examples of an EEG epoch scored as (A, B) EEG interictal epileptiform discharge (IED) and as (C) not epileptiform (non-IED).
Postprocessing Deep Learning Model
A two-dimensional (2D) postprocessing CNN was implemented in Python 3.10 using Keras
2.6.0, Tensorflow 2.8.0 and scikit-learn 1.0.2 ([Fig. 3]). EEG epochs were used as CNN input, processed as an 18 (channels) × 250 (timepoints)
matrix. The CNN applied 25 2D convolutional filters with a receptive field of 3 × 3
on each epoch and down-sampled the data further with a 2 × 2 max pooling layer. A
dropout layer of 20% was used to prevent overfitting. The data were flattened and
forwarded to a hidden layer comprising 100 neurons. Stochastic optimization was performed
using an Adam optimizer, with default parameters: learning rate = 0.001, β
1 = 0.9; β
2 = 0.999, and ε = 10−7. We used binary cross-entropy as a loss function and a batch size of 50. The ratio
of IED to non-IED epochs was 1,136:2,493 and was used as a weight factor in the model.
The model provides the probability for IED presence for each epoch as output.
Fig. 3 Postprocessing deep learning neural network architecture. The total number of parameters
is 3,150,552.
Performance Evaluation
Model performance was evaluated as the accuracy for the second-level postprocessing
CNN for the validation and test set, and the sensitivity and specificity for the test
set using Python 3.10. The receiver operating characteristic (ROC) curve and corresponding
area under the curve (AUC) were calculated using Matlab R2021a. Additionally, the
percentage of epochs correctly labeled as containing an IED (positive predictive value)
was calculated in the test set for the first-level VGG-C-based CNN, and the second-level
postprocessing CNN, using the same IED probability threshold of 0.99.
We checked that input data for the second-level CNN were sufficient by increasing
the number of input epochs from 1,483 to 3,629 (with 1,136 true IEDs). Additionally,
we checked that the model architecture was optimal by determining model accuracy after,
first, making the model more complex and, second, less complex. First, we added (1)
a second convolutional or (2) a hidden dense layer with 50 neurons. Second, we (1)
removed the dense layer of 100 neurons or (2) reduced the number of neurons to 20
(from 100).
The MATLAB and Python code and our dataset of selected anonymous EEG epochs are shared
in a publicly available repository at the German Neuroinformatics Node/G-Node (GIN),
with doi:10.12751/g-node.swrz7z.
Results
The accuracy of the model for the validation set was 86%. The model accuracy for the
test set was 60%, with a sensitivity of 0.89 and specificity of 0.11. The ROC curve
is shown in [Fig. 4] with an AUC of 0.56. The percentage of epochs correctly labeled as containing an
IED was 38% (10 of 26 epochs) for the second-level postprocessing CNN. This was 37%
(215 of 580 epochs) in the data preselected by the first-level VGG-C-based CNN.
Fig. 4 Receiver operating characteristic (ROC) curve. ROC curve of the second-level postprocessing
deep neural network. IED, interictal epileptiform discharge.
Doubling the number of input epochs for the second-level CNN did not improve model
performance. Making the model architecture more complex by adding (1) a second convolutional
or (2) a hidden dense layer with 50 neurons did not improve the performance. Making
the model architecture less complex by (1) removing the dense layer of 100 neurons
or (2) reducing the number of neurons to 20 (from 100) showed a deterioration of accuracy
(53%).
Discussion
In summary, our major findings showed a model accuracy of 86% for the validation set
and 60% for the test set. We also found that the first-level CNN selected 37% true
IEDs, and after adding our second-level postprocessing CNN, this increased to 38%.
In conclusion, we were unable to reproduce the previously reported performance of
the first-level CNN, and adding the postprocessing CNN did not improve IED detection,
considering the model performance with insufficient specificity of 0.11.
Underperformance of a deep learning model in general can be due to (1) insufficient
amount of data, (2) the quality of the data, and/or (3) underfitting or overfitting
of the model.[14] First, it is unlikely that the sample size we used was insufficient because we doubled
the total number of input EEG epochs without any improvement of model performance.[13]
[14]
Second, the quality of our input data was mainly affected by the performance of the
first-level VGG-C network. This is likely due to the limited number of two IED assessors,
who scored the EEGs used for training the first-level CNN. Only 37% of the epochs
that were the output of the first-level network, and consequently input for our postprocessing,
were correctly labeled as containing an IED in our study. This corresponds with differences
in IED interrater agreement in general, which are reported to be 49% (95% confidence
interval [CI]: 37–60%).[15] This suggests that using a first-level network in the future that is trained with
IEDs labeled by more assessors may present more generalizable and robust overall result.
A limitation of our study is that the number of assessors of EEG epochs for the second-level
CNN was limited to one, and different from the two assessors of the first-level CNN.
Also, assessment of IEDs was performed differently for the first- and second-level
CNN: the Matlab App GUI was used for the second-level CNN with an extracted EEG epoch
with a fixed montage and filter settings, whereas assessment for the first-level CNN
was done in the context of the whole EEG. However, systematically scoring differently
by the assessor in this study may have explained the poor positive predictive value
of the first-level CNN, but not the second-level CNN. We feel it is unlikely data
heterogeneity; for example, the inclusion of EEGs of adults and children would have
affected performance, as both groups were included in the training of the first- and
second-level CNN. Additionally, we used the leave-one-out principle and excluded the
child (a 4-year-old) from our test set to see if accuracy would improve, which would
be expected if children were not well represented in the training set. However, the
accuracy did not change (61%).
Third, we adapted the model architecture, making it more as well as less complex,
to check for under- and overfitting. Adding a second convolutional or a hidden dense
layer with 50 neurons did not improve the performance on EEG data, suggesting underfitting
was not the issue. Overfitting was addressed by adding a dropout layer in the model
architecture. To check for overfitting due to a too complex or deep architecture,
we additionally (1) removed the dense layer of 100 neurons or (2) reduced the number
of neurons to 20 (from 100), both showing a deterioration of accuracy. Validation
accuracy was higher than test accuracy most likely because we used the 80/20 training/validation
data split. This is standard practice but implies that a part of the epochs from one
patient could be in the training set and another part could be in the validation set.
For the test set, we ensured that all epochs from a particular patient were only used
for the test set. IEDs within the same patient are likely more similar to each other
than IEDs from different patients, thus explaining the difference in model accuracy
between the validation and the test set. The percentage of correctly labeled IED epochs
by the first-level network was low (37%); thus, the input data for the second-level
CNN were not as well filtered as expected for a postprocessing model, despite selecting
a relatively high IED probability threshold (0.99). Thus, the preselected EEG data
were possibly still too noisy for the limited model architecture complexity of the
second-level CNN.
The application of a second-level postprocessing deep neural network has successfully
been used in the field of electronics and nephrology,[11]
[12] but it is novel in the field of clinical neurophysiology and automated EEG IED detection
in particular. These negative results are important to guide spending limited resources
(time and EEG data) in the future. We were unable to reproduce the previous VGG-C
(the first-level) network performance.
The steps to further improve IED detection rate with a postprocessing CNN approach
may be achieved as follows. First, a different first-level CNN or the same CNN with
an increased number of assessors can be used. Additionally, unifying the assessment
methods of the first- and second-level CNN and increasing the number of assessors
of the second-level CNN may contribute to overall performance. The second-level CNN
architecture could be further optimized, using, for example, metaheuristic algorithms
and different optimizers.[16]
[17] However, we expect this to improve only its computational burden and not performance,
because we have shown that altering the architecture of the postprocessing CNN did
not improve its performance.