Efficient and Accelerated EEG Seizure Analysis through Residual State Updates (2024)

Arshia Afzal Grigorios Chrysos Volkan Cevher Mahsa Shoaran

Abstract

EEG-based seizure detection models face challenges in terms of inference speed and memory efficiency, limiting their real-time implementation in clinical devices. This paper introduces a novel graph-based residual state update mechanism (Rest) for real-time EEG signal analysis in applications such as epileptic seizure detection. By leveraging a combination of graph neural networks and recurrent structures, Rest efficiently captures both non-Euclidean geometry and temporal dependencies within EEG data. Our model demonstrates high accuracy in both seizure detection and classification tasks. Notably, Rest achieves a remarkable 9-fold acceleration in inference speed compared to state-of-the-art models, while simultaneously demanding substantially less memory than the smallest model employed for this task. These attributes position Rest as a promising candidate for real-time implementation in clinical devices, such as Responsive Neurostimulation or seizure alert systems.

Machine Learning, ICML

1 Introduction

Brain disorders, including epilepsy, present substantial challenges globally, prompting the need for innovative approaches in diagnosis and treatment. Recurrent seizures, recognized as one of the most prevalent neurological emergencies globally (Strein etal., 2019), impact approximately 50 million people worldwide (Beghi etal., 2019).

Detecting changes in the rhythms of brain activity through the monitoring of electroencephalography (EEG) signal allows us to pinpoint the onset zone and time of seizures (Gotman, 1990; Siddiqui etal., 2020), making EEG an invaluable and extensively utilized tool for seizure detection and localization. Traditionally, neurological experts perform these tasks, involving the time-consuming process of manually labeling periods spanning from hours to days for each individual patient (Harrer etal., 2019; Ahmedt-Aristizabal etal., 2020). Several studies have explored the application of Machine Learning (ML) in seizure analysis, aiming to simplify the handling of large seizure datasets for experts (Tang etal., 2021; Ahmedt-Aristizabal etal., 2020; Covert etal., 2019; Siddiqui etal., 2020). These studies predominantly focus on deep models, known for their accuracy and suitability for clinical applications.

Taking inspiration from computer vision (Voulodimos etal., 2018), many studies have applied different variations of Convolutional Neural Networks (CNN) for seizure detection, as demonstrated in Saab etal. (2020). Various versions of Graph Neural Networks (GNN) effectively capture non-Euclidean geometry in datasets like EEG signals, contributing to enhanced seizure detection and classification (Li etal., 2022; Tang etal., 2021; Ho & Armanfard, 2023). Additionally, to enhance the performance of deep neural networks and accounting for time-series nature of brain rhythms, different variations of Recurrent Neural Networks (RNN) have been utilized in seizure analyses (Ahmedt-Aristizabal etal., 2020).

While these models excel in achieving high accuracy in seizure detection and classification tasks, they often struggle with issues such as complexity, inefficient memory usage, and slow inference speeds. One of the main reasons behind this inefficiency lies in structures such as the gating mechanism found in RNN models (e.g., Long Short-Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997) or the presence of deep convolutional layers in CNN s and GNN s.

Both inference time and memory storage considerations become critically important in the context of modern seizure treatment devices like Responsive Neurostimulation (RNS) and Deep Brain Stimulation (DBS) (Fisher & Velasco, 2014a; Sun & Morrell, 2014). These devices, which have shown promise in suppressing seizure attacks, require a small yet accurate ML model to trigger stimulation commands for symptom suppression (Shoaran etal., 2016; Shin etal., 2022). Furthermore, the model must exhibit low inference time in activating the stimulator to ensure its effectiveness (Fisher & Velasco, 2014b; Zhu etal., 2021). Unfortunately the aforementioned methods do not have such a low inference.

In this study, we introduce Rest, a graph-based residual update mechanism designed to efficiently detect both spatial and temporal information from EEG. Rest captures spatio-temporal dependencies in EEG signals without relying on computationally expensive gating mechanisms commonly found in existing models (Hochreiter & Schmidhuber, 1997; Cho etal., 2014; Asif etal., 2020; Tang etal., 2021). The ability to dynamically capture spatial information over time and update the state accordingly contributes to the high accuracy of Rest in localizing and detecting seizures. Notably, Rest attains comparable accuracy to state-of-the-art models, while achieving significantly faster processing during inference and substantially reducing computational and memory overhead ¹¹1Visit our web site at https://arshiaafzal.github.io/REST/. Our contributions are as follows:

•
We present a novel graph-based residual update mechanism designed to capture spatio-temporal dependencies in EEG signals.
•
We enhance the model’s performance while maintaining its small size and rapid detection and classification speed using binary random masking the state and multiple state updates.
•
Our model delivers predictions with an impressive inference latency of 1.29ms. This unmatched inference speed is achieved with a light memory footprint of 37KB.
•
Our model is 14 $\times$ smaller than the smallest competitive models for seizure detection. Remarkably, our architecture can match the performance of the state-of-the-art deep neural networks with less than 10K parameters.

2 Related Work

Many studies have attempted to develop ML and deep learning models for seizure detection (Siddiqui etal., 2020; O’Shea etal., 2020; Saab etal., 2020) and classification of seizure types (Ahmedt-Aristizabal etal., 2020; Iešmantas & Alzbutas, 2020; Tang etal., 2021). Here, we examine existing seizure detection and classification models, assessing their strengths and limitations across three key aspects. Firstly, we explore how these studies capture the spatio-temporal features present in EEG. Secondly, we delve into the inference speed and the impact of varying clip lengths on seizure analysis. Lastly, we study the memory requirements and model size of current models.

Spatio-Temporal Nature of EEG Signals: As introduced earlier, the nature of EEG signals involves both spatial and temporal components, which are pivotal for accurate analysis in epilepsy studies. Notably, some studies, like Asif etal. (2020), extract spectral features to represent temporal dependencies, incorporating them into a CNN architecture. In contrast, Saab etal. (2020) employ a CNN model that treats EEG signals as multi-channel images, a methodology that does not align with the time-series structure of EEG. Recent advancements involve the utilization of various RNN variations or transformers (Vaswani etal., 2017) to effectively capture temporal patterns in alignment with the intricate dynamics of EEG signals.

RNN s capture temporal dependencies within time-series data by mapping the input $x(t)$ into a latent space $h(t)$ and employ recurrence within that space through linear or non-linear transformations.Despite their effectiveness in capturing time-series dependencies, RNN s suffer from a significant challenge known as gradient vanishing. This issue occurs during backpropagation, causing gradients to diminish and hindering the effective learning of long-range dependencies in sequential data. To address the vanishing gradient problem (Pascanu etal., 2013), RNN variants like LSTM (Hochreiter & Schmidhuber, 1997) or Gated Reccurent Unit (GRU) (Cho etal., 2014) leverage gating mechanisms, introducing different gates that contribute to creating the next state $h(t)$ from the current input $x(t)$ and the previous state $h(t-1)$ . Thodoroff etal. (2016) used an LSTM based model for seizure detection.

On the other hand, attention-based models or transformers (Vaswani etal., 2017) are more complex than RNN s. Rather than constructing an explicit state, they directly use previous inputs to predict the future. However, this approach is more memory-intensive and time-demanding due to the necessity of retaining all prior inputs up to a specified time point and storing weights for each input to construct the attention matrix. Yan etal. (2022b) employed a transformer-based model for the seizure detection task.

In the context of EEG analysis where spatial details are critical at each time point, a common strategy is to utilize a CNN or graph convolution network independently across all time points, mapping them into a new feature space. This approach is then complemented by RNN to capture temporal dependencies. Ahmedt-Aristizabal etal. (2020) further employ a CNN-LSTM model, effectively addressing both spatial and temporal dependencies in EEG data.

Method	A	B	C	D	E
SeizureNet (Asif etal., 2020)	✗	✔	✗	✗	✗
Transformer (Yan etal., 2022a)	✗	✔	✔	✗	✗
EEG-CGS (Ho & Armanfard, 2023)	✔	✔	✗	✗	✗
GGN (Li etal., 2022)	✔	✔	✔	✗	✗
LSTM (Hochreiter & Schmidhuber, 1997)	✗	✔	✗	✗	✗
CNN-LSTM [1] (Ahmedt-Aristizabal etal., 2020)	✗	✔	✗	✗	✗
CNN-LSTM [2] (Thodoroff etal., 2016)	✗	✔	✗	✗	✗
DCRNN (Tang etal., 2021)	✔	✔	✔	✗	✗
Rest (Ours)	✔	✔	✔	✔	✔

Nevertheless, these approaches assume Euclidean geometry for EEG signals, overlooking the natural geometry of electrode placement (Figure1 a) and brain network connectivity (Tang etal., 2021). Recent studies exploit GNN s and graph-based modeling to capture the non-Euclidean geometry of EEG signals (Tang etal., 2021; Ho & Armanfard, 2023; Covert etal., 2019; Li etal., 2022). For instance, Tang etal. (2021) implement a self-supervised diffusion graph convolution model for both detection and classification tasks. Similarly, Ho & Armanfard (2023) employ a self-supervised graph network for channel anomaly detection. These studies (Ho & Armanfard, 2023; Tang etal., 2021) align more closely with the dynamic changes in EEG rhythms by replacing the weights of the RNN network with graph convolution filters. This approach represents the evolution of spectral features within each time point of the time-series data, offering a more integrated approach compared to the sequential mapping from CNN to LSTM (Ahmedt-Aristizabal etal., 2020).

Significance of Inference Time: Timely detection of seizure events is essential for the efficacy of closed-loop epileptic treatments such as RNS and DBS (Shoaran etal., 2016). To the best of our knowledge, most previous studies either overlook the importance of inference runtime or, as observed in Asif etal. (2020), consider a 90ms delay for giving predictions. This delay is still significant, especially for edge devices like RNS and DBS.Furthermore, current studies often evaluate models using a limited range of long window sizes, typically exceeding 10 seconds or even 1 minute (Tang etal., 2021; Saab etal., 2020). However, shorter window sizes are preferable for real-time seizure detection and responsive intervention (Christou etal., 2022; Zhu etal., 2020). The chosen window size influences a model’s ability to localize seizures and its overall detection performance. For instance, a model designed for extended window sizes may lose accuracy in short-term seizure detection scenarios, an aspect that has not been extensively explored in the literature.

Memory Requirement in Seizure Detection Models:While numerous studies have focused on enhancing the accuracy of seizure detection and classification tasks, the crucial aspect of memory demand remains largely overlooked. For instance, Tang etal. (2021) utilize 240K parameters with complex gating units, Ho & Armanfard (2023) employ 58K for channel anomaly detection, and Asif etal. (2020) address seizure classification task with a substantial number of 45.94 Million parameters. These examples underscore the need for an efficient model tailored for seizure detection and classification problems, especially one suitable for resource-constrained stimulation devices deployed at the edge, which do not have access to extensive memory storage for model weights and states (Zhu etal., 2020).

In Table1, we present a summary of current models, highlighting their respective strengths and weaknesses.

Efficient and Accelerated EEG Seizure Analysis through Residual State Updates (1)

3 Method

Below, we first formulate the tasks of seizure detection and classification, outlining the graph representation of EEG signals. Next, we describe the design of Rest’s structure using various updating strategies.

3.1 Seizure Detection and Classification Problem Setting

Following the preprocessing of raw EEG signals and constructing the EEG graph, we obtain an EEG clip $X$ and a label $y$ for both detection and classification tasks. Here, $X\in\mathbb{R}^{T\times M\times N}$ with $N$ electrodes, $T$ time points and $M$ features per node while $y$ denotes the label. For detection, the label is binary, whereas for classification, the label falls within the range of {0,1,2,3,4} where each class represents a unique seizure type ²²2The five seizure types include: focal, generalized non-specific, complex partial, absence, andtonic-clonic.. The goal for both tasks is to predict the label $y$ based on a given EEG clip $X$ .

3.2 EEG Distance Graph Construction

For each EEG clip, we denote a graph as $\mathcal{G}=\{\mathcal{V},\mathcal{E},\mathcal{A}\}$ where $\mathcal{V}=\{v_{1},...,v_{N}\}$ represents the nodes corresponding to EEG electrodes, $\mathcal{E}$ represents the edges, and $\mathcal{A}\in\mathbb{R}^{N\times N}$ denotes the adjacency matrix of the graph where $N$ is the number of nodes which in case of EEG data it is the EEG electrodes.We build a distance-based EEG graph (Figure1a) that precisely represents the electrode placement geometry in the standard 10/20 system (Jasper, 1958). Unlike correlation graphs, our graph remains static over time, reducing computations during inference, as the graph structure does not need to be constructed for each input (Ho & Armanfard, 2023). Details regarding the choice of $k$ and visualization of distance graphs based on threshold values can be found in AppendixH.

For a distance graph, the adjacency matrix is constructed using the distance between electrode locations, as in previous studies (Tang etal., 2021; Li etal., 2022; Ho & Armanfard, 2023). As the EEG electrode placements are fixed, the adjacency matrix remains unchanged over time. Thus, for each element $a_{ij}\in\mathcal{A}$ :

a_{ij}=\left\{\begin{array}[]{ c l }\exp(-\frac{||v_{i}-v_{j}||^{2}}{\sigma^{2%}})&\quad\textrm{if }||v_{i}-v_{j}||\leq k,\\0&\quad\textrm{if }\textrm{Otherwise},\end{array}\right.

(1)

where $\sigma$ is the standard deviation of the distances and $k$ is the Gaussian kernel’s threshold (Shuman etal., 2013).

3.3 Residual State Update

Similar to RNN s, Rest initially maps the input into a latent space, evolving the state over time to reach the final output. In contrast to RNN s, Rest updates the state using a novel approach that avoids the complexity of gating mechanisms like LSTM or GRU, efficiently addressing the vanishing gradient problem with fewer parameters (details in AppendixB). For mapping to the state space, Rest employs a linear mapping represented as:

3.4 Binary Random Mask: Continuous Dropout during Inference

To combat overfitting in deep neural networks, Dropout is commonly employed, randomly selecting model parameters during training and retaining all parameters during test-time (Srivastava etal., 2014). Drawing inspiration from a similar concept in Mordvintsev etal. (2020), we introduce Binary Masking for state updates, preventing overfitting while enabling the model to learn random state updates. This approach prevent the model to overfit as well as accelerates inference during test-time by skipping computations related to zero-masked feature points in the update.The state update will simply change as follows:

S^{t}=H^{t}+\delta S^{t}\odot B.

(6)

Here, $\odot$ denotes the Hadamard product, and $B\in\mathbb{R}^{Q\times N}$ is the binary mask with $B_{ij}\sim\mathcal{B}(p)$ from the Bernoulli distribution, where $B_{ij}$ takes the value 1 with a probability of $p$ and can be treated as hyperparameter for the model.

	EEG-Files	Patients	Seizure Type Numbers (Seizure Type Sessions)
	(% Seizures)	(% Seizures)	GN	TC	AB	FN	CP
Train	4664(5.34%)	579(36%)	335(152)	30(11)	50(15)	1516(496)	279(132)
Evaluation	881(5.82%)	43(79%)	185(54)	57(8)	50(1)	240(98)	108(32)

3.5 Multiple Update Mechanism: Escaping the Memory Requirements of StackedRNN Layers

As widely recognized in neural networks, increasing the depth enhances performance by enabling the extraction of more general and complex features (Nakkiran etal., 2021). However, this poses a challenge in RNN s, where each additional layer increases memory requirements, not only for storing extra weights but also for additional gates and states.

In our study, we tackle this challenge by modifying Rest to employ identical weights for state updates, thus facilitating multiple state updates. Although the graph convolution layer appears repetitive, the effect of binary random mask allows Rest to learn to update a new part of the state during each iteration. This adaptation allows Rest to align itself with the nature of these random updates, contributing to increased performance and enhanced stability without affecting memory requirements.

Thus, the Equations2, 5 and6 will be modified as follows:

H^{t}_{i}=WX^{t}+US^{t}_{i},

(7)

S^{t}_{i+1}=H^{t}_{i}+\delta S^{t}_{i}\odot B.

(8)

Here, the index $i$ denotes the current iteration during which the model updates its state, and $\delta S^{t}_{i}=\mathcal{G}_{\Theta}(H^{t}_{i})$ . It is crucial to emphasize $X^{t}$ as the feature input at time point $t$ to prevent the model from diverging into a state and neglecting the input during multiple updates (additional details are provided in the AppendixG). To update the state for the next time point, the final state obtained after multiple updates becomes the initial state. For instance, after updating the model’s state $I$ times at time point $t$ , the initial state for the next time point $t+1$ is set as the final state after the last update at time point $t$ ( $S^{t+1}_{0}$ = $S^{t}_{I}$ ). This enables the model to effectively capture the temporal dynamics across different time points. The proposed framework for the update cell is illustrated in (Figure 1c).

Moreover, previous studies (Mordvintsev etal., 2020; Pajouheshgar etal., 2023) have demonstrated that recurrently updating the state of neural networks, similar to Rest in structure, for image and texture generation contributes to improved stability. We hypothesize that a similar enhancement can be achieved for seizure detection and classification.

	Patients	Seizures	Recording (hours)
Train	18	154	732
Evaluation	3	19	91
Test	3	19	92.5

4 Rest & RNNs

To better understand the memory efficiency and speed advantages of Rest during inference, we compare Rest with traditional RNN s. As mentioned in Related Work, RNN s map the input $x(t)$ to a hidden state $h(t)$ and update this state over time using the previous state $h(t-1)$ and the current input $x(t)$ . We highlight the efficiency and connections between Rest and other types of RNNs through the following comparisons:

Single Update Rest vs. Single-Layer RNN: First we consider a single GRU as a representative of RNN models, which leverages gating mechanisms to mitigate gradient vanishing.For a simple GRU update, we have the following set of equations:

r(t)=\sigma(W_{r}\cdot[h(t-1),x(t)]),

(9)

z(t)=\sigma(W_{z}\cdot[h(t-1),x(t)]),

(10)

\tilde{h}(t)=\tanh(W_{h}\cdot[r(t)\odot h(t-1),x(t)]),

(11)

h(t)=(1-z(t))\odot h(t-1)+z(t)\odot\tilde{h}(t).

(12)

Here, $h(t)$ is the hidden state at time $t$ , $x(t)$ is the input at time $t$ , $\sigma$ is the sigmoid activation function, $\odot$ denotes element-wise multiplication, $[a,b]$ denotes the concatenation of vectors $a$ and $b$ , and $W_{r},W_{z},W_{h}$ represent the weight matrices.

These equations describe how the hidden state $h(t)$ is updated over time based on the input and the preceding state. Unlike Rest, GRU relies on three different gates ( $z(t),r(t),\tilde{h}(t)$ ) for each state update, requiring twice as much memory as Rest, in addition to the storage required for the weights utilized in generating these gates.

Despite GRU’s memory demands, it not only needs to compute the next state ( $h(t)$ ), but also three additional gates ( $z(t),r(t),\tilde{h}(t)$ ) as the next state depends on these gates. In contrast, Rest relies solely on the update result ( $\delta S^{t}$ ), enabling it to rapidly derive the next state by adding it to the previous state, without the need for additional gates.

Multi Random Update Rest vs. Multi-Layer RNN:

The remarkable efficiency of Rest becomes particularly evident when comparing it with multi-layer RNN. In the context of multi-layer GRU, reaching the final state involves computing a set of equations (Equations9, 10, 11 and12) for each layer. This process introduces three times more latency per layer, as each layer has three gates that must be computed to obtain the next state. Furthermore, it requires additional memory to store the hidden state of each layer, especially since it is required for updating the final hidden state of the last layer.

Contrastingly, Rest distinguishes itself by reusing the same set of weights for the update cell and state evolution. This eliminates the need to store the previous state, as it evolves a distinct state over iterations. Consequently, Rest maintains the same memory requirements as a single update, while delivering more accurate results (as discussed in the next section).It is worth mentioning that in the context of EEG data, all fully connected layers will be replaced by graph convolutions for both Rest and GRU. For example, the combination of GRU with diffusion graph convolution for a traffic forecasting problem was undertaken by Li etal. (2017).

Connection of Rest Update Cell to Gating Mechanism:

As shown in Equation12, the state update of RNNs, such as GRU, can be expressed as:

h(t)=h(t-1)+z(t)\odot\left(\tilde{h}(t)-h(t-1)\right).

(13)

This update shares similarities with the Rest cell update in Equation6. Instead of learning both $\tilde{h}(t)$ and $h(t)$ separately, the Rest update directly learns $\tilde{h}(t)-h(t-1)$ as the residual update $\delta S^{t}$ . Additionally, the update gate vector $z(t)$ is replaced with binary random masking. This substitution reduces the computational and memory overhead required for building $z(t)$ from the input $x(t)$ and hidden state $h(t)$ .

	Seizure Detection AUROC (%)						Model Efficiency
Model	4-s	6-s	8-s	10-s	12-s	14-s	Size(MB)	#Param	Inference(ms)
LSTM	$75.5_{\pm 0.3}$	$76.1_{\pm 0.07}$	$80.1_{\pm 0.3}$	$70.43_{\pm 0.02}$	$77.9_{\pm 0.06}$	$74.24_{\pm 0.2}$	2.147	536K	3.254
GRU	$76.1_{\pm 0.02}$	$78.8_{\pm 0.03}$	$73.2_{\pm 0.04}$	$73.5_{\pm 0.02}$	$80.1{\pm 0.1}$	$77.9_{\pm 0.04}$	1.61	402K	2.12
ResNet-LSTM	$79.1_{\pm 0.05}$	$80.1_{\pm 0.2}$	$75.6_{\pm 0.07}$	$74.3_{\pm 0.04}$	$78.8_{\pm 0.1}$	$80.0_{\pm 0.08}$	27.6	6.9M	6.78
ResNet-Dilation-LSTM	$80.2_{\pm 0.08}$	$76.5_{\pm 0.12}$	$75.9_{\pm 0.06}$	$73.6_{\pm 0.03}$	$77.4_{\pm 0.15}$	$78.2_{\pm 0.07}$	27.6	6.9M	6.78
CNN-LSTM	$81.3_{\pm 0.1}$	$78.5_{\pm 0.05}$	$76.4_{\pm 0.01}$	$75.4_{\pm 0.05}$	$75.05_{\pm 0.1}$	$74.0_{\pm 0.03}$	22.8	6M	5.624
DCRNN	$79.7_{\pm 0.01}$	$82.1_{\pm 0.04}$	$80.1_{\pm 0.04}$	$80.0_{\pm 0.06}$	$82.5_{\pm 0.1}$	$80.12_{\pm 0.04}$	0.884	126K	9.670
DCRNN w/SS	$\textbf{83.0}_{\pm\textbf{0.08}}$	$81.8_{\pm 0.05}$	$\textbf{82.7}_{\pm\textbf{0.1}}$	$82.1_{\pm 0.03}$	${85.6}_{\pm{0.2}}$	${84.0}_{\pm{0.01}}$	1.319	330K	23.25
Transformer	$83.0_{\pm 0.02}$	$82.1_{\pm 0.03}$	$82.2_{\pm 0.04}$	$\textbf{85.5}_{\pm\textbf{0.07}}$	$\textbf{86.0}_{\pm\textbf{0.03}}$	$\textbf{85.1}_{\pm\textbf{0.02}}$	0.80	120.3K	2.5
Rest_(DS)	$75.3_{\pm 0.2}$	$67.0_{\pm 0.03}$	$72.2_{\pm 0.07}$	$74.1_{\pm 0.1}$	$70.6_{\pm 0.04}$	$70.0_{\pm 0.04}$	0.037	8.4K	0.615
Rest_(RS)	$79.4_{\pm 0.03}$	$81.1_{\pm 0.01}$	$81.0_{\pm 0.08}$	$81.8_{\pm 0.02}$	$80.1_{\pm 0.1}$	$78.1_{\pm 0.4}$	0.037	8.4K	0.710
Rest_(RM)	$82.4_{\pm 0.04}$	$\textbf{82.2 }_{\pm\textbf{0.05}}$	$\textbf{82.7}_{\pm\textbf{0.1}}$	$83.6_{\pm{0.2}}$	$83.4_{\pm 0.09}$	$82.0_{\pm 0.1}$	0.037	8.4K	1.292

	Seizure Detection AUROC (%)					Model Efficiency
Model	4-s	6-s	8-s	10-s	12-s	Size(MB)	#Param	Inference(ms)
LSTM	$85.5_{\pm 0.2}$	$84.1_{\pm 0.4}$	$81.0_{\pm 0.2}$	$75.2_{\pm 0.03}$	$73.5_{\pm 0.08}$	2.691	627K	3.56
GRU	$76.1_{\pm 0.3}$	$78.8_{\pm 0.03}$	$73.2_{\pm 0.4}$	$73.5_{\pm 0.01}$	$80.1_{\pm 0.2}$	1.92	553K	2.42
ResNet-LSTM	$77.6_{\pm 0.2}$	$82.1_{\pm 0.14}$	$79.9_{\pm 0.3}$	$76.8_{\pm 0.4}$	$81.4_{\pm 0.17}$	29.1	7.2M	$6.84$
ResNet-Dilation-LSTM	$78.2_{\pm 0.03}$	$79.8_{\pm 0.1}$	$82.3_{\pm 0.4}$	$77.6_{\pm 0.4}$	$81.2_{\pm 0.1}$	29.1	7.2M	6.84
CNN-LSTM	$86.2_{\pm 0.4}$	$84.9_{\pm 0.2}$	$80.4_{\pm 0.04}$	$80.35_{\pm 0.06}$	$77.6_{\pm 0.3}$	7.6M	30.23	6.432
DCRNN	$88.7_{\pm 0.3}$	$80.0_{\pm 0.02}$	$86.8_{\pm 0.06}$	$88.8_{\pm 0.3}$	$86.5_{\pm 0.3}$	0.591	147K	9.80
Transformer	$80.1_{\pm 0.2}$	$82.3_{\pm 0.6}$	$82.2_{\pm 0.04}$	$85.5_{\pm 0.01}$	$86_{\pm 0.17}$	0.25	52.4K	6.00
Rest_(DS)	$89.1_{\pm 0.2}$	$88.5_{\pm 0.08}$	$90.1_{\pm 0.1}$	$86.3_{\pm 0.03}$	$87.8_{\pm 0.5}$	0.037	9.3K	1.314
Rest_(RS)	$92.3_{\pm 0.1}$	$88.7_{\pm 0.06}$	$\textbf{92.1}_{\pm\textbf{0.03}}$	$\textbf{93.5}_{\pm\textbf{0.02}}$	$91.5_{\pm 0.02}$	0.037	9.3K	1.314
Rest_(RM)	$\textbf{96.7}_{\pm\textbf{0.2}}$	$\textbf{92.3}_{\pm\textbf{0.04}}$	$91.4_{\pm 0.1}$	$89.2_{\pm 0.4}$	$\textbf{91.6}_{\pm\textbf{0.03}}$	0.037	9.3K	1.314

5 Empirical Results

5.1 Setup

Dataset:We used two extensive publicly available datasets for the seizure detection and classification task: the Temple University Hospital EEG Seizure Corpus (TUSZ) (Obeid & Picone, 2016; Shah etal., 2018) and the Children’s Hospital Boston (Goldberger etal., 2000) dataset. Below is a detailed description of each dataset:

TUSZThis dataset includes a total of 5545 EEG files for training and evaluation. These files encompass five different seizure types. We incorporated all 19 channels for all patients in the standard 10-20 system (Figure 1a).

CHB-MIT This dataset comprises recordings from 24 patients, with each patient having data from 9 to 42 sessions, recorded at a sampling rate of 256Hz. The dataset contains a total of 192 seizures. For our study, we included all 19 channels in the standard 10-20 system for the majority of patients, and excluded sessions that had fewer or a higher number of channels.

Preprocessing: In line with previous studies (Tang etal., 2021; Saab etal., 2020), we resample the EEG signals from TUSZ dataset into 200Hz (256Hz for CHB-MIT dataset) to have consistent sampling frequency among different EEG s. Then, we extract non-overlapping window sizes with length $T$ leading to an EEG clip $X\in\mathbb{R}^{T\times L\times N}$ with $N=19$ nodes, $L=200$ ( $L=256$ for CHB-MIT dataset) features per node, and $T$ time points. After applying the fast Fourier transform on the second dimension of the EEG clip and choosing the log amplitude of non-negative frequency components, the final EEG clip fused as the input to the models is $X\in\mathbb{R}^{T\times M\times N}$ where $M=100$ ( $M=128$ for CHB-MIT dataset). Finally, the features for each node and time point are z-normalized using the mean and variance calculated from 100 (128 for CHB-MIT dataset) feature points along its axis. We examine the presence of a seizure within an EEG clip in the detection task. For classification, we start analyzing each clip 2 seconds before the seizure begins and evaluate the outcomes within a clip duration of $T=10$ seconds. This approach aligns with the annotations of seizure onset, as demonstrated in previous works (Ahmedt-Aristizabal etal., 2020; Tang etal., 2021).

5.2 Experimental Results

Seizure Detection and Classification Accuracy: We evaluated the performance of all baseline models and Rest using the Area Under the Receiver Operating Characteristic Curve (AUROC) for seizure detection and Weighted F1-Score for seizure classification. Our model surpassed all baselines significantly on the CHB-MIT dataset for all different clip lengths. For the TUSZ dataset, it achieved very close detection AUROC scores for all clip lengths compared to DCRNN with self-supervision and the Transformer, while outperforming them at clip lengths of 6 and 8 seconds. Figure2 suggests that multiple random updates improve the stability of Rest as it leads to higher and more consistent performance compared to other models. According to Figure2, Rest_(RM) and DCRNN with self supervision exhibit more stable performance over time across clip lengths, yielding consistent results. Interestingly, CNN-LSTM achieved higher performance in a small clip size of 4s, surpassing DCRNN with graph convolution layers.

Rest Enjoys an Exponentially Smaller Size:While maintaining high accuracy, Rest exhibits a size that is 14 $\times$ smaller than the smallest existing model for seizure detection and classification on TUSZ dataset. Table 4 highlights that Rest requires 38 $\times$ fewer parameters than state-of-the-art models (DCRNN w/SS) and over 697 $\times$ fewer parameters than the deep CNN-LSTM model for seizure analysis.

Figure 3 a-b showcases Rest’s outstanding performance, achieving an AUROC of 83.6% for seizure detection with a clip length of 10 seconds. Additionally, Rest secures the second-highest F1-Score for seizure classification, trailing only 2% below DCRNN w/SS but with a significantly smaller size than all other baselines. The substantial gap between Rest’s size and the sizes of other baselines, depicted on the logarithmic scale in Figure 3 a-b, underscores Rest’s remarkable size advantage and potential for implementation on edge devices. The graph convolution layers in Rest efficiently capture both short and long-range communication between nodes, ensuring high accuracy with a compact model size. Moreover, using identical weights for multiple random updates eliminates the need for additional layers while enhancing the model’s accuracy and memory efficiency.

Rapid Seizure Detection:Rest_(RM) achieves the fastest inference speed among all models, being 20 $\times$ faster than DCRNN w/SS and 9 $\times$ faster than DCRNN during inference, with only a minor AUROC drop of less than 2% for seizure detection across various clip lengths for TUSZ dataset. Moreover, Rest, with multiple updates, requires only 1.292 ms for seizure detection, which is three times faster than the fastest baseline, LSTM, while being 13% more accurate in delivering predictions (at 10-s clip length). On the CHB-MIT dataset, Rest outperforms all other baselines in the seizure detection task, being the only model with an AUROC higher than 90%. It also significantly outperforms other baselines for the short clip length of 4 seconds, which is crucial for real-time seizure detection (Zhu etal., 2021).

In seizure classification, Rest_(RM) secures the second-highest F1-Score (Table 6) and excels in providing the fastest classification result within 1.51 ms (Figure 3 c-d). Notably, it is three times faster than LSTM, while achieving 21% higher accuracy than LSTM.The swift prediction capability of our model is attributed to its efficient design. Rest relies on a single affine mapping into the state space, complemented by two computationally lightweight graph convolutions.

6 Conclusion

In this work, we propose Rest, a graph-based residual state update mechanism for efficient seizure detection and classification tasks. Our model effectively captures both spatial and temporal behaviors of EEG signals, achieving state-of-the-art performance in seizure detection and classification. With its shallow structure, Rest boasts a fast inference speed, making it 9 times faster than current models with a comparable performance. Furthermore, Rest exhibits remarkable efficiency, requiring only 37KB of memory, which is 14 times smaller than smallest existing models for seizure analysis tasks. These advancements position Rest as a promising model for implementation on small, low-power edge devices, particularly for applications in epilepsy treatments like DBS and RNS.

Impact Statement

The EEG Seizure Corpus from Temple University Hospital, utilized in our research, is anonymized and publicly accessible with IRB approval (Obeid & Picone, 2016; Shah etal., 2018). The authors declare no conflicts of interest, and the seizure detection and classification models presented in this study do not provide any harmful insights. Although our model has demonstrated accuracy in real-time seizure analyses, further experiments are essential for real-world application and implementation on edge devices, as demonstrated in a number of recent systems (Shoaran etal., 2018; Shin etal., 2022; Shaeri etal., 2024).These evaluations should encompass testing with diverse datasets from various patient populations and hospitals. Additionally, assessing the model’s energy efficiency is crucial to ensure its safety for chronic use, along with obtaining neurologists’ approval regarding its neurological aspects for deployment in such devices.

Acknowledgements

This work was supported in part by the Swiss State Secretariat for Education, Research and Innovation under Contract number SCR0548363, in part by the Wyss project under contract number 532932, in part by Hasler Foundation Program: Hasler Responsible AI project number 21043, in part by the Army Research Office under grant number W911NF-24-1-0048, and in part by the Swiss National Science Foundation (SNSF) under grant number 200021_205011. Moreover, we appreciate the reviewers for their insightful feedback, which has significantly enhanced the robustness and clarity of our results.

References

Acharya etal. (2016)Acharya, J.N., Hani, A.J., Thirumala, P., and Tsuchida, T.N.American clinical neurophysiology society guideline 3: a proposal for standard montages to be used in clinical eeg.The Neurodiagnostic Journal, 56(4):253–260, 2016.
Ahmedt-Aristizabal etal. (2020)Ahmedt-Aristizabal, D., Fernando, T., Denman, S., Petersson, L., Aburn, M.J., and f*ckes, C.Neural memory networks for seizure type classification.In 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), pp. 569–575. IEEE, 2020.
Asif etal. (2020)Asif, U., Roy, S., Tang, J., and Harrer, S.Seizurenet: Multi-spectral deep feature learning for seizure type classification.In Machine Learning in Clinical Neuroimaging and Radiogenomics in Neuro-oncology: Third International Workshop, MLCN 2020, and Second International Workshop, RNO-AI 2020, Held in Conjunction with MICCAI 2020, Lima, Peru, October 4–8, 2020, Proceedings 3, pp. 77–87. Springer, 2020.
Beghi etal. (2019)Beghi, E., Giussani, G., Nichols, E., Abd-Allah, F., Abdela, J., Abdelalim, A., Abraha, H.N., Adib, M.G., Agrawal, S., Alahdab, F., etal.Global, regional, and national burden of epilepsy, 1990–2016: a systematic analysis for the global burden of disease study 2016.The Lancet Neurology, 18(4):357–375, 2019.
Cho etal. (2014)Cho, K., VanMerriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y.Learning phrase representations using rnn encoder-decoder for statistical machine translation.arXiv preprint arXiv:1406.1078, 2014.
Christou etal. (2022)Christou, V., Miltiadous, A., Tsoulos, I., Karvounis, E., Tzimourta, K.D., Tsipouras, M.G., Anastasopoulos, N., Tzallas, A.T., and Giannakeas, N.Evaluating the window size’s role in automatic eeg epilepsy detection.Sensors, 22(23):9233, 2022.
Covert etal. (2019)Covert, I.C., Krishnan, B., Najm, I., Zhan, J., Shore, M., Hixson, J., and Po, M.J.Temporal graph convolutional networks for automatic seizure detection.In Machine Learning for Healthcare Conference, pp. 160–180. PMLR, 2019.
Fisher & Velasco (2014a)Fisher, R.S. and Velasco, A.L.Electrical brain stimulation for epilepsy.Nature Reviews Neurology, 10(5):261–270, 2014a.
Fisher & Velasco (2014b)Fisher, R.S. and Velasco, A.L.Electrical brain stimulation for epilepsy.Nature Reviews Neurology, 10(5):261–270, 2014b.
Goldberger etal. (2000)Goldberger, A.L., Amaral, L.A., Glass, L., Hausdorff, J.M., Ivanov, P.C., Mark, R.G., Mietus, J.E., Moody, G.B., Peng, C.-K., and Stanley, H.E.Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals.circulation, 101(23):e215–e220, 2000.
Gotman (1990)Gotman, J.Automatic seizure detection: improvements and evaluation.Electroencephalography and clinical Neurophysiology, 76(4):317–324, 1990.
Harrer etal. (2019)Harrer, S., Shah, P., Antony, B., and Hu, J.Artificial intelligence for clinical trial design.Trends in pharmacological sciences, 40(8):577–591, 2019.
He etal. (2016)He, K., Zhang, X., Ren, S., and Sun, J.Deep residual learning for image recognition.In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
Ho & Armanfard (2023)Ho, T. K.K. and Armanfard, N.Self-supervised learning for anomalous channel detection in eeg graphs: application to seizure analysis.In Proceedings of the AAAI Conference on Artificial Intelligence, volume37, pp. 7866–7874, 2023.
Hochreiter & Schmidhuber (1997)Hochreiter, S. and Schmidhuber, J.Long short-term memory.Neural computation, 9(8):1735–1780, 1997.
Iešmantas & Alzbutas (2020)Iešmantas, T. and Alzbutas, R.Convolutional neural network for detection and classification of seizures in clinical data.Medical & Biological Engineering & Computing, 58:1919–1932, 2020.
Jasper (1958)Jasper, H.H.Ten-twenty electrode system of the international federation.Electroencephalogr Clin Neurophysiol, 10:371–375, 1958.
Kingma & Ba (2014)Kingma, D.P. and Ba, J.Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014.
Lee etal. (2022)Lee, K., Jeong, H., Kim, S., Yang, D., Kang, H.-C., and Choi, E.Real-time seizure detection using eeg: a comprehensive comparison of recent approaches under a realistic setting.arXiv preprint arXiv:2201.08780, 2022.
Li etal. (2017)Li, Y., Yu, R., Shahabi, C., and Liu, Y.Graph convolutional recurrent neural network: Data-driven traffic forecasting.arXiv preprint arXiv:1707.01926, 7(8), 2017.
Li etal. (2022)Li, Z., Hwang, K., Li, K., Wu, J., and Ji, T.Graph-generative neural network for eeg-based epileptic seizure detection via discovery of dynamic brain functional connectivity.Scientific Reports, 12(1):18998, 2022.
Loshchilov & Hutter (2016)Loshchilov, I. and Hutter, F.Sgdr: Stochastic gradient descent with warm restarts.arXiv preprint arXiv:1608.03983, 2016.
Mordvintsev etal. (2020)Mordvintsev, A., Randazzo, E., Niklasson, E., and Levin, M.Growing neural cellular automata.Distill, 5(2):e23, 2020.
Morris etal. (2019)Morris, C., Ritzert, M., Fey, M., Hamilton, W.L., Lenssen, J.E., Rattan, G., and Grohe, M.Weisfeiler and leman go neural: Higher-order graph neural networks.In Proceedings of the AAAI conference on artificial intelligence, volume33, pp. 4602–4609, 2019.
Nakkiran etal. (2021)Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., and Sutskever, I.Deep double descent: Where bigger models and more data hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021.
Obeid & Picone (2016)Obeid, I. and Picone, J.The temple university hospital eeg data corpus.Frontiers in neuroscience, 10:196, 2016.
O’Shea etal. (2020)O’Shea, A., Lightbody, G., Boylan, G., and Temko, A.Neonatal seizure detection from raw multi-channel eeg using a fully convolutional architecture.Neural Networks, 123:12–25, 2020.
Pajouheshgar etal. (2023)Pajouheshgar, E., Xu, Y., Zhang, T., and Süsstrunk, S.Dynca: Real-time dynamic texture synthesis using neural cellular automata.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20742–20751, 2023.
Pascanu etal. (2013)Pascanu, R., Mikolov, T., and Bengio, Y.On the difficulty of training recurrent neural networks.In International conference on machine learning, pp. 1310–1318. Pmlr, 2013.
Randazzo etal. (2020)Randazzo, E., Mordvintsev, A., Niklasson, E., Levin, M., and Greydanus, S.Self-classifying mnist digits.Distill, 5(8):e00027–002, 2020.
Saab etal. (2020)Saab, K., Dunnmon, J., Ré, C., Rubin, D., and Lee-Messer, C.Weak supervision as an efficient approach for automated seizure detection in electroencephalography.NPJ digital medicine, 3(1):59, 2020.
Shaeri etal. (2024)Shaeri, M.A., Shin, U., Yadav, A., Caramellino, R., Rainer, G., and Shoaran, M.33.3 mibmi: A 192/512-channel 2.46 mm² miniaturized brain-machine interface chipset enabling 31-class brain-to-text conversion through distinctive neural codes.In 2024 IEEE International Solid-State Circuits Conference (ISSCC), volume67, pp. 546–548. IEEE, 2024.
Shah etal. (2018)Shah, V., VonWeltin, E., Lopez, S., McHugh, J.R., Veloso, L., Golmohammadi, M., Obeid, I., and Picone, J.The temple university hospital seizure detection corpus.Frontiers in neuroinformatics, 12:83, 2018.
Shin etal. (2022)Shin, U., Ding, C., Zhu, B., Vyza, Y., Trouillet, A., Revol, E.C., Lacour, S.P., and Shoaran, M.Neuraltree: A 256-channel 0.227- $\mu$ j/class versatile neural activity classification and closed-loop neuromodulation soc.IEEE Journal of Solid-State Circuits, 57(11):3243–3257, 2022.
Shoaran etal. (2016)Shoaran, M., Shahshahani, M., Farivar, M., Almajano, J., Shahshahani, A., Schmid, A., Bragin, A., Leblebici, Y., and Emami, A.A 16-channel 1.1 mm 2 implantable seizure control soc with sub- $\mu$ w/channel consumption and closed-loop stimulation in 0.18 $\mu$ m cmos.In 2016 IEEE Symposium on VLSI Circuits (VLSI-Circuits), pp. 1–2. Ieee, 2016.
Shoaran etal. (2018)Shoaran, M., Haghi, B.A., Taghavi, M., Farivar, M., and Emami-Neyestanak, A.Energy-efficient classification for resource-constrained biomedical applications.IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 8(4):693–707, 2018.
Shuman etal. (2013)Shuman, D.I., Narang, S.K., Frossard, P., Ortega, A., and Vandergheynst, P.The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains.IEEE signal processing magazine, 30(3):83–98, 2013.
Siddiqui etal. (2020)Siddiqui, M.K., Morales-Menendez, R., Huang, X., and Hussain, N.A review of epileptic seizure detection using machine learning classifiers.Brain informatics, 7(1):1–18, 2020.
Srivastava etal. (2014)Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R.Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958, 2014.
Strein etal. (2019)Strein, M., Holton-Burke, J.P., Smith, L.R., and Brophy, G.M.Prevention, treatment, and monitoring of seizures in the intensive care unit.Journal of Clinical Medicine, 8(8):1177, 2019.
Sun & Morrell (2014)Sun, F.T. and Morrell, M.J.The rns system: responsive cortical stimulation for the treatment of refractory partial epilepsy.Expert review of medical devices, 11(6):563–572, 2014.
Tang etal. (2021)Tang, S., Dunnmon, J.A., Saab, K., Zhang, X., Huang, Q., Dubost, F., Rubin, D.L., and Lee-Messer, C.Self-supervised graph neural networks for improved electroencephalographic seizure analysis.arXiv preprint arXiv:2104.08336, 2021.
Thodoroff etal. (2016)Thodoroff, P., Pineau, J., and Lim, A.Learning robust features using deep learning for automatic seizure detection.In Machine learning for healthcare conference, pp. 178–190. PMLR, 2016.
Vaswani etal. (2017)Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., and Polosukhin, I.Attention is all you need.Advances in neural information processing systems, 30, 2017.
Voulodimos etal. (2018)Voulodimos, A., Doulamis, N., Doulamis, A., Protopapadakis, E., etal.Deep learning for computer vision: A brief review.Computational intelligence and neuroscience, 2018, 2018.
Yan etal. (2022a)Yan, J., Li, J., Xu, H., Yu, Y., and Xu, T.Seizure prediction based on transformer using scalp electroencephalogram.Applied Sciences, 12(9):4158, 2022a.
Yan etal. (2022b)Yan, J., Li, J., Xu, H., Yu, Y., and Xu, T.Seizure prediction based on transformer using scalp electroencephalogram.Applied Sciences, 12(9):4158, 2022b.
Zhu etal. (2020)Zhu, B., Farivar, M., and Shoaran, M.Resot: Resource-efficient oblique trees for neural signal classification.IEEE Transactions on Biomedical Circuits and Systems, 14(4):692–704, 2020.
Zhu etal. (2021)Zhu, B., Shin, U., and Shoaran, M.Closed-loop neural prostheses with on-chip intelligence: A review and a low-latency machine learning model for brain state detection.IEEE transactions on biomedical circuits and systems, 15(5):877–897, 2021.

Appendix Introduction

The Appendix is organised as followes:

•
Preprocessing details are outlined in AppendixA.
•
The mathematical proof addressing the avoidance of gradient vanishing in our model is provided in AppendixB.
•
Seizure analyses results are presented in AppendixC.
•
Hyperparameter selection and training details for all models are discussed in AppendixD.
•
The impact of BCE and MSE loss on training Rest is compared in AppendixE.
•
Training times are documented in AppendixF.
•
Details explaining how Rest avoids overfitting are shown in AppendixG.
•
Differences between various graph structures are explored in AppendixH.
•
Information about baseline compression is provided in AppendixI.
•
F1-scores for seizure detection are presented in AppendixJ.
•
The effectiveness of binary random masking on different RNN variants is shown in AppendixK.
•
Size comparisons for models with the same number of neurons are provided in AppendixL.
•
Real-time evaluations of different models with overlapping windows are detailed in AppendixM.
•
An ablation study on the inference performance of Rest with and without binary random masking is presented in AppendixN.

Appendix A Details of Preprocessing

We initially performed general preprocessing on the EEG data followed by specific steps for each detection and classification tasks:

A.1 TUSZ dataset

General Preprocessing: The EEG signals in the TUH EEG Corpus (TUSZ) dataset were initially sampled at various frequencies. As a part of the preprocessing pipeline, all signals were uniformly resampled to 200 Hz. Subsequently, EEG clips were extracted using the natural choice of one-second, non-overlapping windows, resulting in an EEG tensor $X\in\mathbb{R}^{T\times L\times N}$ , where $T$ represents clip lengths (ranging from 4, 6, 8, 10, 12, to 14 seconds), $N$ is the number of electrodes (19), and $L$ is the number of time samples (200). To harness the effectiveness of Fourier transform for neural EEG recordings, fast Fourier transform was applied to extract frequency components for each node at each time point. The log-amplitude of the frequencies was then computed and only non-negative frequency components were extracted similar to prior studies (Tang etal., 2021; Ahmedt-Aristizabal etal., 2020) leading to EEG clip tensor of $X\in\mathbb{R}^{T\times M\times N}$ with $M$ =100. Last, we have z-normalized the EEG clips across their second dimension for further analyses.

Preprocessing for Seizure Detection: For seizure detection after extracting EEG clips from the entire training set consisting of 5545 sessions, a binary label was assigned, with $y=1$ indicating the presence of at least one seizure within the clip and $y=0$ otherwise. To handle the issue of a substantial number of background clips in the dataset, non-seizure clips were randomly selected to achieve a balanced representation with seizure clips in the training data. Also, the last clip was dropped for each EEG data if the recording ends before the clip could reach it’s length.

Preprocessing for Seizure Classification:For seizure classification followed by Tang etal. (2021); Ahmedt-Aristizabal etal. (2020) we have removed the background data and only processed the seizure clips. We have started 2 seconds before the annotated seizure for tolerance in the annotations. Then we have labeled the clip $y=0$ for general non-specific (GN), $y=1$ for combined tonic (TC), $y=2$ for absence (AB), $y=3$ for focal, and $y=4$ for complex parietal (CP) seizures. Moreover, if seizure event is shorter than the clip length we have truncated the clip to avoid having multiple seizures in one clip. Also, it is noteworthy that while the training set included simple partial seizures, these seizure types were absent in the evaluation set. Therefore, we excluded simple parietal seizures from the classification task. Also, because the clips for seizure classification may have different lengths we pad 0’s to the end of the clip to assure all samples share the same length.

A.2 CHB-MIT Dataset

For the CHB-MIT dataset, we randomly selected 18 patients for training, 3 for evaluation, and 3 for testing. We followed the same preprocessing pipeline as described for the TUSZ dataset, with the exception of maintaining a uniform sampling rate of 256Hz for all patients. For each 1-second time window, we have 256 samples of raw EEG data per channel. The number of channels is consistent with the TUSZ dataset, comprising 19 channels, and we excluded any sessions with a different number of channels.

We utilized the same frequency domain components for seizure detection. Unlike the TUSZ dataset, the CHB-MIT dataset does not include seizure types for classification. The results are reported based on five different random seeds for the train/test/evaluation splits (more details at Table7).

Case	Number of Seizures	Number of Sessions	Age
1	7	24	11
2	3	36	11
3	7	38	14
4	4	42	22
5	5	39	7
6	10	18	1.5
7	3	19	14.5
8	5	20	3.5
9	4	19	10
10	7	25	3
11	3	35	12
12	27	24	2
13	10	33	3
14	8	26	9
15	20	40	16
16	8	19	7
17	3	21	12
18	6	36	18
19	3	30	19
20	8	29	6
21	4	33	13
22	3	31	9
23	7	9	6
24	16	22	Unknown

Appendix B Preventing Gradient Vanishing with Residual Update

In equations Equations3, 4 and5, the model’s state is updated using a residual state update. When we take the derivative of $S^{t-1}$ concerning the forward propagation of Equation3, we get:

\frac{\partial\mathcal{L}}{\partial S^{t-1}}=\frac{\partial\mathcal{L}}{%\partial S^{t}}\frac{\partial{S^{t}}}{\partial S^{t-1}}=\frac{\partial\mathcal%{L}}{\partial S^{t}}\Big{(}1+\frac{\partial{\delta S^{t}}}{\partial S^{t-1}}%\Big{)}=\frac{\partial\mathcal{L}}{\partial S^{t}}+\frac{\partial\mathcal{L}}{%\partial S^{t}}\frac{\partial{\delta S^{t}}}{\partial S^{t-1}}.

(14)

Here the $\mathcal{L}$ is the loss function to be minimized. This equation shows that the gradient of the previous state $S^{t-1}$ always has a term $\frac{\partial\mathcal{L}}{\partial S^{t}}$ directly added. This helps prevent the gradients of $\frac{\partial\mathcal{L}}{\partial S^{t-1}}$ from becoming too small, even when the gradients of the previous updates are small, i.e., $\frac{\partial\mathcal{L}}{\partial S^{t}}\frac{\partial{\delta S^{t}}}{%\partial S^{t-1}}$ .

Appendix C ROC Curves and Confusion Matrices for Different Clip Lengths

Efficient and Accelerated EEG Seizure Analysis through Residual State Updates (4)

Efficient and Accelerated EEG Seizure Analysis through Residual State Updates (5)

Appendix D Model Training and Hyperparameter Selection Details

Here are the details of training and hyperparameter selection for Rest and baselines:

Rest Hyperparameters:We optimized the following hyperparameters for Rest based on the lowest validation error:a) Number of neurons in each graph convolution layer within the range [16, 32, 64];b) Initial learning rate within the range [5e-4, 1e-4];c) Success probability of the random binary mask within [0.1, 0.3, 0.5, 0.7, 1].For multi-update Rest, the number of updates for each time point was randomly selected an inteager from the interval [1, 10]. We conducted training for 500 epochs using a Multistep learning rate scheduler. Five experiments were run in PyTorch with different random seeds.

DCRNN:We followed the hyperparameter tuning strategy from the original paper (Tang etal., 2021) for both DCRNN with and without self-supervision tasks. The hyperparameter search on the validation set included:a) Initial learning rate within the range [5e-5, 1e-3];b) Number of Diffusion Convolutional Gated Recurrent Units (DCGRU) layers within the range {2, 3, 4, 5} and hidden units within the range {32, 64, 128};c) Maximum diffusion step K $\in$ {2, 3, 4};d) Dropout probability in the last fully connected layer.For self-supervised pre-training, we utilized mean absolute error (MAE) as the loss function. The models underwent training for 350 epochs with an initial learning rate of 5e-4, employing a maximum diffusion step of 1 and 64 hidden units in both the encoder and decoder. Moreover, cosine annealing learning rate scheduler (Loshchilov & Hutter, 2016) was used as scheduler.

CNN-LSTM: For the baseline CNN-LSTM, we adopt the identical model architecture outlined in Ahmedt-Aristizabal etal. (2020). This configuration employes two stacked convolutional layers with 32 kernels of size 3 $\times$ 3, one max-pooling layer of size 2 $\times$ 2, one fully-connected layer with an output neuron count of 512, two stacked LSTM layers with a hidden size of 128, and one additional fully connected layer.

LSTM: We employed two stacked RNN layers, each with 64 hidden units, and an additional fully connected layer for the final prediction.

GRU: For the GRU model, we used same number of layers and hidden units as LSTM.

ResNet-LSTM: We followed two versions with and without dilation described at Lee etal. (2022).

Transformer: We implemented a two-layer multi-head attention mechanism with 64 embedding dimensions and 16 heads for the transformer architecture. Additionally, we utilized time positional encoding as introduced by Vaswani etal. (2017) for the original positional encoding.

For detection task for all models binary cross entropy loss was used exept for Rest which MSE performs slightly higher during the validation step. For classification task weighted binary cross entropy was employed due to the highly imbalancy among different seizure types.

Appendix E Comparison Between MSE and BCE loss for Training Rest

Rest was trained for seizure detection using both MSE and Binary Cross Entropy (BCE) loss functions. However, MSE outperformed BCE in terms of stability and accuracy. This advantage is attributed to BCE’s tendency for unbounded growth in classification logits, hindering residual updates and message passing between graph nodes, particularly in multi-update scenarios, as discussed in Randazzo etal. (2020). As shown in Figure 4 MSE has less fluctuations and more stability in validation error during training compared to BCE loss when training Rest with multiple updates.

Seizure Detection performance
Loss Function	AUROC
Rest BCE	80.4
Rest MSE	83.6

Efficient and Accelerated EEG Seizure Analysis through Residual State Updates (6)

Appendix F Training Time

Bellow we report the time needed for training each model (Table9). All the models were trained on the same NVIDIA A100 GPU and the number of parameters and model size has reported at Tables6 and4. Rest requires more training time to adopt itself and converge to a stable point specially to adapt its update cell with multiple random updates

Seizure Detection							Seizure Classification
Model	4-s	6-s	8-s	10-s	12-s	14-s	10-s
LSTM	5	5	5	6	7	7	4
GRU	5	5	5	6	7	8	4
CNN-LSTM	8	8	8	9	9	10	5
ResNet-LSTM	9	9	10	10	12	12	6
ResNet-LSTM-Dilation	9	9	10	10	12	12	6
DCRNN	20	22	23	25	28	30	20
DCRNN w/SS	23	30	35	40	48	60	35
Transformer	12	12	13	14	14	16	8
Rest_(DS)	45	47	50	53	55	60	10
Rest_(RS)	45	47	50	53	55	60	10
Rest_(RM)	70	75	80	90	95	100	25

Appendix G Rest Combat Forgetting at Each Time Point

While updating Rest specially when the update cell includes multiple updates Rest avoids forgetting the input by updating its state based on the affine mapping of the previous state and the input. As an example we consider two following setting:

Setting 1: Updating the state based on previous state only where first the state is initialized as $S^{t}_{i}=WX^{t}+US^{t}_{i}$ and then it will iteratively update the state $S^{t}_{i}$ as follows:

\delta S^{t}_{i}=\mathcal{G}_{\Theta}(S^{t}_{i}),

(15)

S^{t}_{i+1}=S^{t}_{i}+\delta S^{t}_{i}\odot B.

(16)

Setting 2: Updating the state based on affine mapping of current input and previous state for iteratively update the state $S^{t}_{i}$ as follows:

H^{t}_{i}=WX^{t}+US^{t}_{i}

(17)

\delta S^{t}_{i}=\mathcal{G}_{\Theta}(H^{t}_{i}),

(18)

S^{t}_{i+1}=H^{t}_{i}+\delta S^{t}_{i}\odot B.

(19)

In Setting 1, after mapping from the input to the state space, the state is updated only based on the previous state. This setup poses a risk of the model forgetting information from the current input, especially if the update cell iteratively modifies the state multiple times. This situation hinders state from converging to a stable point and simply diverges due to neglecting the input data. In Setting 2, represented by Rest’s update cell, the input plays a crucial role and is actively involved in the iterative update process, as shown in equations Equations17, 18 and19. This design prevents the model from forgetting information from the current time input $X^{t}$ , promoting convergence of the state to a more meaningful final state by utilizing the input’s information throughout the updates.

As illustrated in Figure7, Setting 1 fails to converge to a stable point, and the validation loss remains unchanged throughout the training process.

Efficient and Accelerated EEG Seizure Analysis through Residual State Updates (7)

Appendix H Comparison Between Different Gaussian Kernels Threshold for EEG Distance Graph

Here we illustrate different distance graph constructions based on different thresholds or the Gaussian kernel. The lower $k$ values (i.e. 0.6) results in missing connection between nodes and large $k$ thresholds results in connecting nodes which are far away. Similar to Tang etal. (2021) we also choose $k=0.9$ as threshold which resembles the EEG montage (longitudinal bipolar and transverse bipolar) (Acharya etal., 2016) and results in a reasonable node connection.

Efficient and Accelerated EEG Seizure Analysis through Residual State Updates (8)

Appendix I Compressing Baseline Models

We tried to compress existing models for seizure detection and classification, achieving performance comparable to those described in Tang etal. (2021). However, in case of LSTM and CNN-LSTM models shrinking the model size without a significant performance drop proved challenging. We matched the performance reported in Tang etal. (2021) for DCRNN and DCRNN w/SS models with only one diffusion convolution gated recurrent unit, and reduced the model size by half, from 2.7 MB to less than 1 MB for DCRNN. Furthermore, for the seizure detection task, we achieved the same accuracy with 126K parameters compared to the original paper’s 168,641 parameters.

For the classification task, the original paper (Tang etal., 2021) reported 280,964 parameters for DCRNN and 417,572 parameters for DCRNN w/SS. In our compressed models, we achieved 126K parameters for DCRNN and 330K for DCRNN w/SS, successfully reducing the model size by a factor of 2 for DCRNN and a factor of 1.5 for DCRNN w/SS.

Despite successful reductions, the compressed models still possess a considerable number of parameters, especially in the presence of a gating mechanism, highlighting the non-parameter efficiency and memory demands associated with existing models for seizure detection.

Appendix J F1-Score for Seizure detection

Below is the F1-score (weighted averaged) results for seizure detection task on TUSZ dataset.

Model	4-s	6-s	8-s	10-s	12-s	14-s
LSTM	82.3	69.9	79.5	80.5	72.7	73.2
CNN-LSTM	70.1	69.5	75.3	73.5	68.3	67.5
GRU	82.7	69.9	81.6	80.5	81.0	71.3
RestNet-LSTM	79.7	78.2	80.1	75.1	77.0	76.3
RestNet-Dilation-LSTM	80.5	80.4	79.0	76.6	75.0	74.6
Transformer	78.45	79.3	78.5	82.0	79.1	79.2
DCRNN	81.2	80.2	81.6	80.0	74.2	72.0
DCRNN W/SS	75.2	81.1	81.2	81.0	75.7	76.0
Rest(RS)	69.5	68.4	78.3	79.1	74.7	74.1
Rest(RM)	81.0	75.2	83.2	81.0	75.7	76.2

Appendix K Binary Random Masking and Multiple Updates for Other RNNs

We conducted an ablation study to evaluate the performance of RNN baselines with single and multiple random updates, as shown in Table 11.

Model	Vanilla	RS	RM
RNN	77.3	80.1	80.8
GRU	73.5	72.8	73.6
LSTM	70.4	74.5	74.7

As shown, the RNN variants can improve their performance in seizure detection tasks using Rest update techniques.

Appendix L Size Comparison with 64 Number of Neurons for all Models

Model	Parameters (#)	Size (MB)
DCRNN w/SS	330K	1.319
DCRNN	126K	0.884
Transformer	48.3K	0.193
GRU	402K	1.61
ResNet-LSTM	7.5M	30.3
ResNet-LSTM-Dilation	7.5M	30.3
LSTM	536K	2.147
CNN-LSTM	6M	22.8
REST(DS)	27K	0.051
REST(RS)	27K	0.051
REST(RM)	27K	0.051

Appendix M More Evaluation for Real-Time Detection

We followed the real-time seizure detection framework described by Lee etal. (2022), using a 4-second clip length for seizure detection with a 3-second overlap between consecutive clips. We measured both the inference time and latency, the latter being the delay between the actual onset of a seizure and the model’s detection. Low latency is crucial to avoid late detection of seizure events. As shown in Table13, Rest achieves the lowest latency alongside the Transformer model, while also maintaining significantly lower inference times compared to all other baselines.

Model	AUCROC	Latency (s)	Inference (ms)
LSTM	75.5	0.31	3.254
GRU	76.1	0.4	2.12
RestNet-LSTM	79.1	0.3	6.78
RestNet-Dilation-LSTM	80.2	0.34	6.78
CNN-LSTM	81.3	0.26	5.624
DCRNN	79.7	0.25	9.67
Transformer	83	0.2	2.5
Rest(DS)	75.3	0.23	0.615
Rest(RS)	79.4	0.2	0.71
Rest(RM)	82.4	0.25	1.29

Appendix N Rest W/O Binary Random Mask during Inference

We evaluated Rest performance with and without masking over the inference in which similar to Srivastava etal. (2014) strategy the mask was removed and the incremental state update was scaled using the success probability of the binary mask ( $p$ ).

Model	W/ Inference Mask	W/O Inference Mask
Rest (RS)	81.8	81.5
Rest (RM)	83.6	82.9

Model	F1-Score	Size(MB)	Parameter(#)
LSTM	0.39	2.021	512K
GRU	0.44	1.92	553K
ResNet-LSTM	0.58	30.3	7.5M
ResNet-LSTM-Dilation	0.50	30.3	7.5M
CNN-LSTM	0.47	23.9	6M
DCRNN	0.54	0.506	126K
DCRNN w/SS	0.62	1.40	332K
Transformer	0.54	0.25	53K
Rest_(DS)	0.51	0.034	8.6K
Rest_(RS)	0.57	0.034	8.6K
Rest_(RM)	0.60	0.034	8.6K