Efficient and Accelerated EEG Seizure Analysis through Residual State Updates (2024)

Arshia Afzal  Grigorios Chrysos  Volkan Cevher  Mahsa Shoaran

Abstract

EEG-based seizure detection models face challenges in terms of inference speed and memory efficiency, limiting their real-time implementation in clinical devices. This paper introduces a novel graph-based residual state update mechanism (Rest) for real-time EEG signal analysis in applications such as epileptic seizure detection. By leveraging a combination of graph neural networks and recurrent structures, Rest efficiently captures both non-Euclidean geometry and temporal dependencies within EEG data. Our model demonstrates high accuracy in both seizure detection and classification tasks. Notably, Rest achieves a remarkable 9-fold acceleration in inference speed compared to state-of-the-art models, while simultaneously demanding substantially less memory than the smallest model employed for this task. These attributes position Rest as a promising candidate for real-time implementation in clinical devices, such as Responsive Neurostimulation or seizure alert systems.

Machine Learning, ICML

1 Introduction

Brain disorders, including epilepsy, present substantial challenges globally, prompting the need for innovative approaches in diagnosis and treatment. Recurrent seizures, recognized as one of the most prevalent neurological emergencies globally (Strein etal., 2019), impact approximately 50 million people worldwide (Beghi etal., 2019).

Detecting changes in the rhythms of brain activity through the monitoring of electroencephalography (EEG) signal allows us to pinpoint the onset zone and time of seizures (Gotman, 1990; Siddiqui etal., 2020), making EEG an invaluable and extensively utilized tool for seizure detection and localization. Traditionally, neurological experts perform these tasks, involving the time-consuming process of manually labeling periods spanning from hours to days for each individual patient (Harrer etal., 2019; Ahmedt-Aristizabal etal., 2020). Several studies have explored the application of Machine Learning (ML) in seizure analysis, aiming to simplify the handling of large seizure datasets for experts (Tang etal., 2021; Ahmedt-Aristizabal etal., 2020; Covert etal., 2019; Siddiqui etal., 2020). These studies predominantly focus on deep models, known for their accuracy and suitability for clinical applications.

Taking inspiration from computer vision (Voulodimos etal., 2018), many studies have applied different variations of Convolutional Neural Networks (CNN) for seizure detection, as demonstrated in Saab etal. (2020). Various versions of Graph Neural Networks (GNN) effectively capture non-Euclidean geometry in datasets like EEG signals, contributing to enhanced seizure detection and classification (Li etal., 2022; Tang etal., 2021; Ho & Armanfard, 2023). Additionally, to enhance the performance of deep neural networks and accounting for time-series nature of brain rhythms, different variations of Recurrent Neural Networks (RNN) have been utilized in seizure analyses (Ahmedt-Aristizabal etal., 2020).

While these models excel in achieving high accuracy in seizure detection and classification tasks, they often struggle with issues such as complexity, inefficient memory usage, and slow inference speeds. One of the main reasons behind this inefficiency lies in structures such as the gating mechanism found in RNN models (e.g., Long Short-Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997) or the presence of deep convolutional layers in CNN s and GNN s.

Both inference time and memory storage considerations become critically important in the context of modern seizure treatment devices like Responsive Neurostimulation (RNS) and Deep Brain Stimulation (DBS) (Fisher & Velasco, 2014a; Sun & Morrell, 2014). These devices, which have shown promise in suppressing seizure attacks, require a small yet accurate ML model to trigger stimulation commands for symptom suppression (Shoaran etal., 2016; Shin etal., 2022). Furthermore, the model must exhibit low inference time in activating the stimulator to ensure its effectiveness (Fisher & Velasco, 2014b; Zhu etal., 2021). Unfortunately the aforementioned methods do not have such a low inference.

In this study, we introduce Rest, a graph-based residual update mechanism designed to efficiently detect both spatial and temporal information from EEG. Rest captures spatio-temporal dependencies in EEG signals without relying on computationally expensive gating mechanisms commonly found in existing models (Hochreiter & Schmidhuber, 1997; Cho etal., 2014; Asif etal., 2020; Tang etal., 2021). The ability to dynamically capture spatial information over time and update the state accordingly contributes to the high accuracy of Rest in localizing and detecting seizures. Notably, Rest attains comparable accuracy to state-of-the-art models, while achieving significantly faster processing during inference and substantially reducing computational and memory overhead 111Visit our web site at https://arshiaafzal.github.io/REST/. Our contributions are as follows:

  • We present a novel graph-based residual update mechanism designed to capture spatio-temporal dependencies in EEG signals.

  • We enhance the model’s performance while maintaining its small size and rapid detection and classification speed using binary random masking the state and multiple state updates.

  • Our model delivers predictions with an impressive inference latency of 1.29ms. This unmatched inference speed is achieved with a light memory footprint of 37KB.

  • Our model is 14×\times× smaller than the smallest competitive models for seizure detection. Remarkably, our architecture can match the performance of the state-of-the-art deep neural networks with less than 10K parameters.

2 Related Work

Many studies have attempted to develop ML and deep learning models for seizure detection (Siddiqui etal., 2020; O’Shea etal., 2020; Saab etal., 2020) and classification of seizure types (Ahmedt-Aristizabal etal., 2020; Iešmantas & Alzbutas, 2020; Tang etal., 2021). Here, we examine existing seizure detection and classification models, assessing their strengths and limitations across three key aspects. Firstly, we explore how these studies capture the spatio-temporal features present in EEG. Secondly, we delve into the inference speed and the impact of varying clip lengths on seizure analysis. Lastly, we study the memory requirements and model size of current models.

Spatio-Temporal Nature of EEG Signals: As introduced earlier, the nature of EEG signals involves both spatial and temporal components, which are pivotal for accurate analysis in epilepsy studies. Notably, some studies, like Asif etal. (2020), extract spectral features to represent temporal dependencies, incorporating them into a CNN architecture. In contrast, Saab etal. (2020) employ a CNN model that treats EEG signals as multi-channel images, a methodology that does not align with the time-series structure of EEG. Recent advancements involve the utilization of various RNN variations or transformers (Vaswani etal., 2017) to effectively capture temporal patterns in alignment with the intricate dynamics of EEG signals.

RNN s capture temporal dependencies within time-series data by mapping the input x(t)𝑥𝑡x(t)italic_x ( italic_t ) into a latent space h(t)𝑡h(t)italic_h ( italic_t ) and employ recurrence within that space through linear or non-linear transformations.Despite their effectiveness in capturing time-series dependencies, RNN s suffer from a significant challenge known as gradient vanishing. This issue occurs during backpropagation, causing gradients to diminish and hindering the effective learning of long-range dependencies in sequential data. To address the vanishing gradient problem (Pascanu etal., 2013), RNN variants like LSTM (Hochreiter & Schmidhuber, 1997) or Gated Reccurent Unit (GRU) (Cho etal., 2014) leverage gating mechanisms, introducing different gates that contribute to creating the next state h(t)𝑡h(t)italic_h ( italic_t ) from the current input x(t)𝑥𝑡x(t)italic_x ( italic_t ) and the previous state h(t1)𝑡1h(t-1)italic_h ( italic_t - 1 ). Thodoroff etal. (2016) used an LSTM based model for seizure detection.

On the other hand, attention-based models or transformers (Vaswani etal., 2017) are more complex than RNN s. Rather than constructing an explicit state, they directly use previous inputs to predict the future. However, this approach is more memory-intensive and time-demanding due to the necessity of retaining all prior inputs up to a specified time point and storing weights for each input to construct the attention matrix. Yan etal. (2022b) employed a transformer-based model for the seizure detection task.

In the context of EEG analysis where spatial details are critical at each time point, a common strategy is to utilize a CNN or graph convolution network independently across all time points, mapping them into a new feature space. This approach is then complemented by RNN to capture temporal dependencies. Ahmedt-Aristizabal etal. (2020) further employ a CNN-LSTM model, effectively addressing both spatial and temporal dependencies in EEG data.

MethodABCDE
SeizureNet (Asif etal., 2020)
Transformer (Yan etal., 2022a)
EEG-CGS (Ho & Armanfard, 2023)
GGN (Li etal., 2022)
LSTM (Hochreiter & Schmidhuber, 1997)
CNN-LSTM [1] (Ahmedt-Aristizabal etal., 2020)
CNN-LSTM [2] (Thodoroff etal., 2016)
DCRNN (Tang etal., 2021)
Rest (Ours)

Nevertheless, these approaches assume Euclidean geometry for EEG signals, overlooking the natural geometry of electrode placement (Figure1 a) and brain network connectivity (Tang etal., 2021). Recent studies exploit GNN s and graph-based modeling to capture the non-Euclidean geometry of EEG signals (Tang etal., 2021; Ho & Armanfard, 2023; Covert etal., 2019; Li etal., 2022). For instance, Tang etal. (2021) implement a self-supervised diffusion graph convolution model for both detection and classification tasks. Similarly, Ho & Armanfard (2023) employ a self-supervised graph network for channel anomaly detection. These studies (Ho & Armanfard, 2023; Tang etal., 2021) align more closely with the dynamic changes in EEG rhythms by replacing the weights of the RNN network with graph convolution filters. This approach represents the evolution of spectral features within each time point of the time-series data, offering a more integrated approach compared to the sequential mapping from CNN to LSTM (Ahmedt-Aristizabal etal., 2020).

Significance of Inference Time: Timely detection of seizure events is essential for the efficacy of closed-loop epileptic treatments such as RNS and DBS (Shoaran etal., 2016). To the best of our knowledge, most previous studies either overlook the importance of inference runtime or, as observed in Asif etal. (2020), consider a 90ms delay for giving predictions. This delay is still significant, especially for edge devices like RNS and DBS.Furthermore, current studies often evaluate models using a limited range of long window sizes, typically exceeding 10 seconds or even 1 minute (Tang etal., 2021; Saab etal., 2020). However, shorter window sizes are preferable for real-time seizure detection and responsive intervention (Christou etal., 2022; Zhu etal., 2020). The chosen window size influences a model’s ability to localize seizures and its overall detection performance. For instance, a model designed for extended window sizes may lose accuracy in short-term seizure detection scenarios, an aspect that has not been extensively explored in the literature.

Memory Requirement in Seizure Detection Models:While numerous studies have focused on enhancing the accuracy of seizure detection and classification tasks, the crucial aspect of memory demand remains largely overlooked. For instance, Tang etal. (2021) utilize 240K parameters with complex gating units, Ho & Armanfard (2023) employ 58K for channel anomaly detection, and Asif etal. (2020) address seizure classification task with a substantial number of 45.94 Million parameters. These examples underscore the need for an efficient model tailored for seizure detection and classification problems, especially one suitable for resource-constrained stimulation devices deployed at the edge, which do not have access to extensive memory storage for model weights and states (Zhu etal., 2020).

In Table1, we present a summary of current models, highlighting their respective strengths and weaknesses.

Efficient and Accelerated EEG Seizure Analysis through Residual State Updates (1)

3 Method

Below, we first formulate the tasks of seizure detection and classification, outlining the graph representation of EEG signals. Next, we describe the design of Rest’s structure using various updating strategies.

3.1 Seizure Detection and Classification Problem Setting

Following the preprocessing of raw EEG signals and constructing the EEG graph, we obtain an EEG clip X𝑋Xitalic_X and a label y𝑦yitalic_y for both detection and classification tasks. Here, XT×M×N𝑋superscript𝑇𝑀𝑁X\in\mathbb{R}^{T\times M\times N}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_M × italic_N end_POSTSUPERSCRIPT with N𝑁Nitalic_N electrodes, T𝑇Titalic_T time points and M𝑀Mitalic_M features per node while y𝑦yitalic_y denotes the label. For detection, the label is binary, whereas for classification, the label falls within the range of {0,1,2,3,4} where each class represents a unique seizure type 222The five seizure types include: focal, generalized non-specific, complex partial, absence, andtonic-clonic.. The goal for both tasks is to predict the label y𝑦yitalic_y based on a given EEG clip X𝑋Xitalic_X.

3.2 EEG Distance Graph Construction

For each EEG clip, we denote a graph as 𝒢={𝒱,,𝒜}𝒢𝒱𝒜\mathcal{G}=\{\mathcal{V},\mathcal{E},\mathcal{A}\}caligraphic_G = { caligraphic_V , caligraphic_E , caligraphic_A } where 𝒱={v1,,vN}𝒱subscript𝑣1subscript𝑣𝑁\mathcal{V}=\{v_{1},...,v_{N}\}caligraphic_V = { italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } represents the nodes corresponding to EEG electrodes, \mathcal{E}caligraphic_E represents the edges, and 𝒜N×N𝒜superscript𝑁𝑁\mathcal{A}\in\mathbb{R}^{N\times N}caligraphic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT denotes the adjacency matrix of the graph where N𝑁Nitalic_N is the number of nodes which in case of EEG data it is the EEG electrodes.We build a distance-based EEG graph (Figure1a) that precisely represents the electrode placement geometry in the standard 10/20 system (Jasper, 1958). Unlike correlation graphs, our graph remains static over time, reducing computations during inference, as the graph structure does not need to be constructed for each input (Ho & Armanfard, 2023). Details regarding the choice of k𝑘kitalic_k and visualization of distance graphs based on threshold values can be found in AppendixH.

For a distance graph, the adjacency matrix is constructed using the distance between electrode locations, as in previous studies (Tang etal., 2021; Li etal., 2022; Ho & Armanfard, 2023). As the EEG electrode placements are fixed, the adjacency matrix remains unchanged over time. Thus, for each element aij𝒜subscript𝑎𝑖𝑗𝒜a_{ij}\in\mathcal{A}italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ caligraphic_A:

aij={exp(vivj2σ2)ifvivjk,0ifOtherwise,subscript𝑎𝑖𝑗casessuperscriptnormsubscript𝑣𝑖subscript𝑣𝑗2superscript𝜎2ifnormsubscript𝑣𝑖subscript𝑣𝑗𝑘0ifOtherwisea_{ij}=\left\{\begin{array}[]{ c l }\exp(-\frac{||v_{i}-v_{j}||^{2}}{\sigma^{2%}})&\quad\textrm{if }||v_{i}-v_{j}||\leq k,\\0&\quad\textrm{if }\textrm{Otherwise},\end{array}\right.italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = { start_ARRAY start_ROW start_CELL roman_exp ( - divide start_ARG | | italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) end_CELL start_CELL if | | italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | | ≤ italic_k , end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL if roman_Otherwise , end_CELL end_ROW end_ARRAY(1)

where σ𝜎\sigmaitalic_σ is the standard deviation of the distances and k𝑘kitalic_k is the Gaussian kernel’s threshold (Shuman etal., 2013).

3.3 Residual State Update

Similar to RNN s, Rest initially maps the input into a latent space, evolving the state over time to reach the final output. In contrast to RNN s, Rest updates the state using a novel approach that avoids the complexity of gating mechanisms like LSTM or GRU, efficiently addressing the vanishing gradient problem with fewer parameters (details in AppendixB). For mapping to the state space, Rest employs a linear mapping represented as:

Ht=WXt+USt1.superscript𝐻𝑡𝑊superscript𝑋𝑡𝑈superscript𝑆𝑡1H^{t}=WX^{t}+US^{t-1}.italic_H start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_W italic_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_U italic_S start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT .(2)

Here, XtM×Nsuperscript𝑋𝑡superscript𝑀𝑁X^{t}\in\mathbb{R}^{M\times N}italic_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_N end_POSTSUPERSCRIPT represents the input, in our case, the preprocessed EEG clip at time point t[1,,T]𝑡1𝑇t\in[1,...,T]italic_t ∈ [ 1 , … , italic_T ], and St1Q×Nsuperscript𝑆𝑡1superscript𝑄𝑁S^{t-1}\in\mathbb{R}^{Q\times N}italic_S start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_Q × italic_N end_POSTSUPERSCRIPT is the previous state of the model at time point t1𝑡1t-1italic_t - 1. WQ×M𝑊superscript𝑄𝑀W\in\mathbb{R}^{Q\times M}italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_Q × italic_M end_POSTSUPERSCRIPT and UQ×Q𝑈superscript𝑄𝑄U\in\mathbb{R}^{Q\times Q}italic_U ∈ blackboard_R start_POSTSUPERSCRIPT italic_Q × italic_Q end_POSTSUPERSCRIPT are the weights of the affine mapping, with Q𝑄Qitalic_Q being the state size, while HtQ×Nsuperscript𝐻𝑡superscript𝑄𝑁H^{t}\in\mathbb{R}^{Q\times N}italic_H start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_Q × italic_N end_POSTSUPERSCRIPT represents the state of Rest prior to the update. Inspired by He etal. (2016), Rest uses a residual mechanism to update its latent state:

St=Ht+δSt.superscript𝑆𝑡superscript𝐻𝑡𝛿superscript𝑆𝑡S^{t}=H^{t}+\delta S^{t}.italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_H start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_δ italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT .(3)

Here, Stsuperscript𝑆𝑡S^{t}italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is the next state of the model and δSt𝛿superscript𝑆𝑡\delta S^{t}italic_δ italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is the incremental update for the model’s state. The critical aspect lies in extracting δSt𝛿superscript𝑆𝑡\delta S^{t}italic_δ italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT to align with the spatial changes in EEG dynamics at each time point. For this purpose, we utilize the graph convolution method introduced by Morris etal. (2019). We opt for this graph convolution because of its simple structure, which is suited for our application. The graph convolution is defined as follows:

O[:,i]t=σ(Θ1H[:,i]t+Θ2jiaijH[:,j]t),superscriptsubscript𝑂:𝑖𝑡𝜎subscriptΘ1superscriptsubscript𝐻:𝑖𝑡subscriptΘ2subscript𝑗𝑖subscript𝑎𝑖𝑗superscriptsubscript𝐻:𝑗𝑡{O_{[:,i]}^{t}}=\sigma\Big{(}\Theta_{1}H_{[:,i]}^{t}+\Theta_{2}\sum_{j\neq i}a%_{ij}H_{[:,j]}^{t}\Big{)},italic_O start_POSTSUBSCRIPT [ : , italic_i ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_σ ( roman_Θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT [ : , italic_i ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + roman_Θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT [ : , italic_j ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ,(4)

where OtQ×Nsuperscript𝑂𝑡superscript𝑄𝑁{O}^{t}\in\mathbb{R}^{Q\times N}italic_O start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_Q × italic_N end_POSTSUPERSCRIPT is the output of the convolutional filter with Q𝑄Qitalic_Q features per node. Θ1,Θ2Q×QsubscriptΘ1subscriptΘ2superscript𝑄𝑄\Theta_{1},\Theta_{2}\in\mathbb{R}^{Q\times Q}roman_Θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_Θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_Q × italic_Q end_POSTSUPERSCRIPT parameterize the first and second convolutional filters, aijsubscript𝑎𝑖𝑗a_{ij}italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT represents the edge (in this case, the adjacency matrix element) between node i,j1,,Nformulae-sequence𝑖𝑗1𝑁i,j\in{1,\ldots,N}italic_i , italic_j ∈ 1 , … , italic_N and σ𝜎\sigmaitalic_σ is the activation function. We denote the graph convolution in Equation4 as 𝒢Θ(Ht)subscript𝒢Θsuperscript𝐻𝑡\mathcal{G}_{\Theta}(H^{t})caligraphic_G start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ( italic_H start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ). Note that in Equation4, the summation is performed over the neighbors of each node. Considering that for non-neighbor nodes, aij=0subscript𝑎𝑖𝑗0a_{ij}=0italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 0, we can simplify the sum by taking it over all nodes, implicitly incorporating only the neighbor nodes.

The update for the state, δSt𝛿superscript𝑆𝑡\delta S^{t}italic_δ italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, leveraging the graph convolution, is expressed as follows:

δSt=𝒢Θ(Ht).𝛿superscript𝑆𝑡subscript𝒢Θsuperscript𝐻𝑡\delta S^{t}=\mathcal{G}_{\Theta}(H^{t}).italic_δ italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = caligraphic_G start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ( italic_H start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) .(5)

This approach aligns well with the spatial dynamics of EEG signals. We refer to the process of updating the state of our model using Equations2, 3 and5 as the update cell of Rest (Figure1 - c).

3.4 Binary Random Mask: Continuous Dropout during Inference

To combat overfitting in deep neural networks, Dropout is commonly employed, randomly selecting model parameters during training and retaining all parameters during test-time (Srivastava etal., 2014). Drawing inspiration from a similar concept in Mordvintsev etal. (2020), we introduce Binary Masking for state updates, preventing overfitting while enabling the model to learn random state updates. This approach prevent the model to overfit as well as accelerates inference during test-time by skipping computations related to zero-masked feature points in the update.The state update will simply change as follows:

St=Ht+δStB.superscript𝑆𝑡superscript𝐻𝑡direct-product𝛿superscript𝑆𝑡𝐵S^{t}=H^{t}+\delta S^{t}\odot B.italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_H start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_δ italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⊙ italic_B .(6)

Here, direct-product\odot denotes the Hadamard product, and BQ×N𝐵superscript𝑄𝑁B\in\mathbb{R}^{Q\times N}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_Q × italic_N end_POSTSUPERSCRIPT is the binary mask with Bij(p)similar-tosubscript𝐵𝑖𝑗𝑝B_{ij}\sim\mathcal{B}(p)italic_B start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∼ caligraphic_B ( italic_p ) from the Bernoulli distribution, where Bijsubscript𝐵𝑖𝑗B_{ij}italic_B start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT takes the value 1 with a probability of p𝑝pitalic_p and can be treated as hyperparameter for the model.

EEG-FilesPatientsSeizure Type Numbers (Seizure Type Sessions)
(% Seizures)(% Seizures)GNTCABFNCP
Train4664(5.34%)579(36%)335(152)30(11)50(15)1516(496)279(132)
Evaluation881(5.82%)43(79%)185(54)57(8)50(1)240(98)108(32)

3.5 Multiple Update Mechanism: Escaping the Memory Requirements of StackedRNN Layers

As widely recognized in neural networks, increasing the depth enhances performance by enabling the extraction of more general and complex features (Nakkiran etal., 2021). However, this poses a challenge in RNN s, where each additional layer increases memory requirements, not only for storing extra weights but also for additional gates and states.

In our study, we tackle this challenge by modifying Rest to employ identical weights for state updates, thus facilitating multiple state updates. Although the graph convolution layer appears repetitive, the effect of binary random mask allows Rest to learn to update a new part of the state during each iteration. This adaptation allows Rest to align itself with the nature of these random updates, contributing to increased performance and enhanced stability without affecting memory requirements.

Thus, the Equations2, 5 and6 will be modified as follows:

Hit=WXt+USit,subscriptsuperscript𝐻𝑡𝑖𝑊superscript𝑋𝑡𝑈subscriptsuperscript𝑆𝑡𝑖H^{t}_{i}=WX^{t}+US^{t}_{i},italic_H start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_W italic_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_U italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,(7)
Si+1t=Hit+δSitB.subscriptsuperscript𝑆𝑡𝑖1subscriptsuperscript𝐻𝑡𝑖direct-product𝛿subscriptsuperscript𝑆𝑡𝑖𝐵S^{t}_{i+1}=H^{t}_{i}+\delta S^{t}_{i}\odot B.italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT = italic_H start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_δ italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊙ italic_B .(8)

Here, the index i𝑖iitalic_i denotes the current iteration during which the model updates its state, and δSit=𝒢Θ(Hit)𝛿subscriptsuperscript𝑆𝑡𝑖subscript𝒢Θsubscriptsuperscript𝐻𝑡𝑖\delta S^{t}_{i}=\mathcal{G}_{\Theta}(H^{t}_{i})italic_δ italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = caligraphic_G start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ( italic_H start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). It is crucial to emphasize Xtsuperscript𝑋𝑡X^{t}italic_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT as the feature input at time point t𝑡titalic_t to prevent the model from diverging into a state and neglecting the input during multiple updates (additional details are provided in the AppendixG). To update the state for the next time point, the final state obtained after multiple updates becomes the initial state. For instance, after updating the model’s state I𝐼Iitalic_I times at time point t𝑡titalic_t, the initial state for the next time point t+1𝑡1t+1italic_t + 1 is set as the final state after the last update at time point t𝑡titalic_t (S0t+1subscriptsuperscript𝑆𝑡10S^{t+1}_{0}italic_S start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = SItsubscriptsuperscript𝑆𝑡𝐼S^{t}_{I}italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT). This enables the model to effectively capture the temporal dynamics across different time points. The proposed framework for the update cell is illustrated in (Figure 1c).

Moreover, previous studies (Mordvintsev etal., 2020; Pajouheshgar etal., 2023) have demonstrated that recurrently updating the state of neural networks, similar to Rest in structure, for image and texture generation contributes to improved stability. We hypothesize that a similar enhancement can be achieved for seizure detection and classification.

PatientsSeizuresRecording (hours)
Train18154732
Evaluation31991
Test31992.5

4 Rest & RNNs

To better understand the memory efficiency and speed advantages of Rest during inference, we compare Rest with traditional RNN s. As mentioned in Related Work, RNN s map the input x(t)𝑥𝑡x(t)italic_x ( italic_t ) to a hidden state h(t)𝑡h(t)italic_h ( italic_t ) and update this state over time using the previous state h(t1)𝑡1h(t-1)italic_h ( italic_t - 1 ) and the current input x(t)𝑥𝑡x(t)italic_x ( italic_t ). We highlight the efficiency and connections between Rest and other types of RNNs through the following comparisons:

Single Update Rest vs. Single-Layer RNN: First we consider a single GRU as a representative of RNN models, which leverages gating mechanisms to mitigate gradient vanishing.For a simple GRU update, we have the following set of equations:

r(t)=σ(Wr[h(t1),x(t)]),𝑟𝑡𝜎subscript𝑊𝑟𝑡1𝑥𝑡r(t)=\sigma(W_{r}\cdot[h(t-1),x(t)]),italic_r ( italic_t ) = italic_σ ( italic_W start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ⋅ [ italic_h ( italic_t - 1 ) , italic_x ( italic_t ) ] ) ,(9)
z(t)=σ(Wz[h(t1),x(t)]),𝑧𝑡𝜎subscript𝑊𝑧𝑡1𝑥𝑡z(t)=\sigma(W_{z}\cdot[h(t-1),x(t)]),italic_z ( italic_t ) = italic_σ ( italic_W start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ⋅ [ italic_h ( italic_t - 1 ) , italic_x ( italic_t ) ] ) ,(10)
h~(t)=tanh(Wh[r(t)h(t1),x(t)]),~𝑡subscript𝑊direct-product𝑟𝑡𝑡1𝑥𝑡\tilde{h}(t)=\tanh(W_{h}\cdot[r(t)\odot h(t-1),x(t)]),over~ start_ARG italic_h end_ARG ( italic_t ) = roman_tanh ( italic_W start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ⋅ [ italic_r ( italic_t ) ⊙ italic_h ( italic_t - 1 ) , italic_x ( italic_t ) ] ) ,(11)
h(t)=(1z(t))h(t1)+z(t)h~(t).𝑡direct-product1𝑧𝑡𝑡1direct-product𝑧𝑡~𝑡h(t)=(1-z(t))\odot h(t-1)+z(t)\odot\tilde{h}(t).italic_h ( italic_t ) = ( 1 - italic_z ( italic_t ) ) ⊙ italic_h ( italic_t - 1 ) + italic_z ( italic_t ) ⊙ over~ start_ARG italic_h end_ARG ( italic_t ) .(12)

Here,h(t)𝑡h(t)italic_h ( italic_t ) is the hidden state at time t𝑡titalic_t,x(t)𝑥𝑡x(t)italic_x ( italic_t ) is the input at time t𝑡titalic_t,σ𝜎\sigmaitalic_σ is the sigmoid activation function,direct-product\odot denotes element-wise multiplication,[a,b]𝑎𝑏[a,b][ italic_a , italic_b ] denotes the concatenation of vectors a𝑎aitalic_a and b𝑏bitalic_b, andWr,Wz,Whsubscript𝑊𝑟subscript𝑊𝑧subscript𝑊W_{r},W_{z},W_{h}italic_W start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT represent the weight matrices.

These equations describe how the hidden state h(t)𝑡h(t)italic_h ( italic_t ) is updated over time based on the input and the preceding state. Unlike Rest, GRU relies on three different gates (z(t),r(t),h~(t)𝑧𝑡𝑟𝑡~𝑡z(t),r(t),\tilde{h}(t)italic_z ( italic_t ) , italic_r ( italic_t ) , over~ start_ARG italic_h end_ARG ( italic_t )) for each state update, requiring twice as much memory as Rest, in addition to the storage required for the weights utilized in generating these gates.

Despite GRU’s memory demands, it not only needs to compute the next state (h(t)𝑡h(t)italic_h ( italic_t )), but also three additional gates (z(t),r(t),h~(t)𝑧𝑡𝑟𝑡~𝑡z(t),r(t),\tilde{h}(t)italic_z ( italic_t ) , italic_r ( italic_t ) , over~ start_ARG italic_h end_ARG ( italic_t )) as the next state depends on these gates. In contrast, Rest relies solely on the update result (δSt𝛿superscript𝑆𝑡\delta S^{t}italic_δ italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT), enabling it to rapidly derive the next state by adding it to the previous state, without the need for additional gates.

Multi Random Update Rest vs. Multi-Layer RNN:

The remarkable efficiency of Rest becomes particularly evident when comparing it with multi-layer RNN. In the context of multi-layer GRU, reaching the final state involves computing a set of equations (Equations9, 10, 11 and12) for each layer. This process introduces three times more latency per layer, as each layer has three gates that must be computed to obtain the next state. Furthermore, it requires additional memory to store the hidden state of each layer, especially since it is required for updating the final hidden state of the last layer.

Contrastingly, Rest distinguishes itself by reusing the same set of weights for the update cell and state evolution. This eliminates the need to store the previous state, as it evolves a distinct state over iterations. Consequently, Rest maintains the same memory requirements as a single update, while delivering more accurate results (as discussed in the next section).It is worth mentioning that in the context of EEG data, all fully connected layers will be replaced by graph convolutions for both Rest and GRU. For example, the combination of GRU with diffusion graph convolution for a traffic forecasting problem was undertaken by Li etal. (2017).

Connection of Rest Update Cell to Gating Mechanism:

As shown in Equation12, the state update of RNNs, such as GRU, can be expressed as:

h(t)=h(t1)+z(t)(h~(t)h(t1)).𝑡𝑡1direct-product𝑧𝑡~𝑡𝑡1h(t)=h(t-1)+z(t)\odot\left(\tilde{h}(t)-h(t-1)\right).italic_h ( italic_t ) = italic_h ( italic_t - 1 ) + italic_z ( italic_t ) ⊙ ( over~ start_ARG italic_h end_ARG ( italic_t ) - italic_h ( italic_t - 1 ) ) .(13)

This update shares similarities with the Rest cell update in Equation6. Instead of learning both h~(t)~𝑡\tilde{h}(t)over~ start_ARG italic_h end_ARG ( italic_t ) and h(t)𝑡h(t)italic_h ( italic_t ) separately, the Rest update directly learns h~(t)h(t1)~𝑡𝑡1\tilde{h}(t)-h(t-1)over~ start_ARG italic_h end_ARG ( italic_t ) - italic_h ( italic_t - 1 ) as the residual update δSt𝛿superscript𝑆𝑡\delta S^{t}italic_δ italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. Additionally, the update gate vector z(t)𝑧𝑡z(t)italic_z ( italic_t ) is replaced with binary random masking. This substitution reduces the computational and memory overhead required for building z(t)𝑧𝑡z(t)italic_z ( italic_t ) from the input x(t)𝑥𝑡x(t)italic_x ( italic_t ) and hidden state h(t)𝑡h(t)italic_h ( italic_t ).

Seizure Detection AUROC (%)Model Efficiency
Model4-s6-s8-s10-s12-s14-sSize(MB)#ParamInference(ms)
LSTM75.5±0.3subscript75.5plus-or-minus0.375.5_{\pm 0.3}75.5 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT76.1±0.07subscript76.1plus-or-minus0.0776.1_{\pm 0.07}76.1 start_POSTSUBSCRIPT ± 0.07 end_POSTSUBSCRIPT80.1±0.3subscript80.1plus-or-minus0.380.1_{\pm 0.3}80.1 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT70.43±0.02subscript70.43plus-or-minus0.0270.43_{\pm 0.02}70.43 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT77.9±0.06subscript77.9plus-or-minus0.0677.9_{\pm 0.06}77.9 start_POSTSUBSCRIPT ± 0.06 end_POSTSUBSCRIPT74.24±0.2subscript74.24plus-or-minus0.274.24_{\pm 0.2}74.24 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT2.147536K3.254
GRU76.1±0.02subscript76.1plus-or-minus0.0276.1_{\pm 0.02}76.1 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT78.8±0.03subscript78.8plus-or-minus0.0378.8_{\pm 0.03}78.8 start_POSTSUBSCRIPT ± 0.03 end_POSTSUBSCRIPT73.2±0.04subscript73.2plus-or-minus0.0473.2_{\pm 0.04}73.2 start_POSTSUBSCRIPT ± 0.04 end_POSTSUBSCRIPT73.5±0.02subscript73.5plus-or-minus0.0273.5_{\pm 0.02}73.5 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT80.1±0.1plus-or-minus80.10.180.1{\pm 0.1}80.1 ± 0.177.9±0.04subscript77.9plus-or-minus0.0477.9_{\pm 0.04}77.9 start_POSTSUBSCRIPT ± 0.04 end_POSTSUBSCRIPT1.61402K2.12
ResNet-LSTM79.1±0.05subscript79.1plus-or-minus0.0579.1_{\pm 0.05}79.1 start_POSTSUBSCRIPT ± 0.05 end_POSTSUBSCRIPT80.1±0.2subscript80.1plus-or-minus0.280.1_{\pm 0.2}80.1 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT75.6±0.07subscript75.6plus-or-minus0.0775.6_{\pm 0.07}75.6 start_POSTSUBSCRIPT ± 0.07 end_POSTSUBSCRIPT74.3±0.04subscript74.3plus-or-minus0.0474.3_{\pm 0.04}74.3 start_POSTSUBSCRIPT ± 0.04 end_POSTSUBSCRIPT78.8±0.1subscript78.8plus-or-minus0.178.8_{\pm 0.1}78.8 start_POSTSUBSCRIPT ± 0.1 end_POSTSUBSCRIPT80.0±0.08subscript80.0plus-or-minus0.0880.0_{\pm 0.08}80.0 start_POSTSUBSCRIPT ± 0.08 end_POSTSUBSCRIPT27.66.9M6.78
ResNet-Dilation-LSTM80.2±0.08subscript80.2plus-or-minus0.0880.2_{\pm 0.08}80.2 start_POSTSUBSCRIPT ± 0.08 end_POSTSUBSCRIPT76.5±0.12subscript76.5plus-or-minus0.1276.5_{\pm 0.12}76.5 start_POSTSUBSCRIPT ± 0.12 end_POSTSUBSCRIPT75.9±0.06subscript75.9plus-or-minus0.0675.9_{\pm 0.06}75.9 start_POSTSUBSCRIPT ± 0.06 end_POSTSUBSCRIPT73.6±0.03subscript73.6plus-or-minus0.0373.6_{\pm 0.03}73.6 start_POSTSUBSCRIPT ± 0.03 end_POSTSUBSCRIPT77.4±0.15subscript77.4plus-or-minus0.1577.4_{\pm 0.15}77.4 start_POSTSUBSCRIPT ± 0.15 end_POSTSUBSCRIPT78.2±0.07subscript78.2plus-or-minus0.0778.2_{\pm 0.07}78.2 start_POSTSUBSCRIPT ± 0.07 end_POSTSUBSCRIPT27.66.9M6.78
CNN-LSTM81.3±0.1subscript81.3plus-or-minus0.181.3_{\pm 0.1}81.3 start_POSTSUBSCRIPT ± 0.1 end_POSTSUBSCRIPT78.5±0.05subscript78.5plus-or-minus0.0578.5_{\pm 0.05}78.5 start_POSTSUBSCRIPT ± 0.05 end_POSTSUBSCRIPT76.4±0.01subscript76.4plus-or-minus0.0176.4_{\pm 0.01}76.4 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT75.4±0.05subscript75.4plus-or-minus0.0575.4_{\pm 0.05}75.4 start_POSTSUBSCRIPT ± 0.05 end_POSTSUBSCRIPT75.05±0.1subscript75.05plus-or-minus0.175.05_{\pm 0.1}75.05 start_POSTSUBSCRIPT ± 0.1 end_POSTSUBSCRIPT74.0±0.03subscript74.0plus-or-minus0.0374.0_{\pm 0.03}74.0 start_POSTSUBSCRIPT ± 0.03 end_POSTSUBSCRIPT22.86M5.624
DCRNN79.7±0.01subscript79.7plus-or-minus0.0179.7_{\pm 0.01}79.7 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT82.1±0.04subscript82.1plus-or-minus0.0482.1_{\pm 0.04}82.1 start_POSTSUBSCRIPT ± 0.04 end_POSTSUBSCRIPT80.1±0.04subscript80.1plus-or-minus0.0480.1_{\pm 0.04}80.1 start_POSTSUBSCRIPT ± 0.04 end_POSTSUBSCRIPT80.0±0.06subscript80.0plus-or-minus0.0680.0_{\pm 0.06}80.0 start_POSTSUBSCRIPT ± 0.06 end_POSTSUBSCRIPT82.5±0.1subscript82.5plus-or-minus0.182.5_{\pm 0.1}82.5 start_POSTSUBSCRIPT ± 0.1 end_POSTSUBSCRIPT80.12±0.04subscript80.12plus-or-minus0.0480.12_{\pm 0.04}80.12 start_POSTSUBSCRIPT ± 0.04 end_POSTSUBSCRIPT0.884126K9.670
DCRNN w/SS83.0±0.08subscript83.0plus-or-minus0.08\textbf{83.0}_{\pm\textbf{0.08}}83.0 start_POSTSUBSCRIPT ± 0.08 end_POSTSUBSCRIPT81.8±0.05subscript81.8plus-or-minus0.0581.8_{\pm 0.05}81.8 start_POSTSUBSCRIPT ± 0.05 end_POSTSUBSCRIPT82.7±0.1subscript82.7plus-or-minus0.1\textbf{82.7}_{\pm\textbf{0.1}}82.7 start_POSTSUBSCRIPT ± 0.1 end_POSTSUBSCRIPT82.1±0.03subscript82.1plus-or-minus0.0382.1_{\pm 0.03}82.1 start_POSTSUBSCRIPT ± 0.03 end_POSTSUBSCRIPT85.6±0.2subscript85.6plus-or-minus0.2{85.6}_{\pm{0.2}}85.6 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT84.0±0.01subscript84.0plus-or-minus0.01{84.0}_{\pm{0.01}}84.0 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT1.319330K23.25
Transformer83.0±0.02subscript83.0plus-or-minus0.0283.0_{\pm 0.02}83.0 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT82.1±0.03subscript82.1plus-or-minus0.0382.1_{\pm 0.03}82.1 start_POSTSUBSCRIPT ± 0.03 end_POSTSUBSCRIPT82.2±0.04subscript82.2plus-or-minus0.0482.2_{\pm 0.04}82.2 start_POSTSUBSCRIPT ± 0.04 end_POSTSUBSCRIPT85.5±0.07subscript85.5plus-or-minus0.07\textbf{85.5}_{\pm\textbf{0.07}}85.5 start_POSTSUBSCRIPT ± 0.07 end_POSTSUBSCRIPT86.0±0.03subscript86.0plus-or-minus0.03\textbf{86.0}_{\pm\textbf{0.03}}86.0 start_POSTSUBSCRIPT ± 0.03 end_POSTSUBSCRIPT85.1±0.02subscript85.1plus-or-minus0.02\textbf{85.1}_{\pm\textbf{0.02}}85.1 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT0.80120.3K2.5
Rest(DS)75.3±0.2subscript75.3plus-or-minus0.275.3_{\pm 0.2}75.3 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT67.0±0.03subscript67.0plus-or-minus0.0367.0_{\pm 0.03}67.0 start_POSTSUBSCRIPT ± 0.03 end_POSTSUBSCRIPT72.2±0.07subscript72.2plus-or-minus0.0772.2_{\pm 0.07}72.2 start_POSTSUBSCRIPT ± 0.07 end_POSTSUBSCRIPT74.1±0.1subscript74.1plus-or-minus0.174.1_{\pm 0.1}74.1 start_POSTSUBSCRIPT ± 0.1 end_POSTSUBSCRIPT70.6±0.04subscript70.6plus-or-minus0.0470.6_{\pm 0.04}70.6 start_POSTSUBSCRIPT ± 0.04 end_POSTSUBSCRIPT70.0±0.04subscript70.0plus-or-minus0.0470.0_{\pm 0.04}70.0 start_POSTSUBSCRIPT ± 0.04 end_POSTSUBSCRIPT0.0378.4K0.615
Rest(RS)79.4±0.03subscript79.4plus-or-minus0.0379.4_{\pm 0.03}79.4 start_POSTSUBSCRIPT ± 0.03 end_POSTSUBSCRIPT81.1±0.01subscript81.1plus-or-minus0.0181.1_{\pm 0.01}81.1 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT81.0±0.08subscript81.0plus-or-minus0.0881.0_{\pm 0.08}81.0 start_POSTSUBSCRIPT ± 0.08 end_POSTSUBSCRIPT81.8±0.02subscript81.8plus-or-minus0.0281.8_{\pm 0.02}81.8 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT80.1±0.1subscript80.1plus-or-minus0.180.1_{\pm 0.1}80.1 start_POSTSUBSCRIPT ± 0.1 end_POSTSUBSCRIPT78.1±0.4subscript78.1plus-or-minus0.478.1_{\pm 0.4}78.1 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT0.0378.4K0.710
Rest(RM)82.4±0.04subscript82.4plus-or-minus0.0482.4_{\pm 0.04}82.4 start_POSTSUBSCRIPT ± 0.04 end_POSTSUBSCRIPT82.2±0.05subscript82.2plus-or-minus0.05\textbf{82.2 }_{\pm\textbf{0.05}}82.2 start_POSTSUBSCRIPT ± 0.05 end_POSTSUBSCRIPT82.7±0.1subscript82.7plus-or-minus0.1\textbf{82.7}_{\pm\textbf{0.1}}82.7 start_POSTSUBSCRIPT ± 0.1 end_POSTSUBSCRIPT83.6±0.2subscript83.6plus-or-minus0.283.6_{\pm{0.2}}83.6 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT83.4±0.09subscript83.4plus-or-minus0.0983.4_{\pm 0.09}83.4 start_POSTSUBSCRIPT ± 0.09 end_POSTSUBSCRIPT82.0±0.1subscript82.0plus-or-minus0.182.0_{\pm 0.1}82.0 start_POSTSUBSCRIPT ± 0.1 end_POSTSUBSCRIPT0.0378.4K1.292
Seizure Detection AUROC (%)Model Efficiency
Model4-s6-s8-s10-s12-sSize(MB)#ParamInference(ms)
LSTM85.5±0.2subscript85.5plus-or-minus0.285.5_{\pm 0.2}85.5 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT84.1±0.4subscript84.1plus-or-minus0.484.1_{\pm 0.4}84.1 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT81.0±0.2subscript81.0plus-or-minus0.281.0_{\pm 0.2}81.0 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT75.2±0.03subscript75.2plus-or-minus0.0375.2_{\pm 0.03}75.2 start_POSTSUBSCRIPT ± 0.03 end_POSTSUBSCRIPT73.5±0.08subscript73.5plus-or-minus0.0873.5_{\pm 0.08}73.5 start_POSTSUBSCRIPT ± 0.08 end_POSTSUBSCRIPT2.691627K3.56
GRU76.1±0.3subscript76.1plus-or-minus0.376.1_{\pm 0.3}76.1 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT78.8±0.03subscript78.8plus-or-minus0.0378.8_{\pm 0.03}78.8 start_POSTSUBSCRIPT ± 0.03 end_POSTSUBSCRIPT73.2±0.4subscript73.2plus-or-minus0.473.2_{\pm 0.4}73.2 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT73.5±0.01subscript73.5plus-or-minus0.0173.5_{\pm 0.01}73.5 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT80.1±0.2subscript80.1plus-or-minus0.280.1_{\pm 0.2}80.1 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT1.92553K2.42
ResNet-LSTM77.6±0.2subscript77.6plus-or-minus0.277.6_{\pm 0.2}77.6 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT82.1±0.14subscript82.1plus-or-minus0.1482.1_{\pm 0.14}82.1 start_POSTSUBSCRIPT ± 0.14 end_POSTSUBSCRIPT79.9±0.3subscript79.9plus-or-minus0.379.9_{\pm 0.3}79.9 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT76.8±0.4subscript76.8plus-or-minus0.476.8_{\pm 0.4}76.8 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT81.4±0.17subscript81.4plus-or-minus0.1781.4_{\pm 0.17}81.4 start_POSTSUBSCRIPT ± 0.17 end_POSTSUBSCRIPT29.17.2M6.846.846.846.84
ResNet-Dilation-LSTM78.2±0.03subscript78.2plus-or-minus0.0378.2_{\pm 0.03}78.2 start_POSTSUBSCRIPT ± 0.03 end_POSTSUBSCRIPT79.8±0.1subscript79.8plus-or-minus0.179.8_{\pm 0.1}79.8 start_POSTSUBSCRIPT ± 0.1 end_POSTSUBSCRIPT82.3±0.4subscript82.3plus-or-minus0.482.3_{\pm 0.4}82.3 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT77.6±0.4subscript77.6plus-or-minus0.477.6_{\pm 0.4}77.6 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT81.2±0.1subscript81.2plus-or-minus0.181.2_{\pm 0.1}81.2 start_POSTSUBSCRIPT ± 0.1 end_POSTSUBSCRIPT29.17.2M6.84
CNN-LSTM86.2±0.4subscript86.2plus-or-minus0.486.2_{\pm 0.4}86.2 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT84.9±0.2subscript84.9plus-or-minus0.284.9_{\pm 0.2}84.9 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT80.4±0.04subscript80.4plus-or-minus0.0480.4_{\pm 0.04}80.4 start_POSTSUBSCRIPT ± 0.04 end_POSTSUBSCRIPT80.35±0.06subscript80.35plus-or-minus0.0680.35_{\pm 0.06}80.35 start_POSTSUBSCRIPT ± 0.06 end_POSTSUBSCRIPT77.6±0.3subscript77.6plus-or-minus0.377.6_{\pm 0.3}77.6 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT7.6M30.236.432
DCRNN88.7±0.3subscript88.7plus-or-minus0.388.7_{\pm 0.3}88.7 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT80.0±0.02subscript80.0plus-or-minus0.0280.0_{\pm 0.02}80.0 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT86.8±0.06subscript86.8plus-or-minus0.0686.8_{\pm 0.06}86.8 start_POSTSUBSCRIPT ± 0.06 end_POSTSUBSCRIPT88.8±0.3subscript88.8plus-or-minus0.388.8_{\pm 0.3}88.8 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT86.5±0.3subscript86.5plus-or-minus0.386.5_{\pm 0.3}86.5 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT0.591147K9.80
Transformer80.1±0.2subscript80.1plus-or-minus0.280.1_{\pm 0.2}80.1 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT82.3±0.6subscript82.3plus-or-minus0.682.3_{\pm 0.6}82.3 start_POSTSUBSCRIPT ± 0.6 end_POSTSUBSCRIPT82.2±0.04subscript82.2plus-or-minus0.0482.2_{\pm 0.04}82.2 start_POSTSUBSCRIPT ± 0.04 end_POSTSUBSCRIPT85.5±0.01subscript85.5plus-or-minus0.0185.5_{\pm 0.01}85.5 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT86±0.17subscript86plus-or-minus0.1786_{\pm 0.17}86 start_POSTSUBSCRIPT ± 0.17 end_POSTSUBSCRIPT0.2552.4K6.00
Rest(DS)89.1±0.2subscript89.1plus-or-minus0.289.1_{\pm 0.2}89.1 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT88.5±0.08subscript88.5plus-or-minus0.0888.5_{\pm 0.08}88.5 start_POSTSUBSCRIPT ± 0.08 end_POSTSUBSCRIPT90.1±0.1subscript90.1plus-or-minus0.190.1_{\pm 0.1}90.1 start_POSTSUBSCRIPT ± 0.1 end_POSTSUBSCRIPT86.3±0.03subscript86.3plus-or-minus0.0386.3_{\pm 0.03}86.3 start_POSTSUBSCRIPT ± 0.03 end_POSTSUBSCRIPT87.8±0.5subscript87.8plus-or-minus0.587.8_{\pm 0.5}87.8 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT0.0379.3K1.314
Rest(RS)92.3±0.1subscript92.3plus-or-minus0.192.3_{\pm 0.1}92.3 start_POSTSUBSCRIPT ± 0.1 end_POSTSUBSCRIPT88.7±0.06subscript88.7plus-or-minus0.0688.7_{\pm 0.06}88.7 start_POSTSUBSCRIPT ± 0.06 end_POSTSUBSCRIPT92.1±0.03subscript92.1plus-or-minus0.03\textbf{92.1}_{\pm\textbf{0.03}}92.1 start_POSTSUBSCRIPT ± 0.03 end_POSTSUBSCRIPT93.5±0.02subscript93.5plus-or-minus0.02\textbf{93.5}_{\pm\textbf{0.02}}93.5 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT91.5±0.02subscript91.5plus-or-minus0.0291.5_{\pm 0.02}91.5 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT0.0379.3K1.314
Rest(RM)96.7±0.2subscript96.7plus-or-minus0.2\textbf{96.7}_{\pm\textbf{0.2}}96.7 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT92.3±0.04subscript92.3plus-or-minus0.04\textbf{92.3}_{\pm\textbf{0.04}}92.3 start_POSTSUBSCRIPT ± 0.04 end_POSTSUBSCRIPT91.4±0.1subscript91.4plus-or-minus0.191.4_{\pm 0.1}91.4 start_POSTSUBSCRIPT ± 0.1 end_POSTSUBSCRIPT89.2±0.4subscript89.2plus-or-minus0.489.2_{\pm 0.4}89.2 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT91.6±0.03subscript91.6plus-or-minus0.03\textbf{91.6}_{\pm\textbf{0.03}}91.6 start_POSTSUBSCRIPT ± 0.03 end_POSTSUBSCRIPT0.0379.3K1.314

5 Empirical Results

5.1 Setup

Dataset:We used two extensive publicly available datasets for the seizure detection and classification task: the Temple University Hospital EEG Seizure Corpus (TUSZ) (Obeid & Picone, 2016; Shah etal., 2018) and the Children’s Hospital Boston (Goldberger etal., 2000) dataset. Below is a detailed description of each dataset:

TUSZThis dataset includes a total of 5545 EEG files for training and evaluation. These files encompass five different seizure types. We incorporated all 19 channels for all patients in the standard 10-20 system (Figure 1a).

CHB-MIT This dataset comprises recordings from 24 patients, with each patient having data from 9 to 42 sessions, recorded at a sampling rate of 256Hz. The dataset contains a total of 192 seizures. For our study, we included all 19 channels in the standard 10-20 system for the majority of patients, and excluded sessions that had fewer or a higher number of channels.

Preprocessing: In line with previous studies (Tang etal., 2021; Saab etal., 2020), we resample the EEG signals from TUSZ dataset into 200Hz (256Hz for CHB-MIT dataset) to have consistent sampling frequency among different EEG s. Then, we extract non-overlapping window sizes with length T𝑇Titalic_T leading to an EEG clip XT×L×N𝑋superscript𝑇𝐿𝑁X\in\mathbb{R}^{T\times L\times N}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_L × italic_N end_POSTSUPERSCRIPT with N=19𝑁19N=19italic_N = 19 nodes, L=200𝐿200L=200italic_L = 200 (L=256𝐿256L=256italic_L = 256 for CHB-MIT dataset) features per node, and T𝑇Titalic_T time points. After applying the fast Fourier transform on the second dimension of the EEG clip and choosing the log amplitude of non-negative frequency components, the final EEG clip fused as the input to the models is XT×M×N𝑋superscript𝑇𝑀𝑁X\in\mathbb{R}^{T\times M\times N}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_M × italic_N end_POSTSUPERSCRIPT where M=100𝑀100M=100italic_M = 100 (M=128𝑀128M=128italic_M = 128 for CHB-MIT dataset). Finally, the features for each node and time point are z-normalized using the mean and variance calculated from 100 (128 for CHB-MIT dataset) feature points along its axis. We examine the presence of a seizure within an EEG clip in the detection task. For classification, we start analyzing each clip 2 seconds before the seizure begins and evaluate the outcomes within a clip duration of T=10𝑇10T=10italic_T = 10 seconds. This approach aligns with the annotations of seizure onset, as demonstrated in previous works (Ahmedt-Aristizabal etal., 2020; Tang etal., 2021).

We evaluate models’ ability to perform detection tasks across a range of window sizes, spanning from {4,6,8,10,12,14} seconds for TUSZ and {4,6,8,10,12} seconds for CHB-MIT. This allows us to evaluate their performance in both short and long-term detection scenarios. For seizure detection task, we used both the seizure and background data, while for the classification task, only the seizure data were used (details in AppendixA).

Train-Evaluation Split: The original TUSZ Train-set was randomly split into training and validation sets with a ratio of 90/10. The TUSZ eval set served as a standardized evaluation set, consistent with previous studies Tang etal. (2021). Further details regarding the data split are provided in Table 2. For the CHB-MIT dataset, since predefined splits for training, evaluation, and testing are not provided, we randomly selected 80% of the data for training, 10% for evaluation, and 10% for testing. We ensured that patients in each set are unique, preventing the model from being tested on patients included in the training set (details at Table3).

Baselines:To evaluate performance and runtime, we implemented three key baselines widely used in seizure analysis: DCRNN (Tang etal., 2021), with two versions of the model, with and without self-supervision; CNN-LSTM (Ahmedt-Aristizabal etal., 2020); LSTM (Hochreiter & Schmidhuber, 1997); Transformer (Vaswani etal., 2017); GRU (Cho etal., 2014); and two versions of the ResNet-LSTM model as described in Lee etal. (2022).

Efficient and Accelerated EEG Seizure Analysis through Residual State Updates (2)

Rest architecture and training:Rest was designed with two graph convolution layers for state updates, the first employing ReLU activation and the second utilizing a linear activation function (Figure 1c).We evaluate various versions of Rest: a) Rest(DS) with a single deterministic update without any masking, b) Rest(RS) with a single random update (utilizing binary random masking), and c) Rest(RM) with multiple random updates.

In the seizure detection task, both Binary Cross Entropy and Mean Squared Error (MSE) loss were employed, with MSE outperforming Binary Cross Entropy. This result stems from the observation that Binary Cross Entropy prevents residual updates from approaching zero (more details on AppendixE). For seizure classification, the Cross-Entropy loss was utilized.

ModelF1-ScoreSize(MB)Parameter(#)
LSTM0.392.021512K
GRU0.441.92553K
ResNet-LSTM0.5830.37.5M
ResNet-LSTM-Dilation0.5030.37.5M
CNN-LSTM0.4723.96M
DCRNN0.540.506126K
DCRNN w/SS0.621.40332K
Transformer0.540.2553K
Rest(DS)0.510.0348.6K
Rest(RS)0.570.0348.6K
Rest(RM)0.600.0348.6K

We trained all models with 5 different random seeds and averaged the performance on evaluation set over different runs. We utilized ADAM (Kingma & Ba, 2014) to optimize the models’ parameters, conducting training on a single NVIDIA A100 GPU with a batch size of 128 EEG clips. Training times for all models across various clip lengths can be found in the AppendixF.

Runtime Comparison:To ensure a fair comparison between different models, we adopted the following approach for each model:We selected the optimal set of hyperparameters for each clip length based on performance on the validation set.Here, inference time refers to the time required for each model to provide predictions for one sample of the test data, where each sample is an EEG clip with length T{4,6,8,10,12,14}𝑇468101214T\in\{4,6,8,10,12,14\}italic_T ∈ { 4 , 6 , 8 , 10 , 12 , 14 }.We also attempted to shrink the baselines while maintain the same accuracy for both tasks and the details are reported in AppendixI.

Efficient and Accelerated EEG Seizure Analysis through Residual State Updates (3)

5.2 Experimental Results

Seizure Detection and Classification Accuracy: We evaluated the performance of all baseline models and Rest using the Area Under the Receiver Operating Characteristic Curve (AUROC) for seizure detection and Weighted F1-Score for seizure classification. Our model surpassed all baselines significantly on the CHB-MIT dataset for all different clip lengths. For the TUSZ dataset, it achieved very close detection AUROC scores for all clip lengths compared to DCRNN with self-supervision and the Transformer, while outperforming them at clip lengths of 6 and 8 seconds. Figure2 suggests that multiple random updates improve the stability of Rest as it leads to higher and more consistent performance compared to other models. According to Figure2, Rest(RM) and DCRNN with self supervision exhibit more stable performance over time across clip lengths, yielding consistent results. Interestingly, CNN-LSTM achieved higher performance in a small clip size of 4s, surpassing DCRNN with graph convolution layers.

Rest Enjoys an Exponentially Smaller Size:While maintaining high accuracy, Rest exhibits a size that is 14×\times× smaller than the smallest existing model for seizure detection and classification on TUSZ dataset. Table 4 highlights that Rest requires 38×\times×fewer parameters than state-of-the-art models (DCRNN w/SS) and over 697×\times× fewer parameters than the deep CNN-LSTM model for seizure analysis.

Figure 3 a-b showcases Rest’s outstanding performance, achieving an AUROC of 83.6% for seizure detection with a clip length of 10 seconds. Additionally, Rest secures the second-highest F1-Score for seizure classification, trailing only 2% below DCRNN w/SS but with a significantly smaller size than all other baselines. The substantial gap between Rest’s size and the sizes of other baselines, depicted on the logarithmic scale in Figure 3 a-b, underscores Rest’s remarkable size advantage and potential for implementation on edge devices. The graph convolution layers in Rest efficiently capture both short and long-range communication between nodes, ensuring high accuracy with a compact model size. Moreover, using identical weights for multiple random updates eliminates the need for additional layers while enhancing the model’s accuracy and memory efficiency.

Rapid Seizure Detection:Rest(RM) achieves the fastest inference speed among all models, being 20×\times× faster than DCRNN w/SS and 9×\times× faster than DCRNN during inference, with only a minor AUROC drop of less than 2% for seizure detection across various clip lengths for TUSZ dataset. Moreover, Rest, with multiple updates, requires only 1.292 ms for seizure detection, which is three times faster than the fastest baseline, LSTM, while being 13% more accurate in delivering predictions (at 10-s clip length). On the CHB-MIT dataset, Rest outperforms all other baselines in the seizure detection task, being the only model with an AUROC higher than 90%. It also significantly outperforms other baselines for the short clip length of 4 seconds, which is crucial for real-time seizure detection (Zhu etal., 2021).

In seizure classification, Rest(RM) secures the second-highest F1-Score (Table 6) and excels in providing the fastest classification result within 1.51 ms (Figure 3 c-d). Notably, it is three times faster than LSTM, while achieving 21% higher accuracy than LSTM.The swift prediction capability of our model is attributed to its efficient design. Rest relies on a single affine mapping into the state space, complemented by two computationally lightweight graph convolutions.

6 Conclusion

In this work, we propose Rest, a graph-based residual state update mechanism for efficient seizure detection and classification tasks. Our model effectively captures both spatial and temporal behaviors of EEG signals, achieving state-of-the-art performance in seizure detection and classification. With its shallow structure, Rest boasts a fast inference speed, making it 9 times faster than current models with a comparable performance. Furthermore, Rest exhibits remarkable efficiency, requiring only 37KB of memory, which is 14 times smaller than smallest existing models for seizure analysis tasks. These advancements position Rest as a promising model for implementation on small, low-power edge devices, particularly for applications in epilepsy treatments like DBS and RNS.

Impact Statement

The EEG Seizure Corpus from Temple University Hospital, utilized in our research, is anonymized and publicly accessible with IRB approval (Obeid & Picone, 2016; Shah etal., 2018). The authors declare no conflicts of interest, and the seizure detection and classification models presented in this study do not provide any harmful insights. Although our model has demonstrated accuracy in real-time seizure analyses, further experiments are essential for real-world application and implementation on edge devices, as demonstrated in a number of recent systems (Shoaran etal., 2018; Shin etal., 2022; Shaeri etal., 2024).These evaluations should encompass testing with diverse datasets from various patient populations and hospitals. Additionally, assessing the model’s energy efficiency is crucial to ensure its safety for chronic use, along with obtaining neurologists’ approval regarding its neurological aspects for deployment in such devices.

Acknowledgements

This work was supported in part by the Swiss State Secretariat for Education, Research and Innovation under Contract number SCR0548363, in part by the Wyss project under contract number 532932, in part by Hasler Foundation Program: Hasler Responsible AI project number 21043, in part by the Army Research Office under grant number W911NF-24-1-0048, and in part by the Swiss National Science Foundation (SNSF) under grant number 200021_205011. Moreover, we appreciate the reviewers for their insightful feedback, which has significantly enhanced the robustness and clarity of our results.

References

  • Acharya etal. (2016)Acharya, J.N., Hani, A.J., Thirumala, P., and Tsuchida, T.N.American clinical neurophysiology society guideline 3: a proposal for standard montages to be used in clinical eeg.The Neurodiagnostic Journal, 56(4):253–260, 2016.
  • Ahmedt-Aristizabal etal. (2020)Ahmedt-Aristizabal, D., Fernando, T., Denman, S., Petersson, L., Aburn, M.J., and f*ckes, C.Neural memory networks for seizure type classification.In 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), pp. 569–575. IEEE, 2020.
  • Asif etal. (2020)Asif, U., Roy, S., Tang, J., and Harrer, S.Seizurenet: Multi-spectral deep feature learning for seizure type classification.In Machine Learning in Clinical Neuroimaging and Radiogenomics in Neuro-oncology: Third International Workshop, MLCN 2020, and Second International Workshop, RNO-AI 2020, Held in Conjunction with MICCAI 2020, Lima, Peru, October 4–8, 2020, Proceedings 3, pp. 77–87. Springer, 2020.
  • Beghi etal. (2019)Beghi, E., Giussani, G., Nichols, E., Abd-Allah, F., Abdela, J., Abdelalim, A., Abraha, H.N., Adib, M.G., Agrawal, S., Alahdab, F., etal.Global, regional, and national burden of epilepsy, 1990–2016: a systematic analysis for the global burden of disease study 2016.The Lancet Neurology, 18(4):357–375, 2019.
  • Cho etal. (2014)Cho, K., VanMerriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y.Learning phrase representations using rnn encoder-decoder for statistical machine translation.arXiv preprint arXiv:1406.1078, 2014.
  • Christou etal. (2022)Christou, V., Miltiadous, A., Tsoulos, I., Karvounis, E., Tzimourta, K.D., Tsipouras, M.G., Anastasopoulos, N., Tzallas, A.T., and Giannakeas, N.Evaluating the window size’s role in automatic eeg epilepsy detection.Sensors, 22(23):9233, 2022.
  • Covert etal. (2019)Covert, I.C., Krishnan, B., Najm, I., Zhan, J., Shore, M., Hixson, J., and Po, M.J.Temporal graph convolutional networks for automatic seizure detection.In Machine Learning for Healthcare Conference, pp. 160–180. PMLR, 2019.
  • Fisher & Velasco (2014a)Fisher, R.S. and Velasco, A.L.Electrical brain stimulation for epilepsy.Nature Reviews Neurology, 10(5):261–270, 2014a.
  • Fisher & Velasco (2014b)Fisher, R.S. and Velasco, A.L.Electrical brain stimulation for epilepsy.Nature Reviews Neurology, 10(5):261–270, 2014b.
  • Goldberger etal. (2000)Goldberger, A.L., Amaral, L.A., Glass, L., Hausdorff, J.M., Ivanov, P.C., Mark, R.G., Mietus, J.E., Moody, G.B., Peng, C.-K., and Stanley, H.E.Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals.circulation, 101(23):e215–e220, 2000.
  • Gotman (1990)Gotman, J.Automatic seizure detection: improvements and evaluation.Electroencephalography and clinical Neurophysiology, 76(4):317–324, 1990.
  • Harrer etal. (2019)Harrer, S., Shah, P., Antony, B., and Hu, J.Artificial intelligence for clinical trial design.Trends in pharmacological sciences, 40(8):577–591, 2019.
  • He etal. (2016)He, K., Zhang, X., Ren, S., and Sun, J.Deep residual learning for image recognition.In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
  • Ho & Armanfard (2023)Ho, T. K.K. and Armanfard, N.Self-supervised learning for anomalous channel detection in eeg graphs: application to seizure analysis.In Proceedings of the AAAI Conference on Artificial Intelligence, volume37, pp. 7866–7874, 2023.
  • Hochreiter & Schmidhuber (1997)Hochreiter, S. and Schmidhuber, J.Long short-term memory.Neural computation, 9(8):1735–1780, 1997.
  • Iešmantas & Alzbutas (2020)Iešmantas, T. and Alzbutas, R.Convolutional neural network for detection and classification of seizures in clinical data.Medical & Biological Engineering & Computing, 58:1919–1932, 2020.
  • Jasper (1958)Jasper, H.H.Ten-twenty electrode system of the international federation.Electroencephalogr Clin Neurophysiol, 10:371–375, 1958.
  • Kingma & Ba (2014)Kingma, D.P. and Ba, J.Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014.
  • Lee etal. (2022)Lee, K., Jeong, H., Kim, S., Yang, D., Kang, H.-C., and Choi, E.Real-time seizure detection using eeg: a comprehensive comparison of recent approaches under a realistic setting.arXiv preprint arXiv:2201.08780, 2022.
  • Li etal. (2017)Li, Y., Yu, R., Shahabi, C., and Liu, Y.Graph convolutional recurrent neural network: Data-driven traffic forecasting.arXiv preprint arXiv:1707.01926, 7(8), 2017.
  • Li etal. (2022)Li, Z., Hwang, K., Li, K., Wu, J., and Ji, T.Graph-generative neural network for eeg-based epileptic seizure detection via discovery of dynamic brain functional connectivity.Scientific Reports, 12(1):18998, 2022.
  • Loshchilov & Hutter (2016)Loshchilov, I. and Hutter, F.Sgdr: Stochastic gradient descent with warm restarts.arXiv preprint arXiv:1608.03983, 2016.
  • Mordvintsev etal. (2020)Mordvintsev, A., Randazzo, E., Niklasson, E., and Levin, M.Growing neural cellular automata.Distill, 5(2):e23, 2020.
  • Morris etal. (2019)Morris, C., Ritzert, M., Fey, M., Hamilton, W.L., Lenssen, J.E., Rattan, G., and Grohe, M.Weisfeiler and leman go neural: Higher-order graph neural networks.In Proceedings of the AAAI conference on artificial intelligence, volume33, pp. 4602–4609, 2019.
  • Nakkiran etal. (2021)Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., and Sutskever, I.Deep double descent: Where bigger models and more data hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021.
  • Obeid & Picone (2016)Obeid, I. and Picone, J.The temple university hospital eeg data corpus.Frontiers in neuroscience, 10:196, 2016.
  • O’Shea etal. (2020)O’Shea, A., Lightbody, G., Boylan, G., and Temko, A.Neonatal seizure detection from raw multi-channel eeg using a fully convolutional architecture.Neural Networks, 123:12–25, 2020.
  • Pajouheshgar etal. (2023)Pajouheshgar, E., Xu, Y., Zhang, T., and Süsstrunk, S.Dynca: Real-time dynamic texture synthesis using neural cellular automata.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20742–20751, 2023.
  • Pascanu etal. (2013)Pascanu, R., Mikolov, T., and Bengio, Y.On the difficulty of training recurrent neural networks.In International conference on machine learning, pp. 1310–1318. Pmlr, 2013.
  • Randazzo etal. (2020)Randazzo, E., Mordvintsev, A., Niklasson, E., Levin, M., and Greydanus, S.Self-classifying mnist digits.Distill, 5(8):e00027–002, 2020.
  • Saab etal. (2020)Saab, K., Dunnmon, J., Ré, C., Rubin, D., and Lee-Messer, C.Weak supervision as an efficient approach for automated seizure detection in electroencephalography.NPJ digital medicine, 3(1):59, 2020.
  • Shaeri etal. (2024)Shaeri, M.A., Shin, U., Yadav, A., Caramellino, R., Rainer, G., and Shoaran, M.33.3 mibmi: A 192/512-channel 2.46 mm2 miniaturized brain-machine interface chipset enabling 31-class brain-to-text conversion through distinctive neural codes.In 2024 IEEE International Solid-State Circuits Conference (ISSCC), volume67, pp. 546–548. IEEE, 2024.
  • Shah etal. (2018)Shah, V., VonWeltin, E., Lopez, S., McHugh, J.R., Veloso, L., Golmohammadi, M., Obeid, I., and Picone, J.The temple university hospital seizure detection corpus.Frontiers in neuroinformatics, 12:83, 2018.
  • Shin etal. (2022)Shin, U., Ding, C., Zhu, B., Vyza, Y., Trouillet, A., Revol, E.C., Lacour, S.P., and Shoaran, M.Neuraltree: A 256-channel 0.227-μ𝜇\muitalic_μj/class versatile neural activity classification and closed-loop neuromodulation soc.IEEE Journal of Solid-State Circuits, 57(11):3243–3257, 2022.
  • Shoaran etal. (2016)Shoaran, M., Shahshahani, M., Farivar, M., Almajano, J., Shahshahani, A., Schmid, A., Bragin, A., Leblebici, Y., and Emami, A.A 16-channel 1.1 mm 2 implantable seizure control soc with sub-μ𝜇\muitalic_μw/channel consumption and closed-loop stimulation in 0.18 μ𝜇\muitalic_μm cmos.In 2016 IEEE Symposium on VLSI Circuits (VLSI-Circuits), pp. 1–2. Ieee, 2016.
  • Shoaran etal. (2018)Shoaran, M., Haghi, B.A., Taghavi, M., Farivar, M., and Emami-Neyestanak, A.Energy-efficient classification for resource-constrained biomedical applications.IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 8(4):693–707, 2018.
  • Shuman etal. (2013)Shuman, D.I., Narang, S.K., Frossard, P., Ortega, A., and Vandergheynst, P.The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains.IEEE signal processing magazine, 30(3):83–98, 2013.
  • Siddiqui etal. (2020)Siddiqui, M.K., Morales-Menendez, R., Huang, X., and Hussain, N.A review of epileptic seizure detection using machine learning classifiers.Brain informatics, 7(1):1–18, 2020.
  • Srivastava etal. (2014)Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R.Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958, 2014.
  • Strein etal. (2019)Strein, M., Holton-Burke, J.P., Smith, L.R., and Brophy, G.M.Prevention, treatment, and monitoring of seizures in the intensive care unit.Journal of Clinical Medicine, 8(8):1177, 2019.
  • Sun & Morrell (2014)Sun, F.T. and Morrell, M.J.The rns system: responsive cortical stimulation for the treatment of refractory partial epilepsy.Expert review of medical devices, 11(6):563–572, 2014.
  • Tang etal. (2021)Tang, S., Dunnmon, J.A., Saab, K., Zhang, X., Huang, Q., Dubost, F., Rubin, D.L., and Lee-Messer, C.Self-supervised graph neural networks for improved electroencephalographic seizure analysis.arXiv preprint arXiv:2104.08336, 2021.
  • Thodoroff etal. (2016)Thodoroff, P., Pineau, J., and Lim, A.Learning robust features using deep learning for automatic seizure detection.In Machine learning for healthcare conference, pp. 178–190. PMLR, 2016.
  • Vaswani etal. (2017)Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., and Polosukhin, I.Attention is all you need.Advances in neural information processing systems, 30, 2017.
  • Voulodimos etal. (2018)Voulodimos, A., Doulamis, N., Doulamis, A., Protopapadakis, E., etal.Deep learning for computer vision: A brief review.Computational intelligence and neuroscience, 2018, 2018.
  • Yan etal. (2022a)Yan, J., Li, J., Xu, H., Yu, Y., and Xu, T.Seizure prediction based on transformer using scalp electroencephalogram.Applied Sciences, 12(9):4158, 2022a.
  • Yan etal. (2022b)Yan, J., Li, J., Xu, H., Yu, Y., and Xu, T.Seizure prediction based on transformer using scalp electroencephalogram.Applied Sciences, 12(9):4158, 2022b.
  • Zhu etal. (2020)Zhu, B., Farivar, M., and Shoaran, M.Resot: Resource-efficient oblique trees for neural signal classification.IEEE Transactions on Biomedical Circuits and Systems, 14(4):692–704, 2020.
  • Zhu etal. (2021)Zhu, B., Shin, U., and Shoaran, M.Closed-loop neural prostheses with on-chip intelligence: A review and a low-latency machine learning model for brain state detection.IEEE transactions on biomedical circuits and systems, 15(5):877–897, 2021.

Appendix Introduction

The Appendix is organised as followes:

  • Preprocessing details are outlined in AppendixA.

  • The mathematical proof addressing the avoidance of gradient vanishing in our model is provided in AppendixB.

  • Seizure analyses results are presented in AppendixC.

  • Hyperparameter selection and training details for all models are discussed in AppendixD.

  • The impact of BCE and MSE loss on training Rest is compared in AppendixE.

  • Training times are documented in AppendixF.

  • Details explaining how Rest avoids overfitting are shown in AppendixG.

  • Differences between various graph structures are explored in AppendixH.

  • Information about baseline compression is provided in AppendixI.

  • F1-scores for seizure detection are presented in AppendixJ.

  • The effectiveness of binary random masking on different RNN variants is shown in AppendixK.

  • Size comparisons for models with the same number of neurons are provided in AppendixL.

  • Real-time evaluations of different models with overlapping windows are detailed in AppendixM.

  • An ablation study on the inference performance of Rest with and without binary random masking is presented in AppendixN.

Appendix A Details of Preprocessing

We initially performed general preprocessing on the EEG data followed by specific steps for each detection and classification tasks:

A.1 TUSZ dataset

General Preprocessing: The EEG signals in the TUH EEG Corpus (TUSZ) dataset were initially sampled at various frequencies. As a part of the preprocessing pipeline, all signals were uniformly resampled to 200 Hz. Subsequently, EEG clips were extracted using the natural choice of one-second, non-overlapping windows, resulting in an EEG tensor XT×L×N𝑋superscript𝑇𝐿𝑁X\in\mathbb{R}^{T\times L\times N}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_L × italic_N end_POSTSUPERSCRIPT, where T𝑇Titalic_T represents clip lengths (ranging from 4, 6, 8, 10, 12, to 14 seconds), N𝑁Nitalic_N is the number of electrodes (19), and L𝐿Litalic_L is the number of time samples (200). To harness the effectiveness of Fourier transform for neural EEG recordings, fast Fourier transform was applied to extract frequency components for each node at each time point. The log-amplitude of the frequencies was then computed and only non-negative frequency components were extracted similar to prior studies (Tang etal., 2021; Ahmedt-Aristizabal etal., 2020) leading to EEG clip tensor of XT×M×N𝑋superscript𝑇𝑀𝑁X\in\mathbb{R}^{T\times M\times N}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_M × italic_N end_POSTSUPERSCRIPT with M𝑀Mitalic_M=100. Last, we have z-normalized the EEG clips across their second dimension for further analyses.

Preprocessing for Seizure Detection: For seizure detection after extracting EEG clips from the entire training set consisting of 5545 sessions, a binary label was assigned, with y=1𝑦1y=1italic_y = 1 indicating the presence of at least one seizure within the clip and y=0𝑦0y=0italic_y = 0 otherwise. To handle the issue of a substantial number of background clips in the dataset, non-seizure clips were randomly selected to achieve a balanced representation with seizure clips in the training data. Also, the last clip was dropped for each EEG data if the recording ends before the clip could reach it’s length.

Preprocessing for Seizure Classification:For seizure classification followed by Tang etal. (2021); Ahmedt-Aristizabal etal. (2020) we have removed the background data and only processed the seizure clips. We have started 2 seconds before the annotated seizure for tolerance in the annotations. Then we have labeled the clip y=0𝑦0y=0italic_y = 0 for general non-specific (GN), y=1𝑦1y=1italic_y = 1 for combined tonic (TC), y=2𝑦2y=2italic_y = 2 for absence (AB), y=3𝑦3y=3italic_y = 3 for focal, and y=4𝑦4y=4italic_y = 4 for complex parietal (CP) seizures. Moreover, if seizure event is shorter than the clip length we have truncated the clip to avoid having multiple seizures in one clip. Also, it is noteworthy that while the training set included simple partial seizures, these seizure types were absent in the evaluation set. Therefore, we excluded simple parietal seizures from the classification task. Also, because the clips for seizure classification may have different lengths we pad 0’s to the end of the clip to assure all samples share the same length.

A.2 CHB-MIT Dataset

For the CHB-MIT dataset, we randomly selected 18 patients for training, 3 for evaluation, and 3 for testing. We followed the same preprocessing pipeline as described for the TUSZ dataset, with the exception of maintaining a uniform sampling rate of 256Hz for all patients. For each 1-second time window, we have 256 samples of raw EEG data per channel. The number of channels is consistent with the TUSZ dataset, comprising 19 channels, and we excluded any sessions with a different number of channels.

We utilized the same frequency domain components for seizure detection. Unlike the TUSZ dataset, the CHB-MIT dataset does not include seizure types for classification. The results are reported based on five different random seeds for the train/test/evaluation splits (more details at Table7).

CaseNumber of SeizuresNumber of SessionsAge
172411
233611
373814
444222
55397
610181.5
731914.5
85203.5
941910
107253
1133512
1227242
1310333
148269
15204016
168197
1732112
1863618
1933019
208296
2143313
223319
23796
241622Unknown

Appendix B Preventing Gradient Vanishing with Residual Update

In equations Equations3, 4 and5, the model’s state is updated using a residual state update. When we take the derivative of St1superscript𝑆𝑡1S^{t-1}italic_S start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT concerning the forward propagation of Equation3, we get:

St1=StStSt1=St(1+δStSt1)=St+StδStSt1.superscript𝑆𝑡1superscript𝑆𝑡superscript𝑆𝑡superscript𝑆𝑡1superscript𝑆𝑡1𝛿superscript𝑆𝑡superscript𝑆𝑡1superscript𝑆𝑡superscript𝑆𝑡𝛿superscript𝑆𝑡superscript𝑆𝑡1\frac{\partial\mathcal{L}}{\partial S^{t-1}}=\frac{\partial\mathcal{L}}{%\partial S^{t}}\frac{\partial{S^{t}}}{\partial S^{t-1}}=\frac{\partial\mathcal%{L}}{\partial S^{t}}\Big{(}1+\frac{\partial{\delta S^{t}}}{\partial S^{t-1}}%\Big{)}=\frac{\partial\mathcal{L}}{\partial S^{t}}+\frac{\partial\mathcal{L}}{%\partial S^{t}}\frac{\partial{\delta S^{t}}}{\partial S^{t-1}}.divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_S start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT end_ARG = divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG divide start_ARG ∂ italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_S start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT end_ARG = divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG ( 1 + divide start_ARG ∂ italic_δ italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_S start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT end_ARG ) = divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG + divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG divide start_ARG ∂ italic_δ italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_S start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT end_ARG .(14)

Here the \mathcal{L}caligraphic_L is the loss function to be minimized. This equation shows that the gradient of the previous state St1superscript𝑆𝑡1S^{t-1}italic_S start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT always has a term Stsuperscript𝑆𝑡\frac{\partial\mathcal{L}}{\partial S^{t}}divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG directly added. This helps prevent the gradients of St1superscript𝑆𝑡1\frac{\partial\mathcal{L}}{\partial S^{t-1}}divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_S start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT end_ARG from becoming too small, even when the gradients of the previous updates are small, i.e., StδStSt1superscript𝑆𝑡𝛿superscript𝑆𝑡superscript𝑆𝑡1\frac{\partial\mathcal{L}}{\partial S^{t}}\frac{\partial{\delta S^{t}}}{%\partial S^{t-1}}divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG divide start_ARG ∂ italic_δ italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_S start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT end_ARG.

Appendix C ROC Curves and Confusion Matrices for Different Clip Lengths

Efficient and Accelerated EEG Seizure Analysis through Residual State Updates (4)
Efficient and Accelerated EEG Seizure Analysis through Residual State Updates (5)

Appendix D Model Training and Hyperparameter Selection Details

Here are the details of training and hyperparameter selection for Rest and baselines:

Rest Hyperparameters:We optimized the following hyperparameters for Rest based on the lowest validation error:a) Number of neurons in each graph convolution layer within the range [16, 32, 64];b) Initial learning rate within the range [5e-4, 1e-4];c) Success probability of the random binary mask within [0.1, 0.3, 0.5, 0.7, 1].For multi-update Rest, the number of updates for each time point was randomly selected an inteager from the interval [1, 10]. We conducted training for 500 epochs using a Multistep learning rate scheduler. Five experiments were run in PyTorch with different random seeds.

DCRNN:We followed the hyperparameter tuning strategy from the original paper (Tang etal., 2021) for both DCRNN with and without self-supervision tasks. The hyperparameter search on the validation set included:a) Initial learning rate within the range [5e-5, 1e-3];b) Number of Diffusion Convolutional Gated Recurrent Units (DCGRU) layers within the range {2, 3, 4, 5} and hidden units within the range {32, 64, 128};c) Maximum diffusion step K \in {2, 3, 4};d) Dropout probability in the last fully connected layer.For self-supervised pre-training, we utilized mean absolute error (MAE) as the loss function. The models underwent training for 350 epochs with an initial learning rate of 5e-4, employing a maximum diffusion step of 1 and 64 hidden units in both the encoder and decoder. Moreover, cosine annealing learning rate scheduler (Loshchilov & Hutter, 2016) was used as scheduler.

CNN-LSTM: For the baseline CNN-LSTM, we adopt the identical model architecture outlined in Ahmedt-Aristizabal etal. (2020). This configuration employes two stacked convolutional layers with 32 kernels of size 3 ×\times× 3, one max-pooling layer of size 2 ×\times× 2, one fully-connected layer with an output neuron count of 512, two stacked LSTM layers with a hidden size of 128, and one additional fully connected layer.

LSTM: We employed two stacked RNN layers, each with 64 hidden units, and an additional fully connected layer for the final prediction.

GRU: For the GRU model, we used same number of layers and hidden units as LSTM.

ResNet-LSTM: We followed two versions with and without dilation described at Lee etal. (2022).

Transformer: We implemented a two-layer multi-head attention mechanism with 64 embedding dimensions and 16 heads for the transformer architecture. Additionally, we utilized time positional encoding as introduced by Vaswani etal. (2017) for the original positional encoding.

For detection task for all models binary cross entropy loss was used exept for Rest which MSE performs slightly higher during the validation step. For classification task weighted binary cross entropy was employed due to the highly imbalancy among different seizure types.

Appendix E Comparison Between MSE and BCE loss for Training Rest

Rest was trained for seizure detection using both MSE and Binary Cross Entropy (BCE) loss functions. However, MSE outperformed BCE in terms of stability and accuracy. This advantage is attributed to BCE’s tendency for unbounded growth in classification logits, hindering residual updates and message passing between graph nodes, particularly in multi-update scenarios, as discussed in Randazzo etal. (2020). As shown in Figure 4 MSE has less fluctuations and more stability in validation error during training compared to BCE loss when training Rest with multiple updates.

Seizure Detection performance
Loss FunctionAUROC
Rest BCE80.4
Rest MSE83.6
Efficient and Accelerated EEG Seizure Analysis through Residual State Updates (6)

Appendix F Training Time

Bellow we report the time needed for training each model (Table9). All the models were trained on the same NVIDIA A100 GPU and the number of parameters and model size has reported at Tables6 and4. Rest requires more training time to adopt itself and converge to a stable point specially to adapt its update cell with multiple random updates

Seizure DetectionSeizure Classification
Model4-s6-s8-s10-s12-s14-s10-s
LSTM5556774
GRU5556784
CNN-LSTM88899105
ResNet-LSTM99101012126
ResNet-LSTM-Dilation99101012126
DCRNN20222325283020
DCRNN w/SS23303540486035
Transformer1212131414168
Rest(DS)45475053556010
Rest(RS)45475053556010
Rest(RM)707580909510025

Appendix G Rest Combat Forgetting at Each Time Point

While updating Rest specially when the update cell includes multiple updates Rest avoids forgetting the input by updating its state based on the affine mapping of the previous state and the input. As an example we consider two following setting:

Setting 1: Updating the state based on previous state only where first the state is initialized as Sit=WXt+USitsubscriptsuperscript𝑆𝑡𝑖𝑊superscript𝑋𝑡𝑈subscriptsuperscript𝑆𝑡𝑖S^{t}_{i}=WX^{t}+US^{t}_{i}italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_W italic_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_U italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and then it will iteratively update the state Sitsubscriptsuperscript𝑆𝑡𝑖S^{t}_{i}italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as follows:

δSit=𝒢Θ(Sit),𝛿subscriptsuperscript𝑆𝑡𝑖subscript𝒢Θsubscriptsuperscript𝑆𝑡𝑖\delta S^{t}_{i}=\mathcal{G}_{\Theta}(S^{t}_{i}),italic_δ italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = caligraphic_G start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ( italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ,(15)
Si+1t=Sit+δSitB.subscriptsuperscript𝑆𝑡𝑖1subscriptsuperscript𝑆𝑡𝑖direct-product𝛿subscriptsuperscript𝑆𝑡𝑖𝐵S^{t}_{i+1}=S^{t}_{i}+\delta S^{t}_{i}\odot B.italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT = italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_δ italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊙ italic_B .(16)

Setting 2: Updating the state based on affine mapping of current input and previous state for iteratively update the state Sitsubscriptsuperscript𝑆𝑡𝑖S^{t}_{i}italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as follows:

Hit=WXt+USitsubscriptsuperscript𝐻𝑡𝑖𝑊superscript𝑋𝑡𝑈subscriptsuperscript𝑆𝑡𝑖H^{t}_{i}=WX^{t}+US^{t}_{i}italic_H start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_W italic_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_U italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT(17)
δSit=𝒢Θ(Hit),𝛿subscriptsuperscript𝑆𝑡𝑖subscript𝒢Θsubscriptsuperscript𝐻𝑡𝑖\delta S^{t}_{i}=\mathcal{G}_{\Theta}(H^{t}_{i}),italic_δ italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = caligraphic_G start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ( italic_H start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ,(18)
Si+1t=Hit+δSitB.subscriptsuperscript𝑆𝑡𝑖1subscriptsuperscript𝐻𝑡𝑖direct-product𝛿subscriptsuperscript𝑆𝑡𝑖𝐵S^{t}_{i+1}=H^{t}_{i}+\delta S^{t}_{i}\odot B.italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT = italic_H start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_δ italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊙ italic_B .(19)

In Setting 1, after mapping from the input to the state space, the state is updated only based on the previous state. This setup poses a risk of the model forgetting information from the current input, especially if the update cell iteratively modifies the state multiple times. This situation hinders state from converging to a stable point and simply diverges due to neglecting the input data. In Setting 2, represented by Rest’s update cell, the input plays a crucial role and is actively involved in the iterative update process, as shown in equations Equations17, 18 and19. This design prevents the model from forgetting information from the current time input Xtsuperscript𝑋𝑡X^{t}italic_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, promoting convergence of the state to a more meaningful final state by utilizing the input’s information throughout the updates.

As illustrated in Figure7, Setting 1 fails to converge to a stable point, and the validation loss remains unchanged throughout the training process.

Efficient and Accelerated EEG Seizure Analysis through Residual State Updates (7)

Appendix H Comparison Between Different Gaussian Kernels Threshold for EEG Distance Graph

Here we illustrate different distance graph constructions based on different thresholds or the Gaussian kernel. The lower k𝑘kitalic_k values (i.e. 0.6) results in missing connection between nodes and large k𝑘kitalic_k thresholds results in connecting nodes which are far away. Similar to Tang etal. (2021) we also choose k=0.9𝑘0.9k=0.9italic_k = 0.9 as threshold which resembles the EEG montage (longitudinal bipolar and transverse bipolar) (Acharya etal., 2016) and results in a reasonable node connection.

Efficient and Accelerated EEG Seizure Analysis through Residual State Updates (8)

Appendix I Compressing Baseline Models

We tried to compress existing models for seizure detection and classification, achieving performance comparable to those described in Tang etal. (2021). However, in case of LSTM and CNN-LSTM models shrinking the model size without a significant performance drop proved challenging. We matched the performance reported in Tang etal. (2021) for DCRNN and DCRNN w/SS models with only one diffusion convolution gated recurrent unit, and reduced the model size by half, from 2.7 MB to less than 1 MB for DCRNN. Furthermore, for the seizure detection task, we achieved the same accuracy with 126K parameters compared to the original paper’s 168,641 parameters.

For the classification task, the original paper (Tang etal., 2021) reported 280,964 parameters for DCRNN and 417,572 parameters for DCRNN w/SS. In our compressed models, we achieved 126K parameters for DCRNN and 330K for DCRNN w/SS, successfully reducing the model size by a factor of 2 for DCRNN and a factor of 1.5 for DCRNN w/SS.

Despite successful reductions, the compressed models still possess a considerable number of parameters, especially in the presence of a gating mechanism, highlighting the non-parameter efficiency and memory demands associated with existing models for seizure detection.

Appendix J F1-Score for Seizure detection

Below is the F1-score (weighted averaged) results for seizure detection task on TUSZ dataset.

Model4-s6-s8-s10-s12-s14-s
LSTM82.369.979.580.572.773.2
CNN-LSTM70.169.575.373.568.367.5
GRU82.769.981.680.581.071.3
RestNet-LSTM79.778.280.175.177.076.3
RestNet-Dilation-LSTM80.580.479.076.675.074.6
Transformer78.4579.378.582.079.179.2
DCRNN81.280.281.680.074.272.0
DCRNN W/SS75.281.181.281.075.776.0
Rest(RS)69.568.478.379.174.774.1
Rest(RM)81.075.283.281.075.776.2

Appendix K Binary Random Masking and Multiple Updates for Other RNNs

We conducted an ablation study to evaluate the performance of RNN baselines with single and multiple random updates, as shown in Table 11.

ModelVanillaRSRM
RNN77.380.180.8
GRU73.572.873.6
LSTM70.474.574.7

As shown, the RNN variants can improve their performance in seizure detection tasks using Rest update techniques.

Appendix L Size Comparison with 64 Number of Neurons for all Models

ModelParameters (#)Size (MB)
DCRNN w/SS330K1.319
DCRNN126K0.884
Transformer48.3K0.193
GRU402K1.61
ResNet-LSTM7.5M30.3
ResNet-LSTM-Dilation7.5M30.3
LSTM536K2.147
CNN-LSTM6M22.8
REST(DS)27K0.051
REST(RS)27K0.051
REST(RM)27K0.051

Appendix M More Evaluation for Real-Time Detection

We followed the real-time seizure detection framework described by Lee etal. (2022), using a 4-second clip length for seizure detection with a 3-second overlap between consecutive clips. We measured both the inference time and latency, the latter being the delay between the actual onset of a seizure and the model’s detection. Low latency is crucial to avoid late detection of seizure events. As shown in Table13, Rest achieves the lowest latency alongside the Transformer model, while also maintaining significantly lower inference times compared to all other baselines.

ModelAUCROCLatency (s)Inference (ms)
LSTM75.50.313.254
GRU76.10.42.12
RestNet-LSTM79.10.36.78
RestNet-Dilation-LSTM80.20.346.78
CNN-LSTM81.30.265.624
DCRNN79.70.259.67
Transformer830.22.5
Rest(DS)75.30.230.615
Rest(RS)79.40.20.71
Rest(RM)82.40.251.29

Appendix N Rest W/O Binary Random Mask during Inference

We evaluated Rest performance with and without masking over the inference in which similar to Srivastava etal. (2014) strategy the mask was removed and the incremental state update was scaled using the success probability of the binary mask (p𝑝pitalic_p).

ModelW/ Inference MaskW/O Inference Mask
Rest (RS)81.881.5
Rest (RM)83.682.9
Efficient and Accelerated EEG Seizure Analysis through Residual State Updates (2024)

References

Top Articles
Latest Posts
Article information

Author: Greg Kuvalis

Last Updated:

Views: 5734

Rating: 4.4 / 5 (55 voted)

Reviews: 86% of readers found this page helpful

Author information

Name: Greg Kuvalis

Birthday: 1996-12-20

Address: 53157 Trantow Inlet, Townemouth, FL 92564-0267

Phone: +68218650356656

Job: IT Representative

Hobby: Knitting, Amateur radio, Skiing, Running, Mountain biking, Slacklining, Electronics

Introduction: My name is Greg Kuvalis, I am a witty, spotless, beautiful, charming, delightful, thankful, beautiful person who loves writing and wants to share my knowledge and understanding with you.