WO2024015018A1

WO2024015018A1 - Cognitive workload recognition from temporal series information

Info

Publication number: WO2024015018A1
Application number: PCT/SG2023/050490
Authority: WO
Inventors: Chen LYU; Haohan YANG
Original assignee: Nanyang Technological University
Priority date: 2022-07-12
Filing date: 2023-07-12
Publication date: 2024-01-18

Abstract

Disclosed is a method to train a supervised learning model for cognitive workload recognition by employing a sequence-to-sequence learning paradigm. The model comprises a main long short-term memory (LSTM) network, an auxiliary LSTM network and a classifier layer. The model is configured to output a predicted cognitive workload level of a user in response to input of temporal series information related to the user over a plurality of time steps. It does so by, at each said time step: updating the main LSTM network to map the temporal series information to a sequence of hidden states; updating the auxiliary LSTM network to generate weights for the main LSTM network; and obtaining the predicted cognitive workload level by processing the hidden states and the weights through the classifier layer.

Description

COGNITIVE WORKLOAD RECOGNITION FROM TEMPORAL SERIES INFORMATION

Technical Field

The present invention relates, in general terms, to systems and methods for training supervised learning models for cognitive workload recognition.

Background

Driver workload inference is significant for the design of intelligent humanmachine cooperative driving schemes. Such inference allows systems to alert drivers before potentially dangerous manoeuvres are performed and achieve a safer control transition. However, pattern variations among individual drivers and sensor artefacts pose great challenges to the existing cognitive workload recognition approaches.

Various advanced functions have been developed for intelligent vehicles to improve the driving experience and convenience, but each has its drawbacks. For example, in-vehicle infotainment (IVI) such as navigation systems provide real-time guidance to drivers but visual-manual tasks and auditory-verbal activities increase the driver's mental workload and are secondary to driving, thereby increasing the risk of distraction.

Various approaches to obtaining cognitive load levels have been studied. The most straightforward ones are subjective measures, requiring drivers to conduct the self-evaluation by completing questionnaires after driving tasks. Typically, subjective measuring approaches provide cumulative estimations of the cognitive workload based on drivers' memories, while these methods are intuitive and uncertain since they are susceptible to memory bias. Studies have been performed to define cognitive workload based on physiological and vehicle indicators. Such methods typically require extended sampling windows (e.g. 2-5 min recordings for heart rate), or a susceptible to noise from changes in driving conditions (e.g. machine vision-based techniques for eye tracking deteriorate in low light). Vehicle indicators such as steering angles, vehicle speeds, and accelerations can be used, but are insensitive to low workload levels.

Many learning-based approaches have been developed for the recognition of driver cognitive loads from different measured signals. Some such approaches employ deep, machine learning technologies. These machine learning-based methods commonly require the manual features extraction from raw data. However, these methods commonly consider confidence of cognitive workload assessment based on individual input modes - e.g. eye tracking or vehicle behaviour - and thereby fail to take advantage of the benefits of multi-modal information.

It would be desirable to overcome or ameliorate at least one of the abovedescribed problems, or at least to provide a useful alternative.

Summary

To address the aforementioned challenges, proposed herein is an Attention- enabled Recognition Network (ARecNet) for recognizing driver cognitive load in real time using multiple input modes - e.g. electroencephalogram (EEG) signals, eye movements and external states. An "external state" may be a "vehicle state" in embodiments applied to cognitive workload recognition for vehicle drivers. Previous machine learning technologies consider single input modalities which cannot fully exploit the complementarity of multimodal data in assessing cognitive workload, due to the information redundancy. In contrast, ARecNet employs a feature-level fusion architecture across input modes, to recognize driver cognitive workload. Also disclosed is a system that trains a supervised learning model for cognitive workload recognition, wherein the system comprises a plurality of processors configured to train the model by employing a sequence-to-sequence learning paradigm, wherein the model comprises a main long short-term memory (LSTM) network, an auxiliary LSTM network and a classifier layer; wherein the model is configured to output a predicted cognitive workload level of a user in response to input of temporal series information related to the user over a plurality of time steps. It does so by, at each said time step: updating the main LSTM network to map the temporal series information to a sequence of hidden states; updating the auxiliary LSTM network to generate weights for the main LSTM network; and obtaining the predicted cognitive workload level by processing the hidden states and the weights through a classifier layer of the model.

Relevantly, ARecNet may embody a method to train a supervised learning model for cognitive workload recognition by employing a sequence-to-sequence learning paradigm. The model comprises a main long short-term memory (LSTM) network, an auxiliary LSTM network and a classifier layer. The model is configured to output a predicted cognitive workload level of a user in response to input of temporal series information related to the user over a plurality of time steps. It does so by, at each said time step: updating the main LSTM network to map the temporal series information to a sequence of hidden states; updating the auxiliary LSTM network to generate weights for the main LSTM network; and obtaining the predicted cognitive workload level by processing the hidden states and the weights through the classifier layer.

The temporal series information in the phrase "input of temporal series information related to the user over a plurality of time steps", and similar, refers to information relating to the user themselves (e.g. of the driver, where the user is a driver) and information relating to external states (e.g. of a vehicle, such as a neighbouring vehicle or a vehicle being driven by the user, where the user is a driver). Advantageously, embodiments of the present invention establish an attention- enabled decision-level fusion architecture to infer driver cognitive workload levels. This suggests the availability of a viable generic technique for capturing useful feature representation from time-series multimodal information.

Advantageously, embodiments of the present invention involve constructing a novel driver workload dataset, including multimodal signals and multiple driving scenarios.

Brief description of the drawings

Embodiments of the present invention will now be described, by way of nonlimiting example, with reference to the drawings in which:

Figure 1 is a schematic overview of an attention-enabled cognitive workload recognition method and mode, with multimodal information fusion in accordance with present teachings;

Figure 2 is a platform for multi-modal information capture, which can be applied as a driver-in-the-loop platform;

Figure 3 illustrates a method to train a supervised learning model for cognitive workload recognition, in accordance with present teachings;

Figure 4 shows activity power spectra for typical components and the corresponding scalp topographies, in which image (a) shows eye artefact, (b) shows muscle artefact and (c) is normal;

Figure 5 shows the results of recognition accuracy assessment of the present methodology with varied temporal series information and historical horizons (image (a) t_w = 1 s, (b) t_w = 2 s, (c) t_w = 4 s) in typical driving scenarios; Figure 6 shows confusion matrices for driver cognitive workload recognition with different historical horizons in typical driving scenarios, in which image (a) is sunny noon, (b) is foggy dusk, and (c) is rainy night;

Figure 7 provides the precision-recall curves of cognitive workload level recognition with the historical horizon t_w = 1 s in different driving scenarios, being (a) sunny noon, (b) foggy dusk, and (c) rainy night, where shaded areas represent the extrema across the 5-fold cross validation;

Figure 8 is the outcome of statistical tests of decision-level fusion-based approaches with HyperLSTM modules; and

Figure 9 is a block diagram of a system for cognitive workload recognition (estimation).

Detailed description

Disclosed is ARecNet, an attention-enabled recognition network with a decisionlevel fusion architecture, that assesses cognitive workload estimation performance. Specifically, the present methodology employs a cross-attention mechanism to enhance useful feature representations learned by hyper long short-term memory (HyperLSTM) based modules from time-series multimodal information, e.g., EEG signals, eye movements, external states or behaviours (e.g. vehicle states or vehicle behaviour or, where the user is an athlete, the state or behaviour of the user's body and/or competing athletes around the user - such as on a running track). Also disclosed is the construction of a novel dataset containing multiple driving scenarios for evaluating model performance across different historical horizons and decision thresholds.

The description below will be made with reference to a driver (user) and vehicle states, for illustration purposes only. Without loss of generality, the same teachings apply to other types of user such as an athlete, where the external states and external behaviours, being "vehicle states" and "vehicle behaviours", can be substituted for the 'states' and 'behaviours' of the athlete and/or one or more neighbouring athletes (on the same team - e.g. in volleyball - or different teams - e.g. in a competitive running event), or such as an air traffic controller (user) where the "vehicle states" and "vehicle behaviours" can be replaced by the "aeroplane states" and "aeroplane behaviours" of the aeroplane currently being directed by the air traffic controller and/or aeroplanes other than the aeroplane currently being directed by the air traffic controller.

Figure 1 schematically represents a method 100, for training a supervised learning model for cognitive workload recognition, in the context of multi-modal input information acquisition 102. The input information comprises temporal or time series information related to the vehicle and/or driver over a plurality of time steps. This information can be acquired through any appropriate mechanism - e.g. the information can be extracted from a database or recorded and used for real-time training.

While the temporal series information may have a single input mode - e.g. EEG, eye tracking/movement or vehicle behavior/performance - the present temporal series information is multi-modal, thus having multiple input modes. Each input mode is a respective one of EEG signals, eye movements and vehicle states.

In some embodiments, the temporal series information is captured from monitoring drivers during normal on-road driving. However, in the present embodiment the temporal series information is captured through the driver-in- the-loop experimental platform 200, shown in Figure 2. The platform 200 may have any appropriate configuration to facilitate data acquisition of the desired input modes, and presently comprises a physical simulator 202 (e.g. Logitech G29), an image capture device or system for tracking eye movement 204 (e.g. an infrared eye tracker such as Tobii Pro), and a wired or wireless EEG headset

206 (e.g. EMOTIVE EPOC Flex, with 32 channels).

Data are captured for different weather and lighting conditions, such as sunny noon, foggy dusk, and rainy night, to be able to learn cognitive workload information across various levels of visibility and stress. Since human mental workload cannot be directly observed, the method may involve collecting the input temporal series information by subjecting one or more drivers to tests of varying difficulty. For example, n-back tasks with varying difficulty may be employed to modulate cognitive workload levels objectively. The n-back task may be a visual-auditory mixed n-back task for regulating cognitive loads on drivers. This task can therefore reflect cognitive workload introduced by both visual and auditory information during driving.

In each driving environment or under each set of driving conditions, the tests may comprise secondary tasks with the varied amount of information that participants need to memorize and respond to, such as maintaining speed through traffic, driving from origin to destination and others. These secondary tasks enable the system 200 to obtain three classes of driver cognitive workload, namely slight level, moderate level and intensive level, which correspond to ground truth labels - e.g. ternary ground truth labels for three input modes.

The data may be pre-processed. This can be necessary to remove artifacts such as blink, facial and body movement artifacts from eye tracking data. Various techniques can be adopted for removing noise from signals, or removing signals, before extracting sub-band components from raw data. Band-pass filtering (low- and/or high-band) and notch filtering (to remove power supply noise) may restrict the band spectrum to relevant information - e.g. 1-30 Hz, and independent component analysis (ICA) may be used to reject artefact-induced signal components. The temporal series information captured by the system 200 is transmitted or conveyed to a system 210 that employs method 100 to train a supervised learning model 102 for cognitive workload recognition.

With further reference to Figure 1, the model 104 comprises a main long shortterm memory (LSTM) network 106, an auxiliary LSTM network 108 and a classifier layer 110. The model 104 is configured to output a predicted cognitive workload level of a vehicle driver in response to input of the temporal series information over a plurality of time steps. Via model 104, the method 100 (as also reflected in Figure 3) employs a sequence-to-sequence learning paradigm. The learning paradigm comprises, at each time step:

- step 106' - updating the main LSTM network 106 to map the temporal series information to a sequence of hidden states

- step 108' - updating the auxiliary LSTM network 108 to generate weights for the main LSTM network

- step 110' - obtaining the predicted cognitive workload level by processing the hidden states and the weights through a classifier layer of the model.

The order of the steps may be changed. For example, step 108' may be performed before step 106'.

To make use of the model 104, the driver cognitive load recognition is formulated as a supervised classification problem. In this problem, the workload levels are adopted as labels, as shown in Figure 1. For data capture from a simulated environment, the tasks user/drivers are asked to perform will be assigned a predetermined workload level based on an anticipated difficulty or amount of attention required to successfully complete the task. For data capture in a real world environment, an EEG measurement system may be mounted ot a user/driver and the user-driver may be in a vehicle with one or more image capture devices and sensors mounted to capture images (e.g. video feed) of the driving environment, vehicle parameters - e.g. vehicle speed - and/or eye movements of the user/driver. The images may be cross-referenced to simulated tasks or otherwise processed to ascertain workload levels at multiple time intervals while driving. For illustration purposes, the present discussion will be made with reference to a simulated environment, but it will be appreciated that the same or similar teachings may be employed in respect of a real world environment. In this embodiment, the temporal series information is multimodal, comprising electroencephalogram (EEG) signals, eye movements and vehicle states. Consequently, given the dataset as:

(1)

X¹ denotes the multimodal temporal sequences of size /, y is the corresponding driver cognitive workload, which is categorized into three levels, i.e., slight (y = 0), moderate (y = 1) and intensive (y = 2), j is the j^th sample, and N is its total number of samples. For each sample: (2)

wherein x represents the feature vectors across the time steps, with x, being the feature vector at the i^th time step. The features may described one or more relationships between the multi-modal input data and driver cognitive load. The temporal series information consists of EEG signals, eye movements and vehicle motion states, denoted by X¹ = [W¹, M¹, V'] and x = [w, m, v], respectively. Specifically, w = [wδ, wΘ, w_a, w_β] is the power of four typical EEG frequency bands, i.e., delta (1-4 Hz), theta (4-8 Hz), alpha (8-13 Hz), beta (13-30 Hz). Greater, fewer or different frequency bands may be used, m = [m_Cx, m_Cy, m_Sx, m_sy] is horizontal/vertical coordinates and speeds of the eye gaze. In some embodiments, only the coordinates may be provided and in other embodiments only the speeds of eye gaze may be provided, v = [ Δv_x, Δv_y, Δa_x, Δa_y, Δv, y] represents the instantaneous longitudinal and lateral velocities/accelerations of the vehicle with respect to the front one, the relative resultant velocity and the yaw rate, respectively. In some embodiments, only a proper subset of these velocities is required for assessing cognitive workload. Accordingly, where Dim is the total dimension from concatenating all feature vectors in x. In the embodiment given above, Dim is 14 (w (4), m (4) and v (6)) and thus x ∈

The model then learns to generate the corresponding workload level

based on temporal series information

(3)

where < = {0,1,2} is the label of cognitive workload levels.

With reference to Figure 1, the model itself comprises a main LSTM 106 that maps the temporal series information 108 to a sequence of hidden states 110. The model 104 further comprises at least one auxiliary LSTM network 112 and a classifier layer 114. In some embodiments the model further comprises an attention mechanism 116. The number of HyperLSTMs may correspond to the number of modal inputs - for example, three or four HyperLSTMs will be use for three or four modal inputs, respectively.

As shown, the model 104 comprises a plurality of auxiliary LSTM networks, herein referred to as hyper long short-term memory (HyperLSTM) based modules each associated with a main LSTM, marked 106, 118, 120 for input modes EEG, eye movements and vehicle performance, respectively. Each HyperLSTM module comprises a LSTM network and a HyperLSTM network. The number of HyperLSTM networks or modules may be the same as the number of weights or hyperparameters to be dynamically learned. For example, for each input mode the hyperparameters may be the standard cell, input, output and forget gate values.

The HyperLSTM, a variant of HyperNetworks, is an auxiliary LSTM network that is designed to dynamically learn hyper parameters, i.e., the weights of each main LSTM cell at each time step. The HyperLSTM-based module is a dualnetwork architecture that jointly captures time-series feature representations and adapts itself through dynamic hyperparameters learning from the multimodal information. This joint capturing of information assists with managing data variability among individual drivers. The dual-network architecture involves the HyperLSTM output being fed to the LSTM, in each HyperLSTM module.

Regarding using the HyperLSTM-based module for mapping EEG signals (the description for other input modes is the same, but with the feature vector for the relevant input mode - so

refers to the hidden state at time I in temporal series k where, for the input modes mentioned above, k is one of w, m and v): given EEG temporal sequences W¹, for all t ∈ {1,2,...,/}, the main LSTM network maps the time-series information to a sequence of hidden states

via following updates per: (4)

where δ and tanh are the sigmoid function and hyperbolic tangent function, respectively, Θ represents the element-wise product,

_are parameters generated by the HyperLSTM, * denotes one of {/,f,o,c} gates, N_h is the hidden size of the main LSTM network and N_w = 4 is the number of EEG features. The update of the main LSTM network is denoted as:

(5) The input of the HyperLSTM network is the concatenation of the hidden state from

(the principle of the main LSTM network, with reference to equations (4) and (5)) and EEG signals w_t :

Similarly, updates of the HyperLSTM can be described as:

(7)

In the described methodology, weights are functions of a set of embeddings, where the embeddings are linear projections of the hidden states of the relevant auxiliary LSTM network. More formulaically, weights matrices W*, I* and b* are functions of a set of embeddings z*_h, z*and z_b* , respectively, which are linear projections of the hidden states of HyperLSTM cells:

where

can be set to any desired value, based on memory usage requirements. For example, N_z can be set to 16 to reduce the memory usage required by the ARecNet.

is the hidden size of the HyperLSTM. At each time step, weights matrices of main LSTM cells are dynamically formulated. The dynamic formulation may follow: (9)

where

denotes the tensor dot product. Accordingly, the last hidden state of the main LSTM, i.e.,

, , (106") is obtained as the representation of the EEG information. Similarly, learning representations of eye movements and vehicle states are mapped as ⁿ and

, respectively,

the last hidden state of the respective main LSTMs being labelled 108" and 110", respectively. For this reason, the input modes corresponding to representations w, m, and v have been replaced with k in various formulae, to indicate that the k may be any one of the input modes.

In some embodiments, the outputs of the last hidden state for each input mode, 106", 108", 110" is given to the classifier layer 114. To achieve this, the outputs may be concatenated. However, in the embodiment shown in Figure 1, learning or feature representations (collectively 122) obtained by each HyperLSTM- based module (h in equation (10)) performs an equidimensional projection through a fully connected layer 124 (HZ in equation (10) - i.e. the input dimensions are the same, and the output dimensions are consistent with the input dimensions). The results are concatenated at 126 as: (10)

wherein are parameters to be learned and

Then, similarity scores of feature representations of different information sources are computed using an attention matrix 128, formulated as: (11)

where

is utilized to enhance useful representations through increasing their scores automatically. M_att can be regarded as a weight matrix. M_att and representations 112 will be multiplied, as shown by the connecting path between the two in Figure 12. The hidden states are integrated by an integration layer, presently a max pooling layer 130, such that: (12)

yhere h_att e [R^wh denotes the attention-based hidden state with strengthened feature representations. To improve the training stability, the h_att may be normalized (at layer 132) as: h_att = layernorm(h_att) (13)

To obtain the probability of each label, the ARecNet performs a nonlinear projection through a classifier layer 114:

(14) where

are parameters to be learned. Eventually, the predicted cognitive workload level

is obtained, being either 0, 1 or 2, corresponding to slight, moderate or intensity cognitive workload. The classifier layer 114 may have any appropriate architecture. In some embodiments, the classifier layer comprises a fully connected layer followed by a softmax activation layer that produces . The classifier layer 114 is Figure 1

further comprises a fully connected layer and a rectified linear unit the output of which is fed into the second fully connected layer and from there into the softmas layer.

During training, instead of relying on a single label, a sequence-to-sequence (Seq2Seq) learning paradigm is employed. For the given dataset D) , an optimized cross-entropy loss function is adopted:

(15)

where x≤t = [x1,x2,...,xt] denotes the subsequence of . In addition to

encouraging the ARecNet to extract feature representations from early observations, the given loss function reduces the possibility of overfitting when the current information is insufficient for recognition.

The cognitive workload cannot be observed directly, resulting in inherent uncertainty in its label. Therefore, a regularization technique, namely label smoothing, is introduced to improve the model generalization ability. Label smoothing may be performed using any appropriate method. For example, label smoothing may comprise assigning the real cognitive workload label a probability that penliases overconfident predictions, e.g. real cognitive workload label y_j may be assigned a probability 1 - ∈, while the probability of other labels is replaced by

wherein a tunable parameter ∈ is set to 0.1 and k = 3 is the number of labels (i.e. 0, 1, 2).

In experiments, data was collected from fourteen participants (10 males, 4 females) with varied ages and driving experience. Data collection was performed using the system described with reference to Figure 2. The simulated driving environment was a three-lane expressway with several stationary vehicles distributed randomly in the three lanes. For all driving scenarios, the primary objective was to drive along a straight road and avoid stationary vehicles. Participants were asked to maintain the vehicle speed within a predetermined speed range - e.g. 80-90 km/h. This ensures a consistent workload level throughout driving. Obstacles with various paint colours are placed at a regular interval, and vehicle paint is classified into two categories, namely dark colours (black, navy, sepia) and bright colours (white, red, yellow). The driving tasks in different cognitive workload levels are illustrated below:

Slight level’. Only the primary task is required to be accomplished, i.e., participants need to avoid other vehicles and reach the destination. Moderate level-. In addition to avoiding all obstacles, participants need to recall the colour category of the previous obstacle and press the corresponding button as they drive past new one.

Intensive level-. Apart from primary and visual tasks, participants also need to listen to a pre-recorded series of 15 letters separated by approximately 4 second intervals and count the number of times two identical letters appeared in pairs in a sequence, e.g., "H, H".

Audio stimuli exist in all experiments to inhibit their effect on EEG signals. However, only participants with the intensive workload level react to them. To ensure no single lane was free of obstacles for an extended stretch of road, a custom-defined discrete distribution for obstacle locations is employed: ( 16)

where K_T is the distance between the current obstacle and the previous one in line T , IntervalSize denotes the distance between two adjacent obstacles. Moreover, both visual and audio stimuli are regenerated randomly in each experiment to rule out human memory effect.

The driver workload dataset extracted as set out with reference to Figure 2, can be migrated to both the performance evaluation of other recognition approaches and extended studies involving the cognitive workload, such as driving authority allocation and takeover strategies design, etc. In human-machine cooperative driving, the driver's authority is usually determined. For a very high driver workload, the driving authority can be zero. Consequently, to ensure safety the driver's inputs will not be executed. Generally, the range of driver authority can be [0, 1]. 0 indicates that the vehicle has been taken over by the machine, and 1 means that the vehicle is completely controlled by the human. The dataset is multimodal, presently containing three types of information, i.e., EEG signals, eye gaze and vehicle states. This facilitates identification of workload response on one channel (mode) where that response is not evident on another channel (mode). The dataset contains multiple scenarios - lighting conditions, colours, speeds, obstacles, audio and/or visual tasks. The dataset can, for example, reveal the influence of varied visibility on driver workload recognition. The dataset is multi-sensory. For example, visual-auditory mixed stimuli can be used, requiring drivers to respond to visual information, which reflects the visual-induced cognitive workload in the real world, and audio stimuli, which reflects auditory activities such as voice navigation and phone calls during practical driving.

Pre-processing is performed, as set out with reference to Figure 2, to remove artifacts from the data. By identifying artifacts, an activity power spectrum of three typical components and the corresponding scalp topographies can be produced as shown in Figure 4. The artifact in image (a) of Figure 3 is produced by eye activities such as blink, in which high power at low frequencies is concentrated close to eyes. Muscle artifacts are also evident as shown in image (b) of Figure 3, the muscle artefact having relatively high power at high frequencies (20-30 Hz) with a localized distribution on the scalp topography. Image (c) of Figure 3 represents the normal component generated by brain- related activities. Image (c) is therefore adopted to calculate the power of various frequency bands. The power may be calculated through: (17)

where w Ψ with Ψ = { δ,θ , α (β} is the power of the corresponding EEG signal, Ω represents EEG channels, fi and fu are lower and upper frequency limits of Ψ- band, Sj(f) is the power spectrum density (PSD) of the j^th channel, which is calculated using the fast Fourier transform (FFT) with a hamming window. All features are uniformly resampled to 10 Hz, i.e., / = 10 tw, and normalized (zz score) to the same scale. Also, the input sequences are extracted using a sliding window with 90% overlap to augment training samples.

The recognition performance of the present methodology can be evaluation through various metrics, including average accuracy (Gave), precision (Pr), recall (Re) and Fl score, which are formulated as:

(18)

(19)

where tp, tn, fp and fn represent true positives, true negatives, false positives and false negatives, respectively.

To test performance enhancement over known methods, a comparative study was performed between the present methodology and previous learning-based methods, namely, MTS-CNN (variant of a CNN-based architecture), DecNet (a variant of an LSTM-based network), CNN-LSTM model, and m-HyperLSTM (a variant of HyperNetworks which uses only one HyperLSTM-based module).

The present methodology can effectively capture and strengthen useful timeseries feature representations through HyperLSTM-based modules and a crossattention mechanism. The designed model was trained using an Adam optimizer, with a desired learning rate - e.g. 0.001. The batch size is selected to obtain a trade-off between the training time and model generalization ability (e.g. batch size of 64).

The recognition accuracies and standard deviations of the the present methodology with varied time-series information and historical horizons under typical driving scenarios is shown in Figure 5. Based on recognition accuracy, vehicle states clearly have a lower influence on cognitive workload than physiological and visual information. Since extra mental workload is generally required to ensure safe driving with the decreased visibility, classification becomes more difficult as average cognitive workload increases (e.g. in rough inverse proportion to visibility). By comparison, the multimodal information fusion-based ARecNet has a relatively stable recognition performance in varied environments, and has lower standard deviation in most cases, indicating that the multimodal information fusion-based ARecNet has better stability.

The recognition results of the multimodal information fusion-based ARecNet with different historical horizons in typical driving scenarios are shown in Figure 6, in which each confusion matrix displays the average result of five-fold cross validation. In each set (a - sunny noon), (b - foggy dusk) and (c - rainy night) of confusion matrices, the rows (top to bottom) are recognition accuracy for slight, moderate and intensive cognitive workload and the classification accuracy of each workload level with respect to prediction values, and the rightmost column shows the recognition results with respect to the ground truth label. Recognition accuracy increases with extended historical horizons.

Figure 7 shows precision-recall curves illustrating the influence of different decision thresholds n on each workload level (historical horizon t_w = 1 s). The macro-average curves are nearly the same in all of images (a), (b) and (c), indicating superior comprehensive recognition performance of the present methodology in varied weather conditions. The optimal threshold point is at the tangent of the corresponding precision-recall curve and the Fl-score curve.

In experiments against known workload recognition models mentioned above, the performance of m-HyperLSTM universally surpassed the LSTM-based models and the performance of DecNet and CNN-LSTM models was close to m- HyperLSTM in some specific situations - this suggests the adaptive module can capture feature representations more effectively than static ones. The Fl scores of the present methodology were significantly higher than those for m- HyperLSTM, especially with the historical horizons t_w = 1 s, which is increased by at least 3.32%. The phenomena indicate the superiority of decision-level fusion architecture in the present disclosure.

In ablation experiments the effects of HyperLSTM and cross-attention within the ARecNet were tested. In this regard, a variant was also tested that lacks an attention mechanism (herein referred to as RecNet). The results are in Table I.

Variants ARecNet

HyperLSTM x x √ √

Cross attention x √ V √

Driving scenario (high visibility): Sunny noon tw = 1 s 0.832 (14.81%)0.843 (13.55%) 0.877 ($0.34%) 0.874 tw = 2 s 0.879 (14.66%)0.892 (13.25%) 0.918 (10.43%) 0.922 tw = 4 s 0.882 (17.16%)0.911 (14.11%) 0.945 (10.53%) 0.950

Driving scenario (medium visibility) : Foggy dusk tw = 1 s 0.772 (15.04%)0.793 (12.46%) 0.817 (fO.49%) 0.813 tw = 2 s 0.836 Q4.89%)0.856 (12.62%) 0.874 (10.57%) 0.879 tw = 4 s 0.842 (18.48%)0.884 (13.91%) 0.913 (10.76%) 0.920

Driving scenario (low visibility): Rainy night tw = 1 s 0.756 (13.69%)0.772 (11.66%) 0.782 (10.38%) 0.785 tw = 2 s 0.787 (14.61%)0.801 (12.91%) 0.816 (11.09%) 0.825 tw = 4 s 0.793 (18. ll%)0.831 (13.71%) 0.853 (11.16%) 0.863

Table I: ablation study on the recognition accuracy of HyperLSTM and cross attention with varied historical horizons in different driving scenarios.

HyperLSTM greatly improved the model performance in all cases, further indicating its superior feature capturing ability compared to conventional LSTM models. The influence of the cross-attention mechanism remains inconspicuous in Table I, especially for HyperLSTM-based models. A paired t-test was also employed to determine statistical significance of cross attention, with five-fold cross validation performed 50 times with the same sequence of random seeds, and statistical results presented in Figure 8 - single asterisk (*) and double asterisks (**) represent R values lower than 0.05 and 0.01, respectively. Cross attention provided no statistically relevant improvement at a 1 s historical horizon, but provided significantly better performance for longer horizons - e.g. 4 s. The phenomena demonstrate that the cross- attention mechanism is better at strengthening useful learning representations of longer sequence multimodal information.

Embodiments of the present methodology seek to address the two limitations of most previous driver workload recognition models in practical applications, namely single-modality indicators, and time-series signal distortion. The present, decision-level multimodal information fusion architecture can employ a cross-attention mechanism to strengthen useful feature representations captured by the HyperLSTM-based module from individual information sources. Experimental results demonstrate that the proposed models are advantageous over other baseline approaches in terms of recognition accuracy and robustness.

In addition to the driver workload estimation, the data collection methodology provides a generic driver monitoring framework for advanced driving assistance systems (ADAS). Using a minor alteration of the structure of the model (e.g. adding more, or removing, HyperLSTM modules, containing one or more HyperLSTM networks and a main LSTM network, depending on the number of information sources/input modes - e.g. road types and traffic condition monitoring) according to the number of information sources, it can be utilized for driver distraction/fatigue detection. Meanwhile, accurate driver states recognition can provide a decision-making basis for the mutual takeover of drivers and vehicles, which is beneficial to other ADAS technologies such as lane departure warning systems, traffic jam assistant systems, and others.

The present framework can also be extended into specific application fields involving multiple biosensors, for example, athlete health stress estimation and air traffic controller states monitoring, etc.

An end-user computing device or system referred to in this disclosure comprises a smartphone device, a tablet device, a laptop device etc. that is used by an end user to train a supervised learning model for cognitive workload recognition, or implement that model for real time cognitive workload recognition. Computing device 900 of Figure 9 illustrates a schematic diagram of one such device. The device 900 comprises one or more processing units 910 with access to one or more pre-processors 902 (if used) for pre-processing input data - e.g. from EEG, vehicle behaviour and/or eye tracking - has a communication channel to a camera/external device(s) 904 for collecting input data (e.g. image capture device(s), vehicle monitoring sensors for speed, yaw and others, EEG system etc), auxiliary network module(s) 906 each comprising auxiliary network(s) 908 and a main LSTM network 910, an attention mechanism 912 and a classifier layer 914 (the term "classifier layer" may be used to refer to a single layer or multiple layers in a machine learning network, depending on the context used herein).

The external device(s) 904 may be integral with, or unitary with, system 900, or may be separate.

System 900 may be in communication (e.g. over network 918) with one or more server system 916 that serve as a back end system for an application executing on the system 900. For example, in embodiments where system 900 is a smartphone, the server system 916 may be a backend application server of a relevant application for input data evaluation executing on the system 900. The server system 916 may transmit code or information to the system 900 and may receive information from system 900 obtained after pre-processing input data captured by external device(s) 904.

The code running the methodology, and/or input data whether before or after pre-processing, may be stored in memory 920.

It will be appreciated that many further modifications and permutations of various aspects of the described embodiments are possible. Accordingly, the described aspects are intended to embrace all such alterations, modifications, and variations that fall within the spirit and scope of the appended claims.

Throughout this specification and the claims which follow, unless the context requires otherwise, the word "comprise", and variations such as "comprises" and "comprising", will be understood to imply the inclusion of a stated integer or step or group of integers or steps but not the exclusion of any other integer or step or group of integers or steps. The reference in this specification to any prior publication (or information derived from it), or to any matter which is known, is not, and should not be taken as an acknowledgment or admission or any form of suggestion that that prior publication (or information derived from it) or known matter forms part of the common general knowledge in the field of endeavour to which this specification relates.

Claims

1. A method to train a supervised learning model for cognitive workload recognition by employing a sequence-to-sequence learning paradigm, wherein the model comprises a main long short-term memory (LSTM) network, an auxiliary LSTM network and a classifier layer; wherein the model is configured to output a predicted cognitive workload level of a user in response to input of temporal series information related to the user over a plurality of time steps by, at each said time step: updating the main LSTM network to map the temporal series information to a sequence of hidden states; updating the auxiliary LSTM network to generate weights for the main LSTM network; and obtaining the predicted cognitive workload level by processing the hidden states and the weights through the classifier layer.

2. The method of claim 1, wherein the temporal series information is multimodal, the model comprising, for each input mode, a main long shortterm memory (LSTM) network, an auxiliary LSTM network and a classifier layer, wherein the hidden states from each main LSTM network are integrated and the classifier layer obtains the predicted cognitive workload level by processing the integrated hidden states and the weights.

3. The method of claim 2, wherein each input mode is a respective one of electroencephalogram (EEG) signals, eye movements and external states. The method of claim 2 or 3, wherein the model comprises an attention mechanism for learning cross-attention parameters between input modes and integrating the hidden states using the cross-attention parameters. The method of any one of claims 1 to 4, wherein updating the main LSTM network is according to

where

refers to the hidden state at time t in temporal series k, k_t refers to the temporal series information at time t, c_t refers to a gate for selectively carrying information from the previous time step through a forget gate at time t. The method of claim 5, wherein updating the auxiliary LSTM network is according to

where k_t is the concatenation of the hidden state h_t_₁ from

The method of any one of claims 1 to 6, wherein the weights are functions of a set of embeddings, wherein the embeddings are linear projections of the hidden states of the auxiliary LSTM network. The method of any one of claims 1 to 7, wherein the classifier layer is a fully connected layer followed by a softmax activation. The method of any one of claims 1 to 8, wherein employing the sequence- to-sequence learning paradigm comprises adopting a cross-entropy loss function according to

where x_<t represents the subsequence of the hidden states. . The method of any one of claims 1 to 9, comprising using label smoothing to improve generalization ability of the model. . The method of any one of claims 1 to 10, comprising using an Adam optimizer to train the model. . The method of any one of claims 1 to 11, wherein the user is a driver. A system that trains a supervised learning model for cognitive workload recognition, wherein the system comprises a plurality of processors configured to train the model by employing a sequence-to- sequence learning paradigm, wherein the model comprises a main long short-term memory (LSTM) network, an auxiliary LSTM network and a classifier layer; wherein the model is configured to output a predicted cognitive workload level of a user in response to input of temporal series information related to the user over a plurality of time steps by, at each said time step: updating the main LSTM network to map the temporal series information to a sequence of hidden states; updating the auxiliary LSTM network to generate weights for the main LSTM network; and obtaining the predicted cognitive workload level by processing the hidden states and the weights through a classifier layer of the model. . The system of claim 12, wherein the temporal series information is multi-modal, the model comprising, for each input mode, a main long short-term memory (LSTM) network, an auxiliary LSTM network and a classifier layer, wherein the hidden states from each main LSTM network are integrated and the classifier layer obtains the predicted cognitive workload level by processing the integrated hidden states and the weights. . The system of claim 13, wherein each input mode is a respective one of electroencephalogram (EEG) signals, eye movements and external states. The system of claim 13 or 14, wherein the model comprises an attention mechanism for learning cross-attention parameters between input modes and integrating the hidden states using the cross-attention parameters. The system of any one of claims 12 to 15, wherein updating the main LSTM network is according to

where h_t refers to the hidden state at time t, x_t refers to the temporal series information at time t, c_t refers to a gate for selectively carrying information from the previous time step through a forget gate at time t. . The system of claim 16, wherein updating the auxiliary LSTM network is according to

where x_t is the concatenation of the hidden state h_L-1 from €_hyper and x_t. . The system of any one of claims 12 to 17, wherein the weights are functions of a set of embeddings, wherein the embeddings are linear projections of the hidden states of the auxiliary LSTM network.

20. The system of any one of claims 12 to 18, wherein the classifier layer is a fully connected layer followed by a softmax activation.

21. The system of any one of claims 12 to 19, wherein employing the sequence-to-sequence learning paradigm comprises adopting a crossentropy loss function according to

where x_<t represents the subsequence of the hidden states.

22. The system of any one of claims 12 to 20, wherein the processors are configured to use label smoothing to improve generalization ability of the model.

23. The system of any one of claims 12 to 22, wherein the processors are configured to use an Adam optimizer to train the model.

24. The system of any one of claims 12 to 23, wherein the user is a driver.