WO2024015018A1 - Cognitive workload recognition from temporal series information - Google Patents

Cognitive workload recognition from temporal series information Download PDF

Info

Publication number
WO2024015018A1
WO2024015018A1 PCT/SG2023/050490 SG2023050490W WO2024015018A1 WO 2024015018 A1 WO2024015018 A1 WO 2024015018A1 SG 2023050490 W SG2023050490 W SG 2023050490W WO 2024015018 A1 WO2024015018 A1 WO 2024015018A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
lstm network
lstm
network
main
Prior art date
Application number
PCT/SG2023/050490
Other languages
French (fr)
Inventor
Chen LYU
Haohan YANG
Original Assignee
Nanyang Technological University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanyang Technological University filed Critical Nanyang Technological University
Publication of WO2024015018A1 publication Critical patent/WO2024015018A1/en

Links

Classifications

    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/72Signal processing specially adapted for physiological signals or for diagnostic purposes
    • A61B5/7235Details of waveform analysis
    • A61B5/7264Classification of physiological signals or data, e.g. using neural networks, statistical classifiers, expert systems or fuzzy systems
    • A61B5/7267Classification of physiological signals or data, e.g. using neural networks, statistical classifiers, expert systems or fuzzy systems involving training the classification device
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/16Devices for psychotechnics; Testing reaction times ; Devices for evaluating the psychological state
    • A61B5/163Devices for psychotechnics; Testing reaction times ; Devices for evaluating the psychological state by tracking eye movement, gaze, or pupil change
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/16Devices for psychotechnics; Testing reaction times ; Devices for evaluating the psychological state
    • A61B5/18Devices for psychotechnics; Testing reaction times ; Devices for evaluating the psychological state for vehicle drivers or machine operators
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/24Detecting, measuring or recording bioelectric or biomagnetic signals of the body or parts thereof
    • A61B5/316Modalities, i.e. specific diagnostic methods
    • A61B5/369Electroencephalography [EEG]
    • A61B5/372Analysis of electroencephalograms
    • A61B5/374Detecting the frequency distribution of signals, e.g. detecting delta, theta, alpha, beta or gamma waves
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W40/00Estimation or calculation of non-directly measurable driving parameters for road vehicle drive control systems not related to the control of a particular sub unit, e.g. by using mathematical models
    • B60W40/08Estimation or calculation of non-directly measurable driving parameters for road vehicle drive control systems not related to the control of a particular sub unit, e.g. by using mathematical models related to drivers or passengers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B2503/00Evaluating a particular growth phase or type of persons or animals
    • A61B2503/20Workers
    • A61B2503/22Motor vehicles operators, e.g. drivers, pilots, captains
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2520/00Input parameters relating to overall vehicle dynamics
    • B60W2520/10Longitudinal speed
    • B60W2520/105Longitudinal acceleration
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2520/00Input parameters relating to overall vehicle dynamics
    • B60W2520/12Lateral speed
    • B60W2520/125Lateral acceleration
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2520/00Input parameters relating to overall vehicle dynamics
    • B60W2520/14Yaw
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2540/00Input parameters relating to occupants
    • B60W2540/22Psychological state; Stress level or workload
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2540/00Input parameters relating to occupants
    • B60W2540/225Direction of gaze

Definitions

  • the present invention relates, in general terms, to systems and methods for training supervised learning models for cognitive workload recognition.
  • Driver workload inference is significant for the design of intelligent humanmachine cooperative driving schemes. Such inference allows systems to alert drivers before potentially dangerous manoeuvres are performed and achieve a safer control transition. However, pattern variations among individual drivers and sensor artefacts pose great challenges to the existing cognitive workload recognition approaches.
  • IVI in-vehicle infotainment
  • navigation systems provide real-time guidance to drivers but visual-manual tasks and auditory-verbal activities increase the driver's mental workload and are secondary to driving, thereby increasing the risk of distraction.
  • ARecNet Attention- enabled Recognition Network
  • EEG electroencephalogram
  • An "external state” may be a "vehicle state” in embodiments applied to cognitive workload recognition for vehicle drivers.
  • Previous machine learning technologies consider single input modalities which cannot fully exploit the complementarity of multimodal data in assessing cognitive workload, due to the information redundancy.
  • ARecNet employs a feature-level fusion architecture across input modes, to recognize driver cognitive workload.
  • a system that trains a supervised learning model for cognitive workload recognition, wherein the system comprises a plurality of processors configured to train the model by employing a sequence-to-sequence learning paradigm, wherein the model comprises a main long short-term memory (LSTM) network, an auxiliary LSTM network and a classifier layer; wherein the model is configured to output a predicted cognitive workload level of a user in response to input of temporal series information related to the user over a plurality of time steps.
  • LSTM main long short-term memory
  • ARecNet may embody a method to train a supervised learning model for cognitive workload recognition by employing a sequence-to-sequence learning paradigm.
  • the model comprises a main long short-term memory (LSTM) network, an auxiliary LSTM network and a classifier layer.
  • the model is configured to output a predicted cognitive workload level of a user in response to input of temporal series information related to the user over a plurality of time steps. It does so by, at each said time step: updating the main LSTM network to map the temporal series information to a sequence of hidden states; updating the auxiliary LSTM network to generate weights for the main LSTM network; and obtaining the predicted cognitive workload level by processing the hidden states and the weights through the classifier layer.
  • the temporal series information in the phrase "input of temporal series information related to the user over a plurality of time steps", and similar, refers to information relating to the user themselves (e.g. of the driver, where the user is a driver) and information relating to external states (e.g. of a vehicle, such as a neighbouring vehicle or a vehicle being driven by the user, where the user is a driver).
  • embodiments of the present invention establish an attention- enabled decision-level fusion architecture to infer driver cognitive workload levels. This suggests the availability of a viable generic technique for capturing useful feature representation from time-series multimodal information.
  • embodiments of the present invention involve constructing a novel driver workload dataset, including multimodal signals and multiple driving scenarios.
  • Figure 1 is a schematic overview of an attention-enabled cognitive workload recognition method and mode, with multimodal information fusion in accordance with present teachings
  • Figure 2 is a platform for multi-modal information capture, which can be applied as a driver-in-the-loop platform;
  • Figure 3 illustrates a method to train a supervised learning model for cognitive workload recognition, in accordance with present teachings
  • Figure 4 shows activity power spectra for typical components and the corresponding scalp topographies, in which image (a) shows eye artefact, (b) shows muscle artefact and (c) is normal;
  • Figure 6 shows confusion matrices for driver cognitive workload recognition with different historical horizons in typical driving scenarios, in which image (a) is sunny noon, (b) is foggy dusk, and (c) is rainy night;
  • Figure 8 is the outcome of statistical tests of decision-level fusion-based approaches with HyperLSTM modules.
  • Figure 9 is a block diagram of a system for cognitive workload recognition (estimation).
  • ARecNet an attention-enabled recognition network with a decisionlevel fusion architecture, that assesses cognitive workload estimation performance.
  • the present methodology employs a cross-attention mechanism to enhance useful feature representations learned by hyper long short-term memory (HyperLSTM) based modules from time-series multimodal information, e.g., EEG signals, eye movements, external states or behaviours (e.g. vehicle states or vehicle behaviour or, where the user is an athlete, the state or behaviour of the user's body and/or competing athletes around the user - such as on a running track).
  • HyperLSTM hyper long short-term memory
  • Figure 1 schematically represents a method 100, for training a supervised learning model for cognitive workload recognition, in the context of multi-modal input information acquisition 102.
  • the input information comprises temporal or time series information related to the vehicle and/or driver over a plurality of time steps.
  • This information can be acquired through any appropriate mechanism - e.g. the information can be extracted from a database or recorded and used for real-time training.
  • the temporal series information may have a single input mode - e.g. EEG, eye tracking/movement or vehicle behavior/performance - the present temporal series information is multi-modal, thus having multiple input modes.
  • Each input mode is a respective one of EEG signals, eye movements and vehicle states.
  • the temporal series information is captured from monitoring drivers during normal on-road driving.
  • the temporal series information is captured through the driver-in- the-loop experimental platform 200, shown in Figure 2.
  • the platform 200 may have any appropriate configuration to facilitate data acquisition of the desired input modes, and presently comprises a physical simulator 202 (e.g. Logitech G29), an image capture device or system for tracking eye movement 204 (e.g. an infrared eye tracker such as Tobii Pro), and a wired or wireless EEG headset
  • n-back tasks with varying difficulty may be employed to modulate cognitive workload levels objectively.
  • the n-back task may be a visual-auditory mixed n-back task for regulating cognitive loads on drivers. This task can therefore reflect cognitive workload introduced by both visual and auditory information during driving.
  • the tests may comprise secondary tasks with the varied amount of information that participants need to memorize and respond to, such as maintaining speed through traffic, driving from origin to destination and others.
  • These secondary tasks enable the system 200 to obtain three classes of driver cognitive workload, namely slight level, moderate level and intensive level, which correspond to ground truth labels - e.g. ternary ground truth labels for three input modes.
  • the data may be pre-processed. This can be necessary to remove artifacts such as blink, facial and body movement artifacts from eye tracking data.
  • Various techniques can be adopted for removing noise from signals, or removing signals, before extracting sub-band components from raw data. Band-pass filtering (low- and/or high-band) and notch filtering (to remove power supply noise) may restrict the band spectrum to relevant information - e.g. 1-30 Hz, and independent component analysis (ICA) may be used to reject artefact-induced signal components.
  • ICA independent component analysis
  • the temporal series information captured by the system 200 is transmitted or conveyed to a system 210 that employs method 100 to train a supervised learning model 102 for cognitive workload recognition.
  • the model 104 comprises a main long shortterm memory (LSTM) network 106, an auxiliary LSTM network 108 and a classifier layer 110.
  • the model 104 is configured to output a predicted cognitive workload level of a vehicle driver in response to input of the temporal series information over a plurality of time steps.
  • the method 100 (as also reflected in Figure 3) employs a sequence-to-sequence learning paradigm.
  • the learning paradigm comprises, at each time step:
  • step 106' updating the main LSTM network 106 to map the temporal series information to a sequence of hidden states
  • step 108' updating the auxiliary LSTM network 108 to generate weights for the main LSTM network
  • step 110' obtaining the predicted cognitive workload level by processing the hidden states and the weights through a classifier layer of the model.
  • step 108' may be performed before step 106'.
  • the driver cognitive load recognition is formulated as a supervised classification problem.
  • the workload levels are adopted as labels, as shown in Figure 1.
  • the tasks user/drivers are asked to perform will be assigned a predetermined workload level based on an anticipated difficulty or amount of attention required to successfully complete the task.
  • an EEG measurement system may be mounted ot a user/driver and the user-driver may be in a vehicle with one or more image capture devices and sensors mounted to capture images (e.g. video feed) of the driving environment, vehicle parameters - e.g. vehicle speed - and/or eye movements of the user/driver.
  • the images may be cross-referenced to simulated tasks or otherwise processed to ascertain workload levels at multiple time intervals while driving.
  • the present discussion will be made with reference to a simulated environment, but it will be appreciated that the same or similar teachings may be employed in respect of a real world environment.
  • the temporal series information is multimodal, comprising electroencephalogram (EEG) signals, eye movements and vehicle states. Consequently, given the dataset as: (1)
  • X 1 denotes the multimodal temporal sequences of size /
  • j is the j th sample
  • N is its total number of samples.
  • x represents the feature vectors across the time steps, with x, being the feature vector at the i th time step.
  • the features may described one or more relationships between the multi-modal input data and driver cognitive load.
  • v [ ⁇ v x , ⁇ v y , ⁇ a x , ⁇ a y , ⁇ v, y] represents the instantaneous longitudinal and lateral velocities/accelerations of the vehicle with respect to the front one, the relative resultant velocity and the yaw rate, respectively. In some embodiments, only a proper subset of these velocities is required for assessing cognitive workload. Accordingly, where Dim is the total dimension from concatenating all feature vectors in x. In the embodiment given above, Dim is 14 (w (4), m (4) and v (6)) and thus x ⁇
  • the model itself comprises a main LSTM 106 that maps the temporal series information 108 to a sequence of hidden states 110.
  • the model 104 further comprises at least one auxiliary LSTM network 112 and a classifier layer 114.
  • the model further comprises an attention mechanism 116.
  • the number of HyperLSTMs may correspond to the number of modal inputs - for example, three or four HyperLSTMs will be use for three or four modal inputs, respectively.
  • the model 104 comprises a plurality of auxiliary LSTM networks, herein referred to as hyper long short-term memory (HyperLSTM) based modules each associated with a main LSTM, marked 106, 118, 120 for input modes EEG, eye movements and vehicle performance, respectively.
  • Each HyperLSTM module comprises a LSTM network and a HyperLSTM network.
  • the number of HyperLSTM networks or modules may be the same as the number of weights or hyperparameters to be dynamically learned. For example, for each input mode the hyperparameters may be the standard cell, input, output and forget gate values.
  • the HyperLSTM a variant of HyperNetworks, is an auxiliary LSTM network that is designed to dynamically learn hyper parameters, i.e., the weights of each main LSTM cell at each time step.
  • the HyperLSTM-based module is a dualnetwork architecture that jointly captures time-series feature representations and adapts itself through dynamic hyperparameters learning from the multimodal information. This joint capturing of information assists with managing data variability among individual drivers.
  • the dual-network architecture involves the HyperLSTM output being fed to the LSTM, in each HyperLSTM module.
  • the update of the main LSTM network is denoted as: (5)
  • the input of the HyperLSTM network is the concatenation of the hidden state from (the principle of the main LSTM network, with reference to equations (4) and (5)) and EEG signals w t :
  • weights are functions of a set of embeddings, where the embeddings are linear projections of the hidden states of the relevant auxiliary LSTM network. More formulaically, weights matrices W*, I* and b* are functions of a set of embeddings z* h , z*and z b * , respectively, which are linear projections of the hidden states of HyperLSTM cells: where can be set to any desired value, based on memory usage requirements. For example, N z can be set to 16 to reduce the memory usage required by the ARecNet. is the hidden size of the HyperLSTM. At each time step, weights matrices of main LSTM cells are dynamically formulated.
  • the dynamic formulation may follow: (9) where denotes the tensor dot product. Accordingly, the last hidden state of the main LSTM, i.e., , , (106") is obtained as the representation of the EEG information. Similarly, learning representations of eye movements and vehicle states are mapped as n and , respectively, the last hidden state of the respective main LSTMs being labelled 108" and 110", respectively. For this reason, the input modes corresponding to representations w, m, and v have been replaced with k in various formulae, to indicate that the k may be any one of the input modes.
  • the outputs of the last hidden state for each input mode, 106", 108", 110" is given to the classifier layer 114.
  • the outputs may be concatenated.
  • learning or feature representations (collectively 122) obtained by each HyperLSTM- based module (h in equation (10)) performs an equidimensional projection through a fully connected layer 124 (HZ in equation (10) - i.e. the input dimensions are the same, and the output dimensions are consistent with the input dimensions).
  • the results are concatenated at 126 as: (10) wherein are parameters to be learned and Then, similarity scores of feature representations of different information sources are computed using an attention matrix 128, formulated as: (11) where is utilized to enhance useful representations through increasing their scores automatically.
  • M att can be regarded as a weight matrix. M att and representations 112 will be multiplied, as shown by the connecting path between the two in Figure 12.
  • the hidden states are integrated by an integration layer, presently a max pooling layer 130, such that: (12) yhere h att e [R wh denotes the attention-based hidden state with strengthened feature representations.
  • the ARecNet performs a nonlinear projection through a classifier layer 114: (14) where are parameters to be learned.
  • the predicted cognitive workload level is obtained, being either 0, 1 or 2, corresponding to slight, moderate or intensity cognitive workload.
  • the classifier layer 114 may have any appropriate architecture.
  • the classifier layer comprises a fully connected layer followed by a softmax activation layer that produces .
  • the classifier layer 114 is Figure 1 further comprises a fully connected layer and a rectified linear unit the output of which is fed into the second fully connected layer and from there into the softmas layer.
  • Label smoothing may be performed using any appropriate method.
  • Slight level Only the primary task is required to be accomplished, i.e., participants need to avoid other vehicles and reach the destination. Moderate level-. In addition to avoiding all obstacles, participants need to recall the colour category of the previous obstacle and press the corresponding button as they drive past new one.
  • participant Apart from primary and visual tasks, participants also need to listen to a pre-recorded series of 15 letters separated by approximately 4 second intervals and count the number of times two identical letters appeared in pairs in a sequence, e.g., "H, H".
  • the driver workload dataset extracted as set out with reference to Figure 2 can be migrated to both the performance evaluation of other recognition approaches and extended studies involving the cognitive workload, such as driving authority allocation and takeover strategies design, etc.
  • the driver's authority In human-machine cooperative driving, the driver's authority is usually determined. For a very high driver workload, the driving authority can be zero. Consequently, to ensure safety the driver's inputs will not be executed.
  • the range of driver authority can be [0, 1]. 0 indicates that the vehicle has been taken over by the machine, and 1 means that the vehicle is completely controlled by the human.
  • the dataset is multimodal, presently containing three types of information, i.e., EEG signals, eye gaze and vehicle states.
  • the dataset contains multiple scenarios - lighting conditions, colours, speeds, obstacles, audio and/or visual tasks.
  • the dataset can, for example, reveal the influence of varied visibility on driver workload recognition.
  • the dataset is multi-sensory. For example, visual-auditory mixed stimuli can be used, requiring drivers to respond to visual information, which reflects the visual-induced cognitive workload in the real world, and audio stimuli, which reflects auditory activities such as voice navigation and phone calls during practical driving.
  • Pre-processing is performed, as set out with reference to Figure 2, to remove artifacts from the data.
  • an activity power spectrum of three typical components and the corresponding scalp topographies can be produced as shown in Figure 4.
  • the artifact in image (a) of Figure 3 is produced by eye activities such as blink, in which high power at low frequencies is concentrated close to eyes. Muscle artifacts are also evident as shown in image (b) of Figure 3, the muscle artefact having relatively high power at high frequencies (20-30 Hz) with a localized distribution on the scalp topography.
  • Image (c) of Figure 3 represents the normal component generated by brain- related activities. Image (c) is therefore adopted to calculate the power of various frequency bands.
  • the recognition performance of the present methodology can be evaluation through various metrics, including average accuracy (Gave), precision (Pr), recall (Re) and Fl score, which are formulated as:
  • MTS-CNN variant of a CNN-based architecture
  • DecNet variant of an LSTM-based network
  • CNN-LSTM model CNN-LSTM model
  • m-HyperLSTM variant of HyperNetworks which uses only one HyperLSTM-based module
  • the present methodology can effectively capture and strengthen useful timeseries feature representations through HyperLSTM-based modules and a crossattention mechanism.
  • the designed model was trained using an Adam optimizer, with a desired learning rate - e.g. 0.001.
  • the batch size is selected to obtain a trade-off between the training time and model generalization ability (e.g. batch size of 64).
  • the recognition accuracies and standard deviations of the the present methodology with varied time-series information and historical horizons under typical driving scenarios is shown in Figure 5.
  • vehicle states clearly have a lower influence on cognitive workload than physiological and visual information. Since extra mental workload is generally required to ensure safe driving with the decreased visibility, classification becomes more difficult as average cognitive workload increases (e.g. in rough inverse proportion to visibility).
  • the multimodal information fusion-based ARecNet has a relatively stable recognition performance in varied environments, and has lower standard deviation in most cases, indicating that the multimodal information fusion-based ARecNet has better stability.
  • each confusion matrix displays the average result of five-fold cross validation.
  • the rows are recognition accuracy for slight, moderate and intensive cognitive workload and the classification accuracy of each workload level with respect to prediction values, and the rightmost column shows the recognition results with respect to the ground truth label. Recognition accuracy increases with extended historical horizons.
  • the macro-average curves are nearly the same in all of images (a), (b) and (c), indicating superior comprehensive recognition performance of the present methodology in varied weather conditions.
  • the optimal threshold point is at the tangent of the corresponding precision-recall curve and the Fl-score curve.
  • Table I ablation study on the recognition accuracy of HyperLSTM and cross attention with varied historical horizons in different driving scenarios.
  • HyperLSTM greatly improved the model performance in all cases, further indicating its superior feature capturing ability compared to conventional LSTM models.
  • the influence of the cross-attention mechanism remains inconspicuous in Table I, especially for HyperLSTM-based models.
  • a paired t-test was also employed to determine statistical significance of cross attention, with five-fold cross validation performed 50 times with the same sequence of random seeds, and statistical results presented in Figure 8 - single asterisk (*) and double asterisks (**) represent R values lower than 0.05 and 0.01, respectively.
  • Cross attention provided no statistically relevant improvement at a 1 s historical horizon, but provided significantly better performance for longer horizons - e.g. 4 s.
  • the phenomena demonstrate that the cross- attention mechanism is better at strengthening useful learning representations of longer sequence multimodal information.
  • Embodiments of the present methodology seek to address the two limitations of most previous driver workload recognition models in practical applications, namely single-modality indicators, and time-series signal distortion.
  • the present, decision-level multimodal information fusion architecture can employ a cross-attention mechanism to strengthen useful feature representations captured by the HyperLSTM-based module from individual information sources.
  • Experimental results demonstrate that the proposed models are advantageous over other baseline approaches in terms of recognition accuracy and robustness.
  • the data collection methodology provides a generic driver monitoring framework for advanced driving assistance systems (ADAS).
  • ADAS advanced driving assistance systems
  • Using a minor alteration of the structure of the model e.g. adding more, or removing, HyperLSTM modules, containing one or more HyperLSTM networks and a main LSTM network, depending on the number of information sources/input modes - e.g. road types and traffic condition monitoring) according to the number of information sources, it can be utilized for driver distraction/fatigue detection.
  • accurate driver states recognition can provide a decision-making basis for the mutual takeover of drivers and vehicles, which is beneficial to other ADAS technologies such as lane departure warning systems, traffic jam assistant systems, and others.
  • the present framework can also be extended into specific application fields involving multiple biosensors, for example, athlete health stress estimation and air traffic controller states monitoring, etc.
  • An end-user computing device or system referred to in this disclosure comprises a smartphone device, a tablet device, a laptop device etc. that is used by an end user to train a supervised learning model for cognitive workload recognition, or implement that model for real time cognitive workload recognition.
  • Computing device 900 of Figure 9 illustrates a schematic diagram of one such device.
  • the device 900 comprises one or more processing units 910 with access to one or more pre-processors 902 (if used) for pre-processing input data - e.g. from EEG, vehicle behaviour and/or eye tracking - has a communication channel to a camera/external device(s) 904 for collecting input data (e.g.
  • auxiliary network module(s) 906 each comprising auxiliary network(s) 908 and a main LSTM network 910, an attention mechanism 912 and a classifier layer 914 (the term “classifier layer” may be used to refer to a single layer or multiple layers in a machine learning network, depending on the context used herein).
  • the external device(s) 904 may be integral with, or unitary with, system 900, or may be separate.
  • System 900 may be in communication (e.g. over network 918) with one or more server system 916 that serve as a back end system for an application executing on the system 900.
  • server system 916 may be a backend application server of a relevant application for input data evaluation executing on the system 900.
  • the server system 916 may transmit code or information to the system 900 and may receive information from system 900 obtained after pre-processing input data captured by external device(s) 904.
  • the code running the methodology, and/or input data whether before or after pre-processing, may be stored in memory 920.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Public Health (AREA)
  • Psychiatry (AREA)
  • Pathology (AREA)
  • Veterinary Medicine (AREA)
  • Heart & Thoracic Surgery (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Computation (AREA)
  • Surgery (AREA)
  • Animal Behavior & Ethology (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Psychology (AREA)
  • Child & Adolescent Psychology (AREA)
  • Social Psychology (AREA)
  • Educational Technology (AREA)
  • Hospice & Palliative Care (AREA)
  • Developmental Disabilities (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physiology (AREA)
  • Fuzzy Systems (AREA)
  • Automation & Control Theory (AREA)
  • Transportation (AREA)
  • Mechanical Engineering (AREA)
  • Traffic Control Systems (AREA)

Abstract

Disclosed is a method to train a supervised learning model for cognitive workload recognition by employing a sequence-to-sequence learning paradigm. The model comprises a main long short-term memory (LSTM) network, an auxiliary LSTM network and a classifier layer. The model is configured to output a predicted cognitive workload level of a user in response to input of temporal series information related to the user over a plurality of time steps. It does so by, at each said time step: updating the main LSTM network to map the temporal series information to a sequence of hidden states; updating the auxiliary LSTM network to generate weights for the main LSTM network; and obtaining the predicted cognitive workload level by processing the hidden states and the weights through the classifier layer.

Description

COGNITIVE WORKLOAD RECOGNITION FROM TEMPORAL SERIES INFORMATION
Technical Field
The present invention relates, in general terms, to systems and methods for training supervised learning models for cognitive workload recognition.
Background
Driver workload inference is significant for the design of intelligent humanmachine cooperative driving schemes. Such inference allows systems to alert drivers before potentially dangerous manoeuvres are performed and achieve a safer control transition. However, pattern variations among individual drivers and sensor artefacts pose great challenges to the existing cognitive workload recognition approaches.
Various advanced functions have been developed for intelligent vehicles to improve the driving experience and convenience, but each has its drawbacks. For example, in-vehicle infotainment (IVI) such as navigation systems provide real-time guidance to drivers but visual-manual tasks and auditory-verbal activities increase the driver's mental workload and are secondary to driving, thereby increasing the risk of distraction.
Various approaches to obtaining cognitive load levels have been studied. The most straightforward ones are subjective measures, requiring drivers to conduct the self-evaluation by completing questionnaires after driving tasks. Typically, subjective measuring approaches provide cumulative estimations of the cognitive workload based on drivers' memories, while these methods are intuitive and uncertain since they are susceptible to memory bias. Studies have been performed to define cognitive workload based on physiological and vehicle indicators. Such methods typically require extended sampling windows (e.g. 2-5 min recordings for heart rate), or a susceptible to noise from changes in driving conditions (e.g. machine vision-based techniques for eye tracking deteriorate in low light). Vehicle indicators such as steering angles, vehicle speeds, and accelerations can be used, but are insensitive to low workload levels.
Many learning-based approaches have been developed for the recognition of driver cognitive loads from different measured signals. Some such approaches employ deep, machine learning technologies. These machine learning-based methods commonly require the manual features extraction from raw data. However, these methods commonly consider confidence of cognitive workload assessment based on individual input modes - e.g. eye tracking or vehicle behaviour - and thereby fail to take advantage of the benefits of multi-modal information.
It would be desirable to overcome or ameliorate at least one of the abovedescribed problems, or at least to provide a useful alternative.
Summary
To address the aforementioned challenges, proposed herein is an Attention- enabled Recognition Network (ARecNet) for recognizing driver cognitive load in real time using multiple input modes - e.g. electroencephalogram (EEG) signals, eye movements and external states. An "external state" may be a "vehicle state" in embodiments applied to cognitive workload recognition for vehicle drivers. Previous machine learning technologies consider single input modalities which cannot fully exploit the complementarity of multimodal data in assessing cognitive workload, due to the information redundancy. In contrast, ARecNet employs a feature-level fusion architecture across input modes, to recognize driver cognitive workload. Also disclosed is a system that trains a supervised learning model for cognitive workload recognition, wherein the system comprises a plurality of processors configured to train the model by employing a sequence-to-sequence learning paradigm, wherein the model comprises a main long short-term memory (LSTM) network, an auxiliary LSTM network and a classifier layer; wherein the model is configured to output a predicted cognitive workload level of a user in response to input of temporal series information related to the user over a plurality of time steps. It does so by, at each said time step: updating the main LSTM network to map the temporal series information to a sequence of hidden states; updating the auxiliary LSTM network to generate weights for the main LSTM network; and obtaining the predicted cognitive workload level by processing the hidden states and the weights through a classifier layer of the model.
Relevantly, ARecNet may embody a method to train a supervised learning model for cognitive workload recognition by employing a sequence-to-sequence learning paradigm. The model comprises a main long short-term memory (LSTM) network, an auxiliary LSTM network and a classifier layer. The model is configured to output a predicted cognitive workload level of a user in response to input of temporal series information related to the user over a plurality of time steps. It does so by, at each said time step: updating the main LSTM network to map the temporal series information to a sequence of hidden states; updating the auxiliary LSTM network to generate weights for the main LSTM network; and obtaining the predicted cognitive workload level by processing the hidden states and the weights through the classifier layer.
The temporal series information in the phrase "input of temporal series information related to the user over a plurality of time steps", and similar, refers to information relating to the user themselves (e.g. of the driver, where the user is a driver) and information relating to external states (e.g. of a vehicle, such as a neighbouring vehicle or a vehicle being driven by the user, where the user is a driver). Advantageously, embodiments of the present invention establish an attention- enabled decision-level fusion architecture to infer driver cognitive workload levels. This suggests the availability of a viable generic technique for capturing useful feature representation from time-series multimodal information.
Advantageously, embodiments of the present invention involve constructing a novel driver workload dataset, including multimodal signals and multiple driving scenarios.
Brief description of the drawings
Embodiments of the present invention will now be described, by way of nonlimiting example, with reference to the drawings in which:
Figure 1 is a schematic overview of an attention-enabled cognitive workload recognition method and mode, with multimodal information fusion in accordance with present teachings;
Figure 2 is a platform for multi-modal information capture, which can be applied as a driver-in-the-loop platform;
Figure 3 illustrates a method to train a supervised learning model for cognitive workload recognition, in accordance with present teachings;
Figure 4 shows activity power spectra for typical components and the corresponding scalp topographies, in which image (a) shows eye artefact, (b) shows muscle artefact and (c) is normal;
Figure 5 shows the results of recognition accuracy assessment of the present methodology with varied temporal series information and historical horizons (image (a) tw = 1 s, (b) tw = 2 s, (c) tw = 4 s) in typical driving scenarios; Figure 6 shows confusion matrices for driver cognitive workload recognition with different historical horizons in typical driving scenarios, in which image (a) is sunny noon, (b) is foggy dusk, and (c) is rainy night;
Figure 7 provides the precision-recall curves of cognitive workload level recognition with the historical horizon tw = 1 s in different driving scenarios, being (a) sunny noon, (b) foggy dusk, and (c) rainy night, where shaded areas represent the extrema across the 5-fold cross validation;
Figure 8 is the outcome of statistical tests of decision-level fusion-based approaches with HyperLSTM modules; and
Figure 9 is a block diagram of a system for cognitive workload recognition (estimation).
Detailed description
Disclosed is ARecNet, an attention-enabled recognition network with a decisionlevel fusion architecture, that assesses cognitive workload estimation performance. Specifically, the present methodology employs a cross-attention mechanism to enhance useful feature representations learned by hyper long short-term memory (HyperLSTM) based modules from time-series multimodal information, e.g., EEG signals, eye movements, external states or behaviours (e.g. vehicle states or vehicle behaviour or, where the user is an athlete, the state or behaviour of the user's body and/or competing athletes around the user - such as on a running track). Also disclosed is the construction of a novel dataset containing multiple driving scenarios for evaluating model performance across different historical horizons and decision thresholds.
The description below will be made with reference to a driver (user) and vehicle states, for illustration purposes only. Without loss of generality, the same teachings apply to other types of user such as an athlete, where the external states and external behaviours, being "vehicle states" and "vehicle behaviours", can be substituted for the 'states' and 'behaviours' of the athlete and/or one or more neighbouring athletes (on the same team - e.g. in volleyball - or different teams - e.g. in a competitive running event), or such as an air traffic controller (user) where the "vehicle states" and "vehicle behaviours" can be replaced by the "aeroplane states" and "aeroplane behaviours" of the aeroplane currently being directed by the air traffic controller and/or aeroplanes other than the aeroplane currently being directed by the air traffic controller.
Figure 1 schematically represents a method 100, for training a supervised learning model for cognitive workload recognition, in the context of multi-modal input information acquisition 102. The input information comprises temporal or time series information related to the vehicle and/or driver over a plurality of time steps. This information can be acquired through any appropriate mechanism - e.g. the information can be extracted from a database or recorded and used for real-time training.
While the temporal series information may have a single input mode - e.g. EEG, eye tracking/movement or vehicle behavior/performance - the present temporal series information is multi-modal, thus having multiple input modes. Each input mode is a respective one of EEG signals, eye movements and vehicle states.
In some embodiments, the temporal series information is captured from monitoring drivers during normal on-road driving. However, in the present embodiment the temporal series information is captured through the driver-in- the-loop experimental platform 200, shown in Figure 2. The platform 200 may have any appropriate configuration to facilitate data acquisition of the desired input modes, and presently comprises a physical simulator 202 (e.g. Logitech G29), an image capture device or system for tracking eye movement 204 (e.g. an infrared eye tracker such as Tobii Pro), and a wired or wireless EEG headset
206 (e.g. EMOTIVE EPOC Flex, with 32 channels).
Data are captured for different weather and lighting conditions, such as sunny noon, foggy dusk, and rainy night, to be able to learn cognitive workload information across various levels of visibility and stress. Since human mental workload cannot be directly observed, the method may involve collecting the input temporal series information by subjecting one or more drivers to tests of varying difficulty. For example, n-back tasks with varying difficulty may be employed to modulate cognitive workload levels objectively. The n-back task may be a visual-auditory mixed n-back task for regulating cognitive loads on drivers. This task can therefore reflect cognitive workload introduced by both visual and auditory information during driving.
In each driving environment or under each set of driving conditions, the tests may comprise secondary tasks with the varied amount of information that participants need to memorize and respond to, such as maintaining speed through traffic, driving from origin to destination and others. These secondary tasks enable the system 200 to obtain three classes of driver cognitive workload, namely slight level, moderate level and intensive level, which correspond to ground truth labels - e.g. ternary ground truth labels for three input modes.
The data may be pre-processed. This can be necessary to remove artifacts such as blink, facial and body movement artifacts from eye tracking data. Various techniques can be adopted for removing noise from signals, or removing signals, before extracting sub-band components from raw data. Band-pass filtering (low- and/or high-band) and notch filtering (to remove power supply noise) may restrict the band spectrum to relevant information - e.g. 1-30 Hz, and independent component analysis (ICA) may be used to reject artefact-induced signal components. The temporal series information captured by the system 200 is transmitted or conveyed to a system 210 that employs method 100 to train a supervised learning model 102 for cognitive workload recognition.
With further reference to Figure 1, the model 104 comprises a main long shortterm memory (LSTM) network 106, an auxiliary LSTM network 108 and a classifier layer 110. The model 104 is configured to output a predicted cognitive workload level of a vehicle driver in response to input of the temporal series information over a plurality of time steps. Via model 104, the method 100 (as also reflected in Figure 3) employs a sequence-to-sequence learning paradigm. The learning paradigm comprises, at each time step:
- step 106' - updating the main LSTM network 106 to map the temporal series information to a sequence of hidden states
- step 108' - updating the auxiliary LSTM network 108 to generate weights for the main LSTM network
- step 110' - obtaining the predicted cognitive workload level by processing the hidden states and the weights through a classifier layer of the model.
The order of the steps may be changed. For example, step 108' may be performed before step 106'.
To make use of the model 104, the driver cognitive load recognition is formulated as a supervised classification problem. In this problem, the workload levels are adopted as labels, as shown in Figure 1. For data capture from a simulated environment, the tasks user/drivers are asked to perform will be assigned a predetermined workload level based on an anticipated difficulty or amount of attention required to successfully complete the task. For data capture in a real world environment, an EEG measurement system may be mounted ot a user/driver and the user-driver may be in a vehicle with one or more image capture devices and sensors mounted to capture images (e.g. video feed) of the driving environment, vehicle parameters - e.g. vehicle speed - and/or eye movements of the user/driver. The images may be cross-referenced to simulated tasks or otherwise processed to ascertain workload levels at multiple time intervals while driving. For illustration purposes, the present discussion will be made with reference to a simulated environment, but it will be appreciated that the same or similar teachings may be employed in respect of a real world environment. In this embodiment, the temporal series information is multimodal, comprising electroencephalogram (EEG) signals, eye movements and vehicle states. Consequently, given the dataset as:
Figure imgf000011_0002
(1)
X1 denotes the multimodal temporal sequences of size /, y is the corresponding driver cognitive workload, which is categorized into three levels, i.e., slight (y = 0), moderate (y = 1) and intensive (y = 2), j is the jth sample, and N is its total number of samples. For each sample: (2)
Figure imgf000011_0001
wherein x represents the feature vectors across the time steps, with x, being the feature vector at the ith time step. The features may described one or more relationships between the multi-modal input data and driver cognitive load. The temporal series information consists of EEG signals, eye movements and vehicle motion states, denoted by X1 = [W1, M1, V'] and x = [w, m, v], respectively. Specifically, w = [wδ, wΘ, wa, wβ] is the power of four typical EEG frequency bands, i.e., delta (1-4 Hz), theta (4-8 Hz), alpha (8-13 Hz), beta (13-30 Hz). Greater, fewer or different frequency bands may be used, m = [mCx, mCy, mSx, msy] is horizontal/vertical coordinates and speeds of the eye gaze. In some embodiments, only the coordinates may be provided and in other embodiments only the speeds of eye gaze may be provided, v = [ Δvx, Δvy, Δax, Δay, Δv, y] represents the instantaneous longitudinal and lateral velocities/accelerations of the vehicle with respect to the front one, the relative resultant velocity and the yaw rate, respectively. In some embodiments, only a proper subset of these velocities is required for assessing cognitive workload. Accordingly, where Dim is the total dimension from concatenating all feature vectors in x. In the embodiment given above, Dim is 14 (w (4), m (4) and v (6)) and thus x ∈
Figure imgf000012_0002
The model then learns to generate the corresponding workload level
Figure imgf000012_0004
based on temporal series information
Figure imgf000012_0003
(3)
Figure imgf000012_0001
where < = {0,1,2} is the label of cognitive workload levels.
With reference to Figure 1, the model itself comprises a main LSTM 106 that maps the temporal series information 108 to a sequence of hidden states 110. The model 104 further comprises at least one auxiliary LSTM network 112 and a classifier layer 114. In some embodiments the model further comprises an attention mechanism 116. The number of HyperLSTMs may correspond to the number of modal inputs - for example, three or four HyperLSTMs will be use for three or four modal inputs, respectively.
As shown, the model 104 comprises a plurality of auxiliary LSTM networks, herein referred to as hyper long short-term memory (HyperLSTM) based modules each associated with a main LSTM, marked 106, 118, 120 for input modes EEG, eye movements and vehicle performance, respectively. Each HyperLSTM module comprises a LSTM network and a HyperLSTM network. The number of HyperLSTM networks or modules may be the same as the number of weights or hyperparameters to be dynamically learned. For example, for each input mode the hyperparameters may be the standard cell, input, output and forget gate values.
The HyperLSTM, a variant of HyperNetworks, is an auxiliary LSTM network that is designed to dynamically learn hyper parameters, i.e., the weights of each main LSTM cell at each time step. The HyperLSTM-based module is a dualnetwork architecture that jointly captures time-series feature representations and adapts itself through dynamic hyperparameters learning from the multimodal information. This joint capturing of information assists with managing data variability among individual drivers. The dual-network architecture involves the HyperLSTM output being fed to the LSTM, in each HyperLSTM module.
Regarding using the HyperLSTM-based module for mapping EEG signals (the description for other input modes is the same, but with the feature vector for the relevant input mode - so
Figure imgf000013_0004
refers to the hidden state at time I in temporal series k where, for the input modes mentioned above, k is one of w, m and v): given EEG temporal sequences W1, for all t ∈ {1,2,...,/}, the main LSTM network maps the time-series information to a sequence of hidden states
Figure imgf000013_0006
via following updates per: (4)
Figure imgf000013_0001
where δ and tanh are the sigmoid function and hyperbolic tangent function, respectively, Θ represents the element-wise product,
Figure imgf000013_0005
Figure imgf000013_0003
are parameters generated by the HyperLSTM, * denotes one of {/,f,o,c} gates, Nh is the hidden size of the main LSTM network and Nw = 4 is the number of EEG features. The update of the main LSTM network is denoted as:
Figure imgf000013_0002
(5) The input of the HyperLSTM network is the concatenation of the hidden state from
Figure imgf000014_0006
(the principle of the main LSTM network, with reference to equations (4) and (5)) and EEG signals wt :
Figure imgf000014_0001
Similarly, updates of the HyperLSTM can be described as:
Figure imgf000014_0002
(7)
In the described methodology, weights are functions of a set of embeddings, where the embeddings are linear projections of the hidden states of the relevant auxiliary LSTM network. More formulaically, weights matrices W*, I* and b* are functions of a set of embeddings z*h, z*and zb* , respectively, which are linear projections of the hidden states of HyperLSTM cells:
Figure imgf000014_0003
where
Figure imgf000014_0005
can be set to any desired value, based on memory usage requirements. For example, Nz can be set to 16 to reduce the memory usage required by the ARecNet.
Figure imgf000014_0008
is the hidden size of the HyperLSTM. At each time step, weights matrices of main LSTM cells are dynamically formulated. The dynamic formulation may follow: (9)
Figure imgf000014_0004
where
Figure imgf000014_0007
denotes the tensor dot product. Accordingly, the last hidden state of the main LSTM, i.e.,
Figure imgf000015_0006
, , (106") is obtained as the representation of the EEG information. Similarly, learning representations of eye movements and vehicle states are mapped as n and
Figure imgf000015_0002
, respectively,
Figure imgf000015_0001
the last hidden state of the respective main LSTMs being labelled 108" and 110", respectively. For this reason, the input modes corresponding to representations w, m, and v have been replaced with k in various formulae, to indicate that the k may be any one of the input modes.
In some embodiments, the outputs of the last hidden state for each input mode, 106", 108", 110" is given to the classifier layer 114. To achieve this, the outputs may be concatenated. However, in the embodiment shown in Figure 1, learning or feature representations (collectively 122) obtained by each HyperLSTM- based module (h in equation (10)) performs an equidimensional projection through a fully connected layer 124 (HZ in equation (10) - i.e. the input dimensions are the same, and the output dimensions are consistent with the input dimensions). The results are concatenated at 126 as: (10)
Figure imgf000015_0003
wherein are parameters to be learned and
Figure imgf000015_0007
Figure imgf000015_0008
Then, similarity scores of feature representations of different information sources are computed using an attention matrix 128, formulated as: (11)
Figure imgf000015_0004
where
Figure imgf000015_0009
is utilized to enhance useful representations through increasing their scores automatically. Matt can be regarded as a weight matrix. Matt and representations 112 will be multiplied, as shown by the connecting path between the two in Figure 12. The hidden states are integrated by an integration layer, presently a max pooling layer 130, such that: (12)
Figure imgf000015_0005
yhere hatt e [Rwh denotes the attention-based hidden state with strengthened feature representations. To improve the training stability, the hatt may be normalized (at layer 132) as: hatt = layernorm(hatt) (13)
To obtain the probability of each label, the ARecNet performs a nonlinear projection through a classifier layer 114:
Figure imgf000016_0001
(14) where
Figure imgf000016_0005
are parameters to be learned. Eventually, the predicted cognitive workload level
Figure imgf000016_0003
is obtained, being either 0, 1 or 2, corresponding to slight, moderate or intensity cognitive workload. The classifier layer 114 may have any appropriate architecture. In some embodiments, the classifier layer comprises a fully connected layer followed by a softmax activation layer that produces . The classifier layer 114 is Figure 1
Figure imgf000016_0004
further comprises a fully connected layer and a rectified linear unit the output of which is fed into the second fully connected layer and from there into the softmas layer.
During training, instead of relying on a single label, a sequence-to-sequence (Seq2Seq) learning paradigm is employed. For the given dataset D) , an optimized cross-entropy loss function is adopted:
(15)
Figure imgf000016_0002
where x≤t = [x1,x2,...,xt] denotes the subsequence of . In addition to
Figure imgf000017_0002
encouraging the ARecNet to extract feature representations from early observations, the given loss function reduces the possibility of overfitting when the current information is insufficient for recognition.
The cognitive workload cannot be observed directly, resulting in inherent uncertainty in its label. Therefore, a regularization technique, namely label smoothing, is introduced to improve the model generalization ability. Label smoothing may be performed using any appropriate method. For example, label smoothing may comprise assigning the real cognitive workload label a probability that penliases overconfident predictions, e.g. real cognitive workload label yj may be assigned a probability 1 - ∈, while the probability of other labels is replaced by
Figure imgf000017_0001
wherein a tunable parameter ∈ is set to 0.1 and k = 3 is the number of labels (i.e. 0, 1, 2).
In experiments, data was collected from fourteen participants (10 males, 4 females) with varied ages and driving experience. Data collection was performed using the system described with reference to Figure 2. The simulated driving environment was a three-lane expressway with several stationary vehicles distributed randomly in the three lanes. For all driving scenarios, the primary objective was to drive along a straight road and avoid stationary vehicles. Participants were asked to maintain the vehicle speed within a predetermined speed range - e.g. 80-90 km/h. This ensures a consistent workload level throughout driving. Obstacles with various paint colours are placed at a regular interval, and vehicle paint is classified into two categories, namely dark colours (black, navy, sepia) and bright colours (white, red, yellow). The driving tasks in different cognitive workload levels are illustrated below:
Slight level’. Only the primary task is required to be accomplished, i.e., participants need to avoid other vehicles and reach the destination. Moderate level-. In addition to avoiding all obstacles, participants need to recall the colour category of the previous obstacle and press the corresponding button as they drive past new one.
Intensive level-. Apart from primary and visual tasks, participants also need to listen to a pre-recorded series of 15 letters separated by approximately 4 second intervals and count the number of times two identical letters appeared in pairs in a sequence, e.g., "H, H".
Audio stimuli exist in all experiments to inhibit their effect on EEG signals. However, only participants with the intensive workload level react to them. To ensure no single lane was free of obstacles for an extended stretch of road, a custom-defined discrete distribution for obstacle locations is employed: ( 16)
Figure imgf000018_0001
where KT is the distance between the current obstacle and the previous one in line T , IntervalSize denotes the distance between two adjacent obstacles. Moreover, both visual and audio stimuli are regenerated randomly in each experiment to rule out human memory effect.
The driver workload dataset extracted as set out with reference to Figure 2, can be migrated to both the performance evaluation of other recognition approaches and extended studies involving the cognitive workload, such as driving authority allocation and takeover strategies design, etc. In human-machine cooperative driving, the driver's authority is usually determined. For a very high driver workload, the driving authority can be zero. Consequently, to ensure safety the driver's inputs will not be executed. Generally, the range of driver authority can be [0, 1]. 0 indicates that the vehicle has been taken over by the machine, and 1 means that the vehicle is completely controlled by the human. The dataset is multimodal, presently containing three types of information, i.e., EEG signals, eye gaze and vehicle states. This facilitates identification of workload response on one channel (mode) where that response is not evident on another channel (mode). The dataset contains multiple scenarios - lighting conditions, colours, speeds, obstacles, audio and/or visual tasks. The dataset can, for example, reveal the influence of varied visibility on driver workload recognition. The dataset is multi-sensory. For example, visual-auditory mixed stimuli can be used, requiring drivers to respond to visual information, which reflects the visual-induced cognitive workload in the real world, and audio stimuli, which reflects auditory activities such as voice navigation and phone calls during practical driving.
Pre-processing is performed, as set out with reference to Figure 2, to remove artifacts from the data. By identifying artifacts, an activity power spectrum of three typical components and the corresponding scalp topographies can be produced as shown in Figure 4. The artifact in image (a) of Figure 3 is produced by eye activities such as blink, in which high power at low frequencies is concentrated close to eyes. Muscle artifacts are also evident as shown in image (b) of Figure 3, the muscle artefact having relatively high power at high frequencies (20-30 Hz) with a localized distribution on the scalp topography. Image (c) of Figure 3 represents the normal component generated by brain- related activities. Image (c) is therefore adopted to calculate the power of various frequency bands. The power may be calculated through: (17)
Figure imgf000019_0001
where w Ψ with Ψ = { δ,θ , α (β} is the power of the corresponding EEG signal, Ω represents EEG channels, fi and fu are lower and upper frequency limits of Ψ- band, Sj(f) is the power spectrum density (PSD) of the jth channel, which is calculated using the fast Fourier transform (FFT) with a hamming window. All features are uniformly resampled to 10 Hz, i.e., / = 10 tw, and normalized (zz score) to the same scale. Also, the input sequences are extracted using a sliding window with 90% overlap to augment training samples.
The recognition performance of the present methodology can be evaluation through various metrics, including average accuracy (Gave), precision (Pr), recall (Re) and Fl score, which are formulated as:
(18)
(19)
Figure imgf000020_0001
where tp, tn, fp and fn represent true positives, true negatives, false positives and false negatives, respectively.
To test performance enhancement over known methods, a comparative study was performed between the present methodology and previous learning-based methods, namely, MTS-CNN (variant of a CNN-based architecture), DecNet (a variant of an LSTM-based network), CNN-LSTM model, and m-HyperLSTM (a variant of HyperNetworks which uses only one HyperLSTM-based module).
The present methodology can effectively capture and strengthen useful timeseries feature representations through HyperLSTM-based modules and a crossattention mechanism. The designed model was trained using an Adam optimizer, with a desired learning rate - e.g. 0.001. The batch size is selected to obtain a trade-off between the training time and model generalization ability (e.g. batch size of 64).
The recognition accuracies and standard deviations of the the present methodology with varied time-series information and historical horizons under typical driving scenarios is shown in Figure 5. Based on recognition accuracy, vehicle states clearly have a lower influence on cognitive workload than physiological and visual information. Since extra mental workload is generally required to ensure safe driving with the decreased visibility, classification becomes more difficult as average cognitive workload increases (e.g. in rough inverse proportion to visibility). By comparison, the multimodal information fusion-based ARecNet has a relatively stable recognition performance in varied environments, and has lower standard deviation in most cases, indicating that the multimodal information fusion-based ARecNet has better stability.
The recognition results of the multimodal information fusion-based ARecNet with different historical horizons in typical driving scenarios are shown in Figure 6, in which each confusion matrix displays the average result of five-fold cross validation. In each set (a - sunny noon), (b - foggy dusk) and (c - rainy night) of confusion matrices, the rows (top to bottom) are recognition accuracy for slight, moderate and intensive cognitive workload and the classification accuracy of each workload level with respect to prediction values, and the rightmost column shows the recognition results with respect to the ground truth label. Recognition accuracy increases with extended historical horizons.
Figure 7 shows precision-recall curves illustrating the influence of different decision thresholds n on each workload level (historical horizon tw = 1 s). The macro-average curves are nearly the same in all of images (a), (b) and (c), indicating superior comprehensive recognition performance of the present methodology in varied weather conditions. The optimal threshold point is at the tangent of the corresponding precision-recall curve and the Fl-score curve.
In experiments against known workload recognition models mentioned above, the performance of m-HyperLSTM universally surpassed the LSTM-based models and the performance of DecNet and CNN-LSTM models was close to m- HyperLSTM in some specific situations - this suggests the adaptive module can capture feature representations more effectively than static ones. The Fl scores of the present methodology were significantly higher than those for m- HyperLSTM, especially with the historical horizons tw = 1 s, which is increased by at least 3.32%. The phenomena indicate the superiority of decision-level fusion architecture in the present disclosure.
In ablation experiments the effects of HyperLSTM and cross-attention within the ARecNet were tested. In this regard, a variant was also tested that lacks an attention mechanism (herein referred to as RecNet). The results are in Table I.
Variants ARecNet
HyperLSTM x x √ √
Cross attention x √ V √
Driving scenario (high visibility): Sunny noon tw = 1 s 0.832 (14.81%)0.843 (13.55%) 0.877 ($0.34%) 0.874 tw = 2 s 0.879 (14.66%)0.892 (13.25%) 0.918 (10.43%) 0.922 tw = 4 s 0.882 (17.16%)0.911 (14.11%) 0.945 (10.53%) 0.950
Driving scenario (medium visibility) : Foggy dusk tw = 1 s 0.772 (15.04%)0.793 (12.46%) 0.817 (fO.49%) 0.813 tw = 2 s 0.836 Q4.89%)0.856 (12.62%) 0.874 (10.57%) 0.879 tw = 4 s 0.842 (18.48%)0.884 (13.91%) 0.913 (10.76%) 0.920
Driving scenario (low visibility): Rainy night tw = 1 s 0.756 (13.69%)0.772 (11.66%) 0.782 (10.38%) 0.785 tw = 2 s 0.787 (14.61%)0.801 (12.91%) 0.816 (11.09%) 0.825 tw = 4 s 0.793 (18. ll%)0.831 (13.71%) 0.853 (11.16%) 0.863
Table I: ablation study on the recognition accuracy of HyperLSTM and cross attention with varied historical horizons in different driving scenarios.
HyperLSTM greatly improved the model performance in all cases, further indicating its superior feature capturing ability compared to conventional LSTM models. The influence of the cross-attention mechanism remains inconspicuous in Table I, especially for HyperLSTM-based models. A paired t-test was also employed to determine statistical significance of cross attention, with five-fold cross validation performed 50 times with the same sequence of random seeds, and statistical results presented in Figure 8 - single asterisk (*) and double asterisks (**) represent R values lower than 0.05 and 0.01, respectively. Cross attention provided no statistically relevant improvement at a 1 s historical horizon, but provided significantly better performance for longer horizons - e.g. 4 s. The phenomena demonstrate that the cross- attention mechanism is better at strengthening useful learning representations of longer sequence multimodal information.
Embodiments of the present methodology seek to address the two limitations of most previous driver workload recognition models in practical applications, namely single-modality indicators, and time-series signal distortion. The present, decision-level multimodal information fusion architecture can employ a cross-attention mechanism to strengthen useful feature representations captured by the HyperLSTM-based module from individual information sources. Experimental results demonstrate that the proposed models are advantageous over other baseline approaches in terms of recognition accuracy and robustness.
In addition to the driver workload estimation, the data collection methodology provides a generic driver monitoring framework for advanced driving assistance systems (ADAS). Using a minor alteration of the structure of the model (e.g. adding more, or removing, HyperLSTM modules, containing one or more HyperLSTM networks and a main LSTM network, depending on the number of information sources/input modes - e.g. road types and traffic condition monitoring) according to the number of information sources, it can be utilized for driver distraction/fatigue detection. Meanwhile, accurate driver states recognition can provide a decision-making basis for the mutual takeover of drivers and vehicles, which is beneficial to other ADAS technologies such as lane departure warning systems, traffic jam assistant systems, and others.
The present framework can also be extended into specific application fields involving multiple biosensors, for example, athlete health stress estimation and air traffic controller states monitoring, etc.
An end-user computing device or system referred to in this disclosure comprises a smartphone device, a tablet device, a laptop device etc. that is used by an end user to train a supervised learning model for cognitive workload recognition, or implement that model for real time cognitive workload recognition. Computing device 900 of Figure 9 illustrates a schematic diagram of one such device. The device 900 comprises one or more processing units 910 with access to one or more pre-processors 902 (if used) for pre-processing input data - e.g. from EEG, vehicle behaviour and/or eye tracking - has a communication channel to a camera/external device(s) 904 for collecting input data (e.g. image capture device(s), vehicle monitoring sensors for speed, yaw and others, EEG system etc), auxiliary network module(s) 906 each comprising auxiliary network(s) 908 and a main LSTM network 910, an attention mechanism 912 and a classifier layer 914 (the term "classifier layer" may be used to refer to a single layer or multiple layers in a machine learning network, depending on the context used herein).
The external device(s) 904 may be integral with, or unitary with, system 900, or may be separate.
System 900 may be in communication (e.g. over network 918) with one or more server system 916 that serve as a back end system for an application executing on the system 900. For example, in embodiments where system 900 is a smartphone, the server system 916 may be a backend application server of a relevant application for input data evaluation executing on the system 900. The server system 916 may transmit code or information to the system 900 and may receive information from system 900 obtained after pre-processing input data captured by external device(s) 904.
The code running the methodology, and/or input data whether before or after pre-processing, may be stored in memory 920.
It will be appreciated that many further modifications and permutations of various aspects of the described embodiments are possible. Accordingly, the described aspects are intended to embrace all such alterations, modifications, and variations that fall within the spirit and scope of the appended claims.
Throughout this specification and the claims which follow, unless the context requires otherwise, the word "comprise", and variations such as "comprises" and "comprising", will be understood to imply the inclusion of a stated integer or step or group of integers or steps but not the exclusion of any other integer or step or group of integers or steps. The reference in this specification to any prior publication (or information derived from it), or to any matter which is known, is not, and should not be taken as an acknowledgment or admission or any form of suggestion that that prior publication (or information derived from it) or known matter forms part of the common general knowledge in the field of endeavour to which this specification relates.

Claims

Claims
1. A method to train a supervised learning model for cognitive workload recognition by employing a sequence-to-sequence learning paradigm, wherein the model comprises a main long short-term memory (LSTM) network, an auxiliary LSTM network and a classifier layer; wherein the model is configured to output a predicted cognitive workload level of a user in response to input of temporal series information related to the user over a plurality of time steps by, at each said time step: updating the main LSTM network to map the temporal series information to a sequence of hidden states; updating the auxiliary LSTM network to generate weights for the main LSTM network; and obtaining the predicted cognitive workload level by processing the hidden states and the weights through the classifier layer.
2. The method of claim 1, wherein the temporal series information is multimodal, the model comprising, for each input mode, a main long shortterm memory (LSTM) network, an auxiliary LSTM network and a classifier layer, wherein the hidden states from each main LSTM network are integrated and the classifier layer obtains the predicted cognitive workload level by processing the integrated hidden states and the weights.
3. The method of claim 2, wherein each input mode is a respective one of electroencephalogram (EEG) signals, eye movements and external states. The method of claim 2 or 3, wherein the model comprises an attention mechanism for learning cross-attention parameters between input modes and integrating the hidden states using the cross-attention parameters. The method of any one of claims 1 to 4, wherein updating the main LSTM network is according to
Figure imgf000027_0005
Figure imgf000027_0002
where
Figure imgf000027_0001
refers to the hidden state at time t in temporal series k, kt refers to the temporal series information at time t, ct refers to a gate for selectively carrying information from the previous time step through a forget gate at time t. The method of claim 5, wherein updating the auxiliary LSTM network is according to
Figure imgf000027_0004
where kt is the concatenation of the hidden state ht_1 from
Figure imgf000027_0006
The method of any one of claims 1 to 6, wherein the weights are functions of a set of embeddings, wherein the embeddings are linear projections of the hidden states of the auxiliary LSTM network. The method of any one of claims 1 to 7, wherein the classifier layer is a fully connected layer followed by a softmax activation. The method of any one of claims 1 to 8, wherein employing the sequence- to-sequence learning paradigm comprises adopting a cross-entropy loss function according to
Figure imgf000027_0003
where x<t represents the subsequence of the hidden states. . The method of any one of claims 1 to 9, comprising using label smoothing to improve generalization ability of the model. . The method of any one of claims 1 to 10, comprising using an Adam optimizer to train the model. . The method of any one of claims 1 to 11, wherein the user is a driver. A system that trains a supervised learning model for cognitive workload recognition, wherein the system comprises a plurality of processors configured to train the model by employing a sequence-to- sequence learning paradigm, wherein the model comprises a main long short-term memory (LSTM) network, an auxiliary LSTM network and a classifier layer; wherein the model is configured to output a predicted cognitive workload level of a user in response to input of temporal series information related to the user over a plurality of time steps by, at each said time step: updating the main LSTM network to map the temporal series information to a sequence of hidden states; updating the auxiliary LSTM network to generate weights for the main LSTM network; and obtaining the predicted cognitive workload level by processing the hidden states and the weights through a classifier layer of the model. . The system of claim 12, wherein the temporal series information is multi-modal, the model comprising, for each input mode, a main long short-term memory (LSTM) network, an auxiliary LSTM network and a classifier layer, wherein the hidden states from each main LSTM network are integrated and the classifier layer obtains the predicted cognitive workload level by processing the integrated hidden states and the weights. . The system of claim 13, wherein each input mode is a respective one of electroencephalogram (EEG) signals, eye movements and external states. The system of claim 13 or 14, wherein the model comprises an attention mechanism for learning cross-attention parameters between input modes and integrating the hidden states using the cross-attention parameters. The system of any one of claims 12 to 15, wherein updating the main LSTM network is according to
Figure imgf000029_0002
where ht refers to the hidden state at time t, xt refers to the temporal series information at time t, ct refers to a gate for selectively carrying information from the previous time step through a forget gate at time t. . The system of claim 16, wherein updating the auxiliary LSTM network is according to
Figure imgf000029_0001
where xt is the concatenation of the hidden state hL-1 from €hyper and xt. . The system of any one of claims 12 to 17, wherein the weights are functions of a set of embeddings, wherein the embeddings are linear projections of the hidden states of the auxiliary LSTM network.
20. The system of any one of claims 12 to 18, wherein the classifier layer is a fully connected layer followed by a softmax activation.
21. The system of any one of claims 12 to 19, wherein employing the sequence-to-sequence learning paradigm comprises adopting a crossentropy loss function according to
Figure imgf000030_0001
where x<t represents the subsequence of the hidden states.
22. The system of any one of claims 12 to 20, wherein the processors are configured to use label smoothing to improve generalization ability of the model.
23. The system of any one of claims 12 to 22, wherein the processors are configured to use an Adam optimizer to train the model.
24. The system of any one of claims 12 to 23, wherein the user is a driver.
PCT/SG2023/050490 2022-07-12 2023-07-12 Cognitive workload recognition from temporal series information WO2024015018A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
SG10202250435C 2022-07-12
SG10202250435C 2022-07-12

Publications (1)

Publication Number Publication Date
WO2024015018A1 true WO2024015018A1 (en) 2024-01-18

Family

ID=89537553

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SG2023/050490 WO2024015018A1 (en) 2022-07-12 2023-07-12 Cognitive workload recognition from temporal series information

Country Status (1)

Country Link
WO (1) WO2024015018A1 (en)

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CHAOPENG PAN; HAOTIAN CAO; WEIWEI ZHANG; XIAOLIN SONG; MINGJUN LI: "Driver activity recognition using spatial‐temporal graph convolutional LSTM networks with attention mechanism", IET INTELLIGENT TRANSPORT SYSTEMS, vol. 15, no. 2, 21 December 2020 (2020-12-21), Michael Faraday House, Six Hills Way, Stevenage, Herts. SG1 2AY, UK , pages 297 - 307, XP006116504, ISSN: 1751-956X, DOI: 10.1049/itr2.12025 *
HUA QIANG, JIN LISHENG, JIANG YUYING, GAO MING, GUO BAICANG: "Cognitive Distraction State Recognition of Drivers at a Nonsignalized Intersection in a Mixed Traffic Environment", ADVANCES IN CIVIL ENGINEERING, vol. 2021, 3 March 2021 (2021-03-03), pages 1 - 16, XP093132333, ISSN: 1687-8086, DOI: 10.1155/2021/6676807 *
RUOHAN WANG; PIERLUIGI V. AMADORI; YIANNIS DEMIRIS: "Real-Time Workload Classification during Driving using HyperNetworks", ARXIV.ORG, 7 October 2018 (2018-10-07), 201 Olin Library Cornell University Ithaca, NY 14853 , XP080930566 *

Similar Documents

Publication Publication Date Title
US11783601B2 (en) Driver fatigue detection method and system based on combining a pseudo-3D convolutional neural network and an attention mechanism
Yin et al. Automatic dangerous driving intensity analysis for advanced driver assistance systems from multimodal driving signals
Doshi et al. On-road prediction of driver's intent with multimodal sensory cues
US7839292B2 (en) Real-time driving danger level prediction
US20220327840A1 (en) Control device, system and method for determining perceptual load of a visual and dynamic driving scene in real time
Manawadu et al. Multiclass classification of driver perceived workload using long short-term memory based recurrent neural network
Costa et al. Detecting driver’s fatigue, distraction and activity using a non-intrusive ai-based monitoring system
Celona et al. A multi-task CNN framework for driver face monitoring
CN113743471B (en) Driving evaluation method and system
JP2019523943A (en) Control apparatus, system and method for determining perceptual load of visual and dynamic driving scene
Rezaei et al. Simultaneous analysis of driver behaviour and road condition for driver distraction detection
Yang et al. Real-time driver cognitive workload recognition: Attention-enabled learning with multimodal information fusion
Wei et al. Driver's mental workload classification using physiological, traffic flow and environmental factors
Zhao et al. Deep convolutional neural network for drowsy student state detection
Selvakumar et al. Real-time vision based driver drowsiness detection using partial least squares analysis
KR102543604B1 (en) Method for detecting driver fatigue based multimodal, recording medium and system for performing the method
Vasudevan et al. Driver drowsiness monitoring by learning vehicle telemetry data
Zhao et al. A driver stress detection model via data augmentation based on deep convolutional recurrent neural network
CN116955943A (en) Driving distraction state identification method based on eye movement sequence space-time semantic feature analysis
WO2024015018A1 (en) Cognitive workload recognition from temporal series information
CN116597611A (en) Method, system and device for monitoring and early warning driver state
CN114170588A (en) Railway dispatcher bad state identification method based on eye features
Kalisetti et al. Analysis of driver drowsiness detection methods
Suresh et al. Analysis and Implementation of Deep Convolutional Neural Network Models for Intelligent Driver Drowsiness Detection System
Subbaiah et al. Driver drowsiness detection methods: A comprehensive survey

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23840073

Country of ref document: EP

Kind code of ref document: A1