CN114209323A

CN114209323A - Method for recognizing emotion and emotion recognition model based on electroencephalogram data

Info

Publication number: CN114209323A
Application number: CN202210069138.1A
Authority: CN
Inventors: 陈益强; 翁伟宁; �谷洋; 王记伟
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2022-01-21
Filing date: 2022-01-21
Publication date: 2022-03-22

Abstract

The embodiment of the invention provides a method for recognizing emotion and an emotion recognition model based on electroencephalogram data, wherein the emotion recognition model comprises the following steps: the spatial matrix construction module is used for generating a first spatial matrix according to the electroencephalogram signals of the user obtained by each time slice in the plurality of time slices to obtain a plurality of first spatial matrices; the spatial feature extraction module is used for calculating attention weights of each row and each column of each first spatial matrix in the plurality of first spatial matrices by using an attention mechanism respectively, and obtaining a plurality of second spatial matrices according to the attention weights of each row and each column of each first spatial matrix; the time-space characteristic fusion module is used for extracting time sequence correlation characteristics among the second space matrixes and obtaining a plurality of time-space characterization vectors according to the second space matrixes and the corresponding time sequence correlation characteristics; and the emotion recognition module is used for determining the emotion of the user according to the plurality of space-time characterization vectors.

Description

Method for recognizing emotion and emotion recognition model based on electroencephalogram data

Technical Field

The invention relates to the field of physiological data mining, in particular to the field of psychological state detection, and more particularly relates to a method for recognizing emotion and an emotion recognition model based on electroencephalogram data.

Background

Physiological health detection based on wearable devices is a development focus in the medical industry and health field nowadays, and different wearable devices (e.g., health bracelet, smart watch, blood pressure and blood sugar detection device, and other related devices) are widely applied to health management. The field of detection and management for mental health still remains blank, and mental health is important health content except physiological health, and directly influences emotional state and psychological state of people. The mental health detection comprises the contents of multiple fields such as medicine, psychology, data analysis and the like, and can realize technical support for the application of user mental monitoring, bad mental state early warning and the like through the technologies such as medical definition, physiological state analysis, various behavior signals and physiological signal calculation, emotional state prediction, detection and the like, thereby becoming an important approach and method for the mental health detection.

Emotion recognition is an important part of mental state detection, and emotion is a mental state generated by interaction between an individual and the outside world. Multimodal physiological data can be applied to calculate emotional states, including electroencephalographic signals, electrical muscle signals, skin resistivity, and like physiological signals, as well as behavioral signals, expressions, gestures, and like behavioral signals. In multi-modal data, electroencephalogram signals become a primary method for calculating emotion due to the characteristics of difficult disguise, emotion direct relevance, easy acquisition and the like. The prior art only focuses on the temporal features (for example, patent application publication No. CN 112364697A) or the spatial features (for example, patent application publication No. CN 112990008A) of the electroencephalogram signals, and the accuracy of emotion recognition is not high.

Disclosure of Invention

Therefore, the present invention is directed to overcome the above-mentioned drawbacks of the prior art, and to provide a method for recognizing emotion and an emotion recognition model based on electroencephalogram data.

The purpose of the invention is realized by the following technical scheme:

according to a first aspect of the present invention, there is provided an emotion recognition model based on electroencephalogram data, comprising: the spatial matrix construction module is used for generating a first spatial matrix according to the electroencephalogram signals of the user obtained by each time slice in the plurality of time slices to obtain a plurality of first spatial matrices; the spatial feature extraction module is used for calculating attention weights of each row and each column of each first spatial matrix in the plurality of first spatial matrices by using an attention mechanism respectively, and obtaining a plurality of second spatial matrices according to the attention weights of each row and each column of each first spatial matrix; the time-space characteristic fusion module is used for extracting time sequence correlation characteristics among the second space matrixes and obtaining a plurality of time-space characterization vectors according to the second space matrixes and the corresponding time sequence correlation characteristics; and the emotion recognition module is used for determining the emotion of the user according to the plurality of space-time characterization vectors.

In some embodiments of the present invention, the first spatial matrix is generated according to spatial distribution of a plurality of electrodes for acquiring the electroencephalogram signal after data preprocessing is performed on the electroencephalogram signal of the user in the corresponding time slice, where the data preprocessing includes data filtering processing, and/or data de-artifact processing and/or data de-baselining processing, and a value in the first spatial matrix corresponding to the corresponding time slice is a channel variance of the electroencephalogram signal acquired by the corresponding channel in the time slice after the data preprocessing.

In some embodiments of the present invention, the second spatial matrix is obtained by multiplying each data in the corresponding first spatial matrix by the attention weight corresponding to the row in which the data is located and by the attention weight corresponding to the column in which the data is located.

In some embodiments of the invention, the spatial feature extraction module comprises: a first fully-connected network module comprising a first fully-connected network, the first fully-connected network module configured to: inputting the splicing vector of the mean value of each row of data of the first space matrix into a first fully-connected network for processing to obtain the output of the first fully-connected network, and performing Softmax calculation on the output of the first fully-connected network to obtain the attention weight of each row of the first space matrix; and a second fully connected network module comprising a second fully connected network, the second fully connected network module configured to: and inputting the splicing vector of the mean value of each row of data of the first space matrix into a second fully-connected network for processing to obtain the output of the second fully-connected network, and performing Softmax calculation on the output of the second fully-connected network to obtain the attention weight of each row of the first space matrix.

In some embodiments of the invention, the spatiotemporal feature fusion module comprises a plurality of coding networks stacked, and the input of each coding network is processed by a self-attention mechanism layer, a feedforward layer and a residual layer of the coding network in sequence; the input of the first coding network is a plurality of space representation sequences, each space representation sequence is obtained by sequentially splicing data corresponding to each row in the second space matrix, the input of the subsequent coding network is a middle space-time representation vector output by the previous coding network, and the last coding network outputs a final space-time representation vector.

In some embodiments of the invention, the self-attentive mechanism layer is a unidirectional self-attentive mechanism layer, wherein the unidirectional self-attentive mechanism layer is configured to: when the attention relationship between the time slice corresponding sequences is calculated, the attention relationship between the current time slice corresponding sequence and the time slice corresponding sequence in the front direction and the attention relationship between the current time slice corresponding sequence and the current time slice corresponding sequence are calculated, and the attention relationship between the current time slice corresponding sequence and the time slice corresponding sequence in the back direction is not calculated.

In some embodiments of the invention, the emotion recognition model is trained by: acquiring a plurality of training samples, wherein each training sample comprises an electroencephalogram signal acquired by an experimenter by a plurality of time slices and an emotion label corresponding to each time slice; and outputting the emotion corresponding to the experimenter in the corresponding time slice in each training sample by utilizing a plurality of training sample training emotion recognition models, calculating a loss value according to a plurality of emotions output to the corresponding training samples and the corresponding emotion labels, and updating the parameters of the spatial feature extraction module, the spatiotemporal feature fusion module and the emotion recognition module by utilizing the loss value.

According to a second aspect of the present invention, there is provided a method of recognizing emotion, the method comprising: acquiring electroencephalogram signals of a user, which are acquired by electroencephalogram acquisition equipment in a plurality of time slices; inputting the electroencephalogram signals of the users in a plurality of time slices into the emotion recognition model based on the first aspect, and outputting the emotion of the user in each time slice.

In some embodiments of the invention, the mood of each time slice is an instantaneous mood of the user, the method further comprising: and determining the long-term emotion of the user in a soft voting mode based on the instantaneous emotion of the user in a plurality of time slices and the corresponding probability of each instantaneous emotion.

According to a third aspect of the present invention, there is provided an electronic apparatus comprising: one or more processors; and a memory, wherein the memory is to store executable instructions; the one or more processors are configured to implement the steps of the method of the second aspect via execution of the executable instructions.

Drawings

Embodiments of the invention are further described below with reference to the accompanying drawings, in which:

FIG. 1 is a schematic block diagram of an emotion recognition model based on electroencephalogram data according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a process of emotion recognition by an emotion recognition model based on electroencephalogram data according to an embodiment of the present invention;

FIG. 3 shows the arrangement of EEG electrodes and the corresponding spatial electrode matrix in the system of International Standard 10-20;

FIG. 4 is a data processing diagram of an emotion recognition model based on electroencephalogram data according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a unidirectional self-attention mechanism layer in an emotion recognition model based on electroencephalogram data according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail by embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As mentioned in the background section, the prior art only focuses on the temporal or spatial characteristics of the electroencephalogram signal, resulting in low accuracy of emotion recognition. According to the method, a first space matrix is constructed according to the electroencephalogram signals, the attention weight of each row and each column is calculated based on the data of the first space matrix, and a second space matrix is calculated, so that the spatial correlation degree of the electroencephalogram signals of the current individual can be reflected through the attention weight of each row and each column according to different individual differences, and the second space matrix capable of more accurately reflecting the emotion of the current individual can be obtained; in addition, the time sequence correlation characteristics are extracted according to the second space matrix, and a plurality of space-time characterization vectors are obtained according to the plurality of second space matrices and the corresponding time sequence correlation characteristics, so that the space-time characterization vectors are obtained under the condition of considering the space correlation characteristics and the time correlation characteristics, and the emotion of the user is identified more accurately.

According to an embodiment of the present invention, referring to fig. 1, the present invention provides an emotion recognition model based on electroencephalogram data, including: the system comprises a spatial matrix construction module 10, a spatial feature extraction module 20, a spatiotemporal feature fusion module 30 and an emotion recognition module 40. In order to extract time sequence correlation characteristics, electroencephalograms corresponding to a plurality of preset time slices form input data, and the processing of the spatial matrix construction module 10, the spatial characteristic extraction module 20, the time-space characteristic fusion module 30 and the emotion recognition module 40 is carried out in sequence to obtain the emotion corresponding to each time slice. An exemplary electroencephalogram signal (also called electroencephalogram data) acquisition and processing process is shown in fig. 2, which includes: carrying out multi-dimensional data acquisition by utilizing a plurality of electrodes in the electroencephalogram acquisition equipment to obtain an electroencephalogram signal; the method comprises the steps of constructing a first space matrix according to the spatial arrangement of electrodes by using a space matrix constructing module 10, calculating a second space matrix (corresponding to obtaining the second space matrix according to the attention weight of each row and each column of each first space matrix) by using a space feature extracting module 20 based on a cross attention mechanism, extracting time sequence correlation features by using a time-space fusion module 30, calculating a time-space characterization vector according to the corresponding second space matrix, and identifying the emotion of a user according to the time-space characterization vector by using an emotion identifying module 40. For example, -1, 0, 1 negative mood, neutral mood, positive mood, respectively.

For easy understanding, the process of acquiring electroencephalogram signals by the electroencephalogram acquisition device is described first. The electroencephalogram acquisition devices are of various types, and generally, the electroencephalogram acquisition device is applicable to processing of electroencephalogram signals acquired by the electroencephalogram acquisition device provided with a plurality of signal acquisition points (electrodes or other types of electroencephalogram acquisition sensors) according to specific spatial distribution. As an example, the brain electrical electrode arrangement (corresponding to the spatial distribution of the electrodes) of an exemplary brain electrical acquisition device (according to the International Standard 10-20 System) is given, the electrodes used for sampling brain electrical signals are silver chloride electrodes, or the electrodes comprise components made of silver chloride and felt. The electrodes are soaked by physiological saline and then are arranged according to a spatial distribution mode specified by the specification of the international standard 10-20 system and are in contact with the scalp of a user, so that all the electrodes are placed on a channel defined by the standard and record real-time electroencephalogram signals. The letters and numbers in the small circles in fig. 3a indicate the electrode names, wherein the letters represent the meaning: f: frontal lobe (Frontal lobe), Fp: prefrontal leaves (Frontal poles), T: temporal lobes (Temporal lobes), O: occipital leaves (Occipital lobes), P: apical leaves (Parietal lobes), C: central (Central) or Sensorimotor cortex (Sensorimotor core), Z: zero (Zero) is the center of the left and right brain. The numbers represent the meaning: different numbers are used for distinguishing corresponding electrodes, wherein the corresponding area of the left brain adopts a singular number, and the corresponding area of the right brain adopts an even number. More detailed electrode name meanings can be found in reference to the published specification of the international standard 10-20 system.

According to an embodiment of the present invention, the spatial matrix constructing module 10 is configured to generate a first spatial matrix according to the electroencephalogram signals of the user obtained in each of the plurality of time slices, so as to obtain a plurality of first spatial matrices.

According to one embodiment of the invention, the first spatial matrix is generated according to spatial distribution of a plurality of electrodes for acquiring electroencephalogram signals after the electroencephalogram signals of users of corresponding time slices are subjected to data preprocessing, wherein the data preprocessing comprises data filtering processing, data de-artifact processing and data de-baselining processing. The technical scheme of the embodiment can at least realize the following beneficial technical effects: the irrelevant signals comprise noise data, emotion irrelevant data and an environmental baseline which causes serious interference to the electroencephalogram signals, and the influence of irrelevant or weakly relevant factors on emotion recognition can be better removed after data filtering, data artifact removal and baseline removal are adopted in the invention, so that the accuracy of subsequent emotion recognition is improved.

According to one embodiment of the invention, the data filtering process filters low and high frequency data in the brain electrical signal with a band pass filter through a data filtering algorithm, preserving the data with a frequency between a first predetermined frequency and a second predetermined frequency. For example, the data filtering algorithm filters low-frequency and high-frequency data by a band-pass filter, takes a 0.5HZ-70HZ frequency band containing a large amount of emotion related information as a reserved characteristic interval, and removes a large amount of irrelevant low-pass and high-pass noise in the rest frequency bands to reduce the influence of irrelevant noise data on emotion recognition.

According to one embodiment of the invention, the data de-artifact processing includes channel normalization processing of the brain electrical signal in units of acquisition channels. The channel normalization processing can reduce the absolute value of the electroencephalogram signal in the channel and reduce the influence of high-saw-tooth artifacts on low-waveform related electroencephalogram data, and the calculation mode of the schematic data artifact removal processing is as follows:

wherein, CR_iIndicating the ith sample point, CR, of the corresponding channel_minRepresents the minimum value, CR, in the channel_maxRepresents the maximum value, C, in the channel_iIndicating the channel value of the ith sampling point after normalization calculation. The technical scheme of the embodiment can at least realize the following beneficial technical effects: the data artifact removing processing aims to reduce interference information in the electroencephalogram signals, for example, reduce the influence of electromyogram signals and electrooculogram signals on emotion recognition, and therefore the accuracy of emotion recognition is improved.

According to one embodiment of the invention, the data de-baselining process is to subtract the mean value of the baseline signals calculated by the baseline signals acquired by the channels corresponding to the plurality of electroencephalograms from the electroencephalograms corresponding to the respective time slices. The method for reducing the environmental influence is to distinguish continuous electroencephalogram signals into experimental signals (electroencephalogram signals corresponding to corresponding time slices) and baseline signals and make a difference so as to reduce the influence of environmental factors. Due to the particularity of the electroencephalogram signals and the particularity of the emotion states, the difference of the whole electroencephalogram signals can be caused by different testing environment factors, the overall emotion potential change can be influenced by different initial emotion states, and the effective degree of overall characteristic representation and emotion recognition accuracy are reduced. Therefore, the influence of the environmental factors on emotion recognition can be reduced by the dequalinization process.

Because the conditions of the electroencephalogram signal of the user collected in real time and the baseline signal which can be referred to by the electroencephalogram signal in the training sample are different, corresponding de-baselining processing can be respectively set.

In some embodiments of the present invention, for the electroencephalogram signals of the user collected in real time, data de-baselining processing may be performed by using a pre-stored baseline signal mean value of each channel corresponding to the electroencephalogram collecting device or a baseline signal mean value calculated from the baseline signal of each channel collected by the electroencephalogram collecting device within a predetermined time before the electroencephalogram collecting device does not contact the scalp of the user. The electroencephalogram acquisition equipment is, for example, a multichannel electroencephalogram acquisition sensor based on Emotiv Epoc +.

According to one embodiment of the invention, for a training sample, the data in the training sample can be divided into a baseline signal, an electroencephalogram signal for predicting emotion, and a time slice according to the following formulas:

wherein XS is the electroencephalogram signal sample in the corresponding time period, and XB is the baseline signal in the time period, including

XE is the electroencephalogram signal (belonging to the experimental signal required for emotion recognition) in a time period, k1 represents the number of time slices of the baseline signal, and k is the total number of time slices. The baseline data is consistent with the split time slice length of the experimental data for baseline removal calculations. Preferably, the calculation formula of the data de-baseline processing is as follows:

i∈(k₁+1，k)。

wherein, XB_jRepresents a baseline signal, X ', corresponding to the jth time slice'_iRepresenting the brain electrical signal before the ith time slice is processed by the baseline removal. The formula is used for weakening the influence of environmental factors in the electroencephalogram signal by calculating the mean value of the baseline signal of each channel, more accurate training data can be provided, and the prediction precision of the model is improved.

According to an embodiment of the present invention, the first spatial matrix is generated according to spatial distribution of a plurality of electrodes for acquiring the electroencephalogram signal after data preprocessing the electroencephalogram signal of the user in the corresponding time slice, wherein the data preprocessing includes data filtering processing, data de-artifact processing, and data de-baselining processing, and a value in the first spatial matrix corresponding to the corresponding time slice is a channel variance of the electroencephalogram signal acquired by the corresponding channel in the time slice after the data preprocessing. For example, assuming that the aforementioned international standard 10-20 system is adopted, the mapping relationship between the electroencephalogram electrode arrangement and the spatial electrode matrix (spatial matrix) can be referred to in fig. 3b, the plurality of spatial matrices corresponding to the plurality of time slices can be referred to in fig. 4, the international standard 10-20 system can generate a 9x9 spatial matrix according to the spatial arrangement of the electrodes, each position in the spatial matrix is filled based on the electroencephalogram data acquired by the corresponding spatial electrode, for example, the channel variance of the electroencephalogram signal of the corresponding channel of each preprocessed electrode is taken as the channel characteristic and is filled in the corresponding position of the spatial matrix according to the spatial distribution of the electrode, and the position of no corresponding electrode in the first spatial matrix is directly set to zero to sparsize the matrix.

According to an embodiment of the present invention, the spatial feature extraction module 20 is configured to calculate an attention weight for each row and each column of each first spatial matrix in the plurality of first spatial matrices by using an attention mechanism, and obtain a plurality of second spatial matrices according to the attention weight for each row and each column of each first spatial matrix. Since there are multiple first spatial matrices as inputs, a second spatial matrix is generated for each first spatial matrix, resulting in multiple second spatial matrices. Preferably, the second spatial matrix is obtained by multiplying each data in the corresponding first spatial matrix by the attention weight corresponding to the row in which the data is located and by the attention weight corresponding to the column in which the data is located.

In order to analyze the spatial characteristics of each electrode in the overall spatial matrix, which requires channel attention analysis on all electrodes to obtain the weight of the relevance, according to an embodiment of the present invention, the spatial characteristic extraction module 20 includes: a first fully-connected network module comprising a first fully-connected network, the first fully-connected network module configured to: inputting the splicing vector of the mean value of each row of data of the first space matrix into a first fully-connected network for processing to obtain the output of the first fully-connected network, and performing Softmax calculation on the output of the first fully-connected network to obtain the attention weight of each row of the first space matrix; and a second fully connected network module comprising a second fully connected network, the second fully connected network module configured to: and inputting the splicing vector of the mean value of each line of data of the first space matrix into a second fully-connected network for processing to obtain the output of the second fully-connected network, and performing Softmax calculation on the output of the second fully-connected network to obtain the attention weight of each line in the first space matrix.

According to one embodiment of the present invention, the attention weight of each row is calculated in a two-layer fully-connected network, and the calculation formula is as follows:

w_l＝softmax(w₄×tanh(w₃×l+b₃)+b₄)；

wherein, w₃For the weight parameter, w, of the first layer in a fully connected network₄As a weight parameter of a second layer in a fully connected network, b₃For the first layer of a fully connected network, b₄For the bias of the second layer in the fully-connected network, tanh is a hyperbolic tangent activation function, softmax is an exponential probability activation function, and l is a row mean vector (a concatenation vector corresponding to the mean of each row of data) of the first spatial matrix.

Preferably, the attention weight of each column is calculated by a two-layer fully-connected network, and the calculation formula is as follows:

w_c＝softmax(W₄×tanh(W₃×c+B₃)+B₄)

wherein, W₃For the weight parameter, W, of the first layer in a fully connected network₄As a weight parameter of a second layer in a fully connected network, B₃For the first layer of the fully connected network, B₄For the bias of the second layer in the fully-connected network, tanh is a hyperbolic tangent activation function, softmax is an exponential probability activation function, and c is a column mean vector of the first spatial matrix (a concatenation vector corresponding to the mean of each column of data).

The above calculation result w_lIncluding the attention weight, w, corresponding to each line generated_cIncluding the attention weight corresponding to each generated column. The filling data of the corresponding position in the first space matrix is multiplied by the attention weight of each row and column in which the filling data is positioned, namely the second space matrix for the electroencephalogram electrode is obtained, and the calculation formula is shown as follows

Wherein v is_i，jFor the calculated data of the ith row and the jth column of the second spatial matrix,

indicating the attention weight of the ith row in which the data is located,

the attention weight, S, of the j-th column in which the data is located_i，jData (originally filled data) of the ith row and the jth column of the first spatial matrix are represented, I represents the row number of the spatial matrix, and J represents the column number of the spatial matrix. For example, in the international standard 10-20 system, I ═ 9, J ∈ (0, 9), and J ∈ (0, 9) indicate that the size of the spatial matrix (the first spatial matrix and/or the second spatial matrix) is 10 × 10. The technical scheme of the embodiment can at least realize the following beneficial technical effects: the method divides rows and columns, calculates the cross attention value (corresponding to the attention weight value of each row and each column) of the rows and columns respectively according to the rows and the columns as the minimum unit, calculates the mean value of each row in the line brain area, and calculates the attention weight value by taking the mean value as the line representation; the mean value of each column is calculated by the column brain area, and the mean values are taken as column representations to calculate the attention weight value. The second space matrix generated by each time slice comprises channel similarity and correlation characteristics, the obtained space characteristics are more accurate, and the emotion recognition accuracy is improved.

According to an embodiment of the present invention, the spatio-temporal feature fusion module 30 is configured to extract a time sequence correlation feature between the plurality of second spatial matrices, and obtain a plurality of spatio-temporal characterization vectors according to the plurality of second spatial matrices and the corresponding time sequence correlation feature. According to one embodiment of the invention, the spatiotemporal feature fusion module 30 comprises a plurality of coding networks stacked, the input of each coding network being processed in turn by the self attention mechanism layer, the feedforward layer and the residual layer of the coding network; the input of the first coding network is a plurality of space representation sequences, each space representation sequence is obtained by sequentially splicing data corresponding to each row in the second space matrix, the input of the subsequent coding network is a space-time representation vector in the middle of the output of the previous coding network, and the last coding network outputs a final space-time representation vector. The technical scheme of the embodiment can at least realize the following beneficial technical effects: according to the method, the spatial features can be better injected with the time-sequence features through the processing of the self-attention mechanism layer, the feedforward layer and the residual error layer, so that the accuracy of subsequent feature vector expression is optimized, and the accuracy of emotion recognition is improved.

According to one embodiment of the invention, multiple coding networks may be adapted based on a transform model. The Transformer model (time series data mining network) comprises an encoder and a decoder, wherein the encoder is used for calculating the time correlation characteristics of each spatial characterization sequence and generating a time characterization vector, and the decoder is used for explaining the generated intermediate time series vector. In the present invention, the decoding part in the transform network is not used, so that the encoder can be modified only for the transform model, and the encoder is explained in detail below. According to one embodiment of the invention, the adapted Transformer model comprises a plurality of coding networks, without a decoder of the original Transformer model. Each coding network comprises a self-attention mechanism layer (or called a self-attention calculation module), a feedforward layer (or called a full-connection module) and a residual layer (or called a residual connection module) which are connected in sequence. The self-attention mechanism layer, the feedforward layer and the residual error layer are sequentially connected to form a coding network. Referring to fig. 4, a plurality of coding nets (xN, which represents N coding net stacks, e.g., N is 4, 6, 8, etc.) are cascaded to form a deep encoder, i.e., a transformed Transformer model. The input and output data formats of the coding network of the improved Transformer model are consistent, the time dependence relation of fine granularity is calculated by the self-attention mechanism layer, the feedforward layer and the residual error layer, corresponding representations are formed, the gradient and distribution are kept in the deep network, and the time sequence feature mining and the spatial feature fusion are efficiently realized by using fewer parameters.

According to an embodiment of the present invention, in the self-attention mechanism layer, a plurality of second spatio-temporal matrices are converted into spatial representation sequences as inputs of the Transformer network according to the sequence of the spatial representation sequences in the time dimension, the self-attention calculation module needs to calculate the correlation between the inputs, and for each input, Q, K, V vectors in the correlation operation are calculated by using three matrices, and the calculation formula is as follows:

Q_t＝W_Q×X_t；

K_t＝W_K×X_t；

V_t＝W_V×X_t；

wherein, X_tDenotes the t-th input sequence (it should be understood that in the case of multiple coding networks, the sequence here refers to the spatial token sequence corresponding to the time slice t if the first coding network, and refers to the spatio-temporal token vector output by the previous coding network for the time slice t if the subsequent coding network, respectively), and W_QRepresenting a parameter matrix, W, used to generate a Q vector_KRepresenting a parameter matrix, W, used to generate a K vector_VA parameter matrix used to generate the V vector is represented. Q_t、K_t、V_tThe inquiry vector, key vector, value vector corresponding to the time slice t in the correlation calculation, and x symbol are standard multiplication operations among the matrixes. Wherein the query vector is used for similarity calculation of the current sequence and the rest sequences, the key vector represents an index of a matrix and is used for forming a similarity measure by point multiplication calculation with the query vector, the value vector is used for generating a spatial characterization sequence characterization vector in the similarity score, and the calculation formula of the similarity score is as follows:

i，j∈(0，n)；

wherein, Score_ijRepresenting the similarity score of the sequence corresponding to time slice i to the sequence corresponding to time slice j, n representing the total number of sequences, dK_jKey vector dimension, Q, for the sequence corresponding to time slice j_iAn interrogation vector, K, representing a sequence corresponding to time slice i_jA key vector representing the sequence corresponding to time slice j.

According to one embodiment of the invention, the intermediate timing vectors are computed in temporal order of the spatial characterization sequence by the self-attention mechanism layer. According to one embodiment of the invention, the self-attention mechanism layer calculates an intermediate timing vector from the similarity score and the value vector. Optionally, the calculation method for generating the intermediate timing vector from the attention mechanism layer is as follows:

wherein, XN_tIntermediate timing vectors (intermediate timing representation), Score, representing sequences corresponding to time slice t_t，tScore representing the similarity Score between the corresponding sequence and itself over time t_t，jRepresenting the similarity score between the sequences corresponding to time slice t and time slice j, e representing the natural logarithm, V_tIs a vector of values corresponding to the sequence for time slice t. The formula adopts a softmax activation function to calculate similarity index ratios with different sequences and generate weights, and a value vector of the weighted sum is used as an intermediate time sequence vector after similarity calculation with all the other sequences.

According to an embodiment of the present invention, in order to reduce the amount of computation, the encoding network of the transformed Transformer model of the present invention may change the multi-head attention mechanism in the encoder of the original Transformer model into the single-head attention mechanism. Preferably, in the case of a single-head attention mechanism, the specification of the weight parameter in the encoder of the transform model may be adjusted to adapt to the spatial representation sequences, so that the intermediate time sequence vectors obtained through the self-attention mechanism layer remain unchanged in data format with respect to the spatial representation sequences, that is, each spatial representation sequence generates an intermediate time sequence vector with the same format.

According to one embodiment of the invention, the feedforward layer is configured to perform a non-linear transformation on the intermediate timing vector output from the attention mechanism layer to generate an intermediate timing vector. Preferably, the calculation formula of the feedforward layer to generate the intermediate timing vector is as follows:

g_t＝W₂(Relu(W₁·XN_t+b₁))；

wherein, g_tRepresenting intermediate timing vectors, W, generated after passing through the feedforward layer, corresponding to time slices t₁Weight parameter, W, representing the first layer of the feedforward layer₂A weight parameter representing a second layer of the feedforward layer,b₁indicating the offset of the first layer of the feed-forward layer, XN_tAnd generating an intermediate timing vector through a self-attention mechanism layer corresponding to the time slice t, wherein Relu is a nonlinear activation function. Relu is used to improve the non-linear characterization capability of the algorithm.

According to one embodiment of the invention, the residual layer (residual connection layer) is used for performing residual connection and layer regularization on the input of the coding network to which the residual layer belongs and the output of the feedforward layer of the coding network to which the residual layer belongs, and outputting a space-time characterization vector. It should be understood that the input to the first coding network is a plurality of spatial characterization sequences, and the input to the subsequent coding network is an intermediate spatio-temporal characterization vector of the output of its previous coding network. According to the method, the layer jump connection (residual connection) and the layer regularization are carried out on the data through the residual layer, so that the sequence operation gradient can be improved. Preferably, the calculation formula of the residual layer is as follows:

R_t＝LayerNorm(g_t+XN_t)；

wherein R is_tTo represent the output of the residual layer for a slice t (and also the output of the coding network to which it belongs), g_tRepresenting the output of the feedforward layer of the coding network to which the residual layer belongs, XN, for a time slice t_tRepresenting the input to the coding network to which the residual layer belongs, and LayerNorm represents the layer regularization. Layer regularization is, for example, a regularization that performs a mean of variance for all neurons of the layer. The technical scheme of the embodiment can at least realize the following beneficial technical effects: the residual layer can prevent the gradient disappearance of the algorithm from causing training difficulty, and maintain the gradient and data distribution of the deep network

In the above transformed transform model, the self-attention mechanism layer is not transformed, that is, the self-attention mechanism layer still adopts a bidirectional self-attention mechanism layer. In the bidirectional self-attention mechanism layer, attention relations between the current time slice t and time slices before and after the current time slice t are considered, but through experiments and analysis of the inventor, the current emotion of the user is generally only related to the previous experienced event or the previous generated emotion, and is not related to or is weakly related to the subsequent emotion. According to one embodiment of the invention, the self-attentive force mechanism layer is preferably unidirectionalThe self-attentiveness mechanism layer of (1), wherein the unidirectional self-attentiveness mechanism layer is configured to: when the attention relationship among the time slice corresponding sequences is determined, calculating the attention relationship between the current time slice corresponding sequence and the time slice corresponding sequence in the front direction of the current time slice corresponding sequence, and the attention relationship between the current time slice corresponding sequence and the current time slice corresponding sequence; without calculating the attention relationship between the current time slice correspondence sequence and the backward time slice correspondence sequence. Thus, the dependency on the backward time slice when calculating the feature for the current time slice in the time sequence is masked, i.e. when calculating the attention relationship, all sequences only calculate the dependency relationship with their forward sequence. See FIG. 5, Q_t、K_t、V_tRespectively representing a query vector, a key vector, a value vector, Q corresponding to the time slice t_t+1、K_t+1、V_t+1Respectively representing a query vector, a key vector and a value vector corresponding to the time slice t + 1. The dotted line portion of fig. 5 shows the masked data stream. In the case of a bidirectional self-attention floor, the dotted line is a solid line, which indicates the attention relationship between the sequence of the current time slice and the sequence of the subsequent time slice. For further explanation, the following formula is used to explain that if a bidirectional self-attention mechanism layer is used, the calculation method for generating the intermediate timing vector by the self-attention mechanism layer is as follows:

namely: the attention correlation of time slice t with all vectors is calculated.

If a unidirectional self-attention mechanism layer is adopted, the calculation mode of generating the intermediate time sequence vector by the self-attention mechanism layer is as follows:

it can be seen from the difference in the formulas that only the attention relationship between time slice t and the time slice before it (the forward time slice) is calculated, and the attention relationship between the time slice after time slice t (the backward time slice) is not considered (i.e., the subsequent time slice is masked). For example, assuming that 10 time slices are input at a time, assuming that the current time slice is time slice 4, the attention relationship between the time slice 4 corresponding sequence and the time slice 0-4 corresponding sequences is calculated, and the attention relationship between the time slice 4 corresponding sequence and the time slice 5-9 corresponding sequences is not calculated.

The technical scheme of the embodiment can at least realize the following beneficial technical effects: because the emotion recognition is different from the sentence translation scene, the former emotion has stronger influence on the current emotion and the latter emotion has weaker influence, so that the embodiment is changed into a unidirectional self-attention mechanism layer, and the emotion recognition precision of the model is improved.

According to an embodiment of the invention, it should be understood that, in addition to the transformation by using the existing transform model, a corresponding model structure can also be constructed by directly referring to the description of the embodiment of the invention, so as to realize the emotion recognition model based on electroencephalogram data.

According to an embodiment of the present invention, the training data may employ a SEED data set and/or a DEAP data set. In the training data, the emotion is induced by video stimulation, the video comprises related emotion labels, real emotion is fed back to a questionnaire of a user after the video is finished, and a sample with the video labels consistent with the questionnaire labels is used as a feasible training sample. The training data form dictionary storage in the form of data matrix and corresponding label, and the storage is as follows: s_i＝(X_i，l_i)，X_i∈R^{c×((bt+dt)*r)}，l_iE { -1, 0, 1}, wherein S_iFor the storage data structure of the ith sample, X_iIn the format of [ c, ((bt + dt). r) ]]C is total number of electrodes, r is signal sampling rate, bt represents recording duration before emotion induction, dt represents recording duration in emotion induction process, bt corresponds to baseline time in environment and is used for recording baseline signals, dt corresponds to time for inducing emotion recording electroencephalogram signals, and l_iThe sample is labeled i. The labels are represented by-1, 0 and 1 for negative emotion, neutral emotion and positive emotion respectively. Data acquisition methods all samples were collected as indicated aboveThe method uses matrix storage to provide a data base for model training.

According to one embodiment of the invention, emotion recognition module 40 is configured to determine an emotion of the user based on the plurality of spatiotemporal characterization vectors. According to one embodiment of the invention, emotion recognition module 40 comprises a multi-layer, fully connected network. For example, emotion recognition module 40 comprises a two-layer fully connected network, calculated as:

E_t＝softmax(w₆×Relu(w₅×R_t+b₅)+b₆)；

wherein, w₅、w₆Weight parameters of a first layer and a second layer of a fully connected network of emotion recognition modules, respectively, b₅、b₆The emotion recognition module is used for generating emotion recognition probabilities by respectively using the offsets of a first layer and a second layer of a fully-connected network of the emotion recognition module, Relu is a nonlinear activation function, and softmax is a probability activation function of an output layer of the fully-connected network. E_tRepresenting the probability that the identified user belongs to the corresponding emotion at time slice t.

According to one embodiment of the invention, the emotion recognition model is trained in the following way: acquiring a plurality of training samples, wherein each training sample comprises an electroencephalogram signal acquired by an experimenter by a plurality of time slices and an emotion label corresponding to each time slice; the emotion recognition models are trained by utilizing a plurality of training samples to output the emotions of the experimenters in the corresponding time slices, loss values are calculated according to the emotions output by the corresponding training samples and the corresponding emotion labels, and parameters of the spatial feature extraction module 20, the spatiotemporal feature fusion module 30 and the emotion recognition module 40 are updated by utilizing the loss values. According to one embodiment of the invention, the model is trained with cross-entropy loss functions to compute the loss values. Namely: and training the model by taking the cross entropy loss function as an optimization target. The cross entropy loss function calculation formula is as follows:

wherein E is_tRepresentation recognitionProbability of a given user belonging to a corresponding emotion in time slice t, Y_t，jThe emotion categories to which the user indicated in the tag belongs at time slice T are represented, T represents the number of time slices currently used for updating the model parameters, and K represents the number of categories of emotions. For example, the number of categories of emotions is 3 categories, negative emotion, neutral emotion, and positive emotion, respectively. The original tags are denoted negative, neutral and positive emotions with-1, 0, 1, respectively. And calculating to obtain a loss value according to the emotion recognition result of the experimenter in the corresponding time slice and the real emotion label by using the cross entropy loss function. It should be noted that, since the cross entropy loss is calculated, the original label needs to be converted into a multi-channel label represented by 0 and 1, which is needed for the cross entropy loss. Namely: y is_t，jIs the One-hot vector (One-hot) corresponding to its label. For example, when the channels of the multi-channel tag corresponding to negative emotion, neutral emotion and positive emotion are arranged in sequence, 1 in the original tag needs to be converted into 1, 0 and 0; 0 needs to be converted to 0, 1, 0; 1 requires the conversion of 0, 1 to compute the penalty value for the cross-entropy penalty function for multiple classes.

The present invention also provides, according to an embodiment of the present invention, a method of recognizing emotion, including: acquiring electroencephalogram signals of the user in a plurality of time slices, inputting the emotion recognition model based on the corresponding embodiment, and outputting the emotion of the user in each time slice. According to one embodiment of the invention, the emotion of each time slice is an instantaneous emotion of the user, the method further comprising: and determining the long-term emotion of the user in a soft voting mode based on the instantaneous emotion of the user in a plurality of time slices and the corresponding probability of each instantaneous emotion. The instantaneous emotion (or short-term emotion) recognition is the emotion recognition result corresponding to each time slice, and the long-term emotion prediction is generated through the soft voting of the short-term emotions in the sample, namely, the majority of the short-term emotions in the sample are used as the long-term emotion (or induced emotion) of the sample, or the emotion with the highest average probability in the short-term emotions in the sample is used as the long-term emotion of the sample.

According to one embodiment of the invention, the training of the emotion recognition model and the process for recognizing emotion comprises the steps of:

step S1: the model training adopts labeled multi-channel electroencephalogram signals sampled by electroencephalogram acquisition equipment (based on Emotiv Epoc + multi-channel electroencephalogram acquisition sensors) under emotion induction conditions and SEED and DEAP electroencephalogram reference data sets as training data. Wherein, the original data needs to be subjected to noise reduction, artifact removal and environmental baseline influence removal. The data is subjected to noise reduction and filtration to remove signals of non-electroencephalogram and emotional irrelevant parts in the signals, the artifact removing algorithm reduces interference of acquisition of electro-oculogram, myoelectricity and respiratory signals in the sampling process on the electroencephalogram signals, and the instability of environmental factors (such as air temperature, humidity, body temperature and the like) in the data in the signals is removed in a baseline mode. The data preprocessing is completed before the model operation, and the data divides time slices according to the time sequence and waits for the model operation.

Step S2: and reading the electroencephalogram data corresponding to each time slice in the training sample by the emotion recognition model under the condition of maintaining the data time sequence. Ensuring the accuracy of training. Firstly, realizing spatial representation of data, arranging a specified number of time slices within a certain time according to a time dimension to form samples, wherein each time slice is a multi-channel short-time sampling matrix, filling a spatial matrix (corresponding to a first spatial matrix) according to spatial distribution of electrode channels, calculating in each dimension of the matrix by a cross attention mechanism and generating an attention weight matrix, and taking a weighted spatial matrix (corresponding to a second spatial matrix) as signal spatial representation. According to the method, a weighted spatial matrix is generated through spatial correlation calculation, wherein the weighted spatial matrix comprises activation states of electrodes and brain regions induced by different emotions, so that differences and similarities of individuals induced by the emotions can be visually reflected, and analysis and recording of the differences of the emotion cognitive abilities of users are realized.

Step S3: training an emotion recognition model by using training samples, outputting emotions corresponding to corresponding time slices of experimenters in each training sample, calculating loss values according to a plurality of emotions output to the corresponding training samples and corresponding emotion labels, updating parameters of a spatial feature extraction module, a spatio-temporal feature fusion module and an emotion recognition module by using the loss values, wherein the loss values are calculated by adopting a cross entropy loss function, and repeating the steps S2 and S3 until the model converges.

Step S4: the trained emotion recognition model recognizes transient emotions for a short period of time and persistent emotional states (corresponding to long-term emotions) for a long period of time from historical emotional conditions. Aiming at the real-time emotion monitoring of a user, the electroencephalogram signals of the individual are monitored in real time, emotion data are generated in a time window, and the data are updated in the time window in a time dimension to realize real-time instant emotion and/or continuous emotion state identification.

To verify the effect of the present invention, the inventors performed an experiment based on the SEED data set (SJTU emission EEG Dataset).

Comparison technical scheme 1: the existing Cascade CNN model adopts a cascaded Convolutional Neural Network (CNN), uses about 12 convolutional layers and only focuses on the spatial characteristics of data; its average recognition accuracy of emotion is 69.2243%;

comparison technical scheme 2: the existing Cascade LSTM model, a cascaded long-short term memory network (LSTM), uses a 6-layer long-short term memory network, and only focuses on the time characteristics of data; the average recognition accuracy of the emotion was 75.4227%.

Embodiment 1 of the present invention: the emotion recognition model based on the electroencephalogram data adopts 6 stacked coding networks, wherein a unidirectional self-attention mechanism layer is adopted;

embodiment 2 of the present invention: the emotion recognition model based on the electroencephalogram data adopts 6 stacked coding networks, wherein a bidirectional self-attention mechanism layer is adopted.

The recognition accuracy and average recognition accuracy of the embodiments of the present invention for the emotions of different users are shown in the following table, and it can be seen that both embodiment 1 and embodiment 2 of the present invention are superior to the above

comparative embodiment

1, 2. And in the case of adopting a unidirectional self-attention mechanism layer, the recognition effect of the model is relatively better. In general, the invention adopts the lightweight neural network model for training, improves the generalization capability and pervasive effect of the model, prevents overfitting in the training process and ensures high-precision emotion recognition under the condition of crossing users.

It should be noted that, although the steps are described in a specific order, the steps are not necessarily performed in the specific order, and in fact, some of the steps may be performed concurrently or even in a changed order as long as the required functions are achieved.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.

The computer readable storage medium may be a tangible device that retains and stores instructions for use by an instruction execution device. The computer readable storage medium may include, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. An emotion recognition model based on electroencephalogram data, comprising:

the spatial matrix construction module is used for generating a first spatial matrix according to the electroencephalogram signals of the user obtained by each time slice in the plurality of time slices to obtain a plurality of first spatial matrices;

the spatial feature extraction module is used for calculating attention weights of each row and each column of each first spatial matrix in the plurality of first spatial matrices by using an attention mechanism respectively, and obtaining a plurality of second spatial matrices according to the attention weights of each row and each column of each first spatial matrix;

the time-space characteristic fusion module is used for extracting time sequence correlation characteristics among the second space matrixes and obtaining a plurality of time-space characterization vectors according to the second space matrixes and the corresponding time sequence correlation characteristics;

and the emotion recognition module is used for determining the emotion of the user according to the plurality of space-time characterization vectors.

2. The emotion recognition model of claim 1, wherein the first spatial matrix is generated according to spatial distribution of a plurality of electrodes for acquiring electroencephalograms after preprocessing the electroencephalograms of the user in the corresponding time slice, wherein the preprocessing of the data includes data filtering processing, and/or data de-artifact processing and/or data de-baselining processing, and the value in the first spatial matrix corresponding to the corresponding time slice is the channel variance of the electroencephalograms acquired by the corresponding channel in the time slice after preprocessing the data.

3. The emotion recognition model of claim 1, wherein the second spatial matrix is obtained by multiplying each data in the corresponding first spatial matrix by the attention weight corresponding to the row in which the data is located and by the attention weight corresponding to the column in which the data is located.

4. The emotion recognition model of claim 3, wherein the spatial feature extraction module comprises:

a first fully-connected network module comprising a first fully-connected network, the first fully-connected network module configured to: inputting the splicing vector of the mean value of each row of data of the first space matrix into a first fully-connected network for processing to obtain the output of the first fully-connected network, and performing Softmax calculation on the output of the first fully-connected network to obtain the attention weight of each row of the first space matrix; and

a second fully-connected network module comprising a second fully-connected network, the second fully-connected network module configured to: and inputting the splicing vector of the mean value of each line of data of the first space matrix into a second fully-connected network for processing to obtain the output of the second fully-connected network, and performing Softmax calculation on the output of the second fully-connected network to obtain the attention weight of each line in the first space matrix.

5. The emotion recognition model of claim 3, wherein the spatiotemporal feature fusion module comprises a plurality of coding networks stacked, and the input of each coding network is processed by a self-attention mechanism layer, a feedforward layer and a residual layer of the coding network in sequence;

the input of the first coding network is a plurality of space representation sequences, each space representation sequence is obtained by sequentially splicing data corresponding to each row in the second space matrix, the input of the subsequent coding network is a middle space-time representation vector output by the previous coding network, and the last coding network outputs a final space-time representation vector.

6. The emotion recognition model of claim 5, wherein the self-attention mechanism layer is a unidirectional self-attention mechanism layer, wherein the unidirectional self-attention mechanism layer is configured to: when the attention relationship between the time slice corresponding sequences is calculated, the attention relationship between the current time slice corresponding sequence and the time slice corresponding sequence in the front direction and the attention relationship between the current time slice corresponding sequence and the current time slice corresponding sequence are calculated, and the attention relationship between the current time slice corresponding sequence and the time slice corresponding sequence in the back direction is not calculated.

7. The emotion recognition model of any of claims 1-6, wherein the emotion recognition model is trained by:

acquiring a plurality of training samples, wherein each training sample comprises an electroencephalogram signal acquired by an experimenter by a plurality of time slices and an emotion label corresponding to each time slice;

and outputting the emotion corresponding to the experimenter in the corresponding time slice in each training sample by utilizing a plurality of training sample training emotion recognition models, calculating a loss value according to a plurality of emotions output to the corresponding training samples and the corresponding emotion labels, and updating the parameters of the spatial feature extraction module, the spatiotemporal feature fusion module and the emotion recognition module by utilizing the loss value.

8. A method of recognizing emotion, the method comprising:

acquiring electroencephalogram signals of a user, which are acquired by electroencephalogram acquisition equipment in a plurality of time slices;

inputting the electroencephalogram signals of the user for a plurality of time slices into the emotion recognition model according to any one of claims 1 to 7, and outputting the emotion of the user at each time slice.

9. The method of claim 8, wherein the emotion of each time slice is an instantaneous emotion of the user, the method further comprising:

and determining the long-term emotion of the user in a soft voting mode based on the instantaneous emotion of the user in a plurality of time slices and the corresponding probability of each instantaneous emotion.

10. An electronic device, comprising:

one or more processors; and

a memory, wherein the memory is to store executable instructions;

the one or more processors are configured to implement the steps of the method of claim 8 or 9 via execution of the executable instructions.

11. A computer-readable storage medium, on which a computer program is stored which is executable by a processor for carrying out the steps of the method as claimed in claim 8 or 9.