CN114662547A

CN114662547A - MSCRNN emotion recognition method and device based on electroencephalogram signals

Info

Publication number: CN114662547A
Application number: CN202210361451.2A
Authority: CN
Inventors: 黄翔东; 刘泽宇; 甘霖
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2022-04-07
Filing date: 2022-04-07
Publication date: 2022-06-24

Abstract

The invention discloses a method and a device for MSCRNN emotion recognition based on electroencephalogram signals, wherein the method comprises the following steps: constructing 4D characteristics for the preprocessed electroencephalogram data; building a classification network consisting of a multi-scale convolution module, a self-attention module, an LSTM module and other module units; using the 4D features and the corresponding labels for training a classification network; the trained classification network is used for testing a test set and can be used for emotion recognition of an actual tested emotion. The device comprises: a processor and a memory. According to the invention, a 4D characteristic structure containing time domain, frequency domain and space domain information is constructed in electroencephalogram data processing, emotion recognition is carried out through a classification network, wherein a multi-scale convolution module and a self-attention module are used for deeply mining the frequency domain and space domain information in the 4D characteristic, and an LSTM module is used for mining the time domain information, so that the electroencephalogram emotion recognition method has a great practical value in the field of electroencephalogram emotion recognition.

Description

MSCRNN emotion recognition method and device based on electroencephalogram signals

Technical Field

The invention relates to the field of electroencephalogram signal emotion recognition, in particular to an MSCRNN (multi-scale convolution cyclic neural network) emotion recognition method and device based on electroencephalogram signals, specifically to a method for constructing a 4D characteristic structure according to the spatial position of an electroencephalogram electrode on the basis of the electroencephalogram signals after preprocessing, and emotion recognition can be carried out by combining the MSCRNN classification network with information of a time domain, a frequency domain and a spatial domain at the same time.

Background

The emotion is attitude experience of a person to objective objects and corresponding behavior response, and the significance of accurately identifying the emotion is important to human-computer interaction. Emotion recognition also has great potential in practical applications, such as: negative emotions are identified in the field of automatic driving, the probability of accidents is reduced, and target audiences and the like are identified for games, movies, music, educational products and the like. In recent years, emotion calculation has been gradually developed and becomes a special field in Artificial Intelligence (AI), and rapid development of computer technology enables emotion calculation^[1]It becomes possible. The emotion recognition technology provides an important theoretical basis for brain science and human-computer interaction research.

There are many emotion recognition methods, and facial expressions, voice tones, gesture movements and physiological signals can be used as data sources for emotion recognition, wherein the physiological signals are difficult to disguise, and the contained information is richer^[2]. The physiological signals include electroencephalogram (EEG), Electrocardiogram (ECG), Electromyogram (EMG), skin temperature (SKT), skin resistance (SC), and the like, and because the electroencephalogram recognizes emotion, the method has the advantages of simple operation, low cost, mobility (with ready-made wearable earphones), high time resolution, and good effect, and because the method records the potential brain activity of human, the method is considered to be the most reliable physiological signal in the emotion recognition system. Visual stimulation, music stimulation, self-body recall, situational normal form and imagination are induced by emotion in electroencephalogram emotion recognition experiment^[3]Metrics relating to emotional state include: discrete models with emotion and dimensional models described in The two-dimensional valence-arousal space (The Value Aroma (VA))^[4]Recently, a three-dimensional emotion model for classifying emotions generated when a user watches music videos is proposed^[5]. Examples of the electroencephalogram Emotion recognition data set include SEED (Emotion electroencephalogram data set of shanghai university of transportation), DEAP (data set for Emotion recognition using physiological signals), and Imagined Emotion (imagination Emotion data set). At present, original electroencephalogram data or preprocessed electroencephalogram data are directly used for classification, the calculated amount is huge, and the classification effect is difficult to satisfy, so most of researches are to perform certain feature extraction on preprocessed electroencephalogram signals, then use a classifier for classification, and the feature extraction is performed in emotion recognition based on electroencephalogram^[6]Plays an important role. In the field of emotion recognition, the conventional features of the EEG are mainly classified into 3 types, namely, time domain features, frequency domain features and time-frequency domain features. Spatial domain features are therefore increasingly used to identify emotions given that asymmetry of brain regions may also reflect emotional information^[7]. Common time domain features include Event Related Potential (ERP), energy, power, signal statistics, and the like. The emotion recognition is relatively less by utilizing the time domain characteristics, the existing conclusion shows that the resolution capability of the time domain characteristics on the emotion valence is strongest, and in the subsequent related research, a researcher can consider adding the time domain characteristics to optimize the emotion recognition effect^[7]. The frequency domain characteristics include power, Power Spectral Density (PSD)^[8]Event correlation (desynchronization) and Differential Entropy (DE)^[9]Et Al, Al-Nafjan et Al utilize Deep Neural Networks (DNN) based on PSD characteristics^[10]To recognize human emotions. Zheng et al proposed a Deep Belief Network (DBN) for emotion recognition using DE features to verify the effectiveness of DE features^[11]. Jeevan Reddy Koya et al also used DE feature as input, and recognized emotion using LSTM-RNN network, which proved to be an effective feature for recognizing emotion^[12]. Yang et al, propose a hierarchical network of DE features using five frequency bands to identify different emotions^[13]. The time-frequency domain features include short-time Fourier transform (STFT), wavelet transform (DWT) and the like, and Nattapong et al research the influence of the sliding window size on emotion classification accuracy when the DWT features are utilized^[14]. It should be noted that, the nature of the frequency domain and the time-frequency domain is to transform the time-domain signal into the frequency domain, and the feature types (such as PSD, differential entropy, etc.) calculated by the two are the same; however, the time-frequency domain divides signals according to time periods, contains richer characteristic information and is more commonly used in the field of emotion recognition^[7]. The spatial domain features mainly utilize the connectivity change of brain regions as features or combine some time domain, frequency domain and time frequency features with electrodes to generate the features, so that the spatial change of electroencephalogram information can be reflected. There are studies that show that listening to happy music can increase the effective connectivity in the frontal lobe area^[15]. Lu et al studied the features of Differential Asymmetry (DASM), Rational Asymmetry (RASM), and Differential Caudal (DCAU), and demonstrated their effectiveness in classifying emotions^[9]. It is noted that combining multiple features for emotion recognition may yield better recognition results, since various features are processing of the raw data, which may discard some emotion-related information, Shen et al^[16]A4D-CRNN (4D-convolution cyclic neural network) model is provided by combining the characteristics of three domains, and the recognition effect is better when the information of the three domains is fused than when only two domains are used. Jia et al propose SST-EmotionNet by combining time domain feature, frequency domain feature and space domain feature^[17]Extremely high recognition results are produced for a specific mood data set. However, when extracting spatial information, the previous model often performs plane mapping on the extracted features only according to spatial arrangement of electroencephalogram channels, and then performs spatial feature extraction by using a CNN neural network, so that part of spatial information is often lost or redundant spatial feature information is generated, and the improvement of emotion recognition accuracy is restricted.

Reference to the literature

[1]Erol B A,Majumdar A,Benavidez P,et al.Toward Artificial Emotional Intelligence for Cooperative Social Human-Machine Interaction[J].IEEE Transactions on Computational Social Systems,2019,PP(99):1-13.

[2]Zhao G,Song J,Ge Y,et al.Advances in emotion recognition based on physiological big data[J].Journal of Computer Research and Development,2016.

[3]Siedlecka E,Denson T F.Experimental Methods for Inducing Basic Emotions:A Qualitative Review[J].Emotion Review,2018:175407391774901.

[4]van den Broek Egon L.s[J].Personal and Ubiquitous Computing.2013,17,(1):53-67.

[5]Dabas H,Sethi C,Dua C,et al.Emotion Classification Using EEG Signals[C]//the 2018 2nd International Conference.2018.

[6]Jenke R,Peer A,Buss M.Feature Extraction and Selection for Emotion Recognition from EEG[J].IEEE Transactions on Affective Computing,2017,5(3):327-339.

[7]Zhang G,Minjing Y U,Chen G,et al.A review of EEG features for emotion recognition[J].Scientia Sinica(Informationis),2019.

[8]Frantzidis C A,Bratsas C,Papadelis C L,et al.Toward Emotion Aware Computing:An Integrated Approach Using Multichannel Neurophysiological Recordings and Affective Visual Stimuli[J].IEEE transactions on information technology in biomedicine:a publication of the IEEE Engineering in Medicine and Biology Society,2010,14(3):589-597.

[9]Duan R N,Zhu J Y,Lu B L.Differential entropy feature for EEG-based emotion classification[C]//Neural Engineering(NER),2013 6th International IEEE/EMBS Conference on.IEEE,2013.

[10]Patil A,Panat A,Ragade S A.Classification of human emotions from electroencephalogram using support vector machine[C]//International Conference on Information Processing.IEEE,2016.

[11]Zheng W L,Zhu J Y,Peng Y,et al.EEG-based emotion classification using deep belief networks[C]//IEEE International Conference on Multimedia&Expo.IEEE,2014.

[12]R.K.Jeevan,V.M.R.S.P,P.Shiva Kumar and M.Srivikas.EEG-based emotion recognition using LSTM-RNN machine learning algorithm[C]//2019 1st International Conference on Innovations in Information and Communication Technology(ICIICT),2019:1-4.

[13]Yimin,Yang,Q,et al.EEG-Based Emotion Recognition Using Hierarchical Network With Subnetwork Nodes[J].IEEE Transactions on Cognitive and Developmental Systems,2017.

[14]Thammasan N,Fukui K I,Numao M.Application of deep belief networks in eeg-based dynamic music-emotion recognition[C]//The International Joint Conference on Neural Networks(IJCNN 2016).IEEE,2016.

[15]F.Hasanzadeh,H.Shahabi,S.Moghimi,and A.Moghimi.EEG investigation of the effective brain networks for recognizing musical emotions[J].Signal and Data Processing,Research,2015,12(2):41-54.

[16]Shen F,Dai G,Lin G,et al.EEG-based emotion recognition using 4D convolutional recurrent neural network[J].Cognitive Neurodynamics,2020:1-14.

[17]Jia Z,Lin Y,Cai X,et al.SST-EmotionNet:Spatial-Spectral-Temporal based Attention 3D Dense Network for EEG Emotion Recognition[C]//MM'20:The 28th ACM International Conference on Multimedia.ACM,2020.

Disclosure of Invention

The invention provides a method and a device for recognizing MSCRNN emotion based on electroencephalogram signals, wherein a 4D characteristic structure containing time domain, frequency domain and space domain information is constructed in electroencephalogram data processing, emotion recognition is carried out through a classification network, a multi-scale convolution module and a self-attention module are used for deeply mining the frequency domain and space domain information in the 4D characteristic, an LSTM module is used for mining the time domain information, the method and the device have great practical value in the field of electroencephalogram emotion recognition, and the method and the device are described in detail as follows:

a method for mscrinn emotion recognition based on brain electrical signals, the method comprising:

constructing 4D characteristics for the preprocessed electroencephalogram data;

building a classification network consisting of a multi-scale convolution module, a self-attention module, an LSTM module and other module units;

using the 4D features and the corresponding labels for training a classification network;

the trained classification network is used for testing the test set and can be used for emotion recognition of an actual tested emotion.

Wherein, the multi-scale convolution module outputs the input data through three parallel convolution paths and integrates the output results under the three paths,

the multi-scale convolution module is composed of a plurality of convolution modules, and each convolution module comprises: a convolution layer, a batch standardization layer and an activation layer;

let the input data of the ith path of the ith convolutional layer under the jth data be x_j ^l(i)∈R^C×H×WWhere C denotes the number of input channels, i.e., the number of input signatures, and H and W denote the height and width of the input signatures, then:

a_j ^l(i)＝f(conv2(x_j ^l(i),W^l(i))+b^l(i))

wherein, W^l(i)Weight tensor representing the ith path of the ith convolutional layer under the jth data, b^l(ⁱ) Represents the offset of the ith path of the ith convolutional layer under the jth data, a_j ^l(i)Representing the output of the ith path of the ith convolutional layer under the jth data, conv2(.) representing two-dimensional convolution operation, and f (.) representing a ReLU activation function;

the batch normalization layer operates as follows:

wherein,

denotes the result of normalization of the ith path of the ith BN layer, E [ a ]^l(i)]And

mean and variance of the input data, respectively;

and adding a pair of parameters gamma and beta to each neuron, and expressing the output result of the final batch normalization layer as follows:

with the active layer output, the active layer uses the ReLU function;

the output of the multi-scale convolution model is calculated by:

Y＝f(g(c((x,d_short),(x,d_medium),(x,d_long))))

where f (. eta.) denotes a ReLU activation function, g (. eta.) denotes a batch normalization operation, c (. eta.) denotes a convolution operation, x denotes input data, d_short、d_mediumAnd d_longAnd (4) representing convolution operation of convolution kernels from small to large under different paths.

Further, the self-attention module is used for learning the relationship among different channels in the feature map, mining the spatial domain features in the feature map,

performing linear mapping on the input feature map, and further obtaining an attention map based on the values of the first two feature spaces; performing point multiplication on the obtained attention diagram and a result of linear mapping of the value of the third feature space to obtain an output result of an attention layer;

multiplying the output of the final attention layer by a coefficient, and adding the multiplied output to the feature map again to obtain the feature map screened by the self-attention module:

wherein o ═ o (o)₁,o₂,...,o_j,...,o_N) E RC N is the output of the attention stratum and γ is a parameter.

An electroencephalogram signal-based MSCRNN emotion recognition apparatus, the apparatus comprising: a processor and a memory, the memory having stored therein program instructions, the processor calling the program instructions stored in the memory to cause the apparatus to perform the method steps.

The technical scheme provided by the invention has the beneficial effects that:

1. the time domain, frequency domain and space domain characteristics of the electroencephalogram signals are extracted through the characteristic construction and classification network, and the characteristics are used for classification of the electroencephalogram signals;

2. the invention obtains extremely high identification accuracy on the existing public data sets SEED and SEED-IV and realizes the rapid identification of the emotion corresponding to the electroencephalogram signal of a single subject;

3. the invention only uses 2s of electroencephalogram data when constructing data, thus being capable of rapidly identifying emotion in practical application and having good real-time performance.

Drawings

FIG. 1 is a schematic view of a 4D feature configuration;

FIG. 2 is a spatial view of the SEED, SEED-IV data set;

FIG. 3 is a diagram of a 4D-MSCRNN model architecture;

FIG. 4 is a diagram of a conventional self-attention structure;

FIG. 5 is a schematic diagram of the SEED experimental process;

FIG. 6 is a schematic diagram of the SEED-IV experimental process;

FIG. 7 shows the recognition result;

wherein, (a) is the emotion recognition result of the SEED data set; (b) the results are recognized for the mood of the SEED-IV data set.

Fig. 8 is a schematic structural diagram of an mscrinn emotion recognition apparatus based on electroencephalogram signals.

Table 1 shows network hyper-parameters;

table 2 number of samples generated for three data sets;

table 3 is a table of emotion recognition results.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.

Example 1

A MSCRNN emotion recognition method based on electroencephalogram signals comprises the following steps:

step 101: constructing 4D characteristics for the preprocessed electroencephalogram data, and acquiring characteristic X_n∈R^h×w×d×2TsN is 1,2,. cndot.n; n is the number of segments of the preprocessed electroencephalogram data, h and w are the height and width of a 2D space diagram respectively, D is the number of bands to be divided, 2Ts is the number of segments to be decomposed under each segment, and a corresponding label is combined to divide a data set.

The duration of each segment of electroencephalogram signal is Ts, and each segment can be further divided into 2Ts sections according to 0.5s in the embodiment of the invention. The small sections are uniformly used as input, emotion corresponding to the small sections is used as a label, namely, the original data set is further divided, and a sample is manufactured.

Step 102: building a classification network, and setting model parameters, including: the system comprises a multi-scale convolution module, a self-attention module, an LSTM module and other module units;

step 103: using the 4D features and the corresponding labels obtained in the step 101 for training the classification network in the step 102;

step 104: the trained classification network is used for testing the test set and can be used for emotion recognition of an actual tested emotion.

In summary, the emotion recognition is performed through the classification network, wherein the multi-scale convolution module and the self-attention module are used for deeply mining frequency domain and spatial domain information in the 4D features, and the LSTM module is used for mining time domain information, so that the emotion recognition method and the emotion recognition system have great practical value in the field of electroencephalogram emotion recognition.

Example 2

The scheme in embodiment 1 is further described below with reference to specific calculation formulas and examples, and is described in detail in the following description:

one, structure 4D characteristics

In order to integrate time domain, frequency domain and space domain information of electroencephalogram signals simultaneously, 4D features containing the three domain information simultaneously are proposed, and the 4D feature construction process is shown in figure 1.

In order to increase the total amount of samples of data and at the same time to be able to reduce the time required for emotion recognition in practical applications, the pre-processed EEG data is first divided into N segments with time Ts and each segment is assigned to a label corresponding to the original data. Then, decomposing each channel data of each segment by using a 5-order Butterworth filter to generate five wave bands of delta (1-4 Hz), theta (4-8 Hz), alpha (8-14 Hz), beta (14-31 Hz) and gamma (31-50 Hz). Then, a 0.5s window is selected to extract the Differential Entropy (DE) feature in each band. The DE characteristic is used for reflecting the complexity of electroencephalogram signals and is widely used in emotion recognition, and the differential entropy characteristic of Gaussian distribution is defined as follows:

wherein Z is a Gaussian distribution obeying N (mu, sigma)²) The variables of (a), pi, e, and sigma represent the mathematical constants, the euler constants, and the standard deviations of the variables, respectively.

The DE features for each band of each channel are then mapped into a planar two-dimensional map according to the spatial location of the channel, resulting in a 2D spatial map, which is shown in fig. 2 with respect to the SEED and SEED-IV data set spatial map employed in an embodiment of the present invention, where 0 indicates that the location contains no channels or that the channels at that location are not being used.

The five generated two-dimensional maps of different wavebands are stacked to obtain a set of 3D features of the segment. Arranging all 3D features of the segment according to a time window moving sequence to finally obtain a 4D feature: x_n∈R^h×w×d×2Ts,n＝1,2,...,N。

N is the number of segments of the preprocessed electroencephalogram data, h and w are the height and width of a 2D space diagram respectively and are 8 and 9 respectively, D is the number of bands to be divided, 5 and 2Ts are the number of segments to be decomposed under each segment, and if the duration of each segment is Ts, DE feature extraction is carried out by using a 0.5s window, the segments can be decomposed into 2Ts segments in total. In the embodiment of the invention, the value of Ts is 2s, and each segment can be decomposed into 4 segments according to the time window movement. And then, using the obtained 4D features and the corresponding labels for training and testing of a subsequent classification network.

Design of two, MSCRNN network structure

Taking the SEED data set as an example, the 4D-MSCRNN model structure for feature extraction and classification adopted by the embodiment of the present invention is shown in fig. 3, and the network is an end-to-end model, and the input is a constructed 4D feature map, and the output is positive, neutral and negative emotion categories. The multi-scale convolution module is used for extracting features of different scales in 4D feature input, the self-attention module is used for enhancing the extracted effective features and reducing network overhead at the same time, frequency domain features and space domain features are further extracted, and the LSTM network part is used for extracting time domain features.

The network input data is 4 time slot parts, each part comprises a space map with 5 frequency band DE characteristics, the picture size is 8 multiplied by 9, network parameters are relatively less because the picture size is smaller, and no pooling layer is added when convolution processing is started. Taking a time period part as an example, the input data is 5 characteristic graphs of 8 × 9, firstly, the input characteristic graphs are subjected to characteristic extraction and integration through a multi-scale convolution module, after the input characteristic graphs pass through three parallel convolution modules, characteristic graphs of 384 channels of 8 × 9 are obtained through channel splicing, characteristic screening is performed through a self-attention module, then the characteristic graphs enter a fourth layer of convolution network, characteristic graphs of 128 channels of 8 × 9 are obtained through convolution, and the network calculation amount is reduced. And then, the network parameters are further reduced through maximum pooling, and a 128-channel 4 x 5 characteristic diagram is obtained. The same operation is performed for the remaining 3 slot portions, and then the data is subjected to dimension conversion into 4 × 2560 feature vectors through the scatter layer, and then is transformed into 4 × 512 feature vectors through one fully connected layer. And 4 multiplied by 512 feature vectors are obtained and input into an LSTM network with a hiding unit of 128, and the output of the last LSTM unit obtains a final result which integrates the time domain, the frequency domain and the spatial domain features in the 2s electroencephalogram signal segment. And finally, obtaining a vector with 3 dimensionality through the full connection layer, and calculating the class probability through a softmax function to obtain a classification result. During network training, the output layer of the network provides supervision and learns network parameters according to the cross entropy loss function. It is worth noting that each module of the model is a shallow neural network because the data volume of the feature image is relatively small, and if the model design is too complex, the parameter quantity is large, the network is easy to be over-fitted, and the model precision is affected. The number of output neurons of the last full-connection layer is equal to the number of categories of emotion data. The detailed network hyper-parameters of the model are shown in table 1. Where FC denotes a full connection layer, and the activation functions ReLU, linear, tanh, and softmax denote a linear rectification function, a linear function, a hyperbolic tangent function, and a normalized exponential function.

TABLE 1 network hyper-parameters

Three, multi-scale convolution module

As shown in the 4D-mscrinn model structure diagram in fig. 3, the multi-scale convolution module outputs the input data through three parallel convolution paths, and finally integrates the output results under the three paths. The multi-scale convolution module is composed of a plurality of convolution modules, each convolution module in fig. 3 includes: a convolution layer, a batch normalization layer and an activation layer. Let the input data of the ith path of the ith convolutional layer under the jth data be x_j ^l(i)∈R^C×H×WWhere C denotes the number of input channels, i.e., the number of input signatures, and H and W denote the height and width of the input signatures, then:

a_j ^l(i)＝f(conv2(x_j ^l(i),W^l(i))+b^l(i)) (2)

wherein, W^l(i)Weight tensor representing the ith path of the ith convolutional layer under the jth data, b^l(i)Represents the offset of the ith path of the ith convolutional layer under the jth data, a_j ^l(i)Represents the output of the ith path of the ith convolutional layer under the jth data, conv2(.) represents the two-dimensional convolution operation, and f (.) represents the ReLU activation function.

The batch normalization layer (BN) mainly has the function of adjusting the input value to a certain range, so that each layer can face the input value of the same characteristic distribution as much as possible, and instability caused by change and influence on the later layer are reduced. The batch normalization operation is shown below:

wherein,

respectively, the mean and variance of the input data.

In order to restore the normalized data to the feature distribution learned from the model and to slow down the phenomenon of model overfitting, a pair of parameters γ and β is added for each neuron. The output result of the final BN layer can be expressed as:

and the output of the active layer is output, and the active layer uses the ReLU function, so that the nonlinear learning capability of the network can be enhanced, and the performance of the model can be improved.

The output of the entire multi-scale convolution model can be calculated by:

Y＝f(g(c((x,d_short),(x,d_medium),(x,d_long)))) (5)

Four, self-attention module

FIG. 4 is conventionalFrom the attention structure diagram, the structure is used in the self-attention module of the invention for screening spatial features. In the image domain, the self-attention mechanism learns the relationship between a certain pixel and all other pixels in all positions (including the farther positions). The embodiment of the invention further mines the spatial domain features in the feature map by using the relationship between different channels in the learning feature map. Given an input profile as x ∈ R^C×H×WWhere C denotes the number of input channels and H and W denote the height and width of the input feature map.

Firstly, input feature maps are subjected to linear mapping to obtain f (x), g (x) and h (x), wherein f (x) is W (x)_fx、g(x)＝W_gx、h(x)＝W_hx. The values of the first two feature spaces are used to further find the attention diagram, and the calculation formula is as follows:

wherein s is_ij＝f(x_i)^Tg(x_j) Where i and j each represent a position, N (N ═ H × W) positions are present in the feature map, and β is present in total_j,iAnd (3) representing the relation weight of the ith position pair generation j position, and multiplying the obtained attention diagram by h (x) to obtain an output result of an attention layer, namely a self-attention feature diagram:

wherein o ═ o (o)₁,o₂,...,o_j,...,o_N) E RC N is the output of the attention layer, W_f，W_g，W_h，W_vAre all learned weight matrices, generated by a 1 × 1 convolution kernel. And multiplying the output of the final attention layer by a coefficient, and adding the multiplied output into the feature map again to obtain the feature map screened by the self-attention module.

Where γ is a parameter that can be learned, and is initialized to 0 during training, the network mainly depends on the features of the neighborhood at the beginning of training, and then the weight that depends on the distant areas is slowly increased.

Five, other modules

The channel splicing module is positioned behind the multi-scale convolution module and mainly plays a role in integrating characteristic information. The feature extraction content is enriched by continuously overlapping the channel number, so that the output of each layer is all information of an input module, and the number of the feature channels under the ith convolution path through multi-scale convolution is assumed to be c_iAnd the number of the characteristic channels output after the characteristics under all the N convolution paths are spliced by the channels is z, and then the following steps are performed:

because the electroencephalogram signal dynamically changes along with time, and information related to emotion may exist in the change of different characteristics of a time window, the LSTM module is adopted to extract the time characteristics. In this experiment, the input data of the LSTM module is a 512-dimensional sequence obtained by 4 time windows respectively, time features are mined by using 128 hidden unit LSTM layers, and a 128-dimensional feature is output by using an output result obtained by using the input feature of the last time window as the output of the network. The feature contains frequency domain and spatial domain information extracted previously, and also contains time domain information extracted by the LSTM module.

After the last full-connection layer of the 4D-MSCRNN model outputs data, emotion recognition is carried out through a softmax function, and the calculation formula is as follows:

wherein x is_iRepresenting the output of the ith neuron in the last layer of the network, y_iIndicates the probability of predicting as the ith class, N represents the number of classes of emotion, andand finally setting the label with the maximum probability as a classification result according to the number of the neurons in the last layer of the network.

Example 3

The protocols of examples 1 and 2 are further described below in conjunction with specific experimental data, as described in detail below:

1. SEED data set

The SEED data set consisted of 15 subjects (7 males, 8 females, age 23.27 ± 2.37 years) who were emotionally evoked by viewing segments of film that evoked three emotions including positive, negative and neutral while recording brain electrical signals. The experimental process is shown in fig. 5, each experiment comprises 15 trials, each of the three emotions is 5, a rest time of 15 seconds is set between the trials, and after each observation, the examinee needs to complete a self-evaluation table to ensure that the corresponding emotion is excited. Each movie fragment is about 4 minutes, and movie fragments of different moods are played randomly. EEG signals are collected by an ESI NeuroScan device with 62 channels, the positions of electrodes are distributed according to a 10-20 system, and the sampling rate is 1000 Hz. The data set provides a pre-processed electroencephalogram signal, which is down-sampled to 200Hz after pre-processing.

2. SEED-IV data set

The SEED-IV data set consisted of 15 subjects (7 males, 8 females) and recorded signals included brain electrical signals and eye movement signals, where only recorded brain electrical signals were used. The electroencephalogram recording device is the same as the device used in the SEED data set. These data were collected while the participants were watching the movie fragments that evoked four emotions, including sadness, neutrality, happiness and fear. Each movie lasts approximately two minutes. Each subject was tested in three trials at different time periods, each trial consisting of 24 trials (6 trials per mood) and the experimental procedure is shown in figure 6. The data set also provides a pre-processed brain electrical signal, which is down-sampled to 200 Hz.

3. Sample construction

After preprocessing and 4D feature construction, the number of samples generated by the two data sets is shown in the following table, taking the SEED data set as an example, the number of positive, neutral and negative samples of each test is 1749, 1650 and 1677 respectively by dividing different tests of each test into different segments according to 2s and converting the segments into 4D features. For the SEED-IV dataset, the number of samples per trial were 1069, 1239, 1370 and 1365 happy, fear, sadness and neutral, respectively. For all data sets of single test, 5-fold cross validation is adopted for training and testing, the average value of 5 results is used as the final classification result of the test, the average value of 15 results of single test experiments is used as the final emotion recognition accuracy result of the scheme, and the standard deviation of 15 test recognition results is used as the final emotion recognition standard deviation result of the scheme.

TABLE 2 number of samples generated for three data sets

4. Training parameter settings

The model is realized by adopting a Pythrch framework and is trained on an NVIDIA GeForce RTX 3080Ti GPU. The experimental training Epoch is set to 100, and each Epoch, i.e., all training sets, are subjected to a complete training process. The Batch _ size is set to 32, and an Adam optimizer is employed to minimize the cross entropy loss function. The holding probability of the exit operation is 0.5. The smoothing constants β 1 and β 2 of the optimizer during training are set to 0.5 and 0.99, respectively, and the initial learning rate is set to 2 × e-4.

5. Analysis of experiments

Figure 7 below shows the results of a single subject emotion recognition on the SEED dataset and the SEED-IV dataset using the 4D-MSCRNN model. See table 3 for the specific emotion recognition accuracy and standard deviation results for each test.

TABLE 3 Emotion recognition results Table

Performing emotion three-classification recognition on the SEED data set, wherein the recognition accuracy is 95.03%, and the standard deviation is 2.65%; and (4) performing emotion four-classification recognition on the SEED-IV data set, wherein the recognition accuracy is 84.34% and the standard deviation is 6.49%. The emotion recognition accuracy rate is higher for the two data sets, and the generalization capability of the model for different data sets is reflected.

An electroencephalogram signal-based mscrinn emotion recognition apparatus, see fig. 8, comprising: a processor 1 and a memory 2, the memory 1 having stored therein program instructions, the processor 2 calling the program instructions stored in the memory 1 to cause the apparatus to perform the following method steps in embodiment 1:

constructing 4D characteristics for the preprocessed electroencephalogram data;

a_j ^l(i)＝f(conv2(x_j ^l(i),W^l(i))+b^l(i))

wherein, W^l(ⁱ) Weight tensor representing the ith path of the ith convolutional layer under the jth data, b^l(ⁱ) Represents the offset of the ith path of the ith convolutional layer under the jth data, a_j ^l(i)The conv2(.) table, which represents the output of the ith path of the ith convolutional layer under the jth dataTwo-dimensional convolution operation is shown, and f (.) represents a ReLU activation function;

the batch normalization layer operates as follows:

wherein,

shows the result of the normalization of the ith path of the ith BN layer, E [ a ]^l(i)]And

mean and variance of the input data, respectively;

each neuron is added with a pair of parameters gamma and beta, and the output result of the final batch normalization layer is expressed as:

with the active layer output, the active layer uses the ReLU function;

the output of the multi-scale convolution model is calculated by:

Y＝f(g(c((x,d_short),(x,d_medium),(x,d_long))))

Wherein, the self-attention module is used for learning the relationship among different channels in the characteristic diagram, mining the space domain characteristics in the characteristic diagram,

performing linear mapping on the input feature map, and further obtaining an attention map based on the values of the first two feature spaces; performing point multiplication on the obtained attention diagram and a result of the linear mapping of the values of the third feature space to obtain an output result of an attention layer;

It should be noted that the device description in the above embodiments corresponds to the method description in the embodiments, and the embodiments of the present invention are not described herein again.

The execution main bodies of the processor 1 and the memory 2 may be devices having a calculation function, such as a computer, a single chip, a microcontroller, and the like, and in the specific implementation, the execution main bodies are not limited in the embodiment of the present invention, and are selected according to requirements in practical applications.

The memory 2 and the processor 1 transmit data signals through the bus 3, which is not described in detail in the embodiment of the present invention.

In the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited as long as the device can perform the above functions.

Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A MSCRNN emotion recognition method based on electroencephalogram signals is characterized by comprising the following steps:

constructing 4D characteristics for the preprocessed electroencephalogram data;

2. The electroencephalogram signal-based MSCRNN emotion recognition method of claim 1, wherein the multi-scale convolution module outputs input data through three parallel convolution paths, integrates output results under the three paths,

let the input data of the ith path of the ith convolutional layer under the jth data be

Where C represents the number of input channels, i.e., the number of input feature maps, and H and W represent the height and width of the input feature maps, then:

a_j ^l(i)＝f(conv2(x_j ^l(i),W^l(i))+b^l(i))

wherein, W^l(i)Weight tensor representing the ith path of the ith convolutional layer under the jth data, b^l(i)Represents the offset of the ith path of the ith convolutional layer under the jth data, a_j ^l(i)The output of the ith path of the ith convolutional layer under the jth data is shown, conv2(.) shows two-dimensional convolution operation, and f (.) shows a ReLU activation function;

the batch normalization layer operates as follows:

wherein,

shows the normalized result of the ith path of the ith BN layer,

and

mean and variance of the input data, respectively;

with the active layer output, the active layer uses the ReLU function;

the output of the multi-scale convolution model is calculated by:

Y＝f(g(c((x,d_sho_rt),(x,d_medium),(x,d_lo_ng))))

3. The EEG signal-based MSCRNN emotion recognition method of claim 1, wherein said self-attention module is used to learn the relationship between different channels in the feature map, mine the spatial domain features in the feature map,

4. An electroencephalogram signal-based MSCRNN emotion recognition device, characterized in that the device comprises: a processor and a memory, the memory having stored therein program instructions, the processor calling upon the program instructions stored in the memory to cause the apparatus to perform the method steps of any of claims 1-3.