CN115444420A

CN115444420A - CCNN and stacked-BilSTM-based network emotion recognition method

Info

Publication number: CN115444420A
Application number: CN202211106511.2A
Authority: CN
Inventors: 杨俊�; 吴俊会; 沈涛; 郑进港; 王芳芳; 王琪琛; 余创贺
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2022-09-12
Filing date: 2022-09-12
Publication date: 2022-12-09

Abstract

The invention relates to a CCNN and stacked-BilSTM-based network emotion recognition method, and belongs to the technical field of electroencephalogram emotion recognition. The method comprises the steps of firstly preprocessing data, namely decoding 4 frequency bands theta, alpha, beta and gamma related to emotion recognition by using a Butterworth filter, eliminating emotion fluctuation caused by factors such as artifacts and noise in an experimental paradigm process by using a 0.5s non-overlapping sliding window to 60s physiological signal data related to emotion recognition under audio stimulation by adopting a moving average method, mapping an electroencephalogram space structure related to emotion recognition electroencephalogram channels by extracting frequency domain characteristic differential entropy on the 4 frequency bands and a brain plane topographic map, inputting a large number of frequency domains and space characteristics after sliding window into each continuous convolution neural network in a multi-parallel mode to extract higher-semantic space-frequency characteristics, and finally fully learning information on past and future time slices by using stacked two-way long-short-time memory learning.

Description

CCNN and stacked-BilSTM-based network emotion recognition method

Technical Field

The invention relates to a CCNN and stacked-BilSTM-based network emotion recognition method, and belongs to the technical field of electroencephalogram emotion recognition.

Background

Human beings, as a high-level creature, have complex mental activities, and can mask the real emotional state of the human beings through some facial expressions, voice information and body movements. Brain science is a well-known science and technology frontier, brain waves generated by a brain region have objectivity and can reflect the fact that a person is relatively objective under objective fact conditions, so that emotion recognition based on electroencephalogram is one of tasks for researching electroencephalogram information at present and has important practical significance. Nowadays, the method has wide application scenes in the fields of industrial control, medical assistance, game entertainment and the like. Electroencephalogram signals based on emotion recognition need to be stimulated by the outside, so that brain waves of corresponding information are generated. However, because brain waves are relatively tiny amplitude waves relative to external noise, the generation of the brain waves is always accompanied by corresponding noise and artifacts such as eye electricity and electrocardio, meanwhile, electroencephalogram data related to emotion recognition are very limited, and the problems of feature loss or insufficient learning are often accompanied in the network learning process, so that scientific researchers carry out a great amount of experimental research to design a high recognition precision and robustness model. Therefore, how to extract high-quality emotion recognition features with relatively large quantity from the brain electricity is very important. But in experimental studies it was found that: 1. the network training time is long, and the computer computing cost is high; 2. abundant characteristics cannot be obtained from the high-dimensional electroencephalogram signals; 3. the emotion recognition data based on the electroencephalogram has sparsity, so that some features of the data cannot be extracted after the data are input into a network, and the training effect cannot achieve the expected effect. Therefore, it is still a challenge to reduce the external interference and make the neural network learn more and richer brain electrical characteristics.

Disclosure of Invention

The invention aims to solve the technical problem of providing a CCNN and held-BilSTM-based network emotion recognition method, which is used for solving the problems and further realizing extraction of richer and more comprehensive characteristic information in a human-computer interaction system.

The technical scheme of the invention is as follows: a network emotion recognition method based on CCNN (multi-parallel continuous convolutional neural network) and stacked-BilSTM (stacked bidirectional long-short time memory) comprises the following specific steps:

step1: collecting an original brain electrical signal, and preprocessing the original brain electrical signal.

The pretreatment specifically comprises the following steps: decoding theta, alpha, beta and gamma4 electroencephalogram wave bands by using a Butterworth band-pass filter, and increasing the number of samples by adopting a 0.5s non-overlapping sliding window.

In the experiment, 63s data of each testee is decoded by using a Butterworth band-pass filter, 4 electroencephalogram bands of frequency bands theta, alpha, beta and gamma relevant to emotion recognition are obtained, and meanwhile, a non-overlapping sliding window of 0.5s is adopted to increase the number of samples, so that the network can fully learn.

Step2: acquiring frequency domain characteristic Differential Entropy (DE) on four frequency bands, reducing electroencephalogram fluctuation caused by factors such as noise, artifacts and the like in the emotion recognition process by using a moving average method, respectively calculating average differential entropy characteristics of the first 3s baselines of 4 frequency bands, and then mapping the frequency domain characteristic differential entropy characteristics on each frequency band after characteristic smoothing to a brain plane topographic map related to emotion recognition to acquire the space frequency characteristics.

Step3: the acquired space frequency features are input into a multi-parallel continuous convolution neural network to fully learn more space frequency features with higher semantics, and because electroencephalograms are dynamic time sequence signals, some hidden information possibly exists in the time process and is useful for emotion classification, the electroencephalograms are fully learned by using a stacked bidirectional long-and-short time memory network (stacked-BilSTM) to fully learn past and future feature information on different time slices, so that accurate decoding of emotion recognition information based on the electroencephalograms is realized.

The Step2 specifically comprises the following steps:

acquiring frequency domain characteristic differential entropy characteristics on each frequency band through a formula (1):

wherein p (x) represents a probability density function of continuous information, [ a, b [ ]]Representing the information value interval, and approximately obeying Gaussian distribution for a section of specific length

The EEG of (1).

Then, calculating the average differential entropy characteristics of the first 3s baselines of 4 frequency bands, and subtracting the average differential entropy formula (3) of the first 3s baselines on the corresponding frequency bands from each 0.5s time sliding window data by using a moving average method:

where k denotes the portion of each sliding window under musical stimulation, j denotes the 4 band bands, and i denotes the portion of each sliding window under the baseline signal.

The method aims to enable the acquired brain waves to be less subjected to brain wave fluctuation caused by factors such as noise and artifacts. The DEAP database used by the invention collects 32-lead electroencephalogram signals, and the electroencephalogram signals are high-dimensional signals and have abundant spatial information, so that the brain plane topographic map of 8*9 associated with emotion recognition is adopted by the invention to capture the spatial information.

The Step3 is specifically as follows:

the space-frequency features are put into a plurality of parallel continuous convolutional neural network architectures, each continuous convolutional neural network comprises three different convolutional layers, namely 64 convolutional kernels of 5*5 size, 128 convolutional kernels of 4*4 size and 256 convolutional kernels of 4*4 size, and the three convolutional layers are built together to form four convolutional blocks. In order to prevent the convergence of the deep neural network from being slower and slower due to the fact that the data distribution approaches to the two ends of the upper limit and the lower limit of the nonlinear data distribution along with the deepening of the network, a batch of normalization layers (BN layers) are added after each convolution layer, input values of each layer of neural network are put into a standard normal distribution with the mean value of 0 and the variance of 1, the output of the network is not large, meanwhile, a large gradient can be obtained, and the training speed is accelerated. After 4 convolution blocks, a 64-1*1 two-dimensional convolution layer is used for realizing the interaction of channel information and reducing the parameters of the network, a 2*2 maximum pooling layer (maxporoling) is used for feature compression, a Flatten layer is used for carrying out multidimensional input one-dimensional operation, a Dense layer is placed at the tail of a continuous convolution neural network for carrying out refitting, and the loss of feature information is reduced.

Connecting space-frequency features output after passing through a plurality of parallel continuous neural networks together by using a concatemate function, and putting the space-frequency features into a stacked bidirectional long-time memory network by using the correlation principle of formulas (4) to (9) LSTM, so that information on past and future time slices can be more fully learned, specifically:

f _t ＝σ(W _f ·[h _t-1 ,x _t ]+b _f ) (4)

i _t ＝σ(W _i ·[h _t-1 ,x _t ]+b _i ) (5)

o _t ＝σ(W _o [h _t-1 ,x _t ]+b _o ) (8)

h _t ＝o _t *tanh(C _t ) (9)

in the formula (f) _t Value, h, representing a forgotten door _t-1 Hidden state at the previous moment, x _t As input value at the present time, i _t The value of the memory gate is represented,

representing a temporary cell state, C _t Indicating the cell state at the present time, C _t-1 Indicating the last cell state, o _t Represents the value of the output gate, h _t Representing a hidden state.

And finally, carrying out classification prediction on the data:

in the formula, Z _i Is the output value of the ith node, and n is the number of output nodes, i.e. the number of classified categories. The output value of the multi-classification can be converted into the range of [0,1 ] through the softmax function]And a probability distribution of 1.

The invention has the beneficial effects that: compared with the prior art, the invention utilizes differential entropy characteristics and brain topographic maps to obtain abundant space-frequency characteristics as input, and then adopts a multi-parallel continuous convolution neural network-stacked bidirectional long-time and short-time memory to obtain more abundant and comprehensive characteristic information, thereby achieving high accuracy in the task of recognizing and classifying, and providing a new thought for better recognizing emotion

Drawings

FIG. 1 is a diagram of a recognition neural network framework in an embodiment of the present invention;

FIG. 2 is a plan topographical view of the brain in an embodiment of the present invention;

FIG. 3 is a space-frequency diagram of 4 frequency bands in an embodiment of the present invention;

FIG. 4 is a convolution block diagram in an embodiment of the present invention.

Detailed Description

The invention is further described with reference to the following drawings and detailed description.

Example 1: as shown in fig. 1-4, a network emotion recognition method based on CCNN and stacked-BiLSTM includes the following specific steps:

step1: collecting original electroencephalogram signals, and preprocessing the original electroencephalogram signals;

the pretreatment specifically comprises the following steps: decoding theta, alpha, beta and gamma4 electroencephalogram wave bands by using a Butterworth band-pass filter, and increasing the number of samples by adopting a 0.5s non-overlapping sliding window;

step2: acquiring frequency domain characteristic differential entropy on four frequency bands, reducing electroencephalogram fluctuation in the emotion recognition process by using a moving average method, respectively obtaining average differential entropy characteristics of base lines 3s before 4 frequency bands, and then mapping the frequency domain characteristic differential entropy characteristics on each frequency band after characteristic smoothing to a brain plane topographic map related to emotion recognition to acquire space frequency characteristics of the brain plane topographic map;

step3: the obtained space-frequency features are input into a multi-parallel continuous convolution neural network to learn space-frequency features with higher semantics, and then the stacked bidirectional long-time memory network (stacked-BilSTM) is used for learning past and future feature information on different time slices, so that accurate decoding of emotion recognition information based on electroencephalogram is realized.

2. The CCNN and staged-BiLSTM based network emotion recognition method of claim 1, wherein Step2 specifically is:

The EEG of (1);

3. The CCNN and staged-BilSTM-based network emotion recognition method as claimed in claim 1, wherein Step3 specifically is:

connecting the space-frequency characteristics output after passing through a plurality of parallel continuous neural networks together by using a configure function, and putting the space-frequency characteristics into a stacked bidirectional long-time and short-time memory network by using formulas (4) - (9), wherein the method specifically comprises the following steps:

f _t ＝σ(W _f ·[h _t-1 ,x _t ]+b _f ) (4)

i _t ＝σ(W _i ·[h _t-1 ,x _t ]+b _i ) (5)

o _t ＝σ(W _o [h _t-1 ,x _t ]+b _o ) (8)

h _t ＝o _t *tanh(C _t ) (9)

in the formula (f) _t Value, h, representing a forgotten door _t-1 Hidden state at previous moment, x _t As input value at the present time, i _t Which represents the value of the memory gate,

representing a temporary cell state, C _t Indicating the cell state at the present time, C _t-1 Indicating the last cell state, o _t Value, h, representing the output gate _t Representing a hidden layer state;

and finally, carrying out classification prediction on the data:

in the formula, Z _i Is the output value of the ith node, and n is the number of output nodes, i.e. the number of classified categories. The output values of the multi-classes can be converted to the range of 0,1 by the softmax function]And a probability distribution of 1.

The following experiments and evaluations were carried out:

1. experimental data and hyper-parameter settings

A common data set dep is used. The data is in experimental paradigm: 32 (16 male and 16 female) healthy participants were collected, each experimenter was required to watch one minute long video, each experimenter contained physiological signal data 40 x 8064 (40 music videos, 40 electrode channels, 8064 sampling points), after the completion of watching, tags 40 x 4 (40 channels, each channel containing 4 dimensions of reference, arousal, domino, and liking) were scored on four dimensions of reference, arousal, domino, and liking, the data set was subjected to 40 experiments in total, and 40 channels of information were collected, wherein the first 32 were electroencephalogram channels, the last 8 were some other channels, and the sampling rate was 512HZ. In the experiment, a data set which is subjected to down-sampling (down-sampling to 128 HZ) by the official part and noise such as electro-oculogram is removed is used. And testing two dimensions of the value and the arousal. The experimental environment is python3.7, GPU processor NVIDIA GeForce GTX 1660Ti, CPU processor 11th Gen Intel (R) Core (TM) i5-11400F. The entire neural network is implemented by the Tensorflow framework.

The data set adopts 5 times of cross validation, trains 100 epochs, divides the data into 6 parts, and trains by adopting 6 parallel continuous convolution neural networks at the same time, wherein the learning rate is 0.001.

2. Comparative analysis of experimental results

To verify the advantages of the present invention, comparisons were made on different models. Experiments prove that the proposal of the invention has certain significance.

Table 1: accuracy of classification model

As shown in table 1, compared to the conventional machine learning model, for example: the accuracy of the SVM and ANN model is improved by about 20%, which proves that the deep learning model can extract more detailed emotional characteristics and spatial structure characteristics. The invention also researches some advanced deep learning models, such as Bi-LSTM and CNN-LSTM, compares the single Bi-LSTM with the related fusion model CNN-LSTM, the model of the invention shows a substantial improvement, and the hybrid model of the invention not only fully extracts the space-frequency characteristic, but also extracts the dynamic time characteristic of the electroencephalogram.

The invention utilizes a multi-parallel branch structure to reduce the computer calculation cost and reduce the training time, a brain scalp graph is mapped to form a spatial information structure, then a continuous convolution neural network is used to obtain more sufficient and higher semantic features, and a stacked bidirectional long-time and short-time memory network is used to learn the information on past and future time slices. The experimental result shows that the experiment of the multi-parallel continuous convolution neural network and the stacked bidirectional long-time memory network on the DEAP public database shows that the average recognition accuracy of the training model reaches 94.8% and 95% in two dimensions of value and Arousal, and relatively comprehensive characteristic information is fully learned while the training time is reduced. The loss of some characteristics can be avoided, more complete and comprehensive space-frequency characteristic information can be obtained, and the network operation time can be shortened after a large amount of sample data is possessed.

While the present invention has been described in detail with reference to the embodiments, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. A CCNN and staged-BilSTM-based network emotion recognition method is characterized by comprising the following steps:

The EEG of (1);

f _t ＝σ(W _f ·[h _t-1 ,x _t ]+b _f ) (4)

i _t ＝σ(W _i ·[h _t-1 ,x _t ]+b _i ) (5)

o _t ＝σ(W _o [h _t-1 ,x _t ]+b _o ) (8)

h _t ＝o _t *tanh(C _t ) (9)

in the formula (f) _t Value, h, representing a forgotten door _t-1 Hidden state at previous moment, x _t As input value at the present time, i _t The value of the memory gate is represented,

representing a temporary cell state, C _t Indicating the cell state at the present time, C _t-1 Indicating the last cell state, o _t Represents the value of the output gate, h _t Representing a hidden state;

and finally, carrying out classification prediction on the data:

in the formula, Z _i The output value of the multi-classification can be converted into the range of [0,1 ] by the soft max function for the output value of the ith node, n is the number of output nodes, namely the number of classes]And a summary of 1And (4) rate distribution.