CN112989920A

CN112989920A - Electroencephalogram emotion classification system based on frame-level feature distillation neural network

Info

Publication number: CN112989920A
Application number: CN202011575538.7A
Authority: CN
Inventors: 李冬冬; 王喆; 朱逸文; 顾天昊; 杜文莉
Original assignee: East China University of Science and Technology
Current assignee: East China University of Science and Technology
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2021-06-18
Anticipated expiration: 2040-12-28
Also published as: CN112989920B

Abstract

The invention discloses an electroencephalogram emotion monitoring system based on a frame-level feature distillation neural network, which comprises the steps of firstly, carrying out framing and channel connection pretreatment on a multi-channel electroencephalogram signal obtained by collection; classifying and modeling the preprocessed data by using a frame-level distillation network; then inputting a training sample to carry out iterative optimization training on the model, and optimizing the parameters of the network by adopting a back propagation algorithm and a distillation loss function; combining the prediction labels of the multiple networks, training the combined labels by using a one-dimensional convolutional neural network, distributing different weights for different networks, and finally giving an integration decision of the networks. And finally, inputting the signals of the test set into the trained neural network for classification and identification. The method continuously improves the characteristics by digging the internal relation in the time domain between signal frames and carrying out multi-network iterative distillation; through a plurality of simple neural networks, the time is reduced, and the identification degree of the extracted features is improved.

Description

Electroencephalogram emotion classification system based on frame-level feature distillation neural network

The technical field is as follows:

the invention relates to the technical field of signal processing, in particular to an electroencephalogram emotion classification system based on a frame-level feature distillation neural network.

Background art:

emotion recognition is to understand the psychological condition of a subject, and automatic emotion recognition has been widely used in various fields in the past decades. With the rise of deep learning methods, the era of automatic emotion recognition has come, forming a representative field called "emotion calculation". Recently, studies on physiological signals, such as electroencephalogram (EEG), Electromyogram (EMG), and Electrocardiogram (ECG), have become the most exciting new directions in emotion recognition. Many researchers have also spontaneously built public databases of electroencephalogram emotion classifications such as DEAP, DREAMER, enabling the latter to focus on the design of emotion recognition algorithms.

In recent years, there have been many studies on emotion recognition based on electroencephalogram signals, but the following drawbacks still exist in these studies. Firstly, the characteristics of electroencephalogram emotion are often designed manually, and the characteristic representation capability of deep learning is not utilized. Researchers and users need to be familiar with and even master the different characteristics to select proper characteristics for emotion monitoring of electroencephalogram signals collected by the researchers and the users. Second, most of the predecessor work focused on emotional classification of full-length electroencephalograms. Because of the rapid and frequent activity of the human brain, emotion recognition for full-length signals is often redundant. Finally, most studies have focused on the correlation of the acquired brain electrical signals on different channels, while the correlation between each frame of brain electrical signals is ignored.

For the task of electroencephalogram emotion recognition, a large number of researchers have applied various machine learning modeling techniques including logistic regression, multi-layer perceptron, support vector machine, and the like. However, the previous research does not solve the aforementioned disadvantages, and in order to solve the problems, a Frame Level distillation Neural Network (FLDNet) oriented to electroencephalogram emotion classification is proposed, and the FLDNet improves the performance from two aspects: firstly, a teacher-student model is introduced into three basic networks to automatically extract high-level electroencephalogram characteristics for emotion recognition, in the process, each basic network improves characteristics obtained from a teacher and converts the characteristics into improved new characteristics, the process iterates from network to network, and each student network can be regarded as learning of knowledge of the teacher again. Therefore, this architecture is named frame-level feature distillation neural network (FLDNet).

Disclosure of Invention

In order to achieve the purpose of recognizing the emotion of the electroencephalogram, the invention provides an electroencephalogram emotion classification system based on a frame-level feature distillation neural network.

In order to achieve the above object, the present invention provides an electroencephalogram emotion classification system based on a frame-level feature distillation neural network, comprising the following steps:

and S1, performing framing and channel connection pretreatment on the acquired multi-channel electroencephalogram signals.

S2, a new frame-level distillation network is provided for classifying and modeling the preprocessed data; according to the method, time is reduced through a plurality of simple neural networks, the transverse depth of the deep neural network is converted into the longitudinal depth, and the identification degree of the extracted features is improved.

And S3, inputting a training sample to carry out iterative optimization training on the model, and optimizing the parameters of the network by adopting a back propagation algorithm and a distillation loss function.

And S4, combining the prediction labels of the multiple networks, training the combined labels by using a one-dimensional convolutional neural network, distributing different weights to different networks, and finally giving an integration decision of the networks.

The invention aims to excavate the correlation information between frames in data through a frame-level gating unit, obtain the abstract data representation after concentration through multi-network characteristic distillation, thereby improving the accuracy of electroencephalogram emotion recognition, and form and provide an electroencephalogram emotion classification system based on a frame-level characteristic distillation neural network by integrating the set of data processing, characteristic engineering and neural network classification processes, so as to help research emotion recognition problems.

In a preferred embodiment of the present invention, step S1 includes the steps of:

s11, when data are collected, firstly, recording an electroencephalogram signal with a certain duration as a baseline (baseline) of a subject, and subtracting the baseline signal of the quiescent period from all the acquired electroencephalograms;

s12, the multi-channel electroencephalogram signals are processed in a segmented mode, and each segment of signal is a multi-channel electroencephalogram signal with the length of 3S. That is, an electroencephalogram signal with a total length of x seconds is decomposed into x/3 segments, and the segments form a frame which can be expressed as F_t. I.e. converting full-length EEG signal into multi-frame representation<F₁，F₂,…，F_N>Where N represents the number of all frames;

s13, channel combination is carried out on the electroencephalogram signals represented by the multiple frames, namely, the signals of each frame are sequentially connected into a one-dimensional vector according to the channel number, and the dimension of the spliced vector is D.

S14 repeats S12-S13 to obtain N eigenvectors with dimension D, and these eigenvectors are arranged longitudinally to finally form a matrix of N × D, and the number S of samples is taken into consideration to finally constitute a three-dimensional tensor representation of S × N × D of the electroencephalogram signal.

In a preferred embodiment of the present invention, step S2 includes the steps of:

s21, feature Gate Unit (Frame Gate), assuming that the input data becomes via LSTM network<h₁,h₂,...， h_N>The computing method of the feature gating unit can be expressed as:

g_t,t′＝tanh(W_gh_t+W_g′h_t′+b_g),

e_t,t′＝σ(W_eg_t,t′+b_e)

wherein W_g、W_g′And W_eIs a relatively hidden state h in the neural network_t、h_t′A weight variable of b_g、b_eIs the bias of the network, these network parameters can be obtained by training and learning, and then the intermediate variable g can be obtained by activating the function tanh_t，t′. Whileg_t，t′Will be multiplied again by the weight variable W_eAnd obtaining an implicit variable e through sigmoid normalization_t，t′。

e_t＝<e_t，1，e_t，2，…，e_t，N>

Each element is hidden by the current state h_tAnd other hidden states are calculated. After that, the whole vector et is subjected to softmax normalization:

a_t＝softmax(e_t)

finally, an attention matrix a can be derived therefrom:

A_N*N＝<a₁，a₂,a₃,…,a_N>

it can be seen that the shape of the attention matrix a is N × N, and the hidden state can also be expressed in a vector form, i.e. H ═ N<h₁,h₂,…,h_N>. Thus, the output O of the feature gating cell can be finally obtained:

O＝A_N*NH^T

for clarity, the way each element in the output matrix O is calculated can be given separately, i.e. O_te.O can be obtained by the following formula:

o_t＝∑_t′a_t,t′*h_t′,a_t,t′∈a_t

s22, Frame-level feature distillation, initializing M basic networks, wherein each network comprises an LSTM coding-decoding structure and a Frame Gate feature Gate control unit, two networks are arbitrarily selected, the first network is defined as a teacher network, the second network is defined as a student network, the teacher network receives a 3D-tensor of S N D as an input, the output of the feature Gate control unit does not change the dimensionality of original data, the three-dimensional tensor of S N D is the same, and the three-dimensional tensor is subjected to weighted combination in channels and frames, so that the student network with the same structure can naturally input the three-dimensional tensor as the input of the student network, and the operation can be continuously iterated, namely the network can be continuously initialized, and the feature matrix extracted by the network is transferred until the iteration is terminated. Each iteration is actually an extraction of the original features, in such a way that the "deep" neural network is transformed into an "extensive" neural network, with features being continuously distilled between the networks. This process can be expressed as:

O_teacher,p_teacher＝net_teacher(F)

O_student,p_student＝net_student(O_teacher)

where F represents the input frame-level three-dimensional tensor data, O, p can be derived from equations (4-7) and (4-9), respectively. In fact, we will further use the prediction result p of the teacher network to correct the loss function that should be learned of the student network, and the loss function design uses knowledge to distill the loss, which is expressed as:

wherein r represents the true label of the prediction sample, p is the network prediction output, T is the temperature coefficient used as the coefficient of the soft label, λ represents the balance term of the front and back terms, and L represents the cross entropy loss. In this way the student model also gets further information from the teacher model's predictive label, and this is therefore called "label distillation". And because we have performed a characteristic distillation from the frame-level data, the whole frame is named a frame-level distillation network.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, fig. 1 is an overall flowchart of the present invention, and fig. 2 is a network architecture diagram of the present invention. FIG. 3 is a network detail diagram of the present invention

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

The invention provides a relevant specification of an electroencephalogram emotion classification system based on a frame-level feature distillation neural network, which comprises the following parts:

the first step is as follows: an electroencephalogram (EEG) of a subject is acquired through a human brain electroencephalogram acquisition system. The signal acquisition should follow the following criteria: (1) different stimuli are applied to different subjects, and the form of the stimuli may include, but is not limited to, video, audio, and images. (2) SAM self-emotion analysis evaluation is carried out on the test subject, and the test subject is required to give scores in three emotion dimensions according to different stimuli. (3) Researchers are asked to set a threshold on the subject's score, and subject emotions are classified according to the threshold. Based on the above criteria, corresponding raw data can be acquired from the electroencephalogram signal acquisition instrument.

And secondly, preprocessing corresponding original data acquired from the electroencephalogram signal acquisition instrument. Due to the complexity of human brain activity, even if the human brain does not have strong emotional reaction, the human brain electrical signals can generate activity signals in a non-stimulation period, so that during data acquisition, researchers often record electroencephalogram signals for a certain time as a baseline (baseline) of a subject. Therefore, the acquired EEG signals during the stimulation period need to be subtracted from their corresponding baseline signals first; still further, we segment these multi-channel EEG signals, each of which is a 3s long multi-channel EEG signal. For example, if a full-length EEG signal has x seconds, the EEG signal is decomposed into x/3 segments, which we call a "frame," which may be denoted as F_t. In other words, the full-length EEG signal would be segmented into a multi-frame representation, i.e.< F₁,F₂,…,F_N>Where N represents the number of all frames. However, each EEG signal is still multi-channel, which is not conducive to subsequent classification and processing by the designed neural network. Therefore, we further process the multi-channel EEG signal for each frame, although there are many methods in the academic world to synthesize the multi-channel signal, and in the proposed method we use the simplest flattening (Flatten) preprocessing method for the time performance. That is, the signals of each frame are sequentially connected into a one-dimensional vector according to the channel number, and the dimension of the spliced vector is D. Finally, we can obtain N segments of feature vectors with dimension D, and these feature vectors are arranged longitudinally to finally form a matrix of N × D. In other words, if the number of samples S is taken into account, we process the entire brain electrical signal into a 3D-tensor of S × N × D by data pre-processing.

The advantage of such a preprocessing is that the processed data contains both multi-channel information and interframe information, and the length of the original signal is greatly shortened, the length of each frame is fixed at 3 seconds, and the neural network can give a comprehensive decision result based on different frames. Meanwhile, such preprocessing also has a corresponding relation with the method FLDNet provided herein, and these framing processed signals make it possible to mine the inter-frame relation of the original signals. Each frame F_tThe polymorphism of human brain behavior is expressed, the connection between the signals shows a context relationship, and the characteristics obtained by extracting the data have stronger expression capability.

And thirdly, constructing a frame-level distillation network to process the acquired electroencephalogram signals.

We first initialize M underlying networks with coding-decoding structure, feature gating units, multi-layered classification networks. The specific structure of the network can be seen in figure 2.

In fig. 2, two LSTM layers are introduced into the FLDNet, however the two LSTM play different roles. The first LSTM network is used as an encoding layer, which encodes the input raw data into a hidden state, and this process can be expressed as:

<h₁,h₂,...，h_N>＝LSTM₁(<F₁，F₂，...,F_N>)

wherein F represents a characteristic sequence obtained after data preprocessing, h_NRepresenting the nth hidden state of the LSTM output.

The hidden state is then input into a characteristic gating cell intended to form an attention moment matrixA, matrix element α therein_t,t′∈A,α_t,t′Calculates the hidden state h_tAnd h_t′The similarity between the states, attention is paid to the hidden state h_t、h_t′Respectively correspond to F_tAnd F_t′Where t and t' correspond to different time domains in the time series data, the calculation method of the characteristic gate control unit may be expressed as:

g_t,t′＝tanh(W_gh_t+W_g′h_t′+b_g)，

e_t,t′＝σ(W_eg_t,t′+b_e)

wherein W_g、W_g′And W_eIs a relatively hidden state h in the neural network_t、h_t′A weight variable of b_g、b_eIs the bias of the network, these network parameters can be obtained by training and learning, and then the intermediate variable g can be obtained by activating the function tanh_t，t′. And g_t，t′Will be multiplied again by the weight variable W_eAnd obtaining an implicit variable e through sigmoid normalization_t，t′. Assuming that our input data has N frames, the LSTM network will encode N hidden states correspondingly, for each hidden state h_tA series of hidden variables e_t，t′Is calculated from this. In other words, each hidden state h_tWill calculate N times to obtain the hidden vector e_t：

e_t＝<e_t，1，e_t，2，…，e_t，N>

Each element is hidden by the current state h_tAnd other hidden states are calculated. Then, we pair the whole vector e_tPerforming softmax normalization:

a_t＝softmax(e_t)

finally, we can thus derive an attention matrix a:

A_N*N＝<a₁，a₂，a₃,…,a_N>

it can be found that the shape of the attention matrix a is N x N,the hidden state can also be expressed in the form of a vector, i.e., H ═ H<h₁,h₂，…，h_N>. Thus, the output O of the feature gating cell can be finally obtained:

O＝A_N*NH^T

for clarity, we can give the way each element in the output matrix O is calculated separately, i.e. O_te.O can be obtained by the following formula:

o_t＝∑_t′a_t，t′*h_t′,a_t,t′∈a_t

therefore, we have completed the derivation of the characteristic gating unit, and it can be found that equations 4-2 and 4-3 are derived from the gating form of LSTM, which we also call it as the gating unit, because, in addition, the essence of the characteristic gating unit is to perform weighted sum on the hidden states of the input, in this way, we can give corresponding weights to different frames of the input, and finally, a series of high-weight regions can be obtained through network learning. This is also the design initiative of us, namely that it is a part of the time domain in the long time series that contributes to the classification. Looking again at this formula derivation process, it can be seen that each input ot is calculated from all the hidden states, i.e. the output of the feature gating unit is a combination and extraction of the original features, which is why the output of the feature gating unit can be given to student network learning.

Subsequently, a second LSTM is introduced, since the output of the gate control unit contains N feature vectors, which is not favorable for subsequent classification, and therefore, a decoding network is required to further merge and integrate complex features, and the second LSTM plays a role. In other words, the second LSTM differs from the first LSTM in that we only take the last hidden state of the second LSTM as the classification feature vector, and this process can be expressed as:

h_last＝LSTM₂(<O₁,O₂,...,O_N>)

finally, a multi-layer neural network is used to classify the extracted feature vectors.

The multi-layer neural network comprises three fully connected layers, the number of neurons of the multi-layer neural network is 32 and 16 respectively, and the number of corresponding classification classes is also provided.

The multi-layer neural network finally outputs a one-hot (one-hot) classification probability p of the input sample, that is:

p＝softmax(multidense(h_last))

further, as will be discussed, interaction between networks is illustrated in FIG. 3, which depicts a plurality of identical networks that differ in that they accept different input and loss functions, optionally wherein two of the networks form a set of networks, the first network being referred to as the teacher network and the second network being referred to as the student network. After a basic network containing an LSTM coding-decoding structure and a Frame Gate characteristic gating unit is trained, corresponding weight variables are reserved in the characteristic gating unit, and corresponding inter-Frame importance information is reserved in the output of the layer.

And then initializing a new basic network, wherein the basic network is used as a student network to receive the input of the original data no longer and receive the output of the teacher network gate control unit instead. The condition for the operation to be established comes from the dimensional invariance of the feature gate unit, as shown in the foregoing derivation and fig. 2, the teacher network accepts the 3D-tensor of S × N × D as input, and the output of the feature gate unit does not change the dimension of the original data, which is also the three-dimensional tensor of S × N × D, but the three-dimensional tensor is combined with weights between channels and frames, that is, the teacher network uses the feature matrix after it has been learned as input to teach the student network. Each iteration is actually an extraction of the original features, and in such a way we can transform a 'deep' neural network into a 'wide' neural network, with features being continuously distilled between networks. This process can be expressed as:

O_teacher,p_teacher＝net_teacher(F)

O_student,p_student＝net_student(O_teacher)

wherein F represents the input frame-level three-dimensional tensor data, and furthermore, the prediction result p of the teacher network is used for correcting the loss function which should be learned of the student network, and the loss function design takes the knowledge distillation loss as reference, and is expressed as:

wherein r represents the true label of the prediction sample, p is the network prediction output obtained by formula (4-9), T is the temperature coefficient, an artificial given constant is used as the system of the soft label; λ represents the balance term of the two terms before and after, and L represents the formula cross entropy loss. In this way the student model learns further information from the teacher model's predictive labels, and this is therefore called "label distillation". And because the network performs a feature distillation from the frame-level data, the entire frame is named a frame-level distillation network.

And fourthly, training and testing the constructed frame-level distillation network.

Specifically, we first initialize M base networks with coding-decoding structure, feature gating units, and multi-layer classification network. For the first network, the first network is preliminarily learned by using cross entropy, then corresponding extracted features are output to the student network, the subsequent student network further optimizes an objective function based on the extracted features and the prediction labels, and finally, a conv is obtained_1*1Will be introduced and train conv based on the true label_1*1The weight variables in the network are fixed, and then the aggregation label matrix P is aggregated based on the network_S*MConversion into probability vectors p_ensembleThus giving the final decision result. An Adam optimizer is used in the model, and dropout regularization is introduced, and for time and performance considerations, the number of networks M is set to three, and the hidden variable numbers are the same for all LSTM layersThe input dimensions are consistent, and the number of neurons of the fully-connected network for classification is 32, 16 and the number of classes respectively. The overall algorithm training process is as follows:

inputting: preprocessed S N x D data matrixes F_N*DTrue tag r, network number M

And (3) outputting: classification probability vector p_ensemble

1. Initializing M base networks

2. Training a neural network:

for i to 1

Model (model)_iInput is F_N*D；

Training model based on cross entropy and real label r_i；

Calculating and extracting characteristic O based on formula in 0057_i；

Model generation based on formula in 0064_iPredictive label p of_i

For i 2 to M

Model (model)_iInput is O_i-1；

Training model based on formula in 0070_i；

Calculating and extracting characteristic O based on formula in 0057_i；

Model generation based on formula in 0064_iPredictive label p of_i

End iteration on i

3. Aggregating all the predicted labels to obtain a sample label matrix P_S*M＝<p₁,p₂，…,p_M>；

4. Training integrated model conv_1*1；

5. Integrated model_cInputting P_S*M；

6. Training model based on cross entropy and real label r_c；

7. Form a final decision p_ensemble＝conv_1*1(P)

8. Return p_ensemble

And inputting the test subset into the trained network to obtain emotion classification results of the objects on the test set, evaluating the quality of the training model, and further considering whether parameter adjustment is needed for iterative training again.

The electroencephalogram signal data acquired from the object is processed according to the data processing mode of the system, the processed data is input into the neural network of the system, and finally, a corresponding prediction result is given.

The electroencephalogram emotion classification system based on the frame-level feature distillation neural network provided by the invention is described in detail above, the principle and the implementation mode of the invention are explained herein, and the description of the above embodiment is only used for helping to understand the method and the idea of the invention; meanwhile, according to the idea of the present invention, variations, modifications and changes may be made in the embodiments and the application range, and in summary, the content of the present description should not be construed as limiting the present invention.

Claims

1. The electroencephalogram emotion classification system based on the frame-level feature distillation neural network is characterized by comprising the following steps of:

S2, a new frame-level distillation network is proposed to classify and model the preprocessed data.

S3, inputting training samples to carry out iterative optimization training on the model, and optimizing parameters of the network by adopting a back propagation algorithm and a distillation loss function;

s31, wherein the distillation loss function can be expressed as:

wherein r represents the true label of the prediction sample, p is the network prediction output, T is the temperature coefficient, the coefficient lambda used as the soft label represents the balance term of the front and back terms, and L represents the cross entropy loss.

S41, specifically, M basic networks with coding-decoding structures, feature gating units, and multi-layer classification networks are initialized. For the first network, the first network is preliminarily learned by using a cross entropy function, then extracted features corresponding to feature gating units are output to the student network, the subsequent student network further optimizes a formula (10) based on the extracted features and prediction labels, and finally, a conv_1*1Layers are introduced and conv is trained based on the true label_1*1The weight variables in the network are fixed, and then the aggregation label matrix P is aggregated based on the network_S*MConversion into probability vectors p_ensembleThus giving the final decision result.

2. The electroencephalogram emotion classification system based on the frame-level feature distillation neural network, as claimed in claim 1, wherein the step S1 includes the following steps:

s12, the multi-channel electroencephalogram signals are processed in a segmented mode, and each segment of signal is a multi-channel electroencephalogram signal with the length of 3S. That is, an electroencephalogram signal with a total length of x seconds is decomposed into x/3 segments, and the segments form a frame which can be expressed as F_t. I.e. converting full-length EEG signal into multi-frame representation<F₁，F₂，...，F_N>Where N represents the number of all frames;

3. The electroencephalogram emotion classification system based on the frame-level feature distillation neural network, as claimed in claim 2, wherein the step S2 includes the following steps:

s21, feature Gate Unit (Frame Gate), assuming that the input data becomes via LSTM network<h₁，h₂，...，h_N>The computing method of the feature gating unit can be expressed as:

g_t，t′＝tanh(W_gh_t+W_g′h_t′+b_g)， (1)

e_t，t′＝σ(W_eg_t，t′+b_e) (2)

wherein W_g、W_g′And W_eIs a relatively hidden state h in the neural network_t、h_t′A weight variable of b_g、b_eIs the bias of the network, these network parameters can be obtained by training and learning, and then the intermediate variable g can be obtained by activating the function tanh_t，t′。

And g_t，t′Will be multiplied again by the weight variable W_eAnd obtaining an implicit variable e through sigmoid normalization_t，t′。

e_t＝<e_t，1，e_t，2，...，e_t，N> (3)

Each element is hidden by the current state h_tAnd other hidden states are calculated. Then, the whole vector e_tPerforming softmax normalization:

a_t＝softmax(e_t) (4)

finally, an attention matrix a can be derived therefrom:

A_N*N＝<a₁，a₂，a₃，...，a_N> (5)

it can be seen that the shape of the attention matrix A is N, but hiddenStates can also be expressed in the form of a vector, i.e. H ═ H<h₁，h₂，...，h_N>. Thus, the output O of the feature gating cell can be finally obtained:

O＝A_N*NH^T (6)

o_t＝∑_t′a_t，t′*h_t′，a_t，t′∈a_t (7)

O_teacher，p_teacher＝net_teacher(F) (8)

O_student，p_student＝net_student(O_teacher) (9)。