CN114330436A

CN114330436A - Emotion recognition method based on twin network architecture and graph convolution

Info

Publication number: CN114330436A
Application number: CN202111617915.3A
Authority: CN
Inventors: 曾虹; 吴琪; 郑浩浩; 金燕萍; 潘登; 徐非凡; 李明明
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-12-22
Filing date: 2021-12-22
Publication date: 2022-04-12

Abstract

The invention relates to an emotion recognition method based on a twin network architecture and graph convolution, which belongs to the technical field of electroencephalogram emotion recognition.

Description

Emotion recognition method based on twin network architecture and graph convolution

Technical Field

The invention relates to an emotion recognition method based on a twin network architecture and graph convolution, and belongs to the technical field of electroencephalogram emotion recognition.

Background

The emotion has the functions of information transmission and behavior regulation in the daily communication, work, study and cognitive decision processes of people, correct emotion is recognized, and people can be helped to master correct information. Mood is a complex psychological and physiological state that results from the brain's response to these physiological changes and plays a crucial role in our lives. In recent years, more and more research has been focused on emotion recognition, not only to create an emotional interaction interface for a machine to perceive human emotions, but also to evaluate psychological diseases of patients with neurological disorders, such as parkinson's disease, autism spectrum disorder, schizophrenia, depression, and the like.

There are two main types of emotion recognition methods: non-physiological signals and physiological signals. Since a part of patients cannot express emotions through external physiological features such as facial expressions, body postures, and the like, and a part of people can deliberately disguise their emotions, measurement of physiological signals is often used as an analysis signal source for emotion classification. The electroencephalogram signal has the advantages of high time resolution, no wound, low acquisition cost and easy acquisition, is one of common physiological signals, and has been proved to reflect important information of human emotional states.

The electroencephalogram data is not regular Euclidean data, and for irregular brain network structures, complex connections among channels can be captured better by using image convolution. However, increasing too many map convolutional layers may result in over-smoothing, which affects accuracy, and how to extract more effective features and improve emotion classification accuracy is worth thinking. At present, methods such as a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), a Support Vector Machine (SVM) and the like are used for recognizing electroencephalogram emotion, and considerable results are obtained, but the methods are partially insufficient in emotion recognition accuracy.

At present, there is a research method for emotion recognition classification based on feature extraction of difference entropy of electroencephalogram signals and combination of an LSTM neural network model (publication number CN110897648A), which comprises the following steps: (1) extracting 62-channel electroencephalogram signals of normal adults; (2) calculating a Differential Entropy (DE) of the time sequence to form a 62-dimensional time sequence characteristic; (3) the time sequence characteristics are used as the input of an LSTM neural network and are trained and learned; (4) the results of the network training were evaluated using the average classification accuracy, standard deviation, and F1 values. The method has good effect, can effectively identify and classify three emotions, finds out the heterogeneity of the electroencephalogram signals with three different emotions from the characteristics of the electroencephalogram signals such as non-stationarity, nonlinearity, time-frequency domain, complexity and the like, thereby distinguishing the three emotions and helping the adjuvant therapy recovery of various diseases, and has the difference from the method that: the invention uses a twin network framework to train auxiliary tasks, and utilizes the characteristics of the middle layer to perform comparative learning while performing emotion classification by using a model, thereby improving the performance and generalization capability of the model and having different emotion recognition accuracy rates.

Disclosure of Invention

Aiming at the defects, the invention provides an emotion recognition method based on a twin network architecture and graph convolution, which can self-adaptively endow different importance to data by utilizing a multi-head self-attention mechanism and extract more deep and effective information. In addition, in order to further improve the model precision, the method utilizes a twin network framework to train an auxiliary task, and utilizes the characteristics of the middle layer to perform comparative learning while performing emotion classification by using the model, so that the performance and generalization capability of the model are improved.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a twin network architecture and graph convolution-based emotion recognition method comprises the following steps:

and acquiring a data set. Acquiring electroencephalogram signals of 62 appointed electrode positions of a subject when watching the movie fragments, and immediately completing a questionnaire to report emotional reactions (neutral, negative and positive emotions) of the subject to each movie fragment after finishing watching each movie fragment;

data preprocessing and feature extraction. The original EEG signal is down sampled and artifact pre-processed. Filtering a time domain signal by using a Hamming window, performing Fast Fourier Transform (FFT), taking a signal every second as a sliding window, calculating the differential entropy characteristics of 62 channels of 5 frequency bands, and performing normalization processing;

and (4) generating a sample. For efficient use of time information, the DE feature of the 62 channels of 3s is used as a sample, and the dimension of one input sample is 3 × 62 × 5 (time × channel × frequency band).

And generating a sample pair. Traversing each sample generated in the step three, taking the sample as reference and being denoted as input _1, randomly selecting a sample (denoted as input _2) under the same emotion, forming a positive sample pair with the input _1, randomly selecting a sample (denoted as input _3) under different emotions, forming a negative sample pair with the input _1, namely, the input _2 and the input _1 in the positive sample pair are samples under the same emotion, and the input _3 and the input _1 in the negative sample pair are samples under different emotions.

A base model is defined. And taking the space-time graph convolutional neural network model as a basic model. The basic neural network consists of an adaptive graph learning layer and a space-time convolution module. The self-adaptive graph learning layer aims at learning the connection relation of the brain network; the space-time convolution module is composed of a space-time self-attention mechanism, a graph convolution and a common convolution, the space-time self-attention mechanism captures dynamics in space dimensions and time dimensions under different emotional states, the graph convolution achieves aggregation of adjacent nodes, and the common convolution is used for extracting features in the time dimensions.

A twin network architecture is defined. Inputting two samples in the sample pair, namely input _1 and input _2 or input3, into the same basic model in sequence, respectively generating two intermediate features embedding _1 and embedding _2, and then calculating the distance between embedding _1 and embedding _ 2. And then extracting deeper features of the embedding _1 by using the multi-head self-attention layer, and outputting the probability that the input _1 belongs to each category after passing through the full-connection layer and the softmax layer.

The inputs and outputs of the model are defined. The model inputs are either a positive sample pair (input _1 and input _2) or a negative sample pair (input _1 and input _ 3). The model has two outputs output _1 and output _2, where output _1 is the distance between the intermediate features embedding _1 and embedding _2, and output _2 is the probability that sample input _1 belongs to each class.

An objective function is defined. The final objective function of the model consists of three loss functions. Firstly, in an adaptive graph learning layer, aiming at learning brain connection relation, a loss function is used for constraining the relation between the distance of characteristics between channels and connection strength, the farther the distance of the characteristics between the two channels is, the weaker the connection strength is, and the connection structure of the brain is not fully connected, so that the sparsity of the learned graph is controlled by adopting a regularization term of an L2 norm. The second is the contrast loss of the twin network, which aims to make the distance between the positive sample pairs in the fourth step closer and the distance between the negative sample pairs farther. And thirdly, cross entropy loss, which is used for measuring the error between the input _1 predicted value and the real sample mark in the step six.

And (5) training and testing. Inputting the sample pairs generated in the step six into the model for training. After the model is trained, samples in a test set are used as input _1, and input _2 is obtained by randomly initializing a tensor with the same dimensionality as that of the input _1, so that an input sample pair is formed. The output _2 is used to calculate the accuracy of the classification.

And evaluating the learning result by using the average classification accuracy.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides an emotion recognition method based on twin network architecture and graph convolution, which can self-adaptively endow different importance to data by utilizing a multi-head self-attention mechanism and extract deeper and more effective information; meanwhile, a twin network framework is borrowed, and the characteristics of the middle layer are utilized for comparative learning, so that the performance and the generalization capability of the model are improved. In addition, compared with other graph convolution methods, the emotion recognition accuracy of the model based on the twin network architecture is improved, and the accuracy reaches 94.78 +/-05.97%.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart of a twin network architecture and graph convolution based emotion recognition method provided by the invention;

FIG. 2 is a spatiotemporal graph convolution neural network model of a twin network architecture and graph convolution based emotion recognition method provided by the invention;

FIG. 3 is an experimental mode of a twin network architecture and graph convolution based emotion recognition method provided by the invention;

FIG. 4 is a channel position of electroencephalogram type acquisition based on a twin network architecture and graph convolution emotion recognition method provided by the invention;

FIG. 5 is a comparison chart of a test of the emotion recognition method based on a twin network architecture and graph convolution provided by the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1-4, a method for emotion recognition based on twin network architecture and graph convolution specifically includes the following steps:

step (1) dataset acquisition

Step (1-1) 15 movie fragments (containing positive, neutral and negative emotions) were selected from the materials library as stimuli used in the experiment;

step (1-2) the experimenter watches 15 segments in each experiment, 5 seconds of prompt is provided before each segment, each segment presents 4 minutes, self-evaluation is carried out for 45 seconds after each segment is finished, and rest is carried out for 15 seconds after each segment; wherein the self-assessment phase feeds back the experimenter's emotional response to each movie clip by completing a questionnaire;

step (2) data preprocessing and feature extraction

Step (2-1) down-sampling the original signal to 200Hz, and removing the signal subjected to the interference of the electro-oculogram and the myoelectricity;

and (2-2) extracting DE characteristics for each channel from 5 frequency bands: delta band (1-3Hz), theta band (4-7), alpha band (8-13Hz), beta band (14-30Hz), gamma band (31-50 Hz):

A. filtering original data by adopting a Hamming window, performing fast Fourier transform on data per second, and calculating the differential entropy of the five frequency bands;

B. the definition method of the differential entropy is as follows:

let X be { X ═ X₁,x₂,...,x_nN is equal to or greater than 1, corresponding to a probability of

According to the definition method of Shannon information entropy, the information quantity of the nondeterministic system is shown as the formula (1):

the state probability pi of the time domain in the above equation is replaced by the frequency domain power spectral density p ^ defined based on the fast Fourier transform, so that the definition of the induced differential entropy is shown as equation (2):

wherein

Representative power spectral density;

and (2-3) normalizing the electroencephalogram signal by adopting z-score, wherein the normalization formula is shown as the formula (3):

wherein X is the EEG signal on each channel,

the mean value of the brain electrical signals on each channel is S, and the standard deviation of the brain electrical signals on each channel is S.

Step (3) sample Generation

For efficient use of time information, the DE feature of the 62 channels of 3s is used as a sample, and the dimension of one input sample is 3 × 62 × 5 (time × channel × frequency band).

Step (4) sample pair generation

Traversing each sample generated in the step three, and taking the sample as a reference (denoted as input _1), randomly selecting a sample (denoted as input _2) under the same emotion to form a positive sample pair with input _1, and randomly selecting a sample (denoted as input _3) under different emotions to form a negative sample pair with input _1, that is, input _2 and input _1 in the positive sample pair are samples under the same emotion, and input _3 and input _1 in the negative sample pair are samples under different emotions.

Step (5) defining a base model

The method comprises the following steps that a spatio-temporal graph convolution neural network model is used as a basic model, the basic model is composed of a self-adaptive graph learning layer and a spatio-temporal convolution module, the self-adaptive graph learning layer aims at learning the connection relation of a brain network, the spatio-temporal convolution module is composed of a spatio-temporal self-attention mechanism, a graph convolution and a common convolution, the spatio-temporal self-attention mechanism captures space dimensionality and dynamic performance on time dimensionality under different emotion states, the graph convolution achieves aggregation of adjacent nodes, and the common convolution is used for extracting features on the time dimensionality;

step (5-1) adaptive image learning

Defining a non-negative adjacency matrix based on the channel characteristics, as shown in equation (4):

A_pq＝g(x_p,x_q)(p,q∈{1,2,...N}) (4)

wherein A is_pqRepresents the connection relationship between the channel p and the channel q, i.e. the weight of the edge connecting the node p and the node q, g (x)_p,x_q) The adjacency matrix A is expressed by learning a weight vector w, and the definition of A is shown in formula (5)

Step (5-2) space-time self-attention mechanism

A. And calculating time self-attention, wherein correlation exists between states of different time slices in time, the correlation is different in different cases, and the dynamic correlation between nodes in the time dimension is acquired by using an attention mechanism in an adaptive mode. Transposing the input to obtain χ_hThe dimension is (62 × 5 × 3), and the temporal attention is defined as shown in formula (6) and formula (7):

T＝V_T·σ(((χ_h)^TU₁)U₂(U₃χ_h)+b_T) (6)

wherein T'_i,jRepresenting the similarity of time i and time j. V_T、U₁、U₂、U₃、b_TTo learn the parameters, σ is the sigmoid activation function.

B. Spatial attention is calculated. Spatially, channels at different locations interact, the effect being highly dynamic, using an attention mechanism to adaptively capture the dynamic correlation between nodes in the spatial dimension. Transposing the input to obtain χ_hThe dimension is (62 × 5 × 3), and the spatial attention is defined as shown in formula (8) and formula (9):

S＝V_S·σ((χ_hW₁)W₂(W₃χ_h)^T+b_s) (8)

wherein S'_p,qRepresenting the similarity of channel p and channel q. V_S、W₁、W₂、W₃、b_STo learn the parameters, σ is the sigmoid activation function.

Step (5-3) spatial convolution

And (3) calculating a Laplace matrix L-D-A, wherein A is the adjacent matrix obtained by the step (4-2), D is a degree matrix obtained by calculation based on A, namely D is a diagonal matrix with the same dimension as A, and elements Dii on the diagonal of the D matrix are the added values of the ith row in A.

A. Calculating the maximum eigenvalue λ of the Laplace matrix L_maxCalculated by the formula (10)

Where I is the identity matrix.

B. Recursively computing chebyshev polynomials according to equation (11):

wherein

C. Performing graph convolution according to equation (12)

Wherein g is_θRepresenting convolution kernel,. about.G represents graph convolution operation,. theta_kExpressing snow in shearThe coefficient is obtained by learning, and x is input data.

Step (5-4) time convolution

At this level, 2D convolution is performed in the time dimension using a 3 x 1 convolution kernel with step size of 1 and Padding of 1 to preserve input height and width.

Step (5-5) residual join and layer normalization

In order to alleviate the gradient vanishing problem and help the network to train better, a layer of residual network is added and layer normalization is performed.

Step (6) defining twin network architecture

Step (6-1) of obtaining interlayer characteristics

And (4) sequentially inputting two samples in the sample pair generated in the step five, namely input _1 and input _2 or input3, into the same basic model, and respectively generating two intermediate features embedding _1 and embedding _ 2.

Step (6-2) of calculating the distance between the pair of samples

Calculating the distance between the two outputs embedding _1 and embedding _2 of the step (6-1) in a manner shown in formula (13):

wherein emb _1 represents embedding _1, emb _2 represents embedding _2, C represents the number of channels, T represents the time length, F represents the feature number, emb _1_ctfThe f-th row element, emb _2, representing the characteristic of the c-channel t at embedding _1_ctfAn f row element representing the characteristic of the c channel t at the embedding _2 moment;

step (6-3) Multi-headed self-attention layer

A. Randomly initializing a learnable position matrix P, and performing position coding on the embedding _1 according to the formula (14):

X_embedding＝embedding_1+P (14)

B. random initialization of 8 sets (also called 8-head) of matrices

(i＝0,12,3,4,5,6,7) are embedded _1 (here denoted as X), respectively_embedding) Dot multiplication results in 8 sets of Q, K, V matrices, as shown in equations (15) - (17):

i＝0,1,2,3,4,5,6,7

C. for each group, the magnitude of the attention weight is obtained by matrix calculation, and is divided by W_kEvolution of the first dimension of the matrix, i.e.

Then multiplying by V to obtain the output of attention layer, finally obtaining 8 groups of matrixes (Z)₀-Z₇) As shown in formula (18):

i＝0,1,2,3,4,5,6,7

D. the 8 groups of matrixes are spliced together horizontally (Z)₀，Z₁，…，Z₇) And then randomly initializing a matrix W_oMultiplying the two matrixes to obtain a matrix Z, wherein the matrix Z is shown as a formula (19):

Z＝concatenate(Z₀,Z₁,Z₂,Z₃,Z₄,Z₅,Z₆,Z₇)·W_o (19)

step (6-4) fully connecting the layer with the softmax layer

A. Flattening the output Z of the step (6-3) into a one-dimensional vector;

B. obtaining a vector with dimension of 16 through transformation of a full connection layer;

C. and obtaining a vector with the dimension of 3 through a full connection layer change, and activating by using a softmax function to obtain the probability that the sample input _1 belongs to each category.

Step (7) defining the input and output of the model

The model inputs are either a positive sample pair (input _1 and input _2) or a negative sample pair (input _1 and input _ 3). The model is composed of two outputs output _1 and output _2, wherein output _1 is the distance between the intermediate features embedding _1 and embedding _2, and output _2 is the probability that sample input _1 belongs to each category.

Step (8) defining an objective function

The final objective function of the model consists of three loss functions. Firstly, in an adaptive graph learning layer, aiming at learning a brain connection relation graph, a loss function is used for constraining the relation between the distance of characteristics between channels and connection strength, the farther the distance between the characteristics between the two channels is, the weaker the connection strength is, and as the connection structure of the brain is not fully connected, a regularization term of an L2 norm is adopted to control the sparsity of the learned graph. The second is the contrast loss of the twin network, which aims to make the distance between the positive sample pairs in the fourth step closer and the distance between the negative sample pairs farther. And thirdly, cross entropy loss, which is used for measuring the error between the input _1 predicted value and the real sample mark in the step six.

The final objective function form of the model is shown in equation (20):

L＝L_{graph_learn}+ηL_{contrastive_loss}+L_{cross_entropy} (20)

where η is the tuning parameter between the two loss functions, the greater η, the greater the proportion of the contrast loss, and vice versa. The three components that make up the objective function are shown in equations (21) - (23):

wherein x is_pIs characteristic of the p channel, x_qIs characteristic of the q channel, A_pqAnd lambda is the regularization coefficient, and is the communication strength of the p channel and the q channel.

Where d is the euclidean distance between channels p and q, y is a dichotomy label, y 0 indicates that samples m and N are not from the same emotion, y 1 indicates that samples m and N are from the same emotion, N is the number of sample pairs within a batch, and margin is a hyperparameter indicating the distance separating different emotion samples.

Wherein, y_i,rIs a flag indicating whether the sample i belongs to the category r, if so, it is 1, otherwise, it is 0,

is the probability that the predicted sample i belongs to the class r.

Step (9) training and testing

Inputting the sample pairs generated in the step (4) into a model for training. After the model is trained, samples in a test set are used as input _1, and input _2 is obtained by randomly initializing a tensor with the same dimensionality as that of the input _1, so that an input sample pair is formed. The output _2 is used to calculate the accuracy of the classification.

And (10) evaluating the learning result by using the average classification accuracy.

And (10-1) evaluating the model by adopting the accuracy. The accuracy is the proportion of correctly classified samples to the total number of samples. In the experiment, 15 experimenters participate, each experimenter performs three experiments, each experiment watches 15 segments, and then the experiment is performed for 45 times, so that the accuracy calculation formula of the ith experiment is shown as the formula (24):

i＝1,2,3,...,45

wherein, TP is a positive sample predicted as a positive class by the model, TN is a negative sample predicted as a negative class by the model, FP is a negative sample predicted as a positive class by the model, and FN is a positive sample predicted as a negative class by the model.

The average accuracy of the 15 2 experiments tested is shown in the formula (25)

The standard deviation of this experiment is shown in equation (26)

The accuracy of the invention is tested by training test, and the comparison between the obtained result and the prior art (SVM, GCNN, DGCNN, BiDANN) is shown in the following table 1. In the comparison, the two experimental results of each tested sample are taken for testing; and the average value of all the obtained accuracy data is used for measuring the effect of the model.

From fig. 5, it can be found that the accuracy of the method of the present invention is higher than that of SVM, GCNN, DGCNN, and bidinn methods.

The embodiments of the present invention have been described in detail with reference to the accompanying drawings, but the present invention is not limited to the described embodiments. It will be apparent to those skilled in the art that various changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, and the scope of protection is still within the scope of the invention.

Claims

1. A emotion recognition method based on twin network architecture and graph convolution is characterized in that: the method comprises the following steps:

step one, acquiring a data set: acquiring electroencephalogram data of a subject when watching movie fragments, and immediately completing a questionnaire by the subject after watching each movie fragment to report emotional response of the subject to each movie fragment, wherein the emotional response comprises positive, neutral and negative, and the electroencephalogram data are 62-channel electroencephalogram signals of a designated electrode position acquired through a 10-20 international standard lead system;

step two, data preprocessing and feature extraction: carrying out down-sampling and artifact-removing preprocessing on an original EEG signal, filtering a time domain signal by using a Hamming window and carrying out fast Fourier transform, taking a signal at each second as a sliding window, calculating the differential entropy characteristics of 62 channels of 5 frequency bands, and carrying out normalization processing;

step three, sample generation: taking the differential entropy characteristics of 62 channels of 3s as a sample, wherein the dimension of one input sample is a time channel frequency band;

step four, generating a sample pair: traversing each sample generated in the step three, taking the sample as reference, recording the sample as input _1, randomly selecting a sample under the same emotion, recording the sample as input _2, forming a positive sample pair with the input _1, randomly selecting a sample under different emotions, recording the sample as input _3, and forming a negative sample pair with the input _ 1;

step five, defining a basic model: taking a space-time graph convolution neural network model as a basic model, wherein the basic model comprises an adaptive graph learning layer and a space-time convolution module;

step six, defining a twin network architecture: sequentially inputting two samples in the sample pair generated in the step five, namely input _1 and input _2 or input3, into the same basic model, respectively generating two intermediate features embedding _1 and embedding _2, then calculating the distance between embedding _1 and embedding _2, then extracting deeper features of embedding _1 from a multi-head self-attention layer, and outputting the probability that the input _1 belongs to each category after passing through a full-connection layer and a softmax layer;

step seven, input and output of the model are defined: the model input is a positive sample pair or a negative sample pair; the model has two outputs, namely output _1 and output _2, wherein output _1 is the distance between the intermediate features embedding _1 and embedding _2, and output _2 is the probability that the sample input _1 belongs to each category;

step eight, defining an objective function: the final objective function of the model consists of three loss functions;

step nine, training and testing: during training, inputting the sample pairs generated in the step four into a basic model for training; during testing, in order to keep the input scales consistent, samples in a testing set are used as input _1, input _2 is obtained by randomly initializing a tensor with the same dimensionality as that of the input _1, so that input sample pairs are formed, and the classification accuracy is calculated by using the output _ 2;

and step ten, evaluating the learning result by using the average classification accuracy.

2. The emotion recognition method based on twin network architecture and graph convolution of claim 1, wherein: the self-adaptive graph learning layer aims at learning the connectivity of the brain network; the space-time convolution module comprises a space-time self-attention mechanism, a graph volume and a common convolution, wherein the space-time self-attention mechanism captures space dimensionality and dynamics on time dimensionality under different emotional states, the graph volume realizes aggregation of adjacent nodes, and the common convolution is used for extracting features on the time dimensionality.

3. The emotion recognition method based on twin network architecture and graph convolution of claim 2, wherein: the three loss functions include:

in an adaptive graph learning layer, a loss function is used for constraining the relationship between the distance of the characteristics between the channels and the communication strength, the farther the characteristic distance between the two channels is, the weaker the communication strength is, and the regularization term of L2 norm is adopted to control the sparsity of a learned graph;

the contrast loss of the twin network makes the distance between the positive sample pairs in the fourth step closer and the distance between the negative sample pairs farther;

and cross entropy loss is used for measuring the error between the input _1 predicted value and the real sample mark in the step six.

4. A twin network architecture and graph convolution based emotion recognition method according to claim 1, 2 or 3, characterised in that: the first step comprises the following steps:

step (1-1): selecting 15 movie fragments from a material library as stimuli used in the experiment, the movie fragments respectively comprising movie fragments with positive, neutral and negative emotions;

step (1-2): the subject watched 15 movie clips per experiment, each clip was prompted 5 seconds before, each clip presented 4 minutes, each clip evaluated itself 45 seconds after completion, and each clip had a rest 15 seconds after; wherein the self-assessment phase feeds back the emotional response of the subject to each movie fragment by completing a questionnaire.

5. A twin network architecture and graph convolution based emotion recognition method according to claim 1, 2 or 3, characterised in that: the second step comprises the following steps:

step (2-1): the original signal is down-sampled to 200Hz, and the signal subjected to the interference of the electro-oculogram and the myoelectricity is removed;

step (2-2): DE features are extracted for each channel from 5 frequency bands: delta_band(1-3Hz)、θ_band(4-7)、α_band(8-13Hz)、β_band(14-30Hz)、γ_band(31-50Hz)：

Filtering the original data by adopting a Hamming window, performing fast Fourier transform on the data per second, and calculating the differential entropy of the five frequency bands;

the definition method of the differential entropy is as follows:

the state probability p of the time domain in the above equation_iIs replaced by based onFrequency domain power spectral density defined by fast fourier transform

The definition of the differential entropy is thus derived as shown in equation (2):

wherein

Representative power spectral density;

step (2-3): and normalizing the electroencephalogram signal by adopting z-score, wherein the normalization formula is shown as a formula (3):

wherein X is the EEG signal on each channel,

6. A twin network architecture and graph convolution based emotion recognition method according to claim 1, 2 or 3, characterised in that: the fifth step comprises the following steps:

step (5-1) adaptive graph learning:

A_pq＝g(x_p,x_q)(p,q∈{1,2,...N}) (4)

wherein A is_pqRepresents the connection relationship between the channel p and the channel q, i.e. the weight of the edge connecting the node p and the node q, g (x)_p,x_q) Intended to be learned by learning a weight vector wThe adjacency matrix A is defined as shown in formula (5)

Step (5-2) space-time self-attention mechanism:

calculating time self-attention, using an attention mechanism to adaptively capture dynamic correlation between nodes in a time dimension, and transposing input to obtain χ_hThe dimension is (62 × 5 × 3), and the temporal attention is defined as shown in formula (6) and formula (7):

T＝V_T·σ(((χ_h)^TU₁)U₂(U₃χ_h)+b_T) (6)

wherein T'_i,jRepresents the similarity, V, of time i and time j_T、U₁、U₂、U₃、b_TSigma is a sigmoid activation function for the learned parameters;

then, the spatial attention is calculated, the attention mechanism is used for capturing the dynamic correlation among the nodes in the spatial dimension in a self-adaptive mode, and the input is transposed to obtain chi_hThe dimension is (62 × 5 × 3), and the spatial attention is defined as shown in formula (8) and formula (9):

S＝V_S·σ((χ_hW₁)W₂(W₃χ_h)^T+b_s) (8)

wherein S'_p,qDenotes the similarity of channel p and channel q, V_S、W₁、W₂、W₃、b_SSigma is a sigmoid activation function for the learned parameters;

step (5-3) spatial convolution:

calculating a Laplace matrix L ═ D-A, wherein A is the adjacency matrix learned in step (4-2), D is the degree matrix calculated based on A, namely D is a diagonal matrix with the same dimension as A, and the elements D on the diagonal of the D matrix_iiIs the value added in the ith row in a,

calculating the maximum eigenvalue λ of the Laplace matrix L_maxCalculated by the formula (10)

Wherein I I is an identity matrix;

recursively computing chebyshev polynomials according to equation (11):

wherein

Graph convolution is performed according to equation (12):

wherein g is_θRepresenting convolution kernel,. about.G represents graph convolution operation,. theta_kExpressing the Chebyshev coefficient, and obtaining the Chebyshev coefficient through learning, wherein x is input data;

step (5-4) time convolution:

performing 2D convolution in the time dimension using a 3 x 1 convolution kernel with step size of 1 and Padding of 1 to preserve input height and width;

step (5-5) residual join and layer normalization:

and adding a layer of residual error network and carrying out layer normalization.

7. A twin network architecture and graph convolution based emotion recognition method according to claim 1, 2 or 3, characterised in that: the sixth step comprises the following steps:

step (6-1) of obtaining characteristics of the intermediate layer:

sequentially inputting two samples in the sample pair generated in the step five, namely input _1 and input _2 or input3, into the same basic model, and respectively generating two intermediate features embedding _1 and embedding _ 2;

calculating the distance of the sample pair in the step (6-2):

and (3) calculating the distance between the two outputs embedding _1 and embedding _2 in the step five in a manner shown in a formula (13):

step (6-3) multi-head self-attention layer:

encoding the embedding _1 according to the formula (14):

X_embedding＝embedding_1+P (14)

wherein P is a learnable matrix;

random initialization of 8 sets of matrices W_q,W_k,W_vRespectively, with X_embeddingDot multiplication results in 8 sets of Q, K, V matrices, as shown in equations (15) - (17):

Q_i＝X_embeddingW_q ⁱ (15)

K_i＝X_embeddingW_k ⁱ (16)

V_i＝X_embeddingW_v ⁱ (17)

i＝0,1,2,3,4,5,6,7

wherein W_q ⁱ，W_k ⁱ，W_v ⁱIs a learnable matrix;

calculating the attention weight value of each group through Q and K matrixes, and dividing the attention weight value by W_kEvolution of the first dimension of the matrix, i.e.

And then multiplied by V to obtain the output of the attention layer, finally 8 groups of matrixes (Z) are obtained₀-Z₇) As shown in formula (18):

the 8 groups of matrixes are spliced together horizontally (Z)₀，Z₁，…，Z₇) Then randomly initializing a learnable matrix W_oMultiplying the two matrixes to obtain a matrix Z, wherein the matrix Z is shown as a formula (19):

Z＝concatenate(Z₀,Z₁,Z₂,Z₃,Z₄,Z₅,Z₆,Z₇)·W_o (19)

and (6-4) connecting the full connection layer with the softmax layer:

flattening the output Z of the step (6-3) into a one-dimensional vector; obtaining a vector with dimension of 16 through transformation of a full connection layer; and obtaining a vector with the dimension of 3 through a full connection layer change, and activating by using a softmax function to obtain the probability that the sample input _1 belongs to each category.

8. A twin network architecture and graph convolution based emotion recognition method according to claim 1, 2 or 3, characterised in that: the eighth step comprises the following steps:

the final objective function form of the model is shown in equation (20):

L＝L_{graph_learn}+ηL_{contrastive_loss}+L_{cross_entropy} (20)

wherein η is a regulation parameter between two loss functions, the larger η is, the larger proportion of the contrast loss is, and vice versa, and three parts forming the target function are shown in formulas (21) to (23):

wherein x is_pIs characteristic of the p channel, x_qIs characteristic of the q channel, A_pqThe communication strength of the p channel and the q channel is defined, and lambda is a regularization coefficient;

wherein d is the euclidean distance between channels p and q, y is a dichotomy label, y is 0 indicating that samples m and N are not from the same emotion, y is 1 indicating that samples m and N are from the same emotion, N is the number of sample pairs within a batch, and margin is a hyperparameter indicating the distance separating different emotion samples;

is the probability that the predicted sample i belongs to the class r.

9. A twin network architecture and graph convolution based emotion recognition method according to claim 1, 2 or 3, characterised in that: the step ten comprises the following steps:

and (3) evaluating the model by adopting the accuracy rate, wherein the accuracy rate is the proportion of the correctly classified samples to the total number of the samples, 15 tested subjects participate in the experiment, each tested subject performs three experiments, 15 fragments are watched in each experiment, and then the experiments are performed for 45 times in total, so that the accuracy rate calculation formula of the ith experiment is shown as the formula (24):

the average accuracy of the 15 2 experiments tested is shown in equation (25):

the standard deviation of this experiment is shown in equation (26):