CN113128353A

CN113128353A - Emotion sensing method and system for natural human-computer interaction

Info

Publication number: CN113128353A
Application number: CN202110327248.9A
Authority: CN
Inventors: 吕钊; 骆轩玉; 张超; 张磊; 李平; 胡世昂; 裴胜兵; 吴小培
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2021-07-16
Anticipated expiration: 2041-03-26
Also published as: CN113128353B

Abstract

The invention discloses a natural man-machine interaction oriented emotion perception method, which comprises the following steps: s1: collecting scalp electroencephalogram signals under positive, moderate and negative emotional states; s2: preprocessing the data collected in the step S1; s3: constructing a training sample and a test sample; s4: training a deep convolutional neural network; s5: the test set tests the classification effect. Also discloses a natural man-machine interaction oriented emotion perception system. The electroencephalogram signal emotion recognition method can be used for learning high-dimensional space-time characteristic information in the electroencephalogram signal through the space-time deep convolutional neural network so as to improve the efficiency and accuracy of an emotion recognition task. The invention overcomes the problems of losing part of useful electroencephalogram characteristic information and verifying the characteristic extraction usefulness generated by firstly extracting the characteristic and then inputting the characteristic into the convolutional neural network, reduces the complexity of emotion recognition and improves the efficiency and the accuracy of the emotion recognition.

Description

Emotion sensing method and system for natural human-computer interaction

Technical Field

The invention relates to the technical field of human-computer interaction, in particular to a natural human-computer interaction oriented emotion perception method.

Background

With the popularization of intelligent applications, machines play an important role in human life, the service functions and sociality of the machines are widely concerned in recent years, and how to improve the human-computer interactivity of the machines is an important topic in the human-computer interaction field. According to emotional psychology: the emotional transformation such as excitement, peace, anger and the like generated by a user in the process of man-machine interaction is a psychological activity which is characterized by the subjective use feeling of the user. In the human-computer interaction process, the user gives instructions and more importantly embodies the psychological participation, and the natural human-computer interaction method based on emotion perception plays an important role in improving the human participation and satisfaction in human-computer interaction and enriching the human-computer interaction experience.

The traditional natural man-machine interaction method is mainly based on emotion perception of voice audio and images. Such as: the robot converts human audio into text for analysis, and Taboada et al propose a method for emotion analysis of text at chapter level and sentence level by a dictionary-based method to obtain emotion classification. Machajdik and Hanbury extract composition characteristics of images to carry out emotion calculation by researching the interested areas of the images of human beings. In recent years, with the rise of the field of artificial intelligence, researchers combine emotion sensing methods based on speech and image analysis with machine learning methods, and emotion sensing methods based on deep learning are the mainstream.

The electroencephalogram signals generated by a user during man-machine interaction have rich emotion, and the emotion characteristics in the electroencephalogram signals are captured in a deep learning mode to carry out emotion classification, so that the machine can sense and feel emotion changes of the user. The method adopted by the prior art is characterized in that the method is implemented by firstly extracting the characteristics and then inputting the extracted characteristics into the convolutional neural network, so that the problems of losing part of useful electroencephalogram characteristic information and verifying the characteristic extraction usefulness are easily caused, and therefore, the deep convolutional neural network framework and the effectiveness thereof on electroencephalogram signal emotion recognition are worthy of important exploration and research.

Disclosure of Invention

The invention aims to solve the technical problem of providing a natural human-computer interaction-oriented emotion sensing method and system, which can classify specific emotion tasks and improve the accuracy and classification efficiency of emotion classification tasks.

In order to solve the technical problems, the invention adopts a technical scheme that: the emotion perception method facing natural human-computer interaction comprises the following steps:

s1: collecting scalp electroencephalogram signals under positive, moderate and negative emotional states;

s2: preprocessing the data collected in the step S1;

s3: constructing a training sample and a testing sample: dividing the preprocessed electroencephalogram signal into non-overlapping isometric samples by taking 1s of time as a unit, and dividing a training set and a testing set according to a five-fold cross validation method;

s4, training a deep convolutional neural network, namely inputting the divided training set into the deep convolutional neural network, extracting high-dimensional time and space information in the original electroencephalogram signal by using a convolutional kernel structure in the convolutional neural network, enabling the neural network to be more stable in the later training stage by gradient descent, back propagation learning update, optimization of network parameters of each layer in a model and use of a learning rate attenuation mode, and stopping training when a loss function converges to a set condition;

s5: the test set tests the classification effect: inputting the test set sample into a trained deep convolutional neural network to obtain a classification result of emotion recognition, and verifying the classification result and a label corresponding to the sample to obtain the accuracy of the experiment; and verifying by adopting a five-fold crossing method, and averaging the accuracy of each experiment to obtain the final accuracy.

In a preferred embodiment of the present invention, the step S2 includes the following steps:

sequentially removing the electro-ocular components in the acquired original signals and invalid and damaged segments in the electroencephalogram signals, and then removing noise and equipment artifacts of the electroencephalogram signals by using a band-pass filter and a notch filter.

In a preferred embodiment of the present invention, in step S4, the deep convolutional neural network is composed of five layers, i.e., convolutional layer L0, first convolutional layer L1, second convolutional layer L2, third fully-connected layer L3, fourth convolutional layer L4, and fifth output layer L5.

Further, the L0 layers are convolutional layers, and the input of the L0 layers is the input sample training set constructed in step S3, so as to

Representing weights of each layer of the convolutional neural network neuron network of L0; to be provided with

Representing the bias of each layer of convolutional neural network of L0; u shape₁Representing the width (U) of the convolution kernel₁＝5)；

Representing intermediate variables that pass through a layer 0 convolutional neural network; k₀Number of representative feature maps (K)₀＝6)。

σ represents the activation function, and Relu is used to modify the linear unit:

σ(x)＝max(0,x),x＞0 (2)

where σ denotes the activation function Relu modified linear unit and x denotes the input to the Relu modified linear unit.

Wherein the content of the first and second substances,

denotes the L0 level intermediate variable, I_1,m,(p,q)Indicating the final output of layer L0, and max indicates the larger of 0 and x.

Further, the L1 layer is a first convolution block layer, three convolution kernels with a size of 1 × 3 are used for performing convolution, the matrix after convolution is activated by using the Relu activation function, the characteristic diagram outputs with sizes of 62 × 48 × 12, 62 × 48 × 12 and 62 × 48 × 12 are obtained respectively, then the output is subjected to batch normalization processing, so that the mean value and the variance of each input matrix are kept consistent, and finally the maximum pooling method with the convolution kernel of 2 × 2 is used for reducing the dimension of each layer of characteristic diagram matrix, so that the final output of the L1 layer, namely the matrix with the output dimension of 31 × 24 × 12, is obtained.

Further, the L2 layer is a second layer of convolution block layer, three times of convolution kernels with a size of 3 × 3 are used for convolution, the matrix after convolution is activated by the Relu activation function, the characteristic diagram outputs with sizes of 29 × 45 × 24, 27 × 43 × 24 and 25 × 41 × 24 are respectively obtained, then the output is subjected to batch normalization processing, so that the mean value and the variance of each input matrix are kept consistent, and finally the maximum pooling method with the convolution kernel of 1 × 2 is used for reducing the dimension of each layer of characteristic diagram matrix, so that the final output of the L2 layer, namely the matrix with the output dimension of 12 × 9 × 24, is obtained.

Further, the L3 layers are all connected block layers, 6 layers of convolution block layers are adopted, convolution kernels with the sizes of 1 × 3 and 5 × 5 are adopted for convolution in sequence, activation operation is carried out on the matrix after convolution by adopting a Relu activation function, each layer obtains characteristic graphs with the sizes of 12 × 9 × 96 and 12 × 9 × 24 respectively, output with the size of 12 × 9 × 168 is obtained finally, then batch normalization processing and random deactivation operation are carried out on the output, finally, dimension reduction is carried out on the characteristic graph matrix of each layer by using a maximum pooling method with the convolution kernel of 3 × 3, and finally the final output of the L3 layers, namely the matrix with the output dimension of 4 × 3 × 168, is obtained.

Further, the L4 is a convolution layer, and global pooling is performed to obtain a feature map with a size of 1 × 168, and convolution is performed using a convolution kernel with a size of 1 × 1 to obtain an output with a dimension of 1 × 3. Then, performing a flatten operation on the matrix to obtain a one-dimensional vector with the dimension of 1 × 3.

Further, the L5 is a final output layer, the vector is input into the full connection layer, and a final prediction result is output by using a softmax function, so that a one-dimensional vector with an output dimension of 1 × 3 is obtained, which represents the probability of the predicted three emotion distributions, and if the probability of representing a certain emotion in the result is higher, the probability that the sample is the emotion is higher; feedback correction is carried out by comparing with a real label, parameters of each layer of network in the model are updated and optimized in a batch training learning mode through gradient descent, back propagation and learning rate attenuation, the neural network is more stable in the later training period, and the training is stopped when a loss function converges to a set condition.

In order to solve the technical problem, the invention adopts another technical scheme that: the emotion perception system for natural human-computer interaction mainly comprises:

the data acquisition module is used for acquiring scalp electroencephalogram signals in positive, mild and negative emotional states;

the data preprocessing module is used for preprocessing the data acquired by the data acquisition module;

the sample construction module is used for dividing the electroencephalogram signals preprocessed by the data preprocessing module into non-overlapping equal-length samples by taking 1s as a unit, and dividing a training set and a test set according to a five-fold cross validation method;

the deep convolutional neural network is composed of five layers of modules, namely a convolutional layer L0, a first convolutional block layer L1, a second convolutional block layer L2, a third fully-connected block layer L3, a fourth convolutional layer L4 and a fifth output layer L5;

the model training module is used for inputting the divided training sets into a deep convolutional neural network, extracting high-dimensional time and space information in an original electroencephalogram signal by using a convolutional kernel structure in the convolutional neural network, enabling the neural network to be more stable in the later training stage by means of gradient descent, back propagation learning updating, optimization of network parameters of each layer in the model and learning rate attenuation, and stopping training when a loss function converges to a set condition;

the test set testing module is used for inputting a test set sample into the trained deep convolutional neural network to obtain a classification result of emotion recognition, and verifying the classification result and a label corresponding to the sample to obtain the accuracy of the experiment; and verifying by adopting a five-fold crossing method, and averaging the accuracy of each experiment to obtain the final accuracy.

The invention has the beneficial effects that:

(1) compared with the electroencephalogram signal subjected to characteristic extraction, the preprocessed original electroencephalogram signal contains the richest time and space information, and the electroencephalogram channel related to the emotion classification task can be better characterized by the timely and empty characteristics by selecting a deep convolutional neural network structure capable of capturing the high-dimensional time and space characteristics of the electroencephalogram signal; the electroencephalogram features are more relevant to represent by completing the work of feature extraction and classification at the same time, and the complexity of the emotion recognition method is reduced at the same time, so that the purpose of improving the emotion recognition efficiency and accuracy is achieved;

(2) the invention uses convolution block layers at L1 and L2 layers, the convolution block layers process electroencephalogram signal samples in a layered mode, and higher-dimensional features are extracted from single-channel electroencephalogram signals and two-dimensional electroencephalogram signals respectively by using convolution kernels with different sizes. The feature maps of the convolution block layers of the L3 layer are compressed by using convolution operation of 1 x 1 of a convolution kernel, then the feature maps are learned by using convolution operation of 5 x 5 of the convolution kernel, and each convolution block layer output is stacked with the current layer input to be used as the input of the next layer for learning, so that the model has stronger overfitting resistance and generalization resistance. Pooling operations are used at the end of each layer to reduce computational consumption while increasing the generalization capability of the model.

Drawings

FIG. 1 is a schematic diagram of the emotion signal generation and collection process;

fig. 2 is a schematic diagram of the electrode arrangement position of the wearable device;

FIG. 3 is a flow chart of the emotion perception method for natural human-computer interaction according to the present invention;

FIG. 4 is a schematic diagram of the hierarchy of the deep convolutional neural network structure;

fig. 5 is a block diagram of the emotion perception system facing natural human-computer interaction.

Detailed Description

The following detailed description of the preferred embodiments of the present invention, taken in conjunction with the accompanying drawings, will make the advantages and features of the invention easier to understand by those skilled in the art, and thus will clearly and clearly define the scope of the invention.

Referring to fig. 3, an embodiment of the invention includes:

a natural human-computer interaction-oriented emotion perception method comprises the following steps:

s1: collecting scalp electroencephalogram signals under positive, moderate and negative emotional states; with reference to fig. 1 and fig. 2, a subject wears a wearable acquisition device, is in a quiet experimental environment without significant interference, watches video segments with different colors to generate different emotion categories, such as positive, flat, negative, and the like, and acquires and records electroencephalograms with different emotion colors while watching videos.

S2: preprocessing the data collected in the step S1;

the method comprises the following specific steps: removing the electro-oculogram component in the electroencephalogram signal by ICA and other methods, and manually removing the damaged and invalid segment parts in the electroencephalogram signal by observation; then, a band-pass filter between 1 Hz and 50Hz is used for filtering the electroencephalogram signals, and then a 45Hz notch filter is used for processing the signals.

with reference to fig. 4, the deep convolutional neural network is composed of five layers of modules, i.e., a convolutional layer L0, a first convolutional block layer L1, a second convolutional block layer L2, a third fully-connected block layer L3, a fourth convolutional layer L4, and a fifth output layer L5. The first convolution block layer and the second convolution block layer comprise three convolution layers, the third convolution layer is a full-connection block layer, and information representing time and space features in a sample is extracted; and a fourth layer of convolution layer for reducing dimension of the characteristic diagram. In the fourth layer, in order to prevent overfitting, a dropout layer is added behind a full-connection block layer, and random immediate inactivation operation is carried out on neurons in the network with the probability of 0.5; and then flattening operation is carried out, and the obtained one-dimensional vector is input into the full connection layer. Selecting a Relu correction linear unit as an activation function in the convolutional neural network, taking the maximum pooling as the pooling operation of the first and second layers of convolutional block layers, and selecting the average pooling operation for the pooling operation of the third layer of convolutional block layers. And the fifth layer adopts a softmax function to obtain a final result, the final result is output as the probability of each emotion category obtained by classifying the samples, and the network parameters of each layer in the model are optimized by reducing the difference value between the probability and the corresponding label through a loss function. Through gradient descent, updating and optimizing network parameters of each layer in the model in a back propagation learning mode, enabling the neural network to be more stable in the later training period by using modes such as learning rate attenuation and the like, and stopping training when the loss function converges to the set condition. Wherein the details of each layer are:

l0 input layer. Inputting the training samples constructed through the

steps

1,2 and 3 into a convolutional neural network, wherein the dimension of the training samples is 62 x 100, 62 represents a lead number, and 200 represents a frequency point number of a matrix. Performing convolution by using a convolution kernel with the size of 1 × 5, performing activation operation on the matrix after convolution by using a Relu activation function to obtain an output with the dimension of 62 × 96 × 6, performing batch normalization processing on the output, and performing dimensionality reduction on each layer of characteristic diagram matrix by using a maximum pooling method with the convolution kernel of 1 × 2 to obtain a final output of L0 layers, namely the matrix with the output dimension of 62 × 48.

L1 first layer convolutional block layer. Performing convolution by using a convolution kernel with the size of three times of 1 × 3, performing activation operation on the matrix after the convolution by using a Relu activation function to respectively obtain characteristic diagram outputs with the sizes of 62 × 48 × 12, 62 × 48 × 12 and 62 × 48 × 12, performing batch normalization processing on the outputs to keep the mean value and the variance of each input matrix consistent, and finally performing dimensionality reduction on each layer of characteristic diagram matrix by using a maximum pooling method with the convolution kernel of 2 × 2 to obtain the final output of L1 layers, namely the matrix with the output dimensionality of 31 × 24 × 12.

L2 second layer convolutional block layer. Performing convolution by using a convolution kernel with the size of three times of 3 x 3, performing activation operation on the matrix after convolution by using a Relu activation function to respectively obtain characteristic diagram outputs with the sizes of 29 x 45 x 24, 27 x 43 x 24 and 25 x 41 x 24, performing batch normalization processing on the outputs to keep the mean value and the variance of each input matrix consistent, and finally performing dimensionality reduction on each layer of characteristic diagram matrix by using a maximum pooling method with the convolution kernel of 1 x 2 to obtain the final output of L2 layers, namely the matrix with the output dimensionality of 12 x 9 x 24.

L3 third layer full connection block layer. And (3) adopting 6 layers of convolution block layers, sequentially adopting convolution kernels with the sizes of 1 × 3 and 5 × 5 to perform convolution, adopting a Relu activation function to perform activation operation on the matrix after convolution, respectively obtaining feature maps with the sizes of 12 × 9 × 96 and 12 × 9 × 24 for each layer, finally obtaining an output with the size of 12 × 9 × 168, then performing batch normalization processing and random deactivation operation on the output, and finally performing dimensionality reduction on the feature map matrix of each layer by using a maximum pooling method with the convolution kernel of 3 × 3 to obtain a final output of L3 layers, namely a matrix with the output dimensionality of 4 × 3 × 168.

L4, convolutional layer. Performing a global pooling operation to obtain a feature map of size 1 x 168, performing a convolution operation using a convolution kernel of size 1 x 1 to obtain an output of dimension 1 x 3. Then, performing a flatten operation on the matrix to obtain a one-dimensional vector with the dimension of 1 × 3.

L5 final output layer. Inputting the vector into a full-connection layer, outputting a final prediction result by adopting a softmax function, obtaining a one-dimensional vector with an output dimension of 1 x 3, representing the probability of the distribution of three predicted emotions, wherein if the probability of representing a certain emotion in the result is higher, the probability of representing the sample as the emotion is higher; feedback correction is carried out by comparing with a real label, parameters of each layer of network in the model are updated and optimized in a batch training learning mode through gradient descent, back propagation and learning rate attenuation, the neural network is more stable in the later training period, and the training is stopped when a loss function converges to a set condition.

Using cross entropy when training the entire network

The classification loss is evaluated as a loss function, which is formulated as:

wherein the content of the first and second substances,

the final output generated by the sample softmax function is represented, alpha' represents a hyper-parameter set optimized by back propagation of a loss function in the training process of the neural network, and o represents the expected output of the neural network, namely the label corresponding to the sample.

Representing the corresponding results of a given classification, which correspond to three emotions, negative, peace, and positive, respectively.

S5: the test set tests the classification effect: inputting the test set sample into a trained deep convolutional neural network to obtain a classification result of emotion recognition, and verifying the classification result and a label corresponding to the sample to obtain the accuracy of the experiment; and verifying by adopting a five-fold crossing method, and averaging the accuracy of each experiment to obtain the final accuracy. The method comprises the following specific steps:

and (3) averagely dividing the samples sequentially processed according to the steps S1, S2 and S3 into 5 folds, taking one fold as a test set of the experiment in turn for each test, taking the other four folds as a training set of the experiment, and averaging the results of the five-fold cross-validation method to obtain the accuracy of the real experiment results. The average value of the five-fold cross-validation method is used as an experimental result, and the accuracy calculation formula is as follows:

wherein n represents a cross-folding number, and n is 5 in the formula; o_k,testRepresenting a label corresponding to the kth test set sample, namely a label corresponding to the test set sample;

represents the expected output of the kth-fold neural network; i is_k,testRepresenting the total number of samples in the kth-fold test set.

The following describes a specific model of the deep convolutional neural network specifically as follows:

l0 input of L0 layer is the training set of input samples constructed in step 3, I_l,m,(p,q)Representing the input of the input mth characteristic diagram (p, q) position convolution neural network unit of the l layer, the input is a matrix I_l,m,(p,q)To do so by

Representing weights of each layer of the convolutional neural network neuron network of L2; to be provided with

Representing the bias of each layer of convolutional neural network of L2; u shape₁Represents the width of a convolution kernel, where U₁＝5；

Representing intermediate variables that pass through a layer 0 convolutional neural network; k₀Number of representative feature maps, K ₀6; the sample width is denoted by C, where C is 62, the sample length is denoted by n, where n is 100, and the input sample size is 62 × 100. Specifically, the convolution block inputs I to L0 layers from 6 convolution kernels with 1 × 5 convolution kernels_1,m,(p,q)Extracting features, convolution and activating function to obtain 62X 96X 12 output

Namely, the method comprises the following steps:

σ represents the activation function, Relu modified linear units are used, x represents the input to Relu modified linear units:

σ(x)＝max(0,x),x＞0 (2)

specifically, the layer uses a maximum pooling method with convolution kernel 1 × 2 to reduce the dimension of the features, and the final output I of the L1 layer is obtained_2,m,(p,q)Dimension 62 x 48 x 6. :

wherein the content of the first and second substances,

L1: the input is a matrix I_1,m,(p,q)To do so by

Representing L1 layer convolutional neural network weightWherein alpha is more than or equal to 1 and less than or equal to 3; to be provided with

Represents the bias of each layer of convolutional neural network of L1, wherein beta is more than or equal to 1 and less than or equal to 3; u shape₁Represents the width of a convolution kernel, where U₁＝3；

Representing intermediate variables passing through the ith layer of convolutional neural network, wherein i is more than or equal to 1 and less than or equal to 3;

represents an intermediate variable passing through the activation function, wherein j is more than or equal to 1 and less than or equal to 3;

wherein, U₁Representing the width of the convolution kernel, U₁＝3；K₁Number of representative feature maps, K₁12; specifically, the convolution is performed by checking the L0 level input I by 12 convolution kernels of 1 × 3_1,m,(p,q)Extracting features, convolution and activating function to obtain 62 x 48 x 12 output

Wherein, U₂Representing the width of the convolution kernel, U₂＝3；K₂Number of representative feature maps, K₂12; in the present invention, the convolution block is checked against the feature map by 12 convolution kernels of 1 x 3

Extracting features, convolution and activating function to obtain 62 x 48 x 12 output

Wherein, U₃Representing the width of the convolution kernel, U₃＝3；K₃Number of representative feature maps, K₃＝12；

Is composed of

Through the output of the activation function, will

Through batch normalization, obtaining

I_2,m,(p,q)Is composed of

Output of pooling operation with convolution kernel of 2 × 2, I in the present invention_1,m,(p,q)Dimension 31 x 24 x 12.

L2 input as matrix I_2,m,(p,q)To do so by

Representing the weight of each layer of convolutional neural network neuron network of L2, wherein alpha is more than or equal to 1 and less than or equal to 3; to be provided with

Represents the bias of each layer of convolutional neural network of L2, wherein beta is more than or equal to 1 and less than or equal to 3;

wherein, U₁Representing the width of the convolution kernel, U₁＝3；U₂Representing the height, U, of the convolution kernel₂＝3；K₁Number of representative feature maps, K₁24; in particular, the convolution is checked by 24 convolution checks I of 3 x 3_2,m,(p,q)Extracting features, convolution and activating function to obtain output with dimension 29 x 22 x 24

Wherein, K₂Number of representative feature maps, K₂24; in the present invention, the convolution block is checked against the feature map by 24 convolution kernels of 3 x 3

Extracting features, convolution and activating function to obtain output with dimension 27X 20X 24

Wherein, K₃Number of representative feature maps, K₃＝24；

Is composed of

Through the output of the activation function,

is composed of

Output obtained by batch normalization.

In the present invention, the convolution block is checked against the feature map by 24 convolution kernels of 3 x 3

Extracting features, convolution and activating function, and filtering

Through batch normalization, an output with dimension 25 x 18 x 24 is obtained

I_3,m,(p,q)Is composed of

The output of the max pooling operation with convolution kernel of 2 x 2, where p is 2n,0 ≦ n ≦ p/2, the final output of L2 layers, I in the present invention_2,m,(p,q)Dimension 12 x 9 x 24.

L3: the input is a matrix I_3,m,(p,q)The L3 layer is composed of a convolutional block layer and a transport layer:

convolution block layer CONV: to be provided with

Representing the weight of a neural network of a layer 3 convolutional neural network, wherein alpha is more than or equal to 1 and less than or equal to 2; to be provided with

Representing the neural network bias of the layer 3 convolutional neural network, wherein beta is more than or equal to 1 and less than or equal to 2; u shape₁Representing the width of the convolution kernel, U₁＝1；U₂Representing the height, U, of the convolution kernel₂＝1。G₁Number of representative feature maps, G ₁4 × growth _ rate; the growth _ rate represents the growth rate of the volume block, and is 24;

representing intermediate variables passing through the ith layer of convolutional neural network, wherein i is more than or equal to 1 and less than or equal to 2;

representing intermediate variables passing through the activation function, wherein j is more than or equal to 1 and less than or equal to 2; in the present invention, the convolution block checks the feature map I by 96 convolution kernels of 1 x 1_3,m,(p,q)Extracting features, convolution and activating function to obtain output with dimension of 12 × 9 × 96

U₃Representing the length of the convolution kernel, U₃＝5；U₄Representing the degree of convolution kernel, U₄＝5；G₂Number of representative feature maps, G₄Growth _ rate; in the present invention, the convolution block is checked against the feature map by 96 convolution kernels of 1 x 1

Extracting features, and obtaining output with dimension of 12 × 9 × 24 after batch normalization and activation function

B represents the number of convolutional block layers, B ═ 6; x is the number of_iRepresents the output of the i-th convolutional block layer, [ x ]₁,x₂,...,x_i]Stacking the first i layers of convolution blocks, namely:

will be provided with

Through batch normalization and random inactivation operation, the method is obtained

Gamma represents the deactivation parameter of dropout operation network, and gamma is equal to 0.5.

In the invention, the feature map I is subjected to layer pair by 6 full connection blocks_3,m,(p,q)Performing feature extraction on the obtained product

After convolution and activation functions, an output with dimension 12 x 9 x 168 is obtained

A transmission layer:

I_4,m,(p,q)is composed of

The output of the pooling operation with a convolution kernel of 3 × 3, where p is 3n, n is 0 ≦ n ≦ p/3, the final output of L2 layers, I in the present invention_4,m,(p,q)Dimension 4 x 3 x 168.

L4 input as matrix I_4,m,(p,q)，w_4,(x,1)Representing corresponding vectors

Network weight of location, b₄Representing a network bias at layer L4. It is first subjected to global average pooling to obtain an intermediate output with a dimension of 1 × 168

Will input the matrix

Performing pooling operation with convolution kernel of 1 × 1 to obtain matrix

Performing convolution operation with convolution kernel of 1 × 1 on the intermediate output to obtain output with dimension of 1 × 3

Wherein the content of the first and second substances,

is a matrix I_4,m,(p,q)Intermediate output from global average pooling:

is a matrix

The intermediate output obtained by the pooling operation is carried out,

is a matrix

And performing convolution operation with convolution kernel of 1 × 1 to obtain output. w is a_4,(x,1)Representing corresponding vectors

Network weight of location, b₄Representing a network bias at layer L4.

In the present invention, a matrix is formed

Performing a flatten operation to obtain a one-dimensional vector I with the dimension of 1 x 3_4,(1,m)。

First, a one-dimensional vector I is divided into_5,(r,1)Inputting the full connection layer to obtain an output vector

Then the vector is transformed

Inputting the softmax layer to obtain the prediction result of the input sample

Representing the probabilities of the three predicted emotion distributions. Wherein, w_5,(r,1)Representing the corresponding vector I_5,(r,1)Network weight of location, b₅Network bias representing layer L5; s represents the output vector

S is more than or equal to 1 and less than or equal to 3.

Using cross entropy in training networks

the prediction result generated by the input sample neural network is represented, α' represents the hyper-parameter set after the neural network is optimized by back propagation of the loss function during training, o represents the expected output of the neural network, namely the label corresponding to the sample, i represents the ith emotion, and i is 1,2 and 3.

With reference to fig. 5, an embodiment of the present invention further provides an emotion sensing system for natural human-computer interaction, which mainly includes:

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A natural human-computer interaction-oriented emotion perception method is characterized by comprising the following steps:

s2: preprocessing the data collected in the step S1;

2. The method for emotion perception oriented to natural human-computer interaction according to claim 1, wherein the specific steps of step S2 include:

3. The natural human-computer interaction oriented emotion perception method according to claim 1, wherein in step S4, the deep convolutional neural network is composed of five layers of convolutional layer L0, first layer convolutional block layer L1, second layer convolutional block layer L2, third layer fully-connected block layer L3, fourth layer convolutional layer L4 and fifth layer output layer L5.

4. The method for emotion perception oriented to natural human-computer interaction of claim 3, wherein the L0 layers are convolutional layers, and the input of the L0 layers is the training set of input samples constructed in step S3, so as to

σ(x)＝max(0,x),x＞0 (2)

Wherein the content of the first and second substances,

5. The method for emotion perception of natural human-computer interaction according to claim 3, wherein the L1 layer is a first convolution block layer, the convolution is performed by using three convolution kernels with a size of 1 × 3, the activation operation is performed on the matrix after convolution by using the Relu activation function, so as to obtain the feature map outputs with sizes of 62 × 48 × 12, 62 × 48 × 12 and 62 × 48 × 12, respectively, then the output is subjected to batch normalization processing, so that the mean value and the variance of each input matrix are consistent, and finally the final output of the L1 layer, namely the matrix with the output dimension of 31 × 24 × 12, is obtained by performing dimensionality reduction on each input matrix by using the maximum pooling method with a convolution kernel of 2 × 2.

6. The method for emotion perception of natural human-computer interaction according to claim 3, wherein the L2 layer is a second layer of convolution block, the convolution is performed by using three convolution kernels with a size of 3 × 3, the activation operation is performed on the matrix after convolution by using the Relu activation function, feature map outputs with sizes of 29 × 45 × 24, 27 × 43 × 24 and 25 × 41 × 24 are obtained respectively, then batch normalization processing is performed on the outputs, so that the mean value and the variance of each input matrix are consistent, and finally, the final output of the L2 layer, namely the matrix with the output dimension of 12 × 9 × 24, is obtained by performing dimensionality reduction on each input feature map matrix by using a maximum pooling method with a convolution kernel of 1 × 2.

7. The method for emotion perception of natural human-computer interaction according to claim 3, wherein the L3 layers are fully-connected block layers, 6 convolutional block layers are adopted, convolution kernels with sizes of 1 × 3 and 5 × 5 are sequentially adopted for convolution, activation operation is carried out on the matrix after convolution by using a Relu activation function, each layer obtains feature maps with sizes of 12 × 9 × 96 and 12 × 9 × 24 respectively, output with a size of 12 × 9 × 168 is finally obtained, batch normalization processing and random deactivation operation are carried out on the output, and finally, a maximum pooling method with a convolution kernel of 3 × 3 is used for dimensionality reduction on the feature map matrix of each layer, so that a matrix with an output dimensionality of 4 × 3 is obtained for the L3 layers.

8. The method for emotion perception of natural human-computer interaction of claim 3, wherein said L4 is a convolution layer, performing a global pooling operation to obtain a feature map with a size of 1 × 168, and performing a convolution operation using a convolution kernel with a size of 1 × 1 to obtain an output with a dimension of 1 × 3. Then, performing a flatten operation on the matrix to obtain a one-dimensional vector with the dimension of 1 × 3.

9. The method for emotion perception of natural human-computer interaction according to claim 3, wherein L5 is a final output layer, the vector is input to the full link layer, a final prediction result is output by using a softmax function, a one-dimensional vector with an output dimension of 1 x 3 is obtained, the probability of three predicted emotion distributions is represented, and if the probability of representing a certain emotion in the result is higher, the probability of representing the emotion in the sample is higher; feedback correction is carried out by comparing with a real label, parameters of each layer of network in the model are updated and optimized in a batch training learning mode through gradient descent, back propagation and learning rate attenuation, the neural network is more stable in the later training period, and the training is stopped when a loss function converges to a set condition.

10. A natural human-computer interaction-oriented emotion perception system is characterized by mainly comprising: