CN114202787A

CN114202787A - Multiframe micro-expression emotion recognition method based on deep learning and two-dimensional attention mechanism

Info

Publication number: CN114202787A
Application number: CN202111421041.4A
Authority: CN
Inventors: 李俊
Original assignee: Individual
Current assignee: Individual
Priority date: 2021-11-26
Filing date: 2021-11-26
Publication date: 2022-03-18

Abstract

The invention discloses a multiframe micro-expression emotion recognition method based on deep learning and two-dimensional attention mechanism, which comprises four steps of micro-expression video picture preprocessing, rolling machine neural network characteristic extraction, two-dimensional attention mechanism weight calculation and recurrent neural network result prediction, has reasonable design and ingenious conception, is based on the idea that the recognition of each frame is integrated into the analysis of the whole video segment and result influence is generated among multiframes, thereby providing a multi-frame micro-expression emotion recognition algorithm based on deep learning and a two-dimensional attention mechanism, in the prediction stage, the feature vector extracted from each frame of image is used as the input of the recurrent neural network, the similarity relation between the frame feature and other frames is calculated through an attention mechanism, the features with high similarity are given higher weight, and the features of other frames are added for prediction, so that a more accurate recognition result can be obtained.

Description

Multiframe micro-expression emotion recognition method based on deep learning and two-dimensional attention mechanism

Technical Field

The invention relates to the technical field of computer intelligent recognition, in particular to a multi-frame micro-expression emotion recognition method based on deep learning and a two-dimensional attention mechanism.

Background

With the development of technology, in the field of computer information technology, the identification of micro expressions is a very challenging task, because the life cycle of an expression in nature is very short, and compared with common expressions, the micro expressions last for a very short time, usually less than one second; and secondly, the action expression range is low, the micro expression is different from the common expression, the micro expression is not easy to be perceived, the micro expression belongs to the behavior of people generated in an unconscious state, and the movement is difficult to disguise and cover, so that the micro expression has great significance to criminal investigation and security.

The whole system of the computer information technology needs feature modeling related to space dimension and time dimension dynamic, so that the complexity of the micro expression recognition task is large. The existing micro-expression recognition system firstly corrects a face image, then cuts the face image into a plurality of small image blocks, and finally obtains a micro-expression recognition result by utilizing a pre-established neural network model.

In the patent of micro-expression recognition method (application (patent) No. CN202110347510.6), a face detected is corrected and then cut into a plurality of blocks, and then a recognition result is directly obtained through a pre-established coiling machine neural network. However, in the prior art method, firstly, the connection between the whole face image is ignored, such as the expression of smile of a person, not only the mouth is smile, but also the face muscle is changed, and if the face image is divided into a plurality of pieces, the connection between a plurality of parts is difficult to capture. And secondly, the relation among a plurality of frames of pictures in the time dimension is ignored, if a person is excited to cry, the wrong judgment can be easily generated if only a single frame of picture is analyzed.

Disclosure of Invention

Aiming at the problems and defects of the prior art, the invention provides a multi-frame micro-expression emotion recognition method and system based on a deep learning and two-dimensional attention mechanism, aiming at solving the problem of poor recognition effect caused by the fact that the relation among multiple frames in a time dimension cannot be combined in micro-expression recognition.

The technical scheme of the invention is as follows:

a multiframe micro-expression emotion recognition method based on deep learning and a two-dimensional attention mechanism comprises four steps of micro-expression video picture preprocessing, coiling machine neural network feature extraction, two-dimensional attention mechanism weight calculation and cyclic neural network result prediction, and specifically comprises the following steps:

s1, preprocessing a micro expression video picture, specifically comprising face registration of a human face, time dimension frame interpolation and micro expression action amplification;

s2, a convolutional neural network feature extraction step, wherein a feature extraction network is used for extracting features of a plurality of preprocessed micro-expression pictures by adopting a rolling machine neural network, wherein the feature extraction network is used for finding the significant features of slight muscle movement and slight change of the face, so that a feature vector for representing original information is obtained;

s3, a two-dimensional attention mechanism weight calculation step, wherein the similarity between the current frame feature and all frames of the whole video is calculated through a two-dimensional attention mechanism to obtain the attention weight;

and S4, a recurrent neural network result prediction step, wherein the gated recurrent neural network classifies the characteristics of each frame according to the attention weight obtained in the previous step to obtain a prediction result.

In the above technical solution, in the step of S1,

the face registration step of the human face specifically comprises the steps of selecting a first frame of a video segment as a reference, mapping feature points in the first frame to feature points of a standard image through a mapping function, and mapping all the subsequent frames through the same method to reduce difference between different human faces;

the time dimension frame inserting step is that the time length of the whole video is expanded in a time domain frame inserting mode, the whole video is regarded as a network, each frame represents a node in the network, adjacent frames in the video are also adjacent nodes in the network, a high-dimensional continuous curve is obtained in a network embedding mode, and the curve is sampled to obtain an image sequence after interpolation;

and the micro-expression action amplification step is to perform action amplification pretreatment on the face image by adopting a linear Euler video amplification method, obtain a plurality of frames of face images through the steps and uniformly amplify or reduce the face images to the same length and width dimensions.

In the technical scheme, in the step S2, the coiling machine neural network records the obtained three-dimensional characteristic vector matrix as F, wherein F belongs to R^C*H*WC, H, W, wherein the unit is a single pixel, and the unit is the depth, height and width of the matrix respectively, and the feature vector F represents the information in the original picture; the convolutional neural network adopts the maximum pooling with 3 windows of 2x2, and the operation ensures that the length and width of the extracted features are reduced when the features of the pictures are extracted, and the complexity of the model is reduced. Convolutional neural networks adopt the idea of residual blocks in ResNet.

As a preferred aspect of the above technical solution, the parameters of the convolutional neural network are:

in the above technical solution, in the step S3, the two-dimensional attention mechanism is based on the prediction in the step S4, and the upper output result of each step of the threshold recurrent neural network is used as the input of the two-dimensional attention mechanism; the two-dimensional attention mechanism calculation method specifically comprises the steps of defining that F' contains all information of a single-frame input picture, counting the total frame number of the whole video as T, obtaining T three-dimensional characteristic vector matrixes F after T face pictures enter a neural network of a rolling machine, splicing the T three-dimensional characteristic vector matrixes F in a long dimension to form a larger three-dimensional characteristic vector matrix meter as H, wherein H belongs to R^C*H*(T*W)The depth and width are the same as F, and the length is T times of F.

In the above technical solution, the two-dimensional attention mechanism is that the input end includes two parts: the first is the upper output of each step of the gated recurrent neural network, and a feature map with the same dimensionality as H is obtained by performing convolution of 1x1 and then copying the spatial dimensionality;

and the other is a three-dimensional characteristic vector matrix H extracted from the whole video, the characteristics of the two vector matrixes are subjected to matrix addition, Tanh operation is performed, finally, attention weight matrix alpha is obtained through softmax operation, the attention weight matrix and the H are subjected to matrix dot multiplication and summation to obtain a matrix, the matrix and an input matrix are spliced in the depth dimension, and finally, an output result is obtained through a full-connection layer.

In the above technical solution, in step S4, the result of the whole gated recurrent neural network includes T gated neural network units, each gated recurrent neural network unit has two input ends at the left and below, and two output ends at the right and above; the left side input is a feature vector G, the feature vector H is subjected to maximum pooling, the dimensionality is 1 × C, and C is the depth of the recurrent neural network; a vector matrix F1 is obtained by extracting a first frame input from the lower part through a coiling machine neural network; the lower side input of the second gated cyclic neural network is a vector matrix F2 extracted by the coiling machine neural network of the first frame, and the left side input is the right side output of the first gated cyclic neural network; by analogy, the whole gating sequence decoding prediction consists of the T small units; the upper side output of the recurrent neural network unit is not only the right side input of the next unit, but also the input of the attention mechanism module, and the prediction result of the frame is obtained after the input of the attention mechanism module.

By adopting the scheme, the invention provides the multi-frame micro-expression emotion recognition method based on the deep learning and two-dimensional attention mechanism, the design is reasonable, the conception is ingenious, based on the idea that the recognition of each frame is integrated into the analysis of the whole video clip and the result influence is generated among multiple frames, the multi-frame micro-expression emotion recognition algorithm based on the deep learning and two-dimensional attention mechanism is further provided, the feature vector extracted from the image of each frame is used as the input of a recurrent neural network in the prediction stage, the similarity relation between the frame feature and other frames is calculated through the attention mechanism, the feature with high similarity is given higher weight, and the features of other frames are added for prediction together, so that a more accurate recognition result can be obtained.

Drawings

Fig. 1 is a schematic flow chart illustrating steps of a multi-frame micro-expression emotion recognition method based on deep learning and a two-dimensional attention mechanism.

Fig. 2 is a schematic structural diagram of a threshold recurrent neural network unit of a multi-frame micro-expression emotion recognition method based on deep learning and a two-dimensional attention mechanism.

Fig. 3 is a schematic diagram of a two-dimensional attention mechanism of a multi-frame micro-expression emotion recognition method based on deep learning and the two-dimensional attention mechanism.

Fig. 4 is a schematic diagram of a sequence prediction structure of a gated recurrent neural network of a multi-frame micro-expression emotion recognition method based on deep learning and a two-dimensional attention mechanism.

Detailed Description

In order to facilitate an understanding of the invention, the invention is described in more detail below with reference to the accompanying drawings and specific examples. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

As shown in fig. 1, the multi-frame micro-expression emotion recognition method based on deep learning and two-dimensional attention mechanism includes the following steps:

(1) preprocessing the micro-expression picture:

after the whole video is subjected to face detection and cutting, three steps of face registration, time dimension frame insertion and micro-expression action amplification of a face are required to be sequentially carried out.

Because the positions of the faces in the video, the head deviation and other actions make the difference between the detected pictures larger, and the distribution of facial features between different people has difference, the face registration based on key points needs to be firstly carried out on the faces, so that the micro-expression pictures of the multi-frame images have the same reference. The method comprises the steps of selecting a first frame of a video clip as a reference, mapping the characteristic points of the first frame to the characteristic points of a standard image through a mapping function, and mapping all the subsequent frames through the same method, so that the difference among different human faces can be reduced.

Because the duration of the micro expression is extremely short, the time length of the whole video needs to be extended in a time domain frame interpolation mode, the whole video is regarded as a network, each frame represents a node in the network, adjacent frames in the video are also adjacent nodes in the network, a high-dimensional continuous curve is obtained in a network embedding mode, and the curve is sampled to obtain an interpolated image sequence.

Because the action amplitude of the micro expression action is small, action amplification processing needs to be carried out on the face image, and a linear Euler video amplification method is adopted for preprocessing. Through the steps, a plurality of frames of face images can be obtained, and the face images are uniformly amplified or reduced to the same length and width dimensions.

(2) Extracting the characteristic of the rolling machine neural network:

and performing feature extraction on the preprocessed picture through a specially designed convolutional neural network module. The parameters of the whole coiling machine neural network are shown in table 1, and the obtained three-dimensional characteristic vector matrix is recorded as F, wherein F belongs to R^C*H*WWhere C, H, W represent the depth, height and width of the matrix, respectively, in units of a single pixel, i.e. the feature vector F can be considered to represent the information in the original picture. The maximum pooling with 3 windows of 2x2 is adopted in the specially designed convolutional neural network, so that the operation ensures that the length and width of the features obtained by extracting the picture are reduced during the extraction of the picture features, and the complexity of a model is reduced. The convolutional neural network design adopts the idea of a residual block in ResNet, and the stability of the feature extraction network is also ensured.

Table 1 feature extraction module operating parameter list

(3) Two-dimensional attention mechanism weight calculation:

f contains all information of single-frame input pictures, the total frame number of the whole video is counted as T, and T human face pictures are obtained after entering a neural network of a rolling machineA three-dimensional characteristic vector matrix F is spliced into a larger three-dimensional characteristic vector matrix in a long dimension and is counted as H, the H belongs to R^C*H*(T*W)The depth and width are the same as F, and the length is T times of F.

The two-dimensional attention mechanism is based on the fourth step prediction, and the upper output result of each step of the gated recurrent neural network is used as the input of the two-dimensional attention mechanism, and the structural diagram of the threshold recurrent neural network unit is shown in fig. 2.

The two-dimensional attention mechanism is schematically shown in fig. 3, and the input comprises two parts: the first is the upper output of each step of the gated recurrent neural network, and a feature map with the same dimensionality as H is obtained by performing convolution of 1x1 and then copying the spatial dimensionality; and the other is a three-dimensional characteristic vector matrix H extracted from the whole video, the characteristics of the two vector matrixes are subjected to matrix addition, Tanh operation is performed, finally, attention weight matrix alpha is obtained through softmax operation, the attention weight matrix and the H are subjected to matrix dot multiplication and summation to obtain a matrix, the matrix and an input matrix are spliced in the depth dimension, and finally, an output result is obtained through a full-connection layer.

(4) And (3) predicting a recurrent neural network result:

the whole gated recurrent neural network result comprises T gated neural network units, and each gated recurrent neural network unit has two inputs on the left side and the lower side and two outputs on the right side and the upper side, which is schematically shown in FIG. 2.

In this patent application, the first gated recurrent neural network element will have two inputs, the left input being the eigenvector G, G being the eigenvector H undergoing maximum pooling, the dimension being 1 × C, and C being the recurrent neural network depth. A vector matrix F1 is obtained by extracting a first frame input from the lower part through a coiling machine neural network; the lower side input of the second gated recurrent neural network is the vector matrix F2 extracted by the rolling machine neural network of the first frame, and the left side input is the right side output of the first gated recurrent neural network. By analogy, the whole gating sequence decoding prediction consists of such T small units. And simultaneously, the result of the upper side output is recorded, the upper side output of the recurrent neural network unit is not only the right side input of the next unit, but also the input of the attention mechanism module, and the prediction result of the frame is obtained after the input is processed by the attention mechanism module. The entire gated cyclic neural network sequence is represented in figure 4.

Example 1

The following is a specific embodiment of the present invention:

the invention provides a method and a system for identifying multi-frame micro-expression emotions based on deep learning and a two-dimensional attention mechanism, which comprises the following specific processes:

(1) preprocessing the micro-expression picture:

the input to the explicit model is first a succession of video frames, each of which has a facial expression. In the first step, the face needs to be subjected to face registration based on key points, so that the multiple frames of image micro-expression pictures have the same reference. The method is characterized in that a first frame of a video segment is selected as a reference, feature points in the first frame are mapped to feature points of a standard image through a mapping function, all the subsequent frames are mapped through the same method, the difference between different human faces can be reduced, and the human face feature points and the facial feature calibration are identified through a dlib library of python.

Because the action amplitude of the micro expression action is small, action amplification processing needs to be carried out on the face image, and a linear Euler video amplification method is adopted for preprocessing. Through the steps, a plurality of frames of face images can be obtained, and in the case, the images are uniformly enlarged or reduced to the size of 32 × 3 pixels in length and width.

(2) Extracting the characteristic of the rolling machine neural network:

and performing feature extraction on the preprocessed picture through a specially designed convolutional neural network module. Machine for finishingThe parameters of the neural network of the coil winding machine are shown in table 1, and the obtained three-dimensional characteristic vector matrix is recorded as F, wherein F belongs to R^C*H*WWhere C, H, W represents the depth, height and width of the matrix, respectively, and the unit is a single pixel, in the case where C is 128, H is 4 and W is 4, the feature vector F can be considered to represent the information in the original picture. The maximum pooling of 3 windows 2x2 is adopted in the specially designed convolutional neural network, so that the operation ensures that the length and the width of the features obtained by image extraction are reduced during image feature extraction, the complexity of the model is reduced, and the original 32 x 3 is changed into a 4 x 128 matrix. The convolutional neural network design adopts the idea of a residual block in ResNet, and the stability of the feature extraction network is also ensured.

(3) Two-dimensional attention mechanism weight calculation:

the two-dimensional attention mechanism, the input, consists of two parts: the first is the upper output of each step of the gated recurrent neural network, and a characteristic diagram with the same dimension as H is obtained by performing convolution with 1x1 and then copying the spatial dimension, wherein the size is 4 x 128; and the other is a three-dimensional characteristic vector matrix H extracted from the whole video, the characteristics of the two vector matrixes are subjected to matrix addition, Tanh operation is performed, finally, attention weight matrix alpha is obtained through softmax operation, the attention weight matrix and the H are subjected to matrix dot multiplication and summation to obtain a matrix and an input matrix, the input matrix is spliced in the depth dimension, and finally, an output result is obtained through a full-connection layer, and the obtained result is the result after characteristic enhancement is performed through a two-dimensional attention mechanism.

(4) And (3) predicting a recurrent neural network result: in this patent, the first gated recurrent neural network unit will have two inputs, the left input is the eigenvector G, G is the eigenvector H that has undergone the largest pooling, the dimension is 1 × C, C is the recurrent neural network depth, and in the case, C is 128. A vector matrix F1 is obtained by extracting a first frame input from the lower part through a coiling machine neural network; the lower side input of the second gated recurrent neural network is the vector matrix F2 extracted by the rolling machine neural network of the first frame, and the left side input is the right side output of the first gated recurrent neural network. By analogy, the whole gating sequence decoding prediction consists of such T small units. And simultaneously, the result of the upper side output is recorded, the upper side output of the recurrent neural network unit is not only the right side input of the next unit, but also the input of the attention mechanism module, and the prediction result of the frame is obtained after the input is processed by the attention mechanism module.

The technical features mentioned above are combined with each other to form various embodiments which are not listed above, and all of them are regarded as the scope of the present invention described in the specification; also, modifications and variations may be suggested to those skilled in the art in light of the above teachings, and it is intended to cover all such modifications and variations as fall within the true spirit and scope of the invention as defined by the appended claims.

Claims

1. A multiframe micro-expression emotion recognition method based on deep learning and a two-dimensional attention mechanism is characterized by comprising four steps of micro-expression video picture preprocessing, rolling machine neural network feature extraction, two-dimensional attention mechanism weight calculation and recurrent neural network result prediction, and specifically comprises the following steps:

2. The method for recognizing the multi-frame micro-expression emotion based on deep learning and two-dimensional attention mechanism as claimed in claim 1, wherein, in the step of S1,

3. The multiframe microexpression emotion recognition method based on deep learning and two-dimensional attention mechanism as claimed in claim 1, wherein in the step of S2, the coiling machine neural network records the obtained three-dimensional feature vector matrix as F, wherein F belongs to R^C*H*WC, H, W, wherein the unit is a single pixel, and the unit is the depth, height and width of the matrix respectively, and the feature vector F represents the information in the original picture; maximum pooling of 3 windows of 2x2 was used in the convolutional neural network.

4. The method for recognizing the multi-frame micro-expression emotion based on the deep learning and two-dimensional attention mechanism as claimed in claim 1, wherein the parameters of the convolutional neural network are as follows:

5. according to claimThe method for recognizing the multi-frame micro-expression emotion based on the deep learning and two-dimensional attention mechanism as claimed in claim 1, wherein in the step S3, the two-dimensional attention mechanism is based on the prediction in the step S4, and the output result above each step of the threshold recurrent neural network is used as the input of the two-dimensional attention mechanism; the two-dimensional attention mechanism calculation method specifically comprises the steps of defining that F' contains all information of a single-frame input picture, counting the total frame number of the whole video as T, obtaining T three-dimensional characteristic vector matrixes F after T face pictures enter a neural network of a rolling machine, splicing the T three-dimensional characteristic vector matrixes F in a long dimension to form a larger three-dimensional characteristic vector matrix meter as H, wherein H belongs to R^C*H*(T*W)The depth and width are the same as F, and the length is T times of F.

6. The method for recognizing the multi-frame micro-expression emotion based on the deep learning and two-dimensional attention mechanism as claimed in claim 1, wherein the two-dimensional attention mechanism is characterized in that the input end comprises two parts: the first is the upper output of each step of the gated recurrent neural network, and a feature map with the same dimensionality as H is obtained by performing convolution of 1x1 and then copying the spatial dimensionality;

7. The method for recognizing the multi-frame micro-expression emotion based on the deep learning and two-dimensional attention mechanism as claimed in claim 1, wherein in step S4, the whole gated recurrent neural network result includes T gated neural network units, each gated recurrent neural network unit has two input ends at the left side and the lower side, and two output ends at the right side and the upper side; the left side input is a feature vector G, the feature vector H is subjected to maximum pooling, the dimensionality is 1 × C, and C is the depth of the recurrent neural network; a vector matrix F1 is obtained by extracting a first frame input from the lower part through a coiling machine neural network; the lower side input of the second gated cyclic neural network is a vector matrix F2 extracted by the coiling machine neural network of the first frame, and the left side input is the right side output of the first gated cyclic neural network; by analogy, the whole gating sequence decoding prediction consists of the T small units; the upper side output of the recurrent neural network unit is not only the right side input of the next unit, but also the input of the attention mechanism module, and the prediction result of the frame is obtained after the input of the attention mechanism module.