CN112800894B

CN112800894B - Dynamic expression recognition method and system based on attention mechanism between space and time streams

Info

Publication number: CN112800894B
Application number: CN202110061153.7A
Authority: CN
Inventors: 卢官明; 陈浩侠; 卢峻禾
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2021-01-18
Filing date: 2021-01-18
Publication date: 2022-08-26
Anticipated expiration: 2041-01-18
Also published as: CN112800894A

Abstract

The invention discloses a dynamic expression recognition method and system based on a space-time inter-flow attention mechanism. Firstly, acquiring facial expression video clips, and establishing a facial expression video library containing expression category labels; then constructing a double-current convolutional neural network model embedded into a space-time inter-flow attention mechanism module, wherein the model comprises a data processing layer, a space flow branch, a time flow branch, a space-time inter-flow attention mechanism module, a characteristic fusion layer, a full connection layer and a classification layer; then, training the model by using video samples in the facial expression video library; and finally, carrying out facial expression recognition on the newly input video by using the trained model. According to the method, a spatial-temporal inter-flow attention mechanism module is embedded in a double-flow convolutional neural network, so that information interaction of spatial domain characteristics and time domain characteristics can be realized, dynamic correlation information between the spatial domain characteristics and the time domain characteristics is captured, characteristics with strong identification capability are obtained, and accuracy and robustness of facial expression identification are improved.

Description

Dynamic expression recognition method and system based on attention mechanism between space and time streams

Technical Field

The invention belongs to the field of machine learning and pattern recognition, relates to a dynamic expression recognition method and a system, and particularly relates to a dynamic expression recognition method and a system based on a space-time inter-flow attention mechanism.

Background

With the rapid development of computer technology and artificial intelligence, the way of human-computer interaction is constantly changing, which makes people more and more inclined to communicate directly with computers. In human communication, it is necessary to know the emotional state of the other party, and in the process of human emotional exposure, facial expressions account for about 55% of the information. Therefore, recognizing facial expressions by a computer has become a very hot topic.

Facial expression recognition is a process of extracting facial expression features from an image or a video and judging expression category labels according to feature information. Facial expression recognition is a cross discipline spanning multiple fields of neuroscience, psychology, computer science and the like, and potential applications of the facial expression recognition comprise fields with emotional man-machine interaction requirements, such as remote education, safe driving, service robots and the like. For example, some current smart phones have cameras with smile expressions to trigger automatic photographing, and high-end automobiles use cameras to monitor emotional states of drivers and take corresponding safety prompt measures. The deep research of the facial expression recognition technology and the improvement of the computer recognition capability will certainly greatly improve the life quality of human beings.

Currently, a large part of research in this field is performed on a static face image. In other words, these methods can only capture the spatial information of facial expressions, and ignore the change of expressions over time, and the temporal information may contain more expressive features. Compared with an expression identification method based on a static image, the double-current convolution neural network can identify dynamic expressions. The network respectively uses a space flow branch and a time flow branch to extract the characteristics of a single-frame face image extracted from a video and a stacked optical flow graph representing expression change, can simultaneously obtain the spatial characteristics and the time characteristics of facial expressions, and has the effect of space-time information complementation.

Similar to face recognition, the recognition of facial expressions in an uncontrolled natural environment is greatly affected by the occlusion and pose of the face. In order to alleviate the influence of the factors on facial expression recognition, the utilization of the facial local information is a relatively well-recognized effective strategy. According to the research on the human visual system and the cognitive process, the human visual system processes the image data by preferentially processing the information of the significant areas and selectively ignoring the non-significant areas. An attention mechanism is introduced into the human face expression recognition task, so that the convolutional neural network can adaptively give higher weight information to salient regions in the human face image, the information of the regions has greater influence on deep features of higher levels learned in the next stage, and the non-salient regions are weakened, thereby further improving the accuracy of expression recognition. However, for the dual-flow convolutional neural network, the prior art is only limited to introduce an attention mechanism into the spatial stream branch and the temporal stream branch, which can play the most basic role of the attention mechanism, but cannot fully utilize the complementary advantages of the spatial stream branch and the temporal stream branch to achieve the purpose of information interaction between the two branches.

The Chinese patent application 'a human face expression recognition method based on an attention mechanism module' (patent application No. CN202010783432.X, publication No. CN111967359A), sending a cut human face image into a network model based on the attention mechanism module to obtain human face image attention features, then adopting two convolution layers to carry out feature extraction operation to obtain a feature image of the human face image, then utilizing a global average pooling layer to carry out feature dimension reduction, finally utilizing a Softmax classifier to classify and recognize the dimension reduction features, and outputting a human face expression recognition result. The method has the problems that only one static face image can be identified, the time domain information of the face expression is ignored, and the best identification effect is difficult to achieve.

The Chinese patent application 'facial expression recognition method based on a space-time fusion network' (patent application No. CN202010221398.7, publication No. CN111709266A), after input video sequences are preprocessed, DMF modules are used for extracting spatial domain features of facial expressions, LTCNN modules are used for extracting time domain features of the facial expressions, and finally fine-tuning-based fusion strategies are used for fusing the expression space-time features learned by the two modules. The method has the problems that the spatial domain feature extraction of the facial expression by using the DMF module and the time domain feature extraction of the facial expression by using the LTCNN module are two independent processes, and sufficient information interaction is not achieved, which may influence the final recognition effect.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problem that the processes of extracting the spatial domain characteristics and the time domain characteristics of the facial expression by using a spatial stream branch and a time stream branch respectively in a double-current convolutional neural network are relatively independent, the invention aims to provide a dynamic expression identification method and a system based on a space-time stream attention mechanism.

The technical scheme is as follows: the invention adopts the following technical scheme for realizing the aim of the invention:

a dynamic expression recognition method based on a spatiotemporal attention mechanism comprises the following steps:

step 1: collecting facial expression video clips, and establishing a facial expression video library containing expression category labels;

step 2: constructing a double-current convolutional neural network model embedded into a space-time inter-flow attention mechanism module, wherein the model comprises a data processing layer, a space flow branch, a time flow branch, a space-time inter-flow attention mechanism module, a feature fusion layer, a full connection layer and a classification layer;

the data processing layer is used for preprocessing input video, and the preprocessing process comprises the following steps: performing framing processing on a video, and extracting an image sequence with the length of u frames from the obtained images according to the time sequence, wherein u is the set sequence length; carrying out face detection, cutting and alignment on each image in the image sequence, and normalizing each processed image to obtain a facial expression image sequence with the length of u frames; randomly selecting one image from the facial expression image sequence with the length of u frames as a single frame of facial image corresponding to the input video; calculating an optical flow graph between every two adjacent images in the facial expression image sequence with the length of u frames, and stacking the optical flow graphs on a channel dimension according to a time sequence to serve as a stacked optical flow graph corresponding to the input video;

the spatial stream branch comprises a plurality of convolution modules, and the input of the branch is a single-frame face image output by the data processing layer and used for extracting spatial domain characteristics of facial expressions;

the time flow branch comprises convolution modules with the same number as the spatial flow branch, and the input of the time flow branch is a stacked light flow diagram output by the data processing layer and used for extracting time domain characteristics of facial expressions;

the space-time inter-flow attention mechanism module is embedded between convolution modules of the space flow branch and the time flow branch, the input of the module is the output of a convolution module at the upper layer in the double-flow convolution neural network, and the output of the module is the input of a convolution module at the lower layer in the double-flow convolution neural network and is used for carrying out information interaction on space-domain characteristics and time-domain characteristics; the space-time inter-flow attention mechanism module firstly calculates a relevance weighting matrix of an input space domain characteristic diagram and a time domain characteristic diagram, then respectively calculates a mapping matrix of the space domain characteristic diagram and the time domain characteristic diagram, respectively calculates a space domain residual error characteristic diagram and a time domain residual error characteristic diagram according to the relevance weighting matrix and the mapping matrix, finally uses residual error connection to add the space domain residual error characteristic diagram and the input space domain characteristic diagram to obtain an output space domain characteristic diagram, and adds the time domain residual error characteristic diagram and the input time domain characteristic diagram to obtain an output time domain characteristic diagram;

the characteristic fusion layer is used for respectively carrying out global average pooling operation on the space domain characteristic diagram output by the space flow branch and the time domain characteristic diagram output by the time flow branch, splicing the obtained two characteristic vectors and then outputting one characteristic vector;

the full connection layer is used for fully connecting the characteristic fusion layer and the classification layer;

the classification layer is used for calculating the probability that the facial expression in the input video belongs to each expression category;

and 3, step 3: training the constructed network model by using the video samples in the established facial expression video library;

and 4, step 4: and carrying out facial expression recognition on the newly input video by using the trained network model.

Further, the spatial stream branch comprises a sequential connectionA plurality of convolution modules; the convolution module comprises one or more convolution layers and a pooling layer, wherein the convolution layer comprises a ReLU nonlinear activation function layer, and m is selected ₁ K is ₁ ×k ₁ The convolution kernel of (1) performs a convolution operation on the output of the previous layer, where m ₁ Selected from 32, 64, 128, 256, 512 values, k ₁ Selecting from 3, 5 and 7 values; the layer of the pool is k ₂ ×k ₂ The output of the upper convolutional layer is downsampled by the pooling core of (a), wherein k ₂ Is selected from 1, 2 and 3.

Furthermore, the time stream branches comprise convolution modules which are connected in sequence and have the same number as the spatial stream branches; the convolution module comprises one or more convolution layers and a pooling layer, wherein the convolution layer comprises a ReLU nonlinear activation function layer, and m is selected ₂ K is ₃ ×k ₃ The convolution kernel of (1) performs a convolution operation on the output of the previous layer, where m ₂ K is selected from 32, 64, 128, 256, 512 values ₃ Selecting from 3, 5 and 7 values; the layer of the pool is k ₄ ×k ₄ The pooled kernel of (a) performs a downsampling operation on the output of the previous convolutional layer, where k is ₄ Is selected from 1, 2 and 3.

Further, the space-time inter-flow attention mechanism modules respectively use X _S And X _T A spatial domain feature map and a temporal domain feature map, X, representing inputs to the module _S And X _T Are respectively H _S ×W _S ×C _S And H _T ×W _T ×C _T The calculation step of the module comprises:

(1) computing spatial domain feature map X _S And time domain feature map X _T Is given to the correlation degree weight matrix F ₁ And F ₂ : respectively select C _O 1X 1 convolution kernel spatial domain feature map X _S And time domain feature map X _T Performing convolution operation to obtain two values H _S ×W _S ×C _O And H _T ×W _T ×C _O A characteristic diagram of (1); the two characteristic graphs are respectively adjusted to be H through dimension transformation _S W _S ×C _O And C _O ×H _T W _T Obtaining two-dimensional matrixes, and multiplying the two-dimensional matrixes to obtain a matrix with the size of H _S W _S ×H _T W _T A matrix F of (A); performing Softmax operation on each row vector of the matrix F to obtain a relevancy weight matrix F ₁ Performing Softmax operation on each column vector of the matrix F to obtain a relevancy weight matrix F ₂ ；

(2) Computing spatial domain feature map X _S Mapping matrix G of _S : selecting C _O 1X 1 convolution kernel spatial domain feature map X _S Performing convolution operation to obtain a value H _S ×W _S ×C _O The feature map of (1), the size of which is adjusted to C by dimension conversion _O ×H _S W _S Obtaining X _S Mapping matrix G of _S ；

(3) Computing a time domain feature map X _T Mapping matrix G of _T : selecting C _O Time domain feature map X of 1 × 1 convolution kernel _T Performing convolution operation to obtain a value H _T ×W _T ×C _O The feature map of (1), the size of which is adjusted to H by dimension conversion _T W _T ×C _O Obtaining X _T Mapping matrix G of _T ；

(4) Computing spatial domain residual characteristic diagram Y _S : will matrix F ₁ And matrix G _T Multiplying to obtain a product with a size of H _S W _S ×C _O By resizing the matrix to H by dimension transformation _S ×W _S ×C _O And then selecting C _S The 1 × 1 convolution cores perform convolution operation to obtain a value of H _S ×W _S ×C _S Space domain residual error characteristic diagram Y _S ；

(5) Computing a time-domain residual feature map Y _T : will matrix G _S And matrix F ₂ Multiplying to obtain a product with a size of C _O ×H _T W _T By resizing the matrix to H by dimension transformation _T ×W _T ×C _O And then selecting C _T Each 1 x 1 convolution kernel checks its progressPerforming convolution operation to obtain a value of H _T ×W _T ×C _T Time domain residual feature map Y of _T ；

(6) Calculating the space domain characteristic diagram Z output by the module _S : using residual concatenation, spatial residual feature map Y _S Spatial domain feature diagram X input by the module _S Adding to obtain space domain characteristic diagram Z output by the module _S ；

(7) Calculating the time domain characteristic diagram Z output by the module _T : using residual concatenation, the time domain residual feature map Y _T Time domain feature diagram X input by the module _T Adding to obtain the time domain characteristic diagram Z output by the module _T 。

Based on the same inventive concept, the invention discloses a dynamic expression recognition system based on a space-time inter-flow attention mechanism, which comprises the following modules:

the data preprocessing module is used for acquiring facial expression video clips and establishing a facial expression video library containing expression category labels;

the network construction module is used for constructing a double-current convolutional neural network model embedded into the space-time inter-flow attention mechanism module, and the model comprises a data processing layer, a space flow branch, a time flow branch, a space-time inter-flow attention mechanism module, a characteristic fusion layer, a full connection layer and a classification layer;

the data processing layer is used for preprocessing an input video, and the preprocessing process comprises the following steps: performing framing processing on a video, and extracting an image sequence with the length of u frames from the obtained images according to the time sequence, wherein u is the set sequence length; carrying out face detection, cutting and alignment on each image in the image sequence, and normalizing each processed image to obtain a face expression image sequence with the length of u frames; randomly selecting one image from the facial expression image sequence with the length of u frames as a single frame of facial image corresponding to the input video; calculating an optical flow graph between every two adjacent images in the facial expression image sequence with the length of u frames, and stacking the optical flow graphs on a channel dimension according to a time sequence to serve as a stacked optical flow graph corresponding to the input video;

the spatial stream branch comprises a plurality of convolution modules, and the input of the branch is a single-frame face image output by the data processing layer and used for extracting the spatial domain characteristics of the facial expression;

the time flow branch comprises convolution modules with the same number as that of the spatial flow branch, and the input of the time flow branch is a stacked light flow diagram output by the data processing layer and used for extracting time domain characteristics of the facial expression;

the space-time inter-flow attention mechanism module is embedded between convolution modules of a space flow branch and a time flow branch, the input of the module is the output of a convolution module at the upper layer in a double-flow convolution neural network, and the output of the module is the input of a convolution module at the lower layer in the double-flow convolution neural network and is used for carrying out information interaction on space domain characteristics and time domain characteristics; the space-time inter-flow attention mechanism module firstly calculates a relevance weighting matrix of an input space domain characteristic diagram and a time domain characteristic diagram, then respectively calculates a mapping matrix of the space domain characteristic diagram and the time domain characteristic diagram, respectively calculates a space domain residual error characteristic diagram and a time domain residual error characteristic diagram according to the relevance weighting matrix and the mapping matrix, finally uses residual error connection to add the space domain residual error characteristic diagram and the input space domain characteristic diagram to obtain an output space domain characteristic diagram, and adds the time domain residual error characteristic diagram and the input time domain characteristic diagram to obtain an output time domain characteristic diagram;

the characteristic fusion layer is used for respectively carrying out global average pooling operation on the space domain characteristic diagram output by the space flow branch and the time domain characteristic diagram output by the time flow branch, and outputting a characteristic vector after splicing the two obtained characteristic vectors;

the network training module is used for training the constructed network model by using the video samples in the established facial expression video library;

and the expression recognition module is used for carrying out facial expression recognition on the newly input video by utilizing the trained network model.

Based on the same inventive concept, the invention discloses a dynamic expression recognition system based on a spatio-temporal attention mechanism, which comprises at least one computing device, wherein the computing device comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, and the computer program realizes the dynamic expression recognition method based on the spatio-temporal attention mechanism when being loaded into the processor.

Has the beneficial effects that: compared with the prior art, the invention has the following technical effects:

(1) according to the invention, a deep learning network model is constructed, complicated manual feature extraction, feature dimension reduction and other operations are not required, the parameters are adaptively adjusted through training the network model, the features capable of reflecting the facial expressions in the video sample can be automatically extracted, the extracted features can better represent the changes of the facial expressions, and the deep learning network model has stronger fitting capability compared with the traditional facial expression recognition.

(2) The method adopts the double-current convolutional neural network, respectively uses the spatial flow branch and the time flow branch to extract the characteristics of the single-frame face image extracted from the video and the stacked optical flow graph representing expression change, can simultaneously obtain the spatial domain characteristics and the time domain characteristics of the facial expression, expands the characteristic extraction from a static image to the video, fully captures the time domain information of the facial expression in addition to the spatial domain information of the facial expression, and has stronger representation capability and generalization capability.

(3) According to the invention, the spatial domain feature and the time domain feature of each stage in the model can be subjected to information interaction by embedding the spatial-temporal inter-flow attention mechanism module between the convolution modules in the double-flow convolution neural network, so that the information related to the spatial domain feature and the time domain feature in the model can be learned, and the information unrelated to the spatial domain feature and the time domain feature can be ignored, so that the dynamic correlation information between the spatial domain feature and the time domain feature can be captured (for example, when a person smiles, the mouth angle of the person is raised and the eye angle is pulled down, which can cause two optical flows in an optical flow diagram to change towards two directions), the complementary action of the spatial flow branch and the temporal flow branch can be further exerted, and the characteristics with higher discrimination and representativeness can be obtained.

(4) According to the invention, the space-time flow inter-attention mechanism module is embedded between two branches of a double-flow convolutional neural network, so that the network can adaptively give higher weight information to salient regions in a face image or a light-flow graph, the information of the regions has greater influence on the learning of higher-level deep features in the next stage, local key features of the image in the non-salient regions are weakened, the relation can be established between the space flow branch and the time flow branch, the mutual learning of the local key features of the two branches is realized, the features with strong identification capability can be obtained, and the accuracy and the robustness of face expression identification are improved.

Drawings

FIG. 1 is a flow chart of a method of an embodiment of the present invention;

fig. 2 is a diagram of a network model architecture used in an embodiment of the present invention.

Detailed Description

The following description will explain embodiments of the present invention in further detail with reference to the accompanying drawings.

As shown in fig. 1, a dynamic expression recognition method based on a spatiotemporal attention mechanism provided in an embodiment of the present invention mainly includes the following steps:

step 1: and collecting facial expression video clips, and establishing a facial expression video library containing expression category labels.

In this embodiment, an AFEW facial expression video library is selected. The video samples in the AFEW facial expression video library are from different movies and contain 1809 video samples, and the face of a person in each video sample corresponds to one expression category which comprises seven categories of anger, fear, aversion, happiness, sadness, surprise and neutrality. In practice, other facial expression video libraries may also be adopted, or facial expression videos may be collected by themselves, and a facial expression video library including facial expression category labels is established.

Step 2: a double-current convolutional neural network model embedded into a space-time inter-flow attention mechanism module is constructed, and the model comprises a data processing layer, a space flow branch, a time flow branch, a space-time inter-flow attention mechanism module, a feature fusion layer, a full connection layer and a classification layer.

The data processing layer is used for preprocessing the input video, and the preprocessing process comprises the following steps: performing framing processing on a video, and extracting an image sequence with the length of u frames from the obtained images according to the time sequence, wherein u is the set sequence length; carrying out face detection, cutting and alignment on each image in the image sequence, and normalizing each processed image to obtain a facial expression image sequence with the length of u frames; randomly selecting one image from a facial expression image sequence with the length of u frames as a single frame of facial image corresponding to the input video; and calculating an optical flow graph between every two adjacent images in the facial expression image sequence with the length of u frames, and stacking the obtained optical flow graphs on a channel dimension according to the time sequence to obtain a stacked optical flow graph corresponding to the input video.

The spatial stream branch comprises a plurality of convolution modules, the input of the branch is a single-frame face image output by the data processing layer, and the branch is used for extracting the spatial domain characteristics of the facial expression; the convolution module comprises one or more convolution layers and a pooling layer, wherein the convolution layers comprise a ReLU nonlinear activation function layer, and m is selected ₁ K is ₁ ×k ₁ The convolution kernel of (a) performs a convolution operation on the output of the previous layer, where m is ₁ Selected from 32, 64, 128, 256, 512 values, k ₁ Selecting from 3, 5 and 7 values; the layer of the pool is k ₂ ×k ₂ The pooled kernel of (a) performs a downsampling operation on the output of the previous convolutional layer, where k is ₂ The numerical values are selected from 1, 2 and 3.

And the time flow branch comprises convolution modules with the same number as the convolution modules of the space flow branch, and the input of the time flow branch is a stacked light flow diagram output by the data processing layer and used for extracting time domain characteristics of the facial expression. The convolution module comprises one or more convolution layers and a pooling layer, wherein the convolution layers comprise ReLU nonlinear activation function layers, and m is selected ₂ K is ₃ ×k ₃ The convolution kernel of (a) performs a convolution operation on the output of the previous layer, where m is ₂ At 32, 64, 128, 256, 512 is selected from the values k ₃ Selecting from 3, 5 and 7 values; the layer of the pool is k ₄ ×k ₄ The pooled kernel of (a) performs a downsampling operation on the output of the previous convolutional layer, where k is ₄ The numerical values are selected from 1, 2 and 3.

The spatial-temporal inter-flow attention mechanism module is embedded between convolution modules of the spatial flow branch and the time flow branch, the input of the module is the output of a convolution module at the upper layer in the double-flow convolution neural network, and the output of the module is the input of a convolution module at the lower layer in the double-flow convolution neural network and is used for carrying out information interaction on spatial domain characteristics and time domain characteristics; the space-time inter-flow attention mechanism module firstly calculates the relevance weighting matrix of an input space domain characteristic diagram and an input time domain characteristic diagram, then respectively calculates the mapping matrix of the space domain characteristic diagram and the time domain characteristic diagram, respectively calculates a space domain residual error characteristic diagram and a time domain residual error characteristic diagram according to the relevance weighting matrix and the mapping matrix, finally uses residual error connection to add the space domain residual error characteristic diagram and the input space domain characteristic diagram to obtain an output space domain characteristic diagram, and adds the time domain residual error characteristic diagram and the input time domain characteristic diagram to obtain an output time domain characteristic diagram.

And the characteristic fusion layer is used for respectively carrying out global average pooling operation on the space domain characteristic diagram output by the space flow branch and the time domain characteristic diagram output by the time flow branch, splicing the two obtained characteristic vectors and then outputting one characteristic vector.

And the full connection layer is used for fully connecting the characteristic fusion layer and the classification layer.

And the classification layer is used for calculating the probability that the facial expression in the input video belongs to each expression category.

The dynamic expression recognition network model based on the spatio-temporal attention mechanism constructed in this embodiment has the following specific structure as shown in fig. 2:

(1) the data processing layer firstly uses FFmpeg software to carry out frame processing on an input video, and extracts an image sequence with the length of 9 frames from the obtained images according to the time sequence; then, carrying out face detection, cutting and alignment on each image in the image sequence by adopting a Dlib face detection algorithm, and normalizing each processed image into 224 multiplied by 224 pixels to obtain a face expression image sequence with the length of 9 frames; then randomly selecting one image from the facial expression image sequence with the length of 9 frames as a single frame of facial image corresponding to the input video; finally, calculating an optical flow graph between every two adjacent images in the facial expression image sequence with the length of 9 frames by using a TVL1 algorithm, and stacking the obtained optical flow graphs on a channel dimension according to a time sequence to be used as a stacked optical flow graph corresponding to the input video;

(2) the spatial flow branch and the time flow branch adopt two convolution neural networks with the same structure and different parameters, and respectively comprise 5 convolution modules which are connected in sequence:

convolution module a and convolution module a': the method comprises the following steps that 2 convolutional layers and 1 pooling layer are included, each of the 2 convolutional layers is subjected to convolution operation by using 64 convolution kernels with the size of 3 x 3 to carry out feature maps, the convolution step length is 1, the zero padding and edge adding length is 1, and the feature maps with the size of 224 x 64 are output after convolution through ReLU nonlinear mapping; the pooling layer selects a maximum pooling core of 2 × 2, performs downsampling operation on the feature map by step size 2, and outputs the feature map with the size of 112 × 112 × 64;

convolution module B and convolution module B': the method comprises the following steps that 2 convolutional layers and 1 pooling layer are included, wherein 128 convolution kernels with the length of 3 x 3 are selected for the 2 convolutional layers to carry out convolution operation on characteristic graphs, the convolution step length is 1, the zero padding and edge adding length is 1, the characteristic graphs with the size of 112 x 128 are output after convolution through ReLU nonlinear mapping; the pooling layer selects a maximum pooling kernel of 2 multiplied by 2, performs down-sampling operation on the feature map by step size 2, and outputs the feature map with the size of 56 multiplied by 128;

convolution module C and convolution module C': the method comprises the steps that 3 convolutional layers and 1 pooling layer are included, 256 convolutional cores of 3 x 3 are selected for the 3 convolutional layers to carry out convolution operation on a characteristic diagram, the convolution step length is 1, the zero padding and edge adding length is 1, the characteristic diagram of 56 x 256 in size is output after convolution through ReLU nonlinear mapping; the pooling layer selects a maximum pooling core of 2 multiplied by 2, performs down-sampling operation on the feature map by step length 2, and outputs the feature map with the size of 28 multiplied by 256;

convolution module D and convolution module D': the method comprises the following steps that 3 convolutional layers and 1 pooling layer are included, 512 convolutional cores of 3 x 3 are selected for the 3 convolutional layers to carry out convolution operation on a feature graph, the convolution step length is 1, the zero padding and edge adding length is 1, the feature graph of 28 x 512 in size is output after convolution through ReLU nonlinear mapping; the pooling layer selects a maximum pooling kernel of 2 multiplied by 2, performs down-sampling operation on the feature map by step size 2, and outputs the feature map with the size of 14 multiplied by 512;

convolution module E and convolution module E': the method comprises the following steps that 3 convolutional layers and 1 pooling layer are included, 512 convolutional cores with the length of 3 multiplied by 3 are selected for the 3 convolutional layers to carry out convolution operation on a characteristic graph, the convolution step length is 1, the zero padding and edge adding length is 1, the characteristic graph with the size of 14 multiplied by 512 is output after convolution through the ReLU nonlinear mapping; the pooling layer selects a maximum pooling core of 2 multiplied by 2, performs down-sampling operation on the feature map by step length 2, and outputs a feature map with the size of 7 multiplied by 512, wherein the feature map is an output feature map of a corresponding branch;

(3) the spatial-temporal inter-flow attention mechanism module is embedded between convolution modules of the spatial flow branch and the time flow branch, and 4 spatial-temporal inter-flow attention mechanism modules are embedded in total because the convolution modules of the spatial flow branch and the time flow branch are respectively 5, and the spatial-temporal inter-flow attention mechanism module A is taken as an example to explain implementation details:

the input of the space-time inter-flow attention mechanism module A is a space domain characteristic diagram X output by the convolution module A _S And the time domain characteristic diagram X output by the convolution module A _T ，X _S And X _T All are 112 × 112 × 64, the operation steps of the module include:

computing spatial domain feature map X _S And time domain feature map X _T Is given to the correlation degree weight matrix F ₁ And F ₂ : selecting 32 1 × 1 convolution checks respectively _S And X _T Performing convolution operation to obtain two characteristic graphs with the size of 112 multiplied by 32; the two feature maps are respectively resized to 112 by dimension transformation ² X 32 and 32X 112 ² Two-dimensional matrices are obtained and multiplied to obtain a matrix with a size of 112 ² ×112 ² A matrix F of (A); performing Softmax operation on each row vector of the matrix F to obtain a relevancy weight momentArray F ₁ Performing Softmax operation on each column vector of the matrix F to obtain a relevancy weight matrix F ₂ ；

Computing spatial domain feature map X _S Mapping matrix G of _S : selecting 32 1 × 1 convolution kernels to check the space domain characteristic diagram X _S Performing convolution operation to obtain a feature map with a size of 112 × 112 × 32, and adjusting the size of the feature map to 32 × 112 × 32 by dimension conversion ² Obtaining X _S Mapping matrix G of _S ；

Computing a time domain feature map X _T Mapping matrix G of _T : selecting 32 1 × 1 convolution kernels to check the time domain feature map X _T Performing convolution operation to obtain a feature map with a size of 112 × 112 × 32, and adjusting the feature map to a size of 112 × 112 by dimension conversion ² X32 to obtain X _T Mapping matrix G of _T ；

Computing spatial domain residual error characteristic diagram Y _S : will matrix F ₁ And matrix G _T Multiplying to obtain a product with a size of 112 ² Adjusting the size of the x 32 matrix to 112 x 32 through dimension transformation, selecting 64 1 x 1 convolution cores to carry out convolution operation to obtain a space domain residual error characteristic diagram Y with the size of 112 x 64 _S ；

Computing a time-domain residual feature map Y _T : will matrix G _S And matrix F ₂ Multiplying to obtain a product with a size of 32 × 112 ² The matrix is adjusted to be 112 multiplied by 32 through dimension transformation, then 64 convolution cores of 1 multiplied by 1 are selected for convolution operation, and a time domain residual error characteristic diagram Y with the size of 112 multiplied by 64 is obtained _T ；

Calculating the space domain characteristic diagram Z output by the module _S : using residual concatenation, spatial residual feature map Y _S Spatial domain feature diagram X input by the module _S Adding to obtain space domain characteristic diagram Z output by the module _S The size of which is 112 × 112 × 64, which is the input of the convolution module B;

calculating the time domain characteristic diagram Z output by the module _T : using residual concatenation, the time domain residual feature map Y _T Time domain characteristic diagram input by the moduleX _T Adding to obtain the time domain characteristic diagram Z output by the module _T 112 × 112 × 64, which is the input of the convolution module B';

similarly, the input of the space-time inter-flow attention mechanism module B is a feature map output by the convolution module B and the convolution module B', and the size of the feature map is 56 × 56 × 128; the input of the space-time inter-flow attention mechanism module C is a feature map output by the convolution module C and the convolution module C', and the size of the feature map is 28 multiplied by 256; the input of the space-time inter-flow attention mechanism module D is a feature map output by the convolution module D and the convolution module D', and the size of the feature map is 14 multiplied by 512; other implementation details are similar to those described above;

(4) the characteristic fusion layer inputs a space domain characteristic diagram output by the space flow branch and a time domain characteristic diagram output by the time flow branch, the sizes of the space domain characteristic diagram and the time domain characteristic diagram are both 7 multiplied by 512, the two characteristic diagrams are respectively subjected to global average pooling operation to obtain two characteristic vectors with 512 dimensions, the two characteristic vectors are spliced, and a characteristic vector with 1024 dimensions is output;

(5) a full connection layer, which comprises 256 neurons and is used for fully connecting the feature fusion layer with the classification layer;

(6) and a classification layer, namely a Softmax classifier is adopted, the classification layer comprises 7 neurons, a 7-dimensional vector is output, the numerical value of each dimension of the vector is the probability that the facial expression in the input video belongs to each expression category, and the expression category corresponding to the element with the largest numerical value is the identification label of the network model to the input video.

And 3, step 3: and training the constructed network model by using the video samples in the established facial expression video library.

Based on the same inventive concept, the embodiment of the invention discloses a dynamic expression recognition system based on a space-time inter-flow attention mechanism, which comprises the following steps: the data preprocessing module is used for acquiring facial expression video clips and establishing a facial expression video library containing expression category labels; the network training module is used for training the constructed network model by using video samples in the established facial expression video library; and the expression recognition module is used for carrying out facial expression recognition on the newly input video by utilizing the trained network model.

The built double-current convolutional neural network model embedded into the space-time inter-flow attention mechanism module comprises a data processing layer, a space flow branch, a time flow branch, a space-time inter-flow attention mechanism module, a feature fusion layer, a full connection layer and a classification layer; specifically, the method comprises the following steps: the data processing layer is used for preprocessing the input video; the spatial stream branch comprises a plurality of convolution modules, the input of the branch is a single-frame face image output by the data processing layer, and the branch is used for extracting the spatial domain characteristics of the facial expression; the time flow branch comprises convolution modules with the same number as the spatial flow branch, and the input of the time flow branch is a stacked light flow diagram output by the data processing layer and used for extracting time domain characteristics of facial expressions; the spatial-temporal inter-flow attention mechanism module is embedded between convolution modules of the spatial flow branch and the time flow branch, the input of the module is the output of a convolution module at the upper layer in the double-flow convolution neural network, and the output of the module is the input of a convolution module at the lower layer in the double-flow convolution neural network and is used for carrying out information interaction on spatial domain characteristics and time domain characteristics; the characteristic fusion layer is used for respectively carrying out global average pooling operation on the space domain characteristic diagram output by the space flow branch and the time domain characteristic diagram output by the time flow branch, splicing the two obtained characteristic vectors and outputting one characteristic vector; the full connection layer is used for fully connecting the characteristic fusion layer and the classification layer; and the classification layer is used for calculating the probability that the facial expression in the input video belongs to each expression category.

Based on the same inventive concept, the embodiment of the invention discloses a dynamic expression recognition system based on a space-time inter-flow attention mechanism, which comprises at least one computing device, wherein the computing device comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, and the computer program realizes the above dynamic expression recognition method based on the space-time inter-flow attention mechanism when being loaded into the processor.

The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can understand that the modifications or substitutions within the technical scope of the present invention are included in the scope of the present invention, and therefore, the scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A dynamic expression recognition method based on a spatiotemporal attention mechanism is characterized by comprising the following steps:

the data processing layer is used for preprocessing an input video, and the preprocessing process comprises the following steps: performing framing processing on a video, and extracting an image sequence with the length of u frames from the obtained images according to the time sequence, wherein u is the set sequence length; carrying out face detection, cutting and alignment on each image in the image sequence, and normalizing each processed image to obtain a facial expression image sequence with the length of u frames; randomly selecting one image from the facial expression image sequence with the length of u frames as a single frame of facial image corresponding to the input video; calculating an optical flow graph between every two adjacent images in the facial expression image sequence with the length of u frames, and stacking the optical flow graphs on a channel dimension according to a time sequence to serve as a stacked optical flow graph corresponding to the input video;

and step 3: training the constructed network model by using video samples in the established facial expression video library;

2. The method for recognizing the dynamic expressions based on the spatio-temporal inter-flow attention mechanism according to claim 1, wherein the spatial stream branch comprises a plurality of convolution modules connected in sequence;

the convolution moduleComprises one or more convolution layers and a pooling layer, wherein the convolution layer comprises a ReLU nonlinear activation function layer, m is selected ₁ K is ₁ ×k ₁ The convolution kernel of (1) performs a convolution operation on the output of the previous layer, where m ₁ K is selected from 32, 64, 128, 256, 512 values ₁ Selecting from 3, 5 and 7 values; the layer of the pool is k ₂ ×k ₂ The pooled kernel of (a) performs a downsampling operation on the output of the previous convolutional layer, where k is ₂ Is selected from 1, 2 and 3.

3. The method for recognizing the dynamic expressions based on the attention mechanism between the spatial streams and the temporal streams as claimed in claim 1, wherein the temporal stream branches comprise the same number of convolution modules as the spatial stream branches which are connected in sequence;

the convolution module comprises one or more convolution layers and a pooling layer, wherein the convolution layer comprises a ReLU nonlinear activation function layer, and m is selected ₂ K is ₃ ×k ₃ The convolution kernel of (a) performs a convolution operation on the output of the previous layer, where m is ₂ Selected from 32, 64, 128, 256, 512 values, k ₃ Selecting from 3, 5 and 7 values; k is selected as the layer of the pool ₄ ×k ₄ The pooled kernel of (a) performs a downsampling operation on the output of the previous convolutional layer, where k is ₄ The numerical values are selected from 1, 2 and 3.

4. The method as claimed in claim 1, wherein the spatiotemporal attention mechanism modules respectively use X _S And X _T A spatial domain feature map and a temporal domain feature map, X, representing inputs to the module _S And X _T Are respectively H _S ×W _S ×C _S And H _T ×W _T ×C _T The calculation steps of the module include:

(1) computing spatial domain feature map X _S And time domain feature map X _T Is given to the correlation degree weight matrix F ₁ And F ₂ : respectively select C _O 1 x 1 convolution kernel to space bitSign diagram X _S And time domain feature map X _T Performing convolution operation to obtain two values H _S ×W _S ×C _O And H _T ×W _T ×C _O A characteristic diagram of (1); the two characteristic graphs are respectively adjusted to be H through dimension transformation _S W _S ×C _O And C _O ×H _T W _T Obtaining two-dimensional matrixes, and multiplying the two-dimensional matrixes to obtain a matrix with the size of H _S W _S ×H _T W _T A matrix F of (A); performing Softmax operation on each row vector of the matrix F to obtain a relevancy weight matrix F ₁ Performing Softmax operation on each column vector of the matrix F to obtain a relevancy weight matrix F ₂ ；

(4) Computing spatial domain residual error characteristic diagram Y _S : will matrix F ₁ And matrix G _T Multiplying to obtain a product with a size of H _S W _S ×C _O By resizing the matrix to H by dimension transformation _S ×W _S ×C _O And then use C _S The 1 × 1 convolution cores perform convolution operation to obtain a value of H _S ×W _S ×C _S Space domain residual error characteristic diagram Y _S ；

(5) Computing a time-domain residual profileY _T : will matrix G _S And matrix F ₂ Multiplying to obtain a value of C _O ×H _T W _T By resizing the matrix to H by dimension transformation _T ×W _T ×C _O And then use C _T The 1 × 1 convolution cores perform convolution operation to obtain a value of H _T ×W _T ×C _T Time domain residual feature map Y of _T ；

(6) Calculating the space domain characteristic diagram Z output by the module _S : using residual concatenation, spatial residual feature map Y _S Spatial domain feature map X input by the module _S Adding to obtain space domain characteristic diagram Z output by the module _S ；

5. A dynamic expression recognition system based on a spatiotemporal attention mechanism, the system comprising the following modules:

6. A system for dynamic expression recognition based on a spatio-temporal attention mechanism, comprising at least one computing device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program when loaded into the processor implementing a method according to any one of claims 1 to 4 for dynamic expression recognition based on a spatio-temporal attention mechanism.