CN112800894B - Dynamic expression recognition method and system based on attention mechanism between space and time streams - Google Patents

Dynamic expression recognition method and system based on attention mechanism between space and time streams Download PDF

Info

Publication number
CN112800894B
CN112800894B CN202110061153.7A CN202110061153A CN112800894B CN 112800894 B CN112800894 B CN 112800894B CN 202110061153 A CN202110061153 A CN 202110061153A CN 112800894 B CN112800894 B CN 112800894B
Authority
CN
China
Prior art keywords
convolution
space
time
layer
characteristic diagram
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110061153.7A
Other languages
Chinese (zh)
Other versions
CN112800894A (en
Inventor
卢官明
陈浩侠
卢峻禾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202110061153.7A priority Critical patent/CN112800894B/en
Publication of CN112800894A publication Critical patent/CN112800894A/en
Application granted granted Critical
Publication of CN112800894B publication Critical patent/CN112800894B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • G06V40/176Dynamic expression
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a dynamic expression recognition method and system based on a space-time inter-flow attention mechanism. Firstly, acquiring facial expression video clips, and establishing a facial expression video library containing expression category labels; then constructing a double-current convolutional neural network model embedded into a space-time inter-flow attention mechanism module, wherein the model comprises a data processing layer, a space flow branch, a time flow branch, a space-time inter-flow attention mechanism module, a characteristic fusion layer, a full connection layer and a classification layer; then, training the model by using video samples in the facial expression video library; and finally, carrying out facial expression recognition on the newly input video by using the trained model. According to the method, a spatial-temporal inter-flow attention mechanism module is embedded in a double-flow convolutional neural network, so that information interaction of spatial domain characteristics and time domain characteristics can be realized, dynamic correlation information between the spatial domain characteristics and the time domain characteristics is captured, characteristics with strong identification capability are obtained, and accuracy and robustness of facial expression identification are improved.

Description

Dynamic expression recognition method and system based on attention mechanism between space and time streams
Technical Field
The invention belongs to the field of machine learning and pattern recognition, relates to a dynamic expression recognition method and a system, and particularly relates to a dynamic expression recognition method and a system based on a space-time inter-flow attention mechanism.
Background
With the rapid development of computer technology and artificial intelligence, the way of human-computer interaction is constantly changing, which makes people more and more inclined to communicate directly with computers. In human communication, it is necessary to know the emotional state of the other party, and in the process of human emotional exposure, facial expressions account for about 55% of the information. Therefore, recognizing facial expressions by a computer has become a very hot topic.
Facial expression recognition is a process of extracting facial expression features from an image or a video and judging expression category labels according to feature information. Facial expression recognition is a cross discipline spanning multiple fields of neuroscience, psychology, computer science and the like, and potential applications of the facial expression recognition comprise fields with emotional man-machine interaction requirements, such as remote education, safe driving, service robots and the like. For example, some current smart phones have cameras with smile expressions to trigger automatic photographing, and high-end automobiles use cameras to monitor emotional states of drivers and take corresponding safety prompt measures. The deep research of the facial expression recognition technology and the improvement of the computer recognition capability will certainly greatly improve the life quality of human beings.
Currently, a large part of research in this field is performed on a static face image. In other words, these methods can only capture the spatial information of facial expressions, and ignore the change of expressions over time, and the temporal information may contain more expressive features. Compared with an expression identification method based on a static image, the double-current convolution neural network can identify dynamic expressions. The network respectively uses a space flow branch and a time flow branch to extract the characteristics of a single-frame face image extracted from a video and a stacked optical flow graph representing expression change, can simultaneously obtain the spatial characteristics and the time characteristics of facial expressions, and has the effect of space-time information complementation.
Similar to face recognition, the recognition of facial expressions in an uncontrolled natural environment is greatly affected by the occlusion and pose of the face. In order to alleviate the influence of the factors on facial expression recognition, the utilization of the facial local information is a relatively well-recognized effective strategy. According to the research on the human visual system and the cognitive process, the human visual system processes the image data by preferentially processing the information of the significant areas and selectively ignoring the non-significant areas. An attention mechanism is introduced into the human face expression recognition task, so that the convolutional neural network can adaptively give higher weight information to salient regions in the human face image, the information of the regions has greater influence on deep features of higher levels learned in the next stage, and the non-salient regions are weakened, thereby further improving the accuracy of expression recognition. However, for the dual-flow convolutional neural network, the prior art is only limited to introduce an attention mechanism into the spatial stream branch and the temporal stream branch, which can play the most basic role of the attention mechanism, but cannot fully utilize the complementary advantages of the spatial stream branch and the temporal stream branch to achieve the purpose of information interaction between the two branches.
The Chinese patent application 'a human face expression recognition method based on an attention mechanism module' (patent application No. CN202010783432.X, publication No. CN111967359A), sending a cut human face image into a network model based on the attention mechanism module to obtain human face image attention features, then adopting two convolution layers to carry out feature extraction operation to obtain a feature image of the human face image, then utilizing a global average pooling layer to carry out feature dimension reduction, finally utilizing a Softmax classifier to classify and recognize the dimension reduction features, and outputting a human face expression recognition result. The method has the problems that only one static face image can be identified, the time domain information of the face expression is ignored, and the best identification effect is difficult to achieve.
The Chinese patent application 'facial expression recognition method based on a space-time fusion network' (patent application No. CN202010221398.7, publication No. CN111709266A), after input video sequences are preprocessed, DMF modules are used for extracting spatial domain features of facial expressions, LTCNN modules are used for extracting time domain features of the facial expressions, and finally fine-tuning-based fusion strategies are used for fusing the expression space-time features learned by the two modules. The method has the problems that the spatial domain feature extraction of the facial expression by using the DMF module and the time domain feature extraction of the facial expression by using the LTCNN module are two independent processes, and sufficient information interaction is not achieved, which may influence the final recognition effect.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the problem that the processes of extracting the spatial domain characteristics and the time domain characteristics of the facial expression by using a spatial stream branch and a time stream branch respectively in a double-current convolutional neural network are relatively independent, the invention aims to provide a dynamic expression identification method and a system based on a space-time stream attention mechanism.
The technical scheme is as follows: the invention adopts the following technical scheme for realizing the aim of the invention:
a dynamic expression recognition method based on a spatiotemporal attention mechanism comprises the following steps:
step 1: collecting facial expression video clips, and establishing a facial expression video library containing expression category labels;
step 2: constructing a double-current convolutional neural network model embedded into a space-time inter-flow attention mechanism module, wherein the model comprises a data processing layer, a space flow branch, a time flow branch, a space-time inter-flow attention mechanism module, a feature fusion layer, a full connection layer and a classification layer;
the data processing layer is used for preprocessing input video, and the preprocessing process comprises the following steps: performing framing processing on a video, and extracting an image sequence with the length of u frames from the obtained images according to the time sequence, wherein u is the set sequence length; carrying out face detection, cutting and alignment on each image in the image sequence, and normalizing each processed image to obtain a facial expression image sequence with the length of u frames; randomly selecting one image from the facial expression image sequence with the length of u frames as a single frame of facial image corresponding to the input video; calculating an optical flow graph between every two adjacent images in the facial expression image sequence with the length of u frames, and stacking the optical flow graphs on a channel dimension according to a time sequence to serve as a stacked optical flow graph corresponding to the input video;
the spatial stream branch comprises a plurality of convolution modules, and the input of the branch is a single-frame face image output by the data processing layer and used for extracting spatial domain characteristics of facial expressions;
the time flow branch comprises convolution modules with the same number as the spatial flow branch, and the input of the time flow branch is a stacked light flow diagram output by the data processing layer and used for extracting time domain characteristics of facial expressions;
the space-time inter-flow attention mechanism module is embedded between convolution modules of the space flow branch and the time flow branch, the input of the module is the output of a convolution module at the upper layer in the double-flow convolution neural network, and the output of the module is the input of a convolution module at the lower layer in the double-flow convolution neural network and is used for carrying out information interaction on space-domain characteristics and time-domain characteristics; the space-time inter-flow attention mechanism module firstly calculates a relevance weighting matrix of an input space domain characteristic diagram and a time domain characteristic diagram, then respectively calculates a mapping matrix of the space domain characteristic diagram and the time domain characteristic diagram, respectively calculates a space domain residual error characteristic diagram and a time domain residual error characteristic diagram according to the relevance weighting matrix and the mapping matrix, finally uses residual error connection to add the space domain residual error characteristic diagram and the input space domain characteristic diagram to obtain an output space domain characteristic diagram, and adds the time domain residual error characteristic diagram and the input time domain characteristic diagram to obtain an output time domain characteristic diagram;
the characteristic fusion layer is used for respectively carrying out global average pooling operation on the space domain characteristic diagram output by the space flow branch and the time domain characteristic diagram output by the time flow branch, splicing the obtained two characteristic vectors and then outputting one characteristic vector;
the full connection layer is used for fully connecting the characteristic fusion layer and the classification layer;
the classification layer is used for calculating the probability that the facial expression in the input video belongs to each expression category;
and 3, step 3: training the constructed network model by using the video samples in the established facial expression video library;
and 4, step 4: and carrying out facial expression recognition on the newly input video by using the trained network model.
Further, the spatial stream branch comprises a sequential connectionA plurality of convolution modules; the convolution module comprises one or more convolution layers and a pooling layer, wherein the convolution layer comprises a ReLU nonlinear activation function layer, and m is selected 1 K is 1 ×k 1 The convolution kernel of (1) performs a convolution operation on the output of the previous layer, where m 1 Selected from 32, 64, 128, 256, 512 values, k 1 Selecting from 3, 5 and 7 values; the layer of the pool is k 2 ×k 2 The output of the upper convolutional layer is downsampled by the pooling core of (a), wherein k 2 Is selected from 1, 2 and 3.
Furthermore, the time stream branches comprise convolution modules which are connected in sequence and have the same number as the spatial stream branches; the convolution module comprises one or more convolution layers and a pooling layer, wherein the convolution layer comprises a ReLU nonlinear activation function layer, and m is selected 2 K is 3 ×k 3 The convolution kernel of (1) performs a convolution operation on the output of the previous layer, where m 2 K is selected from 32, 64, 128, 256, 512 values 3 Selecting from 3, 5 and 7 values; the layer of the pool is k 4 ×k 4 The pooled kernel of (a) performs a downsampling operation on the output of the previous convolutional layer, where k is 4 Is selected from 1, 2 and 3.
Further, the space-time inter-flow attention mechanism modules respectively use X S And X T A spatial domain feature map and a temporal domain feature map, X, representing inputs to the module S And X T Are respectively H S ×W S ×C S And H T ×W T ×C T The calculation step of the module comprises:
(1) computing spatial domain feature map X S And time domain feature map X T Is given to the correlation degree weight matrix F 1 And F 2 : respectively select C O 1X 1 convolution kernel spatial domain feature map X S And time domain feature map X T Performing convolution operation to obtain two values H S ×W S ×C O And H T ×W T ×C O A characteristic diagram of (1); the two characteristic graphs are respectively adjusted to be H through dimension transformation S W S ×C O And C O ×H T W T Obtaining two-dimensional matrixes, and multiplying the two-dimensional matrixes to obtain a matrix with the size of H S W S ×H T W T A matrix F of (A); performing Softmax operation on each row vector of the matrix F to obtain a relevancy weight matrix F 1 Performing Softmax operation on each column vector of the matrix F to obtain a relevancy weight matrix F 2
(2) Computing spatial domain feature map X S Mapping matrix G of S : selecting C O 1X 1 convolution kernel spatial domain feature map X S Performing convolution operation to obtain a value H S ×W S ×C O The feature map of (1), the size of which is adjusted to C by dimension conversion O ×H S W S Obtaining X S Mapping matrix G of S
(3) Computing a time domain feature map X T Mapping matrix G of T : selecting C O Time domain feature map X of 1 × 1 convolution kernel T Performing convolution operation to obtain a value H T ×W T ×C O The feature map of (1), the size of which is adjusted to H by dimension conversion T W T ×C O Obtaining X T Mapping matrix G of T
(4) Computing spatial domain residual characteristic diagram Y S : will matrix F 1 And matrix G T Multiplying to obtain a product with a size of H S W S ×C O By resizing the matrix to H by dimension transformation S ×W S ×C O And then selecting C S The 1 × 1 convolution cores perform convolution operation to obtain a value of H S ×W S ×C S Space domain residual error characteristic diagram Y S
(5) Computing a time-domain residual feature map Y T : will matrix G S And matrix F 2 Multiplying to obtain a product with a size of C O ×H T W T By resizing the matrix to H by dimension transformation T ×W T ×C O And then selecting C T Each 1 x 1 convolution kernel checks its progressPerforming convolution operation to obtain a value of H T ×W T ×C T Time domain residual feature map Y of T
(6) Calculating the space domain characteristic diagram Z output by the module S : using residual concatenation, spatial residual feature map Y S Spatial domain feature diagram X input by the module S Adding to obtain space domain characteristic diagram Z output by the module S
(7) Calculating the time domain characteristic diagram Z output by the module T : using residual concatenation, the time domain residual feature map Y T Time domain feature diagram X input by the module T Adding to obtain the time domain characteristic diagram Z output by the module T
Based on the same inventive concept, the invention discloses a dynamic expression recognition system based on a space-time inter-flow attention mechanism, which comprises the following modules:
the data preprocessing module is used for acquiring facial expression video clips and establishing a facial expression video library containing expression category labels;
the network construction module is used for constructing a double-current convolutional neural network model embedded into the space-time inter-flow attention mechanism module, and the model comprises a data processing layer, a space flow branch, a time flow branch, a space-time inter-flow attention mechanism module, a characteristic fusion layer, a full connection layer and a classification layer;
the data processing layer is used for preprocessing an input video, and the preprocessing process comprises the following steps: performing framing processing on a video, and extracting an image sequence with the length of u frames from the obtained images according to the time sequence, wherein u is the set sequence length; carrying out face detection, cutting and alignment on each image in the image sequence, and normalizing each processed image to obtain a face expression image sequence with the length of u frames; randomly selecting one image from the facial expression image sequence with the length of u frames as a single frame of facial image corresponding to the input video; calculating an optical flow graph between every two adjacent images in the facial expression image sequence with the length of u frames, and stacking the optical flow graphs on a channel dimension according to a time sequence to serve as a stacked optical flow graph corresponding to the input video;
the spatial stream branch comprises a plurality of convolution modules, and the input of the branch is a single-frame face image output by the data processing layer and used for extracting the spatial domain characteristics of the facial expression;
the time flow branch comprises convolution modules with the same number as that of the spatial flow branch, and the input of the time flow branch is a stacked light flow diagram output by the data processing layer and used for extracting time domain characteristics of the facial expression;
the space-time inter-flow attention mechanism module is embedded between convolution modules of a space flow branch and a time flow branch, the input of the module is the output of a convolution module at the upper layer in a double-flow convolution neural network, and the output of the module is the input of a convolution module at the lower layer in the double-flow convolution neural network and is used for carrying out information interaction on space domain characteristics and time domain characteristics; the space-time inter-flow attention mechanism module firstly calculates a relevance weighting matrix of an input space domain characteristic diagram and a time domain characteristic diagram, then respectively calculates a mapping matrix of the space domain characteristic diagram and the time domain characteristic diagram, respectively calculates a space domain residual error characteristic diagram and a time domain residual error characteristic diagram according to the relevance weighting matrix and the mapping matrix, finally uses residual error connection to add the space domain residual error characteristic diagram and the input space domain characteristic diagram to obtain an output space domain characteristic diagram, and adds the time domain residual error characteristic diagram and the input time domain characteristic diagram to obtain an output time domain characteristic diagram;
the characteristic fusion layer is used for respectively carrying out global average pooling operation on the space domain characteristic diagram output by the space flow branch and the time domain characteristic diagram output by the time flow branch, and outputting a characteristic vector after splicing the two obtained characteristic vectors;
the full connection layer is used for fully connecting the characteristic fusion layer and the classification layer;
the classification layer is used for calculating the probability that the facial expression in the input video belongs to each expression category;
the network training module is used for training the constructed network model by using the video samples in the established facial expression video library;
and the expression recognition module is used for carrying out facial expression recognition on the newly input video by utilizing the trained network model.
Based on the same inventive concept, the invention discloses a dynamic expression recognition system based on a spatio-temporal attention mechanism, which comprises at least one computing device, wherein the computing device comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, and the computer program realizes the dynamic expression recognition method based on the spatio-temporal attention mechanism when being loaded into the processor.
Has the beneficial effects that: compared with the prior art, the invention has the following technical effects:
(1) according to the invention, a deep learning network model is constructed, complicated manual feature extraction, feature dimension reduction and other operations are not required, the parameters are adaptively adjusted through training the network model, the features capable of reflecting the facial expressions in the video sample can be automatically extracted, the extracted features can better represent the changes of the facial expressions, and the deep learning network model has stronger fitting capability compared with the traditional facial expression recognition.
(2) The method adopts the double-current convolutional neural network, respectively uses the spatial flow branch and the time flow branch to extract the characteristics of the single-frame face image extracted from the video and the stacked optical flow graph representing expression change, can simultaneously obtain the spatial domain characteristics and the time domain characteristics of the facial expression, expands the characteristic extraction from a static image to the video, fully captures the time domain information of the facial expression in addition to the spatial domain information of the facial expression, and has stronger representation capability and generalization capability.
(3) According to the invention, the spatial domain feature and the time domain feature of each stage in the model can be subjected to information interaction by embedding the spatial-temporal inter-flow attention mechanism module between the convolution modules in the double-flow convolution neural network, so that the information related to the spatial domain feature and the time domain feature in the model can be learned, and the information unrelated to the spatial domain feature and the time domain feature can be ignored, so that the dynamic correlation information between the spatial domain feature and the time domain feature can be captured (for example, when a person smiles, the mouth angle of the person is raised and the eye angle is pulled down, which can cause two optical flows in an optical flow diagram to change towards two directions), the complementary action of the spatial flow branch and the temporal flow branch can be further exerted, and the characteristics with higher discrimination and representativeness can be obtained.
(4) According to the invention, the space-time flow inter-attention mechanism module is embedded between two branches of a double-flow convolutional neural network, so that the network can adaptively give higher weight information to salient regions in a face image or a light-flow graph, the information of the regions has greater influence on the learning of higher-level deep features in the next stage, local key features of the image in the non-salient regions are weakened, the relation can be established between the space flow branch and the time flow branch, the mutual learning of the local key features of the two branches is realized, the features with strong identification capability can be obtained, and the accuracy and the robustness of face expression identification are improved.
Drawings
FIG. 1 is a flow chart of a method of an embodiment of the present invention;
fig. 2 is a diagram of a network model architecture used in an embodiment of the present invention.
Detailed Description
The following description will explain embodiments of the present invention in further detail with reference to the accompanying drawings.
As shown in fig. 1, a dynamic expression recognition method based on a spatiotemporal attention mechanism provided in an embodiment of the present invention mainly includes the following steps:
step 1: and collecting facial expression video clips, and establishing a facial expression video library containing expression category labels.
In this embodiment, an AFEW facial expression video library is selected. The video samples in the AFEW facial expression video library are from different movies and contain 1809 video samples, and the face of a person in each video sample corresponds to one expression category which comprises seven categories of anger, fear, aversion, happiness, sadness, surprise and neutrality. In practice, other facial expression video libraries may also be adopted, or facial expression videos may be collected by themselves, and a facial expression video library including facial expression category labels is established.
Step 2: a double-current convolutional neural network model embedded into a space-time inter-flow attention mechanism module is constructed, and the model comprises a data processing layer, a space flow branch, a time flow branch, a space-time inter-flow attention mechanism module, a feature fusion layer, a full connection layer and a classification layer.
The data processing layer is used for preprocessing the input video, and the preprocessing process comprises the following steps: performing framing processing on a video, and extracting an image sequence with the length of u frames from the obtained images according to the time sequence, wherein u is the set sequence length; carrying out face detection, cutting and alignment on each image in the image sequence, and normalizing each processed image to obtain a facial expression image sequence with the length of u frames; randomly selecting one image from a facial expression image sequence with the length of u frames as a single frame of facial image corresponding to the input video; and calculating an optical flow graph between every two adjacent images in the facial expression image sequence with the length of u frames, and stacking the obtained optical flow graphs on a channel dimension according to the time sequence to obtain a stacked optical flow graph corresponding to the input video.
The spatial stream branch comprises a plurality of convolution modules, the input of the branch is a single-frame face image output by the data processing layer, and the branch is used for extracting the spatial domain characteristics of the facial expression; the convolution module comprises one or more convolution layers and a pooling layer, wherein the convolution layers comprise a ReLU nonlinear activation function layer, and m is selected 1 K is 1 ×k 1 The convolution kernel of (a) performs a convolution operation on the output of the previous layer, where m is 1 Selected from 32, 64, 128, 256, 512 values, k 1 Selecting from 3, 5 and 7 values; the layer of the pool is k 2 ×k 2 The pooled kernel of (a) performs a downsampling operation on the output of the previous convolutional layer, where k is 2 The numerical values are selected from 1, 2 and 3.
And the time flow branch comprises convolution modules with the same number as the convolution modules of the space flow branch, and the input of the time flow branch is a stacked light flow diagram output by the data processing layer and used for extracting time domain characteristics of the facial expression. The convolution module comprises one or more convolution layers and a pooling layer, wherein the convolution layers comprise ReLU nonlinear activation function layers, and m is selected 2 K is 3 ×k 3 The convolution kernel of (a) performs a convolution operation on the output of the previous layer, where m is 2 At 32, 64, 128, 256, 512 is selected from the values k 3 Selecting from 3, 5 and 7 values; the layer of the pool is k 4 ×k 4 The pooled kernel of (a) performs a downsampling operation on the output of the previous convolutional layer, where k is 4 The numerical values are selected from 1, 2 and 3.
The spatial-temporal inter-flow attention mechanism module is embedded between convolution modules of the spatial flow branch and the time flow branch, the input of the module is the output of a convolution module at the upper layer in the double-flow convolution neural network, and the output of the module is the input of a convolution module at the lower layer in the double-flow convolution neural network and is used for carrying out information interaction on spatial domain characteristics and time domain characteristics; the space-time inter-flow attention mechanism module firstly calculates the relevance weighting matrix of an input space domain characteristic diagram and an input time domain characteristic diagram, then respectively calculates the mapping matrix of the space domain characteristic diagram and the time domain characteristic diagram, respectively calculates a space domain residual error characteristic diagram and a time domain residual error characteristic diagram according to the relevance weighting matrix and the mapping matrix, finally uses residual error connection to add the space domain residual error characteristic diagram and the input space domain characteristic diagram to obtain an output space domain characteristic diagram, and adds the time domain residual error characteristic diagram and the input time domain characteristic diagram to obtain an output time domain characteristic diagram.
And the characteristic fusion layer is used for respectively carrying out global average pooling operation on the space domain characteristic diagram output by the space flow branch and the time domain characteristic diagram output by the time flow branch, splicing the two obtained characteristic vectors and then outputting one characteristic vector.
And the full connection layer is used for fully connecting the characteristic fusion layer and the classification layer.
And the classification layer is used for calculating the probability that the facial expression in the input video belongs to each expression category.
The dynamic expression recognition network model based on the spatio-temporal attention mechanism constructed in this embodiment has the following specific structure as shown in fig. 2:
(1) the data processing layer firstly uses FFmpeg software to carry out frame processing on an input video, and extracts an image sequence with the length of 9 frames from the obtained images according to the time sequence; then, carrying out face detection, cutting and alignment on each image in the image sequence by adopting a Dlib face detection algorithm, and normalizing each processed image into 224 multiplied by 224 pixels to obtain a face expression image sequence with the length of 9 frames; then randomly selecting one image from the facial expression image sequence with the length of 9 frames as a single frame of facial image corresponding to the input video; finally, calculating an optical flow graph between every two adjacent images in the facial expression image sequence with the length of 9 frames by using a TVL1 algorithm, and stacking the obtained optical flow graphs on a channel dimension according to a time sequence to be used as a stacked optical flow graph corresponding to the input video;
(2) the spatial flow branch and the time flow branch adopt two convolution neural networks with the same structure and different parameters, and respectively comprise 5 convolution modules which are connected in sequence:
convolution module a and convolution module a': the method comprises the following steps that 2 convolutional layers and 1 pooling layer are included, each of the 2 convolutional layers is subjected to convolution operation by using 64 convolution kernels with the size of 3 x 3 to carry out feature maps, the convolution step length is 1, the zero padding and edge adding length is 1, and the feature maps with the size of 224 x 64 are output after convolution through ReLU nonlinear mapping; the pooling layer selects a maximum pooling core of 2 × 2, performs downsampling operation on the feature map by step size 2, and outputs the feature map with the size of 112 × 112 × 64;
convolution module B and convolution module B': the method comprises the following steps that 2 convolutional layers and 1 pooling layer are included, wherein 128 convolution kernels with the length of 3 x 3 are selected for the 2 convolutional layers to carry out convolution operation on characteristic graphs, the convolution step length is 1, the zero padding and edge adding length is 1, the characteristic graphs with the size of 112 x 128 are output after convolution through ReLU nonlinear mapping; the pooling layer selects a maximum pooling kernel of 2 multiplied by 2, performs down-sampling operation on the feature map by step size 2, and outputs the feature map with the size of 56 multiplied by 128;
convolution module C and convolution module C': the method comprises the steps that 3 convolutional layers and 1 pooling layer are included, 256 convolutional cores of 3 x 3 are selected for the 3 convolutional layers to carry out convolution operation on a characteristic diagram, the convolution step length is 1, the zero padding and edge adding length is 1, the characteristic diagram of 56 x 256 in size is output after convolution through ReLU nonlinear mapping; the pooling layer selects a maximum pooling core of 2 multiplied by 2, performs down-sampling operation on the feature map by step length 2, and outputs the feature map with the size of 28 multiplied by 256;
convolution module D and convolution module D': the method comprises the following steps that 3 convolutional layers and 1 pooling layer are included, 512 convolutional cores of 3 x 3 are selected for the 3 convolutional layers to carry out convolution operation on a feature graph, the convolution step length is 1, the zero padding and edge adding length is 1, the feature graph of 28 x 512 in size is output after convolution through ReLU nonlinear mapping; the pooling layer selects a maximum pooling kernel of 2 multiplied by 2, performs down-sampling operation on the feature map by step size 2, and outputs the feature map with the size of 14 multiplied by 512;
convolution module E and convolution module E': the method comprises the following steps that 3 convolutional layers and 1 pooling layer are included, 512 convolutional cores with the length of 3 multiplied by 3 are selected for the 3 convolutional layers to carry out convolution operation on a characteristic graph, the convolution step length is 1, the zero padding and edge adding length is 1, the characteristic graph with the size of 14 multiplied by 512 is output after convolution through the ReLU nonlinear mapping; the pooling layer selects a maximum pooling core of 2 multiplied by 2, performs down-sampling operation on the feature map by step length 2, and outputs a feature map with the size of 7 multiplied by 512, wherein the feature map is an output feature map of a corresponding branch;
(3) the spatial-temporal inter-flow attention mechanism module is embedded between convolution modules of the spatial flow branch and the time flow branch, and 4 spatial-temporal inter-flow attention mechanism modules are embedded in total because the convolution modules of the spatial flow branch and the time flow branch are respectively 5, and the spatial-temporal inter-flow attention mechanism module A is taken as an example to explain implementation details:
the input of the space-time inter-flow attention mechanism module A is a space domain characteristic diagram X output by the convolution module A S And the time domain characteristic diagram X output by the convolution module A T ,X S And X T All are 112 × 112 × 64, the operation steps of the module include:
computing spatial domain feature map X S And time domain feature map X T Is given to the correlation degree weight matrix F 1 And F 2 : selecting 32 1 × 1 convolution checks respectively S And X T Performing convolution operation to obtain two characteristic graphs with the size of 112 multiplied by 32; the two feature maps are respectively resized to 112 by dimension transformation 2 X 32 and 32X 112 2 Two-dimensional matrices are obtained and multiplied to obtain a matrix with a size of 112 2 ×112 2 A matrix F of (A); performing Softmax operation on each row vector of the matrix F to obtain a relevancy weight momentArray F 1 Performing Softmax operation on each column vector of the matrix F to obtain a relevancy weight matrix F 2
Computing spatial domain feature map X S Mapping matrix G of S : selecting 32 1 × 1 convolution kernels to check the space domain characteristic diagram X S Performing convolution operation to obtain a feature map with a size of 112 × 112 × 32, and adjusting the size of the feature map to 32 × 112 × 32 by dimension conversion 2 Obtaining X S Mapping matrix G of S
Computing a time domain feature map X T Mapping matrix G of T : selecting 32 1 × 1 convolution kernels to check the time domain feature map X T Performing convolution operation to obtain a feature map with a size of 112 × 112 × 32, and adjusting the feature map to a size of 112 × 112 by dimension conversion 2 X32 to obtain X T Mapping matrix G of T
Computing spatial domain residual error characteristic diagram Y S : will matrix F 1 And matrix G T Multiplying to obtain a product with a size of 112 2 Adjusting the size of the x 32 matrix to 112 x 32 through dimension transformation, selecting 64 1 x 1 convolution cores to carry out convolution operation to obtain a space domain residual error characteristic diagram Y with the size of 112 x 64 S
Computing a time-domain residual feature map Y T : will matrix G S And matrix F 2 Multiplying to obtain a product with a size of 32 × 112 2 The matrix is adjusted to be 112 multiplied by 32 through dimension transformation, then 64 convolution cores of 1 multiplied by 1 are selected for convolution operation, and a time domain residual error characteristic diagram Y with the size of 112 multiplied by 64 is obtained T
Calculating the space domain characteristic diagram Z output by the module S : using residual concatenation, spatial residual feature map Y S Spatial domain feature diagram X input by the module S Adding to obtain space domain characteristic diagram Z output by the module S The size of which is 112 × 112 × 64, which is the input of the convolution module B;
calculating the time domain characteristic diagram Z output by the module T : using residual concatenation, the time domain residual feature map Y T Time domain characteristic diagram input by the moduleX T Adding to obtain the time domain characteristic diagram Z output by the module T 112 × 112 × 64, which is the input of the convolution module B';
similarly, the input of the space-time inter-flow attention mechanism module B is a feature map output by the convolution module B and the convolution module B', and the size of the feature map is 56 × 56 × 128; the input of the space-time inter-flow attention mechanism module C is a feature map output by the convolution module C and the convolution module C', and the size of the feature map is 28 multiplied by 256; the input of the space-time inter-flow attention mechanism module D is a feature map output by the convolution module D and the convolution module D', and the size of the feature map is 14 multiplied by 512; other implementation details are similar to those described above;
(4) the characteristic fusion layer inputs a space domain characteristic diagram output by the space flow branch and a time domain characteristic diagram output by the time flow branch, the sizes of the space domain characteristic diagram and the time domain characteristic diagram are both 7 multiplied by 512, the two characteristic diagrams are respectively subjected to global average pooling operation to obtain two characteristic vectors with 512 dimensions, the two characteristic vectors are spliced, and a characteristic vector with 1024 dimensions is output;
(5) a full connection layer, which comprises 256 neurons and is used for fully connecting the feature fusion layer with the classification layer;
(6) and a classification layer, namely a Softmax classifier is adopted, the classification layer comprises 7 neurons, a 7-dimensional vector is output, the numerical value of each dimension of the vector is the probability that the facial expression in the input video belongs to each expression category, and the expression category corresponding to the element with the largest numerical value is the identification label of the network model to the input video.
And 3, step 3: and training the constructed network model by using the video samples in the established facial expression video library.
And 4, step 4: and carrying out facial expression recognition on the newly input video by using the trained network model.
Based on the same inventive concept, the embodiment of the invention discloses a dynamic expression recognition system based on a space-time inter-flow attention mechanism, which comprises the following steps: the data preprocessing module is used for acquiring facial expression video clips and establishing a facial expression video library containing expression category labels; the network training module is used for training the constructed network model by using video samples in the established facial expression video library; and the expression recognition module is used for carrying out facial expression recognition on the newly input video by utilizing the trained network model.
The built double-current convolutional neural network model embedded into the space-time inter-flow attention mechanism module comprises a data processing layer, a space flow branch, a time flow branch, a space-time inter-flow attention mechanism module, a feature fusion layer, a full connection layer and a classification layer; specifically, the method comprises the following steps: the data processing layer is used for preprocessing the input video; the spatial stream branch comprises a plurality of convolution modules, the input of the branch is a single-frame face image output by the data processing layer, and the branch is used for extracting the spatial domain characteristics of the facial expression; the time flow branch comprises convolution modules with the same number as the spatial flow branch, and the input of the time flow branch is a stacked light flow diagram output by the data processing layer and used for extracting time domain characteristics of facial expressions; the spatial-temporal inter-flow attention mechanism module is embedded between convolution modules of the spatial flow branch and the time flow branch, the input of the module is the output of a convolution module at the upper layer in the double-flow convolution neural network, and the output of the module is the input of a convolution module at the lower layer in the double-flow convolution neural network and is used for carrying out information interaction on spatial domain characteristics and time domain characteristics; the characteristic fusion layer is used for respectively carrying out global average pooling operation on the space domain characteristic diagram output by the space flow branch and the time domain characteristic diagram output by the time flow branch, splicing the two obtained characteristic vectors and outputting one characteristic vector; the full connection layer is used for fully connecting the characteristic fusion layer and the classification layer; and the classification layer is used for calculating the probability that the facial expression in the input video belongs to each expression category.
Based on the same inventive concept, the embodiment of the invention discloses a dynamic expression recognition system based on a space-time inter-flow attention mechanism, which comprises at least one computing device, wherein the computing device comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, and the computer program realizes the above dynamic expression recognition method based on the space-time inter-flow attention mechanism when being loaded into the processor.
The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can understand that the modifications or substitutions within the technical scope of the present invention are included in the scope of the present invention, and therefore, the scope of the present invention should be subject to the protection scope of the claims.

Claims (6)

1. A dynamic expression recognition method based on a spatiotemporal attention mechanism is characterized by comprising the following steps:
step 1: collecting facial expression video clips, and establishing a facial expression video library containing expression category labels;
step 2: constructing a double-current convolutional neural network model embedded into a space-time inter-flow attention mechanism module, wherein the model comprises a data processing layer, a space flow branch, a time flow branch, a space-time inter-flow attention mechanism module, a feature fusion layer, a full connection layer and a classification layer;
the data processing layer is used for preprocessing an input video, and the preprocessing process comprises the following steps: performing framing processing on a video, and extracting an image sequence with the length of u frames from the obtained images according to the time sequence, wherein u is the set sequence length; carrying out face detection, cutting and alignment on each image in the image sequence, and normalizing each processed image to obtain a facial expression image sequence with the length of u frames; randomly selecting one image from the facial expression image sequence with the length of u frames as a single frame of facial image corresponding to the input video; calculating an optical flow graph between every two adjacent images in the facial expression image sequence with the length of u frames, and stacking the optical flow graphs on a channel dimension according to a time sequence to serve as a stacked optical flow graph corresponding to the input video;
the spatial stream branch comprises a plurality of convolution modules, and the input of the branch is a single-frame face image output by the data processing layer and used for extracting spatial domain characteristics of facial expressions;
the time flow branch comprises convolution modules with the same number as the spatial flow branch, and the input of the time flow branch is a stacked light flow diagram output by the data processing layer and used for extracting time domain characteristics of facial expressions;
the space-time inter-flow attention mechanism module is embedded between convolution modules of the space flow branch and the time flow branch, the input of the module is the output of a convolution module at the upper layer in the double-flow convolution neural network, and the output of the module is the input of a convolution module at the lower layer in the double-flow convolution neural network and is used for carrying out information interaction on space-domain characteristics and time-domain characteristics; the space-time inter-flow attention mechanism module firstly calculates a relevance weighting matrix of an input space domain characteristic diagram and a time domain characteristic diagram, then respectively calculates a mapping matrix of the space domain characteristic diagram and the time domain characteristic diagram, respectively calculates a space domain residual error characteristic diagram and a time domain residual error characteristic diagram according to the relevance weighting matrix and the mapping matrix, finally uses residual error connection to add the space domain residual error characteristic diagram and the input space domain characteristic diagram to obtain an output space domain characteristic diagram, and adds the time domain residual error characteristic diagram and the input time domain characteristic diagram to obtain an output time domain characteristic diagram;
the characteristic fusion layer is used for respectively carrying out global average pooling operation on the space domain characteristic diagram output by the space flow branch and the time domain characteristic diagram output by the time flow branch, and outputting a characteristic vector after splicing the two obtained characteristic vectors;
the full connection layer is used for fully connecting the characteristic fusion layer and the classification layer;
the classification layer is used for calculating the probability that the facial expression in the input video belongs to each expression category;
and step 3: training the constructed network model by using video samples in the established facial expression video library;
and 4, step 4: and carrying out facial expression recognition on the newly input video by using the trained network model.
2. The method for recognizing the dynamic expressions based on the spatio-temporal inter-flow attention mechanism according to claim 1, wherein the spatial stream branch comprises a plurality of convolution modules connected in sequence;
the convolution moduleComprises one or more convolution layers and a pooling layer, wherein the convolution layer comprises a ReLU nonlinear activation function layer, m is selected 1 K is 1 ×k 1 The convolution kernel of (1) performs a convolution operation on the output of the previous layer, where m 1 K is selected from 32, 64, 128, 256, 512 values 1 Selecting from 3, 5 and 7 values; the layer of the pool is k 2 ×k 2 The pooled kernel of (a) performs a downsampling operation on the output of the previous convolutional layer, where k is 2 Is selected from 1, 2 and 3.
3. The method for recognizing the dynamic expressions based on the attention mechanism between the spatial streams and the temporal streams as claimed in claim 1, wherein the temporal stream branches comprise the same number of convolution modules as the spatial stream branches which are connected in sequence;
the convolution module comprises one or more convolution layers and a pooling layer, wherein the convolution layer comprises a ReLU nonlinear activation function layer, and m is selected 2 K is 3 ×k 3 The convolution kernel of (a) performs a convolution operation on the output of the previous layer, where m is 2 Selected from 32, 64, 128, 256, 512 values, k 3 Selecting from 3, 5 and 7 values; k is selected as the layer of the pool 4 ×k 4 The pooled kernel of (a) performs a downsampling operation on the output of the previous convolutional layer, where k is 4 The numerical values are selected from 1, 2 and 3.
4. The method as claimed in claim 1, wherein the spatiotemporal attention mechanism modules respectively use X S And X T A spatial domain feature map and a temporal domain feature map, X, representing inputs to the module S And X T Are respectively H S ×W S ×C S And H T ×W T ×C T The calculation steps of the module include:
(1) computing spatial domain feature map X S And time domain feature map X T Is given to the correlation degree weight matrix F 1 And F 2 : respectively select C O 1 x 1 convolution kernel to space bitSign diagram X S And time domain feature map X T Performing convolution operation to obtain two values H S ×W S ×C O And H T ×W T ×C O A characteristic diagram of (1); the two characteristic graphs are respectively adjusted to be H through dimension transformation S W S ×C O And C O ×H T W T Obtaining two-dimensional matrixes, and multiplying the two-dimensional matrixes to obtain a matrix with the size of H S W S ×H T W T A matrix F of (A); performing Softmax operation on each row vector of the matrix F to obtain a relevancy weight matrix F 1 Performing Softmax operation on each column vector of the matrix F to obtain a relevancy weight matrix F 2
(2) Computing spatial domain feature map X S Mapping matrix G of S : selecting C O 1X 1 convolution kernel spatial domain feature map X S Performing convolution operation to obtain a value H S ×W S ×C O The feature map of (1), the size of which is adjusted to C by dimension conversion O ×H S W S Obtaining X S Mapping matrix G of S
(3) Computing a time domain feature map X T Mapping matrix G of T : selecting C O Time domain feature map X of 1 × 1 convolution kernel T Performing convolution operation to obtain a value H T ×W T ×C O The feature map of (1), the size of which is adjusted to H by dimension conversion T W T ×C O Obtaining X T Mapping matrix G of T
(4) Computing spatial domain residual error characteristic diagram Y S : will matrix F 1 And matrix G T Multiplying to obtain a product with a size of H S W S ×C O By resizing the matrix to H by dimension transformation S ×W S ×C O And then use C S The 1 × 1 convolution cores perform convolution operation to obtain a value of H S ×W S ×C S Space domain residual error characteristic diagram Y S
(5) Computing a time-domain residual profileY T : will matrix G S And matrix F 2 Multiplying to obtain a value of C O ×H T W T By resizing the matrix to H by dimension transformation T ×W T ×C O And then use C T The 1 × 1 convolution cores perform convolution operation to obtain a value of H T ×W T ×C T Time domain residual feature map Y of T
(6) Calculating the space domain characteristic diagram Z output by the module S : using residual concatenation, spatial residual feature map Y S Spatial domain feature map X input by the module S Adding to obtain space domain characteristic diagram Z output by the module S
(7) Calculating the time domain characteristic diagram Z output by the module T : using residual concatenation, the time domain residual feature map Y T Time domain feature diagram X input by the module T Adding to obtain the time domain characteristic diagram Z output by the module T
5. A dynamic expression recognition system based on a spatiotemporal attention mechanism, the system comprising the following modules:
the data preprocessing module is used for acquiring facial expression video clips and establishing a facial expression video library containing expression category labels;
the network construction module is used for constructing a double-current convolutional neural network model embedded into the space-time inter-flow attention mechanism module, and the model comprises a data processing layer, a space flow branch, a time flow branch, a space-time inter-flow attention mechanism module, a characteristic fusion layer, a full connection layer and a classification layer;
the data processing layer is used for preprocessing an input video, and the preprocessing process comprises the following steps: performing framing processing on a video, and extracting an image sequence with the length of u frames from the obtained images according to the time sequence, wherein u is the set sequence length; carrying out face detection, cutting and alignment on each image in the image sequence, and normalizing each processed image to obtain a facial expression image sequence with the length of u frames; randomly selecting one image from the facial expression image sequence with the length of u frames as a single frame of facial image corresponding to the input video; calculating an optical flow graph between every two adjacent images in the facial expression image sequence with the length of u frames, and stacking the optical flow graphs on a channel dimension according to a time sequence to serve as a stacked optical flow graph corresponding to the input video;
the spatial stream branch comprises a plurality of convolution modules, and the input of the branch is a single-frame face image output by the data processing layer and used for extracting spatial domain characteristics of facial expressions;
the time flow branch comprises convolution modules with the same number as the spatial flow branch, and the input of the time flow branch is a stacked light flow diagram output by the data processing layer and used for extracting time domain characteristics of facial expressions;
the space-time inter-flow attention mechanism module is embedded between convolution modules of the space flow branch and the time flow branch, the input of the module is the output of a convolution module at the upper layer in the double-flow convolution neural network, and the output of the module is the input of a convolution module at the lower layer in the double-flow convolution neural network and is used for carrying out information interaction on space-domain characteristics and time-domain characteristics; the space-time inter-flow attention mechanism module firstly calculates a relevance weighting matrix of an input space domain characteristic diagram and a time domain characteristic diagram, then respectively calculates a mapping matrix of the space domain characteristic diagram and the time domain characteristic diagram, respectively calculates a space domain residual error characteristic diagram and a time domain residual error characteristic diagram according to the relevance weighting matrix and the mapping matrix, finally uses residual error connection to add the space domain residual error characteristic diagram and the input space domain characteristic diagram to obtain an output space domain characteristic diagram, and adds the time domain residual error characteristic diagram and the input time domain characteristic diagram to obtain an output time domain characteristic diagram;
the characteristic fusion layer is used for respectively carrying out global average pooling operation on the space domain characteristic diagram output by the space flow branch and the time domain characteristic diagram output by the time flow branch, and outputting a characteristic vector after splicing the two obtained characteristic vectors;
the full connection layer is used for fully connecting the characteristic fusion layer and the classification layer;
the classification layer is used for calculating the probability that the facial expression in the input video belongs to each expression category;
the network training module is used for training the constructed network model by using the video samples in the established facial expression video library;
and the expression recognition module is used for carrying out facial expression recognition on the newly input video by utilizing the trained network model.
6. A system for dynamic expression recognition based on a spatio-temporal attention mechanism, comprising at least one computing device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program when loaded into the processor implementing a method according to any one of claims 1 to 4 for dynamic expression recognition based on a spatio-temporal attention mechanism.
CN202110061153.7A 2021-01-18 2021-01-18 Dynamic expression recognition method and system based on attention mechanism between space and time streams Active CN112800894B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110061153.7A CN112800894B (en) 2021-01-18 2021-01-18 Dynamic expression recognition method and system based on attention mechanism between space and time streams

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110061153.7A CN112800894B (en) 2021-01-18 2021-01-18 Dynamic expression recognition method and system based on attention mechanism between space and time streams

Publications (2)

Publication Number Publication Date
CN112800894A CN112800894A (en) 2021-05-14
CN112800894B true CN112800894B (en) 2022-08-26

Family

ID=75809973

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110061153.7A Active CN112800894B (en) 2021-01-18 2021-01-18 Dynamic expression recognition method and system based on attention mechanism between space and time streams

Country Status (1)

Country Link
CN (1) CN112800894B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113255530B (en) * 2021-05-31 2024-03-29 合肥工业大学 Attention-based multichannel data fusion network architecture and data processing method
CN113705328A (en) * 2021-07-06 2021-11-26 合肥工业大学 Depression detection method and system based on facial feature points and facial movement units
CN113627349B (en) * 2021-08-12 2023-12-05 南京信息工程大学 Dynamic facial expression recognition method based on self-attention transformation network
CN113971826B (en) * 2021-09-02 2024-06-21 合肥工业大学 Dynamic emotion recognition method and system for estimating continuous titer and arousal level
CN116021506A (en) * 2021-10-26 2023-04-28 美智纵横科技有限责任公司 Robot control method, apparatus and storage medium
CN114067435A (en) * 2021-11-15 2022-02-18 山东大学 Sleep behavior detection method and system based on pseudo-3D convolutional network and attention mechanism
CN114494981B (en) * 2022-04-07 2022-08-05 之江实验室 Action video classification method and system based on multi-level motion modeling
CN114863520B (en) * 2022-04-25 2023-04-25 陕西师范大学 Video expression recognition method based on C3D-SA
CN115273186A (en) * 2022-07-18 2022-11-01 中国人民警察大学 Depth-forged face video detection method and system based on image feature fusion
CN115381467B (en) * 2022-10-31 2023-03-10 浙江浙大西投脑机智能科技有限公司 Attention mechanism-based time-frequency information dynamic fusion decoding method and device
CN115457643B (en) * 2022-11-09 2023-04-07 暨南大学 Fair facial expression recognition method based on increment technology and attention mechanism
CN116071809B (en) * 2023-03-22 2023-07-14 鹏城实验室 Face space-time representation generation method based on multi-class representation space-time interaction
CN116434343B (en) * 2023-04-25 2023-09-19 天津大学 Video motion recognition method based on high-low frequency double branches

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108596069A (en) * 2018-04-18 2018-09-28 南京邮电大学 Neonatal pain expression recognition method and system based on depth 3D residual error networks
CN108596039A (en) * 2018-03-29 2018-09-28 南京邮电大学 A kind of bimodal emotion recognition method and system based on 3D convolutional neural networks
CN111401117A (en) * 2019-08-14 2020-07-10 南京邮电大学 Neonate pain expression recognition method based on double-current convolutional neural network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108596039A (en) * 2018-03-29 2018-09-28 南京邮电大学 A kind of bimodal emotion recognition method and system based on 3D convolutional neural networks
CN108596069A (en) * 2018-04-18 2018-09-28 南京邮电大学 Neonatal pain expression recognition method and system based on depth 3D residual error networks
CN111401117A (en) * 2019-08-14 2020-07-10 南京邮电大学 Neonate pain expression recognition method based on double-current convolutional neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于注意力机制的视频人脸表情识别;何晓云等;《信息技术》;20200220(第02期);全文 *

Also Published As

Publication number Publication date
CN112800894A (en) 2021-05-14

Similar Documents

Publication Publication Date Title
CN112800894B (en) Dynamic expression recognition method and system based on attention mechanism between space and time streams
CN112541503B (en) Real-time semantic segmentation method based on context attention mechanism and information fusion
Zhang et al. Multimodal learning for facial expression recognition
CN107527007B (en) Method for detecting object of interest in vehicle image processing system
CN112860888B (en) Attention mechanism-based bimodal emotion analysis method
CN112580617B (en) Expression recognition method and device in natural scene
CN111160350A (en) Portrait segmentation method, model training method, device, medium and electronic equipment
CN111428664B (en) Computer vision real-time multi-person gesture estimation method based on deep learning technology
CN115131880B (en) Multi-scale attention fusion double-supervision human face living body detection method
Podder et al. Time efficient real time facial expression recognition with CNN and transfer learning
Sharma et al. Deepfakes Classification of Faces Using Convolutional Neural Networks.
Singh et al. Feature based method for human facial emotion detection using optical flow based analysis
Kumar et al. Facial emotion recognition and detection using cnn
Gowada et al. Unethical human action recognition using deep learning based hybrid model for video forensics
Hong et al. Characterizing subtle facial movements via Riemannian manifold
Ruan et al. Facial expression recognition in facial occlusion scenarios: A path selection multi-network
CN114882405B (en) Video saliency detection method based on space-time double-flow pyramid network architecture
CN115965898A (en) Video emotion classification method combining multi-stage branch convolution and expansion interactive sampling
CN113221824B (en) Human body posture recognition method based on individual model generation
CN113887373B (en) Attitude identification method and system based on urban intelligent sports parallel fusion network
CN113159071B (en) Cross-modal image-text association anomaly detection method
CN114782995A (en) Human interaction behavior detection method based on self-attention mechanism
Aouayeb et al. Micro-expression recognition from local facial regions
Deshpande et al. Abnormal Activity Recognition with Residual Attention-based ConvLSTM Architecture for Video Surveillance.
CN115546885A (en) Motion recognition method and system based on enhanced space-time characteristics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant