CN111797683A - Video expression recognition method based on depth residual error attention network - Google Patents

Video expression recognition method based on depth residual error attention network Download PDF

Info

Publication number
CN111797683A
CN111797683A CN202010436500.5A CN202010436500A CN111797683A CN 111797683 A CN111797683 A CN 111797683A CN 202010436500 A CN202010436500 A CN 202010436500A CN 111797683 A CN111797683 A CN 111797683A
Authority
CN
China
Prior art keywords
video
expression
network
depth residual
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010436500.5A
Other languages
Chinese (zh)
Inventor
赵小明
张石清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taizhou University
Original Assignee
Taizhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Taizhou University filed Critical Taizhou University
Priority to CN202010436500.5A priority Critical patent/CN111797683A/en
Publication of CN111797683A publication Critical patent/CN111797683A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Molecular Biology (AREA)
  • Human Computer Interaction (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a video expression recognition method based on a depth residual error attention network, which comprises the following steps: s1, preprocessing video data of the video sample; s2, extracting expression features of the face image by adopting a depth residual error attention network; and S3, training and testing the features extracted in the step S2 after certain processing, and outputting the final classification result of the facial expressions. The invention is realized by adopting a spatial attention mechanism, generates weights in spatial distribution on the input feature map, and then performs weighted summation with the feature map, thereby supervising network learning to distribute different attention (weights) to different regions closely related to the expression in the face image, and focusing on the feature learning of a target region closely related to the expression in the face image, thereby improving the feature characterization capability of a depth residual error network and further improving the performance of video expression recognition.

Description

Video expression recognition method based on depth residual error attention network
Technical Field
The invention relates to the technical field of image processing and mode recognition, in particular to a video expression recognition method based on a depth residual error attention network.
Background
The communication between people is rich in emotion, the expression of emotion is the most original instinct of people, and the basic element of emotion is a polymer with various expressions. In the past, people recorded their lives by characters or pictures. At present, most of the methods record important memories and expressions of emotions, such as joy, anger, sadness and the like in a video blog, a short video and the like.
Feature extraction is an important link of video expression recognition. In early video expression recognition, researchers mostly employed manual features for classification of video expressions. Wherein, the representative manual characteristics mainly include: local Binary Pattern (LBP), Local Phase Quantization (LPQ), histogram of gradient directions (HOG), Scale Invariant Features (SIFT), and the like. In dynamic video sequence context recognition, these methods are updated to LBP-TOP, LPQ-TOP and 3D-SIFT. Although the manual features are widely applied to the field of video expression recognition, the manual features still belong to low-level features. In video emotion recognition, videos contain rich emotion information and need high-level depth features for expression, and manual features and high-level subjective emotion have the problem of semantic gap.
To address the above-mentioned deficiencies of manual features, researchers have proposed a series of deep neural networks for recognition in video expressions in recent years. Wherein, the representative deep neural network model comprises: the method comprises the steps of obtaining a first AlexNet in an Imagenet image classification match in 2012, improving the network performance by deepening the network layer number, improving the network performance by utilizing a VGG (VGG), widening the network structure by utilizing an inclusion module, and improving the network performance by utilizing a GoogleNet, and improving the network performance by deepening the network layer number by utilizing an identity mapping principle in a residual error module. Currently, researchers have tried to use the above network for video expression recognition and achieve good results.
Although the existing deep neural network has strong feature extraction capability, the difference of emotion representation intensity of each local area in the image is ignored, so that the feature characterization capability of the deep neural network is limited, namely the existing deep neural network does not consider the difference of emotion representation intensity of each local area in the face image.
For example, a method for recognizing a video sequence list based on mixed deep learning disclosed in chinese patent literature (publication No. CN201810880749.8) adopts two deep convolutional neural network models, i.e., a time convolutional neural network and a space convolutional neural network, to extract high-level temporal features and spatial features from a video expression sequence, then adopts a deep belief network to realize deep fusion of temporal and spatial features, and performs an average pooling operation to obtain global features of the video sequence, and finally adopts a support vector machine to realize classification of the video expression sequence.
Disclosure of Invention
The invention aims to overcome the technical problems that the difference of emotion expression intensity of each local area in a face image is not considered in video expression recognition in the prior art, and semantic gap exists between manual features and subjective feelings in a video, and provides a video expression recognition method based on a depth residual attention network.
In order to achieve the purpose, the invention adopts the following technical scheme:
a video expression recognition method based on a depth residual attention network comprises the following steps:
s1, preprocessing video data of the video sample;
s2, extracting expression features of the face image by adopting a depth residual error attention network;
and S3, training and testing the features extracted in the step S2 after certain processing, and outputting the final classification result of the facial expressions.
The scheme of the invention is realized by adopting a space attention mechanism, and the depth residual error attention network is adopted to extract the expression characteristics of the face image, so that the monitoring network learning distributes different attentions (weights) to different areas in the face image closely related to the expression, and the characteristic learning of a target area in the face image closely related to the expression can be focused, thereby improving the characteristic representation capability of the depth residual error network and further improving the performance of video expression recognition.
Preferably, the step S1 includes the steps of:
s1.1, screening out image frames in a peak intensity (apex) period for each video sample;
s1.2, adopting a haar-cascades detection model to carry out face detection.
Preferably, the face detection in step S1.2 includes the following steps:
step 1, firstly, converting an input picture into a gray image, and removing color interference;
step 2, setting the size of a face searching frame, sequentially searching faces in an input image, intercepting and storing after finding the faces;
and 3, cutting out an image containing key expression parts such as mouth, nose, forehead and the like from the original facial expression image according to the standard distance between the two eyes, and taking the image as the input of the depth residual error attention network.
Preferably, the step S2 includes the steps of:
s2.1, establishing a depth residual error attention network, extracting the characteristics of each frame of face image from the preprocessed video data, and establishing a video emotion data set;
and S2.2, performing fine tuning training on the video expression data set by using the pre-trained models on other data sets.
Fine-tuning is widely used for transfer learning in computer vision and alleviates the problem of data insufficiency.
Preferably, the depth residual attention network comprises three residual attention modules, each residual attention module comprises a trunk branch and a mask branch, and each trunk branch comprises a residual unit.
Because the gradient return of a network structure formed by simply stacking residual modules is blocked during training, and the mask branches need to output a feature map with normalized weight through a sigmoid activation function and then perform dot product with a main branch feature map, the output response of the feature map is gradually reduced, so that the network cannot perform effective training, and therefore, a residual attention module is provided, and the neural network can be promoted to extract more effective human face features.
Preferably, said step S2.2 comprises the steps of:
step 1, copying a depth residual error attention network model parameter pre-trained on a cifar-10 data set;
step 2, changing the 10-type image category number of the cifar-10 into the expression category number of the video emotion data set;
step 3, retraining the network model by using a back propagation algorithm to update the weight parameters of the network model;
and 4, after the fine tuning training is finished, taking the output of the last full connection layer of the depth residual error attention network as the learned high-level facial expression characteristics for the subsequent expression classification of the multilayer perceptron.
Preferably, the fine tuning training procedure in step S2.2 is specifically shown by the following formula:
X={xi(i=1,2,...,N)} (1)
minH(P(xi),yi)=-∑x(P(xi)logyi) (2)
wherein: i represents the ith frame of picture in the video, xiRepresenting the face image of the i-th frame, yiAn expression label representing the video, H represents a minimization loss function, P (x)i) Representing an input face image xiAnd outputting a predicted value of the time network model.
Preferably, the residual attention module is represented by the following formula:
Figure BDA0002502482310000031
wherein, said O isi,k,c(x) Representing the residual attention module output characteristic, Ti,k,c(x) Representing the characteristic output of the trunk branch, Si,k,c(x)∈[0,1]And (3) representing the mask branch output characteristic, wherein (i, k) is the spatial position coordinate of the characteristic, and C ∈ {0, 1, …, C } is the index value of the characteristic channel.
Preferably, the step S3 includes the steps of: after the feature extraction of each frame of face image in the video is completed, performing average pooling operation on the learned attention features of all the frames of images in one video, calculating global video expression feature parameters with fixed lengths, inputting the global video expression feature parameters into a multilayer perceptron for training and testing, and obtaining the classification result of the face expression.
The invention has the beneficial effects that: the weights in spatial distribution are generated for the input feature maps, and then the weights are subjected to weighted summation with the feature maps, so that different attention (weights) are distributed to different regions, closely related to the expression, in the face image by the supervised network learning, the feature learning of a target region, closely related to the expression, in the face image can be focused, the feature characterization capability of a depth residual error network is improved, and the performance of video expression recognition is further improved.
Drawings
FIG. 1 is a flow chart of the present invention.
Fig. 2 is a schematic diagram of a video expression recognition model according to the present invention.
FIG. 3 is a facial expression image in the BAUM-1s dataset of the present invention.
Fig. 4 is a facial expression image in the RML dataset of the present invention.
FIG. 5 is a diagram of a confusion matrix for obtaining final recognition results on the BAUM-1s dataset according to the present invention.
Fig. 6 is a diagram of a confusion matrix on the RML dataset to achieve the final recognition result of the present invention.
Detailed Description
The technical scheme of the invention is further specifically described by the following embodiments and the accompanying drawings.
Example 1: in this embodiment, as shown in fig. 1, a video expression recognition method based on a depth residual attention network includes the following steps:
s1, preprocessing video data of the video sample;
step S1 includes the following steps:
s1.1, screening out image frames in a peak intensity (apex) period for each video sample;
s1.2, adopting a haar-cascades detection model to carry out face detection; the face detection in step S1.2 comprises the following steps:
step 1, firstly, converting an input picture into a gray image, and removing color interference;
step 2, setting the size of a face searching frame, sequentially searching faces in an input image, intercepting and storing after finding the faces;
and 3, cutting out an image containing key expression parts such as mouth, nose, forehead and the like from the original facial expression image according to the standard distance between the two eyes, and taking the image as the input of the depth residual error attention network.
S2, extracting expression features of the face image by adopting a depth residual error attention network; step S2 includes the following steps:
s2.1, establishing a depth residual error attention network, extracting the characteristics of each frame of face image from the preprocessed video data, and establishing a video emotion data set;
as shown in fig. 2, the depth residual attention network includes three residual attention modules ( attention modules 1,2,3), each of which includes a trunk branch (trunk branch) and a mask branch (soft mask branch), wherein the trunk branch is composed of residual units (residual units) and is mainly used for extracting facial features, and the mask branch outputs a mask with the same size as the characteristic dimension of the trunk branch by combining a significant attention from bottom to top (up-sample) and a focused attention from top to bottom (down-sample) to the residual units, the mask outputs a feature map normalized by convolution (conv) and sigmoid function output weights, and then performs an element-wise product with the feature map of the trunk branch, however, when a network structure formed by such a purely stacked residual attention module is trained, gradient pass back is easily, the output of the feature graph correspondingly becomes smaller, and for the above defects, the output is inspired by a short-circuit mechanism in the residual error network, assuming that the input face picture is x, and the residual error attention module is represented by the following formula:
Figure BDA0002502482310000051
wherein, the output characteristic of the residual attention module is Oi,k,c(x) The trunk branch output characteristic is Ti,k,c(x) The mask branch output characteristic is Si,kc(x)∈[0,1]Where (i, k) is the spatial location coordinate of the feature and C ∈ {0, 1, …, C } is the index value of the feature channel.
The attention provided by the mask branch in the residual attention module can promote the neural network to extract more effective human face features, in addition, the neural network can be trained deeper by combining the short-circuit mechanism, and the neural network can extract more effective human face features by the stacking mode of the residual attention module.
The selection of 92 layers for the number of layers of the depth residual attention network adopted in the embodiment has a good effect.
S2.2, because the samples of the video emotion data set are few, and the video emotion data set is directly used for training a deep residual error attention network and is easy to generate an overfitting phenomenon, a migration learning method is adopted, and fine-tuning (fine-tuning) training is carried out on the video expression data set by using pre-trained models on other data sets; the present embodiment adopts a depth residual attention network model trained in advance in a cifar-10 image data set, wherein the picture resolution of the input layer of the model is 32 × 32 × 32, and the number of nodes of the last layer of the fully-connected layer is 1024.
Step S2.2 comprises the following steps:
step 1, copying a depth residual error attention network model parameter pre-trained on a cifar-10 data set;
step 2, changing the 10-type image category number of the cifar-10 into the expression category number of the video emotion data set;
step 3, retraining the network model by using a back propagation algorithm to update the weight parameters of the network model;
and 4, after the fine tuning training is finished, taking the output of the last full connection layer of the depth residual error attention network as the learned high-level facial expression characteristics for the subsequent expression classification of the multilayer perceptron.
The fine tuning training process is specifically shown by the following formula:
X={xi|(i=1,2,...,N)} (1)
minH(P(xi),yi)=-∑x(P(xi)logyi) (2)
wherein: i represents the ith frame of picture in the video, xiRepresenting the face image of the i-th frame, yiAn expression label representing the video, H represents a minimization loss function, P (x)i) Representing an input face image xiAnd outputting a predicted value of the time network model. In this way, after the fine tuning training of the video emotion data set, the output (1024-D) of the last fully-connected layer of the depth residual attention network is used as the learned high-level facial expression feature for the expression classification of the subsequent multi-layer perceptron (MLP).
S3, training and testing the features extracted in the step S2 after certain processing, and outputting the final classification result of the facial expressions: after the feature extraction of each frame of facial image in the video is completed, performing average pooling operation on the learned attention features of all the frames of images in one video, calculating global video expression feature parameters with fixed length (1024-D), and inputting the global video expression feature parameters into a multilayer perceptron (MLP) for training and testing to obtain the classification result of the facial expressions.
The input layer nodes of the MLP are 1024, the middle hidden layer has 512 nodes, and the number of the nodes of the output layer is the category number of the video emotion data set.
Experimental results and analysis:
two common RML and bamu-1 s video emotion datasets were used to evaluate the video expression recognition performance of the method of the present invention. During deep residual attention network training, the size of batch is set to 64, the learning rate is set to 0.1, the number of cycles (epoch) reaches 10, the learning rate is reduced by 10%, and the maximum number of cycles is set to 40.
The experimental platform is an NVIDIA GPU with 24GB video memory, and the experimental test adopts a cross validation method irrelevant to a test object. The BAUM-1s dataset of more than 10 persons was divided into 5 groups on average, 5 cross-validations were performed, while the RML dataset containing 8 persons was cross-validated 8 times, and finally the average accuracy of all cross-validation results was taken as the final result of the experiment.
As shown in fig. 3, the bamm-1 s dataset consists of 8 basic expressions for 31 individuals, totaling 1222 video segments. In the experiment, only 520 video segments of 6 basic expressions, namely Anger (Anger), Disgust (dispust), Fear (Fear), Joy (Joy), Sadness (Sadness) and Surprise (surprie) are adopted as experimental objects. The original resolution size of each frame of image in video is 720 × 576 × 3.
As shown in fig. 4, the RML dataset consists of 8 persons from different countries, with 720 video segments, containing 6 basic expressions: anger (Anger), Disgust (dispust), Fear (Fear), happy (Joy), sad (Sadness), and Surprise (surprie), the duration of each video segment is about 5s, and the original resolution size of each frame of image in the video is:
720×480×3。
to test the performance of the deep residual attention network, table 1 gives a comparison of the performance with the ResNet and VGG16 networks without the attention mechanism. The ResNet used also contains 92 layers, consistent with the number of layers of the deep residual attention network described above. As can be seen from Table 1, the method of the present invention achieves a correct recognition rate of 56.72% and 68.50% on BAUM-1s and RML respectively, which is significantly better than ResNet and VGG16 without attention mechanism, which indicates that adding attention mechanism in ResNet helps to improve the feature expression capability of the network model.
Data set ResNet VGG16 Ours
BAUM-1s 52.25% 51.01% 56.72%
RML 62.56% 64.04% 68.50%
TABLE 1 comparison of recognition results of different network models
To further illustrate the effectiveness of the present method, Table 2 shows experimental results obtained comparing the method of the present invention with those reported in the prior art. As can be seen from Table 2, the method of the invention achieves a correct recognition rate of 56.72% in BAUM-1s, which is superior to the recognition performance reported in the prior literature.
Figure BDA0002502482310000071
Table 2 compares the results reported in the prior art
For example, facial expression recognition by Shiqing Zhang et al using 3D convolutional neural network (3D-CNN) to extract features on BAUM1-S data set achieved a correct recognition rate of 50.11% (see Zhang S, Pan X, Cui Y, ethyl. Zhalehpour et al achieved a correct recognition rate of 45.04% on the BAUM-1S dataset by extracting LPQ features (see, e.g., Zhalehpour S, OnderO, Akhtar Z, et al, BAUM-1: A discrete audio-visual interface database of affectional and structural states, IEEE Transactions on Affective Computing,2016,8(3): 300-. Panxianza et al adopts a multimode deep convolutional neural network to extract depth space-time characteristics on a BAUM-1s data set, and obtains a correct recognition rate of 52.18% (see the literature: Panxianza, Zhanqing, Guoweiping, the multimode deep convolutional neural network is applied to video expression recognition, optical precision engineering, 2019,27(04): 230-. Similarly, the method of the present invention achieved 68.50% on the RML dataset, which is also better than the results reported in other references. For example, Elmadayy et al obtained a correct recognition rate of 64.58% on RML dataset by extracting Gaborwavelet features (see document: Elmadayy N E D, He Y, Guan L.Multi-view registration view multi-set localization prediction knowledge correlation analysis.2016IEEE International Symposium on circuits and Systems (ISCAS),2016: 590-. The deep spatiotemporal features extracted on the RML dataset by panxianza et al achieved a correct recognition rate of 65.72%. It can be seen that the advantages of the process of the present invention are fully illustrated by comparison with the processes reported in the prior art documents mentioned above.
In order to observe the recognition situation of the depth residual attention network on various expressions more intuitively, fig. 5 and fig. 6 respectively show confusion matrixes obtained by the method on BAUM-1s and RML data sets to obtain final recognition results. As can be seen from fig. 4, the happy (Joy) and Surprise (surpride) recognition effects are good, the correct recognition rates are 78.74% and 83.67%, respectively, and the recognition accuracy rates of Anger (Anger) and Fear (Fear) are low, 44.12% and 42.5%, respectively. The two expressions are easily judged as Sadness (Sadness) by mistake, and the reason may be that the discrimination of the three expressions is not high, which causes misjudgment of the network model.
As can be seen from fig. 5 and 6, the recognition performance of the Fear (Fear) expression is the lowest, and the correct recognition rate is 33.04%, while the recognition effect of other expressions is better, and the correct recognition rate exceeds 70%. The reason may be that the number of samples of Fear (Fear) expressions in the RML dataset is much smaller than the number of samples of other expressions, so that the network model cannot recognize such expressions well.
The invention obtains better correct recognition rate on BAUM-1s and RML data sets, which shows that the video expression recognition performance can be effectively improved by combining a space attention mechanism and a residual error network.
In consideration of the difference of emotion representation intensity of each local region in a face image, the invention provides a video expression recognition method based on a depth residual attention network, which is realized by adopting a spatial attention (weight) mechanism, specifically, weights on spatial distribution are generated for an input feature map, and then the weights are subjected to weighted summation with the feature map, so that the monitoring network learns to allocate different attention (weights) to different regions closely related to expressions in the face image. The invention can focus on the feature learning of the target area closely related to the expression in the face image, thereby improving the feature characterization capability of the depth residual error network and further improving the performance of video expression recognition.

Claims (9)

1. A video expression recognition method based on a depth residual attention network is characterized by comprising the following steps:
s1, preprocessing video data of the video sample;
s2, extracting expression features of the face image by adopting a depth residual error attention network;
and S3, training and testing the features extracted in the step S2 after certain processing, and outputting the final classification result of the facial expressions.
2. The method for video expression recognition based on the depth residual attention network of claim 1, wherein the step S1 comprises the following steps:
s1.1, screening out image frames in an apex period for each video sample;
s1.2, adopting a haar-cascades detection model to carry out face detection.
3. The method for video expression recognition based on the depth residual attention network of claim 2, wherein the face detection in the step S1.2 comprises the following steps:
step 1, firstly, converting an input picture into a gray image, and removing color interference;
step 2, setting the size of a face searching frame, sequentially searching faces in an input image, intercepting and storing after finding the faces;
and 3, cutting out an image containing key expression parts such as mouth, nose, forehead and the like from the original facial expression image according to the standard distance between the two eyes, and taking the image as the input of the depth residual error attention network.
4. The method for video expression recognition based on the depth residual attention network of claim 1, wherein the step S2 comprises the following steps:
s2.1, establishing a depth residual error attention network, extracting the characteristics of each frame of face image from the preprocessed video data, and establishing a video emotion data set;
and S2.2, performing fine tuning training on the video expression data set by using the pre-trained models on other data sets.
5. The method according to claim 4, wherein the depth residual attention network comprises three residual attention modules, each residual attention module comprises a trunk branch and a mask branch, and each trunk branch comprises a residual unit.
6. The method of claim 4, wherein the step S2.2 comprises the steps of:
step 1, copying a depth residual error attention network model parameter pre-trained on a cifar-10 data set;
step 2, changing the 10-type image category number of the cifar-10 into the expression category number of the video emotion data set;
step 3, retraining the network model by using a back propagation algorithm to update the weight parameters of the network model;
and 4, after the fine tuning training is finished, taking the output of the last full connection layer of the depth residual error attention network as the learned high-level facial expression characteristics for the subsequent expression classification of the multilayer perceptron.
7. The method according to claim 6, wherein the fine-tuning training procedure in step S2.2 is specifically represented by the following formula:
X={xi|(i=1,2,...,N)} (1)
minH(P(xi),yi)=-∑x(P(xi)logyi) (2)
wherein: i represents the ith frame of picture in the video, xiRepresenting the face image of the i-th frame, yiAn expression label representing the video, H represents a minimization loss function, P (x)i) Representing an input face image xiAnd outputting a predicted value of the time network model.
8. The method of claim 5, wherein the residual attention module is represented by the following formula:
Figure FDA0002502482300000021
wherein, the
Figure FDA0002502482300000022
Representing residual attention module output characteristics,
Figure FDA0002502482300000023
The characteristic output characteristics of the trunk branches are shown,
Figure FDA0002502482300000024
and (3) representing the mask branch output characteristic, wherein (i, k) is the spatial position coordinate of the characteristic, and C ∈ {0, 1, …, C } is the index value of the characteristic channel.
9. The method for video expression recognition based on the depth residual attention network of claim 1, wherein the step S3 comprises the following steps: after the feature extraction of each frame of face image in the video is completed, performing average pooling operation on the learned attention features of all the frames of images in one video, calculating global video expression feature parameters with fixed lengths, inputting the global video expression feature parameters into a multilayer perceptron for training and testing, and obtaining the classification result of the face expression.
CN202010436500.5A 2020-05-21 2020-05-21 Video expression recognition method based on depth residual error attention network Pending CN111797683A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010436500.5A CN111797683A (en) 2020-05-21 2020-05-21 Video expression recognition method based on depth residual error attention network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010436500.5A CN111797683A (en) 2020-05-21 2020-05-21 Video expression recognition method based on depth residual error attention network

Publications (1)

Publication Number Publication Date
CN111797683A true CN111797683A (en) 2020-10-20

Family

ID=72805869

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010436500.5A Pending CN111797683A (en) 2020-05-21 2020-05-21 Video expression recognition method based on depth residual error attention network

Country Status (1)

Country Link
CN (1) CN111797683A (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110415815A (en) * 2019-07-19 2019-11-05 银丰基因科技有限公司 The hereditary disease assistant diagnosis system of deep learning and face biological information
CN112329683A (en) * 2020-11-16 2021-02-05 常州大学 Attention mechanism fusion-based multi-channel convolutional neural network facial expression recognition method
CN112381061A (en) * 2020-12-04 2021-02-19 中国科学院大学 Facial expression recognition method and system
CN112541409A (en) * 2020-11-30 2021-03-23 北京建筑大学 Attention-integrated residual network expression recognition method
CN112766172A (en) * 2021-01-21 2021-05-07 北京师范大学 Face continuous expression recognition method based on time sequence attention mechanism
CN112836589A (en) * 2021-01-13 2021-05-25 苏州元启创人工智能科技有限公司 Method for recognizing facial expressions in video based on feature fusion
CN112949570A (en) * 2021-03-26 2021-06-11 长春工业大学 Finger vein identification method based on residual attention mechanism
CN113065402A (en) * 2021-03-05 2021-07-02 四川翼飞视科技有限公司 Face detection method based on deformed attention mechanism
CN113111779A (en) * 2021-04-13 2021-07-13 东南大学 Expression recognition method based on attention mechanism
CN113159002A (en) * 2021-05-26 2021-07-23 重庆大学 Facial expression recognition method based on self-attention weight auxiliary module
CN113158872A (en) * 2021-04-16 2021-07-23 中国海洋大学 Online learner emotion recognition method
CN113313048A (en) * 2021-06-11 2021-08-27 北京百度网讯科技有限公司 Facial expression recognition method and device
CN113420703A (en) * 2021-07-03 2021-09-21 西北工业大学 Dynamic facial expression recognition method based on multi-scale feature extraction and multi-attention mechanism modeling
CN113855020A (en) * 2021-09-18 2021-12-31 中国信息通信研究院 Method and device for emotion recognition, computer equipment and storage medium
CN114038041A (en) * 2021-11-17 2022-02-11 杭州电子科技大学 Micro-expression identification method based on residual error neural network and attention mechanism
CN114038037A (en) * 2021-11-09 2022-02-11 合肥工业大学 Expression label correction and identification method based on separable residual attention network
WO2022111236A1 (en) * 2020-11-24 2022-06-02 华中师范大学 Facial expression recognition method and system combined with attention mechanism
WO2023185243A1 (en) * 2022-03-29 2023-10-05 河南工业大学 Expression recognition method based on attention-modulated contextual spatial information

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106503654A (en) * 2016-10-24 2017-03-15 中国地质大学(武汉) A kind of face emotion identification method based on the sparse autoencoder network of depth
US20190311188A1 (en) * 2018-12-05 2019-10-10 Sichuan University Face emotion recognition method based on dual-stream convolutional neural network
CN110363156A (en) * 2019-07-17 2019-10-22 北京师范大学 A kind of Facial action unit recognition methods that posture is unrelated
CN110427867A (en) * 2019-07-30 2019-11-08 华中科技大学 Human facial expression recognition method and system based on residual error attention mechanism
CN111091045A (en) * 2019-10-25 2020-05-01 重庆邮电大学 Sign language identification method based on space-time attention mechanism

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106503654A (en) * 2016-10-24 2017-03-15 中国地质大学(武汉) A kind of face emotion identification method based on the sparse autoencoder network of depth
US20190311188A1 (en) * 2018-12-05 2019-10-10 Sichuan University Face emotion recognition method based on dual-stream convolutional neural network
CN110363156A (en) * 2019-07-17 2019-10-22 北京师范大学 A kind of Facial action unit recognition methods that posture is unrelated
CN110427867A (en) * 2019-07-30 2019-11-08 华中科技大学 Human facial expression recognition method and system based on residual error attention mechanism
CN111091045A (en) * 2019-10-25 2020-05-01 重庆邮电大学 Sign language identification method based on space-time attention mechanism

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
何俊 等: "基于改进的深度残差网络的表情识别研究", 计算机应用研究, pages 1578 - 1581 *

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110415815A (en) * 2019-07-19 2019-11-05 银丰基因科技有限公司 The hereditary disease assistant diagnosis system of deep learning and face biological information
CN112329683A (en) * 2020-11-16 2021-02-05 常州大学 Attention mechanism fusion-based multi-channel convolutional neural network facial expression recognition method
CN112329683B (en) * 2020-11-16 2024-01-26 常州大学 Multi-channel convolutional neural network facial expression recognition method
US11967175B2 (en) 2020-11-24 2024-04-23 Central China Normal University Facial expression recognition method and system combined with attention mechanism
WO2022111236A1 (en) * 2020-11-24 2022-06-02 华中师范大学 Facial expression recognition method and system combined with attention mechanism
CN112541409B (en) * 2020-11-30 2021-09-14 北京建筑大学 Attention-integrated residual network expression recognition method
CN112541409A (en) * 2020-11-30 2021-03-23 北京建筑大学 Attention-integrated residual network expression recognition method
CN112381061B (en) * 2020-12-04 2022-07-12 中国科学院大学 Facial expression recognition method and system
CN112381061A (en) * 2020-12-04 2021-02-19 中国科学院大学 Facial expression recognition method and system
CN112836589A (en) * 2021-01-13 2021-05-25 苏州元启创人工智能科技有限公司 Method for recognizing facial expressions in video based on feature fusion
CN112766172B (en) * 2021-01-21 2024-02-02 北京师范大学 Facial continuous expression recognition method based on time sequence attention mechanism
CN112766172A (en) * 2021-01-21 2021-05-07 北京师范大学 Face continuous expression recognition method based on time sequence attention mechanism
CN113065402A (en) * 2021-03-05 2021-07-02 四川翼飞视科技有限公司 Face detection method based on deformed attention mechanism
CN112949570A (en) * 2021-03-26 2021-06-11 长春工业大学 Finger vein identification method based on residual attention mechanism
CN113111779A (en) * 2021-04-13 2021-07-13 东南大学 Expression recognition method based on attention mechanism
CN113158872A (en) * 2021-04-16 2021-07-23 中国海洋大学 Online learner emotion recognition method
CN113159002A (en) * 2021-05-26 2021-07-23 重庆大学 Facial expression recognition method based on self-attention weight auxiliary module
CN113159002B (en) * 2021-05-26 2023-04-07 重庆大学 Facial expression recognition method based on self-attention weight auxiliary module
CN113313048A (en) * 2021-06-11 2021-08-27 北京百度网讯科技有限公司 Facial expression recognition method and device
CN113313048B (en) * 2021-06-11 2024-04-09 北京百度网讯科技有限公司 Facial expression recognition method and device
CN113420703A (en) * 2021-07-03 2021-09-21 西北工业大学 Dynamic facial expression recognition method based on multi-scale feature extraction and multi-attention mechanism modeling
CN113855020A (en) * 2021-09-18 2021-12-31 中国信息通信研究院 Method and device for emotion recognition, computer equipment and storage medium
CN114038037A (en) * 2021-11-09 2022-02-11 合肥工业大学 Expression label correction and identification method based on separable residual attention network
CN114038037B (en) * 2021-11-09 2024-02-13 合肥工业大学 Expression label correction and identification method based on separable residual error attention network
CN114038041A (en) * 2021-11-17 2022-02-11 杭州电子科技大学 Micro-expression identification method based on residual error neural network and attention mechanism
WO2023185243A1 (en) * 2022-03-29 2023-10-05 河南工业大学 Expression recognition method based on attention-modulated contextual spatial information

Similar Documents

Publication Publication Date Title
CN111797683A (en) Video expression recognition method based on depth residual error attention network
CN112446270B (en) Training method of pedestrian re-recognition network, pedestrian re-recognition method and device
CN108520535B (en) Object classification method based on depth recovery information
CN110188239B (en) Double-current video classification method and device based on cross-mode attention mechanism
Rahmon et al. Motion U-Net: Multi-cue encoder-decoder network for motion segmentation
US20220148291A1 (en) Image classification method and apparatus, and image classification model training method and apparatus
CN113496217A (en) Method for identifying human face micro expression in video image sequence
CN112464865A (en) Facial expression recognition method based on pixel and geometric mixed features
CN112784763A (en) Expression recognition method and system based on local and overall feature adaptive fusion
Wu et al. Feedback weight convolutional neural network for gait recognition
Luo et al. Shape constrained network for eye segmentation in the wild
CN113989890A (en) Face expression recognition method based on multi-channel fusion and lightweight neural network
CN113780249B (en) Expression recognition model processing method, device, equipment, medium and program product
CN114842238B (en) Identification method of embedded breast ultrasonic image
CN112861718A (en) Lightweight feature fusion crowd counting method and system
KR20190128933A (en) Emotion recognition apparatus and method based on spatiotemporal attention
CN112329784A (en) Correlation filtering tracking method based on space-time perception and multimodal response
CN115410254A (en) Multi-feature expression recognition method based on deep learning
Lee et al. Face and facial expressions recognition system for blind people using ResNet50 architecture and CNN
CN113076905B (en) Emotion recognition method based on context interaction relation
Reddi et al. CNN Implementing Transfer Learning for Facial Emotion Recognition
CN116311472B (en) Micro-expression recognition method and device based on multi-level graph convolution network
Liu RETRACTED ARTICLE: Video Face Detection Based on Deep Learning
Uddin et al. A convolutional neural network for real-time face detection and emotion & gender classification
Abayomi-Alli et al. Facial image quality assessment using an ensemble of pre-trained deep learning models (EFQnet)

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination