CN114581971A - Emotion recognition method and device based on facial action combination detection - Google Patents

Emotion recognition method and device based on facial action combination detection Download PDF

Info

Publication number
CN114581971A
CN114581971A CN202210105954.3A CN202210105954A CN114581971A CN 114581971 A CN114581971 A CN 114581971A CN 202210105954 A CN202210105954 A CN 202210105954A CN 114581971 A CN114581971 A CN 114581971A
Authority
CN
China
Prior art keywords
facial
convolution
module
emotion
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210105954.3A
Other languages
Chinese (zh)
Inventor
王淑欣
刘小青
俞益洲
李一鸣
乔昕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Shenrui Bolian Technology Co Ltd
Shenzhen Deepwise Bolian Technology Co Ltd
Original Assignee
Beijing Shenrui Bolian Technology Co Ltd
Shenzhen Deepwise Bolian Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Shenrui Bolian Technology Co Ltd, Shenzhen Deepwise Bolian Technology Co Ltd filed Critical Beijing Shenrui Bolian Technology Co Ltd
Priority to CN202210105954.3A priority Critical patent/CN114581971A/en
Publication of CN114581971A publication Critical patent/CN114581971A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention provides an emotion recognition method and device based on facial action combination detection. The method comprises the following steps: acquiring a face image, and performing feature extraction on the image; inputting the extracted image features into an attention module to obtain the spatial position of each facial action on the face and the feature of the face region of interest; based on the spatial position and the characteristics of the region of interest, obtaining the associated characteristics of different facial actions by using a transducer self-attention mechanism, and further predicting the facial actions; and predicting the combination of facial actions by fusing the associated features, thereby obtaining the emotion change. The invention predicts the combination of facial actions by fusing the association characteristics, thereby obtaining the emotion change, improving the emotion recognition precision, and solving the problem of low emotion recognition precision because the association between different facial actions cannot be concerned and the emotion change is predicted only by simple combination of facial actions compared with the prior art.

Description

Emotion recognition method and device based on facial action combination detection
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to an emotion recognition method and device based on facial action combination detection.
Background
Facial expressions are the primary means of conveying non-verbal information. Psychologist studies have shown that emotional information conveyed by facial micro-expressions can amount to 55% of the total amount of information in human daily communications, so learning "what you look at" is one of the important ways to coordinate interpersonal relationships. For the blind, it is important to establish a set of equipment capable of recognizing the facial micro-expression and sensing the emotion of other people, so that the equipment can help the blind to sense the emotion state of other people and enhance the communication effect.
Facial expressions are the state of the muscles and five sense organs of the human face, such as smile, anger, etc. Facial expressions may be expressed by facial movements, each representing a basic facial movement or expression change, from which facial movements describe facial movements, the combination of which may describe complex emotional changes. With the development of artificial intelligence technology, it is possible to recognize facial expressions using a deep learning method to obtain the current mood of people. Fig. 2 shows a network structure for detecting a face action unit based on a deep learning method, which is called EACNet. The technical principle is as follows: on the basis of a VGG network, two modules of reinforcement learning and tailoring learning are added. The reinforcement learning module enables the network to assign higher learning weight to the interested human face action unit in model training by adding an attention mechanism; the cropping learning module crops facial regions around the detected objects and designs convolutional layers to learn deeper features for each facial region. And finally, predicting a face action value for each detected face region, and obtaining emotion change through simple combination. It has the problems that: although an attention mechanism is adopted to learn the interested areas in the face, each area is learned independently, and the association between different human face actions is not concerned; generally, emotion change conclusions are obtained by simply combining face actions, and if face actions are predicted incorrectly, the emotion conclusions are directly wrong, and the robustness is poor.
Disclosure of Invention
In order to solve the above problems in the prior art, the present invention provides a method and an apparatus for emotion recognition based on facial motion combination detection.
In order to achieve the above object, the present invention adopts the following technical solutions.
In a first aspect, the present invention provides a method for emotion recognition based on facial motion combination detection, comprising the steps of:
acquiring a face image, and performing feature extraction on the image;
inputting the extracted image features into an attention module to obtain the spatial position of each facial action on the face and the feature of the face region of interest;
based on the spatial position and the characteristics of the region of interest, obtaining the associated characteristics of different facial actions by using a transducer self-attention mechanism, and further predicting the facial actions;
and predicting the combination of facial actions by fusing the associated features, thereby obtaining the emotion change.
Further, performing feature extraction on the input image by using a residual error network ResNet 34; ResNet34 is composed of 5 convolution modules, the first convolution module is composed of 7 × 7 convolution and 3 × 3 down-sampling convolution, the second, third, fourth and fifth convolution modules are residual modules composed of two 3 × 3 convolution modules, the input x of the residual module is added with x after being convolved by two 3 × 3 convolution, and then is output through a ReLU activation layer; the output of the fifth convolution module is the output of ResNet 34.
Further, the facial action includes: raise the eyebrow inner angle, raise the eyebrow outer angle, the fold eyebrow, go up the eyelid and rise, the cheek promotes, and the eyelid tightens up, and the nose folds, and the upper lip lifts, and pulling mouth angle tightens up the mouth angle, and the mouth angle is downward, and the lower lip lifts up, pouts the mouth, and the mouth angle is tensile, tightens up the lip, and the lip is pressed, and two lips are separately, and the chin descends, and the mouth is opened greatly, closes the eye.
Furthermore, the method for obtaining the spatial position of the facial motion on the face and the facial region-of-interest features comprises the following steps:
extracting NTInputting the image characteristics T into an attention module, sequentially passing through 3 residual modules and NAConvolution of 1 x 1, a ReLU activation layer, NA1-1 convolution, and finally obtaining the spatial position of each facial action on the face through a Sigmoid activation function; model attention to NAAn attention feature map is respectively associated with NTMultiplying by the image characteristics T and then by NTAdding the image characteristics T to obtain NAGroup feature map, each group NTA feature map; n is a radical ofAEqual to the number of facial movements.
Further, the obtaining, by using a transform self-attention mechanism, associated features of different facial movements to predict the facial movements specifically includes:
the N isAThe group characteristic graphs are respectively subjected to 1-by-1 convolution, batch normalization and ReLU, each group of characteristic graphs outputs a one-dimensional vector with the length of N, and N is obtained in totalAA one-dimensional vector, dividing the NAPerforming feature fusion on the one-dimensional vectors to obtain the length N × NAC; and C is input into a transform encoder consisting of a multi-head attention module and a feedforward network, and the output of the multi-head attention module outputs a predicted value of the facial action after passing through two fully-connected layers.
Still further, the method of recognizing a change in emotion includes:
inputting the correlation features output by the transducer self-attention mechanism into a feature fusion module consisting of two fully-connected layers, outputting a plurality of feature layers, and predicting combinations of a plurality of facial actions through a Softmax function, wherein each combination corresponds to one emotion change;
the emotion change corresponding to the combination of the plurality of facial movements is: raise the inner angle of eyebrow + raise the outer angle of eyebrow + frown: panic attacks; raising the inner eyebrow angle + raising the outer eyebrow angle: surprise; lifting the inner angle of eyebrow and wrinkling eyebrow: sadness; frown + upper eyelid rising: anger; cheek lift + pulling mouth corner: opening the heart; upper lip lift + double lip separation: aversion; puckered lips + tightened lips: is confusing.
In a second aspect, the present invention provides an emotion recognition apparatus based on facial motion combination detection, including:
the image feature extraction module is used for acquiring a face image and extracting features of the image;
the interesting region extracting module is used for inputting the extracted image characteristics into the attention module to obtain the spatial position of each facial action on the face and the interesting region characteristics of the face;
the facial motion prediction module is used for obtaining correlation characteristics of different facial motions by using a transducer self-attention mechanism based on the spatial position and the characteristics of the region of interest so as to predict the facial motions;
and the emotion change prediction module is used for predicting the combination of facial actions by fusing the associated features so as to obtain emotion changes.
Further, the residual error network ResNet34 is used for carrying out feature extraction on the input image; ResNet34 is composed of 5 convolution modules, the first convolution module is composed of 7 × 7 convolution and 3 × 3 down-sampling convolution, the second, third, fourth and fifth convolution modules are residual modules composed of two 3 × 3 convolution modules, the input x of the residual module is added with x after being convolved by two 3 × 3 convolution, and then is output through a ReLU activation layer; the output of the fifth convolution module is the output of ResNet 34.
Further, the facial action includes: raise the eyebrow internal angle, raise the eyebrow outer corner, crumple, go up the eyelid and rise, the cheek promotes, and the eyelid tightens up, and the nose is crumpled, and the upper lip is lifted, and pulling mouth angle tightens up the mouth angle, and the mouth angle is downward, and the lower lip is lifted up, pouts the mouth, and the mouth angle is tensile, tightens up the lip, and the lip is pressed, and the lips is divided, and the chin descends, and the mouth is opened greatly, closes the eye.
Further, the method for obtaining the spatial position of the facial motion on the face and the facial region-of-interest features comprises the following steps:
extracting NTInputting the image characteristics T into an attention module, sequentially passing through 3 residual modules and NAConvolution of 1 x 1, a ReLU activation layer, NA1-1 convolution, and finally obtaining the spatial position of each facial action on the face through a Sigmoid activation function; model attention to NAAn attention feature map is respectively associated with NTMultiplying by the image characteristics T and then by NTAdding the image characteristics T to obtain NAGroup feature map, each group NTA feature map; n is a radical ofAEqual to the number of facial movements.
Compared with the prior art, the invention has the following beneficial effects.
According to the emotion recognition method and device, the facial image is subjected to feature extraction, the extracted image features are input into the attention module, the spatial position of each facial action on the face and the feature of the face region of interest are obtained, the transform self-attention mechanism is utilized to obtain the association features of different facial actions, and the association features are fused to predict the combination of the facial actions, so that emotion change is obtained, emotion recognition accuracy is improved, and the problem that the emotion recognition accuracy is low due to the fact that the association between different facial actions cannot be focused, and the emotion change is predicted only through simple combination of the facial actions in the prior art is solved.
Drawings
Fig. 1 is a flowchart of an emotion recognition method based on facial motion combination detection according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of a conventional facial motion detection network.
Fig. 3 is a schematic general flow chart of emotion recognition according to an embodiment of the present invention.
Fig. 4 is a schematic flow chart of obtaining facial motion related features by using a transform self-attention mechanism.
Fig. 5 is a block diagram of an emotion recognition apparatus based on facial motion combination detection according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer and more obvious, the present invention is further described below with reference to the accompanying drawings and the detailed description. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a flowchart of an emotion recognition method based on facial motion combination detection according to an embodiment of the present invention, including the following steps:
step 101, acquiring a face image, and performing feature extraction on the image;
step 102, inputting the extracted image features into an attention module to obtain the spatial position of each facial action on the face and the facial region-of-interest features;
103, acquiring correlation characteristics of different facial movements by using a transducer self-attention mechanism based on the spatial position and the characteristics of the region of interest, and further predicting the facial movements;
and 104, predicting the combination of facial actions by fusing the related characteristics to obtain the emotion change.
In this embodiment, step 101 is mainly used to perform facial image feature extraction. The present embodiment performs emotion recognition based on the face image, and thus takes the face image as an input. The convolutional neural network CNN is generally used for image feature extraction. In the visual nervous system, the receptive field of a neuron refers to a specific region on the retina, and only the stimulation in this region can activate the neuron. CNN was proposed based on this receptor field mechanism in biology. CNN is a kind of feedforward neural network, but unlike a general fully-connected feedforward neural network, its convolutional layer has the characteristics of local connection and weight sharing, so that the number of weight parameters can be greatly reduced, thereby reducing the complexity of the model and increasing the operation speed. A typical CNN is formed by cross-stacking convolutional layers, convergence layers (or pooling layers, downsampling layers), and fully-connected layers. The convolution layer is used for extracting the characteristics of a local area through convolution operation of convolution kernels and an input image, and different convolution kernels correspond to different characteristic extractors. The function of the convergence layer is to perform feature selection and reduce the number of features, thereby further reducing the number of parameters. The maximum convergence method and the average convergence method are generally adopted. The full connection layer is used for fusing the obtained different characteristics. Currently, the typical CNN structures include LeNet, AlexNet, and ResNet.
In this embodiment, step 102 is mainly used to extract the image features of the face region of interest. The embodiment extracts the image features of the region of interest by inputting the image features extracted in the previous step into the attention module, and obtains the spatial position of each facial action on the face. Facial expressions are the state of the muscles and five sense organs of the human face, such as smile, anger, etc. Facial expressions can be expressed through facial movements, and common facial movements include frown, puckered lips, and the like. The activity of the face is described in terms of different facial movements, each representing a basic facial movement or expression change, and the combination of facial movements may describe complex emotional changes. The attention mechanism refers to the attention mechanism of human brain under the condition of limited computer capability, and only some key interesting information input is concerned for processing, so as to improve the efficiency of the neural network. The calculation of the attention mechanism can be divided into two steps: one is to calculate the attention distribution α over all input informationi=softmax(fiWattq) representing an input vector fiDegree of correlation with the query vector q; and secondly, calculating the weighted sum of the input information according to the attention distribution, wherein the weighting coefficient is the attention distribution.
In this embodiment, step 103 is mainly used to predict the facial movements by obtaining associated features of different facial movements. The embodiment utilizes a transducer self-attention mechanism according to the spatial position and the interesting region characteristics of each facial action obtained in the last stepAnd obtaining the associated characteristics of different facial movements so as to predict the facial movements. The self-attention mechanism is used to "notice" the correlation between different parts of the overall input. The self-attention realization method comprises the following steps: firstly, input vectors are subjected to 3 different linear transformations to obtain K (Key), V (value) and Q (query); attention values were then calculated from K, V, Q:
Figure BDA0003493495600000061
where d is the input vector dimension.
In this embodiment, step 104 is mainly used for emotion prediction. The prior art generally predicts emotional changes by facial movements, which are then simply combined. Because the emotion change is the result of the combined action of different facial actions on the human face, and the facial actions have certain connection, the existing emotion prediction method generally has difficulty in meeting the precision requirement. Therefore, the embodiment predicts the combination of the facial movements by fusing different facial movement related characteristics, and obtains the emotion change according to the predicted facial movement combination. For example, if the predicted combination of facial movements is "raise the inner corner of eyebrow + raise the outer corner of eyebrow", the emotion is "surprise"; the emotion represented by "raised inner eyebrow angle + frown" is "sad" if the predicted combination of facial actions is.
As an alternative embodiment, the input image is subjected to feature extraction by using a residual error network ResNet 34; ResNet34 is made up of 5 convolution modules, the first convolution module is made up of convolution of 7 × 7 and down sampling convolution of 3 × 3, the second, third, fourth, fifth convolution modules are residual modules made up of convolution modules of 3 × 3, the input x of the residual module is added with x after convolution of two 3 × 3, and then output through a ReLU activation layer; the output of the fifth convolution module is the output of ResNet 34.
The embodiment provides a technical scheme for extracting image features. The present embodiment extracts the basic features of the input image using the residual network ResNet34 as the base network. ResNet is proposed to alleviate the loss of gradient as the network deepens, the input x of the network is convolved by two 3 x 3 before the output F (x) is added with x and passes through a ReLU activation layer, and the output can be expressed by the formula: y ═ ReLU (x + f (x)). ResNet34 is composed of 5 convolution modules, the first convolution module is composed of a convolution of 7 x 7 and a downsampled convolution of 3 x 3; the other 4 convolution modules, namely the second convolution module to the fifth convolution module, are ResNet residual modules.
As an alternative embodiment, the facial action includes: raise the eyebrow internal angle, raise the eyebrow outer corner, crumple, go up the eyelid and rise, the cheek promotes, and the eyelid tightens up, and the nose is crumpled, and the upper lip is lifted, and pulling mouth angle tightens up the mouth angle, and the mouth angle is downward, and the lower lip is lifted up, pouts the mouth, and the mouth angle is tensile, tightens up the lip, and the lip is pressed, and the lips is divided, and the chin descends, and the mouth is opened greatly, closes the eye.
This embodiment gives 20 kinds of facial movements. Among the 20 facial movements, there are the most common and simplest facial movements, such as frowning, puckering, closing eyes, opening mouth, etc.; there are also facial movements that are less common and difficult to make, such as upper eyelid lift, cheek lift, etc. The 20 facial movements basically comprise most of the facial movements for expressing emotion, and can meet the requirement of emotion prediction. It should be noted that this embodiment is only a preferred embodiment and does not negate or exclude other possible embodiments, such as other facial movements than the 20 facial movements described.
As an alternative embodiment, the method for obtaining the spatial position of the facial motion on the face and the facial region-of-interest features comprises:
extracting NTInputting the image characteristics T into an attention module, sequentially passing through 3 residual modules and NAConvolution of 1 x 1, a ReLU activation layer, NA1-1 convolution, and finally obtaining the spatial position of each facial action on the face through a Sigmoid activation function; model attention to NAAn attention feature map is respectively associated with NTMultiplying by the image characteristics T and then by NTAdding the image characteristics T to obtain NAGroup feature map, each group NTA feature map; n is a radical ofAEqual to the number of facial movements.
The embodiment provides a technical scheme for extracting the features of the face region of interest. In the embodiment, different facial action areas are automatically weighted by an attention mechanism in the training process, so that the extraction of the characteristics of the interested area is realized. Specifically, assume that step 101 co-extracts NTThe image features T firstly pass through 3 residual error modules and then pass through a convolution kernel with the size of 1 x 1 and the number of NA(NAEqual to the number of facial actions), a ReLU activation layer, and finally a similar convolution kernel of size 1 x 1 and number NAAnd (4) obtaining the spatial position of each facial action on the face by using a Sigmoid activation function. N obtained by attention module of region of interestAEach attention diagram is respectively connected with NTMultiplying the image characteristics T to obtain NAGroup feature map, each group having NTAnd finally, adding the image characteristics T to each group of characteristic maps as the final output of the attention module of interest.
As an optional embodiment, the obtaining, by using a transducer self-attention mechanism, associated features of different facial movements to predict the facial movement specifically includes:
the N isAThe group characteristic graphs are respectively subjected to 1-by-1 convolution, batch normalization and ReLU, each group of characteristic graphs outputs a one-dimensional vector with the length of N, and N is obtained in totalAA one-dimensional vector, dividing the NAPerforming feature fusion on the one-dimensional vectors to obtain the length N × NAC; and C is input into a transform encoder consisting of a multi-head attention module and a feedforward network, and the output of the multi-head attention module outputs a predicted value of the facial action after passing through two fully-connected layers.
The embodiment provides a technical scheme for obtaining the associated features of different facial actions by using a transducer self-attention mechanism. The embodiment establishes a relation between different facial actions by utilizing a transformer self-attention mechanism based on the interesting region characteristics and the positions of the facial actions, and obtains the predicted value of the facial actions by learning the relation between the facial action characteristics and the different facial actions. The process flow is shown in FIG. 4, NAThe group characteristic graphs respectively pass through groups of 1 × 1 convolution + Batchnorm (batch normalization) + ReLU, and each group of characteristic graphs outputs a one-dimensional vector, and N is output in totalAA one-dimensional vector. Will be NAFeature fusion is performed on one-dimensional vectors (assuming that the length of each one-dimensional vector is N, N)ALength of fused vector is N times NAFusing, namely combining the vectors together in sequence) to obtain a characteristic diagram C; c is input into a transform encoder. the transformer encoder mainly comprises a Multi-head attention mechanism Multi-head advanced authorization and a feedforward network Feed-forward network, and the working process is as follows: the one-dimensional feature C passes through 3 full connection layers respectively, the feature is divided into V, K, Q three parts, and the three parts can realize a self-attention mechanism through multiplication calculation later. These three sections each have N headers and are finally connected together to form an output as a single head. The multi-headed is equivalent to copying a plurality of single heads, and the weight coefficient of each single head is different due to different initialization. The output of the network predicts the value of the facial action unit after passing through the two fully connected layers.
As an alternative embodiment, the method for recognizing emotion change includes:
inputting the correlation features output by the transducer self-attention mechanism into a feature fusion module consisting of two fully-connected layers, outputting a plurality of feature layers, and predicting combinations of a plurality of facial actions through a Softmax function, wherein each combination corresponds to one emotion change;
the emotion change corresponding to the combination of the plurality of facial movements is: raise the inner angle of eyebrow + raise the outer angle of eyebrow + fold eyebrow: panic attacks; raising the inner eyebrow angle + raising the outer eyebrow angle: surprise; lifting the inner angle of eyebrow and wrinkling eyebrow: sadness; frown + upper eyelid rising: anger; cheek lift + pulling mouth corner: opening the heart; upper lip lift + double lip separation: aversion; puckered lips + tightened lips: is confusing.
The embodiment provides a technical scheme for predicting emotion change. In the embodiment, the emotion change is predicted by further learning deep-level features through two full-connected layers by using features output by a transducer self-attention mechanism to directly predict the combination of facial movements. Certain constraint can be generated on the training of the facial movement in the model training process, so that the facial movement prediction result is more accurate. The two fully-connected layers finally output a plurality of feature layers, and a plurality of facial action combinations are activated and predicted by a Softmax function. Each of the plurality of combinations of facial movements corresponds to one of the changes in mood, and table 1 gives the changes in mood for 7 combinations.
TABLE 1 emotional Change represented by facial motion combinations
Figure BDA0003493495600000101
Fig. 5 is a schematic composition diagram of an emotion recognition device based on facial action combination detection according to an embodiment of the present invention, where the device includes:
the image feature extraction module 11 is configured to acquire a face image and perform feature extraction on the image;
the region-of-interest extraction module 12 is configured to input the extracted image features into the attention module, and obtain a spatial position of each facial action on the face and facial region-of-interest features;
the facial motion prediction module 13 is configured to obtain associated features of different facial motions by using a transform self-attention mechanism based on the spatial position and the feature of the region of interest, and further predict the facial motions;
and an emotion change prediction module 14, configured to predict a combination of facial actions by fusing the correlation features, so as to obtain an emotion change.
The apparatus of this embodiment may be used to implement the technical solution of the method embodiment shown in fig. 1, and the implementation principle and the technical effect are similar, which are not described herein again. The same applies to the following embodiments, which are not further described.
As an alternative embodiment, the residual network ResNet34 is used to perform feature extraction on the input image; ResNet34 is composed of 5 convolution modules, the first convolution module is composed of 7 × 7 convolution and 3 × 3 down-sampling convolution, the second, third, fourth and fifth convolution modules are residual modules composed of two 3 × 3 convolution modules, the input x of the residual module is added with x after being convolved by two 3 × 3 convolution, and then is output through a ReLU activation layer; the output of the fifth convolution module is the output of ResNet 34.
As an alternative embodiment, the facial action includes: raise the eyebrow internal angle, raise the eyebrow outer corner, crumple, go up the eyelid and rise, the cheek promotes, and the eyelid tightens up, and the nose is crumpled, and the upper lip is lifted, and pulling mouth angle tightens up the mouth angle, and the mouth angle is downward, and the lower lip is lifted up, pouts the mouth, and the mouth angle is tensile, tightens up the lip, and the lip is pressed, and the lips is divided, and the chin descends, and the mouth is opened greatly, closes the eye.
As an alternative embodiment, the method for obtaining the spatial position of the facial motion on the face and the facial region-of-interest features comprises:
extracting NTInputting the image characteristics T into the attention module, sequentially passing through 3 residual error modules and NAConvolution of 1 x 1, a ReLU activation layer, NA1-1 convolution, and finally obtaining the spatial position of each facial action on the face through a Sigmoid activation function; model attention to NAAn attention feature map is respectively associated with NTMultiplying by the image characteristics T and then by NTAdding the image characteristics T to obtain NAGroup feature map, each group NTA feature map; n is a radical ofAEqual to the number of facial movements.
The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A method for recognizing emotion based on facial action combination detection is characterized by comprising the following steps:
acquiring a face image, and performing feature extraction on the image;
inputting the extracted image features into an attention module to obtain the spatial position of each facial action on the face and the feature of the face region of interest;
based on the spatial position and the characteristics of the region of interest, obtaining the associated characteristics of different facial actions by using a transducer self-attention mechanism, and further predicting the facial actions;
and predicting the combination of facial actions by fusing the associated features, thereby obtaining the emotion change.
2. The emotion recognition method based on facial motion combination detection according to claim 1, wherein feature extraction is performed on an input image using a residual network ResNet 34; ResNet34 is composed of 5 convolution modules, the first convolution module is composed of 7 × 7 convolution and 3 × 3 down-sampling convolution, the second, third, fourth and fifth convolution modules are residual modules composed of two 3 × 3 convolution modules, the input x of the residual module is added with x after being convolved by two 3 × 3 convolution, and then is output through a ReLU activation layer; the output of the fifth convolution module is the output of ResNet 34.
3. The emotion recognition method based on facial motion combination detection as claimed in claim 2, wherein the facial motion includes: raise the eyebrow internal angle, raise the eyebrow outer corner, crumple, go up the eyelid and rise, the cheek promotes, and the eyelid tightens up, and the nose is crumpled, and the upper lip is lifted, and pulling mouth angle tightens up the mouth angle, and the mouth angle is downward, and the lower lip is lifted up, pouts the mouth, and the mouth angle is tensile, tightens up the lip, and the lip is pressed, and the lips is divided, and the chin descends, and the mouth is opened greatly, closes the eye.
4. The emotion recognition method based on facial motion combination detection as claimed in claim 3, wherein the method for obtaining the spatial position of the facial motion on the face and the facial region-of-interest features comprises:
extracting NTInputting the image characteristics T into an attention module, sequentially passing through 3 residual modules and NAConvolution of 1 x 1, a ReLU activation layer, NA1-1 convolution, and finally obtaining the spatial position of each facial action on the face through a Sigmoid activation function; model attention to NAAn attention feature map is respectively associated with NTMultiplying by the image characteristics T and then by NTAdding the image characteristics T to obtain NAGroup feature map, each group NTA feature map; n is a radical ofAEqual to the number of facial movements.
5. The method for emotion recognition based on facial motion combination detection as claimed in claim 4, wherein the obtaining of the correlation features of different facial motions by using a transform self-attention mechanism to predict facial motion specifically comprises:
the N isAGroup characteristic graphs are respectively subjected to 1-by-1 convolution, batch normalization and ReLU, each group of characteristic graphs output a one-dimensional vector with the length of N, and N is obtainedAA one-dimensional vector, dividing the NAPerforming feature fusion on the one-dimensional vectors to obtain the length N x NAC; and C is input into a transform encoder consisting of a multi-head attention module and a feedforward network, and the output of the multi-head attention module outputs a predicted value of the facial action after passing through two fully-connected layers.
6. The method of recognizing emotion based on facial motion combination detection as set forth in claim 5, wherein the method of recognizing emotion change includes:
inputting the correlation features output by the transducer self-attention mechanism into a feature fusion module consisting of two fully-connected layers, outputting a plurality of feature layers, and predicting combinations of a plurality of facial actions through a Softmax function, wherein each combination corresponds to one emotion change;
the emotion change corresponding to the combination of the plurality of facial movements is: raise the inner angle of eyebrow + raise the outer angle of eyebrow + frown: panic attacks; raising the inner eyebrow angle + raising the outer eyebrow angle: surprise; lifting the inner angle of eyebrow + frown: sadness; frown + upper eyelid rising: anger; cheek lift + pulling mouth corner: opening the heart; upper lip lift + double lip separation: aversion; puckered lips + tightened lips: is confusing.
7. An emotion recognition apparatus based on facial motion combination detection, characterized by comprising:
the image feature extraction module is used for acquiring a face image and extracting features of the image;
the interesting region extracting module is used for inputting the extracted image characteristics into the attention module to obtain the spatial position of each facial action on the face and the interesting region characteristics of the face;
the facial motion prediction module is used for obtaining correlation characteristics of different facial motions by using a transducer self-attention mechanism based on the spatial position and the characteristics of the region of interest so as to predict the facial motions;
and the emotion change prediction module is used for predicting the combination of facial actions by fusing the associated features so as to obtain emotion changes.
8. The emotion recognition device based on facial motion combination detection according to claim 7, wherein feature extraction is performed on an input image using a residual network ResNet 34; ResNet34 is composed of 5 convolution modules, the first convolution module is composed of 7 × 7 convolution and 3 × 3 down-sampling convolution, the second, third, fourth and fifth convolution modules are residual modules composed of two 3 × 3 convolution modules, the input x of the residual module is added with x after being convolved by two 3 × 3 convolution, and then is output through a ReLU activation layer; the output of the fifth convolution module is the output of ResNet 34.
9. The emotion recognition device based on facial motion combination detection of claim 8, wherein the facial motion includes: raise the eyebrow internal angle, raise the eyebrow outer corner, crumple, go up the eyelid and rise, the cheek promotes, and the eyelid tightens up, and the nose is crumpled, and the upper lip is lifted, and pulling mouth angle tightens up the mouth angle, and the mouth angle is downward, and the lower lip is lifted up, pouts the mouth, and the mouth angle is tensile, tightens up the lip, and the lip is pressed, and the lips is divided, and the chin descends, and the mouth is opened greatly, closes the eye.
10. The emotion recognition device based on facial motion combination detection as claimed in claim 9, wherein the method for obtaining the spatial position of the facial motion on the face and the facial region-of-interest features comprises:
extracting NTInputting the image characteristics T into an attention module, sequentially passing through 3 residual modules and NAConvolution of 1 x 1, a ReLU activation layer, NA1-1 convolution, and finally obtaining the spatial position of each facial action on the face through a Sigmoid activation function; model attention to NAAn attention feature map is respectively associated with NTMultiplying by the image characteristics T and then by NTAdding the image characteristics T to obtain NAGroup feature map, each group NTA feature map; n is a radical ofAEqual to the number of facial movements.
CN202210105954.3A 2022-01-28 2022-01-28 Emotion recognition method and device based on facial action combination detection Pending CN114581971A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210105954.3A CN114581971A (en) 2022-01-28 2022-01-28 Emotion recognition method and device based on facial action combination detection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210105954.3A CN114581971A (en) 2022-01-28 2022-01-28 Emotion recognition method and device based on facial action combination detection

Publications (1)

Publication Number Publication Date
CN114581971A true CN114581971A (en) 2022-06-03

Family

ID=81768918

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210105954.3A Pending CN114581971A (en) 2022-01-28 2022-01-28 Emotion recognition method and device based on facial action combination detection

Country Status (1)

Country Link
CN (1) CN114581971A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111563417A (en) * 2020-04-13 2020-08-21 华南理工大学 Pyramid structure convolutional neural network-based facial expression recognition method
CN112329683A (en) * 2020-11-16 2021-02-05 常州大学 Attention mechanism fusion-based multi-channel convolutional neural network facial expression recognition method
CN112418095A (en) * 2020-11-24 2021-02-26 华中师范大学 Facial expression recognition method and system combined with attention mechanism
US20210081673A1 (en) * 2019-09-12 2021-03-18 Nec Laboratories America, Inc Action recognition with high-order interaction through spatial-temporal object tracking
CN113887487A (en) * 2021-10-20 2022-01-04 河海大学 Facial expression recognition method and device based on CNN-Transformer

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210081673A1 (en) * 2019-09-12 2021-03-18 Nec Laboratories America, Inc Action recognition with high-order interaction through spatial-temporal object tracking
CN111563417A (en) * 2020-04-13 2020-08-21 华南理工大学 Pyramid structure convolutional neural network-based facial expression recognition method
CN112329683A (en) * 2020-11-16 2021-02-05 常州大学 Attention mechanism fusion-based multi-channel convolutional neural network facial expression recognition method
CN112418095A (en) * 2020-11-24 2021-02-26 华中师范大学 Facial expression recognition method and system combined with attention mechanism
CN113887487A (en) * 2021-10-20 2022-01-04 河海大学 Facial expression recognition method and device based on CNN-Transformer

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
WENYU SONG,SHUZE SHI,GAOYUN AN: "Facial Action Unit Detection Based on Transformer and Attention Mechanism", 《ICIG 2021: IMAGE AND GRAPHI》 *
ZHIWEN SHAO,ZHILEI LIU,JIANFEI CAI,LIZHUANG MA: "Deep Adaptive Attention for Joint Facial Action Unit Detection and Face Alignment", 《EUROPEAN CONFERENCE ON COMPUTER VISION 2018》 *

Similar Documents

Publication Publication Date Title
CN108932500B (en) A kind of dynamic gesture identification method and system based on deep neural network
JP6788264B2 (en) Facial expression recognition method, facial expression recognition device, computer program and advertisement management system
WO2022111236A1 (en) Facial expression recognition method and system combined with attention mechanism
CN112800903B (en) Dynamic expression recognition method and system based on space-time diagram convolutional neural network
CN109635727A (en) A kind of facial expression recognizing method and device
KR101893554B1 (en) Method and apparatus of recognizing facial expression base on multi-modal
Rázuri et al. Automatic emotion recognition through facial expression analysis in merged images based on an artificial neural network
CN108803874A (en) A kind of human-computer behavior exchange method based on machine vision
Owayjan et al. Face detection with expression recognition using artificial neural networks
CN111476178A (en) Micro-expression recognition method based on 2D-3D CNN
CN111028319A (en) Three-dimensional non-photorealistic expression generation method based on facial motion unit
Xu et al. Face expression recognition based on convolutional neural network
Podder et al. Time efficient real time facial expression recognition with CNN and transfer learning
CN114550270A (en) Micro-expression identification method based on double-attention machine system
CN111950373B (en) Method for micro expression recognition based on transfer learning of optical flow input
CN110598719A (en) Method for automatically generating face image according to visual attribute description
CN112800979B (en) Dynamic expression recognition method and system based on characterization flow embedded network
Xie et al. Convolutional neural networks for facial expression recognition with few training samples
CN113159002A (en) Facial expression recognition method based on self-attention weight auxiliary module
Singh et al. Face emotion identification by fusing neural network and texture features: facial expression
CN114581971A (en) Emotion recognition method and device based on facial action combination detection
Li et al. Multimodal information-based broad and deep learning model for emotion understanding
CN115131876A (en) Emotion recognition method and system based on human body movement gait and posture
Dembani et al. UNSUPERVISED FACIAL EXPRESSION DETECTION USING GENETIC ALGORITHM.
Yao et al. Facial expression recognition method based on convolutional neural network and data enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20220603