CN114581971A

CN114581971A - Emotion recognition method and device based on facial action combination detection

Info

Publication number: CN114581971A
Application number: CN202210105954.3A
Authority: CN
Inventors: 王淑欣; 刘小青; 俞益洲; 李一鸣; 乔昕
Original assignee: Beijing Shenrui Bolian Technology Co Ltd; Shenzhen Deepwise Bolian Technology Co Ltd
Current assignee: Beijing Shenrui Bolian Technology Co Ltd; Shenzhen Deepwise Bolian Technology Co Ltd
Priority date: 2022-01-28
Filing date: 2022-01-28
Publication date: 2022-06-03

Abstract

The invention provides an emotion recognition method and device based on facial action combination detection. The method comprises the following steps: acquiring a face image, and performing feature extraction on the image; inputting the extracted image features into an attention module to obtain the spatial position of each facial action on the face and the feature of the face region of interest; based on the spatial position and the characteristics of the region of interest, obtaining the associated characteristics of different facial actions by using a transducer self-attention mechanism, and further predicting the facial actions; and predicting the combination of facial actions by fusing the associated features, thereby obtaining the emotion change. The invention predicts the combination of facial actions by fusing the association characteristics, thereby obtaining the emotion change, improving the emotion recognition precision, and solving the problem of low emotion recognition precision because the association between different facial actions cannot be concerned and the emotion change is predicted only by simple combination of facial actions compared with the prior art.

Description

Emotion recognition method and device based on facial action combination detection

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to an emotion recognition method and device based on facial action combination detection.

Background

Facial expressions are the primary means of conveying non-verbal information. Psychologist studies have shown that emotional information conveyed by facial micro-expressions can amount to 55% of the total amount of information in human daily communications, so learning "what you look at" is one of the important ways to coordinate interpersonal relationships. For the blind, it is important to establish a set of equipment capable of recognizing the facial micro-expression and sensing the emotion of other people, so that the equipment can help the blind to sense the emotion state of other people and enhance the communication effect.

Facial expressions are the state of the muscles and five sense organs of the human face, such as smile, anger, etc. Facial expressions may be expressed by facial movements, each representing a basic facial movement or expression change, from which facial movements describe facial movements, the combination of which may describe complex emotional changes. With the development of artificial intelligence technology, it is possible to recognize facial expressions using a deep learning method to obtain the current mood of people. Fig. 2 shows a network structure for detecting a face action unit based on a deep learning method, which is called EACNet. The technical principle is as follows: on the basis of a VGG network, two modules of reinforcement learning and tailoring learning are added. The reinforcement learning module enables the network to assign higher learning weight to the interested human face action unit in model training by adding an attention mechanism; the cropping learning module crops facial regions around the detected objects and designs convolutional layers to learn deeper features for each facial region. And finally, predicting a face action value for each detected face region, and obtaining emotion change through simple combination. It has the problems that: although an attention mechanism is adopted to learn the interested areas in the face, each area is learned independently, and the association between different human face actions is not concerned; generally, emotion change conclusions are obtained by simply combining face actions, and if face actions are predicted incorrectly, the emotion conclusions are directly wrong, and the robustness is poor.

Disclosure of Invention

In order to solve the above problems in the prior art, the present invention provides a method and an apparatus for emotion recognition based on facial motion combination detection.

In order to achieve the above object, the present invention adopts the following technical solutions.

In a first aspect, the present invention provides a method for emotion recognition based on facial motion combination detection, comprising the steps of:

acquiring a face image, and performing feature extraction on the image;

inputting the extracted image features into an attention module to obtain the spatial position of each facial action on the face and the feature of the face region of interest;

based on the spatial position and the characteristics of the region of interest, obtaining the associated characteristics of different facial actions by using a transducer self-attention mechanism, and further predicting the facial actions;

and predicting the combination of facial actions by fusing the associated features, thereby obtaining the emotion change.

Further, performing feature extraction on the input image by using a residual error network ResNet 34; ResNet34 is composed of 5 convolution modules, the first convolution module is composed of 7 × 7 convolution and 3 × 3 down-sampling convolution, the second, third, fourth and fifth convolution modules are residual modules composed of two 3 × 3 convolution modules, the input x of the residual module is added with x after being convolved by two 3 × 3 convolution, and then is output through a ReLU activation layer; the output of the fifth convolution module is the output of ResNet 34.

Further, the facial action includes: raise the eyebrow inner angle, raise the eyebrow outer angle, the fold eyebrow, go up the eyelid and rise, the cheek promotes, and the eyelid tightens up, and the nose folds, and the upper lip lifts, and pulling mouth angle tightens up the mouth angle, and the mouth angle is downward, and the lower lip lifts up, pouts the mouth, and the mouth angle is tensile, tightens up the lip, and the lip is pressed, and two lips are separately, and the chin descends, and the mouth is opened greatly, closes the eye.

Furthermore, the method for obtaining the spatial position of the facial motion on the face and the facial region-of-interest features comprises the following steps:

extracting N_TInputting the image characteristics T into an attention module, sequentially passing through 3 residual modules and N_AConvolution of 1 x 1, a ReLU activation layer, N_A1-1 convolution, and finally obtaining the spatial position of each facial action on the face through a Sigmoid activation function; model attention to N_AAn attention feature map is respectively associated with N_TMultiplying by the image characteristics T and then by N_TAdding the image characteristics T to obtain N_AGroup feature map, each group N_TA feature map; n is a radical of_AEqual to the number of facial movements.

Further, the obtaining, by using a transform self-attention mechanism, associated features of different facial movements to predict the facial movements specifically includes:

the N is_AThe group characteristic graphs are respectively subjected to 1-by-1 convolution, batch normalization and ReLU, each group of characteristic graphs outputs a one-dimensional vector with the length of N, and N is obtained in total_AA one-dimensional vector, dividing the N_APerforming feature fusion on the one-dimensional vectors to obtain the length N × N_AC; and C is input into a transform encoder consisting of a multi-head attention module and a feedforward network, and the output of the multi-head attention module outputs a predicted value of the facial action after passing through two fully-connected layers.

Still further, the method of recognizing a change in emotion includes:

inputting the correlation features output by the transducer self-attention mechanism into a feature fusion module consisting of two fully-connected layers, outputting a plurality of feature layers, and predicting combinations of a plurality of facial actions through a Softmax function, wherein each combination corresponds to one emotion change;

the emotion change corresponding to the combination of the plurality of facial movements is: raise the inner angle of eyebrow + raise the outer angle of eyebrow + frown: panic attacks; raising the inner eyebrow angle + raising the outer eyebrow angle: surprise; lifting the inner angle of eyebrow and wrinkling eyebrow: sadness; frown + upper eyelid rising: anger; cheek lift + pulling mouth corner: opening the heart; upper lip lift + double lip separation: aversion; puckered lips + tightened lips: is confusing.

In a second aspect, the present invention provides an emotion recognition apparatus based on facial motion combination detection, including:

the image feature extraction module is used for acquiring a face image and extracting features of the image;

the interesting region extracting module is used for inputting the extracted image characteristics into the attention module to obtain the spatial position of each facial action on the face and the interesting region characteristics of the face;

the facial motion prediction module is used for obtaining correlation characteristics of different facial motions by using a transducer self-attention mechanism based on the spatial position and the characteristics of the region of interest so as to predict the facial motions;

and the emotion change prediction module is used for predicting the combination of facial actions by fusing the associated features so as to obtain emotion changes.

Further, the residual error network ResNet34 is used for carrying out feature extraction on the input image; ResNet34 is composed of 5 convolution modules, the first convolution module is composed of 7 × 7 convolution and 3 × 3 down-sampling convolution, the second, third, fourth and fifth convolution modules are residual modules composed of two 3 × 3 convolution modules, the input x of the residual module is added with x after being convolved by two 3 × 3 convolution, and then is output through a ReLU activation layer; the output of the fifth convolution module is the output of ResNet 34.

Further, the facial action includes: raise the eyebrow internal angle, raise the eyebrow outer corner, crumple, go up the eyelid and rise, the cheek promotes, and the eyelid tightens up, and the nose is crumpled, and the upper lip is lifted, and pulling mouth angle tightens up the mouth angle, and the mouth angle is downward, and the lower lip is lifted up, pouts the mouth, and the mouth angle is tensile, tightens up the lip, and the lip is pressed, and the lips is divided, and the chin descends, and the mouth is opened greatly, closes the eye.

Further, the method for obtaining the spatial position of the facial motion on the face and the facial region-of-interest features comprises the following steps:

Compared with the prior art, the invention has the following beneficial effects.

According to the emotion recognition method and device, the facial image is subjected to feature extraction, the extracted image features are input into the attention module, the spatial position of each facial action on the face and the feature of the face region of interest are obtained, the transform self-attention mechanism is utilized to obtain the association features of different facial actions, and the association features are fused to predict the combination of the facial actions, so that emotion change is obtained, emotion recognition accuracy is improved, and the problem that the emotion recognition accuracy is low due to the fact that the association between different facial actions cannot be focused, and the emotion change is predicted only through simple combination of the facial actions in the prior art is solved.

Drawings

Fig. 1 is a flowchart of an emotion recognition method based on facial motion combination detection according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a conventional facial motion detection network.

Fig. 3 is a schematic general flow chart of emotion recognition according to an embodiment of the present invention.

Fig. 4 is a schematic flow chart of obtaining facial motion related features by using a transform self-attention mechanism.

Fig. 5 is a block diagram of an emotion recognition apparatus based on facial motion combination detection according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer and more obvious, the present invention is further described below with reference to the accompanying drawings and the detailed description. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of an emotion recognition method based on facial motion combination detection according to an embodiment of the present invention, including the following steps:

step 101, acquiring a face image, and performing feature extraction on the image;

step 102, inputting the extracted image features into an attention module to obtain the spatial position of each facial action on the face and the facial region-of-interest features;

103, acquiring correlation characteristics of different facial movements by using a transducer self-attention mechanism based on the spatial position and the characteristics of the region of interest, and further predicting the facial movements;

and 104, predicting the combination of facial actions by fusing the related characteristics to obtain the emotion change.

In this embodiment, step 101 is mainly used to perform facial image feature extraction. The present embodiment performs emotion recognition based on the face image, and thus takes the face image as an input. The convolutional neural network CNN is generally used for image feature extraction. In the visual nervous system, the receptive field of a neuron refers to a specific region on the retina, and only the stimulation in this region can activate the neuron. CNN was proposed based on this receptor field mechanism in biology. CNN is a kind of feedforward neural network, but unlike a general fully-connected feedforward neural network, its convolutional layer has the characteristics of local connection and weight sharing, so that the number of weight parameters can be greatly reduced, thereby reducing the complexity of the model and increasing the operation speed. A typical CNN is formed by cross-stacking convolutional layers, convergence layers (or pooling layers, downsampling layers), and fully-connected layers. The convolution layer is used for extracting the characteristics of a local area through convolution operation of convolution kernels and an input image, and different convolution kernels correspond to different characteristic extractors. The function of the convergence layer is to perform feature selection and reduce the number of features, thereby further reducing the number of parameters. The maximum convergence method and the average convergence method are generally adopted. The full connection layer is used for fusing the obtained different characteristics. Currently, the typical CNN structures include LeNet, AlexNet, and ResNet.

In this embodiment, step 102 is mainly used to extract the image features of the face region of interest. The embodiment extracts the image features of the region of interest by inputting the image features extracted in the previous step into the attention module, and obtains the spatial position of each facial action on the face. Facial expressions are the state of the muscles and five sense organs of the human face, such as smile, anger, etc. Facial expressions can be expressed through facial movements, and common facial movements include frown, puckered lips, and the like. The activity of the face is described in terms of different facial movements, each representing a basic facial movement or expression change, and the combination of facial movements may describe complex emotional changes. The attention mechanism refers to the attention mechanism of human brain under the condition of limited computer capability, and only some key interesting information input is concerned for processing, so as to improve the efficiency of the neural network. The calculation of the attention mechanism can be divided into two steps: one is to calculate the attention distribution α over all input information_i＝softmax(f_iW_attq) representing an input vector f_iDegree of correlation with the query vector q; and secondly, calculating the weighted sum of the input information according to the attention distribution, wherein the weighting coefficient is the attention distribution.

In this embodiment, step 103 is mainly used to predict the facial movements by obtaining associated features of different facial movements. The embodiment utilizes a transducer self-attention mechanism according to the spatial position and the interesting region characteristics of each facial action obtained in the last stepAnd obtaining the associated characteristics of different facial movements so as to predict the facial movements. The self-attention mechanism is used to "notice" the correlation between different parts of the overall input. The self-attention realization method comprises the following steps: firstly, input vectors are subjected to 3 different linear transformations to obtain K (Key), V (value) and Q (query); attention values were then calculated from K, V, Q:

where d is the input vector dimension.

In this embodiment, step 104 is mainly used for emotion prediction. The prior art generally predicts emotional changes by facial movements, which are then simply combined. Because the emotion change is the result of the combined action of different facial actions on the human face, and the facial actions have certain connection, the existing emotion prediction method generally has difficulty in meeting the precision requirement. Therefore, the embodiment predicts the combination of the facial movements by fusing different facial movement related characteristics, and obtains the emotion change according to the predicted facial movement combination. For example, if the predicted combination of facial movements is "raise the inner corner of eyebrow + raise the outer corner of eyebrow", the emotion is "surprise"; the emotion represented by "raised inner eyebrow angle + frown" is "sad" if the predicted combination of facial actions is.

As an alternative embodiment, the input image is subjected to feature extraction by using a residual error network ResNet 34; ResNet34 is made up of 5 convolution modules, the first convolution module is made up of convolution of 7 × 7 and down sampling convolution of 3 × 3, the second, third, fourth, fifth convolution modules are residual modules made up of convolution modules of 3 × 3, the input x of the residual module is added with x after convolution of two 3 × 3, and then output through a ReLU activation layer; the output of the fifth convolution module is the output of ResNet 34.

The embodiment provides a technical scheme for extracting image features. The present embodiment extracts the basic features of the input image using the residual network ResNet34 as the base network. ResNet is proposed to alleviate the loss of gradient as the network deepens, the input x of the network is convolved by two 3 x 3 before the output F (x) is added with x and passes through a ReLU activation layer, and the output can be expressed by the formula: y ═ ReLU (x + f (x)). ResNet34 is composed of 5 convolution modules, the first convolution module is composed of a convolution of 7 x 7 and a downsampled convolution of 3 x 3; the other 4 convolution modules, namely the second convolution module to the fifth convolution module, are ResNet residual modules.

As an alternative embodiment, the facial action includes: raise the eyebrow internal angle, raise the eyebrow outer corner, crumple, go up the eyelid and rise, the cheek promotes, and the eyelid tightens up, and the nose is crumpled, and the upper lip is lifted, and pulling mouth angle tightens up the mouth angle, and the mouth angle is downward, and the lower lip is lifted up, pouts the mouth, and the mouth angle is tensile, tightens up the lip, and the lip is pressed, and the lips is divided, and the chin descends, and the mouth is opened greatly, closes the eye.

This embodiment gives 20 kinds of facial movements. Among the 20 facial movements, there are the most common and simplest facial movements, such as frowning, puckering, closing eyes, opening mouth, etc.; there are also facial movements that are less common and difficult to make, such as upper eyelid lift, cheek lift, etc. The 20 facial movements basically comprise most of the facial movements for expressing emotion, and can meet the requirement of emotion prediction. It should be noted that this embodiment is only a preferred embodiment and does not negate or exclude other possible embodiments, such as other facial movements than the 20 facial movements described.

As an alternative embodiment, the method for obtaining the spatial position of the facial motion on the face and the facial region-of-interest features comprises:

The embodiment provides a technical scheme for extracting the features of the face region of interest. In the embodiment, different facial action areas are automatically weighted by an attention mechanism in the training process, so that the extraction of the characteristics of the interested area is realized. Specifically, assume that step 101 co-extracts N_TThe image features T firstly pass through 3 residual error modules and then pass through a convolution kernel with the size of 1 x 1 and the number of N_A(N_AEqual to the number of facial actions), a ReLU activation layer, and finally a similar convolution kernel of size 1 x 1 and number N_AAnd (4) obtaining the spatial position of each facial action on the face by using a Sigmoid activation function. N obtained by attention module of region of interest_AEach attention diagram is respectively connected with N_TMultiplying the image characteristics T to obtain N_AGroup feature map, each group having N_TAnd finally, adding the image characteristics T to each group of characteristic maps as the final output of the attention module of interest.

As an optional embodiment, the obtaining, by using a transducer self-attention mechanism, associated features of different facial movements to predict the facial movement specifically includes:

The embodiment provides a technical scheme for obtaining the associated features of different facial actions by using a transducer self-attention mechanism. The embodiment establishes a relation between different facial actions by utilizing a transformer self-attention mechanism based on the interesting region characteristics and the positions of the facial actions, and obtains the predicted value of the facial actions by learning the relation between the facial action characteristics and the different facial actions. The process flow is shown in FIG. 4, N_AThe group characteristic graphs respectively pass through groups of 1 × 1 convolution + Batchnorm (batch normalization) + ReLU, and each group of characteristic graphs outputs a one-dimensional vector, and N is output in total_AA one-dimensional vector. Will be N_AFeature fusion is performed on one-dimensional vectors (assuming that the length of each one-dimensional vector is N, N)_ALength of fused vector is N times N_AFusing, namely combining the vectors together in sequence) to obtain a characteristic diagram C; c is input into a transform encoder. the transformer encoder mainly comprises a Multi-head attention mechanism Multi-head advanced authorization and a feedforward network Feed-forward network, and the working process is as follows: the one-dimensional feature C passes through 3 full connection layers respectively, the feature is divided into V, K, Q three parts, and the three parts can realize a self-attention mechanism through multiplication calculation later. These three sections each have N headers and are finally connected together to form an output as a single head. The multi-headed is equivalent to copying a plurality of single heads, and the weight coefficient of each single head is different due to different initialization. The output of the network predicts the value of the facial action unit after passing through the two fully connected layers.

As an alternative embodiment, the method for recognizing emotion change includes:

the emotion change corresponding to the combination of the plurality of facial movements is: raise the inner angle of eyebrow + raise the outer angle of eyebrow + fold eyebrow: panic attacks; raising the inner eyebrow angle + raising the outer eyebrow angle: surprise; lifting the inner angle of eyebrow and wrinkling eyebrow: sadness; frown + upper eyelid rising: anger; cheek lift + pulling mouth corner: opening the heart; upper lip lift + double lip separation: aversion; puckered lips + tightened lips: is confusing.

The embodiment provides a technical scheme for predicting emotion change. In the embodiment, the emotion change is predicted by further learning deep-level features through two full-connected layers by using features output by a transducer self-attention mechanism to directly predict the combination of facial movements. Certain constraint can be generated on the training of the facial movement in the model training process, so that the facial movement prediction result is more accurate. The two fully-connected layers finally output a plurality of feature layers, and a plurality of facial action combinations are activated and predicted by a Softmax function. Each of the plurality of combinations of facial movements corresponds to one of the changes in mood, and table 1 gives the changes in mood for 7 combinations.

TABLE 1 emotional Change represented by facial motion combinations

Fig. 5 is a schematic composition diagram of an emotion recognition device based on facial action combination detection according to an embodiment of the present invention, where the device includes:

the image feature extraction module 11 is configured to acquire a face image and perform feature extraction on the image;

the region-of-interest extraction module 12 is configured to input the extracted image features into the attention module, and obtain a spatial position of each facial action on the face and facial region-of-interest features;

the facial motion prediction module 13 is configured to obtain associated features of different facial motions by using a transform self-attention mechanism based on the spatial position and the feature of the region of interest, and further predict the facial motions;

and an emotion change prediction module 14, configured to predict a combination of facial actions by fusing the correlation features, so as to obtain an emotion change.

The apparatus of this embodiment may be used to implement the technical solution of the method embodiment shown in fig. 1, and the implementation principle and the technical effect are similar, which are not described herein again. The same applies to the following embodiments, which are not further described.

As an alternative embodiment, the residual network ResNet34 is used to perform feature extraction on the input image; ResNet34 is composed of 5 convolution modules, the first convolution module is composed of 7 × 7 convolution and 3 × 3 down-sampling convolution, the second, third, fourth and fifth convolution modules are residual modules composed of two 3 × 3 convolution modules, the input x of the residual module is added with x after being convolved by two 3 × 3 convolution, and then is output through a ReLU activation layer; the output of the fifth convolution module is the output of ResNet 34.

extracting N_TInputting the image characteristics T into the attention module, sequentially passing through 3 residual error modules and N_AConvolution of 1 x 1, a ReLU activation layer, N_A1-1 convolution, and finally obtaining the spatial position of each facial action on the face through a Sigmoid activation function; model attention to N_AAn attention feature map is respectively associated with N_TMultiplying by the image characteristics T and then by N_TAdding the image characteristics T to obtain N_AGroup feature map, each group N_TA feature map; n is a radical of_AEqual to the number of facial movements.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for recognizing emotion based on facial action combination detection is characterized by comprising the following steps:

acquiring a face image, and performing feature extraction on the image;

2. The emotion recognition method based on facial motion combination detection according to claim 1, wherein feature extraction is performed on an input image using a residual network ResNet 34; ResNet34 is composed of 5 convolution modules, the first convolution module is composed of 7 × 7 convolution and 3 × 3 down-sampling convolution, the second, third, fourth and fifth convolution modules are residual modules composed of two 3 × 3 convolution modules, the input x of the residual module is added with x after being convolved by two 3 × 3 convolution, and then is output through a ReLU activation layer; the output of the fifth convolution module is the output of ResNet 34.

3. The emotion recognition method based on facial motion combination detection as claimed in claim 2, wherein the facial motion includes: raise the eyebrow internal angle, raise the eyebrow outer corner, crumple, go up the eyelid and rise, the cheek promotes, and the eyelid tightens up, and the nose is crumpled, and the upper lip is lifted, and pulling mouth angle tightens up the mouth angle, and the mouth angle is downward, and the lower lip is lifted up, pouts the mouth, and the mouth angle is tensile, tightens up the lip, and the lip is pressed, and the lips is divided, and the chin descends, and the mouth is opened greatly, closes the eye.

4. The emotion recognition method based on facial motion combination detection as claimed in claim 3, wherein the method for obtaining the spatial position of the facial motion on the face and the facial region-of-interest features comprises:

5. The method for emotion recognition based on facial motion combination detection as claimed in claim 4, wherein the obtaining of the correlation features of different facial motions by using a transform self-attention mechanism to predict facial motion specifically comprises:

the N is_AGroup characteristic graphs are respectively subjected to 1-by-1 convolution, batch normalization and ReLU, each group of characteristic graphs output a one-dimensional vector with the length of N, and N is obtained_AA one-dimensional vector, dividing the N_APerforming feature fusion on the one-dimensional vectors to obtain the length N x N_AC; and C is input into a transform encoder consisting of a multi-head attention module and a feedforward network, and the output of the multi-head attention module outputs a predicted value of the facial action after passing through two fully-connected layers.

6. The method of recognizing emotion based on facial motion combination detection as set forth in claim 5, wherein the method of recognizing emotion change includes:

the emotion change corresponding to the combination of the plurality of facial movements is: raise the inner angle of eyebrow + raise the outer angle of eyebrow + frown: panic attacks; raising the inner eyebrow angle + raising the outer eyebrow angle: surprise; lifting the inner angle of eyebrow + frown: sadness; frown + upper eyelid rising: anger; cheek lift + pulling mouth corner: opening the heart; upper lip lift + double lip separation: aversion; puckered lips + tightened lips: is confusing.

7. An emotion recognition apparatus based on facial motion combination detection, characterized by comprising:

8. The emotion recognition device based on facial motion combination detection according to claim 7, wherein feature extraction is performed on an input image using a residual network ResNet 34; ResNet34 is composed of 5 convolution modules, the first convolution module is composed of 7 × 7 convolution and 3 × 3 down-sampling convolution, the second, third, fourth and fifth convolution modules are residual modules composed of two 3 × 3 convolution modules, the input x of the residual module is added with x after being convolved by two 3 × 3 convolution, and then is output through a ReLU activation layer; the output of the fifth convolution module is the output of ResNet 34.

9. The emotion recognition device based on facial motion combination detection of claim 8, wherein the facial motion includes: raise the eyebrow internal angle, raise the eyebrow outer corner, crumple, go up the eyelid and rise, the cheek promotes, and the eyelid tightens up, and the nose is crumpled, and the upper lip is lifted, and pulling mouth angle tightens up the mouth angle, and the mouth angle is downward, and the lower lip is lifted up, pouts the mouth, and the mouth angle is tensile, tightens up the lip, and the lip is pressed, and the lips is divided, and the chin descends, and the mouth is opened greatly, closes the eye.

10. The emotion recognition device based on facial motion combination detection as claimed in claim 9, wherein the method for obtaining the spatial position of the facial motion on the face and the facial region-of-interest features comprises: