CN113792572A

CN113792572A - Facial expression recognition method based on local representation

Info

Publication number: CN113792572A
Application number: CN202110670264.8A
Authority: CN
Inventors: 陈昌川; 刘凯; 代少升; 王海宁; 王延平; 张天骐
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2021-06-17
Filing date: 2021-06-17
Publication date: 2021-12-14
Anticipated expiration: 2041-06-17
Also published as: CN113792572B

Abstract

The invention relates to a facial expression recognition method based on local representation, which belongs to the field of face recognition, wherein expressions are important embodiment of human internal emotion change, and the current expression recognition method usually relies on global facial features for processing and ignores local feature extraction. In order to motivate psychologists to point out different facial expressions to different local muscle motion regions, the paper proposes an expression recognition algorithm based on local tokens, called expression motion unit convolutional neural network. To extract the local features of the face, the text first divides the whole face image into 43 sub-regions according to the 68 feature points of the acquired human face, and then selects the muscle movement region and 8 local candidate regions covered by the facial salient organs as the input of the convolutional neural network. To equalize the features of the local candidate regions, 8 parallel feature extraction branches are employed, each branch dominating a different dimensional fully connected layer. The outputs of the branches are adaptively connected according to attention to highlight the importance degree of different local candidate areas, and finally expressions are classified into seven categories of neutrality, anger, disgust, surprise, happiness, sadness and fear through a Softmax function.

Description

Facial expression recognition method based on local representation

Technical Field

The invention belongs to the field of face recognition, and particularly relates to a facial expression recognition method based on local representation.

Background

Facial expression recognition may recognize human facial expressions, such as: surprisingly, sadness, happiness, anger and the like, and facial expression recognition has wide potential application, and can be used for man-machine interaction, shopping recommendation, criminal investigation, medical assistance and the like. For example: recommending commodities and inferring the preference according to the expression change of the commodities browsed by the user during man-machine interaction; in criminal investigation, facial expressions of criminal suspects are used for inferring psychological changes of the criminal suspects; and the facial expression of the patient is observed during medical aid to adjust the dosage of the medicine and the like. Mehrabian proposes that 55% of human emotion information is transmitted through human faces, and if the face information can be read by using a computer, human-computer interaction has better experience. Based on this, american psychologist Ekman et al proposed a Facial motion Coding System (FACS) that elaborated and specified common Facial muscle motion units (AU) into codes in 1987, while they classified Facial expressions into 6 basic expressions: anger, disgust, surprise, happiness, sadness, fear. Most methods generally divide facial expression recognition into two steps: and extracting and classifying the features, extracting and analyzing the face image and acquiring potential features, and performing optimal classification by the classifier according to the acquired potential features. Feature extraction is a key step of expression recognition, directly influences the effect of expression recognition, and is divided into two types for the existing research of feature extraction: geometric features and textural features. With the advent of Convolutional Neural Networks (CNN), some scholars propose to use CNN features for expression recognition, extract face global feature iterative regression through labeling in advance, and obtain good results, but only face global features ignore local detail features, thereby causing difficulty in distinguishing similar expressions, for example: surprise and fear. Routing adopts face key point positioning and extracts LBP characteristics of a proper amount of face Salient blocks (SFP) around the key points; kaleekal et al extracted Hahn features of 8 SFPs around 68 feature points on the face and cascaded using SVM classification with the highest average accuracy of 91.33% and 93.16% in the CK + and JAFFE data sets. The SFP method highlights the expression local features, however, the existing research method has no basis for SFP selection, and meanwhile, the different proportions of the different expression local features are ignored. According to the study of Li and the like, 6 types of facial basic expressions all have corresponding facial muscle movement units AU in FACS coding, the 6 types of facial expression AU are classified finely, and the expressions are classified through a subsequent Bayesian algorithm by identifying and counting AU, however, the method needs to identify a large number of AUs and has low accuracy in identifying a single AU.

Disclosure of Invention

The invention relates to a facial expression recognition method based on local representation, which realizes facial expression recognition of neutrality, anger, disgust, surprise, happiness, sadness and fear, and the specific technical scheme comprises the following 4 parts.

(1) Face partitioning: the 68 characteristic points of the human face divide the whole facial image into 43 sub-regions, and the 6 types of basic expressions of the face have corresponding single or combined facial muscle movement units.

(2) A convolutional neural network: the convolutional neural network features are used for expression recognition, and face global feature iterative regression is extracted through calibrating labels in advance. The 8 local candidate regions covered by the muscle motor region and the facial salient organs are selected as the input of the convolutional neural network.

(3) Feature extraction: to equalize the features of the local candidate regions, 8 parallel feature extraction branches are employed, each branch dominating a different dimensional fully connected layer. The outputs of the branches are adaptively connected according to attention to highlight the importance of different local candidate regions.

(4) Classification by Softmax function: expressions are classified into seven categories of neutral, angry, disgust, surprise, happy, sad and fear by the Softmax function.

Compared with other face recognition methods, the invention has the following advantages: 1. based on the facial muscle movement unit, the local features are extracted in parallel and evenly, and are adaptively connected according to the attention, so that the accuracy rate of facial expression recognition can be greatly improved. 2. The method solves the disadvantages that facial features are not prominent and facial muscle movements are ignored by the SFP method. 3. And the accuracy of face recognition can be effectively improved by adopting AU partitions.

Drawings

FIG. 1 is a general flowchart of a facial expression recognition method based on local characterization according to the present invention

FIG. 2 is a facial region partition

FIG. 3 is a schematic view of a minimum rectangular area

FIG. 4 CK + data set EAU-CNN algorithm effect graph

FIG. 5 JAFFE data set EAU-CNN algorithm effect graph

Detailed Description

The present invention is directed to a method for recognizing facial expressions based on local representations, and in order to make the technical solutions and effects of the present invention clearer and clearer, specific embodiments of the present invention are described in detail below with reference to the accompanying drawings.

As shown in FIG. 1, the present invention relates to a flow chart of a facial expression recognition method based on local representation. To extract the local features of the face, the text first divides the whole face image into 43 sub-regions according to the 68 feature points of the acquired human face, and then selects the muscle movement region and 8 local candidate regions covered by the facial salient organs as the input of the convolutional neural network. To equalize the features of the local candidate regions, 8 parallel feature extraction branches are employed, each branch dominating a different dimensional fully connected layer. The outputs of the branches are adaptively connected according to attention to highlight the importance degree of different local candidate areas, and finally expressions are classified into seven categories of neutrality, anger, disgust, surprise, happiness, sadness and fear through a Softmax function.

1. Face partition

Facial muscle movements form Facial expressions, which are classified into 45 types in the prior art, and a Facial motion Coding System (FACS) can constitute various expressions. Facial 6-type basic expressions each have a corresponding single or combined facial muscle movement single FACS descriptor. When angry, eyebrows will wrinkle together, vertical wrinkles appear between eyebrows, lower eyelid is tense and is lifted or not lifted, and the like, and the face isMuscle motor units are represented by one or more of AU4, AU5, AU7, AU23, AU24 respectively. When the eyebrow is pressed down, the upper lip is lifted, and the lower part of the lower eyelid appears with striation, the facial muscle movement unit is expressed as AU9 and AU 17. When the eyebrows are wrinkled and raised during fear, the upper eyelid is raised, the lower eyelid is tensed, the lips are slightly tense, and the like, facial muscle movement units are expressed as AU4, AU1+ AU5, AU5+ AU7 respectively. In the elevation, eyebrows may be bent downwards, lower edges of lower eyelids may be bulged or wrinkled, corners of the mouth may be pulled backwards and raised, teeth may be exposed, and the like, and the facial muscle movement units are expressed as AU6, AU12 and AU25, respectively. When sadness exists, the internal angles of eyebrows are wrinkled together and raised, the corners of mouths are pulled down, the eyelids on the internal angles of eyes are raised, and the like, and the facial muscle movement units are respectively expressed as AU1, AU4, AU15 and AU 17. Surprisingly, when the eyebrows are raised and bent, the skin under the eyebrows is stretched, the eyes are opened wide, the eyelids fall when the upper eyelids are raised, the mouth is opened, the lips and the teeth are separated, and the like, facial muscle movement units are respectively represented as AU5, AU26, AU27 and AU1+ AU 2. According to the FACS definition, regional statistics are generated on facial muscle movement units with 6 basic facial expressions, and it is found that facial muscle movement generation regions are concentrated on eyebrows, partial forehead, eyes and lower half of the face, and for other parts of the face, such as forehead, temples, partial side faces and the like, facial muscle movement units do not appear in the regions, but the characteristics of the regions are too similar when different expressions are used, and the overall expression recognition accuracy is affected. Aiming at the problem, the existing method constructs a certain amount of SFP extraction features for expression recognition, however, FACS is not considered in SFP selection, and the extracted local facial features are not outstanding enough. For this purpose, the face is divided into 43 feature regions according to the generation regions of the 6-type basic expression muscle movement units, and 8 local candidate regions AUg are constructed according to the regions of the eyebrows, eyes, nose and mouth of the facial organs as shown in fig. 2_iAnd i represents 1 to 8. Each candidate region contains a certain feature region and is responsible for extracting only the muscle motor unit features belonging to the group.

2. Face recognition

Based on 68 characteristic points of the human face and the partition, 8 local candidate regions are formed, and the image is intercepted by adopting the thought of a minimum rectangular frameEach AUg, as shown in FIG. 3_iThe images are normalized to different fixed sizes, and after the characteristics are extracted through each parallel CNN network, the images are spliced into a 4096-size full-connection layer. The spliced full-connection layer is multiplied by different expression weight values to highlight different expression local features, and expressions are divided into 7 types through subsequent feature extraction and a Softmax function: neutral, angry, disgust, surprise, happy, sad, fear.

AUg₁The method comprises splicing two regions of a left eye and a right eye into 45 x 90 pixels, extracting features through 3 convolution layers and pooling layers to generate a 64-channel feature map with a size of 3 x 9, and performing AUg-step extraction_iDominates the output 129-dimensional fully connected layer. AUg₂The image is normalized to 45 × 220 pixels, and a 64-channel 3 × 25 feature map is generated by similarly passing through 3 convolutional layers and pooling layers, and 290-dimensional fully-connected layers are output according to the area proportion of the local candidate region. AUg₃The image is normalized to 110 × 220 pixels, a 64-channel 5 × 11-size feature map is generated through 5 convolutional layers and 4 pooling layers, and 716-dimensional fully-connected layers are output according to the area proportion. AUg₄The image is normalized to 140 × 170 pixels, passes through 5 convolutional layers and 4 pooling layers to generate a 64-channel 6 × 8 size feature map, and outputs 704-dimensional fully-connected layers according to area ratio. AUg₅The image is normalized to 120 × 220 pixels, a 64-channel 5 × 11 feature map is generated through 5 convolutional layers and 4 pooling layers, and 782-dimensional fully-connected layers are output according to the area proportion. AUg₆The image is normalized to 110 x 220 pixels, and is subjected to 5 convolution layers and 4 pooling layers to generate a 64-channel 5 x 11 feature map, and a 717-dimensional full-connected layer is output according to an area ratio. AUg₇The image is normalized to 80 × 190 pixels, a 64-channel 3 × 10 feature map is generated through 5 convolutional layers and 4 pooling layers, and 451-dimensional fully-connected layers are output according to the area ratio. AUg₈The image was normalized to 80 × 130 pixels, passed through 3 convolutional and pooling layers to generate a 64-channel 8 × 14 feature map, dominated by area scale 758 dimensional fully connected layers. And outputting and splicing all the candidate areas to obtain a 4096-dimensional full-connected layer.

3. Loss function design

Is a protrusionLocal features of different expressions are obtained, and the spliced 4096-dimensional full-connected layer is multiplied by different expression weight values Wf_jAnd j represents 1 to 7 different expressions. Based on the characteristic value X_iSum weight value Wf_jNormalizing to obtain cos theta_jFor each class as Wf_jX_i. Computing

And obtaining a characteristic value X_iAnd true weight Wf_jThe angle therebetween. In fact, Wf_jA hub is provided for each class. Then, at the target (ground truth) angle

Plus an angular margin penalty m. Calculate cos (θ)_yi+ m) and all logits are multiplied by the characteristic scale s and then logarithmically passed the softmax function and resulted in cross-entropy loss. Applying margin to the angle, the loss function is shown in equation 1:

whether a cosine margin should be used depends on the similarity measure (or distance) that the final loss function is optimizing. Obviously, the modified softmax loss function is to optimize cosine similarity, not angle. This may not be a problem if we use the traditional softmax penalty, since the decision boundaries of both forms are the same (cos θ @)₁＝cosθ₂→θ₁＝θ₂). However, when we try to push the boundary, we will face the problem that these two similarities (distances) have different densities. When the angle is close to 0 or n, the cosine values are denser. If we want to optimize the angle, we get the inner product W^TThe value of f may then need to be inverted cosine. It may be more computationally expensive. In general, the angle margin is conceptually better than the cosine margin, but the cosine margin is more attractive in view of the computational cost because it can achieve the same goal with less effort. Intra-class distance and classMathematical description of the inter-distance, as shown in equations 2 and 3:

w is the parameter to be learned, feature X_iIs also obtained through weight learning of the front layer, and X is obtained in the training process_iAnd W will both change, both being driven by the gradient in the direction of decreasing Loss. The introduction of margin intentionally lowers the value of the component corresponding to the class label, the potential of 'squeezing' the model is reduced as much as possible, the position which can be originally converged in softmax needs to be continuously lowered, and the lowering can be realized by improving the value of the component corresponding to the class label or reducing the values of other components. X_iIn the direction of

At the same time as the approach is made,

or may be far away from X_iMay be moved in the direction of (c), the effect ultimately achieved may be X_iAs close as possible to

And W_j，j≠y_iFar away from

To verify AUg_iAnd if the selection is reasonable, selecting another 2 groups of local candidate regions for comparison and verification, wherein the selected regions are shown in the following table 1. To test the accuracy of the method, CK + and JAFFE data sets were used, the effect graphs of which are shown in fig. 4 and 5. The average accuracy of the CK + data set reaches 99.85 percent, and is 6.83 percent higher and 9.03 percent higher than that of the selected characteristic region 1 and the characteristic region 2Precision; the average accuracy of the JAFFE data set reaches 96.61%, and compared with the selected characteristic region 1 and the characteristic region 2, the accuracy is 11.87% and 18.65%.

Experiments show that the method has higher accuracy for expression recognition, overcomes the defects that the overall facial features are not prominent and facial muscle movement is neglected by an SFP (small form-factor pluggable) method, adopts AU (AU) partition, enables the recognition accuracy to be higher, has higher processing speed and can better meet the real-time monitoring requirement.

Claims

1. A facial expression recognition method based on local characterization, comprising:

step 1: the face is divided, 68 characteristic points of the human face divide the whole face image into 43 sub-regions, and 6 basic expressions of the face have corresponding single or combined facial muscle movement units.

Step 2: the 8 local candidate regions covered by the muscle motor region and the facial salient organs are selected as the input of the convolutional neural network.

And step 3: and (3) feature extraction, wherein 8 parallel feature extraction branches are adopted, and each branch governs a different-dimension full-connection layer. The outputs of the branches are adaptively connected according to attention to highlight the importance of different local candidate regions.

And 4, step 4: and (4) feature classification, namely classifying expressions into seven classes of neutrality, anger, disgust, surprise, happiness, sadness and fear through a Softmax function.

And 5: a loss function for highlighting local features of different expressions, and multiplying the spliced 4096-dimensional full-connected layer by different expression weight values Wf_j。

2. The method for recognizing Facial expressions based on local characteristics according to claim 1, wherein Facial muscle movements form Facial expressions, which are divided into 45 types by the existing practice, and a Facial motion Coding System (FACS) can constitute a plurality of expressions. Facial 6-type basic expressions each have a corresponding single or combined facial muscle movement single FACS descriptor. The regional statistics are generated on facial muscle movement units with 6 types of basic facial expressions, the facial muscle movement generation regions are found to be concentrated on eyebrows, partial forehead, eyes and lower half parts of the face, for other parts of the face, such as the forehead, the temples, partial side faces and the like, the facial muscle movement units do not appear in the regions, however, the features of the regions are too similar when the facial muscle movement units are in different expressions, and the overall expression recognition accuracy is affected. Aiming at the problem, the existing method constructs a certain amount of SFP extraction features for expression recognition, however, FACS is not considered in SFP selection, and the extracted local facial features are not outstanding enough. For this purpose, the face is divided into 43 feature regions according to the 6-type basic expression muscle movement unit generation regions, and 8 local candidate regions are constructed according to the regions where the eyebrows, eyes, nose and mouth of facial organs are located at the same time, wherein i represents 1 to 8. Each candidate region contains a certain feature region and is responsible for extracting only the muscle motor unit features belonging to the group.

3. The local token-based facial expression recognition method of claim 2, wherein: the 8 local candidate regions covered by the muscle motor region and the facial salient organs are selected and input into the convolutional neural network. Convolutional layers are the core of CNN and are used to convolve the input layers to extract higher level features.

4. The local token-based facial expression recognition method of claim 3, wherein: based on 68 characteristic points of the human face and the above partitions, 8 local candidate regions are formed, and the image is intercepted by adopting the idea of minimum rectangular box, such as AUg each_iThe images are normalized to different fixed sizes, and after the characteristics are extracted through each parallel CNN network, the images are spliced into a 4096-size full-connection layer. And multiplying the spliced full connecting layer by different expression weight values to highlight different expression local features.

5. The local token-based facial expression recognition method of claim 4, wherein the feature classification comprises: the expressions are divided into seven categories of neutrality, anger, disgust, surprise, happiness, sadness and fear by using a Softmax function, the prediction result of the model is converted into an exponential function in the first step by using Softmax, the nonnegativity of the probability is ensured, and the converted result is normalized.

6. The local token-based facial expression recognition method of claim 5, wherein: the loss function comprises the following steps:

in order to highlight local features of different expressions, the spliced 4096-dimensional full-connected layer is multiplied by different expression weight values Wf_jAnd j represents 1 to 7 different expressions. Based on the characteristic value X_iSum weight value Wf_jNormalizing to obtain cos theta_jFor each class as Wf_jX_i. Computing

whether a cosine margin should be used depends on the similarity measure (or distance) that the final loss function is optimizing. Obviously, the modified softmax loss function is to optimize cosine similarity, not angle. This may not be a problem if we use the traditional softmax penalty, since the decision boundaries of both forms are the same (cos θ @)₁＝cosθ₂→θ₁＝θ₂). However, when we try to push the boundary, we will face the problem that these two similarities (distances) have different densities. When the angle approaches0 or n, the cosine values are more dense. If we want to optimize the angle, we get the inner product W^TThe value of f may then need to be inverted cosine. It may be more computationally expensive. In general, the angle margin is conceptually better than the cosine margin, but the cosine margin is more attractive in view of the computational cost because it can achieve the same goal with less effort. The mathematical description of the intra-class distance and the inter-class distance is shown in equations 2 and 3:

At the same time as the approach is made,

And W_j，j≠y_iFar away from