CN109344744B

CN109344744B - Face micro-expression action unit detection method based on deep convolutional neural network

Info

Publication number: CN109344744B
Application number: CN201811076388.8A
Authority: CN
Inventors: 樊亚春; 税午阳; 邓擎琼
Original assignee: Beijing Normal University
Current assignee: Beijing Normal University
Priority date: 2018-09-14
Filing date: 2018-09-14
Publication date: 2021-10-29
Anticipated expiration: 2038-09-14
Also published as: CN109344744A

Abstract

The invention discloses a face micro-expression action unit detection method based on a deep convolutional neural network, which comprises the following steps: step 1: designing a deep convolutional neural network structure; step 1.1: marking the rectangular areas of the face and different action units in the face; step 1.2: designing and implementing a deep convolutional neural network, wherein the neural network comprises a convolutional layer, a shortcut layer and an action unit detection layer so as to learn the regional information of a face and different expression action units of the face and acquire a network forward propagation parameter; step 1.3: taking sample data in the face sample data set as neural network input data; step 2: realizing the detection of the facial expression action unit according to the network parameters learned in the step 1; and step 3: and performing visual output according to the human face action unit detected in the step 2. The detection method provided by the invention relies on the deep convolutional neural network to detect and identify the action units in the face image, and the detection accuracy rate and speed can be improved.

Description

Face micro-expression action unit detection method based on deep convolutional neural network

Technical Field

The invention relates to the technical field of face recognition and emotion calculation, in particular to a face micro-expression action unit detection method based on a deep convolutional neural network.

Background

The human face micro expression is natural exposure of human mind emotion, compared with common expressions, the micro expression is not easy to perceive, and has the characteristics of small action amplitude and short retention time, and the mouth angle is tilted upwards to express the pleasure of the mind; the inadvertent tilting of the mouth corner at one side can hide the slight bamboo strip at the inner core; the upper lip is lifted by the lower lip, possibly hiding a little discontent; meanwhile, the micro expression recognition can also be applied to various service fields related to the man-machine interaction technology, such as automatic driving, entertainment, shopping and the like.

The automatic recognition of micro expressions not only relates to computer technology problems, but also relates to physiology and psychology, and american psychologist Paul Ekman generalizes micro expressions into combinations of different Action Units (AU), so that the detection of Action units is the basis of micro expression recognition, a human face Action Unit is richer and more subtle than other parts of the body and is difficult to detect and recognize, and Paul Ekman proposes a Facial Action Unit System (FACS) according to the accumulation of the micro expressions in psychology and physiology for many years, and different micro expressions of a human face can be decomposed into one or more different Action units, so that the recognition of the micro expressions of the human face can be realized according to the detection results of different Action units in the human face. The detection accuracy of the action units by using the traditional geometric shape feature method is very low, and the action units are often influenced by typical geometric structures such as human faces, mouths, noses, eyebrows and the like and are difficult to detect.

The human face micro-expression action unit depicts different actions of the forehead, eyebrow, eye, nose, cheek, mouth and jaw of the human face, forms an effect with different geometric structures in different local areas, can divide the action unit into an upper area, a middle area and a lower area according to the difference of the areas, wherein the upper area mainly takes the actions of the forehead, eyebrow and eye as the main part, the middle area mainly takes the action of the nose and cheek as the main part, and the lower area mainly takes the action of the mouth and chin as the main part.

In view of this, the invention provides a method for detecting a facial micro-expression action unit based on a deep convolutional neural network.

Disclosure of Invention

The invention provides a human face micro-expression action unit detection method based on a deep convolutional neural network.

In order to achieve the purpose, the invention adopts the following technical scheme:

a face micro-expression action unit detection method based on a deep convolutional neural network comprises the following steps:

step 1: designing a deep convolutional neural network structure, taking a face sample data set as input, taking an automatically marked micro expression action unit as output, training the network structure, and learning appropriate network parameters;

step 1.1: marking a face and rectangular areas of different action units in the face according to images in the sample data set;

step 1.2: designing and implementing a deep convolutional neural network, wherein the neural network comprises a convolutional layer, a shortcut layer and an action unit detection layer so as to learn the regional information of a face and different expression action units of the face and acquire a network forward propagation parameter;

step 1.3: taking sample data in a face sample data set as neural network input data, wherein the sample data comprises face image data and action unit mark xml data;

step 2: according to the network parameters learned in the step 1, facial expression action unit detection is realized, an image to be detected is used as the input of the neural network in the step 1.2, the convolution layer and the detection layer of the input image are calculated by utilizing the network parameters, whether a face exists in the image or not is judged from the output of the detection layer, if no face exists, the effectiveness of an action unit does not exist, if a face exists, the area position identified by the action unit is corrected according to the face and the position relation of different action units, wherein the judgment threshold value of the probability value identified by the action unit is set to be 0.4, and the missing detection of the action unit with small intensity is avoided;

and step 3: performing visual output according to the face action unit detected in the step 2, and calculating and outputting the micro expression expressed by the face;

step 3.1: judging the action units contained in the input human face according to the probability value and the threshold range of each action unit in the detection layer in the step 2, reading the class names of the action units in the detection layer when the judgment threshold is larger than the probability value, calculating the absolute pixel positions of the action units on the image according to the human face position and the relative positions of the action units, drawing the absolute positions of the action units on the image by using a rectangular frame, and simultaneously drawing the names of the action units;

step 3.2: outputting the micro-expression state of the current face according to the combination of action units appearing in the face;

step 3.3: and outputting the micro-expression state of the human face according to the identification result of the action unit of the human face in the current image.

Further, in step 1.1, the face is marked by defining the local rectangular region positions of different action units and defining the rectangular region positions of the face according to the definitions of different action units and the muscle changes of the face on the basis of the calculation of the feature points of the face.

Further, step 1.1 comprises the steps of:

step 1.1.1: detecting a human face and the positions of characteristic points thereof according to a supervised descending method, numbering each characteristic point of the human face, wherein the contour points of the human face are numbered from the left upper contour of the human face as a starting point, then the characteristic points on the eyebrows and eyes from left to right are numbered, then the characteristic points of the nose bridge and the nose wing part are numbered, and finally all the characteristic points of the mouth part are numbered;

step 1.1.2: according to the positions of the characteristic points of the human face, defining a human face and an action unit area based on the positions of the characteristic points, wherein the action unit area can reflect the actions of the forehead, the eyebrow, the eyes, the nose, the cheeks, the mouth and the jaw of the face;

step 1.1.3: and calculating a face region as a sample region by using the positions of the feature points for model learning.

Further, in step 3.2, the micro-expression states of the face include happy, depressed, surprised, afraid, angry, dislike, and neutral expressions.

Further, in step 1.2, each convolution layer performs convolution operation on the previous layer of feature image through a group of convolution parameter templates, and obtains feature images with the same number as the convolution parameter templates as output layers, and the activation function of the convolution layer adopts a linear rectification function with leakage:

in the formula (1), x is an input value of the activation function,

is the output value of the activation function;

for the shortcut layer, in order to weaken the influence of the gradient disappearance problem in the backward propagation process, a shortcut layer is added between every two convolution layers, namely, an initial input is added into an output layer of the three convolution layers, and the formalization of the shortcut layer is described as the following formulas (2) and (3):

in the formula (2) and the formula (3),

convolution template parameters of the 3 rd, 2 nd and 1 st convolution layers, x is the input of the convolution layer,

outputting the shortcut layer output after the input is superposed for the three-layer convolution layer operation output.

Compared with the prior art, the invention has the following advantages:

1. according to the method for detecting the human face micro-expression action units based on the deep convolutional neural network, in the design of the deep neural network, besides the fact that the convolutional layer is used for learning the geometric characteristics of the bottom layer, the shortcut layer is used for solving the problem of network gradient dispersion, a plurality of detection layers with different scales are designed for learning different action unit classifications and detection parameters, the detection layers with the multiple scales are used for improving the detection accuracy, effective action unit omission is avoided, and the method for detecting the action units based on the deep neural network with high accuracy is realized;

2. according to the method for detecting the human face micro-expression action unit based on the deep convolutional neural network, the human face is aligned in the network parameter learning stage, and the human face does not need to be aligned in the identification stage, so that the detection efficiency is greatly improved.

Drawings

Fig. 1 is a schematic diagram of distribution of feature points of a human face in an embodiment of the present invention.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, the present invention will be described in further detail below with reference to the accompanying drawings and specific embodiments, it being understood that the embodiments and features of the embodiments of the present application can be combined with each other without conflict.

Example 1

step 1.2: designing and implementing a deep convolution neural network, wherein the neural network comprises convolution layers, a shortcut layer and an action unit detection layer, learning the region information of a face and different expression action units thereof to obtain network forward propagation parameters, each convolution layer performs convolution operation on a previous layer of feature images through a group of convolution parameter templates and obtains the feature images with the same number as the convolution parameter templates as an output layer, the activation function of the convolution layers adopts a leaked linear rectification function, wherein x is an input value of the activation function,

for the output value of the activation function:

in the formula (2) and the formula (3),

The detection layer is the output of the detection result of the action unit of the method, and is different from most convolution networks in that the method does not set a full connection layer for characteristic classification, the output of the last convolution layer is used as the input of the detection layer, the activation function of the detection layer selects a Logistic function, the output result classifies a total of seventy-five neurons according to the action unit, wherein the first neuron represents whether the pixel point position of the corresponding characteristic image detects the human face, if so, the detection result is 1, and if not, the detection result is 0; next, absolute position information of the human face on the image is obtained, wherein the absolute position information comprises coordinates of a top left vertex and the length and the width of a rectangular area; the rest seventy neurons are divided into fourteen parts, the information of fourteen action units is recorded respectively, each action unit records the detected probability value and the relation information of the probability value and the face position, wherein the position information is the relative value of the horizontal and vertical coordinate offset length relative to the upper left point of the face area and the face length and width and the length and width ratio relative to the face area respectively.

In the method, the number of the neural network convolution layers and the shortcut layers can be set as much as possible on the basis of being supported by hardware calculation, namely, the network depth is not limited, the detection layer is only set as one layer of network output, in order to improve the detection accuracy of an action unit, the detection layer can be set into two layers, and the convolution layers and the shortcut layers are spaced between the two layers, so that the detection layer setting with multiple scales is formed, and the network hierarchy scheme is set as follows: the method comprises the steps that fifteen turns of convolution layers and shortcut layers are counted, then three convolution layers and one detection layer are added for outputting, then the shortcut layer is set after the last turn of the nearest convolution layer, the output of one shortcut layer of three convolution layers of four turns is carried out, then the three convolution layers and the detection layer are added, and the sampling interval of the convolution layers and the size of a filter are set according to requirements;

step 1.3: taking sample data in a face sample data set as neural network input data, wherein the sample data comprises face image data and action unit mark xml data, the marked action data is used for carrying out iterative correction on result data predicted by the network, and a loss function has great influence on the convergence speed of an iteration result;

step 2: according to the network parameters learned in the step 1, facial expression action unit detection is realized, an image to be detected is used as the input of the neural network in the step 1.2, the convolution layer and the detection layer of the input image are calculated by utilizing the network parameters, whether a face exists in the image or not is judged from the output of the detection layer, if no face exists, the effectiveness of an action unit does not exist, if a face exists, the area position identified by the action unit is corrected according to the face and the position relation of different action units, in the embodiment, the probability value judgment threshold value identified by the action unit is set to be 0.4, and the missing detection of the action unit with small intensity is avoided;

In step 1.1 of this embodiment, to achieve the purpose of automatic labeling, based on the calculation of the human face feature point, the local rectangular region positions of different action units are defined according to the definitions of different action units and the changes of human face muscles, and the positions of the human face rectangular regions are defined at the same time, which is specifically implemented by the following method: firstly, using supervised descending method to detect human face and its characteristic point position, in the method 66 human face characteristic points are detected, as shown in fig. 1, the characteristic point distribution of human face and the number of each characteristic point are set, using left upper contour as starting point, firstly numbering human face contour points, then numbering characteristic points on eyebrow and eye from left to right, then numbering characteristic points of nose bridge and nose wing part, finally numbering all characteristic points of mouth, secondly, using human face characteristic point position to define human face and action unit region based on characteristic point position, calculating human face region and 14 action unit regions of sample image, 14 action units are all from Ackman action unit description of corresponding human face, respectively AU1, AU2, AU4, AU5, AU6, AU7, AU9, AU12, AU15, AU17, AU20, AU23, AU25, AU26, reflecting face, brow, eye, cheek, nose, mouth and mouth, The jaw, the motion of each part, according to the local area shape change of the face that each action unit activity involves, and the position of the characteristic point define the face action unit shape area as follows:

the action unit described in AU1 is characterized by upward tilting of the middle eyebrow, and includes not only the main action of the eyebrow but also the variation of forehead wrinkles, and its local area is defined as follows: the X coordinate of the feature point No. 17 is taken as the X coordinate of the top left vertex of the local rectangular region, the Y coordinate of the feature point No. 19 is taken as the Y coordinate of the top left vertex, the X coordinate of the feature point No. 26 is taken as the X coordinate of the bottom right vertex of the rectangular region, and the Y coordinate of the feature point No. 41 is taken as the Y coordinate of the bottom right vertex of the rectangular region.

The action unit described in AU2 is mainly characterized by pulling the outer part of the eyebrow upwards, and has a local area similar to AU1, and the definition method of the two areas is the same.

The action units described in AU4 are mainly for lowering eyebrows, and the local areas are defined as follows: the X coordinate of the feature point No. 17 is taken as the X coordinate of the upper left vertex of the rectangular area, the difference between the Y coordinates of the feature point No. 30 and the feature point No. 27 shifted to the left by the Y coordinate of the feature point No. 19 is taken as the Y coordinate of the upper left vertex of the rectangular area, the X coordinate of the feature point No. 26 is taken as the X coordinate of the lower right vertex of the rectangular area, and the Y coordinate of the feature point No. 41 is taken as the Y coordinate of the lower right vertex of the rectangular area.

The action unit described in AU5 is mainly characterized by widening of eyelid fissure, and its local area is mainly the area around the eye, and is defined as follows: and taking the X coordinate of the No. 17 characteristic point as the X coordinate of the upper left vertex of the rectangular area, taking the Y coordinate of the No. 19 characteristic point to move up by two pixels as the Y coordinate of the upper left vertex of the rectangular area, taking the X coordinate of the No. 26 characteristic point as the X coordinate of the lower right vertex of the rectangular area, and taking the Y coordinate of the No. 41 characteristic point to move right by five pixels as the Y coordinate of the lower right vertex of the rectangular area.

The action units described in AU6 are mainly based on eye contraction and pull the skin from temples and cheeks to the eyes, with local areas defined as follows: the X coordinate of the feature point No. 0 is taken as the X coordinate of the upper left vertex of the rectangular region, the Y coordinate of the feature point No. 19 is taken as the Y coordinate of the upper left vertex of the rectangular region, the X coordinate of the feature point No. 16 is taken as the X coordinate of the lower right vertex of the rectangular region, and the Y coordinate of the feature point No. 33 is taken as the Y coordinate of the lower right vertex of the rectangular region.

The action unit described in AU7 is mainly based on eyelid variation, and forms eyelid covering area variation of upper eyelid and lower eyelid, and its local area is defined as follows: and taking the X coordinate of the No. 17 feature point as the X coordinate of the upper left vertex of the rectangular area, upwards moving five pixel points by the Y coordinate of the No. 38 feature point as the Y coordinate of the upper left vertex of the rectangular area, taking the X coordinate of the No. 26 feature point as the X coordinate of the lower right vertex of the rectangular area, and downwards moving five pixel points by the Y coordinate of the No. 40 feature point as the Y coordinate of the lower right vertex of the rectangular area.

AU9 describes an action unit that is primarily nasal-crinkling, pulling the skin up the nasal root along both sides of the nose, forming folds on both sides of the nose and across the nasal root, and causing the fold in the meibomian folds, with local areas defined as follows: the X coordinate of the No. 36 feature point is the X coordinate of the upper left vertex of the rectangular region, the Y coordinate of the No. 22 feature point is the Y coordinate of the upper left vertex of the rectangular region, the X coordinate of the No. 45 feature point is the X coordinate of the lower right vertex of the rectangular region, and the Y coordinate of the No. 51 feature point is the Y coordinate of the lower right vertex of the rectangular region.

AU12 describes an action unit mainly comprising a mouth portion becoming an upper arc, resulting in deepening of nasolabial folds and lifting of lower triangular space, and its local area is defined as follows: the X coordinate of the feature point No. 2 is the X coordinate of the upper left vertex of the rectangular region, the Y coordinate of the feature point No. 28 is the Y coordinate of the upper left vertex of the rectangular region, the X coordinate of the feature point No. 14 is the X coordinate of the lower right vertex of the rectangular region, and the Y coordinate of the feature point No. 57 is the Y coordinate of the lower right vertex of the rectangular region.

The action units described in AU15 are dominated by lip stretching and mouth corner sinking, while the skin action changes of the skin under the pulling of the mouth, with local areas defined as follows: and taking the X coordinate of the No. 4 characteristic point as the X coordinate of the upper left vertex of the rectangular area, the Y coordinate of the No. 3 characteristic point as the Y coordinate of the upper left vertex of the rectangular area, the X coordinate of the No. 12 characteristic point as the X coordinate of the lower right vertex of the rectangular area, and the Y coordinate of the No. 5 characteristic point as the Y coordinate of the lower right vertex of the rectangular area.

The action unit described in AU17 is mainly characterized by the fact that the lower lip is facing upwards, which causes the chin to wrinkle, and the local area is defined as follows: and taking the X coordinate of the No. 4 characteristic point as the X coordinate of the upper left vertex of the rectangular area, the Y coordinate of the No. 3 characteristic point as the Y coordinate of the upper left vertex of the rectangular area, the X coordinate of the No. 12 characteristic point as the X coordinate of the lower right vertex of the rectangular area, and the Y coordinate of the No. 5 characteristic point as the Y coordinate of the lower right vertex of the rectangular area.

The action units described in AU20 are mainly stretched laterally of the lips, the mouth is mainly elongated, the mouth is flat and extended, and results in pulling the skin over the corners of the mouth, with the local areas defined as follows: the X coordinate of the No. 3 characteristic point is taken as the X coordinate of the upper left vertex of the rectangular area, the Y coordinate of the No. 30 characteristic point is taken as the Y coordinate of the upper left vertex of the rectangular area, the X coordinate of the No. 13 characteristic point is taken as the X coordinate of the lower right vertex of the rectangular area, and the Y coordinate of the No. 10 characteristic point is taken as the Y coordinate of the lower right vertex of the rectangular area.

The action unit described in AU23 is a pinch lips, the local area of which is defined as follows: the X coordinate of the feature point No. 48 is shifted to the left by five pixel points as the X coordinate of the upper left vertex of the rectangular region, the Y coordinate of the feature point No. 33 is the Y coordinate of the upper left vertex of the rectangular region, the X coordinate of the feature point No. 54 is shifted to the right by five pixel points as the X coordinate of the lower right vertex of the rectangular region, and the Y coordinate of the feature point No. 10 is the Y coordinate of the lower right vertex of the rectangular region.

AU25 describes action units mainly based on labial separation and tooth/gum exposure, and its local areas are defined as follows: the X coordinate of the feature point No. 48 is the X coordinate of the upper left vertex of the rectangular region, the Y coordinate of the feature point No. 3 is the Y coordinate of the upper left vertex of the rectangular region, the X coordinate of the feature point No. 54 is the X coordinate of the lower right vertex of the rectangular region, and the Y coordinate of the feature point No. 5 is the Y coordinate of the lower right vertex of the rectangular region.

The action unit described in AU26 is mainly dominated by jaw relaxation, leading to labial separation and tooth separation, the local area of which is defined the same as AU25, and in order to learn a face area, which is calculated as a sample area using the feature point positions for model learning, the face area is defined as follows: the X coordinate of the No. 0 characteristic point is taken as the X coordinate of the upper left vertex of the rectangular area, the Y coordinate of the No. 19 characteristic point is moved upwards by the difference between the Y coordinate of the No. 33 characteristic point and the Y coordinate of the No. 27 characteristic point to be the Y coordinate of the upper left vertex of the rectangular area, the X coordinate of the No. 16 characteristic point is taken as the X coordinate of the lower right vertex of the rectangular area, the Y coordinate of the No. 8 characteristic point is taken as the Y coordinate of the lower right vertex of the rectangular area, the marks of the local areas of different action units are in a full automatic mode, and the marking result in each image is stored in an xml file, wherein the file not only comprises the names and area coordinates of the action units in the image, but also comprises the area coordinates of the human face.

In step 3.2 of this example, the micro-expression states are divided into seven, respectively happy, depressed, surprised, afraid, angry, disgust and neutral expressions:

the evaluation criterion for happy expressions is that the action units on the face must include AU12, and in combination with AU6, the intensity is large.

The evaluation criterion for the depressed expressions is that the action unit on the human face must include AU15 and one of AU1 or AU4, and the inclusion is strong.

The evaluation criterion for the surprise expression is that the action units on the face must include AU26, and must include one of AU1, AU2 or AU5, and the inclusion is strong.

The evaluation criterion for fear of expression is that the action units on the face must include one of AU20 or AU26, and must include one of AU1 or AU2 or AU4 or AU5 or AU7, and the intensity is high when the action units are included.

The evaluation criterion of the anger expression is that the action units on the face must include AU23 and must include one of AU4, AU5 or AU7, and the intensity is high if the action units are included.

The evaluation criterion for aversive expression is that for fear expression, the action unit on the face must include AU9 and must include one of AU15 or AU16, and the intensity is large when the action unit is included.

In the case of not the above six expressions, the neutral expression is assumed.

The present invention is not limited to the above-described embodiments, which are described in the specification and illustrated only for illustrating the principle of the present invention, but various changes and modifications may be made within the scope of the present invention as claimed without departing from the spirit and scope of the present invention. The scope of the invention is defined by the appended claims.

Claims

1. The method for detecting the human face micro-expression action unit based on the deep convolutional neural network is characterized by comprising the following steps of:

step 2: according to the network parameters learned in the step 1, realizing the detection of the facial expression action units, taking the image to be detected as the input of the neural network in the step 1.2, calculating a convolution layer and a detection layer of the input image by using the network parameters, judging whether a face exists in the image or not from the output of the detection layer, if no face exists, the effectiveness of the action units does not exist, and if the face exists, correcting the area positions identified by the action units according to the face and the position relationship of different action units;

2. The method for detecting the facial micro-expression action units based on the deep convolutional neural network as claimed in claim 1, wherein in step 1.1, the marking of the face is realized by defining the positions of the local rectangular regions of different action units and defining the positions of the rectangular regions of the face according to the definitions of different action units and the change of facial muscles based on the calculation of the characteristic points of the face.

3. The method for detecting the facial micro-expression action unit based on the deep convolutional neural network as claimed in claim 1, wherein the step 1.1 comprises the following steps:

step 1.1.1: detecting the human face and the positions of the characteristic points thereof according to a supervised descending method, and numbering each characteristic point of the human face;

4. The method for detecting the facial micro expression action unit based on the deep convolutional neural network of claim 1, wherein in step 3.2, the micro expression states of the face comprise happy, depressed, surprised, afraid, angry, disgust and neutral expressions.

5. The method for detecting the facial micro-expression action unit based on the deep convolutional neural network as claimed in claim 1, wherein in step 1.2, each convolutional layer performs convolutional operation on the previous layer of feature images through a group of convolutional parameter templates, and obtains the feature images with the same number as the convolutional parameter templates as the number of the feature images as output layers, and the activation function of the convolutional layer adopts a leaky linear rectification function:

in the formula (1), x is an input value of the activation function,

is the output value of the activation function;

in the formula (2) and the formula (3),