CN112766159A

CN112766159A - Cross-database micro-expression identification method based on multi-feature fusion

Info

Publication number: CN112766159A
Application number: CN202110073665.5A
Authority: CN
Inventors: 唐宏; 朱龙娇; 范森; 刘红梅
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2021-01-20
Filing date: 2021-01-20
Publication date: 2021-05-07

Abstract

The invention belongs to the field of computer vision, and particularly relates to a cross-database micro-expression identification method based on multi-feature fusion, which comprises the following steps: acquiring a start frame and a vertex frame in two data sets; cutting the face area of the initial frame and the vertex frame to obtain a face area image; calculating a human face key point feature map and optical flow features according to the preprocessed initial frame and the preprocessed top frame, and calculating the direction, the amplitude and the optical strain of the optical flow features; weighting the obtained optical flow amplitude characteristic diagram and direction characteristic diagram to obtain a HOOF characteristic diagram; forming a three-dimensional feature block by the face key point feature map, the optical strain feature map and the HOOF feature map, inputting the three-dimensional feature block into a three-dimensional convolution neural network for feature learning, and obtaining a micro-expression recognition result; the method well utilizes the shallow convolutional neural network to effectively learn the micro-expression characteristics fused with the multiple characteristics, thereby improving the accuracy and efficiency of the facial micro-expression recognition.

Description

Cross-database micro-expression identification method based on multi-feature fusion

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a cross-database micro-expression identification method based on multi-feature fusion.

Background

Human facial expressions are mainly classified into macro expressions and micro expressions. Facial expressions are a form of non-verbal communication, resulting from contraction of facial muscles in emotional states. Different muscle movements and patterns ultimately reflect different types of emotions. However, a macro expression does not accurately imply an emotional state of a person because it is easily forged. The micro expression is a transient facial expression which cannot be controlled by a nervous system, can reflect the real emotion of a human and cannot be hidden. Micro-expression has many potential applications in psychology, public safety, clinical medicine, criminal interrogation, etc. But the duration is only about 0.04 s-0.2 s, the intensity is weak, the micro expression is difficult to be recognized by human eyes, the recognition rate of the micro expression is not high even for a person who is trained professionally, and a large amount of time cost is spent. Research on computer-based automatic recognition of micro-expressions has become a focus.

At present, the traditional machine learning method and the deep learning method are mainly used for automatic micro-expression recognition. The traditional micro expression recognition algorithm mainly utilizes different feature extractors to extract micro expression features, then inputs the extracted features into a classifier, and utilizes the classifier to learn a training sample to recognize micro expressions. However, the traditional method has a complex feature extraction mode, a feature descriptor needs to be designed manually, and because the motion of the micro expression is local, the traditional feature extraction method is difficult to completely capture the weak change of the face, meanwhile, the calculation amount of the features is large, and the selection of a classifier has a great influence on the classification performance. Therefore, the traditional micro expression recognition method has certain limitations.

In recent years, the deep neural network has made remarkable progress in the face micro-expression recognition. The robustness of deep learning brings more promising performance than traditional manual methods. However, deep learning requires a large amount of data, and micro-expression data is relatively lacking, which causes a problem of difficult recognition.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a cross-database micro-expression identification method based on multi-feature fusion, which comprises the following steps: acquiring a human face micro expression image, and preprocessing the human face micro expression image; calculating three characteristic graphs according to the preprocessed face image, and fusing the three characteristic graphs; inputting the fused feature diagram into a trained micro-expression recognition model to obtain a micro-expression recognition result;

the process of training the micro-expression recognition model comprises the following steps:

s1: respectively acquiring a starting frame and a vertex frame of a CASMEII data set and a SAMM data set and an emotion label corresponding to each data;

s2: cutting the face regions of the initial frame and the vertex frame to obtain an image only containing the face region;

s3: carrying out normalization processing on the face region image; the normalization processing comprises scale normalization processing and gray level normalization processing;

s4: calculating a face key point feature map LMF according to the initial frame and the vertex frame of the normalized face region image;

s5: calculating the optical flow characteristics of the facial region image according to the initial frame and the vertex frame of the facial region image after normalization processing; calculating an optical flow direction feature map p, an amplitude feature map m and an optical strain feature map epsilon according to the optical flow features; the optical flow features comprise horizontal optical flow features and vertical optical flow features;

s6: weighting the optical flow direction characteristic diagram p and the amplitude characteristic diagram m to obtain a HOOF characteristic diagram;

s7: forming a three-dimensional feature block by using the face key point feature map LMF, the optical strain feature map epsilon and the HOOF feature map, and inputting the three-dimensional feature block into a three-dimensional convolution neural network for feature learning to obtain a micro-expression recognition result;

s8: calculating a loss function of the micro expression recognition model according to the micro expression recognition result; and continuously adjusting parameters of the loss function, and finishing the training of the model when the value of the loss function is minimum.

Preferably, the process of acquiring the start frame and the vertex frame of the casseii dataset and the SAMM dataset includes: giving out the index of the initial frame and the index of the vertex frame of the micro-expression video sequence in the data set, and directly obtaining two frames of pictures and emotion labels according to the indexes; the starting frame is the frame of image where the observed person starts to appear the micro expression, and the top frame is the frame of image where the micro expression intensity of the observed person is most significant.

Preferably, the clipping the face regions of the start frame and the vertex frame includes detecting the face regions of the start frame and the vertex frame by using a face detector of a dlib library, detecting 68 face key points of the start frame and the vertex frame by using a face landmark detector, and obtaining the position of the clipping region and the size of the clipping region according to the face key points to obtain an image only including the face regions.

Preferably, the formula for calculating the face key point feature map LMF is as follows:

LMF_t(i,j)＝||p(i,t)-p(j,t)||₂-||p(i,t-Δt)-p(j,t-Δt)||₂

preferably, the process of calculating the optical flow features and the respective feature maps of the face region image includes:

step 1: calculating facial area images of a start frame and a vertex frame by adopting a TV-L1 approximate optical flow method to obtain a horizontal optical flow and a vertical optical flow of an optical flow field;

step 2: converting the horizontal optical flow and the vertical optical flow into optical flow vectors p ═ u, v, converting the coordinates of the optical flow vectors into polar coordinates, and calculating the direction θ of the optical flow motion from the polar coordinates_x,yAnd intensity ρ_x,y；

And step 3: calculating an optical strain characteristic map epsilon according to the optical flow vector p ═ (u, v);

and 4, step 4: and calculating an amplitude characteristic map m according to the optical strain characteristic map epsilon.

Further, the direction of the optical flow movement θ_x,yAnd intensity ρ_x,yThe formula of (1) is:

further, the formula for calculating the optical strain characteristic map epsilon is as follows:

preferably, the process of obtaining the HOOF feature map includes: equally dividing the optical flow amplitude feature map and the optical flow direction feature map into 5-by-5 non-overlapping blocks; to the direction theta_x,y∈[-π,π]Bin processing is performed and dependent on the amplitude ρ_x,0Weighting the sizes of the two to obtain a HOOF characteristic diagram; wherein the range of each histogram bin is:

preferably, the three-dimensional convolutional neural network comprises: the three-dimensional convolutional neural network comprises three parallel streams, a convolutional layer, a maximum pooling layer, an FC full-link layer and a softmax layer; each of the three parallel flows consists of a convolution layer and a maximum pooling layer, the number of convolution kernels of the convolution layer of each flow is 3, 5 and 8 respectively, and the sizes of the convolution kernels are 3 x 3; adopting same padding to keep the input and output sizes the same, and stride being 1; inputting the three-dimensional characteristic block of 28 multiplied by 3 into a three-dimensional convolution neural network, wherein the outputs obtained by the three parallel streams are respectively data of 28 multiplied by 3, 28 multiplied by 5 and 28 multiplied by 8; the data output by the convolution layer respectively passes through the maximum pooling layers of 2 × 2 and stride 2, and data of 14 × 14 × 3, 14 × 14 × 5 and 14 × 14 × 8 are output; connecting the outputs of the three streams to form a 14 × 14 × 16 feature block, and outputting a 7 × 16-dimensional feature through a maximum pooling layer of 2 × 2 and stride 2, and flattening the feature into a 784 × 1-dimensional feature vector; and (4) subjecting the feature vector of 784 x 1 dimension to softmax function to micro-expression recognition classification.

Preferably, the penalty function of the micro-expression recognition model is:

aiming at the problems of short duration, weak intensity and small sample data amount of micro expression, the invention provides a traditional micro expression recognition method combined with a deep learning technology for micro expression recognition on the basis of a cross-database experiment. A plurality of feature graphs of the micro expression are obtained through calculation by a traditional method, and the plurality of features are fused and sent to a convolutional neural network for further feature learning, so that the defect that the micro expression is recognized by directly using an original micro expression sequence is overcome. The method effectively improves the accuracy and efficiency of micro-expression recognition.

Drawings

FIG. 1 is a general flow diagram of the present invention;

FIG. 2 is a sample view of the present invention;

FIG. 3 is a picture of the present invention containing only facial regions;

FIG. 4 is a graph of the calculated optical flow characteristics of the present invention;

FIG. 5 is a graph of calculated optical strain characteristics according to the present invention;

FIG. 6 is a HOOF feature diagram example;

fig. 7 is a 3D convolutional neural network designed by the present invention.

Detailed Description

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Other embodiments obtained by persons of ordinary skill in the art based on the embodiments of the present invention without creative efforts belong to the protection scope of the present invention.

A cross-database micro-expression identification method based on multi-feature fusion comprises the following steps: acquiring a human face micro expression image, and preprocessing the human face micro expression image; calculating three characteristic graphs according to the preprocessed face image, and fusing the three characteristic graphs; inputting the fused feature diagram into a trained micro-expression recognition model to obtain a micro-expression recognition result; the process of training the micro expression recognition model is shown in fig. 1, and the specific process includes:

s2: cutting the face regions of the initial frame and the vertex frame by adopting a face detector of a dlib library to obtain an image only containing the face regions;

s6: the optical flow direction feature map p and the amplitude feature map m are equally divided into 5 × 5 blocks which are not overlapped, and the direction is binned in each block. Weighting the direction according to the moving amplitude of each pixel to obtain a HOOF characteristic diagram;

As shown in FIG. 2, the a image in FIG. 2 is the start frame of the Happoess sample, and the b image in FIG. 2 is the top frame of the Happoess sample. Since the study of micro-expressions is video-based and not a single picture, the start and vertex frames of micro-expressions and corresponding emotion labels can be selected directly from the index for a given sample sequence.

And respectively extracting face region images from the initial frame and the vertex frame, and then performing scale normalization and gray scale normalization. The method comprises the following specific steps:

step 1: using a face detector and a face key point detector included in the dlib library to respectively detect face regions of the start frame and the vertex frame, and performing 68 face key point detections, and obtaining the position of the cropping region and the size of the cropping region according to the 68 face key points detected, to obtain an image only including the face region, as shown in fig. 3, where the image a in fig. 3 is a face region image of the start frame, and the image b in fig. 3 is a face region image of the vertex frame.

Step 2: and carrying out geometric normalization and gray level normalization operation on the image only containing the face area of the human face.

And calculating a face key point feature map LMF according to the preprocessed initial frame and the preprocessed top frame, wherein the key points are used for having the advantage of not being influenced by characters such as the shape, sex, age and brightness of the face in the input video or background features, and the attention points are put on the face motion. The formula for calculating the face key point feature map LMF is as follows:

LMF_t(i,j)＝||p(i,t)-p(j,t)||₂-||p(i,t-Δt)-p(j,t-Δt)||₂

where p (i, t) is the ith personal face key point in the tth frame, where i is 1,2 … 68, the start frame corresponds to the frame at time t- Δ t, and the vertex frame corresponds to the frame at time t.

Preferably, as the face has 68 key points, 68 key points are collected for the face image, and the size of the face key point feature map LMF is 68 × 68.

The process of calculating the optical flow characteristics of the face area image and each characteristic graph according to the preprocessed initial frame and the preprocessed vertex frame comprises the following steps:

step 1: calculating facial area images of a start frame and a vertex frame by adopting a TV-L1 approximate optical flow method to obtain a horizontal optical flow and a vertical optical flow of an optical flow field; as shown in fig. 4, the image a in fig. 4 is a horizontal light flow diagram, and the image b in fig. 4 is a vertical light flow diagram. Assuming that the intensity of the image pixel at the point (x, y, t) is I (x, y, t), the pixel point moves to the position of the point (x + Δ x, y + Δ y, t + Δ t) after the time Δ t, and the intensity of the pixel is I (x + Δ x, y + Δ y, t + Δ t); from the conservation of brightness:

I(x,y,t)＝I(x+Δx,y+Δy,t+Δt)

wherein Δ x ═ u_x,yΔt，Δy＝v_x,yΔt，u_x,yAnd v_x,yRespectively, a horizontal optical flow and a vertical optical flow to be estimated, Δ x represents a displacement of a pixel in a horizontal direction, and Δ y represents a displacement of a pixel in a vertical direction.

Step 2: converting the horizontal optical flow and the vertical optical flow into optical flow vectors p ═ u, v, converting the coordinates of the optical flow vectors into polar coordinates, and calculating the direction θ of the optical flow motion from the polar coordinates_x,yAnd intensity ρ_x,y(ii) a The calculation formula is as follows:

where ρ is_x,yTheta being the magnitude of the pixel shift_x,yU (x, y) and v (x, y) are directions in which the pixels move.

And step 3: calculating an optical strain characteristic map epsilon according to the optical flow vector p ═ (u, v); optical strain is the derivative of optical flow that correctly characterizes the minute movements of a variable object between successive frames of motion, as shown in FIG. 5. The optical strain ε is defined as:

wherein,

means to derive a displacement vector, T means transposition; u ═ u, v]^TRepresenting the displacement vector, the optical strain epsilon can also be expressed as:

wherein a diagonal strain component (epsilon)_xx，ε_yy) Is the normal strain component (ε)_xy，ε_yx) Is the tangential strain component. The normal strain component measures the change in optical flow in the x and y dimensions, while the tangential strain component measures the angular change caused by deformation along two axes.

And 4, step 4: and calculating an amplitude characteristic map m according to the optical strain characteristic map epsilon. The sum of the squares of the normal and tangential strain components can be used to calculate the magnitude of the optical strain for each pixel by the formula:

equally dividing the optical flow amplitude feature map and the optical flow direction feature map into 5-by-5 non-overlapping blocks, and aligning the direction theta_x,y∈[-π,π]Bin processing is performed and dependent on the corresponding amplitude ρ_x,yIs weighted to obtain the HOOF feature, as shown in fig. 6. The range of each histogram bin is:

where bin C is e {1,2 … C }, and C represents the total number of histogram bins.

As shown in FIG. 7, the three-dimensional convolutional neural network of the present invention is a small and shallow convolutional neural network; the network has three parallel flows, each flow consists of a convolution layer and a pooling layer, each flow has different convolution kernel numbers of 3, 5 and 8, the convolution kernel sizes are all 3 × 3, same input and output sizes are kept by same stride is 1. In a three-dimensional feature block theta (namely { LMF, epsilon, HOOF } obtained by resampling the face key point feature map LMF, the optical strain epsilon and the HOOF feature map to 28 × 28 × 3, the outputs obtained by three parallel streams are respectively 28 × 28 × 3, 28 × 28 × 5 and 28 × 28 × 8; then, the samples were passed through the maximum pooling layers of 2 × 2 and stride 2, and the outputs were 14 × 14 × 3, 14 × 14 × 5, and 14 × 14 × 8, respectively; and connecting the outputs of the three streams to form a 14 × 14 × 16 feature block, outputting a 7 × 16-dimensional feature through a maximum pooling layer of 2 × 2 and stride 2, flattening the feature into a 784 × 1-dimensional feature vector, connecting the feature vector with a full-connection layer of 400 nodes, and finally performing micro-expression recognition and classification through a softmax function.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by instructions associated with hardware via a program, which may be stored in a computer-readable storage medium, and the storage medium may include: ROM, RAM, magnetic or optical disks, and the like.

The above-mentioned embodiments, which further illustrate the objects, technical solutions and advantages of the present invention, should be understood that the above-mentioned embodiments are only preferred embodiments of the present invention, and should not be construed as limiting the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A cross-database micro-expression identification method based on multi-feature fusion is characterized by comprising the following steps: acquiring a human face micro expression image, and preprocessing the human face micro expression image; calculating three characteristic graphs according to the preprocessed face image, and fusing the three characteristic graphs; inputting the fused feature diagram into a trained micro-expression recognition model to obtain a micro-expression recognition result;

2. The method for cross-database micro-expression recognition based on multi-feature fusion, according to claim 1, wherein the process of obtaining the start frame and the vertex frame of the CASMEII dataset and the SAMM dataset comprises: giving out the index of the initial frame and the index of the vertex frame of the micro-expression video sequence in the data set, and directly obtaining two frames of pictures and emotion labels according to the indexes; the starting frame is the frame of image where the observed person starts to appear the micro expression, and the top frame is the frame of image where the micro expression intensity of the observed person is most significant.

3. The method for cross-database micro-expression recognition based on multi-feature fusion of claim 1, wherein the cropping of the face region of the start frame and the vertex frame comprises detecting the face region of the start frame and the vertex frame by a face detector of a dlib library, detecting 68 face key points of the start frame and the vertex frame by a face landmark detector, and obtaining the position of the cropping region and the size of the cropping region according to the face key points to obtain an image only containing the face region.

4. The cross-database micro-expression recognition method based on multi-feature fusion as claimed in claim 1, wherein the formula for calculating the face key point feature map LMF is as follows:

LMF_t(i，j)＝||p(i，t)-p(j，t)||₂-||p(i，t-Δt)-p(j，t-Δt)||₂

where p (i, t) represents the ith personal face keypoint in the tth frame, and Δ t represents a time interval.

5. The method for cross-database micro-expression recognition based on multi-feature fusion, according to claim 1, wherein the process of calculating the optical flow features of the face area image and each feature map comprises:

step 2: converting the horizontal optical flow and the vertical optical flow into optical flow vectors p ═ u, v, converting the coordinates of the optical flow vectors into polar coordinates, and calculating the direction θ of the optical flow motion from the polar coordinates_x，yAnd intensity ρ_x，y；

6. The method of claim 5, wherein the method comprises a step of optically identifying the micro-expression across databases based on multi-feature fusionDirection of flow motion theta_x，yAnd intensity ρ_x，yThe formula of (1) is:

where ρ is_x，yTheta being the magnitude of the pixel shift_x，yIn the direction of pixel movement, u (x, y) represents a horizontal optical flow component, and v (x, y) represents a vertical optical flow component.

7. The method for cross-database micro-expression recognition based on multi-feature fusion according to claim 5, wherein the formula for calculating the optical strain characteristic map epsilon is as follows:

wherein u is [ u, v ]]^TRepresenting a displacement vector and T representing a transposition.

8. The method for cross-database micro-expression recognition based on multi-feature fusion according to claim 1, wherein the process of obtaining the HOOF feature map comprises: equally dividing the optical flow amplitude feature map and the optical flow direction feature map into 5-by-5 non-overlapping blocks; to the direction theta_x，y∈[-π，π]Bin processing is performed and dependent on the amplitude ρ_x，yWeighting the sizes of the two to obtain a HOOF characteristic diagram; wherein the range of each histogram bin is:

wherein bin C is in {1, 2.. C }, and C represents the total number of histogram bins，θ_x，yIndicating the direction of the optical flow motion.

9. The cross-database micro-expression recognition method based on multi-feature fusion is characterized in that the three-dimensional convolutional neural network comprises three parallel streams, a convolutional layer, a max pooling layer, an FC full connection layer and a softmax layer; each of the three parallel flows consists of a convolution layer and a maximum pooling layer, the number of convolution kernels of the convolution layer of each flow is 3, 5 and 8 respectively, and the sizes of the convolution kernels are 3 x 3; adopting same padding to keep the input and output sizes the same, and stride being 1; inputting the three-dimensional characteristic block of 28 multiplied by 3 into a three-dimensional convolution neural network, wherein the outputs obtained by the three parallel streams are respectively data of 28 multiplied by 3, 28 multiplied by 5 and 28 multiplied by 8; the data output by the convolution layer respectively passes through the maximum pooling layers of 2 × 2 and stride 2, and data of 14 × 14 × 3, 14 × 14 × 5 and 14 × 14 × 8 are output; connecting the outputs of the three streams to form a 14 × 14 × 16 feature block, and outputting a 7 × 16-dimensional feature through a maximum pooling layer of 2 × 2 and stride 2, and flattening the feature into a 784 × 1-dimensional feature vector; and (4) subjecting the feature vector of 784 x 1 dimension to softmax function to micro-expression recognition classification.

10. The cross-database micro expression recognition method based on multi-feature fusion as claimed in claim 1, wherein the loss function of the micro expression recognition model is:

where N is the number of samples, M is the number of data set categories, y_icTo indicate a variable, 1 if the class is the same as that of sample i, otherwise 0, p_icIs the predicted probability that sample i belongs to class c.