CN113902774B

CN113902774B - Facial expression detection method of thick and dense optical flow characteristics in video

Info

Publication number: CN113902774B
Application number: CN202111171053.6A
Authority: CN
Inventors: 杨赛; 顾全林; 曹攀
Original assignee: Wuxi Xishang Bank Co ltd
Current assignee: Wuxi Xishang Bank Co ltd
Priority date: 2021-10-08
Filing date: 2021-10-08
Publication date: 2024-04-02
Anticipated expiration: 2041-10-08
Also published as: CN113902774A

Abstract

The invention relates to the technical field of expression detection, and particularly discloses a facial expression detection method of thick density optical flow characteristics in video, which comprises the following steps: extracting a plurality of frames of face images to be detected from the face video, processing the face images to obtain corrected face images, and detecting face key points to obtain 68 face key points; extracting density optical flow of face images of the ith frame and the (i+k) th frame after correction, and converting the density optical flow into a BGR space image; respectively extracting eyebrow and mouth area images of the BGR space image according to the 68 human face key points, and obtaining a target image; and constructing a facial expression recognition model according to the training target image in the target image and the corresponding label thereof, and inputting the test target image in the target image into the facial expression recognition model to obtain a facial expression recognition result. The facial expression detection method of the thick density optical flow characteristic in the video can realize the detection of micro/macro expression in the video.

Description

Facial expression detection method of thick and dense optical flow characteristics in video

Technical Field

The invention relates to the technical field of expression detection, in particular to a facial expression detection method of thick density optical flow characteristics in video.

Background

Emotion is one of three basic psychological processes of "awareness", which is based on individual wish and need, and is represented by a person's attitude experience to an objective thing and a corresponding behavioral response. Facial expressions are the most dominant behavioral manifestations of emotion and can be divided into macro-expressions and micro-expressions. The micro-expression belongs to a spontaneous expression, cannot be forged and restrained, and can reflect the true emotion of a person. In some scenarios, the microexpressions may reflect more and more credible information than limbs or utterances. Therefore, the research on the micro-expression emotion recognition task has a great deal of potential utilization value, such as background investigation, interview recruitment, suspicion interrogation, loan surface examination and the like.

Expression research is currently divided into two large directions, expression detection and expression recognition. Expression recognition has been studied for a long time, however, the task of expression detection has begun to be slowly paid attention to by researchers in recent years. The duration of the micro-expressions is only 1/25 s-1/3 s, and the motion amplitude is very small, and when the micro-expressions in the long video are interwoven with the common or macro-expressions, detecting the micro-expressions and macro-expressions becomes more and more challenging.

The current research is mostly based on single micro-expression or macro-expression scenes, or the detection task processing method for micro-expression and macro-expression in the video is rough, the utilization of the interested area of the facial expression is insufficient, the dependence on the scenes is high, the influence on the detection result is large when the video shakes or the face swings, the dependence on the feature extraction and the model is high, and the post-processing for the expression detection result in the video after the model prediction is insufficient.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides a facial expression detection method of thick optical flow characteristics in video, so as to solve the problems of large influence on detection results, high dependence on characteristic extraction and models when video shakes or faces swing, and defects in post-processing of expression detection results in video after model prediction.

As a first aspect of the present invention, there is provided a facial expression detection method of thick optical flow features in a video, including:

step S1: acquiring a plurality ofFace videos to be detected, wherein each face video S to be detected comprises a plurality of frames of face images to be detected, and the face video S= { src to be detected ₁ ,...,src _i ,src _N }，src _i The ith frame of the face image to be detected in the face video to be detected is the face image to be detected;

step S2: extracting a plurality of frames of face images to be detected from each face video to be detected, and respectively processing the plurality of frames of face images to be detected to obtain a corrected face image set S ' = { src ' ' ₁ ,…,src' _i ,src' _N }, where src' _i The detected face image is the i frame after correction;

step S3: respectively carrying out face key point detection on the corrected multi-frame face image to be detected to obtain 68 face key points;

step S4: extracting corrected face image src 'to be detected' _i And a face image src' _i+k Thick density optical flow f _i The thick density optical flow f _i Conversion from HSV space to BGR space image img _i Wherein the BGR aerial image img _i The label of the (B) is the corrected face image src 'to be detected of the ith frame' _i Is a label of (2);

step S5: respectively extracting the BGR space image img according to the 68 human face key points _i An eyebrow area image and a mouth area image, and respectively processing the eyebrow area image and the mouth area image to obtain a final target image

Step S6: dividing the final target images of the face video to be detected into training target images and test target images, constructing a facial expression recognition model according to the training target images and the corresponding labels, and inputting the test target images into the facial expression recognition model to obtain a facial expression recognition result.

Further, in the steps S2 and S3, the method further includes:

and utilizing Retinaface to frame the face video S= { src to be detected ₁ ,...,src _i ,src _N Performing face detection on the multi-frame face image to be detected in the sequence to obtain a face coordinate frame set bbox= { bbox ₁ ,...,bbox _i ,bbox _N Sum 5-point face key point set lmk = { lmk ₁ ,...,lmk _i ,lmk _N -a }; wherein src is _i The ith frame of the face image to be detected in the face video to be detected is the face image to be detected, bbox _i A face coordinate frame of the i-th frame of the face video to be detected, lmk _i The key points of the 5-point face of the face image to be detected of the ith frame in the face video to be detected are N, and N is the total frame number of the face image to be detected in the face video to be detected;

according to 5-point face key points lmk _i And performing face alignment on a plurality of frames of face images to be detected in the face video to be detected by a conversion matrix M to obtain a corrected face image set S '= { src'. ₁ ,...,src' _i ,src' _N And a corrected face coordinate frame set bbox '= { bbox' ₁ ,…,bbox' _i ,bbox' _N }, where src' _i To-be-detected face image of the ith frame after correction, src' _i ∈224×224×3，bbox' _i The calculation formula of the conversion matrix M is shown as (1) for the face coordinate frame of the face image to be detected of the ith frame after correction:

correcting the corrected face coordinate frame bbox 'by using 3DDFA_V2 algorithm' _i Face key point detection is carried out on the face image of the region, and 68 face key point sets lmk '= { lmk' ₁ ,…,lmk' _i ,lmk' _N In which lmk' _i Face coordinate frame bbox 'of face image to be detected for ith frame after correction' _i Is a 68 face key point.

Further, in the step S4, the method further includes:

when correctedHuman face image src 'to be detected in ith frame' _i Belongs to the facial expression area, the corrected face image src 'to be detected in the ith frame' _i Is defined as label _i =1, otherwise label _i =0, wherein the selection of k value adopts half of the average length of the facial expression in the face video data set to be detected, and the calculation formula of k value is as follows (2):

wherein T is the total video number in the face video data set to be detected, n _i The frame number of facial expression exists for the jth face video to be detected.

Further, in the step S5, the method further includes:

eyebrow region image ROI ₁ Is a coordinate frameImage of region->Coordinate frame->The calculation formula of (2) is shown as (3):

wherein omega is ₁ Is an extended number of pixels;

mouth region image ROI ₂ Is a coordinate frameImage of region->Coordinate frame->The calculation formula of (2) is shown as (4):

wherein omega is ₂ Is an extended number of pixels;

respectively image the imagesAnd->Normalized to H W size and then combined to obtain the final target imageWherein H and W are normalized height and width, respectively.

Further, in the step S6, the plurality of training target images form a training set IMG _train The plurality of test target images form a test set IMG _train Further comprising:

step S61: IMG the j-th test set _test·j Inputting the test target image of (2) into the facial expression recognition model to obtain a j-th test set IMG _test·j Predictive label 'corresponding to a test target image' _j E {0,1} and confidence value _j ∈[0,1]Computing a j-th test set IMG according to equation (5) _test·j Facial expression score s of i-th frame test target image in (1) _i Then the j-th test set IMG _test·j The facial expression score set of the N frames of test target images is S _j ＝{s _0·j ,…,s _i·j ,s _N·j }；

s _i ＝value _i·j *label' _i·j (5)

Step S62: smoothing the j-th test set IMG using Savitzky-Golay convolution _test·j Facial expression score for middle N frames of test target imagesSet S _j Becomes a continuous curve S' _j ；

Step S63: dynamic threshold T is used as curve S' _j The calculation formula of the dynamic threshold T is shown as (6), S _mean For the set of facial expression scores S _j Mean value of mid-facial expression scores, S _max For the set of facial expression scores S _j The maximum value of the expression score of the middle part, eta is a weight coefficient;

T＝S _mean +η*(S _max -S _mean ) (6)

step S64: finding curve S' _j With a threshold T _j And nearest neighbor k as a constraint, if and only if curve S' _j Is greater than the threshold T _j The distance between adjacent peak points is larger than the nearest neighbor k, so that the peak value of the target facial expression is met, and the target peak value set meeting the limiting conditions is set as a group by adjacent intervals, so that a final predicted facial expression label interval set is obtained;

step S65: and when the overlapping degree IOU of the predicted interval and the real interval in the facial expression label interval set is more than or equal to 0.5, judging that the predicted interval in the facial expression label interval set is correct.

The facial expression detection method of thick density optical flow characteristics in video has the following advantages: the face is normalized and corrected, useless noise is eliminated in the expression region of interest, and the micro/macro expression in the video is detected by combining the thick density optical flow characteristic of the region of interest with the expression detection post-processing method.

Drawings

The accompanying drawings are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate the invention and together with the description serve to explain, without limitation, the invention.

Fig. 1 is a flowchart of a method for detecting facial expression of thick optical flow features in video according to the present invention.

Fig. 2 is a flowchart of processing a face image to be detected provided by the present invention.

Fig. 3 is a schematic diagram of label division provided in the present invention.

Fig. 4 is a flowchart for detecting facial expression according to the present invention.

Fig. 5 is a schematic diagram of a 68-person face key point provided by the present invention.

Detailed Description

In order to further describe the technical means and effects adopted by the invention to achieve the preset aim, the following detailed description refers to the specific implementation, structure, characteristics and effects of the facial expression detection method in video of thick light flow characteristics according to the invention with reference to the accompanying drawings and preferred embodiments. It will be apparent that the described embodiments are some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate in order to describe the embodiments of the invention herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In this embodiment, a method for detecting facial expression of thick optical flow features in a video is provided, as shown in fig. 1, where the method for detecting facial expression of thick optical flow features in a video includes:

step S1: acquiring a plurality of face videos to be detected, wherein each face video S to be detected comprises a plurality of frames of face images to be detected, and the faces to be detectedFace video s= { src ₁ ,…,src _i ,src _N }，src _i The ith frame of the face image to be detected in the face video to be detected is the face image to be detected;

step S2: extracting a plurality of frames of face images to be detected from each face video to be detected, and respectively processing the plurality of frames of face images to be detected to obtain a corrected face image set S ' = { src ' ' ₁ ,...,src' _i ,src' _N }, where src' _i The detected face image is the i frame after correction;

it should be noted that, the purpose of extracting 68 face key points is to distribute the key points in the face expression key areas (eyebrows, eyes, mouth, etc.), as shown in fig. 5;

step S4: correcting the face image src 'to be detected' _i And src' _i+k Converting into a gray image, and extracting a corrected face image src 'to be detected' _i And a face image src' _i+k Is a thick density optical flow f (Two-Frame Motion Estimation Based on PolynomialExpansion) _i The thick density optical flow f _i Conversion from HSV space to BGR space image img _i Wherein the BGR aerial image img _i The label of the (B) is the corrected face image src 'to be detected of the ith frame' _i Is a label of (2);

it should be noted that, in the step S4, the face images to be detected of the i-th frame and the i+k-th frame after correction are selected, where k is a sliding window value;

Step S6: and processing videos in batches, dividing a plurality of final target images of the face video to be detected into a plurality of training target images and a plurality of test target images according to leave-one-out cross verification, constructing a facial expression recognition model based on Mobilene_v2 according to the training target images and corresponding labels, and inputting the test target images into the facial expression recognition model to obtain a facial expression recognition result.

It should be noted that the facial expression recognition model uses 0.5 times random data enhancement (e.g., flipping, adding noise, contrast, etc.) during training.

Preferably, as shown in fig. 2, in the steps S2 and S3, further include:

the face video S= { src to be detected is carried out frame by using Retinaface (Single-stage Dense Face Localisation in the Wild) ₁ ,…,src _i ,src _N Performing face detection on the multi-frame face image to be detected in the sequence to obtain a face coordinate frame set bbox= { bbox ₁ ,…,bbox _i ,bbox _N Sum 5-point face key point set lmk = { lmk ₁ ,…,lmk _i ,lmk _N -a }; wherein src is _i The ith frame of the face image to be detected in the face video to be detected is the face image to be detected, bbox _i A face coordinate frame of the i-th frame of the face video to be detected, lmk _i The key points of the 5-point face of the face image to be detected of the ith frame in the face video to be detected are N, and N is the total frame number of the face image to be detected in the face video to be detected;

in order to reduce the influence of camera shake or face swing, all face images to be detected need to be normalized to the same scale and front view angle according to 5-point face key points lmk _i And performing face alignment on a plurality of frames of face images to be detected in the face video to be detected by a conversion matrix M to obtain a corrected face image set S '= { src'. ₁ ,...,src' _i ,src' _N And a corrected face coordinate frame set bbox '= { bbox' ₁ ,…,bbox' _i ,bbox' _N }, where src' _i For the corrected i-th frame face image to be detected, src' ₁ ∈224×224×3，bbox' _i the calculation formula of the conversion matrix M is shown as (1) for the face coordinate frame of the face image to be detected of the ith frame after correction:

corrected face coordinate frame bbox 'is obtained by using a 3DDFA_V2 algorithm (Towards Fast, accurate and Stable 3D Dense Face Alignment)' _i Face key point detection is carried out on the face image of the region, and 68 face key point sets lmk '= { lmk' ₁ ,…,lmk' _i ,lmk' _N In which lmk' _i Face coordinate frame bbox 'of face image to be detected for ith frame after correction' _i Is a 68 face key point.

Preferably, in the step S4, the method further includes:

when the corrected ith frame is used for detecting the face image src' _i Belonging to the facial expression (micro/macro expression) region (start-end), the corrected i-th frame is used for detecting the human face image src' _i Is defined as label _i =1, otherwise label _i =0, wherein the selection of k value adopts half of the average length of the facial expression in the face video data set to be detected, and the calculation formula of k value is as follows (2):

In this embodiment, a label division schematic diagram is provided, as shown in fig. 3, (1) when the sliding window k=3, a BGR space image img obtained by the 1 st frame face image to be detected and the 4 th frame face image to be detected is obtained ₁ The label of the frame 1 is the label of the face image to be detected; (2) When F _onset ≤F _i ≤F _offset At the time, the ith frame F _i The label of BGR space image is1, otherwise 0, wherein F _onset F, the initial frame of the expression interval of the BGR space image _offset And the end frame of the expression interval of the BGR space image.

It should be noted that the facial expression includes a micro-expression or a macro-expression, and the facial expression recognition model includes a micro-expression recognition model or a macro-expression recognition model.

Preferably, as shown in fig. 2, in step S5, the method further includes:

wherein omega is ₁ Is an extended number of pixels; since the eye portion does not contain an effective face movement unit, and since the eye is irregularly closed, the eyebrow region image ROI ₁ Influence is generated, thus according to 68 human face key point sets lmk' _i Middle eye region lmk' _i [37:48]Performing region filling on key points of the map;

mouth region image ROI ₂ Is a coordinate frameImage of region->Coordinate frame->The calculation formula of (1) is as follows(4) The following is shown:

wherein omega is ₂ Is an extended number of pixels;

respectively image the imagesAnd->Normalized to H W size and then combined to obtain the final target imageWhere H and W are normalized height and width, respectively, where h=96 and w=128 are set.

Preferably, in this embodiment, a flowchart of a process after detecting the micro-expressions (here, taking micro-expressions as an example) is provided, as shown in fig. 4, and in step S6, the plurality of training target images form a training set IMG _train The plurality of test target images form a test set IMG _train Further comprising:

s _i ＝value _i·j *label' _i·j (5)

Step S62: because of the facial expression score set S _j Is a set tending to be discrete, in order to eliminate the cause modelClassification-induced errors, smoothing the j-th test set IMG using Savitzky-Golay convolution _test·j Facial expression score set S of middle N frames of test target images _j Becomes a continuous curve S' _j ；

Step S63: since the facial expression score sets of different videos have large differences, a dynamic threshold T is used as a curve S' _j The calculation formula of the dynamic threshold T is shown as (6), S _mean For the set of facial expression scores S _j Mean value of mid-facial expression scores, S _max For the set of facial expression scores S _j The maximum value of the expression score of the middle part, eta is a weight coefficient;

T＝S _mean +η*(S _max -S _mean ) (6)

step S65: when the overlapping degree IOU of the predicted interval and the real interval in the facial expression label interval set is more than or equal to 0.5, judging that the predicted interval in the facial expression label interval set is correct, taking the balanced average F1-score as an evaluation index, and performing parameter optimization by adopting a grid optimization method.

According to the facial expression detection method for the thick-density optical flow characteristics in the video, provided by the invention, the facial expression is normalized and corrected, useless noise is eliminated in the expression region of interest, and the detection of micro/macro expression in the video is realized by combining the thick-density optical flow characteristics of the region of interest with the expression detection post-processing method.

The present invention is not limited to the above-mentioned embodiments, but is intended to be limited to the following embodiments, and any modifications, equivalents and modifications can be made to the above-mentioned embodiments without departing from the scope of the invention.

Claims

1. A facial expression detection method of thick and dense optical flow characteristics in video is characterized by comprising the following steps:

step S1: acquiring a plurality of face videos to be detected, wherein each face video S to be detected comprises a plurality of frames of face images to be detected, and the face videos S= { src to be detected ₁ ,...,src _i ,src _N }，src _i The ith frame of the face image to be detected in the face video to be detected is the face image to be detected;

step S2: extracting a plurality of frames of face images to be detected from each face video to be detected, and respectively processing the plurality of frames of face images to be detected to obtain a corrected face image set S ' = { src ' ' ₁ ,...,src′ _i ,src' _N }, where src' _i The detected face image is the i frame after correction;

step S5: respectively extracting the BGR space image img according to the 68 human face key points _i An eyebrow area image and a mouth area image, respectively for the eyebrow area image and the mouth areaProcessing the domain image to obtain a final target image

2. The method for detecting facial expression of thick-density optical flow features in video according to claim 1, wherein in the steps S2 and S3, further comprising:

according to 5-point face key points lmk _i And performing face alignment on a plurality of frames of face images to be detected in the face video to be detected by a conversion matrix M to obtain a corrected face image set S '= { src'. ₁ ,...,src′ _i ,src' _N And a corrected face coordinate frame set bbox '= { bbox' ₁ ,...,bbox′ _i ,bbox' _N }, where src' _i To-be-detected face image of the ith frame after correction, src' ₁ ∈224×224×3，bbox′ _i The calculation formula of the conversion matrix M is shown as (1) for the face coordinate frame of the face image to be detected of the ith frame after correction:

correcting the corrected face coordinate frame bbox 'by using 3DDFA_V2 algorithm' _i Face key point detection is carried out on the face image of the region, and 68 face key point sets lmk '= { lmk' ₁ ,...,lmk′ _i ,lmk' _N In which lmk' _i Face coordinate frame bbox 'of face image to be detected for ith frame after correction' _i Is a 68 face key point.

3. The method for detecting facial expression of thick-density optical flow features in video according to claim 2, wherein in step S4, further comprising:

when the corrected ith frame is used for detecting the face image src' _i Belongs to the facial expression area, the corrected face image src 'to be detected in the ith frame' _i Is defined as label _i =1, otherwise label _i =0, wherein the selection of k value adopts half of the average length of the facial expression in the face video data set to be detected, and the calculation formula of k value is as follows (2):

4. The method for detecting facial expression of thick-density optical flow features in video according to claim 3, wherein in step S5, further comprising:

wherein omega is ₁ Is an extended number of pixels;

wherein omega is ₂ Is an extended number of pixels;

5. The method for detecting facial expression of thick-density optical flow features in video according to claim 4, wherein in step S6, the plurality of training target images form a training set IMG _train The plurality of test target images form a test set IMG _train Further comprising:

step S61: IMG the j-th test set _test·j Inputting the test target image of (2) into the facial expression recognition model to obtain a j-th test set IMG _test·j Predictive label 'corresponding to a test target image' _j E {0,1} and confidence value _j ∈[0,1]Computing a j-th test set IMG according to equation (5) _test·j Facial expression score s of i-th frame test target image in (1) _i Then the j-th test set IMG _test·j The facial expression score set of the N frames of test target images is S _j ＝{s _0·j ,...,s _i·j ,s _N·j }；

s _i ＝value _i·j *label′ _i·j (5)

Step S62: smoothing the j-th test set IMG using Savitzky-Golay convolution _test·j Facial expression score set S of middle N frames of test target images _j Becomes a continuous curve S' _j ；

T＝S _mean +η*(S _max -S _mean ) (6)