CN113901915A

CN113901915A - Expression detection method for light-weight network and Magface in video

Info

Publication number: CN113901915A
Application number: CN202111172275.XA
Authority: CN
Inventors: 杨赛; 顾全林; 曹攀
Original assignee: Wuxi Xishang Bank Co ltd
Current assignee: Wuxi Xishang Bank Co ltd
Priority date: 2021-10-08
Filing date: 2021-10-08
Publication date: 2022-01-07
Anticipated expiration: 2041-10-08
Also published as: CN113901915B

Abstract

The invention relates to the technical field of expression detection, and particularly discloses a method for detecting expressions of a light weight network and Magface in a video, which comprises the following steps: extracting a plurality of frames of face images to be detected from the face video, processing the face images to be detected to obtain corrected face images, and detecting the corrected face images to obtain 68 key points of the face; calculating horizontal components and vertical components of the face images of the corrected ith frame and the (i + k) th frame, and converting the horizontal components and the vertical components into BGR space images; extracting eyebrow and mouth region images of the BGR space image respectively according to 68 personal face key points, and obtaining a target image; and constructing a facial expression recognition model according to the training target image in the target image and the label corresponding to the training target image, and inputting the test target image in the target image into the facial expression recognition model to obtain a facial expression recognition result. The method for detecting the expression of the light-weight network and the Magface in the video can realize the detection of the micro/macro expression in the video.

Description

Expression detection method for light-weight network and Magface in video

Technical Field

The invention relates to the technical field of expression detection, in particular to a method for detecting expressions of a light-weight network and Magface in a video.

Background

Facial expressions are the most dominant visual behavior expression form of emotion and are expressed by attitude experience of people on objective objects and corresponding behavior reactions. Facial expressions are classified into macro expressions and micro expressions. The micro expression belongs to a spontaneous expression, the duration is only 1/25 s-1/5 s, and the action amplitude is very small. In some scenarios, micro-expressions may reflect more and more confident information than limbs or utterances. The duration of the macroexpression is 0.5 to 4.0 seconds, with higher intensity. The occurrence of macro and micro expressions may co-exist or occur separately, both distinguished based on their relative durations and intensities. Detecting micro-expressions and macro-expressions becomes increasingly challenging when they are interleaved in long videos.

In general, facial expressions go through three distinct phases: start, vertex, and end. The task of expression detection is to locate the starting and ending points of the micro/macro expression appearance.

The prior expression detection research lays a basic technical framework. Firstly, carrying out face alignment through face key point detection, then carrying out mask masking on eyes according to the face key points, and segmenting regions of interest (such as eyebrows, mouths and the like) with large facial expression motions; extracting the feature of the dense optical flow of the region of interest and obtaining an optical strain image (optical strain) or a 3D gradient histogram (3D-HOG) and the like based on the feature as the input of the model; training a macro/micro expression recognition model of the neural network according to the label of the current frame sample as an input label; and establishing a post-processing flow of the expression unit, and positioning the position of the macro/micro expression in the long video.

According to the characteristics of the expressive motion, the intensity of the motion unit decreases as the motion unit approaches the starting frame or the ending frame. The previous work has focused more on solving the problem of imbalance between positive and negative samples, constructing complex models and mitigating the problem by using more robust human face features. However, this does not promote the discriminativity of positive and negative samples during the training of the model.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides a method for detecting expressions of a light-weight network and Magface in a video, so as to solve the problem that the differentiability of positive and negative samples is not improved in the training process of a model in the related art.

The invention provides a method for detecting expressions of a light-weight network and Magface in a video, which comprises the following steps:

step S1: acquiring a plurality of face videos to be detected, wherein each face video S to be detected comprises a plurality of frames of face images to be detected, and the face video S to be detected is { src₁,...,src_i,src_N}，src_iThe face image to be detected is the ith frame in the face video to be detected;

step S2: extracting multiple frames of face images to be detected from each face video to be detected, and respectively processing the multiple frames of face images to be detected to obtain a corrected face image set S '═ { src'₁,...,src’_i,src’_NWherein, src'_iThe corrected ith frame of the face image to be detected;

step S3: respectively carrying out face key point detection on the corrected multi-frame face images to be detected to obtain 68 face key points;

step S4: respectively calculating the corrected ith frame to-be-detected face image src'_iAnd an i + k frame face image src'_i+kAnd converts the horizontal component u and the vertical component v to the BGR space image img_iWherein the BGR space image img_iThe tags of (a) are partitioned by a dynamic threshold;

step S5: extracting the BGR space image img according to 68 individual face key points_iAnd the eyebrow area image and the mouth area image of the user, and respectively aligning the eyebrow area image and the mouth area imageProcessing the image and the mouth area image to obtain a final target image img_i ^roi；

Step S6: dividing a plurality of final target images of the face video to be detected into a plurality of training target images and a plurality of testing target images, constructing a facial expression recognition model according to the training target images and labels corresponding to the training target images, and inputting the testing target images into the facial expression recognition model to obtain a facial expression recognition result of the face video to be detected;

the facial expression recognition model is a deep convolutional neural network model, and the deep convolutional neural network model comprises a skeleton network, a neck network and a head network;

the skeletal network comprises a series of convolutional and pooling layers;

the neck network comprises two parts, a first part comprising a convolutional layer of Conv2d, a BN layer, and a ReLU6 layer, and a second part comprising a Dorpout layer and a Linear layer;

the head network includes two parts, a first part including a loss function consisting of crossentry and MagFace, and a second part including a classification layer.

Further, the steps S2 and S3 further include:

utilizing FaceX-Zoo to carry out frame-by-frame alignment on the human face video S-src to be detected₁,...,src_i,src_NPerforming face detection on multiple frames of face images to be detected in the sequence to obtain a face coordinate frame set bbox ═ bbox₁,...,bbox_i,bbox_NAnd 5-point face key point set lmk ═ lmk₁,...,lmk_i,lmk_N}; wherein, src_iBbox is the ith frame of human face image to be detected in the human face video to be detected_iLmk face coordinate frame of the ith frame of the face image to be detected in the face video to be detected_iThe method comprises the steps that 5 human face key points of an ith frame of human face image to be detected in the human face video to be detected are obtained, and N is the total frame number of the human face image to be detected in the human face video to be detected;

according to 5-point face key point lmk_iFace alignment is carried out on multiple frames of face images to be detected in the face video to be detected, and a corrected face image set S '═ { src'₁,...,src’_i,src’_NAnd a corrected face coordinate frame set bbox '═ bbox'₁,...,bbox’_i,bbox’_NWherein, src'_iThe corrected ith frame is a face image to be detected, src'₁∈112×112×3，bbox’_iA face coordinate frame of the corrected ith frame of the face image to be detected;

utilizing faceX-Zoo to correct face coordinate frame bbox'_iFace key point detection is performed on the face images of the regions, and 68 face key point sets lmk ' ═ lmk ' are obtained '₁,...,lmk’_i,lmk’_NOf which lmk'_iA face coordinate frame bbox of the corrected i frame of the face image to be detected_i' 68 face keypoints.

Further, the step S4 further includes:

the k value is selected by adopting a half of the average length of the facial expression in the face video data set to be detected, and the calculation formula of the k value is shown as (1):

wherein M is the total video frequency in the face video data set to be detected, [ F ]_j·onset,F_j·offset]The length between the starting frame and the ending frame of the facial expression in the jth human face video to be detected is obtained;

the BGR space image img_iAccording to the cross-to-cross ratio

And a dynamic threshold T_IOUWhen the cross-over ratio is determined

Greater than the dynamic threshold T_IOUTime, the BGR space image img_iBelonging to facial expression region, then the BGR space image img_iThe label of (A) is defined as label _i1, otherwise label_i＝0；

Wherein the cross-over ratio

For the j to-be-detected face video, the facial expression is in [ F ]_j·onset,F_j·offset]The length of (d) and the value of k, the cross-over ratio

Is shown in (2), the dynamic threshold value T_IOUThe calculation formula (2) is shown in (3):

in the formula, F_j·onsetFor the start frame of facial expression in the jth human face video to be detected, F_j·offsetFor the end frame of facial expression in the jth human face video to be detected, [ F ]_i,F_i+k]For the BGR spatial image img in the jth face video to be detected_iAnd the length between the i-th frame and the i + k-th frame.

Further, the step S5 further includes:

the eyebrow region image ROI₁As a coordinate frame

Image of a region

Coordinate frame

Is shown in (4)：

In the formula, ω₁Is the expanded pixel number;

the mouth region image ROI₂As a coordinate frame

Image of a region

Coordinate frame

The calculation formula (2) is shown as (5):

in the formula, ω₂Is the expanded pixel number;

respectively combine the images

And

normalizing to H multiplied by W size, and then combining the H multiplied by W size to obtain a final target image

Where H and W are normalized height and width, respectively.

Further, in the step S6, the plurality of training target images constitute a training set IMG_trainThe plurality of test target images constitute a test set IMG_trainThe method also comprises the following steps:

step S61: IMG the jth test set_test·jInputting the test target image into the facial expression recognition model to obtain the jth testSet IMG_test·jIs the predicted label 'corresponding to the test target image'_jE {0,1} and confidence value_j∈[0,1]Calculating the jth test set IMG according to the formula (6)_test·jFacial expression score s of the ith frame of the test target image_iThen the jth test set IMG_test·jThe facial expression values of the middle N frames of test target images are set to S_j＝{s_0·j,...,s_i·j,s_N·j}；

s_i＝value_i·j*label’_i·j (6)

Step S62: IMG the jth test set using Savitzky-Golay convolution smoothing_test·jFacial expression score set S of middle N frame test target images_jBecomes a continuous curve S'_j；

Step S63: adopting dynamic threshold value T as curve S'_jThe dynamic threshold value T is calculated as shown in (7), and S_meanSet of scores for facial expressions S_jMean of the values of the middle facial expressions, S_maxSet of scores for facial expressions S_jThe maximum value of the middle facial expression score, wherein eta is a weight coefficient;

T＝S_mean+η*(S_max-S_mean) (7)

step S64: look for Curve S'_jAt a threshold value T_jAnd nearest neighbor 2k as a limiting condition if and only if the curve S'_jIs greater than the threshold value T_jAnd the distance between adjacent peak points is greater than the nearest neighbor 2k to satisfy the peak of the target facial expression, calculating each peak interval [ p ] in the target peak set satisfying the above-mentioned constraint_i-k,p_i+k]，p_iFor the peak value of the ith target facial expression, the final predicted facial expression label interval set is obtained as [ p ]₀-k,p₀+k],...,[p_n-k,p_n+k]}；

Step S65: when the facial expression label interval set [ p ]₀-k,p₀+k],...,[p_n-k,p_n+k]Degree of overlap of prediction interval and real interval in }And when the IOU is more than or equal to 0.5, judging the prediction interval in the facial expression label interval set to be correct.

The method for detecting the expression of the light-weight network and the Magface in the video has the following advantages: according to the inspiration of the detection task performance measurement criterion, dividing the labels of the data in a mode of an overlapping degree IOU (input/output unit) so as to relieve the error existing in calibration; in the core part, a general low-dimensional feature embedding loss model of LGNMNet is constructed for the classification and the recognition of expressions based on the idea that Magface processes the intra-class feature distribution and the quality of face images on the face recognition; in addition, we transform the classification task into a regression problem, combining polynomial fitting to peak detect to predict the likelihood that a video frame belongs to a macro or micro expression.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention.

Fig. 1 is a flowchart of a method for detecting expressions of a light-weight network and MagFace in a video according to the present invention.

Fig. 2 is a flowchart for processing a face image to be detected according to the present invention.

Fig. 3 is a schematic structural diagram of a facial expression recognition model provided by the present invention.

Fig. 4 is a schematic diagram of a facial expression recognition result provided by the present invention.

Fig. 5 is a schematic diagram of 68 key points of a human face according to the present invention.

Detailed Description

To further illustrate the technical means and effects of the present invention adopted to achieve the predetermined objects, the following detailed description will be given to the embodiments, structures, features and effects of the method for detecting expressions in videos of a lightweight network and MagFace according to the present invention with reference to the accompanying drawings and preferred embodiments. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without any inventive step, are within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged under appropriate circumstances in order to facilitate the description of the embodiments of the invention herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In this embodiment, a method for detecting expressions of a lightweight network and a MagFace in a video is provided, and as shown in fig. 1, the method for detecting expressions of a lightweight network and a MagFace in a video includes:

it should be noted that the purpose of extracting 68 key points of the face is that the key points are distributed in key areas (eyebrows, eyes, mouth, etc.) of facial expressions, as shown in fig. 5;

step S4: respectively calculating corrected ith frame to-be-detected face image src 'by using TV-L1'_iAnd an i + k frame face image src'_i+kAnd converts the horizontal component u and the vertical component v to the BGR space image img_iWherein the BGR space image img_iThe tags of (a) are partitioned by a dynamic threshold;

it should be noted that, in step S4, the face images to be detected of the i-th frame and the i + k-th frame after correction are selected, where k is half of the average length of the facial expression;

step S5: extracting the BGR space image img according to 68 individual face key points_iRespectively processing the eyebrow area image and the mouth area image to obtain a final target image

Step S6: processing videos in batches, dividing a plurality of final target images of the face video to be detected into a plurality of training target images and a plurality of testing target images according to leave-one-out cross validation, constructing a facial expression recognition model based on LGNMNet according to the training target images and corresponding dynamic labels thereof, and inputting the testing target images into the facial expression recognition model to obtain a facial expression recognition result;

the facial expression recognition model is a deep convolutional neural network model, as shown in fig. 3, the deep convolutional neural network model includes a skeleton network, a neck network and a head network;

the skeleton network is a Mobilene _ v2 network comprising a series of convolutional layers and pooling layers;

the neck network comprises two parts, a first part comprising a convolutional layer of Conv2d, a BN layer and a Relu6 layer, and a second part comprising a Dorpout layer and a Linear layer;

In the embodiment of the present invention, as shown in fig. 3, the deep convolutional neural network model specifically includes:

(1) firstly, a forward-propagation deep convolutional neural network model is constructed, the input of the network model is a preprocessed training target image and a corresponding dynamic label, and the overall network structure is divided into a skeleton network, a neck network and a head network;

(2) the skeleton network comprises a series of convolutional layers and pooling layers, and is derived from a convolutional neural network after a classification layer is removed; the first part of the neck network consists of a convolution layer of Conv2d, a BN layer and a Relu6 layer, the output of the first part is normalized to be normaize, a Dorpout layer is added to the second part of the neck network to optimize overfitting, and a Linear layer is used for eliminating redundant information;

(3) the head network comprises two tasks, wherein a loss function consisting of Cross Encopy and MagFace calculates errors of a feature vector output by the neck network and actually classified data, and an optimization model learns the features; the strength of a motion unit is represented by using the size of feature embedding in the MagFace, meanwhile, the easy classification is more close to the class center by using a self-adaptive boundary, and the difficult classification is more remote, so that better intra-class feature distribution is obtained, model overfitting can be avoided, and the generalization is improved; and the other part is output to the classification layer by MagFace, and the prediction label and the confidence coefficient of the data are obtained through Softmax.

Preferably, as shown in fig. 2, the steps S2 and S3 further include:

utilizing FaceX-Zoo to carry out frame-by-frame alignment on the human face video S-src to be detected₁,...,src_i,src_NPerforming face detection on multiple frames of face images to be detected in the sequence to obtain a face coordinate frame set bbox ═ bbox₁,...,bbox_i,bbox_NAnd 5-point face key point set lmk ═ lmk₁,…,lmk_i,lmk_N}; wherein, src_iBbox is the ith frame of human face image to be detected in the human face video to be detected_iTo be detected asFace coordinate frame of the ith frame of the face image to be detected in the face video lmk_iThe method comprises the steps that 5 human face key points of an ith frame of human face image to be detected in the human face video to be detected are obtained, and N is the total frame number of the human face image to be detected in the human face video to be detected;

in order to reduce the influence caused by camera shake or face swing, all face images to be detected need to be normalized to the same scale and elevation angle according to the 5-point face key point lmk_iPerforming face alignment on a plurality of frames of face images to be detected in the face video to be detected and normalizing the face images to be detected to 112 × 112 to obtain a corrected face image set S '({ src'₁,...,src’_i,src’_NAnd a corrected face coordinate frame set bbox '═ bbox'₁,...,bbox’_i,bbox’_NWherein, src'_iThe corrected ith frame is a face image to be detected, src'₁∈112×112×3，bbox’_iA face coordinate frame of the corrected ith frame of the face image to be detected;

utilizing faceX-Zoo to correct face coordinate frame bbox'_iFace key point detection is performed on the face images of the regions, and 68 face key point sets lmk ' ═ lmk ' are obtained '₁,...,lmk’_i,lmk’_NOf which lmk'_iFace coordinate frame bbox 'of corrected ith frame to-be-detected face image'_i68 face keypoints.

Preferably, in step S4, the method further includes:

the BGR space image img_iIs in [ F ] according to the current facial expression_j·onset,F_j·offset]Length of (d) and the intersection ratio of k

And a dynamic threshold T_IOUWhen the cross-over ratio is determined

Greater than the dynamic threshold T_IOUTime, the BGR space image img_iBelonging to facial expression (micro/macro expression) area, then the BGR space image img_iThe label of (A) is defined as label _i1, otherwise label_i＝0；

Wherein the cross-over ratio

Should understand thatThe solution is that when the cross-over ratio is higher than the cross-over ratio

If the fixed threshold is satisfied, the sample is determined to be a positive sample, as shown in equation (2). However, the expressive action units are not of the same size, and this fixed threshold value is relative to the current [ F ]_i·onset,F_i·offset]Is not friendly when the length of the threshold is far larger than k, so a dynamic threshold T is adopted_IOUThe label is defined as shown in equation (3).

It should be noted that the facial expression includes a micro expression or a macro expression, and the facial expression recognition model includes a micro expression recognition model or a macro expression recognition model.

Preferably, as shown in fig. 2, the step S5 further includes:

the eyebrow region image ROI₁As a coordinate frame

Image of a region

Coordinate frame

The calculation formula (2) is shown as (4):

in the formula, ω₁Is the expanded pixel number;

the mouth region image ROI₂As a coordinate frame

Image of a region

Coordinate frame

The calculation formula (2) is shown as (5):

in the formula, ω₂Is the expanded pixel number;

respectively combine the images

And

Where H and W are normalized height and width, respectively, where H112 and W112 are provided.

In the embodiment of the invention, the facial motion units in the micro/macro expression are mainly distributed in the eyebrow, mouth and corner of eye regions, so that three ROI images are extracted from the part. To reduce the effect of global motion, the mean and standard deviation processing is performed on pixels in the nasal bridge region. Meanwhile, because the optical flow features are highly sensitive to blinking, we use a polygon plus a boundary to expand 6 pixels to block the left and right eye regions. The method specifically comprises the following steps: (i) left eye and left eyebrow; (ii) right eye and right eyebrow; (iii) a mouth. The images of the regions (i) and (ii) were normalized to 56 x 56 respectively and then stitched into a 56 x 112 image, the image of the portion (iii) was normalized to 56 x 112 (height x width), and the combination yielded a new image 112 x 112 (height x width) retaining the dominant face motion unit.

Preferably, in the present embodiment, a flow chart of post-micro-expression detection processing (here, micro-expressions are taken as an example) is provided, and as shown in fig. 4, in the step S6, the plurality of training target images form a training set IMG_trainThe plurality of test target images constitute a test set IMG_trainThe method also comprises the following steps:

to show the inventionThe method constructs a Long video detection example based on the detection effect of the micro/macro expression in the Long video, and because the data set SAMM-Long-video (034_7) simultaneously comprises 2 macro expression intervals and 2 micro expression intervals, the method is favorable for visually evaluating the macro/micro expression detection task at the same time, and we use the macro expression interval and the micro expression interval as reference, and two curves in FIG. 4 respectively represent the jth test set IMG_test·jThe middle N frames test a score set curve of the micro/macro expression of the target image;

step S61: IMG the jth test set_test·jInputting the test target image into the facial expression recognition model to obtain a jth test set IMG_test·jIs the predicted label 'corresponding to the test target image'_jE {0,1} and confidence value_j∈[0,1]Calculating the jth test set IMG according to the formula (6)_test·jFacial expression score s of the ith frame of the test target image_iThen the jth test set IMG_test·jThe facial expression values of the middle N frames of test target images are set to S_j＝{s_0·j,...,s_i·j,s_N·j}；

s_i＝value_i·j*label’_i·j (6)

Step S62: because facial expression score set S_jIs a set tending to be discrete, and in order to eliminate errors caused by model classification, the IMG of the jth test set is smoothed by Savitzky-Golay convolution_test·jFacial expression score set S of middle N frame test target images_jBecomes a continuous curve S'_j；

Step S63: in fig. 4, two horizontal lines are dynamic thresholds for micro/macro expression, and the dynamic threshold T is adopted as the curve S 'because facial expression score sets of different videos have large differences'_jThe dynamic threshold value T is calculated as shown in (7), and S_meanSet of scores for facial expressions S_jMean of the values of the middle facial expressions, S_maxSet of scores for facial expressions S_jThe maximum value of the middle facial expression score, wherein eta is a weight coefficient;

T＝S_mean+η*(S_max-S_mean) (7)

step S64: find _ peaks to look for curve S' using scipy._jAt a threshold value T_jAnd nearest neighbor 2k as a limiting condition if and only if the curve S'_jIs greater than the threshold value T_jAnd the distance between adjacent peak points is greater than the nearest neighbor 2k to satisfy the peak of the target facial expression, calculating each peak interval [ p ] in the target peak set satisfying the above-mentioned constraint_i-k,p_i+k]，p_iFor the peak value of the ith target facial expression, the final predicted facial expression label interval set is obtained as [ p ]₀-k,p₀+k],...,[p_n-k,p_n+k]}；

Step S65: when the facial expression label interval set [ p ]₀-k,p₀+k],...,[p_n-k,p_n+k]Judging that the prediction interval in the facial expression label interval set is correct when the overlapping degree IOU of the prediction interval and the real interval in the symbol is more than or equal to 0.5, taking a balanced average F1-score as an evaluation index, and carrying out parameter optimization on eta by adopting a grid optimization method.

It should be noted that, in the lower part of the diagram in fig. 4, a Predicted interval (Predicted) and a true interval (group route) of an expression unit are provided, and the LGNMNet method provided has a plurality of peaks for expression detection of a long video, and the peaks mostly appear near the expression unit, which shows that our model can effectively identify an expression frame; secondly, a plurality of expression unit intervals are found in the long video, the detection results of the macro expression and the micro expression are (Precision is 100% and Recall is 100%) and (Precision is 100% and Recall is 66.67%), 1 false detection of the micro expression occurs in the macro expression area, and false recognition may be caused by the influence of the macro expression. In addition, the score curves of the macro/micro expressions all have a plurality of peak values, a plurality of the peak values are outside the expression interval, and the false recognition peak values are effectively removed after the post-processing method is adopted, so that the post-processing method is proved to be effective for eliminating the false recognition peak values.

The invention provides a method for detecting expressions of a lightweight Network And MagFace in a video, And particularly relates to a method for detecting micro/macro expression in a long video based on a dense density optical flow characteristic combined with a lightweight General Network And MagFace combined model (Lite General Network And MagFace CNN, LGNMNet). according to the inspiration of a detection task performance measurement criterion, data labels are divided in an IOU mode to relieve errors existing in calibration; meanwhile, the classification task is converted into a regression problem, and the probability that a frame belongs to macro or micro expression is predicted by combining polynomial fitting. In a core part, a general low-dimensional feature embedding loss model of LGNMNet is constructed based on the idea that Magface processes in-class feature distribution and face image quality on face recognition, the weight of face features in a peak region is increased monotonously, a simple sample is pulled to a class center, a difficult sample is pushed away, and the model is prevented from being over-fitted on a noisy low-quality sample.

The expression detection method of the light-weight network and the Magface in the video provided by the invention has the following advantages:

(1) a label dividing method of a dynamic threshold value is used to reduce the error of the sample on the calibration near the starting point and the end point;

(2) the LGNMNet is provided, low-dimensional features extracted by a lightweight model are embedded into a Magface loss model, the learning effect of a difficult sample is improved, and the problem of overfitting of the model on a low-quality sample is solved;

(3) and the classification problem is converted into a regression problem by combining the model prediction label and the confidence coefficient as a test result, so that the result is more discrete and the expression positioning is facilitated.

Although the present invention has been described with reference to a preferred embodiment, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for detecting expressions of a light-weight network and Magface in a video is characterized by comprising the following steps:

step S2: extracting multiple frames of face images to be detected from each face video to be detected, and respectively processing the multiple frames of face images to be detected to obtain a corrected face image set S '═ { src'₁,...,src′_i,src'_NWherein, src'_iThe corrected ith frame of the face image to be detected;

the skeletal network comprises a series of convolutional and pooling layers;

2. The method for detecting expressions in videos by a lightweight network and Magface according to claim 1, wherein the steps S2 and S3 further include:

according to 5-point face key point lmk_iFace alignment is carried out on multiple frames of face images to be detected in the face video to be detected, and a corrected face image set S '═ { src'₁,...,src′_i,src'_NAnd a corrected face coordinate frame set bbox '═ bbox'₁,...,bbox′_i,bbox'_NWherein, src'_iThe corrected ith frame is a face image to be detected, src'₁∈112×112×3，bbox′_iA face coordinate frame of the corrected ith frame of the face image to be detected;

utilizing faceX-Zoo to correct face coordinate frame bbox'_iFace key point detection is performed on the face images of the regions, and 68 face key point sets lmk ' ═ lmk ' are obtained '₁,...,lmk′_i,lmk'_NOf which lmk'_iFace coordinate frame bbox 'of corrected ith frame to-be-detected face image'_i68 face keypoints.

3. The method for detecting expressions in videos by using a lightweight network and MagFace according to claim 2, wherein the step S4 further includes:

the BGR space image img_iAccording to the cross-to-cross ratio

And a dynamic threshold T_IOUWhen the cross-over ratio is determined

Greater than the dynamic threshold T_IOUTime, the BGR space image img_iBelonging to facial expression region, then the BGR space image img_iThe label of (A) is defined as label_i1, otherwise label_i＝0；

Wherein the cross-over ratio

4. The method for detecting expressions in videos by using a lightweight network and MagFace according to claim 3, wherein the step S5 further includes:

the eyebrow region image ROI₁As a coordinate frame

Image of a region

Coordinate frame

The calculation formula (2) is shown as (4):

in the formula, ω₁Is the expanded pixel number;

the mouth region image ROI₂As a coordinate frame

Image of a region

Coordinate frame

The calculation formula (2) is shown as (5):

in the formula, ω₂Is the expanded pixel number;

respectively combine the images

And

Where H and W are normalized height and width, respectively.

5. The method for detecting expressions in videos by using a lightweight network and Magface as claimed in claim 4, wherein in step S6, the plurality of training target images form a training set IMG_trainThe plurality of test target images constitute a test set IMG_trainThe method also comprises the following steps:

s_i＝value_i·j*label′_i·j (6)

T＝S_mean+η*(S_max-S_mean) (7)

step S64: look for Curve S'_jAt a threshold value T_jAnd nearest neighbor 2k as a limiting condition if and only if the curve S'_jIs greater than the threshold value T_jAnd the distance between adjacent peak points is greater than the nearest neighbor 2k to satisfy the peak of the target facial expression, calculating the peak satisfying the above-mentioned limit conditionEach peak interval [ p ] in the target peak set_i-k,p_i+k]，p_iFor the peak value of the ith target facial expression, the final predicted facial expression label interval set is obtained as [ p ]₀-k,p₀+k],...,[p_n-k,p_n+k]}；

Step S65: when the facial expression label interval set [ p ]₀-k,p₀+k],...,[p_n-k,p_n+k]And judging that the predicted interval in the facial expression label interval set is correct when the overlapping degree IOU of the predicted interval and the real interval in the set is more than or equal to 0.5.