CN113901915A - Expression detection method for light-weight network and Magface in video - Google Patents

Expression detection method for light-weight network and Magface in video Download PDF

Info

Publication number
CN113901915A
CN113901915A CN202111172275.XA CN202111172275A CN113901915A CN 113901915 A CN113901915 A CN 113901915A CN 202111172275 A CN202111172275 A CN 202111172275A CN 113901915 A CN113901915 A CN 113901915A
Authority
CN
China
Prior art keywords
face
detected
frame
image
facial expression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111172275.XA
Other languages
Chinese (zh)
Other versions
CN113901915B (en
Inventor
杨赛
顾全林
曹攀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuxi Xishang Bank Co ltd
Original Assignee
Wuxi Xishang Bank Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuxi Xishang Bank Co ltd filed Critical Wuxi Xishang Bank Co ltd
Priority to CN202111172275.XA priority Critical patent/CN113901915B/en
Publication of CN113901915A publication Critical patent/CN113901915A/en
Application granted granted Critical
Publication of CN113901915B publication Critical patent/CN113901915B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24137Distances to cluster centroïds
    • G06F18/2414Smoothing the distance, e.g. radial basis function networks [RBFN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention relates to the technical field of expression detection, and particularly discloses a method for detecting expressions of a light weight network and Magface in a video, which comprises the following steps: extracting a plurality of frames of face images to be detected from the face video, processing the face images to be detected to obtain corrected face images, and detecting the corrected face images to obtain 68 key points of the face; calculating horizontal components and vertical components of the face images of the corrected ith frame and the (i + k) th frame, and converting the horizontal components and the vertical components into BGR space images; extracting eyebrow and mouth region images of the BGR space image respectively according to 68 personal face key points, and obtaining a target image; and constructing a facial expression recognition model according to the training target image in the target image and the label corresponding to the training target image, and inputting the test target image in the target image into the facial expression recognition model to obtain a facial expression recognition result. The method for detecting the expression of the light-weight network and the Magface in the video can realize the detection of the micro/macro expression in the video.

Description

Expression detection method for light-weight network and Magface in video
Technical Field
The invention relates to the technical field of expression detection, in particular to a method for detecting expressions of a light-weight network and Magface in a video.
Background
Facial expressions are the most dominant visual behavior expression form of emotion and are expressed by attitude experience of people on objective objects and corresponding behavior reactions. Facial expressions are classified into macro expressions and micro expressions. The micro expression belongs to a spontaneous expression, the duration is only 1/25 s-1/5 s, and the action amplitude is very small. In some scenarios, micro-expressions may reflect more and more confident information than limbs or utterances. The duration of the macroexpression is 0.5 to 4.0 seconds, with higher intensity. The occurrence of macro and micro expressions may co-exist or occur separately, both distinguished based on their relative durations and intensities. Detecting micro-expressions and macro-expressions becomes increasingly challenging when they are interleaved in long videos.
In general, facial expressions go through three distinct phases: start, vertex, and end. The task of expression detection is to locate the starting and ending points of the micro/macro expression appearance.
The prior expression detection research lays a basic technical framework. Firstly, carrying out face alignment through face key point detection, then carrying out mask masking on eyes according to the face key points, and segmenting regions of interest (such as eyebrows, mouths and the like) with large facial expression motions; extracting the feature of the dense optical flow of the region of interest and obtaining an optical strain image (optical strain) or a 3D gradient histogram (3D-HOG) and the like based on the feature as the input of the model; training a macro/micro expression recognition model of the neural network according to the label of the current frame sample as an input label; and establishing a post-processing flow of the expression unit, and positioning the position of the macro/micro expression in the long video.
According to the characteristics of the expressive motion, the intensity of the motion unit decreases as the motion unit approaches the starting frame or the ending frame. The previous work has focused more on solving the problem of imbalance between positive and negative samples, constructing complex models and mitigating the problem by using more robust human face features. However, this does not promote the discriminativity of positive and negative samples during the training of the model.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, and provides a method for detecting expressions of a light-weight network and Magface in a video, so as to solve the problem that the differentiability of positive and negative samples is not improved in the training process of a model in the related art.
The invention provides a method for detecting expressions of a light-weight network and Magface in a video, which comprises the following steps:
step S1: acquiring a plurality of face videos to be detected, wherein each face video S to be detected comprises a plurality of frames of face images to be detected, and the face video S to be detected is { src1,...,srci,srcN},srciThe face image to be detected is the ith frame in the face video to be detected;
step S2: extracting multiple frames of face images to be detected from each face video to be detected, and respectively processing the multiple frames of face images to be detected to obtain a corrected face image set S '═ { src'1,...,src’i,src’NWherein, src'iThe corrected ith frame of the face image to be detected;
step S3: respectively carrying out face key point detection on the corrected multi-frame face images to be detected to obtain 68 face key points;
step S4: respectively calculating the corrected ith frame to-be-detected face image src'iAnd an i + k frame face image src'i+kAnd converts the horizontal component u and the vertical component v to the BGR space image imgiWherein the BGR space image imgiThe tags of (a) are partitioned by a dynamic threshold;
step S5: extracting the BGR space image img according to 68 individual face key pointsiAnd the eyebrow area image and the mouth area image of the user, and respectively aligning the eyebrow area image and the mouth area imageProcessing the image and the mouth area image to obtain a final target image imgi roi
Step S6: dividing a plurality of final target images of the face video to be detected into a plurality of training target images and a plurality of testing target images, constructing a facial expression recognition model according to the training target images and labels corresponding to the training target images, and inputting the testing target images into the facial expression recognition model to obtain a facial expression recognition result of the face video to be detected;
the facial expression recognition model is a deep convolutional neural network model, and the deep convolutional neural network model comprises a skeleton network, a neck network and a head network;
the skeletal network comprises a series of convolutional and pooling layers;
the neck network comprises two parts, a first part comprising a convolutional layer of Conv2d, a BN layer, and a ReLU6 layer, and a second part comprising a Dorpout layer and a Linear layer;
the head network includes two parts, a first part including a loss function consisting of crossentry and MagFace, and a second part including a classification layer.
Further, the steps S2 and S3 further include:
utilizing FaceX-Zoo to carry out frame-by-frame alignment on the human face video S-src to be detected1,...,srci,srcNPerforming face detection on multiple frames of face images to be detected in the sequence to obtain a face coordinate frame set bbox ═ bbox1,...,bboxi,bboxNAnd 5-point face key point set lmk ═ lmk1,...,lmki,lmkN}; wherein, srciBbox is the ith frame of human face image to be detected in the human face video to be detectediLmk face coordinate frame of the ith frame of the face image to be detected in the face video to be detectediThe method comprises the steps that 5 human face key points of an ith frame of human face image to be detected in the human face video to be detected are obtained, and N is the total frame number of the human face image to be detected in the human face video to be detected;
according to 5-point face key point lmkiFace alignment is carried out on multiple frames of face images to be detected in the face video to be detected, and a corrected face image set S '═ { src'1,...,src’i,src’NAnd a corrected face coordinate frame set bbox '═ bbox'1,...,bbox’i,bbox’NWherein, src'iThe corrected ith frame is a face image to be detected, src'1∈112×112×3,bbox’iA face coordinate frame of the corrected ith frame of the face image to be detected;
utilizing faceX-Zoo to correct face coordinate frame bbox'iFace key point detection is performed on the face images of the regions, and 68 face key point sets lmk ' ═ lmk ' are obtained '1,...,lmk’i,lmk’NOf which lmk'iA face coordinate frame bbox of the corrected i frame of the face image to be detectedi' 68 face keypoints.
Further, the step S4 further includes:
the k value is selected by adopting a half of the average length of the facial expression in the face video data set to be detected, and the calculation formula of the k value is shown as (1):
Figure BDA0003293723950000031
wherein M is the total video frequency in the face video data set to be detected, [ F ]j·onset,Fj·offset]The length between the starting frame and the ending frame of the facial expression in the jth human face video to be detected is obtained;
the BGR space image imgiAccording to the cross-to-cross ratio
Figure BDA0003293723950000039
And a dynamic threshold TIOUWhen the cross-over ratio is determined
Figure BDA00032937239500000313
Greater than the dynamic threshold TIOUTime, the BGR space image imgiBelonging to facial expression region, then the BGR space image imgiThe label of (A) is defined as label i1, otherwise labeli=0;
Wherein the cross-over ratio
Figure BDA00032937239500000314
For the j to-be-detected face video, the facial expression is in [ F ]j·onset,Fj·offset]The length of (d) and the value of k, the cross-over ratio
Figure BDA00032937239500000315
Is shown in (2), the dynamic threshold value TIOUThe calculation formula (2) is shown in (3):
Figure BDA0003293723950000032
Figure BDA0003293723950000033
in the formula, Fj·onsetFor the start frame of facial expression in the jth human face video to be detected, Fj·offsetFor the end frame of facial expression in the jth human face video to be detected, [ F ]i,Fi+k]For the BGR spatial image img in the jth face video to be detectediAnd the length between the i-th frame and the i + k-th frame.
Further, the step S5 further includes:
the eyebrow region image ROI1As a coordinate frame
Figure BDA00032937239500000316
Image of a region
Figure BDA00032937239500000317
Coordinate frame
Figure BDA00032937239500000318
Is shown in (4):
Figure BDA0003293723950000034
In the formula, ω1Is the expanded pixel number;
the mouth region image ROI2As a coordinate frame
Figure BDA0003293723950000035
Image of a region
Figure BDA0003293723950000036
Coordinate frame
Figure BDA0003293723950000037
The calculation formula (2) is shown as (5):
Figure BDA0003293723950000038
in the formula, ω2Is the expanded pixel number;
respectively combine the images
Figure BDA0003293723950000041
And
Figure BDA0003293723950000042
normalizing to H multiplied by W size, and then combining the H multiplied by W size to obtain a final target image
Figure BDA0003293723950000043
Where H and W are normalized height and width, respectively.
Further, in the step S6, the plurality of training target images constitute a training set IMGtrainThe plurality of test target images constitute a test set IMGtrainThe method also comprises the following steps:
step S61: IMG the jth test settest·jInputting the test target image into the facial expression recognition model to obtain the jth testSet IMGtest·jIs the predicted label 'corresponding to the test target image'jE {0,1} and confidence valuej∈[0,1]Calculating the jth test set IMG according to the formula (6)test·jFacial expression score s of the ith frame of the test target imageiThen the jth test set IMGtest·jThe facial expression values of the middle N frames of test target images are set to Sj={s0·j,...,si·j,sN·j};
si=valuei·j*label’i·j (6)
Step S62: IMG the jth test set using Savitzky-Golay convolution smoothingtest·jFacial expression score set S of middle N frame test target imagesjBecomes a continuous curve S'j
Step S63: adopting dynamic threshold value T as curve S'jThe dynamic threshold value T is calculated as shown in (7), and SmeanSet of scores for facial expressions SjMean of the values of the middle facial expressions, SmaxSet of scores for facial expressions SjThe maximum value of the middle facial expression score, wherein eta is a weight coefficient;
T=Smean+η*(Smax-Smean) (7)
step S64: look for Curve S'jAt a threshold value TjAnd nearest neighbor 2k as a limiting condition if and only if the curve S'jIs greater than the threshold value TjAnd the distance between adjacent peak points is greater than the nearest neighbor 2k to satisfy the peak of the target facial expression, calculating each peak interval [ p ] in the target peak set satisfying the above-mentioned constrainti-k,pi+k],piFor the peak value of the ith target facial expression, the final predicted facial expression label interval set is obtained as [ p ]0-k,p0+k],...,[pn-k,pn+k]};
Step S65: when the facial expression label interval set [ p ]0-k,p0+k],...,[pn-k,pn+k]Degree of overlap of prediction interval and real interval in }And when the IOU is more than or equal to 0.5, judging the prediction interval in the facial expression label interval set to be correct.
The method for detecting the expression of the light-weight network and the Magface in the video has the following advantages: according to the inspiration of the detection task performance measurement criterion, dividing the labels of the data in a mode of an overlapping degree IOU (input/output unit) so as to relieve the error existing in calibration; in the core part, a general low-dimensional feature embedding loss model of LGNMNet is constructed for the classification and the recognition of expressions based on the idea that Magface processes the intra-class feature distribution and the quality of face images on the face recognition; in addition, we transform the classification task into a regression problem, combining polynomial fitting to peak detect to predict the likelihood that a video frame belongs to a macro or micro expression.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention.
Fig. 1 is a flowchart of a method for detecting expressions of a light-weight network and MagFace in a video according to the present invention.
Fig. 2 is a flowchart for processing a face image to be detected according to the present invention.
Fig. 3 is a schematic structural diagram of a facial expression recognition model provided by the present invention.
Fig. 4 is a schematic diagram of a facial expression recognition result provided by the present invention.
Fig. 5 is a schematic diagram of 68 key points of a human face according to the present invention.
Detailed Description
To further illustrate the technical means and effects of the present invention adopted to achieve the predetermined objects, the following detailed description will be given to the embodiments, structures, features and effects of the method for detecting expressions in videos of a lightweight network and MagFace according to the present invention with reference to the accompanying drawings and preferred embodiments. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without any inventive step, are within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged under appropriate circumstances in order to facilitate the description of the embodiments of the invention herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In this embodiment, a method for detecting expressions of a lightweight network and a MagFace in a video is provided, and as shown in fig. 1, the method for detecting expressions of a lightweight network and a MagFace in a video includes:
step S1: acquiring a plurality of face videos to be detected, wherein each face video S to be detected comprises a plurality of frames of face images to be detected, and the face video S to be detected is { src1,...,srci,srcN},srciThe face image to be detected is the ith frame in the face video to be detected;
step S2: extracting multiple frames of face images to be detected from each face video to be detected, and respectively processing the multiple frames of face images to be detected to obtain a corrected face image set S '═ { src'1,...,src’i,src’NWherein, src'iThe corrected ith frame of the face image to be detected;
step S3: respectively carrying out face key point detection on the corrected multi-frame face images to be detected to obtain 68 face key points;
it should be noted that the purpose of extracting 68 key points of the face is that the key points are distributed in key areas (eyebrows, eyes, mouth, etc.) of facial expressions, as shown in fig. 5;
step S4: respectively calculating corrected ith frame to-be-detected face image src 'by using TV-L1'iAnd an i + k frame face image src'i+kAnd converts the horizontal component u and the vertical component v to the BGR space image imgiWherein the BGR space image imgiThe tags of (a) are partitioned by a dynamic threshold;
it should be noted that, in step S4, the face images to be detected of the i-th frame and the i + k-th frame after correction are selected, where k is half of the average length of the facial expression;
step S5: extracting the BGR space image img according to 68 individual face key pointsiRespectively processing the eyebrow area image and the mouth area image to obtain a final target image
Figure BDA0003293723950000051
Step S6: processing videos in batches, dividing a plurality of final target images of the face video to be detected into a plurality of training target images and a plurality of testing target images according to leave-one-out cross validation, constructing a facial expression recognition model based on LGNMNet according to the training target images and corresponding dynamic labels thereof, and inputting the testing target images into the facial expression recognition model to obtain a facial expression recognition result;
the facial expression recognition model is a deep convolutional neural network model, as shown in fig. 3, the deep convolutional neural network model includes a skeleton network, a neck network and a head network;
the skeleton network is a Mobilene _ v2 network comprising a series of convolutional layers and pooling layers;
the neck network comprises two parts, a first part comprising a convolutional layer of Conv2d, a BN layer and a Relu6 layer, and a second part comprising a Dorpout layer and a Linear layer;
the head network includes two parts, a first part including a loss function consisting of crossentry and MagFace, and a second part including a classification layer.
In the embodiment of the present invention, as shown in fig. 3, the deep convolutional neural network model specifically includes:
(1) firstly, a forward-propagation deep convolutional neural network model is constructed, the input of the network model is a preprocessed training target image and a corresponding dynamic label, and the overall network structure is divided into a skeleton network, a neck network and a head network;
(2) the skeleton network comprises a series of convolutional layers and pooling layers, and is derived from a convolutional neural network after a classification layer is removed; the first part of the neck network consists of a convolution layer of Conv2d, a BN layer and a Relu6 layer, the output of the first part is normalized to be normaize, a Dorpout layer is added to the second part of the neck network to optimize overfitting, and a Linear layer is used for eliminating redundant information;
(3) the head network comprises two tasks, wherein a loss function consisting of Cross Encopy and MagFace calculates errors of a feature vector output by the neck network and actually classified data, and an optimization model learns the features; the strength of a motion unit is represented by using the size of feature embedding in the MagFace, meanwhile, the easy classification is more close to the class center by using a self-adaptive boundary, and the difficult classification is more remote, so that better intra-class feature distribution is obtained, model overfitting can be avoided, and the generalization is improved; and the other part is output to the classification layer by MagFace, and the prediction label and the confidence coefficient of the data are obtained through Softmax.
Preferably, as shown in fig. 2, the steps S2 and S3 further include:
utilizing FaceX-Zoo to carry out frame-by-frame alignment on the human face video S-src to be detected1,...,srci,srcNPerforming face detection on multiple frames of face images to be detected in the sequence to obtain a face coordinate frame set bbox ═ bbox1,...,bboxi,bboxNAnd 5-point face key point set lmk ═ lmk1,…,lmki,lmkN}; wherein, srciBbox is the ith frame of human face image to be detected in the human face video to be detectediTo be detected asFace coordinate frame of the ith frame of the face image to be detected in the face video lmkiThe method comprises the steps that 5 human face key points of an ith frame of human face image to be detected in the human face video to be detected are obtained, and N is the total frame number of the human face image to be detected in the human face video to be detected;
in order to reduce the influence caused by camera shake or face swing, all face images to be detected need to be normalized to the same scale and elevation angle according to the 5-point face key point lmkiPerforming face alignment on a plurality of frames of face images to be detected in the face video to be detected and normalizing the face images to be detected to 112 × 112 to obtain a corrected face image set S '({ src'1,...,src’i,src’NAnd a corrected face coordinate frame set bbox '═ bbox'1,...,bbox’i,bbox’NWherein, src'iThe corrected ith frame is a face image to be detected, src'1∈112×112×3,bbox’iA face coordinate frame of the corrected ith frame of the face image to be detected;
utilizing faceX-Zoo to correct face coordinate frame bbox'iFace key point detection is performed on the face images of the regions, and 68 face key point sets lmk ' ═ lmk ' are obtained '1,...,lmk’i,lmk’NOf which lmk'iFace coordinate frame bbox 'of corrected ith frame to-be-detected face image'i68 face keypoints.
Preferably, in step S4, the method further includes:
the k value is selected by adopting a half of the average length of the facial expression in the face video data set to be detected, and the calculation formula of the k value is shown as (1):
Figure BDA0003293723950000071
wherein M is the total video frequency in the face video data set to be detected, [ F ]j·onset,Fj·offset]The length between the starting frame and the ending frame of the facial expression in the jth human face video to be detected is obtained;
the BGR space image imgiIs in [ F ] according to the current facial expressionj·onset,Fj·offset]Length of (d) and the intersection ratio of k
Figure BDA00032937239500000713
And a dynamic threshold TIOUWhen the cross-over ratio is determined
Figure BDA00032937239500000712
Greater than the dynamic threshold TIOUTime, the BGR space image imgiBelonging to facial expression (micro/macro expression) area, then the BGR space image imgiThe label of (A) is defined as label i1, otherwise labeli=0;
Wherein the cross-over ratio
Figure BDA00032937239500000714
For the j to-be-detected face video, the facial expression is in [ F ]j·onset,Fj·offset]The length of (d) and the value of k, the cross-over ratio
Figure BDA00032937239500000715
Is shown in (2), the dynamic threshold value TIOUThe calculation formula (2) is shown in (3):
Figure BDA0003293723950000072
Figure BDA0003293723950000073
in the formula, Fj·onsetFor the start frame of facial expression in the jth human face video to be detected, Fj·offsetFor the end frame of facial expression in the jth human face video to be detected, [ F ]i,Fi+k]For the BGR spatial image img in the jth face video to be detectediAnd the length between the i-th frame and the i + k-th frame.
Should understand thatThe solution is that when the cross-over ratio is higher than the cross-over ratio
Figure BDA00032937239500000716
If the fixed threshold is satisfied, the sample is determined to be a positive sample, as shown in equation (2). However, the expressive action units are not of the same size, and this fixed threshold value is relative to the current [ F ]i·onset,Fi·offset]Is not friendly when the length of the threshold is far larger than k, so a dynamic threshold T is adoptedIOUThe label is defined as shown in equation (3).
It should be noted that the facial expression includes a micro expression or a macro expression, and the facial expression recognition model includes a micro expression recognition model or a macro expression recognition model.
Preferably, as shown in fig. 2, the step S5 further includes:
the eyebrow region image ROI1As a coordinate frame
Figure BDA0003293723950000074
Image of a region
Figure BDA0003293723950000075
Coordinate frame
Figure BDA0003293723950000076
The calculation formula (2) is shown as (4):
Figure BDA0003293723950000081
in the formula, ω1Is the expanded pixel number;
the mouth region image ROI2As a coordinate frame
Figure BDA0003293723950000082
Image of a region
Figure BDA0003293723950000083
Coordinate frame
Figure BDA0003293723950000084
The calculation formula (2) is shown as (5):
Figure BDA0003293723950000085
in the formula, ω2Is the expanded pixel number;
respectively combine the images
Figure BDA0003293723950000086
And
Figure BDA0003293723950000087
normalizing to H multiplied by W size, and then combining the H multiplied by W size to obtain a final target image
Figure BDA0003293723950000088
Where H and W are normalized height and width, respectively, where H112 and W112 are provided.
In the embodiment of the invention, the facial motion units in the micro/macro expression are mainly distributed in the eyebrow, mouth and corner of eye regions, so that three ROI images are extracted from the part. To reduce the effect of global motion, the mean and standard deviation processing is performed on pixels in the nasal bridge region. Meanwhile, because the optical flow features are highly sensitive to blinking, we use a polygon plus a boundary to expand 6 pixels to block the left and right eye regions. The method specifically comprises the following steps: (i) left eye and left eyebrow; (ii) right eye and right eyebrow; (iii) a mouth. The images of the regions (i) and (ii) were normalized to 56 x 56 respectively and then stitched into a 56 x 112 image, the image of the portion (iii) was normalized to 56 x 112 (height x width), and the combination yielded a new image 112 x 112 (height x width) retaining the dominant face motion unit.
Preferably, in the present embodiment, a flow chart of post-micro-expression detection processing (here, micro-expressions are taken as an example) is provided, and as shown in fig. 4, in the step S6, the plurality of training target images form a training set IMGtrainThe plurality of test target images constitute a test set IMGtrainThe method also comprises the following steps:
to show the inventionThe method constructs a Long video detection example based on the detection effect of the micro/macro expression in the Long video, and because the data set SAMM-Long-video (034_7) simultaneously comprises 2 macro expression intervals and 2 micro expression intervals, the method is favorable for visually evaluating the macro/micro expression detection task at the same time, and we use the macro expression interval and the micro expression interval as reference, and two curves in FIG. 4 respectively represent the jth test set IMGtest·jThe middle N frames test a score set curve of the micro/macro expression of the target image;
step S61: IMG the jth test settest·jInputting the test target image into the facial expression recognition model to obtain a jth test set IMGtest·jIs the predicted label 'corresponding to the test target image'jE {0,1} and confidence valuej∈[0,1]Calculating the jth test set IMG according to the formula (6)test·jFacial expression score s of the ith frame of the test target imageiThen the jth test set IMGtest·jThe facial expression values of the middle N frames of test target images are set to Sj={s0·j,...,si·j,sN·j};
si=valuei·j*label’i·j (6)
Step S62: because facial expression score set SjIs a set tending to be discrete, and in order to eliminate errors caused by model classification, the IMG of the jth test set is smoothed by Savitzky-Golay convolutiontest·jFacial expression score set S of middle N frame test target imagesjBecomes a continuous curve S'j
Step S63: in fig. 4, two horizontal lines are dynamic thresholds for micro/macro expression, and the dynamic threshold T is adopted as the curve S 'because facial expression score sets of different videos have large differences'jThe dynamic threshold value T is calculated as shown in (7), and SmeanSet of scores for facial expressions SjMean of the values of the middle facial expressions, SmaxSet of scores for facial expressions SjThe maximum value of the middle facial expression score, wherein eta is a weight coefficient;
T=Smean+η*(Smax-Smean) (7)
step S64: find _ peaks to look for curve S' using scipy.jAt a threshold value TjAnd nearest neighbor 2k as a limiting condition if and only if the curve S'jIs greater than the threshold value TjAnd the distance between adjacent peak points is greater than the nearest neighbor 2k to satisfy the peak of the target facial expression, calculating each peak interval [ p ] in the target peak set satisfying the above-mentioned constrainti-k,pi+k],piFor the peak value of the ith target facial expression, the final predicted facial expression label interval set is obtained as [ p ]0-k,p0+k],...,[pn-k,pn+k]};
Step S65: when the facial expression label interval set [ p ]0-k,p0+k],...,[pn-k,pn+k]Judging that the prediction interval in the facial expression label interval set is correct when the overlapping degree IOU of the prediction interval and the real interval in the symbol is more than or equal to 0.5, taking a balanced average F1-score as an evaluation index, and carrying out parameter optimization on eta by adopting a grid optimization method.
It should be noted that, in the lower part of the diagram in fig. 4, a Predicted interval (Predicted) and a true interval (group route) of an expression unit are provided, and the LGNMNet method provided has a plurality of peaks for expression detection of a long video, and the peaks mostly appear near the expression unit, which shows that our model can effectively identify an expression frame; secondly, a plurality of expression unit intervals are found in the long video, the detection results of the macro expression and the micro expression are (Precision is 100% and Recall is 100%) and (Precision is 100% and Recall is 66.67%), 1 false detection of the micro expression occurs in the macro expression area, and false recognition may be caused by the influence of the macro expression. In addition, the score curves of the macro/micro expressions all have a plurality of peak values, a plurality of the peak values are outside the expression interval, and the false recognition peak values are effectively removed after the post-processing method is adopted, so that the post-processing method is proved to be effective for eliminating the false recognition peak values.
The invention provides a method for detecting expressions of a lightweight Network And MagFace in a video, And particularly relates to a method for detecting micro/macro expression in a long video based on a dense density optical flow characteristic combined with a lightweight General Network And MagFace combined model (Lite General Network And MagFace CNN, LGNMNet). according to the inspiration of a detection task performance measurement criterion, data labels are divided in an IOU mode to relieve errors existing in calibration; meanwhile, the classification task is converted into a regression problem, and the probability that a frame belongs to macro or micro expression is predicted by combining polynomial fitting. In a core part, a general low-dimensional feature embedding loss model of LGNMNet is constructed based on the idea that Magface processes in-class feature distribution and face image quality on face recognition, the weight of face features in a peak region is increased monotonously, a simple sample is pulled to a class center, a difficult sample is pushed away, and the model is prevented from being over-fitted on a noisy low-quality sample.
The expression detection method of the light-weight network and the Magface in the video provided by the invention has the following advantages:
(1) a label dividing method of a dynamic threshold value is used to reduce the error of the sample on the calibration near the starting point and the end point;
(2) the LGNMNet is provided, low-dimensional features extracted by a lightweight model are embedded into a Magface loss model, the learning effect of a difficult sample is improved, and the problem of overfitting of the model on a low-quality sample is solved;
(3) and the classification problem is converted into a regression problem by combining the model prediction label and the confidence coefficient as a test result, so that the result is more discrete and the expression positioning is facilitated.
Although the present invention has been described with reference to a preferred embodiment, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (5)

1. A method for detecting expressions of a light-weight network and Magface in a video is characterized by comprising the following steps:
step S1: acquiring a plurality of face videos to be detected, wherein each face video S to be detected comprises a plurality of frames of face images to be detected, and the face video S to be detected is { src1,...,srci,srcN},srciThe face image to be detected is the ith frame in the face video to be detected;
step S2: extracting multiple frames of face images to be detected from each face video to be detected, and respectively processing the multiple frames of face images to be detected to obtain a corrected face image set S '═ { src'1,...,src′i,src'NWherein, src'iThe corrected ith frame of the face image to be detected;
step S3: respectively carrying out face key point detection on the corrected multi-frame face images to be detected to obtain 68 face key points;
step S4: respectively calculating the corrected ith frame to-be-detected face image src'iAnd an i + k frame face image src'i+kAnd converts the horizontal component u and the vertical component v to the BGR space image imgiWherein the BGR space image imgiThe tags of (a) are partitioned by a dynamic threshold;
step S5: extracting the BGR space image img according to 68 individual face key pointsiRespectively processing the eyebrow area image and the mouth area image to obtain a final target image
Figure FDA0003293723940000011
Step S6: dividing a plurality of final target images of the face video to be detected into a plurality of training target images and a plurality of testing target images, constructing a facial expression recognition model according to the training target images and labels corresponding to the training target images, and inputting the testing target images into the facial expression recognition model to obtain a facial expression recognition result of the face video to be detected;
the facial expression recognition model is a deep convolutional neural network model, and the deep convolutional neural network model comprises a skeleton network, a neck network and a head network;
the skeletal network comprises a series of convolutional and pooling layers;
the neck network comprises two parts, a first part comprising a convolutional layer of Conv2d, a BN layer and a Relu6 layer, and a second part comprising a Dorpout layer and a Linear layer;
the head network includes two parts, a first part including a loss function consisting of crossentry and MagFace, and a second part including a classification layer.
2. The method for detecting expressions in videos by a lightweight network and Magface according to claim 1, wherein the steps S2 and S3 further include:
utilizing FaceX-Zoo to carry out frame-by-frame alignment on the human face video S-src to be detected1,...,srci,srcNPerforming face detection on multiple frames of face images to be detected in the sequence to obtain a face coordinate frame set bbox ═ bbox1,...,bboxi,bboxNAnd 5-point face key point set lmk ═ lmk1,...,lmki,lmkN}; wherein, srciBbox is the ith frame of human face image to be detected in the human face video to be detectediLmk face coordinate frame of the ith frame of the face image to be detected in the face video to be detectediThe method comprises the steps that 5 human face key points of an ith frame of human face image to be detected in the human face video to be detected are obtained, and N is the total frame number of the human face image to be detected in the human face video to be detected;
according to 5-point face key point lmkiFace alignment is carried out on multiple frames of face images to be detected in the face video to be detected, and a corrected face image set S '═ { src'1,...,src′i,src'NAnd a corrected face coordinate frame set bbox '═ bbox'1,...,bbox′i,bbox'NWherein, src'iThe corrected ith frame is a face image to be detected, src'1∈112×112×3,bbox′iA face coordinate frame of the corrected ith frame of the face image to be detected;
utilizing faceX-Zoo to correct face coordinate frame bbox'iFace key point detection is performed on the face images of the regions, and 68 face key point sets lmk ' ═ lmk ' are obtained '1,...,lmk′i,lmk'NOf which lmk'iFace coordinate frame bbox 'of corrected ith frame to-be-detected face image'i68 face keypoints.
3. The method for detecting expressions in videos by using a lightweight network and MagFace according to claim 2, wherein the step S4 further includes:
the k value is selected by adopting a half of the average length of the facial expression in the face video data set to be detected, and the calculation formula of the k value is shown as (1):
Figure FDA0003293723940000021
wherein M is the total video frequency in the face video data set to be detected, [ F ]j·onset,Fj·offset]The length between the starting frame and the ending frame of the facial expression in the jth human face video to be detected is obtained;
the BGR space image imgiAccording to the cross-to-cross ratio
Figure FDA0003293723940000029
And a dynamic threshold TIOUWhen the cross-over ratio is determined
Figure FDA00032937239400000210
Greater than the dynamic threshold TIOUTime, the BGR space image imgiBelonging to facial expression region, then the BGR space image imgiThe label of (A) is defined as labeli1, otherwise labeli=0;
Wherein the cross-over ratio
Figure FDA0003293723940000022
For the j to-be-detected face video, the facial expression is in [ F ]j·onset,Fj·offset]The length of (d) and the value of k, the cross-over ratio
Figure FDA0003293723940000023
Is shown in (2), the dynamic threshold value TIOUThe calculation formula (2) is shown in (3):
Figure FDA0003293723940000024
Figure FDA0003293723940000025
in the formula, Fj·onsetFor the start frame of facial expression in the jth human face video to be detected, Fj·offsetFor the end frame of facial expression in the jth human face video to be detected, [ F ]i,Fi+k]For the BGR spatial image img in the jth face video to be detectediAnd the length between the i-th frame and the i + k-th frame.
4. The method for detecting expressions in videos by using a lightweight network and MagFace according to claim 3, wherein the step S5 further includes:
the eyebrow region image ROI1As a coordinate frame
Figure FDA0003293723940000026
Image of a region
Figure FDA0003293723940000027
Coordinate frame
Figure FDA0003293723940000028
The calculation formula (2) is shown as (4):
Figure FDA0003293723940000031
in the formula, ω1Is the expanded pixel number;
the mouth region image ROI2As a coordinate frame
Figure FDA0003293723940000032
Image of a region
Figure FDA0003293723940000033
Coordinate frame
Figure FDA0003293723940000034
The calculation formula (2) is shown as (5):
Figure FDA0003293723940000035
in the formula, ω2Is the expanded pixel number;
respectively combine the images
Figure FDA0003293723940000036
And
Figure FDA0003293723940000037
normalizing to H multiplied by W size, and then combining the H multiplied by W size to obtain a final target image
Figure FDA0003293723940000038
Where H and W are normalized height and width, respectively.
5. The method for detecting expressions in videos by using a lightweight network and Magface as claimed in claim 4, wherein in step S6, the plurality of training target images form a training set IMGtrainThe plurality of test target images constitute a test set IMGtrainThe method also comprises the following steps:
step S61: IMG the jth test settest·jInputting the test target image into the facial expression recognition model to obtain a jth test set IMGtest·jIs the predicted label 'corresponding to the test target image'jE {0,1} and confidence valuej∈[0,1]Calculating the jth test set IMG according to the formula (6)test·jFacial expression score s of the ith frame of the test target imageiThen the jth test set IMGtest·jThe facial expression values of the middle N frames of test target images are set to Sj={s0·j,...,si·j,sN·j};
si=valuei·j*label′i·j (6)
Step S62: IMG the jth test set using Savitzky-Golay convolution smoothingtest·jFacial expression score set S of middle N frame test target imagesjBecomes a continuous curve S'j
Step S63: adopting dynamic threshold value T as curve S'jThe dynamic threshold value T is calculated as shown in (7), and SmeanSet of scores for facial expressions SjMean of the values of the middle facial expressions, SmaxSet of scores for facial expressions SjThe maximum value of the middle facial expression score, wherein eta is a weight coefficient;
T=Smean+η*(Smax-Smean) (7)
step S64: look for Curve S'jAt a threshold value TjAnd nearest neighbor 2k as a limiting condition if and only if the curve S'jIs greater than the threshold value TjAnd the distance between adjacent peak points is greater than the nearest neighbor 2k to satisfy the peak of the target facial expression, calculating the peak satisfying the above-mentioned limit conditionEach peak interval [ p ] in the target peak seti-k,pi+k],piFor the peak value of the ith target facial expression, the final predicted facial expression label interval set is obtained as [ p ]0-k,p0+k],...,[pn-k,pn+k]};
Step S65: when the facial expression label interval set [ p ]0-k,p0+k],...,[pn-k,pn+k]And judging that the predicted interval in the facial expression label interval set is correct when the overlapping degree IOU of the predicted interval and the real interval in the set is more than or equal to 0.5.
CN202111172275.XA 2021-10-08 2021-10-08 Expression detection method of light-weight network and MagFace in video Active CN113901915B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111172275.XA CN113901915B (en) 2021-10-08 2021-10-08 Expression detection method of light-weight network and MagFace in video

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111172275.XA CN113901915B (en) 2021-10-08 2021-10-08 Expression detection method of light-weight network and MagFace in video

Publications (2)

Publication Number Publication Date
CN113901915A true CN113901915A (en) 2022-01-07
CN113901915B CN113901915B (en) 2024-04-02

Family

ID=79190411

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111172275.XA Active CN113901915B (en) 2021-10-08 2021-10-08 Expression detection method of light-weight network and MagFace in video

Country Status (1)

Country Link
CN (1) CN113901915B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019184125A1 (en) * 2018-03-30 2019-10-03 平安科技(深圳)有限公司 Micro-expression-based risk identification method and device, equipment and medium
CN110991348A (en) * 2019-12-05 2020-04-10 河北工业大学 Face micro-expression detection method based on optical flow gradient amplitude characteristics
CN111210415A (en) * 2020-01-06 2020-05-29 浙江大学 Method for detecting facial expression coma of Parkinson patient
US20200272806A1 (en) * 2019-02-22 2020-08-27 Image Metrics, Ltd. Real-Time Tracking of Facial Features in Unconstrained Video
CN112541422A (en) * 2020-12-08 2021-03-23 北京科技大学 Expression recognition method and device with robust illumination and head posture and storage medium
CN112861809A (en) * 2021-03-22 2021-05-28 南京大学 Classroom new line detection system based on multi-target video analysis and working method thereof
CN113158978A (en) * 2021-05-14 2021-07-23 无锡锡商银行股份有限公司 Risk early warning method for micro-expression recognition in video auditing

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019184125A1 (en) * 2018-03-30 2019-10-03 平安科技(深圳)有限公司 Micro-expression-based risk identification method and device, equipment and medium
US20200272806A1 (en) * 2019-02-22 2020-08-27 Image Metrics, Ltd. Real-Time Tracking of Facial Features in Unconstrained Video
CN110991348A (en) * 2019-12-05 2020-04-10 河北工业大学 Face micro-expression detection method based on optical flow gradient amplitude characteristics
CN111210415A (en) * 2020-01-06 2020-05-29 浙江大学 Method for detecting facial expression coma of Parkinson patient
CN112541422A (en) * 2020-12-08 2021-03-23 北京科技大学 Expression recognition method and device with robust illumination and head posture and storage medium
CN112861809A (en) * 2021-03-22 2021-05-28 南京大学 Classroom new line detection system based on multi-target video analysis and working method thereof
CN113158978A (en) * 2021-05-14 2021-07-23 无锡锡商银行股份有限公司 Risk early warning method for micro-expression recognition in video auditing

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
姚丽莎;张军委;房波;张绍雷;周欢;赵凤;: "基于LBP和SVM的人脸表情识别系统的设计与实现", 贵州师范大学学报(自然科学版), no. 01, 15 January 2020 (2020-01-15), pages 69 - 78 *
潘仙张;张石清;郭文平;: "多模深度卷积神经网络应用于视频表情识别", 光学精密工程, no. 04, 15 April 2019 (2019-04-15), pages 230 - 237 *
黄俊;张娜娜;章惠;: "融合头部姿态和面部表情的互动式活体检测", 计算机应用, no. 07, 31 December 2020 (2020-12-31), pages 233 - 239 *

Also Published As

Publication number Publication date
CN113901915B (en) 2024-04-02

Similar Documents

Publication Publication Date Title
CN108921100B (en) Face recognition method and system based on visible light image and infrared image fusion
CN109815826B (en) Method and device for generating face attribute model
JP4318465B2 (en) Person detection device and person detection method
CN103942577B (en) Based on the personal identification method for establishing sample database and composite character certainly in video monitoring
CN112800903B (en) Dynamic expression recognition method and system based on space-time diagram convolutional neural network
CN109101865A (en) A kind of recognition methods again of the pedestrian based on deep learning
CN110674785A (en) Multi-person posture analysis method based on human body key point tracking
JP2019121374A (en) Facial expression recognition method, object recognition method, facial expression recognition apparatus, facial expression training method
MX2012010602A (en) Face recognizing apparatus, and face recognizing method.
CN101710383A (en) Method and device for identity authentication
CN110532850B (en) Fall detection method based on video joint points and hybrid classifier
Morade et al. A novel lip reading algorithm by using localized ACM and HMM: Tested for digit recognition
CN112989889B (en) Gait recognition method based on gesture guidance
CN110956141B (en) Human body continuous action rapid analysis method based on local recognition
CN113869276B (en) Lie recognition method and system based on micro-expression
Zhao et al. Head movement recognition based on Lucas-Kanade algorithm
CN110287829A (en) A kind of video face identification method of combination depth Q study and attention model
CN111860117A (en) Human behavior recognition method based on deep learning
JP2005351814A (en) Detector and detecting method
CN113902774B (en) Facial expression detection method of thick and dense optical flow characteristics in video
CN110675312B (en) Image data processing method, device, computer equipment and storage medium
CN112560618A (en) Behavior classification method based on skeleton and video feature fusion
CN112488165A (en) Infrared pedestrian identification method and system based on deep learning model
CN113901915A (en) Expression detection method for light-weight network and Magface in video
CN116030516A (en) Micro-expression recognition method and device based on multi-task learning and global circular convolution

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant