CN109902553B

CN109902553B - Multi-angle face alignment method based on face pixel difference

Info

Publication number: CN109902553B
Application number: CN201910003381.1A
Authority: CN
Inventors: 宫恩来; 杭丽君; 何远彬; 赵兴文; 叶锋; 丁明旭
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2019-01-03
Filing date: 2019-01-03
Publication date: 2020-11-17
Anticipated expiration: 2039-01-03
Also published as: CN109902553A

Abstract

The invention provides a face alignment method based on face pixel difference from multiple angles; the invention provides a method for back-and-forth prediction of initial key point positions of 5 different angles aiming at faces inclined at different angles in face alignment, so that excellent fitting effect of the faces inclined at different angles is realized, pixel difference of different positions of the faces can be used as representation of different areas to a certain extent, eye difference is most obvious, regression shape selection rules based on maximization of facial pixel difference are provided, and more accurate face alignment effect is realized.

Description

Multi-angle face alignment method based on face pixel difference

Technical Field

The invention relates to the field of face recognition, in particular to a face alignment method in multiple angles of facial pixel difference.

Background

The introduction of deep learning technology and the maturity of machine learning technology make the computer vision related task progress greatly, thereby promoting the application technology in many detection and positioning fields to be perfect. Face alignment is a task of significant research value in these many areas that is both detection-related and inseparable from regression positioning. The method has great significance as extension and expansion of a face detection task and as a basis of a face calibration and face recognition task. In addition to the research of the human face recognition category, the human face alignment technology is established in a plurality of fields. For example, in expression recognition, face alignment provides the possibility of emotion exploration transmitted by human expression. For example, in many applications with a function of beautifying pictures, the unfamiliar human face polishing and beautifying function, the dynamic face changing special effect, and the like, all need to obtain feature points or areas of interest on the face based on human face alignment so as to perform related operations. This means that the face alignment technique requires both a regression shape of the feature points to be accurate enough and a fast enough speed to be suitable for real-time scenes in many applications.

The human face alignment algorithm has a plurality of implementation schemes, and the design idea is formed by combining a human face detection architecture and a human face alignment technology. The mainstream implementation scheme of the face detection part is almost concentrated in the deep learning field, such as classic architecture RCNN, Fast R-CNN and Fast R-CNN of a two-stage network, a depth model is obtained through various candidate frame generation schemes, such as SS, RPN and the like, an interested region is obtained, and then the interested region is input into a classification network for scoring; the other type is a one-stage network with higher real-time speed, such as SSD and YOLO series, the step of generating a candidate frame is omitted, and the regression of classification and a coordinate frame is completed after the features of the whole graph are extracted, so that the model has higher speed-precision balance. The accuracy of the deep network in the detection field is extremely high. Particularly, the introduction of the one-stage network ensures the coordinate regression precision of the target frame and meets the speed requirement in a real-time application scene, thereby laying an unimaginable position of deep learning in the target detection application field at present.

The subsequent algorithm of the face alignment algorithm is to align the face by adopting a depth model, and face detection and face alignment are generally realized uniformly by utilizing various CNN architectures. The scheme can obtain competitive face alignment precision, but the depth model is limited by huge parameters and heavy deep hierarchy, and even through model compression and miniaturization processing, the depth model also causes great obstruction to later-stage integration in hardware. Compared with the human face alignment scheme based on machine learning, most human face alignment schemes are shallow models which are easy to implement, and in the direction, classical technologies such as an LBF scheme and an SDM algorithm optimization strategy all adopt light-weight models which are smaller than depth models, and meanwhile, the human face alignment algorithm which is comparable to the regression accuracy of feature points based on depth learning can be achieved.

In practical applications, the face is not always at a fixed angle. Some left inclinations and some right inclinations are not always centered. While the overall shape mean tends to remain within a few degrees of the plus or minus of a straight face. Therefore, the human face inclined at multiple angles is initialized by uniformly adopting the overall mean value shape, the effect of the human face is accurate, and the human face with a certain inclination angle has extremely poor effect. The indiscriminate initialization scheme causes great difficulty in later regression, and therefore the application of the model is required to be resistant to inclination and robustness of a side face.

Disclosure of Invention

The invention provides a face alignment method in multiple angles with poor facial pixels aiming at the defects of the prior art.

A face alignment method based on face pixel difference from multiple angles comprises the following steps:

step 1), generating a model by a face frame: based on the SSD, a total of eight uniformly distributed feature layers are reselected to perform cascade regression prediction, and a plurality of prediction frame scales which accord with the face proportion are selected to form a robustness model MR-SSD; the selected 8 feature layers are respectively: conv3_3, conv4_3, conv5_3, fc7, conv8_2, conv9_2, conv10_2, conv11_ 2.

Step 2), initializing key points of the human face at multiple angles: selecting 5 key points of the human face, taking the mean value of the key points of the training set as the initialized coordinates of the 5 points, and rotating the key points +/-30 degrees and +/-60 degrees through affine transformation to form another 4 initialized angles;

step 3), random forest training: randomly selecting a plurality of pairs of pixel points in different radius ranges r of key points of the human face, solving the difference value of the pairs of pixel points, training the difference value as the input of a random forest, combining sparse 0, 1 binary features output by leaf nodes of the random forest to obtain a one-dimensional local binary feature vector, wherein a random forest consisting of M decision trees is trained around each key point;

step 4), global linear regression: obtaining local binary feature vectors of the key points through the step 3), performing global linear regression training on all the features, predicting the deviation of the key points by using a regressor obtained by training, and continuously correcting the coordinates of the predicted points;

step 5), selecting a rule based on the regression shape with maximized facial pixel difference: after regression prediction, N pixel points around the initialized 5 eye key points with different angles are selected to calculate the mean value and the mean square error, the initialized angle with the maximum mean square error is most fit with the face, and the predicted point after angle regression is selected as the finally calibrated key point.

Preferably, in step 1), the face prediction frame ratios are respectively: 1:1,1:1.3,1:1.5.

Preferably, in step 2), the key points of the face are selected to be a left eye pupil, a right eye pupil, a nose tip, a left mouth corner and a right mouth corner respectively, a perpendicular bisector of a connecting line of the center face pupil is taken as a 0-degree reference line, five key points of the 0-degree reference line are taken as standard shapes, a right inclination relative to the reference line is defined as a positive direction, a left inclination is defined as a negative direction, and five initialization schemes of angles of 0 °, +/-30 °, +/-60 ° are generated.

Preferably, in step 3), a random forest is trained around each key point, and each random forest is composed of M decision trees.

The invention provides a method for back-and-forth prediction of initial key point positions of 5 different angles aiming at faces inclined at different angles in face alignment, so that excellent fitting effect of the faces inclined at different angles is realized, pixel difference of different positions of the faces can be used as representation of different areas to a certain extent, eye difference is most obvious, regression shape selection rules based on maximization of facial pixel difference are provided, and more accurate face alignment effect is realized.

Drawings

FIG. 1 is an overall architecture of a face alignment method;

FIG. 2 shows the original SSD frame now with the feature layer on the left and the reselected feature layer on the right;

FIG. 3 is an initialization diagram of 5 key points of the human face from different angles;

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention, and not restrictive,

as shown in fig. 1, a multi-angle human face alignment method based on facial pixel difference includes the following specific steps:

step 1: as shown in fig. 2, based on the SSD overall architecture, a con3_3 layer added to a lower layer and a conv5_3 layer of a fifth convolution series are selected, conv4_3, fc7, conv8_2, conv9_2, conv10_2 and conv11_2 uniform level features in an additional layer are fused, a fully connected layer at the end of the VGG architecture is cut out, and a final pooled layer is changed into a convolutional layer; and predicting each picture through an MR-SSD model to obtain a face frame.

Step 2: calculating the coordinate mean value of 5 key points by the face frame obtained in the step 1 and a training set to be used as 0-degree initial coordinates of the key points of the predicted face, and then obtaining initial coordinates at +/-30 degrees and +/-60 degrees through affine transformation, wherein the method for calculating the coordinates through affine transformation comprises the following steps:

0 DEG initialization coordinates are (X, Y), wherein

Respectively are the x-axis coordinate mean value of 5 key points in the training set, and y 1-y 5 are respectively the training setThe mean value of y-axis coordinates of 5 key points, theta is a rotation angle, theta is +/-30 degrees and +/-60 degrees, and the obtained U and V are coordinates of 4 rotated key points;

step 201: the coordinates in step 2 are defined as follows: the midperpendicular of the connecting line of the pupils of the median face is taken as a 0-degree datum line, and five key points of the 0-degree datum line are taken as standard shapes. The inclination to the right with respect to the reference line is defined as a positive direction, and the inclination to the left is defined as a negative direction. As shown in fig. 3, five initialization schemes of minus 30 degrees (cross group), minus 60 degrees (pentagram group), plus 30 degrees (square group), plus 60 degrees (triangle group), plus one standard group face (circle) are generated. The five initialization schemes can perform fitting with high coverage on the faces with different inclination angles, and it can be seen that key points of each group of initialization schemes can cover different areas of the face;

step 202: selecting key points of the human face as a left eye pupil, a right eye pupil, a nose tip, a left mouth corner and a right mouth corner respectively;

and step 3: randomly selecting a plurality of pairs of pixel points within different radius ranges r of the initial shape key points, solving the difference value of the pairs of pixel points, taking the difference value as the input of a random forest for training, and combining sparse 0, 1 binary features output by leaf nodes of the random forest to obtain a one-dimensional local binary feature vector;

step 301: the training process of the random forest is t stages in total, the radius range r is gradually reduced along with the stage, and the regression of the key points is gradually close to the real points;

step 302: feature mapping function of pixel difference obtained by random forest training

Further acquiring local binary features of the facial pixel difference;

and 4, step 4: obtaining local binary characteristics of the key points through the step 3, performing global linear regression training on all the characteristics, predicting the deviation of the key points by using a regressor obtained by training, and continuously correcting the coordinates of the predicted points, wherein the deviation is expressed as follows:

i is the input image matrix, S_t-1In the shape of the t-1 th stage,

for the feature matching function of this stage, W^tIs a linear regression matrix. The regression phase takes the generated LBF characteristics as input to train a linear regression matrix W^tThereby obtaining a trained model;

and 5: after 5 initialized groups of key points are regressed, the key points are closer to the real key point positions, and the most appropriate position is selected as the final prediction result. Regression shape selection rules based on facial pixel difference maximization: and selecting pixel point regions with the eye radius of N from the 5 groups of different initialization shapes as key region statistical pixels to respectively obtain a mean value mu and a mean square error sigma. The formula is as follows:

where mu is all the pixel points x in the designated range of the two eye regions of each initialization scheme_iσ is the mean square error objective function. Unlike the past, in the algorithm, the objective function is maximized, and the region with the strongest pixel jitter is obtained, so that the optimal regression shape closer to the eye region is obtained. It represents the degree of deviation of all the pixels in the region from the mean. If the mean square error of pixels around the eyes is obviously larger than the pixel difference values of the rest groups of key points, the eye region can be used as a characteristic for distinguishing different regions of the face. And training five groups of initialization schemes with different inclination angles to obtain five groups of prediction results subjected to algorithm regression. At this time, the key points of the five initialization schemes returnIt is more accurate to return than before. But because each scheme covers the area of the face at a different angle.

Claims

1. A multi-angle human face alignment method based on facial pixel difference is characterized in that: the method comprises the following steps:

step 1), generating a model by a face frame: based on the SSD, a total of eight uniformly distributed feature layers are reselected to perform cascade regression prediction, and a plurality of prediction frame scales which accord with the face proportion are selected to form a robustness model MR-SSD;

2. The method for aligning the human face from multiple angles based on the facial pixel difference as claimed in claim 1), wherein in the step 1), the selected 8 feature layers are respectively: conv3_3, conv4_3, conv5_3, fc7, conv8_2, conv9_2, conv10_2, conv11_ 2.

3. The method for aligning human faces from multiple angles based on facial pixel differences as claimed in claim 1, wherein in the step 1), the proportions of the human face prediction frames are respectively as follows: 1:1,1:1.3,1:1.5.

4. The method according to claim 1, wherein in step 2), the key points of the face are selected to be a left eye pupil, a right eye pupil, a nose tip, a left mouth corner and a right mouth corner respectively, the perpendicular bisector of the connecting line of the center face pupils is used as a 0-degree reference line, five key points of the 0-degree reference line are used as standard shapes, the right inclination relative to the reference line is defined as a positive direction, the left inclination is defined as a negative direction, and five initialization schemes of angles of 0 °, +/-30 °, +/-60 ° are generated.