CN110659596A

CN110659596A - Face key point positioning method under case and management scene, computer storage medium and equipment

Info

Publication number: CN110659596A
Application number: CN201910860218.7A
Authority: CN
Inventors: 谢良; 毛亮; 朱婷婷; 许丹丹; 林焕凯; 黄仝宇; 汪刚
Original assignee: Gosuncn Technology Group Co Ltd
Current assignee: Gosuncn Technology Group Co Ltd
Priority date: 2019-09-11
Filing date: 2019-09-11
Publication date: 2020-01-07

Abstract

The invention provides a face key point positioning method under a case and management scene, a computer storage medium and electronic equipment, wherein the method comprises the following steps: s1, acquiring a data set of key points of the human face in a case and management scene; s2, cutting the face picture of the public data set into a single face and storing a face frame; s3, converting the input picture into a BGR format; s4, inputting the data preprocessed in the step S3 into a key point model for training; s5, establishing a test library containing face samples of various postures under a plurality of case and management scenes; s6, inputting data into the key point detector to predict key points; s7, obtaining corresponding coordinates of actually predicted key points on the original picture; s8, obtaining a first network model according to the corresponding coordinates of the key points on the original picture; and S9, estimating the pose of the face and positioning key points of the face according to the first network model.

Description

Face key point positioning method under case and management scene, computer storage medium and equipment

Technical Field

The invention relates to the field of human face key point detection, in particular to a human face key point positioning method in a case and management scene, a computer storage medium and electronic equipment.

Background

The detection of the key points of the human face is also called the detection, the positioning or the face alignment of the key points of the human face, which means that the key area positions of the human face, including eyebrows, eyes, a nose, a mouth, a face contour and the like, are positioned by giving a human face image;

the human face key point detection method is roughly divided into three types, namely the traditional methods based on ASM (active Shape model) and AAM (active appearance model); a cascade shape regression-based method; a method based on deep learning. At present, the method is widely applied and the method based on deep learning has the highest effect precision.

The invention patent with publication number "CN 105868769A" proposes a method and device for positioning key points of human face in an image, which performs human face detection on a detection target image to determine the range of human face area. The method learns the key point characteristics in a cascading mode, positions the key points of the normal face well, but has poor face positioning effect on influences of illumination, posture and the like.

The invention patent with the publication number of CN108446619A provides a face key point detection method and device based on deep reinforcement learning, and mathematical modeling is carried out on the face key point detection problem through a Markov decision process. The method adopts an original mathematical modeling mode, end-to-end learning cannot be realized, the efficiency is low, and the side face has the phenomenon of inaccurate positioning.

At present, two main problems exist in the key point positioning technology of the human face: firstly, the method comprises the following steps: the existing face key point detection technology has a good face positioning effect only in a normal environment, and has poor adaptability to cross-scenes and poor generalization capability. For example, in a case scene, the installation position of the mobile device is not fixed, the surrounding environment has various illumination and shielding effects, and side faces with various angles exist. Therefore, under the condition of complex scenes, the requirement on the positioning accuracy of the key points is higher, and the prior art can not basically meet the application requirement in a case handling area under the case management scene. Secondly, the method comprises the following steps: the existing face key point detection is not only complex in model, but also large in calculation amount and occupies a video memory or more in CPU (central processing unit) resources, so that the existing face key point detection is not suitable for being deployed on a mobile terminal such as an embedded device.

The existing method, such as patent application with publication number CN201811109879, provides a multitask scheme of face detection and key point positioning based on a cascaded convolutional neural network, and although the network structure is simpler and meets the requirement of real-time performance, the method has insufficient generalization capability on the face in a complex scene, the detection of key points on the side face is inaccurate, and the key point detection requirement in a case management scene cannot be met. In the multitask detection of face key point positioning and semantic segmentation of the patent application with publication number CN108304765A, although the adopted feature extraction network is better for key point positioning in a normal environment, the model structure is more complex, the occupied video memory is higher, and the real-time performance is lower, so that the method is not suitable for being deployed in mobile terminal equipment.

Disclosure of Invention

In view of the above, the present invention provides a method for positioning key points of a human face in a scenario, a computer storage medium and an electronic device, which can improve positioning accuracy and have simple steps.

In order to solve the above technical problems, in one aspect, the present invention provides a method for positioning key points of a human face in a scenario, comprising the following steps:

s1, acquiring a data set of key points of the human face in a case and management scene;

s2, cutting the face picture of the data set into a single face, storing a face frame, expanding the face frame, labeling key points, and combining the face frame and the key points;

s3, converting the input RGB format picture into BGR format, expanding the face frame area detected by the detector, and normalizing;

s4, inputting the data preprocessed in the step S3 into a key point model for training, wherein the key point model is a network built by mobilenet _ v2, and obtaining a key point detector;

s5, establishing a test library containing face samples of various postures under a plurality of case and management scenes;

s6, inputting the test data constructed in the S5 into the key point detector by using the database to predict the key points, and obtaining predicted values of the key points;

s7, performing inverse transformation on the key point prediction value according to a normalization mode to obtain corresponding coordinates of actually predicted key points on the original picture;

s8, training the mobilene-V2 network for the second time according to corresponding coordinates of the key points on the original picture, and adjusting parameters of the mobilene-V2 network to reduce the root mean square error to a preset value to obtain a first network model;

and S9, estimating the human face pose and positioning the key points of the human face according to the first network model.

According to the method for positioning the key points of the face under the scenario of the case management, aiming at the scenario that the environment under the scenario of the case management is more complex and the face has different postures, particularly more side faces, the face is divided into the key points of two areas of the outline and the points of the five sense organs, aiming at specific points of the five sense organs, the convolutional neural network is utilized to learn characteristics, and the weight of the points of the five sense organs is correspondingly increased along with the increase of the training times, so that the network structure can better learn the key points of the five sense organs, improve the positioning accuracy of the face in the complex environment, the method is simple and feasible, has good generalization performance, can realize accurate positioning of key points aiming at human faces in various illumination environments, such as dim light and backlight environment, and the network structure is simpler, the calculated amount is less, the consumed CPU resource is lower, and all the requirements of the embedded equipment on the real-time property can be met.

According to some embodiments of the invention, the data set comprises a public data set and a case data set, and step S2 comprises: s21, cutting the face pictures of the public data set and the case and administration data set into a single face and storing a face frame; s22, expanding the face frame by 50%, and sending the expanded face frame to a labeling system to label 68 key points; and S23, combining the face frame and the key points, and storing the combined face frame and key points as a label file of the image name plus the face detection frame (x, y, w, h) plus the face key points.

According to some embodiments of the invention, step S2 further comprises: and S24, performing data amplification on all the face pictures and the key points, wherein the data amplification comprises data mirroring, scaling, translation and Gaussian noise addition.

According to some embodiments of the present invention, in step S3, the original picture is expanded by 10% respectively according to the top, bottom, left, and right of the face frame region, and the corresponding key points are subjected to key point shift and normalization according to the shift amount of the face frame.

According to some embodiments of the invention, in step S4, a skeleton network using mobilene _ v2 as keypoints, a deep separable convolutional neural network, hyper-parameters are configured, a weight loss function is written, and the preprocessed data input keypoint model of step S3 is trained after assigning different weights according to the regions of five sense organs and the regions of outlines.

According to some embodiments of the invention, in step S8, if the root mean square error does not meet the requirement, steps S2-S7 are repeated until the root mean square error meets the requirement.

According to some embodiments of the invention, step S8 further comprises: and cutting the network to adapt to the requirements of the embedded equipment on the model performance under the case and pipe scene.

According to some embodiments of the present invention, in step S9, according to the key point model, a manner of turning the key point into the euler angle is adopted to perform face pose estimation, so as to obtain a pitch angle, a yaw angle and a roll angle.

In a second aspect, embodiments of the present invention provide a computer storage medium comprising one or more computer instructions that, when executed, implement a method as in the above embodiments.

An electronic device according to an embodiment of the third aspect of the invention comprises a memory for storing one or more computer instructions and a processor; the processor is configured to invoke and execute the one or more computer instructions to implement the method according to any of the embodiments described above.

Drawings

FIG. 1 is a flowchart of a method for locating key points of a human face in a scenario according to an embodiment of the present invention;

FIG. 2 is a flowchart of cutting a plurality of face regions into a single face in the face key point positioning method under the scenario according to the embodiment of the present invention;

fig. 3 is a diagram of an overall network structure of mobilent _ v2 in the method for locating key points of a human face in a case and management scenario according to an embodiment of the present invention;

FIG. 4 is a distribution diagram of key points of a human face in the method for locating key points of a human face under a scenario and management scene according to the embodiment of the invention;

FIG. 5 is a diagram of root mean square errors of different normalization factors in a face key point positioning method under a scenario and management scenario according to an embodiment of the present invention;

FIG. 6 is a flowchart illustrating the detection of key points of a human face under a scenario in the case management scenario in the method for locating key points of a human face under a case management scenario according to an embodiment of the present invention;

FIG. 7 is a test chart of a face key point data set in a scenario management environment in the face key point positioning method in the scenario management environment according to the embodiment of the present invention;

FIG. 8 is a diagram of a face optimization test result in the face key point location method in the scenario according to the embodiment of the present invention;

fig. 9 is a schematic diagram of an electronic device according to an embodiment of the invention.

Reference numerals:

an electronic device 300;

a memory 310; an operating system 311; an application 312;

a processor 320; a network interface 330; an input device 340; a hard disk 350; a display device 360.

Detailed Description

The following detailed description of embodiments of the present invention will be made with reference to the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

First, a method for locating key points of a human face in a case and management scene according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

As shown in fig. 1, the method for locating key points of a human face in a case and management scene according to an embodiment of the present invention includes the following steps:

and S1, acquiring a public data set and a case management data set of the key points of the human face in the case management scene.

S2, cutting the face pictures of the public data set and the case and tube data set into a single face, storing a face frame, expanding the face frame, labeling key points, and combining the face frame and the key points.

S3, converting the input RGB format picture into BGR format, expanding according to the face frame area detected by the detector, and normalizing;

s5, establishing a test library containing face samples of various postures under a plurality of case management scenes, wherein it needs to be explained that the test library can also complete screening and neaten corresponding labels when training data are established in the step S1;

and S7, performing inverse transformation on the key point prediction value according to a normalization mode to obtain the corresponding coordinate of the actually predicted key point on the original picture.

And S8, training the mobilene-V2 network for the second time according to the corresponding coordinates of the key points on the original picture, and adjusting the parameters of the mobilene-V2 network to reduce the root mean square error to a preset value to obtain a first network model, wherein the preset value can be a value with the minimum root mean square error, and the first network model is the network model with the minimum root mean square error.

Therefore, according to the method for positioning key points of the human face under the scenario of the invention, aiming at the scenario that the environment under the scenario is more complex and the posture of the human face is different, especially the scenario that the side face is more, the human face is divided into the key points of two areas of the outline and the points of the five sense organs, aiming at the specific points of the five sense organs, the convolutional neural network is utilized to learn the characteristics, and the weight of the points of the five sense organs is correspondingly increased along with the increase of the training times, so that the network structure can better learn the key points of the five sense organs, improve the positioning accuracy of the face in the complex environment, the method is simple and feasible, has good generalization performance, can realize accurate positioning of key points aiming at human faces in various illumination environments, such as dim light and backlight environment, and the network structure is simpler, the calculated amount is less, the consumed CPU resource is lower, and all the requirements of the embedded equipment on the real-time property can be met.

According to an embodiment of the present invention, step S2 includes:

s21, cutting the face pictures of the public data set and the case data set into a single face and storing a face frame.

And S22, expanding the face frame by 50%, and then sending the expanded face frame to a labeling system to label 68 key points.

And S23, combining the face frame and the key points, and storing the combined face frame and key points as a label file of the image name plus the face detection frame (x, y, w, h) plus the face key points.

Preferably, step S2 further includes:

and S24, performing data amplification on all the face pictures and the key points, wherein the data amplification comprises data mirroring, scaling, translation and Gaussian noise addition.

In some embodiments of the present invention, in step S3, the original picture is expanded by 10% respectively according to the top, bottom, left, and right of the face frame region, and the corresponding key points are subjected to key point shift and normalization according to the shift amount of the face frame.

Optionally, in step S4, a skeleton network using mobilent _ v2 as a key point, a deep separable convolutional neural network, a hyper-parameter is configured, a weight loss function is written, and the preprocessed data input key point model in step S3 is trained after different weights are assigned according to the regions of five sense organs and the contour region.

Preferably, in step S8, if the root mean square error does not meet the requirement, steps S2-S7 are repeated until the root mean square error meets the requirement.

Further, step S8 further includes: and cutting the network to adapt to the requirements of the embedded equipment on the model performance under the case and management scene.

In other words, the method for locating key points of a face under a scenario management according to the embodiment of the present invention is proposed for the problem that the existing face key point detection cannot meet the requirement for locating key points under the scenario management, and is a method for locating complex key points of a face under a scenario management based on a convolutional neural network structure of Mobilenet _ v 2. The method designs a corresponding loss function and a cutting network under the support of a large number of existing case and management scene pictures, thereby improving the accuracy of positioning key points under the case and management scene to meet the requirements of the actual scene.

The method for positioning the key points of the human face under the case and administration scene according to the embodiment of the invention mainly comprises the following steps:

the first step is as follows: collecting data, namely collecting a public data set of key points of a human face, and selecting human face data collected by a human face panel or equipment in a case and management scene;

the second step is that: and (3) data preprocessing, cutting the face pictures of the public data set and the case and tube data set into a single face, storing a face frame, expanding the face frame by 50%, and sending the expanded face frame to a labeling system to label 68 key points. And combining the stored face frame and the key points, and storing the face frame and the key points as a label file of the image name, the face detection frame (x, y, w, h) and the face key points.

The third step: and (3) data amplification, wherein due to the limited data set, all face pictures and key points are processed as follows: mirroring, scaling, translating, adding gaussian noise, etc.

The fourth step: and (4) performing input preprocessing, converting an input picture into a BGR format, expanding the original picture by 10% respectively up, down, left and right according to a face frame area, and normalizing. And shifting and normalizing the corresponding key points according to the offset of the face frame.

The fifth step: and (3) network training, namely using the mobilent _ v2 as a main network of the key points, using a deep separable convolutional neural network, configuring hyper-parameters, compiling a weight loss function, distributing different weights according to the regions of the five sense organs and the contour region, inputting the data preprocessed in the fourth step into a key point model for training, and obtaining a key point detector.

And a sixth step: the test library is established, and the human face samples in various postures under all case and management scenes are contained as far as possible, so that the rationality of the test can be ensured. At the same time, the test 300W test set was compared with the results presented in the preceding paper. Root mean square error (NME) is used as an indicator for the measure, and pupil distance is used as a normalization factor.

The seventh step: and (4) key point prediction, namely inputting data into the key point detector obtained by the training in the fifth step for key point prediction by using the test library and the label established in the sixth step, and taking out a predicted value of the key point for output post-processing.

Eighth step: and (4) key point post-processing, namely after the predicted value of the key point is taken out, the coordinate point needs to be converted, and because the input trained key point is subjected to normalization processing, after the predicted value is taken out, the key point needs to be converted back to the original image according to the original mode, so that the actual predicted value of the key point is obtained.

The ninth step: optimizing the model, analyzing the test result according to the output post-processed key point data, ensuring the minimum NME test value, if not meeting the requirement, adjusting the network structure and the related parameters to continue training until obtaining the optimal model. And then cutting the network to adapt to the requirements of the case pipe flat plate on the performance, and ensuring the minimum calculated amount under the condition of low precision loss.

The tenth step: and (4) attitude estimation, namely performing face attitude estimation by adopting a mode of turning the key points into Euler angles according to the key point model with the best prediction result in the ninth step to obtain a pitch angle (pitch), a yaw angle (yaw) and a roll angle (rool). The three angles can be used as the preferred standards of the human face in the case and management scene. The predicted key points can also be used for face alignment, and face feature extraction and the like can be performed after alignment.

The following describes in detail the steps of data set processing and key point model training in the face key point localization method under the scenario and management scene according to the embodiment of the present invention with reference to the drawings.

Firstly, in the data set processing process, all collected data are divided into a public data set and a case and tube data set, wherein the case and tube data set comprises a pitch angle, a yaw angle, a roll angle, a shading, a backlight, a weak light and all data under a strong light as far as possible, and the balance of the data is ensured. All pictures can be single face photos or scenes with a plurality of faces.

Wherein, the public data set comprises 68 key point annotation files, such as a 300W data set. In order to ensure the accuracy of the test and eliminate the interference of a plurality of faces, all the faces are cut into a single face area. Detecting a face frame from the public data set through a face detector, and matching the face frame with the labeled key points; detecting a face frame area by the case and tube data through a face detector, storing the largest face, expanding the largest face area by 50 percent, ensuring that each picture only has a single face, and labeling 68 key points by using a labeling tool; the data tag input format is as follows: the image name + face detection frame (x, y, w, h) + face key point (x1, y1, x2, y2, x3, y3.). The public and bureau sets are divided into training and test sets. The process of cutting a plurality of face regions into a single region is shown in fig. 2:

in order to have enough training data, all training sets are subjected to amplification operations such as mirroring, random rotation (randomly rotating to newly increase data of all angles from-30 degrees to 30 degrees), random translation, random scaling and the like, and Gaussian noise is added during amplification in order to fit occlusion data; all key points are also processed accordingly.

The training set data preprocessing process is to convert an input picture into a BGR format, expand 10% of the detection frame up, down, left and right according to the size of the face detection frame to obtain a cutting area, and cut the area from an original image to be used as input data. The main purpose of the expansion is to ensure that all marked key points are in the face frame area; scaling the cropped image to a size of 64x64, then subtracting the mean value of 127.5, and multiplying by a normalization factor of 0.0078125; and shifting the corresponding key points according to the offset of the face frame, and normalizing to (-0.5, 0.5).

For the process of key point model training, in the method for positioning key points of human faces under the scenario of the invention, MobileNet _ V2 is used as a main network for key point training, the main framework of the method is to combine the residual error units of MobileNet _ V1 and a residual error network, a depth separable is used for replacing the residual error units, an expansion layer of 1 x1 is added before the Depthwise contribution, the purpose is to increase the number of channels, obtain more features to increase the number of channels, and obtain more features; in order to avoid the damage of the ReLU to the features, the nonlinear activation layer behind the layer with the smaller number of channels is replaced by the linear layer. As shown in fig. 3, fig. 3 is an overall network structure of the mobile _ v2, and the size of the input data is 64x64, because the input data is considered to be used in the mobile terminal device, and the amount of calculation of the input data is increased. According to the network architecture of fig. 3, the hyper-parameters are configured and an optimization algorithm is used to optimize the gradient to the lowest value.

The root mean square error nme (normalized mean error) is used as an error evaluation index of the accuracy of the key point, and the calculation formula is shown as formula 1.

In the formula, NME represents the root mean square error of the test picture, M represents the number of the key points, Xj represents the abscissa of the jth key point of the test set picture detected by the model, and Yj represents the ordinate of the jth key point detected by the model. xj represents the abscissa of the jth key point of the test set picture label, and yj represents the ordinate of the jth key point of the test set picture label. r represents a normalization factor, and two normalization factors are used herein, one is the distance between pupils of the eye (assuming r1 and the left pupil coordinate is (xl, yl), the left pupil is the center position of points 38, 39, 41, 42 in fig. 4; the right pupil coordinate is (xr, yr), the right pupil is the center position of points 44, 45, 47, 48 in fig. 4), then the distance between pupils of the eye is:

one is the distance of the outer corner of the eye (assuming r2, the left and right outer corners of the eye are points 37 and 46 inside fig. 4).

In order to refine and evaluate the key point errors of each region of the human face, 68 key points of the human face (68 key points are distributed as shown in fig. 4, and are numbered 1-68) are divided into the following regions for testing, and the root mean square error is obtained: whole (68 points: 1-68), no-face contour (51 points: 18-68), face contour (17 points: 1-17), eyebrow (10 points: 18-27), nose (9 points: 28-36), eyes (12 points: 37-48), mouth (20 points: 49-68).

In summary, the items to be tested are shown in FIG. 5.

From the face keypoint distribution diagram shown in fig. 4, the weight is divided into three parts: the primary weights main _ weight are 17-68 points, which represent the weights of the key points of the five sense organs and the upper half contour; the contour weight contourjweight is 0-16 keypoints and represents the weight of the lower half contour; the keypoint weight coarse _ weight is the points of the left and right eyebrows at points 17, 21, 22 and 27, the left and right corners of the eyes at points 36, 39, 42 and 45, the nose bridge at point 27, the nose at points 31 and 35, the mouth at point 48 and 5, and the like, which need to be weighted and learned. With the start of training, firstly, setting importance of the positions to set three weights, then returning the values of the three weights in a back propagation process, and learning the characteristics of the face region according to the weight, wherein the larger the weight is, the better the learning effect is, and the more accurate the positioning of the key points is.

And after the training is finished, testing and taking the predicted key points for coordinate transformation, wherein all output test data need to be subjected to inverse transformation according to a coordinate normalization mode during the training so as to be mapped back to the original image coordinates. A transformation step: 1) the coordinates of output points are respectively added with 0.5 to be changed into (0,1) intervals, 2) the output points are multiplied by the width and the height of the cut face, and the output points are added with offset based on the original picture when cutting; and obtaining the corresponding coordinates of the actually predicted points on the original image.

And analyzing a test result according to the post-processed key point data to ensure that the root mean square error is minimum. If the corresponding NME does not reach the expectation, adjusting the sample, the network parameters, the network structure and the weight distribution, and entering S5 for retraining; after training is finished, the best key point model is found, the network is cut, and training test is carried out again so as to adapt to the requirements of the embedded equipment on performance and cut the network to the minimum.

After the steps, accurate key point values can be obtained, and human face posture estimation and human face optimization schemes of the case tube flat plate can be carried out by using the values of the points; in the estimation of the human face posture, seven points such as human face five sense organs and the like are mainly used for matrix transformation, and three Euler angles, namely a pitch angle, a yaw angle and a roll angle, are calculated.

In the preferred scheme of the human face, the pupil distance of two eyes is used as the standard for filtering the small face; the function of filtering the side face can be realized in the process of shooting the face flat plate, the experiment mainly collects the strategy of filtering the middle line proportion, and the implementation is as follows: the left eye (37,38,40,41) is averaged to synthesize a left eye eyeball point, the right eye (43,44,46,47) is averaged to synthesize a right eye eyeball point, and the five points are formed by adding 30/48/54 to the two points. The five dots are expressed as: left eye (point 0), right eye (point 1), nose (point 2), left mouth corner (point 3), right mouth corner (point 4). And (3) taking the middle point of the connecting line of the 0 th point and the 3 rd point X and y, taking the middle point of the connecting line of the 1 st point and the 4 th point X and y, and taking the X values of the two points, wherein if the two X values are on one side of the X coordinate of the 2 nd point at the same time, the two X values are required to be used as side faces for filtering. And (3) filtering the side faces of the left and right median lines in proportion: and (3) taking the middle point of the connecting line of the 0 th point and the 3 rd point, taking the middle point of the connecting line of the 1 st point and the 4 th point, calculating the distance between the 2 nd point and the middle point of the connecting line of the 3 rd point, calculating the distance between the 2 nd point and the middle point of the connecting line of the 4 th point, and dividing the two terms to obtain the proportion of the side face. Fig. 6 shows a flowchart of face key point training and testing under the scenario of the supervision of this application.

The accuracy of the above-described keypoint model is tested in connection with specific embodiments below.

In order to test the accuracy of the key point model, the test collects data of the case and management scene under a plurality of environments, and mainly comprises the following steps: 1) the case and pipe identification photo 1296 faces, 2) the Kaishi case and pipe scene 1263 faces sideways, 3) the Nanshan case and pipe scene 538 faces forwards and 1049 faces sideways, and 4) the Portulaca bridge case and pipe 456 faces sideways. The test environment configuration is as follows: a CPU: i7-4790, 3.60 GHz; memory: 16G; GPU: GTX1080, video memory 8G.

(1) Key point NME testing

The test result is shown in fig. 7, the NME of all the faces is tested under five scenes by taking the pupil distance as a normalization factor, and miniVGG _ L7_224_20180625 represents a key point model of company internal training with the initial version input of 224x224 in 2018 month 6; miniVGG _ L7_64_20190328 represents a keypoint model trained with VGG networks with inputs of 64x 64; the Mobilentv 2_ L55_64_20190402 represents a key point model adopting the Mobilentv 2 as a backbone network; mobilenetv2_ L30_64_20190404 represents a final version of the network model that tailors the network structure, adjusting the keypoint weights. The results are shown below:

NME test: mobilenetv2_ L30_64_20190404 performed on the front face dataset just as well as the company's previously trained model, but the keypoint localization accuracy on the side faces was improved by an entire 14 points over the previous model.

Size of the model: the size of the model is reduced to 1.3M from the original 60M after the training and cutting, and the time consumed for testing a single picture on a case tube flat plate is only about 30 Ms.

(2) Key point positioning accuracy

In order to actually measure the positioning accuracy of the key points under the real scene, the face data of each angle under the case and management scene is adopted at this time, the predicted key points are displayed in the face area, and experiments prove that all the key points of the front face and the side face can be correctly positioned under all the case and management scenes.

(3) Face preference test

After the key point model training is completed, a human face optimization test is newly added, mainly for filtering wide-angle side face data and judging standards of the wide-angle side face: the horizontal yaw angle of the human face within +/-60 degrees is taken as a positive sample; taking a positive sample with a pitch angle within +/-30 degrees; the roll angle within + -50 degrees is a positive sample. Data without extra ambiguity and too dark was taken as the test set. The following is a face preference test set composition,

sample correction: the face recognition system is composed of three direction regular samples of a yaw angle +/-60 degrees, a pitch angle +/-30 degrees, a roll angle +/-50 degrees and the like, comprises snap pictures of scenes such as case tubes, flat panel acquisition, company exhibition halls, five-floor bayonet machines and the like, is not overlapped with key point training set data, and is 9382 single face pictures in total.

Negative example sample: the system consists of negative examples of the yaw angle, the pitch angle and the roll angle, comprises scene snapshot pictures such as case tubes, flat panel acquisition, exhibition halls, bayonet machines and the like, is not overlapped with the data of the key point training set, and has 5216 face pictures in total.

In this test, the most original cross line and proportional center line filtering mode is adopted for face filtering, and the final test result is shown in fig. 8: the Mobilent network structure adopting the cutting and weighting training can basically achieve 97% of classification accuracy on the basis of the optimization test.

In summary, the method for locating key points of a face under a scenario management scene according to the embodiment of the present invention is a method for detecting key points of a face under a scenario management scene based on mobilent _ v2, which is provided for the problems of insufficient generalization capability of the existing key point detection in an actual scene and high performance requirement of a mobile terminal device on a model, and solves two problems of key point location: the method comprises the steps of firstly, aiming at the situation that the scene environment is complex, various illumination, shading, blurring and non-side faces exist, dividing the face into key points of two regions of a contour and five sense organ parts, aiming at specific points of the five sense organ parts, learning characteristics by using a convolutional neural network, and correspondingly increasing the weight of the five sense organ parts along with the increase of training times, so that a network structure can better fit the key points of the five sense organ parts, the positioning accuracy of the face under the complex environment is improved, secondly, adopting a cut Mobilene _ v2 convolutional neural network structure, and reducing the calculation amount of the network through the structure, thereby improving the requirements of the mobile equipment on real-time performance and CPU resource occupation. The algorithm has good generalization performance, can realize the accurate positioning of key points of the human face aiming at the human face under various illumination environments, such as dim light, backlight and dim light environments, has simpler network structure, less calculation amount and lower consumed CPU resource, and can meet the requirements of embedded equipment on performance.

In addition, the present invention also provides a computer storage medium, where the computer storage medium includes one or more computer instructions, and when executed, the one or more computer instructions implement any one of the above methods for locating key points of a human face in a scenario.

That is, the computer storage medium stores a computer program, and when the computer program is executed by a processor, the processor executes any one of the above-mentioned methods for locating the key points of the face in the managed scene.

As shown in fig. 9, an embodiment of the present invention provides an electronic device 300, which includes a memory 310 and a processor 320, where the memory 310 is configured to store one or more computer instructions, and the processor 320 is configured to call and execute the one or more computer instructions, so as to implement any one of the methods described above.

That is, the electronic device 300 includes: a processor 320 and a memory 310, in which memory 310 computer program instructions are stored, wherein the computer program instructions, when executed by the processor, cause the processor 320 to perform any of the methods described above.

Further, as shown in fig. 9, the electronic device 300 further includes a network interface 330, an input device 340, a hard disk 350, and a display device 360.

The various interfaces and devices described above may be interconnected by a bus architecture. A bus architecture may be any architecture that may include any number of interconnected buses and bridges. Various circuits of one or more Central Processing Units (CPUs), represented in particular by processor 320, and one or more memories, represented by memory 310, are coupled together. The bus architecture may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like. It will be appreciated that a bus architecture is used to enable communications among the components. The bus architecture includes a power bus, a control bus, and a status signal bus, in addition to a data bus, all of which are well known in the art and therefore will not be described in detail herein.

The network interface 330 may be connected to a network (e.g., the internet, a local area network, etc.), and may obtain relevant data from the network and store the relevant data in the hard disk 350.

The input device 340 may receive various commands input by an operator and send the commands to the processor 320 for execution. The input device 340 may include a keyboard or a pointing device (e.g., a mouse, a trackball, a touch pad, a touch screen, or the like).

The display device 360 may display the result of the instructions executed by the processor 320.

The memory 310 is used for storing programs and data necessary for operating the operating system, and data such as intermediate results in the calculation process of the processor 320.

It will be appreciated that memory 310 in embodiments of the invention may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read Only Memory (EPROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), or a flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. The memory 310 of the apparatus and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

In some embodiments, memory 310 stores the following elements, executable modules or data structures, or a subset thereof, or an expanded set thereof: an operating system 311 and application programs 312.

The operating system 311 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application programs 312 include various application programs, such as a Browser (Browser), and are used for implementing various application services. A program implementing methods of embodiments of the present invention may be included in application 312.

The method disclosed by the above embodiment of the present invention can be applied to the processor 320, or implemented by the processor 320. Processor 320 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 320. The processor 320 may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof, and may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present invention. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 310, and the processor 320 reads the information in the memory 310 and completes the steps of the method in combination with the hardware.

It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units designed to perform the functions described herein, or a combination thereof.

For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.

In particular, the processor 320 is also configured to read the computer program and execute any of the methods described above.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be physically included alone, or two or more units may be integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) to execute some steps of the transceiving method according to various embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A face key point positioning method under a case and management scene is characterized by comprising the following steps:

s1, acquiring a face key point data set in a case and management scene;

2. The method of claim 1, wherein the data set comprises a public data set and a case data set, and step S2 comprises:

s21, cutting the face pictures of the public data set and the case and administration data set into a single face and storing a face frame;

s22, expanding the face frame by 50%, and sending the expanded face frame to a labeling system to label 68 key points;

3. The method according to claim 2, wherein step S2 further comprises:

4. The method according to claim 1, wherein in step S3, the original picture is expanded by 10% respectively according to the top, bottom, left and right of the face frame region, and the corresponding key points are subjected to key point shift and normalization according to the shift amount of the face frame.

5. The method as claimed in claim 1, wherein in step S4, the skeleton network using mobilene _ v2 as key points, the deep separable convolutional neural network, the hyper-parameters are configured, the weight loss function is written, and the preprocessed data input key point model of step S3 is trained after assigning different weights according to the regions of five sense organs and the regions of outlines.

6. The method of claim 1, wherein in step S8, if the root mean square error is not satisfactory, the steps S2-S7 are repeated until the root mean square error is satisfactory.

7. The method according to claim 1, wherein step S8 further comprises: and cutting the network to adapt to the requirements of the embedded equipment on the model performance under the case and pipe scene.

8. The method according to claim 1, wherein in step S9, the face pose estimation is performed by turning the key points to euler angles according to the key point model, so as to obtain a pitch angle, a yaw angle and a roll angle.

9. A computer storage medium comprising one or more computer instructions which, when executed, implement the method of any one of claims 1-8.

10. An electronic device comprising a memory and a processor, wherein,

the memory is to store one or more computer instructions;

the processor is configured to invoke and execute the one or more computer instructions to implement the method of any one of claims 1-8.