CN113313010A

CN113313010A - Face key point detection model training method, device and equipment

Info

Publication number: CN113313010A
Application number: CN202110579215.3A
Authority: CN
Inventors: 刘畅; 刘思伟
Original assignee: Guangzhou Weaving Point Intelligent Technology Co ltd
Current assignee: Guangzhou Weaving Point Intelligent Technology Co ltd
Priority date: 2021-05-26
Filing date: 2021-05-26
Publication date: 2021-08-27

Abstract

The application discloses a training method, a device and equipment of a face key point detection model, in the process of respectively training a P-Net network, an R-Net network and an O-Net network by a first face training sample, a second face training sample and a third face training sample, a first loss value of the R-Net network is adjusted by face angle information of the second face training sample, a second loss value of the O-Net network is adjusted by face angle information of the third face training sample, loss values of a face key point detection task under different postures are directionally adjusted by the face angle information, the detection precision of the face key point detection model under a large posture state is improved, the problem that the existing face key point detection method does not consider extra information of the face key point task is solved, particularly under the condition that a face is under a large posture angle, the technical problem that the detection precision of the key points of the human face is not high exists.

Description

Face key point detection model training method, device and equipment

Technical Field

The application relates to the technical field of face key point detection, in particular to a face key point detection model training method, device and equipment.

Background

The existing face key point detection technology has a good detection effect when facing a face in a relatively ideal environment. However, for the face key point detection technology applied to the mobile terminal, the requirement for real-time performance is high, and a large-scale neural network model cannot be adopted at the mobile terminal. Therefore, in the prior art, a lightweight network model is usually adopted for face key point detection, and the face key point detection task is divided into two tasks: the human face key point detection task is performed on the basis of the human face detection task, and the human face key point detection is performed by adopting a lightweight network model, so that the requirement on real-time performance can be met, but the requirement on precision is difficult to meet.

The existing lightweight network for detecting the human face key points is biased to a human face detection task, extra information of the human face key point task is not considered, and particularly, the accuracy of detecting the human face key points of the existing lightweight network is not high under the condition that a human face is in a large posture angle, such as under the conditions of a side face, a head-up condition, a head-down condition and the like.

Disclosure of Invention

The application provides a face key point detection model training method, a face key point detection model training device and face key point detection model training equipment, which are used for solving the technical problem that the existing face key point detection method does not consider extra information of a face key point task, and particularly has low face key point detection precision under the condition that a face is in a large posture angle.

In view of this, a first aspect of the present application provides a method for training a face keypoint detection model, including:

acquiring a first output result of a P-Net network trained by a first face training sample, and acquiring a second face training sample based on the first output result and an original face training sample, wherein the second face training sample is marked with face angle information, and the first face training sample is obtained by cutting the original face training sample;

training an R-Net network through the second face training sample, and acquiring a first loss value of the R-Net network in the training process;

adjusting the first loss value according to the face angle information of the second face training sample, and updating the network parameters of the R-Net network according to the adjusted first loss value to obtain the trained R-Net network;

inputting the second face training sample into the trained R-Net network to obtain a second output result, and obtaining a third face training sample according to the second output result and the original face training sample, wherein the third face training sample is marked with face angle information;

training an O-Net network through the third face training sample, and acquiring a second loss value of the O-Net network in the training process;

adjusting the second loss value according to the face angle information of the third face training sample, and updating the network parameters of the O-Net network according to the adjusted second loss value to obtain the trained O-Net network;

and combining the trained P-Net network, the trained R-Net network and the trained O-Net network to obtain a face key point detection model, wherein the face key point detection model is used for face detection, face frame detection and face key point detection.

Optionally, the obtaining a first output result of a P-Net network trained by a first face training sample, and obtaining a second face training sample based on the first output result and an original face training sample, where the second face training sample is marked with face angle information, and the first face training sample is obtained by cutting the original face training sample, and includes:

inputting a first face training sample into a trained P-Net network for face frame prediction, and obtaining a face frame prediction result of the first face training sample output by the trained P-Net network, wherein the first face training sample is obtained by cutting an original face training sample;

and cutting the original face training sample according to the face frame prediction result of the first face training sample to obtain a second face training sample, and acquiring the face angle information of the second face training sample according to the face key point coordinates of the second face training sample.

Optionally, the inputting the second face training sample into the trained R-Net network to obtain a second output result, and obtaining a third face training sample according to the second output result and the original face training sample, where the third face training sample is labeled with face angle information, includes:

inputting a second face training sample into the trained R-Net network to perform face frame prediction, and acquiring a face frame prediction result of the second face training sample output by the trained R-Net network;

and cutting the original face training sample according to the face frame prediction result of the second face training sample to obtain a third face training sample, and acquiring the face angle information of the third face training sample according to the face key point coordinates of the third face training sample.

Optionally, the method further includes:

according to the face angle information of the second face training sample or the third face training sample, taking the second face training sample or the third face training sample with a face angle exceeding a preset angle range as a non-target training sample, and taking the second face training sample or the third face training sample with a face angle within the preset angle range as a target training sample;

performing data enhancement on the non-target training sample to obtain an enhanced training sample;

and fusing the enhanced training sample and the target training sample to obtain the preprocessed second face training sample or the preprocessed third face training sample.

Optionally, the adjusted first loss value or the adjusted second loss value is:

wherein L isThe first loss value or the second loss value before adjustment,

c is 1, 2, 3, θ is the adjusted first loss value or the adjusted second loss value¹、θ²、θ³Respectively a pitch angle, a yaw angle and a roll angle in the face angle information.

The second aspect of the present application provides a face key point detection model training device, including:

the device comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring a first output result of a P-Net network trained by a first face training sample and acquiring a second face training sample based on the first output result and an original face training sample, the second face training sample is marked with face angle information, and the first face training sample is obtained by cutting the original face training sample;

the first training unit is used for training an R-Net network through the second face training sample and acquiring a first loss value of the R-Net network in the training process;

the first adjusting unit is used for adjusting the first loss value according to the face angle information of the second face training sample, and updating the network parameters of the R-Net network according to the adjusted first loss value to obtain the trained R-Net network;

a second obtaining unit, configured to input the second face training sample to the trained R-Net network to obtain a second output result, and obtain a third face training sample according to the second output result and the original face training sample, where the third face training sample is marked with face angle information;

the second training unit is used for training an O-Net network through the third face training sample and acquiring a second loss value of the O-Net network in the training process;

a second adjusting unit, configured to adjust the second loss value according to the face angle information of the third face training sample, and update the network parameter of the O-Net network according to the adjusted second loss value, so as to obtain the trained O-Net network;

and the combination unit is used for combining the trained P-Net network, the trained R-Net network and the trained O-Net network to obtain a face key point detection model, and the face key point detection model is used for face detection, face frame detection and face key point detection.

Optionally, the first obtaining unit is specifically configured to:

Optionally, the second obtaining unit is specifically configured to:

Optionally, the method further includes: a data enhancement unit to:

A third aspect of the present application provides a training device for a face keypoint detection model, which includes a processor and a memory;

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to execute any one of the face keypoint detection model training methods according to instructions in the program code.

According to the technical scheme, the method has the following advantages:

the application provides a face key point detection model training method, which comprises the following steps: acquiring a first output result of a P-Net network trained by a first face training sample, and acquiring a second face training sample based on the first output result and an original face training sample, wherein the second face training sample is marked with face angle information, and the first face training sample is obtained by cutting the original face training sample; training an R-Net network through a second face training sample, and acquiring a first loss value of the R-Net network in the training process; adjusting a first loss value according to the face angle information of the second face training sample, and updating the network parameters of the R-Net network through the adjusted first loss value to obtain a trained R-Net network; inputting a second face training sample into the trained R-Net network to obtain a second output result, and acquiring a third face training sample according to the second output result and the original face training sample, wherein the third face training sample is marked with face angle information; training an O-Net network through a third face training sample, and acquiring a second loss value of the O-Net network in the training process; adjusting a second loss value according to the face angle information of the third face training sample, and updating the network parameters of the O-Net network according to the adjusted second loss value to obtain a trained O-Net network; and combining the trained P-Net network, the trained R-Net network and the trained O-Net network to obtain a face key point detection model, wherein the face key point detection model is used for face detection, face frame detection and face key point detection.

According to the method, when an R-Net network and an O-Net network are trained, loss values of face key point detection tasks under different postures are directionally adjusted by adding additionally labeled angle information of the face key points, so that the detection precision of a face key point detection model under a large posture state is improved, and the technical problem that the detection precision of the face key point is not high under the condition that the extra information of the face key point task is not considered in the existing face key point detection method, especially under the condition that a face is under a large posture angle is solved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a schematic flow chart of a method for training a face key point detection model according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a face keypoint detection model training device according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

For easy understanding, please refer to fig. 1, an embodiment of a method for training a face keypoint detection model provided by the present application includes:

step 101, obtaining a first output result of a P-Net network trained by a first face training sample, and obtaining a second face training sample based on the first output result and an original face training sample, wherein the second face training sample is marked with face angle information, and the first face training sample is obtained by cutting the original face training sample.

The embodiment of the application considers that most of the existing face key point detection models are embedded into the mobile terminal for face key point detection, and has higher requirements on real-time performance and precision. Therefore, in the embodiment of the present application, a lightweight Network MTCNN (Multi-task Convolutional Neural Network) is preferably used for face key point detection, where the MTCNN is formed by cascading a P-Net Network, an R-Net Network, and an O-Net Network, and a specific structure thereof belongs to the prior art, and is not described herein again. In order to further improve the detection accuracy of the MTCNN in the large-posture state and not influence the detection speed, the embodiment of the application does not change the network structure of the MTCNN and improves the training process of the MTCNN.

First, the P-Net network in the MTCNN is trained. And acquiring an original face training sample, wherein the original training sample is marked with the position of a face frame and the coordinates of key points of the face. The original face training sample is clipped according to Iou (cross-over ratio) to obtain a first face training sample, where the first training sample includes a positive sample, a negative sample, and a partial sample, where the clipped face part (i.e., the first face training sample) and a first face training sample Iou >0.65 of a face frame in the original face training sample are positive samples, the first face training sample Iou <0.3 is a negative sample, and the first face training sample 0.4< Iou <0.65 is a partial sample. After the first face training sample is obtained through cutting, the first face training sample is input into a P-Net network to be trained, and the P-Net network focuses on face detection and foreground and background classification. When the P-Net network is trained, face classification training and face frame detection training are mainly carried out on the P-Net network, and face key point detection training is not carried out on the P-Net network.

Inputting the first face training sample into a trained P-Net network for face detection and face frame prediction, and acquiring a face frame prediction result of the first face training sample output by the trained P-Net network; and cutting the original face training sample according to the face frame prediction result of the first face training sample to obtain a second face training sample, and acquiring the face angle information of the second face training sample according to the face key point coordinates of the second face training sample.

And processing the first face training sample through the trained P-Net network, detecting whether a face exists in the first face sample, and giving a corresponding face frame. And cutting an original face training sample according to a face frame prediction result output by the trained P-Net network, and cutting a face region corresponding to the face frame to obtain a second face training sample, wherein the second face training sample can also define a positive sample, a negative sample and a partial sample according to Iou (cross-over ratio) of the face frame. The face key point coordinates of the second face training sample can be obtained through the face key point coordinates of the original face training sample, and then the face three-dimensional angles of the second face training sample can be calculated according to the face key point coordinates of the second face training sample, wherein the face three-dimensional angles comprise a pitch angle (an upper deflection angle, a lower deflection angle), a yaw angle (a front deflection angle, a rear deflection angle) and a roll angle (a left deflection angle and a right deflection angle), so that the face angle information of the second face training sample is obtained.

And 102, training the R-Net network through a second face training sample, and acquiring a first loss value of the R-Net network in the training process.

Further, before training the P-Net network, data enhancement can be performed on a second face training sample, specifically: according to the face angle information of the second face training sample, taking the second face training sample with the face angle exceeding the preset angle range as a non-target training sample, and taking the second face training sample with the face angle within the preset angle range as a target training sample; performing data enhancement on the non-target training sample to obtain an enhanced training sample; fusing the enhanced training sample and the target training sample to obtain a preprocessed second face training sample; and training the R-Net network through the preprocessed second face training sample, and acquiring a first loss value of the R-Net network in the training process.

The method comprises the steps of limiting the size of a face angle according to a normal steering limit of the face, presetting an angle range, taking a second face training sample with the face angle exceeding the preset angle range as a non-target training sample, and performing targeted data enhancement on the non-target training sample to improve the training effect of a face key point detection model; and then fusing the enhanced training sample after data enhancement and the target training sample and then training the R-Net network.

Inputting the preprocessed second face training sample into an R-Net network for multi-task training, including face classification training, face frame detection training and face key point detection training, to obtain a training result of the second face training sample; and then calculating a first loss value L through a loss function according to the training result and the label information of the second face training sample, wherein the loss function comprises a face classification loss function, a face frame loss function and a face key point loss function.

The face classification loss function is:

in the formula, L_detThe loss value is classified for the face,

the class is predicted for the face of the second face training sample m,

the face true class of the second face training sample m,

when the face classification loss value is calculated, part of samples do not participate in the calculation of the loss value.

The face frame loss function is:

in the formula, L_boxThe loss value of the face frame is the face frame loss value,

the predicted position of the face frame of the second face training sample m,

training the true position of the face frame of sample m for the second face, (x)₁,y₁) Is the coordinate of the upper left corner of the face box, (x)₂,y₂) The coordinates of the lower right corner of the face box are shown.

The face keypoint loss function is:

in the formula, L_landmarkFor the face key point loss value,

the real coordinates of the face keypoints n of the second face training sample m,

and the predicted coordinates of the face key point n of the second face training sample m, and K, omega and epsilon are network hyper-parameters, and values can be flexibly set according to actual conditions.

The first loss value L is calculated by:

L＝w₁L_det+w₂L_box+w₃L_landmark；

in the formula, w₁、w₂、w₃Is a weight parameter.

And 103, adjusting a first loss value according to the face angle information of the second face training sample, and updating the network parameters of the R-Net network according to the adjusted first loss value to obtain the trained R-Net network.

After a first loss value of the R-Net network is obtained, the first loss value is adjusted through face angle information of a second face training sample, and the adjusted first loss value is as follows:

wherein L is a first loss value before adjustment,

for the adjusted first loss value, C is 1, 2, 3, θ¹、θ²、θ³Respectively a pitch angle, a yaw angle and a roll angle in the face angle information.

And updating the network parameters of the R-Net network through the adjusted first loss value until the R-Net network converges to obtain the trained R-Net network.

And 104, inputting the second face training sample into the trained R-Net network to obtain a second output result, and acquiring a third face training sample according to the second output result and the original face training sample, wherein the third face training sample is marked with face angle information.

Inputting the second face training sample into the trained R-Net network to perform face frame prediction, and acquiring a face frame prediction result of the second face training sample output by the trained R-Net network; and cutting the original face training sample according to the face frame prediction result of the second face training sample to obtain a third face training sample, wherein the third face training sample can also define a positive sample, a negative sample and a partial sample according to Iou of the face frame.

And 105, training the O-Net network through the third face training sample, and acquiring a second loss value of the O-Net network in the training process.

Further, before the O-Net network is trained, data enhancement can be performed on a third face training sample, specifically: according to the face angle information of the third face training sample, taking the third face training sample with the face angle exceeding the preset angle range as a non-target training sample, and taking the third face training sample with the face angle within the preset angle range as a target training sample; performing data enhancement on the non-target training sample to obtain an enhanced training sample; fusing the enhanced training sample and the target training sample to obtain a preprocessed third face training sample; and training the O-Net network through the preprocessed third face training sample, and acquiring a second loss value of the O-Net network in the training process.

Inputting the preprocessed third face training sample into an O-Net network for multi-task training, including face classification training, face frame detection training and face key point detection training, to obtain a training result of the third face training sample; and then calculating a second loss value through a loss function according to the training result and the label information of the third face training sample, wherein the loss function also comprises a face classification loss function, a face frame loss function and a face key point loss function.

And 106, adjusting a second loss value according to the face angle information of the third face training sample, and updating the network parameters of the O-Net network according to the adjusted second loss value to obtain the trained O-Net network.

After a second loss value of the O-Net network is obtained, the second loss value is adjusted through the face angle information of a third face training sample, and the adjusted second loss value is as follows:

wherein L is a second loss value before adjustment,

for the adjusted second loss value, C is 1, 2, 3, θ¹、θ²、θ³Respectively a pitch angle, a yaw angle and a roll angle in the face angle information.

And updating the network parameters of the O-Net network through the adjusted second loss value until the O-Net network converges to obtain the trained O-Net network.

And step 107, combining the trained P-Net network, the trained R-Net network and the trained O-Net network to obtain a face key point detection model, wherein the face key point detection model is used for face detection, face frame detection and face key point detection.

And sequentially cascading the trained P-Net network, the trained R-Net network and the trained O-Net network to obtain the face key point detection model. The face key point detection model can be used for carrying out face detection on an image to be detected, detecting whether a face exists in the image to be detected or not, outputting a face frame and face key point coordinates corresponding to the face after the detected face, carrying out face detection, face frame detection and face key point detection through the face key point detection model, improving the detection precision of the face in a large-posture state, adopting a lightweight network MTCNN and meeting the real-time requirement of a mobile terminal.

In the embodiment of the application, when an R-Net network and an O-Net network are trained, the loss values of the face key point detection task under different postures are directionally adjusted by adding additionally labeled angle information of the face key point, so that the detection precision of the face key point detection model under a large posture state is improved, and the technical problem that the detection precision of the face key point is not high under the condition that the extra information of the face key point task is not considered in the existing face key point detection method, especially under the condition that the face is under a large posture angle is solved.

The above is an embodiment of a training method for a face keypoint detection model provided by the present application, and the following is an embodiment of a training device for a face keypoint detection model provided by the present application.

Referring to fig. 2, an embodiment of the present application provides a face keypoint detection model training apparatus, including:

the first acquisition unit is used for acquiring a first output result of a P-Net network trained by a first face training sample, and acquiring a second face training sample based on the first output result and an original face training sample, wherein the second face training sample is marked with face angle information, and the first face training sample is obtained by cutting the original face training sample;

the first training unit is used for training the R-Net network through the second face training sample and acquiring a first loss value of the R-Net network in the training process;

the first adjusting unit is used for adjusting a first loss value according to the face angle information of the second face training sample, and updating the network parameters of the R-Net network according to the adjusted first loss value to obtain a trained R-Net network;

the second acquisition unit is used for inputting the second face training sample into the trained R-Net network to obtain a second output result, and acquiring a third face training sample according to the second output result and the original face training sample, wherein the third face training sample is marked with face angle information;

the second training unit is used for training the O-Net network through a third face training sample and acquiring a second loss value of the O-Net network in the training process;

the second adjusting unit is used for adjusting a second loss value according to the face angle information of the third face training sample, and updating the network parameters of the O-Net network according to the adjusted second loss value to obtain a trained O-Net network;

As a further improvement, the first obtaining unit is specifically configured to:

inputting the first face training sample into a trained P-Net network to perform face frame prediction, and obtaining a face frame prediction result of the first face training sample output by the trained P-Net network, wherein the first face training sample is obtained by cutting an original face training sample;

As a further improvement, the second obtaining unit is specifically configured to:

inputting the second face training sample into the trained R-Net network to perform face frame prediction, and acquiring a face frame prediction result of the second face training sample output by the trained R-Net network;

As a further aspect, the apparatus further comprises: a data enhancement unit to:

according to the face angle information of the second face training sample or the third face training sample, taking the second face training sample or the third face training sample with the face angle exceeding a preset angle range as a non-target training sample, and taking the second face training sample or the third face training sample with the face angle within the preset angle range as a target training sample;

and fusing the enhanced training sample and the target training sample to obtain a preprocessed second face training sample or a preprocessed third face training sample.

The embodiment of the application also provides a face key point detection model training device, which comprises a processor and a memory;

the memory is used for storing the program codes and transmitting the program codes to the processor;

the processor is used for executing the training method of the face key point detection model in the foregoing method embodiments according to instructions in the program code.

The embodiment of the application also provides a computer-readable storage medium, which is used for storing a program code, and the program code is used for executing the face key point detection model training method in the foregoing method embodiments.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for executing all or part of the steps of the method described in the embodiments of the present application through a computer device (which may be a personal computer, a server, or a network device). And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A face key point detection model training method is characterized by comprising the following steps:

2. The training method of the face keypoint detection model according to claim 1, wherein said obtaining a first output result of a P-Net network trained by a first face training sample, and obtaining a second face training sample based on the first output result and an original face training sample, the second face training sample being labeled with face angle information, the first face training sample being obtained by clipping the original face training sample, comprises:

3. The method for training the face keypoint detection model according to claim 1, wherein the inputting the second face training sample to the trained R-Net network to obtain a second output result, and obtaining a third face training sample according to the second output result and the original face training sample, the third face training sample being labeled with face angle information comprises:

4. The training method of the face keypoint detection model according to any one of claims 1 to 3, characterized in that it further comprises:

5. The training method of the face keypoint detection model according to claim 1, wherein the adjusted first loss value or the adjusted second loss value is:

wherein L is the first loss value or the second loss value before adjustment,

6. The utility model provides a people face key point detection model trainer which characterized in that includes:

7. The device for training the face keypoint detection model according to claim 6, wherein the first obtaining unit is specifically configured to:

8. The device for training the face keypoint detection model according to claim 6, wherein the second obtaining unit is specifically configured to:

9. The training device for the face key point detection model according to any one of claims 6 to 8, further comprising: a data enhancement unit to:

10. A human face key point detection model training device is characterized by comprising a processor and a memory;

the processor is used for executing the human face key point detection model training method of any one of claims 1 to 5 according to instructions in the program code.