WO2022089360A1

WO2022089360A1 - Face detection neural network and training method, face detection method, and storage medium

Info

Publication number: WO2022089360A1
Application number: PCT/CN2021/126065
Authority: WO
Inventors: 芦爱余
Original assignee: 广州虎牙科技有限公司
Priority date: 2020-10-28
Filing date: 2021-10-25
Publication date: 2022-05-05
Also published as: CN112287820A

Abstract

The present invention provides a face detection neural network and a training method, a face detection method, and a storage medium. The neural network comprises: an encoding network (210), being configured to obtain a sample comprising a face image and take the sample as an output result of the encoding network; a first decoding network (220), being configured to extract prediction coordinate values of first key points on the face image in the sample according to the output result of the encoding network (210); a second decoding network (230), being configured to obtain, according to the output result of the encoding network (210), a human face posture change parameter corresponding to the face image in the sample; and a prediction network (240), being configured to train by taking the output result of the encoding network (210), the prediction coordinate values obtained by the first decoding network (220), and the face posture change parameter obtained by the second decoding network (230) as inputs, and output a face model containing three-dimensional or two-dimensional coordinate values of at least two second key points on the face image in the sample.

Description

Face detection neural network and training method, face detection method, storage medium

CROSS-REFERENCE TO RELATED APPLICATIONS

This disclosure claims the priority of the Chinese patent application filed on October 28, 2020, with the application number of 2020111738293 and the invention titled "Face Detection Neural Network and Training Method, Face Detection Method, and Storage Medium", which is entitled to Incorporated herein by reference.

technical field

The present disclosure relates to the field of computer technologies, and in particular, to a face detection neural network and a training method thereof, a face detection method, a storage medium and a terminal device.

Background technique

In recent years, with the rapid development of deep learning technology, face detection technology has also made great progress. However, face detection technology based on deep learning requires a large amount of training data containing the three-dimensional coordinates of key points, and the key points marked in the same training data also need to be dense enough. However, the collection of 3D face data and the labeling of key points are more difficult than traditional image collection, and require a lot of manpower, material and financial resources.

SUMMARY OF THE INVENTION

In order to overcome the problems existing in the related art that the collection of 3D face data and the labeling of key points are more difficult than traditional image collection and require a lot of manpower, material resources and financial resources, the present disclosure provides a A face detection neural network and a training method thereof, a face detection method, a storage medium and a terminal device.

According to a first aspect of the embodiments of the present disclosure, there is provided a neural network for face detection, the neural network comprising: an encoding network for acquiring a sample including a face image, and using the sample as an output result thereof, wherein , the face image in the sample is marked with three-dimensional or two-dimensional coordinates of at least one first key point; the first decoding network is used to extract the The predicted coordinate value of each of the first key points on the face image; the second decoding network is used to obtain the face pose change parameter corresponding to the face image in the sample according to the output result of the encoding network And prediction network, be used for the output result of described coding network and the predicted coordinate value that described first decoding network obtains and the facial posture change parameter that described second decoding network obtains as input and carry out training, output comprises described A face model with three-dimensional or two-dimensional coordinate values of at least two second key points on the face image in the sample, wherein the number of the second key points in the output face model is greater than the The number of the first keypoints annotated in the sample.

According to a second aspect of the embodiments of the present disclosure, there is provided a method for training a neural network for face detection, the training method comprising: acquiring a sample including a face image, wherein the face image in the sample is labeled There are three-dimensional or two-dimensional coordinates of at least one first key point; the sample is used as the input of the neural network to obtain the predicted coordinate value of each of the first key points and the face pose corresponding to the face image Change parameters; train the neural network based on the obtained predicted coordinate values and the face pose change parameters, so that the output result of the neural network after training is a three-dimensional or two-dimensional image containing a first number of second key points. A face model of dimensional coordinates, wherein the first number is greater than the number of the first key points marked in the sample.

According to a third aspect of the embodiments of the present disclosure, there is provided a method for face detection, the method is implemented by the encoding network and the prediction network obtained by training in the first aspect of the embodiments of the present disclosure, and the method includes: the encoding network pairing The acquired unlabeled face image is processed for key point labeling, and the face image marked with at least one first key point is output to the prediction network as a processing result; the prediction network output includes the A face model with three-dimensional or two-dimensional coordinates of at least two second key points on a face image, wherein the number of the second key points is greater than the number of the first key points.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a computer storage medium, where computer program codes are stored in the computer storage medium, and when the computer program codes are executed on a processor, the processor causes the processor to execute the present disclosure The steps of the method described in the second aspect of the embodiment or the third aspect of the embodiment of the present disclosure.

According to a fifth aspect of the embodiments of the present disclosure, a terminal device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the present invention when executing the program. The steps of the method of the second or third aspect of the disclosed embodiments.

In the embodiment of the present disclosure, a face image sample marked with three-dimensional coordinates of key points is used as the input of a neural network for face detection, and the neural network is trained by obtaining key points of the sample to predict coordinate values and face pose change parameters. . Using the neural network for face detection, it is possible to input a face image sample marked with only a small number of three-dimensional coordinates of key points into the neural network, and obtain three-dimensional coordinates of a large number of key points of the sample. It can solve the problem that the collection of 3D face data and the labeling of key points are difficult and require a lot of manpower, material resources and financial resources.

It is to be understood that the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.

Description of drawings

The drawings herein illustrate embodiments consistent with the disclosure, and together with the description serve to explain the principles of the disclosure.

FIG. 1 is a system architecture diagram to which a face detection neural network, a training method thereof, and a face detection method can be applied according to an exemplary embodiment of the present disclosure.

FIG. 2 is a structural diagram of a neural network for face detection according to an exemplary embodiment of the present disclosure.

FIG. 3 is a flowchart of another method for obtaining predicted coordinate values of key points according to an exemplary embodiment of the present disclosure.

FIG. 4 is a flowchart of a method for obtaining face pose change parameters of key points according to an exemplary embodiment of the present disclosure.

FIG. 5 is a flowchart of a method of training a prediction network according to an exemplary embodiment of the present disclosure.

FIG. 6 is a flowchart of a training method of a neural network for face detection according to an exemplary embodiment of the present disclosure.

FIG. 7 is a flowchart of a method of applying a prediction network according to an exemplary embodiment of the present disclosure.

Detailed ways

Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. Where the following description refers to the drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the illustrative examples below are not intended to represent all implementations consistent with this disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as recited in the appended claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the present disclosure. As used in this specification and the appended claims, the singular forms "a," "the," and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It will also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in this disclosure to describe various pieces of information, such information should not be limited by these terms. These terms are only used to distinguish the same type of information from each other. For example, the first information may also be referred to as the second information, and similarly, the second information may also be referred to as the first information, without departing from the scope of the present disclosure. The word "if" as used herein can be interpreted as "at the time of" or "when" or "in response to determining," depending on the context.

Face detection refers to detecting the location of key points of the face given a face image. The key points typically include points at the eyebrows, eyes, nose, mouth, face contours, and the like. Face detection technology is the basic technology for application scenarios such as face dressing, beauty makeup, face special effects, and face AR (Augmented Reality). Taking face special effects as an example, in some live video applications, there will be some requirements for 3D animation special effects, such as adding 3D rabbit ears and pig masks to the faces in the video. These three-dimensional animation special effects need to be based on the accurate positioning of the key points of the face, that is to say, it is necessary to be able to obtain the accurate three-dimensional coordinates of the key points of the face.

In recent years, with the development of deep learning technology, some face detection technologies based on deep learning have emerged. For example, input a given face image to a convolutional neural network to regress the coordinates of key points of the face through the convolutional neural network. For another example, a given face image is input to a convolutional neural network, so as to regress the feature map corresponding to the key points of the face through the convolutional neural network, and determine the position representing the key points of the face from the feature map. However, the face detection technology based on deep learning needs to input a large amount of training data marked with the three-dimensional coordinates of key points in the network to be trained, and the key points marked in the same training data also need to be dense enough (that is, dense three-dimensional data is required) . However, 3D face data with key point annotations is very precious, because the collection of 3D face data and the annotation of key points are more difficult than traditional image collection, and require a lot of manpower , material and financial resources.

Compared with traditional image collection, the collection of 3D face data and the labeling of key points are more difficult and require a lot of manpower, material resources and financial resources. The present disclosure proposes a face detection neural network. A training method thereof, a face detection method, a storage medium and a terminal device thereof are provided.

Next, the embodiments of the present disclosure will be described in detail.

FIG. 1 shows a schematic diagram of an exemplary system architecture of a face detection neural network, a training method thereof, and a face detection method to which embodiments of the present disclosure can be applied.

As shown in FIG. 1 , the system architecture 1000 may include one or more of

terminal devices

1001 , 1002 , and 1003 , as well as a network 1004 and a server 1005 . The network 1004 is the medium used to provide the communication link between the

terminal devices

1001 , 1002 , 1003 and the server 1005 . The network 1004 may include various connection types, such as wired connections, wireless communication links, or fiber optic cables, among others.

It should be understood that the numbers of terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs. For example, the server 1005 may be a server cluster composed of multiple servers, or the like.

The user can use the

terminal devices

1001, 1002, 1003 to interact with the server 1005 through the network 1004 to receive or send messages and the like. The

terminal devices

1001, 1002, 1003 may be various electronic devices with display screens, including but not limited to smart phones, tablet computers, portable computers, desktop computers, and the like.

The server 1005 may be a server that provides various services. For example, the server 1005 can provide training of a neural network for face detection of the present disclosure: the server 1005 obtains face images from the

terminal devices

1001, 1002, and 1003, and the face images can be three-dimensional images marked with a small number of key points. Coordinates (i.e. sparse 3D data) of face images. The server trains the neural network for face detection based on the acquired face images. Based on the trained neural network, the server can output the three-dimensional coordinates of a large number of key points in the face image (ie, dense three-dimensional data). For another example, the server 1005 can provide a face detection method of the present disclosure for execution: the server 1005 is pre-installed with all or part of the neural network for face detection that has been trained, and the 1003 Obtain a face image without key point annotations, and obtain the three-dimensional coordinates of key points in the face image by all or part of the trained neural network for face detection.

The

terminal devices

1001, 1002, and 1003 may be terminal devices that provide various services. For example, the terminal device may be a terminal device with an image acquisition unit, and therefore, the terminal device may acquire face images for use by the terminal device itself or to send to other devices. For another example, by pre-installing all or part of the neural network for face detection trained by the server 1005 on the

terminal devices

1001, 1002, and 1003, the terminal device realizes the acquisition of the input face image without key point annotations. 3D coordinates of key points in the face image.

However, it should be understood that, for the

terminal devices

1001 , 1002 , and 1003 whose computing capabilities meet the training requirements, the training method of the neural network for face detection disclosed in the embodiments of the present disclosure can also be executed by the

terminal devices

1001 , 1002 , and 1003 . All or part of the neural network trained by the

terminal devices

1001 , 1002 and 1003 can directly obtain the three-dimensional coordinates of the key points of the face image from the input face image without labels. Of course, all or part of the neural networks trained by the

terminal devices

1001, 1002, and 1003 can also be configured on the server 1005 in advance, and the server 1005 can obtain the key points of the face image from the input face image without labels. three-dimensional coordinates.

FIG. 2 is a structural diagram of a neural network for face detection according to an exemplary embodiment of the present disclosure. As shown in FIG. 2 , the neural network for face detection includes an encoding network 210 , a first decoding network 220 , a second decoding network 230 and a prediction network 240 . The encoding network 210 is used to obtain face image samples, and send the samples as output results to the first decoding network 220, the second decoding network 230 and the prediction network 240, and the samples include at least A face image with three-dimensional coordinates of the first key point; the first decoding network 220 extracts the predicted coordinate value of the first key point in the sample according to the output result of the encoding network 210; the second decoding network 230 For obtaining the face posture change parameter of the predicted coordinate value of the first key point in the sample according to the output result of the encoding network 210; obtain the output result of the encoding network 210 and the first decoding network 220 The predicted coordinate value and the face pose change parameter obtained by the second decoding network 230 are used as input to train the prediction network 240, so that the trained prediction network 240 outputs a three-dimensional image including at least two second key points in the sample. Coordinate face model, the number of the second key points output by the prediction network 240 is greater than the number of the first key points of the three-dimensional coordinates marked in the sample.

The training data of the neural network for face detection described in the present disclosure may be a face image sample marked with three-dimensional coordinates of key points. It should be understood that the face image samples marked with the three-dimensional coordinates of key points may come from a public data set, or may be face images collected by a user and marked with three-dimensional coordinates of key points. The present disclosure does not limit the source of training data. The marked key points in the face image sample may be points at the eyes, nose, mouth, face contour, or points at other positions on the face, which are not limited in this disclosure.

In one embodiment, when the training data used are face image samples marked with three-dimensional coordinates of key points, the activation function used in the neural network for face detection is a positive number corresponding to the three-dimensional coordinates of key points, that is, The activation functions corresponding to the X-axis, Y-axis, and Z-axis are all positive numbers. For example, the three-dimensional coordinates of a key point are x, y, z, then the activation function corresponding to the key point is 1, 1, 1, indicating that the three-dimensional coordinates of the key point are all that the neural network needs to learn. Of course, those skilled in the art should understand that the activation functions corresponding to the X axis, the Y axis and the Z axis may be different and may be other positive numbers. For example, the three-dimensional coordinates of a key point are x, y, z, and the corresponding activation function can be 2, 2, 10. Among them, the activation functions corresponding to the X-axis, Y-axis and Z-axis are all greater than 0, indicating that the neural network needs to learn the three-dimensional coordinates of the key points; the activation function corresponding to the Z-axis coordinate is greater than the activation function corresponding to the X-axis and Y-axis coordinates. , indicating that the learning of the Z-axis coordinates occupies a relatively large weight in the training process of the neural network for face detection. axis coordinates.

In one embodiment, in order to ensure a better training effect, so that the trained face detection neural network has better generalization performance and more accurate prediction results for different input face images, the present disclosure The training data of the neural network for face detection may also be mixed data, and the mixed data includes face image samples marked with three-dimensional coordinates of key points and face image samples marked with two-dimensional coordinates of key points. For face images marked with two-dimensional coordinates of key points, the coordinate value of the Z-axis of the key point can be set to a negative number, for example, set to -1, indicating that the Z-axis coordinate of the key point does not exist. Correspondingly, when the training data used is mixed data, for the face image samples marked with the three-dimensional coordinates of the key points, the activation functions corresponding to the X-axis, Y-axis and Z-axis of the key points can be the same as those described above. Repeat. For face images marked with two-dimensional coordinates of key points, the activation functions corresponding to the X-axis and Y-axis of the key points can also be as described above; the activation function corresponding to the Z-axis of the key points can be set to 0, indicating that the key point The Z-axis coordinates are not what the neural network for face detection needs to learn.

The above-mentioned mixed data is used as the training data of the neural network for face detection. Although the face image samples marked with the two-dimensional coordinates of key points have no Z-axis coordinate values, they have X-axis and Y-axis coordinate values. Also contains some useful information. Therefore, using mixed data for training can learn more useful information about the key points of the face when the face image samples marked with the three-dimensional coordinates of the key points are insufficient, and can realize the neural network for face detection. good training effect.

The encoding network 210 of the neural network for face detection is used to perform preliminary processing after obtaining face image samples to obtain samples marked with key points, and send the samples to the Other parts of the neural network for face detection except the encoding network 210. In one embodiment, the encoding network 210 in the neural network for face detection may be a lightweight neural network. For example, the encoding network 210 may be a network of the Mobilenet series designed by Google for deep learning applications on mobile terminals and embedded terminals, including MobilenetV1, MobilenetV2 and MobilenetV3. The encoding network 210 can also be other lightweight neural networks, such as SqueezeNet, ShuffleNet, Xception and other networks.

Using a light-weight neural network as the coding network 210 in the neural network for face detection can save the volume of the neural network for face detection, improve the running speed of the neural network for face detection, and facilitate the transplantation of faces to mobile terminals Part or all of the detected neural network can effectively expand the application scope and application scenarios of the neural network for face detection.

In one embodiment, extracting the predicted coordinate value of the first key point in the sample according to the output result of the encoding network 210 can be implemented by the first decoding network 220 in the neural network for face detection.

FIG. 3 is a schematic diagram of an embodiment of realizing extraction of predicted coordinate values of key points of a face image through the first decoding network 220 in a neural network for face detection. In FIG. 3, by inputting the face image samples without key point annotations into the encoding network 210, the encoding network 210 can obtain samples with key point annotations after processing the face image samples, and the samples are labeled with key points. The output of the encoding network 210 is sent to the first decoding network 220 . Then, the first decoding network 220 can extract the predicted coordinate value of the key point according to the preset mapping relationship between the UV coordinate map and the three-dimensional coordinate of the key point. Among them, UV is the abbreviation of U, V texture map, U represents the horizontal coordinate, and V is the vertical coordinate. The UV coordinate map is composed of pixel values of three RGB channels, and each pixel channel represents the coordinates of the X-axis, Y-axis and Z-axis of the three-dimensional face key points respectively. Through the two-dimensional UV coordinate map, each point on the UV coordinate map can be corresponding to the surface of the three-dimensional face model, that is, each point on the three-dimensional face model has a unique point on the UV coordinate map. By presetting the mapping relationship between the UV coordinate map and the three-dimensional coordinates, the coordinates of the key points marked in the face image sample can be restored from the UV coordinate map.

When the face image samples input from the encoding network 210 to the first decoding network 220 include only face images marked with the three-dimensional coordinates of the key points, the predicted coordinate values of the key points are valid three-dimensional coordinates. When the face image samples input from the encoding network 210 to the first decoding network 220 include a face image marked with three-dimensional coordinates of key points and a face image marked with two-dimensional coordinates of key points, the samples corresponding to the samples marked with three-dimensional coordinates The predicted coordinate value of is a valid three-dimensional coordinate, and the predicted coordinate value corresponding to the sample marked with two-dimensional coordinates is a pseudo three-dimensional coordinate, which does not have a Z-axis coordinate value.

The encoding network 210 and the first decoding network 220 in the above embodiments may be pre-trained. The first decoding network 220 may be a standard neural network (Standard Neural Networks, SNN), a convolutional neural network (Convolutional Neural Networks, CNN), a recurrent neural network (Recursive Neural Network, RNN) and other networks. The pre-training of the encoding network 210 and the first decoding network 220 can be implemented by any technique well known to those skilled in the art, so that the first decoding network 220 can obtain the predicted coordinates of the key points according to the face image samples marked with the coordinates of the key points. value, which is not repeated here.

In one embodiment, the parameters of the encoding network 210 and the first decoding network 220 may be determined through training. The encoding network 210 and the first decoding network 220, as described above, may be networks in various forms, which will not be repeated here. Here, an exemplary embodiment in which the parameters of the encoding network 210 and the first decoding network 220 are determined through training is given.

The encoding network 210 obtains face image samples marked with key points and uses the samples as output results, and the first decoding network 220 extracts predicted coordinate values of the marked key points in the samples according to the output results of the encoding network 210 . The parameters of the encoding network 210 and the first decoding network 220 are jointly trained based on the first loss function until the first loss function satisfies the preset training conditions, and the network parameters at this time are the encoding network 210 and the first decoding network after training. Parameters of the network 220 . The preset training condition may be that the value of the first loss function is lower than the threshold, or the first loss function converges, or other training conditions, which are not limited in the present disclosure.

In one embodiment, the first loss function may be a loss function based on the minimum absolute value deviation, may also be a loss function based on the minimum mean square value deviation, or may be other forms of loss functions, which are not limited in the present disclosure. The first loss function may be the deviation between the predicted coordinate value of the key point and the labeled coordinate value of the key point, or may be other measures representing the accuracy of the predicted coordinate value of the key point. An exemplary first loss function may be:

Among them, Loss ₁ is the first loss function, predict is the predicted coordinate value of the key point output from the first decoding network 220 , and label is the labeled coordinate value of the key point in the face image sample input by the encoding network 210 . It should be understood that the above-mentioned first loss function is only an exemplary embodiment, and those skilled in the art can also design other first loss functions to train the encoding network 210 and the first decoding network 220, so that the key output of the first decoding network 220 is the key The predicted coordinate value of the point is more accurate.

In one embodiment, obtaining the face pose change parameters corresponding to the key points according to the output result of the encoding network 210 can be realized by the second decoding network 230 in the neural network for face detection.

The face pose change parameter is used to characterize the change of the face pose relative to the standard face model. The face pose change parameters may include shape parameters and expression parameters, and may also include texture parameters and the like. Since human faces have many commonalities, for example, they have a certain number of eyes, mouths, noses and ears, and the relative positions of each part remain unchanged, each part has a certain topological relationship, so a standard 3D face with parametric representation can be established. Model. In this way, by obtaining the change of the predicted coordinate value of the key point relative to the standard three-dimensional face model, the face pose change parameter is obtained.

In one embodiment, the standard three-dimensional face model of the parametric representation may be a three-dimensional face model of parametric representation obtained based on a 3DMM (3D Morphable Models, three-dimensional deformable face model) method. This method assumes that 3D faces have been densely aligned, that is, all 3D faces can be represented by the same point cloud data or patch data, and points with the same number represent the same semantics. With dense alignment, each textured 3D face can be represented as:

S _model =S ² +∑α _i S _i (2);

T _model = T ² +∑β _i T _i (3);

Among them, _S and _T represent the average shape and average texture of the face, and the discriminative characteristics of each face are reflected in the linear combination of a set of orthogonal bases _Si or _Ti on the right side of the plus sign. Eigenvector of the covariance matrix in descending order of values. The heads of a plurality of 3D human faces are collected, and the principal components representing the shape and texture information of the face, ie, the feature vectors Si and T _i _, can be obtained by using the Principal Components Analysis (PCA) method. Different coefficients α _i and β _i characterize 3D faces with different shapes and textures. Based on the data in the BFM database, the above-mentioned parameterized representation of the face model is optimized, and another parameterized representation of the standard 3D face model can be obtained:

M=BFM _mean +shape*shape _std +exp*exp _std (4);

Among them, BFM _mean is the average part of the face obtained based on the BFM database, shape _std and exp _std are the feature vectors of face shape and facial expression, and shape and exp are shape parameters and expression parameters, respectively. The BFM database is a technology well known to those skilled in the art, and will not be repeated here. In formula (4), by determining the shape parameter shape and the expression parameter exp, the standard three-dimensional face model can be transformed to obtain a three-dimensional face model with posture (referring to changes in shape and expression).

Of course, those skilled in the art should understand that the parametrically represented standard three-dimensional face model may also be a parametrically represented three-dimensional face model established by other methods. For example, a 3D face model with parametric representation is established based on actually collected 3D face data or 3D face data obtained by other means using statistical methods.

In one embodiment, the face pose change parameters can be obtained through a pre-trained second decoding network 230. Specifically, the encoding network 210 sends the samples with key point annotations as output results to the second decoding network 230, and the second decoding network 210 sends the The decoding network 230 obtains the face pose change parameter based on the output result of the encoding network 210 . The second decoding network 230 may be a standard neural network (Standard Neural Networks, SNN), a convolutional neural network (Convolutional Neural Networks, CNN), a recurrent neural network (Recursive Neural Network, RNN) and the like. The pre-training of the second decoding network 230 can be implemented by using techniques well known to those skilled in the art, and details are not described here.

In one embodiment, the second decoding network 230 may be determined through training. The training process of the second decoding network 230 includes: training the second decoding network 230 based on the parameters of the encoding network 210 and the second loss function, until the second loss function satisfies preset training conditions, and determining the first Two parameters of the decoding network 230. The parameters of the encoding network 210 may be determined by the training method described above, or may be trained by other methods, which are not limited in the present disclosure. The preset training condition may be that the value of the second loss function is lower than the threshold, or the first loss function converges, or other training conditions, which are not limited in the present disclosure.

In one embodiment, the form of the second loss function may be a loss function based on the minimum absolute value deviation, a loss function based on the minimum mean square value deviation, or a loss function in other forms, which is not limited in the present disclosure . The second loss function may be the deviation between the predicted coordinate value of the key point and the first coordinate value of the key point, or may be another measure representing the accuracy of the face change parameter output by the second decoding network. An exemplary second loss function can be:

Among them, Loss ₂ is the second loss function, predict is the predicted coordinate value of the key point output by the first decoding network 220, mask is the activation function corresponding to the three-dimensional coordinate of the key point, and affine represents the standard three-dimensional face model relative to the predicted coordinate value The transformation matrix of , and the meanings of the remaining parameters are as described above.

FIG. 4 is a schematic diagram of realizing the acquisition of face pose change parameters by training the second decoding network 230. FIG. 4 schematically shows the influence of the second loss function on training, combined with the expression formula of the second loss function (5 )Be explained. The expressions in the brackets {} in formula (5) represent the first coordinates of the key points of the face image sample, which can be obtained by the following methods: the predicted coordinate values of the key points obtained by the first decoding network 220 and the standard three-dimensional face model, the transformation matrix of the standard three-dimensional face model relative to the predicted coordinate value is calculated, and the transformation matrix represents the rotation and translation required to transform the standard three-dimensional face model into the three-dimensional face model corresponding to the predicted coordinate value and scaling, etc.; transform the standard three-dimensional face model according to the transformation matrix to obtain the first coordinates of the key points. Among them, the standard three-dimensional face model is transformed according to the transformation matrix to obtain the first coordinates of the key points. The parameters of the second decoding network 230 are adjusted based on the transformation matrix, so that the change parameters of the face posture output by the second decoding network 230 occur. change, so that the standard three-dimensional face model changes in attitude, so as to match the predicted coordinate value of the key point output by the first decoding network 220 as much as possible. Taking the standard three-dimensional face model represented by the parameterization in formula (4) as an example, that is, through the parameter change of the second decoding network 230, the shape parameter shape and the expression parameter exp in formula (4) change, and then the parameterization The standard 3D face model represented by the pose changes, and the parameterized expression is:

M ₂ ={(bfm _mean +shape*shape _std +exp*exp _std )*affine},

That is, the content in the brackets in formula (5).

The second decoding network 230 is trained based on the second loss function, that is, the parameters of the second decoding network 230 are continuously adjusted to obtain the face pose change parameters output by the second decoding network 230, so as to transform the standard three-dimensional face model, Until the face model transformed from the standard three-dimensional face model is aligned with the predicted coordinate value of the key point output by the first decoding network 220 (that is, the deviation is small enough), so as to obtain a three-dimensional image with the same or similar posture as the predicted coordinate value of the key point. face model, the second decoding network 230 completes the training. In this way, the result output by the trained second decoding network 230 is the face pose change parameter that can characterize the expression and/or shape change of the face image corresponding to the predicted coordinate value of the key point relative to the standard face model.

Based on the face pose change parameters output by the second decoding network 230 after training, a three-dimensional face model corresponding to the predicted coordinate values of the key points of the input sample can be obtained in essence. Then, based on the three-dimensional face model, the three-dimensional coordinates of any point on the model can be obtained, so the three-dimensional coordinates of a large number of key points of the sample can be obtained.

In one embodiment, the parameters of the prediction network 240 are determined based on the parameters determined by the encoding network 210 and the third loss function training.

As shown in FIG. 5 , according to the face image samples with key point annotations output by the encoding network 210, combined with the obtained predicted coordinate values and face pose change parameters, the parameters of the prediction network 240 are trained based on the third loss function , until the third loss function satisfies the preset training conditions, and the network parameters at this time are the parameters of the prediction network 240 after training, that is, the prediction network 240 completes the training. The training of the prediction network 240 is performed based on the third loss function, and the preset training condition may be that the value of the third loss function is lower than the threshold, or the third loss function converges, or other training conditions, which are not limited in the present disclosure.

In one embodiment, the form of the third loss function may be a loss function based on minimum absolute deviation, a loss function based on minimum mean square deviation, or a loss function in other forms, which is not limited in the present disclosure . The content of the third loss function may be the deviation between the three-dimensional coordinates of the key points (ie, the second key points) output by the prediction network 240 and the first coordinates, or may be the accuracy of the three-dimensional coordinates representing the key points output by the prediction network. other measures of degree. An exemplary third loss function can be:

Among them, Loss ₃ is the third loss function, predict _pts is the three-dimensional coordinates of the key points output by the prediction network, and the meanings of the remaining parameters are as described above. As mentioned above, the expression in the brackets {} in formula (5) actually represents the first coordinate of the key point, and the acquisition method is the same as that described above, and will not be repeated here.

Based on the trained prediction network 240, the following can be achieved: input an unmarked face image to the encoding network 210, the encoding network 210 can obtain a face image marked with a small number of key points, and convert the marked face image with key points. The face image is sent to the prediction network 240 as the processing result, so that the prediction network 240 can output the three-dimensional coordinates of a large number of key points in the face image.

Those skilled in the art should understand that, in order to make the trained model have better generalization ability and more accurate prediction results, and at the same time prevent the model from overfitting, the first, second and third loss functions can also add regularization to improve the overall performance of the trained model.

In one embodiment, the neural network for face detection includes an encoding network, a first decoding network, a second decoding network, and a prediction network, and some of the networks may be trained networks and some of the networks may be untrained networks. Among them, for the specific training method of the untrained network and the designed loss function, please refer to the above-mentioned, and will not be repeated here.

In one embodiment, the encoding network, the first decoding network, the second decoding network and the prediction network included in the neural network for face detection may all be untrained networks. For the training of each network, the following methods can be used: fix the initial parameters of the second decoding network and the prediction network, and jointly train the parameters of the encoding network and the first decoding network based on the first loss function; fix the trained parameters of the encoding network, The parameters of the second decoding network are trained based on the second loss function; the trained parameters of the encoding network are fixed, and the parameters of the prediction network are trained based on the third function.

For the specific training method and the designed loss function of each network, you can refer to the above, and will not repeat them here.

In the above embodiment, the face image samples marked with the three-dimensional coordinates of the key points are used as the input of the neural network for face detection, and the neural network is trained by obtaining the predicted coordinate values of the key points of the samples and the change parameters of the face posture. Using the neural network for face detection, a three-dimensional face model corresponding to the predicted coordinate values of the key points of the input sample can be obtained, and then it is possible to input a face image with only a small number of three-dimensional coordinates of the key points into the neural network. sample, the three-dimensional coordinates of a large number of key points of the sample can be obtained. In addition, based on the trained encoding network and prediction network, the following can be achieved: input an unlabeled face image to the encoding network, the encoding network can obtain a face image marked with a small number of key points, and annotated with a small number of key points. The face image of the point is sent to the prediction network as the processing result, so that the prediction network can output the three-dimensional coordinates of a large number of key points in the face image. In this way, the detection of the three-dimensional coordinates of the key points of the unlabeled face image can be effectively realized.

FIG. 6 is a flowchart of a training method of a neural network for face detection according to an exemplary embodiment of the present disclosure. As shown in FIG. 6 , the training method includes the following steps:

In step 602, a sample including a face image is obtained, wherein the sample includes a face image marked with three-dimensional coordinates of at least one first key point;

In step 604, the sample is used as the input of the neural network, and the predicted coordinate value of the first key point and the face posture change parameter are obtained;

In step 606, the neural network is trained based on the obtained predicted coordinate values and the face pose change parameters, so that the output result of the trained neural network is that the sample contains the three-dimensional coordinates of the first number of second key points , where the first number is greater than the number of first key points marked in the sample.

The above training method can be used in various neural networks that need to be trained based on the predicted coordinate values of the key points and the change parameters of the face pose to obtain the three-dimensional coordinates of the key points of the face image. The present disclosure does not limit the structure of the network trained using the training method. Of course, those skilled in the art should understand that the method can also be used in the training of the aforementioned neural network for face detection.

In one embodiment, the encoding network, the first decoding network, and the second decoding network may be pre-trained networks, and the encoding network outputs the processing result of the input face image sample, that is, the face marked with key points image samples, and send the samples to the first decoding network, the second decoding network and the prediction network; the first decoding network is used to obtain the predicted coordinate values of the key points marked in the sample, and the second decoding network is used to obtain The face pose change parameter corresponding to the sample. This way, only the prediction network needs to be trained. The specific training method of the prediction network has been described in detail above, and will not be repeated here.

In one embodiment, the encoding network, the first decoding network, and the second decoding network may be untrained networks, and the training method described above may be used to perform the encoding network, the first decoding network, the second decoding network, and the prediction network. to train. The specific training method of the network has been described in detail above, and will not be repeated here.

In addition, those skilled in the art should understand that, through the above training method, the predicted coordinate value of the key point and the change parameter of the face posture can also be obtained.

When the above neural network for face detection is trained, a parameter-determined encoding network and prediction network can be obtained. Based on the encoding network and prediction network determined by parameters, a face detection method can be realized: the encoding network is used to process the acquired unlabeled face images, and output faces with a small number of key point annotations Image samples; the prediction network can output a large number of three-dimensional coordinates of key points of the face images for the face image samples input from the encoding network. Based on the acquired three-dimensional coordinates of the key points of the face image, three-dimensional face reconstruction can be performed, as shown in FIG. 7 .

The above-mentioned method of face detection can be installed on the server to realize the detection of the key points of the face image, or the trained encoding network and the first decoding network can be installed on the mobile terminal, so as to realize the detection on the mobile terminal. Face Detection.

In this embodiment, using the trained face model, it is possible to input an unlabeled face image into the encoding network and the prediction network, and obtain the three-dimensional coordinates of the key points of the face image. Based on the three-dimensional coordinates of the key points, it is possible to Further realization of 3D face reconstruction.

Correspondingly, the present disclosure also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the steps of the method described in any of the foregoing embodiments.

Correspondingly, the present disclosure also provides a terminal device, including a memory, a processor, and a computer program stored in the memory and running on the processor, and when the program is executed by the processor, the method described in any of the foregoing embodiments is implemented. step.

The present disclosure may take the form of a computer program product embodied on one or more storage media having program code embodied therein, including but not limited to disk storage, CD-ROM, optical storage, and the like. Computer-usable storage media includes permanent and non-permanent, removable and non-removable media, and storage of information can be accomplished by any method or technology. Information may be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims can be performed in an order different from that in the embodiments and still achieve desirable results. Additionally, the processes depicted in the figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

Other embodiments of the present disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of the invention claimed herein. This disclosure is intended to cover any variations, uses, or adaptations of this disclosure that follow the general principles of this disclosure and include common general knowledge or techniques in the technical field to which this disclosure is not claimed . The specification and examples are to be regarded as exemplary only, with the true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise structures described above and illustrated in the accompanying drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

The above descriptions are only preferred embodiments of the present disclosure, and are not intended to limit the present disclosure. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present disclosure shall be included in the present disclosure. within the scope of protection.

Claims

A neural network for face detection, the neural network comprising:

an encoding network for acquiring a sample including a face image, and using the sample as its output result, wherein the face image in the sample is marked with three-dimensional or two-dimensional coordinates of at least one first key point;

a first decoding network, configured to extract the predicted coordinate value of each of the first key points on the face image in the sample according to the output result of the encoding network;

a second decoding network, configured to acquire, according to the output result of the encoding network, a face pose change parameter corresponding to the face image in the sample; and

The prediction network is used to train the output result of the encoding network, the predicted coordinate value obtained by the first decoding network, and the face pose change parameter obtained by the second decoding network as input, and the output includes the sample in the sample. A face model of three-dimensional or two-dimensional coordinate values of at least two second key points on the face image, wherein the number of the second key points in the output face model is greater than that in the sample The number of labeled first keypoints.
The neural network according to claim 1, wherein the sample further comprises a face image marked with two-dimensional coordinates of key points.
The neural network according to claim 1, wherein the encoding network is a lightweight neural network.
The neural network according to claim 1, wherein the first decoding network is configured to extract each of the first keys on the face image in the sample according to an output result of the encoding network The predicted coordinate values of the point, including:

The first decoding network obtains the UV coordinate map corresponding to the face image in the sample according to the output result of the encoding network;

Based on the mapping relationship between the UV coordinate map and the three-dimensional or two-dimensional coordinates, the predicted coordinate value of each of the first key points on the face image in the sample is extracted.
The neural network according to claim 1, wherein the parameters of the encoding network and the first decoding network are obtained in the following manner:

The encoding network and the first decoding network are jointly trained based on a first loss function.
The neural network according to claim 5, wherein the parameters of the second decoding network are obtained in the following manner:

The second decoding network is trained based on the determined parameters of the encoding network and a second loss function.
The neural network according to claim 6, wherein the parameters of the prediction network are determined by training based on the parameters determined by the encoding network and a third loss function.
The neural network according to claim 6, wherein the first loss function is the predicted coordinate value of each of the first key points on the face image in the sample and the value of the predicted coordinates in the sample The deviation between the marked coordinate values of each of the first key points on the face image.
The neural network according to claim 7, wherein training the second decoding network based on the determined parameters of the encoding network and the second loss function, comprising:

According to the predicted coordinate values of each of the first key points and the standard three-dimensional face model obtained by the first decoding network, obtain the transformation of the standard three-dimensional face model relative to the predicted coordinate values of the first key points matrix;

Transforming the standard three-dimensional face model according to the transformation matrix to obtain the first coordinate value of each of the first key points;

According to the predicted coordinate value of each of the first key points and the first coordinate value of each of the first key points, the parameters of the second decoding network are trained based on the second loss function.
The neural network according to claim 9, wherein the standard three-dimensional face model is a three-dimensional face model of parametric representation obtained based on a 3DMM method.
The neural network according to claim 9, wherein the second loss function is a deviation between the predicted coordinate value of each of the first key points and the first coordinate value.
The neural network according to claim 9, wherein the third loss function is the difference between the three-dimensional or two-dimensional coordinate value of each of the second key points output by the prediction network and the first coordinate value deviation.
A training method of a neural network for face detection, the training method comprising:

acquiring a sample including a face image, wherein the face image in the sample is marked with three-dimensional or two-dimensional coordinates of at least one first key point;

Using the sample as the input of the neural network, obtain the predicted coordinate value of each of the first key points and the face posture change parameter corresponding to the face image;

The neural network is trained based on the obtained predicted coordinate values and the face pose change parameters, so that the output result of the trained neural network is a three-dimensional or two-dimensional coordinate including a first number of second key points. A face model, wherein the first number is greater than the number of the first key points marked in the sample.
The training method according to claim 13, wherein the neural network for face detection comprises:

an encoding network for acquiring a sample including a face image, and using the sample as its output result, wherein the face image in the sample is marked with three-dimensional or two-dimensional coordinates of at least one first key point;

The first decoding network is used to take the sample as input, and obtain the UV coordinate map corresponding to the face image in the sample according to the output result of the encoding network; based on the mapping relationship between the UV coordinate map and the three-dimensional coordinate , extracting the predicted coordinate value of each of the first key points on the face image in the sample.
The training method according to claim 14, wherein the training method further comprises:

The encoding network and the first decoding network are jointly trained based on the first loss function, and the parameters of the encoding network and the parameters of the first decoding network are determined.
The training method according to claim 15, wherein the neural network for face detection further comprises a second decoding network, and the second decoding network is used to obtain the sample based on an output result of the encoding network The face posture change parameter corresponding to the face image described in, the training method also includes:

Based on the determined parameters of the encoding network and the second loss function, the second decoding network is trained to determine the parameters of the second decoding network.
The training method according to claim 16, wherein the neural network for face detection further comprises a prediction network, wherein the prediction network is configured to predict the coordinate value of each of the first key points and the face Pose change parameters, outputting a face model containing a first number of three-dimensional or two-dimensional coordinates of the second key point, and the training method further includes:

Based on the determined parameters of the encoding network and the third loss function, the prediction network is trained to determine the parameters of the prediction network.
The training method according to claim 16, wherein the second decoding network obtains a face pose change parameter corresponding to the face image in the sample, comprising:

According to the predicted coordinate value of each of the first key points and the standard three-dimensional face model obtained by the first decoding network, a transformation matrix of the standard three-dimensional face model relative to the predicted coordinate value is obtained;

The standard three-dimensional face model is transformed according to the transformation matrix to obtain the first coordinate value of each of the first key points;

According to the predicted coordinate value of each of the first key points and the first coordinate value of each of the first key points, the parameters of the second decoding network are trained based on the second loss function, and the second decoding network is determined. parameters of the network;

The second decoding network completed by training outputs, based on the sample, a face pose change parameter corresponding to the face image in the sample.
A method for face detection, characterized in that the method is implemented by the coding network described in any one of claims 1 to 12 and a prediction network obtained by training, the method comprising:

The encoding network performs key point labeling processing on the acquired unlabeled face image, and outputs the face image marked with at least one first key point to the prediction network as a processing result;

The prediction network outputs a face model including three-dimensional or two-dimensional coordinates of at least two second key points on the face image, wherein the number of the second key points is greater than the number of the first key points.
The method for face detection according to claim 19, wherein the encoding network and the prediction network obtained by training are installed on a mobile terminal.
A computer storage medium storing computer program code which, when executed on a processor, causes the processor to perform the steps of the method of any one of claims 13 to 20.
A terminal device, comprising a memory, a processor, and a computer program stored on the memory and running on the processor, wherein the processor implements the method of any one of claims 13 to 20 when the processor executes the program. step.