CN110321821B

CN110321821B - Human face alignment initialization method and device based on three-dimensional projection and storage medium

Info

Publication number: CN110321821B
Application number: CN201910550111.2A
Authority: CN
Inventors: 杨恒
Original assignee: Shenzhen Aimo Technology Co ltd
Current assignee: Shenzhen Aimo Technology Co ltd
Priority date: 2019-06-24
Filing date: 2019-06-24
Publication date: 2022-10-25
Anticipated expiration: 2039-06-24
Also published as: CN110321821A

Abstract

The embodiment of the invention discloses a face alignment initialization method and device based on three-dimensional projection and a storage medium, wherein the method comprises the following steps: estimating three-dimensional face position parameters of the test image based on a pre-trained convolutional neural network model, projecting the three-dimensional face position parameters to a standard two-dimensional position to obtain a corresponding two-dimensional projection, and carrying out scaling processing on the two-dimensional projection according to a face frame of the test image to obtain an initialization result of the face position in the test image. By adopting the invention, the initialization can be ensured to be within the convergence radius under the condition of larger change of the head posture, thereby improving the accuracy of the face alignment under the condition.

Description

Human face alignment initialization method and device based on three-dimensional projection and storage medium

Technical Field

The invention relates to the technical field of head pose estimation and face alignment, in particular to a face alignment initialization method and device based on three-dimensional projection and a storage medium.

Background

Head pose estimation and face alignment have wide application in human-computer interaction, avatar animation, face recognition/verification, etc., and thus have been widely studied in recent years. The two problems are related in a staggered way and are put together to supplement each other. Head pose estimation for two-dimensional images remains challenging due to the high diversity of face images. Recent approaches have attempted to estimate head pose using depth data. In contrast, the face alignment technology has advanced greatly, and various methods can be well represented on outdoor images. Of course, these methods also have disadvantages. Studies have found that these failure cases have an important attribute, and the head (face) in these images is usually rotated from a frontal pose at a large angle. The best face alignment techniques proposed in recent years also have a similar cascading pose regression framework, i.e., face alignment starts with an original shape (a vector representation of a target position) and updates the shape from coarse to fine. The alignment scheme under this framework has a strong dependency on the initialization method. Therefore, even if the same image is input, the output result of a cascaded face alignment system may be different under different initialization methods. Each model has a radius of convergence. If the initialization is within the range of the actual shape, the model can output a reasonable alignment result, which may otherwise result in the shape being positioned at the wrong location. As shown in fig. 1. Using an average face shape within the facial frame or a shape randomly selected from the training set as initialization, like the methods mentioned in conventional initialization, does not guarantee that the initialization is within the convergence radius, especially in cases where the head pose varies greatly.

To improve the performance of face alignment in case of large changes in head pose, a better initialization scheme is needed for explicit head pose estimation based on cascaded face alignment.

Disclosure of Invention

The embodiment of the invention provides a method and a device for initializing face alignment based on three-dimensional projection and a storage medium, which can ensure that the initialization is within a convergence radius under the condition of large change of head postures.

The first aspect of the embodiments of the present invention provides a method for initializing face alignment based on three-dimensional projection, which may include:

estimating three-dimensional face position parameters of the test image based on a pre-trained convolutional neural network model;

projecting the three-dimensional face position parameters to a standard two-dimensional position to obtain a corresponding two-dimensional projection;

and carrying out scaling processing on the two-dimensional projection according to the face frame of the test image to obtain an initialization result of the face position in the test image.

Further, the projecting the three-dimensional face position parameter to the standard two-dimensional position to obtain a corresponding two-dimensional projection includes:

preprocessing the three-dimensional face position parameters, wherein the preprocessing comprises translation and focusing processing of the three-dimensional face position parameters;

and projecting the preprocessed three-dimensional face position parameters to a standard two-dimensional position to obtain a corresponding two-dimensional projection.

Further, the method further comprises:

carrying out enhanced head posture labeling on the training data;

and training the convolutional neural network model by using the labeled training data.

Further, the method further comprises:

and estimating the head pose of the test image based on the convolutional neural network model, and labeling the human face frame in the test image.

Further, the method further comprises:

and performing face alignment on the test image based on the initialization result.

A second aspect of the embodiments of the present invention provides a human face alignment initialization apparatus based on three-dimensional projection, which may include:

the three-dimensional parameter estimation module is used for estimating three-dimensional face position parameters of the test image based on a pre-trained convolutional neural network model;

the parameter projection module is used for projecting the three-dimensional face position parameters to a standard two-dimensional position to obtain corresponding two-dimensional projection;

and the initialization processing module is used for carrying out scaling processing on the two-dimensional projection according to the face frame of the test image to obtain an initialization result of the face position in the test image.

Further, the parameter projection module includes:

the parameter preprocessing unit is used for preprocessing the three-dimensional face position parameters, and the preprocessing comprises translation and focusing processing of the three-dimensional face position parameters;

and the parameter projection unit is used for projecting the preprocessed three-dimensional face position parameters to a standard two-dimensional position to obtain a corresponding two-dimensional projection.

Further, the above apparatus further comprises:

the training data labeling module is used for performing enhanced head posture labeling on the training data;

and the model training module is used for training the convolutional neural network model by adopting the marked training data.

Further, the above apparatus further comprises:

and the face frame labeling module is used for estimating the head pose of the test image based on the convolutional neural network model and labeling the face frame in the test image.

Further, the above apparatus further comprises:

and the face alignment module is used for carrying out face alignment on the test image based on the initialization result.

A third aspect of embodiments of the present invention provides a computer storage medium storing a plurality of instructions, the instructions being adapted to be loaded by a processor and to perform the following steps:

In the embodiment of the invention, the three-dimensional face position parameter of the test image is estimated through the trained convolutional neural network model, then the parameter is projected to the standard two-dimensional position to obtain the two-dimensional projection, then the two-dimensional projection is scaled according to the face frame of the test image to obtain the initialization result of the face position in the test image, and the initialization is ensured to be within the convergence radius under the condition of large head posture change, so that the face alignment precision under the condition is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flowchart of a human face alignment initialization method based on three-dimensional projection according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a convolutional neural network model architecture provided in an embodiment of the present invention;

fig. 3 is a schematic diagram of a three-dimensional face projection effect provided by an embodiment of the present invention;

FIG. 4 is a schematic diagram of an initialization process from head pose estimation to face position according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a human face alignment initialization apparatus based on three-dimensional projection according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a parameter projection module according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of another human face alignment initialization apparatus based on three-dimensional projection according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "including" and "having," and any variations thereof, in the description and claims of this invention and the above-described drawings are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

The human face alignment initialization device based on three-dimensional projection can be terminal equipment such as a mobile phone and a computer.

As shown in fig. 1, the initialization method for face alignment based on three-dimensional projection may at least include the following steps:

and S101, estimating three-dimensional face position parameters of the test image based on a pre-trained convolutional neural network model.

It can be understood that the above-mentioned apparatus can perform enhanced head pose labeling on the training samples in the training sample set, and then train a convolutional neural network model using the labeled training sample set for performing head pose estimation on the samples in the training set. Specifically, when the model is used for estimating the head pose of the sample, small arrangement can be performed on the frame of the face, so that the sample is increased by 3 times. The trained convolutional neural network model can be as shown in fig. 2, the network inputs a 96x96 gray-scale face image, and the normalization is between 0 and 1. The feature extraction stage includes 3 convolutional layers, 3 collection layers, 2 fully-connected layers, and 3 exit layers. When presented as a regression problem, the output layer is 3x1, representing the pitch, yaw and roll angles of the head pose, respectively. These angles are normalized between-1 and 1. Alternatively, the Nesterov accelerated gradient descent method may be used for parameter optimization, and the momentum may be set to 0.9 and the learning rate to 0.01. The training is controlled by an early stop strategy, which ends after 1300 iterations on a TeslaK40c GPU for about two hours in a specific training example.

Specifically, the above device may estimate the three-dimensional face position parameter of the test image based on the trained convolutional neural network model, and optionally, when estimating the head pose of the test image, the device may label a face frame in the test image, and then confirm the three-dimensional face position parameter corresponding to the test image based on the labeled face frame. It is understood that the three-dimensional face position parameter may be three-dimensional key point information, for example, 68 key points.

In an alternative embodiment, the apparatus may use the Viola-Jones detector and the HeadHunter detector to provide different face borders for the test image to perform fair comparison, and may obtain a reasonable manually set border for the input image with failed face detection.

And S102, projecting the three-dimensional face position parameters to a standard two-dimensional position to obtain a corresponding two-dimensional projection.

Specifically, a three-dimensional average face shape represented by 68 keypoint positions, i.e., the three-dimensional face position parameter, can be given as shown in fig. 4. The shape may then be projected to a canonical set of two-dimensional locations under the estimated head pose. More specifically, in order to obtain reasonable projections of all images, the device can preprocess the three-dimensional face position parameters, namely, constant translation and focal length are used, and then the preprocessed three-dimensional face position parameters are projected to a standard two-dimensional position to obtain a corresponding two-dimensional projection.

To vividly illustrate the process from the three-dimensional face position parameters to the two-dimensional projection, referring to the process shown in fig. 3, a three-dimensional parameter sphere representing the face position in a three-dimensional scene is projected onto a two-dimensional image to obtain a two-dimensional plane map.

S103, carrying out scaling processing on the two-dimensional projection according to the face frame of the test image to obtain an initialization result of the face position in the test image.

It can be understood that the above device can scale the standard two-dimensional projection according to the face frame proportion of the test image to obtain the initialization result.

In a specific implementation, the initialization process function Γ is represented as follows:

s ₀ ＝Γ(θ，bb，S ^-3D )

wherein bb represents the human face frame, S ^-3D For a three-dimensional average face shape, θ is the estimated head pose, expressed as follows:

wherein

Is the convolutional neural network model described above.

It should be noted that, with the embodiment of the present invention, the processes from the head pose estimation to the generation of the 3D average face shape to the face position initialization may be as shown in fig. 4.

Further, the device may perform face alignment on the test image according to the initialization result, and a specific alignment method may be the same as that in the prior art, and is not described in detail here.

In the embodiment of the invention, the three-dimensional face position parameter of the test image is estimated through the trained convolutional neural network model, the parameter is projected to the standard two-dimensional position to obtain the two-dimensional projection, then the two-dimensional projection is scaled according to the face frame of the test image to obtain the initialization result of the face position in the test image, and the initialization is ensured to be within the convergence radius under the condition of large head posture change, so that the face alignment precision under the condition is improved.

The following describes in detail a face alignment initialization apparatus based on three-dimensional projection according to an embodiment of the present invention with reference to fig. 5 and 6. It should be noted that, the face-aligned image initialization apparatus shown in fig. 5 is used for executing the method according to the embodiment of the present invention shown in fig. 1-4, for convenience of description, only the portion related to the embodiment of the present invention is shown, and specific technical details are not disclosed, please refer to the embodiment shown in fig. 1-4 of the present invention.

Referring to fig. 5, a schematic structural diagram of a face alignment initialization apparatus based on three-dimensional projection is provided for an embodiment of the present invention. As shown in fig. 5, the face alignment image initialization apparatus 10 according to an embodiment of the present invention may include: the three-dimensional parameter estimation module 101, the parameter projection module 102, the initialization processing module 103, the training data labeling module 104, the model training module 105, the face frame labeling module 106 and the face alignment module 107. As shown in fig. 6, the parameter projection module 102 includes a parameter preprocessing unit 1021 and a parameter projection unit 1022.

And the three-dimensional parameter estimation module 101 is configured to estimate three-dimensional face position parameters of the test image based on a pre-trained convolutional neural network model.

It is understood that the training data labeling module 104 may perform enhanced head pose labeling on the training samples in the training sample set, and then the model training module 105 may train a convolutional neural network model using the labeled training sample set for performing head pose estimation on the samples in the training sample set. Specifically, when the model is used for estimating the head pose of the sample, small arrangement can be performed on the frame of the face, so that the sample is increased by 3 times. The trained convolutional neural network model can be shown in fig. 2, and the network input is a 96x96 gray-scale face image, and is normalized between 0 and 1. The feature extraction stage includes 3 convolutional layers, 3 collection layers, 2 fully-connected layers, and 3 exit layers. When presented as a regression problem, the output layers are 3x1, representing the pitch, yaw and roll angles of the head pose, respectively. These angles are normalized between-1 and 1. Alternatively, parameter optimization may be performed using a Nesterov accelerated gradient descent method, and the momentum may be set to 0.9 and the learning rate may be set to 0.01. Training is controlled by an early stop strategy, which ends after 1300 iterations on a TeslaK40c GPU for about two hours in a specific training example.

In a specific implementation, the three-dimensional parameter estimation module 101 may estimate three-dimensional face position parameters of the test image based on the trained convolutional neural network model, optionally, the face frame labeling module 106 may label a face frame in the test image when estimating the head pose of the test image, and then the three-dimensional parameter estimation module 101 may determine the three-dimensional face position parameters corresponding to the test image based on the labeled face frame. It is understood that the three-dimensional face position parameter may be three-dimensional key point information, for example, 68 key points.

In an alternative embodiment, the apparatus 10 may use the Viola-Jones detector and the HeadHunter detector to provide different face borders for fair comparison of the test images, and may obtain a reasonable manually set border for the input image with failed face detection.

And the parameter projection module 102 is configured to project the three-dimensional face position parameter to a standard two-dimensional position to obtain a corresponding two-dimensional projection.

In a specific implementation, a three-dimensional average face shape represented by 68 keypoint locations, i.e. the three-dimensional face location parameters, can be given, as shown in fig. 1. The parameter projection module 102 may then project the shape to a set of canonical two-dimensional locations under the estimated head pose. More specifically, in order to obtain reasonable projections of all images, the parameter preprocessing unit 1021 may preprocess the three-dimensional face position parameters, i.e. using constant translation and focal length, and then the parameter projection unit 1022 may project the preprocessed three-dimensional face position parameters to the canonical two-dimensional position to obtain a corresponding two-dimensional projection.

And the initialization processing module 103 is configured to perform scaling processing on the two-dimensional projection according to the face frame of the test image to obtain an initialization result of the face position in the test image.

It can be understood that the initialization processing module 103 may scale the standard two-dimensional projection according to the face frame proportion of the test image to obtain an initialization result.

s ₀ =(Γθ，bb，S ^-3D )

wherein bb represents a face frame, S ^-3D For a three-dimensional average face shape, θ is the estimated head pose, expressed as follows:

wherein

Is the convolutional neural network model described above.

It should be noted that, with the embodiment of the present invention, the processes from the head pose estimation to the generation of the 3D average face shape to the face position initialization may be as shown in fig. 5.

Further, the face alignment module 107 may perform face alignment on the test image according to the initialization result, and a specific alignment method may be the same as that in the prior art, which is not described in detail herein.

An embodiment of the present invention further provides a computer storage medium, where the computer storage medium may store a plurality of instructions, where the instructions are suitable for being loaded by a processor and executing the method steps in the embodiments shown in fig. 1 to 5, and a specific execution process may refer to specific descriptions of the embodiments shown in fig. 1 to 5, which are not described herein again.

The embodiment of the application also provides another human face alignment initialization device based on three-dimensional projection. As shown in fig. 7, the face alignment initialization apparatus 20 based on three-dimensional projection may include: the at least one processor 201, e.g., CPU, at least one network interface 204, a user interface 203, memory 205, at least one communication bus 202, and optionally a display 206. Wherein a communication bus 202 is used to enable the connection communication between these components. The user interface 203 may include a touch screen, a keyboard or a mouse, among others. The network interface 204 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), and a communication connection may be established with the server via the network interface 604. The memory 205 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory, and the memory 205 includes a flash in the embodiment of the present invention. The memory 205 may optionally be at least one memory system located remotely from the processor 201. As shown in fig. 7, the memory 205, which is a type of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and program instructions.

It should be noted that the network interface 204 may be connected to a receiver, a transmitter, or other communication module, and the other communication module may include, but is not limited to, a WiFi module, a bluetooth module, and the like.

The processor 201 may be configured to call program instructions stored in the memory 205 and cause the three-dimensional projection based face alignment initialization apparatus 20 to perform the following operations:

In some embodiments, the apparatus 20 is further configured to perform preprocessing on the three-dimensional face position parameters, where the preprocessing includes translation and focusing processing on the three-dimensional face position parameters;

In some embodiments, apparatus 20 is also used to enhance head pose labeling of training data;

and training a convolutional neural network model by using the marked training data.

In some embodiments, the apparatus 20 is further configured to estimate a head pose of the test image based on the convolutional neural network model, labeling a face border in the test image.

In some embodiments, the device 20 is also used for

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present invention, and it is therefore to be understood that the invention is not limited by the scope of the appended claims.

Claims

1. A human face alignment initialization method based on three-dimensional projection is characterized by comprising the following steps:

zooming the two-dimensional projection according to the face frame of the test image to obtain an initialization result of the face position in the test image;

when the convolutional neural network model is adopted to estimate the head pose of a sample, small arrangement is carried out on a face frame, the sample is increased by 3 times, the network input of the trained convolutional neural network model is a 96x96 gray face image, and the normalization is between 0 and 1; the output layer is 3x1 and respectively represents the pitch angle, the yaw angle and the roll angle of the head posture, and the angles are normalized between-1 and 1; parameter optimization was performed using the Nesterov accelerated gradient descent method.

2. The method of claim 1, wherein projecting the three-dimensional face location parameters to canonical two-dimensional locations yields corresponding two-dimensional projections, comprising:

3. The method of claim 1, further comprising:

carrying out enhanced head posture labeling on the training data;

4. The method of claim 3, further comprising:

5. The method of claim 1, further comprising:

6. A face alignment initialization device based on three-dimensional projection is characterized by comprising:

the initialization processing module is used for carrying out scaling processing on the two-dimensional projection according to the face frame of the test image to obtain an initialization result of the face position in the test image;

7. The apparatus of claim 6, wherein the parameter projection module comprises:

8. The apparatus of claim 6, further comprising:

the training data labeling module is used for carrying out enhanced head posture labeling on the training data;

9. The apparatus of claim 8, further comprising:

and the face frame labeling module is used for estimating the head pose of the test image based on the convolutional neural network model and labeling a face frame in the test image.

10. A computer storage medium having stored thereon a plurality of instructions adapted to be loaded by a processor and to perform the steps of: