CN117333914A

CN117333914A - Single-image-based face key point extraction method and electronic equipment

Info

Publication number: CN117333914A
Application number: CN202211430645.XA
Authority: CN
Inventors: 许瀚誉; 杨智远; 吴连朋
Original assignee: Juhaokan Technology Co Ltd
Current assignee: Juhaokan Technology Co Ltd
Priority date: 2022-11-15
Filing date: 2022-11-15
Publication date: 2024-01-02

Abstract

The application provides a face key point extraction method based on a single image and electronic equipment, which are used for improving the accuracy of 3D feature points. Comprising the following steps: inputting the face image into a pre-trained face key point extraction network to obtain the position of each 2D face key point; projecting each marked 3D face key point in the initial three-dimensional face model to a face image to obtain the position of each 2D projected face key point; obtaining an error value based on the positions of the key points of the 2D projection faces and the positions of the key points of the 2D faces; if the error value is larger than the first specified threshold, after three-dimensional face parameters of the initial three-dimensional face model are adjusted, returning to the step of projecting each marked 3D face key point in the preset initial three-dimensional face model into the face image until the error value is not larger than the first specified threshold, and obtaining the position of each 3D face key point according to each marked 3D face key point and each 2D face key point in the adjusted initial three-dimensional face model.

Description

Single-image-based face key point extraction method and electronic equipment

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a face key point extraction method based on a single image and an electronic device.

Background

The face key point extraction method is to identify a face image based on a face detection technology, and determine position information of face key feature points such as eyes, nose, mouth, face contours and the like in the face image. At present, the mainstream face key point extraction method mainly comprises a deformable template method, a point distribution model method, a graph model method, a cascade shape regression method and the like.

In practical applications, the face key point extraction method in the prior art mostly obtains 2D (two-dimensional) face key points first, and then estimates 3D (three-dimensional) face key points according to a virtually defined camera. However, the camera virtually defined in this way in the prior art is fixed and cannot be seen in the same perspective and distortion relationship as the camera for each real picture. Therefore, the accuracy of the determined key points of the 3D face is low.

Disclosure of Invention

The application provides a face key point extraction method based on a single image and electronic equipment, which are used for extracting 3D face key points and improving the accuracy of the extracted 3D face key points.

In a first aspect, an embodiment of the present application provides a face key point extraction method based on a single image, including:

Inputting the face image into a pre-trained face key point extraction network model for 2D face key point extraction aiming at any face image, and obtaining the position of each 2D face key point in the face image;

projecting each marked 3D face key point in a preset initial three-dimensional face model into the face image to obtain the position of each 2D projected face key point in the face image;

obtaining error values based on the positions of the 2D projection face key points and the positions of the 2D face key points;

if the error value is greater than a first specified threshold, after three-dimensional face parameters of the initial three-dimensional face model are adjusted, returning to the step of projecting each marked 3D face key point in the preset initial three-dimensional face model into the face image to obtain the position of each 2D projected face key point in the face image, and determining the adjusted initial three-dimensional face model as a target three-dimensional face model until the error value is not greater than the first specified threshold;

and according to the marked 3D face key points and the 2D face key points in the target three-dimensional face model, obtaining the positions of the 3D face key points in the face image.

A second aspect of the present application provides an electronic device, including a processor and a memory, the processor and the memory being connected by a bus;

the memory has stored therein a computer program, the processor being configured to perform the following operations based on the computer program:

According to a third aspect provided by an embodiment of the present invention, there is provided a computer storage medium storing a computer program for executing the method according to the first aspect.

In the above embodiment of the present application, a face image is input into a pre-trained face key point extraction network model to perform 2D face key point extraction, so as to obtain the position of each 2D face key point in the face image, then each labeled 3D face key point in a preset initial three-dimensional face model is projected into the face image, so as to obtain the position of each 2D projected face key point in the face image, and the three-dimensional face parameters of the initial three-dimensional face model are adjusted based on the positions of each 2D projected face key point and the error value obtained from the positions of each 2D face key point, the adjusted initial three-dimensional face model is determined as a target three-dimensional face model, and finally the positions of each 3D face key point in the face image are obtained according to each labeled 3D face key point and each 2D face key point in the target three-dimensional face model. Therefore, in the method, the three-dimensional face parameters of the initial three-dimensional face model are continuously adjusted based on the positions of the 2D projection face key points and the error values obtained from the positions of the 2D face key points so as to overcome the influence of perspective and distortion existing in the virtual camera, and therefore the accuracy of the extracted 3D face key points is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort to a person skilled in the art.

Fig. 1 schematically illustrates one of application scenario diagrams provided in an embodiment of the present application;

fig. 2 illustrates a second application scenario schematic diagram provided in an embodiment of the present application;

fig. 3 illustrates a third exemplary application scenario provided in an embodiment of the present application;

fig. 4 illustrates one of flowcharts of a single image-based face key point extraction method according to an embodiment of the present application;

FIG. 5 schematically illustrates dense 2D face keypoints provided by embodiments of the present application;

fig. 6 is a schematic flow chart of training a face key point extraction network model according to an embodiment of the present application;

fig. 7 is a schematic flow chart of determining a training sample of a virtual face image according to an embodiment of the present application;

Fig. 8 is a schematic structural diagram of a face key point extraction network model according to an embodiment of the present application;

FIG. 9 is a flow chart illustrating the determination of error values provided by embodiments of the present application;

fig. 10 schematically illustrates a schematic diagram of a 3D face key point provided in an embodiment of the present application;

fig. 11 illustrates a second flowchart of a face key point extraction method based on a single image according to an embodiment of the present application;

fig. 12 schematically illustrates a structural diagram of a face key point extraction device based on a single image according to an embodiment of the present application;

fig. 13 is an exemplary hardware configuration diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For purposes of clarity, embodiments and advantages of the present application, the following description will make clear and complete the exemplary embodiments of the present application, with reference to the accompanying drawings in the exemplary embodiments of the present application, it being apparent that the exemplary embodiments described are only some, but not all, of the examples of the present application.

Based on the exemplary embodiments described herein, all other embodiments that may be obtained by one of ordinary skill in the art without making any inventive effort are within the scope of the claims appended hereto. Furthermore, while the disclosure is presented in the context of an exemplary embodiment or embodiments, it should be appreciated that the various aspects of the disclosure may, separately, comprise a complete embodiment.

It should be noted that the brief description of the terms in the present application is only for convenience in understanding the embodiments described below, and is not intended to limit the embodiments of the present application. Unless otherwise indicated, these terms should be construed in their ordinary and customary meaning.

The terms first, second and the like in the description and in the claims of the present application and in the above-described figures, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprise" and "have," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to those elements expressly listed, but may include other elements not expressly listed or inherent to such product or apparatus.

The term "module" as used in this application refers to any known or later developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware and/or software code that is capable of performing the function associated with that element.

The ideas of the embodiments of the present application are summarized below.

In the prior art, the face key point extraction method mainly obtains 2D face key points first, and then estimates 3D face key points according to a virtually defined camera. However, the camera virtually defined in this way in the prior art is fixed and cannot be seen in the same perspective and distortion relationship as the camera for each real picture. Therefore, the accuracy of the determined key points of the 3D face is low.

Based on the problem that the accuracy of 3D face key points is low in the prior art, the embodiment of the application provides a face key point extraction method based on a single image, the face image is input into a pre-trained face key point extraction network model to perform 2D face key point extraction, the positions of all 2D face key points in the face image are obtained, then all marked 3D face key points in a preset initial three-dimensional face model are projected into the face image, the positions of all 2D projected face key points in the face image are obtained, three-dimensional face parameters of the initial three-dimensional face model are adjusted based on the positions of all 2D projected face key points and error values obtained by the positions of all 2D face key points, the adjusted initial three-dimensional face model is determined to be a target three-dimensional face model, and finally all marked 3D face points and all 2D face key points in the target three-dimensional face model are obtained, and the positions of all 3D face key points in the face image are obtained. Therefore, in the method, the three-dimensional face parameters of the initial three-dimensional face model are continuously adjusted based on the positions of the 2D projection face key points and the error values obtained from the positions of the 2D face key points so as to overcome the influence of perspective and distortion existing in the virtual camera, and therefore the accuracy of the extracted 3D face key points is improved.

Embodiments of the present application are described in detail below with reference to the accompanying drawings.

Fig. 1 schematically illustrates an application scenario provided in an embodiment of the present application; as shown in fig. 1, the application scenario is described by taking an electronic device as a server. The application scene includes a camera 110 and a server 120. The server 120 may be implemented by a single server or by a plurality of servers. The server 120 may be implemented by a physical server or may be implemented by a virtual server.

In a possible application scenario, the camera 110 sends an acquired face image to the server 120, and after the server 120 receives the face image, the face image is input into a pre-trained face key point extraction network model to perform 2D face key point extraction, so as to obtain the position of each 2D face key point in the face image; then the server 120 projects each marked 3D face key point in the preset initial three-dimensional face model into the face image to obtain the position of each 2D projected face key point in the face image; obtaining error values based on the positions of the 2D projection face key points and the positions of the 2D face key points; if the error value is greater than a first specified threshold, after three-dimensional face parameters of the initial three-dimensional face model are adjusted, returning to the step of projecting each marked 3D face key point in the preset initial three-dimensional face model into the face image to obtain the position of each 2D projected face key point in the face image, and determining the adjusted initial three-dimensional face model as a target three-dimensional face model until the error value is not greater than the first specified threshold; finally, the server 120 obtains the position of each 3D face key point in the face image according to each labeled 3D face key point and each 2D face key point in the target three-dimensional face model.

Fig. 2 is a schematic diagram of another application scenario of the present application, where the application includes a camera 110, a server 120, and a terminal device 130. The camera 110 sends the acquired face image to the server 120, and after the server 120 receives the face image, the face image is input into a pre-trained face key point extraction network model to extract 2D face key points, so that the positions of all 2D face key points in the face image are obtained; then the server 120 projects each marked 3D face key point in the preset initial three-dimensional face model into the face image to obtain the position of each 2D projected face key point in the face image; obtaining error values based on the positions of the 2D projection face key points and the positions of the 2D face key points; if the error value is greater than a first specified threshold, after three-dimensional face parameters of the initial three-dimensional face model are adjusted, returning to the step of projecting each marked 3D face key point in the preset initial three-dimensional face model into the face image to obtain the position of each 2D projected face key point in the face image, and determining the adjusted initial three-dimensional face model as a target three-dimensional face model until the error value is not greater than the first specified threshold; finally, the server 120 obtains the positions of the 3D face key points in the face image according to the 3D face key points and the 2D face key points in the target three-dimensional face model, and sends the positions of the 3D face key points in the face image to the terminal device 130 for display.

As shown in fig. 3, another application scenario of the present application is schematically illustrated, where the application scenario includes a camera 110, a server 120, a terminal device 130, and a memory 140. The camera 110 sends the acquired face image to the server 120, and after the server 120 receives the face image, the face image is input into a pre-trained face key point extraction network model to extract 2D face key points, so that the positions of all 2D face key points in the face image are obtained; then the server 120 projects each marked 3D face key point in the preset initial three-dimensional face model into the face image to obtain the position of each 2D projected face key point in the face image; obtaining error values based on the positions of the 2D projection face key points and the positions of the 2D face key points; if the error value is greater than a first specified threshold, after three-dimensional face parameters of the initial three-dimensional face model are adjusted, returning to the step of projecting each marked 3D face key point in the preset initial three-dimensional face model into the face image to obtain the position of each 2D projected face key point in the face image, and determining the adjusted initial three-dimensional face model as a target three-dimensional face model until the error value is not greater than the first specified threshold; finally, the server 120 obtains the positions of the 3D face key points in the face image according to the 3D face key points and the 2D face key points in the target three-dimensional face model, stores the positions of the 3D face key points in the face image in the memory 140, and then sends the positions of the 3D face key points in the face image to the terminal device 130 for display.

In this description, only a single camera 110, a single server 120, a single terminal device 130, and a single memory 140 are described in detail, but it should be understood by those skilled in the art that the illustrated camera 110, server 120, terminal device 130, and memory 140 are intended to represent operations of the camera 110, server 120, terminal device 130, and memory 140 related to the technical solution of the present application. Rather than implying a limitation on the number, type, or location of cameras 110, servers 120, terminal devices 130, and memories 140. It should be noted that the underlying concepts of the example embodiments of the present application are not altered if additional modules are added to or individual modules are removed from the illustrated environment.

Exemplary terminal devices 130 include, but are not limited to: a visual large screen, a tablet, a notebook, a palm top, a mobile internet device (Mobile Intemet Device, MID), a wearable device, a Virtual Reality (VR) device, an augmented Reality (Augmented Reality, AR) device, a wireless terminal device in industrial control, a wireless terminal device in unmanned driving, a wireless terminal device in smart grid, a wireless terminal device in transportation security, a wireless terminal device in smart city, or a wireless terminal device in smart home; the terminal device may have an associated client installed thereon, which may be software (e.g., a browser, short video software, etc.), web pages, applets, etc.

It should be noted that, the single-image-based face key point extraction method provided by the present application is not only applicable to the application scenarios shown in fig. 1, fig. 2 and fig. 3, but also applicable to any face key point extraction device with a single-image-based face key point.

In the following, a single image-based face key point extraction method according to an exemplary embodiment of the present application will be described with reference to the accompanying drawings in conjunction with the above-described application scenario, it should be noted that the above-described application scenario is only shown for the convenience of understanding the method and principle of the present application, and embodiments of the present application are not limited in this respect.

As shown in fig. 4, a flow chart of a face key point extraction method based on a single image may include the following steps:

step 401: inputting the face image into a pre-trained face key point extraction network model for 2D face key point extraction aiming at any face image, and obtaining the position of each 2D face key point in the face image;

the number of the extracted 2D face key points in the embodiment is greater than that in the prior art, and the extracted 2D face key points in the embodiment are dense. As shown in fig. 5, the extracted dense 2D face key points are respectively.

Next, a detailed description will be given of a training manner of the face key point extraction network model, as shown in fig. 6, which is a flow chart for training the face key point extraction network model, and includes the following steps:

step 601: obtaining a virtual face image training sample, wherein the virtual face image training sample comprises each virtual face image and the labeling position of each 2D face key point in each virtual face image;

fig. 7 is a schematic diagram of a specific flow for determining a training sample of a virtual face image, including the following steps:

step 701: obtaining an initial three-dimensional face model based on preset initial three-dimensional face parameters;

the initial three-dimensional face parameters comprise attitude parameters, shape parameters and expression parameters.

In one embodiment, in this embodiment, the initial three-dimensional face model is obtained through a deformable three-dimensional face template, that is, a preset initial parameter is input into the deformable three-dimensional face template, so as to obtain the initial three-dimensional face model. The deformed three-dimensional face template can be any template parameterized model with topological consistency, and a FLAME model is taken as an example for description:

the FLAME model is a parameterized linear face model based on statistics and comprising posture parameters, shape parameters and expression parameters. Wherein the pose parameters include global head nodes, neck nodes, chin nodes, and left and right eye nodes. By giving out the attitude parameters to carry out linear binding skin (Linear Blending Skinning, LBS) and giving out the attitude parameters and the expression parameters to carry out linear combination, the final three-dimensional face reconstruction based on the flame template can be obtained, wherein the three-dimensional face reconstruction based on the flame template can be obtained through a formula (1):

Where beta is a physical parameter, theta is the attitude parameter,the expression parameters, namely the preset initial three-dimensional face parameters in the embodiment, T _P And W is a preset skin weight, J is a preset function related to the shape parameters in the flame template, and W is a linear binding skin function.

Wherein the linear function T _P The determination can be made by equation (2):

wherein T is a flag template in an initial state, s is a preset shape parameter base, and B _S Is a linear function of linear addition of a shape parameter beta and a shape parameter base S in a flame template, wherein P is a preset gesture parameter base and B _P Is a linear function of linear addition of gesture parameters theta and gesture parameter bases P in a flag template, and zeta is a preset expression parameter base B _E Is expression parameter in the flag templateAnd expression parameter basis ζ.

Step 702: determining the positions of all the marked 3D face key points in the initial three-dimensional face model based on the received marking instructions of the 3D face key points;

the labeling instruction comprises the position of each labeling 3D face key point in the initial three-dimensional face model, the index of the face piece where each labeling 3D face point is located and the gravity center coordinate of each face piece.

Step 703: respectively carrying out deformation update on the initial three-dimensional face model by utilizing a plurality of preset intermediate three-dimensional face parameters to obtain a plurality of updated three-dimensional face models;

the intermediate three-dimensional face parameters in this embodiment include a plurality of different combination shape parameters, and the initial three-dimensional face model is deformed by the plurality of different combination shape parameters to generate different individuals. 5000 individuals were generated in this example. And the geometrical similarity of the individual bodies to each other is less than 0.5. However, the number of generated individuals is not limited in this embodiment, and the specific number of individuals may be set according to actual situations.

Secondly, the intermediate three-dimensional face parameters in the embodiment comprise exp expression parameters of a plurality of different combinations, and the expression reconstruction is performed on the initial three-dimensional face model through the plurality of different combination shape expression parameters, so that different expressions are generated. And the intermediate three-dimensional face parameters comprise a plurality of different combinations of phase gesture parameters, and the gesture reconstruction is carried out on the initial three-dimensional face model through the plurality of different combinations of phase gesture parameters to obtain a plurality of different gestures.

And then each individual randomly combines the first appointed number of expressions, and randomly combines the second appointed number of gesture parameters to obtain a plurality of updated three-dimensional face models.

It should be noted that: the first specified number is 200 in this embodiment and the second specified number is 20. However, the first specified number and the second specified number in the present embodiment may be set according to actual conditions, and the specific values of the first specified number and the second specified number are not limited here.

Step 704: respectively reconstructing the plurality of updated three-dimensional face models and preset virtual face texture images to obtain a plurality of reconstructed three-dimensional face models;

in this embodiment, a plurality of face texture images with different high precision are acquired in a matching manner, and in this embodiment, the hairmask for shielding hairs is worn in the acquisition process, and the conditions of glasses, plain colors, uniform illumination and the like are removed. And carrying out artificial post-processing and beautifying on the generated high-precision human face texture image to manufacture a virtual human face texture image.

It should be noted that: the manner of generating the virtual face image in this embodiment may be adjusted according to the actual situation, and this embodiment is not limited herein.

Step 705: rendering the reconstructed three-dimensional face models with a plurality of preset background images respectively to obtain a plurality of virtual face images and labeling positions of 2D face key points in the virtual face images, and determining the virtual face images and the labeling positions of the 2D face key points in the virtual face images as the virtual face image training samples, wherein the labeling positions of the 2D face key points in the virtual face images are obtained based on the labeling 3D face key points in the reconstructed three-dimensional face models.

In this embodiment, different background images are added to the reconstructed three-dimensional face model by using a rendering tool to render, so as to generate a large-scale virtual face image. In the rendering process, the method combines 1000 types of hair of men and women randomly, and combines hair ornaments, glasses and the like randomly with different hairstyles and colors.

In addition, in the rendering process, the 3D face key points marked in the reconstructed three-dimensional face model are also projected in the image, in the projection process, the 2D face key points obtained by each projection are marked according to the perspective relation of the camera projection, the pixel positions of the 2D face key points visible in the image are marked, the confidence coefficient is set to be 1, the pixel positions of the invisible 2D face key points are marked, the confidence coefficient is set to be 0, and finally a large-scale virtual face image training sample is generated.

It should be noted that: in this embodiment, the projection of the 3D face key points is performed in a manner in the prior art, and the embodiment is not limited to a specific manner of projection. The visible 2D face key points and the invisible 2D face key points are determined based on instructions input by a user, and different colors can be used for distinguishing the visible 2D face key points and the invisible 2D face key points.

Step 602: inputting the virtual face image training sample into a face key point extraction network model to extract 2D face key points, and obtaining predicted positions of the 2D face key points in each virtual face image;

step 603, obtaining training error values based on the labeling positions of the 2D face key points and the predicted positions of the 2D face key points;

in one embodiment, for any 2D face key point, based on the labeling position of the 2D face key point and the predicted position of the 2D face key point, a sub error value corresponding to the 2D face key point is obtained, and the sub error values corresponding to the 2D face key points are added to obtain the training error value. The sub-error value of any 2D face key point can be obtained by the formula (3):

wherein x is ₁ Is the abscissa, x of the labeling position of the 2D face key point ₂ Is the abscissa, y of the predicted position of the 2D face key point ₁ Is the ordinate, y of the labeling position of the 2D face key point ₂ Is the ordinate of the predicted position of the 2D face key point.

Step 604: judging whether the training error value is greater than a second designated threshold, if so, executing step 605, and if not, executing step 606;

It should be noted that: the second specified threshold value in the present embodiment may be set according to actual situations, and the specific value of the second specified threshold value is not limited in the present embodiment.

Step 605: after the specified parameters in the face key point extraction network model are adjusted, returning to the execution step 602;

the adjustment manner of the specified parameter in this embodiment may be to increase or decrease the specified value for the specified parameter each time, and the adjustment manner of different specified parameters may be the same or different, and the specific adjustment manner may be set according to the actual situation, which is not limited in this embodiment.

Step 606: and finishing training the face key point extraction network model to obtain the trained face key point extraction network model.

In order to make the face key point extraction network model in the present application more accurate, so as to further improve the accuracy, in one embodiment, after executing step 606, the trained face key point extraction network model is retrained by using a real face image training sample, so as to obtain a retrained face key point extraction network model, and the retrained face key point extraction network model is determined as the trained face key point extraction network model, where the real face training sample is obtained based on the initial three-dimensional face model and the acquired real three-dimensional face model.

The real face training sample in the embodiment includes each real face image and the labeling position of each 2D face key point in each real face image.

And the real face image is obtained by fitting an initial three-dimensional face model with each acquired real three-dimensional face model to obtain a fitted real three-dimensional face model, and projecting the fitted real face three-dimensional model into the acquired front face image to obtain each real face image and the labeling position of each 2D face key point in each real face image.

The method comprises the steps that as all marked 3D face key points exist in an initial three-dimensional face model, all marked 3D face key points exist in a fitted real three-dimensional face model, all marked 3D face key points are projected into an acquired front face image while the fitted real three-dimensional face model is projected into the front face image, and therefore marked positions of all 2D face key points are obtained.

It should be noted that: the real three-dimensional face model and the frontal face image are obtained by MVS (Multiple View Stereo, multi-view stereoscopic reconstruction) by adopting multi-view real images. The real three-dimensional face model and face images of all visual angles can be accurately aligned and projected through calibrated camera relations. In addition, the fitting method of the three-dimensional face model generally establishes loss by establishing the distance from the top point to the face between the initial three-dimensional face model and the real three-dimensional face model, and the loss function is enabled to be minimum by continuous optimization, so that the posture parameters, the shape parameters, the expression parameters and the like of the fitted real three-dimensional face model are obtained. In this embodiment, the fitting manner of the specific model is not limited, and the specific fitting manner may be set according to the actual situation.

Therefore, in the embodiment, the real face image training sample is utilized to continuously train the face key point extraction network model, the network structure of the face key point extraction network model is unchanged, and on the basis of the face key point extraction network model obtained by previous training, the face key point extraction network model is continuously trained through the real face image training sample, so that the refinement of the face key point extraction network model is realized, the extraction precision of the obtained face key point extraction network model is higher, and the generalization is better.

It should be noted that: the method for training the network model by using the real face image training sample is the same as the above-mentioned method, and the embodiment will not be described here again.

After the training manner of the face key point extraction network model is introduced, a specific flow of 2D face key point extraction on the face image based on the face key point extraction network model in step 602 is described below, and as shown in fig. 8, a schematic structural diagram of the face key point extraction network model is shown, where the face key point extraction network model 800 includes a maximum pooling layer 801, a feature pyramid FPN layer 802, a transform layer 803 and a full connection layer 804.

The following describes the step of 2D face key point extraction of a face image in combination with the structure of the face key point extraction network model. Firstly, downsampling the face image by using a maximum pooling layer 801 to obtain a downsampled face image; extracting features of the face image after downsampling by using a feature pyramid FPN layer 802 to obtain a feature vector set; then, the vector in the feature vector set is screened through a transform layer 803 to obtain an intermediate feature vector set; then fusing the intermediate feature vector set with the visibility feature in the face image to obtain a target feature vector set, wherein the visibility feature is preset or is obtained based on the face image of the previous frame; and finally, performing feature stitching on the feature vector set by using a full connection layer 804 to obtain the positions of the 2D face key points in the face image.

If each frame of face image needs to be extracted with 3D face key points, the visibility characteristic of the current frame of face image is an average value of confidence degrees of all 2D face key points in the previous frame of face image, the confidence degrees of all 2D face key points are obtained based on a pre-trained face key point extraction network model, and if only one frame of face image is extracted with D face key points, the visibility characteristic of the face image is preset.

Step 402: projecting each marked 3D face key point in a preset initial three-dimensional face model into the face image to obtain the position of each 2D projected face key point in the face image;

step 403: obtaining error values based on the positions of the 2D projection face key points and the positions of the 2D face key points;

as shown in fig. 9, a specific flow chart for determining an error value includes the following steps:

step 901: screening the 2D face key points by using the confidence coefficient of the 2D face key points to obtain screened 2D face key points, wherein the confidence coefficient of the 2D face key points is obtained when face key point extraction is performed in the pre-trained face key point extraction network model;

and determining each 2D face key point with the confidence coefficient larger than the appointed confidence coefficient as each 2D face key point after screening. And the specified confidence in this embodiment is 0.8. However, the specific value of the specified confidence is not limited, and may be set according to actual situations in this embodiment.

Step 902: aiming at any 2D face key point, obtaining the distance between the 2D face key point and the 2D projection face key point according to the position coordinates of the 2D face key point and the position coordinates of the 2D projection face key point with the same index as the 2D face key point, and determining the distance as an intermediate error value corresponding to the 2D face key point;

The 2D face key points and the marked 3D face key points are provided with indexes, are in one-to-one correspondence, and have the same indexes as 2D projection face key points obtained by projection of any marked 3D face key point.

Step 903: and obtaining the error value according to each intermediate error value corresponding to each 2D face key point.

In one embodiment, the error value may be obtained in two ways:

mode one: and adding the intermediate error values corresponding to the 2D face key points to obtain the error values.

Mode two: and carrying out weighted summation on each intermediate error value corresponding to each 2D face key point to obtain the error value.

It should be noted that: the specific manner of determining the error value may be selected according to practical situations, and the present embodiment is not limited herein.

Step 404: if the error value is greater than a first specified threshold, after three-dimensional face parameters of the initial three-dimensional face model are adjusted, returning to the step of projecting each marked 3D face key point in the preset initial three-dimensional face model into the face image to obtain the position of each 2D projected face key point in the face image, and determining the adjusted initial three-dimensional face model as a target three-dimensional face model until the error value is not greater than the first specified threshold;

The method for adjusting the three-dimensional face parameters of the initial three-dimensional face model may be to increase or decrease each parameter by a specified value, the specified values of increasing or decreasing different three-dimensional face parameters may be the same or different, and the specific adjustment method may be set according to the actual situation, which is not limited in this embodiment.

Step 405: and according to the marked 3D face key points and the 2D face key points in the target three-dimensional face model, obtaining the positions of the 3D face key points in the face image.

In this embodiment, the obtained 3D face key points are also dense 3D face key points, as shown in fig. 10, which is a schematic diagram of each dense 3D face key point, and as can be seen from fig. 10, the number of the obtained 3D face key points in this embodiment is relatively large.

In one embodiment, step 405 may be implemented as: for any 2D face key point, determining the labeled 3D face key point with the same index as the 2D face key point as a target labeled 3D face key point corresponding to the 2D face key point; and determining target labeling 3D face key points corresponding to the 2D face key points as the 3D face key points in the face image, and determining the current positions of the target labeling 3D face key points in the target three-dimensional face model as the positions of the 3D face key points in the face image, wherein the current positions of the target labeling 3D face key points are obtained based on the positions of the vertexes of the face sheet where the target labeling 3D face key points are located.

The current position of each target labeling 3D face key point in the target three-dimensional face model is determined by the following steps:

labeling 3D face key points for any one target, and determining a face corresponding to an index of the target labeled 3D face key points by utilizing a preset corresponding relation between the 3D face key points and the face; and obtaining the current position of the target mark 3D face key point according to the position coordinates of the vertex of the face piece.

The labeling instruction in the foregoing description includes the position of each labeling 3D face key point in the initial three-dimensional face model, the index of the face patch where each labeling 3D face point is located, and the barycentric coordinates of each face patch. Therefore, the spatial relative position between each marked 3D face key point and the barycentric coordinates of the face patch where the marked 3D face key point is located can be determined based on the barycentric coordinates and the marked D face key point, and the spatial relative position comprises the distance and the direction. Because the spatial relative position relation between the marked 3D face key points and the face patches is not changed, the current position of the target marked 3D face key points can be obtained based on the barycentric coordinates and the spatial relative positions of the face patches where the marked 3D face key points are located.

For further connection with the technical solution in the present application, the following detailed description with reference to fig. 11 may include the following steps:

step 1101: obtaining a virtual face image training sample, wherein the virtual face image training sample comprises each virtual face image and the labeling position of each 2D face key point in each virtual face image;

step 1102, inputting the virtual face image training sample into a face key point extraction network model to extract 2D face key points, and obtaining predicted positions of the 2D face key points in each virtual face image;

step 1103 obtains a training error value based on the labeling positions of the 2D face key points and the predicted positions of the 2D face key points;

step 1104: judging whether the training error value is greater than a second designated threshold, if so, executing step 1105, and if not, executing step 1106;

step 1105: after the specified parameters in the face key point extraction network model are adjusted, returning to the execution step 1102;

step 1106: finishing training the face key point extraction network model to obtain a trained face key point extraction network model;

step 1107: retraining the trained face key point extraction network model by using a real face image training sample to obtain a retrained face key point extraction network model, and determining the retrained face key point extraction network model as the trained face key point extraction network model, wherein the real face image training sample is obtained based on the initial three-dimensional face model and the acquired real three-dimensional face model;

Step 1108: inputting the face image into a pre-trained face key point extraction network model for 2D face key point extraction aiming at any face image, and obtaining the position of each 2D face key point in the face image;

step 1109, projecting each marked 3D face key point in a preset initial three-dimensional face model into the face image to obtain the position of each 2D projected face key point in the face image;

step 1110: obtaining error values based on the positions of the 2D projection face key points and the positions of the 2D face key points;

step 1111: judging whether the error value is greater than a first designated threshold, if so, executing step 1112, and if not, executing step 1113;

step 1112, after adjusting the three-dimensional face parameters of the initial three-dimensional face model, returning to step 1109;

step 1113: determining the adjusted initial three-dimensional face model as a target three-dimensional face model;

step 1114: and according to the marked 3D face key points and the 2D face key points in the target three-dimensional face model, obtaining the positions of the 3D face key points in the face image.

Based on the same inventive concept, the single-image-based face key point extraction method disclosed by the disclosure can also be realized by a single-image-based face key point extraction device. The effect of extracting the key points of the face based on the single image is similar to that of the method, and is not repeated here.

Fig. 12 is a schematic structural diagram of a single image-based face key point extraction device according to one embodiment of the present disclosure.

As shown in fig. 12, the single image-based face keypoint extraction apparatus 1200 of the present disclosure may include a 2D face keypoint extraction module 1210, a projection module 1220, an error value determination module 1230, an adjustment module 1240, and a 3D face keypoint determination module 1250.

The 2D face key point extraction module 1210 is configured to input, for any one face image, the face image into a pre-trained face key point extraction network model to perform 2D face key point extraction, so as to obtain positions of 2D face key points in the face image;

the projection module 1220 is configured to project each labeled 3D face key point in the preset initial three-dimensional face model into the face image, so as to obtain a position of each 2D projected face key point in the face image;

an error value determining module 1230, configured to obtain an error value based on the positions of the 2D projection face key points and the positions of the 2D face key points;

an adjustment module 1240, configured to, if the error value is greater than a first specified threshold, adjust three-dimensional face parameters of the initial three-dimensional face model, and then return to a step of projecting each labeled 3D face key point in the preset initial three-dimensional face model into the face image to obtain a position of each 2D projected face key point in the face image, until the error value is not greater than the first specified threshold, and determine the adjusted initial three-dimensional face model as a target three-dimensional face model;

And a 3D face key point determining module 1250, configured to obtain the position of each 3D face key point in the face image according to each labeled 3D face key point and each 2D face key point in the target three-dimensional face model.

In one embodiment, the apparatus further comprises:

a training module 1260, configured to obtain the trained face key point extraction network model by:

obtaining a virtual face image training sample, wherein the virtual face image training sample comprises each virtual face image and the labeling position of each 2D face key point in each virtual face image;

inputting the virtual face image training sample into a face key point extraction network model to extract 2D face key points, and obtaining predicted positions of the 2D face key points in each virtual face image;

obtaining training error values based on the labeling positions of the 2D face key points and the predicted positions of the 2D face key points;

and if the training error value is greater than a second specified threshold, after adjusting specified parameters in the face key point extraction network model, returning to the step of inputting the virtual face image training sample into the face key point extraction network model to extract the 2D face key points to obtain the predicted positions of the 2D face key points in each virtual face image, and ending training the face key point extraction network model until the error value is not greater than the second specified threshold to obtain the trained face key point extraction network model.

In one embodiment, the apparatus further comprises:

the retraining module 1270 is configured to finish training the face key point extraction network model, retrain the trained face key point extraction network model by using a real face image training sample after the trained face key point extraction network model is obtained, obtain a retrained face key point extraction network model, and determine the retrained face key point extraction network model as the trained face key point extraction network model, where the real face image training sample is obtained based on the initial three-dimensional face model and the acquired real three-dimensional face model.

In one embodiment, the apparatus further comprises:

the virtual face image training sample determining module 1280 is configured to determine the virtual face image training sample before the virtual face image training sample is obtained by:

obtaining an initial three-dimensional face model based on preset initial three-dimensional face parameters;

determining the positions of all the marked 3D face key points in the initial three-dimensional face model based on the received marking instructions of the 3D face key points;

Respectively carrying out deformation update on the initial three-dimensional face model by utilizing a plurality of preset intermediate three-dimensional face parameters to obtain a plurality of updated three-dimensional face models;

respectively reconstructing the plurality of updated three-dimensional face models and preset virtual face texture images to obtain a plurality of reconstructed three-dimensional face models;

rendering the reconstructed three-dimensional face models with a plurality of preset background images respectively to obtain a plurality of virtual face images and labeling positions of 2D face key points in the virtual face images, and determining the virtual face images and the labeling positions of the 2D face key points in the virtual face images as the virtual face image training samples, wherein the labeling positions of the 2D face key points in the virtual face images are obtained based on the labeling 3D face key points in the reconstructed three-dimensional face models.

In one embodiment, the face key point extraction network model includes a max pooling layer, a feature pyramid FPN layer, a transform layer, and a full connection layer; the 2D face key point extraction module 1210 is specifically configured to:

Downsampling the face image by using a maximum pooling layer to obtain a downsampled face image; and is combined with the other components of the water treatment device,

extracting features of the face image after downsampling by using a feature pyramid FPN layer to obtain a feature vector set;

screening vectors in the feature vector set through a transform layer to obtain an intermediate feature vector set;

fusing the intermediate feature vector set with the human face image visibility feature to obtain a target feature vector set, wherein the visibility feature is preset or is obtained based on the previous frame of human face image;

and performing feature stitching on the feature vector set by using a full connection layer to obtain the positions of the 2D face key points in the face image.

In one embodiment, the error value determining module 1230 is specifically configured to:

screening the 2D face key points by using the confidence coefficient of the 2D face key points to obtain screened 2D face key points, wherein the confidence coefficient of the 2D face key points is obtained when face key point extraction is performed in the pre-trained face key point extraction network model;

aiming at any 2D face key point, obtaining the distance between the 2D face key point and the 2D projection face key point according to the position coordinates of the 2D face key point and the position coordinates of the 2D projection face key point with the same index as the 2D face key point, and determining the distance as an intermediate error value corresponding to the 2D face key point;

And obtaining the error value according to each intermediate error value corresponding to each 2D face key point.

In one embodiment, the 3D face keypoint determination module 1250 is specifically configured to:

for any 2D face key point, determining the labeled 3D face key point with the same index as the 2D face key point as a target labeled 3D face key point corresponding to the 2D face key point; and is combined with the other components of the water treatment device,

and determining target labeling 3D face key points corresponding to the 2D face key points as the 3D face key points in the face image, and determining the current positions of the target labeling 3D face key points in the target three-dimensional face model as the positions of the 3D face key points in the face image, wherein the current positions of the target labeling 3D face key points are obtained based on the positions of the vertexes of the face sheets where the target labeling 3D face key points are located.

In one embodiment, the apparatus further comprises:

the current position determining module 1290 is configured to determine a current position of the target labeling 3D face key point in the target three-dimensional face model by:

labeling 3D face key points for any one target, and determining a face corresponding to an index of the target labeled 3D face key points by utilizing a preset corresponding relation between the 3D face key points and the face;

And obtaining the current position of the target mark 3D face key point according to the position coordinates of the vertex of the face piece.

Having described a face key point extraction method and apparatus based on a single image according to an exemplary embodiment of the present invention, next, an electronic device according to another exemplary embodiment of the present invention is described.

Those skilled in the art will appreciate that the various aspects of the invention may be implemented as a system, method, or program product. Accordingly, aspects of the invention may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein collectively as a "circuit," module "or" system.

In some possible embodiments, an electronic device according to the invention may comprise at least one processor, and at least one computer storage medium. Wherein the computer storage medium stores program code which, when executed by a processor, causes the processor to perform the steps in the single image-based face key point extraction method according to various exemplary embodiments of the present invention described above in the present specification. For example, the processor may perform steps 401-405 as shown in fig. 4.

An electronic device 1300 according to this embodiment of the invention is described below with reference to fig. 13. The electronic device 1300 shown in fig. 13 is merely an example and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.

As shown in fig. 13, the electronic device 1300 is embodied in the form of a general-purpose electronic device. The components of the electronic device 1300 may include, but are not limited to: the at least one processor 1301, the at least one computer storage medium 1302, a bus 1303 connecting the different system components (including the computer storage medium 1302 and the processor 1301).

Bus 1303 represents one or more of several types of bus structures, including a computer storage medium bus or computer storage medium controller, a peripheral bus, a processor, or a local bus using any of a variety of bus architectures.

Computer storage media 1302 can include readable media in the form of volatile computer storage media, such as random access computer storage media (RAM) 1321 and/or cache storage media 1322, and can further include read only computer storage media (ROM) 1323.

Computer storage media 1302 may also include a program/utility 1325 having a set (at least one) of program modules 1324, such program modules 1324 include, but are not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

The electronic device 1300 may also communicate with one or more external devices 1304 (e.g., keyboard, pointing device, etc.), one or more devices that enable a user to interact with the electronic device 1300, and/or any device (e.g., router, modem, etc.) that enables the electronic device 1300 to communicate with one or more other electronic devices. Such communication may occur through an input/output (I/O) interface 1305. Also, the electronic device 1300 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through a network adapter 1306. As shown, the network adapter 1306 communicates with other modules for the electronic device 1300 over the bus 1303. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 1300, including, but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

In some possible embodiments, aspects of a single image-based face key point extraction method provided by the present invention may also be implemented as a program product, which includes program code for causing a computer device to perform the steps of the single image-based face key point extraction method according to the various exemplary embodiments of the present invention as described herein when the program product is run on the computer device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, a random access computer storage medium (RAM), a read-only computer storage medium (ROM), an erasable programmable read-only computer storage medium (EPROM or flash memory), an optical fiber, a portable compact disc read-only computer storage medium (CD-ROM), an optical computer storage medium, a magnetic computer storage medium, or any suitable combination of the foregoing.

The program product of single image based face key point extraction of embodiments of the present invention may employ a portable compact disc read-only computer storage medium (CD-ROM) and include program code and may be run on an electronic device. However, the program product of the present invention is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the consumer electronic device, partly on the consumer electronic device, as a stand-alone software package, partly on the consumer electronic device, partly on the remote electronic device, or entirely on the remote electronic device or server. In the case of remote electronic devices, the remote electronic device may be connected to the consumer electronic device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external electronic device (e.g., connected through the internet using an internet service provider).

It should be noted that although several modules of the apparatus are mentioned in the detailed description above, this division is merely exemplary and not mandatory. Indeed, the features and functions of two or more modules described above may be embodied in one module in accordance with embodiments of the present invention. Conversely, the features and functions of one module described above may be further divided into a plurality of modules to be embodied.

Furthermore, although the operations of the methods of the present invention are depicted in the drawings in a particular order, this is not required to either imply that the operations must be performed in that particular order or that all of the illustrated operations be performed to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk computer storage media, CD-ROM, optical computer storage media, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable computer storage medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable computer storage medium produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. The face key point extraction method based on the single image is characterized by comprising the following steps of:

2. The method of claim 1, wherein the trained face key point extraction network model is obtained by:

3. The method of claim 2, wherein after the training of the face key point extraction network model is finished, the method further comprises:

Retraining the trained face key point extraction network model by using a real face image training sample to obtain a retrained face key point extraction network model, and determining the retrained face key point extraction network model as the trained face key point extraction network model, wherein the real face image training sample is obtained based on the initial three-dimensional face model and the acquired real three-dimensional face model.

4. The method of claim 2, wherein prior to the obtaining the training samples of the virtual face image, the method further comprises:

determining the virtual face image training sample by:

5. The method of claim 1, wherein the face keypoint extraction network model comprises a max pooling layer, a feature pyramid FPN layer, a transform layer, and a full connection layer;

inputting the face image into a pre-trained face key point extraction network model for 2D face key point extraction, and obtaining the positions of all 2D face key points in the face image, wherein the method comprises the following steps:

6. The method of claim 1, wherein the deriving an error value based on the location of each 2D projected face key point and the location of each 2D face key point comprises:

7. The method according to claim 1, wherein the obtaining the position of each 3D face key point in the face image according to each labeled 3D face key point and each 2D face key point in the target three-dimensional face model includes:

8. The method according to any one of claims 1-7, wherein the current position of the target labeled 3D face keypoints in the target three-dimensional face model is determined by:

9. An electronic device, comprising a processor and a memory, wherein the processor and the memory are connected by a bus;

obtaining error values based on the positions of the 2D projection face key points and the positions of the 2D face key points; if the error value is greater than a first specified threshold, after three-dimensional face parameters of the initial three-dimensional face model are adjusted, returning to the step of projecting each marked 3D face key point in the preset initial three-dimensional face model into the face image to obtain the position of each 2D projected face key point in the face image, and determining the adjusted initial three-dimensional face model as a target three-dimensional face model until the error value is not greater than the first specified threshold;

10. The electronic device of claim 9, wherein the processor executes the step of obtaining the position of each 3D face key point in the face image according to each labeled 3D face key point and each 2D face key point in the target three-dimensional face model, and is specifically configured to: