CN116993929B

CN116993929B - Three-dimensional face reconstruction method and device based on human eye dynamic change and storage medium

Info

Publication number: CN116993929B
Application number: CN202311263345.1A
Authority: CN
Inventors: 刘梦源; 王璇; 丁润伟; 冯卫妮; 孟凡阳
Original assignee: Peking University Shenzhen Graduate School
Current assignee: Peking University Shenzhen Graduate School
Priority date: 2023-09-27
Filing date: 2023-09-27
Publication date: 2024-01-16
Anticipated expiration: 2043-09-27
Also published as: CN116993929A

Abstract

The application discloses a three-dimensional face reconstruction method, a device and a storage medium based on human eye dynamic change, wherein the method comprises the following steps: determining a rendered image using the three-dimensional face determined by the initial three-dimensional face model; determining eye closing probability by using an eye state detector, and adjusting the eye key points based on the eye closing probability to obtain an adjusted training image; training an initial three-dimensional face model based on the mixed loss function and the dynamic loss function determined by the rendered image, the training image and the adjusted training image to obtain a three-dimensional face model; and generating a reconstructed three-dimensional face based on the three-dimensional face model. According to the method and the device, the eye closing probability is determined through the dynamic details captured by the eye state detector, the eye key points are adjusted according to the eye closing probability, and the problem of inconsistency of three-dimensional reconstruction of the local area of the face is solved and the accuracy of reconstructing the three-dimensional face is improved by introducing the dynamic loss function utilizing the adjusted eye key points in the weak supervision learning process.

Description

Three-dimensional face reconstruction method and device based on human eye dynamic change and storage medium

Technical Field

The present disclosure relates to the field of computer vision, and in particular, to a three-dimensional face reconstruction method and apparatus based on dynamic human eye changes, and a storage medium.

Background

With the development of Virtual Reality (VR) and Metaverse (Metaverse) concepts in recent years, single-image three-dimensional face reconstruction is widely focused and emphasized, and is a task with a great development prospect in the field of computer vision, and the aim is to quickly and accurately reconstruct three-dimensional lifelike face effects from two-dimensional RGB face images.

The current mainstream method is a three-dimensional face reconstruction method based on CNN, and the method emphasizes the whole outline and expression driving details of the face, and can provide unique characteristics for different 3D face grids, but can not ensure the consistency of dynamic characteristics of eye areas, so that the accuracy of the reconstructed three-dimensional face is affected.

Disclosure of Invention

The technical problem to be solved by the application is to provide a three-dimensional face reconstruction method, a device and a storage medium based on human eye dynamic change aiming at the defects of the prior art.

In order to solve the above technical problems, a first aspect of an embodiment of the present application provides a three-dimensional face reconstruction method based on dynamic changes of human eyes, where the method includes:

Determining a three-dimensional face corresponding to the training image by using an initial three-dimensional face model, and determining a rendering image based on the three-dimensional face;

determining eye closing probability corresponding to the training image by using a preset eye state detector, and adjusting human eye key points in the training image based on the eye closing probability to obtain an adjusted training image;

determining a hybrid loss function based on the rendered image and the training image, and determining a dynamic loss function based on the rendered image and the adjusted training image;

training the initial three-dimensional face model based on the mixed loss function and the dynamic loss function to obtain a three-dimensional face model;

and generating a reconstructed three-dimensional face corresponding to the image to be reconstructed based on the three-dimensional face model.

The three-dimensional face reconstruction method based on human eye dynamic change, wherein the determining the three-dimensional face corresponding to the training image by using the initial three-dimensional face model specifically comprises the following steps:

inputting a training image into a face shape parameter regression network, and determining face parameters corresponding to the training image through the face shape parameter regression network;

and inputting the face parameters into a face reconstruction network, and generating a three-dimensional face through the face reconstruction network.

The three-dimensional face reconstruction method based on human eye dynamic change, wherein the adjusting the human eye key points in the training image based on the eye closing probability specifically comprises the following steps:

for each upper eyelid key point in the human eye key points, acquiring a lower eyelid key point corresponding to the upper eyelid key point;

calculating the distance between the upper eyelid key point and the lower eyelid key point in a preset direction;

and adjusting the key points of the upper eyelid along a preset direction based on the distance and the eye closing probability.

The three-dimensional face reconstruction method based on human eye dynamic change, wherein the adjustment formula of the key points of the upper eyelid is as follows:

，

wherein,representing key points of upper eyelid->Representing the lower eyelid key point corresponding to the upper eyelid key point, < ->Representing the eye closure probability.

The three-dimensional face reconstruction method based on human eye dynamic change, wherein the determining the mixing loss function based on the rendering image and the training image specifically comprises the following steps:

acquiring the rendering face information of the rendering image and the face information of the training image, wherein the rendering face information comprises rendering face key points, rendering face areas and rendering face features, and the face information comprises the face key points, the face areas and the face features;

Calculating an image layer loss function based on the rendered face key points, the rendered face region, the face key points and the face region;

and calculating a perception layer loss function based on the rendered face features and the face features, and calculating a mixed loss function based on the image layer loss function and the perception layer loss function.

The three-dimensional face reconstruction method based on human eye dynamic change, wherein the calculating the image layer loss function based on the rendering face key points, the rendering face region, the face key points and the face region specifically comprises the following steps:

acquiring the skin color probability of each training pixel point in the face region, and determining an attention mask of the face region based on the skin color probability of each pixel point;

calculating a light sensation loss term based on the attention mask and pixel differences of each training pixel point in the face region and a rendering pixel point corresponding to each training pixel point, wherein the rendering pixel point is a pixel point matched with the pixel position of the training pixel point in the rendering face region;

calculating a face landmark point loss item based on the rendered face key points and the face key points;

And calculating an image layer loss function based on the light sensation loss term and the human face landmark point loss term.

The three-dimensional face reconstruction method based on human eye dynamic change, wherein the determining the dynamic loss function based on the rendered image and the adjusted training image specifically comprises the following steps:

selecting a plurality of key point pairs from the rendered image and the adjusted training image respectively to form a rendered key point pair set and a training key point pair set, wherein the rendered key point pair set and the training key point pair set correspond to each other, and the rendered key point pair set and the training key point pair set at least comprise eye key point pairs;

and calculating a dynamic loss function based on the key point distance between each rendering key point pair in the rendering key point pair set and the key point distance between each training key point pair in the training key point pair set.

A second aspect of the embodiments of the present application provides a three-dimensional face reconstruction device based on dynamic changes of human eyes, where the device includes:

the training module is used for determining a three-dimensional face corresponding to the training image by utilizing the initial three-dimensional face model and determining a rendering image based on the three-dimensional face; determining eye closing probability corresponding to the training image by using a preset eye state detector, and adjusting human eye key points in the training image based on the eye closing probability to obtain an adjusted training image; determining a hybrid loss function based on the rendered image and the training image, and determining a dynamic loss function based on the rendered image and the adjusted training image; training the initial three-dimensional face model based on the mixed loss function and the dynamic loss function to obtain a three-dimensional face model;

And the reconstruction module is used for generating a reconstructed three-dimensional face corresponding to the image to be reconstructed based on the three-dimensional face model.

A third aspect of the embodiments of the present application provides a computer readable storage medium storing one or more programs executable by one or more processors to implement steps in a three-dimensional human face reconstruction method based on human eye dynamics as described in any one of the above.

A fourth aspect of the present embodiment provides a terminal device, including: a processor and a memory;

the memory has stored thereon a computer readable program executable by the processor;

the steps in the three-dimensional face reconstruction method based on human eye dynamic change as described in any one of the above are realized when the processor executes the computer readable program.

The beneficial effects are that: compared with the prior art, the application provides a three-dimensional face reconstruction method, a device and a storage medium based on human eye dynamic change, wherein the method comprises the following steps: determining a three-dimensional face corresponding to the training image by using an initial three-dimensional face model, and determining a rendering image based on the three-dimensional face; determining eye closing probability corresponding to the training image by using a preset eye state detector, and adjusting human eye key points in the training image based on the eye closing probability to obtain an adjusted training image; determining a hybrid loss function based on the rendered image and the training image, and determining a dynamic loss function based on the rendered image and the adjusted training image; training the initial three-dimensional face model based on the mixed loss function and the dynamic loss function to obtain a three-dimensional face model; and generating a reconstructed three-dimensional face corresponding to the image to be reconstructed based on the three-dimensional face model. According to the method and the device, the eye closing probability is determined through the dynamic details of the eye region captured by the eye state detector, then the eye key points are adjusted according to the eye closing probability, and the problem of inconsistency of three-dimensional reconstruction of the local region of the face is solved and the accuracy of reconstructing the three-dimensional face is improved by introducing and utilizing the dynamic loss function of the adjusted eye key points in the weak supervision learning process.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without creative effort for a person of ordinary skill in the art.

Fig. 1 is a flowchart of a three-dimensional face reconstruction method based on human eye dynamic change provided by the application.

Fig. 2 is a schematic flow chart of a training process of a three-dimensional face model.

Fig. 3 is a schematic flow chart of a face reconstruction network.

Fig. 4 is a schematic structural diagram of a three-dimensional face reconstruction device based on human eye dynamic change provided by the application.

Fig. 5 is a schematic structural diagram of a terminal device provided in the present application.

Detailed Description

The application provides a three-dimensional face reconstruction method, device and storage medium based on human eye dynamic change, and in order to make the purposes, technical schemes and effects of the application clearer and more definite, the application is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein includes all or any element and all combination of one or more of the associated listed items.

It will be understood by those skilled in the art that all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs unless defined otherwise. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

It should be understood that the sequence number and the size of each step in the embodiments of the present application do not mean the sequence of execution, and the execution sequence of each process is determined by the function and the internal logic of each process, and should not constitute any limitation on the implementation process of the embodiments of the present application.

Through research, in recent years, along with the development of Virtual Reality (VR), metauniverse (Metaverse) and other concepts, single-image three-dimensional face reconstruction is widely focused and valued, and is a task with a great development prospect in the field of computer vision, and the aim is to quickly and accurately reconstruct three-dimensional lifelike face effects from two-dimensional RGB face images.

In order to solve the above problems, the inventors analyzed the causes of the above problems, and found that the main causes of the above problems are:

1) The lack of full use of the dynamic information of the face, particularly the motion details of the eye region, is critical to accurately reproducing the 3D face mesh;

2) The detector for obtaining the human face landmark points has lower prediction precision on the positions of the eye key points, and particularly for human face pictures with smaller eyes, the positions of upper and lower eyelid are more difficult to accurately distinguish;

3) The eye area occupies a small area on the whole face, and for the local area, the landmark point loss function of the local absolute distance is applied, so that the opening and closing characteristics of the local area can not be captured well, and an error reconstruction result is caused.

Based on the above, in the embodiment of the application, a three-dimensional face corresponding to a training image is determined by using an initial three-dimensional face model, and a rendering image is determined based on the three-dimensional face; determining eye closing probability corresponding to the training image by using a preset eye state detector, and adjusting human eye key points in the training image based on the eye closing probability to obtain an adjusted training image; determining a hybrid loss function based on the rendered image and the training image, and determining a dynamic loss function based on the rendered image and the adjusted training image; training the initial three-dimensional face model based on the mixed loss function and the dynamic loss function to obtain a three-dimensional face model; and generating a reconstructed three-dimensional face corresponding to the image to be reconstructed based on the three-dimensional face model. According to the method and the device, the eye closing probability is determined through the dynamic details of the eye region captured by the eye state detector, then the eye key points are adjusted according to the eye closing probability, and the problem of inconsistency of three-dimensional reconstruction of the local region of the face is solved and the accuracy of reconstructing the three-dimensional face is improved by introducing and utilizing the dynamic loss function of the adjusted eye key points in the weak supervision learning process.

The application will be further described by the description of embodiments with reference to the accompanying drawings.

The embodiment of the application provides a three-dimensional face reconstruction method based on human eye dynamic change, as shown in fig. 1 and fig. 2, the method comprises the following steps:

s10, determining a three-dimensional face corresponding to the training image by using the initial three-dimensional face model, and determining a rendering image based on the three-dimensional face.

Specifically, the training image is used for training an initial three-dimensional face model, wherein the training image is a two-dimensional image, and the training image carries the face image. For example, the training image is a two-dimensional RGB face image. The initial three-dimensional face model is used for generating a three-dimensional face corresponding to the training image, wherein the initial three-dimensional face model can comprise a face shape parameter regression network and a face reconstruction network; the face shape parameter regression network is used for identifying the training image to obtain face parameters corresponding to the training image, and the face reconstruction network reconstructs a three-dimensional face network based on the face parameters to obtain a three-dimensional face, wherein the face parameters comprise shape, texture, expression, illumination and gesture parameters, and can be expressed as face parameters，/>Representing shape, & lt >Expression of->Indicates texture, ->Representing illumination information, < >>Representing the face pose. According to the method, the training image is subjected to special extraction through the face shape parameter regression network to obtain the face parameters for outputting and generating the three-dimensional face, and then the three-dimensional face is obtained through combining the extracted face parameters with the shape, texture and expression base vector provided by BFM (Basel Face Model) through the face reconstruction network.

In an implementation manner of the embodiment of the present application, the determining, by using the initial three-dimensional face model, the three-dimensional face corresponding to the training image specifically includes:

Specifically, the face shape parameter regression network is a face shape parameter regression network based on deep learning, and is used for extracting features of an input two-dimensional image so as to obtain face parameters corresponding to the two-dimensional image. In a typical implementation manner of the embodiment of the application, the face shape parameter regression network can adopt a ResNet-50 depth network, and the model initial parameters can adopt pre-training parameters from an ImageNet data set, so that training efficiency of the three-dimensional face model can be improved.

The face reconstruction network is provided by combining BFM (Basel Face Model)The three-dimensional facial grid is reconstructed from the shape, texture and expression basis vectors, wherein the core idea of the BFM model is a three-dimensional deformable face model (3D Morphable Model,3DMM), as shown in fig. 3, the principle of the three-dimensional deformable face model is that any face can be the linear superposition expression of the shape vector and the texture vector, namely, the abstract problem of three-dimensional reconstruction is converted into the mathematical problem of solving the basis coefficients of the face model vector, and the shape can be usedTexture->And expression->The change carries out the deformation of the average face and the arbitrary three-dimensional face gridTexture->Can be expressed by the following formula:

，

wherein,representing the average shape of the face of the database, +.>Representing the average texture of the face of the database; />Vector basis representing shape PCA, +.>PCA vector base representing texture, < >>PCA vector base representing expression, < >>Representing shape, & lt>Expression of->Representing texture.

Further, the rendered image is a two-dimensional image, which is obtained by rendering a three-dimensional face determined based on the training image, for weakly supervised learning. The generating a two-dimensional image based on three-dimensional face rendering may use techniques known in the art, for example, using a camera model, a light sensation model, and a differentiable renderer Nvdiffrast corresponding to the training image to render the three-dimensional face to generate a rendered image.

S20, determining eye closing probability corresponding to the training image by using a preset eye state detector, and adjusting human eye key points in the training image based on the eye closing probability to obtain an adjusted training image.

Specifically, the eye state detector is used for capturing dynamic characteristics of an eye region in the face image to obtain motion details of eye opening and closing, and predicting the eye closing probability of the eyes in the face image based on the captured dynamic characteristics. That is, dynamic features of an eye region in the training image may be captured by the eye state detector and the eye closure probability of the eye in the training image may be determined based on the dynamic features. In one implementation, the eye state detector is a trained deep network, e.g., the eye state detector is a ResNet-50 network trained in advance on a closed eye dataset CEW (Closed Eye in the Wild), and the last connection layer output neuron number of the ResNet-50 network is adjusted to 2, and the closed eye probability is obtained by a softMax activation function. According to the embodiment of the application, the dynamic characteristics of the eye area in the face image are effectively captured through the deep convolution layer structure in the eye state detector, and the movement details of opening and closing the eyes are obtained, so that the defect of insufficient utilization on the eye dynamic details is overcome. Meanwhile, under the condition of insufficient closed-eye face data set, the dynamic characteristics of human eyes are well enhanced and fully utilized.

Further, after the eye closing probability is obtained, the eye key points in the training image can be adjusted according to the eye closing probability, wherein the eye key points in the training image are obtained by performing face recognition on the training image. When the eye key points in the training image are adjusted based on the eye closing probability, each eye key point can be adjusted directly according to the eye closing probability; the upper eyelid key point may be adjusted according to the eye closing probability, or the lower eyelid key point may be adjusted according to the eye closing probability, wherein the adjustment mode of the human eye key point may be to adjust the distance between the human eye key point and the designated key point according to the eye closing probability, or adjust the relative distance between the upper eyelid key point and the lower eyelid key point according to the eye closing probability, or the like.

In an implementation manner of the embodiment of the present application, the adjusting the eye keypoints in the training image based on the eye closing probability specifically includes:

Specifically, the lower eyelid key points corresponding to the upper eyelid key points refer to lower eyelid key points with the smallest inclination angles of the connecting lines of the upper eyelid key points relative to the preset direction, namely, for each upper eyelid key point, the connecting lines of the upper eyelid key points and the lower eyelid key points are respectively obtained, the inclination angles of the connecting lines relative to the preset direction are obtained, then the connecting lines with the smallest inclination angles are selected to correspond to the lower eyelid key points, and the lower eyelid key points corresponding to the upper eyelid key points are selected. Furthermore, it is worth noting that the number of upper eyelid keypoints and the number of lower eyelid keypoints in the human eye are the same, and the lower eyelid keypoints corresponding to each upper eyelid keypoint are different. For example, in the embodiment of the present application, 68 face keypoints are used, where the upper eyelid keypoints include face keypoints with index numbers 37,38,43 and 44, the lower eyelid keypoints include face keypoints with index numbers 41,40,47 and 46, the upper eyelid keypoint with index number 37 corresponds to the lower eyelid keypoint with index number 41, the upper eyelid keypoint with index number 38 corresponds to the lower eyelid keypoint with index number 40, the upper eyelid keypoint with index number 43 corresponds to the lower eyelid keypoint with index number 47, and the upper eyelid keypoint with index number 44 corresponds to the lower eyelid keypoint with index number 46.

Further, because the human eyes generally have larger relative position changes in the up-down direction, the y-axis direction is taken as the preset direction in the embodiment of the application, and the relative distance between the upper eyelid key and the corresponding lower eyelid key point in the y-axis direction is adjusted, so that the adaptable key point position adjustment can be brought to training images with different eye closing degrees, the boundary of the training images on the eye level is expanded, the error defect of the detection position of the key point of the human eyes can be overcome, and the rationality of the reconstruction of the eye closing state is ensured.

Further, when the upper eyelid key point is adjusted along the preset direction based on the distance and the eye closure probability, an adjustment distance of the upper eyelid key point in the preset direction can be determined according to the eye closure probability and the distance, and then the position of the upper eyelid key point in the preset direction is adjusted according to the adjustment distance, so that the adjusted upper eyelid key point is obtained. Based on the above, the adjustment formula of the key points of the upper eyelid is as follows:

，

wherein,representing key points of upper eyelid->Representing the lower eyelid key point corresponding to the upper eyelid key point, < ->Representing eye closure probability->Index number indicating key point of upper eyelid, < - >Index number indicating key points of lower eyelid. For example, the face key points are 68 key points, +.>，/>。

According to the embodiment of the application, the relative positions of the upper eyelid key point and the lower eyelid key point are adjusted by adjusting the y-axis direction of the upper eyelid key point, so that the relative positions of the upper eyelid key point and the lower eyelid key point are consistent with the training image, the adaptable key point adjustment is brought to images with different eye opening degrees, the boundary of the input image on the eye level is enlarged, and meanwhile, the problem of inaccurate coordinate positions is solved.

S30, determining a mixing loss function based on the rendered image and the training image, and determining a dynamic loss function based on the rendered image and the adjusted training image.

Specifically, the hybrid loss function is used for reflecting the difference between the rendered image and the training image, the reconstruction effect of the three-dimensional face model can be improved through the hybrid loss function, the dynamic loss function is used for reflecting the difference of the relative distances between the rendered image and the key points of the eyes in the adjusted training image, and the accuracy of reconstructing the local areas of the eyes can be improved through the dynamic loss function.

In one implementation of the embodiment of the present application, the determining the mixing loss function based on the rendered image and the training image specifically includes:

Acquiring the rendering face information of the rendering image and the face information of the training image;

Specifically, the rendering face information comprises rendering face key points, rendering face areas and rendering face features, the face information comprises face key points, face areas and face features, the face key points and the face areas can be detected by a pre-trained face landmark point detector, the face features can be identified by a pre-trained face identification model, the rendering face key points and the rendering face areas can be formed by projection of the face key points and the face areas in a three-dimensional face, and the rendering face features can be identified by the face identification model. Wherein, rendering face keypoints and face keypoints all include 68 keypoint positions, and face recognition model can be face recognition depth network Arcface, and rendering face characteristic and face characteristic are the last layer that derives from face recognition depth network Arcface.

The perception layer loss function is used for reflecting the difference between the rendered face characteristics and the face characteristics, the problem that the reconstruction effect on partial images is unreal can be solved through the perception layer loss function, and the more accurate reconstruction effect is realized on the texture perception layer. In the embodiment of the application, the difference between the rendered face features and the face features can be measured by the perceptual layer loss function through cosine similarity, wherein the smaller the cosine similarity value is, the larger the difference between the two features is, the lower the similarity is, otherwise, the larger the cosine similarity value is, the smaller the difference between the two features is, and the higher the similarity is. Thus, in one exemplary implementation, the expression of the perceptual layer loss function may be:

，

wherein,representing features identified by face recognition model, +.>Representing the inner product of vectors for calculating the cosine distance between two features,/->Representing training images +.>Rendering the image is rendered.

The image layer loss function is used for reflecting the density luminosity difference and the key point position difference of the human face between the training image and the rendered image, and the accuracy of the human face arrangement outline and the five sense organs can be improved through the image layer loss function. In one implementation manner, the calculating the image layer loss function based on the rendered face key point, the rendered face region, the face key point and the face region specifically includes:

In particular, skin color probabilities may be obtained by a pre-trained recognition network model, e.g., a naive bayes classifier pre-trained on the skin image dataset, etc. The attention mask is determined based on the skin color probability of each pixel point and is used for reflecting the skin color importance degree of each pixel point in the face area, wherein the image size of the attention mask is matched with the image size of the face area, the attention pixel points in the attention mask are in one-to-one correspondence with the area pixel points in the face area, and the pixel value of each attention pixel point in the attention mask is used for reflecting the skin color attention value of the corresponding area pixel point. Note that the pixel value of a pixel point may be the skin color probability of its corresponding region pixel point, or may be a value determined based on the skin color probability of its corresponding region pixel point.

In this embodiment of the present application, a calculation formula of a pixel value of each attention pixel point in the attention mask may be:

，

wherein,index representing attention pixel, +.>Represents skin color probability, ++>A pixel value representing the attention pixel point.

Further, after the attention mask is obtained, pixel differences between each region pixel point in the face region and the rendering region pixel point in the rendering face region can be calculated, and then a light sensation loss term is calculated according to the attention mask and all calculated pixel differences, wherein the light sensation loss term can be obtained by weighting all calculated pixel differences based on the attention mask, or can be obtained by weighting and normalizing all calculated pixel differences based on the attention mask. In an exemplary implementation manner of the embodiment of the present application, the calculation formula of the light sensation loss term may be:

，

wherein,representing a loss of light term->Pixel value representing attention pixel, < >>Index representing attention pixel, +.>Region pixel points in the face region are represented, < +.>Representing rendering pixels in rendering face area, < >>Representing the mapped face region.

The loss items of the human face landmark points are used for reflecting differences among human face key points, and the accuracy of the overall outline of the human face and the five sense organs can be improved. The calculation formula of the face landmark point loss term may be:

，

wherein,loss item representing landmark point of human face, +.>Representing the number of facial landmark points, +.>Weights representing keypoints, +.>Is a key point of human face, is->Representing rendering face keypoints.

Further, after the light sensation loss term and the face landmark point loss term are obtained, the sum of the light sensation loss term and the face landmark point loss term may be used as an image layer loss function, the average value of the light sensation loss term and the face landmark point loss term may be used as an image layer loss function, the light sensation loss term and the face landmark point loss term may be weighted, and the image layer loss function may be obtained.

In one implementation manner of the embodiment of the application, the dynamic loss function guides the three-dimensional face model based on the relative distance between the key point pairs of human eyes, so that the dynamic characteristics of the local area are effectively amplified, and the eye area is prevented from being ignored in the whole face reconstruction process. Correspondingly, the determining the dynamic loss function based on the rendered image and the adjusted training image specifically includes:

Selecting a plurality of key point pairs from the rendered image and the adjusted training image respectively to form a rendered key point pair set and a training key point pair set;

Specifically, in order to enhance the motion characteristics of the local small region in the whole face, the dynamic loss function is considered based on the opening and closing action characteristics of eyes, and the self-adaptive use of the non-absolute accurate landmark detector is realized by combining the relative distance between the upper landmark point pair and the lower landmark point pair and the flexible selection rule. The rendering key point pair set corresponds to the training key point pair set, and the rendering key point pair set and the training key point pair set at least comprise eye key point pairs. The relative distance between the eye key point pairs can influence the reconstruction of the human face, so that the more complete and accurate reproduction effect of the eyes in different opening and closing states can be guided out by introducing a dynamic loss function determined based on the relative distance between the eye key point pairs. In the embodiment of the application, the rendering key point pair set and the training key point pair set both comprise an eye key point pair and a mouth key point pair, wherein the eye key point pair comprises an upper eyelid key point and a lower eyelid key point corresponding to the upper eyelid key point, and the mouth key point pair comprises an upper lip key point and a lower lip key point corresponding to the upper eyelid key point. In addition, it should be noted that, the determination method of the eye key point pair and the determination method of the mouth key point pair are both obtained by adopting the determination method of the lower eyelid key point corresponding to the upper eyelid key point, which is not described in detail herein.

Further, in order to reduce the dependency of the dynamic loss function on the face landmark point detector for detecting the face key points, a weight may be added to adjust the importance degree of each rendering key point pair in the calculation process of the dynamic loss function. The calculation formula of the weight of each rendering key point pair may be:

，

wherein,index representing key point pair->Representing rendering key point pairs->Is the accuracy of the rendered face keypoints in the rendered keypoint pair (e.g., the difference in position of the rendered face keypoints relative to the training face keypoints of the training image, etc.).

After the weight is acquired, calculating a dynamic loss function based on the weight, the relative distance of the rendering face key points and the relative distance of the training face key points, wherein the calculation formula of the dynamic loss function is as follows:

，

wherein,representing dynamic loss function, ++>Representing index set,/->An index representing the key point pair,for the relative distance calculation function, the European distance calculation mode is adopted, < >>Representing rendering face key points->Representing training keypoint pairs.

And S40, training the initial three-dimensional face model based on the mixed loss function and the dynamic loss function to obtain a three-dimensional face model.

Specifically, after the mixed loss function and the dynamic loss function are obtained, the mixed loss function and the dynamic loss function can be weighted to obtain a target loss function, and then the target loss function is adopted to reversely learn the initial face three-dimensional model so as to obtain the three-dimensional face model. In addition, it is worth to say that the face reconstruction network in the initial three-dimensional face model is a three-dimensional deformable face model adjusted based on face parameters, so that only the face shape parameter regression network can be subjected to reverse learning.

S50, generating a reconstructed three-dimensional face corresponding to the image to be reconstructed based on the three-dimensional face model.

Specifically, the three-dimensional face model is obtained through the training steps, wherein the training steps of S10-S40 may be performed multiple times in the process of obtaining the three-dimensional face model. After the three-dimensional face model is obtained, the image to be reconstructed can be input into the three-dimensional face model, and the three-dimensional face can be reconstructed through the output of the three-dimensional face model.

In summary, the embodiment of the application provides a three-dimensional face reconstruction method based on human eye dynamic change, which includes: determining a three-dimensional face corresponding to the training image by using an initial three-dimensional face model, and determining a rendering image based on the three-dimensional face; determining eye closing probability corresponding to the training image by using a preset eye state detector, and adjusting human eye key points in the training image based on the eye closing probability to obtain an adjusted training image; determining a hybrid loss function based on the rendered image and the training image, and determining a dynamic loss function based on the rendered image and the adjusted training image; training the initial three-dimensional face model based on the mixed loss function and the dynamic loss function to obtain a three-dimensional face model; and generating a reconstructed three-dimensional face corresponding to the image to be reconstructed based on the three-dimensional face model. According to the method and the device, the eye closing probability is determined through the dynamic details of the eye region captured by the eye state detector, then the eye key points are adjusted according to the eye closing probability, and the problem of inconsistency of three-dimensional reconstruction of the local region of the face is solved and the accuracy of reconstructing the three-dimensional face is improved by introducing and utilizing the dynamic loss function of the adjusted eye key points in the weak supervision learning process.

Based on the three-dimensional face reconstruction method based on the human eye dynamic change, the embodiment of the application provides a three-dimensional face reconstruction device based on the human eye dynamic change, as shown in fig. 4, the device comprises:

the training module 100 is configured to determine a three-dimensional face corresponding to the training image by using the initial three-dimensional face model, and determine a rendered image based on the three-dimensional face; determining eye closing probability corresponding to the training image by using a preset eye state detector, and adjusting human eye key points in the training image based on the eye closing probability to obtain an adjusted training image; determining a hybrid loss function based on the rendered image and the training image, and determining a dynamic loss function based on the rendered image and the adjusted training image; training the initial three-dimensional face model based on the mixed loss function and the dynamic loss function to obtain a three-dimensional face model;

the reconstruction module 200 is configured to generate a reconstructed three-dimensional face corresponding to the image to be reconstructed based on the three-dimensional face model.

Based on the three-dimensional face reconstruction method based on human eye dynamic change, the embodiment of the application provides a computer readable storage medium, wherein one or more programs are stored in the computer readable storage medium, and the one or more programs can be executed by one or more processors to realize the steps in the three-dimensional face reconstruction method based on human eye dynamic change as described in the embodiment.

Based on the three-dimensional face reconstruction method based on human eye dynamic change, the application also provides a terminal device, as shown in fig. 5, which comprises at least one processor (processor) 20; a display screen 21; and a memory (memory) 22, which may also include a communication interface (Communications Interface) 23 and a bus 24. Wherein the processor 20, the display 21, the memory 22 and the communication interface 23 may communicate with each other via a bus 24. The display screen 21 is configured to display a user guidance interface preset in the initial setting mode. The communication interface 23 may transmit information. The processor 20 may invoke logic instructions in the memory 22 to perform the methods of the embodiments described above.

Further, the logic instructions in the memory 22 described above may be implemented in the form of software functional units and stored in a computer readable storage medium when sold or used as a stand alone product.

The memory 22, as a computer readable storage medium, may be configured to store a software program, a computer executable program, such as program instructions or modules corresponding to the methods in the embodiments of the present disclosure. The processor 20 performs functional applications and data processing, i.e. implements the methods of the embodiments described above, by running software programs, instructions or modules stored in the memory 22.

The memory 22 may include a storage program area that may store an operating system, at least one application program required for functions, and a storage data area; the storage data area may store data created according to the use of the terminal device, etc. In addition, the memory 22 may include high-speed random access memory, and may also include nonvolatile memory. For example, a plurality of media capable of storing program codes such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or a transitory storage medium may be used.

In addition, the specific processes that the storage medium and the plurality of instruction processors in the terminal device load and execute are described in detail in the above method, and are not stated here.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. The three-dimensional face reconstruction method based on human eye dynamic change is characterized by comprising the following steps:

generating a reconstructed three-dimensional face corresponding to the image to be reconstructed based on the three-dimensional face model;

the adjusting the human eye key points in the training image based on the eye closing probability specifically comprises:

And adjusting the upper eyelid key points along a preset direction based on the distance and the eye closing probability, wherein an adjusting formula of the upper eyelid key points is as follows:

，

representing key points of upper eyelid->Representing the lower eyelid key point corresponding to the upper eyelid key point, < ->Representing the eye closure probability.

2. The three-dimensional face reconstruction method based on human eye dynamic change according to claim 1, wherein the initial three-dimensional face model comprises a face shape parameter regression network and a face reconstruction network; the determining the three-dimensional face corresponding to the training image by using the initial three-dimensional face model specifically comprises the following steps:

3. The three-dimensional face reconstruction method based on human eye dynamic variation according to claim 1, wherein the determining a mixing loss function based on the rendered image and the training image specifically comprises:

4. The three-dimensional face reconstruction method based on human eye dynamic change according to claim 3, wherein the calculating an image layer loss function based on the rendered face key points, the rendered face regions, the face key points and the face regions specifically comprises:

5. The three-dimensional human face reconstruction method based on human eye dynamic variation according to claim 1, wherein the determining a dynamic loss function based on the rendered image and the adjusted training image specifically comprises:

6. A three-dimensional face reconstruction device based on human eye dynamic changes, the device comprising:

The reconstruction module is used for generating a reconstructed three-dimensional face corresponding to the image to be reconstructed based on the three-dimensional face model;

the adjusting the eye key points in the training image based on the eye closing probability specifically includes:

，

7. A computer readable storage medium storing one or more programs executable by one or more processors to implement the steps in the human eye dynamic change based three-dimensional face reconstruction method of any one of claims 1-5.

8. A terminal device, comprising: a processor and a memory;

the processor, when executing the computer readable program, implements the steps in the three-dimensional face reconstruction method based on human eye dynamic changes as defined in any one of claims 1-5.