CN114898034A

CN114898034A - Three-dimensional face generation method and device and three-dimensional face replay method and device

Info

Publication number: CN114898034A
Application number: CN202210402505.5A
Authority: CN
Inventors: 曾豪; 张智勐; 张唯; 丁彧; 吕唐杰; 范长杰; 胡志鹏
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2022-04-18
Filing date: 2022-04-18
Publication date: 2022-08-12

Abstract

The application discloses a three-dimensional face generation method and a three-dimensional face replay device, wherein the generation method comprises the following steps: acquiring an expression vector of a first image containing a first identity object based on a target expression coding network model; the first image is an image of which the face is provided with a shelter; inputting the expression vector and a second image containing a second identity object into a target face generation network model, and determining to transfer the expression vector of the first image to the target image of the face of the second identity object in the second image; even if the facial area of the first identity object in the first image has the shielding object, the expression vector can still be recognized and transferred to the face of the second identity object in the second image, so that the accuracy of facial expression transfer is improved, and meanwhile, the robustness of facial expression transfer under the shielding condition is improved.

Description

Three-dimensional face generation method and device and three-dimensional face replay method and device

Technical Field

The application relates to the field of computer application, in particular to a three-dimensional face generation method and device and a three-dimensional face replay method and device. The application also relates to a computer storage medium and an electronic device.

Background

The face rehearsal technique refers to migrating the expression and head pose of one person (source face) to another person face (target face) and keeping the identity of the target face unchanged. The face replay technology has wide application scenes, such as virtual anchor, virtual idol and the like, in popular terms, a speaking video of a source face and a static picture of a target face are input, the face replay technology can enable the static picture to generate a new video of the target face by referring to the source face video, and the video has the same expression change and head posture change as the source video.

Disclosure of Invention

The application provides a three-dimensional face generation method, which aims to solve the problems that in the prior art, due to the fact that a human face is shielded, the human face cannot be generated or generated inaccurately, and further three-dimensional face replay cannot be achieved.

The application provides a three-dimensional face generation method, which comprises the following steps:

acquiring an expression vector of a first image containing a first identity object based on a target expression code network model, wherein the target expression code network model is a model which is determined by training based on an expression image with a face having a shelter; the first image is an image of which the face is provided with a shelter;

inputting the expression vector and a second image containing a second identity object into a target face generation network model, and determining to transfer the expression vector of the first image to a target image of the face of the second identity object in the second image, wherein the target face generation network model is a model determined by performing expression transfer training based on different expressions corresponding to images with the same identity.

In some embodiments, the method further comprises:

acquiring a target sample image, a positive sample image with a similar expression to the target sample image and a negative sample image with a dissimilar expression to the target sample image from an expression image sample data set, wherein a part of a facial area in at least one of the target sample image, the positive sample image and the negative sample image is provided with a barrier, and the target sample image, the positive sample image and the negative sample image are sample images with different identity information;

training an initial expression coding network model based on the target sample image, the positive sample image and the negative sample image, wherein the expression coding model is used for calculating expression vectors corresponding to input images;

and when the preset model convergence condition is met according to the expression vector, obtaining the target expression coding network model.

In some embodiments, the training of the initial expression coding model based on the target sample image, the positive sample image, and the negative sample image comprises:

inputting the target sample image, the positive sample image and the negative sample image into an initial expression coding network model respectively to obtain corresponding expression vectors respectively;

obtaining a loss value based on the expression vector corresponding to the target sample image, the expression vector corresponding to the positive sample image, the expression vector corresponding to the negative sample image and a preset loss function;

and determining whether a preset convergence condition is met or not according to the loss value, if not, adjusting the parameters of the initial expression coding network model, and performing the next round of training on the adjusted expression coding network model.

In some embodiments, said inputting said expression vector and a second image comprising a second identity object into a target face generation network model, determining a target image to migrate the expression vector of said first image to the face of said second image, comprises:

inputting the second image into an encoder of the target face generation network model, and determining facial features corresponding to the second image;

inputting the expression vector of the first image and the facial features corresponding to the second image into an embedding module of the target face generation network model, and determining that the expression vector of the first image is embedded into the target facial features of the second image;

inputting the target facial features into a decoder of the target face generation network model, and determining the target image.

In some embodiments, further comprising:

acquiring a head posture vector of the first image based on a target posture coding network model, wherein the target posture coding network model is a model which is determined by training based on a head posture image with a shelter on the head;

inputting the expression vector and a second image containing a second identity object into a target face generation network model, determining to transfer the expression vector of the first image to a target image of the second image, and further comprising: inputting the head pose vector into the target face generation network model into which the expression vector and the second image are input, and determining the target image of a second identity object face that migrates the expression vector and the head pose vector to the second image.

In some embodiments, the method further comprises:

acquiring a head posture image with a shelter, wherein the head posture image has a posture truth value label;

inputting the head posture image into an initial posture coding network model, and determining a predicted posture label;

obtaining a loss value based on the attitude truth value label, the predicted attitude label and a preset loss function;

and determining whether a preset convergence condition is met or not according to the loss value, if not, adjusting parameters of the initial attitude coding network model, and performing the next round of training on the adjusted initial attitude coding model until the preset convergence condition is met to obtain the target attitude coding network model.

In some embodiments, the method further comprises:

inputting a third image containing a third identity object into the target expression coding network model, and acquiring an expression vector of the third image;

inputting a fourth image containing a third identity object into an encoder in an initial face generation network model, and determining a face feature corresponding to the fourth image;

inputting the expression vector of the third image and the facial features corresponding to the fourth image into an embedding module of the initial face generation network model, and determining the facial features embedded with the expression vector of the third image;

inputting facial features embedded with expression vectors of the third image into a decoder of the initial face generation network model, and determining a fifth image embedded with expression vectors of the third image;

determining whether a loss value between the fifth image and the third image meets a preset convergence condition;

and if not, performing next round of training on the initial face generation network model according to the reconstructed loss function until the requirement of a convergence condition is met.

In some embodiments, the method further comprises:

inputting the third image containing the third identity object into the target head pose coding network model, and acquiring a head pose vector of the third image;

inputting the head pose vector of the third image into an embedding module of the initial face generation network model, and determining the facial features embedded with the head pose vector of the third image;

inputting the facial features embedded with the expression vector of the third image into a decoder of the initial face generation network model, and determining a fifth image embedded with the expression vector of the third image, wherein the method comprises the following steps:

inputting the facial features of the expression vector and the head pose vector embedded with the third image into a decoder of the initial face generation network model, and determining a fifth image embedded with the expression vector and the head pose vector.

The present application also provides a three-dimensional face generating device, including:

the system comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring an expression vector of a first image containing a first identity object based on a target expression code network model, and the target expression code network model is a model which is trained and determined based on an expression image of which the face has a shelter; the first image is an image of which the face is provided with a shelter;

and the determining unit is used for inputting the expression vector and a second image containing a second identity object into a target face generation network model, and determining a target image for transferring the expression vector of the first image to the face of the second image, wherein the target face generation network model is a model for performing expression transfer training determination based on different expressions corresponding to images with the same identity.

The present application further provides a three-dimensional face replay method, including:

acquiring video information containing a first identity object and a facial still image containing a second identity object; wherein the video information comprises a video frame image of which the face has an obstruction;

inputting the extracted video frame image in the video information into a target expression coding network model to obtain an expression vector of the video frame image;

inputting the expression vector and the target static image into a face generation network model according to the playing time sequence of the video information, and acquiring a target face sequence frame image for transferring the expression vector to the target static image face;

and merging the target face sequence frame images according to the playing time sequence to obtain a face replay video for transferring the expression vectors to the target static image.

In some embodiments, further comprising:

inputting the extracted video frame image in the video information into a target attitude coding network model to obtain a head attitude vector of the video frame image;

according to the playing time sequence of the video information, inputting the expression vector and the target static image into a face generation network model, and generating a target face sequence frame image for transferring the expression vector to the target static image face, further comprising:

and inputting the expression vector, the head pose vector and the target static image into the face generation network model according to the playing time sequence of the video information, and acquiring a target face sequence frame image for transferring the expression vector and the head pose vector to the target static image face.

The present application further provides a three-dimensional face replay device, including:

a first acquisition unit configured to acquire video information including a first identity object and a facial still image including a second identity object; wherein the video information comprises a video frame image with a face having an obstruction;

a second obtaining unit, configured to input the extracted video frame image in the video information into a target expression coding network model, and obtain an expression vector of the video frame image;

a third obtaining unit, configured to input the expression vector and the target still image into a face generation network model according to a playing timing sequence of the video information, and obtain a target face sequence frame image in which the expression vector is migrated to the target still image face;

and the replay unit is used for merging the target face sequence frame images according to the playing time sequence and determining a face replay video for transferring the expression vectors to the target static image.

The application also provides a computer storage medium for storing the data generated by the network platform and a program for processing the data generated by the network platform;

the program, when read and executed by a processor, executes the three-dimensional face generation method as described above; alternatively, the three-dimensional face reenactment method described above is performed.

The present application further provides an electronic device, comprising:

a processor;

a memory for storing a program for processing network platform production data, which when read and executed by the processor, performs the three-dimensional face generation method as described above; alternatively, the three-dimensional face reenactment method described above is performed.

Compared with the prior art, the method has the following advantages:

in the three-dimensional face generation method provided by the application, the expression vector of the first image containing the first identity object can be obtained through the target expression coding network model, wherein the first image is an image of which the face is provided with a shielding object, the expression vector of the image of the shielding object can be obtained through the target expression coding network model, the expression vector obtained through the target expression coding network model is not affected by identity information, namely identity decoupling is realized, the obtained expression vector is not limited by identity, and the accuracy and the efficiency of expression recognition are improved. The expression vector and a second image containing a second identity object are input into a target face generation network model, and the expression vector of the first image is determined to be transferred to the target image of the face of the second identity object in the second image, so that the expression vector which can be still recognized in the face area of the first identity object in the first image is transferred to the face of the second identity object in the second image even under the condition that a shielding object exists in the face area of the first identity object in the first image, and the facial expression is transferred to the face of the second identity object in the second image, so that the accuracy of facial expression transfer is improved, and meanwhile, the robustness of facial expression transfer under the condition of shielding is improved.

The application also provides a three-dimensional face replay method, which can input a video frame image in the video information of the first identity object into a target expression coding network model to obtain an expression vector of the video frame image, wherein the video frame image can be an image of the face of the first identity object with a barrier. Based on the fact that the expression vector and the face static image comprising the second identity object are input into the face generation network model, the obtained target face sequence frame image of the face of the second identity object, which is used for transferring the expression vector to the face static image, is merged according to the video playing time sequence, so that the face replay video for transferring the expression vector to the face static image is obtained, and the accuracy and the robustness of the facial expression transfer in the face replay video are guaranteed.

Drawings

FIG. 1 is a flow chart of an embodiment of a three-dimensional face generation method provided by the present application;

FIG. 2 is a schematic structural diagram of expression code network model training in an embodiment of a three-dimensional face generation method provided by the present application;

FIG. 3 is a schematic structural diagram of a posture coding network model training in an embodiment of a three-dimensional face generation method provided by the present application;

FIG. 4 is a schematic structural diagram of face generation network model training in an embodiment of a three-dimensional face generation method provided in the present application;

FIG. 5 is a schematic structural diagram of an embodiment of a three-dimensional face generation apparatus provided in the present application;

FIG. 6 is a schematic structural diagram of an embodiment of a three-dimensional face generation model provided by the present application;

fig. 7 is a flowchart of an embodiment of a three-dimensional face replaying method provided by the present application;

fig. 8 is a schematic structural diagram of an embodiment of a three-dimensional face replaying apparatus provided by the present application;

fig. 9 is a schematic structural diagram of an embodiment of an electronic device provided in the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.

The terminology used in the description herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The description used in this application and in the appended claims is for example: the terms "a," "an," "first," and "second," etc., are not intended to be limiting in number or order, but rather are used to distinguish one type of information from another.

As can be seen from the above background art, the goal of face replay is to migrate the expression and head pose of a source face to a target face, and the prior art related to face replay can generally include two methods, namely a method based on three-dimensional face reconstruction coefficients and a method based on face key points.

In the method based on the three-dimensional face reconstruction coefficient, the expression and the head pose of a face are represented by the coefficient of the three-dimensional face, the method firstly carries out three-dimensional face reconstruction on a source face providing the expression and the pose, then obtains the reconstructed expression coefficient and the reconstructed head pose coefficient, and then embeds the expression coefficient and the head pose coefficient into a face generator for controlling the face generator to generate a target face with specific expression and pose.

In the method based on the face key points, firstly, the face key points of a source face are obtained through a face key point module, and the key points comprise expression information of the face or expression information and head posture information of the face; then, the face key points are embedded into the face generator in a corresponding manner for generating the target face.

When a human face is shielded, the method of the three-dimensional human face reconstruction coefficient or the method of the human face key point can cause the failure or inaccurate reconstruction of the three-dimensional human face, because the related information of the shielded part cannot be detected or is detected inaccurately.

Based on the above problems, the present application provides a three-dimensional face generation method, a three-dimensional face generation network model, a model training method, and the like, so as to overcome the problems in the prior art that information acquisition is inaccurate due to occlusion of a face, and further face generation and subsequent replay fail or are inaccurate, which are described in detail below.

As shown in fig. 1, fig. 1 is a flowchart of an embodiment of a three-dimensional face generation method provided in the present application; the embodiment comprises the following steps:

step S101: acquiring an expression vector of a first image containing a first identity object based on a target expression code network model, wherein the target expression code network model is a model which is determined by training based on an expression image with a face having a shelter; the first image is an image of which the face is provided with a shelter;

step S102: inputting the expression vector and a second image containing a second identity object into a target face generation network model, and determining to transfer the expression vector of the first image to a target image of the face of the second identity object in the second image, wherein the target face generation network model is a model determined by performing expression transfer training based on different expressions corresponding to images with the same identity.

Based on the step S101, obtaining the expression vector needs to be realized by a target expression code network model, where the target expression code network model is a pre-trained expression code network model. The first image of the first identity object may be a frame image of video information including face information, that is, a frame image of face information, or may be an image of a still image or a moving image including face information; the first image may be an image in which a face has an obstruction (may also be referred to as an occlusion region). Specifically, in this embodiment, the first image may be a video frame image in the extracted video information, for example, an extracted JPEG-format image. Of course, it is understood that the video frame image with the obstruction may be the whole video information, or a part of the video frame image in the video information may have the face obstruction.

The following describes, with reference to fig. 2, an expression code network model training process in a three-dimensional face generation method provided in the present application.

As shown in fig. 2, fig. 2 is a schematic structural diagram of expression code network model training in an embodiment of a three-dimensional face generation method provided in the present application, and a specific implementation process may include:

step S101-1: the method comprises the steps of obtaining a target sample image, a positive sample image with a similar expression to the target sample image and a negative sample image with a dissimilar expression to the target sample image from an expression image sample data set, wherein a part of a facial area in at least one of the target sample image, the positive sample image and the negative sample image is provided with a shelter, and the target sample image, the positive sample image and the negative sample image are sample images with different identity information.

The expression image sample data set can be obtained through a publicly published expression data set, or a related expression image sample data set is obtained through expression keyword search. The expression can be expression information that the face presents the emotion change of the face similar to joy, anger, sadness and the like.

The expression information representing the facial emotion changes have similarity, so that the expression information similarity can be labeled on the expression images in the collected expression image sample data set. Labeling similarity of expression images, for example, the expression image sample data set includes expression images A, B, C, D; the expression image A and the expression image B are similar in expression comparison, and the label value is 1; the expression image C is similar to the expression image A in expression, and the label value is 2; the expression image D and the expression image A are generally close to each other in expression, the label value is 0.5, and the label value is 0 when the similarity degree between the expression images cannot be judged. And determining the expression image marked with the similarity as expression similarity image sample data.

The addition of the shielding object can be the random addition of the expression image in the expression image sample data, and the addition form and the number are not limited.

Step S101-2: training an initial expression coding network model based on the target sample image, the positive sample image and the negative sample image, wherein the expression coding model is used for calculating expression vectors corresponding to the input images.

Step S101-3: and when the preset model convergence condition is met according to the expression vector, obtaining the target expression coding network model.

The step S102 of training an initial expression coding model based on the target sample image, the positive sample image and the negative sample image may include:

step S101-21: inputting the target sample image, the positive sample image and the negative sample image into an initial expression coding network model respectively to obtain corresponding expression vectors respectively;

step S101-22: obtaining a loss value based on the expression vector corresponding to the target sample image, the expression vector corresponding to the positive sample image, the expression vector corresponding to the negative sample image and a preset loss function;

step S101-23: and determining whether a preset convergence condition is met or not according to the loss value, if not, adjusting the parameters of the initial expression coding network model, and performing the next round of training on the adjusted expression coding network model.

In this embodiment, loss data (or a loss value) may be determined by comparing a loss function or a ternary loss function, so that the expression code network model is optimized according to the loss data until the expression code network model meets the convergence requirement, and the expression code network model meeting the convergence requirement is determined as the target expression code network model. In this embodiment, a ternary loss function based on similarity is used, as follows:

L _expression ＝maz(||f(a)-f(p)|| ² -||f(a)-f(n)|| ² +α，0)

the expression images a, p and n respectively represent expression images a, p and n, namely a sample image (anchor), a positive sample image (positive) and a negative sample image (negative), wherein the positive sample image and the sample image are more similar expressions, and the negative sample image and the sample image have lower expression similarity or are farther away from the expression. Whether the predicted value of the expression coding network model for the expression image similarity learning is close to the labeled similarity real label value or not can be known through the loss value of the ternary loss function, if so, the expression coding network model is converged, otherwise, the expression coding network model needs to be optimized, and the specific optimization process can be that model parameters in the expression coding network model are adjusted according to the ternary loss function until the output predicted value is close to or equal to the similarity real label value, so that the adjusted expression coding network model is determined to be the target expression coding network model.

It should be noted that, in this embodiment, the input to the expression coding network model may be an image, the output is expression vectors, and mapping between expressions and vectors is realized by constraining distances between feature vectors.

Based on the above, the specific content of step S102 may include:

step S102-11: inputting the second image into an encoder of the target face generation network model, and determining facial features corresponding to the second image;

step S102-12: inputting the expression vector of the first image and the facial features corresponding to the second image into an embedding module of the target face generation network model, and determining that the expression vector of the first image is embedded into the target facial features of the second image;

step S102-13: inputting the target facial features into a decoder of the target face generation network model, and determining the target image.

In order to further improve the accuracy of the target image and further improve the robustness of the generated target image, the head pose of the first image may be further used as a generation parameter for generating the target image, and the method specifically may further include:

step S101-a 1: acquiring a head posture vector of the first image based on a target posture coding network model, wherein the target posture coding network model is a model which is determined by training based on a head posture image with a shelter on the head;

the step S102-12 may further include: inputting the head pose vector into the target face generation network model into which the expression vector and the second image are input, and determining the target image of a second identity object face that migrates the expression vector and the head pose vector to the second image.

The target pose coding network model may be a pre-trained model, as shown in fig. 3, and fig. 3 is a schematic structural diagram of the pose coding network model training in the embodiment of the three-dimensional face generation method provided by the present application.

In this embodiment, the training of the target pose coding network model may include:

step Sa 1: acquiring a head posture image with a shelter, wherein the head posture image has a posture truth value label; the head posture image can be obtained through a publicly published head posture data set, or a related head posture image is obtained through head posture keyword search. Typically, the acquired head pose image may carry pose truth labels, such as: pitch (yaw), pitch (pitch), roll (roll), i.e. the euler angle of the head towards the pose.

Step Sa 2: inputting the head posture image into an initial posture coding network model, and determining a predicted posture label; the head pose image is input into the initial pose coding network model, and euler angles of the head orientation, i.e., yaw, pitch, roll, are output, and these three euler angles, in combination with the head pose, can be understood as an angle of shaking the head left and right, an angle of nodding and raising the head, an angle of left and right head bending, and the like in this embodiment.

Step Sa 3: obtaining a loss value based on the attitude truth value label, the predicted attitude label and a preset loss function; wherein the loss value is determined by using a euclidean distance loss function. The euclidean distance loss function may be:

the upper horizontal line parameter is a predicted value output by the attitude coding network model, and the upper horizontal line parameter is not a true attitude tag, namely an attitude true tag value.

Step Sa 4: and determining whether a preset convergence condition is met or not according to the loss value, if not, adjusting parameters of the initial attitude coding network model, and performing the next round of training on the adjusted initial attitude coding model until the preset convergence condition is met to obtain the target attitude coding network model. According to the Euclidean distance loss function, when L is equal to L _pose The more the predicted value output by the initial attitude coding network model is close to the attitude real tag value, the more the model convergence requirement is met, the target attitude coding network model is obtained, and the robustness is better. Otherwise, the parameter adjustment is performed and the model training is continued, and the process may be an iterative process.

Based on the above steps S101-a1 and S101-a2, the step S102 may further include:

step S102-a 1: inputting the second image into an encoder of a target face generation network model, and determining facial features corresponding to the second image;

step S102-a 2: inputting the expression vector, the head pose vector and the facial features into an embedding module of the face generation network model, and determining target facial features for embedding the expression vector and the head pose vector of the first image into the second image;

step S102-a 3: inputting the target facial features into a decoder of the face generation network model, and determining the target image.

The steps S102-a1 through S102-a3 may be interpreted as embedding the expression vector and the head pose vector of the first image into the target image of the second image obtained based on the input of the expression vector, the head pose feature vector, and the facial features into the face generating network model. Namely, the target image is the image in which the expression features and the head pose features in the first image are embedded in the second image.

It is understood that the face generation network model may also be a pre-trained model, and the specific training process may refer to the following process as shown in fig. 4, which includes:

step S10b 1: inputting a third image containing a third identity into the target expression coding network model to obtain an expression vector of the third image;

step S10b 2: inputting a fourth image containing a third identity into an encoder in the initial face generation network model, and determining facial features corresponding to the fourth image;

step S10b 3: inputting the expression vector of the third image and the facial features corresponding to the fourth image into an embedding module of the initial face generation network model, and determining the facial features embedded with the expression vector of the third image;

step S10b 4: inputting facial features embedded with expression vectors of the third image into a decoder of the initial face generation network model, and determining a fifth image embedded with expression vectors of the third image;

step S10b 5: determining whether a loss value between the fifth image and the third image meets a preset convergence condition;

step S10b 6: and if not, performing next round of training on the initial face generation network model according to the reconstructed loss function until the requirement of a convergence condition is met.

Correspondingly, a head posture vector can be obtained on the basis of obtaining the expression vector, so that the robustness of the face generation network model is further improved, and therefore the method can further comprise the following steps:

step S10c 1: inputting the third image containing the third identity into the target head pose coding network model, and obtaining a head pose vector of the third image;

step S10c 2: inputting a fourth image containing a third identity into an encoder in the initial face generation network model, and determining facial features corresponding to the fourth image;

step S10c 3: inputting the head pose vector of the third image into an embedding module of the initial face generation network model, and determining the facial features embedded with the head pose vector of the third image;

based on the above steps S10c1 to S10c3, the step S10b4 includes:

It should be noted that, sample data when the initial face generation network model is trained may be an image of the same identity object or an image of a different identity object. That is, the image input into the target expression coding network model and the target head pose coding network model may be an image of the same identity object including a face, and the image input into the initial face generation network model may be an image of the same identity object as the image input into the target expression coding network model and the head pose coding network model, both having the same identity but different expressions and head poses. The accuracy of the output result can be improved by training the initial face generation network model by using images of different expressions and head gestures of the same identity object, and certainly, the initial face generation model is not limited to a form of training the images of the same identity object. In the steps 10b2 and S10c1, the fourth image of the third identity object may be input into the initial face generation network model, and the initial face generation network model encodes the fourth image into hidden layer feature data through encoding processing of the fourth image, and determines the hidden layer feature data as the face feature data corresponding to the fourth image. In this embodiment, the fourth image may be a still image including face information of a human face, and typically, a still image including complete face information of a human face.

In this embodiment, the facial feature data in which the expression vector is embedded, or the expression vector and the head posture vector are decoded, and the facial feature data is mapped to an image for output. In this embodiment, it is determined whether the output image result can enable the initial face generation network model to meet the model convergence requirement, that is, whether the output image matches a third image of a third identity object, by using the pixel reconstruction loss function and/or the perceptual loss function.

The pixel reconstruction loss function may be:

the perceptual loss function may be:

in the above formulaThe above-mentioned

And the y is a true value image for the output image of the model. F denotes a pre-trained perceptual network such as VGGNet or the like. The pixel reconstruction loss function and the perception loss function can be selected alternatively, and in order to further ensure the definition of an output image, the perception loss function can be combined on the basis of the pixel reconstruction loss function, so that whether the face generation network model meets the convergence requirement is determined. Wherein the convergence requirement may be based on whether the loss function is reduced to be small enough and stable; or whether the image generated by the face generation network model is matched with the source face image or not, if so, the convergence requirement is met; if not, adjusting model parameters of the face generation network model according to the loss function to optimize the face generation network model so that the model meets the convergence requirement, and determining the model as the target face generation network model after meeting the convergence requirement. In this embodiment, the reconstruction of the loss function may be determined in combination with the situation of the input image, and is not limited to the pixel reconstruction loss function and the perceptual loss function, for example: when the input images are images of different expressions, or expressions and postures of the same identity object in the same environment, the loss function of pixel reconstruction is not adopted, and adjustment can be performed according to other parameters in the model.

The above is a description of an embodiment of a three-dimensional face generation method provided by the present application, by which a target expression coding network model and a target pose coding network model can determine expression feature data (expression vector) and head pose feature data (head pose vector) in a first image of a first object whose face has an obstruction, and the expressive feature data and the head pose feature data are embedded into target facial feature data in a target face generation network model, the expressive feature data can be output based on the target face generation network model, or the expressive feature data and the head pose feature data are migrated to the target image on the second image of the second identity object, without the presence of an obstruction in the first image of the first identity object, resulting in the accuracy of the embedded data and thus the unclear or inaccurate display of the generated target image.

The above is a specific description of an embodiment of a three-dimensional face generation method provided by the present application, and corresponds to the foregoing provided embodiment of a three-dimensional face generation method, the present application also discloses an embodiment of a three-dimensional face generation apparatus, please refer to fig. 5, since the apparatus embodiment is basically similar to the method embodiment, the description is relatively simple, and related points can be referred to the partial description of the method embodiment. The device embodiments described below are merely illustrative.

As shown in fig. 5, fig. 5 is a schematic structural diagram of an embodiment of a three-dimensional face generation apparatus provided in the present application; the embodiment of the training device comprises:

an obtaining unit 501, configured to obtain an expression vector of a first image including a first identity object based on a target expression code network model, where the target expression code network model is a model trained and determined based on an expression image of a face with a blocking object; the first image is an image of which the face is provided with a shelter;

the determining unit 502 is configured to input the expression vector and a second image including a second identity object into a target face generation network model, and determine a target image for transferring the expression vector of the first image to the face of the second image, where the target face generation network model is a model for performing expression transfer training determination based on different expressions corresponding to images with the same identity.

The method for determining the target expression code network model based on the training of the expression image with the obstruction on the basis of the target expression code network model may include training an initial expression code network model to determine the target expression code network model, and specifically may include:

the system comprises a sample acquisition unit, a comparison unit and a display unit, wherein the sample acquisition unit is used for acquiring a target sample image, a positive sample image with a similar expression to the target sample image and a negative sample image with a dissimilar expression to the target sample image from an expression image sample data set, wherein a part of a face area in at least one of the target sample image, the positive sample image and the negative sample image is provided with a shelter, and the target sample image, the positive sample image and the negative sample image are sample images with different identity information;

the expression training unit is used for training an initial expression coding network model based on the target sample image, the positive sample image and the negative sample image, and the expression coding model is used for calculating an expression vector corresponding to an input image;

and the model obtaining unit is used for obtaining the target expression coding network model when the preset model convergence condition is met according to the expression vector.

Wherein the training unit may include: an expression vector obtaining subunit, a loss value obtaining subunit and a determining subunit;

the expression vector obtaining subunit is configured to input the target sample image, the positive sample image, and the negative sample image to an initial expression coding network model respectively, and obtain corresponding expression vectors respectively;

the loss value obtaining subunit is configured to obtain a loss value based on the expression vector corresponding to the target sample image, the expression vector corresponding to the positive sample image, the expression vector corresponding to the negative sample image, and a preset loss function;

and the determining subunit is configured to determine whether a preset convergence condition is met according to the loss value, adjust parameters of an initial expression coding network model if the preset convergence condition is not met, and perform a next round of training on the adjusted expression coding network model.

In this embodiment, the determining unit 502 includes: a facial feature determination subunit, an input subunit and a target image determination subunit;

the facial feature determining subunit is configured to input the second image into an encoder of the target face generation network model, and determine a facial feature corresponding to the second image;

the input subunit is configured to input the expression vector of the first image and the facial features corresponding to the second image into the embedding module of the target face generation network model, and determine to embed the expression vector of the first image into the target facial features of the second image;

the target image determining subunit is configured to input the target facial features into a decoder of the target face generation network model, and determine the target image.

In order to further improve the robustness of the face generation network model, the embodiment may further include: an attitude vector acquisition unit and an attitude input unit;

the attitude vector acquisition unit is used for acquiring a head attitude vector of the first image based on a target attitude coding network model, wherein the target attitude coding network model is a model which is determined by training based on a head attitude image with a shielding object on the head;

the gesture input unit is used for inputting the head gesture vector into the target face generation network model, and determining the target image which transfers the expression vector of the first image and the head gesture vector to the second identity object face of the second image.

Based on the model that the target posture coding network model is trained and determined based on the expression image of which the face has the obstruction, the method in this embodiment may include training the initial posture coding network model to determine the target posture coding network model, and specifically may include: an attitude image acquisition unit, a predicted tag determination unit, a loss value acquisition unit and a target attitude determination unit;

the attitude image acquisition unit is used for acquiring a head attitude image with a shelter, and the head attitude image has an attitude true value label;

the predicted tag determining unit is used for inputting the head posture image into an initial posture coding network model and determining a predicted posture tag;

the loss value obtaining unit is configured to obtain a loss value based on the attitude true value label, the predicted attitude label, and a preset loss function;

and the target posture determining unit is used for determining whether a preset convergence condition is met or not according to the loss value, if not, adjusting parameters of the initial posture coding network model, and performing the next round of training on the adjusted initial posture coding model until the preset convergence condition is met to obtain the target posture coding network model.

Based on the determination unit 502, it can be known that the target face generation network model is a model for performing expression migration training and determination based on different expressions corresponding to images with the same identity, and therefore, this embodiment may further include: an expression vector acquisition unit, a facial feature determination unit, an input unit, a fifth image determination unit, a loss determination unit and a function reconstruction unit;

the expression vector acquiring unit is used for inputting a third image containing a third identity object into the target expression coding network model and acquiring an expression vector of the third image;

the facial feature determining unit is used for inputting a fourth image containing a third identity object into an encoder of an initial face generation network model, and determining a facial feature corresponding to the fourth image;

the input unit is used for inputting the expression vector of the third image and the facial features corresponding to the fourth image into the embedding module of the initial face generation network model, and determining the facial features embedded with the expression vector of the third image;

the fifth image determining unit is used for inputting the facial features embedded with the expression vectors of the third image into a decoder of the initial face generation network model and determining the fifth image embedded with the expression vectors of the third image;

the loss determining unit is used for determining whether a loss value between the fifth image and the third image meets a preset convergence condition;

and the function reconstruction unit is used for carrying out the next round of training on the initial face generation network model according to the reconstructed loss function until the requirement of a convergence condition is met when the determination result of the loss determination unit is negative.

The above is a description of an embodiment of the three-dimensional face generation apparatus provided in the present application, and for specific contents of the embodiment of the generation apparatus, reference may be made to detailed descriptions of step S101 to step S102 in the embodiment of the generation method, and repeated descriptions of corresponding or the same contents are not repeated here.

Based on the above, the present application further provides a three-dimensional face generation model, as shown in fig. 6, fig. 6 is a schematic structural diagram of an embodiment of the three-dimensional face generation model provided in the present application; the method specifically comprises the following steps: an expression code network model 601 and a face generation network model 603; or may include: expression code network model 601, posture code network model 602 and face generation network model 603; the expression coding network model 601, the pose coding network model 602, and the face generation network model 603 are described below as examples.

The expression coding network model 601 is configured to output expression feature data (which may also be referred to as expression vectors or expression feature vectors) of the face of the first identity object according to the input first image of the first identity object; wherein the first image comprises an image of a face having an obstruction. The training process of the expression coding network model 601 is described in the above generation method, and is not described in detail here.

The pose coding network model 602 is configured to output head pose feature data (which may also be referred to as a head pose vector or a head pose feature vector) of the head of the first identity object based on the input first image of the first identity object. The training process of the posture-coding network model 602 is described in the above generation method, and is not described in detail here.

The face generation network model 603 is configured to migrate the input expression feature data and the input head pose feature data, and a second image of a second identity object to a target image of the face of the second identity object

The face generating network model 603 may include: an encoder 603-1, an embedding module 603-2, and a decoder 603-3.

The encoder 603-1 is configured to perform encoding processing on an input second image of a second identity object, and output facial feature data corresponding to the second image;

the embedding module 603-2 is configured to embed the obtained expression feature data, or the expression feature data and the head pose feature data, into the facial feature data;

the decoder 603-3 is configured to decode the feature data output by the embedding module 603-2, and generate a target image in which expression feature data in the first image is migrated to the second image, or generate a target image in which expression feature data and head pose feature data in the first image are migrated to the second image.

For details of the face generation network model 603, reference may be made to the description of the above generation method embodiment, and details thereof are not described here.

The above is a description of an embodiment of a three-dimensional face generation model provided by the present application, where the model can obtain expression feature data and head pose feature data in a first image including a first identity object through an expression coding network model and a pose coding network model, where the expression feature data is obtained based on a mask existing on a face and a head (or only a face) of the first image, and the expression coding network model is trained using images including different identity objects, so that expression information and head pose information output by the model are not affected by the mask, and more accurate expression face feature data can be output, and robustness of the expression coding network model is improved. And then the two are embedded into the facial feature data of the second image of the second identity object obtained by the encoder 603-1, so that identity decoupling can be realized, and the accuracy of generating an image by a three-dimensional face can be ensured.

Based on the above, the present application further provides a three-dimensional face replaying method, as shown in fig. 7, fig. 7 is a flowchart of an embodiment of the three-dimensional face replaying method provided in the present application; the method embodiment may include:

step S701: acquiring video information containing a first identity object and a facial still image containing a second identity object; wherein the video information comprises a video frame image with an obstruction on a face.

Step S702: inputting the extracted video frame image in the video information into a target expression coding network model to obtain an expression vector of the video frame image;

step S703: inputting the expression vector and the target static image into a face generation network model according to the playing time sequence of the video information, and acquiring a target face sequence frame image for transferring the expression vector to the target static image face;

step S704: and merging the target face sequence frame images according to the playing time sequence to obtain a face replay video for transferring the expression vectors to the target static image.

In this embodiment, the relationship between the video information and the facial still image in step S701 may be understood as that the video information is video information including a first object identity utterance, and facial expression information of the first object in the video information, or facial expression information and head posture information are migrated to a face of a second object in the facial still image.

In this embodiment, the video frame image with the obstruction may be a part of the video frame image in the video information or may be all the video frame images.

In this embodiment, the target expression code network model in step S702 is a pre-trained model, that is, a model meeting the convergence requirement and/or the test requirement. Reference may be made to the related contents of the three-dimensional face generation method in relation to the training process of the target expression code network model, and details are not described here. The expression vector may be obtained by inputting a video frame image in the video information into a target expression coding network model, and may further include:

and inputting the extracted video frame image in the video information into a target attitude coding network model to obtain a head attitude vector of the video frame image.

In this embodiment, the target posture coding network model may also be a pre-trained model, that is, a model meeting the convergence requirement and/or the test requirement. Reference may be made in particular to the related contents of the three-dimensional face generation method regarding the training process of the target pose coding network model, and details are not described here.

The step S703

The specific implementation procedure in this embodiment may be to embed the expression vector and the head pose vector into hidden layer features (facial feature data of a video frame image), and decode the hidden layer features into a sequence frame image with the expression vector and the head pose vector. Specifically, the target static image is input into an encoder of the face generation network model to obtain face feature data, an expression vector and a head pose vector are taken from an embedding module of the face generation network model, the expression vector and the head pose vector are embedded into the face feature data, and a decoder of the face generation network model decodes the embedded data to obtain a sequence frame image in which the expression vector and the head pose vector are transferred to the face feature data.

It will be appreciated that the expression vector may be embedded on the facial feature data only, or the expression vector and the head pose vector may be embedded on the facial feature data together.

The above steps S701 to S704 may be combined with the content of the above steps S101 to S102, and the description of the three-dimensional face generation network model embodiment, and details are not repeated here. The main purpose of this embodiment is to extract expression feature data and head pose feature data in each video frame image according to a video playing sequence and input the extracted expression feature data and head pose feature data into a target face generation network model through an acquired video segment containing a first identity object and a facial static image containing a second identity object, generate a sequence frame image corresponding to a video playing time sequence, and process and merge the obtained sequence frame image into a final target video according to the playing time sequence. The target video is a face replay video formed by transferring the expression vector and the head posture vector to the face of a second identity object in the static image of the face. Because the expression coding network model is a well-trained model for the area with the face shielding, and is not limited by the identity, even if the video frame image with the face shielding object in the video can also obtain the corresponding expression vector and head posture vector, therefore, the accuracy of the face replay video or image can be improved.

It should be noted that, in this embodiment, the video frame image and the facial still image in the video may be facial images of different persons, that is, the first identity object is person a, and the video frame image may include an expression and/or a head pose of person a; the second identity object is a person B; alternatively, the same person may be used, that is, the video frame image may be an image of the expression and/or the head pose of the person a, and the facial still image may be an image of the person a including facial information.

The above is a description of an embodiment of the three-dimensional face replaying method provided by the present application, and the description process is relatively schematic, and specific contents may be combined with the description of the embodiments of the three-dimensional face generating method and apparatus and the description of the embodiment of the three-dimensional face generating network model.

Based on the above, correspondingly, the present application further provides a three-dimensional face replaying device, as shown in fig. 8, fig. 8 is a schematic structural diagram of an embodiment of the three-dimensional face replaying device provided in the present application, where the embodiment of the device includes:

a first acquisition unit 801 for acquiring video information containing a first identity object and a facial still image containing a second identity object; wherein the video information comprises a video frame image with a face having an obstruction;

a second obtaining unit 802, configured to input the extracted video frame image in the video information into a target expression coding network model, and obtain an expression vector of the video frame image;

a third obtaining unit 803, configured to input the expression vector and the target still image into a face generation network model according to a playing timing sequence of the video information, and obtain a target face sequence frame image in which the expression vector is migrated to the target still image face;

a replay unit 804, configured to perform merging processing on the target face sequence frame images according to the playing time sequence, and determine a face replay video for transferring the expression vector to the target still image.

Specific contents of the above three-dimensional face replaying apparatus embodiment may be combined with the above three-dimensional face replaying method embodiment, the description of the three-dimensional face generating method and apparatus embodiment, and the description of the three-dimensional face generating network model embodiment, and no repeated description is repeated here.

Based on the above, the present application further provides a computer storage medium for storing data generated by a network platform and a program for processing the data generated by the network platform;

the program, when read and executed by a processor, performs the steps in the above-described three-dimensional face generation method embodiment; alternatively, the steps in the above-described three-dimensional face reenactment method embodiment are performed.

Based on the above, the present application further provides an electronic device, as shown in fig. 9, fig. 9 is a schematic structural diagram of an embodiment of the electronic device provided in the present application, where the embodiment includes:

a processor 901;

a memory 902 for storing a program for processing network platform production data, which when read and executed by the processor, performs the steps involved in the above-described three-dimensional face generation method embodiments; alternatively, the steps involved in the three-dimensional face reenactment method embodiment described above are performed.

Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application, and those skilled in the art can make variations and modifications without departing from the spirit and scope of the present application, therefore, the scope of the present application should be determined by the claims that follow.

Claims

1. A three-dimensional face generation method, comprising:

2. The method of claim 1, further comprising:

3. The method of claim 2, wherein training an initial expression coding model based on the target sample image, the positive sample image, and the negative sample image comprises:

4. The method of claim 1, wherein inputting the expression vector and a second image containing a second identity object into a target face generation network model, determining a target image that migrates the expression vector of the first image to the face of the second image comprises:

5. The method of claim 1, further comprising:

6. The method of claim 5, further comprising:

7. The method of claim 1, further comprising:

8. The method of claim 7, further comprising:

9. A three-dimensional face generation apparatus, characterized by comprising:

10. A three-dimensional face replay method, comprising:

acquiring video information containing a first identity object and a facial still image containing a second identity object; wherein the video information comprises a video frame image with a face having an obstruction;

11. The method of claim 10, further comprising:

inputting the extracted video frame image in the video information into a target posture coding network model to obtain a head posture vector of the video frame image;

12. A three-dimensional face recapitulation apparatus comprising:

a first acquisition unit configured to acquire video information including a first identity object and a facial still image including a second identity object; wherein the video information comprises a video frame image of which the face has an obstruction;

the second acquisition unit is used for inputting the extracted video frame image in the video information into a target expression coding network model and acquiring an expression vector of the video frame image;

13. A computer storage medium for storing network platform generated data and a program for processing the network platform generated data;

the program, when read and executed by a processor, performs the three-dimensional face generation method according to any one of claims 1 to 8; alternatively, the three-dimensional face reenactment method according to claim 10 or 11 is performed.

14. An electronic device, comprising:

a processor;

a memory for storing a program for processing network platform production data, which when read and executed by the processor, performs the three-dimensional face generation method according to any one of claims 1 to 8; alternatively, the three-dimensional face reenactment method according to claim 10 or 11 is performed.