CN114898020A

CN114898020A - 3D character real-time face driving method and device, electronic equipment and storage medium

Info

Publication number: CN114898020A
Application number: CN202210589964.9A
Authority: CN
Inventors: 邱戴飞; 范勇; 吴永辉
Original assignee: Weiwu Hangzhou Technology Co Ltd
Current assignee: Weiwu Hangzhou Technology Co Ltd
Priority date: 2022-05-26
Filing date: 2022-05-26
Publication date: 2022-08-12

Abstract

The invention discloses a real-time face driving method and device for a 3D role, electronic equipment and a storage medium, wherein the method comprises the steps of obtaining a role face animation file and a facial expression video file corresponding to an actor; acquiring a controller value corresponding to each frame of animation based on the character facial animation file; rendering a role video file from the role facial animation file, and extracting a corresponding role picture set and an actor facial picture set from the role video file and an actor facial expression video file; constructing a VAE model; training the VAE model based on the role picture set, the actor facial picture set and the controller value corresponding to the role picture set; after the training is finished, inputting the actor face picture into a trained VAE model to obtain the coefficient of a controller; and transmitting the coefficient of the controller to rendering software, and driving the 3D virtual image by the rendering software in real time to obtain high-quality facial animation. The invention can obtain high-precision facial animation through the VAE model.

Description

3D character real-time face driving method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of 3D image processing, in particular to a method and a device for driving a 3D role real-time face, electronic equipment and a storage medium.

Background

With the advent of the meta-universe era, a great deal of requirements for real-time face driving of high-precision 3D face models (3-4 thousands of model vertices) are generated. Current video face driving schemes: the problem of the mobile phone terminal represented by iphoneX is that the expressiveness is not enough, such as the change of the mouth shape of a speaking person, and the mobile phone terminal cannot cross over the uncany valley (terrorist effect) when being used for driving a high-precision real virtual person; the surface capture helmet generally has a complicated 3D role, has a higher use threshold in the calibration stage of an actor, and needs to be recalibrated each time the actor is changed.

Disclosure of Invention

In order to solve the above problems in the prior art, the present invention provides a method and an apparatus for driving a 3D character real-time face, an electronic device, and a storage medium.

The invention discloses a real-time face driving method for a 3D role, which comprises the following steps:

acquiring a 3D character facial animation file and a facial expression video file of a corresponding actor;

acquiring a controller value corresponding to each frame of animation based on the 3D character facial animation file;

rendering the 3D character facial animation file into a 3D character video file, and extracting a corresponding 3D character picture set and an actor facial picture set from the 3D character video file and the actor facial expression video file;

constructing a VAE model;

training the VAE model based on the 3D role picture set, the actor facial picture set and controller values corresponding to the 3D role picture set;

after the training is finished, inputting the actor face picture into the VAE model after the training is finished, and acquiring the coefficient of the controller;

and transmitting the coefficient of the controller to rendering software, wherein the rendering software drives the 3D virtual image in real time to obtain high-quality facial animation.

Preferably, the VAE model includes an encoder and two decoders;

the encoder encodes the input pictures in the 3D character picture set and the actor face picture set;

the decoder performs decoding optimization training on the pictures in the 3D role picture set and the actor face picture set after being encoded to obtain the optimal weight values of the encoder and the decoder;

and the other decoder performs decoding training on the encoded 3D role picture set and the controller value corresponding to the 3D role picture set to obtain the controller coefficient.

Preferably, the encoder encoding function is:

Enc(x)＝(f _z (x)，f _id (x))；

in the formula: x is an input picture; f. of _z (x) Encoding for a VAE; z is the code of expression information in a hidden space; f. of _id (x) Encoding for AE; id is the code of the identity information in the hidden space.

Preferably, a loss function of the decoding optimization training is:

L＝L _vae +L _rec +L _cycle ；

in the formula: x is the number of ₁ The character is a 3D character picture; x is the number of ₂ An actor face picture;

is the output of the decoder;

is f _z (x)；p _θ (x | z) is the reconstruction of a photograph by z; p (z) is the prior distribution of z; l is _dssim Is x and

structural similarity error of (2).

Preferably, the error function of the other decoding training is:

in the formula: y is the controller value vector corresponding to a frame of animation,

is a vector of predicted controller values.

Preferably, the 3D character face animation file is an FBX file, and the controller value corresponding to each frame of animation is read out through FBX sdk of the autodesk.

The present invention further provides a device for driving the real-time face of the 3D character, including:

the acquiring module is used for acquiring a 3D role facial animation file and a facial expression video file corresponding to an actor;

the reading module is used for acquiring a controller value corresponding to each frame of animation based on the 3D role facial animation file;

the extraction module is used for rendering the 3D role facial animation file into a 3D role video file and extracting a corresponding 3D role picture set and an actor facial picture set from the 3D role video file and the actor facial expression video file;

the building module is used for building the VAE model;

a training module for training the VAE model based on the 3D character picture set, the actor face picture set, and a controller value corresponding to the 3D character picture set;

the computing module is used for inputting the actor face picture into the trained VAE model after the training is finished and acquiring the coefficient of the controller;

and the driving module is used for transmitting the coefficient of the controller to rendering software, and the rendering software drives the 3D virtual image in real time to obtain high-quality facial animation.

The invention also provides an electronic device comprising at least one processing unit and at least one memory unit, wherein the memory unit stores a computer program which, when executed by the processing unit, causes the processing unit to perform the above-mentioned method.

The invention also provides a storage medium storing a computer program executable by an electronic device, which when run on the electronic device causes the electronic device to perform the above-mentioned method.

Compared with the prior art, the invention has the beneficial effects that:

according to the invention, high-precision facial animation can be obtained through the VAE model, and the threshold is lower than that of a facial helmet.

Drawings

FIG. 1 is a schematic flow structure diagram of a 3D role real-time face driving method according to the present invention;

FIG. 2 is a schematic structural diagram of a model in the 3D role real-time face driving method of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

The invention is described in further detail below with reference to the attached drawing figures:

referring to fig. 1, the invention discloses a 3D character real-time face driving method, including:

specifically, the 3D character face animation file is an FBX file, and the controller name used by the 3D character model and the controller value corresponding to each frame of animation are read from the FBX sdk of the autodesk.

Rendering the 3D role facial animation file into a 3D role video file, and extracting a corresponding 3D role picture set and an actor facial picture set from the 3D role video file and the actor facial expression video file;

specifically, frames are extracted from the 3D character video file and the facial expression video file of the actor according to a certain frame rate to form a picture set, and the picture of each 3D character animation frame corresponds to the value of the controller.

Constructing a VAE model, as shown in FIG. 2;

specifically, the model comprises an encoder and two decoders, wherein the encoder and the decoders are all multilayer convolutional neural networks, and gradient descent is adopted to update the model weight during training. The VAE model comprises an encoder and two decoders; the encoder encodes the input 3D role picture set and pictures in the actor face picture set; a decoder 1 carries out decoding optimization training on the pictures in the 3D role picture set and the actor face picture set after being coded to obtain the optimal weight values of the encoder and the decoder; and fixing the parameters of the encoder, discarding the decoder 1 and the id part, and performing decoding training on the encoded 3D role picture set and the controller value corresponding to the 3D role picture set by the other decoder 2 to obtain a controller coefficient. An encoder encodes an input picture into a hidden space (spatial space), enc (x) ═ f _z (x)，f _id (x) X represents an input picture, and the entire Enc represents the encoder, where f _z (x) Z is the typical VAE encoding; f. of _id (x) Id is the code of a typical AE. f. of _z (x) And f _id (x) Sharing the structure of the previous layers of the encoder except for the last layer andand (4) weighting. Wherein z is the code of the expression information in the hidden space, and id is the code of the identity information in the hidden space.

Training the VAE model based on the 3D role picture set, the actor face picture set and the controller value corresponding to the 3D role picture set;

specifically, during training, the 3D character picture and the actor face picture do not need to be in one-to-one correspondence with each frame, the training is divided into two stages, the first stage is self-supervision learning, and the optimal weight values of the encoder and the decoder 1 are learned by reconstructing the input photos. The loss function of the first stage model includes:

L＝L _vae +L _rec +L _cycle ；

L _vae ELBO loss for VAE:

wherein the content of the first and second substances,

i.e. f _z (x) Representing an input picture x, and acquiring a VAE code z; p is a radical of _θ (x | z) denotes the reconstruction of a picture by z, corresponding to decoder 1; p (z) is the prior distribution of z. Where the first term represents the expected value of the x-likelihood and the second term, called KL divergence, is equivalent to a regularization term, such that

Can approximate p (z) as closely as possible.

L _rec Error for reconstructed picture and original picture:

wherein the content of the first and second substances,

is the output of the decoder, L _dssim Is x and

the first term of the error focuses on the consistency of each pixel, and the second term focuses on the similarity of the structures of the two photographs.

L _cycle For cycle consistency error:

wherein x is ₁ The character is a 3D character picture; x is the number of ₂ An actor face picture; the first term represents x ₁ The hidden space expression code f of the picture _z (x ₁ ) And x ₂ Identity code f _id (x ₂ ) Pictures obtained after splicing through the decoder Dec should be summed f after passing through the encoder _z (x ₁ ) The distance is as close as possible; the second equality is simply from x ₂ Obtaining expression code f _z (x ₂ )，x ₁ Obtaining an identity code f _id (x ₁ )。

The second stage of training is a supervised learning, and the data is 3D character pictures and corresponding controller values. Inputting the 3D role picture into an encoder, inputting the output implicit code z into a decoder 2 to obtain a controller value of model prediction, wherein an error function is

Wherein y is a controller value vector corresponding to a frame of animation,

is a vector of predicted controller values.

Further, in order to ensure the robustness of the model, the training data, i.e., the 3D character picture and the actor's face picture, are processed, including but not limited to rotation, flipping, cropping, adding noise, changing brightness, chrominance, contrast, saturation, simulated illumination, distortion, etc., so as to improve the robustness of the model.

Namely, training the VAE in the first stage, inputting the actor facial picture and the 3D role picture into the model, and obtaining the optimal weight values of the encoder and the decoder through the optimal loss function. In order not to spill the expression information encoded in z into id, during training, the id obtained for the same actor photo in a batch is randomized before entering the encoder, and the id obtained for the character photo in a batch is randomized before entering the encoder. And the z and the id are spliced by a re-parameterization skill during training.

And the second stage trains the controller decoding module. And fixing parameters of an encoder, discarding a decoder 1 and an id part, training an output z of the encoder connected with a decoder 2, namely performing supervision training on the values of an input 3D role picture and a corresponding controller, wherein the controller parameters are used as output label values. By optimizing L _reg A penalty function to obtain optimal weight values for the controller decoder. After the training is completed, the VAE model is used only by the encoder and decoder 2.

After the training is finished, inputting the actor face picture into a trained VAE model to obtain the coefficient of a controller;

and transmitting the coefficient of the controller to rendering software, and driving the 3D virtual image by the rendering software in real time to obtain high-quality facial animation.

Specifically, when the system is used, actor face video stream is obtained from a camera, each frame of picture of the actor face is extracted and input to a 3D role model, the obtained result of a decoder 2 is input to a 3D rendering engine as a controller coefficient, animation is rendered, the performance of the whole process is high, and under the condition that a GPU display card is accelerated, if the frame rate of the camera is high enough, the whole process can reach a frame rate more than 60. Rendering software includes, but is not limited to, UE, maya.

In this embodiment, the facial expression video files of the actors may be the same actor or may be multiple actors.

Examples

The well-pinched realistic faces on Metahuman creator are downloaded from qualel Bridge and imported into Maya and UE as 3D characters.

A performance script is prepared, which includes common expressions, Chinese pronunciations, and Chinese sentences. And the actor A performs according to the performance script, records a front face video with the frame rate of 30 frames, and takes the front face video as a reference template for the performance of the subsequent actor.

Sequence frame animation was performed from actor a performance video K using Metahuman's controller inside Maya for a total of about 8 minutes; exporting the animation into an FBX file for storage, and rendering the animation into a video file in Maya, wherein the video file comprises videos of front faces and side face angles, and the frame rate is 30 frames;

performing by the actor A or other actors B according to the performance script, recording front face and side face videos, wherein the frame rate is 30 frames;

reading the controller coefficient of each frame in the FBX animation file by FBX sdk, dimension 169; extracting photos from the 3D role video and the actor A or other actor B videos, aligning the face of each photo, and cutting out 256x256 pictures;

building a VAE model, wherein an encoder comprises a plurality of layers of rolling blocks, a decoder 1 comprises a plurality of layers of reverse rolling blocks, and a decoder 2 is a plurality of layers of MLPs;

training the VAE model by using the obtained picture, and obtaining the coefficient of the controller after the training is finished;

starting a model client, enabling the actor A or other actors B to face a camera, reading video streams of the camera, inputting each frame of actor picture into a model, and transmitting the coefficient of a controller to UE (user equipment) through levellink; therefore, the Metahuman 3D character in the UE is driven to carry out facial animation in real time, and the effect exceeds the driving effect of IphoneX.

the building module is used for building the VAE model;

the training module is used for training the VAE model based on the 3D role picture set, the actor face picture set and the controller value corresponding to the 3D role picture set;

The present invention also provides a storage medium storing a computer program executable by an electronic device, which when run on the electronic device causes the electronic device to perform the above-mentioned method.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A real-time face driving method for a 3D character is characterized by comprising the following steps:

constructing a VAE model;

training the VAE model based on the 3D role picture set, the actor facial picture set and a controller value corresponding to the 3D role picture set;

2. The 3D character real-time face driving method according to claim 1, wherein the VAE model includes an encoder and two decoders;

and the other decoder carries out decoding training on the encoded 3D role picture set and the controller value corresponding to the 3D role picture set to obtain the controller coefficient.

3. The method of real-time face driving of a 3D character according to claim 2, wherein the encoder encoding function is:

Enc(x)＝(f _z (x),f _id (x))；

4. The 3D character real-time face driving method according to claim 3, wherein the loss function of the decoding optimization training is:

L＝L _vae +L _rec +L _cycle ；

is the output of the decoder;

structural similarity error of (2).

5. The method of claim 4, wherein the error function of the decoding training is:

is a vector of predicted controller values.

6. The method of claim 1, wherein the 3D character face animation file is an FBX file, and the controller value corresponding to each frame of animation is read out through FBX sdk of autodesk.

7. A 3D character real-time face-driving apparatus, comprising:

the building module is used for building the VAE model;

8. An electronic device, comprising at least one processing unit and at least one memory unit, wherein the memory unit stores a computer program that, when executed by the processing unit, causes the processing unit to perform the method of any of claims 1 to 6.

9. A storage medium storing a computer program executable by an electronic device, the program, when run on the electronic device, causing the electronic device to perform the method of any one of claims 1 to 6.