CN115512014A

CN115512014A - Method for training expression driving generation model, expression driving method and device

Info

Publication number: CN115512014A
Application number: CN202211228272.8A
Authority: CN
Inventors: 王鹏程; 冀志龙
Original assignee: Beijing Century TAL Education Technology Co Ltd
Current assignee: Beijing Century TAL Education Technology Co Ltd
Priority date: 2022-10-08
Filing date: 2022-10-08
Publication date: 2022-12-23

Abstract

The disclosure provides a method for training an expression drive generation model, an expression drive method and a device, wherein the method for training the expression drive generation model comprises the following steps: acquiring a sample face image set, wherein the sample face image set comprises face images of the same object with different expressions; processing the facial images in the sample set facial image set by using an expression driving generation model to obtain expression driving parameters of the facial images; generating a three-dimensional face mesh structure of the face image according to the expression driving parameters; determining a first face key point of the face image according to the three-dimensional face mesh structure; acquiring a second face key point of the face image, wherein the second face key point is a real face key point; projecting the first and second face key points from the three-dimensional key points into two-dimensional key points; network parameters of the expression driven generative model are updated to minimize keypoint errors between the first and second facial keypoints. Thus, large expressions and different expressions driving the same object can be better represented.

Description

Method for training expression driving generation model, expression driving method and device

Technical Field

The disclosure relates to the technical field of image processing, in particular to a method for training an expression driving generation model, an expression driving method and an expression driving device.

Background

With the development of virtual reality (virtual reality), augmented reality (augmented reality), and the concept of metasphere, how to drive the expression of a virtual character in a virtual space becomes an important research issue.

In the related technology, a trained convolutional neural network is used for processing a figure image, expression driving parameters of the figure are output, and then the expression driving parameters are used for driving a real face or a face corresponding to a virtual person.

The method has the problems that large expressions (also called exaggerated expressions) cannot be accurately expressed in the actual expression driving process and the like.

Disclosure of Invention

According to an aspect of the present disclosure, there is provided a method of training an expression-driven generative model, comprising:

acquiring a sample face image set, wherein the sample face image set comprises face images of the same object with different expressions;

processing the facial images in the sample set facial image set by using the expression driving generation model to obtain expression driving parameters of the facial images;

generating a three-dimensional face mesh structure of the facial image according to the expression driving parameters;

determining a first face key point of the face image according to the three-dimensional face mesh structure;

acquiring a second face key point of the face image, wherein the second face key point is a real face key point;

projecting the first face key points and the second face key points into two-dimensional key points from three-dimensional key points according to a preset projection function;

updating at least a portion of network parameters of the expression driven generative model to minimize keypoint errors between the first facial keypoints and the second facial keypoints.

According to another aspect of the present disclosure, there is provided an expression driving method including:

acquiring a face image to be processed;

the method disclosed by the invention is used for training to obtain an expression driving generation model to process the facial image to obtain expression driving parameters;

and performing expression driving according to the expression driving parameters.

According to another aspect of the present disclosure, there is provided an apparatus for training an expression-driven generative model, comprising:

the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring a sample facial image set, and the sample facial image set comprises facial images of the same object with different expressions;

a training module to:

projecting the first face key points and the second face key points into two-dimensional key points from three-dimensional key points according to a preset projection function; and

According to another aspect of the present disclosure, there is provided an electronic device including:

a processor; and

a memory for storing a program, wherein the program is stored in the memory,

wherein the program comprises instructions which, when executed by the processor, cause the processor to perform the method of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of the present disclosure.

According to one or more technical schemes provided in the embodiment of the application, the expression driving generation model is trained according to the facial images of different expressions of the same object, so that the expression driving generation model obtained through training can more accurately represent a large expression, and different expressions of the same object can be better driven.

Drawings

Further details, features and advantages of the disclosure are disclosed in the following description of exemplary embodiments, taken in conjunction with the accompanying drawings, in which:

FIG. 1 shows a flow diagram of a method of training an expression driven generative model according to an exemplary embodiment of the present disclosure;

FIG. 2 shows a flowchart for acquiring facial images of different expressions of the same subject according to an exemplary embodiment of the present disclosure;

FIG. 3 shows a schematic diagram of a conditional countermeasure generation network generating facial images of different expressions of the same subject, according to an example embodiment of the present disclosure;

FIG. 4 shows a flowchart of a method of training based on a keypoint loss function, according to an example embodiment of the present disclosure;

FIG. 5 shows a flowchart of a keypoint selection method according to an exemplary embodiment of the present disclosure;

FIG. 6 shows a flow chart of a shape consistency error based training method according to an example embodiment of the present disclosure;

FIG. 7 shows a flowchart of a training method of errors between an input face and a synthesized face according to an example embodiment of the present disclosure;

FIG. 8 shows a flowchart of a method of rendering a two-dimensional facial image according to an example embodiment of the present disclosure;

FIG. 9 shows a schematic diagram of rendering a two-dimensional facial image using an intermediate network, according to an example embodiment of the present disclosure;

FIG. 10 illustrates a flow diagram of a method of rendering a two-dimensional facial image using an intermediate network according to an exemplary embodiment of the present disclosure;

FIG. 11 shows a flowchart of an expression parameter error-based training method according to an example embodiment of the present disclosure;

FIG. 12 shows a flow diagram of a neural network training and expression driving method according to an example embodiment of the present disclosure;

FIG. 13 shows a schematic diagram of face image extraction according to an exemplary embodiment of the present disclosure;

fig. 14 illustrates a flowchart of an expression driving method according to an exemplary embodiment of the present disclosure;

FIG. 15 is a schematic block diagram of an apparatus for training an expression driven generative model according to an exemplary embodiment of the present disclosure;

fig. 16 shows a schematic block diagram of an expression driver apparatus according to an exemplary embodiment of the present disclosure;

FIG. 17 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and the embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein is intended to be open-ended, i.e., "including but not limited to". The term "according to" is "at least partially according to". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description. It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

Aspects of the present disclosure are described below with reference to the accompanying drawings.

The embodiment of the disclosure provides a method for training an expression-driven generation model.

Fig. 1 shows a flowchart of a method of training an expression-driven generative model according to an exemplary embodiment of the present disclosure, which includes steps S101 to S102, as shown in fig. 1.

Step S101, a sample facial image set is obtained, wherein the sample facial image set comprises facial images of the same object with different expressions.

The object may include, but is not limited to, a person and, correspondingly, the facial image includes, but is not limited to, a human face image.

The training data may include facial images of one or more subjects.

And S102, training an expression driving generation model according to the sample face image set.

The expression-driven generation model is configured to generate expression-driven parameters with the facial image as an output. The expression driving parameters may include shape parameters, expression parameters, texture parameters, and the like. Optionally, the expression driving parameters further include: one or more of a pose parameter, a camera parameter, a lighting parameter.

The expression driving generation model takes the facial image as input and takes expression driving parameters as output. The expression-driven generative model may include various convolutional neural networks, which is not limited in this embodiment.

By adopting the exemplary embodiment, the expression driving generation model is trained according to the facial images of different expressions of the same object, so that the expression driving generation model obtained by training can more accurately represent a large expression and better drive different expressions of the same object. However, in the related art, various expressions of the same object cannot be provided, so that a problem that a large expression cannot be accurately expressed occurs in an expression driving process.

In some cases, the illumination parameters, background information, and camera parameters of the facial image of the same object are greatly changed, and the expression driving effect under the condition of fixed illumination parameters and camera parameters in an actual use scene cannot be guaranteed. For this reason, in some embodiments, as shown in fig. 2, facial images of different expressions of the same subject in step S101 are acquired, including step S201 to step S202.

In step S201, a face image of a subject is acquired.

As one embodiment, acquiring a face image of a subject includes: the method comprises the steps of acquiring an image of a subject, detecting and extracting a face image area in the image, and obtaining a face image of the subject.

Step S202, the facial image of the object is processed according to different expression conditions by using the expression generation network, and facial images of different expressions of the object are obtained.

Because the facial images of different expressions of the same object are generated based on the same facial image, the illumination parameters, the background information and the camera parameters of the facial image are consistent, and the expression driving effect under the condition of fixing the illumination parameters and the camera parameters in an actual use scene can be improved.

As an embodiment, the expression generation network may be a conditional countermeasure generation network (conditional GAN), and the expression condition is a combination of expression Action Units (Facial Action Units). Facial expressions are described using expression action units that are anatomically related to specific facial muscle contractions. A plurality of facial expressions can be obtained by combining the expression action units. The combination of the expression action units and the corresponding expressions can be referred to in the prior art, and this implementation is not described in detail herein.

An exemplary conditional confrontation generating network for generating facial images of the same subject with different expressions is described below in conjunction with fig. 3.

Fig. 3 is a schematic diagram illustrating a conditional countermeasure generation network generating facial images of different expressions of the same object according to an exemplary embodiment of the present disclosure, and referring to fig. 3, the conditional countermeasure generation network is an encoder-decoder (encoder-decoder) structure, which is illustrated as G in fig. 3 _I . The face image input condition countermeasure generation network of the subject takes the expression condition therein as a combination of expression action units. By changing the input expressive action units, facial images of different expressions can be output. Referring to FIG. 3, the input face image is represented as I _yr The expression conditions are respectively Y _g1 、Y _g2 、Y _g3 Respectively through G _I Processing to obtain images with different expressions, which are respectively I _yg1 、I _yg2 And I _yg3 . Due to I _yg1 、I _yg2 And I _yg3 Are all based on I _yr And generating, so that the illumination parameters, the background information and the camera parameters are the same, thereby improving the expression driving effect under the condition of fixing the illumination parameters and the camera parameters. And, due to I _yg1 、I _yg2 And I _yg3 The facial images of different expressions of the same object can better realize the driving effect of different expressions of the same person. The same object has various expressions, and the large expression representation effect can be improved.

As an embodiment, the conditional countermeasure generation network is trained. An exemplary training method for conditional confrontation generation of a network is described below.

In this example, a generator (G) is included _I ) And a discriminator (D) _I ) The generator is used for obtaining the expressionAnd the discriminator is used for judging whether the generated image is real enough or not. Meanwhile, in order to eliminate the interference of the rest background information in the image, a mask area m corresponding to the face is obtained by using a face segmentation network, and the generated image is multiplied by m pixel by pixel to obtain an output image (I) _yg1 、I _yg2 And I _yg3 )。

After the output image is obtained, the human face perception loss L of the original image (i.e. the input face image) and the output image is calculated _idt (see formula (1) and formula (2)) and against production loss.

In the formulas (1) and (2), the function f represents a face feature extraction structure; i denotes the original image, I _ygi Representing the ith output image, and N representing the number of output images corresponding to the original image; l is _{idt_i} Representing the face perception loss L of the ith output image in the N output images corresponding to the original image I _idt L for N output images _{idt_i} Summing; l |. Electrically ventilated margin ₂ Representing the euclidean norm (also known as the 2-norm).

As an embodiment, the function f may use a VGG network (Visual Geometry Group Net), which is not limited in this embodiment.

The optimization function L of the model is the balance of the two losses, as shown in equation (3).

L＝L _adv +λ·L _idt (3)

In the formula (3), λ is a face perception loss L of the original image and the output image _idt Weight of (1), L _adv To combat the generation losses.

In some embodiments, facial images of a plurality of subjects may be acquired, facial images of different expressions of each subject may be generated, and a sample set of facial images including facial images of different expressions of each subject may be formed.

The following describes training of the expression-driven generative model based on a sample set of facial images.

Fig. 4 shows a flowchart of a training method based on a keypoint loss function according to an exemplary embodiment of the present disclosure, and as shown in fig. 4, the method includes steps S401 to S406.

Step S401, the expression driving generation model is used for processing the facial images in the sample set facial image set, and expression driving parameters of the facial images are obtained.

Wherein facial images of at least some of the subjects with different expressions are obtained based on the method shown in fig. 2. The facial images of the same object with different expressions obtained by the method of fig. 2 are generated based on the same facial image, so that the illumination parameters, the background information and the camera parameters of the facial images are consistent, and the expression driving effect under the condition of fixing the illumination parameters and the camera parameters in an actual use scene can be improved.

Step S402, generating a three-dimensional face mesh structure of the face image according to the expression driving parameters.

With the expression driving parameters known, a three-dimensional face mesh structure of the face image can be generated. In some embodiments, a three-dimensional face mesh structure of the face image is generated based on the shape parameters. In other embodiments, a three-dimensional face mesh structure of the facial image is generated based on shape parameters, expression parameters, texture parameters, and the like.

As an embodiment, the expression driving parameters include shape parameters, and wherein generating a three-dimensional face mesh structure of the face image from the expression driving parameters includes:

for facial images of the same object with different expressions, fusing shape parameters corresponding to the facial images of the object with different expressions through a full connection layer to obtain fused shape parameters;

and generating a three-dimensional face mesh structure of the object according to the fusion shape parameters so as to obtain the three-dimensional face mesh structure corresponding to the facial images of different expressions of the object.

By the embodiment, because the shapes of the facial images with different expressions of the same object are consistent, the shape parameters are consistent, and therefore, the fused shape parameters can be obtained based on the shape parameters corresponding to different expressions of the same object, and the accuracy of the three-dimensional face mesh structure of the object is improved.

The three-dimensional face mesh structure of the face image may include three-dimensional key point coordinates of the face.

Step S403, determining a first face key point of the face image according to the three-dimensional face mesh structure.

The first face keypoints are three-dimensional keypoints.

As an implementation manner, the three-dimensional face mesh structure is used to obtain a three-dimensional face key point corresponding to the designated index as a first face key point.

Step S404, a second face key point of the face image is obtained, wherein the second face key point is a real face key point.

The second face key points refer to the real key points of the face image, and the first face key points refer to the key points of the face synthesized based on the expression driving parameters of the expression driving generation model. Based on the difference between the first facial key point and the second facial key point, the accuracy of the expression driving parameters generated by the expression driving generation model can be determined.

The second face keypoints can be pre-labeled, for example, by determining the keypoints of the face image through a face keypoint detection algorithm as the actual keypoints of the face image.

The accuracy of the second facial keypoints influences the training effect of the expression-driven generation model. If the second facial key point has a large error, the expression-driven generation model is trained based on the difference between the second facial key point and the first facial key point, which results in poor effect of the expression-driven generation model. To this end, as an embodiment, and the accuracy of selecting the key points is improved by adopting the coordinated key point selection.

Fig. 5 is a flowchart illustrating a keypoint selection method according to an exemplary embodiment of the present disclosure, and as shown in fig. 5, includes steps S501 to S502.

Step S501, using a plurality of key point prediction models to process the face image, and obtaining a plurality of prediction results corresponding to each face image, where each prediction result includes the prediction coordinates of the second face key point.

As an embodiment, processing the face image by using multiple kinds of keypoint prediction models to obtain multiple prediction results corresponding to each face image includes: processing the face image by using a dlib model to obtain a predicted coordinate of a two-dimensional key point of the face image; processing the face image by using a three-dimensional face alignment network (3 DFAN) to obtain a predicted coordinate of a three-dimensional face key point of the face image; the face image is processed using a functional facial keypoint detector (PFLD) to obtain predicted coordinates of three-dimensional facial keypoints of the face image. It should be understood that the embodiment of the present application does not limit the keypoint prediction model, and one or more other keypoint prediction models may be used.

Step S502, regarding each second face key point, regarding the predicted coordinates with the key point spacing smaller than the preset value in the multiple predicted coordinates corresponding to the second face key point as the same coordinates, and regarding the predicted coordinates with the most occurrence times as the coordinates of the second face key point.

In one embodiment, the coordinates of the second facial keypoints are determined from the multiple predictions by a voting method.

Step S405, the first face key points and the second face key points are projected into two-dimensional key points from the three-dimensional key points according to a preset projection function.

Step S406, at least part of the network parameters of the expression driven generative model are updated to minimize keypoint errors between the first facial keypoints and the second facial keypoints.

As an embodiment, the second face keypoints comprise one or more of eye keypoints and mouth keypoints, and the keypoint errors comprise one or more of eye opening and closing errors and mouth opening and closing errors.

Further, in step S406, updating at least part of the network parameters of the expression-driven generative model to minimize the keypoint error between the first facial keypoint and the second facial keypoint comprises: updating at least a portion of network parameters of the expression driven generative model to minimize one or more of eye opening and mouth opening and closing errors.

As an embodiment, the eye opening and closing error L _eye Determined according to the loss function shown in equation (4).

L _eye ＝∑ _(i,j)∈E ||(e _i -e _j )-(pe _i -pe _j )|| ² (4)

In the formula (4), e _i Coordinates of key points representing the upper eyelid at key points of the second face, e _j Key point coordinates, pe, representing the lower eyelid among the second facial key points _i Key point coordinates, pe, representing the upper eyelid in the key points of the first face _j Representing the keypoint coordinates of the lower eyelid in the first facial keypoints. E denotes the set of eye keypoints, i.e. the set of keypoints of the upper and lower eyelids.

As an embodiment, the mouth opening and closing error L _mouth Determined according to the loss function shown in equation (5).

L _mouth ＝∑ _(i,j)∈M ||(m _i -m _j )-(pm _i -pm _j )|| ² (5)

In the formula (5), m _i Key point coordinates of the upper lip in the second face key points, m _j Key point coordinates, pm, representing the lower lip among the second face's key points _i Key point coordinates, pm, representing the top lip in the first face key points _j Representing the keypoint coordinates of the lower lip among the keypoints of the first face. M represents a set of key points of the mouth.

For example, the mouth opening and closing error is determined using the coordinates of the key points of the upper inner lip and the lower inner lip, and M represents a set of the coordinates of the key points of the upper inner lip and the lower inner lip.

Further, the face reconstruction effect is further optimized by using the key point re-projection loss. The method in the related art uses the coordinates of the three-dimensional key points of the human face to supervise the training of the neural network independently, butThe three-dimensional face needs to be projected to a two-dimensional image according to predicted camera parameters in the rendering process, and the parameter information of the camera and the coordinate information of key points are coupled in the process, so that the face reconstruction effect can be reduced. Therefore, the embodiment of the present disclosure proposes a key point reprojection loss L _{re-projection} The specific definition is shown as formula (6).

L _{re-projection} ＝∑ _(i,j)∈N ||R(k _i -k _j )-R(p _i -p _j )|| ² (6)

In the formula (6), R represents a three-dimensional coordinate reprojection function, N represents a set of key points, k _i 、k _j Representing the second face keypoint, p _i 、p _j Representing a first facial keypoint. N may be a set of mouth keypoints or a set of eye keypoints.

In step S406, when the key point error is an eye opening and closing error, the network parameters of the expression drive generation model are updated to minimize the eye opening and closing error. And when the key point error is the mouth opening and closing error, updating network parameters of the expression driving generation model to minimize the mouth opening and closing error.

In step S406, when the key point error includes an eye opening and closing error and a mouth opening and closing error, as an example, the network parameters of the expression driving generation model may be updated to minimize the eye opening and closing error, and then the network parameters of the expression driving generation model may be further updated to minimize the mouth opening and closing error; as another example, the network parameters of the expression-driven generative model may be updated to minimize the mouth opening and closing error, and then the network parameters of the expression-driven generative model may be further updated to minimize the eye opening and closing error. As still another example, the eye opening and closing error and the mouth opening and closing error may be converted into a training error by means of weighted averaging, and the network parameters of the expression driving generation model may be updated to minimize the converted training error. This embodiment is not limited to this.

In some embodiments, further comprising: determining a face error according to a first face key point and a second face key point of the face image, wherein the key points of different parts have different influence weights on the face error. During the training process, at least part of the network parameters of the expression driven generative model are updated to minimize facial errors.

As an embodiment, the facial error L _landmark The key point loss function shown in equation (7) is used for determination.

In the formula (7), w _n Representing the weight of the keypoint, q _n Represents the second face keypoint, q' _n Representing the first face keypoints, and N is the number of keypoints.

In one embodiment, in order to better represent a large expression, the influence weight of the key points of at least one of the eyes, lips, and nose is greater than the influence weight of the key points of the other parts. Illustratively, the weights of the key points of the left and right eyes, the lip and the nose are set to a first value, and the weights of the key points of the rest parts are set to a second value, wherein the first value is greater than the second value, for example, the first value is 20 and the second value is 1.

The expression driving parameters comprise shape parameters, and the shape parameters of the facial images of different expressions of the same object have consistency. In some embodiments, the expression-driven generative model is trained based on shape consistency errors.

Fig. 6 shows a flowchart of a training method based on shape consistency errors according to an exemplary embodiment of the present disclosure, and as shown in fig. 6, the method includes steps S601 to S603.

Step S601, aiming at the facial images of the same object with different expressions in the sample facial image set, the facial images are processed by using an expression driving generation model to obtain the shape parameters of the facial images.

Step S602, determining the shape consistency error according to the shape parameters corresponding to the facial images of different expressions of the same object.

The shape parameters corresponding to the facial images of different expressions of the same object are kept consistent, the shape consistency errors are determined according to the shape parameters corresponding to the facial images of different expressions of the same object, and the prediction accuracy of the expression driving parameters is optimized based on the shape consistency errors.

As an embodiment, the shape consistency error L _shape Determined according to equation (8).

L _shape ＝∑||α _shape,i -α _shape,j || (8)

In the formula (8), α _shape,i 、α _shape,j And the shape parameters of the ith expression and the jth expression of the same object are represented, wherein the values of i and j are based on the expression number of the object.

In one embodiment, facial images of different expressions of the same subject are obtained based on the method shown in fig. 2.

Step S603, at least part of the network parameters of the expression driver generation model are updated to minimize the shape consistency error.

As an implementation mode, the shape consistency error is subjected to gradient back propagation during training, the reconstructed three-dimensional face shape parameters of the same object are constrained, and finally the prediction accuracy of the expression parameters can be optimized.

In the embodiment of the present disclosure, at least part of the network parameters of the expression-driven generative model may be updated first to minimize the shape consistency error, on which basis at least part of the network parameters of the expression-driven generative model may be further updated to minimize the keypoint error (one or more of the eye opening and closing error, mouth opening and closing error, and face error). It should be understood that the disclosed embodiments do not limit the order in which the network parameters are updated using errors.

Fig. 7 shows a flowchart of a training method of an error between an input face and a synthesized face according to an exemplary embodiment of the present disclosure, which includes steps S701 to S705, as shown in fig. 7.

Step S701, processing the facial images in the sample set facial image set by using the expression driving generation model to obtain expression driving parameters of the facial images.

Step S702, generating a three-dimensional face mesh structure of the face image according to the expression driving parameters.

Step S703 is to render each face image according to the three-dimensional face mesh structure of the face image to obtain a two-dimensional face image corresponding to the face image.

The two-dimensional facial image is a facial image rendered based on the expression driving parameters, and the rendered facial image should be consistent with the original facial image. Therefore, the expression-driven generative model is trained based on the error of the two.

For facial images of the same object with different expressions, the shape parameters of the facial images with different expressions are fused, so that the accuracy of the shape parameters is improved, and the accuracy of the facial images obtained by rendering is improved.

Fig. 8 illustrates a flowchart of a method of rendering a two-dimensional face image according to an exemplary embodiment of the present disclosure, and as illustrated in fig. 8, the method includes steps S801 to 8903.

Step S801, fusing shape parameters corresponding to facial images of the same object with different expressions to obtain fused shape parameters.

Step S802, generating a three-dimensional face mesh structure of the object according to the fusion shape parameters.

Step S803, a two-dimensional facial image corresponding to each expression of the object is generated according to the three-dimensional facial mesh structure of the object and the expression parameters and texture parameters corresponding to each expression of the object.

As an embodiment, as shown in fig. 9, an intermediate network is constructed, wherein the intermediate network includes: a plurality of identical neural networks in parallel, and a fusion unit. The intermediate network outputs expression parameters, texture parameters, shape parameters, and fusion shape parameters.

Fig. 10 is a flowchart illustrating a method of rendering a two-dimensional face image using an intermediate network according to an exemplary embodiment of the present disclosure, which includes steps S1001 to S1004, shown in conjunction with fig. 9 and 10.

Step S1001, face images of the same object with different expressions are respectively processed by using a plurality of parallel same neural networks, and expression driving parameters of the face images of the object with different expressions are obtained.

Step S1002, the shape parameters of the facial images of different expressions of the object are processed by the fusion unit, and fusion shape parameters are obtained.

As an embodiment, the fusion unit includes a full link layer.

And step S1003, generating a three-dimensional face mesh structure of the object according to the fusion shape parameters.

Step S1004, according to the three-dimensional face mesh structure of the object and the expression parameters and texture parameters corresponding to each expression of the object, a Differentiable Renderer (Differentiable Renderer) is used to generate a two-dimensional face image corresponding to each expression of the object.

Step S704 is to determine, for each face image, a perceptual loss and/or a pixel-by-pixel loss between the face image and the rendered two-dimensional face image.

Step S705, at least part of the network parameters of the expression driver generation model are updated to minimize perceptual loss and/or pixel-by-pixel loss.

In one embodiment, a training error obtained by transforming the perceptual loss and the pixel-by-pixel loss is transformed, and at least part of network parameters of the expression driving generation model are updated to minimize the training error obtained by transforming. For example, a weighted sum of the above-mentioned perceptual loss and the above-mentioned pixel-by-pixel loss.

The expression driving parameters include expression parameters, and the facial images of different expressions of at least one object are generated based on one facial image of the object and different expression conditions using the conditional countermeasure generation network, and each expression condition is composed of expression action units. The generation of facial images of different expressions of a subject based on the facial image of the subject can be seen in the foregoing description of the present disclosure. In some embodiments, the generative model is driven based on the new joint expression of the error between the expression parameter and the true expression parameter.

Fig. 11 shows a flowchart of a training method based on expression parameter errors according to an exemplary embodiment of the present disclosure, and as shown in fig. 11, the method includes steps S1101 to S1103.

In step S1101, for each face image generated based on the expression condition, the face image is processed using an expression-driven generation model, and expression parameters of the face image are obtained.

The face image in the training data is generated based on the method shown in fig. 2.

Step S1102, determining real expression parameters of the facial image according to the expression conditions.

The expression condition is a combination of expression action units. Facial images of different expressions are generated based on expression conditions, and thus the expression conditions can be regarded as real expression parameters of the facial images.

Step S1103, updating at least part of the network parameters of the expression driver generation model to minimize an expression parameter error between the expression parameter and the real expression parameter.

It should be understood that, in the embodiments of the present disclosure, a plurality of training errors are listed separately, and the embodiments of the present disclosure are not limited to the use of the plurality of training errors. The plurality of training errors can be converted into a comprehensive error, and the network parameters of the expression driving generation model are updated based on the comprehensive error. The various errors can be respectively used for updating the network parameters of the expression driving generation model. The various errors can be partially converted into a comprehensive error, and one or more of the rest errors are respectively used for updating the network parameters of the expression driving generation model. The one or more errors may further update network parameters of the expression-driven generative model based on the expression-driven generative model trained based on the other one or more errors to obtain a final expression-driven generative model.

An example of the embodiment of the present disclosure is described below with a human face as an example.

Referring to fig. 12, this example includes steps S1 to S7.

S1: and detecting a face area by using a face detection network for the whole picture.

The face detection network may be any network, and this example is not limiting.

As an example, a face region is detected by using an arbitrary face detection network for an entire picture, a specific region of a face in the picture is obtained, and the face region is cut out from the original image, as shown in fig. 13. Optionally, a Face detection algorithm capable of obtaining the Face region may use, but is not limited to, retinaFace, yolov5-Face, and a Multi-task convolutional neural network (mtNN for short), and the like.

S2: and performing expression enhancement by using the conditional countermeasure generation network, and generating images corresponding to different expressions through the expression action unit.

Wherein, referring to fig. 3, S2 includes:

and (3) countering the input conditions of the original face image to generate a network, and adjusting the expression conditions in the network into expression action units. In the use stage, the expression action units y input by changing the network _g Can obtain the image output of different expressions

Network G _I An encoder-decoder (encoder) structure;

in the training phase, a training method using a challenge generation network, comprising a generator (G) _I ) And a discriminator (D) _I ) The generator is used for obtaining the image corresponding to the expression, and the discriminator is used for judging whether the generated image is real enough. Meanwhile, in order to eliminate the interference of other background information in the image, a mask area m corresponding to the face is obtained by using a face segmentation network, and then the generated image is subjected to

Multiplying m pixel by pixel to obtain the final output image

Obtaining the above-mentioned generated image

Thereafter, the face perception loss L of the original image and the generated image is calculated _idt (see the aforementioned formulas (1) and (2)) and resistance to generation loss L _adv . The final optimization function of the model is L _idt And L _adv Is balancedAs shown in formula (3).

S3: different images obtained through the conditional countermeasure generating network are sent to a convolutional neural network (expression driving generating model).

The corresponding network model size is selected according to different purposes, and the network finally outputs 100 shape parameters, 50 texture parameters, 43 expression parameters, 6 posture parameters, 3 camera parameters and 27 illumination parameters.

When a plurality of images obtained by passing one image through step S2 are simultaneously fed into the convolutional neural network, the shape output from the network should be kept consistent, and in order to supervise the process, a shape consistency error based on equation (8) is used. The shape consistency error is subjected to gradient back propagation during model training, the shape parameters of the three-dimensional face Mesh structure of the same character are constrained, and finally the accuracy of expression parameter prediction can be further optimized.

Referring to fig. 9, the plurality of expression images I of the same person obtained in S2 are shown _yg1 、I _yg2 And I _yg3 Respectively through an encoding network (convolutional neural network) E _C The coding network can use, but is not limited to, mobilenet, resnet, initiation, etc. as a backbone. The encoding network outputs to obtain a shape parameter alpha corresponding to the image _i Expression parameter beta _i And a texture parameter c _i Sending all shape parameters into a fused full Connected Layer (fused Connected Layer), and outputting to obtain the final shape parameters of the current character

And obtaining a corresponding three-dimensional face mesh structure according to the shape parameters. And sending the obtained three-dimensional face mesh structure, expression parameters and texture parameters into a Differentiable Renderer (differential Renderer) to obtain a rendered two-dimensional face.

The reconstructed face is obtained by training a neural network according to a large amount of data, and the training is continuously iterated and optimized depending on a cost function, so that the optimal parameter value in the network model is finally obtained. The cost function (training error) includes: the method comprises the steps of calculating a perception loss function between a synthesized face and an original input face, pixel-by-pixel loss of the synthesized face and the original input face, parameter loss between a predicted expression parameter and a real expression parameter, key point loss between a reconstructed three-dimensional face mesh and the real mesh, and consistency loss (shape consistency errors) between predicted shape parameters of different expression images.

S4: and generating a three-dimensional face model with texture based on the expression parameters, the shape parameters and the texture parameters.

And calculating a corresponding three-dimensional face mesh structure by using the shape parameters, the expression parameters and the like obtained by the calculation in the S3. Further, the three-dimensional face mesh structure and the texture parameters are used to output RGB pixel values corresponding to each vertex, that is, three-dimensional face texture information T.

S5: and obtaining accurate three-dimensional key point coordinates of the original input face image by using a plurality of screening strategies.

Obtaining two-dimensional key points of the Face by using dlib, obtaining three-dimensional key points of the Face by using a 3D Face Alignment Network (3D FAAN) algorithm, and generating the two-dimensional key points of the Face by using a functional Face identifier (PFLD) algorithm. Subsequently, exact face key point data is selected using a warping algorithm based on mole voting.

The two-dimensional key points of the Face are predicted by utilizing the dlib model, the three-dimensional Face key points are predicted by using 3DFAN (3D Face Alignment Network), and the three-dimensional Face key points are predicted by using PFLD (functional Face Landmark Detector). Because the contour points corresponding to the two-dimensional face key points predicted by the dlib algorithm automatically converge to the contour points when the posture changes greatly, the prediction of the face contour is influenced, and the coordinates of the face five sense organs key points are obtained only by using the dlib algorithm. Calculating the Mean square Error of the two-dimensional face key point predicted by dlib and the MSE (Mean Squared Error) of the five sense organs key point predicted by 3 DFAN; the threshold is defined in terms of the calculated error.

Furthermore, the method for cooperatively screening the key points obtains the coordinates of the key points of the human face by applying a mole voting algorithm in the screening process. And for the three-dimensional face key points predicted by the m models, obtaining the key points with the most occurrence times by applying a mole voting algorithm.

S6: use ofS4, obtaining a three-dimensional face mesh structure to obtain three-dimensional key points of the face corresponding to the designated index (and obtaining two-dimensional key points through camera (camera parameter) mapping), and calculating a face error L _landmark As shown in reference formula (7).

Wherein, the weight coefficients of key points corresponding to the left eye, the right eye, the inner lip and the nose are set as 20, and the weights of the key points of the rest parts are set as 1. And during calculation, selecting 68 accurate face key points obtained by screening in the step S6, and obtaining the coordinates of the three-dimensional face key points corresponding to the three-dimensional face mesh structure reconstructed in the step S4.

Determining the eye opening and closing error L according to the loss function shown in the formula (4) _eye . Determining mouth opening and closing error L according to loss function shown in equation (5) _mouth . Determining the key point reprojection loss L according to the formula (6) _{re-projection} 。

S7: according to the above process, a trained convolutional neural network can be obtained. And sending the image to be driven into a convolutional neural network, outputting expression parameters, shape parameters and texture parameters of the character in the image, and sending the expression parameters into a drivable rendering engine to obtain an expression real-time driving effect.

And cutting the video to be driven into frames to obtain the image corresponding to each frame. And then, inputting the image into the trained convolutional neural network, outputting expression parameters, shape parameters and texture parameters of the character in the image, combining the expression parameters with a predefined three-dimensional real face model to obtain a real character driving effect, and combining the expression parameters with a predefined cartoon character to obtain a cartoon character driving effect.

The embodiment of the present disclosure further provides an expression driving method, as shown in fig. 14, including step S1401 to step S1403.

In step S1401, a face image to be processed is acquired.

In one embodiment, the video to be driven is cut into frames, and an image corresponding to each frame is obtained as a face image to be processed.

Step S1402, the facial image is processed by the expression driving generation model obtained through training, and expression driving parameters are obtained.

And inputting the image into the trained neural network, and outputting the expression parameters, the shape parameters and the texture parameters of the character in the image.

And step S1403, performing expression driving according to the expression driving parameters.

The expression driving parameters and the predefined three-dimensional real face model are combined to obtain a real character driving effect, and the expression parameters and the predefined cartoon character are combined to obtain a cartoon character driving effect.

The embodiment of the disclosure also provides a device for training the expression driving generation model. As shown in fig. 15, the apparatus for training an expression-driven generative model includes:

an obtaining module 1510, configured to obtain a sample facial image set, where the sample facial image set includes facial images of the same object with different expressions;

the training module 1520, configured to process the facial images in the sample set facial image set using the expression driver generation model, to obtain expression driver parameters of the facial images; generating a three-dimensional face mesh structure of the face image according to the expression driving parameters; determining a first face key point of the face image according to the three-dimensional face mesh structure; acquiring a second face key point of the face image, wherein the second face key point is a real face key point; projecting the first face key points and the second face key points from three-dimensional key points into two-dimensional key points according to a preset projection function; and updating at least part of the network parameters of the expression driven generative model to minimize keypoint errors between the first facial keypoint and the second facial keypoint.

The expression driver parameters include shape parameters, and in some embodiments, the training module 1520, is further configured to:

aiming at the facial images of the same object with different expressions in the sample facial image set, processing the facial images by using the expression driving generation model to obtain shape parameters of the facial images;

determining a shape consistency error according to shape parameters corresponding to facial images of the same object with different expressions;

updating at least part of network parameters of the expression driven generative model to minimize the shape consistency error.

The expression driving parameters comprise expression parameters, at least one face image of different expressions of an object is generated on the basis of one face image of the object and different expression conditions by using a conditional confrontation generation network, and each expression condition consists of an expression action unit. In some embodiments, training module 1520, is further to:

processing each face image generated based on the expression condition by using the expression driving generation model to obtain expression parameters of the face image;

determining real expression parameters of the facial image according to the expression conditions;

updating at least part of network parameters of the expression drive generation model to minimize expression parameter errors between the expression parameters and the real expression parameters.

In some embodiments, training module 1520, is further to:

for each face image, rendering according to the three-dimensional face mesh structure of the face image to obtain a two-dimensional face image corresponding to the face image;

determining a perception loss and/or a pixel-by-pixel loss between the facial image and a rendered two-dimensional facial image thereof;

updating at least part of network parameters of the expression driven generative model to minimize the perceptual and/or pixel-wise loss.

In some embodiments, the second facial keypoints comprise one or more of eye keypoints and mouth keypoints, the keypoint errors comprising one or more of eye opening and closing errors and mouth opening and closing errors; and wherein the training module 1520 is configured to update at least a portion of the network parameters of the expression driver generator model to minimize one or more of the eye opening and closing error and the mouth opening and closing error.

In some embodiments, the training module 1520 is specifically configured to obtain the second facial keypoints of the facial image as follows:

processing the face image by using a plurality of key point prediction models to obtain a plurality of prediction results corresponding to each face image, wherein each prediction result comprises a prediction coordinate of a second face key point;

and regarding each second face key point as a same coordinate, and regarding the predicted coordinate with the key point spacing smaller than a preset value in a plurality of predicted coordinates corresponding to the second face key point as the coordinate of the second face key point, wherein the predicted coordinate with the largest occurrence frequency is the coordinate of the second face key point.

The expression driving parameters include shape parameters, and the training module 1520 is specifically configured to generate a three-dimensional face mesh structure of the facial image according to the expression driving parameters in the following manner:

The embodiment of the disclosure also provides an expression driving device. As shown in fig. 16, the expression driving apparatus includes:

an obtaining module 1610 configured to obtain a face image to be processed;

the processing module 1620 is configured to process the facial image by using the expression driving generation model obtained by training in the method of the present disclosure, so as to obtain expression driving parameters;

and the driving module 1630 is configured to perform expression driving according to the expression driving parameter.

An exemplary embodiment of the present disclosure also provides an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor. The memory stores a computer program executable by the at least one processor, the computer program, when executed by the at least one processor, is for causing the electronic device to perform a method according to an embodiment of the disclosure.

The disclosed exemplary embodiments also provide a non-transitory computer readable storage medium storing a computer program, wherein the computer program, when executed by a processor of a computer, is adapted to cause the computer to perform a method according to an embodiment of the present disclosure.

The exemplary embodiments of the present disclosure also provide a computer program product comprising a computer program, wherein the computer program, when executed by a processor of a computer, is adapted to cause the computer to perform a method according to an embodiment of the present disclosure.

Referring to fig. 17, a block diagram of a structure of an electronic device 1700 that may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 17, the electronic apparatus 1700 includes a computing unit 1701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1702 or a computer program loaded from a storage unit 1708 into a Random Access Memory (RAM) 1703. In the RAM 1703, various programs and data required for the operation of the device 1700 can also be stored. The computing unit 1701, the ROM 1702, and the RAM 1703 are connected to each other through a bus 1704. An input/output (I/O) interface 1705 is also connected to bus 1704.

Various components in the electronic device 1700 are connected to the I/O interface 1705, including: an input unit 1706, an output unit 1707, a storage unit 1708, and a communication unit 1709. The input unit 1706 may be any type of device capable of inputting information to the electronic device 1700, and the input unit 1706 may receive input numeric or character information and generate key signal inputs related to user settings and/or function control of the electronic device. Output unit 1707 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. Storage unit 1708 may include, but is not limited to, a magnetic disk, an optical disk. The communication unit 1709 allows the electronic device 1700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, a modem, a network card, an infrared communication device, a wireless communication transceiver, and/or a chipset, such as a bluetooth device, a WiFi device, a WiMax device, a cellular communication device, and/or the like.

The computing unit 1701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 1701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The computing unit 1701 executes the various methods and processes described above. For example, in some embodiments, the method of training an emoji-driven generative model, the emoji-driven method, may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 1708. In some embodiments, part or all of a computer program may be loaded and/or installed onto the electronic device 1700 via the ROM 1702 and/or the communication unit 1709. In some embodiments, the computing unit 1701 may be configured in any other suitable manner (e.g., by means of firmware) to perform the method of training the expression driven generative model, the expression driven method.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection according to one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

As used in this disclosure, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

Claims

1. A method of training an expression-driven generative model, comprising:

projecting the first face key points and the second face key points from three-dimensional key points into two-dimensional key points according to a preset projection function;

2. The method of claim 1, wherein the expression driver parameters comprise shape parameters, the method further comprising:

3. The method according to claim 1 or 2, wherein the expression driving parameters include expression parameters, at least one facial image of a different expression of the subject is generated based on one facial image of the subject and different expression conditions using a conditional confrontation generating network, each expression condition being composed of expression action units,

and wherein the method further comprises:

updating at least part of network parameters of the expression driving generation model to minimize expression parameter errors between the expression parameters and the real expression parameters.

4. The method of any of claims 1 to 3, wherein the method further comprises:

updating at least part of network parameters of the expression driven generative model to minimize the perceptual loss and/or pixel-wise loss.

5. The method of claim 1, wherein the second facial keypoints comprise one or more of eye keypoints and mouth keypoints, and the keypoint errors comprise one or more of eye opening and closing errors and mouth opening and closing errors;

and wherein said updating at least part of the network parameters of the expression driven generative model to minimize keypoint errors between the first facial keypoints and the second facial keypoints comprises:

updating at least part of network parameters of the expression driven generative model to minimize one or more of the eye opening and closing error and the mouth opening and closing error.

6. The method of claim 1 or 5, wherein said obtaining second face keypoints for a face image comprises:

7. The method of claim 1, wherein the expression driving parameters comprise shape parameters,

and wherein the generating a three-dimensional face mesh structure of a facial image according to the expression driving parameters comprises:

8. An expression driving method, comprising:

acquiring a face image to be processed;

processing the facial image using an expression-driven generation model trained using the method of any one of claims 1 to 7, resulting in expression-driven parameters;

9. An apparatus for training an expression-driven generative model, comprising:

the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring a sample face image set, and the sample face image set comprises face images of the same object with different expressions;

a training module to:

generating a three-dimensional face mesh structure of the face image according to the expression driving parameters;

10. An electronic device, comprising:

a processor; and

a memory for storing a program, wherein the program is stored in the memory,

wherein the program comprises instructions which, when executed by the processor, cause the processor to carry out the method according to any one of claims 1-7.

11. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.