CN111783658A

CN111783658A - Two-stage expression animation generation method based on double generation countermeasure network

Info

Publication number: CN111783658A
Application number: CN202010621885.2A
Authority: CN
Inventors: 郭迎春; 王静洁; 刘依; 朱叶; 郝小可; 于洋; 师硕; 阎刚
Original assignee: Hebei University of Technology
Current assignee: Hebei University of Technology
Priority date: 2020-07-01
Filing date: 2020-07-01
Publication date: 2020-10-16
Anticipated expiration: 2040-07-01
Also published as: CN111783658B

Abstract

The invention relates to a two-stage expression animation generation method based on a double generation antagonistic network, which comprises the steps of firstly, extracting expression characteristics in a target expression profile by using an expression migration network faceGAN in a first stage, migrating the expression characteristics to a source face, and generating a first-stage prediction graph; in the second stage, the detail generation network fineGAN is used for supplementing and enriching the details of the eye and mouth regions which have larger contribution to the change of the expression in the first-stage prediction graph, a fine-grained second-stage prediction graph is generated and synthesized into a face video animation, and the expression migration network faceGAN and the detail generation network fineGAN are both realized by adopting a generation confrontation network. The method includes the steps that an anti-network is generated in two stages to generate expression animation, expression conversion is conducted in the first stage, image details are optimized in the second stage, the designated area of an image is extracted through a mask vector to conduct emphasis optimization, and meanwhile, the important part generation effect is better by combining the use of a local discriminator.

Description

Two-stage expression animation generation method based on double generation countermeasure network

Technical Field

The technical scheme of the invention relates to image data processing in computer vision, in particular to a two-stage expression animation generation method based on a double generation antagonistic network.

Background

The facial expression synthesis refers to transferring expressions from a target expression reference face to a source face, identity information of a newly synthesized source face image is kept unchanged, but the expressions of the newly synthesized source face image are kept consistent with the target expression reference face, and the technology is gradually applied to the fields of movie and television production, virtual reality, criminal investigation and the like. The synthesis of the facial expression has important research value in both academic and industrial fields, and how to robustly synthesize natural and vivid facial expressions becomes a challenging hot research topic.

The existing facial expression synthesis methods can be divided into two categories, namely a traditional graphics method and an image generation method based on deep learning. The first type of conventional graphical method generally uses a parametric model to parameterize a source face image, and a design model performs expression conversion to generate a new image, or distorts the face image by using feature correspondence and an optical flow graph to assemble a face patch from existing expression data, but the process of designing the model is detailed and complicated, and a very expensive calculation amount is generated, and the generalization capability is poor.

And a second expression synthesis method based on deep learning. Firstly, extracting facial features by using a deep neural network, mapping an image from a high-dimensional space to a feature vector, changing source expression features by adding an expression label, synthesizing a target facial image by using the deep neural network, and mapping the target facial image back to the high-dimensional space. The appearance of GAN networks then brings about eosin for clear image synthesis, which has attracted great attention once it is proposed. In the field of image synthesis, a large number of research methods such as GAN variants have been introduced to generate images. For example, a Conditional generation countermeasure network (CGAN) may generate an image under specific supervision information, and in the field of facial expression generation, an expression label may be used as Conditional supervision information to generate facial images of different expressions. At present, the GAN network-based correlation method also has some disadvantages, and when an expression animation is generated, unreasonable artifacts, fuzzy generated images, low resolution and the like may occur.

The facial expression generation is image-image conversion, the invention aims to generate facial animation, belongs to image-video conversion, and increases the challenge on the time dimension compared with the task of facial expression generation. Xing et al use a gender-preserving network in the GP-GAN for Synthesizing Faces from Landmarks to enable the network to learn more gender information, but this method still has a deficiency in preserving face identity information, which may result in the generated face having different identity characteristics from the target face. CN108288072A discloses a facial expression synthesis method based on a generation countermeasure network, which does not consider fine-grained generation of a face image, omits the extraction of detail features of a source face image, and has the defects of fuzzy generation result and low resolution. CN110084121A discloses a method for realizing facial expression migration of a cyclic generation type confrontation network based on spectrum normalization, the method adopts an expression unique heat vector to supervise the training process of the network, the discreteness of the unique heat vector limits the learning ability of the network, so that the network can only learn the expression of target emotion, such as happiness, sadness, surprise and the like, but can not learn the emotion degree, and the method is deficient in the aspect of continuous generation of the emotion. CN105069830A discloses an expression animation generation method and device, which can only generate expression animations with six specified templates, but human expressions are very rich and complex, so that the method has poor expansibility and cannot generate any specified expression animation according to user requirements. CN107944358A discloses a face generation method based on a deep convolution countermeasure network model, which cannot ensure invariance of face identity information in the expression generation process, and may have a defect that the generated face is inconsistent with the target face.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: firstly, extracting the characteristics of a target expression by using an expression migration network in a first stage, migrating the characteristics to a source face to generate a first-stage prediction graph, and naming the expression migration network in the first stage as facegan (face genetic adaptive network); in the second stage, a detail generation network is used for enriching some face details in the first-stage prediction graph, generating a fine-grained second-stage prediction graph and synthesizing video animation, wherein the second-stage detail generation network is named as FineGAN (Fine Generation adaptive network); the method of the invention overcomes the problems of fuzzy generated images or low resolution, unreasonable artifacts in the generated result and the like in the prior art.

The technical scheme adopted by the invention for solving the technical problem is as follows: in the first stage, under the drive of a target expression profile, an expression migration network faceGAN is used for capturing expression characteristics in the target expression profile and migrating the expression characteristics to a source face to generate a first-stage prediction graph; in the second stage, the detail generation network FineGAN is used as supplement to enrich the details of eyes and mouth regions which have relatively large contribution to the change of the expression in the first-stage prediction graph, generate a fine-grained second-stage prediction graph and synthesize a facial animation, and the specific steps are as follows:

firstly, acquiring a facial expression profile of each frame of image in a data set:

collecting a facial expression video sequence data set, extracting a face in each frame of image in a video sequence by using a Dlib machine learning library, simultaneously obtaining a plurality of feature points in each face, and then sequentially connecting the feature points by using line segmentsObtaining an expression profile of each frame of the video sequence, and recording as e ═ e (e)₁,e₂,···,e_i,···,e_n) Wherein e represents the set of all expression contour graphs in a video sequence, namely an expression contour graph sequence; n represents the number of video frames, e_iAn expression profile representing the ith frame in a video sequence;

the first stage is to build an expression migration network faceGAN, and the method comprises the following steps:

secondly, extracting the identity characteristics of the source face and the expression characteristics of the target expression profile graph, and preliminarily generating a first-stage prediction graph:

the faceshift network faceGAN comprises a generator G₁And a discriminator D₁Wherein the generator G₁Comprising three sub-networks, respectively two encoders Enc_id and Enc_expA decoder Dec₁；

Firstly, inputting a neutral non-expression image I of a source face_NAnd a sequence e of target expression profiles, then using an identity encoder Enc_idNeutral non-expression image I of extraction source face_NIdentity feature vector f_idWhile using the expression encoder Enc_expExtracting expression characteristic vector set f of target expression profile graph sequence e_exp, wherein f_exp＝(f_{exp_1},f_{exp_2},···,f_{exp_i},···,f_{exp_n}) The formula is expressed as:

f_id＝Enc_id(I_N) (1),

f_{exp_i}＝Enc_exp(e_i) (2),

the identity feature vector f_idAnd the expression feature vector f of the ith frame_{exp_i}The serial connection is carried out to obtain a characteristic vector f, and f is f_id+f_{exp_i}The feature vector f is fed to the decoder Dec₁Decoding to generate a first-stage prediction graph I_pre-targetAnd I is_pre-target＝Dec₁(f) Finally, will I_pre-targetInput to a discriminator D₁Judging whether the image is true or false;

and thirdly, taking the prediction image of the first stage as input, and adopting the concept of cycleGAN to reconstruct a neutral image of the source face:

predicting the first stage of the picture I_pre-targetAnd the neutral non-expression image I in the second step_NCorresponding expression profile e_NAs the input of the faceGAN again, using the identity encoder Enc_idExtracting an image I_pre-targetUsing the expression encoder Enc_expExtracting an expression profile e_NThe expression feature vector is repeatedly processed by the second step, and is decoded by a decoder to generate I_NIs reconstructed image I_reconGenerating a reconstructed image I_reconIs expressed as:

I_recon＝Dec₁(Enc_id(I_pre-target)+Enc_exp(e_N)) (3)；

fourthly, calculating a loss function in the faceGAN of the first-stage expression migration network:

the generator G in the FaceGAN of the first-stage expression migration network₁The specific formula of the loss function is as follows:

wherein ,

wherein ,I_realFor the target true value, equation (5) is the penalty of the generator, D₁(. represents) the discriminator D₁The method comprises the steps that an object is true probability, an SSIM (structure description language) function in a formula (6) is used for measuring similarity between two images, a formula (7) is pixel loss, an MAE (mean square error) function is a mean square error function and is used for measuring a difference between a true value and a predicted value, a formula (8) is sensing loss, sensing characteristics of the images are extracted by using VGG-19, characteristics output by a last convolution layer in a VGG-19 network are used as sensing characteristics of the images, the sensing loss between the images and the true images is calculated and generated by the method, a formula (9) is reconstruction loss, and a neutral expressionless image I of a source face is calculated_NAnd its reconstructed image I_reconThe distance between them;

discriminator D in FaceGAN of the first-stage expression migration network₁The specific formula of the loss function is as follows:

wherein ,

equation (11) is the penalty loss, and equation (12) is the penalty loss of the reconstructed image, where λ₁ and λ₂For loss of similarity

And loss of perception

Generator G in faceGAN₁Weight parameter of (1), λ₃Is heavyCountermeasure loss for constructing images

Weight parameters in FaceGAN arbiter losses;

and building a detail generation network FineGAN of the second stage, wherein the steps from the fifth step to the seventh step are as follows:

fifthly, local mask vectors adaptive to individuals are generated:

using a plurality of feature points in each human face obtained in the first step to extract an eye region I_eyeAnd mouth area I_mouthSetting eye mask vectors M respectively_eyeAnd mouth mask vector M_mouthTaking an eye as an example, an eye mask vector M is constructed by setting the pixel value of the eye region in the image to 1 and the pixel values of the other regions to 0_eyeMouth mask vector M_mouthIs formed with an eye mask vector M_eyeSimilarly;

and sixthly, inputting the prediction graph of the first stage into a network of a second stage to carry out detail optimization:

generator G contained in detail generation network FineGAN₂And a discriminator D₂，D₂Is formed by a global arbiter D_globalAnd two local discriminators D_eye and D_mouthForming;

predicting the first stage of the picture I_pre-targetAnd a neutral blankness image I in the second step_NInput to the generator G₂In the method, a second-stage prediction image I with more human face details is generated_targetThen the second stage prediction graph I_targetSimultaneously input into three discriminators, via a global discriminator D_globalFor the second stage prediction chart I_targetPerforming global discrimination to make the second stage prediction graph I_targetWith the target real image I_realAs close as possible, by means of an eye part discriminator D_eyeAnd mouth local discriminator D_mouthFor the second stage prediction chart I_targetFurther optimizes the eye and mouth regions to make the second stage predict the image I_targetMore lifelike, second stage pre-processingDrawing I_targetIs expressed as:

I_target＝G₂(I_pre-target,I_N) (13)；

and seventhly, calculating a loss function in the FinEGAN in the second stage:

generator G₂The specific formula of the loss function is as follows:

wherein ,

equation (15) is a penalty, including a global penalty and a local penalty, operator

Is a Hadamard product, formula (16) is a pixel loss, formula (17) and formula (18) are local pixel losses, an L1 norm of a pixel difference between a local region of the generated image and a local region of the real image is calculated, formula (19) is a local perceptual loss, and generator G is a maximum value of a local perceptual loss₂The total loss function is the weighted sum of the loss functions;

discriminator D₂The specific formula of the loss function is as follows:

wherein ,

equation (21) is the penalty of the global arbiter, and equations (22) and (23) are the penalty of the local arbiter, where λ₄ and λ₅In FineGAN generator G for local antagonistic losses, respectively₂Weight parameter of (1), λ₆ and λ₇Respectively, eye pixel loss L_eyeAnd mouth pixel loss

In FineGAN generator G₂Weight parameter of (1), λ₈For local perception loss

In FineGAN generator G₂Weight parameter of (1), λ₉To combat loss of global competition

In FineGAN discriminator D₂The weight parameter of (1);

and eighth step, synthesizing a video:

each frame is independently generated, thus completing n frames of image (I)_{target_1},I_{target_2},···,I_{target_i},···,I_{target_n}) After the generation, the video frame sequence is synthesized into the final human face animation;

therefore, the generation of the two-stage expression animation based on the double-generation countermeasure network is completed, the expression in the face image is converted, and the image details are optimized.

In particular, the identity encoder Enc_idThe system comprises 4 layers of convolution blocks, and a CBAM attention module is added into the first 3 layers of convolution blocks; expression encoder Enc_expIncluding 3 layers of convolution blocks, adding CBAM attention module and decoder Dec in last layer of convolution block₁The method comprises a 4-layer deconvolution block, a CBAM attention module is added in a first 3-layer convolution block, and a network encoder and a decoder are connected by using a jump connection, specifically, an identity encoder Enc is used_id Layer 1 output and decoder Dec₁The input of the last 1 layer is connected with an identity encoder Enc_idLayer 2 output and decoder Dec₁The input of the 2 nd last layer is connected with an identity encoder Enc_idLayer 3 output and decoder Dec₁The input of the last 3 layer is connected. The CBAM attention module is added, so that the network can pay more attention to the learning of important areas in the image, and meanwhile, in order to enable the network to learn detail information such as face textures of lower layers, the high layer and the lower layer of the network are combined by using jump connection.

In the two-stage expression animation generation method based on the double generation countermeasure network, the English abbreviation of the generation countermeasure network model is GAN, the generation countermeasure network model is called as general adaptive Networks, the generation countermeasure network model is a well-known algorithm in the technical field, and a Dlib library is a public database.

The invention has the beneficial effects that: compared with the prior art, the method has the advantages that,

the significant improvements of the present invention are as follows:

(1) compared with CN108288072A, the method of the invention has the advantages that the detail generation network can ensure the fine-grained generation of the human face animation, and two important areas of the mouth and eyes are optimized, so that the generation effect is more vivid and natural.

(2) Compared with CN110084121A, the method of the invention has the advantages that the expression profile is used to supervise the learning process of faceGAN network, so that the network can learn the continuous expression of the expression, learn the emotion degree and generate smooth face animation.

(3) Compared with CN105069830A, the method of the invention has the advantages that the target expression profile is used to guide the expression of the target expression of the network learning, the method is not limited to the type limitation of the expression, and the expression animation of any emotion required by the user can be generated.

(4) Compared with CN107944358A, the method of the invention has the advantages that the method utilizes the ring network structure of the cycleGAN to train the model, and simultaneously adds jump connection in the faceGAN to ensure the consistency of the identity information of the generated face and the source face.

(5) According to the method, the global discriminator, the local discriminator and the local loss function (the formula (17) and the formula (18)) are arranged, so that the real degree of the whole generated image can be ensured, and two important areas, namely eyes and a mouth, can be generated in a refined mode.

(6) According to the method, the attention module and the second-stage detail generation network are added into the faceGAN, so that the local detail generation and fine-grained expression of the image are guaranteed.

The prominent substantive features of the invention are:

1) the method includes the steps that a countermeasure network is generated in two stages to generate expression animations, the expressions are converted in the first stage, and image details are optimized in the second stage; a local loss function based on a mask is provided, a specified region of an image is extracted through a mask vector, emphasis optimization is carried out, and meanwhile, the important part generating effect is better by combining the use of a local discriminator.

2) In the application, each frame of image in the video sequence is generated by a neutral image instead of a video frame sequence generated in a recursive mode, so that the problem that the generation quality of the subsequent frame is worse and worse due to the fact that errors generated by the preorder frame are transmitted to the subsequent frame and the propagation of the errors is caused is solved; in addition, the image input mode can enable the difficulty of model training to be increased by more learning of the network from the neutral expression to the larger change of other expressions. After the predicted image is generated by using the first-stage network, the predicted image is input into the network again, and the source input image is reconstructed by using the ring network concept of the cycleGAN, so that the network can be forced to retain identity characteristics without increasing the number of parameters of the model, and loss functions of the model comprise countermeasure loss, SSIM similarity loss, pixel loss, VGG perception loss and reconstruction loss. The second stage network of the present application includes a generator and a global arbiter, two local arbiters, with the addition of mask-based local arbiters and local penalty functions.

3) In the faceGAN, the method uses the concept of cycleGAN to input the image after the expression conversion as the network again, and reconstructs and generates a source face image, so that the network can forcibly keep the identity characteristics of the face and only change the expression; meanwhile, in the faceGAN, a jump connection structure is utilized to fuse the high-level features and the low-level features of the network, so that the network can learn more face identity information in the low-level features; the method and the device can realize that the identity information of the face is not changed while the expression conversion is carried out.

4) The invention provides a detail optimization network FineGAN, which is focused on the generation of image details and emphatically optimizes important eye regions and mouth regions; proper weight is set to balance pixel loss and antagonistic loss, and perceptual loss is added to remove artifacts, so that the generated image does not contain unreasonable artifacts and the like, and the network generates a high-quality vivid image which has rich details and accords with human vision.

5) The method has the advantages of relatively less network parameter quantity, lower space and time complexity, capability of learning the migration of any expression type by using a uniform network and learning the continuous change of the emotional intensity, and good use prospect.

Drawings

The invention is further illustrated with reference to the following figures and examples.

FIG. 1 is a schematic block flow diagram of the method of the present invention.

In fig. 2, the odd rows are schematic diagrams of the facial feature points of the method of the present invention, and the even rows are facial expression contour diagrams.

Fig. 3 is a mask diagram of the present invention, wherein the first row is a face region image extracted after preprocessing the original data set, the second and fourth rows are visualizations of an eye mask vector and a mouth mask vector, respectively, and the third and fifth rows are partial region images extracted after applying the eye mask vector and the mouth mask vector to the source image.

FIG. 4 is a graph of the 3 experimental effects of the present invention, wherein the odd rows are the input to the method of the present invention, including a sequence of neutral images of the source face and a silhouette image of the target expression; the even rows are experimental results, i.e., the sequence of video frames that output the expressive animation.

Detailed Description

The embodiment shown in fig. 1 shows that the two-stage expression animation generation method based on the dual generation countermeasure network of the present invention has the following processes:

acquiring a facial expression profile of each frame of image in a data set → extracting the identity characteristic of a source face and the expression characteristic of a target expression profile, preliminarily generating a first-stage prediction image → taking the first-stage prediction image as input, adopting the concept of CycleGAN to reconstruct a neutral image of the source face → calculating a loss function in the first-stage faceGAN → generating a local mask vector adapted to an individual → inputting the first-stage prediction image into a network of a second stage, and carrying out detail optimization → calculating a loss function in the second-stage FineGAN → synthesizing a video.

Example 1

The two-stage expression animation generation method based on the double generation countermeasure network of the embodiment specifically comprises the following steps:

collecting a facial expression video sequence data set, extracting a face in each frame of image in a video sequence by using a Dlib machine learning library, simultaneously obtaining 68 feature points in each face (in the expression migration field, 68 feature points form a face contour and an eye, mouth and nose contour, and 5 or 81 feature points can be arranged), as shown in an odd line in FIG. 2, and then connecting the feature points in sequence by using line segments to obtain an expression contour map of each frame of the video sequence, as shown in an even line in FIG. 2, recording the expression contour map of each frame of the video sequenceIs e ═ e (e)₁,e₂,···,e_i,···,e_n) Wherein e represents the set of all facial expression profiles in a video sequence, n represents the number of video frames, e_iRepresenting the facial expression profile of the ith frame in a certain video sequence;

the faceGAN includes a generator G₁And a discriminator D₁Wherein the generator G₁Comprising three sub-networks, respectively two encoders Enc_id and Enc_expA decoder Dec₁；

Firstly, inputting a neutral non-expression image I of a source face_NAnd a target expression profile image sequence e, wherein the input of the embodiment is a neutral face of the S010 user, the target expression profile image sequence is a process from a facial expression to an exposed smile, and the neutral expressionless image I is a neutral expressionless image I_NThe extracted expression profile is marked as e_NWith specific inputs as shown in the first line of FIG. 4, and then using the identity encoder Enc_idExtracting identity characteristic vector f of S010 user_idWhile using the expression encoder Enc_expExpression feature vector set f for extracting target expression profile graph_exp, wherein f_exp＝(f_{exp_1},f_{exp_2},···,f_{exp_i},···,f_{exp_n}) The formula is expressed as:

f_id＝Enc_id(I_N) (1),

f_{exp_i}＝Enc_exp(e_i) (2),

predicting the first stage of the picture I_pre-targetAnd the neutral non-expression image I in the second step_NExtracted expression profile e_NRepeating the second step as faceGAN input to generate S010 reconstructed image I with neutral expression of user_reconGenerating I_reconIs expressed as:

I_recon＝Dec₁(Enc_id(I_pre-target)+Enc_exp(e_N)) (3)；

fourthly, calculating a loss function in the FaceGAN in the first stage:

generator G in the first stage faceGAN described above₁The specific formula of the loss function is as follows:

wherein ,

wherein ,I_realCalculating a target real value (a target real value, namely Groundtruth, which is a source face image with a target expression, namely a real image of a final predicted value of a model), namely an S010 real image of smile of a user, a formula (5) is countermeasure loss of a generator, an SSIM (question mark) function in a formula (6) is used for measuring similarity between two images, a formula (7) is pixel loss, an MAE (question mark) function is a mean square error function is used for measuring a difference between the real value and the predicted value, a formula (8) is sensing loss, and sensing characteristics of the images are extracted by using VGG-19_NAnd reconstructing an image I_reconDistance between, generator G₁The loss function of (a) is a weighted sum of the loss functions of the respective portions;

discriminator D in FaceGAN of the first stage₁The specific formula of the loss function is as follows:

wherein ,

equation (11) is the countermeasure loss, and equation (12) is the countermeasure loss of the reconstructed image;

the identity encoder Enc_idThe system comprises 4 layers of convolution blocks, and a CBAM attention module is added into the first 3 layers of convolution blocks; expression encoder Enc_expIncluding 3 layers of convolution blocks, adding CBAM attention module and decoder Dec in last layer of convolution block₁The system comprises 4 layers of deconvolution blocks, a CBAM attention module is added in the first 3 layers of convolution blocks, and the high layer and the low layer of the network are connected by using a jump connectionThe specific way is to use an identity encoder Enc_idLayer 1 output and decoder Dec₁The input of the last 1 layer is connected with an identity encoder Enc_idLayer 2 output and decoder Dec₁The input of the 2 nd last layer is connected with an identity encoder Enc_idLayer 3 output and decoder Dec₁The inputs of the last 3 layers are connected and the convolution kernel size in this patent is 3 × 3.

fifthly, local mask vectors adaptive to individuals are generated:

using 68 feature points in each human face obtained in the first step to extract an eye region I_eyeAnd mouth area I_mouthFirst, eye mask vectors M are set up separately_eyeAnd mouth mask vector M_mouthAs shown in the second row and the fourth row in fig. 3, taking the eye as an example, M is formed by setting the pixel value of the eye region in the image to 1 and the pixel values of the other regions to 0_eyeMouth mask vector M_mouthIs formed with M_eyeSimilarly;

finegan includes generator G₂And a discriminator D₂，D₂Is formed by a global arbiter D_globalAnd two local discriminators D_eye and D_mouthForming;

predicting the first stage of the image I_pre-targetAnd the neutral non-expression image I in the second step_NInput to the generator G₂In the method, a second-stage prediction graph I containing more face details of the S010 user is generated_targetThen mix I_targetSimultaneously input into three discriminators, by D_globalFor the generated I_targetMaking a global discrimination to_targetReal image I smiling with S010 user_realAs close as possible, by means of an eye part discriminator D_eyeAnd mouth local discriminator D_mouthTo I_targetFurther emphasising the optimization of the eye and mouth regions such that an image I is generated_targetMore realistic, the formula is illustrated as follows:

I_target＝G₂(I_pre-target,I_N) (13)；

and seventhly, calculating a loss function in the FinEGAN in the second stage:

generator G₂The loss function is specifically formulated as follows:

wherein ,

Is a Hadamard product, formula (16) is a pixel loss, formula (17) and formula (18) are local pixel losses, an L1 norm of a pixel difference between a local region of a generated image and a local region of a real image is calculated, formula (19) is a local perceptual loss, and a generator total loss function is a weighted sum of loss functions;

discriminator D₂The specific formula of the loss function is as follows:

wherein ,

equation (21) is the penalty of the global arbiter, and equations (22) and (23) are the penalty of the local arbiter;

and eighth step, synthesizing a video:

each frame is independently generated, thus completing n frames of image (I)_{target_1},I_{target_2},···,I_{target_i},···,I_{target_n}) After the generation, that is, an expression gradual change process from a non-expression to a smile of the user is generated in S010, and the video frame sequence is synthesized into a facial animation of the user in S010, as shown in the second line of fig. 4;

In this embodiment, the weight parameter settings related to the steps are shown in table 1, and the whole sample database has good effects.

TABLE 1 weight parameter settings for each penalty in this example

In the two-stage expression animation generation method based on the double generation countermeasure network, the English abbreviation of the generation countermeasure network model is GAN, which is called as general adaptive Networks, and the method is a well-known algorithm in the technical field.

Figure 4 shows the effect diagram of 3 embodiments of the invention. Wherein the second line is a sequence of video frames generating S010 a user from neutral expression to smiling, the fourth line is a sequence of video frames generating S022 a user from neutral expression to surprised big mouth, and the sixth line is a sequence of video frames generating S032 a user from neutral expression to down-left mouth. Fig. 4 shows that the method of the present invention can complete the migration of expressions under the condition of retaining the face identity information, and can generate a continuously gradual change video frame sequence to synthesize an animation video with the specified identity and the specified expression.

Nothing in this specification is said to apply to the prior art.

Claims

1. A two-stage expression animation generation method based on a double generation countermeasure network is characterized in that the method comprises the steps of firstly, extracting expression features in a target expression profile by using an expression migration network faceGAN in a first stage, migrating the expression features to a source face, and generating a first-stage prediction image; in the second stage, the detail generation network fineGAN is used for supplementing and enriching the details of the eye and mouth regions which have larger contribution to the change of the expression in the first-stage prediction graph, a fine-grained second-stage prediction graph is generated and synthesized into a face video animation, and the expression migration network faceGAN and the detail generation network fineGAN are both realized by adopting a generation confrontation network.

2. The generation method of claim 1, wherein the faceshift network FaceGAN comprises a generator G₁And a discriminator D₁Wherein the generator G₁Comprising three sub-networks, each being an identity encoder Enc_idAnd an expression encoder Enc_expA decoder Dec₁；

Generator G contained in detail generation network FineGAN₂And a discriminator D₂，D₂Is formed by a global arbiter D_globalAn eye part discriminator D_eyeAnd a local mouth discriminator D_mouthAnd (4) forming.

3. The generation method of claim 1, characterized in that the method comprises the following specific steps:

collecting a facial expression video sequence data set, extracting a face in each frame of image in a video sequence by using a Dlib machine learning library, simultaneously obtaining a plurality of feature points in each face, then connecting the feature points in sequence by using line segments to obtain an expression profile of each frame of the video sequence, and recording the expression profile as e ═ (e ═₁,e₂,…,e_i,…,e_n) Wherein e represents the set of all expression contour graphs in a video sequence, namely an expression contour graph sequence; n represents the number of video frames, e_iAn expression profile representing the ith frame in a video sequence;

Firstly, inputting a neutral non-expression image I of a source face_NAnd a sequence e of target expression profiles, then using an identity encoder Enc_idNeutral non-expression image I of extraction source face_NIdentity feature vector f_idWhile using the expression encoder Enc_expExtracting expression characteristic vector set f of target expression profile graph sequence e_exp, wherein f_exp＝(f_{exp_1},f_{exp_2},…,f_{exp_i},…,f_{exp_n}) The formula is expressed as:

f_id＝Enc_id(I_N) (1),

f_{exp_i}＝Enc_exp(e_i) (2),

I_recon＝Dec₁(Enc_id(I_pre-target)+Enc_exp(e_N)) (3)；

wherein ,

wherein ,

And loss of perception

Generator G in faceGAN₁Weight parameter of (1), λ₃For counteracting loss of reconstructed image

Weight parameters in FaceGAN arbiter losses;

fifthly, local mask vectors adaptive to individuals are generated:

will be firstPhase prediction graph I_pre-targetAnd a neutral blankness image I in the second step_NInput to the generator G₂In the method, a second-stage prediction image I with more human face details is generated_targetThen the second stage prediction graph I_targetSimultaneously input into three discriminators, via a global discriminator D_globalFor the second stage prediction chart I_targetPerforming global discrimination to make the second stage prediction graph I_targetWith the target real image I_realAs close as possible, by means of an eye part discriminator D_eyeAnd mouth local discriminator D_mouthFor the second stage prediction chart I_targetFurther optimizes the eye and mouth regions to make the second stage predict the image I_targetMore vivid, second stage prediction graph I_targetIs expressed as:

I_target＝G₂(I_pre-target,I_N) (13)；

and seventhly, calculating a loss function in the FinEGAN in the second stage:

generator G₂The specific formula of the loss function is as follows:

wherein ,

discriminator D₂The specific formula of the loss function is as follows:

wherein ,

equation (21) is the penalty of the global arbiter, and equations (22) and (23) are the penalty of the local arbiter, where λ₄ and λ₅In FineGAN generator G for local antagonistic losses, respectively₂Weight parameter of (1), λ₆ and λ₇Respectively eye pixel loss

And mouth pixel loss

In FineGAN discriminator D₂The weight parameter of (1);

and eighth step, synthesizing a video:

each frame is independently generated, thus completing n frames of image (I)_{target_1},I_{target_2},…,I_{target_i},…,I_{target_n}) After the generation, the video frame sequence is synthesized into the final human face animation;

4. Method for generating according to claim 2 or 3, characterized in that said identity encoder Enc_idThe system comprises 4 layers of convolution blocks, and a CBAM attention module is added into the first 3 layers of convolution blocks; expression encoder Enc_expIncluding 3 layers of convolution blocks, adding CBAM attention module and decoder Dec in last layer of convolution block₁The system comprises 4 layers of deconvolution blocks, a CBAM attention module is added in the first 3 layers of convolution blocks, and a hopping connection is used for combining the high layer and the low layer of the network at the same time_idLayer 1 output and decoder Dec₁The input of the last 1 layer is connected with an identity encoder Enc_idLayer 2 output and decoder Dec₁The input of the 2 nd last layer is connected with an identity encoder Enc_idLayer 3 output and decoder Dec₁The input of the last 3 layer is connected.

5. A method according to claim 3, characterized in that the weight parameter for each loss is set as:

6. the generation method according to claim 2, wherein the number of feature points obtained in the first step for each face is 68, and the 68 feature points constitute a face contour and an eye, mouth, and nose contour.