CN113096206A

CN113096206A - Human face generation method, device, equipment and medium based on attention mechanism network

Info

Publication number: CN113096206A
Application number: CN202110277161.5A
Authority: CN
Inventors: 文永明; 黄绮恒; 成慧
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2021-03-15
Filing date: 2021-03-15
Publication date: 2021-07-09
Anticipated expiration: 2041-03-15
Also published as: CN113096206B

Abstract

The invention discloses a human face generation method, a human face generation device, human face generation equipment and a human face generation medium based on an attention mechanism network, wherein the method comprises the following steps: acquiring a training image and a training expression target domain; determining a mapping relation table according to the relation between the expression basic category and the activation vector; determining a mapping rule from the probability vector of the expression basic category to the activation vector according to the mapping relation table; inputting the training image and the training expression target domain into a bidirectional antagonism network for training to obtain a bidirectional antagonism model; and inputting the initial facial expression image to be generated and the expression basic category probability vector to be generated into the bidirectional antagonism model according to the mapping rule, and determining a continuous target facial expression image. The method can process continuous expression distribution, generate continuous facial expression images and improve the robustness of the model to the change of the background and the illumination condition.

Description

Human face generation method, device, equipment and medium based on attention mechanism network

Technical Field

The invention relates to the technical field of deep learning and image processing, in particular to a human face generation method, a human face generation device, human face generation equipment and a human face generation medium based on an attention mechanism network.

Background

There are two main methods for facial expression generation: one is an image deformation model based on face modeling, such as a three-dimensional face deformation model 3 DMM. The three-dimensional deformation model is established on the basis of a three-dimensional face database, the statistics of the face shape and the face texture are taken as constraints, and the influence of the face posture and the illumination factor is considered, so that the generated three-dimensional face model is high in precision. The 3DMM can be used for constructing a complete face expression result, so that functions of angle conversion, continuous expression conversion and the like can be realized, and the change operations of face changing, expression changing and the like based on a model are facilitated. However, the accuracy depends heavily on the model used, and the training of the model requires a relatively high data acquisition and processing requirement.

The second is a neural network-based generative model, and the invention belongs to the category. The generative model is broadly defined as a process of inputting given training data and generating a new sample similar to the original data distribution, i.e. training data is generated from a certain distribution, the trained model learns the same distribution to generate a sample, and then a true-like sample can be sampled from the data distribution. The generation model comprises Pixel point sequence prediction Pixel CNN and Pixel RNN, a variational automatic encoder VAE and a generation countermeasure network GAN. Defining an easy-to-process density function in the Pixel CNN to directly optimize the likelihood of training data; a density distribution function which is not suitable for processing is defined in the VAE, the density function is modeled through an implicit variable s, data similar to a real sample is generated through sampling, potential vectors of image coding are subjected to Gaussian distribution on the basis of a self-encoder, image generation is achieved, and the lower bound of data log-likelihood is optimized. However, the generated model for the true sample likelihood estimation depends on the distribution of the chosen samples; the generated model adopting the approximate reasoning method is difficult to solve the optimal solution and only can approach the feasible lower bound of the target function, so that the generated picture is fuzzy as a whole.

The generation of the countermeasure network GAN is a game theory-based generation model. A GAN is typically composed of two parts, a generation network and a discrimination network. The input to the generation network is a set of random numbers z and the output is an image used to generate false samples. The false samples should approximate the data distribution of the real samples as much as possible, so that the discrimination network cannot distinguish whether the real samples or the false samples exist. The input of the network is judged to be an image, and the output is a probability value (the probability value is more than 0.5, namely a true sample, and less than 0.5, namely a false sample) for judging whether the input sample is a true sample or a false sample. The GAN network structure has been proven to be useful for generating realistic high detail images, successfully applied to image transformation, image resolution, indoor scene modeling, etc. The generating network and the discriminating network are alternately trained. During the alternate training process, the false samples generated by the generation network increasingly approach the distribution of the real data.

The existing method based on generation of the confrontation network makes remarkable progress in the aspect of facial expression synthesis. Although these prior art methods are effective in synthesizing discrete facial expressions, these methods can only generate a discrete number of expression classes that are determined by the annotations in the data set. Since the expression category is a discrete variable, the prior art processes the expression discrete variable to result in the output generation result being discrete. The prior art cannot generate expressions in smooth transition and is not good at processing continuous expression distribution.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method, an apparatus, a device, and a medium for generating a face based on an attention mechanism network, so as to implement processing of continuous expression distribution and improve robustness of a model to changes in background and illumination conditions.

In one aspect, the present invention provides a face generation method based on an attention mechanism network, including:

acquiring a training image and a training expression target domain;

determining a mapping relation table according to the relation between the expression basic category and the activation vector;

determining a mapping rule from the probability vector of the expression basic category to the activation vector according to the mapping relation table;

inputting the training image and the training expression target domain into a bidirectional antagonism network to train to obtain a bidirectional antagonism model, wherein the bidirectional antagonism model comprises a mask for eliminating attention and color and an evaluation generation image;

and inputting the initial facial expression image to be generated and the expression basic category probability vector to be generated into the bidirectional antagonism model according to the mapping rule, and determining a continuous target facial expression image.

Optionally, the determining a mapping relationship table according to the relationship between the expression basic category and the activation vector includes:

determining the contraction state of specific facial muscles according to the expression basic categories;

determining different combinations of the activation vectors based on a contraction status of the particular facial muscle;

and determining the mapping relation table according to different combinations of the activation vectors.

Optionally, the inputting the training image and the training expression target domain into a bidirectional antagonism network to train to obtain a bidirectional antagonism model, wherein the bidirectional antagonism model includes masks for eliminating attention and color and is used for evaluating a generated image, and the method includes:

determining a color mask for the training image;

determining an attention mask of the training image according to the training expression target domain;

determining the generated image according to the color mask and the attention mask.

obtaining a confidence coefficient parameter, and evaluating the generated image according to the confidence coefficient parameter;

mapping the generated image to a probability matrix and determining a loss value by combining a loss function;

and modifying the bidirectional antagonism network according to the loss value, and determining the bidirectional antagonism model.

Optionally, the inputting, according to the mapping rule, an initial facial expression image to be generated and an expression basic category probability vector to be generated into the bidirectional antagonism model to determine a continuous target facial expression image includes:

inputting the probability vector of the expression basic category, and determining an expression target domain to be generated according to the mapping rule;

and inputting the expression target domain to be generated into the bidirectional antagonism model, and outputting the continuous target human face expression image.

Optionally, the method for generating a face of a mechanism network is characterized in that the inputting the probability vector of the expression basic category and determining an expression target domain to be generated according to the mapping rule includes:

determining a set of hyper-parameters according to the mapping rule;

determining the expression target domain to be generated according to an expression target domain generating formula and by combining the hyper-parameter and the probability vector of the expression basic category;

the expression target domain generation formula is as follows:

y_g＝F(v)＝∑_iα_iT(v)；

wherein, y_gF is a mapping function, alpha, for the expression target domain to be generated_iAnd T (v) is the corresponding activation vector obtained by the probability vector v of the expression basic category according to a mapping relation table T.

On the other hand, the embodiment of the invention also discloses a human face generation device based on the attention mechanism network, which comprises the following steps:

the first module is used for acquiring a training image and a training expression target domain;

the second module is used for determining a mapping relation table according to the relation between the expression basic category and the activation vector;

a third module, configured to determine, according to the mapping relationship table, a mapping rule from the probability vector of the expression basic category to the activation vector;

a fourth module, configured to input the training image and the training expression target domain into a bidirectional antagonism network to train to obtain a bidirectional antagonism model, where the bidirectional antagonism model includes masks for eliminating attention and color, and is used for evaluating a generated image;

and the fifth module is used for inputting the initial facial expression image to be generated and the expression basic category probability vector to be generated into the bidirectional antagonism model according to the mapping rule and determining a continuous target facial expression image.

On the other hand, the embodiment of the invention also discloses an electronic device, which comprises a processor and a memory;

the memory is used for storing programs;

the processor executes the program to implement the method as described above.

On the other hand, the embodiment of the invention also discloses a computer readable storage medium, wherein the storage medium stores a program, and the program is executed by a processor to realize the method.

The embodiment of the invention also discloses a computer program product or a computer program, which comprises computer instructions, and the computer instructions are stored in a computer readable storage medium. The computer instructions may be read by a processor of a computer device from a computer-readable storage medium, and the computer instructions executed by the processor cause the computer device to perform the foregoing method.

Compared with the prior art, the invention adopting the technical scheme has the following technical effects: the method comprises the steps of obtaining a training image and a training expression target domain; determining a mapping relation table according to the relation between the expression basic category and the activation vector; determining a mapping rule from the probability vector of the expression basic category to the activation vector according to the mapping relation table; probability vectors of expression basic categories can be processed, and continuous facial expressions can be generated in a smoother transition mode; inputting the training image and the training expression target domain into a bidirectional antagonism network to train to obtain a bidirectional antagonism model, wherein the bidirectional antagonism model comprises a mask for eliminating attention and color and an evaluation generation image; inputting an initial facial expression image to be generated and an expression basic category probability vector to be generated into the bidirectional antagonism model according to the mapping rule, and determining a continuous target facial expression image; the robustness of the model to changes of background and illumination conditions can be improved based on an attention mechanism.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a detailed flow chart of an embodiment of the present invention;

FIG. 2 is a flow chart of a training model according to an embodiment of the present invention;

FIG. 3 is a flow chart of generating a facial expression according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating the face generation effect according to an embodiment of the present invention;

fig. 5 is a mapping table created according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The embodiment of the invention provides a human face generation method, a human face generation device, human face generation equipment and a human face generation medium based on an attention mechanism network, so that continuous expression distribution is processed, and robustness of a model to changes of background and illumination conditions is improved.

The embodiment of the invention discloses a face generation method based on an attention mechanism network, which comprises the following steps of:

acquiring a training image and a training expression target domain;

Further as a preferred embodiment, the determining a mapping relationship table according to the relationship between the expression basic category and the activation vector includes:

Referring to fig. 5, seven expression basic categories are provided, which are aversion, happy, surprised, fear, angry, slight, hurt and nature, each expression basic category is related to the contraction state of specific facial muscles, the 0 state represents that the specific facial muscles are relaxed, and the 1 state represents that the specific facial muscles are contracted; the activation vectors are used for describing basic categories of expressions, different specific facial muscle contraction states are represented according to different activation vector combinations, and the mapping relation table is determined.

As a further preferred embodiment, the inputting the training image and the training expression target domain into a bidirectional antagonism network training results in a bidirectional antagonism model, wherein the bidirectional antagonism model includes a model for eliminating attention and color masks, and a model for evaluating generated images, and the method includes:

determining a color mask for the training image;

Referring to fig. 2, inputting a training image and a training expression target domain into a bidirectional antagonism model, wherein a generating module in the bidirectional antagonism model comprises two generators, namely an attention mask generator and a color mask generator; generating an attention mask through an attention mask generator, enabling a neural network to keep attention to the part related to the expression, focusing on an image area meaningful for synthesizing a new expression, reducing the attention to the part unrelated to the expression generation, and keeping the rest part of the image unchanged, such as hair, glasses, a hat or a decoration; the attention mask generator only renders elements related to the expression, focuses on pixels defining the facial expression, and uses a color mask to process the illumination condition and remove the influence of illumination on the generation of the facial expression; and according to the training expression target domain, after the training image is processed by using a color mask and an attention mask, gradually changing the size of the activation vector to generate a corresponding face image with complex emotion.

Referring to fig. 2, after the image is generated in the previous step, the image is evaluated; the evaluation module in the two-way antagonism model comprises a conditional discriminator used for training and evaluating the fidelity of the generated image and the completion degree of the expected expression; the conditional arbiter maps the generated image to a matrix

In (1), wherein,

expressed as a matrix dimension of

Y_I[i，j]Representing overlapping patches [ i, j]Is the true probability, [ i, j]H is the length of the generated image and W is the width of the generated image; in addition, to evaluate its constraints, a confidence parameter of the activation value is added on top of the conditional arbiter to estimate the activation vector value in the image

Where N denotes the number of activation vectors used and T denotes the transpose operation.

The overall loss function is as follows:

it consists of a total of four loss terms:

image fighting loss term

Wherein D is_IRepresenting an image discriminator, G representing a generator,

representing the input original face image, y_fRepresenting the neighborhood of the target expression, λ_gpA penalty factor is represented which is a function of,

it is shown that the expectation is obtained,

representing the corresponding facial expression image generated by the generator; the effect of the method is to make the distribution of the generated images tend to the distribution of the training images, namely to make the generated images look more real; the loss function is based on WGAN, since the original GAN is hard to train with JS divergence, and is prone to gradient disappearance or gradient explosion; the meaning of the loss is that the result of the generated image is maximized and the result of the original image is minimized, and a gradient penalty term is added to control the gradient in a certain range.

Loss term of attention mechanism

Wherein λ is_TVThe representation of the hyper-parameter is,

indicating that it is desired, y_oRepresenting the original expression neighborhood, H, W representing the length and width of the image, A representing the attention mask; since the dataset does not have the true values of the attention mask, the attention mask is easily oversaturated, i.e. all values tend to 1; the first term of attention loss is the fully differential loss, which is originally a smoothing for the image, and the second term is a penalty term of L2.

Conditional expression loss terms

Inputting the original image and the generated image into a discriminator respectively, and calculating loss of the expression vector and the truth value of the expression vector respectively;

loss of identity item

Bringing the output of the second generator closer to the original image; and ensuring that the generated expression face and the original image are the same person.

As a further preferred embodiment, the determining a continuous target facial expression image by inputting an initial facial expression image to be generated and an expression basic category probability vector to be generated into the bidirectional antagonism model according to the mapping rule includes:

Referring to fig. 3, the probability vector of the expression basic category is a vector of the expression category in percentage of the facial expression, and expression target domains to be generated are obtained in a one-to-one correspondence manner according to a mapping rule; inputting the expression target domain to be generated into the bidirectional antagonism model trained by the method, combining the facial image to be generated, synthesizing the image through the attention mask generator and the color mask generator, and outputting the continuous target facial expression image, referring to fig. 4, wherein fig. 4 is a generated complex facial generation effect diagram with 20% aversion, 50% anger and 30% hurt.

Further as a preferred embodiment, the inputting the probability vector of the expression basic category and determining an expression target domain to be generated according to the mapping rule includes:

determining a set of hyper-parameters according to the mapping rule;

the expression target domain generation formula is as follows:

y_g＝F(v)＝∑_iα_iT(v)；

Referring to fig. 5, a set of hyper-parameters α ═ α is set₁，α₂，α₃，…，α_iAdjusting the size of the activation vector corresponding to the expression category probability, and changing the expression category probability vector v to { v ═ v }₁，v₂，v₃，…，v₇Fig. 5 shows a table according to the mapping relationship T, and finally obtains a corresponding expression target domain according to an expression target domain generation formula.

The specific operation flow of the embodiment of the present invention is further described with reference to fig. 1 below: the embodiment of the invention constructs a mapping relation table for the relation between seven expression basic categories and activation vectors, designs a mapping rule from the probability vectors of the expression basic categories to the activation vectors, trains a bidirectional antagonism model based on an attention mechanism in an unsupervised mode, processes the probability vectors and images of continuous expression categories through the model based on the attention mechanism, and outputs continuous facial expression pictures containing corresponding expression components.

The embodiment of the invention also discloses a face generation device based on the attention mechanism, which comprises:

Corresponding to the method of fig. 1, an embodiment of the present invention further provides an electronic device, including a processor and a memory; the memory is used for storing programs; the processor executes the program to implement the method as described above.

Corresponding to the method of fig. 1, the embodiment of the present invention also provides a computer-readable storage medium, which stores a program, and the program is executed by a processor to implement the method as described above.

The embodiment of the invention also discloses a computer program product or a computer program, which comprises computer instructions, and the computer instructions are stored in a computer readable storage medium. The computer instructions may be read by a processor of a computer device from a computer-readable storage medium, and executed by the processor to cause the computer device to perform the method illustrated in fig. 1.

In the prior art, a method based on generation of a countermeasure network can only effectively synthesize discrete facial expressions, and only a discrete number of expression types can be generated, and the expression types are determined by labeling of a data set. Since the expression category is a discrete variable, the prior art processes the expression discrete variable to result in the output generation result being discrete. The prior art cannot generate expressions in smooth transition and is not good at processing continuous expression distribution.

In summary, the face generation method, apparatus, device and medium based on attention mechanism network of the present invention have the following advantages:

1) in the generation of the facial expression, a mapping rule from the probability vector of the expression basic category to the activation vector is designed, so that the probability vector of continuous expression basic categories is processed, and continuous facial expression pictures containing corresponding expression components are generated.

2) When the probability vectors of the expression basic categories are processed, the attention mechanism principle is used in the generator, the attention of the neural network to the parts relevant to the expression generation is kept, meanwhile, the attention to the parts irrelevant to the expression generation is reduced, the generator can only focus on the generation of continuous new expressions, other elements are kept, and the robustness of the model to the change of the background and the illumination condition is improved.

In alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flow charts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed and in which sub-operations described as part of larger operations are performed independently.

Furthermore, although the present invention is described in the context of functional modules, it should be understood that, unless otherwise stated to the contrary, one or more of the described functions and/or features may be integrated in a single physical device and/or software module, or one or more functions and/or features may be implemented in a separate physical device or software module. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary for an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be understood within the ordinary skill of an engineer, given the nature, function, and internal relationship of the modules. Accordingly, those skilled in the art can, using ordinary skill, practice the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative of and not intended to limit the scope of the invention, which is defined by the appended claims and their full scope of equivalents.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a u-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A face generation method based on an attention mechanism network is characterized by comprising the following steps:

acquiring a training image and a training expression target domain;

inputting the training image and the training expression target domain into a bidirectional antagonism network to train to obtain a bidirectional antagonism model, wherein the bidirectional antagonism model is used for eliminating attention and color masks and evaluating generated images;

2. The method of claim 1, wherein the determining a mapping relationship table according to the relationship between the expression basic categories and the activation vectors comprises:

3. The method of claim 1, wherein the inputting the training image and the training expression target domain into a bidirectional antagonism network is trained to obtain a bidirectional antagonism model, wherein the bidirectional antagonism model comprises masks for eliminating attention and color, and the evaluating the generated image comprises:

determining a color mask for the training image;

4. The method for generating a human face based on an attention mechanism network according to any one of claims 1 or 3, wherein the inputting the training image and the training expression target domain into a bidirectional antagonism network is trained to obtain a bidirectional antagonism model, wherein the bidirectional antagonism model comprises a model for eliminating attention and color masks and a model for evaluating a generated image, and comprises:

5. The method of claim 1, wherein the step of inputting an initial facial expression image to be generated and an expression basic category probability vector to be generated into the bidirectional antagonism model according to the mapping rule to determine a continuous target facial expression image comprises:

6. The method as claimed in claim 5, wherein the inputting the probability vector of the expression basic category and determining the expression target domain to be generated according to the mapping rule comprises:

determining a set of hyper-parameters according to the mapping rule;

the expression target domain generation formula is as follows:

y_g＝F(v)＝∑_iα_iT(v)；

7. A human face generation apparatus based on attention mechanism network, comprising:

8. An electronic device comprising a processor and a memory;

the memory is used for storing programs;

the processor executing the program realizes the method of any one of claims 1-6.

9. A computer-readable storage medium, characterized in that the storage medium stores a program, which is executed by a processor to implement the method according to any one of claims 1-6.