WO2021254499A1

WO2021254499A1 - Editing model generation method and apparatus, face image editing method and apparatus, device, and medium

Info

Publication number: WO2021254499A1
Application number: PCT/CN2021/101007
Authority: WO
Inventors: 吴臻志; 祝夭龙
Original assignee: 北京灵汐科技有限公司
Priority date: 2020-06-19
Filing date: 2021-06-18
Publication date: 2021-12-23
Also published as: CN111754596B; CN111754596A

Abstract

An editing model generation method and apparatus, a face image editing method and apparatus, a device, and a medium. The editing model generation method comprises: iteratively training a generative adversarial network comprising a generator and a discriminator (S110); in iterative training, updating the generative adversarial network according to gradient updating configuration information of the discriminator, the gradient updating configuration information being determined by means of a Lipschitz constraint condition (S120); and when determining that the generative adversarial network satisfies a training ending condition, generating an image editing model according to the generator in the trained generative adversarial network (S130).

Description

Editing model generation, face image editing method, device, equipment and medium

Technical field

The embodiment of the present invention relates to the field of artificial intelligence, in particular to the editing of face images and the generation of corresponding editing models.

Background technique

In recent years, people have higher and higher requirements for the authenticity of synthesized images, and it is hoped that more realistic and natural images can be generated through algorithms. In particular, people often edit face images, expecting that the edited face images are still real faces.

At present, a Generative Adversarial Network (GAN) including a generator and a discriminator can be used to generate real face images. In the training process of the generative confrontation network, the generator is used to generate the face image, and the discriminator is used to distinguish the true and false of the generated face image.

The training of the generative confrontation network is actually training the generator and discriminator in the generative confrontation network. In the actual training process, compared with the generator, the discriminator training is usually completed faster. After the training begins, the discriminator can be quickly trained to determine the true and false of the generated image relatively accurately, but at this time the generator has not learned how to generate real images, and the images generated by the generator cannot pass the judgment of the discriminator. , Resulting in the failure of the training of the entire generative confrontation network. In the case of training failure, the authenticity of the image generated based on the generative confrontation network obtained by training cannot be guaranteed, and the editing effect of the image editing model obtained based on the generative confrontation network cannot be guaranteed.

Summary of the invention

The embodiment of the present invention provides an editing model generation and face image editing method, device, equipment and medium, which can improve the training consistency of the generator and the discriminator, and improve the authenticity of the generated image.

In the first aspect, an embodiment of the present invention provides an editing model generation method, including: performing iterative training on a generative confrontation network, the generative confrontation network including a generator and a discriminator; in the iterative training, according to all The gradient update configuration information of the discriminator updates the generative confrontation network, wherein the gradient update configuration information is determined by Lipschitz constraints; when it is determined that the generative confrontation network satisfies the training end condition, it is obtained according to the training The generator in the generative confrontation network generates an image editing model.

In a second aspect, an embodiment of the present invention provides a face image editing method, including: acquiring a face image to be edited; inputting the face image to be edited into an image editing model to obtain the image editing model The output edited face image. Wherein, the image editing model is generated by the above-mentioned editing model generation method.

In a third aspect, an embodiment of the present invention also provides an editing model generation device, including: a network training module for iterative training of a generative confrontation network, the generative confrontation network including a generator and a discriminator; network update A module for updating the generative confrontation network according to the gradient update configuration information of the discriminator in the iterative training, the gradient update configuration information is determined by Lipschitz constraints; a model generation module is used for When it is determined that the generative confrontation network satisfies the training end condition, an image editing model is generated according to the generator in the generative confrontation network obtained by training.

In a fourth aspect, an embodiment of the present invention also provides a face image editing device, including: a face image acquisition module for acquiring a face image to be edited; a face image editing module for editing the face image The face image of is input into the image editing model to obtain the edited face image output by the image editing model; wherein, the image editing model is generated by the above-mentioned editing model generation method.

In a fifth aspect, an embodiment of the present invention also provides a computer device, including a memory, a processor, and a computer program stored on the memory and running on the processor. Wherein, when the processor executes the program, the editing model generation method or the face image editing method according to any one of the embodiments of the present invention is implemented.

In a sixth aspect, an embodiment of the present invention also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the editing model generation method as described in any one of the embodiments of the present invention is implemented Or face image editing method.

The embodiment of the present invention limits the learning rate of the parameter items of the discriminator according to Lipschitz constraints when training a generative confrontation network including a generator and a discriminator, so as to slow down the training speed of the discriminator, and can improve the discriminator and The training consistency of the generator can ensure the accuracy of the discriminator’s identification of true and false images, and enable the generator to quickly learn how to generate real images, thereby improving the editing effect of the image editing model built by the generator Authenticity.

Description of the drawings

FIG. 1A is a flowchart of an editing model generation method in Embodiment 1 of the present invention;

FIG. 1B is a schematic diagram of an application scenario of training a generative confrontation network in Embodiment 1 of the present invention;

2 is a flowchart of an editing model generation method in the second embodiment of the present invention;

3A is a flowchart of an editing model generation method in Embodiment 3 of the present invention;

3B is a schematic diagram of an application scenario of a self-supervised training convolutional neural network in Embodiment 3 of the present invention;

4A is a flowchart of a method for editing a face image in the fourth embodiment of the present invention;

4B is a schematic diagram of a face image editing image in the fourth embodiment of the present invention;

Figure 5 is a schematic structural diagram of an editing model generating device in the fifth embodiment of the present invention;

Fig. 6 is a schematic structural diagram of a face image editing device in the sixth embodiment of the present invention;

Fig. 7 is a schematic structural diagram of a computer device in the seventh embodiment of the present invention.

detailed description

The present invention will be further described in detail below with reference to the drawings and embodiments. It can be understood that the specific embodiments described here are only used to explain the present invention, but not to limit the present invention. In addition, it should be noted that, for ease of description, the drawings only show a part but not all of the structure related to the present invention. In addition, the scope of disclosure involved in the present invention is not limited to the technical solutions formed by the specific combination of the above technical features, and should also cover any implementation of the above technical features or equivalent features without departing from the above inventive concept. Other technical solutions formed by combination.

Example one

FIG. 1A is a flowchart of a method for generating an editing model in Embodiment 1 of the present invention. This embodiment may be suitable for training a generative confrontation network, and generate an image editing model according to a generator in the trained generative confrontation network. The method can be executed by the editing model generation device provided in the embodiment of the present invention, and the device can be implemented in software and/or hardware, and generally can be integrated in computer equipment. As shown in FIG. 1A, the method of this embodiment specifically includes:

S110: Perform iterative training on a generative confrontation network (GAN), which includes a generator and a discriminator.

In this embodiment, the generator to be trained and the discriminator to be trained constitute a GAN. For the GAN training operation, the generator and the discriminator are actually trained at the same time.

In this embodiment, samples are used to train the generative confrontation network.

Optionally, training the generative countermeasure network includes: inputting samples of real images or noisy images to the generative countermeasure network, and iteratively trains the generative countermeasure network.

The samples include noisy images and real images. The noise image may be a random noise image, and the real image may include images with real attributes such as real people, real animals, or real scenes. Exemplarily, the real image may include a real face image, for example, a face photo. Exemplarily, multiple samples can be formed into a sample group, and multiple rounds of iterative training can be performed on the generative confrontation network. Each round of training can use a set number of samples for training, and the set number can be selected according to the actual situation, for example, There are eight, which are not limited in the embodiment of the present invention. A set number of samples can be determined as a sample group, and in one round of training, a sample group is used to train the generative adversarial network.

As shown in FIG. 1B, the generative confrontation network includes a generator 101 and a discriminator 102. In the training process, random noise images or real images can be input into the generator 101 as samples, and the generated images output by the generator 101 can be obtained. Then, the generated image output by the generator 101 and the corresponding random noise image or real image as a sample can be input to the discriminator 102 to obtain the discrimination result output by the discriminator 102, and update the generator 101 and the real image based on the discrimination result. The parameter item of the discriminator 102.

In the embodiment of the present invention, the generator 101 can be used to edit any image and output the generated image, and the discriminator 102 is used to determine whether the generated image output by the generator 101 meets the real conditions (or rules). It should be noted that the discriminator 102 is not used to determine whether the generated image is accurate, that is, whether to edit the original image into the required image effect, but to determine the degree of authenticity of the generated image, that is, true or Fake. For example, in the case where the generated image is a face image, the authenticity of the generated image can be determined according to the positional relationship between the nose and the mouth, and the corresponding real conditions may include that the nose is located directly above the mouth. Exemplarily, if the nose is located below the mouth in the generated image output by the generator 101, the generated image is determined to be false; if the nose is located above the mouth in the generated image output by the generator 101, the generated image is determined to be true. The true condition can be used by the determiner 102 to determine whether the generated image output by the generator is true. For example, the discriminator 102 can learn real features to determine whether the image is true or false.

S120. In iterative training, update the generative confrontation network according to the gradient update configuration information of the discriminator. Wherein, the gradient update configuration information is determined by Lipschitz constraint conditions.

The gradient update configuration information is used to determine the learning rate of the parameter item learned from each sample, where the learning rate is used to measure the rate of change of the parameter item. Updating the generative confrontation network is actually updating the parameter items of the generator and/or the parameter items of the discriminator. In this step, the learning rate of each parameter item can be determined according to the gradient update configuration information of the discriminator, so as to update the generative confrontation network based on the learning rate.

In some optional embodiments, the target learning rate of each parameter item may be determined according to the gradient update configuration information. Among them, the target learning rate is used to indicate the fastest learning rate achievable for each parameter item, or the target learning rate is used to define the suitability of the learning rate. After determining the target learning rate, the generative confrontation network can be updated according to the current learning rate and the target learning rate.

For example, the value of the parameter item when entering the current round of training can be obtained as the pre-update value, and the value of the parameter item determined based on the current round of training can be used as the proposed update value; according to the parameter The value to be updated and the value before the update of the item, calculate the current learning rate of the parameter item, and judge whether the current learning rate matches the target learning rate. If the current learning rate matches the target learning rate, use the to-be-updated value to update the parameter item, that is, to use the to-be-updated value as the final value of the parameter item in the current round of training, and as the parameter item in the next round of training If the current learning rate does not match the target learning rate, determine the target value of the parameter item according to the target learning rate, and use the target value to update the parameter item, that is, the target value as the current round of training The final value of the parameter item will be used as the pre-update value of the parameter item in the next round of training.

Among them, the target learning rate can be determined according to Lipschitz constraints.

Lipschitz constraint definition:

If there is a constant L, the following inequality holds for any two different real numbers x1 and x2 in the domain D:

|f(x1)-f(x2)|≤L||x1-x2||

It is said that the function f(x) satisfies the Lipschitz constraint on D. Among them, L is called Lipschitz constant, and the specific value can be determined by the function f(x).

Obviously, if the function f(x) satisfies the Lipschitz condition, then the function f(x) is uniformly continuous. In fact, the Lipschitz constraint limits the rate of change of the function f(x), that is, limits the range of change of the function f(x) not to exceed a certain constant, and its slope must be less than Lipschitz’s constant L, which can be based on The Pushtz constant L determines the learning rate.

The inventor found through research that when training a generative adversarial network that includes a generator and a discriminator, if there is no constraint on the update step, the discriminator will learn how to distinguish between true and false images too fast, and the generator The rate of learning how to generate real images is too slow. Due to the difference in the learning rate between the discriminator and the generator, the discriminator usually can determine the authenticity of the generated image output by the generator relatively accurately when the generator has not been trained, for example, the generator that has not completed training The output generated image is judged to be false, so that no matter how the generator is trained and learned, it cannot generate the generated image determined by the discriminator to be true. In other words, after the discriminator completes the training, most of the discriminator's discrimination result of the generated image output by the generator is "the generated image is a fake image", which causes the gradient of the generator to disappear, and it is impossible to continue learning how to generate a real image. Therefore, the authenticity of the generated image output by the trained generator cannot be guaranteed in the end.

In view of this, through Lipschitz constraints, the learning rate of each parameter item in the discriminator can be reduced, so as to ensure the accuracy of the discriminator’s identification of true and false images, while allowing the generator to quickly learn how to generate real images, and then The generator is effectively applied to the image editing model of real images.

Optionally, updating the generative countermeasure network according to the gradient update configuration information of the discriminator includes: updating the configuration information according to the gradient update configuration information of the discriminator, and determining the parameter corresponding to each feature extraction layer in one or more feature extraction layers included in the discriminator Maximum learning rate threshold; for each of the feature extraction layers in the discriminator, according to the parameter learning rate maximum threshold of the feature extraction layer, update the parameter items of the feature extraction layer in the discriminator to make the parameters associated with the feature extraction layer The update rate of the item matches the maximum threshold of the parameter learning rate corresponding to the feature extraction layer.

The maximum parameter learning rate threshold is used to determine the maximum learning rate of the parameter item. The parameter item refers to the parameter item of the generative confrontation network. Specifically, it can refer to one or more parameter items corresponding to one or more feature extraction layers in the discriminator . The feature extraction layer is used to extract feature information from the input and output it. The discriminator can be a learning model of any depth, and is usually a structure including multiple feature extraction layers.

The learning rate of the parameter item to be updated relative to the value before the update needs to be less than or equal to the maximum learning rate determined according to the maximum threshold of the parameter learning rate. In a possible implementation manner, the maximum threshold of the parameter learning rate can be configured for some or all of the parameter items; or, the parameter item that requires the configuration of the maximum threshold of the parameter learning rate can be customized according to the actual situation. In this regard, the embodiment of the present invention does not make a specific limitation.

According to the maximum threshold of the parameter learning rate of the feature extraction layer, update the parameter items of the feature extraction layer in the discriminator, which can be specifically: According to the gradient update configuration information, each parameter item of one or more parameter items associated with the feature extraction layer can be determined The target learning rate of the parameter item; for each parameter item, obtain the value to be updated for the parameter item; calculate the learning rate of the parameter item according to the value to be updated and the value before the update of the parameter item; determine the learning rate and the parameter item The size relationship of the target learning rate. When the learning rate is less than or equal to the target learning rate, it is determined that the learning rate matches the target learning rate, and the parameter item is updated according to the expected update value of the parameter item; when the learning rate is greater than the target learning rate, it is determined that the learning rate does not match the target learning rate. Calculate the target value of the parameter item according to the target learning rate, and update the parameter item with the target value.

Exemplarily, the target value of the parameter item can be calculated based on the following formula:

Among them, α is the learning rate, J(θ ₀ , θ ₁ ) is the fitting function, θ ₀ is the value before the update of the parameter term, and θ ₁ is the target value of the parameter term. Wherein, the value of α can be determined according to the value of the constant L in the aforementioned Lipschitz constraint condition, for example, it can be equal to the aforementioned target learning rate.

By configuring the maximum threshold of the parameter learning rate of each feature extraction layer in the discriminator, the maximum learning rate of each parameter item can be limited, the learning rate of each parameter item of the discriminator can be slowed down, and the discriminator and the discriminator in the generative confrontation network can be effectively improved. The learning consistency of the generator enables the generator to quickly learn how to generate real images while ensuring the accuracy of the discriminator’s identification of true and false images, so that the generator can be effectively applied to the image editing model structure of real images .

S130: When it is determined that the generative confrontation network meets the training end condition, generate an image editing model according to the generator in the generative confrontation network obtained by training.

The training end condition is used to judge whether the training of the generative confrontation network is completed. Generally, the loss function will converge to a set value, and the training end condition can be configured to be that the calculated value of the loss function is less than the set value, or that the update rate of the loss function is less than the set threshold, etc.

When the generative confrontation network training is completed, the generator in it can generate real images relatively accurately. The image editing model can be obtained by adjusting the generator. As a result, the real image can be edited using the image editing model, and the edited image output correspondingly is the real image.

Among them, the editing mode of the image editing model may include changes in attributes such as the position, size, brightness, and color of pixels in the image. The editing method of the image editing model does not change the true nature of the image, and usually the image obtained after editing the real image is still the real image. Exemplarily, the editing method includes editing at least one of the skin color, age, gender, and organ region of the human face. For example, edit the skin color of the face from yellow to white; edit the age feature of the face from 50 to 10; edit the gender feature of the face from male to female; edit the single eyelid of the face to double eyelid, etc.

In some alternative embodiments, the generator includes an encoder and a decoder. In fact, in the generator structure, there are multiple intermediate layers cascaded. The intermediate results corresponding to these intermediate layers can affect the final output result of the generator, that is, the final image editing effect. The output results of a specific one or more intermediate layers can be obtained from the generator as a hidden space (Latent Space), and the hidden space can be adjusted and sent to the cascade structure behind the generator to achieve the effect of image editing. That is, the image editing model can be generated by adjusting the parameters of the hidden space of the generator.

For example, the gender characteristics of the face image can be adjusted by editing the hidden space. Exemplarily, a female face is input, and a male face is output. Among them, the hidden space can be selected according to the specific structure of the generator. Optionally, the generator includes an encoder and a decoder, and the hidden space is a neural network layer in the decoder. Editing the hidden space may be: obtaining the parameter items of the pre-trained image editing model, and updating the parameter items of the hidden space of the generator.

For another example, image editing samples can be used to continue training the generator to generate an image editing model. Among them, the image editing sample includes the real image before editing and the real image after editing. For example, the image editing sample may include a face image before editing and a face image after editing. The correlation between the face image after editing and the face image before editing can be selected according to the actual situation. For example, the correlation may include gender, age, skin color, etc., which is not limited in the embodiment of the present invention.

In addition, a pre-trained standard encoder can be used to replace the encoder in the generator to extract effective features from the input image. The standard encoder is used to learn how to extract features that can characterize the input image from the input image. The input size of the decoder in the generator matches the output size of the standard encoder, where the size can be the dimension of the vector.

In the embodiment of the present invention, by inputting a sample group including noisy images and/or real images to a generative confrontation network including a discriminator and a generator, the generative confrontation network is trained, and the parameters of the discriminator are limited according to Lipschitz constraints The learning rate of the item can slow down the learning rate of each parameter item of the discriminator, which can effectively improve the consistency of the training rate of the discriminator and generator in the generative confrontation network. In this way, not only can the change of the parameter items of the discriminator be more continuous and smooth, but also the accuracy of the discriminator's identification of true and false images can be ensured, so that the generator can quickly learn how to generate real images, and then the generator can be effectively applied In the image editing model structure of the real image, the authenticity of the effect of editing the real image based on the image editing model structure is improved.

Example two

FIG. 2 is a flowchart of a method for generating an editing model in Embodiment 2 of the present invention. This embodiment is embodied on the basis of Embodiment 1 described above.

As shown in Figure 2, the method of this embodiment specifically includes:

S210: Perform iterative training on the generative confrontation network, which includes a generator and a discriminator.

For detailed descriptions in the embodiments of the present invention, reference may be made to the foregoing embodiments.

S220: In the iterative training, update the generative confrontation network according to the gradient update configuration information of the discriminator. Wherein, the gradient update configuration information is determined by Lipschitz constraint conditions.

S230: Calculate the value of the loss function of the generative countermeasure network according to the loss function configuration information. Among them, the loss function configuration information can be used to add the Euclidean distance norm to the initial loss function, and the elements included in the Euclidean distance norm are the parameter items of the encoder in the generator.

The training process of the generative confrontation network is actually the process of solving the algorithm used to realize the input to the output, and the solving of the algorithm is actually the numerical value of each parameter item in the algorithm. The algorithm has an objective function, and the solution process of the algorithm is an optimization process of the objective function. Generally, the loss function can be used as the objective function. The loss function is used to express the degree to which the predicted value of the generative confrontation network is different from the true value. For example, the smaller the value of the loss function, the better the performance of the corresponding generative adversarial network. Generally, different models use different loss functions.

In the embodiment of the present invention, the loss function is used as the training target of the generative confrontation network. The loss function can be of the following form:

Among them, LOSS is the initial loss function, which represents the sum of the loss function LOSS_D of the discriminator D and the loss function LOSS_G of the generator G; E(*) represents the expected value of the distribution function; m represents the real image; θ _d represents the value of the discriminator D Parameter term matrix; P _data (m) represents the distribution of samples as real images, and can be mapped to the upper data space by the discriminator D to obtain D(m, θ _d ); n represents random noise; θ _g represents generator G The parameter item matrix of; P _noise (n) represents the noise distribution, and can be mapped to the higher data space by the generator G to obtain G(n, θ _g ).

Represents the loss function LOSS_D of the discriminator D, and can maximize the LOSS_D as the training target of the discriminator D;

It can represent the loss function LOSS_G of generator G, and minimize LOSS_G as the training target of generator D. In other words, the training discriminator D takes maximizing logD(m) as the training goal, so as to continuously improve the accuracy of the discriminator to determine whether the generated image output by the generator is true or not, while training the generator G to minimize 1-logD(G( n)) is the training target, so as to continuously reduce the difference between the generated image output by the generator and the real image. In this way, by maximizing the loss function of the discriminator D and minimizing the loss function of the generator G as the training target of the generative confrontation network, the training effect of the discriminator and generator confrontation training can be achieved.

Only the initial loss function is used as the training target of the generative confrontation network, and a large number of image feature extraction operations need to be performed, which is computationally expensive and slow to solve. In view of this, the Euclidean distance norm can be added as a constraint condition on the basis of the initial loss function. Since the Euclidean distance norm can be decomposed into a combination of two low-dimensional parameter matrices, adding the Euclidean distance norm as a constraint condition can effectively reduce the dimension of the parameter matrix and the sample requirement.

The training of generative confrontation network may also have the problem of over-fitting. As a result, the trained generative confrontation network has good generation effect and discrimination accuracy only for certain types of real images, while the generation effect for unknown types of real images And the accuracy of discrimination is poor. In view of this, it can also be considered to add the Euclidean distance norm as a constraint condition on the basis of the initial loss function, so that the distribution of the mapping to the hidden space is more even, thereby reducing the coupling of each feature vector, and correspondingly improving the generative confrontation network The generalization ability to ensure the application effect of real images of unknown categories.

The loss function configuration information is used to add the Euclidean distance norm on the basis of the initial loss function. The Euclidean distance norm can also be called the regularization term, or the L2 norm, which refers to the result of the square sum of each element. Adding the Euclidean distance norm is equivalent to adding constraints to the initial loss function. In fact, it severely penalizes large-value weight vectors to tend to more dispersed weight vectors, so as to achieve a more uniform weight distribution and avoid weights concentrated in With a small number of vectors, the generative adversarial network is closer to a low-dimensional model. The lower the dimensionality, the smaller the amount of data used for training. Therefore, adding the Euclidean distance range to the initial loss function as a constraint condition can reduce the amount of data used in the training of the generative confrontation network, thereby reducing the complexity of the training of the generative confrontation network.

Specifically, the updated loss function can be in the following form:

Among them, θ _g represents the parameter item matrix of the hidden space of the generator G, which can be specifically the parameter item matrix of the hidden space of the encoder in the generator G. λ is the penalty coefficient, which is used to adjust the training complexity of the generative confrontation network, which can be set according to the actual situation. ‖·‖ _F stands for norm operation,

Represents the Euclidean distance norm of the parameter matrix θ _{g of the hidden space.}

The elements included in the Euclidean distance norm may be the parameter item matrix θ _g of the hidden space of the generator G, and specifically may be the parameter items of the encoder in the generator.

S240: If it is determined that the loss function satisfies the stability condition (also referred to as the convergence condition), it is determined that the generative confrontation network meets the training end condition, and the image editing model can be generated according to the generator in the generative confrontation network obtained by training.

The stability condition is used to judge whether the loss function tends to be stable or convergent. For example, the stable condition is used to determine whether the change rate of the loss function in adjacent training rounds is less than a set change rate threshold, and the size of the change rate threshold can be limited according to actual conditions. It can be understood that the value of the loss function changes very little with the number of training rounds, which indicates that the loss function is stable. The rate of change of the loss function may be: the difference between the current value of the loss function calculated in the current round of training and the historical value of the loss function calculated in the previous round of training, relative to the ratio of the current value of the loss function. If the ratio is less than the set threshold, it is determined that the rate of change of the loss function is small even after retraining, indicating that the loss function has stabilized, or that the loss function has converged. At this point, it is determined that the training of the generative confrontation network is completed. Alternatively, the stable condition may be to determine whether the number of training rounds exceeds the set round number threshold. If the number of training rounds of the generative confrontation network is sufficient, it can be determined that the training of the generative confrontation network is completed.

In the embodiment of the present invention, by adding a norm as a constraint condition on the basis of the initial loss function, the weight distribution of the vectors can be made more uniform, and the weights can be prevented from being concentrated on a few vectors, thereby not only reducing the amount of data used in the training of the generative confrontation network and The computational complexity can also improve the generalization ability of the trained generative countermeasure network, that is, expand the range of real images applicable to the trained generative countermeasure network, thereby ensuring the application of the trained generative countermeasure network The accuracy and editing effect of images in unknown categories.

Example three

FIG. 3A is a flowchart of a method for generating an editing model in Embodiment 3 of the present invention. This embodiment is embodied on the basis of the above-mentioned embodiment.

As shown in FIG. 3A, the method of this embodiment specifically includes:

S310: Perform iterative training on the generative confrontation network. Wherein, the generative confrontation network includes a generator and a discriminator.

S320: In iterative training, update the generative confrontation network according to the gradient update configuration information of the discriminator. Wherein, the gradient update configuration information is determined by Lipschitz constraint conditions.

S330: When it is determined that the generative confrontation network satisfies the training end condition, update the generator in the trained generative confrontation network according to the convolutional neural network in the pre-trained image feature detection model. Among them, the image feature detection model is obtained by training based on image feature samples. The image feature sample may include two regional image blocks in the same image and relationship data between the two regional image blocks.

Exemplarily, the image feature detection model may include two convolutional neural networks sharing weights, a feature vector splicer, and a fully connected network classifier. Among them, the convolutional neural network is used to extract the feature information of the regional image block and form a feature vector; the feature vector splicer is used to synthesize the feature vector generated by each convolutional neural network into the target feature vector; the fully connected network classifier is used to combine the target feature The vector is classified, and the relationship data between the image blocks in each area is output.

The image feature detection model is used to extract features from the image, and can be a pre-trained deep learning model. Specifically, the image feature detection model can learn to extract the features of image blocks in different regions and the relationship between image blocks in different regions in a self-supervised manner.

Among them, the image blocks of different regions are partial image regions in the same image, and there is no overlap or overlap between the two regional image blocks. The number and specific settings of regional image blocks in the same image can be selected according to actual conditions. For example, the target object is detected in the image, and the target object is divided into nine equal parts (for example, in the form of nine square grids). In this regard, the embodiment of the present invention does not make a specific limitation.

The relationship data is used to describe the relationship between the two regional image blocks. The relationship data may be at least one of the position relationship, size relationship, shape relationship, and color relationship of the regional image blocks. Exemplarily, the relationship data includes a position relationship. For example, in the case of dividing an image into regional image blocks in the form of a nine-square grid, the positional relationship in the relationship data may include upper left, upper middle, upper right, right left, middle, right right, bottom left, bottom middle, and bottom right.

The feature information of the regional image block is used to represent the regional image block in the form of data, for example, a feature vector. In fact, the feature information represents regional image blocks from different dimensions, and the feature vector can be used to represent the corresponding dimensional information.

The convolutional neural network and the feature vector stitcher are used to map the original image data to the hidden space, and the fully connected network classifier is used to map the learned distributed feature representation to the sample label space, so that the classification of the sample can be determined based on the sample label. Among them, the convolutional neural network can use the PixelShuffle method to realize the upsampling of the feature map to reduce the artifact effect caused by the transposed convolution or the ordinary linear interpolation upsampling method, which can improve the generation based on the convolutional neural network structure The authenticity of the generated image output by the generator.

Specifically, as shown in FIG. 3B, the image feature detection model may include a first convolutional neural network 301 and a second convolutional neural network 302 that share weights, a feature vector splicer 303, and a fully connected network classifier 304. Wherein, the convolutional neural network used to construct the generator may be any one of the first convolutional neural network 301 and the second convolutional neural network 302.

The specific operation of the image feature detection model may include: dividing the face image into at least two regional image blocks (for example, the mouth region image block and the right eye region image block); in this embodiment, the mouth region image block can be input to the first Perform feature extraction in a convolutional neural network 301 to obtain the first feature vector output by the first convolutional neural network 301; input the image block of the right eye area into the second convolutional neural network 302 for feature extraction to obtain the second volume The second feature vector output by the product neural network 302; the first feature vector and the second feature vector are input to the feature vector splicer 303 for splicing, and the spliced feature vector output by the feature vector splicer 303 is obtained; the spliced feature vector Input to the fully connected network classifier 304 for classification, and obtain the relationship data between the image block of the mouth area and the image block of the right eye area. For example, the fully connected network classifier 304 can determine that the image block of the right eye area is at the upper right of the image block of the mouth area.

Optionally, the image feature sample may include two face organ region image blocks in the same face image and the relationship data between the two face organ region image blocks.

The facial organ region image blocks may be image blocks divided according to facial organs, for example, a nose region image block and a mouth region image block. The relationship data between the image blocks of the face organ region may indicate the relative positional relationship of the two face organ region image blocks in the face image. For example, if the nose area image block is in the middle and the mouth area image block is in the middle and bottom, the relational data may be that the nose area image block is located above the mouth area image block.

By using the facial organ region image blocks in the face image as the image feature sample, the feature information used to distinguish the facial organs can be accurately extracted from the face image, and learning can be performed. In this way, it can help to accurately identify the various organs of the face image to be edited, so that the authenticity of face editing can be improved.

The decoder of the generator may include a convolutional neural network. In this step, the convolutional neural network in the pre-trained image feature detection model can be used as the convolutional neural network in the decoder of the generator; or, the parameter items of the convolutional neural network in the pre-trained image feature detection model , Migrate to the convolutional neural network in the decoder of the generator.

In some optional embodiments, the convolutional neural network in the pre-trained image feature detection model may be additionally added to the existing feature extraction network of the decoder of the generator. For example, the convolutional neural network and other feature extraction networks can share weights, the output feature vector of the convolutional neural network and the output feature vector of other feature extraction layers can be spliced, and the spliced feature vector can be input to the original feature extraction Layer output feature vector module, for example, fully connected network classifier, etc.

S340: Generate an image editing model according to the updated generator.

The updated generator uses a self-supervised learning method to train the generated convolutional neural network. A small number of samples can be used to complete the training of the convolutional neural network, which effectively reduces the demand for the generator's training samples and improves the training speed .

In the embodiment of the present invention, the generator in the generative confrontation network is updated by the convolutional neural network trained and generated in advance based on the self-supervised learning method, and the image editing model is constructed based on the updated generator, which can effectively extract the input image of the image editing model The features in, reduce the demand for labeling samples, reduce the training sample size of the image editing model, thereby increasing the generation speed of the image editing model, and reducing the labeling labor cost of the image editing model.

Embodiment four

Fig. 4A is a flowchart of a method for editing a face image in the fourth embodiment of the present invention. This embodiment is applicable to a situation where an image editing model is used to edit a face image. The method may be executed by the face image editing apparatus provided by the embodiment of the present invention, and the apparatus may be implemented in software and/or hardware, and may generally be integrated into computer equipment. As shown in FIG. 4A, the method of this embodiment specifically includes:

S410: Acquire a face image to be edited.

The human face image is a real image including the human face. For example, photos taken by users themselves. It should be noted that the face images of cartoon characters are not real images.

S420: Input the face image to be edited into the image editing model to obtain the edited face image output by the image editing model. Wherein, the image editing model is generated by the editing model generation method in any of the foregoing embodiments of the present invention.

In this embodiment, the image editing model can be generated by the editing model generation method of any of the foregoing embodiments of the present invention. In other words, the generator in the image editing model or the decoder in the generator is derived from the generative confrontation network obtained by the method for generating the editing model in any of the foregoing embodiments of the present invention. The generative confrontation network includes a generator and a discriminator, and uses Lipschitz constraints to determine the gradient update configuration information of the discriminator, so as to slow down the learning rate of each parameter item of the discriminator, so that the discriminator learns how to distinguish between real images and The rate at which the generator learns how to generate real images is as consistent as possible, so as to ensure both the accuracy of the generative adversarial network to distinguish real images and the authenticity of the generated images.

Specifically, as shown in FIG. 4B, among the three images, the first image on the left is a standard processed image commonly used in teaching books, and this image can be used as a real face image. The second image in the middle is a video frame in a dynamic video. The third image on the right is the edited image formed by editing the first image to simulate the mouth opening action of the intermediate video frame.

The embodiment of the present invention uses Lipschitz constraints to constrain the gradient update configuration information of the discriminator in the training of the generative confrontation network, which can effectively slow down the learning rate of each parameter item of the discriminator, and improve the performance of the discriminator and generator. Training consistency guarantees the authenticity of the generated image output by the generator while ensuring the discrimination accuracy of the final trained discriminator. In this way, the editing model built based on the generator in the generative confrontation network that is finally trained can effectively ensure the authenticity of the image editing effect when used to obtain the edited image of the real face image, and further improve the user experience .

Embodiment five

Fig. 5 is a schematic diagram of an editing model generating device in the fifth embodiment of the present invention. The fifth embodiment is a corresponding device that implements the editing model generation method provided in the foregoing embodiment of the present invention. The device can be implemented in software and/or hardware, and generally can be integrated with computer equipment.

As shown in FIG. 5, the apparatus of this embodiment may include:

The network training module 510 is used for iterative training of a generative confrontation network, the generative confrontation network including a generator and a discriminator;

The network update module 520 is configured to update the generative confrontation network according to the gradient update configuration information of the discriminator in the iterative training, and the gradient update configuration information is determined by Lipschitz constraint conditions;

The model generation module 530 is configured to generate an image editing model according to the generator in the trained generative confrontation network when it is determined that the generative confrontation network meets the training end condition.

The embodiment of the present invention uses real images and/or noise images as samples to be input to the generative countermeasure network to iteratively train a generative countermeasure network including a generator and a discriminator, and limit the parameter items of the discriminator according to Lipschitz constraints In order to improve the learning consistency of the discriminator and the generator, it can ensure the accuracy of the discriminator’s identification of true and false images while ensuring the authenticity of the generated image output by the generator, and then the generator can be effectively applied In the image editing model structure of the real image, the authenticity of the effect of editing the face image based on the image editing model structure is improved.

Further, the model generation module 530 includes a loss function calculation unit for: adding Euclidean distance norm as a constraint condition on the basis of the initial loss function according to the configuration information of the loss function to obtain the loss function of the generative countermeasure network. The elements included in the distance norm are the parameter items of the encoder in the generator; when the loss function is determined to meet the convergence condition, it is determined that the generative confrontation network meets the training end condition, and the generator in the generated confrontation network is confronted according to the training Generate image editing model.

Further, the network update module 520 includes a discriminator parameter item update unit, which is used to: according to the gradient update configuration information of the discriminator, determine that the parameter learning rate corresponding to each feature extraction layer in the one or more feature extraction layers included in the discriminator is the largest Threshold; For each feature extraction layer in the discriminator, according to the maximum threshold of the parameter learning rate of the feature extraction layer, update the parameter items of the feature extraction layer in the discriminator so that the update rate of the parameter items associated with the feature extraction layer is the same as The parameter learning rate corresponding to the feature extraction layer is matched with the maximum threshold value.

Further, the model generation module 530 includes a self-supervised generation unit, which is used to: based on the pre-trained image feature detection model of the convolutional neural network, update the generator in the generative confrontation network obtained by training, and specifically update the generator in the generator Decoder; Generate image editing model based on the updated generator. The image feature detection model is obtained by training based on image feature samples, and the image feature samples include two regional image blocks in the same image and the relationship data between the two regional image blocks. The image feature detection model can include two convolutional neural networks that share weights, a feature vector splicer, and a fully connected network classifier. The convolutional neural network is used to extract the feature information of the regional image block and form a feature vector; the feature vector splicer is used to synthesize the feature vector generated by each convolutional neural network into the target feature vector; the fully connected network classifier is used to perform the target feature vector Categorize and output the relationship data between image blocks in each area.

Further, the image feature sample includes two face organ region image blocks in the same face image and the relationship data between the two face organ region image blocks.

Further, the network training module 510 includes a training unit for inputting samples including real images and/or noisy images into the generative confrontation network, and performs a round of training on the generative confrontation network.

The above-mentioned editing model generating device can execute the editing model generating method provided by any one of the embodiments of the present invention, and achieve the same beneficial effects.

Example Six

Fig. 6 is a schematic diagram of a face image editing device in the sixth embodiment of the present invention. The sixth embodiment is a corresponding device that implements the face image editing method provided in the foregoing embodiment of the present invention. The device can be implemented in software and/or hardware, and generally can be integrated with computer equipment.

As shown in FIG. 6, the apparatus of this embodiment may include:

The face image acquisition module 610 is used to acquire the face image to be edited;

The face image editing module 620 is configured to input the face image to be edited into the image editing model to obtain the edited face image output by the image editing model; wherein, the image editing model is adopted as in any of the foregoing embodiments of the present invention Generated by editing model generation method.

In the embodiment of the present invention, the gradient update configuration information of the discriminator in the generative confrontation network is determined according to Lipschitz constraints, and the learning rate of the parameter items of the discriminator is restricted based on the gradient update configuration information, which can improve the generative confrontation network The training consistency of the middle discriminator and the generator can ensure the authenticity of the image editing effect of the editing model constructed based on the generator in the generative confrontation network of the final training, which effectively improves the user experience.

The aforementioned facial image editing device can execute the facial image editing method provided by any one of the embodiments of the present invention, and achieve the same beneficial effects.

Example Seven

FIG. 7 is a schematic structural diagram of a computer device according to Embodiment 7 of the present invention. Figure 7 shows a block diagram of an exemplary computer device 12 suitable for implementing embodiments of the present invention. The computer device 12 shown in FIG. 7 is only an example, and should not bring any limitation to the function and application scope of the embodiment of the present invention.

As shown in FIG. 7, the computer device 12 is represented in the form of a general-purpose computing device. The components of the computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 connecting different system components (including the system memory 28 and the processing unit 16).

The bus 18 represents one or more of several types of bus structures, including a memory bus, a peripheral bus, or any bus structure using multiple bus structures. For example, these bus structures include but are not limited to Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, enhanced ISA bus, Video Electronics Standards Association (Video Electronics Standards) Association, VESA) local bus and Peripheral Component Interconnect (PCI) bus.

The computer device 12 typically includes a variety of computer system readable media. These media can be any available media that can be accessed by the computer device 12, including volatile and nonvolatile media, removable and non-removable media.

The system memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (RAM) 30 and/or cache memory 32. The computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. For example only, the storage system 34 may be used to read and write non-removable, non-volatile magnetic media (not shown in FIG. 7 and generally referred to as a "hard drive"). Although not shown in FIG. 7, a disk drive for reading and writing to a removable non-volatile disk (such as a "floppy disk") and a removable non-volatile optical disk (such as a compact disk read-only memory (Compact Disk)) can be provided. Disc Read-Only Memory, CD-ROM), Digital Video Disc-Read Only Memory (DVD-ROM)) or other optical media read and write optical disc drives. In these cases, each drive can be connected to the bus 18 through one or more data media interfaces. The system memory 28 may store at least one program product, the program product having a set (for example, at least one) program modules, and these program modules are configured to perform the functions of the various embodiments of the present invention.

The program/utility tool 40 having a set of (at least one) program module 42 may be stored in the system memory 28, for example. The program module 42 includes, but is not limited to, an operating system, one or more application programs, other program modules, and program data. Each of these examples or some combination may include the implementation of a network environment. The program module 42 generally executes the functions and/or methods in the described embodiments of the present invention.

The computer device 12 can also communicate with one or more external devices 14 (such as keyboards, pointing devices, displays 24, etc.), and can also communicate with one or more devices that enable users to interact with the computer device 12, and/or communicate with Any device (such as a network card, modem, etc.) that enables the computer device 12 to communicate with one or more other computing devices. This communication can be performed through an input/output (Input/Output, I/O) interface 22. In addition, the computer device 12 may also communicate with one or more networks (for example, a local area network (LAN), a wide area network (WAN)) through the network adapter 20. As shown in the figure, the network adapter 20 communicates with other modules of the computer device 12 through the bus 18. It should be understood that although not shown in FIG. 7, other hardware and/or software modules can be used in conjunction with the computer device 12, including but not limited to microcode, device drivers, redundant processing units, external disk drive arrays, (Redundant Arrays of Expensive Disks, RAID) systems, tape drives, and data backup storage systems.

The processing unit 16 executes various functional applications and data processing by running the program module 42 stored in the system memory 28, for example, to implement an editing model generation method and/or a face image editing method provided by any embodiment of the present invention .

Example eight

The eighth embodiment of the present invention provides a computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, it realizes the editing model generation method as provided in all the invention embodiments of this application, or realizes all An embodiment of the invention provides a method for editing a face image.

The computer storage medium of the embodiment of the present invention may adopt any combination of one or more computer-readable media. The computer-readable medium may be a computer-readable signal medium or a computer-readable storage medium. The computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above. More specific examples of computer-readable storage media (non-exhaustive list) include: electrical connections with one or more wires, portable computer disks, hard disks, RAM, Read Only Memory (ROM), erasable Erasable Programmable Read Only Memory (EPROM), flash memory, optical fiber, portable CD-ROM, optical storage device, magnetic storage device, or any suitable combination of the above. In this document, the computer-readable storage medium can be any tangible medium that contains or stores a program, and the program can be used by or in combination with an instruction execution system, apparatus, or device.

The computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, and computer-readable program code is carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. The computer-readable signal medium may also be any computer-readable medium other than the computer-readable storage medium. The computer-readable medium may send, propagate or transmit the program for use by or in combination with the instruction execution system, apparatus, or device .

The program code contained on the computer-readable medium can be transmitted by any suitable medium, including but not limited to wireless, wire, optical cable, radio frequency, etc., or any suitable combination of the foregoing.

The computer program code used to perform the operations of the present invention can be written in one or more programming languages or a combination thereof. The programming languages include object-oriented programming languages-such as Java, Smalltalk, C++, and also conventional procedural programming languages. Programming language-such as "C" language or similar programming language. The program code can be executed entirely on the user's computer, partly on the user's computer, as an independent software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to a user computer through any kind of network including a LAN or WAN, or may be connected to an external computer (for example, using an Internet service provider to connect through the Internet).

Note that the above are only the preferred embodiments of the present invention and the applied technical principles. Those skilled in the art will understand that the present invention is not limited to the specific embodiments herein, and various obvious changes, readjustments and substitutions can be made to those skilled in the art without departing from the protection scope of the present invention. Therefore, although the present invention has been described in more detail through the above embodiments, the present invention is not limited to the above embodiments, and can also include more other equivalent embodiments without departing from the concept of the present invention. The scope of is determined by the scope of the appended claims.

Claims

An editing model generation method, including:

Performing iterative training on a generative confrontation network, the generative confrontation network including a generator and a discriminator;

In the iterative training, the generative confrontation network is updated according to gradient update configuration information of the discriminator, wherein the gradient update configuration information is determined by Lipschitz constraints;

When it is determined that the generative confrontation network satisfies the training end condition, an image editing model is generated according to the generator in the generative confrontation network obtained by training.
The method according to claim 1, wherein updating the generative confrontation network according to the gradient update configuration information of the discriminator comprises:

Determining, according to the gradient update configuration information, the maximum threshold of the parameter learning rate corresponding to each of the one or more feature extraction layers included in the discriminator;

According to the maximum threshold value of the parameter learning rate of each feature extraction layer, the parameter item of the discriminator is updated.
The method according to claim 2, wherein updating the parameter items of the discriminator according to the maximum threshold of the parameter learning rate of each of the feature extraction layers includes:

Acquiring the value of the parameter item associated with the feature extraction layer when entering the current round of training, as the value before the update;

Acquiring the value of the parameter item associated with the feature extraction layer calculated based on the current round of training as a value to be updated;

Determine the learning rate of the parameter item according to the value to be updated and the value before the update;

In the case that the learning rate is less than or equal to the maximum threshold of the parameter learning rate, update the parameter item associated with the feature extraction layer according to the to-be-updated value;

In the case that the learning rate is greater than the maximum threshold of the parameter learning rate, the parameter item associated with the feature extraction layer is updated according to a target value, and the target value is obtained according to the maximum threshold of the parameter learning rate.
The method according to claim 3, wherein the target value is calculated according to the following formula:

Where θ 1 represents the target value,

θ 0 represents the value before the update,

α represents the maximum threshold of the parameter learning rate,

J(θ 0 , θ 1 ) is the fitting function.
The method according to any one of claims 1 to 4, wherein the training end condition comprises:

The loss function converges to the set value;

Wherein, the loss function is obtained by adding a Euclidean distance norm to the initial loss function according to the loss function configuration information, and the elements included in the Euclidean distance norm are parameter items of the encoder in the generator.
The method according to claim 5, wherein the calculation formula of the Euclidean distance norm is as follows:

Among them, λ represents the penalty coefficient,

‖·‖ F stands for norm operation,

θ g represents the parameter item of the encoder in the generator.
The method according to claim 5, wherein the initial loss function comprises:

The loss function of the discriminator and the loss function of the generator; wherein the training objective of the discriminator is to maximize the loss function of the discriminator, and the training objective of the generator is to minimize the The loss function of the generator.
The method according to any one of claims 1 to 7, wherein generating the image editing model according to a generator in the generative confrontation network obtained by training includes:

Update the generator according to the convolutional neural network in the pre-trained image feature detection model;

Generate an image editing model according to the updated generator;

Wherein, the image feature detection model is obtained through image feature sample training, and the image feature sample includes two regional image blocks in the same image and relationship data between the two regional image blocks.
The method according to claim 8, wherein the image feature detection model comprises:

Two convolutional neural networks sharing weights are used to extract feature information of the two regional image blocks respectively, and form feature vectors;

The feature vector splicer is used to synthesize the feature vector generated by each of the convolutional neural networks into a target feature vector; and

The fully connected network classifier is used to classify the target feature vector and output the relationship data between the two regional image blocks.
The method according to claim 8 or 9, wherein updating the generator according to the convolutional neural network in the pre-trained image feature detection model includes at least one of the following:

Using the convolutional neural network in the image feature detection model as the convolutional neural network in the decoder of the generator;

Migrating the parameter items of the convolutional neural network in the image feature detection model to the convolutional neural network in the decoder of the generator;

The convolutional neural network in the image feature detection model is added to the feature extraction network of the decoder of the generator.
The method according to any one of claims 8 to 10, wherein the image feature sample comprises:

Two face organ region image blocks in the same face image, and relationship data between the two face organ region image blocks.
The method according to any one of claims 8 to 11, wherein the relationship data represents any one or more of the following relationships between the two regional image blocks: position relationship, size relationship , Shape relationship, color relationship.
The method according to any one of claims 1 to 12, wherein the iterative training of the generative confrontation network comprises:

Inputting samples into the generative confrontation network to train the generative confrontation network; wherein the samples include at least one of a real image and a noisy image.
A face image editing method, including:

Obtain the face image to be edited;

Inputting the face image to be edited into an image editing model to obtain an edited face image output by the image editing model;

Wherein, the image editing model is generated by the editing model generating method according to any one of claims 1 to 13.
An editing model generation device, including:

A network training module for iterative training of a generative confrontation network, the generative confrontation network including a generator and a discriminator;

A network update module, configured to update the generative confrontation network according to gradient update configuration information of the discriminator in the iterative training, where the gradient update configuration information is determined by Lipschitz constraints;

The model generation module is configured to generate an image editing model according to the generator in the generative confrontation network obtained by training when it is determined that the generative confrontation network satisfies the training end condition.
A face image editing device, including:

The face image acquisition module is used to acquire the face image to be edited;

The face image editing module is used to input the face image to be edited into the image editing model to obtain the edited face image output by the image editing model; wherein, the image editing model is as claimed in claim 1 to The editing model generation method described in any one of 13 is generated.
A computer equipment including:

Memory

Processor; and

A computer program stored in the memory and running on the processor,

It is characterized in that, when the processor executes the program, the editing model generation method according to any one of claims 1 to 13 or the face image editing method according to claim 14 is realized.
A computer-readable storage medium on which a computer program is stored,

It is characterized in that, when the program is executed by the processor, the editing model generation method according to any one of claims 1 to 13 or the face image editing method according to claim 14 is realized.