CN114332119A

CN114332119A - Method and related device for generating face attribute change image

Info

Publication number: CN114332119A
Application number: CN202111597494.2A
Authority: CN
Inventors: 陈仿雄
Original assignee: Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Current assignee: Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Priority date: 2021-12-24
Filing date: 2021-12-24
Publication date: 2022-04-12

Abstract

The embodiment of the application relates to the technical field of facial image attribute editing, and discloses a method and a related device for generating a facial attribute change image. The first characteristic latent codes can be edited and modified according to the requirement of the change of the face attributes, and second characteristic latent codes are obtained. And in the second stage, the preliminarily trained generative confrontation network is trained by adopting the first characteristic latent code and the first face image to obtain an image generation model. And inputting the second characteristic latent code into the image generation model to obtain a face attribute change image. And performing secondary fine-tuning training on the preliminarily trained generative countermeasure network, and generating an image with corresponding attribute change under the condition of no distortion.

Description

Method and related device for generating face attribute change image

Technical Field

The embodiment of the application relates to the technical field of facial image attribute editing, in particular to a method for generating a facial attribute change image and a related device.

Background

With the rise of photographing and short videos, a great number of users have higher requirements on the photographing quality of the face, and hope that the face attributes can be controlled and edited in a personalized manner, namely, the face attributes can be modified and adjusted according to respective wishes in the photographing process, and rich image effects are achieved.

At present, electronic products often provide several well-set modes for users to select and realize human face attribute editing, however, the editing of human face attributes is not only single in mode and incapable of providing personalized editing, but also images obtained after editing are easy to distort.

Disclosure of Invention

The embodiment of the application mainly solves the technical problem of providing a method and a related device for generating a face attribute change image, which can provide rich face attribute editability and ensure that the image is not distorted.

In order to solve the above technical problem, in a first aspect, an embodiment of the present application provides a method for generating a face attribute change image, including:

acquiring a first face image;

training a feature latent code generation network by adopting the first face image and the preliminarily trained generation type countermeasure network to obtain a feature latent code generation model, and outputting a first feature latent code by the feature latent code generation model;

training the preliminarily trained generative confrontation network by adopting the first characteristic latent code and the first face image to obtain an image generation model;

acquiring a second characteristic latent code, wherein the second characteristic latent code is obtained by editing the first characteristic latent code;

and inputting the second characteristic latent code into the image generation model to obtain a face attribute change image.

In some embodiments, the method further comprises:

constructing an image pyramid of the first face image;

the method for training the characteristic latent code generation network by adopting the first face image and the preliminarily trained generation type confrontation network to obtain a characteristic latent code generation model comprises the following steps of

And training the feature latent code generation network by adopting the image pyramid and the preliminarily trained generation type countermeasure network to obtain the feature latent code generation model.

In some embodiments, the training the feature latent code generation network by using the image pyramid and the preliminarily trained generative confrontation network to obtain the feature latent code generation model includes:

circularly traversing the image layers in the image pyramid, and inputting the image layers into the feature latent code generation network to obtain intermediate feature latent codes;

inputting the intermediate characteristic latent code into the preliminarily trained generative confrontation network to obtain an intermediate face image;

calculating a first loss between the intermediate face image and the first face image by using a first loss function;

and according to the first loss, iteratively adjusting parameters of the characteristic latent code generation network until the characteristic latent code generation network is converged to obtain the characteristic latent code generation model.

In some embodiments, the first loss function includes a first image perception loss and a structural similarity loss, wherein the structural similarity loss is used to measure a structural similarity between the intermediate face image and the image layer.

In some embodiments, the structural similarity loss is a product of a luminance similarity value, a contrast value, and a structural similarity value.

In some embodiments, the training the preliminarily trained generative confrontation network by using the first feature latent code and the first face image to obtain an image generation model includes:

inputting the first characteristic latent code into the preliminarily trained generative confrontation network to obtain a restored image;

calculating a second loss between the first face image and the restored image using a second loss function;

and iteratively adjusting parameters of the preliminarily trained generative confrontation network according to the second loss until the preliminarily trained generative confrontation network converges to obtain the image generation model.

In some embodiments, the second loss function includes a second image perception loss and a regularization term loss.

In some embodiments, the method further comprises:

acquiring an image comprising a human face;

acquiring key points of the face in the image by adopting a face key point algorithm;

and adjusting the face in the image into a front face according to the key points of the face in the image, and intercepting the effective area of the face to obtain the first face image.

In order to solve the above technical problem, in a second aspect, an electronic device is provided in an embodiment of the present application, and includes a memory and one or more processors, where the one or more processors are configured to execute one or more computer programs stored in the memory, and when the one or more processors execute the one or more computer programs, the electronic device is enabled to implement the method according to the first aspect.

In order to solve the above technical problem, in a third aspect, the present application provides a computer-readable storage medium storing a computer program, the computer program comprising program instructions that, when executed by a processor, cause the processor to perform the method according to the first aspect.

The beneficial effects of the embodiment of the application are as follows: different from the situation in the prior art, in the method for generating the face attribute change image provided by the embodiment of the application, in the first stage, the first face image and the preliminarily trained generation type countermeasure network are adopted to train the feature latent code generation network, so that the feature latent code generation model is obtained. The characteristic latent code generation model outputs a first characteristic latent code, so that the first characteristic latent code can restore the first face image to a higher degree, and equivalently, an accurate editable vector is provided for subsequent face attribute editing. Specifically, the first feature latent codes can be edited and modified according to the requirement of face attribute change, so that second feature latent codes are obtained.

And in the second stage, the preliminarily trained generative confrontation network is trained by adopting the first characteristic latent code and the first face image to obtain an image generation model. And inputting the second characteristic latent code into the image generation model to obtain a face attribute change image. And performing secondary fine-tuning training on the preliminarily trained generative countermeasure network to map the first face image outside the domain to the potential space in the domain, so that the face attribute change image after attribute editing can not only change the face attribute according to the attribute change parameters carried in the second feature latent code, but also further restore the detail information of the first face image, namely, the image with corresponding attribute change can be generated under the condition of ensuring no distortion.

Drawings

One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.

FIG. 1 is a schematic diagram of a portion of a generative countermeasure network in some embodiments of the present application;

FIG. 2 is an application scenario for editing face attributes in some embodiments of the present application;

FIG. 3 is a schematic flow chart of a method for generating an image with changed facial attributes according to some embodiments of the present application;

FIG. 4 is an overall process of generating a face property change image according to some embodiments of the present application;

fig. 5 is a schematic structural diagram of an electronic device in some embodiments of the present application.

Detailed Description

The present application will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present application, but are not intended to limit the present application in any way. It should be noted that various changes and modifications can be made by one skilled in the art without departing from the spirit of the application. All falling within the scope of protection of the present application.

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

It should be noted that, if not conflicted, the various features of the embodiments of the present application may be combined with each other within the scope of protection of the present application. Additionally, while functional block divisions are performed in apparatus schematics, with logical sequences shown in flowcharts, in some cases, steps shown or described may be performed in sequences other than block divisions in apparatus or flowcharts. Further, the terms "first," "second," "third," and the like, as used herein, do not limit the data and the execution order, but merely distinguish the same items or similar items having substantially the same functions and actions.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

In addition, the technical features mentioned in the embodiments of the present application described below may be combined with each other as long as they do not conflict with each other.

For the convenience of understanding the technical solution of the present application, the related principles of the generative countermeasure network and the feature latent code related to the present application will be described first.

Referring to fig. 1, fig. 1 is a schematic partial structural diagram of a generative confrontation network provided in an embodiment of the present application, and as shown in fig. 1, the generative confrontation network includes a mapping network S1 and an image generator S2. The mapping network S1 is configured to perform feature decoupling on the composite features included in the latent feature codes, so as to map the latent feature codes into multiple sets of feature control vectors input to the image generator, and input the multiple sets of feature control vectors obtained by mapping into the image generator S2, so as to perform pattern control (i.e., face attribute control) on the image generator. The image generator S2 is configured to perform pattern control and processing on the constant tensor based on the control vector input from the mapping network S1, thereby generating one image. When the feature latent code reflects the face feature, the image generated by the image generator S2 is a face image.

As shown in fig. 1, the mapping network S1 includes 8 fully-connected layers, and the 8 fully-connected layers are connected in sequence and used for performing nonlinear mapping on the latent feature codes to obtain an intermediate vector w. The intermediate vector w can reflect various facial features such as eye features, mouth features, nose features, or the like.

The image generator S2 includes N sequentially arranged generation networks, the first generation network includes a constant tensor const and a convolution layer, both of which are configured with an adaptive instance normalization layer. It is understood that the constant tensor const corresponds to an initial data for generating an image. Any one of the remaining generation networks except the first generation network comprises two convolution layers, and a self-adaptive instance normalization layer is configured behind each convolution layer. In the image generator S2, each generated network outputs a feature map, which is used as an input of the next generated network, and as the generated network recurs, the size of the output feature map becomes larger, and the feature map output by the last generated network is the generated face image. It will be appreciated that the target size of each generated net output feature map is set, for example, the size of the feature map of the 1 st generated net output in fig. 1 is 4 × 4, the size of the feature map of the 2 nd generated net output is 8 × 8, and if the size of the finally generated image is 1024 × 1024, the size of the feature map of the last generated net output is 1024 × 1024.

In a production network, convolutional layers and adaptive instance normalization layers are interleaved, with the output of the previous layer being the input of the next layer. As shown in fig. 1, the convolution layer includes a 3 × 3 convolution kernel for performing deconvolution operation on the input image, and outputting a feature map with an increased size. The feature map output by the convolutional layer is then input into the example normalization layer, and the intermediate vector w and random noise B are also input into the example normalization layer. The processing procedure of the instance normalization layer (i.e., AdaIN) is shown in fig. 1: and expanding the intermediate vector w into a scaling factor y (s, i) and a deviation factor y (B, i) through a learnable affine change (namely, a full connection layer), wherein the scaling factor y (s, i) and the deviation factor y (B, i) are subjected to weighted summation with the feature map output by the normalized last convolution layer and the random noise B, so that the intermediate vector w influences the pattern of the feature map output by the convolution layer, namely, the pattern control can be realized. Second, random noise B is used to enrich the details of the feature map.

The latent feature code is a feature vector, which may also be referred to as a feature map, and may be a multidimensional vector, and each value in the vector is in the range of [ -1,1], for example, 18 × 512, and each value in the vector is in the range of [ -1,1 ]. It will be appreciated that by inputting the feature latent code into the generative countermeasure network, an image corresponding to the feature latent code may be generated. It is understood that the feature potential may also be understood as a feature of an image extracted from the image based on the neural network, the feature potential can represent the image, in the case of feature potential determination, the image generated based on the feature potential is also determined, and in another aspect, the feature potential may also be understood as a vector output after the image passes through the convolutional layer in the neural network.

In the above, the process of generating the image for the generative confrontation network is a variety of application scenarios for the generative confrontation network, for example, the trained generative confrontation network is used to generate the image. When the generative confrontation network is trained, a plurality of training images are adopted to train the generative confrontation network, and the trained generative confrontation network is obtained.

In the training process, the network learns all the training images, and maps the training images into a field, which can be understood as a space. When the trained generative countermeasure network is used to generate images, if the images to be generated are learned by the network in advance, that is, in the above-mentioned field or space, it is beneficial to make the generated images not have too much difference from the real images represented by the feature latent codes.

Based on the image generation capability of a pre-trained generation type countermeasure network, a large number of face attribute editing functions appear in some intelligent devices, and the face attributes can be controlled and edited in a personalized mode so as to meet the optimization requirements of users on images. Fig. 2 shows an example of editing the attributes of a human face, as shown in fig. 2, when a smart phone photographs an original image a, and if a user wants to edit the original image a by "eye size and hair length", an application program (app) on the smart phone can implement the function, and the application program receives an instruction reflecting "eye size and hair length", so that the application program can edit the attributes of the original image a according to the instruction to obtain an image B with changed attributes of the human face. It is understood that the command reflecting the user editing requirement may be in the form of voice, text or virtual button input parameters, and the interaction mode of the command is not limited in any way.

In order to successfully edit the original image, the original image must first be projected (or inverted) into the domain of a pre-trained generative countermeasure network, i.e., projected as a latent feature code (vector). Therefore, how to convert the original image into the latent feature code z and project the latent feature code z into the pre-trained field enables the latent feature code z to generate an image identical to the original image, that is, the latent feature code z can accurately restore the original image, and a basis is provided for realizing face attribute change by subsequently modifying the latent feature code z.

However, the realm or space of generational combat networks breaks the balance between distortion and editability. In practical applications, the problem of distortion of the original image is often caused because the editability needs to be ensured, and if the original image is ensured not to be distorted, the attributes of the face image cannot be edited.

In view of this, the technical solution of the present application provides a method for generating a face attribute change image, and in the first stage, a first face image and a preliminarily trained generation-type countermeasure network are adopted to train a feature latent code generation network, so as to obtain a feature latent code generation model. The characteristic latent code generation model outputs a first characteristic latent code, so that the first characteristic latent code can restore the first face image to a higher degree, and equivalently, an accurate editable vector is provided for subsequent face attribute editing. Specifically, according to the requirement for the change of the face attribute, the corresponding attribute parameters in the first feature latent code can be edited and modified to obtain a second feature latent code.

The technical solution of the present application is specifically described below.

Referring first to fig. 3, fig. 3 is a schematic flowchart of a method for generating a face attribute change image according to an embodiment of the present application, and as shown in fig. 3, the method S100 includes, but is not limited to, the following steps:

s10: a first face image is acquired.

In this embodiment, a feature latent code needs to be extracted from the first face image, that is, a one-dimensional vector capable of reflecting the features of the first face image is obtained from the first face image, and equivalently, the feature latent code is a vector expression form of the first face image.

In some embodiments, the first face image is pre-processed. It can be understood that, in the actual image collection, the image including the face includes not only the face but also a large amount of complicated backgrounds, and therefore, training of a subsequent latent code feature network is affected, and calculation and detail feature learning barriers are brought to the training of the network. Therefore, the image comprising the human face is preprocessed, and the first human face image is obtained.

In some embodiments, the method further comprises:

s11: an image including a human face is acquired.

S12: and acquiring key points of the face in the image by adopting a face key point algorithm.

S13: and adjusting the face in the image into a front face according to the key points of the face in the image, and intercepting the effective area of the face to obtain the first face image.

It will be appreciated that images including a human face may be collected, the images including a human face. It will be appreciated that images include not only human faces but also a large number of intricate backgrounds.

In order to remove the interference of background interference, for the image including the face, firstly, the face needs to be recognized, a face key point algorithm can be adopted to recognize the face in the image, and key points of the face are obtained. The key points of the human face comprise points in the areas of eyebrows, eyes, nose, mouth, face contour and the like. The face keypoint algorithm may be Active Area Models (AAMs), Constrained Local Models (CLMs), Explicit Shape Regression (ESR), or explicit device method (SDM).

Then, the face is adjusted to be a front face according to key points of the face in the image, and specifically, the centers (x) of two eyeballs are obtained₁，y₁) And (x)₂，y₂) Then, an angle θ between the center of the two eyeballs and the horizontal direction is calculated as (x)₁，y₁) And (x)₂，y₂) The image is rotated counterclockwise by theta to obtain a rotated image with the midpoint of (a) as a base point.

Specifically, the following formula is adopted to calculate the image after face correction;

wherein,

(x, y) is the coordinate of the pixel in the original image, (x ', y') is the coordinate of the pixel in the rotated image, and theta is the rotation angle; clockwise when s is an integer and counterclockwise when s is a negative number, and at the same time, s represents a scaling factor, t_xAnd t_yIs a translation vector.

The human face in the rotated image is a front face, the centers of the two eyes are taken as the centers of the intercepting frames aiming at the rotated image (namely the image after front face adjustment), the maximum distance between the key points of the human face is the side length of the intercepting frame, and the effective area of the human face is intercepted, so that the human face image only comprising the human face is obtained.

Then, the first face image is obtained by performing a scale normalization process on the face image, for example, compressing the face image to a size of 1024 × 3.

In this embodiment, the key points of the face in the image are obtained through a face key point algorithm, then, face adjustment and face effective area interception are performed based on the key points of the face, and resolution compression processing is performed, so that the first face image only includes the face area, on one hand, interference of background interference can be removed, on the other hand, learning of face features by a neural network is facilitated, model convergence is accelerated, and the calculation amount in the training process can be reduced.

S30: and training the characteristic latent code generation network by adopting the first face image and the preliminarily trained generation type confrontation network to obtain a characteristic latent code generation model. The feature latent code generation model outputs a first feature latent code.

The characteristic latent code extraction network is used for extracting the characteristics of the first face image and comprises a convolution layer, an activation function layer and a normalization layer so as to realize dimension reduction of the input first face image and output a characteristic latent code (one-dimensional vector). It can be understood that the feature latent code extraction network may be an existing mobilene algorithm, Resnet algorithm, or VGG algorithm, and the like, and the specific structure of the feature latent code extraction network is not limited herein as long as feature extraction and dimension reduction are achieved.

The feature latent code generation network can be trained by adopting the first face image and the preliminarily trained generation type confrontation network to obtain a feature latent code generation model. Specifically, the one-time iterative training process includes: inputting a first face image into a feature latent code generation network, outputting an intermediate feature latent code (namely the feature latent code in the training process) by the feature latent code generation network, inputting the intermediate feature latent code into a preliminarily trained generative confrontation network, processing the intermediate feature latent code, outputting a generated image by the preliminarily trained generative confrontation network, reversely adjusting model parameters of the feature latent code generation network according to the difference between the generated image and the first face image, and learning the mapping relation between the first face image and the intermediate feature latent code by the feature latent code generation network in the difference return process.

In this stage, the model parameters of the preliminarily trained generative countermeasure network are kept unchanged and are not adjusted, so that artifacts and spots of the generated image can be prevented, the generated image is ensured to have higher quality, and the optimization of the characteristic latent code generation network is facilitated.

The actual training times are multiple times, and the specific times can be set. And continuously optimizing model parameters according to the difference between the generated image and the first face image until the characteristic latent code generation network converges to obtain a characteristic latent code generation model.

In some embodiments, the adam algorithm may be used to optimize the model parameters, the number of iterations is set to 200, the initial learning rate is set to 0.001, the weight attenuation is set to 0.0005, the learning rate is attenuated to 1/10 as it is, the feature latent code generation network training is performed until convergence, and the feature latent code generation model is saved.

It can be understood that the feature latent code generation model outputs the first feature latent code, which can effectively restore the first face image. And the trained feature latent code generation model is not suitable for other face images. The feature latent code generation model is used for extracting a first feature latent code from a first face image, in the training process, along with gradual training completion of a feature latent code generation network, the similarity between a generated image obtained by restoring an intermediate feature latent code and the first face image is higher and higher, and after the training is completed, the first feature latent code is output, so that the first feature latent code can be obtained through a training method.

In some embodiments, the method S100 further comprises:

s20: and constructing an image pyramid of the first face image.

Specifically, in some embodiments, the convolution operation may be performed on the first face image by using an existing gaussian pyramid operator, so as to obtain an L-layer image pyramid. For example, the image pyramid can be obtained by taking the first face image as the 0 th image layer of the image pyramid, performing convolution operation on the first face image by using a gaussian kernel (5 × 5), wherein the convolution operation mainly processes even rows and columns, obtaining the previous layer image, namely the 1 st image layer of the image pyramid, repeating the convolution operation on the 1 st image layer to obtain the 2 nd image layer, and repeating the operation. After each operation, the M × N image is changed to an M/2 × N/2 image, i.e., reduced by half. In this embodiment, the image pyramid is a gaussian image pyramid.

In some embodiments, the convolution operation may also be performed on the first face image by using the existing laplacian pyramid operator to obtain an image pyramid, and the processing procedure is not described in detail. In this embodiment, the image pyramid is a laplacian image pyramid.

It can be understood that the image pyramid includes L image layers, the 0 th image layer is a first face image, the resolution is the largest, and the resolution is smaller and smaller from the 0 th image layer to the L-1 th image layer, so that the image pyramid can carry feature information of different levels.

In some embodiments, the step S30 specifically includes:

s31: and training the feature latent code generation network by adopting the image pyramid and the preliminarily trained generation type countermeasure network to obtain the feature latent code generation model.

In this embodiment, the one-iteration training process of the feature latent code generation network includes: and sequentially inputting the L image layers in the image pyramid from the 0 th layer to the characteristic latent code generation network in turn for training, wherein for the ith image layer (i is more than or equal to 0 and less than or equal to L-1), the training process is the same as the process of directly inputting the first face image to the characteristic latent code generation network for training, and detailed description is omitted here. And sequentially training from the 0 th image layer until the L-1 th image layer finishes training, and finishing one iteration training.

For example, the iteration number is set to 200, each iteration training needs to perform alternate training on all the L image layers in the image pyramid, and equivalently, each iteration needs to complete L training parameter adjustments.

Because the resolution ratios of the L layers of images in the image pyramid are different and the feature granularity is different, the images are circularly input into the latent code generation network, so that the network learns the feature information from light to deep, the feature latent code generation model is more accurate and real, the output first feature latent code has higher reduction degree, namely the similarity between the generated image obtained by reducing the first feature latent code and the first face image is as high as possible.

It can be understood that the resolution of each image layer in the pyramid image is different, and each image layer with different resolution is input into the feature latent code generation network for training, so that the feature latent code generation network can support images with any size as input. Aiming at input images with different sizes, each layer of the network for generating the characteristic latent codes is unchanged, and the resolution setting of the input images is dynamically adjusted along with the alternate input of the image layers in the pyramid. In some embodiments, the structure of the feature latent code generating network is shown in table 1 below, and the feature latent code generating network includes 7 convolutional layers of 3 × 3, 5 pooling layers, and 1 fully-connected layer. In table 1, c denotes the number of output channels, and s denotes the convolution kernel step size.

It is understood that, as the image layers in the pyramid image are alternately input, the parameters w and h may be changed to be consistent with the resolution of the input image layer. For example, the 0 th image layer resolution is 1024 × 3, and when the 0 th image layer is input, w and h are both 1024. The resolution of the 1 st image layer is 512 x 3, then when the 1 st image layer is input, w and h are both 512.

For any input image layer, after 7 convolution layers of 3 × 3, 5 pooling layers and 1 fully-connected layer in table 1 are sequentially processed, an 18 × 512 feature vector (i.e. intermediate feature latent code) is output.

TABLE 1

In some embodiments, the step S31 specifically includes:

s311: and circularly traversing the image layers in the image pyramid, inputting the image layers into the characteristic latent code generation network, and obtaining the intermediate characteristic latent code.

S312: and inputting the intermediate characteristic latent code into a preliminarily trained generative confrontation network to obtain an intermediate face image.

S313: a first loss between the intermediate face image and the first face image is calculated using a first loss function.

S314: and according to the first loss, iteratively adjusting parameters of the characteristic latent code generation network until the characteristic latent code generation network converges to obtain a characteristic latent code generation model.

And the cyclic traversal refers to inputting each image layer into the feature latent code generation network one by one according to the layer sequence for training, training the image layers one by one from the 0 th layer after all the image layers in the pyramid are finished in turn, and performing cyclic processing in this way.

The following is an exemplary description of a training process of an image layer, which inputs the image layer into a feature latent code generation network to obtain an intermediate feature latent code. Then, the intermediate characteristic latent code is input into a preliminarily trained generative confrontation network to obtain an intermediate face image. It will be appreciated that the intermediate face image is an image recovered from the intermediate feature potential.

In order to obtain a difference between the intermediate face image and the first face image, a first loss between the intermediate face image and the first face image is calculated using a first loss function. And finally, according to the first loss, carrying out iterative parameter adjustment on the characteristic latent code generation network, so that the first loss in the training process is gradually reduced to reach the minimum value or fluctuate within a certain range, namely the characteristic latent code generation network is converged, and a characteristic latent code generation model is obtained. It can be understood that the corresponding first loss reaches the minimum when the first face image is converged, so that the first feature latent code output by the feature latent code generation model can infinitely and truly restore the first face image.

In order to improve the accuracy of the feature latent code generation model and the true restoration degree of the first feature latent code, in some embodiments, the first loss function is set to include a first image perception loss and a structural similarity loss, wherein the structural similarity loss is used for measuring the structural similarity between the intermediate face image and the first face image.

For example, the first loss function is expressed by the following formula;

L₁＝α*L_LPIPS(Y,Y')+β*L_s(Y,Y')，

wherein, alpha and beta are loss weighted values, and the parameters can be adjusted according to the effect of model training. Y denotes a first face image, Y' denotes an intermediate face image, L_LPIPS(Y, Y') is the first image perception loss, L_s(Y, Y') is a loss of structural similarity.

First image perception loss L_LPIPS(Y, Y') is used to evaluate perceptual differences between the intermediate face image and the first face image, typical perceptual differences including content differences and style differences. Introducing a first image perception loss into a first loss function, so that the intermediate face image and the first face image can be trained in content and styleAnd (4) infinite approximation, thereby continuously improving the accuracy of the feature latent code generation network for mapping the contour style features. It can be appreciated that the first image perception loss constraint feature latent codes generate features with larger network learning granularity, such as content contour, style and the like.

Loss of structural similarity L_s(Y, Y') is used to evaluate the structural similarity between the intermediate face image and the first face image. And introducing the structural similarity loss into a first loss function, so that the structure of the intermediate face image and the structure of the first face image can be infinitely approximated along with the training, and the accuracy of the feature latent code generation network for mapping the detail features is continuously improved. It is to be appreciated that the structural similarity loss constrains the feature latent codes to generate features with smaller granularity, such as brightness, contrast, and structure.

For example, the structural similarity loss Ls (Y, Y') is calculated using the following formula;

ls (Y, Y ') -r (Y, Y') -c (Y, Y ') -s (Y, Y'), where r (Y, Y ') is the luminance similarity value, c (Y, Y') is the contrast value, s (Y, Y ') is the structural similarity value, Y is the first face image, and Y' is the output intermediate face image.

The brightness similarity value r (Y, Y') is an index for reflecting the degree of similarity between the brightness of two images. The contrast value c (Y, Y') is an index for reflecting the degree of similarity between the contrasts of two images. The structural similarity value s (Y, Y') is an index for measuring the similarity between two images. In this embodiment, the above Structural SIMilarity loss is a Structural SIMilarity (SSIM) index.

Respectively calculating a brightness similarity value r (Y, Y '), a contrast value c (Y, Y ') and a structure similarity loss s (Y, Y ') by adopting the following formulas;

wherein Y represents a first face image, Y' represents a middle face image, μ_YMean value of pixel values, mu, representing the first face image_Y'Mean value of pixel values, σ, representing an intermediate face image_YThe variance value, σ, of pixel values representing the first face image_Y'Representing a pixel value variance value of the intermediate face image;

wherein σ_YY'Representing the covariance values of the first face image and the intermediate face image, c1 ═ k1 ═ LO)²,c2＝(k2,*LO)²K1 and k2 represent preset constants, for example, k1 and k2 may be 0.01 and 0.03, respectively, and LO is a range of pixel values, and may be 255 in general.

In this embodiment, the accuracy of the feature latent code generation model and the true restoration degree of the first feature latent code can be improved by constructing the multidimensional first loss function and constraining the feature latent code generation network to learn various feature differences.

In addition, because the resolution ratios of L layers of image layers in the image pyramid are different, the feature granularity is different, the image layers are circularly input into the latent code generation network, the image is coded by a circular mechanism, the latent code generation network has a coarse-to-fine mechanism (namely a refined three-dimensional depth mechanism), the feature latent code generation network can better learn the feature information of different layers of the first face image, and then the corresponding first feature latent code is better acquired.

S50: and training the preliminarily trained generative confrontation network by adopting the first characteristic latent code and the first face image to obtain an image generation model.

It will be appreciated that the first feature potential is highly resilient to the first face image. As can be seen from the above structure and training of the generative confrontation network, the first facial image has not been learned by the initially trained generative confrontation network training, i.e., the first facial image is not in the mapping field of the initially trained generative confrontation network.

And carrying out secondary training on the preliminarily trained generative confrontation network by adopting the first characteristic latent code and the first face image, and finely adjusting model parameters to obtain an image generation model. Thus, the first face image is trained and learned by the image generation model, and is in the mapping field of the image generation model. Therefore, the image generation model restores the first characteristic latent codes, the restored image with extremely high similarity to the first face image can be restored, and a distortionless basis is provided for follow-up face attribute editing.

The training of the initially trained generative confrontation network can also adopt a second loss function to perform back propagation and iterative parameter adjustment. It can be understood that when the model parameters are adjusted, the parameters are adjusted partially, and the parameters are changed less. For example, parameters reflecting details of facial features of a human face may be adjusted.

In some embodiments, the step S50 specifically includes:

s51: and inputting the first characteristic latent code into the preliminarily trained generative confrontation network to obtain a restored image.

S52: a second loss between the first face image and the restored image is calculated using a second loss function.

S53: and according to the second loss, iteratively adjusting parameters of the preliminarily trained generative confrontation network until the preliminarily trained generative confrontation network converges to obtain the image generation model.

Taking a training process as an example, it can be understood that in the secondary training of the initially trained generative confrontation network, a plurality of training times, for example, 100 training times, may be set.

Specifically, the first feature latent code is input into a preliminarily trained generative confrontation network to obtain a restored image. Ideally, the restored image and the first face image have minimal differences. To obtain a difference between the restored image and the first face image, a second loss between the first face image and the restored image is calculated using a second loss function. And finally, carrying out iterative fine-tuning on the preliminarily trained generative confrontation network according to the second loss, so that the second loss in the training process is gradually reduced to reach the minimum value or fluctuate within a certain range, namely, the preliminarily trained generative confrontation network is converged to obtain the image generation model. It can be understood that the corresponding second loss in convergence is minimized, so that the restored image output by the image generation model can approach the first face image infinitely.

In some embodiments, the adam algorithm may be used to optimize the model parameters, the number of iterations is set to 100, the initial learning rate is set to 0.001, the weight attenuation is set to 0.0001, the parameter fine tuning is performed, the secondary training of the preliminarily trained generative confrontation network is performed until convergence, and the image generation model is saved.

It can be understood that, in this embodiment, the fine tuning is performed on part of model parameters of the preliminarily trained generative confrontation network (for example, shape parameters reflecting details of a part of the face), that is, only some shape parameters are enhanced without affecting the structure of the preliminarily trained generative confrontation network, and the reconstruction can be completed. Therefore, a second loss function adapted to fine-tune the model parameters is employed.

In some embodiments, the second loss function includes a second image perception loss and a regularization term loss. Specifically, the second loss function is the following equation:

where Y represents a first face image, θ^*Model parameters, G, representing a preliminarily trained generative confrontation network^style(wp，θ^*) Representing the input first characteristic potential wp, G^style(wp，θ^*) At a network parameter of theta^*The image generated. Y is_iRepresenting the ith pixel value, G, in the first face image^style(wp，θ^*)_iRepresenting the generated image G^style(wp，θ^*) The ith pixel value. CWH represents the size of the generated image. In some embodiments, CWH 1024 × 3.

In this embodiment, the second loss function is constructed, and the second loss function is adopted to perform secondary fine-tuning model parameter training on the preliminarily trained generative confrontation network, so that some appearance parameters affecting the details of the face part can be enhanced, the image generation model can restore the first feature latent code, a restored image with extremely high similarity to the first face image can be restored, and a distortionless basis is provided for subsequent face attribute editing.

S70: and acquiring a second characteristic latent code, wherein the second characteristic latent code is obtained by editing the first characteristic latent code.

S90: and inputting the second characteristic latent code into the image generation model to obtain a face attribute change image.

When the face attributes (such as age, skin color, hair and the like) of the first face image are edited, the first characteristic latent code is edited and adjusted, the face attributes such as the age, the skin color, the hair and the like have fixed parameters, and the second characteristic latent code after editing can be obtained by multiplying the first characteristic latent code by the parameters.

And outputting the second characteristic latent code to an image generation model in order to see the face editing effect, so as to obtain a face attribute change image.

In this embodiment, the preliminarily trained generative confrontation network is trained by using the first feature latent code and the first face image, so as to obtain an image generation model. And inputting the second characteristic latent code into the image generation model to obtain a face attribute change image. And performing secondary fine-tuning training on the preliminarily trained generative countermeasure network to map the first face image outside the domain to the potential space in the domain, so that the face attribute change image after attribute editing can not only change the face attribute according to the attribute change parameters carried in the second feature latent code, but also further restore the detail information of the first face image, namely, the image with corresponding attribute change can be generated under the condition of ensuring no distortion.

Referring to fig. 4, fig. 4 shows an overall process of generating a face property change image. As shown in fig. 4, the first face image is gaussian processed to obtain an image pyramid, and in the first stage, each image layer in the image pyramid is input into a feature latent code generation network, and training is performed with the help of a preliminarily trained generation-type countermeasure network. At this stage, the model parameters of the preliminarily trained generative confrontation network are not adjusted and are in a frozen state. When the feature latent code generation network converges, the first stage outputs a feature latent code generation model. And taking the first characteristic latent code output by the characteristic latent code generation model as the input of the generation type countermeasure network which is subjected to secondary training and primary training in the second stage to obtain the image generation model. And editing and modifying the first characteristic latent code to obtain a second characteristic latent code. And finally, inputting the second characteristic latent codes into the image generation model to obtain the face attribute change image which realizes face attribute editing and is not distorted.

In this embodiment, in the first stage, the feature latent code generation network is trained by using the image pyramid corresponding to the first face image and the preliminarily trained generative confrontation network, so as to obtain a feature latent code generation model. The feature latent code generation model outputs a first feature latent code. Because L layers of image layers in the image pyramid have different resolutions and different feature granularities, the images are circularly input into the latent code generation network, and the images are coded by a circular mechanism, so that the latent code generation network has a coarse-to-fine mechanism (namely a refined three-dimensional depth mechanism), and can better enable the feature latent code generation network to learn feature information of different layers of the first face image, so that the first face image can be restored by the first feature latent code to a higher degree, and equivalently, an accurate editable vector is provided for subsequent face attribute editing. Specifically, the first feature latent codes can be edited and modified according to the requirement of face attribute change, so that second feature latent codes are obtained.

Having described the methods of the present application, and in order to better practice the methods of the present application, the apparatus of the present application is now described.

Referring to fig. 5, a hardware structure diagram of an electronic device 60 according to an embodiment of the present disclosure is provided, specifically, as shown in fig. 5, the electronic device 60 includes at least one processor 61 and a memory 62 (a bus connection, a processor is taken as an example in fig. 5) that are communicatively connected.

The processor 61 is configured to provide computing and control capabilities to control the electronic device 60 to perform corresponding tasks, and control the electronic device 60 to perform any one of the methods for generating a face attribute change image provided in the above embodiments.

It is understood that the Processor 61 may be a general-purpose Processor including a Central Processing Unit (CPU), a Network Processor (NP), etc.; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

The memory 62, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the method for generating an image of a change in a human face attribute according to the embodiments of the present invention. The processor 61 may implement any of the methods for generating a face attribute change image provided by the above embodiments by running non-transitory software programs, instructions, and modules stored in the memory 62. In particular, the memory 62 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 62 may also include memory located remotely from the processor, which may be connected to the processor via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Embodiments of the present application further provide a computer-readable storage medium, in which a computer program is stored, where the computer program includes program instructions, and when the program instructions are executed by a computer, the computer executes any one of the methods for generating a face attribute change image.

It should be noted that the above-described device embodiments are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a computer readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; within the context of the present application, where technical features in the above embodiments or in different embodiments can also be combined, the steps can be implemented in any order and there are many other variations of the different aspects of the present application as described above, which are not provided in detail for the sake of brevity; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A method for generating a face attribute change image, comprising:

acquiring a first face image;

2. The method of claim 1, further comprising:

constructing an image pyramid of the first face image;

3. The method of claim 2, wherein the training the latent feature code generation network using the image pyramid and the preliminarily trained generative confrontation network to obtain the latent feature code generation model comprises:

4. The method of claim 3, wherein the first loss function comprises a first image perception loss and a structural similarity loss, wherein the structural similarity loss is used to measure the structural similarity between the intermediate face image and the image layer.

5. The method of claim 4, wherein the structural similarity loss is a product of a luminance similarity value, a contrast value, and a structural similarity value.

6. The method according to any one of claims 1 to 5, wherein the training the preliminarily trained generative confrontation network using the first feature latent code and the first face image to obtain an image generation model comprises:

7. The method of claim 6, wherein the second loss function comprises a second image perception loss and a regularization term loss.

8. The method of claim 1, further comprising:

acquiring an image comprising a human face;

9. An electronic device comprising memory and one or more processors to execute one or more computer programs stored in the memory, the one or more processors, when executing the one or more computer programs, causing the electronic device to implement the method of any of claims 1-8.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to carry out the method according to any one of claims 1-8.