CN111914617A - Face attribute editing method based on balanced stack type generation countermeasure network - Google Patents
Face attribute editing method based on balanced stack type generation countermeasure network Download PDFInfo
- Publication number
- CN111914617A CN111914617A CN202010521351.2A CN202010521351A CN111914617A CN 111914617 A CN111914617 A CN 111914617A CN 202010521351 A CN202010521351 A CN 202010521351A CN 111914617 A CN111914617 A CN 111914617A
- Authority
- CN
- China
- Prior art keywords
- attribute
- image
- loss
- face
- discriminator
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 47
- 238000012549 training Methods 0.000 claims abstract description 35
- 238000007781 pre-processing Methods 0.000 claims abstract description 9
- 230000008569 process Effects 0.000 claims abstract description 9
- 150000001875 compounds Chemical class 0.000 claims description 23
- 230000006870 function Effects 0.000 claims description 12
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 8
- 230000003042 antagnostic effect Effects 0.000 claims description 5
- 230000001815 facial effect Effects 0.000 claims description 4
- 230000008859 change Effects 0.000 claims description 2
- 238000013527 convolutional neural network Methods 0.000 claims description 2
- 238000013507 mapping Methods 0.000 claims description 2
- 230000000694 effects Effects 0.000 abstract description 6
- 238000010586 diagram Methods 0.000 description 5
- OAICVXFJPJFONN-UHFFFAOYSA-N Phosphorus Chemical compound [P] OAICVXFJPJFONN-UHFFFAOYSA-N 0.000 description 4
- 239000011521 glass Substances 0.000 description 4
- 238000010801 machine learning Methods 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 210000004709 eyebrow Anatomy 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000037311 normal skin Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 230000002087 whitening effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/25—Determination of region of interest [ROI] or a volume of interest [VOI]
Abstract
The invention discloses a face attribute editing method based on a balanced stack type generation countermeasure network, which comprises the following steps: 1) acquiring a data set containing a face image and an attribute label, and preprocessing the data set; 2) constructing a plurality of condition generating type confrontation networks consisting of paired generators and discriminators according to the size of the face image; 3) utilizing the preprocessed data set, aiming at different face attributes, adopting a weighted learning and residual image generation mode, and independently training all condition generation type countermeasure networks; 4) and stacking all the trained generators to form a stack structure, and sequentially editing corresponding face attributes aiming at the preprocessed unknown face images. According to the invention, a weighted learning and residual image generation mode is applied in a training process, and a stack structure is adopted in an attribute editing process, so that the model can effectively solve the problem of data imbalance, the editing capability of a few types of samples is enhanced, the image generation effect in an attribute irrelevant area is improved, and the problem of attribute entanglement is avoided.
Description
Technical Field
The invention relates to the technical field of image editing and machine learning, in particular to a face attribute editing method based on a balanced stack generation type confrontation network.
Background
Face attribute editing is directed to changing specific attributes, such as adding glasses, removing mustache, whitening skin, and even replacing gender, given a face image. The visual task achieves the semantic controllability of the image and fine-grained image transformation. Meanwhile, with the rise of selfie wave on social media, the wide spread of network videos, the intelligent design of games or animation roles, the generation of large-scale face image data on the internet every year and the increasingly strong demand for face attribute editing are met. Therefore, the face attribute editing is widely applied to application scenes such as facial beautification, video restoration, role synthesis and the like. In addition, the face image with the edited specific attributes can also be used for performing data enhancement on other vision type machine learning tasks, such as face recognition, face detection and face tracking.
In recent years, the generative countermeasure network has become a popular model in human face attribute editing research by virtue of its strong image generation capability, and has obtained high-fidelity and high-diversity editing results. Given a data set containing face images and attribute labels, in a generative confrontation network, a trained generator can manipulate specific image attributes to fool a discriminator, which can be used to distinguish between real images and false images synthesized by the generator. The quality of the edited image is improved in the mutual confrontation of the generator and the discriminator. However, face multi-attribute editing remains a challenging task due to the large number of attribute combinations, i.e., the large number of attributes themselves, each of which contains multiple values. Similar to other machine learning tasks, the collection of samples, i.e., pairs of face images and attribute labels, is particularly expensive and limited, so they cannot represent the entirety of all attribute combinations. Current multi-property editing methods suffer from three problems due to lack of samples and imbalance. The first problem is attribute entanglement, i.e. editing of one attribute changes other attributes, while balancing of one attribute affects the balance of other attributes. This is because existing methods do not have enough samples to distinguish between different attributes. Secondly. Due to the lack of enough samples for learning, the generated image is not satisfactory enough for learning in regions with irrelevant attributes, which is reflected in high noise and inaccurate editing. Finally, the existing methods edit the minority attributes poorly because they often treat all samples equally, so that the samples of the minority attributes are ignored, and simultaneous learning of multiple attributes causes the existing methods to fail to perform balance adjustment individually for each attribute.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a face attribute editing method based on a balanced stack type generation confrontation network, which utilizes a plurality of condition generation confrontation networks to independently learn the editing of a plurality of attributes in a training stage and avoids the problem of attribute entanglement. In addition, each condition generating type confrontation network directly learns that the target is a residual image, and the generating effect of the attribute-independent area is improved. Meanwhile, weighted learning is adopted, so that samples of a small number of types of attribute values can obtain larger learning weight, the editing effect on the small number of types of attribute values is improved, and the challenges brought by unbalanced data sets are properly solved.
In order to achieve the purpose, the technical scheme provided by the invention is as follows: a human face attribute editing method based on a balanced stack type generation type confrontation network adopts a mode of weighting learning and training a plurality of condition generation type confrontation networks to solve the problem of unbalanced attributes, stacks all generators of the trained condition generation type confrontation networks to form a stack type structure to solve the problem of attribute entanglement, and solves the problem of inaccurate image editing by utilizing a residual image generation mode; which comprises the following steps:
1) acquiring a data set containing a face image and an attribute label, and preprocessing the data set;
2) constructing a plurality of condition generating type confrontation networks consisting of paired generators and discriminators according to the size of the face image;
3) utilizing the preprocessed data set, aiming at different face attributes, adopting a weighted learning and residual image generation mode, and independently training all condition generation type countermeasure networks;
4) and stacking all the trained generators to form a stack structure, and sequentially editing corresponding face attributes aiming at the preprocessed unknown face images.
In step 1), the data set is obtained by a face data set published on the internet; the face image is an image containing a single face, and the preprocessing comprises the following steps: cutting, scaling and normalizing to make the face in the image occupy the main picture, and the pixel value size is between-1 and 1, representing a face image by x, all x forming a face image setNamely, it isY for the attribute tagjThe expression refers to a label of a value of j-th attribute of the face, wherein j is 1,2, …, m and m are the number of attributes, and all attribute labels yjForm attribute label setNamely, it isAfter the m attribute labels are combined, 1 face image is added to form 1 pair of samples.
In step 2), according to the size of the face image, constructing a plurality of condition generating type confrontation networks composed of paired generators and discriminators, specifically as follows:
i, constructing a generator G in the structure of an encoder-decoderj:
Generator GjThe encoder in (1) comprises a plurality of convolutional layers which receive two variables, including a variable from a set of face imagesFace image ofAnd one from the target property tag setObject property tag ofWhereinIs different from the attribute label yjValue v ofk,yjFrom attribute tag collectionsvkTaking the value of the jth attribute k, wherein j is 1,2, …, m is the number of attributes; after 1 attribute label is combined, 1 face image is added to form 1 pair of samples; then GjMapping the two received variables into a hidden code z for extracting the characteristics of the face image and eliminating redundant information; then the implicit code z is inputted into a decoder composed of a plurality of deconvolution layers to generate a residual image for changing the attribute of the target faceWhereinDefined as a face image x and an edited imageThe difference value of (a) to (b),for all that isThe set of components is composed of a plurality of groups,for all that isA set of constructs; finally, the edited imageThrough x andsuperposing the pixels on a pixel level; since the minority class samples are easy to be ignored in the unbalanced data set, each sample in the data set needs to be given a learning weight ω to represent its importance in training, and the value of ω of the sample with the minority class attribute value is large, and vice versa; for a face image x, omega is according to j attribute value vkIs defined as follows:
wherein, | upsilonjL refers to the number of different values of the jth attribute,indicating the number of samples for which the jth attribute is k, i.e. kWherein (x, y)j) For face image x and attribute label yjThe number of 1 pair of samples that are formed,as sets of facial imagesAnd attribute tag collectionsA set pair is formed, | represents the cardinality of a set; ω measures the maximum number of samples with value on the jth attribute and v with value on the jth attributekIs greater than or equal to 1,the smaller the representative weight ω, the larger the representative weight ω, and vice versa, this weight encourages the model to pay more attention to samples with a few classes of attribute values during the learning process;
II, constructing a complete discriminator D by two sub-discriminatorsj:
Discriminator DjThe method comprises two sub-discriminators: authenticity discriminator Dj realAnd a category discriminator Dj cls(ii) a Authenticity discriminator Dj realThe method is used for judging the authenticity of the input image and predicting the probability that the input image is a real image; class discriminator Dj clsThe method is used for identifying whether the attribute value of the input image accords with the target attribute value or not and predicting the conformity degree, wherein different values of the same attribute can be regarded as different categories; the two sub-discriminators share one multilayer convolutional neural network but have independent double-layer full-connection layers;
III, constructing a training target of the generator:
to combat the lossAlong with the other three loss components: class lossLoss of reconstructionAnd regularization lossAre considered together into a generator Gj(ii) a To combat the lossIs used to ensure that the generated image is authentic; class lossControlling the face image to be correctly according to the target attribute valueTo edit; loss of reconstructionThen the generator G is enhanced during the editing processjThe ability to retain information for attribute independent regions; loss of regularizationMeasuring the L-1 norm of a residual image to enhance the sparsity of the residual image, wherein the residual image should have a large number of zero-valued pixels; finally, generator GjTraining target LGComprises the four components, and is defined as:
in the formula (I), the compound is shown in the specification,are balance parameters, which refer to the importance of the corresponding loss, respectively; notably, all four loss components need to be computationally considered to weight samples with a few classes of attributes; these loss components are specifically as follows:
This loss quantifies the degree of realism of the resulting image, which is defined as:
in the formula (I), the compound is shown in the specification,representing a computational mathematical expectation, attribute tag yjFrom attribute tag collectionsEdited imageFrom edited image setsvkThe facticity discriminator takes the value of the jth attribute k, j is 1,2, …, m, m is the number of attributes, omega is the learning weightIs used to judge the authenticity of the input image by minimizing the penalty, by GjThe similarity of the synthesized image and the real image is improved;
The loss measures the edited value of the jth attribute of the face image x and the target attribute valueThe degree of coincidence, which takes the form of a weighted binary cross-entropy loss function, is defined as follows:
in the formula (I), the compound is shown in the specification,representing a computational mathematical expectation, attribute tag yjFrom attribute tag collectionsObject property labelFrom a target property tag setEdited imageFrom edited image setsvkJ is 1,2, …, m, m is the number of attributes, ω is the learning weight,is a binary cross entropy function and is defined as:category discriminatorUsed for judging whether the jth attribute of the input image is correctly edited or not;
The loss avoids information loss in the process of reconstructing the attribute irrelevant area in the generated image; loss of reconstructionThe original image x and the reconstructed image g are measuredjThe difference between, among them, the set of reconstructed imagesIs/are as followsIs an edited imageThrough generator GjAccording to the attribute value vkThe result after generation, i.e.The reconstruction loss is defined as follows:
in the formula (I), the compound is shown in the specification,representing a computational mathematical expectation, attribute tag yjFrom attribute tag collectionsFace image x is from a set of face imagesReconstructed image gjFrom a set of reconstructed imagesm is the attribute number, omega is the learning weight, | | · | | non-woven phosphor1Represents the L-1 norm;
In generator GjIn the method, a residual image is not a complete image and is used as a direct learning target, the residual image represents local pixel change on target attributes, theoretically, the residual image should be sparse, and a large number of zero-value pixels exist; thus, a regularization penalty is introduced and is defined as follows:
in the formula (I), the compound is shown in the specification,representing a computational mathematical expectation, attribute tag yjFrom attribute tag collectionsResidual imageFrom sets of residual imagesm is the attribute number, omega is the learning weight, | | · | | non-woven phosphor1Represents the L-1 norm;
IV, constructing a discriminator DjThe training target of (1):
the countermeasure loss and the classification loss are taken into account in the discriminator DjIn the training target of (2), the following is defined:
in the formula (I), the compound is shown in the specification,is a balance factor, indicating the importance of the class loss; two losses are defined as follows:
i, fight against loss:
this loss is used to encourage the authenticity discriminator Dj realDistinguishing between real and false images generated, and applying to generator GjSimilar to the above antagonistic losses, also using a base Dj realThe weighted logarithm of the discrimination result quantifies the reality of the input image, and smaller countermeasure loss represents poorer performance of the generator, and the countermeasure loss is defined as follows:
in the formula,Representing a computational mathematical expectation, attribute tag yjFrom attribute tag collectionsThe face image is from a set of face imagesEdited imageFrom edited image setsvkTaking the value of the jth attribute k, wherein j is 1,2, …, m is the number of attributes, and ω is the learning weight;
ii, class loss:
the loss also quantifies the degree of correlation between the predicted value and the true value of the jth attribute of the input image, and minimizes the class discriminator Dj clsIs lost in category, ensureFor accurate judgment of the attribute values of the input image, the category loss is defined as follows:
in the formula (I), the compound is shown in the specification,representing a computational mathematical expectation, attribute tag yjFrom attribute tag collectionsFace image x is from a set of face imagesj is 1,2, m, m is the number of attributes, ω is the learning weight,is a binary cross entropy function defined asCategory discriminatorUsed for judging whether the jth attribute of the input image is correctly edited or not;
v, constructing respective optimizers of the generator and the discriminator:
in order to improve the training stability and speed, the generator and the discriminator both adopt an Adam optimizer, and the Adam optimizer comprehensively considers the first moment estimation and the second moment estimation of the gradient so as to update the learning step length and automatically adjust the learning rate.
In step 3), each condition generates a training of the antagonistic network, comprising the following steps:
3.1) loading the preprocessed data set and the constructed conditional generation type countermeasure network;
3.2) inputting the face images in the data set and artificially specified target attribute values to a generator and a discriminator in batches;
3.3) calculating respective loss values according to respective output values of the generator and the discriminator and respective training targets;
3.4) calculating respective gradient values according to respective loss values of the generator and the discriminator and an Adam optimizer;
and 3.5) realizing back propagation according to respective gradient values of the generator and the discriminator and adjusting respective parameters.
In step 4), all trained generators are stacked to form a stack structure, and corresponding face attributes are sequentially edited aiming at the preprocessed unknown face images, and the method comprises the following steps:
4.1) stacking all generators;
4.2) preprocessing unknown face images and artificially setting a plurality of target attribute values;
4.3) inputting the face image and the target attribute value to all generators one by one, respectively changing the corresponding attributes, generating a residual image, combining the residual image with the input face image, and outputting a complete face image; the first generator receives the preprocessed unknown face image, the other generators receive the complete face image output by the previous generator, and the output of the last generator is the complete face image with all the properties edited.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the invention solves the human face multi-attribute editing task by utilizing the stack structure for the first time, so that each human face attribute can be independently edited, and the balance adjustment of one attribute does not influence the balance of other attributes.
2. According to the invention, a residual image generation mode is introduced in a multi-attribute editing scene for the first time, so that information loss caused by reconstruction of the attribute-independent partial images is avoided.
3. According to the invention, by applying the learning weight to each sample, the influence caused by the unbalanced data problem is inhibited, and the editing capability of the model on the minority class attribute values is improved.
4. When considering additional characteristics or neglecting the existing characteristics, the invention does not need to train the whole model from the beginning, only needs to add or remove part of the conditional generation type countermeasure network, and has the flexible characteristic.
5. The method for independently training the conditional generation type countermeasure network aiming at the single attribute and the residual image generation mode reduce the learning difficulty of the single attribute editing task, and each conditional generation type countermeasure network can adopt a lighter-weight network structure, thereby reducing the training cost.
6. The invention finally generates an image only depending on the generators of the conditional generative countermeasure networks of the relevant attributes, rather than all generators of the generative countermeasure networks, which makes the model of the invention more efficient.
Drawings
FIG. 1 is a logic flow diagram of the present invention.
FIG. 2 is a schematic diagram of the training phase of the present invention.
FIG. 3 is a schematic diagram of the image generation stage according to the present invention.
FIG. 4 is a diagram illustrating an example of independently editing a single attribute according to the present invention.
FIG. 5 is a diagram illustrating an example of editing 3 attributes one by one according to the present invention.
Detailed Description
The present invention will be further described with reference to the following specific examples.
As shown in fig. 1 to fig. 3, the face attribute editing method based on a balanced stack generation confrontation network provided in this embodiment uses techniques such as a stack structure, a residual image generation method, and weighted learning, and includes the following steps:
1) acquiring a face image x and an attribute label yjAnd performing preprocessing. The data set is obtained through a face data set published on the internet. The face image is an image containing a single face, and is preprocessed: cutting, scaling and normalizing, wherein the face in the image occupies a main picture, the pixel value size is between-1 and 1, x represents a face image, and all x form a face image setNamely, it isThe attribute label is a label indicating values (also called attribute values) of a plurality of attributes of a face, and is represented by yjWherein j is 1,2, …, m; in this example, only 8 attributes are taken, i.e., m is 8, and for a single binary attribute, such as mouth open and close, and the presence or absence of glasses, each attribute has only two values after being processed: 0 and 1, and have different imbalance ratios (i.e., the ratio of the number of samples with different values of the same attribute), all attribute labels yjForm attribute label setNamely, it isAfter 8 attribute labels are combined, 1 face image is added to form 1 pair of samples. They are obtained from a face data set CelebA published on the web and divided into training data and test data according to a certain proportion, containing 182,637 pairs and 19,962 pairs of samples, respectively.
The obtained face image is as follows: the image comprises a human face, an image with RGB three channels, a size of 128 multiplied by 128 and a format of jpg.
The obtained data of 8 attributes are shown in table 1:
TABLE 1 Attribute and corresponding imbalance ratio
In table 1, there are two expression forms for each attribute, corresponding to two attribute values. Mouth type: open and closed, pouch: presence and absence, eyebrows: thickness and thinness, cheek color: red and common colors, glasses: with and without, the gill-breaking urheen: presence and absence, skin tone: white and normal skin tone, upper lip fiddle: with and without.
2) According to the size of the face image, a plurality of condition generating type confrontation networks consisting of paired generators and discriminators are constructed, and the method comprises the following steps:
i, constructing a generator G in the structure of an encoder-decoderjContains four convolutional layers and four anti-convolutional layers, and is shown in Table 2.
Table 2 generator structure
Generator |
Conv(64,3,2),BN,Leaky ReLU |
Conv(128,3,2),BN,Leaky ReLU |
Conv(256,3,2),BN,Leaky ReLU |
Conv(512,3,2),BN,Leaky ReLU |
DeConv(256,3,2),BN,Leaky ReLU |
DeConv(128,3,2),BN,Leaky ReLU |
DeConv(64,3,2),BN,Leaky ReLU |
DeConv(3,3,2),Tanh |
In table 2, Conv (d, k, s) and DeConv (d, k, s) represent the convolutional and deconvolution layers, respectively, with d dimensions, a kernel size of k, and s as a step size. BN is batch normalization. Leaky Relu and Tanh are two different activation functions, respectively.
II, discriminator DjThe method comprises two sub-discriminators: authenticity discriminator Dj realAnd a category discriminator Dj cls. See table 3 for details.
TABLE 3 discriminator structure
In table 3, Conv (d, k, s) and DeConv (d, k, s) represent the convolutional and deconvolution layers, respectively, with d dimensions, a kernel size of k, and s as a step size. BN is batch normalization. FC (c) is the fully connected layer with output in c dimension. m is the number of target attributes. Leaky Relu and Sigmoid are two different activation functions, respectively.
III, construction of Generator GjThe training target of (1). To combat the lossAlong with the other three loss components: class lossLoss of reconstructionAnd regularization lossAre considered together into a generator Gj(ii) a Finally, GjTraining target LGThe four components are contained, and are defined as follows:
in the formula (I), the compound is shown in the specification,are balance parameters, which refer to the importance of the corresponding loss, respectively; notably, all four loss components need to be computationally considered to weight samples with a few classes of attributes; these losses are defined as follows:
This loss quantifies the degree of realism of the generated image. It is defined as:
in the formula (I), the compound is shown in the specification,representing a computational mathematical expectation, attribute tag yjFrom attribute tag collectionsEdited imageFrom edited image setsvkThe facticity discriminator takes the value of the jth attribute k, j is 1,2, …, m, m is the number of attributes, omega is the learning weightIs used to judge the authenticity of the input image by minimizing the penalty, by GjThe similarity of the synthesized image and the real image is improved;
The loss measures the edited value of the jth attribute of the image x and the target attribute valueThe degree of coincidence. It takes the form of a weighted binary cross entropy loss function, defined as follows:
in the formula (I), the compound is shown in the specification,representing a computational mathematical expectation, attribute tag yjFrom attribute tag collectionsObject property labelFrom a target property tag setEdited imageFrom edited image setsvkJ is 1,2, …, m, m is the number of attributes, ω is the learning weight,is a binary cross entropy function defined asCategory discriminatorFor determining whether the jth attribute of the input image was edited correctly.
The loss avoids information loss in the process of reconstructing the attribute irrelevant area in the generated image; loss of reconstructionThe original image x and the reconstructed image g are measuredjThe difference between, among them, the set of reconstructed imagesIs/are as followsIs an edited imageThrough generator GjAccording to the attribute value vkThe result after generation, i.e.The reconstruction loss is defined as follows:
in the formula (I), the compound is shown in the specification,representing a computational mathematical expectation, attribute tag yjFrom attribute tag collectionsFace image x is from a set of face imagesReconstructed image gjFrom a set of reconstructed imagesm is the attribute number, omega is the learning weight, | | · | | non-woven phosphor1Representing the L-1 norm.
In our generator GjThe residual image, rather than the full image, acts as a direct learning target, representing local pixel variations in the target properties. The residual image should theoretically be sparse with a large number of zero valued pixels. Therefore, we introduce a regularization penalty and define as follows:
in the formula (I), the compound is shown in the specification,representing a computational mathematical expectation, attribute tag yjFrom attribute tag collectionsResidual imageFrom sets of residual imagesm is the attribute number, omega is the learning weight, | | · | | non-woven phosphor1Representing the L-1 norm.
IV, constructing generator GjThe training target of (1). The countermeasure loss and the classification loss are taken into account in the discriminator DjIn the training target of (2), the following is defined:
in the formula (I), the compound is shown in the specification,is a balance factor, indicating the importance of the class loss; two losses are defined as follows:
i, fight against loss:
the loss is used to encourage the discriminatorDistinguishing between real images and generated false images. And is applied to the generator GjSimilar to the above antagonistic losses, we also use the basisAnd quantifying the authenticity of the input image by the weighted logarithm of the discrimination result. Smaller competing losses indicate poorer performance of the generator. The loss of opposition is defined as follows:
in the formula (I), the compound is shown in the specification,representing a computational mathematical expectation, attribute tag yjFrom attribute tag collectionsFace image x is from a set of face imagesEdited imageFrom edited image setsvkFor the value of the jth attribute k,
j is 1,2, …, m is the number of attributes, and ω is the learning weight.
Ii, class loss:
the loss also quantifies the degree of correlation between the predicted value and the true value of the jth attribute of the input image, and minimizes the class discriminator Dj clsIs lost in category, ensureFor accurate judgment of the attribute values of the input image, the category loss is defined as follows:
in the formula (I), the compound is shown in the specification,representing a computational mathematical expectation, attribute tag yjFrom attribute tag collectionsFace image x is from a set of face imagesm is the number of attributes, ω is the learning weight,is a binary cross entropy function defined asCategory discriminatorFor determining whether the jth attribute of the input image was edited correctly.
V, constructing a generator GjAnd a discriminator DjThe optimizer of (1). In order to improve the training stability and speed, an Adam optimizer is adopted in both the generator and the discriminator.
3) As shown in FIG. 2, the method of the present invention utilizes a preprocessed data set to independently train all conditional generative confrontation networks by using weighted learning and residual image generation methods for different face attributes.
A residual image generation mode and weighted learning are adopted in the training process, so that the problems of attribute entanglement and poor editing effect on a few types of attribute values are solved. For a face image x, the learning weight omega is according to the j attribute value vkIs defined as follows:
wherein, | upsilonjI denotes the firstThe number of different values of the j attributes,indicating the number of samples for which the jth attribute is k, i.e. kWherein (x, y)j) For face image x and attribute label yjThe number of 1 pair of samples that are formed,as sets of facial imagesAnd attribute tag collectionsA set pair is formed, | represents the cardinality of a set; wherein, the learning weight has two characteristics: the more the number of corresponding attribute value samples is, the smaller the weight is; the weight value is always more than or equal to one; according to the above calculation formula of the learning weight ω and the number of samples (or imbalance ratio) of different attribute values, the weights shown in table 4 below can be calculated.
TABLE 4 weight values of all attribute values
The whole training process comprises the following steps:
3.1) loading the preprocessed data set and the constructed condition generating countermeasure network;
3.2) inputting the face images in the data set and artificially specified target attribute values to a generator and a discriminator in batches;
3.3) calculating respective loss values according to respective output values of the generator and the discriminator and respective training targets;
3.4) calculating gradient values of respective parameters according to respective loss values of the generator and the discriminator and an Adam optimizer;
and 3.5) realizing back propagation according to respective gradient values of the generator and the discriminator and adjusting respective parameters.
4) As shown in fig. 3, the method of the present invention stacks all trained generators to form a stack structure, and sequentially edits corresponding face attributes for a preprocessed unknown face image, and comprises the following steps:
4.1) stacking all generators;
4.2) preprocessing unknown face images and artificially setting a plurality of target attribute values;
and 4.3) inputting the face image and the target attribute value to all generators one by one, respectively changing the corresponding attributes, generating a residual image, combining the residual image with the input face image, and outputting a complete face image. Wherein, except the first generator receives the preprocessed unknown face image, the other generators receive the complete face image output by the last generator. The output of the last generator is a complete face image with all the properties edited.
Finally, example results as shown in fig. 4 and 5 can be obtained by the method of the present invention. FIG. 4 illustrates an example of the present invention editing individual properties independently. It can be seen that when all the conditions are not stacked in the confrontation network, the invention can independently and correctly edit 8 attributes (pouch, eyebrow, mouth shape, upper lip, beard, glasses, cheek color, skin color) of the input image, and has the advantages of good editing effect of a few types of attribute values, accurate editing, low noise, and no mutual interference between different attributes, which can be verified by means of residual images. Fig. 5 shows an example of the invention editing 3 attributes (mouth shape, upper lip, skin tone) one by one. It can be seen from the figure that when all the condition generating type confrontation network stacking is carried out, the invention can edit 3 attributes of the input image one by one, and finally finish editing the 3 attributes, and has the advantages of good editing effect of a few types of attribute values, accurate editing, low noise and no mutual interference among different attributes, which can be proved by means of residual images. The advantages reflect the functions of the stack structure, the weighted learning and the residual image generation mode of the invention together.
The above-mentioned embodiments are merely preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, so that the changes in the shape and principle of the present invention should be covered within the protection scope of the present invention.
Claims (5)
1. A human face attribute editing method based on a balanced stack type generation type confrontation network is characterized in that the method adopts a mode of weighting learning and training a plurality of condition generation type confrontation networks to solve the problem of unbalanced attributes, stacks all the generators of the trained condition generation type confrontation networks to form a stack type structure to solve the problem of attribute entanglement, and solves the problem of inaccurate image editing by using a residual image generation mode; which comprises the following steps:
1) acquiring a data set containing a face image and an attribute label, and preprocessing the data set;
2) constructing a plurality of condition generating type confrontation networks consisting of paired generators and discriminators according to the size of the face image;
3) utilizing the preprocessed data set, aiming at different face attributes, adopting a weighted learning and residual image generation mode, and independently training all condition generation type countermeasure networks;
4) and stacking all the trained generators to form a stack structure, and sequentially editing corresponding face attributes aiming at the preprocessed unknown face images.
2. The face property editing method based on the balanced stack generation type confrontation network as claimed in claim 1, wherein: in step 1), the data set is obtained by a face data set published on the internet; the face image is an image containing a single face, and the preprocessing comprises the following steps: cutting, scaling and normalizing to make the face in the image occupy the main picture, and the pixel value size is between-1 and 1, representing a face image by x, all x forming a face image setNamely, it isY for the attribute tagjThe expression refers to a label of a value of j-th attribute of the face, wherein j is 1,2, …, m and m are the number of attributes, and all attribute labels yjForm attribute label setNamely, it isAfter the m attribute labels are combined, 1 face image is added to form 1 pair of samples.
3. The face property editing method based on the balanced stack generation type confrontation network as claimed in claim 1, wherein: in step 2), according to the size of the face image, constructing a plurality of condition generating type confrontation networks composed of paired generators and discriminators, specifically as follows:
i, constructing a generator G in the structure of an encoder-decoderj:
Generator GjThe encoder in (1) comprises a plurality of convolutional layers which receive two variables, including a variable from a set of face imagesFace image ofAnd one from the target property tag setObject property tag ofWhereinIs different from the attribute label yjValue v ofk,yjFrom attribute tag collectionsvkTaking the value of the jth attribute k, wherein j is 1,2, …, m is the number of attributes; after 1 attribute label is combined, 1 face image is added to form 1 pair of samples; then GjMapping the two received variables into a hidden code z for extracting the characteristics of the face image and eliminating redundant information; then the implicit code z is inputted into a decoder composed of a plurality of deconvolution layers to generate a residual image for changing the attribute of the target faceWhereinDefined as a face image x and an edited imageThe difference value of (a) to (b),for all that isThe set of components is composed of a plurality of groups,for all that isA set of constructs; finally, the edited imageThrough x andsuperposing the pixels on a pixel level; since the minority class samples are easy to be ignored in the unbalanced data set, each sample in the data set needs to be given a learning weight ω to represent its importance in training, and the value of ω of the sample with the minority class attribute value is large, and vice versa; for a face image x, omega is according to j attribute value vkIs defined as follows:
wherein, | YjL refers to the number of different values of the jth attribute,indicating the number of samples for which the jth attribute is k, i.e. kWherein (x, y)j) For face image x and attribute label yjThe number of 1 pair of samples that are formed,as sets of facial imagesAnd attribute tag collections1 pair of sets is formed, | represents the cardinality of one set; ω measures the maximum number of samples with value on the jth attribute and v with value on the jth attributekIs greater than or equal to 1,the smaller the representative weight ω, the larger the representative weight ω, and vice versa, this weight encourages the model to pay more attention to samples with a few classes of attribute values during the learning process;
II, constructing a complete discriminator D by two sub-discriminatorsj:
Discriminator DjThe method comprises two sub-discriminators: authenticity discriminator Dj realAnd a category discriminator Dj cls(ii) a Authenticity discriminator Dj realThe method is used for judging the authenticity of the input image and predicting the probability that the input image is a real image; class discriminator Dj clsThe method is used for identifying whether the attribute value of the input image accords with the target attribute value or not and predicting the conformity degree, wherein different values of the same attribute can be regarded as different categories; the two sub-discriminators share one multilayer convolutional neural network but have independent double-layer full-connection layers;
III, constructing a training target of the generator:
to combat the lossAlong with the other three loss components: class lossLoss of reconstructionAnd regularization lossAre considered together into a generator Gj(ii) a To combat the lossIs used to ensure that the generated image is authentic; class lossControlling the face image to be correctly according to the target attribute valueTo edit; loss of reconstructionThen the generator G is enhanced during the editing processjThe ability to retain information for attribute independent regions; loss of regularizationMeasuring the L-1 norm of a residual image to enhance the sparsity of the residual image, wherein the residual image should have a large number of zero-valued pixels; finally, generator GjTraining target LGComprises the four components, and is defined as:
in the formula (I), the compound is shown in the specification,are balance parameters, which refer to the importance of the corresponding loss, respectively; notably, all four loss components need to be computationally considered to weight samples with a few classes of attributes; these loss components are specifically as follows:
This loss quantifies the degree of realism of the resulting image, which is defined as:
in the formula (I), the compound is shown in the specification,representing a computational mathematical expectation, attribute tag yjFrom attribute tag collectionsEdited imageFrom edited image setsvkThe facticity discriminator takes the value of the jth attribute k, j is 1,2, …, m, m is the number of attributes, omega is the learning weightIs used to judge the authenticity of the input image by minimizing the penalty, by GjThe similarity of the synthesized image and the real image is improved;
The loss measures the edited value of the jth attribute of the face image x and the target attribute valueThe degree of coincidence, which takes the form of a weighted binary cross-entropy loss function, is defined as follows:
in the formula (I), the compound is shown in the specification,representing a computational mathematical expectation, attribute tag yjFrom attribute tag collectionsObject property labelFrom a target property tag setEdited imageFrom edited image setsvkJ is 1,2, …, m, m is the number of attributes, ω is the learning weight,is a binary cross entropy function and is defined as:category discriminatorUsed for judging whether the jth attribute of the input image is correctly edited or not;
The loss avoids information loss in the process of reconstructing the attribute irrelevant area in the generated image; loss of reconstructionMeasure the originalImage x and reconstructed image gjThe difference between, among them, the set of reconstructed imagesIs/are as followsIs an edited imageThrough generator GjAccording to the attribute value vkThe result after generation, i.e.The reconstruction loss is defined as follows:
in the formula (I), the compound is shown in the specification,representing a computational mathematical expectation, attribute tag yjFrom attribute tag collectionsFace image x is from a set of face imagesReconstructed image gjFrom a set of reconstructed imagesm is the number of attributes, omega is the learning weight, | | |1Represents the L-1 norm;
In generator GjIn the method, a residual image is not a complete image and is used as a direct learning target, the residual image represents local pixel change on target attributes, theoretically, the residual image should be sparse, and a large number of zero-value pixels exist; thus, a regularization penalty is introduced and is defined as follows:
in the formula (I), the compound is shown in the specification,representing a computational mathematical expectation, attribute tag yjFrom attribute tag collectionsResidual imageFrom sets of residual imagesm is the number of attributes, omega is the learning weight, | | |1Represents the L-1 norm;
IV, constructing a discriminator DjThe training target of (1):
the countermeasure loss and the classification loss are taken into account in the discriminator DjIn the training target of (2), the following is defined:
in the formula (I), the compound is shown in the specification,is a balance factor, indicating the importance of the class loss; two losses are defined as follows:
i, fight against loss:
this loss is used to encourage the authenticity discriminator Dj realDistinguishing between real and false images generated, and applying to generator GjSimilar to the above antagonistic losses, also using a base Dj realThe weighted logarithm of the discrimination result quantifies the reality of the input image, and smaller countermeasure loss represents poorer performance of the generator, and the countermeasure loss is defined as follows:
in the formula (I), the compound is shown in the specification,representing a computational mathematical expectation, attribute tag yjFrom attribute tag collectionsThe face image is from a face image set x and an edited imageFrom edited image setsvkThe facticity discriminator takes the value of the jth attribute k, j is 1,2, …, m, m is the number of attributes, omega is the learning weightIs used for judging the authenticity of the input image;
ii, class loss:
the loss also quantifies the degree of correlation between the predicted value and the true value of the jth attribute of the input image, and minimizes the class discriminator Dj clsIs lost in category, ensureFor accurate judgment of the attribute values of the input image, the category loss is defined as follows:
in the formula (I), the compound is shown in the specification,representing a computational mathematical expectation, attribute tag yjFrom attribute tag collectionsFace image x is from a set of face imagesm is the number of attributes, ω is the learning weight, l (y)jAnd x) is a binary cross-entropy function defined asCategory discriminatorUsed for judging whether the jth attribute of the input image is correctly edited or not;
v, constructing respective optimizers of the generator and the discriminator:
in order to improve the training stability and speed, the generator and the discriminator both adopt an Adam optimizer, and the Adam optimizer comprehensively considers the first moment estimation and the second moment estimation of the gradient so as to update the learning step length and automatically adjust the learning rate.
4. The face property editing method based on the balanced stack generation type confrontation network as claimed in claim 1, wherein: in step 3), each condition generates a training of the antagonistic network, comprising the following steps:
3.1) loading the preprocessed data set and the constructed conditional generation type countermeasure network;
3.2) inputting the face images in the data set and artificially specified target attribute values to a generator and a discriminator in batches;
3.3) calculating respective loss values according to respective output values of the generator and the discriminator and respective training targets;
3.4) calculating respective gradient values according to respective loss values of the generator and the discriminator and an Adam optimizer;
and 3.5) realizing back propagation according to respective gradient values of the generator and the discriminator and adjusting respective parameters.
5. The face property editing method based on the balanced stack generation type confrontation network as claimed in claim 1, wherein: in step 4), all trained generators are stacked to form a stack structure, and corresponding face attributes are sequentially edited aiming at the preprocessed unknown face images, and the method comprises the following steps:
4.1) stacking all generators;
4.2) preprocessing unknown face images and artificially setting a plurality of target attribute values;
4.3) inputting the face image and the target attribute value to all generators one by one, respectively changing the corresponding attributes, generating a residual image, combining the residual image with the input face image, and outputting a complete face image; the first generator receives the preprocessed unknown face image, the other generators receive the complete face image output by the previous generator, and the output of the last generator is the complete face image with all the properties edited.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010521351.2A CN111914617B (en) | 2020-06-10 | 2020-06-10 | Face attribute editing method based on balanced stack type generation type countermeasure network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010521351.2A CN111914617B (en) | 2020-06-10 | 2020-06-10 | Face attribute editing method based on balanced stack type generation type countermeasure network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111914617A true CN111914617A (en) | 2020-11-10 |
CN111914617B CN111914617B (en) | 2024-05-07 |
Family
ID=73237577
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010521351.2A Active CN111914617B (en) | 2020-06-10 | 2020-06-10 | Face attribute editing method based on balanced stack type generation type countermeasure network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111914617B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112991150A (en) * | 2021-02-08 | 2021-06-18 | 北京字跳网络技术有限公司 | Style image generation method, model training method, device and equipment |
CN112990078A (en) * | 2021-04-02 | 2021-06-18 | 深圳先进技术研究院 | Facial expression generation method based on generation type confrontation network |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105678340A (en) * | 2016-01-20 | 2016-06-15 | 福州大学 | Automatic image marking method based on enhanced stack type automatic encoder |
CN108932693A (en) * | 2018-06-15 | 2018-12-04 | 中国科学院自动化研究所 | Face editor complementing method and device based on face geological information |
CN109377535A (en) * | 2018-10-24 | 2019-02-22 | 电子科技大学 | Facial attribute automatic edition system, method, storage medium and terminal |
CN109615582A (en) * | 2018-11-30 | 2019-04-12 | 北京工业大学 | A kind of face image super-resolution reconstruction method generating confrontation network based on attribute description |
-
2020
- 2020-06-10 CN CN202010521351.2A patent/CN111914617B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105678340A (en) * | 2016-01-20 | 2016-06-15 | 福州大学 | Automatic image marking method based on enhanced stack type automatic encoder |
CN108932693A (en) * | 2018-06-15 | 2018-12-04 | 中国科学院自动化研究所 | Face editor complementing method and device based on face geological information |
CN109377535A (en) * | 2018-10-24 | 2019-02-22 | 电子科技大学 | Facial attribute automatic edition system, method, storage medium and terminal |
CN109615582A (en) * | 2018-11-30 | 2019-04-12 | 北京工业大学 | A kind of face image super-resolution reconstruction method generating confrontation network based on attribute description |
Non-Patent Citations (1)
Title |
---|
于贺;余南南;: "基于多尺寸卷积与残差单元的快速收敛GAN胸部X射线图像数据增强", 信号处理, no. 12, 25 December 2019 (2019-12-25) * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112991150A (en) * | 2021-02-08 | 2021-06-18 | 北京字跳网络技术有限公司 | Style image generation method, model training method, device and equipment |
CN112990078A (en) * | 2021-04-02 | 2021-06-18 | 深圳先进技术研究院 | Facial expression generation method based on generation type confrontation network |
Also Published As
Publication number | Publication date |
---|---|
CN111914617B (en) | 2024-05-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Wang et al. | Face aging with identity-preserved conditional generative adversarial networks | |
TWI772805B (en) | Method for training generative adversarial network, method for generating image, and computer-readable storage medium | |
CN108717568B (en) | A kind of image characteristics extraction and training method based on Three dimensional convolution neural network | |
CN108537743B (en) | Face image enhancement method based on generation countermeasure network | |
CN110706152B (en) | Face illumination migration method based on generation of confrontation network | |
CN109948692B (en) | Computer-generated picture detection method based on multi-color space convolutional neural network and random forest | |
CN109214298A (en) | A kind of Asia women face value Rating Model method based on depth convolutional network | |
CN111914617B (en) | Face attribute editing method based on balanced stack type generation type countermeasure network | |
CN112541865A (en) | Underwater image enhancement method based on generation countermeasure network | |
CN113724354B (en) | Gray image coloring method based on reference picture color style | |
CN111882516B (en) | Image quality evaluation method based on visual saliency and deep neural network | |
CN113642621A (en) | Zero sample image classification method based on generation countermeasure network | |
CN111915545A (en) | Self-supervision learning fusion method of multiband images | |
CN114581552A (en) | Gray level image colorizing method based on generation countermeasure network | |
CN114004333A (en) | Oversampling method for generating countermeasure network based on multiple false classes | |
CN113112416A (en) | Semantic-guided face image restoration method | |
Wei et al. | Universal deep network for steganalysis of color image based on channel representation | |
CN109947960A (en) | The more attribute Combined estimator model building methods of face based on depth convolution | |
CN113609944A (en) | Silent in-vivo detection method | |
CN114049675B (en) | Facial expression recognition method based on light-weight two-channel neural network | |
CN111489405A (en) | Face sketch synthesis system for generating confrontation network based on condition enhancement | |
Jolly et al. | Bringing monochrome to life: A GAN-based approach to colorizing black and white images | |
CN115035052A (en) | Forged face-changing image detection method and system based on identity difference quantification | |
CN114187380A (en) | Color transfer method based on visual saliency and channel attention mechanism | |
Qian et al. | An efficient fuzzy clustering-based color transfer method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |