CN113888399B

CN113888399B - Face age synthesis method based on style fusion and domain selection structure

Info

Publication number: CN113888399B
Application number: CN202111240317.9A
Authority: CN
Inventors: 郭迎春; 夏伟毅; 于洋; 朱叶; 阎刚; 郝小可; 师硕; 刘依; 吕华; 于明
Original assignee: Hebei University of Technology
Current assignee: Hebei University of Technology
Priority date: 2021-10-25
Filing date: 2021-10-25
Publication date: 2024-04-16
Anticipated expiration: 2041-10-25
Also published as: CN113888399A

Abstract

The invention relates to a method for synthesizing human face ages based on style fusion and domain selection structures, which comprises the following steps: preprocessing a face public data set, wherein each face sample corresponds to an age label, and the number of age domains is set; constructing a countermeasure network of style fusion and multi-domain discrimination, wherein the countermeasure network comprises a generator network and a domain selection discriminator network based on the style fusion; the domain selection arbiter network comprises a plurality of domain selection structures and a full connection layer, wherein the domain selection structures are composed of two types of functions: a base function and a plurality of domain functions; features are extracted for each batch of input images using only a basis function and a specific domain function, wherein the number of domain functions is consistent with the number of age domains. The invention can effectively solve the problems of losing the face identity information and unstable training.

Description

Face age synthesis method based on style fusion and domain selection structure

Technical Field

The technical scheme of the invention relates to the field of face generation, in particular to a face age synthesis method based on a style fusion and domain selection structure.

Background

Face age synthesis, i.e., face age aging/rejuvenation, aims to generate face appearances of different ages for a given face image while generating an image that also retains the identity characteristics of the source map. In recent years, along with the rapid development of deep learning, the face age synthesis makes a significant breakthrough, not only can provide data augmentation technical support for a face recognition system and assist in the detection of criminal cases for predicting future appearances of lost children and suspects, but also can be applied to special effect production of film and television entertainment.

The existing face age synthesis methods can be divided into two main types, namely a synthesis method based on a traditional mode and a synthesis method based on a deep learning mode. The conventional face age synthesis method can be classified into a physical model-based method and a template-based method. CN101556699a discloses a face aging image synthesis method, which is to obtain final aging images through texture enhancement and color transformation by matching face feature points from a database to a plurality of images of the same face and different age groups. The traditional human face age synthesis method needs complex simulation steps, is large in calculation burden, is poor in age synthesis effect and human face identity retention capability, and cannot be put into practical application. The face age synthesis method based on deep learning mainly adopts the idea of generating an countermeasure network, and the training process involves two model networks, namely a generator and a discriminator. The generator is responsible for image generation, the discriminator learns the real data distribution, and the generator is constrained to fit the real data distribution by judging the authenticity of the generated image, so that the generator can generate a realistic face image. In the paper "Age progress/regression by conditional adversarial autoencoder" published by Zhifei Zhang in IEEE Conference on Computer Vision and Pattern Recognition in 2017, the ideas of convolutional neural network and generation of countermeasure network in deep learning are applied to face Age synthesis for the first time, so that the synthesis process of face Age synthesis is simplified. CN111612872a discloses a face age-varying image countermeasure generating method and system, which uses airspace attention mechanism to make the generating process of generator pay more attention to the face area related to age, but uses the splicing mode in channel dimension for condition input, and the age synthesizing effect of the condition input is limited. CN110322394a discloses a method and a device for generating a face age aging image countermeasure based on attribute guidance, a discriminator network uses the characteristics of a multi-scale wavelet packet after transformation as network input, and uses a face attribute characteristic vector as condition input to learn the distribution of different attributes, but the characteristics learned by the model are too tedious, so that the age synthesis effect of a generator is limited. Hongyu Yang, published paper "Learning continuous face age progression: A pyramid of gans" in IEEE transactions on pattern analysis and machine intelligence, proposes a pyramidal arbiter structure, and sets up one arbiter for each age group, each arbiter learning the data distribution of its corresponding age group, able to provide finer feature distribution to the generator, but the age group with fewer samples suffers from pattern collapse.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to solve the technical problems that: a face age synthesis method based on style fusion and domain selection structure is provided, and a style fusion normalization module and a domain selection discriminator providing multi-domain distribution are designed. The style fusion normalization module respectively acquires identity style information and condition style information, fuses the style information through a multi-layer full-connection layer structure to obtain affine parameters which contain source image identity characteristics and simultaneously have target age information, performs instance normalization operation on decoder characteristics and corresponding encoder output characteristics, and modulates the normalized characteristics by utilizing the affine parameters. The corresponding style fusion normalization module is arranged on the features of different scales in the whole generator, so that an effective face age synthesis effect is realized. In addition, in order to obtain a more stable synthesis effect, the method designs a domain selection discriminator, adopts a structure of a convolutional neural network, adopts domain selection structures except for a fully connected output layer, and a plurality of domain selection structures select specific domain functions to learn corresponding age domain distribution in the training process. The invention can effectively solve the problems of losing the face identity information and unstable training.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a face age synthesis method based on style fusion and domain selection structure is a face age synthesis method based on a style fusion normalization module and a domain selection discriminator providing multi-domain distribution, the method comprises the following contents:

preprocessing a face public data set, obtaining a face data set with 256 multiplied by 256 resolution through face key point alignment, rotation and cutting, wherein each face sample corresponds to an age label, and the number of age domains is set;

the method comprises the steps of constructing a countermeasure network for style fusion and multi-domain discrimination, wherein the countermeasure network comprises two parts: a generator network and a domain selection discriminator network based on style fusion;

the generator network based on style fusion adopts an encoder-decoder structure, an encoder part consists of 5 groups of convolution modules, a decoder part consists of 5 groups of transposition convolution modules, the overall structure is similar to U-Net, wherein the encoder can extract multi-scale characteristics of an input face image, the last layer of characteristics of the encoder are taken as symmetry axes, the four-layer characteristics before the encoder and the four-layer characteristics before the decoder are axisymmetric in scale, the decoder comprises an upsampling layer and a style fusion normalization module, and the style fusion normalization module is connected with the output characteristics of the encoder with corresponding scales in a jumping manner and is used for being combined with style information. The style fusion normalization module consists of three parts: an identity extractor, a condition mapper and a normalization module. The method comprises the steps that an identity extractor is respectively arranged on output characteristics of encoders of different scales to obtain corresponding identity style codes, wherein the identity style codes are one-dimensional vectors; the condition mapper maps the target age label into a condition style code, wherein the condition style code is a one-dimensional vector; the normalization module fuses the identity style and the condition style and obtains the modulation characteristic. The style fusion normalization module is used for modulating the characteristics of the decoder, and can effectively synthesize the face image which accords with the target age and simultaneously stores the identity information of the source image through the characteristic modulation fusion of different scales;

The domain selection discriminator network adopts a structure of a convolutional neural network and comprises a plurality of domain selection structures and a full connection layer, wherein the domain selection structures are composed of two types of functions: the method comprises the steps of extracting features by only using a basic function and a specific domain function aiming at each batch of input images, wherein the number of the domain functions is consistent with the number of the age domains; since the arbiter network structure involves a convolution module and a fully connected layer, the convolution module and the fully connected layer are collectively abstract as functions.

The specific implementation of the basic function is divided into a basic convolution module and a basic full-connection module, the specific implementation of the domain function is divided into a domain selection convolution module and a domain selection full-connection module, the basic convolution module and the domain selection convolution module form a domain selection structure convolution module, and the basic full-connection module and the domain selection full-connection module form a domain selection structure full-connection module;

the basic convolution module and the domain selection convolution module are composed of a convolution layer with the convolution kernel size of 4 multiplied by 4, the step length of 2 and the filling of 1, an example normalization layer and a LeakyReLU activation function;

the basic full-connection module and the domain selection full-connection module are composed of a full-connection layer, a pixel normalization layer and a LeakyReLU activation function.

The convolution module of the encoder is composed of a convolution layer with the convolution kernel size of 4 multiplied by 4, the step length of 2 and the filling of 1, an example normalization layer and a LeakyReLU activation function; the up-sampling layer is a transposed convolution layer with a transposed convolution kernel of 4 multiplied by 4, a step length of 2 and a filling of 1;

the style fusion normalization module consists of an identity extractor, a condition mapper and a normalization module, wherein the identity extractor comprises a convolution layer with a convolution kernel size of 1 multiplied by 1 and a full-connection layer, the condition mapper comprises two layers of feature mapping structures consisting of the full-connection layer, a pixel normalization layer and a LeakyReLU activation function, and the normalization module comprises an instance normalization layer and the full-connection layer;

the specific steps of the face age synthesis method based on the style fusion and domain selection structure are as follows:

firstly, extracting multi-scale characteristics of an input face image through an encoder network:

step 1.1, carrying out normalization and tensor processing on the face image, and adjusting the resolution of the image to 256 multiplied by 256 to obtain the preprocessed face image.

Step 1.2, inputting the preprocessed face image into a first layer convolution module to obtain a first layer encoder output characteristic which is recorded as

Step 1.3, extracting depth features of different scales from the output features of the first layer of encoder through an encoder network, wherein the specific operation is shown in a formula (1);

Wherein,representing the output characteristics of the layer i encoder, < >>Representing the output characteristics of the layer i-1 encoder, conv represents the convolution module, N _enc Indicating the number of encoder network layers;

thus, the extraction of the multi-scale features of the input face image is completed;

second, extracting identity style codes by using an identity extractor:

step 2.1, for the output characteristics of the encoders of different layers, performing convolution operation by using convolution layers with convolution kernel size of 1×1, so as to obtain a corresponding learnable matrix, as shown in formula (2):

wherein M is _i Representing i-layer encoder output characteristicsCorresponding learnable matrix, < >>Representing a convolution layer with a convolution kernel size of 1 x 1, N _enc Indicating the number of encoder network layers;

and 2.2, taking the learnable matrix in the step 2.1 as an operation parameter, and multiplying the element level with the corresponding encoder output characteristic to obtain a characteristic after weighted distribution, wherein the specific operation expression is shown in a formula (3):

wherein,representing the output characteristics of the i-th layer encoder>Multiplication with the result of the operation of the learnable matrix, representing the matrix element level, N _enc Indicating the number of encoder network layers;

step 2.3, the characteristics after the weighted allocation in the step 2.2 are subjected to global average pooling and a full connection layer to obtain the final corresponding identity style codes, as shown in a formula (4):

Wherein, id ⁱ Representing output characteristics of an i-th layer encoderCorresponding identity style coding, FC represents full connection layer, sumPooling represents global average pooling, N _enc Indicating the number of encoder network layers;

thirdly, extracting the conditional style codes by using a conditional mapper:

step 3.1, converting the conditional label in the form of one-hot code into one-dimensional vector, and adding smoothing noise z-N (0,0.2) ² ) Obtaining a condition vector as shown in a formula (5):

wherein N is _a The number of the age fields is represented,condition vector, y representing nth attribute condition _n Representing the converted one-dimensional condition vector;

step 3.2, using the full connection layer, pixel normalization layer and the LeakyReLU activation function pairsPerforming feature mapping to obtain conditional mapping feature F _map As shown in formula (6):

wherein FC represents a fully connected layer, pixelNorm represents a pixel normalization layer, and LeakyReLU represents a LeakyReLU activation function;

step 3.3, using the full connection layer, the pixel normalization layer and the LeakyReLU activation function pair F _map And performing feature mapping to obtain a conditional style code a, as shown in a formula (7):

a＝LeakyReLU(PixelNorm(FC(F _map )) (7)

fourth, fusing identity style codes and conditional style codes, and mapping fusion features into normalized affine parameters:

Step 4.1, encoding id by identity style ⁱ Channel splicing is carried out on the conditional style code a, and splicing characteristics corresponding to the output characteristics of the ith layer encoder are obtainedAs shown in formula (8):

wherein Concat represents channel splicing operation, N _enc Representing the number of layers of the encoder network;

step 4.2, mapping the splicing characteristic into a corresponding affine parameter by using the full connection layer to obtain an affine parameter gamma corresponding to the output characteristic of the ith layer encoder _i And beta _i As shown in formulas (9), (10):

wherein FC represents a fully connected layer, N _enc Representing the number of layers of the encoder network; gamma ray _i Is the offset factor and beta _i Is a scaling factor;

fifth step, modulating decoder features with affine parameters:

step 5.1, inputting the output characteristic diagram of the last layer of encoder into a first layer transposition convolution module of a decoder, and performing channel splicing on the output characteristic diagram of the decoder and the output characteristic diagram of the jump-connected penultimate layer encoder to obtain a first layer decoding characteristic(i.e., normalized features) as shown in equation (11):

wherein TConv represents transpose convolution, concat represents channel splicing operation, N _enc Representing the number of layers of the encoder network;

step 5.2, using the corresponding affine parametersAnd->Decoding features for the first layer- >Modulating to obtain modulated characteristic->As shown in formula (12):

wherein ChannelNorm represents a channel normalization operation;

step 5.3, the modulation characteristics are as described aboveThe final modulation characteristic is obtained through a multi-layer transposition convolution module and a style fusion normalization module>The specific operation is shown in formulas (13) and (14):

wherein TConv represents transpose convolution, concat represents channel splicing operation, channelNorm represents channel normalization operation, N _enc Indicating the number of layers, N, of the encoder network _dec Indicating the number of layers of the decoder network;

sixth, obtaining a final generated image:

the final modulation characteristics are as described aboveThe final layer of transpose convolution module is input into a decoder, and the final generated effect graph x is obtained through a tanh activation function ^t As shown in formula (15):

wherein TConv represents a transpose convolution;

seventh, the discriminating process of the domain selection discriminator;

step 7.1, inputting the image sample into a basic convolution module of a first domain selection structure convolution module of the discriminator to obtain a first layer of basic featuresAs shown in equation (16):

wherein BaseConv represents a basic convolution module and x represents an input image sample;

step 7.2, inputting the image sample into a domain selection convolution module of a first domain selection structure convolution module of the discriminator to obtain a first layer domain feature As shown in formula (17):

wherein, domConv _y A domain selection convolution module representing a corresponding condition y;

step 7.3, splicing the first layer basic features and the first layer domain features together to obtain first layer identifier featuresAs shown in equation (18):

wherein Concat represents a channel splicing operation;

step 7.4, the first layer of the discriminator features sequentially pass through a basic convolution module and a domain selection convolution module in a plurality of domain selection structure convolution modules, and final convolution function features are obtained according to the modeAs shown in formula (19):

wherein Concat represents channel splicing operation, baseConv represents a basic convolution module, and DomConv _y Domain selection convolution module representing corresponding condition y, N _dc Representing the number of domain selection structure convolution modules in the discriminator network;

step 7.5, use of the Flatten function pairPerforming dimension reduction processing to obtain output characteristic F _Flatten As shown in formula (20):

wherein, flat represents the operation of performing dimension reduction compression on tensors;

step 7.6, F _Flatten Respectively inputting into a basic full-connection module and a domain selection full-connection module in a domain selection structure full-connection module, and obtaining a final discriminator characteristic F through splicing operation _dis As shown in formula (21):

F _dis ＝Concat(BaseFC(F _flatten ),DomFC _y (F _flatten )) (21)

wherein Concat represents channel splicing operation, baseFC represents a basic full-connection module, and DomConv _y A domain selection full-connection module for representing a corresponding condition y;

step 7.7, using the full connection layer pair F _dis Performing operation to obtain output result F of final discriminator _logit As shown in equation (22):

F _logit ＝FC(F _dis ) (22)

wherein FC represents a fully connected layer;

the operations from the first step to the seventh step finish the construction of the generator network and the domain selection discriminant network based on style fusion, and the final output result of the generated image and the domain selection discriminant meeting the target conditions is obtained;

eighth step, training a face age synthesis method based on style fusion and domain selection structures;

the generator network and the domain selection discriminator network based on style fusion form a countermeasure network, the training of the countermeasure network adopts a strategy of multi-domain alternate training, firstly, in each training iteration, all age domains are traversed, each training process needs to relate to two age domains, and the sampling rule is shown in a formula (23):

t＝(s+p+1)mod N _a (23)

wherein t represents the index of the target domain, s represents the index of the source domain, p represents the current iteration number, mod represents the remainder calculation, N _a Representing the number of age domains involved in the method;

calculating the total loss of a batch of generator networks, optimizing the generator networks through a gradient descent algorithm, then calculating the total loss of the corresponding conditions of the domain selection discriminant, optimizing the domain selection discriminant networks through the gradient descent algorithm, and finally enabling the loss function of the generator to reach a convergence state through optimization of alternate training of different domains, so that the generator networks can be ensured to generate face images meeting target conditions;

The total loss function of the style fusion based generator network is shown in equation (24):

wherein lambda is _i I epsilon (1, 2, 3) represents the weight coefficient between the losses, L _G Representing the total loss function of the generator network, L _rec Representing the reconstruction loss function, L _ip Representing an identity-aware loss function,countering the loss for the generation of the corresponding target condition;

the reconstruction loss of the style fusion based generator G is shown in equation (25):

wherein,representing the mean of the current source domain image batch, |·||is ₁ Represents L1 norm, x _s Representing a face image sampled from the source domain, < +.>A condition vector representing a source domain, obtained by the formula (5);

the identity aware loss function of the style fusion based generator G is shown in equation (26):

wherein,representing the mean of the current source domain image batch, |·||is ₁ Representing the L1 norm, F represents the relu5_3 layer network of the VGG16 model, x _s Representing a face image sampled from the source domain, < +.>A condition vector representing the target domain, obtained by the formula (5);

the fight loss function of the style fusion based generator G is shown in equation (27):

wherein,representing the mean value of the current source domain image batch, D _t Domain selection discriminant, x representing corresponding target conditions _s Representing a face image sampled from the source domain, < +. >A condition vector representing a target domain;

the loss function of the domain selection arbiter is shown in equation (28):

wherein,representing the mean value of the current source domain image batch, +.>Represents the average value of the current target domain image batch, D _t Domain selection arbiter, x, representing a domain function selecting a target domain t _s Representing face images sampled from the source domain, x _t Representing a face image sampled from the target domain, < +.>A condition vector representing a target domain;

for the generator, the source domain represents which domain the input face sample comes from, the target domain represents which domain the output face sample represents, each iterative training is the source domain-target domain, and the domain selection discriminator provides the data distribution of the target domain, so that the domain function corresponding to the target domain is selected, namely, only the basis function and the domain function corresponding to the target domain participate in calculation each time, and other domain functions do not participate in calculation.

Through the operation, training of the generator network and the domain selection discriminator network based on style fusion is completed;

ninth, measuring a human face age synthesizing method based on style fusion and domain selection structures;

after the eighth step of training, after the training is stable, obtaining a trained generator network based on style fusion, inputting the tested face images of different age domains into the generator network based on style fusion to obtain the face images meeting the target conditions, detecting by using a third-party face attribute detection API, and calculating the average synthetic age and the face identity retention;

The age prediction interface of the Face attribute detection API provided by the face++ is used for obtaining the age of the generated Face corresponding to the test image, and the average synthesized ages of different age domains can be finally obtained by detecting and averaging the test set sample;

whether the generated Face image and the source test Face image are the same person or not can be compared through a Face comparison API provided by face++, if so, the Face image is marked as a positive example, and the Face identity retention of the test set can be finally obtained through detection, comparison and statistics on the test set sample;

thus, the face age synthesis method based on the style fusion and domain selection structure is completed.

Specifically, in the face age synthesis method based on the style fusion and domain selection structure, the number of network layers of the second encoder is N _enc 5, the number of age domains N in the third step _a 6, decoder network layer number N in fifth step _dec 5, the number N of domain selection structure convolution modules of the discriminator in the seventh step _dc 4.

The beneficial effects of the invention are as follows: compared with the prior art, the invention has the outstanding substantial characteristics and remarkable progress as follows:

(1) The invention provides a face age synthesis method, in particular to a face age synthesis method based on style fusion and domain selection structures. The generator network is respectively provided with a style fusion normalization module on decoder characteristics of different scales, an identity extractor in the style fusion normalization module is responsible for extracting source diagram identity attribute information contained in the output characteristics of the encoder to obtain an identity style of a corresponding scale, a condition mapper is responsible for mapping an age label into a condition style code, the two styles are fused and used for modulating the decoder characteristics of the corresponding scale, and age conversion of different age domains can be realized while the identity semantics of the source diagram are maintained through normalization modulation of a plurality of scales.

(2) The invention provides a domain selection structure which is applied to a feature extraction link of a discriminator network and is provided with a basic function and a plurality of domain functions. The basic function is fixed in the training process, and is responsible for extracting common characteristics in different domains, the domain selection structure only selects one domain function of the aged domains when forward transmitting each time, and the data distribution of different domains is provided by switching different domain functions in the whole training process, because the basic function for learning the data distribution of all domains and the domain function for learning the data distribution of a specific domain are arranged in the domain selection discriminator, and the common characteristics of different domains and the differential characteristics of the specific domain are fused. Only the domain function of the current age domain is selected when forward propagation is performed each time, the domain function extracts the difference characteristic of the corresponding age domain, and the data distribution supplement is provided for the domains with fewer samples by combining the common characteristic and the difference characteristic, so that the phenomenon of pattern collapse is avoided, the problem of overfitting of the domains with fewer samples is effectively relieved, and meanwhile, the generator is more robust.

(3) CN111612872a discloses a face age-varying image countermeasure generating method and system, which uses an airspace attention mechanism to make the generating process of a generator pay more attention to the face area related to age, and uses a channel dimension stitching mode to input condition information, but the age conversion effect of the mode is limited. Compared with CN111612872A, the invention integrates identity style and condition style to guide age conversion, and can obtain more vivid generation effect which accords with target conditions.

(4) CN110322394a discloses a method and a device for generating a face age aging image countermeasure based on attribute guidance, a discriminator network uses the characteristics of a multi-scale wavelet packet after transformation as network input, and uses a face attribute characteristic vector as condition input to learn the distribution of different attributes, but the characteristics learned by the model are too tedious, so that the age synthesis effect of a generator is limited. Compared with CN110322394A, the invention adopts different domain functions to learn the data distribution of different age domains in the domain selection discriminator, combines the data distribution with basic characteristics, and remarkably improves the generation effect and the training stability.

In summary, the invention makes corresponding improvements in both the generator network and the arbiter network. In the generator, extracting identity style from the output characteristics of the encoder with different scales, and modulating the corresponding decoder characteristics by combining the condition style, thereby simultaneously taking account of identity information retention and target condition expression; in the discriminator, domain selection structures are used for extracting features of different scales, and the training process is more stable through complementation of common features and differential features.

Drawings

The invention will be further described with reference to the drawings and examples.

FIG. 1 is an overall training flow chart of a face age synthesis method based on style fusion and domain selection structures.

Fig. 2 is a flow chart of the generation of the generator of the face age synthesis method based on the style fusion and domain selection structure of the present invention.

Fig. 3 is a flowchart of a style fusion normalization module of the face age synthesis method based on the style fusion and domain selection structure.

Fig. 4 is a flowchart of a discriminator network of the face age synthesizing method based on the style fusion and domain selection structure of the invention.

Fig. 5 is a flow chart of a domain selection structure of a face age synthesis method based on style fusion and domain selection structure of the present invention.

Fig. 6 is a view of the effect of the face age synthesis method based on style fusion and domain selection structure.

Detailed Description

The embodiment shown in fig. 1 shows that the overall training flow of the face age synthesis method based on the style fusion and domain selection structure of the invention is as follows:

the method comprises the steps of inputting an image, preprocessing the input image, inputting the preprocessed face image into a generator network, firstly extracting depth features of the image, namely encoder output features, from encoder output features of different scales, extracting identity styles, inputting age conditions, mapping the age conditions into conditional styles, fusing the identity styles and the conditional styles through a style fusion normalization module, modulating corresponding decoder features, generating the face image conforming to the age conditions, inputting the generated image into a discriminator for discrimination, calculating a loss function according to output results of the generator and the discriminator, and finally performing optimization training through back propagation.

The embodiment shown in fig. 2 shows that the generation flow of the generator of the face age synthesis method based on the style fusion and domain selection structure of the invention is as follows:

the generator comprises an encoder network and a decoder network, wherein the encoder network is composed of a plurality of layers of convolution modules, the decoder network comprises up-sampling layers and style fusion normalization modules, the up-sampling layers are sequentially arranged, the style fusion normalization modules are arranged between adjacent up-sampling layers, each up-sampling layer outputs a decoder characteristic diagram of a corresponding layer, and each convolution module outputs an encoder output characteristic diagram of the corresponding layer. The convolution module of the encoder network is composed of a convolution layer with a convolution kernel size of 4×4, a step size of 2 and a filling of 1, an instance normalization layer and a LeakyReLU activation function. The up-sampling layer is a transposed convolution layer with a transposed convolution kernel of 4 x 4, a step size of 2, and a padding of 1. Inputting a face image, obtaining the output characteristics of the encoder of the next layer by using a convolution layer with the convolution kernel size of 4 multiplied by 4, a step length of 2 and filling with 1, an example normalization layer and a LeakyReLU activation function, extracting the output characteristics of the encoders of different scales layer by layer, and extracting five layers altogether; the method comprises the steps of inputting the characteristics of a first layer of decoder, combining age condition vectors and the output characteristics of a coder with a corresponding scale into a style fusion normalization module together for characteristic modulation, obtaining a next layer of decoder characteristic diagram through an up-sampling layer, fusing the characteristics of different scales layer by layer through style fusion normalization modules of different layers, and fusing four layers altogether; and inputting the fusion characteristic of the last layer into the transposition convolution and tanh activation function of the last layer to obtain a final generated image.

The embodiment shown in fig. 3 shows that the flow of the style fusion normalization module of the face age synthesis method based on the style fusion and domain selection structure of the invention is as follows, wherein the three of channel normalization operation, identity style extraction and conditional style mapping are parallel:

channel normalization operation: performing channel splicing on the decoder characteristic diagram and the encoder output characteristic diagram which are input with corresponding scales, and performing channel normalization operation to obtain normalized characteristics;

and (3) extracting identity style: the encoder outputs the characteristic diagram to obtain an identity style through an identity extractor;

conditional style mapping: the age condition vector passes through a condition mapper to obtain a condition style;

and (3) characteristic modulation: and performing channel splicing on the identity style and the condition style, obtaining affine parameters through a full-connection layer, modulating the normalized characteristics by using the affine parameters, and outputting a modulated output characteristic diagram.

The embodiment shown in fig. 4 shows the flow of the domain selection discriminator of the face age synthesis method based on the style fusion and domain selection structure, and the input face image passes through a four-layer domain selection structure convolution module and then passes through a one-layer domain selection structure full-connection module to obtain the final output result. The four-layer domain selection structure convolution modules have the same structure and form a downsampling form.

The embodiment shown in fig. 5 shows the flow of the domain selection structure of the face age synthesis method based on style fusion and domain selection structure, wherein each domain selection structure is a downsampling module, and the domain selection structure is composed of two types of functions: a base function and a plurality of domain functions, each domain function corresponding to an age domain. Only one basis function and a specific domain function are used for splicing at a time, wherein the number of the domain functions is consistent with the number of the set age domains. The domain selection structure has two inputs, wherein the first input is age domain information, and a domain function of a corresponding age domain is selected according to the input age domain condition information; the second input is the input feature map of the upper layer, the domain features and the basic features of the corresponding domain are obtained through the selected domain functions and the basic functions, then the output feature map of the domain selection structure is obtained through channel splicing, the output feature map and the input domain condition information of the domain selection structure are continuously used as the input of the next domain selection structure to obtain the output feature map of the next domain selection structure, and after a plurality of domain selection structures, the final output result is obtained through a fully connected layer.

The embodiment shown in fig. 6 shows that the first column in the generated effect diagram of the face age synthesis method based on the style fusion and domain selection structure is an input test face sample image, and other columns respectively correspond to synthesized faces of different age groups.

The age domain refers to different divided age groups, and the number of the age domains can be set manually, and the age domains are divided at equal age intervals from 0 years.

The domain selection structure is composed of a basic function and a plurality of domain functions, face samples related to each forward propagation process come from the same age domain, input samples can use the basic function and the domain functions corresponding to the age domain to extract features, and other domain functions do not participate in the forward propagation and optimization process; during the whole training process, the basis function can process face samples of different age domains, and each domain function only processes face samples from the corresponding age domain.

Example 1

The face age synthesis method based on the style fusion and domain selection structure of the embodiment comprises the following specific steps:

wherein,representing the output characteristics of the i-th layer encoder, conv represents a convolution module, N _enc Indicating the number of encoder network layers;

second, extracting identity style codes by using an identity extractor:

step 2.2, taking the learnable matrix as an operation parameter, and multiplying the operation parameter with the corresponding encoder output characteristic to obtain a weighted and allocated characteristic, wherein the specific operation expression is shown in a formula (3):

wherein,representing the output characteristics of the i-th layer encoder>Multiplication with the result of the operation of the learnable matrix, representing the matrix element level, N _enc Indicating the number of encoder network layers; / >

Step 2.3, carrying out global average pooling and full connection on the weighted result to obtain a final corresponding identity style code, as shown in a formula (4):

thirdly, extracting the conditional style codes by using a conditional mapper:

wherein N is _a The number of age fields involved in the method is represented,condition vector, y representing nth attribute condition _n Representing the converted one-dimensional condition vector;

wherein FC represents a fully connected layer, pixeNorm represents a pixel normalization layer, and LeakyReLU represents a LeakyReLU activation function;

a＝LeakyReLU(PixlNorm(FC(F _map )) (7)

step 4.2, mapping the splicing characteristic into a corresponding affine parameter by using the full connection layer to obtain an affine parameter gamma corresponding to the output characteristic of the ith layer encoder _i And beta _i As shown in the formula (9) (10):

wherein FC represents a fully connected layer, N _enc Representing the number of layers of the encoder network;

fifth step, modulating decoder features with affine parameters:

step 5.1, inputting the output characteristic diagram of the last layer of encoder into a decoder transposition convolution module, and performing channel splicing on the output characteristic diagram and the output characteristic of the jump-connected penultimate layer of encoder to obtain the first layer of decoding characteristicAs shown in formula (11): />

step 5.2, using the corresponding affine parametersAnd->Decoding features for the first layer->Modulating to obtain modulated characteristic->As shown in formula (12):

wherein ChannelNorm represents a channel normalization operation;

step 5.3The modulation characteristics are as described aboveThe final modulation characteristics are obtained through a multi-layer transposition convolution module and a style fusion normalization module, and the specific operations are shown in formulas (13) and (14):

sixth, obtaining a final generated image:

the modulation characteristics are as described aboveInputting the final layer of decoder transpose convolution module, and obtaining final generated effect graph x through tanh activation function ^t As shown in formula (15):

wherein TConv represents a transpose convolution;

seventh, the discriminating process of the domain selection discriminator;

step 7.1, inputting the pattern sample into a basic convolution module of the discriminator to obtain a first layer of basic featuresAs shown in equation (16):

wherein BaseVonv represents a basic convolution module and x represents an input image sample;

Step 7.2, inputting the image sample into a domain selection convolution module of the discriminator to obtain a first layer domain featureAs shown in formula (17):

wherein Concat represents a channel splicing operation;

step 7.4, the first layer of discriminator features are passed through a multi-layer basic convolution module and a domain selection convolution module to obtain final convolution function features, as shown in a formula (19):

wherein Concat represents channel splicing operation, baseConv represents a basic convolution module, and DomConv _y Domain selection convolution module representing corresponding condition y, N _dc Representing the number of layers of convolution modules in the arbiter network;

step 7.5, use of the Flatten function pairPerforming operation dimension reduction processing to obtain an output characteristic F _Flatten As shown in formula (20):

step 7.6, F _Flatten Inputting into a basic full-connection module and a domain selection full-connection module, and obtaining a final discriminator characteristic F through splicing operation _dis As shown in formula (21):

F _dis ＝Concat(BaseFC(F _flatten ),DomFC _y (F _flatten )) (21)

step 7.7, using the full connection layer pair F _dis Operating to obtain final output result F _logit As shown in equation (22):

F _logit ＝FC(F _dis ) (22)

wherein FC represents a fully connected layer;

the utility model provides a antagonism network based on style fuses, it comprises two parts: a generator network and a domain selection discriminator network based on style fusion. The training of the countermeasure network adopts a multi-domain alternating training strategy, firstly, in each training iteration, all age domains are traversed, two age domains are needed to be involved in each training process, and the sampling rule is shown in a formula (23):

t＝(s+p+1)mod N _a (23) Wherein t represents the index of the target domain, s represents the index of the source domain, p represents the current iteration number, mod represents the remainder calculation, N _a Representing the number of age domains involved in the method;

/>

wherein,representing the mean of the current source domain image batch, |·||is ₁ Represents L1 norm, x _s Representing a face image sampled from the source domain, < +.>A condition vector representing a source domain;

wherein,representing the mean of the current source domain image batch, |·||is ₁ Representing the L1 norm, F represents the relu5_3 layer network of the VGG16 model, x _s Representing a face image sampled from the source domain, < +.>A condition vector representing a target domain;

wherein,representing the mean value of the current source domain image batch, D _t Domain selection discriminant, x representing corresponding target conditions _s Representing a face image sampled from the source domain, < +.>A condition vector representing a target domain;

The loss function of the domain selection arbiter is shown in equation (28):

wherein,representing the mean value of the current source domain image batch, +.>Represents the average value of the current target domain image batch, D _t Domain selection discriminant, x representing corresponding target conditions _s Representing face images sampled from the source domain, x _t Representing a face image sampled from the target domain, < +.>A condition vector representing a target domain;

ninth, using the trained model to perform reasoning, inputting a test face image and a target age label, and generating a final face image;

Specifically, in the face age synthesis method based on the style fusion and domain selection structure, the number of network layers of the second encoder is N _enc 5, the number of age domains N in the third step _a 6, decoder network layer number N in fifth step _dec 5, the number of layers N of the convolution module of the discriminator in the seventh step _dc 4.

Table 1 identity Retention (%)

TABLE 2 age distribution on FFHQ-Aging dataset

Tables 1 and 2 show the experimental results of this example compared to the experimental results of the prior art on the FFHQ-Aging dataset, where CAAE represents the method abbreviation described in the paper "Age progress/regression by conditional adversarial autoencoder" and LATS represents the method abbreviation described in the paper "LifespanAge Transformation Synthesis".

The designed style fusion normalization module can play a role in reserving the identity information of the face of the source map and completing the age conversion meeting the conditions at the same time aiming at the face age synthesis; the designed domain selection structure can effectively combine the common characteristics of all age domain distribution with the differential characteristics of specific domains, and improves the training stability.

The invention is applicable to the prior art where it is not described.

Claims

1. A method for synthesizing human face ages based on style fusion and domain selection structures, which comprises the following steps:

preprocessing a face public data set, wherein each face sample corresponds to an age label, and the number of age domains is set;

constructing a countermeasure network of style fusion and multi-domain discrimination, wherein the countermeasure network comprises a generator network and a domain selection discriminator network based on the style fusion;

the style fusion-based generator network adopts an encoder-decoder structure, and the encoder part is provided with N _enc A layer convolution module for extracting multi-scale characteristics of the input face image, a decoder part having N _dec Layer up-sampling layer, taking the last layer characteristic of the encoder as symmetry axis, N before the encoder _enc Layer-1 features and pre-decoder N _dec The layer-1 features are axisymmetric in scale, the decoder comprises an up-sampling layer and a style fusion normalization module, and the style fusion normalization module is connected with the output features of the encoder in a jumping manner of corresponding scale and is used for combining with style information; the style fusion normalization module comprises an identity extractor, a condition mapper and a normalization module, wherein the identity extractor is respectively arranged on the output characteristics of encoders with different scales Obtaining a corresponding identity style code; the condition mapper maps the target age label into a condition style code; the normalization module fuses the identity style and the condition style and obtains modulation characteristics;

the domain selection arbiter network comprises a plurality of domain selection structures and a full connection layer, wherein the domain selection structures are composed of two types of functions: a base function and a plurality of domain functions; extracting features by only using a basic function and a specific domain function aiming at each batch of input images, wherein the number of the domain functions is consistent with the number of the age domains;

the basic convolution module and the domain selection convolution module comprise a convolution layer, an example normalization layer and a LeakyReLU activation function; the basic full-connection module and the domain selection full-connection module are composed of a full-connection layer, a pixel normalization layer and a LeakyReLU activation function.

2. The method for face age synthesis based on style fusion and domain selection structure according to claim 1, wherein the basic convolution module and the domain selection convolution module are composed of a convolution layer with a convolution kernel size of 4×4, a step size of 2 and a filling of 1, an instance normalization layer and a LeakyReLU activation function; the convolution module of the encoder consists of a convolution layer with the convolution kernel size of 4 multiplied by 4, the step length of 2 and the filling of 1, an example normalization layer and a LeakyReLU activation function; the up-sampling layer is a transposed convolution layer with a transposed convolution kernel of 4 multiplied by 4, a step length of 2 and a filling of 1;

The identity extractor comprises a convolution layer with a convolution kernel size of 1 multiplied by 1 and a full connection layer;

the condition mapper comprises two layers of characteristic mapping structures consisting of a full-connection layer, a pixel normalization layer and a LeakyReLU activation function;

the normalization module includes an instance normalization layer and a full connection layer.

3. The method for face age synthesis based on style fusion and domain selection structures according to claim 1, wherein,

the style fusion normalization module is arranged between adjacent up-sampling layers, acquires identity style information and condition style information respectively, fuses the style information through a multi-layer full-connection layer structure to obtain affine parameters which contain source image identity characteristics and simultaneously have target age information, performs instance normalization operation on decoder characteristics and corresponding encoder output characteristics, and modulates the normalized characteristics by utilizing the affine parameters;

in the training process, a plurality of domain selection structures select a specific domain function to learn corresponding age domain distribution, basic functions in the domain selection structures learn basic characteristics of different domains, face samples related to each forward transmission process of a network come from the same age domain, input samples can use the basic functions and domain functions corresponding to the age domains to extract characteristics, and other domain functions do not participate in the forward transmission and optimization process; during the whole training process, the basis function can process face samples of different age domains, and each domain function only processes face samples from the corresponding age domain.

4. The method for face age synthesis based on style fusion and domain selection structure according to claim 1, wherein the specific steps of the method are as follows:

step 1.1, carrying out normalization and tensor processing on a face image, and adjusting the resolution of the image to 256 multiplied by 256 to obtain a preprocessed face image;

second, extracting identity style codes by using an identity extractor:

thirdly, extracting the conditional style codes by using a conditional mapper:

a＝LeakyReLU(PixelNorm(FC(F _map )) (7)

wherein Concat represents channel splicing operation, N _enc Layer representing encoder networkA number;

fifth step, modulating decoder features with affine parameters:

step 5.1, inputting the output characteristic diagram of the last layer of encoder into a first layer transposition convolution module of a decoder, and performing channel splicing on the output characteristic diagram of the decoder and the output characteristic diagram of the jump-connected penultimate layer encoder to obtain a first layer decoding characteristicAs shown in formula (11):

wherein ChannelNorm represents a channel normalization operation;

sixth, obtaining a final generated image:

wherein TConv represents a transpose convolution;

seventh, the discriminating process of the domain selection discriminator;

step 7.2, inputting the image sample into a domain selection convolution module of a first domain selection structure convolution module of the discriminator to obtain a first layer domain featureAs shown in formula (17):

wherein Concat represents a channel splicing operation;

step 7.4, the first layer of the discriminator features sequentially pass through a basic convolution module and a domain selection convolution module in a plurality of domain selection structure convolution modules, and final convolution function features are obtained according to the mode As shown in formula (19):

F _dis ＝Concat(BaseFC(F _flatten ),DomFC _y (F _flatten )) (21)

step 7.7Using a fully connected layer pair F _dis Performing operation to obtain output result F of final discriminator _logit As shown in equation (22):

F _logit ＝FC(F _dis ) (22)

wherein FC represents a fully connected layer;

t＝(s+p+1)modN _a (23)

calculating the total loss of a batch of generator networks, optimizing the generator networks through a gradient descent algorithm, then calculating the total loss of the corresponding conditions of the domain selection discriminant, optimizing the domain selection discriminant networks through the gradient descent algorithm, and finally enabling the loss function of the generator to reach a convergence state through optimization of alternate training of different domains, so that the generator networks can be ensured to generate face images meeting target conditions.

5. The method of face age synthesis based on style fusion and domain selection structure according to claim 4, wherein the total loss function of the generator network based on style fusion is as shown in formula (24):

the loss function of the domain selection arbiter is shown in equation (28):

wherein, Representing the mean value of the current source domain image batch, +.>Represents the average value of the current target domain image batch, D _t Domain selection arbiter, x, representing a domain function selecting a target domain t _s Representing face images sampled from the source domain, x _t Representing a face image sampled from the target domain, < +.>A condition vector representing a target domain;

6. The method for synthesizing the human face ages based on the style fusion and domain selection structure according to claim 4, wherein after training is stable, a trained generator network based on the style fusion is obtained, test human face images of different age domains are input into the generator network based on the style fusion, a human face image meeting target conditions is obtained, detection is carried out by using a third-party human face attribute detection API, and average synthesized age and human face identity retention are calculated;

The age prediction interface of the Face attribute detection API provided by the face++ is used for obtaining the age of the generated Face corresponding to the test image, and the average synthesized age of different age domains is finally obtained by detecting and averaging the test set sample;

and comparing whether the generated Face image and the source test Face image are the same person or not through a Face comparison API provided by face++, if so, marking the same person as a positive example, and finally obtaining the Face identity retention of the test set through detection, comparison and statistics on the test set sample.

7. The method for face age synthesis based on style fusion and domain selection structure according to any one of claims 1-6, wherein the encoder network layer number N _enc 5, number of age fields N _a 6 decoder network layer number N _dec 5 number N of domain selection structure convolution modules _dc 4.