CN108596024B

CN108596024B - Portrait generation method based on face structure information

Info

Publication number: CN108596024B
Application number: CN201810206139.XA
Authority: CN
Inventors: 俞俊; 施圣洁; 高飞
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2018-03-13
Filing date: 2018-03-13
Publication date: 2021-05-04
Anticipated expiration: 2038-03-13
Also published as: CN108596024A

Abstract

The invention discloses a portrait generation method based on face structure information. The invention comprises the following steps: 1. and 2, performing data preprocessing on the original image, the target image and the face structure information, and performing feature extraction and fusion by using a face structure information model at the input end of the image generator. 3. The combined loss function based on the face structure component is used in the loss function part of the image generator. 4. A countermeasure network is generated, generated by a generator, and differentiated by a discriminator. 5. And (4) model training, namely training neural network parameters by using a back propagation algorithm. The invention provides a neural network model for generating a portrait from a face photo, in particular a method for generating the portrait by using guiding information of a face part and calculating loss of each part by using part information to optimize network parameters in a portrait generator.

Description

Portrait generation method based on face structure information

Technical Field

The invention relates to a generation countermeasure network for generating a face Photo-portrait (Photo-Sketch Synthesis), mainly relating to a method for modeling generation of the face Photo-portrait and instructive optimization for generating an image by using face structure information.

Background

The problem of Face Sketch Portrait Generation (Sketch portal Generation) is to generate a corresponding Face Sketch given a Face, and it is also called Photo-Sketch conversion (Photo-Sketch conversion) or Face Sketch Generation (Face Sketch Synthesis). The human face sketch generation has more applications, such as wide application in entertainment or criminal investigation. An ideal generated face sketch has two characteristics, namely, the appearance of a person is kept, so that the method has high preparation rate on identification of sketch face sketch identity information; and secondly, pixel tracing is required, so that the pixel tracing has a good effect on visual perception. Although some successful methods have been proposed in this field, some existing generation methods, even those based on deep learning, still produce blurring or severe deformation in the generation process of face sketch.

In recent years, creation of a countermeasure network (GAN) has been highly successful in problems such as Image Style conversion (Image Style Transfer), Image Super-Resolution (Image Super-Resolution), and Image-to-Image conversion (Image-to-Image conversion). The sketch generation problem can be described as a photo-sketch generation problem, which is better handled by modeling a Conditional generation countermeasure network (cGAN). Although a conditional generation countermeasure network can exhibit good performance, such as it can generate some good textures, it is difficult to effectively model the relationships between various parts of a face without a given face structure.

In actual scenes, photo face conversion is also widely applied. Particularly in case detection and security, and especially in case detection, can help investigators locate or narrow the scope of suspect. Although the layout and the use of the current monitoring video are very wide, in practical application, the problems that a suspect cannot be shot and the resolution ratio is too low often exist, so that a witness can be adopted to describe the facial features of the suspect, then a professional draws a face sketch portrait, and the face sketch portrait is taken into an police database to be compared, so that the criminal suspect can be found. The face recognition rate of the face photos in the prior art is mature, but the recognition problem of sketching and photos still cannot be well solved. On the other hand, the generation of the sketch also pursues visual quality, a good sketch is generated, the identity information is retained, the texture of the sketch, the details of each part, the clear texture and the like can make the sketch widely applied to entertainment.

Because the actual human face image content is complex, various human face parts exist, and the details required to be presented among the parts also have differences, the conversion algorithm of human face photo-sketch faces a huge challenge. Specifically, there are two main difficulties:

(1) modeling the face photo image, extracting features, and keeping identity information: in the traditional face sketch generation process, the retention of identity information is an important problem. However, the extraction and retention of identity information features in the actual sketch generation still remain a difficult problem, and particularly, the problem of retaining face identity information is more difficult when a visual effect is pursued. In actual criminal investigation application, the face identity information is indispensable, so that the face identity information is reserved in the process from the face photo image to sketch generation.

(2) Face sketch each part structure and holistic visual effect: in the process of face sketch generation, problems which usually occur include deformation of face structure, blurring of details, especially blurring of hair texture, and the like, and furthermore, blurring or unrealistic of overall paintings is also a frequently encountered problem. Particularly, in the task of improving the recognition effect, the influence of the identity information features of the human face on the visual effect is more obvious. In order to be applied to entertainment and the like, the improvement of the visual effect of face sketch is also an important aspect.

Disclosure of Invention

The invention aims to provide a portrait generation method based on face structure information aiming at the defects of the prior art.

The technical scheme adopted by the invention for solving the technical problem comprises the following steps:

step (1), data preprocessing

The size of an original face photo image is 250 multiplied by 200, the number of channels of an RGB image is 3; the size of the sketch image of the original face photograph is also 250 x 200, a grayscale image, with a channel of 1. The human face photo image and the sketch image Y of the human face photo are aligned uniformly, the size and the channel number of the human face photo image obtained after alignment are equal to those of the original human face photo image, the distance between two eyes is 50, and the distance from the eyes to the upper side of the image is 125. The aligned face picture image is acquired by the existing method to obtain a face structure part probability map, namely, the face structure part probability map is acquired by a semantic analysis network (P-Net) in fig. 1, the size of the image is 250 × 200, the number of channels is 11, and the probability of each channel is the probability of 11 parts. And respectively combining the probability maps of the left eyebrow and the right eyebrow, the left eye and the right eye, and the upper lip and the lower lip in the 11 parts in a corresponding pixel addition mode to finally obtain the probability maps of 8 parts in total. During training, edge filling is performed on the three types of images by using 0 to obtain 286 × 286 images, and the images with the sizes of 256 × 256 corresponding positions are randomly selected for training each time.

Step (2), feature extraction and fusion based on face structure information

Based on the existing original U-shaped network (U Networks, UNet), the face structure information is added at the input end, the U-shaped network is improved, and the model is realized based on the neural network. The structure diagram is shown in fig. 1, a face photo image X is input into an Appearance Encoder (Appearance Encoder), a feature map with the size of 1 × 1 is finally obtained, and meanwhile, the feature map after each convolution operation is reserved; probability map of human face structural parts

Inputting the data into a structure Encoder (Composition Encoder), wherein the specific method is the same as the method for processing the face photos; in the Decoder (Decoder) part, the input of each time is spliced with the feature map with the same size obtained in the previous encoder, and then the operation is carried out. Finally, the required face sketch image is obtained at the end of the decoder.

Aligning the face picture image X, the sketch image Y corresponding to the face picture and the probability map of the face structure part

The three groups of the triads, X, Y,

as a training set.

Step (3), a combined loss function based on the face structure component:

by the method of step (2), we have obtained a face sketch image of 256 × 256 size. Based on the probability graph of the existing face structure part, each part is optimized, namely, for the generated image, the probability graph of each part is multiplied by pixel points, and the manhattan distance (L) is calculated by the image obtained by multiplying the probability graph of each part by the original face sketch image₁Distance) to optimize the network model.

Step (4) generating a countermeasure network

The network is divided into a generator and a discriminator, the portrait generated by the generator is close to the distribution of the real portrait, and the discriminator is used for calculating a loss function by discriminating whether the portrait is the real original portrait or the generated portrait and optimizing the loss function.

Step (5), model training

And (3) generating a portrait according to the model in the step (2) by using a training set consisting of the existing 'photo-structure information-portrait' triple, calculating the loss of the network by using the steps (3) and (4), and training the model parameters of the neural network in the steps (2) and (4) by using a back propagation algorithm until the whole neural network model converges.

Preprocessing the data in the step (1):

firstly, face alignment is carried out, the size and the channel number of a face photo image X obtained after alignment are equal to those of the original image, the distance between two eyes is 50, and the distance from the eyes to the upper edge of the image is 125.

Secondly, the face photo image X is decomposed into 8 parts of probability graph based on pixel points by a face semantic analysis method

Wherein

The probability output for each component is separately,

the probability that the pixel Xi, j belongs to the C-th component is shown, wherein C is 1, …, C, and C is 8. The 8 parts are respectively: eyes, eyebrows, nose, upper and lower lips, mouth, face, hair, background;

aligned face photo image

And face sketch images

Wherein h, w and C represent the face photo image height, width and channel number, respectively.

The feature extraction and fusion based on the face structure information in the step (2) are specifically as follows:

2-1, firstly, an original U-shaped network (UNet) has the following specific structure:

the U-network is divided into two parts: an encoder and a decoder.

In the Encoder (Encoder), divide into 8 modules (Block) totally, 2-7 modules constitute by 3 kinds of operations, do respectively in proper order: an empty corrected Linear unit (ReLU), a Convolution (CNN) and Batch Normalization (BN), wherein the first module comprises convolution, and the last module comprises an empty corrected Linear unit and convolution; meanwhile, the output result of each module is reserved as a characteristic and used in a decoder.

In the Decoder (Decoder) part, divide into 8 modules altogether, 1-7 module comprises by 3 kinds of operations, does respectively in proper order: modified Linear Units (ReLU), Convolution (CNN), and Batch Normalization (BN), the last module including modified Linear Units, convolution, and Tanh.

In the decoder part, the last Feature map (Feature maps) in the encoder is used as the input of the first module of the decoder, and the input of each module of the decoder is spliced (collocated) with the Feature map of the corresponding size reserved in the encoder as the input of the next module. The required input image is available at the end of the decoder.

2-2. the U-shaped network added with the face structure information has the following specific structure:

the U-shaped network added with the face structure information comprises two encoders and a decoder.

Two encoders respectively facing the face photo image X and the face structure information

Is subjected to a treatment in which

I.e. by X and

the specific network structure and the retained characteristic diagram are the same as those of the original U-shaped network respectively serving as the input of the two encoders.

An encoder for the face picture image X, able to obtain a feature set

For face structure information

Can obtain a feature map set

Wherein S is 8.

In the decoder part, the operation of each module is the same as that of the original U-shaped network. At the input part of the first module, the last characteristic diagrams of the two encoders are spliced to obtain a characteristic diagram I₁，

Will feature diagram I₁As input to the first module. At the same time, the output O of each module is stitched with the corresponding size of the feature maps in the two encoders, i.e. the output O for the first module in the decoder₁Will be

And

and O₁Splicing, waiting for input to the next module

The outputs thereafter are analogized. Finally, a 256 multiplied by 256 face sketch image which is needed by people is obtained

The probability graph based on the face structure part in the step (3)

The combined loss function of (2) is as follows:

the loss function includes two parts: for global losses and for losses of individual components, respectively

And

and (4) showing.

For the

The specific formula of the loss function of (2) is as follows:

wherein the content of the first and second substances,

representing a sketch image of a human face, Y representing a sketch image, m and n representing the height and width of the sketch image respectively,

the calculation formula of (2) is as follows:

for the

Loss function of (2): in the probability map of the existing face structure part

Respectively optimizing each part on the basis of the following specific processes:

firstly, a weight factor is introduced to eliminate the loss of different pixel points of each part. For each part, the specific formula is as follows:

wherein the content of the first and second substances,

representing the sum of the probabilities of all the pixels in the c-th component,

here, convolution operation is indicated. An indication of multiplication by a corresponding pixel point is indicated here.

Thus total loss function of 8 components

The concrete formula of (1) is as follows:

the total loss function of the resulting generator is:

wherein α preferably ranges from 0 to 1.

Generating the countermeasure network in the step (4), specifically as follows:

the generation of the countermeasure network is divided into two parts as a whole: a Generator (Generator) and a Discriminator (Discriminator), and the two encoders and the decoder in the step (2) integrally form the Generator in the generation countermeasure network.

The input of the discriminator is

Wherein

Representing the generated sketch image, and judging whether the sketch image is true or false, wherein a judging loss function formula is as follows:

the training model in the step (5) is as follows:

the loss value of the generator is formulated as follows:

loss value of discriminator

I.e. the loss of the discriminator.

According to the calculated loss value

And

parameters in the network are adjusted using a Back-Propagation (BP) algorithm.

The invention has the following beneficial effects:

because the human face has strong geometric constraint and very complex structural details, the introduction of the human face structural information to assist the generation of the human face sketch is very promising. Recently, the face part marking technology based on face pixel points is rapidly developed, and with the inspiration, face structure information is introduced to generate faces. In addition, we add structure information not only at the input, but also at the Loss function part at the output, and use the upgraded version of the Loss function, which we call composite Loss.

The invention provides a deep neural network architecture generated from a face photo image to a sketch based on face structure information, and aims to solve the two difficult problems. 1. Generating a human face sketch image with good visual effect, making the structure of the human face sketch image unreasonable, keeping the characteristics of details and the like, and making the human face sketch image more like a manual picture; 2. in the aspect of preserving the face identity information, namely for the face recognition problem, the method also has extremely high accuracy.

Drawings

FIG. 1 is a schematic view of the present invention.

Detailed Description

The following is a more detailed description of the detailed parameters of the present invention.

As shown in fig. 1, a portrait generation method based on face structure information includes the following steps:

step (1), data preprocessing:

carrying out face alignment on a face photo with an original size of 250 multiplied by 200 and a portrait uniformly, wherein the size of the aligned face photo is 250 multiplied by 200, the distance between two eyes is 50, and the distance between the eyes and the upper side of the image is 125; carrying out edge filling on the image by 0 to obtain a 286 × 286 image, and randomly taking a face photo image X with the size of 256 × 256 for training each time; extracting features with space and texture information by using a U-Net network;

The three groups of the triads, X, Y,

as a training set;

step (2) based on the probability graph of the face structure part

The feature extraction and fusion:

probability map of face picture image X and face structure component using two different encoders

Coding is carried out, the features extracted by the two coders are spliced with the features of the decoder, and the finally generated sketch image Y is output;

step (3) probability graph based on face structure part

The combined loss function of:

for the sketch image Y generated in the step (2), according to the probability map of the existing face structure part

Respectively calculating loss functions of each part and the target image, and optimizing network parameters by adding the loss functions of the whole sketch image Y and the target image;

step (4), generating a countermeasure network:

the network is divided into a generator and a discriminator, the sketch image Y generated by the generator is close to the distribution of a real image, and the discriminator is used for calculating and optimizing a loss function according to whether the sketch image Y is a real original sketch image or the generated sketch image Y;

step (5), model training:

and (3) generating a sketch image according to the model in the step (2) by using a training set consisting of the existing 'photo-structure information-sketch image' triple, calculating the loss of the network by using the steps (3) and (4), and training the model parameters of the neural network in the steps (2) and (4) by using a back propagation algorithm until the whole neural network model converges.

The data preprocessing in the step (1) comprises the following specific steps:

the CUFS data set is used here as training and testing data.

1-1. the face photograph image is X, wherein

The number of channels of the image features is defined, and h is 250 and w is 200, which are the height and width of the face photo image respectively; the sketch image of the human face is Y, wherein

The number of channels of the image features is, and h-250 and w-200 are the height and width of the face photograph image, respectively. Firstly, the human face alignment is carried out on X and Y, the size and the number of channels of an image obtained after the alignment are equal to those of the original image, the distance between two eyes is 50, and the distance from the eyes to the upper edge of the image is 125.

1-2, according to the face photo X obtained after alignment in 1-1, we predict the structural components to obtain the structural probability graph of the face photo as

C is the number of channels of the image feature, and h 250 and w 200 are the height and width of the face photograph image, respectively. Where C-11 is the probability of 11 components per channel. In addition, probability maps of the left and right eyebrows, the left and right eyes, and the upper and lower lips of the 11 parts are respectively combined in a corresponding pixel addition manner, and finally probability maps of 8 parts in total are obtained. Finally, we can get

Wherein C is 8.

1-3. after obtaining X, Y,

then, we perform a uniform processing on the sizes of the training samples, namely, 0 to X, Y,

and respectively carrying out edge filling to obtain 286 × 286 images, and randomly taking the images with the sizes of 256 × 256 at the corresponding positions for training each time. When the image is filled with 0, the number of 0 s on the top and bottom of the image is equal, and the number of 0 s on the left and right is equal.

2-1. in the encoder part, the parameter Negative Slope (Negative Slope) of the Leaky modified linear unit is 0.2; the convolution kernels (Kernel Size) of the convolution operation are all 4 in Size, the step Size (Stride) is 2, the 0 Padding (Zero Padding) is 1, and the feature map Size is sequentially 2 powers of 64 and 512 maximum.

The combined loss function based on the face structure component in the step (3) is specifically as follows:

in the loss function described in step (3), α is preferably in the range of 0 to 1, where α is 0.7.

Claims

1. A portrait generation method based on face structure information is characterized by comprising the following steps:

step (1), data preprocessing:

The three groups of the triads, X, Y,

as a training set;

step (2) based on the probability graph of the face structure part

The feature extraction and fusion:

step (3) probability graph based on face structure part

The combined loss function of:

tracing the sketch generated in the step (2)Image Y, from the probability map of the existing face structure components

step (4), generating a countermeasure network:

step (5), model training:

using a training set formed by the existing 'photo-structure information-sketch image' triple, generating a sketch image according to the model in the step (2), calculating the loss of the network by using the steps (3) and (4), and training the model parameters of the neural network in the step (2) and the step (4) by using a back propagation algorithm until the whole neural network model converges;

the U-network is divided into two parts: an encoder and a decoder;

in the encoder part, the encoder part is divided into 8 modules, and the 2 nd to 7 th modules are composed of 3 operations, which are respectively as follows: the method comprises a Leaky correction linear unit, convolution and batch normalization, wherein the first module comprises convolution, and the last module comprises a Leaky correction linear unit and convolution; meanwhile, the output result of each module is reserved as a characteristic and used in a decoder;

in the decoder part, 8 modules are divided, and the 1 st module to the 7 th module are composed of 3 operations, which are respectively as follows: modifying linear units, convolution and batch normalization, wherein the last module comprises modifying linear units, convolution and hyperbolic tangent;

in the decoder part, the last characteristic diagram in the encoder is used as the input of the first module of the decoder, and the input of each module of the decoder is spliced with the characteristic diagram with the corresponding size reserved in the encoder to be used as the input of the next module; the required input image is available at the end of the decoder;

the U-shaped network added with the face structure information comprises two encoders and a decoder;

Is subjected to a treatment in which

I.e. by X and

the specific network structure and the reserved characteristic diagram are the same as those of the original U-shaped network;

an encoder for the face picture image X, able to obtain a feature set

For face structure information

Can obtain a feature map set

Wherein S ═ 8;

in the decoder part, the operation of each module and the original U-shaped networkThe same; at the input part of the first module, the last characteristic diagrams of the two encoders are spliced to obtain a characteristic diagram I₁，

Will feature diagram I₁As an input to a first module; at the same time, the output O of each module is stitched with the corresponding size of the feature maps in the two encoders, i.e. the output O for the first module in the decoder₁Will be

And

and O₁Splicing, waiting for input to the next module

The subsequent output is analogized in turn; finally, a 256 multiplied by 256 face sketch image which is needed by people is obtained

The probability graph based on the face structure part in the step (3)

The combined loss function of (2) is as follows:

And

represents;

for the

The specific formula of the loss function of (2) is as follows:

wherein the content of the first and second substances,

the calculation formula of (2) is as follows:

for the

firstly, introducing a weight factor to eliminate the loss of different pixel points of each part; for each part, the specific formula is as follows:

wherein the content of the first and second substances,

here, convolution operation is denoted; an indication here is the multiplication of the corresponding pixel;

thus total loss function of 8 components

The concrete formula of (1) is as follows:

the total loss function of the resulting generator is:

wherein α preferably ranges from 0 to 1.

2. The portrait generation method based on face structure information as claimed in claim 1, wherein the data preprocessing of step (1):

firstly, carrying out face alignment, wherein the size and the channel number of a face photo image X obtained after alignment are equal to those of the original image, the distance between two eyes is 50, and the distance from the eyes to the upper edge of the image is 125;

Wherein

The probability output for each component is separately,

representing a pixel point X_i，jProbability of belonging to the C-th component, where C is 1, …, C is 8; the 8 parts are respectively: eyes, eyebrows, nose, upper and lower lips, mouth, face, hair, background;

aligned face photo image

And face sketch images

3. The portrait generation method based on face structure information as claimed in claim 2, wherein the generation of the confrontation network in step (4) is as follows:

the generation of the countermeasure network is divided into two parts as a whole: a generator and a discriminator;

the two encoders and one decoder in the step (2) integrally form a generator in the generation countermeasure network;

the input of the discriminator is

Wherein

the training model in the step (5) is as follows:

the loss value of the generator is formulated as follows:

loss value of discriminator

Namely the loss of the discriminator;

according to the calculated loss value

And

parameters in the network are adjusted using a back propagation algorithm.