CN113160035A

CN113160035A - Human body image generation method based on posture guidance, style and shape feature constraints

Info

Publication number: CN113160035A
Application number: CN202110413125.7A
Authority: CN
Inventors: 卢书芳; 卢富男; 朱翔; 寿旭峰; 陶相艳; 高飞
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2021-04-16
Filing date: 2021-04-16
Publication date: 2021-07-23

Abstract

The invention discloses a human body image generation method based on posture guidance, style and shape feature constraints, which comprises the following steps: (1) collecting and acquiring source human body image I_sAnd a target human body image I_tCalculate its pose image P_s、P_tHuman body semantic segmentation image S_s、S_t(ii) a (2) Construct generator G and discriminator D_I、D_P(ii) a (3) Will I_sAnd S_sInput style encoder, P_tInput attitude encoder, S_tAn input shape encoder; inputting the obtained style characteristics, posture characteristics and shape characteristics into a decoder to obtain a virtual target human body image I_f(ii) a (4) Handle(I_s，I_t) And (I)_s，I_f) As a discriminator D_IInput of (P)_t，I_t) And (P)_t，I_f) As a discriminator D_PRespectively calculating the antagonistic losses and based on I_fAnd I_tCalculating image reconstruction loss, perception loss and semantic loss, and optimizing G; (5) and (5) performing iterative training to obtain a generator G for generating the human body image. By using the method and the device, style characteristics can be extracted according to the semantic region, and the posture and the shape of the human body image can be controlled.

Description

Human body image generation method based on posture guidance, style and shape feature constraints

Technical Field

The invention belongs to the technical field of human body image generation, and particularly relates to a human body image generation method based on posture guidance, style and shape feature constraints.

Background

Human body image generation is an important branch in the field of computer vision, and can be widely applied to the fields of data enhancement of pedestrian re-identification, movie role making, virtual fitting, augmented reality and the like. The human body image generation based on the posture guidance means that a target posture and a (group of) source image are given, and under the guidance of the target posture, a target human body image with the style characteristics of the source image in the target posture is generated.

For example, chinese patent publication No. CN112116673A discloses a method for generating a virtual human body image based on structural similarity under posture guidance; chinese patent publication No. CN109191366A discloses a method and apparatus for synthesizing multi-view human body images based on human body posture.

The current human body image generation has two problems: (1) in the style feature extraction, a global style feature is often extracted by taking the whole source image as an input, and the feature of a specific semantic area cannot be extracted independently. (2) The control mode is single, only the posture of the source image can be changed, and the style and the shape of a specific semantic area cannot be controlled.

Therefore, there is a need for a human body image generation method that can provide a variety of image synthesis control methods.

Disclosure of Invention

The invention provides a human body image generation method based on posture guidance, style and shape feature constraints, which can extract style features according to semantic regions and control the posture and the shape of a human body image.

A human body image generation method based on posture guidance, style and shape feature constraints comprises the following steps:

(1) collecting and acquiring source human body image I_sAnd a target human body image I_tRespectively obtaining a posture image P of the source human body image and the target human body image according to the two images_s、P_tHuman body semantic segmentation image S_s、S_t；

(2) Construct generator G and discriminator D_I、D_PWherein the generator G comprises a stylized Encoder Encoder_styleEncoder for gesture coder_poseEncoder of shape Encoder_shapeAnd a Decoder; discriminator D_IFor discriminating virtual target image I_fAnd source human body image I_sThe texture similarity between them; discriminator D_PFor discriminating virtual target image I_fWith the target attitude P_tThe consistency of (2);

(3) the source human body image I obtained in the step (1) is processed_sSource-fused human body semantic segmentation image S_sInput style Encoder Encoder_styleObject pose image P_tInput attitude Encoder Encoder_poseTarget human semantic segmentation image S_tInput shape Encoder Encoder_shape；

Inputting the style characteristic, the posture characteristic and the shape characteristic which are extracted in sequence into a Decoder to obtain a virtual target human body image I_f；

(4) Handle (I)_s，I_t) And (I)_s，I_f) Respectively as a discriminator D_IInput of (P)_t，I_t) And (P)_t，I_f) Respectively as a discriminator D_PRespectively calculating the opposing loss L_advAnd is based on I_fAnd I_tCalculating image reconstruction loss L_{reconstruction}And a loss of perception L_perceptualAnd semantic loss L_CXG is optimized;

(5) and (5) circulating the step (3) and the step (4), and obtaining a trained generator G after the preset iteration times are reached, and using the generator G for generating the virtual target image in the real scene.

In the step (1), the number N of key points of the pose image is 18, and the number C of classes of the human semantic segmentation image is 8.

The specific steps of the step (2) are as follows:

(2-1) construction of stylistic Encoder Encoder_style

Encoder_styleThe VGG network comprises 5 pre-trained VGG networks of 3 x 3 convolutional layers, and the sizes of feature maps extracted from the first 4 convolutional layers respectively correspond to the sizes of {1_1,2_1,3_1,4_1} feature maps in the VGG; combining the features extracted by the convolutional layer and the features extracted by the VGG network in sequence, and inputting the next convolutional layer; a last convolutional layer, mapping features from 1024 dimensions to 64 dimensions;

when in use, firstly, the semantic segmentation image is used for segmenting 8 independent images

Then 8 independent semantic images are respectively input into the Encoder_styleAnd finally, sequentially cascading the style features to obtain the final 512-dimensional style feature. (2-2) construction of an attitude Encoder Encoder_poseAnd shape Encoder Encoder_shape

Encoder_poseAnd Encoder_shapeThe network structures of the devices are the same, and the devices all comprise 4 3 multiplied by 3 convolution layers, wherein the activation layer is a ReLU layer, and 512-dimensional posture features and shape features are extracted;

(2-3) construction of Decoder

Taking the attitude characteristics as input, and calculating a normalization parameter by using the style characteristics and the shape characteristics; firstly, 4 ResBlock are passed through, and the channel is kept unchanged; then 3 groups of upper sampling layers and ResBlock layers are passed; the remaining active layers are the ReLU layers, except for the last active layer, which is tan h.

(2-4) construction of the discriminator D_I、D_P

Using PatchGAN as the discriminator, including 4 3 × 3 convolutional layers and 3 residual blocks, Dropout of the discriminator is set to 0.5.

In step (4), the countermeasure loss function is defined as:

in the formula, E represents a desirable value.

In step (4), the image reconstruction loss L_{reconstruction}Is L between the virtual target image and the real target image₁The loss, defined as:

L_{reconstruction}＝||G(I_s，S_s，P_t，S_t)-I_t||₁.

the image perception loss is defined as:

wherein,

a gram matrix is represented that is,

representing I extracted with a pre-trained VGG19 network_tLayer i profile, i ═ relu {3_2,4_2 };

loss of semantics L_CXIs defined as:

in the formula,

representing I extracted with a pre-trained VGG19 network_fThe characteristic map of the l-th layer. In the step (5), in the training process, the learning rate is initially 0.0001, and in 1000 iterations, the linear attenuation is to 0.

Compared with the prior art, the invention has the following beneficial effects:

1. in the human body image generation method based on the posture guidance, the style and the shape feature constraint, the style encoder based on the semantic segmentation image can independently extract the features of each semantic region and combine the features into style features according to the preset sequence, so that the features between different semantic regions have independence, the style feature recombination can be realized under the condition of a group of source images, and the method is more flexible in practical application.

2. In the human body image generation method based on the posture guidance, the style and the shape feature constraint, the decoder uses the shape feature of the target semantic segmentation image for normalization, and can output the image which is consistent with the target semantic segmentation.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention;

FIG. 2 is a schematic diagram of human body image pose according to the present invention;

FIG. 3 is a schematic diagram of human body image semantic segmentation according to the present invention;

FIG. 4 is a schematic diagram of a stylized encoder of the present method.

Detailed Description

The invention will be described in further detail below with reference to the drawings and examples, which are intended to facilitate the understanding of the invention without limiting it in any way.

As shown in fig. 1, a human body image generation method based on posture guidance, style and shape feature constraints includes the following steps:

step 1, acquiring and acquiring a source human body image I_sAnd a target human body image I_t(ii) a Respectively obtaining a posture image P of the source human body image and a target human body image according to the source human body image and the target human body image_s，P_tHuman body semantic segmentation image S_s，S_t。

Specifically, as shown in fig. 2, the number N of pose image key points is 18; as shown in fig. 3, the number of classes C of human semantic segmentation images is 8.

Step 2, constructing a generator G and a discriminator D_I、D_PWherein the generator G comprises a stylized Encoder Encoder_styteEncoder for gesture coder_poseEncoder of shape Encoder_shapeAnd a Decoder.

The method comprises the following specific steps:

step 2.1, constructing Encoder_styleAs shown in fig. 4.

Encoder_stvleThe VGG network comprises 5 pre-trained VGG networks of 3 x 3 convolutional layers, and the sizes of feature maps extracted from the first 4 convolutional layers respectively correspond to the sizes of {1_1,2_1,3_1,4_1} feature maps in the VGG. And combining the features extracted by the convolutional layer and the features extracted by the VGG network in sequence, and inputting the next convolutional layer. The last convolutional layer maps features from 1024 dimensions to 64 dimensions.

Then 8 independent semantic images are respectively input into the Encoder_stvleAnd finally, sequentially cascading the style features to obtain the final 512-dimensional style feature.

Step 2.2, constructing Encoder_poseAnd Encoder_shape

Encoder_poseAnd Encoder_shapeThe network structures of the three-dimensional space network are the same, the three-dimensional space network comprises 4 3 x 3 convolution layers, the active layer is a ReLU layer, and 512-dimensional posture features and shape features are extracted.

Step 2.3, constructing the Decoder

With the pose features as input, normalization parameters are calculated using the style features and the shape features.

Firstly, 4 ResBlock are passed through, and the channel is kept unchanged; then go through 3 sets of upsampled and ResBlock layers, all except the last active layer being tanh, the remaining active layers being ReLU layers.

Step 3.4, constructing a discriminator D_I、D_P

Step 3, the source human body image I obtained in the step 1 is processed_sSource-fused human body semantic segmentation image S_sInput style Encoder Encoder_styleObject pose image P_tInput attitude Encoder Encoder_poseTarget human semantic segmentation image S_tInput shape Encoder Encoder_shape(ii) a Inputting the style characteristic, the posture characteristic and the shape characteristic which are extracted in sequence into a Decoder to obtain a virtual target human body image I_f。

Step 4, treating (I)_s，I_t) And (I)_s，I_f) Respectively as a discriminator D_IInput of (P)_t，I_t) And (P)_t，I_f) Respectively as a discriminator D_PRespectively calculating the opposing loss L_advAnd is based on I_fAnd I_tCalculating image reconstruction loss L_{reconstruction}And a loss of perception L_perceptualAnd semantic loss L_CXAnd G is optimized.

Specifically, the penalty function is defined as:

wherein E represents expectation.

Image reconstruction loss is L between the virtual target image and the real target image₁The loss, defined as:

L_{reconstruction}＝||G(I_s，S_s，P_t，S_t)-I_t||₁.

the image perception loss is defined as:

wherein

A gram matrix is represented that is,

representing I extracted with a pre-trained VGG19 network_tI ═ relu {3_2,4_2 }.

Loss of semantics L_CXIs defined as:

and 5, circulating the step 3 and the step 4, and obtaining a trained generator G after the preset iteration times are reached, wherein the generator G is used for generating the virtual target image in the real scene.

Specifically, in the training process, the learning rate is initially 0.0001, and in 1000 iterations, the linear decay is to 0.

The embodiments described above are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only specific embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions and equivalents made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims

1. A human body image generation method based on posture guidance, style and shape feature constraints is characterized by comprising the following steps:

(4) Handle (I)_s,I_t) And (I)_s,I_f) Respectively as a discriminator D_IInput of (P)_t，I_t) And (P)_t，I_f) Respectively as a discriminator D_PRespectively calculating the opposing loss L_advAnd is based on I_fAnd I_tCalculating image reconstruction loss L_{reconstruction}And a loss of perception L_perceptualAnd semantic loss L_CXG is optimized;

2. The method for generating a human body image based on pose guidance, style and shape feature constraints according to claim 1, wherein in step (1), the number of key points N of the pose image is 18, and the number of classes C of the human body semantic segmentation image is 8.

3. The human body image generation method based on the posture guidance, style and shape feature constraints as claimed in claim 1, wherein the specific steps of step (2) are:

(2-1) construction of stylistic Encoder Encoder_style

Encoder_styleThe VGG network comprises 5 3 × 3 convolutional layers and a pre-trained VGG network, wherein the sizes of feature maps extracted from the first 4 convolutional layers respectively correspond to the sizes of feature maps of {1_1,2_1,3_1,4_1} layers in the VGG; combining the features extracted by the convolutional layer and the features extracted by the VGG network in sequence, and inputting the next convolutional layer; a last convolutional layer, mapping features from 1024 dimensions to 64 dimensions;

Then 8 independent semantic images are respectively input into the Encoder_styleOutputting corresponding style characteristics, and finally cascading the style characteristics in sequence to obtain a final 512-dimensional style characteristic;

(2-2) construction of an attitude Encoder Encoder_poseAnd shape Encoder Encoder_shape

(2-3) construction of Decoder

(2-4) construction of the discriminator D_I、D_P

4. The method for generating human body image based on pose guidance, style and shape feature constraints according to claim 1, wherein in the step (4), the confrontation loss function is defined as:

in the formula, E represents a desirable value.

5. The human image generation method based on pose guidance, style and shape feature constraints of claim 1, wherein in step (4), image reconstruction loses L_{reconstruction}Is L between the virtual target image and the real target image₁The loss, defined as:

L_{reconstruction}＝||G(I_s,S_s,P_t,S_t)-I_t||₁.

the image perception loss is defined as:

wherein,

a gram matrix is represented that is,

loss of semantics L_CXIs defined as:

in the formula,

representing I extracted with a pre-trained VGG19 network_fThe characteristic map of the l-th layer.

6. The method for generating human body image based on posture guidance, style and shape feature constraints as claimed in claim 1, wherein in step (5), during training, the learning rate is initially 0.0001, and in 1000 iterations, the linear decay is 0.