CN113393550A

CN113393550A - Fashion garment design synthesis method guided by postures and textures

Info

Publication number: CN113393550A
Application number: CN202110660701.8A
Authority: CN
Inventors: 顾晓玲; 俞俊; 黄洁
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-06-15
Filing date: 2021-06-15
Publication date: 2021-09-14
Anticipated expiration: 2041-06-15
Also published as: CN113393550B

Abstract

The invention discloses a method for designing and synthesizing fashion clothes guided by postures and textures. The method comprises the following steps: 1. collecting task data by means of an existing fashion data set, preprocessing the data, and constructing a fashion image, posture information and semantic information data set; 2. constructing a two-stage generation model by taking the natural and accurate fashion image as a target; the generation model comprises a semantic layout generation network and a texture generation network, so that effective transfer of textures is realized, and diversified fashion images are generated; 3. training a semantic layout generation network and a texture transfer network by utilizing the collected data set under the conditions of minimized countermeasure loss, cross entropy loss, pixel level loss, perception loss and style loss; 4. and training the network parameters in the generated model through a back propagation algorithm until the whole model converges to generate a corresponding fashion image. The invention has performed experiments on the fast-Gen data set, and has obtained good results both quantitatively and qualitatively.

Description

Fashion garment design synthesis method guided by postures and textures

Technical Field

The invention provides a novel method for Fashion clothing Design Synthesis (Pose and Texture Guided Multi-View fast Design Synthesis) Guided by gestures and textures, mainly relating to a method for converting input human gestures into a series of human semantic layouts by utilizing a semantic layout generation network and a method for realizing Texture transfer by using the Texture transfer network and generating a real Fashion image by generating a confrontation network.

Background

Due to the high demand of real life application and the breakthrough development of relevant theories and technologies such as deep learning, machine learning, computer vision, multimedia technology and the like, the task of combining artificial intelligence and fashion has received considerable attention in recent years, such as garment identification, garment retrieval, fashion recommendation, fashion trend prediction and the like, and the subject of research is garments. In recent years, computer researchers have also developed a wide range of research applications in the field of fashion image synthesis, such as human-body-posture-guided garment image generation algorithms, text-guided garment image generation algorithms, virtual fitting algorithms based on image generation models, garment design applications based on image generation models, and the like, due to the remarkable results obtained by generative models (e.g., GANs, VAEs) in image synthesis.

The human body posture guided clothing image generation algorithm takes a human body posture picture as an input condition, changes the existing clothing picture containing the character model, and synthesizes a brand new clothing image. The method for generating the clothing image guided by the text comprises the steps of changing the existing clothing image containing the character model by taking the text description containing the clothing characteristic semantics as an input condition, and synthesizing a brand-new clothing image. Virtual fitting algorithms based on image-generating models are given a picture of a character model and a picture of a target garment, which first generate a rough fitting result graph, wherein the deformed target garment is transferred to the correct area of the character model. The garment design application based on the image generation model is used for controlling output garment design drawings through information such as color, texture and shape. The method can be classified into the clothing design application based on the image generation model, and various fashionable clothing design drawings are generated through posture and texture information control, so that the work of designers is reduced, and the design cycle of fashionable products is accelerated.

On the pose and texture guided fashion image generation task, the existing simple idea is to apply standard image-to-image conversion models directly, such as pix2pix and pix2pixHD, to solve our proposed problem. However, these methods essentially learn the mapping from the source image to the target image. Experimental results demonstrate that this does not fulfill our task. Furthermore, our task requires solving several challenging problems.

1) Too little information contained in the guidance gesture

The human body posture is usually represented by two-dimensional joint points, only the human body joint point information is contained, and the shape information is not contained, so that the human body structure and the clothing structure are difficult to be deduced from rough posture information by the conventional method.

2) Difficulty of texture transfer implementation

Due to the locality of a common convolutional network to feature processing, a special texture transmission mechanism does not exist in the existing fashion image generation method to realize the effective transfer of the texture of the fashion image. Secondly, since the regions of the garment are usually irregular, how to accurately transfer the texture to the corresponding regions of the garment by using texture blocks of any size is also a challenge in synthesizing natural and realistic fashion images. The existing fashion image generation method realizes generation of pure-color textures, cannot realize effective transfer of complex textures, and generally realizes generation of local textures or generation of incorrect textures.

3) Diversity limitation of fashion garment generation

The existing fashion image generation method generally uses posture information of a human body or semantic information of the human body for guidance, the type of a garment structure is fixed, and fashion images corresponding to various garment types and fashion styles in real situations cannot be generated.

Our approach addresses the existing problem of synthesizing diverse and accurate fashion images under the guidance of pose and texture information.

Disclosure of Invention

The invention provides a method for designing and synthesizing Fashion clothes guided by postures and textures.

A method for synthesizing fashion clothing design guided by posture and texture comprises the following steps:

and (1) collecting task data by means of the existing fashion data set, preprocessing the data, and constructing a fashion image, posture information and semantic information data set.

Step (2), constructing a two-stage generation model by taking the generation of natural and accurate fashion images as a target under the existing fashion data set; the generation model comprises a semantic layout generation network and a texture generation network, so that effective transfer of textures is realized, and diversified fashion images are generated.

And (3) training a semantic layout generation network and a texture transfer network by utilizing the collected data set under the conditions of minimized countermeasure loss, cross entropy loss, pixel level loss, perception loss and style loss.

And (4) training the network parameters in the generated model in the step (3) through a back propagation algorithm until the whole model converges to generate a corresponding fashion image.

The collection of mission data with the existing Fashion dataset described in step (1) means that we evaluated our method on the Fashion-Gen dataset because it contains various complex garment textures. We selected 4 major garment categories (i.e., dress, shirt, sweater, and coat) from the 48 major Fashion categories in the fast-Gen dataset for evaluation.

The step (1) of constructing the fashion image, the pose information and the semantic information data set means that the pose of the person is estimated from the fashion image by using the most advanced pose estimator for the corresponding fashion image data, and the calculated pose information of the person includes 18 joint coordinate points. In addition, an advanced body parser is used to compute a body semantic information containing 20 tags, each representing a specific part of the body, such as the face, hair, arms, legs and clothing areas.

Constructing a two-stage generation model in the step (2), wherein the two-stage generation model comprises a semantic layout generation network and a texture generation network, so that effective transfer of textures is realized, and diversified fashion images are generated, and the two-stage generation model specifically comprises the following steps:

the first stage is as follows: semantic layout generation network

In the semantic layout generation network, our goal is to map the guiding pose p to the semantic layout of a series of people { H }₁,H₂,....,H_N}. These semantic layouts provide sufficient a priori knowledge of the shape of the human body and the structure of the garment.

The method is characterized in that the posture information and the corresponding semantic information are used as input, diversified semantic information is learned and generated, the simple UNet network can also generate corresponding semantic output, but the requirement of the diversity cannot be met, and the semantic layout generation network is built on the basis of a BicycleGAN model because the semantic layout generation network encourages a plurality of outputs generated from a single source image to complete the task of translating the image into the image. The semantic layout generation network comprises a conditional variational self-coding neural sub-network and a conditional latent recurrent neural sub-network.

The conditional variational self-coding neural subnetwork uses the attitude information and the semantic information as input together, uses an encoder to process the semantic information, encodes to obtain a potential vector of a control feature, and then inputs the potential vector and the attitude information into a generator together to generate corresponding reconstructed semantic information; and KL loss is used for constraining the potential vector to obey Gaussian distribution, so that sampling is facilitated during testing.

The conditional latent regression neural network uses attitude information and randomly sampled Gaussian distribution-obeying vectors as input of a generator, generates a real semantic layout under the constraint of a discriminator, processes the generated semantic layout by using an encoder, and uses the L1 loss constraint-generating vectors and the original Gaussian distribution-obeying vectors to ensure one-to-one generation and further realize the output of diversified semantic information.

And a second stage: texture generation network

In the texture generation network, the aim is to design a texture generation network to generate the texture on the semantic layout converted by the semantic layout generation network, wherein the synthesized texture requirement of the clothing region is consistent with the example of the guide texture, and the synthesized human appearance has the perception persuasion. The diversified semantic layout output of the semantic layout generation network provides a multimodal information input for our texture generation network.

The texture generation of the top and bottom garments is processed separately, the top and bottom garments are generated respectively, the texture block area mask and the clothing area mask are used as input, and a texture generation network is realized through an encoder, a texture generation block, a decoder and a Patch-GAN discriminator respectively. The encoder decodes the input texture block, the texture generation block transfers local texture features to the corresponding clothing region, and the decoder decodes the reconstructed features into the corresponding fashion image. In order to make the generated fashion image more realistic, a Patch-GAN discriminator is added to be trained together with an encoder, a texture generation block and a decoder.

The encoder of the texture generation network:

the Encoder adopts a common Encoder structure to decode the input texture block, and compared with other methods, partial convolution is used in the Encoder to replace a standard convolution layer, so that artifacts such as blurring and color difference are avoided. The partial convolution at each position is expressed as:

wherein X is the characteristic value of the current convolution (sliding) window, M is the binary mask of the texture block area mask corresponding to the current convolution window, W is the weight of the convolution filter, and b is the offset. sum (M) is the number of 1's in the binary mask.

After each partial convolution operation, updating the mask by marking the corresponding position of the mask after the window convolution operation as valid if at least one valid input value exists in the binary mask of the current convolution window, and expressing as follows:

the texture generation block of the texture generation network:

we have found that previous work has achieved the effect of texture generation using solely convolution to model the correlation between different image regions. However, because the convolution operation has a local acceptance domain, the long-distance dependency relationship must be processed through several convolution layers, the learning effect is not good, and the texture generation effect is difficult to realize. We introduce a texture generation block that reconstructs the texture features of the existing encoder output by using an attention map. And (3) forming a similarity matrix by calculating the cosine value similarity among the texture feature blocks, and activating by using a softmax function to obtain an attention map, so that feature information is copied from the existing texture feature blocks, and the texture of the missing part of the garment is generated. To better learn the correlation between textures, i use features one layer higher than the reconstructed features to compute cosine similarity between features. The similarity matrix is calculated as follows:

and

respectively extracted texture features

The ith texture feature block and the jth texture feature block in the block, and

is composed of

And

is scored. We apply the softmax function to activate and obtain the initial attention map of the ith texture feature block

From texture features according to similarity calculation formula

Initial attention map AS for extracting whole texture features^lWe then use an attention-seeking scheme to reconstruct each block within the texture feature separately by a deconvolution operation:

wherein the content of the first and second substances,

is the ith block extracted within the texture feature,

is the jth block extracted within the texture feature. Reconstructing all blocks through attention scores to finally obtain reconstructed features

Wherein L is E [1, L-1 ∈]L is the characteristic number output by the encoder, and L is the corresponding characteristic layer serial number. After that time, the user can use the device,

further refinement is achieved by four sets of dilation convolutions at different rates.

The decoder of the texture generation network:

the SPADE structure (space adaptive normalization method) and the Decoder structure are combined, so that the introduction of human body information is realized, the generated clothing shape is further constrained through semantic information, and the characteristics generated by the reconstructed texture after coding and the semantic information are combined and decoded into a corresponding fashion image. The calculation process of the spatial adaptive normalization is as follows:

wherein, for the input semantic layout H_sExtracting features by convolution, and obtaining normalized scaling coefficient by two convolution layers respectively

And bias term

Wherein x, y and c are the height, width and channel number of the feature respectively, and n is the number of samples participating in training.

And

are respectively input features

Mean and standard deviation of. The calculation formula is as follows, and this part is the same as the calculation in BN.

H, W, C are the height, width, and number of channels, respectively, of the semantic layout input. x, y and c are respectively the height, width and channel number of the input feature, and n is the number of samples participating in training. N is the number of samples involved in training.

And (3) constructing a deep learning framework, and training a semantic layout generation network and a texture generation network by utilizing the collected data set under the conditions of minimized countermeasure loss, cross entropy loss, pixel level loss, perception loss and style loss. The method comprises the following specific steps:

because the details of fashion images are complex, how to train the generator well is a great challenge. To solve this problem, we use multiple penalties for training from different aspects, namely, antagonism penalties, cross-entropy penalties, pixel-level penalties, perceptual penalties, and gram matrix-based style penalties.

The overall penalty of the semantic layout generation network is defined as follows:

the first three terms respectively correspond to the conditional variation self-encoder to generate an objective function of the countermeasure network, and the second two terms respectively correspond to the conditional potential regression to generate the objective function of the countermeasure network. Lambda [ alpha ]_vae＝2,λ_seg＝3,λ_kl＝0.01，λ_gan＝2,λ_latent30 is a parameter of each loss function. Unlike the original BicycleGAN model, we used softmax activation at the last layer of the generator and used cross-entropy loss to predict human semantic layout. In semantic layout transformation, cross-entropy loss constrains pixel-level precision as defined below:

h, W, C are the height, width, and number of channels, respectively, of the semantic layout input. H_sIs the semantic layout that is generated and,

is corresponding toThe true semantic layout of (2).

The overall loss of the texture generation network is defined as follows:

wherein the content of the first and second substances,

is to counter the loss of the liquid,

is a generated fashion image

And a real image

L between₁The loss of the carbon dioxide gas is reduced,

is that

And

in the middle of the perception loss, and

is that

And

style loss in between. Lambda [ alpha ]_adv＝0.1,λ_rec＝6,λ_per＝0.5，λ_sty50 is the parameter of each loss function.

The invention has the beneficial effects that:

the invention provides a method for designing and synthesizing fashion clothes guided by postures and textures aiming at the practical problems of poor and single generation effect of the existing fashion images, solves the problems of too little information contained in the guided postures, locality and inaccuracy of texture transfer and single generation of the fashion images in the existing method, and realizes the generation of the diversity and the accuracy of the fashion images to a great extent. In addition, the task of combining artificial intelligence and fashion is taken as a current research hotspot, the reasonable use also enables the invention to have more advanced and innovative scientific research, and corresponding real and various fashion images are automatically designed and generated according to input control conditions (posture information and texture information) of a plurality of modes, so that the design inspiration of clothing designers can be further stimulated, and the development and application popularization of creative design related research in the fashion field can be promoted.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention.

FIG. 2 is a semantic layout generating network model in the method of the present invention.

FIG. 3 is a model of a texture transfer network in the method of the present invention.

FIG. 4 is a schematic of the data set of the present invention.

Detailed Description

The invention is further illustrated by the following figures and examples.

As shown in fig. 1, a method for synthesizing fashion garment design guided by pose and texture comprises the following steps:

as shown in fig. 2, the first stage: semantic layout generation network

In the semantic layout generation network, our goal is to map the guiding pose p to the semantic layout of a series of people { H }₁,H₂,....,H_N}. These semantic layouts provide sufficient human satisfactionA priori knowledge of the shape of the body and the structure of the garment.

As shown in fig. 3, the second stage: texture generation network

The encoder of the texture generation network:

the texture generation block of the texture generation network:

and

respectively extracted texture features

is composed of

And

From texture features according to similarity calculation formula

wherein the content of the first and second substances,

is the ith block extracted within the texture feature,

The decoder of the texture generation network:

wherein, for the input semantic layout H_sBy convolutional extractionTaking characteristics, and obtaining normalized scaling coefficient through two convolution layers respectively

And bias term

And

are respectively input features

And (3) constructing a deep learning framework, and training a semantic layout generation network and a texture generation network by utilizing the collected data set under the conditions of minimized countermeasure loss, cross entropy loss, pixel level loss, perception loss and style loss as shown in FIG. 3. The method comprises the following specific steps:

is the corresponding true semantic layout.

The overall loss of the texture generation network is defined as follows:

wherein the content of the first and second substances,

is to counter the loss of the liquid,

is a generated fashion image

And a real image

L between₁The loss of the carbon dioxide gas is reduced,

is that

And

in the middle of the perception loss, and

is that

And

Claims

1. A method for synthesizing fashion garment design guided by posture and texture is characterized by comprising the following steps:

the method comprises the following steps that (1) task data are collected by means of an existing fashion data set, the data are preprocessed, and a fashion image, posture information and semantic information data set is constructed;

step (2), constructing a two-stage generation model by taking the generation of natural and accurate fashion images as a target under the existing fashion data set; the generation model comprises a semantic layout generation network and a texture generation network, so that effective transfer of textures is realized, and diversified fashion images are generated;

step (3), training a semantic layout to generate a network and a texture transfer network by utilizing the collected data set under the conditions of minimized countermeasure loss, cross entropy loss, pixel level loss, perception loss and style loss;

2. The method of claim 1, wherein the step (1) of constructing the fashion image, the pose information, and the semantic information data set is to estimate the pose of the person from the fashion image using a state-of-the-art pose estimator for the corresponding fashion image data, and the calculated pose information of the person comprises 18 joint coordinate points; in addition, an advanced body parser is used to compute a body semantic information containing 20 tags, each representing a specific part of the body, such as the face, hair, arms, legs and clothing areas.

3. The method of claim 2, wherein the step (2) of constructing a two-stage generative model comprising a semantic layout generation network and a texture generation network to achieve effective texture transfer and generate diverse fashion images comprises the steps of: the semantic layout generation network is specifically implemented as follows:

in a semantic layout generation network, the goal is to map the guide poses p to the semantic layouts { H } of a series of people₁,H₂,....,H_N}; these semantic layouts provide sufficient a priori knowledge of the shape of the human body and the structure of the garment;

using the posture information and the corresponding semantic information as input, and learning to generate diversified semantic information; the semantic layout generating network is established on the basis of a BicycleGAN model and comprises a conditional variational self-coding neural sub-network and a conditional latent recurrent neural sub-network;

the conditional variational self-coding neural subnetwork uses the attitude information and the semantic information as input together, uses an encoder to process the semantic information, encodes to obtain a potential vector of a control feature, and then inputs the potential vector and the attitude information into a generator together to generate corresponding reconstructed semantic information; KL loss is used for restraining potential vectors to obey Gaussian distribution, so that sampling is facilitated during testing;

4. A method of pose and texture guided fashion garment design synthesis according to claim 3, characterized by a second stage: the texture generation network is specifically implemented as follows:

in the texture generation network, the aim is to design a texture generation network to generate the texture on the semantic layout converted by the semantic layout generation network, wherein the synthesized texture requirement of the clothing region is consistent with the example of the guide texture, and the synthesized human appearance has perception persuasion; the diversified semantic layout output of the semantic layout generation network provides multi-modal information input for the texture generation network;

the texture generation of the top and bottom garments is processed separately, the top and bottom garments are generated respectively, the texture block area mask and the clothing area mask are used as input, and a texture generation network is realized through an encoder, a texture generation block, a decoder and a Patch-GAN discriminator respectively; the encoder decodes the input texture block, the texture generation block transfers local texture features to a corresponding clothing region, and the decoder decodes the reconstructed features into a corresponding fashion image; meanwhile, a Patch-GAN discriminator is added to be trained together with an encoder, a texture generation block and a decoder;

the encoder of the texture generation network:

the Encoder adopts a common Encoder structure to decode the input texture block, and compared with other methods, partial convolution is used in the Encoder to replace a standard convolution layer, so that the generation of blurring and color difference is avoided; the partial convolution at each position is expressed as:

wherein X is the characteristic value of the current convolution (sliding) window, M is the binary mask of the texture block area mask corresponding to the current convolution window, W is the weight of the convolution filter, and n is the offset; sum (M) is the number of 1's in the binary mask;

5. a method of pose and texture guided fashion garment design synthesis as claimed in claim 3, characterized by entering texture generation block, reconstructing texture features of existing encoder output by using attention map; the similarity matrix is formed by calculating the cosine value similarity among the texture feature blocks, and the attention map is obtained by using the softmax function activation, so that the feature information is copied from the existing texture feature blocks, and the texture of the missing part of the garment is generated; in order to better learn the correlation degree between textures, cosine similarity between features is calculated by using features of a layer higher than reconstructed features; the similarity matrix is calculated as follows:

and

respectively extracted texture features

is composed of

And

(ii) similarity score of (d); applying softmax function to activate and obtain initial attention diagram of ith texture feature block

From texture features according to similarity calculation formula

Initial attention map AS for extracting whole texture features^lThen, each block within the texture feature is separately reconstructed by a deconvolution operation using an attention map:

wherein the content of the first and second substances,

is the ith block extracted within the texture feature,

is the jth block extracted within the texture feature; reconstructing all blocks through attention scores to finally obtain reconstructed features

Wherein L is E [1, L-1 ∈]L is the characteristic number output by the encoder, and L is the serial number of the corresponding characteristic layer; after that time, the user can use the device,

6. A pose and texture guided fashion garment design synthesis method according to claim 4 or 5, characterized by the texture generation network decoder:

the SPADE structure is combined with the Decoder structure, so that the introduction of human body information is realized, the generated clothing shape is further constrained through semantic information, and the characteristics generated by the reconstructed texture after coding and the semantic information are combined and decoded into a corresponding fashion image; the calculation process of the spatial adaptive normalization is as follows:

And bias term

Wherein x, y, c are the characteristic heights,Width and channel number, n is the number of samples participating in training;

and

are respectively input features

Mean and standard deviation of; the calculation formula is as follows, and this part is the same as the calculation in BN;

h, W, C, height, width and channel number of semantic layout input; x, y and c are respectively the height, width and channel number of the input features, and n is the number of samples participating in training; n is the number of samples involved in training.

7. The method of claim 6, wherein step (3) is implemented as follows:

training with multiple losses from different aspects, namely, antagonism losses, cross-entropy losses, pixel-level losses, perceptual losses, and gram matrix-based style losses;

wherein the first three terms respectively correspond to conditional variational self-encoderThe target function of the antagonistic network is generated by the potential regression of the latter two terms respectively corresponding to the conditions; lambda [ alpha ]_vae＝2,λ_seg＝3,λ_kl＝0.01，λ_gan＝2,λ_latent30 is the parameter of each loss function; the last layer of the generator is activated by softmax, and the semantic layout of the human is predicted by adopting cross entropy loss; in semantic layout transformation, cross-entropy loss constrains pixel-level precision as defined below: