CN113393550A - Fashion garment design synthesis method guided by postures and textures - Google Patents

Fashion garment design synthesis method guided by postures and textures Download PDF

Info

Publication number
CN113393550A
CN113393550A CN202110660701.8A CN202110660701A CN113393550A CN 113393550 A CN113393550 A CN 113393550A CN 202110660701 A CN202110660701 A CN 202110660701A CN 113393550 A CN113393550 A CN 113393550A
Authority
CN
China
Prior art keywords
texture
semantic
fashion
loss
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110660701.8A
Other languages
Chinese (zh)
Other versions
CN113393550B (en
Inventor
顾晓玲
俞俊
黄洁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202110660701.8A priority Critical patent/CN113393550B/en
Publication of CN113393550A publication Critical patent/CN113393550A/en
Application granted granted Critical
Publication of CN113393550B publication Critical patent/CN113393550B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/001Texturing; Colouring; Generation of texture or colour

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention discloses a method for designing and synthesizing fashion clothes guided by postures and textures. The method comprises the following steps: 1. collecting task data by means of an existing fashion data set, preprocessing the data, and constructing a fashion image, posture information and semantic information data set; 2. constructing a two-stage generation model by taking the natural and accurate fashion image as a target; the generation model comprises a semantic layout generation network and a texture generation network, so that effective transfer of textures is realized, and diversified fashion images are generated; 3. training a semantic layout generation network and a texture transfer network by utilizing the collected data set under the conditions of minimized countermeasure loss, cross entropy loss, pixel level loss, perception loss and style loss; 4. and training the network parameters in the generated model through a back propagation algorithm until the whole model converges to generate a corresponding fashion image. The invention has performed experiments on the fast-Gen data set, and has obtained good results both quantitatively and qualitatively.

Description

Fashion garment design synthesis method guided by postures and textures
Technical Field
The invention provides a novel method for Fashion clothing Design Synthesis (Pose and Texture Guided Multi-View fast Design Synthesis) Guided by gestures and textures, mainly relating to a method for converting input human gestures into a series of human semantic layouts by utilizing a semantic layout generation network and a method for realizing Texture transfer by using the Texture transfer network and generating a real Fashion image by generating a confrontation network.
Background
Due to the high demand of real life application and the breakthrough development of relevant theories and technologies such as deep learning, machine learning, computer vision, multimedia technology and the like, the task of combining artificial intelligence and fashion has received considerable attention in recent years, such as garment identification, garment retrieval, fashion recommendation, fashion trend prediction and the like, and the subject of research is garments. In recent years, computer researchers have also developed a wide range of research applications in the field of fashion image synthesis, such as human-body-posture-guided garment image generation algorithms, text-guided garment image generation algorithms, virtual fitting algorithms based on image generation models, garment design applications based on image generation models, and the like, due to the remarkable results obtained by generative models (e.g., GANs, VAEs) in image synthesis.
The human body posture guided clothing image generation algorithm takes a human body posture picture as an input condition, changes the existing clothing picture containing the character model, and synthesizes a brand new clothing image. The method for generating the clothing image guided by the text comprises the steps of changing the existing clothing image containing the character model by taking the text description containing the clothing characteristic semantics as an input condition, and synthesizing a brand-new clothing image. Virtual fitting algorithms based on image-generating models are given a picture of a character model and a picture of a target garment, which first generate a rough fitting result graph, wherein the deformed target garment is transferred to the correct area of the character model. The garment design application based on the image generation model is used for controlling output garment design drawings through information such as color, texture and shape. The method can be classified into the clothing design application based on the image generation model, and various fashionable clothing design drawings are generated through posture and texture information control, so that the work of designers is reduced, and the design cycle of fashionable products is accelerated.
On the pose and texture guided fashion image generation task, the existing simple idea is to apply standard image-to-image conversion models directly, such as pix2pix and pix2pixHD, to solve our proposed problem. However, these methods essentially learn the mapping from the source image to the target image. Experimental results demonstrate that this does not fulfill our task. Furthermore, our task requires solving several challenging problems.
1) Too little information contained in the guidance gesture
The human body posture is usually represented by two-dimensional joint points, only the human body joint point information is contained, and the shape information is not contained, so that the human body structure and the clothing structure are difficult to be deduced from rough posture information by the conventional method.
2) Difficulty of texture transfer implementation
Due to the locality of a common convolutional network to feature processing, a special texture transmission mechanism does not exist in the existing fashion image generation method to realize the effective transfer of the texture of the fashion image. Secondly, since the regions of the garment are usually irregular, how to accurately transfer the texture to the corresponding regions of the garment by using texture blocks of any size is also a challenge in synthesizing natural and realistic fashion images. The existing fashion image generation method realizes generation of pure-color textures, cannot realize effective transfer of complex textures, and generally realizes generation of local textures or generation of incorrect textures.
3) Diversity limitation of fashion garment generation
The existing fashion image generation method generally uses posture information of a human body or semantic information of the human body for guidance, the type of a garment structure is fixed, and fashion images corresponding to various garment types and fashion styles in real situations cannot be generated.
Our approach addresses the existing problem of synthesizing diverse and accurate fashion images under the guidance of pose and texture information.
Disclosure of Invention
The invention provides a method for designing and synthesizing Fashion clothes guided by postures and textures.
A method for synthesizing fashion clothing design guided by posture and texture comprises the following steps:
and (1) collecting task data by means of the existing fashion data set, preprocessing the data, and constructing a fashion image, posture information and semantic information data set.
Step (2), constructing a two-stage generation model by taking the generation of natural and accurate fashion images as a target under the existing fashion data set; the generation model comprises a semantic layout generation network and a texture generation network, so that effective transfer of textures is realized, and diversified fashion images are generated.
And (3) training a semantic layout generation network and a texture transfer network by utilizing the collected data set under the conditions of minimized countermeasure loss, cross entropy loss, pixel level loss, perception loss and style loss.
And (4) training the network parameters in the generated model in the step (3) through a back propagation algorithm until the whole model converges to generate a corresponding fashion image.
The collection of mission data with the existing Fashion dataset described in step (1) means that we evaluated our method on the Fashion-Gen dataset because it contains various complex garment textures. We selected 4 major garment categories (i.e., dress, shirt, sweater, and coat) from the 48 major Fashion categories in the fast-Gen dataset for evaluation.
The step (1) of constructing the fashion image, the pose information and the semantic information data set means that the pose of the person is estimated from the fashion image by using the most advanced pose estimator for the corresponding fashion image data, and the calculated pose information of the person includes 18 joint coordinate points. In addition, an advanced body parser is used to compute a body semantic information containing 20 tags, each representing a specific part of the body, such as the face, hair, arms, legs and clothing areas.
Constructing a two-stage generation model in the step (2), wherein the two-stage generation model comprises a semantic layout generation network and a texture generation network, so that effective transfer of textures is realized, and diversified fashion images are generated, and the two-stage generation model specifically comprises the following steps:
the first stage is as follows: semantic layout generation network
In the semantic layout generation network, our goal is to map the guiding pose p to the semantic layout of a series of people { H }1,H2,....,HN}. These semantic layouts provide sufficient a priori knowledge of the shape of the human body and the structure of the garment.
The method is characterized in that the posture information and the corresponding semantic information are used as input, diversified semantic information is learned and generated, the simple UNet network can also generate corresponding semantic output, but the requirement of the diversity cannot be met, and the semantic layout generation network is built on the basis of a BicycleGAN model because the semantic layout generation network encourages a plurality of outputs generated from a single source image to complete the task of translating the image into the image. The semantic layout generation network comprises a conditional variational self-coding neural sub-network and a conditional latent recurrent neural sub-network.
The conditional variational self-coding neural subnetwork uses the attitude information and the semantic information as input together, uses an encoder to process the semantic information, encodes to obtain a potential vector of a control feature, and then inputs the potential vector and the attitude information into a generator together to generate corresponding reconstructed semantic information; and KL loss is used for constraining the potential vector to obey Gaussian distribution, so that sampling is facilitated during testing.
The conditional latent regression neural network uses attitude information and randomly sampled Gaussian distribution-obeying vectors as input of a generator, generates a real semantic layout under the constraint of a discriminator, processes the generated semantic layout by using an encoder, and uses the L1 loss constraint-generating vectors and the original Gaussian distribution-obeying vectors to ensure one-to-one generation and further realize the output of diversified semantic information.
And a second stage: texture generation network
In the texture generation network, the aim is to design a texture generation network to generate the texture on the semantic layout converted by the semantic layout generation network, wherein the synthesized texture requirement of the clothing region is consistent with the example of the guide texture, and the synthesized human appearance has the perception persuasion. The diversified semantic layout output of the semantic layout generation network provides a multimodal information input for our texture generation network.
The texture generation of the top and bottom garments is processed separately, the top and bottom garments are generated respectively, the texture block area mask and the clothing area mask are used as input, and a texture generation network is realized through an encoder, a texture generation block, a decoder and a Patch-GAN discriminator respectively. The encoder decodes the input texture block, the texture generation block transfers local texture features to the corresponding clothing region, and the decoder decodes the reconstructed features into the corresponding fashion image. In order to make the generated fashion image more realistic, a Patch-GAN discriminator is added to be trained together with an encoder, a texture generation block and a decoder.
The encoder of the texture generation network:
the Encoder adopts a common Encoder structure to decode the input texture block, and compared with other methods, partial convolution is used in the Encoder to replace a standard convolution layer, so that artifacts such as blurring and color difference are avoided. The partial convolution at each position is expressed as:
Figure BDA0003115117680000051
wherein X is the characteristic value of the current convolution (sliding) window, M is the binary mask of the texture block area mask corresponding to the current convolution window, W is the weight of the convolution filter, and b is the offset. sum (M) is the number of 1's in the binary mask.
After each partial convolution operation, updating the mask by marking the corresponding position of the mask after the window convolution operation as valid if at least one valid input value exists in the binary mask of the current convolution window, and expressing as follows:
Figure BDA0003115117680000052
the texture generation block of the texture generation network:
we have found that previous work has achieved the effect of texture generation using solely convolution to model the correlation between different image regions. However, because the convolution operation has a local acceptance domain, the long-distance dependency relationship must be processed through several convolution layers, the learning effect is not good, and the texture generation effect is difficult to realize. We introduce a texture generation block that reconstructs the texture features of the existing encoder output by using an attention map. And (3) forming a similarity matrix by calculating the cosine value similarity among the texture feature blocks, and activating by using a softmax function to obtain an attention map, so that feature information is copied from the existing texture feature blocks, and the texture of the missing part of the garment is generated. To better learn the correlation between textures, i use features one layer higher than the reconstructed features to compute cosine similarity between features. The similarity matrix is calculated as follows:
Figure BDA0003115117680000053
Figure BDA0003115117680000054
and
Figure BDA0003115117680000055
respectively extracted texture features
Figure BDA0003115117680000056
The ith texture feature block and the jth texture feature block in the block, and
Figure BDA0003115117680000057
is composed of
Figure BDA0003115117680000058
And
Figure BDA0003115117680000059
is scored. We apply the softmax function to activate and obtain the initial attention map of the ith texture feature block
Figure BDA0003115117680000061
From texture features according to similarity calculation formula
Figure BDA0003115117680000062
Initial attention map AS for extracting whole texture featureslWe then use an attention-seeking scheme to reconstruct each block within the texture feature separately by a deconvolution operation:
Figure BDA0003115117680000063
wherein the content of the first and second substances,
Figure BDA0003115117680000064
is the ith block extracted within the texture feature,
Figure BDA0003115117680000065
is the jth block extracted within the texture feature. Reconstructing all blocks through attention scores to finally obtain reconstructed features
Figure BDA0003115117680000066
Wherein L is E [1, L-1 ∈]L is the characteristic number output by the encoder, and L is the corresponding characteristic layer serial number. After that time, the user can use the device,
Figure BDA0003115117680000067
further refinement is achieved by four sets of dilation convolutions at different rates.
The decoder of the texture generation network:
the SPADE structure (space adaptive normalization method) and the Decoder structure are combined, so that the introduction of human body information is realized, the generated clothing shape is further constrained through semantic information, and the characteristics generated by the reconstructed texture after coding and the semantic information are combined and decoded into a corresponding fashion image. The calculation process of the spatial adaptive normalization is as follows:
Figure BDA0003115117680000068
wherein, for the input semantic layout HsExtracting features by convolution, and obtaining normalized scaling coefficient by two convolution layers respectively
Figure BDA0003115117680000069
And bias term
Figure BDA00031151176800000610
Wherein x, y and c are the height, width and channel number of the feature respectively, and n is the number of samples participating in training.
Figure BDA00031151176800000611
And
Figure BDA00031151176800000612
are respectively input features
Figure BDA00031151176800000613
Mean and standard deviation of. The calculation formula is as follows, and this part is the same as the calculation in BN.
Figure BDA00031151176800000614
Figure BDA00031151176800000615
H, W, C are the height, width, and number of channels, respectively, of the semantic layout input. x, y and c are respectively the height, width and channel number of the input feature, and n is the number of samples participating in training. N is the number of samples involved in training.
And (3) constructing a deep learning framework, and training a semantic layout generation network and a texture generation network by utilizing the collected data set under the conditions of minimized countermeasure loss, cross entropy loss, pixel level loss, perception loss and style loss. The method comprises the following specific steps:
because the details of fashion images are complex, how to train the generator well is a great challenge. To solve this problem, we use multiple penalties for training from different aspects, namely, antagonism penalties, cross-entropy penalties, pixel-level penalties, perceptual penalties, and gram matrix-based style penalties.
The overall penalty of the semantic layout generation network is defined as follows:
Figure BDA0003115117680000071
the first three terms respectively correspond to the conditional variation self-encoder to generate an objective function of the countermeasure network, and the second two terms respectively correspond to the conditional potential regression to generate the objective function of the countermeasure network. Lambda [ alpha ]vae=2,λseg=3,λkl=0.01,λgan=2,λlatent30 is a parameter of each loss function. Unlike the original BicycleGAN model, we used softmax activation at the last layer of the generator and used cross-entropy loss to predict human semantic layout. In semantic layout transformation, cross-entropy loss constrains pixel-level precision as defined below:
Figure BDA0003115117680000072
h, W, C are the height, width, and number of channels, respectively, of the semantic layout input. HsIs the semantic layout that is generated and,
Figure BDA0003115117680000073
is corresponding toThe true semantic layout of (2).
The overall loss of the texture generation network is defined as follows:
Figure BDA0003115117680000074
wherein the content of the first and second substances,
Figure BDA0003115117680000075
is to counter the loss of the liquid,
Figure BDA0003115117680000076
is a generated fashion image
Figure BDA0003115117680000077
And a real image
Figure BDA0003115117680000078
L between1The loss of the carbon dioxide gas is reduced,
Figure BDA0003115117680000079
is that
Figure BDA00031151176800000710
And
Figure BDA00031151176800000711
in the middle of the perception loss, and
Figure BDA00031151176800000712
is that
Figure BDA00031151176800000713
And
Figure BDA00031151176800000714
style loss in between. Lambda [ alpha ]adv=0.1,λrec=6,λper=0.5,λsty50 is the parameter of each loss function.
The invention has the beneficial effects that:
the invention provides a method for designing and synthesizing fashion clothes guided by postures and textures aiming at the practical problems of poor and single generation effect of the existing fashion images, solves the problems of too little information contained in the guided postures, locality and inaccuracy of texture transfer and single generation of the fashion images in the existing method, and realizes the generation of the diversity and the accuracy of the fashion images to a great extent. In addition, the task of combining artificial intelligence and fashion is taken as a current research hotspot, the reasonable use also enables the invention to have more advanced and innovative scientific research, and corresponding real and various fashion images are automatically designed and generated according to input control conditions (posture information and texture information) of a plurality of modes, so that the design inspiration of clothing designers can be further stimulated, and the development and application popularization of creative design related research in the fashion field can be promoted.
Drawings
FIG. 1 is a schematic flow diagram of the process of the present invention.
FIG. 2 is a semantic layout generating network model in the method of the present invention.
FIG. 3 is a model of a texture transfer network in the method of the present invention.
FIG. 4 is a schematic of the data set of the present invention.
Detailed Description
The invention is further illustrated by the following figures and examples.
The invention provides a method for designing and synthesizing Fashion clothes guided by postures and textures.
As shown in fig. 1, a method for synthesizing fashion garment design guided by pose and texture comprises the following steps:
and (1) collecting task data by means of the existing fashion data set, preprocessing the data, and constructing a fashion image, posture information and semantic information data set.
Step (2), constructing a two-stage generation model by taking the generation of natural and accurate fashion images as a target under the existing fashion data set; the generation model comprises a semantic layout generation network and a texture generation network, so that effective transfer of textures is realized, and diversified fashion images are generated.
And (3) training a semantic layout generation network and a texture transfer network by utilizing the collected data set under the conditions of minimized countermeasure loss, cross entropy loss, pixel level loss, perception loss and style loss.
And (4) training the network parameters in the generated model in the step (3) through a back propagation algorithm until the whole model converges to generate a corresponding fashion image.
The collection of mission data with the existing Fashion dataset described in step (1) means that we evaluated our method on the Fashion-Gen dataset because it contains various complex garment textures. We selected 4 major garment categories (i.e., dress, shirt, sweater, and coat) from the 48 major Fashion categories in the fast-Gen dataset for evaluation.
The step (1) of constructing the fashion image, the pose information and the semantic information data set means that the pose of the person is estimated from the fashion image by using the most advanced pose estimator for the corresponding fashion image data, and the calculated pose information of the person includes 18 joint coordinate points. In addition, an advanced body parser is used to compute a body semantic information containing 20 tags, each representing a specific part of the body, such as the face, hair, arms, legs and clothing areas.
Constructing a two-stage generation model in the step (2), wherein the two-stage generation model comprises a semantic layout generation network and a texture generation network, so that effective transfer of textures is realized, and diversified fashion images are generated, and the two-stage generation model specifically comprises the following steps:
as shown in fig. 2, the first stage: semantic layout generation network
In the semantic layout generation network, our goal is to map the guiding pose p to the semantic layout of a series of people { H }1,H2,....,HN}. These semantic layouts provide sufficient human satisfactionA priori knowledge of the shape of the body and the structure of the garment.
The method is characterized in that the posture information and the corresponding semantic information are used as input, diversified semantic information is learned and generated, the simple UNet network can also generate corresponding semantic output, but the requirement of the diversity cannot be met, and the semantic layout generation network is built on the basis of a BicycleGAN model because the semantic layout generation network encourages a plurality of outputs generated from a single source image to complete the task of translating the image into the image. The semantic layout generation network comprises a conditional variational self-coding neural sub-network and a conditional latent recurrent neural sub-network.
The conditional variational self-coding neural subnetwork uses the attitude information and the semantic information as input together, uses an encoder to process the semantic information, encodes to obtain a potential vector of a control feature, and then inputs the potential vector and the attitude information into a generator together to generate corresponding reconstructed semantic information; and KL loss is used for constraining the potential vector to obey Gaussian distribution, so that sampling is facilitated during testing.
The conditional latent regression neural network uses attitude information and randomly sampled Gaussian distribution-obeying vectors as input of a generator, generates a real semantic layout under the constraint of a discriminator, processes the generated semantic layout by using an encoder, and uses the L1 loss constraint-generating vectors and the original Gaussian distribution-obeying vectors to ensure one-to-one generation and further realize the output of diversified semantic information.
As shown in fig. 3, the second stage: texture generation network
In the texture generation network, the aim is to design a texture generation network to generate the texture on the semantic layout converted by the semantic layout generation network, wherein the synthesized texture requirement of the clothing region is consistent with the example of the guide texture, and the synthesized human appearance has the perception persuasion. The diversified semantic layout output of the semantic layout generation network provides a multimodal information input for our texture generation network.
The texture generation of the top and bottom garments is processed separately, the top and bottom garments are generated respectively, the texture block area mask and the clothing area mask are used as input, and a texture generation network is realized through an encoder, a texture generation block, a decoder and a Patch-GAN discriminator respectively. The encoder decodes the input texture block, the texture generation block transfers local texture features to the corresponding clothing region, and the decoder decodes the reconstructed features into the corresponding fashion image. In order to make the generated fashion image more realistic, a Patch-GAN discriminator is added to be trained together with an encoder, a texture generation block and a decoder.
The encoder of the texture generation network:
the Encoder adopts a common Encoder structure to decode the input texture block, and compared with other methods, partial convolution is used in the Encoder to replace a standard convolution layer, so that artifacts such as blurring and color difference are avoided. The partial convolution at each position is expressed as:
Figure BDA0003115117680000101
wherein X is the characteristic value of the current convolution (sliding) window, M is the binary mask of the texture block area mask corresponding to the current convolution window, W is the weight of the convolution filter, and b is the offset. sum (M) is the number of 1's in the binary mask.
After each partial convolution operation, updating the mask by marking the corresponding position of the mask after the window convolution operation as valid if at least one valid input value exists in the binary mask of the current convolution window, and expressing as follows:
Figure BDA0003115117680000111
the texture generation block of the texture generation network:
we have found that previous work has achieved the effect of texture generation using solely convolution to model the correlation between different image regions. However, because the convolution operation has a local acceptance domain, the long-distance dependency relationship must be processed through several convolution layers, the learning effect is not good, and the texture generation effect is difficult to realize. We introduce a texture generation block that reconstructs the texture features of the existing encoder output by using an attention map. And (3) forming a similarity matrix by calculating the cosine value similarity among the texture feature blocks, and activating by using a softmax function to obtain an attention map, so that feature information is copied from the existing texture feature blocks, and the texture of the missing part of the garment is generated. To better learn the correlation between textures, i use features one layer higher than the reconstructed features to compute cosine similarity between features. The similarity matrix is calculated as follows:
Figure BDA0003115117680000112
Figure BDA0003115117680000113
and
Figure BDA0003115117680000114
respectively extracted texture features
Figure BDA0003115117680000115
The ith texture feature block and the jth texture feature block in the block, and
Figure BDA0003115117680000116
is composed of
Figure BDA0003115117680000117
And
Figure BDA0003115117680000118
is scored. We apply the softmax function to activate and obtain the initial attention map of the ith texture feature block
Figure BDA0003115117680000119
From texture features according to similarity calculation formula
Figure BDA00031151176800001110
Initial attention map AS for extracting whole texture featureslWe then use an attention-seeking scheme to reconstruct each block within the texture feature separately by a deconvolution operation:
Figure BDA00031151176800001111
wherein the content of the first and second substances,
Figure BDA00031151176800001112
is the ith block extracted within the texture feature,
Figure BDA00031151176800001113
is the jth block extracted within the texture feature. Reconstructing all blocks through attention scores to finally obtain reconstructed features
Figure BDA00031151176800001114
Wherein L is E [1, L-1 ∈]L is the characteristic number output by the encoder, and L is the corresponding characteristic layer serial number. After that time, the user can use the device,
Figure BDA00031151176800001115
further refinement is achieved by four sets of dilation convolutions at different rates.
The decoder of the texture generation network:
the SPADE structure (space adaptive normalization method) and the Decoder structure are combined, so that the introduction of human body information is realized, the generated clothing shape is further constrained through semantic information, and the characteristics generated by the reconstructed texture after coding and the semantic information are combined and decoded into a corresponding fashion image. The calculation process of the spatial adaptive normalization is as follows:
Figure BDA0003115117680000121
wherein, for the input semantic layout HsBy convolutional extractionTaking characteristics, and obtaining normalized scaling coefficient through two convolution layers respectively
Figure BDA0003115117680000122
And bias term
Figure BDA0003115117680000123
Wherein x, y and c are the height, width and channel number of the feature respectively, and n is the number of samples participating in training.
Figure BDA0003115117680000124
And
Figure BDA0003115117680000125
are respectively input features
Figure BDA0003115117680000126
Mean and standard deviation of. The calculation formula is as follows, and this part is the same as the calculation in BN.
Figure BDA0003115117680000127
Figure BDA0003115117680000128
H, W, C are the height, width, and number of channels, respectively, of the semantic layout input. x, y and c are respectively the height, width and channel number of the input feature, and n is the number of samples participating in training. N is the number of samples involved in training.
And (3) constructing a deep learning framework, and training a semantic layout generation network and a texture generation network by utilizing the collected data set under the conditions of minimized countermeasure loss, cross entropy loss, pixel level loss, perception loss and style loss as shown in FIG. 3. The method comprises the following specific steps:
because the details of fashion images are complex, how to train the generator well is a great challenge. To solve this problem, we use multiple penalties for training from different aspects, namely, antagonism penalties, cross-entropy penalties, pixel-level penalties, perceptual penalties, and gram matrix-based style penalties.
The overall penalty of the semantic layout generation network is defined as follows:
Figure BDA0003115117680000129
the first three terms respectively correspond to the conditional variation self-encoder to generate an objective function of the countermeasure network, and the second two terms respectively correspond to the conditional potential regression to generate the objective function of the countermeasure network. Lambda [ alpha ]vae=2,λseg=3,λkl=0.01,λgan=2,λlatent30 is a parameter of each loss function. Unlike the original BicycleGAN model, we used softmax activation at the last layer of the generator and used cross-entropy loss to predict human semantic layout. In semantic layout transformation, cross-entropy loss constrains pixel-level precision as defined below:
Figure BDA0003115117680000131
h, W, C are the height, width, and number of channels, respectively, of the semantic layout input. HsIs the semantic layout that is generated and,
Figure BDA0003115117680000132
is the corresponding true semantic layout.
The overall loss of the texture generation network is defined as follows:
Figure BDA0003115117680000133
wherein the content of the first and second substances,
Figure BDA0003115117680000134
is to counter the loss of the liquid,
Figure BDA0003115117680000135
is a generated fashion image
Figure BDA0003115117680000136
And a real image
Figure BDA0003115117680000137
L between1The loss of the carbon dioxide gas is reduced,
Figure BDA0003115117680000138
is that
Figure BDA0003115117680000139
And
Figure BDA00031151176800001310
in the middle of the perception loss, and
Figure BDA00031151176800001311
is that
Figure BDA00031151176800001312
And
Figure BDA00031151176800001313
style loss in between. Lambda [ alpha ]adv=0.1,λrec=6,λper=0.5,λsty50 is the parameter of each loss function.

Claims (7)

1. A method for synthesizing fashion garment design guided by posture and texture is characterized by comprising the following steps:
the method comprises the following steps that (1) task data are collected by means of an existing fashion data set, the data are preprocessed, and a fashion image, posture information and semantic information data set is constructed;
step (2), constructing a two-stage generation model by taking the generation of natural and accurate fashion images as a target under the existing fashion data set; the generation model comprises a semantic layout generation network and a texture generation network, so that effective transfer of textures is realized, and diversified fashion images are generated;
step (3), training a semantic layout to generate a network and a texture transfer network by utilizing the collected data set under the conditions of minimized countermeasure loss, cross entropy loss, pixel level loss, perception loss and style loss;
and (4) training the network parameters in the generated model in the step (3) through a back propagation algorithm until the whole model converges to generate a corresponding fashion image.
2. The method of claim 1, wherein the step (1) of constructing the fashion image, the pose information, and the semantic information data set is to estimate the pose of the person from the fashion image using a state-of-the-art pose estimator for the corresponding fashion image data, and the calculated pose information of the person comprises 18 joint coordinate points; in addition, an advanced body parser is used to compute a body semantic information containing 20 tags, each representing a specific part of the body, such as the face, hair, arms, legs and clothing areas.
3. The method of claim 2, wherein the step (2) of constructing a two-stage generative model comprising a semantic layout generation network and a texture generation network to achieve effective texture transfer and generate diverse fashion images comprises the steps of: the semantic layout generation network is specifically implemented as follows:
in a semantic layout generation network, the goal is to map the guide poses p to the semantic layouts { H } of a series of people1,H2,....,HN}; these semantic layouts provide sufficient a priori knowledge of the shape of the human body and the structure of the garment;
using the posture information and the corresponding semantic information as input, and learning to generate diversified semantic information; the semantic layout generating network is established on the basis of a BicycleGAN model and comprises a conditional variational self-coding neural sub-network and a conditional latent recurrent neural sub-network;
the conditional variational self-coding neural subnetwork uses the attitude information and the semantic information as input together, uses an encoder to process the semantic information, encodes to obtain a potential vector of a control feature, and then inputs the potential vector and the attitude information into a generator together to generate corresponding reconstructed semantic information; KL loss is used for restraining potential vectors to obey Gaussian distribution, so that sampling is facilitated during testing;
the conditional latent regression neural network uses attitude information and randomly sampled Gaussian distribution-obeying vectors as input of a generator, generates a real semantic layout under the constraint of a discriminator, processes the generated semantic layout by using an encoder, and uses the L1 loss constraint-generating vectors and the original Gaussian distribution-obeying vectors to ensure one-to-one generation and further realize the output of diversified semantic information.
4. A method of pose and texture guided fashion garment design synthesis according to claim 3, characterized by a second stage: the texture generation network is specifically implemented as follows:
in the texture generation network, the aim is to design a texture generation network to generate the texture on the semantic layout converted by the semantic layout generation network, wherein the synthesized texture requirement of the clothing region is consistent with the example of the guide texture, and the synthesized human appearance has perception persuasion; the diversified semantic layout output of the semantic layout generation network provides multi-modal information input for the texture generation network;
the texture generation of the top and bottom garments is processed separately, the top and bottom garments are generated respectively, the texture block area mask and the clothing area mask are used as input, and a texture generation network is realized through an encoder, a texture generation block, a decoder and a Patch-GAN discriminator respectively; the encoder decodes the input texture block, the texture generation block transfers local texture features to a corresponding clothing region, and the decoder decodes the reconstructed features into a corresponding fashion image; meanwhile, a Patch-GAN discriminator is added to be trained together with an encoder, a texture generation block and a decoder;
the encoder of the texture generation network:
the Encoder adopts a common Encoder structure to decode the input texture block, and compared with other methods, partial convolution is used in the Encoder to replace a standard convolution layer, so that the generation of blurring and color difference is avoided; the partial convolution at each position is expressed as:
Figure FDA0003115117670000021
wherein X is the characteristic value of the current convolution (sliding) window, M is the binary mask of the texture block area mask corresponding to the current convolution window, W is the weight of the convolution filter, and n is the offset; sum (M) is the number of 1's in the binary mask;
after each partial convolution operation, updating the mask by marking the corresponding position of the mask after the window convolution operation as valid if at least one valid input value exists in the binary mask of the current convolution window, and expressing as follows:
Figure FDA0003115117670000031
5. a method of pose and texture guided fashion garment design synthesis as claimed in claim 3, characterized by entering texture generation block, reconstructing texture features of existing encoder output by using attention map; the similarity matrix is formed by calculating the cosine value similarity among the texture feature blocks, and the attention map is obtained by using the softmax function activation, so that the feature information is copied from the existing texture feature blocks, and the texture of the missing part of the garment is generated; in order to better learn the correlation degree between textures, cosine similarity between features is calculated by using features of a layer higher than reconstructed features; the similarity matrix is calculated as follows:
Figure FDA0003115117670000032
Figure FDA0003115117670000033
and
Figure FDA0003115117670000034
respectively extracted texture features
Figure FDA0003115117670000035
The ith texture feature block and the jth texture feature block in the block, and
Figure FDA0003115117670000036
is composed of
Figure FDA0003115117670000037
And
Figure FDA0003115117670000038
(ii) similarity score of (d); applying softmax function to activate and obtain initial attention diagram of ith texture feature block
Figure FDA0003115117670000039
From texture features according to similarity calculation formula
Figure FDA00031151176700000310
Initial attention map AS for extracting whole texture featureslThen, each block within the texture feature is separately reconstructed by a deconvolution operation using an attention map:
Figure FDA00031151176700000311
wherein the content of the first and second substances,
Figure FDA00031151176700000312
is the ith block extracted within the texture feature,
Figure FDA00031151176700000313
is the jth block extracted within the texture feature; reconstructing all blocks through attention scores to finally obtain reconstructed features
Figure FDA00031151176700000314
Wherein L is E [1, L-1 ∈]L is the characteristic number output by the encoder, and L is the serial number of the corresponding characteristic layer; after that time, the user can use the device,
Figure FDA00031151176700000315
further refinement is achieved by four sets of dilation convolutions at different rates.
6. A pose and texture guided fashion garment design synthesis method according to claim 4 or 5, characterized by the texture generation network decoder:
the SPADE structure is combined with the Decoder structure, so that the introduction of human body information is realized, the generated clothing shape is further constrained through semantic information, and the characteristics generated by the reconstructed texture after coding and the semantic information are combined and decoded into a corresponding fashion image; the calculation process of the spatial adaptive normalization is as follows:
Figure FDA0003115117670000041
wherein, for the input semantic layout HsExtracting features by convolution, and obtaining normalized scaling coefficient by two convolution layers respectively
Figure FDA0003115117670000042
And bias term
Figure FDA0003115117670000043
Wherein x, y, c are the characteristic heights,Width and channel number, n is the number of samples participating in training;
Figure FDA0003115117670000044
and
Figure FDA0003115117670000045
are respectively input features
Figure FDA0003115117670000046
Mean and standard deviation of; the calculation formula is as follows, and this part is the same as the calculation in BN;
Figure FDA0003115117670000047
Figure FDA0003115117670000048
h, W, C, height, width and channel number of semantic layout input; x, y and c are respectively the height, width and channel number of the input features, and n is the number of samples participating in training; n is the number of samples involved in training.
7. The method of claim 6, wherein step (3) is implemented as follows:
training with multiple losses from different aspects, namely, antagonism losses, cross-entropy losses, pixel-level losses, perceptual losses, and gram matrix-based style losses;
the overall penalty of the semantic layout generation network is defined as follows:
Figure FDA0003115117670000049
wherein the first three terms respectively correspond to conditional variational self-encoderThe target function of the antagonistic network is generated by the potential regression of the latter two terms respectively corresponding to the conditions; lambda [ alpha ]vae=2,λseg=3,λkl=0.01,λgan=2,λlatent30 is the parameter of each loss function; the last layer of the generator is activated by softmax, and the semantic layout of the human is predicted by adopting cross entropy loss; in semantic layout transformation, cross-entropy loss constrains pixel-level precision as defined below:
Figure FDA0003115117670000051
h, W, C, height, width and channel number of semantic layout input; hsIs the semantic layout that is generated and,
Figure FDA0003115117670000052
is the corresponding true semantic layout;
the overall loss of the texture generation network is defined as follows:
Figure FDA0003115117670000053
wherein the content of the first and second substances,
Figure FDA0003115117670000054
is to counter the loss of the liquid,
Figure FDA0003115117670000055
is a generated fashion image
Figure FDA0003115117670000056
And a real image
Figure FDA0003115117670000057
L between1The loss of the carbon dioxide gas is reduced,
Figure FDA0003115117670000058
is that
Figure FDA0003115117670000059
And
Figure FDA00031151176700000510
in the middle of the perception loss, and
Figure FDA00031151176700000511
is that
Figure FDA00031151176700000512
And
Figure FDA00031151176700000513
style loss in between; lambda [ alpha ]adv=0.1,λrec=6,λper=0.5,λsty50 is the parameter of each loss function.
CN202110660701.8A 2021-06-15 2021-06-15 Fashion garment design synthesis method guided by postures and textures Active CN113393550B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110660701.8A CN113393550B (en) 2021-06-15 2021-06-15 Fashion garment design synthesis method guided by postures and textures

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110660701.8A CN113393550B (en) 2021-06-15 2021-06-15 Fashion garment design synthesis method guided by postures and textures

Publications (2)

Publication Number Publication Date
CN113393550A true CN113393550A (en) 2021-09-14
CN113393550B CN113393550B (en) 2022-09-20

Family

ID=77621042

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110660701.8A Active CN113393550B (en) 2021-06-15 2021-06-15 Fashion garment design synthesis method guided by postures and textures

Country Status (1)

Country Link
CN (1) CN113393550B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113838166A (en) * 2021-09-22 2021-12-24 网易(杭州)网络有限公司 Image feature migration method and device, storage medium and terminal equipment
CN114723843A (en) * 2022-06-01 2022-07-08 广东时谛智能科技有限公司 Method, device, equipment and storage medium for generating virtual clothing through multi-mode fusion
CN115147526A (en) * 2022-06-30 2022-10-04 北京百度网讯科技有限公司 Method and device for training clothing generation model and method and device for generating clothing image
CN115659852A (en) * 2022-12-26 2023-01-31 浙江大学 Layout generation method and device based on discrete potential representation
CN116229229A (en) * 2023-05-11 2023-06-06 青岛科技大学 Multi-domain image fusion method and system based on deep learning

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109325952A (en) * 2018-09-17 2019-02-12 上海宝尊电子商务有限公司 Fashion clothing image partition method based on deep learning
CN109559287A (en) * 2018-11-20 2019-04-02 北京工业大学 A kind of semantic image restorative procedure generating confrontation network based on DenseNet
US20200151807A1 (en) * 2018-11-14 2020-05-14 Beijing Jingdong Shangke Information Technology Co., Ltd. System and method for automatically generating three-dimensional virtual garment model using product description
CN111445426A (en) * 2020-05-09 2020-07-24 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Target garment image processing method based on generation countermeasure network model
CN111476241A (en) * 2020-03-04 2020-07-31 上海交通大学 Character clothing conversion method and system
US20210065418A1 (en) * 2019-08-27 2021-03-04 Shenzhen Malong Technologies Co., Ltd. Appearance-flow-based image generation

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109325952A (en) * 2018-09-17 2019-02-12 上海宝尊电子商务有限公司 Fashion clothing image partition method based on deep learning
US20200151807A1 (en) * 2018-11-14 2020-05-14 Beijing Jingdong Shangke Information Technology Co., Ltd. System and method for automatically generating three-dimensional virtual garment model using product description
CN109559287A (en) * 2018-11-20 2019-04-02 北京工业大学 A kind of semantic image restorative procedure generating confrontation network based on DenseNet
US20210065418A1 (en) * 2019-08-27 2021-03-04 Shenzhen Malong Technologies Co., Ltd. Appearance-flow-based image generation
CN111476241A (en) * 2020-03-04 2020-07-31 上海交通大学 Character clothing conversion method and system
CN111445426A (en) * 2020-05-09 2020-07-24 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Target garment image processing method based on generation countermeasure network model

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
XIAOLING GU;FEI GAO; MIN TAN; PAI PENG: "fashion analysis and understanding with artificial intelligence", 《INFORMATION PROCESSING & MANAGEMENT》 *
XIAOLING GU;JUN YU;YONGKANG WONG;MOHAN S. KANKANHALLI: "Toward Multi-Modal Conditioned Fashion Image Translation", 《IEEE TRANSACTIONS ON MULTIMEDIA》 *
徐俊哲; 陈佳; 何儒汉; 胡新荣: "基于姿态的时装图像合成研究", 《现代计算机》 *
李锵等: "基于级联卷积神经网络的服饰关键点定位算法", 《天津大学学报(自然科学与工程技术版)》 *
黄菲等: "基于生成对抗网络的异质人脸图像合成:进展与挑战", 《南京信息工程大学学报(自然科学版)》 *
黄韬等: "基于生成对抗网络的文本引导人物图像编辑方法", 《广东技术师范大学学报》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113838166A (en) * 2021-09-22 2021-12-24 网易(杭州)网络有限公司 Image feature migration method and device, storage medium and terminal equipment
CN113838166B (en) * 2021-09-22 2023-08-29 网易(杭州)网络有限公司 Image feature migration method and device, storage medium and terminal equipment
CN114723843A (en) * 2022-06-01 2022-07-08 广东时谛智能科技有限公司 Method, device, equipment and storage medium for generating virtual clothing through multi-mode fusion
CN114723843B (en) * 2022-06-01 2022-12-06 广东时谛智能科技有限公司 Method, device, equipment and storage medium for generating virtual clothing through multi-mode fusion
CN115147526A (en) * 2022-06-30 2022-10-04 北京百度网讯科技有限公司 Method and device for training clothing generation model and method and device for generating clothing image
CN115147526B (en) * 2022-06-30 2023-09-26 北京百度网讯科技有限公司 Training of clothing generation model and method and device for generating clothing image
CN115659852A (en) * 2022-12-26 2023-01-31 浙江大学 Layout generation method and device based on discrete potential representation
CN116229229A (en) * 2023-05-11 2023-06-06 青岛科技大学 Multi-domain image fusion method and system based on deep learning

Also Published As

Publication number Publication date
CN113393550B (en) 2022-09-20

Similar Documents

Publication Publication Date Title
CN113393550B (en) Fashion garment design synthesis method guided by postures and textures
CN110211196B (en) Virtual fitting method and device based on posture guidance
Zhang et al. Pise: Person image synthesis and editing with decoupled gan
CN111275518A (en) Video virtual fitting method and device based on mixed optical flow
CN108288072A (en) A kind of facial expression synthetic method based on generation confrontation network
Kolotouros et al. Dreamhuman: Animatable 3d avatars from text
Tang et al. Multi-channel attention selection gans for guided image-to-image translation
US11282256B2 (en) Crowdshaping realistic 3D avatars with words
CN113496507A (en) Human body three-dimensional model reconstruction method
Li et al. Learning symmetry consistent deep cnns for face completion
Sheng et al. Deep neural representation guided face sketch synthesis
CN113255457A (en) Animation character facial expression generation method and system based on facial expression recognition
CN111476241B (en) Character clothing conversion method and system
CN113538608B (en) Controllable figure image generation method based on generation countermeasure network
WO2023088277A1 (en) Virtual dressing method and apparatus, and device, storage medium and program product
CN111462274A (en) Human body image synthesis method and system based on SMP L model
Zeng et al. Avatarbooth: High-quality and customizable 3d human avatar generation
Du et al. VTON-SCFA: A virtual try-on network based on the semantic constraints and flow alignment
Kwolek et al. Recognition of JSL fingerspelling using deep convolutional neural networks
CN113076918A (en) Video-based facial expression cloning method
Liu et al. Multimodal face aging framework via learning disentangled representation
CN116777738A (en) Authenticity virtual fitting method based on clothing region alignment and style retention modulation
CN116168186A (en) Virtual fitting chart generation method with controllable garment length
Kuo et al. Generating ambiguous figure-ground images
Kim et al. Development of an IGA-based fashion design aid system with domain specific knowledge

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant