CN111476241B - Character clothing conversion method and system - Google Patents

Character clothing conversion method and system Download PDF

Info

Publication number
CN111476241B
CN111476241B CN202010143086.9A CN202010143086A CN111476241B CN 111476241 B CN111476241 B CN 111476241B CN 202010143086 A CN202010143086 A CN 202010143086A CN 111476241 B CN111476241 B CN 111476241B
Authority
CN
China
Prior art keywords
network
map
original
target
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010143086.9A
Other languages
Chinese (zh)
Other versions
CN111476241A (en
Inventor
宋利
张义诚
解蓉
张文军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202010143086.9A priority Critical patent/CN111476241B/en
Publication of CN111476241A publication Critical patent/CN111476241A/en
Application granted granted Critical
Publication of CN111476241B publication Critical patent/CN111476241B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/60Editing figures and text; Combining figures or text
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a character clothing conversion method and a character clothing conversion system, wherein the method comprises the following steps: according to the input sentence, carrying out corresponding shape change on an original segmentation map of the original map, and converting the original segmentation map into a target segmentation map; processing the composite problem with the second level generation against the network: taking the target segmentation map as a semantic guidance and shape limitation condition, and training a generator to learn multi-domain mapping from an original map to a target picture together with an input sentence so as to synthesize the target picture and complete character clothing conversion; the second level generation network merges: adopting a soft attention layer to strengthen the relevance between the target picture and the input sentence; explicitly capturing remote correlations on the image using the self-attention layer; the stylized attention layer is used to establish the dependency between features by channel-by-channel inner product and feature map recalibration. The character clothing conversion method and system integrate three attention layers, and realize high-quality clothing generation.

Description

Character clothing conversion method and system
Technical Field
The invention relates to the technical field of computer vision, in particular to a character clothing conversion method and system.
Background
Character clothing conversion is a very challenging task in the field of computer vision, and aims to perform corresponding conversion on character clothing in an original figure according to input character descriptions, and meanwhile, keep the information of the gesture, identity, body shape and the like of the character unchanged. The character has quite wide application, and can be expanded into various emerging application scenes such as photo editing, movie making, virtual fitting and the like. Although the generation of a reactance network has achieved quite excellent performance in domain migration tasks such as face attribute conversion, makeup conversion, etc. in recent years, there is still a great room for improvement in character clothing conversion tasks.
The challenge of clothing conversion tasks is firstly embodied in the high difficulty of the task itself, and the core problems are as follows: firstly, the input sentence description contains various clothes styles and styles with different shapes, such as short-sleeved shirts, sleeveless dress, long-sleeve jackets and the like, which causes remarkable shape change in the process of clothes conversion. Secondly, the character clothing involves more texture and color information, but does not contain similar skin color and five sense organs as much as the face picture, so a finer granularity generation method is needed to realize high quality clothing conversion.
Secondly, the existing method is difficult to meet the requirement of high-quality clothing conversion generation. The conventional character clothing conversion method still adopts a traditional full convolution generator, and the network structure has very limited capability in capturing long-distance correlation and cannot meet the requirement of high-quality generation. Furthermore, existing methods train the network with the overall representation of the input sentence as conditional information, do not fully exploit semantic information to the word level, and thus are insufficient to support fine-grained texture and color generation. In addition, since character clothing conversion may require extensive inference and imagination of the network, such as when converting from long sleeve to short sleeve, the network needs to generate new arm parts, so how to generate information not in the original figure is a big problem of the task, however, the existing method does not perform deep enough exploration for the problem.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a character clothing conversion method and a character clothing conversion system, which integrate three attention layers and realize high-quality clothing generation.
In order to solve the technical problems, the invention is realized by the following technical scheme:
the invention provides a character clothing conversion method, which comprises the following steps:
s11: processing the deformation problem with the first level generation countering network: according to the input sentence, carrying out corresponding shape change on an original segmentation map of the original map, and converting the original segmentation map into a target segmentation map;
s12: processing the composite problem with the second level generation against the network: taking the target segmentation map obtained in the step S11 as a semantic guidance and shape limitation condition, and training a generator to learn multi-domain mapping from an original map to a target picture together with an input sentence so as to synthesize the target picture and complete character clothing conversion;
further, the second level generation network fuses simultaneously:
s121: adopting a soft attention layer to strengthen the relevance between the target segmentation graph and the input sentence;
s122: explicitly capturing remote correlations on the target segmentation map using a self-attention layer;
s123: the stylized attention layer is used to establish the dependency between features by channel-by-channel inner product and feature map recalibration.
Preferably, the step S121 further includes: two inputs are received with the soft attention layer: word embedding matrix w and original feature map x, and obtaining soft attention context feature map c by calculating context vector c combination soft
s ji =W q (x j ) T W k (w i )
Figure BDA0002399778490000021
Figure BDA0002399778490000022
c soft =concat(c 1 ,c 2 ,…,c j ,…,c N )
Wherein W is q ,W k ,W v All are convolution layer parameters, and beta is attention weight.
Preferably, the step S122 further includes: mapping the original feature map x into a plurality of feature spaces by using a convolution layer, calculating the correlation between different subareas by using an inner product mode, representing the correlation as attention weight, and finally obtaining the self-attention context feature map c by weighting and summing self
Preferably, S123 furtherComprising the following steps: computing the gram matrix of the original feature map x, normalizing the gram matrix by a softmax function, and then recalibrating the features through weighted summation of all channels to finally obtain a stylized attention context feature map c style
Figure BDA0002399778490000031
Figure BDA0002399778490000032
Figure BDA0002399778490000033
c style =concat(f 1 ,f 2 ,…,f C )
Where G is a glamer matrix, α is the attention weight, F is the feature map, and F is the context vector.
Preferably, the step S11 further includes:
training a generator to approximate a mapping function, and converting the original segmentation map into a target segmentation map through the mapping function under the condition of taking the target sentence representation vector as a condition; further, the method comprises the steps of,
the training process of S11 includes: and combining the target sentence characterization vector with the segmentation graph characteristics obtained after the residual block, and then sending to an up-sampling stage.
Preferably, the steps S11 and S12 further include: and stabilizing a training strategy, and respectively carrying out spectrum normalization on the weight matrix of each layer in the first-stage generation countermeasure network and the second-stage generation countermeasure network.
The invention also provides a character clothing conversion system, which comprises: a deformed network generator, a synthesized network generator, a deformed network discriminator, and a synthesized network discriminator; wherein,,
the deformation network generator and the deformation network discriminator form a first-stage generation countermeasure network for processing deformation problems: according to the input sentence, carrying out corresponding shape change on an original segmentation map of the original map, and converting the original segmentation map into a target segmentation map;
the synthetic network generator and the synthetic network arbiter form a second-stage generation countermeasure network for processing the synthetic problem: taking a target segmentation map obtained by the first-stage generation countermeasure network as a semantic guidance and shape limitation condition, and training a generator to learn multi-domain mapping from an original map to a target picture together with an input sentence so as to synthesize the target picture, so that character clothing conversion is completed;
further, the second level generation network fuses simultaneously:
adopting a soft attention layer to strengthen the relevance between the target picture and the input sentence;
explicitly capturing remote correlations on the image using the self-attention layer;
the stylized attention layer is used to establish the dependency between features by channel-by-channel inner product and feature map recalibration.
Preferably, the deformed network discriminant and/or the synthesized network discriminant is a projection-based discriminant for calculating a matching loss of an image and a condition by element-by-element inner product; further, the method comprises the steps of,
the deformation network discriminator and/or the synthesis network discriminator are/is used for judging different blocks of the image processed by the block processing, and finally, the average score is calculated; and/or the number of the groups of groups,
the deformation network generator adopts a least mean square function as a countermeasures loss; further, the method comprises the steps of,
the deformed network generator adopts a cyclical consistency loss to ensure that the figure body type, the figure posture and the identity in the target segmentation map and the original segmentation map are not changed; and/or the number of the groups of groups,
the deformation network generator is used for combining the target sentence representation vector with the segmentation graph characteristics obtained after the residual block, and then sending the segmentation graph characteristics to an up-sampling stage.
Preferably, the synthesis network generator comprises two coding branches for extracting the characteristics of the target segmentation map and the original map respectively.
Preferably, the method further comprises: a steady training strategy module; the stable training strategy module is used for stabilizing the training process of the first-stage generation countermeasure network and the second-stage generation countermeasure network; further, the method comprises the steps of,
the stable training strategy module is used for respectively carrying out spectrum normalization on the weight matrix of each layer of the first-stage generation countermeasure network and the second-stage generation countermeasure network; further, the method comprises the steps of,
the spectral norms in the steady training strategy module are approximately estimated by a power iteration method.
Preferably, the topmost of the volume hierarchy of the composite network generator further comprises a matting layer to retain headers.
Preferably, the warped network synthesizer and/or the synthesized network generator comprises a plurality of noise layers for improving diversity and randomness of the generation, thereby suppressing overfitting and preventing mode collapse.
Compared with the prior art, the invention has the following advantages:
(1) The character clothing conversion method and system provided by the invention are integrated with: a soft attention layer, a self-attention layer, and a stylized attention layer; the relevance between the generated image and the sentence is enhanced through the soft attention layer, so that each position on the feature map can find the most relevant word in the sentence, and the fine-granularity word-to-image synthesis is effectively promoted; the self-attention layer compensates the locality of the traditional convolution network, can explicitly capture the remote correlation on the image, not only provides support for fine granularity generation, but also strengthens the integral harmony and consistency of the image; the stylized attention layer can effectively promote texture generation and fine coloring, and improve the capability of a network for reasonable deduction and imagination;
(2) According to the character clothing conversion method and system provided by the invention, the target sentence representation vector is combined with the segmentation graph characteristics obtained after the residual block, and then the segmentation graph characteristics are sent to an up-sampling stage instead of combining the target sentence representation vector and the residual block directly during input; when the dimension of the sentence representation vector is higher than that of the segmentation map channel, the direct combination of the sentence representation vector and the segmentation map channel is unfavorable for the learning of the features, but a great amount of original picture information is lost; the invention can avoid the problems caused by directly combining the characteristics and statement condition information in the middle stage of the network;
(3) According to the character clothing conversion method and system, two errors in the existing method can be effectively identified by calculating the matching loss of the image and the condition through the element-by-element inner product based on the projected discriminator structure; many existing methods adopt a direct series connection method or an auxiliary classifier method to design a discriminator structure, but the two methods have some disadvantages: the direct tandem image and the condition can not help the discriminator to explicitly distinguish between two different error sources of the unrealistic property and the unmatched property, and adding an auxiliary classifier branch at the top end of the discriminator can lead the generator to generate some pictures which are easy to classify by the discriminator in intangible state, and the phenomenon is particularly obvious when the dimension of the condition information is more;
(4) According to the character clothing conversion method and system, the discriminators are used for judging different blocks of the image processed by the blocks, the average score is obtained finally, the ideas of the block processing are combined into the discriminators, the convergence speed of a network is accelerated, and very effective generation guidance is generated on textures and styles of the image;
(5) According to the character clothing conversion method and system, the head is reserved by adding the scratch layer at the top end of the convolution layer of the synthetic network generator, so that the attention of the synthetic network generator can be further focused on clothes and body parts;
(6) According to the character clothing conversion method and system, the plurality of noise layers are added in the deformed network generator and the synthesized network generator, so that the diversity and randomness of the generation are improved, the overfitting is restrained, and the mode collapse is prevented.
Of course, it is not necessary for any one product to practice the invention to achieve all of the advantages set forth above at the same time.
Drawings
Embodiments of the present invention are further described below with reference to the accompanying drawings:
FIG. 1 is a flow chart of a method for transforming character apparel in accordance with an embodiment of the present invention;
FIG. 2 is a schematic diagram of a soft attention layer according to an embodiment of the present invention;
FIG. 3 is a comparison of the results obtained in an embodiment of the present invention with the effects of the prior art method.
Detailed Description
The following describes in detail the examples of the present invention, which are implemented on the premise of the technical solution of the present invention, and detailed embodiments and specific operation procedures are given, but the scope of protection of the present invention is not limited to the following examples.
Fig. 1 is a flowchart of a character clothing conversion method according to an embodiment of the present invention, which is a character clothing conversion method based on semantic guidance and a fused attention mechanism.
Referring to fig. 1, the character clothing conversion method of the present embodiment includes the following steps:
s11: processing the deformation problem with the first level generation countering network: according to the input sentence, carrying out corresponding shape change on an original segmentation map of the original map, and converting the original segmentation map into a target segmentation map;
s12: processing the composite problem with the second level generation against the network: taking the target segmentation map obtained in the step S11 as a semantic guidance and shape limitation condition, and training a generator to learn multi-domain mapping from an original map to a target picture together with an input sentence so as to synthesize the target picture and complete character clothing conversion;
further, the second generation network merges together:
s121: the relevance between the target segmentation map and the input sentence is enhanced by adopting the soft attention layer, so that each position on the feature map can find out the most relevant word in the sentence, thereby effectively promoting the synthesis of fine-granularity words to images;
s122: the self-attention layer is adopted to explicitly capture the remote correlation on the target segmentation graph, so that the locality of the traditional convolution network is compensated, support is provided for fine granularity generation, and the integral coordination and consistency of the image are enhanced;
s123: the stylized attention layer is adopted to build the dependency relationship between the features through channel-by-channel inner product and feature graph recalibration, so that the generation and fine coloring of textures are effectively promoted, and the capability of reasonably deducing and imagining of a network is improved.
In the first-stage network, firstly, according to the input statement description, the original segmentation diagram of the original diagram is subjected to shape conversion, the generated target segmentation diagram can describe the rough outline of the target diagram which is expected to be generated, and the rough outline is sent to the next-stage network. In the second-level network, not only the input statement description is used as a condition, but also the converted target segmentation graph is used as semantic guidance to help the network learn the conversion from the original graph to the target graph.
The detailed technical operation in each of the above steps is described below with reference to specific examples.
(1) The first stage generates an antagonizing network: deformed network
Compared with directly generating a desired target picture, the method comprises the steps of firstly splitting the problem and carrying out corresponding shape change on the segmentation map of the original picture according to the input sentence. As shown in fig. 1, the statement description first extracts semantic codes via a two-layer recurrent neural network (LSTM) based on long and short term memory cells. We use the hidden state of each time step in LSTM as the token vector for its corresponding word and combine these vectors to form the word embedding matrix. In addition, the last hidden state of the second layer in LSTM will represent the vector as an entire sentence of the sentence.
Because the input statement description contains a wide variety of garment styles and styles, the morphing network is essentially an extensible multi-domain conversion model. Our goal is to train a generator to approximate a mapping function by which the original segmentation map can be converted into the target segmentation map, subject to the target sentence token vector.
During training, we can learn without supervision by using the triplet data only. Each iteration of the target sentence is randomly selected from the training set, so that the adaptability and the robustness of the generator are improved.
In the modified network a standard encoding-decoding structure is used, in which several residual blocks are contained. In a preferred embodiment, unlike some prior approaches, the target sentence token vector is combined with the segmentation map features obtained after the residual block, and then sent to the upsampling stage instead of combining the two directly at the time of input. This is because when the dimension (128) of the sentence representation vector is far higher than the number (1) of the channels of the segmentation map, the direct combination of the two is unfavorable for the learning of the features, but rather causes a great loss of the original picture information, so the invention makes an improvement and selects to combine the features and the sentence condition information in the middle stage of the network.
Although many existing methods use direct concatenation or auxiliary classifier methods to design the arbiter structure, both methods suffer from drawbacks. The direct tandem image and the condition cannot help the discriminator to explicitly distinguish between two different sources of error, namely, the unrealistic and the unmatched, and adding an auxiliary classifier branch at the top of the discriminator may lead the generator to generate some pictures which are easy for the discriminator to classify in intangible ways, and the phenomenon is particularly obvious when the dimension of the condition information is more. To solve these problems, the present invention employs a projection-based arbiter structure that effectively identifies two errors by calculating the matching loss of the image to the condition by element-by-element inner product.
In a preferred embodiment, the idea of block processing is combined into the arbiter, so that the arbiter evaluates different blocks of the image, and finally, the average score is obtained, thereby not only accelerating the convergence speed of the network, but also generating very effective generation guidance on the texture and style of the image.
In a preferred embodiment, in order to make the generated target segmentation map have very high realism, a least mean square function is used as the countermeasures against losses, and a loss function is designed which can penalize the mismatch. Since the multi-domain transformation task without pairing data is inherently uncomfortable, additional constraints need to be added to the network, and in a preferred embodiment, a cyclical consistent loss is employed to ensure that the figure shape, pose, and identity in the segmentation map do not change. The cyclical consistent loss may be expressed as an L1 loss function between the original segmentation map and the reconstructed segmentation map.
In a preferred embodiment, the same generator is used in the whole generation and reconstruction process, so that the parameter quantity and the memory consumption are greatly reduced.
(2) The second stage generates an antagonism network: synthetic network
The target segmentation map obtained from the deformed network of the first stage delineates the approximate contour shape of the target picture. To generate the target picture, the training generator learns a multi-domain mapping from the artwork to the target picture with the target segmentation map as a semantic guidance and shape constraint along with the input sentence. In the second level of the synthesis network, three mechanisms of soft attention, self attention and stylized attention are fused to promote fine-grained synthesis and the coordinated consistency of the whole image, and to strengthen the network to make reasonable inferences and imagination. The synthesis network has two separate coding branches for extracting the features of the segmentation map and the real picture, respectively.
In order to solve the problem that the existing method is insufficient for realizing high-quality clothing generation, three mechanisms of soft attention, self attention and stylized attention are fused.
As shown in FIG. 2, the soft attention layer receives two inputs, the word embedding matrix w and the feature map x, and the soft attention context feature map c is obtained by computing the context vector c and combining soft
s ji =W q (x j ) T W k (w i )
Figure BDA0002399778490000081
Figure BDA0002399778490000082
c soft =concat(c 1 ,c 2 ,…,c j ,…,c N )
Wherein W is q ,W k ,W v All are convolution layer parameters, and beta is attention weight.
The self-attention layer maps the feature map x into a plurality of feature spaces by using a convolution layer, calculates the correlation between different subareas by using an inner product mode, represents the correlation as attention weight, and finally obtains the self-attention context feature map c through weighted summation self
In the stylized attention layer, the gram matrix of the feature map is calculated, normalized by a softmax function, and then the features are recalibrated by the weighted summation of all channels, thus finally obtaining the stylized attention context feature map c style . The whole process can be expressed as:
Figure BDA0002399778490000083
Figure BDA0002399778490000084
Figure BDA0002399778490000085
c style =concat(f 1 ,f 2 ,…,f C )
where G is a glamer matrix, α is the attention weight, F is the feature map, and F is the context vector.
In a preferred embodiment, since the head of the character belongs to irrelevant information in the clothing conversion task, a matting layer is added at the top of the convolution layer to preserve the head part. The matting mask of the head-related portion may be expressed as an intersection of the segmentation map and the corresponding portion. In a preferred embodiment, similarly, a background retention penalty is introduced, and the task of retaining the background is passed to the generator for learning.
In the preferred embodiment, a steady training strategy is also added:
because of the instability of the generated countermeasure network, many existing face attribute conversion methods all adopt a WGAN-GP strategy to stabilize the training process. However, in the task of character clothing conversion, the performance of WGAN-GP is not ideal, because in WGAN-GP the gradient penalty term is calculated by sampling as input to the discriminator on a spatial straight line of the real sample and the generated sample, whereas in essence the character picture is a non-convex low-dimensional manifold of the high-dimensional space, so that the sample thus sampled may have been freed out of the manifold. Furthermore, the WGAN-GP calculates the gradient of the output and the input of the discriminator in each iteration, and the training time is increased to a certain extent. Based on the analysis, the embodiment adopts a spectrum normalization method with smaller calculation cost in a two-stage network to ensure that the discriminator meets the lipzetz continuous condition.
Specifically, the weight matrix of each layer in the network is subjected to spectrum normalization, that is, divided by the spectrum norm, so that the whole network can meet the liplitz continuous condition. However, if a conventional singular value decomposition method is used to solve the spectral norms of each layer of the network for each iteration, the calculation amount will be quite huge, and therefore, in the preferred embodiment, a power iteration method is used to approximate the estimated spectral norms. First, a vector u is randomly initialized for each weight matrix, and if there are no weights in the dominant singular values and u is not orthogonal to the first left singular vector, the first left singular vector and the right singular vector can be generated using the following update rule according to the power iteration method. If the network uses random gradient descent to optimize, the updating change of the weight matrix of each iteration is smaller, and the maximum singular value change is also smaller, so that the weight matrix can be reused as an initial vector of the next step in actual training.
In one embodiment, code implementation is done by Pytorch. In the training phase, the learning rate of the generator and the arbiter were both set to 0.0002, and the Adam optimizer was used, with the batch size set to 32 samples. The parameters of the synthetic network are fixed first, a total of 15 rounds of training the deformed network are spent, and the learning rate is linearly decayed to 0 in the last 5 rounds. Then, the parameters of the fixed deformation network are unchanged, and 20 rounds of training are spent on synthesizing the network, and the learning rate linear decay is 0 in the last 5 rounds.
The results of the character apparel conversion method of the above-described embodiments are evaluated below, with deep fascion selected as the training and testing dataset, using fascion gan as the current most advanced method, and the above-described embodiment of the present invention methods for quantitative and qualitative comparisons.
Regarding the quality evaluation index, the friendship distance (FID) is adopted because it is more in line with the human eye characteristics in evaluating the authenticity and diversity of the generated samples. The lower FID indicates that the closer the generated sample is to the real sample, i.e., the higher the quality of the generated. For each model we randomly generated 5000 samples to calculate their FID. The final quantitative comparison results are shown in Table 1. The resulting sample FID of the method of the present invention was far smaller than the result of FashionGAN, from 35.18 to 30.54, indicating that the method of the present invention achieved more advanced results on the deep Fashion dataset.
TABLE 1 French distance comparison of examples of the invention with the prior art method
Model Frechet distance
FashionGAN method 35.18
The method of the invention 30.54
In order to embody the method of the invention not only can generate high-quality pictures, but also can ensure that the pictures highly conform to the input statement description, an attribute prediction experiment is carried out. Specifically, we use the R CNN model as a predictor of character apparel attributes and let it fine-tune parameters on the deep fashion dataset. We selected 5 attributes, respectively "T-shirt", "long sleeve", "shorts", "jeans", "trousers", and classified the generated samples of the different models with predictors, the results are shown in table 2. It can be seen that the method of the present invention exceeds the FashionGAN model in all 5 attributes, indicating that the method of the present invention can generate very realistic pictures with a high degree of coordination and consistency.
TABLE 2 comparison of the attribute prediction results of the inventive examples with the existing methods
Figure BDA0002399778490000101
For qualitative comparison of the quality of the generation, we choose the same artwork and input sentence to observe the results of the generation of both methods. As shown in fig. 3, each column represents the same statement description. It can be seen intuitively that the FashionGAN model learns the mapping of paired style drawings and real pictures, rather than taking the original drawings as input, so that the background information in the original drawings cannot be kept, and the network can only learn a blank. In contrast, our model does not suffer from this problem since it is designed for background retention loss, and the background can be retained intact. In addition, as is clear from fig. 3, the method of the present invention is capable of generating the most natural and realistic figures of a person, while having very consistent colors and fine texture details. Although the FashionGAN model can keep the character actions and identities of the original image unchanged, the generated result lacks enough texture details and therefore does not have a stereoscopic impression. The FashionGAN model also fails to produce true colors, such as the sixth plot of the first line of fig. 3, and the "green" in the statement description is not reflected in the generated image, but rather a large area of artifacts appear. Moreover, the generated samples of the FashionGAN model look very similar, lack some diversity, and have signs of pattern collapse. In contrast, the method of the invention is more excellent in texture color detail and diversity.
In an embodiment, the present invention further provides a person clothing conversion system, which corresponds to the person clothing conversion method of the above embodiment, and includes: a deformed network generator, a synthesized network generator, a deformed network arbiter, and a synthesized network arbiter. The deformation network generator and the deformation network discriminator form a first-stage generation countermeasure network for processing deformation problems: according to the input sentence, carrying out corresponding shape change on an original segmentation map of the original map, and converting the original segmentation map into a target segmentation map; the synthetic network generator and the synthetic network discriminator form a second-stage generation countermeasure network for processing the synthetic problem: taking a target segmentation map obtained by the first-stage generation countermeasure network as a semantic guidance and shape limitation condition, and training a generator to learn multi-domain mapping from an original map to a target picture together with an input sentence so as to synthesize the target picture, thereby completing character clothing conversion. Further, the second level generation network merges three mechanisms of soft attention, self-attention, and stylized attention:
(1) Adopting a soft attention layer to strengthen the relevance between the target picture and the input sentence;
(2) Explicitly capturing remote correlations on the image using the self-attention layer;
(3) The stylized attention layer is used to establish the dependency between features by channel-by-channel inner product and feature map recalibration.
In a preferred embodiment, the deformed network classifier and the synthesized network classifier are projection-based classifiers for calculating the matching loss of an image to a condition by element-by-element inner product. Further, the deformed network discriminator and the synthesized network discriminator are used for judging different blocks of the image which is processed by the block, and finally, the average score is obtained, so that the convergence speed of the network is accelerated, and very effective generation guidance is generated on the texture and style of the image.
In a preferred embodiment, the deformed network generator uses a least mean square function as the countermeasures against the loss; further, the deformed network generator adopts a cyclic consistency loss to ensure that the figure body type, the figure posture and the identity in the target segmentation map and the original segmentation map are not changed.
In a preferred embodiment, the warped network generator is configured to combine the target sentence token vector with the segmentation map feature obtained after the residual block, and then send the combined target sentence token vector to the upsampling stage.
In a preferred embodiment, the synthetic network generator comprises two encoding branches for extracting features of the target segmentation map and the original map, respectively.
In a preferred embodiment, the method further comprises: and the stable training strategy module is used for stabilizing the training process of generating the countermeasure network at the first stage and generating the countermeasure network at the second stage. Further, the steady training strategy module is used for respectively carrying out spectrum normalization on the weight matrix of each layer of the first-stage generation countermeasure network and the second-stage generation countermeasure network. Further, the spectral norms in the steady training strategy module are approximated using a power iteration method.
In a preferred embodiment, to further focus the composite web generator's attention to clothing and body parts, the topmost of the roll hierarchy of the composite web generator also includes a matting layer to preserve the head. The matting mask of the head-related portion may be expressed as an intersection of the segmentation map and the corresponding portion. Similarly, a background retention penalty may be introduced, and the task of retaining the background is passed to the generator for learning.
In the preferred embodiment, an encoding-bottleneck layer-decoding architecture is employed for all generators and reconstruction networks. Wherein the warped network generator and the reconstruction network comprise 2 convolutional layers of step size 2 for downsampling, 6 residual blocks and 2 deconvolution layers for upsampling. To enhance the synthesis capability of the second level network, we add 1 convolutional layer, 3 residual blocks and 1 deconvolution layer in the generator. The sample normalization layer is adopted in all generators to learn individual characteristics of samples, and a plurality of noise layers are added to improve the diversity and randomness of the generation, so that overfitting is restrained, and mode collapse is prevented.
The embodiments disclosed herein were chosen and described in detail in order to best explain the principles of the invention and the practical application, and to thereby not limit the invention. Any modifications or variations within the scope of the description that would be apparent to a person skilled in the art are intended to be included within the scope of the invention.

Claims (10)

1. A method of character apparel conversion comprising:
s11: processing the deformation problem with the first level generation countering network: according to the input sentence, carrying out corresponding shape change on an original segmentation map of the original map, and converting the original segmentation map into a target segmentation map;
s12: processing the composite problem with the second level generation against the network: taking the target segmentation map obtained in the step S11 as a semantic guidance and shape limitation condition, and training a generator to learn multi-domain mapping from an original map to a target picture together with an input sentence so as to synthesize the target picture and complete character clothing conversion;
the second level generation antagonism network fuses simultaneously:
s121: adopting a soft attention layer to strengthen the relevance between the target segmentation graph and the input sentence;
s122: explicitly capturing remote correlations on the target segmentation map using a self-attention layer;
s123: the stylized attention layer is used to establish the dependency between features by channel-by-channel inner product and feature map recalibration.
2. The character clothing conversion method according to claim 1, wherein S121 further comprises: two inputs are received with the soft attention layer: word embedding matrix w and original feature map x, and obtaining soft attention context feature map c by calculating context vector c combination soft
s ji =W q (x j ) T W k (w i )
Figure QLYQS_1
Figure QLYQS_2
c soft =concat(c 1 ,c 2 ,…,c j ,…,c N )
Wherein W is q ,W k ,W v All are convolution layer parameters, and beta is attention weight.
3. The character clothing conversion method according to claim 2, wherein S122 further comprises: mapping the original feature map x into a plurality of feature spaces by using a convolution layer, calculating the correlation between different subareas by using an inner product mode, representing the correlation as attention weight, and finally obtaining the self-attention context feature map c by weighting and summing self
4. The character clothing conversion method according to claim 1, wherein S123 further comprises: computing the gram matrix of the original feature map x, normalizing the gram matrix by a softmax function, and then recalibrating the features through weighted summation of all channels to finally obtain a stylized attention context feature map c style
Figure QLYQS_3
Figure QLYQS_4
Figure QLYQS_5
c style =concat(f 1 ,f 2 ,…,f C )
Where G is a glamer matrix, α is the attention weight, F is the feature map, and F is the context vector.
5. The character clothing conversion method according to claim 1, wherein the S11 further comprises:
training a generator to approximate a mapping function, and converting the original segmentation map into a target segmentation map through the mapping function under the condition of taking the target sentence representation vector as a condition; further, the method comprises the steps of,
the training process of S11 includes: and combining the target sentence characterization vector with the segmentation graph characteristics obtained after the residual block, and then sending to an up-sampling stage.
6. The character clothing conversion method according to any one of claims 1 to 5, wherein the steps S11 and S12 further include: and stabilizing a training strategy, and respectively carrying out spectrum normalization on the weight matrix of each layer in the first-stage generation countermeasure network and the second-stage generation countermeasure network.
7. A character apparel conversion system, comprising: a deformed network generator, a synthesized network generator, a deformed network discriminator, and a synthesized network discriminator; wherein,,
the deformation network generator and the deformation network discriminator form a first-stage generation countermeasure network for processing deformation problems: according to the input sentence, carrying out corresponding shape change on an original segmentation map of the original map, and converting the original segmentation map into a target segmentation map;
the synthetic network generator and the synthetic network arbiter form a second-stage generation countermeasure network for processing the synthetic problem: taking a target segmentation map obtained by the first-stage generation countermeasure network as a semantic guidance and shape limitation condition, and training a generator to learn multi-domain mapping from an original map to a target picture together with an input sentence so as to synthesize the target picture, so that character clothing conversion is completed;
further, the second level generation antagonism network fuses simultaneously:
adopting a soft attention layer to strengthen the relevance between the target picture and the input sentence;
explicitly capturing remote correlations on the image using the self-attention layer;
the stylized attention layer is used to establish the dependency between features by channel-by-channel inner product and feature map recalibration.
8. The character apparel conversion system of claim 7, wherein the deformation network discriminator and/or the synthetic network discriminator are projection-based discriminators for calculating a matching loss of an image to a condition by element-by-element inner product;
the deformation network discriminator and/or the synthesis network discriminator are/is used for judging different blocks of the image processed by the block processing, and finally, the average score is calculated; and/or the number of the groups of groups,
the deformation network generator adopts a least mean square function as a countermeasures loss; the deformed network generator adopts a cyclical consistency loss to ensure that the figure body type, the figure posture and the identity in the target segmentation map and the original segmentation map are not changed; further, the method comprises the steps of,
the deformation network generator is used for combining the target sentence representation vector with the segmentation graph characteristics obtained after the residual block, and then sending the segmentation graph characteristics to an up-sampling stage; and/or the number of the groups of groups,
the synthesis network generator comprises two coding branches which are respectively used for extracting the characteristics of the target segmentation graph and the original graph.
9. The character apparel conversion system of claim 7, further comprising: a steady training strategy module; the stable training strategy module is used for stabilizing the training process of the first-stage generation countermeasure network and the second-stage generation countermeasure network;
the stable training strategy module is used for respectively carrying out spectrum normalization on the weight matrix of each layer of the first-stage generation countermeasure network and the second-stage generation countermeasure network;
the spectral norms in the steady training strategy module are approximately estimated by a power iteration method.
10. The character apparel conversion system of any one of claims 7-9 wherein a topmost of the roll hierarchy of the synthetic network generator further comprises a matting layer to preserve headers;
the warped network synthesizer and/or the synthesized network generator includes a plurality of noise layers for increasing diversity and randomness of the generation.
CN202010143086.9A 2020-03-04 2020-03-04 Character clothing conversion method and system Active CN111476241B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010143086.9A CN111476241B (en) 2020-03-04 2020-03-04 Character clothing conversion method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010143086.9A CN111476241B (en) 2020-03-04 2020-03-04 Character clothing conversion method and system

Publications (2)

Publication Number Publication Date
CN111476241A CN111476241A (en) 2020-07-31
CN111476241B true CN111476241B (en) 2023-04-21

Family

ID=71748048

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010143086.9A Active CN111476241B (en) 2020-03-04 2020-03-04 Character clothing conversion method and system

Country Status (1)

Country Link
CN (1) CN111476241B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112598806A (en) * 2020-12-28 2021-04-02 深延科技(北京)有限公司 Virtual fitting method and device based on artificial intelligence, computer equipment and medium
CN112967293B (en) * 2021-03-04 2024-07-12 江苏中科重德智能科技有限公司 Image semantic segmentation method, device and storage medium
CN113393550B (en) * 2021-06-15 2022-09-20 杭州电子科技大学 Fashion garment design synthesis method guided by postures and textures
CN114862666B (en) * 2022-06-22 2022-10-04 阿里巴巴达摩院(杭州)科技有限公司 Image conversion system, method, storage medium and electronic device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110021051A (en) * 2019-04-01 2019-07-16 浙江大学 One kind passing through text Conrad object image generation method based on confrontation network is generated
CN110211196A (en) * 2019-05-28 2019-09-06 山东大学 A kind of virtually trying method and device based on posture guidance
CN110659958A (en) * 2019-09-06 2020-01-07 电子科技大学 Clothing matching generation method based on generation of countermeasure network
CN110675353A (en) * 2019-08-31 2020-01-10 电子科技大学 Selective segmentation image synthesis method based on conditional generation countermeasure network
CN110706300A (en) * 2019-09-19 2020-01-17 腾讯科技(深圳)有限公司 Virtual image generation method and device
WO2020029356A1 (en) * 2018-08-08 2020-02-13 杰创智能科技股份有限公司 Method employing generative adversarial network for predicting face change

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020029356A1 (en) * 2018-08-08 2020-02-13 杰创智能科技股份有限公司 Method employing generative adversarial network for predicting face change
CN110021051A (en) * 2019-04-01 2019-07-16 浙江大学 One kind passing through text Conrad object image generation method based on confrontation network is generated
CN110211196A (en) * 2019-05-28 2019-09-06 山东大学 A kind of virtually trying method and device based on posture guidance
CN110675353A (en) * 2019-08-31 2020-01-10 电子科技大学 Selective segmentation image synthesis method based on conditional generation countermeasure network
CN110659958A (en) * 2019-09-06 2020-01-07 电子科技大学 Clothing matching generation method based on generation of countermeasure network
CN110706300A (en) * 2019-09-19 2020-01-17 腾讯科技(深圳)有限公司 Virtual image generation method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
黄永铖,宋利,解蓉.基于深度残差网络的VP9超级块快速划分算法.《电视技术》.2019,第43卷(第43期),10-14. *

Also Published As

Publication number Publication date
CN111476241A (en) 2020-07-31

Similar Documents

Publication Publication Date Title
CN111476241B (en) Character clothing conversion method and system
CN110097609B (en) Sample domain-based refined embroidery texture migration method
CN108288072A (en) A kind of facial expression synthetic method based on generation confrontation network
CN108334816A (en) The Pose-varied face recognition method of network is fought based on profile symmetry constraint production
CN111368662A (en) Method, device, storage medium and equipment for editing attribute of face image
CN111428667A (en) Human face image correcting method for generating confrontation network based on decoupling expression learning
CN111950432A (en) Makeup style migration method and system based on regional style consistency
CN113393550B (en) Fashion garment design synthesis method guided by postures and textures
CN113160033B (en) Clothing style migration system and method
CN110223370A (en) A method of complete human body's texture mapping is generated from single view picture
Souza et al. Efficient neural architecture for text-to-image synthesis
CN113724354B (en) Gray image coloring method based on reference picture color style
Liu et al. Psgan++: Robust detail-preserving makeup transfer and removal
CN110660020A (en) Image super-resolution method of countermeasure generation network based on fusion mutual information
CN113538608B (en) Controllable figure image generation method based on generation countermeasure network
CN113343878A (en) High-fidelity face privacy protection method and system based on generation countermeasure network
CN117095128A (en) Priori-free multi-view human body clothes editing method
CN113807265A (en) Diversified human face image synthesis method and system
CN111462274A (en) Human body image synthesis method and system based on SMP L model
CN112819951A (en) Three-dimensional human body reconstruction method with shielding function based on depth map restoration
CN113222808A (en) Face mask removing method based on generative confrontation network
CN116266251A (en) Sketch generation countermeasure network, rendering generation countermeasure network and clothes design method thereof
Ma et al. Semantic 3d-aware portrait synthesis and manipulation based on compositional neural radiance field
CN117611428A (en) Fashion character image style conversion method
CN116777738A (en) Authenticity virtual fitting method based on clothing region alignment and style retention modulation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant