CN111476241B

CN111476241B - Character clothing conversion method and system

Info

Publication number: CN111476241B
Application number: CN202010143086.9A
Authority: CN
Inventors: 宋利; 张义诚; 解蓉; 张文军
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2020-03-04
Filing date: 2020-03-04
Publication date: 2023-04-21
Anticipated expiration: 2040-03-04
Also published as: CN111476241A

Abstract

The invention discloses a character clothing conversion method and a character clothing conversion system, wherein the method comprises the following steps: according to the input sentence, carrying out corresponding shape change on an original segmentation map of the original map, and converting the original segmentation map into a target segmentation map; processing the composite problem with the second level generation against the network: taking the target segmentation map as a semantic guidance and shape limitation condition, and training a generator to learn multi-domain mapping from an original map to a target picture together with an input sentence so as to synthesize the target picture and complete character clothing conversion; the second level generation network merges: adopting a soft attention layer to strengthen the relevance between the target picture and the input sentence; explicitly capturing remote correlations on the image using the self-attention layer; the stylized attention layer is used to establish the dependency between features by channel-by-channel inner product and feature map recalibration. The character clothing conversion method and system integrate three attention layers, and realize high-quality clothing generation.

Description

Character clothing conversion method and system

Technical Field

The invention relates to the technical field of computer vision, in particular to a character clothing conversion method and system.

Background

Character clothing conversion is a very challenging task in the field of computer vision, and aims to perform corresponding conversion on character clothing in an original figure according to input character descriptions, and meanwhile, keep the information of the gesture, identity, body shape and the like of the character unchanged. The character has quite wide application, and can be expanded into various emerging application scenes such as photo editing, movie making, virtual fitting and the like. Although the generation of a reactance network has achieved quite excellent performance in domain migration tasks such as face attribute conversion, makeup conversion, etc. in recent years, there is still a great room for improvement in character clothing conversion tasks.

The challenge of clothing conversion tasks is firstly embodied in the high difficulty of the task itself, and the core problems are as follows: firstly, the input sentence description contains various clothes styles and styles with different shapes, such as short-sleeved shirts, sleeveless dress, long-sleeve jackets and the like, which causes remarkable shape change in the process of clothes conversion. Secondly, the character clothing involves more texture and color information, but does not contain similar skin color and five sense organs as much as the face picture, so a finer granularity generation method is needed to realize high quality clothing conversion.

Secondly, the existing method is difficult to meet the requirement of high-quality clothing conversion generation. The conventional character clothing conversion method still adopts a traditional full convolution generator, and the network structure has very limited capability in capturing long-distance correlation and cannot meet the requirement of high-quality generation. Furthermore, existing methods train the network with the overall representation of the input sentence as conditional information, do not fully exploit semantic information to the word level, and thus are insufficient to support fine-grained texture and color generation. In addition, since character clothing conversion may require extensive inference and imagination of the network, such as when converting from long sleeve to short sleeve, the network needs to generate new arm parts, so how to generate information not in the original figure is a big problem of the task, however, the existing method does not perform deep enough exploration for the problem.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a character clothing conversion method and a character clothing conversion system, which integrate three attention layers and realize high-quality clothing generation.

In order to solve the technical problems, the invention is realized by the following technical scheme:

the invention provides a character clothing conversion method, which comprises the following steps:

s11: processing the deformation problem with the first level generation countering network: according to the input sentence, carrying out corresponding shape change on an original segmentation map of the original map, and converting the original segmentation map into a target segmentation map;

s12: processing the composite problem with the second level generation against the network: taking the target segmentation map obtained in the step S11 as a semantic guidance and shape limitation condition, and training a generator to learn multi-domain mapping from an original map to a target picture together with an input sentence so as to synthesize the target picture and complete character clothing conversion;

further, the second level generation network fuses simultaneously:

s121: adopting a soft attention layer to strengthen the relevance between the target segmentation graph and the input sentence;

s122: explicitly capturing remote correlations on the target segmentation map using a self-attention layer;

s123: the stylized attention layer is used to establish the dependency between features by channel-by-channel inner product and feature map recalibration.

Preferably, the step S121 further includes: two inputs are received with the soft attention layer: word embedding matrix w and original feature map x, and obtaining soft attention context feature map c by calculating context vector c combination _soft ：

s _ji ＝W _q (x _j ) ^T W _k (w _i )

c _soft ＝concat(c ₁ ,c ₂ ,…,c _j ,…,c _N )

Wherein W is _q ,W _k ,W _v All are convolution layer parameters, and beta is attention weight.

Preferably, the step S122 further includes: mapping the original feature map x into a plurality of feature spaces by using a convolution layer, calculating the correlation between different subareas by using an inner product mode, representing the correlation as attention weight, and finally obtaining the self-attention context feature map c by weighting and summing _self 。

Preferably, S123 furtherComprising the following steps: computing the gram matrix of the original feature map x, normalizing the gram matrix by a softmax function, and then recalibrating the features through weighted summation of all channels to finally obtain a stylized attention context feature map c _style ：

c _style ＝concat(f ₁ ,f ₂ ,…,f _C )

Where G is a glamer matrix, α is the attention weight, F is the feature map, and F is the context vector.

Preferably, the step S11 further includes:

training a generator to approximate a mapping function, and converting the original segmentation map into a target segmentation map through the mapping function under the condition of taking the target sentence representation vector as a condition; further, the method comprises the steps of,

the training process of S11 includes: and combining the target sentence characterization vector with the segmentation graph characteristics obtained after the residual block, and then sending to an up-sampling stage.

Preferably, the steps S11 and S12 further include: and stabilizing a training strategy, and respectively carrying out spectrum normalization on the weight matrix of each layer in the first-stage generation countermeasure network and the second-stage generation countermeasure network.

The invention also provides a character clothing conversion system, which comprises: a deformed network generator, a synthesized network generator, a deformed network discriminator, and a synthesized network discriminator; wherein,,

the deformation network generator and the deformation network discriminator form a first-stage generation countermeasure network for processing deformation problems: according to the input sentence, carrying out corresponding shape change on an original segmentation map of the original map, and converting the original segmentation map into a target segmentation map;

the synthetic network generator and the synthetic network arbiter form a second-stage generation countermeasure network for processing the synthetic problem: taking a target segmentation map obtained by the first-stage generation countermeasure network as a semantic guidance and shape limitation condition, and training a generator to learn multi-domain mapping from an original map to a target picture together with an input sentence so as to synthesize the target picture, so that character clothing conversion is completed;

further, the second level generation network fuses simultaneously:

adopting a soft attention layer to strengthen the relevance between the target picture and the input sentence;

explicitly capturing remote correlations on the image using the self-attention layer;

the stylized attention layer is used to establish the dependency between features by channel-by-channel inner product and feature map recalibration.

Preferably, the deformed network discriminant and/or the synthesized network discriminant is a projection-based discriminant for calculating a matching loss of an image and a condition by element-by-element inner product; further, the method comprises the steps of,

the deformation network discriminator and/or the synthesis network discriminator are/is used for judging different blocks of the image processed by the block processing, and finally, the average score is calculated; and/or the number of the groups of groups,

the deformation network generator adopts a least mean square function as a countermeasures loss; further, the method comprises the steps of,

the deformed network generator adopts a cyclical consistency loss to ensure that the figure body type, the figure posture and the identity in the target segmentation map and the original segmentation map are not changed; and/or the number of the groups of groups,

the deformation network generator is used for combining the target sentence representation vector with the segmentation graph characteristics obtained after the residual block, and then sending the segmentation graph characteristics to an up-sampling stage.

Preferably, the synthesis network generator comprises two coding branches for extracting the characteristics of the target segmentation map and the original map respectively.

Preferably, the method further comprises: a steady training strategy module; the stable training strategy module is used for stabilizing the training process of the first-stage generation countermeasure network and the second-stage generation countermeasure network; further, the method comprises the steps of,

the stable training strategy module is used for respectively carrying out spectrum normalization on the weight matrix of each layer of the first-stage generation countermeasure network and the second-stage generation countermeasure network; further, the method comprises the steps of,

the spectral norms in the steady training strategy module are approximately estimated by a power iteration method.

Preferably, the topmost of the volume hierarchy of the composite network generator further comprises a matting layer to retain headers.

Preferably, the warped network synthesizer and/or the synthesized network generator comprises a plurality of noise layers for improving diversity and randomness of the generation, thereby suppressing overfitting and preventing mode collapse.

Compared with the prior art, the invention has the following advantages:

(1) The character clothing conversion method and system provided by the invention are integrated with: a soft attention layer, a self-attention layer, and a stylized attention layer; the relevance between the generated image and the sentence is enhanced through the soft attention layer, so that each position on the feature map can find the most relevant word in the sentence, and the fine-granularity word-to-image synthesis is effectively promoted; the self-attention layer compensates the locality of the traditional convolution network, can explicitly capture the remote correlation on the image, not only provides support for fine granularity generation, but also strengthens the integral harmony and consistency of the image; the stylized attention layer can effectively promote texture generation and fine coloring, and improve the capability of a network for reasonable deduction and imagination;

(2) According to the character clothing conversion method and system provided by the invention, the target sentence representation vector is combined with the segmentation graph characteristics obtained after the residual block, and then the segmentation graph characteristics are sent to an up-sampling stage instead of combining the target sentence representation vector and the residual block directly during input; when the dimension of the sentence representation vector is higher than that of the segmentation map channel, the direct combination of the sentence representation vector and the segmentation map channel is unfavorable for the learning of the features, but a great amount of original picture information is lost; the invention can avoid the problems caused by directly combining the characteristics and statement condition information in the middle stage of the network;

(3) According to the character clothing conversion method and system, two errors in the existing method can be effectively identified by calculating the matching loss of the image and the condition through the element-by-element inner product based on the projected discriminator structure; many existing methods adopt a direct series connection method or an auxiliary classifier method to design a discriminator structure, but the two methods have some disadvantages: the direct tandem image and the condition can not help the discriminator to explicitly distinguish between two different error sources of the unrealistic property and the unmatched property, and adding an auxiliary classifier branch at the top end of the discriminator can lead the generator to generate some pictures which are easy to classify by the discriminator in intangible state, and the phenomenon is particularly obvious when the dimension of the condition information is more;

(4) According to the character clothing conversion method and system, the discriminators are used for judging different blocks of the image processed by the blocks, the average score is obtained finally, the ideas of the block processing are combined into the discriminators, the convergence speed of a network is accelerated, and very effective generation guidance is generated on textures and styles of the image;

(5) According to the character clothing conversion method and system, the head is reserved by adding the scratch layer at the top end of the convolution layer of the synthetic network generator, so that the attention of the synthetic network generator can be further focused on clothes and body parts;

(6) According to the character clothing conversion method and system, the plurality of noise layers are added in the deformed network generator and the synthesized network generator, so that the diversity and randomness of the generation are improved, the overfitting is restrained, and the mode collapse is prevented.

Of course, it is not necessary for any one product to practice the invention to achieve all of the advantages set forth above at the same time.

Drawings

Embodiments of the present invention are further described below with reference to the accompanying drawings:

FIG. 1 is a flow chart of a method for transforming character apparel in accordance with an embodiment of the present invention;

FIG. 2 is a schematic diagram of a soft attention layer according to an embodiment of the present invention;

FIG. 3 is a comparison of the results obtained in an embodiment of the present invention with the effects of the prior art method.

Detailed Description

The following describes in detail the examples of the present invention, which are implemented on the premise of the technical solution of the present invention, and detailed embodiments and specific operation procedures are given, but the scope of protection of the present invention is not limited to the following examples.

Fig. 1 is a flowchart of a character clothing conversion method according to an embodiment of the present invention, which is a character clothing conversion method based on semantic guidance and a fused attention mechanism.

Referring to fig. 1, the character clothing conversion method of the present embodiment includes the following steps:

further, the second generation network merges together:

s121: the relevance between the target segmentation map and the input sentence is enhanced by adopting the soft attention layer, so that each position on the feature map can find out the most relevant word in the sentence, thereby effectively promoting the synthesis of fine-granularity words to images;

s122: the self-attention layer is adopted to explicitly capture the remote correlation on the target segmentation graph, so that the locality of the traditional convolution network is compensated, support is provided for fine granularity generation, and the integral coordination and consistency of the image are enhanced;

s123: the stylized attention layer is adopted to build the dependency relationship between the features through channel-by-channel inner product and feature graph recalibration, so that the generation and fine coloring of textures are effectively promoted, and the capability of reasonably deducing and imagining of a network is improved.

In the first-stage network, firstly, according to the input statement description, the original segmentation diagram of the original diagram is subjected to shape conversion, the generated target segmentation diagram can describe the rough outline of the target diagram which is expected to be generated, and the rough outline is sent to the next-stage network. In the second-level network, not only the input statement description is used as a condition, but also the converted target segmentation graph is used as semantic guidance to help the network learn the conversion from the original graph to the target graph.

The detailed technical operation in each of the above steps is described below with reference to specific examples.

(1) The first stage generates an antagonizing network: deformed network

Compared with directly generating a desired target picture, the method comprises the steps of firstly splitting the problem and carrying out corresponding shape change on the segmentation map of the original picture according to the input sentence. As shown in fig. 1, the statement description first extracts semantic codes via a two-layer recurrent neural network (LSTM) based on long and short term memory cells. We use the hidden state of each time step in LSTM as the token vector for its corresponding word and combine these vectors to form the word embedding matrix. In addition, the last hidden state of the second layer in LSTM will represent the vector as an entire sentence of the sentence.

Because the input statement description contains a wide variety of garment styles and styles, the morphing network is essentially an extensible multi-domain conversion model. Our goal is to train a generator to approximate a mapping function by which the original segmentation map can be converted into the target segmentation map, subject to the target sentence token vector.

During training, we can learn without supervision by using the triplet data only. Each iteration of the target sentence is randomly selected from the training set, so that the adaptability and the robustness of the generator are improved.

In the modified network a standard encoding-decoding structure is used, in which several residual blocks are contained. In a preferred embodiment, unlike some prior approaches, the target sentence token vector is combined with the segmentation map features obtained after the residual block, and then sent to the upsampling stage instead of combining the two directly at the time of input. This is because when the dimension (128) of the sentence representation vector is far higher than the number (1) of the channels of the segmentation map, the direct combination of the two is unfavorable for the learning of the features, but rather causes a great loss of the original picture information, so the invention makes an improvement and selects to combine the features and the sentence condition information in the middle stage of the network.

Although many existing methods use direct concatenation or auxiliary classifier methods to design the arbiter structure, both methods suffer from drawbacks. The direct tandem image and the condition cannot help the discriminator to explicitly distinguish between two different sources of error, namely, the unrealistic and the unmatched, and adding an auxiliary classifier branch at the top of the discriminator may lead the generator to generate some pictures which are easy for the discriminator to classify in intangible ways, and the phenomenon is particularly obvious when the dimension of the condition information is more. To solve these problems, the present invention employs a projection-based arbiter structure that effectively identifies two errors by calculating the matching loss of the image to the condition by element-by-element inner product.

In a preferred embodiment, the idea of block processing is combined into the arbiter, so that the arbiter evaluates different blocks of the image, and finally, the average score is obtained, thereby not only accelerating the convergence speed of the network, but also generating very effective generation guidance on the texture and style of the image.

In a preferred embodiment, in order to make the generated target segmentation map have very high realism, a least mean square function is used as the countermeasures against losses, and a loss function is designed which can penalize the mismatch. Since the multi-domain transformation task without pairing data is inherently uncomfortable, additional constraints need to be added to the network, and in a preferred embodiment, a cyclical consistent loss is employed to ensure that the figure shape, pose, and identity in the segmentation map do not change. The cyclical consistent loss may be expressed as an L1 loss function between the original segmentation map and the reconstructed segmentation map.

In a preferred embodiment, the same generator is used in the whole generation and reconstruction process, so that the parameter quantity and the memory consumption are greatly reduced.

(2) The second stage generates an antagonism network: synthetic network

The target segmentation map obtained from the deformed network of the first stage delineates the approximate contour shape of the target picture. To generate the target picture, the training generator learns a multi-domain mapping from the artwork to the target picture with the target segmentation map as a semantic guidance and shape constraint along with the input sentence. In the second level of the synthesis network, three mechanisms of soft attention, self attention and stylized attention are fused to promote fine-grained synthesis and the coordinated consistency of the whole image, and to strengthen the network to make reasonable inferences and imagination. The synthesis network has two separate coding branches for extracting the features of the segmentation map and the real picture, respectively.

In order to solve the problem that the existing method is insufficient for realizing high-quality clothing generation, three mechanisms of soft attention, self attention and stylized attention are fused.

As shown in FIG. 2, the soft attention layer receives two inputs, the word embedding matrix w and the feature map x, and the soft attention context feature map c is obtained by computing the context vector c and combining _soft ：

s _ji ＝W _q (x _j ) ^T W _k (w _i )

c _soft ＝concat(c ₁ ,c ₂ ,…,c _j ,…,c _N )

The self-attention layer maps the feature map x into a plurality of feature spaces by using a convolution layer, calculates the correlation between different subareas by using an inner product mode, represents the correlation as attention weight, and finally obtains the self-attention context feature map c through weighted summation _self 。

In the stylized attention layer, the gram matrix of the feature map is calculated, normalized by a softmax function, and then the features are recalibrated by the weighted summation of all channels, thus finally obtaining the stylized attention context feature map c _style . The whole process can be expressed as:

c _style ＝concat(f ₁ ,f ₂ ,…,f _C )

In a preferred embodiment, since the head of the character belongs to irrelevant information in the clothing conversion task, a matting layer is added at the top of the convolution layer to preserve the head part. The matting mask of the head-related portion may be expressed as an intersection of the segmentation map and the corresponding portion. In a preferred embodiment, similarly, a background retention penalty is introduced, and the task of retaining the background is passed to the generator for learning.

In the preferred embodiment, a steady training strategy is also added:

because of the instability of the generated countermeasure network, many existing face attribute conversion methods all adopt a WGAN-GP strategy to stabilize the training process. However, in the task of character clothing conversion, the performance of WGAN-GP is not ideal, because in WGAN-GP the gradient penalty term is calculated by sampling as input to the discriminator on a spatial straight line of the real sample and the generated sample, whereas in essence the character picture is a non-convex low-dimensional manifold of the high-dimensional space, so that the sample thus sampled may have been freed out of the manifold. Furthermore, the WGAN-GP calculates the gradient of the output and the input of the discriminator in each iteration, and the training time is increased to a certain extent. Based on the analysis, the embodiment adopts a spectrum normalization method with smaller calculation cost in a two-stage network to ensure that the discriminator meets the lipzetz continuous condition.

Specifically, the weight matrix of each layer in the network is subjected to spectrum normalization, that is, divided by the spectrum norm, so that the whole network can meet the liplitz continuous condition. However, if a conventional singular value decomposition method is used to solve the spectral norms of each layer of the network for each iteration, the calculation amount will be quite huge, and therefore, in the preferred embodiment, a power iteration method is used to approximate the estimated spectral norms. First, a vector u is randomly initialized for each weight matrix, and if there are no weights in the dominant singular values and u is not orthogonal to the first left singular vector, the first left singular vector and the right singular vector can be generated using the following update rule according to the power iteration method. If the network uses random gradient descent to optimize, the updating change of the weight matrix of each iteration is smaller, and the maximum singular value change is also smaller, so that the weight matrix can be reused as an initial vector of the next step in actual training.

In one embodiment, code implementation is done by Pytorch. In the training phase, the learning rate of the generator and the arbiter were both set to 0.0002, and the Adam optimizer was used, with the batch size set to 32 samples. The parameters of the synthetic network are fixed first, a total of 15 rounds of training the deformed network are spent, and the learning rate is linearly decayed to 0 in the last 5 rounds. Then, the parameters of the fixed deformation network are unchanged, and 20 rounds of training are spent on synthesizing the network, and the learning rate linear decay is 0 in the last 5 rounds.

The results of the character apparel conversion method of the above-described embodiments are evaluated below, with deep fascion selected as the training and testing dataset, using fascion gan as the current most advanced method, and the above-described embodiment of the present invention methods for quantitative and qualitative comparisons.

Regarding the quality evaluation index, the friendship distance (FID) is adopted because it is more in line with the human eye characteristics in evaluating the authenticity and diversity of the generated samples. The lower FID indicates that the closer the generated sample is to the real sample, i.e., the higher the quality of the generated. For each model we randomly generated 5000 samples to calculate their FID. The final quantitative comparison results are shown in Table 1. The resulting sample FID of the method of the present invention was far smaller than the result of FashionGAN, from 35.18 to 30.54, indicating that the method of the present invention achieved more advanced results on the deep Fashion dataset.

TABLE 1 French distance comparison of examples of the invention with the prior art method

Model	Frechet distance
		FashionGAN method	35.18
The method of the invention	30.54

In order to embody the method of the invention not only can generate high-quality pictures, but also can ensure that the pictures highly conform to the input statement description, an attribute prediction experiment is carried out. Specifically, we use the R CNN model as a predictor of character apparel attributes and let it fine-tune parameters on the deep fashion dataset. We selected 5 attributes, respectively "T-shirt", "long sleeve", "shorts", "jeans", "trousers", and classified the generated samples of the different models with predictors, the results are shown in table 2. It can be seen that the method of the present invention exceeds the FashionGAN model in all 5 attributes, indicating that the method of the present invention can generate very realistic pictures with a high degree of coordination and consistency.

TABLE 2 comparison of the attribute prediction results of the inventive examples with the existing methods

For qualitative comparison of the quality of the generation, we choose the same artwork and input sentence to observe the results of the generation of both methods. As shown in fig. 3, each column represents the same statement description. It can be seen intuitively that the FashionGAN model learns the mapping of paired style drawings and real pictures, rather than taking the original drawings as input, so that the background information in the original drawings cannot be kept, and the network can only learn a blank. In contrast, our model does not suffer from this problem since it is designed for background retention loss, and the background can be retained intact. In addition, as is clear from fig. 3, the method of the present invention is capable of generating the most natural and realistic figures of a person, while having very consistent colors and fine texture details. Although the FashionGAN model can keep the character actions and identities of the original image unchanged, the generated result lacks enough texture details and therefore does not have a stereoscopic impression. The FashionGAN model also fails to produce true colors, such as the sixth plot of the first line of fig. 3, and the "green" in the statement description is not reflected in the generated image, but rather a large area of artifacts appear. Moreover, the generated samples of the FashionGAN model look very similar, lack some diversity, and have signs of pattern collapse. In contrast, the method of the invention is more excellent in texture color detail and diversity.

In an embodiment, the present invention further provides a person clothing conversion system, which corresponds to the person clothing conversion method of the above embodiment, and includes: a deformed network generator, a synthesized network generator, a deformed network arbiter, and a synthesized network arbiter. The deformation network generator and the deformation network discriminator form a first-stage generation countermeasure network for processing deformation problems: according to the input sentence, carrying out corresponding shape change on an original segmentation map of the original map, and converting the original segmentation map into a target segmentation map; the synthetic network generator and the synthetic network discriminator form a second-stage generation countermeasure network for processing the synthetic problem: taking a target segmentation map obtained by the first-stage generation countermeasure network as a semantic guidance and shape limitation condition, and training a generator to learn multi-domain mapping from an original map to a target picture together with an input sentence so as to synthesize the target picture, thereby completing character clothing conversion. Further, the second level generation network merges three mechanisms of soft attention, self-attention, and stylized attention:

(1) Adopting a soft attention layer to strengthen the relevance between the target picture and the input sentence;

(2) Explicitly capturing remote correlations on the image using the self-attention layer;

(3) The stylized attention layer is used to establish the dependency between features by channel-by-channel inner product and feature map recalibration.

In a preferred embodiment, the deformed network classifier and the synthesized network classifier are projection-based classifiers for calculating the matching loss of an image to a condition by element-by-element inner product. Further, the deformed network discriminator and the synthesized network discriminator are used for judging different blocks of the image which is processed by the block, and finally, the average score is obtained, so that the convergence speed of the network is accelerated, and very effective generation guidance is generated on the texture and style of the image.

In a preferred embodiment, the deformed network generator uses a least mean square function as the countermeasures against the loss; further, the deformed network generator adopts a cyclic consistency loss to ensure that the figure body type, the figure posture and the identity in the target segmentation map and the original segmentation map are not changed.

In a preferred embodiment, the warped network generator is configured to combine the target sentence token vector with the segmentation map feature obtained after the residual block, and then send the combined target sentence token vector to the upsampling stage.

In a preferred embodiment, the synthetic network generator comprises two encoding branches for extracting features of the target segmentation map and the original map, respectively.

In a preferred embodiment, the method further comprises: and the stable training strategy module is used for stabilizing the training process of generating the countermeasure network at the first stage and generating the countermeasure network at the second stage. Further, the steady training strategy module is used for respectively carrying out spectrum normalization on the weight matrix of each layer of the first-stage generation countermeasure network and the second-stage generation countermeasure network. Further, the spectral norms in the steady training strategy module are approximated using a power iteration method.

In a preferred embodiment, to further focus the composite web generator's attention to clothing and body parts, the topmost of the roll hierarchy of the composite web generator also includes a matting layer to preserve the head. The matting mask of the head-related portion may be expressed as an intersection of the segmentation map and the corresponding portion. Similarly, a background retention penalty may be introduced, and the task of retaining the background is passed to the generator for learning.

In the preferred embodiment, an encoding-bottleneck layer-decoding architecture is employed for all generators and reconstruction networks. Wherein the warped network generator and the reconstruction network comprise 2 convolutional layers of step size 2 for downsampling, 6 residual blocks and 2 deconvolution layers for upsampling. To enhance the synthesis capability of the second level network, we add 1 convolutional layer, 3 residual blocks and 1 deconvolution layer in the generator. The sample normalization layer is adopted in all generators to learn individual characteristics of samples, and a plurality of noise layers are added to improve the diversity and randomness of the generation, so that overfitting is restrained, and mode collapse is prevented.

The embodiments disclosed herein were chosen and described in detail in order to best explain the principles of the invention and the practical application, and to thereby not limit the invention. Any modifications or variations within the scope of the description that would be apparent to a person skilled in the art are intended to be included within the scope of the invention.

Claims

1. A method of character apparel conversion comprising:

the second level generation antagonism network fuses simultaneously:

2. The character clothing conversion method according to claim 1, wherein S121 further comprises: two inputs are received with the soft attention layer: word embedding matrix w and original feature map x, and obtaining soft attention context feature map c by calculating context vector c combination _soft ：

s _ji ＝W _q (x _j ) ^T W _k (w _i )

c _soft ＝concat(c ₁ ,c ₂ ,…,c _j ,…,c _N )

3. The character clothing conversion method according to claim 2, wherein S122 further comprises: mapping the original feature map x into a plurality of feature spaces by using a convolution layer, calculating the correlation between different subareas by using an inner product mode, representing the correlation as attention weight, and finally obtaining the self-attention context feature map c by weighting and summing _self 。

4. The character clothing conversion method according to claim 1, wherein S123 further comprises: computing the gram matrix of the original feature map x, normalizing the gram matrix by a softmax function, and then recalibrating the features through weighted summation of all channels to finally obtain a stylized attention context feature map c _style ：

c _style ＝concat(f ₁ ,f ₂ ,…,f _C )

5. The character clothing conversion method according to claim 1, wherein the S11 further comprises:

6. The character clothing conversion method according to any one of claims 1 to 5, wherein the steps S11 and S12 further include: and stabilizing a training strategy, and respectively carrying out spectrum normalization on the weight matrix of each layer in the first-stage generation countermeasure network and the second-stage generation countermeasure network.

7. A character apparel conversion system, comprising: a deformed network generator, a synthesized network generator, a deformed network discriminator, and a synthesized network discriminator; wherein,,

further, the second level generation antagonism network fuses simultaneously:

8. The character apparel conversion system of claim 7, wherein the deformation network discriminator and/or the synthetic network discriminator are projection-based discriminators for calculating a matching loss of an image to a condition by element-by-element inner product;

the deformation network generator adopts a least mean square function as a countermeasures loss; the deformed network generator adopts a cyclical consistency loss to ensure that the figure body type, the figure posture and the identity in the target segmentation map and the original segmentation map are not changed; further, the method comprises the steps of,

the deformation network generator is used for combining the target sentence representation vector with the segmentation graph characteristics obtained after the residual block, and then sending the segmentation graph characteristics to an up-sampling stage; and/or the number of the groups of groups,

the synthesis network generator comprises two coding branches which are respectively used for extracting the characteristics of the target segmentation graph and the original graph.

9. The character apparel conversion system of claim 7, further comprising: a steady training strategy module; the stable training strategy module is used for stabilizing the training process of the first-stage generation countermeasure network and the second-stage generation countermeasure network;

the stable training strategy module is used for respectively carrying out spectrum normalization on the weight matrix of each layer of the first-stage generation countermeasure network and the second-stage generation countermeasure network;

10. The character apparel conversion system of any one of claims 7-9 wherein a topmost of the roll hierarchy of the synthetic network generator further comprises a matting layer to preserve headers;

the warped network synthesizer and/or the synthesized network generator includes a plurality of noise layers for increasing diversity and randomness of the generation.