CN110956579B

CN110956579B - Text picture rewriting method based on generation of semantic segmentation map

Info

Publication number: CN110956579B
Application number: CN201911181726.9A
Authority: CN
Inventors: 印鉴; 周晨星
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2019-11-27
Filing date: 2019-11-27
Publication date: 2023-05-23
Anticipated expiration: 2039-11-27
Also published as: CN110956579A

Abstract

The invention provides a text picture rewriting method based on semantic segmentation graph generation, which modifies clothes of characters in a picture through text description. Different from the previous method for directly generating the modified picture, the method firstly codes the text through a bidirectional LSTM to obtain the semantic features of the text, then obtains the semantic segmentation map of the original picture through an existing semantic segmentation model, then splices the semantic segmentation map and the text code and puts the semantic segmentation map and the text code into a resnet network to learn the joint representation of the text code and the primitive semantic segmentation map, thereby generating the semantic segmentation map of the modified picture, and finally splices the generated semantic segmentation map and the original text code again and puts the generated semantic segmentation map into another resnet network to learn the relation representation between the text code and the generated semantic segmentation map, thereby generating the finally modified picture.

Description

Text picture rewriting method based on generation of semantic segmentation map

Technical Field

The invention relates to the field of computer application technology and image processing, in particular to a text picture rewriting method based on generation of a semantic segmentation map.

Background

In recent years, internet technology is becoming mature, people have become accustomed to purchasing the goods they want on the internet, and clothes are the most purchased type of goods. The clothes of the heart instrument can be purchased without going out, which brings great convenience to the life of people. However, the clothes cannot be tried on personally, which is a great problem, so that if a piece of text describing the appearance of clothes can be used to automatically modify the clothes of a figure picture, the clothes of the generated picture just conform to the description of the text and the clothes are attached to the figure of the original picture, the clothes are very important and significant. The text-to-picture task arises from the creation of a new picture given a picture of a person and a text description such that the person of the picture remains consistent with the input picture and the clothing worn by the person remains consistent with the content of the text description.

The text-to-picture task is a conditional picture generation task, and the generation of pictures is controlled by the text as a strong condition. Currently, a network in which a picture generation task is frequently used is a generation countermeasure network. The network designs the generator and the discriminator (both are neural networks), the purpose of the generator is that the discriminator cannot discriminate whether the picture is true or false, the purpose of the discriminator is to distinguish the true or false of the picture, the two are stronger and stronger through the countermeasure learning of the generator and the discriminator, and finally the picture generated by the generator can achieve the effect of false and spurious. However, doing conditional picture generation tasks directly with this network tends not to work well, noting that the goal of this task is to modify the clothing of the picture, but the generated picture and the person inputting the picture remain unchanged. Therefore, it is better if the generation of pictures can be done using semantic segmentation graphs. The semantic segmentation map refers to a manner of classifying each pixel point at a pixel level, wherein the classification target refers to which type (clothes, trousers, face, head, etc.) the pixel point belongs to, and the learning difficulty required by directly obtaining a target picture from an original picture is quite high, but if the semantic segmentation map of the target picture can be generated by adding semantic information of a text according to the semantic segmentation map of the original picture, and then the target picture can be generated according to the semantic segmentation map of the target picture and the semantic information of the text, the learning difficulty can be reduced. Regardless of whether the semantic segmentation map of the target picture or the target picture is generated, the method used is based on the idea of generating an countermeasure network. And to reduce the difficulty of network learning, the task is divided into two phases: the first stage input is a semantic segmentation map of a text and an original picture, the output is a semantic segmentation map conforming to the text description, the second stage input is the semantic segmentation map generated in the first stage and the text, and the output is a picture conforming to the text description.

Disclosure of Invention

The invention provides a text picture rewriting method based on generation of a semantic segmentation graph, which can reduce learning difficulty of a model network.

In order to achieve the technical effects, the technical scheme of the invention is as follows:

a text picture rewriting method based on generation of semantic segmentation graphs comprises the following steps:

s1: establishing a semantic segmentation graph model G for generating an input picture, a feature extractor T of the semantic segmentation graph and a bidirectional encoder LSTM network for generating text semantic information;

s2: constructing a resnet1 network, inputting the semantic segmentation features and the text semantic features generated in the S1 into the network, and generating a semantic segmentation map P of the modified picture through a GAN training method;

s3: constructing a resnet2 network, inputting the semantic features of the text generated in the semantic segmentation map P and S1 generated in the S2 into the network, and generating a modified picture through a GAN training method.

Further, the specific process of the step S1 is:

s11: predefining 20 labels including hair, face and coat, wherein the aim is to classify each pixel point of the input picture, and if the input picture is represented as [ height, width, channel ] by a matrix, the output picture is represented as [ height, width ];

s12: scaling the body part to blur it, and stitching together their representations after such feature extraction to form a [ height, width,3] semantic segmentation feature matrix;

s13: the input text is first represented by word2vec tools with each word using a low-dimensional, dense real vector, so that the entire sentence can be represented as x= [ X ₁ ,…,x _t ,…,x _n ]In order to let the model learn the context information of each word of the sentence, let each word represent a time step t, and let the input of each LSTM unit be the word vector X at the current time t _t LSTM cell hidden layer output h at time t-1 _ft-1 From which a forward LSTM denoted as H can be derived _f ＝[h _f1 ,…,h _ft ,…h _fn ]Similarly, the backward LSTM is denoted as H _b ＝[h _b1 ,…,h _bt ,…h _bn ]Finally, h _fn And h _b1 Spliced together as a semantic feature representation of the text.

Further, the specific process of step S2 is as follows:

s21: in S12, the features of the head, face and body parts of the input picture semantic segmentation map are obtained, which need to be stitched for joint learning with the text semantic features obtained in S13, since the features of the picture are [ height, width,3]While text is characterized by [ h _fn ；h _b1 ]Before splicing, the characteristics of the text need to be expanded to change the characteristic dimension into [ height, width, h ] _fn ；h _b1 ]Then after stitching the overall feature dimension is [ height, width,3+h _fn +h _b1 ]；

S22: in the stage, new semantic segmentation graphs are generated through semantic segmentation graph features and text semantic features obtained in the step S21, the task is very similar to that of a pix2pix model, so a network of resnet is used as a generator, the network is called as a resnet1 network, the network structure is similar to a decoder structure of an encoder and mainly comprises two parts, namely an encoder, an up-sampling part, namely a decoder, the feature extraction part performs feature extraction on input semantic segmentation graph features and text semantic features by using convolution operation and pooling operation, and the up-sampling part performs fusion by using the same scale of channels corresponding to the transposed convolution and the feature extraction part, so that a new semantic segmentation graph is generated;

s23: in order to achieve the aim, an countermeasure training idea is adopted, firstly, a discriminator for judging whether an input semantic segmentation graph is true or false is designed, the discriminator is designed to be composed of a pile of convolution layers and two last full-connection layers, the output of the discriminator is a two-classification probability value, the aim of the discriminator is to distinguish the true or false of the input semantic segmentation graph as far as possible, the aim of a generator is that the discriminator can not judge the true or false of the semantic segmentation graph generated through S22 when the semantic segmentation graph passes through the discriminator, in addition, an analysis loss is added for better learning of the generator, the loss is an operation of cross entropy on each pixel point of the true semantic segmentation graph and the generated semantic segmentation graph, the generation of the semantic segmentation graph of the generator is guided through reducing the value of the analysis loss, and the generator resnet1 obtained through multiple rounds of training has the function of generating the semantic segmentation graph of a target picture;

s24: in the training process, the weight of the analysis loss is set to be 0.01, the length of each sentence about the clothes description is limited to be within 10, an ADAM optimizer is used for optimizing a network structure, the discriminator and the generator resnet1 are alternately trained, and parameters of the generator resnet1 are saved after training is finished so as to be convenient to use in a test stage.

Further, the specific process of step S3 is as follows:

s31: after S2 training is completed, firstly, a semantic segmentation graph P of a modified picture can be obtained, the semantic segmentation graph has the characteristic of conforming to text description, then, a network structure of the picture after modification can be designed and generated, firstly, the characteristics of the modified semantic segmentation graph generated in S2 still need to be extracted, and the extracted characteristic dimensions are [ height, width,3]In order to make the generated picture better fit the description of the text, the semantic features of the text still need to be added, and the semantic features of the text are acquired in the S13 stage, and the dimension is [ h ] _fn ；h _b1 ]Similarly, the stitching method introduced in S21 expands the dimension of the text feature into [ height, width, h ] _fn ；h _b1 ]Then, the overall feature is obtained by splicing the generated semantic segmentation graph features, and the dimension is [ height, width,3+h ] _fn +h _b1 ]；

S32: the objective of this stage is to generate the final modified picture, similar to the previous task to generate the modified semantic segmentation map, these two tasks have similar objectives, therefore, the network of the resnet is used as a generator, called as the resnet2 network, the network structure is basically consistent with the network structure of S22, but the input and output of the two are different;

s33: s23, designing a discriminator for judging whether the input picture is true or false by adopting the idea of countermeasure training; the output of the discriminator is a two-class probability value, the target of the discriminator is also to distinguish the true and false of the input picture as far as possible, the target of the generator is that the discriminator can not judge the true and false of the picture generated by S32 when the picture passes through the discriminator, and an L1 regularization loss function is added for preventing the overfitting;

s34: in the training process, regularization loss weight is set to be 100, an ADAM optimizer is used for optimizing a network structure, learning rate is 0.0002, and in the training process, the learning rate is exponentially reduced, and a discriminator and a generator resnet2 are used for alternately training.

Further, in the step S11, a value of each pixel is between 0 and 19.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

the aim of the method is to modify the clothing of the person in the picture through the text description. Different from the previous method for directly generating the modified picture, the method firstly codes the text through a bidirectional LSTM to obtain the semantic features of the text, then obtains the semantic segmentation map of the original picture through an existing semantic segmentation model, then splices the semantic segmentation map and the text code and puts the semantic segmentation map and the text code into a resnet network to learn the joint representation of the text code and the primitive semantic segmentation map, thereby generating the semantic segmentation map of the modified picture, and finally splices the generated semantic segmentation map and the original text code again and puts the generated semantic segmentation map into another resnet network to learn the relation representation between the text code and the generated semantic segmentation map, thereby generating the finally modified picture.

Drawings

FIG. 1 is a flow chart of the overall network architecture of the present invention;

FIG. 2 is a schematic diagram of a complete model of the present invention.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the present patent;

for the purpose of better illustrating the embodiments, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the actual product dimensions;

it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.

As shown in fig. 1, a text picture rewriting method based on generation of a semantic segmentation map includes the following steps:

The specific process of step S1 is:

s11: predefining 20 labels including hair, face and coat, wherein the aim is to classify each pixel of the input picture, if the input picture is represented as [ height, width, channel ] by a matrix, the output picture is represented as [ height, width ], and the value of each pixel is between 0 and 19;

The specific process of step S2 is:

The specific process of step S3 is:

The data set used is from 2019 Fashion-Gen competition, and the data set is in the format of a picture corresponding to a category and a section of the picture, wherein the category is used for describing the category to which the picture belongs, and the section of the picture is used for supplementing category information, namely the content of the complete introduction picture. Since the original dataset contains many categories (e.g., shoes, bags, etc.) that are not related to the pants of the garment, the dataset needs to be screened before use, and a total of 8 categories are selected, including TOPS, SWEATERS, PANTS, JEANS, SHIRTS, DRESSES, SHORTS and skurts. The basic cases of the training set and the test set are shown in the following table:

Selected	Train	Test	Topics
				Before	260490	32528	8
After	164352	17984	8

the generation of the modified semantic segmentation map network and the generation of the modified picture network are respectively shown in the left part and the right part of fig. 2. Since the input picture and the input text description are paired in the training phase, generating the modified semantic segmentation map and generating the modified picture and the semantic segmentation map of the input picture remain consistent with the input picture in designing the training phase. Taking this sentence as an example: long sleeve flannel plaid shirt in tones of white and brown (the input picture is consistent with the sentence description). Each word of the sentence is first encoded to obtain a 300×11 matrix, and then the matrix is input into the bidirectional LSTM to learn the context information between the words, so as to obtain forward GRU outputs respectively: h _f ＝[h _f1 ,…,h _ft ,…h _fn ]Rearward GRU output: h _b ＝[h _b1 ,…,h _bt ,…h _bn ]. Then h is _fn And h _b1 Spliced together as a semantic feature representation of text, a vector of 256 x 1 dimensions.

Then, the input picture is put into a Graphomy network to obtain a semantic segmentation picture of the input picture, and the output semantic segmentation picture is a 256×256 matrix because the size of the input picture is 256×256, and each pixel point has a value between 0 and 19, which indicates the category to which the pixel point belongs and contains 20 categories of information. Because it is desirable that the model learn the portions of the text description that are relevant to the picture (clothing and pants), while retaining the other content of the picture. Thus, the head and face features are obtained from the semantic segmentation map, and the whole body features need to be blurred in order for the model to reconstruct the body from the description of the text, and thus the body needs to be scaled up and scaled down again, so that the whole body features are blurred. The three features are then stitched together, and since each feature dimension is [256,256,1], the feature dimension becomes [256,256,3] after stitching together. Finally, the features of the input semantic segmentation map are spliced with the features of the input text description (the feature dimension of the text needs to be expanded from [256,1] to [256,256,256] before splicing) to obtain the input of the whole feature dimension as [256,256,259].

Then, a resnet network is constructed to extract the input features and generate a modified semantic segmentation map, the generated semantic segmentation map is still [256,256], and the value of each pixel point is still between 0 and 19. The entire network is trained using countermeasure training. Firstly, the discriminator judges whether the input semantic segmentation map is real or generated by the network, and the training target of the resnet is to enable the semantic segmentation map generated by the resnet to be judged to be true by the discriminator so as to confuse the discriminator. In order to make the generated semantic segmentation map more stable, an analytic loss is added, wherein the loss is an operation of cross entropy on each pixel point of the real semantic segmentation map and the generated semantic segmentation map. After the loss function is constructed, iterative training can be performed. The modified semantic segmentation graph can be generated by training the obtained resnet network for a plurality of times.

After the modified semantic segmentation map is obtained, the last step can be completed. I.e. a modified picture is generated. The input of the modified picture to be generated is the modified semantic segmentation picture generated in the previous stage and the text description of the input. Another resnet network is designed to generate the modified picture. The training method used is identical to the previous stage. And after training, putting the test data into the trained resnet1 network and the trained resnet2 network for testing.

In order to show the good effect of the experiment, a new test index is designed to verify that the generated picture meets the description of the text. Several representative textual descriptions are chosen, such as 'Has T-shirt', 'Has Long sleeves', 'Has shorts', and so on. If the generated picture does not have T-shirt but the input text contains T-shirt, then the generated picture is not in accordance with the description of the input text, which is not a good generation model. The evaluation index is the accuracy rate, and if the accuracy rate is higher, the picture generated by the generation model is more consistent with the description of the text. In order to verify that the two-stage generation effect proposed by the method is better, the two-stage generation effect is compared with the one-stage generation model, and meanwhile, a baseline model is also provided, namely, whether the real picture accords with the description of the text or not is predicted, and obviously, the real picture is a description which accords with the text or not, but the predicted model cannot be distinguished by 100%, so that a good generation model effect is more similar to the baseline model effect. The experimental results are as follows:

Images from	Has T-shirt	Has Long Sleeves	Has Shorts
				Baseline	77.6％	88.2％	90.4％
One-Step	57.1％	72.4％	80.1％
				Our model	63.4％	87.1％	90.1％

compared with the prior art, the invention has the advantages that the invention can be improved to a certain extent from the result, and the invention can generate the semantic segmentation image from the perspective of generating the semantic segmentation image instead of directly generating the modified image, thereby leading the model to learn the internal characteristics between the image and the text better and generating the image conforming to the text description better.

The same or similar reference numerals correspond to the same or similar components;

the positional relationship depicted in the drawings is for illustrative purposes only and is not to be construed as limiting the present patent;

it is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims

1. A text picture rewriting method based on generation of a semantic segmentation map is characterized by comprising the following steps:

s1: establishing a semantic segmentation graph model G for generating an input picture, a feature extractor T of the semantic segmentation graph and a bidirectional encoder LSTM network for generating text semantic information; the specific process of the step S1 is as follows:

s13: the input text is first represented by word2vec tools with each word using a low-dimensional, dense real vector, so that the entire sentence can be represented as x= [ X ₁ ,…,x _t , …,x _n ]In order to let the model learn the context information of each word of the sentence, let each word represent a time step t, and let the input of each LSTM unit be the word vector X at the current time t _t LSTM cell hidden layer output h at time t-1 _ft-1 From which a forward LSTM denoted as H can be derived _f =[h _f1 ,…,h _ft , …h _fn ]Similarly, the backward LSTM is denoted as H _b =[h _b1 ,…,h _bt , …h _bn ]Finally, h _fn And h _b1 Spliced together as semantic feature representations of text;

2. The text-to-picture rewriting method based on semantic segmentation map generation according to claim 1, wherein the specific process of step S2 is:

3. The text-to-picture rewriting method based on the generation of the semantic segmentation map according to claim 2, wherein the specific process of step S3 is:

4. The text-to-picture method based on semantic segmentation map generation according to claim 2, wherein in step S11, the value of each pixel is between 0 and 19.