CN110956579B - Text picture rewriting method based on generation of semantic segmentation map - Google Patents

Text picture rewriting method based on generation of semantic segmentation map Download PDF

Info

Publication number
CN110956579B
CN110956579B CN201911181726.9A CN201911181726A CN110956579B CN 110956579 B CN110956579 B CN 110956579B CN 201911181726 A CN201911181726 A CN 201911181726A CN 110956579 B CN110956579 B CN 110956579B
Authority
CN
China
Prior art keywords
semantic segmentation
picture
text
semantic
generated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911181726.9A
Other languages
Chinese (zh)
Other versions
CN110956579A (en
Inventor
印鉴
周晨星
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN201911181726.9A priority Critical patent/CN110956579B/en
Publication of CN110956579A publication Critical patent/CN110956579A/en
Application granted granted Critical
Publication of CN110956579B publication Critical patent/CN110956579B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/04Context-preserving transformations, e.g. by using an importance map
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a text picture rewriting method based on semantic segmentation graph generation, which modifies clothes of characters in a picture through text description. Different from the previous method for directly generating the modified picture, the method firstly codes the text through a bidirectional LSTM to obtain the semantic features of the text, then obtains the semantic segmentation map of the original picture through an existing semantic segmentation model, then splices the semantic segmentation map and the text code and puts the semantic segmentation map and the text code into a resnet network to learn the joint representation of the text code and the primitive semantic segmentation map, thereby generating the semantic segmentation map of the modified picture, and finally splices the generated semantic segmentation map and the original text code again and puts the generated semantic segmentation map into another resnet network to learn the relation representation between the text code and the generated semantic segmentation map, thereby generating the finally modified picture.

Description

Text picture rewriting method based on generation of semantic segmentation map
Technical Field
The invention relates to the field of computer application technology and image processing, in particular to a text picture rewriting method based on generation of a semantic segmentation map.
Background
In recent years, internet technology is becoming mature, people have become accustomed to purchasing the goods they want on the internet, and clothes are the most purchased type of goods. The clothes of the heart instrument can be purchased without going out, which brings great convenience to the life of people. However, the clothes cannot be tried on personally, which is a great problem, so that if a piece of text describing the appearance of clothes can be used to automatically modify the clothes of a figure picture, the clothes of the generated picture just conform to the description of the text and the clothes are attached to the figure of the original picture, the clothes are very important and significant. The text-to-picture task arises from the creation of a new picture given a picture of a person and a text description such that the person of the picture remains consistent with the input picture and the clothing worn by the person remains consistent with the content of the text description.
The text-to-picture task is a conditional picture generation task, and the generation of pictures is controlled by the text as a strong condition. Currently, a network in which a picture generation task is frequently used is a generation countermeasure network. The network designs the generator and the discriminator (both are neural networks), the purpose of the generator is that the discriminator cannot discriminate whether the picture is true or false, the purpose of the discriminator is to distinguish the true or false of the picture, the two are stronger and stronger through the countermeasure learning of the generator and the discriminator, and finally the picture generated by the generator can achieve the effect of false and spurious. However, doing conditional picture generation tasks directly with this network tends not to work well, noting that the goal of this task is to modify the clothing of the picture, but the generated picture and the person inputting the picture remain unchanged. Therefore, it is better if the generation of pictures can be done using semantic segmentation graphs. The semantic segmentation map refers to a manner of classifying each pixel point at a pixel level, wherein the classification target refers to which type (clothes, trousers, face, head, etc.) the pixel point belongs to, and the learning difficulty required by directly obtaining a target picture from an original picture is quite high, but if the semantic segmentation map of the target picture can be generated by adding semantic information of a text according to the semantic segmentation map of the original picture, and then the target picture can be generated according to the semantic segmentation map of the target picture and the semantic information of the text, the learning difficulty can be reduced. Regardless of whether the semantic segmentation map of the target picture or the target picture is generated, the method used is based on the idea of generating an countermeasure network. And to reduce the difficulty of network learning, the task is divided into two phases: the first stage input is a semantic segmentation map of a text and an original picture, the output is a semantic segmentation map conforming to the text description, the second stage input is the semantic segmentation map generated in the first stage and the text, and the output is a picture conforming to the text description.
Disclosure of Invention
The invention provides a text picture rewriting method based on generation of a semantic segmentation graph, which can reduce learning difficulty of a model network.
In order to achieve the technical effects, the technical scheme of the invention is as follows:
a text picture rewriting method based on generation of semantic segmentation graphs comprises the following steps:
s1: establishing a semantic segmentation graph model G for generating an input picture, a feature extractor T of the semantic segmentation graph and a bidirectional encoder LSTM network for generating text semantic information;
s2: constructing a resnet1 network, inputting the semantic segmentation features and the text semantic features generated in the S1 into the network, and generating a semantic segmentation map P of the modified picture through a GAN training method;
s3: constructing a resnet2 network, inputting the semantic features of the text generated in the semantic segmentation map P and S1 generated in the S2 into the network, and generating a modified picture through a GAN training method.
Further, the specific process of the step S1 is:
s11: predefining 20 labels including hair, face and coat, wherein the aim is to classify each pixel point of the input picture, and if the input picture is represented as [ height, width, channel ] by a matrix, the output picture is represented as [ height, width ];
s12: scaling the body part to blur it, and stitching together their representations after such feature extraction to form a [ height, width,3] semantic segmentation feature matrix;
s13: the input text is first represented by word2vec tools with each word using a low-dimensional, dense real vector, so that the entire sentence can be represented as x= [ X 1 ,…,x t ,…,x n ]In order to let the model learn the context information of each word of the sentence, let each word represent a time step t, and let the input of each LSTM unit be the word vector X at the current time t t LSTM cell hidden layer output h at time t-1 ft-1 From which a forward LSTM denoted as H can be derived f =[h f1 ,…,h ft ,…h fn ]Similarly, the backward LSTM is denoted as H b =[h b1 ,…,h bt ,…h bn ]Finally, h fn And h b1 Spliced together as a semantic feature representation of the text.
Further, the specific process of step S2 is as follows:
s21: in S12, the features of the head, face and body parts of the input picture semantic segmentation map are obtained, which need to be stitched for joint learning with the text semantic features obtained in S13, since the features of the picture are [ height, width,3]While text is characterized by [ h fn ;h b1 ]Before splicing, the characteristics of the text need to be expanded to change the characteristic dimension into [ height, width, h ] fn ;h b1 ]Then after stitching the overall feature dimension is [ height, width,3+h fn +h b1 ];
S22: in the stage, new semantic segmentation graphs are generated through semantic segmentation graph features and text semantic features obtained in the step S21, the task is very similar to that of a pix2pix model, so a network of resnet is used as a generator, the network is called as a resnet1 network, the network structure is similar to a decoder structure of an encoder and mainly comprises two parts, namely an encoder, an up-sampling part, namely a decoder, the feature extraction part performs feature extraction on input semantic segmentation graph features and text semantic features by using convolution operation and pooling operation, and the up-sampling part performs fusion by using the same scale of channels corresponding to the transposed convolution and the feature extraction part, so that a new semantic segmentation graph is generated;
s23: in order to achieve the aim, an countermeasure training idea is adopted, firstly, a discriminator for judging whether an input semantic segmentation graph is true or false is designed, the discriminator is designed to be composed of a pile of convolution layers and two last full-connection layers, the output of the discriminator is a two-classification probability value, the aim of the discriminator is to distinguish the true or false of the input semantic segmentation graph as far as possible, the aim of a generator is that the discriminator can not judge the true or false of the semantic segmentation graph generated through S22 when the semantic segmentation graph passes through the discriminator, in addition, an analysis loss is added for better learning of the generator, the loss is an operation of cross entropy on each pixel point of the true semantic segmentation graph and the generated semantic segmentation graph, the generation of the semantic segmentation graph of the generator is guided through reducing the value of the analysis loss, and the generator resnet1 obtained through multiple rounds of training has the function of generating the semantic segmentation graph of a target picture;
s24: in the training process, the weight of the analysis loss is set to be 0.01, the length of each sentence about the clothes description is limited to be within 10, an ADAM optimizer is used for optimizing a network structure, the discriminator and the generator resnet1 are alternately trained, and parameters of the generator resnet1 are saved after training is finished so as to be convenient to use in a test stage.
Further, the specific process of step S3 is as follows:
s31: after S2 training is completed, firstly, a semantic segmentation graph P of a modified picture can be obtained, the semantic segmentation graph has the characteristic of conforming to text description, then, a network structure of the picture after modification can be designed and generated, firstly, the characteristics of the modified semantic segmentation graph generated in S2 still need to be extracted, and the extracted characteristic dimensions are [ height, width,3]In order to make the generated picture better fit the description of the text, the semantic features of the text still need to be added, and the semantic features of the text are acquired in the S13 stage, and the dimension is [ h ] fn ;h b1 ]Similarly, the stitching method introduced in S21 expands the dimension of the text feature into [ height, width, h ] fn ;h b1 ]Then, the overall feature is obtained by splicing the generated semantic segmentation graph features, and the dimension is [ height, width,3+h ] fn +h b1 ];
S32: the objective of this stage is to generate the final modified picture, similar to the previous task to generate the modified semantic segmentation map, these two tasks have similar objectives, therefore, the network of the resnet is used as a generator, called as the resnet2 network, the network structure is basically consistent with the network structure of S22, but the input and output of the two are different;
s33: s23, designing a discriminator for judging whether the input picture is true or false by adopting the idea of countermeasure training; the output of the discriminator is a two-class probability value, the target of the discriminator is also to distinguish the true and false of the input picture as far as possible, the target of the generator is that the discriminator can not judge the true and false of the picture generated by S32 when the picture passes through the discriminator, and an L1 regularization loss function is added for preventing the overfitting;
s34: in the training process, regularization loss weight is set to be 100, an ADAM optimizer is used for optimizing a network structure, learning rate is 0.0002, and in the training process, the learning rate is exponentially reduced, and a discriminator and a generator resnet2 are used for alternately training.
Further, in the step S11, a value of each pixel is between 0 and 19.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
the aim of the method is to modify the clothing of the person in the picture through the text description. Different from the previous method for directly generating the modified picture, the method firstly codes the text through a bidirectional LSTM to obtain the semantic features of the text, then obtains the semantic segmentation map of the original picture through an existing semantic segmentation model, then splices the semantic segmentation map and the text code and puts the semantic segmentation map and the text code into a resnet network to learn the joint representation of the text code and the primitive semantic segmentation map, thereby generating the semantic segmentation map of the modified picture, and finally splices the generated semantic segmentation map and the original text code again and puts the generated semantic segmentation map into another resnet network to learn the relation representation between the text code and the generated semantic segmentation map, thereby generating the finally modified picture.
Drawings
FIG. 1 is a flow chart of the overall network architecture of the present invention;
FIG. 2 is a schematic diagram of a complete model of the present invention.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the present patent;
for the purpose of better illustrating the embodiments, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the actual product dimensions;
it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.
As shown in fig. 1, a text picture rewriting method based on generation of a semantic segmentation map includes the following steps:
s1: establishing a semantic segmentation graph model G for generating an input picture, a feature extractor T of the semantic segmentation graph and a bidirectional encoder LSTM network for generating text semantic information;
s2: constructing a resnet1 network, inputting the semantic segmentation features and the text semantic features generated in the S1 into the network, and generating a semantic segmentation map P of the modified picture through a GAN training method;
s3: constructing a resnet2 network, inputting the semantic features of the text generated in the semantic segmentation map P and S1 generated in the S2 into the network, and generating a modified picture through a GAN training method.
The specific process of step S1 is:
s11: predefining 20 labels including hair, face and coat, wherein the aim is to classify each pixel of the input picture, if the input picture is represented as [ height, width, channel ] by a matrix, the output picture is represented as [ height, width ], and the value of each pixel is between 0 and 19;
s12: scaling the body part to blur it, and stitching together their representations after such feature extraction to form a [ height, width,3] semantic segmentation feature matrix;
s13: the input text is first represented by word2vec tools with each word using a low-dimensional, dense real vector, so that the entire sentence can be represented as x= [ X 1 ,…,x t ,…,x n ]In order to let the model learn the context information of each word of the sentence, let each word represent a time step t, and let the input of each LSTM unit be the word vector X at the current time t t LSTM cell hidden layer output h at time t-1 ft-1 From which a forward LSTM denoted as H can be derived f =[h f1 ,…,h ft ,…h fn ]Similarly, the backward LSTM is denoted as H b =[h b1 ,…,h bt ,…h bn ]Finally, h fn And h b1 Spliced together as a semantic feature representation of the text.
The specific process of step S2 is:
s21: in S12, the features of the head, face and body parts of the input picture semantic segmentation map are obtained, which need to be stitched for joint learning with the text semantic features obtained in S13, since the features of the picture are [ height, width,3]While text is characterized by [ h fn ;h b1 ]Before splicing, the characteristics of the text need to be expanded to change the characteristic dimension into [ height, width, h ] fn ;h b1 ]Then after stitching the overall feature dimension is [ height, width,3+h fn +h b1 ];
S22: in the stage, new semantic segmentation graphs are generated through semantic segmentation graph features and text semantic features obtained in the step S21, the task is very similar to that of a pix2pix model, so a network of resnet is used as a generator, the network is called as a resnet1 network, the network structure is similar to a decoder structure of an encoder and mainly comprises two parts, namely an encoder, an up-sampling part, namely a decoder, the feature extraction part performs feature extraction on input semantic segmentation graph features and text semantic features by using convolution operation and pooling operation, and the up-sampling part performs fusion by using the same scale of channels corresponding to the transposed convolution and the feature extraction part, so that a new semantic segmentation graph is generated;
s23: in order to achieve the aim, an countermeasure training idea is adopted, firstly, a discriminator for judging whether an input semantic segmentation graph is true or false is designed, the discriminator is designed to be composed of a pile of convolution layers and two last full-connection layers, the output of the discriminator is a two-classification probability value, the aim of the discriminator is to distinguish the true or false of the input semantic segmentation graph as far as possible, the aim of a generator is that the discriminator can not judge the true or false of the semantic segmentation graph generated through S22 when the semantic segmentation graph passes through the discriminator, in addition, an analysis loss is added for better learning of the generator, the loss is an operation of cross entropy on each pixel point of the true semantic segmentation graph and the generated semantic segmentation graph, the generation of the semantic segmentation graph of the generator is guided through reducing the value of the analysis loss, and the generator resnet1 obtained through multiple rounds of training has the function of generating the semantic segmentation graph of a target picture;
s24: in the training process, the weight of the analysis loss is set to be 0.01, the length of each sentence about the clothes description is limited to be within 10, an ADAM optimizer is used for optimizing a network structure, the discriminator and the generator resnet1 are alternately trained, and parameters of the generator resnet1 are saved after training is finished so as to be convenient to use in a test stage.
The specific process of step S3 is:
s31: after S2 training is completed, firstly, a semantic segmentation graph P of a modified picture can be obtained, the semantic segmentation graph has the characteristic of conforming to text description, then, a network structure of the picture after modification can be designed and generated, firstly, the characteristics of the modified semantic segmentation graph generated in S2 still need to be extracted, and the extracted characteristic dimensions are [ height, width,3]In order to make the generated picture better fit the description of the text, the semantic features of the text still need to be added, and the semantic features of the text are acquired in the S13 stage, and the dimension is [ h ] fn ;h b1 ]Similarly, the stitching method introduced in S21 expands the dimension of the text feature into [ height, width, h ] fn ;h b1 ]Then, the overall feature is obtained by splicing the generated semantic segmentation graph features, and the dimension is [ height, width,3+h ] fn +h b1 ];
S32: the objective of this stage is to generate the final modified picture, similar to the previous task to generate the modified semantic segmentation map, these two tasks have similar objectives, therefore, the network of the resnet is used as a generator, called as the resnet2 network, the network structure is basically consistent with the network structure of S22, but the input and output of the two are different;
s33: s23, designing a discriminator for judging whether the input picture is true or false by adopting the idea of countermeasure training; the output of the discriminator is a two-class probability value, the target of the discriminator is also to distinguish the true and false of the input picture as far as possible, the target of the generator is that the discriminator can not judge the true and false of the picture generated by S32 when the picture passes through the discriminator, and an L1 regularization loss function is added for preventing the overfitting;
s34: in the training process, regularization loss weight is set to be 100, an ADAM optimizer is used for optimizing a network structure, learning rate is 0.0002, and in the training process, the learning rate is exponentially reduced, and a discriminator and a generator resnet2 are used for alternately training.
The data set used is from 2019 Fashion-Gen competition, and the data set is in the format of a picture corresponding to a category and a section of the picture, wherein the category is used for describing the category to which the picture belongs, and the section of the picture is used for supplementing category information, namely the content of the complete introduction picture. Since the original dataset contains many categories (e.g., shoes, bags, etc.) that are not related to the pants of the garment, the dataset needs to be screened before use, and a total of 8 categories are selected, including TOPS, SWEATERS, PANTS, JEANS, SHIRTS, DRESSES, SHORTS and skurts. The basic cases of the training set and the test set are shown in the following table:
Selected Train Test Topics
Before 260490 32528 8
After 164352 17984 8
the generation of the modified semantic segmentation map network and the generation of the modified picture network are respectively shown in the left part and the right part of fig. 2. Since the input picture and the input text description are paired in the training phase, generating the modified semantic segmentation map and generating the modified picture and the semantic segmentation map of the input picture remain consistent with the input picture in designing the training phase. Taking this sentence as an example: long sleeve flannel plaid shirt in tones of white and brown (the input picture is consistent with the sentence description). Each word of the sentence is first encoded to obtain a 300×11 matrix, and then the matrix is input into the bidirectional LSTM to learn the context information between the words, so as to obtain forward GRU outputs respectively: h f =[h f1 ,…,h ft ,…h fn ]Rearward GRU output: h b =[h b1 ,…,h bt ,…h bn ]. Then h is fn And h b1 Spliced together as a semantic feature representation of text, a vector of 256 x 1 dimensions.
Then, the input picture is put into a Graphomy network to obtain a semantic segmentation picture of the input picture, and the output semantic segmentation picture is a 256×256 matrix because the size of the input picture is 256×256, and each pixel point has a value between 0 and 19, which indicates the category to which the pixel point belongs and contains 20 categories of information. Because it is desirable that the model learn the portions of the text description that are relevant to the picture (clothing and pants), while retaining the other content of the picture. Thus, the head and face features are obtained from the semantic segmentation map, and the whole body features need to be blurred in order for the model to reconstruct the body from the description of the text, and thus the body needs to be scaled up and scaled down again, so that the whole body features are blurred. The three features are then stitched together, and since each feature dimension is [256,256,1], the feature dimension becomes [256,256,3] after stitching together. Finally, the features of the input semantic segmentation map are spliced with the features of the input text description (the feature dimension of the text needs to be expanded from [256,1] to [256,256,256] before splicing) to obtain the input of the whole feature dimension as [256,256,259].
Then, a resnet network is constructed to extract the input features and generate a modified semantic segmentation map, the generated semantic segmentation map is still [256,256], and the value of each pixel point is still between 0 and 19. The entire network is trained using countermeasure training. Firstly, the discriminator judges whether the input semantic segmentation map is real or generated by the network, and the training target of the resnet is to enable the semantic segmentation map generated by the resnet to be judged to be true by the discriminator so as to confuse the discriminator. In order to make the generated semantic segmentation map more stable, an analytic loss is added, wherein the loss is an operation of cross entropy on each pixel point of the real semantic segmentation map and the generated semantic segmentation map. After the loss function is constructed, iterative training can be performed. The modified semantic segmentation graph can be generated by training the obtained resnet network for a plurality of times.
After the modified semantic segmentation map is obtained, the last step can be completed. I.e. a modified picture is generated. The input of the modified picture to be generated is the modified semantic segmentation picture generated in the previous stage and the text description of the input. Another resnet network is designed to generate the modified picture. The training method used is identical to the previous stage. And after training, putting the test data into the trained resnet1 network and the trained resnet2 network for testing.
In order to show the good effect of the experiment, a new test index is designed to verify that the generated picture meets the description of the text. Several representative textual descriptions are chosen, such as 'Has T-shirt', 'Has Long sleeves', 'Has shorts', and so on. If the generated picture does not have T-shirt but the input text contains T-shirt, then the generated picture is not in accordance with the description of the input text, which is not a good generation model. The evaluation index is the accuracy rate, and if the accuracy rate is higher, the picture generated by the generation model is more consistent with the description of the text. In order to verify that the two-stage generation effect proposed by the method is better, the two-stage generation effect is compared with the one-stage generation model, and meanwhile, a baseline model is also provided, namely, whether the real picture accords with the description of the text or not is predicted, and obviously, the real picture is a description which accords with the text or not, but the predicted model cannot be distinguished by 100%, so that a good generation model effect is more similar to the baseline model effect. The experimental results are as follows:
Images from Has T-shirt Has Long Sleeves Has Shorts
Baseline 77.6% 88.2% 90.4%
One-Step 57.1% 72.4% 80.1%
Our model 63.4% 87.1% 90.1%
compared with the prior art, the invention has the advantages that the invention can be improved to a certain extent from the result, and the invention can generate the semantic segmentation image from the perspective of generating the semantic segmentation image instead of directly generating the modified image, thereby leading the model to learn the internal characteristics between the image and the text better and generating the image conforming to the text description better.
The same or similar reference numerals correspond to the same or similar components;
the positional relationship depicted in the drawings is for illustrative purposes only and is not to be construed as limiting the present patent;
it is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims (4)

1. A text picture rewriting method based on generation of a semantic segmentation map is characterized by comprising the following steps:
s1: establishing a semantic segmentation graph model G for generating an input picture, a feature extractor T of the semantic segmentation graph and a bidirectional encoder LSTM network for generating text semantic information; the specific process of the step S1 is as follows:
s11: predefining 20 labels including hair, face and coat, wherein the aim is to classify each pixel point of the input picture, and if the input picture is represented as [ height, width, channel ] by a matrix, the output picture is represented as [ height, width ];
s12: scaling the body part to blur it, and stitching together their representations after such feature extraction to form a [ height, width,3] semantic segmentation feature matrix;
s13: the input text is first represented by word2vec tools with each word using a low-dimensional, dense real vector, so that the entire sentence can be represented as x= [ X 1 ,…,x t , …,x n ]In order to let the model learn the context information of each word of the sentence, let each word represent a time step t, and let the input of each LSTM unit be the word vector X at the current time t t LSTM cell hidden layer output h at time t-1 ft-1 From which a forward LSTM denoted as H can be derived f =[h f1 ,…,h ft , …h fn ]Similarly, the backward LSTM is denoted as H b =[h b1 ,…,h bt , …h bn ]Finally, h fn And h b1 Spliced together as semantic feature representations of text;
s2: constructing a resnet1 network, inputting the semantic segmentation features and the text semantic features generated in the S1 into the network, and generating a semantic segmentation map P of the modified picture through a GAN training method;
s3: constructing a resnet2 network, inputting the semantic features of the text generated in the semantic segmentation map P and S1 generated in the S2 into the network, and generating a modified picture through a GAN training method.
2. The text-to-picture rewriting method based on semantic segmentation map generation according to claim 1, wherein the specific process of step S2 is:
s21: in S12, the features of the head, face and body parts of the input picture semantic segmentation map are obtained, which need to be stitched for joint learning with the text semantic features obtained in S13, since the features of the picture are [ height, width,3]While text is characterized by [ h fn ;h b1 ]Before splicing, the characteristics of the text need to be expanded to change the characteristic dimension into [ height, width, h ] fn ;h b1 ]Then after stitching the overall feature dimension is [ height, width,3+h fn +h b1 ];
S22: in the stage, new semantic segmentation graphs are generated through semantic segmentation graph features and text semantic features obtained in the step S21, the task is very similar to that of a pix2pix model, so a network of resnet is used as a generator, the network is called as a resnet1 network, the network structure is similar to a decoder structure of an encoder and mainly comprises two parts, namely an encoder, an up-sampling part, namely a decoder, the feature extraction part performs feature extraction on input semantic segmentation graph features and text semantic features by using convolution operation and pooling operation, and the up-sampling part performs fusion by using the same scale of channels corresponding to the transposed convolution and the feature extraction part, so that a new semantic segmentation graph is generated;
s23: in order to achieve the aim, an countermeasure training idea is adopted, firstly, a discriminator for judging whether an input semantic segmentation graph is true or false is designed, the discriminator is designed to be composed of a pile of convolution layers and two last full-connection layers, the output of the discriminator is a two-classification probability value, the aim of the discriminator is to distinguish the true or false of the input semantic segmentation graph as far as possible, the aim of a generator is that the discriminator can not judge the true or false of the semantic segmentation graph generated through S22 when the semantic segmentation graph passes through the discriminator, in addition, an analysis loss is added for better learning of the generator, the loss is an operation of cross entropy on each pixel point of the true semantic segmentation graph and the generated semantic segmentation graph, the generation of the semantic segmentation graph of the generator is guided through reducing the value of the analysis loss, and the generator resnet1 obtained through multiple rounds of training has the function of generating the semantic segmentation graph of a target picture;
s24: in the training process, the weight of the analysis loss is set to be 0.01, the length of each sentence about the clothes description is limited to be within 10, an ADAM optimizer is used for optimizing a network structure, the discriminator and the generator resnet1 are alternately trained, and parameters of the generator resnet1 are saved after training is finished so as to be convenient to use in a test stage.
3. The text-to-picture rewriting method based on the generation of the semantic segmentation map according to claim 2, wherein the specific process of step S3 is:
s31: after S2 training is completed, firstly, a semantic segmentation graph P of a modified picture can be obtained, the semantic segmentation graph has the characteristic of conforming to text description, then, a network structure of the picture after modification can be designed and generated, firstly, the characteristics of the modified semantic segmentation graph generated in S2 still need to be extracted, and the extracted characteristic dimensions are [ height, width,3]In order to make the generated picture better fit the description of the text, the semantic features of the text still need to be added, and the semantic features of the text are acquired in the S13 stage, and the dimension is [ h ] fn ;h b1 ]Similarly, the stitching method introduced in S21 expands the dimension of the text feature into [ height, width, h ] fn ;h b1 ]Then, the overall feature is obtained by splicing the generated semantic segmentation graph features, and the dimension is [ height, width,3+h ] fn +h b1 ];
S32: the objective of this stage is to generate the final modified picture, similar to the previous task to generate the modified semantic segmentation map, these two tasks have similar objectives, therefore, the network of the resnet is used as a generator, called as the resnet2 network, the network structure is basically consistent with the network structure of S22, but the input and output of the two are different;
s33: s23, designing a discriminator for judging whether the input picture is true or false by adopting the idea of countermeasure training; the output of the discriminator is a two-class probability value, the target of the discriminator is also to distinguish the true and false of the input picture as far as possible, the target of the generator is that the discriminator can not judge the true and false of the picture generated by S32 when the picture passes through the discriminator, and an L1 regularization loss function is added for preventing the overfitting;
s34: in the training process, regularization loss weight is set to be 100, an ADAM optimizer is used for optimizing a network structure, learning rate is 0.0002, and in the training process, the learning rate is exponentially reduced, and a discriminator and a generator resnet2 are used for alternately training.
4. The text-to-picture method based on semantic segmentation map generation according to claim 2, wherein in step S11, the value of each pixel is between 0 and 19.
CN201911181726.9A 2019-11-27 2019-11-27 Text picture rewriting method based on generation of semantic segmentation map Active CN110956579B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911181726.9A CN110956579B (en) 2019-11-27 2019-11-27 Text picture rewriting method based on generation of semantic segmentation map

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911181726.9A CN110956579B (en) 2019-11-27 2019-11-27 Text picture rewriting method based on generation of semantic segmentation map

Publications (2)

Publication Number Publication Date
CN110956579A CN110956579A (en) 2020-04-03
CN110956579B true CN110956579B (en) 2023-05-23

Family

ID=69978574

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911181726.9A Active CN110956579B (en) 2019-11-27 2019-11-27 Text picture rewriting method based on generation of semantic segmentation map

Country Status (1)

Country Link
CN (1) CN110956579B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111626918B (en) * 2020-04-29 2023-05-09 杭州火烧云科技有限公司 Method and system for carrying out style change on digital image based on semantic segmentation network technology
CN111627055B (en) * 2020-05-07 2023-11-24 浙江大学 Scene depth completion method combining semantic segmentation
CN111563899B (en) * 2020-06-09 2020-10-02 南京汇百图科技有限公司 Bone segmentation method in hip joint CT image
CN112801911B (en) * 2021-02-08 2024-03-26 苏州长嘴鱼软件有限公司 Method and device for removing text noise in natural image and storage medium
CN113268991B (en) * 2021-05-19 2022-09-23 北京邮电大学 CGAN model-based user personality privacy protection method
CN114723843B (en) * 2022-06-01 2022-12-06 广东时谛智能科技有限公司 Method, device, equipment and storage medium for generating virtual clothing through multi-mode fusion
CN115293109B (en) * 2022-08-03 2024-03-19 合肥工业大学 Text image generation method and system based on fine granularity semantic fusion

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110111335A (en) * 2019-05-08 2019-08-09 南昌航空大学 A kind of the urban transportation Scene Semantics dividing method and system of adaptive confrontation study
CN110245229A (en) * 2019-04-30 2019-09-17 中山大学 A kind of deep learning theme sensibility classification method based on data enhancing
WO2019179100A1 (en) * 2018-03-20 2019-09-26 苏州大学张家港工业技术研究院 Medical text generation method based on generative adversarial network technology

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019179100A1 (en) * 2018-03-20 2019-09-26 苏州大学张家港工业技术研究院 Medical text generation method based on generative adversarial network technology
CN110245229A (en) * 2019-04-30 2019-09-17 中山大学 A kind of deep learning theme sensibility classification method based on data enhancing
CN110111335A (en) * 2019-05-08 2019-08-09 南昌航空大学 A kind of the urban transportation Scene Semantics dividing method and system of adaptive confrontation study

Also Published As

Publication number Publication date
CN110956579A (en) 2020-04-03

Similar Documents

Publication Publication Date Title
CN110956579B (en) Text picture rewriting method based on generation of semantic segmentation map
Bossard et al. Apparel classification with style
CN108960959B (en) Multi-mode complementary clothing matching method, system and medium based on neural network
Zhou et al. Fashion recommendations through cross-media information retrieval
CN110598017A (en) Self-learning-based commodity detail page generation method
CN111967930A (en) Clothing style recognition recommendation method based on multi-network fusion
Mameli et al. Deep learning approaches for fashion knowledge extraction from social media: a review
Mohammadi et al. Smart fashion: a review of AI applications in the Fashion & Apparel Industry
CN116091667B (en) Character artistic image generation system based on AIGC technology
CN111476241A (en) Character clothing conversion method and system
CN115908657A (en) Method, device and equipment for generating virtual image and storage medium
CN113034237A (en) Dress suit recommendation system and method
Li et al. Toward accurate and realistic virtual try-on through shape matching and multiple warps
Zhang et al. M6-UFC: Unifying multi-modal controls for conditional image synthesis via non-autoregressive generative transformers
CN111728302A (en) Garment design method and device
US11983248B2 (en) Apparatus and method for classifying clothing attributes based on deep learning
CN114821202B (en) Clothing recommendation method based on user preference
Vozáriková et al. Clothing Parsing using Extended U-Net.
KR102366127B1 (en) Apparatus and method for classifying style based on deep learning using fashion attribute
Choi et al. Improving diffusion models for virtual try-on
Huang et al. Clothing image retrieval based on parts detection and segmentation
JP7263864B2 (en) Image processing device, image processing method, image processing program and product development system
CN108763203B (en) Method for expressing film comments by feature vectors by using feature word sets in film comment emotion analysis
Park et al. Clothing classification using CNN and shopping mall search system
Zhang et al. Stylized Text-to-Fashion Image Generation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant