CN114022582A - Text image generation method - Google Patents
Text image generation method Download PDFInfo
- Publication number
- CN114022582A CN114022582A CN202111109265.1A CN202111109265A CN114022582A CN 114022582 A CN114022582 A CN 114022582A CN 202111109265 A CN202111109265 A CN 202111109265A CN 114022582 A CN114022582 A CN 114022582A
- Authority
- CN
- China
- Prior art keywords
- module
- image
- text
- features
- generator
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 28
- 239000013598 vector Substances 0.000 claims abstract description 88
- 230000006870 function Effects 0.000 claims abstract description 38
- 230000004927 fusion Effects 0.000 claims abstract description 16
- 238000005070 sampling Methods 0.000 claims abstract description 11
- 238000013528 artificial neural network Methods 0.000 claims abstract description 6
- 238000012549 training Methods 0.000 claims description 23
- 239000011159 matrix material Substances 0.000 claims description 12
- 238000010606 normalization Methods 0.000 claims description 11
- 238000013527 convolutional neural network Methods 0.000 claims description 8
- 238000007781 pre-processing Methods 0.000 claims description 4
- 230000000694 effects Effects 0.000 claims description 3
- 230000002194 synthesizing effect Effects 0.000 claims description 3
- 238000003058 natural language processing Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 235000014676 Phragmites communis Nutrition 0.000 description 1
- 238000011840 criminal investigation Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T11/00—2D [Two Dimensional] image generation
- G06T11/001—Texturing; Colouring; Generation of texture or colour
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a text image generation method, based on a Transformer module and an AttnGAN network, a text is coded by a text coder to obtain sentence characteristics and word characteristics, the sentence characteristics are obtained by a condition enhancement module, and are fused with a random noise vector to be input into a Transformer module for learning, improved characteristic vectors are output and input into a generator to generate an initial image with rough 64 x 64 pixels, the initial synthetic image and the improved characteristic vectors are input into a discriminator for discrimination, and the generator is trained according to a loss function; and sequentially inputting the improved feature vectors and the word features in the previous step into a neural network for up-sampling to obtain fusion vectors, and inputting the fusion vectors into a generator to obtain images of 128 × 128 pixels and images of 256 × 256 pixels. Compared with the image detail outline generated by the traditional AttnGAN method, the image generated by the method is clearer.
Description
Technical Field
The invention relates to the technical field of general image data processing or generation, in particular to a text image generating method based on a Transformer module and an AttnGAN network, belonging to the technical field of computer vision and natural language processing.
Background
The rapid development of modern science and technology promotes the theoretical and technical progress of computer vision and natural language processing, the generation of images based on text description is a comprehensive task spanning two fields of computer vision and natural language processing, has great application potential, and is expected to play an important role in criminal investigation solution, data enhancement, design and the like in the future.
The early text generation image mainly combines retrieval and supervised learning, but the method can only change the specific image characteristics until Reed et al uses a confrontation generation network to realize the text generation image for the first time, which not only changes the characteristics, but also lays a foundation for the subsequent development according to the text content.
The StackGAN generates a low-resolution image in the first stage and then gradually synthesizes the image by detail optimization in the second stage to improve the image resolution; StackGAN + + reduces instability of network training by adding regularization of color consistency to the loss; however, the existing GAN-INT-CLS, StackGAN and StackGAN + + only use the whole text information as text features, which can cause the loss of important details of the synthesized image; based on the above, AttnGAN is proposed, and this network introduces a global attention mechanism and extracts features from the text from both the whole and the local aspects to obtain sentence features and word features, and both the features are input as text features, so that the relevance between the text and the image is improved, and the quality of the generated image is also improved due to the global attention mechanism introduced by the AttnGAN network.
Although the above methods continuously improve the quality and resolution of the generated image, the problems of unreasonable image and unclear detail still exist.
Disclosure of Invention
The invention solves the problems in the prior art, provides a text image generating method based on a Transformer module and an AttnGAN network, and solves the problems of unreasonable image, fuzzy details and the like.
The technical scheme adopted by the invention is that the method for generating the image by the text comprises the following steps:
step 1: acquiring a data set consisting of texts and corresponding images, and preprocessing the data set;
step 2: constructing an AttnGAN-based text generation image network model, wherein the network model comprises a pre-training network and a multi-stage generation network, and the pre-training network introduces a Transformer module;
and step 3: extracting text features of text description corresponding to the image, wherein the text features comprise word features and sentence features, combining the sentence features with random noise after conditional enhancement, inputting the result into a Transformer module, and learning to obtain space and position information;
and 4, step 4: inputting the feature information learned in the step 3 into a first generator, outputting 64 × 64 low-resolution initial generated images, and inputting the low-resolution initial generated images and sentence features into an initial discriminator for discrimination;
and 5: down-sampling the low-resolution initial generated image generated in the step 4 to obtain features, inputting the word features into a global attention module to obtain new word features, inputting the two features into a convolutional neural network for learning to obtain fusion features, inputting the fusion features into a second generator, outputting a 128 x 128 image, and inputting the 128 x 128 image and sentence features into a secondary discriminator for discrimination;
step 6: down-sampling the 128 × 128 images generated in the step 5 to obtain features, inputting the word features into a global attention module to obtain new word features, inputting the two features into a convolutional neural network for learning to obtain new fusion features, inputting the new fusion features into a third generator, outputting 256 × 256 images, and inputting the 256 × 256 images and sentence features into a three-level discriminator for discrimination;
and 7: and (4) outputting the image generated in the step (6) as a final generated image.
Preferably, in the step 2, the Transformer module comprises an Encoder module and a Decoder module;
the Encoder module comprises three first sub-modules which are connected in sequence, wherein each first sub-module comprises a self-attack layer, a normalization layer and a full-connection layer which are connected in sequence;
the Decoder module comprises three second sub-modules which are connected in sequence, wherein each second sub-module comprises a first self-addressing layer, a first normalization layer, a second self-addressing layer, a second normalization layer and a full-connection layer which are connected in sequence;
the output of the Encoder module is respectively input into a second self-attention layer of the three second sub-modules;
the first submodule of the Encoder module and the first second submodule of the Decoder module are respectively and correspondingly input ends, and the output end of the third second submodule of the Decoder module is correspondingly an output end.
Preferably, in the step 3, combining the sentence features and the random noise and inputting the combination into the Transformer module, and learning to obtain the spatial and positional information includes the following steps:
Wherein,is a sentence-like feature of the text,is the mean vector of the sentence feature vectors of the text,is a covariance matrix of sentence feature vectors of text,epsilon is the distribution sample of the unit Gaussian distribution N (0, 1);
step 3.2: the obtained feature vectorCombining the random noise vector z to obtain e, and taking the e as the input of a transform module;
step 3.3: in the Transformer module, e is transformed in attention space, resulting in three representation vectors:calculating weight information
Wherein,αj,iweight information indicating an ith position at the time of synthesizing the jth region of the image; finally obtaining an image characteristic matrix m with an attention mechanismj,
preferably, in the step 4, the loss function of the first generator is
Wherein,representing a mathematical expectation, G1And D1The method comprises the steps that a first generator and an initial discriminator are respectively provided, lambda represents a hyper-parameter for determining the influence of a DAMSM module on a generator loss function, h represents an output characteristic vector of a Transformer, e represents a text characteristic vector after sentence characteristics and random noise vectors are combined, and LDAMSMRepresenting the loss function derived by the DAMSM module of the training network.
Preferably, in step 4, the loss function of the initial discriminator is LD1=L1+L2Wherein L is1Indicating whether the input image is authentic, L2Indicating to discriminate whether the input image is semantically consistent with the text,
wherein,representing a mathematical expectation, x represents that the textual description corresponds to a real image,the representation text description corresponds to the image generated by the first generator, and e represents a text feature vector after sentence features are combined with random noise vectors.
Preferably, in the step 5, the loss function of the second generator is
Wherein,denotes mathematical expectation, K ═ S (F (h, e)1)),G2And D2Respectively a second generator and a second level discriminator, lambda1Representing a hyperparameter that determines the magnitude of the effect of the DAMSM module on the second generator loss function, h representing the output eigenvector of the Transformer, e1Representing a word feature, LDAMSMAnd the loss function obtained by the DAMSM module of the training network is shown, F represents the global attention generating module, and S represents the down-sampling module of the neural network.
Preferably, in step 5, the loss function of the second-stage discriminator is LD2=L1+L2Wherein L is1Indicating whether the input image is authentic, L2Indicating to discriminate whether the input image is semantically consistent with the text,
wherein,representing a mathematical expectation, x represents that the textual description corresponds to a real image,the representation text description corresponds to the image generated by the second generator, and e represents the text feature vector after the sentence features are combined with the random noise vector.
Preferably, in the step 6, the loss function of the third generator is
Wherein K ═ S (F (h, e)1))),G3And D3Are respectively a third generatorAnd a three-stage discriminator, λ2Representing a hyperparameter determining the magnitude of the influence of the DAMSM module on the loss function of the third generator, h representing the output eigenvector of the transform, e1Representing word features, e representing a text feature vector after sentence features are combined with a random noise vector, LDAMSMRepresenting the loss function, F (h, e), derived from the DAMSM module of the training network1) Representing feature vectors learned by the global attention module.
Preferably, in step 6, the loss function of the three-stage discriminator is LD3=L1+L2Wherein L is1Indicating whether the input image is authentic, L2Indicating to discriminate whether the input image is semantically consistent with the text,
wherein,representing a mathematical expectation, x represents that the textual description corresponds to a real image,the representation text description corresponds to the image generated by the third generator, and e represents a text feature vector after sentence features are combined with the random noise vector.
Preferably, in the training phase of the text generation image network model based on the AtmGAN, the result of passing the word features through the DAMSM module is compared with the result of passing the 256 × 256 images through the image decoder, and the text generation image network model is adjusted based on the comparison result.
The invention provides a method for generating an image by a text, which is based on a Transformer module and an AttnGAN network, obtains sentence characteristics and word characteristics after the text is coded by a text coder, obtains characteristic vectors by the sentence characteristics through a condition enhancement module, fuses the characteristic vectors and random noise vectors, inputs the characteristic vectors into a Transformer module for learning, outputs improved characteristic vectors, inputs the characteristic vectors into a generator to generate an initial image with 64 pixels by 64 pixels roughly, inputs the initial synthesized image and the improved characteristic vectors into a discriminator for discrimination, and trains the generator according to a loss function; and sequentially inputting the improved feature vectors and the word features in the previous step into a neural network for up-sampling to obtain fusion vectors, and inputting the fusion vectors into a generator to obtain images of 128 × 128 pixels and images of 256 × 256 pixels.
Compared with the image detail outline generated by the traditional AttnGAN method, the image generated by the method of the invention is clearer.
Drawings
FIG. 1 is a diagram of a network architecture of the present invention;
FIG. 2 is a structural diagram of a Transformer module in the present invention.
Detailed Description
The present invention is described in further detail with reference to the following examples, but the scope of the present invention is not limited thereto.
The invention relates to a text image generation method based on a Transformer module and AttnGAN, which comprises the following steps.
Step 1: and acquiring a data set consisting of the text and the corresponding image, and preprocessing.
In the invention, the data in the data set comprises texts and corresponding images, the preprocessing comprises the steps of manually screening the images and the texts in the data set, removing the texts and the image data which represent the ambiguity, and modifying the texts which describe the pictures inaccurately.
Step 2: constructing an AttnGAN-based text generation image network model, wherein the network model comprises a pre-training network and a multi-stage generation network, and the pre-training network introduces a Transformer module;
in the step 2, the Transformer module comprises an Encoder module and a Decoder module;
the Encoder module comprises three first sub-modules which are connected in sequence, wherein each first sub-module comprises a self-attack layer, a normalization layer and a full-connection layer which are connected in sequence;
the Decoder module comprises three second sub-modules which are connected in sequence, wherein each second sub-module comprises a first self-addressing layer, a first normalization layer, a second self-addressing layer, a second normalization layer and a full-connection layer which are connected in sequence;
the output of the Encoder module is respectively input into a second self-attention layer of the three second sub-modules;
the first submodule of the Encoder module and the first second submodule of the Decoder module are respectively and correspondingly input ends, and the output end of the third second submodule of the Decoder module is correspondingly an output end.
In the training phase of the text generation image network model based on the AttnGAN, the result of the words with the characteristics passing through the DAMSM module is compared with the result of the 256 x 256 images with the characteristics passing through the image decoder, and the text generation image network model is adjusted based on the comparison result.
In the invention, data in a data set is divided into a training set and a testing set, the training set is used for training a text generation image network model, and the testing set is used for testing and experiencing network performance.
In the invention, in order to better extract features and generate images, a Transformer module, a DAMSM module text encoder and a picture encoder are introduced into a pre-training network, and a multi-stage network generation part comprises three generators and a discriminator;
the system comprises a transform module, a picture encoder, a DAMSM module and a text recognition module, wherein the transform module is a neural network based on self attention, the text encoder is used for extracting text features, the picture encoder is used for extracting image features, the DAMSM module is used for inputting a final synthesized image into the picture encoder to obtain local image features and character features for relevance comparison so as to improve the relevance of the image and the text;
the text generation image network model of the invention uses three generators and three discriminators to form three groups, and the generators and the discriminators are Convolutional Neural Networks (CNN) respectively.
In the present invention, as shown in fig. 2, a structure of a Transformer module is shown;
in the subsequent step 3, a feature vector (generally called A) obtained by conditional enhancement and random noise combination of sentence features is input into a transformer module;
in an Encoder module, A is tiled to form a one-dimensional vector, position information is embedded to obtain Q, K, V corresponding three matrixes to enter a self-attention layer, the weight of the matrix Q, K is calculated in the self-attention layer to obtain the corresponding fraction, and the fraction is added to a V matrix; after the self-attribute layer, carrying out line normalization operation, then sending the normalized line to a full connection layer, and outputting a one-dimensional vector by the full connection layer; repeating for three times, the vector output by the Encoder module becomes B.
In a Decoder module, A is tiled to form a one-dimensional vector, corresponding Q, K, V three matrixes are obtained after position information is embedded and enter a self-attention layer, normalization operation is carried out after the position information passes through the self-attention layer and is added with a vector B, then the vector C is obtained after the matrix enters a self-attention layer, and then the vector C is sent to a full-connection layer;
the vector output by the Decoder is the vector output by the last Transformer module.
And step 3: extracting text features of text description corresponding to the image, wherein the text features comprise word features and sentence features, combining the sentence features with random noise after conditional enhancement, inputting the result into a Transformer module, and learning to obtain space and position information; in particular, more spatial and positional information.
In the step 3, combining the sentence characteristics and the random noise and inputting the sentence characteristics and the random noise into a Transformer module, and learning to obtain the space and position information comprises the following steps:
Wherein,is a sentence-like feature of the text,is the mean vector of the sentence feature vectors of the text,the covariance matrix of sentence characteristic vectors of the text is shown, and epsilon is a distribution sample of unit Gaussian distribution N (0, 1);
step 3.2: the obtained feature vectorCombining the random noise vector z to obtain e, and taking the e as the input of a transform module;
step 3.3: in the Transformer module, e is transformed in attention space, resulting in three representation vectors:calculating weight information
Wherein,αj,iweight information indicating an ith position at the time of synthesizing the jth region of the image; finally obtaining an image characteristic matrix m with an attention mechanismj,
in the invention, in step 3.3, Query refers to each Value (meaning to be searched) in the feature vector, Key is other values of the feature vector, and Value can be understood as the correlation between Query and Key; query, KeyK and Value are all directly available, and then alpha is obtainedj,iAnd mj。
In the present invention, in step 3.4, the transform output eigenvector h refers to the space and position information.
In the present invention, i and j refer to one value (positive integer) from 1 to n, but the i and j values are different subsamples in the same large sample.
And 4, step 4: inputting the feature information learned in the step 3 into a first generator, outputting 64 × 64 low-resolution initial generated images, and inputting the low-resolution initial generated images and sentence features into an initial discriminator for discrimination;
in step 4, the loss function of the first generator is
Wherein,representing a mathematical expectation, G1And D1The method comprises the steps that a first generator and an initial discriminator are respectively provided, lambda represents a hyper-parameter for determining the influence of a DAMSM module on a generator loss function, h represents an output characteristic vector of a Transformer, e represents a text characteristic vector after sentence characteristics and random noise vectors are combined, and LDAMSMRepresenting the loss function derived by the DAMSM module of the training network.
In step 4, the loss function of the initial discriminatorIs LD1=L1+L2Wherein L is1Indicating whether the input image is authentic, L2Indicating to discriminate whether the input image is semantically consistent with the text,
wherein,representing a mathematical expectation, x represents that the textual description corresponds to a real image,the representation text description corresponds to the image generated by the first generator, and e represents a text feature vector after sentence features are combined with random noise vectors.
And 5: down-sampling the low-resolution initial generated image generated in the step 4 to obtain features, inputting the word features into a global attention module to obtain new word features, inputting the two features into a convolutional neural network for learning to obtain fusion features, inputting the fusion features into a second generator, outputting a 128 x 128 image, and inputting the 128 x 128 image and sentence features into a secondary discriminator for discrimination;
in said step 5, the loss function of the second generator is
Wherein,denotes mathematical expectation, K ═ S (F (h, e)1)),G2And D2Respectively a second generator and a second level discriminator, lambda1Representing a hyperparameter that determines the magnitude of the effect of the DAMSM module on the second generator loss function, h representing the output eigenvector of the Transformer, e1Representing a word feature, LDAMSMAnd the loss function obtained by the DAMSM module of the training network is shown, F represents the global attention generating module, and S represents the down-sampling module of the neural network.
In the step 5, the loss function of the second-level discriminator is LD2=L1+L2Wherein L is1Indicating whether the input image is authentic, L2Indicating to discriminate whether the input image is semantically consistent with the text,
wherein,representing a mathematical expectation, x represents that the textual description corresponds to a real image,the representation text description corresponds to the image generated by the second generator, and e represents the text feature vector after the sentence features are combined with the random noise vector.
Step 6: down-sampling the 128 × 128 images generated in the step 5 to obtain features, inputting the word features into a global attention module to obtain new word features, inputting the two features into a convolutional neural network for learning to obtain new fusion features, inputting the new fusion features into a third generator, outputting 256 × 256 images, and inputting the 256 × 256 images and sentence features into a three-level discriminator for discrimination;
in step 6, the loss function of the third generator is
Wherein K ═ S (F (h, e)1))),G3And D3Respectively a third generator and a third level discriminator, lambda2Representing a hyperparameter determining the magnitude of the influence of the DAMSM module on the loss function of the third generator, h representing the output eigenvector of the transform, e1Representing word features, e representing a text feature vector after sentence features are combined with a random noise vector, LDAMSMRepresenting the loss function, F (h, e), derived from the DAMSM module of the training network1) Representing feature vectors learned by the global attention module.
In the step 6, the loss function of the three-level discriminator is LD3=L1+L2Wherein L is1Indicating whether the input image is authentic, L2Indicating to discriminate whether the input image is semantically consistent with the text,
wherein,representing a mathematical expectation, x represents that the textual description corresponds to a real image,the representation text description corresponds to the image generated by the third generator, and e represents a text feature vector after sentence features are combined with the random noise vector.
In the present invention, L is the judgment result of the three discriminators1Whether the input image is real or not is judged, namely a number between 0 and 1 is calculated, and the input image is not real when the number is 0 and is real when the number is 1; in the same way, L2Representation discrimination between input image andwhether the texts have consistent semantics also means that a number from 0 to 1 is calculated, and if the number is 0, the number is inconsistent, and 1 is consistent.
And 7: and (4) outputting the image generated in the step (6) as a final generated image.
Claims (10)
1. A method for generating an image from a text, comprising: the method comprises the following steps:
step 1: acquiring a data set consisting of texts and corresponding images, and preprocessing the data set;
step 2: constructing an AttnGAN-based text generation image network model, wherein the network model comprises a pre-training network and a multi-stage generation network, and the pre-training network introduces a Transformer module;
and step 3: extracting text features of text description corresponding to the image, wherein the text features comprise word features and sentence features, combining the sentence features with random noise after conditional enhancement, inputting the result into a Transformer module, and learning to obtain space and position information;
and 4, step 4: inputting the feature information learned in the step 3 into a first generator, outputting 64 × 64 low-resolution initial generated images, and inputting the low-resolution initial generated images and sentence features into an initial discriminator for discrimination;
and 5: down-sampling the low-resolution initial generated image generated in the step 4 to obtain features, inputting the word features into a global attention module to obtain new word features, inputting the two features into a convolutional neural network for learning to obtain fusion features, inputting the fusion features into a second generator, outputting a 128 x 128 image, and inputting the 128 x 128 image and sentence features into a secondary discriminator for discrimination;
step 6: down-sampling the 128 × 128 images generated in the step 5 to obtain features, inputting the word features into a global attention module to obtain new word features, inputting the two features into a convolutional neural network for learning to obtain new fusion features, inputting the new fusion features into a third generator, outputting 256 × 256 images, and inputting the 256 × 256 images and sentence features into a three-level discriminator for discrimination;
and 7: and (4) outputting the image generated in the step (6) as a final generated image.
2. The method of claim 1, wherein: in the step 2, the Transformer module comprises an Encoder module and a Decoder module; the Encoder module comprises three first sub-modules which are connected in sequence, wherein each first sub-module comprises a self-attack layer, a normalization layer and a full-connection layer which are connected in sequence;
the Decoder module comprises three second sub-modules which are connected in sequence, wherein each second sub-module comprises a first self-addressing layer, a first normalization layer, a second self-addressing layer, a second normalization layer and a full-connection layer which are connected in sequence;
the output of the Encoder module is respectively input into a second self-attention layer of the three second sub-modules;
the first submodule of the Encoder module and the first second submodule of the Decoder module are respectively and correspondingly input ends, and the output end of the third second submodule of the Decoder module is correspondingly an output end.
3. The method of claim 1, wherein: in the step 3, combining the sentence characteristics and the random noise and inputting the sentence characteristics and the random noise into a Transformer module, and learning to obtain the space and position information comprises the following steps:
Wherein,as sentence features of text,Is the mean vector of the sentence feature vectors of the text,the covariance matrix of sentence characteristic vectors of the text is shown, and epsilon is a distribution sample of unit Gaussian distribution N (0, 1);
step 3.2: the obtained feature vectorAnd random noise vectorMerging to obtain e, and taking the e as the input of the Transformer module;
step 3.3: in the Transformer module, e is transformed in attention space, resulting in three representation vectors: calculating weight information
Wherein,αj,iweight information indicating an ith position at the time of synthesizing the jth region of the image; finally, obtaining an image feature matrix with an attention mechanism
4. the method of claim 1, wherein: in step 4, the loss function of the first generator is
Wherein,representing a mathematical expectation, G1And D1The method comprises the steps that a first generator and an initial discriminator are respectively provided, lambda represents a hyper-parameter for determining the influence of a DAMSM module on a generator loss function, h represents an output characteristic vector of a Transformer, e represents a text characteristic vector after sentence characteristics and random noise vectors are combined, and LDAMSMRepresenting the loss function derived by the DAMSM module of the training network.
5. The method of claim 1, wherein: in the step 4, the loss function of the initial discriminator is LD1=L1+L2Wherein L is1Indicating whether the input image is authentic, L2Indicating to discriminate whether the input image is semantically consistent with the text,
wherein,representing a mathematical expectation, x represents that the textual description corresponds to a real image,the representation text description corresponds to the image generated by the first generator, and e represents a text feature vector after sentence features are combined with random noise vectors.
6. The method of claim 1, wherein: in said step 5, the loss function of the second generator is
Wherein,denotes mathematical expectation, K ═ S (F (h, e)1)),G2And D2Respectively a second generator and a second level discriminator, lambda1Representing a hyperparameter that determines the magnitude of the effect of the DAMSM module on the second generator loss function, h representing the output eigenvector of the Transformer, e1Representing a word feature, LDAMSMAnd the loss function obtained by the DAMSM module of the training network is shown, F represents the global attention generating module, and S represents the down-sampling module of the neural network.
7. The method of claim 1, wherein: in the step 5, second-level discriminationLoss function of the device is LD2=L1+L2Wherein L is1Indicating whether the input image is authentic, L6Indicating to discriminate whether the input image is semantically consistent with the text,
wherein,representing a mathematical expectation, x represents that the textual description corresponds to a real image,the representation text description corresponds to the image generated by the second generator, and e represents the text feature vector after the sentence features are combined with the random noise vector.
8. The method of claim 1, wherein: in step 6, the loss function of the third generator is
Wherein K ═ S (F (h, e)1))),G3And D3Respectively a third generator and a third level discriminator, lambda2Representing a hyperparameter determining the magnitude of the influence of the DAMSM module on the loss function of the third generator, h representing the output eigenvector of the transform, e1Representing word features, e representing a text feature vector after sentence features are combined with a random noise vector, LDAMSMRepresenting the loss function, F (h, e), derived from the DAMSM module of the training network1) Represents passing throughLocal attention module learns the feature vectors.
9. The method of claim 1, wherein: in the step 6, the loss function of the three-level discriminator is LD3=L1+L2Wherein L is1Indicating whether the input image is authentic, L6Indicating to discriminate whether the input image is semantically consistent with the text,
wherein,representing a mathematical expectation, x represents that the textual description corresponds to a real image,the representation text description corresponds to the image generated by the third generator, and e represents a text feature vector after sentence features are combined with the random noise vector.
10. The method of claim 1, wherein: in the training phase of the text generation image network model based on the AttnGAN, the result of the words with the characteristics passing through the DAMSM module is compared with the result of the 256 x 256 images with the characteristics passing through the image decoder, and the text generation image network model is adjusted based on the comparison result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111109265.1A CN114022582A (en) | 2021-09-22 | 2021-09-22 | Text image generation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111109265.1A CN114022582A (en) | 2021-09-22 | 2021-09-22 | Text image generation method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114022582A true CN114022582A (en) | 2022-02-08 |
Family
ID=80054526
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111109265.1A Pending CN114022582A (en) | 2021-09-22 | 2021-09-22 | Text image generation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114022582A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116108157A (en) * | 2023-04-11 | 2023-05-12 | 阿里巴巴达摩院(杭州)科技有限公司 | Method for training text generation model, text generation method and device |
CN116630465A (en) * | 2023-07-24 | 2023-08-22 | 海信集团控股股份有限公司 | Model training and image generating method and device |
CN118097318A (en) * | 2024-04-28 | 2024-05-28 | 武汉大学 | Controllable defect image generation method and device based on visual semantic fusion |
CN118314246A (en) * | 2024-06-11 | 2024-07-09 | 西南科技大学 | Training method and training system for text synthesized image |
-
2021
- 2021-09-22 CN CN202111109265.1A patent/CN114022582A/en active Pending
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116108157A (en) * | 2023-04-11 | 2023-05-12 | 阿里巴巴达摩院(杭州)科技有限公司 | Method for training text generation model, text generation method and device |
CN116108157B (en) * | 2023-04-11 | 2023-09-12 | 阿里巴巴达摩院(杭州)科技有限公司 | Method for training text generation model, text generation method and device |
CN116630465A (en) * | 2023-07-24 | 2023-08-22 | 海信集团控股股份有限公司 | Model training and image generating method and device |
CN116630465B (en) * | 2023-07-24 | 2023-10-24 | 海信集团控股股份有限公司 | Model training and image generating method and device |
CN118097318A (en) * | 2024-04-28 | 2024-05-28 | 武汉大学 | Controllable defect image generation method and device based on visual semantic fusion |
CN118314246A (en) * | 2024-06-11 | 2024-07-09 | 西南科技大学 | Training method and training system for text synthesized image |
CN118314246B (en) * | 2024-06-11 | 2024-08-20 | 西南科技大学 | Training method and training system for text synthesized image |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110490946B (en) | Text image generation method based on cross-modal similarity and antagonism network generation | |
CN111340122B (en) | Multi-modal feature fusion text-guided image restoration method | |
CN114022582A (en) | Text image generation method | |
CN110706302B (en) | System and method for synthesizing images by text | |
CN111160343B (en) | Off-line mathematical formula symbol identification method based on Self-Attention | |
CN111968193B (en) | Text image generation method based on StackGAN (secure gas network) | |
CN113343705B (en) | Text semantic based detail preservation image generation method and system | |
CN108765279A (en) | A kind of pedestrian's face super-resolution reconstruction method towards monitoring scene | |
CN111325660B (en) | Remote sensing image style conversion method based on text data | |
CN113140020B (en) | Method for generating image based on text of countermeasure network generated by accompanying supervision | |
Naveen et al. | Transformer models for enhancing AttnGAN based text to image generation | |
CN113837229B (en) | Knowledge-driven text-to-image generation method | |
CN113140023B (en) | Text-to-image generation method and system based on spatial attention | |
CN111402365A (en) | Method for generating picture from characters based on bidirectional architecture confrontation generation network | |
CN113378949A (en) | Dual-generation confrontation learning method based on capsule network and mixed attention | |
CN117058673A (en) | Text generation image model training method and system and text generation image method and system | |
CN116721176A (en) | Text-to-face image generation method and device based on CLIP supervision | |
CN111339734A (en) | Method for generating image based on text | |
Kasi et al. | A deep learning based cross model text to image generation using DC-GAN | |
CN113658285B (en) | Method for generating face photo to artistic sketch | |
CN115496134A (en) | Traffic scene video description generation method and device based on multi-modal feature fusion | |
Hou et al. | A single-stage multi-class object detection method for remote sensing images | |
Bayoumi et al. | An intelligent hybrid text-to-image synthesis model for generating realistic human faces | |
Gajendran et al. | Text to Image Synthesis Using Bridge Generative Adversarial Network and Char CNN Model | |
Kaddoura | Real-world applications |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |