CN114022582A - Text image generation method - Google Patents

Text image generation method Download PDF

Info

Publication number
CN114022582A
CN114022582A CN202111109265.1A CN202111109265A CN114022582A CN 114022582 A CN114022582 A CN 114022582A CN 202111109265 A CN202111109265 A CN 202111109265A CN 114022582 A CN114022582 A CN 114022582A
Authority
CN
China
Prior art keywords
module
image
text
features
generator
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111109265.1A
Other languages
Chinese (zh)
Inventor
姚信威
张馨戈
王佐响
杨啸天
齐楚锋
邢伟伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202111109265.1A priority Critical patent/CN114022582A/en
Publication of CN114022582A publication Critical patent/CN114022582A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/001Texturing; Colouring; Generation of texture or colour
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a text image generation method, based on a Transformer module and an AttnGAN network, a text is coded by a text coder to obtain sentence characteristics and word characteristics, the sentence characteristics are obtained by a condition enhancement module, and are fused with a random noise vector to be input into a Transformer module for learning, improved characteristic vectors are output and input into a generator to generate an initial image with rough 64 x 64 pixels, the initial synthetic image and the improved characteristic vectors are input into a discriminator for discrimination, and the generator is trained according to a loss function; and sequentially inputting the improved feature vectors and the word features in the previous step into a neural network for up-sampling to obtain fusion vectors, and inputting the fusion vectors into a generator to obtain images of 128 × 128 pixels and images of 256 × 256 pixels. Compared with the image detail outline generated by the traditional AttnGAN method, the image generated by the method is clearer.

Description

Text image generation method
Technical Field
The invention relates to the technical field of general image data processing or generation, in particular to a text image generating method based on a Transformer module and an AttnGAN network, belonging to the technical field of computer vision and natural language processing.
Background
The rapid development of modern science and technology promotes the theoretical and technical progress of computer vision and natural language processing, the generation of images based on text description is a comprehensive task spanning two fields of computer vision and natural language processing, has great application potential, and is expected to play an important role in criminal investigation solution, data enhancement, design and the like in the future.
The early text generation image mainly combines retrieval and supervised learning, but the method can only change the specific image characteristics until Reed et al uses a confrontation generation network to realize the text generation image for the first time, which not only changes the characteristics, but also lays a foundation for the subsequent development according to the text content.
The StackGAN generates a low-resolution image in the first stage and then gradually synthesizes the image by detail optimization in the second stage to improve the image resolution; StackGAN + + reduces instability of network training by adding regularization of color consistency to the loss; however, the existing GAN-INT-CLS, StackGAN and StackGAN + + only use the whole text information as text features, which can cause the loss of important details of the synthesized image; based on the above, AttnGAN is proposed, and this network introduces a global attention mechanism and extracts features from the text from both the whole and the local aspects to obtain sentence features and word features, and both the features are input as text features, so that the relevance between the text and the image is improved, and the quality of the generated image is also improved due to the global attention mechanism introduced by the AttnGAN network.
Although the above methods continuously improve the quality and resolution of the generated image, the problems of unreasonable image and unclear detail still exist.
Disclosure of Invention
The invention solves the problems in the prior art, provides a text image generating method based on a Transformer module and an AttnGAN network, and solves the problems of unreasonable image, fuzzy details and the like.
The technical scheme adopted by the invention is that the method for generating the image by the text comprises the following steps:
step 1: acquiring a data set consisting of texts and corresponding images, and preprocessing the data set;
step 2: constructing an AttnGAN-based text generation image network model, wherein the network model comprises a pre-training network and a multi-stage generation network, and the pre-training network introduces a Transformer module;
and step 3: extracting text features of text description corresponding to the image, wherein the text features comprise word features and sentence features, combining the sentence features with random noise after conditional enhancement, inputting the result into a Transformer module, and learning to obtain space and position information;
and 4, step 4: inputting the feature information learned in the step 3 into a first generator, outputting 64 × 64 low-resolution initial generated images, and inputting the low-resolution initial generated images and sentence features into an initial discriminator for discrimination;
and 5: down-sampling the low-resolution initial generated image generated in the step 4 to obtain features, inputting the word features into a global attention module to obtain new word features, inputting the two features into a convolutional neural network for learning to obtain fusion features, inputting the fusion features into a second generator, outputting a 128 x 128 image, and inputting the 128 x 128 image and sentence features into a secondary discriminator for discrimination;
step 6: down-sampling the 128 × 128 images generated in the step 5 to obtain features, inputting the word features into a global attention module to obtain new word features, inputting the two features into a convolutional neural network for learning to obtain new fusion features, inputting the new fusion features into a third generator, outputting 256 × 256 images, and inputting the 256 × 256 images and sentence features into a three-level discriminator for discrimination;
and 7: and (4) outputting the image generated in the step (6) as a final generated image.
Preferably, in the step 2, the Transformer module comprises an Encoder module and a Decoder module;
the Encoder module comprises three first sub-modules which are connected in sequence, wherein each first sub-module comprises a self-attack layer, a normalization layer and a full-connection layer which are connected in sequence;
the Decoder module comprises three second sub-modules which are connected in sequence, wherein each second sub-module comprises a first self-addressing layer, a first normalization layer, a second self-addressing layer, a second normalization layer and a full-connection layer which are connected in sequence;
the output of the Encoder module is respectively input into a second self-attention layer of the three second sub-modules;
the first submodule of the Encoder module and the first second submodule of the Decoder module are respectively and correspondingly input ends, and the output end of the third second submodule of the Decoder module is correspondingly an output end.
Preferably, in the step 3, combining the sentence features and the random noise and inputting the combination into the Transformer module, and learning to obtain the spatial and positional information includes the following steps:
step 3.1: obtaining feature vector by sentence feature through condition enhancement module
Figure BDA0003273673700000031
Figure BDA0003273673700000032
Wherein,
Figure BDA0003273673700000033
is a sentence-like feature of the text,
Figure BDA0003273673700000034
is the mean vector of the sentence feature vectors of the text,
Figure BDA0003273673700000035
is a covariance matrix of sentence feature vectors of text,epsilon is the distribution sample of the unit Gaussian distribution N (0, 1);
step 3.2: the obtained feature vector
Figure BDA0003273673700000036
Combining the random noise vector z to obtain e, and taking the e as the input of a transform module;
Figure BDA0003273673700000037
step 3.3: in the Transformer module, e is transformed in attention space, resulting in three representation vectors:
Figure BDA0003273673700000038
calculating weight information
Figure BDA0003273673700000039
Wherein,
Figure BDA00032736737000000310
αj,iweight information indicating an ith position at the time of synthesizing the jth region of the image; finally obtaining an image characteristic matrix m with an attention mechanismj
Figure BDA00032736737000000311
Step 3.4: integrating the feature matrix to obtain a Transformer output feature vector h,
Figure BDA0003273673700000041
preferably, in the step 4, the loss function of the first generator is
Figure BDA0003273673700000042
Wherein,
Figure BDA0003273673700000043
representing a mathematical expectation, G1And D1The method comprises the steps that a first generator and an initial discriminator are respectively provided, lambda represents a hyper-parameter for determining the influence of a DAMSM module on a generator loss function, h represents an output characteristic vector of a Transformer, e represents a text characteristic vector after sentence characteristics and random noise vectors are combined, and LDAMSMRepresenting the loss function derived by the DAMSM module of the training network.
Preferably, in step 4, the loss function of the initial discriminator is LD1=L1+L2Wherein L is1Indicating whether the input image is authentic, L2Indicating to discriminate whether the input image is semantically consistent with the text,
Figure BDA0003273673700000044
Figure BDA0003273673700000045
wherein,
Figure BDA0003273673700000046
representing a mathematical expectation, x represents that the textual description corresponds to a real image,
Figure BDA0003273673700000047
the representation text description corresponds to the image generated by the first generator, and e represents a text feature vector after sentence features are combined with random noise vectors.
Preferably, in the step 5, the loss function of the second generator is
Figure BDA0003273673700000048
Wherein,
Figure BDA0003273673700000049
denotes mathematical expectation, K ═ S (F (h, e)1)),G2And D2Respectively a second generator and a second level discriminator, lambda1Representing a hyperparameter that determines the magnitude of the effect of the DAMSM module on the second generator loss function, h representing the output eigenvector of the Transformer, e1Representing a word feature, LDAMSMAnd the loss function obtained by the DAMSM module of the training network is shown, F represents the global attention generating module, and S represents the down-sampling module of the neural network.
Preferably, in step 5, the loss function of the second-stage discriminator is LD2=L1+L2Wherein L is1Indicating whether the input image is authentic, L2Indicating to discriminate whether the input image is semantically consistent with the text,
Figure BDA0003273673700000051
Figure BDA0003273673700000052
wherein,
Figure BDA0003273673700000053
representing a mathematical expectation, x represents that the textual description corresponds to a real image,
Figure BDA0003273673700000054
the representation text description corresponds to the image generated by the second generator, and e represents the text feature vector after the sentence features are combined with the random noise vector.
Preferably, in the step 6, the loss function of the third generator is
Figure BDA0003273673700000055
Wherein K ═ S (F (h, e)1))),G3And D3Are respectively a third generatorAnd a three-stage discriminator, λ2Representing a hyperparameter determining the magnitude of the influence of the DAMSM module on the loss function of the third generator, h representing the output eigenvector of the transform, e1Representing word features, e representing a text feature vector after sentence features are combined with a random noise vector, LDAMSMRepresenting the loss function, F (h, e), derived from the DAMSM module of the training network1) Representing feature vectors learned by the global attention module.
Preferably, in step 6, the loss function of the three-stage discriminator is LD3=L1+L2Wherein L is1Indicating whether the input image is authentic, L2Indicating to discriminate whether the input image is semantically consistent with the text,
Figure BDA0003273673700000056
Figure BDA0003273673700000057
wherein,
Figure BDA0003273673700000058
representing a mathematical expectation, x represents that the textual description corresponds to a real image,
Figure BDA0003273673700000059
the representation text description corresponds to the image generated by the third generator, and e represents a text feature vector after sentence features are combined with the random noise vector.
Preferably, in the training phase of the text generation image network model based on the AtmGAN, the result of passing the word features through the DAMSM module is compared with the result of passing the 256 × 256 images through the image decoder, and the text generation image network model is adjusted based on the comparison result.
The invention provides a method for generating an image by a text, which is based on a Transformer module and an AttnGAN network, obtains sentence characteristics and word characteristics after the text is coded by a text coder, obtains characteristic vectors by the sentence characteristics through a condition enhancement module, fuses the characteristic vectors and random noise vectors, inputs the characteristic vectors into a Transformer module for learning, outputs improved characteristic vectors, inputs the characteristic vectors into a generator to generate an initial image with 64 pixels by 64 pixels roughly, inputs the initial synthesized image and the improved characteristic vectors into a discriminator for discrimination, and trains the generator according to a loss function; and sequentially inputting the improved feature vectors and the word features in the previous step into a neural network for up-sampling to obtain fusion vectors, and inputting the fusion vectors into a generator to obtain images of 128 × 128 pixels and images of 256 × 256 pixels.
Compared with the image detail outline generated by the traditional AttnGAN method, the image generated by the method of the invention is clearer.
Drawings
FIG. 1 is a diagram of a network architecture of the present invention;
FIG. 2 is a structural diagram of a Transformer module in the present invention.
Detailed Description
The present invention is described in further detail with reference to the following examples, but the scope of the present invention is not limited thereto.
The invention relates to a text image generation method based on a Transformer module and AttnGAN, which comprises the following steps.
Step 1: and acquiring a data set consisting of the text and the corresponding image, and preprocessing.
In the invention, the data in the data set comprises texts and corresponding images, the preprocessing comprises the steps of manually screening the images and the texts in the data set, removing the texts and the image data which represent the ambiguity, and modifying the texts which describe the pictures inaccurately.
Step 2: constructing an AttnGAN-based text generation image network model, wherein the network model comprises a pre-training network and a multi-stage generation network, and the pre-training network introduces a Transformer module;
in the step 2, the Transformer module comprises an Encoder module and a Decoder module;
the Encoder module comprises three first sub-modules which are connected in sequence, wherein each first sub-module comprises a self-attack layer, a normalization layer and a full-connection layer which are connected in sequence;
the Decoder module comprises three second sub-modules which are connected in sequence, wherein each second sub-module comprises a first self-addressing layer, a first normalization layer, a second self-addressing layer, a second normalization layer and a full-connection layer which are connected in sequence;
the output of the Encoder module is respectively input into a second self-attention layer of the three second sub-modules;
the first submodule of the Encoder module and the first second submodule of the Decoder module are respectively and correspondingly input ends, and the output end of the third second submodule of the Decoder module is correspondingly an output end.
In the training phase of the text generation image network model based on the AttnGAN, the result of the words with the characteristics passing through the DAMSM module is compared with the result of the 256 x 256 images with the characteristics passing through the image decoder, and the text generation image network model is adjusted based on the comparison result.
In the invention, data in a data set is divided into a training set and a testing set, the training set is used for training a text generation image network model, and the testing set is used for testing and experiencing network performance.
In the invention, in order to better extract features and generate images, a Transformer module, a DAMSM module text encoder and a picture encoder are introduced into a pre-training network, and a multi-stage network generation part comprises three generators and a discriminator;
the system comprises a transform module, a picture encoder, a DAMSM module and a text recognition module, wherein the transform module is a neural network based on self attention, the text encoder is used for extracting text features, the picture encoder is used for extracting image features, the DAMSM module is used for inputting a final synthesized image into the picture encoder to obtain local image features and character features for relevance comparison so as to improve the relevance of the image and the text;
the text generation image network model of the invention uses three generators and three discriminators to form three groups, and the generators and the discriminators are Convolutional Neural Networks (CNN) respectively.
In the present invention, as shown in fig. 2, a structure of a Transformer module is shown;
in the subsequent step 3, a feature vector (generally called A) obtained by conditional enhancement and random noise combination of sentence features is input into a transformer module;
in an Encoder module, A is tiled to form a one-dimensional vector, position information is embedded to obtain Q, K, V corresponding three matrixes to enter a self-attention layer, the weight of the matrix Q, K is calculated in the self-attention layer to obtain the corresponding fraction, and the fraction is added to a V matrix; after the self-attribute layer, carrying out line normalization operation, then sending the normalized line to a full connection layer, and outputting a one-dimensional vector by the full connection layer; repeating for three times, the vector output by the Encoder module becomes B.
In a Decoder module, A is tiled to form a one-dimensional vector, corresponding Q, K, V three matrixes are obtained after position information is embedded and enter a self-attention layer, normalization operation is carried out after the position information passes through the self-attention layer and is added with a vector B, then the vector C is obtained after the matrix enters a self-attention layer, and then the vector C is sent to a full-connection layer;
the vector output by the Decoder is the vector output by the last Transformer module.
And step 3: extracting text features of text description corresponding to the image, wherein the text features comprise word features and sentence features, combining the sentence features with random noise after conditional enhancement, inputting the result into a Transformer module, and learning to obtain space and position information; in particular, more spatial and positional information.
In the step 3, combining the sentence characteristics and the random noise and inputting the sentence characteristics and the random noise into a Transformer module, and learning to obtain the space and position information comprises the following steps:
step 3.1: obtaining feature vector by sentence feature through condition enhancement module
Figure BDA0003273673700000088
Figure BDA0003273673700000081
Wherein,
Figure BDA0003273673700000082
is a sentence-like feature of the text,
Figure BDA0003273673700000083
is the mean vector of the sentence feature vectors of the text,
Figure BDA0003273673700000084
the covariance matrix of sentence characteristic vectors of the text is shown, and epsilon is a distribution sample of unit Gaussian distribution N (0, 1);
step 3.2: the obtained feature vector
Figure BDA0003273673700000085
Combining the random noise vector z to obtain e, and taking the e as the input of a transform module;
Figure BDA0003273673700000086
step 3.3: in the Transformer module, e is transformed in attention space, resulting in three representation vectors:
Figure BDA0003273673700000087
calculating weight information
Figure BDA0003273673700000091
Wherein,
Figure BDA0003273673700000092
αj,iweight information indicating an ith position at the time of synthesizing the jth region of the image; finally obtaining an image characteristic matrix m with an attention mechanismj
Figure BDA0003273673700000093
Step 3.4: integrating the feature matrix to obtain a Transformer output feature vector h,
Figure BDA0003273673700000094
in the invention, in step 3.3, Query refers to each Value (meaning to be searched) in the feature vector, Key is other values of the feature vector, and Value can be understood as the correlation between Query and Key; query, KeyK and Value are all directly available, and then alpha is obtainedj,iAnd mj
In the present invention, in step 3.4, the transform output eigenvector h refers to the space and position information.
In the present invention, i and j refer to one value (positive integer) from 1 to n, but the i and j values are different subsamples in the same large sample.
And 4, step 4: inputting the feature information learned in the step 3 into a first generator, outputting 64 × 64 low-resolution initial generated images, and inputting the low-resolution initial generated images and sentence features into an initial discriminator for discrimination;
in step 4, the loss function of the first generator is
Figure BDA0003273673700000095
Wherein,
Figure BDA0003273673700000096
representing a mathematical expectation, G1And D1The method comprises the steps that a first generator and an initial discriminator are respectively provided, lambda represents a hyper-parameter for determining the influence of a DAMSM module on a generator loss function, h represents an output characteristic vector of a Transformer, e represents a text characteristic vector after sentence characteristics and random noise vectors are combined, and LDAMSMRepresenting the loss function derived by the DAMSM module of the training network.
In step 4, the loss function of the initial discriminatorIs LD1=L1+L2Wherein L is1Indicating whether the input image is authentic, L2Indicating to discriminate whether the input image is semantically consistent with the text,
Figure BDA0003273673700000097
Figure BDA0003273673700000101
wherein,
Figure BDA0003273673700000102
representing a mathematical expectation, x represents that the textual description corresponds to a real image,
Figure BDA0003273673700000103
the representation text description corresponds to the image generated by the first generator, and e represents a text feature vector after sentence features are combined with random noise vectors.
And 5: down-sampling the low-resolution initial generated image generated in the step 4 to obtain features, inputting the word features into a global attention module to obtain new word features, inputting the two features into a convolutional neural network for learning to obtain fusion features, inputting the fusion features into a second generator, outputting a 128 x 128 image, and inputting the 128 x 128 image and sentence features into a secondary discriminator for discrimination;
in said step 5, the loss function of the second generator is
Figure BDA0003273673700000104
Wherein,
Figure BDA0003273673700000105
denotes mathematical expectation, K ═ S (F (h, e)1)),G2And D2Respectively a second generator and a second level discriminator, lambda1Representing a hyperparameter that determines the magnitude of the effect of the DAMSM module on the second generator loss function, h representing the output eigenvector of the Transformer, e1Representing a word feature, LDAMSMAnd the loss function obtained by the DAMSM module of the training network is shown, F represents the global attention generating module, and S represents the down-sampling module of the neural network.
In the step 5, the loss function of the second-level discriminator is LD2=L1+L2Wherein L is1Indicating whether the input image is authentic, L2Indicating to discriminate whether the input image is semantically consistent with the text,
Figure BDA0003273673700000106
Figure BDA0003273673700000107
wherein,
Figure BDA0003273673700000108
representing a mathematical expectation, x represents that the textual description corresponds to a real image,
Figure BDA0003273673700000109
the representation text description corresponds to the image generated by the second generator, and e represents the text feature vector after the sentence features are combined with the random noise vector.
Step 6: down-sampling the 128 × 128 images generated in the step 5 to obtain features, inputting the word features into a global attention module to obtain new word features, inputting the two features into a convolutional neural network for learning to obtain new fusion features, inputting the new fusion features into a third generator, outputting 256 × 256 images, and inputting the 256 × 256 images and sentence features into a three-level discriminator for discrimination;
in step 6, the loss function of the third generator is
Figure BDA0003273673700000111
Wherein K ═ S (F (h, e)1))),G3And D3Respectively a third generator and a third level discriminator, lambda2Representing a hyperparameter determining the magnitude of the influence of the DAMSM module on the loss function of the third generator, h representing the output eigenvector of the transform, e1Representing word features, e representing a text feature vector after sentence features are combined with a random noise vector, LDAMSMRepresenting the loss function, F (h, e), derived from the DAMSM module of the training network1) Representing feature vectors learned by the global attention module.
In the step 6, the loss function of the three-level discriminator is LD3=L1+L2Wherein L is1Indicating whether the input image is authentic, L2Indicating to discriminate whether the input image is semantically consistent with the text,
Figure BDA0003273673700000112
Figure BDA0003273673700000113
wherein,
Figure BDA0003273673700000114
representing a mathematical expectation, x represents that the textual description corresponds to a real image,
Figure BDA0003273673700000115
the representation text description corresponds to the image generated by the third generator, and e represents a text feature vector after sentence features are combined with the random noise vector.
In the present invention, L is the judgment result of the three discriminators1Whether the input image is real or not is judged, namely a number between 0 and 1 is calculated, and the input image is not real when the number is 0 and is real when the number is 1; in the same way, L2Representation discrimination between input image andwhether the texts have consistent semantics also means that a number from 0 to 1 is calculated, and if the number is 0, the number is inconsistent, and 1 is consistent.
And 7: and (4) outputting the image generated in the step (6) as a final generated image.

Claims (10)

1. A method for generating an image from a text, comprising: the method comprises the following steps:
step 1: acquiring a data set consisting of texts and corresponding images, and preprocessing the data set;
step 2: constructing an AttnGAN-based text generation image network model, wherein the network model comprises a pre-training network and a multi-stage generation network, and the pre-training network introduces a Transformer module;
and step 3: extracting text features of text description corresponding to the image, wherein the text features comprise word features and sentence features, combining the sentence features with random noise after conditional enhancement, inputting the result into a Transformer module, and learning to obtain space and position information;
and 4, step 4: inputting the feature information learned in the step 3 into a first generator, outputting 64 × 64 low-resolution initial generated images, and inputting the low-resolution initial generated images and sentence features into an initial discriminator for discrimination;
and 5: down-sampling the low-resolution initial generated image generated in the step 4 to obtain features, inputting the word features into a global attention module to obtain new word features, inputting the two features into a convolutional neural network for learning to obtain fusion features, inputting the fusion features into a second generator, outputting a 128 x 128 image, and inputting the 128 x 128 image and sentence features into a secondary discriminator for discrimination;
step 6: down-sampling the 128 × 128 images generated in the step 5 to obtain features, inputting the word features into a global attention module to obtain new word features, inputting the two features into a convolutional neural network for learning to obtain new fusion features, inputting the new fusion features into a third generator, outputting 256 × 256 images, and inputting the 256 × 256 images and sentence features into a three-level discriminator for discrimination;
and 7: and (4) outputting the image generated in the step (6) as a final generated image.
2. The method of claim 1, wherein: in the step 2, the Transformer module comprises an Encoder module and a Decoder module; the Encoder module comprises three first sub-modules which are connected in sequence, wherein each first sub-module comprises a self-attack layer, a normalization layer and a full-connection layer which are connected in sequence;
the Decoder module comprises three second sub-modules which are connected in sequence, wherein each second sub-module comprises a first self-addressing layer, a first normalization layer, a second self-addressing layer, a second normalization layer and a full-connection layer which are connected in sequence;
the output of the Encoder module is respectively input into a second self-attention layer of the three second sub-modules;
the first submodule of the Encoder module and the first second submodule of the Decoder module are respectively and correspondingly input ends, and the output end of the third second submodule of the Decoder module is correspondingly an output end.
3. The method of claim 1, wherein: in the step 3, combining the sentence characteristics and the random noise and inputting the sentence characteristics and the random noise into a Transformer module, and learning to obtain the space and position information comprises the following steps:
step 3.1: obtaining feature vector by sentence feature through condition enhancement module
Figure FDA0003273673690000021
Figure FDA0003273673690000022
Wherein,
Figure FDA0003273673690000023
as sentence features of text,
Figure FDA0003273673690000024
Is the mean vector of the sentence feature vectors of the text,
Figure FDA0003273673690000025
the covariance matrix of sentence characteristic vectors of the text is shown, and epsilon is a distribution sample of unit Gaussian distribution N (0, 1);
step 3.2: the obtained feature vector
Figure FDA0003273673690000026
And random noise vector
Figure FDA0003273673690000027
Merging to obtain e, and taking the e as the input of the Transformer module;
Figure FDA0003273673690000028
step 3.3: in the Transformer module, e is transformed in attention space, resulting in three representation vectors:
Figure FDA0003273673690000031
Figure FDA0003273673690000032
calculating weight information
Figure FDA0003273673690000033
Wherein,
Figure FDA0003273673690000034
αj,iweight information indicating an ith position at the time of synthesizing the jth region of the image; finally, obtaining an image feature matrix with an attention mechanism
Figure FDA0003273673690000035
Figure FDA0003273673690000036
Step 3.4: integrating the feature matrix to obtain a Transformer output feature vector h,
Figure FDA0003273673690000037
4. the method of claim 1, wherein: in step 4, the loss function of the first generator is
Figure FDA0003273673690000038
Wherein,
Figure FDA0003273673690000039
representing a mathematical expectation, G1And D1The method comprises the steps that a first generator and an initial discriminator are respectively provided, lambda represents a hyper-parameter for determining the influence of a DAMSM module on a generator loss function, h represents an output characteristic vector of a Transformer, e represents a text characteristic vector after sentence characteristics and random noise vectors are combined, and LDAMSMRepresenting the loss function derived by the DAMSM module of the training network.
5. The method of claim 1, wherein: in the step 4, the loss function of the initial discriminator is LD1=L1+L2Wherein L is1Indicating whether the input image is authentic, L2Indicating to discriminate whether the input image is semantically consistent with the text,
Figure FDA00032736736900000310
Figure FDA0003273673690000041
wherein,
Figure FDA0003273673690000042
representing a mathematical expectation, x represents that the textual description corresponds to a real image,
Figure FDA0003273673690000043
the representation text description corresponds to the image generated by the first generator, and e represents a text feature vector after sentence features are combined with random noise vectors.
6. The method of claim 1, wherein: in said step 5, the loss function of the second generator is
Figure FDA0003273673690000044
Wherein,
Figure FDA0003273673690000045
denotes mathematical expectation, K ═ S (F (h, e)1)),G2And D2Respectively a second generator and a second level discriminator, lambda1Representing a hyperparameter that determines the magnitude of the effect of the DAMSM module on the second generator loss function, h representing the output eigenvector of the Transformer, e1Representing a word feature, LDAMSMAnd the loss function obtained by the DAMSM module of the training network is shown, F represents the global attention generating module, and S represents the down-sampling module of the neural network.
7. The method of claim 1, wherein: in the step 5, second-level discriminationLoss function of the device is LD2=L1+L2Wherein L is1Indicating whether the input image is authentic, L6Indicating to discriminate whether the input image is semantically consistent with the text,
Figure FDA0003273673690000046
Figure FDA0003273673690000047
wherein,
Figure FDA0003273673690000048
representing a mathematical expectation, x represents that the textual description corresponds to a real image,
Figure FDA0003273673690000049
the representation text description corresponds to the image generated by the second generator, and e represents the text feature vector after the sentence features are combined with the random noise vector.
8. The method of claim 1, wherein: in step 6, the loss function of the third generator is
Figure FDA0003273673690000051
Wherein K ═ S (F (h, e)1))),G3And D3Respectively a third generator and a third level discriminator, lambda2Representing a hyperparameter determining the magnitude of the influence of the DAMSM module on the loss function of the third generator, h representing the output eigenvector of the transform, e1Representing word features, e representing a text feature vector after sentence features are combined with a random noise vector, LDAMSMRepresenting the loss function, F (h, e), derived from the DAMSM module of the training network1) Represents passing throughLocal attention module learns the feature vectors.
9. The method of claim 1, wherein: in the step 6, the loss function of the three-level discriminator is LD3=L1+L2Wherein L is1Indicating whether the input image is authentic, L6Indicating to discriminate whether the input image is semantically consistent with the text,
Figure FDA0003273673690000052
Figure FDA0003273673690000053
wherein,
Figure FDA0003273673690000054
representing a mathematical expectation, x represents that the textual description corresponds to a real image,
Figure FDA0003273673690000055
the representation text description corresponds to the image generated by the third generator, and e represents a text feature vector after sentence features are combined with the random noise vector.
10. The method of claim 1, wherein: in the training phase of the text generation image network model based on the AttnGAN, the result of the words with the characteristics passing through the DAMSM module is compared with the result of the 256 x 256 images with the characteristics passing through the image decoder, and the text generation image network model is adjusted based on the comparison result.
CN202111109265.1A 2021-09-22 2021-09-22 Text image generation method Pending CN114022582A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111109265.1A CN114022582A (en) 2021-09-22 2021-09-22 Text image generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111109265.1A CN114022582A (en) 2021-09-22 2021-09-22 Text image generation method

Publications (1)

Publication Number Publication Date
CN114022582A true CN114022582A (en) 2022-02-08

Family

ID=80054526

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111109265.1A Pending CN114022582A (en) 2021-09-22 2021-09-22 Text image generation method

Country Status (1)

Country Link
CN (1) CN114022582A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116108157A (en) * 2023-04-11 2023-05-12 阿里巴巴达摩院(杭州)科技有限公司 Method for training text generation model, text generation method and device
CN116630465A (en) * 2023-07-24 2023-08-22 海信集团控股股份有限公司 Model training and image generating method and device
CN118097318A (en) * 2024-04-28 2024-05-28 武汉大学 Controllable defect image generation method and device based on visual semantic fusion
CN118314246A (en) * 2024-06-11 2024-07-09 西南科技大学 Training method and training system for text synthesized image

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116108157A (en) * 2023-04-11 2023-05-12 阿里巴巴达摩院(杭州)科技有限公司 Method for training text generation model, text generation method and device
CN116108157B (en) * 2023-04-11 2023-09-12 阿里巴巴达摩院(杭州)科技有限公司 Method for training text generation model, text generation method and device
CN116630465A (en) * 2023-07-24 2023-08-22 海信集团控股股份有限公司 Model training and image generating method and device
CN116630465B (en) * 2023-07-24 2023-10-24 海信集团控股股份有限公司 Model training and image generating method and device
CN118097318A (en) * 2024-04-28 2024-05-28 武汉大学 Controllable defect image generation method and device based on visual semantic fusion
CN118314246A (en) * 2024-06-11 2024-07-09 西南科技大学 Training method and training system for text synthesized image
CN118314246B (en) * 2024-06-11 2024-08-20 西南科技大学 Training method and training system for text synthesized image

Similar Documents

Publication Publication Date Title
CN110490946B (en) Text image generation method based on cross-modal similarity and antagonism network generation
CN111340122B (en) Multi-modal feature fusion text-guided image restoration method
CN114022582A (en) Text image generation method
CN110706302B (en) System and method for synthesizing images by text
CN111160343B (en) Off-line mathematical formula symbol identification method based on Self-Attention
CN111968193B (en) Text image generation method based on StackGAN (secure gas network)
CN113343705B (en) Text semantic based detail preservation image generation method and system
CN108765279A (en) A kind of pedestrian's face super-resolution reconstruction method towards monitoring scene
CN111325660B (en) Remote sensing image style conversion method based on text data
CN113140020B (en) Method for generating image based on text of countermeasure network generated by accompanying supervision
Naveen et al. Transformer models for enhancing AttnGAN based text to image generation
CN113837229B (en) Knowledge-driven text-to-image generation method
CN113140023B (en) Text-to-image generation method and system based on spatial attention
CN111402365A (en) Method for generating picture from characters based on bidirectional architecture confrontation generation network
CN113378949A (en) Dual-generation confrontation learning method based on capsule network and mixed attention
CN117058673A (en) Text generation image model training method and system and text generation image method and system
CN116721176A (en) Text-to-face image generation method and device based on CLIP supervision
CN111339734A (en) Method for generating image based on text
Kasi et al. A deep learning based cross model text to image generation using DC-GAN
CN113658285B (en) Method for generating face photo to artistic sketch
CN115496134A (en) Traffic scene video description generation method and device based on multi-modal feature fusion
Hou et al. A single-stage multi-class object detection method for remote sensing images
Bayoumi et al. An intelligent hybrid text-to-image synthesis model for generating realistic human faces
Gajendran et al. Text to Image Synthesis Using Bridge Generative Adversarial Network and Char CNN Model
Kaddoura Real-world applications

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination