CN114022582A

CN114022582A - Text image generation method

Info

Publication number: CN114022582A
Application number: CN202111109265.1A
Authority: CN
Inventors: 姚信威; 张馨戈; 王佐响; 杨啸天; 齐楚锋; 邢伟伟
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2021-09-22
Filing date: 2021-09-22
Publication date: 2022-02-08

Abstract

The invention relates to a text image generation method, based on a Transformer module and an AttnGAN network, a text is coded by a text coder to obtain sentence characteristics and word characteristics, the sentence characteristics are obtained by a condition enhancement module, and are fused with a random noise vector to be input into a Transformer module for learning, improved characteristic vectors are output and input into a generator to generate an initial image with rough 64 x 64 pixels, the initial synthetic image and the improved characteristic vectors are input into a discriminator for discrimination, and the generator is trained according to a loss function; and sequentially inputting the improved feature vectors and the word features in the previous step into a neural network for up-sampling to obtain fusion vectors, and inputting the fusion vectors into a generator to obtain images of 128 × 128 pixels and images of 256 × 256 pixels. Compared with the image detail outline generated by the traditional AttnGAN method, the image generated by the method is clearer.

Description

Text image generation method

Technical Field

The invention relates to the technical field of general image data processing or generation, in particular to a text image generating method based on a Transformer module and an AttnGAN network, belonging to the technical field of computer vision and natural language processing.

Background

The rapid development of modern science and technology promotes the theoretical and technical progress of computer vision and natural language processing, the generation of images based on text description is a comprehensive task spanning two fields of computer vision and natural language processing, has great application potential, and is expected to play an important role in criminal investigation solution, data enhancement, design and the like in the future.

The early text generation image mainly combines retrieval and supervised learning, but the method can only change the specific image characteristics until Reed et al uses a confrontation generation network to realize the text generation image for the first time, which not only changes the characteristics, but also lays a foundation for the subsequent development according to the text content.

The StackGAN generates a low-resolution image in the first stage and then gradually synthesizes the image by detail optimization in the second stage to improve the image resolution; StackGAN + + reduces instability of network training by adding regularization of color consistency to the loss; however, the existing GAN-INT-CLS, StackGAN and StackGAN + + only use the whole text information as text features, which can cause the loss of important details of the synthesized image; based on the above, AttnGAN is proposed, and this network introduces a global attention mechanism and extracts features from the text from both the whole and the local aspects to obtain sentence features and word features, and both the features are input as text features, so that the relevance between the text and the image is improved, and the quality of the generated image is also improved due to the global attention mechanism introduced by the AttnGAN network.

Although the above methods continuously improve the quality and resolution of the generated image, the problems of unreasonable image and unclear detail still exist.

Disclosure of Invention

The invention solves the problems in the prior art, provides a text image generating method based on a Transformer module and an AttnGAN network, and solves the problems of unreasonable image, fuzzy details and the like.

The technical scheme adopted by the invention is that the method for generating the image by the text comprises the following steps:

step 1: acquiring a data set consisting of texts and corresponding images, and preprocessing the data set;

step 2: constructing an AttnGAN-based text generation image network model, wherein the network model comprises a pre-training network and a multi-stage generation network, and the pre-training network introduces a Transformer module;

and step 3: extracting text features of text description corresponding to the image, wherein the text features comprise word features and sentence features, combining the sentence features with random noise after conditional enhancement, inputting the result into a Transformer module, and learning to obtain space and position information;

and 4, step 4: inputting the feature information learned in the step 3 into a first generator, outputting 64 × 64 low-resolution initial generated images, and inputting the low-resolution initial generated images and sentence features into an initial discriminator for discrimination;

and 5: down-sampling the low-resolution initial generated image generated in the step 4 to obtain features, inputting the word features into a global attention module to obtain new word features, inputting the two features into a convolutional neural network for learning to obtain fusion features, inputting the fusion features into a second generator, outputting a 128 x 128 image, and inputting the 128 x 128 image and sentence features into a secondary discriminator for discrimination;

step 6: down-sampling the 128 × 128 images generated in the step 5 to obtain features, inputting the word features into a global attention module to obtain new word features, inputting the two features into a convolutional neural network for learning to obtain new fusion features, inputting the new fusion features into a third generator, outputting 256 × 256 images, and inputting the 256 × 256 images and sentence features into a three-level discriminator for discrimination;

and 7: and (4) outputting the image generated in the step (6) as a final generated image.

Preferably, in the step 2, the Transformer module comprises an Encoder module and a Decoder module;

the Encoder module comprises three first sub-modules which are connected in sequence, wherein each first sub-module comprises a self-attack layer, a normalization layer and a full-connection layer which are connected in sequence;

the Decoder module comprises three second sub-modules which are connected in sequence, wherein each second sub-module comprises a first self-addressing layer, a first normalization layer, a second self-addressing layer, a second normalization layer and a full-connection layer which are connected in sequence;

the output of the Encoder module is respectively input into a second self-attention layer of the three second sub-modules;

the first submodule of the Encoder module and the first second submodule of the Decoder module are respectively and correspondingly input ends, and the output end of the third second submodule of the Decoder module is correspondingly an output end.

Preferably, in the step 3, combining the sentence features and the random noise and inputting the combination into the Transformer module, and learning to obtain the spatial and positional information includes the following steps:

step 3.1: obtaining feature vector by sentence feature through condition enhancement module

Wherein,

is a sentence-like feature of the text,

is the mean vector of the sentence feature vectors of the text,

is a covariance matrix of sentence feature vectors of text,epsilon is the distribution sample of the unit Gaussian distribution N (0, 1);

step 3.2: the obtained feature vector

Combining the random noise vector z to obtain e, and taking the e as the input of a transform module;

step 3.3: in the Transformer module, e is transformed in attention space, resulting in three representation vectors:

calculating weight information

Wherein,

α_j,iweight information indicating an ith position at the time of synthesizing the jth region of the image; finally obtaining an image characteristic matrix m with an attention mechanism_j，

Step 3.4: integrating the feature matrix to obtain a Transformer output feature vector h,

preferably, in the step 4, the loss function of the first generator is

Wherein,

representing a mathematical expectation, G₁And D₁The method comprises the steps that a first generator and an initial discriminator are respectively provided, lambda represents a hyper-parameter for determining the influence of a DAMSM module on a generator loss function, h represents an output characteristic vector of a Transformer, e represents a text characteristic vector after sentence characteristics and random noise vectors are combined, and L_DAMSMRepresenting the loss function derived by the DAMSM module of the training network.

Preferably, in step 4, the loss function of the initial discriminator is L_D1＝L₁+L₂Wherein L is₁Indicating whether the input image is authentic, L₂Indicating to discriminate whether the input image is semantically consistent with the text,

wherein,

representing a mathematical expectation, x represents that the textual description corresponds to a real image,

the representation text description corresponds to the image generated by the first generator, and e represents a text feature vector after sentence features are combined with random noise vectors.

Preferably, in the step 5, the loss function of the second generator is

Wherein,

denotes mathematical expectation, K ═ S (F (h, e)₁))，G₂And D₂Respectively a second generator and a second level discriminator, lambda₁Representing a hyperparameter that determines the magnitude of the effect of the DAMSM module on the second generator loss function, h representing the output eigenvector of the Transformer, e₁Representing a word feature, L_DAMSMAnd the loss function obtained by the DAMSM module of the training network is shown, F represents the global attention generating module, and S represents the down-sampling module of the neural network.

Preferably, in step 5, the loss function of the second-stage discriminator is L_D2＝L₁+L₂Wherein L is₁Indicating whether the input image is authentic, L₂Indicating to discriminate whether the input image is semantically consistent with the text,

wherein,

the representation text description corresponds to the image generated by the second generator, and e represents the text feature vector after the sentence features are combined with the random noise vector.

Preferably, in the step 6, the loss function of the third generator is

Wherein K ═ S (F (h, e)₁)))，G₃And D₃Are respectively a third generatorAnd a three-stage discriminator, λ₂Representing a hyperparameter determining the magnitude of the influence of the DAMSM module on the loss function of the third generator, h representing the output eigenvector of the transform, e₁Representing word features, e representing a text feature vector after sentence features are combined with a random noise vector, L_DAMSMRepresenting the loss function, F (h, e), derived from the DAMSM module of the training network₁) Representing feature vectors learned by the global attention module.

Preferably, in step 6, the loss function of the three-stage discriminator is L_D3＝L₁+L₂Wherein L is₁Indicating whether the input image is authentic, L₂Indicating to discriminate whether the input image is semantically consistent with the text,

wherein,

the representation text description corresponds to the image generated by the third generator, and e represents a text feature vector after sentence features are combined with the random noise vector.

Preferably, in the training phase of the text generation image network model based on the AtmGAN, the result of passing the word features through the DAMSM module is compared with the result of passing the 256 × 256 images through the image decoder, and the text generation image network model is adjusted based on the comparison result.

The invention provides a method for generating an image by a text, which is based on a Transformer module and an AttnGAN network, obtains sentence characteristics and word characteristics after the text is coded by a text coder, obtains characteristic vectors by the sentence characteristics through a condition enhancement module, fuses the characteristic vectors and random noise vectors, inputs the characteristic vectors into a Transformer module for learning, outputs improved characteristic vectors, inputs the characteristic vectors into a generator to generate an initial image with 64 pixels by 64 pixels roughly, inputs the initial synthesized image and the improved characteristic vectors into a discriminator for discrimination, and trains the generator according to a loss function; and sequentially inputting the improved feature vectors and the word features in the previous step into a neural network for up-sampling to obtain fusion vectors, and inputting the fusion vectors into a generator to obtain images of 128 × 128 pixels and images of 256 × 256 pixels.

Compared with the image detail outline generated by the traditional AttnGAN method, the image generated by the method of the invention is clearer.

Drawings

FIG. 1 is a diagram of a network architecture of the present invention;

FIG. 2 is a structural diagram of a Transformer module in the present invention.

Detailed Description

The present invention is described in further detail with reference to the following examples, but the scope of the present invention is not limited thereto.

The invention relates to a text image generation method based on a Transformer module and AttnGAN, which comprises the following steps.

Step 1: and acquiring a data set consisting of the text and the corresponding image, and preprocessing.

In the invention, the data in the data set comprises texts and corresponding images, the preprocessing comprises the steps of manually screening the images and the texts in the data set, removing the texts and the image data which represent the ambiguity, and modifying the texts which describe the pictures inaccurately.

in the step 2, the Transformer module comprises an Encoder module and a Decoder module;

In the training phase of the text generation image network model based on the AttnGAN, the result of the words with the characteristics passing through the DAMSM module is compared with the result of the 256 x 256 images with the characteristics passing through the image decoder, and the text generation image network model is adjusted based on the comparison result.

In the invention, data in a data set is divided into a training set and a testing set, the training set is used for training a text generation image network model, and the testing set is used for testing and experiencing network performance.

In the invention, in order to better extract features and generate images, a Transformer module, a DAMSM module text encoder and a picture encoder are introduced into a pre-training network, and a multi-stage network generation part comprises three generators and a discriminator;

the system comprises a transform module, a picture encoder, a DAMSM module and a text recognition module, wherein the transform module is a neural network based on self attention, the text encoder is used for extracting text features, the picture encoder is used for extracting image features, the DAMSM module is used for inputting a final synthesized image into the picture encoder to obtain local image features and character features for relevance comparison so as to improve the relevance of the image and the text;

the text generation image network model of the invention uses three generators and three discriminators to form three groups, and the generators and the discriminators are Convolutional Neural Networks (CNN) respectively.

In the present invention, as shown in fig. 2, a structure of a Transformer module is shown;

in the subsequent step 3, a feature vector (generally called A) obtained by conditional enhancement and random noise combination of sentence features is input into a transformer module;

in an Encoder module, A is tiled to form a one-dimensional vector, position information is embedded to obtain Q, K, V corresponding three matrixes to enter a self-attention layer, the weight of the matrix Q, K is calculated in the self-attention layer to obtain the corresponding fraction, and the fraction is added to a V matrix; after the self-attribute layer, carrying out line normalization operation, then sending the normalized line to a full connection layer, and outputting a one-dimensional vector by the full connection layer; repeating for three times, the vector output by the Encoder module becomes B.

In a Decoder module, A is tiled to form a one-dimensional vector, corresponding Q, K, V three matrixes are obtained after position information is embedded and enter a self-attention layer, normalization operation is carried out after the position information passes through the self-attention layer and is added with a vector B, then the vector C is obtained after the matrix enters a self-attention layer, and then the vector C is sent to a full-connection layer;

the vector output by the Decoder is the vector output by the last Transformer module.

And step 3: extracting text features of text description corresponding to the image, wherein the text features comprise word features and sentence features, combining the sentence features with random noise after conditional enhancement, inputting the result into a Transformer module, and learning to obtain space and position information; in particular, more spatial and positional information.

In the step 3, combining the sentence characteristics and the random noise and inputting the sentence characteristics and the random noise into a Transformer module, and learning to obtain the space and position information comprises the following steps:

Wherein,

is a sentence-like feature of the text,

is the mean vector of the sentence feature vectors of the text,

the covariance matrix of sentence characteristic vectors of the text is shown, and epsilon is a distribution sample of unit Gaussian distribution N (0, 1);

step 3.2: the obtained feature vector

calculating weight information

Wherein,

α_j，iweight information indicating an ith position at the time of synthesizing the jth region of the image; finally obtaining an image characteristic matrix m with an attention mechanism_j，

in the invention, in step 3.3, Query refers to each Value (meaning to be searched) in the feature vector, Key is other values of the feature vector, and Value can be understood as the correlation between Query and Key; query, KeyK and Value are all directly available, and then alpha is obtained_j，iAnd m_j。

In the present invention, in step 3.4, the transform output eigenvector h refers to the space and position information.

In the present invention, i and j refer to one value (positive integer) from 1 to n, but the i and j values are different subsamples in the same large sample.

in step 4, the loss function of the first generator is

Wherein,

In step 4, the loss function of the initial discriminatorIs L_D1＝L₁+L₂Wherein L is₁Indicating whether the input image is authentic, L₂Indicating to discriminate whether the input image is semantically consistent with the text,

wherein,

in said step 5, the loss function of the second generator is

Wherein,

In the step 5, the loss function of the second-level discriminator is L_D2＝L₁+L₂Wherein L is₁Indicating whether the input image is authentic, L₂Indicating to discriminate whether the input image is semantically consistent with the text,

wherein,

in step 6, the loss function of the third generator is

Wherein K ═ S (F (h, e)₁)))，G₃And D₃Respectively a third generator and a third level discriminator, lambda₂Representing a hyperparameter determining the magnitude of the influence of the DAMSM module on the loss function of the third generator, h representing the output eigenvector of the transform, e₁Representing word features, e representing a text feature vector after sentence features are combined with a random noise vector, L_DAMSMRepresenting the loss function, F (h, e), derived from the DAMSM module of the training network₁) Representing feature vectors learned by the global attention module.

In the step 6, the loss function of the three-level discriminator is L_D3＝L₁+L₂Wherein L is₁Indicating whether the input image is authentic, L₂Indicating to discriminate whether the input image is semantically consistent with the text,

wherein,

In the present invention, L is the judgment result of the three discriminators₁Whether the input image is real or not is judged, namely a number between 0 and 1 is calculated, and the input image is not real when the number is 0 and is real when the number is 1; in the same way, L₂Representation discrimination between input image andwhether the texts have consistent semantics also means that a number from 0 to 1 is calculated, and if the number is 0, the number is inconsistent, and 1 is consistent.

Claims

1. A method for generating an image from a text, comprising: the method comprises the following steps:

2. The method of claim 1, wherein: in the step 2, the Transformer module comprises an Encoder module and a Decoder module; the Encoder module comprises three first sub-modules which are connected in sequence, wherein each first sub-module comprises a self-attack layer, a normalization layer and a full-connection layer which are connected in sequence;

3. The method of claim 1, wherein: in the step 3, combining the sentence characteristics and the random noise and inputting the sentence characteristics and the random noise into a Transformer module, and learning to obtain the space and position information comprises the following steps:

Wherein,

as sentence features of text，

Is the mean vector of the sentence feature vectors of the text,

step 3.2: the obtained feature vector

And random noise vector

Merging to obtain e, and taking the e as the input of the Transformer module;

calculating weight information

Wherein,

α_j,iweight information indicating an ith position at the time of synthesizing the jth region of the image; finally, obtaining an image feature matrix with an attention mechanism

4. the method of claim 1, wherein: in step 4, the loss function of the first generator is

Wherein,

5. The method of claim 1, wherein: in the step 4, the loss function of the initial discriminator is L_D1＝L₁+L₂Wherein L is₁Indicating whether the input image is authentic, L₂Indicating to discriminate whether the input image is semantically consistent with the text,

wherein,

6. The method of claim 1, wherein: in said step 5, the loss function of the second generator is

Wherein,

7. The method of claim 1, wherein: in the step 5, second-level discriminationLoss function of the device is L_D2＝L₁+L₂Wherein L is₁Indicating whether the input image is authentic, L₆Indicating to discriminate whether the input image is semantically consistent with the text,

wherein,

8. The method of claim 1, wherein: in step 6, the loss function of the third generator is

Wherein K ═ S (F (h, e)₁)))，G₃And D₃Respectively a third generator and a third level discriminator, lambda₂Representing a hyperparameter determining the magnitude of the influence of the DAMSM module on the loss function of the third generator, h representing the output eigenvector of the transform, e₁Representing word features, e representing a text feature vector after sentence features are combined with a random noise vector, L_DAMSMRepresenting the loss function, F (h, e), derived from the DAMSM module of the training network₁) Represents passing throughLocal attention module learns the feature vectors.

9. The method of claim 1, wherein: in the step 6, the loss function of the three-level discriminator is L_D3＝L₁+L₂Wherein L is₁Indicating whether the input image is authentic, L₆Indicating to discriminate whether the input image is semantically consistent with the text,

wherein,

10. The method of claim 1, wherein: in the training phase of the text generation image network model based on the AttnGAN, the result of the words with the characteristics passing through the DAMSM module is compared with the result of the 256 x 256 images with the characteristics passing through the image decoder, and the text generation image network model is adjusted based on the comparison result.