CN111260740B

CN111260740B - Text-to-image generation method based on generation countermeasure network

Info

Publication number: CN111260740B
Application number: CN202010046540.9A
Authority: CN
Inventors: 田安捷; 陆璐
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2020-01-16
Filing date: 2020-01-16
Publication date: 2023-05-23
Anticipated expiration: 2040-01-16
Also published as: CN111260740A

Abstract

The invention discloses a text-to-image generation method based on a generated countermeasure network, which comprises the following steps: 1) Inputting a text description into a network, and generating a word feature matrix and sentence feature vectors according to the text description; 2) Adding conditions and noise vectors to the sentence feature vectors to obtain an image feature matrix; 3) Calculating a word context matrix of the image features; 4) Calculating in the generation of an impedance network by utilizing an image feature matrix and a word context matrix, and gradually generating images with higher and higher resolution in three stages; 5) Acquiring a local image feature matrix according to the generated image; 6) And evaluating the similarity between the generated image and the text description, and optimizing the next image generation. The image generation method can ensure that the content of the generated image is consistent with the semantic of the text description, ensure that the generated image has more optimized image details, effectively improve the resolution of the generated image and increase the diversity of the generated image.

Description

Text-to-image generation method based on generation countermeasure network

Technical Field

The invention relates to the field of image generation, in particular to a text-to-image generation method based on a generation countermeasure network.

Background

Generating high resolution and realistic images based on textual descriptions is a very interesting study. In industry, it not only provides help for deeper visual understanding for related research in the field of computer vision, but also has a wide range of practical applications. In the academia, it has become one of the most popular research directions in the field of computer vision in recent years, with remarkable results. Recurrent Neural Networks (RNNs) and generation countermeasure networks (GANs) are often combined to generate realistic images based on natural language descriptions. These methods have been able to produce satisfactory results in certain areas, such as creating an elegant image of flowers or birds.

The original GAN model contains a generator and a arbiter. The generator is optimized to generate samples distributed to the real data, thereby achieving the purpose of spoofing the arbiter. The trained discriminators may separate the true data distribution samples from the false samples generated by the generator. The generator and the arbiter are optimized in the mutual game, so that the generated result is better and better.

Although impressive results have been achieved, many challenges remain when training conditions generate an countermeasure network. Most models tend to learn only one data distribution pattern, which is prone to collapse, that is, the generator will generate the same image each time. Although the image is clear, there is no change. Another major challenge is that the instability of the training process and the loss obtained during training do not converge. In addition, most existing image generation methods focus attention on global sentence vectors, and useful fine-grained image features and word-level text information are ignored. Furthermore, each sub-region of the image is not considered to have a different impact on the overall image when evaluating the generated image. Such a method would prevent the generation of high quality images on the one hand and would reduce the diversity of the generated images on the other hand. This problem becomes more serious as the scenes and objects that need to be generated become more complex.

Disclosure of Invention

The invention aims to overcome the defects and shortcomings of the prior art, and provides a text-to-image generation method based on a generation countermeasure network, which can achieve the purposes of meeting the requirement that the content of a generated image is consistent with the semantic of text description, enabling the generated image to have more optimized image details, effectively improving the resolution of the generated image and increasing the diversity of the image.

The aim of the invention is achieved by the following technical scheme:

a text-to-image generation method based on generating a countermeasure network, comprising the steps of:

1) Inputting a text description into a network, and generating a word feature matrix and sentence feature vectors according to the text description;

2) Adding conditions and noise vectors to the sentence feature vectors to obtain an image feature matrix;

3) Calculating a word context matrix of the image features;

4) Calculating in the generation of an impedance network by utilizing an image feature matrix and a word context matrix, and gradually generating images with higher and higher resolution in three stages;

5) Acquiring a local image feature matrix according to the generated image;

6) And evaluating the similarity between the generated image and the text description, and optimizing the next image generation.

In the step 1), the text description is the description of the attribute of more than one object, and two hidden states corresponding to each word in the text description are connected in series through a two-way long-short-term memory network so as to represent the semantics of the word; the attributes include category, size, number, shape, location; and the two hidden states are connected with each other to obtain a global sentence vector, and the other hidden states are connected in series to obtain a word characteristic matrix.

The step 2) is specifically as follows:

2.1 Adding a conditional formation condition enhancement to the sentence feature vectors to enhance the training data and avoid overfitting;

2.2 The noise vector sampled from the standard normal distribution is spliced for the condition enhancement to obtain the image feature matrix.

In step 3), the word context matrix of the image feature is calculated by using the image feature matrix obtained in step 2) and the word feature matrix obtained in step 1), and each column of the word context matrix of the image feature represents a word context vector associated with a sub-region of the image.

The word context matrix of the image features is obtained by calculating the image feature matrix obtained in the step 2) and the word feature matrix obtained in the step 1), and specifically comprises the following steps:

firstly, converting word characteristics into a public semantic space of image characteristics by adding a new perceptron layer;

then calculating the weight of the jth sub-region of the image corresponding to the ith word: the method is obtained by normalized calculation of the product of the j-th column image feature vector (namely one column vector of the image feature matrix) and the i-th column single word feature vector (namely one column vector of the word feature matrix);

then, obtaining a word context vector of an image subregion by calculating the product sum of weights of each word and the image subregion corresponding to each word; each column vector of the word feature matrix corresponds to a word context vector of one of the image subregions.

The step 4) is specifically as follows:

4.1 Inputting the image feature matrix into a first layer generation countermeasure network to obtain an optimized image feature matrix, and carrying out 3x3 convolution on the optimized image feature matrix to output an image with the resolution of 64 x 64;

4.2 Inputting the image feature matrix and the word context matrix after primary optimization into a second layer to generate an countermeasure network, obtaining the image feature matrix after secondary optimization, and carrying out 3x3 convolution on the image feature matrix to output an image with 128 x 128 resolution;

4.3 Adding an attention mechanism to the image feature matrix, strengthening key subregions of the image, weakening unimportant regions of the image, and then updating the word context matrix by using the step 3);

4.4 Inputting the secondarily optimized image feature matrix and the updated word context matrix into a third layer of generation countermeasure network to obtain a final image feature matrix, and carrying out 3x3 convolution on the final image feature matrix to output 256 x 256 resolution images.

In step 5), the local image feature matrix is obtained according to the generated image, which is completed by an image encoder; the image encoder is essentially a convolutional neural network using an acceptance-v 3 model pre-trained on the ImageNet dataset.

In step 6), the specific process of evaluating the similarity of the generated image and the text description is as follows:

6.1 Adding an attention mechanism to the local image feature matrix, strengthening key sub-areas of the image, and weakening unimportant areas of the image;

6.2 Cosine similarity of the optimized local image feature matrix and the word feature matrix is calculated and used for evaluating similarity of the text description and the generated image so as to assist in generating optimization of the generator in the countermeasure network.

Compared with the prior art, the invention has the following advantages and beneficial effects:

the invention adopts the attention mechanism, wherein the attention mechanism is used for distinguishing the information of a plurality of parts, and the attention mechanism adds attention of different degrees to different parts so as to pay attention to the information which needs to be focused. Based on the method, the invention provides a text-to-image generation method based on a generation countermeasure network, so that the focus area of the generated image is more focused, and the image with more and more rich details is generated through a plurality of stages.

In conventional text-to-image generation methods, most existing methods focus on global sentence vectors when training conditions generate an countermeasure network, and useful image features with fine-grained detail and word-level text information are ignored. Meanwhile, in evaluating the quality of the generated image, each sub-region of the image is ignored as having a different influence on the entire image. These methods may result in regions of less importance in the image (e.g., background regions of the image) being too much focused and some fine-grained details that need to be constantly optimized are ignored. Compared with the prior art, the invention provides a generating countermeasure network added with an image attention mechanism, and the generating countermeasure network focuses on optimizing important subregions of the image, namely focuses on the generating effect of the important subregions and the subregions with rich contents of the image when generating the image so as to generate the image with higher resolution and richer details.

Drawings

Fig. 1 is a block diagram of a text-to-image generation method based on generating a countermeasure network according to the present invention.

Fig. 2 is a flow chart of a text-to-image generation method based on generating a countermeasure network in accordance with the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.

As shown in fig. 1 and 2, a text-to-image generation method based on generating a countermeasure network includes the steps of:

1) A meaningful text description is entered into the network, which may be a description of a representative property of one or more physical objects, such as kind, size, number, color, shape, location, etc. Two hidden states corresponding to each word in the text description are concatenated to represent the semantics of the word by using a bi-directional LSTM. The last hidden state is connected to obtain a global sentence vector, and the rest hidden states are connected in series to obtain a word feature matrix.

2) The specific process of acquiring the image feature matrix is as follows:

2.1 Adding a conditional formation condition enhancement to the resulting sentence feature vector to enhance the training data and avoid overfitting;

2.2 The conditional enhancement is spliced with noise vectors sampled from standard normal distribution to obtain an image feature matrix.

3) The word context matrix of the image feature is calculated using the image feature matrix obtained in step 2) and the word feature matrix obtained in step 1), each column of the matrix representing a word context vector associated with a sub-region of the image.

4) And calculating and optimizing an image characteristic matrix by using the three-layer generation countermeasure network to generate an image. The specific operation of each layer of network is as follows:

4.3 Adding an attention mechanism to the image feature matrix, strengthening key subregions of the image, weakening unimportant regions of the image, and updating the word context matrix by using the step 3;

5) And mapping the generated high-resolution image to a local image feature matrix by using an acceptance-v 3 model which is trained on the ImageNet data set in advance as an image encoder. The image encoder is essentially a convolutional neural network.

6) The similarity of the generated image and the text description is evaluated, and the specific process is as follows:

In summary, after the scheme is adopted, the invention provides a new method for the text-to-image generation process, and the generation countermeasure network added with the attention mechanism is utilized to generate the image, so that the consistency of the content of the generated image and the semantics of the text description is ensured, the generated image can be ensured to have more optimized image details, the resolution of the generated image can be effectively improved, and the diversity of the generated image is increased.

The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims

1. A text-to-image generation method based on generating a countermeasure network, comprising the steps of:

3) Calculating a word context matrix of the image features;

5) Acquiring a local image feature matrix according to the generated image;

6) Evaluating the similarity between the generated image and the text description, and optimizing the next image generation;

the step 4) is specifically as follows:

2. The method for generating text-to-image based on a generation countermeasure network according to claim 1, wherein in step 1), the text description is a description of attributes of more than one object, and two hidden states corresponding to each word in the text description are connected in series through a two-way long-short-term memory network to represent the semantics of the word; the attributes include category, size, number, shape, location; and the two hidden states are connected with each other to obtain a global sentence vector, and the other hidden states are connected in series to obtain a word characteristic matrix.

3. The method for generating text-to-image based on generating a countermeasure network according to claim 1, wherein said step 2) is specifically as follows:

4. A text-to-image generation method based on generating an countermeasure network according to claim 1, wherein in step 3) the word context matrix of image features is calculated using the image feature matrix obtained in step 2) and the word feature matrix obtained in step 1), each column of the word context matrix of image features representing a word context vector associated with a sub-region of the image.

5. The method for generating text-to-image based on a generating countermeasure network according to claim 4, wherein the word context matrix of the image feature is calculated by using the image feature matrix obtained in step 2) and the word feature matrix obtained in step 1), specifically:

then calculating the weight of the jth sub-region of the image corresponding to the ith word: the method is obtained through normalized calculation of the product of the j-th image feature vector and the i-th single word feature vector;

6. A method of generating text-to-image based on a generated countermeasure network as recited in claim 1, wherein in step 5), the obtaining of the local image feature matrix from the generated image is performed by an image encoder; the image encoder is essentially a convolutional neural network using an acceptance-v 3 model pre-trained on the ImageNet dataset.

7. The method for generating text-to-image based on generating a countermeasure network according to claim 1, wherein in step 6), the specific procedure of evaluating the similarity of the generated image and the text description is as follows: