CN111260740A

CN111260740A - Text-to-image generation method based on generation countermeasure network

Info

Publication number: CN111260740A
Application number: CN202010046540.9A
Authority: CN
Inventors: 田安捷; 陆璐
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2020-01-16
Filing date: 2020-01-16
Publication date: 2020-06-09
Anticipated expiration: 2040-01-16
Also published as: CN111260740B

Abstract

The invention discloses a text-to-image generation method based on generation of a countermeasure network, which comprises the following steps: 1) inputting a text description into a network, and generating a word feature matrix and a sentence feature vector according to the text description; 2) adding conditions and noise vectors to the sentence characteristic vectors to obtain an image characteristic matrix; 3) calculating a word context matrix of the image features; 4) calculating in the generation of the countermeasure network by utilizing the image characteristic matrix and the word context matrix, and gradually generating images with higher and higher resolutions in three stages; 5) acquiring a local image feature matrix according to the generated image; 6) and evaluating the similarity between the generated image and the text description, and optimizing the next image generation. The image generation method of the invention not only can ensure that the content of the generated image is consistent with the semantics of the text description, but also can ensure that the generated image has more optimized image details, can effectively improve the resolution of the generated image and increase the diversity of the generated image.

Description

Text-to-image generation method based on generation countermeasure network

Technical Field

The invention relates to the field of image generation, in particular to a text-to-image generation method based on a generation countermeasure network.

Background

Generating high resolution and realistic images based on textual descriptions is a very meaningful study. In industry, it not only provides help for deeper visual understanding for related research in the field of computer vision, but also has wide practical application. In academia, it has become one of the most popular research directions in the field of computer vision in recent years, with significant results. Recurrent Neural Networks (RNNs) and generative countermeasure networks (GANs) are often combined to generate realistic images based on natural language descriptions. These methods have been able to produce satisfactory results in certain fields, such as creating a fine image of a flower or bird.

The original GAN model contains a generator and an arbiter. The generator can generate samples distributed to real data through optimization, and therefore the purpose of deceiving the discriminator is achieved. The trained discriminator may separate the true data distribution samples from the spurious samples generated by the generator. The generator and the discriminator reach the optimum in the mutual game, so that the generated result is better and better.

While impressive results have been achieved, many challenges remain in training conditions to generate an antagonistic network. Most models tend to learn only one data distribution pattern, which tends to collapse, i.e., the generator will generate the same image each time. Although the image is sharp, it is unchanged. Another major challenge is that the training process is unstable and the losses obtained during the training process do not converge. In addition, most existing image generation methods focus on global sentence vectors, and useful fine-grained image features and word-level text information are ignored. Furthermore, in evaluating the generated image, it is not assumed that each sub-region of the image has a different effect on the overall image. Such a method would on the one hand hinder the generation of high quality images and on the other hand also reduce the diversity of the generated images. This problem becomes more severe as the scenes and objects that need to be generated are more complex.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, provides a text-to-image generation method based on a generation countermeasure network, and can achieve the purposes of meeting the condition that the content of a generated image is consistent with the semantics of text description and enabling the generated image to have more optimized image details, effectively improving the resolution of the generated image and increasing the diversity of the image.

The purpose of the invention is realized by the following technical scheme:

a text-to-image generation method based on generation of a countermeasure network, comprising the steps of:

1) inputting a text description into a network, and generating a word feature matrix and a sentence feature vector according to the text description;

2) adding conditions and noise vectors to the sentence characteristic vectors to obtain an image characteristic matrix;

3) calculating a word context matrix of the image features;

4) calculating in the generation of the countermeasure network by utilizing the image characteristic matrix and the word context matrix, and gradually generating images with higher and higher resolutions in three stages;

5) acquiring a local image feature matrix according to the generated image;

6) and evaluating the similarity between the generated image and the text description, and optimizing the next image generation.

In the step 1), the text description is used for describing the attributes of more than one object, and two hidden states corresponding to each word in the text description are connected in series through a bidirectional long-term and short-term memory network to represent the semantics of the words; the attributes comprise type, size, number, shape and position; the last hidden state of the two hidden states is connected to obtain a global sentence vector, and the rest hidden states are connected in series to obtain a word feature matrix.

The step 2) is specifically as follows:

2.1) adding conditional forming conditional enhancement to the sentence feature vector to enhance the training data and avoid overfitting;

2.2) splicing the noise vector sampled from the standard normal distribution to obtain an image characteristic matrix.

In step 3), the word context matrix of the image features is calculated by using the image feature matrix obtained in step 2) and the word feature matrix obtained in step 1), and each column of the word context matrix of the image features represents a word context vector associated with a sub-region of the image.

The word context matrix of the image features is obtained by calculating the image feature matrix obtained in the step 2) and the word feature matrix obtained in the step 1), and specifically comprises the following steps:

firstly, converting the word characteristics into a public semantic space of image characteristics by adding a new sensor layer;

then calculating the weight of the jth sub-area of the image corresponding to the ith word: the method is obtained by the normalized calculation of the product of the image feature vector of the jth column (namely, a column vector of an image feature matrix) and the word feature vector of the ith column (namely, a column vector of a word feature matrix);

then, calculating the product sum of the weights of each word and the corresponding image sub-region to obtain a word context vector of the image sub-region; each column vector of the word feature matrix corresponds to a word context vector for a subregion of the image.

The step 4) is specifically as follows:

4.1) inputting the image feature matrix into the first-layer generation countermeasure network to obtain an image feature matrix after primary optimization, and performing 3x3 convolution on the image feature matrix to output an image with 64 x 64 resolution;

4.2) inputting the image feature matrix and the word context matrix after the primary optimization into a second layer generation countermeasure network to obtain an image feature matrix after the secondary optimization, and performing 3x3 convolution on the image feature matrix to output an image with a resolution of 128 × 128;

4.3) adding an attention mechanism to the image feature matrix, strengthening key subregions of the image, weakening unimportant regions of the image, and updating a word context matrix by utilizing the step 3);

4.4) inputting the image feature matrix after the second optimization and the updated word context matrix into the third layer generation countermeasure network to obtain a final image feature matrix, and performing 3x3 convolution on the final image feature matrix to output an image with 256 × 256 resolution.

In step 5), the local image feature matrix is obtained according to the generated image and is completed through an image encoder; the image encoder utilizes the inclusion-v 3 model pre-trained on the ImageNet dataset, which is essentially a convolutional neural network.

In step 6), the specific process of evaluating the similarity between the generated image and the text description is as follows:

6.1) adding an attention mechanism to the local image feature matrix, strengthening key subregions of the image and weakening unimportant regions of the image;

6.2) calculating cosine similarity of the optimized local image feature matrix and the word feature matrix, and evaluating the similarity of the text description and the generated image to assist in optimizing a generator in the generation countermeasure network.

Compared with the prior art, the invention has the following advantages and beneficial effects:

the invention adopts an attention mechanism, and the central idea is to distinguish information of a plurality of parts and add attention of different degrees to different parts so as to attach importance to the information which needs to be focused. Based on the method, the text-to-image generation method based on the generation countermeasure network is provided to pay more attention to the key area of the generated image, so that the image with richer and richer details is generated through multiple stages.

In the conventional text-to-image generation method, when a training condition generates a confrontation network, most of the existing methods focus on a global sentence vector, and useful image features with fine-grained details and word-level text information are ignored. Also, in evaluating the quality of the generated image, it is neglected that each sub-region of the image has a different effect on the whole image. These methods may result in less important sub-regions in the image (e.g., background regions of the image) being of excessive interest, and some fine-grained details that need to be continually optimized being ignored. In contrast, the present invention provides a generation countermeasure network with an added image attention mechanism, which generates a higher resolution and more detailed image by focusing on optimizing the generation effect of important sub-regions of the image, i.e., more attention on the important sub-regions and rich-content sub-regions of the image when generating the image.

Drawings

Fig. 1 is an architecture diagram of a text-to-image generation method based on a generation countermeasure network according to the present invention.

Fig. 2 is a flow chart of a text-to-image generation method based on generation of a countermeasure network according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

As shown in fig. 1 and 2, a text-to-image generation method based on generation of a confrontation network includes the following steps:

1) a meaningful text description is input into the network, and the text description can be a description of attributes representative of the type, size, number, color, shape, position and the like of one or more entity objects. The two hidden states corresponding to each word in the text description are concatenated to represent the word's semantics by using a bi-directional long-short term memory network (bi-directional LSTM). The global sentence vector is obtained by connecting the last hidden state, and the word feature matrix is obtained by connecting the other hidden states in series.

2) Acquiring an image characteristic matrix, wherein the specific process is as follows:

2.1) adding condition formation condition enhancement to the obtained sentence feature vector to enhance training data and avoid overfitting;

and 2.2) splicing the condition enhancement and the noise vector sampled from the standard normal distribution to obtain an image characteristic matrix.

3) And calculating a word context matrix of the image features by using the image feature matrix obtained in the step 2) and the word feature matrix obtained in the step 1), wherein each column of the matrix represents a word context vector associated with a sub-region of the image.

4) And calculating and optimizing an image characteristic matrix by utilizing the three layers of generation countermeasure networks to generate an image. The specific operation of each layer network is as follows:

4.3) adding an attention mechanism to the image feature matrix, strengthening key subregions of the image, weakening unimportant regions of the image, and updating the word context matrix by utilizing the step 3;

5) The generated high resolution image is mapped to a local image feature matrix using the inclusion-v 3 model pre-trained on the ImageNet dataset as the image encoder. The image encoder is essentially a convolutional neural network.

6) And evaluating the similarity of the generated image and the text description, wherein the specific process is as follows:

In summary, after the scheme is adopted, the invention provides a new method for the process of generating the text to the image, and the image is generated by using the generation countermeasure network added with the attention mechanism, so that the content of the generated image is ensured to be consistent with the semantics of the text description, the generated image can be ensured to have more optimized image details, the resolution of the generated image can be effectively improved, and the diversity of the generated image is increased.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A text-to-image generation method based on generation of a confrontation network is characterized by comprising the following steps:

3) calculating a word context matrix of the image features;

5) acquiring a local image feature matrix according to the generated image;

2. The method for generating text-to-image based on generation countermeasure network as claimed in claim 1, wherein in step 1), the text description is a description of attributes of more than one object, and two hidden states corresponding to each word in the text description are concatenated through a bidirectional long-short term memory network to represent the semantic of the word; the attributes comprise type, size, number, shape and position; the last hidden state of the two hidden states is connected to obtain a global sentence vector, and the rest hidden states are connected in series to obtain a word feature matrix.

3. The text-to-image generation method based on generation of an countermeasure network according to claim 1, wherein the step 2) is specifically as follows:

4. The text-to-image generation method based on generation of countermeasure network as claimed in claim 1, wherein in step 3), the word context matrix of the image feature is calculated by using the image feature matrix obtained in step 2) and the word feature matrix obtained in step 1), and each column of the word context matrix of the image feature represents a word context vector associated with a sub-region of the image.

5. The method for generating texts to images based on generation of countermeasure networks according to claim 4, wherein the word context matrix of the image features is calculated by using the image feature matrix obtained in step 2) and the word feature matrix obtained in step 1), specifically:

then calculating the weight of the jth sub-area of the image corresponding to the ith word: the method is obtained by the normalized calculation of the product of the image feature vector of the jth column and the word feature vector of the ith column;

6. The text-to-image generation method based on generation of countermeasure network according to claim 1, wherein the step 4) is specifically as follows:

7. The text-to-image generation method based on generation of countermeasure network as claimed in claim 1, wherein in step 5), the local image feature matrix is obtained from the generated image by an image encoder; the image encoder utilizes the inclusion-v 3 model pre-trained on the ImageNet dataset, which is essentially a convolutional neural network.

8. The method for generating the text-to-image based on the generation countermeasure network as claimed in claim 1, wherein in step 6), the specific process of evaluating the similarity between the generated image and the text description is as follows: