CN111260740B - Text-to-image generation method based on generation countermeasure network - Google Patents

Text-to-image generation method based on generation countermeasure network Download PDF

Info

Publication number
CN111260740B
CN111260740B CN202010046540.9A CN202010046540A CN111260740B CN 111260740 B CN111260740 B CN 111260740B CN 202010046540 A CN202010046540 A CN 202010046540A CN 111260740 B CN111260740 B CN 111260740B
Authority
CN
China
Prior art keywords
image
word
feature matrix
matrix
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010046540.9A
Other languages
Chinese (zh)
Other versions
CN111260740A (en
Inventor
田安捷
陆璐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202010046540.9A priority Critical patent/CN111260740B/en
Publication of CN111260740A publication Critical patent/CN111260740A/en
Application granted granted Critical
Publication of CN111260740B publication Critical patent/CN111260740B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/002Image coding using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a text-to-image generation method based on a generated countermeasure network, which comprises the following steps: 1) Inputting a text description into a network, and generating a word feature matrix and sentence feature vectors according to the text description; 2) Adding conditions and noise vectors to the sentence feature vectors to obtain an image feature matrix; 3) Calculating a word context matrix of the image features; 4) Calculating in the generation of an impedance network by utilizing an image feature matrix and a word context matrix, and gradually generating images with higher and higher resolution in three stages; 5) Acquiring a local image feature matrix according to the generated image; 6) And evaluating the similarity between the generated image and the text description, and optimizing the next image generation. The image generation method can ensure that the content of the generated image is consistent with the semantic of the text description, ensure that the generated image has more optimized image details, effectively improve the resolution of the generated image and increase the diversity of the generated image.

Description

Text-to-image generation method based on generation countermeasure network
Technical Field
The invention relates to the field of image generation, in particular to a text-to-image generation method based on a generation countermeasure network.
Background
Generating high resolution and realistic images based on textual descriptions is a very interesting study. In industry, it not only provides help for deeper visual understanding for related research in the field of computer vision, but also has a wide range of practical applications. In the academia, it has become one of the most popular research directions in the field of computer vision in recent years, with remarkable results. Recurrent Neural Networks (RNNs) and generation countermeasure networks (GANs) are often combined to generate realistic images based on natural language descriptions. These methods have been able to produce satisfactory results in certain areas, such as creating an elegant image of flowers or birds.
The original GAN model contains a generator and a arbiter. The generator is optimized to generate samples distributed to the real data, thereby achieving the purpose of spoofing the arbiter. The trained discriminators may separate the true data distribution samples from the false samples generated by the generator. The generator and the arbiter are optimized in the mutual game, so that the generated result is better and better.
Although impressive results have been achieved, many challenges remain when training conditions generate an countermeasure network. Most models tend to learn only one data distribution pattern, which is prone to collapse, that is, the generator will generate the same image each time. Although the image is clear, there is no change. Another major challenge is that the instability of the training process and the loss obtained during training do not converge. In addition, most existing image generation methods focus attention on global sentence vectors, and useful fine-grained image features and word-level text information are ignored. Furthermore, each sub-region of the image is not considered to have a different impact on the overall image when evaluating the generated image. Such a method would prevent the generation of high quality images on the one hand and would reduce the diversity of the generated images on the other hand. This problem becomes more serious as the scenes and objects that need to be generated become more complex.
Disclosure of Invention
The invention aims to overcome the defects and shortcomings of the prior art, and provides a text-to-image generation method based on a generation countermeasure network, which can achieve the purposes of meeting the requirement that the content of a generated image is consistent with the semantic of text description, enabling the generated image to have more optimized image details, effectively improving the resolution of the generated image and increasing the diversity of the image.
The aim of the invention is achieved by the following technical scheme:
a text-to-image generation method based on generating a countermeasure network, comprising the steps of:
1) Inputting a text description into a network, and generating a word feature matrix and sentence feature vectors according to the text description;
2) Adding conditions and noise vectors to the sentence feature vectors to obtain an image feature matrix;
3) Calculating a word context matrix of the image features;
4) Calculating in the generation of an impedance network by utilizing an image feature matrix and a word context matrix, and gradually generating images with higher and higher resolution in three stages;
5) Acquiring a local image feature matrix according to the generated image;
6) And evaluating the similarity between the generated image and the text description, and optimizing the next image generation.
In the step 1), the text description is the description of the attribute of more than one object, and two hidden states corresponding to each word in the text description are connected in series through a two-way long-short-term memory network so as to represent the semantics of the word; the attributes include category, size, number, shape, location; and the two hidden states are connected with each other to obtain a global sentence vector, and the other hidden states are connected in series to obtain a word characteristic matrix.
The step 2) is specifically as follows:
2.1 Adding a conditional formation condition enhancement to the sentence feature vectors to enhance the training data and avoid overfitting;
2.2 The noise vector sampled from the standard normal distribution is spliced for the condition enhancement to obtain the image feature matrix.
In step 3), the word context matrix of the image feature is calculated by using the image feature matrix obtained in step 2) and the word feature matrix obtained in step 1), and each column of the word context matrix of the image feature represents a word context vector associated with a sub-region of the image.
The word context matrix of the image features is obtained by calculating the image feature matrix obtained in the step 2) and the word feature matrix obtained in the step 1), and specifically comprises the following steps:
firstly, converting word characteristics into a public semantic space of image characteristics by adding a new perceptron layer;
then calculating the weight of the jth sub-region of the image corresponding to the ith word: the method is obtained by normalized calculation of the product of the j-th column image feature vector (namely one column vector of the image feature matrix) and the i-th column single word feature vector (namely one column vector of the word feature matrix);
then, obtaining a word context vector of an image subregion by calculating the product sum of weights of each word and the image subregion corresponding to each word; each column vector of the word feature matrix corresponds to a word context vector of one of the image subregions.
The step 4) is specifically as follows:
4.1 Inputting the image feature matrix into a first layer generation countermeasure network to obtain an optimized image feature matrix, and carrying out 3x3 convolution on the optimized image feature matrix to output an image with the resolution of 64 x 64;
4.2 Inputting the image feature matrix and the word context matrix after primary optimization into a second layer to generate an countermeasure network, obtaining the image feature matrix after secondary optimization, and carrying out 3x3 convolution on the image feature matrix to output an image with 128 x 128 resolution;
4.3 Adding an attention mechanism to the image feature matrix, strengthening key subregions of the image, weakening unimportant regions of the image, and then updating the word context matrix by using the step 3);
4.4 Inputting the secondarily optimized image feature matrix and the updated word context matrix into a third layer of generation countermeasure network to obtain a final image feature matrix, and carrying out 3x3 convolution on the final image feature matrix to output 256 x 256 resolution images.
In step 5), the local image feature matrix is obtained according to the generated image, which is completed by an image encoder; the image encoder is essentially a convolutional neural network using an acceptance-v 3 model pre-trained on the ImageNet dataset.
In step 6), the specific process of evaluating the similarity of the generated image and the text description is as follows:
6.1 Adding an attention mechanism to the local image feature matrix, strengthening key sub-areas of the image, and weakening unimportant areas of the image;
6.2 Cosine similarity of the optimized local image feature matrix and the word feature matrix is calculated and used for evaluating similarity of the text description and the generated image so as to assist in generating optimization of the generator in the countermeasure network.
Compared with the prior art, the invention has the following advantages and beneficial effects:
the invention adopts the attention mechanism, wherein the attention mechanism is used for distinguishing the information of a plurality of parts, and the attention mechanism adds attention of different degrees to different parts so as to pay attention to the information which needs to be focused. Based on the method, the invention provides a text-to-image generation method based on a generation countermeasure network, so that the focus area of the generated image is more focused, and the image with more and more rich details is generated through a plurality of stages.
In conventional text-to-image generation methods, most existing methods focus on global sentence vectors when training conditions generate an countermeasure network, and useful image features with fine-grained detail and word-level text information are ignored. Meanwhile, in evaluating the quality of the generated image, each sub-region of the image is ignored as having a different influence on the entire image. These methods may result in regions of less importance in the image (e.g., background regions of the image) being too much focused and some fine-grained details that need to be constantly optimized are ignored. Compared with the prior art, the invention provides a generating countermeasure network added with an image attention mechanism, and the generating countermeasure network focuses on optimizing important subregions of the image, namely focuses on the generating effect of the important subregions and the subregions with rich contents of the image when generating the image so as to generate the image with higher resolution and richer details.
Drawings
Fig. 1 is a block diagram of a text-to-image generation method based on generating a countermeasure network according to the present invention.
Fig. 2 is a flow chart of a text-to-image generation method based on generating a countermeasure network in accordance with the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.
As shown in fig. 1 and 2, a text-to-image generation method based on generating a countermeasure network includes the steps of:
1) A meaningful text description is entered into the network, which may be a description of a representative property of one or more physical objects, such as kind, size, number, color, shape, location, etc. Two hidden states corresponding to each word in the text description are concatenated to represent the semantics of the word by using a bi-directional LSTM. The last hidden state is connected to obtain a global sentence vector, and the rest hidden states are connected in series to obtain a word feature matrix.
2) The specific process of acquiring the image feature matrix is as follows:
2.1 Adding a conditional formation condition enhancement to the resulting sentence feature vector to enhance the training data and avoid overfitting;
2.2 The conditional enhancement is spliced with noise vectors sampled from standard normal distribution to obtain an image feature matrix.
3) The word context matrix of the image feature is calculated using the image feature matrix obtained in step 2) and the word feature matrix obtained in step 1), each column of the matrix representing a word context vector associated with a sub-region of the image.
4) And calculating and optimizing an image characteristic matrix by using the three-layer generation countermeasure network to generate an image. The specific operation of each layer of network is as follows:
4.1 Inputting the image feature matrix into a first layer generation countermeasure network to obtain an optimized image feature matrix, and carrying out 3x3 convolution on the optimized image feature matrix to output an image with the resolution of 64 x 64;
4.2 Inputting the image feature matrix and the word context matrix after primary optimization into a second layer to generate an countermeasure network, obtaining the image feature matrix after secondary optimization, and carrying out 3x3 convolution on the image feature matrix to output an image with 128 x 128 resolution;
4.3 Adding an attention mechanism to the image feature matrix, strengthening key subregions of the image, weakening unimportant regions of the image, and updating the word context matrix by using the step 3;
4.4 Inputting the secondarily optimized image feature matrix and the updated word context matrix into a third layer of generation countermeasure network to obtain a final image feature matrix, and carrying out 3x3 convolution on the final image feature matrix to output 256 x 256 resolution images.
5) And mapping the generated high-resolution image to a local image feature matrix by using an acceptance-v 3 model which is trained on the ImageNet data set in advance as an image encoder. The image encoder is essentially a convolutional neural network.
6) The similarity of the generated image and the text description is evaluated, and the specific process is as follows:
6.1 Adding an attention mechanism to the local image feature matrix, strengthening key sub-areas of the image, and weakening unimportant areas of the image;
6.2 Cosine similarity of the optimized local image feature matrix and the word feature matrix is calculated and used for evaluating similarity of the text description and the generated image so as to assist in generating optimization of the generator in the countermeasure network.
In summary, after the scheme is adopted, the invention provides a new method for the text-to-image generation process, and the generation countermeasure network added with the attention mechanism is utilized to generate the image, so that the consistency of the content of the generated image and the semantics of the text description is ensured, the generated image can be ensured to have more optimized image details, the resolution of the generated image can be effectively improved, and the diversity of the generated image is increased.
The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims (7)

1. A text-to-image generation method based on generating a countermeasure network, comprising the steps of:
1) Inputting a text description into a network, and generating a word feature matrix and sentence feature vectors according to the text description;
2) Adding conditions and noise vectors to the sentence feature vectors to obtain an image feature matrix;
3) Calculating a word context matrix of the image features;
4) Calculating in the generation of an impedance network by utilizing an image feature matrix and a word context matrix, and gradually generating images with higher and higher resolution in three stages;
5) Acquiring a local image feature matrix according to the generated image;
6) Evaluating the similarity between the generated image and the text description, and optimizing the next image generation;
the step 4) is specifically as follows:
4.1 Inputting the image feature matrix into a first layer generation countermeasure network to obtain an optimized image feature matrix, and carrying out 3x3 convolution on the optimized image feature matrix to output an image with the resolution of 64 x 64;
4.2 Inputting the image feature matrix and the word context matrix after primary optimization into a second layer to generate an countermeasure network, obtaining the image feature matrix after secondary optimization, and carrying out 3x3 convolution on the image feature matrix to output an image with 128 x 128 resolution;
4.3 Adding an attention mechanism to the image feature matrix, strengthening key subregions of the image, weakening unimportant regions of the image, and then updating the word context matrix by using the step 3);
4.4 Inputting the secondarily optimized image feature matrix and the updated word context matrix into a third layer of generation countermeasure network to obtain a final image feature matrix, and carrying out 3x3 convolution on the final image feature matrix to output 256 x 256 resolution images.
2. The method for generating text-to-image based on a generation countermeasure network according to claim 1, wherein in step 1), the text description is a description of attributes of more than one object, and two hidden states corresponding to each word in the text description are connected in series through a two-way long-short-term memory network to represent the semantics of the word; the attributes include category, size, number, shape, location; and the two hidden states are connected with each other to obtain a global sentence vector, and the other hidden states are connected in series to obtain a word characteristic matrix.
3. The method for generating text-to-image based on generating a countermeasure network according to claim 1, wherein said step 2) is specifically as follows:
2.1 Adding a conditional formation condition enhancement to the sentence feature vectors to enhance the training data and avoid overfitting;
2.2 The noise vector sampled from the standard normal distribution is spliced for the condition enhancement to obtain the image feature matrix.
4. A text-to-image generation method based on generating an countermeasure network according to claim 1, wherein in step 3) the word context matrix of image features is calculated using the image feature matrix obtained in step 2) and the word feature matrix obtained in step 1), each column of the word context matrix of image features representing a word context vector associated with a sub-region of the image.
5. The method for generating text-to-image based on a generating countermeasure network according to claim 4, wherein the word context matrix of the image feature is calculated by using the image feature matrix obtained in step 2) and the word feature matrix obtained in step 1), specifically:
firstly, converting word characteristics into a public semantic space of image characteristics by adding a new perceptron layer;
then calculating the weight of the jth sub-region of the image corresponding to the ith word: the method is obtained through normalized calculation of the product of the j-th image feature vector and the i-th single word feature vector;
then, obtaining a word context vector of an image subregion by calculating the product sum of weights of each word and the image subregion corresponding to each word; each column vector of the word feature matrix corresponds to a word context vector of one of the image subregions.
6. A method of generating text-to-image based on a generated countermeasure network as recited in claim 1, wherein in step 5), the obtaining of the local image feature matrix from the generated image is performed by an image encoder; the image encoder is essentially a convolutional neural network using an acceptance-v 3 model pre-trained on the ImageNet dataset.
7. The method for generating text-to-image based on generating a countermeasure network according to claim 1, wherein in step 6), the specific procedure of evaluating the similarity of the generated image and the text description is as follows:
6.1 Adding an attention mechanism to the local image feature matrix, strengthening key sub-areas of the image, and weakening unimportant areas of the image;
6.2 Cosine similarity of the optimized local image feature matrix and the word feature matrix is calculated and used for evaluating similarity of the text description and the generated image so as to assist in generating optimization of the generator in the countermeasure network.
CN202010046540.9A 2020-01-16 2020-01-16 Text-to-image generation method based on generation countermeasure network Active CN111260740B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010046540.9A CN111260740B (en) 2020-01-16 2020-01-16 Text-to-image generation method based on generation countermeasure network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010046540.9A CN111260740B (en) 2020-01-16 2020-01-16 Text-to-image generation method based on generation countermeasure network

Publications (2)

Publication Number Publication Date
CN111260740A CN111260740A (en) 2020-06-09
CN111260740B true CN111260740B (en) 2023-05-23

Family

ID=70950653

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010046540.9A Active CN111260740B (en) 2020-01-16 2020-01-16 Text-to-image generation method based on generation countermeasure network

Country Status (1)

Country Link
CN (1) CN111260740B (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111918071A (en) * 2020-06-29 2020-11-10 北京大学 Data compression method, device, equipment and storage medium
US20220005235A1 (en) * 2020-07-06 2022-01-06 Ping An Technology (Shenzhen) Co., Ltd. Method and device for text-based image generation
CN114078172B (en) * 2020-08-19 2023-04-07 四川大学 Text image generation method for progressively generating confrontation network based on resolution
CN112348911B (en) * 2020-10-28 2023-04-18 山东师范大学 Semantic constraint-based method and system for generating fine-grained image by stacking texts
CN113343705B (en) * 2021-04-26 2022-07-05 山东师范大学 Text semantic based detail preservation image generation method and system
CN113361250A (en) * 2021-05-12 2021-09-07 山东师范大学 Bidirectional text image generation method and system based on semantic consistency
CN113361251B (en) * 2021-05-13 2023-06-30 山东师范大学 Text generation image method and system based on multi-stage generation countermeasure network
CN113191375B (en) * 2021-06-09 2023-05-09 北京理工大学 Text-to-multi-object image generation method based on joint embedding
CN113674374B (en) * 2021-07-20 2022-07-01 广东技术师范大学 Chinese text image generation method and device based on generation type countermeasure network
CN113793404B (en) * 2021-08-19 2023-07-04 西南科技大学 Manual controllable image synthesis method based on text and contour
CN113837229B (en) * 2021-08-30 2024-03-15 厦门大学 Knowledge-driven text-to-image generation method
CN113537416A (en) * 2021-09-17 2021-10-22 深圳市安软科技股份有限公司 Method and related equipment for converting text into image based on generative confrontation network
CN114332288B (en) * 2022-03-15 2022-06-14 武汉大学 Method for generating text generation image of confrontation network based on phrase drive and network
CN115797495B (en) * 2023-02-07 2023-04-25 武汉理工大学 Method for generating image by sentence-character semantic space fusion perceived text
CN117095083B (en) * 2023-10-17 2024-03-15 华南理工大学 Text-image generation method, system, device and storage medium
CN117152370B (en) * 2023-10-30 2024-02-02 碳丝路文化传播(成都)有限公司 AIGC-based 3D terrain model generation method, system, equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA3040165A1 (en) * 2016-11-18 2018-05-24 Salesforce.Com, Inc. Spatial attention model for image captioning
CN110135441A (en) * 2019-05-17 2019-08-16 北京邮电大学 A kind of text of image describes method and device
CN110609891A (en) * 2019-09-18 2019-12-24 合肥工业大学 Visual dialog generation method based on context awareness graph neural network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA3040165A1 (en) * 2016-11-18 2018-05-24 Salesforce.Com, Inc. Spatial attention model for image captioning
CN110135441A (en) * 2019-05-17 2019-08-16 北京邮电大学 A kind of text of image describes method and device
CN110609891A (en) * 2019-09-18 2019-12-24 合肥工业大学 Visual dialog generation method based on context awareness graph neural network

Also Published As

Publication number Publication date
CN111260740A (en) 2020-06-09

Similar Documents

Publication Publication Date Title
CN111260740B (en) Text-to-image generation method based on generation countermeasure network
CN108875807B (en) Image description method based on multiple attention and multiple scales
CN110263912B (en) Image question-answering method based on multi-target association depth reasoning
CN110490946B (en) Text image generation method based on cross-modal similarity and antagonism network generation
CN111858954B (en) Task-oriented text-generated image network model
CN110163299B (en) Visual question-answering method based on bottom-up attention mechanism and memory network
CN109783666B (en) Image scene graph generation method based on iterative refinement
CN113343705B (en) Text semantic based detail preservation image generation method and system
CN111177376A (en) Chinese text classification method based on BERT and CNN hierarchical connection
CN112036276B (en) Artificial intelligent video question-answering method
CN110264407B (en) Image super-resolution model training and reconstruction method, device, equipment and storage medium
CN111598183A (en) Multi-feature fusion image description method
Qi et al. Personalized sketch-based image retrieval by convolutional neural network and deep transfer learning
CN115222998B (en) Image classification method
CN113140023B (en) Text-to-image generation method and system based on spatial attention
CN111949824A (en) Visual question answering method and system based on semantic alignment and storage medium
WO2020228536A1 (en) Icon generation method and apparatus, method for acquiring icon, electronic device, and storage medium
CN109740012A (en) The method that understanding and question and answer are carried out to image, semantic based on deep neural network
CN115018941A (en) Text-to-image generation algorithm based on improved version text parser
CN117094395B (en) Method, device and computer storage medium for complementing knowledge graph
CN113869007A (en) Text generation image learning model based on deep learning
CN113420833A (en) Visual question-answering method and device based on question semantic mapping
CN117033609A (en) Text visual question-answering method, device, computer equipment and storage medium
CN113239678A (en) Multi-angle attention feature matching method and system for answer selection
Kasi et al. A Deep Learning Based Cross Model Text to Image Generation using DC-GAN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant