CN114078172B - Text image generation method for progressively generating confrontation network based on resolution - Google Patents
Text image generation method for progressively generating confrontation network based on resolution Download PDFInfo
- Publication number
- CN114078172B CN114078172B CN202010836037.3A CN202010836037A CN114078172B CN 114078172 B CN114078172 B CN 114078172B CN 202010836037 A CN202010836037 A CN 202010836037A CN 114078172 B CN114078172 B CN 114078172B
- Authority
- CN
- China
- Prior art keywords
- mask
- resolution
- generated
- text
- feature vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T11/00—2D [Two Dimensional] image generation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/126—Character encoding
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Image Processing (AREA)
- Compression Of Band Width Or Redundancy In Fax (AREA)
Abstract
Aiming at the problem of instability of a generated network, the invention designs a method for generating an image by text against the network based on resolution progressive generation. In the field of text-generated images, generation networks have been able to generate high-resolution pictures with clear details. The invention provides a text generation image method for generating a confrontation network based on resolution progression. And a semantic separation-fusion generation module is adopted at the low resolution layer to separate the feature vectors into three feature vectors, a generator is used for generating corresponding feature maps, mask maps are combined for feature fusion, finally, the low resolution maps are obtained, and Mask pictures are used as semantic constraints to improve the stability of the low resolution generator. Meanwhile, a resolution progressive residual error structure is adopted in a high-resolution layer, and the quality of generated pictures is further improved. In the field of text generation of images, the method provides ideas and wide application prospects.
Description
Technical Field
The invention designs a text image generation method for progressively generating a confrontation network based on resolution, and relates to the technical field of deep learning and computer vision.
Background
A Text-to-Image Synthesis (Text 2 Image) is a direction of a comparative front edge in the field of computer vision. The text generation image aims to generate a corresponding natural image by inputting a sentence description sentence, belongs to the cross application field of computer vision and natural language processing, and is beneficial to mining the potential relation between the text and the image and forming a visual semantic mechanism of a computer.
The task of generating images by texts was originally proposed in 2016, and the main task is to require automatic generation of images corresponding to text descriptions input for each sentence, so that Reed and the like build a GAN-INT-CLS and other networks based on a conditional countermeasure generation network to solve the problem. Although the network can basically generate images which are related to description and have certain definition, the generated images have low quality resolution, and the semantic consistency problem of texts and generated images is basically not considered.
Text-generated images are a very challenging problem with two main goals: (1) capable of generating realistic images; and (2) matching the generated image with the input text description. Most of the current text generation image basic frameworks adopt a condition-generation countermeasure network (cGAN) mode, and adopt a pre-trained text encoder to encode an input descriptive sentence into a corresponding semantic vector, connect a noise vector which obeys normal distribution, and input the semantic vector as a cGAN condition to generate a natural image. In the aspect of generating a high-resolution clear picture, a method of generating multi-scale output and a multi-scale discriminator is adopted, so that the quality of the generated picture is improved. In semantic consistency, fine-tuning is often performed on high-resolution maps using attention mechanisms and the like.
Most networks that generate images of text are prone to many semantically unreasonable pictures due to the instability of the generating countermeasure network. Taking a bird picture as an example, a generated target structure is not restricted to a certain extent, and some generated pictures are not real, such as double-headed birds, partial loss, disconnection of a target area, fuzzy boundaries and the like caused by blurring of a foreground and a background, so that a generated result is poor and pleasant. At present, most of research interest points of generating pictures based on texts are improved in a high-resolution generator, and the generated pictures are corrected and fine-tuned by means of an attention mechanism and the like. In the generation network, in order to generate clear high-resolution natural pictures, a mode of cascading a plurality of generators is often adopted, so that gradual thinning from low-resolution pictures to high-resolution pictures is achieved. Meanwhile, studies have shown that low resolution generators focus on structure and layout, high resolution generators focus on detail and random variations, and if generation fails on the spatial structure of the picture, it is futile how much detail correction is done.
Therefore, the low resolution generator initially generates pictures that have a greater impact on the spatial semantic structure of the generated result. The better low-resolution generator can ensure the semantic rationality of the low-resolution generated picture and improve the stability of the generated network generated picture to a certain extent.
Disclosure of Invention
The invention provides a method for generating an image based on a text of a confrontation network generated by resolution progressive generation, which aims to promote the stability of image generation. And a semantic separation-fusion generation module is adopted at a low resolution layer, text features are separated into three feature vectors under the guidance of a self-attention mechanism, corresponding feature maps are generated by a generator and fused into a low resolution map, and Mask pictures are adopted as semantic constraints to improve the stability of the low resolution generator. Meanwhile, a resolution progressive residual error structure is adopted in a high-resolution layer, and the quality of a generated picture is further improved by combining a word attention mechanism and pixel shuffling. The method for generating the text generation image of the confrontation network by resolution progressive generation reduces the structural error of the generated target to a certain extent, and further improves the quality of the generated image.
The invention realizes the purpose through the following technical scheme:
the method comprises the following steps: coding the input description sentence into a Text semantic feature vector c and a noise z which obeys normal distribution through a Text-Encoder to obtain a new feature vector s;
step two: adopting a semantic separation module to calculate corresponding attention weight of the feature vector output by the encoding end through a self-attention module, and multiplying the attention weight by the original semantic feature vector to obtain a separated foreground feature vector s fore Background feature vector s back And Mask feature vector s mask ;
Step three: by a first stage of three different generators G fore ,G back ,G mask Generate the data with size of 6464 characteristic map R fore ,R back ,R mask Through R mask Calculating to obtain a generated binary mask image I mask The first stage generator outputs a feature map R 0 And first level generating picture I 0 ;
Step four: the first-stage feature map passes through a second-stage generator G and a third-stage generator G 1 ,G 2 Finally, the generated pictures I of 128 × 128 and 256 × 256 are obtained respectively by combining the resolution progressive residual structure 1 ,I 2 ;
Step five: for each generation stage, there is a corresponding discriminator, D 0 ,D 1 ,D 2 Meanwhile, the Mask picture generated in the first stage also has a corresponding discriminator D mask Constraining the generated result;
step six: DAMSM loss was calculated using the 256 x 256 size image generated by the last generator.
It should be noted that:
the semantic attention extracting module in the step two is an ith semantic feature vector in the semantic attention separating moduleThe calculation method is as follows:
α i,j =exp(W i s T s)/∑ j exp(W i s T s)
wherein, W i A weight that is a linear transformation;
passage in step three mask Calculating to obtain a generated binary mask image I mask The first stage generator outputs a feature map R 0 And first level generating picture I 0 The method comprises the following steps:
(1) R is to be mask Obtaining a single-channel binary mask image I through the convolution layer and the activation layer mask ;
(2) By the formula:
calculating to obtain a first-stage characteristic spectrum R 0 ;
(3) R is to be 0 Finally obtaining a first-level generator generated picture I through the convolution layer and the activation layer 0 。
The invention mainly provides a text image generation method for generating a confrontation network based on resolution progressive generation. The semantic feature separation-fusion module is adopted in the low-resolution generation layer to improve the image structure generation stability, the resolution progressive residual error structure is adopted in the high-resolution generation layer to improve the image generation quality, and the effectiveness of the proposed network is verified on the public data sets CUB and Oxford-102.
Drawings
Fig. 1 is a diagram of the network architecture of the present invention.
FIG. 2 is a diagram of the self-attention mechanism detachment architecture of the present invention.
Fig. 3 is a high resolution residual network architecture of the present invention.
Detailed Description
The invention will be further described with reference to the accompanying drawings in which:
FIG. 1 is a network architecture diagram of a method for generating a text-generating image against a network based on resolution progression.
And a text encoding end: the Text encoding end of the generator consists of a pre-trained Text Encoder Text-Encoder, an input description sentence is encoded into a Text semantic feature vector C through the Text-Encoder, and the Text semantic feature vector C is connected with noise Z which obeys normal distribution to form a new feature vector which is used as the input of the image decoding end of the generator. Text-Encoder is also responsible for computing words in the Text description as attention maps as one of the inputs to the last two stages (64 × 64 to 128 × 128, 128 × 128 to 256 × 256) of the image decoding side.
An image encoding end: and obtaining a condition vector from the encoded semantic feature vector through a condition enhancement module. The feature vector is in a low resolution layer, and semantic feature vectors with three different attention weights are obtained through a self-attention separation module. Three different semantic feature maps are generated by adopting three different generators, and the generated low-resolution map is obtained by a feature fusion method. In the high-resolution layer, a residual error structure is adopted in combination with an attention mechanism to finely adjust the high-resolution map, so that the generation from low resolution to high resolution is realized, and finally, a high-quality picture is obtained.
An image decoding end: for each generation stage, there is a corresponding discriminator, D0, D1, D2 respectively. In the final generation phase, the generated 256 × 256 size image is also used to calculate DAMSM losses.
FIG. 2 is a diagram of the self-attention mechanism detachment architecture of the present invention. The semantic separation module adopts a self-attention mechanism, the feature vector output by the encoding end is subjected to corresponding attention weight calculation by the self-attention module, and then the attention weight is multiplied by the original semantic feature vector to obtain a foreground feature vector, a background feature vector and a Mask feature vector after separation.
Fig. 3 is a high resolution residual network architecture of the present invention. In a residual error network, firstly obtaining an attention map through word vector guidance, connecting the attention map with a previous generation map, calculating an attention weight of the previous characteristic map and the word vector, multiplying the attention weight by the characteristic map to obtain the attention map, splicing the attention map and the previous characteristic map to be used as the input of a generator, simultaneously up-sampling twice the previous characteristic map, adding the output of the generator and the up-sampled result, and obtaining a picture of a corresponding scale in the stage through an activation layer.
Claims (1)
1. The method for generating the text generation image of the confrontation network based on the resolution progressive generation is characterized in that: the method comprises the following steps:
the method comprises the following steps: coding an input description sentence into a Text semantic feature vector c through a Text-Encoder, and connecting the Text semantic feature vector c with a noise z which obeys normal distribution to obtain a new feature vector s;
step two: adopting a semantic separation module to calculate corresponding attention weight of the feature vector output by the encoding end through a self-attention module, and multiplying the attention weight by the original semantic feature vector to obtain a separated foreground feature vector s fore Background feature vector s back And Mask feature vector s mask The semantic attention extracting module is in the semantic attention separating module, the ith semantic feature vectorThe calculation method is as follows:
α i,j =exp(W i s T s)/∑ j exp(W i s T s)
wherein, W i A weight that is a linear transformation;
step three: by a first stage of three different generators G fore ,G back ,G mask Respectively generating feature maps R with the size of 64 multiplied by 64 fore ,R back ,R mask Through R mask Calculating to obtain a generated binary mask image I mask The first stage generator outputs a feature map R 0 And first level generating picture I 0 The method comprises the following steps:
(1) R is to be mask Obtaining a single-channel binary mask image I through the convolution layer and the activation layer mask ;
(2) By the formula:
calculating to obtain a first-stage characteristic spectrum R 0 ;
(3) R is to be 0 Finally obtaining a first-level generator generated picture I through the convolution layer and the activation layer 0 ;
Step four: the first-stage feature map passes through a second-stage generator G and a third-stage generator G 1 ,G 2 Finally, the generated pictures I of 128 × 128 and 256 × 256 are obtained respectively by combining the resolution progressive residual structure 1 ,I 2 ;
Step five: for each generation stage, there is a corresponding discriminator, D 0 ,D 1 ,D 2 Meanwhile, the Mask picture generated in the first stage also has a corresponding discriminator D mask Constraining the generated result;
step six: DAMSM loss was calculated using the 256 x 256 size image generated by the last generator.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010836037.3A CN114078172B (en) | 2020-08-19 | 2020-08-19 | Text image generation method for progressively generating confrontation network based on resolution |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010836037.3A CN114078172B (en) | 2020-08-19 | 2020-08-19 | Text image generation method for progressively generating confrontation network based on resolution |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114078172A CN114078172A (en) | 2022-02-22 |
CN114078172B true CN114078172B (en) | 2023-04-07 |
Family
ID=80282441
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010836037.3A Active CN114078172B (en) | 2020-08-19 | 2020-08-19 | Text image generation method for progressively generating confrontation network based on resolution |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114078172B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115713680B (en) * | 2022-11-18 | 2023-07-25 | 山东省人工智能研究院 | Semantic guidance-based face image identity synthesis method |
CN116246331B (en) * | 2022-12-05 | 2024-08-16 | 苏州大学 | Automatic keratoconus grading method, device and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109495741A (en) * | 2018-11-29 | 2019-03-19 | 四川大学 | Method for compressing image based on adaptive down-sampling and deep learning |
CN110443863A (en) * | 2019-07-23 | 2019-11-12 | 中国科学院深圳先进技术研究院 | Method, electronic equipment and the storage medium of text generation image |
CN110706302A (en) * | 2019-10-11 | 2020-01-17 | 中山市易嘀科技有限公司 | System and method for text synthesis image |
CN111260740A (en) * | 2020-01-16 | 2020-06-09 | 华南理工大学 | Text-to-image generation method based on generation countermeasure network |
CN111340907A (en) * | 2020-03-03 | 2020-06-26 | 曲阜师范大学 | Text-to-image generation method of self-adaptive attribute and instance mask embedded graph |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11074495B2 (en) * | 2013-02-28 | 2021-07-27 | Z Advanced Computing, Inc. (Zac) | System and method for extremely efficient image and pattern recognition and artificial intelligence platform |
-
2020
- 2020-08-19 CN CN202010836037.3A patent/CN114078172B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109495741A (en) * | 2018-11-29 | 2019-03-19 | 四川大学 | Method for compressing image based on adaptive down-sampling and deep learning |
CN110443863A (en) * | 2019-07-23 | 2019-11-12 | 中国科学院深圳先进技术研究院 | Method, electronic equipment and the storage medium of text generation image |
CN110706302A (en) * | 2019-10-11 | 2020-01-17 | 中山市易嘀科技有限公司 | System and method for text synthesis image |
CN111260740A (en) * | 2020-01-16 | 2020-06-09 | 华南理工大学 | Text-to-image generation method based on generation countermeasure network |
CN111340907A (en) * | 2020-03-03 | 2020-06-26 | 曲阜师范大学 | Text-to-image generation method of self-adaptive attribute and instance mask embedded graph |
Non-Patent Citations (3)
Title |
---|
Rintaro Yanagi等.Scene Retrieval from Multiple Resolution Generated Images Based on Text-to-Image GAN.《2019 IEEE International Symposium on Circuits and Systems (ISCAS)》.2019,第1-5页. * |
徐赫遥.基于循环生成式对抗网络与文本信息的图像翻译研究.《cnki优秀硕士学位论文全文库 信息科技辑》.2020,(第02期),第I138-1742页. * |
许一宁等.基于多层次分辨率递进生成对抗网络的文本生成图像方法.《计算机应用》.2020,(第12期),第3612-3617页. * |
Also Published As
Publication number | Publication date |
---|---|
CN114078172A (en) | 2022-02-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113139907B (en) | Generation method, system, device and storage medium for visual resolution enhancement | |
CN111179167B (en) | Image super-resolution method based on multi-stage attention enhancement network | |
CN113096017B (en) | Image super-resolution reconstruction method based on depth coordinate attention network model | |
CN114078172B (en) | Text image generation method for progressively generating confrontation network based on resolution | |
CN113361250A (en) | Bidirectional text image generation method and system based on semantic consistency | |
CN111402365B (en) | Method for generating picture from characters based on bidirectional architecture confrontation generation network | |
CN113140020B (en) | Method for generating image based on text of countermeasure network generated by accompanying supervision | |
CN112016604A (en) | Zero-resource machine translation method applying visual information | |
CN109034198B (en) | Scene segmentation method and system based on feature map recovery | |
CN111768354A (en) | Face image restoration system based on multi-scale face part feature dictionary | |
CN112381716A (en) | Image enhancement method based on generation type countermeasure network | |
Cheng et al. | Hybrid transformer and cnn attention network for stereo image super-resolution | |
CN113850718A (en) | Video synchronization space-time super-resolution method based on inter-frame feature alignment | |
CN117173219A (en) | Video target tracking method based on hintable segmentation model | |
CN112396554A (en) | Image super-resolution algorithm based on generation countermeasure network | |
CN117689592A (en) | Underwater image enhancement method based on cascade self-adaptive network | |
CN116823610A (en) | Deep learning-based underwater image super-resolution generation method and system | |
WO2023010981A1 (en) | Encoding and decoding methods and apparatus | |
CN114881858A (en) | Lightweight binocular image super-resolution method based on multi-attention machine system fusion | |
CN109657589B (en) | Human interaction action-based experiencer action generation method | |
Yang et al. | Depth map super-resolution via multilevel recursive guidance and progressive supervision | |
CN118037898B (en) | Text generation video method based on image guided video editing | |
US20240161250A1 (en) | Techniques for denoising diffusion using an ensemble of expert denoisers | |
CN113628108B (en) | Image super-resolution method and system based on discrete representation learning and terminal | |
Chen et al. | Pyramid attention dense network for image super-resolution |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |