CN114078172A - Text image generation method for progressively generating confrontation network based on resolution - Google Patents

Text image generation method for progressively generating confrontation network based on resolution Download PDF

Info

Publication number
CN114078172A
CN114078172A CN202010836037.3A CN202010836037A CN114078172A CN 114078172 A CN114078172 A CN 114078172A CN 202010836037 A CN202010836037 A CN 202010836037A CN 114078172 A CN114078172 A CN 114078172A
Authority
CN
China
Prior art keywords
mask
resolution
generated
text
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010836037.3A
Other languages
Chinese (zh)
Other versions
CN114078172B (en
Inventor
何小海
许一宁
卿粼波
张津
罗晓东
滕奇志
吴小强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202010836037.3A priority Critical patent/CN114078172B/en
Publication of CN114078172A publication Critical patent/CN114078172A/en
Application granted granted Critical
Publication of CN114078172B publication Critical patent/CN114078172B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Processing (AREA)
  • Compression Of Band Width Or Redundancy In Fax (AREA)

Abstract

Aiming at the problem of instability of a generated network, the invention designs a method for generating an image by text against the network based on resolution progressive generation. In the field of text-generated images, generation networks have been able to generate high-resolution pictures with clear details. The invention provides a text generation image method for generating a confrontation network based on resolution progression. And a semantic separation-fusion generation module is adopted at the low resolution layer to separate the feature vectors into three feature vectors, a generator is used for generating corresponding feature maps, Mask maps are combined for feature fusion, finally, the low resolution maps are obtained, and Mask pictures are used as semantic constraints to improve the stability of the low resolution generator. Meanwhile, a resolution progressive residual error structure is adopted in the high-resolution layer, and the quality of the generated picture is further improved. In the field of text generation of images, the method provides ideas and wide application prospects.

Description

Text image generation method for progressively generating confrontation network based on resolution
Technical Field
The invention designs a text image generation method for progressively generating a confrontation network based on resolution, and relates to the technical field of deep learning and computer vision.
Background
Text-to-Image Synthesis (Text 2Image) is a direction of a comparative front edge in the field of computer vision. The text generation image aims at generating a corresponding natural image by inputting a sentence description sentence, belongs to the cross application field of computer vision and natural language processing, and is beneficial to mining the potential relation between the text and the image and forming a visual semantic mechanism of a computer.
The task of generating images by texts was originally proposed in 2016, and the main task is to require automatic generation of images corresponding to text descriptions input for each sentence, so that Reed and the like build a GAN-INT-CLS and other networks based on a conditional countermeasure generation network to solve the problem. Although the network can basically generate images which are related to description and have certain definition, the generated images have low quality resolution, and the semantic consistency problem of texts and generated images is basically not considered.
Text-generated images are a very challenging problem with two main goals: (1) a vivid image can be generated; (2) the generated image is matched with the input text description. Most of the current text generation image basic frameworks adopt a condition-generation countermeasure network (cGAN) mode, and adopt a pre-trained text encoder to encode an input descriptive sentence into a corresponding semantic vector, connect a noise vector which obeys normal distribution, and input the semantic vector as a cGAN condition to generate a natural image. In the aspect of generating a high-resolution clear picture, a method of generating multi-scale output and a multi-scale discriminator is adopted, so that the quality of the generated picture is improved. In semantic consistency, fine-tuning is often performed on high-resolution maps using attention mechanisms and the like.
Most networks that generate images of text are prone to many semantically unreasonable pictures due to the instability of the generating countermeasure network. Taking a bird picture as an example, a generated target structure is not restricted to a certain extent, and some generated pictures are not real, such as double-headed birds, partial loss, disconnection of a target area, fuzzy boundaries and the like caused by blurring of a foreground and a background, so that a generated result is poor and pleasant. At present, most of research interest points of generating pictures based on texts are improved in a high-resolution generator, and the generated pictures are corrected and fine-tuned by means of an attention mechanism and the like. In the generation network, in order to generate clear high-resolution natural pictures, a mode of cascading a plurality of generators is often adopted, so that gradual thinning from low-resolution pictures to high-resolution pictures is achieved. Meanwhile, studies show that low-resolution generator concerns about structure and layout, high-resolution generator concerns about detail and random variation, and if generation fails on the spatial structure of a picture, it is futile how many detail corrections are made.
Therefore, the low resolution generator initially generates pictures that have a greater impact on the spatial semantic structure of the generated result. The better low-resolution generator can ensure the semantic rationality of the low-resolution generated picture and improve the stability of the generated network generated picture to a certain extent.
Disclosure of Invention
The invention provides a method for generating an image based on a text of a confrontational network generated by resolution progressive generation, which aims to promote the stability of image generation. And a semantic separation-fusion generation module is adopted at a low resolution layer, text features are separated into three feature vectors under the guidance of a self-attention mechanism, corresponding feature maps are generated by a generator and fused into a low resolution map, and Mask pictures are adopted as semantic constraints to improve the stability of the low resolution generator. Meanwhile, a resolution progressive residual error structure is adopted in a high-resolution layer, and the quality of a generated picture is further improved by combining a word attention mechanism and pixel shuffling. The method for generating the text generation image of the confrontation network by resolution progressive generation reduces the structural error of the generated target to a certain extent, and further improves the quality of the generated image.
The invention realizes the purpose through the following technical scheme:
the method comprises the following steps: coding an input description sentence into a Text semantic feature vector c through a Text-Encoder and obtaining a new feature vector s through a noise z which follows normal distribution;
step two: using semantic separation modules to output from the encoding sideThe feature vector is subjected to self-attention module calculation to obtain corresponding attention weight, and then the attention weight is multiplied by the original semantic feature vector to obtain a separated foreground feature vector sforeBackground feature vector sbackAnd Mask feature vector smask
Step three: by a first stage of three different generators Gfore,Gback,GmaskRespectively generating feature maps R with the size of 64 multiplied by 64fore,Rback,RmaskThrough RmaskCalculating to obtain a generated binary mask image ImaskThe first stage generator outputs a feature map R0And first level generating picture I0
Step four: the first-stage feature map passes through a second-stage generator G and a third-stage generator G1,G2Finally, the generated pictures I of 128 × 128 and 256 × 256 are obtained respectively by combining the resolution progressive residual structure1,I2
Step five: for each generation stage, there is a corresponding discriminator, D0,D1,D2Meanwhile, the Mask picture generated in the first stage also has a corresponding discriminator DmaskConstraining the generated result;
step six: DAMSM loss was calculated using the 256 x 256 size image generated by the last generator.
It should be noted that:
the semantic attention extracting module in the step two is an ith semantic feature vector in the semantic attention separating module
Figure BDA0002639737420000023
The calculation method is as follows:
Figure BDA0002639737420000021
αi,j=exp(WisTs)/∑jexp(WisTs)
wherein, WiA weight that is a linear transformation;
passage in step threemaskCalculating to obtain a generated binary mask image ImaskThe first stage generator outputs a feature map R0And first level generating picture I0The method comprises the following steps:
(1) r is to bemaskObtaining a single-channel binary mask image I through the convolution layer and the activation layermask
(2) By the formula:
Figure BDA0002639737420000022
calculating to obtain a first-stage characteristic spectrum R0
(3) R is to be0Finally obtaining a first-level generator generated picture I through the convolution layer and the activation layer0
The invention mainly provides a text image generation method for generating a confrontation network based on resolution progressive generation. The semantic feature separation-fusion module is adopted in the low-resolution generation layer to improve the image structure generation stability, the resolution progressive residual error structure is adopted in the high-resolution generation layer to improve the image generation quality, and the effectiveness of the proposed network is verified on the public data sets CUB and Oxford-102.
Drawings
Fig. 1 is a diagram of the network architecture of the present invention.
FIG. 2 is a diagram of the self-attention mechanism detachment architecture of the present invention.
Fig. 3 is a high resolution residual network architecture of the present invention.
Detailed Description
The invention will be further described with reference to the accompanying drawings in which:
FIG. 1 is a network architecture diagram of a method for generating a text-generating image against a network based on resolution progression.
And a text encoding end: the Text encoding end of the generator consists of a pre-trained Text Encoder Text-Encoder, an input description sentence is encoded into a Text semantic feature vector C through the Text-Encoder, and the Text semantic feature vector C is connected with noise Z which obeys normal distribution to form a new feature vector which is used as the input of the image decoding end of the generator. Text-Encoder is also responsible for computing words in the Text description as attention maps as one of the inputs to the last two stages (64 × 64 to 128 × 128, 128 × 128 to 256 × 256) of the image decoding side.
An image encoding end: and obtaining a condition vector by the condition enhancement module according to the encoded semantic feature vector. The feature vector is in a low resolution layer, and semantic feature vectors with three different attention weights are obtained through a self-attention separation module. Three different semantic feature maps are generated by adopting three different generators, and the generated low-resolution map is obtained by a feature fusion method. In the high-resolution layer, a residual error structure is adopted in combination with an attention mechanism to finely adjust the high-resolution map, so that the generation from low resolution to high resolution is realized, and finally, a high-quality picture is obtained.
An image decoding end: for each generation phase, there is a corresponding discriminator, D0, D1, D2. In the final generation phase, the generated 256 × 256 size image is also used to calculate DAMSM losses.
FIG. 2 is a diagram of the self-attention mechanism detachment architecture of the present invention. The semantic separation module adopts a self-attention mechanism, the feature vector output by the encoding end is subjected to corresponding attention weight calculation by the self-attention module, and then the attention weight is multiplied by the original semantic feature vector to obtain a foreground feature vector, a background feature vector and a Mask feature vector after separation.
Fig. 3 is a high resolution residual network architecture of the present invention. In a residual error network, firstly obtaining an attention map through word vector guidance, connecting the attention map with a previous generation map, calculating an attention weight of the previous characteristic map and the word vector, multiplying the attention weight by the characteristic map to obtain the attention map, splicing the attention map and the previous characteristic map to be used as the input of a generator, simultaneously up-sampling twice the previous characteristic map, adding the output of the generator and the up-sampled result, and obtaining a picture of a corresponding scale in the stage through an activation layer.

Claims (2)

1. The method for generating the text generation image of the confrontation network based on the resolution progressive generation is characterized in that: the method comprises the following steps:
the method comprises the following steps: coding an input description sentence into a Text semantic feature vector c through a Text-Encoder and obtaining a new feature vector s through a noise z which follows normal distribution;
step two: adopting a semantic separation module to calculate corresponding attention weight of the feature vector output by the encoding end through a self-attention module, and multiplying the attention weight by the original semantic feature vector to obtain a separated foreground feature vector sforeBackground feature vector sbackAnd Mask feature vector smask
Step three: by a first stage of three different generators Gfore,Gback,GmaskRespectively generating feature maps R with the size of 64 multiplied by 64fore,Rback,RmaskThrough RmaskCalculating to obtain a generated binary mask image ImaskThe first stage generator outputs a feature map R0And first level generating picture I0
Step four: the first-stage feature map passes through a second-stage generator G and a third-stage generator G1,G2Finally, the generated pictures I of 128 × 128 and 256 × 256 are obtained respectively by combining the resolution progressive residual structure1,I2
Step five: for each generation stage, there is a corresponding discriminator, D0,D1,D2Meanwhile, the Mask picture generated in the first stage also has a corresponding discriminator DmaskConstraining the generated result;
step six: DAMSM loss was calculated using the 256 x 256 size image generated by the last generator.
2. The resolution-progressive-based text-generating image-of-confrontation network method of claim 1, wherein:
semantics in step twoThe attention extracting module is the ith semantic feature vector in the semantic attention separating module
Figure FDA0002639737410000013
The calculation method is as follows:
Figure FDA0002639737410000011
αi,j=exp(WisTs)/∑jexp(WisTs)
wherein, WiA weight that is a linear transformation;
passage in step threemaskCalculating to obtain a generated binary mask image ImaskThe first stage generator outputs a feature map R0And first level generating picture I0The method comprises the following steps:
(1) r is to bemaskObtaining a single-channel binary mask image I through the convolution layer and the activation layermask
(2) By the formula:
Figure FDA0002639737410000012
calculating to obtain a first-stage characteristic spectrum R0
(3) R is to be0Finally obtaining a first-level generator generated picture I through the convolution layer and the activation layer0
CN202010836037.3A 2020-08-19 2020-08-19 Text image generation method for progressively generating confrontation network based on resolution Active CN114078172B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010836037.3A CN114078172B (en) 2020-08-19 2020-08-19 Text image generation method for progressively generating confrontation network based on resolution

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010836037.3A CN114078172B (en) 2020-08-19 2020-08-19 Text image generation method for progressively generating confrontation network based on resolution

Publications (2)

Publication Number Publication Date
CN114078172A true CN114078172A (en) 2022-02-22
CN114078172B CN114078172B (en) 2023-04-07

Family

ID=80282441

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010836037.3A Active CN114078172B (en) 2020-08-19 2020-08-19 Text image generation method for progressively generating confrontation network based on resolution

Country Status (1)

Country Link
CN (1) CN114078172B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115713680A (en) * 2022-11-18 2023-02-24 山东省人工智能研究院 Semantic guidance-based face image identity synthesis method
CN116246331A (en) * 2022-12-05 2023-06-09 苏州大学 Automatic keratoconus grading method, device and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180204111A1 (en) * 2013-02-28 2018-07-19 Z Advanced Computing, Inc. System and Method for Extremely Efficient Image and Pattern Recognition and Artificial Intelligence Platform
CN109495741A (en) * 2018-11-29 2019-03-19 四川大学 Method for compressing image based on adaptive down-sampling and deep learning
CN110443863A (en) * 2019-07-23 2019-11-12 中国科学院深圳先进技术研究院 Method, electronic equipment and the storage medium of text generation image
CN110706302A (en) * 2019-10-11 2020-01-17 中山市易嘀科技有限公司 System and method for text synthesis image
CN111260740A (en) * 2020-01-16 2020-06-09 华南理工大学 Text-to-image generation method based on generation countermeasure network
CN111340907A (en) * 2020-03-03 2020-06-26 曲阜师范大学 Text-to-image generation method of self-adaptive attribute and instance mask embedded graph

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180204111A1 (en) * 2013-02-28 2018-07-19 Z Advanced Computing, Inc. System and Method for Extremely Efficient Image and Pattern Recognition and Artificial Intelligence Platform
CN109495741A (en) * 2018-11-29 2019-03-19 四川大学 Method for compressing image based on adaptive down-sampling and deep learning
CN110443863A (en) * 2019-07-23 2019-11-12 中国科学院深圳先进技术研究院 Method, electronic equipment and the storage medium of text generation image
CN110706302A (en) * 2019-10-11 2020-01-17 中山市易嘀科技有限公司 System and method for text synthesis image
CN111260740A (en) * 2020-01-16 2020-06-09 华南理工大学 Text-to-image generation method based on generation countermeasure network
CN111340907A (en) * 2020-03-03 2020-06-26 曲阜师范大学 Text-to-image generation method of self-adaptive attribute and instance mask embedded graph

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
RINTARO YANAGI等: "Scene Retrieval from Multiple Resolution Generated Images Based on Text-to-Image GAN" *
徐赫遥: "基于循环生成式对抗网络与文本信息的图像翻译研究" *
许一宁等: "基于多层次分辨率递进生成对抗网络的文本生成图像方法" *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115713680A (en) * 2022-11-18 2023-02-24 山东省人工智能研究院 Semantic guidance-based face image identity synthesis method
CN116246331A (en) * 2022-12-05 2023-06-09 苏州大学 Automatic keratoconus grading method, device and storage medium

Also Published As

Publication number Publication date
CN114078172B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
CN111179167B (en) Image super-resolution method based on multi-stage attention enhancement network
CN113139907B (en) Generation method, system, device and storage medium for visual resolution enhancement
CN113096017B (en) Image super-resolution reconstruction method based on depth coordinate attention network model
CN114078172B (en) Text image generation method for progressively generating confrontation network based on resolution
CN112016604B (en) Zero-resource machine translation method applying visual information
CN113361250A (en) Bidirectional text image generation method and system based on semantic consistency
CN111402365B (en) Method for generating picture from characters based on bidirectional architecture confrontation generation network
CN113140020B (en) Method for generating image based on text of countermeasure network generated by accompanying supervision
CN113076957A (en) RGB-D image saliency target detection method based on cross-modal feature fusion
CN115512368B (en) Cross-modal semantic generation image model and method
CN111768354A (en) Face image restoration system based on multi-scale face part feature dictionary
CN112381716A (en) Image enhancement method based on generation type countermeasure network
CN114022582A (en) Text image generation method
CN115330620A (en) Image defogging method based on cyclic generation countermeasure network
CN114140322A (en) Attention-guided interpolation method and low-delay semantic segmentation method
Zhu et al. Scaling the Codebook Size of VQGAN to 100,000 with a Utilization Rate of 99%
CN116823610A (en) Deep learning-based underwater image super-resolution generation method and system
CN116451649A (en) Text image generation method based on affine transformation
CN115862039A (en) Text-to-image algorithm based on multi-scale features
CN114881858A (en) Lightweight binocular image super-resolution method based on multi-attention machine system fusion
CN115330601A (en) Multi-scale cultural relic point cloud super-resolution method and system
CN109657589B (en) Human interaction action-based experiencer action generation method
CN115705493A (en) Image defogging modeling method based on multi-feature attention neural network
CN115482302A (en) Method for generating image from text based on cross attention coding
Yang et al. Depth map super-resolution via multilevel recursive guidance and progressive supervision

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant