CN116452687A

CN116452687A - Image generation method for fine-grained text based on contrast learning and dynamic convolution

Info

Publication number: CN116452687A
Application number: CN202310249612.3A
Authority: CN
Inventors: 那巍; 杨冰; 徐楚阳
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2023-03-15
Filing date: 2023-03-15
Publication date: 2023-07-18

Abstract

The invention discloses a method for generating an image based on contrast learning and dynamic convolution of fine granularity text. The invention builds a text generation image model, and helps the generator to generate fine-grained images by introducing word context information based on sentence vector information and word information in the generation process. The invention introduces word information, so that the generator can generate the subregion of the image according to the word information, the generated image has richer details, the contrast loss is added, the given text and the generated image are more consistent in potential semantics, the generator adopts dynamic convolution, and under the condition of only increasing a small amount of calculation cost, the expression capability of the generator is enhanced, and the generation countermeasure network is converged more quickly.

Description

Image generation method for fine-grained text based on contrast learning and dynamic convolution

Technical Field

The invention belongs to the field of image generation, and particularly relates to a method for generating an image by using fine granularity text based on contrast learning and dynamic convolution.

Background

When hearing or seeing a sentence, people can involuntarily associate with the picture corresponding to the sentence. The person can easily handle the conversion of language information and visual information. The complex relation between the language and the picture can be understood, the conversion from text information to image information is realized, and the real artificial intelligence is realized.

The image has the characters, and is more vivid than the characters. The images can simply and intuitively convey information, such as a large number of images in a children book, and the children can understand and accept the information more easily. The image can obtain more information in a short time than text, and people in different countries can understand the same piece of image information. Meanwhile, the human is better in memorizing images or pictures, for example, a plurality of memorizing methods are to convert the text information into a picture scene for memorizing, so as to achieve the effect of memorizing more text information in a short time. Images have many benefits over text, but the difficulty of producing and obtaining image information is more difficult than text information. If the mapping relation between the text and the image can be solved by a calculation mechanism, a large number of high-resolution images conforming to the text semantics can be easily obtained. The text generation image is an image conforming to the text semantic according to the text information and the noise vector, and a large number of images conforming to the given text semantic can be obtained by changing the input noise vector. This greatly reduces the difficulty of acquiring images. Text-generated images have great potential in visual question-answering, entertainment interaction, computer-aided design, and the like.

The text generation image needs to encode the text according to the text and the image in the training set, and then generates the image according to the encoding. Text encoding using only a single sentence extraction may miss some key detail descriptions and may not provide enough semantic information to help generate a rich detail image against a network of things (GAN). In AttnGAN, word information is introduced to help the generator to generate different sub-regions of the image, text semantics are better expressed by providing more text information, and the generator is helped to generate high-quality images, so that the most advanced effect at the time is achieved. Recently, DF-GAN effectively merges sentence information into visual features by concatenating multiple depth text image fusion blocks. Under the condition that only sentence information is used by DF-GAN, the quality of generated images is far higher than that of AttnGAN generated images, which proves that the AttnGAN can not realize the deep fusion of text information and visual features by fusing image features and word information in a direct splicing mode. Therefore, in the process of generating the image, text information is reasonably utilized, so that the text information can be more fully and effectively fused into image features, and the text information is also a key factor for determining the visual authenticity of the generated image and the generated image accords with text semantics. Thus, providing as much text information as possible to represent text semantics and effectively integrate the text information into image features is two key factors in the task of text generating an image. Therefore, the invention provides that word information is reasonably added in the depth text image fusion block in DF-GAN, text semantics are fully mined, and the word information can be effectively fused into image features.

Secondly, in order to further promote the generated image and the text to be consistent in potential semantics, the invention applies contrast loss between the real image and the generated image, reduces the distance between the same semantic image pair and increases the distance between different semantic image pairs. The generated image is facilitated to conform to text semantics by making the generated image more "like" a real image.

Finally, since DF-GAN uses common convolution in the process of generating images, different semantic text inputs use the same convolution kernel parameters. In order to better enable the generator to dynamically generate images from a given text, the present invention introduces dynamic convolution. Because the images corresponding to the texts with different semantics have great difference, the dynamic convolution can dynamically adjust the weights of a plurality of convolution kernels according to the characteristics of the images blended with the semantics, so that the generator indirectly dynamically generates the images according to the input text information, and the quality of the generated images is further improved.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a method for generating images based on contrast learning and dynamic convolution of fine granularity texts, word information is introduced, so that the generator can generate sub-regions of images according to the word information, the generated images have richer details, contrast loss is added, the given texts and the generated images are more consistent in potential semantics, the generator adopts dynamic convolution, and under the condition of only increasing a small amount of calculation cost, the expression capacity of the generator is enhanced, and the generation countermeasure network is converged more quickly.

In a first aspect, there is provided a method of generating an image based on contrast learning and dynamic convolution fine granularity text, comprising the steps of:

step (1), acquiring an English text and a real image with the same English text semanteme;

step (2), constructing a text encoder and an image encoder, and training the text encoder and the image encoder;

step (3), constructing a text to generate an image model, and training the image model;

the text generation image model adopts DF-GAN as a reference model and comprises a generator and a discriminator;

the generator comprises a full connection layer, a Reshape function, 6 up-sampling text fusion blocks, a dynamic convolution layer and a Tanh function which are sequentially connected in series;

the first three upsampling text fusion blocks in the 6 upsampling text fusion blocks connected in series have the same structure, and comprise a first upsampling layer, a first fusion block, a first dynamic convolution layer, a second improved fusion block and a second dynamic convolution layer which are connected in series in sequence; the last three upsampling text fusion blocks have the same structure and comprise a second upsampling layer, a third fusion block, a third dynamic convolution layer, a fourth fusion block and a fourth dynamic convolution layer which are sequentially connected in series;

the first fusion block, the third fusion block and the fourth fusion block have the same structure and respectively comprise a first Affine layer, a first Relu function, a second Affine layer and a second Relu function which are sequentially connected in series;

the second improved fusion block comprises a third improved Affine layer, a third Relu function, a fourth improved Affine layer and a fifth Relu function which are sequentially connected in series;

the first and second Affine layers have the same structure and are used for fusing sentence information to image features; the method specifically comprises the following steps:

(1) splicing sentence codes s and noise vectors z to serve as text conditions of the current subarea; through a multi-layer sensing machineObtaining channel scaling parameter gamma from the text conditions _j ¹ And a shift parameter θ _j ¹ Each sub-region j has a multi-layer perceptron MLP _j ¹ Each sub-area is provided with a scaling parameter and a shifting parameter;

γ _j ¹ ＝MLP _j ¹ (concat (z, s)) type (1)

θ _j ¹ ＝MLP _j ¹ (concat (z, s)) type (2)

Where concat () represents a splicing function; j represents the j-th sub-region of the image feature;

(2) image characteristics of the output of the previous layerAffine transformation is carried out to enable the image subareas to be integrated into sentence information;

AFF(h _j ^last1 |concat(z,s))＝γ ¹ _j h _j ^last1 +θ ¹ _j (3)

Wherein AFF represents affine transformation; n (N) ₁ The number of sub-regions representing the current image feature,representing the current image feature h ^last1 Is a sub-region dimension of (c).

The third improved Affine layer and the fourth improved Affine layer have the same structure and are used for fusing sentence information and word information into image characteristics; the method specifically comprises the following steps:

(1) encoding words e R using 1 x 1 convolution ^D×L Conversion to word encodings of the same dimension as the current image feature

Wherein U is ₁ Representing a dimension transformation matrix;representing the current image feature h ^last2 Is a dimension of a sub-region of (a);

(2) image characteristics of e' and the upper layer outputMultiplying and normalizing to obtain the weight of the current image feature:

wherein alpha is _j,i The weight of the jth sub-area of the current image feature to the ith word is represented, and the superscript T represents transposition; n (N) ₂ Current image feature h ^last2 Is a dimension of a sub-region of (a);

(3) by alpha _j,i Encoding with wordsMultiplying to obtain dynamic representation c of all words associated with the current image sub-region _j Namely a word context vector;

(4) encoding a sentence s, word context vector c _j Splicing the current sub-region with the noise vector z to serve as a text condition of the current sub-region; through multilayer perceptron MLP ² _j Obtaining channel scaling parameter gamma from the text conditions ² _j And a shift parameter θ ² _j Each sub-area is provided with a scaling parameter and a shifting parameter;

γ ² _j ＝MLP ² _j (concat(z,s,c _j ) Arbitrary (7)

θ ² _j ＝MLP ² _j (concat(z,s,c _j ) Arbitrary (8)

(5) Image characteristics of the output of the previous layerThrough affine transformation, the image subregion can be fused with sentence information and word information most relevant to the current subregion, so that the problem that the word information is possibly ignored when only sentence information is used by an original frame is solved, and a generator can generate images in fine granularity;

AFF(h _j ^last2 |concat(z,s,c _j ))＝γ ² _j h _j ^last2 +θ ² _j (9)

The discriminator comprises 6 identical downsampling layers, a first convolution layer, a LeakyRelu layer and a second convolution layer which are sequentially connected in series; each downsampling layer comprises 2 convolution layers and two LeakyRelu layers;

and (4) realizing the text generation image by using the trained text generation image model.

In a second aspect, a computer readable storage medium is provided, on which a computer program is stored which, when executed in a computer, causes the computer to perform the method.

In a third aspect, a computing device is provided, including a memory having executable code stored therein and a processor, which when executing the executable code, implements the method.

Compared with the prior art, the invention has the beneficial effects that:

(1) The invention introduces word information on DF-GAN based on the attention mechanism, so that the generator can generate the subareas of the image according to the word information, and the generated image has richer details.

(2) According to the invention, the contrast loss is added, so that the same semantic images are closer to each other, and different semantic images are more distant from each other, thereby better ensuring the semantic consistency between the text and the generated image.

(3) The invention adopts dynamic convolution in the generator, and strengthens the expression capability of the generator under the condition of only increasing a small amount of calculation cost, so that the generation countermeasure network converges more quickly.

Drawings

Fig. 1 is a structural framework diagram of a text-generated image model of the present invention.

Fig. 2 is a structural architecture diagram of a third modified Affine layer and a fourth modified Affine layer.

Fig. 3 is a structural architecture diagram of a first dynamic convolution layer, a second dynamic convolution layer, a third dynamic convolution layer, and a fourth dynamic convolution layer.

FIG. 4 is a comparison graph of CUB dataset effects.

Fig. 5 is a graph comparing the effect of the COCO dataset.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

The method is used for generating an image according to the text of the test set, and is shown in fig. 1, and specifically comprises the following steps:

the invention adopts Caltech-UCSD Birds 200-2011 (CUB) and 2014 COCO data sets to train and test the proposed model. CUB is a dataset of bird-only pictures. There were 200 birds in the dataset, with 150 8855 images in the training set and 50 2933 images in the test set. Each image has 10 corresponding English sentences with the same semantics and different expression modes. The COCO dataset has richer image categories and more complex scenes, and has 171 image categories in total, wherein the training dataset contains 80k images and the testing dataset contains 40k images. Each image corresponds to 5 English sentences with the same semantics and different expression modes.

the text encoder adopts a two-way long-short-term memory network (LSTM), and the input of the text encoder is English text; the two-way long-short-term memory network comprises L hidden layers, L is a positive integer, the connection of two directions of each hidden layer is used as a word code, L word codes are owned altogether, and the connection of two directions of the last hidden layer is used as a current sentence code; all words in a sentence are encoded as e R ^D×L Sentence code s e R ^D Wherein L represents the number of words in the sentence and D represents the dimension;

the image encoder adopts an image Net data set pre-trained acceptance-V3 model, the input is a real image scaled to a fixed size, and the last average pooling layer of the model obtains the global code of the real imageThe "mixed_6e" layer of the model gets the local coding f e R of the real image ^768×289 The method comprises the steps of carrying out a first treatment on the surface of the Where 768 is the dimension of the sub-regions and 289 is the number of sub-regions of the image.

The image encoder and text encoder are trained by minimizing DAMSM (DeepAttentional Multimodal Similarity Model) losses.

Step (3), constructing a text to generate an image model;

the generator comprises a full connection layer, a Reshape function, 6 up-sampling text fusion blocks, a dynamic convolution layer and a Tanh function which are sequentially connected in series; specifically, a 100-dimensional noise vector z is input into a full connection layer and remodeled into a size of 4 multiplied by 256 through Reshape function processing, the size of the full connection layer is input into a combined 6 up-sampling text fusion blocks, and the output image features are subjected to dynamic convolution layer and Tanh function to generate an image with the size of 256 multiplied by 256.

γ _j ¹ ＝MLP _j ¹ (concat (z, s)) type (1)

θ _j ¹ ＝MLP _j ¹ (concat (z, s)) type (2)

AFF(h _j ^last1 |concat(z,s))＝γ ¹ _j h _j ^last1 +θ ¹ _j (3)

The third improved Affine layer and the fourth improved Affine layer have the same structure and are used for fusing sentence information and word information into image characteristics; see fig. 2 for details:

(3) by alpha _j，i Encoding with wordsMultiplying to obtain dynamic representation c of all words associated with the current image sub-region _j Namely a word context vector;

γ ² _j ＝MLP ² _j (concat(z,s,c _j ) Arbitrary (7)

θ ² _j ＝MLP ² _j (concat(z,s,c _j ) Arbitrary (8)

AFF(h _j ^last2 |concat(z,s,c _j ))＝γ ² _j h _j ^last2 +θ ² _j (9)

The first dynamic convolution layer, the second dynamic convolution layer, the third dynamic convolution layer and the fourth dynamic convolution layer have the same structure, see fig. 3, specifically:

image characteristics h of the upper fusion block or the improved fusion block ^last Firstly, the length and width of image characteristics are changed into 1 through global averaging pooling, then the channel size is changed into one fourth of the original size through a first full-connection layer, the channel size is changed into K through a second full-connection layer, and finally, the attention pi corresponding to each convolution kernel is obtained through normalization _k (h ^last )；

Aggregating the K convolution kernels and the bias, using the aggregated convolution kernelsAnd deviation->Performing convolution operation on the input image features;

wherein the method comprises the steps ofRespectively representing the kth convolution kernel and the deviation; />Representation->Is a transpose of (a).

In the conventional convolution operation, all inputs share a unified convolution kernel parameter, and the dynamic convolution can dynamically aggregate a plurality of convolution kernels according to different inputs. Since there are many different types of images in the data set of the text generated image, for example, there are many types of characters, scenery, automobiles, etc. in the COCO data set, the different types of images are greatly different, so when the convolution operation is required, it is necessary to dynamically generate the image according to the difference of the image characteristics after the text information is incorporated. The text information blended into the image features is different, and the weights of the convolution kernels are also different. Different convolution kernel weights can be obtained for different text information, so that images can be dynamically generated according to the text information in the image generation process, and the network dynamic generation capacity can be improved. In the generator, the present invention replaces the conventional convolution operation with dynamic convolution. After using dynamic convolution, the generator can generate better quality images faster. Since the cost of computing the corresponding attention of each convolution kernel is low, and the size of each convolution kernel is small relative to the input, the aggregate computation of the corresponding convolution kernel is also small, so that the increased computation of using dynamic convolution instead of convolution is also not large.

The objective loss function of the generator network is defined as:

L＝L _G +λ ₁ L _DAMSM +λ ₂ L _real (14)

Wherein lambda is ₁ Is the weight lost by the DAMSM, lambda ₂ Is to generate contrast between imagesThe weight lost. Using only L _real Lambda is at the time ₃ Is 0. Since the present invention introduces word-level information, the DAMSM penalty is used. The DAMSM calculates the loss of words and the region of the generated image, and measures the matching degree of the generated image and the text from the word level.

Adding contrast loss L to the generator loss function _real The text is promoted to be consistent with the generated image semanteme; the method specifically comprises the following steps:

the NT-Xent is used as contrast learning loss, so that the generated image and the real image of the same semantic are more similar, the generated image and the real image of different semantics are more far away, the generated image is more 'like' the real image, and the generated image accords with the text semantic more; true image and generated image contrast loss L _real The following are provided:

where M is the size of the batch, i, j represent the image number, i+.j, η is the temperature super-parameter, f _img (x _i ),Representing a real image code corresponding to the same text and a generated image code output by a text generated image model respectively; />Representation and x _i And generating image codes output by the image model by text generation corresponding to different texts under the same batch.

The discriminator comprises 6 identical downsampling layers, a first convolution layer, a LeakyRelu layer and a second convolution layer which are sequentially connected in series; each downsampling layer comprises 2 convolution layers and two LeakyRelu layers; the real image with the dimension of 256 multiplied by 3 and the generated image are input into a discriminator, and the characteristic size of the output image is 4 multiplied by 256 through 6 downsampling layers; then, splicing sentence vectors after spatial replication after the image features; finally, the discriminator output generates an combat lossLoss of L _D The method comprises the steps of carrying out a first treatment on the surface of the The discriminator judges whether the generated image is real according to the image characteristics, and judges and generates whether the image accords with text semantics through splicing the input sentence vectors;

the discriminator network uses the same loss function as the DF-GAN discriminator, and the objective function is defined as:

the invention uses the same hinge loss as DF-GAN to stabilize the process of countermeasure training, and simultaneously uses matching perception gradient penalty (MA-GP) and unidirectional output to promote a generator to generate images with higher text image semantic consistency.

The model of the present invention is implemented on pytorch. The present invention uses the text encoder and image encoder parameters provided in the literature. Both dataset CUB and COCO were trained on a single NVIDIAV10032G GPU. Use of beta in training ₁ ＝0.0，β ₂ Adam optimizer=0.9. η in equation (16) is set to 0.1 and the batch size is 32. In equation (14), the coefficient weight is set to λ ₁ ＝0.1，λ ₂ ＝0.1，λ ₃ 0.2. The final model of the CUB dataset is the corresponding model that achieves the smallest FID value between iterations 600 and 700. The final model of the COCO dataset is a model with a first FID value of less than 13 and a number of iterations of 219.

The invention uses FID to quantitatively analyze the quality of the generated image and evaluate the performance of the model. The generated image and the real image can be regarded as two distributions, respectively, and the FID calculates the distance between the generated image distribution and the real image distribution, so that a lower FID value represents that the generated image is closer to the real image, i.e. visually real and conforms to the text semantics.

Wherein r and g represent the true image and the generated image, T _r Trace, mu _r Sum mu _g Representing the real image mean and the generated image mean respectively,and->Representing the true image feature covariance and generating the image feature covariance.

The invention uses the same method as DF-GAN to test the generated image. When tested on the CUB dataset, each semantic text in the test set generated 10 corresponding images, resulting in 29330 images in total to calculate FID scores. When tested on the COCO dataset, one image was generated per semantic text, resulting in a total of 40k images to calculate the FID score. The results of the quantitative comparison experiments are shown in Table 1, compared with the currently popular models of AttnGAN, DM-GAN, KD-GAN, DF-GAN, etc.

TABLE 1 FID score comparison for various models on CUB and COCO datasets

FID scores on the CUB and COCO datasets for the current popular models and the model of the present invention are listed in table 1. On the CUB dataset, the model herein reduced the FID score from 12.10 to 10.36 compared to DF-GAN. On the COCO dataset, the FID score was reduced from 15.41 to 12.74. From the perspective of quantitative analysis, compared with images generated by other methods, the model provided by the invention is closer to a real image, the potential semantic consistency of texts and generated images is better, and the image quality is greatly improved.

And carrying out visual comparison on the real image, the image generated by the model of the invention and the image generated by the DF-GAN model, wherein the comparison results are shown in figures 4 and 5.

FIG. 4 is a comparative example of image generation on a CUB dataset. In the first image, the generated image well shows white birds and gray wings, the image generated by the DF-GAN model does not show text semantics, and the similarity of the real image is not as similar as that of the image generated by the model. In the second image, the model of the invention produces an image that well shows red eyes and white abdomen. In the fifth image, white eye rings are clearly revealed. Therefore, the model can better utilize word information to realize fine granularity image generation, and the generated image is more in line with the description of text information. In other groups of generated images, the images generated by the two models show the semantics expressed by the text, but the images generated by the model are more natural and have higher similarity with the real images.

Fig. 5 is a comparative example of image generation on a COCO dataset. Because the COCO data set has rich image types and complex scenes, the generated images can only ensure the authenticity of the images as much as possible. In the first image, the model of the present invention delineates the general outline of the skier. In the fifth image, the model of the present invention preferably restores a tall tower with a clock in the building. In general, compared with the DF-GAN generated image, the image contour generated by the method is clearer, the detail features of the picture mentioned in the text description are better captured, the definition of the picture is obviously improved, and the generated image is more in line with the observation expectation of human eyes. And the generated image is evaluated again through the text, so that the image generated by the text model can be seen to be more matched with text semantics.

Claims

1. A method for generating an image based on contrast learning and dynamic convolution fine granularity text, the method comprising the steps of:

the text encoder adopts a two-way long-short-term memory network, and the input of the text encoder is English text; the two-way long-short-period memory network comprises L hidden layers, wherein L is positiveAn integer, wherein the connection of two directions of each hidden layer is used as the code of a word, L word codes are owned, and the connection of two directions of the last hidden layer is used as the current sentence code; all words in a sentence are encoded as e R ^D×L Sentence code s e R ^D Wherein L represents the number of words in the sentence and D represents the dimension;

the first and second Affine layers have the same structure and are used for fusing sentence information to image features;

γ ² _j ＝MLP ² _j (concat(z,s,c _j ) Arbitrary (4)

θ ² _j ＝MLP ² _j (concat(z,s,c _j ) Arbitrary (5)

AFF(h _j ^last2 |concat(z,s,c _j ))＝γ ² _j h _j ^last2 +θ ² _j (6)

2. The method of claim 1, wherein the image encoder in step (2) uses an image net dataset pre-trained acceptance-V3 model that is input as a real image scaled to a fixed size, and the last average pooling layer of the model yields a global encoding of the real imageThe "mixed_6e" layer of the model gets the local coding f e R of the real image ^768×289 The method comprises the steps of carrying out a first treatment on the surface of the 768 is the dimension of the sub-region, 289Is the number of sub-regions of the image.

3. The method according to claim 1 or 2, characterized in that step (2) trains the image encoder and the text encoder by minimizing DAMSM loss.

4. The method of claim 1 wherein in step (3) the generator specifically inputs the 100-dimensional noise vector z into the full-join layer and remodels the full-join layer into a 4 x 256 up-sampled text fusion block by Reshape function processing, and the output image features then generate a 256 x 256 image by dynamic convolution layer and Tanh function.

5. The method according to claim 1, wherein in the step (3), the first Affine layer and the second Affine layer are specifically:

γ _j ¹ ＝MLP _j ¹ (concat (z, s)) type (7)

θ _j ¹ ＝MLP _j ¹ (concat (z, s)) type (8)

AFF(h _j ^last1 |concat(z,s))＝γ ¹ _j h _j ^last1 +θ ¹ _j (9)

Wherein AFF represents affine transformation; n (N) ₁ The number of sub-regions representing the current image feature,representing the current image feature h ^l The sub-region dimension of ast 1.

6. The method according to claim 1, wherein in step (3), the first dynamic convolution layer, the second dynamic convolution layer, the third dynamic convolution layer, and the fourth dynamic convolution layer have the same structure, specifically:

7. The method according to claim 1, characterized in that in step (3) the discriminator is in particular: the real image with the dimension of 256 multiplied by 3 and the generated image are input into a discriminator, and the characteristic size of the output image is 4 multiplied by 256 through 6 downsampling layers; then, splicing sentence vectors after spatial replication after the image features; finally, the discriminator output generates a countering loss L _D The method comprises the steps of carrying out a first treatment on the surface of the The discriminator judges whether the generated image is real according to the image characteristics, and judges and generates whether the image accords with text semantics through splicing the input sentence vectors.

8. The method according to claim 1, characterized in that a contrast loss L is added to the generator loss function _real The text is promoted to be consistent with the generated image semanteme; the method specifically comprises the following steps:

where M is the size of the batch, i, j denote the image numbers, i+.j,eta is the temperature super-parameter, f _img (x _i ),Representing a real image code corresponding to the same text and a generated image code output by a text generated image model respectively; />Representation and x _i And generating image codes output by the image model by text generation corresponding to different texts under the same batch.

9. A computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of any of claims 1-8.

10. A computing device comprising a memory having executable code stored therein and a processor, which when executing the executable code, implements the method of any of claims 1-8.