CN113191375B

CN113191375B - Text-to-multi-object image generation method based on joint embedding

Info

Publication number: CN113191375B
Application number: CN202110642098.0A
Authority: CN
Inventors: 余月; 王孟岚; 杨越
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2021-06-09
Filing date: 2021-06-09
Publication date: 2023-05-09
Anticipated expiration: 2041-06-09
Also published as: CN113191375A

Abstract

The invention discloses a text-to-multi-object image generation method based on joint embedding, and belongs to the field of cross-mode generation from text to image. The implementation method of the invention comprises the following steps: the text description is input into a semantic encoder to obtain joint semantic features of the text and the image, the text is input into a spatial layout encoder to obtain joint spatial features of the text and the segmentation map, and the joint semantic features and the spatial features both comprise sentence levels and word levels. And respectively fusing the word-level features and the sentence-level features by using a dynamic fusion module. Feeding the sentence-level features obtained through fusion into an initial generator in a generating type countermeasure network to generate a low-resolution image, and feeding the word-level features obtained through fusion into a subsequent generator to generate a fine high-resolution image. And constructing a cascade generation type countermeasure network consisting of a plurality of pairs of generators and discriminators, training on a data set through a design loss function to generate the countermeasure network, and generating corresponding images by using the trained generators.

Description

Text-to-multi-object image generation method based on joint embedding

Technical Field

The invention relates to a text-to-multi-object image generation method based on joint embedding, and belongs to the field of cross-mode generation from text to image.

Background

In our daily lives, we are not able to rely on a single data form to convey some information in most cases, and we often need to express it in combination with multi-modal data. For example, when we describe a thing, we usually state it in the form of a text-assisted image. However, such pair-matched data requires a significant amount of financial and effort to collect. The generation problem is different from the retrieval problem, and the retrieved resources are all existing data, so that the generation is more prone to creating data. Collecting corresponding text and images is not an easy task and text-to-image generation studies help solve these problems. The obscure words on the book are often painful for students lacking in imagination, and we want to configure matching images or three-dimensional scenes for the texts by means of a deep learning method, and combine the texts, the corresponding images and the three-dimensional scenes to help the students understand the knowledge more deeply. Generating corresponding images from textual descriptions is a challenging and meaningful study.

For cross-modal generation problems, the key is the extraction of joint features and the design of the generation model. Just as in the text-to-image task, text and images are two different modalities of data, how to derive the joint features of text and images from input and how to design a reasonable model to generate an image is key to solving this problem. The purpose of generating images from text is to generate high quality images that are reasonable in shape, color, layout, etc. and that conform to the text description.

Previous studies have simplified the text-to-image generation task into two parts to process: extracting text-image joint semantic features from the text; and feeding the obtained feature vector into a generated network model to obtain a corresponding image. However, visual space is high-dimensional, structured, covering visual features of different aspects, including high-level abstract semantics, layout features, and low-level textures, color features, and so forth. Previous methods have achieved good results in text-to-single object image generation tasks, but they are not suitable for processing multi-object image generation corresponding to complex text. Mapping directly from semantic features into a visual space with a reasonable layout is a very difficult challenge for multi-object image generation tasks.

Disclosure of Invention

Aiming at the problem that the generated multi-object image has no reasonable space layout, the invention discloses a text-to-multi-object image generation method based on joint embedding, which aims to solve the technical problems that: a network framework capable of generating corresponding multi-object images from textual descriptions is provided, consisting essentially of a semantic encoder, a spatial layout encoder, a dynamic feature fusion module, and a cascade-generated countermeasure network with attention modules. The method comprises the steps of extracting joint semantic features of a text and an image from the text through a semantic encoder, extracting joint spatial features of the text and a segmentation map from the text through a spatial layout encoder, fusing the semantic features and the spatial features through a dynamic fusion module, and feeding the fused features into a generation type countermeasure network to generate an image which accords with the text description and has reasonable layout. The invention has the advantages of convenience, wide applicability and good generation effect. The invention uses the corresponding image generated from the text in the cross-modal generation field, and solves the technical problem of related engineering.

The method comprises the steps of multimedia education resource construction, image editing and computer teaching assistance.

In order to achieve the above purpose, the present invention adopts the following technical scheme.

The invention discloses a method for generating a text-to-multi-object image based on joint embedding. And respectively fusing the word-level features and the sentence-level features by using a dynamic fusion module. Feeding the sentence-level features obtained through fusion into an initial generator in a generating type countermeasure network to generate a low-resolution image, and feeding the word-level features obtained through fusion into a subsequent generator to generate a fine high-resolution image. And constructing a cascade generation type countermeasure network consisting of a plurality of pairs of generators and discriminators, training on a data set through a design loss function to generate the countermeasure network, and generating corresponding images by using the trained generators. The invention uses the corresponding image generated from the text in the cross-modal generation field, and solves the technical problem of related engineering.

The invention discloses a text-to-multi-object image generation method based on joint embedding, which comprises the following steps:

step 1: the text description is input into a semantic encoder to obtain joint semantic features of the text and the image, and the semantic encoder is realized by the text encoder which is obtained by pre-training in a semantic encoder architecture.

The semantic encoder architecture described in step 1 includes a text encoder and an image encoder. The training text encoder is guided in the semantic encoder architecture by using the pre-trained image encoder, and the text encoder obtained through training can extract the joint semantic features of the text and the image from the text.

The text encoder obtained through training can extract the joint semantic features of the text and the image from the text, and the realization method comprises the following steps: the image encoder is trained on the ImageNet data set in advance, and the trained image encoder can obtain image semantic features from the image. By minimizing the distance between text features and semantic features, the text encoder can be forced to extract joint semantic features of text and images from the text. Considering that the image encoder pre-trained on the ImageNet dataset is not fully applicable to other datasets, the text encoder is optimized, and meanwhile, part of the network layer of the image encoder is continuously optimized by minimizing the distance between the text features and the image semantic features, and the two encoders are synchronously trained and optimized until the joint semantic features of the text and the image can be extracted from the text. The text features include sentence-level features e _I And word-level features w _I The image semantic features also comprise global features g _I And local feature v _I . Thus, the distance between the text feature and the semantic feature also includes sentence level and word level.

Using word-level penalty

And sentence level penalty->

Optimizing a semantic encoder architecture, wherein the semantic encoder architecture loses a function L ^I Is defined as follows:

image global semantic feature g using pairing _I And text sentence feature e _I Dot product between to calculate sentence levelSentence-level penalty in semantic encoder architecture

Is defined as:

to calculate word level loss

The method for calculating the similarity of the text and the image by the citation DAMSM (Deep Attentional Multimodal Similarity Model) calculates the similarity of the text feature and the image semantic feature on the word level, and the specific implementation method is as follows:

calculating to obtain a similarity matrix s of words in the text and the image subregion:

wherein ,w_I Word characteristics, v, representing text _I Representing the sub-region features of the image.

The degree of similarity between the word and the image block can be obtained by calculation of the similarity matrix s,

representing the degree of similarity of the ith word to the jth image block, where T represents the number of words in the text. / >

By introducing an attention module, a region context vector is calculated for each word in the text

Said region context vector +.>

The dynamic representation of the image subarea associated with the ith word in the sentence is the weighted sum of all the visual characteristics of the image subareas, and the calculation formula is as follows:

wherein N represents the number of sub-region blocks of the image,

representing the j-th sub-region feature of the image. The similarity between text and image local features is calculated using the region context vector and the word vector. The formula is as follows:

wherein ,

calculating the region context vector of the i-th word-associated image subregion>

And word characteristics of the i-th word +.>

Cosine similarity of (c). The trained text encoder can extract the joint semantic features of the text and the image from the text.

Step 2: the text is input into a spatial layout encoder to obtain joint spatial features of the text and the segmentation map, and the spatial layout encoder is realized by a text encoder which is obtained by pre-training in a spatial encoder architecture.

The spatial encoder architecture described in step 2 consists of a text encoder and a segmentation map encoder. The text encoder is guided to be trained by using a pre-trained segmentation map encoder in a space encoder architecture, and the text encoder obtained through training can extract joint space layout characteristics of the text and the segmentation map from the text.

The text encoder obtained through training can extract the joint space layout characteristics of the text and the segmentation map from the text, and the realization method comprises the following steps: the segmentation map encoder is trained on the ImageNet data set in advance, and the segmentation map encoder obtained through training can obtain spatial layout features from the segmentation map. By minimizing the distance between the text features and the segmentation map features, the text encoder can be forced to extract from the text joint spatial layout features of the text and the segmentation map. Considering that the segmentation map encoder pre-trained on the ImageNet dataset is not fully applicable to other datasets, the text encoder is optimized, and meanwhile, part of the network layer of the segmentation map encoder is continuously optimized by minimizing the distance between the text features and the segmentation map features, the two encoders are synchronously trained and optimized until the joint spatial layout features of the text and the segmentation map can be extracted from the text. The text features include sentence-level features e _S And word-level features w _S The spatial layout features also include global features g _S And local feature v _S . Thus, the distance between the text feature and the spatial layout feature also includes a sentence-level distance and a word-level distance.

Using word-level penalty

And sentence level penalty->

Optimizing spatial layout encoder architecture, spatial encoder architecture loss function L ^S Is defined as follows:

global spatial feature g of segmentation map using pairing _S And text sentence feature e _S The dot product between them calculates sentence-level losses, sentence-level losses in the spatial encoder architecture

Is defined as: />

To calculate word level loss

The method for calculating the similarity of the text and the image by the citation DAMSM (Deep Attentional Multimodal Similarity Model) calculates the similarity of the text characteristic and the space characteristic of the segmentation map on the word level, and the specific implementation method is as follows:

calculating to obtain a similar matrix of words in the text and the subareas of the segmentation map:

wherein ,w_S Word characteristics, v, representing text _S Representing the sub-region features of the segmentation map.

The similarity between the word and the segmented block is calculated by the similarity matrix s,

representing the degree of similarity of the ith word to the jth segmented tile, where T represents the number of words in the text.

Said region context vector +.>

The dynamic representation of the segmentation map subarea associated with the ith word in the sentence is the weighted sum of the visual characteristics of all the segmentation map subareas, and the calculation formula is as follows:

Where N represents the number of sub-region blocks of the segmentation map. The similarity between the text and the segmentation map local features is calculated using the region context vector and the word vector. The formula is as follows:

wherein ,

calculating the region context vector of the i-th word-associated segmentation map sub-region>

And word characteristics of the i-th word +.>

Cosine similarity of (c). The trained text encoder can extract joint spatial features of the text and the segmentation map from the text.

Step 3: and (3) fusing the semantic features obtained in the step (1) and the spatial features obtained in the step (2) through a dynamic fusion module, wherein the semantic features and the spatial features comprise sentence-level features and word-level features. The fused sentence-level features are fed into an initial generator in a generative antagonism network to generate a low resolution image. The fused word-level features are then used in subsequent generators in the generative antagonism network to constrain the generation of the high resolution image.

In the step 3, concat operation is adopted, and sentence-level semantic features e are fused _I And spatial feature e _S As input to the initial stage generator. The specific implementation formula is as follows:

e _I-S ＝concat(e _I ,e _S ,z)

where z represents a random noise vector.

Step 3, fusing word-level semantic features w by adopting a dynamic fusion mode _I And spatial features w _S The obtained characteristics

Is fed into a high-stage generator to generate a high-resolution image. The specific implementation method comprises the following steps:

word-level semantic feature w _I And spatial features w _S Respectively with the image characteristics h obtained in the previous stage _k-1 And (3) performing dynamic fusion, and re-fusing the two features obtained in the previous step by adopting dynamic fusion again to serve as semantic-space constraint of a generator in the subsequent stage. The formula is as follows:

wherein DF (.) represents the multilayer perceptron. Firstly, DF (·) is used for obtaining the image characteristic h at the previous stage _k-1 And semantic features w _I Obtaining a fusion ratio

And then DF (DEG) is used as the image characteristic h obtained in the last stage _k-1 And spatial features w _S A fusion ratio is determined>

Finally, get->

And fusing the semantic features and the spatial features obtained in the kth stage. The fused word-level features ensure that high-pixel image generation is constrained in both semantic and spatial layout aspects in the high-stage generator.

Step 4: the sentence-level feature e obtained by fusion in the step 3 is processed _I-S Into an initial generator in a cascade generation type countermeasure network, a low resolution image is generated. Processing the word-level semantic features w obtained in step 2 using an attention module _I And spatial features w _S Obtaining word-level attention semantic features

And attention space features->

Calculate +.>

Is to calculate word-level attention semantic-space constraint +.>

Image feature h generated by the generator of the previous stage _k-1 And word-level attention semantic-space constraint +.>

Together into a subsequent generator to generate a finer high resolution image.

Generating word-level attention semantic-space constraints in step 4

The specific implementation mode of (a) is as follows:

word-level semantic feature w _I And spatial features w _S Respectively with the image characteristics h obtained in the previous stage _k-1 And (3) performing dynamic fusion, and using the fused characteristics as a value vector in the attention module. Generating word-level attention semantic features using attention modules, respectively

And attention space features->

The median vector of the attention module is the same as the keyword vector, and the query vector uses the image feature h obtained in the previous stage _k-1 . Finally, a dynamic fusion module is adopted to fuse and obtain word-level attention semantic-space constraint +.>

Using the attention module proposed by AttnGAN, taking the image feature h generated in the last stage as a query vector, and taking the word-level space feature w _S Treated as a keyword vector and a value vector. Generating an attention space feature for each image block using word-level space features

wherein />

Is defined as:

wherein

Representing the degree of attention of the jth image to the ith word-level spatial feature, synthesizing an attention-space feature c for the image by weighted summing the word-level spatial features _S . The above formula Attn (h, w _S ) Keyword vector and value vector w in (a) _S Substitution to word-level semantic features w _I To obtain Attn (h, w _I ) Also for image synthesis an attention semantic feature c _I . Fusing the attention space features c by adopting a word-level feature fusion mode in the step 3 _S And attention to semantic feature c _I . The specific formula is defined as:

in step 4, sentence-level feature e obtained by fusion in step 3 is used _I-s As an input to the initial stage generator, a low resolution image is generated, concretely implemented as:

h ₀ ＝F ₀ (e _I-S )

word-level attention semantic-space constraint c obtained by using image features obtained in the previous stage and dynamic fusion _I-S As input to the subsequent generator, a high resolution image is generated. Consider c _I-S The method belongs to high-level features of the image, and low-level features such as colors, textures, structures and the like may be more concerned when the image is generated at a low stage, so that two high-level information of semantic and spatial layout cannot necessarily play a good role in guiding. The network determines whether the current generation stage needs higher-level information or middle-low-level information to obtain

The image characteristic h obtained in the previous stage is obtained _k-1 Constraint information with the kth phase->

Fusion, send to F _k In the module, image feature h is generated _k . Finally, the image features are sent to a corresponding generator G _k A high resolution image is generated. The specific implementation formula is as follows:

wherein F_k (. Cndot.) image feature generation Module representing the kth stage, G _k (·) represents the kth generator.

Step 5: and constructing a cascade generation type countermeasure network consisting of a plurality of pairs of generators and discriminators, and training the generation countermeasure network on a data set through designing a loss function, namely training the generation countermeasure network with minimized loss function.

To optimize the overall neural network model, generator penalty L is employed _G Sum discriminator loss L _D To alternate training. Loss function L of generator _G And loss function L of the arbiter _D Is defined as follows:

L _G ＝L _advG +λ ₁ L _DAMSM L _D ＝L _advD

wherein L_advG Is defined as follows:

p _G representing the data distribution of the generated image, s represents the text description of the input, and D (·) represents the discriminant in the generated countermeasure network. Wherein the first term ensures that the generated image is as realistic as possible, and the second term ensures that the generated image matches the input text. Loss L using DAMSM in step 2 _DAMSM To calculate the local and global similarity of the input text and the generated image, and to ensure that the generated image and the input text maintain semantic consistency by minimizing the loss.

L _advD Is defined as:

p _data the data distribution representing the real image, the first term in the formula being used to distinguish the generated image from the real image, the second term determining whether the input image and text match.

Step 6: and 3, fusing the word characteristics and the sentence characteristics obtained in the step 1 and the step 2, feeding the word characteristics and the sentence characteristics into a cascade generation type countermeasure network according to the mode described in the step 4, training a generator according to the mode described in the step 5, and generating images with different resolutions by the trained generator.

Further comprising the step 7 of: and (3) applying the image which is generated in the step (6) and accords with the text description to the field of cross-mode generation, and solving the technical problem of related engineering.

The related engineering technical problems in the step 7 comprise multimedia education resource construction, image editing and computer teaching assistance.

The beneficial effects are that:

1. the invention discloses a text-to-multi-object image generation method based on joint embedding, which provides a network framework capable of generating a corresponding multi-object image from text description, wherein the network framework mainly comprises a semantic encoder, a spatial layout encoder, a dynamic characteristic fusion module and a cascade generation type countermeasure network with an attention module. The method comprises the steps of extracting joint semantic features of a text and an image from the text through a semantic encoder, extracting joint spatial features of the text and a segmentation map from the text through a spatial layout encoder, fusing the semantic features and the spatial features through a dynamic fusion module, and feeding the fused features into a generation type countermeasure network to generate an image conforming to text description. Because the generation of the image is constrained in terms of both semantics and spatial layout when the high-resolution image is generated, the generated image has the advantages of conforming to the text description, reasonable spatial layout and realistic visual effect.

2. The existing mode of obtaining the text paired images is mostly realized through retrieval, but no text description exists in the existing corresponding images, and professional image editing software is needed to manually create a corresponding image. The method for generating the text-to-multi-object image based on joint embedding can realize the simple text description input, can quickly generate corresponding images through a network obtained through training, and has the advantages of convenience and rapidness.

3. Visual space is high-dimensional, structured, covering visual features of different aspects, including high-level abstract semantics, layout features, and low-level textures, color features, and the like. Existing research has simplified text-to-image generation tasks into two parts to process: extracting text and image joint semantic features from the text; and feeding the obtained feature vector into a generated network model to obtain a corresponding image. Mapping directly from semantic features into visual space with reasonable layout is a very difficult challenge. The method for generating the text-to-multi-object image based on joint embedding disclosed by the invention has the advantages that the spatial layout features are extracted from the text when the image is generated, the semantic features and the spatial features are used for restraining the image generation, the problem of unreasonable image layout generated by the previous method is solved, and the generation effect is good.

4. The invention discloses a method for generating a corresponding multi-object image from a text, which is used for applying the multi-object image which is generated by the invention and accords with the text description to the cross-mode generation field and solving the technical problem of related engineering. For example: the method comprises the following steps of building multimedia education resources, editing images, assisting computer teaching and the like.

Drawings

FIG. 1 is a flow chart of an implementation of the joint embedding-based text-to-multi-object image generation method of the present invention;

FIG. 2 is a diagram of the overall network architecture of the present invention;

FIG. 3 is a block diagram of an encoder architecture according to the present invention, where FIG. 3 (a) is a semantic encoder architecture and FIG. 3 (b) is a spatial encoder architecture;

FIG. 4 is a block diagram of a dynamic fusion module according to the present invention;

FIG. 5 is an exemplary graph of the results of the generation on the MSCOCO data set in the present invention;

FIG. 6 is an exemplary view of an image generated on an MSCOCO data set according to the present invention.

Detailed Description

Embodiments of the present invention will be described below with reference to the drawings.

As shown in fig. 1, the method for generating a text-to-multi-object image based on joint embedding disclosed in this embodiment can perform education-related application on an MSCOCO dataset, for example, input a text description, and a model can generate a corresponding image, and understand knowledge contained in the text by combining the multi-object image and the text. The training and image generation flow of this embodiment is shown in fig. 1.

The semantic encoder architecture described in step 1 includes a text encoder and an image encoder, as shown in fig. 3 (a). The training text encoder is guided in the semantic encoder architecture by using the pre-trained image encoder, and the text encoder obtained through training can extract the joint semantic features of the text and the image from the text.

The text encoder is realized by a two-way long-short term memory neural network (LSTM), and the final hidden layer output of the model is used for representing the text sentence characteristic e _I The intermediate hidden layer output then represents the text word feature w _I . The image encoder is implemented by using an acceptance-v 3 model pre-trained on an image net data set, and the last layer of the model learns to obtain the global feature g of the image _I The middle convolution layer learns to obtain the subarea characteristic v of the image _I 。

The two encoders are continuously optimized and trained by minimizing the distance between text features and image features until joint semantic features of text and images can be extracted from the text. The distance between the text feature and the image semantic feature contains sentence level and word level as shown in fig. 3 (a).

Using word-level penalty

And sentence level penalty->

image global semantic feature g using pairing _I And text sentence feature e _I The dot product between them calculates sentence-level losses, sentence-level losses in the semantic encoder architecture

Is defined as:

to calculate word level loss

The method for calculating the similarity of the text and the image by the reference DAMSM (Deep Attentional Multimodal Similarity Model) calculates the similarity of the text characteristics and the image characteristics on the word level, and the specific implementation method is as follows:

Then represents the similarity of the ith word to the jth image block and T represents the word number in the textA number. By introducing an attention module, a region context vector is calculated for each word in the text>

Said region context vector +.>

Where N represents the number of sub-region blocks of the image. The similarity between text and image local features is calculated using the region context vector and the word vector. The formula is as follows:

wherein ,

calculate->

and />

The spatial encoder architecture described in step 2 consists of a text encoder and a segmentation map encoder, as shown in fig. 3 (b). The text encoder is guided to be trained by using a pre-trained segmentation map encoder in a space encoder architecture, and the text encoder obtained through training can extract joint space layout characteristics of the text and the segmentation map from the text.

The text encoder is realized by a two-way long-short term memory neural network (LSTM), and the final hidden layer output of the model is used for representing the text sentence characteristic e _S The intermediate hidden layer outputs then represent text word features w, respectively _S . The segmentation map encoder is implemented using an acceptance-v 3 model pre-trained on the ImageNet dataset, and considering the differences between the segmentation map and the image, only the first few layers of the model are used in the paper to extract the segmentation features. The last layer of the model is learned to obtain global space feature g of the segmentation map _S Intermediate convolution layer learning to obtain subarea feature v of segmentation map _S 。

The segmentation map encoder and the text encoder are continuously trained and optimized by minimizing the distance between the text features and the segmentation map features until the joint spatial layout features of the text and the segmentation map can be extracted from the text. The distance between the text feature and the spatial layout feature comprises sentence level and word level.

Using word-level penalty

And sentence level penalty->

sentence-level penalty calculation using dot products between paired segmentation map global spatial features and text sentence features, sentence-level penalty in a spatial encoder architecture

Is defined as:

to calculate word level loss

Then it represents the degree of similarity of the ith word to the jth segmented tile and T represents the number of words in the text. By introducing an attention module, a region context vector is calculated for each word in the text>

Said region context vector +.>

wherein ,

calculate->

and />

Step 3: and (3) fusing the semantic features obtained in the step (1) and the spatial features obtained in the step (2) through a dynamic fusion module, wherein the semantic features and the spatial features comprise sentence levels and word levels. The fused sentence-level features are fed into an initial generator in a generative antagonism network to generate a low resolution image. The fused word-level features are then used in subsequent generators in the generative antagonism network to constrain the generation of the high resolution image.

In the step 3, concat operation is adopted, and sentence-level semantic features e are fused _I And spatial feature e _S As input to the initial stage generator, as shown in fig. 4. The specific implementation formula is as follows:

e _I-S ＝concat(e _I ,e _S ,z)

where z represents a random noise vector.

Step 3, fusing word-level semantic features w by adopting a dynamic fusion mode _I And spatial features w _S The resulting features are fed into the high stageThe generator generates a high resolution image. The specific implementation method comprises the following steps:

/>

Finally, get->

And fusing the semantic features and the spatial features obtained in the kth stage. Fused word-level features->

It is ensured that in the high-stage generator the high-pixel image generation is constrained both in terms of semantics and spatial layout.

And attention space features->

Calculate +.>

Is to calculate word-level attention semantic-space constraint +.>

Together into a subsequent generator to generate a finer high resolution image.

Generating word-level attention semantic-space constraints using an attention module in step 4

The specific implementation mode of (a) is as follows:

word-level semantic feature w _I And spatial features w _S Respectively with the image characteristics h obtained in the previous stage _k-1 And (3) performing dynamic fusion, and using the fused characteristics as a value vector in the attention module. Generating word-level attention semantic features and attention space features respectively by using an attention module, wherein the median vector of the attention module is the same as the keyword vector, and the query vector uses the image features h obtained in the previous stage _k-1 . Finally, fusing by adopting a dynamic fusion module to obtain word-level attention semantic-space constraint

wherein />

Is defined as:

wherein

as shown in fig. 2, the sentence-level feature e obtained by the fusion in step 3 is used in step 4 _I-S As an input to the initial stage generator, a low resolution image is generated, concretely implemented as:

h ₀ ＝F ₀ (e _I-S )

word-level attention semantic-space constraint c obtained by using image features obtained in the previous stage and dynamic fusion _I-S As an input to the subsequent generator,a high resolution image is generated. Consider c _I-S The method belongs to high-level features of the image, and low-level features such as colors, textures, structures and the like may be more concerned when the image is generated at a low stage, so that two high-level information of semantic and spatial layout cannot necessarily play a good role in guiding. As shown in FIG. 4, the network determines whether the current generation stage requires higher-level information or middle-lower-level information by itself to obtain

Fusion, send to F _k In the module, image feature h is generated _k . Finally, the image features are sent to corresponding generators G _k As shown in fig. 2. The specific implementation formula is as follows:

To optimize the overall neural network model, generator penalty L is employed _G Sum discriminator loss L _D Alternate training. Loss function L of generator _G And loss function L of the arbiter _D Is defined as follows:

L _G ＝L _advG +λ ₁ L _DAMSM L _D ＝L _advD

wherein L_advG Is defined as follows:

L _advD Is defined as:

In the training process of the invention, 210 generations are trained altogether, the whole network model is not end-to-end in the training process, and the text encoder and the image encoder are trained synchronously first, and the learning rate of both is 0.002. The other text encoder and the segmentation map encoder were resynchronized, and the learning rate of both was also 0.002. The training process of the generated countermeasure network is a process of constantly optimizing and countering between the generator and the arbiter, and their learning rate is set to 0.0002.

In step 6, the present embodiment obtains a good generation result on the MSCOCO public data set. The MSCOCO dataset is a large, rich object detection, segmentation and caption dataset. This dataset is targeted at scene understanding, mainly capturing images from complex daily scenes. MSCOCO provides very detailed data labeling for each image, including text description of the entire image, bounding box information of objects in the image, class labels, and segmentation map, etc. The image contains 91 classes of objects, of which 82 classifications all have over 5,000 instance objects. An effect diagram of generating a corresponding image from a text description is shown in fig. 5, and an example of a generated image of a network is shown in fig. 6.

In summary, in this embodiment, the text description is input into the generating countermeasure network, and the generating countermeasure network model is trained, so as to obtain a well-trained generator, and at this time, the generator can generate a multi-object image conforming to the text description. The method and the device can solve the problems that the time cost and the labor cost of the generation in the traditional method are high, and the effect cannot be guaranteed.

While the foregoing is directed to embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.

Claims

1. A text-to-multi-object image generation method based on joint embedding is characterized by comprising the following steps of: comprises the following steps of the method,

step 1: inputting the text description into a semantic encoder to obtain joint semantic features of the text and the image, wherein the semantic encoder is realized by a text encoder which is obtained by pre-training in a semantic encoder architecture;

the semantic encoder architecture described in step 1 includes a text encoder and an image encoder; guiding a training text encoder by using an image encoder obtained by pre-training in a semantic encoder architecture, wherein the text encoder obtained by training can extract joint semantic features of a text and an image from the text;

step 2: inputting the text into a space layout encoder to obtain joint space characteristics of the text and the segmentation map, wherein the space layout encoder is realized by a text encoder which is obtained by pre-training in a space encoder architecture;

The space encoder architecture described in step 2 consists of a text encoder and a segmentation map encoder; the text encoder is guided to be trained by using a segmentation map encoder obtained through pre-training in a space encoder framework, and the text encoder obtained through training can extract joint space layout characteristics of the text and the segmentation map from the text;

step 3: the semantic features obtained in the step 1 and the spatial features obtained in the step 2 are fused through a dynamic fusion module, wherein the semantic features and the spatial features comprise sentence-level features and word-level features; the sentence-level features obtained by fusion are sent to an initial generator in a cascade generation type countermeasure network to generate a low-resolution image; the word-level features obtained by fusion are used in a subsequent generator in a cascade generation type countermeasure network to restrict the generation of high-resolution images;

step 4: the sentence-level features obtained by fusion in the step 3 are sent to an initial generator in a cascade generation type countermeasure network, and a low-resolution image is generated; processing the word-level semantic features obtained in the step 1 and the space features obtained in the step 2 by using an attention module to obtain word-level attention semantic features and attention space features; calculating word-level attention semantic-space constraints using the method of step 3 calculation; the image features generated by the generator at the previous stage and the word-level attention semantic-space constraint are fed into the subsequent generator together to generate a finer high-resolution image;

Step 5: constructing a cascade generation type countermeasure network consisting of a plurality of pairs of generators and discriminators, and training on a data set through designing a loss function to generate the countermeasure network, namely training to obtain a generated countermeasure network with minimized loss function;

step 6: and 3, fusing the word-level features and the sentence-level features obtained in the step 1 and the step 2, sending the word-level features and the sentence-level features into a cascade generation type countermeasure network in a manner described in the step 4, training a generator in a manner described in the step 5, and generating images with different resolutions by the trained generator.

2. A method for generating a text-to-multi-object image based on joint embedding as recited in claim 1, wherein: the method also comprises a step 7 of applying the image which is generated in the step 6 and accords with the text description to the cross-mode generation field to solve the technical problem of related engineering;

3. A method of generating a text-to-multi-object image based on joint embedding as claimed in claim 1 or 2, wherein: the text encoder obtained through training in the semantic encoder architecture can extract the joint semantic features of the text and the image from the text, and the realization method comprises the following steps: training an image encoder on an ImageNet data set in advance, wherein the trained image encoder can obtain image semantic features from images; the text encoder can extract the joint semantic features of the text and the image from the text by minimizing the distance between the text features and the semantic features; while optimizing the text encoder, continuously optimizing part of the network layer of the image encoder by minimizing the distance between the text features and the image semantic features, and synchronously training and optimizing the two encoders until the joint semantic features of the text and the image can be extracted from the text; the text features include sentence-level features e _I And word-level features w _I The image semantic features also comprise global features g _I And local feature v _I The method comprises the steps of carrying out a first treatment on the surface of the Thus, the distance between the text feature and the semantic feature also includes sentence-level distance and word-level distance;

using word-level penalty

And sentence level penalty->

Is defined as:

to calculate word level loss

wherein ,w_I Word characteristics, v, representing text _I Sub-representing imageRegional characteristics;

representing the similarity of the ith word and the jth image block, wherein T represents the number of words in the text;

Said region context vector +.>

wherein N represents the number of sub-region blocks of the image,

representing the j-th sub-region feature of the image, calculating the similarity between the text and the image local feature using the region context vector and the word vector; the formula is as follows:

wherein ,

And word characteristics of the i-th word +.>

Cosine similarity of (c); the trained text encoder can extract the joint semantic features of the text and the image from the text.

4. A method for generating a text-to-multi-object image based on joint embedding as recited in claim 3, wherein: the text encoder obtained through training in the space encoder architecture can extract the joint space layout characteristics of the text and the segmentation map from the text, and the realization method comprises the following steps: training a segmentation map encoder on an ImageNet data set in advance, wherein the segmentation map encoder obtained through training can obtain spatial layout characteristics from the segmentation map; by minimizing the distance between the text features and the segmentation map features, the text encoder can be enabled to extract the joint spatial layout features of the text and the segmentation map from the text; while optimizing the text encoder, continuously optimizing part of the network layer of the segmentation map encoder by minimizing the distance between the text features and the segmentation map features, and synchronously training and optimizing the two encoders until the joint space layout features of the text and the segmentation map can be extracted from the text; the text features include sentence-level features e _S And word-level features w _S The spatial layout features also include global features g _S And local feature v _S The method comprises the steps of carrying out a first treatment on the surface of the Thus, the distance between the text feature and the spatial layout feature also includes sentence-level distance and word-level distance;

using word-level penalty

And sentence level penalty->

Is defined as:

to calculate word level loss

wherein ,w_S Word characteristics, v, representing text _S A sub-region feature representing a segmentation map;

representing the similarity of the ith word and the jth segmented block, wherein T represents the number of words in the text;

Said region context vector +.>

wherein N represents the number of sub-region blocks of the segmentation map; calculating a similarity between the text and the segmentation map local features using the region context vector and the word vector; the formula is as follows:

wherein ,

And word characteristics of the i-th word +.>

Cosine similarity of (c); the trained text encoder can extract joint spatial features of the text and the segmentation map from the text. />

5. A method for generating a text-to-multi-object image based on joint embedding as recited in claim 4, wherein: in the step 3, concat operation is adopted, and sentence-level semantic features e are fused _I And spatial feature e _S As input to the initial stage generator; the specific implementation formula is as follows:

e _I-S ＝concat(e _I ，e _S ，z)

wherein z represents a random noise vector;

Is fed into a high-stage generator to generate a high-resolution image; the specific implementation method comprises the following steps:

word-level semantic feature w _I And spatial features w _S Respectively with the image characteristics h obtained in the previous stage _k-1 Dynamic fusion is carried out, and the two features obtained in the previous step are fused again by adopting the dynamic fusion to serve as semantic-space constraint of a generator in the subsequent stage; the formula is as follows:

wherein DF (& gt) represents the multi-layer perceptron; firstly, DF (·) is used for obtaining the image characteristic h at the previous stage _k-1 And semantic features w _I Obtaining a fusion ratio

Finally, get->

Fusing semantic features and space features obtained in the kth stage; the word-level features obtained through fusion ensure that high-pixel image generation is constrained in a high-stage generator from two aspects of semantics and spatial layout;

step 4: the sentence-level feature e obtained by fusion in the step 3 is processed _I-S Sending the low-resolution images into an initial generator in a cascade generation type countermeasure network; processing the word-level semantic features w obtained in step 2 using an attention module _I And spatial features w _S Obtaining word-level attention semantic features

And attention space features->

Calculate +.>

Is to calculate word-level attention semantic-space constraint +.>

Together into a subsequent generator to generate a finer high resolution image.

6. The method for generating a text-to-multi-object image based on joint embedding of claim 5, wherein:

generating word-level attention semantic-space constraints in step 4

The specific implementation mode of (a) is as follows:

word-level semantic feature w _I And spatial features w _S Respectively with the image characteristics h obtained in the previous stage _k-1 Dynamic fusion is carried out, and the characteristics obtained by fusion are used as value vectors in the attention module; generating word-level attention semantic features using attention modules, respectively

And attention space features->

The median vector of the attention module is the same as the keyword vector, and the query vector uses the image feature h obtained in the previous stage _k-1 The method comprises the steps of carrying out a first treatment on the surface of the Finally, a dynamic fusion module is adopted to fuse and obtain word-level attention semantic-space constraint +.>

Using the attention module proposed by AttnGAN to make the image feature h generated in the last stage _k-1 Regarded as a query vector, the word-level spatial features w _S Treating as a keyword vector and a value vector; generating an attention space feature for each image block using word-level space features

wherein />

Is defined as:

wherein

Representing the degree of attention of the jth image to the ith word-level spatial feature, synthesizing an attention-space feature c for the image by weighted summing the word-level spatial features _S The method comprises the steps of carrying out a first treatment on the surface of the The above formula Attn (h, w _S ) Keyword vector and value vector w in (a) _S Substitution to word-level semantic features w _I To obtain Attn (h, w _I ) Also for image synthesis an attention semantic feature c _I The method comprises the steps of carrying out a first treatment on the surface of the Fusing the attention space features c by adopting a word-level feature fusion mode in the step 3 _S And attention to semantic feature c _I The method comprises the steps of carrying out a first treatment on the surface of the The specific formula is defined as:

h ₀ ＝F ₀ (e _I-S )

word-level attention semantic-space constraint c obtained by using image features obtained in the previous stage and dynamic fusion _I-S Generating a high resolution image as input to a subsequent generator; the network determines that the current generation stage needs higher-level information or middle-low-level information to obtain

The image characteristic h obtained in the previous stage is obtained _k-1 Constraint information with the kth phase- >

Fusion, send to F _k In the module, image feature h is generated _k The method comprises the steps of carrying out a first treatment on the surface of the Finally, the image features are sent to a corresponding generator G _k A high resolution image is generated; the specific implementation formula is as follows:

7. The method for generating a text-to-multi-object image based on joint embedding of claim 6, wherein: to optimize the overall neural network model, generator penalty L is employed _G Sum discriminator loss L _D To train alternately; loss function L of generator _G And loss function L of the arbiter _D Is defined as follows:

L _G ＝L _advG +λ ₁ L _DAMSM L _D ＝L _advD

wherein L_advG Is defined as follows:

p _G representing the data distribution of the generated image, s representing the input text description, and D (-) representing the discriminators in the cascade generation type countermeasure network; wherein the first term ensures that the generated image is authentic and the second term ensures that the generated image matches the input text; loss L using DAMSM in step 2 _DAMSM Calculating the local similarity and the global similarity of the input text and the generated image, and ensuring that the generated image and the input text keep semantic consistency by minimizing the loss;

L _advD is defined as: