CN113239961A

CN113239961A - Method for generating sequence images based on text for generating confrontation network

Info

Publication number: CN113239961A
Application number: CN202110384686.9A
Authority: CN
Inventors: 胡伏原; 张玮琪; 李林燕; 徐峰磊; 夏振平; 顾敏明
Original assignee: Suzhou Jiatu Intelligent Drawing Information Technology Co ltd; Suzhou University of Science and Technology
Current assignee: Suzhou Jiatu Intelligent Drawing Information Technology Co ltd; Suzhou University of Science and Technology
Priority date: 2021-04-09
Filing date: 2021-04-09
Publication date: 2021-08-10
Anticipated expiration: 2041-04-09
Also published as: CN113239961B

Abstract

The invention relates to a method for generating a sequence image based on a text for generating a countermeasure network, which comprises the following steps: constructing a training database, wherein the training database comprises a training text and an original image, and generating a confrontation network model by using the training text and the original image for training; generating an antagonistic network model comprises a mixture generator and a discriminator, wherein the mixture generator comprises a scene graph guided image generator and an image generator based on sequence conditions; and inputting the text to be processed into the trained generation countermeasure network model, and generating and outputting an image corresponding to the text to be processed by the trained generation countermeasure network model. The invention can generate the visually real image matched with the text description, thereby avoiding the problem of disordered object layout and improving the accuracy of the output image.

Description

Method for generating sequence images based on text for generating confrontation network

Technical Field

The invention relates to the technical field of image generation, in particular to a method for generating a sequence image based on a text of a generation countermeasure network.

Background

Text-generated images are a popular area of research in computer vision. The text generation image means that a user provides a text description, and the system can automatically generate one or more images with the content consistent with the text description. The method greatly improves the flexibility and comprehensiveness of image information acquisition, and has good development prospect and important significance, such as: concept enlightenment in the education field, generation of picture insertions in the literature field, visual creation in the art field and the like.

The existing method for generating an image by a text mainly comprises the following three steps:

the method based on the variational self-encoder model comprises the following steps: the method models in a statistical method to maximize the minimum possibility of data to generate images, and is an unsupervised learning mode for compressing input data to realize the reproduction of the input data by output data. The encoder encodes data, and can map high-dimensional data into low-dimensional data; the decoder is just opposite to convert the low micro data into high dimensional data, and the reconstruction of the input data is realized.

The method based on the deep cycle attention model comprises the following steps: the method uses a recurrent neural network and utilizes an attention mechanism, each step focuses on a generated object, and an image block is sequentially generated and a final result is superposed. Wherein the encoder determines the distribution of latent variable space to capture significant input information; the decoder is arranged to receive samples sampled from the encoded distribution and use them to condition the self-distribution over the image.

Method based on generation of a countermeasure network: the method for generating the countermeasure Network (GAN) is a deep learning model, and is different from the traditional machine learning method, and the biggest characteristic is that an countermeasure mechanism is introduced. Referring to fig. 1, the two parties of the confrontation are respectively composed of a generator G and a discriminator D. The generator learns real data distribution, inputs random noise which obeys prior distribution, and generates data similar to a real training sample; the discriminator is a two-classifier for estimating the probability of the sample from the training data rather than the generated data, and discriminating whether the input object is a real image or a generated image based on the output probability value.

However, the essence of the method based on the variational self-encoder model is to convert an image into the most common mean value and standard deviation, then use the two parameters to randomly sample from distribution to obtain an implicit variable, and decode and reconstruct the implicit variable; the method cannot accurately generate a high-resolution image in practical application. The method based on the depth cycle attention model introduces a cycle neural network and an attention mechanism, and although the accuracy of image generation can be improved to a certain extent, the image generation accuracy is limited for texts containing complex scenes;

compared with the method based on the variational self-encoder model and the method based on the deep cycle attention model, the method based on generation of the countermeasure network can overcome the defects of the two methods, so that the method is widely applied to the research field of image generation in recent years. However, when the input text relates to a plurality of objects and relationships, the context information of the text sequence is difficult to extract, the layout of the objects for generating the image is easily disordered, the details of the generated objects are insufficient, and the requirement for outputting a high-accuracy image cannot be met.

Therefore, the existing methods for generating images by texts have the problem of low accuracy of output images, and cannot meet the use requirements.

Disclosure of Invention

Therefore, the technical problem to be solved by the invention is to overcome the problem that the method for generating the image by the text in the prior art cannot output the image with high accuracy.

In order to solve the technical problem, the invention provides a method for generating a sequence image based on a text for generating a confrontation network, which comprises the following steps:

s1), constructing a training database, wherein the training database comprises training texts and original images corresponding to the training texts, and training generation of a confrontation network model by using the training texts and the original images;

the generating an antagonistic network model comprises a mixture generator and a discriminator, wherein the mixture generator comprises a scene graph guided image generator and an image generator based on sequence conditions; the scene graph guided image generator is used for outputting a first generated image by processing the training text based on the scene graph generation model, and the sequence condition based image generator is used for outputting a second generated image by processing the training text according to the context information of the training text;

when the generation countermeasure network model is trained, processing a training text by an image generator guided by a scene graph to output a first generated image, processing the training text by an image generator based on a sequence condition to output a second generated image, overlapping the first generated image and the second generated image in the mixer-synthesizer to generate an initial generated image, outputting the initial generated image to a discriminator, discriminating the similarity of the initial generated image relative to the original image by the discriminator, and optimizing the mixer-generator according to a discrimination result;

s2) inputting the text to be processed into the trained generation confrontation network model, and generating and outputting an image corresponding to the text to be processed by the trained generation confrontation network model.

In one embodiment of the invention, the blending generator generates at least one initial generated image from the training text;

the discriminator comprises an image discriminator, an object discriminator and a story discriminator;

the image discriminator is used for judging whether the initial generated image is matched with the content of a training text or not and judging the definition of the initial generated image relative to the original image;

the object discriminator is used for judging whether the object in the initial generated image is missing relative to the original image;

the story discriminator is used for judging the consistency of objects in a plurality of initial generation images corresponding to the whole training text.

In an embodiment of the present invention, in step S1), the method for processing the training text by the image generator guided by the scene graph and outputting the first generated image includes: each sentence in the training text is converted into a corresponding initial scene graph, the scene layout of the initial scene graph is calculated, and then the scene layout is input into a cascade refinement network to generate a corresponding first generated image.

In an embodiment of the present invention, the method for calculating the scene layout of the initial scene graph includes: processing the initial scene graph by using a graph convolution network to obtain an embedded vector of each object in the initial scene graph, predicting a segmentation mask and a boundary box of each object according to the embedded vector of each object, and then calculating scene layout by using the segmentation masks and the boundary boxes of all the objects.

In an embodiment of the present invention, when each sentence in the training text is converted into a corresponding initial scene graph, an object relationship, and an object attribute of each sentence in the training text are extracted, and then the initial scene graph of the sentence is generated according to the object, the object relationship, and the object attribute of each sentence.

In an embodiment of the present invention, the method for extracting the object, the object relationship, and the object attribute of each sentence in the training text comprises: the training text is parsed into a syntactic dependency tree using a dependency parser, which is then parsed to extract the objects, object relationships, and object attributes for each sentence.

In one embodiment of the present invention, in step S1), the method for processing the training text by the image generator based on the sequence condition to output the second generated image includes: the method comprises the steps that a story encoder is used for encoding the whole training text into embedded vectors and outputting the embedded vectors to a context encoder, the context encoder captures context information from the embedded vectors output by the story encoder and outputs the context information to a plurality of first image generators, each first image generator corresponds to one sentence in the training text, and each first image generator combines the context information and the information of each sentence in the training text to convert each sentence into the corresponding second generated image.

Compared with the prior art, the technical scheme of the invention has the following advantages:

the method for generating the text generation sequence image based on the generation countermeasure network can effectively extract the context information of the text sequence, generate the visually real image matched with the given text description, avoid the problem of disordered object layout and improve the accuracy of the output image.

Drawings

In order that the present disclosure may be more readily and clearly understood, the following detailed description of the present disclosure is provided in connection with specific embodiments thereof and the accompanying drawings, in which,

FIG. 1 is a block diagram of a prior art architecture for generating a countermeasure network model;

FIG. 2 is a block diagram of the structure of the method of generating sequential images based on text that generates a countermeasure network of the present invention;

Detailed Description

The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.

Referring to fig. 2, a method for generating a sequence image based on a text for generating a confrontation network according to the present invention includes the following steps:

s1), constructing a training database, wherein the training database comprises a training text and an original image corresponding to the training text, and training the confrontation network model generated by using the training text and the original image;

generating an antagonistic network model comprises a mixture generator and a discriminator, wherein the mixture generator comprises a scene graph guided image generator and an image generator based on sequence conditions; the scene graph guided image generator is used for outputting a first generated image by processing the training text based on a generation model of a scene graph (scene graph), and the sequence condition based image generator is used for processing the training text according to the context information of the training text and outputting a second generated image;

when a confrontation network model is generated and trained, an image generator guided by a scene graph processes a training text to output a first generated image, an image generator based on a sequence condition processes the training text to output a second generated image, the first generated image and the second generated image are overlapped in the mixer-synthesizer to generate an initial generated image, the initial generated image is output to a discriminator, the discriminator judges the similarity of the initial generated image relative to an original image, and the mixer-generator is optimized according to a judgment result so as to finally enable the image generated by the optimized mixer-generator and the original image to achieve higher similarity;

it is understood that a training text is composed of at least one sentence, and each of the initial generated images corresponds to one sentence in the training text, that is, when the initial generated images are generated by superimposing, the first generated image and the second generated image belonging to the same sentence are superimposed to obtain the initial generated image of the sentence.

In the method, by adopting the mixing generator, the scene graph guided image generator can clearly express the information conveyed by the complex sentence with a plurality of objects, the sequence condition based image generator can express the information conveyed by the sentence on the basis of combining the context information, and through the mixing of the scene graph guided image generator and the sequence condition based image generator, the details of each object in the generated image are increased, the context information of the text can be effectively captured, and the context information of the text is matched with a plurality of objects related to the text, so that the accuracy of the object layout of the generated image is improved, and the requirement of outputting high-accuracy images is further realized.

For example, referring to fig. 2, a training text-paragraph S is composed of a series of sentences S ═ S₁，s₂，…s_T]Composition, wherein the length T may vary. In the scene graph guided image generator, the image generated by the scene graph for each sentence is named

In the image generation model based on the sequence condition, an image generated by each sentence is named

The initial generated image after merging of the images generated by the two generators is named as [ I₁，I₂，…I_T]. During training, the original image is represented as p ═ p₁，p₂，…p_T]。

Wherein generating the countermeasure network model is defined as:

where G denotes the mix generator, D denotes the discriminator, E (. + -.) denotes the expected value of the distribution function, P_data(x) Representing the distribution of real samples, P_z(z) represents the noise distribution defined in the lower dimension. D (x) represents the output of the discriminator D, and g (z) represents the mapping of the input noise z to data. In the training process, the discriminator is generally trained several times before the generator. In the real training process, the effect of the discriminator is expected to be better, so that the effect of the generator can be supervised, if the discriminator has poor effect, the generated false data is judged as real data, and thenThe overall effect may be poor.

In one embodiment, the mixing generator generates at least one initial generated image according to the training text, wherein each initial generated image corresponds to one sentence in the training text;

the image discriminator is used for judging whether the initial generated image is matched with the content of the training text or not and judging the definition of the initial generated image relative to the original image;

For example, the training text is the following paragraphs: "the There is a ten under the sun. The girl advanced with The owl on her summer. The boy sat by The tree and tapped to The girl. If the paragraph comprises three sentences, generating three initial generated images by a mixed generator, wherein each sentence corresponds to one initial generated image, and each initial generated image is formed by overlapping a first generated image and a second generated image which are generated by corresponding sentences; the training text comprises objects such as "tree", "sun", "the girl", "the own", and "the boy", and for an initial generated image corresponding to a first sentence, the image discriminator is used for judging whether the relationship between "tree", "sun", and object "under" is embodied in the initial generated image, and discriminating the difference between the sharpness of the initial generated image and the sharpness of the original image, for example, the original image has sharp outlines, but the image generated by the generator is blurred and cannot be in shape; the object discriminator is used for judging whether the object in the initially generated image is missing relative to the original image, for example, the object of 'girl' exists in the real image, but the object is missing in the initially generated image; the story discriminator is used for judging the consistency of objects in a plurality of initial generated images corresponding to the whole training text, for example, the object of 'the girl' is arranged in the initial generated images corresponding to the second sentence and the third sentence, but the shapes of the 'the girl' in the initial generated images corresponding to the second sentence and the third sentence are different, namely the same object is not consistent;

the three discriminators can discriminate the initial generated image from different angles, can more accurately discriminate the similarity of the initial generated image relative to the original image, improves discrimination precision, optimizes the hybrid generator according to the discrimination result, and is more beneficial to improving the optimization effect of the hybrid generator.

In one embodiment, in step S1), the method for processing the training text by the image generator guided by the scene graph and outputting the first generated image includes: each sentence in the training text is converted into a corresponding initial scene graph, the scene layout of the initial scene graph is calculated, and then the scene layout is input into a Cascade Refinement Network (CRN) to generate a corresponding first generated image.

The scene graph is a data structure, wherein each node represents an object, and edges connecting the objects represent the relationship.

In one embodiment, the method for calculating the scene layout of the initial scene graph comprises the following steps: processing the initial scene graph by using a graph convolution network to obtain an embedded vector of each object in the initial scene graph, predicting a segmentation mask and a boundary box of each object according to the embedded vector of each object, and then calculating scene layout by using the segmentation masks and the boundary boxes of all the objects.

For example, a graph convolution network initial scene graph composed of multiple graph convolution layers may be used for processing. Given an initial scene graph, each node and edge has a dimension D_inWhich calculates a new dimension D for each node and edge_outThe vector of (2). The output vector is a function of the neighborhood of its corresponding input, so each graph convolution layer propagates information along the edges of the graph. The graph convolution layer applies the same function to all edges of the graph, allowing a single layer to operate on arbitrarily shaped graphs. Using a series of figuresThe convolution layer processes the input scene graph, gives the embedded vector of each object, the embedded vector aggregates the information of all objects and relations in the graph, and inputs the embedded vector into the object layout network, so as to predict the segmentation mask and the bounding box of the object.

In one embodiment, when each sentence in the training text is converted into a corresponding initial scene graph, an object relationship, and an object attribute of each sentence in the training text are extracted, and then the initial scene graph of the sentence is generated according to the object, the object relationship, and the object attribute of each sentence.

Further, the method for extracting the object, the object relationship and the object attribute of each sentence in the training text comprises the following steps: the training text is parsed into a syntactic dependency tree using a dependency parser, and the dependency tree is then parsed according to the linguistic rules (nine simple linguistic rules) to extract the objects, object relationships, and object attributes for each sentence. The object relationship mainly refers to the position relationship of the object, and the object attribute mainly refers to the material class, such as the material of the sphere.

In one embodiment, in step S1), the method for processing the training text by the image generator based on the sequence condition to output the second generated image includes: the method comprises the steps that a story encoder is used for encoding the whole training text into embedded vectors and outputting the embedded vectors to a context encoder, the context encoder captures context information from the embedded vectors output by the story encoder and outputs the context information to a plurality of first image generators, each first image generator corresponds to one sentence in the training text, and each first image generator combines the context information and the information of each sentence in the training text to convert each sentence into the corresponding second generated image.

The verification result of the method of generating a sequence image based on text that generates a countermeasure network of the above embodiment is as follows:

a deep learning framework PyTorch is adopted, the experimental environment is a ubantu14.04 operating system, and 4 blocks of NVIDIA 1080Ti Graphic Processor (GPU) are used for accelerating operation to obtain a trained model.

Training models on data sets CLEVR-SV and CoDraw-SV, generating 64 x 64 images, and quantitatively comparing the images with image GAN and StoryGAN;

the comparison of the different methods on the CLEVR-SV data set is described as follows: the input is the attribute and relative position of the current object, given by two real numbers representing its coordinates. For example, an image is generated from the description "blue, small, metal, cylinder, (-2.3, 2.6)". All objects are described in the same way. Given the description, the appearance of the generated objects should differ very little from the ground-truth (reference image) case, and their relative positions should be similar.

The comparison shows that the ImageGAN algorithm cannot keep the consistency of the input text, and when the number of objects increases, the objects can confuse attributes, such as spheres generating wrong rubber attributes; the StoryGAN algorithm can solve the problem of paragraph consistency through the included story encoder, however, when the relationship between objects and the relative positions of the objects are judged, the image cannot be correctly generated; the comparison of the amplified text generated images by different methods shows that compared with a group Truth (reference image), ImageGAN is difficult to accurately generate a coherent image, for example, a silver sphere is changed from 'metal' to 'rubber' under an input condition, the position relationship between a purple cylinder and a blue cube is greatly different, and the size of the yellow cube is inaccurate; while StoryGAN can produce more coherent images, when multiple objects are produced, the object layout is more cluttered. Compared with ImageGAN and StoryGAN algorithms, the method for generating the image by the text, namely the image generated by the SGGAN, is smoother, has smaller difference with a reference image, has more accurate relation between objects, and generates a sequence image with higher quality.

The results of the different methods of comparison on the CoDraw-SV number data set are described below: ImageGAN cannot generate a consistent image sequence, and the appearance of a character is not consistent across the image sequence. In contrast, the SGGAN image generation method of the present application has higher image quality, and can better grasp the relationship of a plurality of objects in image generation.

The method for generating the sequence image based on the text generated against the network according to the embodiment can be applied to various aspects, such as generating customized images and videos according to personal preferences of users, helping artists or graphic designers to effectively match characters, effectively generating images of different styles for the same character, and the like. The method is based on a scene graph generation confrontation network structure, can effectively extract the context information of the text sequence, effectively solves the problem of disordered layout of the generated object in the text generated image, can effectively generate a visually real image matched with the given text description, and enables the generated image to be more diversified and accurate.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications of the invention may be made without departing from the spirit or scope of the invention.

Claims

1. A method for generating sequential images based on text for generating a confrontational network, comprising the steps of:

2. The method of generating a sequence of images based on generating text to combat network of claim 1,

the mixing generator generates at least one initial generation image according to the training text;

3. The method of claim 1, wherein in step S1), the method for processing the training text by the scene graph-guided image generator to output the first generated image comprises: each sentence in the training text is converted into a corresponding initial scene graph, the scene layout of the initial scene graph is calculated, and then the scene layout is input into a cascade refinement network to generate a corresponding first generated image.

4. The method of claim 3, wherein the method of calculating the scene layout of the initial scene graph comprises: processing the initial scene graph by using a graph convolution network to obtain an embedded vector of each object in the initial scene graph, predicting a segmentation mask and a boundary box of each object according to the embedded vector of each object, and then calculating scene layout by using the segmentation masks and the boundary boxes of all the objects.

5. The method of claim 3, wherein when each sentence in the training text is converted into the corresponding initial scene graph, the object relationship and the object attribute of each sentence in the training text are extracted, and then the initial scene graph of the sentence is generated according to the object, the object relationship and the object attribute of each sentence.

6. The method of generating sequential images based on text generated against a network as claimed in claim 5, wherein the method of extracting the object, object relationship and object attribute of each sentence in the training text is: the training text is parsed into a syntactic dependency tree using a dependency parser, which is then parsed to extract the objects, object relationships, and object attributes for each sentence.

7. The method of claim 1, wherein in step S1), the method for processing the training text by the image generator based on the sequence condition to output the second generated image comprises: the method comprises the steps that a story encoder is used for encoding the whole training text into embedded vectors and outputting the embedded vectors to a context encoder, the context encoder captures context information from the embedded vectors output by the story encoder and outputs the context information to a plurality of first image generators, each first image generator corresponds to one sentence in the training text, and each first image generator combines the context information and the information of each sentence in the training text to convert each sentence into the corresponding second generated image.