CN113239961B

CN113239961B - Method for generating sequence image based on text of generating countermeasure network

Info

Publication number: CN113239961B
Application number: CN202110384686.9A
Authority: CN
Inventors: 胡伏原; 张玮琪; 李林燕; 徐峰磊; 夏振平; 顾敏明
Original assignee: Suzhou Jiatu Intelligent Drawing Information Technology Co ltd; Suzhou University of Science and Technology
Current assignee: Suzhou Jiatu Intelligent Drawing Information Technology Co ltd; Suzhou University of Science and Technology
Priority date: 2021-04-09
Filing date: 2021-04-09
Publication date: 2023-10-20
Anticipated expiration: 2041-04-09
Also published as: CN113239961A

Abstract

The application relates to a method for generating sequence images based on texts generated against a network, comprising: constructing a training database, wherein the training database comprises training texts and original images, and training a generated countermeasure network model by utilizing the training texts and the original images; generating the countermeasure network model includes a hybrid generator and a discriminator, the hybrid generator including a scene graph guided image generator and a sequence condition based image generator; inputting the text to be processed into the trained generated countermeasure network model, generating an image corresponding to the text to be processed by the trained generated countermeasure network model and outputting the image. The application can generate the visual true image matched with the text description, avoid the problem of disordered object layout and improve the accuracy of the output image.

Description

Method for generating sequence image based on text of generating countermeasure network

Technical Field

The application relates to the technical field of image generation, in particular to a method for generating sequence images based on texts of a generated countermeasure network.

Background

Text-generated images are a popular area of research in computer vision. The text generation image refers to a section of text description provided by a user, and the system can automatically generate one or more images with the content consistent with the text description. The method greatly improves the flexibility and comprehensiveness of image information acquisition, and has good development prospect and important significance, such as: concept enlightenment in the education field, picture-inserting generation in the literature field, visual creation in the art field, and the like.

The existing text image generation method mainly comprises the following three steps:

a method based on a variational self-encoder model: the method generates an image by using a statistical method to model the minimum possibility of maximized data, and the method realizes the reproduction of output data to input data by an unsupervised learning mode of compressing the input data. The data processing system consists of an encoder and a decoder, wherein the encoder encodes data and can map high-dimension data into low-dimension data; the decoder, in contrast, converts the low-profile data into high-profile data, enabling reconstruction of the input data.

A method based on a deep cyclic attention model: the method uses a cyclic neural network, focuses on a generated object in each step by using a focus mechanism, sequentially generates an image block and superimposes a final result. Wherein the encoder determines the distribution of the latent variable space to capture significant input information; the decoder is arranged to accept samples sampled from the encoded distribution and use them to condition its own distribution over the image.

Based on the method of generating the countermeasure network: the method for generating the countermeasure network (Generative Adversarial Network, GAN) is a deep learning model, and is characterized by introducing a countermeasure mechanism, unlike the traditional machine learning method. Referring to fig. 1, both parties of the countermeasure are composed of a generator G and a discriminator D, respectively. The generator learns real data distribution, inputs random noise conforming to prior distribution, and generates data similar to a real training sample; the discriminator is a classifier for estimating the probability that the sample is derived from training data rather than generated data, and discriminating whether the input object is a real image or a generated image based on the probability value output.

The method based on the variation self-encoder model is essentially to convert the image into the most common mean value and standard deviation, then randomly sample the distribution by using the two parameters to obtain hidden variables, and decode and reconstruct the hidden variables; the method cannot accurately generate high-resolution images in practical application. The method based on the deep cyclic attention model introduces a cyclic neural network and an attention mechanism, and can improve the accuracy of generating images to a certain extent, but has limited image generation accuracy for texts containing complex scenes;

the "method based on generating an countermeasure network" can overcome the drawbacks of the two methods with respect to the "method based on a variation self-encoder model" and the "method based on a depth circulation attention model", and thus the method has been widely used in the research field of image generation in recent years. However, when the input text relates to a plurality of objects and relations, the context information of the text sequence is difficult to extract, the layout of the objects for generating the image is easy to be confused, the details of the generated objects are insufficient, and the requirement of outputting the high-accuracy image cannot be met.

Therefore, the existing methods for generating the image by the text have the problem of low accuracy of the output image, and cannot meet the use requirement.

Disclosure of Invention

Therefore, the technical problem to be solved by the application is to solve the problem that the method for generating the image by the text in the prior art cannot output the image with high accuracy.

In order to solve the technical problem, the application provides a method for generating sequence images based on texts generated against a network, which comprises the following steps:

s1) constructing a training database, wherein the training database comprises training texts and original images corresponding to the training texts, and training a generated countermeasure network model by utilizing the training texts and the original images;

the generating an countermeasure network model includes a mixture generator and a discriminator, the mixture generator including a scene graph guided image generator and a sequence condition based image generator; the scene graph guided image generator is used for outputting a first generated image by processing training text based on a scene graph generating model, and the sequence condition based image generator is used for outputting a second generated image by processing the training text according to the context information of the training text;

when training the generated countermeasure network model, an image generator guided by a scene graph processes training texts to output a first generated image, an image generator based on sequence conditions processes the training texts to output a second generated image, the first generated image and the second generated image are overlapped in the mixer to generate an initial generated image, the initial generated image is output to a discriminator, the discriminator discriminates the similarity of the initial generated image relative to the original image, and the mixer is optimized according to the discrimination result;

s2) inputting the text to be processed into the trained generated countermeasure network model, generating an image corresponding to the text to be processed by the trained generated countermeasure network model and outputting the image.

In one embodiment of the application, the blending generator generates at least one initially generated image from the training text;

the discriminator includes an image discriminator, an object discriminator, and a story discriminator;

the image discriminator is used for discriminating whether the initial generated image is matched with the content of the training text or not and discriminating the definition of the initial generated image relative to the original image;

the object identifier is used for judging whether the object in the initial generated image is missing relative to the original image;

the story discriminator is used for discriminating consistency of objects in a plurality of initial generated images corresponding to the whole training text.

In one embodiment of the present application, in the step S1), the method for processing training text by the image generator guided by the scene graph to output the first generated image includes: each sentence in the training text is converted into a corresponding initial scene graph, the scene layout of the initial scene graph is calculated, and then the scene layout is input into a cascade refinement network to generate a corresponding first generated image.

In one embodiment of the application, a method of computing a scene layout of the initial scene graph includes: processing the initial scene graph by using a graph rolling network to obtain an embedded vector of each object in the initial scene graph, predicting a segmentation mask and a boundary box of each object according to the embedded vector of each object, and calculating the scene layout by the segmentation masks and the boundary boxes of all the objects.

In one embodiment of the present application, when each sentence in the training text is converted into a corresponding initial scene graph, the object relationship and the object attribute of each sentence in the training text are extracted first, and then the initial scene graph of the sentence is generated according to the object, the object relationship and the object attribute of each sentence.

In one embodiment of the present application, the method for extracting the object, the object relation and the object attribute of each sentence in the training text is as follows: the training text is parsed into a syntactic dependency tree using a dependency parser, and then the dependency tree is parsed to extract the object, object relationship, and object properties for each sentence.

In one embodiment of the present application, in the step S1), the method for processing training text by the image generator based on the sequence condition to output the second generated image includes: the whole training text is encoded into an embedded vector by using a story encoder and is output to a context Wen Bianma device, and the context encoder captures context information from the embedded vector output by the story encoder and outputs the context information to a plurality of first image generators, each corresponding to one sentence in the training text, and each first image generator converts each sentence into a corresponding second generated image by combining the context information and the information of each sentence in the training text.

Compared with the prior art, the technical scheme of the application has the following advantages:

the method for generating the sequence image based on the text of the generation countermeasure network can effectively extract the context information of the text sequence, generate the visually real image matched with the given text description, avoid the problem of disordered object layout and improve the accuracy of the output image.

Drawings

In order that the application may be more readily understood, a more particular description of the application will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings, in which,

FIG. 1 is a block diagram of a prior art architecture for generating an countermeasure network model;

FIG. 2 is a block diagram of a method of generating sequence images based on text that generates a countermeasure network in accordance with the present application;

Detailed Description

The present application will be further described with reference to the accompanying drawings and specific examples, which are not intended to be limiting, so that those skilled in the art will better understand the application and practice it.

Referring to fig. 2, a method of generating a sequence image based on text generated against a network of the present application includes the steps of:

generating the countermeasure network model includes a hybrid generator and a discriminator, the hybrid generator including a scene graph guided image generator and a sequence condition based image generator; the scene graph guided image generator is used for outputting a first generated image by processing training text based on a generating model of a scene graph (scene graph), and the image generator based on the sequence condition is used for outputting a second generated image by processing the training text according to the context information of the training text;

when training the generated countermeasure network model, processing a training text by an image generator guided by a scene graph to output a first generated image, processing the training text by an image generator based on sequence conditions to output a second generated image, superposing the first generated image and the second generated image in the mixer to generate an initial generated image, outputting the initial generated image to a discriminator, discriminating the similarity of the initial generated image relative to an original image by the discriminator, and optimizing the mixed generator according to a discrimination result to finally enable the image generated by the optimized mixed generator to reach higher similarity with the original image;

it is understood that a training text is formed by at least one sentence, and each initial generation image corresponds to one sentence in the training text, that is, when the initial generation images are superimposed, the first generation image and the second generation image belonging to the same sentence are superimposed to obtain the initial generation image of the sentence.

In the method, the mixed generator is adopted, the image generator guided by the scene graph can clearly express the information conveyed by the complex sentence with a plurality of objects, the image generator based on the sequence condition can express the information conveyed by the sentence on the basis of combining the context information, the details of each object in the generated image are increased through the mixing of the image generator guided by the scene graph and the image generator based on the sequence condition, the context information of the text can be effectively captured, the context information of the text is matched with a plurality of objects related to the text, and therefore the accuracy of the object layout of the generated image is improved, and the requirement of outputting the high-accuracy image is further met.

For example, referring to fig. 2, training text-paragraph S is formed of a series of sentences s= [ S ] ₁ ，s ₂ ，…s _T ]Composition, wherein length T may vary. In a scene graph guided image generator, the image generated by the scene graph for each sentence is namedIn the image generation model based on the sequence condition, the image generated for each sentence is named +.>The initial generated image after the images generated by the two generators are combined is named as [ I ] ₁ ，I ₂ ，…I _T ]. During training, the original image is expressed as p= [ p ] ₁ ，p ₂ ，…p _T ]。

Wherein generating the countermeasure network model is defined as:

wherein G represents the mixing generator, D represents the discriminator, E (x) represents the expected value of the distribution function, P _data (x) Representing the distribution of real samples, P _z (z) represents the noise distribution defined in the low dimension. D (x) represents the output result of the discriminator D, and G (z) represents the mapping of the input noise z into data. In the training process, the discriminator is generally trained multiple times before the generator is trained. During the real training process, we often want the discriminator to be better, so that the effect of the generator can be supervised, and if the discriminator is worse, the generated false data is judged to be real data, then the overall effect is worse.

In one embodiment, the hybrid generator generates at least one initial generated image according to the training text, and each initial generated image corresponds to one sentence in the training text;

the object discriminator is used for discriminating whether the object in the initial generated image has a deletion relative to the original image;

the story discriminator is used for discriminating the consistency of the objects in the plurality of initial generated images corresponding to the whole training text.

For example, training text is the following paragraphs: "There is a tent under the sun. The girl approached with the owl on her shoulder. The boy sat by the tree and talked to the girl. The paragraph comprises three sentences, three initial generation images are generated by a mixed generator, each sentence corresponds to one initial generation image, and each initial generation image is formed by overlapping a first generation image and a second generation image generated by the corresponding sentences; the training text includes objects such as "tree", "sun", "the girl", "the boy", etc., and for an initially generated image corresponding to the first sentence, the image discriminator is configured to determine whether the "tree", "sun" and the object relationship "under" are represented in the initially generated image, and identify a difference between the sharpness of the initially generated image and the sharpness of the original image, for example, the original image is clear and has a contour, but the image generated by the generator is blurred and cannot distinguish the shape; the object identifier is used for judging whether the object in the initial generated image has a deletion relative to the original image, for example, the object of the 'girl' exists in the real image, but the object is deleted in the initial generated image; the story discriminator is used for discriminating the consistency of the objects in a plurality of initial generated images corresponding to the whole training text, for example, the initial generated image corresponding to the second sentence and the third sentence has the object of 'the girl', but the appearance of the 'the girl' in the initial generated image corresponding to the second sentence and the third sentence is different, i.e. the same object is not kept consistent;

the three discriminators can discriminate the initial generated image from different angles, so that the similarity of the initial generated image relative to the initial image can be discriminated more accurately, the discrimination precision is improved, the hybrid generator is optimized according to the discrimination result, and the optimization effect of the hybrid generator is improved more favorably.

In one embodiment, in step S1), the method for processing training text by the image generator guided by the scene graph to output the first generated image includes: each sentence in the training text is converted into a corresponding initial scene graph, the scene layout of the initial scene graph is calculated, and then the scene layout is input into a cascade refinement network (CRN, cascaded Refinement Network) to generate a corresponding first generated image.

Wherein a scene graph is a data structure in which each node represents an object and edges between connected objects represent the relationship.

In one embodiment, a method of computing a scene layout of the initial scene graph includes: the initial scene graph is processed by using a graph rolling network to obtain an embedded vector of each object in the initial scene graph, a segmentation mask and a boundary box of each object are predicted according to the embedded vector of each object, and then the scene layout is calculated by the segmentation masks and the boundary boxes of all the objects.

For example, a graph-rolling network initial scene graph consisting of multiple graph convolution layers may be processed. Given an initial scene graph, each node and edge has dimension D _in It calculates a new dimension D for each node and edge _out Is a vector of (a). The output vector is a function of the neighborhood of its respective input, so each graph volume layer propagates information along the edges of the graph. The graph convolution layer applies the same function to all sides of the graph, allowing a single layer to operate on arbitrarily shaped graphs. The segmentation mask and bounding box of each object can be predicted by processing the input scene graph with a series of graph convolution layers to give an embedded vector for each object that aggregates information about all objects and relationships in the graph, and inputting the embedded vector into the object layout network.

In one embodiment, when each sentence in the training text is converted into a corresponding initial scene graph, the object relation and the object attribute of each sentence in the training text are extracted first, and then the initial scene graph of the sentence is generated according to the object, the object relation and the object attribute of each sentence.

Further, the method for extracting the object, the object relation and the object attribute of each sentence in the training text comprises the following steps: the training text is parsed into a syntactic dependency tree using a dependency parser, and then the dependency tree is parsed according to language rules (nine simple language rules) to extract objects, object relationships, and object attributes for each sentence. Wherein, the object relationship mainly refers to the position relationship of the object, and the object attribute mainly refers to the material class, such as the material of the sphere.

In one embodiment, in step S1), a method of processing training text by an image generator based on a sequence condition to output a second generated image includes: the whole training text is encoded into an embedded vector by using a story encoder and is output to a context Wen Bianma device, and the context encoder captures context information from the embedded vector output by the story encoder and outputs the context information to a plurality of first image generators, each corresponding to one sentence in the training text, and each first image generator converts each sentence into a corresponding second generated image by combining the context information and the information of each sentence in the training text.

The verification result of the method of generating a sequence image based on a text of a countermeasure network of the above embodiment is as follows:

with the deep learning framework PyTorch, the experimental environment was ubantu14.04 operating system, and 4 NVIDIA 1080Ti Graphics Processor (GPU) accelerated operations to get a trained model.

Training a model on the data sets CLEVR-SV and CoDraw-SV, generating 64X 64 images, and quantitatively comparing the images with the image generation method ImageGAN, storyGAN;

comparison of the different methods on the CLEVR-SV dataset is described as follows: the input is the property and relative position of the current object, given by two real numbers representing its coordinates. For example, an image is generated from the description "blue, small, metal, cylinder, (-2.3,2.6)". All objects are described in the same way. Given the description, the appearance of the generated objects should be very small in contrast to the group-try (reference image) case, and their relative positions should be similar.

By contrast, the ImageGAN algorithm cannot maintain consistency of the input text, and as the number of objects increases, it confuses properties, such as spheres that generate incorrect rubber properties; the Story GAN algorithm can solve the problem of paragraph consistency through the contained story encoder, however, when judging the relation between objects and the relative position of the objects, the image cannot be generated correctly; comparing the amplified text generated images by different methods shows that compared with a group Truth (reference image), the ImageGAN is difficult to accurately generate a coherent image, for example, a silver sphere is changed from 'metal' of an input condition to 'rubber', the position relationship between a purple cylinder and a blue cube is greatly different, and the size of the yellow cube is inaccurate; although a more coherent image can be generated in StoryGAN, when a plurality of objects are generated, the layout of the objects is disordered. Compared with the ImageGAN and StoryGAN algorithms, the method for generating the image by the text, namely the SGGAN, generates smoother images, has smaller difference with reference images, has more accurate relation between objects, and generates higher quality sequence images.

The comparison of the different methods on the CoDraw-SV number dataset is described as follows: imageGAN cannot generate consistent image sequences and the appearance of characters is inconsistent in the image sequence. In contrast, the image quality generated by the SGGAN, which is the method for generating an image by text, is higher, and the relationship between a plurality of objects and the relationship between the objects can be better grasped in the image generation.

The method for generating sequence images based on texts of the countermeasure network in this embodiment can be applied in various aspects, for example, generating customized images and videos according to personal preference of users, helping artists or plane designers to effectively map characters, effectively generating images of different styles for the same characters, and the like. The method is based on a scene graph generation countermeasure network structure, can effectively extract context information of a text sequence, effectively solves the problem of disordered generated object layout in a text generated image, can effectively generate a visually real image matched with a given text description, and enables the generated image to be more diversified and accurate.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations and modifications of the present application will be apparent to those of ordinary skill in the art in light of the foregoing description. It is not necessary here nor is it exhaustive of all embodiments. And obvious variations or modifications thereof are contemplated as falling within the scope of the present application.

Claims

1. A method of generating sequence images based on text generated against a network, comprising the steps of:

when training the generated countermeasure network model, an image generator guided by a scene graph processes training texts to output a first generated image, an image generator based on sequence conditions processes the training texts to output a second generated image, the first generated image and the second generated image are overlapped in the mixed generator to generate an initial generated image, the initial generated image is output to a discriminator, the discriminator discriminates the similarity of the initial generated image relative to the original image, and the mixed generator is optimized according to discrimination results;

s2) inputting the text to be processed into the trained generated countermeasure network model, generating an image corresponding to the text to be processed by the trained generated countermeasure network model and outputting the image;

the mixing generator generates at least one initial generated image according to the training text;

the story discriminator is used for discriminating the consistency of objects in a plurality of initial generated images corresponding to the whole training text;

in the step S1), the method for processing training text by the image generator based on the sequence condition to output a second generated image includes: the whole training text is encoded into an embedded vector by using a story encoder and is output to a context Wen Bianma device, and the context encoder captures context information from the embedded vector output by the story encoder and outputs the context information to a plurality of first image generators, each corresponding to one sentence in the training text, and each first image generator converts each sentence into a corresponding second generated image by combining the context information and the information of each sentence in the training text.

2. The method for generating text generation sequence images based on the countermeasure network according to claim 1, wherein in the step S1), the method for processing training text by the image generator guided by the scene graph to output the first generated image is: each sentence in the training text is converted into a corresponding initial scene graph, the scene layout of the initial scene graph is calculated, and then the scene layout is input into a cascade refinement network to generate a corresponding first generated image.

3. The method of generating sequence images based on text for generating a countermeasure network of claim 2, wherein the method of calculating a scene layout of the initial scene graph includes: processing the initial scene graph by using a graph rolling network to obtain an embedded vector of each object in the initial scene graph, predicting a segmentation mask and a boundary box of each object according to the embedded vector of each object, and calculating the scene layout by the segmentation masks and the boundary boxes of all the objects.

4. The method of generating sequence images based on text of a countermeasure network according to claim 2, wherein when each sentence in the training text is converted into a corresponding initial scene graph, the object relation and the object attribute of each sentence in the training text are extracted first, and then the initial scene graph of the sentence is generated according to the object, the object relation and the object attribute of each sentence.

5. The method for generating sequence images based on text for a countermeasure network of claim 4, wherein the method for extracting the object, the object relation, and the object attribute of each sentence in the training text is: the training text is parsed into a syntactic dependency tree using a dependency parser, and then the dependency tree is parsed to extract the object, object relationship, and object properties for each sentence.