CN113239961A - Method for generating sequence images based on text for generating confrontation network - Google Patents

Method for generating sequence images based on text for generating confrontation network Download PDF

Info

Publication number
CN113239961A
CN113239961A CN202110384686.9A CN202110384686A CN113239961A CN 113239961 A CN113239961 A CN 113239961A CN 202110384686 A CN202110384686 A CN 202110384686A CN 113239961 A CN113239961 A CN 113239961A
Authority
CN
China
Prior art keywords
image
text
training
initial
training text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110384686.9A
Other languages
Chinese (zh)
Other versions
CN113239961B (en
Inventor
胡伏原
张玮琪
李林燕
徐峰磊
夏振平
顾敏明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Jiatu Intelligent Drawing Information Technology Co ltd
Suzhou University of Science and Technology
Original Assignee
Suzhou Jiatu Intelligent Drawing Information Technology Co ltd
Suzhou University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Jiatu Intelligent Drawing Information Technology Co ltd, Suzhou University of Science and Technology filed Critical Suzhou Jiatu Intelligent Drawing Information Technology Co ltd
Priority to CN202110384686.9A priority Critical patent/CN113239961B/en
Publication of CN113239961A publication Critical patent/CN113239961A/en
Application granted granted Critical
Publication of CN113239961B publication Critical patent/CN113239961B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention relates to a method for generating a sequence image based on a text for generating a countermeasure network, which comprises the following steps: constructing a training database, wherein the training database comprises a training text and an original image, and generating a confrontation network model by using the training text and the original image for training; generating an antagonistic network model comprises a mixture generator and a discriminator, wherein the mixture generator comprises a scene graph guided image generator and an image generator based on sequence conditions; and inputting the text to be processed into the trained generation countermeasure network model, and generating and outputting an image corresponding to the text to be processed by the trained generation countermeasure network model. The invention can generate the visually real image matched with the text description, thereby avoiding the problem of disordered object layout and improving the accuracy of the output image.

Description

Method for generating sequence images based on text for generating confrontation network
Technical Field
The invention relates to the technical field of image generation, in particular to a method for generating a sequence image based on a text of a generation countermeasure network.
Background
Text-generated images are a popular area of research in computer vision. The text generation image means that a user provides a text description, and the system can automatically generate one or more images with the content consistent with the text description. The method greatly improves the flexibility and comprehensiveness of image information acquisition, and has good development prospect and important significance, such as: concept enlightenment in the education field, generation of picture insertions in the literature field, visual creation in the art field and the like.
The existing method for generating an image by a text mainly comprises the following three steps:
the method based on the variational self-encoder model comprises the following steps: the method models in a statistical method to maximize the minimum possibility of data to generate images, and is an unsupervised learning mode for compressing input data to realize the reproduction of the input data by output data. The encoder encodes data, and can map high-dimensional data into low-dimensional data; the decoder is just opposite to convert the low micro data into high dimensional data, and the reconstruction of the input data is realized.
The method based on the deep cycle attention model comprises the following steps: the method uses a recurrent neural network and utilizes an attention mechanism, each step focuses on a generated object, and an image block is sequentially generated and a final result is superposed. Wherein the encoder determines the distribution of latent variable space to capture significant input information; the decoder is arranged to receive samples sampled from the encoded distribution and use them to condition the self-distribution over the image.
Method based on generation of a countermeasure network: the method for generating the countermeasure Network (GAN) is a deep learning model, and is different from the traditional machine learning method, and the biggest characteristic is that an countermeasure mechanism is introduced. Referring to fig. 1, the two parties of the confrontation are respectively composed of a generator G and a discriminator D. The generator learns real data distribution, inputs random noise which obeys prior distribution, and generates data similar to a real training sample; the discriminator is a two-classifier for estimating the probability of the sample from the training data rather than the generated data, and discriminating whether the input object is a real image or a generated image based on the output probability value.
However, the essence of the method based on the variational self-encoder model is to convert an image into the most common mean value and standard deviation, then use the two parameters to randomly sample from distribution to obtain an implicit variable, and decode and reconstruct the implicit variable; the method cannot accurately generate a high-resolution image in practical application. The method based on the depth cycle attention model introduces a cycle neural network and an attention mechanism, and although the accuracy of image generation can be improved to a certain extent, the image generation accuracy is limited for texts containing complex scenes;
compared with the method based on the variational self-encoder model and the method based on the deep cycle attention model, the method based on generation of the countermeasure network can overcome the defects of the two methods, so that the method is widely applied to the research field of image generation in recent years. However, when the input text relates to a plurality of objects and relationships, the context information of the text sequence is difficult to extract, the layout of the objects for generating the image is easily disordered, the details of the generated objects are insufficient, and the requirement for outputting a high-accuracy image cannot be met.
Therefore, the existing methods for generating images by texts have the problem of low accuracy of output images, and cannot meet the use requirements.
Disclosure of Invention
Therefore, the technical problem to be solved by the invention is to overcome the problem that the method for generating the image by the text in the prior art cannot output the image with high accuracy.
In order to solve the technical problem, the invention provides a method for generating a sequence image based on a text for generating a confrontation network, which comprises the following steps:
s1), constructing a training database, wherein the training database comprises training texts and original images corresponding to the training texts, and training generation of a confrontation network model by using the training texts and the original images;
the generating an antagonistic network model comprises a mixture generator and a discriminator, wherein the mixture generator comprises a scene graph guided image generator and an image generator based on sequence conditions; the scene graph guided image generator is used for outputting a first generated image by processing the training text based on the scene graph generation model, and the sequence condition based image generator is used for outputting a second generated image by processing the training text according to the context information of the training text;
when the generation countermeasure network model is trained, processing a training text by an image generator guided by a scene graph to output a first generated image, processing the training text by an image generator based on a sequence condition to output a second generated image, overlapping the first generated image and the second generated image in the mixer-synthesizer to generate an initial generated image, outputting the initial generated image to a discriminator, discriminating the similarity of the initial generated image relative to the original image by the discriminator, and optimizing the mixer-generator according to a discrimination result;
s2) inputting the text to be processed into the trained generation confrontation network model, and generating and outputting an image corresponding to the text to be processed by the trained generation confrontation network model.
In one embodiment of the invention, the blending generator generates at least one initial generated image from the training text;
the discriminator comprises an image discriminator, an object discriminator and a story discriminator;
the image discriminator is used for judging whether the initial generated image is matched with the content of a training text or not and judging the definition of the initial generated image relative to the original image;
the object discriminator is used for judging whether the object in the initial generated image is missing relative to the original image;
the story discriminator is used for judging the consistency of objects in a plurality of initial generation images corresponding to the whole training text.
In an embodiment of the present invention, in step S1), the method for processing the training text by the image generator guided by the scene graph and outputting the first generated image includes: each sentence in the training text is converted into a corresponding initial scene graph, the scene layout of the initial scene graph is calculated, and then the scene layout is input into a cascade refinement network to generate a corresponding first generated image.
In an embodiment of the present invention, the method for calculating the scene layout of the initial scene graph includes: processing the initial scene graph by using a graph convolution network to obtain an embedded vector of each object in the initial scene graph, predicting a segmentation mask and a boundary box of each object according to the embedded vector of each object, and then calculating scene layout by using the segmentation masks and the boundary boxes of all the objects.
In an embodiment of the present invention, when each sentence in the training text is converted into a corresponding initial scene graph, an object relationship, and an object attribute of each sentence in the training text are extracted, and then the initial scene graph of the sentence is generated according to the object, the object relationship, and the object attribute of each sentence.
In an embodiment of the present invention, the method for extracting the object, the object relationship, and the object attribute of each sentence in the training text comprises: the training text is parsed into a syntactic dependency tree using a dependency parser, which is then parsed to extract the objects, object relationships, and object attributes for each sentence.
In one embodiment of the present invention, in step S1), the method for processing the training text by the image generator based on the sequence condition to output the second generated image includes: the method comprises the steps that a story encoder is used for encoding the whole training text into embedded vectors and outputting the embedded vectors to a context encoder, the context encoder captures context information from the embedded vectors output by the story encoder and outputs the context information to a plurality of first image generators, each first image generator corresponds to one sentence in the training text, and each first image generator combines the context information and the information of each sentence in the training text to convert each sentence into the corresponding second generated image.
Compared with the prior art, the technical scheme of the invention has the following advantages:
the method for generating the text generation sequence image based on the generation countermeasure network can effectively extract the context information of the text sequence, generate the visually real image matched with the given text description, avoid the problem of disordered object layout and improve the accuracy of the output image.
Drawings
In order that the present disclosure may be more readily and clearly understood, the following detailed description of the present disclosure is provided in connection with specific embodiments thereof and the accompanying drawings, in which,
FIG. 1 is a block diagram of a prior art architecture for generating a countermeasure network model;
FIG. 2 is a block diagram of the structure of the method of generating sequential images based on text that generates a countermeasure network of the present invention;
Detailed Description
The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.
Referring to fig. 2, a method for generating a sequence image based on a text for generating a confrontation network according to the present invention includes the following steps:
s1), constructing a training database, wherein the training database comprises a training text and an original image corresponding to the training text, and training the confrontation network model generated by using the training text and the original image;
generating an antagonistic network model comprises a mixture generator and a discriminator, wherein the mixture generator comprises a scene graph guided image generator and an image generator based on sequence conditions; the scene graph guided image generator is used for outputting a first generated image by processing the training text based on a generation model of a scene graph (scene graph), and the sequence condition based image generator is used for processing the training text according to the context information of the training text and outputting a second generated image;
when a confrontation network model is generated and trained, an image generator guided by a scene graph processes a training text to output a first generated image, an image generator based on a sequence condition processes the training text to output a second generated image, the first generated image and the second generated image are overlapped in the mixer-synthesizer to generate an initial generated image, the initial generated image is output to a discriminator, the discriminator judges the similarity of the initial generated image relative to an original image, and the mixer-generator is optimized according to a judgment result so as to finally enable the image generated by the optimized mixer-generator and the original image to achieve higher similarity;
it is understood that a training text is composed of at least one sentence, and each of the initial generated images corresponds to one sentence in the training text, that is, when the initial generated images are generated by superimposing, the first generated image and the second generated image belonging to the same sentence are superimposed to obtain the initial generated image of the sentence.
S2) inputting the text to be processed into the trained generation confrontation network model, and generating and outputting an image corresponding to the text to be processed by the trained generation confrontation network model.
In the method, by adopting the mixing generator, the scene graph guided image generator can clearly express the information conveyed by the complex sentence with a plurality of objects, the sequence condition based image generator can express the information conveyed by the sentence on the basis of combining the context information, and through the mixing of the scene graph guided image generator and the sequence condition based image generator, the details of each object in the generated image are increased, the context information of the text can be effectively captured, and the context information of the text is matched with a plurality of objects related to the text, so that the accuracy of the object layout of the generated image is improved, and the requirement of outputting high-accuracy images is further realized.
For example, referring to fig. 2, a training text-paragraph S is composed of a series of sentences S ═ S1,s2,…sT]Composition, wherein the length T may vary. In the scene graph guided image generator, the image generated by the scene graph for each sentence is named
Figure BDA0003014328810000061
In the image generation model based on the sequence condition, an image generated by each sentence is named
Figure BDA0003014328810000062
The initial generated image after merging of the images generated by the two generators is named as [ I1,I2,…IT]. During training, the original image is represented as p ═ p1,p2,…pT]。
Wherein generating the countermeasure network model is defined as:
Figure BDA0003014328810000063
where G denotes the mix generator, D denotes the discriminator, E (. + -.) denotes the expected value of the distribution function, Pdata(x) Representing the distribution of real samples, Pz(z) represents the noise distribution defined in the lower dimension. D (x) represents the output of the discriminator D, and g (z) represents the mapping of the input noise z to data. In the training process, the discriminator is generally trained several times before the generator. In the real training process, the effect of the discriminator is expected to be better, so that the effect of the generator can be supervised, if the discriminator has poor effect, the generated false data is judged as real data, and thenThe overall effect may be poor.
In one embodiment, the mixing generator generates at least one initial generated image according to the training text, wherein each initial generated image corresponds to one sentence in the training text;
the discriminator comprises an image discriminator, an object discriminator and a story discriminator;
the image discriminator is used for judging whether the initial generated image is matched with the content of the training text or not and judging the definition of the initial generated image relative to the original image;
the object discriminator is used for judging whether the object in the initial generated image is missing relative to the original image;
the story discriminator is used for judging the consistency of objects in a plurality of initial generation images corresponding to the whole training text.
For example, the training text is the following paragraphs: "the There is a ten under the sun. The girl advanced with The owl on her summer. The boy sat by The tree and tapped to The girl. If the paragraph comprises three sentences, generating three initial generated images by a mixed generator, wherein each sentence corresponds to one initial generated image, and each initial generated image is formed by overlapping a first generated image and a second generated image which are generated by corresponding sentences; the training text comprises objects such as "tree", "sun", "the girl", "the own", and "the boy", and for an initial generated image corresponding to a first sentence, the image discriminator is used for judging whether the relationship between "tree", "sun", and object "under" is embodied in the initial generated image, and discriminating the difference between the sharpness of the initial generated image and the sharpness of the original image, for example, the original image has sharp outlines, but the image generated by the generator is blurred and cannot be in shape; the object discriminator is used for judging whether the object in the initially generated image is missing relative to the original image, for example, the object of 'girl' exists in the real image, but the object is missing in the initially generated image; the story discriminator is used for judging the consistency of objects in a plurality of initial generated images corresponding to the whole training text, for example, the object of 'the girl' is arranged in the initial generated images corresponding to the second sentence and the third sentence, but the shapes of the 'the girl' in the initial generated images corresponding to the second sentence and the third sentence are different, namely the same object is not consistent;
the three discriminators can discriminate the initial generated image from different angles, can more accurately discriminate the similarity of the initial generated image relative to the original image, improves discrimination precision, optimizes the hybrid generator according to the discrimination result, and is more beneficial to improving the optimization effect of the hybrid generator.
In one embodiment, in step S1), the method for processing the training text by the image generator guided by the scene graph and outputting the first generated image includes: each sentence in the training text is converted into a corresponding initial scene graph, the scene layout of the initial scene graph is calculated, and then the scene layout is input into a Cascade Refinement Network (CRN) to generate a corresponding first generated image.
The scene graph is a data structure, wherein each node represents an object, and edges connecting the objects represent the relationship.
In one embodiment, the method for calculating the scene layout of the initial scene graph comprises the following steps: processing the initial scene graph by using a graph convolution network to obtain an embedded vector of each object in the initial scene graph, predicting a segmentation mask and a boundary box of each object according to the embedded vector of each object, and then calculating scene layout by using the segmentation masks and the boundary boxes of all the objects.
For example, a graph convolution network initial scene graph composed of multiple graph convolution layers may be used for processing. Given an initial scene graph, each node and edge has a dimension DinWhich calculates a new dimension D for each node and edgeoutThe vector of (2). The output vector is a function of the neighborhood of its corresponding input, so each graph convolution layer propagates information along the edges of the graph. The graph convolution layer applies the same function to all edges of the graph, allowing a single layer to operate on arbitrarily shaped graphs. Using a series of figuresThe convolution layer processes the input scene graph, gives the embedded vector of each object, the embedded vector aggregates the information of all objects and relations in the graph, and inputs the embedded vector into the object layout network, so as to predict the segmentation mask and the bounding box of the object.
In one embodiment, when each sentence in the training text is converted into a corresponding initial scene graph, an object relationship, and an object attribute of each sentence in the training text are extracted, and then the initial scene graph of the sentence is generated according to the object, the object relationship, and the object attribute of each sentence.
Further, the method for extracting the object, the object relationship and the object attribute of each sentence in the training text comprises the following steps: the training text is parsed into a syntactic dependency tree using a dependency parser, and the dependency tree is then parsed according to the linguistic rules (nine simple linguistic rules) to extract the objects, object relationships, and object attributes for each sentence. The object relationship mainly refers to the position relationship of the object, and the object attribute mainly refers to the material class, such as the material of the sphere.
In one embodiment, in step S1), the method for processing the training text by the image generator based on the sequence condition to output the second generated image includes: the method comprises the steps that a story encoder is used for encoding the whole training text into embedded vectors and outputting the embedded vectors to a context encoder, the context encoder captures context information from the embedded vectors output by the story encoder and outputs the context information to a plurality of first image generators, each first image generator corresponds to one sentence in the training text, and each first image generator combines the context information and the information of each sentence in the training text to convert each sentence into the corresponding second generated image.
The verification result of the method of generating a sequence image based on text that generates a countermeasure network of the above embodiment is as follows:
a deep learning framework PyTorch is adopted, the experimental environment is a ubantu14.04 operating system, and 4 blocks of NVIDIA 1080Ti Graphic Processor (GPU) are used for accelerating operation to obtain a trained model.
Training models on data sets CLEVR-SV and CoDraw-SV, generating 64 x 64 images, and quantitatively comparing the images with image GAN and StoryGAN;
the comparison of the different methods on the CLEVR-SV data set is described as follows: the input is the attribute and relative position of the current object, given by two real numbers representing its coordinates. For example, an image is generated from the description "blue, small, metal, cylinder, (-2.3, 2.6)". All objects are described in the same way. Given the description, the appearance of the generated objects should differ very little from the ground-truth (reference image) case, and their relative positions should be similar.
The comparison shows that the ImageGAN algorithm cannot keep the consistency of the input text, and when the number of objects increases, the objects can confuse attributes, such as spheres generating wrong rubber attributes; the StoryGAN algorithm can solve the problem of paragraph consistency through the included story encoder, however, when the relationship between objects and the relative positions of the objects are judged, the image cannot be correctly generated; the comparison of the amplified text generated images by different methods shows that compared with a group Truth (reference image), ImageGAN is difficult to accurately generate a coherent image, for example, a silver sphere is changed from 'metal' to 'rubber' under an input condition, the position relationship between a purple cylinder and a blue cube is greatly different, and the size of the yellow cube is inaccurate; while StoryGAN can produce more coherent images, when multiple objects are produced, the object layout is more cluttered. Compared with ImageGAN and StoryGAN algorithms, the method for generating the image by the text, namely the image generated by the SGGAN, is smoother, has smaller difference with a reference image, has more accurate relation between objects, and generates a sequence image with higher quality.
The results of the different methods of comparison on the CoDraw-SV number data set are described below: ImageGAN cannot generate a consistent image sequence, and the appearance of a character is not consistent across the image sequence. In contrast, the SGGAN image generation method of the present application has higher image quality, and can better grasp the relationship of a plurality of objects in image generation.
The method for generating the sequence image based on the text generated against the network according to the embodiment can be applied to various aspects, such as generating customized images and videos according to personal preferences of users, helping artists or graphic designers to effectively match characters, effectively generating images of different styles for the same character, and the like. The method is based on a scene graph generation confrontation network structure, can effectively extract the context information of the text sequence, effectively solves the problem of disordered layout of the generated object in the text generated image, can effectively generate a visually real image matched with the given text description, and enables the generated image to be more diversified and accurate.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications of the invention may be made without departing from the spirit or scope of the invention.

Claims (7)

1. A method for generating sequential images based on text for generating a confrontational network, comprising the steps of:
s1), constructing a training database, wherein the training database comprises training texts and original images corresponding to the training texts, and training generation of a confrontation network model by using the training texts and the original images;
the generating an antagonistic network model comprises a mixture generator and a discriminator, wherein the mixture generator comprises a scene graph guided image generator and an image generator based on sequence conditions; the scene graph guided image generator is used for outputting a first generated image by processing the training text based on the scene graph generation model, and the sequence condition based image generator is used for outputting a second generated image by processing the training text according to the context information of the training text;
when the generation countermeasure network model is trained, processing a training text by an image generator guided by a scene graph to output a first generated image, processing the training text by an image generator based on a sequence condition to output a second generated image, overlapping the first generated image and the second generated image in the mixer-synthesizer to generate an initial generated image, outputting the initial generated image to a discriminator, discriminating the similarity of the initial generated image relative to the original image by the discriminator, and optimizing the mixer-generator according to a discrimination result;
s2) inputting the text to be processed into the trained generation confrontation network model, and generating and outputting an image corresponding to the text to be processed by the trained generation confrontation network model.
2. The method of generating a sequence of images based on generating text to combat network of claim 1,
the mixing generator generates at least one initial generation image according to the training text;
the discriminator comprises an image discriminator, an object discriminator and a story discriminator;
the image discriminator is used for judging whether the initial generated image is matched with the content of a training text or not and judging the definition of the initial generated image relative to the original image;
the object discriminator is used for judging whether the object in the initial generated image is missing relative to the original image;
the story discriminator is used for judging the consistency of objects in a plurality of initial generation images corresponding to the whole training text.
3. The method of claim 1, wherein in step S1), the method for processing the training text by the scene graph-guided image generator to output the first generated image comprises: each sentence in the training text is converted into a corresponding initial scene graph, the scene layout of the initial scene graph is calculated, and then the scene layout is input into a cascade refinement network to generate a corresponding first generated image.
4. The method of claim 3, wherein the method of calculating the scene layout of the initial scene graph comprises: processing the initial scene graph by using a graph convolution network to obtain an embedded vector of each object in the initial scene graph, predicting a segmentation mask and a boundary box of each object according to the embedded vector of each object, and then calculating scene layout by using the segmentation masks and the boundary boxes of all the objects.
5. The method of claim 3, wherein when each sentence in the training text is converted into the corresponding initial scene graph, the object relationship and the object attribute of each sentence in the training text are extracted, and then the initial scene graph of the sentence is generated according to the object, the object relationship and the object attribute of each sentence.
6. The method of generating sequential images based on text generated against a network as claimed in claim 5, wherein the method of extracting the object, object relationship and object attribute of each sentence in the training text is: the training text is parsed into a syntactic dependency tree using a dependency parser, which is then parsed to extract the objects, object relationships, and object attributes for each sentence.
7. The method of claim 1, wherein in step S1), the method for processing the training text by the image generator based on the sequence condition to output the second generated image comprises: the method comprises the steps that a story encoder is used for encoding the whole training text into embedded vectors and outputting the embedded vectors to a context encoder, the context encoder captures context information from the embedded vectors output by the story encoder and outputs the context information to a plurality of first image generators, each first image generator corresponds to one sentence in the training text, and each first image generator combines the context information and the information of each sentence in the training text to convert each sentence into the corresponding second generated image.
CN202110384686.9A 2021-04-09 2021-04-09 Method for generating sequence image based on text of generating countermeasure network Active CN113239961B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110384686.9A CN113239961B (en) 2021-04-09 2021-04-09 Method for generating sequence image based on text of generating countermeasure network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110384686.9A CN113239961B (en) 2021-04-09 2021-04-09 Method for generating sequence image based on text of generating countermeasure network

Publications (2)

Publication Number Publication Date
CN113239961A true CN113239961A (en) 2021-08-10
CN113239961B CN113239961B (en) 2023-10-20

Family

ID=77127888

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110384686.9A Active CN113239961B (en) 2021-04-09 2021-04-09 Method for generating sequence image based on text of generating countermeasure network

Country Status (1)

Country Link
CN (1) CN113239961B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113570695A (en) * 2021-09-27 2021-10-29 清华大学 Image generation method and device and electronic equipment
CN113688927A (en) * 2021-08-31 2021-11-23 中国平安人寿保险股份有限公司 Picture sample generation method and device, computer equipment and storage medium
CN116246288A (en) * 2023-05-10 2023-06-09 浪潮电子信息产业股份有限公司 Text coding method, model training method, model matching method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107330364A (en) * 2017-05-27 2017-11-07 上海交通大学 A kind of people counting method and system based on cGAN networks
CN109410239A (en) * 2018-11-07 2019-03-01 南京大学 A kind of text image super resolution ratio reconstruction method generating confrontation network based on condition
US20200097554A1 (en) * 2018-09-26 2020-03-26 Huawei Technologies Co., Ltd. Systems and methods for multilingual text generation field
US20200302231A1 (en) * 2019-03-22 2020-09-24 Royal Bank Of Canada System and method for generation of unseen composite data objects

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107330364A (en) * 2017-05-27 2017-11-07 上海交通大学 A kind of people counting method and system based on cGAN networks
US20200097554A1 (en) * 2018-09-26 2020-03-26 Huawei Technologies Co., Ltd. Systems and methods for multilingual text generation field
CN109410239A (en) * 2018-11-07 2019-03-01 南京大学 A kind of text image super resolution ratio reconstruction method generating confrontation network based on condition
US20200302231A1 (en) * 2019-03-22 2020-09-24 Royal Bank Of Canada System and method for generation of unseen composite data objects

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
兰红;刘秦邑;: "图注意力网络的场景图到图像生成模型", 中国图象图形学报, no. 08 *
林胜;巩名轶;牟文芊;董伯男;: "基于对抗式生成网络的农作物病虫害图像扩充", 电子技术与软件工程, no. 03 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113688927A (en) * 2021-08-31 2021-11-23 中国平安人寿保险股份有限公司 Picture sample generation method and device, computer equipment and storage medium
CN113570695A (en) * 2021-09-27 2021-10-29 清华大学 Image generation method and device and electronic equipment
CN116246288A (en) * 2023-05-10 2023-06-09 浪潮电子信息产业股份有限公司 Text coding method, model training method, model matching method and device
CN116246288B (en) * 2023-05-10 2023-08-04 浪潮电子信息产业股份有限公司 Text coding method, model training method, model matching method and device

Also Published As

Publication number Publication date
CN113239961B (en) 2023-10-20

Similar Documents

Publication Publication Date Title
CN111858954B (en) Task-oriented text-generated image network model
Zhang et al. Text-to-image diffusion models in generative ai: A survey
CN110111236B (en) Multi-target sketch image generation method based on progressive confrontation generation network
CN113239961B (en) Method for generating sequence image based on text of generating countermeasure network
CN111967533B (en) Sketch image translation method based on scene recognition
CN105868706A (en) Method for identifying 3D model based on sparse coding
Galatolo et al. TeTIm-Eval: a novel curated evaluation data set for comparing text-to-image models
CN117011875A (en) Method, device, equipment, medium and program product for generating multimedia page
Song et al. Exploring explicit and implicit visual relationships for image captioning
Mei et al. Vision and language: from visual perception to content creation
CN115631285B (en) Face rendering method, device, equipment and storage medium based on unified driving
Wang et al. Generative model with coordinate metric learning for object recognition based on 3D models
Han et al. Attribute-sentiment-guided summarization of user opinions from online reviews
Jia et al. Facial expression synthesis based on motion patterns learned from face database
KR100544684B1 (en) A feature-based approach to facial expression cloning method
Ye et al. HAO‐CNN: Filament‐aware hair reconstruction based on volumetric vector fields
Manushree et al. XCI-Sketch: Extraction of Color Information from Images for Generation of Colored Outlines and Sketches
Lan et al. LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation
Song et al. Virtual Human Talking-Head Generation
CN118014086B (en) Data processing method, device, equipment, storage medium and product
Gao et al. Generating face images from fine-grained sketches based on GAN with global-local joint discriminator
조시현 Interactive Storyboarding System Leveraging Large-Scale Pre-trained Model
Mitsouras et al. U-Sketch: An Efficient Approach for Sketch to Image Diffusion Models
Wang et al. A new Gaussian mixture conditional random field model for indoor image labeling
CN106097373A (en) A kind of smiling face's synthetic method based on branch's formula sparse component analysis model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant