CN111858954A - Task-oriented text-generated image network model - Google Patents

Task-oriented text-generated image network model Download PDF

Info

Publication number
CN111858954A
CN111858954A CN202010609005.XA CN202010609005A CN111858954A CN 111858954 A CN111858954 A CN 111858954A CN 202010609005 A CN202010609005 A CN 202010609005A CN 111858954 A CN111858954 A CN 111858954A
Authority
CN
China
Prior art keywords
entity
network
text
image
task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010609005.XA
Other languages
Chinese (zh)
Other versions
CN111858954B (en
Inventor
李春豹
崔莹
代翔
刘鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwest Electronic Technology Institute No 10 Institute of Cetc
Original Assignee
Southwest Electronic Technology Institute No 10 Institute of Cetc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest Electronic Technology Institute No 10 Institute of Cetc filed Critical Southwest Electronic Technology Institute No 10 Institute of Cetc
Priority to CN202010609005.XA priority Critical patent/CN111858954B/en
Publication of CN111858954A publication Critical patent/CN111858954A/en
Application granted granted Critical
Publication of CN111858954B publication Critical patent/CN111858954B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Animal Behavior & Ethology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a task-oriented text generation image network model, aiming at providing a text generation image network model for generating high-quality images which are consistent with text semantics and rich in content, and the invention is realized by the following technical scheme: the common sense reasoning module enriches and expands entities and entity attributes in the natural language text description, and then constructs an entity relationship scene graph and an entity attribute semantic vector aiming at the natural language text description; the global generation model inputs the entity relation scene graph into the GCN, the obtained embedded vectors are respectively input into a mask regression network and a boundary box regression network, the segmentation mask and the boundary box of each entity are estimated, all entity layouts are fused to form a scene layout, and an initial image is generated by combining a convolutional neural network; the local thinning model takes entity attribute semantic vectors and initial image feature mapping as input, and combines RRRN and a convolutional neural network to generate a high-quality image with rich and harmonious content.

Description

Task-oriented text-generated image network model
Technical Field
The invention relates to the field of computer science, in particular to a generation countermeasure network (GAN) technology in the field of deep learning, and specifically relates to a task-oriented text generation image network model.
Background
With the widespread use of mobile intelligent terminals capable of taking pictures and the rapid development of the internet, multimodal data fusing visual and text information, such as photos with text labels, teletext content in newspaper articles, videos with titles and multimodal interaction data appearing in social media, are increasing dramatically. The image text description method can effectively organize the image data and conveniently search massive image data by combining with a text information retrieval technology. In addition, the use of the image text description method can not only read the content spoken by the speaker from the images in the slide but also help the visually impaired to understand the images. The text description of the image is a cross task in the field of computer vision and natural language processing, can complete the modal conversion from the image to the text, and mainly comprises three methods: a generation-based method, a retrieval-based method and a coding-decoding-based method. The generation-based method is divided into a detection process and a generation process. The detection process detects information such as objects appearing in the image, object attributes, scenes and behaviors of image expression content and the like on the basis of the image features; the generation process uses this information to drive the natural language generation system to output a textual description of the image. And searching an image set similar to the input image in the database based on the searching method, and reasonably organizing the text description of the generated image by using the most similar searching result based on the text description of the searched similar image set. The method based on encoding and decoding is based on deep learning, and adopts a manner of encoding and decoding to directly generate text description. The method based on generation depends on the quality of concept detection in the detection process, and is limited by a manually designed template, an incomplete language model and a limited syntactic model in the generation process, so that the text description sentence generated by the method is single and has no diversity. The text description problem of the image is regarded as an information retrieval problem based on a retrieval method, namely, a query image is searched in a data set; depending on the large scale corpus, the generated text descriptions are limited to similarities between images. At present, a plurality of fusion methods based on a deep neural network exist, but the fusion problem of the image and the text in the high-level semantics is not really deep, so how to perform multi-mode high-level semantic fusion on the image and the text modal information is a key problem and a research difficulty to be solved in the text description of the image.
Text generation images as a popular research field in recent years, a problem to be solved is to generate images corresponding to descriptive texts from the descriptive texts. Text-to-Image Synthesis (Text-to-Image Synthesis), which refers to a computational method that translates natural language descriptions (in Text form, such as keywords, sentences or paragraphs) into images with semantics similar to Text. Text-to-image correlation analysis is mainly relied on in text-to-image generation, and a supervised approach is combined to find the best alignment of visual content to text matching, the main limitation of which is the lack of ability to generate new image content, which can only change the features of a given or training image. Furthermore, the most important and difficult problem to solve for text-to-image generation is semantic consistency between text and image. The semantic expression vector is added into a generation countermeasure network as a condition for image generation, and semantic supervision is carried out on the generation of the sample, so that the method is effective, but is very dependent on the quality of the completion of the preposed work. Text-generated images are also a complex and challenging computer vision and machine learning problem, as they require not only knowing how to do but also understanding the content of the task.
The current text image generation methods include VAE (spatial Auto-Encoder), draw (explicit Attention writer), and generation of a countermeasure network GAN. The VAE and DRAW methods are both used to perform image generation, the VAE models in a statistical way to maximize the minimum likelihood of data to generate an image, and the DRAW method uses a recurrent neural network and uses an attention mechanism, each step focuses on a generating object, each patch is generated in turn and the final result is superimposed. GAN is the most popular text-to-image method in recent years, and its main idea is to use a countermeasure method to generate data. It is based on a new framework of generative models, consisting of a Generator (Generator) which attempts to generate a real image that can trick the Discriminator, and a Discriminator (Discriminator) which attempts to distinguish the real image from the generated image. The generator uses a neural network to generate the desired data, i.e. obtains training samples and trains a model, and the model can generate the data according to the defined target data distribution. For many tasks, the generator can map one input to multiple correct outputs while filling in missing information. For example, given a two-dimensional image, the generator may be used to pad more image information to generate a possible three-dimensional image. GAN is attractive compared to other neural network models in that it introduces a counter-measure to the neural network. Most of the existing GAN models process natural language texts to obtain text features, and then the text features are used as constraints of a subsequent image generation process. And generating an image by the generator in the GAN according to the text characteristics, identifying the generation effect by the identifier, generating a more real image again according to the identification result of the identifier, identifying a new image again by the identifier, and repeating the steps until a high-quality image is generated.
In recent years, a variety of variant generation countermeasure networks are applied to a task of generating an image of a text, and typically, a method of generating a countermeasure network based on a condition is capable of generating an image semantically related to a text with a given text description as a condition. Although GAN has enjoyed significant success in the generation of images from textual descriptions, there are still some deficiencies. First, textual descriptions containing complex entity relationships are not well handled. Most of the existing methods generate images aiming at simple text description (a sentence with simple entity and relation thereof), and cannot generate rich and harmonious images according to complex text description (a sentence or a paragraph with complex entity and relation thereof). Second, the resolution of the generated image is to be improved. The existing method can only generate a blurry low-resolution image, and cannot generate a clear high-resolution image with textures, edges and the like. Third, the quality of the local detail of the generated image is to be improved. The existing method mostly uses global text description vectors as input, and lacks attention to local entities in text description, so that local details of each entity in a generated image cannot be well processed.
Disclosure of Invention
The invention aims to overcome the defects in the prior art and provide a task-oriented text generation image network model for generating high-quality images which are consistent with text semantics and rich and harmonious in content, so that the problems of poor capability of processing multiple entities and complex relationships among the multiple entities, low resolution of the generated images, poor quality of details such as textures and edges and the like in the prior art are solved.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a task-oriented text-generating image network model, comprising: the task-oriented common sense reasoning module, the text-based feature expression module, the global generation model and the local refinement model are characterized in that: the common sense reasoning module combines the natural language text description, the natural language task description and the common sense knowledge base to carry out task-oriented common sense reasoning, reasonably enriches and expands entities and entity attributes in the text description to obtain expanded text description and feature expression; the text-based feature expression module respectively constructs an entity relationship scene graph in the natural language text description and extracts semantic vectors of each entity attribute in the natural language text description through analyzing and processing the text description; the global generation model inputs an entity relation scene graph constructed by a text-based feature expression module into a Graph Convolution Network (GCN), generates embedded vectors containing each entity and the relation of the entities, respectively inputs the embedded vectors into a mask regression network and a boundary box regression network, estimates the segmentation mask and the boundary box of each entity, fuses all the entities to form a scene layout, and then generates a low-resolution image by combining a convolution neural network; the local refinement model is combined with feature mapping of a low-resolution image generated by the global generation model and semantic vectors of attributes of each entity, optimization of local details of each entity is achieved by using a cycle residual error refinement network (RRRN), embedded vectors of each entity and attributes of each entity and images generated by a previous refinement neural network are used as input of the cycle residual error refinement network, refined images with continuously increased resolution can be generated, the refined images generated by each residual error refinement network correspond to a discriminator, and after end-to-end training, high-quality images with rich and harmonious contents are generated.
The invention has the beneficial effects that:
(1) the complex entity relationships in the textual description can be accurately processed. Aiming at the given natural language text description, the method firstly carries out task-oriented common sense reasoning, combines the text description, the task description (the type, scene, entity and the like of the task) and the common sense knowledge base to carry out common sense reasoning, reasonably enriches and expands the entity and entity attribute in the text description, comprehensively considers the global information (entity relation scene graph) and the local information (semantic vector of each entity attribute) of the text description, obtains the expanded text description and feature expression, and can accurately process the complex entity relation in the text description; even in the context of complex text descriptions, high quality images covering all entities can still be generated.
(2) The richness, harmony and reality of the generated image are high. The invention adopts the common sense reasoning module facing the task, combines the text description, the task description (the type, the scene, the entity and the like of the task) and the common sense knowledge base to carry out common sense reasoning, reasonably enriches the entity and the entity attribute in the text description, enhances the scene graph and each entity attribute obtained by the feature expression module based on the text, optimizes the background, the entity and each entity attribute in the generated image, improves the richness, the harmony and the reality of the generated image, and can also generate the image with the integral outline and the style consistent with the real sample.
(3) End-to-end training, good model stability and overall fitness. Aiming at the problem that the diversity is insufficient due to the fact that the image distribution generated by a text-to-image generation model is not uniform, a generation countermeasure network model of the stacked text-to-image with maximized local-global mutual information is combined, a global vector is decoupled by the generation model to obtain feature maps with different scales, and the correlation between the global features of the image and the description of the text is enhanced by maximizing the mutual information between the feature maps and the global vector; and finally, extracting the feature map into a local position feature vector, and enhancing the correlation between the local position feature and the text description by maximizing the average mutual information between the local position feature vector and the global vector to obtain a tighter mapping relation from the text to the image. Respectively constructing an entity relationship scene graph in natural language text description and extracting semantic vectors of each entity attribute in the natural language text description by adopting a text-based feature expression module through analyzing and processing the text description; the global generation model inputs the constructed entity relationship scene graph and entity attribute semantic vectors into a Graph Convolution Network (GCN) to generate embedded vectors containing each entity and the relationship thereof, the embedded vectors are respectively input into a mask regression network and a boundary box regression network to estimate the segmentation mask and the boundary box of each entity, all the entities are fused to form a scene layout, and then a low-resolution image is generated by combining a convolution neural network; and the local refinement model is combined with the feature mapping of the low-resolution image generated by the global generation model and the semantic vector of each entity attribute, and a loop residual error refinement network is used for realizing the optimization of the local details of each entity. The whole network model can realize end-to-end training and verification, and the stability and the overall integrating degree of the model are improved. Experimental analysis and results show that the diversity and semantic accuracy of the generated images can be effectively improved.
(4) The generated image has good local detail and high resolution. Aiming at the problem that the diversity and the quality of the generated image are difficult to improve because the convergence of a discriminator in a generation countermeasure network is too fast to provide a gradient for a generator, the invention uses a cyclic residual error refinement network (RRRN) to realize the optimization of the local details of each entity by combining the feature mapping of a low-resolution image generated by a global generation model and the semantic vector of the attribute of each entity, uses the cyclic residual error refinement network for a local refinement model, takes the embedded vector of each entity and the attribute thereof and the image generated by the previous refinement neural network as the input of the cyclic residual error refinement network, and generates a refined image with the resolution which is continuously increased, and the refined image generated by each residual refined network corresponds to one discriminator, so that the local details of each entity can be optimized continuously, and images with gradually increased resolution are generated at different stages, so that the quality of the generated images is improved. The method has certain improvement in the aspects of solving the problems of edge blurring and local texture unsharpness of the generated image, and the generated image is closer to a real image.
Drawings
FIG. 1 is a schematic diagram of the structure of a task-oriented text-to-image network model of the present invention;
FIG. 2 is a schematic diagram of the working principle of the task-oriented common sense inference module of FIG. 1;
FIG. 3 is a schematic diagram of the working principle of the text-based feature expression module of FIG. 1;
FIG. 4 is a schematic diagram of the working principle of the global generative model of FIG. 1;
FIG. 5 is a schematic diagram of the working principle of the partially refined model of FIG. 1;
fig. 6 is a schematic diagram of the operation of the residual convolution block of fig. 5.
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments and accompanying drawings.
Detailed Description
In a preferred embodiment described below, as shown in FIG. 1, a task-oriented text-generating image network model includes: the system comprises a task-oriented common sense reasoning module, a text-based feature expression module, a global generation model and a local refinement model, wherein: the task-oriented common sense reasoning module combines the natural language text description, the natural language task description and the common sense knowledge base to carry out task-oriented common sense reasoning, reasonably enriches and expands entities and entity attributes in the text description to obtain expanded text description and feature expression; the task description of the type, scene, entity and the like of the task and the common sense knowledge base carry out common sense reasoning, and reasonably enrich the entity and entity attribute in the natural language text description so as to enhance the harmony, diversity and reality degree of the scene graph and each entity attribute obtained by the text-based feature expression module; the feature expression module respectively constructs an entity relationship scene graph in natural language text description and extracts semantic vectors of each entity attribute in the natural language text description through analyzing and processing the text description, and corresponds to the feature expression module based on the text in the graph; the entity relation scene graph gives the position relation among the entities in the natural language text description, and the semantic vector of the attribute of each entity gives the semantic vector of each entity and the attribute of each entity; the global generation model inputs the constructed entity relation scene graph into a Graph Convolution Network (GCN) to generate embedded vectors containing each entity and the relation of the entity, the embedded vectors are respectively input into a mask regression network and a boundary box regression network to estimate the segmentation mask and the boundary box of each entity, all the entities are fused to form a scene layout, and then a convolution neural network is combined to generate a low-resolution image; the local refinement model is combined with the feature mapping of the low-resolution image generated by the global generation model and the semantic vector of each entity attribute, the optimization of the local detail of each entity is realized by using a cycle residual error refinement network (RRRN), the embedded vector of each entity and each entity attribute and the image generated by the previous refinement neural network are used as the input of the cycle residual error refinement network, the refined images with the resolution being increased continuously can be generated, the refined images generated by each residual error refinement network correspond to a discriminator, and the high-quality image with rich and harmonious content is generated. The global generation model comprises a generator-discriminator pair, the local refinement model comprises two generator-discriminator pairs, the generator generates images with different resolutions as real as possible, and the discriminator distinguishes the real (Ground-judge) images from the generated images as far as possible. During training, a discriminator introduces word vector constraint to change condition vectors of secondary generators on each layer in the network aiming at an integral image or an object in the image, the discriminant expands and adjusts a loss function corresponding to the loss function of the generator, the loss function of cross entropy loss is used, and the cross entropy loss, pixel loss, bounding box loss and segmentation mask loss of the integral image or the object in the image are combined on a CU-Birds, Oxford-102 or COCO-Stuf data set for training; and finally, generating different scale images of the corresponding texts from generators of different hierarchies. The edge details and the local texture of the image are clearer and more vivid; then, the joint training generator and the discriminator approach the real image distribution together by means of the generated image distribution of a plurality of layers, so that the variance of the generated image is increased, and the diversity of the generated image is increased.
As shown in fig. 2, in this embodiment, the task-oriented common sense inference module constructs a text description semantic vector and a task description semantic vector for a given natural language text description and a natural language task description, respectively, and connects the constructed text description semantic vector and the task description semantic vector to obtain a connected semantic vector; then, the semantic vector is input into a common sense reasoning model, single-step or multi-step reasoning is carried out by combining a common sense knowledge base, the common sense reasoning model is based on traditional rule reasoning, distributed expression reasoning, neural network-based reasoning or mixed reasoning based on the method, the semantic vector after common sense reasoning is obtained according to the reasoning, and the semantic vector is output as a text description after the entity or attribute is rich and the text description is expanded.
As shown in fig. 3, in this embodiment, the feature expression module performs word segmentation on the natural language text description according to the text description that is input as rich and expanded, using a word segmentation method based on a dictionary, statistics, rules, word annotation, or comprehension, and removes stop words to obtain a word sequence of a given text description s, and constructs a word list, where entities are represented in the form of a word list, and relationships between entities and attributes of entities are represented in the form of tuples. Then, the feature expression module analyzes the dependency item by using a dependency item analysis algorithm in combination with the word sequence and the text description, constructs a dependency tree, extracts the relationship among the entities based on the generated dependency tree, constructs an entity relationship scene graph, extracts the entities and the corresponding attributes in the text description, inputs the entities and the corresponding attributes into a bidirectional Long-Short Term Memory neural network (LSTM) model, obtains feature representation of the entity attributes in the text description, obtains semantic vectors of the entity attributes, and outputs the semantic vectors as the constructed entity relationship scene graph and the extracted semantic vectors which are connected with the entity attributes in series.
In an optional embodiment, the entity relationship scene graph may have two types of entity nodes and relationship nodes, the feature expression module sets the entity nodes and the relationship nodes as an entity class set O and a relationship type set R, each node gathers feature information of neighboring nodes, performs nonlinear transformation after information gathering, and analyzes a given text description s into an entity relationship scene graph g(s) ═<O(s),R(s)>Wherein O(s) { O ═ O1(s),O2(s),…,Oi(s)},Oi(s) belong to the entity class set O, Oi(s) represents the ith entity in a sentence or paragraph s,
Figure BDA0002560220350000061
the relationship between two entities in s is represented, and the tuple corresponding to the extracted entity attribute can be represented as:<O(s),A(s)>where A(s) is a collection of attribute types in a sentence or paragraph s.
In the bidirectional LSTM model, each word corresponds to two hidden states, each direction corresponds to one hidden state, the two hidden states are connected to represent the semantics of one word, a feature matrix of each entity attribute can be obtained, and simultaneously the last hidden states of the bidirectional LSTM model are connected in series to serve as a semantic vector of the entity attributes in text description.
As shown in fig. 4, in this embodiment, the global generation model inputs the entity relationship scene graph constructed by the text-based feature expression module into a graph convolution network according to the entity relationship scene graph described by the text input by the feature expression module, so as to generate each entity and its relationship embedded vector, where the embedded vector includes information of all entities and their relationships in the scene graph.
Firstly, the dimensions of all nodes and edges constructed by the text-based feature expression module are D in the global generation modeliThe entity relationship scene graph is input into a graph convolution network, each graph convolution layer spreads information along the edges in the graph, the same graph convolution layer adopts the same function for all nodes and edges in the input graph, and the dimensionality D of each entity and the relationship information thereof is obtained by calculation by utilizing the functions of the neighborhood corresponding to the input of all nodes and edgesoAnd allows a single layer to operate on an arbitrarily shaped input graph, which may be implemented by a multi-layer perceptron network. Then, embedding vector V of each entityiEstimating each entity segmentation mask M by input mask regression network and bounding box regression networkiAnd each entity bounding box BiThe mask regression network adopts multiple deconvolution layers to obtain binary segmentation mask with dimension of m × m of each entity, and the bounding box regression network adopts a multiple-layer perceptron network to estimate the bounding box B of the entityiWhere (x, y) is the center of the bounding box, and w and h represent the width and height of the bounding box, respectively. Embedding entities into vector ViBinary split mask with entity M iAfter multiplying corresponding elements, using a two-line interpolation method to make a bounding box B of the entityiInto which it is fused to represent the layout of the entity. After the entity layout is obtained by adopting the method, the obtained matrix elements corresponding to the entities are added to obtain an integral scene layout matrix. And finally, connecting the overall scene layout matrix and random Gaussian noise according to a channel, inputting the connected overall scene layout matrix and random Gaussian noise into a convolutional neural network to generate an initial image with the resolution of 32 x 32, wherein the convolutional neural network firstly adopts three groups of 3 x 3 convolutional layers and an upsampling layer to continuously multiply the resolution of the features of the scene layout matrix, the upsampling layer is completed by using a double-line interpolation method, and then two 3 x 3 convolutional layers are used to complete the generation of the image.
As shown in fig. 5, in this embodiment, the input of the local refinement module is a semantic vector of entity attributes in the text description and a feature map of an initially generated image with a resolution of 32 × 32, and the output is a high-quality image (with a resolution of 512 × 512) with rich and harmonious contents. Firstly, a generator inputs entity attribute semantic vectors in text description into a full connection layer and a reshape layer, a feature mapping with the same scale as the initial image feature mapping is constructed and obtained, a concat layer is used for connecting the entity attribute semantic vectors according to channels to obtain a new feature mapping, then the new feature mapping is input into a cyclic residual error refinement network formed by two residual error refinement networks with the same structure, the quality of local edges, textures, colors and the like of the initial image is gradually improved in the network, each residual error refinement network is respectively formed by 2 groups of 3 × 3 convolutional layers, residual error convolutional blocks and an upper sampling layer, and then 2 3 × 3 convolutional layers used for image generation are connected to generate a final high-quality image. In the training process, the parts correspond to generators in the local refining module, the generated images with different resolutions are respectively connected with a discriminator for evaluating the difference between the generated images and the real images, the discriminator respectively performs down-sampling on the generated images with the resolution of 128 multiplied by 128, the generated images with the resolution of 512 multiplied by 512 and the ground-truth images, and the evaluation result is returned to the generators and the circulating residual error refining network through the grading of the discriminator so as to improve the generation capability of the generators.
As shown in fig. 6, in the present embodiment, for the feature map input by the 3 × 3 convolutional layer, the residual convolutional block firstly uses one 1 × 1 convolutional layer to adjust the number of feature mapping channels; then, one branch of the residual convolution block is processed by two 3 × 3 convolution layers and a middle modified linear unit (ReLU) activation function layer, and the feature mapping obtained after the processing and the feature mapping obtained after the channel adjustment are added by corresponding elements through an accumulation layer to obtain the feature mapping after the residual convolution block processing.
The foregoing is directed to the preferred embodiment of the present invention and it is noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. It will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the spirit and substance of the invention, and these modifications and improvements are also considered to be within the scope of the invention.

Claims (10)

1. A task-oriented text-generating image network model, comprising: the task-oriented common sense reasoning module, the text-based feature expression module, the global generation model and the local refinement model are characterized in that: the common sense reasoning module combines the natural language text description, the natural language task description and the common sense knowledge base to carry out task-oriented common sense reasoning, reasonably enriches and expands entities and entity attributes in the text description to obtain expanded text description and feature expression; the text-based feature expression module respectively constructs an entity relationship scene graph in the natural language text description and extracts semantic vectors of each entity attribute in the natural language text description through analyzing and processing the text description; the global generation model inputs an entity relation scene Graph constructed by a text-based feature expression module into a Graph Convolution Network (GCN), generates embedded vectors containing each entity and the relation of the entity, respectively inputs the embedded vectors into a mask regression network and a boundary box regression network, estimates a segmentation mask and a boundary box of each entity, fuses all entity layouts to form a scene layout, and then generates a low-resolution image by combining a convolution neural network; the local Refinement model combines the feature mapping of the low-resolution image generated by the global generation model and the semantic vector of each entity attribute, a cyclic Residual error Refinement Network RRRN (robust Refinement Network) is used for optimizing the local details of each entity, the embedded vector of each entity and each attribute and the image generated by the previous Refinement neural Network are used as the input of the cyclic Residual error Refinement Network, refined images with continuously increased resolution can be generated, the refined images generated by each Residual error Refinement Network correspond to a discriminator, and after end-to-end training, high-quality images with rich and harmonious contents are generated.
2. The task-oriented text-generating image network model of claim 1, wherein: the global generation model comprises a generator-discriminator pair, the local refinement model comprises two generator-discriminator pairs, the generator generates images with different resolutions as real as possible, and the discriminator distinguishes the real images from the generated images; during training, a discriminator introduces word vector constraint to change condition vectors of secondary generators on each layer in a network aiming at an integral image or an object in the image, expands and adjusts a loss function, corresponds to the loss function of a generator, uses the loss function of cross entropy loss, and trains on a CU-Birds, Oxford-102 or COCO-Stuf data set by combining the cross entropy loss, pixel loss, bounding box loss and segmentation mask loss of the integral image or the object in the image; and finally, generating different scale images of the corresponding texts from generators of different hierarchies.
3. The task-oriented text-generating image network model of claim 1, wherein: the task-oriented common sense reasoning module constructs a text description semantic vector and a task description semantic vector respectively aiming at given natural language text description and natural language task description which are input, and connects the constructed text description semantic vector and the task description semantic vector to obtain a connected semantic vector; the semantic vector is then input into a common sense inference model.
4. The task-oriented text-generating image network model of claim 1, wherein: the feature expression module divides the natural language text description into words by using a word division method based on a dictionary, statistics, rules, word annotation or understanding according to the text description which is input to be rich and expanded, removes stop words, obtains a word sequence of a given text description s, and constructs a word list, wherein entities are represented in the form of the word list, and the relationships among the entities and the attributes of the entities are represented in the form of tuples.
5. The task-oriented text-generating image network model of claim 1, wherein: the feature expression module analyzes the dependency item by using a dependency item analysis algorithm in combination with the word sequence and the text description, constructs a dependency tree, extracts the relationship among the entities based on the generated dependency tree, constructs an entity relationship scene graph, extracts the entities and the corresponding attributes in the text description, inputs the entities and the corresponding attributes into a bidirectional Long-Short term memory neural network (LSTM) model, obtains feature representation of the entity attributes in the text description, obtains semantic vectors of the entity attributes, and outputs the semantic vectors as the constructed entity relationship scene graph and the extracted semantic vectors which are connected with the entity attributes in series.
6. The task-oriented text-generating image network model of claim 1, wherein: the feature expression module sets an entity node and a relation node as an entity class set O and a set R of a relation type respectively, each node gathers the feature information of neighbor nodes, nonlinear transformation is carried out after the information gathering, and a given text description s is analyzed into an entity relation scene graph G(s) ═ g<O(s),R(s)>Wherein O(s) { O ═ O1(s),O2(s),…,Oi(s)},Oi(s) belong to the entity class set O, Oi(s) represents the ith entity in a sentence or paragraph s,
Figure FDA0002560220340000021
the relationship between two entities in s is represented, and the tuple corresponding to the extracted entity attribute can be represented as:<O(s),A(s)>where A(s) is a collection of attribute types in a sentence or paragraph s.
7. The task-oriented text-generating image network model of claim 1, wherein: in the bidirectional LSTM model, each word corresponds to two hidden states, each direction corresponds to one hidden state, the two hidden states are connected to represent the semantics of one word, a feature matrix of each entity attribute is obtained, and meanwhile, the last hidden states of the bidirectional LSTM model are connected in series to serve as semantic vectors of the entity attributes in text description.
8. The task-oriented text-generating image network model of claim 1, wherein: and the global generation model inputs the entity relationship scene graph constructed by the text-based feature expression module into a graph convolution network according to the entity relationship scene graph described by the text input by the feature expression module to generate each entity and relationship embedded vectors thereof, wherein the embedded vectors contain information of all entities and relationships thereof in the scene graph.
9. The task-oriented text-generating image network model of claim 1, wherein: firstly, the dimensions of all nodes and edges constructed by the text-based feature expression module are D in the global generation modeliThe entity relationship scene graph is input into a graph convolution network, each graph convolution layer spreads information along the edges in the graph, the same graph convolution layer adopts the same function for all nodes and edges in the input graph, and the dimensionality D of each entity and the relationship information thereof is obtained by calculation by utilizing the functions of the neighborhood corresponding to the input of all nodes and edgesoThe embedding vector of each entity is divided into ViInputting a mask regression network and a bounding box regression network, and estimating a segmentation mask M of each entity iAnd each entity bounding box BiThe mask regression network adopts multiple deconvolution layers to obtain binary segmentation mask with dimension of m × m of each entity, the bounding box regression network adopts a multiple-layer perceptron network to estimate the bounding box (x, y, w, h) of the entity, and the entity is embedded into a vector ViBinary split mask with entity MiAfter multiplying corresponding elements, using a two-line interpolation method to make a bounding box B of the entityiThe entity layout is blended in the scene layout matrix to represent the entity layout, and after the entity layout is obtained, the obtained matrix elements corresponding to the entities are added to obtain an overall scene layout matrix; finally, the whole scene layout matrix and the random Gaussian noise are connected according to the channel and then input into a convolution neural network to generate an initial image with the resolution of 32 multiplied by 32, and the convolution neural networkThe network first uses three 3 × 3 convolutional layers and an upsampling layer to continuously multiply the resolution of the scene layout matrix features, the upsampling layer is completed by using a two-line interpolation method, and then two 3 × 3 convolutional layers are used to complete the generation of an image, where (x, y) is the center of the bounding box, and w and h represent the width and height of the bounding box, respectively.
10. The task-oriented text-generating image network model of claim 1, wherein: the local refining module firstly inputs semantic vectors of entity attributes in text description into a full connection layer and a reshape layer, constructs a feature mapping with the same scale as the feature mapping of an initial image, uses a concat layer to connect the constructed feature mapping with the feature mapping of the initial image according to a channel to obtain a new feature mapping, then inputs the new feature mapping into a circulating residual error refining network consisting of two residual error refining networks with the same structure, and gradually improves the quality of local edges, textures, colors and the like of the initial image in the network, each residual error refining network consists of 2 groups of 3 × 3 convolutional layers, residual error convolutional blocks and an upper sampling layer, then is connected with 2 groups of 3 × 3 convolutional layers to generate images with resolutions of 128 × 128 and 512 × 512 respectively, and is respectively connected with a discriminator for evaluating the difference between the generated images and real images after the images with different resolutions, the generated image with the resolution of 128 x 128, the generated image with the resolution of 512 x 512 and the Grund-truth image are respectively subjected to down sampling by the discriminator, and the evaluation result is returned to the cyclic residual error refinement network and the generator through the grading of the discriminator so as to improve the generation capability of the cyclic residual error refinement network and the generator.
CN202010609005.XA 2020-06-29 2020-06-29 Task-oriented text-generated image network model Active CN111858954B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010609005.XA CN111858954B (en) 2020-06-29 2020-06-29 Task-oriented text-generated image network model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010609005.XA CN111858954B (en) 2020-06-29 2020-06-29 Task-oriented text-generated image network model

Publications (2)

Publication Number Publication Date
CN111858954A true CN111858954A (en) 2020-10-30
CN111858954B CN111858954B (en) 2022-12-13

Family

ID=72989184

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010609005.XA Active CN111858954B (en) 2020-06-29 2020-06-29 Task-oriented text-generated image network model

Country Status (1)

Country Link
CN (1) CN111858954B (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112101330A (en) * 2020-11-20 2020-12-18 北京沃东天骏信息技术有限公司 Image processing method, image processing apparatus, electronic device, and storage medium
CN112417539A (en) * 2020-11-16 2021-02-26 杭州群核信息技术有限公司 Method, device and system for designing house type based on language description
CN112634405A (en) * 2020-11-30 2021-04-09 南京大学 Image-text generation method for releasing crowd-sourcing task
CN112734881A (en) * 2020-12-01 2021-04-30 北京交通大学 Text synthesis image method and system based on significance scene graph analysis
CN112765956A (en) * 2021-01-22 2021-05-07 大连民族大学 Dependency syntax analysis method based on multi-task learning and application
CN113111329A (en) * 2021-06-11 2021-07-13 四川大学 Password dictionary generation method and system based on multi-sequence long-term and short-term memory network
CN113140019A (en) * 2021-05-13 2021-07-20 电子科技大学 Method for generating text-generated image of confrontation network based on fusion compensation
CN113254828A (en) * 2021-05-24 2021-08-13 北京邮电大学 Seamless multi-mode content mixing exhibition method based on nonlinear editing technology
CN113408619A (en) * 2021-06-21 2021-09-17 江苏苏云信息科技有限公司 Language model pre-training method and device
CN113793404A (en) * 2021-08-19 2021-12-14 西南科技大学 Artificially controllable image synthesis method based on text and outline
CN113837229A (en) * 2021-08-30 2021-12-24 厦门大学 Knowledge-driven text-to-image generation method
CN113934890A (en) * 2021-12-16 2022-01-14 之江实验室 Method and system for automatically generating scene video by characters
CN114240811A (en) * 2021-11-29 2022-03-25 浙江大学 Method for generating new image based on multiple images
CN114648681A (en) * 2022-05-20 2022-06-21 浪潮电子信息产业股份有限公司 Image generation method, device, equipment and medium
CN115527216A (en) * 2022-11-09 2022-12-27 中国矿业大学(北京) Text image generation method based on modulation fusion and generation countermeasure network
CN115984293A (en) * 2023-02-09 2023-04-18 中国科学院空天信息创新研究院 Spatial target segmentation network and method based on edge perception attention mechanism
CN116132756A (en) * 2023-01-06 2023-05-16 重庆大学 End-to-end video subtitle generating method based on deep learning
CN116152647A (en) * 2023-04-18 2023-05-23 中国科学技术大学 Scene graph generation method based on multi-round iteration strategy and difference perception
CN116797684A (en) * 2023-08-21 2023-09-22 腾讯科技(深圳)有限公司 Image generation method, device, electronic equipment and storage medium
CN116883528A (en) * 2023-06-12 2023-10-13 阿里巴巴(中国)有限公司 Image generation method and device
CN116992493A (en) * 2023-09-01 2023-11-03 滨州八爪鱼网络科技有限公司 Digital blind box generation method and system
WO2023246822A1 (en) * 2022-06-22 2023-12-28 华为技术有限公司 Image processing method and terminal device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB0308234D0 (en) * 2002-04-20 2003-05-14 Virtual Mirrors Ltd Body models from scanned data
CN109885723A (en) * 2019-02-20 2019-06-14 腾讯科技(深圳)有限公司 A kind of generation method of video dynamic thumbnail, the method and device of model training
CN110472616A (en) * 2019-08-22 2019-11-19 腾讯科技(深圳)有限公司 Image-recognizing method, device, computer equipment and storage medium
CN110675329A (en) * 2019-08-06 2020-01-10 厦门大学 Image deblurring method based on visual semantic guidance
CN111340907A (en) * 2020-03-03 2020-06-26 曲阜师范大学 Text-to-image generation method of self-adaptive attribute and instance mask embedded graph

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB0308234D0 (en) * 2002-04-20 2003-05-14 Virtual Mirrors Ltd Body models from scanned data
CN109885723A (en) * 2019-02-20 2019-06-14 腾讯科技(深圳)有限公司 A kind of generation method of video dynamic thumbnail, the method and device of model training
CN110675329A (en) * 2019-08-06 2020-01-10 厦门大学 Image deblurring method based on visual semantic guidance
CN110472616A (en) * 2019-08-22 2019-11-19 腾讯科技(深圳)有限公司 Image-recognizing method, device, computer equipment and storage medium
CN111340907A (en) * 2020-03-03 2020-06-26 曲阜师范大学 Text-to-image generation method of self-adaptive attribute and instance mask embedded graph

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JUSTIN JOHNSON: "Image Generation from Scene Graphs", 《2018 IEEE/CVF CONFERENCE COMPUTER VSION AND PATTERN RECOGNITION》 *
SEBASTIAN SCHUSTER: "Generating Semantically Precise Scene Graphs from Textual Descriptions for Improved Image Retrieval", 《ASSOCIATION FOR COMPUTATIONAL LINGUISTICS》 *
吴昊昱: "基于生成对抗网络的文本描述生成图像算法研究及应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112417539A (en) * 2020-11-16 2021-02-26 杭州群核信息技术有限公司 Method, device and system for designing house type based on language description
CN112417539B (en) * 2020-11-16 2023-10-03 杭州群核信息技术有限公司 House type design method, device and system based on language description
CN112101330B (en) * 2020-11-20 2021-04-30 北京沃东天骏信息技术有限公司 Image processing method, image processing apparatus, electronic device, and storage medium
CN112101330A (en) * 2020-11-20 2020-12-18 北京沃东天骏信息技术有限公司 Image processing method, image processing apparatus, electronic device, and storage medium
CN112634405A (en) * 2020-11-30 2021-04-09 南京大学 Image-text generation method for releasing crowd-sourcing task
CN112734881A (en) * 2020-12-01 2021-04-30 北京交通大学 Text synthesis image method and system based on significance scene graph analysis
CN112734881B (en) * 2020-12-01 2023-09-22 北京交通大学 Text synthesized image method and system based on saliency scene graph analysis
CN112765956A (en) * 2021-01-22 2021-05-07 大连民族大学 Dependency syntax analysis method based on multi-task learning and application
CN112765956B (en) * 2021-01-22 2023-06-20 大连民族大学 Dependency syntax analysis method based on multitask learning and application
CN113140019B (en) * 2021-05-13 2022-05-31 电子科技大学 Method for generating text-generated image of confrontation network based on fusion compensation
CN113140019A (en) * 2021-05-13 2021-07-20 电子科技大学 Method for generating text-generated image of confrontation network based on fusion compensation
CN113254828B (en) * 2021-05-24 2022-09-16 北京邮电大学 Seamless multi-mode content mixing exhibition method based on nonlinear editing technology
CN113254828A (en) * 2021-05-24 2021-08-13 北京邮电大学 Seamless multi-mode content mixing exhibition method based on nonlinear editing technology
CN113111329A (en) * 2021-06-11 2021-07-13 四川大学 Password dictionary generation method and system based on multi-sequence long-term and short-term memory network
CN113111329B (en) * 2021-06-11 2021-08-13 四川大学 Password dictionary generation method and system based on multi-sequence long-term and short-term memory network
CN113408619B (en) * 2021-06-21 2024-02-13 江苏苏云信息科技有限公司 Language model pre-training method and device
CN113408619A (en) * 2021-06-21 2021-09-17 江苏苏云信息科技有限公司 Language model pre-training method and device
CN113793404A (en) * 2021-08-19 2021-12-14 西南科技大学 Artificially controllable image synthesis method based on text and outline
CN113837229A (en) * 2021-08-30 2021-12-24 厦门大学 Knowledge-driven text-to-image generation method
CN113837229B (en) * 2021-08-30 2024-03-15 厦门大学 Knowledge-driven text-to-image generation method
CN114240811A (en) * 2021-11-29 2022-03-25 浙江大学 Method for generating new image based on multiple images
CN113934890A (en) * 2021-12-16 2022-01-14 之江实验室 Method and system for automatically generating scene video by characters
CN113934890B (en) * 2021-12-16 2022-04-15 之江实验室 Method and system for automatically generating scene video by characters
CN114648681A (en) * 2022-05-20 2022-06-21 浪潮电子信息产业股份有限公司 Image generation method, device, equipment and medium
CN114648681B (en) * 2022-05-20 2022-10-28 浪潮电子信息产业股份有限公司 Image generation method, device, equipment and medium
WO2023246822A1 (en) * 2022-06-22 2023-12-28 华为技术有限公司 Image processing method and terminal device
CN115527216B (en) * 2022-11-09 2023-05-23 中国矿业大学(北京) Text image generation method based on modulation fusion and antagonism network generation
CN115527216A (en) * 2022-11-09 2022-12-27 中国矿业大学(北京) Text image generation method based on modulation fusion and generation countermeasure network
CN116132756A (en) * 2023-01-06 2023-05-16 重庆大学 End-to-end video subtitle generating method based on deep learning
CN116132756B (en) * 2023-01-06 2024-05-03 重庆大学 End-to-end video subtitle generating method based on deep learning
CN115984293A (en) * 2023-02-09 2023-04-18 中国科学院空天信息创新研究院 Spatial target segmentation network and method based on edge perception attention mechanism
CN115984293B (en) * 2023-02-09 2023-11-07 中国科学院空天信息创新研究院 Spatial target segmentation network and method based on edge perception attention mechanism
CN116152647A (en) * 2023-04-18 2023-05-23 中国科学技术大学 Scene graph generation method based on multi-round iteration strategy and difference perception
CN116152647B (en) * 2023-04-18 2023-07-18 中国科学技术大学 Scene graph generation method based on multi-round iteration strategy and difference perception
CN116883528A (en) * 2023-06-12 2023-10-13 阿里巴巴(中国)有限公司 Image generation method and device
CN116797684B (en) * 2023-08-21 2024-01-05 腾讯科技(深圳)有限公司 Image generation method, device, electronic equipment and storage medium
CN116797684A (en) * 2023-08-21 2023-09-22 腾讯科技(深圳)有限公司 Image generation method, device, electronic equipment and storage medium
CN116992493A (en) * 2023-09-01 2023-11-03 滨州八爪鱼网络科技有限公司 Digital blind box generation method and system
CN116992493B (en) * 2023-09-01 2024-02-06 滨州八爪鱼网络科技有限公司 Digital blind box generation method and system

Also Published As

Publication number Publication date
CN111858954B (en) 2022-12-13

Similar Documents

Publication Publication Date Title
CN111858954B (en) Task-oriented text-generated image network model
CN110866140B (en) Image feature extraction model training method, image searching method and computer equipment
CN110516096A (en) Synthesis perception digital picture search
Ni et al. Learning to photograph: A compositional perspective
CN110599592A (en) Three-dimensional indoor scene reconstruction method based on text
CN114419642A (en) Method, device and system for extracting key value pair information in document image
Xu et al. Multi-modal transformer with global-local alignment for composed query image retrieval
Zhang et al. Retargeting semantically-rich photos
CN113761105A (en) Text data processing method, device, equipment and medium
CN115131698A (en) Video attribute determination method, device, equipment and storage medium
Zhang et al. Online modeling of esthetic communities using deep perception graph analytics
CN116975615A (en) Task prediction method and device based on video multi-mode information
CN113240033B (en) Visual relation detection method and device based on scene graph high-order semantic structure
CN117033609B (en) Text visual question-answering method, device, computer equipment and storage medium
Li et al. KBHN: A knowledge-aware bi-hypergraph network based on visual-knowledge features fusion for teaching image annotation
CN116597267B (en) Image recognition method, device, computer equipment and storage medium
Yu et al. A novel multi-feature representation of images for heterogeneous IoTs
CN117312594A (en) Sketching mechanical part library retrieval method integrating double-scale features
Bi et al. C^ 2 C 2 Net: a complementary co-saliency detection network
Xu et al. Estimating similarity of rich internet pages using visual information
CN114677569A (en) Character-image pair generation method and device based on feature decoupling
Li et al. Image aesthetic assessment using a saliency symbiosis network
Wang et al. Deep learning for font recognition and retrieval
CN117156078B (en) Video data processing method and device, electronic equipment and storage medium
CN116311275B (en) Text recognition method and system based on seq2seq language model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant