CN111858954A

CN111858954A - Task-oriented text-generated image network model

Info

Publication number: CN111858954A
Application number: CN202010609005.XA
Authority: CN
Inventors: 李春豹; 崔莹; 代翔; 刘鑫
Original assignee: Southwest Electronic Technology Institute No 10 Institute of Cetc
Current assignee: Southwest Electronic Technology Institute No 10 Institute of Cetc
Priority date: 2020-06-29
Filing date: 2020-06-29
Publication date: 2020-10-30
Anticipated expiration: 2040-06-29
Also published as: CN111858954B

Abstract

The invention discloses a task-oriented text generation image network model, aiming at providing a text generation image network model for generating high-quality images which are consistent with text semantics and rich in content, and the invention is realized by the following technical scheme: the common sense reasoning module enriches and expands entities and entity attributes in the natural language text description, and then constructs an entity relationship scene graph and an entity attribute semantic vector aiming at the natural language text description; the global generation model inputs the entity relation scene graph into the GCN, the obtained embedded vectors are respectively input into a mask regression network and a boundary box regression network, the segmentation mask and the boundary box of each entity are estimated, all entity layouts are fused to form a scene layout, and an initial image is generated by combining a convolutional neural network; the local thinning model takes entity attribute semantic vectors and initial image feature mapping as input, and combines RRRN and a convolutional neural network to generate a high-quality image with rich and harmonious content.

Description

Task-oriented text-generated image network model

Technical Field

The invention relates to the field of computer science, in particular to a generation countermeasure network (GAN) technology in the field of deep learning, and specifically relates to a task-oriented text generation image network model.

Background

With the widespread use of mobile intelligent terminals capable of taking pictures and the rapid development of the internet, multimodal data fusing visual and text information, such as photos with text labels, teletext content in newspaper articles, videos with titles and multimodal interaction data appearing in social media, are increasing dramatically. The image text description method can effectively organize the image data and conveniently search massive image data by combining with a text information retrieval technology. In addition, the use of the image text description method can not only read the content spoken by the speaker from the images in the slide but also help the visually impaired to understand the images. The text description of the image is a cross task in the field of computer vision and natural language processing, can complete the modal conversion from the image to the text, and mainly comprises three methods: a generation-based method, a retrieval-based method and a coding-decoding-based method. The generation-based method is divided into a detection process and a generation process. The detection process detects information such as objects appearing in the image, object attributes, scenes and behaviors of image expression content and the like on the basis of the image features; the generation process uses this information to drive the natural language generation system to output a textual description of the image. And searching an image set similar to the input image in the database based on the searching method, and reasonably organizing the text description of the generated image by using the most similar searching result based on the text description of the searched similar image set. The method based on encoding and decoding is based on deep learning, and adopts a manner of encoding and decoding to directly generate text description. The method based on generation depends on the quality of concept detection in the detection process, and is limited by a manually designed template, an incomplete language model and a limited syntactic model in the generation process, so that the text description sentence generated by the method is single and has no diversity. The text description problem of the image is regarded as an information retrieval problem based on a retrieval method, namely, a query image is searched in a data set; depending on the large scale corpus, the generated text descriptions are limited to similarities between images. At present, a plurality of fusion methods based on a deep neural network exist, but the fusion problem of the image and the text in the high-level semantics is not really deep, so how to perform multi-mode high-level semantic fusion on the image and the text modal information is a key problem and a research difficulty to be solved in the text description of the image.

Text generation images as a popular research field in recent years, a problem to be solved is to generate images corresponding to descriptive texts from the descriptive texts. Text-to-Image Synthesis (Text-to-Image Synthesis), which refers to a computational method that translates natural language descriptions (in Text form, such as keywords, sentences or paragraphs) into images with semantics similar to Text. Text-to-image correlation analysis is mainly relied on in text-to-image generation, and a supervised approach is combined to find the best alignment of visual content to text matching, the main limitation of which is the lack of ability to generate new image content, which can only change the features of a given or training image. Furthermore, the most important and difficult problem to solve for text-to-image generation is semantic consistency between text and image. The semantic expression vector is added into a generation countermeasure network as a condition for image generation, and semantic supervision is carried out on the generation of the sample, so that the method is effective, but is very dependent on the quality of the completion of the preposed work. Text-generated images are also a complex and challenging computer vision and machine learning problem, as they require not only knowing how to do but also understanding the content of the task.

The current text image generation methods include VAE (spatial Auto-Encoder), draw (explicit Attention writer), and generation of a countermeasure network GAN. The VAE and DRAW methods are both used to perform image generation, the VAE models in a statistical way to maximize the minimum likelihood of data to generate an image, and the DRAW method uses a recurrent neural network and uses an attention mechanism, each step focuses on a generating object, each patch is generated in turn and the final result is superimposed. GAN is the most popular text-to-image method in recent years, and its main idea is to use a countermeasure method to generate data. It is based on a new framework of generative models, consisting of a Generator (Generator) which attempts to generate a real image that can trick the Discriminator, and a Discriminator (Discriminator) which attempts to distinguish the real image from the generated image. The generator uses a neural network to generate the desired data, i.e. obtains training samples and trains a model, and the model can generate the data according to the defined target data distribution. For many tasks, the generator can map one input to multiple correct outputs while filling in missing information. For example, given a two-dimensional image, the generator may be used to pad more image information to generate a possible three-dimensional image. GAN is attractive compared to other neural network models in that it introduces a counter-measure to the neural network. Most of the existing GAN models process natural language texts to obtain text features, and then the text features are used as constraints of a subsequent image generation process. And generating an image by the generator in the GAN according to the text characteristics, identifying the generation effect by the identifier, generating a more real image again according to the identification result of the identifier, identifying a new image again by the identifier, and repeating the steps until a high-quality image is generated.

In recent years, a variety of variant generation countermeasure networks are applied to a task of generating an image of a text, and typically, a method of generating a countermeasure network based on a condition is capable of generating an image semantically related to a text with a given text description as a condition. Although GAN has enjoyed significant success in the generation of images from textual descriptions, there are still some deficiencies. First, textual descriptions containing complex entity relationships are not well handled. Most of the existing methods generate images aiming at simple text description (a sentence with simple entity and relation thereof), and cannot generate rich and harmonious images according to complex text description (a sentence or a paragraph with complex entity and relation thereof). Second, the resolution of the generated image is to be improved. The existing method can only generate a blurry low-resolution image, and cannot generate a clear high-resolution image with textures, edges and the like. Third, the quality of the local detail of the generated image is to be improved. The existing method mostly uses global text description vectors as input, and lacks attention to local entities in text description, so that local details of each entity in a generated image cannot be well processed.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provide a task-oriented text generation image network model for generating high-quality images which are consistent with text semantics and rich and harmonious in content, so that the problems of poor capability of processing multiple entities and complex relationships among the multiple entities, low resolution of the generated images, poor quality of details such as textures and edges and the like in the prior art are solved.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a task-oriented text-generating image network model, comprising: the task-oriented common sense reasoning module, the text-based feature expression module, the global generation model and the local refinement model are characterized in that: the common sense reasoning module combines the natural language text description, the natural language task description and the common sense knowledge base to carry out task-oriented common sense reasoning, reasonably enriches and expands entities and entity attributes in the text description to obtain expanded text description and feature expression; the text-based feature expression module respectively constructs an entity relationship scene graph in the natural language text description and extracts semantic vectors of each entity attribute in the natural language text description through analyzing and processing the text description; the global generation model inputs an entity relation scene graph constructed by a text-based feature expression module into a Graph Convolution Network (GCN), generates embedded vectors containing each entity and the relation of the entities, respectively inputs the embedded vectors into a mask regression network and a boundary box regression network, estimates the segmentation mask and the boundary box of each entity, fuses all the entities to form a scene layout, and then generates a low-resolution image by combining a convolution neural network; the local refinement model is combined with feature mapping of a low-resolution image generated by the global generation model and semantic vectors of attributes of each entity, optimization of local details of each entity is achieved by using a cycle residual error refinement network (RRRN), embedded vectors of each entity and attributes of each entity and images generated by a previous refinement neural network are used as input of the cycle residual error refinement network, refined images with continuously increased resolution can be generated, the refined images generated by each residual error refinement network correspond to a discriminator, and after end-to-end training, high-quality images with rich and harmonious contents are generated.

The invention has the beneficial effects that:

(1) the complex entity relationships in the textual description can be accurately processed. Aiming at the given natural language text description, the method firstly carries out task-oriented common sense reasoning, combines the text description, the task description (the type, scene, entity and the like of the task) and the common sense knowledge base to carry out common sense reasoning, reasonably enriches and expands the entity and entity attribute in the text description, comprehensively considers the global information (entity relation scene graph) and the local information (semantic vector of each entity attribute) of the text description, obtains the expanded text description and feature expression, and can accurately process the complex entity relation in the text description; even in the context of complex text descriptions, high quality images covering all entities can still be generated.

(2) The richness, harmony and reality of the generated image are high. The invention adopts the common sense reasoning module facing the task, combines the text description, the task description (the type, the scene, the entity and the like of the task) and the common sense knowledge base to carry out common sense reasoning, reasonably enriches the entity and the entity attribute in the text description, enhances the scene graph and each entity attribute obtained by the feature expression module based on the text, optimizes the background, the entity and each entity attribute in the generated image, improves the richness, the harmony and the reality of the generated image, and can also generate the image with the integral outline and the style consistent with the real sample.

(3) End-to-end training, good model stability and overall fitness. Aiming at the problem that the diversity is insufficient due to the fact that the image distribution generated by a text-to-image generation model is not uniform, a generation countermeasure network model of the stacked text-to-image with maximized local-global mutual information is combined, a global vector is decoupled by the generation model to obtain feature maps with different scales, and the correlation between the global features of the image and the description of the text is enhanced by maximizing the mutual information between the feature maps and the global vector; and finally, extracting the feature map into a local position feature vector, and enhancing the correlation between the local position feature and the text description by maximizing the average mutual information between the local position feature vector and the global vector to obtain a tighter mapping relation from the text to the image. Respectively constructing an entity relationship scene graph in natural language text description and extracting semantic vectors of each entity attribute in the natural language text description by adopting a text-based feature expression module through analyzing and processing the text description; the global generation model inputs the constructed entity relationship scene graph and entity attribute semantic vectors into a Graph Convolution Network (GCN) to generate embedded vectors containing each entity and the relationship thereof, the embedded vectors are respectively input into a mask regression network and a boundary box regression network to estimate the segmentation mask and the boundary box of each entity, all the entities are fused to form a scene layout, and then a low-resolution image is generated by combining a convolution neural network; and the local refinement model is combined with the feature mapping of the low-resolution image generated by the global generation model and the semantic vector of each entity attribute, and a loop residual error refinement network is used for realizing the optimization of the local details of each entity. The whole network model can realize end-to-end training and verification, and the stability and the overall integrating degree of the model are improved. Experimental analysis and results show that the diversity and semantic accuracy of the generated images can be effectively improved.

(4) The generated image has good local detail and high resolution. Aiming at the problem that the diversity and the quality of the generated image are difficult to improve because the convergence of a discriminator in a generation countermeasure network is too fast to provide a gradient for a generator, the invention uses a cyclic residual error refinement network (RRRN) to realize the optimization of the local details of each entity by combining the feature mapping of a low-resolution image generated by a global generation model and the semantic vector of the attribute of each entity, uses the cyclic residual error refinement network for a local refinement model, takes the embedded vector of each entity and the attribute thereof and the image generated by the previous refinement neural network as the input of the cyclic residual error refinement network, and generates a refined image with the resolution which is continuously increased, and the refined image generated by each residual refined network corresponds to one discriminator, so that the local details of each entity can be optimized continuously, and images with gradually increased resolution are generated at different stages, so that the quality of the generated images is improved. The method has certain improvement in the aspects of solving the problems of edge blurring and local texture unsharpness of the generated image, and the generated image is closer to a real image.

Drawings

FIG. 1 is a schematic diagram of the structure of a task-oriented text-to-image network model of the present invention;

FIG. 2 is a schematic diagram of the working principle of the task-oriented common sense inference module of FIG. 1;

FIG. 3 is a schematic diagram of the working principle of the text-based feature expression module of FIG. 1;

FIG. 4 is a schematic diagram of the working principle of the global generative model of FIG. 1;

FIG. 5 is a schematic diagram of the working principle of the partially refined model of FIG. 1;

fig. 6 is a schematic diagram of the operation of the residual convolution block of fig. 5.

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments and accompanying drawings.

Detailed Description

In a preferred embodiment described below, as shown in FIG. 1, a task-oriented text-generating image network model includes: the system comprises a task-oriented common sense reasoning module, a text-based feature expression module, a global generation model and a local refinement model, wherein: the task-oriented common sense reasoning module combines the natural language text description, the natural language task description and the common sense knowledge base to carry out task-oriented common sense reasoning, reasonably enriches and expands entities and entity attributes in the text description to obtain expanded text description and feature expression; the task description of the type, scene, entity and the like of the task and the common sense knowledge base carry out common sense reasoning, and reasonably enrich the entity and entity attribute in the natural language text description so as to enhance the harmony, diversity and reality degree of the scene graph and each entity attribute obtained by the text-based feature expression module; the feature expression module respectively constructs an entity relationship scene graph in natural language text description and extracts semantic vectors of each entity attribute in the natural language text description through analyzing and processing the text description, and corresponds to the feature expression module based on the text in the graph; the entity relation scene graph gives the position relation among the entities in the natural language text description, and the semantic vector of the attribute of each entity gives the semantic vector of each entity and the attribute of each entity; the global generation model inputs the constructed entity relation scene graph into a Graph Convolution Network (GCN) to generate embedded vectors containing each entity and the relation of the entity, the embedded vectors are respectively input into a mask regression network and a boundary box regression network to estimate the segmentation mask and the boundary box of each entity, all the entities are fused to form a scene layout, and then a convolution neural network is combined to generate a low-resolution image; the local refinement model is combined with the feature mapping of the low-resolution image generated by the global generation model and the semantic vector of each entity attribute, the optimization of the local detail of each entity is realized by using a cycle residual error refinement network (RRRN), the embedded vector of each entity and each entity attribute and the image generated by the previous refinement neural network are used as the input of the cycle residual error refinement network, the refined images with the resolution being increased continuously can be generated, the refined images generated by each residual error refinement network correspond to a discriminator, and the high-quality image with rich and harmonious content is generated. The global generation model comprises a generator-discriminator pair, the local refinement model comprises two generator-discriminator pairs, the generator generates images with different resolutions as real as possible, and the discriminator distinguishes the real (Ground-judge) images from the generated images as far as possible. During training, a discriminator introduces word vector constraint to change condition vectors of secondary generators on each layer in the network aiming at an integral image or an object in the image, the discriminant expands and adjusts a loss function corresponding to the loss function of the generator, the loss function of cross entropy loss is used, and the cross entropy loss, pixel loss, bounding box loss and segmentation mask loss of the integral image or the object in the image are combined on a CU-Birds, Oxford-102 or COCO-Stuf data set for training; and finally, generating different scale images of the corresponding texts from generators of different hierarchies. The edge details and the local texture of the image are clearer and more vivid; then, the joint training generator and the discriminator approach the real image distribution together by means of the generated image distribution of a plurality of layers, so that the variance of the generated image is increased, and the diversity of the generated image is increased.

As shown in fig. 2, in this embodiment, the task-oriented common sense inference module constructs a text description semantic vector and a task description semantic vector for a given natural language text description and a natural language task description, respectively, and connects the constructed text description semantic vector and the task description semantic vector to obtain a connected semantic vector; then, the semantic vector is input into a common sense reasoning model, single-step or multi-step reasoning is carried out by combining a common sense knowledge base, the common sense reasoning model is based on traditional rule reasoning, distributed expression reasoning, neural network-based reasoning or mixed reasoning based on the method, the semantic vector after common sense reasoning is obtained according to the reasoning, and the semantic vector is output as a text description after the entity or attribute is rich and the text description is expanded.

As shown in fig. 3, in this embodiment, the feature expression module performs word segmentation on the natural language text description according to the text description that is input as rich and expanded, using a word segmentation method based on a dictionary, statistics, rules, word annotation, or comprehension, and removes stop words to obtain a word sequence of a given text description s, and constructs a word list, where entities are represented in the form of a word list, and relationships between entities and attributes of entities are represented in the form of tuples. Then, the feature expression module analyzes the dependency item by using a dependency item analysis algorithm in combination with the word sequence and the text description, constructs a dependency tree, extracts the relationship among the entities based on the generated dependency tree, constructs an entity relationship scene graph, extracts the entities and the corresponding attributes in the text description, inputs the entities and the corresponding attributes into a bidirectional Long-Short Term Memory neural network (LSTM) model, obtains feature representation of the entity attributes in the text description, obtains semantic vectors of the entity attributes, and outputs the semantic vectors as the constructed entity relationship scene graph and the extracted semantic vectors which are connected with the entity attributes in series.

In an optional embodiment, the entity relationship scene graph may have two types of entity nodes and relationship nodes, the feature expression module sets the entity nodes and the relationship nodes as an entity class set O and a relationship type set R, each node gathers feature information of neighboring nodes, performs nonlinear transformation after information gathering, and analyzes a given text description s into an entity relationship scene graph g(s) ═<O(s),R(s)>Wherein O(s) { O ═ O₁(s)，O₂(s),…,O_i(s)},O_i(s) belong to the entity class set O, O_i(s) represents the ith entity in a sentence or paragraph s,

the relationship between two entities in s is represented, and the tuple corresponding to the extracted entity attribute can be represented as:<O(s),A(s)>where A(s) is a collection of attribute types in a sentence or paragraph s.

In the bidirectional LSTM model, each word corresponds to two hidden states, each direction corresponds to one hidden state, the two hidden states are connected to represent the semantics of one word, a feature matrix of each entity attribute can be obtained, and simultaneously the last hidden states of the bidirectional LSTM model are connected in series to serve as a semantic vector of the entity attributes in text description.

As shown in fig. 4, in this embodiment, the global generation model inputs the entity relationship scene graph constructed by the text-based feature expression module into a graph convolution network according to the entity relationship scene graph described by the text input by the feature expression module, so as to generate each entity and its relationship embedded vector, where the embedded vector includes information of all entities and their relationships in the scene graph.

Firstly, the dimensions of all nodes and edges constructed by the text-based feature expression module are D in the global generation model_iThe entity relationship scene graph is input into a graph convolution network, each graph convolution layer spreads information along the edges in the graph, the same graph convolution layer adopts the same function for all nodes and edges in the input graph, and the dimensionality D of each entity and the relationship information thereof is obtained by calculation by utilizing the functions of the neighborhood corresponding to the input of all nodes and edges_oAnd allows a single layer to operate on an arbitrarily shaped input graph, which may be implemented by a multi-layer perceptron network. Then, embedding vector V of each entity_iEstimating each entity segmentation mask M by input mask regression network and bounding box regression network_iAnd each entity bounding box B_iThe mask regression network adopts multiple deconvolution layers to obtain binary segmentation mask with dimension of m × m of each entity, and the bounding box regression network adopts a multiple-layer perceptron network to estimate the bounding box B of the entity_iWhere (x, y) is the center of the bounding box, and w and h represent the width and height of the bounding box, respectively. Embedding entities into vector V_iBinary split mask with entity M _iAfter multiplying corresponding elements, using a two-line interpolation method to make a bounding box B of the entity_iInto which it is fused to represent the layout of the entity. After the entity layout is obtained by adopting the method, the obtained matrix elements corresponding to the entities are added to obtain an integral scene layout matrix. And finally, connecting the overall scene layout matrix and random Gaussian noise according to a channel, inputting the connected overall scene layout matrix and random Gaussian noise into a convolutional neural network to generate an initial image with the resolution of 32 x 32, wherein the convolutional neural network firstly adopts three groups of 3 x 3 convolutional layers and an upsampling layer to continuously multiply the resolution of the features of the scene layout matrix, the upsampling layer is completed by using a double-line interpolation method, and then two 3 x 3 convolutional layers are used to complete the generation of the image.

As shown in fig. 5, in this embodiment, the input of the local refinement module is a semantic vector of entity attributes in the text description and a feature map of an initially generated image with a resolution of 32 × 32, and the output is a high-quality image (with a resolution of 512 × 512) with rich and harmonious contents. Firstly, a generator inputs entity attribute semantic vectors in text description into a full connection layer and a reshape layer, a feature mapping with the same scale as the initial image feature mapping is constructed and obtained, a concat layer is used for connecting the entity attribute semantic vectors according to channels to obtain a new feature mapping, then the new feature mapping is input into a cyclic residual error refinement network formed by two residual error refinement networks with the same structure, the quality of local edges, textures, colors and the like of the initial image is gradually improved in the network, each residual error refinement network is respectively formed by 2 groups of 3 × 3 convolutional layers, residual error convolutional blocks and an upper sampling layer, and then 2 3 × 3 convolutional layers used for image generation are connected to generate a final high-quality image. In the training process, the parts correspond to generators in the local refining module, the generated images with different resolutions are respectively connected with a discriminator for evaluating the difference between the generated images and the real images, the discriminator respectively performs down-sampling on the generated images with the resolution of 128 multiplied by 128, the generated images with the resolution of 512 multiplied by 512 and the ground-truth images, and the evaluation result is returned to the generators and the circulating residual error refining network through the grading of the discriminator so as to improve the generation capability of the generators.

As shown in fig. 6, in the present embodiment, for the feature map input by the 3 × 3 convolutional layer, the residual convolutional block firstly uses one 1 × 1 convolutional layer to adjust the number of feature mapping channels; then, one branch of the residual convolution block is processed by two 3 × 3 convolution layers and a middle modified linear unit (ReLU) activation function layer, and the feature mapping obtained after the processing and the feature mapping obtained after the channel adjustment are added by corresponding elements through an accumulation layer to obtain the feature mapping after the residual convolution block processing.

The foregoing is directed to the preferred embodiment of the present invention and it is noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. It will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the spirit and substance of the invention, and these modifications and improvements are also considered to be within the scope of the invention.

Claims

1. A task-oriented text-generating image network model, comprising: the task-oriented common sense reasoning module, the text-based feature expression module, the global generation model and the local refinement model are characterized in that: the common sense reasoning module combines the natural language text description, the natural language task description and the common sense knowledge base to carry out task-oriented common sense reasoning, reasonably enriches and expands entities and entity attributes in the text description to obtain expanded text description and feature expression; the text-based feature expression module respectively constructs an entity relationship scene graph in the natural language text description and extracts semantic vectors of each entity attribute in the natural language text description through analyzing and processing the text description; the global generation model inputs an entity relation scene Graph constructed by a text-based feature expression module into a Graph Convolution Network (GCN), generates embedded vectors containing each entity and the relation of the entity, respectively inputs the embedded vectors into a mask regression network and a boundary box regression network, estimates a segmentation mask and a boundary box of each entity, fuses all entity layouts to form a scene layout, and then generates a low-resolution image by combining a convolution neural network; the local Refinement model combines the feature mapping of the low-resolution image generated by the global generation model and the semantic vector of each entity attribute, a cyclic Residual error Refinement Network RRRN (robust Refinement Network) is used for optimizing the local details of each entity, the embedded vector of each entity and each attribute and the image generated by the previous Refinement neural Network are used as the input of the cyclic Residual error Refinement Network, refined images with continuously increased resolution can be generated, the refined images generated by each Residual error Refinement Network correspond to a discriminator, and after end-to-end training, high-quality images with rich and harmonious contents are generated.

2. The task-oriented text-generating image network model of claim 1, wherein: the global generation model comprises a generator-discriminator pair, the local refinement model comprises two generator-discriminator pairs, the generator generates images with different resolutions as real as possible, and the discriminator distinguishes the real images from the generated images; during training, a discriminator introduces word vector constraint to change condition vectors of secondary generators on each layer in a network aiming at an integral image or an object in the image, expands and adjusts a loss function, corresponds to the loss function of a generator, uses the loss function of cross entropy loss, and trains on a CU-Birds, Oxford-102 or COCO-Stuf data set by combining the cross entropy loss, pixel loss, bounding box loss and segmentation mask loss of the integral image or the object in the image; and finally, generating different scale images of the corresponding texts from generators of different hierarchies.

3. The task-oriented text-generating image network model of claim 1, wherein: the task-oriented common sense reasoning module constructs a text description semantic vector and a task description semantic vector respectively aiming at given natural language text description and natural language task description which are input, and connects the constructed text description semantic vector and the task description semantic vector to obtain a connected semantic vector; the semantic vector is then input into a common sense inference model.

4. The task-oriented text-generating image network model of claim 1, wherein: the feature expression module divides the natural language text description into words by using a word division method based on a dictionary, statistics, rules, word annotation or understanding according to the text description which is input to be rich and expanded, removes stop words, obtains a word sequence of a given text description s, and constructs a word list, wherein entities are represented in the form of the word list, and the relationships among the entities and the attributes of the entities are represented in the form of tuples.

5. The task-oriented text-generating image network model of claim 1, wherein: the feature expression module analyzes the dependency item by using a dependency item analysis algorithm in combination with the word sequence and the text description, constructs a dependency tree, extracts the relationship among the entities based on the generated dependency tree, constructs an entity relationship scene graph, extracts the entities and the corresponding attributes in the text description, inputs the entities and the corresponding attributes into a bidirectional Long-Short term memory neural network (LSTM) model, obtains feature representation of the entity attributes in the text description, obtains semantic vectors of the entity attributes, and outputs the semantic vectors as the constructed entity relationship scene graph and the extracted semantic vectors which are connected with the entity attributes in series.

6. The task-oriented text-generating image network model of claim 1, wherein: the feature expression module sets an entity node and a relation node as an entity class set O and a set R of a relation type respectively, each node gathers the feature information of neighbor nodes, nonlinear transformation is carried out after the information gathering, and a given text description s is analyzed into an entity relation scene graph G(s) ═ g<O(s),R(s)>Wherein O(s) { O ═ O₁(s)，O₂(s),…,O_i(s)},O_i(s) belong to the entity class set O, O_i(s) represents the ith entity in a sentence or paragraph s,

7. The task-oriented text-generating image network model of claim 1, wherein: in the bidirectional LSTM model, each word corresponds to two hidden states, each direction corresponds to one hidden state, the two hidden states are connected to represent the semantics of one word, a feature matrix of each entity attribute is obtained, and meanwhile, the last hidden states of the bidirectional LSTM model are connected in series to serve as semantic vectors of the entity attributes in text description.

8. The task-oriented text-generating image network model of claim 1, wherein: and the global generation model inputs the entity relationship scene graph constructed by the text-based feature expression module into a graph convolution network according to the entity relationship scene graph described by the text input by the feature expression module to generate each entity and relationship embedded vectors thereof, wherein the embedded vectors contain information of all entities and relationships thereof in the scene graph.

9. The task-oriented text-generating image network model of claim 1, wherein: firstly, the dimensions of all nodes and edges constructed by the text-based feature expression module are D in the global generation model_iThe entity relationship scene graph is input into a graph convolution network, each graph convolution layer spreads information along the edges in the graph, the same graph convolution layer adopts the same function for all nodes and edges in the input graph, and the dimensionality D of each entity and the relationship information thereof is obtained by calculation by utilizing the functions of the neighborhood corresponding to the input of all nodes and edges_oThe embedding vector of each entity is divided into V_iInputting a mask regression network and a bounding box regression network, and estimating a segmentation mask M of each entity _iAnd each entity bounding box B_iThe mask regression network adopts multiple deconvolution layers to obtain binary segmentation mask with dimension of m × m of each entity, the bounding box regression network adopts a multiple-layer perceptron network to estimate the bounding box (x, y, w, h) of the entity, and the entity is embedded into a vector V_iBinary split mask with entity M_iAfter multiplying corresponding elements, using a two-line interpolation method to make a bounding box B of the entity_iThe entity layout is blended in the scene layout matrix to represent the entity layout, and after the entity layout is obtained, the obtained matrix elements corresponding to the entities are added to obtain an overall scene layout matrix; finally, the whole scene layout matrix and the random Gaussian noise are connected according to the channel and then input into a convolution neural network to generate an initial image with the resolution of 32 multiplied by 32, and the convolution neural networkThe network first uses three 3 × 3 convolutional layers and an upsampling layer to continuously multiply the resolution of the scene layout matrix features, the upsampling layer is completed by using a two-line interpolation method, and then two 3 × 3 convolutional layers are used to complete the generation of an image, where (x, y) is the center of the bounding box, and w and h represent the width and height of the bounding box, respectively.

10. The task-oriented text-generating image network model of claim 1, wherein: the local refining module firstly inputs semantic vectors of entity attributes in text description into a full connection layer and a reshape layer, constructs a feature mapping with the same scale as the feature mapping of an initial image, uses a concat layer to connect the constructed feature mapping with the feature mapping of the initial image according to a channel to obtain a new feature mapping, then inputs the new feature mapping into a circulating residual error refining network consisting of two residual error refining networks with the same structure, and gradually improves the quality of local edges, textures, colors and the like of the initial image in the network, each residual error refining network consists of 2 groups of 3 × 3 convolutional layers, residual error convolutional blocks and an upper sampling layer, then is connected with 2 groups of 3 × 3 convolutional layers to generate images with resolutions of 128 × 128 and 512 × 512 respectively, and is respectively connected with a discriminator for evaluating the difference between the generated images and real images after the images with different resolutions, the generated image with the resolution of 128 x 128, the generated image with the resolution of 512 x 512 and the Grund-truth image are respectively subjected to down sampling by the discriminator, and the evaluation result is returned to the cyclic residual error refinement network and the generator through the grading of the discriminator so as to improve the generation capability of the cyclic residual error refinement network and the generator.