CN112612900A

CN112612900A - Knowledge graph guided multi-scene image generation method

Info

Publication number: CN112612900A
Application number: CN202011434422.1A
Authority: CN
Inventors: 肖贺文; 孔雨秋; 刘秀平; 尹宝才
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2020-12-10
Filing date: 2020-12-10
Publication date: 2021-04-06

Abstract

The invention provides a method for generating a plurality of scene images guided by a knowledge graph, belonging to an image generation neighborhood. The method comprises the steps of firstly utilizing a knowledge graph to assist in completing an image generation task, firstly constructing the knowledge graph containing object layout relations, then inputting a group of object labels into the graph, and obtaining a plurality of layout relation graphs which accord with facts through a layout searching module; and finally, when the layout relational graph passes through the image generation module, training the generator and the discriminator by combining a knowledge object matrix and a global knowledge vector obtained in the knowledge module so as to generate a scene image corresponding to each relational graph. The method realizes a group of labels by using the knowledge graph, generates one-to-many tasks of a plurality of images, and improves the image generation quality by embedding knowledge representation information. The present invention evaluates the present invention using a real image dataset and observes the improvement over the most advanced baseline.

Description

Knowledge graph guided multi-scene image generation method

Technical Field

The invention belongs to the field of image generation, and particularly relates to a method for generating a plurality of scene images guided by a knowledge graph.

Background

The knowledge graph is a database taking triples as a unit, entity information and relationship information between entities are stored in the triples, in the method application of the knowledge graph, the KG2E method in the trans series is one of classical knowledge representation methods, the method can embed and represent the entities and the relationships in the graph into a high-dimensional Gaussian distribution, and the distribution of head entities and tail entities is reduced by utilizing KL divergence during model training, so that the distribution of the head entities and the tail entities approaches to the distribution of the relationships as far as possible. The knowledge representation method can introduce the information in the map into other models in a distributed mode, and is also a method adopted by the invention when the map information is extracted.

Cross-modality conversion is a classic task in multi-modality learning, and generation of images from various modalities such as texts and sounds belongs to the field. At present, the generation of images is mainly realized by utilizing a generation countermeasure network, the generation countermeasure network consists of a generator and a discriminator, the specific design is determined by tasks, the generator generally consists of a multilayer perceptron and a deep convolution network, a feature vector extracted by texts or sounds is input, and the generated images are output; the discriminator consists of a shallow layer convolution network, inputs the image, outputs the true and false scores of the image, and can output the corresponding category of the image more finely. In the training process, the discriminator hopes to judge the generated image as low score, and the real image as high score, so as to achieve the effect of 'evaluation discrimination'; the generator hopes that the generated image can be judged as high score by the discriminator, and the effect of 'false and false' is achieved. The generator and the discriminator are alternately trained and mutually confronted, thereby ensuring the generation quality of the image.

At present, the method for synthesizing scene images by texts mostly has the following problems: (1) the current text is often a sentence, and a user needs to give a sentence to generate an image in practical application in life, which is inconvenient; (2) the number of images conforming to the description text is more than one, but most methods can only realize one-to-one generation task at present, and the method has poor performance when generating complex scene images with many objects and cannot generate good layout. (3) Text and images belong to different modalities, and the amount of information that can be provided by text information is insufficient to support the generation of high quality images.

Disclosure of Invention

The purpose of the invention is as follows: the invention mainly aims at overcoming the defects of the text-to-image generation method, provides a knowledge-graph-guided multi-scene image generation method, takes a label as input, obtains a layout relation by using a knowledge graph, realizes one-to-many generation, and simultaneously adds knowledge information in a generation countermeasure network to improve the image generation quality.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method for generating a plurality of scene images guided by a knowledge graph comprises the following steps:

step S1: constructing a knowledge graph, extracting required triples in the form of (head entities, relations and tail entities), and integrating the triples into a small knowledge graph;

step S2: inputting a group of object tags into a layout searching module to obtain a plurality of layout relational graphs which accord with facts;

further, the step S2 specifically includes:

step S21: inputting object tags into a constructed knowledge graph for graph searching, searching all triples containing the relationship between the input tags, and sequencing the searched triples according to the occurrence frequency;

step S22: selecting the required triple quantity to form the most possible layout relational graph according to the parameter setting, and simultaneously generating other multiple different layout relational graphs in a random combination mode;

step S3: inputting each layout relation diagram into a pre-trained knowledge module to obtain an object knowledge matrix and a global knowledge vector;

further, the step S3 specifically includes:

step S31: pre-training a knowledge graph by using a classical knowledge representation method KG2E, and representing knowledge representations corresponding to all objects and relations in the graph by using different Gaussian distributions;

step S32: performing data processing on the layout relation diagram, and decomposing the layout relation diagram into an in-diagram object label and an in-diagram relation label;

step S33: inputting in-graph object labels and in-graph relation labels, and sampling from KG2E knowledge representation of pre-trained objects and relations to obtain an object knowledge matrix and a relation knowledge matrix;

step S34: the sum of the object knowledge matrix and the relation knowledge matrix is called a global knowledge matrix, a global knowledge vector is generated through a full connection layer, and the global knowledge vector represents knowledge information extracted from the map by the whole layout relation diagram;

step S4: adding an object knowledge matrix and a global knowledge vector into a generator;

further, the step S4 specifically includes:

step S41: initializing and embedding the label of the object in the graph and the label of the relation in the graph obtained by decomposition to obtain an initial matrix of the object and an initial matrix of the relation;

step S42: inputting the initial matrix of the object and the relation into a graph convolution network with 5 layers of depth to obtain an updated matrix of the object and the relation;

step S43: connecting the object knowledge matrix output from the knowledge module with the object update matrix to obtain an object prediction matrix;

step S44: the object matrix generates a numerical value of the position of an object frame through the multilayer perceptron 1, generates an object shape mask through the multilayer perceptron 2, and generates a scene layout tensor through mapping combination;

step S45: automatically expanding the dimension of the global knowledge vector output from the knowledge module, which is the same as the size of the picture, and connecting the global knowledge vector with a scene layout tensor, inputting the global knowledge vector into a cascade generation network, and generating a scene image;

step S5: adding an object knowledge matrix and a global knowledge vector into a discriminator;

further, the step S5 specifically includes:

step S51: when different objects in the image are identified, an image slice obtained by processing the scene image data and an object knowledge matrix are simultaneously input into the convolutional neural network 1 to obtain the true and false scores of the object image slice and the object class prediction;

step S52: when the whole image is identified, the scene image and the global knowledge vector are simultaneously input into the convolutional neural network 2 to obtain the true and false scores of the image;

step S6: and training the generator and the discriminator alternately according to the overall loss function, so that the generation quality of the whole image is ensured, and the corresponding category of the object image slice composite label is also ensured. The obtained generator is a tool for completing the generation of the layout relationship diagram to the scene image.

Compared with the prior art, the invention has the following beneficial effects:

(1) unlike text entry in a sentence in most methods, the input of the present invention is a set of selectable labels, which is more convenient for the user to use; (2) most methods can only complete one-to-one generation task, but the invention realizes that a plurality of layout relational graphs are generated by a group of labels by introducing the integrally constructed knowledge graph, thereby generating a plurality of scene images, completing one-to-many generation tasks and simultaneously ensuring that each image has reasonable layout; (3) in the generation countermeasure network, knowledge information in a map obtained by a knowledge representation method KG2E is added to a generator and a discriminator, an object knowledge matrix is added from the perspective of a local object, a global knowledge vector is added from the perspective of global layout, the defect that text information is insufficient is made up, and the generation quality of images is improved. This is also the first application of knowledge representation in the knowledge-graph in the field of image generation.

Drawings

Figure 1 is the overall structure of the design of the present invention.

FIG. 2 is a layout search module designed by the present invention.

FIG. 3 is a knowledge module designed by the present invention.

Fig. 4 is a generator structure in an image generation module designed by the present invention.

Fig. 5 is a discriminator structure in an image generation module designed by the present invention.

Detailed description of the invention

The technical solution of the present invention will be further described with reference to the following specific embodiments and accompanying drawings.

step S1: extracting all triples (head entity, relation and tail entity) in the VG dataset, wherein the set of the head entity and the tail entity comprises all label objects, the relation comprises words which can represent object layout relations such as 'adjacent', 'above', 'behind', and the like, and all triples are extracted and integrated into a small knowledge graph;

step S2: as shown in fig. 2, a group of n object tags is input into a layout search module to obtain m layout relationship diagrams conforming to the fact;

further, the step S2 specifically includes:

step S21: inputting n object tags into a constructed knowledge graph for graph search, searching all triples containing the relationship between the input tags, and sequencing the searched triples from high to low according to the occurrence frequency;

step S22: according to parameter setting, selecting the number of required triples according to the sequence to form a most possible layout relational graph representing the layout relational graph with the most possible n labels, and simultaneously generating a plurality of other different layout relational graphs in a random combination mode to obtain m layout relational graphs;

step S3: as shown in fig. 3, the layout relation diagram is input into a knowledge module pre-trained by using a knowledge graph to obtain a corresponding object knowledge matrix and a global knowledge vector;

further, the step S3 specifically includes:

step S31: pre-training the knowledge graph by using a classical knowledge representation method KG2E to obtain d-dimensional Gaussian distribution (mu) corresponding to all N entities in the graph_i，σ_i) N, d-dimensional gaussian distribution (μ) corresponding to all K relationships in the map_j，σ_j) K, which is a knowledge representation of KG2E of the object and relationship.

Step S32: and performing data processing on the layout relationship diagram, and decomposing the layout relationship diagram into an in-diagram object label and an in-diagram relationship label.

Step S33: inputting object labels and relation labels in the graph, sampling from KG2E knowledge representation of pre-trained objects and relations, and obtaining an object knowledge matrix O_k∈R^n×dAnd relation knowledge matrix P_k∈R^k×d. Wherein n is the number of object labels in the layout relational graph, k is the number of relational labels in the layout relational graph, and d is the embedding dimension represented by the map knowledge.

Step S34: knowledge matrix O of object_k∈R^n×dAnd relation knowledge matrix P_k∈R^k×dThe global knowledge matrix S is obtained by adding the column directions_k∈R^1×dGenerating a global knowledge vector G through the full connection layer_k∈R^d。

Step S4: as shown in fig. 4, the object knowledge matrix and the global knowledge vector are added to the generator, and a layout relationship diagram is input to generate a scene image.

Further, the step S4 specifically includes:

step S41: initializing and embedding n in-picture object labels and k in-picture relation labels obtained by decomposition to obtain an object initial matrix O_o∈R^n×dAnd relation initial matrix P_o∈R^k×dWhere d is the embedding dimension, consistent with the embedding dimension of the knowledge module.

Step S42: initial matrix O of object and relation_o∈R^n×dAnd P_o∈R^k×dInput into the graph convolution networkTo obtain an object update matrix O_n∈R^n×dAnd relation initial matrix P_n∈R^k×dThe graph convolution network is formed by stacking 5 layers of same graph convolution blocks, and each block is formed by combining a full connection layer, a Relu layer, a full connection layer and a Relu layer in sequence.

Step S43: the object knowledge matrix O output in the knowledge module of step S3_k∈R^n×dUpdate matrix O with object_n∈R^n×dConnected together according to the direction of the row to obtain an object prediction matrix O_p∈R^n×2dAnd the knowledge information of each object is integrated into the generator.

Step S44: predicting the object with a matrix O_p∈R^n×2dGenerating a value B epsilon of the position of an object frame through a multilayer perceptron 1^n×4Generating an object shape mask M E R through a multilayer perceptron 2^n×s×s×dMapping and combining the two, and setting a scene layout tensor L epsilon R^H×W×dThe multilayer perceptron 1 is composed of a full connection layer, a Relu layer and a full connection layer, and the multilayer perceptron 2 is composed of an upper sampling layer, a BN layer, a volume data layer and a Relu layer which are stacked for 4 times in sequence. Object frame position B is corresponding to R^n×4In the drawing, n represents the number of objects in the drawing, and 4 represents the position values of the objects at the lower left corner, the lower right corner, the upper left corner and the lower right corner of the bounding box. M ∈ R in object shape mask^n×s×s×dS represents the size of the object mask, d is the embedding dimension of the object matrix input, and the scene layout tensor L is the R^H×W×dIn the above description, H represents the height of the scene image to be generated, and W represents the width of the scene image to be generated.

Step S45: global knowledge vector G output in the knowledge module of step S3_k∈R^dAutomatically extending dimensionality with the same size as the picture to obtain G'_k∈R^H×W×dAnd is associated with the scene layout tensor L ∈ R^H×W×dConnected together, input into a cascade generation network to generate a scene image I e R^H×W×3. The cascade generation network consists of 5 cascade generation modules, and the structure of each cascade generation module comprises an average pooling layer, an upsampling layer, a convolutional layer, a BN layer, a Relu layer, a convolutional layer, a BN layer and RelThe u layers are 8 layers in total.

Step S5: as shown in fig. 5, an object knowledge matrix and a global knowledge vector are added to the discriminator.

Further, the step S5 specifically includes:

step S51: when different objects in the image are identified, the scene image I belongs to R^H×W×3Obtaining an image slice C' epsilon R after data processing^n×L×L×3Where L is the size of the image slice, and is related to the object knowledge matrix O_k∈R^n×dAnd connecting and grouping according to a first dimension, wherein each group is a two u image slice and a corresponding knowledge vector, inputting the two u image slices into a convolutional neural network 1 in a group to obtain a true and false score of each object image slice and object type prediction, wherein the convolutional neural network 1 has a structure of a convolutional layer, a BN layer, a Relu layer, a convolutional layer, an average pooling layer and a full connection layer.

Step S52: when the whole image is identified, the scene image I belongs to R^H×W×3And global knowledge vector G_k∈R^dAnd simultaneously inputting the image into a convolutional neural network 2 to obtain the true and false scores of the image, wherein the convolutional neural network 2 has the structure of a convolutional layer, a BN layer, a Relu layer, a convolutional layer, a BN layer, a Relu layer and a convolutional layer.

Step S6: training the generator and discriminator alternately, minimizing the overall loss function:

L＝λ₁L_box+λ₂L_pixel+λ₃L_GAN+λ₄L_img-per+λ₅L_obj-per

wherein L is_boxFor the L1 loss between the predicted object bounding box position and the real object bounding box, L_pixelFor the L1 loss between the generated push image and the real image, L_GANFor the generation of the generator and discriminator, L_img-perTo generate a loss of perception at the feature level of the image and the real image, L_obj-perFor generating a loss of perception at the feature level, λ, of an object slice of an image with an object slice of a real image₁，λ₂，λ₃，λ₄，λ₅And the hyper-parameters are manually set in the training process. And the generator part after training is used for generating a scene image from the layout relation diagram.

The above-mentioned expanding model with sg2im as the baseline for the generator and discriminator in steps S4 and S5 is only a preferred embodiment of the present invention, and all equivalent changes and modifications made according to the claimed scope of the present invention should be covered by the present invention.

Claims

1. A method for generating a plurality of scene images guided by a knowledge graph is characterized by comprising the following steps:

step S1: extracting required triples in the form of (head entity, relation and tail entity) and integrating the triples into a knowledge graph;

step S6: training a generator and a discriminator alternately according to an integral loss function, and ensuring the generation quality of the whole image and the corresponding category of the object image slice composite label; the obtained generator is a tool for completing the generation of the layout relationship diagram to the scene image.

2. The method for generating a plurality of scene images under knowledge-graph guidance according to claim 1, wherein the step S2 is specifically as follows:

step S21: inputting a group of object tags into the knowledge graph constructed in the step S1 for graph search, searching all triples containing the relationship between the input tags, and sequencing the searched triples according to the occurrence frequency;

step S22: and according to parameter setting, selecting the required number of triples to form the most possible layout relational graph, and simultaneously generating other multiple different layout relational graphs in a random combination mode.

3. The method for generating a plurality of scene images under knowledge-graph guidance according to claim 1 or 2, wherein the step S3 is specifically as follows:

step S31: pre-training the knowledge graph constructed in the step S1 by using a knowledge representation method KG2E, and representing all objects in the graph by using different Gaussian distributions and representing knowledge representation corresponding to the relationship;

step S32: performing data processing on the layout relation diagram obtained in the step S2, and decomposing the layout relation diagram into an in-diagram object label and an in-diagram relation label;

step S34: the sum of the object knowledge matrix and the relation knowledge matrix is called a global knowledge matrix, and a global knowledge vector is generated through a full connection layer and represents knowledge information extracted from the map by the whole layout relation diagram.

4. The method for generating a plurality of scene images under knowledge-graph guidance according to claim 3, wherein the step S4 is specifically as follows:

step S42: inputting the initial matrix of the object and the relation into a graph convolution network with 5 layers of depth to respectively obtain an object updating matrix and a relation updating matrix;

step S44: the object prediction matrix generates a numerical value of an object frame position through the multilayer perceptron 1, generates an object shape mask through the multilayer perceptron 2, and generates a scene layout tensor through the object frame position and the object shape mask through mapping combination;

step S45: and automatically expanding the dimension of the global knowledge vector output by the knowledge module, which is the same as the size of the picture, and connecting the global knowledge vector with the scene layout tensor together, and inputting the global knowledge vector into a cascade generation network to generate a scene image.

5. The method for generating a plurality of scene images under knowledge-graph guidance according to claim 4, wherein the step S5 is specifically as follows:

step S51: when different objects in the image are identified, an object image slice obtained after the scene image is subjected to data processing and an object knowledge matrix output by a knowledge module are simultaneously input into the convolutional neural network 1 to obtain the true and false scores of the object image slice and the object class prediction;

step S52: when the whole image is identified, the scene image and the global knowledge vector output by the knowledge module are simultaneously input into the convolutional neural network 2 to obtain the true and false scores of the image.

6. The method for generating a plurality of scene images guided by a knowledge graph according to claim 4, wherein the multilayer perceptron 1 is composed of a full connection layer, a Relu layer and a full connection layer, and the multilayer perceptron 2 is composed of an upsampling layer, a BN layer, a volume layer and a Relu layer which are stacked 4 times in sequence.

7. The method for generating a plurality of scene images guided by a knowledge graph according to claim 5, wherein the convolutional neural network 1 has a structure of a convolutional layer, a BN layer, a Relu layer, a convolutional layer, an average pooling layer and a full connection layer; the convolutional neural network 2 has a structure of convolutional layer, BN layer, Relu layer, convolutional layer.