CN113052784A

CN113052784A - Image generation method based on multiple auxiliary information

Info

Publication number: CN113052784A
Application number: CN202110301738.1A
Authority: CN
Inventors: 牛学硕; 尹宝才; 孔雨秋
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2021-03-22
Filing date: 2021-03-22
Publication date: 2021-06-29
Anticipated expiration: 2041-03-22
Also published as: CN113052784B

Abstract

The invention belongs to the field of image generation under computer vision tasks, and provides an image generation method based on multiple auxiliary information. The invention utilizes various auxiliary information to guide the model to complete the image generation task for the first time, the generation task is completed in two stages, the input of the model in the first stage is the fusion characteristic of scene graph information and text information, the scene graph information is taken as the main part, the text information is taken as the auxiliary part, and the GAN network model is taken as the prototype to generate the rough image; the model input of the second stage is text information and the output of the first stage, and the purpose is to enrich the image details by using the text information and generate a high-quality image. The invention uses the real data set to train and evaluate, and simultaneously compares with the current mainstream image generation model to evaluate the performance improvement.

Description

Image generation method based on multiple auxiliary information

Technical Field

The invention belongs to the field of image generation under computer vision tasks, and relates to a method for guiding image generation based on participation of various auxiliary information.

Background

In daily production and life, such a scene is ubiquitous: the poster designer cannot well understand the description of the customer, so that the customer and the poster designer are in ineffective communication for a long time, and the efficiency is low; the witness witnesses in the case-out site can describe the patterns of the suspects, and the public security organization needs to obtain the patterns of the suspects according to the description of the witness and carry out case breaking; when a house is decorated, according to the description of an owner, if a decoration result graph can be seen quickly, the satisfaction degree of the owner on a decoration scheme can be greatly improved. In the past, people have many times of pursuing luxuriant of pictures and texts when needing aesthetic feelings, images can visually impact people, the meanings which cannot be described by characters can be shown, and the characters can show the beautiful characters which cannot be felt by sense by gorgeous word algae from the aspect of semantics. Only when the pictures and texts appear together, the omnibearing reading of a scene can be presented from different angles. However, in the actual life scene, text data and voice data are easy to obtain, and image data are difficult to obtain to a certain extent, so how to display the picture of text description by using the technical form of the emerging technology under the background that artificial intelligence continuously obtains new results is an important research direction for promoting production and improving life quality. In recent years, machine learning and deep learning are continuously developed and achieve more achievements in practical application, and the exploration and application of multi-modal learning become hot points of artificial intelligence gradually due to continuous progress in various fields. In the present academic field, the most widely studied is the interaction between images and characters, for example, one segment of characters is used as input, and the output is the image corresponding to the characters. The generation of images according to texts is a common application in a multi-modal learning task, the research can bring great driving force to the field of data intelligence, and the landing of the research can also bring great convenience to production and life.

At present, the mainstream image generation method only adopts single information to participate in the training process of the model. For example, the sg2im model instructs image generation using scene graph (scenegraph) information as an input of the model; mainstream models such as stackGAN and attnGAN are described in text to guide the models to generate images meeting requirements. sg2im provides that each object and the relation thereof in a text are modeled through a scene graph, a bounding box and a mask of each object in the semantics are obtained on the basis of obtaining the scene graph, so that a scene layout related to the text semantics is obtained, and then the scene layout is used as an input to be added into a subsequent GAN network to generate a picture. stackGAN uses two GANs to step through the image. Because the quality of generated pictures cannot be improved by simply adding up sampling in the network, a two-stage GAN network is proposed, wherein the first stage is used for generating low-precision (64x64) images, and the first stage mainly focuses on basic information such as backgrounds, colors, contours and the like of the images; in the second stage, the output of the first stage is used as input, and text embedding is used again, so that the detail information lost in the first stage is obtained, and 256 × 256 finer pictures are generated. Meanwhile, a CA (conditioning assessment) module is added into the method to add some practical random noise to the text features, so that the generated image has more variability. an attention mechanism is added to attnGAN, not only is the content feature of a text extracted as global constraint, but also word embedding extracted from the attribute to the word level is fed into a network as local constraint, and a generator and a discriminator are accurately optimized for the word embedding part each time, so that the generated image can highlight details in the text.

Disclosure of Invention

The method provided by the invention is based on the image generation of various auxiliary information, and the generated image is restored as truly as possible by extracting and fusing the characteristics of various information and fully utilizing all the auxiliary information. The method takes a scene graph and text description information as examples to introduce research contents.

The research goal of the task has two important aspects:

(1) extracting and fusing features: the input data of the task is a scene graph and text description, the scene graph provides the position relation of each object in the image, the text description provides the implementation details of each object, and efficient feature extraction and fusion of the input data are required to generate a high-quality image. The method aims to realize a high-quality feature fusion algorithm and retain the original information of two data as much as possible.

(2) Use of fusion features: the obtained fusion features greatly retain the original information of the data, the features are applied to layout generation, then mask generation is carried out, and finally the image is generated. The application links and application ways of the fusion features will be studied here, i.e. in which link how to add the feature can make the feature most useful, so that a satisfactory image is finally generated.

The technical scheme of the invention is as follows:

an image generation method based on multi-auxiliary information comprises the following steps:

step S1: aiming at scene graphs and text information, a current mainstream method is used for representation learning;

step S2: and a first stage of image generation, namely establishing a GAN network model, and training by taking the obtained scene graph and text information as model input. In the first stage, attention is paid to scene graph information, and a feature fusion algorithm module is designed, so that the scene graph information and text information can be fully utilized to assist in the training process of the model. A first stage of generating a rough graph meeting requirements;

step S3: in the second stage of image generation, features are processed before the generation model in the second stage is input. The input information of the second stage is the output image and text information of the first stage, and the process focuses on fully utilizing the text information;

the process of performing representation learning on the scene graph and the text information by using the currently mainstream method in step S1 is as follows:

step S11: embedding scene graph information by adopting a GCN network, training each scene graph, and finally obtaining vector representation of each object;

step S12: the text information is encoded by using a CNN-RNN textencoder, and the text description of each image is input into the model to obtain an embedded vector of each text description;

the specific steps of establishing the first-stage generation model in the step S2 are as follows:

step S21: performing feature fusion on the obtained scene graph information and text information, and guiding the image generation in the first stage by taking the scene graph information as a main part and the text information as an auxiliary part;

step S22: establishing an image generation model taking a GAN network as a prototype, wherein the image generation model comprises a generator and a discriminator, fusion characteristics are taken as model input, and output is a rough image with lower quality;

the specific steps of establishing the second-stage generation model in the step S3 are as follows:

step S31: processing the text information to enable the second stage to fully use the text information and capture more information;

step S32: and constructing an image generation model comprising a generator and a discriminator, taking the processed text information and the generated image in the first stage as input, and outputting a high-quality image.

The invention has the beneficial effects that: (1) the traditional image generation algorithm only uses single information for model training, and the invention guides image generation by means of various information; (2) the scene graph information is mainly used in the first stage of the task generation, the text information is used as assistance, the position relation of each object in the image can be mainly grasped by using the scene graph information, and the text information is used in the second stage, so that the details of the object can be further refined, and the quality of the image is improved.

Drawings

FIG. 1 is a block diagram of the overall module design of the present invention.

FIG. 2 is a design diagram of a multi-information fusion module according to the present invention.

FIG. 3 is a drawing of a text-information-guided two-stage generative model design of the present invention.

Detailed Description

The following further describes a specific embodiment of the present invention with reference to the drawings and technical solutions.

step S1: extracting a COCO data set, and extracting a scene graph of each image according to the labeling information of each image to obtain a training set of scene graph data; extracting text information corresponding to the image by means of the labeling information of the image to obtain a corresponding text information training set;

step S11: firstly, initializing and embedding objects and relations in a scene graph to obtain an object initial matrix and a relation initial matrix, then inputting the initialized and embedded objects and relations into a GCN network to obtain an updated object matrix and relation matrix, realizing embedding of scene graph information and obtaining a scene graph vector matrix; the GCN network is formed by stacking five layers of convolution blocks, and each convolution block consists of a full connection layer, a Relu layer, a full connection layer and a Relu layer;

step S12: for the obtained text information, character embedding is carried out by using a char-CNN-RNN text encoder model, and the model consists of two parts: ConvAutoencorder for image feature extraction and CharEmbedding for obtaining text embedding; finally outputting the text embedded vector containing the image information;

step S2: the model structure of the first stage, the main structure is a generative confrontation network GAN, including a generator and a discriminator; performing feature fusion on the obtained scene graph vector matrix and the text embedding vector to obtain fusion features; the generator generates Gaussian distribution by the fusion characteristics through a full connection layer to obtain a condition variable, then the condition variable is spliced with random noise to be used as the input of the generator, and finally an image is generated through a group of up-sampling layers; compressing the text embedding vector by the discriminator, performing spatial repetition to obtain a characteristic tenor, inputting the image generated by the generator into a down-sampling layer to obtain an image tenor, and finally inputting the characteristic tenor and the image tenor into the convolutional layer to obtain a confidence score through a single-node full-connection layer;

step S21: the fusion of scene graph information and text information is realized, the scene graph information is taken as the main part, and the text information is taken as the auxiliary part; the text information passes through a full-connection layer lacking nodes, partial text information is reserved, and the text information is spliced with scene graph information;

step S22: obtaining condition variables from random samples in a Gaussian distribution

Concatenating with randomly sampled noise z as input to train generator G₀And a discriminator D₀The objective function is as follows:

wherein, the real image I₀And the characteristic input t is derived from the actual data distribution p_data，p_zThe index is a quasi-normal distribution,

is a word-embedded vector derived by a precoder, z denotes from p_zOf randomly extracted noise, mu₀Sum Σ₀Is that

The regularization parameter is lambda which is obtained by generating Gaussian distribution through the full connection layer;

step S3: the network model of the second stage also takes GAN as a main body and consists of a generator and a discriminator, the model input is a text embedded vector and an image generated in the first stage, and the high-resolution image is generated by emphasizing the use of text information in the second stage; the structure of the discriminator is roughly consistent with that of the discriminator in the first stage, and only the step length of the convolution layer on the input size is changed to be 2 times of that of the original convolution layer, so that the down-sampling layer is changed from 3 to 4; on the generator, generating Gaussian distribution by using a text embedding vector through a full connection layer to obtain a condition variable, then performing spatial repetition to obtain a characteristic tensor, simultaneously performing down sampling on the output of the first stage to obtain a characteristic tensor of 1 × 1, splicing the two characteristic tensors, and performing a series of residual blocks to obtain an image through up sampling;

step S31: the text description of each image is multiple, so that the obtained text embedding vectors are multiple, and one text embedding vector and the image generated in the first stage are selected as the input of the second-stage generator during each training; the discriminator reserves the image with the highest confidence score as a final image;

step S32: gaussian hidden variable of the second stage

And the generator output of the first stage

Train generator G for input₁And a discriminator D₁The objective functions are respectively:

the above-mentioned expanding model using stackGAN as the baseline in step S2 and step S3 is only a preferred embodiment of the present invention, and all equivalent changes and modifications made according to the claimed scope of the present invention should be covered by the present invention.

Claims

1. An image generation method based on multi-auxiliary information is characterized by comprising the following steps:

step S32: gaussian hidden variable of the second stage

And the generator output of the first stage