CN115661603B

CN115661603B - Image generation method based on modeless layout completion

Info

Publication number: CN115661603B
Application number: CN202211612018.8A
Authority: CN
Inventors: 吴敬宇; 李泽健; 孙凌云
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-12-15
Filing date: 2022-12-15
Publication date: 2023-04-25
Anticipated expiration: 2042-12-15
Also published as: CN115661603A

Abstract

The invention discloses an image generation method based on modeless layout completion, which comprises the steps of classifying, combining, extracting and scaling standard frames in a modeless layout diagram to obtain a training sample, inputting the training sample into a training model to complete the standard frames to be completed, wherein the training model comprises a category hidden space module, a boundary frame hidden space module and a modal standard frame deriving module, inputting the modeless layout diagram into the modeless completion model to obtain a completed modal layout diagram through a modeless completion model trained by a loss function, and inputting the modal layout diagram into a generation model to obtain a scene image. The method can accurately generate the scene image based on the modeless layout.

Description

Image generation method based on modeless layout completion

Technical Field

The invention belongs to the field of image data processing, and particularly relates to an image generation method based on non-modal layout completion.

Background

In recent years, a layout-diagram (layout) -based generation model has received great attention because it can more explicitly represent scene information. The layout is a very important concept in the image generation process, and the layout information contains object types and spatial position information in a scene, and is a powerful structural representation of the image. Compared with other scene priori information, the biggest characteristic of the layout is that the category and the spatial position of each object in the complex scene can be described. Therefore, the generation network based on the prior of the layout diagram is expected to solve the problems of lower precision and lower accuracy in the generated image.

Chinese patent CN114241052a discloses a method for generating a new view image of a multi-object scene based on a layout, which includes inputting the layout of a plurality of images to a layout predictor to obtain a layout under a new view; inputting a plurality of images, sampling each object instance in the images, connecting the images with a camera pose matrix along a channel direction to construct an input tensor, and inputting the constructed tensor to a pixel predictor to obtain images of all objects under a new view angle; inputting the layout diagram under the new view angle and the images of all objects under the new view angle into a scene generator, sequentially passing through an encoder and a fusion device to obtain a fusion characteristic containing all object information, and generating a scene image through a decoder. According to the method, the network is guided to generate the scene image through the layout image information of the scene, the depth image of the input image is not relied on, the generated image is clearer and more real, and the problems of lower precision and lower accuracy in the existing generated image are solved.

The Chinese patent CN114241052A discloses a semantic image analogy method based on a single image generation countermeasure network, and the technical scheme provided by the invention can train a generation model special for a given image under the condition of giving any image and a semantic segmentation diagram thereof, and the model can recombine source images according to different expected semantic layouts to generate images conforming to target semantic layouts so as to achieve the effect of semantic image analogy. The visual quality and the coincidence accuracy of the result generated by the method are both optimal.

However, the above patent must divide the picture by using the original picture and the corresponding real/pre-training model, and in many application scenarios, for example, the user draws a modeless layout, and can obtain a relatively accurate scene image only by the modeless layout, and determine whether the thought of the layout is accurate by the scene image, where the modeless layout is a layout with a shielding relationship between objects, and the existing modeless layout is marked with only a visible part of the object in the picture, and the masking part is not considered, so that the incompleteness of the scene marking information is caused, and the model defaults that each layout represents a complete object when training, so that the masking relationship existing in the real scene is ignored, and the model cannot understand the relationship between objects in the complex real scene accurately.

Disclosure of Invention

The invention provides an image generation method based on modeless layout completion, which can accurately generate a scene image based on a modeless layout.

An image generation method based on modeless layout completion, comprising:

constructing a training sample set to obtain a real scene image, and a modeless layout diagram and a modal layout diagram corresponding to the real scene image; combining frames with overlapping areas or intersecting edges in a modeless layout chart into a first modeless frame group, combining the first modeless frame groups with the same frames to obtain a second modeless frame group, sequentially extracting and scaling the second modeless frame group to obtain modeless frame combined images, taking each modeless frame combined image as a training sample, and constructing a training sample set by a plurality of modeless frame combined images;

the training model is constructed and comprises a category hidden space module, a boundary frame hidden space module and a modal frame deriving module, any non-modal frame in the training sample is used as a frame to be complemented, other frames are used as a covering frame, the object categories of the frame to be complemented and the covering frame are converted into category hidden space features through label embedding in the category hidden space module, and the category hidden space features are fully connected to obtain category hidden space feature vectors; respectively downsampling the boundary frames of the to-be-complemented frame and the hidden frame by using the boundary frame hidden space module to obtain boundary frame hidden space feature vectors of the to-be-complemented frame and the hidden frame; combining the boundary frame hidden space feature vector and the category hidden space feature vector through a mode frame derivation module to obtain a prediction mode frame hidden space feature vector, and upsampling the prediction mode frame hidden space feature vector to obtain a prediction mode frame;

constructing a loss function based on the prediction mode standard frame and the corresponding standard frame in the mode layout diagram, training a training model based on a training sample set through the loss function to obtain a modeless layout completion model, inputting the modeless layout diagram into the modeless layout completion model to obtain a prediction mode layout diagram, and inputting the prediction mode layout diagram into an image generation model to obtain a scene image.

The frame in the modeless layout is used for marking the category of the object and the size and the position of the visible range;

the frame in the modal layout is used for marking the category of the object, the size of the visible range and the shielding range, and the positions of the visible range and the shielding range.

The step of sequentially extracting and scaling the second modeless frame combination to obtain a modeless frame combination image comprises the following steps:

and expanding the boundary of the second modeless frame combination by adopting a maximum value method based on the extreme values of the height, the width and the abscissa of the second modeless frame group, extracting the expanded second modeless frame combination to obtain a second modeless frame combination image, and scaling the second modeless frame combination image to a given resolution to obtain a modeless frame combination image.

The class hidden space module comprises a label embedding layer and a full connection layer, the object classes of the to-be-complemented standard frame and the hidden standard frame are converted into class hidden space features through the label embedding layer, and the class hidden space features are fully connected through the full connection layer to obtain class hidden space feature vectors.

The method comprises the steps that a boundary frame hidden space module respectively downsamples a boundary frame to be complemented and a boundary frame of a covering boundary frame to obtain boundary frame hidden space feature vectors of the boundary frame to be complemented and the covering boundary frame, wherein:

the boundary frame hidden space module comprises a plurality of downsampling submodules which are connected in sequence, each downsampling submodule comprises a downsampling unit and a maximum pooling layer which are connected in sequence, each downsampling unit comprises a plurality of downsampling subunits which are connected in sequence, and each downsampling subunit comprises a convolution layer, a regularization layer and an activation layer in sequence.

The mode frame deriving module comprises a plurality of full-connection layers and a plurality of upsampling submodules, wherein the boundary frame hidden space feature vector and the category hidden space feature vector are combined through the plurality of full-connection layers to obtain a prediction mode frame hidden space feature vector, and the prediction mode frame hidden space feature vector is upsampled through the plurality of upsampling submodules to obtain a prediction mode frame.

Constructing a loss function based on the prediction mode frame and the corresponding frame in the mode layout diagram

The method comprises the following steps:

wherein ,

is the parameter of the ultrasonic wave to be used as the ultrasonic wave,N ₁ for the number of bounding boxes of the modality frames in the frame to be complemented,N ₂ for the number of bounding boxes of the modeless boxes to be filled,

is the firstoThe bounding boxes of the label frames are covered,

is the first to be complemented in the framesThe bounding box of the modeless bounding box,

is the first to be complemented in the framerThe bounding boxes of the individual modality frames,

is the first to be complemented in the framerThe category of the frame of each modality,

is the first to be complemented in the framesThe category of the non-modal box,

is the firstoThe category of the individual mask frames,

in order to train the model,

as a function of the cross-over loss,M _t is a real mode frame.

Measuring accuracy of a modeless layout completion model based on IoU variant indicators, the IoU variant indicators comprising a first IoU variant indicator and a second IoU variant indicator, wherein:

first IoU variant index

The method comprises the following steps:

second IoU variant index

The method comprises the following steps:

wherein ,

is the first in the modeless layoutiThe number of bounding boxes is a function of the number of bounding boxes,mrepresenting the bounding box as in the original modeless layout,

is the first in the true module layoutiThe number of bounding boxes is a function of the number of bounding boxes,arepresenting that the bounding box is the bounding box in the real module layout, the bounding box of the original modeless layout corresponds to the bounding box of the real modal layout one by one,Fthe model is completed for the modeless layout,Nis the number of bounding boxes.

Compared with the prior art, the invention has the beneficial effects that:

according to the method, the prediction mode layout is obtained by complementing each modeless frame in the modeless layout one by one, and the scene image is accurately obtained through the generator based on the prediction mode layout.

According to the invention, the classification relation vector and the boundary frame relation vector are respectively obtained by fusing the classification of the to-be-complemented modeless frame and other frames and the characteristics of the boundary frame in the modeless layout diagram in the hidden space, and the classification relation vector and the boundary frame relation vector are fused and then up-sampled, so that the to-be-complemented modeless frame is complemented to obtain the accurate prediction mode frame, and the corresponding object and the position relation of the corresponding object and other objects in the scene diagram can be completely presented based on the prediction mode frame.

Drawings

FIG. 1 is a flowchart of an image generation method based on modeless layout completion according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a training sample set obtained according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an image generating method based on modeless layout completion according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a loss function construction according to an embodiment of the present invention;

fig. 5 is a graph showing the comparison of effects provided by the embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and technical effects of the present invention more apparent, the present invention will be further described in detail below with reference to the accompanying drawings.

The invention provides an image generation method based on modeless layout completion, as shown in fig. 1, comprising the following steps:

(1) And obtaining a training sample set and a label based on the real image and the modal layout and the modeless layout corresponding to the real image.

Performing frame labeling on a real image to obtain a modeless layout and a modal layout, wherein the modeless layout is the object type of the object in the real image and the size and the position of the object in a visible range; the modal layout is a model of the object class that labels the objects in the real image, the size of the visible and obscured ranges, and the location of the visible and obscured ranges,the invention takes the standard frame in the modal layout as a label, and the standard frame comprises an object category and a framebbox) The bezel includes a position (the coordinates of the upper left corner) and a size (height and width).

As shown in fig. 2, first, the present invention provides a method for classifying and combining frames in a modeless layout, which specifically includes the following steps: combining two frames with overlapping areas into a first modeless frame group, and then combining two frames with intersecting edges in a modeless layout diagram into the first modeless frame group; traversing the first modeless frame group, and combining the first modeless frame group with the same frame to obtain a second modeless frame group.

Then, the invention provides a method for extracting and scaling the obtained second modeless frame group, which comprises the following specific steps: obtaining maximum height value of second non-modal frame group

Maximum width value

And a minimum abscissa and ordinate value

And expanding the boundary frame of the second modeless frame group based on the extremum to obtain the boundary frame of the second modeless frame group after leaving the edge

And extracting the expanded second modeless frame combination to obtain a second modeless frame combination image, and then scaling the second modeless frame combination image to a picture with the resolution of 256 multiplied by 256. Each modeless frame combination image is used as a training sample, and a plurality of modeless frame combination images construct a training sample set.

(2) Constructing a training model which comprises a hidden space modulebranch _cate ) Boundary frame hidden space modulebranch _modal ) And a mode frame deduction modulebranch _amodal ) Any non-modal frame in the training sample is used as the frame to be complemented, and other frames are used as the mask frames, wherein, as shown in figure 3,branch _cate is used for analyzing the interrelationship between the categories according to the category C of the given complement frame and the mask frame to obtain the category hidden space feature vector,branch _modal for analyzing the hidden spatial relationship between the bounding box to be complemented and the bounding box in the mask bounding box and expressing the relationship as a bounding box hidden spatial feature vector,branch _amodal and combining the obtained two feature vectors to derive a possible hidden space feature vector of the modal boundary frame, namely, a hidden space feature vector of the prediction modal boundary frame, deriving a prediction modal frame through a series of up-sampling and full connection, taking each frame in the training sample as a frame to be complemented, taking other frames as a covering frame, and complementing the frame to be complemented through the steps to complete the training sample.

The hidden space module provided by the inventionbranch _cate ) The method comprises a label embedding layer and a full connection layer, wherein object categories of a to-be-complemented standard frame and a hidden standard frame are converted into category hidden space features through the label embedding layer, the category hidden space features are fully connected through the full connection layer to obtain category hidden space feature vectors, and the vectors are 512 multiplied by 1 dimensionality, so that category relations of the to-be-complemented standard frame and the hidden standard frame are obtained.

The invention provides a boundary frame hidden space modulebranch _modal ) The device comprises 5 downsampling sub-modules which are sequentially connected, wherein each downsampling sub-module comprises downsampling units and a maximum pooling layer which are sequentially connected, each downsampling unit comprises 2 downsampling sub-units which are sequentially connected, and each downsampling sub-unit sequentially comprises a convolution layer, a regularization layer and an activation layer. Finally, the hidden space feature vector of the boundary box with the dimension of 512 multiplied by 16 is obtained.

The modal frame deducing module provided by the inventionbranch _amodal ) Comprising 2 full connection layersAnd 5 upsampling submodules, wherein the boundary frame hidden space feature vector and the category hidden space feature vector are combined through 2 full-connection layers to obtain a prediction mode frame hidden space feature vector with the dimension of 512 multiplied by 17, and the prediction mode frame hidden space feature vector is upsampled through the 5 upsampling submodules to obtain the prediction mode frame.

(3) Training the training model based on the training sample set by adopting the constructed loss function to obtain a non-modal layout completion model (ALCN), and constructing the loss function by predicting a modal frame and a corresponding frame in the modal layout as shown in figure 4

The method comprises the following steps:

wherein ,

is the firstoThe bounding boxes of the label frames are covered,

is the first to be complemented in the framesThe category of the non-modal box,

is the firstoThe category of the individual mask frames,

in order to train the model,

as a function of the cross-over loss,M _t is a real mode frame. By regulating and controlling the super-parameters, the model can complement the non-modal frame in the frame to be complemented, and the complement of the modal frame is reduced.

According to the method, a training model is trained through a loss function based on a training sample set to obtain a modeless layout completion model, a modeless layout diagram is input into the modeless layout completion model to obtain a prediction mode layout diagram, and the prediction mode layout diagram is input into a layout diagram to an image generation model to obtain a scene image.

(4) The invention also provides IoU variant indexes for evaluating the non-modal layout completion model completion effect, the non-modal layout completion model obtained in the step (3) is evaluated through IoU variant indexes, the accuracy of the non-modal layout completion model is measured based on IoU variant indexes, ioU variant indexes comprise a first IoU variant index and a second IoU variant index, wherein:

first IoU variant index

The method comprises the following steps:

second IoU variant index

The method comprises the following steps:

wherein ,

is the first in the true module layoutiThe number of bounding boxes is a function of the number of bounding boxes,arepresenting that the bounding box is the bounding box in the real module layout, the bounding box of the original modeless layout corresponds to the bounding box of the real modal layout one by one,Fthe model is completed for the modeless layout,Nis the number of bounding boxes. . Of these two indices, the two indices are,

the complementation effect of the modeless layout complementation model under different difficulty levels is measured, because IoU between the modeless layout and the real modal layout is lower, the part of the model needing complementation is relatively more;

the accuracy of the modeless layout completion model is measured because IoU between the modeless layout and the real modeless layout is very high, meaning that it is very close, and a little erroneous change will result in a drop in the results of the index.

(5) And (3) generating a corresponding complement modal layout according to the non-modal frame layout diagram of the arbitrarily input object by using the non-modal layout complement model obtained in the step (3), and visualizing the shielding relation between the objects in the scene. The method comprises the following specific steps:

(5-1) drawing bounding boxes of the modeless markup boxes to be completed and categories of each bounding box.

And (5-2) inputting the drawn modeless layout completion model obtained in the step (3) into the modeless annotation frame to be completed to obtain a completed modal layout diagram, comparing differences between the modeless layout diagram and the modal layout diagram, and highlighting the shielding relation between objects in the scene.

(6) Generating a high-quality scene image by using the completed modal layout obtained in the step (5), wherein the method comprises the following specific steps of: and (5) inputting the complement annotation frame obtained in the step (5) into an image generation model to obtain a generated scene image. Fig. 5 shows an example of a set of generated images, which are from left to right, respectively a modeless layout, a picture generated based on the modeless layout, a picture generated by the completed modeless layout, and a real picture, as shown in fig. 5, it can be found from the result that the quality effect of the picture generated after the modeless layout is completed into the modeless layout by using the method provided by the invention is relatively better.

Claims

1. An image generation method based on modeless layout completion, comprising:

constructing a loss function based on the prediction mode standard frame and the corresponding standard frame in the mode layout diagram, training a training model based on a training sample set through the loss function to obtain a modeless layout completion model, inputting the modeless layout diagram to the modeless layout completion model to obtain a prediction mode layout diagram, and inputting the prediction mode layout diagram to an image generation model to obtain a scene image;

The method comprises the following steps:

wherein ,

is the parameter of the ultrasonic wave to be used as the ultrasonic wave,N ₁ for the number of bounding boxes of the modality frames in the frame to be complemented,N ₂ for the number of bounding boxes of the modeless boxes in the boxes to be complemented, +.>

Is the firstoBoundary boxes of the mask frame +.>

Is the first to be complemented in the framesBounding box of each modeless frame, +.>

Is the first to be complemented in the framerBoundary box of each mode frame, +.>

Is the first to be complemented in the framerCategory of individual modality frame->

Is the first to be complemented in the framesCategory of modeless frame +.>

Is the firstoCategory of individual mask frame->

For training the model->

As a function of the cross-over loss,M _t is a real mode frame.

2. The modeless layout completion-based image generation method of claim 1, wherein a frame in the modeless layout diagram is used to label a category of an object, and a size and a position of a visual range; the frame in the modal layout is used for marking the category of the object, the size of the visible range and the shielding range, and the positions of the visible range and the shielding range.

3. The method for generating an image based on modeless layout completion of claim 1, wherein the sequentially extracting and scaling the second modeless frame combination to obtain the modeless frame combination image comprises:

4. The image generation method based on modeless layout completion of claim 1, wherein the class hidden space module comprises a label embedding layer and a full connection layer, object classes of a frame to be completed and a hidden frame are converted into class hidden space features through the label embedding layer, and the class hidden space features are fully connected through the full connection layer to obtain class hidden space feature vectors.

5. The modeless layout completion-based image generation method of claim 1, wherein the boundary frames of the frame to be completed and the mask frame are respectively downsampled by the boundary frame hidden space module to obtain boundary frame hidden space feature vectors of the frame to be completed and the mask frame, wherein:

6. The modeless layout completion-based image generation method of claim 1, wherein the modality frame derivation module comprises a plurality of fully connected layers and a plurality of upsampling submodules, wherein the bounding box hidden spatial feature vector and the class hidden spatial feature vector are combined through the plurality of fully connected layers to obtain a prediction modality frame hidden spatial feature vector, and the prediction modality frame hidden spatial feature vector is upsampled through the plurality of upsampling submodules to obtain the prediction modality frame.

7. The modeless layout completion based image generation method of claim 1 wherein the accuracy of the modeless layout completion model is measured based on a IoU variant indicator, the IoU variant indicator comprising a first IoU variant indicator and a second IoU variant indicator, wherein:

first IoU variant index

The method comprises the following steps: />

Second IoU variant index

The method comprises the following steps: />

wherein ,

is the first in the modeless layoutiThe number of bounding boxes is a function of the number of bounding boxes,mrepresenting the bounding box as the bounding box in the original modeless layout, +.>

Is the first in the true module layoutiThe number of bounding boxes is a function of the number of bounding boxes,arepresenting that the bounding box is the bounding box in the real module layout, the bounding box of the original modeless layout corresponds to the bounding box of the real modal layout one by one,Fthe model is completed for the modeless layout,Nis the number of bounding boxes. />