CN112686816A

CN112686816A - Image completion method based on content attention mechanism and mask code prior

Info

Publication number: CN112686816A
Application number: CN202011565269.6A
Authority: CN
Inventors: 马鑫; 侯峦轩; 赫然; 孙哲南
Original assignee: Tianjin Zhongke Intelligent Identification Industry Technology Research Institute Co ltd
Current assignee: Tianjin Zhongke Intelligent Identification Industry Technology Research Institute Co ltd
Priority date: 2020-12-25
Filing date: 2020-12-25
Publication date: 2021-04-20

Abstract

The invention discloses an image completion method based on a content attention mechanism and mask code prior. The method comprises the following steps: preprocessing an image, generating a binary mask image by using an algorithm and generating a damaged image based on the binary mask image; inputting and training by utilizing a damaged image and a binary mask image to form a generated countermeasure network model based on content attention and mask prior; and inputting the test image and the binary mask image into a trained model to perform damaged image completion operation. The method generates the confrontation network model based on the content attention mechanism and the mask code prior, utilizes the binary mask image as extra information guide, and combines the input image training learning, so that the completion result contains rich detail information and can keep the continuity of the structure.

Description

Image completion method based on content attention mechanism and mask code prior

Technical Field

The invention relates to the technical field of image completion, in particular to an image completion method based on a content attention mechanism and mask code prior.

Background

Image inpainting refers to the generation of substitute content for missing regions in a given damaged image, and makes the repaired image visually realistic and semantically reasonable. Image completion tasks may be used in other applications, such as image editing, when scene elements distracting from human attention, such as people or objects (which are often unavoidable), are present in an image, allowing a user to remove unwanted elements from the image while filling in blank areas with visually and semantically reasonable content.

The generated countermeasure network is initiated from the thought of two-person zero-sum game in the game theory, and has two networks of a generating network and a discriminant network, and the two networks compete with each other to continuously improve the network performance and finally reach balance. Many variant networks have been derived based on the idea of generating confrontational networks, and these networks have made significant advances in image synthesis, image hyperseparation, image style conversion, and image inpainting. Research on image completion, including image restoration, image watermarking, image raining, and image defogging, has received attention.

Human content Attention Mechanism and mask prior (Attention Mechanism) are obtained from intuition, and are means for rapidly screening high-value information from a large amount of information by human beings by using limited Attention resources. The content attention mechanism and the mask code prior in the deep learning use the attention thinking mode of human beings for reference, are widely applied to various deep learning tasks of different types such as Natural Language Processing (NLP), image classification and voice recognition, and obtain remarkable results.

With the continuous development of science and technology, the demands of people in different fields are correspondingly improved, including movie advertisement animation production, online games and the like, and the vivid image restoration technology has important significance on the good experience of users.

Therefore, under the background, an image completion method based on a content attention mechanism and mask code prior is developed, so that the repaired image is vivid visually and reasonable semantically, and the method has important significance.

Disclosure of Invention

The invention aims to improve the generation quality (including rich texture details and structural continuity) of an image in an image completion task, provides an image completion method based on a content attention mechanism and mask prior, and is a method with wider application significance.

The technical scheme adopted for realizing the purpose of the invention is as follows:

an image completion method based on a content attention mechanism and mask prior comprises the following steps:

s1, preprocessing an image, generating a binary mask image M, and synthesizing a damaged image x by using the binary mask image M;

s2, obtaining a content attention mechanism for image completion and a mask prior generation confrontation network model through training, wherein the confrontation network model comprises a generator and a discriminator which are used for training and generating the confrontation network by using the damaged image x and a corresponding binary mask image M as network input and using an undamaged image as a target real image y and training and learning complex nonlinear transformation mapping from the damaged image to the target real image; coding the input damaged image and binary mask image M by the encoder in the generator, and decoding the obtained hidden code into the damaged image x by the decoder according to the content attention mechanism to obtain a complete image

Calculating the resistance loss with the target real image in a discriminator; after iteration is repeated for a plurality of times to reach stability, training of the model is completed;

and S3, using the trained generation confrontation network model to perform completion processing on the test data.

Wherein, step S2 includes:

s21: initializing network weight parameters in the image completion task, wherein the loss function of the generator is L_totalThe loss function of the discriminator is L_D；

S22: inputting the damaged image and the binary mask image into a generator network G for image completion task, and generating a completed image and a targetInputting the true images into a discriminator network D together, and sequentially carrying out iterative training to ensure a loss function L of a generator_totalLoss function L of sum discriminator_DAll reduce to tend to be stable;

s23: and simultaneously training the expression generation and removal tasks until all loss functions are not reduced any more, thereby obtaining a final generation confrontation network model.

Wherein the output value of the partial convolution layer depends on the undamaged area, and is mathematically described as follows:

wherein, 1 denotes a matrix of all elements being 1 and having the same shape as the binary mask pattern M, W denotes a parameter of the convolution layer, F denotes an output characteristic pattern of the previous layer convolution layer, b denotes a deviation of the convolution layer, M denotes a corresponding binary mask pattern,

is a scaling factor, adjusting the weight of the known region;

the binary mask map M is updated after the partial convolution is performed, and is mathematically described as follows:

i.e. if the local convolutional layer can get an output result according to valid inputs, then the position in the binary mask map M is marked as 1.

Wherein the content attention mechanism outputs forming the missing region by:

firstly, calculating the feature similarity of the deletion part and the known part: extracting blocks of the known region, and resizing the blocks to be used as parameters of a convolution kernel; known region block { f_x,yAnd unknown region block b_x′,y′The cosine similarity between the two is calculated by the following equation:

weighing the similarity by using scaled softmax in the dimension of x 'y', and obtaining the attention value of each pixel point:

wherein λ is a constant; finally, the selected unknown region block { b }_x′,y′And reconstructing a defect area as a convolution kernel parameter of deconvolution.

To achieve consistency in the attention mechanism, first a left-to-right attention propagation is performed, followed by a top-down propagation with kernel size k:

wherein the total loss function in the image completion is:

L_total＝λ_recL_rec+λ_perL_per+λ_styleL_style+λ_tvL_tv+λ_advL_adv

wherein L is_recRepresenting the reconstruction loss function, L_perRepresenting the perceptual loss function, L_styleRepresenting a style loss function, L_tvRepresenting the total variation loss function, L_advRepresenting the function of the opposing loss, λ_rec、λ_per、λ_style、λ_tvAnd λ_advRepresenting a weighting factor.

Wherein the reconstruction loss function is expressed as:

wherein |₁Represents L₁The norm of the number of the first-order-of-arrival,

cat denotes the join operation.

Wherein the perceptual loss function is expressed as:

where φ is the pre-trained VGG-16 network, φⁱAnd outputting a characteristic diagram of the ith pooling layer by using pool-1, pool-2 and pool-3 layers in the VGG-16 network, wherein N is the selected number of layers.

Wherein the style loss function is expressed as:

wherein C is_iAnd the channel number of the characteristic diagram of the ith layer output of the pre-training VGG-16 network is represented.

Wherein the total variation loss function is expressed as:

wherein omega represents a damaged area in the image, the total variation loss function is a smooth penalty term and is defined on an expansion domain of one pixel of the lost area, and i and j represent a certain point in the image.

Wherein the penalty function is expressed as:

where D denotes the discriminator, y 'is a randomly scaled version of a sample taken from y' and y, λ is set to 10, E (. + -.) denotes the mean, y P_YRepresenting the distribution P of samples y_YAnd obtaining the intermediate sample.

According to the method, the countermeasure network can be generated by using the prior information of the binary mask code through the local convolution layer, and the damaged image can be completed more accurately, so that the quality of the generated image is improved.

According to the invention, the unknown region can be reconstructed according to the known region through a content attention mechanism, namely, the unknown region in the image can be reconstructed according to the known region of the image by the model, so that abundant detail information is generated, and the generation of the high-resolution image is improved.

According to the invention, a reconstruction loss function, a style loss function, a total variation loss function and an antagonistic loss function are introduced as constraints at an image level and a characteristic level, so that the robustness and the accuracy of the network are improved.

The generation of the confrontation network model provided by the invention uses a multi-objective optimization mode, so that the model convergence is faster, the effect is better, and the generalization performance is stronger.

Drawings

FIG. 1 is a flow chart of the image completion method based on the content attention mechanism and the mask prior in the present invention. partial conv denotes a local convolution layer, and Concatenate denotes a join operation; e and d represent an encoder and a decoder, and z represents an input of the decoder, which is a feature of an input image;

FIG. 2 is a flow chart of the process of the content attention mechanism of the present invention, in which Background and Foreground respectively represent the missing feature map and the missing part, Input feature represents the Input feature map, Extract patches represents the extraction of blocks (patch) from the missing feature map, Reshape represents resizing, Conv for Matching represents calculating cosine similarity, Softmax for Comparison represents the selection of the most similar blocks according to the attention value;

fig. 3 is an effect diagram of image completion on the public data set according to the present invention, which is a damaged image, a binary mask image, a completed image, and a target real image from left to right in sequence.

Detailed Description

The invention is described in further detail below with reference to the figures and specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention learns a group of highly nonlinear transformations by an antagonistic network based on a content attention mechanism and mask prior generation for performing an image completion task, so that a completed image contains abundant texture details and continuous structures.

As shown in fig. 1, the image completion method based on content attention mechanism and mask prior includes the steps of:

step S1, first, a binary mask image is generated offline by using a binary mask algorithm, and the binary mask image is multiplied by the undamaged image to obtain a damaged image.

For the face image, normalizing the image according to the positions of two eyes and cutting the image to be 256 × 256 with a uniform size; for natural images, the image size is first enlarged to 350 × 350, and then the enlarged image is randomly cropped to a uniform size of 256 × 256. And randomly selecting an off-line generated binary mask image M, and multiplying the binary mask image M by the undamaged image to obtain the damaged image.

And step S2, combining the damaged image and the corresponding binary mask image to be used as input training data, and training a generated confrontation network model based on a content attention mechanism and a mask prior to be used for completing an image completion task.

In order to expand the input data sample size and improve the network generalization capability, the invention adopts data augmentation operation, including random inversion and the like. In the present invention, when a countermeasure generation network completes an image, features are extracted from input data by an encoder composed of a local convolution layer, an obtained hidden code is decoded into the image by a decoder, and a final completed image is finally output by a process of a content attention mechanism, as shown in fig. 1.

Wherein, the encoder and the decoder are composed of 8 convolutional layers. The convolutional layer filters in the encoder are 7, 5, 3, 3, 3, 3, 3, 3; the convolutional layer filters in the decoder are all 3 in size.

In the present example, the feature map is upsampled using conventional methods. The number of layers of the convolutional layers and the number and size of the filters in each convolutional layer can be selected and set according to actual conditions. In the discriminator, a convolution neural network structure is adopted to take the real image pair and the generated complementary image pair as input, and the output adopts a block countermeasure loss function to judge whether the real image pair and the generated complementary image pair are true or false.

According to the method, the high nonlinear fitting capacity of the generation countermeasure network based on the content attention mechanism and the mask prior is utilized, and the prior information in the binary mask image is utilized by the local convolution layer aiming at the image completion task. Secondly, the invention proposes a content attention module, so that the algorithm can reconstruct an unknown region according to the known region of the image. The encoder may gradually increase the texture detail in the generated image. In particular, the network advantageously produces high quality images with the constraint of an applied loss function.

Thus, a model with image completion can be trained using the network shown in FIG. 1. In the testing phase, the binary mask and the damaged image are also used as the input of the model, and the generated image completion result is obtained, as shown in fig. 3.

Specifically, in the present invention, the total objective function in the image completion task is expressed as follows:

L_total＝λ_recL_rec+λ_perL_per+λ_styleL_style+λ_tvL_tv+λ_advL_adv

wherein L is_recRepresenting the reconstruction loss function, L_perRepresenting the perceptual loss function, L_styleRepresenting a style loss function, L_tvRepresenting the total variation loss function, L_advRepresenting the penalty function. Lambda [ alpha ]_rec、λ_per、λ_style、λ_tvAnd λ_advRepresenting a weighting factor.

The generation countermeasure network based on the content attention mechanism and the mask prior mainly completes the image completion task, and the final target of the generation countermeasure network is L_totalThe loss function is minimized and stabilized.

Wherein the reconstruction loss function is expressed as:

cat denotes the join operation.

Wherein the perceptual loss function is expressed as:

where φ is the pre-trained VGG-16 network, φⁱAnd outputting the characteristic diagram of the ith pooling layer by using the pool-1, pool-2 and pool-3 layers in the VGG-16 network.

Wherein the style loss function is expressed as:

Wherein the total variation loss function is expressed as:

wherein omega represents a damaged area in the image, and the total variation loss function is a smooth penalty term and is defined on an expansion domain of one pixel of the damaged area.

Wherein the penalty function is expressed as:

where D denotes the discriminator, y 'is a randomly scaled version of a sample taken from y' and y, and λ is set to 10.

Wherein the content attention mechanism and mask prior based generative confrontation network is trained as follows:

step S21: initializing a weight parameter of the network, wherein λ_rec、λ_per、λ_style、λ_tvAnd λ_adv6, 0.1, 240, 0.1, 0.001, batch size 32, learning rate 10^-4。

Step S22: and inputting the damaged image and the binary mask image into a generator G for image completion. The generated complete image and the real target real image are input into a discriminator D, and the iteration is carried out in sequence to ensure that the network total loss function L_totalAnd decreases to tend to stabilize.

Step S3: and (4) performing completion processing on the test data by using a trained content attention mechanism and a mask prior-based generation confrontation network model.

In order to explain the specific implementation mode of the invention in detail and verify the effectiveness of the invention, the method provided by the invention is applied to four public databases (one face database and three natural databases), namely CelebA-HQ, ImageNet, Places2 and Pairs Street View. CelebA-HQ contains 30000 high-quality face images. The plants 2 contained 365 scenes, with a total number of images exceeding 8000000. A Pairs Street View contains 15000 Paris Street View maps.

ImageNet is a large data set, exceeding 14 hundred million images. For Places2, Pairs Street View, and ImageNet, the original validation and test set was used in the present invention. For CelebA-HQ, 28000 images were randomly selected for training and the remaining images were used for testing in the present invention. 60000 binary mask graphs are generated off line by using a binary mask algorithm. 55000 binary mask images are randomly selected for training, and the rest 5000 binary mask images are used for testing (the binary mask images are used for generating damaged images).

The deep neural network is trained by using the countermeasure and gradient back propagation between the generator and the discriminator by using a content attention mechanism and a mask prior-based generation countermeasure network and an objective function in the invention and taking a damaged image and a corresponding binary mask image as input. And continuously adjusting the weights of different tasks in the training process until the network converges finally to obtain the model for editing the facial expressions.

In order to test the effectiveness of the model, the test set data is used for image completion, and the visualization result is shown in fig. 3, which effectively proves that the method provided by the invention can generate high-quality images.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. The image completion method based on the content attention mechanism and the mask prior is characterized by comprising the following steps of:

s2, obtaining a content attention mechanism for image completion and a mask prior generation confrontation network model through training, wherein the confrontation network model comprises a generator and a discriminator which are used for training and generating the confrontation network by using the damaged image x and a corresponding binary mask image M as network input and using an undamaged image as a target real image y and training and learning complex nonlinear transformation mapping from the damaged image to the target real image; coding the input damaged image and binary mask image M by local convolution layer through coder in generator, decoding the obtained hidden code into damaged image x by decoder according to content attention mechanism to obtain complete image

2. The method for image completion based on content attention mechanism and mask prior as claimed in claim 1, wherein step S2 comprises:

S22: inputting the damaged image and the binary mask image into a generator network G for image completion task, inputting the generated completed image and the target real image into a discriminator network D, and sequentially performing iterative training to enable a loss function L of the generator_totalLoss function L of sum discriminator_DAll reduce to tend to be stable;

3. The method of image completion based on content attention mechanism and masking priors according to claim 2, wherein the output values of the local convolution layer depend on undamaged regions, and are mathematically described as follows:

is a scaling factor, adjusting the weight of the known region;

i.e. if the local convolutional layer can get the output result according to the valid input, the corresponding position in the binary mask map M is marked as 1.

4. The method of claim 3, wherein the content attention mechanism and the masking priors output the missing regions by:

firstly, calculating the feature similarity of the deletion part and the known part: extracting blocks of the known region, and resizing the blocks to be used as parameters of a convolution kernel; known region block { f_x，yAnd unknown region block b_x′，y′The cosine similarity between the two is calculated by the following equation:

wherein λ is a constant; finally, the selected unknown region block { b }_x′，y′Reconstructing a missing area by taking the reconstructed area as a convolution kernel parameter of deconvolution;

5. the method of claim 4, wherein the total loss function in the image completion is:

L_total＝λ_recL_rec+λ_perL_per+λ_styleL_style+λ_tvL_tv+λ_advL_adv

wherein L is_recRepresenting the reconstruction loss function, L_perRepresenting the perceptual loss function, L_stvleRepresenting a style loss function, L_tvRepresenting the total variation loss function, L_advRepresenting the function of the opposing loss, λ_rec、λ_per、λ_style、λ_tvAnd λ_advRepresenting a weighting factor.

6. The method of image completion based on content attention mechanism and masking priors of claim 5, wherein the reconstruction loss function is expressed as:

wherein | · | purple sweet₁Represents L₁The norm of the number of the first-order-of-arrival,

cat denotes the join operation.

7. The method of image completion based on content attention mechanism and masking priors of claim 6, wherein the perceptual loss function is expressed as:

8. The method of image completion based on content attention mechanism and masking priors of claim 7, wherein the style loss function is expressed as:

9. The method of image completion based on content attention mechanism and masking priors of claim 8, wherein the total variation loss function is expressed as:

10. The method of image completion based on content attention mechanism and masking priors of claim 9, wherein the penalty function is expressed as:

where D denotes the discriminator, y 'is a randomly scaled version of a sample taken from y' and y, λ is set to 10, E (. + -.) denotes the mean, y P_YRepresents the sample y fromCloth P_YAnd obtaining the intermediate sample.