CN115829880A

CN115829880A - Image restoration method based on context structure attention pyramid network

Info

Publication number: CN115829880A
Application number: CN202211664365.5A
Authority: CN
Inventors: 黄蓉; 郑钰辉
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2022-12-23
Filing date: 2022-12-23
Publication date: 2023-03-21

Abstract

The invention discloses an image restoration method based on a context structure attention pyramid network, which comprises the following steps: using CelebA-HQ and Places2 data sets, sorting and dividing the data sets into a training set and a data set, and preprocessing the data sets; constructing a pyramid network based on a structural attention mechanism, and performing training modeling by using a training set to obtain an initial image restoration model; repairing the test set by using the model; and evaluating the repair capability of the model through indexes. The invention uses the U-Net structure as a backbone, encodes the context from low-level pixels into high-level semantic features, and decodes the context back into an image. By shifting the attention of the structure from deep to shallow layer by layer to fill the missing area, the consistency of the synthesized texture and the generated structure can be improved, and an image with fine-grained details can be restored. Compared with the existing algorithm, the algorithm has strong robustness and universality and better repairing effect.

Description

Image restoration method based on context structure attention pyramid network

Technical Field

The invention belongs to the field of computer image restoration, and particularly relates to an attention pyramid network image restoration method based on a context structure.

Background

Image restoration is a technical means for estimating and restoring the content of a damaged or missing area according to the known content of an image so that the restored image can meet the human visual perception requirement as much as possible. As an important research content of computer vision and computer graphics, the method is widely applied to the fields of culture, life and the like, such as digital cultural heritage protection, target removal, old photo restoration, movie and television special effect production and the like.

Image inpainting is divided into a conventional method and a deep learning-based method. The traditional image restoration algorithm can be mainly divided into two types, firstly, the method is based on diffusion, and mainly aims to calculate the pixels to be filled by utilizing a differential equation based on the pixels of the edge, but the method is usually suitable for narrow missing areas such as cracks, scratches and the like, the image texture details cannot be presumed, and the defects exist in detail restoration. The other method is based on texture synthesis, and the core idea of the method is to construct a pixel block at the edge of the missing area, search a sample block which is most similar to the pixel block in the intact image area, fill the missing area in the image by using the found sample block, and repeatedly iterate the process until the whole missing area is filled.

The traditional algorithms have certain limitations, and when a missing area is large, or a scene with strong semantics is repaired, such as the five sense organs of a human face, the obtained result is general and does not have good robustness and universality.

Disclosure of Invention

The invention aims to provide an image restoration method based on a context structure attention pyramid network aiming at the defects in the prior art.

In order to achieve the purpose, the invention provides the following technical scheme: the image restoration method based on the context structure attention pyramid network comprises the following steps:

s1, constructing an original real image data set and an image data set to be restored based on an original real image and a mask corresponding to the original real image, and dividing the original real image data set and the image data set to be restored into a training set and a testing set;

s2, constructing a convolution module by taking the images to be restored in the training set as input and the scale characteristic diagrams respectively corresponding to the convolution layers as output and combining a style loss function and a perception loss function;

s3, constructing a structure attention module by taking each scale characteristic diagram as input and the initial characteristic restoration diagram as output;

s4, constructing a layered decoder by taking the image to be repaired in the training set as input and the initial characteristic repair image as output;

s5, taking the scale feature map and the initial feature restoration map as input, taking the restoration images of all scales as output, and combining a scale reconstruction loss function to construct a multi-scale decoder;

s6, based on a layered decoder and a multi-scale decoder, combining an edge preserving loss function with the training set to be repaired as input and the repaired image as output to construct a generator of a countermeasure network;

and S7, according to a generator of the countermeasure network, combining the discriminator and the countermeasure training loss function, constructing and training an image restoration model by taking the images to be restored in the training set as input and the corresponding original real images as output, and using the global loss function in the training to obtain the image restoration model.

Further, in the step S2, the convolution module includes 7 layers of convolution, a convolution kernel of each convolution layer is 3*3, a step size is 2, and a padding number is 1, wherein a deep-to-shallow feature map is

Further, the aforementioned step S3 includes the following sub-steps:

s3.1, respectively selecting n-n feature blocks in the missing areas of the two adjacent scale feature maps, and calculating the structural similarity between the two feature blocks, wherein the structural similarity is as follows:

where d () is the euclidean distance and m and σ are the mean and standard deviation, respectively.

S3.2, applying a softmax function to the similarity to obtain the attention score of each feature block, wherein the attention score is as follows:

and S3.3, carrying out attention transfer after obtaining the attention score from the high-level feature map, and filling the missing area of the adjacent bottom-level feature map by weighting according to the attention score, wherein the formula is as follows:

wherein l is the number of layers,

is p ^l The fill-in area of (a) is,

is the padded area of the initial feature repair map.

Further, the step S5 is specifically: the multi-scale decoder takes the initial feature repair map output by the structural attention module and the scale feature map output by the convolution module as input at the same time, and then decodes layer by layer as follows:

wherein psi ^L To characterize the L-th layer reconstruction of the attention transfer network,

for the feature mapping of the L-th layer of the encoder,

for the characteristics of the L-th layer of the multi-scale decoder, h is the transposed volumeThe volume of the mixture is accumulated,

denotes a characteristic connection, λ ₁ 、λ ₂ Are corresponding parameters.

Further, in the foregoing step S6, the repaired image is represented by the following formula:

wherein, x is a real image,

is the element-wise multiplied symbol and M is the mask.

Further, the penalty function of the aforementioned discriminator in S7 is as follows:

further, in step S2, when constructing the convolution module, the perceptual loss function is as follows:

wherein the real image is compared to the restored image by the corresponding activation feature map of ReLU _ i _1 (i =1,2,3,4,5) trained in VGG-19 network trained on ImageNet, N _j Indicating the number of elements in the jth active layer,

representing the corresponding activation map, x being the real image and z being the restored image;

when constructing the convolution module, the style loss function is used as follows:

wherein the content of the first and second substances,

is a composed of

C of (a) _j ×C _j The size Gram matrix, x is the real image and z is the restored image.

Further, in step S6, when constructing the generator of the countermeasure network, the edge preserving loss function is used as follows:

L _edge ＝‖E(z)-E(x)‖ ₁ ，

wherein E (-) is a sobel filter, the image edge is extracted, x is a real image, and z is a model generation image.

Further, in step S5, when constructing the multi-scale decoder, the scale reconstruction loss function is as follows:

wherein x is ^l Is scaled to

The real picture of the same size is taken,

is to be

Decoding into RGB images of the same size, lambda _l Is the weight for each scale.

Further, in the aforementioned step S7, an image inpainting model is constructed and trained, and a global loss function is used in the training as follows:

L＝α ₁ L _m +α ₂ L _adv +α ₃ L _edge +α ₄ L _perc +α ₅ L _style ，

wherein L is _m Is a multi-scale reconstruction loss function, L _adv Is a antagonistic training loss function, L _edge Is an edge-preserving loss function, L _perc Is a perceptual loss function, L _style Is a style loss function, alpha ₁ 、α ₂ 、α ₃ 、α ₄ 、α ₅ Are corresponding parameters.

Compared with the prior art, the invention has the following beneficial effects: by adopting the method, the repairing quality can be further improved on the premise of ensuring the repairing effect. Richer texture details can be obtained through a layered encoder; the designed structure attention transfer module improves the consistency of the synthesized texture and the generated structure; the multi-scale decoder inputs two parts of characteristics and decodes the characteristics layer by layer, so that the consistency of vision and semantics can be realized; in addition, the method ensures that the repairing result is more real and natural through a plurality of loss functions.

Drawings

FIG. 1 is a schematic general flow diagram of the present invention.

FIG. 2 is a schematic diagram of the model structure of the present invention.

FIG. 3 is a schematic diagram of the structural attention transfer mechanism of the present invention.

FIG. 4 is a schematic diagram of the repair results of the present invention.

Detailed Description

In order to better understand the technical content of the present invention, specific embodiments are described below with reference to the accompanying drawings.

Aspects of the invention are described herein with reference to the accompanying drawings, in which a number of illustrative embodiments are shown. Embodiments of the invention are not limited to those illustrated in the drawings. It is to be understood that the invention is capable of implementation in any of the numerous concepts and embodiments described hereinabove or described in the following detailed description, since the disclosed concepts and embodiments are not limited to any embodiment. In addition, some aspects of the present disclosure may be used alone, or in any suitable combination with other aspects of the present disclosure.

As shown in fig. 1, the attention pyramid network image restoration method based on the context structure includes the following steps:

s2, constructing a convolution module by taking the image to be repaired in the training set as input and the scale characteristic graph corresponding to each convolution layer as output and combining a style loss function and a perception loss function;

s5, constructing a multi-scale decoder by taking the scale characteristic diagram and the initial characteristic restoration diagram as input and the restoration images of all scales as output and combining a scale reconstruction loss function;

s6, based on a layered decoder and a multi-scale decoder, constructing a generator of an anti-network by combining an edge-preserving loss function with the training set to be repaired as input and the repaired image as output;

The real image data set used in the invention adopts CelebA-HQ face data set and Places2 scene data set; an irregular mask dataset from an Nvidia team was used simultaneously, containing six different mask areas of the mask; the regular mask is formed by a rectangular mask, i.e. a white rectangular area is taken at the center of the image. And multiplying the real image and the mask element by element to obtain the image to be restored.

As shown in fig. 2, the model structure of the present invention is a coding-decoding structure, the coder part inputs an image to be repaired, the image is subjected to feature extraction by a convolution layer with a seven-layer convolution kernel of 3*3, a step length of 2 and a filling number of 1, and is activated by a nonlinear activation function leak ReLu, and as the number of layers of convolution layers continuously increases, the extracted features are gradually converted from low-level features such as texture, color and the like into high-level features such as semantic information and the like; and obtaining feature maps of different scales through the convolution operation.

The perception loss is that the characteristics obtained by convolution of the real image are compared with the characteristics obtained by convolution of the generated image, so that the image content is closer to the global structure. Wherein

Representing the characteristics of the i-th layer in the pre-trained VGG-19 network. In this experiment, relu1_1, relu2_1, relu3_1, relu4_1, relu5 _1were used as the number of layers for extracting features, respectively.

In the step S2, a convolution module is constructed by taking the images to be restored in the training set as input and the scale characteristic maps respectively corresponding to the convolution layers as output and combining the style loss function and the perception loss function;

when constructing the convolution module, the perceptual loss function is used as follows:

the Gram matrix is first solved at each layer,

extraction from VGG-19 networks in representing perceived lossAnd taking a Gram matrix of the feature vector, calculating Euclidean distances among corresponding layers, and finally adding the Euclidean distances of different layers to obtain the final style loss.

wherein the content of the first and second substances,

is a composed of

And S3, constructing a structural attention module by taking each scale characteristic diagram as input and the initial characteristic restoration diagram as output.

Under the assumption that pixels with similar semantics should have similar details, the encoder uses a structural attention transfer network (SAT) layer by layer in a pyramidal fashion to fill in missing regions from the high-level feature map to the low-level feature map; as shown in fig. 3, the network consists of two parts, namely attention calculation and attention transfer, and firstly, the feature blocks 3*3 are respectively selected from the inside and the outside of the missing area, and then the structural similarity between the two feature blocks is calculated:

where d () is the euclidean distance and m and σ are the mean and standard deviation. Then applying the softmax function to the similarity yields the attention score of each feature block:

and acquiring an attention score from the high-level feature map, performing attention transfer, and filling the missing area of the adjacent bottom-level feature map by using the attention score in a weighting manner.

Wherein l is the number of layers,

is p ^l The fill-in area of (a) is,

are the padded areas of the reconstructed feature map. The final reconstructed feature map is input to the decoder via a skip connection.

The decoder receives the reconstructed features from the SAT module and the potential features from the encoder as input at the same time, and then decodes layer by layer, and the specific formula is as follows:

…

for the feature mapping of the encoder L-th layer,

for the features of the L-th layer of the multi-scale decoder, h is the transposed convolution,

The loss function of the invention consists of five parts: (1) Multi-scale reconstruction loss function L _m The method is used for perfecting the generation of the image of each scale missing region; (2) Antagonism training loss function L _adv Generating a more real picture through the confrontation training; (3) Edge-preserving loss function L _edge Controlling the edge structure of the generated image; (4) Perceptual loss function L _perc The pre-trained VGG model promotes better repairing effect; (5) Style loss function L _style And through the pre-trained covariance matrix of the VGG model characteristics, overall common semantic information is kept and image restoration is promoted.

Multi-scale reconstruction loss function L _m Calculating the L1 distance between a predicted image and a real image in each scale, gradually perfecting the prediction of a missing region in each scale by controlling the distance, and constructing a multi-scale decoder by using a scale reconstruction loss function as follows:

wherein x is ^l Is scaled to

The real picture of the same size is taken,

is to be

Decoding into RGB images of the same size, lambda _l Is the weight of each scale;

in step S6, based on a layered decoder and a multi-scale decoder, a generator for constructing a countermeasure network by combining an edge-preserving loss function with an image to be repaired as input and a repaired image as output in a training set. In the invention, the SN-PatchGAN is used for constructing a discriminator, and in the process of resistance training, x is a real picture,

is a symbol of multiplication element by element, M is a mask, z represents the whole of the generated image and the area not missing in the original image, that is:

the penalty function of the final discriminator is expressed as:

when a generator of the countermeasure network is constructed, the sobel filter is used as the edge preserving loss function, and the edge preserving loss function is expressed as follows:

L _edge ＝‖E(z)-E(x)‖ ₁ ，

in step S7, an image inpainting model is constructed and trained, and a global loss function is used in the training as the following total loss function:

wherein alpha is ₁ 、α ₂ 、α ₃ 、α ₄ 、α ₅ To correspond to the parameters, L _m Is a multi-scale reconstruction loss function, L _adv Is a antagonistic training loss function, L _edge Is an edge-preserving loss function, L _perc Is a perceptual loss function, L _style Is a style loss function.

And finally, testing the trained repairing model, and inputting the pre-trained test image into the model. And then evaluating the repairing effect of the model through L1Loss, peak signal to noise ratio (PSNR) and Structural Similarity (SSIM) indexes.

Fig. 4 shows the repair result of the damaged image by using the method of the present invention, and it can be seen that a good repair effect is achieved.

Although the present invention has been described with reference to the preferred embodiments, it is not intended to be limited thereto. Those skilled in the art can make various changes and modifications without departing from the spirit and scope of the invention. Therefore, the protection scope of the present invention should be determined by the appended claims.

Claims

1. The image restoration method based on the context structure attention pyramid network is characterized by comprising the following steps of:

2. The method for repairing an image based on a context and structure attention pyramid network as claimed in claim 1, wherein in step S2, the convolution module comprises 7 layers of convolution, the convolution kernel of each convolution layer is 3*3, the step size is 2, the padding number is 1, and the features from deep to light are mapped into

3. The method for image inpainting based on the context structure attention pyramid network as claimed in claim 1, wherein the step S3 comprises the following sub-steps:

wherein d () is the euclidean distance, and m and σ are the mean and standard deviation, respectively;

s3.2, applying the softmax function to the similarity to obtain the attention score of each feature block, wherein the attention score is as follows:

s3.3, acquiring an attention score from the high-level feature map, then carrying out attention transfer, and filling the missing region of the adjacent low-level feature map in a weighting mode by using the attention score, wherein the formula is as follows:

wherein, l is the number of layers,

is p ^l The fill-in area of (a) is,

is the padded area of the initial feature repair map.

4. The method for image restoration based on the context-based structure attention pyramid network as claimed in claim 3, wherein the step S5 specifically comprises: the multi-scale decoder takes the initial feature repair map output by the structural attention module and the scale feature map output by the convolution module as input at the same time, and then decodes layer by layer as follows:

for the feature mapping of the L-th layer of the encoder,

5. The method for image restoration based on the context-based structure attention pyramid network of claim 4, wherein in step S6, the restored image is represented by the following formula:

wherein, x is a real image,

is the element-wise multiplied symbol and M is the mask.

6. The method of image inpainting based on the context-based structure attention pyramid network of claim 5, wherein in S7, the fighting loss function of the discriminator is as follows:

7. the method for restoring an image based on a context structure attention pyramid network according to claim 1, wherein in step S2, when constructing the convolution module, the perceptual loss function is defined as follows:

representing the corresponding activation map, x being the real image, z being the repair image;

wherein, the first and the second end of the pipe are connected with each other,

is a composed of

8. The method for image inpainting based on context and structure attention pyramid network as claimed in claim 1, wherein in step S6, when constructing the generator of the countermeasure network, the edge preserving loss function is as follows:

L _edge ＝‖E(z)-E(x)‖ ₁ ，

9. The method for image restoration based on the context-based structure attention pyramid network of claim 1, wherein in step S5, when constructing the multi-scale decoder, the scale reconstruction loss function is as follows:

wherein x is ^l Is scaled to

The real picture of the same size is taken,

is to be

10. The method for image inpainting based on the context and structure attention pyramid network as claimed in claim 1, wherein in step S7, an image inpainting model is constructed and trained, and a global loss function is used in the training as follows:

wherein L is _m Is a multi-scale reconstruction loss function, L _adv Is a resistance training loss function, L _edge Is an edge-preserving loss function, L _perc Is a perceptual loss function, L _style Is a style loss function, alpha ₁ 、α ₂ 、α ₃ 、α ₄ 、α ₅ Are corresponding parameters.