CN114943646A

CN114943646A - Gradient weight loss and attention mechanism super-resolution method based on texture guidance

Info

Publication number: CN114943646A
Application number: CN202210636553.0A
Authority: CN
Inventors: 孙建德; 王海涛; 李静; 万文博
Original assignee: Shandong Normal University
Current assignee: Shandong Normal University
Priority date: 2022-06-07
Filing date: 2022-06-07
Publication date: 2022-08-26

Abstract

The invention discloses a super-resolution method based on texture guidance, which mainly solves the problem that high-frequency details are seriously lost when a high-resolution image is generated by a low-resolution image in the prior art. The method mainly comprises the following steps: (1) pairs of high resolution and low resolution image training data are obtained. (2) And constructing an image super-resolution network, particularly using an attention mechanism to measure the weight among each channel to fuse the characteristics, training the generator network, and simultaneously using gradient weight loss constraint to constrain the training in texture guide. (3) And (3) training a relative discriminator by using the generator trained in the step (2) to finely tune a generator network to obtain a final image super-resolution network model. (4) And inputting a low-resolution image, and outputting a high-resolution generated image through an image super-resolution model. The method can more accurately recover the high-frequency detail information from the low-resolution image, and can be used in the fields of target identification, image classification and the like.

Description

Gradient weight loss and attention mechanism super-resolution method based on texture guidance

Technical Field

The invention belongs to a super-resolution method in an image processing technology, can be classified into single-image super-resolution, is used for improving the quality of an image, and can be used in the fields of target identification, image classification and the like.

Background

With the rapid development of image processing technology, more ultra-clear display devices are also emerging, and the demand for higher resolution images or videos is increasing. The image super-resolution method is a task aiming at generating a high-resolution image from a low-resolution image, and a basic visual task is always concerned.

In recent years, the super-resolution method can be mainly divided into three types: interpolation-based methods, reconstruction-based methods, learning-based methods. The interpolation-based method such as Bicubic and Lanczos interpolation method has the advantages of rapidness and directness, but the interpolation-based method can lose high-frequency detail information of the image, the recovered high-resolution image can be accompanied with unnatural details such as artifacts and the like, and the image quality is poor; the reconstruction-based method uses a priori knowledge to limit the possible solution space to generate clearer details, however, as the amplification factor increases, the performance of the reconstruction-based super-resolution method decreases, the quality of the recovered high-resolution image decreases, and the reconstruction-based method generally takes more time and is more computationally expensive. The learning-based method generally uses a machine learning algorithm to obtain a non-linear mapping model between a low-resolution image and a high-resolution image, such as a markov random field, a neighborhood embedding method, sparse coding, a random forest and the like. In recent years, a learning-based method has attracted much attention because of its superior performance as compared with other super-resolution methods.

The super-resolution method based on deep learning is one of learning-based methods, and a nonlinear mapping relation model between a low-resolution image and a high-resolution image is obtained by using a Convolutional Neural Network (CNN), so that a clearer high-resolution image with higher image quality is generated. Pioneering work to apply deep learning to the super-resolution domain is the proposal of SRCNN. The CNN network is used for solving the image super-resolution problem and has more superiority compared with the traditional method, because the CNN network can learn richer characteristics from a large amount of data. After the SRCNN is proposed, VDSR further deepens the depth of the network to solve the single graph super resolution problem. EDSR proposes to delete the Batch Normalization (BN) layer in the network because it will introduce transitions to the features that may adversely affect the final performance. However, the objective function of these methods is mainly focused on minimizing the mean square reconstruction error, which results in the generation of super-resolution images lacking high frequency information, and in order to solve this problem, it is proposed to use super-resolution (SRGAN) generating a countermeasure network, which can reconstruct finer texture details to some extent, and although the scores of evaluation indexes such as peak signal-to-noise ratio and structural similarity are not high, these images are more acceptable visually.

The existing super-resolution method does not consider the texture relation between the low-resolution image and the high-resolution image and the aggregation of global features, and ignores that the low-resolution image can generate more detailed textures through texture guidance and the aggregation of the global features.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a gradient weight loss and attention mechanism super-resolution method based on texture guidance so as to improve the detail information of a generated image.

The technical scheme for realizing the purpose is as follows:

a gradient weight loss and attention mechanism super-resolution method based on texture guidance is used for constructing a CNN-based image super-resolution network model and comprises a generator for generating images and a discriminator for discriminating the authenticity of the generated images, wherein the generator performs nonlinear feature mapping after extracting features from input low-resolution images, calculates the correlation among feature maps through the attention mechanism, redistributes the weight of each feature map, and obtains a high-resolution image through an attention mechanism image reconstruction module, and the gradient weight loss restrains a training process in the texture guidance mode, and the method specifically comprises the following steps:

(1) constructing a data set: acquiring a high-resolution image in a data set, wherein a low-resolution image is obtained by sampling the high-resolution image, and paired training data is acquired, and the specific formula is as follows:

x＝F(y)

f () represents a down-sampling operation, x represents a low resolution image, and y represents a high resolution image;

(2) constructing an image super-resolution model, and training a generator: the model consists of two branch networks of a generator and an arbiter, and comprises the following steps: the device comprises a shallow feature extraction module, a nonlinear mapping module, an image reconstruction module of an attention mechanism and a discriminator module; the shallow feature extraction module consists of a convolutional layer, the nonlinear mapping module consists of a residual dense network, and the image reconstruction module of the attention mechanism consists of an attention module and an up-sampling module; extracting and mapping the features through CNN, and finally performing weighted reconstruction on the obtained features to obtain a high-resolution image;

constructing a target equation: generator usage generation to combat network loss L _G And gradient weight loss L _gw The constraint is specifically as follows:

L＝L _G +L _gw

wherein, the specific formula for generating the network loss is as follows:

generating L in countering network loss ₁ | | | y-f (x) | L, f (x) represents the generated image, y represents the high resolution image, λ represents the loss parameter, L _per For sensing loss, the specific formula is:

L _per ＝‖φ(y)-φ[f(x)]‖

wherein, phi (y), phi (f (x)) represent the characteristics of the real image extracted by the CNN and the generated image;

for the generator's fight loss, the specific formula is:

is a discriminator, X _r ,X _f Respectively representing the true image and the generated image, outputting a probability of being discriminated as the true image;

expressing the expectation of the distribution of the function;

the gradient weight loss is specifically formulated as follows:

wherein L is ₁ Y-f (x) i, f (x) represents the generated image, y represents the real image; wherein D _gw ＝(1+αD _x )(1+αD _y )；

D _x ,D _y Representing the generation of a gradient difference map between the image and the high resolution in the horizontal and vertical directions,

a gradient image representing the generated image,

a gradient image representing a real image, α being a weight coefficient in the loss function;

(3) training a discriminator: after the initial training of the generator is completed, training a discriminator to fine tune the generator, and using a relative discriminator to replace a standard discriminator to generate an image with more authenticity and better quality, wherein the following formula is a discriminator loss function:

wherein the content of the first and second substances,

is a discriminator which inputs a real image and a generated image, outputs a probability of being discriminated as the real image,

representing the expectation of the distribution of the function.

(4) And (3) high-resolution image output: inputting a low-resolution image, outputting a high-resolution image through a trained image super-resolution network model, wherein the formula is as follows:

Y＝F _SR (x)

where x is the input low resolution image, F _SR () The image super-resolution model is trained, and Y is the output high-resolution image.

More particularly, the specific steps of step (2) are:

(2a) obtaining a feature map m after an input image passes through a shallow feature extraction module, namely two convolution layers;

(2b) carrying out nonlinear mapping on the characteristic graph m through a plurality of residual error dense network modules to obtain the characteristic graph m ₁ ；

(2c) Constructing an attention mechanism image reconstruction module, and obtaining the global characteristics of the characteristic graph u on the whole spatial characteristics through global pooling, wherein the formula is as follows:

wherein H and W represent the height and width of the feature map, u _c (i, j) is a feature value of the feature map u corresponding to the (i, j) position, and after obtaining the global feature description of the feature map u, the relationship between the channels is obtained by using the following formula:

z＝F _ex (z,W)＝σ(g(z,W))＝σ(W ₂ ReLU(W ₁ z))

wherein

r is a dimensionality reduction coefficient, ReLU is an activation function, σ () is a Sigmoid activation function, and z is a scalar of the final output;

finally, multiplying the activation value, namely the weight, of each channel by the original feature map u:

wherein F _scale (u _c ,s _c ) Is a characteristic diagram u _c And a scalar z _c Corresponding channel products;

(2d) the feature map obtained in the step (2c) is processed by a convolution up-sampling module to obtain a result with higher resolution;

(2e) training the generator network, wherein the loss function is used for generating the countermeasure network loss and the gradient weight loss, and the specific formula is as follows:

L＝L _G +L _gw

the specific formula of the generator loss is as follows:

l in generator losses ₁ | | | y-f (x) | L, f (x) represents the generated image, y represents the high resolution image, L _per For sensing loss, λ is a loss parameter, and loss is calculated by using a feature map output by the trained VGG16 network relu3 — 3 on the ImageNet dataset, wherein the specific formula is as follows: l is _per ＝‖φ(y)-φ[f(x)]II,. phi. (y),. phi. (f) (x)) represents the features of the real image extracted and the generated image by the CNN,

for the countermeasure loss of the generator network, the specific formula is as follows:

is a discriminator, X _r ,X _f Is a true image and a generated image, outputs a probability of being discriminated as a true image,

expressing the expectation of function distribution, wherein the gradient weight loss is specifically formulated as follows:

wherein L is ₁ | | | y-f (x) | l, f (x) represents the generated image, and y represents the high-resolution image; wherein D _gw ＝(1+αD _x )(1+αD _y )；

D _x ,D _y Representing the generation of a texture difference map between an image and a high resolution in the horizontal and vertical directions,

a gradient image representing the generated image is generated,

a gradient image representing a real image, and a is a weight coefficient in the loss function, which is 4 in the present example.

Compared with the prior art, the invention has the following advantages:

in generating a high resolution image from an input low resolution image, a guiding role of texture should be more emphasized. The correlation among the channels is learned by the introduction of an attention mechanism to emphasize the relation among different channels of the feature map. Because texture edges in the image are more emphasized, the image generated by the embodiment of the invention has more authenticity, the texture details are clearer, and the visual perception quality of the generated image is effectively improved; in addition, the gradient weight loss and the attention mechanism module used in the embodiment of the invention have lower calculation cost, but the effect is obviously improved.

Drawings

FIG. 1 is a flow chart of an image super-resolution network according to an embodiment of the present invention;

FIG. 2 is a diagram of a generator network architecture according to an embodiment of the present invention;

FIG. 3 is a diagram of a residual dense network architecture of an embodiment of the present invention;

FIG. 4 is a diagram of an arbiter network according to an embodiment of the present invention;

FIG. 5 is a graph comparing the effects of the examples of the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings, and the embodiments described herein are only for the purpose of more clearly illustrating the present invention and are not intended to limit the scope of the present invention.

Referring to fig. 1, the specific implementation steps of the present invention are as follows:

step 1, constructing a data set.

The high resolution image y and the corresponding down-sampled low resolution image x are input separately: in the embodiment of the present invention, images in the DIV2K dataset are used, the size of the input high resolution image is 128 × 128, and the size of the input low resolution image is 32 × 32, where the low resolution image is obtained by down-sampling the high resolution image into the low resolution image by the Bicubic interpolation method, and the specific formula is as follows:

x＝F(y)

f () represents a down-sampling operation, x represents a low resolution image, and y represents a high resolution image. Paired training data are obtained.

And 2, constructing an image super-resolution network and training a generator.

Constructing an image super-resolution model, wherein the model consists of a generator and a discriminator, and comprises the following steps: the device comprises a shallow feature extraction module, a nonlinear mapping module, an image reconstruction module and a discriminator module, wherein the nonlinear mapping module consists of a residual dense network, and the image reconstruction module consists of an attention module and an up-sampling module. The image super-resolution network constructed by the invention extracts and maps the characteristics of the low-resolution image and the high-resolution image by constructing the image super-resolution network containing a plurality of convolution network modules, and finally performs weighted fusion on the extracted characteristics to obtain the final high-resolution image. Constructing an objective equation to train the generator, additionally adding gradient weight loss to constrain the training process, wherein the network structure of the generator is shown in FIG. 2,

(2a) the low-resolution image is subjected to shallow feature extraction, namely two-layer convolution to obtain a feature map m with the size of 32 × 32.

(2b) The characteristic diagram is nonlinearly mapped through a constructed nonlinear mapping module, and the specific structure of the nonlinear mapping module is shown in fig. 3. The module is composed of a plurality of residual error dense networks, the number of residual error blocks is 16, the number of residual error dense networks is 23, each residual error dense network module is composed of three dense network modules through residual error scaling fusion, a feature diagram is mapped to 64 32 by 32, and the dense network module is composed of a convolution layer, a LeakyReLU active layer, a convolution layer, a LeakyReLU active layer and a convolution layer and is built in a dense connection mode. Finally obtaining a nonlinear mapping characteristic graph m ₁ And the size is 64 × 32, and the input is input to an image reconstruction module of the attention mechanism.

(2c) Obtaining global features of the feature map u over the entire spatial features through global pooling:

wherein H and W represent the height and width of the feature map, u _c (i, j) is the feature value of the feature map u corresponding to the (i, j) position.

(2d) After obtaining the global feature description of the feature graph u, the relationship between the channels is obtained by using the following formula:

z＝F _ex (z,W)＝σ(g(z,W))＝σ(W ₂ ReLU(W ₁ z))

wherein

r is the dimensionality reduction coefficient, ReLU is the activation function, σ () is the Sigmoid activation function, and z is the final output scalar.

(2e) Finally, the activation values (i.e. weights) of the individual channels are multiplied by the original feature map u:

wherein F _scale (u _c ,s _c ) Is a characteristic diagram u _c And a scalar z _c Corresponding to the channel product.

(2f) The size of the feature map obtained in (2e) is changed to 64 × 128 after convolution and the image is generated after the last convolution, wherein the size of the feature map is changed from 64 × 128 to 3 × 128.

(2g) Training the network by using the training samples generated in the step 1 and an Adam random gradient descent algorithm, wherein a loss function is used for generating a confrontation network loss and a gradient weight loss, and the specific formula is as follows:

L＝L _G +L _gw

the specific formula of the generator loss is as follows:

l in generator losses ₁ | | | y-f (x) | L, f (x) represents the generated image, y represents the high resolution image, L _per For sensing loss, λ is a loss parameter, and loss is calculated by using a feature map output by the trained VGG16 network relu3 — 3 on the ImageNet dataset, wherein the specific formula is as follows: l is _pre ＝‖φ(y)-φ[f(x)]II,. phi. (y),. phi. (f) (x)) represents the features of the real image extracted and the generated image by the CNN,

representing the expectation of the distribution of the function.

Wherein, the gradient weight loss is specifically formulated as follows:

D _x ,D _y Representing the generation of a texture difference map between an image and high resolution in the horizontal and vertical directions.

A gradient image representing the generated image,

And 3, training a discriminator and finely adjusting a generator.

And (3) training a discriminator by using the generator of the image super-resolution network after training obtained in the step (2) and using the generator which is trained preliminarily, so as to finely tune the generator, and obtain a final image super-resolution network model. The network structure of the discriminator is shown in fig. 4, and comprises 8 convolutional layers, the number of features is continuously increased and the feature size is continuously reduced along with the deepening of the network layer number, an activation function is selected to be LeakyReLU, and finally the probability of being predicted as a natural image is obtained through two full-connection layers and a final Sigmoid activation function, so that the generated image is closer to a real image. A relative discriminator is used instead of a standard discriminator to generate a more realistic, better quality image, with the following formula being the discriminator network loss function:

is a discriminator, X _r ,X _f Is a true image and a generated image, and outputs a probability of being discriminated as a true image.

Representing the expectation of the distribution of the function.

And 4, inputting a low-resolution image and outputting a high-resolution generated image.

Inputting a low-resolution image, outputting the low-resolution image through the final image super-resolution network model obtained in the step (3) to obtain a high-resolution image, wherein the specific formula is as follows:

Y＝F _SR (x)

where x is the input low resolution image, F _SR () And Y is the output high-resolution image.

In order to verify the effect of the invention, the method of the invention is respectively compared with other existing image super-resolution network methods, and the evaluation indexes are as follows:

PI (perceptual index) is a comprehensive image perceptual quality evaluation criterion proposed in the PIRM super-resolution match, and is a currently mainstream super-resolution quality evaluation index, and the specific formula is as follows:

where Ma uses statistics in the spatial and frequency domains as the feature representation image. And each group of extracted features are trained in a separate integrated regression tree, and the quality scores are predicted from a large number of visual perception scores by using a lower linear regression model, wherein the larger the Ma index is, the better the visual effect of the image is.

Niqe (natural image quality evaluator): based on the quality-aware features, a larger NIQE represents better image quality. The smaller the perception index PI finally obtained by combining the two quality evaluation methods is, the better the image quality and the visual quality is represented.

The result of the super-resolution method based on the deep learning is shown in fig. 5, compared with other super-resolution methods of the same type, the high-resolution image generated by the method has a better recovery effect on texture details and a better perception index.

Claims

1. A super-resolution method of gradient weight loss and attention mechanism based on texture guidance constructs an image super-resolution network model based on CNN, and comprises a generator for generating images and a discriminator for discriminating the authenticity of the generated images, wherein the generator performs nonlinear feature mapping after extracting features from input low-resolution images, calculates the correlation among feature maps through the attention mechanism, redistributes the weight of each feature map, and obtains a high-resolution image through an image reconstruction module of the attention mechanism, and the gradient weight loss restrains a training process in the texture guidance, and the method specifically comprises the following steps:

x＝F(y)

constructing an objective equation: generator usage generation to combat network loss L _G And loss of gradient weights

The constraint is specifically as follows:

wherein, the specific formula for generating the network loss is as follows:

generating L in countering network loss ₁ | | | y-f (x) | f (x) generationTable generated image, y represents high resolution image, λ represents loss parameter, L _per For the perception loss, the concrete formula is:

L _per ＝‖φ(y)-φ[f(x)]‖

for the generator's fight loss, the specific formula is:

expressing the expectation of the distribution of the function;

the gradient weight loss is specifically formulated as follows:

wherein L is ₁ Y-f (x) i, f (x) represents the generated image, y represents the real image; wherein

D _x ,D _y Representing the difference in gradient between image generation and high resolution in the horizontal and vertical directionsIn the figure, the figure shows that,

a gradient image representing the generated image,

(3) training a discriminator: after the generator is initially trained, training a discriminator to fine-tune the generator, and using a relative discriminator to replace a standard discriminator to generate an image with more authenticity and better quality, wherein the following formula is a discriminator loss function:

wherein the content of the first and second substances,

expressing the expectation of the distribution of the function;

(4) high-resolution image output: inputting a low-resolution image, outputting a high-resolution image through a trained image super-resolution network model, wherein the formula is as follows:

Y＝F _SR (x)

2. The texture-oriented gradient weight loss and attention mechanism super-resolution method of claim 1, wherein: the specific steps of the step (2) are as follows:

(2a) obtaining a feature map m after the input image passes through a shallow feature extraction module, namely two convolution layers;

(2b) carrying out nonlinear mapping on the feature map m through a plurality of residual error dense network modules to obtain the feature map m ₁ ；

wherein H and W represent the height and width of the feature map, u _c (i, j) is the feature value of the feature map u corresponding to the position (i, j), and after obtaining the global feature description of the feature map u, the relationship between the channels is obtained by using the following formula:

z＝F _ex (z,W)＝σ(g(z,W))＝σ(W ₂ ReLU(W ₁ z))

wherein

(2e) training the generator network, wherein the loss function is to use the generator network to generate the confrontation network loss and the gradient weight loss, and the specific formula is as follows: