CN113627487A

CN113627487A - Super-resolution reconstruction method based on deep attention mechanism

Info

Publication number: CN113627487A
Application number: CN202110790131.4A
Authority: CN
Inventors: 刘晶; 杨慧; 薛雨馨
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2021-07-13
Filing date: 2021-07-13
Publication date: 2021-11-09
Anticipated expiration: 2041-07-13
Also published as: CN113627487B

Abstract

The invention discloses a super-resolution reconstruction method based on a deep attention mechanism, which specifically comprises the following steps: step 1, acquiring a low-resolution LR image; step 2, inputting the low-resolution LR image into a deep attention mechanism network to obtain a shallow characteristic map; step 3, inputting the shallow feature map in the step 2 into a deep feature extraction module, cascading the shallow feature map and the deep feature map to obtain a cascading feature map, giving weight distribution to the cascading feature map, and reducing dimensions of the feature map to obtain a dimension reduction feature map; step 4, adding the dimension reduction feature map obtained in the step 3 and the LR image obtained in the step 1, and learning a feature residual error to obtain a global feature map; and 5, inputting the global feature map obtained in the step 4 into an up-sampling module, amplifying the low-resolution feature map to an output scale, and finally performing super-resolution reconstruction on the image in a reconstruction module.

Description

Super-resolution reconstruction method based on deep attention mechanism

Technical Field

The invention belongs to the technical field of image processing methods, and relates to a super-resolution reconstruction method based on a deep attention mechanism.

Background

The concept of attention mechanism can be explained by the examples of liveness, such as when a person is reading a book, the person will pay attention to the text instead of the blank area on the page, and when a page has a color illustration, the person will be attracted by the color illustration. People's vision system tends to focus on some information in the image that assists in the determination and ignore irrelevant information. In computer vision tasks, the attention mechanism ensures that the deep network learns relatively important information, while ignoring irrelevant information. The Google team first applied attention to the task of image classification in 2014. In the same year, bahdana et al have extensively applied their attention model to the task of machine interpretation. Now that attention mechanism has been widely applied in various fields, Xu et al applied attention mechanism research to a picture expression task in 2015 and proposed the concept of hard and soft attention mechanism. Hu et al applied an attention mechanism to a target detection task in 2017, and the recognition effect of the model is improved. Attention mechanisms can be divided into hard and soft attention mechanisms.

The Hard attention mechanism (Hard-attention) is an 0/1 problem that preserves important features in the input information and ignores those features that are not important. For example, in some fine-grained object classification tasks, the hard attention mechanism can locate the key information of image input and take the key information as the next input. The hard attention mechanism can effectively reduce the calculation amount of the network model, only selects the characteristic information useful for the classification result, and obviously improves the calculation efficiency and the classification accuracy of the network.

The Soft attention mechanism (Soft-attention) focuses on the problem of continuously distributing [0,1] regions, and each region of interest is weighted by 0 to 1 according to the degree of interest. The training process can be directly attached to the neural network, and the area of the channel is paid more attention to, so that each concerned area has an effect on the result, and the calculation amount is increased compared with a hard attention mechanism.

According to the difference of the focus of the soft attention mechanism, the soft attention mechanism is generally divided into a channel attention mechanism and a space attention mechanism.

Disclosure of Invention

The invention aims to provide a super-resolution reconstruction method based on a deep attention mechanism.

The invention adopts the technical scheme that a super-resolution reconstruction method based on a deep attention mechanism specifically comprises the following steps:

step 1, acquiring a low-resolution LR image;

step 2, inputting the low-resolution LR image in the step 1 into a deep attention mechanism network, and extracting the low-resolution LR image through a shallow feature extraction module to obtain a shallow feature map;

step 3, inputting the low-resolution LR image in the step 1 into a deep attention mechanism network, and performing deep feature extraction through a deep feature extraction module to obtain a deep feature map; cascading the shallow feature map and the deep feature map to obtain a cascading feature map, performing weight distribution on the cascading feature map, and reducing the dimension of the feature map to obtain a dimension-reduced feature map;

step 4, adding the dimension reduction feature map obtained in the step 3 and the LR image obtained in the step 1, and learning a feature residual error to obtain a global feature map;

and 5, inputting the global feature map obtained in the step 4 into an up-sampling module, amplifying the low-resolution feature map to an output scale, and finally performing super-resolution reconstruction on the image in a reconstruction module.

The invention is also characterized in that:

the specific operation of the step 1 is as follows:

step 1.1, respectively downloading Set5, Set14, BSD100, URBAN100 and MANGA109 data sets on the network;

and step 1.2, carrying out 4-time down-sampling pretreatment on each data set in the step 1.1 to obtain low-resolution LR images corresponding to each data set.

The step 2 comprises the following specific steps:

inputting the low-resolution LR image in the step 1 into a deep attention mechanism network, transforming the low-resolution image input into the network into a feature map space through two layers of convolution layers with ReLU activation functions by a shallow feature extraction module, wherein mathematical formulas of a shallow feature extraction process are as follows (1) and (2):

F_-1＝H_SFEB(I_LR) (1)；

F₀＝H_SFEB(F_-1) (2)；

in the formula I_LRRepresenting a low resolution image, H_SFEB() Representing latent layer feature extraction, F_-1Representing the result after the first latent layer feature extraction, F₀Representing the results after the second latent layer feature extraction.

The specific process of the step 3 is as follows:

step 3.1, performing depth feature extraction on the feature map by utilizing a plurality of DFEB modules containing a channel attention mechanism;

step 3.2, extracting the shallow feature graph F extracted by the shallow feature extraction module_-1、F₀And a deep feature map F extracted by the deep feature module₁，F_d，...F_DCascade together to obtain a cascade characteristic diagram denoted by F_CONThe mathematical expression formula is shown in formula (3):

F_CON＝[F_-1，F₀，F₁，...，F_d，...，F_D] (3)；

and 3.3, performing weight distribution on the cascade characteristic graph by using the depth dimension of the DAB module in the network model, and performing dimensionality reduction on the cascade characteristic graph to obtain a dimensionality reduction characteristic graph.

The specific process of the step 4 is as follows: obtaining a global feature map by using the following formula (4):

F_GF＝F_DF+I_LR (4)；

wherein, I_LRRepresenting low resolution images LR, F_DFRepresenting the weighted feature map after dimensionality reduction, F_GFRepresenting a global feature map.

The specific process of the step 5 is as follows:

step 5.1, the global feature map F obtained in the step 4 is processed_GFInputting the low-resolution characteristic image into an up-sampling module, and amplifying the low-resolution characteristic image to an output scale in a deconvolution mode;

and 5.2, performing super-resolution reconstruction on the image obtained in the step 5.1 by adopting a convolution layer in a reconstruction module to finally obtain a reconstructed high-resolution picture.

Compared with the traditional super-resolution reconstruction method based on the attention mechanism, the method has the advantages that the deep attention mechanism module (DAB) provided by the method carries out weight optimization in the process of processing the shallow network and the deep network. The traditional attention mechanism only optimizes channel dimension weights, and weight optimization is not performed in the process of processing shallow networks and deep networks. The method has higher convergence speed during training, and simultaneously improves PSNR and SSIM evaluation indexes compared with a method without a DAB module when the iteration times are the same.

Drawings

FIG. 1 is a network structure diagram of a super-resolution method based on a deep attention mechanism according to the present invention;

FIG. 2 is a block diagram of a deep layer feature extraction module (DFEB) of a super-resolution method network structure diagram based on a deep layer attention mechanism according to the present invention;

FIG. 3 is a diagram of a CA module in a deep layer feature extraction module (DFEB) module in a network structure of a super-resolution method based on a deep layer attention mechanism according to the present invention;

FIG. 4 is a DAB module structure diagram of a super-resolution method network structure diagram based on a deep attention mechanism of the present invention;

FIGS. 5(a), (b) are graphs comparing the results obtained by using the super-resolution method based on the deep attention mechanism of the present invention;

fig. 6(a) and (b) are experimental results of validity verification of the DAB module by using the super-resolution method based on the deep attention mechanism of the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

The invention relates to a super-resolution method based on a deep attention mechanism, as shown in figure 1, I_LRRepresenting an input low resolution picture, SFEB representing a shallow feature extraction module, F_-1，F₀A shallow feature map obtained by a shallow feature extraction module; DFEB represents a deep feature extraction module, F₁-F_DThe deep characteristic map is obtained by a deep characteristic extraction module; concatee stands for cascade operation, and the shallow feature map F_-1，F₀Deep layer feature map F₁，F_d，...F_DCascading to obtain a cascading characteristic diagram F_CON(ii) a Then, a DAB (DeepAttention Block) module gives new weight to the feature maps, an attention mechanism is established in the depth dimension, and a weighted feature map F is obtained_DAB(ii) a And then, two convolution operations are carried out, the first convolution operation carries out dimensionality reduction operation on the weighted feature graph obtained by the DAB module to reduce the feature graph to 64 dimensions, and the second convolution operation carries out dimensionality reduction on the feature graph to 3 channels to obtain a feature graph F_DF(ii) a Then adding the learning characteristic residual error with LR to obtain a global characteristic diagram F_GFThen inputting the data into an up-sampling module; the up-sampling module amplifies the low-resolution characteristic image to an output scale in a deconvolution mode; finally, a layer of convolution layer is adopted in a reconstruction module to carry out super-resolution reconstruction on the image to obtain a high-resolution picture I_HR。

The method is implemented according to the following steps:

step 1, acquiring a low-resolution LR image;

step 1.1, Set5, Set14, BSD100, URBAN100, MANGA109 data sets are downloaded over the network.

And step 1.2, carrying out 4-time downsampling pretreatment on each data set in the step 1.1 to obtain low-resolution LR pictures corresponding to each data set.

Step 2, inputting the low-resolution LR image in the step 1 into a deep attention mechanism network, and extracting the low-resolution LR image through a shallow feature extraction module to obtain a shallow feature map; the method specifically comprises the following steps:

and (3) inputting the low-resolution LR image in the step (1) into a deep attention mechanism network, wherein the deep attention mechanism network can be divided into a shallow feature extraction module, a deep feature extraction module, an up-sampling module and a reconstruction module. The low-resolution LR image is first processed by a Shallow Feature Extraction Block (SFEB), and the low-resolution image of the input network is transformed into a Feature map space by two convolutional layers with ReLU activation functions, where the number of channels of the transformed Feature map is 64. The shallow feature extraction module is mainly responsible for extracting low-frequency information of the image and transmitting the low-frequency information on the network. Inputting the low-resolution LR image into a shallow feature extraction module (SFEB) to obtain a shallow feature map F_-1，F₀The formula is expressed by the following mathematical formula (1) and (2):

F_-1＝H_SFEB(I_LR) (1)；

F₀＝H_SFEB(F_-1) (2)；

in the formula I_LRRepresenting a low resolution image, H_SFEB() Representing a latent layer feature extraction operation, F_-1，F₀Representing the resulting shallow feature map after shallow feature extraction.

Step 3, inputting the shallow feature map in the step 2 into a deep feature extraction module, performing depth feature extraction on the shallow feature map to obtain a deep feature map, cascading the shallow feature map and the deep feature map to obtain a cascading feature map, performing weight distribution on the cascading feature map, and reducing the dimension of the feature map to obtain a dimension-reduced feature map;

and 3.1, performing deep feature extraction on the feature map by using a plurality of DFEB modules containing a channel attention mechanism, wherein a sub-module DFEB structure of the deep extraction module is shown in figure 2. Upper layer characteristic diagram F_d-1Inputting the DFEB module, wherein the CA module is a channel attention mechanism, and endowing different weights to the characteristic diagram of the previous layer through the channel attention mechanism; training residual components of the CA module through 4 layers of convolution and a ReLU activation function; finally, the output of the CA module, namely the characteristic graphs with different weights, and the residual error of the CA moduleThe components are added to obtain the output Fd of the DFEB block. Wherein d represents the d-th DFEB module. The mathematical expression of a DFEB module is as in equation (3):

F_d＝H_DFEB，d(F_d-1)

＝H_DFEB，d(H_DFEB，d-1(…(H_DFEB，1(F₀))…)) (3)；

in the formula, F_d-1Is the profile output of the d-1 level DFEB module, H_DFEBRepresenting a deep feature extraction operation, d representing the d-th DFEB module, F₀Feature map output, F, for a layer 0 DFEB Module_dAnd (3) representing the characteristic diagram output of the d-th layer DFEB module.

Step 3.1.1, as shown in fig. 3, a structure diagram of the CA module is a schematic diagram of the channel attention module, N represents a feature diagram with a size of h ' × w ' × c ', and a feature diagram U of h × w × c is obtained after a corresponding convolution operation; f_sqNamely, the Squeeze operation performs global average pooling on the feature map U to obtain a feature map with the size of 1 × 1 × c; f_exNamely, the Excitation operation uses a full-connection network to perform nonlinear transformation on the result after the Squeeze; f_scaleTaking the result obtained by the Excitation as the weight, and multiplying the result to the input characteristic of the corresponding channel, namely channel weighting; finally, obtaining the weighted characteristic diagram of different weights of each channel

The specific operation of global average pooling is as follows: performing average pooling in a global range on the feature map of each channel, converting a two-dimensional feature map into a real number, recording the real number as Zc, and using a mathematical expression shown as (4):

in the formula, H and w represent the height and width of the input picture, u_cCharacteristic diagram representing the number of channels C, F_sqRepresents the global average pooling operation on the feature map U, and U (i, j) represents the feature mapUth row and j th column.

The specific operation of the Excitation is as follows: generating different excitations for each real number by two fully-connected layers, where W₁Representing the first full join operation, the number of lanes from C₂Dimension reduction to C₂The/r dimension, d δ, represents the ReLU activation layer, where r is the hyper-parameter 16. W₂Represents a second full join operation, reducing the number of lanes to C₂And (5) maintaining. And finally, mapping by a Sigmoid function to obtain S, wherein a mathematical expression is shown as (5):

S＝F_ex(z，W)＝σ(g(z，W))＝σ(W₂6(W₁z)) (5)；

in the formula, W₁、W₂Respectively representing a first full-connection operation and a second full-connection operation, 6 representing a ReLU activation layer, sigma representing Sigmoid function mapping, Z representing a real number after pooling, and S representing a learned weight.

F_scaleThe specific operation is as follows: weighting the S weight output in the step (5) according to the channel to achieve the purpose of a channel attention mechanism, wherein a mathematical expression is shown as a step (6):

x_c＝F_scale(U，s_c)＝U·s_c (6)；

in the formula, U represents the number of channels C₂Characteristic diagram of (1), s_cRepresenting the learned weight(s)_cWeighted by weight S), where c represents the number of channels.

Step 3.2, obtaining a plurality of groups of characteristic graphs after respectively carrying out shallow characteristic extraction and deep characteristic extraction, wherein the shallow characteristic graph F_-1，F₀Deep layer feature map F₁，F_d，...F_D. Cascading these profiles together to obtain a cascaded profile, denoted F_CONThe mathematical expression is as in formula (7):

F_CON＝[F_-1，F₀，F₁，...，F_d，...，F_D] (7)；

step 3.3, under the initiation of an Attention mechanism, utilizing the depth dimension of the proposed DAB (deep Attention Block) module in a network model to carry out cascade connection on a plurality of groupsThe deep profile is given a weight assignment. The cascade feature maps are the image feature information extracted from different depth convolution layers, and when they have the same weight in the depth direction, we establish the Attention mechanism in the depth dimension by giving new weight to the feature maps through a DAB (deep Attention Block) module, wherein the DAB module is schematically shown in FIG. 4. Through F_sqI.e. the Squeeze operation vs. channel number C1 cascade feature diagram F_CONCarrying out average pooling to obtain a characteristic diagram Z consisting of c1 real numbers; then obtaining a weight S through two full-connection layers of W1 and W2; through F_scaleMultiplying the weight S to be learned by the cascade characteristic diagram to obtain the final output F of the depth dimension DAB module_DAB。

The pixel-by-pixel loss is combined with the perceptual loss as an objective function for optimizing the network. The pixel-by-pixel loss function is a mean square error loss function, which is also called MSE loss function, and represents the mean value of the sum of squares of the difference values of the predicted value and the true value. The minimum point of the loss function indicates that the predicted value is true, and the value of the loss function increases steeply with the increase of the error. With the success of the image stylized migration task, it was found that the resulting feature maps in the convolutional neural network could be used as part of the target loss. The perceptual loss function optimizes a feature map obtained by passing an input low-resolution image through a convolutional neural network, and a difference value between the feature map and a feature map obtained by passing the input low-resolution image through the same network as a truth map can enable a generated image to be semantically closer to a truth image. The significance of the perceptual loss function is that if two network predicted images and true images which are similar to each other in appearance exist, but the two images only have translation deviation of one row of pixels, the two images can obtain larger errors according to pixel-by-pixel loss calculation, the errors can not effectively reflect the real reconstruction effect of the predicted images, the convergence rate of the network can be reduced, the perceptual loss is proposed as a part of the target function under the background, and reconstruction is facilitated to obtain the predicted images with better effects. The mathematical formula of the loss function of the method is as follows:

in the formula, y_iRepresents the predicted value of the prediction,

representing the true value and n representing the dimension of the data. The MSE loss function has smooth curve and continuous conductivity everywhere, is convenient for the forward and backward propagation of the network,

indicates the predicted value, V_DAB(Y) represents a true value. C_j、H_j、W_jRespectively, the shapes of the output characteristic values. The first item represents the pixel-by-pixel loss, the second item represents the perception loss between the predicted characteristic diagram and the truth characteristic diagram output by the DAB module, and the sum of the two items is optimized to continuously update the network structure until the network training is completed.

Step 3.3.1, number of paired channels is C₁Cascade feature map F_CONThe average pooling is performed, the formula is shown below, and the calculation result Z is represented by C₁A feature map composed of real numbers, and the mathematical expression is as the following formula (9):

in the formula, H and W represent the height and width of the input picture, u_cRepresents the number of channels as C₁Characteristic diagram of (1), F_sqRepresents the global average pooling operation performed on the feature map U, and U (i, j) represents the pixel of the ith row and the jth column of the feature map U.

Step 3.3.2, generating C by two fully-connected layers₁The new weight for each channel, the formula is shown below, W₁Denotes the first fully-connected layer, δ denotes the ReLU activation function, C₁Dimensional feature graph non-linear mapping to C₂And (5) maintaining. W₂Denotes a second fully-connected layer, C₂The dimension feature map is reduced back to C₁Dimension, σ, represents a sigmoid activation function, the result is denoted S, and a mathematical expression such asEquation (10):

S＝F_ex(z，W)＝σ(g(z，W))＝σ(W₂δ(W₁z)) (10)；

in the formula W₁、W₂Respectively, the first and second fully connected operations are represented, δ represents the ReLU active layer, σ represents the Sigmoid function mapping, and Z represents the real number in step 3.3.1. F_exI.e. the Excitation operation uses a fully connected network, making a non-linear transformation of the result after Squeeze.

Step 3.3.3, the learned weight is multiplied by the cascade characteristic graph, and the final output F of the depth dimension DAB module can be obtained_DABThe formula can be expressed as formula (11):

F_DAB＝F_scale(u_c，s)＝u_c·s

(11)；

in the formula u_cRepresents the number of channels as C₁S represents the weight learned in step 3.3.2.

and 4.1, after the network model is processed by the DAB module, giving different importance degrees to the information extracted by the networks with different depths. And (4) performing dimensionality reduction operation on the weighted feature map obtained by the DAB module to reduce the feature map to 64 dimensions.

Step 4.2, performing a layer of convolution operation on the 64-dimensional feature map obtained in the step 4.1, reducing the dimension of the feature map to three channels, adding the learned feature residual error to LR to obtain a global feature map F_GFThe formula can be expressed as formula (12):

F_GF＝F_DF+I_LR

(12)；

in the formula I_LRRepresenting a low resolution picture, F_DFFeature map representing 3 channels, F_GFRepresenting a global feature map.

Step 5.1, obtaining the global feature map F from the step 4.2_GFAnd inputting the low-resolution characteristic image into an up-sampling module, and amplifying the low-resolution characteristic image to an output scale by adopting a deconvolution mode.

And 5.2, performing super-resolution reconstruction on the image obtained in the step 5.1 by adopting a convolution layer in a reconstruction module, wherein a comparison graph of a reconstruction result is shown in fig. 5, wherein fig. 5(a) is a low-resolution image, and fig. 5(b) is a high-resolution image.

And (3) verifying the validity of the DAB module:

the method provides a deep attention mechanism module (DAB), the traditional attention mechanism only optimizes channel dimension weight, and weight optimization is not carried out in the process of processing a shallow network and a deep network. Effectiveness verification experiment of DAB compares the method with an algorithm which has the same network structure but does not contain a DAB module, and the result is shown in FIG. 6(a) is a PSNR change curve, and FIG. 6(b) is an SSIM change curve). The experimental result shows that the algorithm containing the DAB module has higher convergence speed during training, and meanwhile, compared with the algorithm without the DAB module, the algorithm has the advantages that the PSNR and SSIM evaluation indexes are improved when the iteration times are the same.

Claims

1. A super-resolution reconstruction method based on a deep attention mechanism is characterized by comprising the following steps: the method specifically comprises the following steps:

step 1, acquiring a low-resolution LR image;

2. The super-resolution reconstruction method based on the deep attention mechanism as claimed in claim 1, wherein: the specific operation of the step 1 is as follows:

3. The super-resolution reconstruction method based on the deep attention mechanism as claimed in claim 1, wherein: the step 2 comprises the following specific steps:

F_-1＝H_SFEB(I_LR) (1)；

F₀＝H_SFEB(F_-1) (2)；

4. The super-resolution reconstruction method based on the deep attention mechanism as claimed in claim 1, wherein: the specific process of the step 3 is as follows:

F_CON＝[F_-1，F₀，F₁，...，F_d，...，F_D] (3)；

and 3.3, based on the attention mechanism, utilizing the DAB module in the depth dimension of the network model to give weight distribution to the cascade characteristic graph, and reducing the dimension of the cascade characteristic graph to obtain a dimension-reduced characteristic graph.

5. The deep attention mechanism-based super-resolution reconstruction method according to claim 4, wherein: the specific process of the step 4 is as follows: obtaining a global feature map by using the following formula (4):

F_GF＝F_DF+I_LR (4)；

6. The deep attention mechanism-based super-resolution reconstruction method according to claim 5, wherein: the specific process of the step 5 is as follows: