CN114926343A

CN114926343A - Image super-resolution method based on pyramid fusion attention network

Info

Publication number: CN114926343A
Application number: CN202210639755.0A
Authority: CN
Inventors: 唐杰; 何昊; 郑秀; 武港山
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2022-06-08
Filing date: 2022-06-08
Publication date: 2022-08-19

Abstract

A pyramid fusion attention network-based image super-resolution method is characterized in that a pyramid fusion attention mechanism is introduced to recover a high-resolution image from a given low-resolution image, a pyramid fusion structure is adopted firstly, each layer of pyramids are stacked by using residual blocks, and a down-sampling operation and a multi-scale fusion strategy are used to ensure a complete receptive field and master more context detail information. In addition, the invention also provides a progressive backward fusion strategy to fully utilize the hierarchical characteristics generated by the intermediate pyramid fusion attention module. The pyramid fusion attention module disclosed by the invention can better enhance the discrimination capability of a network by modeling the relationship between pixels, so that high-frequency information can be better recovered.

Description

Image super-resolution method based on pyramid fusion attention network

Technical Field

The invention belongs to the technical field of deep learning, computer vision and computer image processing, relates to single image super-resolution, and discloses an image super-resolution method based on a pyramid fusion attention network.

Background

Image Super-Resolution (SISR) has recently received much attention as one of the long-standing basic tasks in the field of computer vision. The task is to restore a Low Resolution image (LR) to a High Resolution image (HR) by a certain technical means, so as to obtain clearer image details. In a real scene, images with higher resolution cannot be directly obtained due to the influence of multiple aspects such as imaging equipment capacity, network bandwidth and storage cost. Therefore, reconstructing a high resolution image from a low resolution image is particularly important in these situations. The image super-resolution has very wide application prospect in multiple fields such as military affairs, medicine, security protection, satellite imaging, HDTV and the like, and has very important research value. However, the SISR task is highly ill-posed because of the existence of one-to-many mapping of the solution space from LR images to HR images.

In recent years, deep learning techniques have achieved an effect superior to conventional techniques in various fields. Since AlexNet has made a great deal of splendid in the ImageNet competition in 2010, methods based on deep learning have found wide application in various fields of computer vision. The image super-resolution task is a long-existing underlying vision task in computer vision and is also deeply influenced by the underlying vision task. Dong et al first introduced a Convolutional Neural Network (CNN) into the field in 2015, and proposed a SRCNN. The model first upsamples the LR image to a corresponding scale and then maps the image to the HR image domain using a network built up by 3-layer convolution. Compared with the traditional algorithm, the SRCNN achieves better effect. Following this, CNN-based methods have been a long-standing development in this area.

To further improve the performance of the network, most models use an attention mechanism based on CNN. Attention mechanisms attempt to mimic the ability of the human visual system to capture information, thereby making the network more focused on salient region features while reducing the focus on extraneous features. The traditional attention mechanism is mainly based on two considerations: 1. the correlation is calculated along the channel domain, the weight of each channel is calculated, and the features in the channels are treated equally. 2. And calculating the correlation along the spatial domain, calculating the correlation for each pixel point in the channel, wherein the correlations of the same coordinate position in different channels are the same. On this basis, a number of attention mechanisms have been proposed.

Disclosure of Invention

The invention aims to solve the problems that: most of the existing attention mechanisms only focus on capturing the inherent feature correlation along a channel domain or a space domain, and treat the features in the corresponding dimension equally, so that the capability of the attention mechanism is hindered; the method has the advantages that a plurality of modules exist in the whole network architecture of image hyper-segmentation, the output of the intermediate modules is intermediate features, most of the existing methods cannot fully utilize the intermediate features, and the features have important significance for reconstructing spatial context details, so that the performance is relatively weak.

The technical scheme of the invention is as follows: a super-resolution method of a single image based on a pyramid fusion attention network is characterized in that an image super-resolution network PFAN is constructed to recover a high-resolution image from a given low-resolution image, the image super-resolution network is composed of a shallow feature extraction module, a stacked feature extraction basic group BG, an up-sampling module and a reconstruction module which are sequentially arranged, and the super-resolution method comprises the following procedures:

1) the shallow feature extraction module uses a layer of convolution to pair the low-resolution image I _LR Extracting shallow feature F ₀ ：

F ₀ ＝H ₀ (I _LR )

Wherein H ₀ (. -) represents a convolution function;

2) deep level feature extraction is carried out by using the stacked feature extraction basic group BG:

F _D ＝H _BG-D (F _D-1 )＝H _BG-D (H _BG-(D-1) (…H _BG-1 (F ₀ )…))

wherein H _BG-D (. denotes the Dth BG module, F _D-1 Denotes the (D-1) th) The output of each BG module;

3) using sub-pixel convolution as an upsampling module to improve resolution while fusing deep features F _D And shallow feature F ₀ ：

F _↑ ＝H _↑ (F _D +F ₀ )

Wherein H _↑ (. represents an upsampling module, F _↑ Representing the upsampled features;

4) result F of upsampling using a layer of convolution _↑ And (3) carrying out feature reconstruction:

I _SR ＝H _R (F _↑ )

output result I of final pyramid fusion attention network _SR Namely the reconstructed high-resolution image;

the basic group BG of feature extraction is composed of n pyramid fusion attention modules PFAB and a progressive backward fusion module PBFM, the pyramid fusion attention module PFAB is composed of a basic residual block Plain RB and a pyramid fusion attention network PFA, the PFA adopts a pyramid structure, the output of the basic residual block Plain RB is firstly convoluted and pooled after the PFA is input, then the relation between pixels is modeled by using stacked standard residual blocks RB, the pyramid middle layer receives the output from corresponding upper and lower layers RB as input, the output of each layer of the pyramid is up-sampled to the same size and then is cascaded, and is sent to a convolution layer and a sigmoid layer, so that an attention mask is obtained, and the output of the current PFAB is obtained after the attention mask and a characteristic diagram of the current PFAB module are subjected to pixel level product operation; n PFABs are connected in series in sequence, a progressive backward fusion module PBFM is used for cascade operation on a channel domain for the output of two adjacent PFABs, a channel attention CCA module based on contrast is sent to the feature after each cascade operation to further enhance key information, a 1 multiplied by 1 convolutional layer is used for fusing the feature of the channel domain of the feature enhanced by the CCA module, wherein the PFAB _n Output result of (B) _n And PFAB _n-1 Output result of (B) _n-1 Directly cascaded, then convolved with 1 × 1 through a CCA module to obtain the corresponding PFAB _n-1 Fused output of B' _n-1 For PFAB _i I-1, …, n-1, adjacent PFABs _i Is cascaded to B _i And B' _i+1 Cascade, represented as:

B′ _i ＝H _F (concat(B _i +B′ _i+1 ))

B′ ₁ for the final output of the progressive backward fusion module PBFM, the output characteristic of the jth BG module is recorded as F _j Then F is _j From B' ₁ And F _j-1 And performing pixel-level addition to obtain the product.

The present invention recovers a high resolution image from a given low resolution image by introducing a novel pyramid fusion attention mechanism. According to the invention, the stacked feature extraction basic group BG can effectively utilize intermediate features, and the pyramid fusion attention module PFAB can better enhance the discrimination capability of a network by modeling the relationship between pixels, so that high-frequency information can be better recovered. In addition, the invention also provides a progressive backward fusion strategy, which guides the shallow feature to perform feature fusion by using the deep feature and better utilizes the network middle layer feature.

The invention provides an image hyper-resolution network PFAN based on a pyramid fusion attention network, which is used for constructing stronger feature representation and enhancing the discrimination capability of the network. The invention provides a pyramid fusion attention network PFA, and a pyramid fusion attention module PFAB is constructed according to the pyramid fusion attention network PFA. PFA adopts a plurality of residual modules RB to extract intermediate features, and simultaneously uses a pyramid structure with down-sampling operation and multi-scale fusion strategy to recalibrate the obtained feature information. Such a design has three main advantages: first, the pyramid bottom structure maintains the size of the feature map to learn the correlation between pixels. Such a design can make the network more flexible in handling different types of information than channel attention or spatial attention; secondly, the higher-level pyramid structure has a larger receptive field, and can acquire more global context information with relatively less expense; third, feature fusion between adjacent pyramid layers would provide information exchange across multiple resolutions in order to better apply multi-scale information. Thus, the pyramid fusion attention module described above can predict a more accurate attention mask. In addition, in order to fully apply the hierarchical features generated by each PFAB, the invention further provides a progressive backward fusion module PBFM to fuse the features of each intermediate layer to generate features with more discriminative ability, and finally, a plurality of PFABs and one PBFM are combined to form a feature extraction basic group BG, and a plurality of BGs and a global jump connection are used to construct a final PFAN. Generally, the larger the parameter amount of the model is, the stronger the expression ability of the model is, and thus the better the model achieves. However, it is very difficult to fully utilize all parameters, and therefore it is important to design a network structure to achieve better performance under the premise of a certain parameter amount. Compared with the prior method, the PFAN provided by the invention achieves superior performance by using less parameter quantity (11.9M).

Compared with the existing method, the single-image super-resolution method based on the pyramid fusion attention network has the advantages that:

1. an image super-resolution network PFAN based on a pyramid fusion attention network is provided, and is used for a large-model single-image super-resolution task. Full experiments are carried out on the public data sets based on different degradation models, and the experimental results show the effectiveness of PFAN and are superior to the existing latest and most advanced method in the aspects of quantitative indexes and visual results.

2. A pixel-level attention mechanism PFA is proposed that adaptively acquires inter-pixel correlations using a pyramid fusion structure, while allowing information exchange across multiple resolutions and providing a larger and more complete field of view. This design enables the network to generate more accurate attention masks, thereby obtaining more pixel-level and global information.

3. A progressive backward fusion module PBFM is provided to fully utilize a plurality of middle layer features so as to acquire more context information for better image recovery.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Fig. 2 is a schematic diagram of a basic architecture of a single-image super-resolution network based on a pyramid fusion attention network according to the present invention.

Fig. 3 is a schematic structural diagram of the pyramid fusion attention module in the present invention.

Fig. 4 is a schematic structural diagram of a basic residual block in the present invention.

Fig. 5 is a schematic structural diagram of a standard residual block in the present invention.

Fig. 6 is a schematic structural diagram of a feature extraction basic group in the present invention.

Detailed Description

The invention provides a single-image super-resolution method based on a pyramid fusion attention network. Most recent approaches focus on high frequency information using an attention mechanism. However, these approaches only consider interdependencies between channels or spaces, resulting in an equal processing of channel or space characteristics, thereby hindering the ability to focus on mechanisms. The invention provides an image super-resolution network PFAN based on a pyramid fusion attention network, which is used for constructing stronger feature representation and enhancing the discrimination capability of the network. Specifically, the high resolution images HR are recovered from a given low resolution image LR by introducing a novel pyramid fusion attention mechanism PFA. The PFA module is used for better enhancing the discrimination capability of the network by modeling the relation between pixels, so that high-frequency information can be better recovered. PFA adopts a pyramid fusion structure, each layer of pyramid uses a standard residual block introduced with a BN layer for stacking, and a down-sampling operation and a multi-scale fusion strategy are used for ensuring a complete receptive field and mastering more context detail information. In addition to utilizing the feature correlation between pixels, the present invention also provides a progressive backward fusion strategy PBFM to fully utilize the hierarchical features generated by the intermediate pyramid fusion attention module PFAB.

The method comprises the steps of calculating the error between an output image and a high-resolution real image by using an MAE function during PFAN training of the image hyper-division network, calculating the gradient of each parameter of the image hyper-division network according to the error, updating the parameter, and training the network by using an Adam optimizer. The invention is described in further detail below with reference to the following figures and detailed description.

Step 1: and (4) data is expanded. In order to better apply the capability of the common data set, before the model training, the invention firstly carries out data enhancement on the training data set through a strategy of randomly turning horizontally and randomly rotating by 90 degrees, 180 degrees or 270 degrees, and cuts the training data set into 48 multiplied by 48 image blocks, thereby enhancing the generalization capability of the network training.

Step 2: and constructing a single image super-resolution network model and carrying out supervision training. The basic framework of network model construction is shown in fig. 2, and includes four key parts: the device comprises a shallow layer feature extraction module, a stacked feature extraction basic group BGs, an up-sampling module and a reconstruction module. The overall structure can be formulated as the following flow:

(1) the shallow feature extraction module uses a layer of convolution to pair the low-resolution image I _LR Extracting shallow feature F ₀ ：

F ₀ ＝H ₀ (I _LR )#(1)

Wherein H ₀ (. cndot.) represents a convolution function.

(2) And performing deep level feature extraction by using the stacked feature extraction basic group.

F _D ＝H _BG,D (F _D-1 )＝H _BG,D (H _BG,(D-1) (…H _BG,1 (F ₀ )…))#(2)

H _BG,D (. smallcircle.) denotes the operation of the Dth BG module, F _D-1 Represents the output of the (D-1) th BG module, F _D The output of the last BG module of the stack.

(3) And using sub-pixel convolution as an up-sampling module to improve the resolution, and fusing the deep layer features and the shallow layer features at the same time through global jumping.

F _↑ ＝H _↑ (F _D +F ₀ )#(3)

Wherein H _↑ (. represents an upsampling module, F) _↑ Representing the upsampled features.

(4) A final feature reconstruction is performed using a layer of convolution.

I _SR ＝H _R (F _↑ )#(4)

Output result I of final pyramid fusion attention network _SR I.e. the reconstructed high resolution image.

The basic group BG for feature extraction is composed of a plurality of pyramid fusion attention modules PFAB and a progressive backward fusion module PBFM, taking the 2 nd BG module as an example:

F _2,1 ＝H _PFAB,1 (F ₁ )

F _2,2 ＝H _PFAB,2 (F _2,1 ) #（5）

F _2,n ＝H _PFAB,n (F _2,(n-1) )

F ₂ ＝H _PBFM (F _2,1 ,F _2,2 ,…,F _2,n )+F ₁ #(6)

wherein H _PFAB,i (. o) denotes the ith PFAB module of the 2 nd BG module, n PFAB modules in total, F _2,i Denotes the output of the ith PFAB module of the 2 nd BG module, H _PBFM (. represents a PBFM module, F ₂ For the output of the 2 nd BG module, the result of the fusion by the PBFM and the output F of the first BG module ₁ And carrying out pixel-level addition to obtain the product.

The PFAB is composed of a basic residual block play RB and a pyramid Fusion attention network pfa (pyramid Fusion attention), and the specific structure is shown in fig. 3. The architecture of the basic residual block plain RB is shown in FIG. 4. As discussed in the EDSR (see article Enhanced Deep Residual network for Single Image Super-Resolution b.lim, s.son, h.kim, s.nah and k.m.lee, "Enhanced Deep Residual network for Single Image Super-Resolution,"2017IEEE Conference on Computer Vision and Pattern Recognition Works (CVPRW),2017, pp.1132-1140, doi: 10.1109/prw.2017.151.), the bulk normalization layer BN normalizes the features, which means that the Residual block with the BN layer is not the domain flexibility of the network by normalizing the features, which is required for the beyond Resolution task of the Image. The BN layer is therefore deleted here in the base residual block structure, which is crucial for the underlying tasks like SR.

Previous work on attention mechanisms focused primarily on the inherent feature correlation at the channel-level or spatial-level, resulting in equal treatment of features within these dimensions. Accordingly, the present invention proposes PFA to solve the above-mentioned drawbacks. To make the overall structure lighter, PFA first uses one 1 × 1 convolution layer to reduce the number of channels. After the convolutional layer, PFA uses stacked residual blocks to model the relationship between pixels. Thanks to the definition of convolution operations by the existing deep learning framework, each channel in the feature map considers all input channels after a 2-dimensional convolutional layer process with in _ channel and out _ channel mechanisms. Therefore, PFA can acquire the correlation between the characteristic channels and obtain the complete receptive field by simply using the residual block composed of the 2D convolution layer and the jump connection. Unlike the basic residual block plain RB, the standard residual block RB structure adopted in PFA is shown in fig. 5, into which the batch normalization layer BN is newly introduced. Since EDSR explored the role of BN layers in underlying visual tasks, almost all subsequent approaches removed BN layers in the network. However, the attention module is intended to direct the network to focus more on the salient regions, rather than directly on the computation of the image super-score results, which means that this part of the feature may be high-level. Therefore, reintroducing the BN layer into the attention module does not cause performance degradation, but is beneficial to accelerating convergence of network training, preventing gradient explosion or disappearance, and even further improving performance.

In order to further expand the receptive field and improve the attention mechanism, the PFA of the invention adopts a pyramid structure. The structure of each layer in the pyramid structure is similar, and it is noted that PFA has a maximum pooling layer behind the 1 × 1 convolution layer in each layer of the pyramid structure, the pooling-n in fig. 3 represents the maximum pooling layer that reduces the size of the input feature map by n times, and different layers in the pyramid structure have different n values, which makes the feature maps of different layers have different sizes. Thanks to the above design, the pyramid bottom structure aims at maintaining the size of the feature map to learn the correlation between pixels, which makes the network more flexible when processing different types of information; and a higher-layer structure can obtain a larger receptive field and obtain more global context information at the same time. In addition, the pyramid middle layer will receive as input the output from the corresponding upper and lower layers RB and fuse using one upsampling layer and one 1 x 1 convolutional layer, providing information exchange across multiple resolutions. Finally, the output of each pyramid layer is up-sampled to the same size and then cascaded (concatee layer), and is sent to a convolutional layer and a sigmoid layer, so as to obtain the final attention mask. And after the attention mask and the feature image of the current PFAB module are subjected to pixel level multiplication operation, the output of the current PFAB is obtained.

In addition, for the connection strategy of a plurality of PFAB modules in the feature extraction basic group, the invention provides a progressive Backward Fusion module pbfm (progressive Backward Fusion module), the structure of which is shown in fig. 6. Rather than simply concatenating these PFABs, the PBFM aims to take full advantage of the hierarchical features produced by each PFAB and produce a more discriminative output. Through experimental verification, the output features of the deeper modules help guide the features of the shallow modules, so that the features are better used for image restoration. For the output of two adjacent PFABs, using cascade operation on the channel domain and feeding the cascaded features into a Contrast-based channel attention CCA module to further enhance the critical information, the CCA (Contrast-aware channel attention) module can be seen in the paper "light weight image super-resolution with information multi-resolution network" (arXiv:1909.11856[ eess. IV)]) And then, fusing the characteristics of the channel domain of the characteristics enhanced by the CCA module by using a convolution layer of 1 multiplied by 1. For n PFAB, PFAB _n Output result of (B) _n And PFAB _n-1 Output result B of _n-1 Directly cascaded and then convolved with 1 × 1 through a CCA module to obtain the corresponding PFAB _n-1 Fused output B 'of' _n-1 For PFAB _i I-1, …, n-1, adjacent PFABs _i Is cascaded as B _i And B' _i+1 A cascade, represented as:

B′ _i ＝H _F (concat(B _i +B′ _i+1 ))

B′ ₁ is the output of the final progressive backward fusion module PBFM.

After the above mentioned progressive fusion step, an activation function layer (ReLU) is also used to preserve the non-linearity of the network. Similarly, after all the PFAB fusion operations, skip-connection is used to fully utilize the shallow feature and finally generate the output feature of the current BG module, and the output feature of the jth BG module is recorded as F _j Then F is _j From B' ₁ And F _j-1 And carrying out pixel-level addition to obtain the product.

For the supervised training process of this step, the invention uses Adam optimizer with optimizer parameters set to β 1-0.9, β 2-0.99 and ∈ -10 ^-8 . Initial learning rate was set to 10 ^-4 And then halved every 200 epochs.

And 3, step 3: and (6) testing. And (3) fixing the parameters of the trained network model, and sending the low-resolution test image into the network model to obtain a corresponding high-resolution image.

The invention provides a brand-new pyramid fusion attention network PFAN. The conventional attention mechanism only calculates the correlation along the channel dimension or the space dimension, and treats the characteristics in the corresponding dimension equally, so that the invention provides a pyramid fusion attention mechanism PFA, which models the correlation under the pixel granularity, ensures the receptive field by using a pyramid structure, acquires global and pixel level information and improves the capability of a network for recovering high-frequency information. In addition, the invention also provides a Progressive Backward Fusion Module (PBFM) which guides the shallow feature to perform feature fusion by using the deep feature and better utilizes the network middle layer feature. The invention achieves the leading performance on the public data sets Set5 and Set14, namely PSNR 38.32 and SSIM 0.9617 on Set5, and PSNR 34.21dB and SSIM 0.9224 on Set 14.

Claims

1. A single-image super-resolution method based on a pyramid fusion attention network is characterized in that an image super-resolution network PFAN is constructed to recover a high-resolution image from a given low-resolution image, the image super-resolution network consists of a shallow feature extraction module, a stacked feature extraction basic group BG, an up-sampling module and a reconstruction module which are sequentially arranged, and the method comprises the following processes:

F ₀ ＝H ₀ (I _LR )

Wherein H ₀ (. cndot.) represents a convolution function;

F _D ＝H _BG,D (F _D-1 )＝H _BG,D (H _BG,(D-1) (…H _BG,1 (F ₀ )…))

wherein H _BG,D (. represents a Dth BG module, F _D-1 Represents the output of the (D-1) th BG module;

F _↑ ＝H _↑ (F _D +F ₀ )

Wherein H _↑ (. represents an upsampling module, F) _↑ Representing the upsampled features;

4) result F of upsampling using a layer of convolution _↑ And (3) carrying out characteristic reconstruction:

I _SR ＝H _R (F _↑ )

output result I of final pyramid fusion attention network _SR Namely, the reconstructed high-resolution image is obtained;

the basic group BG consists of n pyramid fusion attention modules PFAB and a progressive backward fusion module PBFM, and the pyramid fusion attention module PFAB consists of a basic residual errorThe method comprises the steps that a block Plain RB and a pyramid fusion attention network PFA are formed, the PFA adopts a pyramid structure, the output of a basic residual block Plain RB is convoluted and pooled after the PFA is input, then the relation between pixels is modeled by using stacked standard residual blocks RB, the pyramid middle layer receives the output from corresponding upper and lower layers RB as input, the output of each layer of the pyramid is cascaded after being up-sampled to the same size, and is sent to a convolution layer and a sigmoid layer, so that an attention mask is obtained, and the attention mask and a feature diagram of a current PFAB module are subjected to pixel level multiplication operation to obtain the output of the current PFAB; n PFABs are connected in series in sequence, a progressive backward fusion module PBFM is used for cascade operation on a channel domain for the output of two adjacent PFABs, a channel attention CCA module based on contrast is sent to the feature after each cascade operation to further enhance key information, a 1 multiplied by 1 convolutional layer is used for fusing the feature of the channel domain of the feature enhanced by the CCA module, wherein the PFAB _n Output result B of _n And PFAB _n-1 Output result B of _n-1 Directly cascaded and then convolved with 1 × 1 through a CCA module to obtain the corresponding PFAB _n-1 Fused output of B' _n-1 For PFAB _i I-1, …, n-1, adjacent PFABs _i Is cascaded to B _i And B' _i+1 Cascade, represented as:

B′ _i ＝H _F (concat(B _i +B′ _i+1 ))

B′ ₁ for the final output of the progressive backward fusion module PBFM, the output characteristic of the jth BG module is recorded as F _j Then F is _j From B' ₁ And F _j-1 And carrying out pixel-level addition to obtain the product.

2. The single-image super-resolution method based on the pyramid fusion attention network as claimed in claim 1, wherein training data is used to train the image super-resolution network, during training, an MAE function is used to calculate errors between output images and high-resolution real images, gradients of parameters of the image super-resolution network are calculated according to the errors, the parameters are updated, and the network is trained by using an Adam optimizer.

3. The single-image super-resolution method based on the pyramid fusion attention network according to claim 2, wherein training data is subjected to data enhancement first, the training data is subjected to data enhancement through a strategy of random horizontal flipping and random rotation by 90 °, 180 ° or 270 °, and each min-batch uses 16 low-resolution RGB image blocks with the size of 48 × 48 as input in the training process.