CN114897690A

CN114897690A - Lightweight image super-resolution method based on serial high-frequency attention

Info

Publication number: CN114897690A
Application number: CN202210466344.6A
Authority: CN
Inventors: 唐杰; 杜宗财; 高恒; 董元康; 武港山
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2022-08-12

Abstract

A light-weight image super-resolution method based on serial high-frequency attention is characterized in that a serial high-frequency attention module is constructed and comprises a dimensionality reduction convolution, an edge detection convolution, a dimensionality enhancement convolution, a batch normalization layer and a Sigmoid layer, and the recovery of a convolution neural network to image high-frequency edge information is enhanced by learning a weight of 0-1 for each pixel. The method can fully utilize an attention mechanism, and compared with the prior method, the method has the advantages that the performance and the efficiency are considered: the performance of the attention module is greatly enhanced by adopting a trainable Laplacian edge detection operator, and the efficiency of the attention module is guaranteed by adopting a serial structure and an efficient operator. The method can improve the problem of fuzzy edge information of the reconstructed image in a general method to obtain better reconstruction quality, and can reduce 72% of maximum video memory occupation and improve 38% of reasoning speed compared with the best lightweight image super-resolution method at present.

Description

Lightweight image super-resolution method based on serial high-frequency attention

Technical Field

The invention belongs to the technical field of computers, relates to deep learning, computer vision and computer image understanding, in particular to an image super-resolution technology, is used for guiding a deep learning neural network to focus on high-frequency characteristics so as to enhance reconstruction quality, and is a light-weight image super-resolution method based on serial high-frequency attention.

Background

The lightweight image super-resolution is a technology for restoring a low-resolution image into a clear high-resolution image by using a model which is high in operation speed and small in occupied video memory. The technology can be directly used for enhancing the quality of images in real life, and can also provide an effective preprocessing means for downstream tasks such as small object target detection, segmentation, human body key point detection and the like. In practical application scenarios, it is often desirable that the model can reconstruct images more quickly, such as magnetic resonance image overdivities for diagnosing patient conditions, Design ids overdivities in Microsoft 365, underwater cruise environment image overdivities, and target detection preprocessing. These practical requirements make lightweight image super-resolution a big research focus.

Since the birth of deep learning, a Convolutional Neural Network (CNN) based method has made a great progress in the field of image super-resolution. SRCNN creatively designs three-layer CNN to learn the mapping from low resolution to high resolution, and compared with the traditional method, the method is remarkably improved. Later, more and more heuristic ideas such as residual learning, feature fusion, skillfully designed loss functions and recently focused attention mechanisms are introduced to promote the development of the image super-resolution field.

Attention mechanisms have proven effective in various computer vision tasks, with the goal of directing the network to focus on important signals while reducing the focus on unimportant signals. Since SENet has had great success in the image classification task, researchers at super-resolution of images have presented a variety of attention mechanisms. The residual Channel Attention network rcan (residual Channel Attention networks) first incorporates the Channel Attention into the residual block. Residual Non-Local Attention Networks (RNAns) introduce a Local and global Attention mechanism to scale intermediate features. The Second-order Attention Network SAN (Second-order Attention Network) utilizes the Second-order statistical information of the characteristics to design a channel Attention mechanism, and achieves better effect than the first-order channel Attention mechanism. The residual Feature Aggregation network rfa (residual Feature Aggregation network) proposes an enhanced spatial attention mechanism to obtain a Feature map with a larger receptive field. The global Attention network han (local Attention network) proposes a layer Attention mechanism and a channel-space Attention mechanism to model elements of different convolutional layers, different channels, and different spaces. Despite the great advances made by these attention mechanism approaches, multi-branch structures and inefficient operators, such as 7x7 convolution, are suboptimal in the task of lightweight image super-resolution. The attention mechanism module with strong performance and high efficiency needs to be further studied.

Disclosure of Invention

The invention aims to solve the problems that: the current common attention mechanism display memory is over-occupied, the reasoning speed is too slow, and the guidance of high-frequency characteristics is not obvious.

The technical scheme of the invention is as follows: a lightweight image super-resolution method based on serial high-frequency attention is characterized in that an image super-resolution model based on high-frequency attention is built, a low-resolution image is subjected to feature extraction and then subjected to high-frequency learning and then reconstructed into a high-resolution image, a high-frequency learning module comprises a serial ERB + HFAB structure, the ERB + HFAB structure is connected with a high-frequency attention module HFAB behind each enhancement residual block ERB, the high-frequency attention module HFAB is composed of a dimension reduction convolution, an edge detection convolution, a dimension increase convolution, a batch normalization layer and a Sigmoid layer, the recovery of a convolution neural network to image high-frequency edge information is enhanced by learning a weight of 0 to 1 for each pixel, firstly, the dimension reduction is carried out on an input feature map, then a rough edge map is obtained through the edge detection convolution, then the refinement of the edge map is realized through the enhancement residual block, and then the dimension is converted back into an input space through the dimension increase convolution, finally, the nonsaturation point of the Sigmoid function is reached through the batch normalization layer BN, the Sigmoid function is sent to learn a weight from 0 to 1 for each pixel to obtain an attention diagram, and the attention diagram is multiplied by the input characteristic diagram pixel by pixel to realize characteristic correction; during image super-resolution model training, the distance between a reconstructed high-resolution image and a high-definition image positive sample is calculated by using an L1 loss function, the gradient of each layer parameter of the network is deduced according to the distance, and an Adam optimizer is used for supervised training.

When the high-frequency attention module is designed, firstly testing the reasoning time of a meta-operator, screening out the operator with the highest efficiency and constructing the attention module; secondly, the serial structure is adopted to reduce the occupation of the video memory rather than the parallel structure by analyzing the video memory; then, the learning of high-frequency features is enhanced by explicitly introducing a learnable Laplacian edge detection operator; and finally, introducing a batch normalization layer to accelerate module convergence.

The invention has the following outstanding innovation points: (1) the invention finds that the convolution efficiency of 3x3 is higher and can bring larger receptive field, and the 3x3 convolution is adopted to carry out dimension increasing and dimension reducing instead of the 1x1 convolution used by the existing method; (2) the method of the invention adopts a completely serial structure to reduce the video memory occupation and improve the reasoning speed, rather than a parallel structure adopted by the existing attention module; (3) the method applies a learnable Laplacian edge detection operator in an attention module to enhance the characteristics of a high-frequency region; (4) the method of the present invention finds that the batch normalization layer contributes to network convergence in the attention module, whereas the attention modules of the existing attention mechanism do not employ the batch normalization layer.

The invention has the beneficial effects that:

1. the attention module of the invention has higher efficiency, the maximum video memory occupation is reduced by 72 percent compared with the prior method, the reasoning speed is improved by 38 percent, and the light-weight image super-resolution task can be well met.

2. The reconstruction quality is higher, the problem of fuzzy reconstructed image edge information can be improved, better reconstruction quality is obtained, and the highest reconstruction quality is obtained on five data sets of Set5, Set14, B100, Urban100 and Manga109 according to the peak signal-to-noise ratio PSNR.

Drawings

FIG. 1 is a flow chart of the present invention.

FIG. 2 is a test of the inference time of the meta-operator on GTX 1080Ti according to the present invention.

FIG. 3 is a diagram illustrating the video memory analysis of the serial and parallel modules according to the present invention.

Fig. 4 is a high frequency attention-based network structure proposed by the present invention.

Fig. 5 is a comparison diagram of the enhanced residual block ERB and the normal residual block RB employed in the present invention.

Detailed Description

The invention provides a light-weight image super-resolution method based on serial high-frequency attention, which can greatly improve reconstruction quality and guarantee reasoning efficiency of a model. The method is characterized in that a serial high-frequency attention module is constructed and comprises a dimensionality reduction convolution, an edge detection convolution, a dimensionality enhancement convolution, a batch normalization layer and a Sigmoid layer, and the recovery of a convolution neural network to image high-frequency edge information is enhanced by learning a weight of 0 to 1 for each pixel.

Fig. 1 shows a main process for building a high-frequency attention-based image super-resolution model, and the implementation of the present invention is described in further detail with reference to the accompanying drawings and the detailed description.

Step 1: and (5) carrying out metaoperator time test. The convolution neural network model is composed of basic components such as a convolution layer, an activation layer, an up/down sampling layer and a normalization layer and four arithmetic operations, and the operation time of a metaoperator is an important component of the model inference time. The mainstream deep learning framework has different optimization strengths on different operators, for example, due to the universality of 3x3 convolution, the deep learning framework obtains high parallel optimization on hardware equipment, and the computation density of a larger convolution kernel such as 5x5 is far lower than that of 3x3 convolution. Therefore, the selection of efficient operators is necessary for constructing a lightweight image super-resolution model, and the lowest inference time overhead can be achieved under the same model capacity. According to the method, on the basis of the super-resolution model EDSR-baseline, a plurality of repeated operators are added to obtain a new model, and then the approximate inference time of each operator is obtained through a difference value. The input image size was set to 3x480x480, and 5 runs were reasoned on GTX 1080Ti to average time to reduce the error, resulting in fig. 2. From the figure, the following conclusions can be drawn:

(1) preferably, a convolution layer of 3x3 convolution kernel size is used. The parameter quantity of the lightweight image super-resolution model is only hundreds of K, and the 3x3 convolution can bring a larger receptive field compared with the 1x1 convolution. More importantly, the calculation amount of the 3x3 convolution is 9 times of that of the 1x1 convolution, but the time overhead is only 2 times, so that the method is more efficient;

(2) the activation function selects LeakyReLU. The LeakyReLU has larger nonlinear mapping space than the ReLU, and the reasoning time is obviously better than that of the PReLU;

(3) the use of bound operators is avoided. While there are many methods that suggest downsampling before upsampling to obtain a larger field of view, the combination of a single downsampling and a single upsampling is equivalent to half the 3 × 3 convolution inference time. The image super-resolution model is a plurality of repeated substructures, and the combination of up-sampling and down-sampling can cause the final model to add a small additional overhead. In addition, channel separation and channel splicing are binding operators, which are common in the current best models IMDN and RFDN, but the channel separation and channel splicing are found to be main factors for limiting performance;

(4) the upsampling module uses sub-pixel convolution. Sub-pixel convolution is significantly better than other up-sampling approaches.

Finally, the invention screens out 6 most efficient operators in total, namely 3x3 convolution, LeakyReLU, independent element-by-element addition, multiplication, Sigmoid and sub-pixel convolution to build a module.

Step 2: and analyzing the maximum video memory occupation. Besides the commonly used parameter quantity calculated quantity and reasoning time indexes, the lightweight image super-resolution international competition AIM also incorporates the maximum video memory into the performance index. The maximum video memory index is provided with the rationality: the video memory of the video card is limited, a plurality of models often cooperate to complete tasks in practical application, and each model is expected to occupy as little video memory as possible, otherwise, the phenomenon that the models occupy too much video memory at a certain moment to cause excessive video memory at a certain moment may occurThe program is aborted. In addition, when the system is deployed to a mobile device, the problem of heat generation and the like can be caused due to overlarge video memory occupation. Let M _i Representing the video memory occupied by the ith node during inference, n representing the number of summary points of the model, and M representing the maximum video memory occupied during inference, defined as:

M＝max(M ₁ ,M ₂ ,…,M _n )

M _i the device is composed of four parts: display memory occupied by input characteristics

Display memory occupied by output characteristics

Video memory occupied by temporary storage feature accessed in future

And video memory occupied by model parameters

Is defined as:

in the task of the light-weight image super-resolution, the video memory occupied by the model is negligible compared with the video memory occupied by the features. Under the condition that the input resolution and the parameters are consistent, the video memory occupation of the layer mainly depends on the video memory size occupied by the temporary storage characteristic diagram. In order to analyze the reasons behind the current best model, which is due to the fact that the feature fusion layer causes the display memory to multiply, consider a compact serial structure and a fused parallel topology adopted by the RFDN, as shown in fig. 3. For convenience of description, it is assumed that the size of each dimension of the feature map is unchanged after the feature map is subjected to convolution by 3 × 3. Multiple intermediate features are stitched along the channel dimension, and 1x1 convolution is used to reduce the number of feature maps to be consistent with the number of input image channels after stitching. Because the ReLU function can be performed directly on the original feature map, meaning that the input and output of the ReLU layer share the memory,the active layer is omitted from the figure. For convolution kernel size of C _in ×C _out The input and output characteristics of the xKxK convolution cannot share video memory because each pixel needs to be C _out xK K visits and Winograd algorithm.

First consider the serial structure of fig. 3 (a). In the serial structure, each node is only related to the current input and output characteristic diagram, and besides, the characteristic diagram of the front relay layer can not be stored after being propagated forwards, so that the display memory occupation of each convolution node is about M _input +M _output The stacking of the same serial structure only increases the number of network parameters, and the video memory brought by the network parameters can be ignored. Under the condition that the size of the characteristic diagram is not changed, the maximum video memory M occupied by the serial data _serial Is 2 XM _input 。

Consider again the feature fusion based parallel structure of fig. 3 (b). The feature map associated with the 1x1 convolution fusion layer will be stored in the video memory after the initial calculation is finished, so that M will be caused _kept Is significantly increased. Taking the second 3x3 convolutional layer as an example, there will be three features occupying memory: the input to the convolutional layer, the output of the convolutional layer, and the input to the first convolutional layer to be used for fusing, so the video memory usage of the node will be 3 times the video memory usage of the input signature. Similarly, at the feature splicing layer, the video memory occupied by splicing 3 features along the channel dimension will be 6 × M _input . N characteristics with the same size are set to participate in fusion, and then the display memory occupied by the splicing nodes with index i is set

Is 2 XNxM _input . The memory M occupied by the merged parallel structure _parallel Memory M occupied by simple serial structure _serial The ratio relation of the components is as follows:

i.e., at least N times, the above analysis has been well validated on the RFDN model: the video memory occupied by the 400K serial structure is about 30M, the video memory occupied by the RFDN parallel structure with global fusion is about 200M, and the video memory occupied by the RFDN parallel structure with global fusion is about 7 times of the maximum video memory. In order to reduce the maximum video memory occupation as much as possible, the invention should consider the serial structure first when designing the network block, and avoid using a plurality of parallel connections at a certain node.

And step 3: and designing a model based on the serial high-frequency attention module. After the analysis of the first 2 steps, according to the determined operator and the feature fusion structure, an image super-resolution model based on high-frequency attention is built, and particularly a high-frequency attention module HFAB is built.

The present invention constructs a network based on an enhanced residual block ERB (enhanced residual block), as shown in fig. 5, fig. 5(a) is a residual block RB, and fig. 5(b) shows the enhanced residual block ERB. The reconstruction precision of the ERB is equivalent to that of the RB, but two jump connections in the ERB can be combined with parallel convolution in an inference stage, the access and storage expenses are reduced, in the RB, the jump connections are in nonlinear operation and cannot be combined, and the inference speed of the ERB can be improved by 10% compared with that of the RB. The invention discloses an image super-resolution model, which is basically consistent with the overall structure of the existing method, is different from a high-frequency learning part, designs a serial ERB + HFAB structure, and applies a high-frequency attention module HFAB behind each ERB to enhance high-frequency information.

The task of the high frequency attention module HFAB is to assign a weight of 0 to 1 to each pixel point on the feature map, representing their importance in the model learning process. The goal that HFAB expects is that edge detail pixels can be more finely repaired during the recovery process. To achieve this goal, Laplacian operators are introduced in the HFAB module to guide the attention of the branches to edge details. The Laplacian operator convolution template is defined as:

the 3x3 convolution is initialized with the weight and can be updated continuously in the later learning process, so that the template has stronger correction capability.

The overall structure of the image super-resolution model and the structure of the high-frequency attention module are shown in FIG. 4, and the input of the kth HFAB is recorded as F _k-1 The output is F _k The learning process of HFAB can be formally described. To reduce the parameter overhead due to attention branching, the feature map is first reduced in dimension by convolution with 3x 3. Passing through Conv _squeeze The characteristics after dimension reduction are as follows:

where LReLU (. circle.) is the LeakyReLU nonlinear activation function. Then, a rough edge map is obtained through an edge detection layer

Then, the edge graph is refined through an enhanced residual block E (-) to obtain

Then the input space is transformed by convolution of 3x3 to obtain

Then, the non-saturation point of the Sigmoid function is reached through a batch normalization layer, the Sigmoid function is sent to learn a weight from 0 to 1 for each pixel, and attention is soughton _k ：

Finally, multiplying the attention diagram and the input characteristics pixel by pixel to realize characteristic correction:

F _k ＝Attention _k ×F _k-1

batch normalization is linear operation, and in the HFAB designed by the invention, the batch normalization layer BN can be convolutely fused with the 3x3 of the front end during reasoning, thereby accelerating the reasoning. And setting the mean value, the variance and the value stability parameters of BN as mu, sigma and epsilon, learning scale factors and offsets as gamma and beta, and setting the weight and the offset of the convolution of 3x3 as W ₃ And b ₃ The input is X, and the fusion process of the BN layer and the previous convolutional layer is described as follows:

after the reparameterization of the merged linear operation, the whole network structure only comprises the following 6 efficient operators: 3x3 convolution, ReLU activation function, element-by-element addition, element-by-element multiplication, Sigmoid, and sub-pixel convolution. The efficient operator and the serial module guarantee the operation efficiency of the model, and the high-frequency attention module with the self-adaptive zooming characteristic guarantees the reconstruction performance of the model.

And 4, step 4: and (5) training a model. The batch size per iteration is set to 16, and the learning rate is initially set to 1 × 10 ^-5 And becomes half every 20 million iterations for a total of 100 million iterations. And calculating the distance between the reconstructed high-resolution image and a high-definition image positive sample by using an L1 loss function, regularizing, deducing the gradient of each layer parameter of the network, and optimizing the model by using an Adam optimizer.

And 5: and (5) testing the model. And after training, saving the network weight, and reconstructing the picture of the test set. Tests prove that the method reduces the video memory occupation by 72% and improves the inference speed by 38% in an inference stage compared with an AIM2020-ESR (equivalent Feature discovery network) scheme of a lightweight image super-separation model.

Claims

1. A light-weight image super-resolution method based on serial high-frequency attention is characterized in that an image super-resolution model based on high-frequency attention is built, a low-resolution image is subjected to feature extraction, high-frequency learning is carried out, and then the high-resolution image is reconstructed into a high-resolution image, a high-frequency learning module comprises a serial ERB + HFAB structure, the ERB + HFAB structure is connected with a high-frequency attention module HFAB behind each enhanced residual block ERB, the high-frequency attention module HFAB is composed of a dimensionality reduction convolution, an edge detection convolution, a dimensionality enhancement convolution, a batch normalization layer and a Sigmoid layer, the recovery of a convolution neural network to image high-frequency edge information is enhanced by learning a weight of 0 to 1 for each pixel, the dimensionality reduction is firstly carried out on an input feature map, then a rough edge map is obtained through the edge detection convolution, then the refinement of the edge map is realized through the enhanced residual block, and then the dimensionality is transformed back to an input space through the dimensionality enhancement convolution, finally, global information of the image is introduced when the batch normalization layer BN reaches the unsaturated point of the Sigmoid function, the global information is sent to the Sigmoid function to learn a weight from 0 to 1 for each pixel, an attention diagram is obtained, and the attention diagram and an input feature diagram are multiplied pixel by pixel to realize feature correction; during image super-resolution model training, the distance between a reconstructed high-resolution image and a high-definition image positive sample is calculated by using an L1 loss function, the gradient of each layer parameter of the network is deduced according to the distance, and an Adam optimizer is used for supervised training.

2. The method for super-resolution of light-weight images based on serial high-frequency attention as claimed in claim 1, wherein the method comprises the steps of analyzing a meta-operator, testing inference time of the meta-operator, and screening out an operator with highest efficiency to build an attention module, and comprises the following steps:

1) convolution layer with kernel size 3x 3;

2) the activation function selects LeakyReLU;

3) avoiding the use of bound operators;

4) the upsampling module uses sub-pixel convolution.

3. The method for super-resolution of light-weight images based on serial high-frequency attention as claimed in claim 2, wherein the high-frequency attention module HFAB is constructed by 6 operators of 3x3 convolution, LeakyReLU, independent element-by-element addition, element-by-element multiplication, Sigmoid and sub-pixel convolution.

4. The method for super-resolution of lightweight images based on serial high-frequency attention as claimed in claim 1, wherein the high-frequency attention module HFAB is specifically:

let the k-th HFAB input be F _k-1 The output is F _k Firstly, the feature map is convolved by 3x3 for dimensionality reduction, and is subjected to Conv _squeeze The characteristics after dimension reduction are as follows:

wherein LReLU (-) is a non-linear activation function, and then a rough edge map is obtained through an edge detection layer

W _Laplacian The Laplacian operator is used for guiding the attention of branches to edge details;

then through an enhanced residual block E _k (. to) refine the edge map

Then the signal is converted into an input space through convolution upscaling of 3x3

Then, the non-saturation point of the Sigmoid function is reached through a batch normalization layer, the Sigmoid function is sent to learn a weight from 0 to 1 for each pixel, and Attention is obtained _k ：

Finally, multiplying the attention diagram and the input feature diagram pixel by pixel to realize the correction of the features:

F _k ＝Attention _k ×F _k-1

the batch normalization layer BN in the HFAB is subjected to convolution fusion by the 3x3 in the inference stage, the mean value, the variance and the numerical value stability parameters of the BN are respectively set as mu, sigma and epsilon, the learned scale factor and the offset are set as gamma and beta, and the weight and the offset of the 3x3 convolution are set as W ₃ And b ₃ The input is X, and the parameter fusion process of the convolution layer and the BN layer is as follows: