CN115578638A

CN115578638A - Method for constructing multi-level feature interactive defogging network based on U-Net

Info

Publication number: CN115578638A
Application number: CN202211340900.1A
Authority: CN
Inventors: 孙航; 李勃辉; 但志平; 余梅; 郑锐林; 杨雯; 方帅领; 刘致远
Original assignee: China Three Gorges University CTGU
Current assignee: China Three Gorges University CTGU
Priority date: 2022-10-30
Filing date: 2022-10-30
Publication date: 2023-01-06

Abstract

A method for constructing a multi-level feature interaction defogging network based on U-Net comprises a U-shaped basic framework, wherein a multi-level feature interaction module and a channel non-local information enhancement attention module are included. Inputting a foggy image into a U-shaped network architecture, performing convolution downsampling to obtain EB1, EB2 and EB3 respectively, and then performing feature fusion on the EB1, EB2 and EB3 through a multi-level feature module to obtain fusion features EF1, EF2 and EF3. And fusing the fusion features obtained in the last step with the decoding stages DB1, DB2 and DB3 respectively after passing through a channel non-local information enhancement attention module, and then obtaining a final defogged image through up-sampling. On dense fog images, non-uniform fog images and remote sensing images with large scale change, the defogged images with better color, brightness and detail information can be restored by the method provided by the invention.

Description

Method for constructing multi-level feature interactive defogging network based on U-Net

Technical Field

The invention belongs to the field of image processing, and particularly relates to a method for constructing a multi-level feature interactive defogging network based on U-Net.

Background

Haze is a common atmospheric phenomenon generated by tiny particles such as smoke and dust in the atmosphere, and is an important factor causing visual quality degradation such as object appearance and contrast. The problems of image blurring, information loss, contrast reduction and the like often occur in pictures shot in the foggy day, and the negative influence is generated on numerous high-level visual tasks such as face recognition, image segmentation, target detection, target tracking and the like due to the severe loss of image information in the foggy day. Therefore, during the past decade, the task of image defogging has received a great deal of attention in the visual field.

At present, image defogging algorithms are mainly classified into two categories: a defogging method based on parameter estimation and an end-to-end defogging method. The defogging method based on parameter estimation depends on an atmospheric scattering model, and defogging of an image is realized by estimating parameters such as global atmospheric light and a transmission graph. Although the prior-based methods make remarkable progress, under an unconstrained condition, the defogging methods based on the intermediate parameter estimation are easy to generate large errors, and a large amount of image degradation phenomena such as artifacts and color distortion can occur. With the development of deep learning, in recent years, end-to-end defogging methods become mainstream, and researchers have proposed many end-to-end defogging methods that directly learn the mapping relationship between a foggy image and a fogless image by using a convolutional neural network without estimating any intermediate parameters. An Enhanced Pix2Pix Dehazing Network published by Qing et al proposes an Enhanced Pix2Pix defogging algorithm on the basis of a U-type Network to construct a multi-resolution generator and a multi-scale discriminator, and designs an enhancement module at the tail end of the multi-resolution generator to enhance the recovery effect on image texture and color. Done et al published Multi-Scale assisted Dehazing Network with Dense Feature Fusion and proposed MSBDN defogging Network, the model combines U-shaped architecture and Dense Feature Fusion to perform Dense jump connection on the encoding layer and the decoding layer, respectively, to achieve excellent defogging performance. Wu et al, "contextual Learning for Compact Single Image Dehazing," propose a defogging network AECR-Net based on the idea of comparative Learning, the model is based on a U-shaped framework foundation, and the algorithm draws the network defogged Image and the positive sample GT Image closer on the representation space and pushes away from the input foggy Image, further improving the defogging effect. FFA-Net, feature Fusion attachment Network for Single Image Dehazing, published by Qing et al, proposes a Feature Fusion Attention Network FFA-Net, which uses channel Attention and pixel Attention to assign weights in the spatial dimension and channel dimension of a Feature map, resulting in good defogging performance.

Although the deep learning-based end-to-end defogging method achieves excellent defogging performance. However, when image defogging is performed using a U-type network and a non-U-type network, the following problems still remain.

(1) Most defogging algorithms adopt a U-shaped network structure, a decoding layer and a coding layer with a corresponding scale are directly fused, so that not only is the effective utilization of information of different coding layers neglected, but also the problem of characteristic information dilution exists, and therefore, the edge details, the whole scene (color, brightness and the like) and other aspects of defogged images are not ideal.

The two-layer fully-connected dimensionality reduction operation in the channel attention can negatively affect the feature channel weight prediction, thereby reducing the performance of the defogging network.

Disclosure of Invention

The invention aims to solve the technical problems that in the prior art, when a U-type network is used for defogging images, a decoding layer and a coding layer with a corresponding scale are directly fused, and the effective utilization of coding layer information at different levels is lacked, and the characteristic dilution problem exists in the up-sampling process because the down-sampling based on a U-type network structure model destroys the space detail information of the images; in addition, the invention can also solve the technical problems that the two-layer fully-connected dimensionality reduction operation in the attention of the SE channel can generate negative influence on the weight prediction of the feature channel and reduce the performance of the defogging network, and provides the image defogging method with multilevel feature interaction and high-efficiency non-local information of the channel for enhancing the attention.

A method for constructing a multi-level feature interactive defogging network based on U-Net comprises the following steps:

step S1: constructing a U-shaped image defogging network, wherein the network comprises: the device comprises a coding layer feature extraction module, a feature restoration module, a decoding layer image restoration module, a single feature-channel non-local information enhancement attention module SF-NEA and a multi-feature-channel non-local information enhancement attention module MF-NEA;

step S2: constructing a channel non-local information enhancement attention module NEA, adding the channel non-local information enhancement attention module NEA into a U-type network, and enhancing the performance of a defogging network, wherein the module comprises two sub-modules, namely a single-feature-channel non-local information enhancement attention module SF-NEA and a multi-feature-channel non-local information enhancement attention module MF-NEA;

and step S3: sending the foggy image into a U-shaped image defogging network, outputting a clear fogless image through a multi-level feature fusion module and a channel non-local information enhancement attention module, and calculating loss by using the output clear image to constrain network training;

and constructing the multi-level characteristic interactive defogging network based on the U-Net through the steps.

In step S1, the U-shaped image defogging network is constructed as follows:

the first layer EB1 of the coding layer feature extraction module → the InstanceNorm layer IN1 → the second layer EB2 of the coding layer feature extraction module → InstanceNormIN2 → the third layer EB3 of the coding layer feature extraction module → the InstanceNorm layer IN3;

InstanceNorm layer IN1, instanceNorm layer IN2, instanceNorm layer IN3 → third multi-level feature interaction module MFS3 → fusion feature EF3 → third multi-feature channel non-local information enhancement attention MF-NEA3;

the third multi-feature channel non-local information enhances attention MF-NEA3, decode recovery layer DB3 → fusion operation → first deconvolution layer → InstanceNorm layer IN4 → SF-NEA module → decode recovery layer DB2;

InstanceNorm layer IN2, instanceNorm layer IN1, instanceNorm layer IN3 → second multi-level feature interaction module MFS2 → fusion feature EF2 → second multi-feature channel non-local information enhancement attention MF-NEA2;

the second multi-feature channel non-local information enhances attention MF-NEA2, decode recovery layer DB2 → fusion operation → second deconvolution layer → InstanceNorm layer IN5 → SF-NEA module → decode recovery layer DB1;

the method comprises the following steps of preparing an InstanceNorm layer IN3, an InstanceNorm layer IN2, an InstanceNorm layer IN1 → a first multi-level feature interaction module MFS1 → an obtained fusion feature EF1 → a first multi-feature channel non-local information enhancement attention MF-NEA1;

the first multi-feature channel non-local information enhances attention MF-NEA1, decoding recovery layer DB1 → fusion operation → third deconvolution layer → fogless image.

The specific operation of MFS3 is: 1 × 1 volume for InstanceNorm layer IN3, 3*3 for InstanceNorm layer IN2, 3*3 for InstanceNorm layer IN 1;

wherein, the specific operation of the MFS2 is as follows: 1 × 1 fusion of InstanceNorm layer IN2, 3 × 3, and 3*3 of fusion of InstanceNorm layer IN3;

the MSF1 specifically comprises the following operations: instanceNorm layer IN1 was subjected to 1 × 1 fusion, instanceNorm layer IN2 was subjected to 3*3 fusion transfer, and InstanceNorm layer IN3 was subjected to 3*3 fusion transfer → fusion.

The single feature-channel non-local information enhanced attention module SF-NEA in the channel non-local information enhanced attention module in step S2 is structured as follows:

input feature F → subjected to global average pooling GAP operation → channel descriptor vector S → 1D convolution operation → vector S containing local information _lc ；

Channel descriptor vector S → transpose operation gets transposed vector S of channel descriptor ^T → channel descriptor vector S, transposed vector S of channel descriptors ^T Dot product → vector S containing non-local information _gc ；

Vector S containing non-local information _gc Vector S containing local information _lc → fusion operation → 1D convolution operation → feature assignment weight W;

assigning weight W to the feature, inputting feature map F → multiplication operation pixel by pixel → feature map F

The multi-feature-channel non-local information enhancement attention module MF-NEA in the channel non-local information enhancement attention module in step S3 is structured as follows:

multilevel features EC1, EC2, EC → fusion operations → fusion features EF → global mean pooling → three 1D convolution operations → those containing local information

Fused feature channel descriptor S _EF → the transpose operation gets the fused feature channel description Fu Zhuaizhi

→ fusion feature channel descriptor S _EF ，

Dot product operation → vector containing non-local information;

Includedvector of local information

Adding with vectors containing non-local information → passing through three 1D convolution operations → vectors containing local information

→ Concat operation → Softmax activation function → result in W1, W2, W3 weight → W1 multiplied by the input feature Ec1, W2 multiplied by the input feature Ec2, W3 and the input feature Ec3 respectively multiplied correspondingly → fusion operation → fusion feature F.

In step S4, the image defogging network, the multi-level feature fusion module and the channel non-local information enhanced attention module, which are interactive with multi-level features and efficient, adopt the following steps when in use:

step 1), inputting a foggy image into a U-shaped image defogging network;

step 2) inputting the feature information extracted by the U-shaped network into a multi-level feature fusion module to obtain fusion features;

step 3) fusing the fusion features with corresponding coding layers respectively to obtain features with more detailed texture semantics, and enhancing the defogging network performance through a channel non-local information enhancement attention module to obtain a final output clear fog-free image;

step 4) using four loss constraint network training processes.

In step 4), four loss constraints are as follows:

(1) L1 loss, the specific formula is:

x _i and y _i Respectively representing the values of the hazy image and the GT image at pixel i, G () representing a defogging network parameter, G (x) _i ) Representing the value of the pixel at the input image i, which is then subjected to the defogging network parameter operation. N represents the number of pixels in the image.

(2) And (3) calculating the perception loss by using a pre-training model of VGG16 on ImageNet, wherein the specific formula is as follows:

where x and y represent the foggy image and the GT image, respectively. i represents the ith layer of the feature map, H represents the length of the feature map, W represents the width of the feature map, and C represents the channel of the feature map. Wherein C is _i Representing the channel of the ith layer of the feature map, W _i Represents the width of the ith layer of the feature map, H _i Representing the length of the ith layer of the signature. Phi is a _i (x) Representing the input foggy image, and obtaining the size of the ith layer as the length H after the input foggy image passes through a VGG16 pre-training model _i Width of W _i The number of channels is C _i The characteristic diagram of (1). | | | represents the L2 norm, and N represents the number of model feature layers pre-trained using VGG16 in perceptual loss.

(3) The loss of the similarity of the multi-scale structure is specifically expressed as follows:

where x represents the generated image and y represents the sharp image. Mu.s _x ,μ _y Representing the mean of the generated image and the GT image, respectively. Sigma _x ,σ _y Respectively represent the standard deviation, σ _xy Representing the covariance of the generated image and the sharp image. Beta is a _m γ _m Representing two relative importance, C ₁ ，C ₂ Is a constant term. M represents the total number of scales.

(4) The specific formula of the resistance loss is as follows:

wherein, D (y) represents the probability of judging the defogged image y as a clear image, and N represents the total number of images.

The loss function of the overall network is expressed as:

L _loss ＝λ ₁ L _adv +λ ₂ L _l1 +λ ₃ L _perc +λ ₄ L _ms-ssim

where λ 1, λ 2, λ 3, λ 4 are the hyperparameters of each loss function.

Compared with the prior art, the invention has the following technical effects:

1) The invention provides a defogging network with multilayer characteristic interaction and high-efficiency attention enhancement by channel non-local information, which realizes a high-quality image defogging result by effectively encoding and decoding layer information interaction and channel non-local information attention enhancement, and obtains the best performance on a plurality of natural images such as RESIDE, densehaze, NHhaze, statehaze1k and the like and remote sensing data sets;

2) The multi-level feature interaction module provided by the invention fuses shallow layer and deep layer information of the encoding stage in each layer of features of the decoding stage, so that the dilution of feature information is reduced, and the recovery capability of a defogging network on details, semantics and scene information can be effectively improved;

3) According to the efficient channel non-local information enhancement attention mechanism provided by the invention, effective learning of channel weight distribution is carried out by using ID convolution and non-local information fusion, so that the number of learning parameters is reduced, and simultaneously, better strength channel characteristics which are important for defogging are better, unimportant channel characteristics are inhibited, and thus, the network defogging performance is improved.

Drawings

The invention is further illustrated by the following examples in conjunction with the accompanying drawings:

FIG. 1 is a diagram of an overall network architecture according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of the single feature-channel non-local information enhancement attention module SF-NEA of FIG. 1;

fig. 3 is a schematic structural diagram of the multi-feature-channel non-local information enhancement attention module MF-NEA of fig. 1.

Detailed Description

As shown in fig. 1 to 3, a method for constructing an image defogging network with multi-level feature interaction and high-efficiency channel non-local information attention enhancement includes the following steps:

s1, constructing a U-shaped image defogging network, wherein the network comprises: the device comprises a coding layer feature extraction module, a feature restoration module and a decoding layer image restoration module.

And S2, constructing a multi-level feature fusion module, and fusing features of different levels by using the features extracted by the coding layer feature extraction module in the S1.

And S3, constructing a channel non-local information enhancement attention module, wherein the module comprises two sub-modules, namely a single feature-channel non-local information enhancement attention module (SF-NEA) and a multi-feature-channel non-local information enhancement attention module (MF-NEA).

And S4, sending the foggy image into a U-shaped image defogging network, outputting a clear fogless image through a multi-level feature fusion module and a channel non-local information enhancement attention module, and finally calculating loss by using the output clear image to constrain the training of the network.

The step S1 specifically includes:

as shown in fig. 1, the coding layer feature extraction module performs feature extraction by performing 4-fold down-sampling using 3 convolution operations. Each layer of the encoding stage consists of a convolutional layer, an active layer, and an instant-Norm. Secondly, feature extraction is further enhanced by combining 6 continuous residual blocks on the extracted low-resolution features. And finally, decoding and reconstructing the features containing a large amount of semantic scene information extracted from the continuous residual block by adopting deconvolution and convolution operations, and recovering to the resolution of the original image.

And constructing a multi-level feature fusion module, and fusing features of different levels by using the features extracted by the coding layer feature extraction module in the S1. As shown in a frame selection part in fig. 1, in the multi-level feature interaction module, taking the EB2 coding layer in fig. 1 as an example, the EB1 layer is downsampled to the same resolution as the EB2 layer by using a convolution kernel of 3*3, the EB3 layer is restored to the same resolution as the EB2 layer by using a deconvolution layer of 3*3, and finally, the features containing different information are fused to obtain the feature EF2. The multi-level feature interaction module provided by the invention has 3 multi-level interaction processes, and the formula is expressed as follows:

EF ₁ ＝(Conv(EB ₁ )+TConv(EB ₂ )+TConv(EB ₃ ))

EF ₂ ＝(Conv(EB ₁ )+Conv(EB ₂ )+TConv(EB ₃ ))

EF ₃ ＝(Conv(EB ₁ )+Conv(EB ₂ )+Conv(EB ₃ ))

EBi represents the coding characteristic of the ith layer, EFi represents the fusion characteristic of the ith layer, DBi represents the decoding layer characteristic of the ith layer, the multi-level characteristic interaction module has 3 multi-level interaction processes, so that i belongs to {1,2,3}, and Conv and Tconv represent convolution and deconvolution respectively. By the method, the shallow detail texture and the deep semantic features of the coding layer are fully fused into each decoding layer, the problem of feature dilution caused in the sampling process at the decoding stage can be effectively solved, and the aim of improving the recovery capability of the defogging network on detail, semantic and scene information is finally fulfilled.

The step S2 specifically includes: and constructing a channel non-local information enhancement attention module, which comprises two sub-modules, namely a single feature-channel non-local information enhancement attention module (SF-NEA) and a multi-feature-channel non-local information enhancement attention module (MF-NEA).

As shown in fig. 2 and 3, the SF-NEA module first performs global averaging pooling and one-dimensional convolution operations on the features to capture local channel dependencies. Secondly, non-local information is obtained by utilizing autocorrelation operation and is fused into a vector representing the dependency relationship of a local channel, so that the deficiency of the non-local information is made up. Finally, the weight learning of the feature channel is performed again using the one-dimensional convolution operation. The single feature-NEA module can mine more effective feature channel information for distribution of feature map weights by introducing non-local channel information and learning of two times of 1D convolution. The overall process of MF-NEA is divided into three processes of polymerization-dispersion-polymerization. The first aggregation is to fuse the features of different levels to obtain the fused features with more perfect texture, detail and semantic information. The dispersion process is to adaptively learn the weights of the features of different layers according to the fused feature statistics and distribute the learned weights to the features of the corresponding layers. And the second aggregation is to fuse the same channel features of different hierarchical features according to the weight for reconstructing the defogged image.

Step S3 specifically includes:

inputting the foggy image into a U-shaped image defogging network, outputting a clear fogless image through a multi-level feature fusion module, a multi-level feature fusion module and a channel non-local information enhancement attention module, and finally calculating loss by utilizing the output clear image, wherein four loss constraints are adopted in the training process of the four loss constraint networks as follows:

the first is L1 loss, and the specific formula is as follows:

x _i and y _i Respectively, the values of the hazy image and the GT image at pixel i, G () represents the defogging network parameter, G (x) _i ) Representing the value of the pixel at the input image i, which is then computed by the defogging network parameters. N represents the number of pixels in the image.

The second method is perceptual loss, and the perceptual loss is calculated by using a pre-training model of VGG16 on ImageNet; the concrete formula is as follows:

where x and y represent the foggy image and the GT image, respectively. i represents the ith layer of the feature map, H represents the length of the feature map, W represents the width of the feature map, and C represents the channel of the feature map. Wherein C _i Representing the channel of the ith layer of the feature map, W _i Represents the width of the ith layer of the feature map, H _i Representing the length of the ith layer of the signature. Phi is a _i (x) The size of the ith layer obtained by representing the input foggy image after passing through a VGG16 pre-training model is H _i Width of W _i The number of channels is C _i Is characterized byDrawing. | | represents the L2 norm, and N represents the number of model feature layers pre-trained using VGG16 in perceptual loss.

The third multi-scale structure similarity loss is as follows:

The fourth is the resistance loss, and the specific formula is as follows:

The loss function of the overall network is expressed as:

L _loss ＝λ ₁ L _adv +λ ₂ L _l1 +λ ₃ L _perc +λ ₄ L _ms-ssim

where λ 1, λ 2, λ 3, λ 4 are the over-parameters of each loss function, λ 1=0.5, λ 2=1, λ 3=1, λ 4=1.

Examples

1. Parameter setting

The code of the invention is realized based on a Pythrch framework, the experiment is carried out on an NVIDIA RTX3090Ti GPU, and the code is realized based on the Pythrch framework. An Adam optimizer is adopted to optimize the network, the learning rate and the batch processing size are respectively set to be 0.001 and 8, and the momentum decay index beta 1=0.9 and beta 2=0.999. The initial learning rate is set to be 0.001, the learning rate is adjusted by adopting a cosine annealing strategy, and the half period of the cosine function is set to be 5. In addition, the present invention evaluates various defogging algorithms on the synthetic dataset RESIDE and the ntie defogging challenge match truth datasets DenseHaze, NHHaze21, and SateHaze1k telemetry datasets. In the RESIDE dataset, we use the outdoor training set OTS to train the network, and SOTS outwood as the test set. Wherein, OTS contains 8970 clear pictures and 313950 foggy pictures, and SOTS contains 500 indoor test data and outdoor test data sets. DenseHaze contains 45 dense fog data sets, including 35 training data sets, 5 validation sets, and 5 test sets. NHHaze2021 contains 25 non-uniformly hazy images, and since its validation set and test set GT images have not been published, we chose the first 20 as training sets and the remaining 5 as test sets for evaluation. The public remote sensing data set SateHaze1k comprises three sub-data sets respectively representing different fog concentrations, wherein Thin represents a mist data set in the sub-data sets, moderate represents a medium-concentration mist data set in the sub-data sets, and Thick represents a high-concentration mist data set in the sub-data sets. Each sub-training set contains 320 images, the validation set contains 35 images, and the test set contains 45 images. In order to verify the correctness and effectiveness of the method, the currently excellent defogging algorithm is added for comparison with the method, and the methods are as follows: DCP, AOD-Net, GCA-Net, EPDN, grideehaze-Net, MSBDN, FFA, AECR, TBN

2. Results of the experiment

TABLE 1 comparison of SOTS-outdoor dataset and real scene dataset with SOTA method

Our invention achieves good performance of PSNR first name and SSIM second name on the SOTS (outdoor) dataset. As shown in column 2 of table 1, the end-to-end based defogging algorithm is generally superior to the parameter estimation based algorithm. In the learning-based method, compared with EDPN, MSBDN and AECR based on U-shaped structures, the PSNR of the algorithm is respectively improved by 11.31db, 1.72db and 2.98db, and the SSIM of the algorithm is respectively improved by 0.118, 0.005 and 0.013. The algorithms only fuse the corresponding coding and decoding characteristics or perform intensive connection on the coding layer and the decoding layer respectively to perform characteristic fusion, neglect the utilization among all layers at the coding layer stage, fuse the coding characteristics of different layers, and mine the more important channel for defogging by using the proposed attention, thereby improving the network defogging performance and generating a defogged image with higher quality. In addition, the defogging algorithm provided by the invention respectively achieves 18.34db PSNR, 0.609SSIM, 23.47db PSNR and 0.873SSIM on Dense fog (Dense-haze) and non-uniform fog data sets (NH-haze 21), and is superior to all comparative defogging methods, and excellent performance is obtained.

3. Ablation analysis

To assess the effectiveness of the various modules of the invention, the invention designed ablation experiments in accordance with innovations to the framework and innovations of attention. In which a total of 7 experiments are included, (1) Base represents a U-shaped basic framework which is mainly composed of two down-sampling layers, six residual blocks and two up-sampling layers, wherein the coding layer and the decoding layer are directly connected by means of skip. (2) Base +1MFS, a basic frame + one multi-scale feature fusion skip connection (3) Base +2MFS, a basic frame + two multi-scale feature fusion skip connections (4) Base +3MFS, a basic frame + three multi-scale feature fusion skip connections (5) Base +3MFS + CA, a basic frame + three multi-scale feature fusion skip connections + CA + PA (6) Base +3MFS + ECA, a Base frame + three multi-scale feature fusion skip connections + ECA + PA (7) Base +3MFS + NEA, and a Base frame + three multi-scale feature fusion skip connections + NEA + PA.

Table 2 PSNR and SSIM results on SOTS outdoor data sets

Ablation experiments were performed on the SOTS outdoor dataset, comparing the seven experiments described above, with the PSNR and SSIM results as shown in the table. The first basic framework achieved 29.55 and 0.963 performance over PSNR and SSIM. On the basis of the Base reference, a multi-scale feature fusion jump connection is added, the PSNR is improved by 2.03dB on an index, and the SSIM is improved by 0.005. And PSNR gradually rises by continuously adding multi-scale feature fusion jump connection, PSNR is improved by 2.66db on the basis of Base by adding three multi-scale feature fusion jump connections, and SSIM is improved by 0.007. The experiment verifies the effectiveness of multi-scale feature fusion jump connection. Secondly, to verify the performance of our attention module, we compared the results of CA, ECA and our proposed attention model, respectively, on the basis of fusing three multi-scale feature fusion hops. As shown in Table 3, the PSNR is improved by 1.3db and the SSIM is improved by 0.008 on the basis of three feature fusion jumps by using CA, and the performance is slightly improved by using an ECA (equal cost analysis) way to avoid the dimension reduction of a channel. Finally, by utilizing the high-efficiency channel non-local information enhancement attention model designed in the text, compared with the PSNR without attention, the PSNR is improved by 1.67db, and the SSIM is improved by 0.011, compared with the CA module, the PSNR is improved by 0.37db, and the SSIM is improved by 0.003.

The invention provides a single image defogging algorithm based on multilevel feature interaction and high-efficiency channel non-local information attention enhancement. The multi-level feature fusion module makes full use of information of different levels of a coding layer, and the high-efficiency channel non-local information enhancement attention module sufficiently excavates more effective channel features by adding brand-new information to guide. The invention verifies the effectiveness of multi-scale feature fusion, effectively improves the defogging effect of the network and recovers a clear image with higher quality.

Claims

1. A method for constructing a multi-level feature interactive defogging network based on U-Net comprises the following steps:

step S1: constructing a U-shaped image defogging network;

step S2: constructing a channel non-local information enhancement attention module NEA and adding the NEA into a U-type network;

and constructing a multi-level characteristic interactive defogging network based on the U-Net through the steps.

2. The method according to claim 1, wherein in step S1, a U-shaped image defogging network is constructed as follows:

the third multi-feature channel non-local information enhances attention MF-NEA3, decoding recovery layer DB3 → merge operation → first deconvolution layer → InstanceNorm layer IN4 → SF-NEA module → decoding recovery layer DB2;

InstanceNorm layer IN3, instanceNorm layer IN2, instanceNorm layer IN1 → first multi-level feature interaction module MFS1 → fusion feature EF1 → first multi-feature channel non-local information enhancement attention MF-NEA1;

3. The method of claim 1, wherein the third multi-level feature interaction module MFS3 is specifically operative to: 1 × 1 volume for InstanceNorm layer IN3, 3*3 for InstanceNorm layer IN2, 3*3 for InstanceNorm layer IN 1;

the second multi-level feature interaction module MFS2 specifically operates as follows: 1 × 1 fusion of InstanceNorm layer IN2, 3 × 3, and 3*3 of fusion of InstanceNorm layer IN3;

the specific operation of the first multi-level feature interaction module MSF1 is as follows: instanceNorm layer IN1 was subjected to 1 × 1 fusion, instanceNorm layer IN2 was subjected to 3*3 fusion transfer, and InstanceNorm layer IN3 was subjected to 3*3 fusion transfer → fusion.

4. The method according to claim 1, wherein the single feature-channel non-local information enhancement attention module SF-NEA in the channel non-local information enhancement attention module in step S2 is structured as follows, as shown in fig. 2:

Channel descriptor vector S → transpose operation gets transposed vector S of channel descriptor ^T → channel descriptor vector S, transposed vector S of channel descriptor ^T By dot product operation → vector S containing non-local information _gc ；

the feature is assigned a weight W and the feature map F is input → a pixel-by-pixel multiplication operation → the feature map F.

5. The method according to claim 1, characterized in that the multi-feature-channel non-local information enhancement attention module MF-NEA in the channel non-local information enhancement attention module in step S3 is structured as follows:

multilevel features EC1, EC2, EC → fusion operation → fusion feature EF → global average pooling GAP operation → three 1D convolution operations → vectors containing local information

Fused feature channel descriptor S _EF → fusion feature channel description Fu Zhuaizhi obtained by transposition operation

→ fusion feature channel descriptor S _EF ，

Dot product operation → vector containing non-local information;

vector containing local information

6. The method according to claim 1, wherein in step S1, the U-shaped image defogging network comprises: the device comprises a coding layer feature extraction module, a feature restoration module, a decoding layer image restoration module, a single feature-channel non-local information enhancement attention module SF-NEA and a multi-feature-channel non-local information enhancement attention module MF-NEA; in step S2, the channel non-local information enhancement attention module NEA comprises two sub-modules, a single feature-channel non-local information enhancement attention module SF-NEA and a multi-feature-channel non-local information enhancement attention module MF-NEA.

7. The method according to claim 1, wherein in step S4, the multi-level feature interaction and efficient channel non-local information enhanced attention image defogging network, multi-level feature fusion module and channel non-local information enhanced attention module, when in use, adopt the following steps:

step 1) inputting the foggy image into a U-shaped image defogging network;

step 3) fusing the fusion features with the corresponding coding layers respectively to obtain features with more detail texture semantics, and improving the defogging network performance through a channel non-local information enhancement attention module to obtain a final output clear fog-free image;

step 4) using four loss constraint network training processes;

in step 4), four loss constraints are as follows:

(1) L1 loss, the specific formula is:

x _i and y _i Respectively representing the values of the foggy image and the GT image at pixel i, G (x) _i ) Representing a value obtained by computing the pixel value of the input image i through defogging network parameters; n represents the number of pixels in the image;

(2) The perception loss is calculated by using a pre-training model of VGG16 on ImageNet; the concrete formula is as follows:

wherein x and y represent the foggy image and the GT image, respectively, i represents the ith layer of the feature map, H represents the length of the feature map, W represents the width of the feature map, C represents the channel of the feature map, wherein C represents the channel of the feature map _i Representing the channel of the ith layer of the feature map, W _i Represents the width of the ith layer of the feature map, H _i Layer i of the representative feature mapLength of (phi) _i (x) Representing the input foggy image, and obtaining the size of the ith layer as the length H after the input foggy image passes through a VGG16 pre-training model _i Width is W _i The number of channels is C _i A characteristic diagram of (1); | | represents the L2 norm, and N represents the number of feature layers of the model pre-trained by using VGG16 in sensing loss;

(3) The multi-scale structure similarity loss is specifically represented by the following formula:

wherein x represents the resulting image, y represents the sharp image, μ _x ,μ _y Representing the mean, σ, of the generated image and the GT image, respectively _x ,σ _y Respectively represent the standard deviation, σ _xy Covariance, beta, representing the generated image and the sharp image _m γ _m Representing two relative importance, C ₁ ，C ₂ Is a constant term, M represents the total number of scales;

(4) The specific formula of the confrontation loss is as follows:

wherein D (y) represents the probability of judging the defogged image y as a clear image, N represents the total number of images, and N represents the number of images from the first image;

the loss function of the overall network is expressed as:

L _loss ＝λ ₁ L _adv +λ ₂ L _l1 +λ ₃ L _perc +λ ₄ L _ms-ssim

where λ 1, λ 2, λ 3, λ 4 are the hyperparameters of each loss function.