CN117541505A

CN117541505A - Defogging method based on cross-layer attention feature interaction and multi-scale channel attention

Info

Publication number: CN117541505A
Application number: CN202311472766.5A
Authority: CN
Inventors: 但志平; 付秋月; 孙航; 朱佳; 韩进; 管诗衡; 赵宇辉; 谢黎
Original assignee: China Three Gorges University CTGU; Yichang Power Supply Co of State Grid Hubei Electric Power Co Ltd
Current assignee: China Three Gorges University CTGU; Yichang Power Supply Co of State Grid Hubei Electric Power Co Ltd
Priority date: 2023-11-07
Filing date: 2023-11-07
Publication date: 2024-02-09

Abstract

The defogging method based on cross-layer attention feature interaction and multi-scale channel attention is characterized in that a cross-layer attention feature interaction module learns hierarchical weights by utilizing multi-scale cross-layer features of an encoding layer, and then the cross-layer features are aggregated and transferred to a corresponding decoding layer, so that feature dilution in the process of reconstructing a clear image by a defogging network is reduced. In addition, in order to mine the characteristic channel information very important to the defogging network, a multi-scale channel attention mechanism is designed, multi-scale characteristic information is extracted by utilizing hole convolution with different hole rates, a multi-scale channel attention mechanism for parallel learning of contexts is formed, and weights can be more effectively distributed to the characteristics of the defogging network.

Description

Defogging method based on cross-layer attention feature interaction and multi-scale channel attention

Technical Field

The invention belongs to the field of image processing, and particularly relates to a defogging method based on cross-layer attention characteristic interaction and multi-scale channel attention.

Background

In the past few decades, with rapid development of industrialization, haze and pollution often occur around the country, so that problems of color distortion, contrast reduction, detail loss and the like of images shot by a visible light system exist. These problems have serious negative effects on the application of advanced vision algorithms like object detection, object tracking, face detection, etc. Therefore, the image defogging field is a focus of extensive attention of researchers. Early image defogging algorithms typically estimate parameters related to the atmospheric scattering model via a priori information to achieve image defogging. Although some progress has been made in defogging methods based on a priori, these methods generally do not recover well-defined image quality without constraint. In recent years, with the continuous development of deep learning, researchers have proposed many defogging methods based on convolutional neural networks, which are mainly classified into two types of defogging based on parameter estimation and end-to-end defogging.

Currently, most defogging algorithms use a U-shaped network structure as the basis for end-to-end defogging. In order to achieve better defogging performance, researchers have designed various information fusion methods in the U-shaped network to enhance feature expression. For example, dong et al published Multi-Scale Boosted Dehazing Network that proposed MSBDN defogging algorithms that utilized the ideas of DenseNet to perform dense feature fusion on the encoding and decoding layers to achieve Multi-scale feature interactions. While some progress has been made in these end-to-end defogging methods, they tend to ignore cross-layer information interactions between non-corresponding decoding and encoding layers in the network, resulting in feature dilution, thereby affecting the effect of reconstructing a clear image. Hu et al published a channel attention mechanism SE of "Squeeze-and-Excitation Networks", and has been widely used in the visual fields of object detection, semantic segmentation, image segmentation, and the like. The core idea of SE is to learn the weights of different channels by evaluating the importance of each feature channel to enhance or suppress a particular feature to accommodate different task demands. However, SE channel attention extracts channel features by global averaging pooling over feature maps, which has problems of limited receptive fields, inadequate consideration of contextual information, etc., and it is difficult to achieve accurate channel feature weighting.

Although the methods obtain better image defogging effect, the method still lacks constraint capability, and the phenomena of poor defogging effect, artifacts, color distortion and the like still exist when complex haze images are processed. In particular, it has several problems.

(1) Most defogging networks based on U-shaped structures directly transmit the characteristics of the coding layer to the decoding layers with corresponding scales, and effective utilization of the characteristic information of different layers is ignored.

(2) The widely used channel attention in defogging networks is limited by receptive fields, and the context information is not fully utilized, so that the learning of channel weights is negatively affected, and finally, the defogging effect of the methods has a further improvement space.

Disclosure of Invention

The invention aims to solve the problem that the existing image defogging method generally directly transmits the characteristics of a coding layer to a decoding layer with a corresponding scale, and omits the effective utilization of the characteristic information of different layers; secondly, the widely used channel attention in the defogging network is limited by the receptive field, and the context information is not fully utilized, so that the learning of the channel weight is negatively affected, and finally, the defogging effect of the methods has the technical problem of further improving the space, and the defogging method based on the cross-layer attention characteristic interaction and the multi-scale channel attention is provided.

In order to solve the technical problems, the invention adopts the following technical scheme:

defogging method based on cross-layer attention feature interaction and multi-scale channel attention, comprising the following steps:

step S1: constructing a defogging network frame based on cross-layer attention feature interaction and multi-scale channel attention;

step S2: constructing a U-shaped defogging network A, wherein the U-shaped defogging network A comprises: the device comprises an encoding layer characteristic extraction module, a residual error module and a decoding layer image recovery module;

step S3: constructing a cross-layer attention interaction module B, and fusing the cross-layer characteristics by acquiring characteristics of different layers in the coding layer; transmitting the aggregate features containing the edge textures and semantic information to the corresponding decoding layers through proper weight distribution;

step S4: and constructing a multi-scale channel attention module C, wherein the multi-scale channel attention module C extracts different receptive field information of the same feature in the network, and then learns channel weights respectively to effectively fuse the features containing rich context information.

Through the steps, a defogging network frame based on cross-layer attention characteristic interaction and multi-scale channel attention is constructed, and defogging of images is carried out by adopting the frame.

In step S1, the specific steps are as follows:

the defogging network framework based on cross-layer attention feature interaction and multi-scale channel attention comprises a U-shaped defogging network A, a cross-layer attention feature interaction module B and a multi-scale channel attention module C, and the output of the network is restrained through a series of loss functions, so that the network learns the mapping relation between a foggy image I and a foggy image gt, and the defogging capability of the image is improved; specifically, inputting a foggy image I, then inputting the foggy image I into a defogging network frame based on cross-layer attention feature interaction and multi-scale channel attention, and aggregating feature information of different layers of an encoder, wherein the aggregated feature comprises high-level semantic information and low-level edge texture information; and then, merging the aggregation features containing more abundant context information into each decoding layer, and relieving the problem of feature dilution in the up-sampling process of the U-shaped defogging network A so as to generate a clear defogging image with higher quality.

In step S2, the encoding layer feature extraction module performs feature extraction by using four times of downsampling; each layer of the coding layer uses convolution downsampling operation firstly, then normalization processing is carried out, and then a ReLU activation function is used; and performing feature extraction by using a residual block after downsampling operation, performing image reconstruction by adopting deconvolution and convolution operation on a feature map containing a large amount of semantic information through a decoding layer, and finally recovering to the original image size.

In step S3, in the coding layer of the U-shaped defogging network a, different feature layers include corresponding information, for example, shallow features include more detailed texture information, and deep features include semantic information such as color and brightness; therefore, in order to more effectively utilize the information of different layers of the coding layer, reduce the dilution of the characteristic information and simultaneously improve the effect of reconstructing the defogging image in the decoding process; first, for feature X of different layers of the coding layer _L Employing global average pooling operation H _gap Compressing to obtain channel descriptor G of each layer of feature map _L ∈[G ₁ ,G ₂ ,G ₃ ]Thereby preserving the most significant features of each layer of feature map.

The channel descriptor obtained in the L layer in the coding layer is expressed as:

(X _L ) _n (i, j) represents the value of the nth single channel feature map at position (i, j) in the L-th layer of the coding layer, feature map X _L From H _L ×W _L ×C _L Compressed to 1×1×C _L The number of channels corresponding to the obtained channel descriptor is C _L ∈[64,128,256]The method comprises the steps of carrying out a first treatment on the surface of the And then carrying out Concat splicing on the channel descriptors in the channel dimension:

D _n ＝Concate(G ₁ ,G ₂ ,G ₃ )

D _n representing the spliced channel descriptor, and then performing feature weight learning on each code according to the importance of each code layer on the aggregated features, wherein the aggregated features combine the high-level semantic information and the low-level edge texture information, and then learning the weight corresponding to the code layer by adopting a convolution-activation-convolution mode, namely:

Conv represents a 2D convolution with a convolution kernel of 1 x 1 size,representing the activation function, chunk represents the splitting of the resulting 1×1×3-sized one-dimensional vector into three weights +.>And then the obtained weight of each layer is +.>Multiplying by features of the original layer (X _L ) _n The function of distributing weight to each coding layer is realized, and the formula is as follows:

Z _L in order to fuse the coding layers with different sizes into the decoding layer, the feature map is amplified to the same size by adopting a bilinear interpolation function B and Concat operation is carried out to obtain an aggregate feature F, which is specifically expressed as follows:

F＝Concat(B(Z ₁ ),B(Z ₂ ),B(Z ₃ ))

downsampling operations are then performed using different convolutions to downsample the aggregated features to the size of the corresponding decoding layer such that the aggregated features are fused to the layers of the decoding layer, where Conv _L (F) Representing the process of the aggregation feature F passing through convolution kernels of different step sizes, Y _L Layer L representing the decoder;

Y _L ＝Concat(Conv _L (F),Y _L+1 )

the cross-layer attention feature interaction module aggregates feature information of different layers of the coding layer to enable the feature information to contain richer detail textures and structural semantic information, and downsamples the aggregated features to corresponding decoding layers, so that the aggregated features are fused to each layer of the decoding layers to guide a network to generate defogging images with higher quality, and the problem of feature dilution in the encoding and decoding processes of the U-shaped defogging network A is effectively relieved.

In step S4, for a multi-scale channel attention C, mainly comprising two phases of dispersion-aggregation; in the dispersing process, using cavity convolution with different scales to obtain multi-scale feature expression; in the polymerization process, the obtained weights are multiplied by corresponding feature graphs respectively, and the features with different receptive fields are fused; through the multi-scale parallel learning channel attention network, rich context semantic information contained in different scale features can be more fully mined, so that the attention distribution result meets the requirements of defogging tasks.

For the multiscale channel attention C, for the input feature map X, a convolution kernel Conv with different void ratios r E {1,2,3} is utilized _r Convolving to extract SF features of different receptive fields _r ∈{SF ₁ ,SF ₂ ,SF ₃ Mathematical expression:

SF _r ＝Conv _r (X)r∈{1,2,3}

extracting multi-scale characteristic information of the characteristic diagram X by adopting a parallel cavity convolution mode; this approach can capture broader, richer, and more complex contextual information; and then carrying out global average pooling operation on the characteristics with different receptive fields respectively, wherein the global average pooling operation is specifically as follows:

(SF _r ) _n (i, j) represents the value of the nth single channel feature map at the position (i, j) after convolution with the void fraction r; SG (SG) _r Channel descriptors corresponding to different receptive field features, SG _r ∈{SG ₁ ,SG ₂ ,SG ₃ The SE channel attention adopts a full-connection layer to learn the weight of each channel, but the channel dimension reduction can possibly cause some important characteristics to be lost, the 1D convolution is adopted to learn the channel weight, the number of parameters is reduced, the negative influence caused by the channel dimension reduction is avoided, and then the activation function is adopted to normalize the weight of each layer to obtain (M _r ) _n The n-th channel weight obtained after convolution with the void ratio r and channel attention is expressed as follows:

(M _r ) _n ＝δ(Conv1d((SG _r ) _n ))

wherein Conv1d represents one-dimensional convolution, delta represents a Relu activation function, and finally, the obtained weights are multiplied by corresponding features respectively, and the features with different receptive fields are fused to obtain channel features of different receptive fields

The defogging network framework A based on cross-layer attention feature interaction and multi-scale channel attention is constructed as follows:

inputting a foggy image I into a coding layer, firstly carrying out reflection filling by using a reflection pad2d, inputting the characteristic map into a first convolution layer of the coding layer, carrying out convolution operation with a size of 7*7 and a step length of 1, then carrying out normalization processing, and then obtaining a characteristic map of the convolution layer by using a Relu activation function, wherein the characteristic map has a size of H, W and C;

The feature map obtained from the convolution layer is subjected to a second convolution layer, the convolution layer is subjected to a downsampling operation, specifically, a convolution operation with a size of 3*3 and a step length of 2 is performed, then a normalization process is performed, and a Relu activation function is used immediately, so that the purpose of downsampling is achieved, and the size of the feature map is changed from H, W, C to H/2*W/2C;

the feature map obtained from the convolution layer is subjected to a third convolution layer, the convolution layer is subjected to a downsampling operation, specifically, a convolution operation with a size of 3*3 and a step length of 2 is performed, then a normalization treatment is performed, and a Relu activation function is used immediately, so that the purpose of downsampling is achieved, and the size of the feature map is changed from H/2*W/2 x 2C to H/4*W/4 x 4C;

the characteristic diagram obtained by the convolution layer is subjected to 6 residual blocks, wherein the specific steps of the residual blocks are that a convolution, normalization and activation function are firstly carried out, and finally the characteristic diagram is subjected to an addition operation with the characteristic diagram before operation; the feature map of a large amount of semantic information is mainly obtained and transmitted to a decoding layer;

the residual blocks pass through a multi-scale channel attention module, so that abundant context semantic information contained in different scale features can be fully mined, and then pass through a pixel attention module;

The convolution layer replaces the traditional jump connection through a cross-layer attention interaction module, and simultaneously obtains the characteristics of different levels without changing the characteristic diagram and fuses the cross-layer characteristics; the method comprises the steps of transmitting aggregate features containing edge textures and semantic information to corresponding decoding layers through proper weight distribution, then enabling features of a cross-layer attention interaction module and feature graphs of a pixel attention module to be Concat in a channel dimension, and then enabling the cross-layer attention interaction module and the feature graphs to pass through an up-sampling layer, wherein the up-sampling layer is structured by firstly performing a deconvolution operation, changing the size of the feature graphs from H/4*W/4 x 4C to H/2*W/2 x 2C, and then enabling the feature graphs to pass through a multi-scale channel attention module and then pass through a pixel attention module;

the downsampling convolution layer in the coding layer replaces the traditional jump connection through a cross-layer attention interaction module, the size of the feature map is kept to be H/2*W/2 x 2C, then the downsampling convolution layer and the upsampling layer are subjected to a Concat splicing operation in the channel dimension, then the upsampling layer is subjected to a deconvolution operation, the size of the feature map is changed from H/2*W/2 x 2C to H x W x C, then the feature map passes through a multi-scale channel attention module C, and then the pixel attention module;

And obtaining a characteristic diagram of H, W and C, changing the size of the characteristic diagram into the original size by using reflection filling, and finally obtaining a defogged image O.

The constructed cross-layer attention interaction module B framework is as follows:

firstly, respectively carrying out global tie pooling on feature graphs of different levels of a coding layer to obtain corresponding channel descriptors, wherein the channel descriptors keep the most obvious features of the feature graphs of each layer, for example, shallow features contain more detail texture information, and deep features contain semantic information such as color, brightness and the like;

performing Concat stitching on the channel descriptor to obtain a feature vector of 1 x 448, performing operation by using a convolution check with a size of 1*1 to obtain a feature vector of 1 x 128, and performing activation function to obtain a feature vector of 1 x 3 by using a convolution kernel with a size of 1*1;

the feature vector uses a Chunk, the obtained one-dimensional vector with the size of 1 x 3 is split into three weights, and then the three weights are correspondingly multiplied by feature graphs of different levels of the coding layers, so that the feature graph of the channel weight is obtained, and the effect of distributing the weights to the coding layers is realized;

in order to fuse the processed characteristic diagrams of the coding layers with different dimensions into a decoding layer, firstly converting the characteristic diagrams of the acquired channel weights into the characteristic diagrams with the same dimensions by using a bilinear interpolation function B, and then performing a Concat operation on the channel dimensions to obtain an aggregation characteristic;

The aggregation feature is subjected to convolution operation with the size of 3*3 and the step length of 4 and the output channel number of 4C to obtain a feature map with the size of H/4*W/4 x 4C;

the aggregation feature is subjected to convolution operation with the size of 3*3 and the step length of 2 and the output channel number of 2C to obtain a feature map with the size of H/2*W/2 x 2C;

the aggregation feature is subjected to convolution operation with the size of 3*3 and the step length of 1 and the output channel number of C to obtain a feature map with the size of H, W and C;

the multi-scale channel attention module C framework was constructed as follows:

firstly, a feature map is obtained by a convolution kernel with the size of 3*3, the step length of 1 and the void ratio of 1, then a global average pooling is adopted to obtain a corresponding feature vector, then a 1*1 convolution is adopted to learn channel weights to obtain the corresponding feature vector, then a Sigmoid activation function is used to perform normalization operation to obtain the feature vector, and finally the obtained weight is multiplied by the corresponding feature map to obtain the feature map;

inputting a feature map, obtaining a feature map through a convolution kernel with the size of 3*3, the step length of 1 and the void ratio of 2, obtaining a corresponding feature vector through global averaging pooling, obtaining a corresponding feature vector through channel weight learning through convolution with the size of 1*1, obtaining the feature vector through normalization operation through a Sigmoid activation function, and multiplying the obtained weight by the corresponding feature map to obtain the feature map;

Finally, a convolution kernel with the size of 3*3, the step length of 1 and the void ratio of 3 is used for inputting the feature map to obtain a feature map, then a global average pooling is adopted to obtain a corresponding feature vector, then a 1*1 convolution is used for carrying out channel weight learning to obtain a corresponding feature vector, then a Sigmoid activation function is used for carrying out normalization operation to obtain the feature vector, and finally the obtained weight is multiplied by the corresponding feature map to obtain the feature map;

and fusing the characteristic diagrams of different receptive fields and different channel weights to obtain different multi-scale channel characteristic diagrams.

Compared with the prior art, the invention has the following technical effects:

1) In order to overcome the characteristic dilution in the network model and enhance the expression capability of the network, the invention introduces a cross-layer attention characteristic interaction module CLA. By acquiring features of different levels within the encoder and fusing these cross-layer features. The aggregate features containing edge textures and semantic information are transmitted to the corresponding decoding layers through proper weight distribution, so that the problem of feature dilution in the process of reconstructing a clear image is effectively solved;

2) The invention designs a multi-scale attention mechanism MSCA, which extracts different receptive field information of the same feature in a network, then respectively learns channel weights, and effectively fuses the features containing rich context information to form a more reasonable feature weight distribution mechanism;

3) The invention develops a novel defogging method based on cross-layer attention feature interaction and multi-scale channel attention, which seamlessly integrates a designed cross-layer attention feature interaction module CLA and a multi-scale attention mechanism MSCA into a defogging network species and is used for processing images of natural scenes and remote sensing scene species to defog.

Drawings

The invention is further illustrated by the following examples in conjunction with the accompanying drawings:

FIG. 1 is a block diagram of the overall network of the present invention;

FIG. 2 is a block diagram of a cross-layer attention feature interaction model;

FIG. 3 is a block diagram of a multi-scale channel attention module.

Detailed Description

As shown in fig. 1 to 3, the defogging method based on cross-layer attention feature interaction and multi-scale channel attention comprises the following steps:

s1, constructing a defogging network frame diagram 1 based on cross-layer attention feature interaction and multi-scale channel attention, wherein the frame comprises a U-shaped defogging network A, a cross-layer attention feature interaction module B and a multi-scale channel attention module C.

S2, constructing a U-shaped defogging network A, wherein a U-shaped defogging network A packet comprises: the device comprises an encoding layer 101 feature extraction module, a residual error module 102 and a decoding layer 103 image recovery module.

S3, constructing a cross-layer attention interaction module B, and fusing the cross-layer characteristics by acquiring the characteristics of different layers in the coding layer 101. The aggregated features containing edge texture and semantic information are passed to the corresponding decoding layer 101 by appropriate weight assignment.

S4, constructing a multi-scale channel attention module C, wherein the multi-scale channel attention module C extracts different receptive field information of the same feature in the network, and then learns channel weights respectively to effectively fuse the features containing rich context information.

The steps S1 and S2 specifically include:

as shown in fig. 1, the defogging network frame based on cross-layer attention feature interaction and multi-scale channel attention comprises a U-shaped defogging network a, a cross-layer attention feature interaction module B and a multi-scale channel attention module C, and the output of the network is constrained through a series of loss functions, so that the network learns the mapping relation between a foggy image I and a foggy image gt, and the defogging capability of the image is improved. Specifically, a foggy image I is input, and then the foggy image I is input to the defogging network framework based on cross-layer attention feature interaction and multi-scale channel attention fig. 1, and feature information of different layers of the encoder 101 is aggregated, wherein the aggregated feature comprises high-level semantic information and low-level edge texture information. And then the aggregation features containing more abundant context information are fused into each decoding layer 103, so that the problem of feature dilution in the up-sampling process of the U-shaped defogging network A is relieved, and a clear defogging image O with higher quality is generated.

Wherein the encoding layer 101 feature extraction module performs feature extraction by four times downsampling. Each layer of the coding layer 101 uses a convolutional downsampling operation, then a normalization process, and then a ReLU activation function. And performing feature extraction by using a residual block 102 after downsampling operation, performing image reconstruction by using deconvolution and convolution operation on a feature map containing a large amount of semantic information through a decoding layer 103, and finally recovering to the original image size.

The step S3 specifically comprises the following steps:

at the coding layer 101 of the U-shaped defogging network a, different feature layers contain corresponding information, for example, shallow features contain more detailed texture information, and deep features contain semantic information such as color, brightness and the like. Therefore, in order to more effectively use information of different layers of the encoding layer 101, reduce dilution of feature information and improve the effect of reconstructing defogging images in the decoding process. As shown in FIG. 2, the proposed cross-level feature interaction module first, for features X of different levels of the coding layer 101 _L Employing global average pooling operation H _gap Compressing to obtain channel descriptor G of each layer of feature map _L ∈[G ₁ ,G ₂ ,G ₃ ]Thereby preserving the most significant features of each layer of feature map.

The channel descriptor obtained in the L-th layer of the coding layer 101 is expressed as:

(X _L ) _n (i, j) represents the value of the nth single channel feature map at position (i, j) in the L-th layer of the coding layer 101, feature map X _L From H _L ×W _L ×C _L Compressed to 1×1×C _L The number of channels corresponding to the obtained channel descriptor is C _L ∈[64,128,256]. And then carrying out Concat splicing on the channel descriptors in the channel dimension:

D _n ＝Concate(G ₁ ,G ₂ ,G ₃ )

D _n representing the concatenated channel descriptor. And then each code is specially processed according to the importance of each code layer to the aggregation characteristicsAnd (3) sign weight learning, wherein the aggregated features combine high-level semantic information with low-level edge texture information. Then, the weight corresponding to the coding layer is learned by adopting a convolution-activation-convolution mode, namely:

Z _L a feature map is shown in which each coding layer 101 is assigned a weight. In order to be able to fuse the different scale-size encoded layers 101 into the decoded layer 102, the feature map is here enlarged to the same size using a bilinear interpolation function B and subjected to a Concat operation to obtain the aggregated feature F. The method is specifically represented as follows:

F＝Concat(B(Z ₁ ),B(Z ₂ ),B(Z ₃ ))

Downsampling operations are then performed using different convolutions to downsample the aggregated features to the size of the corresponding decoding layer 101 such that the aggregated features are fused to the layers of the decoding layer 102, where Conv _L (F) Representing the process of the aggregation feature F passing through convolution kernels of different step sizes, Y _L Representing the L-th layer of decoder 102.

Y _L ＝Concat(Conv _L (F),Y _L+1 )

The cross-layer attention feature interaction module 9 aggregates the feature information of different layers of the coding layer 101 to enable the feature information to contain richer detail textures and structural semantic information, and downsamples the aggregated features to the corresponding decoding layer 102, so that the aggregated features are fused to each layer of the decoding layer 102 to guide a network to generate a defogging image with higher quality, and the problem of feature dilution in the encoding and decoding process of the U-shaped defogging network A is effectively relieved.

The step S4 is specifically as follows:

as shown in fig. 3, the multiscale channel attention C mainly comprises two stages of dispersion-aggregation. In the dispersing process, cavity convolution with different scales is used for obtaining multi-scale feature expression. In the polymerization process, the obtained weights are multiplied by the corresponding feature graphs respectively, and the features with different receptive fields are fused. Through the multi-scale parallel learning channel attention network, rich context semantic information contained in different scale features can be more fully mined, so that the attention distribution result meets the requirements of defogging tasks.

First, for the input feature map X, the convolution kernels Conv of different void ratios r E {1,2,3}, are used _r Convolving to extract SF features of different receptive fields _r ∈{SF ₁ ,SF ₂ ,SF ₃ Mathematical expression:

SF _r ＝Conv _r (X)r∈{1,2,3}

compared with the traditional convolution, the cavity convolution can increase the receptive field without increasing the parameter number. The multi-scale characteristic information of the characteristic diagram X is extracted in a parallel cavity convolution mode. This approach can capture broader, richer, and more complex contextual information. And then carrying out global average pooling operation on the characteristics with different receptive fields respectively, wherein the global average pooling operation is specifically as follows:

(SF _r ) _n (i, j) represents the value of the nth single channel feature map at the position (i, j) after convolution with the hole rate r. SG (SG) _r Channel descriptors corresponding to different receptive field features, SG _r ∈{SG ₁ ,SG ₂ ,SG ₃ }. SE channel attention adopts full-connection layer to learn weight of each channel, but channel dimension reduction may cause some important characteristics to be lost, and 1D convolution is adopted to learn channel weight, so that the number of parameters is reduced, and meanwhile negative effects caused by channel dimension reduction are avoided. The weights of each layer are then normalized by an activation function to obtain (M _r ) _n Expressed as the nth channel weight obtained after convolution with the void fraction r and channel attention. The formula is as follows:

(M _r ) _n ＝δ(Conv1d((SG _r ) _n ))

Where Conv1d represents a one-dimensional convolution and delta represents the Relu activation function. Finally, multiplying the obtained weights with the corresponding features, and fusing the features with different receptive fields to obtain channel features of different receptive fields

With respect to the network constructed in the present invention, the following is specific:

inputting a foggy image I into the coding layer 101, firstly, inputting the characteristic map into a first convolution layer 2 of the coding layer 101 through reflection pad2d reflection filling 1, performing convolution operation with a size of 7*7 and a step length of 1, then performing normalization processing, and then, obtaining a characteristic map of the convolution layer 2 through a Relu activation function, wherein the characteristic map has a size of H x W x C;

the feature map obtained from the convolution layer 2 passes through a second convolution layer 3, the convolution layer 3 is a downsampling operation, specifically, a convolution operation with a size of 3*3 and a step length of 2 is performed, then a normalization process is performed, and a Relu activation function is used next, so that the purpose of downsampling is achieved, and the size of the feature map is changed from H, W, C to H/2*W/2, C;

the feature map obtained from the convolution layer 3 passes through a third convolution layer 4, the convolution layer 4 is a downsampling operation, specifically, a convolution operation with a size of 3*3 and a step length of 2 is performed, then a normalization process is performed, and a Relu activation function is used next, so that the purpose of downsampling is achieved, and the feature map size is changed from H/2*W/2C to H/4*W/4x4C;

The characteristic diagram obtained by the convolution layer 4 is subjected to 6 residual blocks 5, and the specific steps of the residual blocks 5 are that a convolution, normalization and activation function are firstly carried out, and finally an addition operation is carried out on the characteristic diagram before operation. The feature map of a large amount of semantic information is mainly obtained and transferred to the decoding layer 103;

the residual block 6 passes through a multi-scale channel attention module 7, so that abundant context semantic information contained in different scale features can be fully mined, and then passes through a pixel attention module 8;

the convolution layer 4 replaces the traditional jump connection through the cross-layer attention interaction module 9, and meanwhile, features of different levels are obtained without changing the feature map, and the cross-layer features are fused. The aggregate features containing edge texture and semantic information are passed to the corresponding decoding layer 103 by appropriate weight distribution, then the features of the cross-layer attention interaction module 9 are Concat and from the feature map of the pixel attention module 8 in the channel dimension, then through an upsampling layer 10, the structure of the upsampling layer 10 being such that the feature map is first subjected to a deconvolution operation changing its size from H/4*W/4 x 4c to H/2*W/2 x 2c, then through the multi-scale channel attention module 3, then through a pixel attention module 8;

The downsampling convolution layer 3 in the coding layer 101 replaces the traditional jump connection through the cross-layer attention interaction module 9, the size of the feature map is kept to be H/2*W/2 x 2C, then the feature map and the upsampling layer 10 are subjected to a Concat splicing operation in the channel dimension, then the upsampling layer 11 is subjected to a deconvolution operation, the size of the feature map is changed from H/2*W/2 x 2C to H x W x C, then the feature map passes through the multiscale channel attention module C, and then the pixel attention module 8;

after obtaining the feature map of h×w×c, the feature map is changed to the original size by using the reflective filler 12, and finally the defogged image O is obtained.

The cross-layer attention interaction module B framework is constructed as follows:

as shown in fig. 2, firstly, feature maps 201 of different levels of the coding layer 101 are respectively subjected to global tie pooling to obtain corresponding channel descriptors 202, and the channel descriptors 202 retain the most significant features of the feature maps of each layer, such as shallow features containing more detail texture information, and deep features containing semantic information such as color, brightness and the like;

then performing Concat stitching on the channel descriptor in the channel dimension to obtain a feature vector 203 of 1 x 448, then performing operation by using a convolution kernel of 1*1 size to obtain a feature vector 204 of 1 x 128, and performing activation function to obtain a feature vector 205 of 1 x 3 by using a convolution kernel of 1*1 size again;

The feature vector 205 uses a Chunk to split the obtained one-dimensional vector with the size of 1 x 3 into three weights, and then correspondingly multiplies the three weights by the feature maps 201 of different levels of the coding layers 101 to obtain a feature map 206 for obtaining channel weights, so as to realize the function of distributing weights to each coding layer 101;

in order to fuse the processed feature graphs of the encoding layer 101 with different dimensions into the decoding layer 103, firstly converting the feature graph 206 with acquired channel weight into a feature graph 207 with the same dimension by using a bilinear interpolation function B, and then performing a Concat operation on the channel dimension to obtain an aggregation feature 208;

the aggregate feature 208 is subjected to convolution operation with the size of 3*3 and the step length of 4 and the output channel number of 4C to obtain a feature map 209 with the size of H/4*W/4 x 4C;

the aggregate feature 208 is subjected to a convolution operation with a 3*3 size and a step length of 2 and an output channel number of 2C to obtain a feature map 210 with a size of H/2*W/2 x 2C;

the aggregate feature 208 is subjected to a convolution operation with a 3*3 size and a step length of 1 and an output channel number of C to obtain a feature map 211 with a size of h×w×c;

the multi-scale channel attention module framework is constructed as follows:

as shown in fig. 3, the input feature map 300 firstly passes through a convolution kernel 310 with a 3*3 size, a step length of 1 and a void ratio of 1 to obtain a feature map 311, then adopts global average pooling to obtain a corresponding feature vector 312, then adopts a 1*1-size convolution to learn channel weights to obtain a corresponding feature vector 313, then uses a Sigmoid activation function to perform normalization operation to obtain a feature vector 314, and finally multiplies the obtained weight 314 by the corresponding feature map 311 to obtain a feature map 315;

The input feature map 300 is subjected to a convolution kernel 320 with a 3*3 size, a step length of 1 and a void ratio of 2 to obtain a feature map 321, a global average pooling is adopted to obtain a corresponding feature vector 322, a 1*1 convolution is adopted to learn channel weights to obtain a corresponding feature vector 323, a Sigmoid activation function is used to perform normalization operation to obtain a feature vector 324, and the obtained weight 324 is multiplied by the corresponding feature map 321 to obtain a feature map 325;

the input feature map 300 finally passes through a convolution kernel 330 with a 3*3 size, a step length of 1 and a void ratio of 3 to obtain a feature map 331, then a global average pooling is adopted to obtain a corresponding feature vector 332, then a 1*1 convolution is adopted to learn channel weights to obtain a corresponding feature vector 333, then a Sigmoid activation function is used to perform normalization operation to obtain a feature vector 334, and finally the obtained weight 334 is multiplied by the corresponding feature map 331 to obtain a feature map 335;

finally, feature map 315, feature map 325, and feature map 335 are fused to obtain channel feature maps 340 for different receptive fields.

In order to facilitate a better understanding of the present invention by those of ordinary skill in the art, the following is further described:

1) Parameter setting

The experiment uses an RTX 3090 display card with 24GB of display memory, the code is realized based on a Pytorch framework, and the CUDA version is 11.0. Adam optimization strategies were chosen to tune the network with momentum decay parameters β1=0.9 and β2=0.999. During training, the image size input to the network is 256 by 256. To extend the training set, preventing model overfitting, a random rotation of 90 °, 180 ° and 270 °, and horizontal and vertical flipping strategies were also introduced. The RESIDE synthetic foggy dataset, the Dense-Haze true Dense fog dataset, and the NH-Haze21 true heterogeneous fog dataset were selected herein to evaluate the model herein. In the RESIDE data set, an Outdoor training set OTS is adopted to train the network, and the SOTS's outer door is used as a test set. The Dense-Haze contained 45 Dense fog datasets. NH-Haze2021 contained 25 non-uniform hazy images, the first 20 of which were selected as training sets and the remaining 5 as test sets for evaluation, as their verification set and test set GT images were not published. Since there are fewer real foggy datasets, the pictures in the dataset are randomly cropped to 1024, 512, 256 sizes, and the pictures are resized to 256 scale-up datasets in a manner that utilizes bilinear interpolation. To evaluate the model performance, the peak channel ratio PSNR and the structural similarity index SSIM are used as quantitative evaluation indexes. The two indexes are commonly used for quality evaluation of defogging images.

2) Experimental results

To verify the correctness and effectiveness of the method of the present invention, we add the presently excellent defogging algorithm to the comparison of the method of the present invention, including DCP, AOD-Net, EPDN, GCA-Net, MSBDN, FFA, AECR, TBN, SGID, and Huang. Table 1 shows the PSNR and SSIM test results of each defogging algorithm on the synthetic fog dataset SOTS-Outdor, the Dense fog dataset Dense-Haze, the non-uniform fog dataset NH-Haze.

The analysis of the results on the natural hazy dataset is shown in table 1: in the true Dense fog dataset Dense-Haze and the true non-uniform fog dataset NH-Haze, the algorithm presented herein achieved 17.42db, 0.603 and 22.78db, 0.845 on PSNR and SSIM, respectively, and achieved the best performance results on both datasets. In the RESIDE outdoor synthesis test set SOTS, the algorithm obtains the highest PSNR index and the second SSIM index compared with other algorithms, and compared with FFA and MSBDN, the defogging algorithm is improved by 0.13db and 1.54db on PSNR. Since FFA does not have a downsampling process, structural information of the defogging image is preserved, so the defogging algorithm herein is slightly lower than FFA algorithm on SSIM. The algorithm herein achieves SOTA performance in terms of overall performance by integrating the natural synthesis of the foggy dataset, the true dense fog scene dataset, and the true non-uniform fog scene dataset.

Table 1 comparison with SOTA method on SOTS-Outdoor and real scene data set

3) Ablation experiments

To evaluate the effectiveness of the modules of the model, we designed ablation experiments according to the framework and the proposed innovation point, which included five experiments: (1) Base: the U-Net architecture is mainly composed of an encoding layer, a decoding layer and 6 residual blocks. Notably, the coding layer and decoding layer are directly connected by a jump connection. (2) Base+CLA: the intermediate jump connection is improved on the U-Net basic framework, and a cross-layer attention feature interaction module (CLA) is added. (3) Base+CLA+CA: channel attention is added on the basis of (2). (4) adding ECA channel attention on the basis of (2). (5) adding an MSCA attention mechanism on the basis of (2). By adding different attentiveness, it is verified that the attention model proposed herein is superior to other channel attention models

Table 2 PSNR and SSIM results on SOTS outdoor dataset

As can be seen from Table 2, the Base framework is 29.55db,0.963 on PSNR and SSIM, with PSNR reaching 31.42db and SSIM reaching 0.964 by adding a cross-layer attention feature interaction module. The PSNR was improved by 1.87db and the SSIM was improved by 0.001 compared to the baseline model. Second, the model is further improved in performance after adding channel attention. The attention module herein considers the problem of limited attention mechanism receptive field, while the elicitation of attention through ECA channels uses 1D convolution as an intermediate process to reduce the parametric quantity of the model, designing a multi-scale channel attention MSCA. The attention module MSCA of this section increased PSNR and SSIM by 0.18db and 0.002, respectively, compared to adding CA. The attention module herein improved by 0.17db,0.002 over ECA on PSNR and SSIM, verifying the effectiveness of the proposed attention module.

Claims

1. Defogging method based on cross-layer attention feature interaction and multi-scale channel attention, which is characterized by comprising the following steps:

step S2: constructing a U-shaped defogging network A, wherein the U-shaped defogging network A comprises: a coding layer (101) feature extraction module, a residual error module (102) and a decoding layer (103) image recovery module;

step S3: constructing a cross-layer attention interaction module B, and fusing the cross-layer characteristics by acquiring characteristics of different layers in the coding layer (101); passing the aggregated features containing edge texture and semantic information to the respective decoding layers (101) by appropriate weight distribution;

step S4: constructing a multi-scale channel attention module C, extracting different receptive field information of the same feature in a network by the multi-scale channel attention module C, respectively learning channel weights, and effectively fusing the features containing rich context information;

2. The method according to claim 1, characterized in that in step S1, it is specified as follows:

The defogging network framework based on cross-layer attention feature interaction and multi-scale channel attention comprises a U-shaped defogging network A, a cross-layer attention feature interaction module B and a multi-scale channel attention module C, and the output of the network is restrained through a series of loss functions, so that the network learns the mapping relation between a foggy image I and a foggy image gt, and the defogging capability of the image is improved; specifically, inputting a foggy image I, then inputting the foggy image I into a defogging network framework figure 1 based on cross-layer attention feature interaction and multi-scale channel attention, and aggregating feature information of different layers of an encoder (101), wherein the aggregated feature comprises high-level semantic information and low-level edge texture information; and then the aggregation features containing more abundant context information are fused into each decoding layer (103), so that the problem of feature dilution in the up-sampling process of the U-shaped defogging network A is relieved, and a clear defogging image with higher quality is generated.

3. The method according to claim 1, characterized in that in step S2 the encoding layer (101) feature extraction module employs four times downsampling for feature extraction; each layer of the coding layer (101) uses convolution downsampling operation firstly, then normalization processing is carried out, and then a ReLU activation function is used; and performing feature extraction by using a residual block (102) after downsampling operation, performing image reconstruction on a feature map containing a large amount of semantic information by adopting deconvolution and convolution operation through a decoding layer (103), and finally recovering to the original image size.

4. Method according to claim 1, characterized in that in step S3, in the coding layer (101) of the U-shaped defogging network a, different feature layers contain corresponding information, such as shallow features contain more detailed texture information, deep features contain semantic information of color and brightness etc.; thus, in order toThe information of different layers of the coding layer (101) is effectively utilized, the dilution of the characteristic information is reduced, and the effect of reconstructing the defogging image in the decoding process is improved; first, for features X of different layers of the coding layer (101) _L Employing global average pooling operation H _gap Compressing to obtain channel descriptor G of each layer of feature map _L ∈[G ₁ ,G ₂ ,G ₃ ]Thereby preserving the most significant features of each layer of feature map.

5. Method according to claim 4, characterized in that the channel descriptor obtained in the L-th layer of the coding layer (101) is expressed as:

(X _L ) _n (i, j) represents the value of the nth single channel feature map at position (i, j) in the L-th layer of the coding layer (101), feature map X _L From H _L ×W _L ×C _L Compressed to 1×1×C _L The number of channels corresponding to the obtained channel descriptor is C _L ∈[64,128,256]The method comprises the steps of carrying out a first treatment on the surface of the And then carrying out Concat splicing on the channel descriptors in the channel dimension:

D _n ＝Concate(G ₁ ,G ₂ ,G ₃ )

Z _L in order to fuse the encoding layers (101) with different sizes into the decoding layer (102), the feature map is amplified to the same size by a bilinear interpolation function B and subjected to a Concat operation to obtain an aggregate feature F, which is specifically expressed as follows:

F＝Concat(B(Z ₁ ),B(Z ₂ ),B(Z ₃ ))

next, downsampling operations are performed using different convolutions, downsampling the aggregated features to a size corresponding to the decoding layer (101) such that the aggregated features are fused to layers of the decoding layer (102), wherein Conv _L (F) Representing the process of the aggregation feature F passing through convolution kernels of different step sizes, Y _L An L-th layer representing a decoding layer (102);

Y _L ＝Concat(Conv _L (F),Y _L+1 )

and the cross-layer attention feature interaction module (9) aggregates feature information of different layers of the coding layer (101) to enable the feature information to contain richer detail textures and structural semantic information, and downsamples the aggregated features to the corresponding decoding layer (102), so that the aggregated features are fused to each layer of the decoding layer (102) to guide a network to generate a defogging image with higher quality, and the problem of feature dilution in the encoding and decoding process of the U-shaped defogging network A is effectively relieved.

6. The method according to claim 1, characterized in that in step S4, for a multiscale channel attention C, mainly comprising two phases of dispersion-aggregation; in the dispersing process, using cavity convolution with different scales to obtain multi-scale feature expression; in the polymerization process, the obtained weights are multiplied by corresponding feature graphs respectively, and the features with different receptive fields are fused; through the multi-scale parallel learning channel attention network, rich context semantic information contained in different scale features can be more fully mined, so that the attention distribution result meets the requirements of defogging tasks.

7. The method of claim 6, wherein for the multi-scale channel attention C, for the input feature map X, a convolution kernel Conv of different void fractions rε {1,2,3}, is used _r Convolving to extract SF features of different receptive fields _r ∈{SF ₁ ,SF ₂ ,SF ₃ Mathematical expression:

SF _r ＝Conv _r (X)r∈{1,2,3}

(SF _r ) _n (i, j) represents the value of the nth single channel feature map at the position (i, j) after convolution with the void fraction r; SG (SG) _r Channel descriptors corresponding to different receptive field features, SG _r ∈{SG ₁ ,SG ₂ ,SG ₃ The SE channel attention adopts a full connection layer to learn the weight of each channel, but the channel dimension reduction can cause some important characteristics to be lost, and adopts 1D convolution to carry out channel weightLearning, reducing the parameter number, avoiding the negative effect caused by channel dimension reduction, and normalizing the weights of each layer by adopting an activation function to obtain (M _r ) _n The n-th channel weight obtained after convolution with the void ratio r and channel attention is expressed as follows:

(M _r ) _n ＝δ(Conv1d((SG _r ) _n ))

8. The method according to claim 6, wherein the defogging network frame a based on cross-layer attention feature interactions and multi-scale channel attention is constructed as follows:

inputting a foggy image I into a coding layer (101), firstly carrying out reflection pad2d reflection filling (1), inputting the feature map into a first convolution layer (2) of the coding layer (101), carrying out convolution operation with a size of 7*7 and a step length of 1, then carrying out normalization processing, and then obtaining a feature map of the convolution layer (2) through a Relu activation function, wherein the feature map has a size of H, W and C;

The feature map obtained from the convolution layer (2) is subjected to a second convolution layer (3), the convolution layer (3) is subjected to a downsampling operation, specifically a convolution operation with a size of 3*3 and a step length of 2, and then is subjected to a normalization treatment, and a Relu activation function is used immediately, so that the purpose of downsampling is achieved, and the feature map size is changed from H, W, C to H/2*W/2;

the feature map obtained from the convolution layer (3) is subjected to a third convolution layer (4), the convolution layer (4) is subjected to a downsampling operation, specifically a convolution operation with a size of 3*3 and a step length of 2, and then is subjected to a normalization treatment, and a Relu activation function is used immediately, so that the purpose of downsampling is achieved, and the feature map size is changed from H/2*W/2 x 2C to H/4*W/4 x 4C;

the characteristic diagram obtained by the convolution layer (4) is subjected to 6 residual blocks (5), the specific steps of the residual blocks (5) are that a convolution, normalization and activation function are firstly carried out, and finally the characteristic diagram and the characteristic diagram before operation are subjected to an addition operation; the feature map which is mainly used for obtaining a large amount of semantic information is transmitted to a decoding layer (103);

the residual block (6) passes through a multi-scale channel attention module (7) to fully excavate abundant context semantic information contained in different scale features, and then passes through a pixel attention module (8);

The convolution layer (4) performs jump connection through a cross-layer attention interaction module (9), simultaneously obtains features of different levels without changing a feature map, and fuses the cross-layer features; the aggregated features containing edge texture and semantic information are transferred to the corresponding decoding layer (103) by appropriate weight distribution, then the features of the cross-layer attention interaction module (9) are Concat in the channel dimension with the feature map of the slave pixel attention module (8), then by an upsampling layer (10), the structure of the upsampling layer (10) is that the feature map is changed in size from H/4*W/4 x 4c to H/2*W/2 x 2c by a deconvolution operation, then by the multi-scale channel attention module (3), then by a pixel attention module (8);

the downsampling convolution layer (3) in the coding layer (101) is connected in a jumping manner through the cross-layer attention interaction module (9), the size of the feature map is kept to be H/2*W/2 x 2C, then the feature map and the upsampling layer (10) are subjected to a Concat splicing operation in the channel dimension, then the upsampling layer (11) is subjected to a deconvolution operation, the size of the feature map is changed from H/2*W/2 x 2C to H x W x C, and then the feature map passes through the multiscale channel attention module C and then passes through the pixel attention module (8);

After obtaining the characteristic diagram of H.times.W.times.C, the characteristic diagram is changed into the original characteristic diagram by using reflection filling (12), and finally, the defogged image O is obtained.

9. The method according to one of claims 2 to 8, characterized in that the built cross-layer attention interaction module B framework is as follows:

firstly, respectively pooling feature graphs (201) of different levels of a coding layer (101) through a global tie to obtain corresponding channel descriptors (202), wherein the channel descriptors (202) keep the most obvious features of the feature graphs of each layer, such as shallow features contain more detail texture information, and deep features contain semantic information such as color, brightness and the like;

concat stitching is performed on the channel descriptors in the channel dimension to obtain a feature vector (203) of 1 x 448, then a convolution kernel of 1*1 size is used for operation to obtain a feature vector (204) of 1 x 128, and a convolution kernel of 1*1 size is used again after an activation function to obtain a feature vector (205) of 1 x 3;

the feature vector (205) uses a Chunk to split the obtained one-dimensional vector with the size of 1 x 3 into three weights, and then correspondingly multiplies the three weights by feature graphs (201) of different levels of the coding layers (101), so that a feature graph (206) for obtaining channel weights realizes the function of distributing weights to each coding layer (101);

In order to fuse the processed characteristic diagrams of the coding layer (101) with different dimensions into the decoding layer (103), firstly converting the characteristic diagram (206) of the acquired channel weight into the characteristic diagram (207) with the same dimension by using a bilinear interpolation function B, and then performing a Concat operation on the channel dimension to obtain an aggregation characteristic (208);

the aggregation feature (208) is subjected to convolution operation with the size of 3*3 and the step length of 4 and the output channel number of 4C to obtain a feature map (209) with the size of H/4*W/4 x 4C;

the aggregation feature (208) is subjected to convolution operation with the size of 3*3 and the step length of 2 and the output channel number of 2C to obtain a feature map (210) with the size of H/2*W/2 x 2C;

the aggregate feature (208) is subjected to a convolution operation of 3*3 size, step size of 1, and output channel number of C to obtain a feature map (211) of h×w×c.

10. Method according to one of the claims 2 to 8, characterized in that a multi-scale channel attention module C-frame is constructed as follows:

firstly, a characteristic diagram (311) is obtained by inputting the characteristic diagram (300) through a convolution kernel (310) with the size of 3*3, the step length of 1 and the void ratio of 1, then a corresponding characteristic vector (312) is obtained by global average pooling, then a corresponding characteristic vector (313) is obtained by carrying out channel weight learning through convolution with the size of 1*1, then a characteristic vector (314) is obtained by carrying out normalization operation through a Sigmoid activation function, and finally the obtained weight (314) is multiplied by the corresponding characteristic diagram (311) to obtain the characteristic diagram (315);

The input feature map (300) is subjected to a convolution kernel (320) with the size of 3*3, the step length of 1 and the void ratio of 2 to obtain a feature map (321), then a global average pooling is adopted to obtain a corresponding feature vector (322), then a 1*1-sized convolution is adopted to learn channel weights to obtain a corresponding feature vector (323), then a Sigmoid activation function is used for normalization operation to obtain a feature vector (324), and finally the obtained weight (324) is multiplied by the corresponding feature map (321) to obtain a feature map (325);

finally, a convolution kernel (330) with the size of 3*3, the step size of 1 and the void ratio of 3 is adopted to input the feature map (300) to obtain a feature map (331), then a global average pooling is adopted to obtain a corresponding feature vector (332), then a 1*1-sized convolution is adopted to learn channel weights to obtain a corresponding feature vector (333), then a Sigmoid activation function is used to perform normalization operation to obtain a feature vector (334), and finally the obtained weight (334) is multiplied by the corresponding feature map (331) to obtain a feature map (335);

and fusing the characteristic diagram (315), the characteristic diagram (325) and the characteristic diagram (335) to obtain channel characteristic diagrams (340) of different receptive fields.