CN117151990B

CN117151990B - Image defogging method based on self-attention coding and decoding

Info

Publication number: CN117151990B
Application number: CN202310774453.9A
Authority: CN
Inventors: 谌贵辉; 汪少天; 卢凯; 魏钰力; 郑莘于
Original assignee: Southwest Petroleum University
Current assignee: Southwest Petroleum University
Priority date: 2023-06-28
Filing date: 2023-06-28
Publication date: 2024-03-22
Anticipated expiration: 2043-06-28
Also published as: CN117151990A

Abstract

The invention discloses an image defogging method based on self-attention coding and decoding, which comprises the steps of firstly downsampling an image, and then introducing a self-convolution and conventional convolution fusion processing downsampled feature map to improve the local feature extraction capacity of a whole convolution module; secondly, convolving the feature map again by introducing a residual error dense block; then, high-low level semantic information fusion in different scale features is realized through a gating fusion unit; then up-sampling is carried out, and the size of the feature map is restored to the size of the initial image; and finally, weighting the average absolute error loss function and the multi-scale similarity loss function to form a mixed loss function for model training so as to improve subjective feeling of defogging results. The method enables the basic unit to concentrate more on extracting high-frequency information, better highlights important characteristic diagram information in the channel, better extracts important information in the image and can recover defogging images more conforming to artificial sense organs.

Description

Image defogging method based on self-attention coding and decoding

Technical Field

The invention relates to the technical field of computer vision image processing, in particular to an image defogging method based on self-attention coding and decoding.

Background

The image defogging algorithm aims at eliminating noise interference caused by haze on an image, improving definition and color saturation of the image and recovering detail information of the image. Prior to the advent of deep learning methods, conventional defogging methods received attention from many students, mainly of two types: an image enhancement-based defogging method is used for optimizing an image visual effect only by enhancing an image, and enhancing certain information in a foggy image only according to actual requirements without considering a foggy formation mechanism and a foggy image imaging process; the other is a defogging method based on a physical model, which uses an atmospheric scattering model as a theoretical basis, and estimates unknown parameters in the model through a statistical method so as to restore a defogging image.

In recent years, the deep learning theory is widely applied in the image defogging field, and partial students train the neural network by using a large number of foggy day image data sets, so that defogging work is realized more efficiently. Image defogging methods based on convolutional neural networks can be classified into the following three types according to differences in image feature processing modes:

defogging method based on traditional convolutional neural network: such methods are mainly based on conventional convolutional neural networks for image defogging, and generally comprise the following steps: image preprocessing, feature extraction and feature recovery. However, the conventional convolutional neural network defogging method is poor in effect when texture and edge information are processed, loss or blurring of the texture information is easy to occur, and effective extraction and utilization of global and local key information are not available, so that the defogging effect in a complex scene is poor.

Image defogging method based on attention mechanism: the method mainly utilizes an attention mechanism to extract important features in the image, so that the defogging effect is more accurate. In particular, the attention mechanism may be by calculating the weight of each pixel in the image, thereby making the defogging algorithm more focused on important areas in the image. Volodymyr, et al, first introduced an attention mechanism into the RNN model to classify images, selected the areas of the images to be processed by the attention mechanism, each current state determined the location of attention based on the previous state and the currently entered image, and processed only the pixels within the area of attention, not all the pixels of the entire image. This has the advantage of reducing the number of pixels processed and the difficulty of tasks. However, the defogging method based on attention mostly only performs feature extraction on a single-scale level, directly performs recovery operation on the weighted extracted features, and lacks extraction and fusion of global multi-scale features.

Image defogging method based on multi-scale feature fusion: the method mainly fuses the image features with different scales, so that the defogging accuracy of the image is improved. In particular, such algorithms typically fuse features extracted by the model at different scales to generate more accurate defogging results. The multiscale CNN can extract effective characteristics from the foggy image to estimate the transmission diagram, but because the atmospheric light and the transmission diagram are estimated at the same time without adopting a learning mode, the atmospheric light estimation has larger error, and the quality of the final foggy image is affected.

In summary, compared with the traditional convolutional neural network algorithm, the image defogging algorithm based on the attention mechanism can adaptively select the region of interest and adjust according to scene characteristics, so that focus is paid attention to and defogging effect is improved. The multiscale fusion mechanism can extract and fuse information on different scales, so that multiscale information can be better processed, details and colors of images can be better recovered, and defogging effect is improved. However, the above methods based on the attention mechanism still lack an understanding of the spatial features and frequency domain feature information of the input image itself. Therefore, it is particularly important to design a suitable attention mechanism in the image defogging field and to reasonably utilize multi-scale feature fusion.

Disclosure of Invention

Aiming at the problem that an image defogging algorithm based on an attention mechanism lacks understanding of the spatial characteristic and the frequency domain characteristic information of an input image, the invention provides an image defogging method based on self-attention coding and decoding.

The invention provides an image defogging method based on self-attention coding and decoding, which specifically comprises the following steps:

s1: and selecting the public image defogging data set as an image data set to be tested, dividing the data set into a training set and a testing set, and carrying out image preprocessing.

An OTS data set in RESIDE-beta and a Foggy data set in Cityscapes are adopted as experimental data sets, images containing traffic scenes and the Foggy data sets in the OTS data sets are mixed to carry out training tasks, the OTS comprises outdoor clear images 2061, and one clear image corresponds to 35 Foggy images with different degrees; the Cityscapes contains 5002 clear traffic scene pictures, one clear picture corresponds to 3 foggy day images with different degrees, and the total number of the traffic foggy pictures is 15006. M images are selected as training sets, N images are selected as test sets, K real fog images containing traffic scenes are selected from RESIDE-beta at the same time, and the K real fog images are used as test sets for analyzing defogging effects of the real fog images, wherein 90% of the M images are used as training sets (M), 9% of the N images are used as real fog images (K), and 1% of the N images are used as test sets (N).

The image preprocessing refers to performing a series of processing steps on an original image before further processing the image in a computer vision task, so as to improve the effect of a subsequent task or reduce errors. These processing steps may include, but are not limited to, the following:

1. noise reduction: noise in the image is removed to better identify and analyze objects in the image.

2. Resizing the: the image is resized to fit a particular task or model.

3. Cutting: unnecessary parts in the image are removed so as to improve the effect of the subsequent task.

4. Rotation and flipping: the image is rotated or flipped to better match the desired input of the model.

5. Normalization and normalization: the image pixel values are normalized or normalized to better accommodate the input requirements of the model.

6. Contrast and brightness enhancement: the contrast and brightness of the image is increased or decreased in order to better identify and analyze objects in the image.

7. Conversion color space: the image is converted from one color space to another to better accommodate the needs of a particular task.

The image preprocessing can help to improve the accuracy and efficiency of the computer vision task, thereby better meeting the demands of users.

S2: downsampling the preprocessed images in the training set, and processing the downsampled feature map through self-convolution and conventional convolution fusion; the specific operation method is as follows:

adopting the maximum pooling of 3*3 as a downsampling layer, and acquiring multi-scale characteristic information by adopting the downsampling layer for a plurality of times in the encoding stage; and introducing self-convolution based on an attention mechanism after the downsampling layer, and fusing the characteristics extracted by conventional convolution with the characteristics extracted by self-convolution.

The self-convolution based on the attention mechanism operates as follows:

(1) Generating a convolution kernel related to the length and width of the initial image X through a kernel generating function phi (X), and expanding the generated single convolution kernel into a convolution kernel group with the channel dimension of C; the kernel generation function phi (X) is:

φ(X)＝W ₁ ξ(W ₀ (X))

wherein X represents an input image; w (W) ₀ ，W ₁ Representing 2 linear transforms; ζ represents a nonlinear activation function;

(2) And performing convolution operation on the generated convolution kernel group and the corresponding position of the initial image X, performing aggregation operation on the self-convolution and common convolution results in the field of K, and finally outputting a feature map of the initial image.

S3: and (3) introducing a residual error dense block, performing convolution operation on the feature map again, and increasing the extraction of the feature information to finish the coding of the fog map.

The specific operation of the residual error dense block is as follows: the input is skip connected to each residual block based on the original residual structure, and the output of each layer convolution operation in each residual block is skip connected to the output of that residual block.

The original residual structure includes: two convolution blocks with the convolution kernel size of 3 and one convolution block with the convolution kernel size of 5, an identity mapping module, 3 Relu activation function modules and an aggregation module; the calculation formula is as follows:

F＝Y-X

wherein F is the residual error to be learned; x and Y are input and output feature maps, respectively.

S4: and through a plurality of gating fusion units, different scale features are fused, so that high-level and low-level semantic information fusion in the multi-scale features is realized.

A single gated fusion unit consists of one convolutional layer and one polymeric layer. The convolution layer kernel sizes are assigned 7*7, 5*5, 3*3, 3*3, respectively, according to the feature map sizes.

The specific operation of the step is as follows:

first, network upper layer feature F _i And underlying feature F _i+1 Extracting and inputting the extracted data into a gating fusion unit; its output weight Q _i And Q _i+1 The number of the characteristic channels is matched according to the number of the specific characteristic channels of the upper layer and the lower layer, and each characteristic channel is respectively corresponding to each characteristic channel;

finally, the feature images of different layers are linearly combined with the weights output by the gating fusion unit, and the combined feature images are further sent to corresponding decoders to obtain target fog image residual errors, wherein the mathematical expression is as follows:

Q ₁ ,Q ₂ ,Q ₃ ,.......Q _i+1 ＝g _i (F ₁ ,F ₂ ,F ₃ ,.......F _i+1 )

wherein g _i Representing a gating fusion module of an ith layer; f (F) _i Representing an i-th layer feature map; q (Q) _i The weight of the combination output of the ith layer and the (i-1) th layer is represented; f (F) _oi The characteristic diagram of the i-1 layer combined with the i-1 layer is shown.

S5: upsampling by taking bilinear interpolation as an upsampling layer, gradually restoring the size of the feature map to the size of the initial image by adopting a plurality of upsampling layers in a decoding stage, and finishing decoding of the fog map;

s6: during model training, weighting an average absolute error loss function and a multi-scale similarity loss function to form a mixed loss function, so as to obtain model parameters; wherein, the formula of the mixing loss function is as follows:

L _mix ＝α·L _MS-SSIM +(1-α)·G·L ₁

wherein L is ₁ Representing an average absolute error loss function; l (L) _MS-SSIM Representing a multi-scale similarity loss function; alpha represents the weight within the mixing loss function; g is obtained by using Gaussian convolution on the errorWeight coefficient of (2);

s7: and inputting the preprocessed image test set into a trained neural network model for defogging based on the coded and decoded images for testing, so as to obtain defogged images.

Compared with the prior art, the invention has the following advantages:

(1) The self-attention-based coding and decoding method provided by the invention fully considers the problem of feature loss in the deep neural network from the network structure, and the multi-scale gating fusion unit connected layer by layer transmits the lost image feature information of each layer to the corresponding layer, so that the network can not lose the feature information related to human visual sense such as excessive image textures, colors and the like.

(2) The improved self-convolution module can pay attention to more detail features of the image when the features are extracted, so that the restored image is clearer; the residual error dense block can optimize calculation, so that the model is converged rapidly, and the reduction of image recovery quality caused by the increase of network depth is avoided; finally, L is adopted in training ₁ And L _MS-SSIM The composite loss function enhances the artificial perception of the restored image. The defogged image is more in line with the visual sense of human body in subjective sense, and objective indexes also indicate that the defogged image is better in quality, so that sufficient data preprocessing support can be provided for other image processing fields.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention.

Drawings

Fig. 1 is a network configuration diagram of the image defogging method based on self-attention encoding and decoding of the present invention.

Fig. 2 is a block diagram of a modified self-convolution module.

Fig. 3 is a block diagram of a residual dense module.

Fig. 4 is a graph of the effect of each defogging method on the composite fog map in the OTS subset data set. In the figure, (a) a synthetic hazy image; (b) DCP process; (c) a Dehaze method; (d) AOD process; (e) GMAN process; (f) an ERAN method; (g) MFFID method; (h) the process of the invention; (i) a haze-free image.

FIG. 5 is a graph showing the effect of each defogging method on a composite fog map in the Foggy subset of data. (a) a synthetic hazy image; (b) DCP process; (c) a Dehaze method; (d) AOD process; (e) GMAN process; (f) an ERAN method; (g) MFFID method; (h) the process of the invention; (i) a haze-free image.

Fig. 6 is an effect diagram of each defogging method on a real fog map. (a) a true hazy image; (b) DCP process; (c) a Dehaze method; (d) AOD process; (e) GMAN process; (f) an ERAN method; (g) MFFID method; (h) the process of the invention.

Detailed Description

The preferred embodiments of the present invention will be described below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are for illustration and explanation of the present invention only, and are not intended to limit the present invention.

As shown in fig. 1-3, the image defogging method based on self-attention encoding and decoding provided by the present invention firstly adds a plurality of improved self-convolution modules in an encoder-decoder, and the modules can adaptively allocate weights at different positions of an image based on an attention mechanism, so as to prioritize spatial domain information. And secondly, replacing a common convolution block with a residual dense block so as to achieve the purposes of reducing or eliminating gradient disappearance, enhancing information flow and characteristic multiplexing and optimizing calculation force. Because the jump connection of the U-shaped structure directly connects the shallow layer characteristics of the encoder to the deep layer characteristics of the corresponding decoder, and the high-low grade semantic information is easy to ignore, a plurality of gating units with different scales are adopted in the network to fuse the upper and lower characteristic layers, and the module can aggregate the semantic information in a wider space, so that the characteristic loss is further reduced. And finally, connecting the fused features to the corresponding decoders layer by layer, and restoring the clear images after processing by the decoders.

Training process and reasoning:

in the experiment, an Adam optimizer is adopted to optimize training; initial learning rate of 1×10 ^-4 The method comprises the steps of carrying out a first treatment on the surface of the Batch training size was 16. Training image and verification imageThe small adjustment is OTS-256 x 256, foggy-512 x 1024, and the size of the test image is not limited.

Experimental evaluation:

the performance of the image defogging method (Our) of the present invention and the advanced defogging method (DCP method, dehaze method, AOD method, GMANmethod, ERAN method, MFFID method) in RESIDE-beta and in Cityscapes respectively were tested and compared. Fig. 4 and 5 are graphs of the effect of each defogging method on the composite fog map in the OTS and Foggy sub-data sets, respectively. Table 1 shows the objective evaluation index comparison of different comparison algorithms on the synthetic fog pattern.

TABLE 1 objective evaluation results of different defogging methods on synthetic haze patterns

It can be seen that the SSIM value of the method reaches 0.9207, which is 0.299 higher than that of the traditional DCP method and 0.0551 higher than that of other neural network methods on average; the MS-SSIM index value reaches 0.9762, which is 0.156 higher than that of the traditional DCP method and 0.0312 higher than that of other neural network methods on average; the PSNR index reaches 26.531dB, which is 12.11dB higher than that of the traditional DCP method, and 3.171dB higher than that of other neural network methods. Experimental results show that the method is superior to other methods in objective evaluation indexes on the synthetic fog images, the difference between the defogging images and the clear images is smaller, fog clusters can be effectively removed by the method, structural information of the images before and after defogging is more similar, and the processing effect of detail information such as edges and colors of the images is better than that of a comparison method.

Fig. 6 is an effect diagram of each defogging method on a true fog map. Table 2 shows the objective evaluation index comparison of different defogging algorithms on the real fog map. It can be seen that, since DCP has the problem of excessively enhancing boundary information, the index without reference is extremely sensitive to the pixel edge variation of the image, so that the values of the two indexes are abnormally high and have no comparability. In addition, the problem that the GMAN method outlines edges in subjective evaluation is also presented, and the evaluation indexes are higher than those of the ERAN method, the MMFID method and the method of the invention, which have better visual effects. Therefore, the method of the present invention has various indexes higher than other comparison methods except for the two network results of poor subjective visual performance, i.e., DCP and GMANs. Experimental results show that the method can recover fine contrast details and texture changes in the image, the edges of the recovered image are clearer, and the overall visual effect is better than that of a contrast method.

TABLE 2 objective evaluation results of different defogging methods on real fog patterns

The present invention is not limited to the above-mentioned embodiments, but is intended to be limited to the following embodiments, and any modifications, equivalents and modifications can be made to the above-mentioned embodiments without departing from the scope of the invention.

Claims

1. A self-attention encoding and decoding-based image defogging method, comprising the steps of:

s1: selecting a public image defogging data set as an image data set to be tested, dividing the data set into a training set and a testing set, and carrying out image preprocessing;

s2: downsampling the preprocessed images in the training set, and processing the downsampled feature map through self-convolution and conventional convolution fusion; the specific operation is as follows:

adopting the maximum pooling of 3*3 as a downsampling layer, and acquiring multi-scale characteristic information by adopting the downsampling layer for a plurality of times in the encoding stage; introducing self-convolution based on an attention mechanism after a downsampling layer, and fusing the characteristics extracted by conventional convolution with the characteristics extracted by self-convolution;

s3: the residual error dense block is introduced, convolution operation is carried out on the feature map again, extraction of feature information is increased, and coding of the fog map is completed;

s4: the high-low semantic information fusion in the multi-scale features is realized by fusing the different-scale features through a plurality of gating fusion units;

s6: during model training, weighting an average absolute error loss function and a multi-scale similarity loss function to form a mixed loss function, so as to obtain model parameters; the formula of the mixing loss function is as follows:

L _mix ＝α·L _MS-SSIM +(1-α)·G·L ₁

wherein L is ₁ Represents the average absolute error loss function, L _MS-SSIM Representing a multi-scale similarity loss function, wherein alpha represents weights in the mixed loss function, and G is a weight coefficient obtained by using Gaussian convolution on errors;

2. The self-attention-based coded and decoded image defogging method as claimed in claim 1, wherein in step S2, the self-convolution based on the attention mechanism operates as follows:

φ(X)＝W ₁ ξ(W ₀ (X))

wherein X represents an input image, W ₀ 、W ₁ Representing 2 linear transformations, ζ representing a nonlinear activation function;

3. The self-attention-based coded and decoded image defogging method as claimed in claim 1, wherein in step S3, the specific operation of the residual error density block is: the input is skip connected to each residual block based on the original residual structure, and the output of each layer convolution operation in each residual block is skip connected to the output of that residual block.

4. A self-attention-based coded decoded image defogging method as claimed in claim 3, wherein said original residual structure comprises: two convolution blocks with the convolution kernel size of 3 and one convolution block with the convolution kernel size of 5, an identity mapping module, 3 Relu activation function modules and an aggregation module; the calculation formula is as follows:

F＝Y-X

where F is the residual to be learned and X, Y is the input and output profile, respectively.

5. The self-attention-encoding-decoding-based image defogging method of claim 1, wherein in step S4, the single gating fusion unit is composed of one convolution layer and one aggregation layer; the specific operation of the step is as follows:

(1) Network upper layer feature F _i And underlying feature F _i+1 Extracted, input into a gating fusion unit, and output a weight Q _i And Q _i+1 The number of the characteristic channels is matched according to the number of the specific characteristic channels of the upper layer and the lower layer, and each characteristic channel is respectively corresponding to each characteristic channel;

(2) The characteristic diagrams of different layers and the weights output by the gating fusion unit are linearly combined, and the combined characteristic diagrams are further sent to corresponding decoders to obtain target fog diagram residual errors, wherein the mathematical expression is as follows:

wherein g _i Gate fusion module representing ith layer, F _i Represents an i-th layer feature map, Q _i Representing the weight output by the combination of the ith layer and the (i-1) th layer, F _oi The characteristic diagram of the i-1 layer combined with the i-1 layer is shown.

6. The image defogging method based on self-attention encoding and decoding according to claim 1, wherein in step S1, an OTS data set in the rest- β and a Foggy data set in the citiscapes are adopted as data sets of experiments, and the images containing traffic scenes and the Foggy data sets in the OTS data sets are used in a mixed manner for training, wherein a part of the images are selected as training sets, a part of the images are selected as test sets, and a part of the real fog images containing traffic scenes are selected from the rest- β at the same time as test sets for analyzing defogging effects of the real fog images.