CN111915531B

CN111915531B - Neural network image defogging method based on multi-level feature fusion and attention guidance

Info

Publication number: CN111915531B
Application number: CN202010781155.9A
Authority: CN
Inventors: 张笑钦; 王涛; 徐曰旺; 赵丽
Original assignee: Wenzhou University
Current assignee: Wenzhou University
Priority date: 2020-08-06
Filing date: 2020-08-06
Publication date: 2023-09-29
Anticipated expiration: 2040-08-06
Also published as: CN111915531A

Abstract

The invention discloses a neural network image defogging method with multi-level feature fusion and attention guidance, which comprises the following steps: constructing an image defogging model; acquiring foggy image data, and extracting feature graphs representing different phases through a feature extraction module; the characteristic graphs obtained at different stages are fused by utilizing a multi-level characteristic fusion module in the defogging model in a point-by-point element multiplication mode, and a clear image is better recovered by utilizing complementarity of low-layer and high-layer characteristics to guide a network; reconstructing the characteristics generated by the multi-level characteristic fusion module into clear fog-free images through a residual mixed attention module; and calculating the mean square error and the perceived loss of the restored image and the corresponding clear image, updating the image defogging model, and cooperatively optimizing the defogging model by two loss functions, namely a mean square error loss function and a perceived loss function. According to the technical scheme, defogging enhancement processing is carried out on the fog map which is actually shot, high-quality images are recovered, and the practicability is good.

Description

Neural network image defogging method based on multi-level feature fusion and attention guidance

Technical Field

The invention relates to the technical field of image processing, in particular to a neural network image defogging method with multi-level feature fusion and attention guidance.

Background

The low visibility in severe weather (heavy fog and heavy rain) is a major problem faced by most computer vision techniques applied to actual scenes. Most automatic monitoring, autopilot, outdoor target recognition, etc. systems assume that the incoming video, images have clear visibility. However, such ideal conditions are not always satisfied in most cases, and therefore enhancement of low quality images, video, is an unavoidable task. Among them, image defogging is a representative image quality enhancement problem. The process of clear image fogging can be described by the atmospheric light scattering model proposed by McCartney et al:

I＝tJ+A(1-t)，

t(x)＝e ^βd(x) ，

wherein I is a foggy image, t is medium transmissivity, J is a clear image, A is atmospheric illumination, d is the depth of object imaging, and beta is the scattering coefficient of the atmosphere. In the above model, I is a known quantity, and the objective of the image defogging task is to estimate a and t, and then generate a sharp image. The problem of image defogging is a pathological problem. Over the past 20 years, researchers have developed a number of image defogging algorithms to process images taken in foggy complex scenes. Early algorithms primarily focused on estimating depth information of images with multiple images and atmospheric cues to achieve image defogging. For example, narasimhan et al propose a physics-based method to locate depth discontinuities and calculate scene structures from two images of the same scene captured under arbitrary weather conditions. In addition, there are a series of algorithms to enhance the visual effect of image defogging by means of some prior information, and what is the typical algorithm is the defogging method of dark channel prior (DCP, dark Channel Prior) proposed in 2009 by kemine et al, the prior is based on observation and statistics to find that in most of the local areas of the foggy diagram which are not sky, some pixels always have at least one color channel with very low pixel value. For example, zhu Qingsong et al propose color decay prior (CAP, color Attenuation Prior). And recovering a clear image through the atmospheric scattering model by using the prior estimation t. These priors improve the defogging performance of the model to some extent. However, different priors rely on an estimate of a certain characteristic of the image and are often not suitable for real scenes.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a neural network image defogging method for defogging enhancement processing of a fog image which is actually shot and recovering multi-level characteristic fusion and attention guidance of a high-quality image.

In order to achieve the above purpose, the present invention provides the following technical solutions: a neural network image defogging method with multi-level feature fusion and attention guidance comprises the following steps:

s1, constructing an image defogging model; the image defogging model comprises a feature extraction module, a multi-level feature fusion module and a residual mixed convolution attention module;

s2, acquiring foggy image data, and firstly converting a foggy image into 16 feature images through a convolution layer; then, the feature graphs are processed through four stages of a feature extraction module to obtain features of different layers;

s3, a multi-level feature fusion module fuses feature graphs obtained at different stages in a point-by-point element multiplication mode, and a network is guided to better recover a clear image by utilizing complementarity of low-level features and high-level features;

s4, the characteristics generated by the multi-level characteristic fusion module are subjected to residual mixed convolution attention module to obtain a weight graph with the same size as the input elements, the weight graph obtained from an attention layer designed based on an attention mechanism guides a network to discard redundant information, the characteristic information effective for restoring a clear graph is focused, meanwhile, the training and operation efficiency of the residual mixed convolution attention module can be improved through the depth separable convolution operation adopted in the residual mixed convolution attention module, and the characteristics are finally reconstructed into clear haze-free images after passing through the residual mixed convolution attention module;

s5, calculating the mean square error and the perception loss of the restored image and the corresponding clear image, and updating an image defogging model; the method comprises the steps of measuring deviation between a restored image and a corresponding clear image by means of a mean square error, enabling a perception loss help model to perceive the image from a higher dimension, enabling the restored image to be more true, and enabling two loss functions of the mean square error and the perception loss to cooperate and jointly optimize a defogging model.

Preferably, step S5 specifically includes:

calculating a mean square error and a perceived loss for the restored image and the corresponding sharp image, wherein the first loss function is a mean square error loss function, and the formula is:

wherein W and H represent the width and height of the image, I ^re And I ^gt The method is a restored image and a corresponding clear image, i and j are pixel positions in an index image, c is an image RGB channel, and the range is from 1 to 3;

the second is a perceptual loss function, using a VGG16 network pre-trained on an ImageNet dataset (VGG-16 has 13 convolutional layers, divided into 5 phases) where the final convolutional layer at each phase of the network extracts features and computes the difference, the formula:

wherein { phi } _k (-), k=1, 2,3} represents the feature extractor corresponding to the convolutional layer corresponding to VGG16 (i.e., conv1-2, conv2-2, and Conv 3-3), C _k ，W _k And H _k Corresponds to phi _k (.)；

The total defogging model loss function is:

L＝L _mse +α*L _per ,

where α is a parameter that balances the two loss functions.

Preferably, step S2 specifically includes:

feature extraction starts with a 3 x 3 convolution layer that converts a given input foggy image into 16 feature maps;

then, the feature maps are processed through the following four stages to obtain features of different layers; each stage comprises four layers, the first layer being a 3 x 3 convolution with a step size of 2, which is used to reduce the resolution of the feature map to 1/2 and double the width; the second layer and the third layer respectively comprise 3×3 convolutions, a ReLU activation function and 3×3 convolutions; the fourth layer is a 1 x 1 convolution, which reduces the width of the features produced by the third layer to 64 as an output for each stage.

Preferably, in step S3, the multi-level feature fusion module has three feature fusion modules from top to bottom. The feature fusion module is used for fusing the high-level features (fourth convolution-activation function-convolution combined output feature map) and the low-level features (third convolution-activation function-convolution combined output feature map), the fused features are regarded as high-level features, and then the second feature fusion module is used for fusing the high-level features with middle-level features in the third convolution-activation function-convolution combined output feature map. And finally, taking the features obtained by the second feature fusion module as high-level features, and fusing the features with low-level features in the first convolution-activation function-convolution combined output feature map through a third feature fusion module.

For each feature fusion module, given high-level and low-level features, element-by-element multiplication is used to achieve fusion between features. The fused features will be applied to the convolutional layers, the batch normalization and the ReLU activation functions, and then processed by the next fusion module.

Preferably, step S4 specifically includes: the residual hybrid convolution attention module has three consecutive packet convolution layers followed by an attention layer. The given features are processed by them and added to the residual to obtain the output features. The grouping convolution is to group the input features by the number of channels (the number of groups is a super parameter), and apply the convolution operation to each group. Because of the division of the groups, the FLOP (floating point operations performed per second) of the residual mixed convolution attention model is greatly reduced, and the training and defogging efficiency of the network is improved. The group numbers of the grouping convolution layers are respectively 4, 8 and 16, namely, the input characteristic diagram is respectively divided into 4 groups, 8 groups and 16 groups according to the channel number for processing. This configuration was determined by experimental studies.

After three grouping convolution processes, an attention layer is added, and the attention layer enables the output characteristics to reflect important characteristic information of a clear image in an input fog image, so that the network focuses on important clear fog-free image information to be adopted; the attention mechanism is realized through two steps, wherein the first step is to use depth convolution, then a ReLU activation function, then point convolution and then a Sigmoid activation function so as to acquire feature weights; the second step multiplies the original input feature by the obtained weight to obtain a weight map with the same size as the input element, and the weight map is applied to the input feature by element multiplication to output a final feature; the weight map obtained from the attention layer guides the network to discard redundant information (haze characteristic information), focuses on characteristic information of clear haze-free images, and simultaneously adopts depth separable convolution operation (combining two parts of depth convolution and point convolution) to improve training and operation efficiency of the residual mixed convolution attention module.

The invention has the advantages that: compared with the prior art, the invention has the following beneficial effects:

1. compared with the prior art, the invention provides a multi-level feature fusion module which can adaptively adopt features of different levels and recover clear images by utilizing the complementation effect between the modules;

2. compared with the prior art, the invention develops a residual mixed convolution attention module with an attention layer. The mixed convolution operation improves the efficiency of network operation, and the attention block concentrates the model on more important information;

3. the invention also provides a method for cooperatively guiding the defogging model to achieve defogging performance by using the mean square error loss and the perception loss function. The mean square error measures the deviation between the restored image and the corresponding sharp image, while the perceived loss helps the model to perceive the image from a higher dimension, restoring a more realistic sharp image.

The invention is further described below with reference to the drawings and specific examples.

Drawings

FIG. 1 is a defogging flow chart according to an embodiment of the present invention;

FIG. 2 is an application scenario diagram of an embodiment of the present invention;

FIG. 3 is an application scenario diagram of a core component residual hybrid convolution module in the model of FIG. 2;

FIG. 4 is an application scenario diagram of the attention layer of the core component of the model of FIG. 3;

FIG. 5 is an effect diagram of the restored image in the image defogging model of FIG. 2 compared with other methods.

Detailed Description

Referring to fig. 1 to 5, the neural network image defogging method with multi-level feature fusion and attention guidance disclosed by the invention comprises the following steps:

the specific process is that an image defogging model is constructed as shown in fig. 2. The image defogging model comprises a feature extraction module (shown in figure 2), a multi-level feature fusion module (shown in figure 2) and a residual mixed convolution attention module (shown in figure 2);

Preferably, step S5 specifically includes:

the second is a perceptual loss function, which uses a VGG16 pre-trained on the ImageNet dataset (VGG-16 has 13 convolutional layers, divided into 5 phases) to extract features and calculate differences from the last convolutional layer of each phase of the network, the formula:

The total defogging model loss function is:

L＝L _mse +α*L _per ,

where α is a parameter that balances the two loss functions.

Preferably, step S2 specifically includes: the method comprises the specific processes that a hazy picture is obtained, and the characteristic extractor is different from the characteristic extractor of other methods in that the characteristic extractor does not need training in advance and is lightweight;

Preferably, in step S3, the multi-level feature fusion module has three feature fusion modules from top to bottom, where the feature fusion module fuses a high-level feature (a fourth convolution-activation function-convolution combined output feature map) and a low-level feature (a third convolution-activation function-convolution combined output feature map), and the feature generated by the fusion is regarded as a high-level feature, and then the second feature fusion module fuses the feature with a middle-level feature in the third convolution-activation function-convolution combined output feature map. And finally, taking the features obtained by the second feature fusion module as high-level features, and fusing the features with low-level features in the first convolution-activation function-convolution combined output feature map through a third feature fusion module.

When the method is actually applied, firstly, a foggy image is input into a feature extraction module, and the combination of a convolution layer, an activation function and the convolution layer at four different stages of the module is utilized to extract different features of four layers of the image effectively;

secondly, inputting the extracted four features into a multi-level feature fusion module, wherein the module multiplies the features of different levels element by element, and the complementarity of the features of the lower level and the higher level is utilized to help the network to better recover the clear image;

and then, processing the characteristics generated by the multi-level characteristic fusion module by using the residual mixed convolution attention module to obtain a weight graph with the same size as the input element. The weight map derived from the attention module directs the network to relinquish redundant functionality and focus attention on more important functions. The depth and point direction convolution operations employed can improve the efficiency of this module. Finally reconstructing the image into a clear fog-free image after passing through the module;

finally, calculating the mean square error and the perception loss of the restored image and the corresponding clear image, and updating an image defogging model; wherein the mean square error measures the deviation between the restored image and the corresponding sharp image, and the perceived loss helps the model to perceive the image from a higher dimension, the restored more realistic sharp image. The two loss functions cooperate to jointly optimize the defogging model.

The invention has the following beneficial effects:

1. compared with the prior art, the invention provides a multi-level feature fusion module which can adaptively adopt features of different levels and effectively recover clear images from blurred images by utilizing the complementary action between the features;

The foregoing embodiments are provided for further explanation of the present invention and are not to be construed as limiting the scope of the present invention, and some insubstantial modifications and variations of the present invention, which are within the scope of the invention, will be suggested to those skilled in the art in light of the foregoing teachings.

Claims

1. A neural network image defogging method with multi-level feature fusion and attention guidance is characterized in that: the method comprises the following steps:

s5, calculating the mean square error and the perception loss of the restored image and the corresponding clear image, and updating an image defogging model; the method comprises the steps of measuring deviation between a restored image and a corresponding clear image by means of a mean square error, enabling a perception loss help model to perceive the image from a higher dimension, enabling the restored image to be more true, and enabling two loss functions of the mean square error and the perception loss to cooperate to jointly optimize a defogging model;

step S3, the multi-level feature fusion module is provided with three feature fusion modules from top to bottom,

the feature fusion module I fuses the high-level features and the low-level features, and the fused features are regarded as high-level features; then the second feature fusion module fuses the middle layer feature in the third convolution-activation function-convolution combination output feature map with the middle layer feature; finally, the features obtained by the second feature fusion module are regarded as high-level features, and the features are fused with low-level features in the first convolution-activation function-convolution combination output feature map through a third feature fusion module;

for each feature fusion module, given high-level and low-level features, element-by-element multiplication is used for realizing fusion between features, the fused features are applied to convolution layer batch normalization and ReLU activation functions, and then the next fusion module is used for processing;

step S4, specifically comprising:

the residual mixed convolution attention module is provided with three continuous grouping convolution layers, the back of the attention layer is provided with an attention layer, given features are processed by the attention layer and added into the residual to obtain output features, the grouping convolution is to group the input features according to the number of channels and apply convolution operation to each group respectively, and due to the division of the groups, the FLOP of the residual mixed convolution attention model is greatly reduced, and the training and defogging efficiency of the network is improved; the group numbers of the grouping convolution layers are respectively 4, 8 and 16, and the input characteristic images are respectively divided into 4 groups, 8 groups and 16 groups according to the channel number for processing;

after three grouping convolution processes, an attention layer is added, and the attention layer enables the output characteristics to reflect important characteristic information of a clear image in an input fog image, so that the network focuses on important clear fog-free image information to be adopted; the attention mechanism is realized through two steps, wherein the first step is to use depth convolution, then a ReLU activation function, then point convolution and then a Sigmoid activation function so as to acquire feature weights; the second step multiplies the original input feature by the obtained weight to obtain a weight map with the same size as the input element, and the weight map is applied to the input feature by element multiplication to output a final feature; the weight map obtained from the attention layer guides the network to discard redundant information, focuses on the characteristic information of clear fog-free images, and improves the training and operation efficiency of the residual mixed convolution attention module by adopting depth separable convolution operation.

2. The neural network image defogging method based on multi-level feature fusion and attention guidance according to claim 1, wherein the method comprises the following steps: step S5, specifically comprising:

the second is a perceptual loss function, which uses the last convolutional layer of each stage of the VGG16 network pre-trained on the ImageNet dataset to extract features and calculate differences, the formula:

wherein { phi } _k (-), k=1, 2,3} represents the feature extractor corresponding to the convolutional layer corresponding to VGG16, C _k ，W _k And H _k Corresponds to phi _k (.)；

The total defogging model loss function is:

L＝L _mse +α*L _per ,

where α is a parameter that balances the two loss functions.

3. The neural network image defogging method for multi-level feature fusion and attention guidance according to claim 2, wherein the method comprises the following steps: step S2, specifically comprising: