CN117994167B

CN117994167B - Diffusion model defogging method integrating parallel multi-convolution attention

Info

Publication number: CN117994167B
Application number: CN202410045689.3A
Authority: CN
Inventors: 邓红霞; 崔欣桐; 王浚瞩; 梁铮; 吴越; 高巍; 杨茂达; 赵培森
Original assignee: Taiyuan University of Technology
Current assignee: Taiyuan University of Technology
Priority date: 2024-01-11
Filing date: 2024-01-11
Publication date: 2024-06-28
Anticipated expiration: 2044-01-11
Also published as: CN117994167A

Abstract

The invention belongs to the field of deep learning, and particularly relates to a HazeDiffusion diffusion model defogging method integrating parallel multi-convolution attention, which comprises the following steps: constructing a data set; constructing a diffusion network model HazeDiffusion; training on the built HazeDiffusion model by utilizing the belonging training set; acquiring a foggy image to be recovered, and carrying out defogging enhancement on the foggy image through a HazeDiffusion model after training; an evaluation index is established for evaluation of the HazeDiffusion model. The invention is based on a diffusion model, introduces a parallel multi-convolution attention residual block (PMCA), and the PMCA module comprises two parts of parallel attention and parallel multi-convolution, and performs multi-scale connection through residual errors; and the size of the input image is adjusted through double three times of downsampling, and then the defogging image is upsampled by using the Laplacian pyramid, so that the model can process high-resolution images, and the efficiency of the diffusion model is indirectly improved.

Description

Diffusion model defogging method integrating parallel multi-convolution attention

Technical Field

The invention belongs to the technical field of deep learning, and particularly relates to a diffusion model defogging method integrating parallel multi-convolution attention.

Background

Haze can absorb and reflect light in the air, and under the condition of poor weather conditions, the acquired image quality is seriously insufficient, and problems of fuzzy details, color distortion, low contrast and the like often exist, so that the information identification degree of the image is reduced, and the performance of a subsequent series of advanced visual tasks such as object detection, scene identification, automatic driving and the like is seriously influenced. Therefore, it is very important to study how to obtain a clear image from a degraded image obtained in a foggy scene. The image defogging aim is to eliminate the influence of haze in the image, recover clear images from blurred images and restore details of the images.

At present, image defogging studies are mainly classified into three categories: feature and a priori based methods and learning based methods. And carrying out corresponding defogging treatment based on the characteristic and priori based method and the estimation of the atmospheric scattering model. Although the method can restore good image details, when the adopted assumption and the prior are not established in some specific scenes, the problems of supersaturation of defocused images, color distortion, difficulty in processing sky areas and the like can be caused. CNN has made great progress in some tasks in recent years, and there is also a great deal of CNN-based related work for defogging algorithms. The methods can be mainly divided into two types, the first type is still based on an atmospheric degradation model, parameters in the model are estimated by using a neural network, and the early methods are mostly based on the idea. The second type is to directly output and obtain defogged images by using the input foggy images, namely end2end in deep learning.

The deep learning image generation method has a certain result in image defogging, but the generation method based on the generation of the countermeasure network is limited by various limits, a plurality of networks need to be trained, the model is difficult to converge, and optimization instability, network collapse and the like are easily caused. The application of the diffusion model (DDPM) to image denoising successfully solves the problem of unstable generation of the countermeasure network training, and gradually takes the dominant role in the field of image generation. However, the diffusion model has the defects of low sampling speed, poor maximum likelihood, weak data generalization capability and the like. Many studies have made many efforts to solve the above-mentioned limitations from the practical point of view today.

Disclosure of Invention

Aiming at the technical problems that the diffusion model has the defects of low sampling speed, poor maximum likelihood, weak data generalization capability and the like, the invention provides a diffusion model defogging method integrating parallel multi-convolution attention, and the working focus is focused on the improvement of a noise estimation network in a reverse process, so that the diffusion model defogging method can be better applied to image defogging. Firstly, providing a parallel multi-convolution attention residual error block PMCA, wherein the PMCA module mainly comprises two parts of parallel attention and parallel multi-convolution, and multi-scale connection is carried out through residual errors; a SKFusion (SELECTIVE KERNEL Fuison) fusion mode for improving the self-selective convolution kernel network is introduced, the size of an input image is adjusted through double three times of downsampling, and then a Laplacian pyramid upsampling defogging image is used, so that the model can process a high-resolution image, and the efficiency of a diffusion model is indirectly improved.

In order to solve the technical problems, the invention adopts the following technical scheme:

a diffusion model defogging method integrating parallel multi-convolution attention comprises the following steps:

S1, based on a conditional diffusion model, improving a reverse process noise estimation network to construct an image defogging model HazeDiffusion;

S2, introducing SKFusion fusion modes, and realizing more specific and rich acquisition of information of each scale through dynamic feature fusion and jump connection;

S3, designing a PMCA module by combining pixels, channels and cross attention, and more accurately acquiring characteristics of condition information; through parallel convolution and residual error learning, the model is enabled to pay more attention to the haze area of the image more flexibly, and pay more attention to the local characteristics of the hazy image better;

S4, extracting high-frequency features by using bicubic downsampling to reduce the image size, recovering a high-resolution image by adopting an upsampling method based on a Laplacian pyramid, and improving the processing efficiency of the model.

The data sample of the image defogging model HazeDiffusion in S1 is RESIDE dataset, the RESIDE dataset is one of widely used datasets with image defogging standard, and the RESIDE dataset is composed of five subsets: indoor training set ITS, outdoor training set OTS, comprehensive target testing set SOTS, real world task driving testing set RTTS and hybrid subjective testing set HSTS; ITS and OTS are synthetic datasets, RTTS is a real world dataset, and HSTS consists of synthetic and real hazy images. Experiments trained a model on ITS dataset containing 100000 image pairs and tested on the SOTS indoor dataset of 500 pairs of images; models were trained on OTS containing 313950 image pairs and tested on the outdoor test set of SOTS for 500 pairs of images.

The main structure of the model in the S1 is a conditional defogging diffusion model fused with a foggy image, the diffusion model is a depth generation model, noise is added to available training data, then the process is reversed to restore the data, and the model gradually learns to eliminate the noise; gradually increasing Gaussian noise to the clear fog-free image in the diffusion process until the clear fog-free image becomes a pure noise image; the reverse process is the reverse process of the forward process, a random Gaussian noise is generated, the Gaussian noise and the hazy image Haze are input into a network model fusing parallel multi-convolution attention together, and a clear image is recovered through the reverse defogging process; the Haze of the hazy image is used as a condition to be added into the diffusion model, so that a defogging condition diffusion model is obtained, the problems that the defogging effect of the real image is poor and the indoor and outdoor data sets are separated and trained are complicated can be successfully solved, and the defogging effect of the image is improved.

The method for constructing the image defogging model HazeDiffusion in the S1 comprises the following steps: the HazeDiffusion model comprises two large modules, namely Diffusion Process and Reverse Process; the Diffusion Process module is a noise adding module, randomly generates a noise and is connected with the image; the Reverse Process module is a noise prediction module, inputs a Gaussian noise image and a foggy image into a convolution layer, and calculates time embedding for a noise level t; the downsampling stage sequentially comprises a convolution layer, a RESWITHATTN layer, a PCMA layer and a downsampling layer; at an intermediate stage of the network are RESWITHATTN layers and PCMA layers; the up-sampling stage comprises RESWITHATTN layers, a PCMA layer, an up-sampling layer and RESWITHATTN layers in sequence; fusing feature maps from different stages using an SK fusion module; the PCMA module comprises parallel attention and parallel multi-convolution, uses GroupNorm layers of data for normalization, and uses residual error connection to enrich characteristic information in the module; parallel multi-convolution extracts features using depth separable convolutions of different convolution kernel sizes, including 7 x 7, 5 x 5, and 3 x 3 convolutions.

The image defogging model HazeDiffusion training method comprises the following steps:

in HazeDiffusion network models constructed by the training set, an L1 loss supervision model is adopted to calculate the average error between the clear image and the defogging image, and the training is carried out through the maximum likelihood estimation output by the network; the loss formula is defined as follows:

Where n is the total number of samples of the training set, f (x) is the noise image generated, y _i is the estimated noise image, and the L1 penalty function optimizes the model by detecting the absolute value of the difference between f (x) and y _i;

In the training process, the diffusion model takes real data and pure noise as input samples, the output model estimates added noise, then calculates loss with the real noise at each moment, and iteratively updates model parameters.

The SKFusion fusion module in the S2 dynamically fuses the feature maps from different stages, SKFusion fuses the improved self-selective convolution kernel network, and by using the channel attention to fuse a plurality of feature branches, two feature maps are respectively set as x ₁ and x ₂, wherein x ₁ is the feature map from jump connection, and x ₂ is the feature map from the output of the network module; first x ₁ is obtained by passing PWConv (PointWise Conv) layerThen, obtaining a fusion weight by using global average pooling, a multi-layer perceptron, a Softmax activation function and Split operation;

{a₁,a₂}＝Split(Soft max(F_mlp(GAP(x₁+x₂)))

By passing through Will beAnd x ₂; wherein GAP represents global average pooling, F _mlp represents multi-layer perceptron, softma represents Softmax activation function, split represents Split operation.

In the step S3, a parallel multi-convolution attention PMCA module is designed for the improved noise estimation network, wherein the parallel multi-convolution attention PMCA module comprises parallel attention and parallel multi-convolution, groupNorm layers of data are used for normalization, so that training is more stable, and residual error connection is used for enriching characteristic information in the module; the parallel connection of a plurality of depth separable convolution layers with different scales effectively aggregates spatial information and conversion characteristics; paralleling multiple attention mechanisms can strengthen the model's focus on global and local features.

In the step S4, the image size is reduced by extracting high-frequency features through bicubic downsampling, the size of an input image is adjusted to 256×256 pixels through bicubic downsampling, and the calculation efficiency of a diffusion model is improved by reducing the input of the model; in order to obtain a defogging image with high quality, a low-resolution image generated by Laplacian pyramid processing is introduced, the image resolution is recovered, and the Laplacian pyramid retains most edges of the image on the aspect of improving the image resolution, so that detail blurring is avoided, artifacts are reduced, the process is simple, and the calculated amount is reduced.

Compared with the prior art, the invention has the beneficial effects that:

The invention provides a defogging method HazeDiffusion of a diffusion model integrating parallel multi-convolution attention, which combines the advantages of a conditional diffusion model and a deep learning model, solves the problems of incomplete defogging, color distortion, detail blurring and the like of the existing defogging algorithm to a certain extent, simplifies the complicated process of indoor and outdoor separate training, and effectively improves the performance of image generation on defogging tasks. The invention obtains the PSNR value of 27.8163 and the SSIM value of 0.9422 on the indoor synthetic foggy data set, and obtains the PSNR value of 29.2764 and the SSIM value of 0.9583 on the outdoor synthetic data set; the information entropy (Entropy), the fog density estimation (FADE) and the image Visual Information Fidelity (VIF) in the real fog data set obtain very high evaluation scores, and a Entropy value of 6.6685, a FADE value of 0.5843 and a VIF value of 0.9245 are obtained; and also excellent in subjective visual quality.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It will be apparent to those skilled in the art from this disclosure that the drawings described below are merely exemplary and that other embodiments may be derived from the drawings provided without undue effort.

The structures, proportions, sizes, etc. shown in the present specification are shown only for the purposes of illustration and description, and are not intended to limit the scope of the invention, which is defined by the claims, so that any structural modifications, changes in proportions, or adjustments of sizes, which do not affect the efficacy or the achievement of the present invention, should fall within the scope of the invention.

FIG. 1 is a diagram of the overall structure of HazeDiffusion of the present invention;

FIG. 2 is a block diagram of a PMCA block in HazeDiffusion model of the present invention;

FIG. 3 is a block diagram of ParaConv of the PMCA blocks of the present invention;

FIG. 4 is a block diagram of ParaAtm of the PMCA blocks of the present invention;

FIG. 5 is a graph comparing experimental results of the HazeDiffusion model and other image defogging methods used in the present invention to a true foggy image.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions in the embodiments of the present application will be clearly and completely described below, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments, and these descriptions are only for further illustrating the features and advantages of the present application, not limiting the claims of the present application; all other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The following describes in further detail the embodiments of the present invention with reference to the drawings and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.

The embodiment is realized under pytorch deep learning framework, and provides a HazeDiffusion diffusion model defogging method integrating multiple convolution attentions, which specifically comprises the following steps:

1. Data preparation

The data samples of this embodiment are from RESIDE datasets.

RESIDE is one of the widely used datasets for which image defogging is more standard. It consists of five subsets: indoor Training Set (ITS), outdoor Training Set (OTS), synthetic target test set (SOTS), real world task driven test set (RTTS), and Hybrid Subjective Test Set (HSTS). ITS and OTS are synthetic datasets, RTTS is a real world dataset, and HSTS consists of synthetic and real hazy images. Experiments trained a model on ITS dataset containing 100000 image pairs and tested on the SOTS indoor dataset of 500 pairs of images; models were trained on OTS containing 313950 image pairs and tested on the outdoor test set of SOTS for 500 pairs of images.

During training, the images are randomly cropped to a size of 256×256.

2. Model construction

The main body framework of the HazeDiffusion model is a U-Net structure, the specific network structure is shown in figure 1, and the HazeDiffusion model comprises two large modules, namely Diffusion Process and Reverse Process. The Diffusion Process module is a noise adding module, which randomly generates a noise and connects with the image. The Reverse Process module is a noise prediction module, inputs a Gaussian noise image and a foggy image into a convolution layer, and calculates time embedding for a noise level t; the downsampling stage sequentially comprises a convolution layer, a RESWITHATTN layer, a PCMA layer and a downsampling layer; at an intermediate stage of the network are RESWITHATTN layers and PCMA layers; the up-sampling stage comprises RESWITHATTN layers, a PCMA layer, an up-sampling layer and RESWITHATTN layers in sequence; feature maps from different phases are fused using an SK fusion module. As shown in fig. 2, the PCMA module includes parallel attention and parallel multi-convolution, normalized using GroupNorm layers of data, and enriched feature information using residual connections in the module. As shown in fig. 3, the parallel multi-convolution extracts features using depth separable convolutions of different convolution kernel sizes, including 7 x 7, 5 x 5, and 3 x 3 convolutions. As shown in fig. 4, 5, the parallel attention mechanism connects pixel attention, channel attention, and cross attention in parallel and compensates for global features by jumping connections.

3. Model training

And in HazeDiffusion network models constructed by the training set, an L1 loss supervision model is adopted to calculate the average error between the clear image and the defogging image, and the training is carried out through the maximum likelihood estimation output by the network. The loss formula is defined as follows:

Where n is the total number of samples of the training set, f (x) is the noise image generated, y _i is the estimated noise image, and the L1 penalty function optimizes the model by detecting the absolute value of the difference between f (x) and y _i.

4. Test results

During training, samples were taken uniformly for times T to {0, …, T } and t=2000 was set in all experiments, with β increasing from 1e-6 to 1e-2, and the images were randomly cropped to 256 x 256 sizes. The parameters were updated and iterated using Adam optimizer and back propagation algorithm with the fixed learning rate set to 1e-4.

5. Model evaluation

Peak signal-to-noise ratio (PSNR) and Structural Similarity (SSIM) evaluation indexes are calculated using the reconstruction result and the real image to evaluate the performance of the model.

Table 1 test results table for different algorithms pair-synthesized hazy data set

The evaluation of the effect of the synthesized hazy image is carried out in SOTS data sets, the defogging effect of 500 pairs of indoor images and 500 pairs of outdoor images is respectively evaluated, PSNR and SSIM calculation is carried out by adopting defogging images and clear images, and compared with six different defogging algorithms, namely DCP, AOD-Net, PCFAN, MADN, dehazeFormer and MixDehazeNet, the optimal indexes in table 1 are bold fonts.

Through table 1, it is found that HazeDiffusion under the same experimental environment, compared with a comparison algorithm, the results of indoor synthesis of a foggy dataset and outdoor synthesis of a foggy dataset PSMR and SSIM are excellent, and in the indoor synthesis of the foggy dataset, PSNR and SSIM are respectively improved by 1.1641 and 0.0114 compared with MADN; compared to the DehazeFormer and MixDehazeNet methods based on Vision Transformer structures, dehazeFormer and MixDehazeNet all showed good defogging effect on the indoor and outdoor synthetic datasets. Although DehazeFormer and MixDehazeNet perform well on the synthetic dataset, the test results of the synthetic dataset cannot be used as a reference index for actual scene applications, and applying image defogging to an actual scene requires verifying the effect of the algorithm on the actual image defogging.

Because it is difficult to obtain a corresponding clear image of a true hazy image, some non-reference image quality evaluation indexes, such as information entropy (Entropy), haze density estimation (FADE), image visual information fidelity assessment (VIF), and the like, are used to analyze and compare the defogging effects of the true images of different algorithms. The information entropy is used for evaluating the definition degree of the defogging image, and the higher the information entropy is, the more details of the defogging image are reserved; the fog density estimation can evaluate the fog density in the image, and the lower the fog density is, the less fog remains in the image; the image visual information fidelity level is used for evaluating the distortion level of defogging images, and the higher the VIF value is, the higher the image quality is. The test uses HSTS dataset and the average test index results are shown in table 2.

Table 2 different algorithms test results table for true misty data set

The comparison result in the table 2 shows that HazeDiffusion is superior to other comparison algorithms in information entropy, fog density estimation and image visual information fidelity degree results; figure 5 shows the defogging effect of the different algorithms. The DCP algorithm has poor effect of removing the fog image in the real world, and especially causes serious distortion of the image in a sky area; the AOD-Net algorithm still shows the defect of incomplete defogging; the image removed by PCFAN algorithm can generate artifacts at the edge, and defogging of the local area is not thorough; the MADN algorithm is not complete for the contrast recovery of the image, and can not realize targeted adjustment on the local bright and dark areas of the real image; both DehazeFormer and MixDehazeNet algorithms show severe defogging incompleteness, which is quite different from the result of the synthetic data set, and has defects in true data defogging. HazeDiffusion can restore rich colors and textures of the real image, has clear details and is closer to the subjective cognition real haze-free image. Although DehazeFormer and MxiDehazeNet are remarkable in the synthetic dataset, they do not perform well in the true image defogging effect; the HazeDiffusion model is more suitable for defogging of real images, and has more excellent indexes and more real defogging effect. Through the comprehensive detection of a plurality of evaluation indexes, hazeDiffusion algorithm has good defogging effect on a true foggy image, and compared with a comparison algorithm, the processed image can recover more complete image details, so that the image quality is improved obviously.

The preferred embodiments of the present invention have been described in detail, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the spirit of the present invention, and the various changes are included in the scope of the present invention.

Claims

1. The defogging method for the diffusion model integrating the attention of parallel multi-convolution is characterized by comprising the following steps of: comprises the following steps:

The method for constructing the image defogging model HazeDiffusion in the S1 comprises the following steps: the HazeDiffusion model comprises two large modules, namely Diffusion Process and Reverse Process; the Diffusion Process module is a noise adding module, randomly generates a noise and is connected with the image; the Reverse Process module is a noise prediction module, inputs a Gaussian noise image and a foggy image into a convolution layer, and calculates time embedding for a noise level t; the downsampling stage sequentially comprises a convolution layer, a RESWITHATTN layer, a PCMA layer and a downsampling layer; at an intermediate stage of the network are RESWITHATTN layers and PCMA layers; the up-sampling stage comprises RESWITHATTN layers, a PCMA layer, an up-sampling layer and RESWITHATTN layers in sequence; fusing feature maps from different stages using an SK fusion module; the PCMA module comprises parallel attention and parallel multi-convolution, uses GroupNorm layers of data for normalization, and uses residual error connection to enrich characteristic information in the module; parallel multi-convolution extracts features using depth separable convolutions of different convolution kernel sizes, including 7 x 7, 5 x 5, and 3 x 3 convolutions;

the SKFusion fusion module in the S2 dynamically fuses the feature maps from different stages, SKFusion fuses the improved self-selective convolution kernel network, and by using the channel attention to fuse a plurality of feature branches, two feature maps are respectively set as x ₁ and x ₂, wherein x ₁ is the feature map from jump connection, and x ₂ is the feature map from the output of the network module; first x ₁ is obtained by passing PWConv (PointWise Conv) layer Then, obtaining a fusion weight by using global average pooling, a multi-layer perceptron, a Softmax activation function and Split operation;

By passing through Will beAnd x ₂; wherein GAP represents global average pooling, F _mlp represents a multi-layer perceptron, softmax represents a Softmax activation function, split represents Split operation;

In the step S3, a parallel multi-convolution attention PMCA module is designed for the improved noise estimation network, wherein the parallel multi-convolution attention PMCA module comprises parallel attention and parallel multi-convolution, groupNorm layers of data are used for normalization, so that training is more stable, and residual error connection is used for enriching characteristic information in the module; the parallel connection of a plurality of depth separable convolution layers with different scales effectively aggregates spatial information and conversion characteristics; paralleling multiple attention mechanisms can strengthen the model's focus on global and local features;

2. The diffusion model defogging method for fusing parallel multi-convolution attention according to claim 1, wherein: the data sample of the image defogging model HazeDiffusion in S1 is RESIDE dataset, the RESIDE dataset is one of widely used datasets with image defogging standard, and the RESIDE dataset is composed of five subsets: indoor training set ITS, outdoor training set OTS, comprehensive target testing set SOTS, real world task driving testing set RTTS and hybrid subjective testing set HSTS; ITS and OTS are synthetic datasets, RTTS is a real world dataset, HSTS consists of synthetic and real foggy images, experiments train models on ITS dataset containing 100000 image pairs, and tests on the indoor dataset of SOTS of 500 image pairs; models were trained on OTS containing 313950 image pairs and tested on the outdoor test set of SOTS for 500 pairs of images.

3. The diffusion model defogging method for fusing parallel multi-convolution attention according to claim 1, wherein: the main structure of the model in the S1 is a conditional defogging diffusion model fused with a foggy image, the diffusion model is a depth generation model, noise is added to available training data, then the process is reversed to restore the data, and the model gradually learns to eliminate the noise; gradually increasing Gaussian noise to the clear fog-free image in the diffusion process until the clear fog-free image becomes a pure noise image; the reverse process is the reverse process of the forward process, a random Gaussian noise is generated, the Gaussian noise and the hazy image Haze are input into a network model fusing parallel multi-convolution attention together, and a clear image is recovered through the reverse defogging process; the Haze of the hazy image is used as a condition to be added into the diffusion model, so that a defogging condition diffusion model is obtained, the problems that the defogging effect of the real image is poor and the indoor and outdoor data sets are separated and trained are complicated can be successfully solved, and the defogging effect of the image is improved.

4. The diffusion model defogging method for fusing parallel multi-convolution attention according to claim 1, wherein: the image defogging model HazeDiffusion training method comprises the following steps:

5. The diffusion model defogging method for fusing parallel multi-convolution attention according to claim 1, wherein: in the step S4, the image size is reduced by extracting high-frequency features through bicubic downsampling, the size of an input image is adjusted to 256×256 pixels through bicubic downsampling, and the calculation efficiency of a diffusion model is improved by reducing the input of the model; in order to obtain a defogging image with high quality, a low-resolution image generated by Laplacian pyramid processing is introduced, the image resolution is recovered, and the Laplacian pyramid retains most edges of the image on the aspect of improving the image resolution, so that detail blurring is avoided, artifacts are reduced, the process is simple, and the calculated amount is reduced.