CN113052776A

CN113052776A - Unsupervised image defogging method based on multi-scale depth image prior

Info

Publication number: CN113052776A
Application number: CN202110381898.1A
Authority: CN
Inventors: 姜竹青; 汪千淞; 门爱东; 王海婴
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2021-04-09
Filing date: 2021-04-09
Publication date: 2021-06-29

Abstract

The invention relates to an unsupervised image defogging method based on multi-scale depth image prior, belonging to the technical field of computer vision images. First, the original image is downsampled for generating a small-size image prior. Firstly, respectively inputting three noise images with the same size as the down-sampled fog-carrying images into a neural network with three codec structures to obtain three intermediate results representing an atmospheric illumination map, a transmission map and a clear image after defogging; and then modeling the three intermediate results by using an atmospheric scattering model to obtain a reconstructed foggy image. And secondly, inputting a noise image with the same size as the original image into the same network, and initializing the network by using the prior acquired by the small image. The method is reasonable in design, fully considers the problem of difficulty in priori extraction of the unsupervised defogged image, reduces the difficulty in priori extraction by using a multi-scale method, and improves the visual effect and stability of a reconstructed image.

Description

Unsupervised image defogging method based on multi-scale depth image prior

Technical Field

The invention belongs to the technical field of computer vision images, and particularly relates to an unsupervised image defogging method based on multi-scale depth image prior.

Background

Haze is a typical atmospheric phenomenon that results from the accumulation of small droplets of water, dust, smoke, or other particulate matter in the air. These particles absorb and scatter light, directly resulting in reduced visibility. The contrast of the images taken in such weather conditions may deteriorate, losing visual detail, which may present difficulties for subsequent applications. Besides the direct influence on the visual effect, the image quality is ensured, and the image quality is also the basis for high-level visual application such as target detection, semantic segmentation and the like. Therefore, image defogging has been widely studied in recent years as an image preprocessing and visual enhancement technique, and has achieved a remarkable effect.

The Image defogging technology (Image defogging) is used for processing the problems of low visibility, low contrast and the like of haze-containing images, removing haze layers hidden by the haze-containing images and recovering the original colors and contrast of the images. The image after defogging can help a computer to better observe, analyze and process pictures, and has very important application value in many fields such as video monitoring, remote sensing, automatic driving and the like.

Conventional image defogging algorithms employ handmade priors that are derived from the inherent properties of the image, such as texture, contrast, and color difference. Wherein the more classical algorithm is Dark Channel Prior (DCP) defogging, which observes the presence of a dark channel in a local image block of an outdoor haze-free, fog image and accordingly proposes to use the prior to estimate a transmission map and an atmospheric illumination map to reconstruct a haze-free image. The Color Attenuation Prior (CAP) defogging algorithm assumes a positive correlation between image depth and the difference between brightness and saturation, from which the transmission map is estimated. Although both of these methods achieve significant results, the quality of defogging depends in large part on the consistency between the prior information employed and the actual image attributes.

With the benefit of the development of deep learning, more and more researchers are applying neural networks to the image defogging problem. Unlike conventional methods based on manual prior, deep learning-based methods generate defogged images in a data-driven manner. For example, image defogging techniques that estimate transmission maps under supervision of a true transmission map using a trainable convolutional neural network (DehazeNet), image defogging techniques that combine coarse and fine scale networks to estimate transmission maps using a multi-scale convolutional neural network (MSCNN), image defogging techniques that estimate transmission maps and atmospheric light simultaneously using a generation countermeasure network (DehazeGAN), and image defogging techniques that directly generate defogged images using an end-to-end trainable network without the aid of an atmospheric scattering model (AOD-Net) are used. However, as with neural networks in other tasks, these deep learning based image defogging methods rely on a large training data set, and the introduction of data sets necessarily brings about a data set pairing problem and an image domain coverage problem.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, and provides an unsupervised image defogging method based on multi-scale depth image prior, which uses multi-scale information to fully extract inherent information of an image, aiming at the defect of the existing defogging algorithm that different manual prior or data sets need to be designed aiming at different scenes.

The technical problem to be solved by the invention is realized by adopting the following technical scheme:

an unsupervised image defogging method based on multi-scale depth image prior comprises the following steps:

step 1, in a small-scale prior extraction stage, an input foggy image is sampled to 1/2 size of an original image;

step 2, inputting three noise images with the same output size as that in the step 1 into a neural network with three codec structures respectively to obtain three intermediate results representing an atmospheric illumination map, a transmission map and a defogged image;

and 3, modeling the three intermediate results output in the step 2 according to an atmospheric scattering model to obtain a synthesized foggy image. Constructing a loss function by using the graph and the output of the step 1, and simultaneously, using a depth-of-field prior constraint defogged image to optimize the neural network of the three codec structures in the step 2;

and 4, in the original scale image recovery stage, taking an atmospheric illumination map generated by the small scale network as the prior of the original scale network. The training process of the original scale network is the same as that in the step 2-3, only the input image needs to be replaced by the foggy image with the original size, the size of the input noise image is changed into the size of the original image, and the depth of field prior constraint is removed. Finally, the reconstructed defogged image can be obtained from the original size network.

Further, the specific details of step 1 include the following:

(1) setting the width of the original foggy image as W and the height as H, and inputting the width of the small-size image as W/2 and the height as H/2 in a small-scale prior extraction stage;

(2) the down-sampling of the image uses a bicubic interpolation algorithm.

Further, the specific details of step 2 include the following:

(1) the network J inputs the noise image, and the number of channels, width and height of the noise image are 8, W/2 and H/2 respectively. The output of the network is a defogged sharp image J₁(x) The number, width and height of the channels are respectively 3, W/2 and H/2;

(2) the network T inputs the noise image, and the number of channels, width and height of the noise image are 8, W/2 and H/2 respectively. The output of the network is a transmission map T₁(x) The number, width and height of the channels are respectively 1, W/2 and H/2;

(3) the structure of the network A is the same as that of the network T, the input noise image has the channel number, width and height of 8, W/2 and H/2 respectively. The output of the network is an atmospheric illumination map A₁(x) The number, width and height of the channels are respectively 3, W/2 and H/2.

Further, the specific details of the step (1) include the following:

firstly, the network J is a codec structure, the size of a characteristic diagram output by the convolutional layer is gradually reduced and then gradually enlarged, the size of the finally output characteristic diagram is the same as that of the input characteristic diagram, and the structure is like a letter U;

② the characteristic diagram reduction stage has 6 convolution layers, which are divided into 3 groups. Each group of the feature map is reduced to 0.5 times of the original feature map. Each group consists of two convolutional layers, the first convolutional layer step size is 2, the convolutional kernel size is 3 × 3, the second convolutional layer step size is 1, and the convolutional kernel size is 3 × 3. Let the characteristic diagram output from the Nth group be U_N，N∈[1,3]；

And the characteristic diagram amplifying stage has 6 convolution layers which are divided into 3 groups. Each pass through one group, the feature map will be reduced to 2 times the original. Each group consists of two convolutional layers and a bilinear interpolation layer, wherein the step size of the first convolutional layer is 1, the size of a convolutional kernel is 3 multiplied by 3, the step size of the second convolutional layer is 1, the size of the convolutional kernel is 3 multiplied by 3, and then the two convolutional layers pass through the bilinear interpolation layer by 2 times. The input to each group consists of the output of the previous group (except the first group) and the hop information stacked in the channel dimension. The N-th group of jumper connection information is a characteristic diagram U_4-NOutput obtained by 1 × 1 convolution with channel number of 16 and step length of 1;

fourthly, the final output image of the network J is obtained by using the output of the Sigmoid layer processing step III.

Further, the specific details of the step (2) include the following:

firstly, the network T is of a codec structure, the size of a characteristic diagram output by a convolutional layer is gradually reduced and then gradually enlarged, the size of the finally output characteristic diagram is the same as that of the input characteristic diagram, and the structure is like a letter U;

② the characteristic diagram reduction stage has 10 convolution layers, which are divided into 5 groups. Each group of the feature map is reduced to 0.5 times of the original feature map. Each group consists of two convolutional layers, the first convolutional layer step size is 2, the convolutional kernel size is 3 × 3, the second convolutional layer step size is 1, and the convolutional kernel size is 3 × 3. Let the characteristic diagram output from the Nth group be U_N，N∈[1,5]；

③ the characteristic diagram amplifying stage has 10 convolution layers which are divided into 5 groups. Each pass through one group, the feature map will be reduced to 2 times the original. Each group consists of two convolutional layers and a bilinear interpolation layer, wherein the step size of the first convolutional layer is 1, the size of a convolutional kernel is 3 multiplied by 3, the step size of the second convolutional layer is 1, the size of the convolutional kernel is 3 multiplied by 3, and then the two convolutional layers pass through the bilinear interpolation layer by 2 times. The input to each group consists of the output of the previous group (except the first group) and the hop information stacked in the channel dimension. The N-th group of jumper connection information is a characteristic diagram U_6-NOutput obtained by 1 × 1 convolution with 4 channel numbers and 1 step length;

fourthly, the final output image of the network T is obtained by using the output of the Sigmoid layer processing step III.

Further, the specific details of step 3 include the following:

(1) the specific modeling method of the fogged image i (x) synthesized by the output of the network J, A, T is: (x) j (x) +(1-t (x)) a (x);

(2) optimizing a loss function between the output I (x) of step (1) and the down-sampled original foggy image. The first 500 iterations use the mean square error loss, the last 200 iterations use the structural similarity loss, and the total number of iterations is 700. In the optimization process, depth-of-field prior loss function constraint J (x) is used, namely mean square error loss optimization J is used_v(x)-J_S(x) Wherein, J_v(x) Luminance image of J (x), J_S(x) Saturation image of J (x). This constraint can help network a to generate the correct atmospheric illumination map.

Further, the specific details of step 4 include the following:

(1) removing the depth of field prior loss function by using the same network structure as the step 2, wherein the size of the input noise image is the same as that of the original input foggy image, and the number of channels is 8;

(2) the first 200 iterations used the mean square error loss of the synthesized fogged image and the original fogged image, and the last 1800 iterations used the structural similarity loss. The first 300 iterations require the additional addition of 2 times of A after bicubic interpolation up-sampling^′ ₁(x) Atmospheric illumination map A output by original scale network A₂(x) The loss of the mean square error is used for transmitting the atmospheric illumination prior extracted by the small-scale network to the original-scale network;

(3) image J output by original scale network J₂(x) Namely the defogged image finally obtained by the method.

The invention has the advantages and positive effects that:

1. the invention divides the whole defogging process into two stages. The first stage is to obtain small-scale atmosphere illumination prior and use the small-scale image to perform prior extraction, thereby effectively reducing the solution space of the neural network parameters. The depth of field prior is used for restraining the output small-scale defogged image, so that the small-scale prior extraction network can be helped to obtain more accurate atmospheric illumination prior. And in the second stage, an atmospheric illumination map of the original size is generated through atmospheric illumination prior, so that the recovery quality of the defogged image of the original size is effectively ensured.

2. The method is reasonable in design, an atmospheric illumination model is used as a theoretical basis to construct a model overall framework, a deep learning algorithm is used to ensure the expression capability of the network, and an unsupervised training method is used to avoid the problems of pairing of data sets and coverage of image domains and serious decline of defogging performance in different image domains.

Drawings

FIG. 1 is an overall flow diagram of multi-scale defogging according to the present invention;

FIG. 2 is a flow diagram of a small-scale prior extraction module of the present invention;

FIG. 3 is a flow diagram of a full-scale defogged image generation module according to the present invention;

FIG. 4 is a network configuration diagram of the defogged image generating portion of the present invention;

fig. 5 is a network configuration diagram of an atmospheric map and transmission map generating section of the present invention.

Detailed Description

The embodiments of the present invention will be described in detail with reference to the accompanying drawings.

An unsupervised image defogging method based on multi-scale depth image prior is disclosed, as shown in fig. 1 to 5, and comprises the following steps:

s1, in a small-scale prior extraction stage, down-sampling an input image with fog to 1/2 size of an original image;

step S2, respectively inputting three noise images with the same output size as that in the step 1 into a neural network with three codec structures to obtain three intermediate results representing an atmospheric illumination map, a transmission map and a defogged image;

and S3, modeling the three intermediate results output in the step S2 according to an atmospheric scattering model to obtain a synthesized foggy image. Constructing a loss function by using the graph and the output of the step S1, and simultaneously, using a depth-of-field prior constraint to defogging the image so as to optimize the neural network of the three codec structures in the step 2;

and step S4, in the original scale image recovery stage, the atmosphere illumination map generated by the small scale network is used as the prior of the original scale network. The training process of the original scale network is the same as that of the steps S2-S3, and only the input image needs to be replaced by the fog image with the original size, the size of the input noise image is changed into the size of the original image, and the depth-of-field prior constraint is removed. Finally, the reconstructed defogged image can be obtained from the original size network.

The specific implementation method of step S1 is as follows:

s1.1, setting the original fog image size width as W and the original fog image size height as H, and inputting a small-size image with the width as W/2 and the height as H/2 in a small-scale prior extraction stage;

and S1.2, using a bicubic interpolation algorithm for image downsampling.

The specific implementation method of step S2 is as follows:

s2.1, the network J inputs a noise image, and the number, width and height of channels are 8, W/2 and H/2 respectively. The output of the network is a defogged sharp image J₁(x) The number, width and height of the channels are respectively 3, W/2 and H/2;

and S2.2, inputting the noise image by the network T, wherein the number, width and height of channels are 8, W/2 and H/2 respectively. The output of the network is a transmission map T₁(x) The number, width and height of the channels are respectively 1, W/2 and H/2;

and S2.3, the structure of the network A is the same as that of the network T, and the noise image is input, wherein the number, width and height of channels are 8, W/2 and H/2 respectively. The output of the network is an atmospheric illumination map A₁(x) The number, width and height of the channels are respectively 3, W/2 and H/2.

The specific implementation method of step S2.1 is as follows:

step S2.1.1, the network J is a codec structure, the size of the characteristic diagram output by the convolutional layer is gradually reduced and then gradually enlarged, the size of the characteristic diagram output finally is the same as that of the characteristic diagram input, and the structure is like a letter U;

in step S2.1.2, the characteristic diagram reduction stage has 6 convolution layers, which are divided into 3 groups. Each group of the feature map is reduced to 0.5 times of the original feature map. Each consisting of two convolutional layers, the first convolutional layer having a step size of 2 and a convolutional kernel size of3 x 3, the second convolutional layer step size is 1, and the convolutional kernel size is 3 x 3. Let the characteristic diagram output from the Nth group be U_N，N∈[1,3]；

In step S2.1.3, the characteristic diagram amplifying stage has 6 convolution layers, which are divided into 3 groups. Each pass through one group, the feature map will be reduced to 2 times the original. Each group consists of two convolutional layers and a bilinear interpolation layer, wherein the step size of the first convolutional layer is 1, the size of a convolutional kernel is 3 multiplied by 3, the step size of the second convolutional layer is 1, the size of the convolutional kernel is 3 multiplied by 3, and then the two convolutional layers pass through the bilinear interpolation layer by 2 times. The input to each group consists of the output of the previous group (except the first group) and the hop information stacked in the channel dimension. The N-th group of jumper connection information is a characteristic diagram U_4-NOutput obtained by 1 × 1 convolution with channel number of 16 and step length of 1;

and S2.1.4, processing the output of the step (c) by using a Sigmoid layer to obtain a final output image of the network J.

The specific implementation method of step S2.2 is as follows:

s2.2.1, the network T is a codec structure, the size of the characteristic diagram output by the convolutional layer is gradually reduced and then gradually enlarged, the size of the characteristic diagram output finally is the same as that of the characteristic diagram input, and the structure is like a letter U;

in step S2.2.2, the characteristic diagram reduction stage has 10 convolution layers, which are divided into 5 groups. Each group of the feature map is reduced to 0.5 times of the original feature map. Each group consists of two convolutional layers, the first convolutional layer step size is 2, the convolutional kernel size is 3 × 3, the second convolutional layer step size is 1, and the convolutional kernel size is 3 × 3. Let the characteristic diagram output from the Nth group be U_N，N∈[1,5]；

In step S2.2.3, the characteristic diagram amplifying stage has 10 convolution layers, which are divided into 5 groups. Each pass through one group, the feature map will be reduced to 2 times the original. Each group consists of two convolutional layers and a bilinear interpolation layer, wherein the step size of the first convolutional layer is 1, the size of a convolutional kernel is 3 multiplied by 3, the step size of the second convolutional layer is 1, the size of the convolutional kernel is 3 multiplied by 3, and then the two convolutional layers pass through the bilinear interpolation layer by 2 times. The input to each group consists of the output of the previous group (except the first group) and the hop information stacked in the channel dimension. The N-th group of jumper connection information is a characteristic diagram U_％-NThe number of the passing channels is 4, and the step length is1 x 1 convolution of 1 to obtain the output;

and S2.2.4, processing the output of the step (c) by using a Sigmoid layer to obtain a final output image of the network T.

The specific implementation method of step S3 is as follows:

step S3.1, the specific modeling method of the fog-bearing image i (x) synthesized by the output of the network J, A, T is: (x) j (x) +(1-t (x)) a (x);

and step S3.2, optimizing a loss function between the output I (x) of the step (1) and the original fog image after down sampling. The first 500 iterations use the mean square error loss, the last 200 iterations use the structural similarity loss, and the total number of iterations is 700. In the optimization process, depth-of-field prior loss function constraint J (x) is used, namely mean square error loss optimization J is used_v(x)-J_S(x) Wherein, J_v(x) Luminance image of J (x), J_S(x) Saturation image of J (x). This constraint can help network a to generate the correct atmospheric illumination map.

The specific implementation method of step S4 is as follows:

s4.1, removing a depth of field prior loss function by using the same network structure as the step 2, wherein the size of an input noise image is the same as that of an original input fogged image, and the number of channels is 8;

and S4.2, using the mean square error loss of the synthesized foggy image and the original foggy image in the first 200 iterations, and using the structural similarity loss in the last 1800 iterations. The first 300 iterations require the additional addition of 2 times of A after bicubic interpolation up-sampling^′ ₁(x) Atmospheric illumination map A output by original scale network A₂(x) The loss of the mean square error is used for transmitting the atmospheric illumination prior extracted by the small-scale network to the original-scale network;

step S4.3, image J output by original scale network J₂(x) Namely the defogged image finally obtained by the method.

Through the steps, the clear image after defogging can be obtained.

Finally, we evaluate network performance using PSNR (Peak Signal to Noise Ratio) and SSIM (structural similarity index). The method comprises the following steps:

and (3) testing environment: python 3.9; a PyTorch frame; ubuntu16.04 system; NVIDIA GTX 2080ti GPU

And (3) testing sequence: the dataset selected by the present invention is a mixed Subjective test Set (HSTS) in real Single Image defogging (restore), containing 10 sets of foggy-fogless Image pairs.

The test method comprises the following steps: the invention uses all image pairs in HSTS to evaluate the network effect quantitatively and qualitatively.

Testing indexes are as follows: the invention uses two indexes of PSNR and SSIM for evaluation. The index data are calculated by different algorithms which are popular at present, and then result comparison is carried out, so that the method can obtain better results in the field of real image defogging.

Nothing in this specification is said to apply to the prior art.

It should be emphasized that the embodiments described herein are illustrative rather than restrictive, and thus the present invention is not limited to the embodiments described in the detailed description, but also includes other embodiments that can be derived from the technical solutions of the present invention by those skilled in the art.

Claims

1. The unsupervised image defogging method based on the multi-scale depth image prior is characterized by comprising the following steps of:

2. The unsupervised image defogging method based on multi-scale depth image priors according to claim 1, wherein the specific details of the step 1 comprise the following:

(2) the down-sampling of the image uses a bicubic interpolation algorithm.

3. The unsupervised image defogging method based on multi-scale depth image priors according to claim 1, wherein the specific details of the step 2 comprise the following:

4. The unsupervised image defogging method based on multi-scale depth image priors according to claim 3, wherein the specific details of the step (1) comprise the following:

5. The unsupervised image defogging method based on multi-scale depth image priors according to claim 3, wherein the specific details of the step (2) comprise the following:

② the characteristic diagram reduction stage has 10 convolution layers, which are divided into 5 groups. Each group of the feature map is reduced to 0.5 times of the original feature map. Each group consists of two convolutional layers, the first convolutional layer step size is 2, the convolutional kernel size is 3 × 3, the second convolutional layer step size is 1, and the convolutional kernel size is 3 × 3.Let the characteristic diagram output from the Nth group be U_N，N∈[1,5]；

6. The unsupervised image defogging method based on multi-scale depth image priors according to claim 1, wherein the specific details of the step 3 comprise the following:

7. The unsupervised image defogging method based on multi-scale depth image priors according to claim 1, wherein the specific details of the step 4 comprise the following:

(2) the first 200 iterations used the mean square error loss of the synthesized fogged image and the original fogged image, and the last 1800 iterations used the structural similarity loss. The first 300 iterations, an additional 2 times bicubic interpolated up-sampled A 'need to be added'₁(x) Atmospheric illumination map A output by original scale network A₂(x) The loss of the mean square error is used for transmitting the atmospheric illumination prior extracted by the small-scale network to the original-scale network;