CN115439363A

CN115439363A - Video defogging device and method based on comparison learning

Info

Publication number: CN115439363A
Application number: CN202211078484.2A
Authority: CN
Inventors: 赵佳; 杨子龙; 王宇; 杨颖�; 余正涛; 郭晨靓
Original assignee: Hefei University of Technology; Fuyang Normal University
Current assignee: Hefei University of Technology; Fuyang Normal University
Priority date: 2022-09-05
Filing date: 2022-09-05
Publication date: 2022-12-06

Abstract

The invention provides a video defogging device and a defogging method based on contrast learning, which adopt an end-to-end image defogging method and comprise the following steps: acquiring experimental data, designing a defogging network model, analyzing an atmospheric scattering model, and simultaneously learning the transmissivity t (x) and atmospheric light A parameters in the atmospheric scattering model; then, calculating to obtain a fog-free image; respectively taking the information of the blurred image and the clear image as a negative sample and a positive sample, so that the defogged image output by the network is close to the clear image and far away from the foggy image in the representation space; acquiring fog data in a fog day; the acquired video information is transmitted to a terminal through a WIFI module to be processed, the foggy video is cut into single foggy images frame by frame, the foggy images to be processed adopt a trained defogging network model to obtain fogless images, and finally the fogless images are spliced into the fogless video again. The invention can improve the contrast of the image, enhance the detail information of the image and show good defogging performance.

Description

Video defogging device and method based on comparison learning

Technical Field

The application belongs to the field of computer image processing, and particularly relates to a video defogging device and a defogging method based on contrast learning.

Background

Under haze weather, due to the scattering effect of suspended particles in the atmosphere on light, the quality of a shot image is reduced, if image distortion and detail information are lost, the visual field is blurred due to fog, the visibility is reduced, the definition is reduced, people cannot acquire effective information from a foggy image, and certain threats are caused to a plurality of computer vision tasks such as target detection, video monitoring and unmanned driving. The video defogging algorithm is based on defogging of a single image, and is used for defogging each frame of image in the video independently, and then the processed images are arranged in sequence to reconstruct the video, so that the video defogging can be completed. Video defogging has important meaning to unmanned, intelligent monitoring, but the research about video defogging is less now, and defogging effect is relatively poor, and real-time performance is also relatively poor. Therefore, improving the visual effect and real-time performance of video defogging becomes a key problem of research in the field, and has important significance for the development of computer vision tasks.

At present, no very effective technology is available for defogging a long-period video image, the research on the video defogging technology is still in an initial research stage, defogging by using a single image is still the mainstream of the research on the video defogging technology, the weather and scene information acquisition and analysis are still the largest interference factors of the video processing technology, and the video image is blurred due to haze weather, so that the extraction of image information is not facilitated. Because of the limitation of the current video defogging technical field, continuous videos cannot be defogged, and single-frame processing is still performed by taking a frame as a unit, wherein the video defogging is the extension and extension of defogging of single-frame images.

The image defogging algorithm is mainly divided into a traditional image defogging algorithm and an image defogging algorithm based on deep learning, and the traditional image defogging algorithm can be divided into a defogging algorithm based on image enhancement and an image restoration. The image enhancement mainly aims to improve the contrast of an image, reduce noise, highlight useful information of the image and have a certain defogging effect, but the algorithm does not consider the reason of image formation in foggy days and can cause the loss of image detail information. Based on an image restoration method such as a dark channel prior algorithm, a color prior algorithm and the like, the atmosphere scattering model explains the generation process of the foggy day image, and the foggy image is input into the converted atmosphere scattering model to obtain a defogged image. The most classical method is a dark channel prior inspection algorithm, a large number of outdoor fog-free and fog-containing images are observed, the fact that the pixels in the non-sky area of most fog-free images have very low brightness values is found, the lowest brightness value is almost 0, even though the defogging algorithm based on the dark channel prior is the most classical defogging algorithm, the brightness of the defogged image obtained by the method is dark, and the defogging results of the images in the sky area and the non-sky area are different, so that color distortion is easily caused on the image in the sky area. The image defogging method based on deep learning overcomes the defects of the traditional defogging method, and has the advantages that the characteristics can be learned by training data, the defogging network trains a large number of foggy images and fogless images to obtain the mapping relation between the foggy images and the fogless images, and the fogless images are input into the trained model to directly output the fogless images.

The existing defogging method based on deep learning mostly adopts clear defogging images as positive samples to guide the training of a defogging network, and neglects the effective utilization of negative samples. The information of the foggy image and the corresponding clear image are respectively used as a negative sample and a positive sample, and the comparison learning can enable the output fogless image to be closer to the positive sample and far away from the negative sample. Therefore, the training of the network can be supervised by using the positive and negative sample information through comparative learning, and the defogging effect of the network is further improved.

Disclosure of Invention

Based on the analysis of the single image defogging and video defogging research situation, the invention provides a video defogging device and a defogging method based on comparison learning in order to overcome the defects of the prior art.

The purpose of the invention can be realized by the following technical scheme:

the invention provides a video defogging device based on contrast learning, which comprises a data acquisition module, a WIFI module and a processing terminal, wherein the data acquisition module is used for acquiring a video image; in foggy days, acquiring foggy data by using a data acquisition module (a camera device); the method comprises the steps that collected video information is transmitted to a processing terminal to be processed through a WIFI module, the processing terminal comprises a video processing module, a preprocessing module, a grid network, a post-processing module and a clear image generating module, the video processing module is used for selecting each frame of image in video data as an image to be processed, the preprocessing module is composed of a convolution layer and a residual error dense block (RDB), 16 feature maps of foggy images are obtained through the convolution layer to be input, more features capable of being learned in a self-adaptive mode can be fused through features in the RDB module, the grid network is a multi-scale feature fusion grid network combining with an attention system, the post-processing module and the preprocessing module are symmetrical in structure, image distortion or artifacts are prevented, the foggy images pass through the preprocessing module, the grid network and the post-processing module obtain parameters K (x), then the clear image generating module outputs the fogless images, and finally image fusion is carried out, and the fogless video is output.

The invention also provides a video defogging method based on contrast learning, which comprises the following steps:

s1, acquiring and processing image data, namely cutting an original data set into a preset image size to obtain a training set by using paired foggy images and fogless images in the RESIDE data set as original training data;

s2, analyzing the atmospheric scattering model, wherein the atmospheric scattering model is used for defogging images, two parameters of transmissivity t (x) and atmospheric light A need to be estimated, and when the two parameters are respectively estimated for defogging images, accumulation and even amplification errors can be caused, so that the atmospheric scattering model is changed, the two parameters of t (x) and A are unified into K (x), and the reconstruction errors of the output images and the real fog-free images are reduced;

s3, in a training stage, a K (x) estimation module is built, and a more accurate intermediate transmission diagram is estimated;

s4, in a training stage, on the basis of the step S2, the changed atmospheric scattering model is used as an image restoration problem to be processed, and the transmission diagram obtained in the step S3 is used as input to obtain a defogged image;

s5, constructing contrast loss by using a contrast learning strategy and carrying out multi-round training, and taking the foggy image and the corresponding clear image as a negative sample and a positive sample respectively to ensure that the image obtained in the step S4 is pulled to be closer to the clear image in a representation space and pushed away from the blurred image;

s6, in a testing stage, a camera collects a foggy video, and video processing is carried out by taking a frame as a unit to obtain a set of single foggy images;

s7, in the testing stage, inputting the single image obtained in the previous step into the trained defogging network model to obtain a fog-free image;

and S8, in the testing stage, fusing images and outputting a defogged video.

Further, step S1 is to input a clear image by using an atmospheric scattering model, and generate a corresponding foggy image, where the atmospheric scattering model has the following formula:

I(x)＝J(x)t(x)+A(1-t(x))，

where I (x) is a hazy image, J (x) is a generated haze-free image, a represents a global atmospheric light value, t (x) represents a transmittance, and t (x) is defined as:

where β is the atmospheric scattering coefficient and d (x) is the depth of field.

The step S2 specifically includes: the conventional image defogging algorithm based on the atmospheric scattering model is mainly divided into three steps of estimating a transmission matrix t (x) from a blurred image I (x) by using a complex depth model, then estimating atmospheric light by using some empirical methods, and finally obtaining a defogged image by using the atmospheric model. However, the separate estimation of the atmospheric light and the transmittance by the procedure will lead to error amplification, so the text mainly replaces two parameters of the atmospheric light and the transmittance by K (x), and is transformed by the formula (1):

J(x)＝k(x)I(x)-k(x)+b

where b is a deviation whose default value is 1, and t (x) and a are integrated into K (x), it is possible to reduce errors of the generated image and the original image since K (x) depends on the input fogging image.

Further, the K (x) estimation module constructed in the step S3 consists of a preprocessing module, a mesh network and a post-processing module.

The preprocessing module of the K (x) estimation module consists of a convolution layer and a residual error dense block (RDB), 16 feature map inputs are obtained by a foggy image through the convolution layer, more features capable of being learned in a self-adaptive mode are fused in the RDB module, each RDB block consists of 5 convolution layers, the first four layers are used for increasing the number of feature mappings, the last layer is used for fusing the feature mappings, and then the output of the RDB block is combined with the input of the RDB block;

the mesh network is a multi-scale feature fusion mesh network combined with an attention mechanism. Each row is composed of 5 RDB blocks, the up-sampling and down-sampling structures are the same, and feature maps with different scales are obtained through up-sampling or down-sampling of each column. After a downsampling block, the number of channels of the feature map is increased, the size of the feature map is reduced to half of the original size, and the upsampling result is opposite to the original size. Each RDB block is feature fused with the up-sampled or down-sampled result using a channel attention mechanism. The ReLU activation function is used after each convolutional layer. Setting the feature numbers of three different scales as 16, 32 and 64 respectively;

the post-processing module is a post-processing module which is symmetrical to the preprocessing structure because an image directly obtained through a mesh network can be distorted or generate an artifact.

Further, step S4 specifically includes:

and inputting the foggy image into a K (x) estimation module, outputting a more accurate intermediate transmission diagram, inputting the intermediate transmission diagram into an improved atmosphere scattering model formula, and outputting a defogged image.

Further, step S5 specifically includes:

the comparison learning aims at distinguishing data, so that the distance between the training result and the positive sample is shortened, and the distance between the training result and the negative sample is enlarged. The positive sample and the negative sample are respectively composed of a clear image and a synthesized foggy image, a common feature space is selected from the pre-training model VGG-19, and the contrast loss can be expressed as:

where J denotes a fog-free image as a positive sample, I denotes a composite fog-free image as phi (I, w) is a fog-free image generated by a defogging model, and G _j Representing the extraction of features from different layers of pre-training, D (x, y) is the L1 distance between the two, w _j Are the weight coefficients.

Compared with the prior art, the method has the following advantages:

1. the invention unifies the transmissivity in the atmospheric scattering model and the global atmospheric light value into a parameter K (x), can minimize the error of an output image and the real world depending on an input foggy image, generates different weight values for each channel by using a channel attention mechanism in a K (x) estimation module, and can unevenly process different characteristics and pixel regions.

2. The invention adopts a comparison learning strategy to guide the training of the defogging network, and simultaneously utilizes the positive and negative samples as the supervision information of the network, so that the defogging image is closer to the clear image serving as the positive sample and is far away from the foggy image serving as the negative sample, and the defogging effect is further improved.

3. The trained defogging network model obtains better objective and subjective evaluation results on an open test set, and fig. 4 is an experimental comparison result of the defogging data set by the method and the comparison method.

Drawings

Fig. 1 is a flow chart of a video defogging method.

FIG. 2 shows a K (x) estimation block used in the method of the present invention.

FIG. 3 is a diagram of an RDB module used in the present invention.

Fig. 4 is a schematic structural diagram of a video defogging device based on comparative learning according to an embodiment of the present invention.

FIG. 5 is a quantitative comparison of the inventive and comparative methods on a defogged dataset, evaluated by two evaluation indices PSNR and SSIM.

FIG. 6 is a qualitative comparison of the inventive and comparative methods on hazy images, from left to right, in order of hazy images, dark channel defogging, MSCNN defogging, dehazeNet defogging, CAP defogging, AODNet defogging, GCANet defogging, MSBDN defogging, defogging images of the present methods, and corresponding sharp images.

Fig. 7 is a comparison of several frames of a foggy video, the first line being the original video sequence and the second line being the dehazed video sequence.

Detailed Description

In order to explain the contents of the present invention more clearly, the present invention will be further explained with reference to the accompanying drawings.

The invention provides a video defogging method based on contrast learning, which specifically comprises the following steps:

s4, in the training stage, on the basis of the step S2, the changed atmospheric scattering model is used as an image restoration problem to be processed, and the transmission diagram obtained in the step S3 is used as input to obtain a defogged image;

s5, constructing contrast loss by using a contrast learning strategy, and respectively taking the foggy image and the corresponding clear image as a negative sample and a positive sample to ensure that the image obtained in the step S4 is pulled to be closer to the clear image and pushed away from the blurred image in a representation space;

s7, in the testing stage, the single image obtained in the step S6 is input into the trained defogging network model to obtain a fog-free image;

and S8, in the testing stage, fusing images and outputting a defogged video.

The image defogging method needs the pair training of the fog-free image and the corresponding fog image, but because the original data set is difficult to collect, the fog image is synthesized through an atmospheric scattering model.

The mathematical model of the foggy day imaging is as follows:

I(x)＝J(x)t(x)+A(1-t(x))，

where I (x) is a hazy image, J (x) is a generated haze-free image, a represents a global atmospheric light value, and t (x) represents a transmittance.

In this example, we used a RESIDE data set, including a large-scale Indoor Training Set (ITS) and an Outdoor Training Set (OTS), where the indoor training set generates 13990 Zhang Youwu images from 1399 clear images and corresponding depth maps using an atmospheric scattering model, and the atmospheric light value A ∈ [0.7,1.0], and the scattering coefficient β ∈ [0.6,1.8]. The outdoor training set consists of outdoor synthesized foggy images and corresponding clear images, wherein the atmospheric light value A belongs to [0.8,1], and the scattering coefficient belongs to [0.04,0.2]. The test set is a synthetic objective test set SOTS, which comprises 500 pairs of indoor test sets and 500 pairs of outdoor test sets.

In this embodiment, the K (x) estimation module is composed of a preprocessing module, a mesh network, and a post-processing module. The preprocessing module comprises a convolution layer and a residual error dense block (RDB), 16 feature map inputs are obtained by the foggy image through the convolution layer, more features capable of self-adaptive learning are fused in the feature fusion in the RDB module, each RDB block comprises 5 convolution layers, the first four layers are used for increasing the number of feature mappings, the last layer is used for fusing the feature mappings, and then the output of the feature mappings is combined with the input of the RDB block; the mesh network is a multi-scale feature fusion mesh network combined with an attention mechanism. Each row is composed of 5 RDB blocks, the up-sampling and down-sampling structures are the same, and feature maps with different scales are obtained through up-sampling or down-sampling of each column. After a downsampling block, the number of channels of the feature map is increased, the size of the feature map is reduced to half of the original size, and the upsampling result is opposite to the original size. Each RDB block is feature fused with the up-sampled or down-sampled result using a channel attention mechanism. The ReLU activation function is used after each convolutional layer. Setting the feature numbers of three different scales as 16, 32 and 64 respectively; the post-processing module is a post-processing module which is symmetrical to the preprocessing structure because an image directly obtained through a mesh network can be distorted or generate an artifact.

In this example, two parameters of the transmittance and the atmospheric light value in the atmospheric scattering model are unified as K (x), and the converted atmospheric scattering model is:

J(x)＝k(x)I(x)-k(x)+b,

In this embodiment, the network is optimized by using an Adam optimizer in the training process, the initial learning rate is 0.001, and when an indoor data set is trained, 100 iterations are performed, and the learning rate is reduced to half of the original learning rate every 20 rounds. Similarly, when training the outdoor data set, the learning rate is reduced to half of the original rate after every 2 times of iteration, and 10 training rounds are performed in total. And (3) establishing a defogging network model under a PyTorch1.9.0 framework, wherein the GPU model is NVIDIA GeForce RTX 2080Ti.

In the example, the image defogging effect is evaluated through two evaluation indexes of the PSNR and the SSIM, the larger the PSNR value is, the smaller the image distortion degree is, the SSIM measures the similarity of the images from three aspects of brightness, contrast and structure, and the larger the SSIM value is, the more the original information of the output defogged image is.

In this embodiment, a foggy video is input, and video processing is performed in units of frames to obtain a set of single foggy images; inputting the single image into a trained defogging network model to obtain a fog-free image;

and finally, carrying out image fusion and outputting the defogged video.

In this embodiment, the loss function of the defogging network consists of the smooth L1 loss, the perceptual loss and the contrast loss, and the specific formula is as follows:

L＝L _S +L _g +λL _p ,

wherein L is _g Denotes the loss of contrast, L _s Denotes the loss of smoothing L1, L _p Representing the perceptual loss, λ is a parameter that adjusts the relative weight on the loss component, and in this embodiment, λ is set to 0.04.

The comparative loss formula is as follows:

The smooth L1 loss provides a measure of the difference between the defogged image and the true sharp image, can quickly converge at a position far away from the optimal solution, can slowly derive until the optimal solution is reached when the optimal solution is about to be reached, and can effectively prevent gradient explosion. The smoothed L1 loss equation is as follows:

wherein M represents the total number of pixels,

and J _i (x) Respectively representing the intensity of the ith color channel of pixel x in the defogged image and the truthful clear image, and F _S (x) The following were used:

the nature of the perceptual loss is that two pictures are matched at the depth characteristic level, and the perceptual loss formula is as follows:

in this example, a VGG16 pre-trained on ImageNet is used as the loss network, and features are extracted from the last layer of each of the first three stages, j represents the jth layer of the network,

and phi _j (J) Feature maps representing the defogged image and the real clear image, respectively, C _j H _j W _j The dimensions of the jth layer profile are shown.

As shown in fig. 4, an embodiment of the present invention provides a video defogging device based on contrast learning, including a data acquisition module, a WIFI module, and a processing terminal. In the foggy days, acquiring foggy data by using a camera device; transmitting the collected video information to a terminal for processing through a WIFI module; the processing terminal comprises a video processing module, a preprocessing module, a grid network, a post-processing module and a clear image generating module, wherein the video processing module is used for selecting each frame of image in video data as an image to be processed, the preprocessing module consists of a convolution layer and a residual error dense block (RDB), 16 feature maps of the foggy image are obtained through the convolution layer and input, more features capable of being learned in a self-adaptive mode are fused in the feature fusion of the RDB, the grid network is a multi-scale feature fusion grid network combined with an attention mechanism, the post-processing module is symmetrical to the preprocessing module in structure, image distortion or artifact generation is prevented, the foggy image passes through the preprocessing module, the grid network and the post-processing module to obtain a parameter K (x), then the clear image generating module outputs a fogless image, and finally the image fusion is carried out, and the fogless video is output.

The video defogging device based on the comparative learning and the video defogging method based on the comparative learning provided by the embodiment belong to the same concept, and the specific embodiment process is shown in the method embodiment, and the beneficial effects are the same as the method embodiment.

Claims

1. The video defogging device based on the comparison learning is characterized by comprising a data acquisition module, a WIFI module and a processing terminal; in foggy days, acquiring foggy data by using a data acquisition module; the processing terminal comprises a video processing module, a preprocessing module, a grid network, a post-processing module and a clear image generating module, wherein the video processing module is used for selecting each frame of image in video data as an image to be processed, the preprocessing module consists of a convolution layer and a residual dense block, 16 feature graphs of the foggy image are input through the convolution layer, more features capable of being learned in a self-adaptive mode are fused in the residual dense block, the grid network is a multi-scale feature fusion grid network combining an attention system, the post-processing module and the preprocessing module are symmetrical in structure, image distortion or artifact is prevented, the foggy image passes through the preprocessing module and the grid network, the post-processing module obtains a parameter K (x), then the foggy image is output through the clear image generating module, image fusion is finally carried out, and the foggy video is output.

2. The defogging and defogging method for the video defogging device based on the contrast learning as claimed in claim 1, wherein the method comprises the following steps:

s1, acquiring and processing image data, namely cutting an original data set into a preset image size for training by using paired foggy images and fogless images in the RESIDE data set as original training data;

s6, in the testing stage, inputting a foggy video, and performing video processing by taking a frame as a unit to obtain a set of single foggy images;

and S8, in the testing stage, fusing the images and outputting the defogged video.

3. The video defogging method based on the comparative learning as claimed in claim 2, wherein the step S1 specifically comprises: acquiring indoor and outdoor fog-free images, and generating fog images corresponding to the fog-free images according to an atmospheric scattering model, wherein a mathematical model of fog day imaging is as follows:

I(x)＝J(x)t(x)+A(1-t(x))

4. The video defogging method based on the comparative learning as claimed in claim 2, wherein the step S2 specifically comprises the following steps: the conventional image defogging algorithm based on the atmospheric scattering model mainly comprises three steps of estimating a transmission matrix t (x) from a blurred image I (x) by using a complex depth model, estimating atmospheric light by using some empirical methods, and finally obtaining a defogged image by using the atmospheric model; however, the separate estimation of the atmospheric light and the transmittance by the process will cause error amplification, so that the text mainly replaces two parameters of the atmospheric light and the transmittance by K (x), and the atmospheric scattering model is deformed to obtain:

J(x)＝k(x)I(x)-k(x)+b

wherein:

b is a deviation of default value 1, t (x) and a are integrated into K (x), and since K (x) depends on the input fogging image, errors of the generated image and the original image can be reduced.

5. The video defogging method according to claim 2, wherein the step S3 specifically comprises: the K (x) estimation module mainly comprises a preprocessing module, a grid network and a post-processing module; the preprocessing module of the K (x) estimation module consists of a convolution layer and a residual error dense block (RDB), 16 feature map inputs are obtained by a foggy image through the convolution layer, more features capable of being learned in a self-adaptive mode are fused in the RDB module, each RDB block consists of 5 convolution layers, the first four layers are used for increasing the number of feature mappings, the last layer is used for fusing the feature mappings, and then the output of the RDB block is combined with the input of the RDB block; the grid network of the K (x) estimation module is a multi-scale feature fusion grid network combined with an attention mechanism; each row is composed of 5 RDB blocks, the up-sampling and down-sampling structures are the same, and feature maps with different scales are obtained through up-sampling or down-sampling of each column; after a downsampling block, the number of channels of the feature map is increased, the size of the feature map is reduced to a half of the original size, and an upsampling result is opposite to the upsampling result; each RDB block and an up-sampling or down-sampling result are subjected to feature fusion by using a channel attention mechanism; the ReLU activation function is used after each convolutional layer; setting the feature numbers under three different scales as 16, 32 and 64 respectively; the post-processing module of the K (x) estimation module is introduced to be symmetrical with the pre-processing structure because the image directly obtained through the mesh network may be distorted or generate artifacts.

6. The video defogging method according to claim 2, wherein feature maps of different scales may not have the same importance, therefore, a channel attention mechanism is integrated in the mesh network, trainable weights for feature fusion are generated, different weight values are generated for each channel, and different features and pixel regions are unequally processed based on the weight values.

7. The video defogging method based on the contrast learning as recited in claim 2, wherein the contrast learning aims at distinguishing data, so that the distance between the training result and the positive sample is reduced, and the distance between the training result and the negative sample is enlarged, therefore, the generalization ability of the model is stronger, and the quality of the generated fog-free image is better; the positive sample and the negative sample are respectively composed of a clear image and a synthesized fog image, a common feature space is selected from a pre-training model VGG-19, and the contrast loss can be expressed as follows:

where J denotes a fog-free image as a positive sample, I denotes a composite fog-free image as phi (I, w) is a fog-free image generated by a defogging model, and G _j Representing the extraction of features from different layers of pre-training, D (x, y) is the L1 distance between the two, and wj is the weighting factor.