CN113066025A

CN113066025A - Image defogging method based on incremental learning and feature and attention transfer

Info

Publication number: CN113066025A
Application number: CN202110304663.2A
Authority: CN
Inventors: 王科平; 李冰锋; 韦金阳; 杨艺; 李新伟; 崔立志
Original assignee: Henan University of Technology
Current assignee: Henan University of Technology
Priority date: 2021-03-23
Filing date: 2021-03-23
Publication date: 2021-07-02
Anticipated expiration: 2041-03-23
Also published as: CN113066025B

Abstract

The invention discloses an image defogging method based on incremental learning and feature and attention transfer, which comprises the following steps of: s1, constructing a self-encoder network serving as a teacher network, and extracting a first intermediate layer feature diagram and a first feature attention diagram; s2, constructing a defogging network serving as a student network, outputting a second intermediate layer feature map and a second feature attention map to fit the first intermediate layer feature map and the first feature attention map, and performing enhancement operation on corresponding features by using a third feature attention map obtained after fitting; s3, training the teacher network by using multiple groups of paired same images; s4, performing optimization training on the student network by using multiple groups of paired fog images and clear images; s5, training a student network by using the combined action of an SSIM loss function and a smoothen L1 loss function; and S6, performing incremental operation on the data set in the student network, and improving the defogging capacity of the defogging network on other data.

Description

Image defogging method based on incremental learning and feature and attention transfer

Technical Field

The invention relates to the field of image processing, in particular to an image defogging method based on incremental learning and feature and attention transfer.

Background

In recent years, the air quality is deteriorated and the haze weather is gradually increased due to industrial production, automobile emission and other reasons, the problems of low contrast, color distortion, blurring and the like of a foggy image acquired by imaging equipment are caused by absorption and scattering of light rays by suspended particles in the air, the visual effect of the image is directly influenced by the haze image, and a high-level computer vision task taking the image as a processing object is limited, so that the haze image sharpening research has important significance in the field of computer vision.

An image restoration method based on an atmospheric scattering model and an image defogging method based on deep learning are the mainstream methods at present. However, the image restoration method based on the atmospheric scattering model has the problems of defogging residues, image distortion and the like caused by inaccurate estimation of intermediate parameters; the image defogging method based on deep learning has the problem of weak generalization capability due to the limitation of a data set; therefore, there is a need to develop a method for improving the defogging and generalization ability in the defogging network to solve the above problems.

Disclosure of Invention

The invention aims to solve the problems and provides an image defogging method based on increment learning, characteristics and attention transfer, which is simple to operate and improves the defogging effect.

In order to achieve the purpose, the technical scheme of the invention is as follows:

an image defogging method based on incremental learning and feature and attention transfer comprises the following steps:

s1, constructing a self-encoder network serving as a teacher network, and extracting first intermediate layer feature diagrams and first feature attention diagrams of different dimensions in the self-encoder network for subsequent training of the student network;

s2, constructing a defogging network serving as a student network, wherein the defogging network consists of a residual block and two layers of convolutions, a Smooth L1 loss function is used for constraining the residual block to output a second intermediate layer characteristic diagram with different dimensionalities, a second characteristic attention diagram is used for fitting a first intermediate layer characteristic diagram and a first characteristic attention diagram extracted from a coder network, and a third characteristic attention diagram obtained after fitting is used as weight to carry out enhancement operation on corresponding characteristics;

s3, using multiple groups of same images in pairs as the input and the label of the teacher network to train the teacher network;

s4, using multiple groups of paired fog images and clear images as the input and label of the student network to carry out optimization training on the student network;

s5, using a Smooth L1 loss function as a loss function between labels and defogging results in the teacher network and the student network, simultaneously using an SSIM loss function as a loss function between a first middle layer characteristic diagram and a second middle layer characteristic diagram, using a Smooth L1 loss function as a loss function between a first attention diagram and a second attention diagram, and training the student network under the combined action of the SSIM loss function and the Smooth L1 loss function;

and S6, performing incremental operation on the data set in the student network, and improving the defogging capacity of the defogging network on other data.

Further, the self-encoder network in step S1 is composed of a convolution module and an upsampling module; the convolution module comprises four layers of convolution operation; the first layer of convolution uses 64 convolution kernels of 3x3, and performs convolution operation with step size of 2 and pad of 1, i.e. f₁＝3，c₁64, the first layer convolution can be represented as Conv₁(3,64, 3); the second layer convolution uses 128 convolution kernels of 3x3, and performs convolution operation with step size 1 and pad 1, i.e., f₂＝3，c₂128, the second layer convolution may be denoted as Conv₂(64,128, 3); the third layer of convolution uses 256 convolution kernels of 3x3, and performs convolution operation with step size of 2 and pad of 1, i.e. f₃＝3，c₃256, the third layer convolution can be expressed as Conv₃(128,256, 3); the fourth layer convolution uses 512 convolution kernels of 3x3, and performs convolution operation with step size of 1 and pad of 1, i.e. f₄＝3，c₄The fourth layer convolution may be denoted as Conv 512₄(256,512,3)；

The up-sampling module corresponds to the convolution module and comprises four layers of deconvolution operations; the first layer of deconvolution uses 256 4x4 convolution kernels, upsampled with step size 2 and pad 1, i.e., f₁'＝4， c₁' -256, the first layer deconvolution is denoted TranConv₁(512,256, 4); the second layer of deconvolution uses 128 1 × 1 convolution kernels, upsampling with step size 1 and pad 0, i.e., f₂'＝1，c'₂128, the second layer deconvolution is denoted as TranConv₂(256,128, 1); the third layer of deconvolution uses 64 4x4 convolution kernels, upsampled with step size 2 and pad 1, i.e., f₃'＝4，c'₃64, the third layer deconvolution is denoted as TranConv₃(128,64, 4); the fourth layer of deconvolution adopts 3 1x1 convolution kernels, and upsampling is carried out by adopting step size of 1 and pad of 0, namely f₄'＝1， c'₄The fourth layer deconvolution is denoted as TranConv ═ 3₄(64,3,1)。

Further, the residual block in step S2 adopts two 3x3 convolution layers, the pad is 1, the step is 1, the input dimension and the output dimension are kept unchanged, that is, each layer of residual block is in the Conv-ReLU-Add format, and in addition, a layer of convolution kernel with a convolution kernel of 3x3, a step of 2, and a convolution with a pad of 1 is added before the first layer and the third layer of residual block to perform downsampling operation.

Further, the third feature attention in step S2 is input into the student network after the feature enhancement operation is performed on the feature in the feature enhancement module.

Further, in the step S5, a Smooth L1 loss is used as a loss function between the output and the tag and between the first feature attention map and the second feature attention map, the Smooth L1 loss function is obtained by improving on the basis of an L1 norm loss function, and a mathematical formula of the L1 norm loss function is as follows:

wherein J is a label, and J is a label,

n is the number of samples as the network estimation result;

the mathematical formula for the Smooth L1 loss function is:

wherein the content of the first and second substances,

further, the mathematical formula of the SSIM loss function in step S5 is as follows:

wherein x is the intermediate characteristic of the fog image of the student network learning,

intermediate features, mu, of fog-free images output for teacher's network_x、

Respectively are the average values of the characteristic diagram of the second intermediate layer and the characteristic diagram of the first intermediate layer,

the variances of the second intermediate layer characteristic diagram and the first intermediate layer characteristic diagram are respectively,

is a co-party of the second intermediate layer characteristic diagram and the first intermediate layer characteristic diagramA difference; c. C₁＝(k₁L)²、c₁＝(k₂L)²Is a constant number, k₁Is 0.01, k₂Is 0.03 and L is the pixel value dynamic range of the image.

Further, the incremental operation on the data set in the student network in step S6 includes the following steps:

s61, selecting an indoor fog map data set to train the encoder network;

s62, inputting the indoor fog map data set serving as a training data set into a defogging network, and simultaneously inputting a clear image corresponding to each fog map in the indoor fog map data set into an encoder network;

s63, on the basis of the parameters of the defogging network in the step S62, reserving part of the indoor fog map, and adding part of the outdoor fog map as a training data set to retrain the defogging network.

Compared with the prior art, the invention has the advantages and positive effects that:

the invention provides an image defogging method based on incremental learning and feature and attention transfer, which can effectively improve the defogging and generalization capabilities of a defogging network; a double-network model is adopted on a network structure, a self-encoder is used as a teacher network, a middle-layer characteristic diagram and a characteristic attention diagram of a fog-free image are extracted to increase the constraint of a loss function and guide the learning of a defogging network (a student network), an incremental learning method idea is adopted on a training mode, the defogging network is trained by using an indoor fog diagram data set, a small sample data set including an indoor fog diagram and an outdoor fog diagram is used after the training is finished, the network is retrained, the forgetting of the defogging network to the original knowledge is reduced under the combined action of the guidance of the teacher network and the retention of a small number of data sets, and the defogging effect of the image is improved.

The invention has stronger defogging capability on indoor image data, only needs to take a small amount of image data to perform incremental learning on the network when the defogging effect of the network on an outdoor image data set needs to be improved, does not need to retrain the network by a large amount of data, saves a large amount of time, has good effect on processing two data sets, and has better performance compared with other advanced defogging methods.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a general block diagram of a network of the present invention;

fig. 2 is a schematic structural diagram of a feature enhancement module.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived from the embodiments of the present invention by a person skilled in the art without any creative effort, should be included in the protection scope of the present invention.

The invention provides a double-network defogging method combining attention, incremental learning and other methods, which comprises the following steps:

(1) and constructing a self-encoder network serving as a teacher network, reconstructing clear images, extracting intermediate layer feature maps and feature attention maps of different dimensions of the network, and training subsequent student networks. The teacher network comprises an up-sampling module and a down-sampling module.

(2) And constructing a defogging network serving as a student network for clarifying fog images, wherein the network consists of residual blocks formed by jump connection, constraining feature maps of different dimensions output by the residual blocks by using a Smooth L1 loss function, fitting a teacher network feature map and an attention map, and taking the fitted attention map as a weight to enhance corresponding features. And the dimension of the output characteristic diagram of each layer of the student network corresponds to the teacher network.

(3) The teacher network is trained using multiple sets of pairs of identical images as teacher network inputs and labels.

(4) And (4) using multiple groups of paired fog images and clear images as input and labels of the student network to optimize and train the student network.

(5) The difference between the teacher network and the student network is measured by using a Smooth L1 norm as a loss function between labels and network defogging results in the teacher network and the student network, and by using an SSIM loss function as a loss function between characteristic diagrams in the teacher network and the student network and using a Smooth L1 loss function as a loss function between a first attention diagram and a second attention diagram, wherein the two loss functions work together to train the student network.

(6) The student network fits the intermediate features of the teacher network as much as possible, and the features are enhanced with attention, so that the feature extraction capability of the student network is improved, and the defogging capability is further enhanced.

(7) The student network data set is increased, the defogging capacity of the network on other data is improved, and the generalization capacity of the network is enhanced.

In step (1), the self-encoder consists of four layers of convolution and up-sampling modules, the first layer of convolution uses 64 convolution kernels of 3x3, the step size is 2, the pad is 1, and the convolution operation is carried out, namely f₁＝3，c₁64, the convolutional layer can be represented as Conv₁(3,64, 3); the second layer convolution uses 128 convolution kernels of 3x3, and performs convolution operation with step size 1 and pad 1, i.e., f₂＝3，c₂128, the convolutional layer may be denoted as Conv₂(64,128, 3); the third layer of convolution uses 256 convolution kernels of 3x3, with step size of 2 and pad of 1, and performs the convolution operation, i.e. f₃＝3，c₃256, the convolutional layer can be represented as Conv₃(128,256, 3); the fourth layer convolution uses 512 convolution kernels of 3x3, and performs convolution operation with step size of 1 and pad of 1, i.e. f₄＝3，c₄512, the convolutional layer may be denoted as Conv₄(256,512,3). In order to prevent the grid effect of the image restored by deconvolution when the characteristic diagram is too small, the method is only carried out twiceSampling, expanding channels simultaneously, facilitating network training, preventing information loss, activating output by using the ReLU after each layer of convolution, and increasing the nonlinearity of the network. The up-sampling operation corresponds to the convolution module, the four deconvolution layers are utilized to restore the original size of the image, and meanwhile, the number of channels is also restored to the original state, and the specific operation is as follows: the first layer of deconvolution adopts 256 4x4 convolution kernels, and performs one-time upsampling by using step size of 2 and pad of 1, namely f₁'＝4，c₁256,512 channels reduced to 256, and the deconvolution layer denoted TranConv₁(512,256,4), the second layer of deconvolution uses 128 1 × 1 convolution kernels with step size 1 and pad 0, without changing the feature map size, i.e., f₂'＝1，c'₂128, the deconvolution layer is denoted TranConv₂(256,128, 1); the third layer of deconvolution uses 64 4x4 convolution kernels, uses step size of 2 and pad of 1, and performs upsampling to double the feature map as in the first layer, i.e., f₃'＝4，c'₃64, the deconvolution layer is denoted TranConv₃(128,64, 4); the fourth layer of deconvolution adopts 3 1x1 convolution kernels, the step size is 1, the pad is 0, and the size of the feature map is not changed, namely f₄'＝1，c'₄3, the deconvolution layer is denoted TranConv₄(64,3,1)。

In the step (2), the defogging network adopts residual blocks to form a network backbone, and the 'identity mapping' reduces the loss of characteristic information in the characteristic extraction process, retains more information, is beneficial to the training of the network and prevents the problem of gradient 'explosion'. The defogging network uses four residual blocks, performs feature extraction with two convolution layers which are subjected to downsampling, in order to ensure that the feature diagram of each layer of the defogging network has the same dimension as the feature diagram of a teacher network, the residual blocks adopt two 3x3 convolution layers, the pad is 1, the step length is 1, the input dimension and the output dimension are kept unchanged, namely, each layer is in a Conv-ReLU-Conv-ReLU-Add format, no BN layer exists, the experiment shows that the BN layer can cause the color distortion phenomenon of an image, and then, a convolution with the convolution kernel of 3x3, the step length of 2 and the pad of 1 is respectively added in front of the first layer and the third layer of residual blocks to perform downsampling operation. In addition, the characteristics of each layer of the teacher network and the student network are input into a characteristic enhancement module (FE), and the characteristics are enhanced and then input into the next layer of the student network. And finally, performing up-sampling operation on the feature graph to recover a clear fog-free image, wherein the up-sampling operation is the same as that of a teacher network.

The self-encoder structure adopts an encoding-decoding structure to reconstruct images, and the self-encoder is selected as a teacher to learn the mapping from the original images to the original images through network learning. Based on the idea that the fog-free clear image serving as the input extracted feature is more representative and more suitable for recovering the fog-free image compared with the fog-free image, the method takes a self-encoder which is well trained as a teacher network, extracts the fog-free image feature map and the attention map obtained from each middle layer of the network, calculates the loss function of the corresponding feature map and the attention map obtained from a defogging network (student network), and fits the feature map extracted by the defogging network to the teacher network feature map. In addition, the method transforms the feature graph through the Sigmod function to obtain an attention map, and due to the characteristics of the Sigmod function, the value obtained by transforming the important feature pixel points is larger, namely the weight is larger, so that the neural network puts more attention on the important features. The Sigmod function maps the feature values to (0, 1) and multiplies the feature values correspondingly with the original feature graph as attention weights, so that the obtained feature values are gradually reduced and are not beneficial to network training, aiming at the problem, an identity mapping is added in the invention, the original feature graph and the processed feature graph are subjected to element addition, so that the function of feature enhancement is achieved, the feature enhancement is realized by an FE (feature enhancement) module in the figure 1, the invention selects feature graphs extracted by different convolution layers for feature enhancement, so that the features extracted by the defogging network are more comprehensively fitted to the teacher network features, and an FE structural diagram is shown in figure 2.

In step (5), the difference between the clear image output by the network and the real clear image is measured by using a Smooth L1 loss function, and the training of the network is realized by minimizing the loss function. The Smooth L1 loss function is obtained by improving on the basis of an L1 norm loss function, and the mathematical formula of the L1 norm loss can be expressed as:

wherein J is a label, and J is a label,

for the network estimation result, N is the number of samples. For the problem that the solution is unstable due to the fact that the L1 loss function has good robustness but the center point is a break point and is not Smooth, the scholars propose that the Smooth L1 loss function improves the L1 loss, and the mathematical formula can be expressed as follows:

wherein the content of the first and second substances,

in step (6), in addition to calculating the Loss1 of the fog map and the fog-free map, the intermediate output of the teacher network is used as a student network soft label, the Loss LOSS _ F between the feature maps of each layer and the Loss LOSS _ A between the attention maps are increased, smooth L1 Loss is used as a Loss function between the fog image and the estimated fog-free image and between the two network intermediate attention maps, and Structural Similarity (SSIM) is used as a Loss function between the intermediate feature maps of the two network outputs. The structural similarity loss is used for measuring the structural similarity between two images, the structural similarity is compared from three aspects of brightness, contrast and structure, the evaluation standard of SSIM is similar to the visual system of human, the sensing of local structural change is sensitive, the detail processing is more perfect, and the network performance is greatly improved due to the constraint of a multi-loss function. The SSIM mathematical expression is:

wherein x is student network learningIs characterized by the presence of a fog pattern in the middle,

intermediate fog-free map feature, mu, output for teacher's network_x、

Are the average values of the characteristic maps respectively,

respectively, the variance of the feature map is,

is the feature map covariance. c. C₁＝(k₁L)²、c₁＝(k₂L)²Is a constant number, k₁、k₂Defaults to 0.01 and 0.03 respectively, L is the dynamic range of the pixel value of the image, and the value of the invention is 255.

In step (7), in order to enhance the network generalization capability, an incremental learning manner is adopted in the network learning part, and the structure is shown in fig. 1 (right). The network training is divided into three steps: the first step is as follows: training a self-encoder (teacher network), and selecting an indoor fog image as a data set, so that the teacher network has good indoor image reconstruction and feature extraction capability; the second step is that: the defogging network (student network) adopts an indoor fog image data set, the fog image is input into the defogging network, meanwhile, a clear image corresponding to the fog image is used as the input of the self-encoder, and the defogging network is trained to have the capability of removing indoor image haze; the third step: and on the basis of the parameters of the defogging network in the second step, reserving a small amount of indoor fog maps, and adding a small amount of outdoor fog maps as a data set to retrain the network. The incremental learning has the defects that old knowledge is forgotten, and when the network learns new knowledge, part of the existing knowledge is forgotten, but the teacher network provided by the invention not only reduces the forgetting of the student network to the existing knowledge, but also improves the effect of the network to the new knowledge.

The results of the experiments were compared on ITS data sets, and objective evaluation index pairs are shown in table 1.

TABLE 1 Objective evaluation index comparison

The results were compared on the OTS data sets, and the objective evaluation index pairs are shown in table 2.

TABLE 2 Objective evaluation index

As is apparent from tables 1 and 2, the technical scheme of the invention can effectively improve the defogging and generalization capabilities of the defogging network; the invention adopts a double-network model on a network structure, a self-encoder is used as a teacher network, a middle layer characteristic diagram and a characteristic attention diagram of a fog-free image are extracted to increase the constraint of a loss function and guide the learning of a defogging network (student network), an incremental learning method idea is adopted on a training mode, firstly, an indoor fog diagram data set is used for training the defogging network, a small sample data set is used after the training is finished, the network is retrained, the forgetting of the defogging network on the original knowledge is reduced under the coaction of the teacher network and the reservation of a small number of data sets, and the defogging effect of the image is improved.

Claims

1. An image defogging method based on increment learning and feature and attention transfer is characterized in that: the method comprises the following steps:

2. The image defogging method based on incremental learning and feature and attention transfer as claimed in claim 1, wherein: the self-encoder network in the step S1 is composed of a convolution module and an upsampling module; the convolution module comprises four layers of convolution operation; the first layer of convolution uses 64 convolution kernels of 3x3, and performs convolution operation with step size of 2 and pad of 1, i.e. f₁＝3，c₁64, the first layer convolution can be represented as Conv₁(3,64, 3); the second layer convolution uses 128 convolution kernels of 3x3, and performs convolution operation with step size 1 and pad 1, i.e., f₂＝3，c₂128, the second layer convolution may be denoted as Conv₂(64,128, 3); the third layer of convolution uses 256 convolution kernels of 3x3, and performs convolution operation with step size of 2 and pad of 1, i.e. f₃＝3，c₃256, the third layer convolution can be expressed as Conv₃(128,256, 3); the fourth layer convolution uses 512 convolution kernels of 3x3, and performs convolution operation with step size of 1 and pad of 1, i.e. f₄＝3，c₄The fourth layer convolution may be denoted as Conv 512₄(256,512,3)；

The up-sampling module corresponds to the convolution module and comprises four layers of deconvolution operations; the first layer of deconvolution uses 256 4x4 convolution kernels, upsampled with step size 2 and pad 1, i.e., f₁'＝4，c′₁The first layer deconvolution is denoted as TranConv 256₁(512,256, 4); the second layer of deconvolution uses 128 convolution kernels of 1x1, upsampled with step size 1 and pad of 0, i.e., f'₂＝1，c'₂128, the second layer deconvolution is denoted as TranConv₂(256,128, 1); the third layer of deconvolution uses 64 4x4 convolution kernels, upsampled with step size of 2 and pad of 1, i.e., f'₃＝4，c'₃64, the third layer deconvolution is denoted as TranConv₃(128,64, 4); the fourth layer of deconvolution uses 3 convolution kernels of 1x1, upsampled with step size of 1 and pad of 0, i.e., f'₄＝1，c'₄The fourth layer deconvolution is denoted as TranConv ═ 3₄(64,3,1)。

3. The image defogging method based on incremental learning and feature and attention transfer as claimed in claim 2, wherein: the residual block in step S2 adopts two 3x3 convolution layers, where pad is 1 and step size is 1, and input dimension and output dimension are kept unchanged, that is, each layer of residual block is in Conv-ReLU-Add format, and in addition, a layer of convolution kernel of 3x3, step size is 2, and convolution with pad of 1 is added before the first layer and the third layer of residual block to perform downsampling operation.

4. The image defogging method based on incremental learning and feature and attention transfer as claimed in claim 3, wherein: the third feature attention in step S2 is input to the student network after the feature enhancement operation is performed on the feature in the feature enhancement module.

5. The image defogging method based on incremental learning and feature and attention transfer as claimed in claim 4, wherein: in step S5, a Smooth L1 loss is used as a loss function between the output and the tag and between the first feature attention map and the second feature attention map, the Smooth L1 loss function is obtained by improving on the basis of an L1 norm loss function, and a mathematical formula of the L1 norm loss function is as follows:

wherein J is a label, and J is a label,

n is the number of samples as the network estimation result;

the mathematical formula for the Smooth L1 loss function is:

wherein the content of the first and second substances,

6. the image defogging method based on incremental learning and feature and attention transfer as claimed in claim 5, wherein: the mathematical formula of the SSIM loss function in step S5 is:

intermediate features, mu, of fog-free images output for teacher's network_x、

the covariance of the second interlayer feature map and the first interlayer feature map; c. C₁＝(k₁L)²、c₁＝(k₂L)²Is a constant number, k₁Is 0.01, k₂Is 0.03 and L is the pixel value dynamic range of the image.

7. The image defogging method based on incremental learning and feature and attention transfer as claimed in claim 6, wherein: the incremental operation on the data set in the student network in the step S6 includes the following steps:

s61, selecting an indoor fog map data set to train the encoder network;