CN113947538A

CN113947538A - Multi-scale efficient convolution self-attention single image rain removing method

Info

Publication number: CN113947538A
Application number: CN202111113807.2A
Authority: CN
Inventors: 王鑫; 覃琴; 李民谣; 颜靖柯; 王逸轩
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2021-09-23
Filing date: 2021-09-23
Publication date: 2022-01-18

Abstract

The invention discloses a multi-scale efficient convolution self-attention single image rain removing method which comprises the steps of firstly, obtaining corresponding rain images and rain-free images through image data preprocessing, then, transmitting the rain images into a network model fusing an improved Transformer self-attention module and a multi-scale spatial feature fusion module for iterative training, outputting processed images close to the rain-free images through mixed loss function optimization, storing the trained network model, and then, using the trained network model to predict image data to be tested and output images after rain removal.

Description

Multi-scale efficient convolution self-attention single image rain removing method

Technical Field

The invention relates to the technical field of image processing, in particular to a multi-scale efficient convolution self-attention single image rain removing method.

Background

In rainy days, which is a common natural weather condition, the imaging quality of images and video data shot by an outdoor vision system is greatly influenced, and the performance of subsequent high-level computer vision tasks, such as object tracking, target detection, image segmentation and the like, is restricted. Removing rain noise from a rainy image and restoring a clear background is an important image pre-processing problem.

Because the available image characteristic information of a single image is less, the single image has certain challenges in rain removal, and the existing single image rain removal problems are mainly divided into two types: based on a model driving method and a data driving method, the model driving method firstly establishes a physical model for rain stripes based on a certain priori knowledge, such as physical characteristics of rain, then removes rain noise from rain images by manually designing a series of fine mathematical models, and finally obtains clean and rain-free background images, however, the model driving-based rain removing methods are only suitable for specific rain types and cannot cope with irregular distribution of actual rain images, and an optimization algorithm adopted by the method usually involves a plurality of calculation iterations, so that the efficiency of the method is low; the rain removing method based on data driving utilizes the strong feature extraction capability of a deep learning network model, and learns to obtain the features of rain stripes and effective background information through the training of a large number of data sets, so that a rain image is recovered to a rain-free image.

Disclosure of Invention

The invention aims to provide a multi-scale efficient convolution self-attention single image rain removing method, and aims to solve the technical problems that a single image rain removing method in the prior art is large in calculation amount and low in efficiency.

In order to achieve the purpose, the invention adopts a multi-scale high-efficiency convolution self-attention single image rain removing method, which comprises the following steps:

preprocessing data;

constructing a network model;

training the network model;

optimizing a network model;

and predicting and outputting the image after rain removal.

In the data preprocessing process, preprocessing image data to obtain a rain image and a rain-free image, wherein the rain image and the rain-free image are respectively a rain scene and a rain scene in the same environment.

The rain images are used as initial image data for training, and the rain-free images are used as comparison data after processing.

The network model comprises a coding structure and a decoding structure, wherein the coding structure is fused with an improved Transformer self-attention module, the coding structure is embedded with a multi-scale spatial feature fusion module, the decoding structure comprises a conventional high-efficiency rolling block, and semantic features of corresponding scales in the coding structure are fused.

The improved Transformer attention module is added with position coding, so that the improved Transformer attention module not only has the modeling capacity on global features, but also is sensitive to local similar features, the rain noise removal and the maximum reservation of background detail textures are facilitated, and the problem of image part feature loss in the downsampling process in the coding stage can be relieved by embedding the multiscale spatial feature fusion block in the coding stage.

In the process of training the network model, firstly, the optimal parameters of a pre-training model are loaded into the network model, wherein the pre-training model is the trained network model before network improvement, and then the rain images are transmitted into the network model for iterative training.

And in the process of optimizing the network model, updating the network parameters of the network model by adopting a mixed loss function back propagation optimization iteration so that the output result is close to the rainless image, and storing the trained network model.

The raining image is subjected to iterative processing in the network model, and meanwhile, the network model is trained, under the optimization of the mixing loss function, the output processed image is closer and closer to the rainless image, and the network model at the moment is the trained network model and can be used for performing rain removal processing on other images.

And in the process of predicting and outputting the image subjected to rain removal, loading the prepared test image data into a trained network model for forward calculation to obtain the image subjected to rain removal of the test image.

The invention relates to a multi-scale efficient convolution self-attention single image rain removing method which comprises the steps of firstly obtaining corresponding rain images and rain-free images through image data preprocessing, then transmitting the rain images into a network model fusing an improved Transformer self-attention module and a multi-scale spatial feature fusion module for iterative training, outputting processed images close to the rain-free images through mixed loss function optimization, storing the trained network model, and then using the trained network model to predict image data to be tested and output images after rain removal.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of a multi-scale efficient convolution self-attention single image rain removal method according to the invention.

FIG. 2 is a network model structure diagram of a multi-scale efficient convolution self-attention single image rain removal method according to the invention.

FIG. 3 is a block diagram of the multi-scale spatial feature fusion module of the present invention.

Fig. 4 is a comparison of subjective experimental results on a synthetic data set Rain100H for different algorithms in an embodiment of the invention.

FIG. 5 is a comparison graph of the average run time and evaluation metrics over different algorithms Rain100H in an embodiment of the present invention.

Fig. 6 is a graph of the results of subjective experiments on a simulation data set SPA with different algorithms in an embodiment of the invention.

FIG. 7 is a structural comparison diagram of two combination schemes of the cross-scale convolution self-attention module of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

Referring to fig. 1, the present invention provides a method for removing rain from a single image with self attention by using multi-scale high-efficiency convolution, comprising the following steps:

s1: preprocessing data;

s2: constructing a network model;

s3: training the network model;

s4: optimizing a network model;

s5: and predicting and outputting the image after rain removal.

The network model comprises a coding structure and a decoding structure, wherein the coding structure fuses an improved Transformer self-attention module, the coding structure is embedded with a multi-scale spatial feature fusion module, and besides a conventional Efficient Convolution Block (ECB), the decoding structure fuses semantic features of corresponding scales in the coding structure through Skip Connection operation so as to guide an up-sampling process in a decoding stage, establish long-distance feature dependence and contribute to restoration of image details.

As shown in FIG. 2, the network body is composed of an encoding-decoding structure, and further integrates a cross-scale convolution self-attention module and a multi-scale spatial feature fusion module which are improved on the basis of the original Transformer.

Improved transducer self-attention module: in the 3 rd and 4 th stages of the down-sampling stage, an improved Transformer self-attention module is fused, and the calculation formula is as follows:

wherein Q, K, V represents the input vector X ∈ R^N×CIs subjected to linear transformation W^Q、W^KAnd W^V∈R^C×CAre mapped into corresponding query vectors Q, key vectors K and value vectors V,

for the scaling factor, softmax is the activation function,

representing an attention map, computed using two-dimensional depth convolution in a visual task

Where ° denotes the direct multiplication of the elements corresponding to the matrix positions. The improved transform attention module has rain added with position coding, has the capability of modeling global features, is sensitive to local similar features, and is beneficial to clearing rain noise and retaining background detail textures to the maximum extent.

A multi-scale spatial feature fusion module: in an image rain removing algorithm, a plurality of downsampling operations are usually performed on an image to discard redundant information in the image, but part of effective information is lost at the same time, so that the position of a rain strip cannot be accurately positioned, and the problems of incomplete clearing of the rain strip, incomplete image background structure and the like can occur. In order to solve the problem that partial image features are lost in the down-sampling process in the encoding stage, the invention designs a multi-scale spatial feature fusion module which is embedded in the last stage of the encoding stage and aggregates context information from multiple scales so as to fully learn the rain stripe features with different sizes, so that a network model can cope with various complex rainfall situations in a real environment. The specific structure of the multi-scale spatial feature fusion module is shown in fig. 3, 5 parallel convolution operations are adopted to process input features, firstly, a 1 × 1 convolution is utilized to reduce the dimension of an input feature map, then, different expansion factors 2, 4 and 8 are respectively set for three 3 × 3 convolutions, feature extraction is carried out on images with three different receptive fields, the perception capability of a model on different-size rain stripes is improved, secondly, an adaptive average pooling operation is utilized to reduce information redundancy, finally, a 1 × 1 convolution is utilized to reduce the number of channels, 5 feature maps with different scales are fused together, and effective information with different scales in the images can be fully learned.

And in the process of training the network model, transmitting the rain images into the network model for iterative training.

In the process of optimizing the network model, the network parameters of the network model are updated iteratively by adopting a mixed loss function back propagation optimization, so that the output result is close to the rainless image, and the trained network model is stored.

And (3) giving a training loss objective function, wherein the loss function is formed by mixing an MAE loss function, an MS-SSIM loss function, an MSE loss function and a TV loss function, the defect shown by a single loss function is made up by using the advantages of all loss functions, and the stability of the network is enhanced. Firstly, mixing an MAE loss function and an MS-SSIM loss function according to a certain weight, wherein the formula is as follows:

wherein L is_MS-SSIMFor loss of MS-SSIM, L_MAEFor the MAE loss function, it is calculated as follows:

L_MS-SSIM(P)＝1-(MS-SSIM(p)) (4)

where P represents a block of pixel region and P represents a pixel point in the pixel region P, and a is empirically set to 0.84.

Then, the MS-SSIM loss function is not particularly sensitive to consistency deviation, and is easy to cause the brightness change and the color deviation of the image; the TV loss function constrains the smoothness of the image by calculating the difference between adjacent pixels, so that the output image is relatively smooth, which can be used to solve the artifact problem caused by the rainstreak residue in the image after rain removal, but is not suitable for single use. In a rainy image, rain stripes and background detail textures mostly have high-frequency regions, and the MAE loss function gives relatively large weight to the high-frequency parts of the image, which results in that the rain stripe parts are remained while details are retained, so that the MSE loss function and the TV loss function are used for removing rain stripe artifacts. Finally, the mixing loss function is as follows:

L_Mix＝L_MS-SSIM-MAE+μ·L_MSE+λ·L_TV (6)

wherein μ and λ are penalty factors, and are adjusted step by step according to experiment to obtain values of 0.3 and 2 × 10^-8. The expressions for the MSE penalty function and TV penalty function are as follows:

L_TV(p)＝∑_i,j((p_i,j+1-p_i,j)²+(p_i+1,j-p_i,j)²)^β/2 (8)

and iterating the network parameters of the network model through back propagation optimization of the obtained mixed loss function to enable the output result of the network to gradually approach the rain-free image, and storing the trained model.

And in the process of predicting and outputting the image subjected to rain removal, loading the prepared test image data and the pre-training model into the trained network model for forward calculation to obtain the image subjected to rain removal of the test image.

Further, the present invention provides a specific example of experimental comparison using a synthetic data set:

in order to comprehensively verify the performance of the technical scheme, the invention is compared with several advanced rain removing methods based on deep learning at present, and the method specifically comprises the following steps: MPRNET (2021), RCDNet (2020), JORDER-E (2020), DCSFN (2020), SPANet (2019), RESCAN (2018).

Referring to fig. 4, fig. 4 shows the subjective experimental results of different algorithms on the synthetic data set Rain100H, and it can be seen from the figure that the present invention can effectively remove the rainstripes with different directions and different densities, and generate a near-real Rain-free image, while retaining most details. In contrast, the images generated by other methods are smoother and even destroy the background content, as can be seen in fig. 4(a), the female face has traces similar to smeared blocks, fig. 4(b) the girl hair texture is almost disappeared, and fig. 4(c) the cross building is blurred, and in contrast, only the MPRNet method retains some details. The DCSFN algorithm uses SSIM loss as a loss function, which makes the boundary of the structure obvious and also causes the rain streak to remain, for example, the sky of the DCSFN output image in fig. 4(c) can obviously see the rain streak artifact; the RCDNet algorithm uses a MSE loss function, resulting in the algorithm penalizing smooth regions in the image more heavily, blurring the image. The image output by the algorithm of the invention not only completely removes the rain stripes, but also has natural female face, clear girl hair texture and complete cross, fully retains the background details, effectively combines the local feature modeling capability of convolution and the global feature modeling capability of self attention, and simultaneously verifies the feasibility of the mixed loss function in the scheme.

In addition to comparing the subjective effect of each algorithm, in order to embody the performance improvement brought by the algorithm provided by the invention on data, the invention adopts two image quality evaluation indexes of structure similarity SSIM and peak signal-to-noise ratio PSNR to objectively evaluate each algorithm. Wherein, the closer the SSIM value is to 1, the higher the similarity of the two images is; the larger the PSNR value, the less distortion of the image is indicated. Table 1 gives the SSIM and PSNR values for each algorithm on different data sets. As can be seen from table 1, the algorithm of the present invention is quite competitive with some advanced algorithms. From the image quality evaluation index alone, although on the Rain100L data set, the method is slightly inferior to RCDNet in PSNR index and is slightly lower than JORDER-E in SSIM index, on the Rain100H data set, the method of the invention has a leading position, compared with the latest MPRNT algorithm, the SSIM index is improved by 0.0153, the PSNR index is improved by 0.95dB, the probable reason is that a larger training set is needed due to weak inductive bias in a Transformer, and the Transformer is suitable for a large-scale data set, so the method achieves more advantages on the larger Rain100H data set.

TABLE 1 comparison of evaluation indices of different algorithms on a composite dataset

In addition, in order to embody the Rain removal efficiency of the method, the invention also compares the running time of averagely processing a Rain image by different algorithms on the data set Rain100H and the evaluation indexes PSNR and SSIM. The comparison result is shown in fig. 5, and it can be seen that the algorithm of the present invention is equivalent to the RCDNet in terms of PSNR index, but the processing speed is about 50 times faster than the PSNR index; compared with the latest algorithm MPRNT, the algorithm has higher evaluation index and greatly advanced rain removal speed, and the method has the advantages that the efficient rolling blocks are used, and the reasoning speed of the model is increased.

Further, in order to verify the generalization ability of the algorithm provided herein, the present invention also compares the algorithm with the latest algorithm MPRNet (2021) on a near-true rainy data set, and the comparison result is shown in fig. 6, where the two algorithms have equivalent rain removing effect on a near-true rainy image, but the method retains more details, such as fig. 6(a), where the MPRNet algorithm removes one white long object from the background of the original image, and the algorithm retains the whole rain removing image. Experiments prove that: the algorithm has strong generalization capability, can effectively remove different degrees of rainstripes on a synthesized rainy image, and has good clearing effect on a raining scene image close to reality.

Further alternatively, for the way of fusing the cross-scale convolution self-attention module at the encoding stage, two combination schemes are designed as shown in fig. 7:

in order to compare the influence of the number of the cross-scale convolution self-attention module combined with the common convolution on the performance of the proposed network model and the effectiveness of the module in feature extraction in images with different resolutions, two different combination schemes are trained on a data set Rain100H, and evaluation indexes SSIM and PSNR of the two schemes on the two data sets are shown. As shown in table 2, the evaluation index values obtained after training on the Rain100H data set using the two schemes of combination a and combination B are equivalent, combination B is slightly advanced, but combination B contains fewer transform modules, which means fewer parameters and computational consumption. Thus, the scheme herein employing combination B combines volume blocks and transformers.

TABLE 2 comparison of evaluation indices for combination A and combination B on Rain100H dataset

Here, a hybrid loss function is used as an objective function for the desired optimization:

L_Mix＝L_MS-SSIM-MAE+μ·L_MSE+λ·L_TV (9)

while the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A multi-scale efficient convolution self-attention single image rain removing method is characterized by comprising the following steps:

preprocessing data;

constructing a network model;

training the network model;

optimizing a network model;

and predicting and outputting the image after rain removal.

2. The method for rain removal from a single image with multi-scale high-efficiency convolution according to claim 1, wherein in the data preprocessing process, the image data is preprocessed to obtain a rain image and a rain-free image, and the rain image and the rain-free image are respectively a rain scene and a rain scene in the same environment.

3. The method as claimed in claim 2, wherein the network model includes a coding structure and a decoding structure, the coding structure incorporates an improved Transformer self-attention module, the coding structure further embeds a multi-scale spatial feature fusion module, and the decoding structure includes a conventional high-efficiency convolution block and incorporates semantic features of corresponding scales in the coding structure.

4. The method as claimed in claim 3, wherein in the training of the network model, parameters of a pre-trained model are loaded into the network model, wherein the pre-trained model is a trained network model before network improvement, and then the rain image is transmitted into the network model for iterative training.

5. The method as claimed in claim 4, wherein in the process of network model optimization, the network parameters of the network model are iteratively updated by back propagation optimization with a mixed loss function, so that the output result is close to the rainless image, and the trained network model is saved.

6. The method as claimed in claim 5, wherein in the process of predicting and outputting the image after rain removal, the prepared test image data is loaded into the trained network model for forward calculation, and then the relatively better weight and bias parameters obtained by updating the back propagation process are calculated with the input to obtain the image after rain removal of the test image.