Background
With the progress of science and technology, mobile equipment is increasingly popularized, and the image is more convenient to acquire. Due to the use of relatively low cost sensors and lenses, images captured by mobile cameras such as mobile phone cameras are often disturbed by noise, especially when the light is insufficient, the noise is more affected, which may cause the image quality to be degraded, and cause difficulties for subsequent applications. Ensuring image quality is the basis for high-level visual applications such as target detection, semantic segmentation, etc. on images. Therefore, how to efficiently denoise a real image and further improve the quality of the image is an important research topic in the field of computer vision.
Real-Image Denoising (Real-Image Denoising) is an important technical means for solving the problem of Image noise removal from the perspective of software, by recovering a corresponding noise-free Image from an observed noisy Image from the Real world. The noise removal of the real image provides important technical support for enabling a computer to better observe, analyze and process pictures, and has very important application value in many fields such as high-definition televisions, medical images, satellite imaging, monitoring systems and the like.
The traditional real image denoising algorithm models real noise into Gaussian distribution, wherein the common methods comprise a non-local block matching algorithm (BM3D), a sparse coding algorithm (KSVD) and the like, the methods can remove certain noise, but the use stage involves complicated optimization steps, the time cost is high, the trouble is brought to quick application, in addition, the adjustable parameters involved in use are too much, and the denoising effect cannot be ensured.
Convolutional neural networks, which are neural networks specifically designed to process data having a grid-like structure (e.g., an image can be viewed as a two-dimensional grid of pixels), have been successful in a number of different types of computer vision processing tasks (e.g., image classification, object detection, etc.). Many solutions for de-noising real images based on convolutional neural networks have been developed, such as, expanding the traditional nonlinear reaction diffusion model Technology (TNRD) by several parameterized linear filters and several parameterized influence functions, the Gaussian denoising technology (REDNet) based on codec and jump-connected full convolution neural networks, the convolutional neural network denoising technology (DnCNN) integrating residual learning and batch normalization, the denoising technology (FFDNet) using noise estimation graph and input, balancing noise suppression and detail preservation, on the basis of FFDNet, the noise level estimation process is also realized by using a sub-network, so that the technology of blind denoising (CBDNet) of the whole network is realized, by reinforcement learning, establishing a multi-Path CNN with a Path finder, a suitable Path (Path-Restore) can be dynamically selected for each image area, and so on. However, these methods do not take into account the diversity and complexity of the content of the real noise, do not take into account the different importance among feature channels, and do not fully utilize the multi-scale features, thereby achieving a more limited effect.
Disclosure of Invention
The invention aims to solve the problems that the existing image denoising method fails to pay attention to the content diversity and complexity of real noise, does not consider different importance among characteristic channels, and fails to fully utilize multi-scale characteristics, thereby obtaining a more limited effect, and provides a real image denoising method based on multi-scale fusion and edge enhancement, which is reasonable in design, fully considers multi-scale information to improve the noise removal effect, and is relatively light in weight.
The purpose of the invention can be realized by the following technical scheme: a real image denoising method based on multi-scale fusion and edge enhancement comprises the following steps:
the method comprises the following steps: in the image input stage, the data enhancement technology is randomly adopted to transform the sample content;
step two: inputting an original noisy picture into a network, performing convolution operation of three scales on the original noisy picture at the same time, performing primary smoothing treatment by using three convolution kernels and constant parameter quantity by using an expansion convolution technology, and outputting three smoothed pictures;
step three: cascading the graph output in the step two with the original input graph, sending the graph to a fusion stage, simultaneously adopting a jump connection structure to supplement information in time, and outputting a feature graph fused with different scale smoothing effects;
step four: performing edge extraction on the image of the initial input network, the image subjected to smoothing operation and the feature map output in the third step by adopting a Laplacian operator, and setting a threshold value to perform binarization on the result to obtain an edge image of a 5-channel; cascading the edge image and the features output in the third step, and sending the edge image and the features into an enhancement module;
step five: the feature graph output by the enhancement module is mapped to the output feature dimensionality through convolution processing, then a final clear image is output, and the number of output channels of a convolution kernel is consistent with the number of channels of the input original image.
Preferably, the specific implementation method of the data enhancement technology comprises the following steps:
s11: determining whether to perform data enhancement on the input image with a probability of 1/2;
s12: when data enhancement is needed, 3 image blocks are randomly positioned in an input image, overlapping is allowed, and the width and height values of all the image blocks are randomly specified in the ranges of [0,1/4 xW ], [0,1/4 xH ]; wherein, the width of the input image is W, and the height is H;
s13: and replacing the positioned image block with the noiseless image block content for monitoring the corresponding position, namely enabling the network to perform learning of identity mapping on the part of pixels.
Preferably, the sizes of the three convolution kernels in the second step are 3 × 3, 5 × 5 and 7 × 7 in sequence.
Preferably, the fusion stage in step three is composed of five attention modules, a feature map dynamic expression module, and interval down-sampling and up-sampling.
Preferably, the specific treatment steps in the fusion stage are as follows:
s31: the network structure configuration letter V of the fusion stage comprises three layers, wherein the left side is gradually sampled downwards and is regarded as an encoder, and the right side is correspondingly gradually sampled upwards and is regarded as a decoder;
s32: performing convolution twice by 3 multiplied by 32 for each layer, further extracting features, and performing channel importance recalibration on feature maps carrying different scale information through an attention module;
s33: at the end of each layer of the down-sampling stage, reducing the size of the input features to 1/2 by using maximum pooling, compressing and fusing spatial features, retaining texture content, expanding the receptive field of the convolutional network, and extracting more semantic information;
s34: the head of each layer in the up-sampling stage uses transposition convolution to perform 2 times of up-sampling, and output with the same resolution as that of the same layer in the down-sampling stage is cascaded to supplement the first half part of spatial information in time and combine the deep semantic feature information at the same time;
s35: and outputting the V-shaped network, fusing different scale information by a feature map dynamic expression module, and performing self-adaptive expression on each layer of feature map, namely outputting a smooth result which is completely the same as the size of the noisy map input into the network.
Preferably, the specific working steps of the attention module are as follows:
s321: for the feature diagram H multiplied by W multiplied by C of the input module, a layer of convolution operation of 3 multiplied by C multiplied by 64 is used for further abstracting the features;
s322: performing convolution operation on a layer of 3 multiplied by 64 again through a ReLU activation function, and then calibrating the importance of different channels by utilizing a channel attention mechanism;
s323: the inputs of the module and the output of S322 are added pixel by pixel as the final output of the attention module.
Preferably, the specific working steps of the characteristic diagram dynamic expression module in S35 are as follows:
s351: expressing the characteristics by convolution layers with different convolution kernel sizes to obtain U ', U ' and U ' ″, and adding the results pixel by pixel to obtain mixed characteristics
S352: will be provided with
Performing global pooling, extracting global semantic information, performing full-link layer and ReLU nonlinear transformation, dividing into three parts to obtain corresponding three channel calibration coefficient vectors alpha, beta and gamma, and performing softmax normalization operation on the whole, namely performing weighting processing on the three vectors along each channel;
s353: multiplying the three vectors alpha, beta and gamma with U ', U ' and U ' respectively, and then adding the three vectors pixel by pixel, wherein at the moment, convolution kernels with different sizes are adaptively selected by each characteristic channel for characteristic expression;
s354: obtaining a recovered clean image through a single-layer convolution; wherein the dimension of the convolution layer is 3 x 1.
Preferably, the channel attention mechanism described in S322 specifically includes:
a: performing global pooling on input original characteristics U, extracting global semantic information of the input original characteristics U, and then performing full connection layer, ReLU nonlinear transformation, full connection layer and Sigmoid nonlinear transformation to obtain a channel calibration coefficient vector mu;
b: the input features U are re-calibrated by multiplying them with the channel calibration coefficient vector μ.
Preferably, the specific working steps of the enhancement module in the fourth step are as follows:
s41: sequentially passing the input feature map H multiplied by W multiplied by 5 through three cascaded residual modules, wherein each residual module comprises 3 multiplied by 3 convolution, a ReLU activation function and the second 3 multiplied by 3 convolution, and finally adding the result and the module input pixel by pixel; meanwhile, the whole enhancement module adopts a dense convolution network structure, namely, each layer splices the input of all the previous layers and then transmits the output characteristic diagram to all the next layers;
s42: the input is cascaded with the output H multiplied by W multiplied by 5 of the first residual error module to obtain a characteristic diagram H multiplied by W multiplied by 10, the output is mapped into H multiplied by W multiplied by 5 through 1 multiplied by 1 convolution operation and is sent into the second residual error module;
s43: the output of the second residual error module is cascaded with the output of the first residual error module and the input characteristic diagram to obtain an H multiplied by W multiplied by 15 characteristic diagram, H multiplied by W multiplied by 5 is output through 1 multiplied by 1 convolution, and then the H multiplied by W multiplied by 5 characteristic diagram is sent to a third residual error module;
s44: and performing feature mapping on the output by using a 1 × 1 convolutional layer to obtain and output a finally denoised image with edge details.
Compared with the prior art, the invention has the beneficial effects that:
1. the whole denoising process is divided into two stages, wherein the first stage is to obtain a smooth image after multi-scale denoising, denoise the image on each scale, adaptively express and fuse the characteristic images of each scale, simultaneously consider global information and local information, reduce the image recovery fuzziness caused by the loss of the characteristic information, and help the whole denoising by utilizing the redundancy of the image content; in the second stage, edge information is introduced in an auxiliary mode, edges and detail content are recovered, and the visual effect is improved.
2. The method has reasonable design, considers the importance of different characteristic channels, takes global information and local information into consideration for multi-scale receptive field denoising, adaptively fuses the denoised multi-scale characteristics, enhances the details of the image in the later stage, and avoids the common denoising problem of over-smoothness. The output of the network is a denoised clear image, the network is trained by using an input noise image and a noiseless clear image pair and taking an average absolute loss function as a target, and the denoising effect is evaluated by comparing an output image with the noiseless clear image, so that the denoising quality is ensured, and meanwhile, the network is small in size, the denoising effect is ensured, and the algorithm running speed is considered.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the following embodiments, and it should be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1-4, a method for denoising a real image based on multi-scale fusion and edge enhancement includes the following steps:
step S1, in the image input stage, the data enhancement technology is randomly adopted to transform the sample content;
step S2, inputting the original noisy picture into a network, performing convolution operation of three scales on the original noisy picture at the same time, using three convolution kernels by utilizing the idea of expansion convolution, keeping the parameter quantity unchanged, and sequentially performing primary smoothing treatment on the three convolution kernels with the sizes of 3 × 3, 5 × 5 and 7 × 7 to output three smoothed pictures;
step S3, the output of step S2 and the original input graph are cascaded and sent to a fusion stage, the fusion stage is composed of five attention modules, a feature graph dynamic expression module, and interval down-sampling and up-sampling, a jump connection structure is adopted at the same time, information is supplemented in time, and finally a more refined feature graph fusing different scale smoothing effects is output; the denoising stage of the network is described above;
step S4, the following is the detail enhancement phase of the network. And (3) performing edge extraction on the image initially input into the network, the image subjected to smoothing operation and the image output in the step (3) by adopting a Laplacian operator, setting a threshold value, and performing binarization on the result to obtain an edge image of a 5 channel. Cascading the edge image with the output of the step 3, and sending the edge image into an enhancement module;
and step S5, mapping the output characteristic diagram of the enhancement module to the output characteristic dimension after convolution processing, outputting the final clear image, wherein the number of output channels of the convolution kernel is consistent with the number of channels of the input original image.
The specific implementation method of the data enhancement technology of step S1 is as follows:
s1.1, determining whether to perform data enhancement on an input image according to the probability of 1/2;
step S1.2, if data enhancement is needed, randomly positioning 3 image blocks in an input image (the width W and the height W of the input image are set as H), allowing overlapping, and randomly assigning the width and the height of each image block in the ranges of [0, 1/4W ], [0, 1/4H ];
and S1.3, replacing the part of contents of the positioned image block with the contents of the noiseless image block at the corresponding position for supervision, namely enabling the network to learn the identity mapping of the part of pixels, and performing a regularization effect on the network learning. The method limits the image to be denoised only in the place where the image needs to be denoised, avoids the phenomenon that the image is processed too smoothly (excessive denoising), and forces the network to learn not only how to denoise but also where to denoise.
The specific implementation method of step S3 is as follows:
s3.1, the network structure of the fusion stage is like a letter V, and comprises three layers, wherein the left side gradually performs down-sampling and can be regarded as an encoder, and the right side correspondingly performs up-sampling and can be regarded as a decoder;
s3.2, performing convolution twice by 3 multiplied by 32 for each layer, further extracting features, and performing channel importance recalibration on feature maps carrying different scale information through an attention module;
s3.3, at the end of each layer of the down-sampling stage, reducing the size of the input features to 1/2 by using maximum pooling, compressing and fusing spatial features, retaining texture content, expanding the receptive field of a convolutional network, and extracting more semantic information;
s3.4, performing up-sampling 2 times by using transposition convolution at the beginning of each layer in the up-sampling stage, cascading output with the same layer in the down-sampling stage and having the same resolution, supplementing spatial information of the first half part in time, and combining deep semantic feature information at the same time;
s3.5, outputting the V-shaped network, fusing different scale information through a feature map dynamic expression module, and performing self-adaptive expression on each layer of feature map, namely outputting a finer smooth result which is completely the same as the size of the noisy image input into the network;
the specific implementation method of the attention module of step S3.2 is as follows:
step S3.2.1, for the characteristic diagram H × W × C of the input module, a layer of convolution operation of 3 × 3 × C × 64 is used to further abstract the characteristics;
s3.2.2, performing a layer of convolution operation of 3 × 3 × 64 × 64 again through a ReLU activation function, and then calibrating the importance of different channels by using a channel attention mechanism;
step S3.2.3, add the module inputs and the module outputs pixel by pixel as the final output of the attention module.
The specific implementation method of the channel attention mechanism in step S3.2.2 is as follows:
s3.2.2.1, performing global pooling on the input original features U, extracting global semantic information of the original features U, and then performing full connection layer, ReLU nonlinear transformation, full connection layer and Sigmoid nonlinear transformation to obtain a channel calibration coefficient vector mu;
step S3.2.2.2 recalibrates the input signature U by multiplying it with the channel calibration coefficient vector μ.
The specific implementation method of the characteristic diagram dynamic expression module in the step S3.5 is as follows:
s3.5.1, expressing the characteristics by convolution layers with different convolution kernel sizes to obtain U ', U ' and U ', and adding the results pixel by pixel to obtain mixed characteristics
Step S3.5.2, again
Performing global pooling, extracting global semantic information, performing full-link layer and ReLU nonlinear transformation, dividing into three parts to obtain corresponding three channel calibration coefficient vectors alpha, beta and gamma, and performing softmax normalization operation on the whole, namely performing weighting processing on the three vectors along each channel;
step S3.5.3, multiplying the three vectors alpha, beta and gamma with U ', U ' and U ' respectively, then adding pixel by pixel, at this time, each characteristic channel adaptively selects convolution kernels with different sizes to perform characteristic expression;
the recovered clean image is obtained by a single layer convolution, step S3.5.4. Wherein the dimension of the convolution layer is 3 x 1.
The specific implementation method of the enhancement module in step S4 is as follows:
and S4.1, sequentially passing the input feature map H multiplied by W multiplied by 5 through three cascaded residual modules, wherein each residual module comprises 3 multiplied by 3 convolution, a ReLU activation function and the second 3 multiplied by 3 convolution, and finally adding the result and the module input pixel by pixel. And meanwhile, the whole enhancement module adopts a dense convolutional network structure, namely, each layer splices the input of all the previous layers, and then transmits the output feature map to all the next layers.
S4.2, cascading the input and the output H multiplied by W multiplied by 5 of the first residual error module to obtain a characteristic diagram H multiplied by W multiplied by 10, mapping the output H multiplied by W multiplied by 5 through 1 multiplied by 1 convolution operation, and sending the output H multiplied by W multiplied by 5 to the second residual error module;
s4.3, cascading the output of the second residual error module with the output of the first residual error module and the input characteristic diagram to obtain an H multiplied by W multiplied by 15 characteristic diagram, outputting H multiplied by W multiplied by 5 through 1 multiplied by 1 convolution, and then sending the H multiplied by W multiplied by 5 to a third residual error module;
and S4.4, performing characteristic mapping on the output by using a 1 multiplied by 1 convolutional layer to obtain and output the image which is finally denoised and has edge details. The enhancement module can reduce the influence caused by gradient disappearance, strengthen the transmission of detail characteristics and more effectively fuse edge detail characteristics and a denoised smooth image.
The de-noised clear image can be obtained through the steps.
Finally, we trained the network with the minimum absolute error loss function (L1 loss function) as the target, and evaluated the network performance using PSNR (Peak Signal to Noise Ratio) and SSIM (structural similarity index). The method comprises the following steps:
and (3) testing environment: python 3.6; a TensorFlow frame; ubuntu16.04 system; NVIDIA GTX 1080ti GPU
And (3) testing sequence: the selected Dataset is the Darmstadt Noise Dataset (DND) used for true image denoising, containing 50 pairs of ultra-high resolution true Noise-noiseless image pairs.
The test method comprises the following steps: in order to ensure fairness, a target noiseless image of the data set is not disclosed externally, a participant submits an image denoising result to an online, and an online system calculates scores uniformly and quantifies a test effect.
Testing indexes are as follows: the invention uses indexes such as PSNR, SSIM, single and batch image processing time and the like to evaluate. The index data are calculated by different algorithms which are popular at present, and then result comparison is carried out, so that the method can obtain better results in the field of real image denoising.
Nothing in this specification is said to apply to the prior art.
It should be emphasized that the embodiments described herein are illustrative rather than restrictive, and thus the present invention is not limited to the embodiments described in the detailed description, but also includes other embodiments that can be derived from the technical solutions of the present invention by those skilled in the art.
The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise forms disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents.