CN111127331A

CN111127331A - Image denoising method based on pixel-level global noise estimation coding and decoding network

Info

Publication number: CN111127331A
Application number: CN201911005554.XA
Authority: CN
Inventors: 唐鹏靓; 鞠国栋; 沈良恒
Original assignee: Guangdong Qidi Tuwei Technology Co ltd
Current assignee: Qidi Yuanjing Shenzhen Technology Co ltd
Priority date: 2019-10-22
Filing date: 2019-10-22
Publication date: 2020-05-08
Anticipated expiration: 2039-10-22
Also published as: CN111127331B

Abstract

The invention relates to an image denoising method based on a pixel level global noise estimation coding and decoding network, which comprises the following steps: inputting an original noisy picture into an input module of a coding network, performing primary feature extraction on the original noisy picture by using convolution, and outputting an original feature map; processing the original characteristic map by a plurality of cascaded coding modules in the coding network, and outputting a denoised high-level characteristic map with smaller space size and higher semantic level; processing the high-level feature map by a plurality of decoding modules with a jump connection structure in a decoding network symmetrical to the coding network to obtain an output feature map with spatial information and semantic information removed from noise; and (4) the output characteristic graph of the decoding network passes through an output module of the decoding network, and is mapped to the output characteristic dimension by convolution processing, so that a final clear image is output. The method fully considers the characteristics of real image noise, global information and pixel values of the real image noise and the global information, and considers the denoising effect and the running speed.

Description

Image denoising method based on pixel-level global noise estimation coding and decoding network

Technical Field

The invention relates to the technical field of computer vision images, in particular to an image denoising method based on a pixel-level global noise estimation coding and decoding network.

Background

Digital images as important information carriers play more and more important roles in daily production and life, and image information is widely applied to the fields of aerospace, industrial production, military, medical treatment, communication and the like due to the characteristics of intuition and liveliness and large information content, and is closely related to the work and life of people. However, in the process of digitalization and transmission, digital images are often affected by interference of imaging equipment and external environment noise, and the like, so that the image quality is reduced, and difficulties are brought to subsequent research and processing. Therefore, the image denoising is used as a basic and very important low-level computer vision task, and has high scientific value and practical significance.

The image denoising technology is a technology for removing noise introduced in the process of obtaining an image so as to obtain an original clear image, and is an important technical means for solving the problem of image noise from the perspective of software. The image denoising is an important low-level computer vision task, provides important technical support for enabling a computer to better observe, analyze and process pictures, and has very important application value in many fields such as medical images, satellite imaging, monitoring systems and the like.

The current optimal traditional image denoising algorithm comprises a non-local self-similarity algorithm, a sparse coding algorithm, a block matching model and the like, but the algorithms have the common defects that a complex optimization step is involved in the use stage, so the time cost is high, in addition, more adjustable parameters are designed in the use process of the algorithms, and the trouble is brought to quick use. With the development of deep learning, more end-to-end deep neural networks are used in an image denoising task, for example, a deep denoising convolutional neural network (DnCNN) introduces residual connection and batch normalization operation to reduce the difficulty of deep network training and obtain a better effect of removing Gaussian noise; the ultra-deep residual codec network (REDNet) uses a codec network composed of full convolutions and introduces a symmetric skip structure to balance spatial information and receptive field information. The strong learning capability and the end-to-end simplicity of the neural network greatly improve the image denoising effect and reduce the time consumption. However, the above deep learning method aims at a synthesized image with artificially added gaussian noise, the gaussian noise in the image is irrelevant to the pixel value, and it cannot be considered that the noise of the real-world image is relevant to the pixel value in addition to the gaussian characteristic, so the generalization ability of the above method on the real-world noise image is poor, and a relatively limited effect is obtained.

Disclosure of Invention

In view of the above, it is necessary to provide an image denoising method based on a pixel-level global noise estimation coding and decoding network, which fully considers characteristics of real image noise, global information and pixel value correlation of itself, and considers denoising effect and operation speed.

An image denoising method based on a pixel level global noise estimation coding and decoding network comprises the following steps:

step 1, inputting an original noisy picture into an input module of a coding network, performing primary feature extraction on the original noisy picture by using convolution, and outputting an original feature map;

step 2, processing the original feature map by a plurality of cascaded coding modules in the coding network, wherein each coding module consists of a down-sampling layer, a convolution layer and a pixel level global noise estimation submodule, the down-sampling layer and the convolution layer carry out high-level feature extraction on the original feature map, the pixel level global noise estimation submodule carries out noise estimation and removal on each level of features, and finally, the denoised high-level feature map with smaller space size and higher semantic level is output;

step 3, processing the high-level feature map by a plurality of decoding modules with a jump connection structure in a decoding network symmetrical to the coding network, wherein each decoding module consists of an upper sampling layer and a convolution layer, combining spatial information of a front layer with high-level feature information output by the coding module, and finally obtaining an output feature map with spatial information and semantic information taken into account after noise removal;

and 4, mapping the output characteristic graph of the decoding network to the output characteristic dimension by using convolution processing through an output module of the decoding network, and outputting a final clear image.

Specifically, in step 1, the original noisy picture is input to an input module of the coding network, feature extraction of convolution of four times, that is, 3 × 3 × 32 is performed, and an original feature map is output.

The convolution kernel output channel of the output module of the decoding network is consistent with the number of the channels of the original noisy picture input by the input module.

The coding network comprises four cascaded coding modules, and each coding module consists of a down-sampling layer, four convolution layers and two pixel level global noise estimation sub-modules.

The specific method for processing the input characteristics by the encoding module in the step 2 comprises the following steps:

⑴, at the beginning of each coding module, using maximum pooling to reduce the size of input features to 1/2, performing spatial feature fusion, improving the receptive field of the convolutional network, and extracting more semantic information;

step ⑵, extracting feature information after size reduction by using two cascaded convolutional layers, and increasing the number of channels of the features by two times to extract more features, wherein in the four cascaded coding modules, the used convolutional cores are respectively 3 × 3 × 64, 3 × 3 × 128, 3 × 3 × 256, and 3 × 3 × 512 in sequence, and the convolutional cores with the same number of channels are used in the modules;

step ⑶, extracting pixel-level global noise information contained in the features through a pixel-level global noise estimation submodule, and removing the extracted noise information through residual connection in the submodule;

step ⑷, integrating and buffering the characteristics after the primary denoising by using a convolution layer to prepare for further denoising, wherein in the four cascaded coding modules, the convolution kernels used are respectively 3 × 3 × 64, 3 × 3 × 128, 3 × 3 × 256 and 3 × 3 × 512 in sequence;

step ⑸, further extracting pixel level global noise information contained in the features through a pixel level global noise estimation submodule, and removing the extracted noise information through residual connection in the submodule;

the ⑹ module finally uses a convolution layer to integrate and buffer the features after the second denoising for preparing for entering the next module, and in the four cascaded coding modules, the convolution kernels used are respectively 3 × 3 × 64, 3 × 3 × 128, 3 × 3 × 256, and 3 × 3 × 512 in sequence, so the number of channels output by the four cascaded coding modules is 64, 128, 256, and 512 in sequence.

The specific method for processing the input features by the pixel-level global noise estimation submodule of steps ⑶ and ⑸ includes the following steps:

① the input of the pixel level global noise estimation submodule is a feature map with height H, width W and channel number C, the length and width are related to the size of the noise image of the input network and the down sampling times, the channel number is consistent with the feature channel number inside the coding module where the current pixel level global noise estimation submodule is located, the internal channel numbers of the four coding modules are respectively 64, 128, 256 and 512, the input feature firstly enters the first branch and is directly output as a residual branch for providing the feature map before de-noising;

② inputting features and entering a second branch, fusing global channel information into a branch, entering a feature graph with dimension H multiplied by W multiplied by C into two cascaded convolution layers of 1 multiplied by 1, fusing the information of C channels into the global channel information of one channel, simultaneously reserving the spatial information of the features, obtaining a feature graph with dimension H multiplied by W multiplied by 1, obtaining the feature graph with the feature shape changed into HW multiplied by 1, and outputting the HW multiplied by 1 as the feature fused with the global channel information;

③ inputting features and entering a third branch, fusing global spatial information into branches, fusing spatial information with dimension H multiplied by W multiplied by C into single-point information through a global mean pooling layer, reserving channel information of the features to obtain a feature map with dimension 1 multiplied by C, and then sequentially connecting a full connection layer with dimension C/4 and a full connection layer with dimension C to correct channel features, wherein the output size is still 1 multiplied by C, and obtaining feature output fused with the global spatial information;

④ matrix-multiplying the fusion characteristics of HW × 1 global channel information and 1 × C global space information to obtain a characteristic diagram of HW × C, which not only fuses the global characteristics of the channel and the space, but also retains partial information of the space and the channel pixel level in the calculation process to complete the global noise estimation of the pixel level, then transforming the noise estimation into H × W × C with the period consistent with the input characteristic size, and finally performing characteristic mapping by using a 1 × 1 × 1 × 1 convolutional layer to obtain and output the noise estimation characteristics;

⑤, the output pixel level global noise estimation and residual branch are added pixel by pixel, and the estimated noise value is removed from the noise picture, so as to obtain the final output of the pixel level global noise estimation submodule.

The decoding network comprises four decoding modules, and each decoding module consists of a bilinear interpolation upsampling layer and four convolution layers.

The specific method for processing the input characteristics by the decoding module in the step 3 comprises the following steps:

a. at the beginning of each decoding module, for an input feature map, the spatial size of the input feature map is up-sampled to 2 times of the original spatial size by using a bilinear interpolation method, and the number of channels is unchanged so as to gradually recover the size of an input original image;

b. taking characteristic output with the same size as bilinear interpolation output space in an input module and an encoding module and a current decoding module, splicing the characteristic output with the output of the current bilinear interpolation in a channel dimension, wherein the channel number output by the input module and the first three cascaded encoding modules is respectively 32, 64, 128 and 256 in sequence, the channel number input by the four cascaded decoding modules is respectively 256, 128, 64 and 32, the channel dimension after splicing is changed into 512, 256, 128 and 64, the channel number is respectively used as the input of a subsequent convolution layer of a decoding network in sequence, and the semantic information of a denoised characteristic diagram and the space information of a shallow characteristic diagram are comprehensively utilized;

c. the spliced feature maps with the channel numbers of 512, 256, 128 and 64 are sequentially and respectively input into decoding convolutional layers of four cascaded decoding modules for feature fusion, spatial information of a shallow feature map and deep denoised semantic information are fused to obtain a large-space-size noise-removed feature map, each decoding convolutional layer is composed of convolutional layers with four same convolutional kernels, and the convolutional kernels used in the four cascaded decoding modules are respectively 3 × 3 × 256, 3 × 3 × 128, 3 × 3 × 64 and 3 × 3 × 32 in sequence, so that the channel numbers output by the four cascaded decoding modules are respectively 256, 128, 64 and 32 in sequence.

The invention has the advantages and positive effects that:

1. according to the invention, a brand-new pixel level global noise estimation module is inserted into a coding and decoding denoising network, aiming at the Gaussian characteristic of real image noise, two branches of the module respectively fuse a global channel of input characteristics and noise information in a global space, and meanwhile, the fusion mode also considers the characteristics of real image noise and single pixel value correlation, corresponding space and channel information are reserved while the channel and the space information are fused, finally, the two branches are combined to obtain pixel level global noise estimation, and then, the noise information is removed through a residual branch, so that a better denoising effect is obtained.

2. The invention has reasonable design, the pixel level global noise estimation submodule adopts a multi-dimensional fractional fusion mode, and simultaneously, each branch adopts a bottleneck structure, thereby improving the denoising effect and controlling the parameter quantity of the network. The invention uses the coding and decoding network as the backbone network, the output of the network is the clear image after denoising, the network is trained by using the input noise image and the clear image actually shot and taking the average absolute loss function as the target, the denoising effect is evaluated by comparing the output image and the actual clear image, and the algorithm operation time is considered while the simple backbone network denoising effect is improved.

Drawings

FIG. 1 is a backbone framework diagram of a pixel level global noise estimation codec network of the present invention;

FIG. 2 is a block diagram of the pixel level global noise estimation sub-module of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below. It should be noted that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments, and all other embodiments obtained by those skilled in the art without any inventive work based on the embodiments of the present invention belong to the protection scope of the present invention.

The image denoising method based on the pixel level global noise estimation coding and decoding network provided by the embodiment of the invention, as shown in fig. 1 and fig. 2, comprises the following steps:

and step S1, inputting the original noisy picture into an input module of the coding network, wherein the input module consists of four convolution layers of 3 multiplied by 32, performing primary feature extraction on the noisy picture, and outputting an original feature map.

And step S2, sequentially inputting the original feature map into four cascaded coding modules, wherein each coding module consists of a down-sampling layer, four convolutional layers and two pixel-level global noise estimation sub-modules, the coding module performs high-level feature extraction on the input original feature map through the plurality of convolutional layers and the down-sampling layer, performs noise estimation and removal on each level of features through the pixel-level global noise estimation sub-modules, and finally outputs the feature map with smaller space size and higher semantic level after denoising.

The specific implementation method of step S2 is as follows:

and S2.1, at the beginning of each coding module, reducing the size of the input features to 1/2 of the original size by using maximum pooling, performing spatial feature fusion, improving the receptive field of the convolutional network, and extracting more semantic information.

And S2.2, extracting the feature information after the size is reduced by using two cascaded convolutional layers, increasing the number of channels of the features by two times, and extracting more features, wherein in the four cascaded coding modules, the used convolutional kernels are respectively 3 multiplied by 64, 3 multiplied by 128, 3 multiplied by 256 and 3 multiplied by 512 in sequence, and the convolutional kernels with the same number of channels are used in the modules.

And S2.3, extracting pixel level global noise information contained in the characteristics through a pixel level global noise estimation submodule, and removing the extracted noise information through residual connection in the submodule.

The specific implementation method of step S2.3 is as follows:

step S2.3.1, inputting a feature map with height H, width W and channel number C into the pixel-level global noise estimation sub-module, where the length and width are related to the size of the noise image input into the network and the number of down-sampling times, the number of channels is consistent with the number of feature channels inside the coding module in which the current pixel-level global noise estimation sub-module is located, the number of channels inside the four coding modules is 64, 128, 256, 512, respectively, and the first branch input into the sub-module is directly output as a residual branch for providing the feature map before denoising.

And S2.3.2, inputting the features and simultaneously entering a second branch, fusing global channel information into branches, entering a feature map with dimension H multiplied by W multiplied by C into two cascaded convolution layers with the dimension of 1 multiplied by 1, fusing the information of the C channels into the global channel information of one channel, simultaneously reserving the spatial information of the features to obtain a feature map with the dimension H multiplied by W multiplied by 1, and then changing the feature into HW multiplied by 1 to obtain the feature output fused with the global channel information.

And S2.3.3, inputting the features and simultaneously entering a third branch, fusing global spatial information into branches, fusing spatial information with dimension H multiplied by W multiplied by C into single-point information by a feature graph with dimension H multiplied by W multiplied by C through a global mean pooling layer, reserving channel information of the features to obtain a feature graph with dimension 1 multiplied by C, then sequentially connecting a full-connection layer with dimension C/4 and a full-connection layer with dimension C to correct the channel features, outputting the feature graph with dimension 1 multiplied by C, and obtaining feature output fused with the global spatial information.

Step S2.3.4, matrix multiplication is carried out on the global channel information fusion characteristic with the size of HW multiplied by 1 and the global space information fusion characteristic with the size of 1 multiplied by C to obtain a characteristic diagram with the size of HW multiplied by C, the characteristic not only fuses the global characteristics of the channel and the space, partial information of the space and the channel pixel level is reserved in the calculation process, the global noise estimation of the pixel level is completed, then the noise estimation is changed into H multiplied by W multiplied by C, the time period is consistent with the input characteristic size, and finally a convolution layer with the size of 1 multiplied by 1 is used for carrying out characteristic mapping to obtain and output the noise estimation characteristic.

And S2.3.5, adding the output pixel-level global noise estimation and residual branch pixel by pixel, and removing the estimated noise value from the noise picture to obtain the final output of the pixel-level global noise estimation submodule.

And S2.4, integrating and buffering the characteristics subjected to primary denoising by using a convolution layer to prepare for further denoising, wherein in the four cascaded coding modules, the convolution kernels are respectively 3 × 3 × 64, 3 × 3 × 128, 3 × 3 × 256 and 3 × 3 × 512 in sequence.

And S2.5, further extracting pixel level global noise information contained in the features through a pixel level global noise estimation module, and removing the extracted noise information through residual connection in the module.

And S2.6, integrating and buffering the characteristics subjected to secondary denoising by using a convolution layer for the module to prepare for entering the next module, wherein in the four cascaded coding modules, the used convolution kernels are respectively 3 multiplied by 64, 3 multiplied by 128, 3 multiplied by 256 and 3 multiplied by 512 in sequence, so that the channel numbers output by the four cascaded coding modules are respectively 64, 128, 256 and 512 in sequence.

And step S3, inputting the high-level feature map output in step 2 into a decoding network which is composed of four decoding modules symmetrical to the encoding network and composed of bilinear interpolation upsampling and four convolution layers. The module uses a jump connection structure to combine the rich spatial information of the front layer with the high-layer characteristic information output by the coding module, and finally obtains a characteristic diagram which gives consideration to spatial information and semantic information after noise removal.

The specific implementation method of step S3 is as follows:

and S3.1, at the beginning of each decoding module, for the input feature map, using a bilinear interpolation method to up-sample the space size of the input feature map to 2 times of the original space size, wherein the number of channels is unchanged, and the bilinear interpolation method is used for gradually recovering the size of the input original image.

And S3.2, taking characteristic outputs (namely the coding module 3 and the decoding module 1, the coding module 2 and the decoding module 2, the coding module 1 and the decoding module 3, and the input module and the decoding module 3) of the input module and the coding module, which have the same spatial size as the bilinear interpolation output in the current decoding module, splicing the characteristic outputs with the current bilinear interpolation output in channel dimension, sequentially changing the channel numbers output by the input module and the first three cascaded coding modules into 32, 64, 128 and 256, symmetrically, respectively changing the channel numbers input by the four cascaded decoding modules into 256, 128, 64 and 32, sequentially changing the channel dimensions into 512, 256, 128 and 64, sequentially and respectively serving as the input of subsequent convolution layers of the decoding network, and comprehensively utilizing the semantic information of the denoised characteristic diagram and the spatial information of the shallow layer characteristic diagram.

And S3.3, sequentially and respectively inputting the spliced feature maps with the channel numbers of 512, 256, 128 and 64 into decoding convolutional layers of four cascaded decoding modules for feature fusion, fusing spatial information of a shallow feature map and deep denoised semantic information to obtain a feature map with large spatial size and noise removed, wherein the decoding convolutional layers are formed by convolutional layers with four same convolutional kernels, and the convolutional kernels used in the four cascaded decoding modules are sequentially and respectively 3 × 3 × 256, 3 × 3 × 128, 3 × 3 × 64 and 3 × 3 × 32, so that the channel numbers output by the four cascaded decoding modules are sequentially and respectively 256, 128, 64 and 32.

And step S4, mapping the output characteristic diagram of the decoding network to the output characteristic dimension through the convolution processing of the output module, and outputting the final clear image, wherein the output channel number of the convolution kernel is consistent with the channel number of the input original image of the input module.

The de-noised clear image can be obtained through the steps.

Finally, we train the network with the mean absolute loss function as the target, and evaluate the network performance using PSNR (Peak Signal to noise Ratio), SSIM (structural similarity index) and runtime. The method comprises the following steps:

and (3) testing environment: a tensoflow frame; ubuntu16.04 system; NVIDIA GTX 1080ti GPU

And (3) testing sequence: the selected dataset was a Darmstadt Noise Dataset (DND) dataset containing 50 very high resolution true noisy-noiseless image pairs.

The test method comprises the following steps: in order to ensure fairness, the noiseless images of the data set are not disclosed to the outside, all tests are carried out by a tested party by submitting the processing results of the images containing noise through an online system, and the test effects are calculated uniformly by the system; during testing, the original format (raw) is used as network input, the calculated clear image is stored into a mat data format and submitted to a system, and the system calculates the denoising effect of the original format and the denoising effect of the original format after being converted into a standard color format (sRGB).

Testing indexes are as follows: the invention uses indexes such as PSNR, SSIM, running time and the like to evaluate. The index data are calculated by different algorithms which are popular at present, and then result comparison is carried out, so that the method is proved to obtain a better result in the field of real image denoising.

The test results were as follows:

as can be seen from the comparison data, the method is superior to all other methods in the aspect of denoising effect; and in the practical test, the processing speed of the invention is superior to that of the rest methods in terms of the running time. By comprehensive analysis, the method well balances the relation between the image denoising effect and the running speed, and has higher denoising level and higher running speed.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. An image denoising method based on a pixel-level global noise estimation coding and decoding network is characterized by comprising the following steps:

2. The image denoising method based on the pixel-level global noise estimation coding-decoding network as claimed in claim 1, wherein in step 1, the original noisy picture is input into an input module of the coding network, feature extraction of four times of convolution of 3 × 3 × 32 is performed, and an original feature map is output.

3. The image denoising method based on the pixel-level global noise estimation coding-decoding network of claim 2, wherein the number of the convolution kernel output channels of the output module of the decoding network is consistent with the number of the channels of the input module for inputting the original noisy picture.

4. The image denoising method based on the pixel-level global noise estimation coding-decoding network according to any one of claims 1 to 3, wherein the coding network comprises four cascaded coding modules, and each coding module is composed of a down-sampling layer, four convolutional layers and two pixel-level global noise estimation sub-modules.

5. The image denoising method based on the pixel-level global noise estimation coding-decoding network according to claim 4, wherein the specific method for processing the input features by the coding module of step 2 comprises the following steps:

6. The image denoising method based on the pixel-level global noise estimation coding-decoding network of claim 5, wherein the specific method of processing the input features by the pixel-level global noise estimation sub-module of steps ⑶ and ⑸ comprises the following steps:

7. The image denoising method based on the pixel-level global noise estimation coding-decoding network according to any one of claims 1 to 3, wherein the decoding network comprises four decoding modules, and the decoding modules are composed of a bilinear interpolation upsampling layer and four convolution layers.

8. The image denoising method based on the pixel-level global noise estimation coding-decoding network according to claim 7, wherein the specific method for processing the input features by the decoding module of step 3 comprises the following steps: