CN116977208A

CN116977208A - Low-illumination image enhancement method for double-branch fusion

Info

Publication number: CN116977208A
Application number: CN202310825185.9A
Authority: CN
Inventors: 王燕; 苏鹏; 潘晓英; 陈波宇; 高瑗
Original assignee: Xian University of Posts and Telecommunications
Current assignee: Xian University of Posts and Telecommunications
Priority date: 2023-07-06
Filing date: 2023-07-06
Publication date: 2023-10-31

Abstract

A dual-branch fusion low-illumination image enhancement method acquires shallow layer characteristics F of an input image I of a low-illumination image _s Will shallow layer feature F _s Feeding a three-layer encoder to obtain advanced feature F _h The three-layer encoder is formed by sequentially connecting a DFTB and two NSTB; will advanced feature F _h Feeding a bottleneck layer, wherein the bottleneck layer is NSTB; feeding the bottleneck layer output into a high resolution representation of the restored image in a three-layer decoder to obtain a feature map F _o The three-layer decoder is formed by sequentially connecting two NSTB and one DFTB; reducing feature map F _o Obtaining a residual image D with the same dimension as the input image I; the residual image D is added element by element with the input image I,an enhanced high resolution image I' is obtained. The invention solves the problems of low brightness, poor visibility and the like, effectively suppresses noise, improves image quality and provides convenience for application fields such as traffic safety, medical images and the like.

Description

Low-illumination image enhancement method for double-branch fusion

Technical Field

The invention belongs to the field of image processing of deep learning, and particularly relates to a low-illumination image enhancement method based on double-branch fusion.

Background

Low-light image enhancement is a critical issue in many application areas, such as in the fields of computer vision, robotics, medical imaging, security, and the like. Because of environmental and equipment limitations, the image under the low-illumination condition is inevitably affected by noise besides the characteristics of low brightness, poor visibility and the like, and the problems of excessive noise, blurring, insufficient contrast and the like often exist, so that noise not only covers image details, introduces artifacts and reduces image quality, but also interferes with the results of downstream computer vision tasks such as target detection, image segmentation and target tracking, and the analysis and recognition of the image become more difficult. Therefore, developing an effective low-illumination image enhancement method is becoming a popular research direction in the current fields of computer vision and image processing.

Over the past few decades, researchers have proposed a number of methods of low-light image enhancement, including techniques based on histogram equalization, retinex theory, deep learning, and the like. However, these methods tend to ignore the processing of noise or training using a data set to which noise is manually added, or fail to accurately adaptively suppress a noise region in a denoising work, resulting in an excessive smoothing phenomenon in other regions of an image. Therefore, these methods have some limitations in different scenarios, such as excessive enhancement, blurring of details, etc. The areas of the image are different, the characteristics of brightness, noise, visibility and the like are different, and the information is seriously destroyed by the noise at the place with relatively lower brightness, and the reasonable visibility can still exist at other positions. Therefore, in order to better improve the image quality, it is reasonable and important to adaptively enhance different areas of the image.

In recent years, the strong growth of the deep learning method is ending, because the deep learning method has great advantages compared with the traditional method, the deep learning method can learn more abundant characteristic representations from a large amount of data, and can adaptively adjust the weight and the combination mode of the characteristics according to task requirements, so that more accurate and natural image enhancement effects are obtained, and the deep learning method usually adopts an end-to-end learning mode, namely training is directly started from original image data, and complicated characteristic extractors and rules do not need to be designed manually. The method can reduce human intervention and labor cost, and can accelerate algorithm iteration and optimization. Convolutional Neural Networks (CNNs) and vision converters (Vision Transformer) have an impressive effect in the area of low-light image enhancement, CNNs being adept at capturing local features, modeling short-range pixel dependencies, however, convolutional limited receptive fields make it impossible to model long-range pixel dependencies.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a double-branch fusion low-illumination image enhancement method which can effectively inhibit noise while solving the problems of low brightness, poor visibility and the like, improve image quality and provide convenience for application fields such as traffic safety, medical images and the like.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

a low-illumination image enhancement method of double-branch fusion comprises the following steps:

step 1, acquiring shallow layer characteristics F of an input image I _s The input image I is a low-illumination image;

step 2, shallow layer feature F _s Feeding a three-layer encoder to obtain advanced feature F _h The three-layer encoder is formed by sequentially connecting a double-branch fusion transformer block (DFTB) and two Noise Suppression Transformer Blocks (NSTB);

step 3, advanced feature F _h Feeding a bottleneck layer, wherein the bottleneck layer is a Noise Suppression Transformer Block (NSTB);

step 4, feeding the bottleneck layer output into the three-layer decoder to recover the high resolution representation of the image to obtain a feature map F _o The three areThe layer decoder is formed by sequentially connecting two Noise Suppression Transformer Blocks (NSTB) and a double-branch fusion transformer block (DFTB);

step 5, reducing feature map F _o Obtaining a residual image D with the same dimension as the input image I;

step 6, adding the residual image D and the input image I element by element to obtain an enhanced high-resolution image I ^′ 。

Compared with the prior art, the method has the advantages that the long-short distance pixel dependency relationship is respectively modeled by using the transducer and the CNN, the global characteristic and the local characteristic are extracted, the noise suppression Transformer block is constructed, the noise region is subjected to self-adaptive suppression by taking care of the attention through the priori guidance of the signal to noise ratio, and the image after self-adaptive noise suppression is cleaner and the detail is clearer. A large number of experiments show that the method provided by the invention can obtain excellent effects on 3 representative data sets.

Drawings

Fig. 1 is an overall frame diagram of a dual-branch fusion low-intensity image enhancement method of the present invention.

Fig. 2 is a block diagram of a dual-branch fusion transformer block (DFTB) and a Noise Suppression Transformer Block (NSTB) in the present invention.

Fig. 3 is a diagram showing the attention mechanism (NSA) of noise suppression in the Noise Suppression Transformer Block (NSTB) according to the present invention.

Fig. 4 is a block diagram of a Deep Feed Forward Network (DFFN) in a Noise Suppression Transformer Block (NSTB) according to the present invention.

Fig. 5 is a block diagram of a Convolution Residual Block (CRB) in a dual-branch fusion transformer block (DFTB) according to the present invention.

FIG. 6 is a graph of qualitative results of the present invention on different data sets.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings and examples.

As shown in fig. 1, the architecture of the dual-branch fusion low-illumination image enhancement method of the present invention is mainly based on dual-branch fusion transformer block (DFTB), noise Suppression Transformer Block (NSTB), and skip connection between encoder and decoder. The invention applies the achieved transducer in the natural language processing field to the computer vision field, and the self-attention mechanism in the transducer allows the model to acquire information from any position in the sequence, so that the model can well process long-distance dependence. Thus, fusing convolution with the attention mechanism can more effectively handle local context and long-range dependencies simultaneously

Specifically, the steps thereof can be described as:

step 1, acquiring shallow layer characteristics F of an input image I _s In the present invention, the input image I is specifically a low-illuminance image.

Specifically, in an embodiment of the present invention, shallow features F may be extracted by a 3×3 convolution for a given one of the low-light images, i.e., the input image I _s ，Where C is the number of channels, H is the spatial height of the input image I, and W is the spatial width of the input image I.

Step 2, shallow layer feature F _s Feeding a three-layer encoder to obtain advanced feature F _h The three-layer encoder is formed by sequentially connecting a double-branch fusion transformer block (DFTB) and two Noise Suppression Transformer Blocks (NSTB).

In the embodiment of the invention, each layer of the three-layer encoder is cascaded by adopting a downsampling layer, and the downsampling layer is composed of 3×3 convolution for reducing the number of channels and a pixel non-shuffling layer for reducing resolution.

Step 3, advanced feature F _h The bottleneck layer is fed in, which is a Noise Suppression Transformer Block (NSTB). By this step, in advanced feature F _h Global pixel dependencies are further captured on the basis of (a) to avoid over-emphasis or weak emphasis.

Step 4, feeding the bottleneck layer output into the three-layer decoder to recover the high resolution representation of the image to obtain a feature map F _o ，Three-layer decoderThe two Noise Suppression Transformer Blocks (NSTB) and the two-branch fusion transformer block (DFTB) are sequentially connected.

In the embodiment of the invention, the layers of the three-layer decoder are cascaded by adopting an up-sampling layer, and the up-sampling layer is composed of a 3×3 convolution for increasing the number of channels and a pixel shuffling layer for improving the resolution.

By employing the cascading approach described above, the downsampling layer in the encoder reduces the spatial resolution of the input image while increasing the number of feature channels, which enables the network to capture high-level, global features of the input image, such as overall brightness and contrast. On the other hand, the upsampling layer in the decoder increases the spatial resolution of the feature map while reducing the number of feature channels. This enables the network to recover spatial information lost during downsampling and capture low-level, local features of the input image, such as fine detail and texture. The network is capable of capturing multi-scale features of an input image by concatenating an encoder of a downsampling layer and a decoder of an upsampling layer. Meanwhile, a downsampling layer is used in the encoder, and an upsampling layer is used in the decoder, so that the spatial resolution of the feature map can be reduced, and the calculation cost of subsequent layers in the network is reduced. This results in a more efficient network and faster training.

In the embodiment of the invention, in order to help the recovery process, the characteristic cascade is carried out between the encoder and the decoder through jump connection, and the channel number is reduced through 1X 1 convolution after the cascade.

Step 5, reducing feature map F _o The number of channels of (a) results in a residual image D of the same dimension as the input image I,

in the embodiment of the invention, feature map F _o The number of channels is reduced by 1×1 convolution to obtain a residual image D.

Step 6, adding the residual image D and the input image I element by element to obtain an enhanced high-resolution image I ^′ ，The process can be formulated as:

I ^′ ＝I+D

with continued reference to fig. 1, in a three-layer encoder, shallow featuresThe first path of the output of the DFTB is downsampled and then the characteristic dimension of the first path of the output of the DFTB is changed into +.>Then input to the first NSTB, the first output of the first NSTB is downsampled to change the characteristic dimension into +.>Then input into a second NSTB, and the output first path of the second NSTB is subjected to downsampling to obtain advanced characteristic ++>And enters the bottleneck layer.

In a three-layer decoder, the bottleneck layer output is upsampled to become feature dimensionsThe second path of the output of the second NSTB of the three-layer encoder is connected with the features through jump connection, and the feature dimension becomes after the concatenationReducing the characteristic dimension of the channel number by 1X 1 convolution to become +.>The first NSTB to be fed into the three-layer decoder.

The output of the first NSTB of the three-layer decoder is up-sampled to become the characteristic dimensionThe second path of the output of the first NSTB of the three-layer encoder is connected with the features through jump connection, and the feature dimension becomes after the concatenationReducing the characteristic dimension of the channel number by 1X 1 convolution to become +.>The second NSTB is fed into the three-layer decoder.

The output of the second NSTB of the three-layer decoder is changed into C×H×W after up-sampling, the feature dimension is changed into 2C×H×W after the feature dimension is connected with the second path of the DFTB of the three-layer encoder in a jumping way, the output is changed into C×H×W after the feature dimension of the channel number is reduced by 1×1 convolution, the C×H×W is sent into the DFTB of the three-layer decoder, and the DFTB output of the three-layer decoder is the feature map

In order to better improve the image quality and enhance the low-illumination image from being interfered by noise, the invention provides a noise suppression transformer block. As shown in fig. 2, along the input sequence, the noise suppression transformer block includes a noise suppression attention mechanism module (NSA) and a depth feed forward network module (DFFN), where layer normalization is set at the input of the noise suppression attention mechanism module and the input of the depth feed forward network module, and skip connection is performed at the input of the noise suppression attention mechanism module and the input of the depth feed forward network module, respectively.

The double-branch fusion transformer block is formed by fusing a global branch and a local branch through a noise suppression fusion module (NSF), the global branch is a noise suppression transformer block, the local branch and the global branch have the same input, the local branch is formed by connecting a plurality of Convolution Residual Blocks (CRBs) in series, and the outputs of the global branch and the local branch are fed into the noise suppression fusion module (NSF).

The implementation flow of DFTB can be expressed as:

X ^′ ＝NSA(LN(X))+X

Y＝DFFN(LN(X ^′ ))+X ^′

Z＝CRB ⁿ (X)

Output＝NSF(Y,Z)

wherein ,for inputting the characteristic diagram of the double-branch fusion transformer block, < >> For intermediate results of the processing, +.>For the final output of the dual-branch fusion transformer block in the decoder, where c represents the number of feature map channels of the input encoder, h and w are their height and width, LN (·) represents layer normalization, and n is the number of convolution residual blocks. In addition, the implementation flow of the NSTB is already included.

As shown in fig. 3, in the noise suppressed attention mechanism module, a query (Q), key (K), and value (V) are first obtained by concatenating a 1 x 1 point convolution and a 3 x 3 depth convolution, and then reconstructed as Computing self-attention-seeking graph A in the channel dimensions of Q and K _c Enabling models (i.e., the overall framework of the present invention) to learn global context information, whereinAttention striving to be distinguished from other attention mechanisms>Then, for the input imageCalculate the signal to noise ratio diagram->And reconstruct it as +.>And each point in the pixel matrix is processed and then used as a mask in attention calculation, so that the attention and processing of the transformer model to the image area with extremely low signal to noise ratio can be effectively restrained. S is S ^′ The mask value for each point in (a) is:

wherein t is a mask threshold, and the value of t is S ^′ The average value of all the pixel points in the array is determined by the experimental result of a multiple control variable method. Finally, the self-attention after noise suppression is subjected to matrix point multiplication with V, the result is remodeled into C multiplied by H multiplied by W, and then point convolution operation is performed. The attentive mechanism process of noise suppression described above can be formulated as:

delta is a small negative scalar-1 e9, thereby ensuring long distance attention from high signal-to-noise ratio areas with sufficient information. Meanwhile, the parallelization is realized by adopting a multi-head version.

As shown in fig. 4, the present invention proposes a modified version of the feed-forward network, namely a deep feed-forward network module (DFFN), first increasing the channel dimension by 1 x 1 point convolution. Then, to better recover the local structure, spatial features between adjacent pixels are obtained by 3×3 depth convolution. After the Relu nonlinear activation function, the channel dimensions are reduced to the original input state by a 1 x 1 point convolution.

In order to make the image have more detail features, capturing local features and establishing short-distance dependency becomes critical, so the invention designs a convolution residual block, the details of which are shown in fig. 5, the short-distance dependency of pixels is extracted through two 3×3 convolutions, a Relu nonlinear activation function is used between convolutions, and finally, the context of original resolution is enhanced through polarization self-attention (PSA).

For a noise suppression fusion module in DFTB, intermediate results of global branches and local branches are obtainedAdaptively combining, adjusting the size of the signal-to-noise ratio graph S to h multiplied by w to match the feature graph, normalizing the value of the feature graph, and normalizing the signal-to-noise ratio graph S ^′ Fusing Y and Z in an interpolation weight mode, and formulating as:

Output＝Z×S ^′ +Y×(1-S ^′ )

wherein ,as an output to the next operation to generate the final output image. Because the values in the signal-to-noise ratio image can reflect the noise levels of different regions of the input image, the noise suppression fusion operation can adaptively combine the local and global image information to generate a final feature map.

The invention evaluates the effect on the low-illumination image enhancement task on LOL-v1, LOL-v 2-synhetic and SDSD-indoor public data sets, and the LOL data set has obvious noise problems in v1 and v2, wherein the LOL-v1 data set comprises 500 pairs of low-light/normal-light images, and the resolution is 400 multiplied by 600. The LOL-v2-synthetic dataset contained 1000 pairs of low light/normal light images with a resolution of 384×384. The SDSD-inside dataset was 70 video pairs of saved video frames with a resolution of 960X 512.

The invention is realized by using Pytorch on Nvidia RTX4090GPU and Intel Xeon Gold6326CPU and adopts a network of random initialization of Gaussian distributionAnd (5) training parameters. In the training phase, 128×128 pairs of samples are randomly cropped from the LOL-v1 and LOL-v 2-synchronous dataset original images, 256×256 pairs of samples are randomly cropped from the SDSD-input dataset original images, and standard enhancements such as flipping in the vertical and horizontal directions are used. During the test phase, the evaluation of the data set uses the original resolution. Charbonenerlass loss training is adopted between the enhanced image and the ground real image, an Adam optimizer is adopted to minimize loss, and a super parameter beta is set ₁ ＝0.9，β ₂ =0.99, learning rate 4×e ^-4 。

The quantitative and qualitative results of the experiments of the present invention on LOL-v1, LOL-v 2-synhetic and SDSD-indoor datasets are shown below:

quantitative results: on the LOL-v1 dataset, PSNR and SSIM can reach 23.94 and 0.842, respectively; on the LOL-v 2-synchetic dataset, PSNR and SSIM can reach 26.24 and 0.942, respectively; on the SDSD-inside dataset, PSNR and SSIM can reach 29.87 and 0.906, respectively.

The qualitative results are shown in fig. 6, and the visual results of the method provided by the invention show better performance in the aspects of contrast, brightness, noise, color saturation and the like. This also shows that the method provided by the invention can effectively inhibit noise while enhancing the brightness of the image and retaining the details of the image.

Claims

1. The low-illumination image enhancement method based on double-branch fusion is characterized by comprising the following steps of:

step 4, willThe output of the bottleneck layer is fed into a high resolution representation of the restored image in a three-layer decoder, resulting in a feature map F _o The three-layer decoder is formed by sequentially connecting two Noise Suppression Transformer Blocks (NSTB) and a double-branch fusion transformer block (DFTB);

2. The method of claim 1, wherein the step 1 extracts shallow features F from the input image I by 3×3 convolution _s ， Where C is the number of channels, H is the spatial height of the input image I, and W is the spatial width of the input image I.

3. The method for enhancing a low-illumination image by double-branch fusion according to claim 1, wherein each layer of the three-layer encoder is cascaded by adopting a downsampling layer; and each layer of the three-layer decoder is cascaded by adopting an upsampling layer.

4. A dual branch fused low intensity image enhancement method according to claim 3 wherein said downsampling layer is comprised of a 3 x 3 convolution and pixel non-shuffling layer that reduces the number of channels and said upsampling layer is comprised of a 3 x 3 convolution and pixel shuffling layer that increases the number of channels.

5. The method for enhancing a dual-branch fused low-luminance image according to claim 1, wherein the encoder and the decoder are feature-cascaded by a jump connection, and the number of channels is reduced by a 1 x 1 convolution after the concatenation.

6. The dual branch fused low-intensity image enhancement method according to any one of claims 1 to 5, wherein said noise suppression transformer block comprises a noise suppressed attention mechanism module and a deep feed forward network module along an input order; layer normalization is set at the input end of the attention mechanism module and the input end of the depth feedforward network module for noise suppression, and skip connection is respectively carried out on the normalization of the attention mechanism module and the layer thereof for noise suppression and the normalization of the depth feedforward network module and the layer thereof;

the double-branch fusion transformer block is formed by fusing a global branch and a local branch through a noise suppression fusion module; the global branch is the noise suppression transformer block, the local branch and the global branch have the same input, the local branch is formed by serially connecting a plurality of convolution residual blocks, and the outputs of the global branch and the local branch are fed into the noise suppression fusion module.

7. The method of claim 6, wherein the depth feed forward network module increases the channel dimension by 1 x 1 point convolution first, then obtains the spatial feature between adjacent pixels by 3 x 3 depth convolution, and reduces the channel dimension to the original input state by 1 x 1 point convolution after the Relu nonlinear activation function.

8. The method of claim 6, wherein the convolution residual block extracts short-range dependencies of pixels by two 3 x 3 convolutions, uses a Relu nonlinear activation function between convolutions, and finally strengthens the context of the original resolution by polarization self-attention (PSA).

9. The method of claim 6, wherein the noise suppressed attention machineIn the system module, the query Q, key K and value V are first obtained by concatenating a 1×1 point convolution and a 3×3 depth convolution, and then reconstructed asComputing self-attention-seeking A in the channel dimensions of Q and K _c Enabling the model to learn global context information, wherein +.>

Then, a signal-to-noise ratio map of the input image I is calculatedAnd reconstruct it as +.>Each point in its pixel matrix is processed to be a mask in the attention calculation, S ^′ The mask value for each point in (a) is:

wherein t is a mask threshold;

finally, the self-attention after noise suppression is subjected to matrix point multiplication with V, the result is remodeled into C multiplied by H multiplied by W, and then point convolution operation is performed.

10. The method of claim 9, wherein the noise suppression fusion module fuses intermediate results of global branches and local branches Adaptively combine signal to noise ratioThe size of the graph S is adjusted to h multiplied by w to match the characteristic graph, and the value of the graph S is normalized, and the normalized signal to noise ratio graph S ^′ Fusing Y and Z in an interpolation weight mode, and formulating as:

Output＝Z×S ^′ +Y×(1-S ^′ )

wherein ,as an output to the next operation to generate the final output image.