CN113658057A

CN113658057A - Swin transform low-light-level image enhancement method

Info

Publication number: CN113658057A
Application number: CN202110805770.3A
Authority: CN
Inventors: 孙帮勇; 赵兴运
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2021-07-16
Filing date: 2021-07-16
Publication date: 2021-11-16

Abstract

The invention discloses a SwinTransformer low-light-level image enhancement method, which comprises the following steps of: step 1, constructing a preprocessing module, wherein the input of the preprocessing module is an original low-light-level image; the output of the preprocessing module is a feature map; step 2, constructing a SwinTransformer module, wherein input data of the SwinTransformer module is output characteristics of the step 1; the output of the SwinTransformer module is the extracted image characteristics; step 3, constructing a recovery module, wherein input data of the recovery module is the output characteristic of the step 2; the output of the restoration module is an enhanced high quality noise-free color image. The method solves the problems of low visibility, low contrast, noise pollution and color distortion of low-light-level images in the prior art.

Description

Swin transform low-light-level image enhancement method

Technical Field

The invention belongs to the technical field of image processing, particularly belongs to an RGB true color image recovery technology, and relates to a Swin transform low-light-level image enhancement method.

Background

The number of photons entering the lens is small in imaging under a low-light condition, the brightness of a formed image is low, and information such as color and a target structure is difficult to distinguish. Although the image brightness can be improved to some extent by extending the exposure time, the optical sensor is liable to generate a large amount of noise. Therefore, directly acquired low-light-level images mostly have the degradation problems of low contrast, high noise, color distortion, detail blurring and the like. The low-light-level image has poor visual perception quality, and can influence subsequent image processing tasks such as image segmentation, target recognition, video monitoring and the like. The low-light-level image enhancement technology can restore the directly acquired low-light-level images to a normal illumination level based on a series of mathematical methods, improves the visual perception quality and greatly improves the precision of subsequent image processing tasks, thereby becoming one of the research hotspots in the field of image processing.

Early low-light image enhancement methods were mainly based on Histogram Equalization (HE) and Retinex theory. HE image enhancement is a histogram modification method based on cumulative distribution function, which adjusts the image histogram to an equilibrium distribution to stretch the image dynamic range, thereby improving image contrast. The method is simple to operate and high in efficiency, but the generated image is easily influenced by the artifact and is not strong in reality sense. Whereas the retinal theory-based method attempts to illuminate an image by decomposing an input image into a reflection component, which is an inherent property of a scene, and an illumination component, which is affected by ambient illuminance; retinal theory-based methods typically enhance the illumination component of low-light images to approximate corresponding normal-light images. Parameters in the model need to be set manually, the diversity of the image cannot be processed in a self-adaptive mode, the image processing effect aiming at high noise is poor, and the situations of local detail underexposure or overexposure exist.

With the rapid development of artificial intelligence theory, in recent years, low-light-level image enhancement algorithms based on deep learning are proposed in succession. Although the method based on deep learning makes up the defects of the traditional method to a certain extent and achieves a good enhancement effect on a certain image set, most deep learning dim light enhancement methods depend on the quality of the data set seriously, and it is assumed that dark light areas have no noise or the distribution of the noise in different illumination areas is not considered. In fact, the prior knowledge is deviated from a real image, and a complete real image data set is difficult to acquire, which results in that the existing deep learning model cannot effectively suppress real image noise and is difficult to generate satisfactory visual quality.

Through the research on the traditional model and the deep learning model for enhancing the low-light-level images, two challenging problems are found in the process of enhancing the real low-light-level images, namely the problem of low illumination of the images independent of the space and the problem of non-uniform noise. Statistics shows that the spatial characteristic distribution of a real low-light image is complex, the number of photons entering a lens at different spatial positions is greatly different, and the illumination variability in the image space is strong. Most of existing partial deep learning methods can effectively improve illumination characteristics in a data set generated manually, but aiming at the problem of low illumination independent of spatial distribution, the enhancement effect in the whole visibility and underexposed areas of an image is not ideal. In addition, aiming at the non-uniform noise characteristics introduced in image acquisition, the traditional model cannot be well solved, and the deep learning model cannot achieve an ideal effect through simple cascade noise reduction. For example, partial image details are lost by denoising before image enhancement, high-noise pixel information reconstruction is difficult, and image blurring is easily caused by denoising after enhancement. Therefore, how to effectively suppress noise and recover information hidden in the dark is another challenge in the current dim light enhancement model.

Disclosure of Invention

The invention aims to provide a Swin transform low-light-level image enhancement method, which solves the problems of low visibility, low contrast, noise pollution and color distortion of a low-light-level image in the prior art.

The invention adopts the technical scheme that a Swin transform low-light-level image enhancement method is specifically implemented according to the following steps:

step 1, constructing a preprocessing module, wherein the input of the preprocessing module is an original low-light-level image with the size of H, W and 3; the output of the preprocessing module is a feature map with a size of H/4W/4 96;

step 2, constructing a Swin Transformer module, wherein input data of the Swin Transformer module is output characteristics of the step 1, and the size of the input data is H/4W/96; the output of the Swin transform module is the extracted image characteristics, and the size is H/4W/4 96;

step 3, constructing a recovery module, wherein input data of the recovery module is output characteristics of the step 2, and the size of the input data is H/4W/4 96; the output of the restoration module is an enhanced high-quality noise-free color image with a size H x W x 3.

The method has the advantages that the low-light-level image can be effectively restored to the image acquired under the normal illumination condition, and the texture details, the color information and the like of the image are kept.

Drawings

FIG. 1 is a block flow diagram of the method of the present invention;

FIG. 2 is a flow chart of the structure of a pre-processing module constructed in the method of the present invention;

FIG. 3 is a flow chart of the structure of a Swin transducer module constructed in the method of the present invention;

FIG. 4 is a flow chart of the structure of the recovery module constructed in the method of the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

The invention provides a low-light-level image enhancement method mainly based on a Swin transform model, and the overall thought is as follows: firstly, preprocessing an input image, converting image data into processed image characteristics, and facilitating extraction of image information characteristics by a Swin transform model; then, inputting the obtained image features into a Swin Transformer model for feature extraction; and finally, restoring the extracted image features to the original low-light-level image size through a restoring module, and outputting an enhanced result.

Referring to fig. 1, the method of the present invention is based on a Swin Transformer low-light-level image enhancement network (hereinafter referred to as network), and comprises a preprocessing module, a Swin Transformer module, and a recovery module. The preprocessing module comprises two steps of batch Partition and Linear Embedding, directly takes the original low-light image as input, and mainly converts the original low-light image into preprocessed image features. The Patch Partition divides the original low-light image into non-overlapping Patch sizes of 4 × 4, the feature dimension of each Patch becomes 4 × 3 ═ 48 after the division, and then the projected feature dimension is applied to the Linear Embedding layer through Linear Embedding, which is set to 96 in the implementation of the invention. And the Swin transform module performs window division on the input image characteristic graph by using a moving window strategy. And modeling the global dependency relationship of input and output by independently using a multi-head self-attention mechanism in the divided window, thereby extracting the global image characteristics. By translating the position of the window, the problem of information interaction between different windows is solved. The recovery module also comprises two steps of Patch expansion and Linear, wherein the Patch expansion is used for recovering the characteristic size of the image and ensuring that the characteristic size is the same as that of the original input low-light-level image, and the Linear is used for image dimension mapping and ensuring that the dimension of an output result is 3.

The method of the invention is implemented by utilizing the network framework according to the following steps:

step 1, constructing a preprocessing module, wherein the input of the preprocessing module is an original low-light-level image with the size of H, W and 3; the output of the preprocessing module is a signature of size H/4W/4 96.

Referring to fig. 2, the preprocessing module is mainly used for preprocessing data of the original image, and the structure of the preprocessing module sequentially comprises: the original low-light image (Input _ image) serves as an Input image → batch Partition layer (Conv4 × 4 × 48) → Linear Embedding layer (Linear h/4 × W/4 × 96) → Output feature (Output _ feature).

The Patch Partition layer is convolution operation, the size of a convolution kernel is 4 x 4, the convolution step length is 4, and the total number of feature mappings is 48; the Linear Embedding layer performs feature mapping for Linear operation, the size of a convolution kernel is H/4W/4, and the total number of feature mappings is 96.

Step 2, constructing a Swin Transformer module, wherein input data of the Swin Transformer module is output characteristics of the step 1, and the size of the input data is H/4W/96; the output of the Swin Transformer module is the extracted image features, with a size of H/4W/4 96.

A low-light-level image enhancement network is designed based on Swin transform, the dependency relationship among different spatial position features is modeled by using a self-attention mechanism, global context information is effectively captured, and the method has better feature extraction capability. Reduce the stacking of the convolution layer, ensure the precision and greatly improve the processing speed. The Swin Transformer module carries out attention weighting through the relevance between the two spatial position characteristics, local characteristics and global information are blended in a network, the structure of the Swin Transformer module avoids a mode that CNN stacks convolution layers to obtain the global information, and the model can have good performance.

The specific internal structure of a single Swin Transformer module is consistent with the paper (Swin Transformer: Hierarchical Vision Transformer using Shifted Windows).

Referring to fig. 3, the structure of the Swin Transformer module is as follows: the output characteristic of the step 1 is taken as an input characteristic → LN regularization layer → W-MSA submodule (i.e. window multi-head self-attention layer) or SW-MSA submodule (i.e. shift window multi-head self-attention layer) → residual connecting layer → LN regularization layer → feedforward network → residual connecting layer → output characteristic; the Swin Transformer model is circulated for 6 times totally, odd layers and even layers are sequentially connected at intervals, wherein W-MSA sub-modules are adopted in the three odd layers, and SW-MSA sub-modules are adopted in the three even layers.

Referring to fig. 3, the LN regularization layer mainly functions to perform LN regularization processing, normalize the input data to between 0 and 1, and thus ensure that the data distribution of the input layer is the same; the residual connecting layer is mainly used for performing residual connection, so that the problems of gradient disappearance and weight matrix degradation are solved; the feedforward network is composed of two layers of feedforward neural networks, wherein the first layer of feedforward network firstly inputs a vector from d_modelDimension mapping to 4 x d_modelDimension, the activation function is a ReLU function, and the second layer feedforward network is from 4 x d_modelDimension mapping back to d_modelDimension, without using an activation function, the feedforward network is expressed as (1):

FFN(x)＝max(0,xW₁+b₁)W₂+b₂ (1)

the W-MSA submodule (window multi-head self-attention layer) firstly performs window division on input features, the size of a division window set in the embodiment of the invention is 7 x 7, and multi-head self-attention calculation is performed on each divided small window. The W-MSA submodule maps input features into different subspaces, then point multiplication operation is carried out on all the subspaces to calculate attention vectors, finally the attention vectors calculated by all the subspaces are spliced and mapped into an original input space to obtain a final attention vector as output, and an expression of the W-MSA submodule is as follows (2):

MultiHead(Q,K,V)＝Concat(head₁,...,head_h)W⁰

head_i＝Attention(QW_i ^Q,KW_i ^K,VW_i ^V) (2)

wherein Q, K, V are respectively the input of the W-MSA submodule, i.e. query vector, key vector, value vector, W_i ^QMapping matrices for Q in different subspaces, W_i ^KMapping matrix for K in different subspaces, W_i ^VFor the mapping matrix of V in different subspaces, the number h of the subspaces set in this step is set to 8, and the calculation manner of the attention vector on a single subspace is sequentially as follows: the query vector Q and the key vector K are point-multiplied and then divided by the square root of the dimension of the key vector K

Obtaining a fraction matrix of the query vector Q, finally normalizing the fraction matrix through a softmax function to obtain a weight matrix, and then multiplying the weight matrix by a value vector V to obtain an attention vector of a subspace, wherein the expression is as the following formula (3):

the W-MSA submodule (window multi-head self-attention layer) captures the dependency relationship of the features on different subspaces by mapping the input features to different subspaces and then calculating the attention vector, and the finally obtained attention vector can capture the dependency relationship between the features more three-dimensionally and comprehensively.

The ability to model networks using the W-MSA submodule alone is very poor, since it treats each window as an independent area calculation and ignores the necessity of interaction between windows, and based on this motivation the Swin Transformer of the inventive procedure in turn proposes the SW-MSA submodule (shifted window multi-headed self-attention layer). The SW-MSA submodule is used for carrying out pixel displacement operation on the image characteristic with the size of half window before the image characteristic is input into the SW-MSA submodule, and then carrying out W-MSA submodule operation. The image characteristic information contained in the window at the same position divided by the W-MSA sub-module is different, and the problem of information interaction between different windows is solved. The specific operation flow is as follows:

circularly shifting the output characteristics of the step 1 upwards and leftwards by half of the size of the window, and segmenting the window according to the W-MSA sub-module on the basis of shifting to obtain window contents different from W-MSA; and then, operating the W-MSA submodule, and circularly shifting down and circularly shifting right the obtained image characteristics by half window size after the operation is finished, and restoring the original positions.

Referring to fig. 4, the recovery module mainly functions to recover the image features extracted by the Swin Transformer module to the original input low-light-level image size, and outputs an enhanced high-quality noise-free color image, and the recovery module sequentially has the following structures: and (3) taking the Output characteristic of the step 2 as Input _ feature → batch Expanding layer (performing read operation) → Linear layer (Linear, H × W × 3) → Output image (Output _ image), so as to obtain the Output image.

The Patch expansion layer is used for performing read operation, the resolution of the input features is expanded to 4 times of the input resolution, and the feature dimension is reduced to 1/16 of the input dimension; the Linear layer performs feature mapping for Linear operation, the size of a convolution kernel is H x W, and the total number of feature mappings is 3.

When training a Swin transform-based low-light-level image enhancement network, L is considered₁The loss function is better in the aspects of contrast of a target contour and smooth effect of a uniform region, and meanwhile, the graph can be well restored by introducing structural constraint into the SSIM loss functionThe structure and local details of the image and the perception loss function can restrain the difference between a real image and a predicted image and keep the fidelity of image perception and details. In this step, L is₁The + SSIM loss function + perception loss function are combined to be used as a total loss function of the Swin Transformer-based low-light-level image enhancement network, and are expressed as follows:

L_total＝(1-λ_s-λ_p)L₁+λ_sL_ssim+λ_pL_perc

in the formula, L₁L representing pixel level₁Loss of norm, L_ssimDenotes structural similarity loss, L_percIndicating a loss of perception, λ_s、λ_pIs the corresponding coefficient, and the value range is [0,1 ]]Preferably λ_s＝0.2、λ_p＝0.1。

Wherein L is₁The norm loss formula is

I_gtRepresenting a real image, I_hRepresenting a predicted image, l represents a non-zero constant, taken as 10^-6；

The structural similarity loss formula of SSIM is

μ_x、μ_yThe pixel average values of the images x and y are represented respectively; sigma_xyRepresenting the standard deviation of the product of x and y of the image;

respectively representing the variances of the images x and y; n represents the total number of image samples, C₁、C₂Is a constant;

the perceptual loss function is formulated as

I_gtRepresenting a real image, I_hRepresenting a predicted image, C_jRepresents a channel, H_jAnd W_jRespectively represent the height of the jth feature mapAnd a width of the sheet material,

representing the feature map obtained for the jth convolutional layer in the pre-trained VGG16 model.

Claims

1. A Swin transform low-light-level image enhancement method is characterized by being specifically implemented according to the following steps:

2. The Swin transform low-light-level image enhancement method of claim 1, wherein: in step 1, the structure of the preprocessing module is as follows in sequence: the original low-light image serves as an input image → batch Partition layer → Linear Embedding layer → Output feature (Output _ feature),

3. The Swin transform low-light-level image enhancement method of claim 1, wherein: in the step 2, the structures of the Swin Transformer module are as follows in sequence: the output characteristic of the step 1 is used as an input characteristic → an LN regularization layer → a W-MSA submodule or an SW-MSA submodule → a residual connecting layer → an LN regularization layer → a feedforward network → a residual connecting layer → an output characteristic; the Swin Transformer model is circulated for 6 times totally, odd layers and even layers are sequentially connected at intervals, wherein W-MSA sub-modules are adopted in the three odd layers, and SW-MSA sub-modules are adopted in the three even layers.

4. The Swin transform low-light-level image enhancement method of claim 3, wherein: the LN regularization layer is used for carrying out LN regularization treatment, carrying out normalization treatment on input data and enabling the input data to be between 0 and 1;

the residual connecting layer is used for performing residual connection, so that the problems of gradient disappearance and weight matrix degradation are solved;

the feedforward network is composed of two layers of feedforward neural networks, wherein the first layer of feedforward network firstly inputs a vector from d_modelDimension mapping to 4 x d_modelDimension, the activation function is a ReLU function, and the second layer feedforward network is from 4 x d_modelDimension mapping back to d_modelDimension, without using an activation function, the feedforward network is expressed as (1):

FFN(x)＝max(0,xW₁+b₁)W₂+b₂ (1)

firstly, carrying out window division on input characteristics by the W-MSA submodule, and carrying out multi-head self-attention calculation on each divided small window; the W-MSA submodule maps input features into different subspaces, then point multiplication operation is carried out on all the subspaces to calculate attention vectors, finally the attention vectors calculated by all the subspaces are spliced and mapped into an original input space to obtain a final attention vector as output, and an expression of the W-MSA submodule is as follows (2):

MultiHead(Q,K,V)＝Concat(head₁,...,head_h)W⁰

head_i＝Attention(QW_i ^Q,KW_i ^K,VW_i ^V) (2)

wherein Q, K, V are respectively the input of the W-MSA submodule, i.e. query vector, key vector, value vector, W_i ^QIs notMapping matrix of Q in the same subspace, W_i ^KMapping matrix for K in different subspaces, W_i ^VFor mapping matrices of V in different subspaces, the attention vector on a single subspace is calculated in the following manner: the query vector Q and the key vector K are point-multiplied and then divided by the square root of the dimension of the key vector K

the W-MSA submodule captures the dependency relationship of the features on different subspaces by mapping the input features to different subspaces and then calculating the attention vector, and the finally obtained attention vector can capture the dependency relationship between the features more three-dimensionally and more comprehensively;

the SW-MSA sub-module is used for performing pixel displacement operation on the image characteristic with the size of half a window before the image characteristic is input into the SW-MSA sub-module, and then performing W-MSA sub-module operation, wherein the specific operation flow is as follows: circularly shifting the output characteristics of the step 1 upwards and leftwards by half of the size of the window, and segmenting the window according to the W-MSA sub-module on the basis of shifting to obtain window contents different from W-MSA; and then, operating the W-MSA submodule, and circularly shifting down and circularly shifting right the obtained image characteristics by half window size after the operation is finished, and restoring the original positions.

5. The method of claim 1, wherein: in the step 3, the recovery module is used for recovering the image features extracted by the Swin transform module to the original input low-light-level image size, outputting the enhanced high-quality noise-free color image, and the recovery module sequentially has the following structures: the output characteristic of the step 2 is taken as input → batch Expanding layer → Linear layer → output image;

the Patch expansion layer is used for performing read operation, the resolution of the input features is expanded to 4 times of the input resolution, and the feature dimension is reduced to 1/16 of the input dimension; the Linear layer performs characteristic mapping for Linear operation, the size of a convolution kernel is H x W, the total number of the characteristic mapping is 3,

when training a Swin transform-based low-light-level image enhancement network, L is considered₁The loss function is better in the aspects of contrast of a target contour and smooth effect of a uniform region, meanwhile, the SSIM loss function introduces structural constraint to well restore the structure and local details of an image, the perception loss function can constrain the difference between a real image and a predicted image and keep the fidelity of image perception and details, and in the step, L is added₁The + SSIM loss function + perception loss function are combined to be used as a total loss function of the Swin Transformer-based low-light-level image enhancement network, and are expressed as follows:

L_total＝(1-λ_s-λ_p)L₁+λ_sL_ssim+λ_pL_perc

Wherein L is₁The norm loss formula is

The structural similarity loss formula of SSIM is

μ_x、μ_yAre respectively provided withThe pixel average of the representative image x, y; sigma_xyRepresenting the standard deviation of the product of x and y of the image;

the perceptual loss function is formulated as

I_gtRepresenting a real image, I_hRepresenting a predicted image, C_jRepresents a channel, H_jAnd W_jRespectively representing the height and width of the jth feature map,