CN116579940A

CN116579940A - Real-time low-illumination image enhancement method based on convolutional neural network

Info

Publication number: CN116579940A
Application number: CN202310482075.7A
Authority: CN
Inventors: 刘勇; 路红; 谢长勇; 刘书林; 李科华; 黄俊健; 任豪; 陆嘉文; 袁履凡; 王俐钞; 吕传禄
Original assignee: Fudan University; Chinese Peoples Liberation Army Naval Characteristic Medical Center
Current assignee: Fudan University; Chinese Peoples Liberation Army Naval Characteristic Medical Center
Priority date: 2023-04-29
Filing date: 2023-04-29
Publication date: 2023-08-11

Abstract

The invention belongs to the technical field of image processing, and particularly relates to a convolution neural network-based real-time low-illumination image enhancement method. The method comprises the following steps: (1) Preprocessing the low-illumination RAW image, including rearrangement, normalization and pre-amplification; (2) Constructing a network ADU-Net for low-illumination image enhancement, wherein a feature extraction module based on hole convolution and residual connection and an attention-based adaptive feature fusion module are used; (3) And inputting the low-illumination RAW image into an ADU-Net to obtain an enhanced sRGB image. By means of the high efficiency and the light weight of ADU-Net, the invention can recover the low-illumination image at a speed close to real time, effectively improve the brightness of the low-illumination image, accurately recover the color and detail information of the image and obtain a satisfactory visual effect.

Description

Real-time low-illumination image enhancement method based on convolutional neural network

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a low-illumination image enhancement method.

Background

Photography under low light conditions is a very challenging task. Due to insufficient illumination, the photographed image often has problems such as too low brightness and color distortion, and such an image is generally called a low-illumination image. The low-illumination image not only has lower aesthetic experience, but also can greatly reduce the performance of a series of downstream tasks such as target detection, face recognition and the like, thereby bringing negative influence on related applications such as road monitoring, outdoor security and the like. Therefore, enhancement of low-luminance images is of great practical importance.

Currently, low-luminance image enhancement can be divided into two types, namely a physical method and an algorithm enhancement method. Among them, the core of the physical method is to increase the light entering amount of the camera, but this is often difficult to achieve. For example, using a larger aperture can increase the amount of light entering the camera, but can reduce the depth of field, resulting in blurred imaging, and cannot be implemented on devices with limited size, such as cell phones and cameras; the extension of exposure time is only suitable for shooting static scenes, and motion blur can occur on dynamic scenes; shooting with a flash can result in uneven exposure; raising the camera ISO results in noise amplification. Therefore, academia and industry are more concerned with algorithm-based enhancement methods. Currently, representative low-luminance image enhancement algorithms mainly include histogram equalization-based, retinex theory-based, and deep learning-based methods. The image is globally adjusted based on the histogram equalization method, so that the brightness of the image accords with prior distribution, and the image contrast can be effectively improved, but local information of the image is ignored, so that the result is often distorted. The method based on the Retinex theory can effectively maintain the structural information of the image, but the method involves non-convex optimization problem solving and has large calculation amount. Furthermore, the Retinex theory has an inherent disadvantage in that it ignores noise components in the image, resulting in enhancement results on noisy images that tend to be distorted.

Deep learning-based methods are currently the mainstay of academic and industry. The depth method uses a neural network (such as a convolutional neural network) to directly learn the mapping relation between the low-illumination image and the normal-illumination image, and shows better performance than the previous method. At present, research on enhancement methods based on deep learning at home and abroad has achieved a certain result, however, most of the methods are limited to enhancement by using a large-scale network, and the calculation cost brought by the large-scale network and the deployment requirement on actual equipment are ignored.

Disclosure of Invention

The invention aims to provide a convolution neural network-based real-time low-illumination image enhancement method to solve the problem of image quality degradation caused by insufficient light inlet of a camera and obvious imaging noise in an insufficient illumination environment.

The invention provides a convolution neural network-based real-time low-illumination image enhancement method, which comprises the steps of (first) RAW image preprocessing and (second) enhancement processing of a preprocessed image by adopting an ADU-Net enhancement network;

the RAW image preprocessing comprises three steps of image rearrangement, normalization and pre-amplification; the preprocessing process is used for providing translational invariance of subsequent convolution operation and reducing image resolution to accelerate operation;

and (II) the ADU-Net enhancement network receives the preprocessed RAW image, and performs low-illumination enhancement to obtain an enhanced sRGB image.

In the invention, the RAW image preprocessing method comprises the following specific steps:

(1) Rearranging the RAW images according to a color filter array, and arranging pixels at the same position on the array on the same channel; taking a Bayer RAW image as an example, the image is rearranged according to the color channel sequence of R-G-B-G, so as to obtain a four-channel image with a length and width half that of the original image, and this process can be expressed as:

wherein, I represents an input image, I' represents a rearranged image, the subscript c represents a channel number, and x and y are space coordinates; the re-ordering operation is used to provide translational invariance for subsequent convolution operations while accelerating operations;

(2) Normalizing the rearranged images can be expressed as:

I″ _c ＝(I′ _c -BL _c )/(WL _c -BL _c )，

wherein BL represents the black level, WL represents the saturation level, and the subscript c is the serial number of the corresponding channel. The black level and the saturation level are directly read from meta information of the RAW image;

(3) Pre-signal amplification is carried out on the normalized image:

I _input ＝I″*γ,

where γ is the magnification factor, which can be determined by the user.

In the invention, the ADU-Net enhancement network carries out enhancement processing on the preprocessed image; wherein:

the enhanced network model ADU-Net is of a U-Net structure and comprises an encoder part and a decoder part, and a feature extraction module based on hole convolution and residual connection and an attention-based self-adaptive feature fusion module are added. The network receives the preprocessed low-illumination RAW image as input, firstly extracts multi-scale characteristic representation through an encoder part, and then sends the multi-scale characteristic representation to a decoder part for bottom-up characteristic reconstruction. The decoder section captures semantic information of the image using a feature extraction module based on hole convolution and residual connection and fuses the semantic information with shallow texture information from the encoder using an attention-based adaptive feature fusion module. After the characteristic reconstruction of the decoder part, the enhanced sRGB image is finally obtained.

Further, the ADU-Net enhancement network carries out enhancement processing on the preprocessed image, and the specific steps are as follows:

(a) For input image I _input First a 3 x 3 convolution is used to project into the feature space.

(b) Sending the feature representation obtained in the previous step to an encoder to extract a multi-scale feature representation; each encoder contains convolution and Pixel-undersampling ^[1] Expressed as:

wherein, the liquid crystal display device comprises a liquid crystal display device,representing the output of the ith convolutional layer; f (f) _conv (-) represents a convolution operation, ∈represents a Pixel-un-huffle downsampling operation ^[1] 。

(c) The features obtained in the previous step are sent into a decoder network to reconstruct the features; features from the decoder are reconstructed through a feature extraction module after bilinear interpolation, and meanwhile shallow features from the encoder are transmitted through jump connection and are subjected to a self-adaptive feature fusion process to obtain enhanced feature representation; this process can be expressed as:

wherein, the liquid crystal display device comprises a liquid crystal display device,representing feature extraction module operation,/->The method is characterized in that the method represents the operation of a self-adaptive feature fusion module, the superscript i represents the module serial number, and ∈ represents the bilinear interpolation operation; specifically:

(c-I) a feature extraction module comprising N residual modules and operating with a feature aggregation for efficiently extracting feature representations; the residual error module uses cavity convolution to promote receptive field, and uses 1×1 convolution as a characteristic connecting channel; the feature aggregation operation is realized by 1×1 convolution;

in the feature extraction module, different void ratios are set for residual error modules at different positions, specifically:

i is the serial number of the residual module, i=0, 1, … N-1;

the mechanism of the feature extraction module can be expressed as:

Y＝f _conv ([X ₁ ,…,X _N ])，

wherein X is _i Is an intermediate feature, Y is an output feature, [ … ]]The representation is stitched in the channel dimension and,representation residual error module ^[2] The superscript i indicates the module number;

(c-II) adaptive feature fusion procedure for fusing shallow texture features X from an encoder _t And deep semantic feature X from a decoder _s The method comprises the steps of carrying out a first treatment on the surface of the The process first extracts the complementary feature representation Y of the two features using convolution _comp Then injecting the deep semantic features from the decoder through residual connection; then, the weight of each channel is adjusted through a residual channel attention mechanism; the mechanism of the adaptive feature fusion process can be expressed as:

Y _comp ＝f _conv ([X _t ,X _s ])，

A＝f _conv (AvgPool([X _t ,X _s ])),

wherein A is the learned attention weight, Y _fused For output features, avgPool ()' represents global average pooling, [ … ]]The representations are stitched in the channel dimension.

(d) The characteristics of the decoder output are up-sampled by Pixel-Shuffle ^[1] After that, an enhanced sRGB image is obtained.

The specific operation flow of the invention is as follows:

(1) Preparing a dataset for training an enhanced network;

(2) Preprocessing the low-illumination image in the data set by using the RAW image preprocessing method;

(3) Training an enhanced network model ADU-Net;

(4) And processing the preprocessed low-illumination image by using a trained enhancement network.

The training enhancement network model comprises the steps of setting a training strategy; training is started; and adjusting the training strategy for multiple times to obtain the model with the best performance.

The training strategies comprise an image block size, a batch size, a learning rate adjustment strategy, a gradient descent strategy and the like used for training.

The invention obtains a light and efficient image enhancement network by means of an efficient codec network architecture and efficient feature extraction and self-adaptive feature fusion processes, and solves the problem of overlarge calculation cost of the traditional low-illumination image enhancement method. Experiments on a GTX 1080Ti GPU show that the invention can enhance high-resolution extremely dark images (brightness is less than 5 lux) with the size of 4K at the speed of 25fps without special optimization, and the peak memory occupation is not more than 1G. Therefore, the invention can meet the real-time image processing requirements of edge equipment such as mobile phones, monitoring cameras and the like, and has good application value.

Drawings

Fig. 1 is a diagram of an ADU-Net network architecture for use in the present invention.

Fig. 2 is a diagram showing a structure of a feature extraction module based on hole convolution and residual connection used in the present invention.

Fig. 3 is a diagram of the architecture of an adaptive feature fusion module used in the present invention.

Fig. 4 is a graph comparing experimental effects on a real low-light image.

Detailed Description

The network structure for enhancing the real-time low-illumination image used in the invention is shown in fig. 1, wherein the low-illumination RAW image is sent into a network after being preprocessed, and the enhanced sRGB image is finally obtained after being subjected to a feature extraction module shown in fig. 2 and an adaptive feature fusion module shown in fig. 3.

The embodiment of the invention comprises the following steps:

(1) Preparation of data sets

Selecting SID data sets herein ^[3] Training and testing of the enhanced network is performed. The SID dataset contains low-light-normal-light image pairs taken under extremely dark conditions, with ambient brightness between 0.03lux-5 lux. The exposure time of the low-light image is between 0.033s and 0.1s, and the exposure time of the normal-light image is between 10s and 30 s. The dataset contained 5094 image pairs, taken using two cameras Sony α7siii and Fujifilm X-T2, with image resolutions of 4240X 3842 and 6000X 4000, respectively, stored in RAW format. The SID dataset provides a standard training/validation/test set partitioning, which the present invention maintains. Without loss of generality, the present example selects Bayer RAW images taken by Sony α7siii for training and evaluation.

(2) Data preprocessing

The original low-illumination RAW image needs to be rearranged, normalized and pre-amplified, and the specific steps are as follows:

(2a) Rearranging the Bayer RAW images according to the color channel sequence of R-G-B-G to obtain four channels, wherein the length and width of the images are half of that of original images, and the process can be expressed as:where I denotes the input image, I' denotes the rearranged image, and c is the channel number. The re-ordering operation is used to provide translational invariance for subsequent convolution operations while accelerating operations;

(2b) Normalize the rearranged image, which may be specifically denoted as I " _c ＝(I′ _c -BL _c )/(WL _c -BL _c ) Where BL represents the black bit level and WL represents the saturation level. The black level and the saturation level are directly read from meta information of the RAW image;

(2c) Pre-signal amplification is carried out on the normalized image: i _input =i "×γ, where γ is an enlargement coefficient, set as an exposure time length ratio of long/short exposure images in SID data set.

(3) Model training

The network of the present invention uses a single GTX 1080Ti GPU for training, 512X 512 size image blocks for training, and the trained batch size is set to 1. The network trains 4000 epochs altogether, and the initial learning rate is set to be 1 multiplied by 10 ^-4 After 2000epoch, to 1X 10 ^-5 . The network is trained using Adam optimizers.

(4) Model testing

After training is completed, the network is used to evaluate on the SID test set image. The preprocessing mode of the image is the same as that of the step (2). The first line in fig. 4 shows the experimental effect on the SID test set image, where the original low-intensity image is observed to be very low in brightness, while the image, after magnification, exhibits significant noise and color distortion; with the method proposed by Lamba et al ^[4] Compared with the method, the method has the advantages that the restoration effect of the image is obviously improved, and the image presents more accurate colors and better detail information.

The second row in fig. 4 shows the generalized experimental results of the present invention, where the low-light image was taken by a Canon 6D camera and enhanced using a model trained on SID datasets. Compared to the method proposed by Lamba et al, the present invention achieves a more accurate color recovery.

The performance data of the present invention are given in table 1. The related operation data on the GPU is a test result on a GTX 1080Ti single card, the related operation data on the CPU is a test result on a 4-core Intel i7-6700K, and the algorithm operates in a single-wire mode. It can be observed that the invention has higher running speed and lower computing resource cost, and can meet the requirement of real-time image processing.

TABLE 1

MACs(G)	Quantity of parameters (M)	Inference time(s) on CPU/GPU	GPU peak memory occupation (GB)
				0.18	0.635	1.1791/0.0367	0.79

The foregoing description of the invention has been presented for purposes of illustration and description, and is not intended to be limiting. Any partial modification or replacement within the technical scope of the present disclosure by any person skilled in the art should be included in the scope of the present disclosure.

Reference to the literature

[1]Shi W,Caballero J,Huszár F,et al.Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network[C]//Proceedings of the IEEE conference on computer vision and pattern recognition.2016:1874-1883.

[2]He K,Zhang X,Ren S,et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition.2016:770-778.

[3]Chen C,Chen Q,Xu J,et al.Learning to see in the dark[C]//Proceedings of the IEEE conference on computer vision and pattern recognition.2018:3291-3300.

[4]Lamba M,Mitra K.Restoring extremely dark images in real time[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:3487-3497。

Claims

1. The real-time low-illumination image enhancement method based on the convolutional neural network is characterized by comprising the following specific steps of:

RAW image preprocessing, including image rearrangement, normalization and pre-amplification; the preprocessing process is used for providing translational invariance of subsequent convolution operation and reducing image resolution to accelerate operation;

and secondly, adopting an ADU-Net enhancement network to perform low-illumination enhancement on the preprocessed RAW image to obtain an enhanced sRGB image.

2. The method for enhancing the real-time low-illumination image according to claim 1, wherein the RAW image preprocessing comprises the following specific steps:

(1) Rearranging the RAW images according to a color filter array, and arranging pixels at the same position on the array on the same channel; for Bayer RAW images, the images with four channels and half length and width of original image are obtained by rearranging according to the color channel sequence of R-G-B-G, and the process is expressed as:

wherein I represents an input image, I' represents a rearranged image, the subscript c represents a channel number, and x and y are space coordinates;

(2) Normalizing the rearranged images, specifically expressed as:

I″ _c ＝(I′ _c -BL _c )/(WL _c -BL _c )，

BL represents a black level, WL represents a saturation level, and subscript c is the serial number of the corresponding channel; the black level and the saturation level are directly read from meta information of the RAW image;

(3) Pre-signal amplification is carried out on the normalized image:

I _input ＝I″*γ,

where γ is the magnification factor, which can be determined by the user.

3. The method for enhancing a real-time low-illuminance image according to claim 2 wherein the ADU-Net enhancement network performs enhancement processing on the preprocessed image; wherein:

the ADU-Net of the enhanced network model is of a U-Net structure and comprises an encoder and a decoder, and a feature extraction module based on cavity convolution and residual connection and an attention-based self-adaptive feature fusion module are added; the enhancement network model takes the preprocessed low-illumination RAW image as input, firstly extracts multi-scale characteristic representation through an encoder, and then sends the multi-scale characteristic representation to a decoder part for bottom-up characteristic reconstruction; the decoder captures semantic information of the image by using a feature extraction module based on hole convolution and residual connection, fuses the semantic information with shallow texture information from the encoder by using an attention-based adaptive feature fusion module, and finally obtains an enhanced sRGB image after feature reconstruction of the decoder.

4. A real-time low-luminance image enhancement method according to claim 3, wherein said ADU-Net enhancement network performs enhancement processing on the preprocessed image, and specifically comprises the steps of:

(a) For input image I _input First, a 3 x 3 convolution is used to project into the feature space;

(b) Sending the feature representation obtained in the previous step to an encoder to extract a multi-scale feature representation; each encoder contains convolution and Pixel-undersampling;

(c) The features obtained in the previous step are sent into a decoder network to reconstruct the features; features from the decoder are reconstructed through a feature extraction module after bilinear interpolation, and shallow features from the encoder are transmitted through jump connection and are subjected to self-adaptive feature fusion module to obtain enhanced feature representation;

(d) The features output by the decoder are subjected to Pixel-Shuffle upsampling to obtain an enhanced sRGB image.

5. The method of claim 4, wherein in step (c), the feature extraction module comprises N residual modules and one feature aggregation module for efficiently extracting feature representations; the residual error module uses cavity convolution to promote receptive field, and uses 1×1 convolution as a characteristic connecting channel; the feature aggregation module is realized by 1X 1 convolution;

i is the serial number of the residual module, i=0, 1, … N-1;

the mechanism of the feature extraction module is expressed as:

Y＝f _conv ([X ₁ ,…,X _N ])，

wherein X is _i As an intermediate feature of the device,indicating residual module operation, and superscript i indicates module serial number; y is an output feature [ … ]]The representations are stitched in the channel dimension.

6. The method of claim 5, wherein in step (c), the adaptive feature fusion module is configured to fuse shallow texture features X from the encoder _t And deep semantic feature X from a decoder _s The method comprises the steps of carrying out a first treatment on the surface of the The process first extracts the complementary feature representation Y of the two features using convolution _comp Then injecting the deep semantic features from the decoder through residual connection; then, the weight of each channel is adjusted through a residual channel attention mechanism; the mechanism of the adaptive feature fusion process is expressed as:

Y _comp ＝f _conv ([X _t ,X _s ])，

A＝f _conv (AvgPool([X _t ,X _s ])),

wherein A is the learned attention weight, Y _fused For output features, avgPool ()' represents a global average pooling operation, [ … ]]The representations are stitched in the channel dimension.

7. The method of real-time low-illuminance image enhancement according to any one of claims 1 to 6 wherein the specific operation flow is as follows:

(1) Preparing a dataset for training an enhanced network;

(3) Training an enhanced network model ADU-Net;

(4) Processing the preprocessed low-illumination image by using a trained enhancement network;

the training enhancement network model comprises the steps of setting a training strategy; training is started; the training strategy is adjusted for multiple times, and a model with the best performance is obtained;

the training strategies comprise an image block size, a batch size, a learning rate adjustment strategy and a gradient descent strategy which are used for training.