CN117115616A

CN117115616A - Real-time low-illumination image target detection method based on convolutional neural network

Info

Publication number: CN117115616A
Application number: CN202310940678.7A
Authority: CN
Inventors: 袁宥
Original assignee: Jiangsu University of Science and Technology
Current assignee: Jiangsu University of Science and Technology
Priority date: 2023-07-28
Filing date: 2023-07-28
Publication date: 2023-11-24

Abstract

The invention discloses a real-time low-illumination image target detection method based on a convolutional neural network, which is applied to the technical field of computer vision and comprises the following steps: constructing a real low-illumination image dataset; based on the image enhancement process, downsampling the high-resolution image, so as to reduce the calculation cost; the contrast of the low-illumination image is restored by utilizing the depth curve estimation, so that the image quality is improved; based on the target detection process, a lightweight network is used to meet the overall real-time requirement of the detection model; the efficient coordinate attention mechanism is used for focusing on the channel and space position information, so that the characteristic learning capability of the network is enhanced; taking image enhancement as a preprocessing part of target detection to form an enhancement+detection model; adding a weight for each characteristic channel, learning the importance of each channel in the characteristic diagram, establishing a double-channel fusion original image and an enhanced image, and inhibiting the influence of noise amplification caused by image enhancement. And establishing a complementary relation between the original image and the enhanced image.

Description

Real-time low-illumination image target detection method based on convolutional neural network

Technical Field

The invention belongs to the field of deep learning and computer vision, and particularly relates to a real-time low-illumination image target detection method based on a convolutional neural network.

Background

Along with the development of deep learning in the field of computer vision, a target detection algorithm is gradually evolved into Two branches, namely One-Stage and Two-Stage. The target detection task is regarded as a regression problem of positioning and classification by a one-stage algorithm, and the candidate region is selected and classified by a two-stage algorithm. Compared with a one-stage algorithm, the two-stage algorithm is often required to be deployed on a platform with larger calculation power and takes longer detection time, and the real-time requirement of a target detection task is not met. The existing target detection algorithm such as R-CNN, SSD, YOLO can obtain a good detection effect in the universal data set ImageNet, COCO, VOC, and is widely applied to the fields of intelligent traffic, face recognition, pathological analysis, industrial detection and the like. However, imaging in the real world is affected by illumination and equipment, and the captured image has problems of insufficient contrast and low signal-to-noise ratio. The low-quality image not only affects the visual effect, but also makes the downstream visual task more difficult, and seriously affects the detection accuracy of the algorithm.

Researchers generally use two types of methods to deal with the detection problem of low-illumination images, one type of method uses devices such as thermal imaging or infrared sensors to acquire images, but the method has high requirements on physical devices and high cost; another category restores image quality by image enhancement techniques, but traditional histogram equalization or Retinex-based methods focus on restoring the contrast of the image, failing to restore the true color of the image. With the application of deep learning in the field of image processing, the convolutional neural network can be used for extracting the high-level semantics of an algorithm model, learning the characteristics of image contrast, illumination color and the like, and generating a more expressive effect. The method does not judge the image quality by visual sense, combines with a downstream visual task, takes image enhancement as a preprocessing operation of target detection, and cascades an enhancer and a detector to form an enhancement and detection strategy, thereby reducing the influence of a low-illumination image on a target detection algorithm.

Disclosure of Invention

The invention provides a convolution neural network-based real-time low-illumination image target detection method for solving the problem of low target detection precision in a low-illumination environment. Aiming at the characteristics of insufficient contrast and low signal-to-noise ratio of the low-illumination image, the method utilizes a depth network to enhance and recover the image on the premise of not utilizing physical equipment such as illumination, infrared and the like, and optimizes a data set for training a neural network in a data enhancement mode. The invention mainly solves two technical problems, namely, the image is enhanced to restore the image and the noise is amplified, so that the detection model has lower recognition capability on fuzzy objects; and secondly, the parameter quantity of the cascade connection of the enhancer and the detector in actual application is too large, the calculated quantity is too high, and the requirement of realizing real-time detection on an embedded platform with smaller calculation force cannot be met.

Aiming at the first problem, the invention designs a feature fusion module based on a channel attention mechanism to form a feature extraction network of attention pixel level information, and the fusion module is utilized to fuse low-level features of an enhanced image and an original image so as to strengthen the recognition capability of a fuzzy object; aiming at the second problem, the enhancer is subjected to light weight treatment, and the common convolution in the depth network is replaced by the depth separable convolution, so that the network parameter number of the enhancer is greatly reduced, and the processing speed and reasoning capacity of the enhancer are improved. The technical scheme adopted by the invention is as follows:

a real-time low-illumination image target detection method based on a convolutional neural network comprises the following steps:

step one, configuring a deep learning software environment, and configuring an image enhancement algorithm and a target detection algorithm environment based on a convolutional neural network;

step two, constructing a low-illumination image data set, acquiring a real low-illumination image, marking the image, and summarizing the image into a tag data set;

step three, an enhancement recovery module is established, and a weight mechanism is used for inhibiting noise amplification caused by image enhancement;

and fourthly, constructing a deep neural network, and establishing a detection mode of cascade connection of the enhancer and the detector. The method comprises the steps of performing image enhancement on a data set to serve as a preprocessing part of a detection network, and then performing feature extraction and detection activities;

step five, a lightweight network firstly performs downsampling processing on an image in a preprocessing process, and secondly replaces standard convolution with depth separable convolution to ensure network real-time requirements;

step six, optimizing the network, adding an attention mechanism into the feature extraction network to make up for the problem of precision reduction caused by network weight reduction, so that the overall model is balanced in precision and speed;

and step seven, training a neural network model, and verifying the detection effect of the low-light environment.

Specific:

the target detection algorithm is a two-step algorithm based on candidate areas or a single-step algorithm based on regression, and the image enhancement algorithm is a histogram image enhancement algorithm, a tone mapping image enhancement algorithm or a Retinex image enhancement algorithm;

and step two, the real low-illumination image data set is a synthesized data set, and firstly, the real low-illumination image data set is obtained by utilizing a network, and the low-illumination images in the public data set are screened for expansion. And marking the images and summarizing the images into a label data set. Finally, randomly dividing the label data set into a training set, a verification set and a test set;

and thirdly, the enhancement recovery module takes the enhanced image and the original image as input to be fused in a double-channel mode, and adds a weight for each characteristic channel by utilizing a channel attention mechanism. And secondly, learning the importance of each channel in the feature map through a neural network. Finally, according to the weight aggregation dual-channel input characteristic channel, the attention of the module to the target characteristic information channel is improved, and the influence of noise amplification during image enhancement is restrained;

the cascade network of the enhancer and the detector consists of a low-illumination image enhancement algorithm, a cascade module and a target detection algorithm, and comprises an image preprocessing layer, a feature fusion layer, a feature extraction layer and a prediction layer. The low-illumination image enhancement algorithm is used as a preprocessing layer of a network to enhance image quality, the low-illumination image enhancement algorithm comprises a histogram equalization method and a method based on Retinex theory or curve mapping, the cascade module is an enhancement recovery module in the third step, the target detection algorithm comprises a two-step algorithm based on candidate areas or a single-step algorithm based on classification regression, feature fusion is carried out through a network model such as CSPDarknet, VGG or Mobilene, feature extraction is carried out through a network structure such as feature pyramid FPN, PANet or BiFPN, classification regression is carried out through a convolution module of 3 multiplied by 1, and the probability that a target appears in a priori frame is calculated and compared;

and fifthly, the model is light, and the preprocessing part takes the downsampled small-scale image as the input of the preprocessing layer, so that the calculation cost of convolutional layer learning is reduced. And secondly, restoring the enhanced image to the original resolution through upsampling, and substituting the enhanced image into a subsequent activity. Finally, the common convolution is replaced by the depth separable convolution, and the parameter quantity can be reduced to one tenth of the original one;

and step six, adding an attention mechanism into the feature extraction network to compensate for the problem of precision reduction caused by network weight reduction, wherein the attention mechanism can allocate computing resources to more important tasks under the condition of limited computing capacity, so that the neural network has the feature extraction capacity of concentrating on space information and channel information, and the overall model achieves balance in precision and speed.

And step seven, the model training part randomly divides the low-illumination image data set into a training set, a verification set and a test set according to the ratio of 8:1:1 to generate a low-illumination image target detection model. Secondly, verifying the detection effect of the model, shooting a real image under the low-illumination condition, respectively sending the image into a traditional target detection model and the low-illumination target detection model based on the invention, and verifying the detection effect of the model;

the invention has the beneficial effects that:

firstly, the invention uses the image enhancement algorithm as a preprocessing step of target detection, is more suitable for extracting the characteristics of the low-illumination image, and can improve the accuracy of the neural network on the low-illumination image identification. And secondly, the invention designs an enhancement recovery module based on a channel attention mechanism, and weight regulation is carried out on the image noise amplification problem caused by image enhancement, so that an image with higher quality is obtained. And thirdly, the invention pre-processes the downsampled image, reduces the requirement of the model on the calculation force, and can apply the model to a platform with lower calculation force such as a mobile terminal or embedded equipment. Then, the method replaces standard convolution with depth separable convolution in the feature extraction stage, so that the overall model is improved in detection speed, and the requirement of real-time detection is met. Finally, the invention does not need to use hardware equipment such as infrared imaging and the like to process the image, and has lower cost.

Drawings

FIG. 1 is a flow chart of real-time low-light level target detection based on convolutional neural networks;

FIG. 2 is a block diagram of an enhanced recovery module;

FIG. 3 is a block diagram of a depth separable convolution;

fig. 4 is a block diagram of a coordinated attention mechanism.

Detailed description of the preferred embodiments

In order to make the technical solution and features of the present invention more clearly revealed, the present invention is explained below with reference to the accompanying drawings, but the present invention is not limited by examples.

Example 1:

a method for detecting a real-time low-illuminance image target based on a convolutional neural network, the method comprising:

step one, configuring an environment: and configuring an image enhancement algorithm and a target detection algorithm environment based on deep learning. The required development environment is configured under the window system, wherein the computer graphics card used is RTX3060, and each application environment is python 3.9.7,anconda 4.11.0,cuda11.0. The present example obtains the open source procedure of the object detection algorithm YOLOX and the image enhancement algorithm ZeroDCE on the gitsub.

Step two, collecting data: and constructing a low-illumination image data set, acquiring a real low-illumination image, marking the image, and summarizing the image into a label data set. The example uses an open source real low-light data set Expark to cover 10 low-light conditions with different degrees, including 12 categories of people, bicycles, boats, chairs and the like, and 7363 low-light images. Since both the PSCAL VOC and the actual dim light detection dataset ExDark contain 10 classes of objects, 2760 low-intensity images were screened from the VOC2007 dataset for expansion, forming a new dataset a. To facilitate YOLOX training, the labels in dataset a are converted to VOC2007 format and the image resolution is adjusted to accommodate the network input.

And thirdly, establishing an enhancement recovery module, and using a weight mechanism to inhibit noise amplification caused by image enhancement. The embodiment provides a new cascade module by referring to the network structure of SKNet. The enhancement recovery module is shown in fig. 2, and is composed of an input layer, a feature fusion layer and a feature aggregation layer, wherein the input layer takes enhanced image features and original image features as the input of a model, the fusion features are influenced by the amplified noise of the enhanced image by taking the pixel-by-pixel Addition (Point-Wise Addition) method into consideration, a vector splicing (connectate) method is selected to fuse images input by two channels, the feature size of the fused image is 2C, H, W, and the fused features U E R are obtained ^2C*H*W . The calculation formula can be expressed as: u=u ₁ +U ₂ Wherein U represents the characteristics of the fused image, U ₁ Representing enhanced image features, U ₂ Representing original image features; secondly, in order to characterize the importance of the information of each channel, the feature fusion layer adopts a global average pooling method to encode the feature channels of H and W dimensions, and reduces the dimension of each layer of U into a number M. The calculation formula can be expressed as:where W and H represent the width and height of the feature and (i, j) represent the spatial location of the feature. In order to learn the correlation between the feature channels, the module reduces the dimension of the M input first and then increases the dimension to obtain a weight vector Z by dividing the M input into two layers of full-connection layer branches fc, and a calculation formula can be expressed as follows: z is Z _a ＝F _fc (M，W)＝σ(W _a δ(WM))，Z _b ＝F _fc (M，W)＝σ(W _b Delta (WM)), where Z _a And Z _b Representing two output weight vectors, wherein W is a parameter of a first full connection layer, the dimension is c/gamma c, gamma is a scaling factor, and the scaling factor is used for reducing vector dimension, reducing calculated amount and W _a And W is _b Parameters of the second full-connection layer in the two full-connection branches are respectively that the dimension isC/γ for generating a weight vector corresponding to an input feature, δ being a ReLu activation function, β being a Sigmoid layer; finally, obtaining the channel weight Z of the original image features and the enhanced image features by using a softmax function in the feature aggregation layer _a ，Z _b U is set up ₁ ，U ₂ Extracting feature weighted addition to obtain feature map U ⁺ ，U ⁺ Can be expressed as: u (U) ⁺ ＝Z _a *U ₁ +Z _b *U ₂ . The contrast enhancer is directly connected with the detector, and the enhancement recovery module provided by the invention preferentially aggregates the enhanced image with the original image, so that the image quality is improved, and the influence of noise amplification after the enhanced image is reduced.

And fourthly, the model is light, and the model is ensured to meet the requirement of real-time detection. The ZeroDCE and YOLOX are selected as image enhancement and object detection models, respectively. Firstly, taking a downsampled small-scale image as an input of a depth network DCENT for an image enhancement part, mapping and upsampling a curve parameter of depth estimation, recovering to an original resolution, and then carrying out subsequent iterative enhancement. This downsampling operation takes as input a low resolution image, which can significantly reduce the computational cost. And secondly, for the target detection part, the common convolution used by the characteristic extraction network can be replaced by a more efficient depth separable network. As shown in the depth separable convolution of fig. 3, it is shown how the standard convolution (a) is decomposed into a depth-wise convolution (b) and a point-wise convolution (c). Standard convolution layer input image size is MxD _F ×D _F With N sizes M x D _K ×D _K Is convolved with a final output size of NxD _G ×D _G Is a feature map of (1). Wherein D is _F Representing the width and height of the input feature map, D _G Representing the width and height of the output feature map, D _K Is the spatial dimension of the convolution kernel, M is the number of input channels and N is the number of output channels. The parameters and calculated amounts used to finally obtain the standard convolution and the depth separable convolution are as follows: the reference number of the standard convolution layer is D _K ×D _K X M x N; the standard convolution calculated amount is D _K ×D _K ×M×N×D _F ×D _F The method comprises the steps of carrying out a first treatment on the surface of the Depth separable convolution parameterNumber D _K ×D _K ×M×N+D _K ×D _K X N; depth separable calculated amount D _K ×D _K ×M×D _F ×D _F +M×N×D _F ×D _F . The ratio of the parameter quantity to the calculated quantity can find that the light characteristic extraction network parameter quantity and the calculated quantity are about one ninth of the original parameter quantity and the calculated quantity, so that the height and the width of the model are greatly reduced, and the model reasoning speed is improved.

And fifthly, optimizing the model, and adding an attention mechanism into the feature extraction network to make up for the problem of precision reduction caused by network weight reduction. The embodiment adds a high-efficiency coordinate attention mechanism CA for the mobile terminal in the feature extraction layer, and can encode the transverse and longitudinal position information into the feature channel, so that the mobile network can pay attention to the position information in a large range, better locate and identify the target, and can not bring excessive calculation amount. The CA attention mechanism module aims at enhancing the expression capability of the mobile network learning characteristics, the implementation process of the CA attention mechanism module is as shown in an attention structure diagram of the CA in fig. 4, and in order to acquire the attention on the width and the height of an image and encode accurate position information, the CA firstly divides the input characteristic diagram into two directions of the width and the height to respectively carry out global average pooling, and respectively obtains the characteristic diagram in the two directions of the width and the height. Then the feature graphs of the width and the height directions of the obtained global receptive field are spliced together, then the feature graphs are sent to a convolution module with a shared convolution kernel of 1 multiplied by 1, the dimension of the feature graphs is reduced to be the original C/r, and then the feature graph F1 subjected to batch normalization processing is sent to a Sigmoid activation function to obtain the feature graph with the shape of 1 multiplied by (W+H) multiplied by C/r. And then, carrying out convolution with the convolution kernel of 1 multiplied by 1 on the feature map according to the original height and width to respectively obtain feature maps Fh and Fw with the same channel number as the original feature maps, and respectively obtaining the attention weights of the feature map on the height and width after a Sigmoid activation function. Finally, the characteristic diagram with the attention weight in the width and height directions is finally obtained through multiplication weighted calculation on the original characteristic diagram.

And step six, training a model, and testing the detection effect of the low-illumination environment. Dividing the data set A in the second step into training sets according to the ratio of 8:1:1, introducing a verification set and a test set into the embodiment to serve as the training set, the verification set and the test set, and finally generating a low-illumination target detection model.

Experimental results:

and (3) simultaneously sending the low-illumination image into a common target detection model and a model of the embodiment to verify the weak light detection effect. The experimental results are shown in table 1,

table 1 comparison of results

The model of the invention exceeds YOLOX in terms of both detection speed and accuracy, and the model size is smaller. In addition, the invention reduces the resolution of the original image by downsampling and then carries out upsampling to restore the enhanced image, thereby reducing the demand of the model on calculation force on the premise of not influencing the image enhancement effect, leading the model processing speed to be faster and meeting the demand of real-time detection.

While the invention has been described in detail in connection with specific embodiments thereof, it will be understood that the individual details are not limited to the particular embodiments described above, and that various changes and modifications may be made by one skilled in the art without departing from the spirit of the invention, within the scope of the appended claims.

Claims

1. A real-time low-illumination image target detection method based on a convolutional neural network comprises the following steps:

constructing a deep neural network, and establishing a detection mode of 'enhancement+detection', wherein the data set is subjected to image enhancement firstly to serve as a preprocessing part of the detection network, and then feature extraction and detection activities are carried out;

step five, a lightweight network is adopted, the image is subjected to downsampling treatment in the preprocessing process, and standard convolution is replaced by depth separable convolution, so that the real-time requirement of the network is ensured;

step six, optimizing the network, focusing on the channel and space position information by using a high-efficiency coordinate attention mechanism, and enhancing the characteristic learning capability of the network;

2. The method according to claim 1, wherein the target detection algorithm in the first step is a two-step algorithm based on candidate regions or a single-step algorithm based on regression; the image enhancement algorithm is a histogram image enhancement algorithm, a tone mapped image enhancement algorithm, or a Retinex image enhancement algorithm.

3. The method for detecting real-time low-illuminance images according to claim 1, wherein step two, the real low-illuminance image dataset is a composite dataset, and the real low-illuminance image dataset is obtained by using a network and the low-illuminance images in the public dataset are screened for expansion; secondly, labeling the images, and summarizing the images into a label data set; and finally, dividing the label data set into a training set, a verification set and a test set.

4. The method according to claim 1, wherein the enhancement recovery module first fuses the enhanced image with the original image as input in a dual-channel manner, and adds a weight to each feature channel by using a channel attention mechanism; secondly, learning the importance of each channel in the feature map through a neural network; and finally, aggregating the input characteristic channels of the two channels according to the weights, improving the attention of the module to the target characteristic information channel, and inhibiting the influence of noise amplification during image enhancement.

5. The method according to claim 1, wherein the "enhancement+detection" network in step four is composed of a low-illumination image enhancement algorithm, a cascade module and a target detection algorithm, and comprises an image preprocessing layer, a feature fusion layer, a feature extraction layer and a prediction layer. The low-illumination image enhancement algorithm is used as a preprocessing layer of a network to enhance the image quality, and comprises a histogram equalization method and a method based on Retinex theory or depth curve estimation; the cascade module is the enhancement recovery module in claim 4; the target detection algorithm comprises a two-step algorithm based on candidate areas or a single-step algorithm based on classification regression, feature fusion is carried out through network models such as CSPDarknet, VGG or Mobilene, feature extraction is carried out through network structures such as feature pyramids FPN, PANet or BiFPN, classification regression is carried out through convolution modules of 3 multiplied by 3 and 1 multiplied by 1, the cross ratio is calculated, and the probability that a target appears in a priori frame is predicted.

6. The method for detecting a real-time low-illuminance image according to claim 1, wherein the model weight reduction step five uses a small-scale image after downsampling as an input of a preprocessing layer in the preprocessing part, thereby reducing the calculation cost of convolutional layer learning; secondly, restoring the enhanced image to the original resolution through upsampling to replace the subsequent activity; and finally, replacing the common convolution with the depth separable convolution, and reducing the parameter quantity to one tenth of the original one.

7. The method for detecting a real-time low-illuminance image according to claim 1, wherein in step six, an attention mechanism is added to the feature extraction network by the optimization model to compensate for the problem of reduced accuracy caused by light weight of the network, so that the overall model is balanced in accuracy and speed.

8. The method according to claim 1, wherein the model training section firstly randomly divides the low-illuminance image dataset into a training set, a verification set and a test set according to a ratio of 8:1:1 to generate a low-illuminance image target detection model; secondly, verifying the model detection effect, shooting a real image under the low-illumination condition, and respectively sending the image into the target detection model without image enhancement and the low-illumination target detection model in claim 1, and verifying the model detection effect.