CN116524432A

CN116524432A - Application of small target detection algorithm in traffic monitoring

Info

Publication number: CN116524432A
Application number: CN202310412581.9A
Authority: CN
Inventors: 吴璨; 范海连; 张凯
Original assignee: China Communications Huakong Tianjin Construction Group Co ltd
Current assignee: China Communications Huakong Tianjin Construction Group Co ltd
Priority date: 2023-04-18
Filing date: 2023-04-18
Publication date: 2023-08-01

Abstract

The invention relates to an application of a small target detection algorithm in traffic monitoring, which comprises the following steps: performing preliminary treatment on the picture to be detected; constructing a lightweight image super-resolution reconstruction network and completing network training; performing edge sharpening on the image subjected to super-resolution processing; inputting the image with the sharpened edge into a detection module for detection and obtaining a small traffic target detection result; the light-weight image super-resolution reconstruction network structure comprises a shallow layer feature extraction module, a deep layer feature extraction module, an attention module and a reconstruction module. The invention realizes small target detection based on the super-resolution network, can effectively enlarge the resolution of the small target, increases the characteristic information quantity, and has higher and more effective lifting precision compared with the traditional target detection algorithm-based optimization method.

Description

Application of small target detection algorithm in traffic monitoring

Technical Field

The invention relates to the technical field of traffic monitoring, in particular to application of a small target detection algorithm in traffic monitoring.

Background

With the change of the economic growth of China from high speed to high quality development, the transportation industry reaches the stage of basically adapting to the economic and social development demands, the national road traffic mileage and road traffic scale are rapidly increased, and the national motor vehicles keep 3.48 hundred million vehicles by the end of 2022, and the motor vehicle drivers reach 4.35 hundred million people. However, due to complex traffic conditions in China and weak security and guarantee capability of road infrastructure, the current traffic management and control capability cannot meet the rapidly-enlarged traffic demand, traffic contradictions among pedestrians, vehicles and roads are increasingly prominent, so that traffic accidents are frequent, and the total proportion of national security production accidents is obviously higher than that of other industries.

The intelligent monitoring system is used for detecting and predicting traffic accidents, so that the damage caused by the traffic accidents can be greatly reduced, the core of the intelligent monitoring system is a target detection algorithm, wherein the detection of small targets is a difficult point, the targets such as pedestrians and small target vehicles are highly concentrated below 50 pixels in an image, the appearance of colors, edges and the like is fuzzy, and the targets are difficult to distinguish in a complex traffic environment containing a large number of negative samples such as electric vehicles and the like. Therefore, the detection accuracy rate for small-scale pedestrians and vehicles is low, and the omission rate is high. Therefore, improving the detection accuracy of such small targets is of vital importance for traffic safety.

The existing small target detection is generally optimized based on a mainstream target detection algorithm, such as methods of enhancing a small target sample, optimizing a training method, an anchor-free frame mechanism, feature fusion and the like. Among them, in 2018, bai et al proposed an end-to-end multitasking generation countermeasure network (Multi-Task Generative Adversarial Network, MTGAN) to solve the problem of small target detection accuracy. The method comprises the following steps:

1) Cutting the input image according to the requirement;

2) Inputting the target object into a baseline target detector which takes a Faster RCNN or a Mask RCNN as a backbone network, and primarily identifying an object and a background;

3) The picture input generator after preliminary recognition is a super-resolution network and is used for upsampling the small blurred image to a fine image and recovering the detail information of the fine image so as to perform more accurate detection;

4) Inputting the super-resolution restored image into a discriminator, wherein the discriminator is a multitasking network, and the network describes each super-resolution processed image block by using a real or false score, an object category score and a frame regression; in order to enable the generator to recover more small target details for detection, the discriminators can back propagate the loss of classification and regression to the generator in the training process, and therefore the generation effect of the generator is improved.

Disadvantages of the prior art:

1) The method for detecting the small target based on optimization of the mainstream target detection algorithm, such as a method for enhancing a small target sample, an optimization training method, an anchor-free frame mechanism, feature fusion and the like, fundamentally does not solve the problem of detail missing of the small target object, and although the detection accuracy of the small target is improved in a total view, the amplitude of the improvement is limited, and a plurality of algorithms may generate artifacts in practical application.

2) The super-resolution algorithm based on the GAN (antagonistic generation network) is introduced into target detection, so that the detection precision of a small target object is effectively improved, the resolution of the small target object can be effectively enlarged, and the characteristic information quantity is increased, but because the super-resolution algorithm greatly increases the network layer number of the algorithm, the training of the GAN network is relatively difficult, and the real-time performance is difficult to realize and the super-resolution algorithm is applied to a specific scene.

Disclosure of Invention

The present invention addresses the above-mentioned shortcomings, and provides for the use of a small target detection algorithm in traffic monitoring,

the invention adopts the following technical scheme to realize the aim:

a light-weight image super-resolution reconstruction network structure consists of a shallow feature extraction module, a deep feature extraction module, an attention module and a reconstruction module;

the shallow feature extraction module maps the input image to a high-dimensional feature space, comprising a convolution layer of convolution kernel size 3*3, expressed as x ₀ ＝f _ext (I _LR )；

Deep feature extraction module composed of multiple large receptive field information distillation blocks (Vast-redundant-field Information distillation Block, VIDB), for x ₀ Deep feature extraction is performed by progressively refining the extracted features by the VIDBs stack, a process expressed as

An attention module, which consists of an ESA module (Efficient ChannelAttention, high-efficiency channel attention) and a CCA module (Coordinated attention);

the reconstruction module is used for completing reconstruction by adopting a Pixelshuffle algorithm and takes the shape as (, c r) ² H, W) are reconstructed into tensors of the shape (, C, H, W).

Specifically, the VIDB block performs a convolution operation with a convolution kernel of 1*1 on the input image, and then divides the input image into two branches, namely a first branch and a second branch, and the processing results of the first branch and the second branch are output after being added, and perform a pixel normalization operation.

Specifically, the first branch is a direct communication path, the second branch is activated by an activation function based on a gate function, then characteristic weight distribution is carried out by a channel attention module based on information distillation and large convolution kernel depth separation convolution operation, then characteristic fusion between characteristic graphs is carried out by a convolution layer with the size of 1*1, and the characteristic fusion is added with the direct communication path.

Specifically, an input feature map with the size of C, H and W is divided into two feature maps with the size of C/2, W and H according to the channel number by an activation function based on a gate function, and then the feature maps are output after product processing; the channel attention module divides the activated characteristic diagram into two branches for processing, namely a third branch and a fourth branch, and the processing results of the third branch and the fourth branch are output after being added, and a convolution operation with the convolution kernel size of 1*1 is carried out.

Specifically, the third branch firstly carries out a convolution operation with a convolution kernel size of 1*1, and then carries out a deep convolution operation with a convolution kernel size of 9*9, a cloth length of 1 and a filling of 4; the fourth branch first performs a convolution operation with a convolution kernel size of 1*1 and then activates by the gel activation function.

Specifically, the CCA module performs contrast loss calculation and adaptive global pooling on the input picture, performs addition processing on the input picture, outputs the added input picture, sequentially passes through convolution operation with a convolution kernel size of 1*1, activation of a Relu activation function and convolution operation with a convolution kernel size of 1*1, and finally performs multiplication processing on the input picture to obtain an output picture, thereby completing feature learning based on position information.

Specifically, the ESA module performs a convolution operation with a convolution kernel size of 1*1 on the input picture, then divides the input picture into two branches for processing, namely a fourth branch and a fifth branch, performs addition processing on processing results of the fourth branch and the fifth branch, performs convolution operation with the convolution kernel size of 1*1 on the added feature picture, fuses features and recovers the number of channels, activates the feature picture through a sigmoid linear activation function, performs multiplication processing on the feature picture and obtains an output picture, and learns cross-channel interaction relation under the condition of not reducing dimension.

In particular, the fourth branch performs a convolution operation with a convolution kernel size of 1*1; the fifth branch sequentially carries out convolution operation with the convolution kernel size of 3*3, the step length of 2 and the filling of 1; a max pooling layer of core size 7*7 with step size 7; the original image size is scientifically recovered by the deep convolution operation with the convolution kernel size of 3*3, the activation of a GELU activation function and bilinear interpolation processing.

An application of a small target detection algorithm in traffic monitoring, which reconstructs a network structure based on light-weight image super-resolution, comprises the following steps:

s1, performing preliminary processing on a picture to be detected, wherein the specific steps are as follows:

s11, performing format conversion on a low-resolution image to be processed to obtain a low-resolution YCbCr image;

s12, equally dividing the low-resolution YCbCr image into a plurality of sub-images according to rows and columns, wherein the size of the sub-images after dividing is 480 pixels;

s13, randomly rotating the sub-images by 90 degrees or 180 degrees to enhance data so as to provide more data samples and reduce the storage space required by the feature map in network propagation;

s2, constructing a lightweight image super-resolution reconstruction network, and completing network training, wherein the training loss function adopts L2 loss;

s3, carrying out edge sharpening on the image subjected to super-resolution processing;

s4, inputting the image with the sharpened edges into a detection module for detection and obtaining a small traffic target detection result.

In particular, the detection module employs the YOLOv3 algorithm to divide the image into a plurality of regions and predicts the probability of the bounding box and each region.

The beneficial effects of the invention are as follows:

1. the invention reconstructs a network structure by setting the super-resolution of the lightweight image, and the network structure comprises a shallow layer feature extraction module, a deep layer feature extraction module (VIDB block), an attention module (ESA module, CCA module) and a reconstruction module, wherein the super-resolution algorithm effectively enlarges the resolution of a small target, increases the characteristic information quantity, and has higher and more effective lifting precision compared with the traditional optimization method based on the target detection algorithm.

2. According to the invention, the GAN-based super-resolution network is replaced by the lightweight image super-resolution reconstruction network, so that the detection precision of a small target object is improved, the parameter quantity of a network model is greatly reduced, and the training and the deployment are easier.

3. The invention uses the YOLOv3 as a detection module of the system, and ensures a good compromise in detection precision and detection speed.

Drawings

FIG. 1 is a block diagram of an application system of the present invention in traffic monitoring;

FIG. 2 is a schematic diagram of a VIDB block structure according to the present invention;

FIG. 3 is a schematic diagram of a lightweight image super-resolution reconstruction network structure according to the present invention;

fig. 4 is a schematic structural view of a CCA module of the present invention;

FIG. 5 is a schematic diagram of an ESA module structure according to the present invention;

FIG. 6 is a schematic diagram of the YOLOv3 algorithm of the present invention;

the embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

Detailed Description

The invention is further illustrated by the following examples:

as shown in fig. 3, a super-resolution reconstruction network structure of a lightweight image is composed of a shallow layer feature extraction module, a deep layer feature extraction module, an attention module and a reconstruction module;

Specifically, as shown in fig. 2, the VIDB block performs a convolution operation with a convolution kernel size of 1*1 on the input image, and then divides the input image into two branches, namely a first branch and a second branch, and the processing results of the first branch and the second branch are output after being added, and perform a pixel normalization operation.

The first branch is a direct communication path, the second branch is activated by an activation function based on a gate function, characteristic weight distribution is carried out by a channel attention module based on information distillation and large convolution kernel depth separation convolution operation, characteristic fusion between characteristic graphs is carried out by a convolution layer with the size of 1*1, and the characteristic fusion is added with the direct communication path.

The key innovation point of the invention is to provide a channel attention module based on an activation function of a gate function and based on information distillation and large convolution kernel depth separation convolution operation, namely:

the activation function based on the gate function divides an input feature map with the size of C.H.W (wherein C refers to channel number is 64, H.W is the size of a picture after clipping and 480.480 pixels) into two feature maps with the size of C/2.W.H according to channel number, and then outputs the feature maps after multiplication processing, so that the effect similar to that of the traditional activation function is achieved but the parameter quantity is greatly reduced.

The channel attention module divides the activated characteristic diagram into two branches for processing, namely a third branch and a fourth branch, and the processing results of the third branch and the fourth branch are output after being added, and a convolution operation with the convolution kernel size of 1*1 is carried out.

The third branch firstly carries out a convolution operation with a convolution kernel size of 1*1, and then carries out a depth convolution operation with a convolution kernel size of 9*9, a cloth length of 1 and a filling of 4; the third branch carries out the depth separation convolution of the large convolution kernel, the large convolution kernel of 9*9 is beneficial to the extraction of the picture characteristic information, the depth separation operation divides one convolution layer into point convolution (Pointwise) and depth convolution (stride) of 1, and the filling (padding) is the depth convolution (depthwise) of 4, so that the parameter number is greatly reduced; the fourth branch firstly carries out convolution operation with the convolution kernel size of 1*1 and then is activated by a GELU activation function; the parameter is reduced, so that training and deployment are easier, the real-time performance is good, and the method is easy to apply to specific scenes.

specifically, the attention module is added behind the deep feature extraction module to further extract the performance of the neural network, and the ESA module is a lightweight channel attention module which can learn the cross-channel interaction relationship through a one-dimensional convolution layer under the condition of not reducing the dimension; the CCA module is capable of embedding location information into the channel attention, which may generate an attention map with spatial selectivity.

As shown in fig. 4, the CCA module performs contrast loss calculation and adaptive global pooling on the input picture, performs addition processing on the input picture, outputs the result after the addition processing, sequentially passes through convolution operation with a convolution kernel size of 1*1, activation of a Relu activation function, and convolution operation with a convolution kernel size of 1*1, and finally performs multiplication processing on the result with the input picture to obtain an output picture, thereby completing feature learning based on position information.

As shown in fig. 5, the ESA module performs a convolution operation with a convolution kernel size of 1*1 on an input picture, then divides the input picture into two branches, namely a fourth branch and a fifth branch, performs addition processing on the processing results of the fourth branch and the fifth branch, performs a convolution operation with a convolution kernel size of 1*1 on the added feature map, fuses the features and recovers the number of channels, activates the feature map through a sigmoid linear activation function, performs multiplication processing on the feature map and the input picture, obtains an output picture, and learns cross-channel interaction relations under the condition of not reducing dimensions.

The fourth branch performs a convolution operation with a convolution kernel size of 1*1; the fifth branch sequentially carries out convolution operation with the convolution kernel size of 3*3, the step length of 2 and the filling of 1; a max pooling layer of core size 7*7 with step size 7; the original image size is scientifically recovered by the deep convolution operation with the convolution kernel size of 3*3, the activation of a GELU activation function and bilinear interpolation processing.

The reconstruction module is an up-sampling module, and the reconstruction is completed by adopting a Pixelfuffle algorithm, which can realize efficient sub-pixel convolution, has a step length of 1/r and takes a shape of (, c r) ² H, W) are reconstructed into tensors of the shape (, C, H, W).

As shown in fig. 1, an application of a small target detection algorithm in traffic monitoring, which reconstructs a network structure based on light-weight image super-resolution, comprises the following steps:

s11, performing format conversion on a low-resolution image to be processed to obtain a low-resolution YCbCr (Y represents a brightness component, cb represents a blue chrominance component, and Cr represents a red chrominance component) image; compared with RGB image, YCbCr image only occupies little bandwidth in transmission process, so the invention carries out format conversion;

s13, randomly rotating the sub-images by 90 degrees or 180 degrees to enhance data so as to provide more data samples, and greatly reducing the storage space required by the feature map in network propagation;

s4, inputting the image with the sharpened edges into a detection module for detection and obtaining a small traffic target detection result; the detection module of the invention adopts the YOLOv3 algorithm to divide the image into a plurality of areas and predicts the probability of the boundary box and each area.

Specifically, as shown in fig. 6, the present invention uses Yolov3 (You Only Look Once) as a detection module, which, although not the most accurate algorithm, selects a compromise between accuracy and speed, which is suitable for deployment in practical applications. The YOLOv3 algorithm uses a single neural network to act on the image, divides the image into a plurality of areas, predicts the probability of the boundary box and each area, uses the FPN technology and the multi-level detection method, and has good small target detection capability.

YOLOv3 uses only convolutional layers, using dark-53 as the backbone network, it contains 53 convolutional layers, each followed by a batch normalization (batch normalization) layer and a leak ReLU (linear activation layer), the entire framework can be divided into 3 parts: the method is characterized in that the method comprises the steps of inputting an image x into a Darkenet-53 network structure, carrying out a series of convolution and a staggered network to obtain feature maps (namely feature maps 1, 2 and 3 in the maps) of original images 1/8, 1/16 and 1/32 respectively, wherein the process is a feature extraction process, feature fusion of feature maps of different sizes is carried out in the feature extraction process to obtain stronger feature expressive force, and because of different sizes, up-sampling operation is needed in the middle to change the feature maps into the same size, and then stacking, fusion and corresponding convolution operation are carried out. Finally, a 255-dimensional feature map is obtained, and then a convolution operation with a convolution kernel 3*3 and a convolution operation with a convolution kernel 1*1 are needed to obtain a 75-dimensional feature vector, wherein the feature vector contains 3 x (4+1+20) information which is expressed as 3 prediction frames (boundingbox) corresponding to the target category and the position information in the original map, and each prediction frame is composed of 25 parameters including 4 position coordinate information, 1 category confidence coefficient and 20 category predicted values.

According to the invention, by setting the lightweight image super-resolution reconstruction network structure, the resolution of a small target is effectively enlarged, the characteristic information quantity is increased, and compared with the traditional target detection algorithm-based optimization method, the method has the advantages that the lifting precision is higher and more effective.

According to the invention, the GAN-based super-resolution network is replaced by the lightweight image super-resolution reconstruction network, so that the detection precision of a small target object is improved, the parameter quantity of a network model is greatly reduced, and the training and the deployment are easier.

The invention uses the YOLOv3 as a detection module of the system, and ensures a good compromise in detection precision and detection speed.

In the present invention, the terms "first", "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.

While the invention has been described above by way of example, it will be apparent that the invention is not limited to the above embodiments, but is intended to be within the scope of the invention, as long as various modifications of the method concepts and technical solutions of the invention are adopted, or as long as modifications are directly applicable to other applications without modification.

Claims

1. The light-weight image super-resolution reconstruction network structure is characterized by comprising a shallow layer feature extraction module, a deep layer feature extraction module, an attention module and a reconstruction module;

An attention module consisting of two parts, an ESA module (Efficient Channel Attention ) and a CCA module (Coordinate Attention, coordinated attention);

2. The super-resolution reconstruction network structure according to claim 1, wherein the VIDB block performs a convolution operation with a convolution kernel size of 1*1 on the input image, and then divides the input image into two branches, namely a first branch and a second branch, and the processing results of the first branch and the second branch are output after being added, and perform a pixel normalization operation.

3. The super-resolution reconstruction network structure of a lightweight image according to claim 2, wherein the first branch is a direct communication path, the second branch is activated by an activation function based on a gate function, then characteristic weight distribution is performed by a channel attention module based on information distillation and large convolution kernel depth separation convolution operation, then characteristic fusion between characteristic graphs is performed by a convolution layer with a convolution kernel size of 1*1, and the fused characteristic fusion is added with the direct communication path.

4. A lightweight image super-resolution reconstruction network structure according to claim 3, wherein the input feature map with the size of c×h×w is divided into two feature maps with the size of C/2×w×h by the number of channels based on the activation function of the gate function, and then the two feature maps are output after being multiplied; the channel attention module divides the activated characteristic diagram into two branches for processing, namely a third branch and a fourth branch, and the processing results of the third branch and the fourth branch are output after being added, and a convolution operation with the convolution kernel size of 1*1 is carried out.

5. The super-resolution reconstruction network structure of claim 4, wherein said third branch first performs a convolution operation with a convolution kernel size of 1*1, and then performs a deep convolution operation with a convolution kernel size of 9*9, a length of 1, and a fill of 4; the fourth branch first performs a convolution operation with a convolution kernel size of 1*1 and then activates by the gel activation function.

6. The super-resolution reconstruction network structure according to claim 5, wherein the CCA module performs contrast loss calculation and adaptive global pooling on the input picture, performs addition processing on the input picture, outputs the result after the addition processing, sequentially performs convolution operation with a convolution kernel size of 1*1, activation of a Relu activation function, convolution operation with a convolution kernel size of 1*1, and finally performs multiplication processing on the input picture to obtain an output picture, thereby completing feature learning based on position information.

7. The super-resolution reconstruction network structure of claim 6, wherein the ESA module performs a convolution operation with a convolution kernel size of 1*1 on the input picture, then divides the input picture into two branches, namely a fourth branch and a fifth branch, performs addition processing on the processing results of the fourth branch and the fifth branch, performs a convolution operation with a convolution kernel size of 1*1 on the added feature map, merges features and recovers the number of channels, activates the feature map through a sigmoid linear activation function, performs multiplication processing on the input picture, and obtains an output picture, and learns the cross-channel interaction relationship without reducing the dimension.

8. The light-weight image super-resolution reconstruction network structure according to claim 7, wherein the fourth branch performs a convolution operation with a convolution kernel size of 1*1; the fifth branch sequentially carries out convolution operation with the convolution kernel size of 3*3, the step length of 2 and the filling of 1; a max pooling layer of core size 7*7 with step size 7; the original image size is scientifically recovered by the deep convolution operation with the convolution kernel size of 3*3, the activation of a GELU activation function and bilinear interpolation processing.

9. Use of a small object detection algorithm in traffic monitoring based on the lightweight image super-resolution reconstruction network structure of claim 8, characterized by the following steps:

10. The use of a small object detection algorithm in traffic monitoring according to claim 9, wherein the detection module employs YOLOv3 algorithm to divide the image into a plurality of regions and predict the probability of bounding boxes and each region.