CN112966747A

CN112966747A - Improved vehicle detection method based on anchor-frame-free detection network

Info

Publication number: CN112966747A
Application number: CN202110254258.4A
Authority: CN
Inventors: 刘宏哲; 刘腾; 徐成; 徐冰心; 潘卫国; 代松银
Original assignee: Beijing Union University
Current assignee: Beijing Union University
Priority date: 2021-03-04
Filing date: 2021-03-04
Publication date: 2021-06-15

Abstract

The invention discloses an improved vehicle detection method based on an anchor-frame-free detection network, which comprises the following steps: firstly, performing feature fusion on low-dimensional features and high-dimensional features through a feature fusion module; then, the features after the feature fusion processing are sent into a detection network framework without an anchor frame; a CBAM attention mechanism module is added in the network, so that the detection effect is improved; and then outputting the identification result. The method adds a characteristic fusion module and an attention mechanism module, adopts a CenterNet network for the anchor frame-free detection network, adopts a jumper connection mode of a Resnet network, can quickly detect the vehicle, and ensures better detection precision.

Description

Improved vehicle detection method based on anchor-frame-free detection network

Technical Field

The invention relates to the field of deep learning target detection, in particular to an improved vehicle detection method based on an anchor-frame-free detection network.

Background

With the construction and development of smart cities, intelligent traffic systems and unmanned vehicles, the vehicle detection technology becomes a key technology. The method is widely applied to the aspects of traffic management, congested road section detection and the like, and has important significance in reducing and even avoiding traffic accidents.

In the traditional method, firstly, an image is preprocessed, a filter traverses the image to obtain a preliminary position of a target such as a vehicle or a pedestrian, and then characteristics of the vehicle target are designed manually for recognition. The method is mainly characterized by comprising a Gradient Histogram (HOG), scale-invariant feature transform (SIFT), Haar-link feature (HAar-link feature) and the like, and finally, feature classification is carried out through classifiers such as a positive and negative sample training Support Vector Machine (SVM) and the like, so that final detection is completed. The traditional method is limited by the efficiency of target position estimation, so that the robustness is not strong, and the traditional method has obvious defects particularly in real-time detection and shielded target detection.

In recent years, deep learning technology is continuously developed and makes a great breakthrough, and target features are automatically extracted through a convolutional neural network. The strong feature extraction capability of the convolutional neural network is benefited, the detection accuracy of the target detection algorithm is greatly improved, the robustness is stronger, and the method can adapt to more complex recognition scenes.

The AlexNet proposal in 2012 pulled the development of deep learning, and the VGGNet proposal in 2014 made the implementation of deep neural network possible, but the gradient vanishing problem occurred while the network deepened. ResNet in 2015 solves the problems by a residual connection method, reduces model convergence time, and makes the network deeper and difficult to have gradient disappearance.

Today, target detection algorithms are mainly classified into two categories: onestage method and two-stage method. the two-stage method generates a series of candidate frames through an algorithm, and then regresses and classifies the candidate frames, which is characterized by high accuracy but relatively low recognition speed. In 2014, Girshick et al propose an R-CNN (RegionCNN) algorithm, Fast-RCNN and Fast-RCNN use a selective search method to generate candidate frames, then characteristics of the candidate frames are extracted through a convolutional neural network, and finally the characteristics are input into an SVM classifier to perform classification regression operation to obtain a recognition result. In order to overcome the problem of slow recognition speed, a one-stage method is provided, which mainly comprises the steps of canceling the generation of a candidate frame, directly using a convolutional neural network to carry out convolution operation on image data, and directly carrying out detection and classification through extracted features. The method is fast, but the identification accuracy is generally lower than that of the two-stage method. In 2016, the YOLO series algorithm is provided, so that the problem of algorithm instantaneity is solved while the identification accuracy is ensured. The algorithm integrates the detection and classification processes into one process, predicts the position and the category of a detection frame on each feature unit, and finally predicts the whole image feature by combining the background information in the image data to obtain the final recognition result. Similarly, on the one-stage algorithm, the SSD detection algorithm combines the advantages of the R-CNN algorithm and the YOLO algorithm, and the detection of various size targets is realized by generating candidate frames with different sizes on the multi-scale feature detection graph. The two algorithms belong to Anchor-Base algorithms, and the Anchor frame of a possible target needs to be found and then the type of the Anchor frame is predicted. In recent years, an Anchor-Free method is provided, an Anchor frame is not needed, and a target is directly detected and positioned through a key point, so that the detection speed is improved, and the method can better adapt to targets with different sizes.

Based on the problems, the invention adds the characteristic fusion and attention mechanism module on the basis of the anchor frame-free network CenterNet, improves the detection precision while ensuring the vehicle detection speed, and improves the superposition and small target and the like of the vehicle.

Disclosure of Invention

In order to solve the above problems, the present invention provides an improved vehicle detection method based on an anchor-frame-free detection network, aiming to improve the accuracy and speed of vehicle detection, and the method comprises the following steps:

inputting an image into a network, and performing feature extraction on the input image through a feature extraction module to obtain a feature map;

inputting the feature map to a feature fusion module, and performing feature fusion on the low-dimensional features and the high-dimensional features to obtain a feature map after feature fusion;

inputting the feature map after feature fusion into an anchor-frame-free detection network, detecting and identifying the vehicle target, obtaining and outputting an identified result, wherein the anchor-frame-free detection network uses a CenterNet detection network;

step four, the backbone network of the CenterNet network uses the structure of ResNet101, uses a jumper connection mode to connect each convolution layer, and uses a convolution attention module (CBAM) in the jumper connection;

preferably, in the first step, the feature extraction module replaces a standard convolution operation with a depth separable convolution, and first performs a channel-by-channel convolution on an input image by using a convolution with a convolution kernel of 3 × 3 to obtain an input image feature, and then performs a point-by-point convolution on the input image feature by using a convolution with a convolution kernel of 1 × 1 to obtain a final image feature map, so as to achieve the purposes of reducing the calculation amount and improving the feature extraction capability, the depth separable convolution is followed by a batch normalization layer to increase the generalization capability of the model, and the batch normalization layer is followed by a ReLU activation function, which form the feature extraction module;

preferably, a feature fusion module is used in the step two, the accuracy of detecting small targets and overlapped targets is increased by fusing shallow features and deep features, the feature map of the Conv3-3 layer is down-sampled to the size of 38 × 38, and the number of feature map channels is unchanged; reducing the dimension of the feature map of the Conv4-3 layer, wherein the number of channels is reduced from 512 to 256, and the size of the feature map is unchanged; upsampling the feature map of the Conv7 layer to the size of 38 × 38, and reducing the number of feature channels from 1024 to 256; conv3-3 layers, Conv4-3 layers and Conv7 layers. The feature maps are spliced, the number of channels of the spliced feature maps is 768, the number of default frames corresponding to each position of the feature maps is changed from 4 to 6, downsampling is carried out on the feature maps before fusion is carried out, the size of the feature maps is reduced, the receptive field is expanded, the feature loss is reduced, downsampling is carried out by using hole convolution, and the step length is set to be 2. The input and output characteristic graph of the cavity convolution is in the following relation:

where p is the size of the filled pixel, d is the expansion ratio, s is the step size, k is the size of the convolution kernel, W₁

For inputting the size of the feature map, W₂Is the size of the output signature.

Preferably, in the third step, an anchorless frame network centret is adopted to detect the vehicle target, the centret adopts an anchorless frame design, the accuracy is better than that of the existing but phase method, the speed is faster than that of the single phase method, and the anchorless frame detection network centret uses a ResNet structure as a backbone network and uses the jump structure.

Preferably, the CBAM attention module is added to the jumper structure of the ResNet network, and the CBAM (conditional block attention module) is a lightweight and universal attention model, and performs a feature attention mechanism on space and channels at the same time. The model added with the CBAM module in the embodiment of the invention has better performance and better interpretability than a reference model, and focuses more on the target object, and the calculation can be represented by the following formula:

where the input feature map is denoted by F,

and

respectively represent feature, W after calculation by global average potential and global max potential₀And W₁Representing two-layer parameters in a multi-layer perceptron model.

Compared with the prior art, the method and the device are improved on the basis of an anchor frame-free target detection network, the feature fusion module is added, the attention mechanism module is used in the network structure, the problems that the target is small and the target is not accurately detected in superposition in the vehicle detection problem are well solved, and the real-time detection effect is guaranteed while the detection accuracy is guaranteed.

Drawings

FIG. 1 is an overall flow chart of the steps in forming an improved vehicle detection method based on an anchorless detection network according to the present invention;

FIG. 2 is a block diagram of a feature extraction module in the step of forming an improved vehicle detection method based on an anchor-free detection network according to the present invention;

FIG. 3 is a feature fusion module in the step of forming an improved vehicle detection method based on an anchor-frame-free detection network according to the present invention;

FIG. 4 is a schematic illustration of the channel attention in the step of forming an improved vehicle detection method based on an anchor-free frame detection network according to the present invention;

FIG. 5 is a schematic spatial attention diagram of the steps in forming an improved vehicle detection method based on an anchorless detection network according to the present invention;

FIG. 6 is a schematic diagram of a CBMA module for an improved vehicle detection method based on an anchor-frame-free detection network according to the present invention;

fig. 7 is a result display of an improved vehicle detection method based on an anchor-frame-free detection network according to the present invention.

Detailed Description

The model scheme in the embodiment of the present invention will be fully described in the following with reference to the accompanying drawings in the embodiment of the present invention, and it is obvious that the described embodiment is only a part of the embodiment of the present invention, and not a whole embodiment. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, the present invention provides an improved vehicle detection method based on an anchor-frame-free detection network, and the steps of the example formation of the present invention are as follows:

step one, inputting an image into a network, and performing feature extraction on the input image through a feature extraction module to obtain a feature map. The feature extraction module is designed as shown in fig. 2, and adopts a structure of depth separable convolution design and residual error linkage, firstly uses a convolution layer with a convolution kernel of 3 × 3 to perform channel-by-channel convolution on an input image to obtain input image features, and then uses convolution with a convolution kernel of 1 × 1 to perform point-by-point convolution on the feature map to obtain a final image feature map, so that the purposes of reducing the calculated amount and improving the feature extraction capability are achieved.

The feature extraction module comprises two branches, one branch using 1 × 1 convolution, the convolution sum number is 2n, and the step size is set to s-2. The other branch uses separable convolution operation, firstly uses 1 × 1 convolution operation, convolution kernel is skillfully n/2, compensation is set to be s ═ 2, then 3 × 3 convolution feature extraction is used, the number of convolution kernels is n/2, step length is set to be s ═ 1, then 1 × 1 convolution is connected for ascending dimension of feature channel, the number of convolution kernels is 2n, step length is set to be s ═ 1, the number of feature map channels is 2n, finally output corresponding positions of two branches are superposed to obtain the size of an output feature map (h/2, w/2, 2 n). And finally, after output, a batch normalization layer is used for increasing the generalization capability of the model, and a ReLU activation function is used for forming the whole feature extraction module.

And step two, as shown in fig. 2, inputting the feature map to a feature fusion module, and performing feature fusion on the low-dimensional features and the high-dimensional features to obtain a feature map after feature fusion. The method and the device have the advantages that the accuracy of small target detection is improved by fusing the shallow features and the deep features, but the shallow feature graph is large in size, and the real-time performance of the algorithm is greatly influenced by excessive introduction of the shallow features; reducing the dimension of the feature map of the Conv4-3 layer, wherein the number of channels is reduced from 512 to 256, and the size of the feature map is unchanged; upsampling the feature map of the Conv7 layer to the size of 38 × 38, and reducing the number of feature channels from 1024 to 256; conv3-3 layers, Conv4-3 layers and Conv7 layers. Splicing the characteristic diagrams, wherein the number of channels of the spliced characteristic diagrams is 768; and changing the default frame number corresponding to each position of the feature map from 4 to 6.

Wherein the conv3-3 layer feature map has smaller size and more feature information, and is downsampled before fusion to reduce the feature map size and enlarge the receptive field to reduce the feature loss, and is used

The hole convolution downsamples it with a step size set to 2. The input and output characteristic graph of the cavity convolution is in the following relation:

where p is the size of the filled pixel, d is the expansion ratio, s is the step size, k is the size of the convolution kernel, W₁For inputting the size of the feature map, W₂Is the size of the output signature.

The number of characteristic channels of the Conv4-3 layer and the Conv7 layer is large, so that the calculated amount is increased, in order to reduce the training parameter number of the network and increase the real-time performance of the algorithm, firstly, dimension reduction processing needs to be performed on the characteristic channels, after dimension reduction processing is performed on the Conv7 layer, up-sampling operation is performed on the feature channels after dimension reduction, two methods are commonly used for up-sampling operation, namely, a deconvolution operation and a bilinear interpolation method are used respectively, wherein new training parameters are introduced into the network by deconvolution, so that the real-time performance of the algorithm is affected, and the bilinear interpolation method has the advantages of high running speed and simplicity in operation while the up-sampling effect is guaranteed, so that the bilinear interpolation method is used for up-sampling.

And step three, the non-anchor frame detection network CenterNet network uses a ResNet structure as a backbone network and uses the jump connection structure.

Step four, the backbone network of the CenterNet network uses the structure of ResNet101 and uses jump

Connecting each convolution layer in a connection mode, using a convolution attention module (CBAM) in a jumper connection, wherein the structure of a channel attention module is shown in fig. 3, firstly, a feature map is compressed on a spatial dimension to obtain a one-dimensional vector, then, the operation is carried out, when the spatial dimension compression is carried out on the input feature map, an author does not only consider the average firing, and additionally introduces max firing as supplement, and two one-dimensional vectors can be obtained in total after two firing functions are carried out. The global average potential has feedback to each pixel point on the feature map, and the global max potential has feedback of gradient only at the place with maximum response in the feature map when gradient back propagation calculation is carried out.

Where the input feature map is denoted by F,

and

The spatial attention structure is shown in fig. 4, and a feature map output by the Channel attention module is taken as an input feature map of the module. First, a channel-based global max and global average potential is performed, and then the 2 results are subjected to a concat operation based on the channel. Then, after a convolution operation, the dimensionality reduction is 1 channel. And generating a spatial attribute feature by the sigmoid. And finally, multiplying the feature by the input feature of the module to obtain the finally generated feature. The calculation formula is as follows:

where σ denotes the sigmoid activation function, this section shows convolution layers using a 7 × 7 convolution kernel.

According to the embodiment of the invention, a CBMA attention mechanism module is used in a jumper structure of ResNet, so that the detection recognition rate of the vehicle is increased, the module has fewer use parameters, and the integral real-time performance is not influenced. The overall structure is shown in fig. 5.

Through testing, compared with a traditional target detection algorithm, the embodiment of the invention has a good detection effect on the BITCHEHICLE _ Dataset data set, the average precision reaches 87.9%, the network is improved by 2.1% compared with the network before improvement, the average frame rate reaches 43 frames/s, and the requirement of real-time detection is met. The results are shown in FIG. 7.

To sum up, in the embodiment of the present invention, an improved vehicle detection method based on an anchor-frame-free detection network includes firstly, performing feature fusion on a low-dimensional feature and a high-dimensional feature through a feature fusion module; then, the features after the feature fusion processing are sent into a detection network framework without an anchor frame; a CBAM attention mechanism module is added in the network, so that the detection effect is improved; and then outputting the identification result. The method adds a characteristic fusion module and an attention mechanism module, adopts a CenterNet network for the anchor frame-free detection network, adopts a jumper connection mode of a Resnet network, can quickly detect the vehicle, and ensures better detection precision.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. An improved vehicle detection method based on an anchor-frame-free detection network is characterized by comprising the following forming steps:

and step four, the backbone network of the CenterNet network uses the structure of ResNet101, connects each convolution layer by using a jumper connection mode, and uses a convolution attention module CBAM in the jumper connection.

2. The method as claimed in claim 1, wherein in step one, the feature extraction module replaces standard convolution operation with depth separable convolution, and first performs channel-by-channel convolution on the input image by using convolution with convolution kernel of 3 × 3 to obtain a feature map, and then performs point-by-point convolution on the feature map by using convolution with convolution kernel of 1 × 1 to obtain final image features, so as to reduce the amount of computation and improve feature extraction.

3. The improved vehicle detection method based on the anchorless frame detection network is characterized in that a feature fusion module is used in the second step, the accuracy of detection of small targets and overlapped targets is increased by fusing shallow features and deep features, a feature map of Conv3-3 layers is downsampled to 38 x 38, and the number of feature map channels is unchanged; reducing the dimension of the feature map of the Conv4-3 layer, wherein the number of channels is reduced from 512 to 256, and the size of the feature map is unchanged; upsampling the feature map of the Conv7 layer to the size of 38 × 38, and reducing the number of feature channels from 1024 to 256; mixing Conv3-3 layer, Conv4-3 layer and Conv7 layer; the feature maps are spliced, the number of channels of the spliced feature maps is 768, and the number of default frames corresponding to each position of the feature maps is changed from 4 to 6.

4. The improved vehicle detection method based on the anchorless detection network as claimed in claim 2, wherein the anchorless network centret is adopted in the third step for detecting the vehicle target, and the centret is designed by using anchorless frame.

5. The improved vehicle detection method based on anchorless frame detection network of claim 2, wherein the anchorless frame detection network centret network uses ResNet structure as backbone network and uses the jump structure in step three.

6. The improved vehicle detection method based on the anchorless frame detection network as claimed in claim 3, wherein the CBAM attention module is added to the jumper structure of the ResNet network, and the CBAM module is a lightweight and universal attention model, and performs the attention mechanism of features on space and channels at the same time.

7. The improved vehicle detection method based on the anchorless detection network of claim 1, wherein the loss function of the centret network is as follows;

L_det＝L_k+λ_sizeL_size+λ_offL_off

wherein λ is_size＝0.1，λ_off1, center point loss L_kLarge and small loss L_sizeAnd bias loss L_off。

8. The improved vehicle detection method based on the anchorless frame detection network as claimed in claim 4, wherein a data enhancement method is used during training to increase the expansibility of data; the data enhancement method comprises the steps of carrying out random angle rotation, brightness change, noise interference and moderate transformation on a training picture.

9. The improved vehicle detection method based on the anchorless frame detection network as claimed in claim 4, wherein the vehicle position, the center point and the class probability are trained and predicted, and the center point position is shifted, so as to complete the detection.