CN114419589A

CN114419589A - Road target detection method based on attention feature enhancement module

Info

Publication number: CN114419589A
Application number: CN202210049982.8A
Authority: CN
Inventors: 潘树国; 孙迎春; 高旺; 彭雅慧
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2022-01-17
Filing date: 2022-01-17
Publication date: 2022-04-29

Abstract

The invention discloses a road target detection method based on an attention feature enhancement module, and belongs to the technical field of target detection. Firstly, constructing a convolution nerve module to extract the characteristics of a road target to be detected in an original image to obtain input characteristic graphs with different sizes; then constructing an attention feature enhancement module comprising a CBAM attention mechanism and a semantic enhancement branch, and performing feature enhancement on the obtained feature map; and finally, performing classification regression by using a decoupling head based on the enhanced feature map containing deep semantic information and shallow texture information to complete target detection. The BDD100K data set detection result shows that the average precision rate of the method disclosed by the invention is improved by 1.8%; the detection result of the PASCAL VOC 2007 data set shows that the average precision rate of the method disclosed by the invention is improved by 0.6%.

Description

Road target detection method based on attention feature enhancement module

Technical Field

The invention belongs to the technical field of Object detection, and particularly relates to a road Object detection method based on an attention feature enhancement module.

Background

With the increasing of automobile holding capacity, the travel safety problem is increasingly prominent, and the automatic driving technology based on computer vision provides a new solution for the traffic problem, and is paid more and more attention and researched by more and more countries. When the traditional target detection algorithm is used for detecting the road target, the problems of poor screening of distinguishing features, high missing rate, low recall rate and the like exist, so that the improvement of the precision of the road target detection algorithm in a complex traffic scene has important significance.

The deep convolutional neural network has stronger robustness because the deep convolutional neural network can independently complete the learning of target characteristics and extract key information. In recent years, a target detection model based on a convolutional neural network mainly has two ideas, namely a target candidate box idea and a regression idea, and correspondingly generated algorithms are called a two-stage algorithm and a single-stage algorithm. A two-stage detection algorithm represented by R-CNN, Fast R-CNN, Faster R-CNN, R-FCN and the like firstly extracts a target candidate frame, and then completes model training based on the extracted candidate frame by using a detection network. The single-stage detection algorithm represented by the algorithms such as SSD, YOLO, YOLOv3 and the like has higher detection speed by directly detecting the type and the position information of the network regression target. However, because the contribution degrees of different feature maps and even different regions in the same feature map to the target are different, the features obtained by the current detection algorithm always have universality and redundancy, and the task requirements cannot be met accurately.

Disclosure of Invention

In order to solve the problems, the invention discloses a road target detection method based on an attention feature enhancement module, which increases an attention mechanism to distinguishably extract target features, increases the expression of the features of a task interesting area, and simultaneously adds a semantic enhancement branch to spread the features with strong semantics, thereby effectively improving the detection precision of road targets compared with other advanced target detection algorithms.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a road target detection method based on an attention feature enhancement module comprises the following steps:

step 1, acquiring image information of a road target to be detected;

step 2, constructing a convolutional neural network, extracting the characteristics representing the road target from the image information, and obtaining input characteristic graphs of different sizes;

step 3, constructing an attention feature enhancement module, wherein the attention feature enhancement module comprises a CBAM (CBAM) attention mechanism and a semantic enhancement branch, and performing feature enhancement on the input feature map obtained in the step 2 through the attention feature enhancement module, so as to improve the feature expression of the task region of interest and obtain a feature map containing deep semantic information and shallow texture information;

and 4, classifying and regressing by adopting a decoupling output head based on the enhanced characteristic diagram obtained in the step 3, and outputting a detection result.

The method for detecting the road target based on the attention feature enhancing module further comprises the following steps: step 1.3 the concrete steps of constructing the attention feature enhancing module comprise:

step 1.3.1, input feature map of size original image 1/32

Inputting the feature map into a CBAM module

Will be provided with

Performing convolution operation, batch normalization and activation function processing, and then performing upsampling to obtain a semantic enhanced feature map with the size of the original image 1/16

Step 1.3.2 input feature map of size original image 1/16

And semantically enhanced feature maps

Carrying out addition operation to obtain a feature map

Will be provided with

Inputting the feature map into a CBAM module

Will feature map

And characteristic diagram

Performing a stitching operation on the channel dimension to obtain a semantic enhanced feature map with the size of the original image 1/16

Step 1.3.3, the feature map is to be enhanced

Inputting the feature map into a CSPLAyer unit, performing convolution operation, batch normalization and activation function processing on the obtained feature map, and then performing up-sampling on the feature map to obtain a semantic enhanced feature map with the size of an original image 1/8

Step 1.3.4, input feature map of size original image 1/8

And semantically enhanced feature maps

Carrying out addition operation to obtain a feature map

Will be provided with

Inputting the feature map into a CBAM module

Will feature map

And characteristic diagram

Performing a stitching operation on the channel dimension to obtain a semantic enhanced feature map with the size of the original image 1/8

Step 1.3.5, the feature map is enhanced

Inputting the characteristic diagram into a CSPLAyer unit, and using the obtained characteristic diagram for detecting the target by the decoupling head.

The method for detecting the road target based on the attention feature enhancing module further comprises the following steps: the CBAM module described in step 1.3.1, step 1.3.2 and step 1.3.4 includes two processing steps of generating a channel attention feature map and a spatial attention feature map:

performing global mean pooling and global maximum pooling on the input feature map F to obtain spatial information of the input features, and generating

And

two spatial context descriptors.

And

a channel attention map is generated by a multi-layer perceptron including a hidden layer. When the size of the input feature map is

Channel attention feature map M_c(F) The detailed calculation formula is as follows:

wherein the content of the first and second substances,

r represents a reduction value of the bottleneck structure of the multilayer perceptron, and the value is 16; σ (-) represents a Sigmoid activation function; r (-) represents a ReLU linear rectification function; g (-) is a global mean pooling function; δ (-) is a global maximum pooling function.

The Sigmoid activation function σ () is calculated as:

the calculation formula of the ReLU linear rectification function R (-) is as follows:

the global mean pooling function g (-) is calculated as:

the global maximum pooling function δ (-) is calculated as:

in generating the spatial attention feature map, the input feature map F' is first subjected to pooling operations including mean pooling and maximum pooling along the channel axis, and generated

And

two-dimensional feature maps. Splicing the generated two-dimensional characteristic graphs in the channel dimension, and performing convolution through a standard convolution layer to generate a space attention characteristic graph M_s(F'). Spatial attention feature map M_s(F') the detailed calculation formula is:

wherein the content of the first and second substances,

f_7×7which represents a convolution operation with a convolution kernel of 7 x 7.

The final output attention feature mapping calculation formula of the CBAM module is as follows:

the method for detecting the road target based on the attention feature enhancing module further comprises the following steps: the specific steps of the CSPLAyer unit in step 1.3.3 and step 1.3.5 include:

firstly, the methodInputting a feature map F₁Carrying out convolution operation, batch normalization and activation function processing to obtain a characteristic diagram F₁₁；

Then inputting a feature map F₁Inputting another branch, performing convolution operation, batch normalization and activation function processing to obtain a feature map F₂₁Will F₂₁Successively carrying out three operations in successive residual bottleneck blocks to obtain a characteristic diagram F₂₂；

Finally, the feature map F₁₁And characteristic diagram F₂₂Stitching in channel dimension to obtain feature map F₃₁And apply the feature map F₃₁And carrying out convolution operation, batch normalization and activation function processing for subsequent operation.

The invention has the beneficial effects that:

the invention provides a road target detection method based on an attention feature enhancement module. Compared with the basic YOLOX-L algorithm, the road target detection method provided by the invention has the advantages that the detection speed is realized, and the detection precision can be improved. The method proposed herein improved the average accuracy by 1.8% on the BDD100K data set and by 0.6% on the PASCAL VOC 2007 test set. With the increasing of automobile holding capacity, the travel safety problem is increasingly prominent, and the automatic driving technology based on computer vision provides a new solution for the traffic problem, and is paid more and more attention and researched by more and more countries. However, when the traditional target detection algorithm detects the road target in a complex traffic scene, the problems of high missed detection rate, low recall rate and the like exist, so that the improvement of the precision of the road target detection algorithm has important significance.

Drawings

FIG. 1 is a flow chart of the present method;

FIG. 2 is a diagram of a CBAM attention model architecture;

FIG. 3 is a diagram of the AFE-YOLOX algorithm network.

Detailed Description

The present invention will be further illustrated with reference to the accompanying drawings and specific embodiments, which are to be understood as merely illustrative of the invention and not as limiting the scope of the invention.

The invention uses the BDD100K data set and the VOC data set to carry out experiments on the proposed road target detection method based on the attention feature enhancement module.

Firstly, acquiring image information of a road target to be detected, constructing a convolutional neural network on the basis of the image information to acquire the characteristics of the target to be detected from the image information, and acquiring input characteristic graphs with different sizes; then, an attention feature enhancement module consisting of a CBAM attention mechanism and a semantic enhancement branch performs feature enhancement on the input feature map to obtain an enhanced feature map containing deep semantic information and shallow texture information; and finally, classifying and regressing the enhanced feature map by using the decoupling head to obtain a detection result.

Step 1, constructing a convolutional neural network to acquire the characteristics of a target to be detected from image information, and acquiring input characteristic diagrams with different sizes:

step 1.1, constructing a convolutional neural network for extracting target features

The constructed neural network structure is shown in the CSPDarknet part of fig. 3. The input image firstly passes through a Focus network module to obtain a feature map with the width and height dimensions of 1/2 being the original image and the number of channels being 4 times that of the original image, and then passes through operations of a connected module and a CSP layer unit for multiple times to obtain input feature maps with the dimensions of 1/8, 1/16 and 1/32 respectively.

The specific implementation process of the Focus network module is as follows: and taking a value of every other pixel of the input image, stacking the obtained 4 independent feature layers, and concentrating the width and height information into channel information. The concrete implementation process of the conditional module is as follows: and performing convolution operation on the input feature layer, performing batch normalization processing on the result obtained by the convolution operation, and performing activation function operation. Specific embodiments of the CSPLayer unit are: firstly, inputting a feature map F₁Carrying out convolution operation, batch normalization and activation function processing to obtain a characteristic diagram F₁₁(ii) a Then inputting a feature map F₁Inputting another branch, performing convolution operation, batch normalization and activation function processing to obtain a feature map F₂₁Will F₂₁Successively carrying out three operations in successive residual bottleneck blocks to obtain a characteristic diagram F₂₂(ii) a Finally, the feature map F₁₁And characteristic diagram F₂₂Stitching in channel dimension to obtain feature map F₃₁And apply the feature map F₃₁And carrying out convolution operation, batch normalization and activation function processing for subsequent operation.

Step 2, constructing an attention feature enhancement module, wherein the attention feature enhancement module comprises a CBAM (CBAM) attention mechanism and a semantic enhancement branch, and performing feature enhancement on the input feature map obtained in the step 1 through the attention feature enhancement module to obtain an enhanced feature map:

step 2.1, input feature map with size of original image 1/32

Inputting the feature map into a CBAM module

Will be provided with

Step 2.2, input feature map with size of original image 1/16

And semantically enhanced feature maps

Carrying out addition operation to obtain a feature map

Will be provided with

Inputting the feature map into a CBAM module

Will feature map

And characteristic diagram

Step 2.3, enhancing the feature map

Step 2.4, input feature map with size of original image 1/8

And semantically enhanced feature maps

Carrying out addition operation to obtain a feature map

Will be provided with

Inputting the feature map into a CBAM module

Will feature map

And characteristic diagram

Step 2.5, enhancing the feature map

The CBAM modules in step 2.1, step 2.2, and step 2.4 are shown in fig. 2. The CBAM module comprises two processing steps of generating a channel attention feature map and a space attention feature map:

And

two spatial context descriptors.

And

ChannelAttention feature map M_c(F) The detailed calculation formula is as follows:

wherein the content of the first and second substances,

And

wherein the content of the first and second substances,

and 3, classifying and regressing by adopting a decoupling output head based on the enhanced characteristic diagram obtained in the step 2, and outputting a detection result.

Step 3.1, enhancing the feature map F₁Carrying out 1 multiplied by 1 convolution operation to reduce dimensionality and obtaining a characteristic diagram F with 256 channels₁₁Will F₁₁Carrying out 3 multiplied by 3 convolution operation, batch normalization and activation function processing on the input classification branch to obtain a feature map F₁₂To F₁₂Performing convolution 1 multiplied by 1 convolution operation to obtain a feature map F with the number of channels as the number of target categories₁₃；

Step 3.2, enhancing the feature map F₁Carrying out 1 multiplied by 1 convolution operation to reduce dimensionality and obtaining a characteristic diagram F with 256 channels₂₁Will F₂₁Inputting regression branches to carry out 3 multiplied by 3 convolution operation, batch normalization and activation function processing to obtain a feature map F₂₂To F₂₂Performing convolution 1 multiplied by 1 convolution operation to obtain a characteristic diagram F with the number of channels as the number of target coordinates₂₃；

Step 3.3, to feature graph F₂₂Performing convolution 1 multiplied by 1 convolution operation to obtain a characteristic diagram F with the number of channels being the number of anchor frames₃₃；

Step 3.4, converting the characteristic diagram F₁₃、F₂₃And F₃₃And splicing to obtain a detection result characteristic diagram.

Table 1 compares the overall detection results before and after adding the attention feature enhancement module and the various target detection results on the BDD100K dataset based on the YOLOX-L algorithm. The detection average accuracy rate of the AFE-YOLOX-L algorithm added with the attention feature enhancement module on 7 types of road targets reaches 59.0%, compared with the YOLOX-L algorithm, the average accuracy rate is improved by 1.8%, the person accuracy rate is improved by 0.7%, the rider accuracy rate is improved by 0.9%, the car accuracy rate is improved by 0.2%, the bus accuracy rate is improved by 0.9%, the truck accuracy rate is improved by 1.2%, the bike accuracy rate is improved by 4.4%, and the motor accuracy rate is improved by 4.4%.

TABLE 1 comparison of BDD100K data set test results

Table 2 compares the performance of the AFE-YOLOX-L algorithm with other advanced object detection algorithms on the BDD100K data set. By contrast, the AFE-YOLOX-L algorithm is superior to many advanced target detection algorithms.

TABLE 2 Performance contrast of AFE-YOLOX-L with other advanced object detection algorithms on BDD100K dataset

Table 3 compares the overall detection result obtained by adding the attention feature enhancement module and the target detection results of various types based on the YOLOX-L algorithm with the PASCAL VOC 2007 train val and the PASCAL VOC 2012 train val as training data sets and the PASCAL VOC 2007 test set. When the input image is verified to be 320 multiplied by 320, the detection average accuracy rate of the AFE-YOLOX-L algorithm added with the attention feature enhancement module on 20 types of targets reaches 84.1%, compared with the detection average accuracy rate of the AFE-YOLOX-L algorithm, the detection average accuracy rate is improved by 0.6%, and 17 types of targets are improved to different degrees.

TABLE 3 comparison of the test results of the PASCAL VOC 2007 test set

Table 4 compares the performance of the AFE-YOLOX-L algorithm with other advanced target detection algorithms on the PASCAL VOC 2007 test data set. By contrast, the AFE-YOLOX-L algorithm is superior to many advanced target detection algorithms.

TABLE 4 Performance comparison of AFE-YOLOX-L with other advanced target detection algorithms on PASCAL VOC 2007 dataset

It should be noted that the above-mentioned contents only illustrate the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and it is obvious to those skilled in the art that several modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations fall within the protection scope of the claims of the present invention.

Claims

1. A road target detection method based on an attention feature enhancement module is characterized by comprising the following steps:

step 1, acquiring image information of a road target to be detected;

2. The road target detection method based on the attention feature enhancing module is characterized by comprising the following steps of: in step 3, the specific steps of constructing the attention feature enhancing module include:

step 1.3.1, input feature map of size original image 1/32

Inputting the feature map into a CBAM module

Will be provided with

Step 1.3.2 input feature map of size original image 1/16

And semantically enhanced feature maps

Carrying out addition operation to obtain a feature map

Will be provided with

Inputting the feature map into a CBAM module

Will feature map

And characteristic diagram

Step 1.3.3, the feature map is to be enhanced

Step 1.3.4, input feature map of size original image 1/8

And semantically enhanced feature maps

Carrying out addition operation to obtain a feature map

Will be provided with

Inputting the feature map into a CBAM module

Will feature map

And characteristic diagram

Step 1.3.5, the feature map is enhanced

3. The road target detection method based on the attention feature enhancing module as claimed in claim 2, wherein: the CBAM module described in step 1.3.1, step 1.3.2 and step 1.3.4 includes two processing steps of generating a channel attention feature map and a spatial attention feature map:

And

two spatial context descriptors;

and

generating a channel attention map by a multi-layer perceptron including a hidden layer; when the size of the input feature map is

wherein the content of the first and second substances,

r represents a reduction value of the bottleneck structure of the multilayer perceptron, and the value is 16; σ (-) represents a Sigmoid activation function; r (-) represents a ReLU linear rectification function; g (-) is a global mean pooling function; δ (·) is a global maximum pooling function;

the Sigmoid activation function σ () is calculated as:

the global mean pooling function g (-) is calculated as:

the global maximum pooling function δ (-) is calculated as:

And

two-dimensional feature maps; splicing the generated two-dimensional characteristic graphs in the channel dimension, and performing convolution through a standard convolution layer to generate a space attention characteristic graph M_s(F'); spatial attention feature map M_s(F') the detailed calculation formula is:

wherein the content of the first and second substances,

f_7×7it represents a convolution operation with a convolution kernel of 7 × 7;

4. the road target detection method based on the attention feature enhancing module as claimed in claim 2, wherein: the specific steps of the CSPLAyer unit in step 1.3.3 and step 1.3.5 include:

firstly, inputting a feature map F₁Carrying out convolution operation, batch normalization and activation function processing to obtain a characteristic diagram F₁₁；