CN112183203B

CN112183203B - Real-time traffic sign detection method based on multi-scale pixel feature fusion

Info

Publication number: CN112183203B
Application number: CN202010866848.8A
Authority: CN
Inventors: 任坤; 黄泷; 范春奇; 陶清扬; 冯波
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2020-08-26
Filing date: 2020-08-26
Publication date: 2024-05-28
Anticipated expiration: 2040-08-26
Also published as: CN112183203A

Abstract

A real-time traffic sign detection method based on multi-scale pixel feature fusion belongs to the field of deep learning and target detection. Firstly, acquiring an image containing traffic signs and preprocessing the image; secondly, inputting the preprocessed image into MobileNetv network for feature extraction; inputting the extracted multi-scale feature map to a pixel feature fusion module for pixel rearrangement, and splicing to generate a fusion feature map with semantic information and detail information; then, downsampling the fusion feature images to obtain six scale feature images, inputting the six scale feature images to the high-efficiency channel attention module, and distributing weights to the feature channels according to importance degrees; inputting the weighted six-scale feature map to an SSD detection layer to predict the position of the boundary box and the class of the object; and finally, performing non-maximum suppression to obtain an optimal traffic sign detection result. The method can give consideration to real-time performance and accuracy when detecting the traffic sign image, and has strong robustness.

Description

Real-time traffic sign detection method based on multi-scale pixel feature fusion

Technical Field

The invention belongs to the field of deep learning and target detection, and particularly relates to a real-time traffic sign detection method based on multi-scale pixel feature fusion.

Background

For road traffic safety, traffic signs are critical. In a real driving scene, the illumination changes caused by natural environments such as sunlight and weather, and special conditions such as fading, deformation and shielding of traffic signs exist, so that people eyes can miss or incorrectly identify the traffic signs, and the traffic accidents are caused by incorrect judgment of the road conditions in front, so that personal and property and vehicle losses are caused, and even life safety is threatened. The real-time accurate traffic sign detection technology is used as an important component of an advanced driving assistance system, can assist a driver to ensure driving safety, avoids dangerous occurrence, and has important application in the fields of traffic safety, automatic driving and the like.

In practical applications, the driving assistance system is required to have extremely high sensitivity, i.e. the category of the vehicle can be identified when the vehicle is far enough away from the traffic sign, and a better early warning is provided for the driver or the driving system. This requires the detection algorithm to meet high real-time and small target detection performance. The current method for improving the detection performance of the small target brings additional calculation and parameters at the same time, so that the real-time performance of the detection algorithm is reduced. Therefore, how to improve the small target detection performance of the algorithm to meet the requirement of a real driving assistance system while ensuring real-time performance without introducing excessive additional calculation cost is a problem to be solved.

Disclosure of Invention

In order to solve the technical problems, the invention aims to provide a real-time traffic sign detection method based on multi-scale pixel feature fusion, which overcomes the difficulty that the traffic sign method based on deep learning is difficult to consider real-time performance and accuracy.

In order to achieve the technical purpose, the technical scheme of the invention is as follows:

a real-time traffic sign detection method based on multi-scale pixel feature fusion comprises the following steps:

(1) Acquiring an image containing traffic signs, and preprocessing the acquired image;

(2) Inputting the image obtained after the pretreatment in the step (1) into MobileNetv network for feature extraction to obtain three-scale depth feature images;

(3) Inputting the depth feature images with three scales obtained in the step (2) into a pixel feature fusion module for pixel rearrangement, and splicing to generate a fusion feature image with semantic information and detail information;

(4) Downsampling the fusion feature map obtained in the step (3) to obtain six scale feature maps, inputting the six scale feature maps into the high-efficiency channel attention module, and distributing weights to the feature channels according to the importance degree;

(5) And (3) inputting the six scale feature maps with the weights generated in the step (4) into an SSD detection layer for classifying and positioning traffic signs, and finally, performing non-maximum suppression to obtain an optimal traffic sign detection result.

Further, the specific process of step (1) is as follows:

(a) Acquiring images containing traffic signs, and marking boundary frames and category information of each traffic sign appearing in each image;

(b) When the number of the acquired images is small, the existing images are utilized to perform data enhancement operation. More images are created by adopting methods of overturning, translating, rotating or adding noise and the like, so that the trained neural network has better effect;

(c) Uniformly converting the image resolution into 300 x 300 to adapt to the input size;

(d) And optimizing the images based on the number of positive and negative samples, and dividing the images to obtain a training image set and a test image set.

Further, the specific process of step (2) is as follows:

(A) Firstly, carrying out preliminary feature extraction on a 300 x 300 input image through a 3*3 standard convolution block to obtain a 150 x 32 feature map, wherein 32 represents the channel number of the feature map;

(B) The 150×150×32 feature map obtained in step (a) sequentially goes through 6 reverse residual bottleneck blocks to perform depth feature extraction, so as to obtain depth feature maps A, B, C of 38×38×32, 19×19×96 and 10×10×320 respectively.

Further, the specific process of the step (3) is as follows:

Step (I), pixel rearrangement with an upsampling factor of 4 is carried out on the 10-320 depth feature map obtained by the feature extraction in the step (2), so as to obtain a 38-20 upsampling feature map D;

(II) carrying out pixel rearrangement with an image upsampling factor of 2 on the 19 x 96 depth feature map obtained by the feature extraction in the step (2) to obtain a 38 x 24 upsampling feature map E;

And (III) performing splicing processing on the 38 x 20 up-sampling feature map D and the 38 x 24 up-sampling feature map E obtained by pixel rearrangement in the step (I) and the step (II) and the 38 x 32 depth feature map A obtained by feature extraction in the step (2) to generate a 38 x 76 fusion feature map F with semantic information and detail information.

The pixel characteristic fusion module in the step (3) synthesizes the fusion characteristic map by adopting a pixel rearrangement mode, and compared with other up-sampling mode pixel rearrangement, the pixel characteristic fusion module can enhance the information carried by the characteristic map under the condition of not adding any additional parameters and calculation; the pixel rearrangement expands the width and length by compressing the channel number in the feature map, which is essentially to rearrange the features in the same pixel position in the low-resolution feature map with the channel number of r ² C and the length and width of H x W according to a specific sequence to obtain a high-resolution feature map with the channel number of C and the length and width of rH x rW, wherein r represents an up-sampling factor; unlike up-sampling approaches of interpolation and deconvolution, pixel rebinning does not introduce additional parameters and computational expense, while solving some artifacts or checkerboard effects of interpolation and deconvolution.

Further, the specific process of step (4) is as follows:

(i) Downsampling the 38×38×76 fusion feature map F obtained in the step (3) by convolution with a stride of 2 to obtain a19×19×256 feature map G; downsampling the feature map G by convolution with the stride of 2 to obtain a 10 x 256 feature map H; sequentially obtaining a 5 x 256 feature map I, a 3 x 128 feature map J and a1 x 128 feature map K according to the steps;

(ii) Respectively inputting the 38 x 76 fusion feature map F obtained in the step (3) and the feature maps G-K with six scales into a high-efficiency channel attention module, and distributing weights to the feature channels according to importance degrees to obtain feature maps with six scales;

Wherein the efficient channel attention module of step (ii) is capable of learning relationships between channels, assigning channel weights based on channel importance;

Firstly, compressing the dimension of a characteristic channel, and converting an original characteristic channel of H, W and C into 1, 1 and C through global pooling to obtain a global characteristic value in the dimension of the channel;

And then carrying out information extraction integration on each channel and 5 neighborhood channels of the channel by using one-dimensional convolution with the convolution kernel size of 5 to obtain a correlation parameter L _i between the channels:

wherein alpha ^j represents one-dimensional convolution kernel parameters, and are updated along with network training after being initialized and set by an Xavier; Representing 5 neighborhood channels/>, representing characteristic channel C _i Global feature value of the jth channel;

And then L _i is used for obtaining the activation value of each channel through a Sigmoid activation function as the weight omega _i of the channel:

wherein σ represents a Sigmoid activation function;

Finally multiplying the weight with the original channel characteristic value to obtain a weighted output characteristic channel; the network can focus on important subject features by weighting the feature channels.

Further, the specific process of step (5) is as follows:

Taking the six weighted scale feature maps obtained in the step (4) as input, generating a plurality of default frames for each pixel of the input feature maps, and then detecting by a positioning sub-network and a classification sub-network respectively; the detection value comprises two parts: bounding box location and category confidence; the positioning sub-network predicts a bounding box for each default box; the classifying sub-network predicts the confidence of all the categories of each default frame;

And secondly, suppressing the confidence of the target category and the position offset of the prediction frame relative to the default frame in the plurality of prediction frames by using non-maximum suppression, and selecting the prediction frame with the minimum target loss function as the optimal prediction frame to obtain the target category and the prediction frame position in the optimal prediction frame.

Wherein the objective loss function L (x, L, c, g) of the detection network in the step (two) consists of a classification loss function L _conf (x, c) and a positioning loss function L _loc (x, L, g):

Wherein x is a default frame on the feature map, L is a predicted frame, c is a confidence predicted value of the default frame on the feature map on each category, g is a real frame, L _conf (x, c) is a softmax classification loss function of the default frame on the feature map on the category score set c, L _loc (x, L, g) is a position loss function, N is the number of default frames matched with the real frame, and the weight coefficient alpha is set to 1 through cross verification. The detection network achieves more accurate target positioning and classification by optimizing the loss function.

The beneficial effects brought by adopting the technical scheme are that:

The invention provides a multi-scale pixel feature fusion strategy, which is characterized in that a deep feature image extracted by MobileNetv networks is subjected to pixel rearrangement to synthesize a fusion feature image, and compared with other up-sampling pixel rearrangement modes, the method can enhance small target information carried by the feature image under the condition of not adding any additional parameters and calculation; and a high-efficiency channel attention module is added before the network is detected, weights are distributed to the characteristic channels according to the importance degree, and the detection performance is effectively improved. The method has the advantages of small memory occupation, high detection speed and accurate detection of small targets, and can realize high-precision real-time traffic sign detection.

Drawings

FIG. 1 is a schematic flow chart of the present invention;

FIG. 2 is a diagram of a model structure of the present invention;

Detailed Description

For the purposes of making the technical solutions and advantages of the method of the present invention more clear, the following description is given by way of example with reference to the accompanying drawings, which are not intended to limit the invention:

step 1, obtaining images containing traffic signs, and marking boundary boxes and category information of each traffic sign appearing in each image.

When the number of the acquired images is small, the existing images are utilized to perform data enhancement operation. More images are created by adopting methods of overturning, translating, rotating or adding noise and the like, so that the trained neural network has better effect.

The image resolution is uniformly translated to 300 x 300 to accommodate the input size.

And optimizing the images based on the number of positive and negative samples, and dividing the images to obtain a training image set and a test image set.

Step 2, performing preliminary feature extraction on a 300×300 input image through a 3*3 standard convolution block to obtain a 150×150×32 feature map, wherein 32 represents the channel number of the feature map.

And sequentially carrying out depth feature extraction on the 150×150×32 feature map through 6 reverse residual bottleneck blocks to respectively obtain 38×38×32, 19×19×96 and 10×10×320 depth feature maps A, B, C.

Step 3, performing pixel rearrangement with an upsampling factor of 4 on the 10×10×320 depth feature map obtained by the feature extraction in step 2 to obtain a 38×38×20 upsampling feature map D;

Performing pixel rearrangement with an image upsampling factor of 2 on the 19×19×96 depth feature image obtained by the feature extraction in the step 2 to obtain a 38×38×24 upsampling feature image E;

and (3) splicing the 38 x 20 up-sampling feature map D obtained by pixel rearrangement and the 38 x 24 up-sampling feature map E with the 38 x 32 depth feature map A to generate a 38 x 76 fusion feature map F with semantic information and detail information.

Step 4, downsampling the 38×38×76 fusion feature map F obtained in the step 3 by convolution with a stride of 2 to obtain a 19×19×256 feature map G; downsampling the feature map G by convolution with the stride of 2 to obtain a 10 x 256 feature map H; sequentially obtaining a 5 x 256 feature map I, a 3 x 128 feature map J and a1 x 128 feature map K according to the steps;

respectively inputting a feature map with six scales of 38 x 76 fusion feature map F and feature maps G-K into a high-efficiency channel attention module, compressing the feature channel dimension, and converting an original feature channel with H x W x C into 1 x C through global pooling to obtain a global feature value in the channel dimension;

wherein σ represents a Sigmoid activation function;

Finally multiplying the weight with the original channel characteristic value to obtain a characteristic diagram with six scales and weight;

Step 5, taking the six scale feature images with weights obtained in the step 4 as input, generating a plurality of default frames for each pixel of the input feature images, and then detecting by a positioning sub-network and a classification sub-network respectively; the detection value comprises two parts: bounding box location and category confidence; the positioning sub-network predicts a bounding box for each default box; the classification sub-network predicts the confidence of all its classes for each default box.

And suppressing the confidence of the target category in the plurality of prediction frames and the position offset of the prediction frames relative to the default frame by using non-maximum suppression, and selecting the prediction frame with the minimum target loss function as the optimal prediction frame to obtain the target category and the prediction frame position in the optimal prediction frame.

Wherein the objective loss function L (x, L, c, g) of the network consists of a classification loss function L _conf (x, c) and a positioning loss function L _loc (x, L, g):

The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereto, and any modification made on the basis of the technical scheme according to the technical idea of the present invention falls within the protection scope of the present invention.

Claims

1. The real-time traffic sign detection method based on multi-scale pixel feature fusion is characterized by comprising the following steps of:

(5) Inputting the six scale feature maps with the weights generated in the step (4) to an SSD detection layer for classifying and positioning traffic signs, and finally performing non-maximum suppression to obtain an optimal traffic sign detection result;

The pixel characteristic fusion module in the step (3) synthesizes the fusion characteristic map by adopting a pixel rearrangement mode, and compared with other up-sampling mode pixel rearrangement, the pixel characteristic fusion module can enhance the information carried by the characteristic map under the condition of not adding any additional parameters and calculation; the pixel rearrangement expands the width and length by compressing the channel number in the feature map, which is essentially to rearrange the features in the same pixel position in the low-resolution feature map with the channel number of r ² C and the length and width of H x W according to a specific sequence to obtain a high-resolution feature map with the channel number of C and the length and width of rH x rW, wherein r represents an up-sampling factor;

The specific process of the step (4) is as follows:

the efficient channel attention module of step (ii) is capable of learning relationships between channels, assigning channel weights based on channel importance;

Firstly, compressing the dimension of a characteristic channel, and converting an original characteristic channel of an H, W and C characteristic graph into 1,1 and C through global pooling to obtain a global characteristic value in the dimension of the channel;

wherein σ represents a Sigmoid activation function;

2. The method for detecting real-time traffic sign based on multi-scale pixel feature fusion as claimed in claim 1, wherein the specific process of the step (1) is as follows:

(a) Acquiring an image containing traffic signs and performing data enhancement operation;

(b) Marking the boundary box and category information of each traffic sign appearing in each image;

3. The method for detecting real-time traffic sign based on multi-scale pixel feature fusion as claimed in claim 1, wherein the specific process of the step (2) is as follows:

4. The method for detecting real-time traffic sign based on multi-scale pixel feature fusion as claimed in claim 1, wherein the specific process of the step (3) is as follows:

5. The method for detecting real-time traffic sign based on multi-scale pixel feature fusion as claimed in claim 1, wherein the specific process of the step (5) is as follows:

6. The method for detecting traffic sign in real time based on multi-scale pixel feature fusion according to claim 5, wherein in the step (two), the objective loss function L (x, L, c, g) of the detection network consists of a classification loss function L _conf (x, c) and a positioning loss function L _loc (x, L, g):

Wherein x is a default frame on the feature map, L is a predicted frame, c is a confidence predicted value of the default frame on the feature map on each category, g is a real frame, L _conf (x, c) is a softmax classification loss function of the default frame on the feature map on the category score set c, L _loc (x, L, g) is a position loss function, N is the number of default frames matched with the real frame, and the weight coefficient alpha is set to 1 through cross verification.