CN111753787A

CN111753787A - Separated traffic sign detection and identification method

Info

Publication number: CN111753787A
Application number: CN202010620101.4A
Authority: CN
Inventors: 金文�; 杨熙; 岑翼刚; 万晴
Original assignee: Jiangsu Jinhaixing Navigation Technology Co ltd
Current assignee: Jiangsu Jinhaixing Navigation Technology Co ltd
Priority date: 2020-07-01
Filing date: 2020-07-01
Publication date: 2020-10-09

Abstract

The invention discloses a separated traffic detection method, which is characterized in that the detection and the identification of traffic signs are completed through a YOLOv3-SPP-SE network for detecting the traffic signs and a MobileNetv3-small network for classifying the traffic signs; firstly, decoding a video stream into pictures for detection, and scaling the pictures to a corresponding size; then inputting the picture into a pruning Yolov3-SPP-SE to detect a traffic sign; secondly, coordinates of the traffic signs in the pictures are obtained according to the detection results of the traffic signs, and the traffic sign images are sent to a MobileNetv3-small for classification and identification; and finally, outputting a detection and identification result. The invention can accurately detect and identify the traffic sign in the natural scene, and can keep real-time detection and identification in the mobile scene with limited calculation; the separated detection and identification can compress the detection network to a greater extent, reduce the calculation cost and have good application prospect.

Description

Separated traffic sign detection and identification method

Technical Field

The invention relates to the technical field of traffic assistance, in particular to a separated traffic sign detection and identification method.

Background

The system for detecting and identifying the traffic signs is a key subsystem for realizing an intelligent traffic system, and is widely applied to the fields of auxiliary driving, intelligent navigation, automatic driving, intelligent traffic, traffic sign maintenance and the like. Meanwhile, for safety, the traffic sign detection and identification system needs to have the characteristics of accuracy and rapidness.

The traditional algorithm uses methods such as image segmentation based on color, image segmentation based on shape, maximum stable extremum region and the like to position the traffic sign in the image; extracting the characteristics of the traffic sign by using Haar, Hog, SIFT, LBP and other operators designed manually; the extracted features are classified using classifiers such as SVM, Adaboost, and the like. The traditional traffic sign detection and identification algorithm is poor in robustness and difficult to solve the problems of various different illumination conditions, complex backgrounds, undersize targets and the like in natural scenes.

The existing deep learning-based method simultaneously detects and identifies the traffic sign, the recall rate and the accuracy of tail data are difficult to ensure in the face of a long tail data set such as the traffic sign, and meanwhile, due to the fact that the classification is numerous, the model is difficult to compress, a large amount of continuous computing resources are needed when the model runs, and the algorithm performance and the speed are difficult to achieve in a mobile scene.

Disclosure of Invention

Aiming at the defect problems existing at present, the invention discloses a separated traffic sign detection and identification method.

The technical scheme of the invention is as follows: a data clustering processing method comprises the following steps:

(1) cutting preprocessing is carried out on a TT-100K (Tsinghua-Tencent 100K) data set, and the size of an image is reduced for training a detection network;

(2) training a YOLOv3-SPP model by using the processed data set, and selecting a performance optimal model in the training process as a basic model;

(3) carrying out sparse training on the basic model, and inducing gamma scale factors of a BN (batch normalization) layer to become sparse to obtain a sparse model;

(4) pruning the sparse model by using a channel pruning method fused with a pruning mask to obtain a pruned YOLOv3-SPP network structure and corresponding weight;

(5) for an embedded SE (Squeeze-and-Excitation) structure of a pruning model, initializing the SE structure by adopting random initialization, and initializing the rest by adopting pruning weight;

(6) carrying out fine adjustment on YOLOv3-SPP-SE on the processed data set to obtain a model finally used for detection;

(7) arranging the traffic sign in the TT-100k data set and the traffic sign identification data set of the Chinese academy of automation to obtain a traffic sign identification data set;

(8) training a MobileNetv3-small model by using the sorted traffic sign identification data set to obtain a traffic sign identification model;

(9) cascading the detection model and the recognition model after pruning to obtain a final model;

(10) the algorithm terminates.

The method comprises the following specific steps:

the TT-100k data set preprocessing method comprises the following specific steps:

step 1-1: for an image containing objects, taking the values of the top, the left, the bottom and the right of all the objects;

step 1-2: respectively calculating the middle point of the uppermost end and the upper boundary, the middle point of the leftmost end and the left boundary, the middle point of the lowermost end and the lower boundary, and the middle point of the rightmost end and the right boundary;

step 1-3: cutting the original image by using the 4 points obtained in the step 1-2;

step 1-4: calculating the coordinates of the traffic sign according to the cutting coordinates and the size of the original image;

step 1-5: dividing the processed data set into training set and verification set at random according to the ratio of 2:1

The improvement of the loss function of YOLOv3-SPP in step 2 is described in detail as follows:

the confidence Loss and classification Loss of the Focal local measurement model are used, the Loss of positive and negative samples and the Loss of difficult and easy samples can be balanced, and the formula is as follows:

FL(p_t)＝-α_t(1-pt)^γlog(p_t)

α in the above formula_tFor adjusting weights of positive and negative samples, (1-p)_t)^γThe weight used for adjusting the difficult and easy samples, y represents a real label, p represents a prediction label, α and gamma are hyper-parameters, and the values are 0.25 and 2 respectively;

the frame Loss of the GIoU Loss measurement model can solve the problem of non-uniform frame evaluation in the training process and the evaluation process, and the formula is as follows:

loss＝1-GloU

a, B in the above formula represents two frames to be measured, and C represents A, B the smallest bounding rectangle of the two frames;

the sparse training process in the step 3 specifically comprises the following steps:

penalty terms for all BN layer gamma scale factors are added to the overall loss. The penalty formula uses L1 norm to induce gamma sparsity; the penalty coefficient lambda is 0.001, so as to ensure that the gradient in the sparse training process is always dominated by the gradient of the detection network, and the specific expression is as follows:

g(γ)＝||γ||₁

in the above formula, the former term represents the original loss of the detection network, the latter term represents the penalty term of the gamma scale factor, and the value of lambda is 0.001;

the pruning in the step 4 comprises the following specific steps:

step 4-1: taking all gamma values and sorting the gamma values from small to large according to the values;

step 4-2: according to the channel pruning proportion, selecting the gamma value under the proportion from the selected and sequenced gamma as a global threshold;

step 4-3: calculating the pruning mask of each convolutional layer according to the global threshold, and simultaneously reserving the minimum channel number;

step 4-4: grouping all the convolution layers into 5 groups by taking each downsampling as a group;

and 4-5: taking a union set from the pruning masks of the convolutional layers in each group to obtain 5 groups of pruning masks;

and 4-6: pruning the convolutional layers within the group using different pruning masks

The embedded SE structure described in step 5 is specifically described as follows:

embedding an SE structure into a residual error branch in a first residual error block after each downsampling, firstly obtaining the number of channels output by the branch, then performing global average pooling on a feature graph output by the branch, then passing through a two-layer full-connection network, wherein the input dimension of the network is consistent with the number of the channels, the output dimension of the first layer is one fourth of the number of the channels, adopting ReLU activation, the output dimension of the second layer is consistent with the number of the channels, adopting Sigmoid activation, and finally weighting the feature graph of the residual error branch by using the output of full connection;

the fine tuning training in step 6 is specifically described as follows:

restoring a structure corresponding to the model after pruning by using the weight reserved after pruning, adjusting the SE structure according to the number of reserved channels after pruning, randomly initializing parameters of the SE structure, and finely adjusting the restoring precision on a data set;

the fine tuning training in step 7 is specifically described as follows:

cutting traffic sign targets in the TT-100k data set, fusing the traffic sign targets with a traffic sign identification data set of a Chinese academy of automation according to categories, selecting traffic sign categories with the occurrence frequency of more than 200, and performing up-sampling on the categories with the number of less than 500 in the selected categories by adopting the modes of random cutting, random fuzzy, random Gaussian noise and the like to set the traffic sign targets in the TT-100k data set to 500;

the classification model training in step 8 is specifically described as follows:

the sorted images of the recognition data sets are scaled to 256 × 256 and normalized, and then sent to a MobileNet 3-small network for training;

the cascade operation described in step 9 is specifically described as follows:

packing the detection result output by the detection model into a Batch, and sending the Batch into the classification model for parallel processing.

The invention has the beneficial effects that:

the single-classification detection is easier to learn, and compared with the multi-classification detection, the single-classification detection only learns the common characteristics of the traffic signs, so that the learning difficulty degree of the detection network can be reduced; the single-classification detection reduces the requirement on the data set and avoids the processing of the long-tail data set; all traffic signs can be detected, and if new traffic sign detection and identification categories are added, only a new classification model needs to be retrained; the calculation requirement is flexible, and continuous large amount of calculation is not needed; the algorithm has high robustness and can detect and identify in real time.

Drawings

FIG. 1 is a general flow chart of the algorithm of the present invention;

FIG. 2 is a flow chart for pruning Yolov3-SPP in accordance with the present invention;

FIG. 3 is a diagram of the present invention for embedding SE structures for pruning models;

fig. 4 is a diagram of the results of traffic sign detection and identification of the present invention.

Detailed Description

The invention is explained in detail by the specific experimental simulation below;

example 1: as shown in figure 1 of the drawings, in which,

the invention relates to a separated traffic sign detection and identification method, which comprises the following steps:

(1) decoding an image from an input video stream;

(2) sending the image into a pruned YOLOv3-SPP-SE model to detect whether a traffic sign exists;

(3) if the traffic sign is included, the traffic sign is sent to a classification model in parallel for subdivision and identification; if the traffic sign is not included, returning to the original image;

(4) and the classification model carries out subdivision identification on the output of the detection model in parallel.

The specific implementation is as follows:

the first step is as follows: decoding images

Decoding the image from the video stream frame by frame, scaling the long side of the image to 608, and the short side to an integral multiple of 32, and using gray filling in the process, and keeping the width-height ratio of the image content;

the second step is that: pruning Yolov3-SPP-SE model detection of traffic signs in images

The obtaining mode of a YOLOv3-SPP-SE model for detection in the process is shown in FIG. 2;

step 2-1: and carrying out basic training on the detection network to obtain a basic model. Training a YOLOv3-SPP network on the processed TT-100k data set, selecting an optimal model in the training process as a basic model, and balancing the Loss of positive and negative samples and the Loss of difficult and easy samples by using the confidence Loss and the classification Loss of a Focal local measurement model in the training process, wherein the formula is as follows:

FL(p_t)＝-α_t(1-p_t)^γlog(p_t)

α in the above formula_tFor adjusting weights of positive and negative samples, (1-p)_t)^γThe weight used for adjusting the difficult and easy samples, y represents a real label, p represents a prediction label, α and gamma are hyper-parameters, and the values are respectively 0.25，2。

loss＝1-GloU

a, B in the above equation represents two bounding boxes to be measured, and C represents A, B the smallest bounding rectangle of the two bounding boxes.

Step 2-2: and carrying out sparse training on the basic model to obtain a sparse model. The sparse training is to add a penalty term for the gamma scale factors of the BN layer in the training process, and induce gamma to become sparse, that is, add a penalty term for the gamma scale factors of all the BN layers in the overall loss. The penalty formula uses L1 norm to induce gamma sparsity; the penalty coefficient lambda is 0.001, so as to ensure that the gradient in the sparse training process is always dominated by the gradient of the detection network, and the specific expression is as follows:

g(γ)＝||γ||₁

in the above formula, the former term represents the original loss of the detection network, the latter term represents the penalty term of the gamma scale factor, and the value of lambda is 0.001.

Step 2-3: and pruning the sparse model to obtain a pruning model. Firstly, sequencing all gamma from small to large, and calculating a pruning threshold according to a required channel pruning proportion; then respectively calculating a pruning mask of the convolutional layer in front of each BN layer according to a pruning threshold; then 5 times of downsampling of the network Darknet-53 is extracted according to the characteristics, each downsampling is taken as a group, and a pruning mask of all the convolution layers in the group is taken and collected according to downsampling group marks; and finally, pruning the convolutional layers in the group by using a union pruning mask. In order to ensure the continuity of the whole network, 0.01 of the number of all channels of a certain layer is reserved.

Step 2-4: and embedding an SE structure into the residual block of the pruning model to obtain a final detection model. And embedding an SE structure in the residual block after each down-sampling in the feature extraction network, as shown in FIG. 3. In the first residual block after each downsampling, the SE structure is embedded for the residual branch in the residual block. Firstly, acquiring the number of channels output by the branch; then, global average pooling is carried out on the feature map output by the branch; then, a two-layer full-connection network is passed, the input dimensionality of the network is consistent with the number of channels, the output dimensionality of the first layer is one fourth of the number of the channels, ReLU activation is adopted, the output dimensionality of the second layer is consistent with the number of the channels, and Sigmoid activation is adopted; finally, the feature maps of the residual branches are weighted using the fully-connected outputs.

Step 2-5: and carrying out fine tuning training on the final detection model and recovering the precision. And restoring the corresponding structure by using the weight reserved after pruning, and finely adjusting and restoring the precision by using a random initialization SE structure.

The third step: determining traffic sign areas

If the detection result contains the traffic sign, each detection result of the detection network is scaled to 256 × 256, all the detection results are packaged into a Batch, and if the detection results do not contain the traffic sign, the original image is returned.

The fourth step: traffic sign recognition

Using the trained MobileNetv3-small to perform classification and identification on the detection result;

the specific details of training the MobileNetv3-small are as follows: scaling the sorted identification data set images to 256 × 256 and normalizing; and then various forms of the traffic signs in the natural scene are simulated by using data enhancement modes such as random angle rotation within 10 degrees, color dithering, brightness dithering, random blurring, random center cutting and the like. The loss function used in the training phase is a cross entropy function with softmax; the optimizer uses an SGD with momentum of 0.9, the initial learning rate is 0.001; the 90% attenuation over the entire epoch is one tenth of the original; the batch size is 24.

FIG. 4 shows the results of the traffic sign detection and identification of the present invention; wherein the ts label is a mark detected by the detector but not recognized by the classification model; it can be seen from the figure that the traffic sign detection and identification system provided by the invention has good detection performance when facing small target traffic signs, distorted targets, shielded targets and fuzzy targets in natural scenes.

Claims

1. A separated traffic sign detection and identification method is characterized in that: the method comprises the following steps:

(10) the algorithm terminates.

2. The method of claim 1, wherein the step of detecting and identifying the traffic sign comprises: step 1, in the process of processing the TT-100k traffic sign data set, the image is subjected to boundary clipping, and the resolution of the image in the TT-100k and the relative value of a traffic sign target are increased.

3. The method of claim 2, wherein the step of detecting and identifying the traffic sign comprises: in the improvement of the YOLOv3-SPP model in the step 2, residual branches in 23 residual blocks in a network Darknet-53 are extracted to be embedded into an SE structure, the confidence Loss and the classification Loss of the network are detected by using a Focal local measurement, and positive and negative samples and difficult and easy samples of the network are balanced, wherein the formula is as follows:

FL(p_t)＝-α_t(1-p_t)^γlog(p_t)

α in the above formula_tFor adjusting weights of positive and negative samples, (1-p)_t)^γThe weight used for adjusting the difficult and easy samples, y represents a real label, p represents a prediction label, α and gamma are hyper-parameters, the values are respectively 0.25 and 2

The frame Loss of the network is detected by using the GIoU Loss measurement, and the formula is as follows:

loss＝1-GloU

4. The method as claimed in claim 3, wherein the step of detecting and identifying the traffic sign comprises: step 3, adding penalty terms related to gamma scale factors of all BN layers in the overall loss in the sparse training process, wherein a penalty formula uses L1 norm to induce gamma sparse, the value of a penalty coefficient lambda is 0.001, and the specific expression is as follows:

g(γ)＝||γ||₁

5. The method as claimed in claim 4, wherein the step of detecting and identifying the traffic sign comprises: and 4, in the pruning process, sequencing all gamma from small to large, calculating a pruning threshold according to the required channel pruning proportion, performing 5 times of downsampling on the characteristic extraction network Darknet-53, taking each downsampling as a group, calculating a pruning mask of a convolutional layer in front of each BN layer according to the pruning threshold, taking a union set of the pruning masks of all the convolutional layers in the group according to a downsampling group label, pruning the convolutional layers in the group by using the union set pruning mask, and keeping the number of all channels of one layer to be 0.01.

6. The method as claimed in claim 5, wherein the step of detecting and identifying the traffic sign comprises: and 5, in the fine adjustment process in the step 6, restoring the corresponding structure of the model after pruning by using the weight reserved after pruning, adjusting the SE structure according to the number of reserved channels after pruning, randomly initializing the parameters of the SE structure, and performing fine adjustment on the data set to restore the precision.