CN109447034B

CN109447034B - Traffic sign detection method in automatic driving based on YOLOv3 network

Info

Publication number: CN109447034B
Application number: CN201811354012.9A
Authority: CN
Inventors: 王超
Original assignee: Beijing Information Science and Technology University
Current assignee: Beijing Information Science and Technology University
Priority date: 2018-11-14
Filing date: 2018-11-14
Publication date: 2021-04-06
Anticipated expiration: 2038-11-14
Also published as: CN109447034A

Abstract

A traffic sign detection method in automatic driving based on a YOLOv3 network belongs to the field of traffic sign detection. The invention solves the problems that the existing YOLOv3 network target detection algorithm is low in detection precision and the detection speed cannot meet the real-time requirement. The invention provides an improved loss function, thereby reducing the influence of large target errors on the detection effect of small targets and improving the detection accuracy of small-size targets; an improved activation function is provided, a negative value is reserved, and meanwhile, the change and information transmitted to the next layer are reduced, so that the robustness of the algorithm to noise is enhanced; and clustering the real frames in the traffic identification data set through a K-means algorithm, so as to realize the prefetching of the position of the target frame and accelerate the convergence of the network. The detection accuracy mAP of the traffic sign detection model on the test set reaches 92.88%, the detection speed reaches 35FPS, and the real-time requirement is completely met. The invention can be applied to the field of traffic sign detection.

Description

Traffic sign detection method in automatic driving based on YOLOv3 network

Technical Field

The invention belongs to the field of traffic sign detection, and particularly relates to a traffic sign detection method in automatic driving.

Background

Object detection is an important research direction in the field of automatic driving. The main detection targets are divided into two types: a stationary object and a moving object. Stationary objects such as traffic lights, traffic signs, lanes, obstacles, etc.; moving objects such as vehicles, pedestrians, non-motorized vehicles, etc. The traffic sign detection provides abundant and necessary navigation information for the unmanned automobile in the driving process, and is fundamental work with important significance.

The traditional target detection method mainly comprises the following steps: and (4) preprocessing, selecting a candidate region, and extracting target features and feature classification. Commonly used features are SIFT (scale-invariant feature transform), hog (histogram of oriented gradient), Haar. Common classifiers are as follows: SVM (support vector machine), RF (random forest), Adaboost, etc. The method has high design requirements on target features, and if the designed features are not good, the accuracy of the final model is low even if the best classifier is used. Meanwhile, the characteristics have strong pertinence, only a certain kind of targets can be detected, and the generalization capability is poor. And the extracted features are all low-level features (low-level features) of the target and cannot express the real high-level semantic features of the target.

Deep learning has achieved abundant research results in the field of computer vision in recent years, particularly in the field of target detection. The method for extracting the target features by using the convolutional neural network (convolutional neural network) can greatly reduce a plurality of defects of manually extracting the features. R-CNN is a convolutional neural network-based target detection model proposed by Girshick et al in 2014. Firstly, a large number of candidate regions are extracted from the whole picture through a selective search algorithm, then the candidate regions are adjusted to be of a fixed size and input into a convolutional neural network for feature extraction, and finally an SVM classifier is used for classification. The mAP (mean Average precision) of R-CNN reaches 62.4 percent, and the detection time is longer due to higher algorithm complexity. In response to this problem, researchers have proposed a number of improved target candidate region-based algorithms. SPPnet [8] it fixes the feature image to the required size by designing a pyramid pooling layer (pyramid pooling layer) after the last convolutional layer. Fast R-CNN proposes a multi-task loss function (multi-task loss), which adds a loss of target location after the conventional loss function to correct the location information. The Faster R-CNN adds a sliding window (sliding window) to the feature image output from the last convolutional layer, and creates anchor frames of different sizes at the positions crossed by the sliding window and with the center of the window as the center (anchor), and maps the anchor frames to the original picture to become candidate regions. The R-FCN adopts an FCN (full relational networks) network structure, and constructs position-sensitive score maps (position-sensitive score maps) by using a special Convolutional layer. Researchers have also proposed a number of regression (regression) based target detection algorithms such as: yolo (you only look once), ssd (single shot multibox detector), YOLOv2, YOLOv3, and the like. Of these, YOLOv3 is one of the most elegant target detection algorithms at present, which mirrors many of the results of previous researchers. When the input size is 416 × 416, the detection accuracy can reach 55.3%, but the detection time is only 29 ms. Although the existing Yolov3 target detection algorithm achieves certain achievement in the aspect of target detection, the detection precision is still not high, and the detection speed cannot meet the requirement of real-time performance.

Disclosure of Invention

The invention aims to solve the problems that the existing YOLOv3 network target detection algorithm is low in detection precision and the detection speed cannot meet the real-time requirement.

The technical scheme adopted by the invention for solving the technical problems is as follows:

the method for detecting the traffic sign in automatic driving based on the YOLOv3 network comprises the following steps:

firstly, manufacturing training set data and test set data with traffic identification target labels based on a GTSDB data set;

clustering real target frames labeled in the training set data, obtaining initial candidate target frames of the traffic identification type targets predicted in the training set data by adopting an area intersection ratio IOU as a rating index, and taking the initial candidate target frames as initial network parameters of a YOLOv3 network; calling initial network parameters of a YOLOv3 network, inputting training set data into the YOLOv3 network for training until loss function values output by the training set data are less than or equal to a threshold value Q₁Or stopping training when the set maximum iteration number N is reached to obtain a trained YOLOv3 network;

inputting the test set data into the well-trained YOLOv3 network, and if the detection precision corresponding to the test set data is more than or equal to the precision threshold Q₂Then, the trained YOLOv3 network is used as the final YOLOv3 network;

if the detection precision corresponding to the test set data is smaller than the precision threshold Q₂Continuing to train the well-trained YOLOv3 network obtained in the step two until the detection precision corresponding to the test set data is greater than or equal to the precision threshold Q₂The YOLOv3 network at this time is used as a final YOLOv3 network;

and inputting the collected images containing the traffic signs in the automatic driving into a final YOLOv3 network to detect the traffic signs.

The invention has the beneficial effects that: the invention provides a traffic sign detection method in automatic driving based on a YOLOv3 network, and the invention provides an improved loss function, and the size of a real target frame is taken into consideration by weighting the loss part of the width and the height of a detected target, so that the influence of a large target error on the detection effect of a small target is reduced, and the detection accuracy of the small target is improved; by providing an improved activation function, when x is 0 or a negative value, Softplus is adopted to translate log2 units downwards, so that the negative value is reserved, the change and information propagated to the next layer are reduced, and the robustness of the algorithm to noise is enhanced; and finally, clustering the real frame through a K-means algorithm, so as to realize the prefetching of the position of the target frame and accelerate the convergence of the network. The result shows that the detection precision of the traffic sign detection model provided by the invention on the test set is greatly improved, the mAP reaches 92.88%, the detection speed reaches 35FPS, the real-time requirement is completely met, and the convergence speed in the training process is improved by about 66.67%.

Drawings

FIG. 1 is a graph of a currently common activation function ReLU;

x represents input and y represents output;

FIG. 2 is a graph of an activation function Leaky-ReLU (Leaky Rectified Linear Unit);

FIG. 3 is a graph of the activation function Softplus-ReLU applied in the present invention;

FIG. 4 is a schematic diagram of the influence of the K-means clustering initial candidate bounding box on the model performance;

wherein: the gray rectangles represent the values of the original method, and the black rectangles represent the values of the K-means clustering method;

FIG. 5 is a graph comparing the loss (loss function) curves of the method of the present invention (K-means clustering) with that of training without the clustering method;

FIG. 6 is a graph comparing the detection effect of the conventional YOLOv3 with the detection effect of the present invention using the modified loss function;

Detailed Description

The first embodiment is as follows: the traffic sign detection method in automatic driving based on the YOLOv3 network in the embodiment specifically comprises the following steps:

clustering real target frames marked in the training set data, obtaining initial candidate target frames of the predicted traffic identification type targets in the training set data by adopting the area intersection ratio IOU as a rating index, and carrying out classification on the initial candidate target framesThe target frame is used as an initial network parameter of the YOLOv3 network; (this has the benefit of speeding up the convergence speed of the training process); calling initial network parameters of a YOLOv3 network, inputting training set data into the YOLOv3 network for training until loss function values output by the training set data are less than or equal to a threshold value Q₁Or stopping training when the set maximum iteration number N is reached to obtain a trained YOLOv3 network;

inputting the test set data into the well-trained YOLOv3 network, and if the detection accuracy (mAP) corresponding to the test set data is more than or equal to the accuracy threshold Q₂Then, the trained YOLOv3 network is used as the final YOLOv3 network;

The second embodiment is as follows: the first difference between the present embodiment and the specific embodiment is: the specific process of the step one is as follows:

the GTSDB data set comprises M images in total, and after the traffic identification type targets in the M images are labeled, the labeled M images are randomly divided into a training set and a test set.

The third concrete implementation mode: the second embodiment is different from the first embodiment in that: the data volume ratio of the training set to the test set is 8: 1.

the fourth concrete implementation mode: the third difference between the present embodiment and the specific embodiment is that: clustering real target frames marked in the training set data, and obtaining an initial candidate target frame of a traffic identification type target predicted in the training set data by adopting an area intersection ratio IOU as a rating index, wherein the specific process comprises the following steps:

clustering real target frames of the training set data by adopting a K-means algorithm, and taking the area intersection ratio IOU of the predicted candidate target frames and the real target frames as a rating index, namely taking the predicted candidate target frames as initial candidate target frames when the area intersection ratio IOU is not less than 0.5;

the area intersection ratio IOU (intersection over unit) is expressed as follows:

wherein: box_predRepresenting the area of the predicted candidate target bounding box_truthRepresenting the area of the real target frame;

the distance Dis (box, centroid) between all real target bounding boxes and the initial candidate target bounding box is expressed as:

Dis(box,centroid)＝1-IOU(box,centroid)

wherein: dis (box, centroid) represents the distance between all real target frames and the initial candidate target frame in the training set data, and IOU (box, centroid) represents the average intersection ratio between all real target frames and the initial candidate target frame in the training set data.

The fifth concrete implementation mode: the fourth difference between this embodiment and the specific embodiment is that: the initial network parameters of the YOLOv3 network are called, and the training set data are input into the YOLOv3 network for training until the loss function value of the training set data output is less than or equal to the threshold value Q₁Or stopping training when the set maximum iteration number N is reached to obtain a trained YOLOv3 network, wherein the specific process is as follows:

calling initial network parameters of a YOLOv3 network, inputting training set data into a YOLOv3 network for training, continuously training and adjusting weight values and bias values of convolution layers of the YOLOv3 network, and outputting loss function values (object) of the training set data;

wherein: the coordinate adopts an error square sum loss function; the confidence coefficient and the category adopt a binary cross entropy loss function;

λ_coorda penalty factor for coordinate prediction; lambda [ alpha ]_noobjA penalty coefficient for confidence when the traffic does not contain the traffic identification target; k × K indicates the number of meshes into which one input image is divided; m represents the predicted number of target frames of each mesh; x is the number of_i，y_i，w_iAnd h_iRespectively represents the abscissa, ordinate, width and height of the central point of the predicted traffic sign,

and

respectively representing the abscissa, the ordinate, the width and the height of the central point of the real traffic sign;

the ith grid representing the frame of the jth candidate target is responsible for detecting the object;

indicating that the ith grid where the jth candidate target frame is not responsible for detecting the object; c_iAnd

respectively representing the prediction confidence coefficient and the real confidence coefficient of the traffic identification target in the ith grid; p is a radical of_i(c) And

prediction summary respectively representing traffic signs in ith grid belonging to a certain classA value and a true probability value; c represents a certain category, and classes represents the total number of categories;

until the loss function value of the data output of the training set is less than or equal to the threshold value Q₁Or stopping training when the set maximum iteration number N is reached, and taking the network obtained when the training is stopped as a trained YOLOv3 network.

The sixth specific implementation mode: the fifth embodiment is different from the fifth embodiment in that: in the second step, when training set data is input into the YOLOv3 network for training, the learning rate is set to be 0.0001, and the batch _ size is set to be 256.

In the actual training process, the values of the learning rate and the batch _ size may be appropriately adjusted in order to improve the training accuracy.

The seventh embodiment: as shown in fig. 3, the sixth embodiment is different from the first embodiment in that: the activation function adopted by the convolution layer of the YOLOv3 network is defined as:

wherein: x represents information of a previous layer in a YOLOv3 network as input, and y represents nonlinear output; when x takes a positive value, the convolution layer of the YOLOv3 network adopts the same form as the activation function ReLU, when x is 0 or a negative value, Softplus is adopted to shift log2 units downwards, and when the parameter x is continuously reduced, the activation function gradually converges to-log 2.

This means that Softplus-ReLU has a smaller derivative value, which reduces the propagation of changes and information to the next layer. Therefore, Softplus-ReLU has a strong robustness to noise information and its complexity is relatively low.

The activation function of the embodiment is different from a traditional target detection algorithm, and the convergence rate of the currently used activation function relu (rectified Linear unit) is faster than that of the traditional activation function such as sigmoid, tanh and the like. The formula is defined as follows:

as shown in fig. 1, ReLU, but because the values falling in the negative field are all 0, as training progresses, it may happen that the neuron weights cannot be updated, and the gradient flowing through the neuron is always 0 from this point on, i.e., the ReLU neuron dies irreversibly during training.

The activation functions of the three versions of YOLO are all leakage-ReLU (leakage Rectified Linear unit), and as shown in fig. 2, the activation functions are the same as those of ReLU when x takes a positive value, but the output of leakage-ReLU does not take 0 when x takes a negative value or 0, but a Linear function with a small slope is adopted, so that the output when x takes a negative value is reserved. Although Leaky-ReLU has a negative value, noise robustness in the deactivated state (deactivated state) cannot be ensured. The formula is defined as follows:

in view of the above problems, the present embodiment proposes an improved activation function Softplus-ReLU, which is applied to each convolutional layer of the network. Like ReLU when x takes positive values, Softplus is used to translate log2 units downward when x is 0 or negative. As the parameters are continually scaled down, the function gradually converges to-log 2.

The specific implementation mode is eight: the first difference between the present embodiment and the specific embodiment is: the threshold value Q₁The value of (A) is 0.1.

The specific implementation method nine: the first difference between the present embodiment and the specific embodiment is: the precision threshold Q₂The value of (A) is 90%.

Examples

In order to verify the effectiveness of the improved method provided by the invention and evaluate the performance of the traffic sign detection model, four groups of comparative experiments are performed. Respectively as follows: (1) whether a K-means algorithm is used for influencing indexes such as accuracy, recall rate, convergence rate and the like of a model obtained by clustering the initial candidate frames. (2) The effect of using different activation functions on the performance of the target detection model. (3) The difference in the detection effect of the model with the improved loss function and the unmodified model is improved. (4) The improved model of the invention is compared with other mainstream models in detection performance. The evaluation index of the final model detection performance mainly selects an Average accuracy mean value mAP (mean Average precision) and a detection frame Per second FPS (Frames Per second). The aim is to improve the detection speed as much as possible on the premise of ensuring the detection precision.

The experimental environment of the invention is configured as follows: the CPU model is Intel Xeon E5-2620v3 processor, 64G memory, the graphics card model is Nvidia GeForce GTX TITAN X, the CUDA version is 8.0.44, the OpenCV version is 2.4.13, and the operating system is Ubuntu 16.04. Using gtsdb (german Traffic Sign Detection benchmark), 900 images of 1360 × 800 pixels are included, with the target image size being between 16 × 16 pixels and 128 × 128 pixels. The road scene pictures comprise road scene pictures with different conditions such as severe change of illumination conditions, similar background color interference, motion blur, local shielding and the like. The network parameter configuration is as follows: momentum is 0.9; decapay is 0.0005; max batches 50000; a learning rate of 0.001; step lengths are respectively 30000 and 40000; scales were 0.1, 0.1. The invention adopts a loading pre-training (pre-training) model dark net53.conv.74 as an initial parameter of the network during training, thus greatly shortening the training time. And meanwhile, the angle, the exposure, the saturation, the hue and the size of the input picture are adjusted to enhance the robustness of the model.

Clustering selection initial candidate frame performance analysis: in order to compare the influence of whether clustering is used to select the initial candidate frame on the detection performance of the model, the initial candidate frame is firstly trained according to the original candidate frame parameters of YOLOv3, then the parameters with the frame number of 9 obtained by clustering are trained, and the performance of the finally obtained model on the test set is shown in fig. 4. When the threshold value of detection is 0.5, the model of the initial candidate frame using the cluster is obviously higher than the model of the original candidate frame in accuracy (precision), recall rate and FPS. The recall rate and the accuracy rate are respectively improved by 2.88 percentage points and 3.41 percentage points, the average IOU is improved by 3.39 percentage points, and the model can detect 2 more pictures per second.

As shown in FIG. 5, the model using K-means clustering candidate frames converges gradually after about 900 training iterations, while the model without using clustering starts to converge at 1500 times, and the convergence rate is improved by nearly 70% by K-means clustering. The reason is that the initial frame candidate parameters after clustering are closer to the characteristic of the width and height of the traffic identification target, and are easier to approach the real target frame continuously during optimization.

Performance analysis of different activation functions: in order to verify the influence of different activation functions on a traffic sign detection model, four different activation functions are respectively selected for experiments, namely ReLU, Softplus, Leaky-ReLU and the improved activation function Softplus-ReLU provided by the invention. Results as shown in table 1, the model maps using ReLU and using Softplus are comparable; when the model uses Leaky-ReLU, mAP is improved by 1.63 percentage points, which benefits from the reservation of output when Leaky-ReLU takes a negative value; however, the mAP of the activation function Softplus-ReLU proposed by the invention is the highest, and is respectively improved by 4.42 percent and 2.79 percent compared with the ReLU and the Leaky-ReLU, because the Softplus-ReLU simultaneously retains the advantages of the two activation functions: the method has the advantages of high convergence rate and high robustness to noise.

TABLE 1 comparison of Performance of different activation function test models

Improved loss function performance analysis: in order to verify the effectiveness of the model for improving the loss function on the detection of the small target, the invention trains two traffic identification detection models under the condition of ensuring that other parameters are unchanged, and the detection effect is as shown in fig. 6: the first behavior is a picture to be detected and comprises a plurality of small targets; the second behavior is a detection effect graph of the traditional YOLOv 3; the third row is a graph of the detection effect of the loss function improved by the present invention. It can be seen that the improved model detection effect of the invention is obviously better than that of YOLOv3, YOLOv3 has missing detection for all small targets with the size less than 30 × 30 in three pictures, and the loss function of the model provided by the invention balances the loss of the large target and the small target to make the loss weight of the small target larger, so that the learning is better. And finally, detecting all traffic identifications in the picture. Therefore, the improved method for the loss function proposed by the present invention is effective for the detection of such objects as traffic signs.

And (3) performance comparison: the model provided by the invention is compared with other mainstream models in detection performance, and all models are trained by adopting the same data set. The traffic sign detection model has the highest detection accuracy (mAP reaches 92.88%), which benefits from the enhancement of the classification capability of the model after the activation function is improved and the enhancement of the small target recognition capability of the model after the loss function is improved. Since the network structure Darknet53 of the model of the invention is more complicated than Darknet19 of Yolov2 in terms of more layers, the detection speed is lower than that of Yolov2, but higher than that of other detection models. The speed of the final improved model is 35FPS, which is higher than the standard human eye vision persistence of 24 Frames Per Second (FPS) of real-time detection, and the requirements of real-time detection are completely met.

Moreover, the method of the invention is not only applied to the detection of the traffic identification, but also applied to the identification of small targets in the image. Other embodiments of the present invention are also possible, and various changes and modifications may be made by one skilled in the art without departing from the spirit and its scope, and it is intended that all such changes and modifications fall within the scope of the appended claims.

Claims

1. The method for detecting the traffic sign in automatic driving based on the YOLOv3 network is characterized by comprising the following steps of:

clustering real target frames marked in the training set data, obtaining initial candidate target frames of the predicted traffic identification type targets in the training set data by adopting the area intersection ratio IOU as a rating index, and carrying out the following steps ofTaking the initial candidate target frame as an initial network parameter of a YOLOv3 network; calling initial network parameters of a YOLOv3 network, inputting training set data into the YOLOv3 network for training until loss function values output by the training set data are less than or equal to a threshold value Q₁Or stopping training when the set maximum iteration number N is reached to obtain a trained YOLOv3 network;

the initial network parameters of the YOLOv3 network are called, and the training set data are input into the YOLOv3 network for training until the loss function value of the training set data output is less than or equal to the threshold value Q₁Or stopping training when the set maximum iteration number N is reached to obtain a trained YOLOv3 network, wherein the specific process is as follows:

and

respectively representing central points of real traffic signsAbscissa, ordinate, width and height;

the ith grid representing the frame of the jth candidate target is responsible for detecting the traffic identification target;

indicating that the ith grid where the jth candidate target frame is located is not responsible for detecting the traffic identification target; c_iAnd

respectively representing the predicted probability value and the real probability value of the traffic identification in the ith grid belonging to a certain class; c represents a certain category, and classes represents the total number of categories;

until the loss function value of the data output of the training set is less than or equal to the threshold value Q₁Or stopping training when the set maximum iteration number N is reached, and taking the network obtained when the training is stopped as a trained YOLOv3 network;

2. The method for detecting the traffic sign in automatic driving based on the YOLOv3 network as claimed in claim 1, wherein the specific process of the first step is as follows:

3. The method of claim 2, wherein the training set and the test set have a data volume ratio of 8: 1.

4. the method as claimed in claim 3, wherein the method for detecting the traffic sign in the automatic driving based on the YOLOv3 network is characterized in that real target frames labeled in the training set data are clustered, and an area intersection ratio IOU is used as a rating index to obtain an initial candidate target frame of the traffic sign class target predicted in the training set data, and the specific process is as follows:

the area intersection ratio IOU is expressed as follows:

Dis(box,centroid)＝1-IOU(box,centroid)

5. The method according to claim 4, wherein the learning rate is set to 0.0001 and the batch _ size is set to 256 when the training set data is input into the YOLOv3 network for training in the second step.

6. The method for detecting traffic signs in automatic driving based on the YOLOv3 network of claim 5, wherein the activation function adopted by the convolutional layer of the YOLOv3 network is defined as:

7. The method of detecting traffic sign in automatic driving based on YOLOv3 network of claim 1, wherein the threshold Q is set₁The value of (A) is 0.1.

8. The method for detecting traffic signs in automatic driving based on YOLOv3 network as claimed in claim 1, wherein the accuracy threshold Q is₂The value of (A) is 90%.