CN111401148B

CN111401148B - Road multi-target detection method based on improved multi-stage YOLOv3

Info

Publication number: CN111401148B
Application number: CN202010124052.5A
Authority: CN
Inventors: 王海; 王宽; 蔡英凤; 李祎承; 刘擎超; 刘明亮; 张田田; 李洋
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2020-02-27
Filing date: 2020-02-27
Publication date: 2023-06-20
Anticipated expiration: 2040-02-27
Also published as: CN111401148A

Abstract

The invention discloses an improved multi-stage YOLOv 3-based road multi-target detection method, which comprises the following steps of 1, making a data set: creating a road multi-objective dataset based on the disclosed driving dataset BDD 100K; step2, calculating aspect ratio of the road target candidate frame based on a K-means clustering algorithm; step3, designing an improved YOLOv3 neural network model; step4, setting training super parameters and network parameters, inputting a training set into a network, training an improved YOLOv3 network, and storing a trained weight file; step5, outputting predicted boundary box information and class probability; and 6, using a softening non-maximum value filtering detection frame to visualize the detection picture, and generating a final target detection frame and a recognition result. Compared with the original YOLOv3 neural network model, the mAP reaches 58.09% under the verification set of BDD100K, the nearly 9 percentage points are improved, and the detection accuracy is higher; the real-time performance is better, the FPS after statistics is 0.03 s/sheet, and the time consumption is only increased by 1.65% compared with the traditional YOLOv3, so that the real-time performance requirement is met.

Description

Road multi-target detection method based on improved multi-stage YOLOv3

Technical Field

The invention belongs to the technical field of detection of automobile environment perception targets, and particularly relates to a road multi-target detection method based on improved multi-stage YOLOv 3.

Background

Road target detection is an important direction in the field of image recognition, a computer vision algorithm based on deep learning is used as a later-on part in the field of computer vision, and with the continuous increase of data volume and the rapid advance of hardware level in recent years, great success is achieved in various computer vision tasks, such as target classification, target detection, semantic segmentation and the like. In particular, for target detection, a large number of algorithms with outstanding effects and good real-time performance are available at present. These algorithms are classified into single-stage and two-stage detection algorithms according to whether a region candidate network (RPN) is used or not, and first a detection frame regression of positive samples. The single-stage target detection algorithm includes YOLOv3, SSD, retinanet, etc., and the double-stage detection algorithm includes RCNN, RFCN, fasterrnn, cascadercnn, etc. The single-stage target detection algorithm has good real-time performance, and the double-stage detection algorithm has high accuracy. In the field of target detection, road target detection is a very important direction, and research on a road target detection algorithm is very important for traffic safety. In an autopilot scenario, detection and identification of road targets plays a very important role. Accurate detection plays a decisive role in subsequent identification, assisted positioning and navigation. The invention uses a method based on improved YOLOv3 for road multi-target detection.

Disclosure of Invention

The invention aims to solve the problem of poor accuracy of the existing road target detection, and provides a road multi-target detection method based on improved YOLOv3, which can improve the safety in the driving process. Firstly, a data set is manufactured by utilizing a public driving data set BDD100K, then an improved YOLOv3 neural network model is designed, then the neural network model is trained by utilizing the BDD100K data set, stored model parameters are imported into the improved YOLOv3 neural network model, and finally road targets in pictures are detected.

Compared with the original YOLOv3 network framework, the improved YOLOv3 neural network model of the invention adds two feature detection graphs, and the resolution of the modified 5 feature detection graphs are 13×13,26×26,52×52,104×104 and 208×208 respectively, and the improved YOLOv3 neural network model has 104×104 and 208×208 more output feature detection graphs than the original YOLOv3 detection graph. 5 candidate frames are distributed on the feature map of each scale, and the principle that the large-size feature frames detect small-size objects and the small-size feature frames detect large-size objects is followed. Training the training set image and the verification set image through the YOLO neural network to obtain a final YOLO v 3-based network weight model. Meanwhile, when the road targets in the picture are detected in real time, a plurality of prediction boundary boxes exist for each target in the picture, and the softening non-maximum value is used for inhibiting and eliminating redundant prediction boundary boxes. And the positioning accuracy and the detection accuracy of the network are improved.

The beneficial effects of the invention include:

1. compared with the original YOLOv3 neural network model, the method has the advantages that the mAP reaches 58.09% under the verification set of BDD100K, the nearly 9 percentage points are improved, and the detection accuracy is high.

2. The real-time performance is better, the time for detecting each picture by the improved YOLOv3 neural network model is used for counting the FPS, and the FPS after counting is 0.03 s/sheet, so that the real-time performance requirement is met.

Drawings

FIG. 1 is a modified YOLOv3 neural network model

FIG. 2 is a diagram showing the detection effect

FIG. 3 is a diagram showing the detection effect

Detailed Description

The invention is further described below with reference to the accompanying drawings.

As shown in fig. 1, a road multi-target detection method based on improved YOLOv3 includes the following steps:

step1 dataset fabrication

Based on the disclosed driving data set BDD100K, a road multi-target data set is manufactured, the number of the data sets is 10 ten thousand, and the number of GT frame labels in the data sets is 10, wherein the classification is respectively as follows: bus, light traffic lights, sign traffic signs, person pedestrians, bike, truck, motorbike, car, train, rider, and a total of about 184 ten thousand calibration frames. The resolution of the data set pictures is 1280 multiplied by 720, the BDD100k data set contains pictures of different weather, scenes and time, and the pictures of high definition and blurring are large in scale and diversified and are real driving scenes. According to the invention, training sets, test sets and verification sets are divided according to the ratio of 7:2:1, wherein the training sets 70000 are divided into the test sets 20000, the verification sets 10000 are divided into the BDD100k data sets which are arranged into the VOC data set format, the VOC data sets comprise three folders which are respectively JPEGImas files, annotations files and imageses files, the JPEGImas stores the training sets and the test set pictures, the Annotations files store xml type labeling files, the imageses stores txt texts, each line of txt texts corresponds to the name of one picture, the improved YOLOV3 network model reads the file name according to the txt texts, then the corresponding pictures and labeling information are searched in the JPEGImas and the annuas files, the labeling information of a road is extracted from the found picture labels, and the frame parameters of the labeling information are obtained. Then randomly dividing the pictures into different batches, and before the pictures are sent into an improved YOLOv3 network model, carrying out data enhancement modes such as random rotation, clipping, translation transformation, overturn transformation, noise disturbance and the like on the pictures, expanding the scene diversity of the pictures, and uniformly adjusting the sizes of the pictures to 416 multiplied by 416.

Step2 carries out road target candidate frame length-width ratio calculation based on K-means clustering algorithm

The boundary frame markers of the BDD100K dataset were calculated based on the K-means++ algorithm and clustered to obtain 15 anchor frame sizes of (4, 8), (6, 16), (10, 10), (8, 31), (13, 20), (22, 16), (22, 30), (13,51), (36, 42), (25,89), (54,66), (83,95), (57,155), (116, 156), (155,249), respectively.

Step3 improved YOLOv3 neural network model

Original YOLOv3 is a depth residual convolutional neural network of a full convolutional framework, and the network alternately uses 3×3 and 1×1 to extract the characteristics of a target in a picture, reduce resolution, adjust the size of the number of image channels and 2 times the characteristics of a front layer of an up-sampling layer fusion network. The YOLOv3 network is a feature interaction output layer of the network from 75 to 106 layers, the feature interaction output layer is divided into three resolutions, and in each resolution feature map, local feature interaction combination is realized by means of convolution (3×3 and 1×1 kernels). The final output of the network is generated by applying a 1 x1 convolution kernel on the signature, and object detection is accomplished by applying a 1 x1 detection grid on three different layers, three different sizes of the signature in the network. The original YOLOv3 was predicted with three resolution detection maps.

The improved YOLOv3 neural network model of the invention is shown in figure 1, and the detailed process is as follows:

firstly, the normalized image is reduced by half through two 3×3 convolutions, then sequentially passes through a residual error module, a 3×3 convolution, two residual error modules, a 3×3 convolution, eight residual error modules, a 3×3 convolution and seven residual error modules to obtain a 13×13 feature detection diagram, the picture size of an input 416×416 is adjusted to an output detection diagram of 13×13×45, and then an up-sampling layer with a step length of 2 is connected to promote the feature diagram to 26×26×256;

secondly, a 26 multiplied by 26 feature detection diagram is obtained by sequentially passing through a 3 multiplied by 3 convolution and eight residual modules; the following 52×52,104×104,208×208 feature maps were obtained by a 3×3 convolution and eight residual modules. Wherein the residual module is operated by a convolution of 1×1, a convolution of 3×3 and residual in sequence. Secondly, initially generating anchor blocks with three different scales on a 13X 13 feature map, and then sequentially carrying out 3X 3 convolution, a CONV module, 3X 3 convolution and 1X 1 convolution to obtain tensor data under the 13X 13 scale; then, carrying out up-sampling on the 13X 13 feature map sequentially through a 3X 3 convolution, a CONV module and a 1X 1 convolution, carrying out feature fusion on the feature map obtained by up-sampling and the 26X 26 feature map obtained by the neural network part of the YOLO network, initially generating anchor point frames with three different scales on the feature map obtained by feature fusion, and then sequentially carrying out the CONV module, the 3X 3 convolution and the 1X 1 convolution to obtain tensor data under 26X 26; then, the tensor data of 52×52,104×104 and 208×208 are all obtained as before, and feature fusion is carried out on the feature map obtained by up-sampling and the feature map of the upper layer obtained by the basic neural network part of the YOLO network through a vector splicing method, anchor blocks of three different scales are initially generated on the feature map obtained by the feature fusion, and then tensor data is obtained after a CONV module, a 3×3 convolution and a 1×1 convolution are sequentially carried out. The CONV module refers to an operation process of sequentially performing a 1×1 convolution, a 3×3 convolution, and a 1×1 convolution. The resolution sizes of the modified 5 feature detection maps are 13×13,26× 26,52 ×52,104×104,208×208, respectively. The improved network has 104×104 and 208×208 more output characteristic detection patterns than the original YOLOv3 detection pattern. 5 candidate boxes are allocated on the feature map of each scale, and the overall process of the detection model of the improved YOLOv3 neural network is shown in fig. 1.

Step4, setting training super parameters and network parameters, inputting a training set into a network, training an improved YOLOv3 network, and storing a trained weight file;

super parameters during training are set as follows: batch number is 4, learning rate=0.001, maximum iteration number 50000, learning strategy is set to sps= 40000,45000, 50000. A learning rate of 0.1 times the current value between 40000 and 45000, and a learning rate of 0.1 times the current value between 45000 and 50000;

main parameters of the experimental platform: a processor: inter (R) core (TM) i5-8600K CPU@3.60GHZ; memory: 64GB; display card: NVIDIA GeForce GTX1080TI.

The improved YOLOV3 model utilizes the regression loss function of the prediction boundary box to carry out loss calculation, the class score, the confidence score, the center coordinate and the width and height loss of each predicted correction frame relative to the real calibration frame class, the center coordinate and the width and height are calculated through the loss function, the weight is updated through counter propagation to obtain the gradient, the updated weight parameters are obtained, in order to make the loss smaller and smaller, the model weight is updated when each batch is sent into the improved neural network model until the loss value converges, the model parameters are stored once every ten thousands of times, verification is carried out under a verification set, and the learning rate is adjusted according to the loss curve and the detection effect on the verification set. Finally, the model converges in 90000 times, training is stopped, and a final detection model based on the improved YOLOv3 neural network is obtained after the iteration is performed 90000 times. Model parameters were saved for 90000 training times.

Step5 outputs predicted bounding box information and class probabilities.

Introducing the model parameters stored in the previous step into an improved YOLOv3 model, sending the test picture into the improved YOLOv3 model, activating the x, y, confidence and class probability of network prediction by using a logistic function, and judging by a threshold value to obtain the coordinates, confidence and class probability of all prediction frames; and outputting predicted boundary box information and class probability.

b _x ＝σ(t _x )+C _x

b _y ＝σ(t _y )+C _y

Wherein: c (C) _X ,C _Y For the offset of the current grid relative to the top left grid of the current feature map, the sigma () function is a logistic function used to apply t _x 、t _y Normalized to between 0 and 1, P _w ,P _h Is intersected with the marked boundary frame and is wider and higher than the largest anchor frame, t _w 、t _h 、t _x 、t _y Is the vertex coordinates of the prediction box.

Step6 uses softening non-maxima filtering detection frame

At this time, the road targets in the pictures are provided with a plurality of prediction boundary boxes, the traditional non-maximum value inhibition sorts the detection frames according to scores, then the frame with the highest score is reserved, and other frames with the overlapping area larger than a certain proportion with the frame are deleted, so that the omission of the targets is easy to cause. Finally, the detected picture is visualized to generate a final target detection frame and a recognition result, as shown in fig. 2 and 3.

Step7 detection accuracy comparison

According to the invention, mAP is used for evaluating and improving the target detection performance of the YOLOv3 network, mAP (mean Average Precision) is the accumulation sum of detection precision of each category on recall rate, is an important index for evaluating the target detection network performance, mAP is calculated under 10000 pictures in a verification set of BDD100K, and marking information of less categories of Train, divider, motor and like in a data set is removed, so that mAP of six categories are calculated, namely Bus, car, person, traffic light, traffic sign and Truck.

The calculation formula of the AP is as follows: ap= c PdR,

wherein P is the detection precision (precision), R is the Recall ratio Recall, and the calculation formula is as follows:

(1)

(2)

table 1 shows the results of the improved YOLOv3 network versus the original performance:

TABLE 1

As shown in Table 1, the improved YOLOv3 has improved detection accuracy, compared with the original YOLOv3, mAP is increased by about 9 percentage points, 58.09% is reached, and the detection accuracy is higher. And secondly, counting FPS (field programmable gate array) by counting the time of detecting each picture through a program, wherein the FPS is 0.03 s/sheet, which shows that the road multi-target detection method based on the YOLOv3 neural network can also meet the requirement of real-time.

The above list of detailed descriptions is only specific to practical embodiments of the present invention, and they are not intended to limit the scope of the present invention, and all equivalent manners or modifications that do not depart from the technical scope of the present invention should be included in the scope of the present invention.

Claims

1. The road multi-target detection method based on the improved multi-stage YOLOv3 is characterized by comprising the following steps of:

step1, manufacturing a data set: creating a road multi-objective dataset based on the disclosed driving dataset BDD 100K;

step2, calculating aspect ratio of the road target candidate frame based on a K-means clustering algorithm;

step3, designing an improved YOLOv3 neural network model; the specific method comprises the following steps:

firstly, reducing the scale of a normalized image by half after two 3×3 convolutions, then sequentially passing through a residual error module, a 3×3 convolution, two residual error modules, a 3×3 convolution, eight residual error modules, a 3×3 convolution and seven residual error modules to obtain a 13×13 feature detection diagram, adjusting the size of an input 416×416 picture to an output detection diagram of 13×13×45, and then connecting an up-sampling layer with a step length of 2 to promote the feature diagram to 26×26×256;

secondly, sequentially passing the 26 multiplied by 26 feature detection diagram through a 3 multiplied by 3 convolution and eight residual modules; then, obtaining 52×52,104×104 and 208×208 feature maps after a 3×3 convolution and eight residual modules; wherein, the residual error module sequentially carries out convolution of 1 multiplied by 1, convolution of 3 multiplied by 3 and residual error operation;

secondly, anchor blocks with three different scales are initially generated on a 13X 13 feature map, and tensor data under the 13X 13 scale is obtained through a 3X 3 convolution, a CONV module, a 3X 3 convolution and a 1X 1 convolution in sequence; then, carrying out up-sampling on the 13X 13 feature map sequentially through a 3X 3 convolution, a CONV module and a 1X 1 convolution, carrying out feature fusion on the feature map obtained by up-sampling and the 26X 26 feature map obtained by the neural network part of the YOLO network, initially generating anchor point frames with three different scales on the feature map obtained by feature fusion, and then sequentially carrying out the CONV module, the 3X 3 convolution and the 1X 1 convolution to obtain tensor data under 26X 26; then, the tensor data of 52×52,104×104 and 208×208 are all obtained as before, the feature map obtained by up-sampling and the feature map of the upper layer obtained by the basic neural network part of the YOLO network are subjected to feature fusion by a vector splicing method, three anchor point frames with different scales are initially generated on the feature map obtained by the feature fusion, and then tensor data are obtained by a CONV module, a 3×3 convolution and a 1×1 convolution in sequence; the CONV module is used for sequentially performing operations of a 1×1 convolution, a 3×3 convolution, a 1×1 convolution, a 3×3 convolution and a 1×1 convolution; the resolution sizes of the modified 5 feature detection maps are 13×13,26×26,52×52,104×104,208×208, respectively;

finally, 5 candidate frames are distributed on the feature detection graph of each scale;

step5, outputting predicted boundary box information and class probability;

and 6, using a softening non-maximum value filtering detection frame to visualize the detection picture, and generating a final target detection frame and a recognition result.

2. The improved multi-level YOLOv 3-based road multi-objective detection method according to claim 1, wherein in step1, the dataset BDD100K is designed into a VOC dataset format, the VOC dataset includes three folders, respectively, a JPEGImages file, an anotalons file, and an imagets file, wherein the JPEGImages stores training set and test set pictures, the anotals folder stores xml-type labeling files, the imagets folder stores txt text, each line of txt text corresponds to a picture name, the improved YOLOv3 network model reads the file name according to the txt text, then searches the JPEGImages and anotals folders for corresponding pictures and labeling information, extracts the labeling information of the road objective in the found picture labels, and obtains frame parameters of the labeling information.

3. The improved multi-level YOLOv 3-based road multi-objective detection method of claim 2, wherein the pictures in the VOC dataset are randomly divided into different batches, and the pictures are randomly rotated, cropped, translated, flipped, noise disturbance data enhanced before being sent into the improved YOLOv3 network model, thereby expanding the scene diversity of the pictures and uniformly adjusting the picture size to 416 x 416.

4. The improved multi-level YOLOv 3-based road multi-objective detection method of claim 1, wherein GT frame tags in the dataset BDD100K are divided into 10 categories of: bus, light, sign, person, bike, truck, motor, car, train, rider, there are 184 total ten thousand calibration frames; the resolution of the data set pictures is 1280 multiplied by 720, and the training set, the testing set and the verification set are divided according to the proportion of 7:2:1, wherein 70000 pieces of the training set, 20000 pieces of the testing set and 10000 pieces of the verification set are used.

5. The improved multi-level YOLOv 3-based road multi-objective detection method of claim 1, wherein the implementation method of step2 is as follows: the boundary frame markers of the BDD100K dataset were calculated based on the K-means++ algorithm and clustered to obtain 15 anchor frame sizes of (4, 8), (6, 16), (10, 10), (8, 31), (13, 20), (22, 16), (22, 30), (13,51), (36, 42), (25,89), (54,66), (83,95), (57,155), (116, 156), (155,249), respectively.

6. The improved multi-level YOLOv 3-based road multi-objective detection method of claim 1, wherein in step4, super parameters during training are set as follows: batch number is 4, learning rate=0.001, burn_in=1000, maximum iteration number 000, learning strategy is set to sps= 40000,45000,50000; a learning rate of 0.1 times the current value between 40000 and 45000, and a learning rate of 0.1 times the current value between 45000 and 50000;

in the training process, the regression loss function of the prediction boundary box is utilized to carry out loss calculation, the class score, the confidence score, the center coordinate and the width-height loss of each predicted correction box relative to the real calibration box class, the center coordinate and the width-height are calculated through the loss function, the gradient is obtained through counter propagation to update the weight, the updated weight parameters are obtained, each batch is sent into the improved neural network model to update the model weight until the loss value converges, the model parameters are stored once every ten thousands of times, verification is carried out under a verification set, and the learning rate is adjusted according to the loss curve and the detection effect on the verification set.

7. The improved multi-level YOLOv 3-based road multi-objective detection method of claim 1, wherein the implementation method of step6 is as follows: and (3) using a softening non-maximum value to reduce the confidence coefficient, designating a confidence coefficient threshold value, reserving a detection frame with a score larger than the threshold value, and cycling the step in the rest prediction boundary frames to finally obtain a prediction boundary frame corresponding to each road target.