CN111738056B

CN111738056B - Heavy truck blind area target detection method based on improved YOLO v3

Info

Publication number: CN111738056B
Application number: CN202010344037.1A
Authority: CN
Inventors: 朱仲杰; 屠仁伟; 白永强; 王玉儿; 杨跃平
Original assignee: Zhejiang Wanli University
Current assignee: Zhejiang Wanli University
Priority date: 2020-04-27
Filing date: 2020-04-27
Publication date: 2023-11-03
Anticipated expiration: 2040-04-27
Also published as: CN111738056A

Abstract

The invention discloses a heavy truck blind area target detection method based on improved YOLO v3, which is characterized by comprising the following steps of: collecting mixed pictures of a vehicle, a person in a falling state and a person in a normal state under a real road condition, and establishing a sample data set; after preprocessing, carrying out category calibration and position information extraction on detection targets in a sample data set, and dividing the detection targets into a training set and a testing set; performing cluster analysis on the training set, and selecting an anchor value; improving and optimizing a network structure; setting training parameters, and training the optimized network by using a training set to obtain a target detection model; inputting the monitored picture into a target detection model to obtain a detection result of a real-time blind area; the optimized and improved YOLO v3 network has the advantages that the detection performance of a centering object and a small object is enhanced, the defects of network missed detection and false detection in the prior art are overcome, the situation of people in a vehicle and falling state and people in a normal state in a blind area environment around a vehicle can be mastered in time by the heavy truck driver, and traffic accidents are avoided.

Description

Heavy truck blind area target detection method based on improved YOLO v3

Technical Field

The invention relates to a target detection method, in particular to a heavy-duty blind area target detection method based on improved YOLO v 3.

Background

The heavy truck plays an important role in the development of the logistics industry, but because of the characteristics of the length of the truck and the height of the cab, a large visual field blind area exists for a driver, so that the visual field of the driver is limited and accurate judgment cannot be timely made. One of the current methods for solving the dead zone of the heavy truck is to install a camera, but the driver is required to manually identify and judge the dead zone target; one is that a camera combines a traditional algorithm to automatically identify single-type targets, but is only suitable for the conditions of simple detection background and small number of detection targets; still another is 360 ° panoramic detection in combination with radar, but still requires manual detection of obstructions and sometimes false alarms.

In recent years, the target detection algorithm has made a great breakthrough. The YOLO v3 adopts a CNN network to realize detection, so that the speed of target detection is greatly accelerated, the accuracy is improved, the performance of the existing YOLO v3 for large, medium and small-size target detection is balanced, but in the actual detection of a large number of medium and small-size targets, some missed detection and false detection still exist for the medium and small-size images, and in the process of detecting the targets by the frames, the positioning of some detection frames is inaccurate, and the targets cannot be completely framed.

Disclosure of Invention

The invention aims to solve the technical problem of providing the heavy card blind area target detection method based on improved YOLO v3, which is used for detecting the people in the state of falling and the people in the normal state in real time in the heavy card blind area range, accurately detecting the targets and accurately positioning the detection frame

The technical scheme adopted for solving the technical problems is as follows: a heavy truck blind area target detection method based on improved YOLO v3 comprises the following steps:

(1) collecting mixed pictures of a vehicle, a person in a falling state and a person in a normal state under a real road condition mainly with a medium size and a small size, establishing a sample data set, preprocessing the sample data set, performing category calibration and position information extraction on a detection target in the sample data set, and dividing the sample data set into a training set and a testing set;

(2) performing cluster analysis on the training set, and selecting an anchor value;

(3) improving the network structure of the original detection model to obtain an optimized YOLO v3 network;

(4) setting training parameters, and training the optimized YOLO v3 network by using the training set to obtain a target detection model;

(5) inputting the video monitored in real time in the dead zone range of the heavy truck into the target detection model for detection;

(6) and outputting detection results of the vehicle, the person in a falling state and the person in a normal state in the dead zone range of the heavy truck.

The specific method for performing category calibration and position information extraction on the detection targets in the sample data set in the step (1) is as follows:

a, selecting different light factors, different shooting angles, different road environments and different resolutions for the sample data set;

b, adjusting the image size of the training set in the sample data set to uniform pixels;

c, performing category calibration on the detection targets in the sample data set, and respectively using 0,1 and 2 to represent a person in a vehicle or falling state and a person in a normal state;

d, extracting position information of the sample data set, and representing the detection target as a four-dimensional vector { x, y, w, h }; wherein: x represents the coordinate of the detection target in the x-axis direction, y represents the coordinate of the detection target in the y-axis direction, w represents the width of the detection target, and h represents the height of the detection target;

and e, generating a labeling file. Different illumination, different vehicle shooting visual angles, different resolutions and different sample data sets of road environments and road conditions are selected, so that the requirements of sample diversity are met, the requirements of purposeful optimization are achieved, the method has important significance for improving the algorithm detection robustness, and better detection effects can be achieved by using fewer training samples. The image pixels of the training set are tuned to be identical to facilitate convolution operations in the next training model.

In the step (1), the sample data set is divided into 80% training set and 20% test set.

In the step (2), the training set is subjected to cluster analysis by adopting a K-means algorithm, and different anchor values are obtained by setting the number of different cluster centers K; ioU (cross-over ratio) was used as a clustering index, and an anchor value was set to { (12, 26), (18, 71), (31, 43), (66, 73), (35, 151), (98, 121), (61, 260), (110, 310), (238, 212) } by analysis of avg iou (average cross-over ratio).

In the step (3), dark-53 is selected as a basic network for image feature extraction, YOLO v3 upsamples deep information of a convolution layer and then splices the deep information and shallow information together through a concat function to realize feature fusion, and 3 groups of feature information with different depths are fused to output 13×13, 26×26 and 52×52 feature graphs to obtain an FPN (feature pyramid) structure; on the basis, the information of the splicing shallow layer increases the information quantity of the features, the 11 th layer of the Darknet-53 is spliced on the 52X 52 feature map to obtain an improved 52X 52 feature map, and the feature information of the improved 52X 52 feature map consists of three parts: feature information after being downsampled by the layer 11 of the Darknet-53, feature information of the layer 36 of the Darknet-53 and feature information from downsampled 26X 26 feature images; splicing the layer 36 of the Darknet-53 to the 26×26 feature map to obtain an improved 26×26 feature map, wherein the feature information of the improved 26×26 feature map is composed of three parts: the feature information after the layer 36 downsampling of the dark net-53, the layer 61 feature information of the dark net-53 and the feature information from the 13×13 feature map downsampling. The information quantity of the features is increased by splicing the shallow information, the features between the global features and the local features are enhanced, and the detection capability of the middle and small targets is improved.

The training parameters in the step (4) are set as follows: batch 512, sub division 256, max batches 12000.

Compared with the prior art, the method has the advantages that different illumination, different vehicle shooting visual angles, different resolutions and different sample data sets of road environments and road conditions are selected, the requirement of sample diversity is met, the sample diversity is purposefully optimized, the method has important significance for improving the algorithm detection robustness, and better detection effect can be achieved by using fewer training samples; compared with the characteristic diagrams of 13×13, 26×26 and 52×52 in the prior art, which are used for detecting large, medium and small targets respectively, the performance of the characteristic diagrams is balanced, the improved network structure in the invention carries out characteristic enhancement on the characteristic diagrams of 26×26 and 52×52, the detection performance of the medium and small targets is improved, the accurate anchor values under the medium and small target data sets are obtained through K-means clustering, and compared with the prior art, the invention compensates the defect of missed detection in the prior art to a certain extent, so that the detection targets are more accurately detected and the detection frame is also accurately positioned.

Drawings

FIG. 1 is a schematic flow chart of the present invention;

FIG. 2 is a diagram of an Avg IoU line diagram corresponding to different k values in step (2) of the present invention;

FIG. 3 is a schematic diagram of the optimized YOLO v3 network structure in step (3) of the present invention;

FIG. 4 is a graph showing the comparison of the Loss and mAP of unmodified Yolo v3 with the Loss and mAP of the modified Yolo v3 of the present invention, wherein the left is a graph of unmodified Yolo v3, and the right is a graph of modified Yolo v 3;

FIG. 5 is a schematic diagram of the comparison of the Loss and mAP of unmodified Yolo v3 with the Loss and mAP of the modified Yolo v3 of the present invention, the left side being a schematic diagram of unmodified Yolo v3, the right side being a schematic diagram of modified Yolo v 3;

FIG. 6 is a schematic diagram of the comparison of the Loss and mAP of unmodified Yolo v3 with the Loss and mAP of the integrally modified Yolo v3 of the present invention, with the unmodified Yolo v3 on the left and the integrally modified Yolo v3 on the right;

FIG. 7a is a schematic picture of the result of detecting a first scene using the unmodified prior art;

FIG. 7b is a schematic image of the result of detecting a first scene using the method of the present invention;

FIG. 8a is a schematic picture of the result of detecting a second scene using the unmodified prior art;

FIG. 8b is a schematic image of the result of detecting a second scene using the method of the present invention;

FIG. 9a is a schematic picture of the result of detecting a third scene using the unmodified prior art;

FIG. 9b is a schematic image of the result of detecting a third scene using the method of the present invention;

FIG. 10a is a schematic picture of the result of detecting a fourth scene using the unmodified prior art;

FIG. 10b is a schematic image of the result of detecting a fourth scene using the method of the present invention;

FIG. 11a is a schematic picture of the result of detecting a fifth scene using the unmodified prior art;

FIG. 11b is a schematic image of the result of detecting a fifth scene using the method of the present invention;

FIG. 12a is a schematic picture of the result of detecting a sixth scene using the unmodified prior art;

FIG. 12b is a schematic image of the result of detecting a sixth scene using the method of the present invention;

FIG. 13a is a schematic picture of the result of detecting a seventh scene using the unmodified prior art;

FIG. 13b is a schematic image of the result of detecting a seventh scene using the method of the present invention;

FIG. 14a is a schematic picture of the result of detection of an eighth scene using the unmodified prior art;

fig. 14b is a schematic picture of the result of detection of an eighth scene using the method of the present invention.

Detailed Description

The invention is described in further detail below with reference to the embodiments of the drawings.

A heavy truck blind area target detection method based on improved YOLO v3 comprises the following steps:

(1) collecting mixed pictures of a vehicle, a person in a falling state and a person in a normal state under a real road condition mainly with a medium size and a small size, establishing a sample data set, preprocessing the sample data set, carrying out category calibration and position information extraction on a detection target in the sample data set, and dividing the sample data set into a training set and a testing set:

the specific method for carrying out category calibration and position information extraction on the detection targets in the sample data set in the step comprises the following steps:

a, selecting different light factors, different shooting angles, different road environments and different resolutions for a sample data set;

b, adjusting a training set in the sample data set to be uniform 416×416 pixels through a programming program;

c, performing category calibration on detection targets in a sample data set by using YOLO-Mark software of the YOLO, and respectively using 0,1 and 2 to represent people in a vehicle state, a falling state and people in a normal state;

d, extracting position information from the sample data set, and representing the detection target as a four-dimensional vector { x, y, w, h }; wherein: x represents the coordinate of the detection target in the x-axis direction, y represents the coordinate of the detection target in the y-axis direction, w represents the width of the detection target, and h represents the height of the detection target;

and e, generating a labeling file.

The sample data set is divided into 80% training set and 20% test set.

(2) Cluster analysis is carried out on the training set, and an anchor value is selected:

performing cluster analysis on the training set by adopting a K-means algorithm, and obtaining different anchor values by setting the number of different cluster centers K; ioU is taken as a clustering index, the larger the Avg IoU is, the more accurate the anchor is, and the anchor value is obtained through analysis of the Avg IoU. IoU formula is:where DetectionResult represents the predicted bounding box, groundTruth represents the actual bounding box, and a larger value for IoU indicates better performance of the detector. IoU is 1 if the predicted and actual borders completely overlap.

The clustering results of Table 1 were obtained according to the K-means method:

TABLE 1K-means clustering results

K	anchor
		1	47，86
2	20，39，92，166
		3	17，34，63，106，134，258
4	16，32，55，79，76，215，199，231
		5	14，28，34，67，81，98，68，265，198,228
6	14，28，33，66，76，95，66，266，149，156，221,287
		7	13，27，29，53，67，74，39，179，98，123，85，297，227，221
8	12，27，30，43，22，95，65，73，48,213，97，122，94，307，230，216
		9	12，26，18，71，31，43，66，73，35，151，98，121，61，260，110，310，238，212

As shown in fig. 2, as the k value increases, avg IoU increases, and increases first and then decreases, and finally tends to converge.

In summary, when k=9, avg iou= 66.91% reaches the maximum value, the present invention selects the anchor at k=9 as the anchor value of the present invention, i.e., { (12, 26), (18, 71), (31, 43), (66, 73), (35, 151), (98, 121), (61, 260), (110, 310), (238, 212) }.

(3) And improving the network structure of the original detection model to obtain an optimized YOLO v3 network:

according to the invention, darknet-53 is selected as a basic network for image feature extraction, and YOLO v3 is used for realizing feature fusion by splicing deep information of a convolution layer with shallower information through a concat function after upsampling, and 3 groups of feature information with different depths are fused to output feature images of 13×13, 26×26 and 52×52, so that an FPN structure is obtained; on the basis, the information of the splicing shallow layer increases the information quantity of the features, the 11 th layer of the Darknet-53 is spliced on the 52X 52 feature map to obtain an improved 52X 52 feature map, and the feature information of the improved 52X 52 feature map consists of three parts: feature information after being downsampled by the layer 11 of the Darknet-53, feature information of the layer 36 of the Darknet-53 and feature information from downsampled 26X 26 feature images; splicing the layer 36 of the Darknet-53 to the 26×26 feature map to obtain an improved 26×26 feature map, wherein the feature information of the improved 26×26 feature map is composed of three parts: the feature information after being downsampled by the layer 36 of the Darknet-53, the feature information of the layer 61 of the Darknet-53 and the feature information from the downsampled 13X 13 feature map, and the improved network structure is shown in figure 3.

The output of the 11 th layer of the Darknet-53 is 104 multiplied by 128, after downsampling, the convolution kernel size is set to be 3 multiplied by 3, the sliding step length is set to be 2, and the output of the convolution kernel number is set to be 256: 52×52×256, the output of the layer 36 of the dark-53 is 52×52×256, and the up-sampled output from the 26×26 feature map is 52×52×128, so that the three are spliced together to output 52×52×640;

similarly, the output of the 36 th layer of the Darknet-53 is 52 multiplied by 256, after downsampling, the convolution kernel size is set to be 3 multiplied by 3, the sliding step length is set to be 2, and the output of the convolution kernel number is set to be 512: 26×26×512, the layer 61 output of Darknet-53 is 26×26×512, and the upsampled output from the 13×13 feature map is 26×26×256, so the three are spliced together to output 26×26×1280.

(4) Setting training parameters, and training the optimized YOLO v3 network by using a training set to obtain a target detection model:

the experimental development environment selected by the invention is as follows, CPU: intel i9 9920x,3.5ghz; GPU: NVIDIA GeForce RTX2080Ti 11G; RAM:16G; deep learning network framework: dark-53. The invention sets 3 kinds of detection targets, sets the iteration times of each kind to 4000 times, and sets 3 kinds of detection targets to 12000 times in total. Setting 1000 times per iteration to generate one model, the training is finished to generate 12 models in total. The input resolution is set to 416 x 416 and multi-scale training is turned on. The learning rate determines the update speed of the weight, and is set to be too large so that the result exceeds an optimal value, and too small so that the descent speed is too slow, so that the dynamically-changed learning rate is set to obtain a better target detection model. When 0 < Iteration number) < 9600, lr (learning rate) =0.001; 9600 < operation < 10800, lr=0.0001; when 10800 < Iteration < 12000, lr=0.00001, and the learning rate of the whole training process decays 100 times. The key training parameter settings for unmodified and modified YOLO v3 networks are shown in table 2:

table 2 key training parameter settings

In summary, the training parameters of the present invention are set as follows: batch is 512, subdivision is 256, and Max batches is 12000.

(5) Inputting the video monitored in real time in the dead zone range of the heavy truck into a target detection model for detection: the real-time video picture can be obtained by monitoring the dead zone range of the heavy truck through electronic equipment such as a camera, and is input into a trained target detection model,

(6) and outputting detection results of the vehicle, the person in a falling state and the person in a normal state in the dead zone range of the heavy truck, obtaining a picture framed by the detected frame of the target to be detected by the steps, and simultaneously displaying the name of the detected target at the upper left corner of the detected frame.

In order to further verify the advantages of the method of the invention, the performance of the target detection model is verified by using a test set, and the experimental result is as follows:

recording Precision as the proportion of the real target left by removing false detection from all the targets detected by the frame; recall is a ratio of the number of correctly detected targets to the number of all real targets in the test set,

wherein TP is the number of positive samples correctly detected by the model as the number of positive samples, namely the number of correct samples detected by the model as the person in the car or fall state or the person in the normal state, and FP is the number of negative samples incorrectly detected by the model as the number of positive samples, namely the number of correct samples detected by the model as the person in the car or fall state or the person in the normal stateThe number of samples of the person in the car or the fall or the person in the normal state is measured, FN is the number of positive samples which are erroneously detected as negative samples by the model, namely the number of samples of the person in the car or the fall or the person in the normal state which are calibrated as the person in the car or the fall but are not detected as the person in the car or the fall or the person in the normal state, and the higher the result of obtaining Precision, the better the target detection model used, and the result of obtaining Recall is close to 1.

Table 3 lists various important parameters, including precision, recovery and mAP, from 7000 to 12000 iterations for comparison before and after network structure improvement for different iteration numbers. Experimental data prove that the YOLO v3 after the network structure is improved has higher mAP under the condition that the precision and the recovery are basically unchanged. In the experimental result of 7000 iterations, YOLO v3 after improving the network structure has a higher recovery value, 21% improvement over unmodified YOLO v3, and 7.5% improvement over mAP. The best experimental result is obtained by 12000 iterations, and the YOLO v3 with the improved network structure and the YOLO v3 without the improvement obtain the same recovery, but the precision is improved from 93% to 95%, and the mAP is improved from 85.03% to 87.24%. As can be seen from the mAP curve of fig. 4, the mAP of YOLO v3 after the modification of the network structure is significantly larger than the mAP of YOLO v3 without modification. Thus, the characteristic enhancement of the network structure improves the characteristic extraction efficiency of the target detection model to a certain extent.

TABLE 3 comparison of network Structure improvement with different iteration times

Table 4 comparison of various detection targets before and after network structure improvement

As can be seen from table 4. The YOLO v3 with the improved network structure has higher mAP, and the AP values of three detection targets are larger than those of the non-improved YOLO v3, so that the characteristic enhancement is integral to the performance improvement of the target detection model and not only the improvement of the detection capability of a single detection target.

Table 5 shows the comparison of different iterations before and after the model was modified, and from 10000 iterations, the mAP value of the modified YOLO v3 was greater than that of the unmodified YOLO v3, while ensuring that precision and recovery were substantially the same as those of the unmodified YOLO v 3. The best experimental results were obtained at 12000 iterations, with the mAP= 86.31% of YOLO v3 after the anchor improvement being 1.28% higher than the mAP of YOLO v3 without the improvement. The anchor of unmodified YOLO v3 was obtained by clustering the published COCO data. As can be seen from the mAP curve of fig. 5, the mAP of YOLO v3 after the anchor is improved is significantly larger than the mAP before the improvement.

TABLE 5 comparison of Anchor improvement before and after different iteration times

TABLE 6 comparison of various monitoring targets before and after Anchor improvement

As shown in table 6, the improvement of the anchor significantly improved the detection ability of the person whose detection target was in a falling state, the ap=89.26% of YOLO v3 which was not improved, the ap=95.80% of YOLO v3 after the improvement of the anchor, and the AP values of the other two detection targets were substantially the same. Thus, the improvement of the anchor has obvious lifting effect on the positioning detection of the person with the falling object in the invention.

Table 7 shows the comparison of the overall improvement of different iteration times, the overall improvement is the common improvement of the network structure and the anchors, the K-means clustering is performed to obtain a new anchor by optimizing the neural network structure, and the precision and the recovery before and after the improvement are basically the same from 7000 iterations to 12000 iterations, but the mAP value of the YOLO v3 after the overall improvement is obviously improved compared with that of the YOLO v3 after the improvement. At 12000 iterations, the mAP of the overall improved Yolo v3 reached 87.82% improvement over 85.03% of the mAP of the unmodified Yolo v3 by 2.79%. As can be seen from the mAP curve of fig. 6, the mAP of YOLO v3 after overall improvement is significantly larger than that of YOLO v3 without improvement.

TABLE 7 comparison of overall improvement before and after different iteration times

Table 8 comparison of various detection targets before and after overall improvement

Table 9 comparison of parameters before and after overall improvement

Tables 8 and 9 show that the overall improved YOLO v3 has a significant improvement in detection ability of the person whose detection target is in a falling state, and the improvement in mAP of the overall improved YOLO v3 is also significant, from 85.03% to 87.82%, whereas the size of the target detection model is increased by only 7m and total BFLOPS is increased by only about 3.5. Therefore, the overall improved YOLO v3 still has real-time detection performance, and the detection speed reaches 13.792ms/frame. The overall improvement enables the detection performance of the whole target detection model to be greatly improved.

As shown in fig. 7a to 14b, fig. 7a, 8a, 9a, 10a, 11a, 12a, 13a, and 14a are schematic pictures of results of detection using the conventional technique, and fig. 7b, 8b, 9b, 10b, 11b, 12b, 13b, and 14b are schematic pictures of results of detection using the method of the present invention. From comparison of experimental results of the eight scenes, the prior art still has some missed detection and false detection in actual detection. As shown in fig. 7b, 11b, and 13b, a person in a normal state and a person in a falling state, which are not detected in fig. 7a, 11a, and 13a, are detected; detecting a person whose vehicle is in a normal state at the detection position of fig. 8a as shown in fig. 8 b; as shown in fig. 9a, the umbrella in the figure is detected as a person, whereas fig. 9b does not; as shown in fig. 10b and 12b, the detection target is more accurately and completely framed than the detection frames of fig. 10a and 12 a.

In conclusion, the method is obviously superior to the prior art, has more excellent detection capability, particularly has obviously enhanced detection capability of the middle and small targets, overcomes the defect of missed detection in the prior art to a certain extent, and has more accurate detection and more accurate detection frame positioning.

Claims

1. The heavy truck blind area target detection method based on the improved YOLO v3 is characterized by comprising the following steps of:

selecting Darknet-53 as a basic network for image feature extraction, and performing feature fusion on the deep information of a convolution layer after being up-sampled by YOLO v3 and shallower information through a concat function, wherein 3 groups of feature information with different depths are fused and output 13×13, 26×26 and 52×52 feature images to obtain an FPN structure; on the basis, the information of the splicing shallow layer increases the information quantity of the features, the 11 th layer of the Darknet-53 is spliced on the 52X 52 feature map to obtain an improved 52X 52 feature map, and the feature information of the improved 52X 52 feature map consists of three parts: feature information after being downsampled by the layer 11 of the Darknet-53, feature information of the layer 36 of the Darknet-53 and feature information from downsampled 26X 26 feature images; splicing the layer 36 of the Darknet-53 to the 26×26 feature map to obtain an improved 26×26 feature map, wherein the feature information of the improved 26×26 feature map is composed of three parts: feature information after being downsampled by a layer 36 of the Darknet-53, feature information of a layer 61 of the Darknet-53 and feature information from downsampled 13X 13 feature images;

2. The heavy truck blind area target detection method based on improved YOLO v3 according to claim 1, wherein the specific method for performing category calibration and position information extraction on the detection target in the sample data set in step (1) is as follows:

d, extracting position information of the sample data set, and representing the detection target as a four-dimensional vector { x, y, w, h }; wherein: x represents the coordinates of the detection target in the x-axis direction, y represents the coordinates of the detection target in the y-axis direction, w represents the width of the detection target, and h represents the height of the detection target;

and e, generating a labeling file.

3. The heavy truck blind area target detection method based on improved YOLO v3 according to claim 1, wherein the sample data set is divided into 80% training set and 20% test set.

4. The heavy truck blind area target detection method based on improved YOLO v3 according to claim 1, wherein in the step (2), a K-means algorithm is adopted to perform cluster analysis on the training set, and different anchor values are obtained by setting the number of different cluster centers K; ioU is used as a clustering index, and by analysis of Avg IoU, an anchor value is set to { (12, 26), (18, 71), (31, 43), (66, 73), (35, 151), (98, 121), (61, 260), (110, 310), (238, 212) }.

5. The heavy truck blind area target detection method based on improved YOLO v3 according to claim 1, wherein the training parameters in the step (4) are set as follows: batch is 512, subdivision is 256, and Max batches is 12000.