CN112270252A

CN112270252A - Multi-vehicle target identification method for improving YOLOv2 model

Info

Publication number: CN112270252A
Application number: CN202011158555.0A
Authority: CN
Inventors: 李珣; 时斌斌; 聂婷婷; 张玥; 李林鹏; 贠鑫
Original assignee: Xian Polytechnic University
Current assignee: Xian Polytechnic University
Priority date: 2020-10-26
Filing date: 2020-10-26
Publication date: 2021-01-26

Abstract

The invention discloses a multi-vehicle target identification method for improving a YOLOv2 model, which comprises the steps of firstly collecting sample data in an actual traffic environment, and dividing the sample data into sample images of a training set and a test set according to a ratio of 7: 3; then, performing data enhancement on the sample images of the training set, including random scaling of the sample images and adjustment of exposure and saturation, so that the processed images are used as input of a training model; extracting the target region feature vector of the processed training set through an improved Darknet-19 network; inputting the training set into a Darknet-19 network model for training to obtain a detection and identification model; and finally, inputting the test set into the model for testing to obtain a result of multi-target vehicle identification. The invention solves the problems of low detection rate, poor robustness and unsatisfactory classification effect of the prior art aiming at the road vehicle multi-target detection and vehicle type classification method.

Description

Multi-vehicle target identification method for improving YOLOv2 model

Technical Field

The invention belongs to the technical field of image detection and classification, and particularly relates to a multi-vehicle target identification method for improving a YOLOv2 model.

Background

Image detection and image classification techniques are important components of image processing techniques, and are widely applied in many fields, such as remote sensing image identification, military criminal investigation, modern biomedicine, intelligent transportation and the like. However, the conventional target detection and identification method, such as a Cascade classifier based on Haar features, mainly aims at the detection of specific targets, is limited to multi-classified targets, and the region selection process of the targets is complex and the detection and identification efficiency is low. When an object is selected, the feature extraction has the defects of strong subjectivity, poor robustness, weak generalization capability and the like, and the accurate identification effect is difficult to achieve in practical application. Compared with the traditional method, the deep learning method has obvious advantages. Vehicle detection and identification technologies based on deep learning have become a current research trend.

The current detection algorithm based on deep learning is basically divided into three directions: the first scheme is a scheme of extracting candidate regions and classifying corresponding regions mainly by a deep learning method, such as: RCNN, SPP-net, Fast-RCNN, R-FCN, etc.; a regression method based on deep learning, such as a method of YOLO, SSD, etc.; and thirdly, RRC method combined with RNN algorithm and Deformable CNN method combined with DPM. Vehicle detection and other methods based on CNN, R-CNN and Fast-RCNN models cannot achieve the effect of real-time detection on detection precision and detection speed in practical application. YOLOv2 is a real-time object detection algorithm, follows the design concept of end-to-end training and real-time detection, can directly go from input images to detection output in the detection process, directly takes the confidence scores of the target position and the corresponding position as output, omits the step of generating a candidate frame, and greatly shortens the detection time. The detection speed of YOLO can reach 45Fps/s, but the detection and identification precision is slightly lower than that of fast-RCNN. In order to improve the detection and identification precision, the invention improves the network model on the basis of YOLOv2, and improves the detection and identification precision and the robustness of the algorithm while keeping the original speed.

Disclosure of Invention

The invention aims to provide a multi-vehicle target identification method for improving a YOLOv2 model, and solves the problems of low detection rate, poor robustness and unsatisfactory classification effect of the prior art for multi-target detection of road vehicles and vehicle type classification methods.

The technical scheme adopted by the invention is that the multi-vehicle target identification method for improving the YOLOv2 model is implemented according to the following steps:

step 1, collecting sample data in an actual traffic environment, and dividing the sample data into sample images of a training set and a test set according to a ratio of 7: 3;

step 2, performing data enhancement on the sample images of the training set, including random scaling of the sample images and adjustment of exposure and saturation, so that the processed images are used as input of a training model;

step 3, extracting the target region characteristic vector of the training set processed in the step 2 through an improved Darknet-19 network;

step 4, inputting the training set in the step 2 into a Darknet-19 network model for training to obtain a detection and identification model;

and 5, inputting the test set obtained in the step 2 into the model obtained in the step 4 for testing to obtain a multi-target vehicle identification result.

The present invention is also characterized in that,

the step 1 is as follows:

step 1.1, shooting vehicle information in a real-time road traffic environment, framing and extracting a shot video into an image format, and deleting a picture with poor image quality;

step 1.2, labeling vehicles in the selected pictures by using a LabImage labeling tool, framing out a target area, classifying the vehicles in the target area, and manufacturing labels, wherein the labels are car, bus, van and truck, each picture generates an xml file, and finally, randomly distributing the xml files by using Matlab to generate a training set, a test set and a verification set to form a complete data set;

step 1.3, the training set comprises 3 folders, namely indication, ImageSets and JPEGImages, wherein the XML files are stored in the folder indication, each XML file corresponds to an image, the position and the category information of each marked target are stored in each XML file, the names of the position and the category information are the same as those of the corresponding original image, a text file is stored in a Main folder under the ImageSets folder, the formats of the text file are train.txt and test.txt, the content in the ImageSets folder is the name of the image which needs to be used for training or testing, and the JPEGImages folder stores the original image which is named according to a uniform rule.

The step 3 is as follows:

step 3.1, respectively carrying out parameter adjustment on the convolution layer number, the pooling layer number, the BN layer number and the activation function in the Darknet-19 network to finally obtain an improved YOLOv2-S network, wherein the YOLOv2-S network comprises 20 convolution layers, 5 pooling layers and 20 batch normalization layers, namely the BN layer and the Leaky-Linear activation function;

step 3.2, extracting the feature vectors, which is specifically as follows:

(1) the 1 st, 3 rd, 5 th, 6 th, 7 th, 9 th, 10 th, 11 th, 13 th, 14 th, 15 th, 16 th, 17 th, 19 th, 20 th, 21 th, 22 th, 23 th, 24 th, 31 th layers are convolution layers, the 2 nd, 4 th, 8 th, 12 th, 18 th layers are maximum pooling layers, the 26 th, 29 th layers are route layers, and the 32 th layers are detection layers;

(2) the sizes of convolution kernels of

layers

1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21 and 23 in the convolution layers are set to be 3 x 3, the depths of the convolution kernels are set to be 32, 64, 128, 256, 512, 1024 and 1024 respectively, the sizes of convolution kernels of layers 6, 10, 14, 16, 20, 22, 24 and 27 are set to be 1 x 1, and the depths of the convolution kernels are set to be 64, 128, 256, 512, 256, 1024 and 5030 respectively;

(3) the sizes of convolution kernels of layers 2, 4, 8, 12 and 18 in the maximum pooling layer are set to be 2 multiplied by 2, and the step size is set to be 2;

(4) the route layer is used for laminating, namely, the characteristics of a plurality of layers are fused to the next layer and output together, and the 29 th layer of the route layer combines the 28 layers and the 25 layers of the convolution layer together and outputs a characteristic vector.

The step 4 is as follows:

step 4.1, inputting a training set, wherein the process is as follows:

step 4.1.1, dividing the picture obtained in the step 2 into s multiplied by s unit cells, and if the central position of the target to be identified falls into one unit cell, enabling the corresponding unit cell to be responsible for detecting the target; then, directly predicting the position of each unit cell to generate the positions of the required B bounding boxes, wherein each bounding box obtains 5 predicted values: are respectively (t)_x,t_y)、(t_w,t_h) And a Confidence.

The offset of the center of each bounding box from the cell boundary where the bounding box is located is sigma (t)_x),σ(t_y)，(t_w,t_h) The actual width and height of the target relative to the proportional width and height of the whole image are shown, and the edge distance between the bounding box and the upper left corner of the image is (c)_x,c_y) The length and width of the bounding box corresponding to the cell are (p)_w,p_h) Where x represents the length of the cell, y represents the width of the cell, w represents the width of the bounding box, and h represents the height of the bounding box, the real position of the bounding box is as follows:

b_x＝σ(t_x)+c_x

b_y＝σ(t_y)+c_y

b_w＝p_we^tw

b_h＝p_he^th

the method comprises the following steps of representing the precision of the predicted position of a bounding box of the bounding box by the relation between the bounding box of the bounding box and the probability of a target to be detected and the IOU product of the bounding box and a real position, and specifically calculating the Confidence coefficient of a candidate frame as shown in the following formula:

wherein, truth represents the real value of IOU, pred represents the predicted value of IOU, Pr (object) represents the probability value of the object existing in the grid, if the object exists in one grid, the value of Pr (object) is 1; if no target object appears, the value of pr (object) is 0, that is, the value of Confidence is also 0;

step 4.1.2, clustering the real target frames of the targets to be recognized marked in the training set, obtaining the initial candidate frames of the predicted targets in the training set by using the area interaction ratio IOU value as an evaluation index, and inputting the initial candidate frames as initial parameters into a Yolov2-S network model, wherein the specific steps are as follows:

using K-means method, with distance formula d (box)_,centroid)＝1-IOU(box_,centroid) clustering the real target frame of the training data set; wherein, IOU (box) is the area interaction ratio of the predicted target frame and the real target frame, and IOU (box) is used for calculating the area interaction ratio of the predicted target frame and the real target frame_,centroid) as an initial target frame when the threshold value is not less than 0.5;

the area interaction ratio IOU (box, centroid) formula is shown as follows:

wherein, box_predRepresenting the area, box, of the predicted target frame_truthThe area of the real target frame is represented, and the proportion of the intersection and the union of the real target frame and the real target frame is the average interaction ratio of the real target frame and the initial candidate frame of the predicted initial target;

step 4.1.3, when an object exists in the grid, the object class needs to be predicted, a conditional probability Pre (class | object) is used for representing, and a value obtained by class prediction is multiplied by a Confidence of a candidate frame, so as to obtain a Confidence C (M) of a certain class M, as shown in the following formula:

step 4.2, 70000 times of iterative training is carried out on the training set obtained in the step 1 by using a Darknet-19 network, the network input of the model is set to be 416 multiplied by 416, the decade is set to be 0.0005, the momentum is set to be 0.9 and the learning rate is set to be 0.001, the training is stopped until the loss value output by the training data set is smaller than a certain threshold value Q or reaches the preset maximum iteration number N, and the trained YOLOv2-S network model is obtained;

the loss function loss (object) represents:

the loss function comprises a loss function, a first term and a fifth term, wherein the first term of the loss function is the coordinate loss of the anchor of the calculation prediction target, the third term of the loss function is the confidence loss of the anchor of the calculation prediction target, and the fifth term of the loss function is the category loss of the anchor of the calculation prediction target; the second term adds a limit in the hope of returning directly to its own anchor box, the fourth term only calculates anchor boxes that are below the IOU threshold, where,

error coefficients that are predicted coordinates;

as error coefficients not containing confidence in identifying the object, S²Representing the number of meshes into which the input image is divided; b represents the predicted target frame number of each grid;

an abscissa representing the center point of the predicted target,

An ordinate indicating the center point of the predicted target,

Width of center point representing predicted target,

Representing the height of the predicted center point of the target;

the ith grid in which the jth candidate box is positioned is shown to be responsible for detecting the object;

indicating that the ith grid in which the jth candidate box is positioned is not responsible for detecting the object;

an actual abscissa representing the center point of the target frame,

The actual ordinate representing the center of the target frame,

indicating the prediction confidence of the target existing in the ith mesh of the jth candidate box,

indicating the predicted probability value of the object in the ith grid of the jth candidate box belonging to a certain category,

in the ith grid representing the jth candidate frameTrue probability value that a target belongs to a certain category;

step 4.3, the training process specifically includes forward propagation and backward propagation, and the model is saved every 1000 iterations, the momentum adopted is 0.9, the optimization is performed by using random gradient descent, the initial learning rate is 0.001, the attenuation coefficient is set to 0.0005, the learning rate learning _ rate adopted in the previous 10000 iterations is 0.001, the learning rate adopted in 10000-45000 iterations is 0.0001, the subsequent learning rate is adjusted to 0.00001, and finally the network model for detection and identification is obtained.

The step 5 is as follows:

step 5.1, loading the network weight trained in the step 4, and inputting the test set obtained in the step 2 into the network trained in the step 4 to obtain a multi-scale feature map;

step 5.2, restraining and reserving the prior frame with the maximum confidence score according to the non-maximum value to obtain a finally identified detection frame and a classification result of the multi-target vehicle;

step 5.3, testing the existing Yolov2, Yolov2-voc and Yolov3 protogenic model network models by using the prepared data set;

and 5.4, evaluating the performance of the obtained model by utilizing the evaluation index Recall, Precision value Precision and F1 value, wherein the evaluation index Recall, Precision value Precision and F1 value are as follows:

wherein, Total represents the actual number of the bounding box, namely the number of the actual targets to be detected; correct represents the number of correctly detected bounding boxes, namely after a picture is put into a network, the network detects the number of redundant targets of the bounding boxes, each bounding box has a confidence probability, the bounding box with the probability larger than a set threshold value and the actual bounding box, namely the content of txt in labels, calculate the IOU, find out the bounding box with the largest IOU, and if the maximum value is larger than the preset threshold of the IOU, add 1 to the count value; the Proposal represents the number of the detected bounding boxes which are larger than a set threshold value; precision represents a Precision value; recall represents the Recall rate, which is the ratio of the number of the detected targets to the number of all targets in the verification set; f1 represents F1 Score, namely F1-Score, also called balanced F Score, which is defined as the harmonic mean of accuracy and recall ratio and considers the recall ratio and accuracy of the model, the value range is between 0 and 1, and the higher the F1 is, the better the effect is.

The method has the advantages that the method for identifying the multiple vehicle targets of the improved YOLOv2 model realizes the detection and identification of the multiple target vehicles in the end-to-end actual traffic scene, has higher accuracy and robustness compared with the traditional method, and can identify the multiple vehicle instance targets in the image sample at one time; the multi-target vehicle identification method provided by the invention is improved on the basis of a basic Darknet-19 network model, the operation speed is improved, and the identification accuracy rate of multi-target vehicles is improved. The invention provides an effective method for identifying the multi-target vehicles, and a large number of empirical experiments show that the method has strong robustness and better identification performance compared with the existing multi-target vehicle identification method.

Drawings

FIG. 1 is a general flow chart of a multi-vehicle object recognition method of the present invention that improves the YOLOv2 model;

FIG. 2 is a graph of regression target calculation in a multiple vehicle target identification method of the present invention with an improved YOLOv2 model;

FIG. 3 is a model structure diagram of a multi-vehicle object recognition method of the present invention with improved YOLOv2 model;

FIG. 4 is an experimental comparison of the multiple vehicle target recognition method of the present invention with an improved YOLOv2 model, wherein diagram (a) is the YOLOv2 model, diagram (b) is the YOLOv2-voc model, diagram (c) is the YOLOv3 model, and diagram (d) is the YOLOv2-voc _ mul model;

FIG. 5 is a partial experimental result display diagram of a multi-vehicle target identification method of the invention with an improved YOLOv2 model.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

The invention relates to a multi-vehicle target recognition method for improving a YOLOv2 model, which is implemented by the following steps in detail as shown in a flow chart shown in FIG. 1:

the step 1 is as follows:

the step 3 is as follows:

step 3.2, extracting the feature vectors, which is specifically as follows:

(2) the sizes of convolution kernels of

layers

As shown in fig. 2 to fig. 3, step 4, inputting the training set in step 2 into a Darknet-19 network model for training to obtain a model for detection and recognition;

the step 4 is as follows:

step 4.1, inputting a training set, wherein the process is as follows:

b_x＝σ(t_x)+c_x

b_y＝σ(t_y)+c_y

b_w＝p_we^tw

b_h＝p_he^th

the area interaction ratio IOU (box, centroid) formula is shown as follows:

step 4.2, (use Darknet-19 network to carry on 70000 times of iterative training on the training set that step 1 gets, set the network input of the model as 416 x 416, adopt the gradient descent algorithm, set decade as 0.0005, momentum as 0.9, learning rate as 0.001, stop training after the loss value that the training data set outputs is smaller than certain threshold value Q or reaches the maximum iteration number N that is set up in advance, get the good YOLOv2-S network model of training;

the loss function loss (object) represents:

error coefficients that are predicted coordinates;

an abscissa representing the center point of the predicted target,

An ordinate indicating the center point of the predicted target,

Width of center point representing predicted target,

Representing the height of the predicted center point of the target;

an actual abscissa representing the center point of the target frame,

The actual ordinate representing the center of the target frame,

representing the real probability value of the target in the ith grid of the jth candidate box belonging to a certain category;

And 5, inputting the test set in the step 2 into the model obtained in the step 4 for testing to obtain a multi-target vehicle identification result.

The step 5 is as follows:

FIG. 4 is the verification results of 4 training sets and verification sets of models, wherein (a) is the Yolov2 model, (b) is the Yolov2-voc model, (c) is the Yolov3 model, and (d) is the Yolov2-voc _ mul model, and it can be seen that: the recall rates of the 4 models are greatly fluctuated initially, but when the number of detected targets is increased, the recall rate of the YOLOv2 model is gradually stabilized at 96%, YOLOv2-voc tends to 94.5%, the recall rate of the improved YOLOv2-voc _ mul model is stabilized at 95.5%, and the recall rate of the YOLOv3 model is fluctuated between 40% and 60%, which shows that the 3 models of the YOLOv2 can ensure good accuracy in a simple background, and the accuracy of the YOLOv3 is low; the accuracy curve of the YOLOv2 model has large fluctuation, the accuracy curve of the YOLOv2-voc model has jump when the target number increases and is gradually stabilized at 98.6% after the jump, the improved YOLOv2-voc _ mul model is stabilized to about 99.2% after the jump is relatively small, the good accuracy and stability are kept, the accuracy curve of the YOLOv3 model has large jump, and the final accuracy value fluctuates about 60%; meanwhile, the intersection ratio of the YOLOv2 model fluctuates between 0.75 and 0.83 by comparing the intersection ratio curves of the 4 models, namely the stability of the detection number is low, the intersection ratio of the YOLOv2-voc model and the YOLOv2-voc _ mul model is improved by comparing the YOLOv2 and can be kept between 0.8 and 0.83, and the curve change shows that the intersection ratio of the YOLOv2-voc _ mul is similar to that of the YOLOv2-voc model and fluctuates up and down at 0.83 when the target number is increased, and the intersection ratio of the YOLOv3 fluctuates only between 0.4 and 0.7 and is the worst in stability compared with the other 3 models.

FIG. 5 is a graph of the partial detection results of the YOLOv2-voc _ mul model. From the test results, the categories of different vehicles are tested and accurately defined as car, bus, van, truck.

Table 1 shows an evaluation index table of the multi-vehicle target identification method of the present invention, which improves the YOLOv2 model.

TABLE 1 evaluation index Table

Model	Total	Correct	Proposal	Precision(％)	Recall(％)	F₁(％)
							YOLOv2	154	147	152	96.71	95.45	96.07
YOLOv2-voc	154	143	147	97.28	92.86	95.01
							YOLOv3	154	84	151	55.63	54.55	55.08
YOLOv2-S	154	146	148	98.62	94.81	96.67

Claims

1. A multi-vehicle target identification method for improving a YOLOv2 model is characterized by comprising the following steps:

2. The method for identifying multiple vehicle targets based on an improved YOLOv2 model according to claim 1, wherein the step 1 is as follows:

3. The method for identifying multiple vehicle targets based on an improved YOLOv2 model according to claim 2, wherein the step 3 is as follows:

step 3.2, extracting the feature vectors, which is specifically as follows:

(2) the sizes of convolution kernels of layers 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21 and 23 in the convolution layers are set to be 3 x 3, the depths of the convolution kernels are set to be 32, 64, 128, 256, 512, 1024 and 1024 respectively, the sizes of convolution kernels of layers 6, 10, 14, 16, 20, 22, 24 and 27 are set to be 1 x 1, and the depths of the convolution kernels are set to be 64, 128, 256, 512, 256, 1024 and 5030 respectively;

4. The method for identifying multiple vehicle targets based on the improved YOLOv2 model of claim 3, wherein the step 4 is as follows:

step 4.1, inputting a training set, wherein the process is as follows:

step 4.1.1, dividing the picture obtained in the step 2 into s multiplied by s unit cells, and if the central position of the target to be identified falls into one unit cell, enabling the corresponding unit cell to be responsible for detecting the target; then, directly predicting the position of each unit cell to generate the positions of the required B bounding boxes, wherein each bounding box obtains 5 predicted values: are respectively (t)_x,t_y)、(t_w,t_h) And a Confidence;

b_x＝σ(t_x)+c_x

b_y＝σ(t_y)+c_y

b_w＝p_we^tw

b_h＝p_he^th

using K-means method, with distance formula d (box)_,centroid)＝1-IOU(box_,centroid) clustering the real target frame of the training data set; wherein, IOU (box) is the area interaction ratio of the predicted target frame and the real target frame, and IOU (box) is used for calculating the area interaction ratio of the predicted target frame and the real target frame_,centroid) as the threshold valueThe candidate frame predicted at 0.5 is taken as an initial target frame;

the area interaction ratio IOU (box, centroid) formula is shown as follows:

the loss function loss (object) represents:

wherein the first term of the loss function is the coordinate loss of the anchor of the calculated prediction target, the third term is the confidence loss of the anchor of the calculated prediction target,the fifth term is to calculate the category loss of the anchor of the predicted target; the second term adds a limit in the hope of returning directly to its own anchor box, the fourth term only calculates anchor boxes that are below the IOU threshold, where,