CN112836639A

CN112836639A - Pedestrian multi-target tracking video identification method based on improved YOLOv3 model

Info

Publication number: CN112836639A
Application number: CN202110151278.9A
Authority: CN
Inventors: 张相胜; 沈庆; 姚猛
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2021-02-03
Filing date: 2021-02-03
Publication date: 2021-05-25

Abstract

A pedestrian multi-target tracking video identification method based on an improved YOLOv3 model belongs to the field of image processing of computer vision. In the YOLOv3 network, the original standard convolution in the Darknet-53 characteristic extraction layer is replaced by the deep separable convolution; and introducing a SENET module in a prediction layer of the YOLOv3 network; and clustering the target frame in the selected data set by using a K-means + + clustering algorithm, optimizing the prior frame parameter of the network according to a clustering result, and correcting the anchor frame. The invention utilizes a tracking-by-detection framework and uses an improved YOLOv3 algorithm to realize the detection work of the target information, and the tracking part adopts a Deep-SORT algorithm for tracking, so that the whole algorithm can effectively reduce the conditions of missed detection and shielding, and can keep higher detection speed and better tracking effect.

Description

Pedestrian multi-target tracking video identification method based on improved YOLOv3 model

Technical Field

The invention belongs to the field of image processing of computer vision, and particularly relates to a method for improving a network structure of YOLOv3 aiming at the problems of high missing detection rate and low detection speed of a pedestrian target in multi-target tracking, so that the detection precision and the detection speed of a model on the pedestrian target are improved. The detection part detects the pedestrian target by adopting an improved YOLOv3 algorithm, the tracking part predicts the motion track of the target by using a Kalman filtering algorithm, and the data association part performs matching association on the target by using a Hungarian algorithm.

Background

With the rapid development of deep learning, compared with the characteristics of the traditional manual design, the convolutional neural network gradually shows advantages; the deep neural network shows excellent performance in the field of machine vision, and has gained wide attention of scholars; pedestrians serve as vulnerable groups in a road traffic environment, safety problems of the pedestrians are not small and varied, and establishment of a perfect pedestrian detection system becomes a research hotspot; in addition, the use of deep learning for driving assistance systems is also becoming a trend; the research of the target detection and tracking algorithm based on deep learning is carried out by using road pedestrian research objects.

In recent years, a multi-target tracking method based on detection gradually becomes a mainstream scheme in the field of multi-target tracking, but the method has high requirement on the accuracy of a detection result, and if the background is complex, the target detection is greatly influenced, so that the tracking effect is further influenced; even the current advanced YOLOv3 algorithm has the problems of low detection precision and low detection speed; secondly, how to effectively establish the target model between the detector and the tracker is also important. Therefore, it is a problem to be solved by those skilled in the art to provide a pedestrian detection and tracking algorithm with higher detection accuracy and higher detection speed.

Disclosure of Invention

In order to improve the detection precision and speed of a pedestrian multi-target tracking algorithm, the invention provides a pedestrian multi-target tracking video identification method based on an improved YOLOv3 network model; on the basis of a YOLOv3 network model and a Deep-SORT algorithm, aiming at the problems of occlusion and missing detection of target detection tracking, a prior frame is optimized by using a K-means + + clustering method, a SEnet module is embedded into a YOLOv3 network prediction layer, and aiming at the problem of low algorithm detection speed, a standard convolution of a YOLOv3 network is replaced by a Deep separable convolution network for feature extraction. A classical tracking-by-detection framework is selected, a detection part uses an improved YOLOv3 algorithm to realize the detection work of target information, and a tracking part selects a Deep-SORT algorithm for tracking.

The technical scheme adopted by the invention is as follows:

the pedestrian multi-target tracking video identification method based on the improved YOLOv3 model comprises the following steps:

step 1: the pedestrian detection section: improving a YOLOv3 target detection network, introducing a depth separable convolution module, and replacing a standard convolution module in a Darknet-53 feature extraction layer with the depth separable convolution module; introducing a SENET module, and adding the SENET module into a YOLO prediction layer;

step 2: selecting a data set containing a pedestrian image from the public data set, using a K-means + + clustering algorithm to replace the K-means clustering algorithm to perform clustering analysis on the data set labels, and training a pedestrian detection Yolov3 network model;

and step 3: a multi-target tracking section: carrying out target detection by using a trained pedestrian detection YOLOv3 network model, and carrying out multi-target tracking on pedestrians by combining with a Deep-SORT algorithm;

the step1 is further specifically as follows:

step 1.1: a depth separable convolution module is introduced into the Darknet-53 feature extraction layer, and the depth separable convolution module is used for replacing a standard convolution module in the original Darknet-53; the depth separable convolution is to take channels and space regions into consideration separately, decompose standard convolution into depth convolution and point-by-point convolution, namely, firstly, respectively performing 3 × 3 convolution on 3 single channels in a feature map by using the depth convolution, collecting the features of each channel, then performing 1 × 1 point-by-point convolution on the feature map subjected to the depth convolution by using the point-by-point convolution, and collecting the features of each point;

step 1.2: a SENET module is introduced into a YOLO prediction layer, and the SENET module is respectively embedded after vectors are output by 26 th, 43 th and 53 th layers of a network.

The step2 specifically comprises the following steps:

step 2.1: respectively extracting N pedestrian photos from the public data set, and labeling the photos by using a labeling tool; then dividing the pictures into a training set and a testing set according to the proportion;

step 2.2: and (3) carrying out prior frame clustering on the samples of the picture training set by using a K-means + + clustering algorithm instead of the K-means clustering algorithm to obtain a new anchor frame, and carrying out iterative training of a pedestrian detection YOLOv3 network model by using the new anchor frame.

Before multi-target tracking, a trained pedestrian detection YOLOv3 network model is required to be used for detecting targets, specifically:

inputting continuous frames of images with any size into a trained pedestrian detection YOLOv3 network model, firstly adaptively adjusting the input images, predicting B bounding boxes in each grid, detecting C-type targets, and outputting the bounding boxes of each type of targets and the confidence degrees of the bounding boxes. The confidence of the bounding box is defined as: the bounding box intersects the actual bounding box of the object and is compared to the IOU, multiplied by the probability that the object is present within the bounding box. Calculating the formula:

where Confidence is the Confidence of the bounding box, P_r(Object) is the probability that an Object exists within the bounding box,

the bounding box is compared with the actual bounding box of the object.

By setting a threshold, eliminating the boundary box with the category confidence lower than the threshold, and then screening the boundary box by adopting an NMS method to obtain 5 parameters (x, y, w, h, p) of the boundary box_c) Where (x, y) is the target center left relative to the cellRelative coordinates of the upper corners, (w, h) are the width and height of the target and the entire image, p, respectively_cThe probability value representing the target class is normalized, and the final network output is S × (5 × B + C).

The multi-target tracking in the step3 specifically comprises the following steps:

step 1: inputting a multi-target tracking algorithm: target coordinate information (c) obtained after improved YOLOv3 network detection_x,c_yR, h, p) to obtain an 8-dimensional vector X ═ c_x,c_y,r,h,v_x,v_y,v_r,v_h]As input to the multi-target tracking algorithm. Where p is the confidence score and the center coordinate of the bounding box is (c)_x,c_y) Aspect ratio r, height h, v_x,v_y,v_r,v_hRepresents c_x,c_yR, h velocity variation value

Step 2: and (3) state estimation: firstly, predicting the position of a tracker at the next moment by using Kalman filtering, and then updating the predicted position based on a detection result obtained by the Kalman filtering;

step 3: assignment problem: the Hungarian algorithm is utilized to solve the problem of correlation between a detection result obtained by the Kalman filtering algorithm and a tracking prediction result, and meanwhile, the correlation of motion information and the correlation of target appearance information are considered;

correlation of motion information: and (3) predicting the Mahalanobis distance between the state and the new measurement by adopting a Kalman filter to express the motion information:

in the formula (d)⁽¹⁾(i, j) represents the degree of motion matching between the j detection frames and the ith track, d_jIndicates the position of the jth detection frame, y_iState vector representing the ith trace, S_iRepresenting the covariance matrix between the detected position and the average position. Setting the association of the movement state if the mahalanobis distance of a certain association is smaller than a specified threshold value, which is derived from a separate training setSuccess is achieved;

introducing a correlation method of target appearance information, measuring the distance between the apparent features by using cosine distance, wherein the calculation formula is as follows:

wherein the limiting condition is | | | r_i||＝1，

For storing the feature vector, r, successfully associated with the most recent n frames_i，r_kRepresenting two intersected vectors, and measuring the apparent characteristics of the tracker and the apparent characteristics corresponding to the detection result by using cosine distance;

and the relevance measurement is obtained by weighting the motion model and the appearance model:

c_i,j＝λd⁽¹⁾(i,j)+(1-λ)d⁽²⁾(i,j) (7)

in the formula, c_i,jAnd the comprehensive matching degree is shown, and lambda is a hyper-parameter and is 0 by default. Only c is_i,jWhen the two types of measurement thresholds are within the intersection of the two types of measurement thresholds, the correct association is considered to be realized, and after the assignment is completed, unmatched detectors and trackers are classified;

step 4: cascade matching and IOU matching: when the target is shielded for a long time, the correctness of the Kalman filtering prediction result is reduced, and the observability in the state space is correspondingly reduced, so that the priority is given to the more frequently-appearing target by utilizing cascade matching. Performing IOU matching on trackers in unconfirmed states, unmatched trackers and unmatched detection, and assigning by using the Hungarian algorithm again;

and Step5, updating parameters of the matched tracker, deleting the tracker which is not matched again, and initializing the detection of the non-matching as a new target. Judging whether the video stream is finished or not, and if so, exiting the loop; otherwise, entering next frame detection.

Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:

the deep separable convolution module is introduced into a YOLOv3 network model to replace a standard convolution module in YOLOv3, and the operation speed of the algorithm is increased.

According to the invention, the SENet module is added into the YOLOv3 prediction layer, and the characteristics of the SENet network reflecting the correlation and importance of the features between different channels are utilized, so that the feature extraction capability of the network is enhanced, and the detection precision is improved.

In the target detection network part, the K-means + + clustering algorithm is used for replacing the K-means clustering algorithm, and the anchor frame is modified, so that the characteristics of pedestrians are better met, the characteristic extraction is better performed, and the detection precision of the algorithm is improved.

The improved YOLOv3 algorithm is used for achieving detection work of target information, and the Deep-SORT algorithm is used for tracking in the tracking portion. Experimental results show that the provided tracking algorithm can effectively reduce the conditions of missed detection and shielding, and can keep higher detection speed and better tracking effect.

The above description is only an outline of the technical solution of the present invention, and the embodiments of the present invention will be described below in order to make the technical means of the present invention more clearly understood and to make the content, features, and advantages of the present invention more comprehensible.

Drawings

Fig. 1 is a flow chart of a specific algorithm of the present invention.

Fig. 2 is a diagram of an improved YOLOv3 network framework.

FIG. 3 is a diagram of the SEnet module architecture.

FIG. 4 is a diagram of a standard convolution structure and a depth separable convolution structure. Wherein, (a) represents a standard convolution structure, (b) represents a deep convolution structure, and (c) represents a point-by-point convolution structure.

FIG. 5 is a comparison of the test results of the model of the present invention and the original model. Wherein, (a) the tracking result of YOLOv3-Deep-SORT under different frame numbers, and (b) the tracking result under different frame numbers of the algorithm of the invention.

Detailed Description

The following further describes the embodiments of the present invention with reference to the drawings.

As shown in fig. 1, the present invention provides a pedestrian multi-target tracking method based on an improved YOLOv3 model, which includes:

step 1: the improved YOLOv3 sub-network of target detection, which is the basic operation based on detection tracking, as shown in fig. 2, is specifically divided into the following steps:

step 1.1: as shown in FIG. 4, a depth separable convolution module is introduced

A depth separable convolution module is introduced into a Darknet-53 feature extraction layer, and the depth separable convolution module is used for replacing the standard convolution in the original Darknet-53;

step 1.2: as shown in FIG. 3, the SENet module is introduced into the YOLO prediction layer

Embedding the SENet module after the vectors are respectively output by 26 th, 42 th and 53 th layers of a Darknet-53 feature extraction layer of a YOLOv3 network.

Step 2: and selecting a data set containing a pedestrian image from the VOC2007 picture, carrying out cluster analysis on the data set label by using a K-means + + clustering algorithm, and training a pedestrian detection Yolov3 network model. The method comprises the following steps:

step 2.1: 10000 photos of pedestrians in the VOC2007 and MOT 2015 public data sets are respectively extracted, and the pictures are respectively labeled by using a labeling tool; the pictures were then combined in a training set: the test set was 2: the training sample is selected according to the proportion of 1.

Step 2.2: and (3) carrying out prior frame clustering on the samples by using a K-means + + algorithm to obtain new anchors (the number of the anchors is selected to be 9), and carrying out iterative training on a Yolov3 pedestrian detection network model by using the new anchors.

And step 3: the improved YOLOv3 network is used as a detector for target detection, and is combined with a Deep-SORT multi-target tracking algorithm to realize multi-target tracking of pedestrians. The method comprises the following steps:

step 3.1: an object detection section: inputting continuous frames of images with any size into an improved YOLOv3 network model, firstly, adaptively adjusting the input images to 416 x 416, predicting B bounding boxes (B is 9) in each grid, detecting C-type objects (in pedestrian detection, the type is set as person), and outputting the bounding boxes and the confidence degrees of the bounding boxes of each type of objects. The confidence of the bounding box is defined as: the bounding box is cross-over-real (IOU) with the actual bounding box of the object, multiplied by the probability that the object is present within the bounding box. Calculating the formula:

the bounding box is compared with the actual bounding box of the object.

By setting a threshold, eliminating the boundary box with the category confidence lower than the threshold, and then screening the boundary box by adopting an NMS (non-maximum suppression) method to obtain 5 parameters of the boundary box as (x, y, w, h, p)_c) Where (x, y) is the relative coordinate of the center of the target with respect to the upper left corner of the cell, (w, h) is the ratio of the width and height of the target to the entire image, p_cThe probability value representing the target class is normalized, and the final network output is S × (5 × B + C).

Step 3.2: referring to fig. 1, the improved YOLOv3 network is used as a detector for target detection, and the multi-target tracking part specifically includes the following steps:

step 1: target detection: target detection is carried out on the input video stream to obtain frame and characteristic information, and then the target coordinate information (c) obtained after detection is carried out_x,c_yR, h, p) to obtain an 8-dimensional vector X ═ c_x,c_y,r,h,v_x,v_y,v_r,v_h]As input to the multi-target tracking algorithm. Where p is the confidence score and the center coordinate of the bounding box is (c)_x,c_y) Aspect ratio r, height h, and respective speedsDegree change value

Step 2: and (3) state estimation: the position of the tracker at the next time instant is first predicted using kalman filtering, and then the predicted position is updated based on the detection result.

Step 3: assignment problem: the Hungarian algorithm is utilized to solve the problem of association between the detection result and the tracking prediction result, and meanwhile, the association of the motion information and the association of the target appearance information are considered.

in the formula (d)⁽¹⁾(i, j) represents the degree of motion matching between the j detection frames and the ith track, d_jIndicates the position of the jth detection frame, y_iState vector representing the ith trace, S_iRepresenting the covariance matrix between the detected position and the average position. The association of the set motion state is successful if the mahalanobis distance of a certain association is smaller than a specified threshold (which is derived from a separate training set).

in the formula, the limitation is | | | r_i||＝1，

Used to store the feature vectors that were successfully associated with the last 100 frames. The cosine distance is used to measure the apparent characteristics of the tracker and the apparent characteristics corresponding to the detection result.

c_i,j＝λd⁽¹⁾(i,j)+(1-λ)d⁽²⁾(i,j) (4)

in the formula, c_i,jAnd the comprehensive matching degree is shown, and lambda is a hyper-parameter and is 0 by default. Only c is_i,jWhen the assignment is complete, the unmatched detectors and trackers are classified.

Step 4: cascade matching and IOU matching: when the target is shielded for a long time, the correctness of the Kalman filtering prediction result is reduced, and the observability in the state space is correspondingly reduced, so that the priority is given to the more frequently-appearing target through cascade matching. And for the trackers in unconfirmed states, the unmatched trackers and the unmatched detection, performing IOU matching, and assigning by using the Hungarian algorithm again.

And 4, step 4: simulation experiment

And (3) qualitative experiment: the sequence in the MOT16 multi-target tracking data set is selected to perform a multi-target tracking experiment, and specific experiments such as fig. 5 show that the improved network model is improved to a certain extent in the aspects of accuracy, missing rate and the like.

Quantitative experiments: as shown in table 1, an MOT15 multi-target tracking data set is selected for testing, and 7 more advanced multi-target tracking algorithms are selected for comparison, so that the improved network model has obvious advantages and the performance indexes are improved correspondingly in an integrated manner.

TABLE 1 Multi-target tracking algorithm evaluation index contrast

The present invention is not intended to be limited to the particular embodiments shown above, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. The pedestrian multi-target tracking video identification method based on the improved YOLOv3 model is characterized by comprising the following steps:

and step 3: a multi-target tracking section: and (3) carrying out target detection by using a trained pedestrian detection YOLOv3 network model, and carrying out multi-target tracking on pedestrians by combining with a Deep-SORT algorithm.

2. The pedestrian multi-target tracking video identification method based on the improved YOLOv3 model according to claim 1, wherein the step1 is further specifically as follows:

3. The pedestrian multi-target tracking video identification method based on the improved YOLOv3 model according to claim 1 or 2, wherein the step2 is specifically as follows:

4. The pedestrian multi-target tracking video recognition method based on the improved YOLOv3 model as claimed in claim 1 or 2, wherein a trained pedestrian detection YOLOv3 network model is required to detect the target before multi-target tracking, specifically:

inputting continuous frames of images with any size into a trained pedestrian detection YOLOv3 network model, firstly adaptively adjusting the input images, predicting B bounding boxes in each grid, detecting C-type targets, and outputting the bounding boxes of each type of targets and the confidence coefficients of the bounding boxes; the confidence of the bounding box is defined as: the intersection ratio between the bounding box and the actual bounding box of the object IOU is multiplied by the probability of the object existing in the bounding box, and the formula is calculated as follows:

is a bounding box and theComparing the actual bounding boxes of the objects;

by setting a threshold, eliminating the boundary box with the category confidence lower than the threshold, and then screening the boundary box by adopting an NMS method to obtain 5 parameters (x, y, w, h, p) of the boundary box_c) Where (x, y) is the relative coordinate of the center of the target with respect to the upper left corner of the cell, (w, h) is the width and height of the target and the entire image, respectively, p_cThe probability value representing the target class is normalized, and the final network output is S × (5 × B + C).

5. The pedestrian multi-target tracking video recognition method based on the improved YOLOv3 model as claimed in claim 3, wherein a trained pedestrian detection YOLOv3 network model is required to detect the target before multi-target tracking, specifically:

comparing the boundary box with the actual boundary box of the object;

by setting a threshold, eliminating the boundary box with the category confidence lower than the threshold, and then screening the boundary box by adopting an NMS method to obtain 5 parameters (x, y, w, h, p) of the boundary box_c) Where (x, y) is the relative coordinate of the center of the target with respect to the upper left corner of the cell(w, h) width and height of the object and the whole image, p, respectively_cThe probability value representing the target class is normalized, and the final network output is S × (5 × B + C).

6. The pedestrian multi-target tracking video identification method based on the improved YOLOv3 model according to claim 1, 2 or 5, wherein the multi-target tracking in the step3 is specifically:

step 1: inputting a multi-target tracking algorithm: target coordinate information (c) obtained after improved YOLOv3 network detection_x,c_yR, h, p) to obtain an 8-dimensional vector X ═ c_x,c_y,r,h,v_x,v_y,v_r,v_h]As input to a multi-target tracking algorithm; where p is the confidence score and the center coordinate of the bounding box is (c)_x,c_y) Aspect ratio r, height h, v_x,v_y,v_r,v_hRepresents c_x,c_yR, h velocity variation value

in the formula (d)⁽¹⁾(i, j) represents the degree of motion matching between the j detection frames and the ith track, d_jIndicates the position of the jth detection frame, y_iState vector representing the ith trace, S_iIndicating the detected position and the average positionA covariance matrix between the positions; if the mahalanobis distance of a certain correlation is smaller than a specified threshold value, and the threshold value is obtained from a single training set, setting the correlation of the motion state to be successful;

wherein the limiting condition is | | | r_i||＝1，

c_i,j＝λd⁽¹⁾(i,j)+(1-λ)d⁽²⁾(i,j) (7)

in the formula, c_i,jExpressing the comprehensive matching degree, wherein lambda is a hyper-parameter and is 0 by default; only c is_i,jWhen the two types of measurement thresholds are within the intersection of the two types of measurement thresholds, the correct association is considered to be realized, and after the assignment is completed, unmatched detectors and trackers are classified;

step 4: cascade matching and IOU matching: after the target is shielded for a long time, giving priority to the more frequently appeared target by utilizing cascade matching; performing IOU matching on trackers in unconfirmed states, unmatched trackers and unmatched detection, and assigning by using the Hungarian algorithm again;

step5, updating parameters of the matched tracker, deleting the unmatched tracker again, and initializing the unmatched detection as a new target; judging whether the video stream is finished or not, and if so, exiting the loop; otherwise, entering next frame detection.