CN111753732A

CN111753732A - Vehicle multi-target tracking method based on target center point

Info

Publication number: CN111753732A
Application number: CN202010590410.1A
Authority: CN
Inventors: 杨航; 杨海东; 黄坤山; 彭文瑜; 林玉山
Original assignee: Foshan Nanhai Guangdong Technology University CNC Equipment Cooperative Innovation Institute; Foshan Guangdong University CNC Equipment Technology Development Co. Ltd
Current assignee: Foshan Nanhai Guangdong Technology University CNC Equipment Cooperative Innovation Institute; Foshan Guangdong University CNC Equipment Technology Development Co. Ltd
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2020-10-09

Abstract

The invention discloses a vehicle multi-target tracking method based on a target center point, which comprises the following steps: s1, acquiring a vehicle tracking data set, and performing image enhancement on the vehicle tracking data set; s2, building a vehicle detection model, setting a hyper-parameter, and pre-training the vehicle detection model through the vehicle tracking data set; s3, copying all weights from the vehicle detection model, adding 4 input channels and 2 output channels on the basis of the original vehicle detection model, and retraining to generate a vehicle tracking model; s4, inputting the video stream into the vehicle tracking model to obtain the result of vehicle multi-target tracking, the invention greatly reduces the amount of calculation and operation time by integrating the two modules into a network, and simplifies the detection based on tracking, the detector based on tracking can directly extract the heat map and carry out joint reasoning on the targets in a plurality of frames when associating them.

Description

Vehicle multi-target tracking method based on target center point

Technical Field

The invention relates to the technical field of multi-target tracking, in particular to a vehicle multi-target tracking method based on a target center point.

Background

With the development of computer hardware technology and computer vision technology, a traffic monitoring system based on computer vision becomes possible, and real-time detection and tracking of video vehicles are core parts of an intelligent traffic monitoring system. In the existing detection and tracking technology, under the conditions of a complex scene, a large range and multiple targets, the tracking effect of a moving target is not ideal, and further improvement is needed. Due to the development and application of convolutional neural networks (CNN for short), tasks in many computer vision fields are greatly developed, and meanwhile, a plurality of target methods based on CNN are also applied to solving the problems of multi-target tracking and the like.

The existing mainstream target tracking method mostly follows the idea of tracking-by-detection, an interested target in each frame is detected by using a target detection algorithm to obtain corresponding indexes such as position coordinates, classification, reliability and the like, and the detection result in the last step is supposed to be associated with the detection target in the last frame in a certain mode.

Disclosure of Invention

Aiming at the problems, the invention provides a vehicle multi-target tracking method based on a target central point, and mainly provides a novel tracking model structure which replaces objects with points and simultaneously detects and tracks. By performing detection on an image and combining the target detection results of previous frames to estimate the target motion of the current frame, simplicity, on-line and real-time can be achieved without requiring high computational resources.

The invention provides a vehicle multi-target tracking method based on a target center point, which simplifies two key steps of a traditional tracking scheme: one is tracking condition detection, because each object in the past frame is represented by a single point, its historical information is contained in its corresponding heat map, from which the model can directly extract relevant information; and secondly, the targets are correlated in a cross-time mode, and the targets in different frames can be connected through simple displacement prediction similar to sparse optical flow. The displacement prediction is based on previous detection results, which can jointly detect objects in the current frame and associate them with the previous detection results, comprising the following steps:

s1, acquiring a vehicle tracking data set, and performing image enhancement on the vehicle tracking data set;

s2, building a vehicle detection model, setting a hyper-parameter, and pre-training the vehicle detection model through the vehicle tracking data set;

s3, copying all weights from the vehicle detection model, adding 4 input channels and 2 output channels on the basis of the original vehicle detection model, and retraining to generate a vehicle tracking model;

and S4, inputting the video stream into the vehicle tracking model to obtain a vehicle multi-target tracking result.

In a further refinement, the vehicle tracking data set includes MOT, KITTI, and nuScenes, and is divided into a training set, a test set, and a validation set in a 6:2:2 ratio.

In a further improvement, the image enhancement comprises at least one of the following methods:

1) training chart I in the vehicle tracking data set^HRotating according to different angles and generating four sub-graphs, wherein the sub-graphs are marked as

i∈{-30°,-15°,+15°,+30°}；

2) Training chart I in the vehicle tracking data set^HCarrying out size transformation, and marking the sub-graph after size transformation as

3) Training chart I in the vehicle tracking data set^HUsing pixel-by-pixel binary segmentation, the sub-image after binary segmentation being marked as

The character count value is C.

In a further improvement, the step S2 specifically includes the following steps:

s21, cutting the training chart in the vehicle tracking data set into a resolution format of 906 × 554, inputting the training chart into the vehicle detection model, and setting the hyper-parameter false positive rate to be lambda_fp0.1, false negative rate λ_fn0.4, 0.4 confidence threshold θ, 0.5 heat map rendering threshold τ;

s22, selecting a deformable convolution as an up-sampling convolution to jump and connect a low layer and an output layer of the vehicle detection model, selecting a step length of 4, carrying out batch processing on the selected step length of 12, training 30 rounds by using an Adam optimizer, setting the learning rate of the front 20 rounds to be le-4, decreasing the learning rate of the rear 10 rounds to be le-6 from le-5, and adopting focal loss as a loss function:

s23, surrounding the point in the downsampled image map

The key points are distributed on the characteristic graph by using a Gaussian kernel, and the Gaussian parameters are adjusted according to the target size to carry out fuzzy processing;

and S24, comparing all the response points on the heat map with the 8 adjacent points connected with the response points, if the response value of point changing is greater than or equal to the 8 adjacent point values, reserving the first N peak points meeting the condition, and obtaining the heat map which has the resolution of 240x136 and contains the target center point.

In a further improvement, the step S24 further includes:

in order to reduce the false alarm probability, only the targets with scores higher than a preset threshold value in the detection result are rendered.

In a further improvement, the step S3 specifically includes the following steps:

s31, copying all weights from the vehicle detection model, keeping the basic hyper-parameters unchanged, adding the image of the previous frame and the generated heat map to the input part of the original vehicle detection model, and adding a two-dimensional offset vector of the target center point in the current frame to the center point of the previous frame to the output part;

s32, and adding 2 additional output channels for predicting two-dimensional offset vectors describing the displacement of the position of each object in the current frame relative to its position in the previous frame image, in order to be able to establish the link between the detection targets in time;

s33, drawing a target boundary box according to the central point of the generated heat map, establishing a connection between the object of the current frame and the object of the previous frame by using a greedy matching strategy, inheriting the ID of the object of the previous frame if the matching is successful, and distributing a new ID for the target if the matching is unsuccessful.

In a further improvement, the step S3 further includes:

on the prediction result of the previous frame, Gaussian disturbance is carried out on each detection target point, and a hyper-parameter lambda is set_ftSimulating the situation of target positioning error;

and randomly rendering some false peaks near the center of the ground truth target with a certain probability, and setting a hyper-parameter lambda_fpSimulating the condition of false detection, wherein the value is 0.1;

randomly removing part of detection results with a predetermined probability, and setting a hyper-parameter lambda_fnThe case of missing inspection was simulated at 0.02.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention abandons the traditional mode of separating the detection part from the association part in the past, and greatly reduces the calculation amount and the running time by integrating the two modules in one network. While the detection based on tracking is simplified, each target in the video stream is identified by a simple point, and multiple targets can be identified by a heat map comprising multiple points. The trace-based detector may directly extract the heat map and perform joint reasoning on the objects in the multiple frames when associating them.

2. Point-based tracking simplifies target association across time. Simple displacement prediction like sparse optical flow can connect objects in different frames. The displacement prediction is based on previous detection results, which enable joint detection of objects in the current frame and their correlation with previous detection results.

3. Since it is a completely local method, only objects in adjacent frames are correlated, and lost, time-distant tracking is not reinitialized. Inputting the current frame and the previous frame into the model together can help the network to estimate the change of the object in the scene and recover the object that may not be observed in the current frame according to the clue provided by the previous frame.

Drawings

The drawings are for illustrative purposes only and are not to be construed as limiting the patent; for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

FIG. 1 is a schematic overall flow chart of an embodiment of the present invention;

FIG. 2 is a diagram of a network framework according to an embodiment of the present invention;

fig. 3 is a diagram of a detection network structure according to an embodiment of the present invention.

Detailed Description

In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted" and "connected" are to be interpreted broadly, e.g., as being either fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, so to speak, as communicating between the two elements. The specific meaning of the above terms in the present invention can be understood in specific cases to those skilled in the art. The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Referring to fig. 1-, a method for tracking multiple targets of a vehicle based on a target center point includes the following steps:

Specifically, the vehicle detection model uses a codec full convolution network in step S2, as shown in fig. 2, the full convolution network is used to directly obtain a heat map of a 4-fold down-sampling, and anchor points do not need to be set in advance, so that the network parameters and the amount of calculation are greatly reduced. The peak value of the heat map is used as a target central point extracted by the network, and then a threshold value is set for screening to obtain a final target central point. All the upsampling is performed by deformable convolution, and the convolution has the effect of enabling the receptive field of the network to be more accurate, but is not limited to be within a 3 × 3 rectangular convolution frame. Meanwhile, the 4 times of downsampling feature map is much higher than the resolution of a common network, and large and small targets can be well detected at the same time without multi-scale prediction and a pyramid. The up-sampling uses the transposition convolution which is greatly different from the bilinear difference value in the general up-sampling, and the transposition convolution can better recover the semantic information and the position information of the image. In order to generate a heat map close to the real situation, the situations of positioning errors, missing detection and false detection in practice are simulated by adding disturbance to a detection target point, rendering some false peaks and randomly removing some detection results with a certain probability in training.

Specifically, the input to the network in step 3 is a pair of images, and a heat map rendered from the detection of the first frame image. The peak position corresponds to a target center point, Gaussian and rendering methods are used, Gaussian parameters are adjusted according to the size of the target to carry out fuzzy processing, and in order to reduce the false alarm probability, the target with the score higher than a certain threshold value in the detection result is rendered. The model outputs an offset vector from the center of the current object to the center of the previous frame object, and this offset vector is learned as an additional attribute of the center point, thereby adding little extra computation. After the center point and the offset are available, the object of the current frame and the response object of the previous frame can be linked by only a greedy matching strategy. In addition, during the actual training process, the previous frame I^t-1Not necessarily the previous frame but also other frames in the same video sequence. By this enhanced approach, the sensitivity of the model to the video frame rate can be circumvented.

As a preferred embodiment of the present invention, the vehicle tracking data set includes MOT, KITTI and nuScenes, and is divided into a training set, a test set and a validation set in a ratio of 6:2: 2.

As a preferred embodiment of the present invention, the image enhancement includes at least one of the following methods:

i∈{-30°,-15°,+15°,+30°}；

3) Will the vehicleTraining diagram I in vehicle tracking data set^HUsing pixel-by-pixel binary segmentation, the sub-image after binary segmentation being marked as

The character count value is C.

As a preferred embodiment of the present invention, the step S2 specifically includes the following steps:

s21, cutting the image into 960x544 resolution, inputting the image into a network, and setting the hyper-parameter false positive rate as lambda_fp0.1, the false negative rate is set as λ_fn0.4, 0.4 confidence threshold θ, 0.5 heat map rendering threshold τ;

and S22, jumping and connecting the lower layer and the output layer by using a formalable convolution, and replacing the traditional upsampling convolution with the formalable convolution. The step size was selected to be 4, the batch size was 12, 30 rounds were trained with the Adam optimizer, the learning rate was set to 1e-4 for the first 20 rounds, and the next 10 rounds were decremented from 1e-5 to 1 e-6. Considering the problem of the imbalance of positive and negative samples, we take focal length as the loss function:

s23, surrounding the point in the downsampled image map

and S24, comparing all the response points on the heat map with the 8 adjacent points connected with the response points, if the response value of the point change is larger than or equal to the eight adjacent point values, keeping the first N peak points meeting the condition, wherein the peak points are the target center points. In order to reduce the false alarm probability, only the targets with scores higher than a certain threshold value in the detection result are rendered, and finally a heat map containing the center points of the targets with the resolution of 240x136 is obtained.

As a preferred embodiment of the present invention, the step S3 specifically includes the following steps:

s31, copying all weights of all models from step 2, keeping the basic hyper-parameters unchanged, adding an image of a frame (3 channels) and a generated heat map (1 channel) in the original model input part, adding a two-dimensional offset vector of a target center point in the current frame to the center point of the previous frame in the output part, and adding a two-dimensional offset vector of the target center point in the previous frame I in the actual training process^t-1The frame is not necessarily the previous frame, and can be other frames in the same video sequence, and the sensitivity of the model to the video frame rate can be avoided in an enhanced mode;

s32, adding 2 extra output channels for predicting a 2-dimensional offset vector, which describes the displacement of the position of each object in the current frame relative to the position of each object in the previous frame image, so as to establish the relation between the detection targets in time;

As a preferred embodiment of the present invention, the step S3 further includes:

during model inference, heatmaps are rendered from model predictions, and there may be a variable number of missed, false, and mislocalized situations. We simulate these conditions by some methods during the training process, improving the system robustness, specifically:

Compared with the prior art, the invention has the beneficial effects that:

In the drawings, the positional relationship is described for illustrative purposes only and is not to be construed as limiting the present patent; it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A vehicle multi-target tracking method based on a target center point is characterized by comprising the following steps:

2. The vehicle multi-target tracking method based on the target center points as claimed in claim 1, wherein the vehicle tracking data sets comprise MOT, KITTI and nuScenes, and are divided into a training set, a testing set and a verification set according to a ratio of 6:2: 2.

3. The method for multi-target tracking of vehicles based on target center points as claimed in claim 1, wherein the image enhancement comprises at least one of the following methods:

The character count value is C.

4. The method for multiple target tracking of vehicles based on target center points as claimed in claim 1, wherein said step S2 specifically comprises the following steps:

s23, surrounding the point in the downsampled image map

5. The method for multiple target tracking of vehicles according to claim 4, wherein said step S24 further comprises:

6. The method for multiple target tracking of vehicles based on target center points as claimed in claim 1, wherein said step S3 specifically comprises the following steps:

7. The method for multiple target tracking of vehicles according to claim 6, wherein said step S3 further comprises: