CN111767847A

CN111767847A - Pedestrian multi-target tracking method integrating target detection and association

Info

Publication number: CN111767847A
Application number: CN202010605987.5A
Authority: CN
Inventors: 杨航; 杨海东; 黄坤山; 彭文瑜; 林玉山
Original assignee: Foshan Nanhai Guangdong Technology University CNC Equipment Cooperative Innovation Institute; Foshan Guangdong University CNC Equipment Technology Development Co. Ltd
Current assignee: Foshan Nanhai Guangdong Technology University CNC Equipment Cooperative Innovation Institute; Foshan Guangdong University CNC Equipment Technology Development Co. Ltd
Priority date: 2020-06-29
Filing date: 2020-06-29
Publication date: 2020-10-13
Anticipated expiration: 2040-06-29
Also published as: CN111767847B

Abstract

The invention discloses a pedestrian multi-target tracking method integrating target detection and association, which comprises the following steps: training the tracking model network by adopting a training data set to obtain a tracking model; a first frame of image in the video stream to be tracked passes through the detection sub-model, and a boundary frame of the pedestrian target is generated according to the heat map and the offset vector; extracting a characteristic vector for each pedestrian target through an appearance characteristic extraction sub-model and distributing an ID and a track; and sequentially passing other frame images in the video stream to be tracked through the tracking model, determining the corresponding track positions of the pedestrians in the current frame image according to the similarity of the feature vectors in the two adjacent frame images in the video stream to be tracked, and connecting the track positions corresponding to the same ID in all the frame images in the video stream to be tracked, namely the corresponding tracking result. The invention overcomes the defects of low speed and frequent identity switching under the condition of object shielding in the traditional tracking method.

Description

Pedestrian multi-target tracking method integrating target detection and association

Technical Field

The invention relates to the field of pedestrian tracking, in particular to a pedestrian multi-target tracking method integrating target detection and association.

Background

The main task of the multi-target tracking is to give an image sequence, find moving objects in the image sequence, correspond moving objects in different frames one by one, and then give the motion tracks of different objects. These objects may be arbitrary, such as pedestrians, vehicles, athletes, various animals, etc., and most studied is pedestrian tracking. This is because firstly the "pedestrian" is a typical non-rigid target, which is more difficult than a rigid target, and secondly the detection and tracking of the pedestrian is more commercially valuable in practical applications. By incomplete statistics, at least 75% of multi-objective tracking studies are studying pedestrian tracking.

Due to the development and application of convolutional neural networks, many tasks in the field of computer vision have been greatly developed, and many target methods based on convolutional neural networks are also applied to the problem of image recognition. How to expand the application of deep learning in the field of multi-target tracking of pedestrians is a challenging subject to study a deep learning algorithm suitable for a multi-target tracking problem. In the common multi-target pedestrian tracking method in the prior art, target detection and association are divided into two steps and are respectively trained by different data sets, but the method causes the problem that the operation speed of a tracking model is slow, and the real-time tracking effect is difficult to achieve.

Disclosure of Invention

In view of the shortcomings of the prior art, the present invention aims to provide an integrated target detection and associated pedestrian multi-target tracking method.

In order to achieve the purpose, the invention adopts the following technical scheme: a pedestrian multi-target tracking method integrating target detection and association comprises the following steps:

s01: training the tracking model network by adopting a training data set to obtain a tracking model; the tracking model comprises a detection submodel and an appearance characteristic extraction submodel connected behind the detection submodel; the method specifically comprises the following steps:

s011: training the detection submodel network by adopting a first training data set to obtain a detection submodel; the detection sub-model network sequentially comprises two convolution pooling sub-networks and two recursion sub-networks which are connected together, wherein the output end of one convolution pooling sub-network is connected with the input end of one of the recursion sub-networks; the convolution pooling sub-network comprises a convolution layer and a pooling layer, and the output end of the convolution layer is connected with the input end of the pooling layer;

s012: training the appearance characteristic extraction sub-model network by adopting a second training data set to obtain an appearance characteristic extraction sub-model; the appearance characteristic extraction sub-model network sequentially comprises a convolution layer, a pooling layer, three residual blocks and a full connection layer;

s02: inputting a video stream to be tracked into a tracking model to obtain a tracking result; the method specifically comprises the following steps:

s021: a first frame of image in the video stream to be tracked passes through the detection sub-model, and a boundary frame of the pedestrian target is generated according to the heat map and the offset vector; extracting a characteristic vector for each pedestrian target through an appearance characteristic extraction sub-model and distributing an ID and a track;

s022: other frames of images in the video stream to be tracked sequentially pass through the tracking model, each frame of image passes through the detection sub-model, and a boundary frame of the pedestrian target is generated according to the heat map and the offset vector; then generating a feature vector of the pedestrian target through an appearance feature extraction submodel; and determining the track position corresponding to each pedestrian in the current frame image according to the distance between the feature vectors in two adjacent frame images in the video stream to be tracked, and connecting the track positions corresponding to the same ID in all the frame images in the video stream to be tracked, namely the track positions corresponding to the frame images are the tracking results.

Further, the step S011 specifically includes:

s0111: the first training data set sequentially passes through two convolution pooling sub-networks and two recursion sub-networks, and outputs a heat map;

s0112: setting a hyper-parameter of a detection sub-model network, inputting a first training data set into the detection sub-model network for training, and determining a parameter of the detection sub-model network; the hyper-parameters of the detection sub-model network comprise a learning rate, iteration times, batch processing size and a heat map score threshold;

s0113: in the heat map, non-maximum suppression is performed according to the heat map score to extract peak key points, the positions of the key points with the heat map score larger than a threshold value are reserved, and the coordinate of the boundary frame is calculated to obtain the heat map containing the boundary frame.

Furthermore, before each down-sampling in the recursive sub-network, an upper half path is divided to retain original scale information; after each up-sampling, adding the data of the previous scale; extracting features by using three residual blocks between two times of downsampling; between the two additions, a residual block is used to extract features.

Further, the residual block includes a convolution path and a jump path, the convolution path includes three convolution layers with different convolution kernels connected in series, and the jump path includes a convolution layer with a convolution kernel of 1.

Further, step S012 specifically includes:

s0121: inputting the heatmap containing the boundary frame into an appearance feature extraction sub-model network, setting a similarity threshold, judging that the two target features are the same target if the similarity between the two target features is greater than or equal to the similarity threshold, and judging that the two target features are not the same target if the similarity between the two target features is less than the similarity threshold;

s0122: setting a hyper-parameter of the appearance characteristic extraction sub-model network, inputting a second training data set into the appearance characteristic extraction sub-model network, and determining a parameter of the appearance characteristic extraction sub-model network; the appearance characteristic extraction sub-model network hyper-parameters comprise a learning rate, iteration times, batch processing size and a similarity threshold.

Further, in step S022, according to the similarity of the feature vectors and the size of the IOU, the distance between the detection targets in the M-1 th frame and the M-th frame of the video stream to be tracked is measured, when the distance is smaller than a distance threshold, the matching is considered to be successful, the ID of the corresponding target in the previous frame of image is inherited, otherwise, the matching is unsuccessful, and a new ID is assigned to the target.

Further, the similarity is calculated by using a mahalanobis distance or an euclidean distance.

Further, in step S022, kalman filtering is used to predict the position of the trajectory in the current frame image, and if the distance between the trajectory position in the current frame image and the link detection is greater than the trajectory threshold, the distance is set to infinity.

Further, the first training data set includes data sets in an ETH database, a CityPerson database, a calech database, a MOT17 database, a CUHK-SYSU database, and a PRW database.

Further, the second training data set includes data sets in a CalTech database, a MOT17 database, a CUHK-SYSU database, and a PRW database.

The invention has the beneficial effects that: in order to obtain good balance between precision and speed, more skip connections are arranged between low-level and high-level characteristics, meanwhile, intermediate supervision is adopted, loss is calculated for the output of two recursive sub-networks, repeated bidirectional reasoning is adopted to ensure that bottom-layer parameters are normally updated, and the accuracy of detection is improved; the invention abandons the traditional mode of separating the detection part from the association part in the past, and greatly reduces the calculation amount and the running time by integrating the two modules. Meanwhile, the dimensionality of the feature vector is reduced in the appearance feature extraction submodel, so that overfitting can be relieved, robustness is improved, and the calculated amount and the running time are reduced. The method overcomes the defects of low speed and frequent identity switching under the condition of object shielding in the traditional tracking method, and can be applied to video monitoring scenes in areas with large human flow, such as crossroads and the like.

Drawings

FIG. 1 is a schematic view of a tracking model of the present invention;

FIG. 2 is a schematic diagram of a recursive subnetwork structure of the present invention;

FIG. 3 is a schematic diagram of a residual block structure according to the present invention;

FIG. 4 is a flow chart of the pedestrian tracking of the video stream to be tracked according to the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings and the detailed description below:

the invention provides a pedestrian multi-target tracking method integrating target detection and association, which comprises the following steps:

s01: training the tracking model network by adopting a training data set to obtain a tracking model; the tracking model comprises a detection submodel and an appearance characteristic extraction submodel connected behind the detection submodel. The tracking model network is a preliminary model frame, and at the moment, each model parameter in the model frame is uncertain. The purpose of training is to determine each model parameter, and substituting the determined model parameter into the model network is the trained tracking model. The training data set adopted in the invention can be but is not limited to the training data set collected in the existing pedestrian database, such as ETH and CityPerson data sets, which only provide the annotation of the bounding box, and can be used as a first training data set for training the detection sub-model network; the CalTech, MOT17, CUHK-SYSU and PRW databases provide bounding boxes and identity annotations, which can be used as a second training data set for training the appearance feature extraction submodel network, and can also be used as a first training data set for training the detection submodel network.

The specific training process comprises the following steps of training the detection submodel model and training the appearance characteristic extraction submodel network:

s011: training the detection submodel network by adopting a first training data set to obtain a detection submodel; the detection sub-model network sequentially comprises two convolution pooling sub-networks and two recursion sub-networks which are connected together, wherein the output end of one convolution pooling sub-network is connected with the input end of one of the recursion sub-networks; the convolution pooling subnetwork includes a convolution layer and a pooling layer, and an output of the convolution layer is connected to an input of the pooling layer. The specific training process for detecting the sub-model network comprises the following steps:

s0111: the first training data set sequentially passes through two convolution pooling sub-networks and two recursion sub-networks, and outputs a heat map; as shown in fig. 1, the first training data set sequentially passes through successive convolutional layers, pooling layers, convolutional layers and pooling layers, the convolutional layers adopt 3 × 3 convolutional kernels, and the pooling layers actually use maximum pooling to obtain a feature map with the resolution of 1/4 of the original image; the signature graph is then input into two successive recursive subnetworks, and a heat map is output. The recursive sub-network structure is shown in fig. 2, the dimension of the input increases from left to middle, the resolution of the feature map decreases, and the dimension decreases from middle to right, and the resolution of the feature map increases. In the invention, before each down-sampling in a recursive sub-network, an upper half path is divided to retain original scale information; after each up-sampling, adding the data of the previous scale; for example, the c4b network layer is combined by a c7 network layer and a c4a network layer, wherein the c7 network layer doubles the resolution by upsampling, and in order to increase the resolution of the feature map, for example, the convolution kernel of the c7 network layer is 4 × 4, and then the convolution kernel obtained after upsampling is 8 × 8. The size of the c4a network layer is consistent with that of the c4 network layer, and the c4a network layer can be regarded as a 'copy' of the c4 network layer, the convolution kernel of the copy is twice of that of the c7 network layer, the size of the copy is just consistent with that of the c7 network layer after being upsampled, and the values can be directly added, so that the c4b network layer is obtained.

The above process includes a number of pooling and upsampling processes, where the pooling layer employs maximal pooling for redundancy reduction, and the upsampling layer uses nearest neighbor interpolation. Before each down-sampling, a path is divided to retain the original scale information, and after each up-sampling, the path is fused with the feature of the previous scale. Specifically, three residual blocks are used for extracting features between two times of downsampling, and one residual block is used for extracting features between two times of feature map fusion. The advantage of this structure is that the characteristics of the object may appear in different network layers, the conventional convolutional network model is easy to cause characteristic loss, and the multi-scale characteristics can be effectively combined by using the hopping connection mode.

The residual block in the recursive sub-network is shown in fig. 3, and includes a convolution path and a skip level path, where the convolution path includes three convolution layers with different kernel scales connected in series, and the skip level path includes a convolution layer with a convolution kernel of 1. M is the number of input channels, i.e., the input depth, and N is the number of output channels, i.e., the output depth. The characteristic of the residual block structure is that the dimension of the feature map can be increased or decreased, and the resolution of the feature map is not changed; the residual block extracts the characteristics of a higher level (convolution path), and simultaneously retains the information of the original level (jump path), and the residual block does not change the data size and only changes the data depth; it can be considered as an advanced convolutional layer of guaranteed size. Meanwhile, due to the characteristic of the residual block structure, the characteristic in the characteristic diagram is ensured not to be lost easily, and the problem of gradient disappearance can be effectively relieved. Each recursive subnetwork outputs a heat map of 1 number of channels at the same resolution as the original and regresses an offset vector for each corresponding target point on the heat map.

The invention adopts a structure of cascading 2 recursion sub-networks, the heat map generated by each recursion sub-network and a true value calculate a loss function, the value of each point on the heat map is between 0 and 1, and the numerical value is closer to 1 when the probability that the point is predicted to be the target central point is higher. Considering the positive and negative sample imbalances, we use focal length as a loss function.

If the gradient reduction is directly carried out on the whole characteristic diagram, the error of an output layer is greatly reduced through the back propagation of multiple layers, namely, the gradient disappearance occurs. We therefore employ two recursive sub-network cascades in combination with the use of intermediate supervision, with repeated bi-directional reasoning to ensure that the underlying parameters are updated properly. The intermediate supervision refers to that the heat map output by each network layer in the recursive sub-network is used as the prediction, the prediction precision is far better than the effect of only using the heat map output by the last network layer as the prediction, and the supervision training method considering the intermediate network layer is the intermediate supervision.

S0112: setting a hyper-parameter of a detection sub-model network, inputting a first training data set into the detection sub-model network for training, and determining a parameter of the detection sub-model network; detecting the hyper-parameters of the sub-model network, including learning rate, iteration times, batch processing size and heat map score threshold;

s0113: on the obtained heat map, non-maximum inhibition is carried out according to the heat map score to extract peak key points, the positions of the key points with the heat map score larger than a threshold value are reserved, and then corresponding boundary frame coordinates are calculated according to the estimated offset vector and the size of the boundary frame, so that the heat map containing the boundary frame is obtained.

S012: training the appearance characteristic extraction sub-model network by adopting a second training data set to obtain an appearance characteristic extraction sub-model; with reference to fig. 1, the appearance feature extraction submodel network sequentially includes a convolutional layer, a pooling layer, three residual blocks, and a full link layer. The specific training process is as follows:

s0121: and inputting the heat map containing the boundary frame into an appearance feature extraction sub-model network, wherein the heat map sequentially passes through a convolution layer, a pooling layer, three residual blocks and a final full-connection layer to generate a 128-dimensional feature vector for each target in the heat map. The goal of the appearance information extraction module is to generate feature vectors that can distinguish between different objects. Ideally, the distance between different objects should be greater than the distance between the same objects. The Mahalanobis distance can be used as measurement, a Mahalanobis distance threshold value is set before training, the Mahalanobis distance between the targets is considered as the same target when the Mahalanobis distance of the characteristic vectors between the targets is smaller than the threshold value, and the Mahalanobis distance between the characteristic vectors between the targets is judged as different targets when the Mahalanobis distance between the characteristic vectors between the targets is larger than or equal to the threshold value.

S02: as shown in fig. 4, a video stream to be tracked is input into a tracking model to obtain a tracking result; the method specifically comprises the following steps:

s021: a first frame of image in a video stream to be tracked passes through a detection sub-model, and a boundary frame of a pedestrian target is generated according to the heat map and the offset vector; extracting a characteristic vector for each pedestrian target through an appearance characteristic extraction sub-model and distributing an ID and a track;

s022: other frames of images in the video stream to be tracked sequentially pass through the tracking model, each frame of image passes through the detection sub-model, and a boundary frame of the pedestrian target is generated according to the heat map and the offset vector; then generating a feature vector of the pedestrian target through an appearance feature extraction submodel;

and determining the track position corresponding to each pedestrian in the current frame image according to the distance between the feature vectors in two adjacent frame images in the video stream to be tracked, and connecting the track positions corresponding to the same ID in all the frame images in the video stream to be tracked, namely the track positions corresponding to the frame images are the tracking results.

For example, in the M frames, the distance between the detection targets in the M-1 th frame and the M-th frame is measured according to the similarity of the feature vectors between the targets and the size of IOU (interaction Intersection), wherein the similarity is calculated by adopting the Mahalanobis distance or the Euclidean distance. And when the distance is smaller than the set threshold value, the pedestrian target is considered to be successfully matched, namely the pedestrian target and the pedestrian target are the same pedestrian target, at the moment, the pedestrian target in the M-th frame image inherits the ID of the corresponding pedestrian target in the M-1-th frame image, and if the matching is unsuccessful, a new ID is allocated to the target. The invention can also adopt Kalman filtering to predict the position of the track in the current frame, if the track position distance link detection in the current frame image is greater than the track threshold value, the distance is set to be infinite, thereby effectively preventing the link of the detection and a large moving target.

Various other modifications and changes may be made by those skilled in the art based on the above-described technical solutions and concepts, and all such modifications and changes should fall within the scope of the claims of the present invention.

Claims

1. A pedestrian multi-target tracking method integrating target detection and association is characterized by comprising the following steps:

2. The integrated target detection and associated pedestrian multi-target tracking method according to claim 1, wherein the step S011 specifically includes:

3. The integrated target detection and associated pedestrian multi-target tracking method according to claim 2, wherein before each down-sampling in the recursive sub-network, an upper half way is divided to retain original scale information; after each up-sampling, adding the data of the previous scale; extracting features by using three residual blocks between two times of downsampling; between the two additions, a residual block is used to extract features.

4. The integrated target detection and associated pedestrian multi-target tracking method of claim 3, wherein the residual block comprises a convolution path and a jump stage path, the convolution path comprises three convolution layers with different convolution kernels connected in series, and the jump stage path comprises one convolution layer with a convolution kernel of 1.

5. The method for integrating target detection and related pedestrian multi-target tracking according to claim 2, wherein the step S012 specifically comprises:

6. The method according to claim 1, wherein in step S022, according to the similarity of feature vectors and the IOU size, the distance between the detected target in the M-1 th frame and the detected target in the M-th frame of the video stream to be tracked is measured according to the M frame, and when the distance is smaller than a distance threshold, the matching is considered to be successful, and the ID of the corresponding target in the previous frame of image is inherited, otherwise, the matching is unsuccessful, and a new ID is assigned to the target.

7. The integrated object detection and association pedestrian multi-target tracking method according to claim 6, wherein the similarity is calculated using mahalanobis distance or euclidean distance.

8. The method according to claim 1, wherein Kalman filtering is adopted to predict the position of the trajectory in the current frame image in step S022, and if the distance between the trajectory position in the current frame image and the link detection is greater than a trajectory threshold, the distance is set to infinity.

9. The integrated object detection and associated pedestrian multi-target tracking method of claim 1, wherein the first training data set comprises data sets in an ETH database, a CityPerson database, a calech database, a MOT17 database, a CUHK-SYSU database, and a PRW database.

10. The integrated object detection and associated pedestrian multi-target tracking method of claim 1, wherein the second training data set comprises data sets in a CalTech database, a MOT17 database, a CUHK-SYSU database, and a PRW database.