CN111767847B

CN111767847B - Pedestrian multi-target tracking method integrating target detection and association

Info

Publication number: CN111767847B
Application number: CN202010605987.5A
Authority: CN
Inventors: 杨航; 杨海东; 黄坤山; 彭文瑜; 林玉山
Original assignee: Foshan Nanhai Guangdong Technology University CNC Equipment Cooperative Innovation Institute; Foshan Guangdong University CNC Equipment Technology Development Co. Ltd
Current assignee: Foshan Nanhai Guangdong Technology University CNC Equipment Cooperative Innovation Institute; Foshan Guangdong University CNC Equipment Technology Development Co. Ltd
Priority date: 2020-06-29
Filing date: 2020-06-29
Publication date: 2023-06-09
Anticipated expiration: 2040-06-29
Also published as: CN111767847A

Abstract

The invention discloses a pedestrian multi-target tracking method integrating target detection and association, which comprises the following steps: training a tracking model network by adopting a training data set to obtain a tracking model; a first frame image in the video stream to be tracked passes through the detection sub-model first, and a boundary frame of a pedestrian target is generated according to the heat map and the offset vector; extracting feature vectors from each pedestrian target through the appearance feature extraction sub-model, and distributing IDs and tracks; and the other frame images in the video stream to be tracked sequentially pass through the tracking model, the corresponding track positions of all pedestrians in the current frame image are determined according to the similarity of the feature vectors in the two adjacent frame images in the video stream to be tracked, and the track positions corresponding to the same ID in all frame images in the video stream to be tracked are connected, so that the corresponding tracking result is obtained. The invention overcomes the defects of low speed and frequent identity switching under the condition of object shielding in the traditional tracking method.

Description

Pedestrian multi-target tracking method integrating target detection and association

Technical Field

The invention relates to the field of pedestrian tracking, in particular to a pedestrian multi-target tracking method integrating target detection and association.

Background

The main task of multi-target tracking is to give an image sequence, find moving objects in the image sequence, make the moving objects in different frames correspond to each other one by one, and then give the motion trail of different objects. These objects may be arbitrary, such as pedestrians, vehicles, athletes, various animals, etc., and pedestrian tracking is the most studied. This is because, firstly, "pedestrians" are typically non-rigid objects, which are more difficult than rigid objects, and secondly, detection and tracking of pedestrians in practical applications are more commercially valuable. At least 75% of multi-objective tracking studies are studying pedestrian tracking based on incomplete statistics.

Due to the development and application of convolutional neural networks, many tasks in the field of computer vision have been greatly developed, and many target methods based on convolutional neural networks are also applied to the problem of image recognition. How to expand the application of deep learning in the field of pedestrian multi-target tracking, it is a challenging task to study the deep learning algorithm applicable to the multi-target tracking problem. The common multi-target pedestrian tracking method in the prior art divides target detection and association into two steps, and trains by using different data sets respectively, but the problem caused by the method is that the running speed of a tracking model is slow, and the real-time tracking effect is difficult to achieve.

Disclosure of Invention

In view of the shortcomings of the prior art, the invention aims to provide a pedestrian multi-target tracking method integrating target detection and association.

In order to achieve the above purpose, the invention adopts the following technical scheme: a pedestrian multi-target tracking method integrating target detection and association comprises the following steps:

s01: training a tracking model network by adopting a training data set to obtain a tracking model; the tracking model comprises a detection sub-model and an appearance feature extraction sub-model connected behind the detection sub-model; the method specifically comprises the following steps:

s011: training the detection sub-model network by adopting a first training data set to obtain a detection sub-model; the detection sub-model network sequentially comprises two convolution pooling sub-networks and two recursion sub-networks which are connected together, wherein the output end of one convolution pooling sub-network is connected with the input end of one recursion sub-network; the convolution pooling sub-network comprises a convolution layer and a pooling layer, and the output end of the convolution layer is connected with the input end of the pooling layer;

s012: training the appearance characteristic extraction sub-model network by adopting a second training data set to obtain an appearance characteristic extraction sub-model; the appearance characteristic extraction sub-model network sequentially comprises a convolution layer, a pooling layer, three residual blocks and a full-connection layer;

s02: inputting the video stream to be tracked into a tracking model to obtain a tracking result; the method specifically comprises the following steps:

s021: a first frame image in the video stream to be tracked passes through the detection sub-model first, and a boundary frame of a pedestrian target is generated according to the heat map and the offset vector; extracting feature vectors from each pedestrian target through the appearance feature extraction sub-model, and distributing IDs and tracks;

s022: other frame images in the video stream to be tracked sequentially pass through the tracking model, each frame image passes through the detection sub-model, and a boundary frame of a pedestrian target is generated according to the heat map and the offset vector; then, the appearance feature extraction sub-model is used for generating a feature vector of the pedestrian target; and determining the corresponding track positions of each pedestrian in the current frame image according to the distance between the feature vectors in two adjacent frame images in the video stream to be tracked, and connecting the track positions corresponding to the same ID in all frame images in the video stream to be tracked, namely the corresponding tracking result.

Further, the step S011 specifically includes:

s0111: the first training data set sequentially passes through two convolution pooling sub-networks and two recursion sub-networks to output a heat map;

s0112: setting super parameters of the detection sub-model network, inputting a first training data set into the detection sub-model network for training, and determining parameters of the detection sub-model network; the super parameters of the detection sub-model network comprise a learning rate, iteration times, batch processing size and a heat map score threshold;

s0113: and in the heat map, non-maximum suppression is carried out according to the heat map score to extract peak key points, the positions of the key points with the heat map score larger than a threshold value are reserved, and the coordinates of the boundary frame are calculated to obtain the heat map with the boundary frame.

Further, before each downsampling in the recursive sub-network, the upper half way is separated to reserve the original scale information; after each up-sampling, adding the data with the last scale; between two downsampling steps, three residual blocks are used to extract features; between the two additions, a residual block is used to extract features.

Further, the residual block comprises a convolution path and a skip stage path, wherein the convolution path is formed by connecting three convolution layers with different convolution kernels in series, and the skip stage path comprises a convolution layer with a convolution kernel of 1.

Further, the step S012 specifically includes:

s0121: inputting the heat map containing the boundary frame into an appearance feature extraction sub-model network, setting a similarity threshold, judging that the two target features are the same target if the similarity between the two target features is greater than or equal to the similarity threshold, and judging that the two target features are not the same target if the similarity between the two target features is less than the similarity threshold;

s0122: setting super parameters of the appearance characteristic extraction sub-model network, inputting a second training data set into the appearance characteristic extraction sub-model network, and determining parameters of the appearance characteristic extraction sub-model network; the super parameters of the appearance characteristic extraction sub-model network comprise learning rate, iteration times, batch processing size and similarity threshold.

Further, in the step S022, for the M-th frame of the video stream to be tracked, the distance between the detected targets in the M-1 th frame and the M-th frame image is measured according to the similarity of the feature vectors and the IOU size, when the distance is smaller than the distance threshold, successful matching is considered, the ID of the corresponding target in the previous frame image is inherited, otherwise, the matching is unsuccessful, and a new ID is allocated to the target.

Further, the similarity is calculated by using a mahalanobis distance or an euclidean distance.

Further, in the step S022, the position of the track in the current frame image is predicted by adopting kalman filtering, and if the distance between the track position in the current frame image and the link detection is greater than the track threshold, the distance is set to infinity.

Further, the first training data set includes data sets in an ETH database, a Cityperson database, a CalTech database, a MOT17 database, a CUHK-SYSU database, and a PRW database.

Further, the second training data set includes data sets in a CalTech database, a MOT17 database, a CUHK-SYSU database, and a PRW database.

The invention has the beneficial effects that: in order to achieve good balance between precision and speed, more skip connection is arranged between low-level and high-level characteristics, meanwhile, intermediate supervision is adopted to calculate loss on the output of two recursion sub-networks, and repeated bidirectional reasoning is carried out to ensure normal updating of bottom layer parameters and improve the detection accuracy; the present invention eliminates the traditional way of separating the detection and correlation parts, and by integrating the two modules together, the computational effort and the runtime are greatly reduced. Meanwhile, in the appearance feature extraction submodel, the dimension of the feature vector is reduced, so that the overfitting can be relieved, the robustness is improved, and the calculated amount and the running time are reduced. The method overcomes the defects of low speed and frequent identity switching under the condition of object shielding in the traditional tracking method, and can be applied to video monitoring scenes in areas with large human flow such as intersections.

Drawings

FIG. 1 is a schematic diagram of a tracking model of the present invention;

FIG. 2 is a schematic diagram of a recursive subnetwork structure according to the present invention;

FIG. 3 is a schematic diagram of the residual block structure of the present invention;

fig. 4 is a flowchart of pedestrian tracking performed on a video stream to be tracked according to the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings and detailed description below:

the invention provides a pedestrian multi-target tracking method integrating target detection and association, which comprises the following steps:

s01: training a tracking model network by adopting a training data set to obtain a tracking model; the tracking model includes a detection sub-model and an appearance feature extraction sub-model connected after the detection sub-model. The tracking model network is a preliminary model framework, and at the moment, all model parameters in the model framework are uncertain. The training purpose is to determine each model parameter, and substituting the determined model parameter into a model network is the tracking model after training. The training data set employed in the present invention may be, but is not limited to, only providing bounding box annotations for training data sets collected in existing pedestrian databases, such as the ETH and CityPerson data sets, which may be used as the first training data set for training the detection sub-model network; calTech, MOT17, CUHK-SYSU and PRW databases provide bounding boxes and identity annotations, which can be used as both a second training dataset for training the appearance feature extraction sub-model network and a first training dataset for training the detection sub-model network.

The specific training process comprises the steps of training the detection sub-model and training the external feature extraction sub-model network, and the specific process is as follows:

s011: training the detection sub-model network by adopting a first training data set to obtain a detection sub-model; the detection sub-model network comprises two convolution pooling sub-networks and two recursion sub-networks which are connected together in sequence, wherein the output end of one convolution pooling sub-network is connected with the input end of one recursion sub-network; the convolution pooling sub-network comprises a convolution layer and a pooling layer, and an output end of the convolution layer is connected with an input end of the pooling layer. The specific training process for the detection sub-model network comprises the following steps:

s0111: the first training data set sequentially passes through two convolution pooling sub-networks and two recursion sub-networks to output a heat map; as shown in fig. 1, the first training data set sequentially passes through a continuous convolution layer, a pooling layer, a convolution layer and a pooling layer, wherein the convolution layer adopts a convolution kernel of 3×3, and the pooling layer uses maximum pooling to obtain a feature map with the resolution of 1/4 of that of the original figure; the feature map is then input into two successive recursive subnetworks, outputting a heat map. The recursive subnetwork structure is shown in figure 2, the input is from left to middle, the dimension is increased, the resolution of the feature map is reduced, the dimension is reduced, and the resolution of the feature map is increased. Before each time of downsampling in the recursion sub-network, the upper half way is separated to reserve the original scale information; after each up-sampling, adding the data with the last scale; for example, the c4b network layer is merged by the c7 network layer and the c4a network layer, wherein the c7 network layer doubles the resolution by upsampling, and in order to enlarge the resolution of the feature map, for example, the convolution kernel of the c7 network layer is 4×4, the convolution kernel obtained after upsampling is 8×8. The size of the c4a network layer is consistent with that of the c4 network layer, the c4a network layer can be regarded as a copy of the c4 network layer, the convolution kernel of the convolution kernel is twice that of the c7 network layer, the convolution kernel is just consistent with that of the up-sampled c7 network layer, and the values can be directly added, so that the c4b network layer is obtained.

The above process involves a number of pooling and upsampling processes, where the pooling layer uses maximum pooling for redundancy reduction and the upsampling layer uses nearest neighbor interpolation. Before each downsampling, one path is split to retain the original scale information, and after each upsampling, the two paths are fused with the features of the previous scale. Specifically, between two downsampling passes, three residual blocks are used to extract features, and between two feature map fusion passes, one residual block is used to extract features. The advantage of this architecture is that the features of the object may appear at different network layers, and conventional convolutional network models tend to result in feature loss, while the use of such a jump connection can effectively incorporate multi-scale features.

The residual block in the recursive subnetwork is shown in figure 3, and comprises a convolution path and a skip level path, wherein the convolution path is formed by connecting three convolution layers with different kernel scales in series, and the skip level path comprises a convolution layer with a convolution kernel of 1. M is the number of input channels, i.e., input depth, and N is the number of output channels, i.e., output depth. The characteristic of the residual block structure is that the dimension of the feature map can be increased or decreased, and the resolution of the feature map is not changed; the residual block extracts higher-level features (convolution paths) and simultaneously retains original-level information (jump paths), and the residual block does not change the data size but only changes the data depth; it can be regarded as a high-level convolutional layer of a guaranteed size. Meanwhile, due to the characteristic of a residual block structure, the characteristics in the characteristic diagram are not easy to lose, and the gradient vanishing problem can be effectively relieved. Each recursion sub-network outputs a heat map with the same channel number as 1 with the original image resolution and regresses an offset vector for each corresponding target point on the heat map.

The invention adopts a structure of cascade connection of 2 recursion sub-networks, the heat map generated by each recursion sub-network is used for calculating a loss function with a true value, the value of each point on the heat map is between 0 and 1, and the larger the probability that the point is predicted to be a target center point is, the closer the numerical value is to 1. Considering positive and negative sample imbalance, we use focal loss as a loss function.

If the gradient descent is directly carried out on the whole characteristic diagram, the error of the output layer is greatly reduced through multi-layer back propagation, namely gradient disappearance occurs. So we use two recursive subnetworks cascading in combination with the use of intermediate supervision, repeated bi-directional reasoning to ensure the normal updating of the underlying parameters. The intermediate supervision in the invention refers to that the heat map output by each network layer in the recursion sub-network is used as prediction, the prediction accuracy is far better than the effect of only using the heat map output by the last network layer as prediction, and the supervision training method considering the intermediate network layer is the intermediate supervision.

S0112: setting super parameters of the detection sub-model network, inputting a first training data set into the detection sub-model network for training, and determining parameters of the detection sub-model network; the super parameters of the sub-model network are detected, wherein the super parameters comprise a learning rate, iteration times, batch processing size and a heat map score threshold;

s0113: and on the obtained heat map, performing non-maximum inhibition according to the heat map score to extract peak key points, reserving the positions of the key points with the heat map score larger than a threshold value, and calculating corresponding boundary frame coordinates according to the estimated offset vector and the size of the boundary frame to obtain the heat map with the boundary frame.

S012: training the appearance characteristic extraction sub-model network by adopting a second training data set to obtain an appearance characteristic extraction sub-model; with continued reference to fig. 1, the appearance feature extraction submodel network includes, in order, a convolutional layer, a pooling layer, three residual blocks, and a full connection layer. The specific training process is as follows:

s0121: inputting the heat map containing the boundary box into an appearance characteristic extraction submodel network, and sequentially passing through a convolution layer, a pooling layer, three residual blocks and a final full-connection layer to generate a 128-dimensional characteristic vector for each target in the heat map. The object of the appearance information extraction module is to generate feature vectors that can distinguish between different objects. Ideally, the distance between different objects should be greater than the distance between the same object. The method can adopt the mahalanobis distance as a measurement, set a mahalanobis distance threshold before training, consider the same target when the mahalanobis distance of the feature vector between the targets is smaller than the threshold, and judge different targets when the mahalanobis distance of the feature vector between the targets is larger than or equal to the threshold.

S02: as shown in fig. 4, inputting a video stream to be tracked into a tracking model to obtain a tracking result; the method specifically comprises the following steps:

s021: a first frame image in a video stream to be tracked firstly passes through a detection sub-model, and a boundary frame of a pedestrian target is generated according to the heat map and the offset vector; extracting feature vectors from each pedestrian target through the appearance feature extraction sub-model, and distributing IDs and tracks;

s022: other frame images in the video stream to be tracked sequentially pass through the tracking model, each frame image passes through the detection sub-model, and a boundary frame of a pedestrian target is generated according to the heat map and the offset vector; then, the appearance feature extraction sub-model is used for generating a feature vector of the pedestrian target;

and determining the corresponding track positions of each pedestrian in the current frame image according to the distance between the feature vectors in two adjacent frame images in the video stream to be tracked, and connecting the track positions corresponding to the same ID in all frame images in the video stream to be tracked, namely the corresponding tracking result.

For example, in M frames, the distance between detected objects in the M-1 frame and the M frame is measured according to the similarity of feature vectors between objects and the IOU ((Intersection over Union)) size, wherein the similarity is calculated by using the Markov distance or the Euclidean distance. And when the distance is smaller than the set threshold value, the pedestrian targets are considered to be successfully matched, namely the pedestrian targets are the same pedestrian targets, at the moment, the pedestrian targets in the M-1 frame image inherit the IDs of the corresponding pedestrian targets in the M-1 frame image, and if the matching is unsuccessful, a new ID is allocated to the targets. The invention can also predict the track position in the current frame by adopting Kalman filtering, and if the track position distance link detection in the current frame image is greater than the track threshold value, the distance is set to infinity, thereby effectively preventing the link detection with a larger moving object.

It will be apparent to those skilled in the art from this disclosure that various other changes and modifications can be made which are within the scope of the invention as defined in the appended claims.

Claims

1. The pedestrian multi-target tracking method integrating target detection and association is characterized by comprising the following steps of:

s022: other frame images in the video stream to be tracked sequentially pass through the tracking model, each frame image passes through the detection sub-model, and a boundary frame of a pedestrian target is generated according to the heat map and the offset vector; then, the appearance feature extraction sub-model is used for generating a feature vector of the pedestrian target; determining the corresponding track positions of each pedestrian in the current frame image according to the distance between the feature vectors in two adjacent frame images in the video stream to be tracked, and connecting the track positions corresponding to the same ID in all frame images in the video stream to be tracked, namely the corresponding tracking result;

the step S011 specifically includes:

s0113: in the heat map, non-maximum inhibition is carried out according to the heat map score to extract peak key points, positions of key points with the heat map score larger than a threshold value are reserved, and coordinates of a boundary frame are calculated to obtain the heat map with the boundary frame;

the step S012 specifically includes:

2. The method for integrating object detection and associated pedestrian multi-object tracking according to claim 1, wherein prior to each downsampling in the recursive subnetwork, the upper half way of the reserved original scale information is split; after each up-sampling, adding the data with the last scale; between two downsampling steps, three residual blocks are used to extract features; between the two additions, a residual block is used to extract features.

3. The method of claim 2, wherein the residual block comprises a convolution path and a skip path, the convolution path comprising three convolution layers having different convolution kernels in series, the skip path comprising a convolution layer having a convolution kernel of 1.

4. The method according to claim 1, wherein in step S022, for the M-th frame of the video stream to be tracked, the distance between the detected targets in the M-1 st frame and the M-th frame is measured according to the similarity of the feature vectors and the IOU size, and when the distance is smaller than the distance threshold, the matching is considered successful, the ID of the corresponding target in the previous frame is inherited, otherwise, the matching is unsuccessful, and a new ID is allocated to the target.

5. The method of claim 4, wherein the similarity is calculated using a mahalanobis distance or a euclidean distance.

6. The method according to claim 1, wherein the step S022 uses kalman filtering to predict the track position in the current frame image, and if the distance between the track position in the current frame image and the link detection is greater than the track threshold, the distance is set to infinity.

7. The integrated target detection and associated pedestrian multi-target tracking method of claim 1 wherein the first training data set comprises data sets in an ETH database, a CityPerson database, a CalTech database, a MOT17 database, a CUHK-SYSU database, and a PRW database.

8. The integrated target detection and associated pedestrian multi-target tracking method of claim 1 wherein the second training data set comprises data sets in a CalTech database, a MOT17 database, a CUHK-SYSU database, and a PRW database.