CN114897939A

CN114897939A - Multi-target tracking method and system based on deep path aggregation network

Info

Publication number: CN114897939A
Application number: CN202210599934.6A
Authority: CN
Inventors: 张毅锋; 陈曦
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2022-05-26
Filing date: 2022-05-26
Publication date: 2022-08-12

Abstract

The invention discloses a multi-target tracking method and system based on a deep path aggregation network. With a deep path aggregation network as a framework, firstly inputting a large amount of external video data to train the network so as to generate a target center heat map, a target center offset and a prediction frame size, and simultaneously extracting re-ID characteristics of a target; then generating a prediction frame according to the target position information, and marking the detected object in a rectangular frame form; then the cosine distances of the re-ID feature vectors of all detected objects in a certain frame and the previous frame of the video and IoU of a prediction box are calculated, and all the objects in the current frame are connected into the existing track. And finally, further estimating the positions of all the targets in the current frame through a Kalman filtering algorithm. The invention adopts the feature fusion layer from bottom to top to extract the target space feature information, shortens the information path between the bottom layer and the top layer features, and ensures that the tracker has the advantages of higher tracking precision and high real-time tracking speed.

Description

Multi-target tracking method and system based on deep path aggregation network

Technical Field

The invention belongs to the fields of computer vision, deep learning technology and multi-target tracking, and particularly relates to a multi-target tracking method and system based on a deep path aggregation network.

Background

The multi-target tracking is a popular research field in computer vision and has important application in the fields of automatic driving, intelligent traffic, intelligent monitoring and the like. With the rapid development of artificial intelligence, more and more deep learning algorithms are applied to various aspects in life. In single target tracking, the appearance characteristics of an object are known in advance. In multi-target tracking, a tracker needs to estimate the trajectories of multiple targets in a video and detect targets leaving or entering a video scene. Multiple objects in a video may be occluded or have similar appearances, and external environmental factors such as lighting, weather, and video quality may make tracking difficult for a tracker. In the field of multi-target tracking, the trackers based on deep learning have relatively good performance, including JDE, FairMOT, and the like, but they usually cannot achieve a good balance between tracking accuracy and tracking speed. And the spatial feature information of the target is not further extracted, so that the accurate position of the target cannot be well deduced.

Disclosure of Invention

The invention aims to provide a multi-target tracking method and system based on a deep path aggregation network, and aims to solve the technical problems that an existing tracker based on deep learning cannot well balance tracking precision and tracking speed and cannot well deduce the accurate position of a target.

In order to solve the technical problems, the specific technical scheme of the invention is as follows:

a multi-target tracking method based on a deep path aggregation network comprises the following steps:

step 1, data preprocessing: aiming at each frame image in each section of video of the training set, performing data enhancement by using rotation, scaling and color dithering to obtain an input data set of the network;

step 2, constructing a deep path aggregation network, comprising the following substeps:

step 2.1, designing a network structure of the deep path aggregation network;

2.2, constructing a training sample, selecting images from the input data set, and inputting the images into the depth path aggregation network as the input of the network;

step 2.3, designing an error function to perform back propagation, and optimizing parameters of the network until convergence;

step 3, performing multi-target tracking on each detected object in the video: extracting a detection area and re-ID characteristics of the target based on the trained depth path aggregation network, calculating the cosine distance of the re-ID characteristics of each target in a certain frame and the previous frame of the video and IoU of the detection frame, and performing tracking prediction by using a Kalman filter to obtain the position of the target in the current frame.

Further, the data preprocessing step in step 1 is specifically as follows:

and randomly selecting an angle of-10 to 10 for each frame of image in each section of video of the training set, rotating, then carrying out image scaling operation with the proportion of 4, and finally increasing the image color depth and the image brightness by 0.5 time to obtain an input data set of the network.

Further, designing a network structure of the deep path aggregation network in step 2.1 specifically includes the following steps:

step 201, based on the DLA network, adding three feature map layers from top to bottom at the final stage for performing down-sampling and aggregation operations on the output feature map of the DLA network, thereby obtaining three feature maps with different resolutions;

and 202, performing multi-scale aggregation on two medium and small resolution feature maps and another large resolution feature map to obtain a high resolution feature map representation, aiming at the output three feature maps with different resolutions. The high-resolution feature map representation is an output feature map of the deep path aggregation network, and the center, the heat map, the offset and the re-ID feature vector of the target can be output through the output feature map.

Further, the step 2.2 of constructing the training sample specifically includes the following steps:

and 4 images are selected from the input data set and input into the depth path aggregation network.

Further, the error function is designed in step 2.3 to perform back propagation, and the parameters of the network are optimized until convergence, specifically:

calculating a central heat map of each target in the image by using a Focal local Loss function, calculating a central offset and a predicted frame size of each target in the image by using an L1 local Loss function, and calculating a re-ID embedded Loss value of each target in the image by using an ID local Loss function; then, corresponding weight values are given to the three loss values to form a total loss value; after training, the central position, the size of a prediction frame and re-ID characteristics of each target in the image can be obtained; the total loss value is:

L _detection ＝L _heat +L _box

wherein w ₁ To adjust learnable parameters of target detection in the total loss function; w is a ₂ The learnable parameters of the re-ID task in the total loss function are adjusted to balance the target detection and the re-ID task; e is a natural constant in mathematics; l is _identity A penalty function representing a re-ID task; l is _detection A loss function representing the detection task, which includes the heat map loss, prediction box size, and offset loss for each target in the image.

Further, the step 3 of performing multi-target tracking in the video specifically comprises the following steps:

step 301, inputting a certain frame image into a trained depth path aggregation network in a section of video to obtain a convolution feature map of the certain frame image, and then respectively generating a target heat map feature map, a prediction frame size feature map, a prediction frame offset feature map and a re-ID embedded feature map;

step 302, in the subsequent frame, performing online tracking on each target according to all target positions and re-ID features deduced from the previous frame, the steps are as follows:

step 3021, calculating detection frames and re-ID characteristics of all targets in the current frame by using the trained deep path aggregation network;

step 3022, predicting the position in the current frame by using a Kalman filter according to the motion track of each target in the previous frame;

step 3023, calculating cosine distances of re-ID features of all targets in the previous frame and all targets in the current frame and IoU of a detection frame, if the cosine distances are greater than 0.4 and the IoU score is greater than 0.5, determining that the tracking is successful, and connecting the successfully tracked targets to the existing motion track;

and step 3024, performing state updating by using a Kalman filter for all the targets successfully tracked to obtain the optimal estimation in the current frame.

The invention also provides a multi-target tracking system based on the deep path aggregation network, which comprises a data preprocessing unit, a deep path aggregation network training unit and a video multi-target tracking unit;

the data preprocessing unit is used for inputting an image sequence, and performing data enhancement on the image sequence by using rotation, scaling and color dithering;

the deep path aggregation network training unit is used for training a designed deep path aggregation network and is configured to execute the following steps:

step A, designing a network structure of a deep path aggregation network;

step B, constructing a training sample as the input of the network;

step C, designing an error function to perform back propagation, and optimizing parameters of the network until convergence;

the video multi-target tracking unit is configured to execute the following actions: extracting a detection area and re-ID characteristics of the targets based on the trained depth path aggregation network, and performing tracking prediction by using a Kalman filter according to the cosine distance of the re-ID characteristics of each target and IoU of the detection frame to obtain the positions of the targets in the current frame.

The multi-target tracking method and system based on the deep path aggregation network have the following advantages:

1. the multi-target tracking method based on the depth path aggregation network can be used for tracking all targets of each frame in any video;

2. the method can output the characteristics of different levels of the target by using the trained deep path aggregation network and by multi-scale aggregation of the cross-resolution ratio, so that the robustness on the appearance change of the target is stronger.

3. The invention adopts the feature fusion layer from bottom to top to extract the target space feature information, thereby shortening the information path between the bottom layer feature and the top layer feature, leading the tracker to have higher tracking precision and simultaneously ensuring the real-time performance of tracking.

Drawings

Fig. 1 is a schematic diagram of a multi-target tracking method based on a deep path aggregation network according to the present invention.

Detailed Description

In order to better understand the purpose, structure and function of the present invention, a multi-target tracking method and system based on a deep path aggregation network according to the present invention are described in further detail below with reference to the accompanying drawings.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The invention firstly provides a multi-target tracking method based on a deep path aggregation network, which is shown by referring to fig. 1 and comprises the following steps:

step 1, data preprocessing: performing data enhancement on the input image sequence by using rotation, scaling and color dithering to obtain an input data set of a network;

step 2.1, designing a network structure of the deep path aggregation network, which specifically comprises the following steps:

step 201, based on the DLA network, adding three feature map layers from top to bottom at the final stage for performing down-sampling and aggregation operations on the output feature map of the DLA network, thereby obtaining three feature maps with different resolutions; deep aggregation networks (DLAs) are a type of neural network that can extract features and perform multi-scale aggregation. The deep path aggregation network proposed herein is an improvement based on DLA networks.

Step 202, aiming at three output feature maps with different resolutions, carrying out multi-scale aggregation on two middle and small resolution feature maps and another large resolution feature map to obtain a high resolution feature map representation; the high-resolution feature map representation is an output feature map of the deep path aggregation network, and the center, the heat map, the offset and the re-ID feature vector of the target can be output through the output feature map.

2.2, constructing a training sample, selecting 4 images from the input data set, and inputting the images into the depth path aggregation network as the input of the network;

step 2.3, designing an error function to perform back propagation, and optimizing parameters of the network until convergence, wherein the method specifically comprises the following steps:

respectively calculating a central heat map, a central offset, a predicted frame size and a re-ID embedding Loss value of each target in the image by using the Focal local, the L1 local and the ID local; then, corresponding weight values are given to the three loss values to form a total loss value; after training, the central position, the size of a prediction frame and re-ID characteristics of each target in the image can be obtained; where Focal local, L1 local, ID local are all some Loss functions common in the deep learning field, which can be understood as functions that compute target center heat map Loss, center offset and bounding box size Loss, re-ID embedding Loss, respectively. re-ID is an abbreviation of re-identification and means re-identification, the depth path aggregation network finally outputs re-ID feature vectors of all targets in each frame of image in the video, and the positions of the targets in the previous frame of image in the next frame of image can be re-identified by calculating cosine distances of all re-ID feature vectors in two adjacent frames of image in the tracking stage. The total loss value is:

L _detection ＝L _heat +L _box

wherein w ₁ To adjust learnable parameters of target detection in the total loss function; w is a ₂ The learnable parameters of the re-ID task in the total loss function are adjusted to balance the target detection and the re-ID task; e is a natural constant, L _identity A penalty function representing a re-ID task; l is _detection A loss function representing the detection task, which includes the heat map loss, prediction box size, and offset loss for each target in the image.

Step 3, performing multi-target tracking on each detected object in the video: extracting a detection area and a re-ID feature of a target based on a trained deep path aggregation network, calculating the cosine distance of the re-ID feature of each target in a certain frame and a previous frame of a video and the IoU of a detection frame, wherein the IoU represents the overlapping degree of two frames, and the target detection field is often used for judging the overlapping degree of a prediction frame and a real frame, so that the size of the prediction frame is continuously adjusted in the training process to approximate the position and the size of the real frame. Tracking and predicting by using a Kalman filter to obtain the position of a target in a current frame, and specifically comprising the following steps of:

step A, designing a network structure of a deep path aggregation network;

step B, constructing a training sample as the input of the network;

It will be understood by those within the art that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the methods specified in the block or blocks of the block diagrams and/or flowchart block or blocks.

It is to be understood that the present invention has been described with reference to certain embodiments, and that various changes in the features and embodiments, or equivalent substitutions may be made therein by those skilled in the art without departing from the spirit and scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed, but that the invention will include all embodiments falling within the scope of the appended claims.

Claims

1. A multi-target tracking method based on a deep path aggregation network is characterized by comprising the following steps:

step 2.1, designing a network structure of the deep path aggregation network;

2. The multi-target tracking method based on the deep path aggregation network as claimed in claim 1, wherein the data preprocessing step in the step 1 is as follows:

3. The multi-target tracking method based on the deep path aggregation network according to claim 1, wherein the step 2.1 of designing the network structure of the deep path aggregation network specifically comprises the following steps:

and 202, performing multi-scale aggregation on two medium and small resolution feature maps and another large resolution feature map to obtain a high resolution feature map representation, aiming at the output three feature maps with different resolutions.

4. The multi-target tracking method based on the deep path aggregation network as claimed in claim 3, wherein the constructing of the training samples in the step 2.2 specifically includes the following steps:

5. The multi-target tracking method based on the deep path aggregation network as claimed in claim 4, wherein the designing of the error function in the step 2.3 is performed with back propagation to optimize the parameters of the network until convergence, and specifically includes:

calculating a central heat map of each target in the image by using a FocalLoss Loss function, calculating a central offset and a predicted frame size of each target in the image by using an L1 Loss Loss function, and calculating a re-ID embedded Loss value of each target in the image by using an ID Loss Loss function; then, corresponding weight values are given to the three loss values to form a total loss value; after training, obtaining the central position, the size of a prediction frame and re-ID characteristics of each target in the image; the total loss value is:

L _detection ＝L _heat +L _box

wherein w ₁ To adjust learnable parameters of target detection in the total loss function; w is a ₂ The learnable parameters of the re-ID task in the total loss function are adjusted to balance the target detection and the re-ID task; e is a natural constant, L _identity A penalty function representing a re-ID task; l is _detection A loss function representing the detection task, which includes the heat map loss, prediction box size, and center offset loss for each target in the image.

6. The multi-target tracking method based on the deep path aggregation network as claimed in claim 1, wherein the step 3 of performing multi-target tracking in the video specifically comprises the following steps:

7. The multi-target tracking system based on the deep path aggregation network is characterized by comprising a data preprocessing unit, a deep path aggregation network training unit and a video multi-target tracking unit, wherein the data preprocessing unit is used for preprocessing data;

step A, designing a network structure of a deep path aggregation network;

step B, constructing a training sample as the input of the network;