CN112116629A

CN112116629A - End-to-end multi-target tracking method using global response graph

Info

Publication number: CN112116629A
Application number: CN202010802373.6A
Authority: CN
Inventors: 王进军; 万星宇; 曹佳恺; 周三平; 邓烨; 辛晓萌
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2020-08-11
Filing date: 2020-08-11
Publication date: 2020-12-22

Abstract

The invention discloses an end-to-end multi-target tracking method using a global response diagram, which comprises the following steps: 1) expressing the motion characteristics of the tracked target: selecting the motion attributes of all tracked targets from the data set, and carrying out normalization processing on the motion attributes; 2) generating an attribute label of the training sample corresponding to the step 1): generating attribute labels of actual existing states of the training samples by using a logic inference method based on historical state information; 3) target positioning: training a global response graph for target localization with an improved salient target detection sub-network using training data; 4) predicting a target location change; 5) distance measurement, namely calculating the IOU distance between an observation space and a state space; 6) track connection: and constructing a global cost matrix, and performing optimal allocation by using the Hungarian algorithm through minimizing the cost matrix, thereby obtaining a final target track. The invention can realize the end-to-end multi-target tracking in the real sense efficiently.

Description

End-to-end multi-target tracking method using global response graph

Technical Field

The invention belongs to the field of target tracking of computer vision, and particularly relates to an end-to-end multi-target tracking method using a global response diagram.

Background

Target tracking is one of the important areas of research in computer vision and pattern recognition and digital image processing. By target tracking, it is meant the use of an image measurement and predictive dynamic model to continuously estimate the state of one or more targets over a continuous sequence of video frames. The main challenge of multi-target tracking is to continuously and effectively model multiple objects with high uncertainty in a complex scene, where the uncertainty includes occlusion between targets, occlusion between a target and a background, change of illumination, blurring of motion, and false alarm. There are three key problems to be solved in the multi-target tracking algorithm framework: 1) modeling dynamic motion patterns of a plurality of objects; 2) processing the situation that a target enters or leaves from a scene; 3) and the robustness of the tracking result when the target is shielded and the appearance or the background changes is maintained. Single target tracking algorithms focus primarily on solving problems 1) and 3), and the tracking effect that can be obtained by simply applying multiple single target trackers to the multi-target tracking problem is generally not ideal due to problem 2).

The prior art mainly focuses on a technical strategy of 'detection before tracking' to perform multi-target tracking. In this algorithmic framework, the results of object detection are expressed in the form of four point coordinates of a rectangular box and can be extracted from the video sequence to be used as a priori information in the tracking phase. Therefore, in this framework, the multi-target tracking problem evolves into a data association problem, which aims to find a suitable measurement method to connect target detection results into motion tracks frame by frame. The accuracy of the multi-target tracking algorithm of 'detection before tracking' mainly depends on two key factors: 1) the quality of the target detection result, that is, once the detection result is lost or wrong in a certain frame or the target cannot obtain the detection result in the shielding stage, the identity information of the target is lost; 2) and a robust data association model is used for associating the correct motion track with a moving object with high uncertainty frame by frame. The existing deep learning technology cannot perfectly solve the pain points of the two technologies, and meanwhile, the method is high in time complexity, sensitive to the appearance characteristic expression quality of the correlation model and not suitable for scenes requiring real-time processing. Although many methods attempt to utilize a deep neural network in the process of feature expression and data association of a target at the same time, a multi-target tracking method based on a "detection before tracking" framework cannot achieve a true end-to-end state all the time.

Disclosure of Invention

Aiming at the defects or the improvement requirements of the prior art, the invention provides an end-to-end multi-target tracking method using a global response diagram, aiming at improving an algorithm framework of 'detection before tracking' and better integrating a target detector based on a deep learning technology into a visual tracking task so as to efficiently realize the end-to-end multi-target tracking in the true sense.

In order to achieve the purpose, the invention is realized by adopting the following technical scheme:

an end-to-end multi-target tracking method using a global response graph comprises the following steps:

1) expressing the motion characteristics of the tracked target: selecting the motion attributes of all tracked targets from the data set, carrying out normalization processing on the motion attributes, and expressing the attribute characteristics of all targets to a global response graph in the form of different channels;

2) generating an attribute label of the training sample corresponding to the step 1): generating attribute labels of actual existing states of the training samples by using a logic inference method based on historical state information;

3) target positioning: training the global response map defined in step 1) for target localization by means of an improved salient target detection subnetwork using the training data generated in step 2);

4) predicting a target position change: predicting the target position change by using a motion offset regression sub-network based on the interframe optical flow field according to the global response map obtained in the step 3);

5) distance measurement: calculating the IOU distance between the observation space and the state space according to the global response graph obtained in the step 3) and the position change graph obtained in the step 4) aiming at each corresponding point:

wherein, area (a) and area (b) refer to the rectangular frame area of the target at the observation position and the prediction position, respectively;

6) track connection: and 5) constructing a global cost matrix by using the distance measurement obtained in the step 5), and then performing optimal distribution by using the Hungarian algorithm through minimizing the cost matrix, thereby obtaining a final target track.

The further improvement of the invention is that the specific implementation method of the step 1) is as follows:

101) expressing the existence information of the target as a response value with Gaussian distribution, wherein the value range of each response point is [0,1 ];

102) expressing the position information of the targets into a form with Gaussian distribution instead of a form of a detection frame, wherein the central point of the Gaussian distribution of each target is the central point position of the target rectangular frame;

103) the presence or absence of all targets and the location information attributes are modeled simultaneously using a global response graph, each channel of the response graph representing an attribute of a target.

The further improvement of the present invention is that step 2) is to generate a corresponding training sample label for the response graph constructed in step 1), and the specific implementation method is as follows:

201) expressing the actual existence state of each tracked target in the training sample at each moment to be 0/1 response values;

202) and deducing a response value at the current moment by observing the target state value at the historical moment in the time window with the length of l-10.

A further development of the invention is that the object localization sub-network used in step 3) belongs to an automatic encoder whose input is a sequence of consecutive image frames of a time window and whose output is the global response map defined in step 1).

The further improvement of the invention is that between the step 3) and the step 4), the overall response graph output by the step 4) is subjected to non-maximum suppression, so as to filter abnormal values with too low response values and overlapping.

A further improvement of the present invention is that step 4) uses region-of-interest pooling and multiple full-join operations to achieve simultaneous regression of the position change information of all tracked objects at the next time from the pixel-level global motion offset field, including the offset Δ cx, Δ cy of the center point of the object and the variation Δ w, Δ h of the object size.

The further improvement of the present invention is that step 5) is a post-processing step according to the global response map obtained in step 3) and the location prediction map obtained in step 4), and the specific implementation method is as follows:

501) taking the target position positioned in the target response image as a value of an observation space, counting all response points with response values exceeding a lowest threshold value, and setting the total number as M;

502) the target position prediction obtained by the regression network is regarded as a value of a state space, for each tracked target, a predicted value of a corresponding position of the tracked target is obtained from a position prediction image output by the regression network, and the total number of the tracked targets is counted to be N;

503) for each tracked target, calculating IOU distance between the rectangular box of the predicted position and the rectangular boxes of all positive response positions in the next frame of global response map.

The further improvement of the invention is that step 6) is a process of obtaining the IOU distance between the prediction space and the observation space according to step 5) to perform optimal allocation, and the specific implementation method is as follows:

601) constructing an N x M global cost matrix according to the total number N of the tracked targets and the total number M of the positive response points, wherein the corresponding position of the matrix is the IOU distance calculated by each target according to the step 5);

602) performing optimal allocation by minimizing a global cost matrix by using a Hungarian algorithm, wherein the allocation result is the corresponding relation between a tracked target and the position of a response point observed by the current frame;

603) connecting target tracks according to each distribution result, initializing the tracks by using unallocated response points, listing the unallocated targets in an observation period, and terminating the target tracks if the unallocated targets in the observation period are not matched with the response points after 10 frames.

The invention has at least the following beneficial technical effects:

(1) the multi-target tracking technology provided by the invention can simultaneously represent the dynamic characteristics of a plurality of targets, the optimal scheme uses a more efficient target positioning network to implicitly perform a target positioning process, and global normalization is performed on the motion characteristics of the tracked foreground target in a mode of simultaneously representing all interested target areas in a Gaussian distribution mode to achieve a response graph. On one hand, the characteristics of the foreground target and the background area can be effectively distinguished, and the neural network can learn the consistency of the tracked foreground target to the maximum extent; on the other hand, the characteristic expression strategy of the global response diagram can greatly improve the running speed of the tracking process.

(2) According to the multi-target tracking technology provided by the invention, the optimal scheme judges the state of the target at the current moment based on the historical prior information of the sliding window, reasonably deduces whether the target exists at the current moment according to the existing state of the target at the past moment, can effectively solve the problem of shielding of the target, and the global response graph output by the network can still maintain a correct positive response value during the period that the target is shielded.

(3) According to the multi-target tracking technology provided by the invention, the optimal scheme realizes that the position change information of all tracked targets at the next moment is simultaneously regressed from a pixel-level global motion offset field by using region-of-interest pooling and a plurality of full-connection operations, and a network simultaneously outputs the regressed positions of all targets in a one-time forward propagation process, so that the problem of uncertain number during data association among multiple targets can be solved.

(4) According to the multi-target tracking technology provided by the invention, the optimal scheme can complete the association process of all targets simultaneously in one forward propagation by directly matching the strategy of the response point in each prediction region in a global matching manner, so that the calculation complexity of the algorithm is effectively reduced.

Compared with the traditional multi-target tracking method, the multi-target tracking technology provided by the invention is not limited by a specific target detection technology and a complex data association strategy any more, and an additional deep neural network model is not needed to explicitly learn the appearance characteristics of the target, so that the end-to-end online tracking process in a real sense is realized, and the tracking speed and the tracking precision in a complex video monitoring scene can be improved at the same time.

Drawings

Fig. 1 is a system network structure diagram of an end-to-end multi-target tracking method according to an embodiment of the present invention;

FIG. 2 is a three-dimensional schematic of a partial response map characterizing a target state in an embodiment of the present invention;

fig. 3 is a network structure diagram for performing position prediction using an inter-frame optical flow field in an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described with reference to the accompanying drawings and examples.

The invention provides an end-to-end multi-target tracking method using a global response diagram, which comprises the following steps:

1) expressing the motion characteristics of the tracked target: and selecting the motion attributes of all tracked targets from the data set, normalizing the motion attributes, and expressing the attribute characteristics of all targets to a global response graph in the form of different channels. The end-to-end multi-target tracking method using the global response graph comprises the following steps in step 1):

In the conventional method, the motion characteristics are expressed as characteristic vectors related to the number of targets, but in the step of the invention, the motion characteristics of all targets are simultaneously expressed, and the expression is in the form of a global response map independent of the number of targets, the global response map contains a plurality of channels, and the response value of each channel can represent different attributes of the motion trajectory of the target in a state space, such as 'existence or nonexistence of the target', 'x/y/w/h', 'Δ x/Δ y/Δ w/Δ h', and the like.

2) Generating an attribute label of the training sample corresponding to the step 1): an attribute label of the actual presence state of the training sample is generated using a logical inference method based on historical state information. In the end-to-end multi-target tracking method using the global response graph, step 2) is to generate a corresponding training sample label for the response graph constructed in step 1), and the specific implementation method is as follows:

The traditional method for judging the target state is only based on the current frame input image, but in the step, the target existing state is extracted from the continuous frame input images in a sliding window mode and logically deduced, and the existing state of the target at the current moment is determined by the aid of historical prior information of the target.

3) Target positioning: training the global response map defined in step 1) for target localization with an improved salient target detection subnetwork using the training data generated in step 2). In the end-to-end multi-target tracking method using the global response graph, the target positioning sub-network used in the step 3) belongs to an Auto-Encoder (Auto-Encoder), the input of the Auto-Encoder is a continuous image frame sequence of a time window, and the output is the global response graph defined in the step 1).

The traditional multi-target tracking method uses a target detector based on a deep convolutional neural network to position a target, and the invention uses an improved significant target detection network to position the target in the step, and the model parameters of the network are smaller than those of a general target detection network, so that the operation speed of a network inference stage is higher, and the positioning accuracy of the target of interest is more accurate.

4) Predicting a target position change: predicting the target position change by using a motion offset regression sub-network based on the interframe optical flow field according to the global response diagram obtained in the step 3). The end-to-end multi-target tracking method using the global response graph further comprises the step of carrying out non-maximum inhibition on the global response graph output in the step 4) between the step 3) and the step 4), so that abnormal values with too low response values and overlapping occurrence are filtered.

Preferably, in the end-to-end multi-target tracking method using the global response map, step 4) of the method uses region-of-interest pooling and multiple full-connection operations to realize simultaneous regression of position change information of all tracked targets at the next moment from a global motion offset field at a pixel level. These position change information include the amounts of deviation Δ cx, Δ cy of the target center point and the amounts of change Δ w, Δ h of the target size.

The step improves the technical strategy of establishing a motion model by the traditional method, and the target position prediction task is embedded into a tracking network as an interframe optical flow field regressor in an end-to-end mode for training and testing. By introducing the region-of-interest pooling operation, the problem that the number of targets is uncertain in the traditional method can be effectively solved.

5) Distance measurement: calculating the IOU distance between the observation space and the state space according to the global response diagram obtained in the step 3) and the position change diagram obtained in the step 4) aiming at each corresponding point

Wherein, area (a) and area (b) refer to the rectangular frame area of the target at the observation position and the prediction position, respectively; in the end-to-end multi-target tracking method using the global response map, step 5) is a post-processing step of the global response map obtained in step 3) and the position prediction map obtained in step 4), and the specific implementation method is as follows:

6) Track connection: and 5) constructing a global cost matrix by using the distance measurement obtained in the step 5), and then performing optimal distribution by using the Hungarian algorithm through minimizing the cost matrix, thereby obtaining a final target track. In the end-to-end multi-target tracking method using the global response graph, step 6) is a process of obtaining the IOU distance between the prediction space and the observation space according to step 5) to perform optimal allocation, and the specific implementation method is as follows:

The multi-target tracking technology provided by the invention realizes end-to-end multi-target tracking by utilizing a global response diagram inferred based on historical logic and a position prediction diagram combined with an interframe optical flow field, and has the effect of improving the accuracy of quickly positioning and tracking all interested targets in a complex monitoring video scene.

The traditional method generally adopts an iterative mode to extract the characteristics of each target area to construct distance measurement in pairs when data association data is carried out, but the iterative mode is improved in the step of the invention, and the association process of all targets can be simultaneously completed in one-time forward propagation by directly matching the strategy of the response point in each prediction area, so that the calculation complexity of the algorithm is effectively reduced.

Examples

The network structure flow of the multi-target tracking method provided by the embodiment of the invention is as illustrated in fig. 1, and the method mainly comprises three modules: the system comprises a target positioning module, a position prediction module and a data correlation module. The method specifically comprises the following steps:

(1) firstly, at the target positioning module, a simpler and more efficient method is used for expressing the motion characteristics and the position information of the target. For all tracked objects at each time point, uniformly regarding the tracked objects as foreground targets, and expressing the tracked objects as a global response map by using a two-dimensional Gaussian distribution with the center of the target position as an origin and a peak value from 0 to 1.

A three-dimensional schematic diagram of a local response graph of a target state feature is shown in fig. 2, each gaussian distribution represents a tracked foreground target, x and y axes correspond to the spatial position of the target, and z axis corresponds to the actual state of the target at the current time (whether the target exists at the current time, 0 represents absence, and 1 represents presence). The definition of the radius r and sigma of the gaussian kernel is:

where h and w are the length and width of the target detection frame, α is an invariant set to 0.7, a_iAnd b_iAre the parameters calculated in the different cases in 3.

The global response map can be used to describe the spatial position and the actual existence state of the target at the same time through the step (1). To learn such a characterization, the next step requires preparation of sufficient training data and corresponding sample labels for model training.

(2) And acquiring the actual existing state label of the target by using a logic inference method based on historical prior. The definition of this logical inference method is for the target trajectory { T } from the training sample truth (ground-route)_jJ-1, 2, …, m, the response value of the actual presence state of the target at frame t

By means of historical states within a time window of length l

The specific estimation method is as follows:

where β is a constant set to 0.6 indicating the proportion of positive response values within the time window. If the target is present at most of the time (more than 60%) over the past period of time, the target is considered to be present at the present time even if there is no corresponding target detection result. On the other hand, if the target is absent at most of the past time (more than 60%), the target is considered to be absent at the present time even if there is a corresponding target detection result. In addition, if the target has a response value of 1 at the previous time (t-1 frame), the current time is considered to be present.

The target feature representation method defined in the step (1) and the training samples and labels obtained in the step (2) use an improved salient target detection network as an auto-encoder to perform target positioning.

(3) The target location sub-network structure uses an HED-based significance detection network which takes VGG16 as a backbone network and adopts short-connection for the side output of each feature layer. And calculating the average value of the side outputs of the 1 st, 2 nd, 3 rd and 6 th layers and then obtaining the output of the target positioning network through a sigmoid activation function. Given a training set

And corresponding response map label

The loss function of the target positioning network adopts standard cross entropy loss:

wherein P (y)_j＝1|X_l) Representing the probability of belonging to an activation value at position j, while label Y is derived from the training sample by the logical inference method of step (2).

While learning the global response maps of all tracked targets through step (3), the predicted positions of all the targets of interest also need to be learned. The invention uses a regression network fused with an interframe optical flow field to predict and learn the target position.

(4) In the position prediction module, firstly, the FlowNet2 network is adopted to extract the optical flow information between frames. Given two adjacent frames I^t-1And I^tThe optical flow field from t-1 frame to t frame can be expressed as

Wherein (u)_i,v_i) Respectively representing optical flow information of x-direction and y-direction pixel points. After the position offset of the pixel point is obtained from the optical flow field, a regression network is used for learning the global position offset of the Gaussian distribution response point defined in the step (1) as a predicted value of the target position.

Fig. 3 shows a network structure diagram for performing position prediction using an interframe optical flow field in an embodiment of the present invention, where for a t-th frame, a response diagram Z obtained at first is shown^tThe local non-maximum suppression is performed, and the first k positive response distributions are filtered out by setting a response threshold Score to 0.05, and the coordinates of their center points are calculated. Each center point is (cx, cy) and has a fixed size r_zThe Gaussian distribution is used as a region of interest (ROI) and the position offset of all ROIs is extracted from the optical flow field to form a feature vector

Regression was performed. The position offset regression network is composed of one ROI pooling layer and several fully connected layers. The output of the regression network is a position offset vector D from t-1 frame to t frame^t＝{d_j(Δ cx, Δ cy, Δ w, Δ h), j is 1,2, …, k }. Given a truth value G^tAnd network output D^tAdopting smooth L1 loss as a loss function of the regression network:

(5) global response graph Z^tAnd a position prediction value D^tAfter the data is obtained through the step (3) and the step (4), the data correlation module needs to predict a value (D)^t) And observed value (Z)^t) Global matching is performed between them. First, each predicted position of the t frame is calculated

And its nearest neighbor response value

The IOU distance between. Wherein, the nearest neighbor response value is obtained by finding the shortest path of the distance between the central points of the two distributions. Then selecting the IOU distance in the observation space to be larger than the threshold IOU_minAnd calculating the IOU distance between the response value of the t-1 frame and the candidate of the t frame to obtain a cost matrix, wherein the maximum response point is 0.7 as the candidate. And then, a Hungarian algorithm is used for solving a minimum cost matrix to perform optimal allocation.

(6) And for the t-th frame, after global optimal allocation is carried out, a cascade matching strategy is adopted to carry out re-matching on all response points and target tracks which are not allocated. First, a constant A is set_maxRepresenting the maximum number of frames for backtracking. For each response point not matched to the track in the current frame

In a time window t-1, t-2, … t-A_maxCalculating each terminated target track and the response point frame by frame

Until the IOU distance is greater than a set threshold IOU at a certain frame_minWhen the response value is 0.7, the response point is considered to be able to match with a certain past track, and then the corresponding target track position is updated by the response value. After passing through the global cascade matching strategy, all unmatched response points will be reinitializedIs a new target and all unmatched tracks will be terminated.

The end-to-end multi-target tracking method using the global response diagram improves the traditional 'detection-first tracking-later' algorithm framework and a data association-based method, and provides a unique feature expression and a network structure to perform end-to-end multi-target tracking. The target locator based on image sequence/video frame included in the invention can effectively solve the problem of the occlusion of the target in a short period. The method for estimating the motion change of the target by combining the target position offset regressor of the interframe optical flow field can solve the matching problem of uncertain number in one forward propagation. The end-to-end target tracking method using the global response graph provided by the invention breaks away from the traditional frame of 'detection before tracking', and realizes complete end-to-end without any prior information of detection or appearance characteristics. The multi-target tracking method provided by the invention achieves the prior art level in the industry on the aspects of speed and precision.

Claims

1. An end-to-end multi-target tracking method using a global response graph is characterized by comprising the following steps:

2. The end-to-end multi-target tracking method using the global response graph as claimed in claim 1, wherein the specific implementation method of step 1) is as follows:

3. The end-to-end multi-target tracking method using the global response graph as claimed in claim 1, wherein step 2) is to generate corresponding training sample labels for the response graph constructed in step 1), and the specific implementation method is as follows:

4. An end-to-end multi-target tracking method using global response maps according to claim 1, wherein the target location sub-network used in step 3) belongs to an automatic encoder, the input of which is a continuous image frame sequence of a time window, and the output of which is the global response map defined in step 1).

5. The end-to-end multi-target tracking method using the global response graph according to claim 1, wherein between the step 3) and the step 4), non-maximum suppression is further performed on the global response graph output in the step 4), so as to filter abnormal values with too low response values and overlapping occurrence.

6. The method for end-to-end multi-target tracking by using the global response graph according to claim 1, wherein the step 4) uses region-of-interest pooling and a plurality of full-join operations to realize that the position change information of all tracked targets at the next moment is simultaneously regressed from the global motion offset field at the pixel level, and the position change information comprises the offset Δ cx, Δ cy of the center point of the target and the variation Δ w, Δ h of the size of the target.

7. The end-to-end multi-target tracking method using the global response graph according to claim 1, wherein the step 5) is a post-processing step of the global response graph obtained in the step 3) and the position prediction graph obtained in the step 4), and the implementation method is as follows:

8. The method for end-to-end multi-target tracking by using a global response graph as claimed in claim 7, wherein step 6) is a process of obtaining the IOU distance between the prediction space and the observation space according to step 5) for optimal allocation, and the specific implementation method is as follows: