CN111639551B

CN111639551B - Online multi-target tracking method and system based on twin network and long-short term clues

Info

Publication number: CN111639551B
Application number: CN202010404941.7A
Authority: CN
Inventors: 韩守东; 于恩; 刘东海生; 黄飘; 王宏伟
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2020-05-12
Filing date: 2020-05-12
Publication date: 2022-04-01
Anticipated expiration: 2040-05-12
Also published as: CN111639551A

Abstract

The invention discloses an online multi-target tracking method and system based on a twin network and long and short term clues, and belongs to the field of multi-target tracking. The method comprises the following steps: the twin network module is used for performing cross correlation on the tracking target template and the search area to obtain a response graph and acquiring a preliminarily predicted tracking track of each target; the correction module is used for combining the preliminary track and the observation frame and correcting the pedestrian frame through a pedestrian regression network; the data association module is used for calculating the similarity between the tracking track and the observed pedestrian, respectively extracting long and short term clues of the tracking track and the observed pedestrian and fusing the long and short term clues to further calculate the similarity, and distributing a corresponding observed pedestrian frame for each tracking track; and the track post-processing module is used for updating, supplementing and deleting the tracking track to complete the tracking of the current frame. The invention perfects the problems of apparent feature fusion, pedestrian interaction shielding and large-scale change in the multi-target tracking task, improves the accuracy and relieves the problem of feature misalignment.

Description

Online multi-target tracking method and system based on twin network and long-short term clues

Technical Field

The invention belongs to the technical field of multi-target tracking, and particularly relates to an online multi-target tracking method and system based on a twin network and long-short term clues.

Background

In the face of increasingly complex video scenes, effective processing of massive video data needs to be achieved, all meaningful targets in the video need to be detected, positioned, tracked and analyzed, wherein Multi-Object Tracking/Multi-Target Tracking serves as a middle-layer visual task and plays a very critical role, urban communities, highway traffic and accurate Tracking of residential isolation, pedestrians, foreign people and vehicles entering and exiting from crowds are monitored in real time through security cameras, and the method has great practical significance for epidemic situation monitoring. The multi-target tracking is oriented to a complex scene with large area and multiple pedestrians, and the number of the pedestrians in each frame is not fixed, so that the method is very suitable for video monitoring scenes.

In recent years, with the wide application of deep learning in the field of computer vision, the field of target tracking (especially single target tracking) has been rapidly developed. The field of multi-target tracking forms a main framework based on detection and tracking. The common prediction model mostly adopts a motion model based on motion information at present, but the motion model mostly assumes that a tracked object is in a state of uniform motion, and can not better process some sudden motion states (steering, acceleration running, sudden stop and the like), and is extremely easy to lose the track of the condition of pedestrian interaction shielding, and once the track is lost, the track is difficult to reconnect again.

Because a large number of high-density crowds exist in a multi-target tracking scene, the number of pedestrians is not fixed, and interaction shielding phenomenon exists among the pedestrians, the existing multi-target tracking algorithm based on detection still has great defects.

Disclosure of Invention

Aiming at the interactive occlusion problem among the tracked targets in the existing multi-target tracking task and the defect and improvement requirement of large-scale morphological change, the invention provides an online multi-target tracking method and system based on a twin network and long and short-term clues, aiming at improving the apparent feature fusion and pedestrian interactive occlusion in the multi-target tracking task to the maximum extent and solving the problem of large-scale change, greatly improving the data association precision and accuracy and relieving the problem of feature misalignment.

To achieve the above object, according to a first aspect of the present invention, there is provided an online multi-target tracking method based on a twin network and long-short term cues, the method comprising the steps of:

s0., cutting the target detection result of the first frame of the surveillance video as an observation frame to obtain the observation frame of each target of the 1 st frame, taking the observation frame as the first input of the twin network, initializing a target template, taking the observation frame of each target of the 1 st frame as the initial state of the target tracking track, wherein T is 2;

s1, carrying out target detection on a T-th frame, and cutting a target detection result of the T-th frame as an observation frame to obtain the observation frame of each target of the T-th frame; cutting the T-th frame by taking N times of areas of the positions of the target templates of the T-1 th frame as search areas to obtain a search area picture of the T-th frame, wherein N is more than or equal to 1;

s2, taking the search area picture of the T-th frame as the second input of the twin network to obtain a most possible tracking frame as the tracking frame of each target T-th frame;

s3, respectively extracting features of the observation frame and the tracking frame of each target T frame by using the trained Re-ID model, and calculating the similarity of the extracted features to obtain a long-term feature clue of each target T frame; calculating IOU between the tracking frame and the observation frame of each target Tth frame as a short-term characteristic clue of each target Tth frame;

s4, fusing the extracted long-term clues and short-term clues to obtain fused characteristic clues of the T-th frame of each target;

s5, taking the fusion characteristic clue of each T-th frame as a cost matrix of data association, and matching the tracking track with the observation frame;

s6, updating, supplementing and deleting the tracking track according to the data association result to complete the tracking of the T-th frame;

and S7, judging whether the video is finished, if so, finishing, otherwise, inputting the current pedestrian frame of the current tracking track into the twin network as a target template for updating, wherein T is T +1, and returning to the step S1.

Preferably, the twin network comprises the following processes:

(1) extracting a template feature map of each target, and extracting a feature map of a picture of a search area corresponding to the target;

(2) performing cross correlation on the template characteristic diagram and the search area characteristic diagram to obtain a multi-channel response diagram;

(3) classifying the tracking target in the multi-channel response image, and predicting a pedestrian regression frame according to response information in the multi-channel response image;

(4) scoring the pedestrian regression frame by quality assessment;

(5) and taking the product of the quality evaluation score and the classification confidence score as a final score, and taking a regression box with the highest final score as a tracking box.

Preferably, the quality assessment score is calculated as follows:

wherein l^*,r^*,t^*,b^*Respectively representing the distances from the center point of the object to the four edges of the object.

Preferably, the Re-ID model comprises: and the global branch and the local branch respectively extract global features and local features based on a multi-attention joint mechanism.

Preferably, IBN-Net is introduced in the underlying CNN of the Re-ID model by any of the following means:

1) dividing an output channel of a first convolution after picture input into two halves, wherein one half is subjected to IN standardization, the other half is subjected to BN standardization, and the same operation is also performed after the first increment;

2) the IN operation is added after the output of the spatial and channel entries IN the soft entry of the HACNN, and the IN operation is added after the first layer of convolution after the picture input.

Preferably, step S4 includes the steps of:

s41, obtaining a long-term clue as a real distance through the Re-ID model, obtaining a short-term clue as a sot distance through IOU calculation, and using the number of times pause that the target i track is lost_iCalculating the scaling factor

S42, judging pause_iIf it exceeds 2, increasing the proportionality coefficient and new proportionality coefficient

The long-term thread is updated to be a real distance (rate/real threshold), otherwise, the long-term thread is updated to be a real distance/real threshold, TL represents the limiting time of the track loss, and the real threshold represents the Re-ID enhancement coefficient;

s43, calculating a cost matrix cost of the target i_i＝rate×sot distance+(1-rate)×reid distance。

Preferably, before the data correlation, the observation pedestrian frame before correction is transmitted to the twin network for prediction, and the position where the pedestrian frame is possible is obtained to obtain the rough pedestrian frame before screening, so as to determine the observation pedestrian frame sequence.

Preferably, step S6 includes:

directly updating relevant parameters of the successfully associated tracking track;

regarding the observation frames which are not successfully correlated, taking the observation frames as an initial state and adding a tracking sequence again;

regarding the tracking track which is not associated successfully as a lost state;

if the lost state persists beyond the limit time, the active state of the trace is cancelled.

Preferably, the computation formula of the limit time of the track loss is as follows:

where pd represents pedestrian density, TL₀A basic time limit is indicated and,

represents a round-down operation, num_detIndicates the number of detected pedestrians, num₀Representing a pedestrian number threshold.

To achieve the above object, according to a second aspect of the present invention, there is provided an online multi-target tracking system based on a twin network and long-short term cues, comprising:

the twin network module is used for performing cross-correlation on the tracking target template and the search area to obtain a response graph and obtain a preliminarily predicted tracking track of each target;

the correction module is used for combining the acquired preliminary track with the observation frame and correcting the pedestrian frame through a pedestrian regression network;

the data association module is used for calculating the similarity between the tracking track and the observed pedestrian, extracting long and short term characteristic clues of the tracking track and the observed pedestrian respectively and fusing the long and short term characteristic clues to further calculate the similarity, and distributing a corresponding observed pedestrian frame for each tracking track;

and the track post-processing module is used for updating, supplementing and deleting the tracking track to complete the tracking of the current frame.

Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:

(1) the invention constructs a prediction model in a multi-target tracking framework based on the twin network, and because the core idea of the twin network is based on a response diagram, namely the probability of the twin network being a tracking target is judged by comparing the feature similarity of a target template and a candidate region, the influence of a sudden change motion state on tracking is reduced. And introducing a quality branch in the classification branch to score the regression frame of the regression branch, comprehensively considering space and amplitude limitations, performing more accurate scoring on the regression frame, and finally regarding the regression frame with the highest score as a pedestrian prediction frame.

(2) According to the invention, long-term characteristic clue extraction is carried out on the tracked target of each frame by a pedestrian re-identification technology, a multi-attention combination mechanism is adopted in the network, and under the multi-attention combination mechanism, the model can pay more attention to the foreground and the part of the target which is not shielded, so that accurate long-term clues can be extracted more conveniently. And a model structure combining example standardization and batch standardization for enhancing the generalization capability of the Re-ID model is constructed, so that characteristic clues are extracted better.

(3) According to the invention, the long-term characteristic information of the pedestrian is extracted through the Re-ID module, and the part of characteristic information has stronger adaptability to shielding, large-scale change and the like; the extracted long-term clues and short-term clues are subjected to weighted fusion by taking the overlapping degree of the pedestrian frames as short-term characteristic clues, so that the effective utilization of the long-term clues and the short-term clues is realized, and the problem of characteristic misalignment after some special shelters or large-scale changes occur is solved.

Drawings

FIG. 1 is a flow chart of an online multi-target tracking method based on a twin network and long and short term clues according to the present invention;

FIG. 2 is a diagram of a multi-target tracking infrastructure based on a twin network according to the present invention;

FIG. 3(a) is a diagram of a Re-ID model structure provided by the present invention;

FIG. 3(b) is a block diagram of the introduced combined standardization and batch standardization provided by the present invention;

FIG. 4 is a flow chart of a trace post-processing provided by the present invention;

fig. 5 is a diagram of a pedestrian regional regression network structure provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The on-line tracking refers to predicting the next frame by using the related information of the historical frame and the current frame.

As shown in FIG. 1, the invention provides an online multi-target tracking method based on twin network and long-short term clues, which comprises the following steps:

and step S0., cutting the target detection result of the first frame of the surveillance video as an observation frame to obtain an observation frame of each target of the 1 st frame, taking the observation frame as the first input of the twin network, initializing a target template, and taking the observation frame of each target of the 1 st frame as the initial state of the target tracking track, wherein T is 2.

Track initialization is carried out on each target in the initial frame, and information such as the pedestrian ID of the target and the coordinates of a pedestrian frame are recorded.

S1, carrying out target detection on a T-th frame, and cutting a target detection result of the T-th frame as an observation frame to obtain an observation frame of each target of the T-th frame; and cutting the T-th frame by taking the N times area of the position of each target template of the T-1 th frame as a search area to obtain a search area picture of the T-th frame, wherein N is more than or equal to 1.

In this embodiment, N is 2.

And S2, taking the search area picture of the T-th frame as a second input of the twin network to obtain a most possible tracking frame as a tracking frame of each target T-th frame.

As shown in fig. 2, the multi-target tracking task is decomposed into tracking (SOT) branches of multiple targets and data association is performed by combining with an appearance model, firstly, for each target in a pedestrian sequence, a tracker is initialized for the target, the target is tracked separately, the tracker takes a twin Network as a basic structure, and a quality detection branch is added to an RPN (Region candidate regression) structure behind a response map to score a regression frame.

Preferably, the twin network comprises the following processes:

(1) and extracting the template characteristic graph of each target, and extracting the characteristic graph of the picture of the search area corresponding to the target.

(2) And performing cross correlation on the template characteristic diagram and the search area characteristic diagram to obtain a multi-channel response diagram.

Wherein f (z, x) is a response map,

a convolution function for the shared parameters used to extract features, a cross-correlation operation, and bl an offset value for each point on the response map.

The response map includes the location and semantic information of the target.

(3) And classifying the tracking target in the multi-channel response diagram, and predicting a pedestrian regression frame according to response information in the multi-channel response diagram.

To reduce the conflict between regression and classification, it is common to do it separately, i.e., duplicate the same response map into two parts for the regression and classification operations, respectively.

(4) And scoring the pedestrian regression box through quality evaluation.

The twin network comprises multiple branches, wherein the classification branch is responsible for classifying the tracking targets in the response diagram, and the regression branch is responsible for predicting the pedestrian regression box according to the response information in the response diagram. And introducing a quality branch in the classification branch, and taking charge of scoring the regression frame of the regression branch, and finally taking the regression frame with the highest score as a pedestrian prediction frame.

The invention adopts a mode of combining position space and amplitude limitation to score, and preferably, the quality evaluation score calculation formula is as follows:

S3, respectively extracting features of the observation frame and the tracking frame of each target T frame by using the trained Re-ID model, and calculating the similarity of the extracted features to obtain a long-term feature clue of each target T frame; and calculating the IOU between the tracking frame and the observation frame of each target Tth frame as a short-term characteristic clue of each target Tth frame.

In order to ensure the diversity of pedestrian sequences with the same identity, samples are screened in an Intersection Over Unit (IOU) and visibility comparison mode, after a first picture of each pedestrian sequence is initialized, a next pedestrian frame with the same identity, of which the IOU is smaller than 0.7 or the visibility difference exceeds 0.2, is selected as a next sample, and the like. 295 pedestrian IDs can be obtained finally, and the total number of samples is 33573. Adam is adopted in the training process to optimize network weight, the initial learning rate is set to be 0.003, the batch size is 64, the input resolution is 160 multiplied by 64, and a total training is 150 epochs. The loss function of the multitask convolution neural network is designed as a cross entropy loss function:

where N represents the number of samples in the current training batch (batch), y_iAnd

and respectively representing the network predicted value and the real label of the pedestrian classification category joint probability distribution.

As shown in fig. 3(a), preferably, the Re-ID model includes: and the global branch and the local branch respectively extract global features and local features based on a multi-attention joint mechanism.

The network adopts a multi-attention combination mechanism, and under the multi-attention combination mechanism, the model can pay more attention to the foreground and the part of the target which is not shielded, so that accurate long-term clues can be extracted more conveniently. And combining hard attribute and soft attribute to realize the feature extraction of the severe scale change target and the occluded target.

The invention introduces IBN-Net to the bottom CNN of the Re-ID model. Two IBN-Net construction methods are shown in fig. 3 (b): one is that the output channel of the first convolution after the picture input is divided into two halves, where one half is subjected to IN Normalization and the other half is subjected to BN Normalization, and the same operation is performed after the first convolution, which is called HACNN _ IBN. The other is to add IN after the output of the spatial and channel entries IN the soft entry of HACNN, and add IN operation after the first layer convolution after picture input, called HACNN _ IBN _ B. The Re-ID model introduces the structure combining standardization and batch standardization for training, and solves the problem of cross-domain generalization.

And extracting the apparent characteristics of each object in the tracking sequence and observation through Re-ID, and finally calculating characteristic cosine distance as a long-term clue. And combining the overlapping scale of each observed object and the tracking track, and calculating the overlapping degree of each observed object as a short-term characteristic clue.

And S4, fusing the extracted long-term clues and short-term clues to obtain fused characteristic clues of the T-th frame of each target.

Specifically, step S4 includes the steps of:

The long-term thread is updated to a real distance (rate/real threshold), otherwise, the long-term thread is updated to a real distance/real threshold, TL represents the restriction time of the track loss, and the real threshold represents the Re-ID enhancement coefficient. In this embodiment, TL is 3 frames, and reid thresh is 0.7.

And S5, taking the fusion characteristic clue of each T-th frame as a cost matrix associated with data, and matching the tracking track with the observation box.

In a multi-target tracking scene, the number of targets in each frame is dynamically changed, including disappearance of old targets and appearance of new targets, and multiple targets between frames need to be matched, namely data association.

Data association is completed by using the Hungarian algorithm, and the threshold value of the cost matrix is preferably 0.7. This step may assign a corresponding tracking trajectory, i.e. target identity, to each observation pedestrian frame.

And S6, updating, supplementing and deleting the tracking track according to the data association result to complete the tracking of the T-th frame.

Specifically, as shown in fig. 4, step S6 includes:

if the lost state persists for more than a certain time, the active state of the track is cancelled.

represents a round-down operation, num_detIndicates the number of detected pedestrians, num₀Representing a pedestrian number threshold. num₀Setting is carried out according to the complexity of different scenes.

As shown in fig. 5, in order to obtain a finer view frame, preferably, between step S2 and step S3, the method further comprises:

supplementing an observation frame of the T-th frame by using a tracking frame of each target of the T-th frame;

and correcting the supplemented observation frame by using a regional regression network to obtain the corrected observation frame of each target Tth frame.

The invention integrates the processes into a unified multi-target tracking framework, and takes an MOT17 test set as an example to carry out effect display. Wherein MOTA represents the track proportion of the correct overall tracking, IDF1 represents the identity confidence score of the tracking track, MT represents the track proportion of the tracking track with the effective length of more than 80%, ML represents the track proportion with the effective length of less than 20%, FP represents the number of the backgrounds judged as the tracking objects, FN represents the number of the tracking objects judged as the backgrounds, and ID Sw represents the number of times of identity conversion in the track.

The overall tracking effect on the final MOT17 test set is shown in table 1, wherein the specific results of each video are shown in table 2.

TABLE 1

TABLE 2

Correspondingly, the invention also provides an online multi-target tracking system based on the twin network and the long and short term clues, which comprises the following steps:

the twin network module is used for obtaining a response graph by performing cross-correlation on a tracking target template and a search area, and obtaining a preliminarily predicted tracking track of each target;

Preferably, the data association module comprises a long-term apparent feature difference calculation module based on the Re-ID appearance model and a short-term feature difference calculation module based on the pedestrian frame overlapping degree, and the long-term apparent feature difference calculation module and the short-term feature difference calculation module are respectively used for calculating the difference between the tracking track and the long-term and short-term apparent feature of the observed pedestrian frame, and then carrying out weighted fusion on the difference based on the tracking track related information to obtain the final feature difference.

The Re-ID model of the long-term feature difference calculation module introduces the structure combining standardization and batch standardization for training, and solves the problem of cross-domain generalization.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. An online multi-target tracking method based on twin networks and long-short term clues is characterized by comprising the following steps:

s7, judging whether the video is finished, if so, finishing, otherwise, inputting the current pedestrian frame of the current tracking track into the twin network as a target template for updating, wherein T is T +1, and returning to the step S1;

step S4 includes the following steps:

2. The method of claim 1, wherein the twin network comprises the following processes:

(4) scoring the pedestrian regression frame by quality assessment;

3. The method of claim 2, wherein the quality assessment score is calculated as follows:

wherein l^*，r^*，t^*，b^*Respectively representing the distances from the center point of the object to the four edges of the object.

4. The method of any one of claims 1 to 3, wherein the Re-ID model comprises: and the global branch and the local branch respectively extract global features and local features based on a multi-attention joint mechanism.

5. The method of claim 4, wherein IBN-Net is introduced at the underlying CNN of the Re-ID model by any one of:

1) dividing an output channel of a first convolution after picture input into two halves, wherein one half is subjected to example standardization, the other half is subjected to batch standardization, and the same operation is also performed after the first inclusion;

2) the example is added after the output of the spatial and channel entries in the soft entry of HACNN, and the example operation is added after the first layer of convolution after the picture input.

6. A method according to any one of claims 1 to 3, wherein prior to said data correlation, the pre-corrected observed pedestrian frames are transported to a twin network for prediction, and the positions where pedestrian frames are likely to be located are obtained to obtain the unscreened coarse pedestrian frames, thereby determining the sequence of observed pedestrian frames.

7. The method according to any one of claims 1 to 3, wherein step S6 includes:

8. The method of claim 1, wherein the constraint time for track loss is calculated as follows:

9. An online multi-target tracking system based on twin networks and long-short term cues, comprising:

a computer-readable storage medium and a processor;

the computer-readable storage medium is used for storing executable instructions;

the processor is used for reading executable instructions stored in the computer-readable storage medium and executing the twin network and long-short term clue-based online multi-target tracking method of any one of claims 1 to 8.