CN111639551B - Online multi-target tracking method and system based on twin network and long-short term clues - Google Patents
Online multi-target tracking method and system based on twin network and long-short term clues Download PDFInfo
- Publication number
- CN111639551B CN111639551B CN202010404941.7A CN202010404941A CN111639551B CN 111639551 B CN111639551 B CN 111639551B CN 202010404941 A CN202010404941 A CN 202010404941A CN 111639551 B CN111639551 B CN 111639551B
- Authority
- CN
- China
- Prior art keywords
- frame
- target
- tracking
- pedestrian
- track
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 28
- 230000004044 response Effects 0.000 claims abstract description 24
- 230000004927 fusion Effects 0.000 claims abstract description 7
- 230000001502 supplementing effect Effects 0.000 claims abstract description 7
- 230000007774 longterm Effects 0.000 claims description 26
- 238000010586 diagram Methods 0.000 claims description 18
- 238000001514 detection method Methods 0.000 claims description 12
- 238000004364 calculation method Methods 0.000 claims description 9
- 239000011159 matrix material Substances 0.000 claims description 7
- 230000007246 mechanism Effects 0.000 claims description 7
- 230000000875 corresponding effect Effects 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 6
- 238000013441 quality evaluation Methods 0.000 claims description 5
- 238000001303 quality assessment method Methods 0.000 claims description 4
- 230000002596 correlated effect Effects 0.000 claims description 3
- 230000008859 change Effects 0.000 abstract description 5
- 238000012937 correction Methods 0.000 abstract description 5
- 238000012805 post-processing Methods 0.000 abstract description 4
- 230000003993 interaction Effects 0.000 abstract description 3
- 238000012549 training Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 3
- 101000642315 Homo sapiens Spermatogenesis-associated protein 17 Proteins 0.000 description 2
- 102100036408 Spermatogenesis-associated protein 17 Human genes 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000002452 interceptive effect Effects 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 241001239379 Calophysus macropterus Species 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000002955 isolation Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004660 morphological change Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Human Computer Interaction (AREA)
- Biomedical Technology (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses an online multi-target tracking method and system based on a twin network and long and short term clues, and belongs to the field of multi-target tracking. The method comprises the following steps: the twin network module is used for performing cross correlation on the tracking target template and the search area to obtain a response graph and acquiring a preliminarily predicted tracking track of each target; the correction module is used for combining the preliminary track and the observation frame and correcting the pedestrian frame through a pedestrian regression network; the data association module is used for calculating the similarity between the tracking track and the observed pedestrian, respectively extracting long and short term clues of the tracking track and the observed pedestrian and fusing the long and short term clues to further calculate the similarity, and distributing a corresponding observed pedestrian frame for each tracking track; and the track post-processing module is used for updating, supplementing and deleting the tracking track to complete the tracking of the current frame. The invention perfects the problems of apparent feature fusion, pedestrian interaction shielding and large-scale change in the multi-target tracking task, improves the accuracy and relieves the problem of feature misalignment.
Description
Technical Field
The invention belongs to the technical field of multi-target tracking, and particularly relates to an online multi-target tracking method and system based on a twin network and long-short term clues.
Background
In the face of increasingly complex video scenes, effective processing of massive video data needs to be achieved, all meaningful targets in the video need to be detected, positioned, tracked and analyzed, wherein Multi-Object Tracking/Multi-Target Tracking serves as a middle-layer visual task and plays a very critical role, urban communities, highway traffic and accurate Tracking of residential isolation, pedestrians, foreign people and vehicles entering and exiting from crowds are monitored in real time through security cameras, and the method has great practical significance for epidemic situation monitoring. The multi-target tracking is oriented to a complex scene with large area and multiple pedestrians, and the number of the pedestrians in each frame is not fixed, so that the method is very suitable for video monitoring scenes.
In recent years, with the wide application of deep learning in the field of computer vision, the field of target tracking (especially single target tracking) has been rapidly developed. The field of multi-target tracking forms a main framework based on detection and tracking. The common prediction model mostly adopts a motion model based on motion information at present, but the motion model mostly assumes that a tracked object is in a state of uniform motion, and can not better process some sudden motion states (steering, acceleration running, sudden stop and the like), and is extremely easy to lose the track of the condition of pedestrian interaction shielding, and once the track is lost, the track is difficult to reconnect again.
Because a large number of high-density crowds exist in a multi-target tracking scene, the number of pedestrians is not fixed, and interaction shielding phenomenon exists among the pedestrians, the existing multi-target tracking algorithm based on detection still has great defects.
Disclosure of Invention
Aiming at the interactive occlusion problem among the tracked targets in the existing multi-target tracking task and the defect and improvement requirement of large-scale morphological change, the invention provides an online multi-target tracking method and system based on a twin network and long and short-term clues, aiming at improving the apparent feature fusion and pedestrian interactive occlusion in the multi-target tracking task to the maximum extent and solving the problem of large-scale change, greatly improving the data association precision and accuracy and relieving the problem of feature misalignment.
To achieve the above object, according to a first aspect of the present invention, there is provided an online multi-target tracking method based on a twin network and long-short term cues, the method comprising the steps of:
s0., cutting the target detection result of the first frame of the surveillance video as an observation frame to obtain the observation frame of each target of the 1 st frame, taking the observation frame as the first input of the twin network, initializing a target template, taking the observation frame of each target of the 1 st frame as the initial state of the target tracking track, wherein T is 2;
s1, carrying out target detection on a T-th frame, and cutting a target detection result of the T-th frame as an observation frame to obtain the observation frame of each target of the T-th frame; cutting the T-th frame by taking N times of areas of the positions of the target templates of the T-1 th frame as search areas to obtain a search area picture of the T-th frame, wherein N is more than or equal to 1;
s2, taking the search area picture of the T-th frame as the second input of the twin network to obtain a most possible tracking frame as the tracking frame of each target T-th frame;
s3, respectively extracting features of the observation frame and the tracking frame of each target T frame by using the trained Re-ID model, and calculating the similarity of the extracted features to obtain a long-term feature clue of each target T frame; calculating IOU between the tracking frame and the observation frame of each target Tth frame as a short-term characteristic clue of each target Tth frame;
s4, fusing the extracted long-term clues and short-term clues to obtain fused characteristic clues of the T-th frame of each target;
s5, taking the fusion characteristic clue of each T-th frame as a cost matrix of data association, and matching the tracking track with the observation frame;
s6, updating, supplementing and deleting the tracking track according to the data association result to complete the tracking of the T-th frame;
and S7, judging whether the video is finished, if so, finishing, otherwise, inputting the current pedestrian frame of the current tracking track into the twin network as a target template for updating, wherein T is T +1, and returning to the step S1.
Preferably, the twin network comprises the following processes:
(1) extracting a template feature map of each target, and extracting a feature map of a picture of a search area corresponding to the target;
(2) performing cross correlation on the template characteristic diagram and the search area characteristic diagram to obtain a multi-channel response diagram;
(3) classifying the tracking target in the multi-channel response image, and predicting a pedestrian regression frame according to response information in the multi-channel response image;
(4) scoring the pedestrian regression frame by quality assessment;
(5) and taking the product of the quality evaluation score and the classification confidence score as a final score, and taking a regression box with the highest final score as a tracking box.
Preferably, the quality assessment score is calculated as follows:
wherein l*,r*,t*,b*Respectively representing the distances from the center point of the object to the four edges of the object.
Preferably, the Re-ID model comprises: and the global branch and the local branch respectively extract global features and local features based on a multi-attention joint mechanism.
Preferably, IBN-Net is introduced in the underlying CNN of the Re-ID model by any of the following means:
1) dividing an output channel of a first convolution after picture input into two halves, wherein one half is subjected to IN standardization, the other half is subjected to BN standardization, and the same operation is also performed after the first increment;
2) the IN operation is added after the output of the spatial and channel entries IN the soft entry of the HACNN, and the IN operation is added after the first layer of convolution after the picture input.
Preferably, step S4 includes the steps of:
s41, obtaining a long-term clue as a real distance through the Re-ID model, obtaining a short-term clue as a sot distance through IOU calculation, and using the number of times pause that the target i track is lostiCalculating the scaling factor
S42, judging pauseiIf it exceeds 2, increasing the proportionality coefficient and new proportionality coefficientThe long-term thread is updated to be a real distance (rate/real threshold), otherwise, the long-term thread is updated to be a real distance/real threshold, TL represents the limiting time of the track loss, and the real threshold represents the Re-ID enhancement coefficient;
s43, calculating a cost matrix cost of the target ii=rate×sot distance+(1-rate)×reid distance。
Preferably, before the data correlation, the observation pedestrian frame before correction is transmitted to the twin network for prediction, and the position where the pedestrian frame is possible is obtained to obtain the rough pedestrian frame before screening, so as to determine the observation pedestrian frame sequence.
Preferably, step S6 includes:
directly updating relevant parameters of the successfully associated tracking track;
regarding the observation frames which are not successfully correlated, taking the observation frames as an initial state and adding a tracking sequence again;
regarding the tracking track which is not associated successfully as a lost state;
if the lost state persists beyond the limit time, the active state of the trace is cancelled.
Preferably, the computation formula of the limit time of the track loss is as follows:
where pd represents pedestrian density, TL0A basic time limit is indicated and,represents a round-down operation, numdetIndicates the number of detected pedestrians, num0Representing a pedestrian number threshold.
To achieve the above object, according to a second aspect of the present invention, there is provided an online multi-target tracking system based on a twin network and long-short term cues, comprising:
the twin network module is used for performing cross-correlation on the tracking target template and the search area to obtain a response graph and obtain a preliminarily predicted tracking track of each target;
the correction module is used for combining the acquired preliminary track with the observation frame and correcting the pedestrian frame through a pedestrian regression network;
the data association module is used for calculating the similarity between the tracking track and the observed pedestrian, extracting long and short term characteristic clues of the tracking track and the observed pedestrian respectively and fusing the long and short term characteristic clues to further calculate the similarity, and distributing a corresponding observed pedestrian frame for each tracking track;
and the track post-processing module is used for updating, supplementing and deleting the tracking track to complete the tracking of the current frame.
Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:
(1) the invention constructs a prediction model in a multi-target tracking framework based on the twin network, and because the core idea of the twin network is based on a response diagram, namely the probability of the twin network being a tracking target is judged by comparing the feature similarity of a target template and a candidate region, the influence of a sudden change motion state on tracking is reduced. And introducing a quality branch in the classification branch to score the regression frame of the regression branch, comprehensively considering space and amplitude limitations, performing more accurate scoring on the regression frame, and finally regarding the regression frame with the highest score as a pedestrian prediction frame.
(2) According to the invention, long-term characteristic clue extraction is carried out on the tracked target of each frame by a pedestrian re-identification technology, a multi-attention combination mechanism is adopted in the network, and under the multi-attention combination mechanism, the model can pay more attention to the foreground and the part of the target which is not shielded, so that accurate long-term clues can be extracted more conveniently. And a model structure combining example standardization and batch standardization for enhancing the generalization capability of the Re-ID model is constructed, so that characteristic clues are extracted better.
(3) According to the invention, the long-term characteristic information of the pedestrian is extracted through the Re-ID module, and the part of characteristic information has stronger adaptability to shielding, large-scale change and the like; the extracted long-term clues and short-term clues are subjected to weighted fusion by taking the overlapping degree of the pedestrian frames as short-term characteristic clues, so that the effective utilization of the long-term clues and the short-term clues is realized, and the problem of characteristic misalignment after some special shelters or large-scale changes occur is solved.
Drawings
FIG. 1 is a flow chart of an online multi-target tracking method based on a twin network and long and short term clues according to the present invention;
FIG. 2 is a diagram of a multi-target tracking infrastructure based on a twin network according to the present invention;
FIG. 3(a) is a diagram of a Re-ID model structure provided by the present invention;
FIG. 3(b) is a block diagram of the introduced combined standardization and batch standardization provided by the present invention;
FIG. 4 is a flow chart of a trace post-processing provided by the present invention;
fig. 5 is a diagram of a pedestrian regional regression network structure provided by the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The on-line tracking refers to predicting the next frame by using the related information of the historical frame and the current frame.
As shown in FIG. 1, the invention provides an online multi-target tracking method based on twin network and long-short term clues, which comprises the following steps:
and step S0., cutting the target detection result of the first frame of the surveillance video as an observation frame to obtain an observation frame of each target of the 1 st frame, taking the observation frame as the first input of the twin network, initializing a target template, and taking the observation frame of each target of the 1 st frame as the initial state of the target tracking track, wherein T is 2.
Track initialization is carried out on each target in the initial frame, and information such as the pedestrian ID of the target and the coordinates of a pedestrian frame are recorded.
S1, carrying out target detection on a T-th frame, and cutting a target detection result of the T-th frame as an observation frame to obtain an observation frame of each target of the T-th frame; and cutting the T-th frame by taking the N times area of the position of each target template of the T-1 th frame as a search area to obtain a search area picture of the T-th frame, wherein N is more than or equal to 1.
In this embodiment, N is 2.
And S2, taking the search area picture of the T-th frame as a second input of the twin network to obtain a most possible tracking frame as a tracking frame of each target T-th frame.
As shown in fig. 2, the multi-target tracking task is decomposed into tracking (SOT) branches of multiple targets and data association is performed by combining with an appearance model, firstly, for each target in a pedestrian sequence, a tracker is initialized for the target, the target is tracked separately, the tracker takes a twin Network as a basic structure, and a quality detection branch is added to an RPN (Region candidate regression) structure behind a response map to score a regression frame.
Preferably, the twin network comprises the following processes:
(1) and extracting the template characteristic graph of each target, and extracting the characteristic graph of the picture of the search area corresponding to the target.
(2) And performing cross correlation on the template characteristic diagram and the search area characteristic diagram to obtain a multi-channel response diagram.
Wherein f (z, x) is a response map,a convolution function for the shared parameters used to extract features, a cross-correlation operation, and bl an offset value for each point on the response map.
The response map includes the location and semantic information of the target.
(3) And classifying the tracking target in the multi-channel response diagram, and predicting a pedestrian regression frame according to response information in the multi-channel response diagram.
To reduce the conflict between regression and classification, it is common to do it separately, i.e., duplicate the same response map into two parts for the regression and classification operations, respectively.
(4) And scoring the pedestrian regression box through quality evaluation.
(5) And taking the product of the quality evaluation score and the classification confidence score as a final score, and taking a regression box with the highest final score as a tracking box.
The twin network comprises multiple branches, wherein the classification branch is responsible for classifying the tracking targets in the response diagram, and the regression branch is responsible for predicting the pedestrian regression box according to the response information in the response diagram. And introducing a quality branch in the classification branch, and taking charge of scoring the regression frame of the regression branch, and finally taking the regression frame with the highest score as a pedestrian prediction frame.
The invention adopts a mode of combining position space and amplitude limitation to score, and preferably, the quality evaluation score calculation formula is as follows:
wherein l*,r*,t*,b*Respectively representing the distances from the center point of the object to the four edges of the object.
S3, respectively extracting features of the observation frame and the tracking frame of each target T frame by using the trained Re-ID model, and calculating the similarity of the extracted features to obtain a long-term feature clue of each target T frame; and calculating the IOU between the tracking frame and the observation frame of each target Tth frame as a short-term characteristic clue of each target Tth frame.
In order to ensure the diversity of pedestrian sequences with the same identity, samples are screened in an Intersection Over Unit (IOU) and visibility comparison mode, after a first picture of each pedestrian sequence is initialized, a next pedestrian frame with the same identity, of which the IOU is smaller than 0.7 or the visibility difference exceeds 0.2, is selected as a next sample, and the like. 295 pedestrian IDs can be obtained finally, and the total number of samples is 33573. Adam is adopted in the training process to optimize network weight, the initial learning rate is set to be 0.003, the batch size is 64, the input resolution is 160 multiplied by 64, and a total training is 150 epochs. The loss function of the multitask convolution neural network is designed as a cross entropy loss function:
where N represents the number of samples in the current training batch (batch), yiAndand respectively representing the network predicted value and the real label of the pedestrian classification category joint probability distribution.
As shown in fig. 3(a), preferably, the Re-ID model includes: and the global branch and the local branch respectively extract global features and local features based on a multi-attention joint mechanism.
The network adopts a multi-attention combination mechanism, and under the multi-attention combination mechanism, the model can pay more attention to the foreground and the part of the target which is not shielded, so that accurate long-term clues can be extracted more conveniently. And combining hard attribute and soft attribute to realize the feature extraction of the severe scale change target and the occluded target.
The invention introduces IBN-Net to the bottom CNN of the Re-ID model. Two IBN-Net construction methods are shown in fig. 3 (b): one is that the output channel of the first convolution after the picture input is divided into two halves, where one half is subjected to IN Normalization and the other half is subjected to BN Normalization, and the same operation is performed after the first convolution, which is called HACNN _ IBN. The other is to add IN after the output of the spatial and channel entries IN the soft entry of HACNN, and add IN operation after the first layer convolution after picture input, called HACNN _ IBN _ B. The Re-ID model introduces the structure combining standardization and batch standardization for training, and solves the problem of cross-domain generalization.
And extracting the apparent characteristics of each object in the tracking sequence and observation through Re-ID, and finally calculating characteristic cosine distance as a long-term clue. And combining the overlapping scale of each observed object and the tracking track, and calculating the overlapping degree of each observed object as a short-term characteristic clue.
And S4, fusing the extracted long-term clues and short-term clues to obtain fused characteristic clues of the T-th frame of each target.
Specifically, step S4 includes the steps of:
s41, obtaining a long-term clue as a real distance through the Re-ID model, obtaining a short-term clue as a sot distance through IOU calculation, and using the number of times pause that the target i track is lostiCalculating the scaling factor
S42, judging pauseiIf it exceeds 2, increasing the proportionality coefficient and new proportionality coefficientThe long-term thread is updated to a real distance (rate/real threshold), otherwise, the long-term thread is updated to a real distance/real threshold, TL represents the restriction time of the track loss, and the real threshold represents the Re-ID enhancement coefficient. In this embodiment, TL is 3 frames, and reid thresh is 0.7.
S43, calculating a cost matrix cost of the target ii=rate×sot distance+(1-rate)×reid distance。
And S5, taking the fusion characteristic clue of each T-th frame as a cost matrix associated with data, and matching the tracking track with the observation box.
In a multi-target tracking scene, the number of targets in each frame is dynamically changed, including disappearance of old targets and appearance of new targets, and multiple targets between frames need to be matched, namely data association.
Data association is completed by using the Hungarian algorithm, and the threshold value of the cost matrix is preferably 0.7. This step may assign a corresponding tracking trajectory, i.e. target identity, to each observation pedestrian frame.
And S6, updating, supplementing and deleting the tracking track according to the data association result to complete the tracking of the T-th frame.
Specifically, as shown in fig. 4, step S6 includes:
directly updating relevant parameters of the successfully associated tracking track;
regarding the observation frames which are not successfully correlated, taking the observation frames as an initial state and adding a tracking sequence again;
regarding the tracking track which is not associated successfully as a lost state;
if the lost state persists for more than a certain time, the active state of the track is cancelled.
Preferably, the computation formula of the limit time of the track loss is as follows:
where pd represents pedestrian density, TL0A basic time limit is indicated and,represents a round-down operation, numdetIndicates the number of detected pedestrians, num0Representing a pedestrian number threshold. num0Setting is carried out according to the complexity of different scenes.
Preferably, before the data correlation, the observation pedestrian frame before correction is transmitted to the twin network for prediction, and the position where the pedestrian frame is possible is obtained to obtain the rough pedestrian frame before screening, so as to determine the observation pedestrian frame sequence.
And S7, judging whether the video is finished, if so, finishing, otherwise, inputting the current pedestrian frame of the current tracking track into the twin network as a target template for updating, wherein T is T +1, and returning to the step S1.
As shown in fig. 5, in order to obtain a finer view frame, preferably, between step S2 and step S3, the method further comprises:
supplementing an observation frame of the T-th frame by using a tracking frame of each target of the T-th frame;
and correcting the supplemented observation frame by using a regional regression network to obtain the corrected observation frame of each target Tth frame.
The invention integrates the processes into a unified multi-target tracking framework, and takes an MOT17 test set as an example to carry out effect display. Wherein MOTA represents the track proportion of the correct overall tracking, IDF1 represents the identity confidence score of the tracking track, MT represents the track proportion of the tracking track with the effective length of more than 80%, ML represents the track proportion with the effective length of less than 20%, FP represents the number of the backgrounds judged as the tracking objects, FN represents the number of the tracking objects judged as the backgrounds, and ID Sw represents the number of times of identity conversion in the track.
The overall tracking effect on the final MOT17 test set is shown in table 1, wherein the specific results of each video are shown in table 2.
TABLE 1
TABLE 2
Correspondingly, the invention also provides an online multi-target tracking system based on the twin network and the long and short term clues, which comprises the following steps:
the twin network module is used for obtaining a response graph by performing cross-correlation on a tracking target template and a search area, and obtaining a preliminarily predicted tracking track of each target;
the correction module is used for combining the acquired preliminary track with the observation frame and correcting the pedestrian frame through a pedestrian regression network;
the data association module is used for calculating the similarity between the tracking track and the observed pedestrian, extracting long and short term characteristic clues of the tracking track and the observed pedestrian respectively and fusing the long and short term characteristic clues to further calculate the similarity, and distributing a corresponding observed pedestrian frame for each tracking track;
and the track post-processing module is used for updating, supplementing and deleting the tracking track to complete the tracking of the current frame.
Preferably, the data association module comprises a long-term apparent feature difference calculation module based on the Re-ID appearance model and a short-term feature difference calculation module based on the pedestrian frame overlapping degree, and the long-term apparent feature difference calculation module and the short-term feature difference calculation module are respectively used for calculating the difference between the tracking track and the long-term and short-term apparent feature of the observed pedestrian frame, and then carrying out weighted fusion on the difference based on the tracking track related information to obtain the final feature difference.
The Re-ID model of the long-term feature difference calculation module introduces the structure combining standardization and batch standardization for training, and solves the problem of cross-domain generalization.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (9)
1. An online multi-target tracking method based on twin networks and long-short term clues is characterized by comprising the following steps:
s0., cutting the target detection result of the first frame of the surveillance video as an observation frame to obtain the observation frame of each target of the 1 st frame, taking the observation frame as the first input of the twin network, initializing a target template, taking the observation frame of each target of the 1 st frame as the initial state of the target tracking track, wherein T is 2;
s1, carrying out target detection on a T-th frame, and cutting a target detection result of the T-th frame as an observation frame to obtain the observation frame of each target of the T-th frame; cutting the T-th frame by taking N times of areas of the positions of the target templates of the T-1 th frame as search areas to obtain a search area picture of the T-th frame, wherein N is more than or equal to 1;
s2, taking the search area picture of the T-th frame as the second input of the twin network to obtain a most possible tracking frame as the tracking frame of each target T-th frame;
s3, respectively extracting features of the observation frame and the tracking frame of each target T frame by using the trained Re-ID model, and calculating the similarity of the extracted features to obtain a long-term feature clue of each target T frame; calculating IOU between the tracking frame and the observation frame of each target Tth frame as a short-term characteristic clue of each target Tth frame;
s4, fusing the extracted long-term clues and short-term clues to obtain fused characteristic clues of the T-th frame of each target;
s5, taking the fusion characteristic clue of each T-th frame as a cost matrix of data association, and matching the tracking track with the observation frame;
s6, updating, supplementing and deleting the tracking track according to the data association result to complete the tracking of the T-th frame;
s7, judging whether the video is finished, if so, finishing, otherwise, inputting the current pedestrian frame of the current tracking track into the twin network as a target template for updating, wherein T is T +1, and returning to the step S1;
step S4 includes the following steps:
s41, obtaining a long-term clue as a real distance through the Re-ID model, obtaining a short-term clue as a sot distance through IOU calculation, and using the number of times pause that the target i track is lostiCalculating the scaling factor
S42, judging pauseiIf it exceeds 2, increasing the proportionality coefficient and new proportionality coefficientThe long-term thread is updated to be a real distance (rate/real threshold), otherwise, the long-term thread is updated to be a real distance/real threshold, TL represents the limiting time of the track loss, and the real threshold represents the Re-ID enhancement coefficient;
s43, calculating a cost matrix cost of the target ii=rate×sot distance+(1-rate)×reid distance。
2. The method of claim 1, wherein the twin network comprises the following processes:
(1) extracting a template feature map of each target, and extracting a feature map of a picture of a search area corresponding to the target;
(2) performing cross correlation on the template characteristic diagram and the search area characteristic diagram to obtain a multi-channel response diagram;
(3) classifying the tracking target in the multi-channel response image, and predicting a pedestrian regression frame according to response information in the multi-channel response image;
(4) scoring the pedestrian regression frame by quality assessment;
(5) and taking the product of the quality evaluation score and the classification confidence score as a final score, and taking a regression box with the highest final score as a tracking box.
4. The method of any one of claims 1 to 3, wherein the Re-ID model comprises: and the global branch and the local branch respectively extract global features and local features based on a multi-attention joint mechanism.
5. The method of claim 4, wherein IBN-Net is introduced at the underlying CNN of the Re-ID model by any one of:
1) dividing an output channel of a first convolution after picture input into two halves, wherein one half is subjected to example standardization, the other half is subjected to batch standardization, and the same operation is also performed after the first inclusion;
2) the example is added after the output of the spatial and channel entries in the soft entry of HACNN, and the example operation is added after the first layer of convolution after the picture input.
6. A method according to any one of claims 1 to 3, wherein prior to said data correlation, the pre-corrected observed pedestrian frames are transported to a twin network for prediction, and the positions where pedestrian frames are likely to be located are obtained to obtain the unscreened coarse pedestrian frames, thereby determining the sequence of observed pedestrian frames.
7. The method according to any one of claims 1 to 3, wherein step S6 includes:
directly updating relevant parameters of the successfully associated tracking track;
regarding the observation frames which are not successfully correlated, taking the observation frames as an initial state and adding a tracking sequence again;
regarding the tracking track which is not associated successfully as a lost state;
if the lost state persists beyond the limit time, the active state of the trace is cancelled.
8. The method of claim 1, wherein the constraint time for track loss is calculated as follows:
9. An online multi-target tracking system based on twin networks and long-short term cues, comprising:
a computer-readable storage medium and a processor;
the computer-readable storage medium is used for storing executable instructions;
the processor is used for reading executable instructions stored in the computer-readable storage medium and executing the twin network and long-short term clue-based online multi-target tracking method of any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010404941.7A CN111639551B (en) | 2020-05-12 | 2020-05-12 | Online multi-target tracking method and system based on twin network and long-short term clues |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010404941.7A CN111639551B (en) | 2020-05-12 | 2020-05-12 | Online multi-target tracking method and system based on twin network and long-short term clues |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111639551A CN111639551A (en) | 2020-09-08 |
CN111639551B true CN111639551B (en) | 2022-04-01 |
Family
ID=72330228
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010404941.7A Expired - Fee Related CN111639551B (en) | 2020-05-12 | 2020-05-12 | Online multi-target tracking method and system based on twin network and long-short term clues |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111639551B (en) |
Families Citing this family (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112163473A (en) * | 2020-09-15 | 2021-01-01 | 郑州金惠计算机系统工程有限公司 | Multi-target tracking method and device, electronic equipment and computer storage medium |
CN112132152B (en) * | 2020-09-21 | 2022-05-27 | 厦门大学 | Multi-target tracking and segmentation method utilizing short-range association and long-range pruning |
CN112288775B (en) * | 2020-10-23 | 2022-04-15 | 武汉大学 | Multi-target shielding tracking method based on long-term and short-term prediction model |
CN112487934B (en) * | 2020-11-26 | 2022-02-01 | 电子科技大学 | Strong data association integrated real-time multi-target tracking method based on ReID (ReID) characteristics |
CN112633078B (en) * | 2020-12-02 | 2024-02-02 | 西安电子科技大学 | Target tracking self-correction method, system, medium, equipment, terminal and application |
CN112560651B (en) * | 2020-12-09 | 2023-02-03 | 燕山大学 | Target tracking method and device based on combination of depth network and target segmentation |
CN112560656B (en) * | 2020-12-11 | 2024-04-02 | 成都东方天呈智能科技有限公司 | Pedestrian multi-target tracking method combining attention mechanism end-to-end training |
CN112464900B (en) * | 2020-12-16 | 2022-04-29 | 湖南大学 | Multi-template visual target tracking method based on twin network |
CN112734800A (en) * | 2020-12-18 | 2021-04-30 | 上海交通大学 | Multi-target tracking system and method based on joint detection and characterization extraction |
CN112802067B (en) * | 2021-01-26 | 2024-01-26 | 深圳市普汇智联科技有限公司 | Multi-target tracking method and system based on graph network |
CN112991385B (en) * | 2021-02-08 | 2023-04-28 | 西安理工大学 | Twin network target tracking method based on different measurement criteria |
CN113239800B (en) * | 2021-05-12 | 2023-07-25 | 上海善索智能科技有限公司 | Target detection method and target detection device |
CN113379793B (en) * | 2021-05-19 | 2022-08-12 | 成都理工大学 | On-line multi-target tracking method based on twin network structure and attention mechanism |
CN113392721B (en) * | 2021-05-24 | 2023-02-10 | 中国科学院西安光学精密机械研究所 | Remote sensing satellite video target tracking method |
CN113344976B (en) * | 2021-06-29 | 2024-01-23 | 常州工学院 | Visual tracking method based on target object characterization point estimation |
CN113724291B (en) * | 2021-07-29 | 2024-04-02 | 西安交通大学 | Multi-panda tracking method, system, terminal device and readable storage medium |
CN113673166B (en) * | 2021-08-26 | 2023-10-31 | 东华大学 | Digital twin model working condition self-adaption method and system for processing quality prediction |
CN113744313B (en) * | 2021-09-06 | 2024-02-02 | 山东工商学院 | Deep learning integrated tracking algorithm based on target movement track prediction |
CN114241003B (en) * | 2021-12-14 | 2022-08-19 | 成都阿普奇科技股份有限公司 | All-weather lightweight high-real-time sea surface ship detection and tracking method |
CN114677633B (en) * | 2022-05-26 | 2022-12-02 | 之江实验室 | Multi-component feature fusion-based pedestrian detection multi-target tracking system and method |
CN116647644B (en) * | 2023-06-06 | 2024-02-20 | 上海优景智能科技股份有限公司 | Campus interactive monitoring method and system based on digital twin technology |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109191491A (en) * | 2018-08-03 | 2019-01-11 | 华中科技大学 | The method for tracking target and system of the twin network of full convolution based on multilayer feature fusion |
CN109636829A (en) * | 2018-11-24 | 2019-04-16 | 华中科技大学 | A kind of multi-object tracking method based on semantic information and scene information |
CN110443827A (en) * | 2019-07-22 | 2019-11-12 | 浙江大学 | A kind of UAV Video single goal long-term follow method based on the twin network of improvement |
CN110473231A (en) * | 2019-08-20 | 2019-11-19 | 南京航空航天大学 | A kind of method for tracking target of the twin full convolutional network with anticipation formula study more new strategy |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11308350B2 (en) * | 2016-11-07 | 2022-04-19 | Qualcomm Incorporated | Deep cross-correlation learning for object tracking |
US10902615B2 (en) * | 2017-11-13 | 2021-01-26 | Qualcomm Incorporated | Hybrid and self-aware long-term object tracking |
US10957053B2 (en) * | 2018-10-18 | 2021-03-23 | Deepnorth Inc. | Multi-object tracking using online metric learning with long short-term memory |
-
2020
- 2020-05-12 CN CN202010404941.7A patent/CN111639551B/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109191491A (en) * | 2018-08-03 | 2019-01-11 | 华中科技大学 | The method for tracking target and system of the twin network of full convolution based on multilayer feature fusion |
CN109636829A (en) * | 2018-11-24 | 2019-04-16 | 华中科技大学 | A kind of multi-object tracking method based on semantic information and scene information |
CN110443827A (en) * | 2019-07-22 | 2019-11-12 | 浙江大学 | A kind of UAV Video single goal long-term follow method based on the twin network of improvement |
CN110473231A (en) * | 2019-08-20 | 2019-11-19 | 南京航空航天大学 | A kind of method for tracking target of the twin full convolutional network with anticipation formula study more new strategy |
Non-Patent Citations (4)
Title |
---|
Harmonious Attention Network for Person Re-identification;Wei Li et al;《2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition》;20181217;全文 * |
Multi-Object Tracking Hierarchically in Visual Data Taken From Drones;Siyang Pan et al;《2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)》;20200305;全文 * |
Multi-Object Tracking with Multiple Cues and Switcher-Aware Classification;Weitao Feng et al;《arXiv:1901.06129v1》;20190118;全文 * |
基于深度学习的视频多目标跟踪算法研究;储琪;《中国博士学位论文全文数据库》;20190815;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN111639551A (en) | 2020-09-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111639551B (en) | Online multi-target tracking method and system based on twin network and long-short term clues | |
CN110717414B (en) | Target detection tracking method, device and equipment | |
Wang et al. | Deep learning model for house price prediction using heterogeneous data analysis along with joint self-attention mechanism | |
Fan et al. | Point spatio-temporal transformer networks for point cloud video modeling | |
Feng et al. | Cross-frame keypoint-based and spatial motion information-guided networks for moving vehicle detection and tracking in satellite videos | |
Fernández-Sanjurjo et al. | Real-time multiple object visual tracking for embedded GPU systems | |
Abdulghafoor et al. | A novel real-time multiple objects detection and tracking framework for different challenges | |
CN112634329A (en) | Scene target activity prediction method and device based on space-time and-or graph | |
CN113744316A (en) | Multi-target tracking method based on deep neural network | |
Liang et al. | Cross-scene foreground segmentation with supervised and unsupervised model communication | |
Mao et al. | Aic2018 report: Traffic surveillance research | |
CN114926859A (en) | Pedestrian multi-target tracking method in dense scene combined with head tracking | |
CN114677633A (en) | Multi-component feature fusion-based pedestrian detection multi-target tracking system and method | |
CN114283355A (en) | Multi-target endangered animal tracking method based on small sample learning | |
Wang et al. | Non-local attention association scheme for online multi-object tracking | |
Zaman et al. | A robust deep networks based multi-object multi-camera tracking system for city scale traffic | |
Badal et al. | Online multi-object tracking: multiple instance based target appearance model | |
CN115100565A (en) | Multi-target tracking method based on spatial correlation and optical flow registration | |
Cao et al. | A long-memory pedestrian target tracking algorithm incorporating spatiotemporal trajectory feature enhancement model | |
Nguyen et al. | Real-time multi-vehicle multi-camera tracking with graph-based tracklet features | |
CN114494349A (en) | Video tracking system and method based on target feature space-time alignment | |
Sankaranarayanan et al. | Road traffic congestion (TraCo) estimation using multi-layer continuous virtual loop (MCVL) | |
CN112561956A (en) | Video target tracking method and device, electronic equipment and storage medium | |
Chigrinskii et al. | Optimization of a tracking system based on a network of cameras | |
Li et al. | MULS-Net: A Multilevel Supervised Network for Ship Tracking From Low-Resolution Remote-Sensing Image Sequences |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20220401 |