CN115619827A - Multi-target tracking method based on Transformer and space-time memory - Google Patents
Multi-target tracking method based on Transformer and space-time memory Download PDFInfo
- Publication number
- CN115619827A CN115619827A CN202211304713.8A CN202211304713A CN115619827A CN 115619827 A CN115619827 A CN 115619827A CN 202211304713 A CN202211304713 A CN 202211304713A CN 115619827 A CN115619827 A CN 115619827A
- Authority
- CN
- China
- Prior art keywords
- score
- target
- space
- information
- pedestrian
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/246—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
- G06T7/73—Determining position or orientation of objects or cameras using feature-based methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/761—Proximity, similarity or dissimilarity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/7715—Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30241—Trajectory
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Multimedia (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a multi-target tracking method based on Transformer and space-time memory, which comprises the following steps: firstly, extracting characteristic information by using continuous four-frame video images; fusing the feature information of the four frames to obtain the high-quality feature rich in space-time information; secondly, storing all pedestrian information in a certain time window in a dynamic space-time memory module, and calculating the appearance similarity and distance score with the current video image target; and finally, fusing the appearance similarity score and the distance score to obtain a final score, and predicting the target track by using the final score. The invention can accurately position the pedestrian under the conditions of camera movement, rapid pedestrian movement and the like, eliminates the influence of deformation by adopting the dynamic space-time memory module, predicts the target track by utilizing the appearance similarity and the distance fusion score, solves the problem of long-distance shielding and obtains the accurate target track.
Description
Technical Field
The invention relates to the technical field of video scene analysis and processing, in particular to a multi-target tracking method based on a Transformer and space-time memory.
Background
Multi-target tracking aims at finding and tracking all objects with the same identity from a video sequence containing multiple objects. Multi-target tracking plays an important role in solving many basic problems of video analysis and computer vision, and is continuously applied to various fields such as automatic driving, smart cities, visual monitoring, public safety, video analysis, human-computer interaction and the like. In a complex scene, the detection precision is obviously reduced under the influence of the conditions of camera movement, target deformation, frequent shielding and the like, so that the tracking effect is poor.
In recent years, methods based on the detection tracking paradigm divide multi-target tracking into two tasks, detection and association. All targets in the video image are first detected and then the trajectories are matched according to a correlation algorithm. Wherein the data association is a core part. Some algorithms use a spatial scale metric to achieve the correlation between successive video images. However, various challenges exist in most tracking scenes, such as camera motion, fast object movement, and occlusion. When the above situation occurs, it is difficult for the detector to output a stable detection result, and effective support cannot be provided for subsequent data association. Furthermore, objects between adjacent video images may have large displacements, and thus spatial scale correlation does not guarantee long-distance tracking. Some algorithms match target features in any two frames to infer similarity of objects and correlate objects in the current image with depth in the previous frame using efficient affinity calculations, thereby enabling reliable online tracking. However, the method of associating with the target features is too dependent on the extraction of the target features, and the feature matching of only two frames of targets easily results in target identity conversion.
Disclosure of Invention
Aiming at the technical problem that the target identity conversion is easily caused by the fact that a target feature correlation method is too dependent on the extraction of target features and only two frames of targets are subjected to feature matching, the invention provides a multi-target tracking method based on a Transformer and space-time memory, a space-time enhancement module is used for extracting target features rich in space-time information, and the detection precision is improved; and the Transformer is applied to multi-target tracking, all information in a specific time window is stored in a dynamic space-time memory, target identity conversion is reduced, and therefore tracking performance is comprehensively improved.
The technical scheme of the invention is realized as follows:
a multi-target tracking method based on Transformer and space-time memory comprises the following steps:
the method comprises the following steps: inputting continuous four-frame images, and preprocessing the images;
step two: extracting feature information of the preprocessed image by using a neural network;
step three: fusing feature information of the four frames of images by using a space-time strengthening module to obtain space-time information features;
step four: acquiring a detection frame of a target, and extracting position information of the detection frame and pedestrian characteristics of the target in the detection frame according to the space-time information characteristics;
step five: storing all pedestrian characteristics and detection frame position information in a certain time window in a dynamic space-time memory module, and calculating the appearance similarity and distance score with a target in a current video image;
step six: fusing the appearance similarity score and the distance score to obtain a final score, and judging as the same target when the final score is greater than a threshold value; if the current target is stored in the dynamic space-time memory module, after the track of the current target is obtained, the stored pedestrian characteristics and the detection frame position information are updated; and if the current target is not stored in the dynamic space-time memory module, storing the pedestrian characteristics and the position information of the detection frame of the current target.
The method for preprocessing the image comprises the following steps: the size of the input original image is changed from 1920 × 1080 to 1280 × 1280.
The method for extracting the feature information of the preprocessed image by utilizing the neural network comprises the following steps: and simultaneously inputting the four transformed frames of images into a CenterNet backbone network to extract characteristic information.
The method for fusing the feature information of the four frames of images by the space-time strengthening module comprises the following steps: firstly, characteristic information F belonging to R of four pictures NT*C*H*W Transformation to F1 ∈ R N*C*T*H*W Averaging a plurality of channels to obtain F2 ∈ R N*1*T*H*W Then the obtained signal is transmitted into a 3D convolution with a convolution kernel of 3 multiplied by 3 to obtain F3; f3, after 3D convolution operation, performing matrix multiplication on the result multiplied by F1 and then performing convolution operation on the result multiplied by F after global average pooling and full connection layer to obtain a space-time information characteristic F4; the expression is as follows:
where reshape (-) is the size transform operation, mean (-) is the averaging operation of multiple passes, f (-) is the 3D convolution operation, β (-) is the batch normalization operation,is a convolution operation, FC (-) is the fully-connected layer, and AVG (-) is a global average pooling operation.
The dynamic space-time memory module comprises a dynamic encoder and an identity aggregation module;
the identity aggregation module operates as follows:
obtaining the pedestrian feature F 'into F1' through a global average pooling operation, a deformable convolution, reLU and a hard-sigmoid activation function:
F1'=ε(σ(f'(AVG(F'))));
where f' (. Cndot.) is a deformable convolution, AVG (. Cndot.) is a global mean pooling operation, σ (. Cndot.) is a ReLU activation function, and ε (. Cndot.) is a hard-sigmoid activation function;
multiplying the pedestrian characteristic F ' by the pedestrian characteristic F1', and sequentially passing through a global average pooling layer and two full-connection layers to obtain a target characteristic F2':
F2'=FC(FC(AVG(F1'*F')))。
the method for calculating the appearance similarity and the distance score of the target in the current video image comprises the following steps:
in dynamic spatiotemporal memory, all pedestrian features { F over previous T frames 1 1 ,F 2 1 ,F 3 1 ,...,F n i Performing cross attention operation in a dynamic encoder to obtain a correlation matrix E 1 Then all objects in the current image { F } 1 ,F 2 ,F 3 ,...,F n And the correlation matrix E 1 Performing cross attention operation to obtain a correlation matrix E 2 Finally, the matrix E will be correlated 1 And E 2 Multiplying to obtain an appearance similarity score; for distance score, use all object detection boxes { B ] within the current image 1 ,B 2 ,B 3 ,...,B n And all target detection frames in the previous T frameA distance score is calculated.
The method for obtaining the final score by fusing the appearance similarity score and the distance score comprises the following steps:
Score=IoU_Score*ReID_Score+IoU_Score+ReID_Score;
wherein Score is the final Score, ioU _ Score is the distance Score, and ReID _ Score is the appearance similarity Score.
Compared with the prior art, the invention has the following beneficial effects:
1) Aiming at the problems of missing detection and error detection, the invention provides a space-time strengthening module, which utilizes continuous four-frame images to extract features and adds time information on the basis of extracting spatial features, so that the extracted features are rich in space-time information, and the detection precision is improved;
2) Aiming at the problem of target deformation, the invention provides an identity aggregation module, which can adaptively acquire the receptive field of an object by utilizing deformable convolution to obtain dynamic characteristics, and aggregate the original characteristics and the dynamic characteristics, so that the identity aggregation module can dynamically adapt to the target deformation and reduce the influence of the target deformation on the tracking effect;
3) Aiming at the problem of frequent shielding, the invention provides a dynamic space-time encoder which stores the characteristics of all targets in a certain time window and the position information of a detection frame, thereby solving the problems of multi-target tracking long-distance tracking and shielding;
4) Aiming at the problem of track prediction, the invention provides a balanced fusion strategy, the similarity score and the distance score are fused to obtain a final score, and the target track is judged according to the final score, so that the target identity conversion is greatly reduced.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of the detection of the present invention;
FIG. 2 is a schematic diagram of the spatio-temporal enhancement module according to the present invention;
FIG. 3 is a diagram illustrating the structure of the dynamic spatiotemporal memory module according to the present invention;
FIG. 4 is a schematic diagram of an identity aggregation module according to the present invention;
FIG. 5 is a graph of the tracking results of the MOT17 data set obtained by the calculation of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.
As shown in fig. 1, an embodiment of the present invention provides a multi-target tracking method based on Transformer and spatio-temporal memory, where the extraction of pedestrian features and anchor frames depends on the quality of neural network extraction features, and feature information is extracted by using four consecutive video pictures for this purpose; fusing the feature information of the four frames to obtain the high-quality feature rich in space-time information; storing all pedestrian information in a certain time window in a dynamic space-time memory module, and calculating the appearance similarity and distance score with the current video image target; and fusing the appearance similarity score and the distance score to obtain a final score, and predicting the target track by using the final score. The method comprises the following specific steps:
the method comprises the following steps: a continuous four-frame image is input and preprocessed, that is, the size of the input original image is changed from 1920 × 1080 to 1280 × 1280.
Step two: extracting the feature information of the preprocessed image by using a neural network; and simultaneously inputting the four transformed frames of images into a CenterNet backbone network to extract characteristic information.
Step three: fusing feature information of the four frames of images by using a space-time strengthening module to obtain space-time information features; the method comprises the steps of firstly transforming the size of characteristic information, extracting characteristics rich in space-time information by using 3D convolution, simultaneously obtaining a global dynamic strengthening kernel by using a full connection layer, and finally performing convolution by using the kernel as a convolution kernel.
As shown in fig. 2, the method for fusing feature information of four frames of images by the spatio-temporal enhancement module is as follows: firstly, characteristic information F of four pictures belongs to R NT*C*H*W Transformation to F1 ∈ R N*C*T*H*W Averaging a plurality of channels to obtain F2 ∈ R N*1*T*H*W Then transmitting into 3D convolution with convolution kernel of 3 × 3 × 3 to obtain F3, and performing normalization operation after 3D convolution operationTo reduce model complexity; f3, after 3D convolution operation, performing matrix multiplication on the result multiplied by F1 and then performing convolution operation on the result multiplied by F after global average pooling and full connection layer to obtain a space-time information characteristic F4; again, using 3D convolution obtains more sensitive spatio-temporal features. In order to enhance the time information sensitivity, the invention carries out full connection operation on the features after the global average pooling operation to obtain a global dynamic strengthening kernel, and the full connection operation can fully utilize global context information, thereby enabling the space-time information to be more sensitive, and the calculation method comprises the following steps:
where reshape (-) is the size transformation operation, mean (-) is the averaging operation of multiple passes, f (-) is the 3D convolution operation, β (-) is the batch normalization operation,is a convolution operation, FC (-) is the full connection layer, and AVG (-) is the global average pooling operation.
Step four: acquiring a detection frame of a target, extracting position information of the detection frame according to the space-time information characteristics, and acquiring pedestrian characteristics of the target in the detection frame by using two full-connected layers;
step five: storing all pedestrian characteristics and detection frame position information in a certain time window in a dynamic space-time memory module, and calculating the appearance similarity and distance score with a target in a current video image; as shown in FIG. 3, the dynamic spatiotemporal memory module comprises a dynamic encoder and an identity aggregation module; the identity aggregation module utilizes a deformable volume and two fully connected layers to solve the deformation problem. The dynamic encoder based Transformer reduces memory occupancy based on the original Transformer.
As shown in fig. 4, the implementation process of the identity aggregation module is as follows: obtaining the pedestrian feature F 'into F1' through a global average pooling operation, a deformable convolution, a ReLU and a hard-sigmoid activation function:
F1'=ε(σ(f'(AVG(F'))));
where f' (. Cndot.) is a deformable convolution, AVG (. Cndot.) is a global mean pooling operation, σ (. Cndot.) is a ReLU activation function, and ε (. Cndot.) is a hard-sigmoid activation function; wherein the pedestrian features F' are reduced in complexity by a global average pooling operation; deformable convolution is used to adapt object deformation.
Multiplying the pedestrian feature F ' by the pedestrian feature F1', and sequentially passing through a global average pooling layer and two full-connection layers to obtain a target feature F2':
F2'=FC(FC(AVG(F1'*F')))。
use of global averaging pools to reduce complexity; finally, two fully connected layers are used to learn and summarize the different features of the object to be acquired.
The method for calculating the appearance similarity and the distance score of the target in the current video image comprises the following steps:
in dynamic spatiotemporal memory, all pedestrian features { F over previous T frames 1 1 ,F 2 1 ,F 3 1 ,...,F n i } Cross attention operation in dynamic encoder to obtain correlation matrix E 1 Then all objects in the current image { F } 1 ,F 2 ,F 3 ,...,F n And the correlation matrix E 1 Performing cross attention operation to obtain a correlation matrix E 2 Finally, the matrix E will be correlated 1 And E 2 Multiplying to obtain an appearance similarity score; for distance score, use all object detection boxes { B ] within the current image 1 ,B 2 ,B 3 ,...,B n And all target detection frames in the previous T frameA distance score is calculated. The calculation method is as follows:
in the formula, F i n Are all the objects within a certain time window,is all targets of the current image, is the cross attention operation, and is the matrix multiplication operation.
Step six: fusing the appearance similarity score and the distance score to obtain a final score, and judging as the same target when the final score is greater than a threshold value; if the current target is stored in the dynamic space-time memory module, after the track of the current target is obtained, the stored pedestrian characteristics and the detection frame position information are updated; and if the current target is not stored in the dynamic space-time memory module, storing the pedestrian characteristics and the position information of the detection frame of the current target.
The method for obtaining the final score by fusing the appearance similarity score and the distance score comprises the following steps:
Score=IoU_Score*ReID_Score+IoU_Score+ReID_Score;
wherein Score is the final Score, ioU _ Score is the distance Score, and ReID _ Score is the appearance similarity Score.
As shown in fig. 5, the method of the present invention has high detection accuracy and good tracking effect for tracking the target in the scenes of camera movement, target deformation and frequent occlusion, and has wide practicability in automatic driving, smart cities, visual monitoring, public safety, video analysis, human-computer interaction, etc.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.
Claims (7)
1. A multi-target tracking method based on Transformer and space-time memory is characterized by comprising the following steps:
the method comprises the following steps: inputting continuous four-frame images, and preprocessing the images;
step two: extracting the feature information of the preprocessed image by using a neural network;
step three: fusing feature information of the four frames of images by using a space-time strengthening module to obtain space-time information features;
step four: acquiring a detection frame of a target, and extracting position information of the detection frame and pedestrian characteristics of the target in the detection frame according to the space-time information characteristics;
step five: in the dynamic time-space memory module for storing all pedestrian characteristics and detection frame position information in a certain time window,
and carrying out appearance similarity and distance score calculation with the target in the current video image;
step six: fusing the appearance similarity score and the distance score to obtain a final score, and judging as the same target when the final score is greater than a threshold value; if the current target is stored in the dynamic space-time memory module, after the track of the current target is obtained, the stored pedestrian characteristics and the detection frame position information are updated; and if the current target is not stored in the dynamic space-time memory module, storing the pedestrian characteristics and the position information of the detection frame of the current target.
2. The multi-target tracking method based on Transformer and space-time memory according to claim 1, wherein the method for preprocessing the image is as follows: the size of the input original image is changed from 1920 × 1080 to 1280 × 1280.
3. The Transformer and spatiotemporal memory-based multi-target tracking method according to claim 2, wherein the method for extracting the feature information of the preprocessed image by using the neural network comprises the following steps: and simultaneously inputting the four transformed frames of images into a CenterNet backbone network to extract characteristic information.
4. The multi-target tracking method based on Transformer and spatiotemporal memory according to claim 3, characterized in that the method for fusing feature information of four frames of images by the spatiotemporal enhancement module is as follows: firstly, characteristic information F of four pictures belongs to R NT*C*H*W Transformation to F1 ∈ R N*C*T*H*W Averaging a plurality of channels to obtain F2 ∈ R N*1*T*H*W Then the obtained signal is transmitted into a 3D convolution with a convolution kernel of 3 multiplied by 3 to obtain F3; f3 is processed by 3D convolution operation and then is processed by matrix multiplication with F1Performing convolution operation on the result after passing through the global average pooling layer and the full connection layer to obtain a space-time information characteristic F4; the expression is as follows:
where reshape (-) is the size transform operation, mean (-) is the averaging operation of multiple passes, f (-) is the 3D convolution operation, β (-) is the batch normalization operation,is a convolution operation, FC (-) is the fully-connected layer, and AVG (-) is a global average pooling operation.
5. The Transformer and spatiotemporal memory based multi-target tracking method according to claim 1, wherein the dynamic spatiotemporal memory module comprises a dynamic encoder and an identity aggregation module;
the identity aggregation module operates as follows:
obtaining the pedestrian feature F 'into F1' through a global average pooling operation, a deformable convolution, a ReLU and a hard-sigmoid activation function:
F1'=ε(σ(f'(AVG(F'))));
where f' (. Cndot.) is a deformable convolution, AVG (. Cndot.) is a global mean pooling operation, σ (. Cndot.) is a ReLU activation function, and ε (. Cndot.) is a hard-sigmoid activation function;
multiplying the pedestrian feature F ' by the pedestrian feature F1', and sequentially passing through a global average pooling layer and two full-connection layers to obtain a target feature F2':
F2'=FC(FC(AVG(F1'*F')))。
6. the Transformer and spatiotemporal memory-based multi-target tracking method according to claim 5, wherein the method for calculating the appearance similarity and the distance score with the target in the current video image comprises:
all pedestrian features in previous T-frames in dynamic spatiotemporal memoryCross attention operation in dynamic encoder to obtain correlation matrix E 1 Then all objects { F) in the current image 1 ,F 2 ,F 3 ,...,F n And the correlation matrix E 1 Performing cross attention operation to obtain a correlation matrix E 2 Finally, the matrix E will be correlated 1 And E 2 Multiplying to obtain an appearance similarity score; for distance score, use all object detection boxes { B ] within the current image 1 ,B 2 ,B 3 ,...,B n And all target detection frames in the previous T frameA distance score is calculated.
7. The Transformer and spatiotemporal memory-based multi-target tracking method according to claim 6, wherein the method for fusing the appearance similarity score and the distance score to obtain a final score comprises the following steps:
Score=IoU_Score*ReID_Score+IoU_Score+ReID_Score;
wherein Score is the final Score, ioU _ Score is the distance Score, and ReID _ Score is the appearance similarity Score.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211304713.8A CN115619827A (en) | 2022-10-24 | 2022-10-24 | Multi-target tracking method based on Transformer and space-time memory |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211304713.8A CN115619827A (en) | 2022-10-24 | 2022-10-24 | Multi-target tracking method based on Transformer and space-time memory |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115619827A true CN115619827A (en) | 2023-01-17 |
Family
ID=84864221
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211304713.8A Pending CN115619827A (en) | 2022-10-24 | 2022-10-24 | Multi-target tracking method based on Transformer and space-time memory |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115619827A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117036407A (en) * | 2023-08-11 | 2023-11-10 | 浙江深象智能科技有限公司 | Multi-target tracking method, device and equipment |
-
2022
- 2022-10-24 CN CN202211304713.8A patent/CN115619827A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117036407A (en) * | 2023-08-11 | 2023-11-10 | 浙江深象智能科技有限公司 | Multi-target tracking method, device and equipment |
CN117036407B (en) * | 2023-08-11 | 2024-04-02 | 浙江深象智能科技有限公司 | Multi-target tracking method, device and equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111523410B (en) | Video saliency target detection method based on attention mechanism | |
CN110717411A (en) | Pedestrian re-identification method based on deep layer feature fusion | |
CN110120064B (en) | Depth-related target tracking algorithm based on mutual reinforcement and multi-attention mechanism learning | |
WO2016183766A1 (en) | Method and apparatus for generating predictive models | |
WO2019023921A1 (en) | Gesture recognition method, apparatus, and device | |
US10685263B2 (en) | System and method for object labeling | |
CN111639564B (en) | Video pedestrian re-identification method based on multi-attention heterogeneous network | |
CN112464807A (en) | Video motion recognition method and device, electronic equipment and storage medium | |
CN112257569B (en) | Target detection and identification method based on real-time video stream | |
CN111723822B (en) | RGBD image significance detection method and system based on multi-level fusion | |
CN112906545A (en) | Real-time action recognition method and system for multi-person scene | |
CN110853074A (en) | Video target detection network system for enhancing target by utilizing optical flow | |
CN116309725A (en) | Multi-target tracking method based on multi-scale deformable attention mechanism | |
CN112084952B (en) | Video point location tracking method based on self-supervision training | |
CN112329784A (en) | Correlation filtering tracking method based on space-time perception and multimodal response | |
CN115619827A (en) | Multi-target tracking method based on Transformer and space-time memory | |
CN113674321B (en) | Cloud-based method for multi-target tracking under monitoring video | |
Wang et al. | Dual memory aggregation network for event-based object detection with learnable representation | |
CN112949451B (en) | Cross-modal target tracking method and system through modal perception feature learning | |
US20220366570A1 (en) | Object tracking device and object tracking method | |
CN110555406B (en) | Video moving target identification method based on Haar-like characteristics and CNN matching | |
CN115063717A (en) | Video target detection and tracking method based on key area live-action modeling | |
CN115439771A (en) | Improved DSST infrared laser spot tracking method | |
CN112084922B (en) | Method for detecting crowd with abnormal behaviors based on gestures and facial expressions | |
CN110503061B (en) | Multi-feature-fused multi-factor video occlusion area detection method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |