CN115619827A - Multi-target tracking method based on Transformer and space-time memory - Google Patents

Multi-target tracking method based on Transformer and space-time memory Download PDF

Info

Publication number
CN115619827A
CN115619827A CN202211304713.8A CN202211304713A CN115619827A CN 115619827 A CN115619827 A CN 115619827A CN 202211304713 A CN202211304713 A CN 202211304713A CN 115619827 A CN115619827 A CN 115619827A
Authority
CN
China
Prior art keywords
score
target
space
information
pedestrian
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211304713.8A
Other languages
Chinese (zh)
Inventor
肖启阳
谷松波
杨茂林
李森
贾林
胡振涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Henan University
Original Assignee
Henan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Henan University filed Critical Henan University
Priority to CN202211304713.8A priority Critical patent/CN115619827A/en
Publication of CN115619827A publication Critical patent/CN115619827A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30241Trajectory

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a multi-target tracking method based on Transformer and space-time memory, which comprises the following steps: firstly, extracting characteristic information by using continuous four-frame video images; fusing the feature information of the four frames to obtain the high-quality feature rich in space-time information; secondly, storing all pedestrian information in a certain time window in a dynamic space-time memory module, and calculating the appearance similarity and distance score with the current video image target; and finally, fusing the appearance similarity score and the distance score to obtain a final score, and predicting the target track by using the final score. The invention can accurately position the pedestrian under the conditions of camera movement, rapid pedestrian movement and the like, eliminates the influence of deformation by adopting the dynamic space-time memory module, predicts the target track by utilizing the appearance similarity and the distance fusion score, solves the problem of long-distance shielding and obtains the accurate target track.

Description

Multi-target tracking method based on Transformer and space-time memory
Technical Field
The invention relates to the technical field of video scene analysis and processing, in particular to a multi-target tracking method based on a Transformer and space-time memory.
Background
Multi-target tracking aims at finding and tracking all objects with the same identity from a video sequence containing multiple objects. Multi-target tracking plays an important role in solving many basic problems of video analysis and computer vision, and is continuously applied to various fields such as automatic driving, smart cities, visual monitoring, public safety, video analysis, human-computer interaction and the like. In a complex scene, the detection precision is obviously reduced under the influence of the conditions of camera movement, target deformation, frequent shielding and the like, so that the tracking effect is poor.
In recent years, methods based on the detection tracking paradigm divide multi-target tracking into two tasks, detection and association. All targets in the video image are first detected and then the trajectories are matched according to a correlation algorithm. Wherein the data association is a core part. Some algorithms use a spatial scale metric to achieve the correlation between successive video images. However, various challenges exist in most tracking scenes, such as camera motion, fast object movement, and occlusion. When the above situation occurs, it is difficult for the detector to output a stable detection result, and effective support cannot be provided for subsequent data association. Furthermore, objects between adjacent video images may have large displacements, and thus spatial scale correlation does not guarantee long-distance tracking. Some algorithms match target features in any two frames to infer similarity of objects and correlate objects in the current image with depth in the previous frame using efficient affinity calculations, thereby enabling reliable online tracking. However, the method of associating with the target features is too dependent on the extraction of the target features, and the feature matching of only two frames of targets easily results in target identity conversion.
Disclosure of Invention
Aiming at the technical problem that the target identity conversion is easily caused by the fact that a target feature correlation method is too dependent on the extraction of target features and only two frames of targets are subjected to feature matching, the invention provides a multi-target tracking method based on a Transformer and space-time memory, a space-time enhancement module is used for extracting target features rich in space-time information, and the detection precision is improved; and the Transformer is applied to multi-target tracking, all information in a specific time window is stored in a dynamic space-time memory, target identity conversion is reduced, and therefore tracking performance is comprehensively improved.
The technical scheme of the invention is realized as follows:
a multi-target tracking method based on Transformer and space-time memory comprises the following steps:
the method comprises the following steps: inputting continuous four-frame images, and preprocessing the images;
step two: extracting feature information of the preprocessed image by using a neural network;
step three: fusing feature information of the four frames of images by using a space-time strengthening module to obtain space-time information features;
step four: acquiring a detection frame of a target, and extracting position information of the detection frame and pedestrian characteristics of the target in the detection frame according to the space-time information characteristics;
step five: storing all pedestrian characteristics and detection frame position information in a certain time window in a dynamic space-time memory module, and calculating the appearance similarity and distance score with a target in a current video image;
step six: fusing the appearance similarity score and the distance score to obtain a final score, and judging as the same target when the final score is greater than a threshold value; if the current target is stored in the dynamic space-time memory module, after the track of the current target is obtained, the stored pedestrian characteristics and the detection frame position information are updated; and if the current target is not stored in the dynamic space-time memory module, storing the pedestrian characteristics and the position information of the detection frame of the current target.
The method for preprocessing the image comprises the following steps: the size of the input original image is changed from 1920 × 1080 to 1280 × 1280.
The method for extracting the feature information of the preprocessed image by utilizing the neural network comprises the following steps: and simultaneously inputting the four transformed frames of images into a CenterNet backbone network to extract characteristic information.
The method for fusing the feature information of the four frames of images by the space-time strengthening module comprises the following steps: firstly, characteristic information F belonging to R of four pictures NT*C*H*W Transformation to F1 ∈ R N*C*T*H*W Averaging a plurality of channels to obtain F2 ∈ R N*1*T*H*W Then the obtained signal is transmitted into a 3D convolution with a convolution kernel of 3 multiplied by 3 to obtain F3; f3, after 3D convolution operation, performing matrix multiplication on the result multiplied by F1 and then performing convolution operation on the result multiplied by F after global average pooling and full connection layer to obtain a space-time information characteristic F4; the expression is as follows:
Figure BDA0003905299850000021
where reshape (-) is the size transform operation, mean (-) is the averaging operation of multiple passes, f (-) is the 3D convolution operation, β (-) is the batch normalization operation,
Figure BDA0003905299850000022
is a convolution operation, FC (-) is the fully-connected layer, and AVG (-) is a global average pooling operation.
The dynamic space-time memory module comprises a dynamic encoder and an identity aggregation module;
the identity aggregation module operates as follows:
obtaining the pedestrian feature F 'into F1' through a global average pooling operation, a deformable convolution, reLU and a hard-sigmoid activation function:
F1'=ε(σ(f'(AVG(F'))));
where f' (. Cndot.) is a deformable convolution, AVG (. Cndot.) is a global mean pooling operation, σ (. Cndot.) is a ReLU activation function, and ε (. Cndot.) is a hard-sigmoid activation function;
multiplying the pedestrian characteristic F ' by the pedestrian characteristic F1', and sequentially passing through a global average pooling layer and two full-connection layers to obtain a target characteristic F2':
F2'=FC(FC(AVG(F1'*F')))。
the method for calculating the appearance similarity and the distance score of the target in the current video image comprises the following steps:
in dynamic spatiotemporal memory, all pedestrian features { F over previous T frames 1 1 ,F 2 1 ,F 3 1 ,...,F n i Performing cross attention operation in a dynamic encoder to obtain a correlation matrix E 1 Then all objects in the current image { F } 1 ,F 2 ,F 3 ,...,F n And the correlation matrix E 1 Performing cross attention operation to obtain a correlation matrix E 2 Finally, the matrix E will be correlated 1 And E 2 Multiplying to obtain an appearance similarity score; for distance score, use all object detection boxes { B ] within the current image 1 ,B 2 ,B 3 ,...,B n And all target detection frames in the previous T frame
Figure BDA0003905299850000031
A distance score is calculated.
The method for obtaining the final score by fusing the appearance similarity score and the distance score comprises the following steps:
Score=IoU_Score*ReID_Score+IoU_Score+ReID_Score;
wherein Score is the final Score, ioU _ Score is the distance Score, and ReID _ Score is the appearance similarity Score.
Compared with the prior art, the invention has the following beneficial effects:
1) Aiming at the problems of missing detection and error detection, the invention provides a space-time strengthening module, which utilizes continuous four-frame images to extract features and adds time information on the basis of extracting spatial features, so that the extracted features are rich in space-time information, and the detection precision is improved;
2) Aiming at the problem of target deformation, the invention provides an identity aggregation module, which can adaptively acquire the receptive field of an object by utilizing deformable convolution to obtain dynamic characteristics, and aggregate the original characteristics and the dynamic characteristics, so that the identity aggregation module can dynamically adapt to the target deformation and reduce the influence of the target deformation on the tracking effect;
3) Aiming at the problem of frequent shielding, the invention provides a dynamic space-time encoder which stores the characteristics of all targets in a certain time window and the position information of a detection frame, thereby solving the problems of multi-target tracking long-distance tracking and shielding;
4) Aiming at the problem of track prediction, the invention provides a balanced fusion strategy, the similarity score and the distance score are fused to obtain a final score, and the target track is judged according to the final score, so that the target identity conversion is greatly reduced.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of the detection of the present invention;
FIG. 2 is a schematic diagram of the spatio-temporal enhancement module according to the present invention;
FIG. 3 is a diagram illustrating the structure of the dynamic spatiotemporal memory module according to the present invention;
FIG. 4 is a schematic diagram of an identity aggregation module according to the present invention;
FIG. 5 is a graph of the tracking results of the MOT17 data set obtained by the calculation of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.
As shown in fig. 1, an embodiment of the present invention provides a multi-target tracking method based on Transformer and spatio-temporal memory, where the extraction of pedestrian features and anchor frames depends on the quality of neural network extraction features, and feature information is extracted by using four consecutive video pictures for this purpose; fusing the feature information of the four frames to obtain the high-quality feature rich in space-time information; storing all pedestrian information in a certain time window in a dynamic space-time memory module, and calculating the appearance similarity and distance score with the current video image target; and fusing the appearance similarity score and the distance score to obtain a final score, and predicting the target track by using the final score. The method comprises the following specific steps:
the method comprises the following steps: a continuous four-frame image is input and preprocessed, that is, the size of the input original image is changed from 1920 × 1080 to 1280 × 1280.
Step two: extracting the feature information of the preprocessed image by using a neural network; and simultaneously inputting the four transformed frames of images into a CenterNet backbone network to extract characteristic information.
Step three: fusing feature information of the four frames of images by using a space-time strengthening module to obtain space-time information features; the method comprises the steps of firstly transforming the size of characteristic information, extracting characteristics rich in space-time information by using 3D convolution, simultaneously obtaining a global dynamic strengthening kernel by using a full connection layer, and finally performing convolution by using the kernel as a convolution kernel.
As shown in fig. 2, the method for fusing feature information of four frames of images by the spatio-temporal enhancement module is as follows: firstly, characteristic information F of four pictures belongs to R NT*C*H*W Transformation to F1 ∈ R N*C*T*H*W Averaging a plurality of channels to obtain F2 ∈ R N*1*T*H*W Then transmitting into 3D convolution with convolution kernel of 3 × 3 × 3 to obtain F3, and performing normalization operation after 3D convolution operationTo reduce model complexity; f3, after 3D convolution operation, performing matrix multiplication on the result multiplied by F1 and then performing convolution operation on the result multiplied by F after global average pooling and full connection layer to obtain a space-time information characteristic F4; again, using 3D convolution obtains more sensitive spatio-temporal features. In order to enhance the time information sensitivity, the invention carries out full connection operation on the features after the global average pooling operation to obtain a global dynamic strengthening kernel, and the full connection operation can fully utilize global context information, thereby enabling the space-time information to be more sensitive, and the calculation method comprises the following steps:
Figure BDA0003905299850000041
where reshape (-) is the size transformation operation, mean (-) is the averaging operation of multiple passes, f (-) is the 3D convolution operation, β (-) is the batch normalization operation,
Figure BDA0003905299850000042
is a convolution operation, FC (-) is the full connection layer, and AVG (-) is the global average pooling operation.
Step four: acquiring a detection frame of a target, extracting position information of the detection frame according to the space-time information characteristics, and acquiring pedestrian characteristics of the target in the detection frame by using two full-connected layers;
step five: storing all pedestrian characteristics and detection frame position information in a certain time window in a dynamic space-time memory module, and calculating the appearance similarity and distance score with a target in a current video image; as shown in FIG. 3, the dynamic spatiotemporal memory module comprises a dynamic encoder and an identity aggregation module; the identity aggregation module utilizes a deformable volume and two fully connected layers to solve the deformation problem. The dynamic encoder based Transformer reduces memory occupancy based on the original Transformer.
As shown in fig. 4, the implementation process of the identity aggregation module is as follows: obtaining the pedestrian feature F 'into F1' through a global average pooling operation, a deformable convolution, a ReLU and a hard-sigmoid activation function:
F1'=ε(σ(f'(AVG(F'))));
where f' (. Cndot.) is a deformable convolution, AVG (. Cndot.) is a global mean pooling operation, σ (. Cndot.) is a ReLU activation function, and ε (. Cndot.) is a hard-sigmoid activation function; wherein the pedestrian features F' are reduced in complexity by a global average pooling operation; deformable convolution is used to adapt object deformation.
Multiplying the pedestrian feature F ' by the pedestrian feature F1', and sequentially passing through a global average pooling layer and two full-connection layers to obtain a target feature F2':
F2'=FC(FC(AVG(F1'*F')))。
use of global averaging pools to reduce complexity; finally, two fully connected layers are used to learn and summarize the different features of the object to be acquired.
The method for calculating the appearance similarity and the distance score of the target in the current video image comprises the following steps:
in dynamic spatiotemporal memory, all pedestrian features { F over previous T frames 1 1 ,F 2 1 ,F 3 1 ,...,F n i } Cross attention operation in dynamic encoder to obtain correlation matrix E 1 Then all objects in the current image { F } 1 ,F 2 ,F 3 ,...,F n And the correlation matrix E 1 Performing cross attention operation to obtain a correlation matrix E 2 Finally, the matrix E will be correlated 1 And E 2 Multiplying to obtain an appearance similarity score; for distance score, use all object detection boxes { B ] within the current image 1 ,B 2 ,B 3 ,...,B n And all target detection frames in the previous T frame
Figure BDA0003905299850000054
A distance score is calculated. The calculation method is as follows:
Figure BDA0003905299850000051
in the formula, F i n Are all the objects within a certain time window,
Figure BDA0003905299850000053
is all targets of the current image, is the cross attention operation, and is the matrix multiplication operation.
Step six: fusing the appearance similarity score and the distance score to obtain a final score, and judging as the same target when the final score is greater than a threshold value; if the current target is stored in the dynamic space-time memory module, after the track of the current target is obtained, the stored pedestrian characteristics and the detection frame position information are updated; and if the current target is not stored in the dynamic space-time memory module, storing the pedestrian characteristics and the position information of the detection frame of the current target.
The method for obtaining the final score by fusing the appearance similarity score and the distance score comprises the following steps:
Score=IoU_Score*ReID_Score+IoU_Score+ReID_Score;
wherein Score is the final Score, ioU _ Score is the distance Score, and ReID _ Score is the appearance similarity Score.
As shown in fig. 5, the method of the present invention has high detection accuracy and good tracking effect for tracking the target in the scenes of camera movement, target deformation and frequent occlusion, and has wide practicability in automatic driving, smart cities, visual monitoring, public safety, video analysis, human-computer interaction, etc.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (7)

1. A multi-target tracking method based on Transformer and space-time memory is characterized by comprising the following steps:
the method comprises the following steps: inputting continuous four-frame images, and preprocessing the images;
step two: extracting the feature information of the preprocessed image by using a neural network;
step three: fusing feature information of the four frames of images by using a space-time strengthening module to obtain space-time information features;
step four: acquiring a detection frame of a target, and extracting position information of the detection frame and pedestrian characteristics of the target in the detection frame according to the space-time information characteristics;
step five: in the dynamic time-space memory module for storing all pedestrian characteristics and detection frame position information in a certain time window,
and carrying out appearance similarity and distance score calculation with the target in the current video image;
step six: fusing the appearance similarity score and the distance score to obtain a final score, and judging as the same target when the final score is greater than a threshold value; if the current target is stored in the dynamic space-time memory module, after the track of the current target is obtained, the stored pedestrian characteristics and the detection frame position information are updated; and if the current target is not stored in the dynamic space-time memory module, storing the pedestrian characteristics and the position information of the detection frame of the current target.
2. The multi-target tracking method based on Transformer and space-time memory according to claim 1, wherein the method for preprocessing the image is as follows: the size of the input original image is changed from 1920 × 1080 to 1280 × 1280.
3. The Transformer and spatiotemporal memory-based multi-target tracking method according to claim 2, wherein the method for extracting the feature information of the preprocessed image by using the neural network comprises the following steps: and simultaneously inputting the four transformed frames of images into a CenterNet backbone network to extract characteristic information.
4. The multi-target tracking method based on Transformer and spatiotemporal memory according to claim 3, characterized in that the method for fusing feature information of four frames of images by the spatiotemporal enhancement module is as follows: firstly, characteristic information F of four pictures belongs to R NT*C*H*W Transformation to F1 ∈ R N*C*T*H*W Averaging a plurality of channels to obtain F2 ∈ R N*1*T*H*W Then the obtained signal is transmitted into a 3D convolution with a convolution kernel of 3 multiplied by 3 to obtain F3; f3 is processed by 3D convolution operation and then is processed by matrix multiplication with F1Performing convolution operation on the result after passing through the global average pooling layer and the full connection layer to obtain a space-time information characteristic F4; the expression is as follows:
Figure FDA0003905299840000011
where reshape (-) is the size transform operation, mean (-) is the averaging operation of multiple passes, f (-) is the 3D convolution operation, β (-) is the batch normalization operation,
Figure FDA0003905299840000012
is a convolution operation, FC (-) is the fully-connected layer, and AVG (-) is a global average pooling operation.
5. The Transformer and spatiotemporal memory based multi-target tracking method according to claim 1, wherein the dynamic spatiotemporal memory module comprises a dynamic encoder and an identity aggregation module;
the identity aggregation module operates as follows:
obtaining the pedestrian feature F 'into F1' through a global average pooling operation, a deformable convolution, a ReLU and a hard-sigmoid activation function:
F1'=ε(σ(f'(AVG(F'))));
where f' (. Cndot.) is a deformable convolution, AVG (. Cndot.) is a global mean pooling operation, σ (. Cndot.) is a ReLU activation function, and ε (. Cndot.) is a hard-sigmoid activation function;
multiplying the pedestrian feature F ' by the pedestrian feature F1', and sequentially passing through a global average pooling layer and two full-connection layers to obtain a target feature F2':
F2'=FC(FC(AVG(F1'*F')))。
6. the Transformer and spatiotemporal memory-based multi-target tracking method according to claim 5, wherein the method for calculating the appearance similarity and the distance score with the target in the current video image comprises:
all pedestrian features in previous T-frames in dynamic spatiotemporal memory
Figure FDA0003905299840000021
Cross attention operation in dynamic encoder to obtain correlation matrix E 1 Then all objects { F) in the current image 1 ,F 2 ,F 3 ,...,F n And the correlation matrix E 1 Performing cross attention operation to obtain a correlation matrix E 2 Finally, the matrix E will be correlated 1 And E 2 Multiplying to obtain an appearance similarity score; for distance score, use all object detection boxes { B ] within the current image 1 ,B 2 ,B 3 ,...,B n And all target detection frames in the previous T frame
Figure FDA0003905299840000022
A distance score is calculated.
7. The Transformer and spatiotemporal memory-based multi-target tracking method according to claim 6, wherein the method for fusing the appearance similarity score and the distance score to obtain a final score comprises the following steps:
Score=IoU_Score*ReID_Score+IoU_Score+ReID_Score;
wherein Score is the final Score, ioU _ Score is the distance Score, and ReID _ Score is the appearance similarity Score.
CN202211304713.8A 2022-10-24 2022-10-24 Multi-target tracking method based on Transformer and space-time memory Pending CN115619827A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211304713.8A CN115619827A (en) 2022-10-24 2022-10-24 Multi-target tracking method based on Transformer and space-time memory

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211304713.8A CN115619827A (en) 2022-10-24 2022-10-24 Multi-target tracking method based on Transformer and space-time memory

Publications (1)

Publication Number Publication Date
CN115619827A true CN115619827A (en) 2023-01-17

Family

ID=84864221

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211304713.8A Pending CN115619827A (en) 2022-10-24 2022-10-24 Multi-target tracking method based on Transformer and space-time memory

Country Status (1)

Country Link
CN (1) CN115619827A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117036407A (en) * 2023-08-11 2023-11-10 浙江深象智能科技有限公司 Multi-target tracking method, device and equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117036407A (en) * 2023-08-11 2023-11-10 浙江深象智能科技有限公司 Multi-target tracking method, device and equipment
CN117036407B (en) * 2023-08-11 2024-04-02 浙江深象智能科技有限公司 Multi-target tracking method, device and equipment

Similar Documents

Publication Publication Date Title
CN111523410B (en) Video saliency target detection method based on attention mechanism
CN110717411A (en) Pedestrian re-identification method based on deep layer feature fusion
CN110120064B (en) Depth-related target tracking algorithm based on mutual reinforcement and multi-attention mechanism learning
WO2016183766A1 (en) Method and apparatus for generating predictive models
WO2019023921A1 (en) Gesture recognition method, apparatus, and device
US10685263B2 (en) System and method for object labeling
CN111639564B (en) Video pedestrian re-identification method based on multi-attention heterogeneous network
CN112464807A (en) Video motion recognition method and device, electronic equipment and storage medium
CN112257569B (en) Target detection and identification method based on real-time video stream
CN111723822B (en) RGBD image significance detection method and system based on multi-level fusion
CN112906545A (en) Real-time action recognition method and system for multi-person scene
CN110853074A (en) Video target detection network system for enhancing target by utilizing optical flow
CN116309725A (en) Multi-target tracking method based on multi-scale deformable attention mechanism
CN112084952B (en) Video point location tracking method based on self-supervision training
CN112329784A (en) Correlation filtering tracking method based on space-time perception and multimodal response
CN115619827A (en) Multi-target tracking method based on Transformer and space-time memory
CN113674321B (en) Cloud-based method for multi-target tracking under monitoring video
Wang et al. Dual memory aggregation network for event-based object detection with learnable representation
CN112949451B (en) Cross-modal target tracking method and system through modal perception feature learning
US20220366570A1 (en) Object tracking device and object tracking method
CN110555406B (en) Video moving target identification method based on Haar-like characteristics and CNN matching
CN115063717A (en) Video target detection and tracking method based on key area live-action modeling
CN115439771A (en) Improved DSST infrared laser spot tracking method
CN112084922B (en) Method for detecting crowd with abnormal behaviors based on gestures and facial expressions
CN110503061B (en) Multi-feature-fused multi-factor video occlusion area detection method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination