CN115619827A

CN115619827A - Multi-target tracking method based on Transformer and space-time memory

Info

Publication number: CN115619827A
Application number: CN202211304713.8A
Authority: CN
Inventors: 肖启阳; 谷松波; 杨茂林; 李森; 贾林; 胡振涛
Original assignee: Henan University
Current assignee: Henan University
Priority date: 2022-10-24
Filing date: 2022-10-24
Publication date: 2023-01-17

Abstract

The invention provides a multi-target tracking method based on Transformer and space-time memory, which comprises the following steps: firstly, extracting characteristic information by using continuous four-frame video images; fusing the feature information of the four frames to obtain the high-quality feature rich in space-time information; secondly, storing all pedestrian information in a certain time window in a dynamic space-time memory module, and calculating the appearance similarity and distance score with the current video image target; and finally, fusing the appearance similarity score and the distance score to obtain a final score, and predicting the target track by using the final score. The invention can accurately position the pedestrian under the conditions of camera movement, rapid pedestrian movement and the like, eliminates the influence of deformation by adopting the dynamic space-time memory module, predicts the target track by utilizing the appearance similarity and the distance fusion score, solves the problem of long-distance shielding and obtains the accurate target track.

Description

Multi-target tracking method based on Transformer and space-time memory

Technical Field

The invention relates to the technical field of video scene analysis and processing, in particular to a multi-target tracking method based on a Transformer and space-time memory.

Background

Multi-target tracking aims at finding and tracking all objects with the same identity from a video sequence containing multiple objects. Multi-target tracking plays an important role in solving many basic problems of video analysis and computer vision, and is continuously applied to various fields such as automatic driving, smart cities, visual monitoring, public safety, video analysis, human-computer interaction and the like. In a complex scene, the detection precision is obviously reduced under the influence of the conditions of camera movement, target deformation, frequent shielding and the like, so that the tracking effect is poor.

In recent years, methods based on the detection tracking paradigm divide multi-target tracking into two tasks, detection and association. All targets in the video image are first detected and then the trajectories are matched according to a correlation algorithm. Wherein the data association is a core part. Some algorithms use a spatial scale metric to achieve the correlation between successive video images. However, various challenges exist in most tracking scenes, such as camera motion, fast object movement, and occlusion. When the above situation occurs, it is difficult for the detector to output a stable detection result, and effective support cannot be provided for subsequent data association. Furthermore, objects between adjacent video images may have large displacements, and thus spatial scale correlation does not guarantee long-distance tracking. Some algorithms match target features in any two frames to infer similarity of objects and correlate objects in the current image with depth in the previous frame using efficient affinity calculations, thereby enabling reliable online tracking. However, the method of associating with the target features is too dependent on the extraction of the target features, and the feature matching of only two frames of targets easily results in target identity conversion.

Disclosure of Invention

Aiming at the technical problem that the target identity conversion is easily caused by the fact that a target feature correlation method is too dependent on the extraction of target features and only two frames of targets are subjected to feature matching, the invention provides a multi-target tracking method based on a Transformer and space-time memory, a space-time enhancement module is used for extracting target features rich in space-time information, and the detection precision is improved; and the Transformer is applied to multi-target tracking, all information in a specific time window is stored in a dynamic space-time memory, target identity conversion is reduced, and therefore tracking performance is comprehensively improved.

The technical scheme of the invention is realized as follows:

a multi-target tracking method based on Transformer and space-time memory comprises the following steps:

the method comprises the following steps: inputting continuous four-frame images, and preprocessing the images;

step two: extracting feature information of the preprocessed image by using a neural network;

step three: fusing feature information of the four frames of images by using a space-time strengthening module to obtain space-time information features;

step four: acquiring a detection frame of a target, and extracting position information of the detection frame and pedestrian characteristics of the target in the detection frame according to the space-time information characteristics;

step five: storing all pedestrian characteristics and detection frame position information in a certain time window in a dynamic space-time memory module, and calculating the appearance similarity and distance score with a target in a current video image;

step six: fusing the appearance similarity score and the distance score to obtain a final score, and judging as the same target when the final score is greater than a threshold value; if the current target is stored in the dynamic space-time memory module, after the track of the current target is obtained, the stored pedestrian characteristics and the detection frame position information are updated; and if the current target is not stored in the dynamic space-time memory module, storing the pedestrian characteristics and the position information of the detection frame of the current target.

The method for preprocessing the image comprises the following steps: the size of the input original image is changed from 1920 × 1080 to 1280 × 1280.

The method for extracting the feature information of the preprocessed image by utilizing the neural network comprises the following steps: and simultaneously inputting the four transformed frames of images into a CenterNet backbone network to extract characteristic information.

The method for fusing the feature information of the four frames of images by the space-time strengthening module comprises the following steps: firstly, characteristic information F belonging to R of four pictures ^NT*C*H*W Transformation to F1 ∈ R ^N*C*T*H*W Averaging a plurality of channels to obtain F2 ∈ R ^N*1*T*H*W Then the obtained signal is transmitted into a 3D convolution with a convolution kernel of 3 multiplied by 3 to obtain F3; f3, after 3D convolution operation, performing matrix multiplication on the result multiplied by F1 and then performing convolution operation on the result multiplied by F after global average pooling and full connection layer to obtain a space-time information characteristic F4; the expression is as follows:

where reshape (-) is the size transform operation, mean (-) is the averaging operation of multiple passes, f (-) is the 3D convolution operation, β (-) is the batch normalization operation,

is a convolution operation, FC (-) is the fully-connected layer, and AVG (-) is a global average pooling operation.

The dynamic space-time memory module comprises a dynamic encoder and an identity aggregation module;

the identity aggregation module operates as follows:

obtaining the pedestrian feature F 'into F1' through a global average pooling operation, a deformable convolution, reLU and a hard-sigmoid activation function:

F1'＝ε(σ(f'(AVG(F'))))；

where f' (. Cndot.) is a deformable convolution, AVG (. Cndot.) is a global mean pooling operation, σ (. Cndot.) is a ReLU activation function, and ε (. Cndot.) is a hard-sigmoid activation function;

multiplying the pedestrian characteristic F ' by the pedestrian characteristic F1', and sequentially passing through a global average pooling layer and two full-connection layers to obtain a target characteristic F2':

F2'＝FC(FC(AVG(F1'*F')))。

the method for calculating the appearance similarity and the distance score of the target in the current video image comprises the following steps:

in dynamic spatiotemporal memory, all pedestrian features { F over previous T frames ₁ ¹ ,F ₂ ¹ ,F ₃ ¹ ,...,F _n ⁱ Performing cross attention operation in a dynamic encoder to obtain a correlation matrix E ₁ Then all objects in the current image { F } ₁ ,F ₂ ,F ₃ ,...,F _n And the correlation matrix E ₁ Performing cross attention operation to obtain a correlation matrix E ₂ Finally, the matrix E will be correlated ₁ And E ₂ Multiplying to obtain an appearance similarity score; for distance score, use all object detection boxes { B ] within the current image ₁ ,B ₂ ,B ₃ ,...,B _n And all target detection frames in the previous T frame

A distance score is calculated.

The method for obtaining the final score by fusing the appearance similarity score and the distance score comprises the following steps:

Score＝IoU_Score*ReID_Score+IoU_Score+ReID_Score；

wherein Score is the final Score, ioU _ Score is the distance Score, and ReID _ Score is the appearance similarity Score.

Compared with the prior art, the invention has the following beneficial effects:

1) Aiming at the problems of missing detection and error detection, the invention provides a space-time strengthening module, which utilizes continuous four-frame images to extract features and adds time information on the basis of extracting spatial features, so that the extracted features are rich in space-time information, and the detection precision is improved;

2) Aiming at the problem of target deformation, the invention provides an identity aggregation module, which can adaptively acquire the receptive field of an object by utilizing deformable convolution to obtain dynamic characteristics, and aggregate the original characteristics and the dynamic characteristics, so that the identity aggregation module can dynamically adapt to the target deformation and reduce the influence of the target deformation on the tracking effect;

3) Aiming at the problem of frequent shielding, the invention provides a dynamic space-time encoder which stores the characteristics of all targets in a certain time window and the position information of a detection frame, thereby solving the problems of multi-target tracking long-distance tracking and shielding;

4) Aiming at the problem of track prediction, the invention provides a balanced fusion strategy, the similarity score and the distance score are fused to obtain a final score, and the target track is judged according to the final score, so that the target identity conversion is greatly reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of the detection of the present invention;

FIG. 2 is a schematic diagram of the spatio-temporal enhancement module according to the present invention;

FIG. 3 is a diagram illustrating the structure of the dynamic spatiotemporal memory module according to the present invention;

FIG. 4 is a schematic diagram of an identity aggregation module according to the present invention;

FIG. 5 is a graph of the tracking results of the MOT17 data set obtained by the calculation of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.

As shown in fig. 1, an embodiment of the present invention provides a multi-target tracking method based on Transformer and spatio-temporal memory, where the extraction of pedestrian features and anchor frames depends on the quality of neural network extraction features, and feature information is extracted by using four consecutive video pictures for this purpose; fusing the feature information of the four frames to obtain the high-quality feature rich in space-time information; storing all pedestrian information in a certain time window in a dynamic space-time memory module, and calculating the appearance similarity and distance score with the current video image target; and fusing the appearance similarity score and the distance score to obtain a final score, and predicting the target track by using the final score. The method comprises the following specific steps:

the method comprises the following steps: a continuous four-frame image is input and preprocessed, that is, the size of the input original image is changed from 1920 × 1080 to 1280 × 1280.

Step two: extracting the feature information of the preprocessed image by using a neural network; and simultaneously inputting the four transformed frames of images into a CenterNet backbone network to extract characteristic information.

Step three: fusing feature information of the four frames of images by using a space-time strengthening module to obtain space-time information features; the method comprises the steps of firstly transforming the size of characteristic information, extracting characteristics rich in space-time information by using 3D convolution, simultaneously obtaining a global dynamic strengthening kernel by using a full connection layer, and finally performing convolution by using the kernel as a convolution kernel.

As shown in fig. 2, the method for fusing feature information of four frames of images by the spatio-temporal enhancement module is as follows: firstly, characteristic information F of four pictures belongs to R ^NT*C*H*W Transformation to F1 ∈ R ^N*C*T*H*W Averaging a plurality of channels to obtain F2 ∈ R ^N*1*T*H*W Then transmitting into 3D convolution with convolution kernel of 3 × 3 × 3 to obtain F3, and performing normalization operation after 3D convolution operationTo reduce model complexity; f3, after 3D convolution operation, performing matrix multiplication on the result multiplied by F1 and then performing convolution operation on the result multiplied by F after global average pooling and full connection layer to obtain a space-time information characteristic F4; again, using 3D convolution obtains more sensitive spatio-temporal features. In order to enhance the time information sensitivity, the invention carries out full connection operation on the features after the global average pooling operation to obtain a global dynamic strengthening kernel, and the full connection operation can fully utilize global context information, thereby enabling the space-time information to be more sensitive, and the calculation method comprises the following steps:

where reshape (-) is the size transformation operation, mean (-) is the averaging operation of multiple passes, f (-) is the 3D convolution operation, β (-) is the batch normalization operation,

is a convolution operation, FC (-) is the full connection layer, and AVG (-) is the global average pooling operation.

Step four: acquiring a detection frame of a target, extracting position information of the detection frame according to the space-time information characteristics, and acquiring pedestrian characteristics of the target in the detection frame by using two full-connected layers;

step five: storing all pedestrian characteristics and detection frame position information in a certain time window in a dynamic space-time memory module, and calculating the appearance similarity and distance score with a target in a current video image; as shown in FIG. 3, the dynamic spatiotemporal memory module comprises a dynamic encoder and an identity aggregation module; the identity aggregation module utilizes a deformable volume and two fully connected layers to solve the deformation problem. The dynamic encoder based Transformer reduces memory occupancy based on the original Transformer.

As shown in fig. 4, the implementation process of the identity aggregation module is as follows: obtaining the pedestrian feature F 'into F1' through a global average pooling operation, a deformable convolution, a ReLU and a hard-sigmoid activation function:

F1'＝ε(σ(f'(AVG(F'))))；

where f' (. Cndot.) is a deformable convolution, AVG (. Cndot.) is a global mean pooling operation, σ (. Cndot.) is a ReLU activation function, and ε (. Cndot.) is a hard-sigmoid activation function; wherein the pedestrian features F' are reduced in complexity by a global average pooling operation; deformable convolution is used to adapt object deformation.

Multiplying the pedestrian feature F ' by the pedestrian feature F1', and sequentially passing through a global average pooling layer and two full-connection layers to obtain a target feature F2':

F2'＝FC(FC(AVG(F1'*F')))。

use of global averaging pools to reduce complexity; finally, two fully connected layers are used to learn and summarize the different features of the object to be acquired.

in dynamic spatiotemporal memory, all pedestrian features { F over previous T frames ₁ ¹ ,F ₂ ¹ ,F ₃ ¹ ,...,F _n i } Cross attention operation in dynamic encoder to obtain correlation matrix E ₁ Then all objects in the current image { F } ₁ ,F ₂ ,F ₃ ,...,F _n And the correlation matrix E ₁ Performing cross attention operation to obtain a correlation matrix E ₂ Finally, the matrix E will be correlated ₁ And E ₂ Multiplying to obtain an appearance similarity score; for distance score, use all object detection boxes { B ] within the current image ₁ ,B ₂ ,B ₃ ,...,B _n And all target detection frames in the previous T frame

A distance score is calculated. The calculation method is as follows:

in the formula, F _i ⁿ Are all the objects within a certain time window,

is all targets of the current image, is the cross attention operation, and is the matrix multiplication operation.

Score＝IoU_Score*ReID_Score+IoU_Score+ReID_Score；

As shown in fig. 5, the method of the present invention has high detection accuracy and good tracking effect for tracking the target in the scenes of camera movement, target deformation and frequent occlusion, and has wide practicability in automatic driving, smart cities, visual monitoring, public safety, video analysis, human-computer interaction, etc.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A multi-target tracking method based on Transformer and space-time memory is characterized by comprising the following steps:

step two: extracting the feature information of the preprocessed image by using a neural network;

step five: in the dynamic time-space memory module for storing all pedestrian characteristics and detection frame position information in a certain time window,

and carrying out appearance similarity and distance score calculation with the target in the current video image;

2. The multi-target tracking method based on Transformer and space-time memory according to claim 1, wherein the method for preprocessing the image is as follows: the size of the input original image is changed from 1920 × 1080 to 1280 × 1280.

3. The Transformer and spatiotemporal memory-based multi-target tracking method according to claim 2, wherein the method for extracting the feature information of the preprocessed image by using the neural network comprises the following steps: and simultaneously inputting the four transformed frames of images into a CenterNet backbone network to extract characteristic information.

4. The multi-target tracking method based on Transformer and spatiotemporal memory according to claim 3, characterized in that the method for fusing feature information of four frames of images by the spatiotemporal enhancement module is as follows: firstly, characteristic information F of four pictures belongs to R ^NT*C*H*W Transformation to F1 ∈ R ^N*C*T*H*W Averaging a plurality of channels to obtain F2 ∈ R ^N*1*T*H*W Then the obtained signal is transmitted into a 3D convolution with a convolution kernel of 3 multiplied by 3 to obtain F3; f3 is processed by 3D convolution operation and then is processed by matrix multiplication with F1Performing convolution operation on the result after passing through the global average pooling layer and the full connection layer to obtain a space-time information characteristic F4; the expression is as follows:

5. The Transformer and spatiotemporal memory based multi-target tracking method according to claim 1, wherein the dynamic spatiotemporal memory module comprises a dynamic encoder and an identity aggregation module;

the identity aggregation module operates as follows:

obtaining the pedestrian feature F 'into F1' through a global average pooling operation, a deformable convolution, a ReLU and a hard-sigmoid activation function:

F1'＝ε(σ(f'(AVG(F'))))；

F2'＝FC(FC(AVG(F1'*F')))。

6. the Transformer and spatiotemporal memory-based multi-target tracking method according to claim 5, wherein the method for calculating the appearance similarity and the distance score with the target in the current video image comprises:

all pedestrian features in previous T-frames in dynamic spatiotemporal memory

Cross attention operation in dynamic encoder to obtain correlation matrix E ₁ Then all objects { F) in the current image ₁ ,F ₂ ,F ₃ ,...,F _n And the correlation matrix E ₁ Performing cross attention operation to obtain a correlation matrix E ₂ Finally, the matrix E will be correlated ₁ And E ₂ Multiplying to obtain an appearance similarity score; for distance score, use all object detection boxes { B ] within the current image ₁ ,B ₂ ,B ₃ ,...,B _n And all target detection frames in the previous T frame

A distance score is calculated.

7. The Transformer and spatiotemporal memory-based multi-target tracking method according to claim 6, wherein the method for fusing the appearance similarity score and the distance score to obtain a final score comprises the following steps:

Score＝IoU_Score*ReID_Score+IoU_Score+ReID_Score；