CN118172387A

CN118172387A - Attention mechanism-based lightweight multi-target tracking method

Info

Publication number: CN118172387A
Application number: CN202410367793.4A
Authority: CN
Inventors: 万琴; 葛柱; 沈学军; 陈建文; 杨漾; 鲁春平; 何勇; 段小刚; 刘波; 刘海桥
Original assignee: Hunan Institute of Engineering
Current assignee: Hunan Institute of Engineering
Priority date: 2024-03-28
Filing date: 2024-03-28
Publication date: 2024-06-11

Abstract

The invention discloses a light-weight multi-target tracking method based on an attention mechanism, which provides a joint attention module for enhancing ShuffleNet V a network, wherein the module comprises a space-time pyramid module STPM and a convolution block attention module CBAM, the STPM fuses multi-scale characteristics to capture information on different space and time scales, and CBAM aggregates channel and space dimension information to enhance the representation capability of a model; then, a position code generator module PEGM and a dynamic template update strategy DTUS are proposed to solve the occlusion problem. Adopting group convolution in an input sequence through PEGM, wherein each convolution group is responsible for processing the relative position relation of a specific range; to improve the reliability of the template, DTUS is used to update the template when appropriate. The method has the advantages of realizing lightweight design, reducing the parameter quantity and calculation requirements of the tracker, and effectively improving the accuracy of the target tracking method.

Description

Attention mechanism-based lightweight multi-target tracking method

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a lightweight multi-target tracking method based on an attention mechanism.

Background

Object tracking is an important direction in the field of computer vision. The method is widely applied to practical applications such as intelligent monitoring, behavior recognition, robot navigation and the like. Accurate tracking of arbitrary objects in real world scenes is very challenging due to variations in the scale, appearance and texture of the target. The present invention is primarily concerned with multi-object tracking, including the creation of a consistent trajectory by identifying and locating multiple objects in a continuous video sequence. The multi-target tracking algorithm mainly involves detecting a target object in each video frame and determining its position in the image. Each object is then assigned a unique ID that remains unchanged throughout the movement of the object. Existing target tracking methods can be broadly divided into two main categories: traditional tracking methods and deep learning-based target tracking methods. Significant examples of conventional tracking algorithms include optical flow, kalman filters, and kernel-related filters. Wherein the Kalman filter iteratively dynamically estimates the state at a subsequent time under consideration of the historical state and the observed data. To address the challenges of real world object tracking, a kalman filter models the motion of an object so that it can estimate the position of the object in subsequent frames. However, learning simple features alone does not adequately address the complexity of real scenes.

In the field of deep learning based target tracking, the most popular approach is "detection tracking" (TBD), which involves extracting a set of detection results from video frames to guide the tracking process. By using a twin network based tracking method with high tracking accuracy and speed, tracking errors caused by indistinguishable similar objects can be minimized. Classical algorithm the full convolution twin network (SiamFC) algorithm is a notable example. The SiamFC algorithm formulates the tracking task as a similarity matching problem between the template and the search branches, yielding a complimentary tracking performance. However, siamFC relies primarily on convolution layers to capture local features of the target, and its ability to model long-term dependencies of the target is limited.

In order to establish a robust global time context, a lightweight multi-objective tracking method based on an attention mechanism is provided.

Disclosure of Invention

Aiming at the technical problems, the invention provides a lightweight multi-target tracking method based on an attention mechanism.

The technical scheme adopted for solving the technical problems is as follows:

a light-weight multi-objective tracking method based on an attention mechanism, the method comprising the steps of:

S100: acquiring images containing multiple targets, and respectively inputting the images into a template branch and a search branch for preprocessing to obtain a first image and a second image;

s200: respectively inputting the first image and the second image into a corresponding ShuffleNet V lightweight network to extract the attention-rich target features, and generating a feature map with a specific size to obtain a first feature map and a second feature map;

s300: respectively inputting the first feature map and the second feature map to corresponding joint attention modules to capture long-distance space-time relations, so as to obtain a first aggregation feature and a second aggregation feature;

S400: the first aggregation feature and the second aggregation feature are fused through element multiplication to be encoded, and an encoded feature diagram is obtained and is used as a key of cross attention in a first decoder and a second decoder which are parallel; the first decoder decodes according to the coded feature map and the learned target query to obtain object features, combines confidence branches, updates the target tracking model in real time based on a dynamic template updating strategy, and decodes according to the coded feature map and the template feature query to obtain tracking features;

S500: the object features obtained through the first decoder are processed through a feedforward neural network to generate a detection frame, the tracking features obtained through the second encoder are processed through the feedforward neural network to generate a tracking frame, and the detection frame is associated with the tracking frame by adopting a frame-IoU matching method to obtain a final tracking result.

Preferably, the ShuffleNet V lightweight network in S200 includes a convolutional layer, a max pooling layer, a first extraction stage, a second extraction stage, and a third extraction stage, each extraction stage including a first block and a second block, the first extraction stage including stacked 1 first block and 3 second blocks, the second extraction stage including stacked 1 first block and 7 second blocks, the third extraction stage including stacked 1 first block and 3 second blocks;

In the first block, the input features are subjected to channel segmentation operation, the feature channels are divided into two branches, the channel number of each branch is half of the original channel number, the right branch is subjected to 1×1 convolution, then 3×3 deep convolution, then 1×1 convolution, the left branch is kept unchanged through identity mapping, and after cascading and channel shuffling, the characteristics of the branches are effectively fused together;

in the second block, the input feature map is fed into two branches, the left branch comprises 3×3DWConv with step size of 2 and conventional 1×1 convolution operation, the right branch is subjected to 1×1 convolution, then 3×3 depth convolution is performed, then 1×1 convolution is performed, and after channel series connection and channel shuffling, the channel count of the output feature map is doubled; when the input feature is a first image, the output feature map is a first feature map, and when the input feature is a second image, the output feature map is a second feature map.

Preferably, the joint attention module in S300 includes CBAM module and STPM module, the CBAM module includes a channel attention module and a spatial attention module, the channel attention module is used for performing channel attention enhancement on the input feature to obtain a channel attention feature, multiplying the channel attention feature and the input feature as input of the spatial attention module, the spatial attention module is used for performing spatial attention enhancement on the input feature to obtain a spatial attention feature, and multiplying the spatial attention feature and the input feature of the spatial attention module to obtain an output feature map F _c; the STPM module comprises 4 parallel branches, CBAM modules are embedded in each branch, features of four different pyramid scales are seamlessly integrated, and multi-scale attention features F _s are generated by fusing space-time attention contexts on different scales; and finally, combining CBAM with the STPM processed characteristics and outputting to obtain the aggregation characteristics.

Preferably, the channel attention module compresses global space information contained in the input feature map F through global maximum pooling and global average pooling, generates two different one-dimensional feature maps F ₁ and F ₂, is derived through a shared multi-layer perceptron MLP, the multi-layer perceptron MLP comprises a full-connection layer and a ReLU nonlinear activation function, after element summation operation across channels, the two one-dimensional feature maps are normalized by using a sigmoid function, and the normalization process generates a weight statistical representation of each channel as Mc, and the output feature map F' is obtained by multiplying the Mc and the input feature map F;

the procedure for calculating the channel attention feature Mc is specifically:

Wherein AvgPool refers to mean pooling; maxPool refers to maximum pooling; MLP is multi-layer perceptron calculation; f is an input feature map; h ₁ and W ₁ represent the height and width of the feature map, respectively; the value at position (i ₀,j₀) in the feature map.

Preferably, the feature map F 'output by the channel attention module is used as input to the spatial attention module, global max pooling and global feature pooling based on channels to obtain feature maps F ₃∈R^1×H1×W1 and F ₄∈R^1×1×W1, which are then concatenated to generate a valid feature descriptor F ₅∈R^2×H1×W1, performing a 7 x 7 convolution operation on F ₅, reducing it to one channel, and generating the spatial attention feature Ms by an S-type operation, multiplying the spatial attention feature Ms with the input feature map F' of the channel attention module to obtain the final generated aggregated feature Fc;

the procedure for calculating the spatial attention feature Ms is specifically:

M_s(F′)＝σ{f^7×7[AvgPool(F′)；MaxPool(F′)]}

Where f ^7×7 denotes a convolution layer with a convolution kernel of 7 x 7.

Preferably, in the STPM module, the input feature map F εR ^C1×H1×W1 is processed in 4 parallel branches; each branch divides the input feature map into s×s subregions, and s∈s, s= {1,2,3,4} represents four pyramid scales;

in the branches of scale s, each region is defined as One CBAM is added for each branch, and in each pyramid branch CBAM is applied to all sub-regions to generate an updated residual feature map/>

Upsampling the updated residual feature map obtained from four different scales to obtain features having the same spatial dimensions as the initial feature map F

The four feature maps Zs are connected and input into the convolution layer (C, 1 x 1/1) to generate the final residual feature map Y, which is summed element by element with the input feature map F to obtain the updated feature map F _s.

Preferably, the encoder comprises a self-attention layer, a residual error and normalization layer, a position code generation module, a self-attention layer and a residual error and normalization layer which are connected in sequence; in S400, the first aggregation feature and the second aggregation feature are fused and encoded through element-by-element multiplication, and an encoded feature map is obtained, specifically:

The encoder takes Q, K and V linearly transformed by the combined signature F _cf as input, obtains a set of attention weights by Softmax operation, multiplies the weights by V to obtain a vector with attention weights, and the specific calculation process is as follows:

wherein Q, K, V represent a query vector, a keyword vector, and a value vector, respectively, Is the key vector dimension, A _K→Q is the similarity matrix between Q, K;

Using a single head attention mechanism, Q, K and V are input into the self-attention layer, the relevance of each query to all keys is calculated by a linear transformation and attention mechanism, and the relevance is weighted and summed as a weight value to generate the final output representation H ₁, specifically:

H₁＝Ins.Norm(Attention(Q₁,K₁,V₁)+F_cf) (7)

Wherein F _cf is a fusion feature obtained by element-wise multiplication of the first and second aggregated features F _jt, F _js, Q ₁、K₁、V₁ is a linear transformation from the fusion feature, attention () is an Attention calculation, and the calculation process is as shown in equation (6), ins.norm () represents all embedded L2 normalization of the image block;

H ₁ is input to a position code generation module, Q ₂、K₂、V₂ is obtained through a series of linear transformation calculations, and then the residual connection and normalization are performed after the input to a self-attention layer, so as to obtain H ₂:

H₂＝Ins.Norm(Attention(Q₂,K₂,V₂)+PEGT) (8)

Preferably, the first decoder comprises a self-attention layer, a first residual and normalization layer, a cross-attention layer, a second residual and normalization layer, a feed-forward neural network layer, and a third residual and normalization layer; in S400, the first decoder decodes according to the encoded feature map and the learned target query to obtain the object feature, which specifically is:

The first decoder feeds the learned target query into the self-attention layer for self-attention calculation and residual connection and normalization to obtain H ₃:

H₃＝Ins.Norm(Attention(Q₃,K₃,V₃)+q_lo) (9)

Wherein Q ₃、K₃ and V ₃ are obtained after linear transformation of the learned target query Q _lo;

Q ₄、K₄ and V ₄ are jointly fed to the cross-attention module where a time cross-attention matrix is calculated and then residual connected and normalized to obtain H _c1:

H_c1＝Ins.Norm(Attention(Q₄,K₄,V₄)+H₃) (10)

Wherein, Q ₄ is obtained by linear transformation of H ₃, V ₄ and K ₄ are obtained by linear transformation of H ₂, and finally, H _c1 is input to the feed-forward network and then is connected and normalized by residual error to obtain F _d, and the calculation formula (11) is as follows:

F_d＝Ins.Norm(FFN(H_c1)+H_c1) (11)

Preferably, the second decoder comprises a self-attention layer, a first residual and normalization layer, a cross-attention layer, a second residual and normalization layer, a feed-forward neural network layer, and a third residual and normalization layer; in S400, the second decoder decodes the encoded feature map and the template feature query to obtain tracking features, which specifically includes:

the second decoder takes as input the template feature query q _tt, which through residual join and layer normalization operations after self-attention computation, obtains H ₄:

H₄＝Ins.Norm(Attention(Q₅,K₅,V₅)+q_tt) (12)

wherein Q ₅、K₅ and V ₅ are obtained after linear transformation of the template feature query Q _tt;

The Q ₆ obtained by mapping of H ₄ is subjected to cross calculation with the K ₆,V₆ mapped by the coding feature H ₂, and then is connected with a template feature residual error after self-focusing and normalized to obtain H _c2:

H_c2＝Ins.Norm(Attention(Q₆,K₆,V₆)+H₄) (13)

Wherein the process generates H _c2 by performing a cross-attention computation from K ₆ and V ₆ obtained from the mapping of the characteristic H ₂ of the encoder and Q ₆ obtained from the mapping of the characteristic H ₄ in the second decoder;

finally, the feature H _c2 is input into the feed forward network and normalized to obtain F _t, the calculation is as follows:

F_t＝Ins.Norm(FFN(H_c3)+H_c2) (14)

FFN (-) refers to the computation of a feedforward neural network, involving a linear transformation and an activation function.

Preferably, the confidence branch comprises a three-layer perceptron and an S-shaped function group, and the first decoder in S400 combines the confidence branch, and updates the target tracking model in real time based on a dynamic template updating strategy, specifically:

Obtaining a confidence score c _j of each target generating the detection frame through a confidence branch, wherein j=1, 2, … and n, and fusing the individual confidence of each target through weighted average to obtain a global confidence:

when the global confidence score is higher than the set threshold delta, no shielding is indicated, the target tracking model starts to be updated, otherwise, the target tracking model is updated again after tau frames.

The light-weight multi-target tracking method based on the attention mechanism provides a combined attention module which is combined with ShuffleNet V networks, so that light-weight design is realized, and key characteristics of target objects are accurately and efficiently captured. SFTrans, SFTrans comprising an encoder and two decoders are designed on the basis of the original transducer structure, so that the parameter number and calculation requirements of the tracker are reduced; a dynamic template updating strategy is provided, and the accuracy of the model is improved.

Drawings

FIG. 1 is a flow chart of a lightweight multi-objective tracking method based on an attention mechanism according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a network structure of a lightweight multi-objective tracking model based on an attention mechanism according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of the structure of a ShuffleNetV network according to one embodiment of the present invention; wherein, (a) is a schematic structural diagram of the first block, and (b) is a schematic structural diagram of the second block;

FIG. 4 is a schematic diagram illustrating a structure of CBAM according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of STPM according to an embodiment of the present invention;

FIG. 6 is a schematic diagram illustrating a structure of SFtrans according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of PEGM according to an embodiment of the present invention;

FIG. 8 is a flow chart of a dynamic template update strategy according to an embodiment of the present invention;

FIG. 9 is a graph showing the training loss results of the model of the present invention;

FIG. 10 is a diagram of a joint attention heat map visualization in accordance with one embodiment of the present invention; (a) represents an original image of an input network, (b) represents a superposition result of an interested region and the original image after CBAM is added, (c) represents a superposition result of an interested region and the original image obtained by the original image after STPM is added, and (d) represents a superposition result of the interested region and the original image after a joint attention module is added;

FIG. 11 is a diagram of an exemplary object tracking algorithm and occlusion experiment according to the present invention; (a) The results of the algorithm tracking herein, representing the removal of the position code generator module PEGM and the dynamic template update policy DTUS, (b) the results of the algorithm tracking of the present invention, representing the addition of the position code generator module PEGM and the dynamic template update policy DTUS;

FIG. 12 is a schematic diagram of experimental comparison results of a target tracking algorithm and FairMOT algorithm in a real scene according to an embodiment of the present invention; (a) The comparison of the tracking effect of the algorithm and FairMOT algorithm in the teaching building corridor scene is shown, and the comparison of the tracking effect of the algorithm and FairMOT algorithm in the school gate scene is shown.

Detailed Description

In order to make the technical scheme of the present invention better understood by those skilled in the art, the present invention will be further described in detail with reference to the accompanying drawings.

In one embodiment, as shown in fig. 1, a light-weight multi-objective tracking method based on an attention mechanism, the method comprising the steps of:

Specifically, as shown in fig. 2, the overall architecture follows a structure resembling a twin network, dividing the algorithm into two branches: template branches and search branches. In order to simplify the calculation, the invention adopts ShuffleNet V lightweight network as a backbone. An important complement is the joint attention module, which includes a convolution block attention module CBAM and a spatiotemporal pyramid module STPM, which helps capture long-distance spatiotemporal relationships. After the joint attention module, the features of the two branches are fused by element multiplication and serve as cross-attention keys in the parallel decoder. The second decoder (corresponding to decoder 2 in the figure) uses as input a query the target features extracted from the template frames, while the first decoder (corresponding to decoder 1 in the figure) uses as input a learned target query, which is a set of trainable parameters trained with all other parameters in the network. In the encoder, the invention introduces PEGM to enhance the anti-occlusion capability of the algorithm. In addition, the decoder 1 combines confidence branches, which use dynamic template update policies to update network parameters in the architecture in real-time to obtain more accurate information. In order to effectively manage the computational load, a single-head self-attention is employed in the present invention. The decoder 1 may obtain a detection box and the decoder 2 may obtain a tracking box. In the association matching step, ioU matches are used to associate the detection box with the tracking box.

The invention introduces a transducer architecture and a ShuffleNet V < 2 > lightweight network and combines it with a joint attention module and a free position coding module. The method aims at enhancing the ability of a target tracking algorithm to perceive and utilize spatio-temporal context information. By introducing a joint attention mechanism, the algorithm can selectively focus on key intra-and inter-frame information, thereby improving its ability to capture target motion patterns and environmental conditions. The free position encoding module provides a large amount of position information for each object, helping to model the object's position and motion state with high accuracy.

As shown in fig. 2, the template branch first pre-processes a previous frame image of the current frame into an image of channel number C, height and width h×w, and then inputs it into ShuffleNetV network. Subsequently, the image was subjected to convolution and maximum pool operation to obtain a feature map of 8c× (H/4) × (W/4). Next, three phases are passed, each phase comprising block 1 and block 2 cells, stacked 1 and 3 times, 1 and 7 times, and 1 and 3 times, respectively. These stages extract attention-rich target features, producing feature maps of a specific size (464/C) x (H/32) x (W/33). Similarly, the search branches are preprocessed and input into ShuffleNetV's 2 network. It is subjected to the same processing as the template branch to give a size of (464/C) × (H/32) × (W/32. For simplicity of calculation, 464/C is denoted as C ₁, H/32 is denoted as H ₁, and W/32 is W ₁ in the present invention.

In one embodiment, the ShuffleNet V lightweight network in S200 includes a convolutional layer, a max pooling layer, a first extraction stage, a second extraction stage, and a third extraction stage, each extraction stage including a first block and a second block, the first extraction stage including stacked 1 first block and 3 second blocks, the second extraction stage including stacked 1 first block and 7 second blocks, the third extraction stage including stacked 1 first block and 3 second blocks;

In the second block, the feature map output by the previous block is fed into two branches, the left branch comprises 3x3 DWConv with a step length of 2 and conventional 1x1 convolution operation, the right branch is subjected to 1x1 convolution, then 3x3 depth convolution is performed, then 1x1 convolution is performed, and after channel series connection and channel shuffling, the channel count of the output feature map is doubled; when the input feature is a first image, the output feature map is a first feature map, and when the input feature is a second image, the output feature map is a second feature map.

Specifically, shuffleNet networks are a high-performance lightweight convolutional neural network. It operates through two cores: the set convolution and channel shuffling strike the best balance between speed and accuracy. These operations effectively maintain accuracy while significantly reducing computational complexity. Guidelines for designing efficient and lightweight networks were introduced on the basis of ShuffleNetV's 1 network, forming ShuffleNetV's. The ShuffleNetV network architecture includes a first Block1 and a second Block2, as shown in fig. 3 (a) and 3 (b).

In one embodiment, the joint attention module in S300 includes a CBAM module and an STPM module, the CBAM module includes a channel attention module and a spatial attention module, the channel attention module is configured to perform channel attention enhancement on the input feature to obtain a channel attention feature, multiply the channel attention feature with the input feature as an input of the spatial attention module, the spatial attention module is configured to perform spatial attention enhancement on the input feature to obtain a spatial attention feature, and multiply the spatial attention feature with the input feature of the spatial attention module to obtain an output feature map F _c; the STPM module comprises 4 parallel branches, CBAM modules are embedded in each branch, features of four different pyramid scales are seamlessly integrated, and multi-scale attention features F are generated by fusing space-time attention contexts on different scales; and finally, combining CBAM with the STPM processed characteristics and outputting to obtain the aggregation characteristics.

Specifically, CBAM is a lightweight convolution attention module comprising two sub-modules: a channel attention module and a spatial attention module. CBAM is shown in figure 4. From this architecture ShuffleNet v2 derives a signature, denoted F e R ^C1×H1×W1, where C ₁ represents the number of channels in the input signature and H ₁ and W ₁ represent their height and width, respectively. CBAM the global spatial information contained within the feature map F is compressed by global max pooling and global average pooling. Thus, two different feature maps are generated: f ₁, 1×1×C ₁,F₂, and 1×1 times C ₁. Combining the location information from each channel feature map across the entire dataset helps mitigate network bias, which is typically caused by the limited perceived field range of the convolution kernel in the feature extraction process.

In one embodiment, the channel attention module compresses global spatial information contained in the input feature map F by global maximum pooling and global average pooling, generates two different one-dimensional feature maps F ₁ and F ₂, derives from a shared multi-layer perceptron MLP, which includes fully connected layers and a ReLU nonlinear activation function, and after an element summation operation across channels, normalizes the two one-dimensional feature maps using a sigmoid function, the normalization process generates a weight statistical representation of each channel as Mc, and multiplies the Mc and the input feature map F to obtain an output feature map F';

the procedure for calculating the channel attention feature Mc is specifically:

Specifically, in order to make full use of the feature information extracted by the compression operation, the correlation between channels is obtained by the evaluation operation.

In one embodiment, the feature map F 'output by the channel attention module is used as input to the spatial attention module, global max pooling and global feature pooling based on channels to obtain feature maps F ₃∈R^1×H1×W1 and F ₄∈R^1×1×W1, which are then concatenated to generate a valid feature descriptor F ₅∈R^2×H1×W1, 7 x 7 convolution operations are performed on F ₅, reduced to one channel, and the spatial attention feature Ms is generated by an S-type operation, multiplied by the input feature map F' of the channel attention module to obtain the final generated aggregate feature Fc;

the procedure for calculating the spatial attention feature Ms is specifically:

M_s(F′)＝σ{f^7×7[AvgPool(F′)；MaxPool(F′)]}

Where f ^7×7 denotes a convolution layer with a convolution kernel of 7 x 7.

In one embodiment, as shown in FIG. 5, in the STPM module, the input signature F ε R ^C1×H1×W1 is processed in 4 parallel branches; each branch divides the input feature map into s×s subregions, and s∈s, s= {1,2,3,4} represents four pyramid scales;

In particular, the present invention exploits STPM, which aggregates different regions to obtain global information. STPM improves the ability of a model to identify small objects by aggregating spatiotemporal context information over multiple scales. The module seamlessly integrates the features of four different pyramid scales, and generates multi-scale attention features by fusing the space-time attention contexts on different scales. According to the invention, one CBAM is added for each branch, so that the STPM can detect the high-resolution characteristic diagram to obtain finer details, and meanwhile, a wider background is captured on the low-resolution characteristic diagram to better know the environmental background of the target.

Further, to efficiently utilize global and local information to enable the network to quickly and accurately focus on critical areas and targets, the present invention combines features using CBAM and STPM processes. Thus, the overall joint attention module ultimately outputs these aggregated features, denoted as F _j, and the calculation process is shown in equation (4):

F_j＝λ₁(α(CBAM(F_c))+β(STPM(F_s))) (4)

Where F _c and F _s represent the output of CBAM and STPM, respectively, with α1 and β both involving 3×3 convolution, batch normalization, and ReLU activation, while λ1 involves batch normalization and 1×1 convolution, CBAM (°) represents the computation of the convolution block attention module, and STPM (-) is the computation of the spatio-temporal pyramid module. These aggregated features are then used for subsequent operations to enable the network to dynamically prioritize the important areas and target objects. The invention obtains aggregate feature F _jt in the template branch by joint attention computation, followed by aggregate feature F _js in the search branch by joint attention computation.

As shown in fig. 6, multiplying element by element F _jt and F _js results in a combined feature map F _cf, the encoder of the present invention takes as input the combined feature map F _cf of two adjacent frames. To prevent duplicate computation, the extracted features of the current frame are temporarily stored and reused in the next frame. The decoder is parallelized into two branches. The template feature query q _tt is input into the converter second decoder to generate a tracking box. At the same time, a learnable feature query q _lo is input into the transducer first decoder to generate a detection box. A learnable feature query is a set of learnable parameters that is trained with all other parameters in the network.

In one embodiment, the encoder comprises a self-attention layer, a residual error and a normalization layer, a position code generating module, a self-attention layer and a residual error and a normalization layer which are connected in sequence; in S400, the first aggregation feature and the second aggregation feature are fused and encoded through element-by-element multiplication, and an encoded feature map is obtained, specifically:

H₁＝Ins.Norm(Attention(Q₁,K₁,V₁)+F_cf) (7)

H₂＝Ins.Norm(Attention(Q₂,K₂,V₂)+PEGT) (8)

As shown in fig. 7, the sequence is input Shaped in three-dimensional image space as/>And performing convolution processing. Subsequently, the present invention iteratively uses block convolution on the local segment within S1' to generate a position codeP and S1' are added to form a small residual block and are activated by GeLU functions and recombined into the form of the input sequence. Equation 5 represents the PEGT calculation process:

Where λ2 represents the conversion of the two-dimensional sequence into a three-dimensional tensor, Representing the conversion of a three-dimensional tensor into a two-dimensional sequence, PEGT can be effectively implemented by zero-padded two-dimensional convolution with a kernel size of k, a padding value of (k-1)/2, and k > 3.

In the face of occlusion, the use of fixed length position coding may affect the generalization ability of the model. To solve this problem, the present invention employs a simple and efficient method that uses zero padding to increase the flexibility of capturing relative position information between markers. Packet convolution is introduced to process variable-size input data. Specifically, the present invention applies a group convolution to an input sequence, ensuring that each convolution group is responsible for handling a specific range of relative positional relationships, thereby more efficiently handling variable length input and occlusion scenes. This not only helps manage the uncertainty introduced by occlusion, but is also more efficient in terms of computation and parameter counting than conventional methods. Then, a residual connection is introduced to preserve key details. Specifically, a residual connection is applied between S1' and P, allowing information to be more directly transferred within the network. This helps to improve the gradient flow during training of the depth network. The low-level characteristics are reserved when the shielding and residual connection are processed, so that the model can better capture the local structure of the target, and the robustness to shielding is enhanced. Furthermore, the introduction of a nonlinear GeLU activation function helps the model learn the ability mode of more complex content, especially where occlusion is involved. The activation function enables the model to capture nonlinear relationships in the data, enhancing its adaptability to complex occlusion scenes.

In one embodiment, the first decoder includes a self-attention layer, a first residual and normalization layer, a cross-attention layer, a second residual and normalization layer, a feed-forward neural network layer, and a third residual and normalization layer; in S400, the first decoder decodes according to the encoded feature map and the learned target query to obtain the object feature, which specifically is:

H₃＝Ins.Norm(Attention(Q₃,K₃,V₃)+q_lo) (9)

Q ₄、K₄ and V ₄ are jointly fed to the cross-attention module to calculate a time cross-attention matrix and then normalized to obtain Hc1:

H_c1＝Ins.Norm(Attention(Q₄,K₄,V₄)+H₃) (10)

F_d＝Ins.Norm(FFN(H_c1)+H_c1) (11)

In one embodiment, the second decoder includes a self-attention layer, a first residual and normalization layer, a cross-attention layer, a second residual and normalization layer, a feed-forward neural network layer, and a third residual and normalization layer; in S400, the second decoder decodes the encoded feature map and the template feature query to obtain tracking features, which specifically includes:

H₄＝Ins.Norm(Attention(Q₅,K₅,V₅)+q_tt) (12)

H_c2＝Ins.Norm(Attention(Q₆,K₆,V₆)+H₄) (13)

F_t＝Ins.Norm(FFN(H_c3)+H_c2) (14)

Specifically, the object feature F _d obtained by the decoder 1 is processed through a feedforward neural network to generate a detection frame. At the same time, the tracking feature F _t obtained by the encoder 2 is processed by the feedforward neural network to generate a tracking frame. And then, associating the detection frame with the tracking frame by adopting a frame-IoU matching method to obtain a final tracking result. The KuhnMunkres (KM) algorithm is used to evaluate IoU similarity between the detection and tracking frames, facilitating the matching process.

In one embodiment, the confidence branch includes a three-layer perceptron and an S-shaped functional group, and the first decoder in S400 combines the confidence branches to update the target tracking model in real time based on a dynamic template update strategy, specifically:

In particular, in actual complex scenarios, the planned template update may not always be reasonable. For example, performing a template update when occlusion occurs may result in reduced tracking accuracy. Accordingly, the present invention proposes a Dynamic Template Update Strategy (DTUS) for template updates in order to ensure that at low confidence scores, the algorithm minimizes updates and reduces inaccuracy in the template. DTUS is to balance the real-time performance and stability requirements to ensure efficient tracking in occlusion scenes. The overall process of this strategy is shown in fig. 8.

The experimental environment of the invention is carried out on a deep learning platform. The deep learning platform GPU is NVIDIA Geforce RTX3090,3090. The built-in software environment is Ubuntu20.04 LTS system. Python environments are Python3.7, CUDA11.3, and PyTorrch1.10. For feature extraction, the invention uses a lightweight ShuffleNet V network, which can extract multi-scale feature graphs. As part of the preprocessing, the present invention resizes the image to 224×224. 100 epochs were trained using an Adam optimizer with a learning rate of 1e-4 and a batch size of 16. Fig. 9 shows a loss convergence curve for the training model.

Table 1 backbone network ablation experiment results

Comparative experiments were performed on the MOT17 dataset to evaluate the effectiveness of the improved ShuffleNet V network, the experimental results are shown in table 1. The ShuffleNet V network is referred to herein as ShuffleNetX in combination with the joint attention module to confirm the validity and real-time performance of ShufflenetX. As shown in table 2, shuffleNet X networks exhibit superior performance in terms of success rate and lightweight design compared to ResNet-50. This improvement is due to the joint attention module, which provides a more robust feature representation for the target tracking task.

Table 2 comparison of the effects of the modules

Detailed performance analysis was performed for each model, with emphasis on ResNet-50, shuffleNetX, DTUS, EC-DC (conventional encoder-decoder architecture) and SFTrans architecture. The specific ablation experiment results are shown in table 2, where v 'indicates that the current strategy was used, and no v' indicates that the current strategy was not used. It can be seen that the lightweight design of ShuffleNetX increases the speed of the model, and the template update module enhances the real-time adaptation of the model, particularly in terms of handling spatiotemporal variations. The introduction of SFTrans structure enhances the ability of the model to model time-space relationships. Model5 integrates all of the innovative modules, achieves optimal performance, and highlights the key role of these modules in multi-objective tracking.

The invention performs a series of experiments to more intuitively understand the effectiveness of the joint attention module in improving the detection accuracy. In fig. 10 (d), the joint attention mechanism not only focuses exactly, but also generates a specific region of interest for each target, represented by an orange region. The warmth of the color on the heat map indicates the level of attention of the region of interest. The darkening of these colors suggests that the joint attention mechanism is effectively focused on the relevant region, enhancing its applicability to the target tracking model.

In order to balance the speed and accuracy and to investigate how the variation of the update interval τ and the global confidence parameter δ affects the robustness of the algorithm, the present invention performed a template update experiment on the MOT17 dataset, the results of which are shown in table 3. Through testing, the present invention found that the update intervals τ of 30, 40, and 50 frames produced satisfactory results. And a parameter setting with τ equal to 40 frames and δ equal to 0.5 gives a greater improvement in algorithm accuracy. This suggests that this configuration is advantageous for a balance between accuracy and real-time performance in the target tracking task. Thus, a combination where τ is equal to 40 frames and δ is equal to 0.5 is finally selected as the optimal configuration.

Table 3 results corresponding to different super parameters in template update

The invention performs a shielding experiment on a sidewalk beside a garden, and the result is shown in fig. 11. This verifies the effectiveness of DTUS and PEGM proposed by the present invention in handling occlusion. The algorithm without dynamic template update strategy and PEGM in the present invention is called LTJPA-and the algorithm with these strategies is called LTJPA-. In fig. 11 (a), after occlusion, the pedestrian with ID 43 in the 366 th frame becomes ID 39, and the pedestrian with ID 22 becomes ID 37. This shows that occlusion can lead to erroneous decisions and changes in target identity if there are no dynamic template update policies and PEGM. In contrast, as shown in fig. 11 (b), pedestrians with IDs 43, 22 maintain their original identities after the 366 th and 452 th frames are blocked, respectively. This shows that LTJPA algorithm contains dynamic template updating strategy and PEGM, and shows stronger robustness when processing occlusion, and more effectively maintains the consistency of target identity.

The invention compares the tracker of the invention with the most advanced algorithm on MOT16, MOT17 and MOT20 test data sets, and gives tracking results. FairMOT achieved better results in both HOTA, MOTA than other algorithms than the present invention in MOT16, 17 and 20 datasets. The method of the invention not only reaches the highest precision, but also shows excellent performance in terms of frames per second, and highlights the relative real-time efficiency thereof.

In order to comprehensively evaluate the performance of the proposed algorithm, the invention compares the performance with that of FairMOT algorithm in a real scene, and the experimental result is shown in fig. 12. This comparison highlights the robustness and effectiveness of the proposed algorithm in real scenes, especially in scenes with small objects and occlusions.

Compared with the prior art, the invention has the following beneficial effects:

1) A combined attention module is provided, and is combined with ShuffleNet V networks, so that a lightweight design is realized, and key characteristics of a target object are accurately and efficiently captured. An STPM is proposed to capture feature information of different time scales and spatial resolutions. Its combination with CBAM further enhances the adaptability of the ShuffleNet V network to various visual tasks.

2) SFTrans is designed based on the original transducer structure. To improve the object recognition capability, a position code generator module is proposed. When occlusion occurs, zero padding and packet convolution are used to capture relative position information between the markers. This approach reduces the number of parameters and computational requirements of the tracker.

3) A dynamic template update strategy is presented. During occlusion, the confidence branch calculates a global confidence score. If the global confidence score is above the threshold, immediately updating the target tracking model; otherwise, the target tracking model waits for tau frames before updating, so that the accuracy of the model is improved.

The light-weight multi-target tracking method based on the attention mechanism is described in detail. The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to facilitate an understanding of the core concepts of the invention. It should be noted that it will be apparent to those skilled in the art that various modifications and adaptations of the invention can be made without departing from the principles of the invention and these modifications and adaptations are intended to be within the scope of the invention as defined in the following claims.

Claims

1. A method for lightweight multi-objective tracking based on an attention mechanism, the method comprising the steps of:

2. The method of claim 1, wherein the ShuffleNet V lightweight network in S200 comprises a convolutional layer, a max pooling layer, a first extraction stage, a second extraction stage, and a third extraction stage, each extraction stage comprising a first block and a second block, the first extraction stage comprising 1 first block and 3 second blocks stacked, the second extraction stage comprising 1 first block and 7 second blocks stacked, the third extraction stage comprising 1 first block and 3 second blocks stacked;

3. The method of claim 2, wherein the joint attention module in S300 comprises CBAM and STPM modules, the CBAM module comprises a channel attention module and a spatial attention module, the channel attention module is configured to perform channel attention enhancement on the input feature to obtain a channel attention feature, multiply the channel attention feature with the input feature as an input of the spatial attention module, the spatial attention module is configured to perform spatial attention enhancement on the input feature to obtain a spatial attention feature, and multiply the spatial attention feature with the input feature of the spatial attention module to obtain an output feature map F _c; the STPM module comprises 4 parallel branches, CBAM modules are embedded in each branch, features of four different pyramid scales are seamlessly integrated, and multi-scale attention features F _s are generated by fusing space-time attention contexts on different scales; and finally, combining CBAM with the STPM processed characteristics and outputting to obtain the aggregation characteristics.

4. A method according to claim 3, characterized in that the channel attention module compresses global spatial information contained in the input feature map F by global max pooling and global average pooling, generating two different one-dimensional feature maps F ₁ and F ₂, deriving from the shared multi-layer perceptron MLP, which comprises fully connected layers and a ReLU nonlinear activation function, after the element summation operation across channels, then normalizes the two one-dimensional feature maps using a sigmoid function, the normalization process yielding a weight statistical representation of each channel as Mc, multiplying from Mc and input feature map F to obtain an output feature map F';

the procedure for calculating the channel attention feature Mc is specifically:

5. The method according to claim 4, characterized in that the feature map F 'output by the channel attention module is used as input to the spatial attention module, global max pooling and global feature pooling based on channels to obtain feature maps F ₃∈R¹ ^×H1×W1 and F ₄∈R^1×1×W1, which are subsequently concatenated to generate a valid feature descriptor F ₅∈R^2×H1×W1, performing a 7 x 7 convolution operation on F ₅, reducing it to one channel, and generating the spatial attention feature Ms by an S-type operation, multiplying the spatial attention feature Ms with the input feature map F' of the channel attention module to obtain the final generated aggregated feature Fc;

the procedure for calculating the spatial attention feature Ms is specifically:

M_s(F′)＝σ{f^7×7[AvgPool(F′)；MaxPool(F′)]}

Where f ^7×7 denotes a convolution layer with a convolution kernel of 7 x 7.

6. The method of claim 5, wherein in the STPM module, the input signature F ε R ^C1×H1×W1 is processed in 4 parallel branches; each branch divides the input feature map into s×s subregions, and s∈s, s= {1,2,3,4} represents four pyramid scales;

in the branches of scale s, each region is defined as 1.Ltoreq.i, j.ltoreq.s, one CBAM being added for each branch, CBAM being applied to all sub-regions in each pyramid branch to generate an updated residual signature/>

Upsampling the updated residual feature map obtained from four different scales to obtain features having the same spatial dimensions as the initial feature map F1≤i,j≤s；

7. The method of claim 6, wherein the encoder comprises a self-attention layer, a residual and normalization layer, a position code generation module, a self-attention layer, and a residual and normalization layer connected in sequence; in S400, the first aggregation feature and the second aggregation feature are fused and encoded through element-by-element multiplication, and an encoded feature map is obtained, specifically:

H₁＝Ins.Norm(Attention(Q₁,K₁,V₁)+F_cf) (7)

H₂＝Ins.Norm(Attention(Q₂,K₂,V₂)+PEGT) (8)

8. The method of claim 7, wherein the first decoder comprises a self-attention layer, a first residual and normalization layer, a cross-attention layer, a second residual and normalization layer, a feed-forward neural network layer, and a third residual and normalization layer; in S400, the first decoder decodes according to the encoded feature map and the learned target query to obtain the object feature, which specifically is:

H₃＝Ins.Norm(Attention(Q₃,K₃,V₃)+q_lo) (9)

H_c1＝Ins.Norm(Attention(Q₄,K₄,V₄)+H₃) (10)

F_d＝Ins.Norm(FFN(H_c1)+H_c1) (11)

9. The method of claim 8, wherein the second decoder comprises a self-attention layer, a first residual and normalization layer, a cross-attention layer, a second residual and normalization layer, a feed-forward neural network layer, and a third residual and normalization layer; in S400, the second decoder decodes the encoded feature map and the template feature query to obtain tracking features, which specifically includes:

H₄＝Ins.Norm(Attention(Q₅,K₅,V₅)+q_tt) (12)

H_c2＝Ins.Norm(Attention(Q₆,K₆,V₆)+H₄) (13)

F_t＝Ins.Norm(FFN(H_c3)+H_c2) (14)

10. The method of claim 8, wherein the confidence level branches comprise a three-layer perceptron and an S-shaped function, and wherein the first decoder combines the confidence level branches in S400 to update the target tracking model in real time based on a dynamic template update strategy, in particular: