CN116862949A

CN116862949A - Transformer target tracking method and tracker based on symmetrical cross attention and position information enhancement

Info

Publication number: CN116862949A
Application number: CN202310742715.3A
Authority: CN
Inventors: 张建明; 陈文韬; 何宇凡
Original assignee: Changsha University of Science and Technology
Current assignee: Changsha University of Science and Technology
Priority date: 2023-06-21
Filing date: 2023-06-21
Publication date: 2023-10-10

Abstract

The invention discloses a method and a tracker for tracking a transducer target based on symmetrical cross attention and position information enhancement, wherein the method comprises the following steps: the template image and the search area image are regarded as a pair of images to be input into a backbone network, and feature images of the template and the search area are respectively extracted; the template and the feature map of the search area are sent to a feature fusion network, wherein the feature fusion network comprises an encoder network, a decoder network and a position information enhancement module which are sequentially connected, and final feature vectors of the enhanced search area are obtained through fusion and enhancement; and inputting the final feature vector of the enhanced search area into a pre-measurement head, and generating a target tracking result through a classification function and a regression function of a boundary box. The invention solves the problem that the accuracy and the tracking speed of the existing target tracking are to be improved.

Description

Transformer target tracking method and tracker based on symmetrical cross attention and position information enhancement

Technical Field

The invention relates to the technical field of visual target tracking, in particular to a method and a tracker for tracking a transducer target based on symmetrical cross attention and position information enhancement.

Background

Estimating the trajectory of an object in a video sequence, known as visual object tracking, is a fundamental and challenging task in computer vision, which uses the position of the object in the initial frames of the video to predict the position and state of the object in each frame of a video sequence, so that it is necessary to extract the correct features of the object from the first frame to locate the object from the subsequent frames. At present, the method is widely applied to the fields of unmanned aerial vehicles, automatic driving, monitoring and the like. This area of object tracking is receiving increasing attention due to its wide range of applications. The main challenges faced in the tracking field are illumination changes, deformation, occlusion, background clutter, interference with similar targets, etc. Many problems remain unsolved, and many studies have been made in recent years with a view to these challenges, however, designing a tracker model that is accurate and tracks in real time remains a challenge.

In the prior art, the tracking structure based on the twin network is widely focused on the performance capability and simplicity, and becomes very popular in the field of visual target tracking, and meanwhile, the tracking structure also achieves very good performance. The visual target tracking is formulated as a template matching problem based on a siamese tracking framework, in short, the tracking is to search the area most similar to the target template, the cross-correlation operation between the template and the search area is fully utilized, and the target object is accurately positioned in the search area. Most mainstream tracker frameworks use a backbone network to extract features of the template and the search area image, and then use a correlation network structure to calculate similarity between the template and the search area, such as ATOM; the siamese tracker achieves excellent tracking performance, especially with respect to a balance of accuracy and speed.

Transformer was first proposed in the natural language processing arts for machine translation tasks, which quickly replaced the LSTM model by enabling each element to focus on all other elements to improve learning of remote dependencies in neural network machine translation, has become the dominant architecture for language modeling, even though significant success has been achieved in large-scale pre-training models, such as the well-known GPT, using combinations of encoders and decoders to convert one sequence to another based on attention mechanisms, and thus global information can be obtained in processing language sequences due to the attention mechanisms. In sequential tasks such as natural language processing, transfomer has essentially replaced recurrent neural networks and has amplified highlights in the field of computer vision, such as image classification, object detection, semantic segmentation, multi-object tracking. Currently, there are two main types of single-target tracking application transformers, namely, a back bone based on a convolutional neural network is used for extracting features of a template and a search area, and then a feature fusion structure based on a transformer is used for carrying out depth fusion on the features of the template and the search area so as to predict a bounding box of a target object better for prediction, such as a transit; the second type is that feature extraction is performed without using a convolutional back box, feature extraction and fusion are performed directly by using a transform structure, an attention mechanism is fully utilized, and the whole structure is more compact, such as a mixformer.

However, siamese-based trackers have certain drawbacks due to the use of cross-correlation operations, such as the cross-correlation operations often fall into local optima, and lack of global information acquisition, which may affect the prediction of the final target object bounding box, resulting in inaccurate predictions. Moreover, the original transducer is very computationally intensive, affecting the speed of target tracking.

Disclosure of Invention

First, the technical problem to be solved

Based on the problems, the invention provides a method for tracking a transducer target based on symmetrical cross attention and position information enhancement and a tracker, which solve the problems that the accuracy and the tracking speed of the existing target tracking are to be improved.

(II) technical scheme

Based on the technical problems, the invention provides a method for tracking a transducer target based on symmetrical cross attention and position information enhancement, which comprises the following steps:

s1, regarding a template image and a search area image as a pair of images to be input into a backbone network, and respectively extracting to obtain feature images of the template and the search area;

s2, sending the template and the feature map of the search area into a feature fusion network, wherein the feature fusion network comprises an encoder network, a decoder network and a position information enhancement module which are sequentially connected, and obtaining the final feature vector of the enhanced search area through fusion and enhancement;

s3, inputting the final feature vector of the enhanced search area into a pre-measurement head, and generating a target tracking result through a classification function and a regression function of a boundary box.

Further, the step S2 includes:

s20, preprocessing the feature map of the template and the search area, and converting the feature map into feature vectors of the template and the search area;

s21, inputting the feature vectors of the template and the search area into an encoder network, wherein the encoder network adopts a plurality of layers of symmetrical cross attention modules to fuse the features of the template and the search area, and respectively obtaining the feature vectors of the fused template and search area;

s22, the fused template and the feature vector of the search area are further fused through a decoder network to obtain the final feature vector of the search area;

s23, inputting the final feature vector of the search area into a position information enhancement module for enhancement, and obtaining the enhanced final feature vector of the search area.

Further, the step S20 includes: and flattening the feature maps of the template and the search area along the space dimension respectively, adding sine position codes at the same time, then reducing the channel dimension by using 1X 1 convolution, and flattening the feature maps from the space dimension to obtain feature vectors of the template and the search area respectively.

Further, in the step S21, the symmetrical cross-attention module includes:

inputting the feature vectors of the template and the search area and the position codes corresponding to the template and the search area into a multi-head cross attention module to obtain a preliminarily fused feature vector Xt:

X _t ＝MultiHead(X _zq +P _zq ，X _xk +P _xk ，X _xv )，

MultiHead(Q，K，V)＝Concat(H ₁ ，...，H _h )W ^O ，

inputting the feature vector of the template and the corresponding position code of the template into a multi-head cross attention module, re-fusing the feature vector and the corresponding position code of the primary fusion, and then adding the re-fused feature vector and the feature vector of the template to obtain a fused template feature vector Xz, wherein X is the same as the feature vector of the template _z After passing through a feedforward network FFN, is connected with the X _z Adding to obtain a fused template feature vector

X _z ＝X _zq +MultiHead(X _zq +P _zq ，X _tk +P _tk ，X _tv )，

Meanwhile, the feature vector of the search area and the corresponding position code of the search area are input into a multi-head cross attention module, and after re-fusion, the feature vector of the search area and the corresponding position code are added to obtain a fused feature vector X of the search area _x The X is _x After passing through a feed forward network FFN, is connected with the X _x Adding to obtain the fused search area feature vector

X _x ＝X _xq +MultiHead(X _xq +P _xq ，X _tk +P _tk ，X _tv )

wherein ,W ^O are all parameter matrices, X _zq Is the input of template branch, P _zq Is the position code corresponding to the template branch, X _xk and X_xv Is the input of the search area branch, P _xk Is the position code corresponding to the branch of the search area, X _tk and X_tv Is the input of the feature vector branch of the preliminary fusion, P _tk Is the position code corresponding to the feature vector of the preliminary fusion.

Further, in the step S22, the decoder network includes: inputting the feature vectors of the template and the search area after fusion into a multi-head cross attention module for re-fusion, then carrying out residual connection on the multi-head cross attention module and the feedforward network FFN through the feedforward network FFN inserted with the activation function, and outputting the final feature vector of the search area after norm normalization processing.

Further, in the step S23, the location information enhancing module includes:

inputting the final feature vector of the search area, and obtaining a final feature map X of the search area through reshape deformation, wherein the final feature map X comprises feature maps with the height H of the channel number C and the width W;

each feature map is respectively subjected to one-dimensional horizontal global pooling and one-dimensional vertical global pooling to obtain a middle feature map in the horizontal direction and a middle feature map in the vertical direction p ^h ，p ^w ：

Then the intermediate feature diagram in the vertical direction is spliced with the intermediate feature diagram in the horizontal direction along the space dimension to obtain T after the intermediate feature diagram in the vertical direction is subjected to permite conversion function, and then the T is subjected to convolution transformation of 1 multiplied by 1, after the batch standardization, the batch is multiplied by the characteristic diagram after the batch standardization after the Relu linear rectification function to obtain an intermediate characteristic diagram f for coding the spatial information in the horizontal direction and the vertical direction:

T＝(Conv(Concat(p ^h ，p ^w )))，

f _t ＝φ(T)，

f＝T×f _t ，

splitting the intermediate feature map f along a spatial dimension into two independent height tensors f _h And a width tensor f _w Then, after 1X 1 convolution and sigmoid activation function, the height weight S is obtained ^h And width weight S ^w Multiplying the input after the deformation of the position information enhancement module, outputting an enhanced final feature map Y of the search area, and outputting an enhanced final feature vector of the search area after the reshape deformation:

S ^h ＝τ(Conv(f _h ))，

S ^w ＝τ(Conv(f _w ))，

wherein the subscript c represents the feature map corresponding to the c-th channel, i, j are horizontal and vertical coordinates, τ () represents the sigmoid activation function, and Φ () represents the Relu linear rectification function.

Further, in the step S3, the classification network uses a binary cross entropy loss function, and the bounding box regression network uses L1 loss and IOU loss functions.

Further, in the step S1, the template image and the search area image are further preprocessed: the method comprises the steps that an image block of a template image is obtained by expanding an object boundary box of a first frame image of a video sequence to be twice side length outwards; expanding the image block of the search area image from the boundary frame of the target of the previous frame to four times of side length outwards to obtain the image block; the backbone network adopts a modified ResNet50 network, the ResNet50 network deletes the last stage and the full connection layer, meanwhile, the convolution step length of downsampling of the fourth stage is changed from 2 to 1, and the 3 multiplied by 3 convolution of the fourth stage is changed into expansion convolution with the step length of 2.

Further, the number of layers of the symmetrical cross attention module is 4; after the fusion of the template feature vectors X is obtained _z 、Fused search region feature vector X _x 、/>Norm normalization was performed before.

The invention also discloses a transducer target tracker based on symmetrical cross attention and position information enhancement, which comprises:

at least one processor; and at least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method.

(III) beneficial effects

The technical scheme of the invention has the following advantages:

(1) The invention improves the symmetrical cross attention module and the position information enhancement module on the basis of the original transfomer tracking structure, firstly, extracting the characteristic information of a template and a search area through a backbone network, then sending the characteristic information into a characteristic fusion network, superposing four symmetrical cross attention modules in an encoder network, more effectively fusing the characteristic of the search area and the characteristic of the template, further fusing the characteristic through a decoder network, further enhancing the fused characteristic information by the position information enhancement module, finally, executing classification and bounding box regression by a pre-measuring head on the basis that the characteristic is enhanced, and finally generating a tracking result; through multiple fusion and enhancement, the feature fusion of the template image and the search area image is better realized, the recognition accuracy of target tracking is improved, and meanwhile, the adopted cross attention structure can reduce the calculated amount and improve the tracking speed;

(2) The invention replaces cross-correlation operation by using an attention mechanism, carries out deep fusion on the features of the search area and the features of the template through the symmetrical cross attention module, carries out global information interaction, avoids the situation of local optimization, can better carry out feature fusion on the template image and the search area image, and can better judge the feature similarity of the template image and the search area image;

(3) The invention fully utilizes the spatial position information through the position information enhancement module, is favorable for better identifying and positioning the target in the head prediction stage, can effectively improve the accuracy of the head prediction positioning target, and ensures that the tracker has more excellent performance;

(4) The method can effectively cope with various challenges such as illumination change, deformation, scale change, interference influence and the like, provides high-precision and high-robustness target tracking, has better instantaneity and robustness, and is more suitable for visual single-target tracking.

Drawings

The features and advantages of the present invention will be more clearly understood by reference to the accompanying drawings, which are illustrative and should not be construed as limiting the invention in any way, in which:

FIG. 1 is an overall block diagram of a method for Transformer target tracking based on symmetric cross-attention and location information enhancement according to an embodiment of the invention;

FIG. 2 is a block diagram of a symmetrical cross-attention module according to an embodiment of the present invention;

FIG. 3 is a block diagram of a location information enhancement module according to an embodiment of the present invention;

FIG. 4 is a graph of a tracker of an embodiment of the present invention versus other trackers on a GOT-10k dataset;

FIG. 5 is a graph of a tracker of an embodiment of the present invention versus other trackers on a LaSOT dataset;

FIG. 6 is a graph of a tracker of an embodiment of the present invention compared to other trackers on a UAV123 dataset;

FIG. 7 is a graph of a tracker of an embodiment of the present invention versus other trackers on an OTB100 dataset.

Detailed Description

The following describes in further detail the embodiments of the present invention with reference to the drawings and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.

The embodiment of the invention relates to a method for tracking a transducer target based on symmetrical cross attention and position information enhancement, which is shown in fig. 1 and comprises the following steps:

s1, regarding a template image and an image of a search area as a pair of images to be input into a backbone network, and respectively extracting to obtain feature images of the template and the search area;

s10, preprocessing a template image and an image of a search area;

regarding the template image and the search area image as a pair of images as input of a backbone network; in order to include both the appearance information of the object and the surrounding scene of the object, the template image block patch is obtained by expanding the object bounding box of the first frame image of the video sequence outwards to a size of two times the side length; since the moving range of the next frame of the object is not particularly large in general, the image block patch of the image of the search area is obtained by expanding the bounding box of the object of the previous frame outward by four times the side length. The template and search area images are then reshaped to enable input into the backbone network for processing.

S11, regarding the preprocessed template image and the preprocessed search area image as a pair of images to be input into a backbone network, and respectively extracting to obtain feature images of the template and the search area;

the backbone network adopts a modified ResNet50 network, the last stage and a full connection layer of the existing ResNet50 network are deleted, and meanwhile, the convolution step length of downsampling of a fourth stage is changed from 2 to 1, so that larger feature resolution is obtained; at the same time, the 3 multiplied by 3 convolution of the fourth stage is modified to be the step length2 for increasing receptive field; image block patch of templateAnd image block of search area +.>By the back-bone processing we can obtain template feature map +.>And feature map of search areaC＝1024。

s20, preprocessing the feature map of the template and the search area, and converting the feature map into feature vectors of the template and the search area: flattening the feature images of the template and the search area along the space dimension respectively, adding sine position codes at the same time, reducing the channel dimension by using 1X 1 convolution, and flattening the feature images from the space dimension to obtain feature vectors of the template and the search area respectively;

the feature map output by the backbone network needs to be preprocessed before it is input into the Encoder: firstly, converting the feature map into feature vectors, and because a group of feature vectors are needed to be used as input by an attention mechanism, flattening the template feature map and the search area feature map which are output by a backbone network of a backstone along the space dimension to obtain a series of feature vectors, and meanwhile, adding sine position codes. We then use a 1 x 1 convolution to reduce the channel dimension from 1024 to 256, resulting in f _z0 and f_x0 Then flattening f from the spatial dimension _z0 and f_x0 ObtainingF _z1 and f_x1 ，f _z1 and f_x1 Can be regarded asSuo and->The feature vectors of length 256 are then used as inputs to the template branch and the search area branch in the Encoder.

S21, inputting the feature vectors of the template and the search area into an encoder network, wherein the encoder network adopts 4 layers of symmetrical cross attention modules to fuse the features of the template and the search area, and respectively obtain the feature vectors of the fused template and search area;

the symmetrical cross-attention module SCA, as shown in fig. 2, comprises:

inputting the feature vectors of the template and the search area and the position codes corresponding to the template and the search area into a multi-head cross attention module, and obtaining a preliminarily fused feature vector X after norm normalization _t ：

Inputting the feature vector of the template and the corresponding position code of the template into a multi-head cross attention module, re-fusing the feature vector and the corresponding position code, re-adding the feature vector with the feature vector of the template, and obtaining a fused template feature vector X after norm normalization _z The X is _z After passing through a feedforward network FFN, is connected with the X _z Adding, and obtaining the fused template feature vector after norm normalization

Meanwhile, the feature vector of the search area and the corresponding position code of the search area are input into a multi-head cross attention module, and after re-fusion, the feature vector of the search area and the corresponding position code are added, and after norm normalization, the feature vector X of the fused search area is obtained _x The X is _x Premenstrual spaceAfter feeding the network FFN, the system is connected with the X _x Adding, and obtaining the fused search area feature vector after norm normalization

The Encoder network includes 4 layers of symmetrical cross-attention modules that are fused in sequence. In the symmetrical cross-attention module, we perform a preliminary multi-headed cross-attention operation using both inputs of the template branches and the search region branches to enhance feature fusion. Formulas of the multi-head attention module:

MultiHead(Q，K，V)＝Concat(H ₁ ，...，H _h )W ^O

here, the and />Are parameter matrices.

Here the number of dimensions of the parameter matrix h=8, d _m ＝256，d _k ＝d _v ＝d _m /h＝32。

The formula for performing a preliminary multi-headed crossover attention is as follows:

X _t ＝MultiHead(X _zq +P _zq ，X _xk +P _xk ，X _xv )

here, theIs the input of the template branch, < >>Is the corresponding position code. and />Is the input of the search area branch, +.>Is the position code corresponding to the branch, and the dimension number d=256.

And then, the feature map after fusion is respectively and alternately focused with the input of the template branch and the search area branch before, and the features of the template and the search area are fully fused. Enhancing the ability to identify objects.

X _z ＝X _zq +MultiHead(X _zq +P _zq ，X _tk +P _tk ，X _tv )

X _x ＝X _xq +MultiHead(X _xq +P _xq ，X _tk +P _tk ，X _tv )

We call such combined cross-attention symmetrical cross-attention. Here, the and />Is the output of our symmetrical cross attention.

The encoder network will perform a sufficient feature fusion of the template image and the search area image and the position encoding will help the model distinguish between the markers from different sources and different locations.

the decoder network comprises: inputting the feature vectors of the fused template and the search area into a multi-head cross attention module for re-fusion, then carrying out residual connection on the multi-head cross attention module and the feedforward network FFN through the feedforward network FFN inserted with the activation function, and outputting the final feature vector of the search area after norm normalization processing;

after the features of the template and the search area are fused through the symmetrical cross attention in the Encoder Encoder network, two feature graphs are obtained, the multi-head cross attention is used in the Encoder Encoder network, the two feature graphs obtained in the Encoder Encoder network are used as the input of the Encoder Encoder network, the two feature graphs are finally fused through the multi-head cross attention, and the fitting capability of the model is enhanced through a feedforward network. The feed forward network FFN comprises two layers of multi-layer perceptrons and has inserted therein an activation function, the residual connection being applied to the multi-headed cross-attention module and the FFN module, after which both norm normalization is performed. After fusion, a final feature map of the search area is obtained, and the feature map after fusion is input into a carry position information enhancement module which can effectively increase the recognition capability of the target position and does not bring extra calculation amount.

S23, inputting the final feature vector of the search area into a position information enhancement module for enhancement to obtain the enhanced final feature vector of the search area;

the location information enhancing module PIE is shown in fig. 3, and includes: inputting the final feature vector of the search area, and obtaining a final feature image X of the search area through reshape deformation, wherein the final feature image X comprises feature images with the height H of C channels and the width W _C Is a two-dimensional matrix representation of the feature map of a channel within a feature map; each characteristic diagram is respectively subjected to one-dimensional horizontal fullObtaining a horizontal middle characteristic diagram and a vertical middle characteristic diagram p after local pooling and one-dimensional vertical global pooling ^h ，p ^w The method comprises the steps of carrying out a first treatment on the surface of the Then the intermediate feature diagram in the vertical direction is spliced with the intermediate feature diagram in the horizontal direction along the space dimension to obtain T after the intermediate feature diagram in the vertical direction is subjected to permite conversion function, and then the T is subjected to convolution transformation of 1 multiplied by 1, after the batch of the BatchNor is standardized, multiplying the standardized characteristic diagram of the batch of the BatchNor by a Relu linear rectification function to obtain an intermediate characteristic diagram f for coding spatial information in the horizontal direction and the vertical direction; splitting the intermediate feature map f along a spatial dimension into two independent height tensors f _h And a width tensor f _w Then, after 1X 1 convolution and sigmoid activation function, the height weight S is obtained ^h And width weight S ^w Multiplying the input after the deformation of the position information enhancement module, outputting an enhanced final feature map Y of the search area, and outputting an enhanced final feature vector of the search area after the reshape deformation;

unlike channel attention, the location information enhancement module also considers encoding spatial information, we do a one-dimensional horizontal global pooling and a one-dimensional vertical global pooling along the horizontal and vertical directions, respectively, and global pooling is typically used to enhance spatial information encoding capability in channel attention. However, global pooling compresses global spatial information into one channel descriptor, so it is very difficult to preserve the position information, which is critical to capturing the spatial position information of the target in the tracking task, and in order to obtain more accurate position information on the spatial region, we use pooling of two spatial ranges, along the horizontal and vertical coordinates, respectively, so the output of the c-th channel with height h can be expressed as

Likewise, the output of the c-th channel of width w may be written as

Where k in turn represents the kth row or the kth column.

The two transformations along spatially distinct directions, which aggregate features to obtain a pair of direction-aware feature maps, may let our attention capture remote dependencies along one spatial direction and preserve accurate location information along another direction, unlike previous global pooling, where each element in both feature maps reflects whether the target object of interest is present in the corresponding row and column. This allows our model to more accurately locate the target object.

The aggregated feature map generated by one-dimensional global pooling we first stitch them together and then perform a 1 x 1 convolution transformation on them. The formula is as follows:

T＝(Conv(Concat(p ^h ，p ^w )))，

f _t ＝d(T)，

f＝T×f _t ，

the stitching operation here is stitching along the spatial dimension. This non-linear activation function results in aAn intermediate feature map of the spatial information in the horizontal and vertical directions. Where r is the size used to control the block. We then split the feature map f along the spatial dimension into two independent tensors f _h and f_w Two 1 x 1 convolutions are then applied to divide f _h and f_w The conversion is to tensors with the same number of channels as the previous input, and the sigmoid activation function is used respectively. The formula is as follows:

S ^h ＝τ(Conv(f _h ))，

S ^u ＝τ(Conv(f _w ))，

we will output S ^h and S^w As a weight for attention, finally the output of our location information enhancement module Y can be written as:

wherein the subscript c represents a feature map corresponding to the c-th channel, i, j are horizontal coordinates and vertical coordinates, k is τ () represents a sigmoid activation function, and Φ () represents a Relu linear rectification function.

The pre-measurement head receives H _x ×W _x Individual feature vectors, output H _x ×W _x And (3) binary classification and regression results, selecting prediction of corresponding feature vectors in a group-trunk boundary box as positive samples, wherein the rest are negative samples, the class label of the positive samples is a foreground, and the class label of the negative samples is a background. All samples will be used to classify the loss, whereas for regression loss only positive samples are useful, so that the feature vector can predict the target at the corresponding location. We use standard binary cross entropy loss for classification, defined as:

for regression, we employ L1 penalty and general IOU penalty, the regression penalty function is defined as:

the above method is verified on a plurality of authoritative data sets such as GOT-10k, laSOT, trackingNet, OTB100, UAV123, VOT2020 and the like, and tables 1-4 and figures 4-7 show that the tracker of the method performs more excellent compared with the indexes performed by other methods.

TABLE 1 detailed data for comparison with other trackers on GOT-10k dataset

Table 2 detailed data for comparison with other trackers on the TrackingNet dataset

Table 3 detailed data compared with other trackers on LaSOT dataset

Table 4 detailed data comparing with other trackers on OTB100 and UAV123 datasets

Finally, it should be noted that the above-described method may be converted into software program instructions, which may be implemented by using a symmetric cross-attention and location information enhanced transducer target tracker including a processor and a memory, or by computer instructions stored in a non-transitory computer readable storage medium. The integrated units implemented in the form of software functional units described above may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform part of the steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only memory (ROM), a random access memory (RandomAccess Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In summary, the method and the tracker for tracking the transducer target based on symmetrical cross attention and position information enhancement have the following beneficial effects:

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope of the invention as defined by the appended claims.

Claims

1. A method for Transformer target tracking based on symmetrical cross-attention and location information enhancement, comprising:

2. The method for tracking a Transformer object based on symmetrical cross-attention and location information enhancement according to claim 1, wherein the step S2 comprises:

3. The method for tracking a Transformer object based on symmetrical cross-attention and location information enhancement according to claim 2, wherein the step S20 comprises: and flattening the feature maps of the template and the search area along the space dimension respectively, adding sine position codes at the same time, then reducing the channel dimension by using 1X 1 convolution, and flattening the feature maps from the space dimension to obtain feature vectors of the template and the search area respectively.

4. The method for enhanced Transformer object tracking based on symmetrical cross-attention and position information according to claim 2, wherein in the step S21, the symmetrical cross-attention module comprises:

inputting the feature vectors of the template and the search area and the position codes corresponding to the template and the search area into a multi-head cross attention module to obtain a preliminarily fused feature vector X _t ：

X _t ＝MultiHead(X _zq +P _zq ，X _xk +P _xk ，X _xυ )，

MultiHead(Q，K，V)＝Concat(H ₁ ，...，H _h )W ^O ，

Inputting the feature vector of the template and the corresponding position code of the template into a multi-head cross attention module, and re-fusing the feature vector and the corresponding position code of the template, and then re-adding the feature vector and the feature vector of the template to obtain a fused template feature vector X _z The X is _z After passing through a feedforward network FFN, is connected with the X _z Adding to obtain a fused template feature vector

X _z ＝X _zq +MultiHead(X _zq +P _zq ，X _tk +P _tk ，X _tυ )，

X _x ＝X _xq +MultiHead(X _xq +P _xq ，X _tk +P _tk ，x _tυ )

wherein ,W ^O are all parameter matrices, X _zq Is the input of template branch, P _zq Is the position code corresponding to the template branch, X _xk and X_xυ Is the input of the search area branch, P _xk Is the position code corresponding to the branch of the search area, X _tk and X_tυ Is the input of the feature vector branch of the preliminary fusion, P _tk Is the position code corresponding to the feature vector of the preliminary fusion.

5. The method for symmetric cross-attention and location information enhanced transducer target tracking according to claim 2, wherein in step S22, the decoder network comprises: inputting the feature vectors of the template and the search area after fusion into a multi-head cross attention module for re-fusion, then carrying out residual connection on the multi-head cross attention module and the feedforward network FFN through the feedforward network FFN inserted with the activation function, and outputting the final feature vector of the search area after norm normalization processing.

6. The method for tracking a transducer object based on symmetrical cross-attention and location information enhancement according to claim 2, wherein in step S23, the location information enhancement module comprises:

T＝(Conυ(Concat(p ^h ，p ^w )))，

f _t ＝φ(T)，

f＝T×f _t ，

S ^h ＝τ(Conυ(f _h ))，

S ^w ＝τ(Conυ(f _w ))，

wherein the subscript c represents a feature map corresponding to the c-th channel, i, j are horizontal coordinates and vertical coordinates, k represents the k-th row or k-th column, τ () represents a sigmoid activation function, and Φ () represents a Relu linear rectification function.

7. The method according to claim 1, wherein in the step S3, the classification network uses a binary cross entropy loss function, and the bounding box regression network uses L1 loss and IOU loss functions.

8. The method for tracking a target by using a Transformer based on the symmetrical cross-attention and the position information enhancement according to claim 1, wherein in the step S1, the template image and the search area image are further preprocessed: the method comprises the steps that an image block of a template image is obtained by expanding an object boundary box of a first frame image of a video sequence to be twice side length outwards; expanding the image block of the search area image from the boundary frame of the target of the previous frame to four times of side length outwards to obtain the image block; the backbone network adopts a modified ResNet50 network, the ResNet50 network deletes the last stage and the full connection layer, meanwhile, the convolution step length of downsampling of the fourth stage is changed from 2 to 1, and the 3 multiplied by 3 convolution of the fourth stage is changed into expansion convolution with the step length of 2.

9. The method for enhanced Transformer target tracking based on symmetric cross-attention and location information of claim 4, wherein the number of layers of the symmetric cross-attention module is 4; after the fusion of the template feature vectors X is obtained _z 、Fused search region feature vector X _x 、/>Norm normalization was performed before.

10. A Transformer target tracker based on symmetrical cross-attention and location information enhancement, comprising:

the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1-9.