CN116862949A - Transformer target tracking method and tracker based on symmetrical cross attention and position information enhancement - Google Patents

Transformer target tracking method and tracker based on symmetrical cross attention and position information enhancement Download PDF

Info

Publication number
CN116862949A
CN116862949A CN202310742715.3A CN202310742715A CN116862949A CN 116862949 A CN116862949 A CN 116862949A CN 202310742715 A CN202310742715 A CN 202310742715A CN 116862949 A CN116862949 A CN 116862949A
Authority
CN
China
Prior art keywords
search area
template
feature
feature vector
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310742715.3A
Other languages
Chinese (zh)
Inventor
张建明
陈文韬
何宇凡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changsha University of Science and Technology
Original Assignee
Changsha University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changsha University of Science and Technology filed Critical Changsha University of Science and Technology
Priority to CN202310742715.3A priority Critical patent/CN116862949A/en
Publication of CN116862949A publication Critical patent/CN116862949A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging

Abstract

The invention discloses a method and a tracker for tracking a transducer target based on symmetrical cross attention and position information enhancement, wherein the method comprises the following steps: the template image and the search area image are regarded as a pair of images to be input into a backbone network, and feature images of the template and the search area are respectively extracted; the template and the feature map of the search area are sent to a feature fusion network, wherein the feature fusion network comprises an encoder network, a decoder network and a position information enhancement module which are sequentially connected, and final feature vectors of the enhanced search area are obtained through fusion and enhancement; and inputting the final feature vector of the enhanced search area into a pre-measurement head, and generating a target tracking result through a classification function and a regression function of a boundary box. The invention solves the problem that the accuracy and the tracking speed of the existing target tracking are to be improved.

Description

Transformer target tracking method and tracker based on symmetrical cross attention and position information enhancement
Technical Field
The invention relates to the technical field of visual target tracking, in particular to a method and a tracker for tracking a transducer target based on symmetrical cross attention and position information enhancement.
Background
Estimating the trajectory of an object in a video sequence, known as visual object tracking, is a fundamental and challenging task in computer vision, which uses the position of the object in the initial frames of the video to predict the position and state of the object in each frame of a video sequence, so that it is necessary to extract the correct features of the object from the first frame to locate the object from the subsequent frames. At present, the method is widely applied to the fields of unmanned aerial vehicles, automatic driving, monitoring and the like. This area of object tracking is receiving increasing attention due to its wide range of applications. The main challenges faced in the tracking field are illumination changes, deformation, occlusion, background clutter, interference with similar targets, etc. Many problems remain unsolved, and many studies have been made in recent years with a view to these challenges, however, designing a tracker model that is accurate and tracks in real time remains a challenge.
In the prior art, the tracking structure based on the twin network is widely focused on the performance capability and simplicity, and becomes very popular in the field of visual target tracking, and meanwhile, the tracking structure also achieves very good performance. The visual target tracking is formulated as a template matching problem based on a siamese tracking framework, in short, the tracking is to search the area most similar to the target template, the cross-correlation operation between the template and the search area is fully utilized, and the target object is accurately positioned in the search area. Most mainstream tracker frameworks use a backbone network to extract features of the template and the search area image, and then use a correlation network structure to calculate similarity between the template and the search area, such as ATOM; the siamese tracker achieves excellent tracking performance, especially with respect to a balance of accuracy and speed.
Transformer was first proposed in the natural language processing arts for machine translation tasks, which quickly replaced the LSTM model by enabling each element to focus on all other elements to improve learning of remote dependencies in neural network machine translation, has become the dominant architecture for language modeling, even though significant success has been achieved in large-scale pre-training models, such as the well-known GPT, using combinations of encoders and decoders to convert one sequence to another based on attention mechanisms, and thus global information can be obtained in processing language sequences due to the attention mechanisms. In sequential tasks such as natural language processing, transfomer has essentially replaced recurrent neural networks and has amplified highlights in the field of computer vision, such as image classification, object detection, semantic segmentation, multi-object tracking. Currently, there are two main types of single-target tracking application transformers, namely, a back bone based on a convolutional neural network is used for extracting features of a template and a search area, and then a feature fusion structure based on a transformer is used for carrying out depth fusion on the features of the template and the search area so as to predict a bounding box of a target object better for prediction, such as a transit; the second type is that feature extraction is performed without using a convolutional back box, feature extraction and fusion are performed directly by using a transform structure, an attention mechanism is fully utilized, and the whole structure is more compact, such as a mixformer.
However, siamese-based trackers have certain drawbacks due to the use of cross-correlation operations, such as the cross-correlation operations often fall into local optima, and lack of global information acquisition, which may affect the prediction of the final target object bounding box, resulting in inaccurate predictions. Moreover, the original transducer is very computationally intensive, affecting the speed of target tracking.
Disclosure of Invention
First, the technical problem to be solved
Based on the problems, the invention provides a method for tracking a transducer target based on symmetrical cross attention and position information enhancement and a tracker, which solve the problems that the accuracy and the tracking speed of the existing target tracking are to be improved.
(II) technical scheme
Based on the technical problems, the invention provides a method for tracking a transducer target based on symmetrical cross attention and position information enhancement, which comprises the following steps:
s1, regarding a template image and a search area image as a pair of images to be input into a backbone network, and respectively extracting to obtain feature images of the template and the search area;
s2, sending the template and the feature map of the search area into a feature fusion network, wherein the feature fusion network comprises an encoder network, a decoder network and a position information enhancement module which are sequentially connected, and obtaining the final feature vector of the enhanced search area through fusion and enhancement;
s3, inputting the final feature vector of the enhanced search area into a pre-measurement head, and generating a target tracking result through a classification function and a regression function of a boundary box.
Further, the step S2 includes:
s20, preprocessing the feature map of the template and the search area, and converting the feature map into feature vectors of the template and the search area;
s21, inputting the feature vectors of the template and the search area into an encoder network, wherein the encoder network adopts a plurality of layers of symmetrical cross attention modules to fuse the features of the template and the search area, and respectively obtaining the feature vectors of the fused template and search area;
s22, the fused template and the feature vector of the search area are further fused through a decoder network to obtain the final feature vector of the search area;
s23, inputting the final feature vector of the search area into a position information enhancement module for enhancement, and obtaining the enhanced final feature vector of the search area.
Further, the step S20 includes: and flattening the feature maps of the template and the search area along the space dimension respectively, adding sine position codes at the same time, then reducing the channel dimension by using 1X 1 convolution, and flattening the feature maps from the space dimension to obtain feature vectors of the template and the search area respectively.
Further, in the step S21, the symmetrical cross-attention module includes:
inputting the feature vectors of the template and the search area and the position codes corresponding to the template and the search area into a multi-head cross attention module to obtain a preliminarily fused feature vector Xt:
X t =MultiHead(X zq +P zq ,X xk +P xk ,X xv ),
MultiHead(Q,K,V)=Concat(H 1 ,...,H h )W O
inputting the feature vector of the template and the corresponding position code of the template into a multi-head cross attention module, re-fusing the feature vector and the corresponding position code of the primary fusion, and then adding the re-fused feature vector and the feature vector of the template to obtain a fused template feature vector Xz, wherein X is the same as the feature vector of the template z After passing through a feedforward network FFN, is connected with the X z Adding to obtain a fused template feature vector
X z =X zq +MultiHead(X zq +P zq ,X tk +P tk ,X tv ),
Meanwhile, the feature vector of the search area and the corresponding position code of the search area are input into a multi-head cross attention module, and after re-fusion, the feature vector of the search area and the corresponding position code are added to obtain a fused feature vector X of the search area x The X is x After passing through a feed forward network FFN, is connected with the X x Adding to obtain the fused search area feature vector
X x =X xq +MultiHead(X xq +P xq ,X tk +P tk ,X tv )
wherein ,W O are all parameter matrices, X zq Is the input of template branch, P zq Is the position code corresponding to the template branch, X xk and Xxv Is the input of the search area branch, P xk Is the position code corresponding to the branch of the search area, X tk and Xtv Is the input of the feature vector branch of the preliminary fusion, P tk Is the position code corresponding to the feature vector of the preliminary fusion.
Further, in the step S22, the decoder network includes: inputting the feature vectors of the template and the search area after fusion into a multi-head cross attention module for re-fusion, then carrying out residual connection on the multi-head cross attention module and the feedforward network FFN through the feedforward network FFN inserted with the activation function, and outputting the final feature vector of the search area after norm normalization processing.
Further, in the step S23, the location information enhancing module includes:
inputting the final feature vector of the search area, and obtaining a final feature map X of the search area through reshape deformation, wherein the final feature map X comprises feature maps with the height H of the channel number C and the width W;
each feature map is respectively subjected to one-dimensional horizontal global pooling and one-dimensional vertical global pooling to obtain a middle feature map in the horizontal direction and a middle feature map in the vertical direction p h ,p w
Then the intermediate feature diagram in the vertical direction is spliced with the intermediate feature diagram in the horizontal direction along the space dimension to obtain T after the intermediate feature diagram in the vertical direction is subjected to permite conversion function, and then the T is subjected to convolution transformation of 1 multiplied by 1, after the batch standardization, the batch is multiplied by the characteristic diagram after the batch standardization after the Relu linear rectification function to obtain an intermediate characteristic diagram f for coding the spatial information in the horizontal direction and the vertical direction:
T=(Conv(Concat(p h ,p w ))),
f t =φ(T),
f=T×f t
splitting the intermediate feature map f along a spatial dimension into two independent height tensors f h And a width tensor f w Then, after 1X 1 convolution and sigmoid activation function, the height weight S is obtained h And width weight S w Multiplying the input after the deformation of the position information enhancement module, outputting an enhanced final feature map Y of the search area, and outputting an enhanced final feature vector of the search area after the reshape deformation:
S h =τ(Conv(f h )),
S w =τ(Conv(f w )),
wherein the subscript c represents the feature map corresponding to the c-th channel, i, j are horizontal and vertical coordinates, τ () represents the sigmoid activation function, and Φ () represents the Relu linear rectification function.
Further, in the step S3, the classification network uses a binary cross entropy loss function, and the bounding box regression network uses L1 loss and IOU loss functions.
Further, in the step S1, the template image and the search area image are further preprocessed: the method comprises the steps that an image block of a template image is obtained by expanding an object boundary box of a first frame image of a video sequence to be twice side length outwards; expanding the image block of the search area image from the boundary frame of the target of the previous frame to four times of side length outwards to obtain the image block; the backbone network adopts a modified ResNet50 network, the ResNet50 network deletes the last stage and the full connection layer, meanwhile, the convolution step length of downsampling of the fourth stage is changed from 2 to 1, and the 3 multiplied by 3 convolution of the fourth stage is changed into expansion convolution with the step length of 2.
Further, the number of layers of the symmetrical cross attention module is 4; after the fusion of the template feature vectors X is obtained zFused search region feature vector X x 、/>Norm normalization was performed before.
The invention also discloses a transducer target tracker based on symmetrical cross attention and position information enhancement, which comprises:
at least one processor; and at least one memory communicatively coupled to the processor, wherein:
the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method.
(III) beneficial effects
The technical scheme of the invention has the following advantages:
(1) The invention improves the symmetrical cross attention module and the position information enhancement module on the basis of the original transfomer tracking structure, firstly, extracting the characteristic information of a template and a search area through a backbone network, then sending the characteristic information into a characteristic fusion network, superposing four symmetrical cross attention modules in an encoder network, more effectively fusing the characteristic of the search area and the characteristic of the template, further fusing the characteristic through a decoder network, further enhancing the fused characteristic information by the position information enhancement module, finally, executing classification and bounding box regression by a pre-measuring head on the basis that the characteristic is enhanced, and finally generating a tracking result; through multiple fusion and enhancement, the feature fusion of the template image and the search area image is better realized, the recognition accuracy of target tracking is improved, and meanwhile, the adopted cross attention structure can reduce the calculated amount and improve the tracking speed;
(2) The invention replaces cross-correlation operation by using an attention mechanism, carries out deep fusion on the features of the search area and the features of the template through the symmetrical cross attention module, carries out global information interaction, avoids the situation of local optimization, can better carry out feature fusion on the template image and the search area image, and can better judge the feature similarity of the template image and the search area image;
(3) The invention fully utilizes the spatial position information through the position information enhancement module, is favorable for better identifying and positioning the target in the head prediction stage, can effectively improve the accuracy of the head prediction positioning target, and ensures that the tracker has more excellent performance;
(4) The method can effectively cope with various challenges such as illumination change, deformation, scale change, interference influence and the like, provides high-precision and high-robustness target tracking, has better instantaneity and robustness, and is more suitable for visual single-target tracking.
Drawings
The features and advantages of the present invention will be more clearly understood by reference to the accompanying drawings, which are illustrative and should not be construed as limiting the invention in any way, in which:
FIG. 1 is an overall block diagram of a method for Transformer target tracking based on symmetric cross-attention and location information enhancement according to an embodiment of the invention;
FIG. 2 is a block diagram of a symmetrical cross-attention module according to an embodiment of the present invention;
FIG. 3 is a block diagram of a location information enhancement module according to an embodiment of the present invention;
FIG. 4 is a graph of a tracker of an embodiment of the present invention versus other trackers on a GOT-10k dataset;
FIG. 5 is a graph of a tracker of an embodiment of the present invention versus other trackers on a LaSOT dataset;
FIG. 6 is a graph of a tracker of an embodiment of the present invention compared to other trackers on a UAV123 dataset;
FIG. 7 is a graph of a tracker of an embodiment of the present invention versus other trackers on an OTB100 dataset.
Detailed Description
The following describes in further detail the embodiments of the present invention with reference to the drawings and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.
The embodiment of the invention relates to a method for tracking a transducer target based on symmetrical cross attention and position information enhancement, which is shown in fig. 1 and comprises the following steps:
s1, regarding a template image and an image of a search area as a pair of images to be input into a backbone network, and respectively extracting to obtain feature images of the template and the search area;
s10, preprocessing a template image and an image of a search area;
regarding the template image and the search area image as a pair of images as input of a backbone network; in order to include both the appearance information of the object and the surrounding scene of the object, the template image block patch is obtained by expanding the object bounding box of the first frame image of the video sequence outwards to a size of two times the side length; since the moving range of the next frame of the object is not particularly large in general, the image block patch of the image of the search area is obtained by expanding the bounding box of the object of the previous frame outward by four times the side length. The template and search area images are then reshaped to enable input into the backbone network for processing.
S11, regarding the preprocessed template image and the preprocessed search area image as a pair of images to be input into a backbone network, and respectively extracting to obtain feature images of the template and the search area;
the backbone network adopts a modified ResNet50 network, the last stage and a full connection layer of the existing ResNet50 network are deleted, and meanwhile, the convolution step length of downsampling of a fourth stage is changed from 2 to 1, so that larger feature resolution is obtained; at the same time, the 3 multiplied by 3 convolution of the fourth stage is modified to be the step length2 for increasing receptive field; image block patch of templateAnd image block of search area +.>By the back-bone processing we can obtain template feature map +.>And feature map of search areaC=1024。
S2, sending the template and the feature map of the search area into a feature fusion network, wherein the feature fusion network comprises an encoder network, a decoder network and a position information enhancement module which are sequentially connected, and obtaining the final feature vector of the enhanced search area through fusion and enhancement;
s20, preprocessing the feature map of the template and the search area, and converting the feature map into feature vectors of the template and the search area: flattening the feature images of the template and the search area along the space dimension respectively, adding sine position codes at the same time, reducing the channel dimension by using 1X 1 convolution, and flattening the feature images from the space dimension to obtain feature vectors of the template and the search area respectively;
the feature map output by the backbone network needs to be preprocessed before it is input into the Encoder: firstly, converting the feature map into feature vectors, and because a group of feature vectors are needed to be used as input by an attention mechanism, flattening the template feature map and the search area feature map which are output by a backbone network of a backstone along the space dimension to obtain a series of feature vectors, and meanwhile, adding sine position codes. We then use a 1 x 1 convolution to reduce the channel dimension from 1024 to 256, resulting in f z0 and fx0 Then flattening f from the spatial dimension z0 and fx0 ObtainingF z1 and fx1 ,f z1 and fx1 Can be regarded asSuo and->The feature vectors of length 256 are then used as inputs to the template branch and the search area branch in the Encoder.
S21, inputting the feature vectors of the template and the search area into an encoder network, wherein the encoder network adopts 4 layers of symmetrical cross attention modules to fuse the features of the template and the search area, and respectively obtain the feature vectors of the fused template and search area;
the symmetrical cross-attention module SCA, as shown in fig. 2, comprises:
inputting the feature vectors of the template and the search area and the position codes corresponding to the template and the search area into a multi-head cross attention module, and obtaining a preliminarily fused feature vector X after norm normalization t
Inputting the feature vector of the template and the corresponding position code of the template into a multi-head cross attention module, re-fusing the feature vector and the corresponding position code, re-adding the feature vector with the feature vector of the template, and obtaining a fused template feature vector X after norm normalization z The X is z After passing through a feedforward network FFN, is connected with the X z Adding, and obtaining the fused template feature vector after norm normalization
Meanwhile, the feature vector of the search area and the corresponding position code of the search area are input into a multi-head cross attention module, and after re-fusion, the feature vector of the search area and the corresponding position code are added, and after norm normalization, the feature vector X of the fused search area is obtained x The X is x Premenstrual spaceAfter feeding the network FFN, the system is connected with the X x Adding, and obtaining the fused search area feature vector after norm normalization
The Encoder network includes 4 layers of symmetrical cross-attention modules that are fused in sequence. In the symmetrical cross-attention module, we perform a preliminary multi-headed cross-attention operation using both inputs of the template branches and the search region branches to enhance feature fusion. Formulas of the multi-head attention module:
MultiHead(Q,K,V)=Concat(H 1 ,...,H h )W O
here, the and />Are parameter matrices.
Here the number of dimensions of the parameter matrix h=8, d m =256,d k =d v =d m /h=32。
The formula for performing a preliminary multi-headed crossover attention is as follows:
X t =MultiHead(X zq +P zq ,X xk +P xk ,X xv )
here, theIs the input of the template branch, < >>Is the corresponding position code. and />Is the input of the search area branch, +.>Is the position code corresponding to the branch, and the dimension number d=256.
And then, the feature map after fusion is respectively and alternately focused with the input of the template branch and the search area branch before, and the features of the template and the search area are fully fused. Enhancing the ability to identify objects.
X z =X zq +MultiHead(X zq +P zq ,X tk +P tk ,X tv )
X x =X xq +MultiHead(X xq +P xq ,X tk +P tk ,X tv )
We call such combined cross-attention symmetrical cross-attention. Here, the and />Is the output of our symmetrical cross attention.
The encoder network will perform a sufficient feature fusion of the template image and the search area image and the position encoding will help the model distinguish between the markers from different sources and different locations.
S22, the fused template and the feature vector of the search area are further fused through a decoder network to obtain the final feature vector of the search area;
the decoder network comprises: inputting the feature vectors of the fused template and the search area into a multi-head cross attention module for re-fusion, then carrying out residual connection on the multi-head cross attention module and the feedforward network FFN through the feedforward network FFN inserted with the activation function, and outputting the final feature vector of the search area after norm normalization processing;
after the features of the template and the search area are fused through the symmetrical cross attention in the Encoder Encoder network, two feature graphs are obtained, the multi-head cross attention is used in the Encoder Encoder network, the two feature graphs obtained in the Encoder Encoder network are used as the input of the Encoder Encoder network, the two feature graphs are finally fused through the multi-head cross attention, and the fitting capability of the model is enhanced through a feedforward network. The feed forward network FFN comprises two layers of multi-layer perceptrons and has inserted therein an activation function, the residual connection being applied to the multi-headed cross-attention module and the FFN module, after which both norm normalization is performed. After fusion, a final feature map of the search area is obtained, and the feature map after fusion is input into a carry position information enhancement module which can effectively increase the recognition capability of the target position and does not bring extra calculation amount.
S23, inputting the final feature vector of the search area into a position information enhancement module for enhancement to obtain the enhanced final feature vector of the search area;
the location information enhancing module PIE is shown in fig. 3, and includes: inputting the final feature vector of the search area, and obtaining a final feature image X of the search area through reshape deformation, wherein the final feature image X comprises feature images with the height H of C channels and the width W C Is a two-dimensional matrix representation of the feature map of a channel within a feature map; each characteristic diagram is respectively subjected to one-dimensional horizontal fullObtaining a horizontal middle characteristic diagram and a vertical middle characteristic diagram p after local pooling and one-dimensional vertical global pooling h ,p w The method comprises the steps of carrying out a first treatment on the surface of the Then the intermediate feature diagram in the vertical direction is spliced with the intermediate feature diagram in the horizontal direction along the space dimension to obtain T after the intermediate feature diagram in the vertical direction is subjected to permite conversion function, and then the T is subjected to convolution transformation of 1 multiplied by 1, after the batch of the BatchNor is standardized, multiplying the standardized characteristic diagram of the batch of the BatchNor by a Relu linear rectification function to obtain an intermediate characteristic diagram f for coding spatial information in the horizontal direction and the vertical direction; splitting the intermediate feature map f along a spatial dimension into two independent height tensors f h And a width tensor f w Then, after 1X 1 convolution and sigmoid activation function, the height weight S is obtained h And width weight S w Multiplying the input after the deformation of the position information enhancement module, outputting an enhanced final feature map Y of the search area, and outputting an enhanced final feature vector of the search area after the reshape deformation;
unlike channel attention, the location information enhancement module also considers encoding spatial information, we do a one-dimensional horizontal global pooling and a one-dimensional vertical global pooling along the horizontal and vertical directions, respectively, and global pooling is typically used to enhance spatial information encoding capability in channel attention. However, global pooling compresses global spatial information into one channel descriptor, so it is very difficult to preserve the position information, which is critical to capturing the spatial position information of the target in the tracking task, and in order to obtain more accurate position information on the spatial region, we use pooling of two spatial ranges, along the horizontal and vertical coordinates, respectively, so the output of the c-th channel with height h can be expressed as
Likewise, the output of the c-th channel of width w may be written as
Where k in turn represents the kth row or the kth column.
The two transformations along spatially distinct directions, which aggregate features to obtain a pair of direction-aware feature maps, may let our attention capture remote dependencies along one spatial direction and preserve accurate location information along another direction, unlike previous global pooling, where each element in both feature maps reflects whether the target object of interest is present in the corresponding row and column. This allows our model to more accurately locate the target object.
The aggregated feature map generated by one-dimensional global pooling we first stitch them together and then perform a 1 x 1 convolution transformation on them. The formula is as follows:
T=(Conv(Concat(p h ,p w ))),
f t =d(T),
f=T×f t
the stitching operation here is stitching along the spatial dimension. This non-linear activation function results in aAn intermediate feature map of the spatial information in the horizontal and vertical directions. Where r is the size used to control the block. We then split the feature map f along the spatial dimension into two independent tensors f h and fw Two 1 x 1 convolutions are then applied to divide f h and fw The conversion is to tensors with the same number of channels as the previous input, and the sigmoid activation function is used respectively. The formula is as follows:
S h =τ(Conv(f h )),
S u =τ(Conv(f w )),
we will output S h and Sw As a weight for attention, finally the output of our location information enhancement module Y can be written as:
wherein the subscript c represents a feature map corresponding to the c-th channel, i, j are horizontal coordinates and vertical coordinates, k is τ () represents a sigmoid activation function, and Φ () represents a Relu linear rectification function.
S3, inputting the final feature vector of the enhanced search area into a pre-measurement head, and generating a target tracking result through a classification function and a regression function of a boundary box.
The pre-measurement head receives H x ×W x Individual feature vectors, output H x ×W x And (3) binary classification and regression results, selecting prediction of corresponding feature vectors in a group-trunk boundary box as positive samples, wherein the rest are negative samples, the class label of the positive samples is a foreground, and the class label of the negative samples is a background. All samples will be used to classify the loss, whereas for regression loss only positive samples are useful, so that the feature vector can predict the target at the corresponding location. We use standard binary cross entropy loss for classification, defined as:
for regression, we employ L1 penalty and general IOU penalty, the regression penalty function is defined as:
the above method is verified on a plurality of authoritative data sets such as GOT-10k, laSOT, trackingNet, OTB100, UAV123, VOT2020 and the like, and tables 1-4 and figures 4-7 show that the tracker of the method performs more excellent compared with the indexes performed by other methods.
TABLE 1 detailed data for comparison with other trackers on GOT-10k dataset
Table 2 detailed data for comparison with other trackers on the TrackingNet dataset
Table 3 detailed data compared with other trackers on LaSOT dataset
Table 4 detailed data comparing with other trackers on OTB100 and UAV123 datasets
Finally, it should be noted that the above-described method may be converted into software program instructions, which may be implemented by using a symmetric cross-attention and location information enhanced transducer target tracker including a processor and a memory, or by computer instructions stored in a non-transitory computer readable storage medium. The integrated units implemented in the form of software functional units described above may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform part of the steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only memory (ROM), a random access memory (RandomAccess Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In summary, the method and the tracker for tracking the transducer target based on symmetrical cross attention and position information enhancement have the following beneficial effects:
(1) The invention improves the symmetrical cross attention module and the position information enhancement module on the basis of the original transfomer tracking structure, firstly, extracting the characteristic information of a template and a search area through a backbone network, then sending the characteristic information into a characteristic fusion network, superposing four symmetrical cross attention modules in an encoder network, more effectively fusing the characteristic of the search area and the characteristic of the template, further fusing the characteristic through a decoder network, further enhancing the fused characteristic information by the position information enhancement module, finally, executing classification and bounding box regression by a pre-measuring head on the basis that the characteristic is enhanced, and finally generating a tracking result; through multiple fusion and enhancement, the feature fusion of the template image and the search area image is better realized, the recognition accuracy of target tracking is improved, and meanwhile, the adopted cross attention structure can reduce the calculated amount and improve the tracking speed;
(2) The invention replaces cross-correlation operation by using an attention mechanism, carries out deep fusion on the features of the search area and the features of the template through the symmetrical cross attention module, carries out global information interaction, avoids the situation of local optimization, can better carry out feature fusion on the template image and the search area image, and can better judge the feature similarity of the template image and the search area image;
(3) The invention fully utilizes the spatial position information through the position information enhancement module, is favorable for better identifying and positioning the target in the head prediction stage, can effectively improve the accuracy of the head prediction positioning target, and ensures that the tracker has more excellent performance;
(4) The method can effectively cope with various challenges such as illumination change, deformation, scale change, interference influence and the like, provides high-precision and high-robustness target tracking, has better instantaneity and robustness, and is more suitable for visual single-target tracking.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope of the invention as defined by the appended claims.

Claims (10)

1. A method for Transformer target tracking based on symmetrical cross-attention and location information enhancement, comprising:
s1, regarding a template image and a search area image as a pair of images to be input into a backbone network, and respectively extracting to obtain feature images of the template and the search area;
s2, sending the template and the feature map of the search area into a feature fusion network, wherein the feature fusion network comprises an encoder network, a decoder network and a position information enhancement module which are sequentially connected, and obtaining the final feature vector of the enhanced search area through fusion and enhancement;
s3, inputting the final feature vector of the enhanced search area into a pre-measurement head, and generating a target tracking result through a classification function and a regression function of a boundary box.
2. The method for tracking a Transformer object based on symmetrical cross-attention and location information enhancement according to claim 1, wherein the step S2 comprises:
s20, preprocessing the feature map of the template and the search area, and converting the feature map into feature vectors of the template and the search area;
s21, inputting the feature vectors of the template and the search area into an encoder network, wherein the encoder network adopts a plurality of layers of symmetrical cross attention modules to fuse the features of the template and the search area, and respectively obtaining the feature vectors of the fused template and search area;
s22, the fused template and the feature vector of the search area are further fused through a decoder network to obtain the final feature vector of the search area;
s23, inputting the final feature vector of the search area into a position information enhancement module for enhancement, and obtaining the enhanced final feature vector of the search area.
3. The method for tracking a Transformer object based on symmetrical cross-attention and location information enhancement according to claim 2, wherein the step S20 comprises: and flattening the feature maps of the template and the search area along the space dimension respectively, adding sine position codes at the same time, then reducing the channel dimension by using 1X 1 convolution, and flattening the feature maps from the space dimension to obtain feature vectors of the template and the search area respectively.
4. The method for enhanced Transformer object tracking based on symmetrical cross-attention and position information according to claim 2, wherein in the step S21, the symmetrical cross-attention module comprises:
inputting the feature vectors of the template and the search area and the position codes corresponding to the template and the search area into a multi-head cross attention module to obtain a preliminarily fused feature vector X t
X t =MultiHead(X zq +P zq ,X xk +P xk ,X ),
MultiHead(Q,K,V)=Concat(H 1 ,...,H h )W O
Inputting the feature vector of the template and the corresponding position code of the template into a multi-head cross attention module, and re-fusing the feature vector and the corresponding position code of the template, and then re-adding the feature vector and the feature vector of the template to obtain a fused template feature vector X z The X is z After passing through a feedforward network FFN, is connected with the X z Adding to obtain a fused template feature vector
X z =X zq +MultiHead(X zq +P zq ,X tk +P tk ,X ),
Meanwhile, the feature vector of the search area and the corresponding position code of the search area are input into a multi-head cross attention module, and after re-fusion, the feature vector of the search area and the corresponding position code are added to obtain a fused feature vector X of the search area x The X is x After passing through a feed forward network FFN, is connected with the X x Adding to obtain the fused search area feature vector
X x =X xq +MultiHead(X xq +P xq ,X tk +P tk ,x )
wherein ,W O are all parameter matrices, X zq Is the input of template branch, P zq Is the position code corresponding to the template branch, X xk and X Is the input of the search area branch, P xk Is the position code corresponding to the branch of the search area, X tk and X Is the input of the feature vector branch of the preliminary fusion, P tk Is the position code corresponding to the feature vector of the preliminary fusion.
5. The method for symmetric cross-attention and location information enhanced transducer target tracking according to claim 2, wherein in step S22, the decoder network comprises: inputting the feature vectors of the template and the search area after fusion into a multi-head cross attention module for re-fusion, then carrying out residual connection on the multi-head cross attention module and the feedforward network FFN through the feedforward network FFN inserted with the activation function, and outputting the final feature vector of the search area after norm normalization processing.
6. The method for tracking a transducer object based on symmetrical cross-attention and location information enhancement according to claim 2, wherein in step S23, the location information enhancement module comprises:
inputting the final feature vector of the search area, and obtaining a final feature map X of the search area through reshape deformation, wherein the final feature map X comprises feature maps with the height H of the channel number C and the width W;
each feature map is respectively subjected to one-dimensional horizontal global pooling and one-dimensional vertical global pooling to obtain a middle feature map in the horizontal direction and a middle feature map in the vertical direction p h ,p w
Then the intermediate feature diagram in the vertical direction is spliced with the intermediate feature diagram in the horizontal direction along the space dimension to obtain T after the intermediate feature diagram in the vertical direction is subjected to permite conversion function, and then the T is subjected to convolution transformation of 1 multiplied by 1, after the batch standardization, the batch is multiplied by the characteristic diagram after the batch standardization after the Relu linear rectification function to obtain an intermediate characteristic diagram f for coding the spatial information in the horizontal direction and the vertical direction:
T=(Conυ(Concat(p h ,p w ))),
f t =φ(T),
f=T×f t
splitting the intermediate feature map f along a spatial dimension into two independent height tensors f h And a width tensor f w Then, after 1X 1 convolution and sigmoid activation function, the height weight S is obtained h And width weight S w Multiplying the input after the deformation of the position information enhancement module, outputting an enhanced final feature map Y of the search area, and outputting an enhanced final feature vector of the search area after the reshape deformation:
S h =τ(Conυ(f h )),
S w =τ(Conυ(f w )),
wherein the subscript c represents a feature map corresponding to the c-th channel, i, j are horizontal coordinates and vertical coordinates, k represents the k-th row or k-th column, τ () represents a sigmoid activation function, and Φ () represents a Relu linear rectification function.
7. The method according to claim 1, wherein in the step S3, the classification network uses a binary cross entropy loss function, and the bounding box regression network uses L1 loss and IOU loss functions.
8. The method for tracking a target by using a Transformer based on the symmetrical cross-attention and the position information enhancement according to claim 1, wherein in the step S1, the template image and the search area image are further preprocessed: the method comprises the steps that an image block of a template image is obtained by expanding an object boundary box of a first frame image of a video sequence to be twice side length outwards; expanding the image block of the search area image from the boundary frame of the target of the previous frame to four times of side length outwards to obtain the image block; the backbone network adopts a modified ResNet50 network, the ResNet50 network deletes the last stage and the full connection layer, meanwhile, the convolution step length of downsampling of the fourth stage is changed from 2 to 1, and the 3 multiplied by 3 convolution of the fourth stage is changed into expansion convolution with the step length of 2.
9. The method for enhanced Transformer target tracking based on symmetric cross-attention and location information of claim 4, wherein the number of layers of the symmetric cross-attention module is 4; after the fusion of the template feature vectors X is obtained zFused search region feature vector X x 、/>Norm normalization was performed before.
10. A Transformer target tracker based on symmetrical cross-attention and location information enhancement, comprising:
at least one processor; and at least one memory communicatively coupled to the processor, wherein:
the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1-9.
CN202310742715.3A 2023-06-21 2023-06-21 Transformer target tracking method and tracker based on symmetrical cross attention and position information enhancement Pending CN116862949A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310742715.3A CN116862949A (en) 2023-06-21 2023-06-21 Transformer target tracking method and tracker based on symmetrical cross attention and position information enhancement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310742715.3A CN116862949A (en) 2023-06-21 2023-06-21 Transformer target tracking method and tracker based on symmetrical cross attention and position information enhancement

Publications (1)

Publication Number Publication Date
CN116862949A true CN116862949A (en) 2023-10-10

Family

ID=88222551

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310742715.3A Pending CN116862949A (en) 2023-06-21 2023-06-21 Transformer target tracking method and tracker based on symmetrical cross attention and position information enhancement

Country Status (1)

Country Link
CN (1) CN116862949A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117197249A (en) * 2023-11-08 2023-12-08 北京观微科技有限公司 Target position determining method, device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117197249A (en) * 2023-11-08 2023-12-08 北京观微科技有限公司 Target position determining method, device, electronic equipment and storage medium
CN117197249B (en) * 2023-11-08 2024-01-30 北京观微科技有限公司 Target position determining method, device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
Zhang et al. Bilateral attention network for RGB-D salient object detection
Wang et al. Gracker: A graph-based planar object tracker
CN113159023A (en) Scene text recognition method based on explicit supervision mechanism
CN116862949A (en) Transformer target tracking method and tracker based on symmetrical cross attention and position information enhancement
CN112785626A (en) Twin network small target tracking method based on multi-scale feature fusion
Shi et al. DPNET: Dual-path network for efficient object detection with Lightweight Self-Attention
Mo et al. PVDet: Towards pedestrian and vehicle detection on gigapixel-level images
Khan et al. A survey of the vision transformers and their CNN-transformer based variants
CN114240811A (en) Method for generating new image based on multiple images
CN117315293A (en) Transformer-based space-time context target tracking method and system
Liu et al. Lightweight human pose estimation algorithm based on polarized self-attention
Wang et al. Summary of object detection based on convolutional neural network
CN116563337A (en) Target tracking method based on double-attention mechanism
CN116311384A (en) Cross-modal pedestrian re-recognition method and device based on intermediate mode and characterization learning
Shen et al. Criss-cross global interaction-based selective attention in YOLO for underwater object detection
CN113869154B (en) Video actor segmentation method according to language description
Zhou et al. Lrfnet: an occlusion robust fusion network for semantic segmentation with light field
CN113627245B (en) CRTS target detection method
CN113378598B (en) Dynamic bar code detection method based on deep learning
Wang et al. EMAT: Efficient feature fusion network for visual tracking via optimized multi-head attention
CN114898457A (en) Dynamic gesture recognition method and system based on hand key points and transform
CN114596432A (en) Visual tracking method and system based on corresponding template features of foreground region
CN117649582B (en) Single-flow single-stage network target tracking method and system based on cascade attention
CN110580503A (en) AI-based double-spectrum target automatic identification method
Sarker et al. Enhanced visible–infrared person re-identification based on cross-attention multiscale residual vision transformer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination