CN117315293A

CN117315293A - Transformer-based space-time context target tracking method and system

Info

Publication number: CN117315293A
Application number: CN202311254115.9A
Authority: CN
Inventors: 贺赟; 刘晴; 徐欣; 吴伟; 姚英彪
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2023-09-26
Filing date: 2023-09-26
Publication date: 2023-12-29

Abstract

The invention discloses a space-time context target tracking method and a space-time context target tracking system based on a transducer, wherein the method comprises the following steps: s1, acquiring and preprocessing an image; s2, inputting the image into a backbone network, and obtaining search area characteristics, initial template characteristics and dynamic update template characteristics through a transducer encoder; s3, taking the output as the input of the interactive characteristic enhancement module, and adopting a multi-head cross self-attention mechanism to obtain a mixed characteristic; s4, taking the mixed characteristic and the target query as input of a transducer decoder, wherein a multi-head self-attention mechanism layer is adopted by a mask self-attention mechanism part in the transducer decoder; after obtaining the output of the transducer decoder, calculating the similarity between the output and the mixed feature embedding, carrying out feature remodeling, and calculating the expected boundary prediction frame of the angular point probability distribution; s5, taking output obtained by the transducer decoder as input of a score head, wherein the score head consists of a full connection layer FFN and a softmax activation function, and setting a threshold value to judge whether template updating is performed or not.

Description

Transformer-based space-time context target tracking method and system

Technical Field

The invention belongs to the technical field of computer vision, mainly relates to technologies such as target tracking, feature enhancement and feature fusion, and particularly relates to a space-time context target tracking method and system based on a transducer.

Background

Target tracking is an important branch of the field of computer vision, and is also a core technology in the fields of video monitoring, intelligent transportation, automatic driving and the like. Through recent continuous research and development, target tracking technologies can be largely divided into two types, namely a generative type and a discriminant type. The processing thought of the generated model is to firstly establish a target model or extract target characteristics, and then to perform similar characteristic search in subsequent frames to realize target positioning in a stepwise iterative manner; however, this type of method has the disadvantage that the background information of the image is not fully utilized, and the creation of a model is affected in the case of motion blur, deformation of the object and occlusion. The discriminant model extracts the target model by introducing background information into the tracking model and comparing the difference between the target model and the background information to obtain the position of the current frame, so that the defect of the generated model can be overcome to a certain extent, and target tracking can be well realized.

The transducer architecture is a neural network model based on self-attention mechanism, and has been widely used in the field of computer vision in recent years, especially after Vision Transformer is proposed. The self-attention mechanism part in the transducer has the advantages of capturing global dependency relations in an input sequence, and is not limited to local contexts; while position coding may embed position information in the input sequence, such that the self-attention mechanism may better model global context correlation. Based on the current situation, the invention adopts the transducer as the feature extractor and the feature fusion network, and can combine the space-time context information to avoid sinking into local optimum, and the tracking model completely based on the transducer can well cope with various challenges such as deformation, shielding and the like under the condition of long-time tracking, thereby improving the robustness of the target tracking system.

Disclosure of Invention

Aiming at the current situation, the invention discloses a space-time context target tracking method and a space-time context target tracking system based on a transducer, and the main content of the invention is as follows: (1) In order to better obtain the spatial context information, the situation that the Resnet network does not have global perception and interaction due to limited receptive fields is avoided, and Vision Transformer is used as a backbone network to extract the characteristics of the picture, so that the mixing of local and global information is achieved, and the characteristic extraction is enhanced. (2) In order to avoid losing semantic information and trapping in local optimum when the template and the search area are matched, the invention uses the interactive characteristic enhancement module to perform characteristic fusion, uses the characteristic of a cross attention mechanism between different input sequences through modeling, and strengthens the representation capability and generalization capability of the model. (3) In order to better locate the bounding box of the object, bounding box corner probability prediction is used, and a target query is introduced at the input part of the Transformer decoder, the similarity between the hybrid features and the decoder output is calculated, and finally the prediction box is obtained by calculating the expectation of the corner probability distribution. (4) In order to cope with challenges such as shielding and deformation encountered in the process of tracking a target for a long time, the invention introduces the time sequence information to dynamically update the template, so that the space-time context information is better combined, in order to reduce post-processing super-parameters, the FFN and softmax functions are used for completing classification of the foreground and the background, a threshold value is set, when the calculated score is larger than the threshold value, the template is updated, and otherwise, the template is not updated.

The invention adopts the following technical scheme:

a space-time context target tracking method based on a transducer comprises the following steps:

s1, acquiring and preprocessing an image:

and acquiring a tracking target image and preprocessing.

S2, extracting characteristics from a backbone network:

the preprocessed image in the step S1 is input into a backbone network Vision Transformer, flattened and linear mapping operation is performed first, then the flattened and linear mapping operation is added with position codes correspondingly to obtain a slice embedded layer, and then a transducer encoder is used to obtain search region features, initial template features and dynamic update template features respectively.

S3, feature enhancement and fusion:

and S2, taking the output in the step as the input of an interactive feature enhancement module, wherein the module is improved by a transducer decoder, wherein the original mask self-attention mechanism part is deleted, and the search area features are used for inquiring the initial template features and the dynamic template features only by using a multi-head cross self-attention mechanism, so that the output mixed features are finally obtained, and the interaction of the global features is enhanced.

S4, boundary box prediction:

the mixed feature obtained in step S3 is used as input to a transducer decoder, wherein the masked self-attention mechanism part of the transducer decoder is replaced by a multi-headed self-attention mechanism layer, thereby adaptively focusing on useful context information to enhance the feature representation. After the output of the transducer decoder is obtained, the similarity between the output and the mixed feature embedding is calculated, the feature is remolded, and finally, the expected boundary prediction box of the corner probability distribution is calculated.

S5, predicting a score head:

and (3) taking the output obtained by the transducer decoder part in the step S4 as the input of a score head, wherein the essence of the score head is to classify the foreground and the background in the image, the score head consists of an FFN and softmax activation function, and finally judging whether to update the template through a set threshold value.

Preferably, in step S1, the invention adopts GOT-10K data set to train and verify the model, and the input part consists of a search area, an initial template and a dynamic update template. Before inputting the images into the backbone network, the input images are required to be preprocessed, the images are divided into slices and then input into the backbone network, so that the parameter number is reduced, and the training speed of the model is improved.

Preferably, in step S1, the picture size of the search area isR is region, C is channel number, H _x Is the length, W of the picture _x Is the width of the picture, the preferred search area size of the present invention is 256 x 3. Initial template frame and dynamically updating template frame to +.>The preferred sizes of the invention are all 128×128×3. The searching area expands the size of the target from the center coordinate of the previous frame to four directions by several times and comprises the possible moving range of the target; preprocessing the search area, and decomposing a frame of picture into +.>Inputting each slice into a backbone network as a token, each slice having a size of nxnxc; the initial template region and the dynamic update template region are decomposed into +.>Is a slice of (2)The size of each slice is n×n×c. In the present invention, n is preferably 16, and the number of slices is 256.

Preferably, in step S2, each slice is flattened, mapped linearly into a vector with dimension n×n×c, then the generated position codes and the corresponding positions of the slices are added to obtain a slice embedding layer, and the obtained slice embedding layer is input into a transducer encoder.

Preferably, in step S2, the transform encoder enters the multi-head self-attention layer through one layer normalization, then passes through a residual network structure, and finally performs normalization processing and a multi-layer perceptron again to obtain the output of the encoder part, namely, the feature vector of the extracted picture.

Preferably, in step S3, the search region feature obtained in the backbone network is used as a query, and the initial template feature and the dynamic update template feature are cascaded to obtain a fusion feature X ₂ Are input together as key-value pairs to the interactive feature-enhancement module.

Preferably, in step S3, the interactive feature enhancement module uses a multi-headed cross-attention mechanism, where given a query Q, a key K, and a value V, the attention function uses a scaled dot product, and the formula is as follows:

wherein q=x ₁ ×W ^Q ,K＝X ₂ ×W ^K ,V＝X ₂ ×W ^V ,W ^Q 、W ^K 、W ^V All are parameter matrixes obtained by training data and are used for extracting characteristics and X ₁ To search for regional features, X ₂ K is fusion characteristics obtained by cascading initial template characteristics and dynamic updating template characteristics ^T Is the transposed matrix of the key K, d _k Is the dimension of the matrix K; the multi-head attention mechanism is arranged into a plurality of layers due to more than one correlation of the characteristics, and is cascaded by a plurality of single-head results and then connected with a matrix W ^O The formula for multiplication is as follows:

MultiHead(Q,K,V)＝Concat(H _1,...., H _n )W ^O

wherein matrix W ^O As a learnable parameter matrix, H _n Is the nth multi-head layer. In a vector matrix H for outputting a plurality of single heads _i After cascading, through W ^O Extracting features from a plurality of vector matrices, wherein H _i For each single-headed output result, the formula is as follows:

H _i ＝Attention(X ₁ W _i ^Q ,X ₂ W _i ^K ,X ₂ W _i ^V )

wherein W is _i ^Q 、W _i ^K 、W _i ^V And the i-th layer parameter matrix is obtained by training data, space position codes of the query Q and the key K are correspondingly added, a multi-head cross attention mechanism is input, and the multi-head cross attention mechanism is connected with the residual error through normalization processing, wherein the formula is as follows:

wherein P is _q Spatial position coding of the corresponding query Q portion, P _k The spatial position of the corresponding key K portion is encoded.

And carrying out normalization processing and residual error connection again to obtain a final output mixed characteristic, wherein the formula is as follows:

preferably, in step S4, after obtaining the output of the transform decoder, calculating the similarity between the output and the mixed feature vector embedding, and multiplying the obtained similarity score by the mixed feature element by element to enhance the important region and weaken the region with less distinction; the new signature sequence is reshaped into a signature; outputting two corner probabilities P through a full convolution network _tl (x, y) and P _br (x, y); wherein x and y are coordinate points, P _tl (x, y) is the probability distribution of the upper left corner, P _br Lower right (x, y)Angular point probability distribution; the full convolution network structure consists of L stacked convolution layers, a batch normalization layer and a ReLU function; finally, calculating expected prediction frame coordinates of probability distribution of the corner points, wherein the formula is as follows:

modeling uncertainty in the coordinate estimation, generating a more accurate and more robust prediction result for target tracking, and selecting a loss function L1 loss and IOU loss combined auxiliary network model to perform boundary frame prediction after obtaining a prediction frame, wherein the formula is as follows:

wherein b _i Andrespectively a bounding box label and a prediction bounding box, lambda is a weight superparameter between two loss functions, and the relative importance between the two is adjusted.

Preferably, in step S5, the threshold is set to 0.7, the output confidence score of the score head is compared with the threshold, and when the score is greater than 0.7, the template is updated, otherwise, the template is not updated; the new template is cut out from the search area image and then input into a backbone network for feature extraction; the binary cross entropy loss function is used in optimizing the score header, and the formula is as follows:

L _cls ＝y _i log(P _i )+(1-y _i )log(1-P _i )

wherein y is _i Represents 0 or 1, P in binary label _i Is of the y _i Probability of the tag; when P _i The more the value of (2)Near 1, the closer the value of the loss function is to 0, whereas the predicted value P _i The larger the penalty function is towards 0.

The invention also discloses a space-time context target tracking system based on the transducer, which comprises the following modules:

and an image acquisition and preprocessing module: acquiring a tracking target image and preprocessing;

and a backbone network extraction feature module: the image preprocessed by the image acquisition and preprocessing module is input into a backbone network Vision Transformer, flattening and linear mapping operation are performed first, then the image is correspondingly added with position codes to obtain a slice embedded layer, and then a transducer encoder is used for respectively obtaining search area characteristics, initial template characteristics and dynamic update template characteristics;

feature enhancement and fusion module: the method comprises the steps that output of a backbone network extraction feature module is used as input of an interactive feature enhancement module, and the interactive feature enhancement module adopts a multi-head cross self-attention mechanism to enable search area features to inquire fusion features of initial template features and dynamic template features, so that mixed features are obtained;

a boundary box prediction module: the mixed features obtained by the feature enhancement and fusion module and a target query are used as input of a transducer decoder, wherein a multi-head self-attention mechanism layer is adopted by a mask self-attention mechanism part in the transducer decoder, so that useful context information is adaptively focused; after obtaining the output of the transducer decoder, calculating the similarity between the output and the mixed feature embedding, performing feature remodeling, and finally calculating the expected boundary prediction frame of the corner probability distribution;

score header prediction module: and taking the output obtained by a transducer decoder in the boundary box prediction module as the input of a score head, wherein the score head consists of a full connection layer FFN and a softmax activation function, and finally judging whether to update the template or not through a set threshold value.

The invention discloses a space-time context target tracking method and a space-time context target tracking system based on a transducer, which have the following beneficial technical effects:

(1) Firstly, the invention improves the feature extraction part of the backbone network, uses Vision Transformer to replace the CNN model, improves the extraction of global information by the model, reduces the sensitivity of the model to the super parameters of post-processing, and improves the training speed and effect of the model.

(2) Secondly, the invention provides an interactive feature enhancement module, which is used for inquiring and matching the features of the initial template and the features of the dynamically updated template by the features of the search region to finally obtain the fusion features, so that the correlation of the features between the fusion template and the search region is enhanced.

(3) The regression and classification are split into two stages, namely a boundary frame prediction part and a score head prediction part, and time sequence information is introduced, so that fusion characteristics are enhanced in space, and context semantic information is fully utilized. In the invention, a threshold is set at the head of the score for judging the updating of the template, and the updated template and the initial template are used as the input of the model, so that the robustness of the model is improved.

Drawings

FIG. 1 is a diagram of the overall framework of a network model involved in a transform-based spatio-temporal context object tracking method in accordance with a preferred embodiment of the present invention.

Fig. 2 is a diagram of the Vision Transformer model structure.

Fig. 3 is a block diagram of an interactive enhancement module.

Fig. 4 is a diagram of a converter decoder.

Fig. 5 is a schematic diagram of a boundary prediction framework.

Fig. 6 is a schematic diagram of a scoring head prediction framework.

FIG. 7 is a block diagram of a transducer-based spatio-temporal context object tracking system in accordance with a preferred embodiment of the invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings and specific embodiments, but it should be noted that the invention is not limited to the following examples.

The environment of the preferred embodiment of the invention is as follows: cuda10.2, deep learning framework pytorch1.7.0, intel core i7-12700HCPU, 16G memory and GPU NVIDIAGeForce RTX 3050, 4G memory. The entire network model is described in detail below in conjunction with fig. 1.

The space-time context target tracking method based on the transducer comprises the following specific steps:

s1, acquiring and preprocessing an image:

first, for tracking target data acquisition, the present embodiment uses a GOT-10K dataset, which has large scale and diversity, and provides a challenging tracking scene, and various evaluation indexes thereof can better evaluate a tracking model. According to the characteristics of Vision Transformer model, some preprocessing operations are needed before inputting network model, wherein the picture size of the search area isInitial template frame and dynamic update template frame size +.>And the search area is four times the size of the object from the object at the center coordinates of the previous frame, generally encompassing the possible range of motion of the object. Firstly, preprocessing a search area, and decomposing a frame of picture into a number of +.>Each slice is input as a token to the backbone network, each slice having a size of nxnxc. The initial template region and the dynamic update template region are decomposed into +.>Each slice is also of size n x C.

S2, extracting characteristics from a main network:

the present invention proposes to eliminate the last part of the MLP Head and Class in the script Vision Transformer, since this step is for the image classification operation, while Vision Transformer was originally proposed for the image classification task. The preprocessed data are respectively input into the backbone network Vision Transformer, each slice is firstly flattened and linearly mapped into a vector with dimension of n×n×c, then the generated position codes and the corresponding positions of the slices are added to obtain slice embedded layers, and tracking tasks are performed, so that CLS token generation is not needed, the obtained slice embedded layers are input into a transform encoder, the detailed description of the encoder part in the Vision Transformer model structure diagram in fig. 2 is firstly performed through one-layer normalization and then enter into a multi-head self-attention layer, the embodiment is set to 6 layers, then a residual network structure is performed, and finally normalization processing and a multi-layer perceptron are performed again to obtain the output of the encoder part, namely the feature vector of the extracted picture.

S3, interactive enhancement features:

searching regional characteristic X obtained in backbone network ₁ As inquiry, the initial template features and the dynamic updating template features are cascaded to obtain fusion features X ₂ Are input together as key-value pairs to the interactive enhancement module. Describing in detail in connection with fig. 3, first the module is essentially a transform decoder partial delete mask self-attention mechanism employing a multi-headed cross self-attention mechanism to search for region feature X ₁ Fusion feature X of initial template and dynamic update template ₂ A query is made. Therefore, feature interaction of the fusion template and the search area in the tracking frame is enhanced, and global feature information is further fully utilized. The module function is specifically described below in conjunction with the formula, the core of which is a multi-headed cross-attention mechanism, where given a query Q, a key K, and a value V, the attention function employs a scaled dot product, the formula being as follows:

wherein q=x ₁ ×W ^Q ,K＝X ₂ ×W ^K ,V＝X ₂ ×W ^V ,W ^Q 、W ^K 、W ^V All are parameter matrixes obtained by training data, and the functions are extracted features, d _k The multi-head attention mechanism is the dimension of the matrix K because the correlation of the characteristics is more than one, the embodiment is arranged into 6 layers, and the results of 6 single heads are cascaded and then connected with the matrix W ^O Multiplying. The formula is as follows:

MultiHead(Q,K,V)＝Concat(H _1,...., H _n )W ^O

matrix W ^O As a learnable parameter matrix, a plurality of vector matrixes H which are output by a single head _i After cascading, through W ^O Extracting features from the 6 vector matrices, wherein H _i For each single-headed output result, the formula is as follows:

H _i ＝Attention(X ₁ W _i ^Q ,X ₂ W _i ^K ,X ₂ W _i ^V )

one difference between the interactive feature enhancement module and a transducer decoder is position coding, the invention only carries out corresponding addition of space position coding on the query Q and the key K, then inputs a multi-head cross attention mechanism, and is connected with residual errors through normalization processing, and the formula is as follows:

in addition, as with the feed forward network FFN in the transducer decoder, the method is used for enhancing the model fitting capability, and then the normalization processing and residual connection are performed once again to obtain the final output mixed characteristic, and the formula is as follows:

s4, boundary box prediction head:

and taking the output mixed feature vector obtained by the interactive enhancement module and a target query as inputs of a transducer decoder. As described in detail herein in connection with FIG. 4, first the module is Transform in natureThe er decoder section replaces the masking multi-head attention mechanism with the multi-head self attention mechanism because the masking matrix introduced by the masking multi-head attention mechanism is mainly used to process variable length sequences and is a serial output. The invention has fixed input length, and the multi-head self-attention mechanism is used for enhancing the capability of the model to express a plurality of attention points and providing the advantage of parallel computation. The position code is shared with the interactive enhancement module in the position code portion. Unlike the transducer decoder is: the present invention uses position coding for the multi-headed cross-attention mechanism portion, where the query Q and key K are similarly added together by spatial position coding, and where the target query itself is added together and self-attention operated. The multi-head self-attention mechanism part is used for learning the characteristics of the anchor frame, and the multi-head cross-attention mechanism part is used for predicting the coordinates and the types of the boundary frames of the image on the basis of obtaining the global characteristic information of the image and the characteristic information of the anchor frame. Thus, after the output of the transducer decoder is obtained, the similarity between the output and the embedding of the hybrid feature vector is calculated, as described in detail herein with reference to fig. 5. The resulting similarity score is then multiplied element-by-element with the blended feature to enhance important regions and attenuate regions that are less discriminative. The new signature sequence is reshaped into a signature. Outputting two corner probabilities P through a full convolution network _tl (x, y) and P _br (x, y) represent the upper left corner and the lower right corner, respectively. Wherein the full convolution network structure is composed of L stacked convolution layers, a batch normalization layer and a ReLU function. Finally, calculating an expected prediction frame of the angular point probability distribution, wherein the formula is as follows:

in this embodiment, uncertainty in the position estimate is modeled, a more accurate and more robust prediction result is generated for target tracking, and after a prediction frame is obtained, in this embodiment, a loss function L1 loss and an IOU loss combined auxiliary network model are selected for boundary frame prediction, where the formula is as follows:

S5, predicting a scoring head:

based on the challenges that the target is easy to generate shielding, deformation and the like under long-time tracking, the method introduces time sequence information into a model structure, and the tracker can not adapt to the appearance of rapid change of the target due to the fact that only initial template frame information is used as a reference, so that the robustness is poor. The design of a dynamic template is adopted, in particular, the dynamic template is updated continuously along with the time and is used as the input of a network model together with a search area and an initial template, so that more time sequence information can be provided for the whole framework, but the target can be blocked and deformed in the tracking process, which leads to unreliable dynamic template, so that a threshold value needs to be set, and whether the dynamic template is updated or not is judged according to the confidence score. The design of the score head takes the output of the transducer decoder as input, the score head structure consists of a full-connection layer and a softmax activation function, the threshold value part is set to 0.7, finally the output confidence score of the score head is compared with the threshold value, and when the score is greater than 0.7, the template is updated, otherwise, the template is not updated. The new template is extracted by clipping from the search area image and then inputting to the backbone network for feature extraction. The bounding box prediction essentially locates the target, and the scoring header classifies the target and the background, and the two are performed in two stages according to the following formula when optimizing the scoring header by adopting a binary cross entropy loss function:

L _cls ＝y _i log(P _i )+(1-y _i )log(1-P _i )

wherein y is _i Represents 0 or 1, P in binary label _i Is of the y _i Probability of the tag. When P _i The closer to 1 the value of the loss function is, the closer to 0, whereas the predicted value P is _i The larger the penalty function is towards 0.

As shown in fig. 7, the preferred embodiment of the present invention further discloses a space-time context object tracking system based on a transducer, which is based on the above embodiment of the method, and specifically includes the following modules:

For other content in this embodiment, reference may be made to the above-described method embodiments.

Those skilled in the art will recognize that various substitutions and modifications are possible without departing from the spirit and scope of the invention and the appended claims. Therefore, the scope of the invention should not be limited to the disclosure of the embodiments.

Claims

1. A space-time context target tracking method based on a transducer is characterized by comprising the following steps:

s1, acquiring and preprocessing an image: acquiring a tracking target image and preprocessing;

s2, extracting characteristics from a backbone network: inputting the preprocessed image in the step S1 to a backbone network Vision

In a transducer, flattening and linear mapping operation are carried out firstly, then the flattening and linear mapping operation is added with position codes correspondingly to obtain a slice embedded layer, and then searching region characteristics, initial template characteristics and dynamic updating template characteristics are respectively obtained through a transducer encoder;

s3, feature enhancement and fusion: the output in the step S2 is used as input of an interactive feature enhancement module, and the interactive feature enhancement module adopts a multi-head cross self-attention mechanism to enable the search region feature to inquire the fusion feature of the initial template feature and the dynamic template feature so as to obtain a mixed feature;

s4, boundary box prediction: taking the mixed characteristic obtained in the step S3 and a target query as input of a transducer decoder, wherein a mask self-attention mechanism part in the transducer decoder adopts a multi-head self-attention mechanism layer, so that useful context information is adaptively focused; after obtaining the output of the transducer decoder, calculating the similarity between the output and the mixed feature embedding, performing feature remodeling, and finally calculating the expected boundary prediction frame of the corner probability distribution;

s5, predicting a score head: and (3) taking the output obtained by the transducer decoder in the step S4 as the input of a score head, wherein the score head consists of a full connection layer FFN and a softmax activation function, and finally judging whether to update the template or not through a set threshold value.

2. The method for tracking a space-time context object based on a transducer according to claim 1, wherein in step S1, a GOT-10K dataset is used, and the input part is composed of a search area, an initial template and a dynamic update template; the preprocessing is to divide the image into slices.

3. The method for space-time context target tracking based on transducer as claimed in claim 2, wherein in step S1, the picture size of the search area isR is region, C is channel number, H _x Is the length, W of the picture _x Is the width of the picture, the initial template frame and the dynamic update template frame are +>The searching area expands the size of the target from the center coordinate of the previous frame to four directions by several times and comprises the possible moving range of the target; preprocessing the search area, and decomposing a frame of picture into +.>Inputting each slice into a backbone network as a token, each slice having a size of nxnxc; the initial template region and the dynamic update template region are decomposed into +.>Each slice has a size of n x C.

4. A method for tracking a space-time context object based on a transducer according to claim 2 or 3, wherein in step S2, each slice is flattened, linearly mapped into a vector having a dimension of nxn x C, the generated position codes are added to the corresponding positions of the slices to obtain a slice embedding layer, and the obtained slice embedding layer is input to a transducer encoder.

5. The method of space-time context object tracking based on a transducer according to claim 4, wherein in step S2, the transducer encoder enters the multi-head self-attention layer through a layer normalization, then goes through a residual network structure, and finally goes through normalization processing and a multi-layer perceptron again to obtain the output of the encoder part, namely, the feature vector of the extracted picture.

6. A method of spatio-temporal context object tracking based on a transducer as claimed in any one of claims 1-3, characterized in that in step S3, the search area features X obtained in the backbone network are used ₁ As inquiry, the initial template features and the dynamic updating template features are cascaded to obtain fusion features X ₂ Are input together as key-value pairs to the interactive feature-enhancement module.

7. The method of claim 6, wherein in step S3, the interactive feature enhancement module uses a multi-headed cross-attention mechanism, wherein given a query Q, a key K, and a value V, the attention function uses a scaled dot product, as follows:

wherein q=x ₁ ×W ^Q ,K＝X ₂ ×W ^K ,V＝X ₂ ×W ^V ,W ^Q 、W ^K 、W ^V Are all trainingTraining the data to obtain a parameter matrix for extracting features, X ₁ To search for regional features, X ₂ K is fusion characteristics obtained by cascading initial template characteristics and dynamic updating template characteristics ^T Is the transposed matrix of the key K, d _k Is the dimension of the matrix K; the multi-head attention mechanism is arranged into a plurality of layers due to more than one correlation of the characteristics, and is cascaded by a plurality of single-head results and then connected with a matrix W ^O The formula for multiplication is as follows:

MultiHead(Q,K,V)＝Concat(H _1,...., H _n )W ^O

wherein matrix W ^O As a learnable parameter matrix, H _n For the nth multi-head layer, a vector matrix H is output by a plurality of single heads _i After cascading, through W ^O Extracting features from a plurality of vector matrices, wherein H _i For each single-headed output result, the formula is as follows:

H _i ＝Attention(X ₁ W _i ^Q ,X ₂ W _i ^K ,X ₂ W _i ^V )

wherein P is _q Spatial position coding of the corresponding query Q portion, P _k Spatial position coding of the corresponding key K part;

8. a method of spatio-temporal context object tracking based on a transducer according to any of claims 1-3, characterized in that in step S4, after obtaining the output of the transducer decoder, the similarity between it and the embedding of the hybrid feature vector is calculated, and the obtained similarity score is multiplied element by element with the hybrid feature to enhance the important regions and weaken the regions that are less distinguishable; the new signature sequence is reshaped into a signature; outputting two corner probabilities P through a full convolution network _tl (x, y) and P _br (x, y), wherein x and y are coordinate points, P _tl (x, y) is the probability distribution of the upper left corner, P _br (x, y) lower right corner probability distribution; the full convolution network structure consists of L stacked convolution layers, a batch normalization layer and a ReLU function; finally, calculating expected prediction frame coordinates of probability distribution of the corner points, wherein the formula is as follows:

wherein b _i Andrespectively bounding box labels and prediction bounding boxes, lambda is a weight super parameter between two loss functions,adjusting the relative importance between the two.

9. A method of spatio-temporal context object tracking based on a transducer according to any of claims 1-3, characterized in that in step S5, the threshold is set to 0.7, the output confidence score of the scoring header is compared with the threshold, and when the score is greater than 0.7, the template is updated, otherwise not updated; the new template is cut out from the search area image and then input into a backbone network for feature extraction; the binary cross entropy loss function is used in optimizing the score header, and the formula is as follows:

L _cls ＝y _i log(P _i )+(1-y _i )log(1-P _i )

wherein y is _i Represents 0 or 1, P in binary label _i Is of the y _i Probability of the tag; when P _i The closer to 1 the value of the loss function is to 0, whereas the predicted value P is _i The larger the penalty function is towards 0.

10. A transducer-based spatio-temporal context target tracking system based on the method of any of claims 1-9, characterized by comprising the following modules: