CN116563355A

CN116563355A - Target tracking method based on space-time interaction attention mechanism

Info

Publication number: CN116563355A
Application number: CN202310523575.0A
Authority: CN
Inventors: 黄丹丹; 于斯宇; 刘智; 王英志; 白昱; 王一雯; 杨明婷; 胡力洲
Original assignee: Changchun University of Science and Technology
Current assignee: Changchun University of Science and Technology
Priority date: 2023-05-10
Filing date: 2023-05-10
Publication date: 2023-08-08

Abstract

The invention belongs to the technical field of target tracking of computer vision, in particular to a target tracking method based on a space-time interaction attention mechanism, which comprises the following steps: step 1: extracting features; acquiring query image data and template image data of a target to be tracked and extracting corresponding characteristic information; step 2: enhancing space-time characteristics; and enhancing the memory frame image characteristics and the query frame image characteristics by using an attention mechanism. The invention aims at the problem that the tracking performance is reduced when the target is subjected to complex environments such as shielding, deformation, background interference and the like, and the invention utilizes a time attention, a space attention and a self attention mechanism to respectively carry out weighting processing on the time sequence characteristics, the space characteristics and the query frame image characteristics of the memory frame image by introducing a characteristic enhancement module, thereby enhancing the expression capability of the memory frame and the query frame, leading the characteristics to be more abundant and further improving the robustness and the accuracy of target tracking.

Description

Target tracking method based on space-time interaction attention mechanism

Technical Field

The invention relates to the technical field of target tracking of computer vision, in particular to a target tracking method based on a space-time interaction attention mechanism.

Background

The target tracking technology is one of important research directions in the field of computer vision, and is widely applied to the fields of intelligent driving, unmanned driving, man-machine interaction and the like. The object tracking task needs to keep accurate tracking of the object in a continuous video sequence to obtain a complete object motion track and calculate the position and size of the object in different image frames. The development of the target tracking technology plays an irreplaceable role in advanced video processing tasks such as behavior understanding, reasoning and decision making, and is also a basic technology of target recognition, behavior analysis, video compression coding, video understanding and the like. Although research on object tracking has advanced significantly over the past few years, and many efficient algorithms have emerged to address the challenging problem in various scenarios. However, there are still many problems such as object shielding, illumination variation, scale transformation, and background interference, so the research of the object tracking technology is still a difficult task. In order to solve these problems, it is necessary to propose a more accurate and robust tracker.

In 2016, with the advent of sialmfc, a Siamese network-based tracking framework became the main stream of a single-target tracking algorithm framework. Thereafter, the SiamRPN introduces a regional proposal network into the Siamese network, and excellent tracking results are obtained on a plurality of benchmark tests. SiamCorders introduce an improved corner pooling layer on the basis of a Siamese network, convert a bounding box into diagonal prediction and have good performance. Most of the trackers based on twin networks currently use initial frames as tracking strategies for templates, which will bring about some risks. When the tracked object is severely deformed and blocked, the target cannot be tracked well. To improve this, some trackers introduce a template update mechanism or use multiple templates, which may enhance the robustness of the trackers to some extent, but this is limited and inevitably increases the computational effort. In addition, these trackers use only appearance information of the memory frames and do not fully utilize the rich temporal context information in the historical frame sequence. Meanwhile, the Siamese network-based tracking algorithm does not pay attention to the correlation between frames and within frames of the video sequence, so that the target cannot generate corresponding correlation in time and space.

Disclosure of Invention

(one) solving the technical problems

Aiming at the defects of the prior art, the invention provides a target tracking method based on a space-time interaction attention mechanism, which solves the problems that the space-time context information is difficult to establish association and the performance of a target is reduced when the target is subjected to challenges such as deformation, shielding and the like.

(II) technical scheme

The invention adopts the following technical scheme for realizing the purposes:

a target tracking method based on a space-time interaction attention mechanism comprises the following steps:

step 1: extracting features; acquiring query image data and template image data of a target to be tracked and extracting corresponding characteristic information;

step 2: enhancing space-time characteristics; enhancing the memory frame image characteristics and the query frame image characteristics by using an attention mechanism;

step 3: constructing a space-time interaction model; performing space-time interaction on the memory frame branches and the query frame branches to obtain corresponding interaction weights, and performing secondary screening on the enhanced memory frame query frame characteristic information by using the interaction weights to obtain characteristics more beneficial to tracking;

step 4: updating a template; using a space-time memory network to update the template;

step 5: and inputting the response diagram into a classification regression network, and training the whole network model according to the loss function to realize tracking of the target in the video.

Further, the feature extraction performed in step 1 is that firstly, the query frame image, the memory frame image and the tag map are obtained through data preprocessing, and the GoogleNet is firstly used as a backbone network in the memory frame branchFeature extraction F _m Using the first convolution layer +.>And an additional convolution layer g to process the memory frame and tag map, will +.>And g output F _q And respectively adding the memory frame feature maps to the rest backbone networks to generate T memory frame feature maps. Finally F is passed through nonlinear convolution layer _m Feature dimension is reduced to 512, query branch is +.>Obtaining a feature map F after inputting a query image _q The overall structure is the same as the memory frame branch but the parameters are different, and finally, the characteristic F with the characteristic dimension of 512 is obtained through a nonlinear convolution layer _q 。

Further, the spatio-temporal feature enhancement module for the memory frame image in step 2 mainly includes a temporal attention module and a spatial attention module, where the temporal attention module is mainly used to enhance the time sequence feature in the sequence, weight-emphasize important time sequence information, and filter out irrelevant information, and the spatial attention module is mainly used to enhance the spatial feature, and weight-process the target region and the background region in the image, so that attention is focused on the region where the target is located.

Further, the time attention module in the step 2 can perform weighted average on the sequence features with different weights at different moments of the sequence to obtain a time sequence feature representation. Firstly, the memory frame image characteristic F extracted by backbone network _m Extracted into T feature vectors { f ₁ ,f ₂ ,...,f _T Where T represents the sequence length, followed by a feature f for each time step _t Performing linear transformation once, mapping the linear transformation into a new feature space, calculating attention coefficients of the new feature space by using a SoftMax function, and weighting and fusing the features according to the attention coefficients to obtain a weighted feature vector F _m,1 。

Further, the spatial attention module in step 2 will first pass through a feature extraction networkThe obtained template frame features F _m Carrying out feature compression processing on image features through a global average pooling layer and maximum pooling in the network model, wherein a first dimension of the image features, namely B, is kept unchanged, respectively obtaining feature vectors with the size of 1 multiplied by C, wherein C is the number of channels, then splicing the two pooled results, sending the spliced results into a 3 multiplied by 3 convolution layer to carry out convolution operation with the kernel size of 3, and carrying out convolution operation on the feature vectors after feature compressionPerforming row dimension increasing operation, finally activating by using Sigmoid function, restraining output between 0-1, namely generating space channel weight corresponding to each space channel, and representing the output characteristics as F after space attention model processing _m2 Finally, the time attention and the space attention in the space-time feature enhancement module are respectively weighted for the time sequence feature and the space feature of the memory frame image to obtain the memory frame feature mapping:

F _m '＝concat(F _m1 ,F _m2 )。

further, in the step 2, the image features of the query frame are enhanced, each element vector in the input sequence of the features of the query image is automatically weighted and summed with other vectors, then the similarity between the features is calculated, key features are extracted, and the weights of the features are recalculated according to the key features to obtain a more enhanced feature representation, which comprises the following specific operations:

s21, firstly, processing a feature vector matrix F obtained by backbone network _q Three linear transformation convolution operations are carried out, and the three linear transformation convolution operations are projected to three groups of different feature spaces to obtain three feature representations, namely: query, key, value.

S22, performing transposed multiplication operation on the Query and the Key to obtain a matrix energy capable of describing similarity of the Query and the Key.

S23, performing SoftMax function normalization processing on the energy matrix, so that each feature point is assigned with a weight between 0 and 1 to obtain an attention weight coefficient matrix attention1, wherein the matrix is mainly used for highlighting feature points which are more important to the model.

S24, finally, weighting and summing all Value vectors by using the weight coefficient attribute 1 to obtain a final weight vector, reconstructing all final weight vectors into a new feature matrix according to the shape of the original feature matrix, and finally, processing the output feature by the query image feature enhancement module to be expressed as F _q '。

Furthermore, the space-time interaction model in the step 3 mainly performs space-time interaction on the feature information on the memory frame branch and the query frame branch processed by the feature enhancement module to obtain a feature representation more favorable for the tracking task, and the specific operation steps are as follows:

s31, the memory frame characteristic F obtained in the step 2 is obtained _m ' and query frame feature F _q ' as input of the space-time interaction attention module, the input is processed into a feature matrix with the same size as the space dimension through reshape transformation, and the size is B multiplied by C multiplied by H multiplied by W.

S32, using three convolution layers to branch the memory frame F _m ' Query, key transformation, branch F for Query frame _q And performing Value transformation, and performing transpose multiplication operation on the Query and the Key to obtain a matrix capable of describing the correlation of the Query and the Key. Secondly, carrying out SoftMax function normalization processing on the similarity matrix to obtain a weight between 0 and 1 for each feature point, obtaining attention weight attention21, and utilizing the attention weight to inquire about the frame F _q Weighting all Value vectors to obtain weighted memory frame feature F _q,m Finally, the weighted features and the memory frame branches are subjected to feature splicing to obtain space-time interactive memory frame features F _m ", the formula is as follows:

F _m ”＝concat(F _q,m ,F _m ')＝F _q,m +F _m '

s33, branching the query frame by using three convolution layers F _q ' Query, key transformation, branch F for memory frame _m And performing Value transformation, and performing transpose multiplication operation on the Query and the Key to obtain a matrix capable of describing the correlation of the Query and the Key. Secondly, carrying out SoftMax function normalization processing on the similarity matrix to obtain a weight between 0 and 1 for each feature point to obtain attention weight attention22, and utilizing the attention weight to memorize the frame F _m All Value vectors of' are weighted to obtain weighted query feature F _m,q . Finally, the weighted features and the query frame branches are subjected to feature splicing to obtain the query frame features F of space-time interaction _q ", the formula is as follows:

F _q ”＝concat(F _m,q ,F _q ')＝F _m,q +F _q '。

further, the template updating mechanism in step 4 uses a space-time memory network (memory reader), updates the target template mainly by using the history information of the memory frame, and calculates the memory frame F output in step 3 _m "AND INQUIRY Frames F _q Similarity between each pixel in the' to obtain a similarity matrix, wherein the similarity matrix is used as a soft weight mapping and memory frame F _m "multiplying, adaptively searching information stored in memory frame, searching the most useful information related to inquiry frame from the information stored in memory frame so as to implement updating of template, finally combining the read information with inquiry frame characteristic F _q "splice along channel dimension, generate final composite feature map y.

Further, the step 5 inputs the response graph into a classification regression network, and uses a lightweight classification convolution network omega _cls Encoding the feature map y obtained in step 4, and using a linear convolution layer with a convolution kernel of 1x1 to encode ω _cls The dimension is reduced to 1, and a final classification response graph R is obtained _cls ∈R ^1×H×W . The classification network also comprises calculation of centrality response, and the centrality response is shown as R _ctr ∈R ^1×H×W In the reasoning phase, R _ctr And R is R _cls Multiplying the confidence scores of the pixel classifications away from the center of the target is used to suppress the confidence scores of the pixel classifications in the regression task using the lightweight classification convolution network ω _reg Encoding the feature map y obtained in the step 4, reducing the output feature dimension to 4, and generating a regression response map R for boundary box estimation _reg ∈R ^4×H×W 。

Further, the loss function in the step 5 includes classifying the loss L _cls Regression loss L using cross entropy loss function _reg Adopting an IOU loss function and a centrality loss function L _ctr The Sigmoid cross entropy loss function is used, and the total loss function is expressed as follows:

L＝L _cls +λ ₁ L _reg +λ ₂ L _ctr

wherein lambda is ₁ 、λ ₂ Are all superbParameters.

(III) beneficial effects

Compared with the prior art, the invention provides a target tracking method based on a space-time interaction attention mechanism, which has the following beneficial effects:

the invention aims at the problem that the tracking performance is reduced when the target is subjected to complex environments such as shielding, deformation, background interference and the like, and the invention utilizes a time attention, a space attention and a self attention mechanism to respectively carry out weighting processing on the time sequence characteristics, the space characteristics and the query frame image characteristics of the memory frame image by introducing a characteristic enhancement module, thereby enhancing the expression capability of the memory frame and the query frame, leading the characteristics to be more abundant and further improving the robustness and the accuracy of target tracking.

According to the invention, the space-time interaction module is introduced to perform space-time information interaction on the two enhanced branch characteristics, so that secondary screening of the characteristics is realized, the information of the memory frame can be fully utilized, the information in the query frame can be dynamically combined, the target can be more accurately tracked, and the complex problems that the model is difficult to establish association on space-time context information and the like are effectively solved.

Drawings

FIG. 1 is a block diagram of the overall network architecture of the present invention;

FIG. 2 is a diagram of a feature enhancement module architecture of the present invention;

FIG. 3 is a diagram of a space-time interactive module architecture of the present invention;

FIG. 4 is a template update flow diagram;

FIG. 5 results from an evaluation of GOT-10k test dataset.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Examples

As shown in fig. 1-5, an object tracking method based on a space-time interaction attention mechanism according to an embodiment of the present invention includes the following steps:

step 1: extracting features; query image data and template image data of a target to be tracked are obtained, and corresponding characteristic information is extracted.

Step 2: enhancing space-time characteristics; and enhancing the memory frame image characteristics and the query frame image characteristics by using an attention mechanism.

Step 3: constructing a space-time interaction model; and carrying out space-time interaction on the memory frame branches and the query frame branches to obtain corresponding interaction weights, and carrying out secondary screening on the characteristic information of the enhanced memory frame query frame by utilizing the interaction weights to obtain the characteristics which are more beneficial to tracking.

Step 4: updating a template; and (5) performing template updating processing by using the space-time memory network.

The invention discloses a feature extraction method in step 1, which comprises the steps of firstly obtaining an inquiry frame image, a memory frame image and a label map through data preprocessing, and firstly using GoogleNet as a backbone network in a memory frame branchFeature extraction F _m Using the first convolution layer +.>And an additional convolution layer g to process the memory frame and tag map, will +.>And g output F _q And respectively adding the memory frame feature maps to the rest backbone networks to generate T memory frame feature maps. Finally F is passed through nonlinear convolution layer _m Feature dimension is reduced to 512, query branch is +.>Obtaining a feature map F after inputting a query image _q The overall structure is the same as the memory frame branch but the parameters are different, and finally, the characteristic F with the characteristic dimension of 512 is obtained through a nonlinear convolution layer _q 。

The space-time characteristic enhancement module mainly used for enhancing the memory frame image in the step 2 comprises a time attention module and a space attention module, wherein the time attention module is mainly used for enhancing time sequence characteristics in a sequence, weighting and emphasizing important time sequence information and filtering irrelevant information, and the space attention module is mainly used for enhancing the space characteristics and weighting a target area and a background area in the image so that attention is focused on the area where the target is located.

The step 2 of the invention adopts a time attention module for enhancing the memory frame characteristics, and can perform weighted average on the characteristics of different weight sequences at different moments of the sequences to obtain the characteristic representation on time sequence. Firstly, the memory frame image characteristic F extracted by backbone network _m Extracted into T feature vectors { f ₁ ,f ₂ ,...,f _T Where T represents the sequence length, followed by a feature f for each time step _t Performing linear transformation once, mapping the linear transformation into a new feature space, calculating attention coefficients of the new feature space by using a SoftMax function, and weighting and fusing the features according to the attention coefficients to obtain a weighted feature vector F _m,1 。

The step 2 of the invention is to use the spatial attention module for enhancing the memory frame characteristics, firstly, the spatial attention module is to use the characteristic extraction networkThe obtained template frame features F _m Carrying out feature compression processing on image features through a global average pooling layer and maximum pooling, wherein a first dimension of the image features, namely B, is kept unchanged, respectively obtaining feature vectors with the size of 1 multiplied by C, wherein C is the number of channels, and then carrying out processing on the results obtained by two poolingPerforming row splicing, namely sending the spliced result into a 3x3 convolution layer to perform convolution operation with the kernel size of 3, performing dimension lifting operation on the feature vector subjected to feature compression, finally activating by using a Sigmoid function, and restricting the output between 0 and 1, namely generating space channel weights corresponding to each space channel, and expressing the feature output by space attention model processing as F _m2 Finally, the time attention and the space attention in the space-time feature enhancement module are respectively weighted for the time sequence feature and the space feature of the memory frame image to obtain the memory frame feature mapping:

F _m '＝concat(F _m1 ,F _m2 )

the step 2 of the invention enhances the image characteristics of the query frame, firstly, each element vector in the input sequence of the image characteristics of the query is automatically weighted and summed with other vectors, then the similarity between the characteristics is calculated, key characteristics are extracted, and the weight of each characteristic is recalculated according to the key characteristics to obtain more enhanced characteristic representation, and the specific operation is as follows:

S23, performing SoftMax function normalization processing on the energy matrix to obtain a weight between 0 and 1 for each feature point, and obtaining an attention weight coefficient matrix attention1 which is mainly used for highlighting feature points which are more important to the model.

The space-time interaction model in the step 3 is mainly to perform space-time interaction on the feature information on the memory frame branch and the query frame branch which are processed by the feature enhancement module, so as to obtain the feature representation which is more favorable for the tracking task, and the specific operation steps are as follows:

F _m ”＝concat(F _q,m ,F _m ')＝F _q,m +F _m '

query frame branching F using three convolutional layers _q ' Query, key transformation, branch F for memory frame _m And performing Value transformation, and performing transpose multiplication operation on the Query and the Key to obtain a matrix capable of describing the correlation of the Query and the Key. Secondly, carrying out SoftMax function normalization processing on the similarity matrix to obtain a weight between 0 and 1 for each feature point to obtain attention weight attention22, and utilizing the attention weight to memorize the frame F _m All Value vectors of' are weighted to obtain weighted query feature F _m,q . Finally, the weighted features and the query frame branches are subjected to feature splicing to obtain the query frame features F of space-time interaction _q ", formula tableThe following is shown:

F _q ”＝concat(F _m,q ,F _q ')＝F _m,q +F _q '

the template updating mechanism in the step 4 of the invention uses a space-time memory network (memory reader), updates the target template mainly by using the history information of the memory frame, and firstly calculates the memory frame F output in the step 3 _m "each pixel and query frame F _q "similarity between the frames, obtain the similarity matrix, the similarity matrix is regarded as the mapping of soft weight and memory frame F _m "multiplying, adaptively searching information stored in memory frame, searching the most useful information related to inquiry frame from the information stored in memory frame so as to implement updating of template, finally combining the read information with inquiry frame characteristic F _q "splice along channel dimension, generate final composite feature map y.

The step 5 of the invention inputs the response diagram into a classification regression network and uses a lightweight classification convolution network omega _cls Encoding the feature map y obtained in step 4, and using a linear convolution layer with a convolution kernel of 1x1 to encode ω _cls The dimension is reduced to 1, and a final classification response graph R is obtained _cls ∈R ^1×H×W . The classification network also comprises calculation of centrality response, and the centrality response is shown as R _ctr ∈R ^1×H×W In the reasoning phase, R _ctr And R is R _cls Multiplying the confidence scores of the pixel classifications away from the center of the target is used to suppress the confidence scores of the pixel classifications in the regression task using the lightweight classification convolution network ω _reg Encoding the feature map y obtained in the step 4, reducing the output feature dimension to 4, and generating a regression response map R for boundary box estimation _reg ∈R ^4×H×W 。

The loss function on which the overall network model is based in the training process comprises classification loss, regression loss and centrality loss, wherein the classification loss L _cls Regression loss L using cross entropy loss function _reg Adopting an IOU loss function and a centrality loss function L _ctr The Sigmoid cross entropy loss function is used, and the total loss function is expressed as follows:

L＝L _cls +λ ₁ L _reg +λ ₂ L _ctr

wherein lambda is ₁ 、λ ₂ Are super parameters.

In the field of target tracking, when a tracking target is subjected to complex environments such as shielding, deformation, background interference and the like, the tracking performance of the tracker is reduced, and in order to improve the situation, the method and the device for tracking the target by using the space-time characteristics enhance module for enhancing the characteristics of the original characteristics by combining space-time context information with a plurality of memory frames in a template branch, so that the expression capability of the memory frames and the query frames is enhanced, the characteristics are richer, and the robustness and the accuracy of target tracking are improved. Meanwhile, the Siamese-based tracking algorithm does not pay attention to the correlation between frames and within frames of the video sequence, so that targets cannot generate corresponding correlations in time and space, and some trackers only use appearance information of memory frames and do not fully utilize rich time context information in historical frame sequences. Aiming at the problem, the invention creates a space-time interaction model by using an attention mechanism, so that the characteristic information of the memory frame branch and the query frame branch are interacted, and then the obtained interaction weight is used for carrying out secondary screening on the characteristic information after characteristic enhancement, so as to obtain the characteristics more beneficial to tracking. The space-time interaction model can utilize the information in the memory frame and dynamically combine the information in the query frame to realize the characteristic space-time interaction, thereby improving the robustness and the accuracy of target tracking.

The target tracking method based on the space-time interaction attention mechanism mainly uses GOT-10K, OTB-100 and LaSOT official data sets to train a network model, and uses a GOT-10k evaluating tool to test the training effect of the method. Comparing the data in table 1, it is found that the target tracking algorithm provided by the invention performs better on the data test set than the weight parameters trained by other algorithms.

By comparing the evaluation results in the table 5, the target tracking method based on the space-time interaction attention mechanism can be clearly observed, and the tracking effect of the tracker can be well improved by using the feature enhancement based on the attention mechanism and the enhancement screening of the features by the space-time interaction module.

Finally, it should be noted that: the foregoing description is only a preferred embodiment of the present invention, and the present invention is not limited thereto, but it is to be understood that modifications and equivalents of some of the technical features described in the foregoing embodiments may be made by those skilled in the art, although the present invention has been described in detail with reference to the foregoing embodiments. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A target tracking method based on a space-time interaction attention mechanism is characterized by comprising the following steps of: the method comprises the following steps:

2. The method for tracking targets based on spatiotemporal interaction attention mechanism of claim 1, wherein: the feature extraction performed in the step 1 is that firstly, the query frame image, the memory frame image and the label map are obtained through data preprocessingAlternatively, googleNet is first used as backbone network in memory frame branchingFeature extraction F _m Using the first convolution layer +.>And an additional convolution layer g to process the memory frame and tag map, will +.>And g output F _q And respectively adding the memory frame feature maps to the rest backbone networks to generate T memory frame feature maps. Finally F is passed through nonlinear convolution layer _m Feature dimension is reduced to 512, query branch is +.>Obtaining a feature map F after inputting a query image _q The overall structure is the same as the memory frame branch but the parameters are different, and finally, the characteristic F with the characteristic dimension of 512 is obtained through a nonlinear convolution layer _q 。

3. The method for tracking targets based on spatiotemporal interaction attention mechanism of claim 1, wherein: the space-time characteristic enhancement module for the memory frame image in the step 2 mainly comprises a time attention module and a space attention module, wherein the time attention module is mainly used for enhancing time sequence characteristics in a sequence, weighting and emphasizing important time sequence information and filtering irrelevant information, and the space attention module is mainly used for enhancing the space characteristics, and weighting a target area and a background area in the image so that attention is focused on the area where the target is located.

4. The method for tracking targets based on spatiotemporal interaction attention mechanism of claim 1, wherein: the time attention module in the step 2 canAnd at different moments of the sequence, carrying out weighted average on the sequence characteristics by using different weights to obtain characteristic representation on time sequence. Firstly, the memory frame image characteristic F extracted by backbone network _m Extracted into T feature vectors { f ₁ ,f ₂ ,...,f _T Where T represents the sequence length, followed by a feature f for each time step _t Performing linear transformation once, mapping the linear transformation into a new feature space, calculating attention coefficients of the new feature space by using a SoftMax function, and weighting and fusing the features according to the attention coefficients to obtain a weighted feature vector F _m,1 。

5. The method for tracking targets based on spatiotemporal interaction attention mechanism of claim 1, wherein: the spatial attention module in step 2 will first pass through the feature extraction networkThe obtained template frame features F _m Carrying out feature compression processing on image features through a global average pooling layer and maximum pooling in the network model, wherein a first dimension of the image features, namely B, is kept unchanged, feature vectors with the size of 1 multiplied by C are respectively obtained, C is the number of channels, then splicing the two pooled results, sending the spliced results into a 3 multiplied by 3 convolution layer to carry out convolution operation with the kernel size of 3, carrying out dimension increasing operation on the feature vectors after feature compression, finally activating by utilizing a Sigmoid function, restricting the output between 0 and 1, namely generating space channel weights corresponding to each space channel, and expressing the feature output through space attention model processing as F _m2 Finally, the time attention and the space attention in the space-time feature enhancement module are respectively weighted for the time sequence feature and the space feature of the memory frame image to obtain the memory frame feature mapping:

F _m '＝concat(F _m1 ,F _m2 )。

6. the method for tracking targets based on spatiotemporal interaction attention mechanism of claim 1, wherein: in the step 2, the image features of the query frame are enhanced, each element vector in the input sequence of the feature of the query image is automatically weighted and summed with other vectors, then the similarity between the features is calculated, key features are extracted, and the weights of the features are recalculated according to the key features to obtain a more enhanced feature representation, and the method comprises the following specific operations:

s21, firstly, processing a feature vector matrix F obtained by backbone network _q Three linear transformation convolution operations are carried out, and the three linear transformation convolution operations are projected to three groups of different feature spaces to obtain three feature representations, namely: query, key, value;

s22, performing transposed multiplication operation on the Query and the Key to obtain a matrix energy capable of describing similarity of the Query and the Key;

s23, performing SoftMax function normalization processing on the energy matrix to enable each feature point to be distributed with a weight between 0 and 1, and obtaining an attention weight coefficient matrix attention1 which is mainly used for highlighting feature points which are more important to the model;

7. The target tracking method based on the space-time interaction attention mechanism according to claim 1, wherein the space-time interaction model in the step 3 mainly performs space-time interaction on the feature information on the memory frame branch and the query frame branch processed by the feature enhancement module to obtain a feature representation more favorable for the tracking task, and the specific operation steps are as follows:

s31, the memory frame characteristic F obtained in the step 2 is obtained _m ' and query frame feature F _q ' input as a spatiotemporal interaction attention module is processed into a time dimension and a space dimension through reshape transformationThe feature matrix with equal degree is B×C×H×W.

F _m ”＝concat(F _q,m ,F _m ')＝F _q,m +F _m '

F _q ”＝concat(F _m,q ,F _q ')＝F _m,q +F _q '。

8. the method of claim 1, wherein the template updating mechanism in step 4 uses a space-time memory (memory reader) to target mainly using the history information of the memory frameUpdating the template, firstly calculating the memory frame F output in the step 3 _m "AND INQUIRY Frames F _q Similarity between each pixel in the' to obtain a similarity matrix, wherein the similarity matrix is used as a soft weight mapping and memory frame F _m "multiplying, adaptively searching information stored in memory frame, searching the most useful information related to inquiry frame from the information stored in memory frame so as to implement updating of template, finally combining the read information with inquiry frame characteristic F _q "splice along channel dimension, generate final composite feature map y.

9. The method of claim 1, wherein the step 5 inputs the response map into a classification regression network using a lightweight classification convolution network ω _cls Encoding the feature map y obtained in step 4, and using a linear convolution layer with a convolution kernel of 1x1 to encode ω _cls The dimension is reduced to 1, and a final classification response graph R is obtained _cls ∈R ^1×H×W . The classification network also comprises calculation of centrality response, and the centrality response is shown as R _ctr ∈R ^1×H×W In the reasoning phase, R _ctr And R is R _cls Multiplying the confidence scores of the pixel classifications away from the center of the target is used to suppress the confidence scores of the pixel classifications in the regression task using the lightweight classification convolution network ω _reg Encoding the feature map y obtained in the step 4, reducing the output feature dimension to 4, and generating a regression response map R for boundary box estimation _reg ∈R ^4×H×W 。

10. The method of claim 1, wherein the loss function in step 5 comprises a classification loss L _cls Regression loss L using cross entropy loss function _reg Adopting an IOU loss function and a centrality loss function L _ctr The Sigmoid cross entropy loss function is used, and the total loss function is expressed as follows:

L＝L _cls +λ ₁ L _reg +λ ₂ L _ctr

wherein lambda is ₁ 、λ ₂ Are super parameters.