CN116630845A

CN116630845A - Target tracking method combining hierarchical encoder and parallel attention mechanism

Info

Publication number: CN116630845A
Application number: CN202310488020.7A
Authority: CN
Inventors: 符强; 王阳; 纪元法; 孙希延; 任风华; 严素清; 付文涛; 黄建华; 梁维彬; 贾茜子
Original assignee: Nanning Guidian Electronic Technology Research Institute Co ltd; Guilin University of Electronic Technology
Current assignee: Nanning Guidian Electronic Technology Research Institute Co ltd; Guilin University of Electronic Technology
Priority date: 2023-05-04
Filing date: 2023-05-04
Publication date: 2023-08-22

Abstract

The invention relates to the technical field of computer vision, in particular to a target tracking method combining a hierarchical encoder and a parallel attention mechanism, which comprises the steps of preprocessing a video sequence to obtain an initial template image and an initial search image; performing feature extraction on the initial template image and the initial search image by utilizing the improved feature extraction backbone network VGG16 to obtain template features and search features; processing the template features and the search features through a parallel attention mechanism to obtain two features; the two obtained features are fused and then are respectively input into a second two-layer convolution block and a DWConv-transform encoder of the improved feature extraction backbone network VGG16 for processing, so that final template features and final search features are obtained; and performing cross-correlation convolution on the final template features and the final search features after the final template features are processed by a parallel attention mechanism to obtain a similar score graph of the final template features and the final search features, and mapping the maximum point coordinates in the similar score graph back to the original graph to obtain a tracking result.

Description

Target tracking method combining hierarchical encoder and parallel attention mechanism

Technical Field

The invention relates to the technical field of computer vision, in particular to a twin network single target tracking method combining a transducer encoder and a parallel attention mechanism.

Background

Target tracking technology is a challenging task as a very popular research direction in computer vision disciplines. The technology can predict the motion trail of the target by modeling the appearance and the motion of the target so as to acquire the position information of the target at the future moment, so that the technology is widely applied to the fields of intelligent traffic monitoring, intelligent man-machine interaction, military reconnaissance and the like.

The twin network is applied to the tracking task by work, the tracking problem is converted into the similarity solving problem, and the solving process of the tracking task is greatly simplified. The regional suggestion network is introduced into the tracking framework, and foreground and background classification and bounding box regression are carried out by using the regional suggestion network, so that the accuracy of predicting the bounding box is effectively improved. There is work to improve the training effect of the model by controlling the distribution of the training data set using an efficient sampling strategy. There is work to improve the performance of trackers by improving the feature extraction backbone network. However, most twin network-based tracking methods use a single feature processing manner, and when facing a complex tracking environment such as foreign object shielding, illumination change, rapid movement, etc., the tracker may have tracking drift.

Disclosure of Invention

The invention aims to provide a target tracking method combining a hierarchical encoder and a parallel attention mechanism, which aims to solve the problem that the existing tracker has tracking drift in a complex tracking environment, and comprises the following steps:

preprocessing a video sequence to obtain an initial template image and an initial search image;

performing feature extraction on the initial template image and the initial search image by utilizing an improved feature extraction backbone network VGG16 to obtain template features and search features;

focusing the weight difference of the target on different characteristic channels and the dependency relationship between the characteristic channels through a parallel attention mechanism based on the template characteristics and the search characteristics to obtain channel characteristics containing channel information weights and spatial characteristics containing position information weights;

the channel characteristics containing the channel information weight and the spatial characteristics containing the position information weight are respectively input into a two-layer convolution block and a DWConv-transform encoder of the improved characteristic extraction backbone network VGG16 for processing after being fused, so that final template characteristics and final search characteristics are obtained;

and performing cross-correlation convolution on the final template feature and the final search feature after the final template feature is processed by the parallel attention mechanism to obtain a similar score graph of the final template feature and the final search feature, and mapping maximum point coordinates in the similar score graph back to the original graph to obtain a tracking result.

Wherein the preprocessing includes clipping and RGB mean filling.

The feature extraction of the initial template image and the initial search image by using the improved feature extraction backbone network VGG16 to obtain template features and search features includes:

optimizing the original feature extraction backbone network VGG16 to obtain an improved feature extraction backbone network VGG16;

inputting the initial template image and the initial search image into the improved feature extraction backbone network VGG16 to obtain template features and search features.

The optimizing the original feature extraction backbone network VGG16 to obtain an improved feature extraction backbone network VGG16 includes:

and reserving the original 5-layer maximum pooling of the original feature extraction backbone network VGG16 as 3 layers, introducing a clipping module to clip the outermost layer features, and introducing hole convolution in the fourth layer and the fifth layer of the network to obtain the improved feature extraction backbone network VGG16.

The focusing the dependency relationship between the weight difference of the target in different characteristic channels and the characteristic channels through a parallel attention mechanism based on the template characteristics and the search characteristics to obtain channel characteristics containing channel information weights and spatial characteristics containing position information weights, comprising the following steps:

inputting the template features and the search features into a channel attention mechanism, and adaptively endowing different feature channels with different weights according to the response degrees of the feature images of the different channels to different target descriptors so as to adjust the importance degrees of the channels to different targets and obtain channel features containing channel information weights;

and inputting the template features and the search features into a spatial attention mechanism, describing the dependency relationship between the internal information of the feature channels, and acquiring the information weights of different spatial positions in the feature map to obtain the spatial features containing the position information weights.

The step of inputting the template features and the search features into a channel attention mechanism, the step of adaptively giving different weights to different feature channels according to the response degrees of different channels of a feature map to different target descriptors to adjust the importance degrees of the channels to different targets and obtain channel features containing channel information weights comprises the following steps:

obtaining a description matrix of channel dimensions by utilizing two-dimensional self-adaptive average pooling operation on the inputted template features and the searching features;

the description matrix is subjected to matrix transformation and one-dimensional convolution treatment to obtain a treatment matrix;

and endowing the channel information weight obtained by the processing matrix after the matrix transformation reshape and sigmoid activation function with the template feature and the search feature to obtain the channel feature containing the channel information weight.

The step of inputting the template features and the search features into a spatial attention mechanism, describing the dependency relationship between the internal information of the feature channels, and obtaining the information weights of different spatial positions in the feature map to obtain the spatial features containing the position information weights, includes:

respectively carrying out average pooling operation and maximum pooling operation on the inputted template features and the searching features to obtain a first pooling feature and a second pooling feature;

stacking the first pooled feature and the second pooled feature to obtain a stacked feature;

multiplying the template feature and the search feature with the stacking feature through a standard convolution to obtain a product feature;

and the product characteristic is subjected to a convolution layer with the convolution kernel size of 7 multiplied by 7 and a sigmoid activation function to obtain the spatial characteristic containing the position information weight.

Wherein the latter two-layer convolution block of the improved feature extraction backbone network VGG16 comprises 3-layer convolution layers with a convolution kernel size of 3×3 and 2-layer convolution layers with a convolution kernel size of 1×1.

The step of respectively inputting the channel characteristics containing the channel information weight and the spatial characteristics containing the position information weight into a convolutional block and a DWConv-transform encoder of the two later layers of the improved characteristic extraction backbone network VGG16 for processing after fusing, so as to obtain final template characteristics and final search characteristics, which comprises the following steps:

the channel characteristics containing the channel information weight and the spatial characteristics containing the position information weight are fused and then downsampled through a standard convolution, and then input into a DWConv-transform encoder, convolution projection mapping is carried out in the DWConv-transform encoder by using depth separable convolution with a preset step length and a convolution kernel size of 3 multiplied by 3, and query, key and value are obtained through matrix transformation;

taking the query, the key and the value as the input of a multi-head attention module, calculating the similarity between the query and the key in a manner of Einstein summation convention, obtaining an attention weight matrix after a Softmax function, and finally carrying out dot multiplication on the value and the attention weight matrix to obtain a vector with attention weight;

the vector with attention weight is processed by Layer Normalization and a multi-layer perceptron (Multilayer Perceptron, MLP) and is subjected to residual error addition with the vector, and finally, the processing characteristics are obtained through matrix transformation;

the channel characteristics containing the channel information weight and the spatial characteristics containing the position information weight are fused and then input into a later two-layer convolution block of the improved characteristic extraction backbone network VGG16, so that output characteristics are obtained;

and carrying out element addition on the processing features and the output features to obtain final template features and final search features.

According to the target tracking method combining the hierarchical encoder and the parallel attention mechanism, an initial template image and an initial search image are obtained by preprocessing a video sequence; performing feature extraction on the initial template image and the initial search image by utilizing an improved feature extraction backbone network VGG16 to obtain template features and search features; focusing the weight difference of the target on different characteristic channels and the dependency relationship between the characteristic channels through a parallel attention mechanism based on the template characteristics and the search characteristics to obtain channel characteristics containing channel information weights and spatial characteristics containing position information weights; the channel characteristics containing the channel information weight and the spatial characteristics containing the position information weight are respectively input into a two-layer convolution block and a DWConv-transform encoder of the improved characteristic extraction backbone network VGG16 for processing after being fused, so that final template characteristics and final search characteristics are obtained; and performing cross-correlation convolution on the final template feature and the final search feature after the final template feature is processed by the parallel attention mechanism to obtain a similar score graph of the final template feature and the final search feature, and mapping maximum point coordinates in the similar score graph back to the original graph to obtain a tracking result. The invention provides a target tracking method combining a hierarchical encoder and a parallel attention mechanism, which can accurately track under complex scenes such as illumination change, similar object interference and the like.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a step diagram of a target tracking method combining a hierarchical encoder and a parallel attention mechanism provided by the present invention.

Fig. 2 is a schematic diagram of a target tracking method combining a hierarchical encoder and a parallel attention mechanism provided by the present invention.

Fig. 3 is a schematic block diagram of the DWConv-TransformerBlock encoder of fig. 2.

Fig. 4 is a schematic block diagram of the SAM of fig. 2.

Fig. 5 is a schematic block diagram of the CAM of fig. 2.

Fig. 6 is a flow chart of a method of object tracking combining a hierarchical encoder and a parallel attention mechanism provided by the present invention.

Fig. 7 is a comparison of the performance of the method provided by the present invention versus other methods on an OTB100 dataset.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.

Referring to fig. 1 to 7, the present invention provides a target tracking method combining a hierarchical encoder and a parallel attention mechanism, comprising the following steps:

s1, preprocessing a video sequence to obtain an initial template image and an initial search image;

specifically, the preprocessing includes clipping and RGB mean filling.

The method comprises the steps of preprocessing an input video sequence through clipping and RGB mean filling operation to obtain an initial template image and a search image, wherein the specific operation is as follows: taking a first frame of an input video sequence as an initial target template, cutting the first frame by taking a target as a center, when the cutting range exceeds the image range, filling the exceeding part in a constant manner by taking an RGB average value, wherein the resolution of the preprocessed template image is 127 multiplied by 127, the channel dimension is 3, the search image is also acquired from other frames of the video sequence in the manner as above, the final resolution is 255 multiplied by 255, and the channel dimension is 3.

S2, performing feature extraction on the initial template image and the initial search image by utilizing an improved feature extraction backbone network VGG16 to obtain template features and search features;

VGG16 is a deep neural network for feature processing of input images;

the specific method is as follows:

s21, optimizing an original feature extraction backbone network VGG16 to obtain an improved feature extraction backbone network VGG16;

specifically, the original 5 layers of the original feature extraction backbone network VGG16 are reserved as 3 layers in a pooling mode, a clipping module is introduced to clip the features of the outermost layer, and cavity convolution is introduced into the fourth layer and the fifth layer of the network, so that the improved feature extraction backbone network VGG16 is obtained.

Optimization of VGG 16: 1) The original 5-layer maximum pooling is reserved as 3 layers, so that the problem that the performance of a tracker is affected due to too small resolution of output characteristics caused by too long network stride is avoided; 2) The cutting module is introduced to cut out the outermost layer characteristics influenced by the filling operation, so that the influence of potential position deviation caused by the filling operation is reduced; 3) Hole convolution is introduced in the fourth layer and the fifth layer of the network to relieve the problem of insufficient receptive field caused by network stride shortening.

The cavity convolution supports exponential expansion of the receptive field without losing resolution, and the receptive field calculation formula for each element is as follows:

F _i+1 ＝F _i *2 ⁱ k _i i＝0,1,…,n-2

where k is a sliding filter of size 3×3 and F represents the characteristic after convolution output. From the above analysis, the magnitude of the receptive field of the elements in each layer of features is obtained based on the convolution result of the previous layer. When using a void fraction of 2 ⁱ The filter of (i=0, 1,2 …) is at F _i After the layer is convolved, F _i+1 The receptive field size of each element in the layer is expressed by the following formula:

therefore, the problem of insufficient receptive field caused by shortening the network stride can be effectively solved by introducing the hole convolution in the backbone network VGG16.

S22, inputting the initial template image and the initial search image into the improved feature extraction backbone network VGG16 to obtain template features and search features.

Specifically, the template image and the search image after the clipping and mean RGB filling pretreatment are input into a VGG16 network, the resolution of template features after the processing of the first three layers of convolution blocks is 11×11, the channel dimension is 256, the resolution of search features is 27×27, the channel dimension is 25, and the first three layers of convolution blocks comprise convolution layers with the convolution kernel size of 3×3 and 3 largest pooling layers.

S3, focusing the dependency relationship between the weight difference of the targets in different characteristic channels and the characteristic channels through a parallel attention mechanism based on the template characteristics and the search characteristics to obtain channel characteristics containing channel information weights and space characteristics containing position information weights;

the specific method is as follows:

s31, inputting the template features and the search features into a channel attention mechanism, and adaptively endowing different feature channels with different weights according to the response degrees of the feature images to different target descriptors to adjust the importance degrees of the channels to different targets so as to obtain channel features containing channel information weights;

specifically, the two-dimensional adaptive average pooling operation is utilized to the inputted template features and the searching features to obtain a description matrix of the channel dimension; the description matrix is subjected to matrix transformation and one-dimensional convolution treatment to obtain a treatment matrix; and endowing the channel information weight obtained by the processing matrix after the matrix transformation reshape and sigmoid activation function with the template feature and the search feature to obtain the channel feature containing the channel information weight.

The channel attention mechanism (ChannelAttention Mechanism, CAM) is used for adaptively giving different weights to different characteristic channels according to the response degree of different channels of the characteristic map to different target descriptors so as to adjust the importance degree of the channels to different targets. For input featuresFirstly, a description matrix ++of channel dimension is obtained by utilizing two-dimensional adaptive average pooling operation>

Where h, w is the height and width of the feature map and (i, j) is the input feature map F _c Pixel locations on the display. Will a _c Is obtained through matrix transformation and one-dimensional convolution processingThe one-dimensional convolution is used for carrying out information interaction on different channels, and the number of the channel interactions N is adaptively adjusted according to the number of the input channels;

wherein, C is the number of input characteristic channels,represents an odd number nearest to x, α=2, β=1;

will b _c Channel information weight obtained after matrix transformation reshape and sigmoid activation function is given to input feature F _c Obtaining final channel self-attention output

S32, inputting the template features and the search features into a spatial attention mechanism, describing the dependency relationship between the internal information of the feature channels, and acquiring the information weights of different spatial positions in the feature map to obtain the spatial features containing the position information weights.

Specifically, carrying out average pooling operation and maximum pooling operation on the inputted template features and the searching features respectively to obtain a first pooling feature and a second pooling feature; stacking the first pooled feature and the second pooled feature to obtain a stacked feature; multiplying the template feature and the search feature with the stacking feature through a standard convolution to obtain a product feature; and the product characteristic is subjected to a convolution layer with the convolution kernel size of 7 multiplied by 7 and a sigmoid activation function to obtain the spatial characteristic containing the position information weight.

The function of the spatial attention mechanism (SpatialAttention Mechanism, SAM) is to describe the dependency relationship between the internal information of the characteristic channels and acquire the information weights of different spatial positions in the characteristic diagram. For input featuresWhere C, H, W are the number of channels, height and width of the input features, respectively. First for input feature F _s The average pooling operation and the maximum pooling operation are respectively carried out to obtain +>And->Then F is carried out _A And F _M Stacking to obtainWhere r is the channel reduction factor. Will F _s By a standard convolution layer with a convolution kernel size of 1 x 1 and then combining a with a _s Multiplication of elements to obtain->

Will b _s Obtaining spatial features containing position information weights through a convolution layer with a convolution kernel size of 7×7 and a sigmoid activation function

Wherein omega _ij Information weight representing position (i, j) on the spatial attention feature;

as shown in fig. 4, the feature F is input _in The size is C×H×W, F is obtained after spatial attention mechanism (Spatial Attention Mechanism, SAM) treatment _out The size is 1×h×w, and a specific operation procedure can be described as:

wherein conv _7×7 And conv _1×1 The convolution layers with convolution kernel sizes of 7 x 7 and 1 x 1 are denoted, respectively, and Cat denotes the feature stack.

S4, fusing the channel characteristics containing the channel information weight and the spatial characteristics containing the position information weight, and then respectively inputting the fused channel characteristics and the spatial characteristics containing the position information weight into a two-layer convolution block and a DWConv-transducer encoder of the improved characteristic extraction backbone network VGG16 for processing to obtain final template characteristics and final search characteristics;

the specific method is as follows:

s41, fusing the channel characteristics containing the channel information weight and the spatial characteristics containing the position information weight, performing downsampling through a standard convolution, inputting the fused channel characteristics and the spatial characteristics containing the position information weight into a DWConv-transform encoder, performing convolution projection mapping in the DWConv-transform encoder by using depth separable convolution with a preset step length and a convolution kernel size of 3 multiplied by 3, and obtaining query, key and value through matrix transformation;

the modified transducer structure was named DWConv-transducer.

Specifically, the features processed by the first parallel attention module as shown in fig. 3 are input to the DWConv-transform encoder through a convolution layer as follows:

using step size S for features of parallel attention processing, respectively _q ＝1，S _v ＝S _k =2, a convolution kernel size of 3×3 depth divisibleThe deconvolution projection mapping is carried out, the query Q, the key K and the value V are obtained through matrix transformation, and the calculation process is as follows:

where DWConv represents a depth-separable convolution, conv _3×3 Representing a standard convolution, F representing the input features.

S42, taking the query, the key and the value as input of a multi-head attention module, calculating the similarity between the query and the key in a manner of Einstein summation convention, obtaining an attention weight matrix after Softmax, and finally carrying out dot multiplication on the value and the attention weight matrix to obtain a vector with attention weight;

specifically, Q, K and V are used as inputs of a multi-head attention module, similarity between Q and K is calculated through Einstein summation convention, then an attention weight matrix is obtained after Softmax, and finally a vector with attention weight is obtained by dot multiplication of a value V and the attention weight matrix, wherein the calculation process is as follows:

wherein d _k Is the scaling factor of the attention weighting matrix.

S43, carrying out Layer Normalization and multi-layer perceptron (Multilayer Perceptron, MLP) processing on the vector with attention weight, carrying out residual error addition on the vector with attention weight and the vector, and finally obtaining processing characteristics through matrix transformation;

specifically, vectors obtained by the multi-head attention module are processed by LayerNormalization and a multi-layer perceptron (Multilayer Perceptron, MLP) and then subjected to residual addition, and finally, processing features with the same resolution as the original features of the input DWConv-transform encoder and the same channel dimension are obtained through matrix transformation.

S44, fusing the channel characteristics containing the channel information weight and the spatial characteristics containing the position information weight, and inputting the fused channel characteristics and the spatial characteristics containing the position information weight into a later two-layer convolution block of the improved characteristic extraction backbone network VGG16 to obtain output characteristics;

specifically, the features subjected to parallel attention processing are input to the latter two-layer convolution block of the backbone network VGG16, so as to obtain a convolution layer with a template feature resolution of 5×5, a channel dimension of 256, a search feature resolution of 21×21, and a channel dimension of 256, where the latter two-layer convolution block includes three convolution layers with a convolution kernel size of 3×3 and two convolution layers with a convolution kernel size of 1×1.

And S45, carrying out element addition on the processing features and the output features to obtain final template features and final search features.

Specifically, the characteristics obtained by processing the multi-layer DWConv-transform encoder and the characteristics output by the two-layer convolution blocks after VGG16 are subjected to element addition to obtain final template characteristics and search characteristics.

S5, carrying out cross-correlation convolution on the final template feature and the final search feature after the final template feature is processed by the parallel attention mechanism, obtaining a similar score graph of the final template feature and the final search feature, and mapping maximum point coordinates in the similar score graph back to an original graph to obtain a tracking result.

Specifically, the final template features obtained in the step S3 are processed by a parallel attention mechanism to increase the characterization capability of the template; and performing cross-correlation convolution on the final search feature to obtain a similar score diagram of the final search feature and the final search feature, and mapping the maximum point coordinate in the similar score diagram back to the original diagram to obtain a tracking result.

The invention designs a target tracking method combining a hierarchical encoder and a parallel attention mechanism, which mainly comprises the following key technical means:

1. the VGG16 is improved to be used as a feature extraction backbone network, meanwhile, a clipping block is designed to eliminate potential position deviation caused by filling operation in the network, and the problem of insufficient receptive field caused by shortening network stride is solved by combining hole convolution.

2. The design of the invention focuses on the weight difference of targets in different characteristic channels and the dependency relationship between the characteristic channels by a parallel attention mechanism, so as to adjust the attention gravity center of the tracker and enhance the expression capability of the characteristics.

3. The invention utilizes the capturing capability of the long time sequence context information of the hierarchical transform structure and the capturing capability of the convolution network to the local context information to improve the robustness of the tracker in a complex environment, replaces the linear mapping in the transform encoder with the convolution mapping, and replaces standard convolution with the depth separable convolution, thereby reducing the introduction of the extra parameter.

4. The invention designs a target tracking method combining a hierarchical encoder and a parallel attention mechanism, which performs tracking performance evaluation on a public data set OTB100, and the tracking accuracy (Precision) and the tracking success rate (AUC) are obviously improved and are superior to those of main stream algorithms such as SiamFC, siamRPN, daSiamRPN, and the result is shown in figure 7, wherein Ours is the performance result obtained by the method provided by the invention.

The above disclosure is merely illustrative of a preferred embodiment of the present invention of a target tracking method combining a hierarchical encoder and a parallel attention mechanism, but it should be understood that the scope of the present invention is not limited thereto, and those skilled in the art will understand that all or part of the above embodiments may be implemented and equivalents thereof may be modified according to the claims of the present invention.

Claims

1. A method of object tracking combining a hierarchical encoder and a parallel attention mechanism, comprising the steps of:

2. The method of object tracking combining a hierarchical encoder and a parallel attention mechanism of claim 1,

the preprocessing includes clipping and RGB mean filling.

3. The method of object tracking combining a hierarchical encoder and a parallel attention mechanism of claim 2,

4. The method for object tracking combining a hierarchical encoder and a parallel attention mechanism of claim 3,

5. The method for object tracking combining a hierarchical encoder and a parallel attention mechanism of claim 4,

focusing the dependency relationship between the weight difference of the target in different characteristic channels and the characteristic channels through a parallel attention mechanism based on the template characteristics and the search characteristics to obtain channel characteristics containing channel information weights and spatial characteristics containing position information weights, wherein the method comprises the following steps:

6. The method for object tracking combining a hierarchical encoder and a parallel attention mechanism of claim 5,

the step of inputting the template features and the search features into a channel attention mechanism, adaptively giving different weights to different feature channels according to the response degrees of the different channels of a feature map to different target descriptors to adjust the importance degrees of the channels to different targets, and obtaining channel features containing channel information weights, comprising:

7. The method for object tracking combining a hierarchical encoder and a parallel attention mechanism of claim 6,

the step of inputting the template features and the search features into a spatial attention mechanism, describing the dependency relationship between the internal information of the feature channels, and obtaining the information weights of different spatial positions in the feature map to obtain the spatial features containing the position information weights, comprises the following steps:

8. The method for object tracking combining a hierarchical encoder and a parallel attention mechanism of claim 7,

the latter two-layer convolution block of the improved feature extraction backbone network VGG16 comprises 3 layers of convolution layers with a convolution kernel size of 3 x 3 and 2 layers of convolution layers with a convolution kernel size of 1 x 1.

9. The method for object tracking combining a hierarchical encoder and a parallel attention mechanism of claim 8,

the step of respectively inputting the channel characteristics containing the channel information weight and the spatial characteristics containing the position information weight into a two-layer convolution block and a DWConv-transducer encoder of the improved characteristic extraction backbone network VGG16 for processing after fusing, so as to obtain final template characteristics and final search characteristics, which comprises the following steps:

taking the query, the key and the value as inputs of a multi-head attention module, calculating the similarity between the query and the key in a manner of Einstein summation convention, obtaining an attention weight matrix after Softmax, and finally performing dot multiplication on the value and the attention weight matrix to obtain a vector with attention weight;

processing the vector with the attention weight through a Layernormative and a multi-layer perceptron, adding residual errors with the vector, and finally obtaining processing characteristics through matrix transformation;