CN117576149A

CN117576149A - Single-target tracking method based on attention mechanism

Info

Publication number: CN117576149A
Application number: CN202311360574.5A
Authority: CN
Inventors: 黄丹丹; 张钰晨; 陈广秋; 刘智; 段锦; 白昱; 许鹤
Original assignee: Changchun University of Science and Technology
Current assignee: Changchun University of Science and Technology
Priority date: 2023-10-19
Filing date: 2023-10-19
Publication date: 2024-02-20

Abstract

The invention belongs to the technical field of computer vision, in particular to a single-target tracking method based on an attention mechanism, which comprises the following steps: step 1, data preprocessing, namely providing data preparation for subsequent network model training; step 2, constructing a twin network frame based on time sequence information; step 3, adding an attention module; training the model, namely training the constructed network model, and finally obtaining the network weight of the single-target network architecture based on the attention; step 5, adding a bounding box thinning module; and 6, testing a model, namely testing the effect of tracking the target by using the network weight obtained through training in a new video sequence. According to the invention, the attention module is combined with the online extraction module, so that the problem of interference of unnecessary features on calculation is reduced, the tracking precision is further improved, and the bounding box thinning module is introduced, so that the model performance is greatly improved.

Description

Single-target tracking method based on attention mechanism

Technical Field

The invention relates to the technical field of computer vision, in particular to a single-target tracking method based on an attention mechanism.

Background

The single-target tracking is one of the hot spots in the field of computer vision research and is widely applied, the tracking focusing of a camera, the automatic target tracking of an unmanned aerial vehicle and the like are required to be used for single-target tracking technology, in addition, the tracking of specific objects, such as human body tracking, vehicle tracking in a traffic monitoring system, gesture tracking in a human face tracking and intelligent interaction system and the like, are required to be applied to the target tracking task, along with continuous deep research of researchers, the vision target tracking has breakthrough progress in more than ten years, so that a vision tracking algorithm is not only limited to a traditional machine learning method, but also combines the methods of artificial intelligent hot-tide-deep learning, a related filter and the like in recent years, robust and accurate results are obtained, along with great success of deep learning in the fields of voice recognition, image classification, target detection and the like, the deep learning framework is applied to target tracking task in more and more comprehensive development, and wider application of the target tracking technology is also obtained in a plurality of social fields, especially in the current social environment, and the requirements of various layers of society on a high-tech tracking mode show a continuously improved state, and the importance of the target tracking technology is further important.

In recent years, unmanned aerial vehicles are increasingly widely applied to a plurality of fields such as civil use, military use, scientific research and the like by virtue of the characteristics of small size, flexible action, large detection range, strong autonomy and the like, and the unmanned aerial vehicle is also increasingly valued for tracking ground targets. Most of the existing tracking algorithms mainly aim at target tracking in natural scenes, and a complete solution for tracking tasks in unmanned airport scenes is not available. For a single target tracking data set in a general scene, the category of the tracking target is mostly common general targets, such as pedestrians, vehicles, animals and the like. The shooting mode is generally shooting by a common camera and the like, and has the characteristics of larger target, obvious outline and clear texture. However, the corresponding difficulty is that the targets are easy to deform greatly during the movement process, such as pedestrians, and the actions of steering, lifting hands, jumping and the like can occur, which can bring great challenges to the tracking algorithm. With the shooting task requirement of special scenes, unmanned aerial vehicles are increasingly widely applied in life. Many scholars are studying how to provide unmanned aerial vehicles with rich computer vision functions such as target detection, tracking and the like. One of the benefits of using unmanned aerial vehicles for target tracking is that it can track group targets, such as aggregated people, animals, ships, etc., while tasks of target tracking using aerial devices are being applied in various industries, such as shooting of extreme movements, such as high-altitude searching and rescue, gliding, etc., monitoring of dense people, monitoring of wild animals, etc.

The main difference between single-target tracking in an unmanned aerial vehicle scene and single-target tracking in a general scene is that the unmanned aerial vehicle tracking time is generally longer in an aerial shooting environment, the tracking boundary frame is inaccurate due to higher shooting height, the computing resources are limited by the limited capability of an aerial platform, the problem of interference caused by redundant features on computation is solved, the shooting angle is easy to change, and the problems that the target scale is changed greatly due to rapid movement of a camera are more remarkable. Thus, for these problems, a single-target tracking method based on an attention mechanism is proposed.

Disclosure of Invention

(one) solving the technical problems

Aiming at the defects of the prior art, the invention provides a single-target tracking method based on an attention mechanism, which solves the problems in the background art.

(II) technical scheme

The invention adopts the following technical scheme for realizing the purposes:

a single target tracking method based on an attention mechanism, comprising the steps of:

step 1, data preprocessing, namely providing data preparation for subsequent network model training;

step 2, constructing a twin network frame based on time sequence information;

step 3, adding an attention module;

training the model, namely training the constructed network model, and finally obtaining the network weight of the single-target network architecture based on the attention;

step 5, adding a bounding box thinning module;

and 6, testing a model, namely testing the effect of tracking the target by using the network weight obtained through training in a new video sequence.

Further, in the step 1, each video picture in each dataset is cut into a fixed size by a data preprocessing operation, and then placed in a regenerated folder, wherein the folders are all templates for training after cutting and sample pictures of a search area, the size of the template picture Z is 127×127, and the size of the search area picture X is 511×511.

Further, in the step 2, a twin network for feature extraction, a transformation-based similarity graph refinement network, and a classification regression sub-network for classification and regression of the target location are included.

Further, in the step 3, an attention module is added, the attention module includes an adaptive average pooling layer for reducing the input feature into a global average value, then, the feature after the dimension reduction is mapped into a context weight range through a series of convolution layers and activation functions, finally, the context weight is multiplied with the input feature to obtain a weighted feature representation, the attention mechanism has the advantage of effectively modeling the global context, and has a lightweight calculation function, so that the interference problem of unnecessary features on calculation is reduced, and the formula of the attention mechanism is expressed as follows:

wherein the method comprises the steps ofIs the weight of the global attention pool, δ (·) =w _v2 ReLU(LN(W _V1 (·)) represents a bottleneck transformation, this attention module is lightweight, and remote dependencies can be better captured without increasing computational cost.

Further, the step 4 model training includes the steps of:

s1, inputting sample pictures into a twin network for training, wherein the training process is performed offline, and 5 data sets of COCO, imageNet-VID 2015, GOT-10K, laSOT and YOUTUBEBB are adopted for training;

s2, the twin network is used for measuring the similarity of input samples: the sample comprises a target image and a search image, wherein the target image is 127x127 of images to be tracked, the search image is 511x511 of images to be executed to track the target; the twin neural network has two input branches, one branch target template Z is used as input, the other branch is used as input, the two inputs are sent into the two weight-shared neural networks, and the two neural networks map the inputs to new spaces respectively to form representations of the inputs in the new spaces;

s3, extracting a network from the feature map: the target template and the search area are placed in the same feature extraction network, namely Alexnet, three layers of convolutions in front of the Alexnet network are reserved, the fourth layer of convolutions and the fifth layer of convolutions are replaced by on-line time sequence self-adaptive convolutions, and finally two feature maps are obtained through the feature extraction network, namely a target image feature map and a search area feature map;

s4, refining a network based on a similarity graph of a transducer: the two-branch feature map information output by the feature extraction network is simultaneously input into a transducer network, firstly, timing sequence priori knowledge with fixed size is designed, the priori knowledge is effectively encoded in a high-efficiency memory mode, then the priori knowledge is decoded for accurately adjusting a similar map, and then information filtering is carried out to obtain a feature map of a current frame, wherein the feature map is used as the most basic component of the transducer, a multi-head attention formula is shown as follows, and 6 heads are used in the text:

for the sake of clarity, we define the timing knowledge of the t-1 frame asThe current frame is F _t Intermediate result->Can be expressed as:

thus the output of the information filterThe method comprises the following steps:

wherein F represents a convolution layer;

knowledge of the timing of the final current frameIs->Can be expressed as:

for t=1, an independent convolution is used for the initialization operation;

s5, classifying and returning to a network: firstly, mapping each position (i, j) in the feature map back to a search area (x, y), and obtaining a classification branch and a regression branch through convolution of a response map; classifying branches to obtain a classification characteristic diagramAnd center feature map->The classification characteristic map is a class for predicting each position, classification characteristic map +.>Each point (i, j,:) above contains a 2D vector representing the corresponding foreground and background scores, respectively; simultaneously with the classification feature map there is also a central feature map, central feature map +.>The center of each pixel point is given a score, the center is the center position with high score, the center can be used for deleting abnormal values, and the position far away from the center often generates a low-quality prediction boundary box;

regression branch output regression feature map of classified regression networkRegression profile->Each point (i, j,:) contains a 4D vector t (i, j) = (l, t, r, b) representing the distance from the corresponding location to the four sides of the bounding box in the input search area, set (x) ₀ ,y ₀ ) And (x) ₁ ,y ₁ ) Representing truth bounding boxes(x, y) represents the corresponding position of point (i, j) at a point +.>Regression objective of->The method can be calculated by the following formula:

wherein (x) ₀ ,y ₀ ) And (x) ₁ ,y ₁ ) Representing the top left and bottom right corners of the truth bounding box,representing the corresponding point on the regression feature map +.>Regression objective of->Respectively representing the distances from the points on the regression feature map to the four sides of the bounding box.

Training the whole network in an end-to-end manner; wherein the loss function value of the classification section is L _cls The regression function value of the bounding box is L _reg The centrality loss is L _cen Weighting the two components together according to the corresponding weight values to serve as a loss function after weighting the whole system;

L＝L _cls +λ ₁ L _cen +λ ₂ L _reg

in the above formula, the cross entropy loss is adopted for classification, the IOU loss is used for regression, and the centrality loss is also used;

according to the calculation gradient of the loss function L, updating the parameters of the network by using an optimizer SGD to reduce the loss function of the whole network until convergence, and ending the whole training to obtain trained network weight based on single-target tracking of attention;

s6, pre-training: the network backbone is pre-trained on ImageNet, the same initialization is used for the online time sequence self-adaptive convolution layer, the similarity graph refinement network is randomly initialized, parameters of the backbone are frozen for the first 5 epochs, the learning rate on the rest epochs of the training process is reduced from 0.005 to 0.0005, SGD is adopted as an optimizer, the momentum is 0.9, the small batch size is 124, and the input sizes of the template and the search area are 127x127 and 511x511 respectively.

Further, in the step 5, a bounding box refinement network is added, the bounding box refinement module needs to input template information for initialization, a search area of the bounding box refinement module is from a local search area intercepted by a current bounding box output by the tracker, the local search area is about 2 times of a target size, a feature fusion operation which plays a key role in the bounding box refinement module is performed in the bounding box refinement module, a point-by-point convolution is adopted in the bounding box refinement module for feature fusion, and a template feature map is assumed to be expressed asThe characteristic diagram of the search area is expressed as S epsilon R ^C×H×W Then K is first decomposed into H ₀ ×W ₀ Small feature map K _j ∈R ^C×1×1 Then, performing common related operations with S to obtain a fused characteristic diagram with the size of +.>This process can be described as

The method comprises the steps of carrying out a point-by-point convolution on a template feature map and a search area feature map, wherein the common correlation operation is represented, the problem of space ambiguity caused by cross-correlation operation of the whole template feature map and the search area feature map by utilizing a sliding window can be avoided, a result predicted and output by a refined boundary frame of the module is taken as a final output result, and meanwhile, the tracker is updated by utilizing the result.

Further, the model test in the step 6 comprises the following steps:

s1, testing the tracking effect in a new video sequence which does not appear by using the trained weight parameters.

(III) beneficial effects

Compared with the prior art, the invention provides a single-target tracking method based on an attention mechanism, which has the following beneficial effects:

according to the invention, the attention template is added, so that the problem of interference of unnecessary features on calculation is reduced, and the object representation with more discrimination is generated, thereby greatly improving the model performance.

According to the invention, by introducing the bounding box refinement module, when the tracker finds the target in the local search area, the algorithm can output the bounding box of the target as a current tracking result. Because the current algorithm only can roughly estimate the bounding box of the target, the target cannot be well surrounded in many cases, so that the tracking accuracy is reduced. The algorithm performance can be improved by using the bounding box refinement module as a final output result.

Drawings

FIG. 1 is a schematic flow chart of the present invention;

FIG. 2 is a schematic diagram of the overall network framework structure of the present invention;

FIG. 3 is a schematic diagram of the attention structure of the present invention;

FIG. 4 is a schematic diagram of a training architecture of the present invention;

FIG. 5 is a schematic diagram of a bounding box refinement module of the present invention;

FIG. 6 is a graph of the present invention evaluating a UAV123 test dataset;

FIG. 7 is a schematic diagram of the structure of the result of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Examples

As shown in fig. 1-7, a single-target tracking method based on an attention mechanism according to an embodiment of the present invention includes the following steps:

step 1, cutting each video picture in each dataset into a fixed size through data preprocessing operation, then placing the video pictures in a regenerated folder, wherein the folders are all templates for training after cutting and sample pictures of a search area, the size of a template picture Z is 127 multiplied by 127, and the size of a search area picture X is 511 multiplied by 511;

step 2, constructing a time sequence information twin network frame; the core idea of target tracking is that an object image to be tracked is firstly framed through an initial frame and used as a retrieval basis of a subsequent frame; secondly, inputting the target and the search into a twin network simultaneously, and outputting two feature graphs; then, the two-branch feature map information output by the feature extraction network is simultaneously input into the transducer network, the time domain knowledge is effectively encoded in a memory-efficient mode, and then the time domain knowledge is decoded for accurately adjusting the similarity map, so that finer feature description is obtained.

The main network for extracting the features is a CNN structure shared by two branches, a target template and a search area are placed in the same feature extraction network, namely Alexnet, three layers of convolution before the Alexnet network are reserved, a fourth layer of convolution and a fifth layer of convolution are replaced by on-line time sequence self-adaptive convolution to enhance the space features, the process calibrates the convolution weight of the next frame according to the convolution weight of the previous frame, and finally two feature maps are obtained through the feature extraction network, namely a target image feature map and a search area feature map;

to increase the operation speed, we replace the normal convolution with the on-line time sequence adaptive convolution, first define the t frameThe image is X _t Is X-after convolution calculation _t The formula is as follows:

X^ _t ＝W*X _t +b

wherein, operator represents convolution operation, W _t 、b _t Is the time weight and offset of the convolution, in the online convolution layer, the parameter is a parameter (W _t 、b _t ) And convolution operator, the calibration factors being different for each frame, i.eAnd->

And simultaneously inputting the two-branch feature map information output by the feature extraction network into a transducer network, designing time sequence priori knowledge with a fixed size, continuously extracting old knowledge and adding the old knowledge into new knowledge, and filtering information to obtain a feature map of the current frame. As the most basic composition of the transducer, the multi-headed attention formula is shown below, where we use 6-headed:

wherein F represents a convolution layer.

Knowledge of the timing of the final current frameIs->Can be expressed as:

for t=1, an independent convolution is used for the initialization operation.

And 3, adding an attention module, wherein the attention module comprises an adaptive average pooling layer for reducing the dimension of the input features to a global average value, mapping the dimension-reduced features into a context weight range through a series of convolution layers and activation functions, and multiplying the context weight with the input features to obtain a weighted feature representation. This attention mechanism has both the advantage of efficient modeling of global context and the lightweight computing function. The problem of interference of unnecessary features on calculation is reduced, and the attention mechanism is expressed as follows:

wherein the method comprises the steps ofIs the weight of the global attention pool, δ (·) =w _v2 ReLU(LN(W _V1 (·))) represents a bottleneck transformation. This attention module is lightweight and can better capture remote dependencies without increasing computational costs.

Step 4, inputting sample pictures into a network for training, wherein the training process is performed offline; the sample pictures are 127x127, the search pictures are 511x511, and the network backbone is pre-trained on ImageNet-1 k. For the first 5 epochs, the parameters of the trunk were frozen and the learning rate on the remaining epochs of the training process was reduced from 0.005 to 0.0005 for a total training period of 10 cycles. The SGD was used as an optimizer with a momentum of 0.9, where the small lot size was 124 and the input sizes of the template and search area were 127x127 and 511x511, respectively.

Step 5, adding a boundary box thinning module, wherein the boundary box thinning module firstly needs to input template information for initial processingThe search area of the bounding box refinement module is initiated from the local search area intercepted by the current bounding box output by the tracker, which is about 2 times the target size. The feature fusion operation plays a key role in the bounding box refinement module, and feature fusion is carried out by adopting point-by-point convolution in the bounding box refinement module. Assume that the template feature map is expressed asThe characteristic diagram of the search area is expressed as S epsilon R ^C×H×W Then K is first decomposed into H ₀ ×W ₀ Small feature map K _j ∈R ^C×1×1 Then, performing common related operations with S to obtain a fused characteristic diagram with the size of +.>This process can be described as

Where x represents the normal correlation operation. The point-by-point convolution mode can avoid the problem of space blurring caused by the cross-correlation operation of the whole template feature map and the search area feature map by utilizing a sliding window. And predicting the output result of the refined boundary box through the module as a final output result.

Step 6, model testing; by adopting the data set provided by the UAV123 functional network and testing the training effect of the method according to the evaluation index of the UAV123 data set, as can be found from fig. 6, the single-target tracking algorithm provided by the invention has better performance than the original basic algorithm.

As shown in fig. 2, in some embodiments, the basic network framework is a time series information twin network framework, including a twin network for feature extraction, a Transformer-based similarity graph refinement network, and a classification regression sub-network for classification and regression of target locations.

Feature map extraction network: the method comprises the steps of extracting characteristics of a target template and a search area, putting the target template and the search area into the same characteristic extraction network, namely Alexnet, reserving three layers of convolutions before the Alexnet, replacing a fourth layer of convolutions and a fifth layer of convolutions with online time sequence self-adaptive convolutions, and finally obtaining two characteristic maps respectively, namely a target image characteristic map and a search area characteristic map through the characteristic extraction network;

network refinement based on a Transformer similarity graph: the two-branch feature map information output by the feature extraction network is simultaneously input into a transducer network, firstly, timing sequence priori knowledge with fixed size is designed, the priori knowledge is effectively encoded in a high-efficiency memory mode, then the priori knowledge is decoded for accurately adjusting a similar map, and information filtering is carried out to obtain the feature map of the current frame. As the most basic composition of the transducer, the multi-headed attention formula is shown below, where we use 6-headed:

wherein F represents a convolution layer.

Knowledge of the timing of the final current frameIs->Can be expressed as:

for t=1, an independent convolution is used for the initialization operation.

Classification regression network: first of all feature mapThe position (i, j) in the search area can be mapped back to be (x, y), and the response graph can obtain a classification branch and a regression branch through convolution; classifying branches to obtain a classification characteristic diagramAnd center feature map->The classification characteristic map is a class for predicting each position, classification characteristic map +.>Each point (i, j,:) above contains a 2D vector representing the corresponding foreground and background scores, respectively; simultaneously with the classification feature map there is also a central feature map, central feature map +.>The center of each pixel point is given a score, the center is the center position with high score, the center can be used for deleting abnormal values, and the position far away from the center often generates a low-quality prediction boundary box;

regression branch output regression feature map of classified regression networkRegression profile->Each point (i, j,:) contains a 4D vector t (i, j) = (l, t, r, b) representing the distance from the corresponding location to the four sides of the bounding box in the input search area, set (x) ₀ ,y ₀ ) And (x) ₁ ,y ₁ ) Representing the top left and bottom right corners of the truth bounding box, (x, y) represents the corresponding position of point (i, j). A certain point on the regression feature map->Regression objective of->The method can be calculated by the following formula:

L＝L _cls +λ ₁ L _cen +λ ₂ L _reg

in the above formula, the cross entropy loss adopted is classified, the IOU loss is regressed, and the centrality loss is also included;

according to the calculated gradient of the loss function L, the optimizer SGD is used for updating the parameters of the network, so that the whole network loss function is reduced until convergence, and then the whole training is finished, and the trained network weight based on the single-target tracking of the attention is obtained.

As shown in fig. 3, in some embodiments, the feature extraction network introduces an attention mechanism, and the attention module includes an adaptive averaging pooling layer for reducing the input features to a global average, then mapping the reduced features into a context-representing weight range through a series of convolution layers and activation functions, and finally multiplying the context weights with the input features to obtain a weighted feature representation. This attention mechanism has both the advantage of efficient modeling of global context and the lightweight computing function. The problem of interference of unnecessary features on calculation is reduced, and the attention mechanism is expressed as follows:

As shown in fig. 4, in some embodiments, the model training includes the steps of:

firstly, obtaining a sample, inputting the sample into a network model for training, wherein the sample comprises a target image and a search image, the target image is 127X127 of the tracked image, the search image is 511X511 of the tracked image, the backbone network is a CNN structure shared by two branches, one branch target template Z is used as input, the other branch search area X is used as input, the target template and the search area are both put into the same feature extraction network, namely Alexnet, three-layer convolution before the Alexnet network is reserved, the fourth-layer convolution and the fifth-layer convolution are replaced by online time sequence self-adaptive convolution, and finally two feature maps are obtained through the feature extraction network, namely a target image feature map and a search area feature map;

for generating a target feature map, firstly selecting a target, preprocessing, finally cutting by using a cutting frame as a center, directly selecting the center of a preprocessed picture as the center, obtaining the coordinate of the cutting frame with the size of 127, then performing operations such as random scaling, random translation of a plurality of pixels, random overturning and the like on the obtained cutting frame, and finally obtaining the picture of the target at the center of the picture through affine transformation;

similar to the generation of the target feature map, the generation of the search region feature map is performed by performing classification regression on the feature map, wherein the regression: for distinguishing whether the anchor frame is foreground or background, classification: is a deviation value of the position of the anchor frame from the actual position for correcting the position of the last frame.

The network backbone is pre-trained on ImageNet. The same initialization is used for the online timing adaptive convolutional layer, and the similarity graph refinement network is randomly initialized. For the first 5 epochs, the parameters of the trunk were frozen and the learning rate on the remaining epochs of the training process was reduced from 0.005 to 0.0005. With SGD as the optimizer, the momentum is 0.9, with a small batch size of 124 and input sizes of 127x127 and 511x511 for the template and search area, respectively.

As shown in fig. 5, in some embodiments, the network incorporates a bounding box refinement module that first needs to input template information for initialization, the search area of the bounding box refinement module being from the local search area intercepted by the current bounding box output by the tracker, which is approximately 2 times the target size. The feature fusion operation plays a key role in the bounding box refinement module, and feature fusion is carried out by adopting point-by-point convolution in the bounding box refinement module. Assume that the template feature map is expressed asThe characteristic diagram of the search area is expressed as S epsilon R ^C×H×W Then K is first decomposed into H ₀ ×W ₀ Small feature map K _j ∈R ^C×1×1 Then, performing common related operations with S to obtain a fused characteristic diagram with the size of +.>This process can be described as:

As shown in fig. 6, in some embodiments, the model test includes the steps of: testing the tracking effect of the trained model in a new video sequence;

in single target tracking, as shown in fig. 7, a rectangular box is typically given in the first frame, containing the center position and size of the target to be tracked, this box is typically manually marked, then the tracking algorithm is required to follow this box in the subsequent frame, and the position offset and size change of the target is calculated by the algorithm in the subsequent frame, and the test results on the UAV123 dataset are presented on the video sequence for a direct visual sense.

Finally, it should be noted that: the foregoing description is only a preferred embodiment of the present invention, and the present invention is not limited thereto, but it is to be understood that modifications and equivalents of some of the technical features described in the foregoing embodiments may be made by those skilled in the art, although the present invention has been described in detail with reference to the foregoing embodiments. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A single target tracking method based on an attention mechanism is characterized in that: the method comprises the following steps:

step 2, constructing a twin network frame based on time sequence information;

step 3, adding an attention module;

step 5, adding a bounding box thinning module;

2. The attention-based single-object tracking method of claim 1, wherein: in the step 1, each video picture in each dataset is cut into a fixed size by a data preprocessing operation and then placed in a regenerated folder, wherein the folder is a template for training after cutting and a sample picture of a search area, the size of the template picture Z is 127×127, and the size of the search area picture X is 511×511.

3. The attention-based single-object tracking method of claim 1, wherein: in the step 2, a twin network for feature extraction, a similarity graph refinement network based on a transducer and a classification regression sub-network for classification and regression of target positions are included.

4. The attention-based single-object tracking method of claim 1, wherein: the attention module is added in the step 3, and comprises an adaptive average pooling layer for reducing the input feature into a global average value, then the feature after the dimension reduction is mapped into a representation context weight range through a series of convolution layers and activation functions, finally the context weight is multiplied with the input feature to obtain a weighted feature representation, and the attention mechanism has the advantage of effectively modeling the global context and has a lightweight calculation function, so that the interference problem of unnecessary features on calculation is reduced, and the formula of the attention mechanism is expressed as follows:

5. The method of claim 1, wherein the step 4 model training comprises the steps of:

for the sake of clarity, we define the timing knowledge of the t-1 frame asThe current frame is F _t Intermediate resultCan be expressed as:

wherein F represents a convolution layer;

knowledge of the timing of the final current frameIs->Can be expressed as:

for t=1, an independent convolution is used for the initialization operation;

s5, classifying and returning to a network: first of all each in the feature mapThe positions (i, j) can be mapped back to the search area as (x, y), and the response graph can obtain a classification branch and a regression branch through convolution; classifying branches to obtain a classification characteristic diagramAnd center feature map->The classification characteristic map is a class for predicting each position, classification characteristic map +.>Each point (i, j,:) above contains a 2D vector representing the corresponding foreground and background scores, respectively; simultaneously with the classification feature map there is also a central feature map, central feature map +.>The center of each pixel point is given a score, the center is the center position with high score, the center can be used for deleting abnormal values, and the position far away from the center often generates a low-quality prediction boundary box;

regression branch output regression feature map of classified regression networkRegression profile->Each point (i, j,:) contains a 4D vector t (i, j) = (l, t, r, b) representing the distance from the corresponding location to the four sides of the bounding box in the input search area, set (x) ₀ ,y ₀ ) And (x) ₁ ,y ₁ ) Representing the upper left corner and the lower right corner of the truth bounding box, (x, y) represents the corresponding position of the point (i, j), a certain point +.>Regression objective of->The method can be calculated by the following formula:

L＝L _cls +λ ₁ L _cen +λ ₂ L _reg

6. The attention-based single-object tracking method of claim 1, wherein: in the step 5, a bounding box refinement network is added, the bounding box refinement module needs to input template information for initialization, the search area of the bounding box refinement module is from the local search area intercepted by the current bounding box output by the tracker, about 2 times of the target size, the feature fusion operation plays a key role in the bounding box refinement module, the feature fusion is performed by adopting point-by-point convolution in the bounding box refinement module, and the template feature graph is assumed to be expressed asThe characteristic diagram of the search area is expressed as S epsilon R ^C×H×W Then K is first decomposed into H ₀ ×W ₀ Small feature map K _j ∈R ^C×1×1 Then, performing common related operations with S to obtain a fused characteristic diagram with the size of +.>This process can be described as

7. The method for single-object tracking based on the attention mechanism as recited in claim 1, wherein said step 6 model test comprises the steps of: