CN116229112A

CN116229112A - Twin network target tracking method based on multiple attentives

Info

Publication number: CN116229112A
Application number: CN202211558887.7A
Authority: CN
Inventors: 周丽芳; 刘金兰; 李伟生; 马将凯; 卢峻民
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-12-06
Filing date: 2022-12-06
Publication date: 2023-06-06

Abstract

The invention discloses a twin network target tracking method (SiamMAN, siamese Multi-attention Network) based on multiple attentions, and belongs to the technical field of computer vision. Mainly comprises the following steps: firstly, in order to more effectively utilize target feature information to cope with the problem of complex background interference, a multi-attention module is designed to optimize the features, wherein a channel attention branch is used for giving higher weight to a more discriminative channel, and a position attention branch fully utilizes the position information of a target; secondly, in order to better utilize shallow features, a feature fusion method is designed in the multi-attention module, the shallow features and features optimized by attention branches are fused by using a residual error learning method, and then the two attention features are fused, so that feature representation is further enhanced; finally, the Focal-EIoU penalty is used as a regression penalty function, leading the tracker to generate a more accurate tracking box.

Description

Twin network target tracking method based on multiple attentives

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a visual target tracking method.

Background

Video object tracking is an important research topic in the field of computer vision, requiring that the size and position of an object in a subsequent frame be predicted given the object size and position of the initial frame of a video sequence. In recent years, along with the continuous development of target tracking algorithms, target tracking theory is also becoming more and more perfect, and target tracking technology is widely applied to the fields of video monitoring, pedestrian tracking, intelligent traffic, modern military and the like. However, in a complex and changeable real scene, a moving target is often accompanied by various influencing factors such as shielding, scale change, background clutter and illumination change, and the like, so that it is still a very challenging task to accurately track any target, and therefore, research on a target tracking technology still has great research value and practical value.

At present, the target tracking methods can be roughly divided into two main types, namely a method based on correlation filtering and a method based on deep learning. The target tracking method based on the correlation filtering mainly carries out correlation operation by designing a filtering template and a target candidate region, and finds the maximum value in a response diagram to position the target, thereby having the advantage of high speed. However, such methods may result in too little information learned by the filtering template due to the cosine window and search area limitations, which may easily lead to tracking drift in large-scale deformation and complex background interference scenarios. And the target tracking method based on deep learning can provide features with more discriminant, so that the tracker is more robust. The target tracking algorithm based on the twin network converts the tracking problem into the similarity matching problem, and the tracking problem is simplified by using an end-to-end training mode, so that higher tracking precision is realized.

However, in the target tracking method based on the twin network, the following problems still remain: 1) In a complex and changeable real scene, the tracking field is a very troublesome problem when the complex background is interfered, while the existing tracker aiming at the complex background challenge mostly adopts a deep backbone network to extract features, and the extracted features are relatively more, but larger parameters are introduced, so that the calculation cost is increased, and the tracking is slow; 2) In addition, other trackers aiming at complex background challenges enhance the anti-interference capability of the network by adding training data, but the time cost is increased, the training time is increased, and the tracking effect is not improved greatly. In order to alleviate the defects, the invention provides a twin network target tracking method based on multiple attentiveness, which aims at the problem of complex background challenges in a real scene and is used for improving tracking performance.

CN111192292a, a target tracking method based on an attention mechanism and a twin network and related devices, input a target template and a search area into a preset target tracking model, and output target tracking information of the target template in the search area through the preset target tracking model; the target tracking model comprises a twin network, and a channel attention module and/or a space attention module are/is additionally arranged in the twin network. According to the target tracking method and the related equipment, the channel attention module and/or the spatial attention module are added into the residual error network, so that the average expected overlapping rate and robustness of a twin tracking algorithm are obviously improved, and the robustness of motion change, camera motion, shielding and size change attributes in tracking is improved, and therefore, when the target tracking method provided by the embodiment is used for target tracking prediction, a more accurate result can be obtained.

The invention with the publication number of CN111192292A takes SiamRPN++ as a basic network, adopts ResNet-50 as a feature extraction network, adds a channel attention and/or space attention module in a twin network, and obtains a regression result and a feature classification result of a frame through frame regression and classification branches. Although both are target tracking methods based on twin networks and both use an attention mechanism to improve tracking performance, the present invention and the invention of publication No. CN111192292a differ in the following points:

(1) Selection of a feature extraction network: the invention of publication number CN111192292a utilizes deep res net-50 as the feature extraction network, while the invention employs relatively shallow google net as the feature extraction network. Therefore, the parameter amount of the invention is smaller, and the tracking speed is faster.

(2) Position of the attention module: the invention with the publication number of CN111192292A designs a multiple attention module which consists of a channel attention branch and a position attention branch, wherein the characteristics extracted by a characteristic extraction network enter the two attention branches in parallel and then are fused, so that channel information and position information can be fully utilized, interference of redundant information is reduced, and tracking performance is improved.

(3) Mode of target prediction: while the invention with publication number CN111192292A adopts an Anchor-based mode to predict the position of the target, the invention adopts an Anchor-free mode to distinguish the foreground and the background by classifying branches, calculates the distance between each point and the center point of the target by a centrality branch, is used for restraining the condition that some predicted points and center points are far, and obtains a tracking frame by a regression branch.

Disclosure of Invention

The present invention is directed to solving the above problems of the prior art. A twin network target tracking method based on multiple attentiveness is provided. The technical scheme of the invention is as follows:

a twin network target tracking method based on multiple attentiveness, comprising the steps of:

step 1: selecting a first frame image of a video as a template image, selecting other subsequent frame images of the video as search images, and respectively preprocessing the template image and the search images;

step 2: respectively inputting the preprocessed template image and the preprocessed search image into a template branch and a search branch of a twin network, and extracting features through a GoogLeNet feature extraction backbone network to obtain a feature map of the template image and a feature map of the search image;

step 3: the template image features and the search image features are respectively input into a multi-attention module which is formed by a plurality of parallel channel attention branches and position attention branches, wherein the channel attention branches give higher weight to feature channels with more discrimination, and the position attention branches fully utilize the position information of the targets, so that the features are further refined; propagating target information from template image features to search image features through a graph attention mechanism to obtain a feature response graph;

step 4: the characteristic response diagram is input into a classification-regression sub-network, and regression branches replace IoU losses by introducing Focal-EIoU losses, so that a tracker is guided to generate a more accurate tracking frame, and the obtained tracking frame is the final tracking result.

Further, the step 1: selecting a first frame image of a video as a template image, selecting other subsequent frame images of the video as search images, and respectively preprocessing the template image and the search images, wherein the method specifically comprises the following steps of:

a1, preprocessing a template image: selecting a first frame image of a video, calibrating a target area by using a rectangular frame, wherein the center point of the rectangular frame represents the position of the target center point, respectively expanding p pixels on four sides of the target rectangular frame, if the rectangular frame exceeds an image boundary, filling the exceeding part by using an image pixel mean value, and finally scaling the size of the cut target image to 127 multiplied by 127 pixels;

a2, preprocessing a search image: and selecting subsequent other frame images of the video, calibrating a target area by using a rectangular frame, wherein the center point of the rectangular frame represents the position of the target center point, respectively expanding p pixels on four sides of the target rectangular frame, if the rectangular frame exceeds the image boundary, filling the exceeding part by using an image pixel mean value, and finally scaling the size of the cut target image to the size of 287 multiplied by 287 pixels.

Further, the step 2: the preprocessed template image and the preprocessed search image are respectively input into a template branch and a search branch of a twin network, feature extraction is carried out through a GoogLeNet feature extraction backbone network, and features of the template image and features of the search image are obtained, specifically comprising the following steps:

b1, obtaining a template image z with the size of 127 multiplied by 127 according to the step A1, and obtaining a search image x with the size of 287 multiplied by 287 according to the step A2;

b2, inputting the template image z into a template branch of a twin network, and extracting the template image features through a GoogLeNet (acceptance v 3) feature extraction backbone network

B3, inputting the search image x into a search branch of a twin network, and extracting the search image features through a GoogLeNet (acceptance v 3) feature extraction backbone network

Further, the step 3: the method comprises the steps that the features of a template image and the features of a search image are respectively input into a multi-attention module formed by parallel channel attention branches and position attention branches, wherein the channel attention branches give higher weight to feature channels with more discriminant, and the position attention branches fully utilize the position information of targets, so that the features are further refined; the target information is transmitted from the template image characteristics to the search image characteristics through a graph attention mechanism, so that a characteristic response graph is obtained, and the method specifically comprises the following steps:

c1, template image characteristics obtained according to the step B2

Inputting the two attention characteristics into a multi-attention module, mainly comprising a channel attention branch and a position attention branch, obtaining two different attention characteristics, fusing the two attention characteristics through pixel-by-pixel multiplication operation, then further enhancing characteristic representation through two 3X 3 convolution operations, and finally obtaining the thinned template image characteristics>

C2, searching image characteristics obtained according to the step B3

Inputting the two attention characteristics into a multi-attention module, mainly comprising a channel attention branch and a position attention branch, obtaining two different attention characteristics, fusing the two attention characteristics through pixel-by-pixel multiplication operation, then further enhancing characteristic representation through two 3X 3 convolution operations, and finally obtaining refined search image characteristics>

C3, refining the template image features

And search for image features->

Target information is transmitted from template image features to search image features through a graph attention mechanism, and a feature response graph F is obtained _fin 。

Further, the steps are as followsC3, refining the template image features

And search for image features->

The target information is transmitted from the template image characteristics to the search image characteristics through a graph attention mechanism, and the specific steps are as follows:

d1, the template image features

The 1 x C grid is regarded as a node, wherein C is the number of characteristic channels, and a node set V containing all the nodes is obtained _z ；

D2, image features are to be searched

The 1 x C grid is regarded as a node, wherein C is the number of characteristic channels, and a node set V containing all the nodes is obtained _x ；

D3, constructing a full bipartite graph g= (V, E), wherein,

/>

two sub-maps of G are +.>

And->

D4, because the more similar the location in the search image is to the local location of the template image, the more likely it is to be foreground, to which more target information should be transferred. Firstly, respectively carrying out linear change on nodes, calculating the inner product of the nodes to calculate a correlation score, wherein the correlation score is the similarity degree of the two nodes, and finally, generating a response diagram, and the formula is as follows:

wherein e _i,j Representing node i e V _x Sum node j e V _z Correlation score, W _x And W is _z In the form of a linear transformation matrix,

and->

The eigenvectors of nodes i and j, respectively.

Further, the multiple attention module comprises the following specific steps:

e1, inputting the feature F before optimization into a channel attention branch, giving higher weight to a channel with more discriminant, and performing pixel multiplication operation by using residual error learning and the feature before optimization to obtain the feature F ₁ The formula is:

wherein F is ₁ Representing features optimized by channel attention branches in the multi-attention module, F representing features before optimization,

representing pixel multiplication operations, σ represents Sigmoid activation function, F _SENet Representing the features obtained after passing through the channel attention mechanism;

e2, inputting the feature F before optimization into a position attention branch, and performing pixel multiplication operation by utilizing residual error learning and the feature before optimization to obtain the feature F ₂ The formula is:

wherein F is ₂ Representing features optimized by the position attention branches in the multi-attention module, F representing features before optimization,

representing pixel multiplication operations, σ represents Sigmoid activation function, F _CA Representing the features obtained after passing through the location attention mechanism;

e3, feature F to be obtained by step E1 ₁ And feature F obtained by step E2 ₂ Performing feature fusion, namely performing pixel multiplication operation on the two features, and further enhancing feature representation through two 3×3 convolution operations to obtain an optimized feature F _output The formula is:

wherein F is _output Representing the optimized features, F _3×3 A 3 x 3 convolution operation is shown,

representing a pixel multiplication operation.

Further, the step 4 inputs the feature response graph to a classification-regression sub-network, and the regression branch replaces IoU loss by introducing Focal-EIoU loss, so as to guide the tracker to generate a more accurate tracking frame, and the obtained tracking frame is the final tracking result, and specifically comprises the following steps:

f1, convolving the obtained characteristic response graph according to the step C3, and respectively inputting the characteristic response graph into a classification branch, a centrality branch and a regression branch;

f2, classifying branches adopt conventional cross entropy loss to carry out classification tasks to obtain classification loss L _cls ；

F3, paralleling the centrality branch and the classification branch, and removing abnormal data to obtain centrality lossL _cen ；

F4, carrying out regression task on the regression branch by adopting Focal-EIoU loss to obtain regression loss L _reg ；

F5, obtaining classification loss L according to the step F2 _cls Obtaining the centrality loss L according to the step F3 _cen Obtaining regression loss L according to step F4 _reg The formula for calculating the final total loss function is:

L＝L _cls +λ ₁ L _cen +λ ₂ L _reg (5)

wherein L represents the total loss function, lambda ₁ Representing hyper-parameters, lambda, in the centrality loss function ₂ Representing the hyper-parameters in the regression loss function.

Further, the Focal-EIoU loss and the loss function of the regression branch are:

wherein L is _EIOU Representing EIoU loss, IOU representing the intersection ratio of two anchor frames, b representing the center point of the anchor frame, b ^gt Represents the center point of the truth box, w represents the width of the anchor box, and w ^gt Represents the width of the truth box, h represents the height of the anchor box, h ^gt Representing the height, w, of the truth box ^c Represents the width of the smallest bounding box, h ^c Representing the height of the smallest bounding box;

L _reg ＝IoU ^γ L _EIoU (7)

wherein L is _reg Represents the regression loss calculated by the Focal-EIoU loss, and γ is a hyper-parameter.

The invention has the advantages and beneficial effects as follows:

1. aiming at the common complex background interference problem in the field of target tracking, the invention designs a twin network target tracking method based on multiple attentions, and improves the tracking performance from the characteristic level through designing a multiple attentions module. Compared with the current state-of-the-art trackers (SiamFC++, siamCAR, siamGAT), the invention can show superior tracking performance on common target tracking data sets;

2. the feature optimization network is a method for improving the tracking performance of the tracker, so that the invention designs a multi-attention module which comprises a channel attention branch and a position attention branch, enhances the learning capacity of the network for feature selection, and reduces the burden of redundant information on the network. The channel attention branches give higher weight to the characteristic channels with more discriminant, the position attention branches fully utilize the position information of the targets, and then a characteristic fusion module is designed to fuse the two different characteristics, so that characteristic representation is further enhanced, more robust characteristic representation is obtained, and the tracking precision of the tracker is effectively improved;

3. the target tracking task comprises classification branches and regression branches, positive and negative samples are determined through the classification branches, and a boundary box of the target is determined through the regression branches. Most current target tracking methods often use IoU loss as a regression loss function, but when two tracking frames do not intersect, the IoU loss does not reflect the distance between the two frames well. A Focal-EIoU penalty is proposed in the EIoU paper, and the penalty function calculates the aspect ratio of the tracking box. Therefore, the invention guides the tracker to generate a more accurate regression frame by introducing Focal-EIoU loss on the regression branch, thereby further improving the tracking performance of the tracker.

Drawings

FIG. 1 is a general framework diagram of a twin network target tracking method based on multiple attentiveness in accordance with a preferred embodiment of the present invention;

FIG. 2 is a schematic diagram of a multi-attention module designed in the present invention;

figure 3 is a graph of the tracking effect of the present invention on OTB100 dataset MotorRolling, board and the Soccer video sequence.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and specifically described below with reference to the drawings in the embodiments of the present invention. The described embodiments are only a few embodiments of the present invention.

The technical scheme for solving the technical problems is as follows:

the embodiment of the invention is based on a SiamGAT target tracking framework as a basic framework, and the details are shown in documents Dongyan Guo, yanyan Shao, yes Cui, zhenhua Wang, ryan Zhang, chunhua Shan. Firstly, a tracking frame is built by using SiamGAT as a basis, then a multi-attention module is designed, characteristics extracted through a backbone network are optimized, focal-EIoU loss is introduced, and a tracker is guided to obtain a more accurate regression frame, so that tracking precision is improved.

As shown in fig. 1, a twin network target tracking method based on multiple attentiveness includes the following steps:

1. as shown in fig. 1, a first frame image of a video is selected as a template image, other subsequent frame images of the video are selected as search images, and then preprocessing operation is performed on the template image and the search images, specifically including:

1.1 template image pretreatment: selecting a first frame image of a video, calibrating a target area by using a rectangular frame, wherein the center point of the rectangular frame represents the position of the target center point, respectively expanding p pixels on four sides of the target rectangular frame, if the rectangular frame exceeds an image boundary, filling the exceeding part by using an image pixel mean value, and finally scaling the size of the cut target image to 127 multiplied by 127 pixels;

1.2 search image preprocessing: and selecting subsequent other frame images of the video, calibrating a target area by using a rectangular frame, wherein the center point of the rectangular frame represents the position of the target center point, respectively expanding p pixels on four sides of the target rectangular frame, if the rectangular frame exceeds the image boundary, filling the exceeding part by using an image pixel mean value, and finally scaling the size of the cut target image to the size of 287 multiplied by 287 pixels.

2. As shown in fig. 1, the preprocessed template image and the search image are respectively input into a template branch and a search branch of a twin network, and feature extraction is performed through a google net feature extraction backbone network to obtain features of the template image and features of the search image, which specifically comprises:

2.1 obtaining a template image z with the size of 127×127 according to the step 1.1, and obtaining a search image x with the size of 287×287 according to the step 1.2;

2.2 inputting the template image z into a template branch of a twin network, extracting the template image features through a GoogLeNet (acceptance v 3) feature extraction backbone network

2.3 inputting the search image x into the search branch of the twin network, extracting the feature of the search image by the GoogLeNet (acceptance v 3) feature extraction backbone network

3. As shown in fig. 1, the features of the template image and the features of the search image are respectively input into a multi-attention module consisting of parallel channel attention branches and position attention branches, wherein the channel attention branches give higher weight to the feature channels with more discrimination, and the position attention branches fully utilize the position information of the targets, so that the features are further refined; the target information is transmitted from the template image characteristics to the search image characteristics through a graph attention mechanism, and a characteristic response graph is obtained, which comprises the following specific steps:

3.1 template image features obtained according to step 2.2

3.2 search map obtained according to step 2.3Image characteristics

3.3 characterizing the thinned template image

And search for image features->

4. Features of the thinned template image

And search for image features->

4.1 characterizing template images

4.2 image features will be searched

4.3 constructing a full bipartite graph g= (V, E), wherein,

two sub-maps of G are +.>

And->

4.4 because the more likely it is that the location in the search image is similar to the local location of the template image, more target information should be delivered thereto. Firstly, respectively carrying out linear change on nodes, calculating the inner product of the nodes to calculate a correlation score, wherein the correlation score is the similarity degree of the two nodes, and finally, generating a response diagram, and the formula is as follows:

and->

The eigenvectors of nodes i and j, respectively.

5. As shown in fig. 2, the multiple attention module includes a channel attention branch and a position attention branch, and specifically includes:

5.1 inputting the feature F before optimization into a channel attention branch, giving higher weight to a channel with more discriminant, and then carrying out pixel multiplication operation by utilizing residual error learning and the feature before optimization to avoid redundant information and inhibit background noise so as to obtain the feature F ₁ The formula is:

representing pixel multiplication operations, σ represents Sigmoid activation function, F _SENet Representing the features obtained after passing through the channel attention mechanism.

5.2 inputting the feature F before optimization into a position attention branch, fully utilizing the position information of the target, and then utilizing residual error learning and the feature before optimization to perform pixel multiplication operation, avoiding redundant information and inhibiting background noise so as to obtain the feature F ₂ The formula is:

representing pixel multiplication operations, σ represents Sigmoid activation function, F _CA Representing the features obtained after passing through the position attention mechanism.

5.3 feature F obtained in step 5.1 ₁ And feature F obtained by step 5.2 ₂ Feature fusion is carried out, firstly, pixel multiplication operation is carried out on two features, and then, the two 3 multiplied by 3 convolution operations are used for further enhancementFeature representation, obtaining optimized feature F _output The formula is:

representing a pixel multiplication operation.

6. As shown in fig. 1, the characteristic response diagram is input into a classification-regression sub-network, and regression branches are used for introducing Focal-EIoU loss to replace IoU loss, so that a tracker is guided to generate a more accurate tracking frame, and the obtained tracking frame is the final tracking result, and the method specifically comprises the following steps:

6.1, according to the step 3.3, convolving the obtained characteristic response graph, and respectively inputting the characteristic response graph into a classification branch, a centrality branch and a regression branch;

6.2 classifying branches adopt conventional cross entropy loss to carry out classifying tasks to obtain classifying loss L _cls ；

6.3 the centrality branches are parallel to the classification branches for removing abnormal data to obtain centrality loss L _cen ；

6.4 regression branches regression tasks with Focal-EIoU loss to obtain regression loss L _reg The calculation formula of the regression loss is:

L _reg ＝IoU ^γ L _EIoU (6)

wherein L is _EIOU Representing EIoU loss, IOU representing the intersection ratio of two anchor frames, b representing the center point of the anchor frame, b ^gt Represents the center point of the truth box, w represents the width of the anchor box, and w ^gt Represents the width of the truth box, h represents the height of the anchor box, h ^gt Representing the height, w, of the truth box ^c Represents the width of the smallest bounding box, h ^c Representing the height of the smallest bounding box, L _reg Represents the regression loss calculated by the Focal-EIoU loss, and γ is a hyper-parameter.

6.5 obtaining the Classification loss L according to step 6.2 _cls Obtaining the centrality loss L according to the step 6.3 _cen Obtaining regression loss L according to step 6.4 _reg The formula for calculating the final total loss function is:

L＝L _cls +λ ₁ L _cen +λ ₂ L _reg (7)

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

The above examples should be understood as illustrative only and not limiting the scope of the invention. Various changes and modifications to the present invention may be made by one skilled in the art after reading the teachings herein, and such equivalent changes and modifications are intended to fall within the scope of the invention as defined in the appended claims.

Claims

1. The twin network target tracking method based on multiple attentiveness is characterized by comprising the following steps of:

step 3: the template image features and the search image features are respectively input into a multi-attention module consisting of parallel channel attention branches and position attention branches, wherein the channel attention branches give higher weight to feature channels with more discrimination, and the position attention branches fully utilize the position information of targets, so that the features are further refined; propagating target information from template image features to search image features through a graph attention mechanism to obtain a feature response graph;

2. The method for twin network target tracking based on multiple attentiveness as defined in claim 1, wherein step 1: selecting a first frame image of a video as a template image, selecting other subsequent frame images of the video as search images, and respectively preprocessing the template image and the search images, wherein the method specifically comprises the following steps of:

3. The method for twin network target tracking based on multiple attentiveness according to claim 2, wherein the step 2: the preprocessed template image and the preprocessed search image are respectively input into a template branch and a search branch of a twin network, feature extraction is carried out through a GoogLeNet feature extraction backbone network, and features of the template image and features of the search image are obtained, specifically comprising the following steps:

4. A twin network target tracking method based on multiple attentiveness as defined in claim 3 in which step 3: the method comprises the steps that the features of a template image and the features of a search image are respectively input into a multi-attention module formed by parallel channel attention branches and position attention branches, wherein the channel attention branches give higher weight to feature channels with more discriminant, and the position attention branches fully utilize the position information of targets, so that the features are further refined; the target information is transmitted from the template image characteristics to the search image characteristics through a graph attention mechanism, so that a characteristic response graph is obtained, and the method specifically comprises the following steps:

c1, template image characteristics obtained according to the step B2

C2, searching image characteristics obtained according to the step B3

C3, refining the template image features

And search for image features->

5. The multi-attention-based twin network object tracking method according to claim 4, wherein the step C3 is to refine the template image features

And search for image features->

d1, the template image features

D2, image features are to be searched

D3, constructing a full bipartite graph g= (V, E), wherein v=v _z ∪V _x ，

Two sub-maps of G are +.>

And->

and->

The eigenvectors of nodes i and j, respectively.

6. The twin network target tracking method based on multiple attentions according to claim 4, wherein the multiple attentions module comprises the following specific steps:

representing pixel multiplication operations, σ represents Sigmoid activation function, F _SENet Representing the features obtained after passing through the channel attention mechanism; />

representing a pixel multiplication operation.

7. The multiple-attention-based twin network target tracking method according to claim 6, wherein the step 4 inputs a characteristic response graph into a classification-regression sub-network, and the regression branch introduces Focal-EIoU loss to replace IoU loss, so as to guide the tracker to generate a more accurate tracking frame, and the obtained tracking frame is the final tracking result, and specifically comprises the following steps:

F3, paralleling the centrality branch and the classification branch, and removing abnormal data to obtain centrality loss L _cen ；

L＝L _cls +λ ₁ L _cen +λ ₂ L _reg (5)

8. The multiple-attention-based twin network target tracking method of claim 7 wherein the Focal-EIoU loss and regression branch loss functions are:

L _reg ＝IoU ^γ L _EIoU (7)