CN116630850A

CN116630850A - Twin target tracking method based on multi-attention task fusion and bounding box coding

Info

Publication number: CN116630850A
Application number: CN202310555213.XA
Authority: CN
Inventors: 胡昭华; 刘浩男; 林潇; 王莹
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2023-05-17
Filing date: 2023-05-17
Publication date: 2023-08-22

Abstract

The invention discloses a twin target tracking method based on multi-attention task fusion and bounding box coding, which comprises the steps of firstly constructing a twin target tracking network, carrying out channel attention and space attention enhancement operation on a template feature extraction branch by a first multi-attention fusion module, carrying out channel attention and space attention enhancement operation on a search feature extraction branch by a second multi-attention fusion module, enabling output features to enter a cross-correlation matching network together to be fused with bounding box coding features, and inputting the fused features into a classification regression network to obtain a classification score graph and a regression prediction graph. According to the invention, the feature extraction network is divided into the template feature extraction branch and the search feature extraction branch, and the multi-attention fusion module is added into the feature extraction network to perform pretreatment of channel attention and space attention, so that the problems of shielding, visual field disappearance, motion blurring, background disorder, scale change and the like can be well solved, and the tracking performance is good.

Description

Twin target tracking method based on multi-attention task fusion and bounding box coding

Technical Field

The invention relates to the field of target tracking, in particular to a twin target tracking method based on multi-attention task fusion and bounding box coding.

Background

Target tracking is a fundamental and challenging task in the computer vision field, one of the most active research topics in the computer vision field for decades. The task of target tracking is defined as: a video sequence is able to keep track of the target accurately in each subsequent frame given only the initial frame position of the tracked target. Target tracking has wide application in the fields of automatic driving, video monitoring, marine exploration, medical imaging and the like, and is therefore concerned by academia and industry. The target tracking can be divided into two main branches, one is target tracking based on correlation filtering, and the other is target tracking based on deep learning.

With the rapid development of deep learning, a target tracking model based on deep learning has become an important treasures for further improving the target tracking accuracy. The current research shows that in the target tracking model based on deep learning, the target tracking model based on the twin network achieves good balance between tracking precision and reasoning speed.

As a representative of a twin target tracking algorithm, siamFC (Bertinetto L, valmadre J, henriques J F, et al Fully-convolutional siamese networks for object tracking [ C ]// European con ference on computer vision Springer, cham, 2016:850-865.) achieves a good balance of speed and accuracy of target tracking by designing a compact and reasonable tracking network structure. SiamRPN++ (Li B, wu W, wang Q, et al Siamrpn++: evolution of siamese visual tracking with very deep network s [ C ]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Reco gnition.2019:4282-4291.) the deep neural network ResNet50 is introduced into the twin tracking network, greatly improving tracking performance. However, most object tracking models, including these algorithms described above, typically employ pre-trained weighting parameters in the feature extraction network portion when training, but the pre-trained deep neural network does not actually fit the task definition of object tracking. The target tracking is different from other visual tasks such as image classification, target detection, instance segmentation and the like, the target classes of the training and testing of the visual tasks such as classification, detection and the like are predefined, the target classes are more focused on class prediction, and the target tracking task needs to face any class of targets in the tracking process, and the target tracking task is more focused on the position prediction of a foreground target. In addition, the pre-trained deep neural network is more biased towards differences outside the class, and the extracted features of the deep neural network may not be sensitive to changes in the target class, so that the discrimination capability of the tracking model is limited to a certain extent by adopting the pre-trained feature extraction network model.

The mainstream twin tracking algorithm generally has two utilization modes for the given bounding box information of the template frame, one is to perform conventional center clipping on the input image, and the other is to use coordinates of the bounding box to do relevant ROI (Region Of Interest) mapping or to extract an image pixel mask. Both of these approaches extract and manipulate the image-level information by utilizing existing bounding box coordinate information, but tend to ignore the value of the unstructured data information, the bounding box coordinates themselves.

The patent with application number of 2022113443642 discloses a target tracking method and a target tracking system based on a cross-correlation matching enhanced twin network, but the method also cannot solve the technical problem that the discrimination capability of a tracking model is limited.

Disclosure of Invention

The invention aims to solve the technical problems that: aiming at the problems that a pre-trained feature extraction network in the existing tracking algorithm does not completely meet the definition of a tracking task and network priori information cannot be fully utilized, the invention provides a twin target tracking method based on multi-attention task fusion and bounding box coding, and the problems of shielding, visual field disappearance, motion blur, background disorder, scale change and the like can be well solved by the tracking model through multi-attention task fusion operation and bounding box coding, so that the discrimination capability of the tracking model is well improved, good tracking performance is achieved, and good robustness can be maintained in long-time tracking.

The invention adopts the following technical scheme for solving the technical problems:

the invention provides a twin target tracking method based on multi-attention task fusion and bounding box coding, which comprises the following steps:

s1, constructing a twin target tracking network, wherein the twin target tracking network comprises a feature extraction network, a cross-correlation matching network, a classification regression network, a multi-attention task fusion module and a boundary box coding module;

the feature extraction network comprises a template feature extraction branch, a search feature extraction branch, a first multi-attention fusion module and a second multi-attention fusion module; the first multi-attention fusion module performs channel attention and space attention enhancement operation on the template feature extraction branches to obtain enhanced template branch features; the second multi-attention fusion module performs channel attention and space attention enhancement operation on the search feature extraction branch to obtain enhanced search branch features; the first multi-attention fusion module and the second multi-attention fusion module are identical in structure, and the expressions are as follows:

F _MA ＝F _CA ·F _SA ＝CA(F _I )·DC(EC(F _CA ))

wherein F is _CA ∈R ^(C×H×W) A feature map representing enhanced attention to the channel; f (F) _SA ∈R ^(C×H×W) Representing a feature map enhanced by spatial attention; CA (-) represents a channel attention enhancement operation; DC (·) represents a decoding up-sampling operation; EC (·) represents the encoding downsampling operation; f (F) _I ∈R ^(C×H×W) An input feature map representing a multi-attention task fusion module;

the boundary frame coding module codes boundary frame information of the template image to obtain boundary frame coding characteristics;

the cross-correlation matching network carries out cross-correlation matching on the template features output by the first multi-attention fusion module and the search features output by the second multi-attention fusion module to obtain cross-correlation features, then fuses the cross-correlation features with the boundary frame coding features, inputs the fused features into the classification regression network, carries out convolution layer calculation, and obtains a classification score graph and a regression prediction graph; the regression prediction graph is used for predicting offset distances between the center position of the target and four edges of the regression frame; the classification score map is a prediction target foreground score, and each score corresponds to the offset distance of four sides in the regression prediction map;

s2, randomly extracting pairs of template images and search images in a training sample set by a feature extraction network in an offline mode, carrying out gradient feedback by utilizing a SGD random gradient descent method with momentum, and optimizing network parameters until a joint task loss function converges;

s3, performing online tracking test, cutting a first frame image of a test video sequence, inputting the first frame image as a template image into a constructed and trained twin target tracking network, sending the template image into a first multi-attention fusion module through a template feature extraction branch to obtain enhanced template features, and meanwhile, performing boundary frame information coding on the template image through a boundary frame coding module to obtain boundary frame coding features; cutting each subsequent frame of the test video sequence by using an area which is four times that of the template image, taking the center of the cut area as a predicted target center point of the previous frame, and sending the cut area as a search image into a constructed and trained twin target tracking network to obtain search features; and sending the obtained search features and template features to a cross-correlation matching network to be fused with the boundary frame coding features, inputting the fused features to a classification regression network to obtain a classification score graph and a regression prediction graph, and obtaining the final position of the target on the video sequence frame according to the position with the maximum response value in the classification score graph and the offset of the regression prediction graph.

Further, the template feature extraction branch and the search feature extraction branch are both improved on the original ResNet50 deep neural network, the step length of the third layer and the fourth layer of the ResNet50 deep neural network is set to be 1, the size of expansion convolution is set to be 4, and the fifth layer of the ResNet50 convolution layer is deleted.

Further, the size of the template image is 127×127×3, and the size of the search image is 255×255×3.

Further, the channel attention enhancement operation comprises the sub-steps of:

inputting a template feature map and a search feature map, and carrying out global average pooling operation along a channel to obtain a template feature map and a search feature map with unchanged channel number and compressed size; respectively carrying out two nonlinear convolution transformations of dimension reduction and dimension increase on the obtained template feature diagram and the search feature diagram; the template feature map and the search feature map after dimension increase are subjected to activation function operation to obtain feature maps representing the weights of all channels, and the channel weight feature maps are multiplied with the input feature map to obtain feature maps subjected to channel attention enhancement, wherein the calculation formula is as follows:

F _sd ＝GAP(F)

F _ac ＝Conv _up (Conv _down (F _sd ))

F _CA ＝σ(F _ac )·F

wherein GAP (-) represents a global average pooling operation; f (F) _sd ∈R ^(C×1×1) Representing a feature map after global average pooling operation along the channel direction; conv _down (. Cndot.) represents a convolution dimension reduction operation; conv _up (. Cndot.) represents a convolution dimension-increasing operation; the relevant proportionality coefficient r of the dimension reduction and the dimension increase is set to be 16; sigma represents Sigmoid activation function; f (F) _ac ∈R ^(C×1×1) Representing the feature map after dimension increase, F _I ∈R ^(C×H×W) An input feature map representing a multi-attention task fusion module; f (F) _CA Representing a feature map enhanced by channel attention.

Further, the spatial attention enhancing operation comprises the sub-steps of:

the extracted template feature map and the search feature map are used as input features to carry out spatial feature transformation through a multi-layer coding and decoding network structure to obtain a feature map representing spatial position weight, wherein the coding structure is composed of a plurality of downsampling convolution layers, the decoding structure is composed of a plurality of upsampling deconvolution layers, the features are restored to be single-channel feature maps at the last layer of the upsampling deconvolution layers, the single-channel feature maps are used as spatial weight response feature maps, and the calculation formula is as follows:

F _SA ＝DC(EC(F _CA ))

wherein EC (·) represents the encoding downsampling operation; DC (·) represents a decoding up-sampling operation; f (F) _SA ∈R ^(1×H×W) Representing a spatial weight response characteristic diagram finally obtained by encoding and decoding operations; f (F) _CA ∈R ^(C×H×W) Representing the feature map after attention enhancement through the channel.

Further, the process of obtaining the boundary box coding feature is as follows:

converting the boundary frame coordinates of the template frame into one-dimensional feature vectors B (x, y, w, h), wherein (x, y) is the angular point coordinates of the target boundary frame, w is the width of the target boundary frame, h is the height of the target boundary frame, and the feature vectors B pass through multiple full-connection layers to obtain an output feature map B _C ：

B _C ＝f _C (B)

Wherein f _C Representing a fully connected layer structure; b (B) _C Representing the output characteristics of the characteristic vector B through the full connection layer;

feature map output by full connection layer and feature map F output by cross correlation network _C ∈R ^(C×H×W) Performing a broadcast addition operation:

F _b ＝F _C +B _C

wherein F is _b ∈R ^(C×H×W) ；

The final output result of the boundary frame coding obtained by convolution coding with the dimension of 1 multiplied by C is as follows:

F _BM ＝f _g (F _b )

wherein f _g A convolutional encoding operation representing a dimension size of 1 x C; f (F) _BM ∈R ^(C×H×W) Representing the final bounding box encoded output.

Further, the size of the classification score map is 25×25×1, and the size of the regression prediction map is 25×25×4. Further, the calculation formula of the joint task loss function is as follows:

L＝λ ₁ L _cls +λ ₂ L _reg

wherein L is _cls Is a binary cross entropy loss function, L _reg Lambda is the IOU penalty function ₁ And lambda (lambda) ₂ Respectively is L _cls And L _reg Is a weight of (2).

Compared with the prior art, the invention adopts the technical proposal and has the following remarkable technical effects:

1. the twin target tracking method based on multi-attention task fusion and bounding box coding does not adopt the fifth layer characteristic of ResNet50, but only adopts the characteristics on the third and fourth layers of ResNet 50; therefore, the training and reasoning speed of the twin neural network can be greatly accelerated, and the accuracy cannot be obviously influenced.

2. According to the twin target tracking method based on multi-attention task fusion and bounding box coding, the proposed multi-attention task fusion module can enhance the target characteristic channel and adaptively adjust the importance of the related channel, so that the characteristic expression capability of the target is further improved; the spatial attention can be enhanced to the position of the target spatial region, and the integral multi-attention task fusion module can be used for enhancing the tracking characteristics more pertinently, so that the effectiveness of the tracking performance can be better improved.

3. The twin target tracking method based on multi-attention task fusion and bounding box coding provided by the invention can further help to enhance the response of a target foreground region in order to better utilize the existing priori information and code the given target boundary truth box information into a tracking model.

Drawings

FIG. 1 is a network structure diagram corresponding to a twin object tracking method based on multi-attention task fusion and bounding box coding.

FIG. 2 is a block diagram of a channel attention module according to the present invention.

Fig. 3 is a block diagram of a spatial attention module according to the present invention.

FIG. 4 is a block diagram of a multi-attention task fusion module of the present invention.

Fig. 5 is an internal structural diagram of the bounding box encoding module of the present invention.

Fig. 6 is a schematic diagram of the performance evaluation results of the present invention on OTB100 and other mainstream tracking algorithms.

Fig. 7 is a schematic diagram of the visual results of the present invention on an OTB100 video sequence.

FIG. 8 is a graph showing the performance evaluation results of the present invention on GOT10K and other mainstream tracking algorithms.

Detailed Description

The invention is described in further detail below with reference to the drawings and the detailed description.

Fig. 1 is a network structure diagram corresponding to a twin target tracking method based on multi-attention task fusion and bounding box coding, and an embodiment of the invention discloses a twin target tracking method based on multi-attention task fusion and bounding box coding, which comprises the following steps:

s1, constructing a twin target tracking network, wherein the twin target tracking network comprises a feature extraction network, a cross-correlation matching network, a classification regression network, a multi-attention task fusion module and a boundary box coding module. The feature extraction network comprises a template feature extraction branch, a search feature extraction branch, a first multi-attention fusion module and a second multi-attention fusion module; the first multi-attention fusion module performs channel attention and space attention enhancement operation on the template feature extraction branches to obtain enhanced template branch features; the second multi-attention fusion module performs channel attention and space attention enhancement operation on the search feature extraction branch to obtain enhanced search branch features; the first multi-attention fusion module and the second multi-attention fusion module are identical in structure, and the expressions are as follows:

F _MA ＝F _CA ·F _SA ＝CA(F _I )·DC(EC(F _CA ))。

wherein F is _CA ∈R ^(C×H×W) A feature map representing enhanced attention to the channel; f (F) _SA ∈R ^(C×H×W) Representing a feature map enhanced by spatial attention; CA (-) represents a channel attention enhancement operation; DC (·) represents a decoding up-sampling operation; EC (·) represents the encoding downsampling operation; f (F) _I ∈R ^(C×H×W) Representing the input feature map of the multi-attention task fusion module.

And the boundary frame coding module codes boundary frame information of the template image to obtain boundary frame coding characteristics.

The cross-correlation matching network carries out cross-correlation matching on the template features output by the first multi-attention fusion module and the search features output by the second multi-attention fusion module to obtain cross-correlation features, then fuses the cross-correlation features with the boundary frame coding features, inputs the fused features into the classification regression network, carries out convolution layer calculation, and obtains a classification score graph and a regression prediction graph; the regression prediction graph is used for predicting offset distances between the center position of the target and four edges of the regression frame; the classification score map is a prediction target foreground score, and each score corresponds to the offset distance of four sides in the regression prediction map.

S2, adopting an off-line mode, randomly extracting pairs of template images and search images in a training sample set by a feature extraction network, carrying out gradient feedback by utilizing a SGD random gradient descent method with momentum, and optimizing network parameters until a joint task loss function converges.

1. Constructing a twin target tracking network

First, a twin target tracking network is constructed, the twin target tracking network comprising a feature extraction network, a cross-correlation matching network, a classification regression network, a multi-attention task fusion module, and a bounding box encoding module.

The feature extraction network comprises a template feature extraction branch, a search feature extraction branch, a first multi-attention fusion module and a second multi-attention fusion module; the first multi-attention fusion module performs channel attention and space attention enhancement operation on the template feature extraction branches to obtain enhanced template branch features; the second multi-attention fusion module performs channel attention and space attention enhancement operation on the search feature extraction branch to obtain enhanced search branch features; the first multi-attention fusion module and the second multi-attention fusion module have the same structure, and the expressions are:

F _MA ＝F _CA ·F _SA ＝CA(F _I )·DC(EC(F _CA ))。

For example, setting the step size of the last two-layer structure of the original ResNet50 to 1 and the size of the dilation convolution to 4 increases the receptive field size. Meanwhile, when a feature extraction network is designed, experimental analysis of a main network in SiamRPN++ is used for reference, and a fifth-layer convolution layer of ResNet50 is removed, so that the feature extraction network has deeper feature extraction capability compared with AlexNet and GoogleNet, resNet. Furthermore, the fifth layer feature of ResNet50 is not utilized in the present invention, but rather only the third, fourth layer feature of ResNet50 is utilized. Because the training and reasoning speed of the twin neural network can be greatly accelerated, and the accuracy is not obviously influenced.

The whole network is divided into two input branches, namely a template image branch and a search image branch. The size of the template image is set to 127×127×3, and the size of the search image is set to 255×255×3. The two branch images respectively enter a first multi-attention fusion module and a second multi-attention fusion module to extract the characteristics, and then the template branch characteristics are obtainedSearch branching feature->Representative height is H _z Width W _z A characteristic diagram with the number of channels being C,

representative height is H _x Width W _x The number of channels is C.

FIG. 2 is a diagram of a channel attention module according to the present invention, and FIG. 3 is a diagram of a spatial attention module according to the present invention, wherein the template image and the search image are subjected to feature extraction and then subjected to feature enhancement operation by the same multi-attention task fusion module, so that the feature map of the input image after feature extraction is set to be F E R ^(C×H×W) This feature may be channel enhanced and spatial feature enhanced by a multi-attention task fusion module. Input feature map F E R ^(C×H×W) And carrying out global average pooling operation along the channels to obtain a characteristic diagram with unchanged channel number and compressed size. Then, the feature map is subjected to two nonlinear convolution transformations of decreasing dimension and increasing dimension, respectively. Finally, the feature after dimension increase is subjected to an activation function operation to obtain feature graphs representing the weights of all channels, and the channel weight feature graphs are multiplied by input features to obtain a final feature graph F subjected to channel attention enhancement _CA ∈R ^(C×H×W) . The corresponding formula of the channel attention part concrete flow is as follows:

F _sd ＝GAP(F)

F _ac ＝Conv _up (Conv _down (F _sd ))

F _CA ＝σ(F _ac )·F

And (3) inputting the extracted template feature map and the search feature map as input features, and performing spatial feature transformation through a multi-layer coding and decoding network structure to obtain a feature map representing spatial position weight. The encoding structure is formed of a plurality of downsampled convolutional layers and the decoding structure is formed of a plurality of upsampled deconvolution layers. At the last layer of up-sampling, the features are restored to a single-channel feature map for weights representing spatial locations. The corresponding formula of the finally obtained space weight response characteristic image is as follows:

F _SA ＝DC(EC(F _CA ))

FIG. 4 is a block diagram of a multi-attention task fusion module of the present invention. In the embodiment of the invention, the front half part of the multi-attention task fusion module is channel attention, the rear half part is space attention, the input of the space attention module is the characteristic of the channel attention after being enhanced, and the output characteristic image after the channel attention and the space attention are fused is obtained as follows:

F _M ＝F _CA ·F _SA ＝CA(F _I )·DC(EC(F _CA ))。

The invention performs feature enhancement by combining a channel attention mechanism and a spatial attention mechanism, wherein the channel attention mechanism is responsible for enhancing a feature channel of a target, the spatial attention mechanism is responsible for enhancing a spatial position of the target, and in order to avoid the problem that a global pooling mode adopted by the traditional spatial attention mechanism is difficult to distinguish an interfering object, the invention performs feature space enhancement operation by designing a coding and decoding network, so that the feature enhancement mode is more in accordance with the definition of a tracking task, and the discrimination capability of a tracking model can be better improved.

Fig. 5 is an internal structural diagram of the bounding box encoding module of the present invention. The bounding box coding module firstly converts the bounding box coordinates of the template frame into one-dimensional feature vectors: b (x, y, w, h), (x, y) represents the corner coordinates of the target bounding box, w represents the width of the target bounding box, and h represents the height of the target bounding box. The feature vector B is subjected to multi-layer full-connection layers to obtain output features as follows:

B _C ＝f _C (B)。

wherein f _C Representing a fully connected layer structure; b (B) _C Representing the output characteristics of the characteristic vector B through the full connection layer.

F _b ＝F _C +B _C 。

wherein F is _b ∈R ^(C×H×W) 。

F _BM ＝f _g (F _b )。

The invention can better utilize the tracking priori information by designing the boundary box coding module, thereby further improving the performance of the tracking model.

After obtaining the output characteristics of the cross-correlation matching network, the cross-correlation matching network outputs the characteristics F _BM ∈R ^(C×H×W) Through a series of convolution layers, a classification score map of size 25×25×1 and a regression prediction map of size 25×25×4 are finally output. The classification branch is responsible for predicting the target as a foreground score. The regression branch is responsible for predicting four distances (l, t, b, r) representing the offset distances of the target center position from the four edges of the regression frame, respectively.

2. Training network

And adopting an offline mode, randomly extracting pairs of template images and search images in a training sample set by a feature extraction network, carrying out gradient feedback by utilizing a SGD random gradient descent method with momentum, and optimizing network parameters until a joint task loss function converges.

The whole network performs end-to-end offline training on a plurality of large-scale data sets, wherein the training data sets comprise five large-scale data sets of ImageNet VID, youTube-BoundingBoxes, GOT K, imageNet DET and COCO. The input image comprises a template image and a search image, the template image and the search image being respectively from different image frames of the same video sequence. In the training stage, the preprocessed template image and the preprocessed search image are used as inputs of a network to share the same network and network parameters, the template image and the preprocessed search image are output to a classification score graph and a regression prediction graph after passing through the whole twin target tracking network, and the whole network is subjected to end-to-end training optimization by combining a joint task loss function.

For example, the template image and the search image are randomly selected for training in the same video sequence during an offline training process. Specifically, during training, the batch size of each iteration is set to 16 on a single GPU, and gradient backhaul optimization is performed using SGD random gradient descent with momentum. The total number of the whole training is 50, when training, the first five training rounds adopt the learning rate of 0.001 gradually to 0.005 to carry out preheating training, and the subsequent forty five training rounds adopt the learning rate of 0.005 gradually to 0.00001 to carry out training. The weight decay factor and momentum parameter are set to 0.0001 and 0.9, respectively. In the training stage, the preprocessed template image and the preprocessed search image are used as inputs of a network to share the same network and network parameters, the template image and the preprocessed search image are subjected to the whole twin target tracking network to obtain a finally output classification score graph and a regression prediction graph, and then the whole network is subjected to end-to-end training optimization by combining a joint task loss function.

For a classification branch, the classification score graph ultimately predicts the probability that the target is foreground. For the regression branches, the regression prediction graph ultimately predicts four center offset distances. The joint task loss function L can thus be expressed as follows:

L＝λ ₁ L _cls +λ ₂ L _reg 。

wherein L is _cls Is a binary cross entropy loss function, L _reg Is an IOU penalty function. In the embodiment of the invention, lambda is set ₁ ＝1，λ ₂ ＝1。

3. On-line tracking test

Cutting a first frame image of a test video sequence, inputting the first frame image as a template image into a constructed and trained twin target tracking network, sending the template image into a multi-attention fusion module through a feature extraction network to obtain fused template features, and carrying out boundary frame information coding on the template image through a boundary frame coding module template to obtain boundary frame coding features; cutting each subsequent frame of the test video sequence by using an area which is four times that of the template image, taking the center of the cut area as a predicted target center point of the previous frame, sending the cut area as a search image into a constructed and trained twin target tracking network to obtain search features, sending the obtained search features and template features to a cross-correlation matching network to be fused with boundary frame coding features, inputting the fused features into a classification regression network to obtain a classification score graph and a regression prediction graph, and combining the offset of the regression prediction graph according to the position with the maximum response value in the classification score graph to obtain the final position of the target on the video sequence frame.

For example, in the embodiment of the invention, in a test stage, a first frame image of a test video sequence is cut and processed as a template image and sent to a feature extraction network, template features obtained by pretreatment of a multi-attention module are fixed in the network, boundary frame information given by the template frames is encoded, each subsequent frame of the test video sequence is cut in an area which is four times that of the template image, the center of the cut area is a target center point predicted by the previous frame, the cut area is used as a search image and sent to a twin tracking network, feature extraction is carried out through a ResNet50 deep neural network, multi-attention enhancement is carried out, and a classification score graph and a regression prediction graph are obtained after the template features obtained before are subjected to a cross-correlation matching network and a classification regression network. And finally, carrying out four offset regression corresponding to the regression prediction graph at the position with the maximum classification score through a series of post-processing operations to obtain a final prediction boundary box of the target in the frame. The whole test process ends or starts tracking the next video sequence after the last frame of the video sequence is tracked, and the corresponding template features are updated.

The invention evaluates tracking performance by evaluating two indexes of accuracy and area under a curve of a work diagram through one pass on an OTB100 data set, and the main stream tracking algorithm for performance evaluation comparison with the invention comprises the following steps: ATOM, daSiamRPN, gradNet, siamRPN, CFNet, siamFC. As shown in fig. 6, the performance evaluation of the tracking algorithm proposed by the present invention on OTB100 is compared with that of other mainstream tracking algorithms, and compared with ATOM, daSiamRPN, gradNet, siamRPN, CFNet, siamFC, the performance evaluation of the tracking algorithm is respectively 0.6, 1.5, 3.4, 4.4, 8.6 and 8.7 percentage points higher. Compared with ATOM, daSiamRPN, gradNet, siamRPN, CFNet, siamFC, the invention is respectively higher by 0.7, 0.8, 2.6, 4.0, 10.9 and 11.5 percent on the accuracy index. In addition, the invention selects three kinds of fast motion, motion blur and deformation under the OTB100 reference, and further tests the tracking performance of the invention under the challenging scene which can greatly influence the tracking characteristics. As shown in fig. 7, in three challenging scenarios, the method ranks first in comparison with performance indexes of various tracking algorithms, obtains good performance, and proves the effectiveness of the design of multi-attention task fusion and bounding box coding in improving tracking performance.

As shown in fig. 8, the performance evaluation of the tracking algorithm proposed by the present invention on the large tracking dataset GOT10K is compared with that of various mainstream tracking algorithms, and most of the video sequences in the GOT10K dataset are in outdoor scenes, so that the tracking algorithm is more in line with actual tracking scenes, and various challenges to tracking are more. The tracking algorithm provided by the invention is arranged first on the average overlapping rate and the success rate index taking 0.75 as the overlapping threshold value, and is arranged second on the success rate index of 0.5 as the overlapping threshold value, so that the tracking algorithm still obtains competitive tracking effect compared with the tracking algorithms of various main streams.

The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereto, and any modification made on the basis of the technical scheme according to the technical idea of the present invention falls within the protection scope of the present invention.

Claims

1. The twin target tracking method based on multi-attention task fusion and bounding box coding is characterized by comprising the following steps of:

F _MA ＝F _CA ·F _SA ＝CA(F _I )·DC(EC(F _CA ))

2. The twin object tracking method based on multi-attention task fusion and bounding box coding according to claim 1, wherein the template feature extraction branch and the search feature extraction branch are both improvements to an original ResNet50 deep neural network, the step sizes of a third layer and a fourth layer structure of the ResNet50 deep neural network are set to be 1, the size of expansion convolution is set to be 4, and a fifth layer convolution layer of the ResNet50 is deleted.

3. The twin object tracking method based on multi-attention task fusion and bounding box encoding of claim 1, wherein the template image has a size of 127 x 3 and the search image has a size of 255 x 3.

4. The multi-attention task fusion and bounding box encoding based twin object tracking method of claim 1, wherein the channel attention enhancement operation comprises the sub-steps of:

F _sd ＝GAP(F)

F _ac ＝Conv _up (Conv _down (F _sd ))

F _CA ＝σ(F _ac )·F

wherein GAP (-) represents a global average pooling operation; f (F) _sd ∈R ^(C×1×1) Representing a feature map after global average pooling operation along the channel direction; conv _down (. Cndot.) represents a convolution dimension reduction operation; conv _up (. Cndot.) represents a convolution dimension-increasing operation; the relevant proportionality coefficient r of the dimension reduction and the dimension increase is set to be 16; sigma represents Sigmoid activation function; f (F) _ac ∈R ^(C×1×1) Representing post-Uygur-like featuresSign chart, F _I ∈R ^(C×H×W) An input feature map representing a multi-attention task fusion module; f (F) _CA Representing a feature map enhanced by channel attention.

5. The multi-attention task fusion and bounding box encoding based twin object tracking method as claimed in claim 1 wherein the spatial attention enhancement operation comprises the sub-steps of:

F _SA ＝DC(EC(F _CA ))

6. The twin object tracking method based on multi-attention task fusion and bounding box coding according to claim 1, wherein the process of obtaining the bounding box coding features is as follows:

B _C ＝f _C (B)

Wherein f _C Representing a fully connected layer structure; b (B) _C Output of representative feature vector B via full connection layerFeatures;

F _b ＝F _C +B _C

wherein F is _b ∈R ^(C×H×W) ；

F _BM ＝f _g (F _b )

7. The twin object tracking method based on multi-attention task fusion and bounding box encoding of claim 1, wherein the classification score map is 25 x 1 in size and the regression prediction map is 25 x 4 in size.

8. The twin object tracking method based on multi-attention task fusion and bounding box coding according to claim 1, wherein the calculation formula of the joint task loss function is:

L＝λ ₁ L _cls +λ ₂ L _reg