CN114140495A

CN114140495A - Single target tracking method based on multi-scale Transformer

Info

Publication number: CN114140495A
Application number: CN202111340646.0A
Authority: CN
Inventors: 何志伟; 聂佳浩; 伍瀚; 高明煜; 董哲康
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-11-12
Filing date: 2021-11-12
Publication date: 2022-03-04

Abstract

The invention discloses a single target tracking method based on a multi-scale Transformer. The method cuts out expression features with different space sizes from template features, obtains target feature information of a multi-scale semantic space through convolution kernels with different sizes, and then supervises the enhancement of the template features by utilizing the information so as to enable the template features to have the perception capability of the target features. And then, an IoU-Net is taken off line to evaluate the accuracy of the candidate frame, a feature modulation vector is learned by the feature of the target to act on the feature of the candidate frame, and the modulated feature is subjected to generalization learning to obtain the confidence score of the candidate frame. And finally, through multiple iterative optimization, finding a candidate frame with the highest confidence coefficient as a tracking result. Based on the multiscale Transformer module provided by the invention, the accuracy of the ATOM tracking method is improved to a certain extent, and the bounding box of the target can be estimated more accurately in a complex scene.

Description

Single target tracking method based on multi-scale Transformer

Technical Field

The invention belongs to the technical field of single target tracking, and particularly relates to a multi-scale Transformer feature-guided single target tracking method in a complex environment.

Background

Single target tracking is a basic and challenging task in computer vision. Given any object in the first frame as a priori knowledge, the tracker aims at locating this target and estimating its bounding box for subsequent frames. In recent years, single-target tracking is widely applied to the fields of unmanned aerial vehicles, intelligent video monitoring and the like, and great progress is made, but the tracking errors accumulated continuously can cause that the tracker cannot cope with complex scenes such as deformation, shielding and the like. Therefore, how to accurately estimate the bounding box of the object remains to be studied.

Early single-target trackers performed bounding box estimation using a conventional multi-scale method, and only performed multi-scale measurements using the tracking result of the previous frame as the reference bounding box of the current frame. This conventional method will limit the accuracy of tracking to some extent when the target is severely distorted in the video stream. With the development of deep learning, many high-precision tracking methods emerge. The bounding box evaluation methods employed by mainstream trackers today can be broadly divided into two broad categories: template matching based and candidate box evaluation based methods. The tracker using the first method cuts an image containing context information as a template with a target in a first frame as the center, extracts the characteristics of the given template and the subsequent frame by using a twin network, and learns the most similar area to the template as a tracking result by a full convolution mode. The method for evaluating the boundary box greatly improves the accuracy of the tracker and can effectively estimate the state of the object when the object deforms. However, the method of using the context information and the target as the template has a certain drawback, in which the position, posture, etc. of the target are obscured by a large amount of context information. Therefore, subsequent candidate-box based evaluation methods are proposed to solve this problem. The method also utilizes the twin network to extract features, except that the features of a given target in the template are explicitly modeled with a priori knowledge, and then the confidence evaluation of the candidate box is guided by propagating the priori knowledge through off-line training of IoU-Net, and the candidate box with the highest confidence is taken as a tracking result. Due to the specific characterization capability of the target features, the method based on candidate box evaluation can effectively overcome some scenes with background interference. However, when similar interfering objects appear in the image, the tracker still has a drift condition because the receptive field of the convolutional neural network is far larger than the target area, so that the target characteristics are mixed with redundant information, and the characterization capability is insufficient. In order to further improve the tracking accuracy, the method optimizes the characterization capability of the target in the single-target tracking process on the basis of the candidate box evaluation.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a single target tracking method based on a multi-scale Transformer, and the characterization capability of a target is enhanced by using a multi-scale Transformer characteristic enhancement technology. The invention uses ATOM as a reference tracking method, thereby realizing more accurate tracking result.

The single target tracking method based on the multi-scale Transformer specifically comprises the following steps:

step 1, after the template features extracted by the twin network are applied by a multi-scale Transformer module, guiding template feature enhancement by taking target features of different scales as supervision information to obtain enhanced template features T';

the method comprises the following specific steps:

1) cutting out 3 features with different space sizes on the template feature map by using the central position, wherein the scales of the three features are a multiplied by a, 2a multiplied by 2a and 3a multiplied by 3a respectively;

2) embedding the characteristics of different spaces into semantic spaces with different scales through a convolution layer with unchanged 3 channels, and finally modulating the characteristics into a 2-dimensional form; the overall flow of the multiscale Transformer is represented by the following formula:

3) in the multi-headed object attention module, the size of the linear convolution layer is 1 x1 through 1 convolution kernel

Reducing the number C of all the characteristic channels of V and K to C/4, so as to achieve the effect of accelerating the fitting of the model;

4) calculating a similarity matrix A between Q and K by taking the template characteristics as Q;

wherein the output is

d_kIs characterized in that

Dimension (d);

5) after the similarity matrix is obtained, calculating the output characteristic O of the single target attention block through the following matrix operation;

O＝A*V (3)

wherein, multiplication of the table matrix and output

6) Expanding the target attention block to multiple heads, and obtaining template characteristics T' enhanced by the multi-scale Transformer through summing and normalization processing;

T'＝MultiHead(Q,K,V)＝Norm(Concat(O₁,O₂,O₃)W_o+Q) (4)

wherein Norm represents the utilization of l₂Normalization adjusts the entire template features; w_oThe characteristic channel 3C is adjusted to input characteristic channel number C;

step 2, taking the tracking result of the previous frame as a reference frame of the current frame, randomly generating candidate frames with a plurality of scales, wherein the random length-width ratio scaling factor interval of the candidate frames is [ 1-alpha, 1+ alpha ], and extracting the features of the candidate frames and the target features in the enhanced template features T' through PrRoi posing;

step 3, training one IoU-Net through a public data set in an off-line manner; transmitting the target information to the candidate frames, adjusting the characteristics of the candidate frames by a vector modulation method, and evaluating the confidence score of each candidate frame;

and 4, in an online testing stage, continuously fine-tuning the candidate frame position information through the gradient of the candidate frame position information, and iterating a more accurate boundary frame to serve as a tracking result.

Preferably, the backbone of the twin network in step 1 is a ResNet-18 network pre-trained on an ImageNet large data set, and parameters of the ResNet-18 network are shared by the template and subsequent image branches; to be suitable for the tracking task, a ResNet-18 network with the full connectivity layer removed is used as the feature extraction module, with a down-sampling rate of 16.

Preferably, the PrRoi posing extraction characteristic process in step 2 is: firstly, solving the quantization problem of regional pooling by using an interpolation method; interpolation is represented by the following equation:

IC(x,y,i,j)＝max(0,1-|x-i|·max(0,1-|y-j|)) (6)

wherein i, j is the coordinate position of the feature image, and IC (x, y, i, j) is the interpolation coefficient; finally, the feature region after interpolation is extracted using double integration, the formula is as follows:

where F is a feature map, and the upper left corner (x1, y1) and the lower right corner (x2, y2) represent a region from which features are to be extracted.

Preferably, the common data set in step 3 comprises TrackingNet, LaSOT and COCO; two frames of images of the same video sequence are sampled during the training process as an input image pair of the model, and each image is a 288 × 288 region cut by taking the target as the center.

Preferably, the confidence score of each candidate box is evaluated in step 3, and the steps are as follows:

1) applying a full connection layer on the target feature to obtain a modulation vector x, wherein the formula is as follows:

x＝φ(Flatten(F_target)) (8)

in the formula, inputting the target characteristics

Outputting through a full connection layer phi () by dimensional adjustment

2) And modulating the feature of each candidate frame by using the feature vector learned from the target feature, and outputting a final confidence score by the adjusted feature through another full-connection layer theta (eta), wherein the process is represented by the following formula:

wherein,

representing broadcast multiplication, to be characterized

The 25 eigenvalues of each channel in the system are given the same weight x_iThe output of i-0, 1,2, …, C-1, s is the confidence score for each candidate box.

Preferably, the iteration step in step 4 is as follows:

1) in the testing stage, the coordinate position of the candidate frame position information is adjusted online by back-propagating the gradient of the candidate frame position information, wherein the coordinates at the upper left corner are (x1, y1), and the coordinates at the lower right corner are (x2, y 2); thanks to the continuity of the PrRoi posing feature extraction operation, the gradient information obtained by the reverse derivation is more accurate, and the derivation formula is as follows:

each candidate frame obtains a corresponding gradient value and updates the position and the size of a boundary frame of the candidate frame; after multiple iterations, selecting a candidate frame with the maximum confidence score from the output s as a tracking result;

2) each candidate frame obtains a corresponding gradient value and updates the position and the size of a boundary frame of the candidate frame; the update formula is as follows:

3) repeating the steps 1) and 2) for multiple times, and selecting the candidate box with the maximum confidence score from the output s as the tracking result.

The invention has the following beneficial effects:

1. a multi-scale Transformer module is bridged behind a twin network of the tracking framework based on ATOM, targets with different scales are used as supervision information to enhance template characteristics, interference of a large amount of background information is effectively inhibited, more accurate bounding box evaluation is guided, tracking errors are reduced, and single-target tracking in a complex scene is achieved.

2. The multi-scale Transformer feature enhancement strategy has strong universality and is suitable for most trackers based on candidate box evaluation.

Drawings

FIG. 1: a flow diagram of a single target tracking method based on candidate box evaluation;

FIG. 2: a structure diagram of a multi-scale transform module;

FIG. 3: an improved ATOM tracking framework map;

FIG. 4: the tracking examples are shown in contrast.

Detailed Description

The invention is further explained below with reference to the drawings.

As shown in fig. 1 and 3, the multiscale Transformer-based single-target tracking method is improved by using an ATOM tracking frame as a reference, and specifically includes the following steps:

step 1, firstly, extracting the image characteristics of a template (a first frame) and subsequent images by utilizing a twin network, wherein the trunk of the twin network adopts a ResNet-18 network pre-trained on an ImageNet large data set, and the parameters of the ResNet-18 network are shared by the template and the subsequent image branches. To be suitable for the tracking task, a ResNet-18 network with the full connectivity layer removed is used as the feature extraction module.

After the template of the twin network structure is branched, the template features are strongly paired by the multi-scale Transformer module provided by the invention, as shown in fig. 2. The module derives template features from ResNet-18 template features

The space size is H W and the number of channels is C. In order to follow the input criteria of the self-attention mechanism, the dimension information of the 3-dimensional template T is adjusted to

As input Q for Multi-Head Target-orientation. And the template T is also input into a Pyramid transducer submodule to extract multi-scale features of the target so as to supervise the enhancement of the template features.

The submodule cuts out 3 features (4 x 4, 8 x 8 and 12 x 12) with different space sizes on a template feature diagram by using the center, then embeds the features of different spaces into semantic spaces with different scales by using convolution layers with 3 channels unchanged, and finally modulates the features into a 2-dimensional form. The overall flow for Pyramid transducer can be represented by the following formula:

in the Multi-Head Target-orientation submodule, first 1 convolution kernel is passed through a linear convolution layer of 1 × 1 size

And the number C of all the characteristic channels of V and K is reduced to C/4, so that the fitting effect of the acceleration model is achieved. Then, the similarity matrix a between Q and K is calculated again using the following equation.

Wherein the output is

d_kIs characterized in that

Of (c) is calculated. After the similarity matrix is obtained, the output characteristic O of a single target attention block is calculated by the following matrix operation.

O＝A*V (3)

Wherein, multiplication of the table matrix and output

And finally, expanding the Target-orientation to a multi-head, and performing summation and normalization processing to obtain the template characteristic T' enhanced by the multi-scale Transformer.

T'＝MultiHead(Q,K,V)＝Norm(Concat(O₁,O₂,O₃)W_o+Q) (4)

Wherein Norm represents the utilization of l₂Normalization adjusts the entire template features. W_oIs a parameter matrix that can be learned, and adjusts the eigen channel 3C to the input eigen channel number C.

And 2, based on the fact that the target change of the front frame and the rear frame is small in the video sequence. The tracking result of the previous frame is used as a reference boundary frame, the center position of the reference boundary frame is kept unchanged, and 10 candidate frames with different sizes are randomly generated to serve as evaluation objects. The range of scaling factors for the length and width of these candidate frames is specified in the interval [0.75,1.25 ] as compared to the reference bounding box]And (4) the following steps. Extracting the target characteristics in the enhanced template in the step 1 by using a PrRoi posing operator

And features of candidate frames

Step 3, the characteristics obtained in the step 2

And

IoU-Net is input and offline training is performed. The present invention uses the same common data set as ATOM for training, including TrackingNet, LaSOT, and COCO. Two frames of images of the same video sequence are sampled during the training process as an input image pair of the model, and each image is a 288 × 288 region cut by taking the target as the center. After training is complete, IoU-Net evaluates the confidence scores of the candidate boxes using its vector modulation method. The specific process is as follows: applying a full connection layer on the target feature to obtain a modulation vector x, wherein the formula is as follows:

x＝φ(Flatten(F_target)) (5)

in the formula, inputting the target characteristics

Outputting through a full connection layer phi () by dimensional adjustment

The feature vector learned from the target feature is used to modulate the feature of each candidate frame, and the adjusted feature outputs a final confidence score through another fully connected layer θ (), which can be expressed as follows:

wherein,

representing broadcast multiplication, to be characterized

The 25 eigenvalues of each channel in the system are given the same weight x_i(i ═ 0,1,2, …, C-1), s outputs the confidence score for each candidate box.

And 4, in the testing stage, adjusting the coordinate position of the candidate frame on line by reversely propagating the gradient of the position information of the candidate frame, wherein the coordinate at the upper left corner is (x1, y1), and the coordinate at the lower right corner is (x2, y 2). Thanks to the continuity of the PrRoi posing feature extraction operation, the gradient information obtained by the reverse derivation is more accurate, and the derivation formula is as follows:

each candidate box obtains its corresponding gradient value and updates the position and size of its bounding box. After 5 iterations, the candidate box with the largest confidence score is selected from the output s as the tracking result.

Step 5, the experimental environment of the invention is as follows: CPU is

Core^TMi5-7300HQ @2.50GHz, GPU GTX1050Ti, video memory 4GB, system version Linux 5.4.0-81-genetic Ubuntu 18.04.5 LTS, Cuda version number 10.2, and deep learning frame Pytrch 1.6.0. The test results in the common data set OTB100 are as follows:

table 1: performance comparison before and after ATOM tracking method improvement

	Accuracy	FPS
			ATOM	0.655	24.35
ATOM + Multi-Scale Transgormer (Ours)	0.664	22.10

As can be seen from table 1, the accuracy of the tracker can be effectively improved by the multi-scale transform module provided by the present invention under the condition of low speed loss. To visually compare the boosting effect, fig. 4 shows an example comparison of two video sequences (basetball and Diving) in the OTB 100.

Claims

1. The single-target tracking method based on the multi-scale Transformer is characterized by comprising the following steps:

the method comprises the following specific steps:

wherein the output is

d_kIs characterized in that

Dimension (d);

O＝A*V (3)

wherein, multiplication of the table matrix and output

T'＝MultiHead(Q,K,V)＝Norm(Concat(O₁,O₂,O₃)W_o+Q) (4)

wherein Norm represents the utilization of l₂Normalization adjusts the entire template features; w_oIs a parameter matrix which can be learnt and adjusts the characteristic channel 3CThe whole is the number C of input characteristic channels;

2. The multi-scale transform-based single-target tracking method of claim 1, wherein: the main trunk of the twin network in the step 1 adopts a ResNet-18 network pre-trained on an ImageNet big data set, and the parameters of the ResNet-18 network are shared by the template and the subsequent image branches; to be suitable for the tracking task, a ResNet-18 network with the full connectivity layer removed is used as the feature extraction module, with a down-sampling rate of 16.

3. The multi-scale transform-based single-target tracking method of claim 1, wherein: the PrRoi pooling extraction characteristic process in the step 2 comprises the following steps: firstly, solving the quantization problem of regional pooling by using an interpolation method; interpolation is represented by the following equation:

IC(x,y,i,j)＝max(0,1-|x-i|·max(0,1-|y-j|)) (6)

4. The multi-scale transform-based single-target tracking method of claim 1, wherein: the common data set in the step 3 comprises TrackingNet, LaSOT and COCO; two frames of images of the same video sequence are sampled during the training process as an input image pair of the model, and each image is a 288 × 288 region cut by taking the target as the center.

5. The multi-scale transform-based single-target tracking method of claim 1, wherein: in step 3, the confidence score of each candidate box is evaluated, and the steps are as follows:

x＝φ(Flatten(F_target)) (8)

in the formula, inputting the target characteristics

Outputting through a full connection layer phi () by dimensional adjustment

wherein,

representing broadcast multiplication, to be characterized

6. The multi-scale transform-based single-target tracking method of claim 1, wherein: the iteration steps in the step 4 are as follows: