CN116820131A

CN116820131A - Unmanned aerial vehicle tracking method based on target perception ViT

Info

Publication number: CN116820131A
Application number: CN202310818688.3A
Authority: CN
Inventors: 李水旺; 杨向阳; 叶恒舟
Original assignee: Guilin University of Technology
Current assignee: Guilin University of Technology
Priority date: 2023-07-05
Filing date: 2023-07-05
Publication date: 2023-09-29

Abstract

The invention discloses an unmanned aerial vehicle tracking method based on target perception, and relates to the technical field of unmanned aerial vehicle tracking. The adopted tracking framework is a single-stream tracking framework and comprises a backbone network and a prediction head. The backbone network uses DeiT-Tiny, which is ViT-based, and achieves target perception through mutual information maximization operation between the template image and the characteristics thereof. The prediction header has three branches for prediction classification score, prediction sample quantization error, and prediction normalization bounding box size, respectively, each branch consisting of four convolutions-batch normalization-ReLU stacked together. Training is performed using the existing target tracking dataset to obtain a drone tracking model, and then the trained framework is deployed to the drone platform for target . According to the invention, by designing and training the unmanned aerial vehicle tracking model based on target perception, accurate, efficient and real-time tracking of the target under strong sunlight, the target with a rapid transformation view angle and the remote small target can be realized.

Description

Unmanned aerial vehicle tracking method based on target perception ViT

Technical Field

The invention relates to the field of target tracking, in particular to a target tracking method for an unmanned aerial vehicle.

Background

With the development of artificial intelligence, many industries are affected, some of which are subject to tremendous variation. For unmanned aerial vehicles, many companies are working to make unmanned aerial vehicles more intelligent by deep learning technology, one of which is unmanned aerial vehicle tracking technology. Unmanned aerial vehicle tracking has wide application in aspects such as disaster relief, traffic monitoring, environment monitoring, power inspection, etc. The unmanned aerial vehicle is different from unmanned aerial vehicle, unmanned ship etc. because the restriction of its take off weight, the processor and the battery that carry all need as light as possible, consequently unmanned aerial vehicle's processor performance and battery power all receive the restriction.

The drone tracker should possess two basic qualities: 1) To be able to cope with extreme challenges such as extreme viewing angle, motion blur and severe occlusion; 2) The requirements of high efficiency and low power consumption under the conditions of limited battery capacity and computational resource constraint are met.

Currently, the most widely used trackers in unmanned aerial vehicles are still discriminant filter (DCF) based trackers, and recently, convolution Neural Network (CNN) based lightweight trackers using filter pruning are also used. DCF-based trackers are favored because of their high efficiency, however they tend to be difficult to achieve with high tracking accuracy. CNN-based trackers, on the other hand, are known for their high accuracy, but they require very high computational resources and are therefore less suitable for efficient demands. To address this issue, researchers have introduced CNN-based lightweight trackers for unmanned aerial vehicle tracking, under trade-offs. These trackers employ filter pruning techniques to reduce the number of parameters in the network, thereby significantly improving accuracy and efficiency.

In the field of general vision tracking, the emerging ViT (Vision Transformer) -based trackers have achieved great success through the use of an attention mechanism, enabling more efficient capture of target locations. While the unmanned tracking field has not yet proposed ViT-based trackers, probably because ViT-based universal vision trackers have a large number of model parameters and low operating rates, these reasons prevent many beneficial exploration.

Disclosure of Invention

The invention aims to provide an unmanned aerial vehicle tracking method based on target perception ViT for real-time and efficient unmanned aerial vehicle tracking.

The technical scheme of the invention is to design and train the proposed unmanned aerial vehicle tracking model, and deploy the model to an unmanned aerial vehicle platform for target tracking so as to meet the requirements of users.

A ViT based drone tracking framework is shown in figure 1. The framework consists of a backbone network based on target awareness ViT and predictive headers.

(1) Backbone network

The backbone network carries the task of simultaneously outputting feature learning and template-search image coupling, allowing the two processes to interact. The input to the framework contains a target template Z and a search image X, which are first cut to the same size (16X 16) and flattened sequentially into a sequence, then labeled by a trainable linear projection layer, and yield K vectors, expressed as:

(1)

where d represents the embedded dimension of each vector, the vector sequenceAndrepresenting the template and the search image, respectively, wherein. By usingRepresent the firstLayer transform block, vector from the firstLayer to the firstLayer passageTo be converted. The entire conversion process can be expressed as:

(2)

wherein The combination operation is represented by a combination operation,is thatIs common to the parameters of (a)Layer transducer block.

The core idea of the framework is that the mutual information between the template image and its features is maximized.

Is provided withIs two random variables, thenThe mutual information between can be expressed as:

(3)

wherein The joint probability distribution is represented by a graph,representing the probability distribution of the edge(s),represents the Kullback-Leibler divergence (commonly abbreviated KL divergence). In practice, however, it is very difficult to estimate mutual information, since we can get samples, but not overall distributions. We therefore learn target perception ViT for drone tracking using Deep InfoMax (DIM), which is based on Jensen-Shannon divergence (JSD) instead of KL divergence. It is expressed as:

(4)

wherein Is composed ofA neural network of parameters which are to be varied,is a Softplus activation function. In this framework, we do the following:

(5)

=representing template features intercepted from the backbone network output. The mutual information maximization loss function is defined as follows:

(6)

(2) Pre-measurement head and loss function

Prediction header using full convolutional networkThree branches are included, each including 4 convolutionally-batch normalized-ReLU layers stacked together for estimating the bounding box of the target. Intercepting part of a search image from vectors output from a backbone networkAnd is re-interpreted as a 2-dimensional spatial signature input into the pre-measurement head. The result is a target classification scoreLocal offsetAnd normalizing bounding box size（ wherein Representing the height and width of the search image respectively,representing the side length of the small block into which the image is cut). The initial estimate of the location is determined by the maximum classification score, expressed as. The predicted target bounding box is then calculated based on this coarse position as:

(7)

for the tracking task we classified with weighted focal loss and used a combination of IoU loss and L1 loss for bounding box regression. The final total loss function is:

(8)

wherein the constant is、、. After loading ViT pre-training weights for image classification, our framework uses the overall loss functionEnd-to-end training is performed.

Drawings

Unmanned aerial vehicle tracking frame based on target perception ViT of fig. 1

FIG. 2 attention-seeking contrast with and without target awareness

FIG. 3 prediction box visualization

FIG. 4 unmanned aerial vehicle tracking test

Detailed Description

The invention relates to an unmanned aerial vehicle tracking method based on target perception ViT, which comprises the following specific steps:

(1) First, a training dataset is prepared, which includes GOT-10k, laSOT, COCO and TrackingNet, which are all very well known in the field of target tracking.

(2) The unmanned plane tracking framework is created, and the backbone network in the framework uses DeiT-Tiny, which is a ViT-based network model, and the pre-measurement heads are 4 convolution-batch normalization-ReLU layers stacked together.

(3) The framework includes two inputs, a template and a search image, of 128 x 128 and 256 x 256, respectively, that scale the input picture to a specified size. The batch size was set to 32. Training a model using an AdamW optimizer and setting the weight decay to beAt firstThe initial learning rate is. A total of 300 rounds of training were performed, each round of inputting 60000 image pairs, and the learning rate was reduced by a factor of 10 after 240 rounds. After our target perception training, the recognition of the model to the target is more accurate. The visualized attention map (attention map) is shown in fig. 2, the left is the original image, the middle is the attention map without using target perception in the training process, the right is the attention map with adding target perception in the training process, and the attention generated by the model after adding target perception is more prominent from the figure.

(4) Test data sets were prepared, including DTB70, UAVDT, visDrone2018, UAV123, and uav123@10fps. These five data sets are existing challenging unmanned aerial vehicle test benchmarks, including videos taken under intense movements of the unmanned aerial vehicle, various cluttered scenes and objects, various weather conditions, flying height and camera viewing angles, etc., for evaluating unmanned aerial vehicle tracking algorithms. And taking the first frame picture of each video of the test set as a template, taking each frame as a search image, sequentially sending the search images into the frames, and outputting a result as a prediction frame of each frame. The prediction frame can be displayed on the image by the visual output, and as shown in fig. 3, the number in the upper left corner represents the frame of the video.

(5) The unmanned aerial vehicle simulation platform verifies that an embedded airborne processor Jetson Nano 2G is installed, and is a typical unmanned aerial vehicle simulation platform. Our unmanned aerial vehicle tracking framework is deployed on this platform, and then video is shot with the real machine to test the tracking effect. In the test, the GPU and CPU utilization were 52.7% and 18.9%, respectively, with an average speed of 43.6FPS. We tested objects in strong sunlight, objects with a fast changing viewing angle, and small objects at a distance, the test results are shown in fig. 4.

Claims

1. The unmanned aerial vehicle tracking method based on target perception ViT is characterized by comprising the following steps of:

s1: the input of the framework comprises a target template to be tracked and a search image;

s2: the frame comprises a backbone network and a pre-measurement head;

s3: the backbone network uses a ViT-based network model DeiT-Tiny, input as segmented and flattened by the target templateThe individual vectors and +.>The number of vectors is 8×8, and the feature map is output;

s4, the prediction head is provided with three branches which are respectively used for predicting classification scores, predicting downsampling offset values and predicting normalized boundary box sizes, and each branch consists of four convolution-batch normalization-ReLUs which are stacked together;

s5, performing mutual information maximization processing on the template image before being sent into the backbone network and the template characteristics after the backbone network so as to realize target perception;

s6: loss function adopted in model trainingCalculated from equation (1);

(1)

wherein ,、/>、/>three constants, +.>、/>、/>、/>Loss of classification branch, loss of IoU of regression branch, L of regression branch, respectively ₁ Loss, mutual information maximization loss of the target perception part;

S7：defined by equation (2);

(2)

wherein ,for the target predictive value, +.>Is a regulatory factor;

S8：defined by equation (3);

(3)

wherein ,area representing the graphic intersection of the prediction and truth boxes, +.>Representing the area of the graphical union of the prediction and truth boxes;

S9：defined by equation (4);

(4)

wherein ,for target value, & lt + & gt>N is the number of samples for the estimated value;

S10：defined by equation (5);

(5)

wherein JSD represents JS divergence, Z is the target template to be tracked,is a template feature after passing through the L-layer transducer block.