CN116820131A - Unmanned aerial vehicle tracking method based on target perception ViT - Google Patents
Unmanned aerial vehicle tracking method based on target perception ViT Download PDFInfo
- Publication number
- CN116820131A CN116820131A CN202310818688.3A CN202310818688A CN116820131A CN 116820131 A CN116820131 A CN 116820131A CN 202310818688 A CN202310818688 A CN 202310818688A CN 116820131 A CN116820131 A CN 116820131A
- Authority
- CN
- China
- Prior art keywords
- target
- unmanned aerial
- tracking
- aerial vehicle
- prediction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000008447 perception Effects 0.000 title claims abstract description 16
- 238000000034 method Methods 0.000 title claims abstract description 12
- 238000012549 training Methods 0.000 claims abstract description 11
- 239000013598 vector Substances 0.000 claims description 7
- 230000006870 function Effects 0.000 claims description 5
- 238000005259 measurement Methods 0.000 claims description 4
- 230000001105 regulatory effect Effects 0.000 claims 1
- 238000010606 normalization Methods 0.000 abstract 1
- 238000013139 quantization Methods 0.000 abstract 1
- 230000009466 transformation Effects 0.000 abstract 1
- 238000012360 testing method Methods 0.000 description 7
- 230000008569 process Effects 0.000 description 4
- 238000009826 distribution Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 238000013138 pruning Methods 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Image Analysis (AREA)
Abstract
The invention discloses an unmanned aerial vehicle tracking method based on target perception, and relates to the technical field of unmanned aerial vehicle tracking. The adopted tracking framework is a single-stream tracking framework and comprises a backbone network and a prediction head. The backbone network uses DeiT-Tiny, which is ViT-based, and achieves target perception through mutual information maximization operation between the template image and the characteristics thereof. The prediction header has three branches for prediction classification score, prediction sample quantization error, and prediction normalization bounding box size, respectively, each branch consisting of four convolutions-batch normalization-ReLU stacked together. Training is performed using the existing target tracking dataset to obtain a drone tracking model, and then the trained framework is deployed to the drone platform for target . According to the invention, by designing and training the unmanned aerial vehicle tracking model based on target perception, accurate, efficient and real-time tracking of the target under strong sunlight, the target with a rapid transformation view angle and the remote small target can be realized.
Description
Technical Field
The invention relates to the field of target tracking, in particular to a target tracking method for an unmanned aerial vehicle.
Background
With the development of artificial intelligence, many industries are affected, some of which are subject to tremendous variation. For unmanned aerial vehicles, many companies are working to make unmanned aerial vehicles more intelligent by deep learning technology, one of which is unmanned aerial vehicle tracking technology. Unmanned aerial vehicle tracking has wide application in aspects such as disaster relief, traffic monitoring, environment monitoring, power inspection, etc. The unmanned aerial vehicle is different from unmanned aerial vehicle, unmanned ship etc. because the restriction of its take off weight, the processor and the battery that carry all need as light as possible, consequently unmanned aerial vehicle's processor performance and battery power all receive the restriction.
The drone tracker should possess two basic qualities: 1) To be able to cope with extreme challenges such as extreme viewing angle, motion blur and severe occlusion; 2) The requirements of high efficiency and low power consumption under the conditions of limited battery capacity and computational resource constraint are met.
Currently, the most widely used trackers in unmanned aerial vehicles are still discriminant filter (DCF) based trackers, and recently, convolution Neural Network (CNN) based lightweight trackers using filter pruning are also used. DCF-based trackers are favored because of their high efficiency, however they tend to be difficult to achieve with high tracking accuracy. CNN-based trackers, on the other hand, are known for their high accuracy, but they require very high computational resources and are therefore less suitable for efficient demands. To address this issue, researchers have introduced CNN-based lightweight trackers for unmanned aerial vehicle tracking, under trade-offs. These trackers employ filter pruning techniques to reduce the number of parameters in the network, thereby significantly improving accuracy and efficiency.
In the field of general vision tracking, the emerging ViT (Vision Transformer) -based trackers have achieved great success through the use of an attention mechanism, enabling more efficient capture of target locations. While the unmanned tracking field has not yet proposed ViT-based trackers, probably because ViT-based universal vision trackers have a large number of model parameters and low operating rates, these reasons prevent many beneficial exploration.
Disclosure of Invention
The invention aims to provide an unmanned aerial vehicle tracking method based on target perception ViT for real-time and efficient unmanned aerial vehicle tracking.
The technical scheme of the invention is to design and train the proposed unmanned aerial vehicle tracking model, and deploy the model to an unmanned aerial vehicle platform for target tracking so as to meet the requirements of users.
A ViT based drone tracking framework is shown in figure 1. The framework consists of a backbone network based on target awareness ViT and predictive headers.
(1) Backbone network
The backbone network carries the task of simultaneously outputting feature learning and template-search image coupling, allowing the two processes to interact. The input to the framework contains a target template Z and a search image X, which are first cut to the same size (16X 16) and flattened sequentially into a sequence, then labeled by a trainable linear projection layer, and yield K vectors, expressed as:
(1)
where d represents the embedded dimension of each vector, the vector sequenceAndrepresenting the template and the search image, respectively, wherein. By usingRepresent the firstLayer transform block, vector from the firstLayer to the firstLayer passageTo be converted. The entire conversion process can be expressed as:
(2)
wherein The combination operation is represented by a combination operation,is thatIs common to the parameters of (a)Layer transducer block.
The core idea of the framework is that the mutual information between the template image and its features is maximized.
Is provided withIs two random variables, thenThe mutual information between can be expressed as:
(3)
wherein The joint probability distribution is represented by a graph,representing the probability distribution of the edge(s),represents the Kullback-Leibler divergence (commonly abbreviated KL divergence). In practice, however, it is very difficult to estimate mutual information, since we can get samples, but not overall distributions. We therefore learn target perception ViT for drone tracking using Deep InfoMax (DIM), which is based on Jensen-Shannon divergence (JSD) instead of KL divergence. It is expressed as:
(4)
wherein Is composed ofA neural network of parameters which are to be varied,is a Softplus activation function. In this framework, we do the following:
(5)
=representing template features intercepted from the backbone network output. The mutual information maximization loss function is defined as follows:
(6)
(2) Pre-measurement head and loss function
Prediction header using full convolutional networkThree branches are included, each including 4 convolutionally-batch normalized-ReLU layers stacked together for estimating the bounding box of the target. Intercepting part of a search image from vectors output from a backbone networkAnd is re-interpreted as a 2-dimensional spatial signature input into the pre-measurement head. The result is a target classification scoreLocal offsetAnd normalizing bounding box size( wherein Representing the height and width of the search image respectively,representing the side length of the small block into which the image is cut). The initial estimate of the location is determined by the maximum classification score, expressed as. The predicted target bounding box is then calculated based on this coarse position as:
(7)
for the tracking task we classified with weighted focal loss and used a combination of IoU loss and L1 loss for bounding box regression. The final total loss function is:
(8)
wherein the constant is、、. After loading ViT pre-training weights for image classification, our framework uses the overall loss functionEnd-to-end training is performed.
Drawings
Unmanned aerial vehicle tracking frame based on target perception ViT of fig. 1
FIG. 2 attention-seeking contrast with and without target awareness
FIG. 3 prediction box visualization
FIG. 4 unmanned aerial vehicle tracking test
Detailed Description
The invention relates to an unmanned aerial vehicle tracking method based on target perception ViT, which comprises the following specific steps:
(1) First, a training dataset is prepared, which includes GOT-10k, laSOT, COCO and TrackingNet, which are all very well known in the field of target tracking.
(2) The unmanned plane tracking framework is created, and the backbone network in the framework uses DeiT-Tiny, which is a ViT-based network model, and the pre-measurement heads are 4 convolution-batch normalization-ReLU layers stacked together.
(3) The framework includes two inputs, a template and a search image, of 128 x 128 and 256 x 256, respectively, that scale the input picture to a specified size. The batch size was set to 32. Training a model using an AdamW optimizer and setting the weight decay to beAt firstThe initial learning rate is. A total of 300 rounds of training were performed, each round of inputting 60000 image pairs, and the learning rate was reduced by a factor of 10 after 240 rounds. After our target perception training, the recognition of the model to the target is more accurate. The visualized attention map (attention map) is shown in fig. 2, the left is the original image, the middle is the attention map without using target perception in the training process, the right is the attention map with adding target perception in the training process, and the attention generated by the model after adding target perception is more prominent from the figure.
(4) Test data sets were prepared, including DTB70, UAVDT, visDrone2018, UAV123, and uav123@10fps. These five data sets are existing challenging unmanned aerial vehicle test benchmarks, including videos taken under intense movements of the unmanned aerial vehicle, various cluttered scenes and objects, various weather conditions, flying height and camera viewing angles, etc., for evaluating unmanned aerial vehicle tracking algorithms. And taking the first frame picture of each video of the test set as a template, taking each frame as a search image, sequentially sending the search images into the frames, and outputting a result as a prediction frame of each frame. The prediction frame can be displayed on the image by the visual output, and as shown in fig. 3, the number in the upper left corner represents the frame of the video.
(5) The unmanned aerial vehicle simulation platform verifies that an embedded airborne processor Jetson Nano 2G is installed, and is a typical unmanned aerial vehicle simulation platform. Our unmanned aerial vehicle tracking framework is deployed on this platform, and then video is shot with the real machine to test the tracking effect. In the test, the GPU and CPU utilization were 52.7% and 18.9%, respectively, with an average speed of 43.6FPS. We tested objects in strong sunlight, objects with a fast changing viewing angle, and small objects at a distance, the test results are shown in fig. 4.
Claims (1)
1. The unmanned aerial vehicle tracking method based on target perception ViT is characterized by comprising the following steps of:
s1: the input of the framework comprises a target template to be tracked and a search image;
s2: the frame comprises a backbone network and a pre-measurement head;
s3: the backbone network uses a ViT-based network model DeiT-Tiny, input as segmented and flattened by the target templateThe individual vectors and +.>The number of vectors is 8×8, and the feature map is output;
s4, the prediction head is provided with three branches which are respectively used for predicting classification scores, predicting downsampling offset values and predicting normalized boundary box sizes, and each branch consists of four convolution-batch normalization-ReLUs which are stacked together;
s5, performing mutual information maximization processing on the template image before being sent into the backbone network and the template characteristics after the backbone network so as to realize target perception;
s6: loss function adopted in model trainingCalculated from equation (1);
(1)
wherein ,、/>、/>three constants, +.>、/>、/>、/>Loss of classification branch, loss of IoU of regression branch, L of regression branch, respectively 1 Loss, mutual information maximization loss of the target perception part;
S7:defined by equation (2);
(2)
wherein ,for the target predictive value, +.>Is a regulatory factor;
S8:defined by equation (3);
(3)
wherein ,area representing the graphic intersection of the prediction and truth boxes, +.>Representing the area of the graphical union of the prediction and truth boxes;
S9:defined by equation (4);
(4)
wherein ,for target value, & lt + & gt>N is the number of samples for the estimated value;
S10:defined by equation (5);
(5)
wherein JSD represents JS divergence, Z is the target template to be tracked,is a template feature after passing through the L-layer transducer block.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310818688.3A CN116820131A (en) | 2023-07-05 | 2023-07-05 | Unmanned aerial vehicle tracking method based on target perception ViT |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310818688.3A CN116820131A (en) | 2023-07-05 | 2023-07-05 | Unmanned aerial vehicle tracking method based on target perception ViT |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116820131A true CN116820131A (en) | 2023-09-29 |
Family
ID=88125706
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310818688.3A Pending CN116820131A (en) | 2023-07-05 | 2023-07-05 | Unmanned aerial vehicle tracking method based on target perception ViT |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116820131A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117893574A (en) * | 2024-03-14 | 2024-04-16 | 大连理工大学 | Infrared unmanned aerial vehicle target tracking method based on correlation filtering convolutional neural network |
-
2023
- 2023-07-05 CN CN202310818688.3A patent/CN116820131A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117893574A (en) * | 2024-03-14 | 2024-04-16 | 大连理工大学 | Infrared unmanned aerial vehicle target tracking method based on correlation filtering convolutional neural network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20220215227A1 (en) | Neural Architecture Search Method, Image Processing Method And Apparatus, And Storage Medium | |
JP2022515895A (en) | Object recognition method and equipment | |
CN112069868A (en) | Unmanned aerial vehicle real-time vehicle detection method based on convolutional neural network | |
CN112949673A (en) | Feature fusion target detection and identification method based on global attention | |
CN114972213A (en) | Two-stage mainboard image defect detection and positioning method based on machine vision | |
CN116343330A (en) | Abnormal behavior identification method for infrared-visible light image fusion | |
CN111462192A (en) | Space-time double-current fusion convolutional neural network dynamic obstacle avoidance method for sidewalk sweeping robot | |
CN113313703A (en) | Unmanned aerial vehicle power transmission line inspection method based on deep learning image recognition | |
CN111046767A (en) | 3D target detection method based on monocular image | |
CN113128564B (en) | Typical target detection method and system based on deep learning under complex background | |
CN116385958A (en) | Edge intelligent detection method for power grid inspection and monitoring | |
CN112613504A (en) | Sonar underwater target detection method | |
CN116820131A (en) | Unmanned aerial vehicle tracking method based on target perception ViT | |
CN115393690A (en) | Light neural network air-to-ground observation multi-target identification method | |
CN115019302A (en) | Improved YOLOX target detection model construction method and application thereof | |
CN114972439A (en) | Novel target tracking algorithm for unmanned aerial vehicle | |
CN112766411A (en) | Target detection knowledge distillation method for adaptive regional refinement | |
CN116740516A (en) | Target detection method and system based on multi-scale fusion feature extraction | |
CN110222822B (en) | Construction method of black box prediction model internal characteristic causal graph | |
CN112132207A (en) | Target detection neural network construction method based on multi-branch feature mapping | |
Zhang et al. | Full-scale Feature Aggregation and Grouping Feature Reconstruction Based UAV Image Target Detection | |
CN117576149A (en) | Single-target tracking method based on attention mechanism | |
CN116152699B (en) | Real-time moving target detection method for hydropower plant video monitoring system | |
CN117392568A (en) | Method for unmanned aerial vehicle inspection of power transformation equipment in complex scene | |
CN116630387A (en) | Monocular image depth estimation method based on attention mechanism |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |