CN114972439A

CN114972439A - Novel target tracking algorithm for unmanned aerial vehicle

Info

Publication number: CN114972439A
Application number: CN202210695272.2A
Authority: CN
Inventors: 赵津; 孙念怡
Original assignee: Guizhou University
Current assignee: Guizhou University
Priority date: 2022-06-17
Filing date: 2022-06-17
Publication date: 2022-08-30

Abstract

The invention discloses a novel unmanned aerial vehicle target tracking algorithm, which solves the problem by completely utilizing global information through a feature related network based on a self-attention mechanism. The method effectively combines the characteristics between the search area and the template, reduces the influence of external interference, and improves the accuracy and the robustness of the tracking algorithm. Global space-time characteristics are obtained through learning query embedding and time updating strategies to predict, and adaptability to rapid changes of the appearance of the target object is enhanced. In order to meet the requirements of the onboard operating speed, the proposed method does not suggest or predefine anchors, no post-processing step is required, and the whole method is end-to-end.

Description

Novel target tracking algorithm for unmanned aerial vehicle

Technical Field

The invention relates to the technical field of intelligent design of unmanned aerial vehicles, in particular to a novel target tracking algorithm of an unmanned aerial vehicle.

Background

Unmanned Aerial Vehicles (UAVs) are one of the main types of unmanned aerial vehicle systems, whose mission range is diverse and expanding. In recent years, unmanned aerial vehicles are widely applied to military and civil fields in the fields of environmental monitoring, search and rescue, autonomous positioning and the like. Target tracking is one of the important subjects of unmanned aerial vehicle research, and currently, there are two main methods, namely a Correlation Filtering (CF) based method and a deep learning based method. However, target tracking under the view angle of the unmanned aerial vehicle has the problems of small target shape, fuzzy boundary and low resolution due to the inherent height and flight jitter, which seriously affects the integration and matching degree between the template and the search area. Most existing trackers use correlation to integrate information in templates and search regions into regions of interest, which is a local linear matching process that mainly deals with local regions. Due to the lack of global information in most cases, feature fusion between the template and the search area is poor, and the accuracy and robustness of target tracking are affected. Therefore, in a complex and uncertain environment, stable and effective tracking of an agile target has been a challenge for target tracking of an unmanned aerial vehicle.

The CF-based tracker treats tracking as a classification problem, converting the computation of circular correlation and convolution in the spatial domain into element multiplication of elements in the frequency domain by means of discrete fourier transformation. The strategy remarkably improves the running speed of the CF-based tracker on a single CPU (central processing unit), and reaches about hundreds of fps (fps), so that the real-time requirement of the unmanned aerial vehicle is met. Thus, in the past few years, CF-based trackers on drones have gained rapid expansion and widespread use. Huang Z proposes the use of a response map generated during the detection phase to form the learning limit. This strategy enhances the robustness and accuracy of tracking objects. Fu C designs a new multi-core correlation tracking framework, and the framework utilizes image quality measurement and context information to construct an adaptive interference source, so that robustness is improved. While many scholars make outstanding contributions in improving the robustness and accuracy of CF trackers, most tracking methods update the model using only information from the latest frame, resulting in limited knowledge of historical information. Furthermore, correlation-based operations only evaluate similarity to local features, and the appearance of the target object may change over time, leading to tracking drift when fast motion or occlusion occurs.

With the development of Convolutional Neural Networks (CNNs) and the increase in computer computing power, learning-based methods are becoming more and more common in the tracking field. The Siamese network creates end-to-end learning and opens up a road for the deep learning method to gradually surpass CF. SiamFC is an pioneering task that combines the original feature association with the siamese framework and uses an off-line end-to-end training strategy to avoid updating parameters on-line. Guo et al propose a dynamic siam network (DSiam) without anchor structure based on siamf, in which a fast transformation learning model can effectively process the change of the appearance of an object. Li et al have proposed a siamese regional suggestion network (SiamRPN) that combines the siamese network with a regional suggestion network (RPN) and uses depth correlation for feature fusion to obtain more accurate tracking results. In addition, many researchers have made further improvements, such as adding additional branches, building deeper architectures, and developing anchorless architectures. While the siamese-based tracker achieved surprising results, the drone did not support the high complexity of convolution operations in the network, and various post-processing to select the best bounding box as the tracking result. Common post-processing includes cosine window, proportional or aspect ratio penalties, bounding box smoothing, etc. Post-processing provides better results, resulting in performance sensitivity to hyper-parameters. Some trackers attempt to simplify tracking the pipeline, but their performance is significantly reduced.

Disclosure of Invention

The invention aims to provide a novel unmanned aerial vehicle target tracking algorithm, which solves the problem by completely utilizing global information through a feature correlation network based on a self-attention mechanism. The method effectively combines the characteristics between the search area and the template, reduces the influence of external interference, and improves the accuracy and the robustness of the tracking algorithm. Global space-time characteristics are obtained through learning query embedding and time updating strategies to predict, and adaptability to rapid changes of the appearance of the target object is enhanced. In order to meet the requirements of the onboard operating speed, the proposed method does not suggest or predefine anchors, does not require post-processing steps, and the whole method is end-to-end, overcoming the deficiencies of the prior art.

In order to achieve the purpose, the invention provides the following technical scheme: a new unmanned aerial vehicle target tracking algorithm utilizes a self-attention mechanism principle; the self-attention mechanism is used for scanning each element in the sequence and updating by aggregating information in the whole sequence, so that the relationship among the global information is focused, and the remote interaction problem is realized.

As a further scheme of the invention: the algorithm includes an attention-based tracking framework; the tracking frame comprises a feature extraction network, a feature correlation network, a time updating strategy and a prediction head;

the characteristic extraction network is used for acquiring the change of the target appearance along with time and providing additional time information;

the feature correlation network learns the relationship between inputs by using an attention mechanism, predicts the spatial position of a target object and effectively captures the template with more feature mapping associations and the features of an interested area;

the time updating strategy is used for capturing the change of a target object along with time, and the robustness is enhanced;

the prediction header is used to estimate the target object in the current frame.

As a further scheme of the invention: the specific algorithm of the feature extraction network framework is to adopt three groups of image blocks as input, namely template images

Search area image

And other input dynamically updating template images

The system is used for acquiring the change of the target appearance along with the time and providing additional time information; adopting ResNet as a main stem for feature extraction, and removing the last stage of ResNet and a layer which is completely connected; after passing through the trunk, template z, z _d And search area image x to three feature maps

And

as a further scheme of the invention: the specific algorithm of the feature related network is based on an attention mechanism to expand the range of the feature related network, obtain global feature information and avoid falling into local optimum, so that the remote feature capturing capability is enhanced;

the feature correlation network comprises an encoder (encoder), a decoder (decoder) and feature cross-correlation (FCC);

as a further scheme of the invention: the input of the attention mechanism is a query matrix Q, a key matrix K and a value matrix V, and an attention function based on a scale dot product is defined as an equation (1);

wherein D is _k Representing the built dimension; softmax multiplies by row; dot product of query sum key divided by

To alleviate the gradient disappearance problem of the softmax function; when Q ═ K ═ V, the attention mechanism becomes a self-attention mechanism.

As a further scheme of the invention: the attention mechanism employs multiple heads to focus on different aspects of the information, i.e. to obtain information in different representation subspaces of different positions, Q, K, V is projected into multiple linear spaces for attention calculation (1), and the results of attention in each linear space are concatenated to obtain multi-head attention (MA), defined as equation (2);

wherein the content of the first and second substances,

a parameter matrix representing the projection, and h is the number of heads.

As a further scheme of the invention: the encoder and decoder are both stacks of N identical layers;

the encoder of each layer comprises a multi-head self-attention (MSA) module and a feed-forward network (FFN) submodule, and the output of each module adopts a residual connection and Layer Normalization (LN) operation;

the decoder comprises a third sub-module which takes the enhanced feature sequence from the encoder as input to perform multi-head self-attention; adopting residual cascade and layer normalization processing around each layer;

as a further scheme of the invention: the input of the encoder adopts a preprocessing mode, namely a bottleneck layer is adopted to reduce the number of channels from C to d, and then three groups of characteristic mappings are compressed and flattened along the spatial dimension to generate the length of the three groups of characteristic mappings

And d dimension.

As a further scheme of the invention: the mathematical algorithm process of the encoder and the decoder is as follows:

I ¹ ＝Flatten(f _z ，f _d ，f _x )

I ⁿ ＝I ⁿ +PE

I ^n′ ＝LN(I ⁿ +MSA(I ⁿ ))

I ⁿ⁺¹ ＝LN(I ^n′ +FFN(I ^n′ ))

………

T ⁿ ＝T ⁿ +PE

T ^n′ ＝LN(T ⁿ +MSA(T ⁿ ))

I ^N′ ＝I ^N +PE

T ^n′ ＝LN(T ^n′ +MA(T ^n′ ，I ^N′ ，I ^N ))

T ⁿ⁺¹ ＝LN(T ^n′ +FFN(T ^n′ ))

………

Output:T ^N

wherein N represents the nth layer, N is equal to [0, 1, 2, …, N]N denotes the total number of layers, I ^N And T ^N Respectively representing the outputs of the last layer of the encoder and decoder; embedding sinusoidal positions into an input sequence, denoted as position coding (PE), as shown in (3);

PE(pos，2i)＝sin(pos/10000 ^2i/d )

(3)

PE(pos，2i+1)＝cos(pos/10000 ^2i/d )

where pos denotes each position in the sequence, pos ∈ [0, max _ sequence _ length [ ]]，

i is the dimension of the vector.

The above algorithm captures the features of all elements in the sequence and uses global context information to enhance the raw features to perform global reasoning on the target, enabling the target query to focus on all locations on the template and search for regional features for final bounding box prediction.

As a further scheme of the invention: the specific algorithm of the feature cross-correlation (FCC) is: template sequence for encoder output in feature correlation network

And search region sequence

Inputting the two branches into a feature cross-correlation (FCC) module respectively, simultaneously receiving the two inputs respectively, performing cross-correlation by using the attention and useful information of multi-head self-attention adaptation as a feature map, fusing the two feature maps by multi-head cross-attention, and outputting a template cross-map after repeating the process for M times

Cross mapping with search area

Then adding spatial position code to the input, and describing the mathematical description process as follows:

………

the feature correlation network performs fusion correlation on the template and the search area by using a cross attention operation, focuses on the boundary information of the concerned target object, and deepens the feature understanding between the template and the search area.

As a further scheme of the invention: the time updating strategy is to add time information into the space information to acquire the latest state of the target object for tracking; specifically, a dynamic update template is added to an encoder in a feature-dependent network to capture updates from intermediate frames as input, and the encoder extracts spatio-temporal features by modeling global relationships between all elements in spatial and temporal dimensions, thereby effectively fusing two types of information and enhancing robustness.

As a further scheme of the invention: the control time updating is realized by adopting a fraction head, the fraction head is a three-stage sensor, is activated by adopting sigmoid, and is set with a threshold value tau; when the score is higher than the threshold, the search area is regarded as containing the target, and the dynamics is updated by setting the number of frame intervals.

As a further scheme of the invention: the prediction head is designed by adopting a method based on an angular point; multiplying the output of the characteristic correlation network and the output of the characteristic correlation network re-decoder by corresponding elements to enhance the important area and weaken the low resolution area; new feature sequences are reshaped into feature maps

Then feeding in; a Full Convolution Network (FCN); the FCN is composed of L Conv-BN-ReLU layers and outputs two mapping probabilities P for the upper left corner and the lower right corner of the object bounding box respectively _tl (x, y) and P _br (x, y); finally, the expected value of probability distribution based on the angular point is calculated to obtain the coordinates of the prediction frame

And

as shown in equation 4:

as a further scheme of the invention: the whole tracking frame is subjected to defined training in a loss function mode, and the training process comprises two steps;

step one, the whole network except the fraction head in the time updating strategy is trained and positioned end to end, namely the model is allowed to learn the positioning capability. Bonding of

Loss and generalized IoU loss the loss function can be written as equation 5;

wherein, b _i ，

Values representing true bounding box values and predicted bounding boxes; lambda [ alpha ] _iou ，

Is a hyper-parameter;

step two, optimizing a fractional head by adopting binary cross entropy loss, as shown in equation 6;

L _ce ＝y _i log(P _i )+(1-y _i )log(1-P _i ) (6)

wherein, y _i Is a real label of the bounding box, P _i Is the confidence score, all other parameters are frozen, avoiding affecting localization ability;

after the final model is trained in two stages, the positioning and classification capabilities can be learned simultaneously.

Compared with the prior art, the invention has the beneficial effects that:

1. the framework of the present invention combines spatial and temporal dimensions to address the challenging problem in aerial tracking. The framework mainly comprises a feature extraction network, a feature correlation network, a time updating strategy and a prediction head; higher performance and more straightforward architecture;

2. the invention adds a feature correlation network which comprises an attention module for capturing global information and a feature enhancement fusion module based on cross attention. The network focuses more on valuable information, such as edges and similar objects, and better understands the relationship between global features, rather than relevance; to capture the change of the target over time, we design an updated template, which adds to the attention of the time information. Meanwhile, learning the score head to control the updating of the updated template image;

3. the invention adopts the angular point-based prediction head, the whole method is end-to-end, and no post-processing steps are needed, such as cosine window and boundary frame smoothing. This strategy greatly simplifies existing tracking pipelines and allows for operation of unmanned aerial vehicle trackers.

4. The test results on multiple benchmarking tests show that the tracker proposed herein has significant performance advantages, particularly on large-scale aerial data sets UAV123 and UAVDT. Furthermore, our tracker runs at approximately 42FPS on the GPU, running at real-time speed.

Drawings

FIG. 1 is a schematic flow chart of the operation of the tracking framework of the present invention;

FIG. 2 is a schematic diagram of an encoder-decoder according to the present invention;

FIG. 3 is a schematic diagram of a characteristic cross-correlation (FCC) module configuration in accordance with the present invention;

FIG. 4 is a schematic view of the structure of the fractional head of the present invention;

fig. 5(a) is a first diagram (when occlusion and motion are jittered) of tracking performance in a real scene using a TransUAV for testing;

fig. 5(b) is a diagram of tracking performance of a test using a TransUAV in a real scene (stable tracking);

fig. 5(c) is a diagram of tracking performance of a test using a TransUAV in a real scene (steady tracking).

Detailed Description

The technical scheme in the embodiment is clearly and completely described in the invention with reference to the attached drawings and table parameters; it should be apparent that the described embodiments are only some embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment provides a Transformer-based unmanned aerial vehicle target tracking architecture, which is called a transUAV; as shown in fig. 1, the architecture comprises four modules: a feature extractor network, a feature correlation network, a time update strategy and a prediction head; fig. 1 shows the workflow of the TransUAV in the tracking process.

One, feature extraction network

Unlike the traditional Siamese-based tracker, we use three groups of image blocks as inputs, respectively template images

Search area image

And other input dynamically updating template images

The method is used for acquiring the change of the target appearance along with time and providing additional time information, and the ResNet is used as a main stem of feature extraction, and the last stage and the fully connected layer of the ResNet are removed. After passing through the trunk, template z, z _d And search area image x to three feature maps

And

two, feature correlation network

The target tracking under the view angle of the unmanned aerial vehicle has the problems of small target shape, fuzzy boundary, low resolution and the like due to the inherent height and flight jitter, so that the integration and matching degree between the template and the search area are seriously influenced. An attention mechanism is used for expanding the range of feature fusion (namely, a feature correlation network), global feature information is obtained, and the situation that local optimization is involved is avoided, so that the remote feature capture capability is enhanced. This part consists of three main modules, namely encoder-decoder and feature cross-correlation (FCC).

Note that the mechanism is a basic component of feature fusion, and the inputs are the query matrix Q, the key matrix K, and the value matrix V. The attention function based on scaled dot products is defined as equation (1).

Wherein D is _k Representing the built dimension; softmax is multiplied by row. Dot product of query sum key divided by

To alleviate the problem of gradient disappearance of the softmax function. When Q is K, V, the attention mechanism becomes a self-attention mechanism.

As described herein, the model cannot focus on multiple aspects of information in common with one attention. To obtain information in different representation subspaces for different locations, the attention calculations are performed by projecting Q, K, V into multiple linear spaces (1), and the attention results in each linear space are concatenated to obtain a multi-headed attention (MA), defined as equation (2).

Wherein the content of the first and second substances,

a parameter matrix representing the projection, and h is the number of heads. In this work, we used h 8, d 256,

1. Encoder-Decoder (coder-Decoder)

Both the Encoder and decoder are stacks of N identical layers. The encoder includes two sub-blocks at each layer, a multi-headed self-attention (MSA) block and a feed-forward network (FFN). The output of each module employs a residual connection and Layer Normalization (LN) operation. In addition to the two sub-modules in each encoder layer, the decoder has a third sub-module that takes as input the enhancement feature sequence from the encoder to perform multi-head attention. Similar to the encoder, residual concatenation and layer normalization are employed around each sub-layer. In addition, we input a target query in the decoder to predict the bounding box of the target object, and the encoder-decoder structure is shown in fig. 2.

The input to the encoder needs pre-processing, first reducing the number of channels from C to d using the bottleneck layer, then compressing and flattening the three sets of feature maps along the spatial dimension to generate a length of

And d dimension. This operation incorporates multiple branches containing spatio-temporal features. This is an intuitive self-attention mechanism that can be performed on the features in each branch to complete the feature extraction step. This operation saves computational operations and reduces model parameters through weight sharing compared to a single branch operation.

The overall process of Encode-Decoder can be described as:

I ¹ ＝Flatten(f _z ，f _d ，f _x )

I ⁿ ＝I ⁿ +PE

I ^n′ ＝LN(I ⁿ +MSA(I ⁿ ))

I ⁿ⁺¹ ＝LN(I ^n′ +FFN(I ^n′ ))

…

T ⁿ ＝T ⁿ +PE

T ^n′ ＝LN(T ⁿ +MSA(T ⁿ ))

I ^N′ ＝I ^N +PE

T ^n′ ＝LN(T ^n′ +MA（T ^n′ ，I ^N′ ，I ^N ))

T ⁿ⁺¹ ＝LN(T ^n′ +FFN(T ^n′ ))

…

Output:T ^N

wherein N represents the nth layer, N is equal to [0, 1, 2, …, N]And N represents the total number of layers. Thus, I ^N And T ^N Representing the output of the last layer of the encoder and decoder, respectively. Since the transducer arrangement is not constant, it cannot understand the input order, so we embed the sinusoidal positions in the input sequence, denoted as position-coding (PE), as shown in (3).

i is the dimension of the vector.

The Encoder-Decoder mechanism captures the characteristics of all elements in the sequence and uses global context information to enhance the original characteristics to perform global reasoning on the target, thereby enabling the target query to focus on all locations on the template and search for regional characteristics for final bounding box prediction.

2. Characteristic Correlation (FCC)

In the study of target tracking, the correlation and matching degree between the template and the search area determine the output position of the bounding box. In order to deal with the small, fuzzy and low boundary of the target object under the view angle of the unmanned aerial vehicleResolution, etc. The present work has designed a feature correlation network to replace the traditional correlation operation, process and fuse the features of the template and search area, as shown in fig. 1. The Encoder output has completed the extraction of the characteristics, and in order to strengthen the characteristic correlation dependency of the template and the search area, we output the template sequence of the Encoder

And search region sequence

As two branches, respectively, to a Feature Cross Correlation (FCC) module. Two feature cross-correlation (FCC) modules respectively and simultaneously receive two inputs, firstly, the two inputs are subjected to attention and useful information by multi-head self-attention self-adaption and are crossed to be used as feature mapping, the two feature mapping are fused by multi-head cross attention, and after the two feature mapping are repeated for M times, the template cross mapping is output

Cross mapping with search area

The structure of the characteristic cross-correlation (FCC) is shown in fig. 3, and similar to the Encoder, the input needs to be added with spatial position coding, the characteristic cross-correlation (FCC) is used to enhance the fitting ability of the model, and the process of the whole characteristic correlation network is described as follows:

…

Third, time updating strategy

Since the appearance of the target object may change significantly over time, temporal information must be added to the spatial information to obtain the latest state of the target object for tracking. Therefore, a time information branch is designed to update the time information to obtain the appearance of the target. As shown in fig. 2, a dynamically updating template is added to the encoder to capture updates from the intermediate frame as input, and the encoder extracts spatio-temporal features by modeling the global relationships between all elements in the spatial and temporal dimensions, thereby effectively fusing both types of information and enhancing robustness.

1. The score head is used in the tracking process, the appearance of the template does not fluctuate obviously in several continuous frames, and there are some situations that the dynamic template should not be updated, for example, when the target is completely blocked or the tracker drifts, the clipped template is unreliable to judge whether the current state is reliable, and we propose to apply the score head to control the update of the dynamically updated template, as shown in fig. 4. It is a three-level perceptron followed by a sigmoid activation, and in order to simplify the framework and save the computational cost, we set a threshold τ. When the score is higher than the threshold, the search area is regarded as containing the target and the frame interval is set to more than 50 frames before the dynamically updated template can be updated.

Fourth, predict the head

Corner-based prediction head: in consideration of the computing power of the unmanned aerial vehicle, the computing cost caused by post-processing of the prediction target box is avoided, and the structure of the prediction head must be concise and robust. To improve the quality of the target box estimation, we design the prediction volume using an angular point-based approach. As shown in table one, first, the similarity between the output of the feature correlation network and the output of the decoder is calculated. Since the output of the feature correlation network is the result of the fusion of the search area and the template, and the output of the decoder is the global information feature extraction containing temporal variation, the result calculated between the two contains a large amount of target boundary information and spatial information. The output is then multiplied by element with the feature correlation network to enhance important regions and attenuate low resolution regions. New feature sequences are reshaped into feature maps

And then feed into a simple Full Convolution Network (FCN). The FCN is composed of L Conv-BN-ReLU layers and outputs two mapping probabilities P for the upper left corner and the lower right corner of the object bounding box respectively _tl (x, y) and P _br (x, y). Finally, the expected value of the probability distribution based on the angular point is calculated to obtain the coordinates of the prediction frame

And

as shown in equation 4.

Five, loss function

We divide the training process into two steps. In the first stage, the entire network is trained end-to-end, except for the score head, and with respect to the first step, localization is a primary task to ensure that all search images contain the target object,and allows the model to learn location capabilities. Bonding with

Loss and generalized IoU loss the loss function can be written as equation 5.

Wherein, b _i ，

Is a hyper-parameter.

In the second phase, the fractional head is optimized using binary cross-entropy loss, as shown in equation 6.

L _ce ＝y _i log(P _i )+(1-y _i )log(1-P _i ) (6)

Wherein, y _i Is a real label of the bounding box, P _i Is the confidence score and all other parameters are frozen to avoid affecting localization ability. After the final model is trained in two stages, the positioning and classification capabilities can be learned simultaneously.

Experimental procedures and effect comparison of the present embodiment

First, the details and results of the implementation of transformer uav on multiple benchmarks are presented and compared to multiple advanced methods. The transformer UAV was then evaluated on three challenging UAV benchmarks for quantitative and qualitative analysis of the tracker's tracking performance under UAV motion. Finally, ablation research is carried out, the influence of key components in the proposed network is analyzed, and the effectiveness and the superiority of the proposed tracker are comprehensively verified.

Second, result and comparison

We compared the TransUAV method to a number of representative trackers on six bases, including the standard aerial tracking reference UAV123, UAV123@10fps, UAVDT, two short-term references GOT-10K, TrackingNet, and one long-term reference lassot.

For the sake of fairness and objectivity of the evaluation, we selected 32 latest and classical tracking methods for evaluation, including correlation filter-based methods and deep learning-based methods, and the results were obtained by running official codes with corresponding hyper-parameters. For UAV123, UAV123@10fps, UAVDT, LaSOT, TrackingNet benchmarks, the experiment is based on one-pass evaluation (OPE), involving two indicators accuracy (P) and success rate (AUC). The GOT-10K standard uses two evaluation indexes of average contact ratio AO and SR success rate.

UAV123 is a large aerial tracking benchmark, involving 123 challenging aerial video sequences, one of the most authoritative and comprehensive data sets in the UAV tracking field. The tracking performance of 19 trackers under the most common air tracking conditions was evaluated using UAV123, with the results shown in table one. Our proposed TransUAV ranked first in both success rate and accuracy. In accuracy, TransUAV achieved 0.904, a 2.6% improvement over second ranked PrDiMP50 performance, 2.8% over TransT, which is also based on the self-attention mechanism. At success rate, TransUAV achieved 0.692, 2.3% and 1.1% higher than PrDiMP50 and TransT. The effectiveness of feature-related network and time information acquisition is fully demonstrated. TransUAV is superior to other trackers in every attribute, with good robustness and interference rejection, especially AUC 0.680,0.641, 3.9% and 2.2% higher than the second name under similar target and partial occlusion attributes.

Table one: the success rate (AUC) and precision value (P) of the TransUAV and multiple trackers tested on the data sets UAV123, UAV123@ fps, UAVDT.

A second table: TransUAV and multiple trackers score on dataset UAV123 based on the AUC's attribute, with red representing the first ranked tracker.

Table three: the TransUAV and multiple trackers scored on the data set UAV123@10fps based on the AUC's attributes, with red representing the first ranked tracker.

To strictly evaluate the TransUAV method, we also used the more challenging UAV123@10 fps. The reference adopts an image rate of 10 frames per second, and the motion and change of the target are more violent and serious between continuous frames, so that the tracking difficulty is greatly improved. As shown in table one, it is clear from comparison with other trackers that TransUAV is first in accuracy (0.904) and success (0.694) and is 2.7% and 2.1% higher than second PrDiMP 50. Compared with the UAV123, the tracking performance of the UAV tracking method is not reduced, and the excellent robustness is kept. And is excellent in most properties without a substantial decrease in overall performance, as shown in table three, demonstrating that the TransUAV is able to handle large changes in target appearance over time.

The UAVDT consists of 50 challenge sequences, and mainly focuses on aircraft tracking under various unconstrained complex scenes such as high density, small targets, weather conditions and the like. We evaluated 13 trackers on UAVDT basis and as shown in table one, the TransUAV tracker obtained the best accuracy score (83.4) and success rate (60.7) 2.5% and 1.5% higher than the second ranked PrDiMP 50. Furthermore, our evaluation on each attribute performed well, as shown in table four, especially in the case of target blur and target motion, demonstrating that TransUAV is suitable for complex aircraft tracking scenarios.

Table four: TransUAV and multiple trackers score on dataset UAVDT based on the attributes of AUC, with red representing the first ranked tracker.

LaSOT is a large-scale, high-quality target tracking data set disclosed in 2019, which contains 1400 challenging videos, 120 for training, and 280 for testing. We compared 16 trackers on the test set, and table six shows that TransUAV achieved the best performance, with AUC and P scores of 0.649 and 0.693. Although the advantages of TransUAV over TransT are not significant, TransUAV performs better than other trackers on an evaluation based on all attributes, as shown in table five, demonstrating that the approach presented herein is not only applicable to tracking scenarios under aerial view, but is still effective for a generally large tracking benchmark.

Table five: TransUAV and multiple trackers scored on dataset LaSOT based on the attributes of AUC, with red representing the first ranked tracker.

Table six: success rate (AUC) and accurate value (P) of the test of TransUAV and multiple trackers on data set LaSOT. The red, green and blue sequentially represent trackers ranked first, second and third

TrackingNet is a large short-term trace data set, test set packageIt contains 511 sequences, covering various object classes and scenes. We submitted the output of the TransUAV to the official evaluation Server, with the results shown in Table seven, where the TransUAV is in AUC, P _Norm Aspects were obtained 81.2%, 77.7%, 85.1%, respectively, after SiamRCNN. However, the running speed of the SiamRCNN on our equipment is less than 5FPS, and cannot meet the real-time requirement, while the running speed of the TransUAV is 42 FPS.

The GOT-10k data set contains 10k training sequences and 180 test sequences. We used the test sequence for model testing and submitted the output to the official evaluation server, with the AO and SR index scores for multiple trackers reported in table seven. Best Performance of TransUAV, AO, SR _0.5 ,SR _0.75 The scores of (a) were 3.9%, 5.1%, 3.6% higher than those of SiamRCNN, respectively.

TABLE seventhly Overall Performance for TrackingNet and GOT-10 k.

Third, material experiment

We implement our tracker on a UAV to verify its performance in a real-world environment. In the experiment, Jetson AGX Xavier in great was used as an onboard computer, and Pixhawk4 was used as a flight controller. In the tests we observed average utilization of GPU and CPU to be 37% and 52%, respectively.

Fig. 5 shows the tracking performance tested using TransUAV in real scene, using center position error (CLE) to evaluate the tracking performance, involving occlusion, low resolution, motion blur, contrast variation, etc. scenarios. As shown in fig. 5(a), the TransUAV can still maintain excellent stability and robustness when encountering partial occlusion and motion judder. Fig. 5(b) and (c) show the stable tracking performance of the TransUAV, effectively reducing redundancy and correcting the tracked target, achieving satisfactory tracking even under interference from similar targets. Furthermore, without the TensorRT3 acceleration, we maintained onboard operating speeds in excess of 12FPS, indicating the utility and feasibility of the TransUAV under complex aerial tracking conditions.

Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that various changes in the embodiments and/or modifications of the invention can be made, and equivalents and modifications of some features of the invention can be made without departing from the spirit and scope of the invention.

Claims

1. A new unmanned aerial vehicle target tracking algorithm is characterized in that: the algorithm utilizes the principle of a self-attention mechanism; the self-attention mechanism is used for scanning each element in the sequence and updating by aggregating information in the whole sequence, so that the relationship among the global information is focused, and the remote interaction problem is realized.

2. The new drone target tracking algorithm according to claim 1, characterized by: the algorithm includes an attention-based tracking framework; the tracking frame comprises a feature extraction network, a feature correlation network, a time updating strategy and a prediction head;

the feature extraction network is used for acquiring feature information and reducing data dimensionality; learning the relation between the inputs by using an attention mechanism, predicting the spatial position of a target object, and effectively capturing the template with more feature mapping associations and the characteristics of the region of interest;

3. The new unmanned aerial vehicle target tracking algorithm of claim 2, characterized in that the specific algorithm of the feature extraction network framework is to use three groups of image blocks as input, respectively template images

Search area image

And other input dynamically updating template images

，

And

4. the new unmanned aerial vehicle target tracking algorithm according to claim 2, characterized in that the specific algorithm of the feature correlation network is based on an attention mechanism to extend the range of the feature correlation network, obtain global feature information, and avoid falling into local optimality, thereby enhancing remote feature capture capability.

5. the new drone target tracking algorithm according to claim 4, characterized by: the input of the attention mechanism is a query matrix Q, a key matrix K and a value matrix V, and an attention function based on a scale dot product is defined as an equation (1);

wherein D is _k Representing the built dimensions; softmax multiplies by row; dot product of query sum key divided by

6. The new drone target tracking algorithm according to claim 5, characterized by: the attention mechanism adopts a plurality of attention heads to pay attention to different aspects of information, namely Q, K and V are projected into a plurality of linear spaces to perform attention calculation (1) in order to obtain information in different expression subspaces of different positions, and attention results in each linear space are connected in series to obtain multi-head attention (MA), which is defined as equation (2);

wherein the content of the first and second substances,

a parameter matrix representing the projection, and h is the number of heads.

7. The new drone target tracking algorithm according to claim 4, characterized by: the encoder and decoder are both stacks of N identical layers;

the decoder comprises a third sub-module which takes the enhanced feature sequence from the encoder as input to perform multi-head self-attention; residual concatenation and layer normalization processing is employed around each layer.

8. The new drone target tracking algorithm according to claim 7, characterized by: the input of the encoder adopts a preprocessing mode, namely a bottleneck layer is adopted to reduce the number of channels from C to d, and then three groups of characteristic mappings are compressed and flattened along the spatial dimension to generate the length of the three groups of characteristic mappings

And dimension d.

9. The new drone target tracking algorithm according to any one of claims 7 or 8, characterized in that the mathematical algorithm process of the encoder and decoder is:

I ¹ ＝Flatten(f _z ，f _d ，f _x )

I ⁿ ＝I ⁿ +PE

I ^n′ ＝LN(I ⁿ +MSA(I ⁿ ))

I ⁿ⁺¹ ＝LN(I ^n′ +FFN(I ^n′ ))

………

T ⁿ ＝T ⁿ +PE

T ^n′ ＝LN(T ⁿ +MSA(T ⁿ ))

I ^N′ ＝I ^N +PE

T ^n′ ＝LN(T ^n′ +MA(T ^n′ ，I ^N′ ，I ^N ))

T ⁿ⁺¹ ＝LN(T ^n′ +FFN(T ^n′ ))

………

Output:T ^N

i is the dimension of the vector.

10. A new drone target tracking algorithm according to claim 4, characterized by the specific algorithm of the feature cross-correlation (FCC) being: template sequence output by encoder in feature correlation network

And search region sequence

Cross mapping with search area

Then will beThe input adds spatial position coding, and the mathematical description process is described as follows:

………

11. The new drone target tracking algorithm according to claim 2, characterized by: the time updating strategy is to add time information into the space information to acquire the latest state of the target object for tracking; specifically, a dynamic update template is added to an encoder in a feature-dependent network to capture updates from intermediate frames as input, and the encoder extracts spatio-temporal features by modeling global relationships among all elements in spatial and temporal dimensions, thereby effectively fusing two types of information and enhancing robustness.

12. The new drone target tracking algorithm according to claim 11, characterized by: the control time updating is realized by adopting a fraction head, the fraction head is a three-stage sensor, is activated by adopting sigmoid, and is set with a threshold value tau; when the score is higher than the threshold, the search area is regarded as containing the target, and the dynamics is updated by setting the number of frame intervals.

13. The new drone target tracking algorithm according to claim 2, characterized by: the prediction head is designed by adopting a method based on an angular point; multiplying the output of the characteristic correlation network and the output of the characteristic correlation network re-decoder by corresponding elements to enhance the important area and weaken the low resolution area; new feature sequences are reshaped into feature maps

And

as shown in equation 4:

14. the new drone target tracking algorithm according to claim 2, characterized by: the whole tracking frame is subjected to defined training in a loss function mode, and the training process comprises two steps;

Loss and generalized IoU loss the loss function can be written as equation 5;

wherein the content of the first and second substances,

values representing true bounding box values and predicted bounding boxes;

is a hyper-parameter;

L _ce ＝y _i log(P _i )+(1-y _i )log(1-P _i ) (6)