CN114972439A - Novel target tracking algorithm for unmanned aerial vehicle - Google Patents

Novel target tracking algorithm for unmanned aerial vehicle Download PDF

Info

Publication number
CN114972439A
CN114972439A CN202210695272.2A CN202210695272A CN114972439A CN 114972439 A CN114972439 A CN 114972439A CN 202210695272 A CN202210695272 A CN 202210695272A CN 114972439 A CN114972439 A CN 114972439A
Authority
CN
China
Prior art keywords
feature
attention
network
encoder
template
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210695272.2A
Other languages
Chinese (zh)
Inventor
赵津
孙念怡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guizhou University
Original Assignee
Guizhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guizhou University filed Critical Guizhou University
Priority to CN202210695272.2A priority Critical patent/CN114972439A/en
Publication of CN114972439A publication Critical patent/CN114972439A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/42Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a novel unmanned aerial vehicle target tracking algorithm, which solves the problem by completely utilizing global information through a feature related network based on a self-attention mechanism. The method effectively combines the characteristics between the search area and the template, reduces the influence of external interference, and improves the accuracy and the robustness of the tracking algorithm. Global space-time characteristics are obtained through learning query embedding and time updating strategies to predict, and adaptability to rapid changes of the appearance of the target object is enhanced. In order to meet the requirements of the onboard operating speed, the proposed method does not suggest or predefine anchors, no post-processing step is required, and the whole method is end-to-end.

Description

Novel target tracking algorithm for unmanned aerial vehicle
Technical Field
The invention relates to the technical field of intelligent design of unmanned aerial vehicles, in particular to a novel target tracking algorithm of an unmanned aerial vehicle.
Background
Unmanned Aerial Vehicles (UAVs) are one of the main types of unmanned aerial vehicle systems, whose mission range is diverse and expanding. In recent years, unmanned aerial vehicles are widely applied to military and civil fields in the fields of environmental monitoring, search and rescue, autonomous positioning and the like. Target tracking is one of the important subjects of unmanned aerial vehicle research, and currently, there are two main methods, namely a Correlation Filtering (CF) based method and a deep learning based method. However, target tracking under the view angle of the unmanned aerial vehicle has the problems of small target shape, fuzzy boundary and low resolution due to the inherent height and flight jitter, which seriously affects the integration and matching degree between the template and the search area. Most existing trackers use correlation to integrate information in templates and search regions into regions of interest, which is a local linear matching process that mainly deals with local regions. Due to the lack of global information in most cases, feature fusion between the template and the search area is poor, and the accuracy and robustness of target tracking are affected. Therefore, in a complex and uncertain environment, stable and effective tracking of an agile target has been a challenge for target tracking of an unmanned aerial vehicle.
The CF-based tracker treats tracking as a classification problem, converting the computation of circular correlation and convolution in the spatial domain into element multiplication of elements in the frequency domain by means of discrete fourier transformation. The strategy remarkably improves the running speed of the CF-based tracker on a single CPU (central processing unit), and reaches about hundreds of fps (fps), so that the real-time requirement of the unmanned aerial vehicle is met. Thus, in the past few years, CF-based trackers on drones have gained rapid expansion and widespread use. Huang Z proposes the use of a response map generated during the detection phase to form the learning limit. This strategy enhances the robustness and accuracy of tracking objects. Fu C designs a new multi-core correlation tracking framework, and the framework utilizes image quality measurement and context information to construct an adaptive interference source, so that robustness is improved. While many scholars make outstanding contributions in improving the robustness and accuracy of CF trackers, most tracking methods update the model using only information from the latest frame, resulting in limited knowledge of historical information. Furthermore, correlation-based operations only evaluate similarity to local features, and the appearance of the target object may change over time, leading to tracking drift when fast motion or occlusion occurs.
With the development of Convolutional Neural Networks (CNNs) and the increase in computer computing power, learning-based methods are becoming more and more common in the tracking field. The Siamese network creates end-to-end learning and opens up a road for the deep learning method to gradually surpass CF. SiamFC is an pioneering task that combines the original feature association with the siamese framework and uses an off-line end-to-end training strategy to avoid updating parameters on-line. Guo et al propose a dynamic siam network (DSiam) without anchor structure based on siamf, in which a fast transformation learning model can effectively process the change of the appearance of an object. Li et al have proposed a siamese regional suggestion network (SiamRPN) that combines the siamese network with a regional suggestion network (RPN) and uses depth correlation for feature fusion to obtain more accurate tracking results. In addition, many researchers have made further improvements, such as adding additional branches, building deeper architectures, and developing anchorless architectures. While the siamese-based tracker achieved surprising results, the drone did not support the high complexity of convolution operations in the network, and various post-processing to select the best bounding box as the tracking result. Common post-processing includes cosine window, proportional or aspect ratio penalties, bounding box smoothing, etc. Post-processing provides better results, resulting in performance sensitivity to hyper-parameters. Some trackers attempt to simplify tracking the pipeline, but their performance is significantly reduced.
Disclosure of Invention
The invention aims to provide a novel unmanned aerial vehicle target tracking algorithm, which solves the problem by completely utilizing global information through a feature correlation network based on a self-attention mechanism. The method effectively combines the characteristics between the search area and the template, reduces the influence of external interference, and improves the accuracy and the robustness of the tracking algorithm. Global space-time characteristics are obtained through learning query embedding and time updating strategies to predict, and adaptability to rapid changes of the appearance of the target object is enhanced. In order to meet the requirements of the onboard operating speed, the proposed method does not suggest or predefine anchors, does not require post-processing steps, and the whole method is end-to-end, overcoming the deficiencies of the prior art.
In order to achieve the purpose, the invention provides the following technical scheme: a new unmanned aerial vehicle target tracking algorithm utilizes a self-attention mechanism principle; the self-attention mechanism is used for scanning each element in the sequence and updating by aggregating information in the whole sequence, so that the relationship among the global information is focused, and the remote interaction problem is realized.
As a further scheme of the invention: the algorithm includes an attention-based tracking framework; the tracking frame comprises a feature extraction network, a feature correlation network, a time updating strategy and a prediction head;
the characteristic extraction network is used for acquiring the change of the target appearance along with time and providing additional time information;
the feature correlation network learns the relationship between inputs by using an attention mechanism, predicts the spatial position of a target object and effectively captures the template with more feature mapping associations and the features of an interested area;
the time updating strategy is used for capturing the change of a target object along with time, and the robustness is enhanced;
the prediction header is used to estimate the target object in the current frame.
As a further scheme of the invention: the specific algorithm of the feature extraction network framework is to adopt three groups of image blocks as input, namely template images
Figure BDA0003699293500000031
Search area image
Figure BDA0003699293500000032
Figure BDA0003699293500000033
And other input dynamically updating template images
Figure BDA0003699293500000034
The system is used for acquiring the change of the target appearance along with the time and providing additional time information; adopting ResNet as a main stem for feature extraction, and removing the last stage of ResNet and a layer which is completely connected; after passing through the trunk, template z, z d And search area image x to three feature maps
Figure BDA0003699293500000035
And
Figure BDA0003699293500000036
as a further scheme of the invention: the specific algorithm of the feature related network is based on an attention mechanism to expand the range of the feature related network, obtain global feature information and avoid falling into local optimum, so that the remote feature capturing capability is enhanced;
the feature correlation network comprises an encoder (encoder), a decoder (decoder) and feature cross-correlation (FCC);
as a further scheme of the invention: the input of the attention mechanism is a query matrix Q, a key matrix K and a value matrix V, and an attention function based on a scale dot product is defined as an equation (1);
Figure BDA0003699293500000041
wherein D is k Representing the built dimension; softmax multiplies by row; dot product of query sum key divided by
Figure BDA0003699293500000042
To alleviate the gradient disappearance problem of the softmax function; when Q ═ K ═ V, the attention mechanism becomes a self-attention mechanism.
As a further scheme of the invention: the attention mechanism employs multiple heads to focus on different aspects of the information, i.e. to obtain information in different representation subspaces of different positions, Q, K, V is projected into multiple linear spaces for attention calculation (1), and the results of attention in each linear space are concatenated to obtain multi-head attention (MA), defined as equation (2);
Figure BDA0003699293500000043
wherein the content of the first and second substances,
Figure BDA0003699293500000044
a parameter matrix representing the projection, and h is the number of heads.
As a further scheme of the invention: the encoder and decoder are both stacks of N identical layers;
the encoder of each layer comprises a multi-head self-attention (MSA) module and a feed-forward network (FFN) submodule, and the output of each module adopts a residual connection and Layer Normalization (LN) operation;
the decoder comprises a third sub-module which takes the enhanced feature sequence from the encoder as input to perform multi-head self-attention; adopting residual cascade and layer normalization processing around each layer;
as a further scheme of the invention: the input of the encoder adopts a preprocessing mode, namely a bottleneck layer is adopted to reduce the number of channels from C to d, and then three groups of characteristic mappings are compressed and flattened along the spatial dimension to generate the length of the three groups of characteristic mappings
Figure BDA0003699293500000051
And d dimension.
As a further scheme of the invention: the mathematical algorithm process of the encoder and the decoder is as follows:
I 1 =Flatten(f z ,f d ,f x )
I n =I n +PE
I n′ =LN(I n +MSA(I n ))
I n+1 =LN(I n′ +FFN(I n′ ))
………
T n =T n +PE
T n′ =LN(T n +MSA(T n ))
I N′ =I N +PE
T n′ =LN(T n′ +MA(T n′ ,I N′ ,I N ))
T n+1 =LN(T n′ +FFN(T n′ ))
………
Output:T N
wherein N represents the nth layer, N is equal to [0, 1, 2, …, N]N denotes the total number of layers, I N And T N Respectively representing the outputs of the last layer of the encoder and decoder; embedding sinusoidal positions into an input sequence, denoted as position coding (PE), as shown in (3);
PE(pos,2i)=sin(pos/10000 2i/d )
(3)
PE(pos,2i+1)=cos(pos/10000 2i/d )
where pos denotes each position in the sequence, pos ∈ [0, max _ sequence _ length [ ]],
Figure BDA0003699293500000061
i is the dimension of the vector.
The above algorithm captures the features of all elements in the sequence and uses global context information to enhance the raw features to perform global reasoning on the target, enabling the target query to focus on all locations on the template and search for regional features for final bounding box prediction.
As a further scheme of the invention: the specific algorithm of the feature cross-correlation (FCC) is: template sequence for encoder output in feature correlation network
Figure BDA0003699293500000062
And search region sequence
Figure BDA0003699293500000063
Figure BDA0003699293500000064
Inputting the two branches into a feature cross-correlation (FCC) module respectively, simultaneously receiving the two inputs respectively, performing cross-correlation by using the attention and useful information of multi-head self-attention adaptation as a feature map, fusing the two feature maps by multi-head cross-attention, and outputting a template cross-map after repeating the process for M times
Figure BDA0003699293500000065
Cross mapping with search area
Figure BDA0003699293500000066
Then adding spatial position code to the input, and describing the mathematical description process as follows:
Figure BDA0003699293500000067
Figure BDA0003699293500000068
Figure BDA0003699293500000069
Figure BDA00036992935000000610
Figure BDA00036992935000000611
………
Figure BDA00036992935000000612
the feature correlation network performs fusion correlation on the template and the search area by using a cross attention operation, focuses on the boundary information of the concerned target object, and deepens the feature understanding between the template and the search area.
As a further scheme of the invention: the time updating strategy is to add time information into the space information to acquire the latest state of the target object for tracking; specifically, a dynamic update template is added to an encoder in a feature-dependent network to capture updates from intermediate frames as input, and the encoder extracts spatio-temporal features by modeling global relationships between all elements in spatial and temporal dimensions, thereby effectively fusing two types of information and enhancing robustness.
As a further scheme of the invention: the control time updating is realized by adopting a fraction head, the fraction head is a three-stage sensor, is activated by adopting sigmoid, and is set with a threshold value tau; when the score is higher than the threshold, the search area is regarded as containing the target, and the dynamics is updated by setting the number of frame intervals.
As a further scheme of the invention: the prediction head is designed by adopting a method based on an angular point; multiplying the output of the characteristic correlation network and the output of the characteristic correlation network re-decoder by corresponding elements to enhance the important area and weaken the low resolution area; new feature sequences are reshaped into feature maps
Figure BDA0003699293500000071
Then feeding in; a Full Convolution Network (FCN); the FCN is composed of L Conv-BN-ReLU layers and outputs two mapping probabilities P for the upper left corner and the lower right corner of the object bounding box respectively tl (x, y) and P br (x, y); finally, the expected value of probability distribution based on the angular point is calculated to obtain the coordinates of the prediction frame
Figure BDA0003699293500000072
And
Figure BDA0003699293500000073
as shown in equation 4:
Figure BDA0003699293500000074
as a further scheme of the invention: the whole tracking frame is subjected to defined training in a loss function mode, and the training process comprises two steps;
step one, the whole network except the fraction head in the time updating strategy is trained and positioned end to end, namely the model is allowed to learn the positioning capability. Bonding of
Figure BDA0003699293500000075
Loss and generalized IoU loss the loss function can be written as equation 5;
Figure BDA0003699293500000076
wherein, b i
Figure BDA0003699293500000081
Values representing true bounding box values and predicted bounding boxes; lambda [ alpha ] iou
Figure BDA0003699293500000082
Is a hyper-parameter;
step two, optimizing a fractional head by adopting binary cross entropy loss, as shown in equation 6;
L ce =y i log(P i )+(1-y i )log(1-P i ) (6)
wherein, y i Is a real label of the bounding box, P i Is the confidence score, all other parameters are frozen, avoiding affecting localization ability;
after the final model is trained in two stages, the positioning and classification capabilities can be learned simultaneously.
Compared with the prior art, the invention has the beneficial effects that:
1. the framework of the present invention combines spatial and temporal dimensions to address the challenging problem in aerial tracking. The framework mainly comprises a feature extraction network, a feature correlation network, a time updating strategy and a prediction head; higher performance and more straightforward architecture;
2. the invention adds a feature correlation network which comprises an attention module for capturing global information and a feature enhancement fusion module based on cross attention. The network focuses more on valuable information, such as edges and similar objects, and better understands the relationship between global features, rather than relevance; to capture the change of the target over time, we design an updated template, which adds to the attention of the time information. Meanwhile, learning the score head to control the updating of the updated template image;
3. the invention adopts the angular point-based prediction head, the whole method is end-to-end, and no post-processing steps are needed, such as cosine window and boundary frame smoothing. This strategy greatly simplifies existing tracking pipelines and allows for operation of unmanned aerial vehicle trackers.
4. The test results on multiple benchmarking tests show that the tracker proposed herein has significant performance advantages, particularly on large-scale aerial data sets UAV123 and UAVDT. Furthermore, our tracker runs at approximately 42FPS on the GPU, running at real-time speed.
Drawings
FIG. 1 is a schematic flow chart of the operation of the tracking framework of the present invention;
FIG. 2 is a schematic diagram of an encoder-decoder according to the present invention;
FIG. 3 is a schematic diagram of a characteristic cross-correlation (FCC) module configuration in accordance with the present invention;
FIG. 4 is a schematic view of the structure of the fractional head of the present invention;
fig. 5(a) is a first diagram (when occlusion and motion are jittered) of tracking performance in a real scene using a TransUAV for testing;
fig. 5(b) is a diagram of tracking performance of a test using a TransUAV in a real scene (stable tracking);
fig. 5(c) is a diagram of tracking performance of a test using a TransUAV in a real scene (steady tracking).
Detailed Description
The technical scheme in the embodiment is clearly and completely described in the invention with reference to the attached drawings and table parameters; it should be apparent that the described embodiments are only some embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment provides a Transformer-based unmanned aerial vehicle target tracking architecture, which is called a transUAV; as shown in fig. 1, the architecture comprises four modules: a feature extractor network, a feature correlation network, a time update strategy and a prediction head; fig. 1 shows the workflow of the TransUAV in the tracking process.
One, feature extraction network
Unlike the traditional Siamese-based tracker, we use three groups of image blocks as inputs, respectively template images
Figure BDA0003699293500000091
Search area image
Figure BDA0003699293500000092
And other input dynamically updating template images
Figure BDA0003699293500000093
The method is used for acquiring the change of the target appearance along with time and providing additional time information, and the ResNet is used as a main stem of feature extraction, and the last stage and the fully connected layer of the ResNet are removed. After passing through the trunk, template z, z d And search area image x to three feature maps
Figure BDA0003699293500000101
And
Figure BDA0003699293500000102
two, feature correlation network
The target tracking under the view angle of the unmanned aerial vehicle has the problems of small target shape, fuzzy boundary, low resolution and the like due to the inherent height and flight jitter, so that the integration and matching degree between the template and the search area are seriously influenced. An attention mechanism is used for expanding the range of feature fusion (namely, a feature correlation network), global feature information is obtained, and the situation that local optimization is involved is avoided, so that the remote feature capture capability is enhanced. This part consists of three main modules, namely encoder-decoder and feature cross-correlation (FCC).
Note that the mechanism is a basic component of feature fusion, and the inputs are the query matrix Q, the key matrix K, and the value matrix V. The attention function based on scaled dot products is defined as equation (1).
Figure BDA0003699293500000103
Wherein D is k Representing the built dimension; softmax is multiplied by row. Dot product of query sum key divided by
Figure BDA0003699293500000104
To alleviate the problem of gradient disappearance of the softmax function. When Q is K, V, the attention mechanism becomes a self-attention mechanism.
As described herein, the model cannot focus on multiple aspects of information in common with one attention. To obtain information in different representation subspaces for different locations, the attention calculations are performed by projecting Q, K, V into multiple linear spaces (1), and the attention results in each linear space are concatenated to obtain a multi-headed attention (MA), defined as equation (2).
Figure BDA0003699293500000105
Wherein the content of the first and second substances,
Figure BDA0003699293500000106
a parameter matrix representing the projection, and h is the number of heads. In this work, we used h 8, d 256,
Figure BDA0003699293500000107
Figure BDA0003699293500000111
1. Encoder-Decoder (coder-Decoder)
Both the Encoder and decoder are stacks of N identical layers. The encoder includes two sub-blocks at each layer, a multi-headed self-attention (MSA) block and a feed-forward network (FFN). The output of each module employs a residual connection and Layer Normalization (LN) operation. In addition to the two sub-modules in each encoder layer, the decoder has a third sub-module that takes as input the enhancement feature sequence from the encoder to perform multi-head attention. Similar to the encoder, residual concatenation and layer normalization are employed around each sub-layer. In addition, we input a target query in the decoder to predict the bounding box of the target object, and the encoder-decoder structure is shown in fig. 2.
The input to the encoder needs pre-processing, first reducing the number of channels from C to d using the bottleneck layer, then compressing and flattening the three sets of feature maps along the spatial dimension to generate a length of
Figure BDA0003699293500000112
And d dimension. This operation incorporates multiple branches containing spatio-temporal features. This is an intuitive self-attention mechanism that can be performed on the features in each branch to complete the feature extraction step. This operation saves computational operations and reduces model parameters through weight sharing compared to a single branch operation.
The overall process of Encode-Decoder can be described as:
I 1 =Flatten(f z ,f d ,f x )
I n =I n +PE
I n′ =LN(I n +MSA(I n ))
I n+1 =LN(I n′ +FFN(I n′ ))
T n =T n +PE
T n′ =LN(T n +MSA(T n ))
I N′ =I N +PE
T n′ =LN(T n′ +MA(T n′ ,I N′ ,I N ))
T n+1 =LN(T n′ +FFN(T n′ ))
Output:T N
wherein N represents the nth layer, N is equal to [0, 1, 2, …, N]And N represents the total number of layers. Thus, I N And T N Representing the output of the last layer of the encoder and decoder, respectively. Since the transducer arrangement is not constant, it cannot understand the input order, so we embed the sinusoidal positions in the input sequence, denoted as position-coding (PE), as shown in (3).
Figure BDA0003699293500000121
Where pos denotes each position in the sequence, pos ∈ [0, max _ sequence _ length [ ]],
Figure BDA0003699293500000122
i is the dimension of the vector.
The Encoder-Decoder mechanism captures the characteristics of all elements in the sequence and uses global context information to enhance the original characteristics to perform global reasoning on the target, thereby enabling the target query to focus on all locations on the template and search for regional characteristics for final bounding box prediction.
2. Characteristic Correlation (FCC)
In the study of target tracking, the correlation and matching degree between the template and the search area determine the output position of the bounding box. In order to deal with the small, fuzzy and low boundary of the target object under the view angle of the unmanned aerial vehicleResolution, etc. The present work has designed a feature correlation network to replace the traditional correlation operation, process and fuse the features of the template and search area, as shown in fig. 1. The Encoder output has completed the extraction of the characteristics, and in order to strengthen the characteristic correlation dependency of the template and the search area, we output the template sequence of the Encoder
Figure BDA0003699293500000131
And search region sequence
Figure BDA0003699293500000132
As two branches, respectively, to a Feature Cross Correlation (FCC) module. Two feature cross-correlation (FCC) modules respectively and simultaneously receive two inputs, firstly, the two inputs are subjected to attention and useful information by multi-head self-attention self-adaption and are crossed to be used as feature mapping, the two feature mapping are fused by multi-head cross attention, and after the two feature mapping are repeated for M times, the template cross mapping is output
Figure BDA0003699293500000133
Cross mapping with search area
Figure BDA0003699293500000134
The structure of the characteristic cross-correlation (FCC) is shown in fig. 3, and similar to the Encoder, the input needs to be added with spatial position coding, the characteristic cross-correlation (FCC) is used to enhance the fitting ability of the model, and the process of the whole characteristic correlation network is described as follows:
Figure BDA0003699293500000135
Figure BDA0003699293500000136
Figure BDA0003699293500000137
Figure BDA0003699293500000138
Figure BDA0003699293500000139
Figure BDA00036992935000001310
the feature correlation network performs fusion correlation on the template and the search area by using a cross attention operation, focuses on the boundary information of the concerned target object, and deepens the feature understanding between the template and the search area.
Third, time updating strategy
Since the appearance of the target object may change significantly over time, temporal information must be added to the spatial information to obtain the latest state of the target object for tracking. Therefore, a time information branch is designed to update the time information to obtain the appearance of the target. As shown in fig. 2, a dynamically updating template is added to the encoder to capture updates from the intermediate frame as input, and the encoder extracts spatio-temporal features by modeling the global relationships between all elements in the spatial and temporal dimensions, thereby effectively fusing both types of information and enhancing robustness.
1. The score head is used in the tracking process, the appearance of the template does not fluctuate obviously in several continuous frames, and there are some situations that the dynamic template should not be updated, for example, when the target is completely blocked or the tracker drifts, the clipped template is unreliable to judge whether the current state is reliable, and we propose to apply the score head to control the update of the dynamically updated template, as shown in fig. 4. It is a three-level perceptron followed by a sigmoid activation, and in order to simplify the framework and save the computational cost, we set a threshold τ. When the score is higher than the threshold, the search area is regarded as containing the target and the frame interval is set to more than 50 frames before the dynamically updated template can be updated.
Fourth, predict the head
Corner-based prediction head: in consideration of the computing power of the unmanned aerial vehicle, the computing cost caused by post-processing of the prediction target box is avoided, and the structure of the prediction head must be concise and robust. To improve the quality of the target box estimation, we design the prediction volume using an angular point-based approach. As shown in table one, first, the similarity between the output of the feature correlation network and the output of the decoder is calculated. Since the output of the feature correlation network is the result of the fusion of the search area and the template, and the output of the decoder is the global information feature extraction containing temporal variation, the result calculated between the two contains a large amount of target boundary information and spatial information. The output is then multiplied by element with the feature correlation network to enhance important regions and attenuate low resolution regions. New feature sequences are reshaped into feature maps
Figure BDA0003699293500000141
And then feed into a simple Full Convolution Network (FCN). The FCN is composed of L Conv-BN-ReLU layers and outputs two mapping probabilities P for the upper left corner and the lower right corner of the object bounding box respectively tl (x, y) and P br (x, y). Finally, the expected value of the probability distribution based on the angular point is calculated to obtain the coordinates of the prediction frame
Figure BDA0003699293500000142
And
Figure BDA0003699293500000143
as shown in equation 4.
Figure BDA0003699293500000144
Five, loss function
We divide the training process into two steps. In the first stage, the entire network is trained end-to-end, except for the score head, and with respect to the first step, localization is a primary task to ensure that all search images contain the target object,and allows the model to learn location capabilities. Bonding with
Figure BDA0003699293500000151
Loss and generalized IoU loss the loss function can be written as equation 5.
Figure BDA0003699293500000152
Wherein, b i
Figure BDA0003699293500000153
Values representing true bounding box values and predicted bounding boxes; lambda [ alpha ] iou
Figure BDA0003699293500000154
Is a hyper-parameter.
In the second phase, the fractional head is optimized using binary cross-entropy loss, as shown in equation 6.
L ce =y i log(P i )+(1-y i )log(1-P i ) (6)
Wherein, y i Is a real label of the bounding box, P i Is the confidence score and all other parameters are frozen to avoid affecting localization ability. After the final model is trained in two stages, the positioning and classification capabilities can be learned simultaneously.
Experimental procedures and effect comparison of the present embodiment
First, the details and results of the implementation of transformer uav on multiple benchmarks are presented and compared to multiple advanced methods. The transformer UAV was then evaluated on three challenging UAV benchmarks for quantitative and qualitative analysis of the tracker's tracking performance under UAV motion. Finally, ablation research is carried out, the influence of key components in the proposed network is analyzed, and the effectiveness and the superiority of the proposed tracker are comprehensively verified.
Second, result and comparison
We compared the TransUAV method to a number of representative trackers on six bases, including the standard aerial tracking reference UAV123, UAV123@10fps, UAVDT, two short-term references GOT-10K, TrackingNet, and one long-term reference lassot.
For the sake of fairness and objectivity of the evaluation, we selected 32 latest and classical tracking methods for evaluation, including correlation filter-based methods and deep learning-based methods, and the results were obtained by running official codes with corresponding hyper-parameters. For UAV123, UAV123@10fps, UAVDT, LaSOT, TrackingNet benchmarks, the experiment is based on one-pass evaluation (OPE), involving two indicators accuracy (P) and success rate (AUC). The GOT-10K standard uses two evaluation indexes of average contact ratio AO and SR success rate.
UAV123 is a large aerial tracking benchmark, involving 123 challenging aerial video sequences, one of the most authoritative and comprehensive data sets in the UAV tracking field. The tracking performance of 19 trackers under the most common air tracking conditions was evaluated using UAV123, with the results shown in table one. Our proposed TransUAV ranked first in both success rate and accuracy. In accuracy, TransUAV achieved 0.904, a 2.6% improvement over second ranked PrDiMP50 performance, 2.8% over TransT, which is also based on the self-attention mechanism. At success rate, TransUAV achieved 0.692, 2.3% and 1.1% higher than PrDiMP50 and TransT. The effectiveness of feature-related network and time information acquisition is fully demonstrated. TransUAV is superior to other trackers in every attribute, with good robustness and interference rejection, especially AUC 0.680,0.641, 3.9% and 2.2% higher than the second name under similar target and partial occlusion attributes.
Table one: the success rate (AUC) and precision value (P) of the TransUAV and multiple trackers tested on the data sets UAV123, UAV123@ fps, UAVDT.
Figure BDA0003699293500000161
Figure BDA0003699293500000171
A second table: TransUAV and multiple trackers score on dataset UAV123 based on the AUC's attribute, with red representing the first ranked tracker.
Figure BDA0003699293500000172
Table three: the TransUAV and multiple trackers scored on the data set UAV123@10fps based on the AUC's attributes, with red representing the first ranked tracker.
Figure BDA0003699293500000181
To strictly evaluate the TransUAV method, we also used the more challenging UAV123@10 fps. The reference adopts an image rate of 10 frames per second, and the motion and change of the target are more violent and serious between continuous frames, so that the tracking difficulty is greatly improved. As shown in table one, it is clear from comparison with other trackers that TransUAV is first in accuracy (0.904) and success (0.694) and is 2.7% and 2.1% higher than second PrDiMP 50. Compared with the UAV123, the tracking performance of the UAV tracking method is not reduced, and the excellent robustness is kept. And is excellent in most properties without a substantial decrease in overall performance, as shown in table three, demonstrating that the TransUAV is able to handle large changes in target appearance over time.
The UAVDT consists of 50 challenge sequences, and mainly focuses on aircraft tracking under various unconstrained complex scenes such as high density, small targets, weather conditions and the like. We evaluated 13 trackers on UAVDT basis and as shown in table one, the TransUAV tracker obtained the best accuracy score (83.4) and success rate (60.7) 2.5% and 1.5% higher than the second ranked PrDiMP 50. Furthermore, our evaluation on each attribute performed well, as shown in table four, especially in the case of target blur and target motion, demonstrating that TransUAV is suitable for complex aircraft tracking scenarios.
Table four: TransUAV and multiple trackers score on dataset UAVDT based on the attributes of AUC, with red representing the first ranked tracker.
Figure BDA0003699293500000191
LaSOT is a large-scale, high-quality target tracking data set disclosed in 2019, which contains 1400 challenging videos, 120 for training, and 280 for testing. We compared 16 trackers on the test set, and table six shows that TransUAV achieved the best performance, with AUC and P scores of 0.649 and 0.693. Although the advantages of TransUAV over TransT are not significant, TransUAV performs better than other trackers on an evaluation based on all attributes, as shown in table five, demonstrating that the approach presented herein is not only applicable to tracking scenarios under aerial view, but is still effective for a generally large tracking benchmark.
Table five: TransUAV and multiple trackers scored on dataset LaSOT based on the attributes of AUC, with red representing the first ranked tracker.
Figure BDA0003699293500000201
Table six: success rate (AUC) and accurate value (P) of the test of TransUAV and multiple trackers on data set LaSOT. The red, green and blue sequentially represent trackers ranked first, second and third
Figure BDA0003699293500000202
Figure BDA0003699293500000211
TrackingNet is a large short-term trace data set, test set packageIt contains 511 sequences, covering various object classes and scenes. We submitted the output of the TransUAV to the official evaluation Server, with the results shown in Table seven, where the TransUAV is in AUC, P Norm Aspects were obtained 81.2%, 77.7%, 85.1%, respectively, after SiamRCNN. However, the running speed of the SiamRCNN on our equipment is less than 5FPS, and cannot meet the real-time requirement, while the running speed of the TransUAV is 42 FPS.
The GOT-10k data set contains 10k training sequences and 180 test sequences. We used the test sequence for model testing and submitted the output to the official evaluation server, with the AO and SR index scores for multiple trackers reported in table seven. Best Performance of TransUAV, AO, SR 0.5 ,SR 0.75 The scores of (a) were 3.9%, 5.1%, 3.6% higher than those of SiamRCNN, respectively.
TABLE seventhly Overall Performance for TrackingNet and GOT-10 k.
Figure BDA0003699293500000212
Third, material experiment
We implement our tracker on a UAV to verify its performance in a real-world environment. In the experiment, Jetson AGX Xavier in great was used as an onboard computer, and Pixhawk4 was used as a flight controller. In the tests we observed average utilization of GPU and CPU to be 37% and 52%, respectively.
Fig. 5 shows the tracking performance tested using TransUAV in real scene, using center position error (CLE) to evaluate the tracking performance, involving occlusion, low resolution, motion blur, contrast variation, etc. scenarios. As shown in fig. 5(a), the TransUAV can still maintain excellent stability and robustness when encountering partial occlusion and motion judder. Fig. 5(b) and (c) show the stable tracking performance of the TransUAV, effectively reducing redundancy and correcting the tracked target, achieving satisfactory tracking even under interference from similar targets. Furthermore, without the TensorRT3 acceleration, we maintained onboard operating speeds in excess of 12FPS, indicating the utility and feasibility of the TransUAV under complex aerial tracking conditions.
Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that various changes in the embodiments and/or modifications of the invention can be made, and equivalents and modifications of some features of the invention can be made without departing from the spirit and scope of the invention.

Claims (14)

1. A new unmanned aerial vehicle target tracking algorithm is characterized in that: the algorithm utilizes the principle of a self-attention mechanism; the self-attention mechanism is used for scanning each element in the sequence and updating by aggregating information in the whole sequence, so that the relationship among the global information is focused, and the remote interaction problem is realized.
2. The new drone target tracking algorithm according to claim 1, characterized by: the algorithm includes an attention-based tracking framework; the tracking frame comprises a feature extraction network, a feature correlation network, a time updating strategy and a prediction head;
the feature extraction network is used for acquiring feature information and reducing data dimensionality; learning the relation between the inputs by using an attention mechanism, predicting the spatial position of a target object, and effectively capturing the template with more feature mapping associations and the characteristics of the region of interest;
the time updating strategy is used for capturing the change of a target object along with time, and the robustness is enhanced;
the prediction header is used to estimate the target object in the current frame.
3. The new unmanned aerial vehicle target tracking algorithm of claim 2, characterized in that the specific algorithm of the feature extraction network framework is to use three groups of image blocks as input, respectively template images
Figure FDA0003699293490000011
Search area image
Figure FDA0003699293490000012
And other input dynamically updating template images
Figure FDA0003699293490000013
The system is used for acquiring the change of the target appearance along with the time and providing additional time information; adopting ResNet as a main stem for feature extraction, and removing the last stage of ResNet and a layer which is completely connected; after passing through the trunk, template z, z d And search area image x to three feature maps
Figure FDA0003699293490000014
Figure FDA0003699293490000015
And
Figure FDA0003699293490000016
4. the new unmanned aerial vehicle target tracking algorithm according to claim 2, characterized in that the specific algorithm of the feature correlation network is based on an attention mechanism to extend the range of the feature correlation network, obtain global feature information, and avoid falling into local optimality, thereby enhancing remote feature capture capability.
The feature correlation network comprises an encoder (encoder), a decoder (decoder) and feature cross-correlation (FCC);
5. the new drone target tracking algorithm according to claim 4, characterized by: the input of the attention mechanism is a query matrix Q, a key matrix K and a value matrix V, and an attention function based on a scale dot product is defined as an equation (1);
Figure FDA0003699293490000021
wherein D is k Representing the built dimensions; softmax multiplies by row; dot product of query sum key divided by
Figure FDA0003699293490000022
To alleviate the gradient disappearance problem of the softmax function; when Q ═ K ═ V, the attention mechanism becomes a self-attention mechanism.
6. The new drone target tracking algorithm according to claim 5, characterized by: the attention mechanism adopts a plurality of attention heads to pay attention to different aspects of information, namely Q, K and V are projected into a plurality of linear spaces to perform attention calculation (1) in order to obtain information in different expression subspaces of different positions, and attention results in each linear space are connected in series to obtain multi-head attention (MA), which is defined as equation (2);
Figure FDA0003699293490000023
wherein the content of the first and second substances,
Figure FDA0003699293490000024
a parameter matrix representing the projection, and h is the number of heads.
7. The new drone target tracking algorithm according to claim 4, characterized by: the encoder and decoder are both stacks of N identical layers;
the encoder of each layer comprises a multi-head self-attention (MSA) module and a feed-forward network (FFN) submodule, and the output of each module adopts a residual connection and Layer Normalization (LN) operation;
the decoder comprises a third sub-module which takes the enhanced feature sequence from the encoder as input to perform multi-head self-attention; residual concatenation and layer normalization processing is employed around each layer.
8. The new drone target tracking algorithm according to claim 7, characterized by: the input of the encoder adopts a preprocessing mode, namely a bottleneck layer is adopted to reduce the number of channels from C to d, and then three groups of characteristic mappings are compressed and flattened along the spatial dimension to generate the length of the three groups of characteristic mappings
Figure FDA0003699293490000031
And dimension d.
9. The new drone target tracking algorithm according to any one of claims 7 or 8, characterized in that the mathematical algorithm process of the encoder and decoder is:
I 1 =Flatten(f z ,f d ,f x )
I n =I n +PE
I n′ =LN(I n +MSA(I n ))
I n+1 =LN(I n′ +FFN(I n′ ))
………
T n =T n +PE
T n′ =LN(T n +MSA(T n ))
I N′ =I N +PE
T n′ =LN(T n′ +MA(T n′ ,I N′ ,I N ))
T n+1 =LN(T n′ +FFN(T n′ ))
………
Output:T N
wherein N represents the nth layer, N is equal to [0, 1, 2, …, N]N denotes the total number of layers, I N And T N Respectively representing the outputs of the last layer of the encoder and decoder; embedding sinusoidal positions into an input sequence, denoted as position coding (PE), as shown in (3);
Figure FDA00036992934900000412
where pos denotes each position in the sequence, pos ∈ [0, max _ sequence _ length [ ]],
Figure FDA0003699293490000041
i is the dimension of the vector.
The above algorithm captures the features of all elements in the sequence and uses global context information to enhance the raw features to perform global reasoning on the target, enabling the target query to focus on all locations on the template and search for regional features for final bounding box prediction.
10. A new drone target tracking algorithm according to claim 4, characterized by the specific algorithm of the feature cross-correlation (FCC) being: template sequence output by encoder in feature correlation network
Figure FDA0003699293490000042
And search region sequence
Figure FDA0003699293490000043
Inputting the two branches into a feature cross-correlation (FCC) module respectively, simultaneously receiving the two inputs respectively, performing cross-correlation by using the attention and useful information of multi-head self-attention adaptation as a feature map, fusing the two feature maps by multi-head cross-attention, and outputting a template cross-map after repeating the process for M times
Figure FDA0003699293490000044
Figure FDA0003699293490000045
Cross mapping with search area
Figure FDA0003699293490000046
Then will beThe input adds spatial position coding, and the mathematical description process is described as follows:
Figure FDA0003699293490000047
Figure FDA0003699293490000048
Figure FDA0003699293490000049
Figure FDA00036992934900000410
Figure FDA00036992934900000411
………
Figure FDA0003699293490000051
the feature correlation network performs fusion correlation on the template and the search area by using a cross attention operation, focuses on the boundary information of the concerned target object, and deepens the feature understanding between the template and the search area.
11. The new drone target tracking algorithm according to claim 2, characterized by: the time updating strategy is to add time information into the space information to acquire the latest state of the target object for tracking; specifically, a dynamic update template is added to an encoder in a feature-dependent network to capture updates from intermediate frames as input, and the encoder extracts spatio-temporal features by modeling global relationships among all elements in spatial and temporal dimensions, thereby effectively fusing two types of information and enhancing robustness.
12. The new drone target tracking algorithm according to claim 11, characterized by: the control time updating is realized by adopting a fraction head, the fraction head is a three-stage sensor, is activated by adopting sigmoid, and is set with a threshold value tau; when the score is higher than the threshold, the search area is regarded as containing the target, and the dynamics is updated by setting the number of frame intervals.
13. The new drone target tracking algorithm according to claim 2, characterized by: the prediction head is designed by adopting a method based on an angular point; multiplying the output of the characteristic correlation network and the output of the characteristic correlation network re-decoder by corresponding elements to enhance the important area and weaken the low resolution area; new feature sequences are reshaped into feature maps
Figure FDA0003699293490000052
Figure FDA0003699293490000053
Then feeding in; a Full Convolution Network (FCN); the FCN is composed of L Conv-BN-ReLU layers and outputs two mapping probabilities P for the upper left corner and the lower right corner of the object bounding box respectively tl (x, y) and P br (x, y); finally, the expected value of probability distribution based on the angular point is calculated to obtain the coordinates of the prediction frame
Figure FDA0003699293490000054
And
Figure FDA0003699293490000055
as shown in equation 4:
Figure FDA0003699293490000056
Figure FDA0003699293490000061
14. the new drone target tracking algorithm according to claim 2, characterized by: the whole tracking frame is subjected to defined training in a loss function mode, and the training process comprises two steps;
step one, the whole network except the fraction head in the time updating strategy is trained and positioned end to end, namely the model is allowed to learn the positioning capability. Bonding of
Figure FDA0003699293490000065
Loss and generalized IoU loss the loss function can be written as equation 5;
Figure FDA0003699293490000062
wherein the content of the first and second substances,
Figure FDA0003699293490000063
values representing true bounding box values and predicted bounding boxes;
Figure FDA0003699293490000064
is a hyper-parameter;
step two, optimizing a fractional head by adopting binary cross entropy loss, as shown in equation 6;
L ce =y i log(P i )+(1-y i )log(1-P i ) (6)
wherein, y i Is a real label of the bounding box, P i Is the confidence score, all other parameters are frozen, avoiding affecting localization ability;
after the final model is trained in two stages, the positioning and classification capabilities can be learned simultaneously.
CN202210695272.2A 2022-06-17 2022-06-17 Novel target tracking algorithm for unmanned aerial vehicle Pending CN114972439A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210695272.2A CN114972439A (en) 2022-06-17 2022-06-17 Novel target tracking algorithm for unmanned aerial vehicle

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210695272.2A CN114972439A (en) 2022-06-17 2022-06-17 Novel target tracking algorithm for unmanned aerial vehicle

Publications (1)

Publication Number Publication Date
CN114972439A true CN114972439A (en) 2022-08-30

Family

ID=82963886

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210695272.2A Pending CN114972439A (en) 2022-06-17 2022-06-17 Novel target tracking algorithm for unmanned aerial vehicle

Country Status (1)

Country Link
CN (1) CN114972439A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115620321A (en) * 2022-10-20 2023-01-17 北京百度网讯科技有限公司 Table identification method and device, electronic equipment and storage medium
CN116402858A (en) * 2023-04-11 2023-07-07 合肥工业大学 Transformer-based space-time information fusion infrared target tracking method
CN117011342A (en) * 2023-10-07 2023-11-07 南京信息工程大学 Attention-enhanced space-time transducer vision single-target tracking method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115620321A (en) * 2022-10-20 2023-01-17 北京百度网讯科技有限公司 Table identification method and device, electronic equipment and storage medium
CN116402858A (en) * 2023-04-11 2023-07-07 合肥工业大学 Transformer-based space-time information fusion infrared target tracking method
CN116402858B (en) * 2023-04-11 2023-11-21 合肥工业大学 Transformer-based space-time information fusion infrared target tracking method
CN117011342A (en) * 2023-10-07 2023-11-07 南京信息工程大学 Attention-enhanced space-time transducer vision single-target tracking method

Similar Documents

Publication Publication Date Title
CN113449680B (en) Knowledge distillation-based multimode small target detection method
CN114972439A (en) Novel target tracking algorithm for unmanned aerial vehicle
CN111161315B (en) Multi-target tracking method and system based on graph neural network
CN110728698B (en) Multi-target tracking system based on composite cyclic neural network system
CN110874578A (en) Unmanned aerial vehicle visual angle vehicle identification and tracking method based on reinforcement learning
CN110705412A (en) Video target detection method based on motion history image
CN115205233A (en) Photovoltaic surface defect identification method and system based on end-to-end architecture
CN117058456A (en) Visual target tracking method based on multiphase attention mechanism
CN116563355A (en) Target tracking method based on space-time interaction attention mechanism
CN115984330A (en) Boundary-aware target tracking model and target tracking method
CN115239765A (en) Infrared image target tracking system and method based on multi-scale deformable attention
CN115393690A (en) Light neural network air-to-ground observation multi-target identification method
CN117542045B (en) Food identification method and system based on space-guided self-attention
Mann et al. Predicting future occupancy grids in dynamic environment with spatio-temporal learning
CN110111358B (en) Target tracking method based on multilayer time sequence filtering
CN116820131A (en) Unmanned aerial vehicle tracking method based on target perception ViT
CN116246338B (en) Behavior recognition method based on graph convolution and transducer composite neural network
CN116844004A (en) Point cloud automatic semantic modeling method for digital twin scene
CN115830707A (en) Multi-view human behavior identification method based on hypergraph learning
Shukla et al. UBOL: User-Behavior-aware one-shot learning for safe autonomous driving
CN116797799A (en) Single-target tracking method and tracking system based on channel attention and space-time perception
CN116363610A (en) Improved YOLOv 5-based aerial vehicle rotating target detection method
CN115841596A (en) Multi-label image classification method and training method and device of multi-label image classification model
Dao et al. Attention-based proposals refinement for 3D object detection
CN113269118A (en) Monocular vision forward vehicle distance detection method based on depth estimation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination