CN116402858A - Transformer-based space-time information fusion infrared target tracking method - Google Patents

Transformer-based space-time information fusion infrared target tracking method Download PDF

Info

Publication number
CN116402858A
CN116402858A CN202310406030.1A CN202310406030A CN116402858A CN 116402858 A CN116402858 A CN 116402858A CN 202310406030 A CN202310406030 A CN 202310406030A CN 116402858 A CN116402858 A CN 116402858A
Authority
CN
China
Prior art keywords
network
formula
infrared
sub
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310406030.1A
Other languages
Chinese (zh)
Other versions
CN116402858B (en
Inventor
齐美彬
汪沁昕
庄硕
张可
李坤袁
刘一敏
杨艳芳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN202310406030.1A priority Critical patent/CN116402858B/en
Publication of CN116402858A publication Critical patent/CN116402858A/en
Application granted granted Critical
Publication of CN116402858B publication Critical patent/CN116402858B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10048Infrared image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30241Trajectory
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a space-time information fusion infrared target tracking method based on a transducer, which comprises the following steps: firstly, preprocessing an infrared image; the second step, construct the infrared target tracking network, including: an infrared image feature extraction sub-network, an infrared image feature fusion sub-network, a corner prediction head sub-network, a salient point focusing sub-network and an IOU-Aware target state evaluation head sub-network; thirdly, constructing a loss function of the infrared target tracking network; and step four, optimizing the infrared target tracking network by adopting a two-stage training method. The invention realizes the fusion of space-time information in the infrared target tracking process by designing a plurality of components, and aims to improve the accuracy and the robustness of the infrared target tracking method under different tracking scenes.

Description

Transformer-based space-time information fusion infrared target tracking method
Technical Field
The invention belongs to the field of computer vision, and particularly relates to a space-time information fusion infrared target tracking method based on a transducer.
Background
Thermal infrared target tracking is a very promising research direction in the field of visual target tracking, and the task of the thermal infrared target tracking is to continuously predict the position of the target in subsequent video frames by giving the basic state of the target to be tracked in an infrared video sequence. Because the infrared image imaging mode does not depend on the intensity of light, the infrared image imaging mode is only related to the temperature of object radiation. Therefore, the infrared target tracking can track the target under the condition of low visibility and even complete darkness, and has all-weather and working capacity under complex environments. Therefore, the system is widely applied to the fields of marine rescue, video monitoring, night driving assistance and the like.
Despite its unique advantages, infrared target tracking also faces a number of challenges. For example, infrared targets have no color information, lack rich texture features, blurred contour texture, etc. These disadvantages cause a lack of local detail features of the infrared target, thereby preventing the existing feature extraction model designed for visible light images from obtaining a strongly discriminative feature representation of the infrared target. In addition, thermal infrared target tracking is also faced with a range of challenges such as thermal crossover, occlusion, dimensional changes, etc. To solve these problems, some infrared target tracking models based on manual design features have been proposed, and despite some advances in these methods, the limited characterization capabilities of manual features still limit the improvement in tracker performance.
Given the powerful feature representation capabilities of convolutional neural networks, some researchers began trying to introduce CNN features into the infrared target tracking task. For example, MCFTS uses a pre-trained convolutional neural network to extract features of multiple convolutional layers of a thermal infrared target, and in combination with a correlation filter, an integrated infrared tracker is constructed. In recent years, the Siam series network is widely applied to a visible light tracking task, and tracking is regarded as a matching problem, and online tracking is performed through offline training of the matching network. In light of this, many infrared trackers based on the Siam network framework have evolved. Wherein the MMNet integrates the characteristic identification feature and fine granularity feature of the TIR by using a multi-task matching framework, and the SiamMSS proposes a plurality of groups of space shift models to enhance the details of the feature map. However, the existing Siam infrared tracker focuses on only spatial information, i.e. the first frame is used as a fixed template to realize matching tracking of the target in the subsequent frame, or the related filter is combined with the Siam network to realize updating of the template by using the historical prediction information. Although these tracking algorithms have good performance and real-time tracking speed in many conventional tracking scenarios. However, when the target is subject to severe appearance changes, non-rigid deformations and partial occlusions, severe drift may occur and cannot recover from tracking failure.
Disclosure of Invention
The invention aims to solve the defects of the prior art, and provides an infrared target tracking method based on space-time information fusion of a transducer, which aims to capture global dependency relationship between infrared image features through an attention mechanism of the transducer, and introduce space-time information with reference value into a model by utilizing information of salient points and an evaluation standard of IOU-Aware, so that the accuracy and the robustness of the infrared target tracking method are further improved.
The invention adopts the following scheme to solve the problems:
the invention relates to a space-time information fusion infrared target tracking method based on a transducer, which is characterized by comprising the following steps of:
step one, preprocessing an infrared image;
step 1.1: arbitrarily selecting a video sequence V containing an infrared target Obj from an infrared target tracking data set, and performing image V on any ith frame in the video sequence V i Image V of the j-th frame j Image V of the kth frame k Cutting and scaling are carried out to respectively obtain preprocessed static template images
Figure BDA0004181410620000021
Pretreated dynamic template image +.>
Figure BDA0004181410620000022
Search image after pretreatment +.>
Figure BDA0004181410620000023
Will V i ′,V′ j ,V′ k As input to an infrared target tracking network, where H T ,W T Is V (V) i ' height and width, H D ,W D Is V' j Height and width of H S ,W S Is V' k C' is the number of channels for each image;
step two, constructing an infrared target tracking network, which comprises the following steps: an infrared image feature extraction sub-network, an infrared image feature fusion sub-network, a corner prediction head sub-network, a salient point focusing sub-network and an IOU-Aware target state evaluation head sub-network;
step 2.1: the feature extraction sub-network is a ResNet50 network and is used for respectively preprocessing the static template image V i 'dynamic module image V' j And search image V' k Extracting features to obtain static template feature map
Figure BDA0004181410620000024
Dynamic template feature map->
Figure BDA0004181410620000025
And search image feature map->
Figure BDA0004181410620000026
D is the downsampling multiple of the feature extraction network, and C is the channel number of the feature map after downsampling;
step 2.2: will be
Figure BDA0004181410620000027
Flattening the template in space dimension to obtain corresponding static template characteristic sequence +.>
Figure BDA0004181410620000028
Dynamic template feature sequence->
Figure BDA0004181410620000029
Search image feature sequence->
Figure BDA00041814106200000210
Then splicing to obtain mixed characteristic sequence +.>
Figure BDA00041814106200000211
Step 2.3: in the mixed feature sequence f m Adding sinusoidal position codes
Figure BDA00041814106200000212
Obtaining a hybrid signature sequence comprising a position code +.>
Figure BDA00041814106200000213
Step 2.4: constructing the infrared image feature fusion sub-network for the mixed feature sequence f M Processing to obtain a search feature map
Figure BDA00041814106200000214
Step 2.5: the corner prediction head sub-network consists of two full networksConvolutional network composition, each full convolutional network comprising A stacked Conv-BN-ReLU layers and one Conv layer for F S The prediction boundary frame of the' contained infrared target Obj carries out angular point probability prediction, so that the two full convolution networks respectively output angular point probability distribution diagrams of the upper left corner of the prediction boundary frame
Figure BDA00041814106200000215
And the corner probability distribution map of the lower right corner +.>
Figure BDA0004181410620000031
Step 2.6: calculating the upper left corner coordinates (x 'of the prediction boundary box using equation (1)' tl ,y′ tl ) And lower right angular position (x' br ,y′ br ) Thereby obtaining the search image V 'of the infrared target Obj' k In (c) a prediction bounding box B '= (x' tl ,y′ tl ,x′ br ,y′ br ) Wherein (x, y) represents the corner probability distribution map P tl ,P br Upper coordinates, and
Figure BDA0004181410620000032
Figure BDA0004181410620000033
step 2.7: the salient point focusing sub-network is used for extracting salient point characteristics
Figure BDA0004181410620000034
Step 2.8: the IOU-Aware target state evaluation head sub-network consists of a plurality of layers of perceptron and F 'is adopted' S Comprising B F All salient point features inside
Figure BDA0004181410620000035
Inputting the prediction boundary box B 'into an IOU-Aware target state evaluation head subnetwork, and outputting an IOU Score of the prediction boundary box B';
step three, constructing a loss function of the infrared target tracking network;
step 3.1: constructing a loss function L of the corner prediction head subnetwork by using the formula (2) bp
Figure BDA0004181410620000036
In the formula (2), the amino acid sequence of the compound,
Figure BDA0004181410620000037
λ GIOU super parameters, b= (x), which are all real number domains tl ,y tl ,x br ,y br ) Four corner coordinates representing a real frame of an infrared target Obj; l1_loss represents loss of four corner distances of the prediction boundary box and the real box, and is obtained by the formula (3); GIOU_loss represents the loss of the generalized cross-correlation of the prediction bounding box and the real box, and is obtained by the formula (4);
Figure BDA0004181410620000038
in the formula (3), B t 'the coordinates of the t-th corner representing the prediction boundary box B', B t The t-th corner coordinate of the real frame B is represented;
GIOU_loss=1-GIOU (4)
in the formula (4), GIOU represents the generalized cross-over ratio of B' and B, and is obtained from the formula (5);
Figure BDA0004181410620000039
in the formula (5), rec represents a minimum rectangular frame area including B' and B, and is obtained from the formula (6); IOU represents the intersection ratio of B' and B and is obtained by the formula (8);
rec=(x 4 -x 1 )(y 4 -y 1 ) (6)
in formula (6), x 4 ,y 4 Maximum values of the right lower angular coordinates of B' and B, x, respectively 1 ,y 1 The minimum values of the upper left corner coordinates of B' and B are respectively represented, and are obtained by the formula (7);
Figure BDA0004181410620000041
Figure BDA0004181410620000042
in the formula (8), the unit represents the union area of B' and B, and is obtained by the formula (9);
union=S′+S-inter (9)
in the formula (9), inter represents the intersection area of B' and B, and is obtained from the formula (10); s 'represents the area of B', S represents the area of B, and is obtained by the formula (11);
inter=(x 3 -x 2 )(y 3 -y 2 ) (10)
in the formula (10), x 2 ,y 2 Maximum values of the upper left corner coordinates of B' and B, x, respectively 3 ,y 3 The minimum values of the lower right corner coordinates of B' and B are respectively represented, and are obtained by the formula (12);
Figure BDA0004181410620000043
Figure BDA0004181410620000044
in the formula (11), B' w ,B′ h Respectively represent the width and the height of B', B w ,B h Respectively representing the width and height of B, and is obtained by the formula (13)
Figure BDA0004181410620000045
Step 3.2: constructing a loss function L of the IOU-Aware target state evaluation head subnetwork by using a formula (14) IATSE
L IATSE =-|IOU-Score| β ((1-IOU)log(1-Score)+IOU log(Score)) (14)
In the formula (14), beta is a real-number category super parameter;
optimizing an infrared target tracking network by adopting a two-stage training method;
step 4.1: during the first stage training, freezing the IOU-Aware target state evaluation head subnetwork, training other networks except the IOU-Aware target state evaluation head subnetwork in the infrared target tracking network by using a gradient descent algorithm, updating network parameters by minimizing a loss function shown in the step (2), and stopping training when the training iteration number reaches the set number of times, thereby obtaining the infrared target tracking network after preliminary training;
step 4.2: during the second stage training, freezing the infrared image feature extraction sub-network after the preliminary training, the infrared image feature fusion sub-network after the preliminary training and the salient point focusing sub-network after the preliminary training, training the corner prediction head sub-network and the IOU-Aware target state evaluation head sub-network after the preliminary training by using a gradient descent algorithm, updating network parameters by minimizing a loss function shown in the (15), and stopping training when the training iteration number reaches the set number of times, thereby obtaining a trained infrared target tracking model for realizing continuous accurate positioning of an infrared target;
Figure BDA0004181410620000051
in the formula (15), the amino acid sequence of the compound,
Figure BDA0004181410620000052
is a real-domain super-parameter.
The infrared target tracking method based on the space-time information fusion of the transformers is also characterized in that the infrared image feature fusion sub-network in the step 2.4 comprises the following steps: a transducer-based encoder module, a transducer-based decoder module and a codec post-processing module,and obtaining a search feature map according to the following steps
Figure BDA0004181410620000053
Step 2.4.1: the transducer-based encoder module consists of R multi-headed self-attention blocks and will contain a position-coded hybrid feature sequence f M Modeling global relationships in spatial and temporal dimensions in an input encoder module to obtain a discriminative spatio-temporal feature sequence f' M R is the number of multi-headed self-attention blocks in the encoder module;
step 2.4.2: the transducer-based decoder module consists of N multi-headed self-attention blocks and combines a spatio-temporal feature sequence f' M And a single target query
Figure BDA0004181410620000054
The input decoder module carries out cross attention processing and outputs enhanced target inquiry +.>
Figure BDA0004181410620000055
N is the number of multi-headed self-attention blocks in the decoder block;
step 2.4.3: the codec post-processing module generates a temporal-spatial feature sequence f' M Is decoupled from the corresponding search region feature sequence
Figure BDA0004181410620000056
And calculate f' S Similarity score with oq->
Figure BDA0004181410620000057
And then similarity scores att and f' S After element-wise multiplication, an enhanced search region feature sequence is obtained>
Figure BDA0004181410620000058
Finally f S Restoring to enhanced search feature map +.>
Figure BDA0004181410620000059
The significant point focusing sub-network in step 2.7 comprises: the salient point coordinate prediction module and the salient point feature extraction module are used for obtaining a search image V '' k The salient point features contained;
step 2.7.1: the salient point coordinate prediction module maps B 'to F' S After the mapping, the mapped coordinates B are obtained F Then from F 'by the ROIAlign operation' S Extraction of B F Corresponding region level features
Figure BDA00041814106200000510
Wherein K represents F P Is the width and height of (2);
the salient point coordinate prediction module performs the prediction on F through a convolution layer pair P After the dimension reduction operation is carried out, obtaining the region level characteristics after dimension reduction
Figure BDA00041814106200000511
Then F 'is carried out' P Flattened as one-dimensional tensor->
Figure BDA00041814106200000512
Then inputting the obtained product into a multi-layer perceptron to predict to obtain F P Predicted coordinates corresponding to L salient points +.>
Figure BDA00041814106200000513
Wherein C 'represents F' P L represents the number of salient points;
step 2.7.2: will be
Figure BDA00041814106200000514
Restoring to two-dimensional tensor->
Figure BDA00041814106200000515
After that, the salient point feature extraction module extracts the salient points from F by bilinear interpolation P Mid-sampling Loc sp Corresponding salient Point feature->
Figure BDA0004181410620000061
The electronic device of the invention comprises a memory and a processor, wherein the memory is used for storing a program for supporting the processor to execute the infrared target tracking method, and the processor is configured to execute the program stored in the memory.
The invention relates to a computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, performs the steps of the infrared target tracking method.
Compared with the prior art, the invention has the beneficial effects that:
1. most of the existing infrared target tracking technologies ignore the utilization of time information, so that the model is difficult to recover when tracking fails. Therefore, the invention is based on the traditional two-branch tracking framework (static template diagram-search diagram) based on Siam, and additionally adds a dynamic template selection branch which introduces a dynamic template changing along with time for the model, and the dynamic template is used as the input of the model together with the static template and the search diagram. In addition, the invention further captures the global dependency relationship of the space-time information by utilizing the encoder-decoder structure of the transducer in the characteristic fusion stage, thereby overcoming the problem that the general infrared tracking technology can only locally model the target characteristic information.
2. In order to further capture the state change of the target object with time, the invention introduces the salient point information into a dynamic template selection branch, and realizes the evaluation of the quality of the target image by explicitly searching a plurality of salient points on the target image and focusing the information of all the salient points, thereby selecting a proper candidate object for updating the template image, and improving the tracking performance of the infrared target tracking method under the conditions of appearance change, non-rigid deformation and the like of the target.
3. The existing target tracker using the dynamic template selection module to introduce time information fails to provide an explicit standard for quality assessment of target images during the training phase. They randomly assign label of the target image during the training phase (i.e., positive sample is 1 and negative sample is 0), indicating that the image is selected as a dynamic template when label is 1. The fuzzy estimation on the quality of the target image can cause that the model cannot make the most accurate quality estimation on the current state of the target image during testing, so that redundant time information which does not have reference value is introduced into the model, and the effect of the template updating module is weakened. Aiming at the problem, the invention selects the IOU-aware score between the prediction boundary box and the real frame as the training target of the dynamic template selection module, the score defines the measurement standard of whether the target image can be used as the dynamic template of the tracker as the positioning accuracy degree of the angular point prediction head, and the training target has a clear evaluation standard at the moment, so that the model can achieve better tracking effect during testing.
Drawings
FIG. 1 is a flow chart of a network of the present invention;
FIG. 2 is a block diagram of a network of the present invention;
FIG. 3 is a block diagram of an IOU-Aware target state evaluation head according to the present invention.
Detailed Description
In this embodiment, an infrared target tracking method based on temporal and spatial information fusion of a transducer, as shown in fig. 1, includes the following steps:
step one, preprocessing an infrared image;
step 1.1: arbitrarily selecting a video sequence v= { V containing a specific infrared target Obj from the infrared target tracking dataset 1 ,V 2 ,…,V n ,…,V I I represents the total frame number of the selected infrared video sequence, V n Representing an nth frame of image in a video sequence, n E [1, I]And for any ith frame image V in video sequence V i Image V of the j-th frame j Image V of the kth frame k Cutting and scaling are carried out to respectively obtain preprocessed static template images
Figure BDA0004181410620000071
Preprocessed dynamic template image
Figure BDA0004181410620000072
Search image after pretreatment +.>
Figure BDA0004181410620000073
Will V i ′,V j ′,V k ' as input to an infrared target tracking network, where H T ,W T Is T n Height and width of H D ,W D For D n Height and width of H S ,W S Is S n C' is the initial channel number of each image, i, j, k E [1, I]. In the present embodiment, V i ' height and width H T =W T =128,V j ' height and width H D =W D =128,V k ' height and width H S =W S The initial channel number C' =3 for each image =320;
step two, constructing an infrared target tracking network, which comprises the following steps: an infrared image feature extraction sub-network, an infrared image feature fusion sub-network, a corner prediction head sub-network, a salient point focusing sub-network and an IOU-Aware target state evaluation head sub-network;
step 2.1: the feature extraction sub-network is a ResNet50 network and is used for respectively preprocessing the static template image V i ' dynamic module image V j ' sum search image V k ' extracting features, and correspondingly obtaining a static template feature map
Figure BDA0004181410620000074
Dynamic template feature map->
Figure BDA0004181410620000075
And search image feature map->
Figure BDA0004181410620000076
In this example, the downsampling multiple d=16 of the feature extraction network, and the channel number c=256 of each feature map after downsampling;
step 2.2: will be
Figure BDA0004181410620000077
Flattening the template in space dimension to obtain corresponding static template characteristic sequence +.>
Figure BDA0004181410620000078
Dynamic template feature sequence->
Figure BDA0004181410620000079
Search image feature sequence->
Figure BDA00041814106200000710
Then splicing to obtain mixed characteristic sequence +.>
Figure BDA00041814106200000711
Step 2.3: in the mixed feature sequence f m Adding sinusoidal position codes
Figure BDA00041814106200000712
Obtaining a hybrid signature sequence comprising a position code +.>
Figure BDA00041814106200000713
Step 2.4, constructing an infrared image feature fusion sub-network, which comprises the following steps: a transducer-based encoder module, a transducer-based decoder module, and a codec post-processing module:
step 2.4.1: the transducer-based encoder module consists of R multi-headed self-attention blocks and will contain a position-coded hybrid feature sequence f M Modeling global relationships in spatial and temporal dimensions in an input encoder module to obtain a discriminative spatio-temporal feature sequence f' M R is the number of multi-headed self-attention blocks in the encoder block. In this example, r=6;
step 2.4.2: the transducer-based decoder module consists of N multi-headed self-attention blocks and combines a spatio-temporal feature sequence f' M And a single target query
Figure BDA0004181410620000081
The input decoder module carries out cross attention processing and outputs enhanced target inquiry +.>
Figure BDA0004181410620000082
N is the number of multi-headed self-attention blocks in the decoder block. In this example, n=6;
step 2.4.3: the codec post-processing module extracts from the spatio-temporal feature sequence f' M Is decoupled from the corresponding search region feature sequence
Figure BDA0004181410620000083
And calculate f' S Similarity score with oq->
Figure BDA0004181410620000084
And then similarity scores att and f' S After element-wise multiplication, an enhanced search region feature sequence is obtained>
Figure BDA0004181410620000085
Finally f S Restoring to enhanced search feature map +.>
Figure BDA0004181410620000086
Step 2.5 the corner prediction head sub-network consists of two full convolutional networks, each comprising A stacked Conv-BN-ReLU layers and one Conv layer for the F '' S The prediction boundary frame of the infrared target Obj is used for carrying out angular point probability prediction, so that the angular point probability distribution map of the left upper corner of the prediction boundary frame is respectively output by two full convolution networks
Figure BDA0004181410620000087
And the corner probability distribution map of the lower right corner +.>
Figure BDA0004181410620000088
In this example, a=4, convolution of Conv layer in each Conv-BN-ReLU layerThe kernel size is 3×3, the stride is 1, the padding is 1, the parameter momentum=0.1 for bn layer, the size of the convolution kernel for the last individual Conv is 1×1, the stride is 1.
Step 2.6: calculating the upper left corner coordinates (x 'of the prediction boundary box using equation (1)' tl ,y′ tl ) And lower right angular position (x' br ,y′ br ) Thereby obtaining the search image V 'of the infrared target Obj' k In (c) a prediction bounding box B '= (x' tl ,y′ tl ,x′ br ,y′ br ) Wherein (x, y) represents the corner probability distribution map P tl ,P br Upper coordinates, and
Figure BDA0004181410620000089
Figure BDA00041814106200000810
step 2.7: a salient point focusing sub-network comprising: the salient point coordinate prediction module and the salient point feature extraction module are used for obtaining a search image V '' k The salient point features contained;
step 2.7.1: the salient point coordinate prediction module maps B 'to F' S After the mapping, the mapped coordinates B are obtained F Then from F 'by the ROIAlign operation' S Extraction of B F Corresponding region level features
Figure BDA00041814106200000811
Wherein K represents F P Is the width and height of (2);
the salient point coordinate prediction module performs the convolutional layer on F P After the dimension reduction operation is carried out, obtaining the region level characteristics after dimension reduction
Figure BDA00041814106200000812
Then F 'is carried out' P Flattened as one-dimensional tensor->
Figure BDA00041814106200000813
Then input the multi-layer perceptionPredicting in the machine to obtain F P Predicted coordinates corresponding to L salient points +.>
Figure BDA0004181410620000091
Wherein C 'represents F' P L represents the number of salient points. In this example, k=7, l=8, and the multi-layered perceptron is formed by connecting 4 linear layers, wherein the output channel of the first linear layer is 256, the output channel of the second linear layer is 512, the output channel of the third linear layer is 512, and the output channel of the fourth linear layer is 16;
step 2.7.2: will be
Figure BDA0004181410620000092
Restoring to two-dimensional tensor->
Figure BDA0004181410620000093
The salient point feature extraction module extracts the salient points from F by bilinear interpolation P Mid-sampling Loc' sp Corresponding salient Point feature->
Figure BDA0004181410620000094
Step 2.8: the IOU-Aware target state evaluation head subnetwork is composed of multiple layers of perceptron and F' S Comprising B F All salient point features inside
Figure BDA0004181410620000095
Input into the IOU-Aware target state evaluation head subnetwork, and output the predicted IOU Score for B'.
The training goal of the dynamic selection module of a general space-time tracking model is a classification score (i.e., foreground is "1" and background is "0"). The invention provides an IOU-Aware target state evaluation head which is composed of a plurality of layers of perceptrons, and the structure diagram is shown in figure 3. In this example, the IOU-Aware target state evaluation head is formed by connecting 4 linear layers, wherein the output channel of the first linear layer is 1024, the output channel of the second linear layer is 512, the output channel of the third linear layer is 256, and the output channel of the fourth linear layer is 1. The input is the characteristic of all salient points in the target image prediction boundary box, the output applies the design of IOU-Aware, and the training target is replaced by the IOU score between the prediction boundary box and the real box from the general classification score (namely the foreground is '1', and the background is '0') so as to strengthen the connection of classification and regression branches. Based on the reselection of the training objectives, the Score output by the IOU-Aware target state assessment head at this time aggregates the information of all salient points in the prediction bounding box, representing the IoU Score of the prediction bounding box, and is therefore referred to as IoU-Aware target state assessment Score. The score provides a criterion for the evaluation of the current state of the target image, IOU-Aware. By integrating the salient point information of the target object into the evaluation of the IOU-Aware, a joint representation of the regression frame itself and the most discernable features contained therein can be obtained, which defines a measure of whether the target image can be used as a dynamic template of the tracker as the degree of accuracy of the positioning of the corner pre-head, since the more accurate the corner pre-head predicts the target bounding box, the more useful information contained therein that can be used to evaluate the quality of the target image, and the more accurate the evaluation result.
Step three, constructing a loss function of the infrared target tracking network;
step 3.1, constructing a loss function L of the corner prediction head sub-network by utilizing the corner (2) bp
Figure BDA0004181410620000096
In formula (2), b= (x) tl ,y tl ,x br ,y br ) Four corner coordinates representing a real frame of an infrared target Obj; l1_loss represents loss of four corner distances of the prediction boundary box and the real box, and is obtained by the formula (3); GIOU_loss represents the loss of the generalized cross-correlation of the prediction bounding box and the real box, and is obtained by the formula (4);
Figure BDA0004181410620000101
in (3),B′ t Four corner coordinates representing a prediction bounding box B', B t Representing the coordinates of the four corner points of the real frame B.
GIOU_loss=1-GIOU (4)
In the formula (4), GIOU represents the generalized cross-over ratio of B' and B, and is obtained from the formula (5);
Figure BDA0004181410620000102
in the formula (5), rec represents a minimum rectangular frame area including B' and B, and is obtained from the formula (6); IOU represents the intersection ratio of B' and B and is obtained by the formula (8);
rec=(x 4 -x 1 )(y 4 -y 1 ) (6)
in formula (6), x 4 ,y 4 Maximum values of the right lower angular coordinates of B' and B, x, respectively 1 ,y 1 The minimum values of the upper left corner coordinates of B' and B are respectively represented, and are obtained by the formula (7);
Figure BDA0004181410620000103
Figure BDA0004181410620000104
in the formula (8), the unit represents the union area of B' and B, and is obtained by the formula (9);
union=S′+S-inter (9)
in the formula (9), inter represents the intersection area of B' and B, and is obtained from the formula (10); s 'represents the area of B', S represents the area of B, and is obtained by the formula (11);
inter=(x 3 -x 2 )(y 3 -y 2 ) (10)
in the formula (10), x 2 ,y 2 Maximum values of the upper left corner coordinates of B' and B, x, respectively 3 ,y 3 The minimum values of the lower right corner coordinates of B' and B are respectively represented, and are obtained by the formula (12);
Figure BDA0004181410620000105
Figure BDA0004181410620000106
in the formula (11), B' w ,B′ h Respectively represent the width and the height of B', B w ,B h Respectively representing the width and height of B, and is obtained by the formula (13)
Figure BDA0004181410620000107
Step 3.2, constructing the loss function L of the IOU-Aware target state evaluation head sub-network by utilizing the step (14) IATSE
L IATSE =-|IOU-Score| β ((1-IOU)log(1-Score)+IOU log(Score)) (14)
In equation (14), β is a real-domain super-parameter. In this example, β=2;
optimizing an infrared target tracking network by adopting a two-stage training method;
step 4.1: during the first stage training, freezing the IOU-Aware target state evaluation head subnetwork, training other networks except the IOU-Aware target state evaluation head subnetwork in the infrared target tracking network by using a gradient descent algorithm, updating network parameters by minimizing a loss function shown in the step (2), and stopping training when the training iteration number reaches the set number of times, thereby obtaining the infrared target tracking network after preliminary training;
step 4.2: during the second stage training, freezing the infrared image feature extraction sub-network after the preliminary training, the infrared image feature fusion sub-network after the preliminary training and the salient point focusing sub-network after the preliminary training, training the corner prediction head sub-network and the IOU-Aware target state evaluation head sub-network after the preliminary training by using a gradient descent algorithm, updating network parameters by minimizing a loss function shown in the (15), and stopping training when the training iteration number reaches the set number of times, thereby obtaining a trained infrared target tracking model for realizing continuous accurate positioning of a specific infrared target;
Figure BDA0004181410620000111
in the formula (15), the amino acid sequence of the compound,
Figure BDA0004181410620000112
is a real-domain super-parameter. In this example, the->
Figure BDA0004181410620000113
In this embodiment, an electronic device includes a memory for storing a program supporting the processor to execute the above method, and a processor configured to execute the program stored in the memory.
In this embodiment, a computer-readable storage medium stores a computer program that, when executed by a processor, performs the steps of the method described above.
The invention establishes two standards for the dynamic template updating mechanism: (1) update threshold (2) update interval. Only when the update interval is reached and the Score output by the IOU-Aware target state assessment head reaches the update threshold, the current search image is selected as a dynamic template for the subsequent tracking process. The dynamic templates are updated continuously during the tracking process. The overall tracking flow of the present invention is shown in figure 1 below. Specifically, a first frame of the video sequence is selected as a fixed static template image, and a region with the size which is 2 times of the area of a target frame of the static template image is cut and scaled by taking the center of the target frame of the static template image as the center to obtain a preprocessed static template image with the size of 128 multiplied by 128. The method comprises the steps that other frames except for a first frame of a video sequence are search pictures, the preprocessing of a search image is to take the center of a target frame predicted by the previous frame as the center, and the current search picture is cut and scaled by an area with the size of 5 times that of the target frame area to obtain a preprocessed search image with the size of 320 multiplied by 320. The dynamic template is determined by the dynamic template selection module, and if the current search image for predicting the target frame position meets the updating condition of the dynamic template selection module, the search image takes the center of a predicted target frame as the center, and the area with the size which is 2 times that of the target frame is cut and scaled to obtain a preprocessed dynamic template image with the size of 128 multiplied by 128. The preprocessed static template image, dynamic template image and search image can be used as the input of the tracking network together when the tracker predicts the target position of the next frame of search image. In the test process, the double templates and the current search frame are sent to the network for feature extraction and fusion. Then, the corner prediction head outputs a predicted target boundary frame of the current search frame, and then searches for salient points within a range surrounded by the boundary frame. And finally, all salient point features extracted through bilinear interpolation are sent to an IOU-Aware target state evaluation head to obtain the state evaluation score of the current search frame. When the score meets the update threshold and reaches the update interval, the current search frame will be considered as a dynamic template for the subsequent tracking process.
Table 1.1 comparison of ablation experimental results
Figure BDA0004181410620000121
Table 1.2 comparison of ablation experimental results
Figure BDA0004181410620000122
Table 2 comparison of results of different ir tracking algorithms on PTB-TIR datasets
Figure BDA0004181410620000123
Table 3 comparison of results of different ir tracking algorithms on LSOTB-TIR dataset
Figure BDA0004181410620000131
The infrared tracking network structure based on the space-time information fusion of the transformers is shown in fig. 2, the global dependency relationship between the dual-template feature sequences and the elements of the search feature sequences is obtained by using a coder-decoder structure based on the transformers, and the dynamic template selection module is focused on the most discernable features by using the salient point information. Meanwhile, the algorithm also introduces an IOU-Aware evaluation component, and integrates the quality evaluation of the dynamic template into the IOU prediction, so that a more reliable standard is provided for the quality evaluation of the dynamic template. Table 1.1 is a comparison of the results of ablation experiments for the Point of significance component (SPF) and the IOU-Aware (I-A) components of the present invention. The experiment takes Stark-s algorithm in the RGB tracking field as a reference model, and the obvious advantage of the invention in space-time information utilization can be seen by respectively adding an SPF component and an I-A component on the reference model. Wherein Accuracy (Acc) is an Accuracy index, robustness (Rob) is a robustness index, and EAO is a desired average overlap ratio. The larger Acc in the index indicates the smaller difference of the center distance between the real frame and the predicted frame, the larger Rob indicates the smaller number of tracking loss of the tracker, and the larger EAO indicates the better average performance of the tracker. The results of Table 1.1 show that utilizing the information of the salient points and the IOU-Aware evaluation component can effectively improve the tracking performance of the network. The reference model introduces the information of the whole target image into the evaluation of the dynamic template, and the invention only selects part of the information to be introduced into the evaluation of the dynamic template by limiting the search range of the salient points in the prediction boundary box, and the table 1.2 shows the comparison result of the search range of the dynamic template information. It can be seen from the table that the present invention is superior to the reference model in all evaluation metrics, which suggests that the quality estimation of the target image is more dependent on identifying key features than on assigning equal importance to all features.
Tables 2 and 3 are comparison of the evaluation results of the present invention with other infrared target tracking algorithms on both the PTB-TIR and LSOTB-TIR infrared data sets. Where STFT (Ours) represents the present invention, ECO-deep, ECO-TIR, MCFTS represents depth feature based correlation filter tracker, MDNet, VITAL represents other depth trackers, siamFC SiamRPN++, siamMask, siamMSS, HSSNet, MLSSNet, MMNet, STMTrack, stark-s, stark-st is a Siam network based tracker. Success is a Success rate index, precision is a Precision index, norm Precision is a normalized Precision index, and the larger the Success is, the higher the overlapping degree of the prediction frame and the real frame is, and the larger the Precision and the Norm Precision are, the smaller the difference of the center distances between the prediction frame and the real frame is. The results in tables 2 and 3 show that the overall performance of the present invention is superior to the infrared tracking method described above under the current evaluation index.

Claims (5)

1. The infrared target tracking method based on the space-time information fusion of the transducer is characterized by comprising the following steps of:
step one, preprocessing an infrared image;
step 1.1: arbitrarily selecting a video sequence V containing an infrared target Obj from an infrared target tracking data set, and performing image V on any ith frame in the video sequence V i Image V of the j-th frame j Image V of the kth frame k Cutting and scaling are carried out to respectively obtain preprocessed static template images
Figure FDA0004181410610000011
Pretreated dynamic template image +.>
Figure FDA0004181410610000012
Search image after pretreatment +.>
Figure FDA0004181410610000013
Will V i ′,V′ j ,V′ k As input to an infrared target tracking network, where H T ,W T Is V (V) i ' height and width, H D ,W D Is V' j Height and width of H S ,W S Is V' k C' is the number of channels for each image;
step two, constructing an infrared target tracking network, which comprises the following steps: an infrared image feature extraction sub-network, an infrared image feature fusion sub-network, a corner prediction head sub-network, a salient point focusing sub-network and an IOU-Aware target state evaluation head sub-network;
step 2.1: the feature extraction sub-network is a ResNet50 network and is used for respectively preprocessing the static template image V i 'dynamic module image V' j And search image V' k Extracting features to obtain static template feature map
Figure FDA0004181410610000014
Dynamic template feature map->
Figure FDA0004181410610000015
And search image feature map->
Figure FDA0004181410610000016
D is the downsampling multiple of the feature extraction network, and C is the channel number of the feature map after downsampling;
step 2.2: will be
Figure FDA0004181410610000017
Flattening the template in space dimension to obtain corresponding static template characteristic sequence +.>
Figure FDA0004181410610000018
Dynamic template feature sequence->
Figure FDA0004181410610000019
Search image feature sequence->
Figure FDA00041814106100000110
Then splicing to obtain mixed characteristic sequence +.>
Figure FDA00041814106100000111
Step 2.3: in the mixed feature sequence f m Adding sinusoidal position codes
Figure FDA00041814106100000112
Obtaining a hybrid signature sequence comprising a position code +.>
Figure FDA00041814106100000113
Step 2.4: constructing the infrared image feature fusion sub-network for the mixed feature sequence f M Processing to obtain a search feature map
Figure FDA00041814106100000114
Step 2.5: the corner prediction head sub-network consists of two full convolution networks, each full convolution network comprises A stacked Conv-BN-ReLU layers and one Conv layer for F' S The prediction boundary frame of the infrared target Obj is included to conduct angular point probability prediction, so that the two full convolution networks respectively output angular point probability distribution diagrams of the upper left corner of the prediction boundary frame
Figure FDA00041814106100000115
And the corner probability distribution map of the lower right corner +.>
Figure FDA00041814106100000116
Step 2.6: calculating the upper left corner coordinates (x 'of the prediction boundary box using equation (1)' tl ,y′ tl ) And lower right angular position (x' br ,y′ br ) Thereby obtaining the search image V 'of the infrared target Obj' k In (c) a prediction bounding box B '= (x' tl ,y′ tl ,x′ br ,y′ br ) Wherein (x, y) represents the corner probability distribution map P tl ,P br Upper coordinates, and
Figure FDA0004181410610000021
Figure FDA0004181410610000022
step 2.7: the salient point focusing sub-network is used for extracting salient point characteristics
Figure FDA0004181410610000023
Step 2.8: the IOU-Aware target state evaluation head sub-network consists of a plurality of layers of perceptron and F 'is adopted' S Comprising B F All salient point features inside
Figure FDA0004181410610000024
Inputting the prediction boundary box B 'into an IOU-Aware target state evaluation head subnetwork, and outputting an IOU Score of the prediction boundary box B';
step three, constructing a loss function of the infrared target tracking network;
step 3.1: constructing a loss function L of the corner prediction head subnetwork by using the formula (2) bp
Figure FDA0004181410610000025
In the formula (2), the amino acid sequence of the compound,
Figure FDA0004181410610000026
λ GIOU super parameters, b= (x), which are all real number domains tl ,y tl ,x br ,y br ) Four corner coordinates representing a real frame of an infrared target Obj; l1_loss represents loss of four corner distances of the prediction boundary box and the real box, and is obtained by the formula (3); GIOU_loss represents the loss of the generalized cross-correlation of the prediction bounding box and the real box, and is obtained by the formula (4);
Figure FDA0004181410610000027
in the formula (3), B' t The coordinates of the t-th corner representing the prediction boundary box B', B t The t-th corner coordinate of the real frame B is represented;
GIOU_loss=1-GIOU (4)
in the formula (4), GIOU represents the generalized cross-over ratio of B' and B, and is obtained from the formula (5);
Figure FDA0004181410610000028
in the formula (5), rec represents a minimum rectangular frame area including B' and B, and is obtained from the formula (6); IOU represents the intersection ratio of B' and B and is obtained by the formula (8);
rec=(x 4 -x 1 )(y 4 -y 1 ) (6)
in formula (6), x 4 ,y 4 Maximum values of the right lower angular coordinates of B' and B, x, respectively 1 ,y 1 The minimum values of the upper left corner coordinates of B' and B are respectively represented, and are obtained by the formula (7);
Figure FDA0004181410610000031
Figure FDA0004181410610000032
in the formula (8), the unit represents the union area of B' and B, and is obtained by the formula (9);
union=S′+S-inter (9)
in the formula (9), inter represents the intersection area of B' and B, and is obtained from the formula (10); s 'represents the area of B', S represents the area of B, and is obtained by the formula (11);
inter=(x 3 -x 2 )(y 3 -y 2 ) (10)
in the formula (10), x 2 ,y 2 Maximum values of the upper left corner coordinates of B' and B, x, respectively 3 ,y 3 The minimum values of the lower right corner coordinates of B' and B are respectively represented, and are obtained by the formula (12);
Figure FDA0004181410610000033
Figure FDA0004181410610000034
in the formula (11), B' w ,B′ h Respectively represent the width and the height of B', B w ,B h Respectively representing the width and height of B, and is obtained by the formula (13)
Figure FDA0004181410610000035
Step 3.2: constructing a loss function L of the IOU-Aware target state evaluation head subnetwork by using a formula (14) IATSE
L IATSE =-|IOU-Score β ((1-IOU)log(1-Score)+IOUlog(Score)) (14)
In the formula (14), beta is a real-number category super parameter;
optimizing an infrared target tracking network by adopting a two-stage training method;
step 4.1: during the first stage training, freezing the IOU-Aware target state evaluation head subnetwork, training other networks except the IOU-Aware target state evaluation head subnetwork in the infrared target tracking network by using a gradient descent algorithm, updating network parameters by minimizing a loss function shown in the step (2), and stopping training when the training iteration number reaches the set number of times, thereby obtaining the infrared target tracking network after preliminary training;
step 4.2: during the second stage training, freezing the infrared image feature extraction sub-network after the preliminary training, the infrared image feature fusion sub-network after the preliminary training and the salient point focusing sub-network after the preliminary training, training the corner prediction head sub-network and the IOU-Aware target state evaluation head sub-network after the preliminary training by using a gradient descent algorithm, updating network parameters by minimizing a loss function shown in the (15), and stopping training when the training iteration number reaches the set number of times, thereby obtaining a trained infrared target tracking model for realizing continuous accurate positioning of an infrared target;
Figure FDA0004181410610000041
in the formula (15), the amino acid sequence of the compound,
Figure FDA0004181410610000042
is a real-domain super-parameter.
2. The infrared target tracking method based on the temporal-spatial information fusion of the transformers according to claim 1, wherein the infrared image feature fusion sub-network in step 2.4 comprises: the encoder module based on the transducer, the decoder module based on the transducer and the coder-decoder post-processing module, and the search feature map is obtained according to the following steps
Figure FDA0004181410610000043
Step 2.4.1: the transducer-based encoder module consists of R multi-headed self-attention blocks and will contain a position-coded hybrid feature sequence f M Modeling global relationships in spatial and temporal dimensions in an input encoder module to obtain a discriminative spatio-temporal feature sequence f' M R is the number of multi-headed self-attention blocks in the encoder module;
step 2.4.2: the transducer-based decoder module consists of N multi-headed self-attention blocks and combines a spatio-temporal feature sequence f' M And a single target query
Figure FDA0004181410610000044
Cross attention processing in input decoder module, outputEnhanced target query->
Figure FDA0004181410610000045
N is the number of multi-headed self-attention blocks in the decoder block;
step 2.4.3: the codec post-processing module generates a temporal-spatial feature sequence f' M Is decoupled from the corresponding search region feature sequence
Figure FDA0004181410610000046
And calculate f' S Similarity score with oq->
Figure FDA0004181410610000047
And then similarity scores att and f' S After element-wise multiplication, an enhanced search region feature sequence is obtained>
Figure FDA0004181410610000048
Finally f S "restore to enhanced search feature map->
Figure FDA0004181410610000049
3. The infrared target tracking method based on the temporal-spatial information fusion of the transformers according to claim 2, wherein the salient point focusing sub-network in step 2.7 comprises: the salient point coordinate prediction module and the salient point feature extraction module are used for obtaining a search image V '' k The salient point features contained;
step 2.7.1: the salient point coordinate prediction module maps B 'to F' S After the mapping, the mapped coordinates B are obtained F Then from F 'by the ROIAlign operation' S Extraction of B F Corresponding region level features
Figure FDA00041814106100000410
Wherein K represents F P Is the width and height of (2);
the salient point coordinate prediction module performs the prediction on F through a convolution layer pair P After the dimension reduction operation is carried out, obtaining the region level characteristics after dimension reduction
Figure FDA00041814106100000411
Then F 'is carried out' P Flattened as one-dimensional tensor->
Figure FDA00041814106100000412
Then inputting the obtained product into a multi-layer perceptron to predict to obtain F P Predicted coordinates corresponding to L salient points +.>
Figure FDA00041814106100000413
Wherein C 'represents F' P L represents the number of salient points;
step 2.7.2: will be
Figure FDA0004181410610000051
Restoring to two-dimensional tensor->
Figure FDA0004181410610000052
After that, the salient point feature extraction module extracts the salient points from F by bilinear interpolation P Mid-sampling Loc' sp Corresponding salient Point feature->
Figure FDA0004181410610000053
4. An electronic device comprising a memory and a processor, wherein the memory is configured to store a program that supports the processor to perform the infrared target tracking method of claims 1-3, the processor being configured to execute the program stored in the memory.
5. A computer readable storage medium having stored thereon a computer program, characterized in that the computer program when run by a processor performs the steps of the infrared target tracking method of claims 1-3.
CN202310406030.1A 2023-04-11 2023-04-11 Transformer-based space-time information fusion infrared target tracking method Active CN116402858B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310406030.1A CN116402858B (en) 2023-04-11 2023-04-11 Transformer-based space-time information fusion infrared target tracking method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310406030.1A CN116402858B (en) 2023-04-11 2023-04-11 Transformer-based space-time information fusion infrared target tracking method

Publications (2)

Publication Number Publication Date
CN116402858A true CN116402858A (en) 2023-07-07
CN116402858B CN116402858B (en) 2023-11-21

Family

ID=87017716

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310406030.1A Active CN116402858B (en) 2023-04-11 2023-04-11 Transformer-based space-time information fusion infrared target tracking method

Country Status (1)

Country Link
CN (1) CN116402858B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116912649A (en) * 2023-09-14 2023-10-20 武汉大学 Infrared and visible light image fusion method and system based on relevant attention guidance
CN117036417A (en) * 2023-09-12 2023-11-10 南京信息工程大学 Multi-scale transducer target tracking method based on space-time template updating

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019137912A1 (en) * 2018-01-12 2019-07-18 Connaught Electronics Ltd. Computer vision pre-fusion and spatio-temporal tracking
CN114550040A (en) * 2022-02-18 2022-05-27 南京大学 End-to-end single target tracking method and device based on mixed attention mechanism
CN114638862A (en) * 2022-03-24 2022-06-17 清华大学深圳国际研究生院 Visual tracking method and tracking device
CN114862844A (en) * 2022-06-13 2022-08-05 合肥工业大学 Infrared small target detection method based on feature fusion
CN114972439A (en) * 2022-06-17 2022-08-30 贵州大学 Novel target tracking algorithm for unmanned aerial vehicle
CN115147459A (en) * 2022-07-31 2022-10-04 哈尔滨理工大学 Unmanned aerial vehicle target tracking method based on Swin transducer
CN115205337A (en) * 2022-07-28 2022-10-18 西安热工研究院有限公司 RGBT target tracking method based on modal difference compensation
US20220332415A1 (en) * 2021-04-20 2022-10-20 Guangdong University Of Technology Landing tracking control method and system based on lightweight twin network and unmanned aerial vehicle
CN115239765A (en) * 2022-08-02 2022-10-25 合肥工业大学 Infrared image target tracking system and method based on multi-scale deformable attention
CN115330837A (en) * 2022-08-18 2022-11-11 厦门理工学院 Robust target tracking method and system based on graph attention Transformer network
CN115482375A (en) * 2022-08-25 2022-12-16 南京信息技术研究院 Cross-mirror target tracking method based on time-space communication data driving
CN115620206A (en) * 2022-11-04 2023-01-17 雷汝霖 Training method of multi-template visual target tracking network and target tracking method
US20230033548A1 (en) * 2021-07-26 2023-02-02 Manpreet Singh TAKKAR Systems and methods for performing computer vision task using a sequence of frames
CN115690152A (en) * 2022-10-18 2023-02-03 南京航空航天大学 Target tracking method based on attention mechanism
CN115908500A (en) * 2022-12-30 2023-04-04 长沙理工大学 High-performance video tracking method and system based on 3D twin convolutional network
CN115909110A (en) * 2022-12-16 2023-04-04 四川中科朗星光电科技有限公司 Lightweight infrared unmanned aerial vehicle target tracking method based on Simese network

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019137912A1 (en) * 2018-01-12 2019-07-18 Connaught Electronics Ltd. Computer vision pre-fusion and spatio-temporal tracking
US20220332415A1 (en) * 2021-04-20 2022-10-20 Guangdong University Of Technology Landing tracking control method and system based on lightweight twin network and unmanned aerial vehicle
US20230033548A1 (en) * 2021-07-26 2023-02-02 Manpreet Singh TAKKAR Systems and methods for performing computer vision task using a sequence of frames
CN114550040A (en) * 2022-02-18 2022-05-27 南京大学 End-to-end single target tracking method and device based on mixed attention mechanism
CN114638862A (en) * 2022-03-24 2022-06-17 清华大学深圳国际研究生院 Visual tracking method and tracking device
CN114862844A (en) * 2022-06-13 2022-08-05 合肥工业大学 Infrared small target detection method based on feature fusion
CN114972439A (en) * 2022-06-17 2022-08-30 贵州大学 Novel target tracking algorithm for unmanned aerial vehicle
CN115205337A (en) * 2022-07-28 2022-10-18 西安热工研究院有限公司 RGBT target tracking method based on modal difference compensation
CN115147459A (en) * 2022-07-31 2022-10-04 哈尔滨理工大学 Unmanned aerial vehicle target tracking method based on Swin transducer
CN115239765A (en) * 2022-08-02 2022-10-25 合肥工业大学 Infrared image target tracking system and method based on multi-scale deformable attention
CN115330837A (en) * 2022-08-18 2022-11-11 厦门理工学院 Robust target tracking method and system based on graph attention Transformer network
CN115482375A (en) * 2022-08-25 2022-12-16 南京信息技术研究院 Cross-mirror target tracking method based on time-space communication data driving
CN115690152A (en) * 2022-10-18 2023-02-03 南京航空航天大学 Target tracking method based on attention mechanism
CN115620206A (en) * 2022-11-04 2023-01-17 雷汝霖 Training method of multi-template visual target tracking network and target tracking method
CN115909110A (en) * 2022-12-16 2023-04-04 四川中科朗星光电科技有限公司 Lightweight infrared unmanned aerial vehicle target tracking method based on Simese network
CN115908500A (en) * 2022-12-30 2023-04-04 长沙理工大学 High-performance video tracking method and system based on 3D twin convolutional network

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
BINGBING WEI 等: "An IoU-aware Siamese network for real-time visual tracking", 《NEUROCOMPUTING》, vol. 527, pages 13 - 26 *
LIN L 等: "Swintrack: A simple and strong baseline for transformer tracking", 《ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS》, vol. 35, pages 16743 - 16754 *
XIN CHEN 等: "Transformer Tracking", 《CVPR 2021》, pages 8126 - 8135 *
王玲 等: "基于FasterMDNet的视频目标跟踪算法", 《计算机工程与应用》, no. 14, pages 123 - 130 *
程文 等: "多特征融合的粒子滤波红外单目标跟踪", 《电脑知识与技术》, no. 14, pages 178 - 180 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117036417A (en) * 2023-09-12 2023-11-10 南京信息工程大学 Multi-scale transducer target tracking method based on space-time template updating
CN116912649A (en) * 2023-09-14 2023-10-20 武汉大学 Infrared and visible light image fusion method and system based on relevant attention guidance
CN116912649B (en) * 2023-09-14 2023-11-28 武汉大学 Infrared and visible light image fusion method and system based on relevant attention guidance

Also Published As

Publication number Publication date
CN116402858B (en) 2023-11-21

Similar Documents

Publication Publication Date Title
CN116402858B (en) Transformer-based space-time information fusion infrared target tracking method
Han et al. Active object detection with multistep action prediction using deep q-network
Wang et al. Adaptive fusion CNN features for RGBT object tracking
CN117011342B (en) Attention-enhanced space-time transducer vision single-target tracking method
Fan et al. Complementary tracking via dual color clustering and spatio-temporal regularized correlation learning
CN116563337A (en) Target tracking method based on double-attention mechanism
CN114677633A (en) Multi-component feature fusion-based pedestrian detection multi-target tracking system and method
CN114724185A (en) Light-weight multi-person posture tracking method
CN115239765A (en) Infrared image target tracking system and method based on multi-scale deformable attention
CN115205336A (en) Feature fusion target perception tracking method based on multilayer perceptron
Cheng et al. Tiny object detection via regional cross self-attention network
Wang et al. EMAT: Efficient feature fusion network for visual tracking via optimized multi-head attention
Zhou et al. Retrieval and localization with observation constraints
CN112883928A (en) Multi-target tracking algorithm based on deep neural network
CN113011359A (en) Method for simultaneously detecting plane structure and generating plane description based on image and application
CN116797799A (en) Single-target tracking method and tracking system based on channel attention and space-time perception
Huang et al. A spatial–temporal contexts network for object tracking
Song et al. TransBoNet: Learning camera localization with Transformer Bottleneck and Attention
CN114119999B (en) Iterative 6D pose estimation method and device based on deep learning
Wu et al. Improving feature discrimination for object tracking by structural-similarity-based metric learning
Zhang et al. Boosting the speed of real-time multi-object trackers
CN115830707A (en) Multi-view human behavior identification method based on hypergraph learning
Zhang et al. Promptvt: Prompting for efficient and accurate visual tracking
An et al. MTAtrack: Multilevel transformer attention for visual tracking
Nie et al. A training-free, lightweight global image descriptor for long-term visual place recognition toward autonomous vehicles

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant