CN115239765B - Infrared image target tracking system and method based on multi-scale deformable attention - Google Patents

Infrared image target tracking system and method based on multi-scale deformable attention Download PDF

Info

Publication number
CN115239765B
CN115239765B CN202210921013.7A CN202210921013A CN115239765B CN 115239765 B CN115239765 B CN 115239765B CN 202210921013 A CN202210921013 A CN 202210921013A CN 115239765 B CN115239765 B CN 115239765B
Authority
CN
China
Prior art keywords
scale
attention
feature
module
search
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210921013.7A
Other languages
Chinese (zh)
Other versions
CN115239765A (en
Inventor
李小红
周喜
齐美彬
庄硕
郝世杰
刘学亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN202210921013.7A priority Critical patent/CN115239765B/en
Publication of CN115239765A publication Critical patent/CN115239765A/en
Application granted granted Critical
Publication of CN115239765B publication Critical patent/CN115239765B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10048Infrared image

Abstract

The invention discloses an infrared image target tracking system and method based on multi-scale deformable attention. The tracking system comprises a search graph branch, a template graph branch, a feature fusion module and a prediction module; the search graph branches are used for extracting search graph multi-scale features F obtained by feature stitching of the search graph under the first scale and the second scale s The method comprises the steps of carrying out a first treatment on the surface of the The template branches are used for extracting template graph multi-scale features F obtained by feature stitching of the template graph under the third scale and the fourth scale t The method comprises the steps of carrying out a first treatment on the surface of the The feature fusion module is used for multi-scale features F according to the search graph s And template diagram multiscale feature F t Computing fusion features G st The method comprises the steps of carrying out a first treatment on the surface of the The prediction module is used for fusing the characteristic G st Predicting a target frame in the search graph. The system combines the characteristics of low and high layers, and is beneficial to tracking targets in infrared images.

Description

Infrared image target tracking system and method based on multi-scale deformable attention
Technical Field
The invention belongs to the field of computer vision, and particularly relates to an infrared image target tracking system and method based on multi-scale deformable attention.
Background
The visual target tracking task is to give a target to be tracked in an initial frame of a video, and predict the position and size of the occurrence of the target in a subsequent video frame. The thermal infrared target tracking is to carry out visual target tracking task under the imaging condition of the thermal infrared camera, and the infrared target tracking can track the target under the condition of low visibility and even complete darkness, is not influenced by light change, has all-weather and the capability of working in complex environments, thus having good application value and being widely applied to the scenes of night monitoring of the monitoring robot, night patrol of the security robot, night monitoring of urban traffic and the like. The difficulty of infrared target tracking is due to the problems of infrared image texture missing, low signal-to-noise ratio, fuzzy visual effect, easy deformation of tracked objects, scale change and the like.
In order to solve the above problems, there is a method to capture global features through a transducer attention mechanism, and combine the context relationship between features to establish association and long-distance dependency between remote features to overcome the above difficulties. While such an approach achieves good results, the attention of the self-attention module is almost average across the feature map due to the limitations of the transducer attention module in processing the image feature map, such as at initialization. However, in the final stage of training, attention attempts become sparse, focusing only on the target parts, such as the limbs of the person, and thus a long training process is required to learn these significant changes in attention attempts, resulting in slow convergence speed, and feature spatial resolution is greatly limited due to the influence of complex calculation amount.
Disclosure of Invention
The invention aims to: aiming at the problems in the prior art, the invention provides an infrared image target tracking system based on multi-scale deformable attention, which integrates the characteristics of a low layer and a high layer and is beneficial to tracking targets in infrared images.
The technical scheme is as follows: in one aspect, the present invention provides an infrared image target tracking system based on multi-scale deformable attention, comprising: search graph branch 1, template graph branch 2, feature fusion module 3 and prediction module 4; the search map branch 1 comprises a first feature extraction module 101 and a first conversion splicing module 102; the template diagram branch 2 comprises a second feature extraction module 201 and a second conversion splicing module 202;
the first feature extraction module 101 is configured to extract an initial feature map of the search map at a first scale and a second scaleAnd->First conversion mosaic Module 102 pair->And->Channel unification and dimension adjustment are carried out to obtain the characteristic f of the search graph under the first scale and the second scale s1 And f s2 And spliced into a search graph multi-scale feature F s =[f s1 ,f s2 ]The method comprises the steps of carrying out a first treatment on the surface of the The second feature extraction module 201 is configured to extract ∈of the template map at the third scale and the fourth scale>And->Second conversion spliceModule 202 pair->And->Channel unification and dimension adjustment are carried out to obtain the characteristic f of the template diagram under the third scale and the fourth scale t1 And f t2 And spliced into template-diagram multi-scale features F t =[f t1 ,f t2 ]The method comprises the steps of carrying out a first treatment on the surface of the The feature fusion module 3 is used for multi-scale features F according to the search graph s And template diagram multiscale feature F t Computing fusion features G st The method comprises the steps of carrying out a first treatment on the surface of the The prediction module 4 is configured to, according to the fusion feature G st Predicting a target frame in the search graph;
the search graph is the input of search branch 1 and the template graph is the input of template branch 2.
The first feature extraction module 101 and the second feature extraction module 201 have the same structure, and the structures are a first convolution module, a first pooling module, a second convolution module, a three convolution module, a fourth convolution module and a fifth convolution module which are sequentially cascaded;
the fourth convolution module in the first feature extraction module 101 outputs an initial feature map of the search map at a first scaleThe fifth convolution module outputs an initial characteristic diagram of the search diagram under the second scale +.>A fourth convolution module in the second feature extraction module 201 outputs an initial feature map +_of the template map at the third scale>The fifth convolution module outputs an initial feature map of the template map under the fourth scale +.>
The feature fusion module 3 comprises N cascaded feature fusion sub-modules, wherein the input of the first-stage feature fusion sub-module is a search graph multi-scale feature F s And template diagram multiscale feature F t Outputting the attention characteristic of the first level search graph to the template graphAnd the attention feature of the first level template map to the search map +.>The input of the N-level feature fusion submodule is N-1-level output +.>And->Output of Nth level feature fusion submodule>Fusion feature G obtained for feature fusion module st
The nth level feature fusion submodule includes a first deformable self-attention module 301, a second deformable self-attention module 302, and a cross-attention module 303, n=1, 2, …, N; the first deformable self-attention module 301 and the second deformable self-attention module 302 are respectively used for calculating two paths of input characteristics I s And I t Contextual characteristics and T of (2) s And T t The method comprises the steps of carrying out a first treatment on the surface of the The cross-attention module 303 is configured to calculate a context feature T of the two-way input vector s And T t Attention features to each otherAnd->
The first deformable self-injectionThe intent module 301 extracts input features I s Context relation feature T of (2) s The method comprises the following steps:
a1, input feature I s Multi-scale position coding SLP with search map s Summing to generate a first query vector Q s ,Q s =[Q s1 ,Q s2 ],Q s1 For query vectors at a first scale, Q s2 For query vectors at a second scale
a2, the first query vector Q s Input features I s Initial reference point R of search graph s Inputting into a first multi-head attention network to obtain a first multi-head deformable attention I of the search graph s 'A'; the first multi-head attention network is provided with M attention units connected in parallel;
the initial reference point R of the search graph s The calculation steps of (a) are as follows: calculating the feature f of the search graph at the first scale s1 Each vector in the initial feature mapCoordinates on the first initial reference point r s1 The method comprises the steps of carrying out a first treatment on the surface of the Calculating the feature f at the second scale s2 Is in the initial feature map +.>The coordinates of the first and second initial reference points r s2
For the first initial reference point r s1 Normalized and mapped to the initial feature mapObtaining a first coordinate mapping point r s12 The method comprises the steps of carrying out a first treatment on the surface of the For the second initial reference point r s2 Is normalized and mapped to the initial feature map +.>Obtaining a second coordinate mapping point r s21
Constructing initial reference points of search graphs
The first multi-head feasible variable attention I 'of the search graph' s =[I′ s1 ,I′ s2 ],I′ s1 For deformable attention at first scale, I' s2 Deformable attention at a second scale;
I′ s1 i 'element of (2)' s1i The calculation steps of (a) are as follows:
Q s1 the ith vector Q of the vectors s1i Obtaining a first initial reference point r through the full connection layer Linear2 s1 And a first coordinate mapping point r s12 Is the ith element r of (2) s1i 、r s12i Sampling offset for each sampling point in each attention unitAnd->Where M represents the number of attention units in the first multi-head attention network, m=1, 2, …, M; k represents a sampling point sequence number, k=1, 2, …, K; k is the total number of sampling points in each attention unit;
will r s1i And (3) withAdding to obtain the m-th attention unit and the k-th sampling point coordinate under the first scale>
Will r s12i And (3) withAdding to obtain the m-th attention unit and the k-th sampling point coordinate under the second scale
Will beFeature map at the first scale +.>Interpolation is carried out after full connection layer Linear1 to obtain Q s1i At a first scale, the value of the kth sample point of the mth attention unit is denoted +.>Inter is an interpolation function;
will beFeature map at the second scale +.>Interpolation is carried out after full connection layer Linear1 to obtain Q s1i At the second scale, the value of the kth sample point of the mth attention unit is denoted +.>Inter is an interpolation function;
Q s1 the ith vector Q of the vectors s1i Obtained by full connection layer Linear3Corresponding attention weightAnd->Attention weight corresponding to +.>
Thus obtaining
Q s2 The ith vector Q of the vectors s2i Obtaining a second initial reference point r through the full connection layer Linear2 s2 The ith element r of (b) s2i And a second coordinate mapping point r s21 The ith element r of (b) s21i Sampling offset for each sampling point in each attention unitAnd->
Will r s2i And (3) withAdding to obtain the m-th attention unit and the k-th sampling point coordinate under the second scale
Will r s21i And (3) withAdding to obtain the m-th attention unit and the k-th sampling point coordinate under the first scale
Will beFeature map at the first scale +.>Obtaining Q through full connection layer Linear1 interpolation s2i At a first scale, the value of the kth sample point of the mth attention unit is denoted +.>
Will beFeature map at the second scale +.>Obtaining Q through full connection layer Linear1 interpolation s2i At the second scale, the value of the kth sample point of the mth attention unit is denoted +.>
Q s2 The ith vector Q of the vectors s2i Obtained by full connection layer Linear3Corresponding attention weightAnd->Attention weight corresponding to +.>
Thus obtaining
a3、I s And I' s After summation normalization, the input characteristic I is obtained through FFN function s Context relation feature T of (2) s
The search map multi-scale position coding SLP s The construction steps of (1) are as follows:
a11, randomly generating a search graph, namely two-layer hierarchical coding, namely a first-layer hierarchical coding SL s1 Features f of the search graph at a first scale s1 The same; second layer hierarchical coding SL s2 Features f of the search graph at the second scale s2 The same;
a12, features f at the first scale and the second scale according to the search graph s1 And f s2 Calculating position code P in first layer of search graph by adopting trigonometric function s1 And a second intra-layer position code P s2
a13, will SL s1 And P s1 Adding, SL s2 And P s2 Adding, and splicing to obtain search pattern multi-scale position coding SLP s
SLP s =[SL s1 +P s1 ,SL s2 +P s2 ]。
The cross-attention module 303 calculates a context feature T of the two-way input vector s And T t Attention features to each otherAnd->The method comprises the following steps:
b1, T is taken s Multi-scale position coding SLP with search map s Adding, respectively passing through two full connection layers W sq And W is sk Mapping to obtain vector Q s ' and K s The method comprises the steps of carrying out a first treatment on the surface of the Will T s Through the full-connection layer W sv Mapping to obtain vector V s
b2, T t Multi-scale position coding SLP with target graph t Adding, respectively passing through twoFull connection layer W tq And W is tk Mapping to obtain vector Q' t And K t The method comprises the steps of carrying out a first treatment on the surface of the Will T t Through the full-connection layer W tv Mapping to obtain vector V t
b3, calculating T s For T t Attention features of (a) The i-th element of (a)>The method comprises the following steps:
calculate T t For T s Attention features of (a) The j-th element of (a)>The method comprises the following steps:
where dot represents a vector dot product operation, d kt For K t Dimension d of ks For K s Is a dimension of (c).
The prediction module 4 comprises a classification prediction network 401, a frame prediction network 402 and a target frame calculation module 403; the classification prediction network 401 is configured to predict the characteristic G according to the fusion st Obtaining a classification result C= [ C ] of a target in a search graph 1 ,C 2 ,…,C len ]The method comprises the steps of carrying out a first treatment on the surface of the The saidThe frame prediction network 402 is configured to generate a fusion feature G st Obtaining a predicted frame B= [ B ] of a target in a search graph 1 ,B 2 ,…,B len ]The method comprises the steps of carrying out a first treatment on the surface of the Where len is the length of the multi-scale feature of the search pattern, l=1, 2, …, len, C l =[C l0 C l1 ]To be according to the fusion characteristics G st Normalized class obtained by the first element; b (B) l =[B lx ,B ly ,B lw ,B lh ]To be according to the fusion characteristics G st Target rectangular frame predicted by the first element in B lx ,B ly Is the center point coordinate of the rectangular frame, B lw ,B lh The width and the height of the rectangular frame are the same;
the target frame calculation module (403) is configured to calculate a target frame according to a classification result c= [ C ] of the target in the search graph 1 ,C 2 ,…,C len ]And the border b= [ B ] of the object in the search graph 1 ,B 2 ,…,B len ]And calculating a target frame in the search graph.
The step of calculating the target frame in the search graph by the target frame calculation module 403 is:
find c= [ C 1 ,C 2 ,…,C len ]Middle C l0 Element number l corresponding to the maximum value of (2) *
First step * The predicted target rectangular frame corresponding to each element is the target frame in the search map
The training steps of the infrared image target tracking system comprise:
c1, randomly selecting two pictures from a video for training, selecting one picture from the two pictures as a template picture, selecting the other picture as a search picture, inputting the picture into an infrared image target tracking system to be trained, and outputting a classification result C= [ C ] by a prediction module 4 1 ,C 2 ,…,C len ]And prediction frame b= [ B ] 1 ,B 2 ,…,B len ];
c2, optimizing parameters in the infrared image target tracking system through a minimized loss function to obtain a trained infrared image target tracking system;
the loss function is: l=l class +L loss +L giou
Wherein L is class To classify the loss:U l according to the value of the first element B in the predicted border l And target real frame B in search graph T Is determined by the position of:W[1]is a negative sample weight, W0]Is a positive sample weight;
L loss for regression loss:wherein count is U l Number with value 0, namely:Pr h for classification accuracy, < >>
L giou Loss for GIOU:L giou (h) To predict element B in the frame h GIOU loss of B h Corresponding U h Has a value of 0, L giou (h)=1-GIOU h ,GIOU h Is B h True border B with target T GIOU value of (C).
On the other hand, the invention also discloses a method for tracking by using the infrared image target tracking system based on the multi-scale deformable attention, which comprises the following steps:
taking a first frame in a video to be tracked as a template diagram, and marking a rectangular frame of a target to be tracked in a target diagram; taking the subsequent frames of the video as a search graph; and respectively inputting the template graph and the search graph into a template graph branch and a search graph branch in the infrared image target tracking system, and acquiring a rectangular frame of a target in the search graph according to the prediction module.
The beneficial effects are that: the infrared image target tracking system and the tracking method based on the multi-scale deformable attention have the following advantages:
1. according to the invention, the features under two scales are spliced into the multi-scale features for subsequent processing, so that the semantics of the low-level features are increased, more space information of the high-level features is provided, and the tracking of a small target is facilitated;
2. the feature fusion module adopts multistage cascade connection to carry out multistage enhancement on the features; the method comprises the steps that a first deformable self-attention module and a second deformable self-attention module are used for respectively obtaining the context relation of image feature sequences of a search graph and a template graph, and automatically searching for features with expressive force on feature images; the deformable self-attention model is used, so that the model convergence speed is high, and the convergence speed is improved by about 4 times compared with that of a common method;
the cross attention module learns the relation between the search image feature sequence and the template image feature sequence, so that the target position of the search image can be accurately positioned.
3. When the system is trained, classification accuracy is used for dynamically restraining the frame regression loss and the GIOU loss, so that the alignment of the classification task and the frame regression task is consistent, and a more stable tracking effect is achieved.
Drawings
FIG. 1 is a schematic diagram of the composition of an infrared image target tracking system based on multi-scale deformable attention in accordance with the present disclosure;
FIG. 2 is a schematic diagram of the composition of a feature fusion module;
FIG. 3 is a schematic diagram of the composition of a feature fusion sub-module;
FIG. 4 is a first multi-headed deformable attention I s Deformable injection at a first scale inItalian force I s ' 1 A calculation flow diagram of the ith element in the (b);
fig. 5 is a schematic diagram of the composition of the prediction module.
Detailed Description
The invention is further elucidated below in connection with the drawings and the detailed description.
The invention discloses an infrared image target tracking system based on multi-scale deformable attention, as shown in fig. 1, comprising: search graph branch 1, template graph branch 2, feature fusion module 3 and prediction module 4; the search map branch 1 comprises a first feature extraction module 101 and a first conversion splicing module 102; the template diagram branch 2 comprises a second feature extraction module 201 and a second conversion splicing module 202;
the first feature extraction module 101 is configured to extract an initial feature map of the search map at a first scale and a second scaleAnd->First conversion mosaic Module 102 pair->And->Channel unification and dimension adjustment are carried out to obtain the characteristic f of the search graph under the first scale and the second scale s1 And f s2 And spliced into a search graph multi-scale feature F s =[f s1 ,f s2 ]The method comprises the steps of carrying out a first treatment on the surface of the The second feature extraction module 201 is configured to extract ∈of the template map at the third scale and the fourth scale>And->Second conversion splice module 202For->And->Channel unification and dimension adjustment are carried out to obtain the characteristic f of the template diagram under the third scale and the fourth scale t1 And f t2 And spliced into template-diagram multi-scale features F t =[f t1 ,f t2 ]The method comprises the steps of carrying out a first treatment on the surface of the The feature fusion module 3 is used for multi-scale features F according to the search graph s And template diagram multiscale feature F t Computing fusion features G st The method comprises the steps of carrying out a first treatment on the surface of the The prediction module 4 is configured to, according to the fusion feature G st Predicting a target frame in the search graph;
the search graph is the input of search branch 1 and the template graph is the input of template branch 2.
In this embodiment, the first feature extraction module 101 and the second feature extraction module 201 have the same structure, and both adopt the residual network structure of the resnet50 as the feature extraction network, the network parameters are different from those of the common resnet, and the maxpool_2 and the FC layer are deleted, and the specific structure is shown in table 1.
Table 1: structure of first feature extraction module 101 and second feature extraction module 201
The structure is that a first convolution module Conv_1, a first pooling module MaxPool_1, a second convolution module Conv_2x, a third convolution module Conv_3x, a fourth convolution module Conv_4x and a fifth convolution module Conv_5x are sequentially cascaded;
the fourth convolution module conv_4x in the first feature extraction module 101 outputs an initial feature map of the search map at the first scaleThe fifth convolution module conv_5x outputs an initial feature map +_of the search map at the second scale>The fourth convolution module conv_4x in the second feature extraction module 201 outputs an initial feature map +_of the template map at the third scale>The fifth convolution module Conv_5x outputs an initial feature map +.f of the template map at the fourth scale>
According to the parameters in table 1, the first conversion splicing module 102 first adopts a convolution kernel of 1×1, a channel of 256, and a convolution layer pair +_1 with a step size>And->Channel unification is carried out, after which->Then carrying out dimension adjustment, namely converting the two feature sequences into two-dimensional feature sequences by adopting size to obtain the feature f of the search graph under the first scale and the second scale s1 ∈R 1024×256 And f s2 ∈R 256×256 The method comprises the steps of carrying out a first treatment on the surface of the And spliced into search graph multi-scale feature F s =[f s1 ,f s2 ],F s ∈R 1280×256 . Likewise, the second conversion mosaic module 202 pair +.>And->Performing similar operation to obtain the characteristic f of the template diagram at the third scale and the fourth scale t1 ∈R 256×256 And f t2 ∈R 64×256 The method comprises the steps of carrying out a first treatment on the surface of the And spliced into template-diagram multi-scale feature F t =[f t1 ,f t2 ],F t ∈R 320×256
The feature fusion module 3 comprises N cascaded feature fusion sub-modules, as shown in fig. 2, wherein the input of the first-stage feature fusion sub-module is a multi-scale feature F of the search graph s And template diagram multiscale feature F t Outputting the attention characteristic of the first level search graph to the template graphAnd the attention feature of the first level template map to the search map +.>The input of the N-level feature fusion submodule is N-1-level output +.>And->Output of Nth level feature fusion submodule>Fusion feature G obtained for feature fusion module st The method comprises the steps of carrying out a first treatment on the surface of the N=4 in this embodiment.
As shown in fig. 3, the nth level feature fusion submodule includes a first deformable self-attention module 301, a second deformable self-attention module 302, and a cross-attention module 303, n=1, 2, …, N; the first deformable self-attention module 301 and the second deformable self-attention module 302 are respectively used for calculating two paths of input characteristics I s And I t Contextual characteristics and T of (2) s And T t The method comprises the steps of carrying out a first treatment on the surface of the The cross attention module 303 is configured to calculate an upper of two input vectorsThe following relationship feature T s And T t Attention features to each otherAnd->If n=1, I s Is F s ,I t Is F t The method comprises the steps of carrying out a first treatment on the surface of the Otherwise I s And I t Respectively->And->
Specifically, the first deformable self-attention module 301 extracts the input feature I s Context relation feature T of (2) s The method comprises the following steps:
a1, input feature I s Multi-scale position coding SLP with search map s Summing to generate a first query vector Q s ,Q s =[Q s1 ,Q s2 ],Q s1 For query vectors at a first scale, Q s2 Is a query vector at a second scale;
a2, the first query vector Q s Input features I s Initial reference point R of search graph s Inputting into a first multi-head attention network to obtain a first multi-head deformable attention I of the search graph s 'A'; the first multi-head attention network is provided with M attention units connected in parallel;
the initial reference point R of the search graph s The calculation steps of (a) are as follows: calculating the feature f of the search graph at the first scale s1 Each vector in the initial feature mapCoordinates on the first initial reference point r s1 The method comprises the steps of carrying out a first treatment on the surface of the Calculating the feature f at the second scale s2 Is in the initial feature map +.>The coordinates of the first and second initial reference points r s2
For the first initial reference point r s1 Normalized and mapped to the initial feature mapObtaining a first coordinate mapping point r s12 The method comprises the steps of carrying out a first treatment on the surface of the For the second initial reference point r s2 Is normalized and mapped to the initial feature map +.>Obtaining a second coordinate mapping point r s21
Constructing initial reference points of search graphs
The first multi-head feasible variable attention I 'of the search graph' s =[I′ s1 ,I′ s2 ],I′ s1 For deformable attention at first scale, I' s2 For deformable attention at the second scale, in this embodiment, I' s1 =[I′ s1,1 ,I′ s1,2 ...I′ s1,1024 ]I′ s2 =[I′ s2,1025 ,I′ s2,1026 ...I′ s1,1280 ];
As shown in FIG. 4, I' s1 I 'element of (2)' s1i The calculation steps of (a) are as follows:
Q s1 the ith vector Q of the vectors s1i Obtaining a first initial reference point r through the full connection layer Linear2 s1 And a first coordinate mapping point r s12 Is the ith element r of (2) s1i 、r s12i Sampling offset for each sampling point in each attention unitAnd->Where M represents the number of attention units in the first multi-head attention network, m=1, 2, …, M; k represents a sampling point sequence number, k=1, 2, …, K; k is the total number of sampling points in each attention unit; k=16 in this embodiment.
Will r s1i And (3) withAdding to obtain the m-th attention unit and the k-th sampling point coordinate under the first scale>
Will r s12i And (3) withAdding to obtain the m-th attention unit and the k-th sampling point coordinate under the second scale
Will beFeature map at the first scale +.>Interpolation is carried out after full connection layer Linear1 to obtain Q s1i At a first scale, the value of the kth sample point of the mth attention unit is denoted +.>Inter is an interpolation function;
will beFeature map at the second scale +.>Interpolation is carried out after full connection layer Linear1 to obtain Q s1i At the second scale, the value of the kth sample point of the mth attention unit is denoted +.>Inter is an interpolation function;
Q s1 the ith vector Q of the vectors s1i Obtained by full connection layer Linear3Corresponding attention weightAnd->Attention weight corresponding to +.>
Thus obtaining
Similar to the above procedure, I' s2i The calculation process of (1) is as follows:
Q s2 the ith vector Q of the vectors s2i Obtaining a second initial reference point r through the full connection layer Linear2 s2 The ith element r of (b) s2i And a second coordinate mapping point r s21 The ith element r of (b) s21i Sampling offset for each sampling point in each attention unitAnd->
Will r s2i And (3) withAdding to obtain the m-th attention unit and the k-th sampling point coordinate under the second scale
Will r s21i And (3) withAdding to obtain the m-th attention unit and the k-th sampling point coordinate under the first scale
Will beFeature map at the first scale +.>Obtaining Q through full connection layer Linear1 interpolation s2i At a first scale, the value of the kth sample point of the mth attention unit is denoted +.>
Will beFeature map at the second scale +.>Obtaining Q through full connection layer Linear1 interpolation s2i At the second scale, the value of the kth sample point of the mth attention unit is denoted +.>
Q s2 The ith vector Q of the vectors s2i Obtained by full connection layer Linear3Corresponding attention weightAnd->Attention weight corresponding to +.>
Thus obtaining
a3、I s And I' s After summation normalization, the input characteristic I is obtained through FFN function s Context relation feature T of (2) s
The second deformable self-attention module 302 takes steps similar to a1-a3 to extract the input features I t Context relation feature T of (2) t
The search map multi-scale position coding SLP s The construction steps of (1) are as follows:
a11, randomly generating a search graph, namely two-layer hierarchical coding, namely a first-layer hierarchical coding SL s1 Features f of the search graph at a first scale s1 The same; second layer hierarchical coding SL s2 Features f of the search graph at the second scale s2 The same;
a12, features f at the first scale and the second scale according to the search graph s1 And f s2 Calculating position code P in first layer of search graph by adopting trigonometric function s1 And a second intra-layer position code P s2
a13, will SL s1 And P s1 Adding, SL s2 And P s2 Adding, and splicing to obtain search pattern multi-scale position coding SLP s
SLP s =[SL s1 +P s1 ,SL s2 +P s2 ]。
In a similar manner to steps a1-a3, the second deformable self-attention module 302 extracts the input feature I t Context relation feature T of (2) t The method comprises the steps of carrying out a first treatment on the surface of the Wherein the template map multi-scale position codes SLP t Features f at third and fourth scales according to a template diagram in a similar manner to a11-a13 t1 And f t2 Calculating to obtain; template diagram initial reference R t According to R s According to f t1 And f t2 And (5) calculating to obtain the product.
The cross-attention module 303 calculates the contextual characteristics T of the two-way input vector s And T t Attention features to each otherAnd->The method comprises the following steps:
b1, T is taken s Multi-scale position coding SLP with search map s Adding, respectively passing through two full connection layers W sq And W is sk Mapping to obtain vector Q' s And K s The method comprises the steps of carrying out a first treatment on the surface of the Will T s Through the full-connection layer W sv Mapping to obtain vector V s
b2, T t Multi-scale position coding SLP with target graph t Adding, respectively passing through two full connection layers W tq And W is tk Mapping to obtain vector Q' t And K t The method comprises the steps of carrying out a first treatment on the surface of the Will T t Through the full-connection layer W tv Mapping to obtain vector V t
b3, calculating T s For T t Attention features of (a) The i-th element of (a)>The method comprises the following steps:
calculate T t For T s Attention features of (a) The j-th element of (a)>The method comprises the following steps:
where dot represents a vector dot product operation, d kt For K t Dimension d of ks For K s Is a dimension of (c).
Output of last level feature fusion sub-moduleI.e. the fusion feature G finally obtained st The prediction module 4 predicts the target according to G st To predict the target bounding box in the search graph. As shown in fig. 5, the prediction module 4 includes a classification prediction network 401, a frame prediction network 402, and a target frame calculation module 403; the classification prediction network 401 is used to predict the network according to the fusion characteristics G st Obtaining a classification result C=of the targets in the search graph[C 1 ,C 2 ,…,C len ]The method comprises the steps of carrying out a first treatment on the surface of the The frame prediction network 402 is configured to use the fusion feature G st Obtaining a predicted frame B= [ B ] of a target in a search graph 1 ,B 2 ,…,B len ]The method comprises the steps of carrying out a first treatment on the surface of the Where len is the length of the multi-scale feature of the search pattern, l=1, 2, …, len, C l =[C l0 C l1 ]To be according to the fusion characteristics G st Normalized class obtained by the first element of (C) l0 Representing according to G st The first element of (C) obtains the predicted target probability l1 Representing a predictive background probability; b (B) l =[B lx ,B ly ,B lw ,B lh ]To be according to the fusion characteristics G st Target rectangular frame obtained by the first element of B lx ,B ly Is the center point coordinate of the rectangular frame, B lw ,B lh Is the width and height of the rectangular frame. The target frame calculation module 403 is configured to calculate a classification result c= [ C ] of the target according to the search graph 1 ,C 2 ,…,C len ]And the border b= [ B ] of the object in the search graph 1 ,B 2 ,…,B len ]Calculating a target frame in a search graph, wherein the specific steps are as follows:
find c= [ C 1 ,C 2 ,…,C len ]Middle C l0 Element number l corresponding to the maximum value of (2) *
First step * The predicted target rectangular frame corresponding to each element is the target frame in the search map
The classification prediction network 401 and the frame prediction network 402 each adopt three fully connected layers, and the structures and parameters are shown in table 2 and table 3:
table 2: classification prediction network structure and parameters
Network layer name Output size Network parameters (input channel, output channel)
FC_1 1280×256 256,256
FC_2 1280×256 256,256
FC_3 1280×2 256,2
Table 3: frame prediction network structure and parameters
Network layer name Output size Network parameters (input channel, output channel)
FC_1 1280×256 256,256
FC_2 1280×256 256,256
FC_3 1280×4 256,4
The final full-connection layer FC_3 in the classification prediction network outputs an initial classification resultWherein-> Represents G st The probability of the first element prediction category of 0, namely the probability of the prediction target; />Represents G st The first element predicts the probability of category 1, i.e., the probability of the prediction context. Because the probability is 0,1]Between (for->Normalizing to obtain normalized class C l =[C l0 C l1 ]Wherein->E∈{0,1}。
The training steps of the infrared image target tracking system comprise:
c1, randomly selecting two pictures from a video for training, selecting one picture from the two pictures as a template picture, selecting the other picture as a search picture, inputting the picture into an infrared image target tracking system to be trained, and outputting a classification result C= [ C ] by a prediction module 4 1 ,C 2 ,…,C len ]And prediction frame b= [ B ] 1 ,B 2 ,…,B len ];
c2, optimizing parameters in the infrared image target tracking system through a minimized loss function to obtain a trained infrared image target tracking system;
the loss function is: l=l class +L loss +L giou
Wherein L is class To classify the loss:U l according to the value of the first element B in the predicted border l And target real frame B in search graph T Is determined by the position of:B l at B T The result of the prediction of the first element is the target, and the result is the background. W1]Is a negative sample weight, W0]Is a positive sample weight;
L loss for regression loss:wherein count is U l Number with value 0, namely:Pr h for classification accuracy, < >>
L giou Loss for GIOU:L giou (h) To predict element B in the frame h GIOU loss of B h Corresponding U h Has a value of 0, L giou (h)=1-GIOU h ,GIOU h Is B h True border B with target T GIOU value of (C). />
In the present embodiment, classification accuracy Pr is adopted in the regression loss and GIOU loss h The classification tasks and the regression tasks are unified by dynamic weighting, so that the classification tasks and the regression tasks are mutually connected, the low-quality frame bounding boxes are further reduced through positioning scores, and the overall tracking precision is improved.
The method for tracking by using the infrared image target tracking system based on the multi-scale deformable attention comprises the following steps:
taking a first frame in a video to be tracked as a template diagram, and marking a rectangular frame of a target to be tracked in a target diagram; taking the subsequent frames of the video as a search graph; and respectively inputting the template graph and the search graph into a template graph branch and a search graph branch in the infrared image target tracking system, and acquiring a rectangular frame of a target in the search graph according to the prediction module.
In this embodiment, the effectiveness of the infrared image target tracking system described above was tested on the infrared data sets VOT2017-TIR and LSOTB-TIR and compared to existing methods. During testing, selecting a first frame of a video sequence as a template diagram, wherein a target to be tracked is surrounded by a rectangular frame, and cutting and scaling by taking the frame of the target as the center to obtain a size of 128 multiplied by 128; the other frames are search pictures, the search area is obtained by taking the target position as the center in the previous frame image, cutting and scaling the area with the size of 4 times of the target area to obtain the size of 256 multiplied by 256, inputting the template picture and the search picture into a trained infrared image target tracking system, obtaining a prediction classification result and a prediction frame by a prediction module, and taking the prediction target rectangular frame corresponding to the characteristic element with the maximum probability of the prediction target in the classification result as a final tracking result. Test pairs on dataset LSOTB-TIR are shown in Table 4:
table 4: test results for data set LSOTB-TIR
Methods Success Precision Norm Precision
ECO-TIR[1] 0.631 0.768 0.695
ECO-stir[2] 0.616 0.750 0.672
ECO[3] 0.609 0.739 0.670
SiamRPN++[4] 0.604 0.711 0.651
MDNet[5] 0.601 0.750 0.686
VITAL[6] 0.597 0.749 0.682
ATOM[7] 0.595 0.729 0.647
Ours(detranst) 0.669 0.782 0.787
Test pairs for data set VOT2017-TIR are shown in Table 5:
table 5: test results for data set VOT2017-TIR
Methods EAO Acc Rob
CFNet[8] 0.254 0.52 3.45
HSSNet[9] 0.262 0.58 3.33
TADT[10] 0.262 0.60 3.18
VITAL[6] 0.272 0.64 2.68
MLSSNet[11] 0.278 0.56 2.95
TCNN[12] 0.287 0.62 2.79
MMNet[13] 0.320 0.58 2.91
Ours(detranst) 0.335 0.71 2.18
In tables 4 and 5, ECO-TIR [1] is as described in literature [1]: liu Q, li X, he Z, et al LSOTB-TIR: A Large-Scale High-Diversity Thermal Infrared Object Tracking Benchmark [ C ]// Proceedings of the, 28, th ACM International Conference on Multimedia (MM' 20). ACM,2020. Tracking methods;
ECO-stinr 2 is as described in document [2]: lichao Zhang, abel Gonzalez-Garcia, joost van de Weijer, martin Danelljan, and Fahad Shahbaz Khan.2019.Synthetic data generation for end-to-end thermal infrared tracking, IEEE Transactions on Image Processing, 4 (2019), 1837-1850;
ECO 3 is as described in document [3]: tracking by Martin Danelljan, goutam Bhat, fahad Shahbaz Khan, and Michael Felsberg.2017.ECO: efficient convolution operators for tracking. In IEEE Conference on Computer Vision and Pattern Recognition;
SiamRPN++ 4 is as described in literature [4]: the tracking is performed by the method in Bo Li, wei Wu, qiang Wang, fangyi Zhang, junliang Xing, and Junjie yan.2019.Siamrpn++: evolution of siamese visual tracking with very deep networks.In IEEE Conference on Computer Vision and Pattern Recognition;
MDNet [5] is as follows: tracking was performed by the method in Hyeonseob Nam and Bohyung han.2016.learning multi-domain convolutional neural networks for visual tracking in IEEE Conference on Computer Vision and Pattern Recognition;
VITAL [6] is as described in document [6]: song, y; ma, c; wu, x; gong, l.; tracking is performed by the method in et al 2018, visual: visual tracking via adversarial learning, in CVPR, 8990-8999;
ATOM [7] is as described in document [7]: tracking by a method in Martin Danelljan, goutam Bhat, fahad Shahbaz Khan, and Michael Felsberg.2019.Atom: accurate tracking by overlap maximation.In IEEE Conference on Computer Vision and Pattern Recognation;
CFNet [8] is as follows: valmadre, j.; bertinetto, l.; henriques, j.; vedaldi, A; and Torr, P.H.2017.end-to-end representation learning for correlation fifilter based tracking.InCVPR, 5000-5008;
HSSNet [9] is as described in literature [9]: li, X; liu, q; fan, n.; tracking by methods in et al 2019a, hierarchical spatial-aware siamese network for thermal infrared object tracking, knowledges-Based Systems 166:71-81;
TADT [10] is as described in literature [10]: li, X; ma, c; wu, b.; he, z; and Yang, m. -h.2019b.target-aware deep tracking.in cvpr.
MLSSNet [11] is described in literature [11]: liu, q; li, X; he, z; fan, n.; yuan, d.; and Wang, H.2019b.learning deep Multi-level similarity for thermal infrared object tracking.arXiv preprintarXiv 1906.03568;
TCNN [12] is as described in document [12]: nam, H.; baek, m.; han, b.; tracking is performed by the method in et al 2016.Modeling and propagating cnns in a tree structure for visual tracking. ArXiv preprint arXiv: 1608.07242;
MMNet [13] is described in literature [13]: the method of Liu Q, li X, he Z, et al, multi-Task Driven Feature Models for Thermal Infrared Tracking [ C ]// Proceedings of the AAAI Conference on Artificial Intelligence,2020,34:11604-11611.
As can be seen from tables 4 and 5, the tracking effect of the infrared image target tracking system and method based on multi-scale deformable attention provided by the invention on two data sets is better than that of the prior art.

Claims (8)

1. An infrared image target tracking system based on multi-scale deformable attention, comprising: a search graph branch (1), a template graph branch (2), a feature fusion module (3) and a prediction module (4); the search graph branch (1) comprises a first feature extraction module (101) and a first conversion splicing module (102); the template diagram branch (2) comprises a second feature extraction module (201) and a second conversion splicing module (202);
the first feature extraction module (101) is used for extracting an initial feature map of the search map under a first scale and a second scaleAnd->First conversion spliceModule (102) pair->And->Channel unification and dimension adjustment are carried out to obtain the characteristic f of the search graph under the first scale and the second scale s1 And f s2 And spliced into a search graph multi-scale feature F s =[f s1 ,f s2 ]The method comprises the steps of carrying out a first treatment on the surface of the The second feature extraction module (201) is used for extracting an initial feature map +_of the template map at a third scale and a fourth scale>And->A second conversion splicing module (202) pair->And->Channel unification and dimension adjustment are carried out to obtain the characteristic f of the template diagram under the third scale and the fourth scale t1 And f t2 And spliced into template-diagram multi-scale features F t =[f t1 ,f t2 ]The method comprises the steps of carrying out a first treatment on the surface of the The characteristic fusion module (3) is used for multiscale characteristic F according to the search graph s And template diagram multiscale feature F t Computing fusion features G st
The feature fusion module (3) comprises N cascaded feature fusion sub-modules, wherein the input of the first-stage feature fusion sub-module is a search graph multi-scale feature F s And template diagram multiscale feature F t Outputting the attention characteristic of the first level search graph to the template graphAnd the attention feature of the first level template map to the search map +.>The input of the N-level feature fusion submodule is N-1-level output +.>And->Output of Nth level feature fusion submodule>Fusion feature G obtained for feature fusion module st
The nth level feature fusion submodule comprises a first deformable self-attention module (301), a second deformable self-attention module (302) and a cross-attention module (303), wherein n=1, 2, …, N; the first deformable self-attention module (301) and the second deformable self-attention module (302) are respectively used for calculating two paths of input characteristics I s And I t Context relation feature T of (2) s And T t The method comprises the steps of carrying out a first treatment on the surface of the The cross attention module (303) is used for calculating a context relation characteristic T of two paths of input vectors s And T t Attention features to each otherAnd->
The first deformable self-attention module (301) extracts input features I s Context relation feature T of (2) s The method comprises the following steps:
a1, input feature I s Multi-scale position coding SLP with search map s Summing to generate a first query vector Q s ,Q s =[Q s1 ,Q s2 ],Q s1 For query vectors at a first scale, Q s2 Is a query vector at a second scale;
a2, the first query vector Q s Input features I s Initial reference point R of search graph s Inputting the first multi-head deformable attention I 'into a first multi-head attention network to obtain a search graph' s The method comprises the steps of carrying out a first treatment on the surface of the The first multi-head attention network is provided with M attention units connected in parallel;
the initial reference point R of the search graph s The calculation steps of (a) are as follows: calculating the feature f of the search graph at the first scale s1 Each vector in the initial feature mapCoordinates on the first initial reference point r s1 The method comprises the steps of carrying out a first treatment on the surface of the Calculating the feature f at the second scale s2 Is in the initial feature map +.>The coordinates of the first and second initial reference points r s2
For the first initial reference point r s1 Normalized and mapped to the initial feature mapObtaining a first coordinate mapping point r s12 The method comprises the steps of carrying out a first treatment on the surface of the For the second initial reference point r s2 Is normalized and mapped to the initial feature map +.>Obtaining a second coordinate mapping point r s21
Constructing initial reference points of search graphs
The first multi-head feasible variable attention I 'of the search graph' s =[I′ s1 ,I′ s2 ],I′ s1 For deformable attention at first scale, I' s2 Deformable attention at a second scale;
I′ s1 i 'element of (2)' s1i The calculation steps of (a) are as follows:
Q s1 the ith vector Q of the vectors s1i Obtaining a first initial reference point r through the full connection layer Linear2 s1 And a first coordinate mapping point r s12 Is the ith element r of (2) s1i 、r s12i Sampling offset for each sampling point in each attention unitAnd->Where M represents the number of attention units in the first multi-head attention network, m=1, 2, …, M; k represents a sampling point sequence number, k=1, 2, …, K; k is the total number of sampling points in each attention unit;
will r s1i And (3) withAdding to obtain the m-th attention unit and the k-th sampling point coordinate under the first scale
Will r s12i And (3) withAdding to obtain the m-th attention unit and the k-th sampling point coordinate under the second scale
Will beFeature map at first scale/>Interpolation is carried out after full connection layer Linear1 to obtain Q s1i At a first scale, the value of the kth sample point of the mth attention unit is denoted +.>Inter is an interpolation function;
will beFeature map at the second scale +.>Interpolation is carried out after full connection layer Linear1 to obtain Q s1i At the second scale, the value of the kth sample point of the mth attention unit is denoted +.>Inter is an interpolation function;
Q s1 the ith vector Q of the vectors s1i Obtained by full connection layer Linear3Attention weight corresponding to +.>Andattention weight corresponding to +.>
Thus obtaining
Q s2 The ith vector Q of the vectors s2i Obtaining a second initial reference point r through the full connection layer Linear2 s2 The ith element r of (b) s2i And a second coordinate mapping point r s21 The ith element r of (b) s21i Sampling offset for each sampling point in each attention unitAnd->
Will r s2i And (3) withAdding to obtain the m-th attention unit and the k-th sampling point coordinate under the second scale
Will r s21i And (3) withAdding to obtain the m-th attention unit and the k-th sampling point coordinate under the first scale
Will beFeature map at the first scale +.>Obtaining Q through full connection layer Linear1 interpolation s2i At a first scale, the value of the kth sample point of the mth attention unit is denoted +.>
Will beFeature map at the second scale +.>Obtaining Q through full connection layer Linear1 interpolation s2i At the second scale, the value of the kth sample point of the mth attention unit is denoted +.>
Q s2 The ith vector Q of the vectors s2i Obtained by full connection layer Linear3Attention weight corresponding to +.>Andattention weight corresponding to +.>
Thus obtaining
a3、I s And I' s After summation normalization, the input characteristic I is obtained through FFN function s Context relation feature T of (2) s
The second deformable self-attention module (302) takes steps similar to a1-a3 to extract the input features I t Context relation feature T of (2) t
The prediction module (4) is used for generating a fusion characteristic G st Predicting a target frame in the search graph;
the search graph is the input of the search branch (1), and the template graph is the input of the template branch (2).
2. The infrared image target tracking system based on multi-scale deformable attention according to claim 1, wherein the first feature extraction module (101) and the second feature extraction module (201) have the same structure, and the structures are a first convolution module, a first pooling module, a second convolution module, a three convolution module, a fourth convolution module and a fifth convolution module which are sequentially cascaded;
a fourth convolution module in the first feature extraction module (101) outputs an initial feature map of the search map at a first scaleThe fifth convolution module outputs an initial characteristic diagram of the search diagram under the second scale +.>A fourth convolution module in the second feature extraction module (201) outputs an initial feature map +_of the template map at a third scale>The fifth convolution module outputs an initial feature map of the template map under the fourth scale +.>
3. The multi-scale deformable attention-based infrared image target tracking system of claim 1 wherein the search map multi-scale position-encoding SLP s The construction steps of (1) are as follows:
a11, randomly generating a search graph, namely two-layer hierarchical coding, namely a first-layer hierarchical coding SL s1 Dimension and search of (a)Feature f of the graph at a first scale s1 The same; second layer hierarchical coding SL s2 Features f of the search graph at the second scale s2 The same;
a12, features f at the first scale and the second scale according to the search graph s1 And f s2 Calculating position code P in first layer of search graph by adopting trigonometric function s1 And a second intra-layer position code P s2
a13, will SL s1 And P s1 Adding, SL s2 And P s2 Adding, and splicing to obtain search pattern multi-scale position coding SLP s
SLP s =[SL s1 +P s1 ,SL s2 +P s2 ]。
4. The multi-scale deformable attention-based infrared image target tracking system of claim 1, wherein the cross attention module (303) calculates a contextual feature T of two input vectors s And T t Attention features to each otherAnd->The method comprises the following steps:
b1, T is taken s Multi-scale position coding SLP with search map s Adding, respectively passing through two full connection layers W sq And W is sk Mapping to obtain vector Q' s And K s The method comprises the steps of carrying out a first treatment on the surface of the Will T s Through the full-connection layer W sv Mapping to obtain vector V s
b2, T t Multi-scale position coding SLP with target graph t Adding, respectively passing through two full connection layers W tq And W is tk Mapping to obtain vector Q' t And K t The method comprises the steps of carrying out a first treatment on the surface of the Will T t Through the full-connection layer W tv Mapping to obtain vector V t
b3, calculating T s For T t Attention features of (a)The i-th element of (a)>The method comprises the following steps:
calculate T t For T s Attention features of (a) The j-th element of (a)>The method comprises the following steps:
where dot represents a vector dot product operation, d kt For K t Dimension d of ks For K s Is a dimension of (c).
5. The multi-scale deformable attention-based infrared image target tracking system of claim 1, wherein the prediction module (4) comprises a classification prediction network (401), a border prediction network (402), and a target border calculation module (403); the classification prediction network (401) is used for predicting the network according to the fusion characteristic G st Obtaining a classification result C= [ C ] of a target in a search graph 1 ,C 2 ,…,C len ]The method comprises the steps of carrying out a first treatment on the surface of the The frame prediction network (402) is used for predicting the network according to the fusion characteristic G st Obtaining a predicted frame B= [ B ] of a target in a search graph 1 ,B 2 ,…,B len ]The method comprises the steps of carrying out a first treatment on the surface of the Where len is the length of the multi-scale feature of the search pattern, l=1, 2, …, len, C l =[C l0 C l1 ]To be according to the fusion characteristics G st Normalized class obtained by the first element; b (B) l =[B lx ,B ly ,B lw ,B lh ]To be according to the fusion characteristics G st Target rectangular frame predicted by the first element in B lx ,B ly Is the center point coordinate of the rectangular frame, B lw ,B lh The width and the height of the rectangular frame are the same;
the target frame calculation module (403) is configured to calculate a target frame according to a classification result c= [ C ] of the target in the search graph 1 ,C 2 ,…,C len ]And the border b= [ B ] of the object in the search graph 1 ,B 2 ,…,B len ]And calculating a target frame in the search graph.
6. The infrared image target tracking system based on multi-scale deformable attention of claim 5, wherein the step of the target frame calculation module (403) calculating a target frame in a search graph is:
find c= [ C 1 ,C 2 ,…,C len ]Middle C l0 Element number l corresponding to the maximum value of (2) *
First step * The predicted target rectangular frame corresponding to each element is the target frame in the search map
7. The multi-scale deformable attention-based infrared image target tracking system of claim 5, wherein the training step of the system comprises:
c1, randomly selecting two pictures from the video for training, wherein one picture is selected as a template picture, and the other picture is selected as a searchIn the figure, the prediction module (4) outputs a classification result C= [ C ] in the infrared image target tracking system to be trained 1 ,C 2 ,…,C len ]And prediction frame b= [ B ] 1 ,B 2 ,…,B len ];
c2, optimizing parameters in the infrared image target tracking system through a minimized loss function to obtain a trained infrared image target tracking system;
the loss function is: l=l class +L loss +L giou
Wherein L is class To classify the loss:U l according to the value of the first element B in the predicted border l And target real frame B in search graph T Is determined by the position of: />W[1]Is a negative sample weight, W0]Is a positive sample weight;
L loss for regression loss:wherein count is U l Number with value 0, namely:Pr h for classification accuracy, < >>
L giou Loss for GIOU:L giou (h) To predict element B in the frame h GIOU loss of B h Corresponding U h Has a value of 0, L giou (h)=1-GIOU h ,GIOU h Is B h True border B with target T GIOU value of (C).
8. An infrared image target tracking method based on multi-scale deformable attention is characterized by comprising the following steps:
taking a first frame in a video to be tracked as a template diagram, and marking a rectangular frame of a target to be tracked in a target diagram; taking the subsequent frames of the video as a search graph; inputting the template diagram branch and the search diagram branch in the infrared image target tracking system according to any one of claims 1-7 respectively, and acquiring a rectangular frame of a target in the search diagram according to the prediction module.
CN202210921013.7A 2022-08-02 2022-08-02 Infrared image target tracking system and method based on multi-scale deformable attention Active CN115239765B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210921013.7A CN115239765B (en) 2022-08-02 2022-08-02 Infrared image target tracking system and method based on multi-scale deformable attention

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210921013.7A CN115239765B (en) 2022-08-02 2022-08-02 Infrared image target tracking system and method based on multi-scale deformable attention

Publications (2)

Publication Number Publication Date
CN115239765A CN115239765A (en) 2022-10-25
CN115239765B true CN115239765B (en) 2024-03-29

Family

ID=83678018

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210921013.7A Active CN115239765B (en) 2022-08-02 2022-08-02 Infrared image target tracking system and method based on multi-scale deformable attention

Country Status (1)

Country Link
CN (1) CN115239765B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116402858B (en) * 2023-04-11 2023-11-21 合肥工业大学 Transformer-based space-time information fusion infrared target tracking method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102019123756A1 (en) * 2019-09-05 2021-03-11 Connaught Electronics Ltd. Neural network for performing semantic segmentation of an input image
CN113628245A (en) * 2021-07-12 2021-11-09 中国科学院自动化研究所 Multi-target tracking method, device, electronic equipment and storage medium
CN113744311A (en) * 2021-09-02 2021-12-03 北京理工大学 Twin neural network moving target tracking method based on full-connection attention module
CN113963009A (en) * 2021-12-22 2022-01-21 中科视语(北京)科技有限公司 Local self-attention image processing method and model based on deformable blocks
CN114359310A (en) * 2022-01-13 2022-04-15 浙江大学 3D ventricle nuclear magnetic resonance video segmentation optimization system based on deep learning
CN114694024A (en) * 2022-03-21 2022-07-01 滨州学院 Unmanned aerial vehicle ground target tracking method based on multilayer feature self-attention transformation network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210319420A1 (en) * 2020-04-12 2021-10-14 Shenzhen Malong Technologies Co., Ltd. Retail system and methods with visual object tracking

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102019123756A1 (en) * 2019-09-05 2021-03-11 Connaught Electronics Ltd. Neural network for performing semantic segmentation of an input image
CN113628245A (en) * 2021-07-12 2021-11-09 中国科学院自动化研究所 Multi-target tracking method, device, electronic equipment and storage medium
CN113744311A (en) * 2021-09-02 2021-12-03 北京理工大学 Twin neural network moving target tracking method based on full-connection attention module
CN113963009A (en) * 2021-12-22 2022-01-21 中科视语(北京)科技有限公司 Local self-attention image processing method and model based on deformable blocks
CN114359310A (en) * 2022-01-13 2022-04-15 浙江大学 3D ventricle nuclear magnetic resonance video segmentation optimization system based on deep learning
CN114694024A (en) * 2022-03-21 2022-07-01 滨州学院 Unmanned aerial vehicle ground target tracking method based on multilayer feature self-attention transformation network

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
DING CHENG等.Exploring Cross-Modality Commonalities via Dual-Stream Multi-Branch Network for Infrared-Visible Person Re-Identification.《IEEE Access》.2020,全文. *
Multilevel Deformable Attention-Aggregated Networks for Change Detection in Bitemporal Remote Sensing Imagery;Xiaokang Zhang等;《IEEE Transactions on Geoscience and Remote Sensing》;全文 *
基于深度学习的行人跟踪算法研究;蒋林枫;《中国优秀硕士学位论文全文数据库 信息科技辑》;全文 *
董吉富 ; 刘畅 ; 曹方伟 ; 凌源 ; 高翔 ; .基于注意力机制的在线自适应孪生网络跟踪算法.激光与光电子学进展.2020,(第02期),全文. *

Also Published As

Publication number Publication date
CN115239765A (en) 2022-10-25

Similar Documents

Publication Publication Date Title
Zhu et al. Residual spectral–spatial attention network for hyperspectral image classification
CN111539370B (en) Image pedestrian re-identification method and system based on multi-attention joint learning
CN111291809B (en) Processing device, method and storage medium
CN113313736B (en) Online multi-target tracking method for unified target motion perception and re-identification network
Komorowski et al. Minkloc++: lidar and monocular image fusion for place recognition
Li et al. Implementation of deep-learning algorithm for obstacle detection and collision avoidance for robotic harvester
Chen et al. Corse-to-fine road extraction based on local Dirichlet mixture models and multiscale-high-order deep learning
CN111695457A (en) Human body posture estimation method based on weak supervision mechanism
Reddy et al. AdaCrowd: Unlabeled scene adaptation for crowd counting
CN111723660A (en) Detection method for long ground target detection network
CN111738074B (en) Pedestrian attribute identification method, system and device based on weak supervision learning
CN115239765B (en) Infrared image target tracking system and method based on multi-scale deformable attention
CN114494812A (en) Image small target detection method based on improved CenterNet
CN114724185A (en) Light-weight multi-person posture tracking method
Fang et al. Sewer defect instance segmentation, localization, and 3D reconstruction for sewer floating capsule robots
CN113743544A (en) Cross-modal neural network construction method, pedestrian retrieval method and system
Yang et al. Progressive domain adaptive network for crater detection
CN115188066A (en) Moving target detection system and method based on cooperative attention and multi-scale fusion
CN112036250B (en) Pedestrian re-identification method, system, medium and terminal based on neighborhood cooperative attention
Wang et al. Non-local attention association scheme for online multi-object tracking
CN116246338B (en) Behavior recognition method based on graph convolution and transducer composite neural network
Chen et al. Pyramid attention object detection network with multi-scale feature fusion
Kajabad et al. YOLOv4 for urban object detection: Case of electronic inventory in St. Petersburg
CN117036658A (en) Image processing method and related equipment
CN111639563B (en) Basketball video event and target online detection method based on multitasking

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant