CN115239765B

CN115239765B - Infrared image target tracking system and method based on multi-scale deformable attention

Info

Publication number: CN115239765B
Application number: CN202210921013.7A
Authority: CN
Inventors: 李小红; 周喜; 齐美彬; 庄硕; 郝世杰; 刘学亮
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2022-08-02
Filing date: 2022-08-02
Publication date: 2024-03-29
Anticipated expiration: 2042-08-02
Also published as: CN115239765A

Abstract

The invention discloses an infrared image target tracking system and method based on multi-scale deformable attention. The tracking system comprises a search graph branch, a template graph branch, a feature fusion module and a prediction module; the search graph branches are used for extracting search graph multi-scale features F obtained by feature stitching of the search graph under the first scale and the second scale _s The method comprises the steps of carrying out a first treatment on the surface of the The template branches are used for extracting template graph multi-scale features F obtained by feature stitching of the template graph under the third scale and the fourth scale _t The method comprises the steps of carrying out a first treatment on the surface of the The feature fusion module is used for multi-scale features F according to the search graph _s And template diagram multiscale feature F _t Computing fusion features G _st The method comprises the steps of carrying out a first treatment on the surface of the The prediction module is used for fusing the characteristic G _st Predicting a target frame in the search graph. The system combines the characteristics of low and high layers, and is beneficial to tracking targets in infrared images.

Description

Infrared image target tracking system and method based on multi-scale deformable attention

Technical Field

The invention belongs to the field of computer vision, and particularly relates to an infrared image target tracking system and method based on multi-scale deformable attention.

Background

The visual target tracking task is to give a target to be tracked in an initial frame of a video, and predict the position and size of the occurrence of the target in a subsequent video frame. The thermal infrared target tracking is to carry out visual target tracking task under the imaging condition of the thermal infrared camera, and the infrared target tracking can track the target under the condition of low visibility and even complete darkness, is not influenced by light change, has all-weather and the capability of working in complex environments, thus having good application value and being widely applied to the scenes of night monitoring of the monitoring robot, night patrol of the security robot, night monitoring of urban traffic and the like. The difficulty of infrared target tracking is due to the problems of infrared image texture missing, low signal-to-noise ratio, fuzzy visual effect, easy deformation of tracked objects, scale change and the like.

In order to solve the above problems, there is a method to capture global features through a transducer attention mechanism, and combine the context relationship between features to establish association and long-distance dependency between remote features to overcome the above difficulties. While such an approach achieves good results, the attention of the self-attention module is almost average across the feature map due to the limitations of the transducer attention module in processing the image feature map, such as at initialization. However, in the final stage of training, attention attempts become sparse, focusing only on the target parts, such as the limbs of the person, and thus a long training process is required to learn these significant changes in attention attempts, resulting in slow convergence speed, and feature spatial resolution is greatly limited due to the influence of complex calculation amount.

Disclosure of Invention

The invention aims to: aiming at the problems in the prior art, the invention provides an infrared image target tracking system based on multi-scale deformable attention, which integrates the characteristics of a low layer and a high layer and is beneficial to tracking targets in infrared images.

The technical scheme is as follows: in one aspect, the present invention provides an infrared image target tracking system based on multi-scale deformable attention, comprising: search graph branch 1, template graph branch 2, feature fusion module 3 and prediction module 4; the search map branch 1 comprises a first feature extraction module 101 and a first conversion splicing module 102; the template diagram branch 2 comprises a second feature extraction module 201 and a second conversion splicing module 202;

the first feature extraction module 101 is configured to extract an initial feature map of the search map at a first scale and a second scaleAnd->First conversion mosaic Module 102 pair->And->Channel unification and dimension adjustment are carried out to obtain the characteristic f of the search graph under the first scale and the second scale _s1 And f _s2 And spliced into a search graph multi-scale feature F _s ＝[f _s1 ,f _s2 ]The method comprises the steps of carrying out a first treatment on the surface of the The second feature extraction module 201 is configured to extract ∈of the template map at the third scale and the fourth scale>And->Second conversion spliceModule 202 pair->And->Channel unification and dimension adjustment are carried out to obtain the characteristic f of the template diagram under the third scale and the fourth scale _t1 And f _t2 And spliced into template-diagram multi-scale features F _t ＝[f _t1 ,f _t2 ]The method comprises the steps of carrying out a first treatment on the surface of the The feature fusion module 3 is used for multi-scale features F according to the search graph _s And template diagram multiscale feature F _t Computing fusion features G _st The method comprises the steps of carrying out a first treatment on the surface of the The prediction module 4 is configured to, according to the fusion feature G _st Predicting a target frame in the search graph;

the search graph is the input of search branch 1 and the template graph is the input of template branch 2.

The first feature extraction module 101 and the second feature extraction module 201 have the same structure, and the structures are a first convolution module, a first pooling module, a second convolution module, a three convolution module, a fourth convolution module and a fifth convolution module which are sequentially cascaded;

the fourth convolution module in the first feature extraction module 101 outputs an initial feature map of the search map at a first scaleThe fifth convolution module outputs an initial characteristic diagram of the search diagram under the second scale +.>A fourth convolution module in the second feature extraction module 201 outputs an initial feature map +_of the template map at the third scale>The fifth convolution module outputs an initial feature map of the template map under the fourth scale +.>

The feature fusion module 3 comprises N cascaded feature fusion sub-modules, wherein the input of the first-stage feature fusion sub-module is a search graph multi-scale feature F _s And template diagram multiscale feature F _t Outputting the attention characteristic of the first level search graph to the template graphAnd the attention feature of the first level template map to the search map +.>The input of the N-level feature fusion submodule is N-1-level output +.>And->Output of Nth level feature fusion submodule>Fusion feature G obtained for feature fusion module _st ；

The nth level feature fusion submodule includes a first deformable self-attention module 301, a second deformable self-attention module 302, and a cross-attention module 303, n=1, 2, …, N; the first deformable self-attention module 301 and the second deformable self-attention module 302 are respectively used for calculating two paths of input characteristics I _s And I _t Contextual characteristics and T of (2) _s And T _t The method comprises the steps of carrying out a first treatment on the surface of the The cross-attention module 303 is configured to calculate a context feature T of the two-way input vector _s And T _t Attention features to each otherAnd->

The first deformable self-injectionThe intent module 301 extracts input features I _s Context relation feature T of (2) _s The method comprises the following steps:

a1, input feature I _s Multi-scale position coding SLP with search map _s Summing to generate a first query vector Q _s ，Q _s ＝[Q _s1 ,Q _s2 ]，Q _s1 For query vectors at a first scale, Q _s2 For query vectors at a second scale

a2, the first query vector Q _s Input features I _s Initial reference point R of search graph _s Inputting into a first multi-head attention network to obtain a first multi-head deformable attention I of the search graph _s 'A'; the first multi-head attention network is provided with M attention units connected in parallel;

the initial reference point R of the search graph _s The calculation steps of (a) are as follows: calculating the feature f of the search graph at the first scale _s1 Each vector in the initial feature mapCoordinates on the first initial reference point r _s1 The method comprises the steps of carrying out a first treatment on the surface of the Calculating the feature f at the second scale _s2 Is in the initial feature map +.>The coordinates of the first and second initial reference points r _s2 ；

For the first initial reference point r _s1 Normalized and mapped to the initial feature mapObtaining a first coordinate mapping point r _s12 The method comprises the steps of carrying out a first treatment on the surface of the For the second initial reference point r _s2 Is normalized and mapped to the initial feature map +.>Obtaining a second coordinate mapping point r _s21 ；

Constructing initial reference points of search graphs

The first multi-head feasible variable attention I 'of the search graph' _s ＝[I′ _s1 ,I′ _s2 ]，I′ _s1 For deformable attention at first scale, I' _s2 Deformable attention at a second scale;

I′ _s1 i 'element of (2)' _s1i The calculation steps of (a) are as follows:

Q _s1 the ith vector Q of the vectors _s1i Obtaining a first initial reference point r through the full connection layer Linear2 _s1 And a first coordinate mapping point r _s12 Is the ith element r of (2) _s1i 、r _s12i Sampling offset for each sampling point in each attention unitAnd->Where M represents the number of attention units in the first multi-head attention network, m=1, 2, …, M; k represents a sampling point sequence number, k=1, 2, …, K; k is the total number of sampling points in each attention unit;

will r _s1i And (3) withAdding to obtain the m-th attention unit and the k-th sampling point coordinate under the first scale>

Will r _s12i And (3) withAdding to obtain the m-th attention unit and the k-th sampling point coordinate under the second scale

Will beFeature map at the first scale +.>Interpolation is carried out after full connection layer Linear1 to obtain Q _s1i At a first scale, the value of the kth sample point of the mth attention unit is denoted +.>Inter is an interpolation function;

will beFeature map at the second scale +.>Interpolation is carried out after full connection layer Linear1 to obtain Q _s1i At the second scale, the value of the kth sample point of the mth attention unit is denoted +.>Inter is an interpolation function;

Q _s1 the ith vector Q of the vectors _s1i Obtained by full connection layer Linear3Corresponding attention weightAnd->Attention weight corresponding to +.>

Thus obtaining

Q _s2 The ith vector Q of the vectors _s2i Obtaining a second initial reference point r through the full connection layer Linear2 _s2 The ith element r of (b) _s2i And a second coordinate mapping point r _s21 The ith element r of (b) _s21i Sampling offset for each sampling point in each attention unitAnd->

Will r _s2i And (3) withAdding to obtain the m-th attention unit and the k-th sampling point coordinate under the second scale

Will r _s21i And (3) withAdding to obtain the m-th attention unit and the k-th sampling point coordinate under the first scale

Will beFeature map at the first scale +.>Obtaining Q through full connection layer Linear1 interpolation _s2i At a first scale, the value of the kth sample point of the mth attention unit is denoted +.>

Will beFeature map at the second scale +.>Obtaining Q through full connection layer Linear1 interpolation _s2i At the second scale, the value of the kth sample point of the mth attention unit is denoted +.>

Q _s2 The ith vector Q of the vectors _s2i Obtained by full connection layer Linear3Corresponding attention weightAnd->Attention weight corresponding to +.>

Thus obtaining

a3、I _s And I' _s After summation normalization, the input characteristic I is obtained through FFN function _s Context relation feature T of (2) _s 。

The search map multi-scale position coding SLP _s The construction steps of (1) are as follows:

a11, randomly generating a search graph, namely two-layer hierarchical coding, namely a first-layer hierarchical coding SL _s1 Features f of the search graph at a first scale _s1 The same; second layer hierarchical coding SL _s2 Features f of the search graph at the second scale _s2 The same;

a12, features f at the first scale and the second scale according to the search graph _s1 And f _s2 Calculating position code P in first layer of search graph by adopting trigonometric function _s1 And a second intra-layer position code P _s2 ；

a13, will SL _s1 And P _s1 Adding, SL _s2 And P _s2 Adding, and splicing to obtain search pattern multi-scale position coding SLP _s ：

SLP _s ＝[SL _s1 +P _s1 ,SL _s2 +P _s2 ]。

The cross-attention module 303 calculates a context feature T of the two-way input vector _s And T _t Attention features to each otherAnd->The method comprises the following steps:

b1, T is taken _s Multi-scale position coding SLP with search map _s Adding, respectively passing through two full connection layers W _sq And W is _sk Mapping to obtain vector Q _s ' and K _s The method comprises the steps of carrying out a first treatment on the surface of the Will T _s Through the full-connection layer W _sv Mapping to obtain vector V _s ；

b2, T _t Multi-scale position coding SLP with target graph _t Adding, respectively passing through twoFull connection layer W _tq And W is _tk Mapping to obtain vector Q' _t And K _t The method comprises the steps of carrying out a first treatment on the surface of the Will T _t Through the full-connection layer W _tv Mapping to obtain vector V _t ；

b3, calculating T _s For T _t Attention features of (a) The i-th element of (a)>The method comprises the following steps:

calculate T _t For T _s Attention features of (a) The j-th element of (a)>The method comprises the following steps:

where dot represents a vector dot product operation, d _kt For K _t Dimension d of _ks For K _s Is a dimension of (c).

The prediction module 4 comprises a classification prediction network 401, a frame prediction network 402 and a target frame calculation module 403; the classification prediction network 401 is configured to predict the characteristic G according to the fusion _st Obtaining a classification result C= [ C ] of a target in a search graph ₁ ,C ₂ ,…,C _len ]The method comprises the steps of carrying out a first treatment on the surface of the The saidThe frame prediction network 402 is configured to generate a fusion feature G _st Obtaining a predicted frame B= [ B ] of a target in a search graph ₁ ,B ₂ ,…,B _len ]The method comprises the steps of carrying out a first treatment on the surface of the Where len is the length of the multi-scale feature of the search pattern, l=1, 2, …, len, C _l ＝[C _l0 C _l1 ]To be according to the fusion characteristics G _st Normalized class obtained by the first element; b (B) _l ＝[B _lx ,B _ly ,B _lw ,B _lh ]To be according to the fusion characteristics G _st Target rectangular frame predicted by the first element in B _lx ,B _ly Is the center point coordinate of the rectangular frame, B _lw ,B _lh The width and the height of the rectangular frame are the same;

the target frame calculation module (403) is configured to calculate a target frame according to a classification result c= [ C ] of the target in the search graph ₁ ,C ₂ ,…,C _len ]And the border b= [ B ] of the object in the search graph ₁ ,B ₂ ,…,B _len ]And calculating a target frame in the search graph.

The step of calculating the target frame in the search graph by the target frame calculation module 403 is:

find c= [ C ₁ ,C ₂ ,…,C _len ]Middle C _l0 Element number l corresponding to the maximum value of (2) ^* ：

First step ^* The predicted target rectangular frame corresponding to each element is the target frame in the search map

The training steps of the infrared image target tracking system comprise:

c1, randomly selecting two pictures from a video for training, selecting one picture from the two pictures as a template picture, selecting the other picture as a search picture, inputting the picture into an infrared image target tracking system to be trained, and outputting a classification result C= [ C ] by a prediction module 4 ₁ ,C ₂ ,…,C _len ]And prediction frame b= [ B ] ₁ ,B ₂ ,…,B _len ]；

c2, optimizing parameters in the infrared image target tracking system through a minimized loss function to obtain a trained infrared image target tracking system;

the loss function is: l=l _class +L _loss +L _giou ；

Wherein L is _class To classify the loss:U _l according to the value of the first element B in the predicted border _l And target real frame B in search graph ^T Is determined by the position of:W[1]is a negative sample weight, W0]Is a positive sample weight;

L _loss for regression loss:wherein count is U _l Number with value 0, namely:Pr _h for classification accuracy, < >>

L _giou Loss for GIOU:L _giou (h) To predict element B in the frame _h GIOU loss of B _h Corresponding U _h Has a value of 0, L _giou (h)＝1-GIOU _h ，GIOU _h Is B _h True border B with target ^T GIOU value of (C).

On the other hand, the invention also discloses a method for tracking by using the infrared image target tracking system based on the multi-scale deformable attention, which comprises the following steps:

taking a first frame in a video to be tracked as a template diagram, and marking a rectangular frame of a target to be tracked in a target diagram; taking the subsequent frames of the video as a search graph; and respectively inputting the template graph and the search graph into a template graph branch and a search graph branch in the infrared image target tracking system, and acquiring a rectangular frame of a target in the search graph according to the prediction module.

The beneficial effects are that: the infrared image target tracking system and the tracking method based on the multi-scale deformable attention have the following advantages:

1. according to the invention, the features under two scales are spliced into the multi-scale features for subsequent processing, so that the semantics of the low-level features are increased, more space information of the high-level features is provided, and the tracking of a small target is facilitated;

2. the feature fusion module adopts multistage cascade connection to carry out multistage enhancement on the features; the method comprises the steps that a first deformable self-attention module and a second deformable self-attention module are used for respectively obtaining the context relation of image feature sequences of a search graph and a template graph, and automatically searching for features with expressive force on feature images; the deformable self-attention model is used, so that the model convergence speed is high, and the convergence speed is improved by about 4 times compared with that of a common method;

the cross attention module learns the relation between the search image feature sequence and the template image feature sequence, so that the target position of the search image can be accurately positioned.

3. When the system is trained, classification accuracy is used for dynamically restraining the frame regression loss and the GIOU loss, so that the alignment of the classification task and the frame regression task is consistent, and a more stable tracking effect is achieved.

Drawings

FIG. 1 is a schematic diagram of the composition of an infrared image target tracking system based on multi-scale deformable attention in accordance with the present disclosure;

FIG. 2 is a schematic diagram of the composition of a feature fusion module;

FIG. 3 is a schematic diagram of the composition of a feature fusion sub-module;

FIG. 4 is a first multi-headed deformable attention I _s Deformable injection at a first scale inItalian force I _s ' ₁ A calculation flow diagram of the ith element in the (b);

fig. 5 is a schematic diagram of the composition of the prediction module.

Detailed Description

The invention is further elucidated below in connection with the drawings and the detailed description.

The invention discloses an infrared image target tracking system based on multi-scale deformable attention, as shown in fig. 1, comprising: search graph branch 1, template graph branch 2, feature fusion module 3 and prediction module 4; the search map branch 1 comprises a first feature extraction module 101 and a first conversion splicing module 102; the template diagram branch 2 comprises a second feature extraction module 201 and a second conversion splicing module 202;

the first feature extraction module 101 is configured to extract an initial feature map of the search map at a first scale and a second scaleAnd->First conversion mosaic Module 102 pair->And->Channel unification and dimension adjustment are carried out to obtain the characteristic f of the search graph under the first scale and the second scale _s1 And f _s2 And spliced into a search graph multi-scale feature F _s ＝[f _s1 ,f _s2 ]The method comprises the steps of carrying out a first treatment on the surface of the The second feature extraction module 201 is configured to extract ∈of the template map at the third scale and the fourth scale>And->Second conversion splice module 202For->And->Channel unification and dimension adjustment are carried out to obtain the characteristic f of the template diagram under the third scale and the fourth scale _t1 And f _t2 And spliced into template-diagram multi-scale features F _t ＝[f _t1 ,f _t2 ]The method comprises the steps of carrying out a first treatment on the surface of the The feature fusion module 3 is used for multi-scale features F according to the search graph _s And template diagram multiscale feature F _t Computing fusion features G _st The method comprises the steps of carrying out a first treatment on the surface of the The prediction module 4 is configured to, according to the fusion feature G _st Predicting a target frame in the search graph;

In this embodiment, the first feature extraction module 101 and the second feature extraction module 201 have the same structure, and both adopt the residual network structure of the resnet50 as the feature extraction network, the network parameters are different from those of the common resnet, and the maxpool_2 and the FC layer are deleted, and the specific structure is shown in table 1.

Table 1: structure of first feature extraction module 101 and second feature extraction module 201

The structure is that a first convolution module Conv_1, a first pooling module MaxPool_1, a second convolution module Conv_2x, a third convolution module Conv_3x, a fourth convolution module Conv_4x and a fifth convolution module Conv_5x are sequentially cascaded;

the fourth convolution module conv_4x in the first feature extraction module 101 outputs an initial feature map of the search map at the first scaleThe fifth convolution module conv_5x outputs an initial feature map +_of the search map at the second scale>The fourth convolution module conv_4x in the second feature extraction module 201 outputs an initial feature map +_of the template map at the third scale>The fifth convolution module Conv_5x outputs an initial feature map +.f of the template map at the fourth scale>

According to the parameters in table 1, the first conversion splicing module 102 first adopts a convolution kernel of 1×1, a channel of 256, and a convolution layer pair +_1 with a step size>And->Channel unification is carried out, after which->Then carrying out dimension adjustment, namely converting the two feature sequences into two-dimensional feature sequences by adopting size to obtain the feature f of the search graph under the first scale and the second scale _s1 ∈R ^1024×256 And f _s2 ∈R ^256×256 The method comprises the steps of carrying out a first treatment on the surface of the And spliced into search graph multi-scale feature F _s ＝[f _s1 ,f _s2 ]，F _s ∈R ^1280×256 . Likewise, the second conversion mosaic module 202 pair +.>And->Performing similar operation to obtain the characteristic f of the template diagram at the third scale and the fourth scale _t1 ∈R ^256×256 And f _t2 ∈R ^64×256 The method comprises the steps of carrying out a first treatment on the surface of the And spliced into template-diagram multi-scale feature F _t ＝[f _t1 ,f _t2 ]，F _t ∈R ^320×256 。

The feature fusion module 3 comprises N cascaded feature fusion sub-modules, as shown in fig. 2, wherein the input of the first-stage feature fusion sub-module is a multi-scale feature F of the search graph _s And template diagram multiscale feature F _t Outputting the attention characteristic of the first level search graph to the template graphAnd the attention feature of the first level template map to the search map +.>The input of the N-level feature fusion submodule is N-1-level output +.>And->Output of Nth level feature fusion submodule>Fusion feature G obtained for feature fusion module _st The method comprises the steps of carrying out a first treatment on the surface of the N=4 in this embodiment.

As shown in fig. 3, the nth level feature fusion submodule includes a first deformable self-attention module 301, a second deformable self-attention module 302, and a cross-attention module 303, n=1, 2, …, N; the first deformable self-attention module 301 and the second deformable self-attention module 302 are respectively used for calculating two paths of input characteristics I _s And I _t Contextual characteristics and T of (2) _s And T _t The method comprises the steps of carrying out a first treatment on the surface of the The cross attention module 303 is configured to calculate an upper of two input vectorsThe following relationship feature T _s And T _t Attention features to each otherAnd->If n=1, I _s Is F _s ，I _t Is F _t The method comprises the steps of carrying out a first treatment on the surface of the Otherwise I _s And I _t Respectively->And->

Specifically, the first deformable self-attention module 301 extracts the input feature I _s Context relation feature T of (2) _s The method comprises the following steps:

a1, input feature I _s Multi-scale position coding SLP with search map _s Summing to generate a first query vector Q _s ，Q _s ＝[Q _s1 ,Q _s2 ]，Q _s1 For query vectors at a first scale, Q _s2 Is a query vector at a second scale;

Constructing initial reference points of search graphs

The first multi-head feasible variable attention I 'of the search graph' _s ＝[I′ _s1 ,I′ _s2 ]，I′ _s1 For deformable attention at first scale, I' _s2 For deformable attention at the second scale, in this embodiment, I' _s1 ＝[I′ _s1,1 ,I′ _s1,2 ...I′ _s1,1024 ]I′ _s2 ＝[I′ _s2,1025 ,I′ _s2,1026 ...I′ _s1,1280 ]；

As shown in FIG. 4, I' _s1 I 'element of (2)' _s1i The calculation steps of (a) are as follows:

Q _s1 the ith vector Q of the vectors _s1i Obtaining a first initial reference point r through the full connection layer Linear2 _s1 And a first coordinate mapping point r _s12 Is the ith element r of (2) _s1i 、r _s12i Sampling offset for each sampling point in each attention unitAnd->Where M represents the number of attention units in the first multi-head attention network, m=1, 2, …, M; k represents a sampling point sequence number, k=1, 2, …, K; k is the total number of sampling points in each attention unit; k=16 in this embodiment.

Thus obtaining

Similar to the above procedure, I' _s2i The calculation process of (1) is as follows:

Thus obtaining

The second deformable self-attention module 302 takes steps similar to a1-a3 to extract the input features I _t Context relation feature T of (2) _t 。

SLP _s ＝[SL _s1 +P _s1 ,SL _s2 +P _s2 ]。

In a similar manner to steps a1-a3, the second deformable self-attention module 302 extracts the input feature I _t Context relation feature T of (2) _t The method comprises the steps of carrying out a first treatment on the surface of the Wherein the template map multi-scale position codes SLP _t Features f at third and fourth scales according to a template diagram in a similar manner to a11-a13 _t1 And f _t2 Calculating to obtain; template diagram initial reference R _t According to R _s According to f _t1 And f _t2 And (5) calculating to obtain the product.

The cross-attention module 303 calculates the contextual characteristics T of the two-way input vector _s And T _t Attention features to each otherAnd->The method comprises the following steps:

b1, T is taken _s Multi-scale position coding SLP with search map _s Adding, respectively passing through two full connection layers W _sq And W is _sk Mapping to obtain vector Q' _s And K _s The method comprises the steps of carrying out a first treatment on the surface of the Will T _s Through the full-connection layer W _sv Mapping to obtain vector V _s ；

b2, T _t Multi-scale position coding SLP with target graph _t Adding, respectively passing through two full connection layers W _tq And W is _tk Mapping to obtain vector Q' _t And K _t The method comprises the steps of carrying out a first treatment on the surface of the Will T _t Through the full-connection layer W _tv Mapping to obtain vector V _t ；

Output of last level feature fusion sub-moduleI.e. the fusion feature G finally obtained _st The prediction module 4 predicts the target according to G _st To predict the target bounding box in the search graph. As shown in fig. 5, the prediction module 4 includes a classification prediction network 401, a frame prediction network 402, and a target frame calculation module 403; the classification prediction network 401 is used to predict the network according to the fusion characteristics G _st Obtaining a classification result C=of the targets in the search graph[C ₁ ,C ₂ ,…,C _len ]The method comprises the steps of carrying out a first treatment on the surface of the The frame prediction network 402 is configured to use the fusion feature G _st Obtaining a predicted frame B= [ B ] of a target in a search graph ₁ ,B ₂ ,…,B _len ]The method comprises the steps of carrying out a first treatment on the surface of the Where len is the length of the multi-scale feature of the search pattern, l=1, 2, …, len, C _l ＝[C _l0 C _l1 ]To be according to the fusion characteristics G _st Normalized class obtained by the first element of (C) _l0 Representing according to G _st The first element of (C) obtains the predicted target probability _l1 Representing a predictive background probability; b (B) _l ＝[B _lx ,B _ly ,B _lw ,B _lh ]To be according to the fusion characteristics G _st Target rectangular frame obtained by the first element of B _lx ,B _ly Is the center point coordinate of the rectangular frame, B _lw ,B _lh Is the width and height of the rectangular frame. The target frame calculation module 403 is configured to calculate a classification result c= [ C ] of the target according to the search graph ₁ ,C ₂ ,…,C _len ]And the border b= [ B ] of the object in the search graph ₁ ,B ₂ ,…,B _len ]Calculating a target frame in a search graph, wherein the specific steps are as follows:

The classification prediction network 401 and the frame prediction network 402 each adopt three fully connected layers, and the structures and parameters are shown in table 2 and table 3:

table 2: classification prediction network structure and parameters

Network layer name	Output size	Network parameters (input channel, output channel)
			FC_1	1280×256	256，256
FC_2	1280×256	256，256
			FC_3	1280×2	256，2

Table 3: frame prediction network structure and parameters

Network layer name	Output size	Network parameters (input channel, output channel)
			FC_1	1280×256	256，256
FC_2	1280×256	256，256
			FC_3	1280×4	256，4

The final full-connection layer FC_3 in the classification prediction network outputs an initial classification resultWherein-> Represents G _st The probability of the first element prediction category of 0, namely the probability of the prediction target; />Represents G _st The first element predicts the probability of category 1, i.e., the probability of the prediction context. Because the probability is 0,1]Between (for->Normalizing to obtain normalized class C _l ＝[C _l0 C _l1 ]Wherein->E∈{0,1}。

The training steps of the infrared image target tracking system comprise:

the loss function is: l=l _class +L _loss +L _giou ；

Wherein L is _class To classify the loss:U _l according to the value of the first element B in the predicted border _l And target real frame B in search graph ^T Is determined by the position of:B _l at B ^T The result of the prediction of the first element is the target, and the result is the background. W1]Is a negative sample weight, W0]Is a positive sample weight;

L _giou Loss for GIOU:L _giou (h) To predict element B in the frame _h GIOU loss of B _h Corresponding U _h Has a value of 0, L _giou (h)＝1-GIOU _h ，GIOU _h Is B _h True border B with target ^T GIOU value of (C). />

In the present embodiment, classification accuracy Pr is adopted in the regression loss and GIOU loss _h The classification tasks and the regression tasks are unified by dynamic weighting, so that the classification tasks and the regression tasks are mutually connected, the low-quality frame bounding boxes are further reduced through positioning scores, and the overall tracking precision is improved.

The method for tracking by using the infrared image target tracking system based on the multi-scale deformable attention comprises the following steps:

In this embodiment, the effectiveness of the infrared image target tracking system described above was tested on the infrared data sets VOT2017-TIR and LSOTB-TIR and compared to existing methods. During testing, selecting a first frame of a video sequence as a template diagram, wherein a target to be tracked is surrounded by a rectangular frame, and cutting and scaling by taking the frame of the target as the center to obtain a size of 128 multiplied by 128; the other frames are search pictures, the search area is obtained by taking the target position as the center in the previous frame image, cutting and scaling the area with the size of 4 times of the target area to obtain the size of 256 multiplied by 256, inputting the template picture and the search picture into a trained infrared image target tracking system, obtaining a prediction classification result and a prediction frame by a prediction module, and taking the prediction target rectangular frame corresponding to the characteristic element with the maximum probability of the prediction target in the classification result as a final tracking result. Test pairs on dataset LSOTB-TIR are shown in Table 4:

table 4: test results for data set LSOTB-TIR

Methods	Success	Precision	Norm Precision
				ECO-TIR[1]	0.631	0.768	0.695
ECO-stir[2]	0.616	0.750	0.672
				ECO[3]	0.609	0.739	0.670
SiamRPN++[4]	0.604	0.711	0.651
				MDNet[5]	0.601	0.750	0.686
VITAL[6]	0.597	0.749	0.682
				ATOM[7]	0.595	0.729	0.647
Ours(detranst)	0.669	0.782	0.787

Test pairs for data set VOT2017-TIR are shown in Table 5:

table 5: test results for data set VOT2017-TIR

Methods	EAO	Acc	Rob
				CFNet[8]	0.254	0.52	3.45
HSSNet[9]	0.262	0.58	3.33
				TADT[10]	0.262	0.60	3.18
VITAL[6]	0.272	0.64	2.68
				MLSSNet[11]	0.278	0.56	2.95
TCNN[12]	0.287	0.62	2.79
				MMNet[13]	0.320	0.58	2.91
Ours(detranst)	0.335	0.71	2.18

In tables 4 and 5, ECO-TIR [1] is as described in literature [1]: liu Q, li X, he Z, et al LSOTB-TIR: A Large-Scale High-Diversity Thermal Infrared Object Tracking Benchmark [ C ]// Proceedings of the, 28, th ACM International Conference on Multimedia (MM' 20). ACM,2020. Tracking methods;

ECO-stinr 2 is as described in document [2]: lichao Zhang, abel Gonzalez-Garcia, joost van de Weijer, martin Danelljan, and Fahad Shahbaz Khan.2019.Synthetic data generation for end-to-end thermal infrared tracking, IEEE Transactions on Image Processing, 4 (2019), 1837-1850;

ECO 3 is as described in document [3]: tracking by Martin Danelljan, goutam Bhat, fahad Shahbaz Khan, and Michael Felsberg.2017.ECO: efficient convolution operators for tracking. In IEEE Conference on Computer Vision and Pattern Recognition;

SiamRPN++ 4 is as described in literature [4]: the tracking is performed by the method in Bo Li, wei Wu, qiang Wang, fangyi Zhang, junliang Xing, and Junjie yan.2019.Siamrpn++: evolution of siamese visual tracking with very deep networks.In IEEE Conference on Computer Vision and Pattern Recognition;

MDNet [5] is as follows: tracking was performed by the method in Hyeonseob Nam and Bohyung han.2016.learning multi-domain convolutional neural networks for visual tracking in IEEE Conference on Computer Vision and Pattern Recognition;

VITAL [6] is as described in document [6]: song, y; ma, c; wu, x; gong, l.; tracking is performed by the method in et al 2018, visual: visual tracking via adversarial learning, in CVPR, 8990-8999;

ATOM [7] is as described in document [7]: tracking by a method in Martin Danelljan, goutam Bhat, fahad Shahbaz Khan, and Michael Felsberg.2019.Atom: accurate tracking by overlap maximation.In IEEE Conference on Computer Vision and Pattern Recognation;

CFNet [8] is as follows: valmadre, j.; bertinetto, l.; henriques, j.; vedaldi, A; and Torr, P.H.2017.end-to-end representation learning for correlation fifilter based tracking.InCVPR, 5000-5008;

HSSNet [9] is as described in literature [9]: li, X; liu, q; fan, n.; tracking by methods in et al 2019a, hierarchical spatial-aware siamese network for thermal infrared object tracking, knowledges-Based Systems 166:71-81;

TADT [10] is as described in literature [10]: li, X; ma, c; wu, b.; he, z; and Yang, m. -h.2019b.target-aware deep tracking.in cvpr.

MLSSNet [11] is described in literature [11]: liu, q; li, X; he, z; fan, n.; yuan, d.; and Wang, H.2019b.learning deep Multi-level similarity for thermal infrared object tracking.arXiv preprintarXiv 1906.03568;

TCNN [12] is as described in document [12]: nam, H.; baek, m.; han, b.; tracking is performed by the method in et al 2016.Modeling and propagating cnns in a tree structure for visual tracking. ArXiv preprint arXiv: 1608.07242;

MMNet [13] is described in literature [13]: the method of Liu Q, li X, he Z, et al, multi-Task Driven Feature Models for Thermal Infrared Tracking [ C ]// Proceedings of the AAAI Conference on Artificial Intelligence,2020,34:11604-11611.

As can be seen from tables 4 and 5, the tracking effect of the infrared image target tracking system and method based on multi-scale deformable attention provided by the invention on two data sets is better than that of the prior art.

Claims

1. An infrared image target tracking system based on multi-scale deformable attention, comprising: a search graph branch (1), a template graph branch (2), a feature fusion module (3) and a prediction module (4); the search graph branch (1) comprises a first feature extraction module (101) and a first conversion splicing module (102); the template diagram branch (2) comprises a second feature extraction module (201) and a second conversion splicing module (202);

the first feature extraction module (101) is used for extracting an initial feature map of the search map under a first scale and a second scaleAnd->First conversion spliceModule (102) pair->And->Channel unification and dimension adjustment are carried out to obtain the characteristic f of the search graph under the first scale and the second scale _s1 And f _s2 And spliced into a search graph multi-scale feature F _s ＝[f _s1 ,f _s2 ]The method comprises the steps of carrying out a first treatment on the surface of the The second feature extraction module (201) is used for extracting an initial feature map +_of the template map at a third scale and a fourth scale>And->A second conversion splicing module (202) pair->And->Channel unification and dimension adjustment are carried out to obtain the characteristic f of the template diagram under the third scale and the fourth scale _t1 And f _t2 And spliced into template-diagram multi-scale features F _t ＝[f _t1 ,f _t2 ]The method comprises the steps of carrying out a first treatment on the surface of the The characteristic fusion module (3) is used for multiscale characteristic F according to the search graph _s And template diagram multiscale feature F _t Computing fusion features G _st ；

The feature fusion module (3) comprises N cascaded feature fusion sub-modules, wherein the input of the first-stage feature fusion sub-module is a search graph multi-scale feature F _s And template diagram multiscale feature F _t Outputting the attention characteristic of the first level search graph to the template graphAnd the attention feature of the first level template map to the search map +.>The input of the N-level feature fusion submodule is N-1-level output +.>And->Output of Nth level feature fusion submodule>Fusion feature G obtained for feature fusion module _st ；

The nth level feature fusion submodule comprises a first deformable self-attention module (301), a second deformable self-attention module (302) and a cross-attention module (303), wherein n=1, 2, …, N; the first deformable self-attention module (301) and the second deformable self-attention module (302) are respectively used for calculating two paths of input characteristics I _s And I _t Context relation feature T of (2) _s And T _t The method comprises the steps of carrying out a first treatment on the surface of the The cross attention module (303) is used for calculating a context relation characteristic T of two paths of input vectors _s And T _t Attention features to each otherAnd->

The first deformable self-attention module (301) extracts input features I _s Context relation feature T of (2) _s The method comprises the following steps:

a2, the first query vector Q _s Input features I _s Initial reference point R of search graph _s Inputting the first multi-head deformable attention I 'into a first multi-head attention network to obtain a search graph' _s The method comprises the steps of carrying out a first treatment on the surface of the The first multi-head attention network is provided with M attention units connected in parallel;

Constructing initial reference points of search graphs

I′ _s1 i 'element of (2)' _s1i The calculation steps of (a) are as follows:

will r _s1i And (3) withAdding to obtain the m-th attention unit and the k-th sampling point coordinate under the first scale

Will beFeature map at first scale/>Interpolation is carried out after full connection layer Linear1 to obtain Q _s1i At a first scale, the value of the kth sample point of the mth attention unit is denoted +.>Inter is an interpolation function;

Q _s1 the ith vector Q of the vectors _s1i Obtained by full connection layer Linear3Attention weight corresponding to +.>Andattention weight corresponding to +.>

Thus obtaining

Q _s2 The ith vector Q of the vectors _s2i Obtained by full connection layer Linear3Attention weight corresponding to +.>Andattention weight corresponding to +.>

Thus obtaining

a3、I _s And I' _s After summation normalization, the input characteristic I is obtained through FFN function _s Context relation feature T of (2) _s ；

The second deformable self-attention module (302) takes steps similar to a1-a3 to extract the input features I _t Context relation feature T of (2) _t ；

The prediction module (4) is used for generating a fusion characteristic G _st Predicting a target frame in the search graph;

the search graph is the input of the search branch (1), and the template graph is the input of the template branch (2).

2. The infrared image target tracking system based on multi-scale deformable attention according to claim 1, wherein the first feature extraction module (101) and the second feature extraction module (201) have the same structure, and the structures are a first convolution module, a first pooling module, a second convolution module, a three convolution module, a fourth convolution module and a fifth convolution module which are sequentially cascaded;

a fourth convolution module in the first feature extraction module (101) outputs an initial feature map of the search map at a first scaleThe fifth convolution module outputs an initial characteristic diagram of the search diagram under the second scale +.>A fourth convolution module in the second feature extraction module (201) outputs an initial feature map +_of the template map at a third scale>The fifth convolution module outputs an initial feature map of the template map under the fourth scale +.>

3. The multi-scale deformable attention-based infrared image target tracking system of claim 1 wherein the search map multi-scale position-encoding SLP _s The construction steps of (1) are as follows:

a11, randomly generating a search graph, namely two-layer hierarchical coding, namely a first-layer hierarchical coding SL _s1 Dimension and search of (a)Feature f of the graph at a first scale _s1 The same; second layer hierarchical coding SL _s2 Features f of the search graph at the second scale _s2 The same;

SLP _s ＝[SL _s1 +P _s1 ,SL _s2 +P _s2 ]。

4. The multi-scale deformable attention-based infrared image target tracking system of claim 1, wherein the cross attention module (303) calculates a contextual feature T of two input vectors _s And T _t Attention features to each otherAnd->The method comprises the following steps:

b3, calculating T _s For T _t Attention features of (a)The i-th element of (a)>The method comprises the following steps:

5. The multi-scale deformable attention-based infrared image target tracking system of claim 1, wherein the prediction module (4) comprises a classification prediction network (401), a border prediction network (402), and a target border calculation module (403); the classification prediction network (401) is used for predicting the network according to the fusion characteristic G _st Obtaining a classification result C= [ C ] of a target in a search graph ₁ ,C ₂ ,…,C _len ]The method comprises the steps of carrying out a first treatment on the surface of the The frame prediction network (402) is used for predicting the network according to the fusion characteristic G _st Obtaining a predicted frame B= [ B ] of a target in a search graph ₁ ,B ₂ ,…,B _len ]The method comprises the steps of carrying out a first treatment on the surface of the Where len is the length of the multi-scale feature of the search pattern, l=1, 2, …, len, C _l ＝[C _l0 C _l1 ]To be according to the fusion characteristics G _st Normalized class obtained by the first element; b (B) _l ＝[B _lx ,B _ly ,B _lw ,B _lh ]To be according to the fusion characteristics G _st Target rectangular frame predicted by the first element in B _lx ,B _ly Is the center point coordinate of the rectangular frame, B _lw ,B _lh The width and the height of the rectangular frame are the same;

6. The infrared image target tracking system based on multi-scale deformable attention of claim 5, wherein the step of the target frame calculation module (403) calculating a target frame in a search graph is:

7. The multi-scale deformable attention-based infrared image target tracking system of claim 5, wherein the training step of the system comprises:

c1, randomly selecting two pictures from the video for training, wherein one picture is selected as a template picture, and the other picture is selected as a searchIn the figure, the prediction module (4) outputs a classification result C= [ C ] in the infrared image target tracking system to be trained ₁ ,C ₂ ,…,C _len ]And prediction frame b= [ B ] ₁ ,B ₂ ,…,B _len ]；

the loss function is: l=l _class +L _loss +L _giou ；

Wherein L is _class To classify the loss:U _l according to the value of the first element B in the predicted border _l And target real frame B in search graph ^T Is determined by the position of: />W[1]Is a negative sample weight, W0]Is a positive sample weight;

8. An infrared image target tracking method based on multi-scale deformable attention is characterized by comprising the following steps:

taking a first frame in a video to be tracked as a template diagram, and marking a rectangular frame of a target to be tracked in a target diagram; taking the subsequent frames of the video as a search graph; inputting the template diagram branch and the search diagram branch in the infrared image target tracking system according to any one of claims 1-7 respectively, and acquiring a rectangular frame of a target in the search diagram according to the prediction module.