CN116912660A

CN116912660A - Hierarchical gating cross-Transformer infrared weak and small target detection method

Info

Publication number: CN116912660A
Application number: CN202310888676.8A
Authority: CN
Inventors: 穆廷魁; 杨获任
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2023-07-19
Filing date: 2023-07-19
Publication date: 2023-10-20

Abstract

A detection model adopts a cross-attention mechanism-based hierarchical gating transducer network of a single encoder-multi decoder architecture, a single encoder drives a gate-control cross-attention transducer encoder stage for tasks, and the encoder stages comprise a plurality of encoder layers, and the encoder layers are connected by a downsampling convolution layer; each encoder layer comprises a plurality of encoders connected in series; the multi-decoder is a plurality of characteristic gating cross-attention transducer decoder stages, and comprises a plurality of decoder layers, wherein the decoder layers are connected through a cross-attention mechanism, each decoder layer comprises a plurality of decoders connected in series, the encoder comprises a window-based task driving cross-attention encoding module, and the decoder comprises a window-based characteristic cross-attention decoding module; the invention can solve the problems of high false alarm rate and low recognition rate.

Description

Hierarchical gating cross-Transformer infrared weak and small target detection method

Technical Field

The invention belongs to the technical field of infrared remote sensing images, and particularly relates to a hierarchical gating cross-Transformer infrared weak and small target detection method.

Background

Infrared small target detection plays an important role in various practical fields, such as offshore monitoring, infrared early warning and accurate guidance. The conventional model-based detection method has a problem of insufficient performance because the assumption condition is not necessarily satisfied in the real world. Convolutional Neural Networks (CNNs), although having very powerful feature extraction capabilities, cannot accommodate infrared small target images with numerous negative samples due to their locality and parameter sharing characteristics. Although the identification method based on the transducer has good global feature extraction capability, the identification method has the problems that the network is difficult to converge and is insensitive to high-frequency information. In addition, most of the existing deep learning detection methods are based on a U-Net encoder-decoder structure, and the structure is easy to cause corresponding deficiency of targets, so that errors of the encoder are accumulated in the decoder, and false alarms and missed detection are caused.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a hierarchical gating cross-Transformer infrared weak and small target detection method, so as to solve the problems of high detection false alarm rate and low recognition rate caused by insufficient background modeling capability, incapability of considering global-local information, low information fusion efficiency of networks with different depths and the like in the existing infrared weak and small target detection method.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

a hierarchical gating cross-transducer infrared weak small target detection method comprises the following steps:

step 1, acquiring a marked infrared small target image data set;

step 2, constructing a detection model, wherein the detection model adopts a hierarchical gating transducer network based on a cross attention mechanism, the hierarchical gating transducer network is of a single encoder-multi-decoder architecture, the single encoder is a task driving gating cross attention transducer encoder stage, the multi-decoder is a plurality of characteristic gating cross attention transducer decoder stages, and the detection result can be obtained after the output of the last decoder stage passes through a punctiform convolution layer; the self-adaptive embedding of task information is realized in the encoding stage, and the multi-scale fusion of encoding and pre-stage decoding information is realized in the decoding stage;

the task driven gated cross-attention transducer encoder stage comprises a plurality of encoder layers connected by downsampled convolutional layers; each encoder layer comprises a plurality of encoders connected in series, and each encoder comprises a depth convolution module, a window-based task driving cross attention encoding module, an information sharing module and a channel attention mechanism module; in the task driving cross attention coding module based on the window, the query matrix is multiplied by the trainable task parameters after Fourier transformation, and then inverse Fourier transformation is carried out;

the feature gating cross-attention transducer decoder stage comprises a plurality of decoder layers, wherein the decoder layers are connected through a cross-attention mechanism, namely each decoder layer is provided with two inputs, one is the splicing of all feature graphs with the same depth of a front stage and the feature graphs with two adjacent scales, and the other is the output of the decoder layer with the same stage and a deeper layer; each decoder layer comprises a plurality of decoders connected in series, and each decoder comprises a depth convolution module, a window-based characteristic cross attention decoding module, an information sharing module and a channel attention mechanism module; in the window-based feature cross attention decoding module, a result of a feature map of a layer deeper than a current layer after a linear layer is used as a query matrix, so that global fusion of features of different depths is realized

Step 3, training the detection model of the step 2 by using the data set of the step 1;

and 4, detecting the infrared dim target by using the detection model which is trained.

Compared with the prior art, the network model establishes a single encoder-multi decoder framework, and a window information sharing mechanism and a window-based gating cross attention mechanism are integrated in the framework. And generating a multi-scale local-global characteristic diagram by combining the self-adaptive task information in the encoding stage, and fully utilizing the multi-scale local-global characteristic diagram in the decoding stage. The network model solves the problem that errors in the U-net structure are easy to accumulate in the forward process, and realizes effective fusion of local-global features. According to the invention, the infrared weak and small target is detected, and meanwhile, more sufficient background global information modeling is performed, so that the detection rate is improved, the false alarm rate is reduced, and the outline of the target is better described.

Drawings

FIG. 1 is a schematic diagram of a hierarchical gated transducer model based on a cross-attention mechanism in accordance with the present invention

FIG. 2 is a schematic diagram of a gated cross-attention transducer encoder/decoder model and attention modules used therein in accordance with the present invention.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings and examples.

Aiming at the problems that global-local information is difficult to consider and error accumulation is caused by a U-net structure on the infrared weak and small target detection problem, the invention provides a hierarchical gating cross-Transformer infrared weak and small target detection method, and the whole thought is as follows: building Shan Bianma a multi-decoder architecture and designing task driven gated cross-attention fransformer modules. Firstly, a series of multi-scale feature information is generated through a plurality of encoder layers, then, features are continuously extracted and refined through a plurality of decoder stages with gradually decreasing depths, each decoder layer is provided with two inputs, one is the splicing of all feature images with the same depth of a front stage and feature images with two adjacent scales, the other is the output of the decoder layer with the same stage and the deeper layer, the two obtain high-level semantic information through a cross attention mechanism, and the final output result is obtained through a punctiform convolution layer by the last decoder stage. The window-based feature cross attention decoding module realizes the compromise of global-local information and the fusion of features with different scales, thereby improving the detection effect.

As shown in fig. 1, the specific steps of the present invention are as follows:

and step 1, acquiring a marked infrared small target image data set.

The data set can be directly a NUDT-SIRST public data set proposed by national defense science and technology university, or a NUAA-SIRST public data set proposed by Nanjing aviation aerospace university or an IRSTD-1k data set proposed by Western security electronic technology university.

And 2, constructing a detection model.

The detection model adopts a hierarchical gating transducer network based on a cross attention mechanism, the network adopts a single encoder-multi-decoder architecture, the self-adaptive embedding of task information is realized in an encoding stage, and the multi-scale fusion of encoding and pre-stage decoding information is realized in a decoding stage. The single encoder is a task driving gate-controlled cross-attention transducer encoder stage, the multi-decoder is composed of a plurality of feature gate-controlled cross-attention transducer decoder stages, and the output of the last decoder stage can obtain a detection result after passing through a punctiform convolution layer.

The encoder and decoder stages based on gated cross transformers used in the present invention are described below.

Aiming at the infrared target detection problem, the invention designs a gate control structure for a transducer. The layer structure of the gated cross-attention transducer encoder/decoder stage is shown in fig. 2, and in order to realize task adaptation, a task matrix is multiplied by a query matrix by using fourier transform and inverse fourier transform on the basis of the original self-attention mechanism, so that task information is embedded. To enhance the exchange of information between windows, an information sharing module consisting of convolution and hole convolution is added before the window-based cross-attention encoding module.

In particular, the task driven gated cross-attention transducer encoder stage comprises a plurality of encoder layers, preferably 5 in this embodiment, connected by downsampled convolutional layers, with 5 encoder layers passing through 4 downsampled layersThe sample convolution layers are alternately connected. Each encoder layer comprises a convolution layer and a plurality of encoders connected in series, and each encoder comprises a depth convolution module, a window-based task driving cross attention coding module, an information sharing module and a channel attention mechanism module; in the window-based task driving cross-attention coding module, the query matrix is multiplied by the trainable task parameters after being subjected to Fourier transform, and then is subjected to inverse Fourier transform. The output of each encoder layer except the last encoder layer is subjected to downsampling by a convolution layer with the size of 2 x 2 and the step length of 2, so that the space dimension of the feature map is reduced and the channel dimension is expanded, and the subsequent feature extraction of different scales is facilitated. The feature maps of different scales taken by the 5 encoder layers are noted asFor use in subsequent decoding stages.

The feature-gated cross-attention converter decoder stage is chosen to be 4 in this embodiment, with progressively decreasing depth, for example decreasing depth from 4 to 1, i.e. 4,3,2,1 in turn. Each decoder stage comprises a plurality of decoder layers, each decoder layer comprising a convolutional layer for integrating the input information in the channel dimension and a plurality of decoders in series. Each decoder layer is similar to the encoder layer except that the attention module is replaced with a window-based feature cross attention decoding module, i.e. each decoder contains a depth convolution module, a window-based feature cross attention decoding module, an information sharing module and a channel attention mechanism module. The decoder layers are connected through a cross attention mechanism, namely each decoder layer is provided with two inputs, one is the splicing of all feature images with the same depth of a front stage and the feature images with two adjacent scales, and the normalization of the channel number is carried out through a convolution layer. This input can be expressed asWherein Up and Down are bilinear interpolation operations with scale factors of 2 and 0.5, respectively, +.>Representing the output of the encoder layer or decoder layer of the kth, ith layer, +.>Representing the output of the encoder layer or decoder layer of the j-1 th level i-1 th layer,/h>Representing the output of the encoder or decoder layer of the j-1 th level, i+1 th level, [ ·]Representing the channel dimension splice operation, conv represents 3*3 convolution. And the other is the output of the decoder layer of the deeper layer of the same level. In the window-based feature cross attention decoding module, a result of a feature map of a layer deeper than a current layer after a linear layer is used as a query matrix, so that global fusion of features of different depths is realized.

The invention generates a series of multi-scale feature information through 5 encoder layers, and then continuously extracts and refines features through 4 decoder stages with gradually decreasing depths. The output of the last decoder of the last level of the decoding layer is passed through a punctiform convolution layer for reducing the channel dimension of the feature to 1 as the final output.

The information sharing module is used for carrying out information interaction on the characteristics of different windows before dividing the image into the windows, and can be expressed as follows:

X ₁ ＝Conv(X),

X ₂ ＝DiConv(X),

in the above equation, conv represents a 3*3 convolution layer, and DiConv represents a 3*3 expansion convolution layer having an expansion ratio of 2. That is, the input of the information sharing module is respectively subjected to expansion convolution with an expansion rate of 2 by a 3*3 convolution kernel 3*3, and then is subjected to a 3*3 convolution after being spliced. The relation between windows can be enhanced through the convolution layers of different receptive fields, and information sharing between windows is realized.

The window-based task driving cross-attention coding module is used for extracting global and local characteristics of the input image, embedding the codes of task information into the global and local characteristics, and generating implicit expression characteristic diagrams of the original image under 5 groups of different scales by 5 encoders. Task information is adaptively embedded into a query matrix in an attention mechanism, and the attention mechanism is as follows:

wherein Q is ₁ ,K ₁ ,V ₁ Represents a query matrix, a key matrix and a value matrix, d ₁ Is the degree of input, X ₁ Is a characteristic diagram of the input, FFT (&) and IFFT (&) represent the Fourier transform and inverse Fourier transform, respectively, W _task To self-adaptively learn task parameters, B ₁ Representing the position code of the object to be coded,W ₁ ^K ,W ₁ ^V respectively represent Q ₁ ,K ₁ ,V ₁ Transformation matrix of matrix. In the above formula, task information is integrated in the query matrix through fourier transformation, and covariance calculation is performed between elements in the matrix and each element in the key value matrix, so that the information is expanded to the global.

In the decoding stage, the used attention mechanism is similar to the encoding stage, and the window-based feature cross attention decoding module is used for extracting global and local features and fusing deep information into shallow layers, and takes high-level information as a query matrix in the attention mechanism, wherein the attention mechanism is as follows:

wherein Q is ₂ ,K ₂ ,V ₂ Represents a query matrix, a key matrix and a value matrix, d ₂ Is the degree of input, X ₂ Is the inputted feature map X _h Representing deep features, B ₂ Representing the position code of the object to be coded,respectively represent Q ₂ ,K ₂ ,V ₂ Transformation matrix of matrix. That is, the attention mechanism of the decoding stage merges into the deep feature X _h As a query matrix, the global fusion of different levels of information is realized.

The invention discloses a window-based task driving cross attention coding module and a window-based feature cross attention decoding module, which comprise a feedforward network layer with depth convolution and a residual linking structure, and also comprise a gating structure with depth convolution besides an attention mechanism.

In the present invention, the input of the nth encoder or decoder of the jth level i layer is defined asWhich gets the feature after passing the information sharing module>In the encoding phase->And the window-based task driving cross-attention coding module is used for realizing self-adaptive task information embedding and global feature extraction. In the decoding phase->And realizing global fusion with deep information and extraction of global features through the window-based feature cross attention decoding module. The feature generated by the nth encoding/decoding module of the jth ith layer is +.>The output of the encoder or decoder of the nth layer of the jth level>Expressed as:

wherein W is _p And W is _d The 1*1 punctiform convolution and 3*3 depth convolution, respectively, as indicated by the element dot product, phi (. Cndot.) indicates the ReLU activation function, and sigma (. Cndot.) indicates the Sigmoid activation function.

And 3, training the detection model of the step 2 by using the data set of the step 1.

Dividing an infrared weak and small target data set into a training set and a testing set, training a hierarchical gating transducer network model based on a cross attention mechanism, calculating a loss function value through a network prediction label and a real label, wherein the loss function is thatUpdating network parameters by using an Adam optimizer. The whole model is trained 2500 times in an iteration, namely, the whole data set is traversed once. The testing process is as follows, parameters are fixed after training is completed, a test sample is input into a network to obtain a test detection result, the test detection result and a marked result are compared and evaluated for the cross-over ratio, and under the condition that the cross-over ratio meets the requirement, a trained hierarchical gating transducer network model based on a cross-attention mechanism is obtained.

After the trained hierarchical gating transducer network model based on the cross attention mechanism is obtained, aiming at the infrared image, the infrared image is input into the hierarchical gating transducer network model based on the cross attention mechanism, and the recognition result of the weak and small targets can be obtained.

It should be noted that, in order to obtain a trained hierarchical gated transducer network model based on a cross-attention mechanism, the steps 2 and 3 are not required to be repeated each time when the infrared small target is detected, and only the infrared image to be detected is required to be input into the network model after the trained model is obtained.

The effectiveness of the present invention is illustrated by the following simulation experiments, which are performed under the conditions of Intel to high gold 5218, 2.30GHz CPU, NVIDIA GeForce RTX 3090 graphics processor, and 24GB video memory.

The data set is divided into a training set and a test set according to 1:1 from the NUDT-SIRST infrared dim target data set. The cross-correlation ratios, the detection rate Pd and the false alarm rate Fa of the different methods are shown in Table 1. Wherein Tophat represents the detection result obtained by using top hat transformation, IPI represents the detection result obtained by using an infrared plaque image model, and RIPT represents the detection result obtained by using a re-weighted infrared plaque tensor model. PSTNN represents the detection results obtained with the partial sum model of tensor kernel norms, ACM represents the detection results obtained with the asymmetric context modulation model, ALC represents the detection results obtained with the attention-localized contrast network model, and DNA represents the detection results obtained with the densely nested attention network model. It can be seen from the table that the invention can obtain higher detection rate, lower false alarm rate and better effect on the profile description of the target.

TABLE 1

Evaluation index/method	Tophat	IPI	RIPT	PSTNN	ACM	ALC	DNA	The method of the invention
									mIoU(×10 ² )	20.72	17.76	29.44	14.85	67.08	81.40	87.09	87.71
Pd(×10 ² )	78.41	74.49	91.85	66.13	95.97	96.51	98.73	98.73
									Fa(×10 ⁶ )	166.7	41.23	344.3	44.17	10.18	9.261	4.223	1.123

In summary, when the infrared weak and small target is detected, the layered structure used in the invention can avoid excessive degradation of the image in the forward process of the neural network, fully reserve background information and perform background global-local modeling, and is beneficial to avoiding the loss of the small target and the error detection of local extremum. The used gate-controlled cross-attention transducer encoder/decoder can be used for extracting global-local information and improving the stability of a network, so that the detection result can be more accurate.

Claims

1. The method for detecting the infrared weak and small target of the hierarchical gating cross-converter is characterized by comprising the following steps of:

step 1, acquiring a marked infrared small target image data set;

the feature gating cross-attention transducer decoder stage comprises a plurality of decoder layers, wherein the decoder layers are connected through a cross-attention mechanism, namely each decoder layer is provided with two inputs, one is the splicing of all feature graphs with the same depth of the front stage and the feature graphs with two adjacent scales, and the other is the output of the decoder layer with the same stage and the deeper layer; each decoder layer comprises a plurality of decoders connected in series, and each decoder comprises a depth convolution module, a window-based characteristic cross attention decoding module, an information sharing module and a channel attention mechanism module; in the window-based feature cross attention decoding module, a result of a feature map of a layer deeper than a current layer after a linear layer is used as a query matrix, so that global fusion of features of different depths is realized

2. The method for detecting infrared weak and small targets according to claim 1, wherein the output of each encoder layer except the last encoder layer is downsampled by a convolution layer with a size of 2 x 2 and a step size of 2.

3. The method for detecting infrared weak and small targets by hierarchical gating cross-converter according to claim 1, wherein the window-based task driving cross-attention encoding module has a task information embedded in a query matrix in an attention mechanism in a self-adaptive manner, and the attention mechanism is as follows:

wherein Q is ₁ ,K ₁ ,V ₁ Represents a query matrix, a key matrix and a value matrix, d ₁ Is the degree of input, X ₁ Is a characteristic diagram of the input, FFT (&) and IFFT (&) represent the Fourier transform and inverse Fourier transform, respectively, W _task To self-adaptively learn task parameters, B ₁ Representing the position code of the object to be coded,respectively represent Q ₁ ,K ₁ ,V ₁ Transformation matrix of matrix.

4. A hierarchical gated cross-transducer infrared small target detection method according to claim 1 or 3, characterized in that the window-based feature cross-attention decoding module characterizes higher-level information as a query matrix in its attention mechanism, the attention mechanism being:

wherein Q is ₂ ,K ₂ ,V ₂ Represents a query matrix, a key matrix and a value matrix, d ₂ Is the degree of input, X ₂ Is a characteristic diagram of input, X _h Representing high-level features, B ₂ Representing the position code of the object to be coded,respectively represent Q ₂ ,K ₂ ,V ₂ Transformation matrix of matrix.

5. The hierarchical gated cross-converter infrared small target detection method of claim 1, wherein the window-based task driven cross-attention encoding module and the window-based feature cross-attention decoding module each comprise a feed-forward network layer with deep convolution and a residual linking structure.

6. The hierarchical gated cross-converter infrared small target detection method of claim 1, wherein the window-based task driven cross-attention encoding module and window-based feature cross-attention decoding module each comprise a gating structure with deep convolution;

input of an nth encoder or decoder of an ith layer of a jth stageObtaining features after passing through the information sharing moduleIn the encoding phase->Driving a cross-attention encoding module through the window-based task; in the decoding stage of the present invention,a cross-attention decoding module via the window-based features; the feature generated by the nth encoding/decoding module of the jth ith layer is +.>Output of the encoder or decoder of the nth layer of the jth level>Expressed as:

wherein W is _p And W is _d The 1*1 punctiform convolution and 3*3 depth convolution are represented respectively, the element dot product is represented, phi (·) represents the ReLU activation function, and sigma (·) represents the Sigmoid activation function.

7. The method for detecting the infrared weak and small target of the hierarchical gated cross-converter according to claim 1, wherein the input of the information sharing module is respectively subjected to expansion convolution with a 3*3 convolution kernel 3*3 expansion rate of 2, and then is subjected to 3*3 convolution after being spliced.

8. The hierarchical gated cross-converter infrared small target detection method of claim 1, wherein the decoder layer inputs, after splicing, are normalized for the number of channels, the inputs are expressed as:

wherein Up and Down are bilinear interpolation operations with scale factors of 2 and 0.5, respectively, xi _k Representing the output of the encoder layer or decoder layer of the kth level, the i-th layer,representing the output of the encoder layer or decoder layer of the j-1 th level i-1 th layer,/h>Representing the output of the encoder or decoder layer of the j-1 th level, i+1 th level, [ ·]Representing the channel dimension splice operation, conv represents 3*3 convolution.

9. The method for detecting infrared weak targets by hierarchical gated cross-converter according to claim 1, wherein the number of encoder layers is 5, the number of the feature gated cross-attention converter decoder stages is 4, and the depth is gradually decreased by 4 down-sampling convolution layers alternately connected.

10. The hierarchical gated cross-converter infrared small target detection method of claim 9, wherein the depth of the 4 feature gated cross-attention converter decoder stages is 4,3,2,1 in order.