CN116912660A - Hierarchical gating cross-Transformer infrared weak and small target detection method - Google Patents

Hierarchical gating cross-Transformer infrared weak and small target detection method Download PDF

Info

Publication number
CN116912660A
CN116912660A CN202310888676.8A CN202310888676A CN116912660A CN 116912660 A CN116912660 A CN 116912660A CN 202310888676 A CN202310888676 A CN 202310888676A CN 116912660 A CN116912660 A CN 116912660A
Authority
CN
China
Prior art keywords
cross
attention
decoder
layer
encoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310888676.8A
Other languages
Chinese (zh)
Inventor
穆廷魁
杨获任
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN202310888676.8A priority Critical patent/CN116912660A/en
Publication of CN116912660A publication Critical patent/CN116912660A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/42Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

A detection model adopts a cross-attention mechanism-based hierarchical gating transducer network of a single encoder-multi decoder architecture, a single encoder drives a gate-control cross-attention transducer encoder stage for tasks, and the encoder stages comprise a plurality of encoder layers, and the encoder layers are connected by a downsampling convolution layer; each encoder layer comprises a plurality of encoders connected in series; the multi-decoder is a plurality of characteristic gating cross-attention transducer decoder stages, and comprises a plurality of decoder layers, wherein the decoder layers are connected through a cross-attention mechanism, each decoder layer comprises a plurality of decoders connected in series, the encoder comprises a window-based task driving cross-attention encoding module, and the decoder comprises a window-based characteristic cross-attention decoding module; the invention can solve the problems of high false alarm rate and low recognition rate.

Description

Hierarchical gating cross-Transformer infrared weak and small target detection method
Technical Field
The invention belongs to the technical field of infrared remote sensing images, and particularly relates to a hierarchical gating cross-Transformer infrared weak and small target detection method.
Background
Infrared small target detection plays an important role in various practical fields, such as offshore monitoring, infrared early warning and accurate guidance. The conventional model-based detection method has a problem of insufficient performance because the assumption condition is not necessarily satisfied in the real world. Convolutional Neural Networks (CNNs), although having very powerful feature extraction capabilities, cannot accommodate infrared small target images with numerous negative samples due to their locality and parameter sharing characteristics. Although the identification method based on the transducer has good global feature extraction capability, the identification method has the problems that the network is difficult to converge and is insensitive to high-frequency information. In addition, most of the existing deep learning detection methods are based on a U-Net encoder-decoder structure, and the structure is easy to cause corresponding deficiency of targets, so that errors of the encoder are accumulated in the decoder, and false alarms and missed detection are caused.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a hierarchical gating cross-Transformer infrared weak and small target detection method, so as to solve the problems of high detection false alarm rate and low recognition rate caused by insufficient background modeling capability, incapability of considering global-local information, low information fusion efficiency of networks with different depths and the like in the existing infrared weak and small target detection method.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
a hierarchical gating cross-transducer infrared weak small target detection method comprises the following steps:
step 1, acquiring a marked infrared small target image data set;
step 2, constructing a detection model, wherein the detection model adopts a hierarchical gating transducer network based on a cross attention mechanism, the hierarchical gating transducer network is of a single encoder-multi-decoder architecture, the single encoder is a task driving gating cross attention transducer encoder stage, the multi-decoder is a plurality of characteristic gating cross attention transducer decoder stages, and the detection result can be obtained after the output of the last decoder stage passes through a punctiform convolution layer; the self-adaptive embedding of task information is realized in the encoding stage, and the multi-scale fusion of encoding and pre-stage decoding information is realized in the decoding stage;
the task driven gated cross-attention transducer encoder stage comprises a plurality of encoder layers connected by downsampled convolutional layers; each encoder layer comprises a plurality of encoders connected in series, and each encoder comprises a depth convolution module, a window-based task driving cross attention encoding module, an information sharing module and a channel attention mechanism module; in the task driving cross attention coding module based on the window, the query matrix is multiplied by the trainable task parameters after Fourier transformation, and then inverse Fourier transformation is carried out;
the feature gating cross-attention transducer decoder stage comprises a plurality of decoder layers, wherein the decoder layers are connected through a cross-attention mechanism, namely each decoder layer is provided with two inputs, one is the splicing of all feature graphs with the same depth of a front stage and the feature graphs with two adjacent scales, and the other is the output of the decoder layer with the same stage and a deeper layer; each decoder layer comprises a plurality of decoders connected in series, and each decoder comprises a depth convolution module, a window-based characteristic cross attention decoding module, an information sharing module and a channel attention mechanism module; in the window-based feature cross attention decoding module, a result of a feature map of a layer deeper than a current layer after a linear layer is used as a query matrix, so that global fusion of features of different depths is realized
Step 3, training the detection model of the step 2 by using the data set of the step 1;
and 4, detecting the infrared dim target by using the detection model which is trained.
Compared with the prior art, the network model establishes a single encoder-multi decoder framework, and a window information sharing mechanism and a window-based gating cross attention mechanism are integrated in the framework. And generating a multi-scale local-global characteristic diagram by combining the self-adaptive task information in the encoding stage, and fully utilizing the multi-scale local-global characteristic diagram in the decoding stage. The network model solves the problem that errors in the U-net structure are easy to accumulate in the forward process, and realizes effective fusion of local-global features. According to the invention, the infrared weak and small target is detected, and meanwhile, more sufficient background global information modeling is performed, so that the detection rate is improved, the false alarm rate is reduced, and the outline of the target is better described.
Drawings
FIG. 1 is a schematic diagram of a hierarchical gated transducer model based on a cross-attention mechanism in accordance with the present invention
FIG. 2 is a schematic diagram of a gated cross-attention transducer encoder/decoder model and attention modules used therein in accordance with the present invention.
Detailed Description
Embodiments of the present invention will be described in detail below with reference to the accompanying drawings and examples.
Aiming at the problems that global-local information is difficult to consider and error accumulation is caused by a U-net structure on the infrared weak and small target detection problem, the invention provides a hierarchical gating cross-Transformer infrared weak and small target detection method, and the whole thought is as follows: building Shan Bianma a multi-decoder architecture and designing task driven gated cross-attention fransformer modules. Firstly, a series of multi-scale feature information is generated through a plurality of encoder layers, then, features are continuously extracted and refined through a plurality of decoder stages with gradually decreasing depths, each decoder layer is provided with two inputs, one is the splicing of all feature images with the same depth of a front stage and feature images with two adjacent scales, the other is the output of the decoder layer with the same stage and the deeper layer, the two obtain high-level semantic information through a cross attention mechanism, and the final output result is obtained through a punctiform convolution layer by the last decoder stage. The window-based feature cross attention decoding module realizes the compromise of global-local information and the fusion of features with different scales, thereby improving the detection effect.
As shown in fig. 1, the specific steps of the present invention are as follows:
and step 1, acquiring a marked infrared small target image data set.
The data set can be directly a NUDT-SIRST public data set proposed by national defense science and technology university, or a NUAA-SIRST public data set proposed by Nanjing aviation aerospace university or an IRSTD-1k data set proposed by Western security electronic technology university.
And 2, constructing a detection model.
The detection model adopts a hierarchical gating transducer network based on a cross attention mechanism, the network adopts a single encoder-multi-decoder architecture, the self-adaptive embedding of task information is realized in an encoding stage, and the multi-scale fusion of encoding and pre-stage decoding information is realized in a decoding stage. The single encoder is a task driving gate-controlled cross-attention transducer encoder stage, the multi-decoder is composed of a plurality of feature gate-controlled cross-attention transducer decoder stages, and the output of the last decoder stage can obtain a detection result after passing through a punctiform convolution layer.
The encoder and decoder stages based on gated cross transformers used in the present invention are described below.
Aiming at the infrared target detection problem, the invention designs a gate control structure for a transducer. The layer structure of the gated cross-attention transducer encoder/decoder stage is shown in fig. 2, and in order to realize task adaptation, a task matrix is multiplied by a query matrix by using fourier transform and inverse fourier transform on the basis of the original self-attention mechanism, so that task information is embedded. To enhance the exchange of information between windows, an information sharing module consisting of convolution and hole convolution is added before the window-based cross-attention encoding module.
In particular, the task driven gated cross-attention transducer encoder stage comprises a plurality of encoder layers, preferably 5 in this embodiment, connected by downsampled convolutional layers, with 5 encoder layers passing through 4 downsampled layersThe sample convolution layers are alternately connected. Each encoder layer comprises a convolution layer and a plurality of encoders connected in series, and each encoder comprises a depth convolution module, a window-based task driving cross attention coding module, an information sharing module and a channel attention mechanism module; in the window-based task driving cross-attention coding module, the query matrix is multiplied by the trainable task parameters after being subjected to Fourier transform, and then is subjected to inverse Fourier transform. The output of each encoder layer except the last encoder layer is subjected to downsampling by a convolution layer with the size of 2 x 2 and the step length of 2, so that the space dimension of the feature map is reduced and the channel dimension is expanded, and the subsequent feature extraction of different scales is facilitated. The feature maps of different scales taken by the 5 encoder layers are noted asFor use in subsequent decoding stages.
The feature-gated cross-attention converter decoder stage is chosen to be 4 in this embodiment, with progressively decreasing depth, for example decreasing depth from 4 to 1, i.e. 4,3,2,1 in turn. Each decoder stage comprises a plurality of decoder layers, each decoder layer comprising a convolutional layer for integrating the input information in the channel dimension and a plurality of decoders in series. Each decoder layer is similar to the encoder layer except that the attention module is replaced with a window-based feature cross attention decoding module, i.e. each decoder contains a depth convolution module, a window-based feature cross attention decoding module, an information sharing module and a channel attention mechanism module. The decoder layers are connected through a cross attention mechanism, namely each decoder layer is provided with two inputs, one is the splicing of all feature images with the same depth of a front stage and the feature images with two adjacent scales, and the normalization of the channel number is carried out through a convolution layer. This input can be expressed asWherein Up and Down are bilinear interpolation operations with scale factors of 2 and 0.5, respectively, +.>Representing the output of the encoder layer or decoder layer of the kth, ith layer, +.>Representing the output of the encoder layer or decoder layer of the j-1 th level i-1 th layer,/h>Representing the output of the encoder or decoder layer of the j-1 th level, i+1 th level, [ ·]Representing the channel dimension splice operation, conv represents 3*3 convolution. And the other is the output of the decoder layer of the deeper layer of the same level. In the window-based feature cross attention decoding module, a result of a feature map of a layer deeper than a current layer after a linear layer is used as a query matrix, so that global fusion of features of different depths is realized.
The invention generates a series of multi-scale feature information through 5 encoder layers, and then continuously extracts and refines features through 4 decoder stages with gradually decreasing depths. The output of the last decoder of the last level of the decoding layer is passed through a punctiform convolution layer for reducing the channel dimension of the feature to 1 as the final output.
The information sharing module is used for carrying out information interaction on the characteristics of different windows before dividing the image into the windows, and can be expressed as follows:
X 1 =Conv(X),
X 2 =DiConv(X),
in the above equation, conv represents a 3*3 convolution layer, and DiConv represents a 3*3 expansion convolution layer having an expansion ratio of 2. That is, the input of the information sharing module is respectively subjected to expansion convolution with an expansion rate of 2 by a 3*3 convolution kernel 3*3, and then is subjected to a 3*3 convolution after being spliced. The relation between windows can be enhanced through the convolution layers of different receptive fields, and information sharing between windows is realized.
The window-based task driving cross-attention coding module is used for extracting global and local characteristics of the input image, embedding the codes of task information into the global and local characteristics, and generating implicit expression characteristic diagrams of the original image under 5 groups of different scales by 5 encoders. Task information is adaptively embedded into a query matrix in an attention mechanism, and the attention mechanism is as follows:
wherein Q is 1 ,K 1 ,V 1 Represents a query matrix, a key matrix and a value matrix, d 1 Is the degree of input, X 1 Is a characteristic diagram of the input, FFT (&) and IFFT (&) represent the Fourier transform and inverse Fourier transform, respectively, W task To self-adaptively learn task parameters, B 1 Representing the position code of the object to be coded,W 1 K ,W 1 V respectively represent Q 1 ,K 1 ,V 1 Transformation matrix of matrix. In the above formula, task information is integrated in the query matrix through fourier transformation, and covariance calculation is performed between elements in the matrix and each element in the key value matrix, so that the information is expanded to the global.
In the decoding stage, the used attention mechanism is similar to the encoding stage, and the window-based feature cross attention decoding module is used for extracting global and local features and fusing deep information into shallow layers, and takes high-level information as a query matrix in the attention mechanism, wherein the attention mechanism is as follows:
wherein Q is 2 ,K 2 ,V 2 Represents a query matrix, a key matrix and a value matrix, d 2 Is the degree of input, X 2 Is the inputted feature map X h Representing deep features, B 2 Representing the position code of the object to be coded,respectively represent Q 2 ,K 2 ,V 2 Transformation matrix of matrix. That is, the attention mechanism of the decoding stage merges into the deep feature X h As a query matrix, the global fusion of different levels of information is realized.
The invention discloses a window-based task driving cross attention coding module and a window-based feature cross attention decoding module, which comprise a feedforward network layer with depth convolution and a residual linking structure, and also comprise a gating structure with depth convolution besides an attention mechanism.
In the present invention, the input of the nth encoder or decoder of the jth level i layer is defined asWhich gets the feature after passing the information sharing module>In the encoding phase->And the window-based task driving cross-attention coding module is used for realizing self-adaptive task information embedding and global feature extraction. In the decoding phase->And realizing global fusion with deep information and extraction of global features through the window-based feature cross attention decoding module. The feature generated by the nth encoding/decoding module of the jth ith layer is +.>The output of the encoder or decoder of the nth layer of the jth level>Expressed as:
wherein W is p And W is d The 1*1 punctiform convolution and 3*3 depth convolution, respectively, as indicated by the element dot product, phi (. Cndot.) indicates the ReLU activation function, and sigma (. Cndot.) indicates the Sigmoid activation function.
And 3, training the detection model of the step 2 by using the data set of the step 1.
Dividing an infrared weak and small target data set into a training set and a testing set, training a hierarchical gating transducer network model based on a cross attention mechanism, calculating a loss function value through a network prediction label and a real label, wherein the loss function is thatUpdating network parameters by using an Adam optimizer. The whole model is trained 2500 times in an iteration, namely, the whole data set is traversed once. The testing process is as follows, parameters are fixed after training is completed, a test sample is input into a network to obtain a test detection result, the test detection result and a marked result are compared and evaluated for the cross-over ratio, and under the condition that the cross-over ratio meets the requirement, a trained hierarchical gating transducer network model based on a cross-attention mechanism is obtained.
And 4, detecting the infrared dim target by using the detection model which is trained.
After the trained hierarchical gating transducer network model based on the cross attention mechanism is obtained, aiming at the infrared image, the infrared image is input into the hierarchical gating transducer network model based on the cross attention mechanism, and the recognition result of the weak and small targets can be obtained.
It should be noted that, in order to obtain a trained hierarchical gated transducer network model based on a cross-attention mechanism, the steps 2 and 3 are not required to be repeated each time when the infrared small target is detected, and only the infrared image to be detected is required to be input into the network model after the trained model is obtained.
The effectiveness of the present invention is illustrated by the following simulation experiments, which are performed under the conditions of Intel to high gold 5218, 2.30GHz CPU, NVIDIA GeForce RTX 3090 graphics processor, and 24GB video memory.
The data set is divided into a training set and a test set according to 1:1 from the NUDT-SIRST infrared dim target data set. The cross-correlation ratios, the detection rate Pd and the false alarm rate Fa of the different methods are shown in Table 1. Wherein Tophat represents the detection result obtained by using top hat transformation, IPI represents the detection result obtained by using an infrared plaque image model, and RIPT represents the detection result obtained by using a re-weighted infrared plaque tensor model. PSTNN represents the detection results obtained with the partial sum model of tensor kernel norms, ACM represents the detection results obtained with the asymmetric context modulation model, ALC represents the detection results obtained with the attention-localized contrast network model, and DNA represents the detection results obtained with the densely nested attention network model. It can be seen from the table that the invention can obtain higher detection rate, lower false alarm rate and better effect on the profile description of the target.
TABLE 1
Evaluation index/method Tophat IPI RIPT PSTNN ACM ALC DNA The method of the invention
mIoU(×10 2 ) 20.72 17.76 29.44 14.85 67.08 81.40 87.09 87.71
Pd(×10 2 ) 78.41 74.49 91.85 66.13 95.97 96.51 98.73 98.73
Fa(×10 6 ) 166.7 41.23 344.3 44.17 10.18 9.261 4.223 1.123
In summary, when the infrared weak and small target is detected, the layered structure used in the invention can avoid excessive degradation of the image in the forward process of the neural network, fully reserve background information and perform background global-local modeling, and is beneficial to avoiding the loss of the small target and the error detection of local extremum. The used gate-controlled cross-attention transducer encoder/decoder can be used for extracting global-local information and improving the stability of a network, so that the detection result can be more accurate.

Claims (10)

1. The method for detecting the infrared weak and small target of the hierarchical gating cross-converter is characterized by comprising the following steps of:
step 1, acquiring a marked infrared small target image data set;
step 2, constructing a detection model, wherein the detection model adopts a hierarchical gating transducer network based on a cross attention mechanism, the hierarchical gating transducer network is of a single encoder-multi-decoder architecture, the single encoder is a task driving gating cross attention transducer encoder stage, the multi-decoder is a plurality of characteristic gating cross attention transducer decoder stages, and the detection result can be obtained after the output of the last decoder stage passes through a punctiform convolution layer; the self-adaptive embedding of task information is realized in the encoding stage, and the multi-scale fusion of encoding and pre-stage decoding information is realized in the decoding stage;
the task driven gated cross-attention transducer encoder stage comprises a plurality of encoder layers connected by downsampled convolutional layers; each encoder layer comprises a plurality of encoders connected in series, and each encoder comprises a depth convolution module, a window-based task driving cross attention encoding module, an information sharing module and a channel attention mechanism module; in the task driving cross attention coding module based on the window, the query matrix is multiplied by the trainable task parameters after Fourier transformation, and then inverse Fourier transformation is carried out;
the feature gating cross-attention transducer decoder stage comprises a plurality of decoder layers, wherein the decoder layers are connected through a cross-attention mechanism, namely each decoder layer is provided with two inputs, one is the splicing of all feature graphs with the same depth of the front stage and the feature graphs with two adjacent scales, and the other is the output of the decoder layer with the same stage and the deeper layer; each decoder layer comprises a plurality of decoders connected in series, and each decoder comprises a depth convolution module, a window-based characteristic cross attention decoding module, an information sharing module and a channel attention mechanism module; in the window-based feature cross attention decoding module, a result of a feature map of a layer deeper than a current layer after a linear layer is used as a query matrix, so that global fusion of features of different depths is realized
Step 3, training the detection model of the step 2 by using the data set of the step 1;
and 4, detecting the infrared dim target by using the detection model which is trained.
2. The method for detecting infrared weak and small targets according to claim 1, wherein the output of each encoder layer except the last encoder layer is downsampled by a convolution layer with a size of 2 x 2 and a step size of 2.
3. The method for detecting infrared weak and small targets by hierarchical gating cross-converter according to claim 1, wherein the window-based task driving cross-attention encoding module has a task information embedded in a query matrix in an attention mechanism in a self-adaptive manner, and the attention mechanism is as follows:
wherein Q is 1 ,K 1 ,V 1 Represents a query matrix, a key matrix and a value matrix, d 1 Is the degree of input, X 1 Is a characteristic diagram of the input, FFT (&) and IFFT (&) represent the Fourier transform and inverse Fourier transform, respectively, W task To self-adaptively learn task parameters, B 1 Representing the position code of the object to be coded,respectively represent Q 1 ,K 1 ,V 1 Transformation matrix of matrix.
4. A hierarchical gated cross-transducer infrared small target detection method according to claim 1 or 3, characterized in that the window-based feature cross-attention decoding module characterizes higher-level information as a query matrix in its attention mechanism, the attention mechanism being:
wherein Q is 2 ,K 2 ,V 2 Represents a query matrix, a key matrix and a value matrix, d 2 Is the degree of input, X 2 Is a characteristic diagram of input, X h Representing high-level features, B 2 Representing the position code of the object to be coded,respectively represent Q 2 ,K 2 ,V 2 Transformation matrix of matrix.
5. The hierarchical gated cross-converter infrared small target detection method of claim 1, wherein the window-based task driven cross-attention encoding module and the window-based feature cross-attention decoding module each comprise a feed-forward network layer with deep convolution and a residual linking structure.
6. The hierarchical gated cross-converter infrared small target detection method of claim 1, wherein the window-based task driven cross-attention encoding module and window-based feature cross-attention decoding module each comprise a gating structure with deep convolution;
input of an nth encoder or decoder of an ith layer of a jth stageObtaining features after passing through the information sharing moduleIn the encoding phase->Driving a cross-attention encoding module through the window-based task; in the decoding stage of the present invention,a cross-attention decoding module via the window-based features; the feature generated by the nth encoding/decoding module of the jth ith layer is +.>Output of the encoder or decoder of the nth layer of the jth level>Expressed as:
wherein W is p And W is d The 1*1 punctiform convolution and 3*3 depth convolution are represented respectively, the element dot product is represented, phi (·) represents the ReLU activation function, and sigma (·) represents the Sigmoid activation function.
7. The method for detecting the infrared weak and small target of the hierarchical gated cross-converter according to claim 1, wherein the input of the information sharing module is respectively subjected to expansion convolution with a 3*3 convolution kernel 3*3 expansion rate of 2, and then is subjected to 3*3 convolution after being spliced.
8. The hierarchical gated cross-converter infrared small target detection method of claim 1, wherein the decoder layer inputs, after splicing, are normalized for the number of channels, the inputs are expressed as:
wherein Up and Down are bilinear interpolation operations with scale factors of 2 and 0.5, respectively, xi k Representing the output of the encoder layer or decoder layer of the kth level, the i-th layer,representing the output of the encoder layer or decoder layer of the j-1 th level i-1 th layer,/h>Representing the output of the encoder or decoder layer of the j-1 th level, i+1 th level, [ ·]Representing the channel dimension splice operation, conv represents 3*3 convolution.
9. The method for detecting infrared weak targets by hierarchical gated cross-converter according to claim 1, wherein the number of encoder layers is 5, the number of the feature gated cross-attention converter decoder stages is 4, and the depth is gradually decreased by 4 down-sampling convolution layers alternately connected.
10. The hierarchical gated cross-converter infrared small target detection method of claim 9, wherein the depth of the 4 feature gated cross-attention converter decoder stages is 4,3,2,1 in order.
CN202310888676.8A 2023-07-19 2023-07-19 Hierarchical gating cross-Transformer infrared weak and small target detection method Pending CN116912660A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310888676.8A CN116912660A (en) 2023-07-19 2023-07-19 Hierarchical gating cross-Transformer infrared weak and small target detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310888676.8A CN116912660A (en) 2023-07-19 2023-07-19 Hierarchical gating cross-Transformer infrared weak and small target detection method

Publications (1)

Publication Number Publication Date
CN116912660A true CN116912660A (en) 2023-10-20

Family

ID=88362550

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310888676.8A Pending CN116912660A (en) 2023-07-19 2023-07-19 Hierarchical gating cross-Transformer infrared weak and small target detection method

Country Status (1)

Country Link
CN (1) CN116912660A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117315056A (en) * 2023-11-27 2023-12-29 支付宝(杭州)信息技术有限公司 Video editing method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117315056A (en) * 2023-11-27 2023-12-29 支付宝(杭州)信息技术有限公司 Video editing method and device
CN117315056B (en) * 2023-11-27 2024-03-19 支付宝(杭州)信息技术有限公司 Video editing method and device

Similar Documents

Publication Publication Date Title
CN107766894B (en) Remote sensing image natural language generation method based on attention mechanism and deep learning
Gao et al. Reading scene text with fully convolutional sequence modeling
Wang et al. Research on Healthy Anomaly Detection Model Based on Deep Learning from Multiple Time‐Series Physiological Signals
CN113673346B (en) Motor vibration data processing and state identification method based on multiscale SE-Resnet
CN113537040B (en) Time sequence behavior detection method and system based on semi-supervised learning
CN112163429B (en) Sentence correlation obtaining method, system and medium combining cyclic network and BERT
CN110751044A (en) Urban noise identification method based on deep network migration characteristics and augmented self-coding
CN112949481B (en) Lip language identification method and system for speaker independence
CN112560948B (en) Fundus image classification method and imaging method under data deviation
CN116912660A (en) Hierarchical gating cross-Transformer infrared weak and small target detection method
CN110599502A (en) Skin lesion segmentation method based on deep learning
CN116343190B (en) Natural scene character recognition method, system, equipment and storage medium
CN116402766A (en) Remote sensing image change detection method combining convolutional neural network and transducer
CN115292568B (en) Civil news event extraction method based on joint model
CN114694255A (en) Sentence-level lip language identification method based on channel attention and time convolution network
CN114299326A (en) Small sample classification method based on conversion network and self-supervision
CN114283432A (en) Text block identification method and device and electronic equipment
CN116954113B (en) Intelligent robot driving sensing intelligent control system and method thereof
CN116403278A (en) Human body action recognition method based on text supervision
CN113887504B (en) Strong-generalization remote sensing image target identification method
CN116883709A (en) Carbonate fracture-cavity identification method and system based on channel attention mechanism
CN114722928A (en) Blue-green algae image identification method based on deep learning
CN113160823A (en) Voice awakening method and device based on pulse neural network and electronic equipment
CN112989048A (en) Network security domain relation extraction method based on dense connection convolution
CN114529746B (en) Image clustering method based on low-rank subspace consistency

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination