CN112560907A

CN112560907A - Limited pixel infrared unmanned aerial vehicle target detection method based on mixed domain attention

Info

Publication number: CN112560907A
Application number: CN202011392343.9A
Authority: CN
Inventors: 秦翰林; 蔡彬彬; 梁毅; 马琳; 延翔; 欧洪璇; 岳恒; 张昱庚
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2020-12-02
Filing date: 2020-12-02
Publication date: 2021-03-26

Abstract

The invention discloses a finite pixel infrared unmanned aerial vehicle target detection method based on mixed domain attention, which is characterized in that an infrared image is subjected to characteristic extraction through a single-step cascade neural network based on an attention mechanism to obtain a characteristic diagram; and performing up-sampling on the characteristic diagram to obtain a multi-scale characteristic diagram, and performing characteristic fusion and prediction on the multi-scale characteristic diagram respectively to obtain a final target detection result. And an attention mechanism and a multi-scale feature fusion prediction method are adopted, so that the feature extraction capability and the detection accuracy of the limited pixel target are improved.

Description

Limited pixel infrared unmanned aerial vehicle target detection method based on mixed domain attention

Technical Field

The invention belongs to the field of infrared small target detection, and particularly relates to a limited pixel infrared unmanned aerial vehicle target detection method based on mixed domain attention.

Background

Along with the continuous development of unmanned platform technology, the development and application of unmanned aerial vehicles in various countries are also active, and the unmanned aerial vehicle is widely applied to the military and civil fields by virtue of the advantages of high efficiency, low cost, high output, strong maneuverability and the like. Meanwhile, the method also brings a serious challenge to the safety guarantee of the airspace and even causes great social harm. The infrared imaging detection technology can be used for detecting and tracking the unmanned aerial vehicle, and the unmanned aerial vehicle can be effectively monitored. However, in an actual scene, due to the influence caused by long imaging distance and long-distance atmospheric radiation interference, the target signal-to-noise ratio is low, the number of pixel points is small, the shape texture is absent, and the interference of complex background noise and random noise can be received, so that the detection precision and the detection efficiency can not be considered simultaneously by a conventional target detection and identification algorithm.

To solve the problem of finite pixel target detection, there are two main methods at present: single frame detection and multiple frame detection. Since multi-frame detection algorithms generally take more time than single-frame detection algorithms, and multi-frame detection algorithms generally assume that the background is static, making multi-frame detection algorithms unsuitable for drone applications. This report mainly studies single frame detection algorithms.

The single frame-based detection method is largely classified into a filtering-based method, a Human Visual System (HVS) -based method, an image data structure-based method, and a neural network-based method. The filtering-based method achieves the purpose of detecting the infrared limited pixel target mainly by inhibiting the background, is easily influenced by clutter and noise in the background and influences the robustness of detection; constructing a saliency map capable of highlighting the target through local difference of the target and the background based on an HVS method, and further realizing detection of the target, but the requirement on stability of the background is higher; the method based on the image data structure separates the target from the single-frame image by using the characteristic difference between the target matrix and the background matrix, and mainly focuses on the establishment of a target model and lacks the description of a complex background.

Due to the great success of the deep neural network in the aspect of natural image processing, the convolutional neural network is introduced into the field of infrared finite pixel target detection. Fan et al trained multi-scale CNNs using MNIST datasets, extracting convolution kernels to enhance small target images. Wang et al transfer a CNN pre-trained on ILSVRC 2013 into a small target dataset to learn small target features, but objects in the real world usually contain a large amount of shape, color and structure information and have limited migration effects on limited pixel targets. Lin et al predict on different characteristic graphs by fusing semantic information of a high layer and position information of a low layer, and the method obviously improves small target detection. Cai et al use a series of IoU progressively larger detectors, each trained with positive and negative samples bounded by different IoU thresholds, with the input to the next detector coming from the output of the previous detector, to train a high quality detector. Li et al use a perceptually-generative countermeasure network to improve the characterization of small targets, improving target detection rates.

Disclosure of Invention

In view of this, the main objective of the present invention is to provide a method for detecting a target of a limited-pixel infrared drone based on attention of a mixed domain.

In order to achieve the above purpose, the technical solution of the embodiment of the present invention is realized as follows:

the embodiment of the invention provides a finite pixel infrared unmanned aerial vehicle target detection method based on multi-scale feature fusion, which comprises the following steps:

performing feature extraction on the infrared image through a single-step cascade neural network based on an attention mechanism to obtain a feature map;

and performing up-sampling on the characteristic diagram to obtain a multi-scale characteristic diagram, and performing characteristic fusion and prediction on the multi-scale characteristic diagram respectively to obtain a final target detection result.

In the above scheme, the feature extraction is performed on the infrared image by the attention-based single-step cascade neural network to obtain a feature map, specifically: and taking the single-step cascade neural network as a feature extraction model, acquiring multiple layers of feature maps with different depths, and adding an attention module based on a mixed domain mechanism to improve the attention score of the target region.

In the above scheme, the feature map size of the input residual module in the feature extraction process is 208 × 208 × 32, where 208 × 208 represents the spatial size of the feature map, and 32 represents the number of channels, and a channel attention module and a spatial attention module are introduced into the feature extraction module and are respectively used to extract effective features in multiple channels and in a single-channel spatial feature.

In the above scheme, first, global average pooling and global maximum pooling operations are performed on the infrared image in channel dimension, and then weights are obtained through a network including a hidden layer and added to obtain a final attention weight vector, as shown in formula (1):

wherein the content of the first and second substances,

representing an input feature map; sigma represents a sigmoid activation function used for learning nonlinear relations among the features; h denotes a network with only one hidden layer for learning parameters

r represents a reduction rate; AvgPool and MaxPool respectively represent average pooling and maximum pooling, wherein the average pooling is to add and average all characteristic values of each channel, and the maximum pooling is to select the maximum value of the characteristic values in each channel; the addition represents element-by-element multiplication, and the feature diagram F' is obtained by splicing according to the channel dimensions;

secondly, performing average pooling and maximum pooling operation based on channel dimension on the feature map F, splicing to obtain a feature map with the channel number of 2, and then performing image segmentation on the feature mapPassing through a 7 × 7 convolution layer, the activation function is Sigmoid, and the weight coefficient M is obtained_SAs shown in formula (2):

wherein, Conv^7×7Represents performing convolution operation on the data, and the size of a convolution kernel is 7 multiplied by 7;

finally, a weighting factor M is used_SAnd multiplying the feature layer F' to obtain a new feature map after scaling.

In the above scheme, the upsampling the feature map to obtain a multi-scale feature map, and respectively performing feature fusion and prediction on the multi-scale feature map to obtain a result map specifically includes: according to the pixel value occupied by the unmanned aerial vehicle target, four feature maps with different levels are selected, wherein the feature maps are respectively 13 multiplied by 13, 26 multiplied by 26, 52 multiplied by 52 and 104 multiplied by 104, and multi-level feature fusion and prediction are carried out on the feature maps.

In the scheme, a feature map with the size of 13 × 13 at the 82 th layer of the single-step cascade neural network is detected, and a first result map with the size of 13 × 13 is obtained; secondly, performing 2 times of upsampling on a 79 th layer of feature map with the size of 13 × 13 to obtain a feature map with the size of 26 × 26, performing feature fusion on the upsampled feature map and a 61 st layer of feature map in a splicing manner, performing depth superposition during splicing, splicing the 26 × 26 × 255 feature map and the 26 × 26 × 255 feature map to obtain a 26 × 26 × 768 feature map, and then performing detection on a 94 th layer of a second detection layer to obtain a second result map with the size of 26 × 26; secondly, performing the same processing on the characteristic diagram obtained from the 91 st layer, splicing the characteristic diagram with the characteristic diagram obtained from the 36 th layer, and detecting the characteristic diagram obtained from the 106 th layer to obtain a third result diagram with the size of 52 multiplied by 255; and finally, performing the same operation on the 103 th layer and splicing the layer with the 11 th characteristic layer to obtain a fourth result graph with the size of 104 multiplied by 104.

Compared with the prior art, the invention has the beneficial effects that:

(1) according to the method, the attention mechanism is utilized to improve the attention degree score of the target area;

(2) the invention adopts a multi-scale feature fusion and prediction method to improve the feature extraction capability and the detection accuracy of the limited pixel target.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

The embodiment of the invention provides a finite pixel infrared unmanned aerial vehicle target detection method based on mixed domain attention, which is specifically realized by the following steps as shown in figure 1:

step 101: and (4) performing feature extraction by adopting a single-step cascade neural network based on an attention mechanism to obtain a feature map.

Specifically, a single-step cascade neural network is used as a feature extraction model to obtain a plurality of layers of feature maps with different depths, and an attention module based on a mixed domain mechanism is added to improve the attention score of a target region.

The feature extraction module of the single-step cascade neural network consists of 5 residual modules, and each residual module consists of 1 convolution layer of 1 multiplied by 1 and 1 multiplied by 3 and a jump connection structure.

In order to improve the characteristic extraction capability of the convolutional layer to the target, a channel attention and space attention module is added between each residual error module.

The feature map size of the input residual module in the feature extraction process is 208 × 208 × 32, wherein 208 × 208 represents the spatial size of the feature map, 32 represents the number of channels, and a channel attention module and a spatial attention module are introduced into the feature extraction module and are respectively used for extracting effective features in a plurality of channels and in a single-channel spatial feature to optimize the feature quality.

Firstly, respectively performing global average pooling and global maximum pooling on the infrared image in channel dimension, obtaining weights through a network containing a hidden layer, and adding the weights to obtain a final attention weight vector, wherein the formula (1) is as follows:

wherein the content of the first and second substances,

secondly, performing average pooling and maximum pooling operations based on channel dimensions on the feature graph F, splicing to obtain a feature graph with the channel number of 2, and obtaining a weight coefficient M by using a convolution layer with the size of 7 multiplied by 7 and an activation function of Sigmoid_SAs shown in formula (2):

Step 102: and acquiring a multi-scale feature map through an up-sampling operation, and respectively carrying out feature fusion and prediction on the multi-scale feature map.

Specifically, according to the pixel value occupied by the unmanned aerial vehicle target, four feature maps with different levels are selected, wherein the feature maps are respectively 13 × 13, 26 × 26, 52 × 52 and 104 × 104, and multi-level feature fusion and prediction are performed on the feature maps.

Detecting a feature map with the size of 13 x 13 at the 82 th layer of the single-step cascade neural network to obtain a first result map with the size of 13 x 13; secondly, performing 2 times of upsampling on a 79 th layer of feature map with the size of 13 × 13 to obtain a feature map with the size of 26 × 26, performing feature fusion on the upsampled feature map and a 61 st layer of feature map in a splicing manner, performing depth superposition during splicing, splicing the 26 × 26 × 255 feature map and the 26 × 26 × 255 feature map to obtain a 26 × 26 × 768 feature map, and then performing detection on a 94 th layer of a second detection layer to obtain a second result map with the size of 26 × 26; secondly, performing the same processing on the characteristic diagram obtained from the 91 st layer, splicing the characteristic diagram with the characteristic diagram obtained from the 36 th layer, and detecting the characteristic diagram obtained from the 106 th layer to obtain a third result diagram with the size of 52 multiplied by 255; and finally, performing the same operation on the 103 th layer and splicing the layer with the 11 th characteristic layer to obtain a fourth result graph with the size of 104 multiplied by 104.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.

Claims

1. A limited pixel infrared unmanned aerial vehicle target detection method based on mixed domain attention is characterized by comprising the following steps:

2. The method for detecting the target of the limited-pixel infrared unmanned aerial vehicle based on the attention of the mixed domain as claimed in claim 1, wherein the method for extracting the features of the infrared image by the single-step cascade neural network based on the attention mechanism is used for obtaining a feature map, and specifically comprises the following steps: and taking the single-step cascade neural network as a feature extraction model, acquiring multiple layers of feature maps with different depths, and adding an attention module based on a mixed domain mechanism to improve the attention score of the target region.

3. The mixed-domain attention-based limited-pixel infrared unmanned aerial vehicle target detection method as claimed in claim 2, wherein the feature map size of the input residual module in the feature extraction process is 208 x 32, wherein 208 x 208 represents the spatial size of the feature map, 32 represents the number of channels, and a channel attention module and a spatial attention module are introduced into the feature extraction module and are respectively used for extracting effective features in a plurality of channels and in a single-channel spatial feature.

4. The method for detecting the target of the infrared unmanned aerial vehicle with the limited pixels based on the attention of the mixed domain according to claim 3, wherein the method comprises the following steps of firstly, respectively performing global average pooling and global maximum pooling on the infrared image in a channel dimension, obtaining weights through a network containing a hidden layer, and adding the weights to obtain a final attention weight vector, wherein the final attention weight vector is shown in formula (1):

wherein the content of the first and second substances,

r represents a reduction rate; AvgPool and MaxPool respectively represent average pooling and maximum pooling, wherein the average pooling is to add and average all characteristic values of each channel, and the maximum pooling is to select the maximum value of the characteristic values in each channel; addition means element by element multiplication and put throughSplicing the lane dimensions to obtain a characteristic diagram F';

5. The method for detecting the target of the limited-pixel infrared unmanned aerial vehicle based on mixed domain attention according to any one of claims 1 to 4, wherein the up-sampling is performed on the feature map to obtain a multi-scale feature map, and feature fusion and prediction are respectively performed on the multi-scale feature map to obtain a final target detection result, specifically: according to the pixel value occupied by the unmanned aerial vehicle target, four feature maps with different levels are selected, wherein the feature maps are respectively 13 multiplied by 13, 26 multiplied by 26, 52 multiplied by 52 and 104 multiplied by 104, and multi-level feature fusion and prediction are carried out on the feature maps.

6. The method for detecting the target of the limited-pixel infrared unmanned aerial vehicle based on the multi-scale feature fusion as claimed in claim 5, wherein a feature map with the size of 13 x 13 at the 82 th layer of the single-step cascade neural network is detected to obtain a first result map with the size of 13 x 13; secondly, performing 2 times of upsampling on a 79 th layer of feature map with the size of 13 × 13 to obtain a feature map with the size of 26 × 26, performing feature fusion on the upsampled feature map and a 61 st layer of feature map in a splicing manner, performing depth superposition during splicing, splicing the 26 × 26 × 255 feature map and the 26 × 26 × 255 feature map to obtain a 26 × 26 × 768 feature map, and then performing detection on a 94 th layer of a second detection layer to obtain a second result map with the size of 26 × 26; secondly, performing the same processing on the characteristic diagram obtained from the 91 st layer, splicing the characteristic diagram with the characteristic diagram obtained from the 36 th layer, and detecting the characteristic diagram obtained from the 106 th layer to obtain a third result diagram with the size of 52 multiplied by 255; and finally, performing the same operation on the 103 th layer and splicing the layer with the 11 th characteristic layer to obtain a fourth result graph with the size of 104 multiplied by 104.