CN115147731A - SAR image target detection method based on full-space coding attention module - Google Patents

SAR image target detection method based on full-space coding attention module Download PDF

Info

Publication number
CN115147731A
CN115147731A CN202210901110.XA CN202210901110A CN115147731A CN 115147731 A CN115147731 A CN 115147731A CN 202210901110 A CN202210901110 A CN 202210901110A CN 115147731 A CN115147731 A CN 115147731A
Authority
CN
China
Prior art keywords
loss
detection
full
attention module
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210901110.XA
Other languages
Chinese (zh)
Inventor
张弘
刘源
杨一帆
袁丁
李旭亮
宋剑波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN202210901110.XA priority Critical patent/CN115147731A/en
Publication of CN115147731A publication Critical patent/CN115147731A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/13Satellite images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Astronomy & Astrophysics (AREA)
  • Remote Sensing (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an SAR image target detection method based on a full-space coding attention module, which is used for improving the detection performance of a target detection network on an SAR image and comprises the following steps: (1) According to the method, by utilizing the characteristic that the SAR image has more noise, a full-space coding attention module is designed to relieve the interference of the noise, attention extraction is carried out in full-space dimensions, information loss is effectively reduced, and the feature extraction capability of the model is improved; (2) The method introduces the deformable convolution, increases the learnable offset, reduces the sensitivity of the model to rotation and size, and improves the detection effect of the model; (3) The detection head part uses a double-head detection network, so that the regression task and the classification task are decoupled separately, and the detection effect of the network can be improved; the method can effectively improve the detection performance and effect of the network on offshore, open sea and onshore targets in the SAR image.

Description

SAR image target detection method based on full-space coding attention module
Technical Field
The invention relates to an SAR image target detection method based on a full-space coding attention module, belonging to the crossing field of aerospace and computer vision information processing.
Background
Synthetic Aperture Radar (SAR) is a detection method using active microwaves, and imaging of the SAR has the characteristics of all-time, all-weather and high resolution. Because the SAR image target identification method is not limited by illumination and meteorological conditions, the SAR image target identification method is widely applied to the fields of military detection, ocean resource monitoring, geographic environment detection and the like, and is an important means for earth observation of people at present, but how to automatically identify and detect the SAR image target is still a key point. Compared with the common optical imaging, the speckle noise makes the result obtained by SAR imaging look disordered and have more noise, so that the characteristics of the target object are not obvious, and the type of the target is difficult to distinguish by naked eyes. With the maturity and development of satellite technology, more and more sea and land SAR target detection data sets are provided, and data support can be provided for deep learning.
The SAR image target detection is mainly based on a traditional algorithm which is usually based on prior knowledge, so that the application scene is narrow. The traditional algorithm firstly extracts the characteristics of the image manually and expresses the characteristics by a mathematical model. The conventional detection method of the SAR image commonly used comprises a detector (CFAR) under constant false alarm probability and manual feature extraction, wherein the former needs to construct a complex statistical model and has large calculation amount, and the latter can only be applied to a specific scene and has poor generalization, and the detection of an oil tank target is realized by utilizing the circular shape feature of an oil tank and using Hough transform circle detection.
With continuous breakthrough of deep learning and iterative progress of computing hardware, the target detection of a deep learning algorithm on the SAR image is gradually applied. The deep learning algorithm does not need scene distinguishing during SAR image target detection, can be self-adapted to different scenes, and can achieve high precision and low time delay. The CNN network has great advantages in the aspect of feature extraction, does not depend on manual feature extraction any more, can automatically extract the features of the target through autonomous learning, and has very good generalization. The fast RCNN detection network based on the double-stage can achieve 79% of precision on an SSDD ship data set; the single-stage-based YOLO series target detection network has higher speed in SAR image target detection; FCOS networks based on anchor-free box designs are further advanced in the positioning accuracy of target detection.
The traditional algorithm can only detect objects with simple shapes by utilizing the characteristics obtained by manual extraction, and has low precision, and for a general deep learning algorithm, the SAR image imaging is influenced by speckle noise, so that the obtained image noise is more, the influence on a target is larger, and the obtained detection effect is general. Recently, attention mechanisms are developed, so that the neural network is emphasized when extracting features, the influence of speckle noise is reduced, the attention of the network is put on a target object, and the feature extraction is carried out more specifically, so that the performance of the network is improved. Therefore, the invention combines the characteristics of the neural network and the characteristics of the SAR image, modifies the structure of the neural network according to the characteristics of the SAR image, and provides the attention mechanism-based single-stage anchor-frame-free convolutional neural network target detection method for the SAR image.
Disclosure of Invention
The invention solves the problems that: the SAR image target detection method based on the full-space coding attention module has the advantages of high detection accuracy and strong anti-interference capability on targets in the terrestrial SAR image.
The general non-local attention module not only has a large calculation amount, but also has a general noise filtering effect, and is difficult to be used in the detection of the SAR image with more noise. The local attention module has a problem of information loss, for example, the SE attention mechanism compresses two-dimensional feature map information to a one-dimensional space, and loses most position-related information in the feature map.
The full Spatial Encoding Attention Block (ESEAB) of the invention introduces the deformable convolution into the target detection network of the SAR image and embeds the full Spatial position information into the channel Attention. Unlike the common channel attention which converts features into a single feature vector through simple global pooling, the attention module in ESEAB decomposes multidimensional spatial feature information into multiple one-dimensional features for encoding, and when processing two-dimensional image information, the module aggregates the features in two spatial directions, respectively. In this way, long-term dependency can be captured in one spatial direction, while accurate position information is retained in another spatial direction, enabling full spatial signature coding. The generated feature maps are then encoded into a pair of direction-aware and location-sensitive attention maps, respectively, which may be applied complementarily to the input feature maps to enhance the representation of the object of interest.
In order to solve the technical problems, the invention is realized by the following technical scheme: a SAR image target detection method based on a full-space coding attention module comprises the following steps:
step 1: preprocessing a target detection data set of the SAR image to obtain labeling information of the target detection data set, and dividing the target detection data set into a training set and a test set;
step 2, the full-space coding attention module is embedded into a backbone network of a depth target detection neural network as a whole to perform feature extraction, the depth target detection neural network added with the full-space coding attention module is obtained, weight coding is performed on feature channels in different spatial directions, the feature extraction capability of the backbone network on a picture to be detected is enhanced, a pyramid structure is used for performing multi-scale feature fusion on feature graphs extracted by the depth target detection neural network added with the full-space coding attention module, a fused multi-scale feature graph is obtained, and the detection precision of the detection network is improved;
step 3, in the training stage, the fused multi-scale feature map is respectively sent to two detection branches for carrying out regression and prediction of classification tasks, and the detection branches of two different tasks enable the classification and regression tasks to carry out direct prediction and respective decoupling, so that the detection effect and the training efficiency of the deep target detection neural network can be improved; the first detection branch predicts the object type at each pixel position of the multi-scale characteristic map to obtain the type of the object at the pixel position; the second detection branch carries out the prediction of confidence coefficient and candidate frame position parameters at each pixel position of the multi-scale feature map to obtain the confidence coefficient of the category of the pixel position target and the candidate frame position parameters of the pixel position target; finally, predicting the category and the position of the target contained in the training set picture input into the network;
step 4, inputting the category and the position of the target in the training set obtained by prediction and the marking information of the training set into a loss function, calculating to obtain the current loss value of each loss item, combining historical loss values, dynamically adjusting the weight of the loss item based on the variance and the mean value, and comprehensively weighting to obtain a final loss value;
step 5, reversely transmitting the obtained final loss value, updating the network parameters of the depth target detection neural network added into the full-space coding attention module, and repeatedly training according to the set maximum iteration times, the learning rate and the reverse propagation algorithm until the parameters of the depth target detection neural network with the full-space coding attention module are converged to obtain a detection model which is finally trained;
and 6, testing the detection model obtained by training in the SAR image test data set, outputting and storing a visual test result, and adding an average precision average value mAP detection index before and after the attention module.
Further, in the step 2, the full-space coding attention module is formed by stacking a deformable convolution layer, two convolution layers, a pooling layer, two activation layers, a feature splicing layer and a feature separation layer; the convolution kernel size of the deformable convolution layer is 3x3, the convolution kernel size of the convolution layer is 3x3 and 1x1, the pooling layer is subjected to average pooling in the height direction and the width direction of the features respectively, the activation functions of the activation layers are RELU and Sigmoid, the features obtained by the pooling layer are spliced by the splicing layer and then sent to the convolution layer, the features obtained by the convolution layer are separated by the separation layer, the separated features are sent to the second convolution layer, the output feature dimension size is equal to the input feature size and is used as the weight to perform point multiplication operation with the input feature, the weight of useful information is increased, and the final output feature is obtained; the full-space coding attention module can code the characteristic diagram in different directions, realizes the optimization of channel information and can utilize related information to the maximum extent.
Further, the backbone network is formed by stacking a plurality of convolution layers, a plurality of normalization layers and a plurality of activation layers; the convolution kernel size of the convolution layer is 3 multiplied by 3, 5 multiplied by 5 and 7 multiplied by 7, the batch normalization layer is BN or GN, and the activation function of the activation layer is RELU or SiLU; and replacing part of convolution layers by using a full-space coding attention module, wherein the deep convolution network comprises N stages, N is not less than 4, the input is a picture, and the output is N characteristic graphs corresponding to the picture.
Further, in step 3, the first branch is a classification branch, which includes three convolutional layers, and is used for predicting the confidence of the detection frame, and the output tensor dimension is the number of categories of the detection target; the second branch is a regression branch, which contains three convolution layers for predicting the relevant parameters of the detection frame.
Further, in the step 4, the loss function loss total Comprises the following steps:
loss total =α 1 ·loss cls2 ·loss obj3 ·loss IoU
wherein loss cls Represents the loss of classification, loss obj Represents a loss of confidence, loss IoU Indicating the loss of the position of the prediction box,
α 1 、α 2 、α 3 is a weight factor and satisfies alpha 123 =1,
Figure BDA0003770961360000041
Standard deviation of
Figure BDA0003770961360000042
Wherein c is i Is loss total The size of the loss of each of the terms,
Figure BDA0003770961360000043
is loss total The variance of the loss value of each term in (a),
Figure BDA0003770961360000044
is loss total Average of the loss values of each term in (1).
Compared with the prior art, the invention has the advantages that: the structure design of the anchor-frame-free single-stage target detection network based on the full-space coding attention module is scientific and reasonable, and a detection attention mechanism, deformable convolution, multi-task learning and the like at the front edge are introduced. Aiming at the problems of background mixing, more noise and unobvious target of the SAR image, more useful information can be learned from the SAR image. The SAR image target object detection method has the advantages that the detection effect on the target object in the SAR image is good, and the SAR image target object detection method has the following advantages:
(1) The method can more effectively extract the characteristics of the target object through an attention mechanism, inhibit the intensity of the background in the characteristic diagram, and relieve noise interference caused by the characteristics of the SAR image; the attention mechanism added in the network structure, namely the full-space coding attention module, codes the characteristic diagram from different directions, realizes the optimization of channel information, and increases the weight of channels containing rich information to reduce the interference of useless information.
(2) According to the SAR image target detection method, the deformable convolution is used in the ESEAB module to replace a common convolution layer, the deformable convolution is applied to detection of the SAR image target, compared with the common convolution, learnable offset is introduced into a receptive field by the deformable convolution, so that a convolution area can always cover the periphery of an object shape, the influence of target movement, size scaling, rotation and the like on detection is reduced, and the detection effect can be further improved.
(3) The invention adopts a direct prediction and separate decoupling mode when predicting the position and the type of the target, can obviously improve the detection speed of the network compared with a two-stage detection network, and has higher convergence speed and higher detection precision compared with a detection network which uses the same detection head to decouple the predicted position information and the type information simultaneously.
(4) When the final total loss value is obtained by combining and weighting the multiple loss values, a multi-task learning strategy based on dynamic weight adjustment of observation is adopted, and the traditional method for manually setting the weight is replaced. The training stability degree of each loss item can be obtained according to the variance and the mean value of the historical loss value of each loss item, and the proportion of the loss item in the total loss item is adjusted according to the stability degree, so that a better training effect is achieved.
Drawings
FIG. 1 is a block diagram of an ESEAB attention module of the present invention;
FIG. 2 is a block diagram of a feature extraction network in the overall network framework diagram of the present invention;
FIG. 3 is a schematic diagram of a network training and testing process according to the present invention;
FIG. 4 is a block diagram of a feature extraction network used in the present invention;
fig. 5 is a schematic diagram of the detection of the open-sea and offshore SAR images on the SSDD data set by adding the depth target detection neural network before and after ESEAB, where a, c, and e represent the detection results of the depth target detection neural network without adding the full-space coding attention module, and b, d, and f are the detection results of the detection network with adding the full-space coding attention module;
fig. 6 is a schematic diagram of target detection of a terrestrial SAR image by a deep target detection neural network before and after adding ESEAB, wherein a is without adding ESEAB, and b is with adding ESEAB.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and examples.
Global pooling is generally used for channel attention to encode spatial information globally, but it compresses global spatial information into one channel descriptor, and it is difficult to retain location information, which is the key to obtain spatial structure in the visual task. To encourage the attention module to capture remote interactions spatially with accurate location information, the attention portion of the full spatial coded attention module (ESEAB module) proposed by the present invention decomposes global pooling. The ESEAB module captures horizontal and vertical attention features using two 1-dimensional averaging pooling. The ESEAB module has the following advantages over previous attention methods on lightweight networks. Firstly, it can capture not only cross-channel information, but also direction-aware and location-aware information, which can help the model to more accurately locate and identify the target of interest; the ESEAB attention module can increase the anti-interference capability of the network and reduce the interference of noise on a target, and the deformable convolution can increase the detection accuracy of the target; second, the ESEAB module is flexible and lightweight, and can be more easily plugged into a network.
In the attention mechanism, the spatial information is often globally encoded by using global pooling, the global information is compressed into one channel information, and the position information is difficult to retain and is important for detection tasks. Global pooling is decomposed into one-dimensional encoding operations in the horizontal and vertical directions in order to avoid corruption of the two-dimensional position information by the global pooling operation.
Full space coding attention module at given input X = [ X ] 1 ,x 2 ,…,x c ]∈R C×H×W Then, the method firstly passes through a deformable convolution without changing the dimension of the input feature, the size of the convolution kernel is 3 multiplied by 3, and then the attention part is sent to, and the attention part firstly uses a pooling kernel to perform two spatial ranges(H, 1) and (1, W) encode each channel in the horizontal and vertical directions, respectively. The formula for the c-th channel in the (H, 1) and (1, W) directions is as follows:
Figure BDA0003770961360000061
Figure BDA0003770961360000062
wherein
Figure BDA0003770961360000063
Refers to encoding the c-th channel in the horizontal direction,
Figure BDA0003770961360000064
refers to encoding the c-th channel in the vertical direction. The two transformations respectively perform feature extraction along two spatial directions to generate a pair of directional perception feature mappings. For the generated features in two directions, the features are spliced and then pass through F 1 (1 × 1 convolution) operation to obtain f ∈ R C/r×(H+W) . Namely:
f=δ(F 1 [z h ,z w ])
wherein [, ]]Representing a feature splicing operation, δ () representing a nonlinear activation function, and r being the reduction ratio of the channel, the overall computational effort can be reduced. Then f is separated into f h ∈R C/r×H And f w ∈R C/r×W Respectively by a 1 × 1 convolution F h And F w Will f is h And f w Transformed into a tensor with the same number of channels as the input X:
g h =Sigmod(F h (f h ))
g w =Sigmod(F w (f w ))
suppose the output tensor is Y = [ Y ] 1 ,y 2 ,…,y c ]∈R C×H×W Unfolding the final output into a form of attention weightThe formula is as follows:
Figure BDA0003770961360000065
wherein y is c (i, j) represents the c-th channel of the output, the tensor value of the position (i, j), x c (i, j) represents the input c-th channel, the position being the tensor value of (i, j),
Figure BDA0003770961360000066
is the encoded attention. FIG. 1 is a schematic diagram of an ESEAB attention module, where an input is first passed through a deformable convolution network, then pooling operations are performed on the obtained feature maps in the horizontal and vertical directions, and then splicing is performed, then attention weights in two directions are obtained through channel convolution, and then point multiplication is performed on the obtained attention weights and the input feature map to obtain an output feature map.
In order to achieve the above purpose, the technical scheme adopted by the invention is shown in fig. 3, and the steps are as follows:
step 1, preprocessing a target detection data set of the SAR image, wherein in order to test the effect of the SAR deep target detection neural network on the detection of marine and land targets, the target detection data set comprises an SSDD ship detection data set and an HRS0.5 land oil tank detection data set which are imaged by using an SAR principle. And carrying out data processing on the two, carrying out image enhancement change on the training set part according to a batch with a fixed size to obtain the labeling information of the data set, and dividing the training set and the test set.
Step 2, embedding the full-space coding attention module into a backbone network of a deep target detection neural network as a whole to extract features, obtaining the deep neural network added with the full-space coding attention module, carrying out weight coding on feature channels in different spatial directions, enhancing the feature extraction capability of the backbone network on a picture to be detected, carrying out up-sampling on a deep small feature map, splicing with a shallow large feature map, and repeating the up-sampling and splicing processes for once, namely, fusing three feature maps with different scales, fully utilizing the feature maps with different scales, obtaining a fused multi-scale feature map, and improving the detection accuracy of the detection network;
the depth target detection neural network structure of the full-space coding attention module is shown in fig. 1:
the dimension of the feature vector of the input ESEAB structure is H multiplied by W multiplied by C, firstly, the feature vector is subjected to deformable convolution, the convolution kernel size is 3 multiplied by 3, the step length is 1, and a feature map after the deformable convolution is obtained, wherein the feature map size is H multiplied by W multiplied by C. Then, average pooling is carried out on the feature maps obtained after the deformable convolution in the horizontal direction and the vertical direction respectively to obtain feature vectors with dimension H multiplied by 1 multiplied by C after the average pooling in the horizontal direction and feature vectors with dimension 1 multiplied by W multiplied by C after the average pooling in the vertical direction, the feature vectors and the feature vectors are spliced to obtain vectors with dimension of 1 multiplied by (W + H) multiplied by C, then the vectors are subjected to point convolution with the number of channels unchanged to obtain weight vectors with dimension of 1 multiplied by (W + H) multiplied by C after the convolution transformation, the weight vectors are split to obtain weight vectors with dimension of H multiplied by 1 multiplied by C and dimension of 1 multiplied by W multiplied by C respectively, and then the point multiplication of the corresponding positions is carried out on the weight vectors and the feature maps obtained after the deformable convolution to finally obtain the output features with dimension of H multiplied by W multiplied by C.
The backbone network structure of the deep target detection neural network with the ESEAB structure is as follows:
the parameters of the convolution bottleneck network structure with the ESEAB structure comprise an input channel number M, an output channel number O, a hidden layer amplification multiplying power R and an activation function F. The front part structure of the convolution bottleneck network consists of a main part and a side branch part: the trunk portion contains 1x1 convolution, 2 3x3 convolutions, 2 active layers. When the step length is greater than 1, a 3x3 packet convolution is additionally included, the convolution is an intermediate layer convolution, and the number of intermediate layer channels is L =2M. The number of the first 1 × 1 convolution channels is C1= M × R, and the step length is 1; the number of channels of the 1 st 3 × 3 convolution is C2= C1= M × R, and the step size is 1; the number of channels of the 2 nd 3x3 convolution is C3= O, and the step length is 1; the side branch part comprises a convolution layer with the convolution kernel size of 1x1, the convolution step length of S, the number of output characteristic diagram channels of O and the activation function of ReLU. The tensors output by the main part and the side branch parts of the front part are added according to corresponding positions and then input into the rear half part of the bottleneck structure, the rear half part of the bottleneck structure is composed of an ESEAB attention module, the input characteristic and the input characteristic of the ESEAB attention module are the same in size, and the number of output channels is R.
The backbone network of the deep target detection neural network for extracting features mainly comprises 5 stages, as shown in fig. 4:
the stage 1 mainly comprises two structures, firstly, under the condition that information is not lost, the first structure adjusts the position information of picture pixels from the input pictures with the size of B multiplied by C multiplied by W multiplied by H, so that the lossless downsampling process is realized, and the tensors with the size of B multiplied by 4C multiplied by W/2 multiplied by H/2 are obtained; the second structure is mainly as follows: sending the signal into a convolution layer with convolution kernel size of 5 multiplied by 5, step length of 2 and output channel number of 64, then carrying out batch normalization, and finally outputting through a SiLU activation function, wherein the final output channel number is 64;
stage 2 contains a convolution bottleneck network structure with an ESEAB structure, the input channel number of the bottleneck structure is 64, the convolution step length is 2, the amplification factor of the hidden layer is 3, the activation function is SiLU, and the output channel number is 128.
Stage 3 comprises three convolution bottleneck network structures with ESEAB structures, the number of input channels of the bottleneck structures is 128, 128 and 256, convolution step length is 2, 1 and 1, amplification multiplying power of a hidden layer is 3, an activation function is SiLU, and the number of output channels is 256;
the structure of the stage 4 is similar to that of the stage 3, and the stage 4 also comprises three convolution bottleneck network structures with ESEAB structures, wherein the number of input channels of the bottleneck structures is 256, 256 and 512, the convolution step length is 2, 1 and 1, the amplification multiplying power of a hidden layer is 3, the activation function is SiLU, and the number of output channels is 512 finally;
stage 5 includes a convolutional bottleneck network structure with an ESEAB structure. Before the feature vectors are sent to the bottleneck structure, maximum pooling operations with the kernel sizes of 1 × 1, 5 × 5, 9 × 9 and 13 × 13 are respectively performed on the feature vectors, and the obtained four results are spliced to obtain a tensor with the channel number of 2048. Then, through convolution with a convolution kernel of 1 × 1 size and a step size of 1, the output channel is 512. And then sending the obtained tensor into a convolution bottleneck network structure with an ESEAB structure, wherein the number of input channels of the bottleneck structure is 512, the convolution step length is 2, the amplification factor of a hidden layer is 3, the activation function is SiLU, and the number of output channels is 1024, so that the final eigenvector is obtained.
And 3, after the fused image feature map is obtained, decoupling is carried out through the two decoupling detection branches, and the type of the predicted target and the position information and the confidence coefficient of the predicted target frame are respectively obtained. The position information and the confidence coefficient of the target frame are predicted by the regression branch, and the type of the predicted target is predicted by the classification branch. Separating the classification task from the regression task, as shown in fig. 2, dividing the detection branch at the right side part in fig. 2 into an upper detection branch and a lower detection branch, executing the classification task by the upper branch, predicting the type of the target, executing the regression task by the lower branch, and performing regression prediction on the position information and the confidence coefficient of the target frame;
step 4, substituting the obtained type of the prediction target, the position and the confidence coefficient of the prediction frame and the true value in the annotation file into a loss function for calculation to obtain a loss value under the current network parameter;
during training, a multi-task loss function is adopted, a classification task and a regression task are separated at a decoupling head, prediction and loss calculation are respectively carried out, then the loss of the classification task and the regression task is summed to obtain total loss, then gradient back transmission is carried out according to the total loss, and network parameters are updated towards the direction of decreasing the loss.
Because the SAR image has more noise, in order to avoid the imbalance problem of positive and negative samples, the local loss variant is added in the BCE loss function for training, so that the network can learn more useful information. The BCE loss function is rewritten as:
Figure BDA0003770961360000081
wherein p is i And
Figure BDA0003770961360000082
is the value as predicted and the value as group truth, i represents the image in the predicted value of the final outputThe prime position index, N, is the number of targets, and α =2 and β =3 are selected as hyper-parameters.
The main task of the classification decoupling branch is to classify the target at the current position, predict which type of the target at the current position belongs to the known target, measure by adopting a BCEWithLogs loss function, and develop the BCEWithLogs loss function from the BCE loss function, compared with the BCE loss function, the BCEWithLogs loss function firstly carries out a Sigmoid function to change the prediction value to a value between (0, 1). Bcewithlogs can be represented by the BCE loss function as:
loss BCEWithLongits =loss BCE (Sigmoid(pred),target)
wherein Sigmoid represents a Sigmoid activation function, pred represents the prediction probability of each type of the target output by the network, and target represents the type under the labeling information, namely the true value.
After the target classification cls _ pred of the decoupling branch prediction is obtained, the loss is carried out with the true value cls _ target cls The calculation of (c) yields:
loss cls =loss BCEWithLogits (cls_pred,cls_target)
where cls _ pred represents the probability of each class to which the target belongs output by the network, and cls _ target represents the class to which the target belongs under the label.
Obtaining whether the object is contained or not and the coordinate range of the target through the decoupling branch of the regression task, and performing loss according to whether the predicted result of the target is contained or not obj Calculation of (2) and loss cls Similarly, BCEWithLogits loss function is adopted for calculation, and loss is carried out between the predicted value obj _ pred and the true value obj _ target obj The calculation of (c) yields:
loss obj =loss BCEWithLogits (obj_pred,obj_target)
loss can be performed according to the coordinate range of the target obtained by the regression branch decoupling IoU The loss function is GIoU, and the cross-over ratio IoU loss function is largely used in the loss calculation of the regression box of the target detection, wherein the difference is as follows:
two different rectangular areas A and B are added, wherein A is predicted target position information, B is a true value, and an IoU calculation formula between A and B is as follows:
Figure BDA0003770961360000091
the GIoU takes place as it cannot reflect the alignment between a and B. The calculation of GIoU requires obtaining a minimum rectangle C surrounding a and B based on IoU, and then subtracting the ratio of the area of C not covered by a and B to the area of C from IoU (a, B):
Figure BDA0003770961360000092
loss IoU =GIoU(A,B)
total loss total Can be expressed as:
loss total =α 1 ·loss cl s+α 2 ·loss obj3 ·loss IoU
among them loss cls Represents the loss of classification, loss obj Represents a loss of confidence, loss IoU Indicating a loss of prediction box position. Wherein alpha is 1 、α 2 、α 3 Is a weight factor and satisfies alpha 123 =1, the weight of the weighting factor is determined by the loss value of the item currently and the previous loss value. Relative standard deviation of
Figure BDA0003770961360000101
Is the ratio of the variance of the historical loss value of the current term to the historical loss average, c i And (3) decoupling the loss size and the loss weight, wherein when the variance of a certain loss value is large, the certain loss value represents that the certain loss value is unstable, the certain loss value is increased, and when the variance of the certain loss value is small, the certain loss value is proved to be basically stable, and a smaller weight is given. Wherein
Figure BDA0003770961360000102
And normalizing the relative deviation to obtain the loss weight of the term.
Step 5, reversely propagating the obtained loss value, updating the network parameters of the deep detection neural network added into the full-space coding attention module, and repeatedly training the network parameters according to the set maximum iteration times, the learning rate and the reverse propagation algorithm until the target detection network parameters with the full-space coding attention module are converged to obtain a final detection model;
and 6, inputting the prepared SAR image test set into a network, testing the network to obtain the detection result of the network, obtaining various detection indexes, and overlaying the detection indexes on an original image for visualization.
The embodiment of the invention adopts a ship data set-SSDD data set for SAR image detection disclosed in 2017. The data set comprises 1160 pictures in total, the size of the pictures is about 500 x 400, and the pictures have different resolutions and shooting modes. Its markup file is in VOC format, and ships are collectively called "ship", and includes its location information. Each picture contains different numbers of ships, and the ships have different sizes and different shooting environments, including offshore and open sea, and are widely applied to SAR target detection. And manually labeling the oil tank target in the HRSD0.5 high-resolution SAR image data set in order to test the target detection performance of the oil tank target in the complex onshore background, and processing the oil tank target in the same way.
Before the training set is sent to the network, the embodiment of the invention uses data enhancement operation: and carrying out operations such as color change, random turning, rotation, mosaic data enhancement and the like on the picture. The Mosaic data enhancement is as follows: randomly selecting 4 pictures from the data set, carrying out random size change on the four pictures to generate a new picture, respectively placing the four pictures at the four corners of the newly generated picture, and correspondingly converting the original marking information and attaching the original marking information to the new picture.
Setting the sizes of pictures sent to a network to be 416 multiplied by 416 pixels, setting the size of batch _ size to be 32, simultaneously using two 2080Ti GPUs for training, performing 250 rounds of training, using a random gradient descent method (SGD) as a learning strategy, setting a weight attenuation coefficient to be 0.0005, setting a momentum coefficient to be 0.9, setting a learning rate updating strategy to be a cosine annealing algorithm with arm _ up, gradually increasing the learning rate from 0 to the maximum learning rate of 0.001 in the first 5 rounds, then performing attenuation according to a cosine function, and not performing data enhancement in the last 20 rounds of the training process. The number of classes detected was set to 1 and the threshold for non-maximum suppression was 0.6.
The experimental result shows that the mAP of the deep target detection neural network with the ESEAB is improved by 3.2 percent compared with the deep target detection neural network without the ESEAB, and the mAP reaches 60.3, which indicates that the network with the attention mechanism can obviously improve the performance of the network. Fig. 5 shows the detection effect of the ship target in the open sea and the near sea before and after adding the ESEAB module, respectively. In fig. 5, a and c are detection results of the unearthed ESEAB module in the offshore environment, and b and d are detection results of the unearthed ESEAB module in fig. 5, and the phenomenon of false detection and missing detection of the network without the attention mechanism in the offshore environment can be found. Fig. 5 e shows a detection result of the remote sea without the ESEAB module, and f shows a detection effect of the remote sea with the ESEAB module, where it is found that the land noise is noisy in the offshore area, so that the network without the ESEAB module is prone to error detection.
As the background of the SAR image of the land target is more complex compared with that of the SAR image of the ocean target, the oil tank detection data set of the SAR image is used for training, and the detection effects before and after the ESEAB module is added are compared. As can be seen from fig. 6, the left image a has a significant false detection phenomenon during the detection process, and the background is mistakenly identified as the target object. Therefore, the false detection rate of target detection in the b diagram is much lower under the complex background of the network added with the ESEAB module on the land, which shows that the ESEAB module is added in the invention, so that the network can extract more useful information, and the interference of surrounding noise is reduced.
Therefore, the invention is added to provide a deep target detection neural network added with ESEAB, the detection effect on offshore, open sea and land targets can be greatly improved, and the deep target detection neural network can be easily inserted into other network structures.
It is to be emphasized that: the above are only preferred embodiments of the present invention, and the present invention is not limited thereto in any way, and any simple modifications, equivalent changes and modifications made to the above embodiments according to the technical essence of the present invention are within the scope of the technical solution of the present invention.

Claims (5)

1. A SAR image target detection method based on a full-space coding attention module is characterized in that: the method comprises the following steps:
step 1: preprocessing a target detection data set of the SAR image to obtain labeling information of the target detection data set, and dividing the target detection data set into a training set and a test set;
step 2, the full-space coding attention module is embedded into a backbone network of a depth target detection neural network as a whole to perform feature extraction, the depth target detection neural network added with the full-space coding attention module is obtained, weight coding is performed on feature channels in different spatial directions, the feature extraction capability of the backbone network on a picture to be detected is enhanced, a pyramid structure is used for performing multi-scale feature fusion on feature maps extracted by the depth target detection neural network added with the full-space coding attention module, a fused multi-scale feature map is obtained, and the detection precision of the detection network is improved;
step 3, in the training stage, the fused multi-scale feature map is respectively sent to two detection branches for carrying out regression and prediction of classification tasks, and the detection branches of two different tasks enable the classification and regression tasks to carry out direct prediction and respective decoupling, so that the detection effect and the training efficiency of the deep target detection neural network can be improved; the first detection branch predicts the object type at each pixel position of the multi-scale characteristic map to obtain the type of the object at the pixel position; the second detection branch carries out the prediction of confidence coefficient and candidate frame position parameters at each pixel position of the multi-scale feature map to obtain the confidence coefficient of the category of the pixel position target and the candidate frame position parameters of the pixel position target; finally, predicting the category and the position of the target contained in the training set picture input into the network;
step 4, inputting the predicted category and position of the target in the training set and the labeled information of the training set into a loss function, calculating to obtain the current loss value of each loss item, combining historical loss values, dynamically adjusting the weight of the loss item based on the variance and the mean value, and comprehensively weighting to obtain a final loss value;
step 5, reversely transmitting the obtained final loss value, updating the network parameters of the depth target detection neural network added with the full-space coding attention module, and repeatedly training according to the set maximum iteration times, the learning rate and a reverse propagation algorithm until the parameters of the depth target detection neural network with the full-space coding attention module are converged to obtain a finally trained detection model;
and 6, testing the detection model obtained by training on the SAR image test data set, outputting and storing the test result after visualization, and adding the average precision average mAP detection index before and after the attention module.
2. The SAR image target detection method based on full-space coding attention module according to claim 1, characterized in that: in the step 2, the full-space coding attention module is formed by stacking a deformable convolution layer, two convolution layers, a pooling layer, two activation layers, a feature splicing layer and a feature separation layer; the convolution kernel size of all deformable convolution layers is 3 multiplied by 3, the convolution kernel size of the convolution layers is 3 multiplied by 3 and 1 multiplied by 1, the pooling layers are respectively subjected to average pooling in the height direction and the width direction of the features, the activation functions of the activation layers are RELU and Sigmoid, the features obtained by the pooling layers are spliced by the splicing layers and then sent to the convolution layers, the features obtained by the convolution layers are separated by the separation layers, the separated features are sent to the second convolution layer, the output feature dimension size is equal to the input feature size and serves as the weight to be subjected to point multiplication with the input feature, the weight of useful information is increased, and the final output feature is obtained; the full-space coding attention module can code the characteristic diagram in different directions, realizes the optimization of channel information and can utilize related information to the maximum extent.
3. The SAR image target detection method based on the full-space coding attention module according to claim 1, characterized in that: the backbone network is formed by stacking a plurality of convolution layers, a plurality of normalization layers and a plurality of activation layers; the convolution kernel size of the convolution layer is 3 multiplied by 3, 5 multiplied by 5 and 7 multiplied by 7, the batch normalization layer is BN or GN, and the activation function of the activation layer is RELU or SiLU; and replacing part of the convolutional layer by using a full-space coding attention module, wherein the deep convolutional network comprises N stages, N is not less than 4, the input is a picture, and the output is a feature map corresponding to the picture.
4. The SAR image target detection method based on the full-space coding attention module according to claim 1, characterized in that: in step 3, the first branch is a classification branch, which includes three convolutional layers, and is used for predicting the confidence of the detection frame, and the tensor dimension output by the classification branch is the number of classes of the detection target; the second branch is a regression branch, which contains three convolution layers for predicting the relevant parameters of the detection frame.
5. The SAR image target detection method based on the full-space coding attention module according to claim 1, characterized in that: in said step 4, the total loss function loss total Comprises the following steps:
loss total =α 1 ·loss cls2 ·loss obj3 ·loss IoU
among them loss cls Represents the loss of classification, loss obj Represents a loss of confidence, loss IoU Indicating the loss of the position of the prediction box,
α 1 、α 2 、α 3 is a weight factor and satisfies alpha 123 =1,
Figure FDA0003770961350000021
i=1,2,3; standard deviation of
Figure FDA0003770961350000022
Wherein c is i Is loss total The size of the loss of each of the terms,
Figure FDA0003770961350000023
is loss total The variance of the loss value of each term in (a),
Figure FDA0003770961350000024
is loss total Average of the loss values of each term in (1).
CN202210901110.XA 2022-07-28 2022-07-28 SAR image target detection method based on full-space coding attention module Pending CN115147731A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210901110.XA CN115147731A (en) 2022-07-28 2022-07-28 SAR image target detection method based on full-space coding attention module

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210901110.XA CN115147731A (en) 2022-07-28 2022-07-28 SAR image target detection method based on full-space coding attention module

Publications (1)

Publication Number Publication Date
CN115147731A true CN115147731A (en) 2022-10-04

Family

ID=83413749

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210901110.XA Pending CN115147731A (en) 2022-07-28 2022-07-28 SAR image target detection method based on full-space coding attention module

Country Status (1)

Country Link
CN (1) CN115147731A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115542433A (en) * 2022-12-05 2022-12-30 香港中文大学(深圳) Method for coding photonic crystal by deep neural network based on self-attention
CN115761383A (en) * 2023-01-06 2023-03-07 北京匠数科技有限公司 Image classification method and device, electronic equipment and medium
CN116012719A (en) * 2023-03-27 2023-04-25 中国电子科技集团公司第五十四研究所 Weak supervision rotating target detection method based on multi-instance learning
CN116071658A (en) * 2023-03-07 2023-05-05 四川大学 SAR image small target detection and recognition method and device based on deep learning
CN116310810A (en) * 2022-12-06 2023-06-23 青岛柯锐思德电子科技有限公司 Cross-domain hyperspectral image classification method based on spatial attention-guided variable convolution
CN116311213A (en) * 2023-05-18 2023-06-23 珠海亿智电子科技有限公司 License plate recognition method, device, equipment and medium based on global information integration
CN117636078A (en) * 2024-01-25 2024-03-01 华南理工大学 Target detection method, target detection system, computer equipment and storage medium

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115542433A (en) * 2022-12-05 2022-12-30 香港中文大学(深圳) Method for coding photonic crystal by deep neural network based on self-attention
CN115542433B (en) * 2022-12-05 2023-03-24 香港中文大学(深圳) Self-attention-based deep neural network coding photonic crystal method
CN116310810A (en) * 2022-12-06 2023-06-23 青岛柯锐思德电子科技有限公司 Cross-domain hyperspectral image classification method based on spatial attention-guided variable convolution
CN116310810B (en) * 2022-12-06 2023-09-15 青岛柯锐思德电子科技有限公司 Cross-domain hyperspectral image classification method based on spatial attention-guided variable convolution
CN115761383A (en) * 2023-01-06 2023-03-07 北京匠数科技有限公司 Image classification method and device, electronic equipment and medium
CN116071658A (en) * 2023-03-07 2023-05-05 四川大学 SAR image small target detection and recognition method and device based on deep learning
CN116012719A (en) * 2023-03-27 2023-04-25 中国电子科技集团公司第五十四研究所 Weak supervision rotating target detection method based on multi-instance learning
CN116311213A (en) * 2023-05-18 2023-06-23 珠海亿智电子科技有限公司 License plate recognition method, device, equipment and medium based on global information integration
CN116311213B (en) * 2023-05-18 2023-08-22 珠海亿智电子科技有限公司 License plate recognition method, device, equipment and medium based on global information integration
CN117636078A (en) * 2024-01-25 2024-03-01 华南理工大学 Target detection method, target detection system, computer equipment and storage medium
CN117636078B (en) * 2024-01-25 2024-04-19 华南理工大学 Target detection method, target detection system, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN115147731A (en) SAR image target detection method based on full-space coding attention module
CN110472627B (en) End-to-end SAR image recognition method, device and storage medium
CN109871902B (en) SAR small sample identification method based on super-resolution countermeasure generation cascade network
CN115497005A (en) YOLOV4 remote sensing target detection method integrating feature transfer and attention mechanism
CN112149591B (en) SSD-AEFF automatic bridge detection method and system for SAR image
CN114612769B (en) Integrated sensing infrared imaging ship detection method integrated with local structure information
CN106845343B (en) Automatic detection method for optical remote sensing image offshore platform
CN113705375A (en) Visual perception device and method for ship navigation environment
CN115393690A (en) Light neural network air-to-ground observation multi-target identification method
CN117237740B (en) SAR image classification method based on CNN and Transformer
CN113486819A (en) Ship target detection method based on YOLOv4 algorithm
CN113536963A (en) SAR image airplane target detection method based on lightweight YOLO network
CN116524189A (en) High-resolution remote sensing image semantic segmentation method based on coding and decoding indexing edge characterization
CN115147727A (en) Method and system for extracting impervious surface of remote sensing image
CN112800932B (en) Method for detecting remarkable ship target in offshore background and electronic equipment
Patil et al. Semantic segmentation of satellite images using modified U-Net
CN112906564B (en) Intelligent decision support system design and implementation method for automatic target recognition of unmanned airborne SAR (synthetic aperture radar) image
CN116912675B (en) Underwater target detection method and system based on feature migration
CN117788296A (en) Infrared remote sensing image super-resolution reconstruction method based on heterogeneous combined depth network
CN117496154A (en) High-resolution remote sensing image semantic segmentation method based on probability map representation edge
CN114998749B (en) SAR data amplification method for target detection
CN116863293A (en) Marine target detection method under visible light based on improved YOLOv7 algorithm
CN116630637A (en) optical-SAR image joint interpretation method based on multi-modal contrast learning
CN115410089A (en) Self-adaptive local context embedded optical remote sensing small-scale target detection method
CN114283336A (en) Anchor-frame-free remote sensing image small target detection method based on mixed attention

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination