CN115147731A

CN115147731A - SAR image target detection method based on full-space coding attention module

Info

Publication number: CN115147731A
Application number: CN202210901110.XA
Authority: CN
Inventors: 张弘; 刘源; 杨一帆; 袁丁; 李旭亮; 宋剑波
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2022-07-28
Filing date: 2022-07-28
Publication date: 2022-10-04

Abstract

The invention discloses an SAR image target detection method based on a full-space coding attention module, which is used for improving the detection performance of a target detection network on an SAR image and comprises the following steps: (1) According to the method, by utilizing the characteristic that the SAR image has more noise, a full-space coding attention module is designed to relieve the interference of the noise, attention extraction is carried out in full-space dimensions, information loss is effectively reduced, and the feature extraction capability of the model is improved; (2) The method introduces the deformable convolution, increases the learnable offset, reduces the sensitivity of the model to rotation and size, and improves the detection effect of the model; (3) The detection head part uses a double-head detection network, so that the regression task and the classification task are decoupled separately, and the detection effect of the network can be improved; the method can effectively improve the detection performance and effect of the network on offshore, open sea and onshore targets in the SAR image.

Description

SAR image target detection method based on full-space coding attention module

Technical Field

The invention relates to an SAR image target detection method based on a full-space coding attention module, belonging to the crossing field of aerospace and computer vision information processing.

Background

Synthetic Aperture Radar (SAR) is a detection method using active microwaves, and imaging of the SAR has the characteristics of all-time, all-weather and high resolution. Because the SAR image target identification method is not limited by illumination and meteorological conditions, the SAR image target identification method is widely applied to the fields of military detection, ocean resource monitoring, geographic environment detection and the like, and is an important means for earth observation of people at present, but how to automatically identify and detect the SAR image target is still a key point. Compared with the common optical imaging, the speckle noise makes the result obtained by SAR imaging look disordered and have more noise, so that the characteristics of the target object are not obvious, and the type of the target is difficult to distinguish by naked eyes. With the maturity and development of satellite technology, more and more sea and land SAR target detection data sets are provided, and data support can be provided for deep learning.

The SAR image target detection is mainly based on a traditional algorithm which is usually based on prior knowledge, so that the application scene is narrow. The traditional algorithm firstly extracts the characteristics of the image manually and expresses the characteristics by a mathematical model. The conventional detection method of the SAR image commonly used comprises a detector (CFAR) under constant false alarm probability and manual feature extraction, wherein the former needs to construct a complex statistical model and has large calculation amount, and the latter can only be applied to a specific scene and has poor generalization, and the detection of an oil tank target is realized by utilizing the circular shape feature of an oil tank and using Hough transform circle detection.

With continuous breakthrough of deep learning and iterative progress of computing hardware, the target detection of a deep learning algorithm on the SAR image is gradually applied. The deep learning algorithm does not need scene distinguishing during SAR image target detection, can be self-adapted to different scenes, and can achieve high precision and low time delay. The CNN network has great advantages in the aspect of feature extraction, does not depend on manual feature extraction any more, can automatically extract the features of the target through autonomous learning, and has very good generalization. The fast RCNN detection network based on the double-stage can achieve 79% of precision on an SSDD ship data set; the single-stage-based YOLO series target detection network has higher speed in SAR image target detection; FCOS networks based on anchor-free box designs are further advanced in the positioning accuracy of target detection.

The traditional algorithm can only detect objects with simple shapes by utilizing the characteristics obtained by manual extraction, and has low precision, and for a general deep learning algorithm, the SAR image imaging is influenced by speckle noise, so that the obtained image noise is more, the influence on a target is larger, and the obtained detection effect is general. Recently, attention mechanisms are developed, so that the neural network is emphasized when extracting features, the influence of speckle noise is reduced, the attention of the network is put on a target object, and the feature extraction is carried out more specifically, so that the performance of the network is improved. Therefore, the invention combines the characteristics of the neural network and the characteristics of the SAR image, modifies the structure of the neural network according to the characteristics of the SAR image, and provides the attention mechanism-based single-stage anchor-frame-free convolutional neural network target detection method for the SAR image.

Disclosure of Invention

The invention solves the problems that: the SAR image target detection method based on the full-space coding attention module has the advantages of high detection accuracy and strong anti-interference capability on targets in the terrestrial SAR image.

The general non-local attention module not only has a large calculation amount, but also has a general noise filtering effect, and is difficult to be used in the detection of the SAR image with more noise. The local attention module has a problem of information loss, for example, the SE attention mechanism compresses two-dimensional feature map information to a one-dimensional space, and loses most position-related information in the feature map.

The full Spatial Encoding Attention Block (ESEAB) of the invention introduces the deformable convolution into the target detection network of the SAR image and embeds the full Spatial position information into the channel Attention. Unlike the common channel attention which converts features into a single feature vector through simple global pooling, the attention module in ESEAB decomposes multidimensional spatial feature information into multiple one-dimensional features for encoding, and when processing two-dimensional image information, the module aggregates the features in two spatial directions, respectively. In this way, long-term dependency can be captured in one spatial direction, while accurate position information is retained in another spatial direction, enabling full spatial signature coding. The generated feature maps are then encoded into a pair of direction-aware and location-sensitive attention maps, respectively, which may be applied complementarily to the input feature maps to enhance the representation of the object of interest.

In order to solve the technical problems, the invention is realized by the following technical scheme: a SAR image target detection method based on a full-space coding attention module comprises the following steps:

step 1: preprocessing a target detection data set of the SAR image to obtain labeling information of the target detection data set, and dividing the target detection data set into a training set and a test set;

step 2, the full-space coding attention module is embedded into a backbone network of a depth target detection neural network as a whole to perform feature extraction, the depth target detection neural network added with the full-space coding attention module is obtained, weight coding is performed on feature channels in different spatial directions, the feature extraction capability of the backbone network on a picture to be detected is enhanced, a pyramid structure is used for performing multi-scale feature fusion on feature graphs extracted by the depth target detection neural network added with the full-space coding attention module, a fused multi-scale feature graph is obtained, and the detection precision of the detection network is improved;

step 3, in the training stage, the fused multi-scale feature map is respectively sent to two detection branches for carrying out regression and prediction of classification tasks, and the detection branches of two different tasks enable the classification and regression tasks to carry out direct prediction and respective decoupling, so that the detection effect and the training efficiency of the deep target detection neural network can be improved; the first detection branch predicts the object type at each pixel position of the multi-scale characteristic map to obtain the type of the object at the pixel position; the second detection branch carries out the prediction of confidence coefficient and candidate frame position parameters at each pixel position of the multi-scale feature map to obtain the confidence coefficient of the category of the pixel position target and the candidate frame position parameters of the pixel position target; finally, predicting the category and the position of the target contained in the training set picture input into the network;

step 4, inputting the category and the position of the target in the training set obtained by prediction and the marking information of the training set into a loss function, calculating to obtain the current loss value of each loss item, combining historical loss values, dynamically adjusting the weight of the loss item based on the variance and the mean value, and comprehensively weighting to obtain a final loss value;

step 5, reversely transmitting the obtained final loss value, updating the network parameters of the depth target detection neural network added into the full-space coding attention module, and repeatedly training according to the set maximum iteration times, the learning rate and the reverse propagation algorithm until the parameters of the depth target detection neural network with the full-space coding attention module are converged to obtain a detection model which is finally trained;

and 6, testing the detection model obtained by training in the SAR image test data set, outputting and storing a visual test result, and adding an average precision average value mAP detection index before and after the attention module.

Further, in the step 2, the full-space coding attention module is formed by stacking a deformable convolution layer, two convolution layers, a pooling layer, two activation layers, a feature splicing layer and a feature separation layer; the convolution kernel size of the deformable convolution layer is 3x3, the convolution kernel size of the convolution layer is 3x3 and 1x1, the pooling layer is subjected to average pooling in the height direction and the width direction of the features respectively, the activation functions of the activation layers are RELU and Sigmoid, the features obtained by the pooling layer are spliced by the splicing layer and then sent to the convolution layer, the features obtained by the convolution layer are separated by the separation layer, the separated features are sent to the second convolution layer, the output feature dimension size is equal to the input feature size and is used as the weight to perform point multiplication operation with the input feature, the weight of useful information is increased, and the final output feature is obtained; the full-space coding attention module can code the characteristic diagram in different directions, realizes the optimization of channel information and can utilize related information to the maximum extent.

Further, the backbone network is formed by stacking a plurality of convolution layers, a plurality of normalization layers and a plurality of activation layers; the convolution kernel size of the convolution layer is 3 multiplied by 3, 5 multiplied by 5 and 7 multiplied by 7, the batch normalization layer is BN or GN, and the activation function of the activation layer is RELU or SiLU; and replacing part of convolution layers by using a full-space coding attention module, wherein the deep convolution network comprises N stages, N is not less than 4, the input is a picture, and the output is N characteristic graphs corresponding to the picture.

Further, in step 3, the first branch is a classification branch, which includes three convolutional layers, and is used for predicting the confidence of the detection frame, and the output tensor dimension is the number of categories of the detection target; the second branch is a regression branch, which contains three convolution layers for predicting the relevant parameters of the detection frame.

Further, in the step 4, the loss function loss _total Comprises the following steps:

loss _total ＝α ₁ ·loss _cls +α ₂ ·loss _obj +α ₃ ·loss _IoU

wherein loss _cls Represents the loss of classification, loss _obj Represents a loss of confidence, loss _IoU Indicating the loss of the position of the prediction box,

α ₁ 、α ₂ 、α ₃ is a weight factor and satisfies alpha ₁ +α ₂ +α ₃ ＝1，

Standard deviation of

Wherein c is _i Is loss _total The size of the loss of each of the terms,

is loss _total The variance of the loss value of each term in (a),

is loss _total Average of the loss values of each term in (1).

Compared with the prior art, the invention has the advantages that: the structure design of the anchor-frame-free single-stage target detection network based on the full-space coding attention module is scientific and reasonable, and a detection attention mechanism, deformable convolution, multi-task learning and the like at the front edge are introduced. Aiming at the problems of background mixing, more noise and unobvious target of the SAR image, more useful information can be learned from the SAR image. The SAR image target object detection method has the advantages that the detection effect on the target object in the SAR image is good, and the SAR image target object detection method has the following advantages:

(1) The method can more effectively extract the characteristics of the target object through an attention mechanism, inhibit the intensity of the background in the characteristic diagram, and relieve noise interference caused by the characteristics of the SAR image; the attention mechanism added in the network structure, namely the full-space coding attention module, codes the characteristic diagram from different directions, realizes the optimization of channel information, and increases the weight of channels containing rich information to reduce the interference of useless information.

(2) According to the SAR image target detection method, the deformable convolution is used in the ESEAB module to replace a common convolution layer, the deformable convolution is applied to detection of the SAR image target, compared with the common convolution, learnable offset is introduced into a receptive field by the deformable convolution, so that a convolution area can always cover the periphery of an object shape, the influence of target movement, size scaling, rotation and the like on detection is reduced, and the detection effect can be further improved.

(3) The invention adopts a direct prediction and separate decoupling mode when predicting the position and the type of the target, can obviously improve the detection speed of the network compared with a two-stage detection network, and has higher convergence speed and higher detection precision compared with a detection network which uses the same detection head to decouple the predicted position information and the type information simultaneously.

(4) When the final total loss value is obtained by combining and weighting the multiple loss values, a multi-task learning strategy based on dynamic weight adjustment of observation is adopted, and the traditional method for manually setting the weight is replaced. The training stability degree of each loss item can be obtained according to the variance and the mean value of the historical loss value of each loss item, and the proportion of the loss item in the total loss item is adjusted according to the stability degree, so that a better training effect is achieved.

Drawings

FIG. 1 is a block diagram of an ESEAB attention module of the present invention;

FIG. 2 is a block diagram of a feature extraction network in the overall network framework diagram of the present invention;

FIG. 3 is a schematic diagram of a network training and testing process according to the present invention;

FIG. 4 is a block diagram of a feature extraction network used in the present invention;

fig. 5 is a schematic diagram of the detection of the open-sea and offshore SAR images on the SSDD data set by adding the depth target detection neural network before and after ESEAB, where a, c, and e represent the detection results of the depth target detection neural network without adding the full-space coding attention module, and b, d, and f are the detection results of the detection network with adding the full-space coding attention module;

fig. 6 is a schematic diagram of target detection of a terrestrial SAR image by a deep target detection neural network before and after adding ESEAB, wherein a is without adding ESEAB, and b is with adding ESEAB.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and examples.

Global pooling is generally used for channel attention to encode spatial information globally, but it compresses global spatial information into one channel descriptor, and it is difficult to retain location information, which is the key to obtain spatial structure in the visual task. To encourage the attention module to capture remote interactions spatially with accurate location information, the attention portion of the full spatial coded attention module (ESEAB module) proposed by the present invention decomposes global pooling. The ESEAB module captures horizontal and vertical attention features using two 1-dimensional averaging pooling. The ESEAB module has the following advantages over previous attention methods on lightweight networks. Firstly, it can capture not only cross-channel information, but also direction-aware and location-aware information, which can help the model to more accurately locate and identify the target of interest; the ESEAB attention module can increase the anti-interference capability of the network and reduce the interference of noise on a target, and the deformable convolution can increase the detection accuracy of the target; second, the ESEAB module is flexible and lightweight, and can be more easily plugged into a network.

In the attention mechanism, the spatial information is often globally encoded by using global pooling, the global information is compressed into one channel information, and the position information is difficult to retain and is important for detection tasks. Global pooling is decomposed into one-dimensional encoding operations in the horizontal and vertical directions in order to avoid corruption of the two-dimensional position information by the global pooling operation.

Full space coding attention module at given input X = [ X ] ₁ ,x ₂ ,…,x _c ]∈R ^C×H×W Then, the method firstly passes through a deformable convolution without changing the dimension of the input feature, the size of the convolution kernel is 3 multiplied by 3, and then the attention part is sent to, and the attention part firstly uses a pooling kernel to perform two spatial ranges(H, 1) and (1, W) encode each channel in the horizontal and vertical directions, respectively. The formula for the c-th channel in the (H, 1) and (1, W) directions is as follows:

wherein

Refers to encoding the c-th channel in the horizontal direction,

refers to encoding the c-th channel in the vertical direction. The two transformations respectively perform feature extraction along two spatial directions to generate a pair of directional perception feature mappings. For the generated features in two directions, the features are spliced and then pass through F ₁ (1 × 1 convolution) operation to obtain f ∈ R ^C/r×(H+W) . Namely:

f＝δ(F ₁ [z ^h ,z ^w ])

wherein [, ]]Representing a feature splicing operation, δ () representing a nonlinear activation function, and r being the reduction ratio of the channel, the overall computational effort can be reduced. Then f is separated into f ^h ∈R ^C/r×H And f ^w ∈R ^C/r×W Respectively by a 1 × 1 convolution F _h And F _w Will f is ^h And f ^w Transformed into a tensor with the same number of channels as the input X:

g ^h ＝Sigmod(F _h (f ^h ))

g ^w ＝Sigmod(F _w (f ^w ))

suppose the output tensor is Y = [ Y ] ₁ ,y ₂ ,…,y _c ]∈R ^C×H×W Unfolding the final output into a form of attention weightThe formula is as follows:

wherein y is _c (i, j) represents the c-th channel of the output, the tensor value of the position (i, j), x _c (i, j) represents the input c-th channel, the position being the tensor value of (i, j),

is the encoded attention. FIG. 1 is a schematic diagram of an ESEAB attention module, where an input is first passed through a deformable convolution network, then pooling operations are performed on the obtained feature maps in the horizontal and vertical directions, and then splicing is performed, then attention weights in two directions are obtained through channel convolution, and then point multiplication is performed on the obtained attention weights and the input feature map to obtain an output feature map.

In order to achieve the above purpose, the technical scheme adopted by the invention is shown in fig. 3, and the steps are as follows:

step 1, preprocessing a target detection data set of the SAR image, wherein in order to test the effect of the SAR deep target detection neural network on the detection of marine and land targets, the target detection data set comprises an SSDD ship detection data set and an HRS0.5 land oil tank detection data set which are imaged by using an SAR principle. And carrying out data processing on the two, carrying out image enhancement change on the training set part according to a batch with a fixed size to obtain the labeling information of the data set, and dividing the training set and the test set.

Step 2, embedding the full-space coding attention module into a backbone network of a deep target detection neural network as a whole to extract features, obtaining the deep neural network added with the full-space coding attention module, carrying out weight coding on feature channels in different spatial directions, enhancing the feature extraction capability of the backbone network on a picture to be detected, carrying out up-sampling on a deep small feature map, splicing with a shallow large feature map, and repeating the up-sampling and splicing processes for once, namely, fusing three feature maps with different scales, fully utilizing the feature maps with different scales, obtaining a fused multi-scale feature map, and improving the detection accuracy of the detection network;

the depth target detection neural network structure of the full-space coding attention module is shown in fig. 1:

the dimension of the feature vector of the input ESEAB structure is H multiplied by W multiplied by C, firstly, the feature vector is subjected to deformable convolution, the convolution kernel size is 3 multiplied by 3, the step length is 1, and a feature map after the deformable convolution is obtained, wherein the feature map size is H multiplied by W multiplied by C. Then, average pooling is carried out on the feature maps obtained after the deformable convolution in the horizontal direction and the vertical direction respectively to obtain feature vectors with dimension H multiplied by 1 multiplied by C after the average pooling in the horizontal direction and feature vectors with dimension 1 multiplied by W multiplied by C after the average pooling in the vertical direction, the feature vectors and the feature vectors are spliced to obtain vectors with dimension of 1 multiplied by (W + H) multiplied by C, then the vectors are subjected to point convolution with the number of channels unchanged to obtain weight vectors with dimension of 1 multiplied by (W + H) multiplied by C after the convolution transformation, the weight vectors are split to obtain weight vectors with dimension of H multiplied by 1 multiplied by C and dimension of 1 multiplied by W multiplied by C respectively, and then the point multiplication of the corresponding positions is carried out on the weight vectors and the feature maps obtained after the deformable convolution to finally obtain the output features with dimension of H multiplied by W multiplied by C.

The backbone network structure of the deep target detection neural network with the ESEAB structure is as follows:

the parameters of the convolution bottleneck network structure with the ESEAB structure comprise an input channel number M, an output channel number O, a hidden layer amplification multiplying power R and an activation function F. The front part structure of the convolution bottleneck network consists of a main part and a side branch part: the trunk portion contains 1x1 convolution, 2 3x3 convolutions, 2 active layers. When the step length is greater than 1, a 3x3 packet convolution is additionally included, the convolution is an intermediate layer convolution, and the number of intermediate layer channels is L =2M. The number of the first 1 × 1 convolution channels is C1= M × R, and the step length is 1; the number of channels of the 1 st 3 × 3 convolution is C2= C1= M × R, and the step size is 1; the number of channels of the 2 nd 3x3 convolution is C3= O, and the step length is 1; the side branch part comprises a convolution layer with the convolution kernel size of 1x1, the convolution step length of S, the number of output characteristic diagram channels of O and the activation function of ReLU. The tensors output by the main part and the side branch parts of the front part are added according to corresponding positions and then input into the rear half part of the bottleneck structure, the rear half part of the bottleneck structure is composed of an ESEAB attention module, the input characteristic and the input characteristic of the ESEAB attention module are the same in size, and the number of output channels is R.

The backbone network of the deep target detection neural network for extracting features mainly comprises 5 stages, as shown in fig. 4:

the stage 1 mainly comprises two structures, firstly, under the condition that information is not lost, the first structure adjusts the position information of picture pixels from the input pictures with the size of B multiplied by C multiplied by W multiplied by H, so that the lossless downsampling process is realized, and the tensors with the size of B multiplied by 4C multiplied by W/2 multiplied by H/2 are obtained; the second structure is mainly as follows: sending the signal into a convolution layer with convolution kernel size of 5 multiplied by 5, step length of 2 and output channel number of 64, then carrying out batch normalization, and finally outputting through a SiLU activation function, wherein the final output channel number is 64;

stage 2 contains a convolution bottleneck network structure with an ESEAB structure, the input channel number of the bottleneck structure is 64, the convolution step length is 2, the amplification factor of the hidden layer is 3, the activation function is SiLU, and the output channel number is 128.

Stage 3 comprises three convolution bottleneck network structures with ESEAB structures, the number of input channels of the bottleneck structures is 128, 128 and 256, convolution step length is 2, 1 and 1, amplification multiplying power of a hidden layer is 3, an activation function is SiLU, and the number of output channels is 256;

the structure of the stage 4 is similar to that of the stage 3, and the stage 4 also comprises three convolution bottleneck network structures with ESEAB structures, wherein the number of input channels of the bottleneck structures is 256, 256 and 512, the convolution step length is 2, 1 and 1, the amplification multiplying power of a hidden layer is 3, the activation function is SiLU, and the number of output channels is 512 finally;

stage 5 includes a convolutional bottleneck network structure with an ESEAB structure. Before the feature vectors are sent to the bottleneck structure, maximum pooling operations with the kernel sizes of 1 × 1, 5 × 5, 9 × 9 and 13 × 13 are respectively performed on the feature vectors, and the obtained four results are spliced to obtain a tensor with the channel number of 2048. Then, through convolution with a convolution kernel of 1 × 1 size and a step size of 1, the output channel is 512. And then sending the obtained tensor into a convolution bottleneck network structure with an ESEAB structure, wherein the number of input channels of the bottleneck structure is 512, the convolution step length is 2, the amplification factor of a hidden layer is 3, the activation function is SiLU, and the number of output channels is 1024, so that the final eigenvector is obtained.

And 3, after the fused image feature map is obtained, decoupling is carried out through the two decoupling detection branches, and the type of the predicted target and the position information and the confidence coefficient of the predicted target frame are respectively obtained. The position information and the confidence coefficient of the target frame are predicted by the regression branch, and the type of the predicted target is predicted by the classification branch. Separating the classification task from the regression task, as shown in fig. 2, dividing the detection branch at the right side part in fig. 2 into an upper detection branch and a lower detection branch, executing the classification task by the upper branch, predicting the type of the target, executing the regression task by the lower branch, and performing regression prediction on the position information and the confidence coefficient of the target frame;

step 4, substituting the obtained type of the prediction target, the position and the confidence coefficient of the prediction frame and the true value in the annotation file into a loss function for calculation to obtain a loss value under the current network parameter;

during training, a multi-task loss function is adopted, a classification task and a regression task are separated at a decoupling head, prediction and loss calculation are respectively carried out, then the loss of the classification task and the regression task is summed to obtain total loss, then gradient back transmission is carried out according to the total loss, and network parameters are updated towards the direction of decreasing the loss.

Because the SAR image has more noise, in order to avoid the imbalance problem of positive and negative samples, the local loss variant is added in the BCE loss function for training, so that the network can learn more useful information. The BCE loss function is rewritten as:

wherein p is _i And

is the value as predicted and the value as group truth, i represents the image in the predicted value of the final outputThe prime position index, N, is the number of targets, and α =2 and β =3 are selected as hyper-parameters.

The main task of the classification decoupling branch is to classify the target at the current position, predict which type of the target at the current position belongs to the known target, measure by adopting a BCEWithLogs loss function, and develop the BCEWithLogs loss function from the BCE loss function, compared with the BCE loss function, the BCEWithLogs loss function firstly carries out a Sigmoid function to change the prediction value to a value between (0, 1). Bcewithlogs can be represented by the BCE loss function as:

loss _{BCEWithLongits} ＝loss _BCE (Sigmoid(pred),target)

wherein Sigmoid represents a Sigmoid activation function, pred represents the prediction probability of each type of the target output by the network, and target represents the type under the labeling information, namely the true value.

After the target classification cls _ pred of the decoupling branch prediction is obtained, the loss is carried out with the true value cls _ target _cls The calculation of (c) yields:

loss _cls ＝loss _{BCEWithLogits} (cls_pred,cls_target)

where cls _ pred represents the probability of each class to which the target belongs output by the network, and cls _ target represents the class to which the target belongs under the label.

Obtaining whether the object is contained or not and the coordinate range of the target through the decoupling branch of the regression task, and performing loss according to whether the predicted result of the target is contained or not _obj Calculation of (2) and loss _cls Similarly, BCEWithLogits loss function is adopted for calculation, and loss is carried out between the predicted value obj _ pred and the true value obj _ target _obj The calculation of (c) yields:

loss _obj ＝loss _{BCEWithLogits} (obj_pred,obj_target)

loss can be performed according to the coordinate range of the target obtained by the regression branch decoupling _IoU The loss function is GIoU, and the cross-over ratio IoU loss function is largely used in the loss calculation of the regression box of the target detection, wherein the difference is as follows:

two different rectangular areas A and B are added, wherein A is predicted target position information, B is a true value, and an IoU calculation formula between A and B is as follows:

the GIoU takes place as it cannot reflect the alignment between a and B. The calculation of GIoU requires obtaining a minimum rectangle C surrounding a and B based on IoU, and then subtracting the ratio of the area of C not covered by a and B to the area of C from IoU (a, B):

loss _IoU ＝GIoU(A,B)

total loss _total Can be expressed as:

loss _total ＝α ₁ ·loss _cl s+α ₂ ·loss _obj +α ₃ ·loss _IoU

among them loss _cls Represents the loss of classification, loss _obj Represents a loss of confidence, loss _IoU Indicating a loss of prediction box position. Wherein alpha is ₁ 、α ₂ 、α ₃ Is a weight factor and satisfies alpha ₁ +α ₂ +α ₃ =1, the weight of the weighting factor is determined by the loss value of the item currently and the previous loss value. Relative standard deviation of

Is the ratio of the variance of the historical loss value of the current term to the historical loss average, c _i And (3) decoupling the loss size and the loss weight, wherein when the variance of a certain loss value is large, the certain loss value represents that the certain loss value is unstable, the certain loss value is increased, and when the variance of the certain loss value is small, the certain loss value is proved to be basically stable, and a smaller weight is given. Wherein

And normalizing the relative deviation to obtain the loss weight of the term.

Step 5, reversely propagating the obtained loss value, updating the network parameters of the deep detection neural network added into the full-space coding attention module, and repeatedly training the network parameters according to the set maximum iteration times, the learning rate and the reverse propagation algorithm until the target detection network parameters with the full-space coding attention module are converged to obtain a final detection model;

and 6, inputting the prepared SAR image test set into a network, testing the network to obtain the detection result of the network, obtaining various detection indexes, and overlaying the detection indexes on an original image for visualization.

The embodiment of the invention adopts a ship data set-SSDD data set for SAR image detection disclosed in 2017. The data set comprises 1160 pictures in total, the size of the pictures is about 500 x 400, and the pictures have different resolutions and shooting modes. Its markup file is in VOC format, and ships are collectively called "ship", and includes its location information. Each picture contains different numbers of ships, and the ships have different sizes and different shooting environments, including offshore and open sea, and are widely applied to SAR target detection. And manually labeling the oil tank target in the HRSD0.5 high-resolution SAR image data set in order to test the target detection performance of the oil tank target in the complex onshore background, and processing the oil tank target in the same way.

Before the training set is sent to the network, the embodiment of the invention uses data enhancement operation: and carrying out operations such as color change, random turning, rotation, mosaic data enhancement and the like on the picture. The Mosaic data enhancement is as follows: randomly selecting 4 pictures from the data set, carrying out random size change on the four pictures to generate a new picture, respectively placing the four pictures at the four corners of the newly generated picture, and correspondingly converting the original marking information and attaching the original marking information to the new picture.

Setting the sizes of pictures sent to a network to be 416 multiplied by 416 pixels, setting the size of batch _ size to be 32, simultaneously using two 2080Ti GPUs for training, performing 250 rounds of training, using a random gradient descent method (SGD) as a learning strategy, setting a weight attenuation coefficient to be 0.0005, setting a momentum coefficient to be 0.9, setting a learning rate updating strategy to be a cosine annealing algorithm with arm _ up, gradually increasing the learning rate from 0 to the maximum learning rate of 0.001 in the first 5 rounds, then performing attenuation according to a cosine function, and not performing data enhancement in the last 20 rounds of the training process. The number of classes detected was set to 1 and the threshold for non-maximum suppression was 0.6.

The experimental result shows that the mAP of the deep target detection neural network with the ESEAB is improved by 3.2 percent compared with the deep target detection neural network without the ESEAB, and the mAP reaches 60.3, which indicates that the network with the attention mechanism can obviously improve the performance of the network. Fig. 5 shows the detection effect of the ship target in the open sea and the near sea before and after adding the ESEAB module, respectively. In fig. 5, a and c are detection results of the unearthed ESEAB module in the offshore environment, and b and d are detection results of the unearthed ESEAB module in fig. 5, and the phenomenon of false detection and missing detection of the network without the attention mechanism in the offshore environment can be found. Fig. 5 e shows a detection result of the remote sea without the ESEAB module, and f shows a detection effect of the remote sea with the ESEAB module, where it is found that the land noise is noisy in the offshore area, so that the network without the ESEAB module is prone to error detection.

As the background of the SAR image of the land target is more complex compared with that of the SAR image of the ocean target, the oil tank detection data set of the SAR image is used for training, and the detection effects before and after the ESEAB module is added are compared. As can be seen from fig. 6, the left image a has a significant false detection phenomenon during the detection process, and the background is mistakenly identified as the target object. Therefore, the false detection rate of target detection in the b diagram is much lower under the complex background of the network added with the ESEAB module on the land, which shows that the ESEAB module is added in the invention, so that the network can extract more useful information, and the interference of surrounding noise is reduced.

Therefore, the invention is added to provide a deep target detection neural network added with ESEAB, the detection effect on offshore, open sea and land targets can be greatly improved, and the deep target detection neural network can be easily inserted into other network structures.

It is to be emphasized that: the above are only preferred embodiments of the present invention, and the present invention is not limited thereto in any way, and any simple modifications, equivalent changes and modifications made to the above embodiments according to the technical essence of the present invention are within the scope of the technical solution of the present invention.

Claims

1. A SAR image target detection method based on a full-space coding attention module is characterized in that: the method comprises the following steps:

step 2, the full-space coding attention module is embedded into a backbone network of a depth target detection neural network as a whole to perform feature extraction, the depth target detection neural network added with the full-space coding attention module is obtained, weight coding is performed on feature channels in different spatial directions, the feature extraction capability of the backbone network on a picture to be detected is enhanced, a pyramid structure is used for performing multi-scale feature fusion on feature maps extracted by the depth target detection neural network added with the full-space coding attention module, a fused multi-scale feature map is obtained, and the detection precision of the detection network is improved;

step 4, inputting the predicted category and position of the target in the training set and the labeled information of the training set into a loss function, calculating to obtain the current loss value of each loss item, combining historical loss values, dynamically adjusting the weight of the loss item based on the variance and the mean value, and comprehensively weighting to obtain a final loss value;

step 5, reversely transmitting the obtained final loss value, updating the network parameters of the depth target detection neural network added with the full-space coding attention module, and repeatedly training according to the set maximum iteration times, the learning rate and a reverse propagation algorithm until the parameters of the depth target detection neural network with the full-space coding attention module are converged to obtain a finally trained detection model;

and 6, testing the detection model obtained by training on the SAR image test data set, outputting and storing the test result after visualization, and adding the average precision average mAP detection index before and after the attention module.

2. The SAR image target detection method based on full-space coding attention module according to claim 1, characterized in that: in the step 2, the full-space coding attention module is formed by stacking a deformable convolution layer, two convolution layers, a pooling layer, two activation layers, a feature splicing layer and a feature separation layer; the convolution kernel size of all deformable convolution layers is 3 multiplied by 3, the convolution kernel size of the convolution layers is 3 multiplied by 3 and 1 multiplied by 1, the pooling layers are respectively subjected to average pooling in the height direction and the width direction of the features, the activation functions of the activation layers are RELU and Sigmoid, the features obtained by the pooling layers are spliced by the splicing layers and then sent to the convolution layers, the features obtained by the convolution layers are separated by the separation layers, the separated features are sent to the second convolution layer, the output feature dimension size is equal to the input feature size and serves as the weight to be subjected to point multiplication with the input feature, the weight of useful information is increased, and the final output feature is obtained; the full-space coding attention module can code the characteristic diagram in different directions, realizes the optimization of channel information and can utilize related information to the maximum extent.

3. The SAR image target detection method based on the full-space coding attention module according to claim 1, characterized in that: the backbone network is formed by stacking a plurality of convolution layers, a plurality of normalization layers and a plurality of activation layers; the convolution kernel size of the convolution layer is 3 multiplied by 3, 5 multiplied by 5 and 7 multiplied by 7, the batch normalization layer is BN or GN, and the activation function of the activation layer is RELU or SiLU; and replacing part of the convolutional layer by using a full-space coding attention module, wherein the deep convolutional network comprises N stages, N is not less than 4, the input is a picture, and the output is a feature map corresponding to the picture.

4. The SAR image target detection method based on the full-space coding attention module according to claim 1, characterized in that: in step 3, the first branch is a classification branch, which includes three convolutional layers, and is used for predicting the confidence of the detection frame, and the tensor dimension output by the classification branch is the number of classes of the detection target; the second branch is a regression branch, which contains three convolution layers for predicting the relevant parameters of the detection frame.

5. The SAR image target detection method based on the full-space coding attention module according to claim 1, characterized in that: in said step 4, the total loss function loss _total Comprises the following steps:

loss _total ＝α ₁ ·loss _cls +α ₂ ·loss _obj +α ₃ ·loss _IoU

among them loss _cls Represents the loss of classification, loss _obj Represents a loss of confidence, loss _IoU Indicating the loss of the position of the prediction box,

i＝1，2,3; standard deviation of

Wherein c is _i Is loss _total The size of the loss of each of the terms,

is loss _total The variance of the loss value of each term in (a),

is loss _total Average of the loss values of each term in (1).