CN115995042A

CN115995042A - Video SAR moving target detection method and device

Info

Publication number: CN115995042A
Application number: CN202310099920.2A
Authority: CN
Inventors: 李银伟; 张慧萍; 朱亦鸣; 毛倩倩; 李晓鹏
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2023-02-09
Filing date: 2023-02-09
Publication date: 2023-04-21

Abstract

The invention provides a method and a device for detecting a video SAR moving target, wherein the method comprises the following steps: framing the video SAR to be trained, then respectively labeling, and expanding a data set in a data enhancement mode; performing primary feature extraction, and inputting the features after the primary feature extraction into BiFPN for further feature fusion extraction; inputting the shallow features output by the BiFPN into the CA, and outputting features which pay more attention to the space coordinates; the method comprises the steps of carrying out feature fusion on high-level features output by BiFPN and features output by CA, inputting the high-level features and the features to an adaptive feature fusion module, carrying out feature adaptive fusion on the input features by the adaptive feature fusion module, and classifying and regressing detection heads; performing iterative training on the deep neural network to obtain optimal weights; and inputting the video SAR to be detected into a trained deep neural network, and outputting the detected moving target. The invention can improve the detection efficiency and the accuracy of the video SAR moving target.

Description

Video SAR moving target detection method and device

Technical Field

The invention relates to the technical field of radar image processing, in particular to a method and a device for detecting a video SAR moving target.

Background

Synthetic aperture radar (Synthetic Aperture Radar, SAR) is an active earth-looking system that can image a variety of targets in high resolution all the day and day. The video SAR can continuously observe and image the region of interest, so that continuous tracking and monitoring of a target can be realized.

For moving objects, the Doppler modulation will cause them to shift and defocus as they are imaged, so that they appear as irregularly shaped shadows in the image, so detection of moving objects can be achieved by detecting shadows in the video SAR image. Conventional SAR image processing algorithms generally require preprocessing of the image, such as registration segmentation extraction, etc., whereas application of a deep neural network to shadow detection of a moving object can achieve end-to-end shadow detection without requiring a complex preprocessing process.

Target detection algorithms based on deep learning are mainly divided into two categories, namely a one-stage and a two-stage method. The one-stage method is to divide an image into S×S grids, and calculate a target probability that a grid center point falls within each grid. The two-stage method is to divide the whole detection process into two stages, firstly, extracting candidate frames in advance according to the position of a target in an image, and then classifying and regressing. The one-stage method detects much faster than the two-stage algorithm, which is one of the more classical algorithms in this class of algorithms. However, in the existing neural network model, the moving target detection precision is low due to factors such as low contrast of the video SAR image, speckle noise and the like.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a method and a device for detecting a moving target of a video SAR, which can improve the detection efficiency of the moving target of the video SAR and have higher detection accuracy of the moving target.

In order to solve the problems, the technical scheme of the invention is as follows:

a video SAR moving target detection method comprises the following steps:

framing the video SAR to be trained, then respectively labeling, and expanding a data set in a data enhancement mode;

performing primary feature extraction, and inputting the features after the primary feature extraction into BiFPN for further feature fusion extraction;

inputting the shallow features output by the BiFPN into the CA, and outputting features which pay more attention to the space coordinates;

the method comprises the steps of carrying out feature fusion on high-level features output by BiFPN and features output by CA, inputting the high-level features and the features to an adaptive feature fusion module, carrying out feature adaptive fusion on the input features by the adaptive feature fusion module, and classifying and regressing detection heads;

performing iterative training on the deep neural network to obtain optimal weights;

and inputting the video SAR to be detected into a trained deep neural network, and outputting the detected moving target.

Preferably, the step of framing the video SAR to be trained, and then labeling the frames, and expanding the data set by data enhancement specifically includes: reading a video SAR image to be trained, obtaining the frame rate, the width and the height of the video, and labeling each frame of image after framing the video SAR image; enhancing the labeled data set by using the enhancing functions such as clipping, mirroring, rotation and the like; renaming and storing the amplified images and labels in a one-to-one correspondence in sequence, and distributing the enhanced data set into a training set and a testing set according to a certain proportion.

Preferably, the step of performing preliminary feature extraction includes inputting the features after the preliminary feature extraction into a BiFPN for further feature fusion extraction, and performing the preliminary feature by using CSPDarknet.

Preferably, the step of inputting the shallow features of the BiFPN output into CA and outputting the features of the spatial coordinates of more interest specifically includes: CA attention separates the height and the width of an input image to encode the height and the width of the input image respectively, and carries out global average pooling on the width and the height of the input feature image respectively to obtain 1D feature images in two directions:

and (3) carrying out channel-by-channel coding on the input x by using pooling cores with the sizes of H1 and 1*W along the horizontal coordinate direction and the vertical coordinate direction respectively, carrying out convolution, batch normalization and nonlinearity on the feature graphs in the two directions after superposition, respectively carrying out convolution and multiplication with the input x after activation to obtain an attention weight graph, carrying out up-sampling on the feature graphs with small scale, carrying out superposition on the channel after ensuring that the dimensions of the two features are the same, and outputting the feature with more attention to the space coordinate.

Preferably, the step of performing feature fusion on the high-level feature output by the BiFPN and the feature output by the CA and inputting the feature fusion to the adaptive feature fusion module, where the adaptive feature fusion module performs feature adaptive fusion on the input feature, and the step of classifying and regressing the detection head specifically includes: three decoupling head structures for receiving characteristic layers with different scales are used as detection heads of a network, a convolution kernel of 1*1 is used for reducing the number of channels in the decoupling head structures, convolution, batch normalization and activation blocks are used, finally obtained values are overlapped, coordinates of grids on the corresponding characteristic diagrams are calculated, network coordinate points of the characteristic diagrams are created, and a prediction frame obtained by forward reasoning of the neural network is projected onto an original image to obtain the prediction frame.

Preferably, in the step of performing iterative training on the deep neural network to obtain the optimal weight, a loss function is defined before training on the neural network is started:

where i is the index value of the training dataset, y _i Is tag data,/->

Is predictive data, there are M training data sets.

Preferably, in the step of performing iterative training on the deep neural network to obtain the optimal weight, the training loss optimizer is a random gradient descent-based function optimization algorithm, calculates a gradient corresponding to the weight for the loss function, changes the weight toward the opposite direction until the loss function converges to a local minimum, updates the weight value in each iterative training, and has a weight calculation formula of:

wherein w is _j Is the weight of the jth iteration, w _j+1 The weight of the j+1st iteration is the learning rate, lr is the loss function, and the weight obtained by each iteration training is calculated by the weight of the last iteration.

Preferably, in the step of performing iterative training on the deep neural network to obtain the optimal weight, the loss type is an overlap ratio IoU, the IoU is an overlap ratio of the generated prediction frame and the real frame, and a IoU formula is as follows:

and storing a weight value once in each iteration, and obtaining the optimal weight value of the deep neural network by training the iteration for a plurality of times.

Further, the present invention also provides a video SAR moving object detection apparatus, characterized in that the apparatus comprises a processor configured to perform the video SAR moving object detection method as described above via execution of the executable instructions of the processor, and a memory for storing the executable instructions of the processor.

Compared with the prior art, the method takes the Yolox backbone network CSPDarknet as a reference line of the network, uses BiFPN to further fuse and extract the characteristics, adopts a coordinated attention mechanism CA to strengthen the attention of part of output characteristic layers, fuses with the output of the BiFPN and inputs the result into an adaptive characteristic fusion module ASFF, and finally uses three characteristic layers output from the adaptive characteristic fusion module to classify and return. The designed deep neural network is used in the detection of the video SAR moving target, and a good effect is obtained in the detection of the blurred video SAR moving target; compared with the traditional image processing method, the method does not need to preprocess the video SAR image, improves the detection efficiency, and has higher accuracy of detecting the moving target compared with the traditional deep neural networks such as the YOLOX, the fast-RCNN and the like.

Drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, given with reference to the accompanying drawings in which:

fig. 1 is a flow chart of a method for detecting a moving target of a video SAR according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a deep neural network in a method for detecting a moving target of a video SAR according to an embodiment of the present invention;

fig. 3 is a schematic diagram of Fusion structure in a deep neural network structure according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a CBS structure in a deep neural network structure according to an embodiment of the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications could be made by those skilled in the art without departing from the inventive concept. These are all within the scope of the present invention.

Specifically, the invention provides a method for detecting a moving target of a video SAR, as shown in figure 1, comprising the following steps:

s1: framing the video SAR to be trained, then respectively labeling, and expanding a data set in a data enhancement mode;

specifically, in step S1, a video SAR image to be trained is read, a frame rate, a width and a height of the video are obtained, and each frame of image after framing the video SAR is labeled; enhancing the labeled data set by using the enhancing functions such as clipping, mirroring, rotation and the like; renaming and storing the amplified images and labels in a one-to-one correspondence in sequence, and distributing the enhanced data set into a training set and a testing set according to a certain proportion.

S2: performing primary feature extraction, and inputting the features after the primary feature extraction into BiFPN for further feature fusion extraction;

specifically, as shown in fig. 2, 3 and 4, training or testing images are input to a CSPDarknet of a deep neural network to perform preliminary feature extraction, and the three feature layers are named as: "dark3", "dark4" and "dark5".

Further, in order to improve the detection precision of the network, the feature layer after the primary feature extraction is input into a weighted feature bidirectional pyramid network (BiFPN) to perform multi-scale feature fusion extraction.

S3: inputting the shallow features output by the BiFPN into a CA attention mechanism, and outputting features which pay more attention to space coordinates;

specifically, in step S3, the shallow feature layers p3_out and p4_out obtained from the BiFPN are input into the CA attention mechanism.

CA attention converts the input 2D coordinates into 1D, namely, the height and the width of the image are separated to be respectively encoded, the width and the height of the input feature map are respectively subjected to global average pooling, and the 1D feature map in two directions is obtained:

the pooling kernels of H1 and 1*W sizes are used for the input x to perform channel-by-channel encoding along the horizontal and vertical coordinate directions, respectively, where equation (1) is the output representation of the c-th channel of height H and equation (2) is the c-th channel representation of width w. The feature graphs in two directions are overlapped, convolved, normalized in batches and nonlinear, respectively convolved and activated, and multiplied by the input x.

After the feature maps with two different sizes are input into a CA attention mechanism to obtain an attention weight map, up-sampling is carried out on the feature maps with small scale, superposition is carried out on a channel after the same scale of the two features is ensured, and the features with more attention to space coordinates are output.

S4: the method comprises the steps of carrying out feature fusion on high-level features output by BiFPN and features output by CA, inputting the high-level features and the features to an adaptive feature fusion module, carrying out feature adaptive fusion on the input features by the adaptive feature fusion module, and carrying out classification and regression through a detection head;

specifically, in step S4, the high-level feature layer output in step S2 and the feature layer output in step S3 are subjected to feature fusion and then input to an adaptive feature fusion module (ASFF).

Specifically, three decoupling head structures for receiving characteristic layers with different scales are used as detection heads of a network, a convolution kernel of 1*1 is used for reducing the number of channels in the decoupling head structures, convolution, batch normalization and activation blocks are used, finally obtained values are overlapped, coordinates of grids on the corresponding characteristic diagrams are calculated, network coordinate points of the characteristic diagrams are created, and a prediction frame obtained by forward reasoning of the neural network is projected onto an original image to obtain the prediction frame.

When screening the prediction frame, the method comprises two steps:

the first step is that a positive sample prediction frame is primarily screened, all prediction frames with the central points of the prediction frames in a real frame are screened, and then the prediction frames in a square which expands the real frame by 2.5 times of step length are screened;

the second step is to use a simplified optimal transmission allocation (Optimal Transport Assignment, OTA) algorithm for further screening of the prediction block.

S5: performing iterative training on the deep neural network to obtain optimal weights;

specifically, in step S5, a loss function is defined before training of the neural network is started:

where i is the index value of the training dataset, y _i Is the label data of the label which is to be read,

is predictive data, there are M training data sets. The optimizer of training loss is random gradient descent (stochastic gradient descent, SGD), the loss type is cross ratio (Intersection over Union, ioU), weight is stored once for each iteration, and the training iteration obtains the optimal weight of the deep neural network multiple times.

SGD is a function optimization algorithm based on random gradient descent, and calculates a gradient corresponding to the weight of the loss function, and changes the weight in the opposite direction until the loss function converges to a local minimum. The weight value is updated in each iterative training, and the weight calculation formula is as follows:

/>

IoU is a criterion used to measure the accuracy of detecting targets in a dataset when calculating losses. IoU formula is as follows:

IoU is the overlap ratio of the generated prediction frames and the real frames, i.e. the ratio of their intersection (Area of overlay) to the union (Area of union).

S6: and inputting the video SAR to be detected into a trained deep neural network, and outputting the detected moving target.

Specifically, in step S6, each frame of image of the video SAR to be detected is input into a trained deep neural network, so as to obtain the detected moving target.

The foregoing describes specific embodiments of the present invention. It is to be understood that the invention is not limited to the particular embodiments described above, and that various changes or modifications may be made by those skilled in the art within the scope of the appended claims without affecting the spirit of the invention. The embodiments of the present application and features in the embodiments may be combined with each other arbitrarily without conflict.

Claims

1. A method for detecting a moving target of a video SAR, comprising the steps of:

2. The method for detecting a moving target of a video SAR according to claim 1, wherein the step of framing the video SAR to be trained, respectively labeling, and expanding the data set by data enhancement specifically comprises: reading a video SAR image to be trained, obtaining the frame rate, the width and the height of the video, and labeling each frame of image after framing the video SAR image; enhancing the labeled data set by using the enhancing functions such as clipping, mirroring, rotation and the like; renaming and storing the amplified images and labels in a one-to-one correspondence in sequence, and distributing the enhanced data set into a training set and a testing set according to a certain proportion.

3. The method for detecting a moving target of a video SAR according to claim 1, wherein the step of performing preliminary feature extraction, inputting the features after the preliminary feature extraction into BiFPN for further feature fusion extraction, and performing the preliminary feature using CSPDarknet.

4. The method for detecting a moving target of a video SAR according to claim 1, wherein said step of inputting the shallow feature of the BiFPN output into CA and outputting the feature of the spatial coordinate of more interest specifically comprises: CA attention separates the height and the width of an input image to encode the height and the width of the input image respectively, and carries out global average pooling on the width and the height of the input feature image respectively to obtain 1D feature images in two directions:

and (3) carrying out channel-by-channel coding on the input x by using pooling cores with the sizes of H1 and 1*W along the width direction and the height direction respectively, carrying out convolution, batch normalization and nonlinearity on the feature graphs in the two directions after superposition, respectively carrying out convolution and activation, multiplying the feature graphs with the input x, carrying out up-sampling on the feature graphs with small scale after obtaining the attention weight graph, carrying out superposition on the feature graphs on the channel after ensuring that the dimensions of the two feature graphs are the same, and outputting the feature of more attention space coordinates.

5. The method for detecting a moving target of a video SAR according to claim 1, wherein the step of performing feature fusion on the high-level feature output by BiFPN and the feature output by CA and then inputting the fused features to the adaptive feature fusion module, and the adaptive feature fusion module performs feature adaptive fusion on the input features, and the step of classifying and regressing the detection head specifically comprises: three decoupling head structures for receiving characteristic layers with different scales are used as detection heads of a network, a convolution kernel of 1*1 is used for reducing the number of channels in the decoupling head structures, convolution, batch normalization and activation blocks are used, finally obtained values are overlapped, coordinates of grids on the corresponding characteristic diagrams are calculated, network coordinate points of the characteristic diagrams are created, and a prediction frame obtained by forward reasoning of the neural network is projected onto an original image to obtain the prediction frame.

6. The method for detecting a moving target of a video SAR according to claim 1, wherein in said step of iteratively training the deep neural network to obtain the optimal weight, a loss function is defined before training the neural network:

where i is the index value of the training dataset, y _i Is tag data,/->

Is predictive data, there are M training data sets.

7. The method for detecting a moving target of a video SAR according to claim 6, wherein in the step of performing iterative training on the deep neural network to obtain the optimal weight, the optimizer for training the loss is a random gradient descent-based function optimization algorithm, the gradient corresponding to the weight is calculated on the loss function, the weight is changed in the opposite direction until the loss function converges to a local minimum, the weight is updated in each iterative training, and the weight calculation formula is:

8. The method for detecting a video SAR moving target according to claim 7, wherein in the step of iteratively training the deep neural network to obtain the optimal weight, the loss type is an overlap ratio IoU, which is an overlap ratio of the generated prediction frame and the real frame, and the IoU formula is:

9. A video SAR moving object detection apparatus, comprising a processor configured to perform the video SAR moving object detection method of any one of claims 1 to 8 via execution of executable instructions of the processor, and a memory for storing the executable instructions of the processor.