CN111144234A

CN111144234A - Video SAR target detection method based on deep learning

Info

Publication number: CN111144234A
Application number: CN201911257311.5A
Authority: CN
Inventors: 秦尉博; 闫贺; 黄佳; 黄智杰
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2019-12-10
Filing date: 2019-12-10
Publication date: 2020-05-12

Abstract

The invention discloses a video SAR target detection method based on deep learning, which comprises the following steps: preprocessing and dividing a video data set to obtain a training set and a test set; constructing a Resnet101 residual error network as a feature extractor for extracting high-dimensional features of the SAR image; constructing an RPN (resilient packet network), inputting image characteristics output by a Resnet101 residual error network into the RPN, and outputting a candidate region; and constructing a Faster-RCNN network, and inputting a result output by the RPN network into the Faster-RCNN network to obtain a video SAR target detection result. The invention has the characteristics of simple realization, high detection precision and wide applicable scenes.

Description

Video SAR target detection method based on deep learning

Technical Field

The invention belongs to the technical field of radars, and particularly relates to a video SAR target detection method.

Background

Synthetic Aperture radar (sar), an active earth observation system, can be installed on flight platforms such as airplanes, satellites, spacecraft, etc., and can perform earth observation all day long and all day long, and has a certain ground surface penetration capability. Therefore, the SAR system has unique advantages in disaster monitoring, environmental monitoring, marine monitoring, resource exploration, crop estimation, mapping, military and other applications, and can play a role that other remote sensing means are difficult to play, so that the SAR system is more and more paid attention by countries in the world.

Currently, common mainstream SAR target detection methods can be classified into three categories, namely target detection based on background clutter statistical distribution, target detection based on polarization decomposition and target detection based on polarization characteristics. The method carries out target detection from the angle of an imaging mechanism, needs artificial modeling to extract SAR image characteristics, is complex in detection method and low in detection accuracy.

Disclosure of Invention

In order to solve the technical problems mentioned in the background art, the invention provides a video SAR target detection method based on deep learning.

In order to achieve the technical purpose, the technical scheme of the invention is as follows:

a video SAR target detection method based on deep learning comprises the following steps:

(1) preprocessing and dividing a video data set to obtain a training set and a test set;

(2) constructing a Resnet101 residual error network as a feature extractor for extracting high-dimensional features of the SAR image; in the process of constructing a Resnet101 residual error network, introducing an FPN network architecture, and providing multi-scale combined image characteristics for subsequent steps by combining characteristic graphs of different scales before and after a pooling layer;

(3) constructing an RPN (resilient packet network), inputting image characteristics output by a Resnet101 residual error network into the RPN, and outputting a candidate region;

(4) and constructing a Faster-RCNN network, and inputting a result output by the RPN network into the Faster-RCNN network to obtain a video SAR target detection result.

Further, in step (1), a data set is constructed through the video, each frame of the video is read and stored in sequence, the data set is firstly calibrated for the data set image, and the position coordinate of the target frame is (x)_k,y_k,w_k,h_k) Wherein x is_k、y_kIs the horizontal and vertical coordinates, w, of the upper left corner of the target frame_k、h_kThe width and height of the target frame;

then data enhancement is carried out, the vertical pixel of the data set image is unchanged, the horizontal position is overturned, and new position coordinates are obtained

Wherein the content of the first and second substances,

and finally dividing the data set into a training set and a test set according to the ratio of m to n, wherein m is larger than n.

Further, in the step (2), an initial neural network model is constructed by adopting a VGG network construction method, and a residual error structure and a jump structure are introduced, so that the number of network layers is deepened; and introducing an FPN network structure, performing down-sampling on the feature maps with different scales before and after the pooling layer, performing summation operation of corresponding elements, inputting the convolution layer, and outputting the convolution layer.

Further, in step (2), the constructed residual error network is trained using the SAR image classification dataset.

Further, in the step (3), the K-Means clustering algorithm is used for calculating the ratio of the length and the width of the target in the data set, N clustering centers are obtained to serve as the ratio of the height and the width of a subsequent prior anchor point frame, anchor point frames with different sizes and the ratio of the N clustering centers are selected on the data set image in a frame mode, a frame selection area corresponds to the feature map according to the corresponding relation of the SPP-net algorithm, the corresponding obtained feature map is respectively input into a classification layer and a regression layer, the classification layer is used for distinguishing whether the target anchor point is contained in the current frame, the output result of the classification layer is a confidence coefficient S, and the regression layer is used for outputting the position coordinate of the candidate prediction frame.

Further, in step (4), an ROI Align layer in a fast-RCNN network is constructed, each candidate region is traversed by the ROI Align layer, floating point boundaries are kept not to be quantized, the candidate regions are divided into k × k units, k is a positive integer, the boundaries of each unit are not quantized, four coordinate positions are calculated and fixed in each unit, values of the four positions are calculated by a bilinear interpolation method, then maximum pooling operation is carried out, and candidate regions with different sizes are mapped to fixed sizes.

Further, in step (4), an RCNN layer in the fast-RCNN network is constructed, and the candidate region of a fixed size output by the ROI Align layer is input into two convolutional neural networks, one of which is a classification neural network for predicting the kind of the background object and the other is a regression neural network for outputting the position coordinates of the target frame.

Further, in step (4), a staged learning rate method is used to train the loss function of the fast-RCNN.

Further, the loss function L of the Faster-RCNN is as follows:

in the above equation, the first term of L is the RPN network classification loss, N_cls1Number of prediction boxes, P, output for RPN network_1iThe probability that the ith prediction frame is the foreground is determined, if the ith prediction frame is the foreground, the probability that the ith prediction frame is the foreground is determined

Otherwise

Cross entropy loss function for two classes; the second term of L is the RCNN classification loss, N_cls2Number of prediction boxes, P, output for RCNN layer_2iThe probability predicted for each class for the ith prediction box,

get 1 to the actual category, the restClass 0, L_cls2Cross entropy loss function for multi-classification; the third term of L is the regression loss of RPN and RCNN, λ is the preset weight coefficient, N_regTo be the area of the output feature map,

is the probability of an object in the ith prediction box, t_iIs the coordinate vector (x, y, w, h) of the ith prediction box,

defining a loss function L for coordinate vector of the actual frame, wherein x and y are horizontal and vertical coordinates of the upper left corner of the frame, w and h are width and height of the frame_reg：

In the above formula, the first and second carbon atoms are,

further, in the testing process, for the result output by the fast-RCNN network, a Soft-NMS method is adopted to eliminate the overlapped frames.

Adopt the beneficial effect that above-mentioned technical scheme brought:

the method utilizes a residual error network to extract high-dimensional characteristics of an original SAR image, introduces an FPN network architecture, provides multi-scale combined image characteristics for a subsequent algorithm through characteristic graphs of different scales before and after a combined pooling layer, inputs the image characteristics into a subsequent RPN network, and finally outputs results through the RCNN network. Meanwhile, the invention also adjusts the initial parameters of the residual error network and trains the initial parameters by utilizing the SAR class data set, thereby realizing the high-precision end-to-end target detection aiming at the multi-scale target in the video SAR. The invention has the characteristics of simple realization, high detection precision and wide applicable scenes.

Drawings

FIG. 1 is a basic flow diagram of the present invention.

Detailed Description

The technical scheme of the invention is explained in detail in the following with the accompanying drawings.

The invention designs a video SAR target detection method based on deep learning, as shown in figure 1, the steps are as follows:

step 1: preprocessing and dividing a video data set to obtain a training set and a test set;

step 2: constructing a Resnet101 residual error network as a feature extractor for extracting high-dimensional features of the SAR image; in the process of constructing a Resnet101 residual error network, introducing an FPN network architecture, and providing multi-scale combined image characteristics for subsequent steps by combining characteristic graphs of different scales before and after a pooling layer;

and step 3: constructing an RPN (resilient packet network), inputting image characteristics output by a Resnet101 residual error network into the RPN, and outputting a candidate region;

and 4, step 4: and constructing a Faster-RCNN network, and inputting a result output by the RPN network into the Faster-RCNN network to obtain a video SAR target detection result.

In this embodiment, the step 1 is implemented by the following preferred scheme:

constructing a data set through a video, sequentially reading and storing each frame of the video, firstly calibrating the data set for a data set image, wherein the position coordinate of a target frame is (x)_k,y_k,w_k,h_k) Wherein x is_k、y_kIs the horizontal and vertical coordinates, w, of the upper left corner of the target frame_k、h_kThe width and height of the target frame. And then data enhancement is carried out, so that generalization capability and robustness are improved. The vertical pixel of the data set image is unchanged, the horizontal position is overturned, and a new position coordinate is obtained

Wherein the content of the first and second substances,

the data set is enlarged by a factor of two. Finally, dividing the data set into a training set and a testing set according to the ratio of m to n, wherein m is>n。

In this embodiment, the step 2 is implemented by the following preferred scheme:

firstly, a Resnet101 residual error network is used as a feature extractor, and the Resnet101 structure is constructed to be similar to a common deep convolution neural network. An initial neural network model is constructed by adopting a traditional VGG network construction method, and a residual error structure and a jump structure are introduced, so that the problem of accuracy reduction under the condition of increasing the number of layers is solved, the number of network layers is deepened, and the extraction precision of high-dimensional features is improved. And secondly, introducing an FPN structure, performing down-sampling on the feature maps with different scales before and after the pooling layer, performing corresponding element summation operation, inputting the 3 x 3 convolution layer, and outputting. And finally, aiming at the defects that the traditional residual error network extracts features and SAR generates offset, training the constructed residual error network by utilizing an SAR image classification data set to obtain a better parameter design.

In this embodiment, the step 3 is implemented by the following preferred scheme:

calculating the length-width ratio of a target in a data set by using a K-Means clustering algorithm to obtain N clustering centers as the height-width ratio of a subsequent prior anchor point frame, performing frame selection on anchor point frames with different sizes and the ratio of N clustering centers on a data set image, corresponding frame selection areas to feature maps according to the corresponding relation of an SPP-net algorithm, respectively inputting the corresponding obtained feature maps into a classification layer and a regression layer, wherein the classification layer is used for distinguishing whether the anchor point frame at present contains the target or not, the output result is a confidence coefficient S, and the regression layer is used for outputting the position coordinates of a candidate prediction frame. Defining the prediction result of IOU >0.7 as positive sample, IOU <0.3 as negative sample.

In this embodiment, the step 4 is implemented by adopting the following preferred scheme:

the ROI Align layer in the fast-RCNN network was constructed. The ROI Pooling layer in the original fast-RCNN is that feature graphs with different sizes are changed into fixed-scale feature graphs through cutting for pre-selected frames, the feature graphs are input into a subsequent classification network, if the size of an original target area is decimal after four times of down-sampling, the ROI Pooling rounds the original target area, and two round-up processes are contained in the whole network frame, so that deviation is generated between a result frame after down-sampling and an original image. The optimization utilizes an ROI Align layer to traverse each candidate region, floating point number boundaries are kept not to be quantized, the candidate regions are divided into k multiplied by k units, k is a positive integer, the boundaries of each unit are not quantized, four coordinate positions are calculated and fixed in each unit, values of the four positions are calculated by a bilinear interpolation method, then the maximum pooling operation is carried out, and the candidate regions with different sizes are mapped to fixed sizes.

And constructing an RCNN layer in a fast-RCNN network, and inputting the candidate region with fixed size output by the ROI Align layer into two convolutional neural networks, wherein one convolutional neural network is a classification neural network and used for predicting the type of the background object, and the other convolutional neural network is used for outputting the position coordinates of the target frame.

Training a loss function of fast-RCNN, setting an initial learning rate to be LS1, reducing the probability of a local minimum value by adopting a stage learning rate method aiming at the characteristic that the SAR image is easy to enter the local minimum value, and reducing the learning rate to 0.1 LS1 and 0.01 LS1 when the training times sequentially reach N1 and N2 times.

The loss function of the fast-RCNN is as follows:

Otherwise

to the actual category, take 1The rest classes are 0, L_cls2Cross entropy loss function for multi-classification; the third term of L is the regression loss of RPN and RCNN, λ is the preset weight coefficient, N_regTo be the area of the output feature map,

In the above formula, the first and second carbon atoms are,

in the embodiment, in the test process, for the result output by the fast-RCNN network, a Soft-NMS method is adopted to eliminate the overlapped frames.

The embodiments are only for illustrating the technical idea of the present invention, and the technical idea of the present invention is not limited thereto, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the scope of the present invention.

Claims

1. A video SAR target detection method based on deep learning is characterized by comprising the following steps:

2. The method for detecting SAR target of video based on deep learning as claimed in claim 1, wherein in step (1), a data set is constructed by video, each frame of video is read and stored in turn, for the image of data set, the data set is calibrated first, and the position coordinate of the target frame is (x)_k,y_k,w_k,h_k) Wherein x is_k、y_kIs the horizontal and vertical coordinates, w, of the upper left corner of the target frame_k、h_kThe width and height of the target frame;

Wherein the content of the first and second substances,

3. The deep learning-based video SAR target detection method according to claim 1, characterized in that in step (2), an initial neural network model is constructed by adopting a VGG network construction method, and a residual structure and a jump structure are introduced, so that the number of network layers is deepened; and introducing an FPN network structure, performing down-sampling on the feature maps with different scales before and after the pooling layer, performing summation operation of corresponding elements, inputting the convolution layer, and outputting the convolution layer.

4. The deep learning-based video SAR target detection method according to claim 1, characterized in that in step (2), the constructed residual network is trained using SAR image classification dataset.

5. The method for detecting the video SAR target based on the deep learning of claim 1, characterized in that in the step (3), the K-Means clustering algorithm is used for calculating the ratio of the length and the width of the target in the data set, N clustering centers are obtained as the ratio of the height and the width of the subsequent prior anchor point frame, the anchor point frames with different sizes and the ratio of N clustering centers are selected on the data set image, the selected area is corresponding to the feature map according to the corresponding relationship of the SPP-net algorithm, the corresponding feature map is respectively input into a classification layer and a regression layer, the classification layer is used for distinguishing whether the current anchor point frame contains the target, the output result is the confidence S, and the regression layer is used for outputting the position coordinates of the candidate prediction frame.

6. The method for detecting SAR target in video based on deep learning as claimed in claim 1, wherein in step (4), the ROI Align layer in the fast-RCNN network is constructed, each candidate region is traversed by the ROI Align layer, the floating point boundary is kept not quantized, the candidate region is divided into k × k units, k is a positive integer, the boundary of each unit is not quantized, four coordinate positions are calculated in each unit, the values of the four positions are calculated by bilinear interpolation, and then the largest pooling operation is performed to map the candidate regions with different sizes to fixed sizes.

7. The deep learning-based video SAR target detection method as claimed in claim 6, wherein in step (4), an RCNN layer in a fast-RCNN network is constructed, and the candidate region with fixed size output by the ROI Align layer is input into two convolutional neural networks, one of which is a classification neural network for predicting the kind of the background object, and the other is a regression neural network for outputting the position coordinates of the target frame.

8. The deep learning-based video SAR target detection method as claimed in claim 1, wherein in step (4), a step-by-step learning rate method is adopted to train the loss function of fast-RCNN.

9. The deep learning based video SAR target detection method according to claim 8, wherein the loss function L of the Faster-RCNN is as follows:

Otherwise

L_cls1Cross entropy loss function for two classes; the second term of L is the RCNN classification loss, N_cls2Number of prediction boxes, P, output for RCNN layer_2iThe probability predicted for each class for the ith prediction box,

take 1 for the actual category and 0, L for the rest categories_cls2Cross entropy loss function for multi-classification; the third term of L is the regression loss of RPN and RCNN, λ is the preset weight coefficient, N_regTo be the area of the output feature map,

as coordinate vectors of the actual frameWherein x and y are the horizontal and vertical coordinates of the upper left corner of the frame, w and h are the width and height of the frame, and a loss function L is defined_reg：

In the above formula, the first and second carbon atoms are,

10. the deep learning-based video SAR target detection method according to claim 1, characterized in that in the test process, for the result output by the fast-RCNN network, a Soft-NMS method is adopted to eliminate the overlapped box.