CN113177456A

CN113177456A - Remote sensing target detection method based on single-stage full convolution network and multi-feature fusion

Info

Publication number: CN113177456A
Application number: CN202110442872.3A
Authority: CN
Inventors: 白静; 温征; 唐晓川; 董泽委; 郭亚泽; 裴晓龙; 闫逊; 孙放; 张秀华
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2021-04-23
Filing date: 2021-04-23
Publication date: 2021-07-27
Anticipated expiration: 2041-04-23
Also published as: CN113177456B

Abstract

The invention discloses a multi-feature fusion-based optical remote sensing image target detection method, which mainly solves the problem of insufficient extraction of optical remote sensing image target features in the prior art. The implementation scheme is as follows: 1) respectively extracting mathematical morphology characteristics, linear scale space characteristics and nonlinear scale space characteristics of original data and fusing the three characteristics to obtain a fused characteristic diagram; 2) dividing the fusion characteristic graph into a training data set and a test data set, and performing small-target expansion on the training data set; 3) constructing a target detection network, and training the network on a training data set after target expansion by using a gradient descent algorithm; 4) and testing the test data set by using the trained network to obtain a detection result. The method enhances the contour characteristics and the edge characteristics of the target, is favorable for improving the accuracy of target detection, and can be used for resource exploration, natural disaster assessment and target identification.

Description

Remote sensing target detection method based on single-stage full convolution network and multi-feature fusion

Technical Field

The invention belongs to the technical field of optical remote sensing images, and particularly relates to a target detection method with multi-feature fusion, which can be used for resource exploration, natural disaster assessment and target identification.

Background

The remote sensing image has the characteristics of complex background, unbalanced target category, large target scale change, special shooting visual angle and the like, so that the target detection aiming at the remote sensing image has great difficulty and challenge.

The traditional target detection method mainly depends on manually designed feature extraction operators to extract image features, including V-J detection, HOG detection, DPM algorithm and the like, and the main feature is that a detector can only use a fixed feature extraction algorithm to fit a single image feature, so that the traditional target detection method can only adapt to the situation that obvious features exist and the background is simple, and the requirement of a remote sensing image target detection task cannot be met.

The target detection method based on deep learning uses a convolution network to extract image features, and can extract various abundant features in the same target, so that the detection accuracy is far higher than that of the traditional manual design method, and the target detection method becomes an industry mainstream method at present and is widely used in remote sensing image target detection tasks.

A remote sensing image ship target detection method is provided by using a YOLO v5 network structure and an attention mechanism in SENET in a patent [ CN112580439A ].

A multi-scale feature extraction network is designed in the patent CN110378297A, the network can extract multi-scale features of images and respectively predict candidate regions of feature images corresponding to each image scale, and the accuracy of remote sensing image target detection is effectively improved.

In a patent [ CN112070729A ], an anchor-free-based target detection network is adopted, firstly, linear enhancement is carried out on an obtained remote sensing image data set in a balance coefficient mixed enhancement mode, and then, feature extraction and fusion are carried out on a depth residual error network ResNet-50 and a feature pyramid network FPN. The invention fully utilizes the context multi-feature fusion method, enhances the feature extraction capability and the category prediction capability of the network, and improves the detection precision.

However, the above methods based on the deep convolutional neural network are all technical solutions in which a convolution operation is directly applied to an original input image, or a simple linear enhancement mode is adopted to preprocess data, and this solution does not alleviate the difficulty of extracting features by the deep convolutional network, and especially for target detection under a complex background of a remote sensing image, this method for detecting a target based on the deep convolutional network cannot accurately extract feature information of a target part, thereby affecting the improvement of detection performance.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a remote sensing target detection method based on single-stage full convolution network and multi-feature fusion so as to accurately extract the feature information of a target part and improve the detection performance.

In order to achieve the purpose, the technical scheme of the invention comprises the following steps:

(1) respectively extracting morphological characteristics, linear scale space characteristics and nonlinear scale space characteristics of the optical remote sensing image:

1a) respectively performing opening operation and closing operation on the original image to obtain 2n initial feature maps, then performing pixel-by-pixel addition on all the initial feature maps and taking an average value to obtain a morphological feature map of the original image, wherein n represents the number of opening operation or closing operation;

1b) filtering the original image by using a Gaussian filter and a Sobel edge extraction operator respectively to obtain a three-channel Gaussian fuzzy feature map and four single-channel local edge feature maps; summing the four local edge feature graphs pixel by pixel and averaging to obtain an integral edge feature graph; performing pixel-by-pixel fusion on each channel component of the three-channel Gaussian feature map and the integral edge feature map to obtain a linear multi-scale spatial feature map;

1c) converting an original optical remote sensing image into a single-channel gray-scale image, and performing wavelet decomposition on the single-channel gray-scale image by using a two-dimensional single-level wavelet transformation function to obtain four single-channel subgraphs, namely a low-frequency component diagram, a horizontal high-frequency component diagram, a vertical high-frequency component diagram and a diagonal high-frequency component diagram; discarding the low-frequency component subgraph, and performing channel splicing on the other three high-frequency component subgraphs to obtain a nonlinear multi-scale spatial feature graph;

(2) constructing a fusion feature map:

2a) performing pixel-by-pixel fusion on the morphological feature map and the linear multi-scale space feature map according to the proportion of alpha and beta to obtain an initial fusion image, wherein the alpha and the beta meet the condition that the alpha + beta is 0.5;

2b) multiplying the original image by a proportionality coefficient of 0.5, then carrying out pixel-by-pixel summation with the original fusion image, and then carrying out pixel-by-pixel addition with the nonlinear multi-scale spatial feature map to obtain a final feature fusion image;

(3) data set partitioning and small target expansion:

3a) for all optical remote sensing images, calculating the maximum value and the minimum value of the area in all targets to be detected according to the labeling information, and marking as S_maxAnd S_minAnd sets a threshold value

3b) All optical remote sensing images are randomly arranged according to the following 8: 2 into a training data set and a test data set;

3c) for each original image in the training set, traversing all the targets to be detected in the image, if the target area S_iIf the value is less than the threshold value S, selecting a target-free position in the original image, and copying the minimum square area where the target is located to the selected position to obtain a new training image; otherwise, the original image is not changed; after traversing is completed, a new training data set is obtained;

(4) training and detecting by using a deep learning-based target detection network:

4a) respectively extracting and fusing the characteristics of the test data set and the new training data set according to the operations (1) and (2) to obtain a training data set and a test data set after the characteristics are fused;

4b) training the existing single-stage full convolution target detection network by using a training data set after feature fusion through a gradient descent algorithm until the overall loss of the network is not changed any more, and obtaining a trained target detection network;

4c) and inputting the test data set into a trained target detection network to obtain a target detection result of the optical remote sensing image.

Compared with the prior art, the invention has the following advantages:

firstly, before the convolution operation of the neural network, the multi-feature extraction and fusion operation is carried out on the original image, so that the contour feature and the edge feature of the target are enhanced, and therefore, when the deep convolution network is used for extracting the target feature, the deep convolution network is more sensitive to the target part, the extracted feature is more accurate, and the accuracy of target detection is favorably improved.

Secondly, the image morphological characteristics, the linear multi-scale characteristics and the nonlinear multi-scale characteristics are fused, so that compared with the existing simple linear data enhancement method, the target saliency is enhanced, especially for small targets and complex background areas, the target characteristics are effectively enhanced, the background part is restrained, the accuracy of extracting the target characteristics by the deep convolutional neural network is improved, and the detection performance is improved.

Drawings

FIG. 1 is a general flow chart of an implementation of the present invention;

FIG. 2 is a sub-flow diagram of the present invention for constructing morphological features of an image;

FIG. 3 is a sobel operator directional template used by the present invention;

FIG. 4 is a sub-flow diagram of the construction of a linear scale spatial feature map according to the present invention;

FIG. 5 is a sub-flow diagram of the construction of a non-linear scale space feature map according to the present invention.

Detailed Description

Embodiments of the present invention are described in further detail below with reference to the accompanying drawings.

Referring to fig. 1, the implementation steps of this example are as follows:

the remote sensing image contains abundant spatial information and scale effect, multi-scale is a characteristic naturally existing in the remote sensing image ground object observation, different levels of ground object features and spatial relation rules can be obtained by analyzing from different scales, the multi-scale spatial information is very key for the accurate identification of the ground object, and a deep learning method generally extracts and classifies the features of the remote sensing image from a set certain scale level and lacks the comprehensive consideration of the multi-scale spatial information. Therefore, more and more scholars are beginning to research how to combine the multi-scale spatial features of the remote sensing images to improve the spatial comprehensive feature recognition capability. In addition, the unique role of mathematical morphology in quantitative description and analysis of image geometric features makes the remote sensing image processing research quite intensive, and the mathematical morphology is a classical nonlinear spatial information processing technology and can extract meaningful shape components from complex information of an optical remote sensing image and retain spatial geometric structural characteristics in the image. Therefore, the characteristics of remote sensing ground object classification can be better met. A large number of researches show that the mathematical morphology can accurately describe the contour and the spatial relationship of the ground feature, and the abundant spatial information can be effectively extracted from the remote sensing image based on the calculation and processing of the mathematical morphology method. Therefore, the invention designs a multi-feature extraction fusion technology based on the mathematical morphological features and the multi-scale features of the remote sensing images, and the specific implementation steps are as follows:

step 1, extracting mathematical morphology characteristics of an image.

The mathematical morphology method includes four basic operations: dilation, erosion, open and close operations. The expansion operation is to perform point-by-point convolution on the original image by using an operation kernel, and the maximum pixel value of the coverage area of the operation kernel is taken as a new pixel value of the convolution position. The erosion operation is to perform point-by-point convolution with the original image by using an operation kernel, and the minimum pixel value of the area covered by the operation kernel is used as a new pixel value of the convolution position. The open operation is to perform erosion before expansion on the image, and the close operation is to perform erosion after expansion on the image. In order to effectively eliminate image noise and smooth image edges, mathematical morphological features are extracted by adopting an open operation mode and a closed operation mode.

Referring to fig. 2, the specific implementation of this step is as follows:

1.1) performing opening operation and closing operation on an original image by using operation cores with the sizes of 3 × 3 and 5 × 5 respectively to obtain two opening operation characteristic diagrams open _3 and open _5 and two closing operation characteristic diagrams close _3 and close _5 respectively;

1.2) carrying out pixel-by-pixel summation and averaging on the obtained feature maps open _3, open _5, close _3 and close _5 to obtain a three-channel morphological feature map I_MThe feature map has the same resolution and dimensions as the original image.

And 2, extracting the linear multi-scale spatial features of the image.

In the field of computer vision, the multi-scale features can effectively improve the results of tasks such as image classification, target detection and the like, and the adoption of a proper mode to construct the multi-scale features of the images, effectively fuse and utilize the multi-scale features and the multi-scale features are always the key problems concerned by researchers.

The gaussian kernel is the only kernel that can generate the multi-scale space, and filtering an image by using a gaussian filter can effectively establish the multi-scale space, and for a two-dimensional image I (x, y), the image after gaussian filtering is:

L(x,y,δ)＝G(x,y,δ)*I(x,y)

wherein G (x, y, δ) represents a Gaussian function whose formula is

In the formula (x)₀,y₀) Is the coordinate of the central point, delta is a scale parameter, and determines the smoothness degree of the transformed image. The larger the value of delta, the better the filtering smoothness.

The Sobel operator is a discrete differential operator, and is commonly used for edge detection in image processing. In the implementation process of the Sobel operator, a 3 x 3 template is used as a convolution kernel to perform convolution operation with each pixel point in the image, and different direction templates are used to obtain edge detection characteristic maps in different directions.

The method for constructing linear multi-scale spatial features of images used in this example is mainly based on gaussian filter and Sobel edge extraction operator to extract edge features respectively, where the Sobel operator uses four directional templates (0 °,45 °,90 °,135 °), as shown in fig. 3.

Referring to fig. 4, the specific implementation of this step is as follows:

2.1) filtering the image by using a Gaussian filter to obtain a Gaussian blur characteristic map I_G

2.2) converting the original image into a gray-scale image, and respectively extracting four edge features of the gray-scale image in four directions by using a sobel operator

Respectively representing a horizontal edge feature map, a vertical edge feature map and two diagonal edge feature maps of the image;

2.3) carrying out pixel-by-pixel fusion on the four extracted edge feature maps to obtain an integral edge feature map I_S

By fusing the edge feature maps in the four directions, the edge feature part in the original picture can be enhanced, the pixel value of the edge feature part is far greater than 0, the non-edge part is inhibited, and the pixel value of the edge feature part is close to 0;

2.4) mapping the global edge profile I_SSi fuzzy characteristic diagram I_GFusing to obtain the final linear multi-scale space characteristic diagram I_L：

Wherein r is 0.3, I_GiIs represented by_GThe ith channel component of (1).

And 3, extracting the nonlinear multi-scale spatial features of the image.

The present example employs wavelet transformation as the primary method of constructing nonlinear multi-scale spatial features of an image, the wavelet basis using a two-dimensional single-level wavelet transform function dwt2 ().

Referring to fig. 5, the specific implementation of this step is as follows:

3.1) converting the common three-channel optical image into a single-channel gray-scale image;

3.2) carrying out wavelet decomposition on the gray level image in 3.1) by using a two-dimensional single-level wavelet transformation function to respectively obtain a low-frequency component subgraph, a horizontal high-frequency component subgraph, a vertical high-frequency component subgraph and a diagonal high-frequency component subgraph of the gray level image, wherein the resolution of each subgraph is only one fourth of that of the original image;

3.3) removing the low-frequency component subgraph in the step 3.2), extracting only three high-frequency component subgraphs and expanding the three high-frequency component subgraphs to be the same as the resolution of the original image by using a bilinear interpolation method;

3.4) splicing the three high-frequency component images after the resolution ratio expansion to obtain a nonlinear space multi-scale characteristic image I_NL。

And 4, constructing a fusion characteristic graph.

The original image I_pAnd Gaussian blur feature map I_GLinear multi-scale space characteristic diagram I_LNonlinear space multi-scale feature map I_NLCarrying out weighted summation to obtain a final fusion image I;

I＝0.5×I_p+α×I_M+β×I_L+I_NL

wherein alpha and beta are two hyperparameters with different values, and alpha + beta is 0.5, because of I_NLThe pixel value in (2) is very small, and the pixel-by-pixel addition is directly performed because the term does not add a weight coefficient.

And 5, expanding the small target.

The detection of small targets is always a difficult point in the field of computer vision, the detection precision is generally not high in the existing various algorithm frames, and the small target sample is expanded in the data preprocessing stage by the method, and the method is specifically realized as follows:

5.1) calculating the area of a real target frame according to the labeling information for all samples in the data set, and finding out the maximum value S of the area_maxAnd minimum value S_min；

5.2) setting threshold

Dividing all data into a training data set and a testing data set according to the ratio of 8: 2;

5.3) calculating the area S of the label box of all targets in each picture in the training set_i∈(S₁,S₂...S_n) Go through S_iAnd then comparing it with a set threshold:

if S is_iIf S is true, the rectangular area where the target is located is copied, a new position is randomly selected in the image for pasting, and execution is carried out 5.4)

If S is_iIf < S is not true, no operation is performed, and the next S is traversed_i；

5.4) selecting a new position:

5.4.1) randomly selecting a point in the image(x, y), calculating the size [ x, y, x + w ] of the label box of the new position_i,y+h_i]Wherein x + w_i,y+h_iRespectively representing the width and the height of the new labeling box;

5.4.2) judging whether the new position is superposed with the existing label frame in the image;

if the position is not coincident, the pasting operation is carried out on the new position;

if the superposition is carried out, returning to 5.4.1), recording the returning times, and if the returning times reach 100 times, abandoning the pasting operation;

5.5) repeat 5.3) a total of 5 times, so that the small target is fully expanded.

And 6, constructing a deep convolutional network for training and detection.

6.1) data preprocessing

Performing multi-feature fusion operation on all optical remote sensing images according to the steps 1-4 to obtain feature fusion images, dividing all the feature fusion images into a training data set and a test data set according to the ratio of 8: 2, and performing small-target expansion on all the images in the training data set according to the step 5;

6.2) constructing a target detection network

The example adopts an existing single-stage full-convolution target Detection network as a Detection framework, the single-stage full-convolution target Detection network comprises a backbone network resnet-50, a feature pyramid network FPN, a classification header Class _ Head and a Detection header Detection _ Head, wherein the pyramid network FPN comprises five feature layers P3, P4, P5, P6 and P7, and target frames are predicted in the five feature layers respectively;

in this example, the top-most layer P7 of the FPN is deleted, and the target frame prediction is performed only on four feature layers P3, P4, P5, and P6, where the target frame prediction ranges in the four feature layers are: 0.64, 64.128, 128.256, 256 ∞.

6.3) network training

And (3) sending the training data set after the small target in the step 6.1) is expanded into the single-stage full convolution target detection network constructed in the step 6.2), and training by using a gradient descent algorithm until the network converges to obtain the trained single-stage full convolution target detection network.

6.4) network testing and result evaluation

And (3) sending the test data set in the 6.1) into the 6.3) trained network for testing to obtain the detection results of the network on all targets in the test data set.

The effect of the present invention is further explained by combining the simulation experiment as follows:

firstly, simulation experiment conditions:

the hardware platform of the simulation experiment of the invention is as follows: the CPU model is Intel Xeon E5-2630 v4, 20 cores, the main frequency is 2.4GHz, and the memory size is 64 GB; the GPU is NVIDIA GeForce GTX 1080Ti/PCIe/SSE2, and the video memory size is 20 GB.

The software platform of the simulation experiment of the invention is as follows: the operating system is Ubuntu20.04 LTS, the cuda version is 10.1, and the version of Pytrch is 1.5.0. The opencv version is 4.4.0.

The data set used for the experiment was the public remote sensing image data set LEVIR.

Second, simulation experiment and results

In the first experiment, original data are used for training and testing the existing single-stage full-convolution target detection network, and the average accuracy mAP and average recall ratio recall indexes are calculated according to the test result.

And secondly, preprocessing the original data by adopting a multi-feature fusion mode, training and testing the conventional single-stage full-convolution target detection network by using the preprocessed data, and calculating the average accuracy mAP and average recall indexes.

And thirdly, preprocessing original data by adopting a small target enhancement mode, training and testing the conventional single-stage full-convolution target detection network by using the preprocessed data, and calculating the average accuracy mAP and average recall indexes.

And fourthly, preprocessing original data by adopting a multi-feature fusion mode and a small target enhancement mode, training and testing the conventional single-stage full-convolution target detection network by using the preprocessed data, and calculating the average accuracy mAP and the average recall rate call indexes.

The results of the above experiments are shown in table 1.

TABLE 1 comparison of simulation test results

Experimental setup	mAP	Recall
			Experiment one	90.3％	72.5％
Experiment two	90.6％	72.9％
			Experiment three	91.1％	75.8％
Experiment four	91.4％	76.1％

By comparing the results of the first experiment and the second experiment with the results of the first experiment and the third experiment, the method for preprocessing data by using the multi-feature fusion mode and the small target enhancement mode can effectively improve the detection performance of the existing single-stage full convolution target detection network.

By comparing the results of the experiment four with the results of the experiment two and the experiment three, the improvement of the single-stage full-convolution target detection network performance by simultaneously using a small target enhancement mode and a multi-feature fusion mode to carry out data preprocessing can be seen to be most obvious.

Claims

1. an optical remote sensing image target detection method based on multi-feature fusion, is characterized in that, comprises as follows:

(1) Extract mathematical morphological features, linear scale space features, and nonlinear scale space features of optical remote sensing images respectively:

1a) Perform the opening operation and closing operation on the original image respectively to obtain 2n initial feature maps, and then add all the initial feature maps pixel by pixel and take the average value to obtain the mathematical morphological feature map of the original image, where, n represents the number of open or closed operations;

1b) Use the Gaussian filter and the Sobel edge extraction operator to filter the original image respectively to obtain a three-channel Gaussian blurred feature map and four single-channel local edge feature maps; the four local edge feature maps are processed pixel by pixel. Summing and averaging to obtain the overall edge feature map; each channel component of this three-channel Gaussian feature map is fused pixel by pixel with the overall edge feature map to obtain a linear multi-scale spatial feature map;

1c) Convert the original optical remote sensing image into a single-channel grayscale image, and then use the two-dimensional single-level wavelet transform function to decompose it to obtain a low-frequency component map, a horizontal high-frequency component map, a vertical high-frequency component map, and a pair of The four single-channel sub-images of the corner high-frequency component map; discard the low-frequency component sub-images, and perform channel splicing of the other three high-frequency component sub-images to obtain a nonlinear multi-scale spatial feature map;

(2) Construct the fusion feature map:

2a) Perform pixel-by-pixel fusion of the mathematical morphological feature map and the linear multi-scale spatial feature map according to the ratio of α and β to obtain an initial fusion image, where α and β satisfy α+β=0.5;

2b) Multiply the original image by a scale factor of 0.5 and perform pixel-by-pixel summation with the initial fusion image, and then perform pixel-by-pixel addition with the nonlinear multi-scale spatial feature map to obtain the final feature fusion image;

(3) Data set division and small target expansion:

3a) For all optical remote sensing images, calculate the maximum and minimum area of all targets to be detected according to the labeling information, marked as S _max and S _min and set the threshold

3b) All optical remote sensing images are randomly divided into training data sets and test data sets according to the ratio of 8:2;

3c) For each original image in the training set, traverse all the targets to be detected in the image, if the target area S _i is smaller than the threshold S, first select a non-target position in the original image, and then select the target The smallest square area is copied to the selected position to obtain a new training image; otherwise, the original image is not changed; after the traversal is completed, a new training data set is obtained;

(4) Use a deep learning-based target detection network for training and detection:

4a) Perform feature extraction and fusion on the test data set and the new training data set according to the operations of (1) and (2), respectively, to obtain a training data set and a test data set after feature fusion;

4b) Using the training data set after feature fusion, train the existing single-stage fully convolutional target detection network through the gradient descent algorithm, and stop until the overall loss of the network does not change, and obtain a trained target detection network;

4c) Input the test data set into the trained target detection network to obtain the target detection result of the optical remote sensing image.

2. The method according to claim 1, wherein 1a) respectively perform opening operation and closing operation on the original image, and use convolution kernels with sizes of 3×3 and 5×5, respectively, to first dilate the original optical remote sensing image. The post-corrosion operation and the first-corrosion-then-dilation operation are used to obtain two open operation feature maps and two closed operation feature maps.

3. The method according to claim 1, wherein the use of the Sobel edge extraction operator to filter the original image in 1b) is to use four 3×3 convolution kernels to convolve with the original image respectively to obtain Four local edge feature maps, in which the directions of the four convolution kernels are: 0°, 45°, 90°, 135°.

4. The method according to claim 1, wherein selecting a piece of no-target position in the original image described in 3c) is realized as follows:

3c1) Randomly select the position (x, y) in the original image, and calculate the annotation frame information of the new position [x, y, x+ _wi , y+hi], where x+ _wi _, y+hi _represent respectively The width and height of the new target box;

3c2) Determine whether the new position coincides with the existing annotation frame in the current image, if not, select the position for subsequent operations, otherwise, return to 3c1);

3c3) When the selection is successful or the randomly selected position in 3c1) is repeated 100 times, this position selection ends.

5. The method according to claim 1 , wherein the existing single-stage full convolution target detection network is trained by gradient descent algorithm described in 4b), and is realized as follows:

4b1) Delete the top feature layer P7 of the FPN in the single-stage fully convolutional target detection network, and retain the P3, P4, P5, and P6 feature layers;

4b2) Send the training data into the network for forward propagation, and perform target box regression on the P3, P4, P5, and P6 feature layers retained in 4b1) to obtain the target prediction result. The size range is: 0, 64, 128, 256, ∞;

4b3) Calculate the overall loss between the prediction result obtained in 4b2) and the real label, and then perform backpropagation to update the network parameters;

4b4) Repeat 4b2)-4b3) until the network converges.