CN113177456B

CN113177456B - Remote sensing target detection method based on single-stage full convolution network and multi-feature fusion

Info

Publication number: CN113177456B
Application number: CN202110442872.3A
Authority: CN
Inventors: 白静; 温征; 唐晓川; 董泽委; 郭亚泽; 裴晓龙; 闫逊; 孙放; 张秀华
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2021-04-23
Filing date: 2021-04-23
Publication date: 2023-04-07
Anticipated expiration: 2041-04-23
Also published as: CN113177456A

Abstract

The invention discloses a remote sensing target detection method based on a single-stage full convolution network and multi-feature fusion, which mainly solves the problem that the target features of an optical remote sensing image are not fully extracted in the prior art. The implementation scheme is as follows: 1) Respectively extracting mathematical morphology characteristics, linear scale space characteristics and nonlinear scale space characteristics of original data and fusing the three characteristics to obtain a fused characteristic diagram; 2) Dividing the fusion characteristic diagram into a training data set and a test data set, and performing small-target expansion on the training data set; 3) Constructing a target detection network, and training the network on a training data set after target expansion by using a gradient descent algorithm; 4) And testing the test data set by using the trained network to obtain a detection result. The method enhances the contour characteristics and the edge characteristics of the target, is favorable for improving the accuracy of target detection, and can be used for resource exploration, natural disaster assessment and target identification.

Description

Remote sensing target detection method based on single-stage full convolution network and multi-feature fusion

Technical Field

The invention belongs to the technical field of optical remote sensing images, and particularly relates to a remote sensing target detection method based on a single-stage full convolution network and multi-feature fusion, which can be used for resource exploration, natural disaster assessment and target identification.

Background

The remote sensing image has the characteristics of complex background, unbalanced target category, large target scale change, special shooting visual angle and the like, so that the target detection aiming at the remote sensing image has great difficulty and challenge.

The traditional target detection method mainly depends on manually designed feature extraction operators to extract image features, including V-J detection, HOG detection, DPM algorithm and the like, and the main features of the traditional target detection method are that a detector can only use a fixed feature extraction algorithm to fit a single image feature, so that the traditional target detection method can only adapt to the situation that obvious features exist and the background is simple, and the requirement of a remote sensing image target detection task cannot be met.

The target detection method based on deep learning uses a convolution network to extract image features, and can extract various abundant features in the same target, so that the detection accuracy is far higher than that of the traditional manual design method, and the target detection method becomes an industry mainstream method at present and is widely used in remote sensing image target detection tasks.

A remote sensing image ship target detection method is provided by using a YOLO v5 network structure and an attention mechanism in SENet in a patent CN112580439A, the method firstly utilizes a small batch of image target samples to effectively train a network model, and secondly obtains a test model through transfer learning, the detection speed of the network in a large-format image is improved, and the accuracy and the robustness of ship target detection are kept.

A multi-scale feature extraction network is designed in a patent (CN 110378297A), the network can extract multi-scale features of images and respectively predict candidate regions of feature images corresponding to each image scale, and accuracy of remote sensing image target detection is effectively improved.

In a patent (CN 112070729A), an anchor-free-based target detection network is adopted, firstly, linear enhancement is carried out on an obtained remote sensing image data set in a balance coefficient mixed enhancement mode, and then feature extraction and fusion are carried out on a depth residual error network ResNet-50 and a feature pyramid network FPN. The invention fully utilizes the context multi-feature fusion method, enhances the feature extraction capability and the category prediction capability of the network, and improves the detection precision.

However, the above methods based on the deep convolutional neural network are all technical solutions in which a convolution operation is directly applied to an original input image, or a simple linear enhancement mode is adopted to preprocess data, and this solution does not alleviate the difficulty of extracting features by the deep convolutional network, and especially for target detection under a complex background of a remote sensing image, this method for detecting a target based on the deep convolutional network cannot accurately extract feature information of a target part, thereby affecting the improvement of detection performance.

Disclosure of Invention

The invention aims to provide a remote sensing target detection method based on single-stage full convolution network and multi-feature fusion to accurately extract feature information of a target part and improve detection performance aiming at the defects of the prior art.

In order to achieve the purpose, the technical scheme of the invention comprises the following steps:

(1) Respectively extracting morphological characteristics, linear scale space characteristics and nonlinear scale space characteristics of the optical remote sensing image:

1a) Respectively performing opening operation and closing operation on the original image to obtain 2n initial feature maps, then performing pixel-by-pixel addition on all the initial feature maps and taking an average value to obtain a morphological feature map of the original image, wherein n represents the number of opening operation or closing operation;

1b) Filtering the original image by using a Gaussian filter and a Sobel edge extraction operator respectively to obtain a three-channel Gaussian fuzzy feature map and four single-channel local edge feature maps; summing the four local edge feature graphs pixel by pixel and averaging to obtain an integral edge feature graph; performing pixel-by-pixel fusion on each channel component of the three-channel Gaussian feature map and the integral edge feature map to obtain a linear multi-scale spatial feature map;

1c) Converting an original optical remote sensing image into a single-channel gray-scale image, and performing wavelet decomposition on the single-channel gray-scale image by using a two-dimensional single-level wavelet transformation function to obtain four single-channel subgraphs, namely a low-frequency component diagram, a horizontal high-frequency component diagram, a vertical high-frequency component diagram and a diagonal high-frequency component diagram; discarding the low-frequency component subgraph, and performing channel splicing on the other three high-frequency component subgraphs to obtain a nonlinear multi-scale spatial feature graph;

(2) Constructing a fusion feature map:

2a) Performing pixel-by-pixel fusion on the morphological characteristic diagram and the linear multi-scale space characteristic diagram according to the proportion of alpha and beta to obtain an initial fusion image, wherein the alpha and the beta meet the requirement that the alpha + beta =0.5;

2b) Multiplying the original image by a proportionality coefficient of 0.5, then summing the original image with the initial fusion image pixel by pixel, and then adding the original image and the initial fusion image pixel by pixel with the nonlinear multi-scale spatial feature map to obtain a final feature fusion image;

(3) Data set partitioning and small target expansion:

3a) For all optical remote sensing images, calculating the maximum and minimum of the area in all targets to be detected according to the labeling information, and marking as S _max And S _min And sets a threshold value

3b) All optical remote sensing images are randomly arranged according to the following 8:2 into a training data set and a test data set;

3c) For each original image in the training set, traversing all targets to be detected in the image if the target area is S _i If the value is less than the threshold value S, selecting a target-free position in the original image, and copying the minimum square area where the target is located to the selected position to obtain a new training image; otherwise, the original image is not changed; after traversing is completed, a new training data set is obtained;

(4) Training and detecting by using a target detection network based on deep learning:

4a) Respectively extracting and fusing the characteristics of the test data set and the new training data set according to the operations (1) and (2) to obtain a training data set and a test data set after the characteristics are fused;

4b) Training the existing single-stage full convolution target detection network by using a training data set after feature fusion through a gradient descent algorithm until the overall loss of the network is not changed any more, and obtaining a trained target detection network;

4c) And inputting the test data set into a trained target detection network to obtain a target detection result of the optical remote sensing image.

Compared with the prior art, the invention has the following advantages:

firstly, the multi-feature extraction and fusion operation is carried out on the original image before the convolution operation of the neural network, so that the contour feature and the edge feature of the target are enhanced, and therefore, when the deep convolution network is used for extracting the target feature, the deep convolution network is more sensitive to the target part, the extracted feature is more accurate, and the accuracy of target detection is favorably improved.

Secondly, the image morphological characteristics, the linear multi-scale characteristics and the nonlinear multi-scale characteristics are fused, so that compared with the existing simple linear data enhancement method, the target saliency is enhanced, especially for small targets and complex background areas, the target characteristics are effectively enhanced, the background part is restrained, the accuracy of extracting the target characteristics by the deep convolutional neural network is improved, and the detection performance is improved.

Drawings

FIG. 1 is a general flow chart of an implementation of the present invention;

FIG. 2 is a sub-flow diagram of the present invention for constructing morphological features of an image;

FIG. 3 is a sobel operator directional template used by the present invention;

FIG. 4 is a sub-flow diagram of the construction of a linear scale spatial feature map according to the present invention;

FIG. 5 is a sub-flow diagram of the construction of a non-linear scale space feature map according to the present invention.

Detailed Description

Embodiments of the present invention are described in further detail below with reference to the accompanying drawings.

Referring to fig. 1, the implementation steps of this example are as follows:

the remote sensing image contains abundant spatial information and scale effect, multi-scale is a characteristic naturally existing in the remote sensing image ground object observation, different levels of ground object features and spatial relation rules can be obtained by analyzing from different scales, the multi-scale spatial information is very key for the accurate identification of the ground object, and a deep learning method generally extracts and classifies the features of the remote sensing image from a set certain scale level and lacks the comprehensive consideration of the multi-scale spatial information. Therefore, more and more scholars are beginning to research how to combine the multi-scale spatial features of the remote sensing images to improve the spatial comprehensive feature recognition capability. In addition, the mathematical morphology is a classical nonlinear spatial information processing technology, can extract meaningful shape components from complex information of optical remote sensing images, and retains the spatial geometrical structural characteristics in the images. Therefore, the characteristics of remote sensing ground feature classification can be better met. A large number of researches show that the mathematical morphology can accurately describe the contour and the spatial relationship of the ground feature, and the abundant spatial information can be effectively extracted from the remote sensing image based on the calculation and processing of the mathematical morphology method. Therefore, the invention designs a multi-feature extraction fusion technology based on the mathematical morphological features and the multi-scale features of the remote sensing images, and the specific implementation steps are as follows:

step 1, extracting mathematical morphology characteristics of an image.

The mathematical morphology method includes four basic operations: expansion, erosion, open and close operations. The expansion operation is to perform point-by-point convolution on the original image by using an operation kernel, and the maximum pixel value of the coverage area of the operation kernel is taken as a new pixel value of the convolution position. The erosion operation is to perform point-by-point convolution with the original image by using an operation kernel, and the minimum pixel value of the area covered by the operation kernel is used as a new pixel value of the convolution position. The open operation is to perform erosion before expansion on the image, and the close operation is to perform erosion after expansion on the image. In order to effectively eliminate image noise and smooth image edges, mathematical morphological features are extracted by adopting an open operation mode and a closed operation mode.

Referring to fig. 2, the specific implementation of this step is as follows:

1.1 Using operation cores with the sizes of 3 × 3 and 5 × 5 respectively to perform opening operation and closing operation on an original image to obtain two opening operation characteristic graphs open _3 and open_5 and two closing operation characteristic graphs close _3 and close_5 respectively;

1.2 The feature maps open _3, open _5, close _3 and close _5 obtained in the previous step are summed pixel by pixel and averaged to obtain a three-channel imageIs shown in the figure I _M The feature map has the same resolution and dimensions as the original image.

And 2, extracting the linear multi-scale spatial features of the image.

In the field of computer vision, the multi-scale features can effectively improve the results of tasks such as image classification, target detection and the like, and the adoption of a proper mode to construct the multi-scale features of the images, effectively fuse and utilize the multi-scale features and the multi-scale features are always the key problems concerned by researchers.

The gaussian kernel is the only kernel that can generate a multi-scale space, and the multi-scale space can be effectively established by filtering an image with a gaussian filter, and for a two-dimensional image I (x, y), the image after gaussian filtering is used as follows:

L(x,y,δ)＝G(x,y,δ)*I(x,y)

wherein G (x, y, δ) represents a Gaussian function whose formula is

In the formula (x) ₀ ,y ₀ ) Is the coordinate of the central point, delta is a scale parameter, and determines the smoothness degree of the transformed image. The larger the value of delta is, the better the filtering smoothness is.

The Sobel operator is a discrete differential operator, and is commonly used for edge detection in image processing. In the implementation process of the Sobel operator, a 3 x 3 template is used as a convolution kernel to perform convolution operation with each pixel point in the image, and different direction templates are used to obtain edge detection characteristic maps in different directions.

The method for constructing linear multi-scale spatial features of images used in this example is mainly based on gaussian filter and Sobel edge extraction operator to extract edge features respectively, where the Sobel operator uses four directional templates (0 °,45 °,90 °,135 °), as shown in fig. 3.

Referring to fig. 4, the specific implementation of this step is as follows:

2.1 Filter the image with a Gaussian filter to obtain a Gaussian blur feature map I _G

2.2 Convert the original image to a grayscale image, extract the four edge features ^ of the grayscale image in four directions using sobel operator, respectively ^h ，▽ ^v ，▽ ^r ，▽ ^l Respectively representing a horizontal edge feature map, a vertical edge feature map and two diagonal edge feature maps of the image;

2.3 Carrying out pixel-by-pixel fusion on the four extracted edge feature maps to obtain an overall edge feature map I _S

By fusing the edge feature maps in the four directions, the edge feature part in the original picture can be enhanced, the pixel value of the edge feature part is far greater than 0, the non-edge part is inhibited, and the pixel value of the edge feature part is close to 0;

2.4 Integral edge feature map I) _S Si fuzzy characteristic diagram I _G Fusing to obtain the final linear multi-scale space characteristic diagram I _L ：

Wherein, r =0.3,I _Gi Is represented by I _G The ith channel component of (1).

And 3, extracting the nonlinear multi-scale spatial features of the image.

The present example employs wavelet transform as the primary method for constructing nonlinear multi-scale spatial features of an image, the wavelet basis using a two-dimensional single-level wavelet transform function dwt2 ().

Referring to fig. 5, the specific implementation of this step is as follows:

3.1 Convert a common three-channel optical image to a single-channel grayscale image;

3.2 Carrying out wavelet decomposition on the gray-scale image in 3.1) by using a two-dimensional single-level wavelet transformation function to respectively obtain a low-frequency component subgraph, a horizontal high-frequency component subgraph, a vertical high-frequency component subgraph and a diagonal high-frequency component subgraph of the gray-scale image, wherein the resolution of each subgraph is only one fourth of that of the original image;

3.3 Removing the low-frequency component subgraph in 3.2), extracting only three high-frequency component subgraphs and expanding the three high-frequency component subgraphs to be the same as the resolution of the original image by using a bilinear interpolation method;

3.4 Splicing the three high-frequency component images after the resolution ratio expansion to obtain a nonlinear space multi-scale characteristic image I _NL 。

And 4, constructing a fusion characteristic graph.

The original image I _p And Gaussian blur feature map I _G Linear multi-scale space characteristic diagram I _L Nonlinear space multi-scale characteristic diagram I _NL Carrying out weighted summation to obtain a final fusion image I;

I＝0.5×I _p +α×I _M +β×I _L +I _NL

wherein alpha and beta are two hyperparameters with different values, and alpha + beta =0.5 is satisfied, because I _NL The pixel value in (2) is very small, and the pixel-by-pixel addition is directly performed because the term does not add a weight coefficient.

And 5, expanding the small target.

The detection of small targets is always a difficult point in the field of computer vision, the detection precision is generally not high in the existing various algorithm frames, and the small target sample is expanded in the data preprocessing stage by the method, and the method is specifically realized as follows:

5.1 For all samples in the data set, calculating the area of the real target frame according to the labeling information, and finding out the maximum value S of the area _max And minimum value S _min ；

5.2 ) set a threshold value

Dividing all data into a training data set and a testing data set according to the ratio of 8: 2;

5.3 For each picture in the training set, calculating the area S of the label box of all targets in the picture _i ∈(S ₁ ,S ₂ ...S _n ) Go through S _i And then comparing it with a set threshold:

if S is _i If S is true, the rectangular area where the target is located is copied, a new position is randomly selected in the image for pasting, and execution is carried out 5.4)

If S is _i If < S is not true, no operation is performed, and the next S is traversed _i ；

5.4 Select a new location:

5.4.1 Randomly selecting a point (x, y) in the image, calculating the size of the annotation box [ x, y, x + w ] for the new location _i ,y+h _i ]Wherein x + w _i ,y+h _i Respectively representing the width and the height of the new labeling box;

5.4.2 Judging whether the new position is overlapped with an existing marking frame in the image or not;

if the position is not coincident, the pasting operation is carried out on the new position;

if the superposition is carried out, returning to 5.4.1), recording the returning times, and if the returning times reach 100 times, abandoning the pasting operation;

5.5 ) is repeatedly performed 5.3) a total of 5 times, so that the small target is sufficiently expanded.

And 6, constructing a deep convolutional network for training and detection.

6.1 Data preprocessing

Performing multi-feature fusion operation on all optical remote sensing images according to the steps 1-4 to obtain feature fusion images, dividing all the feature fusion images into a training data set and a test data set according to the ratio of 8:2, and performing small-target expansion on all the images in the training data set according to the step 5;

6.2 Construct an object detection network

The method adopts an existing single-stage full-convolution target Detection network as a Detection frame, wherein the single-stage full-convolution target Detection network comprises a backbone network resnet-50, a characteristic pyramid network FPN, a classification Head Class _ Head and a Detection Head Detection _ Head, the pyramid network FPN comprises five characteristic layers P3, P4, P5, P6 and P7, and target frames are predicted in the five characteristic layers respectively;

in this example, the top layer P7 of the FPN is deleted, and the target frame prediction is performed only on four feature layers P3, P4, P5, and P6, where the target frame prediction ranges in the four feature layers are: 0.64, 64.128, 128.256, 256 ∞.

6.3 Network training

And (3) sending the training data set after the small target in the step 6.1) is expanded into the single-stage full convolution target detection network constructed in the step 6.2), and training by using a gradient descent algorithm until the network converges to obtain the trained single-stage full convolution target detection network.

6.4 Network test and result evaluation

And (3) sending the test data set in the 6.1) into the 6.3) trained network for testing to obtain the detection results of the network on all targets in the test data set.

The effect of the present invention is further explained by combining the simulation experiment as follows:

1. simulation experiment conditions are as follows:

the hardware platform of the simulation experiment of the invention is as follows: the CPU model is Intel Xeon E5-2630 v4, 20 cores, the main frequency is 2.4GHz, and the memory size is 64GB; the GPU is NVIDIA GeForce GTX 1080Ti/PCIe/SSE2, and the video memory size is 20GB.

The software platform of the simulation experiment of the invention is as follows: the operating system is Ubuntu20.04 LTS, the cuda version is 10.1, and the version of Pytrch is 1.5.0. The opencv version is 4.4.0.

The data set used for the experiment was the public remote sensing image data set LEVIR.

2. Simulation experiment and results

In the first experiment, original data are used for training and testing the existing single-stage full-convolution target detection network, and the average accuracy mAP and average recall ratio recall indexes are calculated according to the test result.

And in the second experiment, the original data is preprocessed in a multi-feature fusion mode, the preprocessed data is used for training and testing the conventional single-stage full convolution target detection network, and the average accuracy mAP and the average recall call index are calculated.

And thirdly, preprocessing original data by adopting a small target enhancement mode, training and testing the conventional single-stage full-convolution target detection network by using the preprocessed data, and calculating the average accuracy mAP and average recall indexes.

And fourthly, preprocessing original data by adopting a multi-feature fusion mode and a small target enhancement mode, training and testing the conventional single-stage full-convolution target detection network by using the preprocessed data, and calculating the average accuracy mAP and the average recall rate call indexes.

The results of the above experiments are shown in table 1.

TABLE 1 comparison of simulation test results

Experimental setup	mAP	Recall
			Experiment one	90.3％	72.5％
Experiment two	90.6％	72.9％
			Experiment three	91.1％	75.8％
Experiment four	91.4％	76.1％

By comparing the results of the first experiment and the second experiment with the results of the first experiment and the third experiment, the method for preprocessing data by using the multi-feature fusion mode and the small target enhancement mode can effectively improve the detection performance of the existing single-stage full convolution target detection network.

By comparing the results of the experiment four with the results of the experiment two and the experiment three, the improvement of the single-stage full-convolution target detection network performance by simultaneously using a small target enhancement mode and a multi-feature fusion mode to carry out data preprocessing can be seen to be most obvious.

Claims

1. A remote sensing target detection method based on single-stage full convolution network and multi-feature fusion is characterized by comprising the following steps:

(1) Respectively extracting mathematical morphology characteristics, linear scale space characteristics and nonlinear scale space characteristics of the optical remote sensing image:

1a) Respectively performing opening operation and closing operation on the original image to obtain 2n initial feature maps, then performing pixel-by-pixel addition on all the initial feature maps and taking an average value to obtain a mathematical morphology feature map of the original image, wherein n represents the number of the opening operation or the closing operation;

1b) Filtering the original image by using a Gaussian filter and a Sobel edge extraction operator respectively to obtain a three-channel Gaussian fuzzy feature map and four single-channel local edge feature maps; summing the four single-channel local edge feature graphs pixel by pixel and averaging to obtain an overall edge feature graph; performing pixel-by-pixel fusion on each channel component of the three-channel Gaussian fuzzy feature map and the overall edge feature map to obtain a linear multi-scale spatial feature map;

1c) Converting an original optical remote sensing image into a single-channel grey-scale image, and performing wavelet decomposition on the single-channel grey-scale image by using a two-dimensional single-level wavelet transformation function to obtain four single-channel subgraphs, namely a low-frequency component diagram, a horizontal high-frequency component diagram, a vertical high-frequency component diagram and a diagonal high-frequency component diagram; discarding the low-frequency component subgraph, and performing channel splicing on the other three high-frequency component subgraphs to obtain a nonlinear multi-scale spatial feature graph;

(2) Constructing a fusion feature map:

2a) Performing pixel-by-pixel fusion on the mathematical morphology characteristic diagram and the linear multi-scale space characteristic diagram according to the proportion of alpha and beta to obtain an initial fusion image, wherein the alpha and the beta meet the requirement that the alpha + beta =0.5;

2b) Multiplying the original image by a proportionality coefficient of 0.5, then carrying out pixel-by-pixel summation with the original fusion image, and then carrying out pixel-by-pixel addition with the nonlinear multi-scale spatial feature map to obtain a final feature fusion image;

(3) Data set partitioning and small target expansion:

3b) All optical remote sensing images are randomly distributed according to the following ratio of 8:2 into a training data set and a test data set;

3c) For each original image in the training set, traversing all the targets to be detected in the original image, if the target area S _i If the image size is smaller than the threshold S, selecting a target-free position in the original image, and copying the minimum square area where the target is located to the selected position to obtain a new training image; otherwise, the original image is not changed; after traversing is completed, a new training data set is obtained;

the method for selecting a non-target position in an original image is realized as follows:

3c1) Randomly selecting a position (x, y) in an original image, and calculating the marking frame information [ x, y, x + w ] of the new position _i ,y+h _i ]Wherein x + w _i ,y+h _i Respectively representing the width and height of the new target frame;

3c2) Judging whether the new position is overlapped with an existing marking frame in the current image, if not, selecting the position for subsequent operation, otherwise, returning to 3c 1);

3c3) When the selection is successful or the random selection of the position in 3c 1) is repeated for 100 times, ending the position selection;

(4) Training and detecting by using a deep learning-based target detection network:

2. The method according to claim 1, wherein 1 a) the opening operation and the closing operation are respectively performed on the original image, and the original optical remote sensing image is subjected to the expansion-first corrosion-then-corrosion operation and the expansion-first corrosion-then-corrosion operation by using convolution kernels with the sizes of 3 x 3 and 5 x 5 respectively, so as to obtain two opening operation characteristic maps and two closing operation characteristic maps.

3. The method of claim 1, wherein the filtering of the original image by using the Sobel edge extraction operator in 1 b) is performed by convolving four convolution kernels with a size of 3 × 3 with the original image respectively to obtain four local edge feature maps, wherein directions of the four convolution kernels are respectively: 0 °,45 °,90 °,135 °.

4. The method of claim 1, wherein the training of the existing single-stage full convolution target detection network by the gradient descent algorithm in 4 b) is implemented as follows:

4b1) Deleting the topmost feature layer P7 of the FPN in the single-stage full-convolution target detection network, and reserving the feature layers P3, P4, P5 and P6;

4b2) Sending the training data into a network for forward propagation, and performing target frame regression on the P3, P4, P5 and P6 feature layers reserved in 4b 1) to obtain a target prediction result, wherein the size ranges of predicted targets in the four feature layers are as follows: 0,64,128,256, ∞;

4b3) Calculating the integral loss between the prediction result obtained by the step 4b 2) and the real label, then performing back propagation, and updating the network parameters;

4b4) Repeat 4b 2) -4b 3) until the network converges.