CN112016512A

CN112016512A - Remote sensing image small target detection method based on feedback type multi-scale training

Info

Publication number: CN112016512A
Application number: CN202010934966.8A
Authority: CN
Inventors: 丁忆; 肖禾; 马泽忠; 刘朝晖; 王小攀; 朱智勤; 李朋龙; 李鹏华; 罗鼎; 李晓龙; 舒文强; 秦瑛歆; 卢建洪; 吴开杰; 范琳洁
Original assignee: Chongqing Geographic Information And Remote Sensing Application Center
Current assignee: Chongqing Geographic Information And Remote Sensing Application Center
Priority date: 2020-09-08
Filing date: 2020-09-08
Publication date: 2020-12-01

Abstract

The invention discloses a remote sensing image small target detection method based on feedback type multi-scale training, which comprises the following steps: constructing a feedback type multi-scale convolution neural network formed by a detection module and a feedback multi-scale training module, inputting original image data and training the original image data in an end-to-end mode; the feedback multi-scale training module calculates a proportion value of the small target according to the loss of the current iteration process output by the detection module; comparing the calculated ratio value of the small target with a preset threshold, and when the ratio value of the small target is smaller than the preset threshold, using the spliced image data as the input of the next iteration, or else, using the original image data as the input; and obtaining a trained feedback type multi-scale convolution neural network, inputting the remote sensing image to be detected, and outputting a recognition result. The detection capability of the small targets in the remote sensing image is enhanced, the overfitting phenomenon and the class imbalance phenomenon are inhibited, and the small target detection method has better effect and robustness on the small target detection in the remote sensing image.

Description

Remote sensing image small target detection method based on feedback type multi-scale training

Technical Field

The invention relates to the technical field of remote sensing image target detection, in particular to remote sensing image target detection by utilizing a neural network model, and specifically relates to a remote sensing image small target detection method based on feedback type multi-scale training.

Background

The remote sensing image target detection problem is always an important research direction in the field of computer vision and pattern recognition research. Although a plurality of deep learning algorithms based on target detection appear in recent years and bring strong improvement to target detection, aiming at the detection of small targets in remote sensing images, a plurality of places which need improvement exist.

The model performance is degraded due to the possibility of introducing more false positives due to the complex background in the remote sensing image. At the same time, the detection performance of small objects (less than 32 x 32 pixels) in the remote sensing image is the least satisfactory.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a remote sensing image small target detection method based on feedback type multi-scale training.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

a remote sensing image small target detection method based on feedback type multi-scale training is characterized by comprising the following steps:

step 1, constructing a feedback type multi-scale convolution neural network formed by a detection module and a feedback multi-scale training module, initializing a network weight by using a pre-training model, inputting original image data and training the original image data in an end-to-end mode;

step 2, the feedback multi-scale training module calculates a proportion value of a small target according to the loss of the current iteration process output by the detection module;

step 3, the feedback multi-scale training module compares the calculated ratio value of the small target with a preset threshold, and when the ratio value is smaller than the preset threshold, the spliced image data is used as the input of the next iteration, otherwise, the original image data is used as the input;

and 4, obtaining the trained feedback type multi-scale convolution neural network, inputting the remote sensing image to be detected, and outputting a recognition result.

Further, the detection module is built by the following steps:

step 11, constructing a feature extraction network, wherein the feature extraction network is used for extracting high-level semantic features and low-level semantic features of an input image, normalizing the size through upsampling, and obtaining an added feature map through average pooling;

step 12, constructing a prior frame acquisition module, wherein the prior frame acquisition module is used for extracting a feature map output by a network according to features and generating a prior frame by adopting a prior frame strategy;

step 13, constructing a classification task module, wherein the classification task module is used for classifying the target classes of the prior frames;

step 14, constructing a regression task module, wherein the regression task module is used for positioning the target frame;

and step 15, constructing a network optimization module, wherein the network optimization module is used for optimizing the network by utilizing the multitask loss function.

Further, the feature extraction network is built based on a residual error structure and comprises a first residual error module, a second residual error module and a third residual error module, the first residual error module, the second residual error module and the third residual error module successively perform down-sampling feature extraction on an input image, and feature extraction results output by the first residual error module, the second residual error module and the third residual error module are subjected to feature fusion to obtain the feature map.

Further, the first residual module, the second residual module, and the third residual module each include three convolution groups, a maximum pooling layer, and a convolutional layer, where a sliding window of the maximum pooling layer is 3 × 3, a step size is 2, a convolutional kernel size of the convolutional layer is 1 × 1, and a channel number is 128.

Further, the specific process of the prior frame acquisition module in step 12 using the prior frame strategy to generate the prior frame is as follows:

a1, acquiring central points of a feature map output by a feature extraction network;

step A2, setting the size and the length-width ratio of each prior frame based on the small target size;

and A3, obtaining a prior frame of the feature map according to a prior frame strategy.

Further, the functional formula of the multitask loss function is as follows:

wherein L is_(x,c,p,g)For multi-tasking loss, N is the number of positive samples of the prior frame; l is_conf(x, c) confidence error loss, x ∈ {1,0} being an indication parameter; c is a category confidence degree predicted value; l is_loc(x, p, g) is the position error loss, and p is the position predicted value of the boundary frame corresponding to the prior frame; g is a position parameter of the real label; and alpha is a weight coefficient.

Further, the calculation formula of the proportional value of the small target is as follows:

wherein,

is a proportional value of a small target, L^tIs a position error L_loc(x,p,g)，

Is no greater than A_sSmall target position error.

Further, the step of acquiring the stitched image data is as follows:

step 31, storing the original image data in a cache form;

step 32, obtaining four images in the current iteration image, scaling the size of each image to 1/2 of the original size through size conversion, and splicing the four scaled images into one image;

step 33: acquiring a data label of a current iteration image, performing same scale transformation and scaling on the data labels corresponding to the four scaled images, and storing the data labels into a new label;

and step 34, taking the spliced image after the scale transformation and the new label corresponding to the spliced image as spliced image data.

The invention has the following remarkable effects:

1. the invention designs a feedback multi-scale training module which carries out feedback correction by using the loss of network output and determines the image sequence to be input in the next iteration according to the small target proportion value of the current iteration, thereby not only enhancing the detection capability of small targets in a remote sensing image, but also inhibiting the overfitting phenomenon and the class imbalance phenomenon, greatly improving the accuracy of a target detection result and having better effect and robustness on the detection of the small targets in the remote sensing image;

2. the traditional multi-scale training method is generally time-consuming or occupies a large amount of memory, and is only used on a common image, but the method can be applied to remote sensing images, almost does not need to increase redundant memory consumption and time consumption, and is a rapid multi-scale training mode;

3. the method uses a feature extraction network to extract features of an input image, so that an obtained feature map contains feature information of high-level semantics and low-level semantics; and then, a great number of prior frames are generated by utilizing a prior frame strategy to find out a sensitive target area, so that the target recall rate is greatly increased.

Drawings

FIG. 1 is a flow chart of a method of the present invention;

FIG. 2 is a schematic structural diagram of the feedback-type multi-scale convolutional neural network;

FIG. 3 is a block diagram of the feature extraction network;

fig. 4 is a schematic diagram of the conversion of the stitched image and the original image.

Detailed Description

The following provides a more detailed description of the embodiments and the operation of the present invention with reference to the accompanying drawings.

As shown in fig. 1, a method for detecting a small target of a remote sensing image based on feedback-type multi-scale training specifically comprises the following steps:

step 1, constructing a feedback type multi-scale convolution neural network composed of a detection module and a feedback multi-scale training module, as shown in fig. 2, initializing a network weight by using a pre-training model, inputting original image data, and training the original image data in an end-to-end mode, specifically:

the detection module is built by the following steps:

step 11, constructing a feature extraction network shown in fig. 3, wherein the feature extraction network is used for extracting high-level semantic features and low-level semantic features of an input image, normalizing the size through upsampling, and obtaining an added feature map through average pooling;

the feature extraction network is built based on a residual error structure and comprises a first residual error module, a second residual error module and a third residual error module, the first residual error module, the second residual error module and the third residual error module successively carry out down-sampling feature extraction on an input image, and feature extraction results output by the first residual error module, the second residual error module and the third residual error module are subjected to feature fusion to obtain the feature map.

As shown in fig. 4, the first Residual block (Residual-1) comprises three convolution groups, and each convolution group forms a Residual structure by using a shortcut connection (linear addition), wherein each convolution group is defined as two convolution networks of [3 × 3,32], 3 × 3 is the convolution kernel size, and 32 represents the number of channels. The third convolution group is followed by a maximal pooling (sliding window is 3 multiplied by 3, step length is 2) and a [1 multiplied by 1,64] convolution network for changing the size of the characteristic diagram, the number of channels and increasing the nonlinear capability of the network, and the characteristic diagram is output at the layer for characteristic fusion;

the second Residual block (Residual-2) contains three convolution groups, each convolution group being defined as two [3 × 3,64] convolution networks. Then, a maximal pooling (sliding window is 3 multiplied by 3, step length is 2) and a [1 multiplied by 1,128] convolution network are connected behind the third convolution group, and a feature map is output at the layer for feature fusion;

the third Residual block (Residual-3) contains three convolution groups, each defined as two convolution networks of [3 × 3,128 ]. Then, a maximal pooling (sliding window is 3 multiplied by 3, step length is 2) and a [1 multiplied by 1,128] convolution network (the number of channels in the layer is unchanged) are connected behind the third convolution group, and a feature map is output in the layer for feature fusion;

the feature maps output by each residual module are subjected to upsampling normalization size, and then are added by using average pooling (a sliding window is 3 multiplied by 3, and the step length is 2), so that the obtained feature maps contain high-level semantic information (containing stronger object feature information but weaker object position information) and low-level semantic information (containing stronger object position information but unobvious object feature information), namely, the fused feature maps contain stronger object position information and object feature information.

that is, an n × n feature map can be obtained through the feature extraction network, and the n × n feature map corresponds to n × n central points, wherein each central point generates a prior frame;

that is, the size of each prior box is set to (8, 16, 32) and the aspect ratio is set to (0.5, 1, 2). The size and length of each prior box can be determined by the target size of a specific data set, and if the size of a small target is generally lower than 32 × 32 pixels, the size of the prior box can be generally set to (8, 16, 32), and the aspect ratio is set to (0.5, 1, 2), so that the real label of each small target can be ensured to correspond to one of the prior boxes. And the parameters of the size or the length-width ratio can be set to be 4 or 5, but if the set parameters are too much, the training and reasoning speed is seriously influenced, so that the parameters are generally defaulted to be 3.

Therefore, the center point of each feature map will generate 9 prior frames (3 x 3), i.e. an n × n feature map will generate 9 × n²A priori block, and thus used for the training task of classifying branches and regressing branches.

The functional formula of the multitask loss function is as follows:

The classification task will produce a confidence error loss L_conf(x, c), the calculation formula is as follows:

L_conf(c,x)＝-log[x*c+(1-x)(1-c)]，

the regression task will generate a position error L_loc(x, p, g), the calculation formula is as follows:

L_loc(x,p,g)＝x*R(p-g)，

where R is the Smooth L1 function, which is shown below:

in order to enhance the detection capability of the small target, the invention determines the input of the next iteration according to the feedback value of the current traversal, and the feedback of the current traversal is composed of the loss proportion of the small target.

the calculation process of the proportional value of the small target is as follows:

a) the object area is defined, which is strictly speaking generally determined by its mask region, but this is only available in the segmentation task. For object detection, the area of the object is typically replaced with a box area, as shown in equation (5):

a₀≈h₀×w₀，

wherein h is₀,w₀Respectively representing the height and width of the object box.

b) Defining small target losses, small target losses

Means from an area not greater than A_s(1024 in the COCO protocol), the proportion formula of the small target is:

wherein L is^tIs a position error L_loc(x,p,g)，

Is no greater than A_sIs detected by the position error of the small target,

is expressed as a proportional value of a small target, and

step 3, the feedback multi-scale training module carries out calculation on the ratio of the small target and a preset threshold valueComparing, when the ratio is smaller than the preset threshold, using the spliced image data as the input of the next iteration, otherwise, using the original image data as the input, namely, traversing the ratio value of the current small target

If the current value is lower than a threshold value tau (tau belongs to (0,1), the default value is 0.4), the information of the current small target is far insufficient, and in order to make up the shortage of the information, the spliced image data is used as the input of the next iteration; otherwise, the original image data will be selected as input.

And splicing the image data by utilizing every four images loaded by the current memory, changing the sizes of the images and the corresponding labels, and if each image of the original image is (h, w), changing the image into a spliced image after the image data splicing operation is carried out

The method comprises the following steps:

step 31, storing original image data in a cache form, such as a. npy file or a binary file in numpy, to improve the loading speed;

step 32, firstly unifying the scales of the four images into (h, w), and scaling the size of each image into 1/2 size of the original size by scaling, so that the size of each image after scaling becomes the size of the image after scaling

Splicing the four zoomed images into one image, wherein the size of the spliced image is (h, w);

step 33: acquiring data labels (target type and real target frame coordinates) of a current batch, firstly carrying out same scale transformation and scaling on labels corresponding to four loaded images (the real target frame coordinates are transformed, and the target type is unchanged), and then storing the labels corresponding to the four images into a new label to ensure that a generated spliced image contains the labels of the original four images;

The embodiment provides a method specially aiming at small target detection of remote sensing images, which is implemented by building a feedback type multi-scale convolution neural network (FBMS-Net) composed of a detection module and a feedback multi-scale training module, wherein the detection module comprises a feature extraction network, a priori frame strategy, a classification branch and a target frame regression branch. Firstly, extracting the characteristics of high-level semantics and low-level semantics by adopting a characteristic extraction network; using a priori frame strategy to carry out region proposal and generate a priori frame; performing a classification branch task and a target frame regression branch task in parallel; initializing a feedback multi-scale module to enable the feedback multi-scale module to be in a feedback receiving state (namely, training is carried out by adopting original picture data during initialization training); and finally, outputting the classification loss and the target frame loss by the detection module, wherein the loss is used as the input of the feedback multi-scale training module, and judging whether to activate the feedback multi-scale training module. If the module is in an activated state, the spliced image data is adopted for training; otherwise, the original image data is used for training. By the method, the small target detection in the remote sensing image has better effect and robustness.

The technical solution provided by the present invention is described in detail above. The principles and embodiments of the present invention are explained herein using specific examples, which are presented only to assist in understanding the method and its core concepts. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

Claims

1. A remote sensing image small target detection method based on feedback type multi-scale training is characterized by comprising the following steps:

2. The remote sensing image small target detection method based on feedback type multi-scale training of claim 1, characterized in that: the target detection convolutional neural network is constructed by the following steps:

3. The remote sensing image small target detection method based on feedback type multi-scale training of claim 2, characterized in that: the feature extraction network is built based on a residual error structure and comprises a first residual error module, a second residual error module and a third residual error module, the first residual error module, the second residual error module and the third residual error module successively carry out down-sampling feature extraction on an input image, and feature extraction results output by the first residual error module, the second residual error module and the third residual error module are subjected to feature fusion to obtain the feature map.

4. The remote sensing image small target detection method based on feedback type multi-scale training of claim 3, characterized in that: the first residual module, the second residual module and the third residual module respectively comprise three convolution groups, a maximum pooling layer and a convolution layer, wherein the sliding window of the maximum pooling layer is 3 multiplied by 3, the step length is 2, the convolution kernel size of the convolution layer is 1 multiplied by 1, and the number of channels is 128.

5. The remote sensing image small target detection method based on feedback type multi-scale training of claim 2, characterized in that: in step 12, the specific process of the prior frame obtaining module generating the prior frame by using the prior frame strategy is as follows:

6. The remote sensing image small target detection method based on feedback type multi-scale training of claim 2, characterized in that: the functional formula of the multitask loss function is as follows:

7. The remote sensing image small target detection method based on feedback type multi-scale training of claim 1, characterized in that: the calculation formula of the proportional value of the small target is as follows:

wherein,

Is no greater than A_sSmall target position error.

8. The remote sensing image small target detection method based on feedback type multi-scale training of claim 1, characterized in that: the step of acquiring the spliced image data is as follows:

step 31, storing the original image data in a cache form;