CN114220019A

CN114220019A - Lightweight hourglass type remote sensing image target detection method and system

Info

Publication number: CN114220019A
Application number: CN202111323948.7A
Authority: CN
Inventors: 贺霖; 李颖琪; 李军
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-11-10
Filing date: 2021-11-10
Publication date: 2022-03-22
Anticipated expiration: 2041-11-10
Also published as: CN114220019B

Abstract

The invention discloses a method and a system for detecting a light hourglass type remote sensing image target, which comprises the steps of preprocessing an acquired remote sensing image data set, and dividing the acquired remote sensing image data set into a training data set, a verification data set and a test data set; constructing a target detection network model, wherein the target detection network model comprises a light hourglass network and a characteristic pyramid network; inputting a training data set remote sensing image, performing feature extraction by using a lightweight hourglass type network, and inputting the extracted features into a feature pyramid network to complete feature coding; obtaining a plurality of candidate frames according to the coding result, and selecting the optimal candidate frame as output to obtain a predicted value; the error of the predicted value and the true value is subjected to back propagation to complete the training of the target detection network model; and (4) completing prediction and classification on the test data set by using the trained target detection network model. The invention has high precision for detecting small targets and faster convergence speed of training.

Description

Lightweight hourglass type remote sensing image target detection method and system

Technical Field

The invention relates to the field of remote sensing image processing, in particular to a light hourglass type remote sensing image target detection method and system.

Background

The target detection of the remote sensing image is a technical means for detecting an interested area or an object on a ground object image shot by a high-resolution satellite, which can frame out the interested object in a rectangular frame form and give the confidence of the category to which the object belongs. The objects to be detected often include small targets such as airplanes, ships, courts, vehicles, etc., and also large targets such as bridges, train stations, ports, etc. With the improvement of the resolution of the remote sensing image, the information contained in the image is more and more abundant, and the target detection technology of the remote sensing image becomes a very important ring in the field of remote sensing image analysis, and has great significance in natural disaster assessment, resource survey, military research and the like.

At present, the target detection algorithm based on deep learning can be mainly divided into two major categories, namely a two-stage target detection algorithm and a single-stage target detection algorithm. The two-stage target detection algorithm mainly comprises RCNN, Fast-RCNN, R-FCN and the like, and the core principle of the two-stage target detection algorithm is that candidate frames of a series of samples are generated by the algorithm, and then objects in the candidate frames are classified through a convolutional neural network. The single-stage target detection algorithm mainly comprises YOLO, YOLO9000, YOLOv3, SSD, RetinaNet and the like, and the core principle of the single-stage target detection algorithm is that the problem of positioning and classifying a target frame is directly converted into a regression problem, and the positioning and classification of an object are obtained by solving the regression problem. Generally speaking, the two-stage target detection algorithm can achieve higher precision, but the calculation efficiency is low; in contrast, the single-stage target detection algorithm can maintain high computational efficiency, but slightly decreases in detection accuracy.

In the target detection algorithm based on deep learning, in order to acquire a feature map with higher-level semantic features, a large number of convolution operations accompanied with downsampling are often used, so that the resolution of the feature map is inevitably reduced while rich semantic features are acquired, and further the detection performance of a small target is easily reduced. For the problem, a common method is to cascade or fuse feature maps of different levels, so as to ensure that the feature maps have better semantics and fine-grained information at the same time. For example, YOLOv3 constructs a feature pyramid after a feature extraction backbone network, concatenates high-level features with low-level features through upsampling, and simultaneously outputs network prediction at multiple levels so that the network has the characteristics of multi-scale target detection.

In a target detection task of a remote sensing image, the resolution of an image to be processed is often large, a large number of small targets exist, each pixel point possibly contains important information, and the promotion of a common cascade or fusion method of different-level feature maps is extremely limited. How to improve the detection precision of particularly small targets in target detection while maintaining the calculation efficiency and to fully utilize the context information to assist the target detection task becomes a considerable problem.

Disclosure of Invention

The invention provides a light hourglass type remote sensing image target detection method and system aiming at the defect that a small target in a remote sensing image cannot achieve better detection performance in the existing target detection algorithm.

The technical scheme adopted by the invention is as follows:

a light hourglass type remote sensing image target detection method comprises

Preprocessing an acquired remote sensing image data set, and dividing the acquired remote sensing image data set into a training data set, a verification data set and a test data set;

constructing a target detection network model, wherein the target detection network model comprises a light hourglass network and a characteristic pyramid network;

inputting a training data set remote sensing image, performing feature extraction by using a lightweight hourglass type network, and inputting the extracted features into a feature pyramid network to complete feature coding;

obtaining a plurality of candidate frames according to the coding result, and selecting the optimal candidate frame as output to obtain a predicted value;

the error of the predicted value and the true value is subjected to back propagation to complete the training of the target detection network model;

and (4) completing prediction and classification on the test data set by using the trained target detection network model.

Further, the preprocessing comprises converting the detection frame in the remote sensing image dataset from a form of 'center point + width and height' into a form of 'upper left corner point + lower right corner point'.

Further, the lightweight hourglass network is constructed from a plurality of hourglass-shaped base modules arranged in a stacked configuration to form each level.

Further, the hourglass-shaped basic module construction method is as follows:

for input

Respectively using convolution kernels with 3 x 3, step sizes of 1 and 3 x 3 and step size of 2 to make convolution and corresponding batch normalization and activation function processing to form two branches, and respectively obtaining corresponding output results

Wherein h is₁＝2h₂，w₁＝2w₂，c₁＝2c₂；

To H₁Performing double nearest neighbor interpolation to obtain an up-sampling result

H₂Width and height of (H)₁Remain consistent due to I to H₂The width and height of the features are compressed and expanded in sequence, and the change process of the features imitates the shape of an hourglass, so the features are called as hourglass structures;

in channel dimension pair S₁And H₂Carrying out a cascade with the result that

Wherein c is₃＝c₁+c₂，S₂At the same timeFine-grained information and semantic information before and after convolution are included;

for cascading results S₂Compressing the number of channels to c using a convolution kernel with 1 x 1 and a step size of 1₁To reduce the amount of parameter calculation and output the result as

S₃Keeping the size of the residual error to be consistent with that of the residual error I, and performing subsequent residual error connection;

using channel attention mechanism pair S₃The channel weight of the model is adjusted to make the model more concentrated on high-value characteristic information, and the re-weighting result is

For S in the form of pixel addition₄And I, residual error connection is carried out to prevent the gradient disappearance problem caused by the depth of the network, and the output result of the module is

And stacking the hourglass type module structures according to the number of basic modules set by each level of the lightweight hourglass type network to form the level of the backbone network, wherein the input and output sizes of the modules in the same level are kept consistent.

Further, the construction method of the lightweight hourglass type network is as follows:

stacking the hourglass-shaped base modules to form a level of the network;

and adding a convolution kernel with 3 x 3 and step size of 2 between different levels to perform convolution kernel and corresponding batch normalization and activation function processing to adjust the number of channels and acquire higher-level semantic information, and connecting the levels in series to form a light-weight hourglass type target detection backbone network.

Further, the lightweight hourglass network comprises five levels, wherein the five levels respectively comprise 1, 2, 4 and 2 stacked basic modules, the characteristic width and height of the subsequent level are 1/2 of the previous level, and the number of channels is 2 times that of the previous level.

Further, the output characteristics of the last three levels of the lightweight hourglass type network are input into a characteristic pyramid network, and downlink characteristics are subjected to up-sampling and then are cascaded with uplink characteristics.

Further, the Euclidean distance loss function is adopted to calculate the loss of the width and the height of the prediction frame, and the cross entropy loss function is adopted to calculate the loss of the center point, the confidence coefficient and the category of the prediction frame.

Further, a non-maximum suppression algorithm is adopted to select the optimal candidate box for output.

A system of a light hourglass type remote sensing image target detection method comprises the following steps:

the remote sensing image acquisition device comprises an acquisition data unit, a verification data unit and a test data unit, wherein the acquisition data unit is used for preprocessing an acquired remote sensing image data set and dividing the acquired remote sensing image data set into a training data set, a verification data set and a test data set;

the target detection network model building unit is used for building a target detection network model, and the target detection network model comprises a light hourglass network and a characteristic pyramid network;

the training target detection network model unit is used for initializing convolution kernel weights and biases of each layer of the network by utilizing Gaussian distribution with a mean value of zero, and performing optimization iteration on the network model by adopting Adam optimizer back propagation;

and the target detection unit is used for completing a target detection task on the remote sensing image in the test set.

The invention has the beneficial effects that:

(1) the invention adopts the lightweight hourglass type module to replace the traditional residual error module, and the output of the basic module is simultaneously provided with semantic information and fine-grained information by combining different convolution characteristic graphs;

(2) taking the YOLOv3 target detection algorithm in the prior art as an example, the algorithm has five levels, and each level comprises 1, 2, 8 and 4 basic modules respectively; each level of the target detection algorithm provided by the invention respectively comprises 1, 2, 4 and 2 basic modules, so that higher calculation efficiency can be effectively maintained;

(3) the output of each basic module of the invention has better semantic information and fine-grained information at the same time, and compared with YOLOv3, the invention has higher detection precision on small targets and faster convergence speed of training.

Drawings

Fig. 1 is a flow chart of an implementation of a light hourglass type remote sensing image target detection method.

Fig. 2 is a first level of a lightweight hourglass-like object detection model, comprising 1 basic module.

Fig. 3 is a backbone network structure of the object detection model, each level comprising a different number of basic modules.

Fig. 4(a) to fig. 4(b) show the average accuracy comparison between the YOLOv3 method and the method of the present invention when the DIOR data set completes the target detection task, where fig. 4(a) shows the average accuracy of YOLOv3 detection, and fig. 4(b) shows the average accuracy of the method of the present invention detection.

Fig. 5(a) to 5(b) are comparisons of effects of the YOLOv3 method and the method proposed by the present invention on the same remote sensing image, where fig. 5(a) is a detection result of YOLOv3, and fig. 5(b) is a detection result of the method proposed by the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited to these examples.

Examples

As shown in fig. 1, the present embodiment takes 23463 remote sensing image data sets with a resolution of 800 × 3 as an example, and includes the following steps:

s1, preprocessing the acquired remote sensing image data set, and dividing the preprocessed remote sensing image data set into a training data set, a verification data set and a test data set;

the preprocessing in this embodiment refers to converting the data labeling mode from a form of "center point + width height" to a form of "upper left corner point + lower right corner point", dividing the data set into a training data set, a verification data set and a test data set according to a ratio of 4:1:5, and adjusting the size of the input image to a fixed value.

The input image is resized to 416 x 3.

S2, constructing a target detection network model, wherein the target detection network model comprises a light-weight hourglass network and a characteristic pyramid network, randomly initializing the weight and bias of a convolution kernel by utilizing Gaussian distribution with a mean value of zero, and inputting a divided training data set into the light-weight hourglass network for learning;

the light hourglass network serves as a backbone network, and the feature pyramid network serves as a neck network.

The proposed backbone network with the lightweight hourglass-like module as the basic module was applied to a conventional YOLOv3 target detection model, with batch normalization and linear rectification function activation after each convolution.

The lightweight hourglass network is constructed from a plurality of hourglass-shaped base modules arranged in a stacked configuration to form various levels.

The specific constitution process of the hourglass-shaped basic module is as follows:

further, the specific process of step 2 is as follows:

step 2.1, for the input I epsilon R^416×416×3The picture is firstly convolved with 32 convolution kernels of 3 x 3 to expand the channel dimension, and output characteristics I are obtained after batch normalization and a linear array rectification activation function₁∈R^416×416×32；

Step 2.2, mixing I₁Inputting to a first level of the light hourglass type object detection model, wherein the structure of the level is shown in fig. 2:

a first layer: convolutional layer Conv1, input I₁Convolving with 64 convolution kernels with 3 x 3 and step length of 2, and obtaining output characteristics I through batch normalization and linear rectification activation processing₂∈R^208×208×64；

A second layer: two convolution shunts Conv2 and Conv3 are input into the output of the previous layer, and are respectively convoluted with 64 convolution kernels with 3 × 3 and step length of 1 and 32 convolution kernels with 3 × 3 and step length of 2, and output characteristics S are respectively obtained through batch normalization and linear rectification activation processing₁∈R^208×208×64And H₁∈R^104×104×32；

And a third layer: upsamplingLayer UpSample1, input output H of the previous layer₁Obtaining an upsampling result H by using a twofold nearest neighbor interpolation₂∈R^208×208×32，H₂Width and height of (H)₁Keeping consistency to facilitate subsequent cascade operation;

a fourth layer: cascade Cat1 with input as output S of the second layer₁And the output H of the third layer₂The two are cascaded in channel dimension to obtain output characteristic S₂∈R^208×208×96；

And a fifth layer: the convolution layer Conv4 inputs the output of the previous layer, performs convolution operation with 64 convolution kernels with 3 × 3 and step size of 1, and obtains output characteristics S through batch normalization and linear rectification activation processing₃∈R^208×208×64，S₃The size of the residual error is consistent with that of the I so as to ensure that the subsequent residual error connection is smoothly carried out;

a sixth layer: attention layer attentive 1, inputting the output of the previous layer, obtaining S through two-dimensional adaptive mean value pooling and one-dimensional convolution₃Weight matrix W in channel dimension₁It is reacted with S₃Multiplying to obtain output S after adjusting channel weight₄∈R²⁰⁸ ^×208×64；

A seventh layer: the residual connecting layer Res1 inputs the output of the previous layer and connects it with I₂Residual connection is carried out in a mode of adding corresponding pixel points to obtain the output O e R of the current module^208×208×64；

And 2.3, the lightweight hourglass type target detection model comprises 5 levels, each level comprises 1, 2, 4 and 2 basic modules, and the constructed basic modules are repeatedly stacked according to the number of the modules of the corresponding level. Taking level 1 as an example, the level has 1 basic module, as shown in fig. 2;

and 2.4, directly connecting the basic modules of the same level of the model in the step 2.3, performing convolution between different levels by adopting convolution kernels with 3 x 3 and step length of 2, wherein the characteristic width and height of the next level are both 1/2 of the previous level, and the number of channels is 2 times that of the previous level. Then in this example the first level output is L₁∈R^208×208×64Of 1 atThe two-level output is L₂∈R¹⁰⁴ ^×104×128The output of the third stage is L₃∈R^52×52×256The fourth level output is L₄∈R^26×26×512The fifth stage output is L₅∈R^13×13×1024The concrete structure is shown in FIG. 3;

and 2.5, taking the output of the backbone network as the input of the characteristic pyramid network, transmitting the input into a neck network, completing the encoding of the characteristics, and then obtaining the predicted output of the network through subsequent decoding.

S3, calculating the loss of the width and the height of the prediction frame by selecting an Euclidean distance loss function, calculating the loss of the center point, the confidence coefficient and the category of the prediction frame by selecting a cross entropy loss function, and taking the sum of the four losses as the prediction loss of the whole network model;

s4, performing back propagation on the error by using an Adam optimizer, performing iterative update on the weight and bias of the convolution kernel, and considering that the current optimal depth network model is obtained when the loss function reaches the minimum value or reaches the set maximum iteration step number;

and S5, adjusting the test data set to 416 × 3, and inputting the test data set into the network model obtained in the step 4 to obtain a target detection prediction result of the test set.

In this embodiment, a DIOR data set is used to compare YOLOv3 with the light hourglass type remote sensing image target detection method provided by the present invention, after 50 rounds of training, the average accuracy is used as an evaluation index, and the comparison result is shown in table 1.

The comparison result is visualized as fig. 4(a) and fig. 4(b), wherein fig. 4(a) shows the detection effect of YOLOv3 on each class, and the average detection precision is 63.91%, and fig. 4(b) shows the detection effect of the method provided by the present invention on each class, and the average detection precision is 66.60%. Compared with the method of YOLOv3, the method provided by the invention can obtain a better detection effect for most targets of the used data set, and further improves the detection performance of small targets.

TABLE 1 Experimental results for target detection on DIOR dataset

Fig. 5(a) -5 (b) show the target detection contrast effect of YOLOv3 and the method proposed by the present invention on the same remote sensing image, where fig. 5(a) is the detection result of YOLOv3, and fig. 5(b) is the detection result of the method proposed by the present invention. Compared with YOLOv3, the confidence coefficient of the detection frame predicted by the method is higher, and the boundary of the detection frame is more fit with the actual target.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A light hourglass type remote sensing image target detection method is characterized by comprising

2. The method for detecting the target of the light hourglass-type remote sensing image according to claim 1, wherein the preprocessing comprises converting a detection frame in the remote sensing image data set from a form of central point + width and height into a form of upper left corner point + lower right corner point.

3. The method for detecting the target of the light hourglass-type remote sensing image according to claim 1, wherein the light hourglass-type network comprises a plurality of hourglass-type basic modules which are stacked to form each layer.

4. The method for detecting the target of the light hourglass type remote sensing image according to claim 3, wherein the hourglass type basic module is constructed as follows:

for input

Wherein h is₁＝2h₂，w₁＝2w₂，c₁＝2c₂；

Wherein c is₃＝c₁+c₂，S₂Fine granularity information and semantic information before and after convolution are contained at the same time;

5. The method for detecting the target of the light hourglass type remote sensing image according to claim 4, wherein the construction method of the light hourglass type network is as follows:

stacking the hourglass-shaped base modules to form a level of the network;

6. The method for detecting the target of the light hourglass remote sensing image according to claim 5, wherein the light hourglass network comprises five levels which respectively comprise 1, 2, 4 and 2 stacked basic modules, the characteristic width and height of the next level are 1/2 of the previous level, and the number of channels is 2 times that of the previous level.

7. The method for detecting the target of the light hourglass type remote sensing image according to any one of claims 5 or 6, wherein the output features of the last three levels of the light hourglass type network are input into a feature pyramid network, and downlink features are subjected to up-sampling and then are cascaded with uplink features.

8. The method for detecting the target of the light hourglass type remote sensing image according to claim 1, wherein an Euclidean distance loss function is adopted to calculate the loss of the width and the height of a prediction frame, and a cross entropy function is adopted to calculate the loss of the central point, the confidence coefficient and the category of the prediction frame.

9. The method for detecting the target of the light hourglass-type remote sensing image according to claim 1, wherein a non-maximum suppression algorithm is adopted to select an optimal candidate box for output.

10. A system for implementing the method for detecting the target in the lightweight hourglass-type remote sensing image according to any one of claims 1 to 9, which comprises the following steps: