CN113971764A

CN113971764A - Remote sensing image small target detection method based on improved YOLOv3

Info

Publication number: CN113971764A
Application number: CN202111269827.9A
Authority: CN
Inventors: 李国强; 常轩
Original assignee: Yanshan University
Current assignee: Yanshan University
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2022-01-25
Anticipated expiration: 2041-10-29
Also published as: CN113971764B

Abstract

The invention discloses a remote sensing image small target detection method based on improved YOLOv3, which belongs to the technical field of deep learning and target detection and comprises data set preprocessing; optimizing a YOLOv3 network, and adding a cavity convolution group module, a feature enhancement module and a channel attention mechanism module into the Neck; enhancing online data; forward reasoning; improving a loss function; and selecting a YOLOv3 network model with the highest detection precision and recall rate on the verification set to load into the network. According to the invention, the cavity convolution group module, the characteristic strengthening module and the channel attention mechanism module are added in the original YOLOv3 network to improve the YOLOv3 detection network by improving the loss function, so that the performance is obviously improved, the target detection in the remote sensing image is more comprehensive and higher in precision, and the training speed and the overall detection precision are improved.

Description

Remote sensing image small target detection method based on improved YOLOv3

Technical Field

The invention relates to the field of deep learning and target detection, in particular to a remote sensing image small target detection method based on improved YOLOv 3.

Background

With the development of deep learning and neural networks, computer vision has developed at a rapid pace. In the field, target detection and identification technologies are widely researched and applied to practice, and great convenience is brought to life of people. For example, the method is applied to unmanned aerial vehicles, can automatically identify specific targets in remote sensing images, and can replace manual work to efficiently complete repeated work and the like. However, in many target detection efforts, the following problems exist:

1. the target is mostly small in size and only has dozens of pixel points, so that the target is not beneficial to searching and identifying;

2. the background is complicated, and interference factor is many, if shoot the angle, illumination change, similar target, object shelter from scheduling problem, lead to erroneous judgement easily, be unfavorable for the detection.

After comparing several conventional target detection networks commonly used at present, a YOLOv3(You Only Look one v3) detection algorithm is selected and used, the algorithm detects a speed block, the recognition accuracy is high, and better detection performance can be obtained by improving on the basis of the algorithm. The improvement idea of the network is as follows:

1. in the basic backbone network, as the depth increases, although the field of view is enlarged after each down-sampling, the image size is reduced, the resolution is reduced, and even though the up-sampling is performed, a certain target feature is lost. This can be solved by hole convolution: the receptive field is enlarged, and the model can observe a larger range in the picture, so that the target can be detected more comprehensively; larger size feature maps will contain more target information, which is beneficial for localization and classification.

2. When people look at things, the people focus on interested parts after observing the whole area, and more useful information is obtained from a complex background, namely an attention mechanism algorithm. Computer vision has many similarities with human vision, and the basic idea is to let machines learn to eliminate influencing factors and capture key information. The algorithm is used for detection, and the accuracy is improved to a certain extent.

3. In terms of a loss function between a real box and a prediction box of YOLOv3, the coordinate loss adopts the sum of the squares of the mean values, the confidence coefficient loss and the category loss adopt cross entropy, and the two are added to obtain an overall error. When the method is used for calculating the loss, the position relation and the contact ratio of the two frames and the direction in which the predicted frames need to be closed cannot be comprehensively reflected, so that a loss function needs to be improved.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a remote sensing image small target detection method based on improved YOLOv3, the improved YOLOv3 detection network has obviously improved performance, more comprehensive target detection in the remote sensing image, higher precision and improved training speed.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

a remote sensing image small target detection method based on improved YOLOv3 comprises the following steps:

step 1, preprocessing a data set: acquiring a training remote sensing image to form a data set, carrying out format conversion on the data set, randomly dividing the data set into a training verification set and a test set, and evaluating a YOLOv3 network model in a cross verification mode during training;

step 2, optimizing the YOLOv3 network: adding a cavity convolution group module, a characteristic strengthening module and a channel attention mechanism module into the Neck;

step 3, online data enhancement: randomly selecting the same number of pictures in a training set each time, and inputting the pictures into an optimized YOLOv3 network after online image data of the pictures are enhanced;

step 4, forward reasoning: head in the optimized YOLOv3 network is responsible for deducing object coordinates and classes according to the fused features to obtain frame coordinates, object classes and confidence degrees of surrounding target objects;

step 5, improving a loss function: iteratively training and updating parameters according to the function values, and evaluating on a verification set after each iteration;

and 6, finishing training: and selecting the optimized YOLOv3 network model with the highest detection precision and recall rate on the verification set to load into the network.

The technical scheme of the invention is further improved as follows: in step 1, the data set preprocessing specifically includes: and converting the labeling information of the data set into a VOC format, and performing the following steps of: 1, randomly dividing the training verification set and the test set into a training verification set and a test set, wherein the sets are not interfered with each other and have no same picture, so that data is prevented from being polluted;

and evaluating the YOLOv3 network model in a cross validation mode during training, namely firstly, a training validation set is evaluated according to the following steps of 8: the proportion of 1 is randomly divided into a training set and a verification set, the training set is used for model training and weight updating, and the verification set is used for evaluating a YOLOv3 network model obtained after each training.

The technical scheme of the invention is further improved as follows: in the step 2, the step of the method is carried out,

the cavity convolution group module can adapt to multi-scale image input and enlarge the network receptive field;

the feature enhancement module can fuse shallow features with rich object position information and less semantic information with deep features with rich object semantic information and less position information, and fuse features with different resolutions;

the channel attention mechanism module can eliminate interference, extract object feature information which is more critical to detection from a complex background, and give each channel weight to the features to strengthen global features.

The technical scheme of the invention is further improved as follows: the calculation formula of the channel attention mechanism is as follows:

global average pooling:

wherein W, H represents the width and height of the feature pattern, x_i,jRepresenting the value of the ith row and jth column point on each channel of the characteristic diagram;

and (3) channel convolution:

ω＝σ(C1D_k(y))

in the formula, since pooling is performed on the global average, i is 1 and y is equal to^jIt is indicated that the j-th channel,

denotes y^jOf k adjacent channels, α^jDenotes the jth channel weight, σ denotes sigmoid (), ω_iRepresents the ith weight;

k, solving:

where c represents a given number of channels, γ, b are equal to 2, 1 respectively, and odd represents the nearest odd number.

The technical scheme of the invention is further improved as follows: in step 3, the online data enhancement technology comprises photometric distortion, geometric distortion, simulated occlusion and multi-image fusion;

the luminosity distortion mainly changes pixel points of the picture, such as: random brightness variation, random contrast variation, random saturation variation, random chromaticity variation, and random noise addition;

the geometric distortion mainly changes the shape of the picture, such as: random cutting, random rotation and random angle;

the simulated shielding refers to randomly erasing small blocks in the picture, namely setting the pixel points of the small blocks to be completely black;

the multi-image fusion refers to randomly cutting a common part of one image and replacing a part at the same position on the other image, or overlapping two images together and overlapping pixel points.

The technical scheme of the invention is further improved as follows: in step 5, the position loss in the loss function is calculated by using the DIOU, and the distance between the two target frames can be directly minimized by considering three factors of the coincidence degree, the coincidence direction and the position relation of the detection frame and the real frame, so that the convergence is faster;

the modified loss function formula is:

in the formula, Loss_DIOUIs the total DIOU Loss, on a picture_confiLoss of total confidence in a picture, Loss_clsN is the target number on a picture, for the total class loss on a picture.

The technical scheme of the invention is further improved as follows: and 6, selecting the model with the highest evaluation precision and recall rate on the verification set to load into the network, and obtaining the detection effect of the model on the test set.

Due to the adoption of the technical scheme, the invention has the technical progress that:

1. the invention can fuse the shallow feature with more position information and the deep feature with more semantic information by using the feature enhancement module, thereby improving the available information quantity of the Head reasoning layer.

2. The invention uses the cavity convolution group module to improve the receptive field without changing the resolution; the channel attention mechanism can be used for enabling the network to extract more detection information from complex background information, and interference elimination is facilitated.

3. After the loss function is improved, the target frame is more fit.

4. The improved YOLOv3 detection network has the advantages of obviously improved performance, more comprehensive detection of targets in remote sensing images, higher precision and improved training speed.

Drawings

FIG. 1 is a primary structural view of YOLOv3 in the present invention;

FIG. 2 is a general scheme of the improvement to YOLOv3 in the present invention;

FIG. 3 is a diagram of an SPP module of the present invention;

FIG. 4 is a block diagram of the RFB module of the present invention;

FIG. 5 is a diagram of a SFM hole convolution group module used in the present invention;

FIG. 6 is a CSP block diagram of the present invention;

FIG. 7 is a block diagram of FEM feature enhancement used in the present invention;

FIG. 8 is an attention block diagram of an ECA channel used in the present invention;

FIG. 9 is a diagram of IOU and DIOU in the present invention.

Detailed Description

The invention is described in further detail below with reference to the following figures and examples:

the invention provides a remote sensing image small target detection method based on improved YOLOv3, which improves training speed and overall accuracy by improving a loss function and adding a feature enhancement module channel attention mechanism module in a YOLOv3 original network (shown in figure 1), and the improved scheme is shown in figure 2.

In this patent application:

SPP is English abbreviation of Spatial farm Pooling;

RFB is English abbreviation of received Field Block;

SFM is English abbreviation of Spatial and Field Model;

CSP is English abbreviation of Cross Stage Partial connections;

FEM is English abbreviation of Feature Enhanced Model;

ECA is English abbreviation of effective Channel Attention;

IOU is English abbreviation of interaction Over Union;

DIOU is an English abbreviation for Distance IOU.

2-9, a remote sensing image small target detection method based on improved YOLOv3 includes randomly dividing a data set into a training verification set and a test set, and evaluating a model in a cross-validation mode during training. And enhancing the online data of the picture set for training, wherein the data comprises geometric distortion, photometric distortion, simulated occlusion and picture fusion. The basic backbone network Darknet-53 is responsible for extracting object features from shallow to deep in the data enhanced picture. Adding a feature enhancement mechanism FEM into the Neck, and fusing features with different resolutions; adding a hole convolution group SFM to adapt to multi-scale input and expand the receptive field; and embedding an ECA channel attention module, and giving weights to all channels of the features to strengthen the global features. Head is responsible for inferring object coordinates and classes from the fused features. During training, comparing the inferred result of the Head with a target label, solving an error according to an error function, wherein the error function improves the original MSE (mean square error) into the DIOU loss, and then updating the parameters to be updated by using an error back propagation method and a random gradient descent method; during testing, NMS processing is carried out on the result of the Head, each prediction frame only corresponds to one target, and the accuracy rate is calculated by comparing all the prediction frames with the real frames; the detection process is similar to the test process except that the accuracy is not calculated any more and the prediction box is displayed directly on the picture.

Examples

A remote sensing image small target detection method based on improved YOLOv3 specifically comprises the following steps:

step 1: acquiring training remote sensing images to form a data set, converting the labeled information of the data set into a VOC format, and carrying out the following steps of: the proportion of 1 is randomly divided into a training verification set and a test set, the sets are not interfered with each other, and the data are prevented from being polluted because of no same picture. Cross validation is adopted during training, namely a training validation set is firstly processed according to the following steps of 8: the proportion of 1 is randomly divided into a training set and a verification set, the training set is used for model training and weight updating, and the verification set is used for model evaluation obtained when each round of training is finished.

Step 2: and (5) improving the Neck. As the Darknet-53 network deepens, the extracted target features become less variable and more abstract from concrete.

In order to adapt to various conditions of inconsistent target positions and sizes at different resolutions, simultaneously expand the receptive field and use the thought of SPP and RFB networks for reference, an SFM is proposed, as shown in FIG. 3, namely, three results obtained by respectively passing input features through three cavity convolution branches with different eccentricities are superposed with the original features on a channel, and finally, the fused features are transmitted to the next module.

The object position information in the shallow layer features is rich, the semantic information is less, the object semantic information in the deep layer features is rich, the position information is less, in order to complement the feature information of the shallow layer features and the deep layer features, the feature enhancement module FEM is provided by using the CSP module thought for reference, and the target information is more comprehensive. The FEM superposes the quadruple down-sampling feature on a channel after passing through a convolution kernel of 3 multiplied by 3 and the octave down-sampling feature, superposes the octave down-sampling feature on the channel after passing through the convolution kernel of 3 multiplied by 3 and the sixteen down-sampling feature, and inputs the two superposed features into the next module after passing through two convolution kernels of 1 multiplied by 1 respectively.

The channel attention mechanism ECA was used after the two superimposed features to expand the viewing range of each feature pixel, with the ECA module shown in fig. 5. Each extracted feature map has a plurality of channels, the features on each channel can be superposed to obtain complete object features, in order to eliminate interference and extract key information, an ECA channel attention mechanism is embedded, channels with important feature components are endowed with heavy weights, features with non-important weights are endowed with small weights, then the weighted sum is obtained, and ECA is shown in figure 5. Neck is the second part of the YOLOv3 detection network, the Neck part labeled in FIG. 1. The modified tack is as noted in the portion of tack in fig. 2.

The ECA module realizes a non-dimensionality-reduction local cross-channel interaction strategy, can self-adaptively select the size of a one-dimensional convolution kernel, and fuses the characteristic information of each adjacent k different channels. Firstly, performing global average pooling on input; then, using a 1 × 1 convolution kernel to complete the channel convolution; secondly, obtaining the weight of each channel through a Sigmoid function after the channel convolution result; and finally, multiplying each layer of the input characteristics by the weight of the corresponding layer. The ECA module is embedded in a position shown in fig. 2, behind two FEM modules, respectively.

The formula for ECA is:

global average pooling:

wherein W, H represents the width and height of the feature pattern, x_i,jThe value of the ith row and jth column point on each channel of the characteristic diagram。

And (3) channel convolution:

ω＝σ(C1D_k(y))

in the formula, since the global average pooling is performed, i is 1; y is^jRepresents the jth channel;

denotes y^jA set of k adjacent channels; alpha is alpha^jRepresents the jth channel weight; σ denotes sigmoid (), ω_iRepresenting the ith weight.

k, solving:

wherein c represents a given number of channels; gamma and b are equal to 2 and 1 respectively; odd denotes the nearest odd number.

And step 3: the on-line data enhancement technology mainly comprises the steps of photometric distortion, geometric distortion, simulated shielding and multi-image fusion. The luminosity distortion mainly changes the pixel points of the picture, such as: random luminance variation, random contrast variation, random saturation variation, random chrominance variation, adding random noise. Geometric distortion mainly changes the shape of a picture, such as: random cutting, random rotation and random angle. The simulated shielding comprises the following steps: and randomly erasing small blocks in the picture (namely setting the pixel points of the small blocks to be completely black). The multi-image fusion has: the general part of one image is Cut randomly and replaces the part at the same position on the other image, or the two images are overlapped together, and the pixel points are overlapped, and the technology comprises Mix _ Up, Cut _ Mix and style _ transfer _ GAN. The purpose of data enhancement is to: 1. the training data volume is increased, and the generalization capability of the model is improved; 2. and the diversity of the pictures is increased, and the robustness of the model is improved.

And 4, step 4: and (4) carrying out forward reasoning and outputting a result.

The Head YOLOv3 detects a third portion of the network, as noted in fig. 1. And outputting a tensor of S (3 (4+1+20)) according to the Neck fused features, wherein each picture is mapped into S cells, each cell contains three detection results, and each detection result comprises 4 detection box coordinates, 1 confidence coefficient and 20 prediction scores of the category of one object.

And 5: and calculating errors and updating parameters. And (4) a forward derivation part of the YOLOv3 network from step 1 to step 4, solving an error of a derivation result and a corresponding label, reversely propagating the error along a forward path, updating all weight parameters according to the gradient direction, and performing one iteration from the forward direction to the reverse direction. Errors are classified into position errors, confidence errors, and category errors. The position error uses the DIOU loss, and the error calculation mode of other parts is not changed. DIOU can directly minimize the distance of the two object boxes and therefore converge faster.

The IOU calculation formula is:

in the formula, A and B are a real frame and a prediction frame, and IOU is the intersection ratio union of the areas of the two frames.

The DIOU calculation formula is as follows:

in the formula, ρ²(b^A,b^B) Is the Euclidean distance between the central points of the two frames; c. C²Representing the euclidean distance around the smallest rectangle diagonal of the two boxes as shown in figure 4.

The DIOU (position loss) penalty for a single prediction box is:

loss_DIOU＝1-DIOU

the total DIOU loss on a picture is:

in the formula, λ_coodRepresenting the weight taken up by the DIOU loss.

Total confidence loss on one picture:

where the penalties are divided into targeted confidence penalties (first row) and non-targeted confidence penalties (second row),

represents the confidence of the jth prediction box of the ith cell,

represents the confidence of the jth real box of the ith cell,

indicating that when the ith cell has a target, it is 1, and the remaining cases are 0,

when the ith cell is the jth prediction box without target, the number is 1, and the other cases are 0, lambda_noobjRepresenting confidence error weights without targets.

Total class loss on one picture:

wherein c ∈ classes denotes 20 classes of the object,

representing the probability of each category in the jth prediction box of the ith cell,

means ith cell jth trueIn the real frame, the probability of each category is 1, and the positions of other categories are 0.

The total loss on one picture is:

in the formula, N is the number of targets on one picture.

The initial learning rate is set to 0.0001, the learning decay rate facility is 0.995, Batch _ size is set to 10, the number of save iterations is 2, the number of partial weight update rounds is set to 35, the number of total weight fine adjustment rounds is set to 15, the test score threshold is set to 0.3, the test IOU threshold is set to 0.5, and the input picture size is randomly selected to be 10 to 19 times 32 per several rounds.

The training method adopts a random gradient descent, error reverse propagation and learning rate attenuation method.

The model obtained by each training needs to evaluate the precision and the recall rate on a verification set, and the model with the best evaluation is tested on a test set to obtain the generalization capability of the model.

Step 6: and loading the weight which is evaluated to be the best, inputting the picture, obtaining a plurality of frames by the network at the moment, filtering the frames by using NMS (non-maximum suppression) to obtain the frames with the highest confidence, and displaying the surrounding frames, the target type and the confidence on the picture.

Claims

1. A remote sensing image small target detection method based on improved YOLOv3 is characterized in that: the method comprises the following steps:

2. The method for detecting the small target of the remote sensing image based on the improved YOLOv3 as claimed in claim 1, wherein: in step 1, the data set preprocessing specifically includes: and converting the labeling information of the data set into a VOC format, and performing the following steps of: 1, randomly dividing the training verification set and the test set into a training verification set and a test set, wherein the sets are not interfered with each other and have no same picture, so that data is prevented from being polluted;

and evaluating the YOLOv3 network model in a cross validation mode during training, namely firstly, a training validation set is evaluated according to the following steps of 8: the proportion of 1 is randomly divided into a training set and a verification set, the training set is used for model training and weight updating, and the verification set is used for model evaluation obtained after each round of training is finished.

3. The method for detecting the small target of the remote sensing image based on the improved YOLOv3 as claimed in claim 1, wherein: in the step 2, the step of the method is carried out,

4. The method for detecting the small target of the remote sensing image based on the improved YOLOv3 as claimed in claim 3, wherein: the calculation formula of the channel attention mechanism module is as follows:

global average pooling:

and (3) channel convolution:

ω＝σ(C1D_k(y))

k, solving:

5. The method for detecting the small target of the remote sensing image based on the improved YOLOv3 as claimed in claim 1, wherein: in step 3, the online data enhancement technology comprises photometric distortion, geometric distortion, simulated occlusion and multi-image fusion;

6. The method for detecting the small target of the remote sensing image based on the improved YOLOv3 as claimed in claim 1, wherein: in step 5, the position loss in the loss function is calculated by using the DIOU, and the distance between the two target frames can be directly minimized by considering three factors of the coincidence degree, the coincidence direction and the position relation of the detection frame and the real frame, so that the convergence is faster;

the modified loss function formula is:

7. The method for detecting the small target of the remote sensing image based on the improved YOLOv3 as claimed in claim 1, wherein: and 6, selecting the model with the highest evaluation precision and recall rate on the verification set to load into the network, and obtaining the detection effect of the model on the test set.