CN114220015A

CN114220015A - Improved YOLOv 5-based satellite image small target detection method

Info

Publication number: CN114220015A
Application number: CN202111567696.2A
Authority: CN
Inventors: 王俊; 庞磊; 周焕来; 陈璐; 刘博文; 余梦鹏; 张诗涵; 朱敬伦; 贾海涛
Original assignee: Yituo Communications Group Co ltd
Current assignee: Yituo Communications Group Co ltd
Priority date: 2021-12-21
Filing date: 2021-12-21
Publication date: 2022-03-22

Abstract

The invention discloses a satellite image small target detection method based on improved YOLOv 5. The invention has certain universality in the small target detection direction, and the patent takes the remote sensing image small target detection as an explanatory case. In order to solve the problems of false detection, missing detection, insufficient feature extraction capability and the like of small targets in remote sensing image target detection, a small target detection algorithm based on improved YOLOv5 is provided. The algorithm uses a Mosaic-6 method to enhance data, and adjusts a loss function by replacing a backbone network with a Swin transform structure with stronger feature extraction capability, so that the network can capture global information and rich context information; by modifying the network neck structure, namely introducing the CBAM attention module into the feature pyramid and the path aggregation network, the method is beneficial to the network to adaptively refine the features of the intermediate feature map, and further improves the detection effect of the network model on the small target. The improved algorithm is applied to remote sensing image detection under the dense small target scene, and experimental results show that compared with the original YOLOv5 algorithm, the algorithm has stronger feature extraction capability and higher detection accuracy on small target detection.

Description

Improved YOLOv 5-based satellite image small target detection method

Technical Field

The invention relates to the field of target detection in deep learning, in particular to a remote sensing image target detection technology aiming at small target detection.

Background

In the field of remote sensing, the resolution of satellite images is generally large, with a large number of small target objects present. Because these targets are small in size and low in resolution relative to the whole image, it is difficult to accurately detect existing small targets when the targets are quickly detected and identified. With the continuous maturity of deep learning technology, more and more target detection methods are used for remote sensing images. Due to the existence of a plurality of target objects in the image, it is very challenging to detect and locate a small target from the image, and a large amount of false detection and missed detection exist in the implementation process, so that the overall detection effect is affected. Therefore, it is one of the hot spots in the field of artificial intelligence development to research the small target detection technology in the remote sensing image.

In order to accurately detect weak and small targets in a remote sensing image, a common detection method includes: a Haar classifier; gradient orientation histogram and support vector machine method (HOG + SVM); deformable partial model techniques (DPM); a method based on a deep neural network. The Haar classifier cascades strong classifiers trained by an AdaBoost algorithm, and adopts a high-efficiency rectangular feature and integral graph method in the bottom-layer feature extraction, but because the original features contain less context information, more high-frequency features cannot be extracted and the target to be detected can not be effectively identified. The histogram of gradient feature (HOG) is a dense descriptor for the local overlapping region of the image, and it constitutes the feature by calculating the histogram of gradient direction of the local region, and uses the histogram of gradient feature in combination with the SVM classifier to detect the target, but the histogram of gradient method has the problems of long descriptor generation process, difficult processing of dense target, and the like, and has the disadvantage of being quite sensitive to noise data. The Deformable Partial Model (DPM) method can be regarded as an upgraded version of a gradient histogram and an SVM classifier, but the DPM structure is relatively complex, the detection speed is relatively slow, and meanwhile, a good detection effect cannot be shown on a target in a complex scene.

With continuous progress and rapid development of deep learning, the application of the method in the field of remote sensing images is more and more extensive, and particularly in the field of target detection, excellent target detection frameworks such as YOLO, RCNN and SSD appear, but the method is always a difficult problem in the field of target detection for small target detection. The invention aims to solve the problem caused by a large number of small targets in the remote sensing image. The method has certain universality in the field of small target detection, and improves the data enhancement module aiming at the problem that small targets exist in the image, so that the network can learn and extract more tiny details.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a small target detection technology based on improved YOLOv 5. The technology applies a high-performance universal target detection model YOLOv5 in deep learning, and further improves the YOLOv5 algorithm aiming at the problems of small targets, large distribution quantity and the like (as shown in figure 1) existing in remote sensing images.

The technical scheme adopted by the invention is as follows:

step 1: the input image enters a network, the self-adaptive calculation of an anchor frame is firstly carried out, the self-adaptive scaling of the image is secondly carried out, and the Mosaic-6 data enhancement is realized thirdly;

step 2: the feature extraction backbone network adopts a Swin-Transformer structure and comprises a first Focus slicing operation, a first downsampling layer, a second convolution normalization layer, a second downsampling layer, a third convolution normalization layer, a third downsampling layer, a fourth convolution normalization layer, a fourth downsampling layer, a fifth convolution normalization layer and a fifth downsampling layer;

and step 3: performing feature extraction on feature maps generated by the third to fifth downsampling layers in the step 2 by adopting convolution of 1x1, and marking the obtained feature maps as M3, M4 and M5 respectively;

and 4, step 4: the step is a traditional FPN network structure, and adopts a bottom-up path to carry out multi-scale target detection, so that the characteristics of the bottom layer are fused with the bottom layer information containing rich position information; the aliasing effect brought by the fusion of M5 after 3 multiplied by 3 convolution elimination is marked as P5; performing double upsampling on M5, adding the upsampled M4 pixel by pixel, and performing 3 × 3 convolution to eliminate an aliasing effect brought by fusion to generate a characteristic diagram which is marked as P4; m4 is up-sampled twice, added with M3 pixel by pixel, and subjected to 3 x3 convolution to eliminate aliasing effect brought by fusion to generate a feature map P3;

and 5: on the basis of the FPN in the step 4, adding a bottom-up path which is called a PAN (personal area network) so as to fuse the bottom-layer features and the high-layer features containing rich semantic information; taking the P3 as bottom layer feature A3, carrying out 2-time down-sampling, and then carrying out pixel-by-pixel addition on the obtained product and P4 to obtain a feature map A4; a4 is subjected to 2-time down-sampling, and then is subjected to pixel-by-pixel addition with P5 to generate a feature map A5; as with step 4, A3-a 5 are subjected to 3 × 3 convolution to eliminate aliasing effects, and final feature maps Q3-Q5 are generated;

step 6: the step is the core content of the patent, and a lightweight attention module (CBAM) is integrated after each upsampling in step 4, and attention maps are sequentially deduced along two independent dimensions of a channel and a space, and then multiplied by an input feature map to perform adaptive feature refinement; similarly, after each downsampling in the step 5, a CBAM module is added to learn or extract the weight distribution from the features, and the weight distribution is applied to the original features to change the distribution of the original features and enhance the features with effective features and ineffective inhibition;

and 7: respectively inputting the characteristic graphs Q3-Q5 into a yolo detection head network, and predicting the Anchor setting of the network according to the clustering of a data set in advance; and then mapping the candidate frames output by the prediction network through non-maximum suppression into the size of the original image, selecting the target object in the image by the frames, and finally obtaining the detection result.

Compared with the prior art, the invention has the beneficial effects that:

(1) on the detection of a satellite image small target, higher identification precision can be achieved;

(2) for intensive target detection, a better detection effect can be shown.

Drawings

FIG. 1 is a diagram: typical small target schematic diagram of remote sensing image.

FIG. 2 is a diagram of: the flow chart of the Mosaic data enhancement is shown.

FIG. 3 is a diagram of: the detail of the Mosaic-6 data enhancement is shown schematically.

FIG. 4 is a diagram of: raw YOLOv5 feature extraction model schematic.

FIG. 5 is a diagram: and (5) extracting the receptive fields of each layer of the network by using the characteristics.

FIG. 6 is a diagram of: adding a Swin Transformer network sampling schematic diagram.

FIG. 7 is a diagram of: the improved characteristic is converged with a network schematic diagram.

FIG. 8 is a diagram of: anchor size schematic in original YOLOv 5.

FIG. 9 is a diagram of: and (3) detecting the effect of the improved algorithm model on the small target of the image.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

First, a process of extracting features of a remote sensing image by using the YOLOv5 network model CSPNet is shown in fig. 4. The size of an input image is 640x640x3, the image firstly passes through a Focus structure, a slicing operation is carried out, namely, the size of the image is reduced, the number of channels is increased, and the size of a feature map is changed into 320x320x 32; after the second convolution operation, the size of the feature map becomes 160x160x 64; after the third convolution operation, the size of the feature map becomes 80x80x 128; after the fourth convolution operation, the size of the feature map becomes 40x40x 256; after the fifth convolution operation, the size of the feature map becomes 20x20x 512.

In the convolution network for generating the feature map, the superposed precomputations of the neurons for generating the bottom-layer feature map are less, the receptive field on the original map is small, the detail information such as the edge and the texture of the image is more emphasized and reserved, the superposed precomputations of the neurons for generating the high-layer feature map are more, the receptive field on the original map is large, and the semantic information of the image is more emphasized and reserved. High-level features are downsampled many times, and more detailed information is generally ignored. FIG. 5 is a diagram of the receptive field condition rooted in the output profiles of the layers of the CSPNet.

The YOLOv5 uses the feature maps output after 8, 16 and 32 times down sampling to perform subsequent classification and regression tasks, that is, uses the feature maps with the receptive field size of 8, 16, 32 and other large, medium and small scales, while the small target in the remote sensing image generally has only a few pixels, and the semantic information extracted from the only few pixels by the network is very limited. In an extreme case, a small image target may correspond to only one point on the high-level feature map, so that detection of the small target requires more consideration of feature maps extracted by neurons with smaller receptive fields.

Then, the invention improves a YOLOv5 detection model, and introduces a Swin Transformer feature extraction backbone network. As shown in fig. 6, the image to be detected is extracted with features through a depth network, the concept of Windows Multi-Head Self-orientation (W-MSA) is used, for example, in the 4-fold down-sampling and 8-fold down-sampling in the figure, the feature map is divided into a plurality of disjoint areas (Windows), and the Multi-Head Self-orientation is performed only in each Window (Window). The calculated amount is reduced, and meanwhile, the information is transmitted in the adjacent window, so that the network can extract more detailed information of the target.

Detailed description of the invention

(1) The input end preprocesses the image, is inspired by the Mosaic thought, adopts an enhanced version of the Mosaic method, namely Mosaic-6, namely 6 pictures are adopted to be cut, randomly arranged and randomly scaled, and then are combined into one picture, so that the data volume of the sample is increased, random noises are reasonably introduced, the discrimination of the network model on small target samples in the image is enhanced, and the generalization capability of the model is improved.

(2) The aliasing effect brought by the fusion of M5 after 3 multiplied by 3 convolution elimination is marked as P5; performing double upsampling on M5, adding the upsampled M4 pixel by pixel, and performing 3 × 3 convolution to eliminate an aliasing effect brought by fusion to generate a characteristic diagram which is marked as P4; m4 is up-sampled twice, added with M3 pixel by pixel, and subjected to 3 x3 convolution to eliminate aliasing effect brought by fusion to generate a feature map P3;

(3) on the basis, a bottom-up path is added, called as a PAN (personal area network), so that the bottom-layer features and the high-layer features containing rich semantic information are fused; taking the P3 as bottom layer feature A3, carrying out 2-time down-sampling, and then carrying out pixel-by-pixel addition on the obtained product and P4 to obtain a feature map A4; a4 is subjected to 2-time down-sampling, and then is subjected to pixel-by-pixel addition with P5 to generate a feature map A5; as with step 4, A3-a 5 are subjected to 3 × 3 convolution to eliminate aliasing effects, and final feature maps Q3-Q5 are generated;

the improvement has two advantages, on one hand, the model fully utilizes low-level features containing abundant detail information to detect small targets; on the other hand, the deep semantic features are transmitted from top to bottom by the feature pyramid network, the position information of the target is transmitted from bottom to top by the path aggregation network, the feature is better learned by the model through the fusion of the feature information from top to bottom and from bottom to top, and the sensitivity of the model to small targets and shielding targets is enhanced.

In the original YOLOv5, after the down-sampling step is completed, the sampling operation is continued, so that part of good characteristic information is lost; similarly, if the sampling operation is directly performed after the up-sampling step is completed, some features may be lost, resulting in incomplete recovery. Therefore, for the sample operation followed by the attention mechanism module (CBAM), given a feature map, CBAM will infer the attention map in turn along two independent dimensions, channel and space, and then multiply the attention map with the input feature map to perform adaptive feature refinement, preserving more features for the next convolution operation.

The setting of the Anchor can greatly affect the detection precision and the convergence speed of the model, the default Anchor aspect ratio and size are set for COCO data set verification in YOLOv5, and the Anchor should be designed in consideration of the actual size of the detected target. The detected object is a remote sensing image small target, more small targets and a target with a ratio of 1:1 exist, so the size and the aspect ratio parameters of the Anchor are automatically set according to the specific distribution condition of the target to be detected in the data set.

The invention carries out Anchor frame clustering on the remote sensing image data set loaded into the network through a K-means clustering algorithm, automatically generates corresponding Anchor sizes, and sets anchors with different sizes for feature maps with different sizes by combining with the multi-scale detection scheme. This is equivalent to adding good prior information, and the difficulty of frame regression can be reduced to a certain extent.

Fig. 9 shows the detection effect of the improved YOLOv5 algorithm on the target to be detected, and it can be found that the algorithm model can accurately detect the small target in the image, and the problems of false detection, missed detection, and the like are well solved.

On the basis of the original Yolov5 algorithm, the method is improved and optimized from three aspects of Mosaic data enhancement, feature extraction backbone network, attention mechanism and the like, and effectively enhances the detection precision of a Yolov5 network model on small target objects.

While the invention has been described with reference to specific embodiments, any feature disclosed in this specification may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise; all of the disclosed features, or all of the method or process steps, may be combined in any combination, except combinations where mutually exclusive features or steps are present.

Claims

1. A satellite image small target detection method based on improved YOLOv5 is characterized by comprising the following steps:

step 1: the input image enters a network, the self-adaptive scaling of the image is firstly carried out, and the Mosaic-6 data enhancement is secondly realized;

2. The method of claim 1, wherein the Mosaic-6 data enhancement module in step 1 combines 6 pictures into one picture after random cropping, random arrangement and random scaling.

3. The method of claim 1, wherein the SwinTransformer backbone network in step 2 has greater feature extraction capability.

4. The method as claimed in claim 1, wherein after the lightweight attention mechanism module (CBAM) in step 6 is introduced into the convolutional layer, the features can be covered on more parts of the object to be identified, so that the probability of identifying the object becomes higher, and it is beneficial for the network to focus on the key information and find the region of interest.