CN115527096A

CN115527096A - Small target detection method based on improved YOLOv5

Info

Publication number: CN115527096A
Application number: CN202211365030.3A
Authority: CN
Inventors: 任向楠; 倪海峰; 张峰; 范文涛; 王琪瑶; 赵莹; 贺超; 董兴东; 张帆; 谢继顺; 陈大明; 徐仰惠; 牛慧卓; 赵万存; 李同磊; 单洪朋; 孟祥振; 李吉鑫; 魏光旭; 徐蒙蒙
Original assignee: Shandong Xinguang Photoelectric Technology Co ltd; Shandong Sheenrun Optics Electronics Co Ltd
Current assignee: Shandong Xinguang Photoelectric Technology Co ltd; Shandong Sheenrun Optics Electronics Co Ltd
Priority date: 2022-11-02
Filing date: 2022-11-02
Publication date: 2022-12-27

Abstract

The invention discloses a small target detection method based on improved YOLOv 5. According to the method, a Mosaic data enhancement method is used, and aiming at the problem that the small target has a small pixel value, a Focus structure in a YOLOv5 backbone network is pruned, so that information loss in the slicing process is reduced; the conditional parameterized convolution and the residual error network are combined to form a conditional residual error unit, and the residual error structure reserves the characteristics before convolution operation and can effectively utilize the characteristics of different stages for fusion; the condition parameter convolution and weighting pooling SPP structure can learn a group of specific parameters for each sample, so that the effective utilization rate of the model to the characteristics can be improved, and the high-efficiency reasoning speed can be kept. The improved method is applied to detection of various small targets, and experimental results show that the method has higher detection precision in the aspect of small target detection compared with the original YOLOv5 algorithm no matter in a simple scene or a complex scene.

Description

Small target detection method based on improved YOLOv5

Technical Field

The invention relates to the field of target detection of computer vision, in particular to a small target detection method based on improved YOLOv 5.

Background

The small target detection is an important research direction in the field of image analysis and processing, the computer is used for effectively analyzing and processing image data captured remotely, different types of targets are identified and positions of the targets are marked, the small target detection method is widely applied to scenes such as urban intelligent traffic, disaster resistance and relief, frontier defense safety and the like, and a large amount of manpower and time cost are saved in the research, so that the small target detection technology has very important research significance and practical value. By small targets is meant targets that are small in size and are generally defined in two ways: (1) Absolute size, in the COCO dataset, objects with a size of less than 32 × 32 pixels are considered small objects; (2) The relative size is defined by the international optical engineering society, and a small target is a target having an image area of less than 80 pixels in an image of 256 × 256 pixels, that is, a small target can be considered as a small target if the size of the target is less than 0.12% of the original image. The difficulty of small target detection mainly lies in the following points: (1) The target pixel area is small, the contained characteristic information is too little, and especially the infrared image characteristic information is seriously lost; (2) The data set distribution is unbalanced, the small target occupation ratio in the existing standard data set is small, and serious image-level imbalance exists; (3) The phenomena of target shielding, blurring and incompleteness exist in the data set, so that the information of the small target is seriously lost.

In recent years, with the continuous development of deep learning theory and the increasing social demand, the research of small target detection technology draws more and more extensive attention, but the research specially aiming at small target detection is less at present, and the existing small target detection algorithm generally enhances the robustness of a model to multi-scale targets by deepening the layer number of a network and extracting richer features or complicating a feature fusion process on the basis of the existing target detection method, so that the performance of small target detection is improved. As classic target detection algorithms such as Faster R-CNN, SSD and YOLO algorithms, the algorithms have better performance in precision and speed, and therefore, many researchers improve the detection of small targets on the basis of the algorithms. The method based on two stages mainly increases the feature extraction aiming at the interested region and pays more attention to the importance of the spatial features so as to enhance the detection performance of small targets; the single-stage-based improved algorithm mainly utilizes the underlying high-resolution features rich in detail information by means of multi-scale feature fusion. In addition, generative countermeasure networks, data expansion, and other techniques are also widely used to solve the small target detection problem. However, no matter multi-scale feature fusion is added or a larger-scale backbone network is applied, the time complexity of the network is greatly increased, and the real-time performance of the target detection model is greatly reduced.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a small target detection method based on improved YOLOv5, which improves and simplifies the structure of a target detection network, improves the feature utilization rate of small targets and ensures the real-time property of a target detection model.

In order to solve the technical problem, the technical scheme adopted by the invention is as follows: a small target detection method based on improved YOLOv5 comprises the following steps:

s01) collecting data for small target detection, and making an image data set in a YOLO label format;

s02), inputting the image data set into a network for data enhancement;

s03), inputting the image data set into a feature extraction network after data enhancement, wherein the feature extraction main network adopts an improved CSPDarkNet, wherein a Focus structure is deleted in the original YOLOv5 main network, five layers of network structures respectively comprise a downsampling condition convolution layer, an SPP module and a condition residual error unit Res unit, three feature graphs with different scales are obtained from the third layer, the fourth layer and the fifth layer respectively and are respectively marked as F3, F4 and F5;

the convolution layer under the down-sampling conditionThe standard convolution in the original YOLOv5 is replaced by a conditional parameterized convolution, parameterizing the convolution kernels in the conditional convolution layer to a linear combination of n experts (alpha) ₁ W ₁ +…+α _n W _n ) X, wherein α _i ＝r _i (x) Example-dependent scalar weights, W, computed using a routing function with learning parameters _i Is a convolution kernel, x is a feature input to the convolution layer;

the SPP module performs multi-scale fusion by adopting SoftPool weighted pooling of k = {1 × 1,5 × 5,9 × 9,13 × 13}, and performs splicing operation on feature maps with different scales; weighted pooling using softmax, computing eigenvalue weights for regions from nonlinear eigenvalues

Wherein w _i Taking the feature weight as a, wherein a is an activity value, i and j are indexes of the obtained activity value in the feature matrix, and R is a local calculation area; after the weight of the characteristic value is obtained, the output result is obtained through the characteristic value of the weighting area

The conditional residual unit is composed of a conditional convolution layer, and a conditional convolution is added in the short connection of the conditional residual unit to expand the characteristic channel;

s04) and the feature map obtained in the step S03) is transmitted to the neck of a target detection network, a FPN + PAN feature fusion network based on a CCSP2 network structure is adopted for the neck structure, feature fusion is carried out in a top-down mode and a bottom-up mode, and finally three enhanced feature maps with different scales are obtained and are respectively marked as A3, A4 and A5;

s05), inputting the enhanced feature maps A3, A4 and A5 into the head of the target detection network, performing conditional convolution on the three enhanced feature maps respectively, further screening and enhancing features related to specific classes, and finally obtaining three predicted feature maps with different scales which are respectively marked as P3, P4 and P5; and the prediction prior frame is dynamically obtained by clustering the data set, and the prediction network outputs a final pre-selection frame through non-maximum inhibition and maps the pre-selection frame into the size of the original image to finally obtain the detection result of the target object.

Further, the data enhancement in the step S02) adopts a mode of randomly zooming, randomly cutting and randomly arranging 4 images for splicing, so that the data set scene is enriched, and the number of small targets is increased.

And step S03) adopts a 2-time downsampling condition convolutional layer as a first layer of the backbone network, so that the characteristics of the region of interest can be enhanced, and the effect of increasing the number of channels can be achieved.

The routing function adopted in step S03) is: r is _i (x) = Sigmoid (GlobalAveragePool (x) R), where R is the weight matrix, sigmoid (·) is the Sigmoid function, globalAveragePool (·) is the global average pooling function.

In the step S04), a CCSP2 network structure is introduced into the FPN and the PAN, the CCSP2 network is formed by splicing a plurality of condition convolution layers and convolution kernels, and the capabilities of feature screening and feature fusion are enhanced by performing feature combination through a cross-stage hierarchical structure; fusing the characteristics of different scales from top to bottom by the FPN, and fusing the high-level characteristics with the low-level characteristics after up-sampling; and (3) fusing the features of different scales from bottom to top in the PAN, and fusing the low-layer features with the high-layer features after convolution by a two-time downsampling condition.

The invention has the beneficial effects that:

(1) The method has the advantages that a feature extraction backbone network is improved, the original standard convolutional layer is replaced by the conditional convolutional layer in the improved YOLOv5 model, a feature extraction mode is corrected, the network pays more attention to features related to specific classes, the sensitivity of the network to targets is enhanced, the size and the capacity of the model can be improved, and meanwhile, the efficient reasoning speed is kept.

(2) The feature fusion network is improved, feature fusion is carried out through the FPN + PAN structures from top to top and from bottom to top, detail information of bottom-layer features and semantic information of high-layer features are fully utilized, better learning features of a model are facilitated, target features are enhanced through the downsampling condition convolution layer in the fusion process, detection precision is effectively improved, and robustness of an algorithm is enhanced.

(3) Softpool in the SPP structure uses softmax to carry out weighting pooling, the discrimination of similar characteristic information is increased, important context characteristics are separated obviously, meanwhile, the characteristic information of the whole receptive field is kept, the function of a pooling layer is kept, meanwhile, information loss in the pooling process is reduced as far as possible, and the detection precision of the model can be effectively improved.

(4) Compared with the original YOLOv5, a conditional residual error unit Res unit replaces a CSP structure in the improved network, the residual error idea is directly applied to the unit, conditional convolution in the conditional residual error unit pays more attention to features related to a target, the conditional convolution is equivalent to an attention mechanism, and the feature channels can be expanded while the features are sampled. The residual structure can avoid the problem of gradient disappearance in the sampling process, can better retain the bottom layer characteristics and enhance the diversity of the characteristics in the characteristic fusion process.

Drawings

FIG. 1 is a general flow diagram of the present invention;

FIG. 2 is a diagram of the overall network architecture;

FIG. 3 is a schematic diagram of data enhancement;

fig. 4 is a partial data set picture.

Detailed Description

The present invention will be further described with reference to the following examples.

Example 1

The technical solutions in the embodiments of the present invention will be clearly and completely described below. The embodiments described herein are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1 and 2, the embodiment discloses a small target detection method based on improved YOLOv5, which includes the following specific steps:

s01), collecting image data of a typical small target, wherein the image data comprises two acquisition modes of visible light and infrared light, and a scene comprises the two acquisition modes; scenes with simple backgrounds such as water and sky, and complex scenes such as land city backgrounds; the target types mainly include common targets with high use value, such as people, vehicles (the labels distinguish the types of automobiles, trucks, buses, bicycles and the like), ships, unmanned planes and the like. In order to increase the robustness of the data set, the data set contains more occluded and overlapped targets.

And (4) carrying out object classification labeling by using a labelImg labeling tool, wherein the labeling format is yolo format.

S02) as shown in FIG. 3, the labeled data set is enhanced by using Mosaic data, a batch of data is taken out from the small target data set to be detected, four images are randomly selected from the taken out data to be randomly cut, arranged, scaled and spliced to form an image which contains more target types and quantities and has a more complex scene. The steps are repeated, so that the discrimination of the network model to the small target sample of the image is enhanced, and the generalization capability of the model is improved.

S03), inputting the enhanced data into a network for iterative training. In the feature extraction process, the improved CSPDarkNet is adopted as a main network for feature extraction, a Focus slicing module is adopted as a main network for feature extraction of original YOLOv5, every other pixel point of an image is sliced, although the number of channels is increased, the number of pixels occupied by a small target is small, and information is seriously lost in the slicing process, so that the module is replaced by a 2-time downsampling condition convolution layer, the features of an interested area can be enhanced, and the function of increasing the number of channels can be achieved. Because the target in the small target detection task contains fewer pixel values, even the original detection image has only a few pixels, when the standard convolution operation is used, the important features of the small target can be weakened, and when the deep convolution is carried out, a large number of important features can be lost, so the invention uniformly adopts the conditional convolution operation, gives larger weight to the small target features, and ensures the importance of the small target in the convolution process.

The conditional convolution layer is downsampled, and the original standard convolution is replaced with a conditional parameterized convolution. Parameterizing convolution kernels in a conditional convolution layer into a linear combination of n experts (alpha) ₁ W ₁ +…+α _n W _n ) X, wherein a _i ＝r _i (x) Example-dependent scalar weights, W, computed using a routing function with learning parameters _i In the form of a convolution kernel, the kernel is,routing function r _i (x) = Sigmoid (GlobalAveragePool (x) R), where R is a weight matrix.

And the SPP module performs multi-scale fusion by adopting SoftPool weighted pooling of k = {1 × 1,5 × 5,9 × 9,13 × 13}, and performs splicing operation on feature maps with different scales. Weighted pooling using softmax, computing eigenvalue weights for regions from nonlinear eigenvalues

The conditional residual unit is mainly composed of conditional convolution layers by using the residual structure in the Resnet network for reference, and adds conditional convolution in the short connection to expand the characteristic channel. The basic principle of the conditional convolutional layer is consistent with that of the downsampling conditional convolutional layer, wherein the number of layers of the conditional convolution is determined by the depth of the network, and the deeper the network, the larger the number of conditional convolutions.

After the first downsampling conditional convolution operation, the input image changes from 640 x 3 to a feature map with a size of 320 x 32; after the second downsampling conditional convolution operation, the signature size becomes 160 × 64; after the third downsampling conditional convolution operation, the signature size becomes 80 × 128; after the fourth downsampling conditional convolution operation, the signature size becomes 40 × 256; after the fifth downsampling conditional convolution operation, the signature size becomes 20 × 512. The characteristic diagrams obtained by the third, fourth and fifth layers are denoted as F3, F4 and F5.

S04), transmitting the obtained feature map to the neck of a target detection network, wherein the neck structure adopts a CCSP2 network structure-based FPN + PAN feature fusion network. The FPN adopts a top-down mode to perform multi-scale feature fusion, and before the multi-scale fusion is performed, the high-level features firstly strengthen the target features through a CCSP2 structure and then eliminate aliasing effects brought by the fusion through 3-by-3 conditional convolution. The characteristic diagram obtained by F5 after the above operation is marked as M5; m5 is fused with F4 through twice upsampling, and the feature obtained through fusion is subjected to CCSP2 strengthening and conditional convolution, so that a feature graph is marked as M4; and M4 is fused with F3 through twice upsampling, the fused features are strengthened through the CCSP2, the feature graph is marked as M3, and the M3 is the final feature graph P3 of the middle and bottom layers of the model after conditional convolution operation.

And feature fusion is carried out on the PAN in a bottom-down mode, similarly, before multi-scale fusion is carried out, the bottom-layer features are subjected to target feature strengthening through a CCSP2 structure, and then aliasing effects caused by fusion are eliminated through 3-by-3 downsampling conditional convolution. M3 obtained by the FPN network is used as a bottom layer feature A3, is fused with M4 after twice down sampling, and is marked as A4 after being strengthened by CCSP2, and the A4 is a final feature map P4 of the middle and bottom layers of the model after conditional convolution operation; and A4 is fused with M3 after twice downsampling, the feature map is marked as A5 after the CCSP2 strengthening, and the A5 is the final feature map P5 of the bottom layer in the model after the conditional convolution operation.

In the YOLOv5 algorithm, different initial setting prior boxes anchor exist for different data sets, and the initial prior boxes anchor are an important part in network training. And outputting a prediction frame on the basis of the initial anchor of the network, comparing the prediction frame with a true frame, calculating the loss between the prediction frame and the true frame, and then reversely propagating and updating the network parameters. In the invention, an anchor clustering is carried out on the small target detection training data set by adopting a k-means clustering algorithm, and the size of the anchor corresponding to the data set is automatically generated. Since the detection mode is a multi-scale fusion strategy, the size of the anchor is set for feature maps of different scales. And the initial anchor obtained by clustering is added into the network model as prior information, so that the difficulty of frame regression is greatly reduced.

On the basis of the original YOLOv5 algorithm, the method respectively carries out corresponding optimization and improvement from the feature extraction backbone network to the neck of the network and then to the head of the network, and effectively enhances the detection precision of the network model on the small target.

The foregoing description is only for the purpose of illustrating the general principles and preferred embodiments of the present invention, and it is intended that modifications and substitutions be made by those skilled in the art in light of the present invention and that they fall within the scope of the present invention.

Claims

1. A small target detection method based on improved YOLOv5 is characterized in that: the method comprises the following steps:

s01), collecting data for small target detection, and making an image data set in a YOLO label format;

s02), inputting the image data set into a network for data enhancement;

s03), inputting the image data set into a feature extraction network after data enhancement, wherein the feature extraction main network adopts an improved CSPDarkNet, a Focus structure is deleted in the original YOLOv5 main network, five layers of network structures respectively comprise a downsampling condition convolution layer, an SPP module and a condition residual error unit Res unit, three feature maps with different scales are respectively obtained from the third layer, the fourth layer and the fifth layer and are respectively marked as F3, F4 and F5;

the downsampling conditional convolution layer replaces the standard convolution in the original YOLOv5 with conditional parameterized convolution, and the convolution kernel in the conditional convolution layer is parameterized into a linear combination (alpha) of n experts ₁ W ₁ +…+α _n W _n ) X, wherein α _i ＝r _i (x) Example-dependent scalar weights, W, computed using a routing function with learning parameters _i Is a convolution kernel, x is the characteristic of the input convolution layer;

the SPP module performs multi-scale fusion by adopting SoftPool weighted pooling with k = {1 x 1,5 x 5,9 x 9,13 x 13}, and performs splicing operation on feature maps with different scales; weighted pooling using softmax, computing eigenvalue weights for regions from nonlinear eigenvalues

2. The improved YOLOv 5-based small-target detection method of claim 1, wherein: and the data enhancement in the step S02) adopts the modes of random zooming, random cutting and random arrangement of 4 images for splicing.

3. The improved YOLOv 5-based small-target detection method of claim 1, wherein: step S03) adopts a 2-fold downsampling conditional convolutional layer as the first layer of the backbone network.

4. The improved YOLOv 5-based small-target detection method of claim 1, wherein: the routing function adopted in step S03) is: r is a radical of hydrogen _i (x) = Sigmoid (GlobalAveragePool (x) R), where R is the weight matrix, sigmoid (·) is the Sigmoid function, and GlobalAveragePool (·) is the global averaging pooling function.

5. The improved YOLOv 5-based small target detection method of claim 1, wherein: in the step S04), a CCSP2 network structure is introduced into the FPN and the PAN, the CCSP2 network is formed by splicing a plurality of condition convolution layers and convolution kernels, and the capabilities of feature screening and feature fusion are enhanced by performing feature combination through a cross-stage hierarchical structure; FPN fuses the characteristics of different scales from top to bottom, and fuses the high-level characteristics with the low-level characteristics after up-sampling; and (3) fusing the features of different scales from bottom to top in the PAN, and fusing the low-level features with the high-level features after convolution under twice down-sampling conditions.