CN117710965A

CN117710965A - Small target detection method based on improved YOLOv5

Info

Publication number: CN117710965A
Application number: CN202410028101.3A
Authority: CN
Inventors: 叶大鹏; 景均; 谢立敏
Original assignee: Fujian Agriculture and Forestry University
Current assignee: Fujian Agriculture and Forestry University
Priority date: 2024-01-09
Filing date: 2024-01-09
Publication date: 2024-03-15

Abstract

The invention provides a small target detection method based on improved YOLOv5, which comprises the following steps of; step S1: an input stage: clustering to obtain a suitable prediction frame size; step S2: feature extraction: establishing a main network for improving the YOLOv5s, optimizing the local and global feature extraction capacity of the model, and improving the precision and the operation speed; step S3: feature polymerization stage: adding an attention mechanism NAM, removing redundant feature information, and enhancing the utilization of key features; step S4: loss function: replacing the original CIoU loss function with the WIoU loss function, focusing on feature learning of the boundary box, and improving the convergence speed and robustness of the model; step S5: and (3) detection: the receptive field is enlarged, so that the detection precision is improved; eliminating redundant prediction frames while retaining as many correct prediction frames as possible; a plurality of sub-models are obtained by utilizing a multi-scale training strategy, so that the model detection precision is improved; the invention improves and optimizes the detection of the small target in the detection task, and can be applied to the fields of automatic identification and picking of edible fungi in various environments and the like.

Description

Small target detection method based on improved YOLOv5

Technical Field

The invention relates to the technical field of small target detection, in particular to a small target detection method based on improved YOLOv 5.

Background

YOLO (You Only Look Once) is clearly an excellent method when referring to the target detection algorithm. Firstly, the YOLO series can rapidly detect and locate objects in images while maintaining accuracy; secondly, YOLO performs target detection in an end-to-end manner, and can extract features and predict targets more efficiently without using a complex candidate region generation algorithm, so that the method has outstanding advantages in the field of target detection.

From YOLOv1 to YOLOv5, a plurality of iterations are carried out, which not only makes great progress in accuracy and real-time, but also provides a better solution for practical application. YOLOv5 is widely focused on achieving faster reasoning speed and lower model complexity by simplifying the network structure and using smaller model sizes.

In recent years, small target detection is gradually reaching a brand-new angle in the field of computer vision due to high practical demands and remarkable research significance. However, this technique is susceptible to interference from various environmental factors such as illumination variation, occlusion, camera shake, and motion blur, which seriously affect the quality of the detection result. In addition, the presence of randomly distributed objects with large scale differences in the image can make network optimization difficult. The growth and distribution conditions of the edible fungi just meet the characteristics, including but not limited to random growth positions, different growth vigor, overlapped growth and the like, and the phenomena can cause missing counting or missing picking of the edible fungi, thereby bringing great challenges for automatic identification and picking of the edible fungi.

Disclosure of Invention

The invention provides a small target detection method based on improved YOLOv5, which is used for improving and optimizing small target detection in detection tasks and can be applied to the fields of automatic identification and picking of edible fungi in various environments.

The invention adopts the following technical scheme.

A small target detection method based on improved YOLOv5, comprising the steps of;

step S1: an input stage: clustering by using a K-means++ clustering algorithm to obtain a predicted frame size which is more suitable for target detection in data used by a small target detection method;

step S2: feature extraction: establishing a backbone network for improving YOLOv5s, integrating a Swin transform encoder and an SPPF module on the basis of the original, optimizing the local and global feature extraction capacity of a model, and improving the precision and the operation speed;

step S3: feature polymerization stage: adding an attention mechanism NAM, replacing the original C3 structure with C3Swtran, eliminating redundant characteristic information, and enhancing the utilization of key characteristics;

step S4: loss function: replacing the original CIoU loss function with the WIoU loss function, focusing on feature learning of the boundary box, and improving the convergence speed and robustness of the model;

step S5: and (3) detection: a detection head is newly added, and is replaced by a Swin transducer detection head, so that a receptive field is enlarged, and the detection precision is improved; the Soft-NMS is used for replacing the original NMS, and redundant prediction frames are eliminated while the correct prediction frames are reserved as much as possible; and a plurality of sub-models are obtained by utilizing a multi-scale training strategy, and are fused by using WBF, so that the model detection precision is improved.

The small target detection method is used for automatically identifying edible fungi and identifying the edible fungi in picking; in the step S1, the data used in the small target detection method is derived from a Fungi edible fungus dataset, and the small target detection method comprises a plurality of images with different sizes distributed under different resolutions, in the step S1, part of the images are selected from the Fungi edible fungus dataset to perform data amplification processing so that the number of the images reaches the required number, then the edible Fungi in the images are marked, and after marking, the images are marked according to a training set: verification set: test set = 7:2:1, proportion random distribution; the data amplification process includes gaussian blur, random inversion, random stitching, and brightness variation.

In the step S1, the prediction frame is an anchor frame, and the original anchor frame size is [10,13,16,30,33,23], namely a small-size anchor frame; [30,61,62,45,59,119], i.e., a medium-sized anchor frame; [116,90,156,198,373,326], i.e., large-sized anchor frame; to optimize the compatibility of the anchor frame with the image data, a K-means++ clustering algorithm is used to obtain a more suitable anchor frame [10.5,9.45,20.75,31.366,40.92,43.243], namely an anchor frame with a very small size; [30.25,42.21,39.35,34.063,46.62,47.925], i.e., a small-sized anchor frame; [55.22,86.71,86.09,71.245,78.03,90.485], i.e., a medium-sized anchor frame; [100.42,89.41,89.38,121.22,128.26,234.42], i.e., large-sized anchor frame; therefore, the anchor frame size is adaptively clustered according to the image data characteristics, so that a more accurate prediction frame is drawn.

In the step S2, the advantages of CNN and Transformer in local feature extraction and global feature extraction are fused, and the self-attention mechanism is utilized to reduce the number of operation parameters; constructing a hierarchical feature mapping structure by using a Swin Transformer on the basis of the Transformer, reducing the computational complexity from exponential to linear, and finally fusing the features of different scales to obtain a global feature representation; the complexity calculation expression is as follows:

Ω(MSA)＝4hwC ² +2(hw) ² c, formula I;

Ω(W-MSA)＝4hwC ² +2Mhw ² c, a formula II;

where Ω denotes computational complexity, MSA denotes global multi-head self-attention, W-MSA denotes window multi-head self-attention, h denotes the height of feature map, W denotes the width of feature map, C denotes the depth of feature map, and M denotes the size of each window.

In the step 3, in order to improve the utilization efficiency of the local features and the global features of the backbone network, the C3 structure in the feature aggregation network is replaced by a C3Swtran structure, a self-attention relationship is established between feature graphs of different scales, the fusion capacity of the features of different scales is enhanced, finally, an attention mechanism is introduced to remove redundant features, and the operation efficiency and speed are improved, wherein the principle of establishing the self-attention relationship comprises the following steps of;

step A1, initializing a parameter vector R _h 、R _w Position codes of height and width are respectively shown;

a2, inputting the characteristic x through three weight matrixesExtracting characteristic information to obtain q, k and v matrixes, and simultaneously R _h And R is R _w Performing elements and operations to obtain an r matrix;

step A3, q and k ^T Matrix multiplication is performed to obtain self-attention scores qk between the features ^T Q and r ^T Matrix multiplication is performed to obtain self-attention score qr between the feature and the position ^T Finally qk is formed ^T And qr (q) ^T Performing elements and operations, and performing Softmax processing to obtain a self-attention score y;

step A4, multiplying y and v by matrix to obtain output vector z of the head ₁ ；

Step A5, the output vector z obtained by each head ₁ ～z _n Vector z is obtained by summing ₀ Then it is combined with weight matrix W ₀ And (5) integrating to obtain a final output.

In the step S4, CIoU is used as an index for calculating the distance between the target detection frame and the real frame, and the conventional IoU index is improved by considering factors such as the length-width ratio, the center point offset, the frame area and the like, so that the defect of IoU index can be made up to a certain extent, for example, for edible fungus targets with different length-width ratios, the distance calculation can be more accurate, and meanwhile, the position offset of the frame can be better tolerated. However, the CIoU is relatively complex in calculation process, and needs to perform operations such as evolution, division and the like, so that the calculation amount is relatively large, and the training time and the calculation cost of the model are increased. In addition, CIoU is not suitable for all target detection tasks, and in small target detection tasks, CIoU may have a reduced accuracy.

When the small target detection method is used for identifying the small edible fungi, WIoU is introduced in the step S4 as an evaluation index of the prediction frame.

WIoU also considers the degree of overlap between the target detection box and the real box, but it adjusts the manner of calculation of the IOU to a specific form for the target detection task, making it more robust and interpretable. Compared with CIoU, WIoU does not need to perform operations such as evolution, division and the like, has faster calculation speed, and better performance in small target detection tasks,

the calculation formula is as follows:

WloU＝(w ^* IoU)/(1-w ^* (1-loU)) formula three;

wherein w represents a weight parameter for adjusting the ratio between IoU and 1-IoU; the smaller w, the more susceptible WIoU to changes in IoU; the larger w, the more susceptible WIoU is to changes of 1-IoU.

In the step 5, by newly adding a target detection head with a very small size, the sensitivity is enlarged, and simultaneously, a self-attention mechanism is combined with a YOLO detection head to extract more abundant characteristic representations, so that the sensitivity of a network model to small and medium-sized edible fungi in an image is improved; flexibly adjusting the bounding box score by Soft-NMS so as to improve the accuracy of target detection, wherein the implementation mode is to attenuate the score of the bounding box instead of directly deleting the bounding box with the maximum confidence; enhancing detection performance by fusing bounding box predictions of multiple models with WBF by fusing the bounding boxes and generating a final integrated bounding box set; finally, detecting by a test set; expressed in pseudo-code as follows:

the invention has good effect on the edible fungus image dataset, improves and optimizes the detection of small targets in the detection task, and can be applied to the fields of automatic identification and picking of edible fungi in various environments and the like.

Drawings

The invention is described in further detail below with reference to the attached drawings and detailed description:

FIG. 1 is a schematic flow chart of the method;

FIG. 2 is a schematic diagram of the distribution of small, medium and large targets;

FIG. 3 is a schematic diagram of an improved object detection model structure;

FIG. 4 is a diagram showing comparison of detection accuracy between different loss functions in the training process;

FIG. 5 is a comparative schematic diagram of the visual results of the detection.

Detailed Description

As shown, the improved Yolov 5-based small target detection method comprises the following steps;

Ω(MSA)＝4hwC ² +2(hw) ² c, formula I;

Ω(W-MSA)＝4hwC ² +2Mhw ² c, a formula II;

the calculation formula is as follows:

WloU＝(w ^* loU)/(1-w ^* (1-loU)) formula three;

example 1:

based on the above, in this example, a small target detection method based on improved YOLOv5 is proposed. Firstly, re-clustering the labeling frames of the edible fungus images by using a K-means++ clustering algorithm. After the appropriate anchor frame size is obtained, an encoder in the Swin transducer is introduced to reconstruct the backbone network of YOLOv5, optimizing feature propagation and enhancing feature extraction capabilities. The C3 convolution block of the neck network is replaced with a C3Swtran that incorporates self-attention. In addition, a normalization-based attention module (NAM) was introduced to further increase detection speed and accuracy. The original loss function CIoU is replaced with WIoU, enhancing convergence speed and training effect. Next, a new test head is added on the basis of YOLOv5s, and the original test head is replaced with a Swin transducer pre-head (SHs) to accommodate the size of the small-scale target. Finally, a Soft-NMS and WBF model fusion method is applied in the post-processing stage, so that the detection performance of the model is further improved. The overall flow is as in fig. 1.

In this example, the analysis of the image data of the edible fungi shows that the scale fluctuation of the pixels occupied by the edible fungi object in the picture is large, but most of the pixels still belong to the category of small and medium targets, as shown in fig. 2. The setting of model parameters in the original YOLOv5 algorithm is based on more than 80 daily article data such as MS COCO, so that the obtained anchor frame size preset value is not completely suitable for the edible fungus objects. The original anchor frame size is [10,13,16,30,33,23] (small); [30,61,62,45,59,119] (in); [116,90,156,198,373,326] (Large). In order to optimize the compatibility of the anchor frame and the image data, a K-means++ clustering algorithm is used to obtain a more suitable anchor frame [10.5,9.45,20.75,31.366,40.92,43.243] (extremely small); [30.25,42.21,39.35,34.063,46.62,47.925] (small); [55.22,86.71,86.09,71.245,78.03,90.485] (in); [100.42,89.41,89.38,121.22,128.26,234.42] (Large).

In this example, the backbone network (backbone network) is mainly used for extracting features. Popular backbone networks include VGG, res Net, dense Net, mobile Net, CSPDarknet53, etc., which are implemented by stacking different numbers of convolutional blocks. The convolutional neural network has the characteristics of local perception and weight sharing, so that the convolutional neural network has better sensitivity to local information; swin transducer has better sensitivity to global information based on self-attention mechanism principles and the property of allowing cross-window connections. Swin transducer constructs hierarchical feature mapping on the basis of retaining self-attention calculation of transducer structure, and reduces the complexity of calculation to be linear. The Swin transform network structure is mainly divided into 4 stages, the feature images are downsampled in each Stage into a plurality of groups with different scales, then the feature images in each group are subjected to self-attention calculation and multi-head attention calculation, and finally the feature images with different scales are fused to obtain global feature representation. The method fuses the characteristics of the small target edible fungi, can avoid losing information after downsampling the characteristics of the small target edible fungi, and improves the characteristic extraction effect.

In the embodiment, the neck network mainly fuses the features, so that feature graphs with different scales can be fully utilized, and information loss is compensated when the features are transmitted in depth. In the original Yolov5, the C3 module adopts a 1x1 convolution kernel to conduct dimension compression, so that the computational complexity is reduced, and the training and reasoning speed of a model is increased. In addition, the skip jump connection is utilized to connect the feature graphs of different levels, fine-grained features and semantic information are fused, and the perceptibility of small targets and details is enhanced. Finally, multi-scale feature extraction is realized by changing the size of the convolution kernel, and feature fusion is realized. The method replaces the original C3 module with the C3Swtran module fused with the self-attention mechanism, and improves modeling capacity. And simultaneously, the parallel computing and feature interaction characteristics also accelerate the training and reasoning process of the model, and the feature expression capability is improved. And finally, after the NAM attention module is introduced into each C3Swtran module, redundant information is reduced, and the important characteristic attention of the model is improved.

In this example, the head network receives the fusion features from the neck and predicts the location and type of the target. Original YOLOv5 used 3 Detect detectors.

The method in this example has an additional 1 detector that can receive the high resolution feature map from the shallow layer of the network, effectively reducing the risk of deep feature extinction. Meanwhile, a multi-head self-attention mechanism is introduced again to form a new detector Swin Transformer Heads (SHs), so that the modeling capability of the model on the relation between targets can be enhanced, and the accuracy and the robustness of target detection can be improved. The improved network architecture is shown in fig. 3.

In this example, the WIoU loss function considers the overlapping degree between the target detection frame and the real frame, but adjusts the calculation mode of the IOU to a specific form aiming at the target detection task, so that the calculation mode has better robustness and interpretability. Compared with CIoU, WIoU does not need to perform operations such as evolution, division and the like, so that the calculation speed is faster, and the WIoU is better in the small target detection task. The test was carried out by the method, and the result is shown in fig. 4, which shows the superiority of WIoU.

In this example, soft-NMS can more flexibly adjust the bounding box score, thereby improving the accuracy of target detection, and the core idea is to attenuate the score of the bounding box instead of directly deleting the bounding box with the maximum confidence; WBF can then improve detection performance by fusing bounding box predictions for multiple models, with the core being fusing these bounding boxes and generating the final integrated bounding box set.

Example 2:

the method comprises the steps of firstly, re-clustering labeling information by using a K-means++ clustering algorithm, and initializing model parameters by using pre-training weights.

The experiment adopts a Fungi edible fungus data set, and comprises 6751 images of edible Fungi with different sizes distributed under different resolutions. 1200 of these were selected and data expansion (including gaussian blur, random flip, random stitching, brightness variation, etc.) to 5000 was performed. After labeling, according to the training set: verification set: test set = 7:2: 1.

Experimental environment configuration: pyTorch 1.13.1 was used as a deep learning framework in this experiment, CUDA version 11.6. Experiments were performed under NVIDIA GeForce RTX3060 GPU equipment with an operating system of x64 Windows10. Training batch size was set to 4 and epoch was set to 275, with the first 3 epochs used as training preheats. Meanwhile, adam is selected as an optimizer, and the initial learning rate lr is selected ₀ Set to 2E-4 and adopt cosine recedingThe fire algorithm acts as a trained learning rate policy scheduler. Super-parameter cyclic learning rate lr _f Set to 9E-2.

To objectively evaluate the performance of the model in the present method, an average accuracy AP50 (Average precision) with a confidence threshold of 50 is used; the small target average precision APs50 with the confidence threshold value of 50 is used as an evaluation index, and a specific calculation formula is as follows:

wherein TP (True Positive) represents true yang, and the model predicts the number of samples that are positive and actually positive; FP (False Positive) the false positive and the number of samples for which the model is predicted to be positive but actually negative.

Analysis of experimental results

TABLE 1

"Basic" in Table 1 is expressed as the initial YOLOv5s model; "MethA" means adding an extremely small-sized target detector; "Meth B" means adding a Swin transducer encoder module on the former basis; "Meth C" means adding NAM attention mechanisms on the former basis; "Meth D" means changing the anchor frame size and detector on the former basis; "Meth E" means that the replacement loss function is WIoU on the former basis. The experiment compares the final model with the initial model to evaluate indexes to obtain an AP50 improvement of 0.83%, and APs50 improvement of 6.89%, and the result shows that the algorithm has a good effect on a data set, and the visual result is shown in fig. 5, so that the detection precision including the small target edible fungi can be effectively improved.

The above examples only represent a few embodiments of the present method, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several modifications and optimization can be made without departing from the spirit of the invention, which is within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims

1. The small target detection method based on the improved YOLOv5 is characterized by comprising the following steps of: comprises the following steps of;

2. The improved YOLOv 5-based small target detection method of claim 1, wherein: the small target detection method is used for automatically identifying edible fungi and identifying the edible fungi in picking; in the step S1, the data used in the small target detection method is derived from a Fungi edible fungus dataset, and the small target detection method comprises a plurality of images with different sizes distributed under different resolutions, in the step S1, part of the images are selected from the Fungi edible fungus dataset to perform data amplification processing so that the number of the images reaches the required number, then the edible Fungi in the images are marked, and after marking, the images are marked according to a training set: verification set: test set = 7:2:1, proportion random distribution; the data amplification process includes gaussian blur, random inversion, random stitching, and brightness variation.

3. The improved YOLOv 5-based small target detection method of claim 1, wherein: in the step S1, the prediction frame is an anchor frame, and the original anchor frame size is [10,13,16,30,33,23], namely a small-size anchor frame; [30,61,62,45,59,119], i.e., a medium-sized anchor frame; [116,90,156,198,373,326], i.e., large-sized anchor frame; to optimize the compatibility of the anchor frame with the image data, a K-means++ clustering algorithm is used to obtain a more suitable anchor frame [10.5,9.45,20.75,31.366,40.92,43.243], namely an anchor frame with a very small size; [30.25,42.21,39.35,34.063,46.62,47.925], i.e., a small-sized anchor frame; [55.22,86.71,86.09,71.245,78.03,90.485], i.e., a medium-sized anchor frame; [100.42,89.41,89.38,121.22,128.26,234.42], i.e., large-sized anchor frame; therefore, the anchor frame size is adaptively clustered according to the image data characteristics, so that a more accurate prediction frame is drawn.

4. The improved YOLOv 5-based small target detection method of claim 1, wherein: in the step S2, the advantages of CNN and Transformer in local feature extraction and global feature extraction are fused, and the self-attention mechanism is utilized to reduce the number of operation parameters; constructing a hierarchical feature mapping structure by using a SwinTransformer on the basis of the transducer, reducing the computational complexity from exponential level to linear level, and finally fusing the features of different scales to obtain a global feature representation; the complexity calculation expression is as follows:

Ω(MSA)＝4hwC ² +2(hw) ² c maleFormula I;

Ω(W-MSA)＝4hwC ² +2Mhw ² c, a formula II;

5. The improved YOLOv 5-based small target detection method of claim 1, wherein: in the step 3, in order to improve the utilization efficiency of the local features and the global features of the backbone network, the C3 structure in the feature aggregation network is replaced by a C3Swtran structure, a self-attention relationship is established between feature graphs of different scales, the fusion capacity of the features of different scales is enhanced, finally, an attention mechanism is introduced to remove redundant features, and the operation efficiency and speed are improved, wherein the principle of establishing the self-attention relationship comprises the following steps of;

6. The improved YOLOv 5-based small target detection method of claim 2, wherein: when the small target detection method is used for identifying the small edible fungi, WIoU is introduced in the step S4 as an evaluation index of the prediction frame.

the calculation formula is as follows:

WloU＝(w ^* loU)/(1-w ^* (1-loU)) formula three;

7. The improved YOLOv 5-based small target detection method of claim 1, wherein: in the step 5, by newly adding a target detection head with a very small size, the sensitivity is enlarged, and simultaneously, a self-attention mechanism is combined with a YOLO detection head to extract more abundant characteristic representations, so that the sensitivity of a network model to small and medium-sized edible fungi in an image is improved; flexibly adjusting the bounding box score by Soft-NMS so as to improve the accuracy of target detection, wherein the implementation mode is to attenuate the score of the bounding box instead of directly deleting the bounding box with the maximum confidence; enhancing detection performance by fusing bounding box predictions of multiple models with WBF by fusing the bounding boxes and generating a final integrated bounding box set; finally, detecting by a test set; expressed in pseudo-code as follows:

Soft-NMS algorithm

And (3) outputting: b '-updated bounding box set, S' -corresponding score set

O≡sorting the boundary box sets from high to low according to the score to obtain the rank

keep←[True]×N

Fori←-1

Forj←-1

iou≡calculates boundary box B [ O [ j ] ], B [ O [1] ] cross ratio

S [ O [ j ] ] is ≡S [ O [ j ] ] is an attenuation function (iou, θ)

IfS [ Oi ] is less than threshold value theta

keep[O[i]]←False

B′←[B[O[i]]|keep[O[i]]＝True，i＝1，2，...，N]

S′←[S[O[i]]|keep[θ[i]]＝True，i＝1，2，...，N]

Returning to B ', S'

WBF algorithm

The first two dimensions of K, N≡detections

fused_boxes≡all zero matrix of size (N, 4)

full-zero matrix of size (N, 1)

Fori←1

weighted_sum ≡ weighted sum of coordinates (detections [: i: ], weights)

total_weight≡sum of weights (weights)

And returning to the fused_boxes, and fusing_confidences.