CN114241548A

CN114241548A - Small target detection algorithm based on improved YOLOv5

Info

Publication number: CN114241548A
Application number: CN202111382559.1A
Authority: CN
Inventors: 郭磊; 薛伟; 王邱龙; 马海钰; 肖怒; 马志伟; 郭济; 蒋煜祺
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-11-22
Filing date: 2021-11-22
Publication date: 2022-03-25

Abstract

The invention discloses a small target detection algorithm based on improved YOLOv5, which comprises the following steps: firstly, enhancing the acquired data set of the face mask to be detected by using Mosaic-8 data, namely randomly cutting, randomly arranging and randomly zooming 8 pictures, then combining the pictures into one picture, and simultaneously reasonably introducing random noise. And secondly, adding a new-scale feature extraction layer in the YOLOv5 feature fusion network, and adjusting a target box regression formula of the YOLOv5 network to improve a loss function. And thirdly, sending the enhanced data into a network for iterative training, and adjusting the learning rate by using a cosine annealing algorithm. And fourthly, after the training is finished, sending the picture to be detected into the optimal model obtained after the training, detecting the class and the position of the target, and finally obtaining a recognition result. The improved algorithm is applied to the wearing detection of the protective mask under the situation of dense crowds, and experimental results show that compared with the original YOLOv5 algorithm, the algorithm has stronger feature extraction capability and higher detection precision on the detection of small targets.

Description

Small target detection algorithm based on improved YOLOv5

Technical Field

The invention relates to the technical field of small target detection, in particular to a small target detection algorithm based on improved YOLOv 5.

Background

The target detection is one of the core problems in the computer vision field, and is to utilize technologies such as image processing, deep learning and the like to position an interested object from an image or a video, judge whether an input image contains a target or not through target classification, find out the position of a target object and frame the target by using the target positioning, and the task of the target detection is to lock the target in the image, position the target position and determine the target category, so that the target detection is widely applied to the computer vision field such as face recognition, automatic driving, pedestrian detection, intelligent monitoring and the like. The traditional target detection algorithm consists of 3 parts, namely, region selection, feature extraction and classifier, but the detection effect is not ideal due to the characteristics of poor feature robustness of manual design, no pertinence of region selection strategy and the like.

Early methods of target detection used manual feature extraction and then model construction based on this. The model designed by the method is not only complex in structure, but also difficult to improve the precision. With the development of deep learning, people find that a convolutional neural network has excellent capability of learning features, a feature extraction technology based on the deep convolutional neural network is widely applied to computer vision tasks, target detection completes the transition from a detection method based on traditional manual design features to a deep learning method based on the convolutional neural network, and then a target detection algorithm based on the convolutional neural network rapidly becomes the mainstream of the research in the field of image processing. Compared with the traditional manually designed extraction operator, the convolutional neural network has richer extracted features and stronger generalization capability of the model.

Although these excellent target detection algorithms have achieved very good performance on large and general data sets, small target detection has long been one of the major and difficult points in target detection, and there are two general definitions of small targets in target detection, one is relative size, and the other is small target definition according to specific data sets. Compared with the conventional target, the small target occupies fewer pixels in the image, has low resolution, less information amount and weak feature expression capability. Early methods of target detection used manual feature extraction and then model construction based on this. The model designed by the method is not only complex in structure, but also difficult to improve the precision, and the detection model still has a certain difference in detection speed.

Disclosure of Invention

The invention aims to provide a small target detection algorithm based on improved YOLOv5, which is characterized in that data enhancement is carried out by using a Mosaic-8 method, a shallow feature map is added, the perception capability of a loss function enhancement network on a small target is adjusted, the problems of gradient disappearance and the like in the training process are solved by modifying a target frame regression formula, and the small target detection precision is improved.

In order to achieve the above task and improve the defects existing in the prior art, the present invention provides a small target detection method based on improved YOLOv5, which includes:

firstly, enhancing the acquired data set of the face mask to be detected by using Mosaic-8 data, namely randomly cutting, randomly arranging and randomly zooming 8 pictures, then combining the pictures into one picture, and simultaneously reasonably introducing random noise.

And secondly, adding a new-scale feature extraction layer in the YOLOv5 feature fusion network, and adjusting a target box regression formula of the YOLOv5 network to improve a loss function.

And thirdly, sending the enhanced data into a network for iterative training, and adjusting the learning rate by using a cosine annealing algorithm.

And fourthly, after the training is finished, sending the picture to be detected into the optimal model obtained after the training, detecting the class and the position of the target, and finally obtaining a recognition result.

Further, the building of the small target detection model comprises the following steps:

on the basis of the original YOLOv5, in a Backbone network of a backhaul and a Head network, a feature diagram with the size being one fourth of the size of an input image is newly added, the mining of small target data is promoted, and multi-scale feedback is adopted to introduce global context information.

On the basis of a YOLOv5 backbone network, a 4-time down-sampling process is added to an original input picture, and the original picture is sent to a feature fusion network after the 4-time down-sampling to obtain a feature map with a new size.

And performing information fusion on the low-level feature map and the high-level feature map, combining a feature pyramid Network with a Path Aggregation Network (PAN), wherein the feature pyramid Network transmits deep semantic features from top to bottom, and the Path Aggregation Network transmits position information of the target from bottom to top.

And training the small target detection network by using the preprocessed data set, and obtaining an optimal detection model after model iteration is completed, thereby establishing the small target detection network.

Further, before inputting the data set to the algorithm model, the method further comprises:

labeling a data set by using labeling software LabelImg in a YOLO format, wherein picture labels in the data set are divided into two types, namely bad (not wearing a mask) and good (wearing a mask).

After the labeling is finished, each picture corresponds to a txt file with the same name as the picture, each line in the txt file represents a label example, 5 columns in total are respectively represented from left to right, and the label type, the ratio of the horizontal coordinate of the center of the label frame to the picture width, the ratio of the vertical coordinate of the center of the label frame to the picture height, the ratio of the width of the label frame to the picture width and the ratio of the height of the label frame to the picture height are represented respectively.

Face pictures of a wearing mask and a non-wearing mask under a multi-person scene are manually screened out from WIDER FACE and MAPA (masked faces) public data sets and a network, and finally 4000 training sets and 1320 testing sets, which are 5320 in total, are obtained.

Further, CIoU was selected instead of GIoU as a loss function of the target box regression.

Further, the total iteration number of the experiment is set to be 140, the iteration batch size is set to be 32, and an SGD optimizer is selected.

Furthermore, the learning rate during model training is preheated by using Warmup training, in the Warmup stage, the learning rate of the bias layer is reduced to 0.01 from 0.1, the learning rate of other parameters is increased to 0.01 from 0, and after Warmup is finished, the learning rate is updated by adopting a cosine annealing learning algorithm.

Further, the target box formula is modified. And predicting the relative coordinates of the target frame relative to the upper left corner by a relative position prediction method for predicting the real target frame. Finally obtaining the central coordinate b of the predicted target frame_x、b_yAnd width and height b_w、 b_h。

The target box formula is as follows:

b_x＝2σ(t_x)-0.5+c_x

b_y＝2σ(t_y)-0.5+c_y

b_w＝p_w(2σ(t_w))²

b_h＝p_h(2σ(t_h))²

P_r(object)*IOU(b,object)＝σ(t_o)

wherein, σ (t)_o) Is the confidence of the prediction box, which is obtained by multiplying the probability of the prediction box by the IoU value of the prediction box and the real box. For sigma (t)_o) And setting a threshold, filtering the prediction frames with lower confidence, and then obtaining the final prediction frame for the rest prediction frames by using a Non-Maximum Suppression algorithm (NMS).

Compared with the prior art, the invention has the following beneficial effects:

1. the method has the advantages that the data set is enriched by adopting the Mosaic-8 data enhancement, the small sample target is added, the network training speed can be increased, 8 pictures can be calculated at one time during normalization operation, the requirement of the model on the memory can be effectively reduced, random noises are reasonably introduced, the discrimination of the network model on the small target sample in the image is enhanced, and the generalization of the model is improved.

2. The method comprises the steps of improving a feature extraction model, adding a 4-time down-sampling process to an original input picture on the basis of a YOLOv5 backbone network, sending the original picture to a feature fusion network after the 4-time down-sampling process to obtain a feature map with a new size, wherein the feature map is small in receptive field and relatively rich in position information, and can improve the detection effect of a small-size mask wearing target; the feature fusion network is improved, the feature information fusion from top to bottom and from bottom to top is favorable for the model to better learn the features, and the sensitivity of the model to small targets and shielded targets is enhanced.

3. The target frame formula is improved, the overlapping rate, the central point distance and the length-width ratio between the real frame and the prediction frame are comprehensively considered, the regression of the target frame can be more stable, and the convergence precision is higher. The algorithm is remarkable in wearing detection effect of the intensive population mask, the detection precision is obviously improved, the conditions of misjudgment and missing detection under the condition of intensive population are obviously reduced, and the robustness of the algorithm is obviously improved under the conditions that the mask target is at an abnormal angle and a human face area is shielded.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:

FIG. 1 is a flow diagram of small target detection;

FIG. 2 is an exemplary diagram of a conventional target and a small target;

FIG. 3 is a diagram of the overall network architecture;

FIG. 4 is a flow chart of the Mosaic data enhancement;

FIG. 5 is a detail view of the Mosaic-8 data enhancement;

FIG. 6 is a diagram of an improved feature extraction model;

FIG. 7 is a diagram of an improved feature fusion network;

FIG. 8 is a diagram of a target box regression diagram;

fig. 9 is a partial picture of a data set.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it should be understood by those skilled in the art that the described embodiments of the present invention are some, but not all embodiments of the present invention. Therefore, all other embodiments obtained by the embodiments of the present invention will fall within the protection scope of the present invention by those skilled in the art without any creative work.

It should be understood that the step numbers used herein are for convenience of description only and are not intended as limitations on the order in which the steps are performed.

Referring to fig. 1, the invention discloses a small target detection algorithm based on improved YOLOv5, comprising:

Specifically, the experiment adopts Mosaic-8 data enhancement, a batch of data sets are taken out from a face mask data set to be detected, 8 pictures are randomly extracted from the extracted batch of data sets, the 8 pictures are randomly cut, randomly arranged and randomly scaled, then combined into one picture, the step is repeated, random noise is reasonably increased in the combined picture, the discrimination of a network model on small target samples in the image is enhanced, and the generalization capability of the model is improved.

Specifically, in the feature extraction process, the feature extraction model is improved. In the process of improving the feature extraction model, a 4-time down-sampling process is added to an original input picture on the basis of a YOLOv5 backbone network, the original picture is sent to a feature fusion network after the 4-time down-sampling process to obtain a feature map with a new size, the feature map is small in receptive field and relatively rich in position information, and the detection effect of detecting the wearing target of the small-size mask can be improved.

Specifically, in the improved feature extraction network, a feature pyramid network is combined with a path aggregation network, the feature pyramid network transmits deep semantic features from top to bottom, the path aggregation network transmits position information of a target from bottom to top, the feature pyramid network is favorable for a model to learn features better through feature information fusion from top to bottom and from bottom to top, and the sensitivity of the model to small targets and shielding targets is enhanced.

In the network model training process in a specific implementation case, the loss function of the invention is composed of three parts, namely positioning loss, confidence coefficient loss and category loss, wherein CIoU is used for replacing GIoU to be used as a loss function of target frame regression to calculate the positioning loss, and the confidence coefficient loss and the category loss are calculated by adopting a binary cross entropy loss function.

The calculation formula of CIoU and the definition formulas of α and v are shown below.

It should be noted that α is a balance parameter, which does not participate in the gradient calculation, and v is a parameter for measuring the uniformity of the aspect ratio.

In a specific implementation case, a regression formula of the target frame is improved, and the relative coordinates of the target frame relative to the upper left corner are predicted by a method for predicting the relative position. The prediction frame is obtained by translation and scaling of a prior frame, the original picture is divided into S multiplied by S grid units according to the size of the characteristic graph, each grid unit can predict 3 prediction frames, and each prediction frame comprises 4 coordinate information and 1 confidence coefficient information. When the center coordinate of a certain target in the real frame is in a certain grid, the target is predicted by the grid.

The coordinate prediction calculation formula of the target frame is as follows:

b_x＝2σ(t_x)-0.5+c_x

b_y＝2σ(t_y)-0.5+c_y

b_w＝p_w(2σ(t_w))²

b_h＝p_h(2σ(t_h))²

P_r(object)*IOU(b,object)＝σ(t_o)

note that t is_x、t_y、t_w、t_hObtaining 4 offsets for network model prediction, wherein sigma represents Sigmoid activation function and is used for predicting network prediction value t_x、t_y、t_w、t_hMapping to [0,1]C is_x、c_yIs the offset, p, in the cell grid relative to the upper left corner of the picture_w、p_hIs the a priori frame width height. Finally obtaining the central coordinate b of the predicted target frame through the formula_x、b_yAnd width and height b_w、b_h。σ(t_o) Is the confidence of the prediction box, which is obtained by multiplying the probability of the prediction box by the IoU value of the prediction box and the real box. For sigma (t)_o) And setting a threshold, filtering the prediction frame with lower confidence coefficient, and then obtaining the final prediction frame by using a non-maximum suppression algorithm for the rest prediction frames.

In a specific embodiment, it should be noted that the test platform and the experimental environment of the present invention are: the method comprises the steps of using an Ubuntu20.04 operating system, using a GeForce GTX 1080Ti video card for operation, wherein the size of a video memory is 11GB, configuring a CPU into Intel (R) Xeon (R) CPU E5-2620 v3@2.40GHz, the version of CUDA is 11.4.0, the version of Pythroch is 1.9.0, and the environment of Python language is 3.7.4.

In one embodiment, it is noted that the data to be tested needs to be preprocessed before being input into the network of the improved YOLOv 5-based small target detection algorithm.

Specifically, labeling of a data set in a YOLO format is performed by using labeling software LabelImg, and picture labels in the data set are divided into two types, namely bad (not wearing a mask) and good (wearing a mask). After the labeling is finished, each picture corresponds to a txt file with the same name as the picture, each line in the txt file represents a label example, 5 columns in total are respectively represented from left to right, and the label type, the ratio of the horizontal coordinate of the center of the label frame to the picture width, the ratio of the vertical coordinate of the center of the label frame to the picture height, the ratio of the width of the label frame to the picture width and the ratio of the height of the label frame to the picture height are represented respectively.

In a specific embodiment, the data set is derived from WIDER FACE, mapa (masked faces) two public data sets and a network, face pictures of a person wearing a mask and a person not wearing the mask in a multi-person scene are manually screened out from the public data sets, and finally 4000 training sets and 1320 testing sets, which are 5320 in total, are obtained.

Specifically, the total number of iterations in the experiment is 140, the iteration batch size is set to 32, and the optimizer selects the SGD. The learning rate is preheated by using Warmup training during model training, so that the overfitting phenomenon of the model to small-batch data in the initial stage is slowed down, and model oscillation is avoided so as to ensure the deep stability of the model. In the Warmup stage, the learning rate of the bias layer is reduced from 0.1 to 0.01, the learning rates of other parameters are increased from 0 to 0.01, and after the Warmup is finished, the learning rate is updated by adopting a cosine annealing learning algorithm.

It should be noted that the evaluation index uses three evaluation indexes, namely Average Precision (AP), mean Precision Average (mapp), and Frame Per Second (FPS) of detected pictures, which are more common in the target detection algorithm, to evaluate the performance of the target detection algorithm. The average Precision is related to Precision (Precision) and Recall (Recall), wherein the Precision is the number of positive samples which are predicted correctly in the prediction data set divided by the number of positive samples predicted by the model; the recall ratio is the number of positive samples in the prediction dataset that are predicted correctly divided by the number of actual positive samples.

In one embodiment, to verify the effectiveness of the algorithm, the algorithm was tested on the same test set as the aizo method and the original YOLOv5 algorithm, and the results of the performance index comparisons are shown in the table below.

In the table, compared with the aizo method and the original YOLOv5 algorithm, the detection performance of the algorithm on the small target of the mask is better in the dense crowd scene, the mAP value can reach 94.88%, on the basis of the original YOLOv5, the AP values of the bad and good categories are respectively improved by 3.72% and 5.38%, and the mAP value is improved by 4.55%. The detection rate of the algorithm is lower than that of other algorithms, the FPS is 30.3, compared with the original YOLOv5, the FPS is reduced by 11.3, and the time for detecting a single picture is increased by 9 ms.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims

1. A small object detection algorithm based on improved YOLOv5, comprising:

step one, carrying out data annotation on the collected face mask data set, and carrying out data enhancement on the data set by using a Mosaic-8 data enhancement method.

And step two, adding a new-scale feature extraction layer in the YOLOv5 feature fusion network, adjusting a target frame regression formula of the YOLOv5 network, and improving a loss function. And carrying out feature extraction and target positioning classification on the pictures in the obtained data set.

And step three, sending the enhanced data into a network for iterative training, and adjusting the learning rate by using a cosine annealing algorithm.

And step four, after the training is finished, sending the picture to be detected into the optimal model obtained after the training, detecting the class and the position of the target, and finally obtaining a recognition result.

2. The improved YOLOv 5-based small target detection algorithm as claimed in claim 1, wherein the invention is improved from four aspects of Mosaic-8 data enhancement, feature extractor, loss function and target frame regression respectively on the basis of the original YOLOv5 algorithm, effectively enhances the detection accuracy of the YOLOv5 network model on small target objects, has a reduced detection rate compared with the original YOLOv5 algorithm, but still can meet the real-time requirement, and can be directly applied to actual life scenes such as automatic driving, medical images, remote sensing image analysis and small target detection in infrared images. The mask wearing detection method is remarkable in mask wearing detection effect of dense crowds, detection precision is obviously improved, the conditions of misjudgment and missing detection under the condition of dense crowds are obviously reduced, and robustness of shielding small target abnormal angles and human face areas is obviously improved.

3. The improved YOLOv 5-based small target detection algorithm according to claim 1, wherein the text data set is derived from WIDER FACE, MAPA (masked faces) and two public data sets and a network, face pictures with masks worn and without masks in a multi-person scene are manually screened out from the text data set, and data labeling is performed on the collected face and mask data sets, so that 4000 training sets and 1320 testing sets are finally obtained, and 5320 training sets are obtained in total.

4. The data annotation method for the collected facial mask data set according to claim 3, wherein labeling software LabelImg is used for labeling the data set in a YOLO format, and has two labeling categories, namely bad (not wearing the mask) and good (wearing the mask).

5. The method of claim 4, wherein each tagged picture corresponds to a txt file with the same name as the picture, each row in the txt file represents a tag instance, and the tag instance has 5 columns, which respectively represent, from left to right, the tag type, the ratio of the horizontal coordinate of the center of the tag box to the picture width, the ratio of the vertical coordinate of the center of the tag box to the picture height, the ratio of the width of the tag box to the picture width, and the ratio of the height of the tag box to the picture height.

6. The improved YOLOv 5-based small target detection algorithm according to claim 1, wherein a Mosaic-8 data enhancement method is used to perform data enhancement on the data set, that is, 8 pictures are randomly cropped, randomly arranged and randomly scaled, and then combined into one picture, so as to increase the data amount of the sample, and simultaneously, some random noise is reasonably introduced, so that the discrimination of the network model on the small target samples in the image is enhanced, and the generalization of the model is improved.

7. The improved YOLOv 5-based small target detection algorithm as claimed in claim 1, wherein the data set is data-enhanced by using a Mosaic-8 data enhancement method, so that the small sample target is increased while the data set is enriched, and the training speed of the network is increased. While the normalization operation is performed, eight pictures are calculated at a time, so that the memory requirement of the model is reduced.

8. The improved YOLOv 5-based small target detection algorithm as claimed in claim 1, wherein the invention is improved based on an original YOLOv5 feature extraction model, that is, a 4-fold down-sampling process is added to an original input picture based on a YOLOv5 backbone network, the original picture is sent to a feature fusion network after being subjected to 4-fold down-sampling to obtain a feature map of a new size, the feature map has a small receptive field and relatively rich position information, and therefore, the detection effect of detecting a small-size mask wearing target can be improved.

9. The improved YOLOv 5-based small target detection algorithm according to claim 1, wherein the improved feature fusion Network combines a feature pyramid Network with a Path Aggregation Network (PAN), the feature pyramid Network delivers deep semantic features from top to bottom, the Path Aggregation Network delivers the position information of the target from bottom to top, and the feature information from top to bottom and bottom to top are fused, so that the model can learn features better, and the sensitivity of the model to small targets and occluded targets is enhanced.

10. The improved YOLOv 5-based small-target detection algorithm according to claim 1, wherein in the process of training the network model, the loss function of the invention is composed of three parts, namely, a localization loss, a confidence loss and a category loss, wherein the localization loss is calculated by using CIoU instead of GIoU as the loss function of the target-box regression, and the confidence loss and the category loss are calculated by using a binary cross-entropy loss function.

11. The improved YOLOv 5-based small target detection algorithm according to claim 1, wherein CIoU is selected as a loss function of target frame regression, and the CIoU comprehensively considers the overlapping rate, the center point distance and the aspect ratio between the real frame and the predicted frame, so that the target frame regression is more stable and the convergence accuracy is higher.

12. The improved YOLOv 5-based small object detection algorithm according to claim 1, wherein the mapping target frame of the candidate target frame (region pro-potential) is infinitely close to the real target frame (group-route) by modifying the target frame formula when performing feature extraction and target location classification on the pictures in the obtained data set. The relative coordinates of the target frame relative to the upper left corner are predicted by a method for predicting the relative position, a 4-scale detection structure is adopted, and the formula of the target frame is as follows:

b_x＝2σ(t_x)-0.5+c_x

b_y＝2σ(t_y)-0.5+c_y

b_w＝p_w(2σ(t_w))²

b_h＝p_h(2σ(t_h))²

P_r(object)*IOU(b,object)＝σ(t_o)

13. modification of the target frame formula according to claim 12, wherein the center coordinate b of the predicted target frame is finally obtained by the above formula_x、b_yAnd width and height b_w、b_h。σ(t_o) Is the confidence of the prediction box, which is obtained by multiplying the probability of the prediction box by the IoU value of the prediction box and the real box. For sigma (t)_o) And setting a threshold, filtering the prediction frames with lower confidence, and then obtaining the final prediction frame for the rest prediction frames by using a Non-Maximum Suppression algorithm (NMS).

14. The improved YOLOv 5-based small-object detection algorithm according to claim 1, wherein in the process of training the optimal weight model, the training process of the network needs to be repeated, and the parameters of the network for wearing the mask detection are continuously corrected until the mask detection network learns the face position in the image and can correctly determine whether the detected face wears the mask. And (4) storing the parameters obtained by training, namely storing the optimal weight model and testing on the test set after the training is finished.

15. The improved YOLOv 5-based small-object detection algorithm according to claim 1, wherein the improved algorithm is applied in the mask wearing scene of dense people and compared with aizo algorithm and original YOLOv5 algorithm. The evaluation indexes of the invention adopt three common evaluation indexes of Average Precision (AP, Average Precision), Average Precision mean (mAP, mean Average Precision) and frame number of detection pictures Per Second (FPS, Frames Per Second) in a target detection algorithm to evaluate the performance of the algorithm, and the calculation formulas of the Average Precision, the Average Precision mean, the Precision (Precision) and the Recall (Recall) are as follows:

16. a comparative experiment of the improved algorithm with aizo algorithm and original YOLOv5 algorithm as claimed in claim 15, wherein compared with aizo method and original YOLOv5 algorithm, the algorithm of the present invention performed better detection of the small target of the mask in the dense crowd scenario, with the mapp value reaching 94.88%, and based on the original YOLOv5, the AP values of bad and good categories were increased by 3.72% and 5.38%, respectively, and the mapp value was increased by 4.55%.

17. The experiment for comparing the improved algorithm with the AIZOO algorithm and the original YOLOv5 algorithm is characterized in that the algorithm is remarkable in wearing detection effect of the mask of dense people, the detection precision is obviously improved, the situations of misjudgment and missing detection under the condition of dense people are obviously reduced, the robustness of the algorithm is obviously improved under the conditions that the mask target is at an abnormal angle and the human face area is shielded, and the detection effect of the small target under the scene of dense people is remarkable compared with that of other algorithms, and the advantage is remarkable.

18. The improved YOLOv 5-based small target detection algorithm as claimed in claim 1, wherein in the training process, the number of iteration cycles is 140, and in several iteration cycles after the Warmup stage is finished, the model gradually reaches the convergence state along with the adjustment of the cosine annealing algorithm on the learning rate.

19. The improved YOLOv 5-based small target detection algorithm according to claim 1, wherein the experimental environment uses ubuntu20.04 operating system, uses GeForce GTX 1080Ti graphics card for operation, the graphics memory size is 11GB, the CPU is configured as intel (r) xeon (r) CPU E5-2620 v3@2.40GHz, CUDA version is 11.4.0, pitorch version is 1.9.0, and Python language environment is 3.7.4.