CN116453186A

CN116453186A - Improved mask wearing detection method based on YOLOv5

Info

Publication number: CN116453186A
Application number: CN202310400564.3A
Authority: CN
Inventors: 王媛媛; 陈秀川; 张兴潮; 沈俞; 王超; 江飞龙; 张海艳; 任珂; 严少峰
Original assignee: Huaiyin Institute of Technology
Current assignee: Huaiyin Institute of Technology
Priority date: 2023-04-14
Filing date: 2023-04-14
Publication date: 2023-07-18

Abstract

The invention relates to a mask wearing detection method based on YOLOv5 improvement, and belongs to the field of target detection. The method comprises the following steps: excavating target image sample data, and constructing a sample data set for mask wearing detection; analyzing the problems existing in wearing detection of the mask, and improving a target detection model YOLOv5 according to the problems. Adopting a K-means++ algorithm to replace the K-means algorithm to acquire anchor parameters and optimize a target anchor frame; adding an attention module ACmix, replacing an original YOLOv5 network neck structure with a weighted bidirectional feature pyramid network (BiFPN), optimizing the detection effect on a small target, and improving the target detection accuracy; a traditional convolution module is replaced by a ghost shuffle convolution (GSConv) module which combines standard convolution and depth separable convolution and performs channel cleaning, so that the network speed is improved; SIoU_Loss is introduced as Bounding Box Regression Loss of the Loss function based on the Yolov5 improved algorithm, so that the Loss function is stably converged, the prediction error is reduced, and the regression accuracy is improved. The improved YOLOv5 detection algorithm adopted by the method not only improves the mask wearing detection accuracy, but also improves the small target detection effect of the mask worn by the remote and dense people, and reduces the false detection and omission situation.

Description

Improved mask wearing detection method based on YOLOv5

Technical Field

The invention belongs to the field of target detection, and particularly relates to a mask wearing detection method based on an improved YOLOv 5.

Background

In specific working areas such as coal mines, mines and the like, standard and standard wearing of dustproof masks is a requirement for safe work. In airtight coal mine and mine environments, if the dustproof mask is not reasonably worn, serious potential safety hazards are caused to the health of workers, and the problems of missed detection and false detection are easy to occur for small target masks and detection at crowded places. Therefore, in order to ensure the safety and the order work of workers, the problem of wearing the mask is most important.

The current target detection technology based on deep learning is rapidly developed, and is mainly applied to the fields of industrial detection, traffic safety and the like. Target detection based on deep learning is divided into two categories: two-stage and single-stage target detection. The two-stage target detection is mainly represented by MASK_RCNN, and the method has high detection accuracy but poor real-time performance. The single-stage target detection is mainly represented by YOLO and SSD, and the detection precision of the method is slightly poorer than that of the two stages, but the real-time detection problem can be guaranteed. Therefore, the target detection technology based on deep learning provides a brand new solution to the mask wearing problem.

Disclosure of Invention

The invention aims to solve the problem that the current YOLOv5 target detection model is insufficient in wearing detection effect on a small target mask, and provides a mask wearing detection method based on YOLOv5 improvement, which not only improves the mask wearing detection accuracy, but also improves the small target detection effect on the mask worn by remote and dense people and reduces the false detection omission.

The technical proposal adopted by the invention is that,

a mask wearing detection method based on Yolov5 improvement is specifically carried out according to the following steps:

step 1: collecting image data of worn and unworn masks, marking by using a LabelImg tool, and manufacturing a data set;

step 2: building an improved YOLOv5 network model;

step 3: inputting training set image data in the mask data set into the improved YOLOv5 network model in the step 2 for training, storing the model parameter with highest accuracy of the improved YOLOv5 model on the verification set in the training process, and naming the file as best.

Step 4: inputting an image to be detected containing a target of a mask to be detected and a target of the mask not to be worn into an improved YOLOv5 model, loading an optimal weight file best.

Preferably, step 1 specifically includes:

step 1.1: the image data of the worn mask and the unworn mask are 7952 pieces through the mode of collecting hundred-degree images and flying oar opening source data on line, and the image data mainly comprises the image data of the correctly worn mask and the image data of the unworn mask;

step 1.2: performing picture marking by using LabelImg to generate a corresponding xml marking file;

step 1.3: the generated xml file is converted into a txt file which can be used for training the YOLOv5 model, and the txt file is divided into a training set, namely a verification set according to the ratio of 8:2, and can be used for model training.

Preferably, step 2 specifically includes:

step 2.1: in a backbone network, an ACmix attention module is added to a 9 th layer of the backbone network, an ACmix attention module integrating self-attention and convolution attention is added, global features and local features are considered, a larger receptive field is obtained, more feature information of a small target object is captured, and the detection effect of the improved network on the small target object is enhanced. Meanwhile, compared with pure convolution or self-attention calculation, the ACmix module has minimum calculation overhead;

the ACmix module principle flow is as follows: inputting H×W×C features, and convolving the input features with 3 1×1×C features to obtain 3×N features (H×W×C/N) with all sizes, and entering the next stage. The next stage is divided into a convolution branch and a self-attention branch, wherein the convolution branch is a convolution path with the convolution kernel size of k, and the sub-features firstly pass through a full connection layer to generate k ² And (3) performing traditional convolution operations such as shifting, aggregation and the like on the generated features, and collecting local receptive field information to form the H multiplied by W multiplied by C latest features of the branch. For the self-attention branch, the input 3 XN features are respectively divided into 3 groups, each group of N features respectively corresponds to a value, a key and a query, information is collected according to a traditional self-attention mode, and finally the H X W X C features of the stage are obtained. Finally, the outputs obtained by the convolution branch and the self-attention branch are operated in parallel, the intensity in the branch is controlled by alpha scalar and beta scalar respectively, and the formula is as follows:

F _out ＝αF _att +βF _conv (1)

in the above formula (1), F _att 、F _conv F is the self-attention branch output result and the convolution branch output result respectively _out The output result of the final branch is obtained;

step 2.2: in the YOLOv5 neck structure, a weighted bi-directional pyramid network (BiFPN) is used to replace the original feature fusion layer. The original target detection model feature fusion layer adopts a top-down FPN network, is easy to be limited by unidirectional information exchange and has lower precision. The weighted bidirectional pyramid network (BiFPN) utilizes a top-down and bottom-up bidirectional fusion method to perform feature fusion on features of different scales, unifies the scales of different resolutions by adopting an up-sampling and down-sampling method, establishes bidirectional connection on the same scale to realize feature fusion between different scales at a higher level, and realizes low-dimensional and high-dimensional feature fusion through a weighted bidirectional pyramid network (BiFPN) structure, so that the problem of losing the feature information of a small target object is solved to a certain extent. The weighted bidirectional pyramid network (BiFPN) is different from the traditional feature fusion method of simple feature superposition or addition, well balances the feature information of different scales, and distinguishes and fuses different input features according to the importance of learning different input information features, and is a simple and efficient weighted feature fusion mechanism, so that the weighted bidirectional pyramid network (BiFPN) adopts a rapid normalization method (Fast normalized fusion). The method is faster than the Softmax-based fusion method. The fast normalization method is defined as:

in the above formula (2), O represents the final result of the weighted feature fusion, w _i 、w _j Represents a weight and w _i 、w _j Not less than 0, e is a small value to ensure numerical stability, lni represents the input characteristics;

step 2.3: to alleviate certain model complexity and maintain certain accuracy, the improved network model replaces the traditional convolution module with a ghost shuffle convolution (GSConv) module in the neck structure. The ghost shuffle convolution (GSConv) module combines standard convolution with depth separable convolution and channel shuffling operations, maintaining accuracy to some extent and reducing model complexity. Compared with the depth separable convolution, the ghost shuffle convolution (GSConv) module solves the problem of massive loss of channel information. The ghost shuffle convolution (GSConv) module reduces the computational resource occupation problem compared to standard convolution.

Because the input picture data must pass through the conversion process of the backbone network, namely, the characteristic information is transmitted from the space information to the channels, and the space compression and the channel expansion are carried out on the characteristic image each time, although the dense convolution keeps the connection between the channels to the greatest extent, the sparse convolution cuts off the connection, so that the problem of semantic information loss can occur. And a ghost shuffle convolution (GSConv) module is introduced into a YOLOv5 model neck structure, so that the connection between channels is reserved to a certain extent, the complexity of the model is reduced, and the accuracy of target detection is ensured. The improved network model only uses a ghost shuffle convolution (GSConv) module in the Neck structure, and if the model uses the ghost shuffle convolution (GSConv) module in the whole stage, the number of layers of the network model is deepened, so that the model reasoning time is increased. When the feature map enters the neck structure, the channel dimension is the largest, the height and width dimension is the smallest, and the model accuracy is optimal and the model reasoning speed is improved by using a ghost shuffle convolution (GSConv) module.

The principle of the ghost shuffle convolution (GSConv) module is as follows: the input channel is set as C1, and the output channel is set as C2. The input feature map is subjected to standard convolution to obtain a feature tensor with an output channel of C2/2, the feature tensor is subjected to depth separable convolution to obtain another feature tensor, and the two feature tensors after processing are subjected to splicing operation. After the channel is disturbed, the generated information is permeated into each part of the characteristic tensor generated by the depth separable convolution;

step 2.4: and (3) performing re-clustering analysis on the width and height of the target frame in the mask data set by using an optimal anchor parameter aiming at the improved YOLOv5 network model and applying a K-means++ algorithm to acquire the prior frame size matched with the mask, and optimizing the false detection rate of the model on the target. Firstly, the clustering method randomly selects one sample in the mask data set as an initial clustering center. Next, a distance D (X) between the selected sample and each sample is calculated. Finally, the probability P that the sample point is selected as the next cluster center is calculated, and the formula P is as follows (3):

repeating the steps until K clustering centers are selected, so as to determine an optimal value of an anchor parameter of the improved model;

step 2.5: an improved network Loss function is provided, and SIoU_Loss is set to replace a Loss function of a Bounding Box in a Yolov5 network. SIoU_Loss introduces vector angle between the real frame and the predicted frame, redefines the Loss function, and improves training speed and reasoning capacity of the model. The siou_loss mainly contains three parts of angle Loss, distance Loss and shape Loss.

The angle loss formula: wherein c _h For the difference in height between the predicted frame and the center point of the real frame, σ is the distance between the predicted frame and the center point of the real frame, α is the angle,is a center coordinate point of a real frame, < >>For predicting the center coordinate point of the frame, if the angle alpha is +.>Or 0, the angle loss is 0 if the angle alpha is less than +.>The minimization uses alpha, whereas the minimization of beta,

distance loss formula: wherein (c) _w ,c _h ) For predicting the width and height of the minimum circumscribed rectangle of the frame and the real frame, the contribution cost of distance loss is reduced as the angle gradually approaches 0, otherwise, the angle gradually approachesThe greater the cost of contribution to the distance loss, and therefore, the greater the gamma following angle, the time-first distance value is given,

γ＝2-Λ (12)

shape loss formula: wherein θ controls the degree of interest in shape loss, and θ is defined as between 2 and 6, (w, h) and (w ^gt ,h ^gt ) The width and height of the predicted and real frames respectively,

the loss function of the SIoU is defined as follows, the angle loss is increased, the expression of the loss function is more sufficient, the model training is more stable and converged, the regression accuracy of the training is improved, the training speed is improved, the model prediction error is reduced,

preferably, step 3 specifically includes:

step 3.1: inputting training set picture data in a mask data set into an improved network model, training by using a GPU, setting parameters of the improved network model, wherein the training number is 150, the momentum is 0.937, the initial learning rate is 0.01, the minimum learning rate is 0.0001, the weight attenuation coefficient is 0.00005, and the batch-size is 16 by adopting an SGD optimizer. In the model training process, model parameters with optimal accuracy are saved and named as best.

Preferably, step 4 specifically includes:

step 4.1: for detecting the improved model performance, precision and average Precision mean (mAP) are adopted as evaluation indexes of a model algorithm. The accuracy of the detection target through the model is the precision rate, the precision rate defines the concept that the correct prediction is the proportion of the total prediction of positive samples,

in the above formula, recall is the Recall, TP represents the number of samples correctly predicted as positive samples in the samples, and FP represents the number of negative samples incorrectly predicted as positive samples in the samples. n represents the number of categories, AP _i Representing class i Average Precision (AP).

Drawings

FIG. 1 is a flow chart of a mask wearing detection method based on the Yolov5 improvement of the invention;

FIG. 2 is a diagram of an overall network model architecture for a mask wear detection method based on the improved YOLOv5 of the present invention;

FIG. 3 is a schematic block diagram of an ACmix attention module in a mask wearing detection method based on YOLOv5 improvement of the invention;

FIG. 4 is a schematic block diagram of a mask wearing detection method based on the YOLOv5 improvement using a weighted bi-directional feature pyramid network (BiFPN);

fig. 5 is a schematic structural diagram of a ghost shuffle convolution (GSConv) used in a mask wearing detection method based on YOLOv5 improvement of the present invention;

FIG. 6 is a graph showing the training process of the improved YOLOv5 network model in the improved mask wearing detection method based on YOLOv5 of the present invention;

fig. 7 is a diagram showing the wearing detection effect of the improved YOLOv5 network model mask.

Fig. 8 is a diagram showing the wearing detection effect of the mask with the original YOLOv5 network model.

Fig. 9 is a diagram showing the wearing detection effect of the improved YOLOv5 network model mask.

Fig. 10 is a diagram showing the wearing detection effect of the mask with the original YOLOv5 network model.

Fig. 11 is a diagram showing the wearing detection effect of the improved YOLOv5 network model mask.

Fig. 12 is a diagram showing the wearing detection effect of the mask with the original YOLOv5 network model.

Detailed Description

In order to make the objects, features and advantages of the present invention more clear, a method for detecting mask wear based on YOLOv5 of the present invention will be described in detail with reference to the accompanying drawings and detailed description. Furthermore, the specific embodiments of the present invention are described for the purpose of illustration only and not for the purpose of limiting the invention.

The invention aims to overcome the defects of the prior art, and provides a mask wearing detection method based on an improvement of YOLOv5, which solves the problems of missed detection, false detection and the like caused by small target objects of mask wearing of workers in coal mines and mine occasions and dense mask wearing. As shown in fig. 1 to 5, an improved mask wearing detection method based on YOLOv5 includes the following steps:

the image data of the worn mask and the unworn mask are 7952 pieces through the mode of collecting hundred-degree images and flying oar opening source data on line, and the image data mainly comprises the image data of the correctly worn mask and the image data of the worn mask. And labeling the pictures by using LabelImg labeling software, and generating xml labeling files of the corresponding pictures. The generated xml file is converted into a txt file which can be used for training the YOLOv5 model, and the txt file is divided into a training set, namely a verification set according to the ratio of 8:2, and the txt file can be used for training the model.

Step 2: building an improved YOLOv5 network model;

YOLOv5 is a single-stage object detection algorithm with faster reasoning speed and smaller network structure. As can be seen from fig. 2, the YOLOv5 network model structure mainly comprises four parts, namely an Input terminal, a Backbone network, a neg Neck structure and a Head output layer. The improved target detection model is obtained by improving the original YOLOv5 network model.

Step 2.1: adding an attention module ACmix into the YOLOv5 backbone network;

in a backbone network, an ACmix module is added at a 9 th layer of the backbone network, an ACmix attention module integrating self-attention and convolution attention is added, global characteristics and local characteristics are considered, a larger receptive field is obtained, more characteristic information of a small target object is captured, and the detection effect of the improved network on the small target object is enhanced. While ACmix modules have minimal computational overhead compared to pure convolution or self-care computation.

Step 2.2: adopting a weighted bidirectional pyramid network (BiFPN) to replace an original feature fusion layer as a feature fusion network for improving YOLOv 5;

the weighted bidirectional pyramid network (BiFPN) utilizes a top-down and bottom-up bidirectional fusion method to perform feature fusion on features of different scales, unifies the scales of different resolutions by adopting an up-sampling and down-sampling method, establishes bidirectional connection on the same scale to realize feature fusion between different scales at a higher level, and realizes low-dimensional and high-dimensional feature fusion through a weighted bidirectional pyramid network (BiFPN) structure, so that the problem of losing the feature information of a small target object is solved to a certain extent.

Step 2.3: in the neck structure of the improved network model, the traditional convolution module is replaced by a ghost shuffle convolution (GSConv) module, so that certain model complexity is reduced and certain precision is maintained;

the ghost shuffle convolution (GSConv) structure combines standard convolution with depth separable convolution and channel shuffling operations, maintaining accuracy to some extent and reducing model complexity. Compared with the depth separable convolution, the ghost shuffle convolution (GSConv) module solves the problem of massive loss of channel information. The ghost shuffle convolution (GSConv) module reduces the computational resource occupation problem compared to standard convolution. And a ghost shuffle convolution (GSConv) module is introduced into a YOLOv5 model neck structure, so that the connection between channels is reserved to a certain extent, the complexity of the model is reduced, and the accuracy of target detection is ensured. The improved network model only uses a ghost shuffle convolution (GSConv) module in the neck structure, and if the model uses the ghost shuffle convolution (GSConv) module in the whole stage, the number of layers of the network model is increased, and the model reasoning time is increased. When the feature map enters the neck structure, the channel dimension is the largest, the height and width dimension is the smallest, and the model accuracy is optimal and the model reasoning speed is improved by using a ghost shuffle convolution (GSConv) module.

Step 2.4: adopting a K-means++ algorithm to replace the K-means algorithm to acquire anchor parameters and optimize a target anchor frame;

in the YOLOv5 algorithm, the initial anchor frame length and width are set for different data set problems. In the training process of the network model, the network outputs a prediction frame based on an initial anchor frame, and the prediction frame is compared with a real frame to calculate the difference between the two frames. Thus, setting the parameter values of the initial candidate box is critical to improving the training of the target detection model. And (3) performing re-clustering analysis on the width and height of the target frame in the mask data set by using an optimal anchor parameter aiming at the improved YOLOv5 network model and applying a K-means++ algorithm to acquire the prior frame size matched with the mask, and optimizing the false detection rate of the model on the target.

Step 2.5: an improved network Loss function is provided, and SIoU_Loss is set to replace a Loss function of a Bounding Box in a Yolov5 network.

The original YOLOv5 adopts giou_loss as a rounding Box Loss function, but the convergence speed is slower, and the GIoU-Loss is a constant value when the real frame and the predicted frame are mutually covered. However, the vector angle between the real frame and the predicted frame is introduced by the SIoU_Loss, and the Loss function is redefined, so that the training speed and the reasoning capacity of the model are improved, and the SIoU_Loss mainly comprises three parts of angle Loss, distance Loss and shape Loss.

The angle loss formula is as follows: wherein c _h For the difference in height between the predicted frame and the center point of the real frame, σ is the distance between the predicted frame and the center point of the real frame, α is the angle,is a center coordinate point of a real frame, < >>For predicting the center coordinate point of the frame, if the angle alpha is +.>Or 0, if the angle alpha is less than +.>The minimization uses alpha, whereas the minimization of beta,

γ＝2-Λ (8)

shape loss formula: wherein θ controls the degree of interest in shape loss, and θ is defined as between 2 and 6, (w, h) and (w ^gt ，h ^gt ) The width and height of the predicted and real frames respectively,

According to the mask wearing detection method based on the YOLOv5 improvement, the experimental environment is configured to be Ubuntu20.04 operating system, NVIDIA GeForce RTX 3090 display cards are used, and the deep learning framework is PyTorch. The specific configuration is shown in table 1 below:

table 1 experimental environment configuration table

Inputting training set picture data in a mask data set into an improved network model, training by using a GPU, setting parameters of the improved network model, wherein the training number is 150, the momentum is 0.937, the initial learning rate is 0.01, the minimum learning rate is 0.0001, the weight attenuation coefficient is 0.00005, and the batch-size is 16 by adopting an SGD optimizer. In the model training process, model parameters with optimal accuracy are saved and named as best.

For detecting the improved model performance, precision and average Precision mean (mAP) are adopted as evaluation indexes of a model algorithm. The accuracy of the detection target through the model is the precision rate, the precision rate defines the concept that the correct prediction is the proportion of the total prediction of positive samples,

In fig. 6, the variation curves of various indexes in the model training process of 150 rounds are shown, and after model training is completed, the accuracy of the verification set reaches 96.5%, and the mAP reaches 94.1%. Compared with original YOLOv5, the algorithm herein improved 2.1% on Precision and 0.3% on mAP. The accuracy of the model is represented by Precision, and the larger the index is, the better the model identification effect is; mAP represents the quality of the model in all categories, and the larger the index is, the better the network performance of the model is. The algorithm index comparison table is shown in table 2, which demonstrates the feasibility of the algorithm of the invention.

Table 2 algorithm performance comparison table

The improved algorithm is applied to the occasion of coal mine and mine environment, the on-site image of coal mine workers is acquired, the on-site image is input into an improved YOLOv5 model, and an optimal weight file best. Pt is loaded into a detection model for reasoning and detection, so that whether a target in the image to be detected wears a mask or not is acquired.

The algorithm provided by the method aims at the mask shielding problem of small target mask objects and crowded places in coal mines and mines, reduces the false detection and missing detection conditions of the small target mask objects, and improves the detection accuracy of the small target mask objects. As shown in fig. 7, fig. 9, fig. 8 and fig. 10, the former is a detection effect diagram obtained by the algorithm provided herein, and the latter is a detection effect diagram of the original YOLOv5, so that the improved algorithm provided herein can be seen to solve the mask shielding problem of small target mask objects and crowded places compared with the original YOLOv5, and the detection accuracy is high and the detection accuracy is correctly recognized. Fig. 12 is a diagram of the detection effect of the original YOLOv5, and the detection effect has the problems of missed detection and false detection, and fig. 11 is a diagram of the detection effect of the algorithm herein, and solves the problems of missed detection and false detection in the original YOLOv5 network.

According to the improved mask wearing detection method based on the YOLOv5, not only is the mask wearing detection accuracy improved, but also the mask shielding detection problem of small target mask objects and crowded places is improved, and the false detection omission detection condition is reduced.

The above-described embodiment is one embodiment of the present invention, but the embodiment of the present invention is not limited thereto, and any modifications, substitutions, improvements made by those skilled in the art without departing from the principle and spirit of the present invention are included in the scope of the present invention.

Claims

1. A mask wearing detection method based on YOLOv5 is characterized by comprising the following steps:

step 2: building an improved YOLOv5 network model;

2. The method for detecting the wearing of the mask based on YOLOv5 according to claim 1, wherein image data of the worn mask and the unworn mask are collected in the step 1, marked by a LabelImg tool, and a data set is produced, and the method specifically comprises the following steps:

step 1.1: collecting 7952 pieces of image data of the worn mask and the unworn mask in an online collecting mode of hundred-degree images and flying oar opening source data, wherein the image data comprises image data of the correctly worn mask and image data of the unworn mask;

3. The mask wearing detection method based on YOLOv5 of claim 1, wherein an improved YOLOv5 network frame is built in step 2, and specifically comprises the following steps:

step 2.1: adding an ACmix attention module integrating self-attention and convolution attention to the layer 9 of the YOLOv5 backbone network, wherein the ACmix attention module carries out parallel operation on outputs respectively obtained by convolution branches and self-attention branches, and the intensities in the branches are respectively controlled by alpha scalar and beta scalar, and the formula is as follows:

F _out ＝αF _att +βF _conv (1)

step 2.2: in the YOLOv5 neck structure, a weighted bi-directional pyramid network (BiFPN) is used to replace the original feature fusion layer. A weighted bi-directional pyramid network (bippn) employs a fast normalization method (Fast normalized fusion). The method is faster than the Softmax-based fusion method. The fast normalization method is defined as:

in the above formula (2), O represents the final result of the weighted feature fusion, w _i 、w _j Represents a weight and w _i 、w _j More than or equal to 0, E is a very smallThe value of (2) is used for guaranteeing the numerical stability, and ln i represents the input characteristic;

step 2.3: a traditional convolution module that improves YOLOv5 neck structure, replacing the traditional convolution module with a ghost shuffle convolution (GSConv) module that combines standard convolution with depth separable convolution and channel shuffling operations;

step 2.4: the method comprises the steps of using an optimal anchor parameter for an improved YOLOv5 network model, carrying out re-clustering analysis on the width and height of a target frame in a mask data set by using a K-means++ algorithm, obtaining the prior frame size matched with the mask, optimizing the false detection rate of the target by the model, firstly, randomly selecting one sample in the mask data set as an initial clustering center by using the clustering method, secondly, calculating the distance D (X) between the selected sample and each sample, and finally, calculating the probability P of the sample point being selected as the next clustering center, wherein the formula P is as shown in the following (3):

step 2.5: the improved network Loss function is provided with SIoU_Loss to replace the Loss function of a Bounding Box in the YOLOv5 network, the SIoU_Loss introduces vector angles between a real frame and a predicted frame, and redefines the Loss function, so that the training speed and the reasoning capacity of a model are improved, and the SIoU_Loss mainly comprises three parts of angle Loss, distance Loss and shape Loss.

The angle loss formula: wherein c _h For the difference in height between the predicted frame and the center point of the real frame, σ is the distance between the predicted frame and the center point of the real frame, and α is the angle.Is a center coordinate point of a real frame, < >>For predicting the center coordinate point of the frame, if the angle alpha is +.>Or 0, if the angle alpha is less than +.>The minimization uses alpha, whereas the minimization of beta,

γ＝2-Λ (12)

4. the mask wearing detection method based on YOLOv5 of claim 1, wherein in step 3, training set image data in a mask data set is input into the YOLOv5 network model improved in step 2, model parameters with highest accuracy of the improved YOLOv5 model on a verification set in the training process are saved, and a file is named as best.

5. The mask wearing detection method based on YOLOv5 of claim 1, wherein in step 4, an image to be detected containing a target of a mask to be detected and a target of a mask not to be worn are input into an improved YOLOv5 model, and an optimal weight file best. Pt in step 3 is loaded into the detection model for carrying out reasoning detection, so as to obtain whether the target in the image to be detected wears the mask or not, and specifically comprising the following steps: