CN115761645A

CN115761645A - YOLOv 5-based light-weight safety helmet wearing detection method

Info

Publication number: CN115761645A
Application number: CN202211501358.3A
Authority: CN
Inventors: 吕云凯; 杨小兵; 管爱; 王飞龙; 解明
Original assignee: China Jiliang University
Current assignee: China Jiliang University
Priority date: 2022-11-28
Filing date: 2022-11-28
Publication date: 2023-03-07

Abstract

The invention discloses a YOLOv 5-based light-weight safety helmet wearing detection method, which comprises the following steps: step S1: collecting construction site images as training samples; step S2: carrying out labeling and data enhancement processing on the obtained sample image to construct a safety helmet wearing detection data set; and step S3: improving a YOLOv5 target detection algorithm, and constructing a lightweight helmet wearing detection network; and step S4: training the lightweight helmet wearing detection network obtained in the step S3 by adopting the helmet wearing detection data set in the step S2 and obtaining a weight file; step S5: and detecting the video stream of the camera on the site to be detected by using the trained detection model. Step S6: if the fact that a person does not wear the safety helmet is detected, corresponding sound alarm information is sent out. The invention can effectively improve the wearing detection precision of the safety helmet, can accurately identify the personnel who do not wear the safety helmet on the road or construction site, and sends alarm information, thereby being beneficial to protecting the life safety of the relevant personnel.

Description

YOLOv 5-based light-weight safety helmet wearing detection method

Technical Field

The invention relates to the technical field of building construction safety, in particular to a YOLOv 5-based light-weight helmet wearing detection method.

Background

In the safety protection of job site, the safety helmet is as an indispensable safety tool, can protect constructor's head safety well, can furthest reduce the emergence of the fatal injury condition to reduce the emergence probability of accident effectively.

However, most construction sites still wear the safety helmet through manual mode and detect, and this whole journey monitoring when hardly accomplishing the construction is difficult to be done, and is wasted time and energy. With the development of computer vision technology, target detection technology has achieved certain achievement. Although a traditional safety helmet wearing detection algorithm achieves certain results, the characteristic extraction mode of the traditional safety helmet wearing detection algorithm needs manual design, and the problems of low detection accuracy, low detection speed, poor robustness in a complex scene and the like exist.

With the development of deep learning, the wearing detection method of the safety helmet has made a certain progress. Although the two-stage target detection algorithm such as fast RCNN has high detection accuracy, the amount of parameters is large, and the detection efficiency is very low. A one-stage target detection algorithm such as SSD, YOLO series algorithm performs classification and regression while generating target candidate boxes. Therefore, the detection efficiency of the algorithm is greatly improved compared with that of a two-stage target detection algorithm, the detection speed is high, but the detection precision is usually slightly inferior to that of the two-stage target detection algorithm.

In the actual process of wearing, detecting and monitoring the construction safety helmet, the detection accuracy is ensured, and the size and the deployment problem of the model are considered at the same time. The YOLOv5 algorithm has excellent detection performance and high detection speed, but the model size has further optimization space. In practical applications, the embedded device has strict requirements on the size of the model, and the smaller the model is, the more beneficial the model is to be deployed. Therefore, the one-stage model YOLOv5 is selected and is subjected to lightweight improvement on the basis, the size of the model is reduced as much as possible under the condition that the accuracy of the model is not influenced, and the practical deployment and application of the model are facilitated, so that a lightweight safety helmet wearing detection method based on YOLOv5 is provided.

Disclosure of Invention

The invention solves the problems: the method overcomes the defects of the prior art, provides a YOLOv 5-based lightweight safety helmet wearing detection method, can effectively and automatically identify personnel who do not wear the safety helmet on a construction site, is more favorable for deployment and application of a lightweight model, and has a wide application prospect.

The invention adopts the following technical scheme: a YOLOv 5-based light-weight helmet wearing detection method comprises the following steps:

step S1: collecting construction site images as training samples;

step S2: carrying out labeling and data enhancement processing on the obtained sample image to construct a safety helmet wearing detection data set;

and step S3: improving a YOLOv5 target detection algorithm, and constructing a lightweight helmet wearing detection network;

and step S4: training the lightweight helmet wearing detection network obtained in the step S3 by adopting the helmet wearing detection data set in the step S2 and obtaining a weight file;

step S5: detecting the camera video stream of a to-be-detected field by using a trained detection model;

step S6: if the fact that a person does not wear the safety helmet is detected, corresponding sound alarm information is sent out.

Preferably, in the step S1, the training sample is based on a public Helmet Detection data set, and information images acquired, screened and sorted by a construction site monitoring video are added.

Preferably, the implementation process of constructing the helmet wearing detection data set in step S2 is as follows:

step S21: performing data enhancement processing on the image in the step S1 by using a data enhancement library imgauge in Python, wherein the data enhancement processing comprises operations of random horizontal or vertical turning, translation, clipping, gaussian noise addition and the like;

step S22: labeling the image processed in the step S21 by using label software, wherein the image with the safety helmet is labeled hat, the image without the safety helmet is labeled person, and the labeled label file is in an XML format;

step S23: converting the markup file from an XML format into a yolo _ txt format by using a data conversion tool, namely, each image corresponds to a txt file, and each file is used as information of one target and comprises a class label (class), a central point abscissa (x _ center), a central point ordinate (y _ center), a width (width) and a height (height) format, wherein the class label of the wearable safety helmet is 0, and the class label of the non-wearable safety helmet is 1;

step S24: according to the following steps of 8:1:1, dividing the data set into a training set, a verification set and a test set; and finally constructing a safety helmet wearing detection data set.

Preferably, the modified YOLOv5 algorithm in step S3 includes the following:

step S31: a Stem module is introduced into a feature extraction part of the YOLOv5 algorithm and used for convolution down-sampling operation of a backbone network, so that strong feature expression capability is ensured, and a large number of parameters can be reduced. A Stem module, a C3 module, a Conv module, a C3 module and an SPPF module which are connected in sequence are constructed into an improved YOLOv5 trunk part;

step S32: replacing all conventional convolution modules with a lightweight Ghost Conv convolution module in a feature fusion part of a YOLOv5 algorithm to construct a lightweight Neck network Ghost Neck;

step S33: the loss function part of the YOLOv5 network is improved, and a SIoU index is quoted to replace an original CIoU index.

Preferably, the network training in step S4 includes the following:

step S41: training the improved YOLOv5 network in the step S3 by adopting the safety helmet wearing detection data set in the step S2, setting the batch size (batch _ size) to be 20, and using an Adam optimizer, wherein the training period (epoch) is 300;

step S42: testing by using the test set, analyzing the training result, and comparing the difference of the test result;

step S43: the hyper-parameters of the network are fine-tuned on the verification set.

Preferably, the camera of the site to be detected in step S5 is connected to the local host, and the wearing condition of the safety helmet of the site personnel in the actual site is detected by loading the model at the local host.

Preferably, the sound alarm information in step S5 is implemented by a self-contained playground module in Python; the playoutput module is installed by the "pip install playoutput" command.

Preferably, the implementation mechanism of the Stem module added in the backbone network of the YOLOv5 algorithm is as follows: the input feature map is first subjected to a convolution operation with a convolution kernel size of 3 × 3, which mainly aims to change the number of channels of the feature map. Then the network structure is divided into two branches, wherein one branch introduces a bottleneck layer, which is the operation of the Stem module mainly reducing the parameter number, the number of channels is reduced firstly, and then downsampling is carried out; the other branch performs maximum pooling on the upper-layer input and then performs splicing, so that partial information in the input is transmitted, the final result is ensured to still have enough semantic information on the basis of reducing the parameter number, and excessive loss of the information is avoided.

In the module, the feature map F of the original input _in The length, width and channel number are H, W and C1 respectively. F _in Firstly, obtaining F through a convolution module with convolution kernel size of 3 multiplied by 3 and step length of 2 ₁ Here, every time convolution with the step length of 2 is performed by 3 × 3, the operation of downsampling the feature map is equal to one time, and then the length, the width and the number of channels are respectively changed into H/2, W/2 and C2; then F ₁ Firstly, the first branch is passed through and 1X 1 convolution is made so as to reduce half channel number and obtain F ₂ The convolution of 1 × 1 will not change the size of the characteristic diagram, so its length, width and channel number are H/2, W/2 and C2/2 respectively; then, 3 × 3 convolution with a step size of 2 is performed to perform downsamplingTo obtain F ₃ The length, width and number of channels are H/4, W/4 and C2 respectively. The other branch performs maximum pooling (Maxpool) on the F1 to obtain F ₄ Its length, width, number of channels and F ₃ Similarly, H/4, W/4, and C2 are also provided. The calculation process is as follows:

F ₁ ＝Conv _3×3 (F _in )

F ₂ ＝Conv _1×1 (F ₁ )

F ₃ ＝Conv _3×3 (F ₂ )

F ₄ ＝Maxpoolo(F ₁ )

then F is mixed ₃ 、F ₄ Splicing according to a channel to obtain F ₅ And finally F ₅ Obtaining the final characteristic diagram F by the number of 1 multiplied by 1 convolution reduction channels _out . The calculation process is as follows:

F ₅ ＝Concat(F ₃ ,F ₄ )

F _out ＝Conv _1×1 (F ₅ )

preferably, our constructed lightweight Neck network, ghost Neck, consists of Ghost Conv modules. The Ghost Conv module firstly uses a small amount of Convolution kernels to perform feature extraction on an input feature map, then uses packet Convolution (Depth-wise Convolution) to further perform cheaper linear variation operation on the part of the feature map, and finally generates a final feature map through splicing operation (collocation). The Ghost Conv module significantly reduces the number of parameters and computations and does not affect the performance of the model by combining a small number of convolution kernels with a cheaper linear-variant operation instead of a conventional convolution. Replacing the conventional convolution of a C3 module in a YOLOv5 network by a Ghost Conv convolution module to obtain a Ghost C3 module, and then applying the Ghost Conv module and the Ghost C3 in the Neck network of the YOLOv5 to construct a lightweight Neck network Ghost Neck.

Preferably, in yollov 5, in order to solve the problem that the original Loss function (Loss) does not consider the direction between the real frame and the prediction frame in the regression process, and the convergence is slow in model training, the improvement on the Loss function part is performed. SIoU is quoted to replace the original CIoU index, and penalty items are redefined. After the matching angle is added to the prediction frame, the prediction frame can be corrected to the same vertical line or horizontal line of the real frame, and the problem of 'oscillation around' of the prediction frame is solved. No matter in clean or noisy environment, no additional parameter is introduced, and the training time is not increased. The calculation formula for SIoU is as follows:

Δ＝∑ _t＝x,y (1-e ^-ργt )

wherein, x, y, w and h respectively represent the central point coordinate and the width and the height of the prediction frame. Δ and Ω represent Distance cost (Distance cost) and Shape cost (Shape cost), respectively. The value of θ is unique, which defines the Shape cost for each data set. Where the value of theta takes 4.

Compared with the prior art, the invention has the advantages that: compared with the existing method, the lightweight network model provided by the invention can be used for detecting the condition that a constructor wears a safety helmet in a complex construction environment, can assist in realizing an intelligent construction site, effectively protects the life safety of the constructor, improves the construction safety, and has a very wide application prospect.

Drawings

FIG. 1 is an overall algorithm flow diagram of the present invention;

FIG. 2 is a view showing the structure of a model for inspecting a helmet in accordance with the present invention;

FIG. 3 is a diagram of a Stem module architecture in the backbone network;

FIG. 4 is a block diagram of the Ghost Conv module and Ghost C3 module in the neck network;

FIG. 5 is a diagram illustrating an actual detection effect according to an embodiment of the present invention.

Detailed Description

For the purpose of more clearly illustrating the objects, technical solutions and advantages of the present invention, the following detailed description is given with reference to the accompanying drawings and specific embodiments. It is to be understood that the embodiments of the invention are not limited to the example descriptions herein.

Referring to fig. 1, a method for detecting wearing of a lightweight helmet based on YOLOv5 according to an embodiment of the present invention is described with reference to fig. 1 to 5, which specifically includes the following steps:

step S1: collecting construction site images as training samples;

step S5: detecting the video stream of the camera of the site to be detected by using the trained detection model;

step S6: if the fact that the safety helmet is not worn by a person is detected, corresponding sound alarm information is sent out.

Specifically, in step S1, the training sample is based on a public Helmet Detection data set, and information images acquired, screened, and collated by a monitoring video in a construction site are added.

Specifically, the implementation process of constructing the helmet wearing detection data set in step S2 is as follows:

step S22: labeling the image processed in the step S21 by using labeling software, wherein the image with the safety helmet is labeled hat, the image without the safety helmet is labeled person, and the labeled label file is in an XML format;

step S23: converting the markup file from an XML format into a yolo _ txt format by using a data conversion tool, namely, each image corresponds to a txt file, and each file is used as information of one target and comprises a class label (class), a center point abscissa (x _ center), a center point ordinate (y _ center), a width (width) and a height (height) format, wherein the class label of a wearable safety helmet is 0, and the class label of an unworn safety helmet is 1;

step S24: according to the following steps of 8:1:1 dividing the data set into a training set, a verification set and a test set; and finally constructing a safety helmet wearing detection data set.

Specifically, as shown in fig. 3, the modified YOLOv5 algorithm in step S3 includes the following:

step S31: a Stem module is introduced into a feature extraction part of a YOLOv5 algorithm and used for convolution down-sampling operation of a backbone network, so that strong feature expression capacity is ensured, and a large number of parameters can be reduced. A Stem module, a C3 module, a Conv module, a C3 module and an SPPF module which are connected in sequence are constructed into an improved YOLOv5 trunk part;

step S32: replacing all conventional convolution modules by a light-weight Ghost Conv convolution module in a feature fusion part of a YOLOv5 algorithm to construct a light-weight Neck network Ghost Neck;

step S33: the improvement is carried out on the loss function part of the YOLOv5 network, and SIoU indexes are referred to replace original CIoU indexes.

Specifically, the network training in step S4 includes the following:

step S41: training the improved YOLOv5 network in the step S3 by adopting the helmet wearing detection data set in the step S2, setting the batch size (batch _ size) to be 20, and using an Adam optimizer, wherein the training period (epoch) is 300;

Specifically, the camera of the site to be detected in step S5 is connected to the local host, and the wearing condition of the safety helmet of the worker and the site personnel in the actual site is detected by loading the model at the local host, with the actual application effect as shown in fig. 5.

Specifically, the audio alarm information in step S5 is implemented by a self-contained playground module in Python; the playground module is installed by the "pip install play" command.

The improvement points of the invention are described as follows:

(1) As shown in the main network part of FIG. 2, a Stem module is introduced in the feature extraction part of the YOLOv5 algorithm for convolution down-sampling operation of the main network. The Stem module implementation mechanism is as shown in fig. 3, firstly, the convolution operation with convolution kernel size of 3 × 3 is performed on the input feature map, and the main purpose is to change the number of channels of the feature map. Then the network structure is divided into two branches, wherein one branch introduces a bottleneck layer, which is the operation of the Stem module mainly reducing the parameter number, the number of channels is reduced firstly, and then downsampling is carried out; the other branch performs maximum pooling on the upper-layer input and then performs splicing, so that partial information in the input is transmitted, the final result is ensured to still have enough semantic information on the basis of reducing the parameter number, and excessive loss of the information is avoided.

In the module, the feature map F of the original input _in The length, width and channel number are H, W and C1 respectively. F _in Firstly, a convolution module with convolution kernel size of 3 multiplied by 3 and step length of 2 is used to obtain F ₁ Here, every time convolution with the step length of 2 is performed by 3 × 3, the operation of downsampling the feature map is equal to one time, and then the length, the width and the number of channels are respectively changed into H/2, W/2 and C2; then F ₁ Firstly, the first branch is passed through and 1X 1 convolution is made so as to reduce half channel number and obtain F ₂ The convolution of 1 × 1 does not change the size of the feature map, so the length, width and channel number are H/2, W/2 and C2/2 respectively; then, the convolution with the step length of 3 multiplied by 3 and the step length of 2 is carried out for down sampling to obtain F ₃ The length, width and number of channels are H/4, W/4 and C2 respectively. The other branch performs maximum pooling on the F1 (Maxp)ool) to obtain F ₄ Its length, width, number of channels and F ₃ Similarly, H/4, W/4, and C2 are also included. The calculation process is as follows:

F ₁ ＝Conv _3×3 (F _in )

F ₂ ＝Conv _1×1 (F ₁ )

F ₃ ＝Conv _3×3 (F ₂ )

F ₄ ＝Maxpoolo(F ₁ )

F ₅ ＝Concat(F ₃ ,F ₄ )

F _out ＝Conv _1×1 (F ₅ )

(2) The Neck network part in fig. 2 is a lightweight Neck network Ghost sock constructed in the feature fusion part of the YOLOv5 algorithm. The Ghost Neck consists of a Ghost Conv module. As shown in fig. 4, the ghost Conv module firstly performs feature extraction on the input feature map by using a small number of Convolution kernels, then further performs a cheaper linear change operation on the part of the feature map by using a packet Convolution (Depth-wise Convolution), and finally generates a final feature map by a Concatenation operation (Concatenation). The Ghost Conv module significantly reduces the number of parameters and computations and does not affect the performance of the model by combining a small number of convolution kernels with a cheaper linear change operation instead of the conventional convolution approach. And replacing the conventional convolution of the C3 module in the YOLOv5 network with a Ghost Conv convolution module to obtain a Ghost C3 module, and using the Ghost Conv module and the Ghost C3 in the Neck network of the YOLOv5 to construct a lightweight Neck network Ghost Neck.

(3) The original Loss function (Loss) does not consider the direction between a real frame and a prediction frame in the regression process, and the convergence is slow in model training. We quote SIoU to replace original CIoU index, redefine penalty item. After the matching angle is added to the prediction frame, the prediction frame can be corrected to the same vertical line or horizontal line of the real frame, and the problem of 'oscillation around' of the prediction frame is solved. No matter in clean or noisy environment, no additional parameter is introduced, and the training time is not increased. The calculation of SIoU is shown in the following equation:

Δ＝∑ _t＝x,y (1-e ^-ργt )

wherein, x, y, w and h respectively represent the coordinate of the central point and the width and the height of the prediction frame. Δ, Ω represent Distance cost (Distance cost) and Shape cost (Shape cost), respectively. The value of θ is unique, which defines the Shape cost for each data set. Here the value of theta takes 4.

Finally, the improved lightweight network is contrastively analyzed, and compared with other existing methods, a brand-new down-sampling module Stem and a YOLO network model of a lightweight feature fusion network Ghost neutral are added, so that the parameter quantity of the model is greatly reduced under the condition that the detection performance is not excessively lost.

As shown in Table 1, the evaluation indexes mAP, model Size and Detection time respectively represent the Detection precision and Size of the Model and the time required for detecting one picture, the method disclosed by the invention has the advantages that the Detection precision for the safety helmet wearing behavior of the construction site personnel is higher, the Model is minimum, the speed is higher, and the final Model Size is reduced by 16% compared with a YOLOv5s network. The actual detection effect is shown in fig. 5, and it can be seen that the method of the invention realizes a lightweight construction site helmet detection model, and is beneficial to further model deployment and actual application.

TABLE 1 comparison of the experimental results of the process of the invention with three prior art processes

Model (model)	mAP	Model Size	Detection time
				Faster RCNN	66.7％	182MB	260ms
YOLOv4	89.7％	23.6MB	19ms
				YOLOv5s	92.7％	14.6MB	22ms
The method of the invention	92.5％	12.5MB	20ms

The foregoing shows and describes the general principles and features of the present invention, together with the advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only for the purpose of illustrating the structural relationship and principles of the present invention, but that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A YOLOv 5-based lightweight safety helmet wearing detection method is characterized by comprising the following steps:

step S1: collecting construction site images as training samples;

2. The YOLOv 5-based lightweight Helmet wearing Detection method according to claim 1, wherein in the step S1, the training sample is based on a public Helmet Detection Helmet Detection data set, and information images obtained by construction site monitoring video acquisition, screening and arrangement are added.

3. The YOLOv 5-based lightweight helmet wearing detection method according to claim 1, wherein the step S2 of constructing the helmet wearing detection data set is implemented by the following steps:

4. The YOLOv 5-based lightweight helmet wearing detection method according to claim 1, wherein the improved YOLOv5 algorithm in the step S3 comprises the following steps:

5. The YOLOv 5-based lightweight helmet wearing detection method according to claim 1, wherein the network training in the step S4 comprises the following steps:

step S43: the hyper-parameters of the network are fine-tuned on the validation set.

6. The YOLOv 5-based lightweight helmet wearing detection method as claimed in claim 1, wherein the camera of the site to be detected in step S5 is connected to the local host, and the helmet wearing condition of the workers and the ground personnel in the actual site is detected by loading the model on the local host.

7. The YOLOv 5-based light-weight helmet wearing detection method according to claim 1, characterized in that the sound alarm information in step S5 is realized by a self-contained playground module in Python; the playground module is installed by the "pip install play" command.

8. The YOLOv 5-based lightweight helmet wearing detection method according to claim 4, wherein the Stem module implementation mechanism in the step S31 is as follows: the input feature map is firstly subjected to convolution operation with a convolution kernel of 3 × 3, and the main purpose of the operation is to change the number of channels of the feature map. Then the network structure is divided into two branches, wherein a bottleneck layer is introduced into one branch, which is the operation of the Stem module mainly reducing the parameter number, the channel number is reduced firstly, and then down sampling is carried out; the other branch performs maximum pooling on the upper-layer input and then splices the upper-layer input, so that partial information in the input is transmitted, the final result is guaranteed to have enough semantic information on the basis of reducing the parameter number, and excessive loss of the information is avoided.

In the module, the feature map F of the original input _in The length, width and channel number are H, W and C1 respectively. F _in Firstly, a convolution module with convolution kernel size of 3 multiplied by 3 and step length of 2 is used to obtain F ₁ Here, every time convolution is performed by 3 × 3, the convolution with the step length of 2 is equal to the down-sampling operation performed on the feature map, so that the length, the width and the number of channels are respectively H/2, W/2 and C2; then F ₁ Firstly, the first branch is processed by 1 multiplied by 1 convolution to reduce half of the channel number, and F is obtained ₂ The convolution of 1 × 1 does not change the size of the feature map, so the length, width and channel number are H/2, W/2 and C2/2 respectively; then 3X 3 convolution with step size of 2 is carried out for down-sampling to obtain F ₃ The length, width and number of channels are H/4, W/4 and C2 respectively. The other branch performs maximum pooling (Maxpool) on the F1 to obtain F ₄ Its length, width, number of channels and F ₃ Similarly, H/4, W/4, and C2 are also provided. The calculation process is as follows:

F ₁ ＝Conv _3×3 (F _in )

F ₂ ＝Conv _1×1 (F ₁ )

F ₃ ＝Conv _3×3 (F ₂ )

F ₄ ＝Maxpoolo(F ₁ )

then F is mixed ₃ 、F ₄ Splicing according to a channel to obtain F ₅ And finally F ₅ Obtaining a final characteristic diagram F through the number of 1 multiplied by 1 convolution reduction channels _out . The calculation process is as follows:

F ₅ ＝Concat(F ₃ ,F ₄ )

F _out ＝Conv _1×1 (F ₅ )。

9. the YOLOv 5-based lightweight headgear wearing detection method according to claim 4, wherein the lightweight Neck network Ghost News in step S32 is composed of Ghost Conv modules. The Ghost Conv module firstly performs feature extraction on an input feature map by using a small number of Convolution kernels, then further performs cheaper linear change operation on the part of the feature map by using packet Convolution (Depth-wise Convolution), and finally generates a final feature map through a splicing operation (collocation). The Ghost Conv module significantly reduces the number of parameters and computations and does not affect the performance of the model by combining a small number of convolution kernels with a cheaper linear-variant operation instead of a conventional convolution. And replacing the conventional convolution of the C3 module in the YOLOv5 network with a Ghost Conv convolution module to obtain a Ghost C3 module, and using the Ghost Conv module and the Ghost C3 in the Neck network of the YOLOv5 to construct a lightweight Neck network Ghost Neck.

10. The YOLOv 5-based lightweight helmet wearing detection method according to claim 4, wherein the improvement of the Loss function in the step S33 is mainly to solve the problem that the original Loss function (Loss) does not consider the direction between the real frame and the prediction frame in the regression process, and the convergence is slow during model training. SIoU is quoted to replace the original CIoU index, and penalty items are redefined. After the matching angle is added to the prediction frame, the prediction frame can be corrected to the same vertical line or horizontal line of the real frame, and the problem of 'oscillation around' of the prediction frame is solved. No matter in clean or noisy environment, no additional parameter is introduced, and the training time is not increased. The calculation formula of SIoU is as follows:

Δ＝∑ _t＝x,y (1-e ^-ργt )