CN115546614A

CN115546614A - Safety helmet wearing detection method based on improved YOLOV5 model

Info

Publication number: CN115546614A
Application number: CN202211534970.0A
Authority: CN
Inventors: 张艳; 梁化民; 刘业辉; 孙晶雪
Original assignee: Tianjin Chengjian University
Current assignee: Tianjin Chengjian University
Priority date: 2022-12-02
Filing date: 2022-12-02
Publication date: 2022-12-30
Anticipated expiration: 2042-12-02
Also published as: CN115546614B

Abstract

The invention discloses a safety helmet wearing detection method based on an improved YOLOV5 model, which comprises the following steps: randomly selecting images from the helmet wearing image data set to perform image data enhancement to obtain data-enhanced images, inputting the data-enhanced images into an improved Yolov5 model to perform training to obtain a trained improved Yolov5 model; the improved YOLOV5 model includes: embedding an inverted residual error module and an inverted residual error attention module in the feature extraction part to extract image features; designing a multi-scale feature fusion module in the feature fusion part for feature fusion, and generating four detection heads with different receptive fields; optimizing a prediction box regression loss function; finally, inputting the image to be detected into the trained improved YOLOV5 model to obtain a detection result of whether the related person wears the safety helmet or not; the invention effectively solves the problems of missed detection and false detection of small targets in the construction site video monitoring image, and improves the wearing detection precision of the safety helmet.

Description

Helmet wearing detection method based on improved YOLOV5 model

Technical Field

The invention relates to the technical field of image processing, in particular to a helmet wearing detection method based on an improved YOLOV5 model.

Background

At present, the building industry of China is still in a stage of continuous development, building employees are increasing every year, and in the safety management of a construction site, the safety helmet is used as a protective article capable of effectively preventing head injury accidents, can effectively absorb the impact force of falling objects on the heads of constructors, avoids or reduces the damage of the falling objects on the heads, and is a personal protective article which must be worn in production and construction activities specified by a safety production method. The safety helmet is ensured to be worn correctly in a construction scene, the casualty rate of personnel in production accidents can be effectively reduced, and the safety helmet has important significance for guaranteeing safe production.

At present, whether a worker wears a safety helmet or not is mostly judged in a building site by adopting a manual supervision method, so that the manpower and material resources are easily wasted, and the problems that the supervision effect is poor due to the limitation that a larger working range and manual operation are easy to generate fatigue and the like exist.

In recent years, with the continuous development of target detection technology, certain results are obtained in the detection research of safety helmets; compared with the traditional manual inspection which consumes time and labor, the method based on the machine vision has the characteristics of high automation degree, easiness in expansion and the like, so that the method becomes a current urgent need.

However, the existing detection method based on traditional machine learning mainly identifies the shape and color features of the safety helmet, for example, the skin color detection method is used to locate the human face, and the detection of the safety helmet is realized by using the method of the support vector machine; although the detection speed of the traditional machine learning safety helmet detection algorithm is high, the design of characteristics and training classifiers needs to be carried out on specific detection objects, meanwhile, due to the characteristics of poor generalization capability, single characteristics and the like, the target cannot be effectively detected in a complex construction environment, the problems of small target missing detection and false detection easily occur, and the wearing detection precision of the safety helmet in the complex environment is low.

Therefore, how to avoid the problems of missing detection and false detection of small targets when the safety helmet is worn and detected in a complex environment and improve the accuracy of wearing and detecting the safety helmet is a problem that needs to be solved by the technical personnel in the field.

Disclosure of Invention

In view of this, the invention provides a method for detecting wearing of a safety helmet based on an improved YOLOV5 model, which at least solves some of the above technical problems, and the method embeds an inverted residual module and an inverted residual attention module in a feature extraction part of the YOLOV5 model, so as to obtain abundant small target spatial information and deep semantic information, and improve the detection accuracy of small targets; a multi-scale feature fusion module is designed in a feature fusion part, the recognition capability of a model to small-size targets is improved, the missing detection of the small targets is reduced, the method can be used for effectively detecting the wearing of the safety helmet in a complex environment, the missing detection and the false detection of the small targets are avoided, and the wearing detection precision of the safety helmet in the complex environment is improved.

In order to achieve the purpose, the invention adopts the technical scheme that:

the embodiment of the invention provides a safety helmet wearing detection method based on an improved YOLOV5 model, which comprises the following steps:

s1, acquiring a helmet wearing image data set, and randomly selecting N images from the helmet wearing image data set to perform image data enhancement to obtain data-enhanced images;

s2, inputting the image subjected to data enhancement into an improved YOLOV5 model for training to obtain a trained improved YOLOV5 model; the improved YOLOV5 model comprises: embedding an inverted residual error module and an inverted residual error attention module in the feature extraction part to extract image features; designing a multi-scale feature fusion module in the feature fusion part for feature fusion, and generating four detection heads with different receptive fields; optimizing a regression loss function of the prediction frame;

and S3, inputting the image to be detected into the trained improved YOLOV5 model to obtain a detection result of whether the related person wears the safety helmet or not.

Further, in step S1, N images are randomly selected from the helmet wearing image dataset to perform image data enhancement, where the image data enhancement includes:

turning, scaling and color gamut transformation are carried out on the image;

and randomly cutting the image after the turning, the zooming and the color gamut conversion according to a preset template and then splicing.

Further, the scaling the image specifically includes: randomly selecting N images from the helmet wearing image data set, and performing image processing by using the width and height of the images as boundary valuest _x Andt _y zooming of zooming magnification;

t _x =f _r (t _w ,t _w +Δt _w )

t _y =f _r (t _h ,t _h +Δt _h )

wherein the content of the first and second substances,t _w andt _h minimum values of wide and high magnification, respectively, deltat _w And Δt _h Respectively the lengths of the random intervals of the wide and the high magnification,f _r representing a random value function.

Further, splicing the zoomed images after randomly cutting the zoomed images according to a preset template specifically comprises:

is determined to be highhIs as wide aswThe image template is used as the size of an output image, four dividing lines are randomly generated in the width and height directions for cutting, then nine cut images are spliced, and an overflowing frame part is cut off; performing secondary cutting on the internal overlapped part, and obtaining a spliced image after cutting; this image was used as input layer data for the YOLOV5 convolutional neural network.

Further, in the step S2, the image feature extraction is performed by the inverted residual error module and the inverted residual error attention module embedded in the feature extraction part; the method specifically comprises the following steps:

a. inputting the image after the data enhancement into a Feature extraction module, performing convolution on the input image through a first layer of Focus, specifically, taking a value for every other pixel in each picture, similar to downsampling, dividing the image into four pictures in the way, wherein the four pictures are similar but have no information loss, concentrating the information into a channel space through the operation, expanding an input channel by 4 times, namely, a channel for splicing the pictures is 12 channels, and then performing convolution on the pictures to finally obtain a Feature map, and obtaining a Feature map Feature _ C0 after the Feature map passes through the Focus convolution and a convolution layer of 3 multiplied by 3;

b. inputting the Feature map Feature _ C0 into a first inverted residual error module, amplifying shallow features by adopting a channel expansion mode, realizing high-dimensional to low-dimensional Feature mapping by utilizing linear transformation through channel expansion of the input features, acquiring rich shallow information, extracting features by utilizing convolution, repeatedly learning the features by utilizing a residual error connection mode, and outputting a Feature map Feature _ C1;

c. the Feature map Feature _ C1 is subjected to a layer of convolution and a second inverted residual error module to obtain a Feature map Feature _ C2, and then the Feature map Feature _ C2 is input into a first inverted residual error attention module through a layer of convolution to obtain a Feature map Feature _ C3; and after convolution with convolution kernel size of 3 multiplied by 3 and spatial pyramid pooling, the Feature map Feature _ C3 enters a second inverted residual error attention module to obtain a Feature map Feature _ C4 which is used as the input of the multi-scale Feature fusion module.

Further, in step S2, the designing a multi-scale feature fusion module in the feature fusion part to perform feature fusion and generate four detection heads with different receptive fields specifically includes the following steps:

1) Convolving the Feature map Feature _ C4 with a convolution kernel of which the size is 3 multiplied by 3 and the number of channels is 512, and performing Up-sampling operation to obtain a Feature map Feature _ Up1;

2) Fusing the Feature map Feature _ Up1 with the Feature map Feature _ C3 features of the Feature extraction module, obtaining a fused Feature map Feature _ Fuse1 through the C3 module, convolving the fused Feature map Feature _ Fuse1 with a convolution kernel of which the size is 3 multiplied by 3 and the number of channels is 256, and obtaining a Feature map Feature _ Up2 through an upsampling operation;

3) Fusing the Feature map Feature _ Up2 with the Feature map Feature _ C2 to obtain a Feature map Feature _ Fuse2, repeating the operation, and obtaining a Feature map Feature _ Up3 through convolution and Up-sampling operation;

4) Performing cascade operation on the Feature map Feature _ Up3 and the Feature map Feature _ C1 to obtain a Feature map Feature _ Fuse3, and then performing Feature extraction through a C3 module and a convolution layer with the convolution kernel size of 1 × 1 to obtain a Feature map F4, wherein the Feature size is 1/4 of the original image and is used for detecting a minimum target;

5) Obtaining a Feature map Feature _ Fuse4 by convolving the Feature map Feature _ Fuse3 with a convolution kernel of which the size is 3 multiplied by 3 through a C3 module, cascading the Feature map Feature _ Fuse4 with the Feature map Feature _ Up2, and further performing cascading operation with the Feature map Feature _ C2 to obtain a Feature map F3, wherein the Feature map size is 1/8 of the size of the original image and is used for detecting small targets;

6) Cascading the Feature map Feature _ Fuse4 after the convolution of a C3 module and the convolution kernel with the size of 3 multiplied by 3 with the Feature map Feature _ Fuse1 after the convolution of the C3 module and the convolution kernel with the size of 3 multiplied by 3 to obtain a Feature map Feature _ Fuse5, and then performing Feature extraction through the C3 module and the convolution layer with the convolution kernel with the size of 1 multiplied by 1 to obtain a Feature map F2, wherein the Feature size is 1/16 of the original image and is used for detecting the medium target;

7) And after convolution of the Feature map Feature _ Fuse5 by a C3 module and a convolution kernel with the size of 3 multiplied by 3 and convolution of the Feature map Feature _ C4 by a convolution kernel with the size of 3 multiplied by 3 are cascaded to obtain a Feature map Feature _ Fuse6, and then Feature extraction is carried out on the convolution layer with the size of 1 multiplied by 1 by 3 by a C3 module and a convolution kernel to obtain a Feature map F1, wherein the Feature size is 1/32 of the original image, and the Feature map F1 is used for detecting a large target.

Further, in the optimized prediction box regression loss function: by usingCIoULoss function as a prediction box regression loss function for the improved YOLOV5 model algorithmL _CIoU It is defined as:

L _CIoU =1-CIoU

wherein, the first and the second end of the pipe are connected with each other,IoUrepresenting the intersection ratio of the prediction box and the real box,brepresents the center point of the prediction frame,b ^gt represents the center point of the real frame,ρwhich is represented as the euclidean distance,ρ ² (b,b ^gt ) Representing the square of the euclidean distance between the center point of the prediction box and the center point of the real box,crepresenting the diagonal distance of the smallest bounding rectangle that can contain both the predicted box and the true box;αrepresents a parameter for making a trade-off,vrepresenting a parameter measuring the uniformity of the aspect ratio;

wherein, the first and the second end of the pipe are connected with each other,αandvthe parameter of (2) is expressed as follows:

wherein, the first and the second end of the pipe are connected with each other,wandhrespectively, the width and height of the prediction box;w ^gt andh ^gt respectively the width and height of the real box.

Compared with the prior art, the invention has the following beneficial effects:

1. according to the method for detecting the wearing of the safety helmet based on the improved YOLOV5 model, the inverted residual error module and the inverted residual error attention module are embedded into the feature extraction part, so that abundant small target space information and deep semantic information can be obtained conveniently, and the detection precision of small and medium targets in the wearing detection of the safety helmet is improved;

2. according to the safety helmet wearing detection method based on the improved YOLOV5 model, a multi-scale feature fusion module is designed in a feature fusion part for feature fusion, four detection heads with different receptive fields are generated, the recognition capability of the model on small-size targets is improved, and missing detection of the small targets in safety helmet wearing detection is reduced;

3. the helmet wearing detection method based on the improved YOLOV5 model provided by the embodiment of the invention is designed with a mosaic mixed data enhancement method, establishes a linear relation between data, increases the background complexity of an image, improves the robustness of an algorithm, and can effectively detect the wearing of a helmet in a complex environment.

Drawings

Fig. 1 is a flowchart of a method for detecting wearing of a safety helmet based on an improved YOLOV5 model according to an embodiment of the present invention.

Fig. 2 is a schematic structural diagram of an improved YOLOV5 convolutional neural network provided in an embodiment of the present invention.

Fig. 3 is a schematic model diagram of an inverted residual error module according to an embodiment of the present invention.

Fig. 4 is a model diagram of an inverted residual attention module according to an embodiment of the present invention.

Fig. 5 is an average precision mean index graph of the improved YOLOV5 model provided by the embodiment of the present invention after 100 times of training.

Fig. 6 is a comparison diagram of the results of the detection of wearing the helmet before and after improvement of the YOLOV5 model provided in the embodiment of the present invention.

Fig. 7 is another comparison graph of the results of the detection of the wearing of the helmet before and after the improvement of the YOLOV5 model provided by the embodiment of the present invention.

Detailed Description

In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the invention is further described with the specific embodiments.

In the description of the present invention, it should be noted that the terms "upper", "lower", "left", "right", "front", "rear", "both ends", "one end", "the other end", and the like indicate orientations or positional relationships based on orientations or positional relationships shown in the drawings, only for convenience of description and simplification of description, but do not indicate or imply that the device or element referred to must have a specific orientation, be constructed in a specific orientation, and operate, and thus, should not be construed as limiting the present invention. Furthermore, the ordinal numbers "(1)", "(2)", etc. are for descriptive purposes only and are not to be construed as indicating or implying relative importance.

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The embodiment of the invention applies the target detection algorithm based on the convolutional neural network to the wearing detection of the safety helmet, and performs optimization and improvement by combining the task characteristics of the wearing detection of the safety helmet on the basis of the Yolov5 model, thereby realizing a more accurate and intelligent detection scheme.

Referring to fig. 1, the invention provides a method for detecting wearing of a safety helmet based on an improved YOLOV5 model, which comprises the following steps:

s2, inputting the image subjected to data enhancement into an improved YOLOV5 model for training to obtain a trained improved YOLOV5 model; the improved YOLOV5 model comprises: embedding an inverted residual error module and an inverted residual error attention module in the feature extraction part to extract image features; designing a multi-scale feature fusion module in the feature fusion part for feature fusion, and generating four detection heads with different receptive fields; optimizing a prediction box regression loss function;

The above steps are described in detail below:

in the step S1, firstly, a helmet wearing image data set is obtained, and data enhancement is performed on the data set, where the image data enhancement includes: turning, scaling and color gamut transformation are carried out on the image; randomly cutting the image after the turning, the scaling and the color gamut conversion according to a preset template and splicing; in this embodiment, an improved mosaic data enhancement method is adopted, which specifically includes: randomly selecting 9 images from the helmet wearing image data set, and performing image processing by using the width and height of the images as boundary valuest _x Andt _y zooming with zooming magnification;

t _x =f _r (t _w ,t _w +Δt _w )

t _y =f _r (t _h ,t _h +Δt _h )

Further, the height is determined to behIs as wide aswThe image template is used as the size of an output image, four dividing lines are randomly generated in the width and height directions, then nine cut pictures are spliced, and an overflowing frame part is cut off; and performing secondary cutting on the internal overlapped part, obtaining a spliced image after cutting, and taking the image as the input layer data of the YOLOV5 model convolutional neural network.

In this embodiment, the improved mosaic data enhancement method is to add 9 images on the basis of adopting four images for random splicing. Namely, each image has a corresponding frame, and nine images are combined together after random cutting and random splicing, so that balance among targets with different scales is realized.

In the step S2, inputting the data-enhanced image into the improved YOLOV5 model for training, so as to obtain a trained improved YOLOV5 model; in the embodiment of the present invention, the improved YOLOV5 model includes: 1. embedding an inverted residual error module (IRC 3) and an inverted residual error attention module (IRAC 3) in a feature extraction part to extract image features; 2. designing a multi-scale feature fusion module in the feature fusion part for feature fusion, and generating four detection heads with different receptive fields; 3. and optimizing the regression loss function of the prediction box.

(1) A feature extraction part:

the main purpose of feature extraction is to learn the mapping relationship between high-resolution images and low-resolution images using a convolutional neural network. As shown in fig. 2, in the embodiment of the present invention, the feature extraction module mainly includes a slice convolution layer (Focus Conv), a convolution layer (Conv), an Inverted residual C3 (IRC 3) module, an Inverted residual Attention C3 (IRAC 3) module, and a feature pyramid module (SPP); wherein, the structures of IRC3 and IRAC3 are respectively shown in FIG. 3 and FIG. 4; wherein, conv represents Convolution, H represents height of the feature map, W represents width of the feature map, C represents number of channels of the feature map, 2C represents number of channels obtained after two-fold expansion, while represents fusion operation, while represents splicing operation, DWConv (Depthwise Convolution) represents Depth Convolution, PWConv (Pointwise Convolution) represents point-by-point Convolution, ECA-Net (effective Channel Attention) represents effective Channel Attention, and SD (Stochastic Depth, SD) represents random Depth.

Referring to fig. 2, the specific process of feature extraction is as follows:

s1, an input image is subjected to a first layer of slice convolution (Focus Conv), specifically, every other pixel in each picture takes one value, the method is similar to down-sampling, the four pictures are divided into four pictures in the mode, the four pictures are similar but have no information loss, the information is concentrated into a channel space through the operation, an input channel is expanded by 4 times, namely, a channel for splicing the pictures is 12 channels, and then the pictures are subjected to convolution operation to obtain a Feature map Feature _ C0 so as to reduce the number of parameters and improve the training speed;

s2, after passing through a convolutional layer (Conv), entering an inverted residual C3 module (IRC 3), wherein in the module, firstly, expansion of the number of channels is realized on an input Feature map by utilizing an expansion factor, then, high-dimensional channels are mapped to low-dimensional channels by utilizing linear change to obtain rich shallow features, and identity mapping is combined with the input features through residual operation to obtain a Feature map Feature _ C1;

s3, the Feature graph Feature _ C1 passes through a convolution layer and an inverted residual error C3 module to obtain a Feature graph Feature _ C2, and then passes through a convolution layer to enter an inverted residual error attention C3 (IRAC 3) module;

in the embodiment of the invention, the inverse residual attention C3 module comprises a depth separable convolution and effective channel attention module, and firstly passes through a depth convolution module which is designed to replace a standard convolution of 3 multiplied by 3 by using a depth convolution with less parameters and low calculation complexity. As shown in the following formula:

wherein

A depth convolution kernel is represented by a number of values,i、jwhich represents the size of the convolution kernel and,k、lthe size of the feature map is shown,

tocThe convolution kernel being applied to the feature multiplied therewithcA plurality of channels, each of which is provided with a plurality of channels,

is the first of the feature mapcChannels, the characteristics of the deep convolution outputs are calculated by a 1 x 1 convolution and are linearly combined,

represents the Feature after 3x3 convolution of the Feature map Feature _ C2.

Further, after the depth separable convolution, further by the effective channel attention, the effective channel attention captures local cross-channel mutual information by each channel and k adjacent channels thereof; finally, after point convolution with convolution kernel size of 1 multiplied by 1, reducing the channel number into an original channel to obtain a Feature map Feature _ C3;

further, the obtained Feature map Feature _ C3 is subjected to convolution with a convolution kernel size of 3 × 3 and further subjected to Spatial Pyramid Pooling (SPP), and then enters a second IRAC3 module in the Feature extraction module, so as to obtain a Feature map Feature _ C4.

(2) A feature fusion part:

referring to fig. 2, a specific process of feature fusion is as follows:

s1, a final Feature map Feature _ C4 of a Feature extraction part enters a multi-scale Feature fusion module, and a Feature map Feature _ Up1 is obtained by performing convolution with a convolution kernel size of 3x3 and a channel number of 512 (convolution is represented by Conv in the figure and the same principle below) and performing upsampling operation (the upsampling operation is represented by UpSample in figure 2 and the same principle below);

s2, connecting and merging the Feature map Feature _ Up1 and the Feature _ C3 of the Feature extraction module, and obtaining a fused Feature map Feature _ Fuse1 through the C3 module; convolving the fused Feature map Feature _ Fuse1 with a convolution kernel of 3 multiplied by 3 and a channel number of 256 and performing Up-sampling operation to obtain a Feature map Feature _ Up2;

s3, repeating the operation of the S2, fusing the Feature map Feature _ Up2 with the Feature _ C2 obtained by the Feature extraction module to obtain a Feature map Feature _ Fuse2, and performing convolution kernel Up-sampling operation to obtain a Feature map Feature _ Up3;

s4, performing cascade operation on the obtained Feature map Feature _ Up3 and Feature _ C1 in a backbone network (Feature extraction module) to obtain a Feature map Feature _ Fuse3, and then performing Feature extraction on a convolution layer with the convolution kernel size of 1 × 1 through a C3 module to obtain a Feature map F4, wherein the Feature size of the Feature map F4 is 1/4 of that of an original image and the Feature map is used for detecting a minimum target;

s5, convolving the Feature map Feature _ Fuse3 by a C3 module and a convolution kernel with the size of 3 multiplied by 3 to obtain a Feature map Feature _ Fuse4, cascading the Feature map Feature _ Fuse4 with a Feature map Feature _ Up2 of a Feature fusion module, and further cascading the Feature map Feature _ C2 in a Feature extraction module to obtain a Feature map F3, wherein the size of the Feature map is 1/8 of that of an original image and the Feature map is used for detecting small targets;

s6, cascading the Feature map Feature _ Fuse4 after convolution with the convolution kernel size of 3x3 through the C3 module and the Feature map Feature _ Fuse1 after convolution with the convolution kernel size of 3x3 through the C3 module to obtain a Feature map Feature _ Fuse5, and then performing Feature extraction through the convolution layer with the convolution kernel size of 1 x 1 through the C3 module to obtain a Feature map F2, wherein the Feature size is 1/16 of the original image and is used for detecting a medium target;

s7, the Feature map Feature _ Fuse5 is subjected to convolution with a convolution kernel size of 3x3 through a C3 module and is cascaded with the Feature map Feature _ C4 after convolution with a convolution kernel size of 3x3 to obtain a Feature map Feature _ Fuse6, and then Feature extraction is carried out through the convolution layer with a convolution kernel size of 1 x 1 through the C3 module to obtain a Feature map F1, wherein the Feature size is 1/32 of that of the original image, and the Feature map F1 is used for detecting a large target.

(3) Optimizing a prediction box regression loss function:

in the embodiment of the invention, useCIoULoss function as a prediction box regression loss function for the improved YOLOV5 model algorithmL _CIoU It is defined as:

L _CIoU =1-CIoU

wherein the content of the first and second substances,IoUrepresenting the intersection ratio of the prediction box and the real box,brepresents the center point of the prediction box,b ^gt represents the center point of the real frame,ρwhich is represented as the euclidean distance,ρ ² (b,b ^gt ) Representing the center point and the trueness of the calculated prediction boxThe square of the euclidean distance between the center points of the frames,crepresenting the diagonal distance of the smallest bounding rectangle that can contain both the predicted box and the true box;αrepresents a parameter for making a trade-off,vrepresenting a parameter measuring the uniformity of the aspect ratio;

wherein the content of the first and second substances,wandhrespectively, the width and height of the prediction box;w ^gt andh ^gt respectively the width and height of the real box.

Further, setting training parameters: the batch size is 16, the iteration times are 100, the initial learning rate is 0.01, the termination learning rate is 0.2, the momentum is 0.937, the weight attenuation is 0.0005, and a random gradient descent strategy is adopted for random attenuation;

in the embodiment of the invention, the rotation and horizontal mirroring methods are adopted to increase the images of the safety helmet at different angles and the improved mosaic method is combined to improve the identification capability of the object; and training by using the improved convolutional neural network and combining the optimized loss function, and finishing training to obtain the final improved YOLOV5 convolutional neural network.

Further, in the embodiment of the present invention, an Average Precision Mean (mAp 0.5) is used as a correlation index for measuring the model performance, as shown in FIG. 5, which shows an Average Precision Mean index achieved after 100 times of training after improving the YOLOv5s model.

Further, inputting two images worn by the safety helmet in a complex environment into the trained improved YOLOV5 model to obtain a detection result of whether related personnel wear the safety helmet or not; and compared with the test result of the unmodified YOLOV5 model, the comparison result is shown in fig. 6 and 7, wherein fig. 6a and 7a are the test result of wearing the helmet of the unmodified YOLOV5 model, and fig. 6c and 7c are partial enlarged views of the test result of wearing the helmet of the unmodified YOLOV5 model; fig. 6b and 7b are partial enlarged views of the result of the detection of the wearing of the modified YOLOV5 model helmet of the present invention, and fig. 6d and 7d are partial enlarged views of the result of the detection of the wearing of the modified YOLOV5 model helmet of the present invention; from a left-right comparison in fig. 6, it can be seen that the unmodified YOLOV5 model has a missing detection of the wearing of the helmet in the small targets (three wearing targets are detected in fig. 6c, and four wearing targets are detected in fig. 6 d); it can also be seen from fig. 7 that the improved YOLOV5 model also has a false detection of the wearing of the helmet in a small target (fig. 7c shows that two targets are worn by the helmet, and fig. 7d shows that three targets are worn by the helmet); as can be known from fig. 6 and 7, the method of the present invention can effectively solve the problem of missing detection of wearing the safety helmet in a dense target, and can effectively detect wearing of the safety helmet on the target even in a complex scene, thereby avoiding missing detection of a small target.

Through the description of the embodiment, those skilled in the art can see that the invention provides a method for detecting the wearing of a safety helmet based on an improved YOLOV5 model, and firstly, an improved Mosaic data enhancement method is designed, so that the diversity of image samples is enriched, a linear relation between data is established, and the robustness of an algorithm is improved; secondly, optimizing a model backbone network aiming at the problem of low small target detection precision, embedding an inverted residual error module and an inverted residual error attention module in the backbone network part, and mapping by using low-dimensional to high-dimensional characteristic information so as to obtain abundant small target spatial information and deep semantic information and improve the small target detection precision; finally, a multi-scale feature fusion module is designed in the feature fusion part, shallow spatial information and deep semantic information are fused, four detection heads with different receptive fields are generated, the recognition capability of the model on small-size targets is improved, the missing detection of the small targets is reduced, the problems of the missing detection and the false detection of the small targets in the video monitoring image of the construction site can be effectively solved, and the wearing detection precision of the safety helmet is improved.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A safety helmet wearing detection method based on an improved YOLOV5 model is characterized by comprising the following steps:

s2, inputting the data-enhanced image into an improved YOLOV5 model for training to obtain a trained improved YOLOV5 model; the improved YOLOV5 model comprises: embedding an inverted residual error module and an inverted residual error attention module in the feature extraction part to extract image features; designing a multi-scale feature fusion module in the feature fusion part for feature fusion, and generating four detection heads with different receptive fields; optimizing a prediction box regression loss function;

2. The method for detecting wearing of a helmet based on the improved YOLOV5 model as claimed in claim 1, wherein in step S1, N images are randomly selected from the image data set for image data enhancement, and the image data enhancement includes:

turning, scaling and color gamut transformation are carried out on the image;

and randomly cutting the image after the turning, the scaling and the color gamut conversion according to a preset template and splicing.

3. The headgear wearing detection method according to claim 2, characterized in that the scaling of the image is specifically: randomly selecting N images from the image data set for wearing the helmet, and performing image processing by using the width and height of the images as boundary valuest _x Andt _y zooming with zooming magnification;

t _x =f _r (t _w ,t _w +Δt _w )

t _y =f _r (t _h ,t _h +Δt _h )

4. The method for detecting wearing of safety helmets based on the improved YOLOV5 model according to claim 1, wherein in the step S2, the image feature extraction is performed by embedding an inverted residual module and an inverted residual attention module in the feature extraction part; the method specifically comprises the following steps:

a. obtaining a Feature map Feature _ C0 by the image after the data enhancement through Focus convolution and a convolution layer of 3 multiplied by 3;

b. inputting the Feature map Feature _ C0 into a first inversion residual error module, amplifying shallow features in a channel expansion mode, extracting features by convolution, repeatedly learning features in a residual error connection mode, and outputting a Feature map Feature _ C1;

c. and the Feature map Feature _ C1 is subjected to a layer of convolution and a second inverted residual error module to obtain a Feature map Feature _ C2, the Feature map Feature _ C2 is input into a first inverted residual error attention module through a layer of convolution to obtain a Feature map Feature _ C3, the Feature map Feature _ C3 is subjected to convolution with a convolution kernel of 3 multiplied by 3 and spatial pyramid pooling, and then the Feature map Feature _ C3 enters a second inverted residual error attention module to obtain a Feature map Feature _ C4 which is used as the input of a multi-scale Feature fusion module.

5. The method for detecting wearing of a helmet based on the improved YOLOV5 model as claimed in claim 4, wherein in the step S2, the multi-scale feature fusion module is designed in the feature fusion part for feature fusion, and four detection heads with different receptive fields are generated, specifically comprising the following steps:

1) Convolving the Feature map Feature _ C4 and performing Up-sampling operation to obtain a Feature map Feature _ Up1;

2) Fusing the Feature map Feature _ Up1 with the Feature map Feature _ C3 Feature, obtaining a fused Feature map Feature _ Fuse1 through a C3 module, and obtaining a Feature map Feature _ Up2 through convolution and Up-sampling operation of the Feature map Feature _ Fuse1;

3) Fusing the Feature map Feature _ Up2 with the Feature map Feature _ C2 to obtain a Feature map Feature _ Fuse2, obtaining a Feature map Feature _ Up3 through convolution and Up-sampling operation,

4) Performing cascade operation on the Feature map Feature _ Up3 and the Feature map Feature _ C1 to obtain a Feature map Feature _ Fuse3, and then performing Feature extraction through a C3 module and a convolution layer with the convolution kernel size of 1 × 1 to obtain a Feature map F4, wherein the Feature size is 1/4 of the original image;

5) Convolving the Feature map Feature _ Fuse3 by a C3 module and a convolution kernel with the size of 3 multiplied by 3 to obtain a Feature map Feature _ Fuse4, cascading the Feature map Feature _ Up2, and further cascading the Feature map Feature _ C2 to obtain a Feature map F3, wherein the Feature size is 1/8 of the original image;

6) Cascading the Feature map Feature _ Fuse4 after the convolution of a C3 module and the convolution kernel with the size of 3 multiplied by 3 with the Feature map Feature _ Fuse1 after the convolution of the C3 module and the convolution kernel with the size of 3 multiplied by 3 to obtain a Feature map Feature _ Fuse5, and then performing Feature extraction through the C3 module and the convolution layer with the convolution kernel with the size of 1 multiplied by 1 to obtain a Feature map F2 with the Feature size of 1/16 of the original image;

7) And after the convolution of the Feature map Feature _ Fuse5 with the convolution kernel size of 3 × 3 through the C3 module and the convolution of the Feature map Feature _ C4 with the convolution kernel size of 3 × 3, cascading the Feature map Feature _ Fuse to obtain a Feature map Feature _ Fuse6, and then performing Feature extraction through the convolution layer with the convolution kernel size of 1 × 1 through the C3 module to obtain a Feature map F1, wherein the Feature size is 1/32 of the original image.

6. The improved YOLOV5 model-based helmet wearing detection method according to claim 1, wherein in the optimized prediction box regression loss function: by usingCIoULoss function as a prediction box regression loss function for an improved YOLOV5 model algorithmL _CIoU Which is defined as:

L _CIoU =1-CIoU

wherein, the first and the second end of the pipe are connected with each other,IoUrepresenting the intersection ratio of the prediction box and the real box,brepresents the center point of the prediction box,b ^gt represents the center point of the real frame,ρrepresented as a function of the euclidean distance,ρ ² (b,b ^gt ) Representing the square of the euclidean distance between the center point of the calculated prediction box and the center point of the real box,ca diagonal distance representing a minimum bounding rectangle capable of containing both the predicted frame and the true frame;αrepresents a parameter for making a trade-off,vrepresenting a parameter measuring the uniformity of the aspect ratio;

wherein, the first and the second end of the pipe are connected with each other,αandvthe parameter expression of (c) is as follows: