CN115546614B

CN115546614B - Safety helmet wearing detection method based on improved YOLOV5 model

Info

Publication number: CN115546614B
Application number: CN202211534970.0A
Authority: CN
Inventors: 张艳; 梁化民; 刘业辉; 孙晶雪
Original assignee: Tianjin Chengjian University
Current assignee: Tianjin Chengjian University
Priority date: 2022-12-02
Filing date: 2022-12-02
Publication date: 2023-04-18
Anticipated expiration: 2042-12-02
Also published as: CN115546614A

Abstract

The invention discloses a safety helmet wearing detection method based on an improved YOLOV5 model, which comprises the following steps: randomly selecting images from the helmet wearing image dataset to perform image data enhancement to obtain data-enhanced images, inputting the data-enhanced images into an improved Yolov5 model to perform training to obtain a trained improved Yolov5 model; the improved YOLOV5 model includes: embedding an inverted residual error module and an inverted residual error attention module in the feature extraction part to extract image features; designing a multi-scale feature fusion module in the feature fusion part for feature fusion, and generating four detection heads with different receptive fields; optimizing a prediction box regression loss function; finally, inputting the image to be detected into the trained improved YOLOV5 model to obtain a detection result of whether the related person wears the safety helmet or not; the invention effectively solves the problems of missed detection and false detection of small targets in the construction site video monitoring image, and improves the wearing detection precision of the safety helmet.

Description

Helmet wearing detection method based on improved YOLOV5 model

Technical Field

The invention relates to the technical field of image processing, in particular to a helmet wearing detection method based on an improved YOLOV5 model.

Background

At present, the building industry of China is still in a stage of continuous development, building employees are increasing every year, and in the safety management of a construction site, the safety helmet is used as a protective article capable of effectively preventing head injury accidents, can effectively absorb the impact force of falling objects on the heads of constructors, avoids or reduces the damage of the falling objects on the heads, and is a personal protective article which must be worn in production and construction activities specified by a safety production method. The safety helmet can be worn correctly in a construction scene, the casualty rate of personnel in production accidents can be effectively reduced, and the safety helmet has important significance for guaranteeing safe production.

At present, whether a worker wears a safety helmet or not is judged mostly by adopting a manual supervision method in a building site, the method easily causes waste of manpower and material resources, and the problems that the supervision effect is poor due to the limitation that a larger working range and manual operation easily cause fatigue and the like exist.

In recent years, with the continuous development of target detection technology, certain results are obtained in the detection research of safety helmets; compared with the traditional manual inspection which consumes time and labor, the method based on the machine vision has the characteristics of high automation degree, easiness in expansion and the like, so that the method becomes a current urgent need.

However, the existing detection method based on traditional machine learning mainly identifies the shape and color features of the safety helmet, for example, the skin color detection method is used to locate the human face, and the detection of the safety helmet is realized by using the method of the support vector machine; although the detection speed of the traditional machine learning safety helmet detection algorithm is high, the design of characteristics and training classifiers needs to be carried out on specific detection objects, meanwhile, due to the characteristics of poor generalization capability, single characteristics and the like, the targets cannot be effectively detected in a complex construction environment, the problems of small target omission and false detection easily occur, and the wearing detection precision of the safety helmet in the complex environment is low.

Therefore, how to avoid the problems of missing detection and false detection of small targets when the safety helmet is worn and detected in a complex environment and improve the accuracy of wearing and detecting the safety helmet is a problem that needs to be solved by the technical personnel in the field.

Disclosure of Invention

In view of this, the invention provides a method for detecting wearing of a safety helmet based on an improved YOLOV5 model, which at least solves some of the above technical problems, and the method embeds an inverted residual module and an inverted residual attention module in a feature extraction part of the YOLOV5 model, so as to obtain abundant small target spatial information and deep semantic information, and improve the detection accuracy of small targets; a multi-scale feature fusion module is designed in a feature fusion part, the recognition capability of a model to small-size targets is improved, the missing detection of the small targets is reduced, the method can be used for effectively detecting the wearing of the safety helmet in a complex environment, the missing detection and the false detection of the small targets are avoided, and the wearing detection precision of the safety helmet in the complex environment is improved.

In order to achieve the purpose, the invention adopts the technical scheme that:

the embodiment of the invention provides a safety helmet wearing detection method based on an improved YOLOV5 model, which comprises the following steps:

s1, acquiring a helmet wearing image data set, and randomly selecting N images from the helmet wearing image data set to perform image data enhancement to obtain data-enhanced images;

s2, inputting the image subjected to data enhancement into an improved YOLOV5 model for training to obtain a trained improved YOLOV5 model; the improved YOLOV5 model comprises: embedding an inverted residual error module and an inverted residual error attention module in the feature extraction part to extract image features; designing a multi-scale feature fusion module in the feature fusion part for feature fusion, and generating four detection heads with different receptive fields; optimizing a prediction box regression loss function;

and S3, inputting the image to be detected into the trained improved YOLOV5 model to obtain a detection result of whether the related person wears the safety helmet or not.

Further, in step S1, N images are randomly selected from the helmet wearing image dataset to perform image data enhancement, where the image data enhancement includes:

turning, scaling and color gamut transformation are carried out on the image;

and randomly cutting the image after the turning, the scaling and the color gamut conversion according to a preset template and splicing.

Further, the scaling the image specifically includes: wearing an image dataset on the headgearSelecting N images at random, using the width and height of the images as boundary values, and carrying out t on the images _x And t _y Zooming of zooming magnification;

t _x ＝f _r (t _w ,t _w +Δt _w )

t _y ＝f _r (t _h ,t _h +Δt _h )

wherein, t _w And t _h Minimum values of wide and high magnification, Δ t, respectively _w And Δ t _h Length of random interval of wide and high magnification, respectively, f _r Representing a random value function.

Further, splicing the zoomed images after randomly cutting the zoomed images according to a preset template specifically comprises:

determining an image template with the height h and the width w as the size of an output image, randomly generating four dividing lines in the width and height direction for cutting, splicing nine cut images, and cutting off an overflowing frame part; performing secondary cutting on the internal overlapped part, and obtaining a spliced image after cutting; this image was used as input layer data for the YOLOV5 convolutional neural network.

Further, in the step S2, the image feature extraction is performed by the inverted residual error module and the inverted residual error attention module embedded in the feature extraction part; the method specifically comprises the following steps:

a. inputting the image after the data enhancement into a Feature extraction module, performing convolution on the input image through a first layer of Focus, specifically, taking a value for every other pixel in each picture, similar to downsampling, dividing the image into four pictures in the way, wherein the four pictures are similar but have no information loss, concentrating the information into a channel space through the operation, expanding an input channel by 4 times, namely, a channel for splicing the pictures is 12 channels, and then performing convolution on the pictures to finally obtain a Feature map, and obtaining a Feature map Feature _ C0 after the Feature map passes through the Focus convolution and a convolution layer of 3 multiplied by 3;

b. inputting the Feature map Feature _ C0 into a first inverted residual error module, amplifying shallow features by adopting a channel expansion mode, realizing high-dimensional to low-dimensional Feature mapping by utilizing linear transformation through channel expansion of the input features, acquiring rich shallow information, extracting features by utilizing convolution, repeatedly learning the features by adopting a residual error connection mode, and outputting a Feature map Feature _ C1;

c. the Feature map Feature _ C1 is subjected to a layer of convolution and a second inverted residual error module to obtain a Feature map Feature _ C2, and then the Feature map Feature _ C2 is input into a first inverted residual error attention module through a layer of convolution to obtain a Feature map Feature _ C3; and after convolution with convolution kernel size of 3 × 3 and spatial pyramid pooling, the Feature map Feature _ C3 enters a second inverted residual attention module to obtain a Feature map Feature _ C4 which is used as input of the multi-scale Feature fusion module.

Further, in step S2, the designing a multi-scale feature fusion module in the feature fusion part to perform feature fusion and generate four detection heads with different receptive fields specifically includes the following steps:

1) Obtaining a Feature map Feature _ d1 by convolving the Feature map Feature _ C4 with the convolution kernel size of 3 multiplied by 3 and the channel number of 512, and obtaining a Feature map Feature _ Up1 through an Up-sampling operation;

2) Performing cascade operation on the Feature map Feature _ Up1 and the Feature map Feature _ C3 Feature of the Feature extraction module to obtain a Feature map Feature _ Fuse1, performing convolution with a C3 module and a convolution kernel of 3 × 3 and with a channel number of 256 to obtain a Feature map Feature _ d2, and performing Up-sampling operation to obtain a Feature map Feature _ Up2;

3) Performing cascade operation on the Feature map Feature _ Up2 and the Feature map Feature _ C2 to obtain a Feature map Feature _ Fuse2, then obtaining a Feature map Feature _ d3 through a C3 module and convolution, and obtaining a Feature map Feature _ Up3 through an Up-sampling operation;

4) Performing cascade operation on the Feature map Feature _ Up3 and the Feature map Feature _ C1 to obtain a Feature map Feature _ Fuse3, and then performing Feature extraction through a C3 module and convolution with a convolution kernel size of 1 × 1 to obtain a Feature map F4, wherein the Feature size is 1/4 of the original image and the Feature map F4 is used for detecting a minimum target;

5) The Feature map Feature _ Fuse3 is convolved with a convolution kernel of 3 multiplied by 3 by a C3 module and is cascaded with the Feature map Feature _ d3 to obtain a Feature map Feature _ Fuse4, and then Feature extraction is carried out on the Feature map by the convolution kernel of 1 multiplied by 1 by the C3 module to obtain a Feature map F3, wherein the Feature size is 1/8 of the original image; for the detection of small targets;

6) Obtaining a Feature map Feature _ Fuse5 after the Feature map Feature _ Fuse4 is convolved by a C3 module and a convolution kernel with the size of 3 × 3 and is cascaded with a Feature map Feature _ d2, and then performing Feature extraction by the C3 module and a convolution layer with the convolution kernel size of 1 × 1 to obtain a Feature map F2, wherein the Feature size is 1/16 of the original image and is used for detecting a medium target;

7) And cascading the Feature map Feature _ Fuse5 with the Feature map Feature _ d1 through convolution of a C3 module and a convolution kernel with the size of 3 × 3 to obtain a Feature map Feature _ Fuse6, and then performing Feature extraction through the C3 module and a convolution layer with the convolution kernel size of 1 × 1 to obtain a Feature map F1, wherein the Feature size is 1/32 of the original image and the Feature map F1 is used for detecting a large target.

Further, in the optimized prediction box regression loss function: CIoU loss function is adopted as prediction frame regression loss function L of improved YOLOV5 model algorithm _CIoU It is defined as:

L _CIoU ＝1-CIoU

wherein, ioU represents the intersection ratio of the prediction frame and the real frame, b represents the central point of the prediction frame ^gt Represents the center point of the real frame, and represents the Euclidean distance rho ² (b,b ^gt ) C represents the diagonal distance of the minimum circumscribed rectangle which can simultaneously contain the prediction frame and the real frame; α represents a parameter for making a balance, and v represents a parameter for measuring the uniformity of the aspect ratio;

wherein, the parameter expressions of alpha and v are as follows:

wherein w and h are the width and height of the prediction box, respectively; w is a ^gt And h ^gt Respectively the width and height of the real box.

Compared with the prior art, the invention has the following beneficial effects:

1. according to the helmet wearing detection method based on the improved YOLOV5 model, the inverted residual error module and the inverted residual error attention module are embedded in the feature extraction part, so that abundant small target space information and deep semantic information can be obtained conveniently, and the detection precision of small and medium targets in helmet wearing detection is improved;

2. according to the safety helmet wearing detection method based on the improved YOLOV5 model, the multi-scale feature fusion module is designed in the feature fusion part for feature fusion, and four detection heads with different receptive fields are generated, so that the recognition capability of the model on small-size targets is improved, and the missing detection of the small targets in the safety helmet wearing detection is reduced;

3. the helmet wearing detection method based on the improved YOLOV5 model provided by the embodiment of the invention is designed with a mosaic mixed data enhancement method, establishes a linear relation between data, increases the background complexity of an image, improves the robustness of an algorithm, and can effectively detect the wearing of a helmet in a complex environment.

Drawings

Fig. 1 is a flowchart of a method for detecting wearing of a safety helmet based on an improved YOLOV5 model according to an embodiment of the present invention.

Fig. 2 is a schematic structural diagram of an improved YOLOV5 convolutional neural network provided in an embodiment of the present invention.

Fig. 3 is a schematic model diagram of an inverted residual error module according to an embodiment of the present invention.

Fig. 4 is a model diagram of an inverted residual attention module according to an embodiment of the present invention.

Fig. 5 is an average precision mean index graph of the improved YOLOV5 model provided by the embodiment of the present invention after 100 times of training.

Fig. 6 is a comparison graph of the results of the detection of the wearing of the helmet before and after the improvement of the YOLOV5 model provided by the embodiment of the present invention.

Fig. 7 is another comparison graph of the results of the detection of the wearing of the helmet before and after the improvement of the YOLOV5 model provided by the embodiment of the present invention.

Detailed Description

In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the invention is further described with the specific embodiments.

In the description of the present invention, it should be noted that the terms "upper", "lower", "left", "right", "front", "rear", "both ends", "one end", "the other end", and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of describing the present invention and simplifying the description, but do not indicate or imply that the referred device or element must have a specific orientation, be configured in a specific orientation, and operate, and thus, should not be construed as limiting the present invention. Furthermore, the ordinal numbers "(1)", "(2)", etc. are for descriptive purposes only and are not to be construed as indicating or implying relative importance.

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The embodiment of the invention applies the target detection algorithm based on the convolutional neural network to the wearing detection of the safety helmet, and performs optimization and improvement by combining the task characteristics of the wearing detection of the safety helmet on the basis of the Yolov5 model, thereby realizing a more accurate and intelligent detection scheme.

Referring to fig. 1, the invention provides a method for detecting wearing of a safety helmet based on an improved YOLOV5 model, which comprises the following steps:

The above steps are described in detail below:

in the step S1, firstly, a helmet wearing image data set is obtained, and data enhancement is performed on the data set, where the image data enhancement includes: turning, scaling and color gamut transformation are carried out on the image; randomly cutting the image after the turning, the scaling and the color gamut conversion according to a preset template and splicing; in this embodiment, an improved mosaic data enhancement method is adopted, which specifically includes: randomly selecting 9 images from the helmet wearing image data set, and performing t on the images by using the width and height of the images as boundary values _x And t _y Zooming with zooming magnification;

t _x ＝f _r (t _w ,t _w +Δt _w )

t _y ＝f _r (t _h ,t _h +Δt _h )

wherein, t _w And t _h Minimum values of wide and high magnification, Δ t, respectively _w And Δ t _h Width and height, respectivelyLength of random interval of zoom factor, f _r Representing a random value function.

Further, determining an image template with the height h and the width w as the size of an output image, randomly generating four dividing lines in the width and height direction, splicing nine cut images, and cutting out an overflowing frame part; and performing secondary cutting on the internal overlapped part, obtaining a spliced image after cutting, and taking the image as the input layer data of the YOLOV5 model convolutional neural network.

In this embodiment, the improved mosaic data enhancement method is to add 9 images on the basis of adopting four images for random splicing. Namely, each image has a corresponding frame, and nine images are combined together after random cutting and random splicing, so that balance among targets with different scales is realized.

In the step S2, inputting the data-enhanced image into the improved YOLOV5 model for training to obtain a trained improved YOLOV5 model; in the embodiment of the present invention, the improved YOLOV5 model includes: 1. embedding an inverted residual error module (IRC 3) and an inverted residual error attention module (IRAC 3) in a feature extraction part to extract image features; 2. designing a multi-scale feature fusion module in the feature fusion part for feature fusion, and generating four detection heads with different receptive fields; 3. and optimizing the regression loss function of the prediction box.

(1) A feature extraction section:

the main purpose of feature extraction is to learn the mapping relationship between high-resolution images and low-resolution images using a convolutional neural network. As shown in fig. 2, in the embodiment of the present invention, the feature extraction module mainly includes a slice convolution layer (Focus Conv), a convolution layer (Conv), an Inverted residual C3 (IRC 3) module, an Inverted residual Attention C3 (IRAC 3) module, and a feature pyramid module (SPP); wherein, the structures of IRC3 and IRAC3 are respectively shown in FIG. 3 and FIG. 4; wherein Conv represents convolution, H represents the height of the feature map, W represents the width of the feature map, C represents the number of channels of the feature map, 2C represents the number of channels obtained after twice expansion,. Alpha.represents fusion operation,

represents the stitching operation, DWConv (Depthwise Convolution) represents Depth Convolution, PWConv (Pointwise Convolution) represents point-by-point Convolution, ECA-Net (Efficient Channel Attention) represents valid Channel Attention, and SD (Stochartic Depth, SD) represents random Depth.

Referring to fig. 2, the specific process of feature extraction is as follows:

s1, an input image is subjected to a first layer of slice convolution (Focus Conv), specifically, every other pixel in each picture takes one value, the method is similar to down-sampling, the four pictures are divided into four pictures in the mode, the four pictures are similar but have no information loss, information is concentrated to a channel space through the operation, an input channel is expanded by 4 times, namely, a channel for splicing the pictures is 12 channels, and then the pictures are subjected to convolution operation to obtain a Feature map Feature _ C0, so that the parameter number is reduced and the training speed is improved;

s2, after passing through a convolutional layer (Conv), entering an inverted residual C3 module (IRC 3), wherein in the module, firstly, expansion of the number of channels is realized on an input Feature map by utilizing an expansion factor, then, high-dimensional channels are mapped to low-dimensional channels by utilizing linear change to obtain rich shallow features, and identity mapping is combined with the input features through residual operation to obtain a Feature map Feature _ C1;

s3, the Feature map Feature _ C1 passes through a convolution layer and an inverted residual error C3 module to obtain a Feature map Feature _ C2, and then passes through the convolution layer and enters an inverted residual error attention C3 (IRAC 3) module;

in the embodiment of the invention, the inverse residual attention C3 module comprises a depth separable convolution and effective channel attention module, and firstly passes through a depth convolution module which is designed to replace a standard convolution of 3 multiplied by 3 by using a depth convolution with less parameters and low calculation complexity. As shown in the following formula:

wherein

Represents a deep convolution kernel, i, j represents the convolution kernel size, k, l represents the feature map size, and->

The c-th convolution kernel in (a) is applied to the c-th channel, in the feature multiplied therewith, in->

For the c-th channel of the feature map, the features of the deep convolution outputs are calculated by a 1 × 1 convolution and combined linearly, on the basis of the results of the linear combination>

Represents the Feature after 3x3 convolution of the Feature map Feature _ C2.

Further, after the depth separable convolution, further by the attention of the effective channel, the attention of the effective channel captures local cross-channel interaction information by each channel and k adjacent channels thereof; finally, after point convolution with convolution kernel size of 1 multiplied by 1, reducing the channel number into an original channel to obtain a Feature map Feature _ C3;

further, the obtained Feature _ C3 enters a second IRAC3 module in the Feature extraction module after being convolved by a convolution kernel with a size of 3 × 3 and further being subjected to Spatial Pyramid Pooling (SPP), so as to obtain a Feature _ C4.

(2) A feature fusion part:

referring to fig. 2, a specific process of feature fusion is as follows:

s1, a final Feature map Feature _ C4 of a Feature extraction part enters a multi-scale Feature fusion module, a Feature map Feature _ d1 is obtained through convolution with a convolution kernel size of 3 multiplied by 3 and a channel number of 512 (convolution is represented by Conv in the figure and the same principle is applied below), and a Feature map Feature _ Up1 is obtained after upsampling operation (represented by 'Upesple' in figure 2 and the same principle is applied below);

s2, carrying out cascade operation on the Feature map Feature _ Up1 and the Feature _ C3 of the Feature extraction module to obtain a Feature map Feature _ Fuse1; obtaining a Feature map Feature _ d2 through convolution with a convolution kernel of 3 multiplied by 3 and a channel number of 256 by a C3 module, and obtaining a Feature map Feature _ Up2 through an Up-sampling operation;

s3, performing cascade operation on the Feature map Feature _ Up2 and the Feature map Feature _ C2 to obtain a Feature map Feature _ Fuse2, then obtaining a Feature map Feature _ d3 through a C3 module and convolution, and obtaining a Feature map Feature _ Up3 through an Up-sampling operation;

s4, performing cascade operation on the obtained Feature map Feature _ Up3 and Feature _ C1 in a backbone network (a Feature extraction module) to obtain a Feature map Feature _ Fuse3, and then performing Feature extraction through a C3 module and a convolution layer with the convolution kernel size of 1 multiplied by 1 to obtain a Feature map F4, wherein the Feature size is 1/4 of the original image and is used for detecting a minimum target;

s5, convolving the Feature map Feature _ Fuse3 with a convolution kernel of 3 multiplied by 3 through a C3 module and cascading with the Feature map Feature _ d3 to obtain a Feature map Feature _ Fuse4, and then performing Feature extraction through the convolution of the convolution kernel of 1 multiplied by 1 through the C3 module to obtain a Feature map F3, wherein the Feature size is 1/8 of that of an original image and the Feature map F3 is used for detecting small targets;

s6, convolving the Feature map Feature _ Fuse4 with a convolution kernel of 3 multiplied by 3 size through a C3 module and cascading with the Feature map Feature _ d2 to obtain a Feature map Feature _ Fuse5, and then performing Feature extraction through the convolution layer of 1 multiplied by 1 size through the C3 module and the convolution kernel to obtain a Feature map F2, wherein the Feature size is 1/16 of the original image and the Feature map Feature _ Fuse is used for detecting a medium target;

and S7, cascading the Feature map Feature _ Fuse5 with the Feature map Feature _ d1 through convolution of a C3 module and a convolution kernel with the size of 3 multiplied by 3 to obtain a Feature map Feature _ Fuse6, and then performing Feature extraction through the C3 module and a convolution layer with the convolution kernel size of 1 multiplied by 1 to obtain a Feature map F1, wherein the Feature size is 1/32 of the original image and the Feature map F1 is used for detecting a large target.

(3) Optimizing a prediction box regression loss function:

in the embodiment of the invention, a CIoU loss function is adopted for doingPrediction box regression loss function L for improving YOLOV5 model algorithm _CIoU Which is defined as:

L _CIoU ＝1-CIoU

wherein, ioU represents the intersection ratio of the prediction frame and the real frame, b represents the central point of the prediction frame ^gt Represents the center point of the real frame, and represents the Euclidean distance rho ² (b,b ^gt ) C represents the diagonal distance of the minimum circumscribed rectangle which can simultaneously contain the prediction frame and the real frame; alpha represents a parameter for weighing, and v represents a parameter for measuring the consistency of the aspect ratio;

wherein, the parameter expressions of alpha and v are as follows:

Further, setting training parameters: the batch size is 16, the iteration times are 100, the initial learning rate is 0.01, the termination learning rate is 0.2, the momentum is 0.937, the weight attenuation is 0.0005, and a random gradient descent strategy is adopted for random attenuation;

in the embodiment of the invention, the rotation and horizontal mirroring methods are adopted to increase the images of the safety helmet at different angles and the improved mosaic method is combined to improve the identification capability of the object; and training by using the improved convolutional neural network and combining the optimized loss function, and finishing training to obtain the final improved YOLOV5 convolutional neural network.

Further, in the embodiment of the present invention, an Average Precision Mean (map0.5) is used as a relevant index for measuring model performance, as shown in fig. 5, which shows an Average Precision Mean index achieved after 100 times of training after improving the YOLOv5s model.

Further, inputting two images of wearing the safety helmet in a complex environment into the trained improved YOLOV5 model to obtain a detection result of whether related personnel wear the safety helmet or not; and comparing the result with the test result of the unmodified YOLOV5 model, wherein the comparison result is shown in fig. 6 and 7, fig. 6a and 7a are the test result of the wearing of the helmet of the unmodified YOLOV5 model, and fig. 6c and 7c are the partial enlarged views of the test result of the wearing of the helmet of the unmodified YOLOV5 model; fig. 6b and fig. 7b are partial enlarged views of the wearing test results of the improved YOLOV5 model helmet, and fig. 6d and fig. 7d are partial enlarged views of the wearing test results of the improved YOLOV5 model helmet; from a left-right comparison in fig. 6, it can be seen that the unmodified YOLOV5 model has a missing detection of the wearing of the helmet in the small targets (three wearing targets are detected in fig. 6c, and four wearing targets are detected in fig. 6 d); it can also be seen from fig. 7 that the unmodified YOLOV5 model also has a small target in which the wearing of the helmet is missed (fig. 7c shows that two targets are worn by the helmet, and fig. 7d shows that three targets are worn by the helmet); as can be known from fig. 6 and 7, the method of the present invention can effectively solve the problem of missing detection of wearing the safety helmet in a dense target, and can effectively detect wearing of the safety helmet on the target even in a complex scene, thereby avoiding missing detection of a small target.

Through the description of the embodiment, those skilled in the art can see that the invention provides a helmet wearing detection method based on an improved YOLOV5 model, and firstly an improved Mosaic data enhancement method is designed, so that the diversity of image samples is enriched, a linear relation between data is established, and the robustness of an algorithm is improved; secondly, optimizing a model backbone network aiming at the problem of low small target detection precision, embedding an inverted residual error module and an inverted residual error attention module in the backbone network part, and mapping by using low-dimensional to high-dimensional characteristic information so as to obtain abundant small target space information and deep semantic information and improve the small target detection precision; finally, a multi-scale feature fusion module is designed in the feature fusion part, shallow spatial information and deep semantic information are fused, four detection heads with different receptive fields are generated, the recognition capability of the model on small-size targets is improved, the missing detection of the small targets is reduced, the problems of the missing detection and the false detection of the small targets in the construction site video monitoring image can be effectively solved, and the wearing detection precision of the safety helmet is improved.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A safety helmet wearing detection method based on an improved YOLOV5 model is characterized by comprising the following steps:

s2, inputting the data-enhanced image into an improved YOLOV5 model for training to obtain a trained improved YOLOV5 model; the improved YOLOV5 model comprises: embedding an inverted residual error module and an inverted residual error attention module in the feature extraction part to extract image features; designing a multi-scale feature fusion module in the feature fusion part for feature fusion, and generating four detection heads with different receptive fields; optimizing a prediction box regression loss function;

s3, inputting the image to be detected into the trained improved YOLOV5 model to obtain a detection result of whether the related person wears the safety helmet or not;

in the step S1, N images are randomly selected from the helmet wearing image data set to perform image data enhancement, where the image data enhancement includes:

turning, scaling and color gamut transformation are carried out on the image;

randomly cutting the image after the turning, the scaling and the color gamut conversion according to a preset template and splicing;

the scaling of the image specifically comprises: randomly selecting N images in the image data set for wearing the helmet, and performing t on the images by using the width and the height of the images as boundary values _x And t _y Zooming of zooming magnification;

t _x ＝f _r (t _w ,t _w +Δt _w )

t _y ＝f _r (t _h ,t _h +Δt _h )

wherein, t _w And t _h Minimum values of wide and high magnification, Δ t, respectively _w And Δ t _h Length of random interval of wide and high magnification, respectively, f _r Representing a random value function;

the splicing of the zoomed images after random cutting according to a preset template specifically comprises the following steps: determining an image template with the height h and the width w as the size of an output image, randomly generating four dividing lines in the width and height direction for cutting, splicing nine cut images, and cutting off an overflowing frame part; and performing secondary cutting on the internal overlapped part, and obtaining a spliced image after cutting.

2. The method for detecting wearing of a safety helmet based on the improved YOLOV5 model as claimed in claim 1, wherein in step S2, the image feature extraction is performed by embedding an inverted residual error module and an inverted residual error attention module in the feature extraction part; the method specifically comprises the following steps:

a. obtaining a Feature map Feature _ C0 by means of Focus convolution and convolution of 3 multiplied by 3 on the image after the data enhancement;

b. inputting the Feature map Feature _ C0 into a first inversion residual error module, amplifying shallow features in a channel expansion mode, extracting features by convolution, repeatedly learning features in a residual error connection mode, and outputting a Feature map Feature _ C1;

c. and the Feature map Feature _ C1 is subjected to a layer of convolution and a second inverted residual error module to obtain a Feature map Feature _ C2, the Feature map Feature _ C2 is input to a first inverted residual error attention module through a layer of convolution to obtain a Feature map Feature _ C3, the Feature map Feature _ C3 is subjected to convolution with a convolution kernel of 3 multiplied by 3 and spatial pyramid pooling, and then the Feature map Feature _ C3 enters a second inverted residual error attention module to obtain a Feature map Feature _ C4 which is used as the input of a multi-scale Feature fusion module.

3. The method for detecting wearing of a safety helmet based on the improved YOLOV5 model as claimed in claim 2, wherein in the step S2, the multi-scale feature fusion module is designed in the feature fusion part for feature fusion, and four detection heads with different receptive fields are generated, specifically comprising the following steps:

1) Obtaining a Feature map Feature _ d1 by convolving the Feature map Feature _ C4, and obtaining a Feature map Feature _ Up1 through an Up-sampling operation;

2) Performing cascade operation on the Feature map Feature _ Up1 and the Feature map Feature _ C3 to obtain a Feature map Feature _ Fuse1, performing convolution through a C3 module to obtain a Feature map Feature _ d2, and performing Up-sampling operation to obtain a Feature map Feature _ Up2;

3) Performing cascade operation on the Feature map Feature _ Up2 and the Feature map Feature _ C2 to obtain a Feature map Feature _ Fuse2, then obtaining a Feature map Feature _ d3 through a C3 module and convolution, and obtaining a Feature map Feature _ Up3 through Up-sampling operation;

4) Performing cascade operation on the Feature map Feature _ Up3 and the Feature map Feature _ C1 to obtain a Feature map Feature _ Fuse3, and then performing Feature extraction through a C3 module and convolution with a convolution kernel size of 1 × 1 to obtain a Feature map F4, wherein the Feature size is 1/4 of the original image;

5) The Feature map Feature _ Fuse3 is convolved with a convolution kernel of 3 multiplied by 3 by a C3 module and is cascaded with the Feature map Feature _ d3 to obtain a Feature map Feature _ Fuse4, and then Feature extraction is carried out on the Feature map by the convolution kernel of 1 multiplied by 1 by the C3 module to obtain a Feature map F3, wherein the Feature size is 1/8 of the original image;

6) The Feature map Feature _ Fuse4 is convolved by a C3 module and a convolution kernel with the size of 3 multiplied by 3 and is cascaded with a Feature map Feature _ d2 to obtain a Feature map Feature _ Fuse5, and then Feature extraction is carried out on the convolution layer with the size of 1 multiplied by 1 by 3 by the C3 module and the convolution kernel to obtain a Feature map F2, wherein the Feature size is 1/16 of the original image;

7) And cascading the Feature map Feature _ Fuse5 with the Feature map Feature _ d1 through convolution of a C3 module and a convolution kernel with the size of 3 multiplied by 3 to obtain a Feature map Feature _ Fuse6, and then performing Feature extraction through the C3 module and a convolution layer with the convolution kernel size of 1 multiplied by 1 to obtain a Feature map F1, wherein the Feature size is 1/32 of the original image.

4. The improved YOLOV5 model-based helmet wearing detection method according to claim 1, wherein in the optimized prediction box regression loss function: CIoU loss function is adopted as prediction frame regression loss function L of improved YOLOV5 model algorithm _CIoU Which is defined as:

L _CIoU ＝1-CIoU

wherein, the parameter expressions of alpha and v are as follows: