CN114973122A

CN114973122A - Helmet wearing detection method based on improved YOLOv5

Info

Publication number: CN114973122A
Application number: CN202210467457.8A
Authority: CN
Inventors: 郑楚伟; 林辉; 韩竺秦
Original assignee: Shaoguan University
Current assignee: Shaoguan University
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2022-08-30

Abstract

The invention relates to a helmet wearing detection method based on improved YOLOv5, which comprises the following steps: acquiring an image to be detected containing a detection target; and inputting the image to be detected into an improved YOLOv5 model for target detection to obtain the position size information and the category of the detected target. Compared with the prior art, the helmet wearing detection method based on the improved YOLOv5 has the advantages that the characteristics are extracted through the multilayer Swin transform Block network, the characteristic extraction capability of a model to-be-detected image is enhanced, the characteristics are extracted through the self-attention mechanism, the image characteristics which greatly contribute to the identification of a detection target can be obtained, the abundant image characteristics can be obtained through the multi-layer characteristic extraction, the shielded detection target and the detection target with lower brightness can be identified, objects similar to the detection target in shape can be identified, the false detection omission ratio is low, and the detection accuracy of the model is high.

Description

Helmet wearing detection method based on improved YOLOv5

Technical Field

The invention relates to the technical field of safety helmet wearing detection, in particular to a safety helmet wearing detection method based on improved YOLOv 5.

Background

As the first major manufacturing country in China, the safety problem of construction sites is a very concern in the industry, and the correct wearing of the safety helmet can effectively prevent or reduce the head injury caused by safety accidents during the operation of workers. However, in actual production activities, even though the clear text requires that constructors can enter the construction site only by wearing safety helmets, the phenomenon that certain constructors wear safety helmets irregularly during working due to lucky psychology or other reasons is still difficult to avoid. At present, the supervision of the wearing condition of the safety helmet mainly depends on manual work, the mode efficiency is low, the manpower is consumed, the manual work is difficult to concentrate on the monitoring for a long time, and certain negligence can exist. In the prior art, an image of a construction site is input into a YOLOv5 model, so that automatic detection whether a worker wears a safety helmet is achieved, however, the construction site is dense in personnel, a detection target is easily shielded, the illumination environment of some construction sites is dark, the detection method is difficult to identify the detection target which is shielded and in a dark place, and the false detection and omission ratio is high.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a helmet wearing detection method based on improved YOLOv5, which can identify the shielded detection target with small size, and has low false detection and missing detection rate and high identification accuracy.

The invention is realized by the following technical scheme: a helmet wearing detection method based on improved YOLOv5 comprises the following steps: acquiring an image to be detected containing a detection target; inputting the image to be detected into an improved YOLOv5 model for target detection to obtain position size information and the category of the detected target, wherein the improved YOLOv5 model comprises a feature extraction module, a feature fusion module and a result prediction module, the feature extraction module comprises a tile segmentation sub-module, a linear embedding sub-module, a first Swin-T sub-module, a first tile splicing sub-module, a second Swin-T sub-module, a second tile splicing sub-module, a third Swin-T sub-module, a third tile splicing sub-module, a fourth Swin-T sub-module, a fourth tile splicing sub-module and a fifth Swin-T sub-module, and when the feature extraction module extracts the features of the image to be detected, the method comprises the following steps: inputting the image to be detected into the image segmentation submodule to perform image segmentation; inputting the segmented image to be detected into the linear embedding submodule for linear transformation; inputting the image to be detected after linear transformation into the first Swin-T submodule for feature extraction to obtain a first Swin-T feature map; inputting the first Swin-T feature map into the first image block splicing submodule for downsampling to obtain a first-level feature map; inputting the first-level feature map into the second Swin-T submodule for feature extraction to obtain a second Swin-T feature map; inputting the second Swin-T feature map into the second image block splicing submodule for downsampling to obtain a second level feature map; inputting the second hierarchical feature map into the third Swin-T submodule for feature extraction to obtain a third Swin-T feature map; inputting the third Swin-T feature map into the third image block splicing submodule for downsampling to obtain a third-level feature map; inputting the third-level feature map into the fourth Swin-T submodule for feature extraction to obtain a fourth Swin-T feature map; inputting the fourth Swin-T feature map into the fourth image block splicing submodule for downsampling to obtain a fourth-level feature map; inputting the fourth-level feature map into the fifth Swin-T submodule for feature extraction to obtain a fifth Swin-T feature map; the first Swin-T submodule, the second Swin-T submodule, the fourth Swin-T submodule and the fifth Swin-T submodule respectively comprise two Swin transform Block networks, the third Swin-T submodule comprises six Swin transform Block networks, and the Swin transform Block networks are used for extracting image features of an input feature map; the feature fusion module is used for fusing according to the fifth Swin-T feature map, the fourth Swin-T feature map, the third Swin-T feature map and the second Swin-T feature map to obtain a plurality of output feature maps with different grid sizes; and the result prediction module is used for predicting to obtain the position size information and the category of the detection target according to the output feature maps with different grid sizes.

Compared with the prior art, the invention provides a helmet wearing detection method based on improved YOLOv5, the characteristics are extracted through a multilayer Swin transform Block network, the characteristic extraction capability of a model to an image to be detected is enhanced, the characteristics are extracted through a self-attention mechanism, the image characteristics which greatly contribute to the identification of a detection target can be obtained, and the multi-level characteristic extraction can obtain more abundant image characteristics, so that the shielded detection target and the detection target with lower brightness can be identified, in addition, objects with the shape similar to the detection target can be identified, the false detection omission rate is low, and the detection accuracy of the model is high.

Further, the Swin transformer Block network includes four LayerNorm layers, a multi-headed self-attention layer, two MLP layers, a shift window multi-headed self-attention layer, four DropPath layers, and four residual connection layers, and when the Swin transformer Block network performs image feature extraction on an input feature map, the Swin transformer Block network includes the steps of: inputting the input feature map into the LayerNorm layer for normalization processing; inputting the normalized input feature map into the multi-head self-attention layer to perform multi-head self-attention feature extraction to obtain a multi-head self-attention feature map; inputting the multi-head self-attention feature map into the DropPath layer for random inactivation; inputting the multi-head self-attention feature map output by the DropPath layer and the input feature map into the residual error connection layer for residual error connection to obtain a first intermediate feature map; inputting the first intermediate feature map into the LayerNorm layer for normalization processing; inputting the normalized first intermediate characteristic diagram into the MLP layer for linear transformation to obtain a first transformation characteristic diagram; inputting the first transformation feature map into the DropPath layer for random inactivation; inputting the first transformation characteristic diagram output by the DropPath layer and the first intermediate characteristic diagram into the residual connecting layer for residual connection to obtain a second intermediate characteristic diagram; inputting the second intermediate feature map into the LayerNorm layer for normalization processing; inputting the normalized second intermediate feature map into the shifting window multi-head self-attention layer to perform pixel shifting multi-head self-attention feature extraction to obtain a shifting multi-head self-attention feature map; inputting the shift multi-head self-attention feature map into the DropPath layer for random inactivation; inputting the shift multi-head self-attention feature map output by the DropPath layer and the second intermediate feature map into the residual connecting layer for residual connection to obtain a third intermediate feature map; inputting the third intermediate feature map into the LayerNorm layer for normalization processing; inputting the normalized third intermediate characteristic diagram into the MLP layer for linear transformation to obtain a second transformation characteristic diagram; inputting the second transformation feature map into the DropPath layer for random inactivation; inputting the second transformed feature map output by the DropPath layer and the third intermediate feature map into the residual connecting layer for residual connection, and obtaining a feature map output by the Swin transform Block network.

Further, the first tile splicing submodule, the second tile splicing submodule, the third tile splicing submodule and the fourth tile splicing submodule all comprise a tile splitting layer, a concat layer, a LayerNorm layer and a full connection layer, wherein the tile splitting layer is used for dividing adjacent pixels with the input dimension of [ H, W, C ] characteristic diagram interval of 2 into a plurality of tiles; the concat layer is used for concat splicing of the segmented image blocks to obtain a feature map with dimensions changed into [ H/2, W/2,4C ]; the LayerNorm layer is used for normalizing a characteristic diagram output by the concat layer; and the full connection layer is used for carrying out linear transformation on the number of channels of the characteristic diagram output by the LayerNorm layer to obtain the characteristic diagram with the dimensionality of [ H/2, W/2,2C ].

Further, the feature fusion module includes a first CONV layer, a first UP layer, a first Concat layer, a first C3-Ghost layer, a second CONV layer, a second UP layer, a second Concat layer, a second C3-Ghost layer, a third CONV layer, a third UP layer, a third Concat layer, a third C3-Ghost layer, a fourth CONV layer, a fourth Concat layer, a fourth C3-Ghost layer, a fifth CONV layer, a fifth Concat layer, a fifth C3-Ghost layer, a sixth CONV layer, a sixth Concat layer, and a sixth C3-Ghost layer, and the feature fusion module performs fusion according to the fifth Swin-T feature map, the fourth Swin-T feature map, the third Swin-T feature map and the second Swin-T feature map to obtain a plurality of different sizes of output features, and the feature fusion module performs the steps of obtaining a plurality of different sizes of output grids according to the feature output sizes of the feature fusion module: acquiring the fifth Swin-T characteristic diagram and inputting the fifth Swin-T characteristic diagram into the first CONV layer for convolution processing to obtain a first convolution characteristic diagram; inputting the first convolution feature map into the first UP layer for UP-sampling operation; acquiring the fourth Swin-T characteristic diagram, inputting the fourth Swin-T characteristic diagram and the characteristic diagram output by the first UP layer into the first Concat layer together, and performing Concat splicing; inputting the feature map output by the first Consat layer into the first C3-Ghost layer for convolution processing to obtain a first output feature map; inputting the first output characteristic diagram into the second CONV layer for convolution processing to obtain a second convolution characteristic diagram; inputting the second convolution characteristic diagram into the second UP layer for UP-sampling operation; acquiring the third Swin-T characteristic diagram, inputting the third Swin-T characteristic diagram and the characteristic diagram output by the second UP layer into the second Concat layer together for Concat splicing; inputting the feature map output by the second Consat layer into the second C3-Ghost layer for convolution processing to obtain a second output feature map; inputting the second output characteristic diagram into the third CONV layer for convolution processing to obtain a third convolution characteristic diagram; inputting the third convolution characteristic diagram into the third UP layer for UP-sampling operation; acquiring the second Swin-T characteristic diagram, inputting the second Swin-T characteristic diagram and the characteristic diagram output by the third UP layer into the third Concat layer together for Concat splicing; inputting the feature graph output by the third Concat layer into the third C3-Ghost layer for convolution processing to obtain a third output feature graph; inputting the third output characteristic diagram into the fourth CONV layer for convolution processing to obtain a fourth convolution characteristic diagram; inputting the fourth convolution feature map and the third convolution feature map into the fourth Concat layer together for Concat splicing; inputting the feature map output by the fourth Concat layer into the fourth C3-Ghost layer for convolution processing to obtain a fourth output feature map; inputting the fourth output characteristic diagram into the fifth CONV layer for convolution processing to obtain a fifth convolution characteristic diagram; inputting the fifth convolution feature map and the second convolution feature map together into the fifth Concat layer for Concat splicing; inputting the feature map output by the fifth Consat layer into the fifth C3-Ghost layer for convolution processing to obtain a fifth output feature map; inputting the fifth output characteristic diagram into the sixth CONV layer for convolution processing to obtain a sixth convolution characteristic diagram; inputting the sixth convolution feature map and the first convolution feature map together into the sixth Concat layer for Concat splicing; and inputting the feature map output by the sixth Concat layer into the sixth C3-Ghost layer for convolution processing to obtain a sixth output feature map.

Further, the feature fusion module includes a first CONV layer, a first UP layer, a first Concat layer, a first C3-Ghost layer, a second CONV layer, a second UP layer, a second Concat layer, a second C3-Ghost layer, a third CONV layer, a third UP layer, a third Concat layer, a third C3-Ghost layer, a fourth CONV layer, a fourth Concat layer, a fourth C3-Ghost layer, a fifth CONV layer, a fifth Concat layer, a fifth C3-Ghost layer, a sixth CONV layer, a sixth Concat layer, and a sixth C3-Ghost layer, and the feature fusion module performs fusion according to the fifth Swin-T feature map, the fourth Swin-T feature map, the third Swin-T feature map and the second Swin-T feature map to obtain a plurality of different sizes of output features, and the feature fusion module performs the steps of obtaining a plurality of different sizes of output grids according to the feature output sizes of the feature fusion module: acquiring the fifth Swin-T characteristic diagram, and inputting the fifth Swin-T characteristic diagram into the first CONV layer for convolution processing to obtain a first convolution characteristic diagram; inputting the first convolution feature map into the first UP layer for UP-sampling operation; acquiring the fourth Swin-T characteristic diagram, inputting the fourth Swin-T characteristic diagram and the characteristic diagram output by the first UP layer into the first Concat layer together, and performing Concat splicing; inputting the feature map output by the first Concat layer into the first C3-Ghost layer for convolution processing to obtain a first output feature map; inputting the first output characteristic diagram into the second CONV layer for convolution processing to obtain a second convolution characteristic diagram; inputting the second convolution characteristic diagram into the second UP layer for UP-sampling operation; acquiring the third Swin-T characteristic diagram, inputting the third Swin-T characteristic diagram and the characteristic diagram output by the second UP layer into the second Concat layer together for Concat splicing; inputting the feature map output by the second Consat layer into the second C3-Ghost layer for convolution processing to obtain a second output feature map; inputting the second output characteristic diagram into the third CONV layer for convolution processing to obtain a third convolution characteristic diagram; inputting the third convolution characteristic diagram into the third UP layer for UP-sampling operation; acquiring the second Swin-T characteristic diagram, inputting the second Swin-T characteristic diagram and the characteristic diagram output by the third UP layer into the third Concat layer together for Concat splicing; inputting the feature graph output by the third Concat layer into the third C3-Ghost layer for convolution processing to obtain a third output feature graph; inputting the third output characteristic diagram into the fourth CONV layer for convolution processing to obtain a fourth convolution characteristic diagram; inputting the third Swin-T feature map, the fourth convolution feature map and the third convolution feature map into the fourth Concat layer together for Concat splicing; inputting the feature map output by the fourth Concat layer into the fourth C3-Ghost layer for convolution processing to obtain a fourth output feature map; inputting the fourth output characteristic diagram into the fifth CONV layer for convolution processing to obtain a fifth convolution characteristic diagram; inputting the fourth Swin-T feature map, a fifth convolution feature map and a second convolution feature map into the fifth Consat layer together for Consat splicing; inputting the feature map output by the fifth Concat layer into the fifth C3-Ghost layer for convolution processing to obtain a fifth output feature map; inputting the fifth output characteristic diagram into the sixth CONV layer for convolution processing to obtain a sixth convolution characteristic diagram; inputting the sixth convolution feature map and the first convolution feature map together into the sixth Concat layer for Concat splicing; and inputting the feature map output by the sixth Concat layer into the sixth C3-Ghost layer for convolution processing to obtain a sixth output feature map.

Further, when the first C3-Ghost layer, the second C3-Ghost layer, the third C3-Ghost layer, the fourth C3-Ghost layer, the fifth C3-Ghost layer and the sixth C3-Ghost layer perform convolution processing on the input feature map, the method includes the steps of: performing standard convolution operation on the input feature graph to compress the number of channels, and performing feature extraction through N series-connected Ghost Bottleneck modules to obtain a first C3-Ghost feature graph; performing another standard convolution operation on the input feature map to obtain a second C3-Ghost feature map; concat superposition is carried out on the first C3-Ghost characteristic diagram and the second C3-Ghost characteristic diagram according to channel dimensions, and characteristic fusion is carried out through convolution to obtain an output characteristic diagram; the method comprises the following steps of: inputting the input feature diagram into a first layer Ghost module for convolution operation, and processing through a BN layer and a Relu activation function with sparsity; inputting the feature graph processed by the BN layer and the Relu activation function into a second layer Ghost module for convolution operation, and processing the feature graph by another BN layer; the step when the Ghost module performs convolution operation comprises the following steps: performing point-by-point convolution on an input feature map through a convolution kernel of 1 multiplied by 1, compressing the number of channels of the input feature map through a scaling factor, performing normalization operation through a BatchNorm2d layer, and performing SiLU activation function processing to obtain a concentrated feature map; performing convolution operation layer by layer on the concentrated characteristic diagram, performing normalization operation through a BatchNorm2d layer, and performing SiLU activation function processing to obtain a redundant characteristic diagram; and performing Concat superposition on the concentrated feature map and the redundant feature map according to channel dimensions, and outputting a superposition result.

Further, the scaling factor is 2.

Further, the step of predicting the position size information and the category of the detection target by the result prediction module according to the output feature maps with different grid sizes includes: predicting through four prior anchor frames with corresponding sizes on each space point of the third output characteristic diagram, the fourth output characteristic diagram, the fifth output characteristic diagram and the sixth output characteristic diagram to obtain the coordinate offset (t) of the predicted detection target frame _x ,t _y ) Width t of _w High t _h A confidence of the probability value and the prediction category; according to the coordinate offset (t) _x ,t _y ) Width t of _w High t _h Obtaining the position coordinates and width and height of the detection target, and the position coordinates of the detection target (b) _x ,b _y ) The expression of (a) is:

b _x ＝σ(t _x )+C _x ,b _y ＝σ(t _y )+C _y

in the formula, C _x 、C _y Respectively the coordinates of the upper left corner of the grid where the detection target is located;

width b of detection target _w High b _h The expression of (a) is:

in the formula, p _w 、p _h The width and the height of the prior anchor frame are respectively;

and performing non-maximum suppression processing to obtain the final position coordinate, width, height and confidence coefficient of the prediction type of the detection target, and determining the prediction type with high confidence coefficient as the type of the corresponding detection target.

Further, the size of the a priori anchor box is updated from the data set used for training the improved YOLOv5 model, including the steps of: firstly, selecting an actual frame in the data set according to the probability by a roulette algorithm to serve as an initial clustering center; secondly, calculating the distance Loss between each actual frame and the current clustering center, wherein the expression of the distance Loss is as follows:

in the formula, Box _i Is the area of the ith actual frame in the n actual frames, Center _j Is the area of the jth one of the k cluster centers;

thirdly, dividing the actual frame into the cluster category to which the cluster center with the shortest distance belongs; fourthly, calculating the median of the actual frame coordinate in each clustering category, updating the clustering center of the corresponding category by the median, repeatedly executing the second step to the fourth step until k clustering centers with stable positions are obtained, and determining the obtained clustering centers with stable positions as prior anchor frames; and fifthly, calculating the size error degree of each actual frame and each prior anchor frame, acquiring the minimum size error value calculation average value corresponding to each actual frame, determining the average value as the fitness of the prior anchor frame, and determining the prior anchor frame with the highest fitness as an updated prior anchor frame.

Further, the size of the a priori anchor frame is updated according to the data set used for training the improved yollov 5 model, and the method further comprises the following steps: and performing linear transformation on the updated prior anchor frame, transforming the minimum width of the updated prior anchor frame into 0.8 times, transforming the maximum width of the updated prior anchor frame into 1.5 times, and keeping the width-height ratio unchanged.

For a better understanding and practice, the invention is described in detail below with reference to the accompanying drawings.

Drawings

FIG. 1 is a schematic flow chart of a method for detecting wearing of a safety helmet based on improved YOLOv5 in an embodiment;

FIG. 2 is a schematic structural diagram of the improved YOLOv5 model in step S2 shown in FIG. 1;

FIG. 3 is a schematic structural diagram of a Swin transform Block network in the embodiment;

FIG. 4 is a schematic diagram of a multi-head self-attention layer according to an embodiment;

FIG. 5 is a flow chart of a first multi-headed self-attention module in an embodiment;

FIG. 6 is a schematic structural diagram of an MLP layer in an embodiment;

FIG. 7 is a diagram illustrating a structure of a multi-headed self-attention layer of the shift window of the embodiment;

FIG. 8 is a schematic flow chart of a second multi-headed self-attention module in an embodiment;

FIG. 9 is a schematic diagram illustrating an algorithm flow of a first C3-Ghost layer, a second C3-Ghost layer, a third C3-Ghost layer, a fourth C3-Ghost layer, a fifth C3-Ghost layer, and a sixth C3-Ghost layer in the embodiment;

fig. 10 is an image of the detection result output by the conventional YOLOv5 model in experiment (1);

FIG. 11 is an image of the test results output from the improved YOLOv5 model in experiment (1);

fig. 12 is an image of the detection result output by the conventional YOLOv5 model in experiment (2);

fig. 13 is a detection result image output from the improved YOLOv5 model in experiment (2);

fig. 14 is a detection result image output by the conventional YOLOv5 model in experiment (3);

fig. 15 is an image of the detection result output from the improved YOLOv5 model in experiment (3).

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the application, as detailed in the appended claims.

In the description of the present application, it is to be understood that the terms "first," "second," "third," and the like are used solely for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order, nor is it to be construed as indicating or implying relative importance. The specific meaning of the above terms in the present application can be understood by those of ordinary skill in the art as appropriate. Further, in the description of the present application, "a plurality" means two or more unless otherwise specified. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

Please refer to fig. 1, which is a schematic flow chart of a method for detecting wearing of a safety helmet based on improved YOLOv5 in this embodiment. The method comprises the following steps:

s1: acquiring an image to be detected containing a detection target;

s2: and inputting the image to be detected into an improved YOLOv5 model for target detection to obtain the position size information and the category of the detected target.

In step S1, the image to be detected may be captured by a camera at a construction site, and the image captured by the camera is transmitted to a processor capable of executing a computer program through a data line or a network, where the image captured by the camera may be picture data or video stream data, and when the captured image is the video stream data, a frame image thereof is extracted as the image to be detected.

The image to be detected can also be a data set used for training and testing an improved YOLOv5 model, the image in the data set is an image containing a helmet worn by a person on a construction site, the data set can be an existing public helmet development source data set, in addition, in order to improve the recognition effect of the model, the existing public helmet development source data set can be expanded by a web crawler or by acquiring the helmet worn images under different illumination environments on the construction site on the spot, wherein a LabLEIMG marking tool is used for manufacturing a label of the image data set of the helmet worn by the person acquired by the web crawler or the spot, and the label comprises an actual frame of an area where the helmet is located. This example randomly divides the data set into a training set and a test set on a 9:1 scale.

The detection target in the image to be detected can be an image of a person wearing a safety helmet, an image of a person without wearing a safety helmet and other image information needing to be identified.

Please refer to fig. 2, which is a schematic structural diagram of the improved YOLOv5 model in step S2, the model includes a feature extraction module, a feature fusion module, and a result prediction module, wherein the feature extraction module is configured to extract image features in an image to be detected; the characteristic fusion module is used for fusing the image characteristics extracted by the characteristic extraction module and outputting a plurality of output characteristic graphs with different grid sizes; and the result prediction module is used for predicting and obtaining the position size information and the category of the detection target according to the output feature maps with different grid sizes.

Specifically, the feature extraction module includes a pattern segmentation (Patch Partition) sub-module, a Linear Embedding (Linear Embedding) sub-module, a first Swin-T (Swin transformer) sub-module, a first pattern splicing (Patch Merging) sub-module, a second Swin-T sub-module, a second pattern splicing sub-module, a third Swin-T sub-module, a third pattern splicing sub-module, a fourth Swin-T sub-module, a fourth pattern splicing sub-module and a fifth Swin-T sub-module, and when the feature extraction module extracts features of an image to be detected, the feature extraction module includes:

inputting an image to be detected into an image segmentation submodule to perform image segmentation;

inputting the segmented image to be detected into a linear embedding submodule for linear transformation;

inputting the linearly transformed image to be detected into a first Swin-T submodule for feature extraction to obtain a first Swin-T feature map;

inputting the first Swin-T characteristic diagram into a first diagram block splicing submodule for downsampling to obtain a first-level characteristic diagram;

inputting the first-level feature map into a second Swin-T submodule for feature extraction to obtain a second Swin-T feature map;

inputting the second Swin-T feature map into a second image block splicing submodule for downsampling to obtain a second level feature map;

inputting the second hierarchical feature map into a third Swin-T submodule for feature extraction to obtain a third Swin-T feature map;

inputting the third Swin-T feature map into a third map block splicing submodule for downsampling to obtain a third-level feature map;

inputting the third-level feature map into a fourth Swin-T submodule for feature extraction to obtain a fourth Swin-T feature map;

inputting the fourth Swin-T feature map into a fourth image block splicing submodule for downsampling to obtain a fourth-level feature map;

and inputting the fourth-level feature map into a fifth Swin-T submodule for feature extraction to obtain a fifth Swin-T feature map.

When the image segmentation submodule performs image segmentation on an image to be detected, the input dimensionality is [ H, W, CH ]]Dividing every a x a adjacent pixels in the image to be detected into an image block, and expanding along the channel direction to obtain the dimension [ H/a, W/a, a ² CH]Wherein H is the height of the image to be detected, W is the width of the image to be detected, and CH is the channel number of the image to be detected. In one embodiment, a has a value of 4.

And when the linear embedding submodule carries out linear transformation on the segmented image to be detected, carrying out linear transformation on channel data of each pixel of the segmented image to be detected to obtain the image to be detected with the dimensionality of [ H/a, W/a, C ], wherein C is a hyper-parameter for adjusting the number of image channels to adapt to the feature fusion module. In one implementation, the hyperparameter C is set to 64.

The first Swin-T submodule, the second Swin-T submodule, the fourth Swin-T submodule and the fifth Swin-T submodule respectively comprise two Swin transform Block networks used for extracting image features of an input feature diagram, and the third Swin-T submodule comprises six Swin transform Block networks. Please refer to fig. 3, which is a schematic structural diagram of a Swin transform Block network, which includes four LayerNorm layers, a multi-headed self-attention (W-MSA) layer, two MLP layers, a shift window multi-headed self-attention (SW-MSA) layer, four DropPath layers and four residual connecting layers. When the Swin transform Block network is used for image feature extraction, the method comprises the following steps:

will input the feature map Z ^l-1 Inputting a LayerNorm layer for normalization treatment; the normalized input feature map Z ^l-1 Inputting a multi-head self-attention layer to carry out multi-head self-attention feature extraction to obtain a multi-head self-attention feature map; inputting a DropPath layer into a multi-head self-attention feature map to randomly inactivate multi-branch paths in Swin Transformer Block, wherein the DropPath layer is a regularization strategy, so that the generalization capability of the model is improved, and overfitting is prevented; will DropPMulti-head self-attention feature map and feature map Z output by ath layer ^l-1 Inputting a residual connecting layer to carry out residual connection to obtain a first intermediate characteristic diagram

Connecting the residual errors to obtain a first intermediate characteristic diagram

Inputting a LayerNorm layer for normalization treatment; the normalized first intermediate feature map

Inputting an MLP layer to perform linear transformation to obtain a first transformation characteristic diagram; inputting the first transformation feature map into a DropPath layer for random inactivation; outputting a first transformation feature map and a first intermediate feature map of the DropPath layer

Inputting the residual connecting layer to carry out residual connection to obtain a second intermediate characteristic diagram Z ^l ；

The second intermediate characteristic diagram Z ^l Inputting a LayerNorm layer for normalization treatment; the normalized second intermediate feature map Z ^l Inputting a shifting window multi-head self-attention layer to carry out multi-head self-attention feature extraction of pixel shifting to obtain a shifting multi-head self-attention feature map; inputting the shift multi-head self-attention feature map into a DropPath layer for random inactivation; outputting the DropPath layer from the shift multi-head self-attention feature map and the second intermediate feature map Z ^l Inputting the residual connecting layer to carry out residual connection to obtain a third intermediate characteristic diagram

The third intermediate characteristic diagram

Inputting a LayerNorm layer for normalization treatment; the normalized third intermediate feature map

Inputting an MLP layer to perform linear transformation to obtain a second transformation characteristic diagram; inputting the second transformation characteristic diagram into a DropPath layer for random inactivation; outputting the second transformation feature map and the third intermediate feature map of the DropPath layer

Performing residual connection on the input residual connection layer to obtain an output characteristic diagram Z ^l+1 。

More specifically, please refer to fig. 4, which is a schematic structural diagram of a Multi-headed Self-Attention layer, wherein the Multi-headed Self-Attention layer includes a first Window segmentation (Window Partition) module, a first Multi-headed Self-Attention (Multi-Head Self-Attention) module, and a first Window recombination (Window Reverse) module, wherein the first Window segmentation module is configured to segment an input feature map into a plurality of non-overlapping independent windows of M × M adjacent pixels, i.e., segment the feature map into a plurality of tile vectors, so as to limit the calculation of the first Multi-headed Self-Attention module within each independent Window, thereby reducing the calculation amount.

The first multi-head self-attention module is used for performing multi-head scaling dot product attention calculation on each independent window respectively to obtain multi-head self-attention characteristics corresponding to each independent window. Specifically, please refer to fig. 5, which is a flowchart illustrating a first multi-head self-attention module, wherein the steps of the first multi-head self-attention module include: performing linear transformation on the image block vector of each independent window in a channel dimension to increase the number of channels by two times, and simultaneously dividing the image block vector into h subspaces in a characteristic dimension, wherein h is the number of attention heads (heads); by h different parameter matrices W ^Q 、W ^K 、W ^V Performing linear transformation on the query Q (query), the key K (Key) and the weight V (value) of each pixel in h subspaces respectively, and performing scaling dot product attention calculation; h calculation results are input into a Concat module and a Linear module and pass through a weight matrix W which can be learnt ^O And performing splicing fusion to combine the feature information learned from different subspaces to obtain the multi-head self-attention feature. Wherein, the firstScaled dot product attention calculation result head for i attention heads _i The expression of (a) is:

head _i ＝Attention(QW _i ^Q ,KW _i ^K ,VW _i ^V )

in the formula, W _i ^Q Is the ith parameter matrix W ^Q ；W _i ^K Is the ith parameter matrix W ^K ；W _i ^V Is the ith parameter matrix W ^V (ii) a Attention () is a normalized scaled dot product model, whose expression is:

in the formula (I), the compound is shown in the specification,

QK ^T calculating the similarity between different pixel points through dot product in the information interaction process for different pixel points; d is the dimension of a query and key vector, by division

The stability of the gradient can be ensured by carrying out zooming operation; b is a learnable Relative Position code (Relative Position Bias),

the splicing fusion expression of the multi-head self-attention feature is as follows:

MultiHead(Q,K,V)＝Concat(head ₁ ,head ₂ ,...,head _h )W ^O

the first window recombination module is used for carrying out reduction splicing on the multi-head self-attention feature of each independent window to obtain a complete multi-head self-attention feature map.

Please refer to fig. 6, which is a schematic structural diagram of an MLP layer, where the MLP layer includes two Linear layers, a GELU layer, and two Dropout layers, and when the MLP layer performs Linear transformation on an input feature map, the method includes the steps of: performing Linear transformation on the input feature diagram through a first Linear layer to obtain a feature diagram with the channel number four times that of the original feature diagram; activating a nonlinear activation function of the feature map with the quadruple channel number through a GELU layer to increase the nonlinearity of the network model; carrying out random inactivation operation on the feature map output by the GELU layer through a first Dropout layer so as to avoid the network from excessively depending on some local features and enhance the generalization of the model; performing Linear transformation on the characteristic diagram output by the first Dropout layer through the second Linear layer to obtain a characteristic diagram with the same channel number as the characteristic diagram input by the MLP layer; and (4) carrying out random inactivation operation on the characteristic diagram output by the second Linear layer through the second Dropout layer, and determining the characteristic diagram output.

Please refer to fig. 7, which is a schematic structural diagram of a Shift window multi-head self-attention layer, wherein the Shift window multi-head self-attention layer includes a Cyclic Shift (Cyclic Shift) module, a second window segmentation module, a second multi-head self-attention module, a second window recombination module, and a Shift reduction (Reverse Cyclic Shift) module, and the Cyclic Shift module is configured to Shift M/2 rows of pixels at the top of an input feature map to the bottom and Shift M/2 columns of pixels at the leftmost to the rightmost side.

And the second window segmentation module is used for segmenting the shifted feature map into a plurality of independent non-overlapping windows of M × M adjacent pixels.

The second multi-head self-attention module is used for performing multi-head scaling dot product attention calculation on each independent window respectively to obtain the shifting multi-head self-attention feature corresponding to each independent window. Please refer to fig. 8, which is a flowchart illustrating a second multi-head self-attention module, wherein the steps of the second multi-head self-attention module include: carrying out linear transformation on the image block vector of each independent window in the channel dimension to increase the number of channels by two times, and simultaneously dividing the image block vector into h subspaces in the characteristic dimension, wherein h is the number of attention heads (heads); by h different parameter matrices W ^Q 、W ^K 、W ^V Respectively carrying out linear transformation on query Q, key K and weight V of pixel points in h subspaces, carrying out zoom dot product attention calculation, adding a mask mechanism in the calculation, and moving an independent window into a middle positionIn a specific implementation, the similarity result of the pixel points in the non-adjacent area before shifting in the independent window is subtracted by 100, and then the result after softmax normalization is 0; h calculation results are input into a Concat module and a Linear module and pass through a weight matrix W which can be learnt ^O And carrying out splicing fusion to obtain the self-attention feature of the displaced multi-head.

And the second window recombination module is used for carrying out reduction splicing on the shifting multi-head self-attention characteristics of each independent window.

And the shifting restoration module is used for moving the rightmost M/2 columns of pixels of the restored and spliced feature map to the leftmost side and moving the bottom M/2 rows of pixels to the top so as to restore the pixel positions of the feature map subjected to cyclic shifting, thereby obtaining the multi-head shifting self-attention feature map.

The first, second, third and fourth tile splicing sub-modules each include a tile splitting layer, a concat layer, a LayerNorm layer and a full link layer. When the first image block splicing submodule, the second image block splicing submodule, the third image block splicing submodule and the fourth image block splicing submodule carry out down-sampling on the input feature map, adjacent pixels with input dimensionality [ H, W, C ] and interval of 2 in the feature map are divided into a plurality of image blocks through an image block segmentation layer; performing concat splicing on the segmented image blocks through a concat layer so that the dimension of the input feature graph becomes [ H/2, W/2,4C ]; normalizing the feature map by a LayerNorm layer; and linearly transforming the number of channels of the characteristic diagram through a full connection layer, so that the dimension of the input characteristic diagram is changed into [ H/2, W/2,2C ].

The feature fusion module comprises a first CONV layer, a first UP layer, a first Consat layer, a first C3-Ghost layer, a second CONV layer, a second UP layer, a second Consat layer, a second C3-Ghost layer, a third CONV layer, a third UP layer, a third Consat layer, a third C3-Ghost layer, a fourth CONV layer, a fourth Consat layer, a fourth C3-Ghost layer, a fifth CONV layer, a fifth Consat layer, a fifth C3-Ghost layer, a sixth CONV layer, a sixth Consat layer and a sixth C3-Ghost layer, and fuses the image features extracted by the feature extraction module, and comprises the following steps:

acquiring a fifth Swin-T characteristic diagram, and inputting the fifth Swin-T characteristic diagram into a first CONV layer for convolution processing to obtain a first convolution characteristic diagram; inputting the first convolution characteristic diagram into a first UP layer to carry out UP-sampling operation; acquiring a fourth Swin-T characteristic diagram, inputting the fourth Swin-T characteristic diagram and the characteristic diagram output by the first UP layer into the first Concat layer together, and performing Concat splicing; inputting the feature graph output by the first Concat layer into a first C3-Ghost layer for convolution processing to obtain a first output feature graph;

inputting the first output characteristic diagram into a second CONV layer for convolution processing to obtain a second convolution characteristic diagram; inputting the second convolution characteristic diagram into a second UP layer for UP-sampling operation; acquiring a third Swin-T characteristic diagram, inputting the third Swin-T characteristic diagram and the characteristic diagram output by the second UP layer into the second Concat layer together for Concat splicing; inputting the feature graph output by the second Concat layer into a second C3-Ghost layer for convolution processing to obtain a second output feature graph;

inputting the second output characteristic diagram into a third CONV layer for convolution processing to obtain a third convolution characteristic diagram; inputting the third convolution characteristic diagram into a third UP layer for UP-sampling operation; acquiring a second Swin-T characteristic diagram, inputting the second Swin-T characteristic diagram and the characteristic diagram output by the third UP layer into a third Concat layer together, and performing Concat splicing; inputting the feature graph output by the third Concat layer into a third C3-Ghost layer for convolution processing to obtain a third output feature graph;

inputting the third output characteristic diagram into a fourth CONV layer for convolution processing to obtain a fourth convolution characteristic diagram; inputting the fourth convolution feature map and the third convolution feature map into a fourth Concat layer together for Concat splicing; inputting the feature map output by the fourth Consat layer into a fourth C3-Ghost layer for convolution processing to obtain a fourth output feature map;

inputting the fourth output characteristic diagram into a fifth CONV layer for convolution processing to obtain a fifth convolution characteristic diagram; inputting the fifth convolution feature map and the second convolution feature map into a fifth Concat layer together for Concat splicing; inputting the feature graph output by the fifth Concat layer into a fifth C3-Ghost layer for convolution processing to obtain a fifth output feature graph;

inputting the fifth output characteristic diagram into a sixth CONV layer for convolution processing to obtain a sixth convolution characteristic diagram; inputting the sixth convolution feature map and the first convolution feature map into a sixth Concat layer together for Concat splicing; and inputting the feature map output by the sixth Consat layer into a sixth C3-Ghost layer for convolution processing to obtain a sixth output feature map.

In a specific implementation, the first UP layer, the second UP layer and the third UP layer perform an upsampling operation through a nearest neighbor interpolation algorithm.

In an embodiment of the preferred feature fusion module fusing the image features extracted by the feature extraction module, the step of inputting the fourth convolution feature map and the third convolution feature map together into the fourth Concat layer for Concat splicing may be replaced with: inputting the third Swin-T characteristic diagram, the fourth convolution characteristic diagram and the third convolution characteristic diagram into a fourth Concat layer together for Concat splicing; the above steps of inputting the fifth convolution feature map and the second convolution feature map into the fifth Concat layer together for Concat splicing may be replaced by: and inputting the fourth Swin-T characteristic diagram, the fifth convolution characteristic diagram and the second convolution characteristic diagram into a fifth Concat layer together for Concat splicing. In the preferred embodiment, by increasing the horizontal jump connection between the original input node and the output node of the same level, the feature graphs on the same level can share the semantic information of each other, and the feature fusion can be enhanced to improve the model accuracy.

Please refer to fig. 9, which is a schematic diagram illustrating an algorithm flow of a first C3-Ghost layer, a second C3-Ghost layer, a third C3-Ghost layer, a fourth C3-Ghost layer, a fifth C3-Ghost layer, and a sixth C3-Ghost layer. When the first C3-Ghost layer, the second C3-Ghost layer, the third C3-Ghost layer, the fourth C3-Ghost layer, the fifth C3-Ghost layer and the sixth C3-Ghost layer carry out convolution processing on the input feature map, the method comprises the following steps:

performing standard convolution operation on the input feature graph to compress the number of channels, and performing feature extraction through N serially connected GhostBottleneck modules to obtain a first C3-Ghost feature graph;

meanwhile, performing another standard convolution operation on the input feature map to obtain a second C3-Ghost feature map;

and performing Concat superposition on the first C3-Ghost characteristic diagram and the second C3-Ghost characteristic diagram according to channel dimensions, and performing characteristic fusion through convolution to obtain an output characteristic diagram.

More specifically, when the Ghost bottleeck module performs feature extraction on an input feature graph, the method includes the following steps:

inputting the input feature map into a first layer Ghost module for convolution operation, and processing the feature map through a BN (batch normalization) layer and a Relu activation function with sparsity, wherein the BN layer is used for ensuring that the input of each layer of the network has the same distribution, and the Relu activation function is used for avoiding a gradient disappearance phenomenon of backward propagation;

and inputting the feature graph processed by the BN layer and the Relu activation function into a second layer Ghost module to perform convolution operation, and processing through another BN layer. The reason why the Relu activation function is not used at this time is that the hard saturation of 0 in the negative half axis of the ReLU activation function makes the output data distribution not zero mean, which leads to the deactivation of neurons, thereby reducing the performance of the network.

When the Ghost module performs convolution operation on the input feature graph, the method comprises the following steps:

performing point-by-point convolution on the input feature map through a convolution kernel of 1 × 1, compressing the number of channels of the input feature map through a scaling factor ratio, performing normalization operation through a BatchNorm2d layer, and performing SiLU activation function processing to obtain a concentrated feature map, wherein in the embodiment, the scaling factor ratio is 2, and the number of channels of the input feature map is compressed to half of the original number;

performing convolution operation layer by layer on the concentrated characteristic diagram, performing normalization operation through a BatchNorm2d layer, and performing SiLU activation function processing to obtain a redundant characteristic diagram;

and performing Concat superposition on the concentrated feature map and the redundant feature map according to the channel dimension, and outputting a superposition result.

The result prediction module obtains position size information and a category of a detection target according to a third output feature map, a fourth output feature map, a fifth output feature map and a sixth output feature map of different grid sizes, wherein the output feature map with the largest grid size is used for detecting a target with a small size, and the feature map with the smallest grid size is used for detecting a target with a large size, and the method specifically comprises the following steps:

predicting through four prior anchor frames with corresponding sizes on each space point of a third output characteristic diagram, a fourth output characteristic diagram, a fifth output characteristic diagram and a sixth output characteristic diagram with different grid sizes to obtain the coordinate offset (t) of the predicted target frame _x ,t _y ) Width t of _w High t _h A confidence of the probability value and the prediction category; coordinate offset (t) according to predicted target frame _x ,t _y ) Width t of target frame _w High t _h Obtaining the position coordinates and width and height of the detection target, and the position coordinates of the detection target (b) _x ,b _y ) The expression of (a) is:

b _x ＝σ(t _x )+C _x ,b _y ＝σ(t _y )+C _y

in the formula, C _x 、C _y Respectively the coordinates of the upper left corner of the grid where the detection target is located in the feature map,

width b of detection target _w High b _h The expression of (c) is:

and finally, carrying out non-maximum suppression processing to obtain the position coordinate, the width and the height of the final detection target and the confidence coefficient of the prediction type, and determining the prediction type with high confidence coefficient as the type of the corresponding detection target. In one implementation, the final position coordinates, width and height of the detection target and the belonged category of the detection target are marked on the image to be detected and output as a detection result image, and the belonged category of the detection target is the correct wearing safety helmet and the unworn safety helmet.

When the improved YOLOv5 model of the present embodiment is trained, the position and size information of the detection target in the image to be detected output by the model and the class to which the detection target belongs are input into a loss function, the difference between the predicted data and the actual data is calculated, and the model is adjusted by the loss function.

In a preferred embodiment, updating the prior anchor box size in the improved YOLOv5 model according to the data set used for training the improved YOLOv5 model specifically includes the steps of:

firstly, selecting an actual frame as an initial clustering center according to the probability by a roulette algorithm in a data set;

secondly, calculating the distance Loss between each actual frame and the current clustering center in the data set, wherein the expression of the distance Loss is as follows:

thirdly, dividing the actual frame into the cluster category to which the cluster center with the shortest distance to the actual frame belongs;

fourthly, calculating the median of the actual frame coordinate in each clustering category, updating the clustering centers of the corresponding categories by the median, repeatedly executing the second step to the fourth step until k clustering centers with stable positions are selected, and determining the obtained clustering centers as prior anchor frames;

calculating the size error degree of each actual frame and each prior anchor frame respectively, obtaining the minimum size error value calculation average value corresponding to each actual frame, determining the average value as the fitness of the prior anchor frame, and determining the prior anchor frame with the highest fitness as an updated prior anchor frame;

in order to better develop the multi-scale target detection capability of the improved YOLOv5 model, the method further comprises the following six steps: and performing linear transformation on the prior anchor frame, transforming the minimum width of the prior anchor frame into 0.8 time and transforming the maximum width of the prior anchor frame into 1.5 times, and keeping the width-height ratio unchanged.

The following is a comparative test of the improved YOLOv5 model with the conventional YOLOv5 model:

(1) and respectively inputting the to-be-detected image with the blocked detection target into a traditional YOLOv5 model and an improved YOLOv5 model to obtain a detection result image. Referring to fig. 10 and fig. 11, where fig. 10 is a detection result image output by the conventional YOLOv5 model, and fig. 11 is a detection result image output by the improved YOLOv5 model, it can be seen that the improved YOLOv5 model can still detect the occluded detection target, while the conventional YOLOv5 model cannot detect the occluded detection target.

(2) And respectively inputting the image to be detected containing the circular controller image into a traditional YOLOv5 model and an improved YOLOv5 model to obtain a detection result image. Referring to fig. 12 and 13, where fig. 12 is a result image of a test output by a conventional YOLOv5 model and fig. 13 is a result image of a test output by a modified YOLOv5 model, it can be seen that the conventional YOLOv5 model mistakenly recognizes a circular controller as a human head without a helmet, and the modified YOLOv5 model can distinguish that the circular controller in the image is not a human head, and has higher test accuracy.

(3) And respectively inputting the image to be detected obtained under weak illumination into a traditional YOLOv5 model and an improved YOLOv5 model to obtain a detection result image. Referring to fig. 14 and 15, fig. 14 is a detection result image output by the conventional YOLOv5 model, and fig. 15 is a detection result image output by the improved YOLOv5 model, it can be seen that the conventional YOLOv5 model has a missing detection condition, while the improved YOLOv5 model can identify all detection targets in the image to be detected, and can still maintain high detection accuracy under the influence of the illumination environment.

In addition, the present application also performs ablation experiments on the improved YOLOv5 model, and the experimental results are shown in table 1, wherein "conventional YOLOv 5" is a conventional YOLOv5 model with four detection scales; "conventional YOLOv5+ C3-Ghost" is a model in which the C3 module of the conventional YOLOv5 model is replaced with the C3-Ghost module of the example; "traditional YOLOv5+ improved feature fusion" is a model that adds a lateral jump connection between the original input node and the output node of the same hierarchy at the feature fusion module of the traditional YOLOv5 model with four detection scales; "conventional YOLOv5+ C3-Ghost + improved feature fusion" is a model in which the C3 module of the conventional YOLOv5 model is replaced with the C3-Ghost module of the embodiment and the feature fusion module adds a lateral jump connection between the original input node and the output node of the same level; "conventional YOLOv5+ Swin Transformer" is a model using Swin Transformer as the backbone feature extraction network of conventional YOLOv 5; "conventional YOLOv5+ Swin Transformer + C3 Ghost" is a model that uses Swin Transformer as a backbone feature extraction network of conventional YOLOv5 and replaces the C3 module of the conventional YOLOv5 model with the C3-Ghost module of the embodiment; "modified YOLOv 5" is the modified YOLOv5 model in the examples; p is the accuracy of the model, R is the recall of the model, mAP @ 5 is the Average of the AP (Average Precision) values of each category under the condition that the threshold value IoU is 0.5; mAP @.5:.95 represents IoU threshold starting at 0.5 and increasing in 0.05 steps to the average mAP corresponding to 0.95.

TABLE 1

Wherein the parameters of the traditional YOLOv5 model are 7.17 × 10 ⁶ The number of traditional YOLOv5+ C3-Ghost is 6.14 × 10 ⁶ It can be seen that, under the condition of keeping the value of mAP @ 5 almost unchanged, the parameter quantity of the traditional YOLOv5+ C3-Ghost is reduced by 14.4% compared with that of the traditional YOLOv5, and the C3-Ghost module of the embodiment is proved to be capable of effectively reducing the model parameters and the calculation complexity. The mAP @.5 value of the traditional YOLOv5+ improved feature fusion was improved by 0.5% compared to the traditional YOLOv 5. Compared with the traditional YOLOv5 model, the traditional YOLOv5+ Swin transducer model is improved by 2.1% on the mAP @.5:.95 index, and the traditional YOLOv5+ Swin transducer + C3Ghost model is improved by 1.9% on the mAP @.5:.95 index. The improved YOLOv5 network model has higher feature extraction capability due to the Swin transform-based feature extraction, and simultaneously has the computational portability brought by the C3-Ghost module and the high feature fusion brought by the improved feature fusionThe accuracy rate, as can be seen from table 1, in the embodiment, compared with the conventional YOLOv5 model, the improved YOLOv5 model is improved by 2.3% on the mAP @.5:.95 index, that is, the detection accuracy rate is obviously higher.

This application may take the form of a computer program product embodied on one or more storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having program code embodied therein. Computer-usable storage media include permanent and non-permanent, removable and non-removable media, and information storage may be implemented by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of the storage medium of the computer include, but are not limited to: phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technologies, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic tape storage or other magnetic storage devices, or any other non-transmission medium, may be used to store information that may be accessed by a computing device.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, to those skilled in the art, changes and modifications may be made without departing from the spirit of the present invention, and it is intended that the present invention encompass such changes and modifications.

Claims

1. A method for detecting wearing of safety helmets based on improved YOLOv5 is characterized by comprising the following steps:

acquiring an image to be detected containing a detection target;

inputting the image to be detected into an improved YOLOv5 model to perform target detection, so as to obtain position size information and the category of the detected target, wherein the improved YOLOv5 model comprises a feature extraction module, a feature fusion module and a result prediction module, the feature extraction module comprises a tile block segmentation sub-module, a linear embedding sub-module, a first Swin-T sub-module, a first tile block splicing sub-module, a second Swin-T sub-module, a second tile block splicing sub-module, a third Swin-T sub-module, a third tile block splicing sub-module, a fourth Swin-T sub-module, a fourth tile block splicing sub-module and a fifth Swin-T sub-module, and when the feature extraction module extracts the features of the image to be detected, the method comprises the following steps:

inputting the image to be detected into the image segmentation submodule to perform image segmentation;

inputting the segmented image to be detected into the linear embedding submodule for linear transformation;

inputting the image to be detected after linear transformation into the first Swin-T submodule for feature extraction to obtain a first Swin-T feature map;

inputting the first Swin-T feature map into the first image block splicing submodule for downsampling to obtain a first-level feature map;

inputting the first-level feature map into the second Swin-T submodule for feature extraction to obtain a second Swin-T feature map;

inputting the second Swin-T feature map into the second image block splicing submodule for downsampling to obtain a second-level feature map;

inputting the second-level feature map into the third Swin-T submodule for feature extraction to obtain a third Swin-T feature map;

inputting the third Swin-T feature map into the third image block splicing submodule for downsampling to obtain a third-level feature map;

inputting the third-level feature map into the fourth Swin-T submodule for feature extraction to obtain a fourth Swin-T feature map;

inputting the fourth Swin-T feature map into the fourth image block splicing submodule for downsampling to obtain a fourth-level feature map;

inputting the fourth-level feature map into the fifth Swin-T submodule for feature extraction to obtain a fifth Swin-T feature map;

the first Swin-T submodule, the second Swin-T submodule, the fourth Swin-T submodule and the fifth Swin-T submodule respectively comprise two Swin transform Block networks, the third Swin-T submodule comprises six Swin transform Block networks, and the Swin transform Block networks are used for carrying out image feature extraction on an input feature map;

the feature fusion module is used for fusing according to the fifth Swin-T feature map, the fourth Swin-T feature map, the third Swin-T feature map and the second Swin-T feature map to obtain a plurality of output feature maps with different grid sizes;

and the result prediction module is used for predicting to obtain the position size information and the category of the detection target according to the output feature maps with different grid sizes.

2. The method of claim 1, wherein: the Swin transform Block network comprises four LayerNorm layers, a multi-head self-attention layer, two MLP layers, a shift window multi-head self-attention layer, four DropPath layers and four residual connecting layers, and when the Swin transform Block network is used for extracting image features of an input feature map, the Swin transform Block network comprises the following steps:

inputting the input feature map into the LayerNorm layer for normalization processing; inputting the normalized input feature map into the multi-head self-attention layer to perform multi-head self-attention feature extraction to obtain a multi-head self-attention feature map; inputting the multi-head self-attention feature map into the DropPath layer for random inactivation; inputting the multi-head self-attention feature map output by the DropPath layer and the input feature map into the residual connection layer for residual connection to obtain a first intermediate feature map;

inputting the first intermediate feature map into the LayerNorm layer for normalization processing; inputting the normalized first intermediate characteristic diagram into the MLP layer for linear transformation to obtain a first transformation characteristic diagram; inputting the first transformation feature map into the DropPath layer for random inactivation; inputting the first transformation characteristic diagram output by the DropPath layer and the first intermediate characteristic diagram into the residual connecting layer for residual connection to obtain a second intermediate characteristic diagram;

inputting the second intermediate feature map into the LayerNorm layer for normalization processing; inputting the normalized second intermediate feature map into the shifting window multi-head self-attention layer to perform multi-head self-attention feature extraction of pixel shifting to obtain a shifting multi-head self-attention feature map; inputting the shift multi-head self-attention feature map into the DropPath layer for random inactivation; inputting the shift multi-head self-attention feature map output by the DropPath layer and the second intermediate feature map into the residual connecting layer for residual connection to obtain a third intermediate feature map;

inputting the third intermediate feature map into the LayerNorm layer for normalization processing; inputting the normalized third intermediate characteristic diagram into the MLP layer for linear transformation to obtain a second transformation characteristic diagram; inputting the second transformation feature map into the DropPath layer for random inactivation; inputting the second transformed feature map output by the DropPath layer and the third intermediate feature map into the residual connecting layer for residual connection, and obtaining a feature map output by the Swin transform Block network.

3. The method of claim 1, wherein: the first, second, third and fourth tile splicing sub-modules comprise tile segmentation layers, concat layers, LayerNorm layers and full connection layers, wherein the tile segmentation layers are used for dividing adjacent pixels with input dimensionality [ H, W, C ] feature map interval of 2 into a plurality of tiles; the concat layer is used for concat splicing of the segmented image blocks to obtain a feature map with dimensions changed into [ H/2, W/2,4C ]; the LayerNorm layer is used for normalizing a characteristic diagram output by the concat layer; and the full connection layer is used for carrying out linear transformation on the number of channels of the characteristic diagram output by the LayerNorm layer to obtain the characteristic diagram with the dimensionality of [ H/2, W/2,2C ].

4. The method of claim 1, wherein: the feature fusion module comprises a first CONV layer, a first UP layer, a first Consat layer, a first C3-Ghost layer, a second CONV layer, a second UP layer, a second Consat layer, a second C3-Ghost layer, a third CONV layer, a third UP layer, a third Consat layer, a third C3-Ghost layer, a fourth CONV layer, a fourth Consat layer, a fourth C3-Ghost layer, a fifth CONV layer, a fifth Consat layer, a fifth C3-Ghost layer, a sixth CONV layer, a sixth Consat layer and a sixth C3-Ghost layer, and is used for carrying out fusion according to the fifth Swin-T feature map, the fourth Swin-T feature map, the third Swin-T feature map and the second Swin-T feature map to obtain a plurality of output feature maps with different sizes, wherein the fusion module comprises the following steps:

acquiring the fifth Swin-T characteristic diagram and inputting the fifth Swin-T characteristic diagram into the first CONV layer for convolution processing to obtain a first convolution characteristic diagram; inputting the first convolution feature map into the first UP layer for UP-sampling operation; acquiring the fourth Swin-T characteristic diagram, inputting the fourth Swin-T characteristic diagram and the characteristic diagram output by the first UP layer into the first Concat layer together, and performing Concat splicing; inputting the feature map output by the first Concat layer into the first C3-Ghost layer for convolution processing to obtain a first output feature map;

inputting the first output characteristic diagram into the second CONV layer for convolution processing to obtain a second convolution characteristic diagram; inputting the second convolution characteristic diagram into the second UP layer for UP-sampling operation; acquiring the third Swin-T characteristic diagram, inputting the third Swin-T characteristic diagram and the characteristic diagram output by the second UP layer into the second Concat layer together for Concat splicing; inputting the feature graph output by the second Concat layer into the second C3-Ghost layer for convolution processing to obtain a second output feature graph;

inputting the second output characteristic diagram into the third CONV layer for convolution processing to obtain a third convolution characteristic diagram; inputting the third convolution characteristic diagram into the third UP layer for UP-sampling operation; acquiring the second Swin-T characteristic diagram, inputting the second Swin-T characteristic diagram and the characteristic diagram output by the third UP layer into the third Concat layer together for Concat splicing; inputting the feature graph output by the third Concat layer into the third C3-Ghost layer for convolution processing to obtain a third output feature graph;

inputting the third output characteristic diagram into the fourth CONV layer for convolution processing to obtain a fourth convolution characteristic diagram; inputting the fourth convolution feature map and the third convolution feature map into the fourth Concat layer together for Concat splicing; inputting the feature map output by the fourth Concat layer into the fourth C3-Ghost layer for convolution processing to obtain a fourth output feature map;

inputting the fourth output characteristic diagram into the fifth CONV layer for convolution processing to obtain a fifth convolution characteristic diagram; inputting the fifth convolution feature map and the second convolution feature map together into the fifth Concat layer for Concat splicing; inputting the feature map output by the fifth Concat layer into the fifth C3-Ghost layer for convolution processing to obtain a fifth output feature map;

inputting the fifth output characteristic diagram into the sixth CONV layer for convolution processing to obtain a sixth convolution characteristic diagram; inputting the sixth convolution feature map and the first convolution feature map together into the sixth Concat layer for Concat splicing; and inputting the feature map output by the sixth Concat layer into the sixth C3-Ghost layer for convolution processing to obtain a sixth output feature map.

5. The method of claim 1, wherein: the feature fusion module comprises a first CONV layer, a first UP layer, a first Consat layer, a first C3-Ghost layer, a second CONV layer, a second UP layer, a second Concat layer, a second C3-Ghost layer, a third CONV layer, a third UP layer, a third CONcat layer, a third C3-Ghost layer, a fourth CONV layer, a fourth Concat layer, a fourth C3-Ghost layer, a fifth CONV layer, a fifth Concat layer, a fifth C3-Ghost layer, a sixth CONV layer, a sixth Concat layer and a sixth C3-Ghost layer, and the feature fusion module performs fusion according to the fifth Swin-T feature map, the fourth Swin-T feature map, the third Swin-T feature map and the second Swin-T feature map to obtain a plurality of output feature maps with different grid sizes, and comprises the following steps:

inputting the third output characteristic diagram into the fourth CONV layer for convolution processing to obtain a fourth convolution characteristic diagram; inputting the third Swin-T feature map, the fourth convolution feature map and the third convolution feature map into the fourth Consat layer together for Consat splicing; inputting the feature map output by the fourth Concat layer into the fourth C3-Ghost layer for convolution processing to obtain a fourth output feature map;

inputting the fourth output characteristic diagram into the fifth CONV layer for convolution processing to obtain a fifth convolution characteristic diagram; inputting the fourth Swin-T feature map, a fifth convolution feature map and a second convolution feature map into the fifth Consat layer together for Consat splicing; inputting the feature map output by the fifth Concat layer into the fifth C3-Ghost layer for convolution processing to obtain a fifth output feature map;

6. The method according to any one of claims 4-5, wherein: when the first C3-Ghost layer, the second C3-Ghost layer, the third C3-Ghost layer, the fourth C3-Ghost layer, the fifth C3-Ghost layer and the sixth C3-Ghost layer carry out convolution processing on the input feature map, the method comprises the following steps:

performing standard convolution operation on the input feature graph to compress the number of channels, and performing feature extraction through N serially connected Ghost Bottleneck modules to obtain a first C3-Ghost feature graph;

performing another standard convolution operation on the input feature map to obtain a second C3-Ghost feature map;

concat superposition is carried out on the first C3-Ghost characteristic diagram and the second C3-Ghost characteristic diagram according to channel dimensions, and characteristic fusion is carried out through convolution to obtain an output characteristic diagram;

the method comprises the following steps of:

inputting the input feature map into a first layer Ghost module for convolution operation, and processing the feature map through a BN layer and a Relu activation function with sparsity;

inputting the feature graph processed by the BN layer and the Relu activation function into a second layer Ghost module for convolution operation, and processing the feature graph by another BN layer;

the step of the Ghost module performing convolution operation comprises the following steps:

performing point-by-point convolution on an input feature map through a convolution kernel of 1 multiplied by 1, compressing the number of channels of the input feature map through a scaling factor, performing normalization operation through a BatchNorm2d layer, and performing SiLU activation function processing to obtain a concentrated feature map;

and performing Concat superposition on the concentrated feature map and the redundant feature map according to channel dimensions, and outputting a superposition result.

7. The method of claim 6, wherein: the scaling factor is 2.

8. The method according to any one of claims 4-5, wherein the step of the result prediction module predicting the position size information and the category of the detection target according to the output feature maps of different grid sizes comprises:

predicting through four prior anchor frames with corresponding sizes on each space point of the third output characteristic diagram, the fourth output characteristic diagram, the fifth output characteristic diagram and the sixth output characteristic diagram to obtain the coordinate offset (t) of the predicted detection target frame _x ，t _y ) Width t of _w High t _h A confidence of the probability value and the prediction category; according to the coordinate offset (t) _x ，t _y ) Width t of _w High t _h Obtaining the position coordinates and width and height of the detection target, and the position coordinates (b) of the detection target _x ，b _y ) The expression of (a) is:

b _x ＝σ(t _x )+C _x ，b _y ＝σ(t _y )+C _y

width b of the detection target _w High b _h The expression of (a) is:

9. The method of claim 8, wherein: the size of the a priori anchor frame is updated according to the data set used for training the improved YOLOv5 model, including the steps of:

firstly, selecting an actual frame in the data set according to the probability by a roulette algorithm to serve as an initial clustering center;

secondly, calculating the distance Loss between each actual frame and the current clustering center, wherein the expression of the distance Loss is as follows:

thirdly, dividing the actual frame into the cluster category to which the cluster center with the shortest distance belongs;

fourthly, calculating the median of the actual frame coordinate in each clustering category, updating the clustering center of the corresponding category by the median, repeatedly executing the second step to the fourth step until k clustering centers with stable positions are obtained, and determining the obtained clustering centers with stable positions as prior anchor frames;

and fifthly, calculating the size error degree of each actual frame and each prior anchor frame, acquiring the minimum size error value calculation average value corresponding to each actual frame, determining the average value as the fitness of the prior anchor frame, and determining the prior anchor frame with the highest fitness as an updated prior anchor frame.

10. The method of claim 9, wherein the a priori anchor block size is updated from a data set used to train the improved YOLOv5 model, further comprising the steps of six: and performing linear transformation on the updated prior anchor frame, transforming the minimum width of the updated prior anchor frame into 0.8 times, transforming the maximum width of the updated prior anchor frame into 1.5 times, and keeping the width-height ratio unchanged.