CN115830381A

CN115830381A - Improved YOLOv 5-based detection method for mask not worn by staff and related components

Info

Publication number: CN115830381A
Application number: CN202211559855.9A
Authority: CN
Inventors: 陈嘉维; 周长源; 袁戟; 郭聿珉; 起亚·伊曼纽尔通格姆
Original assignee: Shenzhen Wanwuyun Technology Co ltd
Current assignee: Shenzhen Wanwuyun Technology Co ltd
Priority date: 2022-12-06
Filing date: 2022-12-06
Publication date: 2023-03-21

Abstract

The invention discloses a detection method for detecting that an employee does not wear a mask based on improved YOLOv5 and related components, wherein the method comprises the following steps: embedding an SA module between a first CBS module and a second CBS module of a backbone network of an original YOLOv5 network to obtain an improved YOLOv5 network; the SA module comprises a feature grouping module, a channel attention module, a space attention module and an aggregation module, and a first weight model for predicting whether to wear the mask or not and a second weight model for predicting whether to be the staff or not are obtained through training; and optimizing the first weight model and the second weight model by adopting a loss function to obtain an optimized improved YOLOv5 network. According to the invention, the extraction capability of the model for the global features is improved after the attention mechanism is introduced into the YOLOv5 network, so that the reasoning accuracy and the robustness for detecting whether the staff wear the mask model are improved.

Description

Improved YOLOv 5-based detection method for mask not worn by staff and related components

Technical Field

The invention relates to the technical field of target detection, in particular to a detection method for detecting whether a mask is not worn by staff based on improved YOLOv5 and a related component.

Background

At present, in an indoor property scene needing to wear the mask, the mask is usually worn by staff through consciousness and mutual reminding among the staff. However, this method is too dependent on the subjective nature of the employee himself, and it is impossible to objectively confirm that the employee wears the mask, or to perform the inspection by a manager. Therefore, the condition that the mask is not worn by the staff can be detected through equipment such as a camera by utilizing a target detection technology, so that the daily dress of the staff can be standardized, and the staff can be urged to keep a good state. However, the existing target detection technology is influenced by problems of object shielding, multi-scale targets and the like in an actual scene, so that the reasoning accuracy and the robustness are poor.

Disclosure of Invention

The embodiment of the invention provides a detection method for detecting whether an employee wears a mask based on improved YOLOv5 and related components, and aims to solve the problems of low reasoning accuracy and poor robustness in a scene that whether the employee wears the mask or not in the existing target detection technology.

In a first aspect, the invention provides a detection method for detecting that an employee does not wear a mask based on improved YOLOv5, which comprises the following steps:

collecting picture data, removing the picture data which do not meet the requirements, and labeling and dividing the rest picture data into data sets; the data set includes: the picture data of whether the mask is worn or not and the picture data of whether the mask is marked as the employee or not are marked;

embedding an SA module between a first CBS module and a second CBS module of a backbone network of an original YOLOv5 network to obtain an improved YOLOv5 network; wherein the SA module comprises a feature grouping module, a channel attention module, a space attention module and an aggregation module;

inputting the picture samples in the data set into the improved YOLOv5 network, inputting the output of the first CBS module as an input feature into the feature grouping module, grouping along a channel dimension, and dividing into two branches;

inputting one branch into the channel attention module to extract global information, activating to obtain channel attention characteristics, inputting the other branch into the space attention module to perform intra-group normalization, and generating space attention characteristics by using a space relation;

inputting the channel attention feature and the spatial attention feature into the aggregation module to aggregate along the channel direction to obtain an aggregation feature;

inputting the aggregation characteristics into a second CBS module in the improved YOLOv5 network for continuous training to obtain a first weight model for predicting whether to wear the mask and a second weight model for predicting whether to be the employee;

and optimizing the first weight model and the second weight model by adopting a loss function to obtain an optimized improved YOLOv5 network.

In a second aspect, the present invention provides a detection apparatus for detecting that an employee does not wear a mask based on modified YOLOv5, including:

the preprocessing module is used for acquiring the picture data, removing the picture data which do not meet the requirements, and labeling and dividing the rest picture data into data sets; the data set includes: the picture data of whether the mask is worn or not and the picture data of whether the mask is marked as the employee or not are marked;

an embedding unit, configured to embed an SA module between a first CBS module and a second CBS module of a backbone network of an original YOLOv5 network, to obtain an improved YOLOv5 network; wherein the SA module comprises a feature grouping module, a channel attention module, a space attention module and an aggregation module;

a feature grouping module, configured to input the picture samples in the data set to the improved YOLOv5 network, input an output of the first CBS module as an input feature to the feature grouping module, group the picture samples along a channel dimension, and divide the picture samples into two branches;

the channel attention module is used for extracting global information from one branch and activating the global information to obtain a channel attention feature;

the spatial attention module is used for carrying out intra-group normalization on the other branch and generating spatial attention characteristics by utilizing a spatial relationship;

the aggregation module is used for inputting the channel attention feature and the space attention feature into the aggregation module to aggregate along the channel direction to obtain an aggregation feature;

the training module is used for inputting the aggregation characteristics to a second CBS module in the improved YOLOv5 network for continuous training to obtain a first weight model for predicting whether to wear the mask and a second weight model for predicting whether to be the staff;

and the optimization module is used for optimizing the first weight model and the second weight model by adopting a loss function to obtain an optimized improved YOLOv5 network.

In a third aspect, the present invention further provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the above-mentioned method for detecting an employee unworn mask based on improved YOLOv5 when executing the computer program.

The embodiment of the invention provides a detection method for detecting whether an employee wears a mask or not based on improved YOLOv5 and related components, wherein the method comprises the following steps: embedding an SA module between a first CBS module and a second CBS module of a backbone network of an original YOLOv5 network to obtain an improved YOLOv5 network; the SA module comprises a feature grouping module, a channel attention module, a space attention module and an aggregation module, and a first weight model for predicting whether to wear the mask or not and a second weight model for predicting whether to be the staff or not are obtained through training; and optimizing the first weight model and the second weight model by adopting a loss function to obtain an optimized improved YOLOv5 network. According to the method, the extraction capability of the model for the global features is improved after the attention mechanism is introduced into the YOLOv5 network, so that the reasoning accuracy and the robustness for detecting whether the staff wear the mask model are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flow chart of a detection method for detecting that an employee does not wear a mask based on improved YOLOv5 according to an embodiment of the present invention;

fig. 2 is a schematic algorithm diagram of a detection method for detecting that an employee does not wear a mask based on improved YOLOv5 according to an embodiment of the present invention;

fig. 3 is a diagram of an improved YOLOv5 network structure according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a post-processing flow provided in an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Referring to fig. 1, fig. 1 is a schematic flow chart of a detection method for detecting that an employee does not wear a mask based on improved YOLOv5 according to an embodiment of the present invention, including steps S101 to S107:

s101, collecting picture data, removing the picture data which do not meet the requirements, and labeling and dividing a data set for the rest picture data; the data set includes: the picture data of whether the mask is worn or not and the picture data of whether the mask is marked as the employee or not are marked;

s102, embedding an SA module between a first CBS module and a second CBS module of a backbone network of the original YOLOv5 network to obtain an improved YOLOv5 network; wherein the SA module comprises a feature grouping module, a channel attention module, a space attention module and an aggregation module;

s103, inputting the picture samples in the data set into the improved YOLOv5 network, inputting the output of the first CBS module as an input feature into the feature grouping module, grouping along the channel dimension, and dividing into two branches;

s104, inputting one branch into the channel attention module to extract global information, activating to obtain channel attention characteristics, inputting the other branch into the space attention module to perform group normalization, and generating space attention characteristics by using a space relation;

s105, inputting the channel attention feature and the space attention feature into the aggregation module to aggregate along the channel direction to obtain an aggregation feature;

s106, inputting the aggregation characteristics into a second CBS module in the improved YOLOv5 network for continuous training to obtain a first weight model (1)) for predicting whether the mask is worn and a second weight model (2)) for predicting whether the mask is worn or not;

s107, optimizing the first weight model and the second weight model by adopting a loss function to obtain an optimized improved YOLOv5 network.

In the embodiment of the invention, the original YOLOv5 is a regression-based high-precision real-time single-stage target detection algorithm proposed in 2020, integrates the advantages of the previous generations of YOLO, and achieves the current optimum in the aspect of speed and precision balance. The network structure of YOLOv5 is mainly composed of a backbone network (backbone), a neck network (sock) and three YOLO probing heads (predictionhead). In the actual scene that the staff do not wear the mask, the inference accuracy of the model is reduced and the robustness is poor due to the influences of the problems of object shielding, multi-scale targets and the like. Therefore, the mask wearing detection directly using the original YOLOv5 has some disadvantages, which are specifically shown in the following: the backbone network of the YOLOv5 is a CNN network, the CNN has translation invariance and locality and lacks long-distance modeling capability, and the extraction capability of the model on the global features can be improved after the attention mechanism is introduced in the embodiment of the invention, so that a certain effect on small targets and intensive detection tasks can be improved.

In step S101, collecting picture data, removing picture data that does not meet the requirement, and labeling and dividing data sets for the remaining picture data;

specifically, as shown in fig. 2, first, camera video data of a plurality of cell sentries or other designated locations (such as happy sentries) are collected, then picture data are obtained in a timing frame-drawing manner, and then picture data which are nobody and unqualified in a scene are removed. And finally, labeling and dividing the obtained picture into data sets. Qualified picture data needs to be divided into two groups. The first group of picture data needs to calibrate whether people in the picture wear the mask or not, and the second group of picture data needs to calibrate staff and non-staff in the picture. In both groups, 80% of the pictures are taken as a training set (i.e., a training data set), and 20% of the pictures are taken as a testing set (i.e., a testing data set).

It should be noted that, the labeling is performed on the picture, that is, the labeling is performed on the specified object in the picture, and the related information is labeled. Taking the first group of picture data as an example, the first group of picture data is a rectangular frame, namely, the first group of picture data is used for marking the face of a person wearing the mask with corresponding label information. This rectangular box contains location information and tag categories. Setting a rectangular tag frame of the picture as L:

L∈R ⁵ ＝{x,y,w,h,c}

wherein, (x, y) represents the coordinates of the center point of the rectangular frame, w and h represent the length and height of the rectangular frame, respectively, and c represents the category (the category label of the mask not worn is no _ mask, and the category label of the mask worn is mask). Similarly, the same principle is followed when labeling the second group of picture data. The class labels for the employees are: a staff; the class labels for non-employees are: others.

Because the installation positions of the cameras are different in distance and height, the quality of the collected pictures is also uneven. Therefore, the following requirements are required when creating a data set: 1. the training set and the test set both contain a certain number of pictures of persons who don and wear the mask; 2. the number of pictures of people at different distances from the camera needs to be acquired in equal proportion; 3. the number of pictures for people facing different angles of the camera needs to be collected in equal proportion. These adjustments can ensure that data sources are wide and closer to the actual scene, thereby ensuring that the generalization of the training model is strong.

In step S102, the YOLOv5 model is used as a basic model in the embodiment of the present invention, and in order to adapt to a complex environment, the original YOLOv5 needs to be modified appropriately. And respectively training the training sets of the first group of data and the second group of data by using the improved YOLOv5 to obtain two weight models, and verifying the two groups of data on the test set by using the corresponding models.

In the embodiment of the invention, the original YOLOv5 model is improved to adapt to the complex situation of detecting the wearing condition of the mask in the property scene, and the network structure of the improved YOLOv5 is shown in fig. 3. The concrete improvement is as follows: embedding an SA module (shown as a rectangle broken line frame at the upper left of a figure 3) between a first CBS module and a second CBS module of a backbone network of an original YOLOv5 network to obtain an improved YOLOv5 network; wherein the SA module comprises a feature grouping module, a channel attention module, a space attention module and an aggregation module;

the nature of the attention mechanism is to focus on the information of interest and suppress the useless information. In principle, there are two main types of spatial attention and channeling attention. Typically, the region of interest is only a small portion of the image, the nature of spatial attention is to extract the relationship between pairs of pixels, and the nature of channel attention is to model the dependencies between individual channels. This embodiment introduces an SA in the model that can introduce both spatial and channeling attention simultaneously. Compared with the method that two attentions are simply fused, the SA performs two attention mechanisms of space and channel in a blocking and parallel mode by introducing channel random mixing operation, combines potentials of the two attention mechanisms efficiently, saves computing resources, brings sufficient global information for the YOLOv5 network, and improves the capacity of the network.

The following is a detailed description of the improved part of the YOLOv5 network, and the original part of the YOLOv5 network may refer to the prior art, which is not described herein again.

In an embodiment, the inputting the output of the first CBS module as the input feature into the feature grouping module to group along the channel dimension and divide into two branches includes:

inputting the characteristic X to the R ^C×H×W Dividing into K groups along the channel dimension yields X = [ X = ₁ ,X ₂ …,X _K ]Therein of the sub-characteristics

Generating corresponding importance coefficients for each sub-feature, wherein C, H and W respectively represent the channel number, height and width of the feature;

for each sub-feature X _k Continue to divide into two branches along the channel dimension to obtain X _k1 ,

Of the two branches, one branch can generate the channel attention feature by using the mutual relation between the channels, and the other branch generates the spatial attention feature by using the spatial relation between the features, so that the model can focus on 'what feature' and 'where the feature is'.

In an embodiment, the inputting one of the branches into the channel attention module to extract global information and activate to obtain a channel attention feature includes:

extracting global information of one branch through Global Average Pooling (GAP) to generate channel statistics s epsilon R ^C ^/2G×1×1 And reducing X by the spatial dimension H × W _k1 ：

The guidance of accurate and adaptive selection is realized by activating the function sigmoid:

wherein, W ₁ ∈R ^C/2G×1×1 ,b ₁ ∈R ^C/2G×1×1 Is a training parameter, σ is an activation function sigmoid,

representing a fully connected layer, s represents an input,

indicating global average pooling.

In one embodiment, the inputting the other branch into the spatial attention module for intra-group normalization and generating the spatial attention feature using the spatial relationship includes:

to another branch X _k2 Performing group normalization to obtain spatial statistical information and enhance X _k2 Is represented by:

wherein, W ₂ ,b ₂ ∈R ^C/2G×1×1 Is a training parameter, GN denotes intra-group normalization, σ is an activation function sigmoid.

This example is at X _k2 The Group Norm (GN) is used to obtain spatial statistics, which are then used

To enhance X _k2 Is shown. Herein, the

Representing fully connected layers, i.e.

Where X is the input, namely GN (X) _k2 ) I.e. to X _k2 And (5) carrying out the result after the group normalization.

In an embodiment, the inputting the channel attention feature and the spatial attention feature into the aggregation module to aggregate along the channel direction to obtain an aggregated feature includes:

and splicing the channel attention feature and the space attention feature by adopting a channel random operator along the channel dimension to obtain cross-group information:

X' _k ＝[X' _k2 ,X' _k2 ]∈R ^C/G×H×W 。

in the step, the channel random operator is adopted to realize the splicing of the cross-group information along the channel dimension. The final output of the SA is the same size as the input.

In the embodiment, an SA module is embedded between a first CBS module and a second CBS module of a backbone network. The improved backbone network can enhance the semantic representation of the shallow feature map and obtain richer feature information in a larger area, thereby further improving the performance of the backbone network. Compared with the original YOLOv5 network, the YOLOv5 network with the random attention module introduced can capture the global feature dependency relationship on the space and the channel, and strengthen feature information interaction, so that the extraction capability of the network on shallow semantic features is effectively improved.

Although the neck network in the original YOLOv5 can enhance the semantic representation capability of the feature map to a certain extent, the detection of whether employees wear masks is easily influenced by natural scenes, mutual shielding of people, unobvious feature identification after people wear masks and the like, so that the capability of extracting shallow semantic features of multi-scale targets is not strong, and therefore, the network structure needs to be improved in a related manner, and the depth and the capacity of the network are improved. In order to solve the above problems, the present embodiment makes some adjustments to the structure of the neck network. That is, as shown in fig. 3, in the neck network of the modified YOLOv5 network, the number of CBS modules in the SPPF module is increased from one layer to three layers; before the fifth volume block and the seventh volume block in the backbone network of the improved YOLOv5 network are input into the neck network, a CBS module is added.

The present embodiment provides two improvements to the original neck network in structure. The method comprises the following specific steps: (1) the number of CBS modules in the SPPF module (shown as the top dashed circle in fig. 3) is increased from one level to three levels, as shown by the bottom left dashed circle in fig. 3. (2) Before the fifth and seventh volume blocks of the backbone network are imported into the neck network, a CBS module is added, as shown by the dashed oval in fig. 3. Compared with the original YOLOv5, the improved neck network deepens the capacity and the depth of the whole network, obtains a larger receptive field and richer semantic feature information, and further improves the detection performance of the model.

In step S106, the aggregated features are input to a second CBS module in the improved YOLOv5 network to continue training, so as to obtain a first weight model for predicting whether to wear a mask and a second weight model for predicting whether to be an employee.

Namely, the previously obtained training sets of two sets of data (respectively, "picture data containing no mask-wearing label, i.e., whether mask-wearing is marked" and "picture data containing employee and non-employee labels, whether the picture data is marked as employee") are respectively input into the improved YOLOv5 network for training, so as to obtain the first weight model and the second weight model.

The training process is as follows: each picture contains a plurality of rectangular labeling frames (namely labeling frames), and the rectangular labeling frames contain corresponding position information and category information. The model optimizes the prediction frame through a loss function, and the prediction frame gradually approaches to the true value of the label along with the increase of the training turns.

In step S107, the loss function adopted in the present embodiment is composed of three types of loss functions, which are: loss of positioning L _box : the error between the prediction frame and the marking frame is calculated; loss of classification L _cls : the system is used for calculating whether the classification of the prediction frame and the marking frame is correct or not; loss of confidence L _obj : the confidence level of the prediction box is characterized, and the larger the value is, the more probable the target exists in the prediction box.

The most common calculation index for localization loss is the intersection-to-union (IOU), which represents the intersection ratio of the real rectangular box and the predicted rectangular box, i.e. the ratio of the intersection area of the two rectangular boxes to the union area. The algorithm adopts L _GIOU As a function of localization loss:

L _GIOU ＝1-GIOU(truth,pred)

where C represents the smallest closed convex surface that can cover the real rectangular box (truth) and the predicted rectangular box (pred), and "\" represents the area of C that is not covered to the real rectangular box and the predicted rectangular box.

And the classification loss and the confidence coefficient loss both adopt a binary cross entropy loss function.

Therefore, the overall loss function is as follows:

wherein N is the number of detection layers, B is the number of targets of the label distributed to the prior frame, and S multiplied by S isNumber of networks into which the respective scale is divided, L _box To represent the loss of positioning of the error; l is a radical of an alcohol _cls Is a classification loss used to indicate whether the classification is correct; l is _obj Is a confidence loss used to represent confidence; lambda [ alpha ] ₁ ，λ ₂ ，λ ₃ The weights of the three losses of localization loss, confidence loss and classification loss are respectively.

The two weight models (the first weight model and the second weight model) both need to be trained for 300 rounds, and after each round of training is finished, the data of the test set is input into the weight model obtained by the current training for verification so as to check the performance of the model after each round of training. The verification test procedure is as follows: loading a test data set, carrying out forward reasoning on a model to obtain an output result, calculating an error, using non-maximum suppression to select a frame with highest confidence in a picture as a predicted value of a current target (deleting other frames with the coincidence degree with the frame exceeding a certain threshold), and storing the predicted result.

In an embodiment, the method for detecting that the employee does not wear the mask based on the modified YOLOv5 further includes:

inputting the real-time picture into the first weight model and the second weight model to respectively obtain first prediction frame information (1)) of whether the mask is worn and second prediction frame information (2)) of whether the mask is a staff;

traversing first prediction frame information and second prediction frame information in a nested loop mode, and if a first prediction frame which does not wear a mask is contained in a second prediction frame of a certain staff, and the distance from the center point of the first prediction frame to the center point of the second prediction frame is less than half of the height of the second prediction frame, recording the first prediction frame and the second prediction frame;

and storing all recorded first prediction boxes and second prediction boxes to obtain total prediction box information which is used as the information of the staff not wearing the mask.

In this embodiment, the detected picture data is subjected to real-time frame extraction through the video stream, so as to obtain a picture without the annotation information. And respectively carrying out real-time reasoning on the pictures by using the trained first weight model and the trained second weight model to obtain two groups of reasoning result information. The reasoning process is as follows: loading a weight model and setting related parameters, loading picture data, carrying out forward reasoning on the picture, using non-maximum suppression to remove redundant frames, and storing prediction information and experimental results. Here, the first set of data, i.e., "whether or not to wear the mask" is taken as an example. The prediction result R is:

R ^mask ,R ^no_mask ＝{x,y,w,h,c,g}

wherein, x, y, w and h represent coordinate information of one of the predicted rectangular boxes obtained by inference, c represents the type of the rectangular box (mask with mask and no _ mask without mask), and g represents the confidence of the rectangular box in a certain type.

Wherein, non-maximum suppression means: assuming that a picture contains a target object, the detection target is to detect the target object, and finally, the algorithm predicts a plurality of prediction frames around the side of the target object, but the situation is redundant, and needs to judge which rectangular frames are redundant, so that the rectangular frames are removed. Assuming that 6 rectangular frames are arranged, sorting is carried out according to the classification probability, and the probability of respectively belonging to the target object from small to large is A < B < C < D < E < F.

(1) Respectively judging whether the overlapping degree IOU of A, B, C, D, E and F is larger than a set threshold value or not from a maximum probability rectangular frame F;

(2) If the overlapping degree of B, D and F exceeds a threshold value, throwing away B and D; and marking the first rectangular frame F, namely the reserved first rectangular frame.

(3) Selecting E with the highest probability from the remaining rectangular frames A, C and E, then judging the overlapping degree of the A, C and E, and throwing away if the overlapping degree is greater than a certain threshold value; and labeled E as the second rectangle that remains.

(4) This process is repeated to find all the remaining rectangular boxes.

After a certain picture is processed by the first weight model and the second weight model, two groups of prediction frame information are obtained, which are respectively: first prediction frame information whether to wear the mask and second prediction frame information whether to be the employee. The first group of prediction frame information is obtained by reasoning a first weight model (1)), and contains prediction information of wearing or not wearing the mask; the second group of information is obtained by reasoning from a second weight model (2)) and contains 'forecast information of employees or non-employees'. Then, the two sets of prediction information are subjected to prediction result post-processing.

As shown in fig. 4, the post-processing procedure is:

1. firstly, in two groups of prediction box information, all prediction boxes of wearing a mask (namely mask labels) and other persons (namely others labels) are shielded, and only prediction box information of 'not wearing the mask' and 'employees' is left;

2. the prediction box information for each "unworn mask" and "employee" is traversed by nesting two levels of loops. The prediction box information of the "mask not worn" is the outer circulation, and the prediction box information of the "employee" is the inner circulation.

3. If a prediction frame of a certain 'mask not worn' is contained in a prediction frame of a certain 'employee' and the distance from the center point of the prediction frame of the 'mask not worn' to the center point of the prediction frame of the 'employee' is less than half of the height of the prediction frame of the 'employee', the information of the 'mask not worn' and the prediction frame of the 'employee' is recorded, and other situations are not recorded.

4. And after traversing all the prediction frame information of the picture, shielding the prediction frames except the record, thereby obtaining the final total prediction frame information of the picture.

5. The final total predicted box label information is modified to "employee not wearing mask".

In this embodiment, information that the employee does not wear and information that the non-employee does not wear are predicted in the prediction box of "no mask worn". Therefore, a prediction box that the employee does not wear the mask needs to be screened out in a loop nesting traversal mode. Expressed in the form of pseudo code, the following may be used:

inputting: "mask worn or not" and "employee and non-employee" prediction box information

The process is as follows:

1. shielding all prediction frames with the information of wearing the mask;

2. traversing each prediction box of the information of the mask not worn:

traverse the prediction box for each "employee" information:

judging whether the prediction frame of the mask not worn is contained in the prediction frame of the staff, wherein the distance from the central point of the prediction frame of the mask not worn to the central point of the prediction frame of the staff is less than

"employee" predicts half the box height:

if so: then record to the list

3. And shielding other prediction frames which are not in the list to obtain the total prediction frame information of the picture.

4. Labeling the final total predicted box label information as: no _ mask _ stamp.

And (3) outputting: final prediction box information: no _ mask _ staff (if any)

And finally, checking whether the picture reasoning result processed by the steps contains the prediction box information of the staff who does not wear the mask. If present: then, an alarm is triggered, and the alarm information is pushed to on-site staff through short messages and the like to remind the staff who do not wear the mask.

The embodiment of the invention also provides a detection device for detecting that a mask is not worn by an employee based on the improved YOLOv5, which comprises:

the preprocessing module is used for acquiring the picture data, removing the picture data which do not meet the requirements, and labeling and dividing the rest picture data into data sets;

the channel attention module is used for extracting global information from one branch and activating the global information to obtain channel attention characteristics;

the aggregation module is used for inputting the channel attention characteristic and the space attention characteristic into the aggregation module to carry out aggregation along the channel direction to obtain an aggregation characteristic;

In one embodiment, the feature grouping module comprises:

a grouping unit for grouping the input features X ∈ R ^C×H×W Dividing into K groups along the channel dimension yields X = [) ₁ ,X ₂ …,X _K ]Therein of the sub-characteristics

a dividing unit for dividing each sub-feature X _k Continue to divide into two branches along the channel dimension to obtain X _k1 ,

In one embodiment, the channel attention module comprises:

a global average pooling unit for extracting global information of one of the branches by global average pooling to generate channel statistics s ∈ R ^C/2G×1×1 And reducing X by the spatial dimension H × W _k1 ：

An activation unit for activating by an activation function sigmoid:

representing a fully connected layer, s represents an input,

indicating global average pooling.

In one embodiment, the spatial attention module comprises:

an intra-group normalization unit for normalizing the other branch X _k2 Performing group normalization to obtain spatial statistical information and enhance X _k2 Is represented by:

In one embodiment, the aggregation module includes:

and the splicing unit is used for splicing the cross-group information of the channel attention feature and the space attention feature along the channel dimension by adopting a channel random operator:

X' _k ＝[X' _k2 ,X' _k2 ]∈R ^C/G×H×W 。

in one embodiment, in the neck network of the improved YOLOv5 network, the number of CBS modules in the SPPF module is increased from one layer to three layers; before the fifth volume block and the seventh volume block in the backbone network of the improved YOLOv5 network are input into the neck network, a CBS module is added.

In an embodiment, the detection device for detecting that the employee does not wear the mask based on modified YOLOv5 further includes:

the real-time detection module is used for inputting a real-time picture into the first weight model and the second weight model to respectively obtain first prediction frame information of whether the mask is worn and second prediction frame information of whether the mask is worn or not;

the traversal module is used for traversing first prediction frame information and second prediction frame information in a nested loop mode, and recording the first prediction frame and the second prediction frame if a first prediction frame which does not wear a mask is contained in a second prediction frame of a certain staff and the distance from the center point of the first prediction frame to the center point of the second prediction frame is less than half of the height of the second prediction frame;

and the storage module is used for storing all recorded first prediction frames and second prediction frames to obtain total prediction frame information and using the total prediction frame information as the information of staff not wearing the mask.

In one embodiment, the optimization module comprises:

an optimization unit for optimizing using the following loss function:

wherein N is the number of detection layers, B is the number of targets to which labels are allocated to prior frames, S × S is the number of networks into which corresponding scales are divided, and L is the number of detection layers _box To represent the loss of positioning of the error; l is a radical of an alcohol _cls Is a classification loss used to indicate whether the classification is correct; l is a radical of an alcohol _obj Is a confidence loss used to represent confidence; lambda [ alpha ] ₁ ，λ ₂ ，λ ₃ The weights of the three losses of localization loss, confidence loss and classification loss are respectively.

The functions and principles of the functional modules and units in the above device embodiments have been described in detail in the foregoing method embodiments, and thus are not described herein again.

The embodiment of the invention provides computer equipment, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the computer program to realize the detection method for detecting whether an employee wears a mask based on improved YOLOv 5.

An embodiment of the present invention provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program causes the processor to execute the above-mentioned method for detecting an employee unworn mask based on improved YOLOv 5.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses, devices and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only a logical division, and there may be other divisions when the actual implementation is performed, or units having the same function may be grouped into one unit, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electrical, mechanical or other form of connection.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a mechanical hard disk, a solid state disk, a magnetic disk, or an optical disk, and various media capable of storing program codes.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A detection method for detecting that an employee does not wear a mask based on improved YOLOv5 is characterized by comprising the following steps:

collecting picture data, removing picture data which do not meet requirements, marking the rest picture data and dividing a data set; the data set includes: the picture data of whether the mask is worn or not and the picture data of whether the mask is marked as the employee or not are marked;

inputting one branch into the channel attention module to extract global information, activating to obtain channel attention characteristics, inputting the other branch into the space attention module to perform group normalization, and generating space attention characteristics by using a space relation;

inputting the channel attention feature and the space attention feature into the aggregation module to aggregate along the channel direction to obtain an aggregation feature;

2. The improved YOLOv 5-based detection method for detecting the absence of masks by employees as claimed in claim 1, wherein the inputting the output of the first CBS module as the input feature into the feature grouping module for grouping along the channel dimension and dividing into two branches comprises:

inputting the characteristic X to the R ^C×H×W Dividing into K groups along the channel dimension yields X = [) ₁ ,X ₂ …,X _K ]Therein of the sub-characteristics

for each sub-feature X _k Continue to divide into two branches along the channel dimension to obtain

3. The method for detecting the absence of masks of employees based on improved YOLOv5 as claimed in claim 2, wherein the step of inputting one branch into the channel attention module to extract global information and activating to obtain the channel attention feature comprises the steps of:

extracting global information of one branch through global average pooling to generate channel statistics s epsilon R ^C/2G×1×1 And reducing X by the spatial dimension H × W _k1 ：

Activation is performed by the activation function sigmoid:

representing a fully connected layer, s represents an input,

indicating global average pooling.

4. The method for detecting the mask not worn by the employee based on the improved YOLOv5 as claimed in claim 2, wherein the inputting the other branch into the spatial attention module for the intra-group normalization and generating the spatial attention feature by using the spatial relationship comprises:

5. The method for detecting the unaffiliated mask of the employee based on the improved YOLOv5 as claimed in claim 1, wherein the inputting the channel attention feature and the spatial attention feature into the aggregation module to perform aggregation along a channel direction to obtain an aggregated feature comprises:

X' _k ＝[X' _k2 ,X' _k2 ]∈R ^C/G×H×W 。

6. the improved YOLOv 5-based detection method for detecting whether an employee does not wear a mask according to claim 1, wherein the number of CBS modules in an SPPF module is increased from one layer to three layers in a neck network of the improved YOLOv5 network; before the fifth volume block and the seventh volume block in the backbone network of the improved YOLOv5 network are input into the neck network, a CBS module is added.

7. The method for detecting whether an employee does not wear a mask according to claim 1, wherein the method further comprises:

inputting the real-time picture into the first weight model and the second weight model to respectively obtain first prediction frame information of whether the mask is worn and second prediction frame information of whether the mask is worn or not;

8. The method for detecting the mask not worn by the employee based on the improved YOLOv5 as claimed in claim 1, wherein the optimizing the first weight model and the second weight model by using the loss function to obtain the optimized improved YOLOv5 network comprises:

the following loss function was used for optimization:

wherein N is the number of detection layers, B is the number of targets of the label to be distributed to the prior frame, S multiplied by SNumber of networks into which the respective scale is divided, L _box To represent the loss of positioning of the error; l is _cls Is a classification loss used to indicate whether the classification is correct; l is _obj Is a confidence loss used to represent confidence; lambda [ alpha ] ₁ ，λ ₂ ，λ ₃ The weights of the three losses of localization loss, confidence loss and classification loss are respectively.

9. A staff does not wear gauze mask detection device based on improve YOLOv5, characterized by, include:

the feature grouping module is used for inputting the picture samples in the data set into the improved YOLOv5 network, inputting the output of the first CBS module as an input feature into the feature grouping module, grouping the output along the channel dimension, and dividing the output into two branches;

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the computer program implements the improved YOLOv5 based employee unworn mask detection method of any one of claims 1 to 8.