CN114399799A

CN114399799A - Mask wearing detection method based on YOLOv5 network

Info

Publication number: CN114399799A
Application number: CN202111382657.5A
Authority: CN
Inventors: 郭磊; 薛伟; 王邱龙; 肖怒; 马海钰; 马志伟; 郭济; 蒋煜祺
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-11-22
Filing date: 2021-11-22
Publication date: 2022-04-26

Abstract

The invention discloses a mask wearing detection method based on a YOLOv5 network, which comprises the following steps: firstly, preprocessing an original picture by using an image enhancement algorithm and dividing a data set; step two, the training set pictures are sent to a YOLOv5 network with an attention mechanism introduced for iterative training, so that the extraction of key point information such as a face, a mask and the like is effectively enhanced; step three, in order to reduce the overall error, CIOU Loss is adopted as a regression Loss function of the target frame; and step four, after the training is finished, storing the optimal weight model and testing on the test set. The result proves that under the condition of image enhancement and attention mechanism, the improved YOLOv5 model realizes the efficient detection of mask wearing, not only successfully detects the face information, but also correctly detects the wearing state of the mask and gives corresponding confidence. The model can reach 92% to the gauze mask under the low, weak condition of illumination intensity of visibility and wear the detection rate of accuracy, has important realistic meaning to epidemic situation prevention and control and maintenance public health safety.

Description

Mask wearing detection method based on YOLOv5 network

Technical Field

The invention relates to the field of mask wearing detection, in particular to a mask wearing detection method based on a YOLOv5 network.

Background

In certain specific work occasions, workers can be specified to wear the mask regularly to prevent harmful gas, spray and virus from entering the mouth and nose of a wearer, and medical workers are generally required to wear the mask in hospitals. The mask is a common medical and sanitary product, and can reduce the risk of disease infection of a mask wearer.

The mask has the attributes of small targets, diversity and the like, and the diversity of scenes and the complexity of interaction among targets make the research of multi-target tracking difficult. In addition, the mask wearing detection has the following difficulties: under dim condition, because illumination intensity is little, the visibility is low, is difficult to carry out accurate positioning to the face, and the degree of difficulty that the detection task was worn to the gauze mask improves greatly.

Most published target detection methods have achieved good results on mask wearing detection tasks, but detection accuracy still needs to be improved under dim conditions with low visibility and low illumination intensity. In the prior art, a multi-stage method is usually adopted to detect the face obstruction, and compared with a one-stage method, a multi-stage network is needed, the detection speed is low, and the real-time performance is poor; or the image pyramid is required to be generated during input, the data volume of network input is increased, the calculation speed is low, and meanwhile, the size range of the face is limited, so that the recall rate is influenced.

Disclosure of Invention

The invention solves the problems: in order to solve the problems that a target is difficult to accurately detect in a scene with low visibility and dim light, and a special target group (not wearing a mask) is difficult to accurately detect and detect information is difficult to feed back in real time, the invention provides a face mask detection method based on YOLOv 5. The key point characteristics of the face mask can be more accurately extracted through an attention mechanism, so that the original YOLOv5 network obtains stronger characteristic expression capability, and the detection accuracy is improved.

In order to overcome the defects in the prior art, the invention provides a face mask detection method based on Yolov5, which comprises the following steps:

the training set images are preprocessed by using an image enhancement algorithm and then divided into a training data set and a test data set. The pictures are sent to a YOLOv5 network with an Attention mechanism CBAM (conditional Block Attention Module network) for iterative training, so that the extraction of key point information such as a face and a mask is effectively enhanced. Before the optimal weight model is not obtained, inputting the training data set into a YOLOv5 network for repeated iterative operation until the optimal weight model is trained, and after the training is finished, storing the optimal weight model and testing on the test set.

Further, the face mask data set to be detected is input into a target YOLOv5 algorithm model, and 32-time down-sampling is achieved through 1 Focus module and 4 Conv modules. The Focus module performs slicing operation on input data, takes a value at every other pixel in a picture, is similar to adjacent downsampling, and is segmented into 4 data, wherein each data is obtained by 2-time downsampling. And splicing the sampled data on the channel dimension, and performing convolution operation. The image information after the convolution operation enters a C3 module, the C3 module divides the feature map of the base layer in one Stage into two parts by referring to a CSPNet (cross Stage partialnetwork) structure, a splitting and merging strategy is used by crossing stages, and the change of the gradient is integrated into the feature map from beginning to end. The information output from the C3 module is passed to the Head module. In the Head part, the characteristic information of a high layer is transmitted and fused with the characteristic information of a low layer in an up-sampling mode, so that the information flow from top to bottom is realized. And then, the convolution with the step length of 2 is used for processing, and the bottom layer characteristic and the high layer characteristic are spliced, so that the characteristic information of the bottom layer is easily transmitted to the upper layer, and the PANet operation is realized.

Further, before the inputting the data to be tested to the target YOLOv5 algorithm model, the method further includes: the data set is manufactured by combining network crawling and self-shooting, 9000 pictures are collected, collected picture information is classified, the format of the data set is standardized, and manual labeling is carried out.

Further, the data sets are classified into two categories, which are bad and good, respectively, the bad indicates that the person does not wear or wear the mask according to the standard, and the good indicates that the mask is worn correctly.

Further, by using an image enhancement technology, image translation, turning, rotation and scaling are carried out on a sample picture labeled as good, three color channels are separated, and random noise is added to effectively alleviate the problem of inter-class imbalance.

Further, the normative dataset format, manual labeling, includes: the data set adopts a YOLO format, LabelImg is used for picture marking, txt is used as a suffix of the marked file, and the file name is consistent with the picture name.

Further, on the basis of the network of the original YOLOv5, a convolution attention module CBAM was introduced. CBAM comprises two sub-modules, channel Attention Module CAM (channel Attention Module) and spatial Attention Module SAM (spatial Attention Module). The YOLOv5 model, which introduced the attention mechanism, totals 367 layers, 7150056 parameters.

Further, CIOU Loss is adopted in the network structure as a Loss function of the regression of the target box.

Further, a binary cross-entropy loss function is used to compute the classification loss and the target confidence loss.

The invention provides a face mask detection method based on YOLOv5, which further comprises the following steps:

when the data are trained, the training process of wearing the mask detection network is repeated, the parameters of wearing the mask detection network are continuously corrected until the mask detection network learns to find out the face position in the image and can correctly judge whether the detected face wears the mask or not, and the parameters obtained by training are stored.

In the network model training phase, the iteration batch size is set to be 32, and the total iteration times is 600. The initial learning rate was set to 0.001, a small batch gradient descent method was employed, and an Adam optimizer was used to calculate the adaptive learning rate for each parameter.

Compared with the prior art, the invention has the beneficial effects that:

1. the attention machine introduced by the invention is used on the characteristic diagram, and better task effect can be achieved by acquiring available attention information in the characteristic diagram.

2. According to the invention, the CIOU Loss is used as a Loss function of the regression of the target frame, and the CIOU Loss is measured from three angles of an overlapping area, a central point distance and an aspect ratio, so that the regression effect of the prediction frame is better.

3. According to the invention, the CBAM module is added in the YOLOv5 network, and as the CBAM module adds the global maximum pooling operation in the channel attention module, the CBAM module can make up for the information lost by global average pooling to a certain extent. Second, the resulting two-dimensional space seeks to encode using convolutional layers with a convolutional kernel size of 7, with larger convolutional kernels helping to preserve important spatial regions. The YOLOv5 network can not only classify and identify the targets more accurately, but also locate the positions of the targets more accurately.

In conclusion, under the condition of image enhancement and attention mechanism, the improved YOLOv5 model realizes efficient detection of mask wearing, and makes a correct judgment on whether the mask is worn correctly. Under the dim condition that the visibility is not high and the illumination intensity is not high, the key point characteristics of the face mask can be more accurately extracted by introducing an attention mechanism, and the detection accuracy and the mask detection efficiency are improved. The method correspondingly improves the loss function of the YOLOv5 network, has stronger robustness and expandability, and can basically meet the requirement of video image real-time property.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a full flow chart of the mask detection mechanism;

FIG. 2 is a diagram of a YOLOv5 network architecture;

FIG. 3 is a diagram of a CBAM network architecture;

FIG. 4 is a diagram of a CAM structure;

FIG. 5 is a SAM structure diagram;

FIG. 6 is a partial picture of a data set;

FIG. 7 is a data set category distribution plot;

FIG. 8 is a graph of a visualization analysis of an image enhanced data set;

FIG. 9 is a diagram showing an example of actual scene comparison of mask detection effects;

Detailed Description

In order that the objects and advantages of the invention will be more clearly understood, the invention is further described in connection with the following examples, it being understood that the specific examples set forth herein are illustrative of the invention only and are not intended to be limiting.

Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and do not limit the scope of the present invention.

It should be understood that the step numbers used in the present invention are only for convenience of description and are not used as a limitation to the order of execution of the steps.

It is to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. It is to be understood that the terms "a", "an" and "the" are intended to cover the plural forms as well.

It should be understood that in the description of the present invention, the term "comprises/comprising" indicates the presence of the described features, modules, structures, elements, operations, but does not exclude the presence or addition of other features, modules, structures, elements, operations.

Referring to fig. 1, an embodiment of the present invention provides a face mask detection method based on YOLOv5, including:

step one, classifying a data set which is manufactured by combining network crawling and self-shooting and contains 9000 pictures, standardizing the format of the data set, manually marking and the like.

And step two, training the preprocessed data, setting the size of the iteration batch to be 32, setting the total iteration times to be 600, setting the initial learning rate to be 0.001, and adopting a small batch gradient descent method. And then carrying out target positioning classification and feature extraction.

And step three, updating network parameters by using an Adam optimizer.

Step four, judging whether the iterative training is finished and obtaining an optimal model, wherein the proportion of the training set to the test set is 8: 1. if not, returning the model to retrain the test data. And if the iterative training is finished and the optimal model is obtained, inputting test data to detect the type and the position of the target, and finally obtaining and outputting a detection result.

In a certain embodiment, it should be noted that a data set is acquired by combining network crawling and self-shooting, wherein 80% of pictures are from a network, 20% of pictures are from actual shooting, the actual shooting mainly acquires a mask wearing picture under a dim condition, the picture acquisition is performed from places with dim light such as corridors and rooms in the experimental process, and meanwhile, the pictures are also shot in environments with weak light such as evening and early morning. The experimental parameters are shown in the table:

in one embodiment, the training effect is improved by optimizing the loss function.

Specifically, before explaining the model of YOLOv5 algorithm, the loss function adopted by its structure will be explained first. In the present invention, a binary cross entropy loss function is employed to compute the classification loss and the target confidence loss. The loss function used herein is as follows:

it should be noted that, in the loss function formula, K represents the characteristic diagram finally output by the network and is divided into K × K grids, M represents the number of anchor frames corresponding to each grid,

an anchor frame with a target is represented,

indicates an anchor frame without a target, λ_no-objThe confidence loss weight coefficient representing no target anchor box.

Further, through comparative experiments, CIOULoss is used herein as a loss function of the regression of the target box, and is as follows:

wherein,

d₁representing Euclidean distance between two central points of the prediction frame and the target frame, d₂Representing the diagonal distance of the minimum bounding rectangle.

And

respectively representing the aspect ratio of each of the target frame and the prediction frame.

In an implementation case, a convolution attention module CBAM is introduced on the basis of a network of original YOLOv 5. It should be noted that the CBAM includes two sub-modules, a channel attention module CAM and a spatial attention module SAM. CAM summarizes channel attention information, CAM for any given intermediate feature F ∈ R^C×H×WCompressing the feature map in the spatial dimension using a maximum pooling operation (global max pooling) and a global average pooling (global average pooling) based on width and height to obtain

And

two characteristic graphs, wherein the two characteristic graphs share a two-layer neural network MLP, the number of neurons in the first layer is C/r (r is a reduction rate), the activation function is ReLU, the number of neurons in the second layer is C, the two characteristic graphs output by the MLP are subjected to element-based addition operation and normalization processing through a Sigmoid activation function to obtain a final channel attention characteristic graph M_c∈R^C×1×1。

The SAM collects the spatial attention information, and the SAM mainly focuses on the position information of the target on the image, and takes the output characteristic diagram of the CAM as the output characteristic diagram of the module. Firstly, a global maximum pooling and a global average pooling based on channels are respectively obtained

And

two feature maps are connected in series, splicing operation is carried out based on channels, and then a 7 multiplied by 7 convolution operation is carried out to generate a space attention feature M_s∈R^1×H×W。

In one embodiment, the iteration batch size is preferably 32, the total number of iterations is preferably 600, and the learning rate is preferably 0.01, so as to improve the training effect and completeness.

In one embodiment, the model evaluation index mainly uses a mean of average Precision (mAP), Recall (Recall), and Precision (Precision). It should be noted that the mean accuracy value (mAP), i.e., the sum of the average accuracies of all classes divided by the average accuracy of all classes in the dataset, is shown in the following formula, where the value of AP is the area of the P-R curve.

Recall, i.e., the probability that the correct category in the sample was predicted to be correct by the model, is shown below, where TP represents the number of positive categories predicted as positive categories and FN represents the number of positive categories predicted as negative categories.

The accuracy, i.e., the number of positive samples predicted correctly in the prediction dataset divided by the actual number of positive samples, is shown in the following equation, where FP represents the number of positive classes predicted from negative classes.

In one embodiment, after the model training is completed, the obtained model is compared with the mask detection models of the method in the reference and the AIZOO method, and the comparison experiment is performed under the conditions of the illumination intensity of 30-75Lux (dim), 75-250Lux (dim) and 250-1000Lux (normal illumination), respectively. It should be noted that, where the illumination intensity refers to the energy of the received visible light per unit area, the amount generally used to indicate the intensity of the illumination and the degree of illumination of the surface area of the object is, and the unit is Lux, the greater the illumination intensity, the stronger the illumination is, and the brighter the surface of the object is illuminated.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims

1. A mask wearing detection method based on YOLOv5 is characterized by comprising the following steps:

firstly, preprocessing a training set picture by using an image enhancement algorithm, and then dividing the training set picture into a training data set and a testing data set.

And step two, sending the picture into a YOLOv5 network introduced with an Attention mechanism CBAM (volumetric Block Attention Module network) for iterative training, thereby effectively enhancing the extraction of key point information such as a face, a mask and the like.

And step three, before the optimal weight model is not obtained, inputting the training data set into a YOLOv5 network for repeated iterative operation until the optimal weight model is trained.

And step four, after the training is finished, storing the optimal weight model and testing on the test set.

2. The mask wearing detection method based on YOLOv5 as claimed in claim 1, further comprising, before inputting the training set picture data to the target YOLOv5 model: the data set was made by combining web crawling with self-filming, where 80% of the pictures originated from the web and 20% originated from the actual filming. The mask wearing picture under the dim condition is mainly obtained in the actual shooting, the picture is collected from places with dim light such as corridors and rooms in the experimental process, and meanwhile, the picture is shot in the environments with weak light such as evening and early morning.

3. The YOLOv 5-based mask wear detection method of claim 1 wherein training set pictures are preprocessed using image enhancement algorithms and data sets are partitioned. The experimental data set contained a total of 9000 pictures, which had to be manually labeled. The labels of this data set are divided into two categories, bad and good, where bad indicates that the person is not wearing or not wearing the mask to specification, and good indicates that the mask is worn correctly. The data set adopts a YOLO format, LabelImg is used for picture marking, txt is used as a suffix of the marked file, and the file name is consistent with the picture name.

4. The method as claimed in claim 3, wherein in order to solve the problem of slight inter-class imbalance of the data set, the image enhancement algorithm is used to perform image translation, flipping, rotation, and scaling on the sample picture labeled as good, separate three color channels, and add random noise, thereby effectively alleviating the inter-class imbalance problem, and finally perform visual analysis on the class distribution of the image-enhanced data set.

5. The mask wearing detection method according to claim 1, wherein the convolutional Attention module CBAM comprises two sub-modules, namely a channel Attention module cam (channel Attention module) and a spatial Attention module sam (spatial Attention module).

6. The CBAM of claim 5, wherein the CAM is mainly focused on the target class, and the output feature map of the CAM is used as the input feature map of the module, so that more useful information of the picture can be obtained. Unlike the channel attention, the SAM focuses mainly on the position information of the target on the image, and takes the output feature map of the CAM as the input feature map of the module as the next step, so as to more accurately classify and identify the target and precisely locate the position of the target.

7. The mask wearing detection method based on YOLOv5 according to claim 1, wherein YOLOv5 model introducing attention mechanism has a total of 367 layers, 7150056 parameters.

8. The mask wear detection method based on YOLOv5 of claim 1, wherein in the network model iterative training phase, the data set is fed into the model in batches, wherein the iterative batch size is set to 32, and the total number of iterations is 600. The initial learning rate was set to 0.001, a small batch gradient descent method was employed, and an Adam optimizer was used to calculate the adaptive learning rate for each parameter. After approximately 350 iterations, the model begins to converge gradually.

9. The mask wearing detection method based on YOLOv5 as claimed in claim 1, wherein the pictures are sent to YOLOv5 network introduced with attention mechanism CBAM for iterative training, and the trained network model is evaluated, wherein the evaluation indexes are as follows: mean Precision average (mAP), Recall (Recall), and accuracy (Precision).

10. The evaluation of the trained network model according to claim 9, wherein the three network model evaluation indicators are calculated to obtain: when the iteration times are close to about 400 times, the numerical value of the average precision mean value is close to 0.996; when the iteration times are close to about 450 times, the numerical value of the recall rate is close to 1; the accuracy value is close to 0.995 when the number of iterations is close to 500.

11. The mask wearing detection method based on YOLOv5 of claim 1, wherein CIOU Loss is used as a Loss function of the regression of the target box in the process of training the network model. The model Loss (Loss) function is composed of a Classification Loss (Classification Loss), a Localization Loss (Localization Loss), and a target Confidence Loss (Confidence Loss). The present invention uses a binary cross entropy loss function to calculate the classification loss and the target confidence loss.

12. The model loss function of claim 11, wherein to reduce errors, the localization loss is calculated by CIOU and the loss function is made small by adjusting parameters using gradient for directional propagation. Therefore, an optimization model is finally obtained, and the overall error is reduced.

13. The YOLOv 5-based mask wearing detection method of claim 1, wherein in the process of training the optimal weight model, the training process of the network needs to be repeated, and the parameters of the mask wearing detection network need to be continuously corrected until the mask detection network learns the face position in the image and can correctly judge whether the detected face wears the mask. And (4) storing the parameters obtained by training, namely after the training is finished, storing the optimal weight model and testing on the test set, and performing comparison experiments on the obtained model and the mask detection model of the AIZOO method under different dim conditions respectively.

14. A comparative experiment between the model obtained and the mask test model of aizo method as claimed in claim 13, wherein the physical unit for measuring the degree of darkness is the intensity of light, which is the energy of the received visible light per unit area, and is used to indicate the intensity of light and the amount of the object per unit area to be illuminated, and the unit is Lux, and the greater the intensity of light, the stronger the light, the brighter the object surface is illuminated. The judgment criteria of the different dim conditions are specifically as follows: 30-75Lux (dim), 75-250Lux (darker) and 250-1000Lux (normal light).

15. A comparative experiment between the model obtained and the mask test model by the aizo method according to claim 13, wherein the mask wearing test accuracy by the method and the aizo method of the present invention gradually increases with the increase of the light intensity in a certain light intensity range. Under two different dim conditions of illumination intensity of 250-. The improved YOLOv5 model realizes the efficient detection of mask wearing. Under the condition of dim light, the detection precision of the method is 9.3% higher than that of the AIZOO method.

16. The comparative experiment of the model obtained according to claim 13 and the mask detection model of the aizo method is characterized in that the experimental results show that, under the dim conditions of low visibility and low illumination intensity, compared with the aizo method, the method of the present invention can improve the quality of the picture by using image enhancement, and then can more accurately extract the key point characteristics of the face mask by means of attention mechanism, so that the accuracy of the detection results is significantly improved, and the method has strong robustness and expansibility and can basically meet the requirements of video image real-time performance.

17. The mask wearing detection method based on YOLOv5 according to claim 1, wherein the experimental environment uses Ubuntu18.04 operating system, adopts Pythrch frame, and uses GeForce GTX 1080Ti video card for operation, and the video memory size is 11 GB.