CN113553936A

CN113553936A - Mask wearing detection method based on improved YOLOv3

Info

Publication number: CN113553936A
Application number: CN202110813607.1A
Authority: CN
Inventors: 刘阳; 李莉; 彭娜; 李冰雪
Original assignee: Hebei University of Engineering
Current assignee: Hebei University of Engineering
Priority date: 2021-07-19
Filing date: 2021-07-19
Publication date: 2021-10-26

Abstract

The invention discloses a mask wearing detection method based on improved YOLOv3, and belongs to the technical field of target detection. According to the method, firstly, a mask shielding face data set is obtained, then a mask wearing detection network based on YOLOv3 is constructed, then the mask wearing detection network is trained, an optimal model is selected, and therefore mask wearing detection is carried out on dense people by the optimal model. The invention utilizes a channel attention mechanism to enable the feature extraction network to have higher attention to the associated target area, and utilizes a K-means + + algorithm to perform cluster optimization on the mask data set, thereby improving the detection efficiency. In addition, the invention takes CIoU as a loss function optimization detection algorithm, which can reduce the loss function value and improve the regression effect of the bounding box.

Description

Mask wearing detection method based on improved YOLOv3

Technical Field

The invention relates to the technical field of target detection, in particular to a mask wearing detection method based on improved YOLOv 3.

Background

Since novel coronavirus pneumonia (COVID-19) was abused, third industries such as tourism and catering and labor-intensive enterprises are forced to delay the repeated production and rework, and the national economic development and the daily life of people are greatly influenced. Researches show that the novel coronavirus is mainly transmitted through droplets and aerosol, people are generally susceptible, and the possibility of large-scale aggregated infection outbreak exists at any time, so that the mask worn in public places is taken as a necessary means for controlling the normalized epidemic situation. In areas with dense groups of people, such as shopping malls and stations, a large amount of manpower resources are consumed and the efficiency is low through manual inspection of the wearing condition of the mask.

In recent years, a deep convolutional neural network has made a great progress in the field of target detection, and the algorithm thereof can be mainly divided into two stages and a single stage. The two-stage algorithm mainly comprises an R-CNN series, and the single-stage algorithm mainly comprises an SSD (Single shot detection) series, a YOLO (Young Look one) series. The two-stage algorithm firstly generates a target candidate frame, and then utilizes a convolutional neural network to perform feature extraction classification and bounding box regression, although the detection precision is excellent, the detection speed is slow, and real-time detection cannot be guaranteed. The single-stage algorithm treats the target detection as a single regression problem, realizes the target detection directly through a regression mode, has high calculation efficiency and can realize real-time detection. The YOLOv3 algorithm is superior in many algorithms due to the advantages of high speed, high precision, strong realizability and the like. However, when the YOLOv3 algorithm is directly used for detecting targets in certain specific scenes, the requirement for detection cannot be met, particularly, the detection of wearing the mask is difficult because the scenes are complex, the crowd is dense, the proportion of pedestrians in image pixels is small, and the difference of wearing the mask is small. Therefore, the method for realizing the real-time and efficient mask wearing detection has important significance.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a mask wearing detection method based on improved YOLOv3, which can realize automatic detection of the wearing condition of a mask of a person and has higher detection precision.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

a mask wearing detection method based on improved YOLOv3 comprises the following steps:

step 1, acquiring a mask shielding face data set, carrying out classification marking and format conversion on the data set, and dividing the data set into a training set and a testing set;

step 2, constructing a mask wearing detection network based on YOLOv 3;

step 3, using the training set in the step 1, training the mask wearing detection network for multiple times, and adjusting the learning rate parameter of each training to make the loss function converge, wherein each training obtains a mask wearing detection model;

step 4, the plurality of mask wearing detection models obtained in the step 3 are respectively tested by using the test set in the step 1, the accuracy of each mask wearing detection model is recorded, and the optimal model is selected as the final mask wearing detection model;

and 5, carrying out mask wearing detection on the dense population by using the mask wearing detection model selected in the step 4.

Further, the specific manner of step 2 is as follows:

step 201, reconstructing a Yolov3 feature extraction network by using a multi-scale channel attention mechanism;

step 202, performing target anchor frame clustering on the data set;

step 203, optimize the loss function.

Further, the specific way of step 201 is to embed the sentet channel attention mechanism into the 5 residual error network structures of the backbone feature extraction network of YOLOv3, deeply mine the context of the target, emphasize useful detail information, suppress invalid interference information, and complete the reconstruction of the feature extraction network.

Further, in step 202, the K-means + + algorithm is used for optimizing the size of the anchor frame of the mask occlusion face data set, so that the detection efficiency is improved.

Further, in step 203, a bounding box regression is performed using the CIoU loss function, so as to improve the positioning accuracy.

As can be seen from the above description, the technical scheme of the invention has the beneficial effects that:

1. aiming at the problem of insufficient feature extraction capability of the original YOLOv3, the method utilizes a channel attention mechanism to enable the feature extraction network to have higher attention to the associated target area, so that the feature extraction capability of the network is improved.

2. Aiming at the problem that the target size in the mask data set is small and the prior frame of the public data set is not suitable any more, the mask data set is subjected to cluster optimization by using a K-means + + algorithm, and the most appropriate anchor frame size is selected, so that the detection effect is optimized, the model convergence speed is accelerated, and the detection efficiency can be improved.

3. Aiming at the problems that the evaluation standard IoU (Intersection over Union) of the detection effect of the original YOLOv3 algorithm is insensitive to the target object scale and cannot accurately reflect the overlapping condition of a prediction frame and a real frame, the invention takes CIoU (Complete-IoU) as a loss function optimization detection algorithm, so that the loss function value can be reduced and the regression effect of the boundary frame can be improved.

In a word, the three measures are adopted, so that the detection precision of the mask wearing detection task can be improved in the scene of dense people.

Drawings

In order to more clearly describe this patent, one or more of the following figures are provided.

FIG. 1 is a diagram of SEnet (signature compression and Excitation Networks) architecture.

FIG. 2 is a diagram of SE-Res structure.

Fig. 3 is a schematic diagram of a mask wearing detection model according to an embodiment of the present invention.

Fig. 4 is a graph of visual clustering results of RMFD data sets.

Detailed Description

In order to facilitate the understanding of the technical solutions of the present patent by those skilled in the art, the technical solutions of the present patent are further described in the following specific cases.

step 1: and acquiring an open mask shielded Face data set RMFD (Real-World Masked Face Dataset), carrying out classification marking and format conversion on the data set, and dividing the data set into a training set and a testing set.

Step 2: the mask wearing detection network is constructed in the following specific mode:

step 201: a multi-scale channel attention mechanism SENet structure was built as shown in figure 1. In the figure, the C ' W ' H ' feature layers X are subjected to a switching operation F_trObtaining C characteristic layers U of W and H, and realizing the process as shown in the formula (1):

in the formula u_cRepresenting the c-th two-dimensional matrix, v, in the feature U_cRepresenting the c-th convolution kernel and Xs representing the s-th input.

The Squeeze compression operation is to compress the width W and height H of each feature layer after obtaining U, using a global average pooling operation, so that C feature layers are converted into a 1 × 1 × C array. As shown in formula (2):

in the formula u_c(i, j) represents the matrix u_cRow i and column j.

The Excitation operation is used to capture the channel dependence in its entirety, as shown in equation (3):

s＝F_ex(z,W)＝σ(g(z,W))＝σ(w₂δ(w₁z)) (3)

expression (3) represents the nonlinear interaction relationship between learning channels, wherein sigma and delta are respectively a Sigmoid activation function and a ReLU function, and w₁To reduce the dimensional parameter, w₂For the upscaled parameter, s is the weight of each channel.

Finally, Scale operation is carried out by the following formula (4) to obtain final output:

in the formula, s_cIs the weight of the c-th two-dimensional matrix.

The Squeeze operation in SENet compresses an input feature map into a channel-based one-dimensional vector by using global average pooling, so that a global receptive field is obtained, a receptive area is wider, two full-connection layers are connected, the dependency between channels is learned by an Excitation operation while the parameter quantity is reduced, then the weight value between the channels is fixed between 0 and 1 through a Sigmoid activation function, and finally the input feature is multiplied by the weight value to obtain the final output. The SEnet structure is embedded into a Residual structure (Residual), and the construction of the SE-Res module is completed, as shown in FIG. 2. And finally, replacing 5 residual structures in YOLOv3 with an SE-Res structure to complete the network structure of the mask wearing detection model, as shown in FIG. 3. Inputting 416 x 146 pictures into a network, entering 5 SE-Res layers for feature extraction after DBL (convolution, standardization and activation function) initialization, taking the last three feature layers for multi-scale feature fusion of a feature enhancement network, and finally obtaining prediction frames of three scales through a prediction layer.

Step 202: and carrying out target anchor frame clustering on the RMFD data set. The target detection network based on the anchor frame needs reasonable anchor frame setting, and if the size of the anchor frame is not consistent with the size of a target, the number of positive samples can be greatly reduced, so that a large number of missed detection and false detection situations occur. YOLOv3 adopts a K-means algorithm to cluster targets in the data set to obtain prior frames with 9 sizes, the prior frames are distributed to 3 different detection layers, the RMFD mask shields the targets in the face data set to be smaller, and the prior frames of the public data set are not suitable any more. The K-means algorithm initialization clustering center is randomly selected from the samples, and the selection of the clustering center has great influence on the clustering result and the running time. The K-means + + algorithm is improved in the aspect of random selection, when the clustering centers are initialized, the distance between the clustering centers is increased as far as possible, and the inter-cluster distance is increased, so that the global optimum is achieved. In order to optimize the detection effect, the dimension and the width and the height of the mask data set are re-optimized and clustered by using a K-means + + clustering algorithm, and the obtained visual clustering result of the RMFD data set is shown in FIG. 4. In fig. 4, the abscissa represents the width of the object, the ordinate represents the height of the object, and the triangle represents the cluster center.

Step 203: the loss function is optimized. The original YOLOv3 algorithm uses L2 norm loss to calculate the regression loss of the bounding box position coordinates, but using IoU as the evaluation criterion of the target detection effect cannot truly reflect the overlapping condition of the prediction box and the real box. In order to solve the problems, a CIoU is introduced as a loss function, the CIoU takes the scale, distance, overlapping rate and punishment items between the target and the anchor frame into consideration, the problems of divergence and the like in the training process like IoU are avoided, and the regression of the target frame becomes more stable. CIoU is represented by formula (5):

wherein, b and b^gtRespectively representing the center points, p, of the prediction and real boxes²(b,b^gt) C represents the diagonal distance of the minimum closure area which can contain the prediction frame and the real frame at the same time.

The calculation of α and v is shown in equations (6) and (7):

in the formula, W^gt、H^gtRespectively, the width and height of the prediction box, and W, H, respectively, the width and height of the real box.

The loss function of CIoU is shown in equation (8):

and step 3: and (3) performing data enhancement on the training set in the step (1) in the modes of rotation, translation, brightness adjustment contrast, random cutting and the like, increasing the diversity of images, enabling the network to have a stronger generalization effect, improving the robustness of the model, and inputting the images into a mask wearing detection network for training. The network was optimized using an Adam optimizer with an initial learning rate set to 0.001, with every 30 epochs (1 epoch equals all samples in the training set trained once), the learning rate becoming 0.1 as it was, and the batch size being 12. And training 150 epochs in total to optimize the convergence of the loss function. Obtaining a mask wearing detection model for each epoch, and obtaining 150 mask wearing detection models in total;

and 4, step 4: detecting 150 mask wearing models by using the test set in the step 1, recording the accuracy of models under different learning rate parameters, and selecting an optimal model as a final mask wearing detection model;

and 5: and (4) carrying out mask wearing detection on the intensive population by using the mask wearing detection model obtained in the step (4).

It should be noted that the above embodiments are only specific examples of the implementation schemes of this patent, and do not cover all the implementation schemes of this patent, and therefore, the scope of protection of this patent cannot be considered as limited; all the implementations which belong to the same concept as the above cases or the combination of the above schemes are within the protection scope of the patent.

Claims

1. A mask wearing detection method based on improved YOLOv3 is characterized by comprising the following steps:

step 2, constructing a mask wearing detection network based on YOLOv 3;

2. The mask wearing detection method based on the improved YOLOv3 as claimed in claim 1, wherein the specific mode of step 2 is as follows:

step 202, performing target anchor frame clustering on the data set;

step 203, optimize the loss function.

3. The mask wearing detection method based on improved YOLOv3 as claimed in claim 2, wherein the specific way of step 201 is to embed a sentet channel attention mechanism into 5 residual network structures of a backbone feature extraction network of YOLOv3, deeply mine the context of the target, emphasize useful detail information, suppress ineffective interference information, and complete the reconstruction of the feature extraction network.

4. The method for detecting whether a user wears a mask according to claim 2, wherein the mask wearing detection method is based on improved YOLOv3, and in step 202, the mask occlusion face data set is optimized in the size of an anchor frame by using a K-means + + algorithm, so that the detection efficiency is improved.

5. The method for detecting whether a mask is worn based on improved YOLOv3 of claim 2, wherein in step 203, a CIoU loss function is used to perform a bounding box regression, thereby improving the positioning accuracy.