CN113033375A

CN113033375A - Face and mask detection method, system, equipment and medium based on YOLOV3

Info

Publication number: CN113033375A
Application number: CN202110303335.0A
Authority: CN
Inventors: 王健; 林浪; 王宋凌; 张海彬; 刘诗伟; 王柏芝
Original assignee: South China Institute Of Software Engineering Gu
Current assignee: South China Institute Of Software Engineering Gu
Priority date: 2021-03-22
Filing date: 2021-03-22
Publication date: 2021-06-25

Abstract

The invention discloses a face mask detection method, a face mask detection system, face mask detection equipment and face mask detection media based on YOLOV3, wherein the method comprises the steps of inputting data to be detected into a target YOLOV3 algorithm model, and performing feature extraction through a DarkNet53 feature extraction network to obtain feature data in different formats; the data to be detected is a face image with a uniform format, and the target Yolov3 algorithm model is a Yolov3 algorithm model combined with an IOU Loss function; inputting the feature data into a feature fusion layer, and performing feature fusion through convolution and long sampling; and carrying out convolution operation on the data after the characteristic fusion through an output layer, and detecting a bounding box of the face image to obtain a face mask detection result. The face mask detection method based on the YOLOV3 has the advantages of high detection speed and high result accuracy.

Description

Face and mask detection method, system, equipment and medium based on YOLOV3

Technical Field

The invention relates to the technical field of face mask detection, in particular to a face mask detection method, a face mask detection system, face mask detection equipment and a face mask detection medium based on YOLOV 3.

Background

At present, wearing the mask to go out becomes the basic requirement of daily trip. When entering various public places, the user can pass the security check only by wearing the mask. Aiming at the detection of the face mask, the prior art mainly checks one by one through manual detection, but the mode is easy to miss detection, consumes manpower and material resources, has low efficiency and also has great potential safety hazard.

Disclosure of Invention

The invention aims to provide a face mask detection method, a face mask detection system, face mask detection equipment and a face mask detection medium based on YOLOV3, and the face mask detection method solves the problems that in the prior art, the face mask detection efficiency is low and the accuracy cannot be guaranteed by improving a YOLOV3 algorithm structure.

In order to overcome the defects in the prior art, the invention provides a face mask detection method based on Yolov3, which comprises the following steps:

inputting the data to be detected into a target YOLOV3 algorithm model, and performing feature extraction through a DarkNet53 feature extraction network to obtain feature data with different formats; the data to be detected is a face image with a uniform format, and the target Yolov3 algorithm model is a Yolov3 algorithm model combined with an IOU Loss function;

inputting the feature data into a feature fusion layer, and performing feature fusion through convolution and long sampling;

and carrying out convolution operation on the data after the characteristic fusion through an output layer, and detecting a bounding box of the face image to obtain a face mask detection result.

Further, before the inputting the data to be tested into the target YOLOV3 algorithm model, the method further includes:

performing detection frame regression analysis by using an IOU Loss function according to a YOLOV3 algorithm model to generate an initial YOLOV3 algorithm model;

and adjusting the learning rate, the number of training iteration rounds and the number of training data set samples of the initial Yolov3 algorithm model to obtain a target Yolov3 algorithm model.

Further, the adjusting the learning rate of the initial YOLOV3 algorithm model includes: fitting the learning rate by an optimizer using a decay strategy.

Further, the number of training iteration rounds includes 270 times.

the method comprises the steps of collecting a face image data set, and carrying out labeling, duplicate removal, data cleaning and normalization processing on the face image data set to obtain a face image with a uniform format.

Further, the normalization processing method comprises a Z-score normalization method.

Further, the face image dataset is acquired using image detection or liveness detection.

The invention also provides a face mask detection system based on Yolov3, which comprises:

the characteristic extraction unit is used for inputting the data to be detected into a target YOLOV3 algorithm model, and extracting characteristics through a DarkNet53 characteristic extraction network to obtain characteristic data in different formats; the data to be detected is a face image with a uniform format, and the target Yolov3 algorithm model is a Yolov3 algorithm model combined with an IOU Loss function;

the characteristic fusion unit is used for inputting the characteristic data into a characteristic fusion layer and carrying out characteristic fusion through convolution and long sampling;

and the detection unit is used for performing convolution operation on the data after the characteristic fusion through an output layer and detecting a bounding box of the face image to obtain a face mask detection result.

The present invention also provides a computer terminal device, comprising:

one or more processors;

a memory coupled to the processor for storing one or more programs;

when the one or more programs are executed by the one or more processors, the one or more processors implement the YOLOV 3-based face mask detection method as described in any one of the above.

The invention also provides a computer-readable storage medium, on which a computer program is stored, the computer program being executed by a processor to implement the YOLOV 3-based face mask detection method as described in any one of the above.

Compared with the prior art, the invention has the beneficial effects that:

based on the Yolov3 algorithm and combined with the IOU Loss function, the accuracy of the data set can be correspondingly controlled in the detection process, so that the value of the Loss function is reduced, the overall learning effect is finally improved, and the working efficiency of detecting whether the face wears the mask and the accuracy of the detection result are further improved.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a face mask detection method based on YOLOV3 according to an embodiment of the present invention;

FIG. 2 is a grid structure diagram of YOLOV3 according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating an image detection method according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating a variation process of a learning rate according to an embodiment of the present invention;

FIG. 5 is a diagram of a service architecture under a flutter according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a face mask detection system based on YOLOV3 according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be understood that the step numbers used herein are for convenience of description only and are not intended as limitations on the order in which the steps are performed.

It is to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

The terms "comprises" and "comprising" indicate the presence of the described features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

The term "and/or" refers to and includes any and all possible combinations of one or more of the associated listed items.

Referring to fig. 1, an embodiment of the present invention provides a face mask detection method based on YOLOV3, including:

s10, inputting the data to be tested into a target YOLOV3 algorithm model, and performing feature extraction through a DarkNet53 feature extraction network to obtain feature data in different formats; the data to be detected is a face image with a uniform format, and the target Yolov3 algorithm model is a Yolov3 algorithm model combined with an IOU Loss function;

in this embodiment, it should be noted that the face and mask detection is a technology for determining whether a face wears a mask according to facial features of a person, and the face and mask detection technology is implemented by collecting a face worn with a mask and a face not worn with a mask and integrating the faces into a data set, training the face detection at a mobile terminal and other devices by using a related algorithm, and determining whether the face wears the mask.

Specifically, in step S10, the pre-acquired data to be measured is mainly input into the target YOLOV3 algorithm model to perform the first layer of operation. Before explaining a target YOLOV3 algorithm model, firstly, a model of an organizational structure and a loss function adopted by the model is explained, as shown in fig. 2, YOLOV3 divides an input image into S × S grids, namely, the input image is mapped into a N × N grid format, each input image is taken as a coordinate axis, and coordinates of the grids are marked, namely, position information and types of all objects in the image can be inferred only by scanning the image. Each trellis predicts B bounding boxes, each of which predicts the probability of Location (x, y, w, h), Confidence Score and C categories, so the number of channels of the output layer of Yolov3 is B (5+ C). The loss function of YOLOV3 also has three components, Location error, Confidence error, and classification error.

Further, the network structure of the YOLOV3 is composed of a basic feature extraction network, a multi-scale feature fusion layer and an output layer, wherein the YOLOV3 uses the DarkNet53 as the feature extraction network, the DarkNet53 basically adopts a full convolution network, the pooling layer is replaced by convolution operation with the step length of 2, and simultaneously, a Residual unit is added, so that gradient dispersion is avoided when the number of network layers is too deep.

Further, the loss function is crucial to the self-learning effect of YOLOV3 model, and it should be noted that the loss function of YOLOV3 is composed of five parts:

wherein, (1) is called the loss error calculation of the center point of the rectangular frame,

is the coordinates of the center of the rectangular box that we predict,

is to mark the center coordinates of the rectangular frame, and

representing the responsibility of a rectangular frame, if the responsibility is 1, detecting an object target, and otherwise, 0;

(2) then for predicting the frame height error

To indicate the rectangular box height, size, and

calculating the sum of the width and the height of all the predicted rectangular frames and the sum of the width and the height of the marked rectangular frames as the size of the width and the height of the marked rectangular frames;

(3) and (4) then for the prediction box confidence loss calculation,

indicating that the prediction box contains the probability score for the target object,

the value of the formal value is determined by whether the (i, j) th rectangular box fulfills the prediction responsibility, and the value of the formal value is 1 or 0;

(5) then it is the prediction box class loss for judgment

In the (i, j) th prediction frameIn the case of the probability of C,

the true value is also only 1 or 0.

It should be noted that the Loss function under YOLOV3 uses a smooth L1 Loss to perform a regression test on the test frame, and is divided into three parts: boundary box mean square error, confidence cross entropy and category cross entropy. Taking a bisection cross entropy as detection, depicting the distance between two probability distributions, and setting the two probabilities as p and q as the numerical value of the cross entropy is smaller and the two probability distributions are closer; p is the current distribution, q is the predicted distribution, and the cross entropy of q is represented by p, namely:

H(p,q)＝-∑p(xi)logq(xi) (6)

this equation is also referred to as the loss function probability prediction equation of YOLOV 3. In the process of calculating through the YOLOV3 function, it is found that the value of mAp of target detection is about 70-90, and there is no more accurate value, so it is necessary to improve the correlation function under YOLOV3 to obtain the target YOLOV3, and the method mainly uses the IOU Loss function instead of smooth L1 Loss function as shown in table 1:

TABLE 1 loss function formula for IOU function

Under the set of algorithms, regression analysis is performed on a box consisting of 4 points of two frames as a whole, IoU of 2 frames is obtained, and then-ln (IoU) is obtained, and in actual use, a large number of IoU are often defined as IoU Loss 1-IoU. Wherein IoU is the ratio of the intersection and union of the real frame and the prediction frame, when they are completely overlapped, IoU is 1, and then for Loss, the smaller the Loss is, the better, it indicates that their overlap ratio is high, so the IoU Loss can be simply expressed as 1-IoU, so that the linear regression is certainly adjusted by using the two block diagrams.

Therefore, in this embodiment, the IOU Loss function is adopted to linearly modify the previous Loss function to obtain the target Yolov3 algorithm model, and the principle that the prior Loss function is still the interaction entropy is used, and the IOU ratio is also specified as [0,1 ]. Therefore, training and testing of multiple scales and multiple proportion images can be carried out, the IOU can have a lower loss value along with the increase of the number of iterations, objects in a prediction frame can be analyzed more accurately, the accuracy of a data set is controlled correspondingly, the value of a loss function is reduced, and the effect of integral learning can be improved.

S20, inputting the feature data into a feature fusion layer, and performing feature fusion through convolution and long sampling;

in this embodiment, to solve the problem that the previous YOLO version is not sensitive to small targets, YOLOV3 uses 3 feature maps with different scales for target detection, 13 × 13, 26 × 26, and 52 × 52, respectively, to detect three targets, i.e., large, medium, and small. The feature fusion layer selects three scale feature maps produced by DarkNet as input, and fuses the feature maps of all scales through a series of convolution layers and upsampling by using the idea of FPN (feature pyramid templates). It should be noted that the 3 different-scale feature maps of the present embodiment, that is, "13 × 13, 26 × 26, 52 × 52" is only a preferred way, and there may be other adaptive options in practical applications, and the present invention is not limited herein.

And S30, performing convolution operation on the data after feature fusion through an output layer, and detecting a bounding box of the face image to obtain a face mask detection result.

In this embodiment, the output layer also uses the full convolution structure, where the number of convolution kernels of the last convolution layer is 255: 3 × (80+4+1) ═ 255, 3 indicates that one grid cell contains 3 bounding boxes, 4 indicates 4 coordinate information of the frame, 1 indicates Confidence Score, and 80 indicates the probability of 80 classes in the COCO data set. It should be noted that 80 may be modified to the actual number of categories if another data set is used instead. After the convolution operation is finished, detecting a boundary frame of the face image, outputting an output result by using detection frames with different colors, marking confidence degrees of corresponding types and corresponding detection results, then performing image classification operation on the result with or without the mask, and finishing the detection process.

The embodiment of the invention is based on the YOLOV3 algorithm, is combined with the IOU Loss function, and can correspondingly control the accuracy of a data set in the detection process so as to reduce the value of the Loss function, finally improve the effect of integral learning, and further improve the working efficiency of detecting whether a face wears a mask and the accuracy of a detection result.

In a certain embodiment, before step S10, the method further includes acquiring a face image data set, and performing labeling, deduplication, data cleaning, and normalization processing on the face image data set to obtain a face image with a uniform format. In this embodiment, a process of constructing a face mask data set is mainly given:

the collection of the data set pictures is all obtained from a network, and the resolution is larger than or equal to 1920 x 1080. Finally, 7949 effective pictures are obtained through screening and de-weighting, the total marked number is 16635, wherein the marked number of the masks is 7024, and the marked number of the normal is 9611. Screening the collected pictures, deleting the pictures with low resolution and non-conformity, and performing duplicate removal processing on the pictures by using duplicate removal software. And (3) labeling the image in labelimg image labeling software, wherein the labeling items are divided into a mask and a nomask, the mask represents the face with the mask, and the nomask represents the face without the mask.

And then, performing data cleaning on the marked data. Data cleansing is the process of re-examining and verifying data with the aim of deleting duplicate information, correcting existing errors, and providing data consistency. Firstly, data duplication removal is carried out, and a DuplicatePhotoFinder software is used for deleting pictures with the same characteristics (characteristic duplication removal); and then, deleting the noise data, and deleting the noise data from the data after the characteristic de-duplication by a manual selection mode.

And finally, carrying out data normalization operation and unifying the data formats. Through the series of data processing operations, a data set with more effective lattice content can be obtained, and further the training efficiency and effect are greatly improved.

In one embodiment, the normalization process is performed by a Z-score normalization method, which normalizes the data by giving the mean and standard deviation of the raw data. The processed data are in accordance with the standard normal distribution, i.e. the mean value is 0, the standard deviation is 1, and the conversion function is:

where μ is the mean of all sample data and σ is the standard deviation of all sample data. The Z-score standardization method is used for processing the image to normalize the image to [ -1,1], is not limited to [0,1], enables the input result to have positive or negative, and can accelerate the training speed of the model.

In one embodiment, the mode of acquiring the face image includes image detection or living body detection, where the image detection refers to a process of acquiring a picture through a web page or locally, and acquiring a face position in the image through a face detection algorithm, attribute detection, and feature analysis to perform detection comparison. Firstly, whether a face exists in an image is detected, the position of the face is obtained, whether the face wears a mask is detected through attributes, and a detection result is output through a corresponding algorithm, wherein the specific flow is shown in fig. 3. The living body detection is different from the image detection, the living body detection acquires real-time pictures through a camera of the equipment, processes each frame of picture in real time, detects the face in the picture, extracts features and detects attributes, and outputs the detection result. Live body detection acquires the range which can be identified by equipment in real time, and the system can automatically identify and reflect the detection result as long as the range is within the range, so that the detection is more flexible than image detection, and the confidence level of whether the mask is worn or not can be displayed on a boundary frame, so that the detection result can be conveniently analyzed.

In addition, during the living body detection, the obtained dynamic image frames are not positioned very consistently, and the frames can be positioned accurately only if the detector is kept still. Therefore, further optimization and improvement of the algorithm are needed. For the situation that the video is continuously accessed to the model by using ajax, each frame is reacted once, and the continuous asynchronous request of Android can cause problems of a server and the model after a long time, the problem is solved.

In a certain embodiment, before step S10, the method further includes adjusting a learning rate, a number of training iteration rounds, and a number of training data set samples of the initial YOLOV3 algorithm model to obtain a target YOLOV3 algorithm model.

In the previous embodiments, it has been explained that the training effect is enhanced by replacing the loss function, here by optimizing the parameters of the model to assist the detection. Firstly, for the learning rate, an optimizer is used to implement an attenuation strategy, and a learning process of machine learning is gradually reduced from small-gradually increasing-fixed-attenuation fixed value-step by step in the training process, as shown in fig. 4, so that the learning process has an ascending trend, the generation of overfitting is prevented, the learning process is slowed down, the instability problem occurring in the learning process is reduced, and the oscillation phenomenon of the data training result is reduced when the loss function value is too large due to too fast learning.

In one embodiment, the number of training iteration rounds (num _ epochs) is preferably 270. The iteration round number is fixed for 270 times, so that the training effect and the training integrity are improved by using the improvement of the round number, the machine repeatedly scans the data sets, and the learning accuracy is improved.

In one embodiment, the training batch size is adjusted according to the number of data sets, and when the training batch size is adjusted to 32, the training effect can produce a good result and the gradient is moderate.

In one embodiment, supervised learning is used for training, and comparative analysis is performed on experimental results to illustrate the effect of the invention. Wherein, the experimental parameters are shown in table 2:

TABLE 2 Experimental parameters and values

By setting the parameters, the results of YOLOV3 training and YOLOV3 training in combination with IOU Loss were compared, and the results are shown in table 3:

TABLE 3 Yolov3 training and Yolov3 training results in combination with IOU Loss

As can be seen from the above table, the mAP of Yolov3 combined with IOU Loss is significantly improved, and the training time is also significantly shortened. In addition, by comparing the YOLOV3 algorithm with other algorithms, the results are shown in table 4:

TABLE 4 training results of four algorithms

From the above table, it can be seen that in the four algorithms, the accuracy, training speed, mAP effect and small object measurement accuracy of the Yolov3 algorithm combined with IOU Loss are all optimal.

Referring to fig. 5, in a certain embodiment, a service architecture diagram under the flute is provided, wherein a real-time video monitoring technology can be implemented by using the paddlel-Lite project in combination with an Android platform, and the real-time video playing and docking technology based on the Android is formed after model derivation generation is performed to set application data, so that an app application focusing on the video and real-time monitoring technology is formed, and the face mask detection method based on YOLOV3 provided by the present invention is implemented.

Referring to fig. 6, in an embodiment, a face mask detection system based on YOLOV3 is further provided, including:

the feature extraction unit 01 is used for inputting data to be detected into a target Yolov3 algorithm model, and performing feature extraction through a DarkNet53 feature extraction network to obtain feature data in different formats; the data to be detected is a face image with a uniform format, and the target Yolov3 algorithm model is a Yolov3 algorithm model combined with an IOU Loss function;

the feature fusion unit 02 is used for inputting the feature data into a feature fusion layer and performing feature fusion through convolution and long sampling;

and the detection unit 03 is configured to perform convolution operation on the data after feature fusion through an output layer, and detect a bounding box of the face image to obtain a face mask detection result.

In an embodiment, there is also provided a computer terminal device including:

one or more processors;

a memory coupled to the processor for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the YOLOV 3-based face mask detection method as described above.

The processor is used for controlling the overall operation of the computer terminal device so as to complete all or part of the steps of the face mask detection method based on the YOLOV 3. The memory is used to store various types of data to support the operation at the computer terminal device, which data may include, for example, instructions for any application or method operating on the computer terminal device, as well as application-related data. The Memory may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk, or optical disk.

The computer terminal Device may be implemented by one or more Application Specific 1 integrated circuits (AS 1C), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, a microprocessor, or other electronic components, and is configured to perform the face mask detection method based on yol 3 according to any of the embodiments described above, and achieve technical effects consistent with the above methods.

In an embodiment, a computer readable storage medium is further provided, which includes program instructions, when executed by a processor, to implement the steps of the YOLOV 3-based face mask detection method according to any one of the above embodiments. For example, the computer readable storage medium may be the above memory including program instructions, which are executable by a processor of a computer terminal device to perform the face mask detection method based on YOLOV3 according to any of the above embodiments, and achieve the technical effects consistent with the above methods.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims

1. A face mask detection method based on YOLOV3 is characterized by comprising the following steps:

2. The YOLOV 3-based face and mask detection method according to claim 1, wherein before inputting the data to be tested into the target YOLOV3 algorithm model, the method further comprises:

3. The YOLOV 3-based face mask detection method according to claim 2, wherein the adjusting the learning rate of the initial YOLOV3 algorithm model comprises: fitting the learning rate by an optimizer using a decay strategy.

4. The YOLOV 3-based face mask detection method according to claim 2, wherein the number of training iteration rounds comprises 270 times.

5. The YOLOV 3-based face and mask detection method according to claim 1, wherein before inputting the data to be tested into the target YOLOV3 algorithm model, the method further comprises:

6. The YOLOV 3-based face mask detection method according to claim 5, wherein the normalization process comprises a Z-score normalization method.

7. The YOLOV 3-based face mask detection method of claim 5, wherein the face image dataset is acquired using image detection or liveness detection.

8. A face and mouth mask detection system based on YOLOV3 is characterized by comprising:

9. A computer terminal device, comprising:

one or more processors;

a memory coupled to the processor for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the YOLOV 3-based face mask detection method of any one of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored, the computer program being executed by a processor to implement the YOLOV 3-based face mask detection method according to any one of claims 1 to 7.