CN113762201A

CN113762201A - Mask detection method based on yolov4

Info

Publication number: CN113762201A
Application number: CN202111088283.6A
Authority: CN
Inventors: 张勇; 赵东宁; 吴显淞; 朱经晨; 宗拓; 颜庚潇; 张恒
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2021-09-16
Filing date: 2021-09-16
Publication date: 2021-12-07
Anticipated expiration: 2041-09-16
Also published as: CN113762201B

Abstract

The embodiment of the invention discloses a mask detection method based on yolov4, which comprises the following steps: step 1: collecting images of a person wearing a mask and a person not wearing the mask in a public place, and making a training set; step 2: constructing a preset YOLOV4 target detection model; and step 3: putting the training set into a preset Yolov4 target detection model for training; and 4, step 4: and detecting a video stream or picture to be detected through the trained preset Yolov4 target detection model, and judging whether each face target in the video stream or picture wears a mask. Aiming at a general scene of mask wearing detection, the method changes a Backbone network part into a MobileNet V3 structure, uses a deep separable convolution to replace a common convolution in the whole network structure, has a small structure size of the MobileNet V3 and high calculation efficiency, changes an SE attention mechanism module in the MobileNet V3 structure into a CA attention mechanism, embeds position information into channel attention, enhances feature expression capability and enhances mask detection of a picture or video area.

Description

Mask detection method based on yolov4

Technical Field

The invention relates to the technical field of mask detection, in particular to a mask detection method based on yolov 4.

Background

The outbreak effect of new crown epidemic continues until now, and the public health protection requirements are continuously emphasized. In order to prevent cross infection of new coronavirus, the mask is worn in and out of public places as a basic measure for preventing infection, and the wearing of the mask is favorable for ensuring the safety of the mask and other people when going out, so that epidemic spread is effectively controlled. Meanwhile, in monitoring systems such as ATM, banks and the like, the wearing detection function of the mask can detect suspicious persons; whether the doctor wears the gauze mask to supervise the selection of doctor's work is detected at the operating room, avoids taking place the medical accident that leads to by the work mistake. With the development of intelligent devices and technologies, mask wearing detection has gradually shifted from manual to robotic device detection. The adoption of machine vision instead of human labor is an important research direction for automation. The automatic mask wearing detection system has the advantages of labor saving, intelligent reminding and informatization monitoring. When great epidemic of proruption, can be used for supervise and the suggestion public wears the gauze mask to go out, can carry on unmanned aerial vehicle, automated inspection does not wear the people of gauze mask trip and carries out the advice when patrolling. The entrance guard system is used for reminding people who do not wear the mask in public places such as buses, subway stations, supermarkets, schools and other public places. The public health protection capability is improved through the technology, social resources and operation cost are saved, and the human resource cost is reduced. Therefore, the problem that people wear the mask is detected and identified quickly and accurately, and the method has important significance for safety management and intelligent information management in life in the future.

The mask wearing detection method can be divided into a conventional image processing method and a machine learning method. The basic idea of the traditional method is to distinguish the background from the portrait foreground through an image processing technology, then find a face area and identify whether the face area is worn by a mask. Specifically, the skin color area of the eyes and the skin color area of the nose and mouth area are obtained through binocular detection and are compared, and if the former is smaller than the latter, the mask is absent; if the latter is smaller than the former, the mask is worn. Additionally, mouth and nose detection and Hough line detection on the edge of the mask wearing the mask face are added to be used as reference decision. In the initial image preprocessing stage, calculation such as filtering, binaryzation, corrosion expansion, edge detection and the like is mainly adopted; then, an OpenCV classifier is adopted to detect the face, nose and mouth; when the eyes detect the skin color area of the eyes, calculating the skin color area according to the skin color outline; after the eye region is determined, calculating the contour area of the skin color of the nose and mouth region; and finally, judging whether the mask is worn or not according to the comparison rule. With the development of deep learning technology in the field of image recognition, new detection and recognition technologies are continuously proposed. In the aspect of mask wearing detection, researchers provide a scheme of two modules of face detection and mask attribute identification based on combination of feature fusion and segmentation supervision, wherein the face detection module is used for positioning a face area in an image, and the mask attribute identification module is used for judging whether a face wears a mask or not in a single face area, namely, the mask wearing detection is divided into face detection and image classification problems, a face detection network is matched with an image classification network to determine the face position of a person firstly, and then the image is classified into two conditions of not wearing the mask and worn the mask. In addition, researchers propose to directly detect whether a face area wears a mask or not based on an SSD network structure, and the detection is divided into two types. The positioning and classifying layer added to the main network comprises 8 convolutional layers, the model uses the design of anchor-based, the human face area of the original image is positioned and detected by using the feature mapping images after the convolutional layers, and the cross entropy loss function is used for calculating and predicting whether the mask is worn or not.

In order to obtain better performance, the number of network layers of the existing network is continuously increased, and although the network performance is improved, the problems of storage and operation efficiency are also brought.

Disclosure of Invention

The technical problem to be solved by the embodiments of the present invention is to provide a mask detection method based on yolov4, so as to reduce the amount of calculation and improve the operation efficiency.

In order to solve the technical problem, an embodiment of the present invention provides a mask detection method based on yolov4, including:

step 1: collecting images of a person wearing a mask and a person not wearing the mask in a public place, screening samples according to definition, and making a training set;

step 2: constructing a preset YOLOV4 target detection model; the preset YOLOV4 target detection model comprises an input end, a BackBone reference network, a Neck middle layer and a Head output layer, wherein the BackBone reference network adopts a preset MobileNet V3 network structure;

the initial part of the preset MobileNetV3 network structure comprises 1 convolution layer, and features are extracted through convolution of 3x 3; the middle portion includes a plurality of convolutional layers composed of bneck; the final part replaces full connection output by two convolution layers of 1x 1;

and step 3: putting the training set into a preset Yolov4 target detection model for training to obtain a trained preset Yolov4 target detection model with high robustness;

and 4, step 4: and detecting a video stream or picture to be detected through the trained preset Yolov4 target detection model, and judging whether each face target in the video stream or picture wears a mask.

Further, in the bneck of the preset MobileNetV3 network structure, a CA attention mechanism module is used to replace an SE module.

Further, in the preset MobileNetV3 network structure, a deep separable convolution is adopted to replace a normal convolution, and then the last three effective features obtained by the backpbone reference network are taken out to construct a reinforced feature pyramid.

The invention has the beneficial effects that: aiming at a general scene of mask wearing detection, the method changes a Backbone network part into a MobileNet V3 structure, uses a deep separable convolution to replace a common convolution in the whole network structure, has a small structure size of the MobileNet V3 and high calculation efficiency, changes an SE attention mechanism module in the MobileNet V3 structure into a CA attention mechanism, embeds position information into channel attention, enhances feature expression capability and enhances mask detection of a picture or video area.

Drawings

Fig. 1 is a schematic flow chart of a mask detection method based on yolov4 according to an embodiment of the present invention.

Fig. 2 is an overall block diagram of yolov4 of the related art.

Fig. 3 is a schematic structural diagram of a Bneck according to an embodiment of the present invention.

FIG. 4 is a schematic diagram of a CA attention mechanism configuration of an embodiment of the present invention.

Fig. 5 is a schematic structural diagram of a preset YOLOV4 target detection model according to an embodiment of the present invention.

Detailed Description

It should be noted that the embodiments and features of the embodiments in the present application can be combined with each other without conflict, and the present invention is further described in detail with reference to the drawings and specific embodiments.

According to the method, a Yolov4 detection frame is adopted, a network structure with small volume and high calculation efficiency, namely MobileNet V3, is adopted for a BackBone network structure part of the frame, and the detection precision is ensured, and the model is reduced and the calculation efficiency of the model is improved by combining a depth separable convolution, an inverse residual error structure of a linear bottleneck and an attention model structure. And then, for an SE Attention mechanism module in a MobileNet V3 structure, a CA Attention mechanism (coding Attention) is changed, the relationship among channels is not only considered, but also the position information of a feature space is considered, and the position information is embedded into the Attention of the channels, so that the feature diagram has stronger feature expression capability, and the mask detection of a picture or video area is enhanced.

Referring to fig. 1, the mask detection method based on yolov4 in the embodiment of the present invention includes:

The YoLO series of target detection algorithms is a classic one-stage target detection algorithm, and the framework draws wide attention from both academic and industrial fields. The YOLO series has higher reasoning speed and can better meet the requirements of real scenes. The YOLOv4 is formed by adding a plurality of practical skills on the basis of the YOLOv3 algorithm, so that the speed and the precision of the YOLOv 3578 are greatly improved, and the YOLOv4 mainly comprises four parts: the input end, the backhaul reference network structure, the neutral intermediate layer network structure and the Head output layer are described in detail as follows.

The input is mainly for the training phase, for the pre-processing of pictures, the image size is typically 608 x 608. The method mainly comprises the following steps of Mosaic data enhancement, cmBN and SAT self-confrontation training, the training speed of a model and the accuracy of a network are improved by using the Mosaic data enhancement operation, and the generalization performance of the network is improved by using the cmBN and SAT self-confrontation training; the enhancement method can combine several pictures into one picture, thereby enriching the data set and greatly improving the training speed of the network and reducing the memory requirement of the model.

The BackBone reference network mainly comprises: CSPDarknet53 network, Mish activation function, Dropblock structure. The CSPDarknet53 is used as a reference network for extracting some general feature representations, the CSPDarknet53 network comprises 5 CSP modules, and the structure of each CSP module is composed of a convolutional layer and X Res unit modules Concat. The CSP module divides the feature mapping of the basic layer into two parts, and then combines the two parts through a cross-stage hierarchical structure, so that the calculated amount is reduced, and the accuracy of the model can be ensured. And replacing the original RELU activation function with a Mish activation function, wherein the Mish activation function is improved on the basis of a Leaky _ RELU algorithm. When x is larger than 0, Leaky _ relu is basically the same as the Mish activation function; when x <0, the Mish function is substantially 0, while the Leaky _ relu function is λ x. The Mish function is smoother, and the accuracy of the model can be further improved. Dropblock blocks are added in the modules to further improve the generalization capability of the models, and Dropblock is a regularization method for solving overfitting of the models.

The Neck middle layer in Yolov4 is used for improving the diversity and robustness of features, and an SPP module and an FPN + PAN structure are added. The SPP module fuses feature maps with different scales, and compared with a pure mode of using k × k maximum pooling, the SPP module mode can more effectively increase the receiving range of the trunk features, and the most important context features are obviously separated. The FPN network can better solve the mesoscale problem of target detection by constructing a pyramid on the feature map. The FPN structure performs a fusion operation on 19 × 19, 38 × 38, 76 × 76 in turn, i.e., performs an upsampling operation on the smaller feature map layers, adjusts them to the same size, and then superimposes the two feature maps of the same size. The size of the feature mapping with the size of 19 x 19 can be adjusted to 76 x 76 by FPN operation, so that the size of the feature mapping is improved, the mesoscale detection problem can be better solved, the depth of the network is increased, and the robustness of the network is improved. The PAN structure rescales the 76 × 76 sized feature map to 19 × 19 sized feature map, which may improve the target localization capability of the algorithm to some extent. The FPN layer can capture strong semantic features from top to bottom, the PAN can transmit strong positioning features from bottom to top, and the target positioning function can be well completed by combining the two modules.

The Head output layer is used for outputting a target detection result, and the output ends have different branch numbers and usually comprise a classification branch and a regression branch. Yolov4 replaces the Smooth L1 Loss function with a CIOU _ Loss function and replaces the conventional NMS operation with DIOU _ NMS, thereby further improving the detection accuracy of the algorithm. The CIOU _ Loss is added with an influence factor on the basis of the DIOU _ Loss, and the aspect ratio of the prediction box and the GT box is taken into consideration. The DIOU _ NMS considers the position information of the central point of the boundary box, thereby obtaining more accurate detection results and being suitable for processing the target detection problem in the intensive scene.

As shown in fig. 2, the overall block diagram of the YOLOv4 target detection algorithm is divided into 4 general modules, which specifically include: the input, the backhaul reference network, the hack network, and the Head output correspond to the four large gray block regions in fig. 2. Some of the widgets are briefly described as follows: the CBM is the minimum component in the Yolov4 network structure and consists of Conv + BN + Mish activation functions; the CBL module consists of a Conv + BN + Leaky _ relu activation function; the Res unit references a residual error structure in a ResNet network and is used for constructing a deep network, and the CBM is a sub-module in a residual error module; the CSPX refers to a CSPNet network structure and consists of a convolution layer and X Res unit modules Concat; SPP-performs multi-scale feature fusion using maximal pooling of 1 × 1, 5 × 5, 9 × 9 and 13 × 13.

Since the deep learning is continuously developed, the CNN model is also continuously developed and evolved. To achieve higher accuracy, it is now common to use more layers and more complex networks. More layers of more complex networks may not be suitable if computational power is limited or storage is limited. The Google team provides a small-size and high-calculation-efficiency network structure, namely MobileNet. The scheme can adapt to various image applications of mobile and embedded devices, and V1, V2 and V3 versions are continuously designed along with development and improvement. The network structure of MobileNetV3 is shown in table 1.

TABLE 1

In table 1, column 1 represents the network structure division; column 2, Input, represents shape changes for the feature layer; the Operator in the 3 rd column represents the block structure to be experienced by each feature layer; column 4 represents the number of channels after the rising of the inverse residual structure in bneck; column 5 inputs the number of channels of the feature layer when it is input to bneck. Column 6 SE represents whether or not to introduce a mechanism of attention at this layer (Yes guide in); column 7 NL represents the kind of activation function, HS represents h-swish, RE represents RELU; column 8 Stride represents the step size of each block structure.

The MobileNet V1 utilizes Depth-wise Separable Convolutions (Depth-wise Separable Convolutions) to reduce parameters and calculation amount, and improve calculation efficiency. The MobileNetV2 is added with the inverse residual structure (the inverted residual with linear bottleeck) module of the linear bottleneck to form an efficient basic module. MobileNet V3 is a design which integrates the ideas of V1, V2 and SE attention mechanisms, combines a depth separable convolution, an inverse residual structure of a linear bottleneck and an SE (Squeeze and excitation) lightweight attention model structure, moves forward an average pooling layer in the last step, removes the last convolution layer, and introduces an h-swish activation function, thereby completing the design. Specifically, the depth separable convolution performs 3x3 depth separable convolution after the input 1x1 convolution is performed with ascending dimension; the inverse residual structure with linear bottleneck is to firstly use 1x1 convolution to increase dimensionality, then carry out the following operations and use residual edges; the SE lightweight attention model works by adjusting the weight of each channel. And h-swish is used for replacing a swish function to reduce the operation amount and improve the performance.

MobileNetV3 has three important steps: 1x1 convolution, which is converted from an input channel to an expansion channel; 3x3 or 5x5 convolution, inflation channel using step size stride; the 1x1 convolution is converted from the inflation channel to the output channel. There are three optional steps, whether to add an SE structure; whether a Residual Residual error structure is adopted; two options for activation functions, the ReLU and h-swish functions. The network structure can be divided into three parts: extracting characteristics from the initial part of 1 convolutional layer by convolution of 3x 3; the middle part: a plurality of convolutional layers composed of Bneck (the Bneck structure is shown in fig. 3); the last part is as follows: the fully connected output is replaced by two 1x1 convolutional layers. Wherein the initial part, i.e. the 1 st convolutional layer in the structure, comprises 3 parts, i.e. convolutional layer, BN layer, h-switch active layer. The middle part is a network structure of a plurality of blocks (MobileBlock) containing convolutional layers, and is shown in a table. Wherein SE is an Squeeze-and-Excite mechanism structure, and whether the addition is performed or not; NL represents non-linearity, where HS is the h-swish activation function and RE is the ReLU activation function; bnegk is a bottle neck layer of the bottle layers, namely MobileBlock; expsize is the number of channels calculated by the bneck structure. In the last part, the computation is reduced by advancing the average pooling layer, using a convolution of 1 × 1 instead of the Squeeze operation.

In one embodiment, a CA attention mechanism module is used to replace the SE module in the bneck of the preset MobileNetV3 network structure. Aiming at improvement of a BackBone network of a yolov4 structure, the invention adopts a MobileNet V3 structure with small volume and high calculation efficiency, changes an SE attention mechanism module in the MobileNet V3 structure into a CA attention mechanism, enhances the characteristic expression capability, simultaneously has lighter weight model, and has less FLOPs and parameter quantity of the whole network structure.

Channel attention (e.g., SE attention) has a significant effect on improving model performance, but it only considers re-weighting the importance of each channel by modeling channel relationships, ignoring location information, which is important for generating spatially selective attribute maps. Therefore, another Attention mode is adopted, not only the relationship among the channels but also the position information of the feature space are considered, the position information is embedded into the Attention of the channels, which is called Coordinate Attention, called CA Attention mechanism for short, and the specific operation is divided into 2 steps of Coordinate information embedding and Coordinate Attention generation.

Coordinate information embedding refers to an operation of decomposing global pooling according to horizontal and vertical coordinates, decomposing channel attention into two one-dimensional feature encoding processes, aggregating features along two spatial directions (X and Y directions) respectively, and converting into a pair of one-dimensional feature codes in order to enable an attention module to capture remote spatial interaction with accurate position information. Specifically, a posing kernel of size (H,1) or (1, W) is first used to encode the aggregate features along the X and Y coordinates for each channel of a given input, respectively, resulting in a feature map perceived in both X and Y directions. This 2 transformation allows the attention module to capture the dependencies along one spatial direction and to save the location information along the other spatial direction, helping the network to locate the object of interest more accurately. Therefore, a global receptive field can be obtained and precise position information can be encoded through the splitting.

The Coordinate Attention generation is a Concat operation on the above transform, which is then transformed using a convolutional transform function: concat along the spatial dimension operates as an intermediate feature map for the encoding of spatial information in both the horizontal and vertical directions as a non-linear activation function. And then along the spatial dimension will be decomposed into 2 individual tensor sums. Using another 2 convolution conversion sums to convert the sum into tensor with same channel number to input to obtain characterization g in height and width direction^hAnd g^w(ii) a The output sums are then expanded as attributes weights, respectively. Finally, the output of the Coordinate Attention Block can be written as (where x is the input):

y＝x×g^h×g^w

thus, a cordinate attribute block can be regarded as a computing unit, which can capture remote dependency relationships along one spatial direction, and can retain accurate position information along another spatial direction, and then the generated feature maps are respectively encoded into attribute maps which are sensitive to direction and position, and can be complementarily applied to the input feature maps to enhance the expression capability of features in the network. The overall input/output configuration diagram is shown in fig. 4.

In the MobileNetV3 structure, a lightweight SE attention model was added. The SE attention model effectively constructs the interdependence relation between channels by compressing each two-dimensional feature map, re-measures the importance of each channel, and mainly adjusts the weight of each channel by modeling the channel relation. However, the position information is not fully utilized, so the method replaces the SE module with the CA attention mechanism module described in the previous section to preset the structure of the MobileNet V3. Aiming at the overall yolov4 mask wearing detection structure, the BackBone structure is replaced by an improved preset MobileNetV3 structure, in order to further reduce the parameter quantity, Depth-wise Separable Convolution (Depth-wise Separable Convolution) is used for replacing common Convolution, then the last three effective characteristics obtained by the BackBone characteristic extraction network are taken out to construct a reinforced characteristic pyramid, and therefore the improved preset yolov4 network is completed, and mask wearing detection is carried out. The modified structure is shown in fig. 5.

The invention adopts an improved yolov4 detection frame to detect the wearing of the mask, and the BackBone reference network structure part adopts a MobileNetV3 structure with small volume and high calculation efficiency; using a deep separable convolution instead of a normal convolution in the entire network structure; the SE attention mechanism module in the structure of the MobileNet V3 is changed into a CA attention mechanism, and position information is embedded into channel attention to enhance the feature expression capability.

In the preset MobileNetV3 network structure, as an implementation manner, deep separable convolution is adopted to replace normal convolution, and then the last three effective features obtained by the backpbone reference network are taken out to construct a reinforced feature pyramid.

The preset Yolov4 target detection model of the invention is compared with the existing Yolov4 network as follows, and is mainly divided into three parts.

(1) The standard convolution and the depth separable convolution are compared in computation.

Assume an input signature size of

The number of channels is M; convolution kernel size of

The number of channels is M, and the number of channels is N; after convolution, the value is obtained

And the output characteristic diagram with the number of channels being N. The following calculation can thus be obtained: each pixel of each layer in the output feature map of the standard convolution is the result of one convolution, and the calculation amount of each convolution is

The amount of computation for each convolution kernel is

A total of N convolution kernels, so the total amount of computation is

Of the computation amounts of the Depth separable convolution, the computation amount of the Depth-wise convolution is

The Point-wise convolution is calculated as

The total calculated amount is the sum of the two. Therefore, under the condition that the sizes of the input feature map, the output feature map and the convolution kernel are not changed, the ratio of the calculated amount of the depth separable convolution to the standard convolution is as follows:

therefore, when the convolution kernel size is larger or the number is larger, the proportional value is smaller, and the calculation amount of the depth separable convolution is smaller. In actual use, the convolution kernel size is usually fixed, but the computational load of the convolution operation is significantly reduced by replacing the normal convolution in the entire improved MobileNetV3-Yolov4 network.

(2) MobileNet V3 and CSPDarknet53 network computation quantity comparison

According to different network structure building modes and parameter quantity calculation modes, the parameter quantity of the CSPDarknet53 network is 27.6M, the size of the MobileNet V3 network is only 4.97M, although the accuracy rate of the MobileNet V3 network as a BackBone in a detection network is lower than that of the CSPDarknet53 network, the parameter quantity is reduced to only 1/5, the storage space is greatly reduced, and a great operation space is provided for the deployment of a model on edge devices such as mobile devices and embedded devices.

(3) Comparison of the Calf and SE attention mechanisms

Detection experiments on a Pascal VOC data set show that the increase of a CA attention mechanism can improve the average AP of an SE attention mechanism by about 2 percent, and great help is brought to the preparation rate. Therefore, the method is considered to be combined with the MobileNetV3 to replace the original SE module to improve the detection accuracy. Meanwhile, on other data set classification, the classification model with the CA attention mechanism has better migration capability.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A mask detection method based on yolov4 is characterized by comprising the following steps:

2. The mask detection method based on yolov4 of claim 1, wherein in the bneck of the preset MobileNetV3 network structure, a CA attention mechanism module is used to replace an SE module.

3. The mask inspection method according to claim 1, wherein the yolov 4-based mask inspection method is characterized in that in the preset MobileNetV3 network structure, the depth separable convolution is adopted to replace the normal convolution, and then the last three effective features obtained by the Backbone reference network are taken out to construct the enhanced feature pyramid.