CN113762201A - Mask detection method based on yolov4 - Google Patents

Mask detection method based on yolov4 Download PDF

Info

Publication number
CN113762201A
CN113762201A CN202111088283.6A CN202111088283A CN113762201A CN 113762201 A CN113762201 A CN 113762201A CN 202111088283 A CN202111088283 A CN 202111088283A CN 113762201 A CN113762201 A CN 113762201A
Authority
CN
China
Prior art keywords
mask
preset
yolov4
convolution
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111088283.6A
Other languages
Chinese (zh)
Other versions
CN113762201B (en
Inventor
张勇
赵东宁
吴显淞
朱经晨
宗拓
颜庚潇
张恒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen University
Original Assignee
Shenzhen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen University filed Critical Shenzhen University
Priority to CN202111088283.6A priority Critical patent/CN113762201B/en
Publication of CN113762201A publication Critical patent/CN113762201A/en
Application granted granted Critical
Publication of CN113762201B publication Critical patent/CN113762201B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Abstract

The embodiment of the invention discloses a mask detection method based on yolov4, which comprises the following steps: step 1: collecting images of a person wearing a mask and a person not wearing the mask in a public place, and making a training set; step 2: constructing a preset YOLOV4 target detection model; and step 3: putting the training set into a preset Yolov4 target detection model for training; and 4, step 4: and detecting a video stream or picture to be detected through the trained preset Yolov4 target detection model, and judging whether each face target in the video stream or picture wears a mask. Aiming at a general scene of mask wearing detection, the method changes a Backbone network part into a MobileNet V3 structure, uses a deep separable convolution to replace a common convolution in the whole network structure, has a small structure size of the MobileNet V3 and high calculation efficiency, changes an SE attention mechanism module in the MobileNet V3 structure into a CA attention mechanism, embeds position information into channel attention, enhances feature expression capability and enhances mask detection of a picture or video area.

Description

Mask detection method based on yolov4
Technical Field
The invention relates to the technical field of mask detection, in particular to a mask detection method based on yolov 4.
Background
The outbreak effect of new crown epidemic continues until now, and the public health protection requirements are continuously emphasized. In order to prevent cross infection of new coronavirus, the mask is worn in and out of public places as a basic measure for preventing infection, and the wearing of the mask is favorable for ensuring the safety of the mask and other people when going out, so that epidemic spread is effectively controlled. Meanwhile, in monitoring systems such as ATM, banks and the like, the wearing detection function of the mask can detect suspicious persons; whether the doctor wears the gauze mask to supervise the selection of doctor's work is detected at the operating room, avoids taking place the medical accident that leads to by the work mistake. With the development of intelligent devices and technologies, mask wearing detection has gradually shifted from manual to robotic device detection. The adoption of machine vision instead of human labor is an important research direction for automation. The automatic mask wearing detection system has the advantages of labor saving, intelligent reminding and informatization monitoring. When great epidemic of proruption, can be used for supervise and the suggestion public wears the gauze mask to go out, can carry on unmanned aerial vehicle, automated inspection does not wear the people of gauze mask trip and carries out the advice when patrolling. The entrance guard system is used for reminding people who do not wear the mask in public places such as buses, subway stations, supermarkets, schools and other public places. The public health protection capability is improved through the technology, social resources and operation cost are saved, and the human resource cost is reduced. Therefore, the problem that people wear the mask is detected and identified quickly and accurately, and the method has important significance for safety management and intelligent information management in life in the future.
The mask wearing detection method can be divided into a conventional image processing method and a machine learning method. The basic idea of the traditional method is to distinguish the background from the portrait foreground through an image processing technology, then find a face area and identify whether the face area is worn by a mask. Specifically, the skin color area of the eyes and the skin color area of the nose and mouth area are obtained through binocular detection and are compared, and if the former is smaller than the latter, the mask is absent; if the latter is smaller than the former, the mask is worn. Additionally, mouth and nose detection and Hough line detection on the edge of the mask wearing the mask face are added to be used as reference decision. In the initial image preprocessing stage, calculation such as filtering, binaryzation, corrosion expansion, edge detection and the like is mainly adopted; then, an OpenCV classifier is adopted to detect the face, nose and mouth; when the eyes detect the skin color area of the eyes, calculating the skin color area according to the skin color outline; after the eye region is determined, calculating the contour area of the skin color of the nose and mouth region; and finally, judging whether the mask is worn or not according to the comparison rule. With the development of deep learning technology in the field of image recognition, new detection and recognition technologies are continuously proposed. In the aspect of mask wearing detection, researchers provide a scheme of two modules of face detection and mask attribute identification based on combination of feature fusion and segmentation supervision, wherein the face detection module is used for positioning a face area in an image, and the mask attribute identification module is used for judging whether a face wears a mask or not in a single face area, namely, the mask wearing detection is divided into face detection and image classification problems, a face detection network is matched with an image classification network to determine the face position of a person firstly, and then the image is classified into two conditions of not wearing the mask and worn the mask. In addition, researchers propose to directly detect whether a face area wears a mask or not based on an SSD network structure, and the detection is divided into two types. The positioning and classifying layer added to the main network comprises 8 convolutional layers, the model uses the design of anchor-based, the human face area of the original image is positioned and detected by using the feature mapping images after the convolutional layers, and the cross entropy loss function is used for calculating and predicting whether the mask is worn or not.
In order to obtain better performance, the number of network layers of the existing network is continuously increased, and although the network performance is improved, the problems of storage and operation efficiency are also brought.
Disclosure of Invention
The technical problem to be solved by the embodiments of the present invention is to provide a mask detection method based on yolov4, so as to reduce the amount of calculation and improve the operation efficiency.
In order to solve the technical problem, an embodiment of the present invention provides a mask detection method based on yolov4, including:
step 1: collecting images of a person wearing a mask and a person not wearing the mask in a public place, screening samples according to definition, and making a training set;
step 2: constructing a preset YOLOV4 target detection model; the preset YOLOV4 target detection model comprises an input end, a BackBone reference network, a Neck middle layer and a Head output layer, wherein the BackBone reference network adopts a preset MobileNet V3 network structure;
the initial part of the preset MobileNetV3 network structure comprises 1 convolution layer, and features are extracted through convolution of 3x 3; the middle portion includes a plurality of convolutional layers composed of bneck; the final part replaces full connection output by two convolution layers of 1x 1;
and step 3: putting the training set into a preset Yolov4 target detection model for training to obtain a trained preset Yolov4 target detection model with high robustness;
and 4, step 4: and detecting a video stream or picture to be detected through the trained preset Yolov4 target detection model, and judging whether each face target in the video stream or picture wears a mask.
Further, in the bneck of the preset MobileNetV3 network structure, a CA attention mechanism module is used to replace an SE module.
Further, in the preset MobileNetV3 network structure, a deep separable convolution is adopted to replace a normal convolution, and then the last three effective features obtained by the backpbone reference network are taken out to construct a reinforced feature pyramid.
The invention has the beneficial effects that: aiming at a general scene of mask wearing detection, the method changes a Backbone network part into a MobileNet V3 structure, uses a deep separable convolution to replace a common convolution in the whole network structure, has a small structure size of the MobileNet V3 and high calculation efficiency, changes an SE attention mechanism module in the MobileNet V3 structure into a CA attention mechanism, embeds position information into channel attention, enhances feature expression capability and enhances mask detection of a picture or video area.
Drawings
Fig. 1 is a schematic flow chart of a mask detection method based on yolov4 according to an embodiment of the present invention.
Fig. 2 is an overall block diagram of yolov4 of the related art.
Fig. 3 is a schematic structural diagram of a Bneck according to an embodiment of the present invention.
FIG. 4 is a schematic diagram of a CA attention mechanism configuration of an embodiment of the present invention.
Fig. 5 is a schematic structural diagram of a preset YOLOV4 target detection model according to an embodiment of the present invention.
Detailed Description
It should be noted that the embodiments and features of the embodiments in the present application can be combined with each other without conflict, and the present invention is further described in detail with reference to the drawings and specific embodiments.
According to the method, a Yolov4 detection frame is adopted, a network structure with small volume and high calculation efficiency, namely MobileNet V3, is adopted for a BackBone network structure part of the frame, and the detection precision is ensured, and the model is reduced and the calculation efficiency of the model is improved by combining a depth separable convolution, an inverse residual error structure of a linear bottleneck and an attention model structure. And then, for an SE Attention mechanism module in a MobileNet V3 structure, a CA Attention mechanism (coding Attention) is changed, the relationship among channels is not only considered, but also the position information of a feature space is considered, and the position information is embedded into the Attention of the channels, so that the feature diagram has stronger feature expression capability, and the mask detection of a picture or video area is enhanced.
Referring to fig. 1, the mask detection method based on yolov4 in the embodiment of the present invention includes:
step 1: collecting images of a person wearing a mask and a person not wearing the mask in a public place, screening samples according to definition, and making a training set;
step 2: constructing a preset YOLOV4 target detection model; the preset YOLOV4 target detection model comprises an input end, a BackBone reference network, a Neck middle layer and a Head output layer, wherein the BackBone reference network adopts a preset MobileNet V3 network structure;
the initial part of the preset MobileNetV3 network structure comprises 1 convolution layer, and features are extracted through convolution of 3x 3; the middle portion includes a plurality of convolutional layers composed of bneck; the final part replaces full connection output by two convolution layers of 1x 1;
and step 3: putting the training set into a preset Yolov4 target detection model for training to obtain a trained preset Yolov4 target detection model with high robustness;
and 4, step 4: and detecting a video stream or picture to be detected through the trained preset Yolov4 target detection model, and judging whether each face target in the video stream or picture wears a mask.
The YoLO series of target detection algorithms is a classic one-stage target detection algorithm, and the framework draws wide attention from both academic and industrial fields. The YOLO series has higher reasoning speed and can better meet the requirements of real scenes. The YOLOv4 is formed by adding a plurality of practical skills on the basis of the YOLOv3 algorithm, so that the speed and the precision of the YOLOv 3578 are greatly improved, and the YOLOv4 mainly comprises four parts: the input end, the backhaul reference network structure, the neutral intermediate layer network structure and the Head output layer are described in detail as follows.
The input is mainly for the training phase, for the pre-processing of pictures, the image size is typically 608 x 608. The method mainly comprises the following steps of Mosaic data enhancement, cmBN and SAT self-confrontation training, the training speed of a model and the accuracy of a network are improved by using the Mosaic data enhancement operation, and the generalization performance of the network is improved by using the cmBN and SAT self-confrontation training; the enhancement method can combine several pictures into one picture, thereby enriching the data set and greatly improving the training speed of the network and reducing the memory requirement of the model.
The BackBone reference network mainly comprises: CSPDarknet53 network, Mish activation function, Dropblock structure. The CSPDarknet53 is used as a reference network for extracting some general feature representations, the CSPDarknet53 network comprises 5 CSP modules, and the structure of each CSP module is composed of a convolutional layer and X Res unit modules Concat. The CSP module divides the feature mapping of the basic layer into two parts, and then combines the two parts through a cross-stage hierarchical structure, so that the calculated amount is reduced, and the accuracy of the model can be ensured. And replacing the original RELU activation function with a Mish activation function, wherein the Mish activation function is improved on the basis of a Leaky _ RELU algorithm. When x is larger than 0, Leaky _ relu is basically the same as the Mish activation function; when x <0, the Mish function is substantially 0, while the Leaky _ relu function is λ x. The Mish function is smoother, and the accuracy of the model can be further improved. Dropblock blocks are added in the modules to further improve the generalization capability of the models, and Dropblock is a regularization method for solving overfitting of the models.
The Neck middle layer in Yolov4 is used for improving the diversity and robustness of features, and an SPP module and an FPN + PAN structure are added. The SPP module fuses feature maps with different scales, and compared with a pure mode of using k × k maximum pooling, the SPP module mode can more effectively increase the receiving range of the trunk features, and the most important context features are obviously separated. The FPN network can better solve the mesoscale problem of target detection by constructing a pyramid on the feature map. The FPN structure performs a fusion operation on 19 × 19, 38 × 38, 76 × 76 in turn, i.e., performs an upsampling operation on the smaller feature map layers, adjusts them to the same size, and then superimposes the two feature maps of the same size. The size of the feature mapping with the size of 19 x 19 can be adjusted to 76 x 76 by FPN operation, so that the size of the feature mapping is improved, the mesoscale detection problem can be better solved, the depth of the network is increased, and the robustness of the network is improved. The PAN structure rescales the 76 × 76 sized feature map to 19 × 19 sized feature map, which may improve the target localization capability of the algorithm to some extent. The FPN layer can capture strong semantic features from top to bottom, the PAN can transmit strong positioning features from bottom to top, and the target positioning function can be well completed by combining the two modules.
The Head output layer is used for outputting a target detection result, and the output ends have different branch numbers and usually comprise a classification branch and a regression branch. Yolov4 replaces the Smooth L1 Loss function with a CIOU _ Loss function and replaces the conventional NMS operation with DIOU _ NMS, thereby further improving the detection accuracy of the algorithm. The CIOU _ Loss is added with an influence factor on the basis of the DIOU _ Loss, and the aspect ratio of the prediction box and the GT box is taken into consideration. The DIOU _ NMS considers the position information of the central point of the boundary box, thereby obtaining more accurate detection results and being suitable for processing the target detection problem in the intensive scene.
As shown in fig. 2, the overall block diagram of the YOLOv4 target detection algorithm is divided into 4 general modules, which specifically include: the input, the backhaul reference network, the hack network, and the Head output correspond to the four large gray block regions in fig. 2. Some of the widgets are briefly described as follows: the CBM is the minimum component in the Yolov4 network structure and consists of Conv + BN + Mish activation functions; the CBL module consists of a Conv + BN + Leaky _ relu activation function; the Res unit references a residual error structure in a ResNet network and is used for constructing a deep network, and the CBM is a sub-module in a residual error module; the CSPX refers to a CSPNet network structure and consists of a convolution layer and X Res unit modules Concat; SPP-performs multi-scale feature fusion using maximal pooling of 1 × 1, 5 × 5, 9 × 9 and 13 × 13.
Since the deep learning is continuously developed, the CNN model is also continuously developed and evolved. To achieve higher accuracy, it is now common to use more layers and more complex networks. More layers of more complex networks may not be suitable if computational power is limited or storage is limited. The Google team provides a small-size and high-calculation-efficiency network structure, namely MobileNet. The scheme can adapt to various image applications of mobile and embedded devices, and V1, V2 and V3 versions are continuously designed along with development and improvement. The network structure of MobileNetV3 is shown in table 1.
TABLE 1
Figure BDA0003266582680000061
In table 1, column 1 represents the network structure division; column 2, Input, represents shape changes for the feature layer; the Operator in the 3 rd column represents the block structure to be experienced by each feature layer; column 4 represents the number of channels after the rising of the inverse residual structure in bneck; column 5 inputs the number of channels of the feature layer when it is input to bneck. Column 6 SE represents whether or not to introduce a mechanism of attention at this layer (Yes guide in); column 7 NL represents the kind of activation function, HS represents h-swish, RE represents RELU; column 8 Stride represents the step size of each block structure.
The MobileNet V1 utilizes Depth-wise Separable Convolutions (Depth-wise Separable Convolutions) to reduce parameters and calculation amount, and improve calculation efficiency. The MobileNetV2 is added with the inverse residual structure (the inverted residual with linear bottleeck) module of the linear bottleneck to form an efficient basic module. MobileNet V3 is a design which integrates the ideas of V1, V2 and SE attention mechanisms, combines a depth separable convolution, an inverse residual structure of a linear bottleneck and an SE (Squeeze and excitation) lightweight attention model structure, moves forward an average pooling layer in the last step, removes the last convolution layer, and introduces an h-swish activation function, thereby completing the design. Specifically, the depth separable convolution performs 3x3 depth separable convolution after the input 1x1 convolution is performed with ascending dimension; the inverse residual structure with linear bottleneck is to firstly use 1x1 convolution to increase dimensionality, then carry out the following operations and use residual edges; the SE lightweight attention model works by adjusting the weight of each channel. And h-swish is used for replacing a swish function to reduce the operation amount and improve the performance.
MobileNetV3 has three important steps: 1x1 convolution, which is converted from an input channel to an expansion channel; 3x3 or 5x5 convolution, inflation channel using step size stride; the 1x1 convolution is converted from the inflation channel to the output channel. There are three optional steps, whether to add an SE structure; whether a Residual Residual error structure is adopted; two options for activation functions, the ReLU and h-swish functions. The network structure can be divided into three parts: extracting characteristics from the initial part of 1 convolutional layer by convolution of 3x 3; the middle part: a plurality of convolutional layers composed of Bneck (the Bneck structure is shown in fig. 3); the last part is as follows: the fully connected output is replaced by two 1x1 convolutional layers. Wherein the initial part, i.e. the 1 st convolutional layer in the structure, comprises 3 parts, i.e. convolutional layer, BN layer, h-switch active layer. The middle part is a network structure of a plurality of blocks (MobileBlock) containing convolutional layers, and is shown in a table. Wherein SE is an Squeeze-and-Excite mechanism structure, and whether the addition is performed or not; NL represents non-linearity, where HS is the h-swish activation function and RE is the ReLU activation function; bnegk is a bottle neck layer of the bottle layers, namely MobileBlock; expsize is the number of channels calculated by the bneck structure. In the last part, the computation is reduced by advancing the average pooling layer, using a convolution of 1 × 1 instead of the Squeeze operation.
In one embodiment, a CA attention mechanism module is used to replace the SE module in the bneck of the preset MobileNetV3 network structure. Aiming at improvement of a BackBone network of a yolov4 structure, the invention adopts a MobileNet V3 structure with small volume and high calculation efficiency, changes an SE attention mechanism module in the MobileNet V3 structure into a CA attention mechanism, enhances the characteristic expression capability, simultaneously has lighter weight model, and has less FLOPs and parameter quantity of the whole network structure.
Channel attention (e.g., SE attention) has a significant effect on improving model performance, but it only considers re-weighting the importance of each channel by modeling channel relationships, ignoring location information, which is important for generating spatially selective attribute maps. Therefore, another Attention mode is adopted, not only the relationship among the channels but also the position information of the feature space are considered, the position information is embedded into the Attention of the channels, which is called Coordinate Attention, called CA Attention mechanism for short, and the specific operation is divided into 2 steps of Coordinate information embedding and Coordinate Attention generation.
Coordinate information embedding refers to an operation of decomposing global pooling according to horizontal and vertical coordinates, decomposing channel attention into two one-dimensional feature encoding processes, aggregating features along two spatial directions (X and Y directions) respectively, and converting into a pair of one-dimensional feature codes in order to enable an attention module to capture remote spatial interaction with accurate position information. Specifically, a posing kernel of size (H,1) or (1, W) is first used to encode the aggregate features along the X and Y coordinates for each channel of a given input, respectively, resulting in a feature map perceived in both X and Y directions. This 2 transformation allows the attention module to capture the dependencies along one spatial direction and to save the location information along the other spatial direction, helping the network to locate the object of interest more accurately. Therefore, a global receptive field can be obtained and precise position information can be encoded through the splitting.
The Coordinate Attention generation is a Concat operation on the above transform, which is then transformed using a convolutional transform function: concat along the spatial dimension operates as an intermediate feature map for the encoding of spatial information in both the horizontal and vertical directions as a non-linear activation function. And then along the spatial dimension will be decomposed into 2 individual tensor sums. Using another 2 convolution conversion sums to convert the sum into tensor with same channel number to input to obtain characterization g in height and width directionhAnd gw(ii) a The output sums are then expanded as attributes weights, respectively. Finally, the output of the Coordinate Attention Block can be written as (where x is the input):
y=x×gh×gw
thus, a cordinate attribute block can be regarded as a computing unit, which can capture remote dependency relationships along one spatial direction, and can retain accurate position information along another spatial direction, and then the generated feature maps are respectively encoded into attribute maps which are sensitive to direction and position, and can be complementarily applied to the input feature maps to enhance the expression capability of features in the network. The overall input/output configuration diagram is shown in fig. 4.
In the MobileNetV3 structure, a lightweight SE attention model was added. The SE attention model effectively constructs the interdependence relation between channels by compressing each two-dimensional feature map, re-measures the importance of each channel, and mainly adjusts the weight of each channel by modeling the channel relation. However, the position information is not fully utilized, so the method replaces the SE module with the CA attention mechanism module described in the previous section to preset the structure of the MobileNet V3. Aiming at the overall yolov4 mask wearing detection structure, the BackBone structure is replaced by an improved preset MobileNetV3 structure, in order to further reduce the parameter quantity, Depth-wise Separable Convolution (Depth-wise Separable Convolution) is used for replacing common Convolution, then the last three effective characteristics obtained by the BackBone characteristic extraction network are taken out to construct a reinforced characteristic pyramid, and therefore the improved preset yolov4 network is completed, and mask wearing detection is carried out. The modified structure is shown in fig. 5.
The invention adopts an improved yolov4 detection frame to detect the wearing of the mask, and the BackBone reference network structure part adopts a MobileNetV3 structure with small volume and high calculation efficiency; using a deep separable convolution instead of a normal convolution in the entire network structure; the SE attention mechanism module in the structure of the MobileNet V3 is changed into a CA attention mechanism, and position information is embedded into channel attention to enhance the feature expression capability.
In the preset MobileNetV3 network structure, as an implementation manner, deep separable convolution is adopted to replace normal convolution, and then the last three effective features obtained by the backpbone reference network are taken out to construct a reinforced feature pyramid.
The preset Yolov4 target detection model of the invention is compared with the existing Yolov4 network as follows, and is mainly divided into three parts.
(1) The standard convolution and the depth separable convolution are compared in computation.
Assume an input signature size of
Figure BDA0003266582680000091
The number of channels is M; convolution kernel size of
Figure BDA0003266582680000092
The number of channels is M, and the number of channels is N; after convolution, the value is obtained
Figure BDA0003266582680000093
And the output characteristic diagram with the number of channels being N. The following calculation can thus be obtained: each pixel of each layer in the output feature map of the standard convolution is the result of one convolution, and the calculation amount of each convolution is
Figure BDA0003266582680000094
The amount of computation for each convolution kernel is
Figure BDA0003266582680000095
A total of N convolution kernels, so the total amount of computation is
Figure BDA0003266582680000096
Of the computation amounts of the Depth separable convolution, the computation amount of the Depth-wise convolution is
Figure BDA0003266582680000097
The Point-wise convolution is calculated as
Figure BDA0003266582680000098
The total calculated amount is the sum of the two. Therefore, under the condition that the sizes of the input feature map, the output feature map and the convolution kernel are not changed, the ratio of the calculated amount of the depth separable convolution to the standard convolution is as follows:
Figure BDA0003266582680000099
therefore, when the convolution kernel size is larger or the number is larger, the proportional value is smaller, and the calculation amount of the depth separable convolution is smaller. In actual use, the convolution kernel size is usually fixed, but the computational load of the convolution operation is significantly reduced by replacing the normal convolution in the entire improved MobileNetV3-Yolov4 network.
(2) MobileNet V3 and CSPDarknet53 network computation quantity comparison
According to different network structure building modes and parameter quantity calculation modes, the parameter quantity of the CSPDarknet53 network is 27.6M, the size of the MobileNet V3 network is only 4.97M, although the accuracy rate of the MobileNet V3 network as a BackBone in a detection network is lower than that of the CSPDarknet53 network, the parameter quantity is reduced to only 1/5, the storage space is greatly reduced, and a great operation space is provided for the deployment of a model on edge devices such as mobile devices and embedded devices.
(3) Comparison of the Calf and SE attention mechanisms
Detection experiments on a Pascal VOC data set show that the increase of a CA attention mechanism can improve the average AP of an SE attention mechanism by about 2 percent, and great help is brought to the preparation rate. Therefore, the method is considered to be combined with the MobileNetV3 to replace the original SE module to improve the detection accuracy. Meanwhile, on other data set classification, the classification model with the CA attention mechanism has better migration capability.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (3)

1. A mask detection method based on yolov4 is characterized by comprising the following steps:
step 1: collecting images of a person wearing a mask and a person not wearing the mask in a public place, screening samples according to definition, and making a training set;
step 2: constructing a preset YOLOV4 target detection model; the preset YOLOV4 target detection model comprises an input end, a BackBone reference network, a Neck middle layer and a Head output layer, wherein the BackBone reference network adopts a preset MobileNet V3 network structure;
the initial part of the preset MobileNetV3 network structure comprises 1 convolution layer, and features are extracted through convolution of 3x 3; the middle portion includes a plurality of convolutional layers composed of bneck; the final part replaces full connection output by two convolution layers of 1x 1;
and step 3: putting the training set into a preset Yolov4 target detection model for training to obtain a trained preset Yolov4 target detection model with high robustness;
and 4, step 4: and detecting a video stream or picture to be detected through the trained preset Yolov4 target detection model, and judging whether each face target in the video stream or picture wears a mask.
2. The mask detection method based on yolov4 of claim 1, wherein in the bneck of the preset MobileNetV3 network structure, a CA attention mechanism module is used to replace an SE module.
3. The mask inspection method according to claim 1, wherein the yolov 4-based mask inspection method is characterized in that in the preset MobileNetV3 network structure, the depth separable convolution is adopted to replace the normal convolution, and then the last three effective features obtained by the Backbone reference network are taken out to construct the enhanced feature pyramid.
CN202111088283.6A 2021-09-16 2021-09-16 Mask detection method based on yolov4 Active CN113762201B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111088283.6A CN113762201B (en) 2021-09-16 2021-09-16 Mask detection method based on yolov4

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111088283.6A CN113762201B (en) 2021-09-16 2021-09-16 Mask detection method based on yolov4

Publications (2)

Publication Number Publication Date
CN113762201A true CN113762201A (en) 2021-12-07
CN113762201B CN113762201B (en) 2023-05-09

Family

ID=78796056

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111088283.6A Active CN113762201B (en) 2021-09-16 2021-09-16 Mask detection method based on yolov4

Country Status (1)

Country Link
CN (1) CN113762201B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114092820A (en) * 2022-01-20 2022-02-25 城云科技(中国)有限公司 Target detection method and moving target tracking method applying same
CN114283469A (en) * 2021-12-14 2022-04-05 贵州大学 Lightweight target detection method and system based on improved YOLOv4-tiny
CN114387484A (en) * 2022-01-11 2022-04-22 华南农业大学 Improved mask wearing detection method and system based on yolov4
CN116912237A (en) * 2023-09-08 2023-10-20 江西拓荒者科技有限公司 Printed circuit board defect detection method and system based on image recognition
CN117218606A (en) * 2023-11-09 2023-12-12 四川泓宝润业工程技术有限公司 Escape door detection method and device, storage medium and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111931623A (en) * 2020-07-31 2020-11-13 南京工程学院 Face mask wearing detection method based on deep learning
CN112016464A (en) * 2020-08-28 2020-12-01 中移(杭州)信息技术有限公司 Method and device for detecting face shielding, electronic equipment and storage medium
CN112183471A (en) * 2020-10-28 2021-01-05 西安交通大学 Automatic detection method and system for standard wearing of epidemic prevention mask of field personnel
CN112232199A (en) * 2020-10-15 2021-01-15 燕山大学 Wearing mask detection method based on deep learning
CN112949572A (en) * 2021-03-26 2021-06-11 重庆邮电大学 Slim-YOLOv 3-based mask wearing condition detection method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111931623A (en) * 2020-07-31 2020-11-13 南京工程学院 Face mask wearing detection method based on deep learning
CN112016464A (en) * 2020-08-28 2020-12-01 中移(杭州)信息技术有限公司 Method and device for detecting face shielding, electronic equipment and storage medium
CN112232199A (en) * 2020-10-15 2021-01-15 燕山大学 Wearing mask detection method based on deep learning
CN112183471A (en) * 2020-10-28 2021-01-05 西安交通大学 Automatic detection method and system for standard wearing of epidemic prevention mask of field personnel
CN112949572A (en) * 2021-03-26 2021-06-11 重庆邮电大学 Slim-YOLOv 3-based mask wearing condition detection method

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114283469A (en) * 2021-12-14 2022-04-05 贵州大学 Lightweight target detection method and system based on improved YOLOv4-tiny
CN114387484A (en) * 2022-01-11 2022-04-22 华南农业大学 Improved mask wearing detection method and system based on yolov4
CN114387484B (en) * 2022-01-11 2024-04-16 华南农业大学 Improved mask wearing detection method and system based on yolov4
CN114092820A (en) * 2022-01-20 2022-02-25 城云科技(中国)有限公司 Target detection method and moving target tracking method applying same
CN116912237A (en) * 2023-09-08 2023-10-20 江西拓荒者科技有限公司 Printed circuit board defect detection method and system based on image recognition
CN116912237B (en) * 2023-09-08 2023-12-12 江西麦可罗泰克检测技术有限公司 Printed circuit board defect detection method and system based on image recognition
CN117218606A (en) * 2023-11-09 2023-12-12 四川泓宝润业工程技术有限公司 Escape door detection method and device, storage medium and electronic equipment
CN117218606B (en) * 2023-11-09 2024-02-02 四川泓宝润业工程技术有限公司 Escape door detection method and device, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN113762201B (en) 2023-05-09

Similar Documents

Publication Publication Date Title
CN113762201B (en) Mask detection method based on yolov4
CN112287940A (en) Semantic segmentation method of attention mechanism based on deep learning
CN110147743A (en) Real-time online pedestrian analysis and number system and method under a kind of complex scene
CN112183471A (en) Automatic detection method and system for standard wearing of epidemic prevention mask of field personnel
CN110852182B (en) Depth video human body behavior recognition method based on three-dimensional space time sequence modeling
CN114758288A (en) Power distribution network engineering safety control detection method and device
WO2023030182A1 (en) Image generation method and apparatus
CN110991444A (en) Complex scene-oriented license plate recognition method and device
CN113379771A (en) Hierarchical human body analytic semantic segmentation method with edge constraint
CN114998830A (en) Wearing detection method and system for safety helmet of transformer substation personnel
CN115861619A (en) Airborne LiDAR (light detection and ranging) urban point cloud semantic segmentation method and system of recursive residual double-attention kernel point convolution network
CN113807361A (en) Neural network, target detection method, neural network training method and related products
Mao et al. Panoptic lintention network: Towards efficient navigational perception for the visually impaired
CN115019274A (en) Pavement disease identification method integrating tracking and retrieval algorithm
CN113936299A (en) Method for detecting dangerous area in construction site
Huu et al. Proposed detection face model by mobilenetv2 using asian data set
CN112597902A (en) Small target intelligent identification method based on nuclear power safety
CN114764856A (en) Image semantic segmentation method and image semantic segmentation device
CN116740516A (en) Target detection method and system based on multi-scale fusion feature extraction
CN116310967A (en) Chemical plant safety helmet wearing detection method based on improved YOLOv5
Zhang et al. Lightweight PM-YOLO network model for moving object recognition on the distribution network side
CN109583584A (en) The CNN with full articulamentum can be made to receive the method and system of indefinite shape input
CN112634411B (en) Animation generation method, system and readable medium thereof
CN115331112A (en) Infrared and visible light image fusion method and system based on multi-granularity word elements
CN112446292B (en) 2D image salient object detection method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant