CN111401202A - Pedestrian mask wearing real-time detection method based on deep learning - Google Patents

Pedestrian mask wearing real-time detection method based on deep learning Download PDF

Info

Publication number
CN111401202A
CN111401202A CN202010164210.XA CN202010164210A CN111401202A CN 111401202 A CN111401202 A CN 111401202A CN 202010164210 A CN202010164210 A CN 202010164210A CN 111401202 A CN111401202 A CN 111401202A
Authority
CN
China
Prior art keywords
layer
pruning
layers
channel
detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010164210.XA
Other languages
Chinese (zh)
Inventor
王兵
乐红霞
赵春兰
肖斌
李文璟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwest Petroleum University
Original Assignee
Southwest Petroleum University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest Petroleum University filed Critical Southwest Petroleum University
Priority to CN202010164210.XA priority Critical patent/CN111401202A/en
Publication of CN111401202A publication Critical patent/CN111401202A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections

Abstract

The invention provides a pedestrian mask wearing real-time detection method based on deep learning, which comprises the following steps of: s1, building a robust backbone network; s2, multi-scale training; s3, compressing the model; and S4, optimizing the model. The method of the invention considers more labor overhead cost, calculation cost of small hardware storage equipment and time overhead cost. Whether the pedestrian wears the mask can be judged more quickly, and the mask has good engineering practicability.

Description

Pedestrian mask wearing real-time detection method based on deep learning
Technical Field
The invention belongs to the field of Internet of things and artificial intelligence, and particularly relates to a pedestrian mask wearing real-time detection method based on deep learning and implementation.
Background
Some large-scale viruses can be transmitted through a medium such as droplets, and when effective antiviral drugs are not developed, the mask worn by people is very important in relieving the disease transmission. The mask is used as a necessary protective article in a special period, and special persons are arranged at places with large crowd flow such as communities, supermarkets and stations at the door to check the wearing condition of the mask, but the checking method needs to consume a large amount of manual resources, and the conditions of missed checking and the like can occur when the crowd flow is large. Therefore, the realization of real-time detection of wearing of the mask for pedestrians has very important practical significance.
The pedestrian Mask wearing real-time detection involves computer vision and embedded hardware devices, requires cameras and hardware platforms to be able to perceive the environment, analyze the scene and respond accordingly, because of the complexity of the detection environment, the automatic detection platform for scenes passing through small cameras has become more and more demanding, and the wearing of real-time detection for pedestrian masks presents many new challenges, such as (1) how to correctly distinguish whether to wear masks, (2) how to deploy a pedestrian Mask wearing real-time detection system on platforms with limited computing power and memory, (3) how to balance the requirements of detection real-time and detection accuracy, (3) an effective method to solve these challenges is a neural network detection method based on deep learning, which has recently made a lot of interesting efforts in the artificial intelligence fields such as image classification (such as ResNet), target detection (such as Faster FasR-CNN, SSD and the memory series), semantic segmentation (such as Master R-CNN), etc., which has produced many detection frames with its advantages of fast, scalable, end-to end learning, etc., FasO series may be the most practical application, and the most highly balanced target detection modules such as SPN-S-CNN, PSN, which can be operated on the basic detection algorithms including the simple SPNET detection algorithm, PSN-S-.
The two-stage detection method first generates a sparse set of candidate frames using a candidate frame generator and extracts features from each candidate frame, and then predicts the class of the candidate frame region using a region classifier. The one-stage detection method directly performs class prediction on the object at each position on the feature map without a region classification step in the two-stage detection method. Generally speaking, the two-step detection method is generally better in detection performance, and obtains the current optimal result on a public basis, while the first-step detection method is more time-saving and has stronger applicability in the aspect of real-time target detection. The detection methods can extract an interested target in the picture or the video, and are often applied to the fields of blind guiding systems, pedestrian detection, traffic identification detection, vehicle detection and the like.
The paper "Face detection based on occlusion area detection and recovery" published by Yihan Xiao et al in 2019, Multimedia Tools and Applications, proposes Face detection based on occlusion areas. The question provides an optimal occlusion region positioning algorithm POOA for the occlusion face detection problem. After data of significance detection processing is obtained, an average gray value is calculated according to a face image, and a proper coefficient is used as a threshold value to be multiplied to obtain a binary image. Then, by using the idea of Haar features, two features of a large rectangle and a large T shape are used for retrieval, and a binary image is combined to obtain an occlusion region of the face. Finally, a robust Principal Component Analysis (PCA) method is used to obtain the best projection of the occlusion face and fill the occlusion region. The Adaboost method achieves better effects in the aspects of shielding area, size, shape and the like, and the detection precision is improved to different degrees. But does not explain the face detection effect when wearing the mask.
A study of a feature-based face classification algorithm published by a microcomputer application, Lixia et al, systematically studies the performance of different face classification algorithms by classifying face images according to two attributes of sunglasses and masks, wherein the performance of the different face classification algorithms comprises Principal Component Analysis (PCA), linear discriminant analysis (L DA), Correlation coefficient (Correlation)), Support Vector Machine (SVM) and Adaboost algorithm, and provides experimental comparison results on an OMRON face library.
The invention discloses a mask detection system, which is reported in a Chinese patent document 'mask detection system and method based on fast Fourier transform and linear Gaussian' with publication number CN 109507198A and publication date of 2019, 3 and 22. The steps of the method or algorithm disclosed therein comprise an image acquisition device, a reading device, a detection device, a modeling device, an evaluation device. To hardware and software modules executed by a processor. The patent considers the conditions of the size of the mask, the length of ear bands of the mask, the length of an aluminum strip in the mask, the surface stain of the mask and the like, and provides a mask detection system based on fast Fourier transform and linear Gaussian, but the invention does not detect the condition when pedestrians wear the mask.
The invention has a patent publication number of CN 108062542A, and a Chinese patent document 'a detection method of an occluded human face' with a publication date of 2019, 5 and 22 months reports a detection method for detecting an occluded human face by constructing an image pyramid of each frame of picture. The method adopts a Boosting method to train to obtain the judgment of whether the human face is shielded, and a plurality of human face pictures and non-human face pictures are used for training; dividing each classifier according to the local position of the face of the point adopted when each weak classifier extracts the features; and performing face detection and abnormality detection on the basis of the classifier set obtained after classification. The invention has the capability of correctly judging wearing different-color masks, different-color eye shields, dark sunglasses and the like. The invention considers the face detection condition when wearing the mask, but does not consider the time overhead.
In summary, the existing pedestrian mask wearing real-time detection system and method are still few, but in some special periods, wearing the mask is very important. The deep learning algorithm has large, deep and complex network, is high in delay performance when being detected on a platform with limited computing capacity and memory, and cannot realize real-time detection, or the detection real-time performance and the detection accuracy cannot be met at the same time. In order to solve the above drawbacks: the invention provides a pedestrian mask wearing real-time detection method based on deep learning, which improves the detection speed, ensures that the detection precision is not greatly different and further improves the detection efficiency.
Disclosure of Invention
In order to solve the problems, the invention provides a pedestrian mask wearing real-time detection method based on deep learning, which is used for detecting whether pedestrians wear masks correctly in front of residential areas, supermarkets, station entrances and the like. By adding multi-scale context information fusion operation and model compression operation, the false detection rate is reduced and the detection speed is increased.
A pedestrian mask wearing real-time detection method based on deep learning comprises the following steps:
s1, building a robust backbone network
Adopting a backbone network Darknet53 as a feature extractor; darknet53 has 52 convolutional layers as main network layers, and the last layer is a fully connected layer consisting of 1 × 1 convolutions; the first layer of the subject network layer is a constraint, and then 5 repeated resblock _ bodies, each resblock _ body _ n includes 1 individual constraint and a set of res _ unit _ n, res _ unit _ n is a constraint that is repeatedly executed for a number of times n (n is 1,2,8,8,4), which together are:
1+ (1+1 × 2) + (1+2 × 2) + (1+8 × 2) + (1+4 × 2) ═ 52 layers;
wherein, the res _ unit _ n has a fast connection layer short, and the residual layer does not belong to the convolutional layer calculation;
the first layer of constraint in the host network is composed of a two-dimensional convolutional layer using L, a batch regularization layer and L eakyRe L U layer with the slope of 0.12(2) Regularizing the kernel weight matrix with parameters 5e-4, L2L eakyRe L U layer L eakyRe L U is an activation function after Re L U basic improvement, Re L U sets all negative values to zero, when a large gradient passes through Re L U neurons, the updated gradient is 0, at this time, if the learning rate is large, half of the neurons in the network probably do not have an activation phenomenon on any data, and L eakyRe L U gives a nonzero slope to all negative values, so that the problem is solved, and the formula is as follows:
Figure BDA0002406822140000031
darknet53 uses more consecutive convolution layers of 3 × 3 and 1 × 1 and organizes them into residual blocks;
s2, multi-scale training
Predicting a boundary box on three different scales, supposing that the input RoI has the width of w, the height of h and the S is a RoI context scale factor, the three different scales of regions have the same center, extracting context information of different regions from a plurality of scales, and simultaneously establishing three detection heads with different scales on a feature map and taking charge of detecting targets with different scales, wherein each grid in the detection heads is distributed with three different anchors so as to predict three detections consisting of 4 boundary coordinates and 1 objective and C-type predictions, and the final result tensor of the detection heads is N × N × (3 × (4+1+ C)), wherein N × N represents the space size of the final convolution feature map;
by introducing Maxout and dropout as the activation function of the layer, fusing a plurality of human face RoIs and outputting the human face RoIs to the pooling layer, wherein the Maxout formula is as follows:
Figure BDA0002406822140000041
the method comprises the following steps that a weight W is a three-dimensional matrix with the size of d × m × k, an offset b is a two-dimensional matrix with the size of m × k, the two matrixes are parameters to be learned, m represents the number of hidden layer nodes, k represents that each hidden layer node corresponds to k hidden layer nodes, when Maxout is used, three feature graphs are fused into a single feature graph with the same dimension, the feature graphs share the weight of each layer before an RoI pooling layer, RoI with different scales is transmitted to a target RoI pooling layer in a forward mode, and the feature graph with the fixed resolution is obtained;
the method can detect the judgment of whether the human face exists and whether the human face is shielded; four anchor points will be generated after the face is located: the upper left vertex (x0, y0), the lower right vertex (x1, y1) and an anchor frame, which is divided into an upper part and a lower part, namely, the upper half corresponding to the human face: upper left vertex (x0, y0), lower right vertex (x1, (y0+ y1)/2), lower half: the upper left vertex (x0, (y0+ y1)/2), the lower right vertex (x1, y 1); if the central point of the shielded area is located on the lower half part of the anchor frame of the face and the IOU is greater than the threshold value P, the pedestrian is considered to wear the mask; if the central point of the shielded area is located on the lower half of the face anchor frame and the IOU is smaller than the threshold value or the shielding condition is not detected on the lower half of the face, the pedestrian is considered not to wear the mask;
s3 model compression
After step S2, a preliminary result is obtained, and the process is subjected to model compression, and the model compression method is divided into two parts: channel sparsity and channel pruning;
s4 model optimization
Performing model optimization operations on the channel pruned model to compensate for temporarily reduced accuracy and potential performance degradation of the method; in a fine-grained target detection task, the detection performance is usually sensitive to channel pruning; model optimization is carried out through fine adjustment, and the updating times, the iteration times, the learning rate and the regularization parameters are modified to obtain a proper model.
The channel sparseness step S3 includes the following steps:
first, penalty term is applied to model parameters, namely L1(3) Punishing an activation unit in the neural network, namely the sum of absolute values of all parameters, thinning the activation unit:
Ω(θ)=||ω||1=∑ii|, (3)
L1regularized loss function:
Figure BDA0002406822140000042
the corresponding gradient is:
Figure BDA0002406822140000043
it can be seen that the effect of regularization on the gradient is no longer to scale each ω linearlyjBut rather a constant is added that is signed with sign (ω), in which case the different components between ω have no correlationSex, therefore L1Regularization brings some elements of the optimal solution to 0, resulting in sparsity, wherein α∈ [0, ∞]Trade-off the relative contribution of norm penalty terms, with larger α corresponding to more regularization;
then, a scaling factor is allocated to each channel, wherein the absolute value of the scaling factor represents the importance of the channel, and the input with low importance is deleted; specifically, except for a detection head, a BN layer is arranged behind each convolution layer to accelerate convergence and improve generalization capability, the BN layer normalizes convolution characteristics by using small-batch statics, and the formula is (6);
Figure BDA0002406822140000051
wherein E [ x ](k)]And
Figure BDA0002406822140000052
is the mean and standard deviation of the input features in the small batch, and gamma and β represent trainable scale factors and deviations, which can be learned during the training process, and allow the network to learn and recover the feature distribution to be learned by the original network.
The channel pruning in the step S3 includes the following steps:
introducing a global threshold value after channel sparse training
Figure BDA0002406822140000053
Determining whether to prune the characteristic channel to control the pruning rate; also introduces a local safety threshold
Figure BDA0002406822140000054
To prevent excessive pruning on convolutional layers and to maintain the integrity of network connections; directly discarding the maximum pooling layer and the upper sampling layer in the pruning process; first, according to a global threshold
Figure BDA0002406822140000055
And local safety threshold
Figure BDA0002406822140000056
Constructing a pruning area for all the convolutional layers; for the routing layer, the pruning sizes of the routing layer incoming layers are connected in sequence, and the connection sequence is used as the pruning sequence; the fast connect layer has a similar effect to ResNet, so all layers connected to the fast connect layer must have the same number of channels; in order to match the feature channels of each layer of the fast-connect layer, the pruning order for all the connection layers is iterated and an OR operation is performed on the pruning orders to generate a final pruning order for the connection layers.
The invention has the beneficial effects that: compared with the traditional method for manually checking mask wearing, the method provided by the invention considers more labor overhead cost, calculation cost of small hardware storage equipment and time overhead cost. Whether the pedestrian wears the mask can be judged more quickly, and the mask has good engineering practicability.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
fig. 2 is a diagram of a backbone network architecture of the present invention;
FIG. 3 is a multi-scale block diagram of the present invention;
FIG. 4 is a diagram of the model compression process of the present invention.
Detailed Description
The specific technical scheme of the invention is described by combining the embodiment.
A pedestrian mask wearing real-time detection method based on deep learning is disclosed, the flow is shown in figure 1, and the method comprises the following steps:
s1, building a robust backbone network
The invention adopts backbone network Darknet53 as a feature extractor, and the network structure is shown in figure 2. Darknet53 has 52 convolutional layers as the main network layer, and the last layer is a fully connected layer consisting of 1 × 1 convolutional layers. The first layer of the body network layer is the constraint, then 5 repeated resblock _ bodies, each resblock _ body _ n comprising 1 individual constraint and a set of res _ unit _ n, res _ unit _ n being the constraint that is repeatedly executed for a number of times n (n is 1,2,8,8,4), for a total of 1+ (1+1 ═ 2) + (1+2 ×)) + (1+8 × 2) + (1+4 ═ 2) × (52). Wherein the residual layer does not belong to convolutional layer computation.
There is a fast connection layer short in res _ unit _ n, this residual form is motivated by the degradation problem. Theoretically, the training error of a deeper model should be smaller than that of a shallow model, but the degradation problem shows that a solver is difficult to fit an equivalent function by using a multi-layer network, and the representation form of residual error makes the multi-layer network easier to approximate.
The first layer of constraint in the host network consists of a two-dimensional convolutional layer using L, a batch of regularization layers, and L eakyRe L U layers with a slope of 0.12(2) And carrying out regularization operation on the kernel weight matrix, wherein the parameter is 5 e-4.
The penalty function of the regularization method is as follows:
Figure BDA0002406822140000061
wherein X, y is a training sample and a corresponding label, omega is a weight coefficient vector, JO is a target function, omega (omega) is a penalty term, and parameter α controls the strength of regularization L2The penalty term for regularization is:
Figure BDA0002406822140000062
the optimal solution of the original objective function J (omega) is assumed to be omega*And second order conductibility, J (ω) is at ω*The second order taylor expansion of (d) is:
Figure BDA0002406822140000063
h is J (omega) at omega*At a hessian matrix with eigenvalues of λj
Figure BDA0002406822140000064
When the minimum value is taken:
Figure BDA0002406822140000065
due to L2The regularized objective function is to add in J (ω)
Figure BDA0002406822140000066
Therefore, the method comprises the following steps:
Figure BDA0002406822140000067
ω*component on each feature vector of H and
Figure BDA0002406822140000068
scaling. As can be seen from the above, the present invention,
Figure BDA0002406822140000069
L2regularization does not produce sparseness.
L eakyRe L U layer L eakyRe L U is the activation function after Re L U base improvement, Re L U sets all negative values as zero, when the large gradient passes through Re L U neurons, the updated gradient is 0, at this time, if the learning rate is large, it is likely that half of the neurons in the network will not activate any data, and L eakyRe L U gives a non-zero slope to all negative values, solving the problem, the formula is as follows:
Figure BDA0002406822140000071
darknet53 uses more consecutive convolutional layers of 3 × 3 and 1 × 1 and organizes them into residual blocks through extensive experimentation, Darknet-53 is much more powerful than Darknet-19 and more efficient than ResNet-101.
S2, multi-scale training
The invention predicts a boundary box on three different scales, supposing that the RoI width is w, the height is h, and S is a context scaling factor of the RoI, and the areas of the three different scales have the same center (in the invention, the scaling factor S is S1-1.2, S2-1.7, and S3-2.2), extracts context information of different areas from the multiple scales, and simultaneously establishes three detection heads of different scales on a feature map, which are responsible for detecting targets of different scales.
In order to improve the self-adaptive selection capability of the context features to the multi-scale RoI, the invention fuses a plurality of face RoIs and outputs the face RoIs to the pooling layer by introducing Maxout and dropout as the activation function of the layer, wherein the Maxout formula is as follows:
Figure BDA0002406822140000072
the method comprises the following steps that a weight W is a three-dimensional matrix with the size of d × m × k, an offset b is a two-dimensional matrix with the size of m × k, and the two matrixes are parameters to be learned, m represents the number of hidden layer nodes, k represents that each hidden layer node corresponds to k hidden layer nodes, when Maxout is used, three feature graphs are fused into a single feature graph with the same dimension, the feature graphs share the weight of each layer before an RoI pooling layer, RoI with different scales is propagated to a target RoI pooling layer in a forward mode, and the feature graph with the fixed resolution is obtained.
The method can detect the judgment of whether the human face exists and whether the human face has shielding. After the face is located, four anchor points (an upper left vertex (x0, y0), a lower right vertex (x1, y1)) and an anchor frame are generated, the anchor frame is divided into an upper part and a lower part, namely, the upper part (the upper left vertex (x0, y0), the lower right vertex (x1, (y0+ y1)/2)) and the lower part (the upper left vertex (x0, (y0+ y1)/2), the lower right vertex (x1, y1)) of the face are corresponding, if the central point of the shielded area falls on the lower part of the face of the anchor frame, and the IOU is greater than a threshold value P, the pedestrian is considered to wear the mask; if the central point of the shielded area is located on the lower half of the face anchor frame and the IOU is smaller than the threshold value or the lower half of the face does not detect the shielding condition, the pedestrian is considered not to wear the mask.
S3 model compression
Although the preliminary result can be obtained after step S2, since the process generates an anchor frame, and requires a large amount of computing resources and time overhead, the process needs to be model-compressed to reduce the consumption of computing resources and increase the running speed, the model compression process is shown in fig. 4, and the model compression method in the present invention is divided into two parts: channel sparsity and channel pruning.
S3-1, channel sparseness
Introducing channel sparsity can make an importance determination from the channel level, compress the potentially unimportant channels, and suppress their output, facilitating subsequent channel pruning, first applying a penalty to the model parameters, i.e., using L1(7) And (3) punishing an activation unit in the neural network, namely, the sum of absolute values of all parameters, and thinning the activation unit.
Ω(θ)=||ω||1=∑ii|, (7)
L is known from the formula (1)1Regularized loss function:
Figure BDA0002406822140000081
the corresponding gradient is:
Figure BDA0002406822140000082
it can be seen that the effect of regularization on the gradient is no longer to scale each ω linearlyjBut rather a constant is added that is signed with sign (ω), where the different components between ω have no correlation, thus L1Regularization brings some elements of the optimal solution to 0, resulting in sparsity]Trade-off relative contribution of norm penalty termsLarger α corresponds to more regularization.
Then, each channel is assigned a scaling factor, wherein the absolute value of the scaling factor indicates the importance of the channel, and the input with low importance is deleted. Specifically, except for the detection head, each convolution layer is followed by a BN layer to accelerate convergence and improve generalization capability, and the BN layer normalizes the convolution characteristics by using small-batch statics, and the formula is (10).
Figure BDA0002406822140000083
Wherein E [ x ](k)]And
Figure BDA0002406822140000084
is the mean and standard deviation of the input features in the small batch, and gamma and β represent trainable scale factors and deviations, which can be learned during the training process, and allow the network to learn and recover the feature distribution to be learned by the original network.
S3-2 channel pruning
Introducing a global threshold value after channel sparse training
Figure BDA0002406822140000085
To determine whether to prune the characteristic channel to control the pruning rate. In addition, a local safety threshold is introduced
Figure BDA0002406822140000086
To prevent excessive pruning on the convolutional layer and to preserve the integrity of the network connection. Some special connections between layers, such as routing layers and fast connection layers, need to be handled carefully in the present invention. During pruning, the maxpool layer and the upsample layer are discarded directly because they are independent of the channel number. First, according to a global threshold
Figure BDA0002406822140000087
And local safety threshold
Figure BDA0002406822140000091
A pruning area is constructed for all convolutional layers. For the routing layer, the pruning sizes of the incoming layers are connected in sequence, and the connection sequence is used as the pruning sequence. The fast connect layer in the present invention has a similar effect to ResNet, so all layers connected to the fast connect layer must have the same number of channels. In order to match the feature channels of each layer of the fast connection layer, the pruning sequence of all connection layers is iterated, and the execution or operation is carried out on the pruning sequences to generate the final pruning sequence of the connection layers.
S4 model optimization
Model optimization operations are performed on the channel pruned model to compensate for temporarily degraded accuracy and potential performance degradation of the method. In fine-grained target detection tasks, detection performance is typically sensitive to channel pruning. Model optimization by fine tuning is a relatively efficient and safe approach. The update times, iteration times, learning rate, regularization parameters, etc. may be modified to arrive at a suitable model.

Claims (3)

1. A pedestrian mask wearing real-time detection method based on deep learning is characterized by comprising the following steps:
s1, building a robust backbone network
Adopting a backbone network Darknet53 as a feature extractor; darknet53 has 52 convolutional layers as main network layers, and the last layer is a fully connected layer consisting of 1 × 1 convolutions; the first layer of the subject network layer is a constraint, and then 5 repeated resblock _ bodies, each resblock _ body _ n includes 1 individual constraint and a set of res _ unit _ n, res _ unit _ n is a constraint that is repeatedly executed for a number of times n (n is 1,2,8,8,4), which together are:
1+ (1+1 × 2) + (1+2 × 2) + (1+8 × 2) + (1+4 × 2) ═ 52 layers;
wherein, the res _ unit _ n has a fast connection layer short, and the residual layer does not belong to the convolutional layer calculation;
the first layer of constraint in the main network is composed of a two-dimensional convolution layer, a batch of regularization layers and L eakyRe L U layers with the slope of 0.1L is used for the two-dimensional convolution layer2(2) Regularizing the kernel weight matrix with parameters 5e-4, L2L eakyRe L U layer L eakyRe L U is an activation function after Re L U basic improvement, Re L U sets all negative values to zero, when a large gradient passes through Re L U neurons, the updated gradient is 0, at this time, if the learning rate is large, half of the neurons in the network probably do not have an activation phenomenon on any data, and L eakyRe L U gives a nonzero slope to all negative values, so that the problem is solved, and the formula is as follows:
Figure FDA0002406822130000011
darknet53 uses more consecutive convolution layers of 3 × 3 and 1 × 1 and organizes them into residual blocks;
s2, multi-scale training
Predicting a boundary box on three different scales, supposing that the input RoI has the width of w, the height of h and the S is a RoI context scale factor, the three different scales of regions have the same center, extracting context information of different regions from a plurality of scales, and simultaneously establishing three detection heads with different scales on a feature map and taking charge of detecting targets with different scales, wherein each grid in the detection heads is distributed with three different anchors so as to predict three detections consisting of 4 boundary coordinates and 1 objective and C-type predictions, and the final result tensor of the detection heads is N × N × (3 × (4+1+ C)), wherein N × N represents the space size of the final convolution feature map;
by introducing Maxout and dropout as the activation function of the layer, fusing a plurality of human face RoIs and outputting the human face RoIs to the pooling layer, wherein the Maxout formula is as follows:
Figure FDA0002406822130000012
the method comprises the following steps that a weight W is a three-dimensional matrix with the size of d × m × k, an offset b is a two-dimensional matrix with the size of m × k, the two matrixes are parameters to be learned, m represents the number of hidden layer nodes, k represents that each hidden layer node corresponds to k hidden layer nodes, when Maxout is used, three feature graphs are fused into a single feature graph with the same dimension, the feature graphs share the weight of each layer before an RoI pooling layer, RoI with different scales is transmitted to a target RoI pooling layer in a forward mode, and the feature graph with the fixed resolution is obtained;
the method can detect the judgment of whether the human face exists and whether the human face is shielded; four anchor points will be generated after the face is located: the upper left vertex (x0, y0), the lower right vertex (x1, y1) and an anchor frame, which is divided into an upper part and a lower part, namely, the upper half corresponding to the human face: upper left vertex (x0, y0), lower right vertex (x1, (y0+ y1)/2), lower half: the upper left vertex (x0, (y0+ y1)/2), the lower right vertex (x1, y 1); if the central point of the shielded area is located on the lower half part of the anchor frame of the face and the IOU is greater than the threshold value P, the pedestrian is considered to wear the mask; if the central point of the shielded area is located on the lower half of the face anchor frame and the IOU is smaller than the threshold value or the shielding condition is not detected on the lower half of the face, the pedestrian is considered not to wear the mask;
s3 model compression
After step S2, a preliminary result is obtained, and the process is subjected to model compression, and the model compression method is divided into two parts: channel sparsity and channel pruning;
s4 model optimization
Performing model optimization operations on the channel pruned model to compensate for temporarily reduced accuracy and potential performance degradation of the method; in a fine-grained target detection task, the detection performance is usually sensitive to channel pruning; model optimization is carried out through fine adjustment, and the updating times, the iteration times, the learning rate and the regularization parameters are modified to obtain a proper model.
2. The pedestrian mask wearing real-time detection method based on deep learning as claimed in claim 1, wherein the channel sparseness in step S3 includes the following steps:
first, penalty term is applied to model parameters, namely L1(3) Punishing an activation unit in the neural network, namely the sum of absolute values of all parameters, thinning the activation unit:
Ω(θ)=||ω||1=∑ii|, (3)
L1regularized loss function:
Figure FDA0002406822130000021
the corresponding gradient is:
Figure FDA0002406822130000022
it can be seen that the effect of regularization on the gradient is no longer to scale each ω linearlyjBut rather a constant is added that is signed with sign (ω), where the different components between ω have no correlation, thus L1Regularization brings some elements of the optimal solution to 0, resulting in sparsity, wherein α∈ [0, ∞]Trade-off the relative contribution of norm penalty terms, with larger α corresponding to more regularization;
then, a scaling factor is allocated to each channel, wherein the absolute value of the scaling factor represents the importance of the channel, and the input with low importance is deleted; specifically, except for a detection head, a BN layer is arranged behind each convolution layer to accelerate convergence and improve generalization capability, the BN layer normalizes convolution characteristics by using small-batch statics, and the formula is (6);
Figure FDA0002406822130000031
wherein E [ x ](k)]And
Figure FDA0002406822130000032
is the mean and standard deviation of the input features in the small batch, and gamma and β represent trainable scale factors and deviations, which can be learned during the training process, and allow the network to learn and recover the feature distribution to be learned by the original network.
3. The pedestrian mask wearing real-time detection method based on deep learning of claim 1, wherein the channel pruning in step S3 comprises the following steps:
introducing a global threshold value after channel sparse training
Figure FDA0002406822130000033
Determining whether to prune the characteristic channel to control the pruning rate; also introduces a local safety threshold
Figure FDA0002406822130000034
To prevent excessive pruning on convolutional layers and to maintain the integrity of network connections; directly discarding the average pooling layer and the upper sampling layer in the pruning process; first, according to a global threshold
Figure FDA0002406822130000036
And local safety threshold
Figure FDA0002406822130000035
Constructing a pruning area for all the convolutional layers; for the routing layer, the pruning sizes of the routing layer incoming layers are connected in sequence, and the connection sequence is used as the pruning sequence; the fast connect layer has a similar effect to ResNet, so all layers connected to the fast connect layer must have the same number of channels; in order to match the feature channels of each layer of the fast-connect layer, the pruning order for all the connection layers is iterated and an OR operation is performed on the pruning orders to generate a final pruning order for the connection layers.
CN202010164210.XA 2020-03-11 2020-03-11 Pedestrian mask wearing real-time detection method based on deep learning Pending CN111401202A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010164210.XA CN111401202A (en) 2020-03-11 2020-03-11 Pedestrian mask wearing real-time detection method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010164210.XA CN111401202A (en) 2020-03-11 2020-03-11 Pedestrian mask wearing real-time detection method based on deep learning

Publications (1)

Publication Number Publication Date
CN111401202A true CN111401202A (en) 2020-07-10

Family

ID=71430769

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010164210.XA Pending CN111401202A (en) 2020-03-11 2020-03-11 Pedestrian mask wearing real-time detection method based on deep learning

Country Status (1)

Country Link
CN (1) CN111401202A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111985621A (en) * 2020-08-24 2020-11-24 西安建筑科技大学 Method for building neural network model for real-time detection of mask wearing and implementation system
CN112001872A (en) * 2020-08-26 2020-11-27 北京字节跳动网络技术有限公司 Information display method, device and storage medium
CN112381097A (en) * 2020-11-16 2021-02-19 西南石油大学 Scene semantic segmentation method based on deep learning
CN113222142A (en) * 2021-05-28 2021-08-06 上海天壤智能科技有限公司 Channel pruning and quick connection layer pruning method and system
CN113379737A (en) * 2021-07-14 2021-09-10 西南石油大学 Intelligent pipeline defect detection method based on image processing and deep learning and application
CN113517056A (en) * 2021-06-18 2021-10-19 安徽医科大学 Medical image target area identification method, neural network model and application
CN113822414A (en) * 2021-07-22 2021-12-21 深圳信息职业技术学院 Mask detection model training method, mask detection method and related equipment
US11436881B2 (en) 2021-01-19 2022-09-06 Rockwell Collins, Inc. System and method for automated face mask, temperature, and social distancing detection

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105354563A (en) * 2015-12-14 2016-02-24 南京理工大学 Depth and color image combined human face shielding detection early-warning device and implementation method
CN108062542A (en) * 2018-01-12 2018-05-22 杭州智诺科技股份有限公司 The detection method for the face being blocked
CN108197584A (en) * 2018-01-12 2018-06-22 武汉大学 A kind of recognition methods again of the pedestrian based on triple deep neural network
CN109101923A (en) * 2018-08-14 2018-12-28 罗普特(厦门)科技集团有限公司 A kind of personnel wear the detection method and device of mask situation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105354563A (en) * 2015-12-14 2016-02-24 南京理工大学 Depth and color image combined human face shielding detection early-warning device and implementation method
CN108062542A (en) * 2018-01-12 2018-05-22 杭州智诺科技股份有限公司 The detection method for the face being blocked
CN108197584A (en) * 2018-01-12 2018-06-22 武汉大学 A kind of recognition methods again of the pedestrian based on triple deep neural network
CN109101923A (en) * 2018-08-14 2018-12-28 罗普特(厦门)科技集团有限公司 A kind of personnel wear the detection method and device of mask situation

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
MIND_PROGRAMMONKEY: "【YoLoV3目标检测实战】keras+yolov3训练自身口罩检测数据集", 《HTTPS://WWW.SHANGMAYUAN.COM/A/F57BD492344048219207BCA4.HTML》 *
PENGYI ZHANG等: "SlimYOLOv3: Narrower, Faster and Better for Real-Time UAVApplications", 《2019 ICCVW》 *
康行天下: "激活函数(ReLU, Swish, Maxout)", 《HTTPS://WWW.CNBLOGS.COM/MAKEFILE/P/ACTIVATION-FUNCTION.HTML》 *
陈云霁等: "《智能计算系统》", 28 February 2020, 机械工业出版社 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111985621A (en) * 2020-08-24 2020-11-24 西安建筑科技大学 Method for building neural network model for real-time detection of mask wearing and implementation system
CN112001872A (en) * 2020-08-26 2020-11-27 北京字节跳动网络技术有限公司 Information display method, device and storage medium
CN112001872B (en) * 2020-08-26 2021-09-14 北京字节跳动网络技术有限公司 Information display method, device and storage medium
US11922721B2 (en) 2020-08-26 2024-03-05 Beijing Bytedance Network Technology Co., Ltd. Information display method, device and storage medium for superimposing material on image
CN112381097A (en) * 2020-11-16 2021-02-19 西南石油大学 Scene semantic segmentation method based on deep learning
US11436881B2 (en) 2021-01-19 2022-09-06 Rockwell Collins, Inc. System and method for automated face mask, temperature, and social distancing detection
CN113222142A (en) * 2021-05-28 2021-08-06 上海天壤智能科技有限公司 Channel pruning and quick connection layer pruning method and system
CN113517056A (en) * 2021-06-18 2021-10-19 安徽医科大学 Medical image target area identification method, neural network model and application
CN113517056B (en) * 2021-06-18 2023-09-19 安徽医科大学 Medical image target area identification method, neural network model and application
CN113379737A (en) * 2021-07-14 2021-09-10 西南石油大学 Intelligent pipeline defect detection method based on image processing and deep learning and application
CN113822414A (en) * 2021-07-22 2021-12-21 深圳信息职业技术学院 Mask detection model training method, mask detection method and related equipment

Similar Documents

Publication Publication Date Title
CN111401202A (en) Pedestrian mask wearing real-time detection method based on deep learning
US20200012923A1 (en) Computer device for training a deep neural network
US8649594B1 (en) Active and adaptive intelligent video surveillance system
Ryan et al. Scene invariant multi camera crowd counting
Cadena et al. Pedestrian graph: Pedestrian crossing prediction based on 2d pose estimation and graph convolutional networks
Chetverikov et al. Dynamic texture as foreground and background
Abbas et al. A comprehensive review of vehicle detection using computer vision
Czyżewski et al. Multi-stage video analysis framework
Farag et al. Deep learning versus traditional methods for parking lots occupancy classification
Charouh et al. Improved background subtraction-based moving vehicle detection by optimizing morphological operations using machine learning
Cao et al. Learning spatial-temporal representation for smoke vehicle detection
Jemilda et al. Moving object detection and tracking using genetic algorithm enabled extreme learning machine
Kumar et al. Background subtraction based on threshold detection using modified K-means algorithm
Lamba et al. A texture based mani-fold approach for crowd density estimation using Gaussian Markov Random Field
CN113095199B (en) High-speed pedestrian identification method and device
Anees et al. Deep learning framework for density estimation of crowd videos
Hanif et al. Performance analysis of vehicle detection techniques: a concise survey
Elguebaly et al. Generalized Gaussian mixture models as a nonparametric Bayesian approach for clustering using class-specific visual features
Agrawal et al. An improved Gaussian Mixture Method based background subtraction model for moving object detection in outdoor scene
Li et al. A deep pedestrian tracking SSD-based model in the sudden emergency or violent environment
Bhattacharya HybridFaceMaskNet: A novel face-mask detection framework using hybrid approach
Kim et al. Development of a real-time automatic passenger counting system using head detection based on deep learning
Marie et al. Dynamic background subtraction using moments
Tank et al. A fast moving object detection technique in video surveillance system
Ghosh et al. Pedestrian counting using deep models trained on synthetically generated images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200710

RJ01 Rejection of invention patent application after publication