CN111401202A

CN111401202A - Pedestrian mask wearing real-time detection method based on deep learning

Info

Publication number: CN111401202A
Application number: CN202010164210.XA
Authority: CN
Inventors: 王兵; 乐红霞; 赵春兰; 肖斌; 李文璟
Original assignee: Southwest Petroleum University
Current assignee: Southwest Petroleum University
Priority date: 2020-03-11
Filing date: 2020-03-11
Publication date: 2020-07-10

Abstract

The invention provides a pedestrian mask wearing real-time detection method based on deep learning, which comprises the following steps of: s1, building a robust backbone network; s2, multi-scale training; s3, compressing the model; and S4, optimizing the model. The method of the invention considers more labor overhead cost, calculation cost of small hardware storage equipment and time overhead cost. Whether the pedestrian wears the mask can be judged more quickly, and the mask has good engineering practicability.

Description

Pedestrian mask wearing real-time detection method based on deep learning

Technical Field

The invention belongs to the field of Internet of things and artificial intelligence, and particularly relates to a pedestrian mask wearing real-time detection method based on deep learning and implementation.

Background

Some large-scale viruses can be transmitted through a medium such as droplets, and when effective antiviral drugs are not developed, the mask worn by people is very important in relieving the disease transmission. The mask is used as a necessary protective article in a special period, and special persons are arranged at places with large crowd flow such as communities, supermarkets and stations at the door to check the wearing condition of the mask, but the checking method needs to consume a large amount of manual resources, and the conditions of missed checking and the like can occur when the crowd flow is large. Therefore, the realization of real-time detection of wearing of the mask for pedestrians has very important practical significance.

The pedestrian Mask wearing real-time detection involves computer vision and embedded hardware devices, requires cameras and hardware platforms to be able to perceive the environment, analyze the scene and respond accordingly, because of the complexity of the detection environment, the automatic detection platform for scenes passing through small cameras has become more and more demanding, and the wearing of real-time detection for pedestrian masks presents many new challenges, such as (1) how to correctly distinguish whether to wear masks, (2) how to deploy a pedestrian Mask wearing real-time detection system on platforms with limited computing power and memory, (3) how to balance the requirements of detection real-time and detection accuracy, (3) an effective method to solve these challenges is a neural network detection method based on deep learning, which has recently made a lot of interesting efforts in the artificial intelligence fields such as image classification (such as ResNet), target detection (such as Faster FasR-CNN, SSD and the memory series), semantic segmentation (such as Master R-CNN), etc., which has produced many detection frames with its advantages of fast, scalable, end-to end learning, etc., FasO series may be the most practical application, and the most highly balanced target detection modules such as SPN-S-CNN, PSN, which can be operated on the basic detection algorithms including the simple SPNET detection algorithm, PSN-S-.

The two-stage detection method first generates a sparse set of candidate frames using a candidate frame generator and extracts features from each candidate frame, and then predicts the class of the candidate frame region using a region classifier. The one-stage detection method directly performs class prediction on the object at each position on the feature map without a region classification step in the two-stage detection method. Generally speaking, the two-step detection method is generally better in detection performance, and obtains the current optimal result on a public basis, while the first-step detection method is more time-saving and has stronger applicability in the aspect of real-time target detection. The detection methods can extract an interested target in the picture or the video, and are often applied to the fields of blind guiding systems, pedestrian detection, traffic identification detection, vehicle detection and the like.

The paper "Face detection based on occlusion area detection and recovery" published by Yihan Xiao et al in 2019, Multimedia Tools and Applications, proposes Face detection based on occlusion areas. The question provides an optimal occlusion region positioning algorithm POOA for the occlusion face detection problem. After data of significance detection processing is obtained, an average gray value is calculated according to a face image, and a proper coefficient is used as a threshold value to be multiplied to obtain a binary image. Then, by using the idea of Haar features, two features of a large rectangle and a large T shape are used for retrieval, and a binary image is combined to obtain an occlusion region of the face. Finally, a robust Principal Component Analysis (PCA) method is used to obtain the best projection of the occlusion face and fill the occlusion region. The Adaboost method achieves better effects in the aspects of shielding area, size, shape and the like, and the detection precision is improved to different degrees. But does not explain the face detection effect when wearing the mask.

A study of a feature-based face classification algorithm published by a microcomputer application, Lixia et al, systematically studies the performance of different face classification algorithms by classifying face images according to two attributes of sunglasses and masks, wherein the performance of the different face classification algorithms comprises Principal Component Analysis (PCA), linear discriminant analysis (L DA), Correlation coefficient (Correlation)), Support Vector Machine (SVM) and Adaboost algorithm, and provides experimental comparison results on an OMRON face library.

The invention discloses a mask detection system, which is reported in a Chinese patent document 'mask detection system and method based on fast Fourier transform and linear Gaussian' with publication number CN 109507198A and publication date of 2019, 3 and 22. The steps of the method or algorithm disclosed therein comprise an image acquisition device, a reading device, a detection device, a modeling device, an evaluation device. To hardware and software modules executed by a processor. The patent considers the conditions of the size of the mask, the length of ear bands of the mask, the length of an aluminum strip in the mask, the surface stain of the mask and the like, and provides a mask detection system based on fast Fourier transform and linear Gaussian, but the invention does not detect the condition when pedestrians wear the mask.

The invention has a patent publication number of CN 108062542A, and a Chinese patent document 'a detection method of an occluded human face' with a publication date of 2019, 5 and 22 months reports a detection method for detecting an occluded human face by constructing an image pyramid of each frame of picture. The method adopts a Boosting method to train to obtain the judgment of whether the human face is shielded, and a plurality of human face pictures and non-human face pictures are used for training; dividing each classifier according to the local position of the face of the point adopted when each weak classifier extracts the features; and performing face detection and abnormality detection on the basis of the classifier set obtained after classification. The invention has the capability of correctly judging wearing different-color masks, different-color eye shields, dark sunglasses and the like. The invention considers the face detection condition when wearing the mask, but does not consider the time overhead.

In summary, the existing pedestrian mask wearing real-time detection system and method are still few, but in some special periods, wearing the mask is very important. The deep learning algorithm has large, deep and complex network, is high in delay performance when being detected on a platform with limited computing capacity and memory, and cannot realize real-time detection, or the detection real-time performance and the detection accuracy cannot be met at the same time. In order to solve the above drawbacks: the invention provides a pedestrian mask wearing real-time detection method based on deep learning, which improves the detection speed, ensures that the detection precision is not greatly different and further improves the detection efficiency.

Disclosure of Invention

In order to solve the problems, the invention provides a pedestrian mask wearing real-time detection method based on deep learning, which is used for detecting whether pedestrians wear masks correctly in front of residential areas, supermarkets, station entrances and the like. By adding multi-scale context information fusion operation and model compression operation, the false detection rate is reduced and the detection speed is increased.

A pedestrian mask wearing real-time detection method based on deep learning comprises the following steps:

s1, building a robust backbone network

Adopting a backbone network Darknet53 as a feature extractor; darknet53 has 52 convolutional layers as main network layers, and the last layer is a fully connected layer consisting of 1 × 1 convolutions; the first layer of the subject network layer is a constraint, and then 5 repeated resblock _ bodies, each resblock _ body _ n includes 1 individual constraint and a set of res _ unit _ n, res _ unit _ n is a constraint that is repeatedly executed for a number of times n (n is 1,2,8,8,4), which together are:

1+ (1+1 × 2) + (1+2 × 2) + (1+8 × 2) + (1+4 × 2) ═ 52 layers;

wherein, the res _ unit _ n has a fast connection layer short, and the residual layer does not belong to the convolutional layer calculation;

the first layer of constraint in the host network is composed of a two-dimensional convolutional layer using L, a batch regularization layer and L eakyRe L U layer with the slope of 0.1²(2) Regularizing the kernel weight matrix with parameters 5e-4, L²L eakyRe L U layer L eakyRe L U is an activation function after Re L U basic improvement, Re L U sets all negative values to zero, when a large gradient passes through Re L U neurons, the updated gradient is 0, at this time, if the learning rate is large, half of the neurons in the network probably do not have an activation phenomenon on any data, and L eakyRe L U gives a nonzero slope to all negative values, so that the problem is solved, and the formula is as follows:

darknet53 uses more consecutive convolution layers of 3 × 3 and 1 × 1 and organizes them into residual blocks;

s2, multi-scale training

Predicting a boundary box on three different scales, supposing that the input RoI has the width of w, the height of h and the S is a RoI context scale factor, the three different scales of regions have the same center, extracting context information of different regions from a plurality of scales, and simultaneously establishing three detection heads with different scales on a feature map and taking charge of detecting targets with different scales, wherein each grid in the detection heads is distributed with three different anchors so as to predict three detections consisting of 4 boundary coordinates and 1 objective and C-type predictions, and the final result tensor of the detection heads is N × N × (3 × (4+1+ C)), wherein N × N represents the space size of the final convolution feature map;

by introducing Maxout and dropout as the activation function of the layer, fusing a plurality of human face RoIs and outputting the human face RoIs to the pooling layer, wherein the Maxout formula is as follows:

the method comprises the following steps that a weight W is a three-dimensional matrix with the size of d × m × k, an offset b is a two-dimensional matrix with the size of m × k, the two matrixes are parameters to be learned, m represents the number of hidden layer nodes, k represents that each hidden layer node corresponds to k hidden layer nodes, when Maxout is used, three feature graphs are fused into a single feature graph with the same dimension, the feature graphs share the weight of each layer before an RoI pooling layer, RoI with different scales is transmitted to a target RoI pooling layer in a forward mode, and the feature graph with the fixed resolution is obtained;

the method can detect the judgment of whether the human face exists and whether the human face is shielded; four anchor points will be generated after the face is located: the upper left vertex (x0, y0), the lower right vertex (x1, y1) and an anchor frame, which is divided into an upper part and a lower part, namely, the upper half corresponding to the human face: upper left vertex (x0, y0), lower right vertex (x1, (y0+ y1)/2), lower half: the upper left vertex (x0, (y0+ y1)/2), the lower right vertex (x1, y 1); if the central point of the shielded area is located on the lower half part of the anchor frame of the face and the IOU is greater than the threshold value P, the pedestrian is considered to wear the mask; if the central point of the shielded area is located on the lower half of the face anchor frame and the IOU is smaller than the threshold value or the shielding condition is not detected on the lower half of the face, the pedestrian is considered not to wear the mask;

s3 model compression

After step S2, a preliminary result is obtained, and the process is subjected to model compression, and the model compression method is divided into two parts: channel sparsity and channel pruning;

s4 model optimization

Performing model optimization operations on the channel pruned model to compensate for temporarily reduced accuracy and potential performance degradation of the method; in a fine-grained target detection task, the detection performance is usually sensitive to channel pruning; model optimization is carried out through fine adjustment, and the updating times, the iteration times, the learning rate and the regularization parameters are modified to obtain a proper model.

The channel sparseness step S3 includes the following steps:

first, penalty term is applied to model parameters, namely L¹(3) Punishing an activation unit in the neural network, namely the sum of absolute values of all parameters, thinning the activation unit:

Ω(θ)＝||ω||₁＝∑_i|ω_i|， (3)

L¹regularized loss function:

the corresponding gradient is:

it can be seen that the effect of regularization on the gradient is no longer to scale each ω linearly_jBut rather a constant is added that is signed with sign (ω), in which case the different components between ω have no correlationSex, therefore L¹Regularization brings some elements of the optimal solution to 0, resulting in sparsity, wherein α∈ [0, ∞]Trade-off the relative contribution of norm penalty terms, with larger α corresponding to more regularization;

then, a scaling factor is allocated to each channel, wherein the absolute value of the scaling factor represents the importance of the channel, and the input with low importance is deleted; specifically, except for a detection head, a BN layer is arranged behind each convolution layer to accelerate convergence and improve generalization capability, the BN layer normalizes convolution characteristics by using small-batch statics, and the formula is (6);

wherein E [ x ]^(k)]And

is the mean and standard deviation of the input features in the small batch, and gamma and β represent trainable scale factors and deviations, which can be learned during the training process, and allow the network to learn and recover the feature distribution to be learned by the original network.

The channel pruning in the step S3 includes the following steps:

introducing a global threshold value after channel sparse training

Determining whether to prune the characteristic channel to control the pruning rate; also introduces a local safety threshold

To prevent excessive pruning on convolutional layers and to maintain the integrity of network connections; directly discarding the maximum pooling layer and the upper sampling layer in the pruning process; first, according to a global threshold

And local safety threshold

Constructing a pruning area for all the convolutional layers; for the routing layer, the pruning sizes of the routing layer incoming layers are connected in sequence, and the connection sequence is used as the pruning sequence; the fast connect layer has a similar effect to ResNet, so all layers connected to the fast connect layer must have the same number of channels; in order to match the feature channels of each layer of the fast-connect layer, the pruning order for all the connection layers is iterated and an OR operation is performed on the pruning orders to generate a final pruning order for the connection layers.

The invention has the beneficial effects that: compared with the traditional method for manually checking mask wearing, the method provided by the invention considers more labor overhead cost, calculation cost of small hardware storage equipment and time overhead cost. Whether the pedestrian wears the mask can be judged more quickly, and the mask has good engineering practicability.

Drawings

FIG. 1 is a flow chart of a method of the present invention;

fig. 2 is a diagram of a backbone network architecture of the present invention;

FIG. 3 is a multi-scale block diagram of the present invention;

FIG. 4 is a diagram of the model compression process of the present invention.

Detailed Description

The specific technical scheme of the invention is described by combining the embodiment.

A pedestrian mask wearing real-time detection method based on deep learning is disclosed, the flow is shown in figure 1, and the method comprises the following steps:

s1, building a robust backbone network

The invention adopts backbone network Darknet53 as a feature extractor, and the network structure is shown in figure 2. Darknet53 has 52 convolutional layers as the main network layer, and the last layer is a fully connected layer consisting of 1 × 1 convolutional layers. The first layer of the body network layer is the constraint, then 5 repeated resblock _ bodies, each resblock _ body _ n comprising 1 individual constraint and a set of res _ unit _ n, res _ unit _ n being the constraint that is repeatedly executed for a number of times n (n is 1,2,8,8,4), for a total of 1+ (1+1 ═ 2) + (1+2 ×)) + (1+8 × 2) + (1+4 ═ 2) × (52). Wherein the residual layer does not belong to convolutional layer computation.

There is a fast connection layer short in res _ unit _ n, this residual form is motivated by the degradation problem. Theoretically, the training error of a deeper model should be smaller than that of a shallow model, but the degradation problem shows that a solver is difficult to fit an equivalent function by using a multi-layer network, and the representation form of residual error makes the multi-layer network easier to approximate.

The first layer of constraint in the host network consists of a two-dimensional convolutional layer using L, a batch of regularization layers, and L eakyRe L U layers with a slope of 0.1²(2) And carrying out regularization operation on the kernel weight matrix, wherein the parameter is 5 e-4.

The penalty function of the regularization method is as follows:

wherein X, y is a training sample and a corresponding label, omega is a weight coefficient vector, JO is a target function, omega (omega) is a penalty term, and parameter α controls the strength of regularization L²The penalty term for regularization is:

the optimal solution of the original objective function J (omega) is assumed to be omega^*And second order conductibility, J (ω) is at ω^*The second order taylor expansion of (d) is:

h is J (omega) at omega^*At a hessian matrix with eigenvalues of λ_j，

When the minimum value is taken:

due to L²The regularized objective function is to add in J (ω)

Therefore, the method comprises the following steps:

ω^*component on each feature vector of H and

scaling. As can be seen from the above, the present invention,

L²regularization does not produce sparseness.

L eakyRe L U layer L eakyRe L U is the activation function after Re L U base improvement, Re L U sets all negative values as zero, when the large gradient passes through Re L U neurons, the updated gradient is 0, at this time, if the learning rate is large, it is likely that half of the neurons in the network will not activate any data, and L eakyRe L U gives a non-zero slope to all negative values, solving the problem, the formula is as follows:

darknet53 uses more consecutive convolutional layers of 3 × 3 and 1 × 1 and organizes them into residual blocks through extensive experimentation, Darknet-53 is much more powerful than Darknet-19 and more efficient than ResNet-101.

S2, multi-scale training

The invention predicts a boundary box on three different scales, supposing that the RoI width is w, the height is h, and S is a context scaling factor of the RoI, and the areas of the three different scales have the same center (in the invention, the scaling factor S is S1-1.2, S2-1.7, and S3-2.2), extracts context information of different areas from the multiple scales, and simultaneously establishes three detection heads of different scales on a feature map, which are responsible for detecting targets of different scales.

In order to improve the self-adaptive selection capability of the context features to the multi-scale RoI, the invention fuses a plurality of face RoIs and outputs the face RoIs to the pooling layer by introducing Maxout and dropout as the activation function of the layer, wherein the Maxout formula is as follows:

the method comprises the following steps that a weight W is a three-dimensional matrix with the size of d × m × k, an offset b is a two-dimensional matrix with the size of m × k, and the two matrixes are parameters to be learned, m represents the number of hidden layer nodes, k represents that each hidden layer node corresponds to k hidden layer nodes, when Maxout is used, three feature graphs are fused into a single feature graph with the same dimension, the feature graphs share the weight of each layer before an RoI pooling layer, RoI with different scales is propagated to a target RoI pooling layer in a forward mode, and the feature graph with the fixed resolution is obtained.

The method can detect the judgment of whether the human face exists and whether the human face has shielding. After the face is located, four anchor points (an upper left vertex (x0, y0), a lower right vertex (x1, y1)) and an anchor frame are generated, the anchor frame is divided into an upper part and a lower part, namely, the upper part (the upper left vertex (x0, y0), the lower right vertex (x1, (y0+ y1)/2)) and the lower part (the upper left vertex (x0, (y0+ y1)/2), the lower right vertex (x1, y1)) of the face are corresponding, if the central point of the shielded area falls on the lower part of the face of the anchor frame, and the IOU is greater than a threshold value P, the pedestrian is considered to wear the mask; if the central point of the shielded area is located on the lower half of the face anchor frame and the IOU is smaller than the threshold value or the lower half of the face does not detect the shielding condition, the pedestrian is considered not to wear the mask.

S3 model compression

Although the preliminary result can be obtained after step S2, since the process generates an anchor frame, and requires a large amount of computing resources and time overhead, the process needs to be model-compressed to reduce the consumption of computing resources and increase the running speed, the model compression process is shown in fig. 4, and the model compression method in the present invention is divided into two parts: channel sparsity and channel pruning.

S3-1, channel sparseness

Introducing channel sparsity can make an importance determination from the channel level, compress the potentially unimportant channels, and suppress their output, facilitating subsequent channel pruning, first applying a penalty to the model parameters, i.e., using L¹(7) And (3) punishing an activation unit in the neural network, namely, the sum of absolute values of all parameters, and thinning the activation unit.

Ω(θ)＝||ω||₁＝∑_i|ω_i|， (7)

L is known from the formula (1)¹Regularized loss function:

the corresponding gradient is:

it can be seen that the effect of regularization on the gradient is no longer to scale each ω linearly_jBut rather a constant is added that is signed with sign (ω), where the different components between ω have no correlation, thus L¹Regularization brings some elements of the optimal solution to 0, resulting in sparsity]Trade-off relative contribution of norm penalty termsLarger α corresponds to more regularization.

Then, each channel is assigned a scaling factor, wherein the absolute value of the scaling factor indicates the importance of the channel, and the input with low importance is deleted. Specifically, except for the detection head, each convolution layer is followed by a BN layer to accelerate convergence and improve generalization capability, and the BN layer normalizes the convolution characteristics by using small-batch statics, and the formula is (10).

Wherein E [ x ]^(k)]And

S3-2 channel pruning

Introducing a global threshold value after channel sparse training

To determine whether to prune the characteristic channel to control the pruning rate. In addition, a local safety threshold is introduced

To prevent excessive pruning on the convolutional layer and to preserve the integrity of the network connection. Some special connections between layers, such as routing layers and fast connection layers, need to be handled carefully in the present invention. During pruning, the maxpool layer and the upsample layer are discarded directly because they are independent of the channel number. First, according to a global threshold

And local safety threshold

A pruning area is constructed for all convolutional layers. For the routing layer, the pruning sizes of the incoming layers are connected in sequence, and the connection sequence is used as the pruning sequence. The fast connect layer in the present invention has a similar effect to ResNet, so all layers connected to the fast connect layer must have the same number of channels. In order to match the feature channels of each layer of the fast connection layer, the pruning sequence of all connection layers is iterated, and the execution or operation is carried out on the pruning sequences to generate the final pruning sequence of the connection layers.

S4 model optimization

Model optimization operations are performed on the channel pruned model to compensate for temporarily degraded accuracy and potential performance degradation of the method. In fine-grained target detection tasks, detection performance is typically sensitive to channel pruning. Model optimization by fine tuning is a relatively efficient and safe approach. The update times, iteration times, learning rate, regularization parameters, etc. may be modified to arrive at a suitable model.

Claims

1. A pedestrian mask wearing real-time detection method based on deep learning is characterized by comprising the following steps:

s1, building a robust backbone network

1+ (1+1 × 2) + (1+2 × 2) + (1+8 × 2) + (1+4 × 2) ═ 52 layers;

the first layer of constraint in the main network is composed of a two-dimensional convolution layer, a batch of regularization layers and L eakyRe L U layers with the slope of 0.1L is used for the two-dimensional convolution layer²(2) Regularizing the kernel weight matrix with parameters 5e-4, L²L eakyRe L U layer L eakyRe L U is an activation function after Re L U basic improvement, Re L U sets all negative values to zero, when a large gradient passes through Re L U neurons, the updated gradient is 0, at this time, if the learning rate is large, half of the neurons in the network probably do not have an activation phenomenon on any data, and L eakyRe L U gives a nonzero slope to all negative values, so that the problem is solved, and the formula is as follows:

s2, multi-scale training

s3 model compression

s4 model optimization

2. The pedestrian mask wearing real-time detection method based on deep learning as claimed in claim 1, wherein the channel sparseness in step S3 includes the following steps:

Ω(θ)＝||ω||₁＝∑_i|ω_i|， (3)

L¹regularized loss function:

the corresponding gradient is:

it can be seen that the effect of regularization on the gradient is no longer to scale each ω linearly_jBut rather a constant is added that is signed with sign (ω), where the different components between ω have no correlation, thus L¹Regularization brings some elements of the optimal solution to 0, resulting in sparsity, wherein α∈ [0, ∞]Trade-off the relative contribution of norm penalty terms, with larger α corresponding to more regularization;

wherein E [ x ]^(k)]And

3. The pedestrian mask wearing real-time detection method based on deep learning of claim 1, wherein the channel pruning in step S3 comprises the following steps:

introducing a global threshold value after channel sparse training

To prevent excessive pruning on convolutional layers and to maintain the integrity of network connections; directly discarding the average pooling layer and the upper sampling layer in the pruning process; first, according to a global threshold

And local safety threshold