CN106778589A

CN106778589A - A kind of masked method for detecting human face of robust based on modified LeNet

Info

Publication number: CN106778589A
Application number: CN201611127956.3A
Authority: CN
Inventors: 纪荣嵘; 林绍辉; 林贤明
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2016-12-09
Filing date: 2016-12-09
Publication date: 2017-05-31

Abstract

A Robust Masked Face Detection Method Based on Improved LeNet, which involves masked face detection. Including the following steps: 1) expanding the training data by flipping the original training picture horizontally; 2) proposing a new MLeNet model by modifying the structure of the traditional LeNet model to make it suitable for the detection of masked humans. The specific method can be: Adjust the size of the convolution kernel and the number of feature maps. In addition, change the number of nodes in the original output layer from 10 to 2 to make it suitable for the 2-category problem of human detection; 3) Borrow the parameters in the original LeNet model to pre-train the MLeNet structure , and fine-tune the MLeNet model to obtain a detector suitable for masked faces; 4) Combine the sliding window and non-maximization suppression technology to accurately locate the masked face. It can accurately detect the face of the masked person, and the model still has strong robustness under the interference conditions such as scattered background and environmental changes.

Description

A Robust Masked Face Detection Method Based on Improved LeNet

技术领域technical field

本发明涉及蒙面人脸检测，尤其是涉及一种基于改进型LeNet的鲁棒蒙面人脸检测方法。The invention relates to masked human face detection, in particular to a robust masked human face detection method based on an improved LeNet.

背景技术Background technique

随着社会的发展，科学技术的提高，以及多媒体技术的普及，越来越多的人们在网络上上传各种各样的网络视频，其中也包括不少犯罪分子企图利用多媒体渠道，开始传播暴力恐怖视频，这种行为已经在一定程度上影响社会的稳定发展。若能在海量的视频帧中快速且准确地定位出恐怖分子，将极大地减少人力资源和维护社会稳定。With the development of society, the improvement of science and technology, and the popularization of multimedia technology, more and more people are uploading various online videos on the Internet, including many criminals trying to use multimedia channels to spread violence. Horror videos, this kind of behavior has affected the stable development of society to a certain extent. If terrorists can be quickly and accurately located in a large number of video frames, it will greatly reduce human resources and maintain social stability.

作为一种大尺度视频库的管理的基本需要，准确地检索出拥有恐怖分子的暴恐视频帧对整个社会稳定起到重大的作用。在给定的视频帧中如何准确定义存在恐怖分子，这是一个困难的问题，因为恐怖分子表现形式多种多样。通常情况下，恐怖分子都是蒙面的，所以在本发明中，将恐怖分子认为是具有蒙面特征的人。蒙面人人脸检测作为一种人脸检测的特殊任务，它跟传统的人脸检测技术不同的是面临着更多的挑战。一方面，蒙面人人脸检测包含着传统人脸检测技术无法处理的姿势变化，光照等影响条件。另一方面，蒙面人的脸部是严重遮挡的，大大丢失了原本人脸的正常结构，使得传统算法对于蒙面人人脸检测失效。As a basic need for the management of a large-scale video library, accurately retrieving violent and terrorist video frames with terrorists plays a major role in the stability of the entire society. How to accurately define the presence of terrorists in a given video frame is a difficult problem because terrorists manifest in various forms. Usually, terrorists are masked, so in the present invention, terrorists are considered as people with masked features. As a special task of face detection, masked face detection is different from traditional face detection technology in that it faces more challenges. On the one hand, masked person face detection includes pose changes, lighting and other influencing conditions that traditional face detection technology cannot handle. On the other hand, the masked person's face is severely occluded, which greatly loses the normal structure of the original face, making the traditional algorithm invalid for masked person's face detection.

目前，大量的人脸检测技术依赖于手动设置的特征，比如：广泛使用的Fisherface[1]，基于Haar-like特征的级联分类器[2]，基于Gabor-like高维特征的AdaBoost检测器[3]。由于这种手动设置的特征需要大量的训练样本以及蒙面人失去了完整的人脸结构使得手动设计的特征无法准确表征蒙面人人脸结构的，最终使得这些方法无法准确检测到蒙面人人脸。近来，基于模板的(exemplar-based)人脸检测方法[4]表现出了较好的效果，主要因为庞大的模板数据库覆盖了所有可能的人脸视觉变化(visual variations)，其中包括遮挡，光照，人脸姿势等变化，但该方法需要大量的模板数据集，且在高度散乱的背景情况下，很容易产生虚警(false alarm)结果。为了减少需要模板的个数，文献[5]提出了一种有效的基于提升的模板人脸检测方法。该方法能够进一步提高人脸检测率，加速检测过程，以及通过判别式训练和有效性的结合模板作为弱分类器的方式，大大地节约内存开销。At present, a large number of face detection technologies rely on manually set features, such as: widely used Fisherface [1], cascade classifier based on Haar-like features [2], AdaBoost detector based on Gabor-like high-dimensional features [3]. Since this manually set feature requires a large number of training samples and the masked person loses the complete face structure, the manually designed features cannot accurately represent the masked person's face structure, and ultimately these methods cannot accurately detect the masked person. human face. Recently, the template-based (exemplar-based) face detection method [4] has shown better results, mainly because the huge template database covers all possible visual variations of the face, including occlusion, lighting , face pose and other changes, but this method requires a large template data set, and in the case of a highly scattered background, it is easy to produce false alarm results. In order to reduce the number of required templates, literature [5] proposed an effective template face detection method based on lifting. This method can further improve the face detection rate, accelerate the detection process, and greatly save memory overhead by combining discriminative training and effective templates as weak classifiers.

近年来，由于深度学习的兴起，使得带有强大的GPU计算能力的卷积神经网络(convolutional neural networks,CNN)在人脸领域也取得了很大的突破，如LFW[6][7][8]。特别地，卷积网络能够通过训练样本自动学习有效的特征表示。在2012年大尺度识别竞赛中(Large Scale Visual Recognition Challenge)中，文献[9]利用深度卷积神经网络取得了突破性的进展。此外，为了进一步处理只有少量的训练样本的情况，文献[10]引入了预训练初始化深度网络的权重，加快网络的收敛以及得到一个较优的局部解。文献[11]提出了LeNet模型，在手写体字符识别中，显示了很好的性能。随着这些深度学习技术的发展，基于深度学习的人脸检测方法成为了可能。In recent years, due to the rise of deep learning, convolutional neural networks (CNN) with powerful GPU computing capabilities have also made great breakthroughs in the face field, such as LFW[6][7][ 8]. In particular, convolutional networks are able to automatically learn effective feature representations from training samples. In the Large Scale Visual Recognition Challenge in 2012, literature [9] made a breakthrough using deep convolutional neural networks. In addition, in order to further deal with the situation of only a small number of training samples, literature [10] introduces pre-training to initialize the weight of the deep network to speed up the convergence of the network and obtain a better local solution. Literature [11] proposed the LeNet model, which showed good performance in handwritten character recognition. With the development of these deep learning techniques, face detection methods based on deep learning have become possible.

参考文献：references:

[1]H.J.P.Belhumeur P N,K.D.J.Eigenfaces vs.fisherfaces:Recognitionusing class specific linear projection.IEEE Transactions on Pattern Analysisand Machine Intelligence,1997,19(7):711-720.[1] H.J.P. Belhumeur P N, K.D.J. Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1997, 19(7): 711-720.

[2]P.Viola,M.Jones,Rapid object detection using a boosted cascade ofsimple features.in Proceedings of CVPR,2001.[2]P.Viola,M.Jones,Rapid object detection using a boosted cascade of simple features.in Proceedings of CVPR,2001.

[3]C.Liu,H.Wechsler,Gabor feature based classification using theenhanced fisher linear discriminant model for face recognition.IEEETransactions on Image Processing,2002,11(4):467-476.[3]C.Liu,H.Wechsler,Gabor feature based classification using the enhanced fisher linear discriminant model for face recognition.IEEE Transactions on Image Processing,2002,11(4):467-476.

[4]X.Shen,Z.Lin,J.Brandt,et al.Detecting and aligning faces by imageretrieval.in Proceedings of CVPR,2013:3460-3467.[4] X.Shen, Z.Lin, J.Brandt, et al. Detecting and aligning faces by image retrieval. in Proceedings of CVPR, 2013: 3460-3467.

[5]H.Li,Z.Lin,J.Brandt,et al.Efficient boosted exemplar-based facedetection.In Proceedings of CVPR,2014:1843-1850.[5] H.Li, Z.Lin, J.Brandt, et al. Efficient boosted exemplar-based face detection. In Proceedings of CVPR, 2014: 1843-1850.

[6]X.W.Yi Sun,X.Tang.Deep learning face representation frompredicting10,000classes.in Proceedings of CVPR,2014:1891-1898.[6]X.W.Yi Sun,X.Tang.Deep learning face representation from predicting10,000classes.in Proceedings of CVPR,2014:1891-1898.

[7]Y.Sun,X.Wang,X.Tang.Deeply learned face representations aresparse,selective,and robust.arXiv preprint arXiv:1412.1265.[7]Y.Sun, X.Wang, X.Tang.Deeply learned face representations aresparse,selective,and robust.arXiv preprint arXiv:1412.1265.

[8]Y.Sun,X.Wang,X.Tang.Hybrid deep learning for face verification.inProceedings of ICCV,2013:1489-1496.[8] Y. Sun, X. Wang, X. Tang. Hybrid deep learning for face verification. in Proceedings of ICCV, 2013: 1489-1496.

[9]A.Krizhevsky,I.Sutskever,G.E.Hinton.Imagenet classification withdeep convolutional neural networks.in Proceedings of NIPS,2012:1097-1105.[9] A. Krizhevsky, I. Sutskever, G. E. Hinton. Imagenet classification with deep convolutional neural networks. in Proceedings of NIPS, 2012: 1097-1105.

[10]G.E.Hinton,R.R.Salakhutdinov.Reducing the dimensionality of datawith neural networks.Science,2006,313:504-507.[10]G.E.Hinton,R.R.Salakhutdinov.Reducing the dimensionality of data with neural networks.Science,2006,313:504-507.

[11]Y.LeCun,L.Bottou,Y.Bengio,et al.Gradient-based learning appliedto document recognition.Proceedings of the IEEE,1998,86(11):2278-2324.[11] Y.LeCun, L.Bottou, Y.Bengio, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998, 86(11): 2278-2324.

发明内容Contents of the invention

本发明的目的在于针对训练样本少，以及蒙面人完整结构特征无法获取的特点，提供MLeNet通过引入预训练及微调(pre-training and fine-tuning)等手段，且结合滑动窗口方法，能够快速且准确地定位蒙面人人脸位置的一种基于改进型LeNet的鲁棒蒙面人脸检测方法。The purpose of the present invention is to provide MLeNet with methods such as pre-training and fine-tuning, combined with the sliding window method, which can quickly A robust masked face detection method based on an improved LeNet that accurately locates the masked person's face position.

本发明包括以下步骤：The present invention comprises the following steps:

1)通过水平翻转原始训练图片，扩充训练数据；1) Expand the training data by flipping the original training picture horizontally;

2)通过修改传统的LeNet模型的结构,提出新的MLeNet模型，使之适应于蒙面人类的检测问题，具体方法可为：调整卷积核大小和特征图个数，另外，改变原来的输出层的节点数10为2，使之适合于人类检测的2分类问题；2) By modifying the structure of the traditional LeNet model, a new MLeNet model is proposed to make it suitable for the detection of masked humans. The specific method can be: adjust the size of the convolution kernel and the number of feature maps, and change the original output The number of nodes in the layer 10 is 2, making it suitable for the 2-classification problem of human detection;

3)借用原始的LeNet模型中的参数预训练MLeNet结构，并微调MLeNet模型，得到适合于蒙面人脸的检测器；3) Borrow the parameters of the original LeNet model to pre-train the MLeNet structure, and fine-tune the MLeNet model to obtain a detector suitable for masked faces;

4)结合滑动窗口及非最大化抑制技术准确定位出蒙面人人脸的位置。4) Combining sliding window and non-maximization suppression technology to accurately locate the position of the masked person's face.

本发明具有以下突出优点：The present invention has the following outstanding advantages:

本发明在原始LeNet模型的基础上，通过修改卷积层的卷积核(convolutionalfilter)大小、特征图(feature map)的个数以及全连接层的节点个数，提出了一种新的MLeNet模型。同时通过扩充训练样本以及结合预训练和微调等手段进一步提高了MLeNet的性能。最后，通过结合滑动窗口及非最大化抑制(non-maximum suppression)准确定位出蒙面人人脸的位置。在本发明中，对于设备的要求较低，只需要一块8G U盘用于存储训练MLeNet模型的数据集，此外还需要一块高性能CPU用于计算MLeNet模型中的各种卷积计算。Based on the original LeNet model, the present invention proposes a new MLeNet model by modifying the size of the convolutional filter, the number of feature maps and the number of nodes in the fully connected layer of the convolutional layer. . At the same time, the performance of MLeNet is further improved by expanding the training samples and combining pre-training and fine-tuning. Finally, the location of the masked person's face is accurately located by combining sliding windows and non-maximum suppression. In the present invention, the requirements for equipment are relatively low, only an 8G USB disk is needed to store the data set for training the MLeNet model, and a high-performance CPU is also needed to calculate various convolution calculations in the MLeNet model.

本发明的技术效果如下：Technical effect of the present invention is as follows:

通过修改的LeNet模型，提出新的MLeNet模型,利用预训练、微调、以及数据扩充等技术，并引入一些后处理技术，本发明提出的模型能够准确的检测出蒙面人人脸，且在背景散乱，环境变化等干扰条件下，该模型依然有较强的鲁棒性。Through the modified LeNet model, a new MLeNet model is proposed, using technologies such as pre-training, fine-tuning, and data expansion, and introducing some post-processing techniques, the model proposed by the invention can accurately detect the masked human face, and in the background The model still has strong robustness under disturbance conditions such as scattered and environmental changes.

MLeNet模型能够有效的解决因小样本问题而引起的模型过拟合问题，以及能够在自然环境下，准确的定位蒙面人人脸位置，在视频监控，公共安全等领域存在大量的应用前景。本发明建立了MLeNet模型，该模型修改了原始LeNet模型，使得该模型更适合蒙面人人脸检测。在训练样本较少的情况下，训练该模型容易导致过拟合现象的发生，因此通过扩充训练数据集，并结合预训练、微调等技术，克服了过拟合问题以及提高的MLeNet模型的分类准确率。后处理方法的使用，如非极大值抑制，使得检测蒙面人人脸更加准确。The MLeNet model can effectively solve the model overfitting problem caused by the small sample problem, and can accurately locate the face position of the masked person in the natural environment. It has a large number of application prospects in the fields of video surveillance and public security. The invention establishes the MLeNet model, which modifies the original LeNet model, so that the model is more suitable for masked person face detection. In the case of fewer training samples, training the model can easily lead to overfitting. Therefore, by expanding the training data set and combining pre-training, fine-tuning and other technologies, the over-fitting problem and the classification of the improved MLeNet model have been overcome. Accuracy. The use of post-processing methods, such as non-maximum suppression, makes the detection of masked faces more accurate.

附图说明Description of drawings

图1为具体蒙面人脸检测总流程图。Figure 1 is a general flowchart of specific masked face detection.

图2为修改的卷积神经网络MLeNet模型：MLeNet输出层只有两个节点，在所有的卷积层中拥有较小的卷积核大小，同时每层拥有较大的特征图个数。Figure 2 shows the modified convolutional neural network MLeNet model: the MLeNet output layer has only two nodes, has a smaller convolution kernel size in all convolution layers, and each layer has a larger number of feature maps.

图3为LeNet损失函数值(包括训练和验证阶段的函数损失值)。Figure 3 shows the LeNet loss function value (including the function loss value in the training and verification stages).

图4为LeNet分类错误率(包括正负样本的分类错误率)。Figure 4 shows the LeNet classification error rate (including the classification error rate of positive and negative samples).

图5为无预训练与微调的MLeNet损失函数值(包括训练和验证阶段的函数损失值)。Figure 5 shows the MLeNet loss function value without pre-training and fine-tuning (including the function loss value in the training and verification stages).

图6为无预训练与微调的MLeNet分类错误率(包括正负样本的分类错误率)。Figure 6 shows the classification error rate of MLeNet without pre-training and fine-tuning (including the classification error rate of positive and negative samples).

图7为有预训练和微调的MLeNet损失函数值(包括训练和验证阶段的函数损失值)。Figure 7 shows the MLeNet loss function value with pre-training and fine-tuning (including the function loss value in the training and verification stages).

图8为有预训练和微调的MLeNet分类错误率(包括正负样本的分类错误率)。Figure 8 shows the classification error rate of MLeNet with pre-training and fine-tuning (including the classification error rate of positive and negative samples).

图9为蒙面的恐怖分子人脸检测的部分结果(为了保护隐私性，蒙面人的人脸区域由马赛克处理过)。Figure 9 shows some results of face detection of masked terrorists (in order to protect privacy, the face area of the masked person is processed by mosaic).

具体实施方式detailed description

本发明的目的在于针对训练样本少，以及改进传统的手动调整人脸特征问题，提供MLeNet模型，并通过简单的扩展样本、预训练及微调等手段，训练得到准确鲁棒的人脸模型，同时结合滑动窗口、非最大化抑制方法，得到快速、鲁棒及准确的人脸检测器。具体的算法流程如图1所示。具体的每个模块如下：The purpose of the present invention is to provide an MLeNet model for the lack of training samples and to improve the traditional manual adjustment of face features, and to obtain an accurate and robust face model through simple expansion of samples, pre-training and fine-tuning. Combining sliding window, non-maximization suppression methods, a fast, robust and accurate face detector is obtained. The specific algorithm flow is shown in Figure 1. The specific modules are as follows:

1、扩充数据集1. Expand the data set

本发明所用的训练及测试数据集为公安部提供的部门暴恐视频中的一些关键帧组合而成。总共包含1140张图片，其中240张正样本(即，包含蒙面人脸)，900张负样本(即，不含蒙面人脸)，实验通过随机选取150张正样本和750张负样本作为训练集(trainingset)，50张正样本和50张负样本作为验证集(validation set)，留下140张图片作为测试集(test set)。考虑到人脸的特殊的对称信息，本发明利用了水平翻转(horizontalreflection)技术将原本的数据集扩充了两倍。The training and test data sets used in the present invention are composed of some key frames in the violent terror videos provided by the Ministry of Public Security. It contains a total of 1140 pictures, of which 240 are positive samples (that is, contain masked faces), and 900 negative samples (that is, do not contain masked faces). The experiment randomly selects 150 positive samples and 750 negative samples as In the training set (trainingset), 50 positive samples and 50 negative samples are used as the validation set (validation set), leaving 140 pictures as the test set (test set). Considering the special symmetry information of the human face, the present invention uses a horizontal reflection technology to expand the original data set twice.

2、MLeNet模型2. MLeNet model

该MLeNet模型是改进原有的LeNet模型。LeNet模型总共有5层，分别3个卷积层(convolutional layer)和2个全连接层(fully connected layer)，卷积层含有卷积和下采样的运算。首先考虑到是否存在蒙面人人脸的问题，这是一个二分类问题，通过修改最后一层全连接层的节点个数，从原来的10变成2，并将原始的LeNet中的卷积核大小减少到3×3，但增加每层特征图的个数。特别地，改变第一个全连接层(FC4)的节点个数由原来的84增加到500。MLeNet与LeNet模型的每层信息都详细列在了表1中，另外，最终的MLeNet模型如图2所示。The MLeNet model is an improvement of the original LeNet model. The LeNet model has a total of 5 layers, including 3 convolutional layers and 2 fully connected layers. The convolutional layer contains convolution and downsampling operations. First consider whether there is a masked face, which is a two-category problem. By modifying the number of nodes in the last fully connected layer, from the original 10 to 2, and convolving the original LeNet The kernel size is reduced to 3×3, but the number of feature maps per layer is increased. In particular, the number of nodes in the first fully connected layer (FC4) is changed from 84 to 500. The information of each layer of the MLeNet and LeNet models are listed in Table 1 in detail. In addition, the final MLeNet model is shown in Figure 2.

MLeNet与LeNet模型参见表1：每个模型包含3个卷积层和2个全连接层，详细的各个模型的各层参数列在最后两行，其中卷积核大小“num×size×size”，卷积核移动间隔“st.”，空间填充“pad”，及最大池因子。See Table 1 for MLeNet and LeNet models: Each model contains 3 convolutional layers and 2 fully connected layers. The detailed parameters of each layer of each model are listed in the last two lines, where the convolution kernel size is "num×size×size" , the convolution kernel movement interval "st.", the space filling "pad", and the maximum pooling factor.

表1Table 1

令N个训练样本为其中标签y_i是标签变量(本发明中取值为0或1)。最后的损失函数为Softmax损失函数(即，预测值与标签的误差)，定义为：Let the N training samples be Wherein the label y _i is a label variable (the value is 0 or 1 in the present invention). The final loss function is the Softmax loss function (ie, the error between the predicted value and the label), defined as:

其中，为模型输出的概率值，l{y_i＝j}为示性函数，可定义为in, is the probability value output by the model, l{y _i =j} is an indicative function, which can be defined as

若模型输出值与真实标签值越相近，则误差输出越小。w,b分别为各层的权值和偏差。预测标签可由一系列w,b前向传播得到。另外，网络的各个参数可结合背向传播(back-propagating)各层误差，和随机梯度下降法(stochastic gradient descent)更新所有的参数。If the model output value The closer to the real label value, the smaller the error output. w and b are the weight and bias of each layer respectively. predicted label It can be obtained by a series of w,b forward propagation. In addition, each parameter of the network can be combined with back-propagating errors of each layer and stochastic gradient descent to update all parameters.

具体地，本发明利用梯度下降法来训练MLeNet模型(即，更新每层的变量w,b)，将批量(batch)大小设置为20，动量(momentum)设为0.9，权重衰减(weight decay)设为0.0005，学习率(learning rate)设为0.001，训练回合数(epoch)为100。权重w和偏置b更新规则如下：Specifically, the present invention uses the gradient descent method to train the MLeNet model (that is, update the variables w,b of each layer), set the batch size to 20, set the momentum to 0.9, and weight decay (weight decay) It is set to 0.0005, the learning rate is set to 0.001, and the number of training rounds (epoch) is 100. The weight w and bias b update rules are as follows:

其中，i是迭代索引值，u,v为动量变量，表示为第i个批量图像D_i所对应的目标函数对权重w的偏导，表示为第i个批量图像D_i所对应的目标函数对权重b的偏导。该更新的规则说明每层变量(权重w和偏差b)更新方式是使得目标损失函数沿着局部最小值方向移动，最终获得局部最优解。本发明初始化的权重及偏置值直接来自于已训练好的LeNet模型参数，利用随机梯度下降法微调MLeNet。在6GB内存，1.90GHz AMD A8-4500MAPU普通PC机上，就可以训练MLeNet模型100回合，不需要采用GPU，训练时间只需要花费10min。Among them, i is the iteration index value, u, v are momentum variables, Expressed as the partial derivative of the objective function corresponding to the i-th batch image D _i to the weight w, Expressed as the partial derivative of the objective function corresponding to the i-th batch image D _i to the weight b. The update rule shows that the update method of each layer of variables (weight w and bias b) is to make the target loss function move along the direction of the local minimum, and finally obtain a local optimal solution. The weights and bias values initialized in the present invention are directly from the trained LeNet model parameters, and the random gradient descent method is used to fine-tune the MLeNet. On an ordinary PC with 6GB of memory and 1.90GHz AMD A8-4500MAPU, the MLeNet model can be trained for 100 rounds without using a GPU, and the training time only takes 10 minutes.

3、提高检测准确率技巧：预训练、微调3. Skills to improve detection accuracy: pre-training, fine-tuning

本发明通过预训练和微调手段学习MLeNet模型。首先，利用MNIST数据集预先训练LeNet模型，然后通过学习到的LeNet参数初始化MLeNet参数。最后，使用随机梯度下降法微调MLeNet的参数。The present invention learns the MLeNet model through pre-training and fine-tuning means. First, the LeNet model is pre-trained using the MNIST dataset, and then the MLeNet parameters are initialized by the learned LeNet parameters. Finally, the parameters of MLeNet are fine-tuned using stochastic gradient descent.

4、检测蒙面人人脸4. Detection of masked faces

利用上面介绍的训练MLeNet方法，就可以得到一个准确率较高的蒙面人人脸检测器能够判断出给定的窗口中是否存在蒙面人人脸。但是，没有考虑到多尺度以及检测窗口重叠问题，所以本发明利用图像金字塔匹配方案并结合非极大值抑制来后处理此类问题。Using the training MLeNet method described above, a masked face detector with high accuracy can be obtained to determine whether there is a masked face in a given window. However, the problems of multi-scale and overlapping detection windows are not considered, so the present invention utilizes an image pyramid matching scheme combined with non-maximum value suppression to post-process such problems.

简而言之，为了进行金字塔匹配，需要在多尺度图像不同位置采集目标图像，每个取样的图像放入已训练好的MLeNet蒙面人人脸检测器中，MLeNet检测器就能给每个窗口产生一个是否存在人脸的得分值。然后，利用非极大值抑制融合一些高得分的子窗口，最终，完成检测。In short, in order to perform pyramid matching, it is necessary to collect target images at different positions in the multi-scale image, and each sampled image is put into the trained MLeNet masked face detector, and the MLeNet detector can give each The window produces a score value for the presence or absence of a face. Then, some high-scoring sub-windows are fused using non-maximum suppression, and finally, the detection is done.

基于一种新的MLeNet模型的蒙面人人脸检测技术。MLeNet通过引入预训练及微调(pre-training and fine-tuning)等手段，且结合滑动窗口方法，能够快速且准确地定位蒙面人人脸位置。Masked face detection technology based on a new MLeNet model. MLeNet introduces pre-training and fine-tuning methods, combined with the sliding window method, to quickly and accurately locate the masked person's face position.

具体实验结果如下：The specific experimental results are as follows:

随着社会的发展，科学技术的提高，以及多媒体技术的普及，越来越多的人们在网络上上传各种各样的网络视频，其中也包括不少犯罪分子企图利用多媒体渠道，开始传播暴力恐怖视频，这种行为已经在一定程度上影响社会的稳定发展。若能在海量的视频帧中快速且准确地定位出恐怖分子，将极大地减少人力资源和维护社会稳定。在给定的视频帧中如何准确定义存在恐怖分子，这是一个困难的问题，因为恐怖分子表现形式多种多样。通常情况下，恐怖分子都是蒙面的，所以在本发明中，将恐怖分子认为是具有蒙面特征的人。因此，能否准确地定位出蒙面人人脸位置，是判断出视频帧中是否存在恐怖分子的关键。在给定少量的训练样本及蒙面人无法获取完整人脸结构情况下，传统的人脸检测技术无法准确地定位蒙面人人脸位置。With the development of society, the improvement of science and technology, and the popularization of multimedia technology, more and more people are uploading various online videos on the Internet, including many criminals trying to use multimedia channels to spread violence. Horror videos, this kind of behavior has affected the stable development of society to a certain extent. If terrorists can be quickly and accurately located in a large number of video frames, it will greatly reduce human resources and maintain social stability. How to accurately define the presence of terrorists in a given video frame is a difficult problem because terrorists manifest in various forms. Usually, terrorists are masked, so in the present invention, terrorists are considered as people with masked features. Therefore, whether the position of the masked person's face can be accurately located is the key to judging whether there are terrorists in the video frame. Given a small number of training samples and the masked person cannot obtain the complete face structure, the traditional face detection technology cannot accurately locate the masked person's face position.

人脸检测是计算机视觉方向一个重要的应用，传统的人脸检测算法能够较为准确地检测到正面的，无遮挡的人脸，但对于遮挡的，特别是低分辨率，蒙面的情况，得不到良好的检测效果。在本发明中提出了一种新的模型用于蒙面人人脸检测，能够获得很好的性能，本发明可用于视频监控、人机交互、暴恐视频检索、公共安全等领域。Face detection is an important application in the direction of computer vision. Traditional face detection algorithms can more accurately detect frontal, unoccluded faces, but for occluded, especially low-resolution, masked faces, it is Less than a good detection effect. In the present invention, a new model is proposed for masked person face detection, which can obtain good performance. The present invention can be used in fields such as video surveillance, human-computer interaction, video retrieval of violent terrorism, and public security.

图3和4给出LeNet模型在给定的蒙面人脸数据集上的性能。图5和6为没有预训练与微调的MLeNet的性能，图7和8为有预训练和微调的MLeNet的训练结果。从实验的曲线图可知，加入预训练及微调等手段训练出来的MLeNet模型大大提高了蒙面人脸分类结果。Figures 3 and 4 present the performance of the LeNet model on a given dataset of masked faces. Figures 5 and 6 show the performance of MLeNet without pre-training and fine-tuning, and Figures 7 and 8 show the training results of MLeNet with pre-training and fine-tuning. From the graph of the experiment, it can be seen that the MLeNet model trained by means of pre-training and fine-tuning has greatly improved the masked face classification results.

在自行创建的蒙面人数据集中检测蒙面人人脸的实验结果见表2。从表2中可知，通过加入预训练及微调等手段的MLeNet模型(即，Ours)相比于传统的AdaBoost算法、LeNet模型，以及没有加入预训练及微调的MLeNet模型，本发明的方法更适合于蒙面人脸检测问题。The experimental results of detecting masked faces in the self-created masked person dataset are shown in Table 2. As can be seen from Table 2, compared to the traditional AdaBoost algorithm, LeNet model, and the MLeNet model without pre-training and fine-tuning, the method of the present invention is more suitable for on the masked face detection problem.

表2Table 2

OursOurs AdaBoost[2]AdaBoost[2] LeNet[11]LeNet[11] MLeNetMLeNet Recallrecall 0.9250.925 0.750.75 0.820.82 0.850.85 PrecisionPrecision 0.710.71 0.60.6 0.640.64 0.680.68 F₁-scoreF ₁ -score 0.8030.803 0.6670.667 0.7190.719 0.7560.756

“Ours”表示加入预训练与微调的MLeNet；“MLeNet”表示无预训练与微调的MLeNet模型。"Ours" means MLeNet with pre-training and fine-tuning; "MLeNet" means MLeNet model without pre-training and fine-tuning.

公式说明如下：(定义的公式变量与符号可参考具体公式表达说明)The formula description is as follows: (The defined formula variables and symbols can refer to the specific formula expression description)

公式(1)定义了模型的损失函数，目的用于衡量模型输出的结果与原始标签值的误差。Formula (1) defines the loss function of the model, which is used to measure the error between the output result of the model and the original label value.

公式(2)为示性函数的定义，目的用于判断两个值是否相等，若相等，则值设为1，反之，则为0。Formula (2) is the definition of an indicative function, the purpose is to judge whether two values are equal, if they are equal, the value is set to 1, otherwise, it is set to 0.

公式(3)定义了随机梯度下降法的更新规则，其目的为更新每层变量(权重w和偏差b)使得目标损失函数沿着局部最小值方向移动，获得最终的局部最优解。Formula (3) defines the update rule of the stochastic gradient descent method. Its purpose is to update the variables (weight w and bias b) of each layer so that the target loss function moves along the direction of the local minimum to obtain the final local optimal solution.

Claims

1. A robust mask face detection method based on an improved LeNet is characterized by comprising the following steps:

1) expanding training data by horizontally turning over the original training picture;

2) by modifying the structure of the traditional LeNet model, a new MLeNet model is provided, so that the MLeNet model is suitable for the detection problem of a masked human;

3) pre-training an MLeNet structure by using parameters in an original LeNet model, and finely adjusting the MLeNet model to obtain a detector suitable for a mask face;

4) and the position of the face of the masked person is accurately positioned by combining a sliding window and a non-maximization inhibition technology.

2. The robust masked face detection method based on the improved LeNet as claimed in claim 1, wherein in step 2), the specific method for proposing the new MLeNet model by modifying the structure of the traditional LeNet model to adapt to the masked human detection problem is: the size of the convolution kernel and the number of the characteristic graphs are adjusted, and in addition, the number 10 of the nodes of the original output layer is changed to be 2, so that the method is suitable for the 2-classification problem of human detection.