CN108090417A

CN108090417A - A kind of method for detecting human face based on convolutional neural networks

Info

Publication number: CN108090417A
Application number: CN201711204234.8A
Authority: CN
Inventors: 刘琳; 姜飞; 申瑞民
Original assignee: Shanghai Jiao Tong University
Current assignee: Shanghai Jiao Tong University
Priority date: 2017-11-27
Filing date: 2017-11-27
Publication date: 2018-05-29

Abstract

The present invention relates to a kind of face detection method based on convolutional neural network, comprises the following steps: 1) establishes face detection model, and this model adopts RFCN network structure, and described RFCN network structure comprises the feature extraction layer based on feature fusion; 2 ) to obtain a sample set; 3) to train the face detection model established in step 1); 4) to perform face detection on the picture to be tested with the trained face detection model. Compared with the prior art, the present invention has the advantages of higher accuracy rate and recall rate, better adaptability to complex scenes, and the like.

Description

A Face Detection Method Based on Convolutional Neural Network

技术领域technical field

本发明涉及人脸识别技术领域，尤其是涉及一种基于卷积神经网络的人脸检测方法。The invention relates to the technical field of face recognition, in particular to a face detection method based on a convolutional neural network.

背景技术Background technique

人脸检测是一项涉及计算机视觉、模式识别及人工智能等多领域的研究课题，因其在商业、医疗和军事等领域中广泛的应用价值，一直是人们研究的热点。然而，在现实场景下，复杂图像中的人脸经常会存在遮挡严重的情况，这给人脸检测带来巨大的挑战，所以提出一种能够适应于严重遮挡的人脸检测方法仍然是研究的难点。Face detection is a research topic involving computer vision, pattern recognition, artificial intelligence and other fields. Because of its wide application value in commercial, medical and military fields, it has always been a hot research topic. However, in real-world scenarios, faces in complex images are often severely occluded, which poses a huge challenge to face detection, so it is still a research to propose a face detection method that can adapt to severe occlusions. difficulty.

文献“Object Detection via Region-based Fully Convolutional Networks”(Dai,J., Li,Y.,He,K.,Sun,J.:R-FCN:.In:30th Conference on Neural InformationProcessing Systems,pp.379-387.Barcelona)公开一种基于区域全卷积神经网络的目标检测方法，该方法基础网络采用ResNet101，为RFCN网络结构，子网络分为Region ProposalNetwork(RPN)和分类网络，整体网络结构如图1所示。ResNet提取feature maps的过程共4个阶段，分别记为res1，res2，res3，res4。res4之后通过卷积运算与RPN子网络和分类子网连接。RPN子网络与分类子网共享ResNet所提取的 feature maps，使得特征的提取只需进行一次运算，极大地提高了运算效率。Literature "Object Detection via Region-based Fully Convolutional Networks" (Dai, J., Li, Y., He, K., Sun, J.: R-FCN:.In: 30th Conference on Neural Information Processing Systems, pp.379 -387.Barcelona) discloses a target detection method based on regional fully convolutional neural network. The basic network of this method uses ResNet101, which is an RFCN network structure. The sub-network is divided into Region ProposalNetwork (RPN) and classification network. The overall network structure is shown in the figure 1. The process of extracting feature maps by ResNet consists of 4 stages, which are recorded as res1, res2, res3, and res4. After res4, it is connected with the RPN subnetwork and the classification subnetwork through the convolution operation. The RPN sub-network and the classification sub-network share the feature maps extracted by ResNet, so that the feature extraction only needs to be performed once, which greatly improves the calculation efficiency.

RPN网络用于提取region proposals，也就是可能的人脸区域。rpn_bbox_pred层回归得每个region相对于anchor的偏移量。anchor是基于原始输入图片所生成的不同尺度scale和长宽比ratio的矩形框。每个anchor值加上rpn_bbox_pred得到的针对每个anchor的偏移量就是RPN层需要输出的region的位置。rpn_cls_prob输出每个region是前景物体和背景的概率。proposal层对rpn_bbox_pred层以及 rpn_cls_prob层的结果进行整合，根据前景概率进行排序，然后利用非极大值抑制 non maximum supression(NMS)算法获取若干regions。(训练时提取2000个，测试时提取300个)。分类网络基于ResNet的第五阶段res5继续提取特征后得到深度为C*k*k的score maps。k为超参数，取值3；C表示最终分类的类别数(包含背景类)，取值2(人脸|背景)。RFCN利用Position-sensitive ROIPooling层，对RPN网络获取的每个region，在score maps上做基于位置的average pooling。它对region 的每个位置都分别提取特征，通过对所有位置的投票得出最终的结果。通过RPN 子网络和分类子网络最终可得到人脸所在的位置region，以及每个region为人脸的概率。The RPN network is used to extract region proposals, which are possible face regions. The rpn_bbox_pred layer returns the offset of each region relative to the anchor. The anchor is a rectangular box of different scales and aspect ratios generated based on the original input image. The offset for each anchor obtained by adding rpn_bbox_pred to each anchor value is the position of the region that the RPN layer needs to output. rpn_cls_prob outputs the probability that each region is a foreground object and background. The proposal layer integrates the results of the rpn_bbox_pred layer and the rpn_cls_prob layer, sorts them according to the foreground probability, and then uses the non maximum suppression (NMS) algorithm to obtain several regions. (Extract 2000 during training and 300 during testing). The classification network continues to extract features based on the fifth stage res5 of ResNet to obtain score maps with a depth of C*k*k. k is a hyperparameter, with a value of 3; C represents the number of categories of the final classification (including background classes), with a value of 2 (face|background). RFCN uses the Position-sensitive ROIPooling layer to perform position-based average pooling on the score maps for each region obtained by the RPN network. It extracts features for each position of the region separately, and obtains the final result by voting for all positions. Through the RPN sub-network and the classification sub-network, the region where the face is located and the probability that each region is a face can be obtained.

该目标检测方法在训练过程中，选择公共数据集WIDER FACE作为样本集，先获取在ImageNet上预训练的RFCN模型，然后在准备好的样本集上再开始训练。最终以训练后的模型进行人脸检测。In the training process of this target detection method, the public data set WIDER FACE is selected as the sample set, the RFCN model pre-trained on ImageNet is obtained first, and then the training starts on the prepared sample set. Finally, the trained model is used for face detection.

上述现有方法虽然能获得一定精度，但还存在以下缺点：1、对于人脸遮挡敏感，有较多遮挡的情况下检测困难，在WIDER FACE上mAP仅有0.77；2、对于较小的人脸或者侧脸检测不佳。Although the above-mentioned existing methods can obtain certain accuracy, there are still the following disadvantages: 1. Sensitive to face occlusion, and it is difficult to detect when there are many occlusions. The mAP on WIDER FACE is only 0.77; 2. For smaller people Poor face or profile detection.

发明内容Contents of the invention

本发明的目的就是为了克服上述现有技术存在的缺陷而提供一种基于卷积神经网络的人脸检测方法。The object of the present invention is to provide a face detection method based on a convolutional neural network in order to overcome the above-mentioned defects in the prior art.

本发明的目的可以通过以下技术方案来实现：The purpose of the present invention can be achieved through the following technical solutions:

一种基于卷积神经网络的人脸检测方法，包括以下步骤：A face detection method based on convolutional neural network, comprising the following steps:

1)建立人脸检测模型，该模型采用RFCN网络结构，所述RFCN网络结构包括基于特征融合的特征提取层；1) set up face detection model, this model adopts RFCN network structure, and described RFCN network structure comprises the feature extraction layer based on feature fusion;

2)获取样本集；2) Obtain a sample set;

3)对步骤1)中建立的人脸检测模型进行训练；3) training the face detection model set up in step 1);

4)以训练后的人脸检测模型对待测图片进行人脸检测。4) Perform face detection with the trained face detection model on the picture to be tested.

进一步地，所述特征提取层中，将res3的输出层与res4的输出层相叠加融合。Further, in the feature extraction layer, the output layer of res3 and the output layer of res4 are superimposed and fused.

进一步地，所述步骤2)中，样本集的样本数量大于3万个。Further, in the step 2), the number of samples in the sample set is greater than 30,000.

进一步地，所述步骤3)的训练采用caffe框架，包括：Further, the training of the step 3) adopts the caffe framework, including:

301)在ImageNet上对所述人脸检测模型进行预训练；301) pre-training the face detection model on ImageNet;

302)采用所述样本集对经预训练后人脸检测模型再次进行训练。302) Using the sample set to train the pre-trained face detection model again.

进一步地，所述步骤4)中，对所述待测图片进行多尺度检测。Further, in the step 4), multi-scale detection is performed on the picture to be tested.

进一步地，所述多尺度检测具体为：Further, the multi-scale detection is specifically:

401)对所述待测图片进行多个尺寸的伸缩处理；401) Perform scaling processing of multiple sizes on the picture to be tested;

402)利用训练后的人脸检测模型对每个尺寸下获得的图片分别进行人脸检测，获得多个人脸检测结果；402) Use the trained face detection model to perform face detection on the pictures obtained under each size, and obtain multiple face detection results;

403)对所述多个人脸检测结果进行合并筛选，获得最终检测结果。403) Merging and screening the multiple face detection results to obtain a final detection result.

进一步地，所述步骤403)中，采用NMS算法对所述多个人脸检测结果进行合并筛选。Further, in the step 403), the NMS algorithm is used to merge and filter the multiple face detection results.

与现有技术相比，本发明具有以下有益效果：Compared with the prior art, the present invention has the following beneficial effects:

1)本发明建立了一个改进型的人脸检测模型，在特征提取过程中，对res3的输出与res4的输出进行特征融合，且融合得到的特征可以被同时应用到RPN和分类网络两个子网中，大大提高了人脸检测的准确率和检全率。1) The present invention establishes an improved face detection model. In the feature extraction process, the output of res3 and the output of res4 are fused, and the fused features can be applied to the two subnets of RPN and classification network at the same time Among them, the accuracy and recall rate of face detection are greatly improved.

2)本发明在对被测图片进行检测时，采用多尺度检测方式，能够获取到更多被遮挡的人脸以及小分辨率人脸的信息，进一步提高了人脸检测的准确率和检全率。2) The present invention adopts a multi-scale detection method when detecting a picture to be tested, and can obtain more information about occluded faces and small-resolution faces, further improving the accuracy and accuracy of face detection. Rate.

3)本发明样本集的样本数量大于3万个，保证了检测模型的准确性。3) The number of samples in the sample set of the present invention is greater than 30,000, which ensures the accuracy of the detection model.

4)本发明对于复杂场景下有良好的适应效果，尤其针对人脸遮挡严重和小脸的场景，经过大量的测试，准确率和检全率均达到90％以上。4) The present invention has a good adaptation effect in complex scenes, especially for scenes with severe face occlusion and small faces. After a large number of tests, the accuracy rate and recall rate both reach more than 90%.

附图说明Description of drawings

图1为现有方法的整体网络结构示意图；Fig. 1 is the overall network structure schematic diagram of existing method;

图2为现有的RPN子网络结构示意图；FIG. 2 is a schematic diagram of an existing RPN subnetwork structure;

图3为本发明特征融合的网络结构示意图；Fig. 3 is a schematic diagram of the network structure of the feature fusion of the present invention;

图4为本发明的整体网络结构示意图；Fig. 4 is a schematic diagram of the overall network structure of the present invention;

图5为本发明的检测流程示意图；Fig. 5 is a schematic diagram of the detection process of the present invention;

图6为本发明的检测效果示意图。Fig. 6 is a schematic diagram of the detection effect of the present invention.

具体实施方式Detailed ways

下面结合附图和具体实施例对本发明进行详细说明。本实施例以本发明技术方案为前提进行实施，给出了详细的实施方式和具体的操作过程，但本发明的保护范围不限于下述的实施例。The present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments. This embodiment is carried out on the premise of the technical solution of the present invention, and detailed implementation and specific operation process are given, but the protection scope of the present invention is not limited to the following embodiments.

本发明提供一种基于卷积神经网络的人脸检测方法，包括以下步骤：1)建立人脸检测模型，该模型采用RFCN网络结构，所述RFCN网络结构包括基于特征融合的特征提取层；2)获取样本集；3)对步骤1)中建立的人脸检测模型进行训练；4)以训练后的人脸检测模型对待测图片进行人脸检测。通过上述方法可以对复杂场景进行准确率和检全率更高的人脸检测。The present invention provides a kind of face detection method based on convolution neural network, comprises the following steps: 1) establishes face detection model, and this model adopts RFCN network structure, and described RFCN network structure comprises the feature extraction layer based on feature fusion; 2 ) to obtain a sample set; 3) to train the face detection model established in step 1); 4) to perform face detection on the picture to be tested with the trained face detection model. Through the above method, face detection with higher accuracy and recall rate can be performed on complex scenes.

上述检测方法的关键点在于：The key points of the above detection method are:

a、模型结构改进a. Model structure improvement

模型结构的改进主要在于网络中间层特征融合。在ResNet101的结构中，前四个阶段到res4为止，共做了4次pooling操作，越深的卷积pooling网络使得每个 feature map的感受野越大，学习到的语义特征也越高级。但是对于遮挡严重的人脸或者较小的人脸，其具备的特征本身就有限，提取高层次的语义特征使得其有限的局部特征更容易丢失。也就是说，对于暴露出的特征有限的物体，大的感受野对检测起到的作用不如小的感受野。因此为了正确检测小人脸以及遮挡严重的人脸，本发明将res3的输出层与res4的输出层相叠加融合，使网络在res4层学习到的特征同时具有高级语义特征以及低级局部特征。选择在res4进行融合的原因是将融合得到的特征可以被同时应用到RPN和分类网络两个子网中。特征融合的网络结构如图3所示，res4b22_relu是res4的输出，res4b22_dcov是对res4的输出结果的上采样，使res4的feature map与res3的feature map保持相同大小的尺寸，res3_scale 扩充res3的channel数，保持和res4feature map相同大小的深度。由于deconvolution 操作会只能加倍成偶数或者奇数，而pooling前的操作是偶数还是奇数是不定的，因此需要利用crop操作将deconvolution后的feature map裁剪到与res3相同的尺寸。改进后的整体网络结构如图4所示。The improvement of the model structure mainly lies in the feature fusion of the middle layer of the network. In the structure of ResNet101, from the first four stages to res4, a total of 4 pooling operations have been performed. The deeper the convolutional pooling network, the larger the receptive field of each feature map, and the more advanced the learned semantic features. However, for severely occluded faces or small faces, the features they have are limited, and the extraction of high-level semantic features makes their limited local features easier to lose. That is to say, for objects with limited exposed features, a large receptive field is not as effective for detection as a small receptive field. Therefore, in order to correctly detect small faces and severely occluded faces, the present invention superimposes and fuses the output layer of res3 and the output layer of res4, so that the features learned by the network at the res4 layer have both high-level semantic features and low-level local features. The reason for choosing to fuse in res4 is that the features obtained by fusion can be applied to the two subnets of RPN and classification network at the same time. The network structure of feature fusion is shown in Figure 3. res4b22_relu is the output of res4, and res4b22_dcov is the upsampling of the output of res4, so that the feature map of res4 and the feature map of res3 maintain the same size, and res3_scale expands the number of channels of res3 , keep the depth of the same size as the res4feature map. Since the deconvolution operation can only be doubled to an even or odd number, and whether the operation before pooling is even or odd is uncertain, it is necessary to use the crop operation to crop the feature map after deconvolution to the same size as res3. The improved overall network structure is shown in Figure 4.

b、训练阶段b. Training stage

第一步：制作样本Step 1: Make a sample

样本集来源为公共数据集WIDER FACE，总共包含32,203张图片，393,703 个人脸样本。按照PASCAL VOC数据集的格式制作，PASCAL VOC为图像识别和分类提供了一整套标准化的优秀的数据集，因此人脸样本按此标准制作。样本存放规范具体为：JPEGImages中存放包含人脸的样本图片，Annotations中存放对应样本图片的详细信息以及图片中人脸目标的包围框坐标，其中人脸框位置标记形式由左上角坐标和左下角坐标组成，Annotation采用xml文件格式存储。The source of the sample set is the public dataset WIDER FACE, which contains a total of 32,203 pictures and 393,703 face samples. Produced according to the format of the PASCAL VOC dataset, PASCAL VOC provides a set of standardized and excellent datasets for image recognition and classification, so face samples are produced according to this standard. The sample storage specifications are as follows: JPEGImages stores sample pictures containing faces, and Annotations stores detailed information about the corresponding sample pictures and the bounding box coordinates of the face objects in the pictures. Coordinate composition, Annotation is stored in xml file format.

第二步：训练模型Step 2: Train the model

对于模型的训练采用caffe框架。首先获取在ImageNet上预训练的改进的RFCN模型，然后在准备好的样本集上再开始训练。训练的超参数表1所示。For the training of the model, the caffe framework is used. First obtain the improved RFCN model pre-trained on ImageNet, and then start training on the prepared sample set. The training hyperparameters are shown in Table 1.

表1：训练的超参数设置Table 1: Hyperparameter settings for training

iterationsiterations 500000500000 batch sizebatch size 11 base learning ratebase learning rate 0.0010.001 kk 33 momentummomentum 0.90.9 scalescale 1,2,41,2,4 weight_decayweight_decay 0.00050.0005 ratioratio 0.5,1,2 0.5,1,2

训练得到模型文件face_model.caffemodel，利用该模型文件即可检测人脸。The model file face_model.caffemodel is obtained after training, and the face can be detected by using this model file.

c、检测人脸c. Detect faces

针对复杂场景下的人脸检测，我们希望获取到更多被遮挡的人脸以及小分辨率人脸的信息。因此在检测时，对图片进行多尺度的处理，将每个尺度的图片均进行一次检测，然后对检测的结果利用NMS算法进行合并筛选，从而得到最终的检测结果，其流程如图5所示。For face detection in complex scenes, we hope to obtain more information about occluded faces and small-resolution faces. Therefore, during the detection, multi-scale processing is performed on the picture, and the pictures of each scale are detected once, and then the detection results are merged and screened using the NMS algorithm to obtain the final detection result. The process is shown in Figure 5 .

在WIRDER FACE上的mAP达到0.897，检测效果如图6所示。The mAP on WIRDER FACE reaches 0.897, and the detection effect is shown in Figure 6.

以上详细描述了本发明的较佳具体实施例。应当理解，本领域的普通技术人员无需创造性劳动就可以根据本发明的构思作出诸多修改和变化。因此，凡本技术领域中技术人员依本发明的构思在现有技术的基础上通过逻辑分析、推理或者有限的实验可以得到的技术方案，皆应在由权利要求书所确定的保护范围内。The preferred specific embodiments of the present invention have been described in detail above. It should be understood that those skilled in the art can make many modifications and changes according to the concept of the present invention without creative efforts. Therefore, all technical solutions that can be obtained by those skilled in the art based on the concept of the present invention through logical analysis, reasoning or limited experiments on the basis of the prior art shall be within the scope of protection defined by the claims.

Claims

1. a face detection method based on convolutional neural network, is characterized in that, comprises the following steps:

1) set up face detection model, this model adopts RFCN network structure, and described RFCN network structure comprises the feature extraction layer based on feature fusion;

2) Obtain a sample set;

3) training the face detection model set up in step 1);

4) Perform face detection with the trained face detection model on the picture to be tested.

2. the face detection method based on convolutional neural network according to claim 1, is characterized in that, in the feature extraction layer, the output layer of res3 and the output layer of res4 are superimposed and fused.

3. the hand-raised detection method based on deep learning according to claim 1, is characterized in that, in described step 2), the sample quantity of sample set is greater than 30,000.

4. the face detection method based on convolutional neural network according to claim 1, is characterized in that, the training of described step 3) adopts caffe framework, comprises:

301) pre-training the face detection model on ImageNet;

302) Using the sample set to train the pre-trained face detection model again.

5. the face detection method based on convolutional neural network according to claim 1, is characterized in that, in described step 4), described picture to be tested is carried out multi-scale detection.

6. the face detection method based on convolutional neural network according to claim 5, is characterized in that, described multi-scale detection is specifically:

401) Perform scaling processing of multiple sizes on the picture to be tested;

402) Use the trained face detection model to perform face detection on the pictures obtained under each size, and obtain multiple face detection results;

403) Merging and screening the multiple face detection results to obtain a final detection result.

7. the face detection method based on convolutional neural network according to claim 6, is characterized in that, in described step 403), adopts NMS algorithm to carry out merging screening to described multiple face detection results.