CN107871101A

CN107871101A - A kind of method for detecting human face and device

Info

Publication number: CN107871101A
Application number: CN201610849651.7A
Authority: CN
Inventors: 段旭; 张祥德
Original assignee: Beijing Eyecool Technology Co Ltd
Current assignee: Beijing Eyecool Technology Co Ltd
Priority date: 2016-09-23
Filing date: 2016-09-23
Publication date: 2018-04-03

Abstract

The embodiment of the present invention provides a kind of method for detecting human face, including：Using the first image feature vector of each candidate window in the first depth convolutional neural networks extraction input picture after training；Using the second image feature vector of each candidate window in the second depth convolutional neural networks extraction input picture after training；The described first image characteristic vector of identical dimensional and second image feature vector are merged to obtain the 3rd image feature vector；The 3rd image feature vector dimensionality reduction is obtained into the 4th image feature vector；According to the 4th image feature vector, whether each candidate window of the detection of classifier after training is used for human face region.The embodiment of the present invention also provides a kind of human face detection device.The embodiment of the present invention extracts the image feature vector in candidate face image by two depth convolutional neural networks and merges the image feature vector respectively, and by dimension-reduction treatment, not only can more accurately detect face, can also improve the efficiency of detection.

Description

A kind of method for detecting human face and device

Technical field

The present invention relates to human face detection tech field, more particularly to a kind of method for detecting human face and device.

Background technology

Face datection refers to the image given for any one width, uses certain strategy to be scanned for it to determine it In whether contain face, if there is information such as the positions and size for then returning to all faces.Conventional method for detecting human face has Based on lifting cascade method, the method based on DPM, based on convolutional neural networks (CNN) and depth convolutional neural networks (DCNN) method.But the method for prior art is typically using single network extraction image feature vector, image feature information Expression do not enrich, can not solve the test problems such as the influence of attitudes vibration, so as to influence Face datection result.

The content of the invention

The embodiment of the present invention provides a kind of method for detecting human face, to solve the characteristics of image of detection method of the prior art The test problems such as not abundant and attitudes vibration the influence of the expression of information.

The embodiment of the present invention provides a kind of human face detection device, to solve human face detection device detection of the prior art The test problems such as not abundant and attitudes vibration the influence of the expression of image feature information.

First aspect, there is provided a kind of method for detecting human face, including：Carried using the first depth convolutional neural networks after training Take the first image feature vector of each candidate window in input picture；Using the second depth convolutional neural networks after training Extract the second image feature vector of each candidate window in input picture；By the described first image feature of identical dimensional to Amount and second image feature vector merge to obtain the 3rd image feature vector；The 3rd image feature vector dimensionality reduction is obtained To the 4th image feature vector；According to the 4th image feature vector, using each time of the detection of classifier after training Select whether window is human face region.

Second aspect, there is provided a kind of human face detection device, including：First extraction module, for using first after training First image feature vector of each candidate window in depth convolutional neural networks extraction input picture；Second extraction module, It is special for the second image using each candidate window in the second depth convolutional neural networks extraction input picture after training Sign vector；Fusion Module, for the described first image characteristic vector of identical dimensional and second image feature vector to be melted Conjunction obtains the 3rd image feature vector；Dimensionality reduction module, for the 3rd image feature vector dimensionality reduction to be obtained into the 4th image spy Sign vector；Detection module, for according to the 4th image feature vector, using each time of the detection of classifier after training Select whether window is human face region.

So, in the embodiment of the present invention, extracted respectively in candidate face image by two depth convolutional neural networks Image feature vector and the image feature vector after the image feature vector extracted respectively is merged, after the fusion The image information of image feature vector expression is enriched, and can reduce influence of the attitudes vibration to detection, while by after to fusion Image feature vector dimensionality reduction, can solve image feature vector it is openness the problems such as, and reduce the complexity of calculating, not only may be used Face is more accurately detected, can also improve the efficiency of detection.

Brief description of the drawings

In order to illustrate the technical solution of the embodiments of the present invention more clearly, below by institute in the description to the embodiment of the present invention The accompanying drawing needed to use is briefly described, it should be apparent that, drawings in the following description are only some implementations of the present invention Example, for those of ordinary skill in the art, without having to pay creative labor, can also be according to these accompanying drawings Obtain other accompanying drawings.

Fig. 1 is the flow chart of the method for detecting human face of the embodiment of the present invention；

Fig. 2 is the schematic diagram of the image pyramid of the embodiment of the present invention；

Fig. 3 is the schematic diagram of the offset max-pooling methods of the embodiment of the present invention；

Fig. 4 is the full convolutional network Feature Mapping schematic diagram of the embodiment of the present invention；

Fig. 5 is the flow chart of the method for detecting human face of another embodiment of the present invention；

Fig. 6 is that the result that detection block is obtained using NMS methods of the method for detecting human face of another embodiment of the present invention is shown It is intended to；Wherein, (a) is using the Face datection frame distribution map in the input picture before the processing of NMS methods, and (b) is to use The Face datection frame distribution map in input picture after the processing of NMS-Max methods, (c) are using at NMS-Average methods Face datection frame distribution map in input picture after reason；

Fig. 7 is recall rate-flase drop number that the method for detecting human face of the embodiment of the present invention of the embodiment of the present invention detects to obtain Curve map；

Fig. 8 is the partial test result using the method for detecting human face of the embodiment of the present invention；

Fig. 9 is a kind of structural representation of the human face detection device of the embodiment of the present invention；

Figure 10 is another structural representation of the human face detection device of the embodiment of the present invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is part of the embodiment of the present invention, rather than whole embodiments.Based on this hair Embodiment in bright, the every other implementation that those of ordinary skill in the art are obtained under the premise of creative work is not made Example, belongs to the scope of protection of the invention.

The embodiments of the invention provide a kind of method for detecting human face.As shown in figure 1, the Face datection for the embodiment of the present invention The flow chart of method.The method for detecting human face of the embodiment specifically includes the steps：

Step S101：Using each candidate window in the first depth convolutional neural networks extraction input picture after training First image feature vector.

Wherein, depth convolutional neural networks (Deep Convolutional Neural Network, DCNN) pass through structure Learning model with multiple hidden layers carries out hierarchical table to input information and reached.It emphasizes the depth of model structure, clearly prominent The importance of feature learning, original input information is transferred to by different layers by successively eigentransformation, every layer passes through a numeral Wave filter obtains the most significant feature of input data so that classification is more accurate.

Therefore, the first image feature vector extracted using the first depth convolutional neural networks is most notable in candidate window Image feature vector.

Step S102：Using each candidate window in the second depth convolutional neural networks extraction input picture after training The second image feature vector.

Second depth convolutional neural networks also have the characteristic of depth convolutional neural networks, therefore, the second depth convolution god The second image feature vector through network extraction is also most significant image feature vector in candidate window.

Second depth convolutional neural networks and the first depth convolutional neural networks are two different networks, and therefore, this two The image feature vector of individual network extraction is had any different, so as to the expression of rich image characteristic information.

Step S103：First image feature vector of identical dimensional and the second image feature vector are merged to obtain the 3rd figure As characteristic vector.

The dimension of the 3rd image feature vector after fusion is the dimension and the second characteristics of image of the first image feature vector The dimension sum of vector.Accordingly, with respect to the first image feature vector and the second image feature vector, the 3rd image feature vector It is the image feature vector of a higher-dimension.The fusion is that the second image feature vector is connected after the first image feature vector.

The step of passing through fusion so that the image feature vector after fusion can make up the characteristics of image of single network extraction The defects of information representation is insufficient, is advantageous to fully learn characteristics of image, portrays the internal information of data rich.

Step S104：3rd image feature vector dimensionality reduction is obtained into the 4th image feature vector.

When the dimension of image feature vector is higher, calculating time complexity is higher, in addition, past between image feature vector It is past certain correlation to be present so as to cause information redundancy.By dimensionality reduction, the characteristics of image that can reflect classification essence both can obtain Therefore vector, and can reduce the complexity of calculating time, the image feature vector after dimensionality reduction can be carried out to face each several part information Integrative expression so that the Face datection under without constraint environment has larger lifting.

Step S105：According to the 4th image feature vector, using each candidate window of detection of classifier after training whether For human face region.

Grader can obtain the confidence level of each candidate window by classifying to the 4th image feature vector.By what is obtained The confidence level of each candidate window is compared with default confidence level.The candidate window that will be above the default confidence level judges to be people Face region, the candidate window less than the default confidence level are judged as non-face region, so as to realize the function of detection face.

To sum up, the method for detecting human face of the embodiment of the present invention, candidate is extracted respectively by two depth convolutional neural networks Image feature vector in facial image and the characteristics of image after the image feature vector extracted respectively is merged to Amount, the image information of the image feature vector expression after the fusion are enriched, can reduce influence of the attitudes vibration to detection, lead to simultaneously Cross to the image feature vector dimensionality reduction after fusion, can solve image feature vector it is openness the problems such as, and reduce calculating Complexity, face not only can be more accurately detected, can also improve the efficiency of detection.

In an of the invention preferred embodiment, before step S101, the method for the embodiment can also include as follows The step of：

Construct the input picture that image pyramid obtains multiple yardsticks.

As shown in Fig. 2 the schematic diagram of the image pyramid for the embodiment of the present invention.It can be obtained by constructing image pyramid Obtain the input picture of different scale, it is contemplated that the efficiency of processing, it is preferred that it is 8 to choose zoom scale, zoom factor 0.9057 Construct image pyramid.For example, when the size of the detection window used in follow-up step is 224 × 224, using image gold Word tower can detect the face that minimum dimension is 224/8=28 pixels, can be consecutively detected 28 using zoom factor and arrive picture Face in magnitude range.

The candidate face image of different scale is generated by above-mentioned step, is not easy what is detected under certain yardstick Feature is readily possible to detect under other yardstick, so as to more effectively extract image feature vector.

In a preferred embodiment of the invention, the first depth convolutional neural networks in the embodiment method are Clarifai networks, it is the network structure of Clarifai networks as shown in table 1.First convolutional layer of Clarifai networks can Experience region using relatively small 7 × 7 to be filtered, intensive filtration treatment is carried out to input picture.Clarifai networks Structure ensures to include more image feature informations in the first layer network and the second layer network, so as to lift classification performance.

The network structure of the Clarifai networks of table 1

In a preferred embodiment of the invention, it is trained before step S101, in addition to Clarifai networks The step of.The step of training, includes：

10 are set to according to initial learning rate^-4, momentum 0.9, the positive sample and negative sample in storehouse are trained with first sample Ratio is 1:3~1:10, training is finely adjusted to the Clarifai networks.

The ratio of positive sample and negative sample in the first sample training storehouse chosen can lead to according to the cutting situation of sample Cross experiment and compare determination, preferably 1:5.In training process, it is preferred that per 128 samples of batch processed, single treatment can be avoided Amount is too big so that efficiency is low.

, can be in advance in ImageNet before training is finely adjusted to the Clarifai networks using first sample training storehouse The model for training to obtain on 2012 training sets carries out parameter initialization to Clarifai networks, so as to preferentially optimize this The parameter of Clarifai networks, such as weights etc., the number subsequently trained is reduced, improve efficiency.

Wherein, the positive sample in first sample training storehouse may be selected from WIDER FACE data sets.WIDERFACE data set bags Containing different sizes, attitudes vibration, it is complicated block, expression and illumination etc. have the image of large change.Because zoom scale is larger, It can detect the human face region of reduced size, and for seriously blocking, obscuring and the complex situations such as larger attitudes vibration Face datection has robustness.Negative sample in first sample training storehouse may be selected from AFLW data sets.

Specifically, first sample training storehouse is built by such a way：

(1) it is more than according to the IOU of the first face calibration frame in the standard clipping WIDER_train subsets of first threshold Sample obtains first sample.Wherein, IOU (intersection over union) is as a kind of evaluation criterion of overlapping region, It is defined as combined region area of the area than upper two detection blocks of the intersecting area of two detection blocks.

Wherein, WIDER FACE data sets include WIDER_train subsets.IOU(intersection over union) As a kind of evaluation criterion of overlapping region, it is defined as the area of two intersecting areas for having overlapping bounding box than upper merging The area in region.

First face calibration frame is the face calibration frame on the sample in WIDER_train subsets.The first threshold is preferred For 0.65.

(2) according to the second face calibration frame friendship except and IOU be more than Second Threshold standard clipping WIDER_val subsets In sample, and choose cut out after WIDER_val subsets in size be more than presetted pixel sample as the second sample.

Wherein, WIDER FACE data sets include WIDER_val subsets.Due to WIDER_train subset small-medium size samples Originally larger specific gravity is accounted for, for enlarged sample quantity, then using WIDER_val subsets.

Second face calibration frame is the face calibration frame on the sample in WIDER_val subsets.Preferably, the Second Threshold Preferably 0.65.The presetted pixel is preferably 80 pixels.Preferably, per 64 samples of batch processed, single treatment amount can be avoided It is too big so that efficiency is low.

(3) first sample and the second sample are trained to the positive sample in storehouse collectively as first sample.

(4) removed according to the friendship that frame is demarcated with third party's face and IOU is less than in the standard clipping AFLW data sets of the 3rd threshold value Sample obtain the 3rd sample.

Preferably, the 3rd threshold value is preferably 0.3.

(5) negative sample in storehouse is trained using the 3rd sample as first sample.

(6) positive sample in the first training storehouse and negative sample are all fixed as special using depth convolutional neural networks extraction image Levy the size of the detection window of vector.

For example, using the image feature vector of Clarifai network extractions first, then by the positive sample in the first training storehouse and negative Sample is all fixed as 224 × 224 size.

(7) positive sample and negative sample are handled using the preprocess method for subtracting average, and positive sample and negative sample is subjected to mirror As upset processing, positive sample and negative sample after mirror image switch processing are obtained.

The sample in the extendible first sample training storehouse of positive sample and negative sample after mirror image switch processing, so as to more have Beneficial to network training.

In a preferred embodiment of the invention, the second depth convolutional neural networks are VGG Net-D networks, such as table 2 It is shown, it is the network structure of VGG Net-D networks.The convolutional layer of VGG Net-D networks can use less 3 × 3 receptive field Domain is filtered, and adds network depth indirectly.During convolution, the step-length of convolution is 1, because the step-length is smaller, so as to It can ensure that image feature information will not be lost.

The network structure of the VGG Net-D networks of table 2

Before step S101, in addition to the step of be trained to VGG Net-D networks.The step of training, includes： 10 are set to according to initial learning rate^-3, momentum 0.9, using the positive sample in first sample training storehouse and the ratio of negative sample as 1:3 ~1:10, training is finely adjusted to VGG Net-D networks.

The ratio of positive sample and negative sample in the first sample training storehouse chosen can lead to according to the cutting situation of sample Cross experiment and compare determination, preferably 1:5.

, can be in advance in ImageNet before being finely adjusted using first sample training storehouse to the VGG Net-D networks The model for training to obtain on 2012 training sets carries out parameter initialization to VGG Net-D networks, so as to preferentially optimize the VGG The parameter of Net-D networks, such as weights etc., the number subsequently trained is reduced, improve efficiency.

The first sample trains storehouse identical with foregoing first sample training storehouse, will not be repeated here.

It should be appreciated that above-mentioned the first training storehouse for being used for two kinds of network fine setting training can use other data Collection.The data set of selection need to include abundant face mark, including block, a variety of situations such as attitudes vibration and various activities scene Under sample.

If extracting the characteristic vector of image using original sliding window method, amount of calculation is very big, therefore, in the present invention In one preferred embodiment, therefore, the full articulamentum conversion in Clarifai networks and VGG Net-D networks can be helped volume Lamination, the layer parameter after conversion also convert therewith.Using full convolutional network, feature extraction is carried out to entire image first, then Convolution operation is carried out to obtained characteristic image, therefore amount of calculation can be reduced, and the input picture of arbitrary size can be handled.

Step S101 and step S102 adopts to the characteristic pattern after last convolutional layer processing of depth convolutional neural networks With translation pond offset max-pooling methods extraction image feature vector, i.e., only use offset max- in Pool5 layers Pooling, common pond method is still used in other layers.It should be appreciated that the method for the present embodiment is only in detection During to the characteristic pattern after the processing of last convolutional layers of depth convolutional neural networks using translation pond offset max- Pooling methods extract image feature vector, and do not use this method in the stage of training network.

Specifically, the step of extracting image feature vector using offset max-pooling methods is as follows：

The first step：Obtain the characteristic pattern after last convolutional layer processing of depth convolutional neural networks.

Last convolutional layer is conv5 convolutional layers in Clarifai networks.Last convolutional layer is in VGG It is conv5_3 convolutional layers in Net-D networks.

Second step：According to translational movement, different translation features figures is produced.As shown in figure 3, utilize 3 × 3 maximum pond Max-pooling layers carry out pondization operation to obtained characteristic pattern, can produce 4 kinds of translation features figure (offset max- Pooling feature map), because each detection window corresponds to the position of different input pictures, so by this close The mode of collection sliding window can find detection window and human face region matches well, the method equivalent to by sliding step by 2 are reduced to 1, because detection step-length is smaller, it is achieved thereby that intensive sliding window detects, it is achieved thereby that multipoint detection.

Translation features figure obtained above is sequentially inputted to handle in subsequent network again.

The image feature vector that step S101 and step S102 are finally extracted is the image feature vector of fc6-conv layers. Fc6-conv layers are that fc6 layers are converted into convolutional layer and obtained.There is higher characteristics of image in fc6-conv layers, therefore, choosing The image feature vector of extraction fc6-conv layers is selected, Detection results can be made more preferable.

As shown in figure 4, for being corresponding to the characteristic pattern of 6 × 6 sizes obtained in full convolutional network in fc6-conv layers The detection window of one 224 × 224 in input picture, each detection window is obtained in fc6-conv layers by forward calculation Fixation dimension image feature vector.

In a preferred embodiment of the invention, because step S101 and step S102 is the fc6-conv layers of extraction Image feature vector.The dimension of the image feature vector of this layer is 4096 dimensions.Therefore, the 3rd figure after being merged in step S103 As the dimension of characteristic vector is 8192 dimensions.

In a preferred embodiment of the invention, the method for the dimensionality reduction in step S104 is principal component analysis PCA methods.Should PCA methods are a kind of a kind of statistical analysis technique that multiple Feature Mappings are a few comprehensive characteristics.What this method obtained Comprehensive characteristics reflect original variable information as much as possible, orthogonal each other, so as to reach the purpose of dimensionality reduction denoising.PCA Method needs to carry out zero averaging processing.For the method for the embodiment of the present invention, input picture can inputted convolutional Neural Before network, the image feature vector of zero averaging pretreatment, i.e. input picture is carried out to the image feature vector of input picture Zero averaging processing is carried out to input picture so as to realize by subtracting constant average.The constant average is according to convolutional Neural net The scale of network determines.Therefore, in step S104, without carrying out zero averaging processing again, the PCA methods only need to be to the 3rd image The covariance matrix of characteristic vector carries out singular value decomposition, then chooses the 3rd image feature vector structure according to the size of characteristic value Make projection matrix.Pca model in the embodiment of the present invention is also the model after training.PCA methods in the embodiment of the present invention Choose and meet that the 3rd image feature vector of characteristic threshold value constructs projecting direction by the order of the characteristic value to sort from big to small Matrix.Preferably, this feature threshold value be 50%, that is, sort preceding 50% third feature vector.By constructing the projection matrix Realize the purpose of dimensionality reduction.

Feature Dimension Reduction and selection are carried out by using PCA methods, the comprehensive characteristics that can reflect classification essence, solution can be obtained Determined image feature vector it is openness the problems such as, while reduce the complexity of calculating.

In a preferred embodiment of the invention, the grader in step S105 is support vector machines grader.Should SVM classifier is a kind of disaggregated model, and its basic model is defined as being spaced the linear classifier of maximum on feature space, utilizes core Skill can turn into Nonlinear Classifier, and the learning strategy of SVMs is margin maximization, be eventually converted into one it is convex secondary The solution of planning problem.

Preferably, the training method of SVM classifier includes：

Using the positive sample in the second sample training storehouse and the ratio of negative sample as 1:1 Training Support Vector Machines SVM classifier.

The second sample training storehouse is built by such a way：

(1) according to the sample in the standard clipping WIDER_train subsets of the first face calibration frame and according to the second face The sample demarcated in the standard clipping WIDER_val subsets of frame, by the above-mentioned sample for cutting out to obtain collectively as the 4th sample.

(2) removed according to the friendship that frame is demarcated with third party's face and IOU is less than in the standard clipping AFLW data sets of the 3rd threshold value Sample obtain the 3rd sample.

Preferably, the 3rd threshold value is preferably 0.3.

(3) positive sample using the 4th sample as the second sample training storehouse, the 3rd sample is as the second sample training storehouse Negative sample.

Process using the second sample training storehouse training SVM classifier is as follows：

First, it is updated to from Polynomial kernel function as final kernel function in classification function, the classification function is：

Wherein, w and b be Optimal Separating Hyperplane normal vector and intercept, k (x_i, x) and it is Polynomial kernel function.

Then, introduce relaxation factor, use Lagrange Multiplier Methods obtain optimization aim for：

Wherein C is a constant having determined, for every weight in Controlling object function.ξ_i(i=1,2 ..., N) it is slack variable, corresponding data point x_iAllow the value at deflection function interval.α_iFor Lagrange multiplier, antithesis is translated into Problem obtains：

Finally, using SMO Algorithm for Solving parameters, final disaggregated model is obtained.

Preferably, the confidence level of each candidate window obtained in step S105 correspond to one 224 in input picture × The confidence level of 224 detection window.SVM classifier has the less risk for causing mistake to be classified.

Although multiple dimensioned detection information can be obtained using the method for image pyramid, cause the face inspection of output Survey frame have it is higher overlapping, therefore, in an of the invention preferred embodiment, as shown in figure 5, after step S105, in addition to：

Step S106：The human face region obtained according to detection, Face datection frame is marked over an input image.

Step S107：If the overlapping area of multiple Face datection frames is more than reference threshold, according to multiple Face datection frames Obtain an optimal Face datection frame.

Wherein, optimal Face datection frame refers to most representing the detection block of human face region.

Preferably, this method is non-maxima suppression NMS methods.The NMS methods (Non-maximal suppresion) It is a kind of post-processing approach of Face datection frame, the purpose is to ensure that each face object only corresponds to a detection block, eliminates more Optimal detection zone is obtained after remaining overlapping detection block.It is using the input figure before the processing of NMS methods as shown in Fig. 6 (a) Face datection frame distribution map as in.After finding the Face datection frame with highest confidence level using NMS-Max methods first, move Except all IOU are more than the Face datection frame of certain anti-eclipse threshold.It is using after the processing of NMS-Max methods as shown in Fig. 6 (b) Input picture in Face datection frame distribution map.Then the face for meeting anti-eclipse threshold is examined using NMS-Average methods Survey frame and merge into a face detection block, and the confidence level using highest confidence level as the Face datection frame after merging, such as Fig. 6 (c) it is using the Face datection frame distribution map in the input picture after the processing of NMS-Average methods shown in.

The effect of the method for the present invention is described further with a specific test case below.

Using FDDB data sets as test sample, the data set is the normal data currently used for weighing Face datection Collection, totally 2846 images, 5171 mark faces, wherein including different postures, illumination, low resolution and the complex situations such as out of focus Image.As shown in table 3, compare for heterogeneous networks test result.

The heterogeneous networks test result of table 3 compares

Network name	Recall rate	Flase drop number
			Clarifai networks	85.32%	2000
VGG Net-D	85.70%	2000
			Clarifai and VGG network characterizations merge	87.24%	2000

As can be seen from Table 3, abundant feature is included using the characteristics of image after the method extraction network integration of the present invention Expression, compensate for the deficiency of single network feature extraction, its detection performance is better than single network.It is as shown in fig. 7, real for the present invention The method for detecting human face for applying example detects obtained recall rate-flase drop number True positive rate-False positives songs Line.The curve show flase drop number be 2000 when, the recall rate of test set reaches 87.24%.From the curve it can also be seen that, this hair The detection performance of the method for detecting human face of bright embodiment is good.As shown in figure 8, for using the method for detecting human face of the embodiment of the present invention Partial test result.

Present invention also offers a kind of human face detection device.As shown in figure 9, the human face detection device for the embodiment of the present invention Structured flowchart.The human face detection device specifically includes following module：

First extraction module 901, for using in the first depth convolutional neural networks extraction input picture after training First image feature vector of each candidate window.

Preferably, the first depth convolutional neural networks are Clarifai networks.To last volume of Clarifai networks Characteristic pattern after lamination processing is using translation pond offset max-pooling methods extraction image feature vector.Final extraction The first image feature vector be fc6-conv layers image feature vector.

Second extraction module 902, for using in the second depth convolutional neural networks extraction input picture after training Second image feature vector of each candidate window.

Preferably, the second depth convolutional neural networks are VGG Net-D networks.To last of VGG Net-D networks Characteristic pattern after convolutional layer processing is using translation pond offset max-pooling methods extraction image feature vector.Finally carry The second image feature vector taken is the image feature vector of fc6-conv layers.

Fusion Module 903, for the first image feature vector of identical dimensional and the second image feature vector to be merged To the 3rd image feature vector.

Dimensionality reduction module 904, for the 3rd image feature vector dimensionality reduction to be obtained into the 4th image feature vector.

Preferably, the method for the dimensionality reduction is principal component analysis PCA methods.In the embodiment of the present invention PCA methods choose press from The order for arriving the characteristic value of small sequence greatly meets the 3rd image feature vector construction projecting direction matrix of characteristic threshold value.It is preferred that , this feature threshold value be 50%, that is, sort preceding 50% third feature vector.

Detection module 905, for according to the 4th image feature vector, using each candidate's window of detection of classifier after training Whether mouth is human face region.

Preferably, the grader is support vector machines grader.

Preferably, the device of the embodiment of the present invention also includes：

First training module 906, for being set to 10 according to initial learning rate^-4, momentum 0.9, storehouse is trained with first sample In positive sample and negative sample ratio be 1:3~1:10, Clarifai networks are finely adjusted.

Second training module 907, for being set to 10 according to initial learning rate^-3, momentum 0.9, storehouse is trained with first sample In positive sample and negative sample ratio be 1:3~1:10, training is finely adjusted to VGG Net-D networks.

3rd training module 908, for using the positive sample in the second sample training storehouse and the ratio of negative sample as 1:1 training Support vector machines grader.Wherein, the kernel function in the categorised decision function of support vector machines grader is multinomial Kernel function.

Constructing module 909, the input picture of multiple yardsticks is obtained for constructing image pyramid.

Labeling module 910, for the human face region obtained according to detection, Face datection frame is marked over an input image.

Merging module 911, if the overlapping area for multiple Face datection frames is more than reference threshold, according to multiple faces Detection block obtains an optimal Face datection frame.

It is furthermore preferred that the method that an optimal Face datection frame is obtained according to multiple Face datection frames presses down for non-maximum NMS methods processed.

For device embodiment, because it is substantially similar to embodiment of the method, so description is fairly simple, it is related Part illustrates referring to the part of embodiment of the method.

To sum up, the human face detection device of the embodiment of the present invention, candidate is extracted respectively by two depth convolutional neural networks Image feature vector in facial image and the characteristics of image after the image feature vector extracted respectively is merged to Amount, the image information of the image feature vector expression after the fusion are enriched, can reduce influence of the attitudes vibration to detection, lead to simultaneously Cross to the image feature vector dimensionality reduction after fusion, can solve image feature vector it is openness the problems such as, and reduce calculating Complexity, face not only can be more accurately detected, can also improve the efficiency of detection.

Those of ordinary skill in the art it is to be appreciated that with reference to disclosed in the embodiment of the present invention embodiment description it is each The unit and algorithm steps of example, it can be realized with the combination of electronic hardware or computer software and electronic hardware.These Function is performed with hardware or software mode actually, application-specific and design constraint depending on technical scheme.Specialty Technical staff can realize described function using distinct methods to each specific application, but this realization should not Think beyond the scope of this invention.

It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, the corresponding process in preceding method embodiment is may be referred to, will not be repeated here.

In embodiment provided herein, it should be understood that disclosed apparatus and method, others can be passed through Mode is realized.For example, device embodiment described above is only schematical, for example, the division of the unit, is only A kind of division of logic function, can there is an other dividing mode when actually realizing, for example, multiple units or component can combine or Person is desirably integrated into another system, or some features can be ignored, or does not perform.Another, shown or discussed is mutual Between coupling or direct-coupling or communication connection can be INDIRECT COUPLING or communication link by some interfaces, device or unit Connect, can be electrical, mechanical or other forms.

The unit illustrated as separating component can be or may not be physically separate, show as unit The part shown can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple On NE.Some or all of unit therein can be selected to realize the mesh of this embodiment scheme according to the actual needs 's.

In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, can also That unit is individually physically present, can also two or more units it is integrated in a unit.

If the function is realized in the form of SFU software functional unit and is used as independent production marketing or in use, can be with It is stored in a computer read/write memory medium.Based on such understanding, technical scheme is substantially in other words The part to be contributed to prior art or the part of the technical scheme can be embodied in the form of software product, the meter Calculation machine software product is stored in a storage medium, including some instructions are causing a computer equipment (can be People's computer, server, or network equipment etc.) perform all or part of step of each embodiment methods described of the present invention. And foregoing storage medium includes：USB flash disk, mobile hard disk, ROM, RAM, magnetic disc or CD etc. are various can be with store program codes Medium.

The foregoing is only a specific embodiment of the invention, but protection scope of the present invention is not limited thereto, any Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, should all be contained Cover within protection scope of the present invention.Therefore, protection scope of the present invention should be defined by scope of the claims.

Claims

A kind of 1. method for detecting human face, it is characterised in that including：

The first image using each candidate window in the first depth convolutional neural networks extraction input picture after training is special Sign vector；

The second image using each candidate window in the second depth convolutional neural networks extraction input picture after training is special Sign vector；

The described first image characteristic vector of identical dimensional and second image feature vector are merged to obtain the 3rd image spy Sign vector；

The 3rd image feature vector dimensionality reduction is obtained into the 4th image feature vector；

According to the 4th image feature vector, whether each candidate window of the detection of classifier after training is used for face Region.
2. according to the method for claim 1, it is characterised in that the first depth convolutional neural networks are Clarifai nets Network, the training method of the first depth convolutional neural networks include：

10 are set to according to initial learning rate^-4, momentum 0.9, with the positive sample and the ratio of negative sample in first sample training storehouse For 1:3~1:10, training is finely adjusted to the Clarifai networks.
3. according to the method for claim 1, it is characterised in that the second depth convolutional neural networks are VGG Net-D Network, the training method of the second depth convolutional neural networks include：

10 are set to according to initial learning rate^-3, momentum 0.9, with the positive sample and the ratio of negative sample in first sample training storehouse For 1:3~1:10, training is finely adjusted to the VGG Net-D networks.
4. according to the method for claim 1, it is characterised in that

First figure of each candidate window in the first depth convolutional neural networks extraction input picture using after training As in the step of characteristic vector, being adopted to the characteristic pattern after last convolutional layer processing of the first depth convolutional neural networks With translation pond offset max-pooling methods extraction image feature vector；

Second figure of each candidate window in the second depth convolutional neural networks extraction input picture using after training As in the step of characteristic vector, being adopted to the characteristic pattern after last convolutional layer processing of the second depth convolutional neural networks With translation pond offset max-pooling methods extraction image feature vector.
5. according to the method for claim 1, it is characterised in that described that the 3rd image feature vector dimensionality reduction is obtained the The method of four image feature vectors is principal component analysis PCA methods.
6. according to the method for claim 1, it is characterised in that the grader is support vector machines grader, described The training method of grader includes：

Using the positive sample in the second sample training storehouse and the ratio of negative sample as 1:The 1 training support vector machines grader, Wherein, the kernel function in the categorised decision function of the support vector machines grader is Polynomial kernel function.
7. according to the method for claim 1, it is characterised in that the first depth convolutional neural networks using after training Before the step of extracting the first image feature vector of each candidate window in input picture, methods described also includes：

Construct the input picture that image pyramid obtains multiple yardsticks.
8. according to the method for claim 1, it is characterised in that it is described according to the 4th image feature vector, using instruction After the step of whether each candidate window of detection of classifier after white silk is human face region, methods described also includes：

The human face region obtained according to detection, Face datection frame is marked on the input picture；

If the overlapping area of multiple Face datection frames is more than reference threshold, one is obtained according to multiple Face datection frames Individual optimal Face datection frame.
9. according to the method for claim 8, it is characterised in that described to obtain one most according to multiple Face datection frames The method of excellent Face datection frame is non-maxima suppression NMS methods.
A kind of 10. human face detection device, it is characterised in that including：

First extraction module, for using each candidate in the first depth convolutional neural networks extraction input picture after training First image feature vector of window；

Second extraction module, for using each candidate in the second depth convolutional neural networks extraction input picture after training Second image feature vector of window；

Fusion Module, for the described first image characteristic vector of identical dimensional and second image feature vector to be merged To the 3rd image feature vector；

Dimensionality reduction module, for the 3rd image feature vector dimensionality reduction to be obtained into the 4th image feature vector；

Detection module, for according to the 4th image feature vector, using each candidate of the detection of classifier after training Whether window is human face region.