CN107871106A

CN107871106A - Face detection method and device

Info

Publication number: CN107871106A
Application number: CN201610852174.XA
Authority: CN
Inventors: 毛秀萍; 张祥德
Original assignee: Beijing Eyecool Technology Co Ltd
Current assignee: Beijing Eyes Intelligent Technology Co ltd; Shenzhen Aiku Smart Technology Co ltd; Beijing Eyecool Technology Co Ltd
Priority date: 2016-09-26
Filing date: 2016-09-26
Publication date: 2018-04-03
Anticipated expiration: 2036-09-26
Also published as: CN107871106B

Abstract

The embodiment of the invention provides a face detection method and a face detection device, wherein the method comprises the following steps: extracting a plurality of face features of different hierarchical networks from a face image to be detected by adopting a convolutional neural network model trained in advance to obtain a plurality of face feature vectors corresponding to the different hierarchical networks; fusing a plurality of face feature vectors into one face feature vector; performing dimensionality reduction on the face feature vectors subjected to the fusion processing to obtain two face feature vectors with the same dimensionality; performing face detection processing on one face feature vector in the two face feature vectors to obtain a face detection result; and simultaneously carrying out attitude estimation processing on the other face feature vector in the two face feature vectors to obtain an attitude estimation result. The method can be used for more richly describing the image by using the characteristics of face detection and posture estimation, has higher accuracy and reduces the error rate of subsequent face detection; and a plurality of related tasks can be executed simultaneously, and the performance of a single task is improved. And the task processing efficiency is improved.

Description

Face detection method and device

Technical Field

The invention relates to the technical field of face recognition, in particular to a face detection method and a face detection device.

Background

Face detection refers to the process of determining the location and size of all faces (if any) from an input image. As a key technology in face information processing, face detection is a prerequisite and basis for many automatic face image analysis applications, such as face recognition, face registration, face tracking, face attribute recognition (gender, age, expression), and the like. Since the last 90 s, human face detection has been greatly advanced, and has recently been widely applied in the fields of security access control, visual monitoring, content-based retrieval, new generation of human-computer interfaces, and the like, and has become a subject of general attention and active research in the fields of pattern recognition and computer vision.

The head pose estimation problem has received more and more attention as a key technology in the face recognition problem. The pose problem is an important factor affecting the performance of the face recognition system, and when a non-frontal face is recognized, the performance of the system is drastically reduced. Therefore, solving the posture estimation problem is an important way for improving the performance of the face recognition system, and has important application value. The task of head pose estimation is to determine the orientation of the head relative to the camera angle. A human face image is given, and the posture of the human face image comprises three angles of up-down turning (pitch), left-right turning (yaw) and in-plane rotation (roll).

Deep learning is one of the most important breakthroughs taken in the field of artificial intelligence in the last decade, and as a class of important branches of deep networks, deep convolutional neural networks make great progress in fields such as face recognition, target classification, object detection and the like. And the characteristics learned by the network can be simultaneously used for multiple tasks such as classification, detection, object segmentation and the like. More and more researchers are trying to improve the effect of face detection using deep convolutional networks.

Currently, the existing face detection algorithms are classified into the following three types: boosting (a method used to improve the accuracy of weak classification algorithms), variants, variable part models (DPM), Convolutional Neural Networks (CNN), and Deep Convolutional Neural Networks (DCNN).

A representative method of Boosting variant algorithms is Adaboost (an iterative algorithm) face detector and its variants. The detector of the Adaboost face detection algorithm comprises three main contents: a cascade structure, an integral graph and an Adaboost classifier. The cascade structure is a classifier with a multilayer structure, in the cascade structure, the classifier of each layer is obtained by training through an Adaboost method, and the training of the classifier of each layer requires the following steps: 1) calculating the value of the rectangular feature; 2) traversing the features to obtain a weak classifier; 3) and combining the weak classifiers into a strong classifier. The Adaboost algorithm is the most representative algorithm in the Boosting learning method, and the Adaboost iterative algorithm can be divided into 3 steps: 1) and initializing weight distribution of the training data. If there are N samples, each training sample is initially given the same weight: 1/N; 2) training a weak classifier; 3) and combining the weak classifiers obtained by training into a strong classifier, wherein the weak classifier with a low error rate occupies a larger weight in the final classifier, and otherwise, the weak classifier occupies a smaller weight.

Although the classical Adaboost face detection algorithm can well process faces on the front side and faces close to the front side, the detection performance is greatly reduced when the face with a complex posture and multiple angles is faced.

The other detection method is DPM, which is a general object detection algorithm, wherein the DPM algorithm adopts improved features of Histogram of Oriented Gradients (HOG), utilizes a Support Vector Machine (SVM) classifier and a Sliding window (Sliding Windows) detection idea, adopts a multi-Component (Component) strategy for the multi-view problem of the target, and adopts a Component model strategy based on a graph structure (Pictorial Structure) for the deformation problem of the target. In addition, the model type of the sample, the position of the component model and the like are used as Latent variables (Latent Variable), and multi-instance Learning (Multiple-instance Learning) is adopted for automatic determination. When DPM is applied to face detection, a face is considered as a collection of components, and the components are defined through supervised or unsupervised training, and a hidden SVM classifier is usually required to be trained to find the components and the geometric relationships among the components.

However, the DPM detector needs to solve a hidden SVM problem for each candidate location; moreover, to achieve optimal performance, multiple DPM models need to be trained; in addition, in some DPM-based models, the training process requires labeling of feature points, which results in a large amount of calculation.

Although the face can be subjected to feature extraction and classification as a method for realizing face detection by using a neural network, the optimization of the CNN is a non-convex problem, which easily causes difficulty in network learning, and the model of the deep network has high complexity, needs to be subjected to ultra-large amount of calculation, has high requirements on hardware support, and needs large-scale label data; in addition, a general deep network only uses the highest-level network to perform tasks such as feature processing and classification, so that the understanding of features of each level of the deep network is insufficient, and the image is not sufficiently described only by using the high-level features, so that the recognized features are not accurate, and the human face detection error rate is high.

Methods for face pose estimation can be classified into model-based methods and appearance-based methods.

The model-based method firstly utilizes a certain geometric model to represent the structure and the shape of the human face, establishes a corresponding relation between the model and an image by extracting certain characteristics, and then realizes the estimation of the human face pose by a geometric method or other methods. The theoretical basis for adopting the model-based pose estimation method is that the human face has a certain space geometric structure. In the case where the head is assumed to be a rigid body, the change in the posture of the head can be considered as a motion of the rigid body.

However, the model-based head pose estimation method estimates the pose of the face by feature point matching, and such method usually depends heavily on the accurate positioning of feature points and additionally needs a 3-dimensional positive "standard model". And in a general model-based pose estimation method, it is often assumed that the head is a rigid body. However, in practice the head is not completely rigid. Furthermore, detecting and tracking head feature points is very difficult in the case of varying lighting conditions, thus causing the model-based approach to also be affected by the varying lighting conditions.

Another, appearance-based approach is to use a large number of samples to find this relationship by statistical methods, assuming that the pose of the face in three-dimensional space is linked to some property of the two-dimensional face image. By learning the face images with various postures, a classifier such as SVM (support vector machine) capable of correctly estimating the postures is established on the basis of a large number of training samples, and then the classifier is used for estimating the postures of the face images in the test images.

However, the apparent-based pose estimation method depends heavily on the result of face detection, that is, the pose estimation can be performed only after the face detection is completed, so that the two tasks cannot be performed simultaneously, and the face recognition efficiency is reduced. Moreover, the method is also easily affected by the change of illumination conditions, and in addition, some limitations of the classifier limit the improvement of the performance thereof, for example, the classification result of the classifier is often affected by the distribution of the training samples, and the calculation amount and the storage space requirement are often large. Other appearance-based methods require either an approximate range of poses to be estimated by a classification algorithm or a regression algorithm to derive an accurate pose angle. For the classification algorithm, the estimated pose value is discrete, and only the approximate pose range of the face image can be given. For the regression algorithm, although accurate continuous poses can be given, a large number of face images under continuous poses are required as training samples. These disadvantages determine that the appearance-based approach is less robust and it is difficult to achieve the desired results in a practical system.

The method aims at the problems that the accuracy of the detected human face features is low in the related technology, so that the human face detection error rate is high, and the problem that the task processing efficiency is low due to the fact that the posture estimation depends on the human face detection result seriously.

Disclosure of Invention

The technical problem to be solved by the embodiments of the present invention is to provide a method and a device for detecting a face, so as to solve the problems that the accuracy of detected face features is low, so that the face detection error rate is high, and the pose estimation depends heavily on the face detection result, so that the task processing efficiency is low in the prior art.

In order to solve the above problem, according to an aspect of the present invention, the present invention discloses a face detection method, including:

extracting a plurality of face features of different hierarchical networks from a face image to be detected by adopting a convolutional neural network model trained in advance to obtain a plurality of face feature vectors corresponding to the different hierarchical networks;

fusing the plurality of face feature vectors into one face feature vector;

performing dimensionality reduction on the face feature vectors subjected to the fusion processing to obtain two face feature vectors with the same dimensionality;

performing face detection processing on one of the two face feature vectors to obtain a face detection result;

and simultaneously carrying out attitude estimation processing on the other face feature vector in the two face feature vectors to obtain an attitude estimation result.

According to another aspect of the present invention, the present invention also discloses a face detection apparatus, comprising:

the extracting module is used for extracting a plurality of face features of different hierarchical networks from the face image to be detected by adopting a convolutional neural network model which is trained in advance to obtain a plurality of face feature vectors corresponding to the different hierarchical networks;

the fusion module is used for fusing the plurality of face feature vectors into one face feature vector;

the dimension reduction module is used for performing dimension reduction processing on the face feature vectors subjected to the fusion processing to obtain two face feature vectors with the same dimension;

the face detection module is used for carrying out face detection processing on one face feature vector in the two face feature vectors to obtain a face detection result;

and the attitude estimation module is used for simultaneously carrying out attitude estimation processing on the other face characteristic vector in the two face characteristic vectors to obtain an attitude estimation result.

Compared with the prior art, the embodiment of the invention has the following advantages:

by means of the technical scheme of the embodiment of the invention, the embodiment of the invention can make the description of the image by the features for face detection and posture estimation richer and have higher accuracy by fusing the features of different levels, thereby reducing the error rate of subsequent face detection; moreover, the fused features are simultaneously used for face detection and posture estimation, so that not only can a plurality of related tasks be executed simultaneously, but also the performance of a single task can be improved. And the task processing efficiency is improved.

Drawings

FIG. 1 is a flow chart of steps of a face detection method according to an embodiment of the present invention;

FIG. 2 is a flowchart of the steps for training a convolutional neural network model, in accordance with an embodiment of the present invention;

FIG. 3 is a schematic diagram of a convolutional neural network model for training in accordance with an embodiment of the present invention;

FIG. 4 is a schematic diagram of a convolutional neural network model for detection in accordance with an embodiment of the present invention;

FIG. 5A is a schematic diagram of an original image of an image to be measured according to an embodiment of the invention;

FIG. 5B is a schematic diagram of an image pyramid of an image to be measured according to an embodiment of the invention;

fig. 6 is a block diagram of a face detection apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

First embodiment

Referring to fig. 1, a flowchart illustrating steps of an embodiment of a face detection method of the present invention is shown, which may specifically include the following steps:

101, extracting a plurality of face features of different hierarchical networks from a face image to be detected by adopting a convolutional neural network model trained in advance to obtain a plurality of face feature vectors corresponding to the different hierarchical networks;

the convolutional neural network model is provided with a plurality of different hierarchical networks, information contained in features extracted by the different hierarchical networks is also distributed hierarchically, and the features extracted by the lower-layer network mainly describe edges and corners and contain better positioning features, so that the convolutional neural network model is suitable for tasks such as learning attitude estimation and the like; the features extracted by the high-level network are the features related to the corresponding categories, so the method is suitable for learning complex classification tasks, such as face detection and the like. In the embodiment of the invention, different types of features can be respectively extracted by adopting different hierarchical networks of a deep neural network model;

and the down-sampling layer of each hierarchical network has an output result, and the output result (namely, the multidimensional feature vector output by the hierarchical network) is used as the input of the next hierarchical network, so that after a face image to be detected is input into the first hierarchical network of the convolutional neural network model, the middle output result of the hierarchical network is used as the input of the second hierarchical network to continue feature extraction processing, finally, for an image, each hierarchical network has an output of a feature vector, and the side emphasis points of the extracted features of each hierarchical network are different.

Step 102, fusing the plurality of face feature vectors into a face feature vector;

in order to enable the features extracted from any one face image to realize richer network description on the image and improve the accuracy of the detected face features, face feature vectors extracted from all levels of networks can be fused and connected to form a face feature vector.

103, performing dimension reduction processing on the face feature vectors subjected to the fusion processing to obtain two face feature vectors with the same dimension;

because the dimension of one face feature vector obtained by connecting a plurality of face feature vectors is high and redundancy exists among features, two other full-connection layers can be added behind different-level networks of the convolutional neural network model, and high-level features are mapped to a low-latitude space by linear mapping, so that two face feature vectors with the same dimension can be obtained.

104, performing face detection processing on one face feature vector in the two face feature vectors to obtain a face detection result;

and 105, simultaneously carrying out attitude estimation processing on the other face feature vector in the two face feature vectors to obtain an attitude estimation result.

In order to improve the processing efficiency of the task, the embodiment of the invention can adopt a trained convolutional neural network model to respectively perform face detection and posture estimation on two face feature vectors obtained after dimensionality reduction, so that the posture estimation does not need to depend on the result of the face detection, and the processing efficiency of the task is improved.

Second embodiment

On the basis of the above embodiments, the present embodiment further discusses the face detection method of the present invention.

Before the convolutional neural network model is adopted to extract the features of the human face, the convolutional neural network model needs to be trained, and before training, a training sample needs to be prepared according to the method provided by the embodiment of the invention.

Embodiments of the invention utilize a training set of a wide face (WIDER FACE) database to generate training samples for face detection and pose estimation. WIDER FACE contains 3, 2203 pictures and 39, 3703 face labels. The method is divided into 61 scenes, such as touring, gathering, festivals, meetings and the like. In each scenario, 40%, 10%, 50% of samples can be randomly selected for use as training samples, validation samples, and test samples, respectively. In addition, in the above labeling, besides the face frame labeling, the labeling of the shielding degree, the shielding posture and the shielding scene is also provided. WIDER FACE is the largest face detection database to date.

In an alternative embodiment, the training sample preparation step according to an embodiment of the present invention includes: selecting a face data set containing face labels as a training sample, and cutting training images in the training sample; determining a positive sample and a negative sample in the training sample according to the overlapping degree of the cut image and the real face label; and inputting the positive sample and the negative sample into the convolutional neural network model according to a preset proportion so as to train the convolutional neural network model.

Specifically, in this embodiment, the 40% of the samples selected in WIDER FACE may be used as training samples, and a large training image in the training samples may be clipped to obtain a plurality of many area images; the cut region image and the region with the IOU (input output) of the real face labeling larger than 0.65 can be used as a positive sample, and the cut region image and the region with the IOU of the real face labeling smaller than 0.3 can be used as a negative sample, and the proportion of the positive sample to the negative sample in the training sample can be 1: 3; when training is needed, the training samples in the proportion can be input into the convolutional neural network model to train the model. The IOU is the ratio of the area of intersection of the two regions to the area of the union of the two regions, as shown in formula 1:

of course, it should be noted that the threshold values 0.65 and 0.3 used herein when defining how much the IOU ratio is greater than or less than positive and negative samples are not used to limit the present invention, and the threshold values can be flexibly adjusted according to actual needs.

Then, after the training samples are prepared, before step 101 is executed, the convolutional neural network model may be trained, and fig. 2 shows a training flowchart of the convolutional neural network model according to an embodiment of the present invention, which specifically includes the following steps:

step 201, training different levels of networks of the convolutional neural network model by adopting intermediate loss;

in performing step 201, according to one embodiment of the present invention, this may be achieved by the following sub-steps:

s11, extracting a plurality of multi-dimensional face feature vectors of different-level networks from input training samples by adopting the different-level networks of the convolutional neural network model;

s12, reducing the dimensions of the multi-dimensional face feature vectors by adopting a plurality of full connection layers in different hierarchical networks of the convolutional neural network model to obtain a plurality of one-dimensional face feature vectors corresponding to the different hierarchical networks;

s13, adopting a plurality of first classification networks in different hierarchical networks of the convolutional neural network model to classify the face of the plurality of one-dimensional face feature vectors respectively to obtain a plurality of face classification results, wherein the face classification results are the probability that the input training sample is a face;

s14, calculating the face classification results by adopting a classification loss function to obtain a plurality of intermediate losses of the first classification networks;

s15, adjusting parameters of all levels of networks generating the intermediate losses according to each intermediate loss so as to train the different levels of networks.

Specifically, in the present embodiment, fig. 3 shows a convolutional neural network model used for training according to an embodiment of the present invention, and table 1 is a network structure parameter of the model, which is used to explain the model shown in fig. 3.

TABLE 1

Wherein the convolutional neural network model comprises 13 convolutional layers Conv1_ 1-Conv 5_ 3;

referring to the model shown in fig. 3, the model shows five hierarchical networks, each including a convolutional layer and a downsampled layer of a corresponding column, viewed from the column direction, and also including a fully-connected layer ip1_1, ip2_1, ip3_1, ip4_1, ip5_1, and a classification network layer ip1_2, ip2_2, ip3_2, ip4_2, ip5_2, respectively. Wherein, the fully-connected layer and the classified network layer in each hierarchical network can not work independently, and are interdependent. When a training sample (input sample) is input into the model, convolutional layers Conv1_1 and Conv1_2 of a first hierarchical network are sequentially subjected to convolutional operation, and then a first downsampling is performed to obtain a three-dimensional face feature vector; then, the three-dimensional face feature vector enters a full-connection layer ip1_1 of the hierarchical network for dimension reduction processing to obtain a one-dimensional face feature vector and enters a classification network; meanwhile, the three-dimensional face feature vector can enter a convolutional layer Conv2_1 of the next-level network to continue feature extraction, and the specific operation steps are similar to those of the network at the first level and are not described again.

Thus, a down-sampling layer of each hierarchical network inputs a multi-dimensional face feature vector of a corresponding full-connection layer ip1_1, ip2_1 … … ip5_1 for dimension reduction processing, so as to obtain a one-dimensional face feature vector of a corresponding hierarchy;

then, in order to perform feature learning on the one-dimensional face feature vector of each hierarchical network, in the embodiment of the present invention, classification network layers ip1_2 and ip2_2 … … ip5_2 are further added below each full-link layer in different hierarchical networks of the convolutional neural network model, so that the one-dimensional face feature vector generated by each hierarchical network further enters a corresponding classification network (here, softmax classification network) for classification processing, and through the classification processing of the classification network, a corresponding probability value can be obtained for each face feature vector, which is used for representing the probability that an image corresponding to the face feature vector is a face.

Then, the embodiment of the present invention may use the classification loss function to perform the calculation of formula 2 on the probability output by each classification network, so as to obtain a plurality of intermediate losses.

Wherein, for sample x_iDefining the softmax loss function of face detection as:

wherein p is_iDenotes x_iProbability of a face, probability value p_iCalculated by softmax classification network. LabelingRepresents a sample x_iThe actual annotation of.

Five intermediate losses, denoted loss, are thus obtained_t,t∈{1,2,...5}。

Parameters of all hierarchical networks that generate the intermediate losses can then be adjusted according to each intermediate loss to train the different hierarchical networks.

That is, referring to fig. 3, since the first face feature vector of the first hierarchical network is obtained through operations of the Conv1_1, the Conv2_1 and the pool1 of the first hierarchical network, the first median loss (detective 1) is used to adjust parameters of the Conv1_1, the Conv2_1 and the pool1 of the first hierarchical network; since the second facial feature vector of the second hierarchical network is obtained through operations of Conv1_1, Conv2_1 and pool1 of the first hierarchical network and operations of Conv2_1, Conv2_2 and pool2 of the second hierarchical network, the second intermediate loss (detection loss2) is used for adjusting parameters of Conv1_1, Conv2_1 and pool1 of the first hierarchical network and parameters of Conv2_1, Conv2_2 and pool2 of the second hierarchical network; the method of the target level network adjusted by the third, fourth and fifth intermediate losses (detection loss1) is similar to that of the level network adjusted by the second intermediate losses, and is not repeated herein.

That is, each intermediate loss participates in the parameter training of all the layers of the network generating the loss, for example, detection loss3 participates in the parameter training of the 1 st, 2 nd and 3 rd layer networks.

Optionally, after each convolution layer and full-link layer, the output result of each layer may be subjected to ReLU nonlinear activation. Then, after the trained convolutional neural network model is used, the parameters of 13 convolutional layers and 5 downsampling layers of the convolutional neural network can be initialized, so that the subsequent training of the face detection network and the pose estimation network is performed.

Step 202, training a face detection network of the convolutional neural network model and the different-level networks by adopting loss of face detection;

referring to fig. 3, after dimension reduction of each face feature vector through the full-connection layer, one path of each face feature vector enters the classification network for classification, and the other path of each face feature vector is in full connection with the face feature vector of the adjacent-level full-connection layer, so that a face detection network and a posture estimation network are trained.

Therefore, before performing step 202, in one embodiment, the method according to an embodiment of the present invention further comprises: extracting a plurality of face features of different hierarchical networks from a training sample by adopting an untrained convolutional neural network model to obtain a plurality of face feature vectors corresponding to the different hierarchical networks; fusing the plurality of face feature vectors obtained by the untrained convolutional neural network model into one face feature vector; and performing dimensionality reduction on the face feature vectors subjected to fusion processing by adding two full-connection layers in the convolutional neural network model to obtain two face feature vectors with the same dimensionality.

Specifically, referring to fig. 3, five face feature vectors of five middle-level networks are connected to obtain a 1472-dimensional feature (concat _ all 1472 shown in fig. 3), and redundancy exists between features due to the high dimensionality of the feature. Thus, the high-dimensional features can be mapped into the lower-dimensional space using linear mapping by adding two fully-connected layers (i.e., the two arrows drawn by concat _ all 1472 in FIG. 3) after concat _ all 1472. Thus, 2 feature vectors (ipt _ 1512 and ipt _ 2512) with 512 dimensions can be obtained through two fully connected layers, and are respectively used for two tasks of face detection and pose estimation.

Step 202 may then be performed, and when step 202 is performed, it may be accomplished, according to one embodiment of the present invention, by the following substeps:

a substep S21, adopting a classification network of the face detection network to classify the face of one of the two face feature vectors to obtain a face classification result, wherein the face classification result is the probability that the input training sample is the face;

a substep S22, calculating the face classification result by adopting a classification loss function to obtain the classification loss of the classification network;

substep S23, training the classification network of the face detection network of the convolutional neural network model using the classification loss;

specifically, referring to fig. 3, the 512-dimensional face feature vector output by ipt _ 1512 is used for face detection (i.e. face detection) and detection box regression (i.e. Bbox regression), that is, the two arrows output at ipt _ 1512 represent the face detection network, and the arrow entering face detection is the classification network which is the same as the classification network in the middle hierarchical network. The embodiment of the invention utilizes positive and negative samples in training samples as samples for face classification, and trains the second classification network by adopting a softmax loss function shown in formula 3. For sample xⁱThe softmax loss function of the face classification is as follows:

wherein p is_iDenotes x_iProbability of a face, probability value p_iCalculated by softmax classification network. LabelingRepresents a sample x_iThe actual annotation of. Wherein, the formula 3 is substantially the same as the formula 2,the subscript is only present in the area, and the classification loss of the network is the average of the classification losses of all samples, which is also different from the equation 2.

Finally, a second classification network of the face detection network may be trained using the classification loss.

A substep S24 of calculating a target coordinate value of a detection frame of the face feature vector for face classification;

a substep S25 of calculating a first euclidean loss of the target coordinate values;

and a substep S26 of adjusting a regression target of the detection frame of the face detection network of the convolutional neural network model and parameters of the different hierarchical networks according to the first euclidean loss to train the face detection network and the different hierarchical networks.

Specifically, when performing the substeps S21-S23, there are candidate windows for detecting the face, and for each candidate window, there is a distance from the position of the real subsequent window of the face, so that the position of the detection frame of the face needs to be subjected to regression training. For each positive sample, the target value for the detection box (i.e., candidate window) regression can be calculated as:

wherein, [ x'₁,y′₁,x'₂,y'₂]As coordinates of the candidate window, [ x ]₁,y₁,x₂,y₂]For the true label corresponding to the candidate window, the regression target isFor negative samples, the network outputs a vector [ -1, -1, -1, -1]。

Then, the euclidean loss of the regression target is calculated using equation 5:

finally, according to loss_{bbox_regi}Training a regression target of the detection frame of the face detection network of the convolutional neural network model, namely finely adjusting the position of the detection frame, wherein the position of the detection frame containing the face determined by the model is deviated from the position of the real face, so that the detection frame regression is required; and according to loss_{bbox_regi}The parameters of the 5-level network shown in fig. 3 of the convolutional neural network model are adjusted, so that the training of the face detection network and the different-level network is realized.

And step 203, training the attitude estimation network of the convolutional neural network model and the different-level network by adopting the loss of the attitude estimation.

In performing step 203, according to an embodiment of the present invention, the following sub-steps may be employed to implement:

a substep S31, performing pose estimation processing on the other face feature vector in the two face feature vectors to obtain a pose estimation result, wherein the pose estimation result comprises different types of angle pose labels;

substep S32, calculating a second euclidean loss between the true angular pose labeling of the positive sample and the pose estimation result;

and a substep S33 of adjusting parameters of the pose estimation network and parameters of the different-level network of the convolutional neural network model according to the second euclidean loss to train the pose estimation network and the different-level network.

Specifically, the head posture is composed of three angles of pitch (p), yaw (y), and roll (r), and represents the angles of vertical, horizontal, and in-plane rotation, respectively.

After performing pose estimation processing on the face feature vector of ipt _ 2512 by using the convolutional neural network model, a pose estimation result can be obtained, where,and labeling the posture estimated by the network. Then, the positive sample x can be calculated using equation 6_jThe euclidean loss between the pose labeling of the network trajectory and the true pose labeling of the sample:

wherein (p)_j，y_j，r_j) Is a sample x_jThe real posture of the user is marked,and labeling the posture estimated by the network.

Wherein the loss of the network attitude estimate is the average of all positive sample euclidean losses.

Finally, the parameters of the pose estimation network and the parameters of the different-level networks of the convolutional neural network model may be adjusted according to the second euclidean loss or the average of all positive sample euclidean losses to train the pose estimation network and the different-level networks.

The face detection network and the posture estimation network are parallel networks.

In summary, when training the convolutional neural network model, the parameters of the whole network are obtained by the above-mentioned various losses joint training, wherein each loss is used for training all networks used in the process of generating the loss in the network structure.

After the training process of fig. 2, the total loss of the convolutional neural network model is the weighted sum of the losses shown in formula 7:

wherein,

the weight of each loss, which has been set before training the model, is in this example:

λ_inter＝0.1,λ_det＝1.0,λ_{bbox_reg}＝0.5,λ_pose＝0.5.

by means of the technical scheme of the embodiment of the invention, the classification network is added in the middle layer of the convolutional neural network model, so that the convolutional parameters and the downsampling parameters of each middle layer can be adjusted by utilizing the classification loss to train each middle layer; the feature vectors of the middle layers can be connected and fused by arranging the full connection layer, so that the face detection and the attitude estimation can be performed on the convolutional neural network in a subsequent process, the parallel training of the face detection and the attitude estimation can be performed on the convolutional neural network, the training speed is improved, and meanwhile, each middle-level network can be trained.

Third embodiment

On the basis of the above embodiment, the following proceeds to discuss the face detection method according to the embodiment of the present invention with reference to fig. 4.

In order to detect faces with different sizes, in an embodiment, before performing step 101, a face pyramid may be further generated for the face image to be detected, which specifically includes: adopting an image pyramid method to carry out scaling processing on the face image to be detected to obtain a plurality of face images to be detected with different sizes, which belong to the same original image; and sequentially inputting the plurality of face images to be detected to a convolutional neural network model trained in advance to perform face detection.

In this example, an image to be measured as shown in fig. 5A may be enlarged to 6 times, since the training sample size is 224 × 224, a face with a size of 37 × 37 may be detected at the minimum, and then the enlarged image is gradually reduced until the short side of the image is greater than or equal to 224, where the scaling factor in this example is set to be equal toAn image pyramid as shown in fig. 5B is obtained, that is, a plurality of human face images to be measured belonging to the same original image and having different sizes are obtained.

Then, when the face detection is performed, the plurality of face images to be detected can be sequentially input to the trained convolutional neural network model described in the second embodiment to perform the face detection.

In addition, in an alternative embodiment, referring to fig. 4, before step 101 is executed, 5 classification nets ip1_2, ip2_2, ip3_2, ip4_2, and ip2_2 used in fig. 3 may be removed, and only ip1_1 to ip5_1 are reserved; furthermore, 5 fully-connected layers ip1_1 to ip5_1 in the network of different layers shown in fig. 3 can be converted into fully convolutional layers, respectively, and the sizes and step sizes of the convolutional kernels are shown in fig. 4, for example, fc _ conv (112,64) indicates that the convolutional kernel size is 112 and the step size is 16.

After the conversion of the full convolution layer and the generation of the image pyramid are completed, steps 101 to 105 in the first embodiment can be performed, and specific reference to the first embodiment will not be made to each step.

Specifically, each image in the image pyramid shown in fig. 5B may be sequentially input to the convolutional neural network model shown in fig. 4, and referring to fig. 4, the meaning of each layer in fig. 4 is the same as that in fig. 3, and specifically refer to table 1, which is not described herein again.

In this way, after each pyramid image (input image) is input into the convolutional neural network model, the calculation is performed forward to obtain the features of each intermediate layerSymbolizing fea_{ipt_1},t＝1,2,...5；

Then, the features of the intermediate layers are fused (feature fusion) and connected together to obtain a three-dimensional feature fea_{concat_all}(dimension 1472).

Assuming that an input image (i.e., input image) of the convolutional neural network model at a certain scale is denoted as F, and the size is M × N, the finally obtained fea_{concat_all}The dimensions of (a) are:

then, the dimension reduction is carried out on the fused features through two full convolution layers, and 2 face feature vectors with the same dimension can be obtained:

the two face feature vectors with the dimensionality of 512 are respectively used for face detection and attitude estimation, and after face detection processing is carried out on the face feature vectors, a face detection result can be obtained.

Specifically, when step 104 is executed, this can be achieved by:

performing face detection processing on one face feature vector in the face feature vectors of the two dimensions 512 to obtain a response graph of face classification and a response graph of regression of a detection frame, wherein each point in the response graph of face classification corresponds to a 224x224 detection window in the input face image to be detected, and a numerical value in the response graph of face classification represents the probability (i.e., a confidence coefficient, i.e., pi in formula 2) that the detection window is a face; then, a target detection window corresponding to a point higher than a preset threshold value can be determined in a response image of the face classification; secondly, zooming the target detection window according to the input zooming size of the face image to be detected; determining an image corresponding to the zoomed target detection window in the original image as a human face area;

therefore, a face region of a face image to be detected (a face pyramid image of one size) of a certain size can be obtained, and then the position of the face region needs to be determined, so that when step 104 is executed, a regression result (namely, the coordinate position of the target detection window is finely adjusted to include the face region) corresponding to the target detection window in a response image regressed by the detection frame needs to be determined, wherein the regression result is the coordinate of the target detection window in an original image; and finally, determining the position of the face region according to the regression result.

Meanwhile, when step 105 is executed, the pose estimation processing is performed on the other face feature vector of the two face feature vectors to obtain a pose estimation result, a response map of the pose estimation can be obtained, and then the pose condition (for example, the up-down flip angle, the left-right flip angle, and the in-plane rotation angle) of the face region detected in step 104 is determined according to the response map of the pose estimation.

Then, through the above process, a face detection result and a pose estimation result of the face image to be detected of one size in the image pyramid shown in fig. 5B can be obtained;

then, the face images to be detected in the image pyramids of other sizes need to be sequentially input to the model shown in fig. 4 for face detection and pose estimation, and the process is similar to the first input detection process of the face image to be detected, and is not described herein again.

Therefore, the face detection results and the posture estimation results of different sizes detected by the same original image can be obtained;

optionally, in an embodiment, in order to eliminate redundant detection windows, find an optimal face detection position, and improve accuracy of the detection window position, the detection method according to an embodiment of the present invention may further include: clustering target detection windows corresponding to the multiple face images to be detected with different sizes respectively to form a window set; determining a target detection window with the largest value in the window set; deleting the target detection window with the overlap degree with the target detection window with the maximum value larger than a preset overlap threshold value in the window set; and circularly executing the step of determining the target detection window with the largest value in the window set and the step of deleting the target detection window with the overlapping degree larger than a first preset overlapping threshold value on the window set subjected to the deleting operation until a target detection window with the largest value is left in the window set.

In other words, after the pyramid images of each size are processed by the model shown in fig. 4, a detection window (i.e., a target detection window) containing a human face is obtained, and the detection windows can be gathered together to form a window set; then, finding a detection window with the maximum confidence coefficient in the window set; then, the detection windows with the IOU (refer to formula 1) of all the detection windows with the maximum confidence in the window combination larger than the preset overlap threshold are deleted. Then, the detection window with the maximum confidence coefficient is found out from the rest detection windows in the window set, and the process is repeated until a target detection window with the maximum value is left in the window set.

Optionally, in another embodiment, in order to eliminate redundant detection windows, find an optimal face detection position, and improve accuracy of the detection window position, the detection method according to an embodiment of the present invention may further include: clustering target detection windows corresponding to the multiple face images to be detected with different sizes respectively to form a window set; determining a target detection window with the largest value in the window set; re-clustering target detection windows in the window set, wherein the overlapping degree of the target detection windows with the maximum numerical value is greater than a second preset overlapping threshold value; calculating the average coordinate of the coordinates of each target detection window which is clustered again in the original image; determining a final detection window according to the average coordinate; and determining the average value of the numerical values of the re-clustered target detection windows in the response graph of the face classification as the numerical value of the final detection window in the response graph of the face classification.

In other words, after the pyramid images of each size are processed by the model shown in fig. 4, a detection window (i.e., a target detection window) containing a human face is obtained, and the detection windows can be gathered together to form a window set; then, finding a detection window with the maximum confidence coefficient in the window set; then, all the detection windows with the IOU greater than the second overlap threshold value and with the detection window with the highest confidence coefficient in the window set are regrouped into one type; averaging coordinate positions of all detection windows in the original image in the re-clustering process to obtain an average coordinate; and obtaining a new detection window according to the average coordinate, and taking the average value of the confidence degrees of all the re-clustered detection windows as the confidence degree of the new detection window, namely determining the average value of the numerical values of all the re-clustered target detection windows in the response graph of the face classification as the numerical value of the final detection window in the response graph of the face classification.

By means of the technical scheme of the embodiment of the invention, the face images with different sizes corresponding to the same original image can be detected from the original image to be detected, so that the usability of the detected face images is improved; in addition, the user only needs to provide one image to be detected, face detection results of various sizes can be achieved, the detection efficiency of the face image is improved, and the user does not need to input the images to be detected of various sizes.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Corresponding to the method provided by the embodiment of the present invention, referring to fig. 6, a block diagram of a structure of an embodiment of a face detection apparatus according to the present invention is shown, which may specifically include the following modules:

the extracting module 601 is configured to extract a plurality of face features of different hierarchical networks from the face image to be detected by using a convolutional neural network model trained in advance, so as to obtain a plurality of face feature vectors corresponding to the different hierarchical networks;

a fusion module 602, configured to fuse the plurality of face feature vectors into one face feature vector;

a dimension reduction module 603, configured to perform dimension reduction on the face feature vectors after the fusion processing, to obtain two face feature vectors with the same dimension;

a face detection module 604, configured to perform face detection processing on one of the two face feature vectors to obtain a face detection result;

and the pose estimation module 605 is configured to perform pose estimation processing on the other face feature vector of the two face feature vectors at the same time to obtain a pose estimation result.

Optionally, the apparatus further comprises the following not shown modules and sub-modules:

the cropping module is used for selecting a face data set containing face labels as a training sample and cropping training images in the training sample;

the sample determining module is used for determining a positive sample and a negative sample in the training sample according to the overlapping degree of the cut image and the real face label;

and the input sample module is used for inputting the positive sample and the negative sample to the convolutional neural network model according to a preset proportion so as to train the convolutional neural network model.

Optionally, the apparatus further comprises:

the intermediate loss training module is used for training different-level networks of the convolutional neural network model by adopting intermediate loss;

the face detection loss training module is used for training the face detection network of the convolutional neural network model and the different-level network by adopting the loss of face detection;

the attitude estimation loss training module is used for training the attitude estimation network of the convolutional neural network model and the different-level network by adopting the loss of the attitude estimation;

Optionally, the intermediate loss training module comprises:

the extraction submodule is used for extracting a plurality of multi-dimensional face feature vectors of different levels from an input training sample by adopting different level networks of the convolutional neural network model;

the dimension reduction submodule is used for reducing dimensions of the multi-dimensional face feature vectors by adopting a plurality of full connection layers in different hierarchical networks of the convolutional neural network model to obtain a plurality of one-dimensional face feature vectors corresponding to the different hierarchical networks;

the classification submodule is used for adopting a plurality of first classification networks in different hierarchical networks of the convolutional neural network model to classify the face of the plurality of one-dimensional face feature vectors respectively to obtain a plurality of face classification results, and the face classification results are the probability that the input training sample is the face;

the intermediate loss calculation submodule is used for calculating the face classification results by adopting a classification loss function to obtain a plurality of intermediate losses of the first classification networks;

and the intermediate loss training submodule is used for adjusting parameters of all hierarchical networks generating the intermediate loss according to each intermediate loss so as to train the different hierarchical networks.

Optionally, the apparatus further comprises:

the training extraction module is used for extracting a plurality of face features of different hierarchical networks from the training sample by adopting an untrained convolutional neural network model to obtain a plurality of face feature vectors corresponding to the different hierarchical networks;

the training fusion module is used for fusing the plurality of face feature vectors obtained by the untrained convolutional neural network model into one face feature vector;

and the dimension reduction and increase full-connection module is used for performing dimension reduction processing on the face feature vectors subjected to the fusion processing by adding two full-connection layers in the convolutional neural network model to obtain two face feature vectors with the same dimension.

Optionally, the face detection loss training module includes:

a face classification sub-module, configured to perform face classification on one of the two face feature vectors by using a second classification network of the face detection network to obtain a face classification result, where the face classification result is a probability that an input training sample is a face;

the calculation classification loss submodule is used for calculating the face classification result by adopting a classification loss function to obtain the classification loss of the classification network;

a classification loss training sub-module for training the second classification network of the face detection network of the convolutional neural network model using the classification loss;

a coordinate calculation submodule for calculating a target coordinate value of a detection frame of the face feature vector for face classification;

a first Euclidean loss calculating submodule for calculating a first Euclidean loss of the target coordinate value;

a face detection loss training submodule, configured to train a regression target of the detection frame of the face detection network of the convolutional neural network model according to the first euclidean loss and adjust parameters of the different-level networks, so as to train the face detection network and the different-level networks.

Optionally, the pose estimation loss training module comprises:

the attitude estimation submodule is used for carrying out attitude estimation processing on the other face characteristic vector in the two face characteristic vectors to obtain an attitude estimation result, wherein the attitude estimation result comprises different types of angle attitude labels;

the second Euclidean loss calculation submodule is used for calculating second Euclidean loss between the real angle attitude marking of the positive sample and the attitude estimation result;

and the attitude estimation loss training submodule is used for adjusting the parameters of the attitude estimation network and the parameters of the different-level networks of the convolutional neural network model according to the second Euclidean loss so as to train the attitude estimation network and the different-level networks.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The above detailed description is made on a face detection method and a face detection device provided by the present invention, and the principle and the implementation of the present invention are explained in the present document by applying specific examples, and the description of the above examples is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A face detection method, comprising:

fusing the plurality of face feature vectors into one face feature vector;

2. The method according to claim 1, wherein before the step of extracting a plurality of facial features of different hierarchical networks from the facial image to be detected by using the pre-trained convolutional neural network model to obtain a plurality of facial feature vectors corresponding to the different hierarchical networks, the method further comprises:

and converting a plurality of full connection layers in different hierarchical networks in the convolutional neural network model trained in advance into a plurality of full convolutional layers.

3. The method according to claim 1, wherein the step of performing face detection processing on one of the two face feature vectors to obtain a face detection result comprises:

performing face detection processing on one of the two face feature vectors to obtain a face classification response graph and a detection frame regression response graph, wherein each point in the face classification response graph corresponds to a detection window in an input face image to be detected, and the numerical value in the face classification response graph represents the probability that the detection window is a face;

determining a target detection window corresponding to a point of the response graph of the face classification, wherein the value of the point is higher than a preset threshold value;

zooming the target detection window according to the input zooming size of the face image to be detected;

determining an image corresponding to the zoomed target detection window in the original image as a human face area;

determining a regression result corresponding to the target detection window in a response graph regressed by the detection frame, wherein the regression result is a coordinate of the target detection window in the original graph;

and determining the position of the face region according to the regression result.

4. The method of claim 3, wherein after the steps of performing the face detection process and the pose estimation process on the two face feature vectors, the method further comprises:

clustering target detection windows corresponding to the multiple face images to be detected with different sizes respectively to form a window set;

determining a target detection window with the largest value in the window set;

deleting the target detection window with the overlap degree with the target detection window with the maximum value larger than a first preset overlap threshold value in the window set;

and circularly executing the step of determining the target detection window with the largest value in the window set and the step of deleting the target detection window with the overlapping degree larger than a first preset overlapping threshold value on the window set subjected to the deleting operation until a target detection window with the largest value is left in the window set.

5. The method of claim 3, wherein after the steps of performing the face detection process and the pose estimation process on the two face feature vectors, the method further comprises:

determining a target detection window with the largest value in the window set;

re-clustering target detection windows in the window set, wherein the overlapping degree of the target detection windows with the maximum numerical value is greater than a second preset overlapping threshold value;

calculating the average coordinate of the coordinates of each target detection window which is clustered again in the original image;

determining a final detection window according to the average coordinate;

and determining the average value of the numerical values of the re-clustered target detection windows in the response graph of the face classification as the numerical value of the final detection window in the response graph of the face classification.

6. The method according to claim 1, wherein before the step of extracting a plurality of facial features of different levels from the facial image to be detected by using the trained convolutional neural network model to obtain a plurality of facial feature vectors corresponding to networks of different levels, the method further comprises:

selecting a face data set containing face labels as a training sample, and cutting training images in the training sample;

determining a positive sample and a negative sample in the training sample according to the overlapping degree of the cut image and the real face label;

and inputting the positive sample and the negative sample into the convolutional neural network model according to a preset proportion so as to train the convolutional neural network model.

7. The method according to claim 6, wherein before the step of extracting a plurality of facial features of different hierarchical networks from the facial image to be detected by using the convolutional neural network model trained in advance to obtain a plurality of facial feature vectors corresponding to the different hierarchical networks, the method further comprises:

training different-level networks of the convolutional neural network model with intermediate losses;

training the face detection network of the convolutional neural network model and the different-level network by adopting the loss of face detection;

training a pose estimation network of the convolutional neural network model and the different-level network with a loss of pose estimation;

8. The method of claim 7, wherein the step of training the different hierarchical networks of the convolutional neural network model with intermediate losses comprises:

extracting a plurality of multi-dimensional face feature vectors of different levels from input training samples by adopting different-level networks of the convolutional neural network model;

reducing the dimensions of the multi-dimensional face feature vectors by adopting a plurality of full connection layers in different hierarchical networks of the convolutional neural network model to obtain a plurality of one-dimensional face feature vectors corresponding to the different hierarchical networks;

adopting a plurality of first classification networks in different hierarchical networks of the convolutional neural network model to classify the plurality of one-dimensional face feature vectors respectively to obtain a plurality of face classification results, wherein the face classification results are the probability that the input training samples are faces;

calculating the face classification results by adopting a classification loss function to obtain a plurality of intermediate losses of the first classification networks;

and adjusting parameters of all hierarchical networks generating the intermediate losses according to each intermediate loss so as to train the different hierarchical networks.

9. The method of claim 7, wherein the step of training the face detection network of the convolutional neural network model and the different hierarchical network with the loss of face detection is preceded by the method further comprising:

extracting a plurality of face features of different hierarchical networks from a training sample by adopting an untrained convolutional neural network model to obtain a plurality of face feature vectors corresponding to the different hierarchical networks;

fusing the plurality of face feature vectors obtained by the untrained convolutional neural network model into one face feature vector;

and performing dimensionality reduction on the face feature vectors subjected to fusion processing by adding two full-connection layers in the convolutional neural network model to obtain two face feature vectors with the same dimensionality.

10. The method of claim 9, wherein the step of training the face detection network of the convolutional neural network model and the different hierarchical network with the loss of face detection comprises:

adopting a second classification network of the face detection network to classify the face of one of the two face feature vectors to obtain a face classification result, wherein the face classification result is the probability that the input training sample is the face;

calculating the face classification result by adopting a classification loss function to obtain the classification loss of the classification network;

training the second classification network of the face detection network of the convolutional neural network model using the classification loss;

calculating a target coordinate value of a detection frame of the face feature vector for face classification;

calculating a first Euclidean loss of the target coordinate value;

training a regression target of the detection frame of the face detection network of the convolutional neural network model according to the first Euclidean loss and adjusting parameters of the different-level networks to train the face detection network and the different-level networks.

11. The method of claim 9, wherein the step of training the pose estimation network of the convolutional neural network model and the different level network with the loss of pose estimation comprises:

performing attitude estimation processing on the other face feature vector of the two face feature vectors to obtain an attitude estimation result, wherein the attitude estimation result comprises different types of angle attitude labels;

calculating a second Euclidean loss between the true angular pose labeling of the positive sample and the pose estimation result;

adjusting parameters of the pose estimation network and parameters of the different-level networks of the convolutional neural network model according to the second Euclidean loss to train the pose estimation network and the different-level networks.

12. An apparatus for face detection, the apparatus comprising: