Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
First embodiment
Referring to fig. 1, a flowchart illustrating steps of an embodiment of a face detection method of the present invention is shown, which may specifically include the following steps:
101, extracting a plurality of face features of different hierarchical networks from a face image to be detected by adopting a convolutional neural network model trained in advance to obtain a plurality of face feature vectors corresponding to the different hierarchical networks;
the convolutional neural network model is provided with a plurality of different hierarchical networks, information contained in features extracted by the different hierarchical networks is also distributed hierarchically, and the features extracted by the lower-layer network mainly describe edges and corners and contain better positioning features, so that the convolutional neural network model is suitable for tasks such as learning attitude estimation and the like; the features extracted by the high-level network are the features related to the corresponding categories, so the method is suitable for learning complex classification tasks, such as face detection and the like. In the embodiment of the invention, different types of features can be respectively extracted by adopting different hierarchical networks of a deep neural network model;
and the down-sampling layer of each hierarchical network has an output result, and the output result (namely, the multidimensional feature vector output by the hierarchical network) is used as the input of the next hierarchical network, so that after a face image to be detected is input into the first hierarchical network of the convolutional neural network model, the middle output result of the hierarchical network is used as the input of the second hierarchical network to continue feature extraction processing, finally, for an image, each hierarchical network has an output of a feature vector, and the side emphasis points of the extracted features of each hierarchical network are different.
Step 102, fusing the plurality of face feature vectors into a face feature vector;
in order to enable the features extracted from any one face image to realize richer network description on the image and improve the accuracy of the detected face features, face feature vectors extracted from all levels of networks can be fused and connected to form a face feature vector.
103, performing dimension reduction processing on the face feature vectors subjected to the fusion processing to obtain two face feature vectors with the same dimension;
because the dimension of one face feature vector obtained by connecting a plurality of face feature vectors is high and redundancy exists among features, two other full-connection layers can be added behind different-level networks of the convolutional neural network model, and high-level features are mapped to a low-latitude space by linear mapping, so that two face feature vectors with the same dimension can be obtained.
104, performing face detection processing on one face feature vector in the two face feature vectors to obtain a face detection result;
and 105, simultaneously carrying out attitude estimation processing on the other face feature vector in the two face feature vectors to obtain an attitude estimation result.
In order to improve the processing efficiency of the task, the embodiment of the invention can adopt a trained convolutional neural network model to respectively perform face detection and posture estimation on two face feature vectors obtained after dimensionality reduction, so that the posture estimation does not need to depend on the result of the face detection, and the processing efficiency of the task is improved.
By means of the technical scheme of the embodiment of the invention, the embodiment of the invention can make the description of the image by the features for face detection and posture estimation richer and have higher accuracy by fusing the features of different levels, thereby reducing the error rate of subsequent face detection; moreover, the fused features are simultaneously used for face detection and posture estimation, so that not only can a plurality of related tasks be executed simultaneously, but also the performance of a single task can be improved. And the task processing efficiency is improved.
Second embodiment
On the basis of the above embodiments, the present embodiment further discusses the face detection method of the present invention.
Before the convolutional neural network model is adopted to extract the features of the human face, the convolutional neural network model needs to be trained, and before training, a training sample needs to be prepared according to the method provided by the embodiment of the invention.
Embodiments of the invention utilize a training set of a wide face (WIDER FACE) database to generate training samples for face detection and pose estimation. WIDER FACE contains 3, 2203 pictures and 39, 3703 face labels. The method is divided into 61 scenes, such as touring, gathering, festivals, meetings and the like. In each scenario, 40%, 10%, 50% of samples can be randomly selected for use as training samples, validation samples, and test samples, respectively. In addition, in the above labeling, besides the face frame labeling, the labeling of the shielding degree, the shielding posture and the shielding scene is also provided. WIDER FACE is the largest face detection database to date.
In an alternative embodiment, the training sample preparation step according to an embodiment of the present invention includes: selecting a face data set containing face labels as a training sample, and cutting training images in the training sample; determining a positive sample and a negative sample in the training sample according to the overlapping degree of the cut image and the real face label; and inputting the positive sample and the negative sample into the convolutional neural network model according to a preset proportion so as to train the convolutional neural network model.
Specifically, in this embodiment, the 40% of the samples selected in WIDER FACE may be used as training samples, and a large training image in the training samples may be clipped to obtain a plurality of many area images; the cut region image and the region with the IOU (input output) of the real face labeling larger than 0.65 can be used as a positive sample, and the cut region image and the region with the IOU of the real face labeling smaller than 0.3 can be used as a negative sample, and the proportion of the positive sample to the negative sample in the training sample can be 1: 3; when training is needed, the training samples in the proportion can be input into the convolutional neural network model to train the model. The IOU is the ratio of the area of intersection of the two regions to the area of the union of the two regions, as shown in formula 1:
of course, it should be noted that the threshold values 0.65 and 0.3 used herein when defining how much the IOU ratio is greater than or less than positive and negative samples are not used to limit the present invention, and the threshold values can be flexibly adjusted according to actual needs.
Then, after the training samples are prepared, before step 101 is executed, the convolutional neural network model may be trained, and fig. 2 shows a training flowchart of the convolutional neural network model according to an embodiment of the present invention, which specifically includes the following steps:
step 201, training different levels of networks of the convolutional neural network model by adopting intermediate loss;
in performing step 201, according to one embodiment of the present invention, this may be achieved by the following sub-steps:
s11, extracting a plurality of multi-dimensional face feature vectors of different-level networks from input training samples by adopting the different-level networks of the convolutional neural network model;
s12, reducing the dimensions of the multi-dimensional face feature vectors by adopting a plurality of full connection layers in different hierarchical networks of the convolutional neural network model to obtain a plurality of one-dimensional face feature vectors corresponding to the different hierarchical networks;
s13, adopting a plurality of first classification networks in different hierarchical networks of the convolutional neural network model to classify the face of the plurality of one-dimensional face feature vectors respectively to obtain a plurality of face classification results, wherein the face classification results are the probability that the input training sample is a face;
s14, calculating the face classification results by adopting a classification loss function to obtain a plurality of intermediate losses of the first classification networks;
s15, adjusting parameters of all levels of networks generating the intermediate losses according to each intermediate loss so as to train the different levels of networks.
Specifically, in the present embodiment, fig. 3 shows a convolutional neural network model used for training according to an embodiment of the present invention, and table 1 is a network structure parameter of the model, which is used to explain the model shown in fig. 3.
TABLE 1
Wherein the convolutional neural network model comprises 13 convolutional layers Conv1_ 1-Conv 5_ 3;
referring to the model shown in fig. 3, the model shows five hierarchical networks, each including a convolutional layer and a downsampled layer of a corresponding column, viewed from the column direction, and also including a fully-connected layer ip1_1, ip2_1, ip3_1, ip4_1, ip5_1, and a classification network layer ip1_2, ip2_2, ip3_2, ip4_2, ip5_2, respectively. Wherein, the fully-connected layer and the classified network layer in each hierarchical network can not work independently, and are interdependent. When a training sample (input sample) is input into the model, convolutional layers Conv1_1 and Conv1_2 of a first hierarchical network are sequentially subjected to convolutional operation, and then a first downsampling is performed to obtain a three-dimensional face feature vector; then, the three-dimensional face feature vector enters a full-connection layer ip1_1 of the hierarchical network for dimension reduction processing to obtain a one-dimensional face feature vector and enters a classification network; meanwhile, the three-dimensional face feature vector can enter a convolutional layer Conv2_1 of the next-level network to continue feature extraction, and the specific operation steps are similar to those of the network at the first level and are not described again.
Thus, a down-sampling layer of each hierarchical network inputs a multi-dimensional face feature vector of a corresponding full-connection layer ip1_1, ip2_1 … … ip5_1 for dimension reduction processing, so as to obtain a one-dimensional face feature vector of a corresponding hierarchy;
then, in order to perform feature learning on the one-dimensional face feature vector of each hierarchical network, in the embodiment of the present invention, classification network layers ip1_2 and ip2_2 … … ip5_2 are further added below each full-link layer in different hierarchical networks of the convolutional neural network model, so that the one-dimensional face feature vector generated by each hierarchical network further enters a corresponding classification network (here, softmax classification network) for classification processing, and through the classification processing of the classification network, a corresponding probability value can be obtained for each face feature vector, which is used for representing the probability that an image corresponding to the face feature vector is a face.
Then, the embodiment of the present invention may use the classification loss function to perform the calculation of formula 2 on the probability output by each classification network, so as to obtain a plurality of intermediate losses.
Wherein, for sample xiDefining the softmax loss function of face detection as:
wherein p isiDenotes xiProbability of a face, probability value piCalculated by softmax classification network. LabelingRepresents a sample xiThe actual annotation of.
Five intermediate losses, denoted loss, are thus obtainedt,t∈{1,2,...5}。
Parameters of all hierarchical networks that generate the intermediate losses can then be adjusted according to each intermediate loss to train the different hierarchical networks.
That is, referring to fig. 3, since the first face feature vector of the first hierarchical network is obtained through operations of the Conv1_1, the Conv2_1 and the pool1 of the first hierarchical network, the first median loss (detective 1) is used to adjust parameters of the Conv1_1, the Conv2_1 and the pool1 of the first hierarchical network; since the second facial feature vector of the second hierarchical network is obtained through operations of Conv1_1, Conv2_1 and pool1 of the first hierarchical network and operations of Conv2_1, Conv2_2 and pool2 of the second hierarchical network, the second intermediate loss (detection loss2) is used for adjusting parameters of Conv1_1, Conv2_1 and pool1 of the first hierarchical network and parameters of Conv2_1, Conv2_2 and pool2 of the second hierarchical network; the method of the target level network adjusted by the third, fourth and fifth intermediate losses (detection loss1) is similar to that of the level network adjusted by the second intermediate losses, and is not repeated herein.
That is, each intermediate loss participates in the parameter training of all the layers of the network generating the loss, for example, detection loss3 participates in the parameter training of the 1 st, 2 nd and 3 rd layer networks.
Optionally, after each convolution layer and full-link layer, the output result of each layer may be subjected to ReLU nonlinear activation. Then, after the trained convolutional neural network model is used, the parameters of 13 convolutional layers and 5 downsampling layers of the convolutional neural network can be initialized, so that the subsequent training of the face detection network and the pose estimation network is performed.
Step 202, training a face detection network of the convolutional neural network model and the different-level networks by adopting loss of face detection;
referring to fig. 3, after dimension reduction of each face feature vector through the full-connection layer, one path of each face feature vector enters the classification network for classification, and the other path of each face feature vector is in full connection with the face feature vector of the adjacent-level full-connection layer, so that a face detection network and a posture estimation network are trained.
Therefore, before performing step 202, in one embodiment, the method according to an embodiment of the present invention further comprises: extracting a plurality of face features of different hierarchical networks from a training sample by adopting an untrained convolutional neural network model to obtain a plurality of face feature vectors corresponding to the different hierarchical networks; fusing the plurality of face feature vectors obtained by the untrained convolutional neural network model into one face feature vector; and performing dimensionality reduction on the face feature vectors subjected to fusion processing by adding two full-connection layers in the convolutional neural network model to obtain two face feature vectors with the same dimensionality.
Specifically, referring to fig. 3, five face feature vectors of five middle-level networks are connected to obtain a 1472-dimensional feature (concat _ all 1472 shown in fig. 3), and redundancy exists between features due to the high dimensionality of the feature. Thus, the high-dimensional features can be mapped into the lower-dimensional space using linear mapping by adding two fully-connected layers (i.e., the two arrows drawn by concat _ all 1472 in FIG. 3) after concat _ all 1472. Thus, 2 feature vectors (ipt _ 1512 and ipt _ 2512) with 512 dimensions can be obtained through two fully connected layers, and are respectively used for two tasks of face detection and pose estimation.
Step 202 may then be performed, and when step 202 is performed, it may be accomplished, according to one embodiment of the present invention, by the following substeps:
a substep S21, adopting a classification network of the face detection network to classify the face of one of the two face feature vectors to obtain a face classification result, wherein the face classification result is the probability that the input training sample is the face;
a substep S22, calculating the face classification result by adopting a classification loss function to obtain the classification loss of the classification network;
substep S23, training the classification network of the face detection network of the convolutional neural network model using the classification loss;
specifically, referring to fig. 3, the 512-dimensional face feature vector output by ipt _ 1512 is used for face detection (i.e. face detection) and detection box regression (i.e. Bbox regression), that is, the two arrows output at ipt _ 1512 represent the face detection network, and the arrow entering face detection is the classification network which is the same as the classification network in the middle hierarchical network. The embodiment of the invention utilizes positive and negative samples in training samples as samples for face classification, and trains the second classification network by adopting a softmax loss function shown in formula 3. For sample xiThe softmax loss function of the face classification is as follows:
wherein p isiDenotes xiProbability of a face, probability value piCalculated by softmax classification network. LabelingRepresents a sample xiThe actual annotation of. Wherein, the formula 3 is substantially the same as the formula 2,the subscript is only present in the area, and the classification loss of the network is the average of the classification losses of all samples, which is also different from the equation 2.
Finally, a second classification network of the face detection network may be trained using the classification loss.
A substep S24 of calculating a target coordinate value of a detection frame of the face feature vector for face classification;
a substep S25 of calculating a first euclidean loss of the target coordinate values;
and a substep S26 of adjusting a regression target of the detection frame of the face detection network of the convolutional neural network model and parameters of the different hierarchical networks according to the first euclidean loss to train the face detection network and the different hierarchical networks.
Specifically, when performing the substeps S21-S23, there are candidate windows for detecting the face, and for each candidate window, there is a distance from the position of the real subsequent window of the face, so that the position of the detection frame of the face needs to be subjected to regression training. For each positive sample, the target value for the detection box (i.e., candidate window) regression can be calculated as:
wherein, [ x'1,y′1,x'2,y'2]As coordinates of the candidate window, [ x ]1,y1,x2,y2]For the true label corresponding to the candidate window, the regression target isFor negative samples, the network outputs a vector [ -1, -1, -1, -1]。
Then, the euclidean loss of the regression target is calculated using equation 5:
finally, according to lossbbox_regiTraining a regression target of the detection frame of the face detection network of the convolutional neural network model, namely finely adjusting the position of the detection frame, wherein the position of the detection frame containing the face determined by the model is deviated from the position of the real face, so that the detection frame regression is required; and according to lossbbox_regiThe parameters of the 5-level network shown in fig. 3 of the convolutional neural network model are adjusted, so that the training of the face detection network and the different-level network is realized.
And step 203, training the attitude estimation network of the convolutional neural network model and the different-level network by adopting the loss of the attitude estimation.
In performing step 203, according to an embodiment of the present invention, the following sub-steps may be employed to implement:
a substep S31, performing pose estimation processing on the other face feature vector in the two face feature vectors to obtain a pose estimation result, wherein the pose estimation result comprises different types of angle pose labels;
substep S32, calculating a second euclidean loss between the true angular pose labeling of the positive sample and the pose estimation result;
and a substep S33 of adjusting parameters of the pose estimation network and parameters of the different-level network of the convolutional neural network model according to the second euclidean loss to train the pose estimation network and the different-level network.
Specifically, the head posture is composed of three angles of pitch (p), yaw (y), and roll (r), and represents the angles of vertical, horizontal, and in-plane rotation, respectively.
After performing pose estimation processing on the face feature vector of ipt _ 2512 by using the convolutional neural network model, a pose estimation result can be obtained, where,and labeling the posture estimated by the network. Then, the positive sample x can be calculated using equation 6jThe euclidean loss between the pose labeling of the network trajectory and the true pose labeling of the sample:
wherein (p)j,yj,rj) Is a sample xjThe real posture of the user is marked,and labeling the posture estimated by the network.
Wherein the loss of the network attitude estimate is the average of all positive sample euclidean losses.
Finally, the parameters of the pose estimation network and the parameters of the different-level networks of the convolutional neural network model may be adjusted according to the second euclidean loss or the average of all positive sample euclidean losses to train the pose estimation network and the different-level networks.
The face detection network and the posture estimation network are parallel networks.
In summary, when training the convolutional neural network model, the parameters of the whole network are obtained by the above-mentioned various losses joint training, wherein each loss is used for training all networks used in the process of generating the loss in the network structure.
After the training process of fig. 2, the total loss of the convolutional neural network model is the weighted sum of the losses shown in formula 7:
wherein,
the weight of each loss, which has been set before training the model, is in this example:
λinter=0.1,λdet=1.0,λbbox_reg=0.5,λpose=0.5.
by means of the technical scheme of the embodiment of the invention, the classification network is added in the middle layer of the convolutional neural network model, so that the convolutional parameters and the downsampling parameters of each middle layer can be adjusted by utilizing the classification loss to train each middle layer; the feature vectors of the middle layers can be connected and fused by arranging the full connection layer, so that the face detection and the attitude estimation can be performed on the convolutional neural network in a subsequent process, the parallel training of the face detection and the attitude estimation can be performed on the convolutional neural network, the training speed is improved, and meanwhile, each middle-level network can be trained.
Third embodiment
On the basis of the above embodiment, the following proceeds to discuss the face detection method according to the embodiment of the present invention with reference to fig. 4.
In order to detect faces with different sizes, in an embodiment, before performing step 101, a face pyramid may be further generated for the face image to be detected, which specifically includes: adopting an image pyramid method to carry out scaling processing on the face image to be detected to obtain a plurality of face images to be detected with different sizes, which belong to the same original image; and sequentially inputting the plurality of face images to be detected to a convolutional neural network model trained in advance to perform face detection.
In this example, an image to be measured as shown in fig. 5A may be enlarged to 6 times, since the training sample size is 224 × 224, a face with a size of 37 × 37 may be detected at the minimum, and then the enlarged image is gradually reduced until the short side of the image is greater than or equal to 224, where the scaling factor in this example is set to be equal toAn image pyramid as shown in fig. 5B is obtained, that is, a plurality of human face images to be measured belonging to the same original image and having different sizes are obtained.
Then, when the face detection is performed, the plurality of face images to be detected can be sequentially input to the trained convolutional neural network model described in the second embodiment to perform the face detection.
In addition, in an alternative embodiment, referring to fig. 4, before step 101 is executed, 5 classification nets ip1_2, ip2_2, ip3_2, ip4_2, and ip2_2 used in fig. 3 may be removed, and only ip1_1 to ip5_1 are reserved; furthermore, 5 fully-connected layers ip1_1 to ip5_1 in the network of different layers shown in fig. 3 can be converted into fully convolutional layers, respectively, and the sizes and step sizes of the convolutional kernels are shown in fig. 4, for example, fc _ conv (112,64) indicates that the convolutional kernel size is 112 and the step size is 16.
After the conversion of the full convolution layer and the generation of the image pyramid are completed, steps 101 to 105 in the first embodiment can be performed, and specific reference to the first embodiment will not be made to each step.
Specifically, each image in the image pyramid shown in fig. 5B may be sequentially input to the convolutional neural network model shown in fig. 4, and referring to fig. 4, the meaning of each layer in fig. 4 is the same as that in fig. 3, and specifically refer to table 1, which is not described herein again.
In this way, after each pyramid image (input image) is input into the convolutional neural network model, the calculation is performed forward to obtain the features of each intermediate layerSymbolizing feaipt_1,t=1,2,...5;
Then, the features of the intermediate layers are fused (feature fusion) and connected together to obtain a three-dimensional feature feaconcat_all(dimension 1472).
Assuming that an input image (i.e., input image) of the convolutional neural network model at a certain scale is denoted as F, and the size is M × N, the finally obtained feaconcat_allThe dimensions of (a) are:
then, the dimension reduction is carried out on the fused features through two full convolution layers, and 2 face feature vectors with the same dimension can be obtained:
the two face feature vectors with the dimensionality of 512 are respectively used for face detection and attitude estimation, and after face detection processing is carried out on the face feature vectors, a face detection result can be obtained.
Specifically, when step 104 is executed, this can be achieved by:
performing face detection processing on one face feature vector in the face feature vectors of the two dimensions 512 to obtain a response graph of face classification and a response graph of regression of a detection frame, wherein each point in the response graph of face classification corresponds to a 224x224 detection window in the input face image to be detected, and a numerical value in the response graph of face classification represents the probability (i.e., a confidence coefficient, i.e., pi in formula 2) that the detection window is a face; then, a target detection window corresponding to a point higher than a preset threshold value can be determined in a response image of the face classification; secondly, zooming the target detection window according to the input zooming size of the face image to be detected; determining an image corresponding to the zoomed target detection window in the original image as a human face area;
therefore, a face region of a face image to be detected (a face pyramid image of one size) of a certain size can be obtained, and then the position of the face region needs to be determined, so that when step 104 is executed, a regression result (namely, the coordinate position of the target detection window is finely adjusted to include the face region) corresponding to the target detection window in a response image regressed by the detection frame needs to be determined, wherein the regression result is the coordinate of the target detection window in an original image; and finally, determining the position of the face region according to the regression result.
Meanwhile, when step 105 is executed, the pose estimation processing is performed on the other face feature vector of the two face feature vectors to obtain a pose estimation result, a response map of the pose estimation can be obtained, and then the pose condition (for example, the up-down flip angle, the left-right flip angle, and the in-plane rotation angle) of the face region detected in step 104 is determined according to the response map of the pose estimation.
Then, through the above process, a face detection result and a pose estimation result of the face image to be detected of one size in the image pyramid shown in fig. 5B can be obtained;
then, the face images to be detected in the image pyramids of other sizes need to be sequentially input to the model shown in fig. 4 for face detection and pose estimation, and the process is similar to the first input detection process of the face image to be detected, and is not described herein again.
Therefore, the face detection results and the posture estimation results of different sizes detected by the same original image can be obtained;
optionally, in an embodiment, in order to eliminate redundant detection windows, find an optimal face detection position, and improve accuracy of the detection window position, the detection method according to an embodiment of the present invention may further include: clustering target detection windows corresponding to the multiple face images to be detected with different sizes respectively to form a window set; determining a target detection window with the largest value in the window set; deleting the target detection window with the overlap degree with the target detection window with the maximum value larger than a preset overlap threshold value in the window set; and circularly executing the step of determining the target detection window with the largest value in the window set and the step of deleting the target detection window with the overlapping degree larger than a first preset overlapping threshold value on the window set subjected to the deleting operation until a target detection window with the largest value is left in the window set.
In other words, after the pyramid images of each size are processed by the model shown in fig. 4, a detection window (i.e., a target detection window) containing a human face is obtained, and the detection windows can be gathered together to form a window set; then, finding a detection window with the maximum confidence coefficient in the window set; then, the detection windows with the IOU (refer to formula 1) of all the detection windows with the maximum confidence in the window combination larger than the preset overlap threshold are deleted. Then, the detection window with the maximum confidence coefficient is found out from the rest detection windows in the window set, and the process is repeated until a target detection window with the maximum value is left in the window set.
Optionally, in another embodiment, in order to eliminate redundant detection windows, find an optimal face detection position, and improve accuracy of the detection window position, the detection method according to an embodiment of the present invention may further include: clustering target detection windows corresponding to the multiple face images to be detected with different sizes respectively to form a window set; determining a target detection window with the largest value in the window set; re-clustering target detection windows in the window set, wherein the overlapping degree of the target detection windows with the maximum numerical value is greater than a second preset overlapping threshold value; calculating the average coordinate of the coordinates of each target detection window which is clustered again in the original image; determining a final detection window according to the average coordinate; and determining the average value of the numerical values of the re-clustered target detection windows in the response graph of the face classification as the numerical value of the final detection window in the response graph of the face classification.
In other words, after the pyramid images of each size are processed by the model shown in fig. 4, a detection window (i.e., a target detection window) containing a human face is obtained, and the detection windows can be gathered together to form a window set; then, finding a detection window with the maximum confidence coefficient in the window set; then, all the detection windows with the IOU greater than the second overlap threshold value and with the detection window with the highest confidence coefficient in the window set are regrouped into one type; averaging coordinate positions of all detection windows in the original image in the re-clustering process to obtain an average coordinate; and obtaining a new detection window according to the average coordinate, and taking the average value of the confidence degrees of all the re-clustered detection windows as the confidence degree of the new detection window, namely determining the average value of the numerical values of all the re-clustered target detection windows in the response graph of the face classification as the numerical value of the final detection window in the response graph of the face classification.
By means of the technical scheme of the embodiment of the invention, the face images with different sizes corresponding to the same original image can be detected from the original image to be detected, so that the usability of the detected face images is improved; in addition, the user only needs to provide one image to be detected, face detection results of various sizes can be achieved, the detection efficiency of the face image is improved, and the user does not need to input the images to be detected of various sizes.
It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.
Corresponding to the method provided by the embodiment of the present invention, referring to fig. 6, a block diagram of a structure of an embodiment of a face detection apparatus according to the present invention is shown, which may specifically include the following modules:
the extracting module 601 is configured to extract a plurality of face features of different hierarchical networks from the face image to be detected by using a convolutional neural network model trained in advance, so as to obtain a plurality of face feature vectors corresponding to the different hierarchical networks;
a fusion module 602, configured to fuse the plurality of face feature vectors into one face feature vector;
a dimension reduction module 603, configured to perform dimension reduction on the face feature vectors after the fusion processing, to obtain two face feature vectors with the same dimension;
a face detection module 604, configured to perform face detection processing on one of the two face feature vectors to obtain a face detection result;
and the pose estimation module 605 is configured to perform pose estimation processing on the other face feature vector of the two face feature vectors at the same time to obtain a pose estimation result.
Optionally, the apparatus further comprises the following not shown modules and sub-modules:
the cropping module is used for selecting a face data set containing face labels as a training sample and cropping training images in the training sample;
the sample determining module is used for determining a positive sample and a negative sample in the training sample according to the overlapping degree of the cut image and the real face label;
and the input sample module is used for inputting the positive sample and the negative sample to the convolutional neural network model according to a preset proportion so as to train the convolutional neural network model.
Optionally, the apparatus further comprises:
the intermediate loss training module is used for training different-level networks of the convolutional neural network model by adopting intermediate loss;
the face detection loss training module is used for training the face detection network of the convolutional neural network model and the different-level network by adopting the loss of face detection;
the attitude estimation loss training module is used for training the attitude estimation network of the convolutional neural network model and the different-level network by adopting the loss of the attitude estimation;
the face detection network and the posture estimation network are parallel networks.
Optionally, the intermediate loss training module comprises:
the extraction submodule is used for extracting a plurality of multi-dimensional face feature vectors of different levels from an input training sample by adopting different level networks of the convolutional neural network model;
the dimension reduction submodule is used for reducing dimensions of the multi-dimensional face feature vectors by adopting a plurality of full connection layers in different hierarchical networks of the convolutional neural network model to obtain a plurality of one-dimensional face feature vectors corresponding to the different hierarchical networks;
the classification submodule is used for adopting a plurality of first classification networks in different hierarchical networks of the convolutional neural network model to classify the face of the plurality of one-dimensional face feature vectors respectively to obtain a plurality of face classification results, and the face classification results are the probability that the input training sample is the face;
the intermediate loss calculation submodule is used for calculating the face classification results by adopting a classification loss function to obtain a plurality of intermediate losses of the first classification networks;
and the intermediate loss training submodule is used for adjusting parameters of all hierarchical networks generating the intermediate loss according to each intermediate loss so as to train the different hierarchical networks.
Optionally, the apparatus further comprises:
the training extraction module is used for extracting a plurality of face features of different hierarchical networks from the training sample by adopting an untrained convolutional neural network model to obtain a plurality of face feature vectors corresponding to the different hierarchical networks;
the training fusion module is used for fusing the plurality of face feature vectors obtained by the untrained convolutional neural network model into one face feature vector;
and the dimension reduction and increase full-connection module is used for performing dimension reduction processing on the face feature vectors subjected to the fusion processing by adding two full-connection layers in the convolutional neural network model to obtain two face feature vectors with the same dimension.
Optionally, the face detection loss training module includes:
a face classification sub-module, configured to perform face classification on one of the two face feature vectors by using a second classification network of the face detection network to obtain a face classification result, where the face classification result is a probability that an input training sample is a face;
the calculation classification loss submodule is used for calculating the face classification result by adopting a classification loss function to obtain the classification loss of the classification network;
a classification loss training sub-module for training the second classification network of the face detection network of the convolutional neural network model using the classification loss;
a coordinate calculation submodule for calculating a target coordinate value of a detection frame of the face feature vector for face classification;
a first Euclidean loss calculating submodule for calculating a first Euclidean loss of the target coordinate value;
a face detection loss training submodule, configured to train a regression target of the detection frame of the face detection network of the convolutional neural network model according to the first euclidean loss and adjust parameters of the different-level networks, so as to train the face detection network and the different-level networks.
Optionally, the pose estimation loss training module comprises:
the attitude estimation submodule is used for carrying out attitude estimation processing on the other face characteristic vector in the two face characteristic vectors to obtain an attitude estimation result, wherein the attitude estimation result comprises different types of angle attitude labels;
the second Euclidean loss calculation submodule is used for calculating second Euclidean loss between the real angle attitude marking of the positive sample and the attitude estimation result;
and the attitude estimation loss training submodule is used for adjusting the parameters of the attitude estimation network and the parameters of the different-level networks of the convolutional neural network model according to the second Euclidean loss so as to train the attitude estimation network and the different-level networks.
By means of the technical scheme of the embodiment of the invention, the embodiment of the invention can make the description of the image by the features for face detection and posture estimation richer and have higher accuracy by fusing the features of different levels, thereby reducing the error rate of subsequent face detection; moreover, the fused features are simultaneously used for face detection and posture estimation, so that not only can a plurality of related tasks be executed simultaneously, but also the performance of a single task can be improved. And the task processing efficiency is improved.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
The above detailed description is made on a face detection method and a face detection device provided by the present invention, and the principle and the implementation of the present invention are explained in the present document by applying specific examples, and the description of the above examples is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.