CN107784288A

CN107784288A - A kind of iteration positioning formula method for detecting human face based on deep neural network

Info

Publication number: CN107784288A
Application number: CN201711034973.7A
Authority: CN
Inventors: 文贵华; 罗达志
Original assignee: South China University of Technology SCUT
Current assignee: Inner Mongolia Kedian Data Service Co ltd
Priority date: 2017-10-30
Filing date: 2017-10-30
Publication date: 2018-03-09
Anticipated expiration: 2037-10-30
Also published as: CN107784288B

Abstract

The invention discloses a kind of iteration based on deep neural network to position formula method for detecting human face, comprises the following steps：Based on AFLW common image data sets, area image block therein is extracted as the input of training set and is pre-processed；Face candidate frame extraction model P Net and face skew fine setting model A Net are defined, the model is trained using above-mentioned training set；Full convolution strategy is used to the above-mentioned P of training pattern Net, to obtain the global detection matrix of consequence to sample；Picture input model P Net are obtained into face candidate frame in test process, then by the iterative fine setting face candidate frame positions of model A Net, coordinate maximum suppressing method to obtain final result.The inventive method is used in complex environment, using computer auto-detection face, and with accuracy is high, recognition speed is fast, steady performance.

Description

Iterative positioning type face detection method based on deep neural network

Technical Field

The invention relates to the technical field of image-based face detection, in particular to an iterative positioning type face detection method based on a deep neural network.

Background

1. Definition of face detection

Face detection means that given any one image, all faces (if any) in the image are automatically detected by a computer, and the positions of the faces are returned.

2. Importance of face detection

The human face is a visual mode with large information quantity, and the reflected visual information has important significance and function in life and work of people. Today, face recognition is widely applied in social life, in which face detection is a key link, and if the effect of a face detection algorithm is not good, the effect of a subsequent recognition algorithm is inevitably affected. Besides, based on identification algorithms such as age identification, gender identification, emotion identification and the like of the images, a face detection algorithm is also needed as a basic link. The wide application of the technologies promotes the importance of the face detection algorithm to a new height.

3. Technical development of face detection

The research on the face detection dates back to 70 s in the 20 th century, and the early research on the face detection mainly focuses on template matching, subspace method, deformed template matching and the like. These early face detection methods are often directed at front face detection under a simple and unchanging background, and do not have a good detection effect for faces under a complex environment. In the 90 s of the 20 th century to the beginning of the 21 st century, a face detection method based on a cascade structure is greatly developed, wherein on the basis of an Adaboost algorithm, viola and Jones use a Haar-like wavelet feature and an integral graph method to detect faces, the detection accuracy and the real-time performance of the method are greatly improved, but the face detection under a complex scene cannot be processed. In recent years, with the rapid development of deep learning, face detection algorithms based on deep learning have been developed greatly, and these methods include:

wu Suwen, wary vicavine face detection based on selective search and convolutional neural networks, 2016 [ J ] 9/28. Computer applications research, 2017 (2); chen Weidong, zhang Yang, yang Xiaolong face detection methods based on skin color features and depth models [ J ]. Industrial control computer, 2017,30 (3): 26-28; chen Rui, linda. Face key point localization based on cascaded convolutional neural networks [ J ]. Proceedings of Sichuan institute of technology (self edition), 2017,30 (1): 32-37; zhang Bailing, xia Yizhang, qian Rongjiang, etc. face occlusion detection method based on deep convolutional neural network, CN 106485215A [ P ].2017.

4. The current face detection method has the following defects: accuracy and speed

However, the method based on deep learning often has no advantage in speed, because the forward process of the deep neural network is time-consuming, and the forward process may need to be performed multiple times for one picture, which results in excessive time consumption. In addition, the existing method does not pay enough attention to the positioning accuracy of face detection, and actually, the positioning accuracy can affect the effects of subsequent algorithms such as face recognition and emotion recognition. Therefore, the algorithm utilizes the convolutional neural networks of two different tasks to combine with the detection result matrix to carry out face detection and iterative positioning of the face candidate box, and good effects on accuracy and real-time performance are obtained.

Disclosure of Invention

The invention aims to solve the defects in the prior art and provides an iterative positioning type face detection method based on a deep neural network, wherein the multitask deep neural network is designed and trained by using mass data of a public data set, a face candidate frame is extracted by using a face candidate frame extraction model in the test process to obtain a preliminary face candidate frame, and then a face offset fine adjustment model is used for iteratively positioning a face for multiple times to obtain more accurate face positioning. The algorithm is used for detecting the face in real time in a complex environment and has the characteristics of high accuracy, stable performance and the like.

The purpose of the invention can be achieved by adopting the following technical scheme:

an iterative positioning type face detection method based on a deep neural network comprises the following steps:

s1, defining a face candidate frame extraction model P-Net and a face offset fine adjustment model A-Net;

s2, extracting data and corresponding labels required by training P-Net and A-Net based on the AFLW public image data set;

s3, fine tuning training P-Net and A-Net based on a classical convolution neural network by using the data obtained in the last step;

s4, adopting a full convolution strategy to the trained P-Net model to obtain a global detection result matrix of the input picture;

s5, inputting the picture under the multi-scale form into P-Net to obtain a detection result matrix of multiple scales for a picture to be tested, and obtaining a candidate face frame through the matrix and a narrowed non-maximum suppression algorithm;

s6, inputting the candidate face frame iterative formula into the A-Net for fine adjustment according to the face position judging condition until the judging condition is met;

s7, removing repeated face candidate frames by using a narrowing non-maximum suppression algorithm, and outputting a final detection result;

further, the face candidate frame extraction model P-Net and the face offset fine-tuning model a-Net defined in the step S1 both adopt AlexNet models, and output layers thereof are modified into 2 types and 45 types according to actual situations.

Further, the training data required by the P-Net in the step S2 is two types of data, namely a human face and a non-human face; the training data required for A-Net is class 45 data, namely face candidate boxes in various offset modes.

Further, the training method in the step S3 adopts random gradient descent, and matches with learning rate attenuation and momentum; the adopted loss function is a cross entropy loss function, and the concrete form is as follows:

wherein x represents the original signal, z represents the reconstructed signal, the lengths of which are all represented in a vector form are d, the length can be easily modified into a vector inner product form, and K represents the number of samples in one iteration.

Further, the fully convolution strategy in step S4 is to store parameters of the fully connected layer, then replace the fully connected layer with the convolution layer of the same size, and assign the parameters of the previously stored fully connected layer to the new convolution layer.

Further, the narrowing of the non-maximum suppression in step S5 is an object shape customized maximum suppression algorithm, which has a better effect on rectangular candidate frames with different aspect ratios, such as a human face. The method comprises the following specific steps:

before carrying out non-maximum value suppression on a plurality of partially overlapped candidate frames, carrying out center narrowing on an original square candidate frame, wherein the narrowing formula is as follows

Wherein (x) ₁ ,y ₁ ) Is the coordinate of the upper left corner, (x) ₂ ,y ₂ ) For the bottom right corner coordinate, narrowrate is the narrowing rate, which is set to 0.08, which means that the candidate frame is narrowed to keep the original height and center point, but the width is reduced to 0.84 times the original width;

then, performing non-maximum suppression calculation and deduplication on the narrowed data, and after deduplication is finished, performing narrowing restoration, wherein a restoration formula is as follows:

wherein (x) ₁ ,y ₁ ) Is the coordinate of the upper left corner, (x) ₂ ,y ₂ ) For the bottom right corner coordinate, narrowrate is the narrowing rate, which is set to 0.08, meaning that the candidate box is restored to maintain the original center point and height, and the width is enlarged to the original width size before narrowing.

Further, in step S6, for the face candidate frame obtained by the model P-Net in the previous step, the frame candidate image is input to a-Net for performing offset mode classification, the model a-Net outputs classification confidences of the frame candidate image for 45 offset modes, and the offset condition of the frame candidate is integrated by using the classification result, as follows:

wherein, [ s, x, y ]]Is the final integration result; n is the number of offset patterns, N =45, [ s ] _n ,x _n ,y _n ]Is an offset pattern of class n, where n is an offset pattern subscript, following the 45 offset pattern settings previously; z is the number of offset patterns exceeding the threshold, and I is a weight calculation formula for calculating the respective weights of the offset patterns exceeding the threshold. The calculation formula for z and I is as follows:

wherein the definition of I is the same as that of the above weight calculation formula I

Wherein c is _n Is the weight of the offset pattern n that exceeds the threshold, t is the threshold;

then, according to the classification result obtained in the above process, the direction of the inverse classification result is finely adjusted to obtain more accurate face positioning, which is specifically as follows:

for a candidate frame with coordinates (x, y) at the upper left corner and length (w, h), the offset mode is [ x, y, s ] obtained by model A-Net classification]Then x after fine tuning of the reverse offset mode direction _new ,y _new ,w _new ,h _new ]Comprises the following steps:

further, in the step S6, after the fine tuning in the reverse offset mode direction, the fine-tuned candidate frame is input into the model a-Net again, and whether the current candidate frame reaches the most suitable position is determined by the estimation of the offset condition again, if yes, the fine tuning is stopped, and the next step is performed; if not, the fine tuning step is continued until the condition is met or the iteration number exceeds the set threshold.

The formula for determining whether the candidate frame has reached the optimal position is as follows:

wherein [ s, x, y ] is the integrated offset pattern of the current candidate frame calculated according to the above formula, and meanwhile, the maximum iteration number is set to 10, that is, the maximum iteration number is 10 to end.

Compared with the prior art, the invention has the following advantages and effects:

1. the method adopts a convolutional neural network, trains the convolutional neural network through mass data, and enables the convolutional neural network to automatically learn convolutional kernels with strong expression and a combination mode of the convolutional kernels so as to obtain better human face characteristic expression;

2. the method of the invention sets a face candidate frame extraction model P-Net and a face offset fine tuning model A-Net, which respectively have two functions of face/non-face classification and offset mode classification, and the two functions have a mutual gain effect on the face detection effect.

3. The method adopts a full convolution strategy, so that the image can be input into the picture with any size, and a detection result matrix is obtained through one convolution neural network forward operation, so that the method has higher detection speed.

4. The method adopts iterative face candidate frame positioning, and the positioning technology is beneficial to positioning the face candidate frame to a more proper position, namely improving the positioning precision.

5. Compared with the traditional method, the method has the advantages of high accuracy, high detection speed and stable performance, and has certain market value and popularization value.

Drawings

FIG. 1 is a flowchart illustrating steps of an iterative localization-based face detection method based on a deep neural network disclosed in the present invention;

fig. 2 is a schematic diagram of an iterative positioning method in a prediction process in the iterative positioning type face detection method based on the deep neural network disclosed in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

Examples

s1, defining a face candidate frame extraction model P-Net and a face offset fine adjustment model A-Net:

in the step S1, two required model functions are defined as face candidate box extraction and face offset fine adjustment, then an AlexNet original model is downloaded from a model public source, and the output of the AlexNet original model is modified to 2 and 45, so as to adapt to the task requirement in the embodiment.

S2, extracting data and corresponding labels required by the training model P-Net and the model A-Net based on the AFLW common image data set, wherein in the step S2,

1. for the face/background classification task, data is obtained by cropping a picture in the AFLW common image dataset. The AFLW data set comprises 25000 pictures, the total number of faces is 50000, the position of a real face frame on each picture is marked by using coordinates (x 1, y 1) at the upper left corner and coordinates (x 2, y 2) at the lower right corner, and the real face frame is expressed as:

(x1,y1,x2,y2)

the face picture is cut by utilizing the coordinate information, and meanwhile, in order to enlarge a data set, certain displacement is allowed to be carried out on a cut face frame, as long as the following conditions are met:

defining the background picture as satisfying

And obtaining a background picture by adopting a random cutting mode.

B is a candidate frame for cutting the picture, and g is a standard human face actual frame in the data set; the IOU is the intersection ratio, namely the area proportion of the overlapping area of the rectangular frames b and g to the union of b and g.

2. For the offset mode classification task, data is obtained by cropping a picture in the AFLW common image dataset. Wherein for each face, the corresponding offset pattern is passed

[x _n ,y _n ,s _n ],

Wherein the content of the first and second substances,

s _n ∈{0.83,0.91,1.0,1.10,1.21}

x _n ∈{-0.17,0,0.17}

y _n ∈{-0.17,0,0.17}

a total of 45 offset face images are cut out (discarded if the size of the face image exceeds the size of the picture), and each picture is labeled with one category, and 45 categories are provided.

S3, fine tuning a training model P-Net and a model A-Net based on a classical convolution neural network by using the data obtained in the previous step;

in the step S3, a model is trained by using random gradient descent in cooperation with learning rate attenuation and momentum, and the specific parameters are as follows:

parameter name/model	Learning rate	Maximum number of iterations	Batch size	Learning rate decay rate
					Face candidate frame extraction model	0.001	90000	128	0.1
Human face offset fine-tuning model	0.002	60000	128	0.12

The training process is implemented on the TensorFlow framework.

S4, adopting a full convolution strategy to the trained model P-Net to obtain a global detection result matrix of the input picture;

the full convolution strategy in step S4 is to store the parameters of the full link layer, replace the full link layer with the convolution layer of the same size, and assign the parameters of the previously stored full link layer to the new convolution layer.

S5, for a picture to be tested, inputting the picture in a multi-scale form into a model P-Net to obtain detection result matrixes in multiple scales, and obtaining a candidate face frame through the matrixes and a narrowing non-maximum suppression algorithm:

in the step S5, for one picture to be tested, scaling is performed by using 6 scaling rates, i.e. 0.79, 1, 1.26, 1.59, 2.0, and 5.0, to obtain 6 pictures with different scales. Inputting the 6 pictures into a model P-Net, 6 global detection result matrixes can be obtained, each data point on the matrixes represents the classification result whether a certain square area in the original picture is a human face or not, candidate frames classified as the human face are screened out through analyzing data information of the 6 matrixes, and the screening criteria are as follows:

p (face) >0.85

Where p (face) represents the confidence that the candidate box is a face.

For the screened candidate frames, a narrowing non-maximum suppression algorithm is used to obtain candidate face frames, which specifically comprises the following steps:

before performing non-maximum suppression on a plurality of partially overlapped candidate frames, performing center narrowing on an original square candidate frame, wherein the narrowing formula is as follows:

and then carrying out non-maximum suppression calculation and deduplication on the narrowed data, wherein the non-maximum suppression algorithm is calculated as follows:

and (3) sorting all candidate frames according to the confidence degrees of the classified face, taking the candidate frame with the highest confidence degree as a target candidate frame, searching other candidate frames with the coincidence rate exceeding 0.3, removing the candidate frames, and storing the target candidate frame. Then, the object with the highest confidence level is continuously selected from the candidate boxes, and the process is continuously carried out until no candidate box exists.

After the duplication removal is finished, narrowing restoration is carried out, and the restoration formula is as follows:

s6, inputting the candidate face frame iterative formula to A-Net for fine adjustment according to the face position judging condition until the judging condition is met;

in the step S6, the face candidate frames obtained in the step S5 are cut out, scaled to a size of 227x 227 and input to a-Net. A-Net will output the classification confidence of the candidate box image for 45 shift modes, i.e., a vector of length 45. Then, the offset condition of the candidate frame is integrated by using the classification result, as follows:

wherein, [ s, x, y [ ]]Is the final integration result; n is the number of offset patterns, N =45, [ s ] _n ,x _n ,y _n ]Is an offset mode of class n, n being in the offset modeThe target follows the previous setting of 45 offset patterns; z is the number of offset patterns exceeding the threshold, and I is a weight calculation formula for calculating the weight of each offset pattern exceeding the threshold. The calculation formula for z and I is as follows:

the mathematical meaning of the above formula is that we choose the weighted direction of the shift directions of those shift modes whose confidence is greater than the threshold as the estimate of the image shift mode of the candidate frame.

Then, according to the classification result obtained in the above process, the direction of the inverse classification result is finely adjusted to obtain more accurate face positioning. The method comprises the following specific steps:

for a candidate box with coordinates (x, y) at the upper left corner and length (w, h), obtaining the offset mode [ x, y, s ] of the candidate box through A-Net classification]Then x after fine tuning of the reverse offset mode direction _new ,y _new ,w _new ,h _new ]Comprises the following steps:

the narrowing non-maximum suppression algorithm used in the above step S7 is the same as that in step S5, and the result is output until this time as the final result.

Example two

The embodiment specifically introduces an iterative positioning type face detection method based on a deep neural network from the aspects of framework building, data set preparation, model training and actual testing, and the specific process is described as follows.

1. The framework building process is as follows:

1. installing an Nvidia GPU driver and a related computing library on a Linux server;

2. and compiling and installing a deep learning framework TensorFlow.

2. The data set preparation process is as follows:

1. writing a tool script by using Python language to obtain a training set, cutting an image in an AFLW public image data set under the condition of four thread by using the tool script, and automatically recording a face/background label and an offset mode label;

2. for the data, a random copy method is adopted for a face/non-face training set to keep the face/non-face proportion of 1:3; and for the offset mode, the control parameters in the tool script are adopted to keep the equalization of various training samples.

3. Carrying out normalization processing on the picture data;

4. and converting the processed human face picture and the label thereof into a tfrecrds data format which can be stored in a memory in a large quantity and has a higher reading speed.

3. The training process is as follows:

1. downloading an AlexNet convolutional neural network model through a Tensorflow public model network publishing platform;

2. parameters of the convolution layer, the down-sampling layer and the front two full-connection layers of the AlexNet are reserved, and the number of output nodes of the final output layer is modified to be 2 and 45, so that the AlexNet and the face offset fine-tuning model are respectively adapted to a face candidate frame extraction model P-Net and a face offset fine-tuning model A-Net;

3. inputting a fixed number of samples into a convolutional neural network according to the number of samples in a batch for tfrecrds format data obtained in the data set preparation process;

4. outputting a feature map through a plurality of convolutional layers and downsampling layers in an AlexNet convolutional neural network model;

5. mapping the feature map features to a full connection layer through a concatemate operation;

6. calculating the classification result of the sample through a Softmax classifier, and sending the result to a loss function layer;

7. calculating the loss and the return gradient of the system according to the result and the loss function;

8. adjusting parameters of the convolutional neural network through a back propagation algorithm; the back propagation algorithm adopts the following hyper-parameters, and the hyper-parameters are obtained by multiple cross validation:

9. And repeating the processes from 3 to 8 until the maximum iteration number reaches a set threshold value, and terminating the training.

10. And adopting a full convolution strategy to the trained model P-Net to obtain a global detection result matrix of the sample.

4. The test procedure was as follows:

writing a test script by using Python language, wherein the test script comprises the following operations:

1. completing normalization operation on the pictures to be tested, and performing multi-scale operation to obtain at most 6 input pictures under multiple scales;

2. loading and training to obtain a model P-Net and a model A-Net;

3. inputting the processed picture to be tested into a model P-Net, analyzing the obtained at most 6 global result matrixes, and matching with narrowing non-maximum value inhibition to obtain a face candidate frame;

4. according to the face candidate frame, cutting out a face picture from an original picture, inputting the face picture into the model A-Net, and carrying out fine adjustment according to the result and determining whether to carry out repeated iterative fine adjustment;

5. and processing the residual candidate boxes again by using a narrowing non-maximum value suppression algorithm, and outputting the result as a final result.

In summary, the present invention extracts the image blocks of the region in the AFLW public image data set as the input of the training set and performs the preprocessing; defining a face candidate frame extraction model P-Net and a face offset fine tuning model A-Net, and using the training set to fine-tune and train the model; and adopting a full convolution strategy to the trained model P-Net to obtain a global detection result matrix of the sample. In the testing process, the picture is input into the model P-Net to obtain a face candidate frame, then the position of the face candidate frame is iteratively fine-tuned through the model A-Net, and a final result is obtained by matching with a maximum value inhibition method. The method is used for automatically detecting the face by using the computer in a complex environment, and has the advantages of high accuracy, high recognition speed, stable performance and the like.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. An iterative positioning type face detection method based on a deep neural network is characterized by comprising the following steps:

s2, extracting data and corresponding labels required by a training model P-Net and a model A-Net based on the AFLW common image data set;

s3, fine-tuning a training model P-Net and a model A-Net based on a classical convolution neural network by using the data obtained in the previous step;

s5, inputting a picture to be tested into the model P-Net in a multi-scale mode to obtain detection result matrixes in multiple scales, and obtaining a candidate face frame through the matrixes and a narrowed non-maximum suppression algorithm;

s6, inputting the candidate face frame iterative formula into the model A-Net for fine adjustment according to the face position judging condition until the judging condition is met;

and S7, removing repeated face candidate frames by using a narrowing non-maximum suppression algorithm, and outputting a final detection result.

2. The iterative localization type face detection method according to claim 1, wherein in the face offset fine tuning model a-Net in step S1, the model is set to be an N-class classification model, the N-class face offset mode is used to evaluate an offset degree of a face candidate frame with respect to a face real frame, the face offset mode is measured by three factors, i.e. a horizontal axis, a vertical axis, and a scaling rate, and the setting is as follows:

defining a set of offset patterns:

wherein x is _n Representing the rate of shift of the frame candidate in the x-axis relative to the frame candidate width itself, y _n Representing the rate of shift, s, of the frame candidate in the y-axis relative to the frame candidate length itself _n Represents the ratio of the candidate box to itself, which should be scaled, N represents the number of classes of the offset pattern, and N is the class index.

3. The iterative positioning type face detection method based on the deep neural network as claimed in claim 2, wherein the offset mode class number N =45, N is set as a class index,

for x _n ，y _n ，s _n The respective assignments are as follows, and 5x 3x 3=45 categories can be obtained:

4. the iterative localization-type face detection method based on the deep neural network as claimed in claim 1, wherein the full convolution strategy in step S4 is to store parameters of the full connection layer, replace the full connection layer with the convolution layer of the same size, and assign the parameters of the full connection layer stored before to the new convolution layer.

5. The iterative localization type face detection method according to claim 1, wherein each point of the detection result matrix in step S5 represents a detection result of a square area with a size of 227 × 227 pixels in the original picture, and the face candidate frame is obtained by restoring the detection result to a candidate frame in the original picture and using a narrowing non-maximum suppression algorithm according to the overlapping condition of the candidate frame.

6. The iterative localized face detection method according to claim 1, wherein the narrowing non-maximum suppression in step S5 is an object shape customized maximum suppression algorithm, which has a better effect on rectangular candidate frames with different aspect ratios, such as a human face, as follows:

then, carrying out non-maximum suppression calculation and deduplication on the narrowed data, and after deduplication is finished, carrying out narrowing restoration, wherein a restoration formula is as follows:

wherein (x) ₁ ,y ₁ ) Is the coordinate of the upper left corner, (x) ₂ ,y ₂ ) In the lower right corner coordinate, narrowrate is the narrowing rate, which is set to 0.08, which means that the candidate frame is restored to maintain the original center point and height, and the width is enlarged to the original width before narrowing.

7. The iterative localization type face detection method based on the deep neural network of claim 1, wherein in step S6, for the face candidate frame obtained by the model P-Net, the frame candidate image is input to the model a-Net for the shift mode classification, the model a-Net outputs the classification confidence of the frame candidate image for N shift modes, and the shift condition of the frame candidate is integrated by using the classification result as follows:

wherein, [ s, x, y [ ]]Is the final integration result; n is the number of offset patterns, N =45, [ s ] _n ,x _n ,y _n ]Is an offset pattern of class n, where n is an offset pattern subscript, following the 45 offset pattern settings previously; z is the number of offset patterns exceeding the threshold, and I is a weight calculation formula for calculating the respective weights of the offset patterns exceeding the threshold. The calculation formula for z and I is as follows:

for a candidate box with (x, y) coordinates at the upper left corner and (w, h) length and width, obtaining the offset mode of [ x, y, s ] through A-Net classification]Then x after fine tuning of the reverse offset mode direction _new ,y _new ,w _new ,h _new ]Comprises the following steps:

8. the iterative localization type face detection method based on the deep neural network of claim 1, wherein in step S6, after the fine tuning in the direction of the reverse offset mode, the fine-tuned candidate frame is input into the model a-Net again, and whether the current candidate frame reaches the most suitable position is determined by the estimation of the offset condition again, if yes, the fine tuning is stopped, and the next step is performed; if not, the fine tuning step is continued until the condition is met or the iteration number exceeds the set threshold.

9. The iterative localization type face detection method based on the deep neural network of claim 8, wherein the formula for determining whether the candidate frame reaches the most suitable position is as follows:

wherein [ s, x, y ] is the integrated offset pattern of the current candidate frame calculated according to the above formula, and meanwhile, the maximum iteration number is set to 10, that is, the maximum iteration number is 10 times over.