CN114005150B

CN114005150B - Design method for quantifiable front-end face detection

Info

Publication number: CN114005150B
Application number: CN202010736641.9A
Authority: CN
Inventors: 田凤彬; 于晓静
Original assignee: Beijing Ingenic Semiconductor Co Ltd
Current assignee: Beijing Ingenic Semiconductor Co Ltd
Priority date: 2020-07-28
Filing date: 2020-07-28
Publication date: 2024-05-03
Anticipated expiration: 2040-07-28
Also published as: CN114005150A

Abstract

The invention provides a design method for quantifiable front-end face detection, which comprises the following steps: s1, adopting a cascading mode, dividing the two cascading detectors, and manufacturing training samples of a secondary model: s1.1, manufacturing a first-stage detector model training sample; s1.2, manufacturing a second-stage detector model training sample; s2, designing a network structure model: s2.1, a first-stage network structure; s2.2, a second-level network structure; s3, training and using a secondary model: training a model of a first-stage detector by using positive and negative samples of a first-stage detector model training sample, detecting a human face by using the first-stage detector, enabling a correlation coefficient to be the model, matting out the detected human face and scaling the detected human face to a picture with the size of 25 multiplied by 25, inputting the picture to a second-stage detector, detecting whether the detected human face is the human face or not, and mapping coordinates back to an original image; and finally determining whether the original image has a human face or not and determining the position of the human face. The detection time is reduced by the method; network quantization is met, and recall rate and accuracy are improved; the model meets the face detection requirement.

Description

Design method for quantifiable front-end face detection

Technical Field

The invention relates to the technical field of neural networks, in particular to a design method for detecting a front-end face capable of being quantized.

Background

The technology of neural networks in the field of artificial intelligence is rapidly developed in the current society. The MTCNN technology is also one of the more popular technologies in recent years. MTCNN, namely Multi-task convolutional neural network (multitasking convolutional neural network), put together face region detection and face keypoint detection, and can be generally divided into three layers of network structures of P-Net, R-Net and O-Net. The model mainly adopts three cascaded networks, and adopts the idea of candidate frames and classifiers to perform rapid and efficient face detection. The three cascaded networks are respectively P-Net for quickly generating candidate windows, R-Net for performing high-precision candidate window filtering selection and O-Net for generating final boundary boxes and key points of faces.

However MTCNN cascade detection suffers from the following drawbacks:

1. the three-stage face detection is adopted, the third stage consumes a large amount of time, and the recall rate is reduced due to three-stage cascade connection.

2. The network structure can not meet the requirement of front-end chip quantization, and the network structure uses the maximum pooling function which is forbidden in quantization, and the number of the feature images used in each layer is not a multiple of 16, so that quantization can not be realized.

3. MTCNN training samples are very noisy and cannot be trained to meet the actual requirements. In addition, the following general technical terms are included in the prior art:

1. cascading: the manner in which several detectors detect by way of a series connection is referred to as a cascade.

2. Iou: the ratio of the intersection of two area areas to the union of the two area areas.

3. Quantification: one phenomenon of floating point conversion to fixed point or 8-bit or 4-bit or 2-bit conversion is called quantization

4. Recall rate: the ratio of the number of faces to the total number of marked faces is correctly detected.

5. Accuracy rate: the ratio of the result to the total number of the detected results is correctly detected.

6. And (3) model: are all the coefficients of a function that are trained from the samples, and these coefficients are called models.

7. A detector: is a function for detection whose main component is a model.

8. Face detection: the process of detecting whether a face exists in a video or a picture using a face detector is called face detection.

9. Convolution kernel: the convolution kernel is a matrix used in image processing and is a parameter for operation with the original image. The convolution kernel is typically a matrix of columns (e.g., a matrix of 3*3) with a weight for each square in the region. The matrix shapes are generally 1×1,3×3,5×5,7×7,1×3,3×1,2×2,1×5,5×1, … …

10. Convolution: the center of the convolution kernel is placed over the pixel to be calculated, and the products of each element in the kernel and its covered image pixel values are calculated and summed once to obtain a structure that is the new pixel value for that location, a process called convolution.

11. Front-end face detection: the face detection used on the chip is called front-end face detection, and the speed and accuracy of the front-end face detection are lower than those of the cloud server.

12. Feature map: the result obtained by convolution calculation of input data is called a feature map, and the result generated by full connection of the data is also called a feature map. The feature map size is generally expressed as length x width x depth, or 1 x depth.

13. Step size: the center position of the convolution kernel is moved by the length of the movement in the coordinates.

14. And (3) performing two-end misalignment treatment: when the image or data is processed by the convolution kernel with the size of 3 and the step length of 2, the data on two sides is insufficient, and the data on two sides or one side is discarded, which is called that the two ends are not processed.

Disclosure of Invention

In order to solve the above problems, an object of the present invention is to: the detection time is reduced by the method; the network meets the quantization requirement, and the recall rate and the accuracy rate are improved; the trained model meets the requirement of face detection in face recognition.

Specifically, the invention provides a design method for quantifiable front-end face detection, which comprises the following steps:

S1, adopting a cascading mode, dividing the two cascading detectors, and manufacturing training samples of a secondary model:

S1.1, manufacturing a first-stage detector model training sample:

A) The shielding degree of the faces is not more than 30%, the faces with large blurring degree or undersize do not meet the requirements, each face is expanded by 0.5 times from left to right, up to down and outwards according to marked frames, then the faces are extracted from the original image, and the newly extracted faces are mapped and marked;

B) The scratched face images are subjected to manual screening and inspection, faces which are blocked by more than 30% are not allowed to exist in the face images, the number of the faces is 60 ten thousand, and the faces are used as a first part of primary training set of one-time face training;

c) Then, according to random screenshot, when the obtained screenshot and the iou of the frame are larger than 0.5, reserving the obtained screenshot and the obtained frame as positive samples, carrying out matting on each image to obtain two positive samples meeting the requirements, and scaling the positive samples meeting the requirements to a specified size; the negative sample is scratched from the existing training set, the iou of any frame on the scratched image and the one frame on the scratched image is smaller than 0.25, then the scratched image meets the requirement of the negative sample, the scratched image is scaled, the obtained image is the negative sample, and 300 ten thousand negative samples are generated by the method; then randomly matting and scaling to a specified size from the picture set without the face, namely a negative sample, and generating 150 ten thousand negative samples;

D) Positive and negative samples of the first-stage detector model training sample are manufactured;

s1.2, manufacturing a second-stage detector model training sample:

a) Positive sample extraction: extracting positive samples from a first partial primary training set of the 60 ten thousand faces; randomly capturing a picture around the label from the 60 ten thousand faces, when the iou of the intercepted picture and box is more than 0.5, scaling the intercepted picture to a picture with the size of 25 multiplied by 25, and keeping the picture as positive samples, wherein the number of the positive samples is controlled to be 10 ten thousand;

b) Extraction of negative samples: detecting a large number of pictures without human faces by using a first-stage detector to extract negative samples, scratching out the detected false human faces, and scaling the scratched pictures to 25 multiplied by 25 to serve as a part of the training set negative samples; detecting from a common marked face training set by using a first-stage detector, extracting the detected result and the result with the iou value of the marked face less than 0.2, and scaling to a picture with the size of 25 multiplied by 25 to be used as a part of a training set negative sample;

S2, designing a network structure model:

s2.1, a first-stage network structure;

S2.2, a second-level network structure;

S3, training and using a secondary model: training a model of a first-stage detector by using positive and negative samples of a first-stage detector model training sample, detecting a human face by using the first-stage detector, extracting the detected human face, scaling the detected human face to a picture with the size of 25 multiplied by 25, inputting the picture to a second-stage detector, detecting whether the detected human face is the human face or not, and mapping coordinates back to an original image; and finally determining whether the original image has a face or not and determining the position of the face.

The face shielding degree in the step S1.1) is not more than 30% and comprises the situation that the incomplete degree of the boundary face is not more than 30%; the shielding comprises a mask, a mask shielding and a cap shielding, but the shielding does not comprise a sunglasses; the face does not include camouflage painted faces, clown faces, and faces in particularly dim conditions.

The step S1.2 a) of extracting the positive sample further includes: and detecting the 60 ten thousand faces by using a first-stage detector, and matting out the faces from the detected result and the result with the iou value of the marked faces being more than 0.5, and scaling to 25 multiplied by 25 to be used as a part of positive samples of the training set.

The first-level network structure in step S2.1 specifically includes:

the picture input by the first layer is 17 multiplied by 3, the feature map with the depth of 16 is output, the convolution kernel is 3 multiplied by 3, the step length is 1, the calculated convolution map is that the two ends are not aligned, and the feature map (1) is 15 multiplied by 16;

The input data of the second layer is 15 multiplied by 16, the depth of an output characteristic diagram is 16, the convolution kernel is 3 multiplied by 3, the step length is 3, the calculated convolution diagram is that the two ends are not aligned, and the output characteristic diagram (2) is 5 multiplied by 16;

the third layer of input data is 5 multiplied by 16, the depth of an output characteristic diagram is 16, the convolution kernel is 3 multiplied by 3, the step length is 1, the calculated convolution diagram is that the two ends are not aligned, and the 3 multiplied by 16 of the output characteristic diagram (3);

The fourth layer of input data is 3 multiplied by 16, 32 feature graphs are output, the convolution kernel is 3 multiplied by 3, the step length is 1, the calculated convolution graph is that the two ends are not aligned, and the feature graph (4) is 1 multiplied by 32;

The fifth layer input data is the characteristic diagram (4) 1×1×32 of the fourth layer output, the depth of the output feature map is 1, the convolution kernel is 1 multiplied by 1, the step length is 1, and the feature map is 1 multiplied by 1;

the sixth layer input data is a feature map (4) 1×1×32 output by the fourth layer, the depth of the output feature map is 4, the convolution kernel size is 1×1, the step size is 1, and the feature map 1×1×4 is output.

The second-level network structure in the step S2.2 specifically includes:

the input picture of the first layer is 25 multiplied by 3, the depth of the output feature picture is 32, the convolution kernel is 3 multiplied by 3, the step length is 1, the calculated convolution picture is that the two ends are not aligned, and the output feature picture (1) is 23 multiplied by 32;

The second layer inputs the feature map (1) 23×23×32, the depth of the output feature map is 32, the convolution kernel is 3×3, the step length is 2, the calculated convolution map is that the two ends are not aligned, and the output feature map (2) 11×11×32;

the third layer inputs the characteristic diagram (2) 11×11×32, the depth of the output characteristic diagram is 48, the convolution kernel is 3×3, the step length is 2, the calculated convolution diagram is that the two ends are not aligned, and the output characteristic diagram (3) is 5×5×48;

The feature map (3) input by the fourth layer is 5 multiplied by 48, the depth of the output feature map is 64, the convolution kernel is 3 multiplied by 3, the step length is 2, the calculated convolution map is that the two ends are not aligned, and the feature map (4) is 2 multiplied by 64;

The fifth layer generates one-dimensional data 256 from the data of the feature map (4) 2×2×64;

the sixth layer comprises two branches which are fully connected, and the one-dimensional data 256 are respectively connected to the judgment of whether the face exists in the category and the relative coordinates of the face frame.

All data in the first layer of the second level network structure is effectively used, and if the use is processed, invalid data padding is increased.

The network is a quantifiable network that can only use a convolution kernel of 3x 3 and a convolution kernel of 1 x 1, the depth of each layer must be a multiple of 16, no other convolution kernel can be used, and no pooling approach can be used.

Thus, the present application has the advantages that: the first cascade detector performs coarse detection, the recall rate reaches more than 30% when the accuracy rate is 99%, the detected face box is cut out and then scaled to the size required by the second-stage detector, and the detected face box enters the second-stage detector for detection. When the recall rate after detection of the two-stage detector is 98%, the correct rate reaches 97%. Due to the lack of one stage, the detection time is saved greatly, and the detection time after quantization is 0.25 times of the current detection time, and the recall rate and the correct rate are unchanged.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate and together with the description serve to explain the application.

Fig. 1 is a schematic flow chart of the method of the present invention.

Fig. 2 is a schematic diagram of the structure of a first level network of an application network of the method of the present invention.

Fig. 3 is a schematic diagram of the structure of a second level network of the application network of the method of the invention.

Fig. 4 is a schematic diagram showing a picture with two faces, wherein the rectangular frame circumscribed by the faces is the rectangular frame of the faces in fig. 4.

Fig. 5 is a schematic illustration of one of the two face images from fig. 4.

Fig. 6 is a schematic illustration of two face images scratched out of fig. 4.

Detailed Description

In order that the technical content and advantages of the present invention may be more clearly understood, a further detailed description of the present invention will now be made with reference to the accompanying drawings.

As shown in fig. 1, the present invention relates to a design method for quantifiable front-end face detection, which includes the following steps:

S1.1, manufacturing a first-stage detector model training sample:

A) The face shielding degree is not more than 30%, the face with large blurring degree or undersize does not meet the requirements, each face is expanded by 0.5 times from left to right, up to down and outwards according to marked frames, then the face is extracted from the original image, and the newly extracted face is mapped and marked; for example, a picture is extracted from the original picture, as shown in fig. 4, the rectangular frame circumscribed by the face is the rectangular frame of the face in fig. 4, the two faces are two rectangular frames circumscribed by the minimum, two face images are extracted from the original picture (fig. 4) as shown in fig. 5 and 6, and the coordinates of the upper left, the upper right and the lower right of the circumscribed rectangle of the two faces in the original picture are: the new labels of the mapping of the two face images are drawn out [ (35,104), (147,235) ], [ (220,89), (325,221) ], namely the upper left, the upper right and the lower right coordinates are [ (35, 65), (147,196) ], [ (52,66), (157,198) ].

s1.2, manufacturing a second-stage detector model training sample:

a) Positive sample extraction: extracting positive samples from a first partial primary training set of the 60 ten thousand faces; randomly capturing a picture around the label from the 60 ten thousand faces, when the iou of the intercepted picture and box is larger than 0.5, scaling the intercepted picture to a picture with the size of 25 multiplied by 25, wherein the size of the intercepted picture is relatively large, scaling the intercepted picture to a picture with the size of 25 length and 25 width, which is the size specified by network training, reserving the picture as positive samples, and controlling the number of the positive samples to be 10 ten thousand;

b) Extraction of negative samples: detecting a large number of pictures without human faces by using a first-stage detector to extract negative samples, scratching out the detected false human faces, and scaling the scratched pictures to pictures with the size of 25 multiplied by 25 to serve as a part of a training set negative sample; detecting from a common marked face training set by using a first-stage detector, extracting the detected result and the result with the iou value of the marked face less than 0.2, and scaling to a picture with the size of 25 multiplied by 25 to be used as a part of a training set negative sample;

S2, designing a network structure model:

s2.1, a first-stage network structure;

S2.2, a second-level network structure;

s3, training and using a secondary model: training a model of a first-stage detector by using positive and negative samples of a first-stage detector model training sample, detecting a human face by using the first-stage detector, wherein the phase relation number is the model, matting out the detected human face and scaling the detected human face to a picture with the size of 25 multiplied by 25, inputting the picture to a second-stage detector, detecting whether the detected human face is the human face or not, and mapping coordinates back to an original image; and finally determining whether the original image has a human face or not and the position of the human face.

The technical scheme of the invention can be further described as follows:

1. the cascade mode is adopted and is divided into two cascade detectors. The first cascade detector performs coarse detection, the recall rate reaches more than 30% when the accuracy rate is 99%, the detected face box is cut out and then scaled to the size required by the second-stage detector, and the detected face box enters the second-stage detector for detection. When the recall rate after detection of the two-stage detector is 98%, the accuracy rate reaches 97%. Due to the lack of one stage, the detection time is saved greatly, and the detection time after quantization is 0.25 times of the current detection time, and the recall rate and the correct rate are unchanged.

2. Training of the model.

The quantifiable network can only use a convolution kernel of 3 x 3, a convolution kernel of 1 x 1, a depth of each layer must be a multiple of 16, other convolution kernels cannot be used, and other modes such as pooling cannot be used.

First stage detector model training sample fabrication.

A) The shielding degree of the human face is not more than 30%, and the human face with large blurring degree or undersize does not meet the requirements. In order to meet the requirements, special processing is carried out on the sample, each face is expanded by 0.5 times from left to right, up to down and outwards according to the marked box, then the face is extracted from the original image, and the newly extracted face is mapped and marked. For example, a picture has two faces as shown in fig. 4, the rectangular frame circumscribed by the faces is the rectangular frame of the faces in fig. 4, the two faces have two minimum rectangular frames circumscribed by the faces, two face images are drawn from an original picture (fig. 4) as shown in fig. 5 and fig. 6, and the coordinates of the upper left, the upper right and the lower left of the rectangle circumscribed by the two faces in the original picture are: the new labels of the mapping of the two face images are drawn out [ (35,104), (147,235) ], [ (220,89), (325,221) ], namely the upper left, the upper right and the lower right coordinates are [ (35, 65), (147,196) ], [ (52,66), (157,198) ].

B) The facial images are screened manually, and the interior of the facial images is not allowed to have more than 30% of faces shielded, and the shielding comprises a mask, a mask shielding, a cap shielding and other object shielding; the incomplete degree of the boundary face cannot exceed 30%; the camouflage faces and clown faces do not belong to human faces. These undesirable face pictures cannot be taken as negative examples either. A particularly dim face is not considered a face. Allowing for a sunglasses. The number of the faces is 60 ten thousand, and the faces are used as a primary training set of the first part of the face training.

C) And then, according to random screenshot, when the iou of the intercepted images and boxes is larger than 0.5, reserving the images as positive samples, and scaling the positive samples meeting the requirements to a specified size by using two positive samples meeting the requirements in each image. And (3) carrying out the matting on the negative sample from the existing training set, wherein the iou of any box on the matting and the drawing is smaller than 0.25, so that the matting meets the requirement of the negative sample, and then scaling the matting, wherein the obtained drawing is the negative sample. 300 ten thousand Zhang Fu samples were generated by this method. Then randomly matting from the picture set without the face and scaling to the appointed size, namely, a negative sample, and generating 150 ten thousand of negative samples. And (5) finishing the preparation of the sample. The model of the first stage detector is trained using positive and negative samples.

As shown in fig. 2, a first level network architecture.

And (3) manufacturing a second-stage detector model training sample.

A) And (5) extracting positive samples. Positive samples are taken from a first partial primary training set of 60 ten thousand faces. Detecting 60 ten thousand faces by using a first-stage detector, matting out the faces from the detected result and the result with the iou value of the marked faces being larger than 0.5, and scaling the intercepted image to a picture with the size of 25 multiplied by 25, wherein the size of the intercepted image is relatively larger, and the intercepted image needs to be scaled to a picture with the size of 25 length and 25 width, which is the size specified by network training and is used as a part of positive samples of a training set;

B) Randomly capturing a picture around the label from 60 ten thousand faces, and scaling the intercepted picture to a picture with the size of 25 multiplied by 25 when the iou of the intercepted picture and the box is larger than 0.5, wherein the size of the intercepted picture is relatively larger, the intercepted picture needs to be scaled to a picture with the size of 25 length and 25 width, which is the size specified by network training, and the picture is reserved as positive samples, and the number of the positive samples is controlled to be 10 ten thousand.

C) And detecting a large number of pictures without human faces by using a first-stage detector to extract negative samples, matting out the human faces from the detected results, and scaling to 25 multiplied by 25 to be used as a part of the negative samples of the training set. And detecting the face training set from the common marked face by using a first-stage detector, and matting out the detected result and the result with the iou value of the marked face smaller than 0.2, and then scaling the result to 25 multiplied by 25 to be used as a part of a negative sample of the training set. The training set is ready.

As shown in fig. 3, a second level network architecture.

The input picture of the first layer is 25 multiplied by 3, the depth of the output feature picture is 32, the convolution kernel is 3 multiplied by 3, the step length is 1, the calculated convolution picture is that the two ends are not aligned, the output feature picture (1) is 23 multiplied by 32, all data are effectively used, and invalid data filling is increased if the processing is used.

The second layer inputs the feature map (1) 23×23×32, outputs the feature map depth 32, convolves the kernel 3×3, steps the step length 2, calculates the convolved map as the feature map (2) 11×11×32 with the two ends not aligned.

The third layer inputs the feature map (2) 11×11×32, the output feature map depth is 48, the convolution kernel is 3×3, the step size is 2, the calculated convolution map is that the two ends are not aligned, and the output feature map (3) is 5×5×48.

The feature map (3) input by the fourth layer is 5×5×48, the depth of the output feature map is 64, the convolution kernel is 3×3, the step size is 2, the calculated convolution map is that the two ends are not aligned, and the feature map (4) is 2×2×64.

The fifth layer generates one-dimensional data 256 from the data of feature map (4) 2×2×64.

The sixth layer includes two branches, connecting 256 data to the face judgment and the relative coordinates of the face box, respectively.

3. Use of first and second level models.

The first-stage detector (the correlation coefficient is a model) is used for detecting the human face, the detected human face is scaled to 25 multiplied by 25, the human face is input to the second-stage detector, whether the human face is detected or not is detected, and then the coordinate is mapped back to the original image. And finally determining whether the original image has a human face and the position of the human face.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, and various modifications and variations can be made to the embodiments of the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A design method for quantifiable front-end face detection, the method comprising the steps of:

S1.1, manufacturing a first-stage detector model training sample:

A) The face shielding degree is not more than 30%, the face with large blurring degree or undersize does not meet the requirements, each face is expanded by 0.5 times from left to right, up to down and outwards according to marked frames, then the face is extracted from the original image, and the newly extracted face is mapped and marked;

b) The scratched face images are subjected to manual screening and inspection, faces which are blocked by more than 30% are not allowed to exist in the scratched face images, the number of the faces is 60 ten thousand, and the faces are used as a first part of primary training set of one-time face training;

C) Then, according to random screenshot, when the obtained screenshot and the iou of the frame are larger than 0.5, reserving the obtained screenshot and the obtained frame as positive samples, carrying out matting on each image to obtain two positive samples meeting the requirements, and scaling the positive samples meeting the requirements to a specified size; the negative sample is scratched from the existing training set, the iou of any frame on the scratched image and the one frame on the scratched image is smaller than 0.25, then the scratched image meets the requirement of the negative sample, the scratched image is scaled, the obtained image is the negative sample, and 300 ten thousand negative samples are generated by the method; then randomly matting from the picture set without the face and scaling to a specified size, which is also a negative sample, wherein 150 ten thousand negative samples are generated;

s1.2, manufacturing a second-stage detector model training sample:

a) Positive sample extraction: extracting positive samples from a first partial primary training set of 60 ten thousand faces; randomly capturing a picture around the label from the 60 ten thousand faces, when the iou of the intercepted picture and box is more than 0.5, scaling the intercepted picture to a picture with the size of 25 multiplied by 25, namely a picture with the size of 25 length and 25 width, which is the size specified by network training, and keeping the picture as positive samples, wherein the number of the positive samples is controlled to be 10 ten thousand;

S2, designing a network structure model:

s2.1, a first-stage network structure;

the first-level network structure in step S2.1 specifically includes:

the picture input by the first layer is 17×17×3, the feature map with the depth of 16 is output, the convolution kernel is 3×3, the step length is 1, the calculated convolution map is that the two ends are not aligned, and the feature map 1 is output: 15×15×16;

The input data of the second layer is 15×15×16, the depth of the output feature map is 16, the convolution kernel is 3×3, the step length is 3, the calculated convolution map is that the two ends are not aligned, and the feature map 2 is output: 5×5×16;

the third layer of input data is 5×5×16, the depth of the output feature map is 16, the convolution kernel is 3×3, the step size is 1, the calculated convolution map is two-end non-aligned, and the output feature map 3:3×3×16;

The fourth layer of input data is 3×3×16, 32 feature graphs are output, the convolution kernel is 3×3, the step length is 1, the calculated convolution graph is that the two ends are not aligned, and the feature graph 4 is output: 1×1×32;

Fifth layer input data is characteristic of fourth layer output fig. 4:1×1×32, an output feature map depth of 1, a convolution kernel of 1×1, step size is 1, and the output characteristic diagram is 1 multiplied by 1;

sixth layer input data is characteristic of fourth layer output fig. 4:1×1×32, the depth of the output feature map is 4, the convolution kernel size is 1×1, the step size is 1, and the output feature map is 1×1×4;

S2.2, a second-level network structure;

The second-level network structure in the step S2.2 specifically includes:

The input picture of the first layer is 25 multiplied by 3, the depth of the output feature picture is 32, the convolution kernel is 3 multiplied by 3, the step length is 1, the calculated convolution picture is that the two ends are not aligned, and the output feature picture 1:23×23×32;

Second layer input features fig. 1:23×23×32, output feature map depth of 32, convolution kernel of 3×3, step size of 2, calculation of the convolved map as two-end misalignment, output feature map 2:11×11×32;

Third layer input features fig. 2:11×11×32, output feature map depth 48, convolution kernel 3×3, step size 2, calculation of the convolved map as two-end misalignment, output feature map 3: 5X 48;

Feature of fourth layer input fig. 3:5×5×48, output feature map depth of 64, convolution kernel of 3×3, step size of 2, calculation of the convolved map with non-aligned ends, output feature map 4:2×2×64;

The fifth layer will feature fig. 4: the 2×2×64 data generates one-dimensional data 256;

the sixth layer comprises two branches which are fully connected, and the one-dimensional data 256 are respectively connected to the class judgment of whether a face exists or not and the relative coordinates of the face frame;

S3, training and using a secondary model: training a model of a first-stage detector by using positive and negative samples of a first-stage detector model training sample, detecting a human face by using the first-stage detector, enabling a correlation coefficient to be the model, matting out the detected human face and scaling the detected human face to a picture with the size of 25 multiplied by 25, inputting the picture to a second-stage detector, detecting whether the detected human face is the human face or not, and mapping coordinates back to an original image; and finally determining whether the original image has a human face or not and determining the position of the human face.

2. The method according to claim 1, wherein the step S1.1 a) is characterized in that the face shielding degree is not more than 30% and the incomplete degree of the boundary face is not more than 30%; the shielding comprises a mask, a mask shielding and a cap shielding, but the shielding does not comprise a sunglasses; the face does not include camouflage colored drawing faces, clown faces and faces under particularly dim conditions.

3. The method for designing a quantifiable front-end face detection according to claim 1, wherein the extracting positive samples in a) in step S1.2 further comprises: and detecting the 60 ten thousand faces by using a first-stage detector, and matting out the faces from the detected result and the result with the iou value of the marked faces being more than 0.5, and scaling to 25 multiplied by 25 to be used as a part of positive samples of the training set.

4. A method of designing a quantifiable front-end face detection as claimed in claim 1 wherein all data in the first layer of the second level network structure is effectively used and if used processing it adds invalid data padding.

5. A design method for quantifiable front-end face detection as defined in claim 1, wherein the network is a quantifiable network, and only a convolution kernel of 3×3, a convolution kernel of 1×1, and a depth of each layer must be a multiple of 16, and no other convolution kernel can be used, and no pooling method can be used.