CN114005150A - Design method of quantifiable front-end face detection - Google Patents

Design method of quantifiable front-end face detection Download PDF

Info

Publication number
CN114005150A
CN114005150A CN202010736641.9A CN202010736641A CN114005150A CN 114005150 A CN114005150 A CN 114005150A CN 202010736641 A CN202010736641 A CN 202010736641A CN 114005150 A CN114005150 A CN 114005150A
Authority
CN
China
Prior art keywords
multiplied
face
feature map
output
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010736641.9A
Other languages
Chinese (zh)
Other versions
CN114005150B (en
Inventor
田凤彬
于晓静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ingenic Semiconductor Co Ltd
Original Assignee
Beijing Ingenic Semiconductor Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ingenic Semiconductor Co Ltd filed Critical Beijing Ingenic Semiconductor Co Ltd
Priority to CN202010736641.9A priority Critical patent/CN114005150B/en
Publication of CN114005150A publication Critical patent/CN114005150A/en
Application granted granted Critical
Publication of CN114005150B publication Critical patent/CN114005150B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a design method of quantifiable front-end face detection, which comprises the following steps: s1, dividing the test sample into two cascade detectors in a cascade mode, and making a training sample of a secondary model: s1.1, manufacturing a first-stage detector model training sample; s1.2, manufacturing a second-stage detector model training sample; s2, designing a network structure model: s2.1, a first-level network structure; s2.2, a second-level network structure; s3, training the secondary model: training a model of a first-stage detector by using positive and negative samples of a first-stage detector model training sample, detecting a human face by using the first-stage detector, wherein a correlation coefficient is the model, extracting and scaling the detected human face to a picture with the size of 25 multiplied by 25, inputting the picture to a second-stage detector, detecting whether the picture is the human face or not, and mapping coordinates back to an original picture; and finally determining whether the original image has a face and the position of the face. The detection time is reduced by the method; network quantification is met, and recall rate and accuracy are improved; the model meets the requirements of face detection.

Description

Design method of quantifiable front-end face detection
Technical Field
The invention relates to the technical field of neural networks, in particular to a design method for quantifiable front-end face detection.
Background
In the current society, the development of the neural network technology in the field of artificial intelligence is rapid. The MTCNN technology is also one of the more popular technologies in recent years. MTCNN, a Multi-task convolutional neural network, combines face region detection and face keypoint detection, and can be generally divided into three-layer network structures of P-Net, R-Net and O-Net. The multi-task neural network model for the face detection task mainly adopts three cascaded networks and adopts the idea of adding a classifier into a candidate frame to carry out rapid and efficient face detection. The three cascaded networks are respectively P-Net for quickly generating candidate windows, R-Net for filtering and selecting high-precision candidate windows and O-Net for generating final bounding boxes and key points of the human face.
However, MTCNN cascade detection has the following drawbacks:
1. the three-level face detection is adopted, the third-level consumption time is very long, and the recall rate is reduced due to three-level cascade.
2. The network structure can not meet the requirement of front-end chip quantization, the maximum pooling function forbidden to be used in quantization is used in the network structure, the number of feature maps used in each layer is not a multiple of 16, and quantization cannot be achieved.
3. The MTCNN training samples are very noisy and cannot be trained to meet practical requirements. In addition, the following commonly used technical terms are also included in the prior art:
1. cascading: the mode that several detectors detect in series is called cascade.
2. iou: the ratio of the intersection of the two region areas to the union of the two region areas.
3. And (3) quantification: one phenomenon of floating point conversion to fixed point or 8-bit or 4-bit or 2-bit is called quantization
4. The recall ratio is as follows: the ratio of the number of correctly detected faces to the total number of marked faces.
5. The accuracy is as follows: the ratio of the correct detection result to the total number of detection results.
6. Model: are all the coefficients of a function trained by the samples, called the model.
7. A detector: is a function for detection, the main component of which is a model.
8. Face detection: the process of detecting whether a face exists in a video or a picture by using a face detector is called face detection.
9. And (3) convolution kernel: the convolution kernel is a parameter used for performing an operation on a matrix and an original image during image processing. The convolution kernel is typically a matrix of column numbers (e.g., a 3 x 3 matrix) with a weight value for each square on the region. The matrix shape is typically 1 × 1,3 × 3,5 × 5,7 × 7,1 × 3,3 × 1,2 × 2,1 × 5,5 × 1, … …
10. Convolution: the centre of the convolution kernel is placed on the pixel to be calculated, the products of each element in the kernel and its covered image pixel value are calculated once and summed, and the resulting structure is the new pixel value at that location, a process called convolution.
11. Front-end face detection: the face detection used on the chip is called front-end face detection, and the speed and accuracy of the front-end face detection are lower than those of the face detection of the cloud server.
12. Characteristic diagram: the result of the convolution calculation of the input data is called a feature map, and the result of the full connection of the data is also called a feature map. The feature size is typically expressed as length x width x depth, or 1 x depth.
13. Step length: the length of the shift in the center position of the convolution kernel in the coordinates.
14. And (3) carrying out non-alignment treatment on two ends: when the image or data is processed by the convolution kernel with the size of 3 and the step length of 2, the data on two sides is insufficient, and the phenomenon that the data on two sides or one side is discarded is called that the data on two sides is not processed.
Disclosure of Invention
In order to solve the above problems, the present invention is directed to: the method can reduce the detection time; the network meets the quantification requirement, and the recall rate and the accuracy rate are improved; the trained model meets the requirements of face detection in face recognition.
Specifically, the invention provides a quantifiable design method for front-end face detection, which comprises the following steps:
s1, dividing the test sample into two cascade detectors in a cascade mode, and making a training sample of a secondary model:
s1.1, manufacturing a first-stage detector model training sample:
A) the human face shielding degree is not more than 30%, human faces with large fuzzy degree or small size do not meet the requirements, each human face is respectively extended by 0.5 time from left to right and from top to bottom according to the marked frame, then the human faces are scratched out from the original image, and the newly scratched human faces are mapped and marked;
B) the extracted face images are subjected to manual screening and inspection, faces with the shielding rate of over 30 percent are not allowed to exist in the face images, and the number of the faces is 60 ten thousand, and the face images are used as a first part primary training set of one-time face training;
C) then according to the random screenshot, when the obtained screenshot and the iou of the frame are larger than 0.5, the screenshot and the iou are reserved as positive samples, two positive samples meeting the requirements are extracted from each screenshot, and the positive samples meeting the requirements are scaled to the specified size; carrying out cutout on the negative sample from the existing training set, wherein the cutout and the iou of any frame on the cutout are both less than 0.25, so that the cutout meets the requirement of the negative sample, zooming the cutout, and obtaining the negative sample, wherein the obtained cutout is the negative sample, and 300 ten thousand negative samples are generated by the method; randomly matting from the picture set without the face and zooming to a specified size, wherein the negative sample is also a negative sample, and 150 pieces of negative samples are generated;
D) finishing the manufacturing of positive and negative samples of the first-stage detector model training sample;
s1.2, manufacturing a second-stage detector model training sample:
a) extraction of a positive sample: extracting positive samples from a first part of the primary training set of 60 million faces; randomly screenshot is carried out near the label in the 60 thousands of faces, when the iou of the intercepted image and the iou of the box are larger than 0.5, the intercepted image is zoomed to the image with the size of 25 multiplied by 25, the image is kept as a positive sample, and the number of the positive samples is controlled to be 10 thousands;
b) extraction of negative samples: detecting a large number of pictures without human faces by using a first-stage detector to extract negative samples, picking out the detected results which are mistaken for human faces, and then scaling the picked pictures to 25 multiplied by 25 to be used as a part of the negative samples of the training set; from a common labeled face training set, detecting by using a first-level detector, extracting a detected result and a result that an iou value of a labeled face is less than 0.2, and then zooming to a picture with the size of 25 multiplied by 25 to be used as a part of a training set negative sample;
s2, designing a network structure model:
s2.1, a first-level network structure;
s2.2, a second-level network structure;
s3, training the secondary model: training a model of a first-stage detector by using positive and negative samples of a first-stage detector model training sample, detecting a human face by using the first-stage detector, wherein a correlation coefficient is the model, extracting and scaling the detected human face to a picture with the size of 25 multiplied by 25, inputting the picture to a second-stage detector, detecting whether the picture is the human face or not, and mapping coordinates back to an original picture; and finally determining whether the original image has a human face and the position of the human face.
The face shielding degree in the step S1.1A) is not more than 30%, including the situation that the incomplete degree of the boundary face is not more than 30%; the sheltering comprises a mask, a mask shelter and a cap shelter, but the sheltering does not comprise the condition of sunglasses; the human face does not comprise a camouflage colored drawing face, a clown face and a human face under a special dim condition.
Said step S1.2 of a) extracting a positive sample further comprises: and detecting the 60 ten thousand faces by using a first-stage detector, picking out the faces from the detected result and the result of the face labeled with the iou value larger than 0.5, and then scaling to 25 multiplied by 25 to be used as a part of a training set positive sample.
The first-level network structure in step S2.1 specifically includes:
the size of an input picture of the first layer is 17 multiplied by 3, a feature map with the output depth of 16 is output, the convolution kernel is 3 multiplied by 3, the step length is 1, the graph for calculating the convolution is a feature map with two non-aligned ends, and the feature map (1) is output with 15 multiplied by 16;
the input data of the second layer is 15 multiplied by 16, the depth of the output characteristic graph is 16, the convolution kernel is 3 multiplied by 3, the step length is 3, the graph for calculating the convolution is that the two ends are not aligned, and the output characteristic graph (2) is 5 multiplied by 16;
the third layer of input data is 5 multiplied by 16, the depth of an output characteristic diagram is 16, a convolution kernel is 3 multiplied by 3, the step length is 1, the diagram for calculating the convolution is that two ends are not aligned, and the output characteristic diagram (3) is 3 multiplied by 16;
the fourth layer of input data is 3 multiplied by 16, 32 feature graphs are output, the convolution kernel is 3 multiplied by 3, the step length is 1, the graph for calculating the convolution is that two ends are not aligned, and the feature graph (4) is output by 1 multiplied by 32;
the fifth layer input data is a feature map (4) output by the fourth layer, wherein the feature map (4) is 1 multiplied by 32, the output feature map depth is 1, the convolution kernel is 1 multiplied by 1, the step length is 1, and the output feature map is 1 multiplied by 1;
the sixth layer input data is a feature map (4) output by the fourth layer, wherein the feature map (4) is 1 × 1 × 32, the output feature map depth is 4, the convolution kernel size is 1 × 1, the step size is 1, and the output feature map is 1 × 1 × 4.
The second-level network structure in step S2.2 specifically includes:
the picture input in the first layer is 25 multiplied by 3, the depth of the output characteristic graph is 32, the convolution kernel is 3 multiplied by 3, the step length is 1, the graph for calculating the convolution is a characteristic graph (1) with two ends being not aligned and the output is 23 multiplied by 32;
23 × 23 × 32 of the second-layer input feature map (1), 32 of the depth of the output feature map, 3 × 3 of a convolution kernel, 2 of step length, 11 × 11 × 32 of the output feature map (2) with two non-aligned ends;
the third layer inputs the feature map (2)11 × 11 × 32, the output feature map depth is 48, the convolution kernel is 3 × 3, the step length is 2, the calculated convolution map is 5 × 5 × 48 with two non-aligned ends;
the feature map (3) input by the fourth layer is 5 multiplied by 48, the depth of the output feature map is 64, the convolution kernel is 3 multiplied by 3, the step length is 2, the map for calculating the convolution is that the two ends are not aligned, and the feature map (4) is 2 multiplied by 64;
the fifth layer generates the data of 2 × 2 × 64 of the feature map (4) into one-dimensional data 256;
the sixth layer comprises two branches which are all connected, and the one-dimensional data 256 is respectively connected to the judgment of whether the human face exists in the category and the relative coordinates of the human face frame.
All data in the first layer of the second level network structure is used efficiently, and if used to process it, invalid data padding is added.
The network is a quantifiable network, which can only use convolution kernel of 3 × 3 and 1 × 1, the depth of each layer must be a multiple of 16, other convolution kernels cannot be used, and pooling cannot be used.
Thus, the present application has the advantages that: the first cascade detector carries out rough detection, the accuracy reaches more than 30% when the recall rate is 99%, the detected face box is cut out and then is scaled to the size required by the second-stage detector, and the face box enters the detection of the second-stage detector. When the recall rate after the detection of the two-stage detector is 98%, the accuracy reaches 97%. Due to the lack of one stage, the detection time is saved greatly and can be quantized, the detection time after quantization is 0.25 times of the current detection time, and the recall rate and the accuracy rate are unchanged.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention.
FIG. 1 is a schematic flow diagram of the process of the present invention.
Fig. 2 is a schematic diagram of the structure of a first level network of an application network of the method of the present invention.
Fig. 3 is a schematic diagram of the structure of the second level network of the application network of the method of the present invention.
Fig. 4 is a schematic diagram showing a picture with two faces, wherein a rectangle frame outside the faces is the rectangle frame of the faces in fig. 4.
Fig. 5 is a schematic diagram of one of the face images extracted from fig. 4.
Fig. 6 is a schematic diagram of two face images extracted from fig. 4.
Detailed Description
In order that the technical contents and advantages of the present invention can be more clearly understood, the present invention will now be described in further detail with reference to the accompanying drawings.
As shown in fig. 1, the present invention relates to a design method of quantifiable front-end face detection, which comprises the following steps:
s1, dividing the test sample into two cascade detectors in a cascade mode, and making a training sample of a secondary model:
s1.1, manufacturing a first-stage detector model training sample:
A) the human face shielding degree is not more than 30%, the human faces with large fuzzy degree or small size do not meet the requirements, each human face is respectively extended by 0.5 time from left to right and from top to bottom according to the marked frame, and then the human faces are extracted from the original image, and the newly extracted human faces are mapped and marked; for example, one picture has two faces, as shown in fig. 4, the external rectangle frame of the face is the rectangle frame of the face in fig. 4, two faces have two external minimum rectangle frames, two face images are extracted from the original picture (fig. 4) as shown in fig. 5 and fig. 6, and the upper left coordinates and the lower right coordinates of the external rectangle of the two faces in the original picture are: [ (35,104), (147,235) ], [ (220,89), (325,221) ], and the new labels for mapping two face images, i.e. the upper left, right and lower coordinates are [ (35,65), (147,196) ], [ (52,66), (157,198) ].
B) The extracted face images are subjected to manual screening and inspection, faces with the shielding rate of over 30 percent are not allowed to exist in the face images, and the number of the faces is 60 ten thousand, and the face images are used as a first part primary training set of one-time face training;
C) then according to the random screenshot, when the obtained screenshot and the iou of the frame are larger than 0.5, the screenshot and the iou are reserved as positive samples, two positive samples meeting the requirements are extracted from each screenshot, and the positive samples meeting the requirements are scaled to the specified size; carrying out cutout on the negative sample from the existing training set, wherein the cutout and the iou of any frame on the cutout are both less than 0.25, so that the cutout meets the requirement of the negative sample, zooming the cutout, and obtaining the negative sample, wherein the obtained cutout is the negative sample, and 300 ten thousand negative samples are generated by the method; randomly matting from the picture set without the face and zooming to a specified size, wherein the negative sample is also a negative sample, and 150 pieces of negative samples are generated;
D) finishing the manufacturing of positive and negative samples of the first-stage detector model training sample;
s1.2, manufacturing a second-stage detector model training sample:
a) extraction of a positive sample: extracting positive samples from a first part of the primary training set of 60 million faces; randomly screenshot nearby the label in the 60 thousands of faces, when the iou of the intercepted image and the iou of the box are larger than 0.5, zooming the intercepted image to a 25 × 25-sized image, wherein the size of the intercepted image is relatively large, the intercepted image needs to be zoomed to a 25-long and 25-wide image, which is the size specified by network training, and the size is reserved as a positive sample, and the number of the positive samples is controlled to be 10 thousands;
b) extraction of negative samples: detecting a large number of pictures without human faces by using a first-stage detector to extract negative samples, picking out the detected results which are mistaken for human faces, and then scaling the picked-out pictures to pictures with the size of 25 multiplied by 25 to be used as a part of the negative samples of the training set; from a common labeled face training set, detecting by using a first-level detector, extracting a detected result and a result that an iou value of a labeled face is less than 0.2, and then zooming to a picture with the size of 25 multiplied by 25 to be used as a part of a training set negative sample;
s2, designing a network structure model:
s2.1, a first-level network structure;
s2.2, a second-level network structure;
s3, training the secondary model: training a model of a first-stage detector by using positive and negative samples of a first-stage detector model training sample, detecting a human face by using the first-stage detector, wherein the number of the relative relation is the model, picking out and scaling the detected human face to a picture with the size of 25 multiplied by 25, inputting the picture to a second-stage detector, detecting whether the picture is the human face or not, and mapping the coordinates back to an original picture; and finally determining whether the original image has the face and the position of the face.
The technical solution of the present invention can be further described as follows:
1. the detector is divided into two cascade detectors by adopting a cascade mode. The first cascade detector carries out rough detection, the accuracy reaches more than 30% when the recall rate is 99%, the detected face box is cut out and then is scaled to the size required by the second-stage detector, and the detection of the second-stage detector is carried out. When the recall rate after the detection of the two-stage detector is 98%, the accuracy reaches 97%. Due to the lack of one stage, the detection time is saved greatly and can be quantized, the detection time after quantization is 0.25 times of the current detection time, and the recall rate and the accuracy rate are unchanged.
2. And (5) training a model.
The quantifiable network can only use convolution kernels of 3 × 3 and 1 × 1, the depth of each layer must be a multiple of 16, and other convolution kernels, pooling, and the like cannot be used.
And (4) making a training sample of the first-stage detector model.
A) The human face shielding degree does not exceed 30%, and the human face with large blurring degree or small size does not meet the requirement. In order to meet the above requirements, the sample is specially processed, each face is respectively extended by 0.5 times from left to right and from top to bottom according to the labeled box, and then the face is extracted from the original image, and the newly extracted face is mapped and labeled. For example, one picture has two faces as shown in fig. 4, the external rectangle frame of the face is the rectangle frame of the face in fig. 4, the two faces have two external minimum rectangle frames, the two face images are extracted from the original image (fig. 4) as shown in fig. 5 and fig. 6, and the upper left, lower right coordinates of the external rectangles of the two faces in the original image are: [ (35,104), (147,235) ], [ (220,89), (325,221) ], and the new labels for mapping two face images, namely, the upper left, the lower right and the lower left coordinates are [ (35,65), (147,196) ], [ (52,66), (157,198) ].
B) The extracted face image is subjected to manual screening and inspection, and the face shielding exceeding 30 percent is not allowed to exist in the face image, wherein the shielding comprises a mask, a mask shield, a cap shield and other object shields; the incomplete degree of the boundary face cannot exceed 30 percent; camouflage faces and clown faces do not belong to faces. These non-compliant face pictures cannot be used as negative samples. Particularly dark faces are not considered as faces. Allowing the presence of sunglasses. The number of faces is 60 ten thousand, and the faces are used as a first part of a primary training set of the face training.
C) And then according to the random screenshot, when the iou of the intercepted graph and box is more than 0.5, reserving the intercepted graph and box as positive samples, matting out two positive samples meeting the requirements of each graph, and scaling the positive samples meeting the requirements to a specified size. And carrying out cutout on the negative sample from the existing training set, wherein the cutout and the iou of any box on the cutout are both less than 0.25, so that the cutout meets the requirement of the negative sample, and then scaling the cutout to obtain the cutout which is the negative sample. 300 million negative examples were generated by this method. And randomly matting from the set of faces-free pictures and scaling to a specified size to obtain negative samples, wherein the negative samples generate 150 pieces of negative samples. And finishing the sample preparation. The model of the first stage detector is trained using positive and negative samples.
As shown in fig. 2, a first level network structure.
The size of an input picture of the first layer is 17 multiplied by 3, a feature map with the output depth of 16 is output, the convolution kernel is 3 multiplied by 3, the step length is 1, the graph for calculating the convolution is a feature map with two non-aligned ends, and the feature map (1) is output with 15 multiplied by 16;
the input data of the second layer is 15 multiplied by 16, the depth of the output characteristic graph is 16, the convolution kernel is 3 multiplied by 3, the step length is 3, the graph for calculating the convolution is that the two ends are not aligned, and the output characteristic graph (2) is 5 multiplied by 16;
the third layer of input data is 5 multiplied by 16, the depth of an output characteristic diagram is 16, a convolution kernel is 3 multiplied by 3, the step length is 1, the diagram for calculating the convolution is that two ends are not aligned, and the output characteristic diagram (3) is 3 multiplied by 16;
the fourth layer of input data is 3 multiplied by 16, 32 feature graphs are output, the convolution kernel is 3 multiplied by 3, the step length is 1, the graph for calculating the convolution is that two ends are not aligned, and the feature graph (4) is output by 1 multiplied by 32;
the fifth layer input data is a feature map (4) output by the fourth layer, wherein the feature map (4) is 1 multiplied by 32, the output feature map depth is 1, the convolution kernel is 1 multiplied by 1, the step length is 1, and the output feature map is 1 multiplied by 1;
the sixth layer input data is a feature map (4) output by the fourth layer, wherein the feature map (4) is 1 × 1 × 32, the output feature map depth is 4, the convolution kernel size is 1 × 1, the step size is 1, and the output feature map is 1 × 1 × 4.
And (4) making a second-stage detector model training sample.
A) And (4) extracting a positive sample. Positive samples were taken from a first partial primary training set of 60 million faces. Detecting 60 million faces by using a first-stage detector, matting out the faces according to the detected result and the result that the iou value of the labeled faces is greater than 0.5, and then scaling the intercepted image to a picture with the size of 25 multiplied by 25, wherein the size of the intercepted image is relatively large, and the intercepted image needs to be scaled to a picture with the size of 25 long and 25 wide, which is the size specified by network training and is used as a part of a training set positive sample;
B) randomly screenshot is carried out near the label in 60 million faces, when the iou of the intercepted image and the iou of the box are larger than 0.5, the intercepted image is zoomed to an image with the size of 25 multiplied by 25, wherein the size of the intercepted image is relatively large, the intercepted image needs to be zoomed to an image with the size of 25 multiplied by 25, which is the size specified by network training, the intercepted image is reserved as a positive sample, and the number of the positive samples is controlled to be 10 million.
C) And detecting a large number of pictures without human faces by using a first-stage detector to extract negative samples, scratching out human faces from the detected results, and then scaling the human faces to 25 multiplied by 25 to be used as a part of the negative samples of the training set. From the face training set of the common label, using a first-level detector for detection, extracting the detected result and the result that the iou value of the labeled face is less than 0.2, and then scaling to 25 multiplied by 25 to be used as a part of the negative sample of the training set. The training set is ready.
As shown in fig. 3, a second level network structure.
The picture input in the first layer is 25 × 25 × 3, the depth of the output feature map is 32, the convolution kernel is 3 × 3, the step size is 1, the graph for calculating the convolution is a graph with two non-aligned ends, the output feature map (1) is 23 × 23 × 32, all data are effectively used, and if the graph is processed by using the graph, invalid data filling is increased.
The second layer inputs 23 × 23 × 32 of the feature map (1), the depth of the output feature map is 32, the convolution kernel is 3 × 3, the step size is 2, the graph for calculating the convolution is 11 × 11 × 32 of the feature map (2) with two ends not aligned, and the output is output.
The third layer inputs 11 × 11 × 32 characteristic maps (2), the output characteristic map depth is 48, the convolution kernel is 3 × 3, the step size is 2, the graph for calculating the convolution is that the two ends are not aligned, and the output characteristic map (3) is 5 × 5 × 48.
The feature map (3) input by the fourth layer is 5 × 5 × 48, the depth of the output feature map is 64, the convolution kernel is 3 × 3, the step size is 2, the map for calculating the convolution is 2 × 2 × 64 with two non-aligned ends, and the feature map (4) is output.
The fifth layer develops the data of the feature map (4)2 × 2 × 64 into one-dimensional data 256.
The sixth layer includes two branches, which connect 256 data to the judgment of the face and the relative coordinates of the face box, respectively.
3. Use of first and second level models.
The face is detected using a first level detector (correlation coefficients are models), the detected face is rescaled to 25 x 25 and input to a second level detector to detect whether the face is detected, and the coordinates are mapped back to the original. And finally determining whether the original image has a face and the position of the face.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the embodiment of the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (7)

1. A design method for quantifiable front-end face detection is characterized by comprising the following steps:
s1, dividing the test sample into two cascade detectors in a cascade mode, and making a training sample of a secondary model:
s1.1, manufacturing a first-stage detector model training sample:
A) the human face shielding degree is not more than 30%, the human faces with large fuzzy degree or small size do not meet the requirements, each human face is respectively extended by 0.5 time from left to right and from top to bottom according to the marked frame, then the human faces are extracted from the original image, and the newly extracted human faces are mapped and marked;
B) the extracted face images are subjected to manual screening and inspection, faces with the shielding rate of over 30 percent are not allowed to exist in the face images, and the number of the faces is 60 ten thousand, and the face images are used as a first part primary training set of one-time face training;
C) then according to the random screenshot, when the obtained screenshot and the iou of the frame are larger than 0.5, the screenshot and the iou are reserved as positive samples, two positive samples meeting the requirements are extracted from each screenshot, and the positive samples meeting the requirements are scaled to the specified size; carrying out cutout on the negative sample from the existing training set, wherein the cutout and the iou of any frame on the cutout are both less than 0.25, so that the cutout meets the requirement of the negative sample, zooming the cutout, and obtaining the negative sample, wherein the obtained cutout is the negative sample, and 300 ten thousand negative samples are generated by the method; randomly matting from the picture set without the face and zooming to the specified size, wherein the negative sample is also a negative sample, and 150 pieces of negative samples are generated;
D) finishing the manufacturing of positive and negative samples of the first-stage detector model training sample;
s1.2, manufacturing a second-stage detector model training sample:
a) extraction of a positive sample: extracting positive samples from a first part of the primary training set of 60 million faces; randomly screenshot is carried out near the label in the 60 million faces, when the iou of the intercepted image and the iou of the box are larger than 0.5, the intercepted image is scaled to a 25 x 25 image, namely a 25 long and 25 wide image, which is the size specified by network training, and is reserved as a positive sample, and the number of the positive samples is controlled to be 10 million;
b) extraction of negative samples: detecting a large number of pictures without human faces by using a first-stage detector to extract negative samples, picking out the detected results which are mistaken for human faces, and then scaling the picked pictures to 25 multiplied by 25 to be used as a part of the negative samples of the training set; from a common labeled face training set, detecting by using a first-level detector, extracting a detected result and a result that an iou value of a labeled face is less than 0.2, and then zooming to a picture with the size of 25 multiplied by 25 to be used as a part of a training set negative sample;
s2, designing a network structure model:
s2.1, a first-level network structure;
s2.2, a second-level network structure;
s3, training the secondary model: training a model of a first-stage detector by using positive and negative samples of a first-stage detector model training sample, detecting a human face by using the first-stage detector, wherein a correlation coefficient is the model, extracting and scaling the detected human face to a picture with the size of 25 multiplied by 25, inputting the picture to a second-stage detector, detecting whether the picture is the human face or not, and mapping coordinates back to an original picture; and finally determining whether the original image has a face and the position of the face.
2. The design method of quantifiable front-end face detection according to claim 1, wherein said face shielding degree of step S1.1 a) is not more than 30% including a case that the incomplete degree of boundary face is not more than 30%; the sheltering comprises a mask, a mask shelter and a hat shelter, but the sheltering does not comprise the condition of sunglasses; the human face does not comprise a camouflage colored drawing face, a clown face and a human face under a special dim condition.
3. The design method of quantifiable front-end face detection according to claim 1, wherein the extracting of the positive sample in step S1.2 a) further comprises: and detecting the 60 ten thousand faces by using a first-stage detector, matting out the faces according to the detected result and the result that the iou value of the labeled faces is greater than 0.5, and then scaling to 25 multiplied by 25 to be used as a part of a training set positive sample.
4. The design method of quantifiable front-end face detection according to claim 1, wherein the first-level network structure in step S2.1 specifically includes:
the size of an input picture of the first layer is 17 multiplied by 3, a feature map with the output depth of 16 is output, the convolution kernel is 3 multiplied by 3, the step length is 1, the graph for calculating the convolution is a feature map with two non-aligned ends, and the feature map (1) is output with 15 multiplied by 16;
the input data of the second layer is 15 multiplied by 16, the depth of the output characteristic graph is 16, the convolution kernel is 3 multiplied by 3, the step length is 3, the graph for calculating the convolution is that the two ends are not aligned, and the output characteristic graph (2) is 5 multiplied by 16;
the third layer of input data is 5 multiplied by 16, the depth of the output characteristic graph is 16, the convolution kernel is 3 multiplied by 3, the step length is 1, the graph for calculating the convolution is that the two ends are not aligned, and the output characteristic graph (3) is 3 multiplied by 16;
the fourth layer of input data is 3 multiplied by 16, 32 feature graphs are output, the convolution kernel is 3 multiplied by 3, the step length is 1, the graph for calculating the convolution is that two ends are not aligned, and the feature graph (4) is output by 1 multiplied by 32;
the fifth layer input data is a feature map (4) output by the fourth layer, wherein the feature map (4) is 1 multiplied by 32, the output feature map depth is 1, the convolution kernel is 1 multiplied by 1, the step length is 1, and the output feature map is 1 multiplied by 1;
the sixth layer input data is a feature map (4) output by the fourth layer, wherein the feature map (4) is 1 × 1 × 32, the output feature map depth is 4, the convolution kernel size is 1 × 1, the step size is 1, and the output feature map is 1 × 1 × 4.
5. The method of claim 1, wherein the second-level network structure in step S2.2 specifically includes:
the picture input in the first layer is 25 multiplied by 3, the depth of the output characteristic graph is 32, the convolution kernel is 3 multiplied by 3, the step length is 1, the graph for calculating the convolution is a characteristic graph (1) with two ends being not aligned and the output is 23 multiplied by 32;
23 × 23 × 32 of the second-layer input feature map (1), 32 of the depth of the output feature map, 3 × 3 of a convolution kernel, 2 of step length, 11 × 11 × 32 of the output feature map (2) with two non-aligned ends;
the third layer inputs the feature map (2)11 × 11 × 32, the output feature map depth is 48, the convolution kernel is 3 × 3, the step length is 2, the calculated convolution map is 5 × 5 × 48 with two non-aligned ends;
the feature map (3) input by the fourth layer is 5 multiplied by 48, the depth of the output feature map is 64, the convolution kernel is 3 multiplied by 3, the step length is 2, the map for calculating the convolution is that the two ends are not aligned, and the feature map (4) is 2 multiplied by 64;
the fifth layer generates the data of 2 × 2 × 64 of the feature map (4) into one-dimensional data 256;
the sixth layer comprises two branches which are all connected, and the one-dimensional data 256 is respectively connected to the judgment of whether the human face exists in the category and the relative coordinates of the human face frame.
6. The method of claim 5, wherein all data in the first layer of the second level network structure is used effectively, and if used, the invalid data padding is increased.
7. The method of claim 1, wherein the network is a quantifiable network, and only has to use convolution kernel of 3 × 3 and convolution kernel of 1 × 1, and the depth of each layer must be a multiple of 16, and no other convolution kernels and no pooling can be used.
CN202010736641.9A 2020-07-28 2020-07-28 Design method for quantifiable front-end face detection Active CN114005150B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010736641.9A CN114005150B (en) 2020-07-28 2020-07-28 Design method for quantifiable front-end face detection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010736641.9A CN114005150B (en) 2020-07-28 2020-07-28 Design method for quantifiable front-end face detection

Publications (2)

Publication Number Publication Date
CN114005150A true CN114005150A (en) 2022-02-01
CN114005150B CN114005150B (en) 2024-05-03

Family

ID=79920338

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010736641.9A Active CN114005150B (en) 2020-07-28 2020-07-28 Design method for quantifiable front-end face detection

Country Status (1)

Country Link
CN (1) CN114005150B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108898112A (en) * 2018-07-03 2018-11-27 东北大学 A kind of near-infrared human face in-vivo detection method and system
CN109657548A (en) * 2018-11-13 2019-04-19 深圳神目信息技术有限公司 A kind of method for detecting human face and system based on deep learning
WO2019169895A1 (en) * 2018-03-09 2019-09-12 华南理工大学 Fast side-face interference resistant face detection method
WO2020001082A1 (en) * 2018-06-30 2020-01-02 东南大学 Face attribute analysis method based on transfer learning
CN110717481A (en) * 2019-12-12 2020-01-21 浙江鹏信信息科技股份有限公司 Method for realizing face detection by using cascaded convolutional neural network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019169895A1 (en) * 2018-03-09 2019-09-12 华南理工大学 Fast side-face interference resistant face detection method
WO2020001082A1 (en) * 2018-06-30 2020-01-02 东南大学 Face attribute analysis method based on transfer learning
CN108898112A (en) * 2018-07-03 2018-11-27 东北大学 A kind of near-infrared human face in-vivo detection method and system
CN109657548A (en) * 2018-11-13 2019-04-19 深圳神目信息技术有限公司 A kind of method for detecting human face and system based on deep learning
CN110717481A (en) * 2019-12-12 2020-01-21 浙江鹏信信息科技股份有限公司 Method for realizing face detection by using cascaded convolutional neural network

Also Published As

Publication number Publication date
CN114005150B (en) 2024-05-03

Similar Documents

Publication Publication Date Title
CN113065558B (en) Lightweight small target detection method combined with attention mechanism
CN108446617B (en) Side face interference resistant rapid human face detection method
CN109934200B (en) RGB color remote sensing image cloud detection method and system based on improved M-Net
CN109613002B (en) Glass defect detection method and device and storage medium
CN107480649B (en) Fingerprint sweat pore extraction method based on full convolution neural network
CN111462126A (en) Semantic image segmentation method and system based on edge enhancement
CN109815850A (en) Iris segmentation and localization method, system, device based on deep learning
CN112183414A (en) Weak supervision remote sensing target detection method based on mixed hole convolution
CN109948566B (en) Double-flow face anti-fraud detection method based on weight fusion and feature selection
CN112396635B (en) Multi-target detection method based on multiple devices in complex environment
CN112149533A (en) Target detection method based on improved SSD model
CN111639668A (en) Crowd density detection method based on deep learning
CN111753682A (en) Hoisting area dynamic monitoring method based on target detection algorithm
CN109344720B (en) Emotional state detection method based on self-adaptive feature selection
CN113435407A (en) Small target identification method and device for power transmission system
CN112926667B (en) Method and device for detecting saliency target of depth fusion edge and high-level feature
CN113313678A (en) Automatic sperm morphology analysis method based on multi-scale feature fusion
CN105844299B (en) A kind of image classification method based on bag of words
CN112633179A (en) Farmer market aisle object occupying channel detection method based on video analysis
Wang et al. Single shot multibox detector with deconvolutional region magnification procedure
CN110458203B (en) Advertisement image material detection method
CN116883303A (en) Infrared and visible light image fusion method based on characteristic difference compensation and fusion
CN111160372A (en) Large target identification method based on high-speed convolutional neural network
CN114005150B (en) Design method for quantifiable front-end face detection
CN110363162B (en) Deep learning target detection method for focusing key region

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant