CN114005150A

CN114005150A - Design method of quantifiable front-end face detection

Info

Publication number: CN114005150A
Application number: CN202010736641.9A
Authority: CN
Inventors: 田凤彬; 于晓静
Original assignee: Beijing Ingenic Semiconductor Co Ltd
Current assignee: Beijing Ingenic Semiconductor Co Ltd
Priority date: 2020-07-28
Filing date: 2020-07-28
Publication date: 2022-02-01
Anticipated expiration: 2040-07-28
Also published as: CN114005150B

Abstract

The invention provides a design method of quantifiable front-end face detection, which comprises the following steps: s1, dividing the test sample into two cascade detectors in a cascade mode, and making a training sample of a secondary model: s1.1, manufacturing a first-stage detector model training sample; s1.2, manufacturing a second-stage detector model training sample; s2, designing a network structure model: s2.1, a first-level network structure; s2.2, a second-level network structure; s3, training the secondary model: training a model of a first-stage detector by using positive and negative samples of a first-stage detector model training sample, detecting a human face by using the first-stage detector, wherein a correlation coefficient is the model, extracting and scaling the detected human face to a picture with the size of 25 multiplied by 25, inputting the picture to a second-stage detector, detecting whether the picture is the human face or not, and mapping coordinates back to an original picture; and finally determining whether the original image has a face and the position of the face. The detection time is reduced by the method; network quantification is met, and recall rate and accuracy are improved; the model meets the requirements of face detection.

Description

Design method of quantifiable front-end face detection

Technical Field

The invention relates to the technical field of neural networks, in particular to a design method for quantifiable front-end face detection.

Background

In the current society, the development of the neural network technology in the field of artificial intelligence is rapid. The MTCNN technology is also one of the more popular technologies in recent years. MTCNN, a Multi-task convolutional neural network, combines face region detection and face keypoint detection, and can be generally divided into three-layer network structures of P-Net, R-Net and O-Net. The multi-task neural network model for the face detection task mainly adopts three cascaded networks and adopts the idea of adding a classifier into a candidate frame to carry out rapid and efficient face detection. The three cascaded networks are respectively P-Net for quickly generating candidate windows, R-Net for filtering and selecting high-precision candidate windows and O-Net for generating final bounding boxes and key points of the human face.

However, MTCNN cascade detection has the following drawbacks:

1. the three-level face detection is adopted, the third-level consumption time is very long, and the recall rate is reduced due to three-level cascade.

2. The network structure can not meet the requirement of front-end chip quantization, the maximum pooling function forbidden to be used in quantization is used in the network structure, the number of feature maps used in each layer is not a multiple of 16, and quantization cannot be achieved.

3. The MTCNN training samples are very noisy and cannot be trained to meet practical requirements. In addition, the following commonly used technical terms are also included in the prior art:

1. cascading: the mode that several detectors detect in series is called cascade.

2. iou: the ratio of the intersection of the two region areas to the union of the two region areas.

3. And (3) quantification: one phenomenon of floating point conversion to fixed point or 8-bit or 4-bit or 2-bit is called quantization

4. The recall ratio is as follows: the ratio of the number of correctly detected faces to the total number of marked faces.

5. The accuracy is as follows: the ratio of the correct detection result to the total number of detection results.

6. Model: are all the coefficients of a function trained by the samples, called the model.

7. A detector: is a function for detection, the main component of which is a model.

8. Face detection: the process of detecting whether a face exists in a video or a picture by using a face detector is called face detection.

9. And (3) convolution kernel: the convolution kernel is a parameter used for performing an operation on a matrix and an original image during image processing. The convolution kernel is typically a matrix of column numbers (e.g., a 3 x 3 matrix) with a weight value for each square on the region. The matrix shape is typically 1 × 1,3 × 3,5 × 5,7 × 7,1 × 3,3 × 1,2 × 2,1 × 5,5 × 1, … …

10. Convolution: the centre of the convolution kernel is placed on the pixel to be calculated, the products of each element in the kernel and its covered image pixel value are calculated once and summed, and the resulting structure is the new pixel value at that location, a process called convolution.

11. Front-end face detection: the face detection used on the chip is called front-end face detection, and the speed and accuracy of the front-end face detection are lower than those of the face detection of the cloud server.

12. Characteristic diagram: the result of the convolution calculation of the input data is called a feature map, and the result of the full connection of the data is also called a feature map. The feature size is typically expressed as length x width x depth, or 1 x depth.

13. Step length: the length of the shift in the center position of the convolution kernel in the coordinates.

14. And (3) carrying out non-alignment treatment on two ends: when the image or data is processed by the convolution kernel with the size of 3 and the step length of 2, the data on two sides is insufficient, and the phenomenon that the data on two sides or one side is discarded is called that the data on two sides is not processed.

Disclosure of Invention

In order to solve the above problems, the present invention is directed to: the method can reduce the detection time; the network meets the quantification requirement, and the recall rate and the accuracy rate are improved; the trained model meets the requirements of face detection in face recognition.

Specifically, the invention provides a quantifiable design method for front-end face detection, which comprises the following steps:

s1, dividing the test sample into two cascade detectors in a cascade mode, and making a training sample of a secondary model:

s1.1, manufacturing a first-stage detector model training sample:

A) the human face shielding degree is not more than 30%, human faces with large fuzzy degree or small size do not meet the requirements, each human face is respectively extended by 0.5 time from left to right and from top to bottom according to the marked frame, then the human faces are scratched out from the original image, and the newly scratched human faces are mapped and marked;

B) the extracted face images are subjected to manual screening and inspection, faces with the shielding rate of over 30 percent are not allowed to exist in the face images, and the number of the faces is 60 ten thousand, and the face images are used as a first part primary training set of one-time face training;

C) then according to the random screenshot, when the obtained screenshot and the iou of the frame are larger than 0.5, the screenshot and the iou are reserved as positive samples, two positive samples meeting the requirements are extracted from each screenshot, and the positive samples meeting the requirements are scaled to the specified size; carrying out cutout on the negative sample from the existing training set, wherein the cutout and the iou of any frame on the cutout are both less than 0.25, so that the cutout meets the requirement of the negative sample, zooming the cutout, and obtaining the negative sample, wherein the obtained cutout is the negative sample, and 300 ten thousand negative samples are generated by the method; randomly matting from the picture set without the face and zooming to a specified size, wherein the negative sample is also a negative sample, and 150 pieces of negative samples are generated;

D) finishing the manufacturing of positive and negative samples of the first-stage detector model training sample;

s1.2, manufacturing a second-stage detector model training sample:

a) extraction of a positive sample: extracting positive samples from a first part of the primary training set of 60 million faces; randomly screenshot is carried out near the label in the 60 thousands of faces, when the iou of the intercepted image and the iou of the box are larger than 0.5, the intercepted image is zoomed to the image with the size of 25 multiplied by 25, the image is kept as a positive sample, and the number of the positive samples is controlled to be 10 thousands;

b) extraction of negative samples: detecting a large number of pictures without human faces by using a first-stage detector to extract negative samples, picking out the detected results which are mistaken for human faces, and then scaling the picked pictures to 25 multiplied by 25 to be used as a part of the negative samples of the training set; from a common labeled face training set, detecting by using a first-level detector, extracting a detected result and a result that an iou value of a labeled face is less than 0.2, and then zooming to a picture with the size of 25 multiplied by 25 to be used as a part of a training set negative sample;

s2, designing a network structure model:

s2.1, a first-level network structure;

s2.2, a second-level network structure;

s3, training the secondary model: training a model of a first-stage detector by using positive and negative samples of a first-stage detector model training sample, detecting a human face by using the first-stage detector, wherein a correlation coefficient is the model, extracting and scaling the detected human face to a picture with the size of 25 multiplied by 25, inputting the picture to a second-stage detector, detecting whether the picture is the human face or not, and mapping coordinates back to an original picture; and finally determining whether the original image has a human face and the position of the human face.

The face shielding degree in the step S1.1A) is not more than 30%, including the situation that the incomplete degree of the boundary face is not more than 30%; the sheltering comprises a mask, a mask shelter and a cap shelter, but the sheltering does not comprise the condition of sunglasses; the human face does not comprise a camouflage colored drawing face, a clown face and a human face under a special dim condition.

Said step S1.2 of a) extracting a positive sample further comprises: and detecting the 60 ten thousand faces by using a first-stage detector, picking out the faces from the detected result and the result of the face labeled with the iou value larger than 0.5, and then scaling to 25 multiplied by 25 to be used as a part of a training set positive sample.

The first-level network structure in step S2.1 specifically includes:

the size of an input picture of the first layer is 17 multiplied by 3, a feature map with the output depth of 16 is output, the convolution kernel is 3 multiplied by 3, the step length is 1, the graph for calculating the convolution is a feature map with two non-aligned ends, and the feature map (1) is output with 15 multiplied by 16;

the input data of the second layer is 15 multiplied by 16, the depth of the output characteristic graph is 16, the convolution kernel is 3 multiplied by 3, the step length is 3, the graph for calculating the convolution is that the two ends are not aligned, and the output characteristic graph (2) is 5 multiplied by 16;

the third layer of input data is 5 multiplied by 16, the depth of an output characteristic diagram is 16, a convolution kernel is 3 multiplied by 3, the step length is 1, the diagram for calculating the convolution is that two ends are not aligned, and the output characteristic diagram (3) is 3 multiplied by 16;

the fourth layer of input data is 3 multiplied by 16, 32 feature graphs are output, the convolution kernel is 3 multiplied by 3, the step length is 1, the graph for calculating the convolution is that two ends are not aligned, and the feature graph (4) is output by 1 multiplied by 32;

the fifth layer input data is a feature map (4) output by the fourth layer, wherein the feature map (4) is 1 multiplied by 32, the output feature map depth is 1, the convolution kernel is 1 multiplied by 1, the step length is 1, and the output feature map is 1 multiplied by 1;

the sixth layer input data is a feature map (4) output by the fourth layer, wherein the feature map (4) is 1 × 1 × 32, the output feature map depth is 4, the convolution kernel size is 1 × 1, the step size is 1, and the output feature map is 1 × 1 × 4.

The second-level network structure in step S2.2 specifically includes:

the picture input in the first layer is 25 multiplied by 3, the depth of the output characteristic graph is 32, the convolution kernel is 3 multiplied by 3, the step length is 1, the graph for calculating the convolution is a characteristic graph (1) with two ends being not aligned and the output is 23 multiplied by 32;

23 × 23 × 32 of the second-layer input feature map (1), 32 of the depth of the output feature map, 3 × 3 of a convolution kernel, 2 of step length, 11 × 11 × 32 of the output feature map (2) with two non-aligned ends;

the third layer inputs the feature map (2)11 × 11 × 32, the output feature map depth is 48, the convolution kernel is 3 × 3, the step length is 2, the calculated convolution map is 5 × 5 × 48 with two non-aligned ends;

the feature map (3) input by the fourth layer is 5 multiplied by 48, the depth of the output feature map is 64, the convolution kernel is 3 multiplied by 3, the step length is 2, the map for calculating the convolution is that the two ends are not aligned, and the feature map (4) is 2 multiplied by 64;

the fifth layer generates the data of 2 × 2 × 64 of the feature map (4) into one-dimensional data 256;

the sixth layer comprises two branches which are all connected, and the one-dimensional data 256 is respectively connected to the judgment of whether the human face exists in the category and the relative coordinates of the human face frame.

All data in the first layer of the second level network structure is used efficiently, and if used to process it, invalid data padding is added.

The network is a quantifiable network, which can only use convolution kernel of 3 × 3 and 1 × 1, the depth of each layer must be a multiple of 16, other convolution kernels cannot be used, and pooling cannot be used.

Thus, the present application has the advantages that: the first cascade detector carries out rough detection, the accuracy reaches more than 30% when the recall rate is 99%, the detected face box is cut out and then is scaled to the size required by the second-stage detector, and the face box enters the detection of the second-stage detector. When the recall rate after the detection of the two-stage detector is 98%, the accuracy reaches 97%. Due to the lack of one stage, the detection time is saved greatly and can be quantized, the detection time after quantization is 0.25 times of the current detection time, and the recall rate and the accuracy rate are unchanged.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention.

FIG. 1 is a schematic flow diagram of the process of the present invention.

Fig. 2 is a schematic diagram of the structure of a first level network of an application network of the method of the present invention.

Fig. 3 is a schematic diagram of the structure of the second level network of the application network of the method of the present invention.

Fig. 4 is a schematic diagram showing a picture with two faces, wherein a rectangle frame outside the faces is the rectangle frame of the faces in fig. 4.

Fig. 5 is a schematic diagram of one of the face images extracted from fig. 4.

Fig. 6 is a schematic diagram of two face images extracted from fig. 4.

Detailed Description

In order that the technical contents and advantages of the present invention can be more clearly understood, the present invention will now be described in further detail with reference to the accompanying drawings.

As shown in fig. 1, the present invention relates to a design method of quantifiable front-end face detection, which comprises the following steps:

s1.1, manufacturing a first-stage detector model training sample:

A) the human face shielding degree is not more than 30%, the human faces with large fuzzy degree or small size do not meet the requirements, each human face is respectively extended by 0.5 time from left to right and from top to bottom according to the marked frame, and then the human faces are extracted from the original image, and the newly extracted human faces are mapped and marked; for example, one picture has two faces, as shown in fig. 4, the external rectangle frame of the face is the rectangle frame of the face in fig. 4, two faces have two external minimum rectangle frames, two face images are extracted from the original picture (fig. 4) as shown in fig. 5 and fig. 6, and the upper left coordinates and the lower right coordinates of the external rectangle of the two faces in the original picture are: [ (35,104), (147,235) ], [ (220,89), (325,221) ], and the new labels for mapping two face images, i.e. the upper left, right and lower coordinates are [ (35,65), (147,196) ], [ (52,66), (157,198) ].

s1.2, manufacturing a second-stage detector model training sample:

a) extraction of a positive sample: extracting positive samples from a first part of the primary training set of 60 million faces; randomly screenshot nearby the label in the 60 thousands of faces, when the iou of the intercepted image and the iou of the box are larger than 0.5, zooming the intercepted image to a 25 × 25-sized image, wherein the size of the intercepted image is relatively large, the intercepted image needs to be zoomed to a 25-long and 25-wide image, which is the size specified by network training, and the size is reserved as a positive sample, and the number of the positive samples is controlled to be 10 thousands;

b) extraction of negative samples: detecting a large number of pictures without human faces by using a first-stage detector to extract negative samples, picking out the detected results which are mistaken for human faces, and then scaling the picked-out pictures to pictures with the size of 25 multiplied by 25 to be used as a part of the negative samples of the training set; from a common labeled face training set, detecting by using a first-level detector, extracting a detected result and a result that an iou value of a labeled face is less than 0.2, and then zooming to a picture with the size of 25 multiplied by 25 to be used as a part of a training set negative sample;

s2, designing a network structure model:

s2.1, a first-level network structure;

s2.2, a second-level network structure;

s3, training the secondary model: training a model of a first-stage detector by using positive and negative samples of a first-stage detector model training sample, detecting a human face by using the first-stage detector, wherein the number of the relative relation is the model, picking out and scaling the detected human face to a picture with the size of 25 multiplied by 25, inputting the picture to a second-stage detector, detecting whether the picture is the human face or not, and mapping the coordinates back to an original picture; and finally determining whether the original image has the face and the position of the face.

The technical solution of the present invention can be further described as follows:

1. the detector is divided into two cascade detectors by adopting a cascade mode. The first cascade detector carries out rough detection, the accuracy reaches more than 30% when the recall rate is 99%, the detected face box is cut out and then is scaled to the size required by the second-stage detector, and the detection of the second-stage detector is carried out. When the recall rate after the detection of the two-stage detector is 98%, the accuracy reaches 97%. Due to the lack of one stage, the detection time is saved greatly and can be quantized, the detection time after quantization is 0.25 times of the current detection time, and the recall rate and the accuracy rate are unchanged.

2. And (5) training a model.

The quantifiable network can only use convolution kernels of 3 × 3 and 1 × 1, the depth of each layer must be a multiple of 16, and other convolution kernels, pooling, and the like cannot be used.

And (4) making a training sample of the first-stage detector model.

A) The human face shielding degree does not exceed 30%, and the human face with large blurring degree or small size does not meet the requirement. In order to meet the above requirements, the sample is specially processed, each face is respectively extended by 0.5 times from left to right and from top to bottom according to the labeled box, and then the face is extracted from the original image, and the newly extracted face is mapped and labeled. For example, one picture has two faces as shown in fig. 4, the external rectangle frame of the face is the rectangle frame of the face in fig. 4, the two faces have two external minimum rectangle frames, the two face images are extracted from the original image (fig. 4) as shown in fig. 5 and fig. 6, and the upper left, lower right coordinates of the external rectangles of the two faces in the original image are: [ (35,104), (147,235) ], [ (220,89), (325,221) ], and the new labels for mapping two face images, namely, the upper left, the lower right and the lower left coordinates are [ (35,65), (147,196) ], [ (52,66), (157,198) ].

B) The extracted face image is subjected to manual screening and inspection, and the face shielding exceeding 30 percent is not allowed to exist in the face image, wherein the shielding comprises a mask, a mask shield, a cap shield and other object shields; the incomplete degree of the boundary face cannot exceed 30 percent; camouflage faces and clown faces do not belong to faces. These non-compliant face pictures cannot be used as negative samples. Particularly dark faces are not considered as faces. Allowing the presence of sunglasses. The number of faces is 60 ten thousand, and the faces are used as a first part of a primary training set of the face training.

C) And then according to the random screenshot, when the iou of the intercepted graph and box is more than 0.5, reserving the intercepted graph and box as positive samples, matting out two positive samples meeting the requirements of each graph, and scaling the positive samples meeting the requirements to a specified size. And carrying out cutout on the negative sample from the existing training set, wherein the cutout and the iou of any box on the cutout are both less than 0.25, so that the cutout meets the requirement of the negative sample, and then scaling the cutout to obtain the cutout which is the negative sample. 300 million negative examples were generated by this method. And randomly matting from the set of faces-free pictures and scaling to a specified size to obtain negative samples, wherein the negative samples generate 150 pieces of negative samples. And finishing the sample preparation. The model of the first stage detector is trained using positive and negative samples.

As shown in fig. 2, a first level network structure.

And (4) making a second-stage detector model training sample.

A) And (4) extracting a positive sample. Positive samples were taken from a first partial primary training set of 60 million faces. Detecting 60 million faces by using a first-stage detector, matting out the faces according to the detected result and the result that the iou value of the labeled faces is greater than 0.5, and then scaling the intercepted image to a picture with the size of 25 multiplied by 25, wherein the size of the intercepted image is relatively large, and the intercepted image needs to be scaled to a picture with the size of 25 long and 25 wide, which is the size specified by network training and is used as a part of a training set positive sample;

B) randomly screenshot is carried out near the label in 60 million faces, when the iou of the intercepted image and the iou of the box are larger than 0.5, the intercepted image is zoomed to an image with the size of 25 multiplied by 25, wherein the size of the intercepted image is relatively large, the intercepted image needs to be zoomed to an image with the size of 25 multiplied by 25, which is the size specified by network training, the intercepted image is reserved as a positive sample, and the number of the positive samples is controlled to be 10 million.

C) And detecting a large number of pictures without human faces by using a first-stage detector to extract negative samples, scratching out human faces from the detected results, and then scaling the human faces to 25 multiplied by 25 to be used as a part of the negative samples of the training set. From the face training set of the common label, using a first-level detector for detection, extracting the detected result and the result that the iou value of the labeled face is less than 0.2, and then scaling to 25 multiplied by 25 to be used as a part of the negative sample of the training set. The training set is ready.

As shown in fig. 3, a second level network structure.

The picture input in the first layer is 25 × 25 × 3, the depth of the output feature map is 32, the convolution kernel is 3 × 3, the step size is 1, the graph for calculating the convolution is a graph with two non-aligned ends, the output feature map (1) is 23 × 23 × 32, all data are effectively used, and if the graph is processed by using the graph, invalid data filling is increased.

The second layer inputs 23 × 23 × 32 of the feature map (1), the depth of the output feature map is 32, the convolution kernel is 3 × 3, the step size is 2, the graph for calculating the convolution is 11 × 11 × 32 of the feature map (2) with two ends not aligned, and the output is output.

The third layer inputs 11 × 11 × 32 characteristic maps (2), the output characteristic map depth is 48, the convolution kernel is 3 × 3, the step size is 2, the graph for calculating the convolution is that the two ends are not aligned, and the output characteristic map (3) is 5 × 5 × 48.

The feature map (3) input by the fourth layer is 5 × 5 × 48, the depth of the output feature map is 64, the convolution kernel is 3 × 3, the step size is 2, the map for calculating the convolution is 2 × 2 × 64 with two non-aligned ends, and the feature map (4) is output.

The fifth layer develops the data of the feature map (4)2 × 2 × 64 into one-dimensional data 256.

The sixth layer includes two branches, which connect 256 data to the judgment of the face and the relative coordinates of the face box, respectively.

3. Use of first and second level models.

The face is detected using a first level detector (correlation coefficients are models), the detected face is rescaled to 25 x 25 and input to a second level detector to detect whether the face is detected, and the coordinates are mapped back to the original. And finally determining whether the original image has a face and the position of the face.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the embodiment of the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A design method for quantifiable front-end face detection is characterized by comprising the following steps:

s1.1, manufacturing a first-stage detector model training sample:

A) the human face shielding degree is not more than 30%, the human faces with large fuzzy degree or small size do not meet the requirements, each human face is respectively extended by 0.5 time from left to right and from top to bottom according to the marked frame, then the human faces are extracted from the original image, and the newly extracted human faces are mapped and marked;

C) then according to the random screenshot, when the obtained screenshot and the iou of the frame are larger than 0.5, the screenshot and the iou are reserved as positive samples, two positive samples meeting the requirements are extracted from each screenshot, and the positive samples meeting the requirements are scaled to the specified size; carrying out cutout on the negative sample from the existing training set, wherein the cutout and the iou of any frame on the cutout are both less than 0.25, so that the cutout meets the requirement of the negative sample, zooming the cutout, and obtaining the negative sample, wherein the obtained cutout is the negative sample, and 300 ten thousand negative samples are generated by the method; randomly matting from the picture set without the face and zooming to the specified size, wherein the negative sample is also a negative sample, and 150 pieces of negative samples are generated;

s1.2, manufacturing a second-stage detector model training sample:

a) extraction of a positive sample: extracting positive samples from a first part of the primary training set of 60 million faces; randomly screenshot is carried out near the label in the 60 million faces, when the iou of the intercepted image and the iou of the box are larger than 0.5, the intercepted image is scaled to a 25 x 25 image, namely a 25 long and 25 wide image, which is the size specified by network training, and is reserved as a positive sample, and the number of the positive samples is controlled to be 10 million;

s2, designing a network structure model:

s2.1, a first-level network structure;

s2.2, a second-level network structure;

s3, training the secondary model: training a model of a first-stage detector by using positive and negative samples of a first-stage detector model training sample, detecting a human face by using the first-stage detector, wherein a correlation coefficient is the model, extracting and scaling the detected human face to a picture with the size of 25 multiplied by 25, inputting the picture to a second-stage detector, detecting whether the picture is the human face or not, and mapping coordinates back to an original picture; and finally determining whether the original image has a face and the position of the face.

2. The design method of quantifiable front-end face detection according to claim 1, wherein said face shielding degree of step S1.1 a) is not more than 30% including a case that the incomplete degree of boundary face is not more than 30%; the sheltering comprises a mask, a mask shelter and a hat shelter, but the sheltering does not comprise the condition of sunglasses; the human face does not comprise a camouflage colored drawing face, a clown face and a human face under a special dim condition.

3. The design method of quantifiable front-end face detection according to claim 1, wherein the extracting of the positive sample in step S1.2 a) further comprises: and detecting the 60 ten thousand faces by using a first-stage detector, matting out the faces according to the detected result and the result that the iou value of the labeled faces is greater than 0.5, and then scaling to 25 multiplied by 25 to be used as a part of a training set positive sample.

4. The design method of quantifiable front-end face detection according to claim 1, wherein the first-level network structure in step S2.1 specifically includes:

the third layer of input data is 5 multiplied by 16, the depth of the output characteristic graph is 16, the convolution kernel is 3 multiplied by 3, the step length is 1, the graph for calculating the convolution is that the two ends are not aligned, and the output characteristic graph (3) is 3 multiplied by 16;

5. The method of claim 1, wherein the second-level network structure in step S2.2 specifically includes:

6. The method of claim 5, wherein all data in the first layer of the second level network structure is used effectively, and if used, the invalid data padding is increased.

7. The method of claim 1, wherein the network is a quantifiable network, and only has to use convolution kernel of 3 × 3 and convolution kernel of 1 × 1, and the depth of each layer must be a multiple of 16, and no other convolution kernels and no pooling can be used.