CN110490170B

CN110490170B - Face candidate frame extraction method

Info

Publication number: CN110490170B
Application number: CN201910793249.5A
Authority: CN
Inventors: 刘中秋; 陈高曙; 祝江威; 张伟; 唐松鹤
Original assignee: Miaxis Biometrics Co Ltd
Current assignee: Miaxis Biometrics Co Ltd
Priority date: 2019-08-27
Filing date: 2019-08-27
Publication date: 2023-01-06
Anticipated expiration: 2039-08-27
Also published as: CN110490170A

Abstract

The invention provides a face candidate frame extraction method, which mainly comprises the following two steps: step 1 is a candidate network training process; step 2 is a candidate frame extraction process; step 1, preparing 3 types of training data and training a candidate network by using the obtained data as a training sample, wherein the training data are a slice image, a label of an anchor frame corresponding to the slice image and a real frame regression vector respectively; and 2, extracting a face candidate frame of the image to be detected based on the candidate network trained in the step 1, wherein the candidate network is a full convolution neural network. The invention combines the multi-anchor frame with the image pyramid, reduces the number of layers of the image pyramid, greatly improves the running speed while ensuring the accuracy, and gives consideration to the accuracy and the real-time performance, so that the invention has good practicability.

Description

Face candidate frame extraction method

[ technical field ] A method for producing a semiconductor device

The invention relates to the technical field of face detection, in particular to a face candidate frame extraction method.

[ background of the invention ]

Face detection (face detection) is a basic step of related application in the field of computer vision, and plays a decisive role in performance of subsequent related application. The face detection is a computer technology for finding relevant information such as the position, size and posture of a face in any input information (such as images, videos and the like), wherein the position and size of the face are the most basic face information which must be obtained by face detection and are usually marked by a face candidate box.

The current common face detection method comprises the following steps: firstly, carrying out image pyramid operation on an original image to obtain a plurality of layers of pyramid images, secondly, using a candidate network to extract candidate frames which may be human faces in each layer of pyramid images, and finally, using a one-stage or multi-stage refining network to carry out screening and coordinate correction on the candidate frames to finally obtain the human face candidate frames. In the method, the extraction of the face candidate frame needs to process each layer of pyramid images, and the calculation amount is large, so that the time consumption is long, the extraction speed is poor, and the practicability is poor.

[ summary of the invention ]

The invention aims to provide a face candidate frame extraction method, which effectively reduces the calculated amount and improves the extraction speed of the face candidate frame.

The purpose of the invention is realized by the following technical scheme:

a face candidate frame extraction method mainly comprises two steps: step 1: a candidate network training process; step 2: a candidate frame extraction process; and 2, extracting a face candidate frame of the image to be detected based on the candidate network trained in the step 1.

Step 1: preparing 3 types of training data which are respectively a slice image, a label of an anchor frame corresponding to the slice image and a regression vector of a real frame, wherein the corresponding data acquisition mode is as follows:

(1) Slice image:

randomly selecting square slices with side length not less than A from the original image of the training data set, uniformly scaling the side lengths of the square slices to A, and finally obtaining slice images with the size of A x A.

(2) Label of anchor frame corresponding to slice image:

1) Anchor frame corresponding to slice image:

selecting side lengths A scale respectively on an original image corresponding to the slice image by taking the central point of the slice image as the center ⁱ K square anchor frames; wherein scale is a scaling factor, and the value range is 0-1; the value of i is 0,1, … and k-1, wherein k is a positive integer and the value satisfies

The preferred value is the maximum k value.

2) The label of the anchor frame:

calculating the IOU value of each anchor frame and the corresponding real face frame in the original image,

wherein S _∩ Is the area of the intersection of the anchor frame and the real face frame, S _∪ The area of the anchor frame and the real face frame is collected.And giving labels to the anchor frames by using the maximum IOU value corresponding to each anchor frame, wherein the label categories are divided into positive sample labels, partial sample labels and negative sample labels.

(3) True frame regression vector:

for the anchor frame given with the positive sample label and the partial sample label in the above step (2), calculating a real frame regression vector of the real face frame corresponding to the maximum IOU value relative to the corresponding anchor frame, wherein the calculation formula is as follows:

wherein x _g 、y _g As coordinates of the center point of the real face frame, x _a 、y _a As coordinates of the center point of the corresponding anchor frame, w _g 、h _g Width and height of the real face frame, w _a 、h _a Corresponding to the width and height of the anchor frame.

And training the candidate network by taking the obtained 3 types of training data as training samples, wherein the training termination condition is that iteration times or loss convergence is met.

Step 2: the extraction process of the candidate frame is as follows:

(a) Carrying out image pyramid processing on an original image to be detected to obtain a multilayer pyramid image;

(b) Inputting the trained candidate network to calculate from the pyramid image of the kth layer and the pyramid image of each j layer to obtain a face confidence coefficient feature map and a prediction frame regression vector feature map, wherein j is a positive integer less than or equal to k;

(c) Selecting pixel points of which the face confidence degrees are more than or equal to a threshold value T in the face confidence degree characteristic diagram in the step (b);

(d) Obtaining an anchor frame corresponding to the pixel point on the original image, namely a primary selection anchor frame, according to the pixel point selected in the step (c);

(e) And (d) aiming at the initial selection anchor frame obtained in the step (d), firstly removing redundant anchor frames by using a non-maximum value inhibition method, and then carrying out coordinate correction on the residual anchor frames by using corresponding predicted frame regression vectors to finally obtain the face candidate frame.

The invention combines the multi-anchor frame with the image pyramid for use, and the pyramid image to be processed is the initial pyramid image

The number of layers of the image pyramid is effectively reduced, the running speed is greatly increased while the accuracy is ensured, and both the accuracy and the real-time performance are considered, so that the method has good practicability.

[ description of the drawings ]

Fig. 1 is a schematic flow chart of a candidate network training process according to an embodiment of the present invention.

Fig. 2 is a flowchart illustrating a candidate frame extraction process according to an embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a candidate network according to an embodiment of the present invention.

[ detailed description ] embodiments

In order to make the objects, technical solutions and effects of the present invention clearer and clearer, the present invention is further described below by referring to the accompanying drawings and listing examples.

The invention provides a face candidate frame extraction method, which mainly comprises the following two steps: step 1 is a candidate network training process, as shown in fig. 1; step 2 is a candidate frame extraction process, as shown in fig. 2; and 2, extracting a face candidate frame of the image to be detected based on the candidate network trained in the step 1.

The candidate network is a convolutional neural network, and the specific network structure is shown in fig. 3.

The candidate network training process of this embodiment is specifically as follows:

(1) Randomly selecting square slices with the side length not less than 24 pixels from an original image of a training data set, uniformly scaling the side lengths of the square slices to 24, and finally obtaining slice images with the size of 24 x 24 pixels as one of candidate network training input data;

(2) Taking the center point of the slice image obtained in the step (1) as the center, obtaining k square anchor frames by scaling, wherein the empirical value of the scaling factor scale is 0.7937, and therefore the side lengths of the square anchor frames are respectively 24 × scale ¹ I sequentially takes the values of 0,1, … and k-1,k as 4;

(3) Calculating the IOU value of each anchor frame in the step (2) and the corresponding real face frame in the original image, wherein the calculation formula is as follows:

wherein S _∩ Is the area of the intersection of the anchor frame and the real face frame, S _∪ The area of the anchor frame and the real face frame is collected. Giving a label to each anchor frame by using the maximum IOU value corresponding to each anchor frame;

the assignment criteria for the tags are described as follows: giving a positive sample label to the corresponding anchor frame when the maximum IOU value is more than or equal to 0.5, giving a partial sample label to the corresponding anchor frame when the maximum IOU value is less than 0.5 and more than or equal to 0.3, giving a negative sample label to the corresponding anchor frame when the maximum IOU value is less than 0.2, and not giving labels to the rest anchor frames; the positive sample label anchor frame can be marked as 1, and the negative sample label anchor frame can be marked as 0;

(4) For the anchor frame given with the positive sample label and the partial sample label in the step (3), calculating a real frame regression vector of the real face frame corresponding to the maximum IOU value relative to the corresponding anchor frame, wherein the calculation formula is as follows:

wherein x _g 、y _g As coordinates of the center point of the real face frame, x _a 、y _a As coordinates of the center point of the anchor frame, w _g 、h _g Width and height of the real face frame, w _a 、h _a The width and height of the anchor frame.

(5) The trained candidate network outputs two types of data, namely face confidence and prediction frame regression vectors, so that for the face confidence, anchor frame labels of all positive sample labels and negative sample labels and slice images corresponding to the anchor frames are used for training; for the predicted frame regression vector, training by using real frame regression vectors of all positive sample label anchor frames and part of sample label anchor frames and slice images corresponding to the anchor frames; and the termination condition of the training is to meet the iteration times or loss convergence, and finally the trained candidate network is obtained.

The specific flow of the candidate box extracting process in this embodiment can be described as follows:

the method comprises the steps of obtaining an original image to be detected with a resolution of 1920 x 1080 from an intelligent foreground, conducting image pyramid processing on the original image to be detected to obtain 18 layers of pyramid images, inputting each 3 layers of pyramid images into a trained candidate network from the 4 th layer of pyramid images, obtaining two outputs of a face confidence coefficient feature image and a prediction frame regression vector feature image, selecting pixel points with a face confidence coefficient larger than or equal to 0.6 from the face confidence coefficient feature image, finding out an anchor frame corresponding to the original image to be an initial selection anchor frame, removing redundant anchor frames in the initial selection anchor frame by using a non-maximum value inhibition method, conducting coordinate correction calculation on the residual anchor frames by using corresponding prediction frame regression vectors, and finally obtaining the face candidate frame.

The embodiment combines the multi-anchor frame with the image pyramid for use, reduces the pyramid image to be processed from the first 18 layers to 6 layers, and effectively reduces the number of image pyramid layers.

Through tests, when a picture with the size of 1920 x 1080 is subjected to face candidate frame extraction on a common personal computer at present, the face candidate frame extraction time in the MTCNN face detection method is about 255ms, and the face candidate frame extraction time in the embodiment of the application is about 33ms, so that the extraction speed of the face candidate frame is greatly optimized, and the method has good practicability.

The above embodiments are only preferred embodiments of the present invention, and are not intended to limit the present invention; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A face candidate frame extraction method is characterized by mainly comprising the following two steps: step 1: a candidate network training process; step 2: a candidate frame extraction process; step 2, extracting a face candidate frame of the image to be detected based on the candidate network trained in step 1;

the step 1 comprises the following steps: a. preparing 3 types of training data, and b, training the candidate network by using the obtained data as a training sample; the training data are respectively a slice image, a label of an anchor frame corresponding to the slice image and a real frame regression vector, and the data acquisition mode is specifically described as follows:

(1) Slice image:

randomly selecting square slices with side length not less than A from an original image of a training data set, uniformly scaling the side lengths of the square slices to A, and finally obtaining slice images with the size of A x A;

(2) Label of anchor frame corresponding to slice image:

1) Anchor frame corresponding to slice image:

selecting side lengths A scale respectively on an original image corresponding to the slice image by taking the center point of the slice image as the center ⁱ K square anchor frames; wherein scale is a scaling factor, and the value range is 0-1; the value of i is 0,1, … and k-1, wherein k is a positive integer and satisfies the condition

2) The label of the anchor frame:

wherein S _∩ Is the area of the intersection of the anchor frame and the real face frame, S _∪ The area of the part is collected by the anchor frame and the real face frame; assigning labels to the anchor frames according to the maximum IOU value corresponding to each anchor frame, wherein the label categories are divided into positive sample labels, partial sample labels and negative sample labels;

(3) True frame regression vector:

wherein x _g 、y _g As coordinates of the center point of the real face frame, x _a 、y _a As coordinates of the center point of the corresponding anchor frame, w _g 、h _g Width and height of the real face frame, w _a 、h _a Corresponding to the width and height of the anchor frame;

the specific process of the step 2 is as follows:

2. The method as claimed in claim 1, wherein the candidate network is a full convolutional neural network.

3. The method as claimed in claim 1, wherein k is satisfied

Is measured.

4. The method as claimed in claim 1, wherein the threshold T is 0.6.

5. The method of claim 1, wherein the assigned corresponding anchor frame positive sample label with the maximum IOU value of 0.5 or more, the assigned corresponding anchor frame partial sample label with the maximum IOU value of 0.5 or more and 0.3 or more, the assigned corresponding anchor frame negative sample label with the maximum IOU value of 0.2 or less, and the remaining anchor frames are not assigned labels.