CN110458005B

CN110458005B - Rotation-invariant face detection method based on multitask progressive registration network

Info

Publication number: CN110458005B
Application number: CN201910590187.8A
Authority: CN
Inventors: 周丽芳; 谷雨; 雷帮军; 李伟生
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2019-07-02
Filing date: 2019-07-02
Publication date: 2022-12-27
Anticipated expiration: 2039-07-02
Also published as: CN110458005A

Abstract

The invention discloses a rotation-invariant face detection method based on a multitask progressive registration network, and belongs to the field of computer vision. The method mainly comprises the following steps: preprocessing images, and constructing and training a cascaded multilayer convolutional neural network; inputting a test image, generating an image set with different resolutions in an image pyramid mode, and then sending the image set into the cascaded multilayer convolutional neural network for starting detection; filtering out part of non-face windows by each level of network, adjusting the positions of candidate frames according to the frame regression result, and predicting the rotation angle of the face; and then registering by flipping the image according to the predicted rotation angle. In the invention, the real-time rotation self-adaptive face detection is realized by a multitask progressive network registration method, and good effects on precision and speed are obtained.

Description

Rotation-invariant face detection method based on multitask progressive registration network

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a rotation invariant face detection method based on a convolutional neural network.

Background

The image containing the face is indispensable to human-computer interaction based on intelligent vision, and the face detection provides abundant visual information for the intelligent analysis of the target, and can be used for identifying an interested object in the image. Meanwhile, the research on human face detection also becomes a fundamental problem which is difficult to avoid in the fields of image processing, computer vision and pattern recognition, and is widely concerned by researchers. The progress made in face detection plays an important role in supporting many problems in the fields of computer vision and pattern recognition, such as face recognition, video tracking, head pose estimation, gender recognition, and the like.

Research on human face detection by computer vision means on human targets has been developed for decades, but the performance of many face detection algorithms is not sufficient to meet the requirements in practical applications. Compared with a controlled environment, the human face in a real scene has different appearances in pictures: the human face is basically in a right-side-up state under the controlled environment, and only the head has slight geometric deformation; the human face pose under the real scene is more complex, and the biggest characteristic is that an uncertain plane rotation angle exists between a human face target and imaging equipment. An important disadvantage of the existing typical DCNN face detection network is that the robustness of the existing DCNN face detection network to image rotation change, scale change and the like is poor.

Disclosure of Invention

In view of the above disadvantages in the prior art, the present invention provides a face detection method that is robust to changes in rotation angles in a plane.

The technical scheme adopted by the invention for realizing the aim is as follows: a rotation-invariant face detection method based on a multi-task progressive registration network comprises the following steps:

s1, preprocessing an image, and constructing and training a cascaded multilayer convolutional neural network;

s2, inputting a test image, generating an image set with different resolutions in an image pyramid mode, and sending the image set into the cascaded multilayer convolutional neural network to start detection;

s3, filtering partial non-face windows by each layer of convolutional neural network, adjusting the positions of candidate frames according to a frame regression result, and simultaneously predicting the rotation angle of the face;

and S4, registering through image overturning operation according to the predicted rotation angle, and judging the registered image as a face image.

Further, the image preprocessing comprises:

a1, rotating a WIDER FACE data set image according to any angle to generate a large number of FACE images containing changes of rotation angles in a plane, and correspondingly rotating and changing FACE position information;

and A2, randomly rotating the LFW data set image according to any angle to generate a large number of face images containing rotation angle changes in a plane, wherein the position information of the key points of the face is correspondingly rotated and changed.

Further, the cascaded multilayer convolutional neural network adopts a three-layer cascaded architecture, the first layer comprises 4 convolutional layers and 1 maximum pooling layer, the second layer comprises 3 convolutional layers, 2 maximum pooling layers and 2 full-connection layers, and the third layer comprises 4 convolutional layers, 3 maximum pooling layers and 2 full-connection layers.

The invention has the following advantages and beneficial effects:

the invention mainly aims at the defect that the existing popular face detection method based on the deep convolutional neural network lacks robustness on image rotation change, and designs a rotation-invariant face detection method based on a multi-task progressive registration network. In the actual scene, the situation that the face area cannot be detected due to the fact that the face target and the imaging device have uncertain plane rotation angles may occur. The method adopts a multi-task learning mode to carry out knowledge transfer among the human face detection task, the human face key point detection task and the angle registration task, and realizes effective learning among all related tasks so as to obtain a practical human face detector with high efficiency, strong discriminative power and high robustness. In addition, the in-plane face image rotation angle information contained in the position coordinates of the key points of the face is fully considered, and in order to improve the robustness of key point detection when the face angle changes, the regression loss function of the key points of the face is redefined, and the tolerance capability of the algorithm to the in-plane rotation angle change is effectively improved. The method obtains better detection effect.

Drawings

Fig. 1 is a flow chart of implementation of rotation-invariant face detection provided in an embodiment of the present invention;

FIG. 2 is a diagram of a first stage network structure of a multi-cascaded convolutional neural network provided in an embodiment of the present invention;

FIG. 3 is a diagram of a second level network structure of a multi-cascaded convolutional neural network provided by an embodiment of the present invention;

FIG. 4 is a diagram of a third-level network structure of a multi-cascaded convolutional neural network provided in an embodiment of the present invention;

FIG. 5 is a diagram illustrating the effect of rotation invariant face detection provided by an embodiment of the present invention;

fig. 6 is a flowchart of a specific implementation of the method S4 for rotation-invariant face detection according to the embodiment of the present invention;

fig. 7 is a tag position display diagram of feature points on a face image.

Detailed Description

The embodiment of the invention is realized based on cascaded multilayer convolutional neural networks, the image to be detected passes through the multilayer convolutional neural networks at all levels at one time, and each multilayer convolutional neural network executes the tasks of face classification, face candidate frame regression, face key point detection and angle identification. And finally, registering through image overturning operation according to the predicted rotation angle, and judging the registered image as a human face image.

In order to explain the technical solution of the present invention, the following description is made with reference to the accompanying drawings and specific examples.

Fig. 1 shows an implementation process of rotation-invariant face detection provided in an embodiment of the present invention, which is detailed as follows:

s1, constructing and training a cascaded multilayer convolutional neural network;

s2, inputting a test image, generating an image set with different resolutions in an image pyramid mode, and then sending the image set into the cascaded multilayer convolutional neural network to start detection;

s3, filtering out partial non-face windows in each level of network, adjusting the positions of the candidate frames according to the frame regression result, and simultaneously predicting the rotation angle of the face;

and S4, registering through image overturning operation according to the predicted rotation angle.

The cascaded multilayer convolutional neural network adopts a three-layer cascaded architecture design, each stage is composed of a shallow convolutional neural network, the human face detection, the angle recognition and the key point positioning tasks are completed simultaneously, and good effects are achieved in the aspects of speed and precision.

Further, step S1 is to construct and train a multi-task convolutional neural network based on a three-layer cascade architecture by using the relevance among the multiple tasks and combining face detection, angle recognition and key point positioning tasks, and the specific implementation steps are as follows:

the network structure diagrams of the first-level network, the second-level network and the third-level network are respectively shown in fig. 2, fig. 3 and fig. 4. The rotation invariant face detection is decomposed into a face/non-face binary classification problem, a face angle identification problem and a face candidate frame regression problem, namely whether an input image is a face or not is judged, and an output result of a detection frame is enabled to be infinitely close to a true value of the input image. Specifically, the method comprises the following steps:

A. as shown in fig. 2, the network structure of the first-level network is, from top to bottom: the first layer, convolutional layer, with convolutional kernel size of 3 × 3 and convolutional kernel number of 16; the second layer is a maximum pooling layer, and the pooling interval is 2 multiplied by 2; the third layer, convolution kernel size is 3 x 3, convolution kernel number is 32; the fourth layer, convolution kernel size is 3 x 3, convolution kernel number is 64; the fifth layer is divided into four sublayers which are respectively connected with the fourth layer in series, the four sublayers are convolution layers, the convolution kernel is 1 multiplied by 1, and the used supervision information is respectively as follows: face and non-face two-classification information, face position information and face key point position information;

B. as shown in fig. 3, the network structure of the second-level network is, from top to bottom: the first layer, convolutional layer, with convolutional kernel size of 3 × 3 and convolutional kernel number of 24; the second layer is a maximum pooling layer, and the pooling interval is 3 multiplied by 3; the third layer, convolution kernel size is 3 x 3, convolution kernel number is 48; the fourth layer is a maximum pooling layer, and the pooling interval is 3 multiplied by 3; the fifth layer, convolution kernel size is 2 x 2, convolution kernel number is 96; the sixth layer is a full connection layer, and the number of neurons is 196; the seventh layer is divided into four sublayers which are respectively connected with the sixth layer in series, the four sublayers are all connected layers, and the used monitoring information is respectively as follows: face and non-face two-classification information, face position information and face key point position information;

C. as shown in fig. 4, the network structure of the third-level network is, from top to bottom: the first layer, convolutional layer, with convolutional kernel size of 3 × 3 and convolutional kernel number of 24; the second layer is a maximum pooling layer, and the pooling interval is 3 multiplied by 3; the third layer, convolution kernel size is 3 x 3, convolution kernel number is 48; the fourth layer is a maximum pooling layer, and the pooling interval is 3 multiplied by 3; the fifth layer, convolution kernel size is 2 x 2, convolution kernel number is 96; the sixth layer is a maximum pooling layer, and the pooling interval is 2 multiplied by 2; a seventh layer, convolutional layer, with convolutional kernel size of 2 × 2 and number of convolutional kernels of 192; the eighth layer is a full connection layer, and the number of the neurons is 254; the ninth layer is divided into three sublayers which are respectively connected with the eighth layer in series, the four sublayers are all connected layers, and the used supervision information is respectively as follows: face and non-face two-classification information, face position information and face key point position information;

D. in the testing stage, the first-level network and the second-level network only output judgment results f of the human face and the non-human face, the displacement t of the human face candidate frame and the human face direction g, and the third-level network only outputs the judgment results f of the human face and the non-human face, the displacement t of the human face candidate frame and the human face key point position p;

E. when the convolutional neural network is trained, a random gradient descent algorithm is used, a cross entropy function is used for calculating the loss of a face/non-face two-classification task, and the loss is shown in a calculation formula (1):

L _cls ＝ylog f+(1-y)log(1-f) (1)

where y represents the true face classification result.

Likewise, the loss of the angle identification task is calculated using a cross entropy function, as shown in equation (2):

L _cal ＝xlog g+(1-x)log(1-g) (2)

where x represents the true angle classification result.

The regression task of the face candidate frame uses an Euclidean distance function, and the calculation formula is as follows:

wherein, the first and the second end of the pipe are connected with each other,

and representing the coordinate value of the real face position.

Finally, the rotation angle information of the face image in the plane contained in the position coordinates of the key points of the face is fully considered, the robustness of key point detection in the method is improved when the face angle changes, therefore, the regression loss function of the key points of the face is redefined, and the calculation formula is as follows:

n in the formula is the total number of training samples participating in a human face key point task, d is the Euclidean distance between a prediction point and a real point, and theta is the rotation angle value of the sample and meets the condition that theta belongs to [ -45 degrees and 45 degrees ].

F. In the present example, the public FACE data sets WIDER FACE and LFW were used as training sets. The WIDER FACE contains 32203 images, 393703 individual FACE detection frame position markers. Of these, 50% of the face data was used to train the face classification and candidate box regression tasks, 40% was used as the test set, and the remaining 10% was used as the verification set. The LFW dataset is used to train the angle recognition task person and face alignment task.

Further, step S2 inputs the image into the cascaded multilayer convolutional neural network, and outputs the generated face candidate frame displacement, candidate frame score, key point position and face rotation angle, and the specific implementation steps are as follows:

A. the image to be tested is firstly scaled to generate an image pyramid. The input to the first level network is 12 × 12 × 3, where 3 represents the input image color channel being 3 channels, i.e., an RGB image. The output of the input image generated by the first-level network is the face candidate frame displacement t, the candidate frame score f and the face direction g. At the moment, the face angle recognition task is regarded as a two-classification task, namely the face direction is upward and the face direction is downward, and the two classification tasks are respectively marked as 1 and 0;

training labeled values of samples used in first-layer network angle recognition

Theta is the rotation angle value of the sample;

let the label value f ₁ Samples of 0 and 1 participate in training the first layer network angle identification.

B. The input of the second-level network is 24 multiplied by 3, and the output of the input image generated by the second-level network is the displacement t of the face candidate frame, the score f of the candidate frame and the face direction g; at the moment, the face angle recognition task is regarded as a three-classification task, namely that the face direction faces upwards, the face direction faces left and the face direction faces right, and the three-classification task is respectively marked as 0, 1 and 2;

method for training labeled values of samples used in angle recognition of second-layer network

Theta is the rotation angle value of the sample;

let the label value f ₂ Samples of 0, 1 and 2 participate in training the second-tier network angle recognition task.

Further, in step S3, the rotation angle of the face output by the network is registered by turning the image, and the specific implementation steps are as follows:

A. in step S2, after the image to be measured passes through the first-level network, a face direction score g is generated, and the calculation formula of the rotation angle in the corresponding plane is

Wherein 0 ° represents face up, and 180 ° represents face down.

B. When the rotation angle p is 0 degrees, the image is not turned over; when the rotation angle p is 180 degrees, the image is turned by 180 degrees; at this time, the range of the rotation angle of the face in the plane is narrowed from-180 DEG, 180 DEG to-90 DEG, 90 deg. The image turning operation is simple, the calculation cost is low, and efficient and rapid face image registration in a plane can be realized;

C. in step S2, after the image to be detected passes through the second-level network, a face direction score g is generated, and the face direction score is converted into a direction label according to the formula (5):

id＝argmax g _i ,i∈[0,1,2] (5)

wherein id represents a directional tag, g ₀ ，g ₁ ，g ₂ Respectively representing the scores of the directions of the human face towards the left, the upper and the right.

The calculation formula of the rotation angle in the corresponding plane is

Wherein 0 ° represents face up, 90 ° represents face to left, and-90 ° represents face to right.

D. When the rotation angle p is 0 degrees, the image is not turned over; when the rotation angle p is 90 degrees, the image rotates rightwards by 90 degrees; when the rotation angle p is-90 deg., the image is rotated 90 deg. to the left. At this time, the range of the rotation angle of the face in the plane is narrowed from-90 °,90 ° ] to-45 °,45 °.

Further, step S2 inputs the image into the last-stage multitask convolutional neural network, and outputs and generates a face candidate frame position t, a candidate frame score f, and a key point position p, and the specific implementation steps are as follows:

A. the input of the last level network is 48 multiplied by 3, and the output of the input image generated by the third level network is the face candidate frame displacement t, the candidate frame score f and the face key point position p, which is different from the first level network and the second level network.

B. The purpose of the image to be measured passing through the first stage network is to rapidly generate candidate windows using a full convolution network, predicting the image rotation angle in a relatively coarse manner. The purpose of the second-level network is to continuously refine the candidate window generated in the first level by using a complex convolutional neural network, discard a large number of overlapped windows and predict the image rotation angle.

Further, step S4 calculates a rotation angle by using the geometric information between the position of the face candidate frame output in step S2 and the face key point, and performs registration by flipping the image to obtain a detected face image, which is specifically implemented as follows:

A. as shown in fig. 6 (a), the image to be measured is subjected to face detection and key point positioning through a third-level network, the output image of the third-level network is determined to be a face image, and the range of the rotation angle of the face in the plane is [ -45 °,45 ° ].

B. As shown in fig. 6 (b), the distance from the human eye to the top of the head is known to be closer than the distance from other key points to the top of the head. Based on this a priori knowledge, the forward direction of the face detection box is first determined by calculating the distances between the left and right eyes to the four bounding boxes.

C. As shown in fig. 7, in a standard face-up image, the line between the left eye and the tip of the nose forms an angle α equal to the angle β formed by the line between the tip of the nose and the top of the head. As shown in fig. 6 (d), the calculation formula of the face image rotation angle is θ = (α - β) ÷ 2 using the geometric relationship between the face detection frame and the key point.

The rotation invariant face detection effect provided by the embodiment of the invention is shown in fig. 5.

Claims

1. A rotation invariant face detection method based on a multitask progressive registration network is characterized by comprising the following steps:

s3, filtering out partial non-face windows by each layer of convolutional neural network, adjusting the positions of candidate frames according to a frame regression result, and simultaneously predicting the rotation angle of the face; the frame regression result is embodied through the regression loss of the key points of the human face;

face key point regression loss is

D in the formula is the Euclidean distance between the prediction point and the real point, and theta is the rotation angle value of the sample and satisfies the condition

And S4, registering through image overturning operation according to the predicted rotation angle, and judging the registered image as a human face image.

2. The rotation-invariant face detection method based on the multitask progressive registration network according to claim 1, wherein the rotation-invariant face detection method comprises the following steps: the image preprocessing comprises:

a1, randomly rotating a face image to any angle to generate a large number of face images containing in-plane rotation angle changes, and correspondingly rotating and changing face position information;

and A2, randomly rotating the face key point images to any angle to generate a large number of face key point images containing rotation angle changes in a plane, wherein the position information of the face key point is correspondingly rotated and changed.

3. The rotation-invariant face detection method based on the multitask progressive registration network according to claim 1 or 2, characterized in that: the cascaded multilayer convolutional neural network adopts a three-layer cascaded structure, wherein the first layer comprises 4 convolutional layers and 1 maximum pooling layer, the second layer comprises 3 convolutional layers, 2 maximum pooling layers and 2 full-connection layers, and the third layer comprises 4 convolutional layers, 3 maximum pooling layers and 2 full-connection layers.

4. The rotation-invariant face detection method based on the multitask progressive registration network according to claim 3, wherein the rotation-invariant face detection method comprises the following steps: in the three-layer cascaded multilayer convolutional neural network,

Theta is the rotation angle value of the sample;

let the label value f ₁ The samples of 0 and 1 participate in training the first layer of network angle recognition;

Theta is the rotation angle value of the sample;