CN110956082B

CN110956082B - Face key point detection method and detection system based on deep learning

Info

Publication number: CN110956082B
Application number: CN201910986156.4A
Authority: CN
Inventors: 马国军; 马道懿; 朱琎; 唐跃; 曾庆军; 夏健
Original assignee: Jiangsu University of Science and Technology
Current assignee: Jiangsu University of Science and Technology
Priority date: 2019-10-17
Filing date: 2019-10-17
Publication date: 2023-03-24
Anticipated expiration: 2039-10-17
Also published as: CN110956082A

Abstract

The invention discloses a face key point detection method and a face key point detection system based on deep learning, wherein the detection method comprises the following steps: 1. constructing a face key point detection network based on a MobileNet V1 architecture; the input of the face key point detection network is a face image, and the output of the face key point detection network is a face key point coordinate; 2. taking a face data set of Facial Key-points Detection on Kaggle as a sample image, and training the face Key point Detection network constructed in the step 1; 3. collecting an image of a face to be detected; acquiring a face region of an acquired image to be detected by adopting an OpenCV (open circuit vehicle vision) cascade classifier; 4. inputting the face region acquired in the step 3 into a trained face key point detection network to obtain coordinates of a left eye and a nose tip; 5. and calculating coordinates of the right eye and the left and right mouth angles of the human face in the image to be detected, and labeling key points. The method can obtain higher detection precision on a deep learning network with smaller scale.

Description

Face key point detection method and detection system based on deep learning

Technical Field

The invention belongs to the technical field of face key point detection, and particularly relates to a method and a system for detecting face key points by applying deep learning.

Background

Face detection technology was first generated with face recognition technology, in the sixth and seventies of the 20 th century, and has been developed for more than half a century since then. Many scientific researchers do a lot of research in this period to make different types of face key point detection systems with different modes. However, until the end of the last century, many systems obtain the position of the face through an external auxiliary sensor or the movement of the face is not obviously and easily captured, so the technology is immature, and the development of the face detection technology is slow in this period.

With the development of society, especially with the rapid development of computer technology and the extensive application of biometric identification in the last two decades, face detection technology has been used not only for face identification. The application range of the face detection technology is developed from the original pure scientific research to the aspects of business, military, security and the like at present, and the attention degree of the technology also reaches a new level. In which the convolutional neural network CNN plays an irreplaceable role.

Convolutional neural is a feedforward network, and its artificial neuron can cover units in the range, including convolutional layer and pooling layer. At present, the convolutional neural network CNN has been widely applied to the computer vision domain, and has achieved good effect. In recent years, the performance of CNN in ImageNet competition, it can be seen that in pursuit of classification accuracy, the depth of the model is deeper and deeper, and the complexity of the model is higher and higher, such as a depth residual error network, the number of layers is up to 152.

However, in some real application scenarios, such as mobile or embedded devices, such large and complex models are difficult to apply. Firstly, the models are too large and face the problem of insufficient memory, and secondly, the scenes require low delay or high response speed, for example, the pedestrian detection system of an automatic driving automobile has high requirement on the response speed. Therefore, it is very important to construct a small-scale and efficient CNN model.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to provide a face key point detection method, which can obtain higher detection precision on a deep learning network with smaller scale.

The technical scheme is as follows: the invention discloses a face key point detection method based on deep learning, which comprises a training stage and a detection stage, wherein the training stage comprises the following steps:

(1) Constructing a face key point detection network based on a MobileNet V1 architecture; the input of the human face key point detection network is a human face image, and the output is a human face key point coordinate; the key points of the human face are coordinates of the left eye and the nose tip;

the face key point detection network comprises 16 layers of networks which are connected in sequence, wherein the first layer is a common convolutional layer; the second layer to the tenth layer are 6 depth separable convolution units in which depth convolution layers and point-by-point convolution layers are alternately connected; the fourteenth layer is a pooling layer, the fifteenth layer is a full-connection layer, and the sixteenth layer is a softmax layer;

(2) Training the face Key point Detection network constructed in the step (1) by taking a face data set of Facial Key-points Detection on Kaggle as a sample image to obtain a trained face Key point Detection network;

the detection stage comprises the following steps:

(3) Collecting an image of a face to be detected; acquiring a face region of an acquired image to be detected by adopting an OpenCV (open visual constant value) cascade classifier;

(4) Inputting the face region acquired in the step (3) into a trained face key point detection network to obtain coordinates of a left eye and a nose tip;

(5) Calculating coordinates of the right eye and the left and right mouth angles of the human face in the image to be detected, and labeling key points;

wherein (X) _{left_eye} ,Y _{left_eye} )，(X _noise ,Y _noise )，(X _{right_eye} ,Y _{right_eye} )，(X _{left_mouth} ,Y _{left_mouth} )，(X _{right_mouth} ,Y _{right_mouth} ) The coordinates of the left eye, the nose tip, the right eye, the left mouth corner and the right mouth corner are respectively.

Further, the step (2) further comprises preprocessing the training sample, including:

(2.1) rejecting training samples with missing data;

(2.2) standardizing the training samples: normalized gray value of pixel i in sample image A

Comprises the following steps:

wherein g is _i Is the gray value, g, of pixel i in sample image A before normalization _min Is the minimum value of the pixel gray levels in all sample images, g _max Is the maximum value of the pixel gray levels in all sample images.

In order to reduce the dependence of the network on initialization, in the first layer to the thirteenth layer of the face key point detection network, the input of each layer is subjected to batch standardization through Batchnormalization and then convolution operation is performed.

In order to prevent overfitting, an early-stop method is adopted for training in the step (2), the sample images are divided into a training set and a verification set, a face key point detection network is trained on the training set, and errors of the face key point detection network are verified by the verification set;

the training steps are as follows:

(3.1) initializing the error of the optimal verification set into the error of the verification set after the first training is finished, and initializing the error training times of the optimal verification set to 0;

(3.2) after each training is finished, verifying the error of the face key point detection network by using a verification set, and comparing the error of the current verification set with the error of the optimal verification set; if the error of the current verification set is smaller than the error of the optimal verification set, updating the error of the optimal verification set into the error of the current verification set, and setting the error training times of the optimal verification set to be 0; if the error of the current verification set is larger than the error of the optimal verification set, adding one to the error training times of the optimal verification set;

and (3.3) stopping training if the error training times of the optimal verification set reach a preset error training time threshold of the optimal verification set or the total training times reach a preset total training time threshold, and taking the face key point detection network parameters corresponding to the errors of the optimal verification set as training results.

On the other hand, the invention discloses a detection system for realizing the human face key point detection method, which comprises an image acquisition module, a human face detection module and a human face key point detection and labeling module;

the image acquisition module is used for acquiring an image of a face to be detected;

the face detection module is used for acquiring a face area in an image of a face to be detected;

the face key point detection and labeling module is used for acquiring and labeling coordinates of a left eye, a nose tip, a right eye, a left mouth corner and a right mouth corner in a face region.

The face key point detection and labeling module is a face key point detection network based on a MobileNet V1 architecture; the input of the face key point detection network is a face image, and the output of the face key point detection network is a face key point coordinate; the face key point detection network comprises 16 layers of networks which are connected in sequence, wherein the first layer is a convolutional layer; the second layer to the tenth layer are 6 depth separable convolution units in which depth convolution layers and point-by-point convolution layers are alternately connected; the fourteenth layer is a pooling layer, the fifteenth layer is a full-link layer, and the sixteenth layer is a softmax layer.

The face key point detection and labeling module is a computer provided with a NVIDIAGTX 1080 GPU.

Has the advantages that: compared with the prior art, the face key point detection method based on deep learning disclosed by the invention has the following advantages: 1. compared with the common convolutional neural network, the constructed human face key point detection network based on the MobileNet V1 framework has fewer model parameters under the condition of the same detection effect; meanwhile, only 2 key points of the face are detected, and the training and detection speeds are further increased; 2. on the basis of the detected 2 key points, other feature point coordinates can be obtained.

Drawings

FIG. 1 is a flow chart of a face key point detection method disclosed by the present invention;

FIG. 2 is a schematic structural diagram of a face key point detection network constructed by the present invention;

FIG. 3 is a schematic diagram of the operation of depth convolution and point-by-point convolution;

fig. 4 is a schematic structural diagram of a face key point detection system disclosed in the present invention.

Detailed Description

The invention is further elucidated with reference to the drawings and the detailed description.

As shown in fig. 1, the invention discloses a method for detecting key points of a human face based on deep learning, which comprises a training stage and a detection stage, wherein the training stage comprises the following steps:

step 1, constructing a face key point detection network based on a MobileNet V1 architecture; the input of the human face key point detection network is a human face image, and the output is a human face key point coordinate; the key points of the human face are coordinates of the left eye and the nose tip;

as shown in fig. 2, the face key point detection network includes 16 layers of networks connected in sequence, where the first layer is a common convolutional layer; the second layer to the tenth layer are 6 depth separable convolution units in which depth convolution layers and point-by-point convolution layers are alternately connected; the fourteenth layer is a pooling layer, the fifteenth layer is a full-link layer, and the sixteenth layer is a softmax layer.

The 1-13 layers are feature extraction sub-networks and are used for extracting local features of the image; and the 14-16 layers are feature classification sub-networks, and face key points are obtained according to local features.

The depth convolution is calculated as:

the method comprises the following steps: splitting an image of size M x N into N images of size M x 1;

step two: defining 1 convolution kernel of F x F1 (F is usually an odd number) for each single-channel image, a total of N need to be defined;

step three: in order to make the convolved output image and the input image equal in size, padding is required, and the size of the padding for the upper, lower, left, and right sides is P = (F-1)/2.

Step four: convolving each pixel point in each single-channel image with the convolution kernel of the corresponding single channel, and assuming that F =2a +1, the calculation formula is as follows because the size of the convolution kernel is usually odd:

where w (i, j) is the coefficient in the convolution kernel of F x F1, F (x, y) is any point pixel in the image, and g (x, y) is the convolution output of F (x, y), where x, y are variable so that each pixel in w can access each pixel in F.

The calculation of the point-by-point convolution is:

and adding the results of the N convolution operations by using 1 × N convolution to obtain an output of M × 1.

The process of depth separable convolution of a three channel image using the convolution kernel of 3*3 is illustrated in fig. 3. The three-channel image is changed into three single-channel images, and then convolution operation is carried out on each single channel. At any point f (x, y) in a single channel image, the output g (x, y) is the sum of the products of the convolution kernel coefficients and the image pixels surrounded by the convolution kernel:

g(x,y)＝w(-1,-1)f(x-1,y-1)+w(-1,0)f(x-1,y)+...........+w(1,1)f(x+1,y+1)

the three single-channel image outputs g (x, y) are then summed by a 1 x 3 convolution kernel to give the final output z (x, y).

In order to reduce the dependence of the network on initialization, the input of each layer in the first layer to the tenth layer in the face key point detection network is subjected to batch standardization through BatchNormalization and then is subjected to convolution operation. Meanwhile, a ReLU linear activation function is used on each convolution layer, and convolution layer results are output.

Step 2, taking a Facial data set of Facial Key-points Detection on Kaggle as a sample image, and training the Facial Key point Detection network constructed in the step 1 to obtain a trained Facial Key point Detection network;

firstly, training samples are preprocessed, and the preprocessing comprises the following steps:

(2.1) eliminating training samples with data missing;

Comprises the following steps:

wherein g is _i For the gray value of pixel i in sample image A before normalization, g _min Is the minimum value of the pixel gray levels in all sample images, g _max Is the maximum value of the pixel gray levels in all sample images.

The method adopts a random gradient descent algorithm (Adam) minimum loss function of adaptive matrix estimation to train the face key point detection network, and the loss function adopts a mean square error loss function.

the training steps are as follows:

(3.1) initializing the error of the optimal verification set to be the error of the verification set after the first training is finished, and initializing the error training times of the optimal verification set to be 0;

(3.2) after each training is finished, verifying errors of the face key point detection network by using a verification set, and comparing the current verification set errors with the optimal verification set errors; if the error of the current verification set is smaller than the error of the optimal verification set, updating the error of the optimal verification set into the error of the current verification set, and setting the error training times of the optimal verification set to be 0; if the error of the current verification set is larger than the error of the optimal verification set, adding one to the error training times of the optimal verification set;

The detection stage comprises the following steps:

step 3, collecting the image of the face to be detected; acquiring a face region of an acquired image to be detected by adopting an OpenCV (open visual constant value) cascade classifier;

in the present invention, the face area resize to be acquired is 96 × 96 × 1.

Step 4, inputting the face region obtained in the step 3 into a trained face key point detection network to obtain coordinates of a left eye and a nose tip;

step 5, calculating coordinates of the right eye and the left and right mouth angles of the human face in the image to be detected, and labeling key points;

the longitudinal distance from the left eye to the nose tip is found to be equal to the longitudinal distance from the left mouth corner to the nose tip through statistics, and the abscissa of the left mouth corner is equal to the abscissa of the left eye; the longitudinal distance from the right eye to the tip of the nose and the longitudinal distance from the right mouth corner to the tip of the nose are equal and the abscissa of the right mouth corner and the right eye is equal. Obtaining the coordinates of other key points on the face according to the corresponding relation, wherein the calculation formula is as follows:

As shown in fig. 4, a detection system for implementing the face key point detection method includes an image acquisition module, a face detection module, and a face key point detection and labeling module;

the image acquisition module is used for acquiring an image of a human face to be detected; in the invention, a monocular camera is adopted to acquire an image of a face to be detected, and the acquired image is preprocessed and converted into a gray image;

the face detection module is used for acquiring a face area in an image of a face to be detected and framing the face area by using a rectangular frame;

the face key point detection and labeling module is used for acquiring and labeling coordinates of a left eye, a nose tip, a right eye, a left mouth angle and a right mouth angle in a face region.

The face key point detection and labeling module is a face key point detection network based on a MobileNet V1 framework; the input of the human face key point detection network is a human face image, and the output is coordinates of a left eye and a nose tip; the face key point detection network comprises 16 layers of networks which are connected in sequence, wherein the first layer is a convolutional layer; the second layer to the tenth layer are 6 depth separable convolution units in which depth convolution layers and point-by-point convolution layers are alternately connected; the fourteenth layer is a pooling layer, the fifteenth layer is a full-link layer, and the sixteenth layer is a softmax layer.

The coordinates of the right eye, the left mouth angle and the right mouth angle are calculated according to the formula (1).

Claims

1. The method for detecting the key points of the human face based on deep learning is characterized by comprising a training stage and a detection stage, wherein the training stage comprises the following steps:

the face key point detection network comprises 16 layers of networks which are connected in sequence, wherein the first layer is a common convolution layer; the second layer to the tenth layer are 6 depth separable convolution units in which depth convolution layers and point-by-point convolution layers are alternately connected; the fourteenth layer is a pooling layer, the fifteenth layer is a full-connection layer, and the sixteenth layer is a softmax layer; the 1-13 layers are feature extraction sub-networks and are used for extracting local features of the image; the 14-16 layers are feature classification sub-networks, and face key points are obtained according to local features;

the depth convolution is calculated as:

step two: defining 1 convolution kernel of F1 for each single-channel image, wherein F is an odd number; a total of N is required to be defined;

step three: in order to make the size of the convolved output image equal to that of the input image, white filling is needed, and the size of the white filling is P = (F-1)/2;

step four: convolving each pixel point in each single-channel image with the convolution kernel of the corresponding single channel, and assuming that F =2a +1, the calculation formula is as follows because the size of the convolution kernel is odd:

wherein w (i, j) is the coefficient in the convolution kernel of F1, F (x, y) is any point pixel in the image, g (x, y) is the convolution output of F (x, y), where x, y are variable, so that each pixel in w can access each pixel in F;

the calculation of the point-by-point convolution is:

adding the results of the N convolution operations by using 1 × N convolution to obtain an output of M × 1;

firstly, converting a three-channel image into three single-channel images, and then performing convolution operation on each single channel; at any point f (x, y) in a single channel image, the output g (x, y) is the sum of the products of the convolution kernel coefficients and the image pixels surrounded by the convolution kernel:

g(x,y)＝w(-1,-1)f(x-1,y-1)+w(-1,0)f(x-1,y)+...........+w(1,1)f(x+1,y+1)

then adding up the three single-channel image outputs g (x, y) by a 1 x 3 convolution kernel to obtain a final output z (x, y);

the detection stage comprises the following steps:

(3) Collecting an image of a face to be detected; acquiring a face region of an acquired image to be detected by adopting an OpenCV (open circuit vehicle vision) cascade classifier;

wherein (X) _{left_eye} ,Y _{left_eye} )，(X _noise ,Y _noise )，(X _{right_eye} ,Y _{right_eye} )，(X _{left_mouth} ,Y _{left_mouth} )，(X _{right_mouth} ,Y _{right_mouth} ) Coordinates of a left eye, a nose tip, a right eye, a left mouth corner and a right mouth corner are respectively;

the step (2) further comprises preprocessing a training sample, and the preprocessing comprises the following steps:

(2.1) rejecting training samples with missing data;

Comprises the following steps:

wherein g is _i Is the gray value, g, of pixel i in sample image A before normalization _min Is the minimum value of the pixel gray levels in all sample images, g _max The maximum value of the pixel gray levels in all sample images is obtained;

in the first layer to the thirteenth layer in the face key point detection network, the input of each layer is subjected to batch standardization through BatchNormal and then convolution operation is carried out;

in the step (2), the sample image is divided into a training set and a verification set, a face key point detection network is trained on the training set, and the error of the face key point detection network is verified by using the verification set; the training steps are as follows:

(3.3) if the error training times of the optimal verification set reach a preset error training time threshold of the optimal verification set or the total training times reach a preset total training time threshold, stopping training and taking the face key point detection network parameters corresponding to the errors of the optimal verification set as training results;

and when the error of the face key point detection network on the verification set is larger than that of the last training result, stopping training, and adopting the parameters of the last training result as the final parameters of the face key point detection network.