CN106203395B

CN106203395B - Face attribute recognition method based on multitask deep learning

Info

Publication number: CN106203395B
Application number: CN201610591877.1A
Authority: CN
Inventors: 严严; 陈日伟; 王菡子
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2016-07-26
Filing date: 2016-07-26
Publication date: 2020-01-14
Anticipated expiration: 2036-07-26
Also published as: CN106203395A

Abstract

A face attribute identification method based on multitask deep learning relates to face attribute identification in computer vision. Preparing an image dataset; carrying out face detection on each image in the image data set one by one; performing face key point detection on all detected faces; aligning each face to a standard face image according to a face alignment method for the detected face key points to form a face image training set; calculating an average face image in a training set; constructing a multitask deep convolutional neural network, and carrying out network parameter training after subtracting an average face image from each face image in a face image training set to obtain a convolutional neural network model; carrying out face detection and face key point detection on a test image to be recognized, and aligning a face in the image to a standard face image according to a face key point; and subtracting the average face image from the standard face image, and putting the standard face image into the constructed convolutional neural network model to perform feed-forward operation, so as to obtain the face image.

Description

Face attribute recognition method based on multitask deep learning

Technical Field

The invention relates to face attribute recognition in computer vision, in particular to a face attribute recognition method based on multitask deep learning.

Background

The image-based face attribute recognition method is a process of judging the face attribute in an image by using a pattern recognition technology according to a given input image. The attributes of the human face contained in the human face image mainly include: age, gender, expression, race, whether to wear glasses, whether to make up, etc. The human face attribute recognition is automatically carried out by utilizing the computer, so that the human-computer interaction performance can be effectively improved, and the method has very important practical application value. The process of face attribute recognition comprises the following steps: the method comprises the steps of face detection technology, face image preprocessing technology, face feature extraction, training of a face attribute classifier and the like. The performance of the face feature extraction and the face attribute classifier directly affects the final face attribute recognition performance.

At present, the face attribute recognition technology is mainly completed by two steps: extracting human face features and training a human face attribute classifier. The face feature extraction technology is divided into two main categories according to different feature acquisition modes: manual design features and automatic learning features. The performance of the classifier is directly affected by the quality of the human face features. The manual design features are mainly as follows: SIFT feature (D.G.Lowe.visual image features from scale-innovative keys [ J ]. International Journal of Computer Vision,2004,60(2): 91-110), LBP feature (T.Ahonen, A.Hadid, M.Pietikain.face description with local binding patterns: Application to face registration [ J ]. IEEE Transactions on Pattern Analysis and machine Analysis, IE2006, 28(12): 2037-. However, these manually designed features are mostly based on expert experience design, and it is difficult to extract features that are effective for various tasks. Second, manually designed feature extraction is decoupled from classifier design, resulting in selected features that are not best suited for a particular classifier.

Deep learning has recently become one of the research hotspots in the field of computer vision. Unlike the traditional method of decomposing the recognition task into feature extraction and classifier training, the deep learning organically combines the feature extraction and the classifier training, directly takes the original data as input, and simultaneously performs feature extraction learning and classifier training learning. The deep learning method for statistical learning by putting feature learning and classifier training in the same frame effectively avoids the significance gap between feature extraction and a target task classifier, facilitates the feature extraction and the target task classifier, and overcomes the trouble of manually designing features. The depth model formed by the multilayer neural network has the characteristics of automatically acquiring characteristics from low level to high level, from simple to complex, and from general to special. For example: in a typical image classification network, edge information is often extracted from the previous layers of the network, angle information is extracted from the middle layer, and contour information, target information and the like are extracted from the later layers. The lower the hierarchy, the simpler and more universal the extracted features, and gradually extracting the features related to the target task.

Training a deep learning model often requires a large amount of label data to avoid over-fitting the learning model with a small amount of training data. However, it is very time-consuming and labor-consuming to acquire a large amount of tag data, and it is a key problem worth solving to explore the problem of insufficient data by acquiring characteristics of different features layer by using a deep network model.

Disclosure of Invention

The invention aims to provide a face attribute recognition method based on multitask deep learning.

The invention comprises the following steps:

A. an image data set is prepared that contains a large number of faces and corresponding face attribute labels.

B. And carrying out face detection on each image in the image data set one by one to obtain the position of the face in each image.

C. And performing face key point detection on all the detected faces to acquire the positions of the face key points in each image.

D. And aligning each face to a standard face image according to a face alignment method for the detected face key points to form a face image training set.

E. And calculating an average human face image in a training set by using the human face image training set.

F. And constructing a multi-task deep convolutional neural network, and carrying out network parameter training after subtracting the average face image from each face image in the face image training set to obtain a convolutional neural network model.

G. And respectively carrying out face detection and face key point detection on the test image to be recognized, and aligning the face in the image to the standard face image according to the face key point.

H. And subtracting the average face image from the standard face image, and putting the standard face image into the constructed convolutional neural network model for feedforward operation to obtain a plurality of attribute recognition results of the face.

In the step a, the prepared image data set may adopt image data including human faces with better diversity acquired in a complex scene, and provide corresponding K personal face attribute labels, where K is a number of learned tasks and a natural number; the convolutional neural network structure based on multi-task learning is adopted, and each face image does not need to have all face attribute label data at the same time, so that the existing face database can be fully utilized for combination to form a large-scale image data set.

In step B, the face detection for each image in the image data set one by one may adopt a common face detection method to obtain a position of a face in each image, and the common face detection method may adopt an OpenCV-self-contained face detection method.

In step C, the face key point detection may adopt a common face key point detection method to obtain the position of the face key point in each image, and the common face key point detection method may adopt a Dlib-carried face key point detection method.

In step D, the face alignment method is affine transformation based on a two-dimensional image, and specifically includes the following steps:

D1. and fitting by using a least square method according to the matching relation between the face key points and the standard face key points to obtain an optimal transformation matrix. Assuming d coordinates of the key points of the standard face as

WhereinThe coordinates of the ith standard face key point are obtained; d is the number of key points of the face and d is a natural number; and the coordinates of the key points of the detected face are { (x)₁；y₁),(x₂；y₂),...,(x_d；y_d) In which (x)_i,y_i) Is the ith individual face key point coordinate of the detected face and extends the coordinate to src { (x)₁；y₁；1),(x₂；y₂；1),...,(x_d；y_d(ii) a 1) And (6) optimizing by a least square method. The specific calculation formula is as follows:

wherein

Representing an affine transformation matrix; a. the₀Is an optimal transformation matrix;

D2. transformation matrix A obtained by optimization₀And aligning all the face images to the standard face images, and cutting the face images into images with uniform sizes.

In step E, the specific method for calculating the average face image in the training set for the face image training set may be:

E1. the mean image is an arithmetic mean image M obtained by calculating each channel (comprising 3 channels of RGB) of a face image training set, wherein the calculation mode of each channel and M is as follows:

wherein the content of the first and second substances, and

the R channel, the G channel and the B channel of the nth human face image are respectively arranged; n is the total image number of the face image training set, and N is a natural number. M_R，M_GAnd M_BArithmetic mean images of an R channel, a G channel and a B channel respectively;

E2. the arithmetic mean value image obtained by the RGB three channels is combined into an average face image, and the calculation formula is as follows:

M＝[M_R,M_G,M_B]；

E3. dividing the face image training set into training data and verification data according to the ratio of 9: 1.

In step F, the specific method for constructing the multitask deep convolutional neural network may be:

F1. training data are randomly disordered, each batch of samples with the size of m are set, and data are divided, wherein m is a natural number. In the process of training the deep convolutional neural network model, learning the weight parameters of the neural network by using a batch gradient updating method;

F2. a structure of a convolutional neural network is designed, which includes convolutional layers, downsampling layers, and full-connect layers. Each convolution layer and the full-connection layer adopt nonlinear rectification activation functions; in the face attribute recognition of the multi-task deep learning, the whole network structure is divided into a sharing layer and a unique layer, the sharing layer is shared by all tasks, and the tasks participate in the training of parameters of the sharing layer together. The unique layer is occupied by each task independently, parameter learning is carried out by using independent data of each task, and the number of the shared layers is assumed to be S, wherein S is a natural number; the number of the unique layers is U, wherein U is a natural number; setting different numbers of S and U according to different multi-task combinations;

F3. setting hyper-parameters such as the number of convolution layer filters and the number of characteristic graphs, the size of the filters, the size of kernels in down-sampling layers, the learning rate of each layer, the initial weight value and the like required in a convolution neural network structure;

F4. in the training process of the convolutional neural network, training skills such as impulse and abandon are adopted for accelerating the training of the convolutional neural network;

F5. judging whether to stop training according to the performance of the trained network model parameters on the verification data;

F6. and extracting the trained network model parameters W.

The method utilizes the deep convolutional neural network to simultaneously learn the feature extraction and the attribute identification of the face image, so that the learned features are more favorable for the identification of a classifier, and the feature extraction and the classifier training are not required to be respectively carried out. For multi-task face attribute recognition of K tasks, all training data do not need to have K label attributes at the same time, and face data with one or more attributes can be used for training network parameters, and different attribute tasks can benefit accordingly.

Unlike a traditional deep convolutional neural network which only has a single label and a single network output, the convolutional neural network based on the multitask deep learning has a plurality of outputs. The objective function of the network training is a combination of a plurality of Softmax loss functions and an L2 loss function. Assume that there are K tasks that need to be learned together. Then for the face attribute recognition of the ith classification task, the loss function is defined as follows:

wherein

Representing the probability value calculated by the Softmax loss function for each attribute class;

a value representing a fully connected classification output in the target class; c_iThe number of categories of the ith task is represented, and i is a natural number.

For the face attribute recognition of the jth regression task, the loss function is defined as follows:

wherein y is_nIn order to be the true tag value,

is the predicted value of the regressor.

In the network training, the cost loss functions of all tasks are combined, and a total optimization objective function is formed as follows:

wherein alpha is_kThe penalty function representing the kth task is weighted with the total penalty function. By default α_kThe values of (a) are all 1, indicating that the respective tasks are equally important.

Compared with the prior art, the method has the advantages of reducing the requirement of data quantity required by training network model parameters, reducing the risk of overfitting training data, reducing the recognition time which is spread in a single task, and effectively improving the accuracy of face attribute recognition.

Drawings

Fig. 1 is a schematic diagram of attributes of smile, gender and attraction of a female face.

Fig. 2 is a schematic diagram of attributes of smile, gender and attraction of a male face.

Detailed Description

The method of the present invention is described in detail below with reference to the accompanying drawings and examples.

The invention comprises the following steps:

s1, preparing an image data set which comprises a large number of human faces and corresponding human face attribute labels. The face attribute database used in this example is an image data set in the CelebrayA database, which contains over 20 million face images and 40 face attributes. Three representative face attribute tasks are employed (K — 3): and explaining the attribute of the human face gender, the attribute of the human face smile and the attribute of the human face attraction force. Schematic representation of three attributes as shown in FIGS. 1 and 2, with labels set to y, respectively₁,y₂,y₃。

And S2, carrying out face detection on each image in the image data set one by one to obtain the position of the face in each image. The step can adopt any one of the existing human face detection methods to carry out human face detection. The face detection method of the OpenCV is adopted in the embodiment, and the face detection method has the advantage of being capable of rapidly detecting the face.

And S3, detecting key points of the human face on all the detected human faces to obtain the position of the key points of the human face in each image. The step can adopt any one of the existing face key point detection methods to detect. In this embodiment, 68 personal face key points can be obtained by using a Dlib-carried face key point detection method.

S4, aligning each face to a standard face image according to a face alignment method for the detected face key points to form a face image training set, which specifically comprises the following steps:

(1) and fitting by using a least square method according to the matching relation between the face key points and the standard face key points to obtain an optimal transformation matrix. Assuming that the 68 coordinates of the key points of the standard face are

Wherein

The coordinates of the ith standard face key point are obtained; d is the number of key points of the face and d is a natural number; and the coordinates of the key points of the detected face are { (x)₁；y₁),(x₂；y₂),...,(x₆₈；y₆₈) In which (x)_i,y_i) Is the ith individual face key point coordinate of the detected face and extends the coordinate to src { (x)₁；y₁；1),(x₂；y₂；1),...,(x₆₈；y₆₈(ii) a 1) And (6) optimizing by a least square method. The specific calculation formula is as follows,

wherein

Representing an affine transformation matrix; a. the₀Is the optimal transformation matrix.

(2) Transformation matrix A obtained by optimization₀All face images are aligned to a standard face image and cut into images of uniform size 128 x 128.

S5, calculating an average face image in a training set for the face image training set, wherein the method specifically comprises the following steps:

(1) the mean image is an arithmetic mean image M calculated for each channel (including 3 channels of RGB) of the face image training set, wherein the calculation for each channel and M is as follows,

wherein

And

the R channel, the G channel and the B channel of the nth human face image are respectively arranged; n is the total image number of the face image training set, and N is a natural number. M_R，M_GAnd M_BArithmetic mean images for the R, G and B channels, respectively.

(2) The arithmetic mean value images obtained by RGB three channels are combined into an average face image, and the calculation formula is M ═ M_R,M_G,M_B]。

(3) Dividing the face image training set into training data and verification data according to the ratio of 9: 1.

S6, constructing a multi-task deep convolutional neural network, and training network parameters after subtracting an average face image from each face image in a face image training set to obtain a convolutional neural network model, wherein the method specifically comprises the following steps:

(1) training data are randomly scrambled, each batch is set to be m-128 samples, and data are divided. In the process of training the deep convolutional neural network model, a batch gradient updating method is used for learning the weight parameters of the neural network.

(2) A structure of a convolutional neural network is designed, which includes convolutional layers, downsampling layers, and full-connect layers. Each convolutional layer and the full link layer adopt nonlinear rectification activation functions. In the face attribute recognition of multi-task deep learning, the method integratesThe body network structure is divided into two parts of a shared layer and a unique layer. The sharing layer is shared by all tasks, and a plurality of tasks participate in the training of the sharing layer parameters together. The unique layer is occupied by each task independently, and parameter learning is carried out by using data of each task independently. Assuming that the number of sharing layers is S-10; the number of unique layers is U-2. The loss functions used are all Softmax functions, alpha_kAre all set to 1.

(3) The number of convolutional layer filters and the number of feature maps required in the convolutional neural network structure, the size of the filters, the size of kernels in downsampling layers, the learning rate of each layer, the initial value of weight and other hyper-parameters are set, and the network structure of the deep convolutional neural network is shown in table 1.

TABLE 1

Network layer name	Type (B)	Input size	Output size	Filter size/step size
					Conv1_1	Convolutional layer	1281283	12812832	3*3/1
Conv1_2	Convolutional layer	12812832	12812864	3*3/1
					Pool1	Downsampling layer	12812864	646464	2*2/2
Conv2_1	Convolutional layer	646464	646496	3*3/1
					Conv2_2	Convolutional layer	646496	6464128	3*3/1
Pool2	Downsampling layer	6464128	3232128	2*2/2
					Conv3_1	Convolutional layer	3232128	3232128	3*3/1
Conv3_2	Convolutional layer	3232128	3232192	3*3/1
					Pool3	Downsampling layer	3232192	1616192	2*2/2
Conv4_1	Convolutional layer	1616192	1616256	3*3/1
					Conv4_2	Convolutional layer	1616256	1616256	3*3/1
Pool4	Downsampling layer	1616256	88256	2*2/2
					Conv5_1	Convolutional layer	88256	88320	3*3/1
Conv5_2	Convolutional layer	88320	88320	3*3/1
					Pool5	Downsampling layer	88320	44320	2*2/2
Dropout1	Dropout layer	44320	44320
					Fc1	Full connection layer	44320	11128
Fc2	Full connection layer	11128	11C_i

(4) In the training process of the convolutional neural network, training skills such as impulse and abandon are adopted for accelerating the training of the convolutional neural network.

(5) And judging whether to stop training the trained network model parameters according to the performance of the trained network model parameters on the verification data.

(6) And extracting the trained network model parameters W.

S7, for any given image to be subjected to face attribute recognition, changing the image to be tested into a color image with the size of 128 multiplied by 128 by using the same data preprocessing method in the steps S1-S4, subtracting a mean value image in training data, inputting the color image into a trained deep convolutional neural network, and finally obtaining recognition results of 3 different face attributes.

And S8, performing operation of step S7 on each face image of the CelebrayA test data, and comparing the attribute identification precision with the prediction time. The precision of the single task network and the multitask network on the face CelebrayA test data and the time comparison result are shown in the table 2, and it can be seen from the table that under the same network structure, the accuracy of attribute identification can be improved by the multitask deep learning attribute identification method, and meanwhile, the average prediction time of each task can be greatly reduced by utilizing the multitask attribute identification.

TABLE 2

Method of producing a composite material	T1	T2	T3	T1+T2	T1+T3	T2+T3	T1+T2+T3
								Face gender (T1)	0.9670	N/A	N/A	0.9580	0.9690	N/A	0.9680
Face smile (T2)	N/A	0.9230	N/A	0.9210	N/A	0.9350	0.9360
								Face attraction (T3)	N/A	N/A	0.8060	N/A	0.8130	0.8160	0.8220
Total predicted time (seconds)	0.0230	0.0230	0.0230	0.0270	0.0270	0.0270	0.0290
								Average predicted time (seconds)	0.0230	0.0230	0.0230	0.0135	0.0135	0.0135	0.0097

The invention aims at different face attribute recognition tasks, can learn the shared network weight together, and train the unique network weight independently, thereby greatly reducing the requirement of the whole training data. The invention effectively improves the performance of face attribute recognition.

Claims

1. The face attribute recognition method based on the multitask deep learning is characterized by comprising the following steps of:

A. preparing an image data set comprising a plurality of faces and corresponding face attribute labels;

B. carrying out face detection on each image in the image data set one by one to obtain the position of a face in each image;

C. performing face key point detection on all detected faces to acquire the position of the face key point in each image;

D. aligning each face to a standard face image according to a face alignment method for the detected face key points to form a face image training set;

E. calculating an average face image in a training set by a face image training set, wherein the specific method comprises the following steps:

E1. the mean image is each channel of a face image training set, comprises 3 channels of RGB, and is an arithmetic mean image M obtained through calculation, wherein the calculation mode of each channel and M is as follows:

wherein the content of the first and second substances,and

the R channel, the G channel and the B channel of the nth human face image are respectively arranged; n is the total image number of the face image training set, and N is a natural number; m_R，M_GAnd M_BArithmetic mean images of an R channel, a G channel and a B channel respectively;

M＝[M_R,M_G,M_B]；

E3. dividing a face image training set into training data and verification data according to the ratio of 9: 1;

F. constructing a multitask deep convolutional neural network, and carrying out network parameter training after subtracting an average face image from each face image in a face image training set to obtain a convolutional neural network model;

the specific method for constructing the multitask deep convolution neural network comprises the following steps:

F1. randomly disordering training data, setting the size of each batch of samples to be m, and dividing the data, wherein m is a natural number; in the process of training the deep convolutional neural network model, learning the weight parameters of the neural network by using a batch gradient updating method;

F2. designing a structure of a convolutional neural network, wherein the structure comprises a convolutional layer, a downsampling layer and a full-connection layer; each convolution layer and the full-connection layer adopt nonlinear rectification activation functions; in the face attribute recognition of multi-task deep learning, the whole network structure is divided into a sharing layer and a unique layer, the sharing layer is shared by all tasks, and a plurality of tasks participate in the training of parameters of the sharing layer together; the unique layer is occupied by each task independently, parameter learning is carried out by using independent data of each task, and the number of the shared layers is assumed to be S, wherein S is a natural number; the number of the unique layers is U, wherein U is a natural number; setting different numbers of S and U according to different multi-task combinations;

F3. setting the number of convolutional layer filters and the number of characteristic graphs required in a convolutional neural network structure, the size of the filters, the size of kernels in downsampling layers, the learning rate of each layer and a weight initial value;

F6. extracting a trained network model parameter W;

G. respectively carrying out face detection and face key point detection on a test image to be recognized, and aligning a face in the image to a standard face image according to the face key point;

2. The method for identifying human face attributes based on multitask deep learning as claimed in claim 1, characterized in that in step A, said prepared image data set adopts image data containing human faces collected under the scene, and provides corresponding K personal face attribute labels.

3. The method according to claim 1, wherein in step B, the face detection is performed on each image in the image data set one by using a common face detection method to obtain the position of the face in each image.

4. The method for identifying the human face attribute based on the multitask deep learning as claimed in claim 3, wherein the commonly used human face detection method adopts an OpenCV self-contained human face detection method.

5. The method for identifying human face attributes based on multitask deep learning as claimed in claim 1, wherein in step C, said human face key point detection adopts a commonly used human face key point detection method to obtain the position of the human face key point in each image.

6. The method for identifying face attributes based on multitask deep learning as claimed in claim 5, characterized in that said commonly used face key point detection method adopts a Dlib-carried face key point detection method.

7. The method for identifying the attributes of the human face based on the multitask deep learning as claimed in claim 1, wherein in the step D, the human face alignment method is affine transformation based on a two-dimensional image, and specifically comprises the following steps:

D1. fitting by using a least square method according to the matching relation between the face key points and the standard face key points to obtain an optimal transformation matrix; assuming d coordinates of the key points of the standard face asWherein

The coordinates of the ith standard face key point are obtained; d is the number of key points of the face and d is a natural number; and the coordinates of the key points of the detected face are { (x)₁；y₁),(x₂；y₂),...,(x_d；y_d) In which (x)_i,y_i) Is the ith individual face key point coordinate of the detected face and extends the coordinate to src { (x)₁；y₁；1),(x₂；y₂；1),...,(x_d；y_d(ii) a 1) Optimizing by a least square method; the specific calculation formula is as follows:

whereinRepresenting an affine transformation matrix; a. the₀Is an optimal transformation matrix;