CN106096535B

CN106096535B - Face verification method based on bilinear joint CNN

Info

Publication number: CN106096535B
Application number: CN201610399704.XA
Authority: CN
Inventors: 胡海峰; 李昊曦; 顾建权; 胡伟鹏
Original assignee: Sun Yat Sen University; SYSU CMU Shunde International Joint Research Institute
Current assignee: Sun Yat Sen University; SYSU CMU Shunde International Joint Research Institute
Priority date: 2016-06-07
Filing date: 2016-06-07
Publication date: 2020-10-23
Anticipated expiration: 2036-06-07
Also published as: CN106096535A

Abstract

The invention discloses a face verification method based on a bilinear convolutional neural network. The method comprises the following steps: 1) training a convolutional neural network (CNN for short) by using a human face image prepared in advance; 2) fine adjustment of bilinear CNN is carried out by using the face pictures in the training set; 3) inputting a human face picture to be verified, segmenting the two pictures, and extracting the joint feature output by the bilinear CNN. 4) And training the obtained vector through a self-coding network to obtain a final verification result. The invention is based on bilinear CNN method, and replaces two repeated inputs of original bilinear neural network with different face verification input images, provides a new face verification descriptor, which has robustness to illumination, shielding and posture change, and the feature extracted by bilinear CNN has smaller feature dimension than that of general CNN full link layer, reduces parameter quantity, thereby making the subsequent deep belief network training easier and improving the accuracy of face verification.

Description

Face verification method based on bilinear joint CNN

Technical Field

The invention relates to the field of computer vision, in particular to a face verification method based on a bilinear joint Convolutional Neural Network (CNN).

Background

The face recognition is the most common identity authentication means in people's daily life, and is one of the most popular pattern recognition research topics at present. The face recognition is to dynamically capture the face of a person through a camera connected with a computer, and simultaneously compare and recognize the captured face with the face in the personnel stock recorded in advance. Other biological feature recognition methods all need human behavior coordination, and face recognition does not need passive coordination, so that the method can be used in certain hidden occasions and can be called as the most friendly biological feature identity authentication technology. The face verification is a sub-problem in the face recognition problem, and the problem of how to judge whether the face in the image is a designated person or not is solved, so that the face verification is a one-to-one matching process. Because the face recognition classification and verification have great practical application values, the subject has been taken as a research hotspot for years, a plurality of methods are proposed, and the face verification algorithm mostly starts from detecting certain features of the face and obtains a verification result by utilizing various probability models or classifiers after processing. The method based on the characteristics of classical SIFT, HOG, Gabor and the like and the method based on CNN are included, and the best result at present can be obtained by face recognition through the CNN. However, since a large number of parameters of the deep CNN network need to be obtained through learning, and many face data sets cannot meet the scale required by the deep CNN network learning at present, the deep CNN network implementation must be trained based on a large number of face data sets, which limits its application in the field of face recognition. In addition, the training process of the deep CNN network takes a lot of time, and optimizing the parameters is a long-term process. Most parameters of the CNN network are from a full link layer with higher dimensionality, the parameters of the network can be reduced by combining the tails of the two CNN networks and reducing the dimensionality, and a good identification effect is reserved. Our face verification studies will be directed to CNN-based methods.

The main idea of the CNN-based method is: firstly, performing convolution on CNN utilized by an input image to extract local features, then reducing dimension through matrix multiplication in a full link layer, and simultaneously adjusting parameters through a gradient descent method of reverse conduction to enable the whole network structure to output a classification result with the minimum difference with a training set result. Features of the second last and the third layer in the network can be regarded as global features of the original image, in the face verification model, the features are combined by various methods to calculate the probability that two face images belong to the same person, and the final recognition process is completed by comparing the face images with a reference set one by one. Many scholars have explored and improved on this basis. Y et al propose a method to increase the links of multiple partial convolution results in the original CNN model to the full link layer and combine these features to enhance the expressive power, thereby improving the recognition effect. Aruni et al multiply and combine two CNN models by a method of mutually reverse chain derivation, and propose a bilinear CNN model, reduce redundancy of features, and improve calculation speed. Wang et al have adopted the deep belief model to combine the CNN characteristic of the multiple pictures that the human face decomposes and get, carry on the human face and verify through the layer code of multiple RBM. The existing face verification research based on the CNN usually ignores the characteristics of the task of face verification, a single face image is directly used as a training sample to extract features, and two relation clues of the input faces are helpful for improving the recognition accuracy based on the CNN.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a face verification method based on bilinear combined CNN. The method comprises the steps of taking an obtained face front image as input, extracting and identifying face image features, comparing the face image features with a reference set face in pairs, and finally outputting a classification result of whether the face and the reference face belong to the same person.

In order to achieve the purpose, the invention adopts the technical scheme that:

a face verification method based on bilinear joint CNN comprises the following steps:

(1) for an input face image, respectively intercepting images of a plurality of parts of the face image by using a multi-scale rectangular frame to serve as input of a CNN (compressed natural number), and pre-training the CNN by using the face image in a training set to obtain an initial CNN model;

(2) combining two initial CNN models with the same parameters to form a new bilinear joint CNN; the initialization parameters of the convolution-pooling layer are given by the step (1), the respective full-link layers of the two CNN models are replaced by a combined three-layer full-link layer, the input of the three-layer full-link layer is obtained by multiplying the output matrix of the last convolution-pooling layer of the two initial CNNs, the structure of the softmax multi-class classifier of the last layer is replaced by a two-class classifier for judging whether the CNNs are the same face, and then the parameters of the three-layer full-link layer are all initialized to Gaussian distribution random values with zero mean variance sigma;

(3) pairwise matching the face images in the training set, respectively inputting two ends of a new bilinear joint CNN, and finely adjusting all parameters of the whole bilinear CNN network according to a classification training result; during fine adjustment, firstly fixing parameters of one side structure of the bilinear CNN model, then adjusting the parameters of the other side CNN model, and obtaining the bilinear CNN model for face verification after repeated iterative fine adjustment;

(4) and carrying out feature secondary classification on the reference face image and the detected face image intercepted in a multi-scale, multi-channel and multi-region mode by adopting a three-layer depth self-coding network, and finally outputting the identification accuracy of face verification.

Preferably, in the step (1), the upper part, the upper left part, the upper right part, the middle part, the left part, the right part, the lower left part, the lower right part and the original image of the face image are respectively captured at different scales by using the symmetry of the face image, that is, the face image is divided into 10 different face input screenshots; and pre-training the CNN by using the face image in the training set until the loss function of the CNN network is converged, wherein the parameter of the convolution-pooling layer in the trained CNN network structure is used as an initial value of the parameter of the convolution pooling layer of each of the two CNN models in the next step.

Preferably, in the step (1), for the input face image, images of a plurality of parts of the face image are respectively captured by a multi-scale rectangular frame to be used as input of the CNN, and the CNN is pre-trained by using the face image in the training set; the CNN network structure model uses a 19-layer VGG network model and consists of 5 convolution-pooling combined layers and three full-link layers, the first two layers are connecting layers of a common neural network method, the third full-link layer is a softmax classifier layer, classification tasks and derivative back propagation are completed, each convolution-pooling layer comprises a plurality of convolution layers and a pooling layer, parameters of all the layers are adjusted layer by layer from the next layer to the previous layer through a softmax classifier classification result of a face image by using a back gradient propagation algorithm, and therefore an initial CNN model is obtained.

Preferably, the full link layer is set to 3 layers, and zero mean white gaussian noise with a variance of 0.01 is used as an initial value of the full link layer parameter.

Preferably, in the step (3), the face images in the training set are paired pairwise, bilinear joint CNNs are respectively input in a grading manner according to the difference of direction and sequence according to whether the face images in the training set are the same person or not as a classification result, parameters of the network are finely adjusted by using a gradient descent method by using a mean square value of the difference between a network classification result and an actual classification result as an optimization criterion function, when gradient back propagation is performed, in a convolution-pooling layer output layer of two CNN models, namely an input layer of a full link layer, because matrix multiplication operation can be conducted on respective matrixes, when a fine adjustment parameter derivative is calculated, parameters of one CNN model are firstly fixed in each iteration process, parameters of the other CNN model are finely adjusted, then parameters of the other CNN model are inversely fixed, parameters of the first CNN model are finely adjusted, after the automatic parameter adjusting process is iterated for multiple times, the total loss function of the model is converged, and therefore the bilinear CNN model for face verification is obtained.

Preferably, in the step (4), for each input human face picture to be verified and each reference set picture, pairwise matching is performed by using 10 different cut pictures, new bilinear joint CNNs are respectively input, a full link layer of the bilinear joint CNNs is used as a feature vector, then all the feature vectors are linked, data dimensionality reduction and secondary classification are performed by using a three-layer deep self-coding network, and finally a result of human face verification, namely a secondary classification result, is obtained, and if the neuron value of 'belonging to the same person' output by the deep self-coding network is larger than the neuron value of 'not belonging to the same person'; and if the input image is used as a face classification task, the reference face corresponding to the highest neuron value of the input face to be classified and the reference face output belonging to the same person can be used as a classification result.

The invention discloses a face verification method based on a bilinear convolutional neural network. The method comprises the following steps: 1) training a convolutional neural network (CNN for short) by using a large number of face images in a data set prepared in advance; 2) fine adjustment of the bilinear CNN is carried out by using the face pictures in the training set and using the pre-training result of the CNN; 3) inputting two human face pictures to be verified, carrying out multi-scale multi-channel multi-region segmentation on the two pictures, respectively serving as two ends of the bilinear CNN, and extracting the joint features output by the bilinear CNN. 4) And connecting the obtained vectors, training the vectors through a deep self-coding network to obtain whether the final two-classification result of the same person is obtained, and taking the maximum output of 'yes' in the two-classification neurons as a face recognition classification result to obtain the recognition accuracy. The invention provides a new face verification descriptor based on a bilinear CNN method, and by replacing two repeated inputs of an original bilinear neural network with different face verification input images, the face verification descriptor can effectively acquire local features and global features of a face, retains the displacement and rotation invariance of the CNN, has good robustness on illumination, shielding and partial posture change, and reduces the calculated amount because the feature vector obtained by the bilinear neural network has smaller dimension than that of a general neural network, thereby leading the subsequent deep belief network training to be easier and obviously improving the accuracy of face verification.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

1. the invention provides a new descriptor characteristic suitable for a face verification task to acquire image characteristics with identity recognition capability in face images of different people.

2. The central area utilized by the invention can effectively and adaptively extract the face image features robust to illumination and shielding problems, and the dimension of the feature vector is reduced by using the bilinear joint CNN model.

3. According to the invention, the bilinear CNN is combined with the deep self-coding network to perform face recognition, so that the recognition effect is improved.

Drawings

FIG. 1 is a general flow diagram of the present invention.

FIG. 2 is a structural and training flow chart of the bilinear joint convolution neural network of the present invention

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

Fig. 1 shows an operation process of the present invention, and as shown in the figure, a face verification method based on a bilinear convolutional neural network includes the following steps:

(1) for an input face image, respectively intercepting images of a plurality of parts of the face image by using a multi-scale rectangular frame to serve as input of a CNN (compressed natural network), pre-training the CNN by using a large number of face images in a pre-prepared training set, wherein the CNN network structure model uses a 19-layer VGG (variable gradient G) network model and consists of 5 convolution-pooling combined layers and three full link layers, the first two full link layers are connecting layers of a common neural network method, the third full link layer is a softmax classifier layer and is used for completing classification tasks and derivative back propagation, each convolution-pooling layer comprises a plurality of lamination layers and a pooling layer, parameters of all the layers are adjusted layer by layer from the next layer to the previous layer by using a back gradient propagation algorithm according to classification results of the softmax classifier of the face image, and accordingly an initial CNN model is obtained;

(2) two initial CNN models with the same parameters are combined to form a new bilinear joint CNN. The initialization parameters of the convolution-pooling layer are given by the step (1), the respective full-link layers of the two CNN models are replaced by a combined three-layer full-link layer, the input of the three-layer full-link layer is obtained by multiplying the output matrix of the last convolution-pooling layer of the two initial CNN models, the structure of the softmax multi-class classifier of the last layer is replaced by a two-class classifier for judging whether the CNN models are the same face, and then the parameters of the three full-link layers are all initialized to zero-mean small-variance Gaussian distribution random values;

(3) pairwise matching is carried out on the face images in the training set, then the two ends of the new bilinear joint CNN are respectively input, and all parameters of the whole bilinear CNN network are finely adjusted according to the classification training result. During fine adjustment, firstly fixing parameters of one side structure of the bilinear CNN model, then adjusting the parameters of the other side CNN model, and obtaining the bilinear CNN model for face verification after 30 times of iterative fine adjustment;

(4) and carrying out feature secondary classification on the reference and detected face images intercepted in a multi-scale, multi-channel and multi-region mode by adopting a three-layer depth self-coding network, and finally outputting the identification accuracy of face verification. Further, the specific process in step (1) is as follows: the symmetry of the face image is utilized to respectively capture the upper part, the upper left part, the middle part, the left part, the lower left part and the original image of the face image in different scales, and each face image is divided into 10 different face input screenshots, so that the robustness of the CNN model is improved. And (3) acquiring a large number of face pictures by using a common face data set such as CASIA and the like to train CNN, wherein the trained network is used as a pre-training basis for the next step.

The specific process in the step (2) is as follows: and (2) constructing a bilinear joint CNN, wherein when the network is initialized, two convolution layers of the network are respectively initialized by adopting the weight of the CNN convolution layer trained in the step (1), a full link layer is set to be 3 layers, and random initialization is carried out by adopting zero mean Gaussian distribution of variance.

In the step (3), pairwise matching is carried out on the face images in the training set, bilinear joint CNNs are respectively input in times according to the fact whether the face images in the training set are the same person or not as a classification result, two face images are respectively input into bilinear joint CNNs in different directions and sequences, parameters of a network are finely adjusted by using a gradient descent method by using a mean square value of the difference between the network classification result and an actual classification result as an optimization criterion function, when gradient is propagated reversely, the parameters of one CNN model are firstly fixed in each iteration process, the parameters of the other CNN model are finely adjusted, then the parameters of the other CNN model are reversely fixed, the parameters of the first CNN model are finely adjusted, the concrete structure of the whole network model is shown in figure 2. After about 30 times of iterative automatic parameter adjustment process, the total loss function of the model is converged, so that a bilinear CNN model for face verification is obtained.

In the step (4), for each input human face picture to be verified and each reference set picture, pairwise matching is performed by using 10 different cut pictures, new bilinear joint CNNs are respectively input, a full link layer of the bilinear joint CNNs is used as a feature vector, then all the feature vectors are connected, a three-layer deep self-coding network is used for data dimension reduction and two-classification, and finally a human face verification result, namely a two-classification result is obtained, and if the neuron value of 'belonging to the same person' output by the deep self-coding network is larger than the neuron value of 'not belonging to the same person'. And if the input image is used as a face classification task, the reference face corresponding to the highest neuron value of the input face to be classified and the reference face output belonging to the same person can be used as a classification result. The specific process in the step (3) is as follows: pairwise matching is carried out on the face images in the training set, whether the face images in the training set are the same person is used as a classification result, the two face images are respectively input into bilinear joint CNNs in a grading mode according to different directions and different sequences, and a gradient descent method is used for fine adjustment of network parameters by using a mean square value of differences between a network classification result and an actual classification result as an optimization criterion function, so that a bilinear CNN model for face verification is obtained.

The specific process in the step (4) is as follows: and (2) pairing 10 different pictures obtained in the step (1) for each input face picture to be verified and each reference set picture, respectively inputting bilinear joint CNNs, taking a full link layer of the bilinear joint CNNs as a feature vector, then connecting all the feature vectors, performing data dimension reduction and two classification by using a three-layer deep self-coding network, and finally obtaining a face verification result, namely a two-classification result, wherein if the output neuron value of 'belonging to the same person' of the deep self-coding network is larger than the neuron value of 'not belonging to the same person'. And if the input image is used as a face classification task, the reference face corresponding to the highest neuron value of the input face to be classified and the reference face output belonging to the same person can be used as a classification result.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A face verification method based on bilinear joint CNN is characterized by comprising the following steps:

(2) combining two initial CNN models with the same parameters to form a new bilinear joint CNN; the initialization parameters of the convolution-pooling layer are given by the step (1), the respective full-link layers of the two CNN models are replaced by a combined three-layer full-link layer, the input of the three-layer full-link layer is obtained by multiplying the output matrix of the last convolution-pooling layer of the two initial CNNs, the structure of the softmax multi-class classifier of the last layer is replaced by a two-class classifier for judging whether the CNNs are the same face, and then the parameters of the three-layer full-link layer are all initialized to Gaussian distribution random values with zero mean variance of sigma;

(3) pairwise matching the face images in the training set, respectively inputting two ends of a new bilinear joint CNN, and finely adjusting all parameters of the whole bilinear CNN network according to a classification training result (Fine Tuning); during fine adjustment, firstly fixing parameters of one side structure of the bilinear CNN model each time, then adjusting the parameters of the other side CNN model by using a gradient descent method, and obtaining the bilinear CNN model for face verification after repeated iterative fine adjustment; (4) carrying out feature secondary classification on a reference face image and a detected face image intercepted in a multi-scale, multi-channel and multi-region mode by adopting a three-layer depth self-coding network, and finally outputting the identification accuracy rate of face verification;

in the step (3), pairwise matching is carried out on the face images in the training set, bilinear joint CNNs are respectively input in times according to the fact whether the face images in the training set are the same person or not as a classification result, two face images are respectively input into bilinear joint CNNs in different directions and sequences, parameters of a network are finely adjusted by using a gradient descent method by using a mean square value of a difference between a network classification result and an actual classification result as an optimization criterion function, when gradient is propagated reversely, in a convolution-pooling layer output layer of two CNN models, namely an input layer of a full-link layer, due to the fact that matrix multiplication can conduct respective matrixes, when a fine adjustment parameter derivative is calculated, parameters of one CNN model are firstly fixed in each iteration process, parameter fine adjustment is carried out on the other CNN model, then parameters of the other CNN model are reversely fixed, and parameters of the first CNN model are finely adjusted, after the automatic parameter adjusting process is iterated for multiple times, the total loss function of the model is converged, and therefore the bilinear CNN model for face verification is obtained.

2. The bilinear joint CNN-based face verification method of claim 1, wherein in step (1), the upper part, the upper left part, the upper right part, the middle part, the left part, the right part, the lower left part, the lower right part and the original image of the face image are respectively captured at different scales by using the symmetry of the face image, and each face image is divided into 10 different face input screenshots; and pre-training the CNN by using the face image in the training set until the loss function of the CNN network is converged, wherein the parameter of the convolution pooling layer in the trained CNN network structure is used as an initial value of the parameter of the convolution pooling layer of each of the two CNN models in the next step.

3. The face verification method based on bilinear joint CNN of claim 1, characterized in that, in step (1), for the input face image, the images of multiple parts of the face image are respectively intercepted by multi-scale rectangular frames as the input of CNN, and the CNN is pre-trained by using the face image in the training set; the CNN network structure model uses a 19-layer VGG network model and consists of 5 convolution-pooling combined layers and three full-link layers, the first two layers are connecting layers of a common neural network method, the third full-link layer is a softmax classifier layer, classification tasks and derivative back propagation are completed, each convolution-pooling layer comprises a plurality of convolution layers and a pooling layer, parameters of all the layers are adjusted layer by layer from the next layer to the previous layer through a softmax classifier classification result of a face image by using a back gradient propagation algorithm, and therefore an initial CNN model is obtained.

4. The bilinear joint CNN-based face verification method of claim 1, wherein the full link layer is set to 3 layers, and zero mean Gaussian white noise with a variance of 0.01 is used as an initial value of a parameter of the full link layer.

5. The face verification method based on bilinear joint CNN as claimed in claim 2, wherein in step (4), for each input face picture to be verified and each reference set picture, pairwise matching is performed using the 10 different cut pictures, new bilinear joint CNNs are respectively input, the full link layer of the bilinear joint CNNs is used as a feature vector, then all the feature vectors are combined, a three-layer deep self-coding network is used for data dimension reduction and two-classification, and finally a face verification result, namely a two-classification result, is obtained, if the output neuron value of the same person by the deep self-coding network is larger than the neuron value of the same person; if the input image is used as a face classification task, the reference face corresponding to the highest neuron value of the input face to be classified and the reference face output belonging to the same person can be used as a classification result; the self-coding network firstly adopts a Gaussian-limited Boltzmann machine (Gaussian RBM) model to carry out parameter initialization, network parameters are firstly initialized randomly by Gaussian distribution white noise with extremely small variance, and the RBM model layer is trained; and then the parameters are used as initial values to train the self-coding network.