CN107766850B

CN107766850B - Face recognition method based on combination of face attribute information

Info

Publication number: CN107766850B
Application number: CN201711232374.6A
Authority: CN
Inventors: 马争; 解梅; 张恒胜; 涂晓光
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2017-11-30
Filing date: 2017-11-30
Publication date: 2020-12-29
Anticipated expiration: 2037-11-30
Also published as: CN107766850A

Abstract

The invention discloses a face recognition method based on combination of face attribute information, and belongs to the technical field of digital image processing. The invention discloses a new method for fusing identity information and attribute information to improve the accuracy of face recognition, aiming at the technical problems that the existing fusion method needs to train a plurality of DCNN networks and then carries out score fusion or feature fusion for further training, the work task is heavy and complicated, and the practical application is not facilitated. The face identity authentication network and the attribute recognition network are fused to form a fusion network, and the identity characteristic and the face attribute characteristic are simultaneously learned in a joint learning mode, so that the face recognition accuracy is improved, the face attribute characteristic can be predicted, and the face identity authentication network is a multi-task network; a weighting function sensitive to cost is adopted, so that the target domain data distribution is not depended on, and the balance training in a source data domain is realized; and the modified fusion framework only adds a few parameters, and the additional calculation load is small.

Description

Face recognition method based on combination of face attribute information

Technical Field

The invention belongs to the technical field of digital image processing, and particularly relates to a face recognition method based on combination of face attribute information.

Background

With the rapid development of deep learning, the face recognition technology has been developed rapidly, and many products applying the face recognition technology are produced. However, the current face recognition technology has many limitations, and the very typical problems are influenced by environmental factors such as a large side face, light and the like, which all reduce the performance of the face recognition system. Many researchers have done much work on face pose correction, domain adaptation (domain), etc., and while many efforts have been made, they are still in the exploration phase. According to research, under the condition that a plurality of conditions are greatly changed, the identification of the facial attribute information (such as gender, eyebrow shape, nose bridge height and the like) of a plurality of people is not greatly influenced, and the facial attribute information can still be accurately identified. Therefore, the accuracy of face recognition can be improved by combining the face attribute information.

At present, a plurality of multitask frameworks applied to face attribute learning are not available. Many of these approaches, while simple in concept, are very burdensome. For example: using AdaBoost to select independent feature subspace and independent SVM classifier for each attribute to realize classification on different attributes; or separately training DCNN (deep convolutional neural network) for each attribute and then training an independent SVM classifier for classification. The work task is very complicated, and the practical utilization value is low. Rudd et al propose a hybrid target optimization network to learn face attributes, and train different attributes jointly, greatly reducing the work difficulty and making it easier to implement.

In the aspect of fusion, many workers try to add attribute information into face recognition to improve the accuracy of face recognition. However, no mature algorithm exists at present for how to fuse the face attribute information and the face identity authentication information. Currently existing fusion methods are roughly divided into two categories:

(1) the fusion framework is shown in fig. 1, and the identity recognition network and n (n > 2) attribute recognition networks are respectively trained, an input picture is subjected to DCNN (deep neural network) to extract features, similarity scores (label corresponding probability values) are output through a full connection layer Fc and a softmax layer, and then all the similarity scores are added to form a new identity similarity score prediction target identity.

(2) Feature level fusion, which can be further divided into an aggregation method and a subspace learning method. The aggregation method is to extract attribute features and identity authentication features by using a network, then simply connect the two features at a feature level or limit the two features to have the same dimension, and then carry out element averaging or multiplication. The subspace learning method is to connect the two features in series, then map the connected features to a more appropriate subspace, and then learn the fusion parameters by adopting a supervised or unsupervised learning method. Unsupervised learning does not utilize identity information for fusion learning, and supervised learning uses identity information for fusion learning, relatively. The fusion framework of the feature layer is as shown in fig. 2, which is similar to the structure of the fractional layer, and respectively trains an identity recognition network and n attribute recognition networks, then inputs the picture into all networks, then extracts the features of the pooling layer of the last pooling layer, fuses all the pooling layer features together through a feature connector, and then performs prediction output on a new feature training SVM or other classifiers.

Both methods need to train a plurality of DCNNs, and then perform score fusion or feature fusion for further training, which is heavy and complicated in work task and not beneficial to practical application.

Disclosure of Invention

The invention aims to: aiming at the existing problems, a new mode of fusing identity information and attribute information is disclosed to improve the accuracy of face recognition. The invention fuses the face identity authentication network and the attribute identification network to form a fusion network, and simultaneously learns the identity characteristics and the face attribute characteristics by adopting a joint learning mode.

The invention relates to a face recognition method based on combination of face attribute information, which comprises the following steps:

constructing a fusion network model:

a third module blockC is used as an input layer of the converged network model, the third module blockC is connected with a first module blockA1, the first module blockA1 is respectively connected with a first module blockA2, and a second module blockB 1; the first module BlockA2 is sequentially connected with a first module BlockA3, a pooling layer of a first global average pooling mode, a first full connection layer and a second full connection layer Softmax layer to form an identity identification network;

the first module Block A2 and the second module Block B1 are both connected with a feature connector, and the feature connection layer is sequentially connected with the second module Block B2, a pooling layer of a second global average pooling mode and a third full connection layer to form a face attribute identification network;

wherein, the first module Block a1 stacks 5 inclusion structures, the first module Block a2 stacks 10 inclusion structures, the first module Block A3 stacks 5 inclusion structures, the inclusion structure includes a feature connector, a convolution layer, a pooling layer, a normalization layer and an input interface layer, and four paths of parallel convolution are included between the feature connector and the input interface layer: the first path is a convolution layer and a normalization layer which are connected in series, wherein the convolution layer is connected with the input interface layer, and the convolution kernel is 1 multiplied by 1; the second path is two convolution layers and a normalization layer which are connected in series, wherein the convolution kernel of the convolution layer connected with the input interface layer is 1 multiplied by 1, and the convolution kernel of the other convolution layer is 3 multiplied by 3; the third path comprises two convolution layers and a normalization layer which are connected in series, wherein the convolution kernel of the convolution layer connected with the input interface layer is 1 multiplied by 1, and the convolution kernel of the other convolution layer is 5 multiplied by 5; the fourth path comprises a pooling layer, a convolution layer and a normalization layer which are sequentially connected in series, wherein the pooling layer is connected with the input interface layer, the pooling mode is maximum pooling, the pooling kernel is 2 multiplied by 2, and the convolution kernel of the convolution layer is 1 multiplied by 1;

the second modules, BlockB1 and BlockB2, are convolution structures: the system comprises an input interface layer, a convolution layer with convolution kernel of 1 × 1, a convolution layer with convolution kernel of 3 × 3 and an output interface layer which are connected in sequence;

the third module blockC comprises an input layer, 3 groups of convolution layers and pooling layer groups which are connected in series and an output interface layer, wherein the cores of the convolution layers and the pooling layers are respectively 3 multiplied by 3 and 2 multiplied by 2, and the pooling mode is maximum pooling;

training the fusion network model:

step 101: collecting a training sample set, and carrying out image preprocessing on the training sample, wherein the image preprocessing comprises size normalization, image pixel value mean normalization and random angle turnover normalization; randomly dividing the training sample set into a plurality of sub-training sets, wherein the sample number of each sub-training set is S;

step 102: initializing a neural network parameter and an attribute distribution weight of an attribute loss function, and acquiring a network parameter of first iteration and the attribute distribution weight of the attribute loss function; wherein the attribute distribution weight of the attribute loss function comprises the attribute distribution weight of the attribute loss function of the positive sample

And of negative examplesAttribute distribution weights for attribute loss functions

i is an attribute category identifier;

step 103: using the sub-training set as the input image of the fusion network model, predicting the identity label and each attribute label, comparing the error with the real label, and calculating the loss function

Wherein l_softmaxRepresents the loss function of the Softmax layer,/_centerlossCenter loss function, l, representing the identity of the first fully-connected layer face_multitaskA face attribute loss function, λ, representing the third connection layer₁And λ₂Represents a predetermined loss weight, 0 < lambda₁,λ₂If the ratio is less than 1, taking a channel test observation value;

wherein

FC[i]_jThe result of the attribute all-connected layer output attribute i representing the jth picture,

labels, P, representing pairs of pictures j to attributes i_t ⁱAttribute distribution weights representing attribute loss functions of positive or negative samples of the t-th iteration corresponding to attribute i, C representing the number of attribute classes;

step 104: calculating a loss function

Gradient of (2)

Wherein W_tA network parameter representing a t-th iteration;

iteratively updating the network parameters: : w_t+1＝W_t+V_t+1Wherein

Beta represents the learning rate of the preset negative gradient, mu represents the weight of the last gradient value, V_tRepresents the gradient of the t-th iteration, and the gradient of the first iteration is 0 (if the initial value of t is 0, i.e. V)₀0), the weight mu is a preset value;

iteratively updating the attribute distribution weights of the attribute loss function:

wherein the dimension parameter

r＝∑P_t ⁱyⁱFC[i]Current normalized variable Zⁱ＝∑P_t ⁱexp(-αyⁱFC[i])，FC[i]Representing the current output result of the third fully-connected layer for attribute i, i.e., S FCs [ i]_jComposition FC [ i]，yⁱTrue labels, i.e. S, representing the current sub-training set for attribute i

Composition yⁱ；

Step 105: step 103-104 is repeatedly executed, the network parameter and the attribute distribution weight of the attribute loss function of each attribute are iteratively updated until the loss function

Converging and storing the currently updated network parameters and the attribute distribution weight of the attribute loss function;

and (3) identification processing of the image to be identified:

step 201: carrying out size normalization and image pixel value mean value normalization processing on an image to be recognized;

step 202: loading the network parameters saved in the training process;

step 203: inputting the image to be recognized processed in the step 201 into the fusion network model constructed in the invention, performing forward propagation, and respectively predicting an identity tag and a C attribute tag through a second full-connection layer and a third full-connection layer, wherein the identity tag is an index tag corresponding to the maximum probability value through the second full-connection layer and a softmax layer; and the attribute label is directly output through the third full connection layer.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:

(1) the invention provides a new fusion framework of a human face attribute supervision human face recognition learning task, under the framework, the invention not only improves the accuracy of human face recognition, but also can predict the attribute characteristics of the human face, and is a multi-task network;

(2) a multi-task learning framework is improved, and a cost-sensitive weighting function is adopted, so that the balance training in a source data domain is realized without depending on the data distribution of a target domain;

(3) the modified fusion framework only adds a few parameters, and the additional calculation load is small. Compared with the existing method of independently training the face attribute network and the identity recognition network, extracting the features and fusing the features, the method reduces the workload and the operation load to a certain extent, and is more convenient for practical deployment and application.

Drawings

FIG. 1 is a diagram of a fusion framework for a prior art fractional layer;

FIG. 2 is a diagram of a fusion framework for a prior art fractional layer;

FIG. 3 is a schematic view of the fusion framework of the present invention;

FIG. 4 is a schematic diagram of the first module Block A of the fusion framework of the present invention;

FIG. 5 is a schematic diagram of the second module Block B of the fusion framework of the present invention;

fig. 6 is a schematic structural diagram of a third module BlockC of the fusion framework of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments and accompanying drawings.

Referring to fig. 3, the converged network model of the present invention includes first, second and third modules: BlockA, BlockB and BlockC, pooling layer, full connectivity layer FC, feature connector (Filter connectivity), Softmax layer; the third module blockC is an input layer of the converged network model, and is connected with the first module blockA1, the first module blockA1 is respectively connected with the first module blockA2, and the second module blockB 1; the first module BlockA2 is sequentially connected with a first module BlockA3, a pooling layer (global average pooling mode), a first full connection layer (FC 1024) and a second full connection layer (FC N) Softmax layer to form an identity identification network; the first module BlockA2 and the second module BlockB1 are both connected with a feature connector, that is, features obtained by BlockA2 and BlockB1 are stacked in depth through the feature connector, and the feature connection layer is sequentially connected with a second module BlockB2, a pooling layer (global average pooling mode) and a third full connection layer (FC 8), so as to form a face attribute recognition network.

Wherein the first fully-connected layer is used to generate a center loss (centrorloss), that is, to make the face features of each identity gather at the center of the corresponding identity, reduce the intra-class distance, and increase the inter-class distance, and the output dimension of the first fully-connected layer depends on the feature dimension of the input image, for example 1024; the second full connection layer is used for outputting N (identity category number, namely N people) dimensional full connection characteristics, and the second full connection layer outputs final identity information through the softmax layer; the output dimension of the third fully-connected layer depends on the number of preset face attributes, namely, the recognition results (whether corresponding attributes exist) of different belongings are obtained through the second fully-connected layer, wherein the face attributes comprise sex, size of mouth, thickness of lips, size of eyes, thickness of eyebrows, height of nose bridge, size of nose, width of forehead and the like.

The first module, Block a (i), is a Block in which three inclusion structures are stacked (connected in series) and is used to extract shallow, intermediate, and advanced features in a picture, that is, Block a (i) is a different number of inclusion structures, where Block a1 stacks 5 inclusion structures, Block a2 stacks 10 inclusion structures, and Block A3 stacks 5 inclusion structures. The inclusion structure is shown in fig. 4, and the feature connector (Filter classification), the convolutional layer (conv), the pooling layer (posing), the normalization layer (batch normalization), and the input interface layer (Previous layer) include four parallel convolutions between the feature connector and the input interface layer: the first path is a convolution layer and a normalization layer which are connected in series, wherein the convolution layer is connected with the input interface layer, and the convolution kernel is 1 multiplied by 1; the second path is two convolution layers and a normalization layer which are connected in series, wherein the convolution kernel of the convolution layer connected with the input interface layer is 1 multiplied by 1, and the convolution kernel of the other convolution layer is 3 multiplied by 3; the third path comprises two convolution layers and a normalization layer which are connected in series, wherein the convolution kernel of the convolution layer connected with the input interface layer is 1 multiplied by 1, and the convolution kernel of the other convolution layer is 5 multiplied by 5; the fourth path comprises a pooling layer, a convolution layer and a normalization layer which are sequentially connected in series, wherein the pooling layer is connected with the input interface layer, the pooling mode is maximum pooling, the pooling kernel is 2 x 2, and the convolution kernel of the convolution layer is 1 x 1. The normalization layer is used for simulating large-scale change of parameters, so that training tends to be stable, deeper training of the network becomes easy, convergence is accelerated, a certain regularization effect is achieved, and overfitting of the model is prevented.

The first module Block A (i) of the invention carries out channel dimension reduction and weighted summation on the shallow layer feature and the intermediate feature through 1 x 1 convolution to form a primary feature and a part of intermediate feature of the face attribute, then learns the feature through 3 x 3 convolution kernels respectively, and finally forms the high-level feature of the face attribute, wherein in a neural network, the shallow layer and the intermediate feature contain certain general information, and the high-level feature is a targeted feature guided by a learning task. The present invention combines shallow and intermediate features in an identity recognition network to learn high-level features in the face attributes. Therefore, the convolution Block B with a small amount of added parameters can simultaneously realize the simultaneous learning of the identity characteristics and the attributes.

The second blocks blockb (i) are all convolution structures shown in fig. 5, and include an input interface layer (Previous layer), a convolution layer with a convolution kernel of 1 × 1, a convolution layer with a convolution kernel of 3 × 3, and an output interface layer (Top layer) connected in sequence.

Referring to fig. 6, the third block BlockC includes an input layer, 3 groups of convolutional layers and pooling layer groups connected in series, and an output interface layer (Top layer), where cores of the convolutional layers and the pooling layers are 3 × 3 and 2 × 2, respectively, and the pooling mode is maximum pooling.

After a small number of parameters are added, the face attribute characteristics and the identity characteristics are fused, network synchronous training is realized, and the face recognition accuracy is improved. The added parameters are mainly the parameters of Block B and a full link layer of a multitask classifier.

BlockB parameter increment: the number of input feature maps, N, is denoted by M₁Denotes the number of 1 × 1 convolution kernels, N₂Representing the number of 3 × 3 convolution kernels, the BlockB parameter is: num_param1＝MN₁+9N₁N₂；

Number of added parameters of the convolutional layer: if A represents the input feature dimension of the convolutional layer and C represents the attribute feature type, the input dimension of the fully-connected layer is C, so the parameters are: num_param2＝AC；

For example, for M128, N₁＝64，N₂An application scenario of 128 may be: num_param181920; and A is of the order of 10³C is of the order of 10²Then Num_param2Typically of the order of 10⁵The overall added parameters are not much more than the millions of parameters of the whole face recognition network.

The invention utilizes the shallow feature and the intermediate feature of the face identity recognition, further learns to generate a part of intermediate features and the final attribute advanced feature, and then outputs the attribute prediction through the full connection layer. C is used to represent attribute feature type, the output dimension of the full connection layer is C, and for a certain attribute i, the full connection output is FC [ i]If i is more than or equal to 1 and less than or equal to C, the classification result Y [ i ] of the attribute i]And error E [ i ]]Respectively as follows:

wherein the loss is: l is_i＝max(0,1-y_iFC[i])。

In the multitask optimization process, the data imbalance problem is a problem which must be solved, so that the loss of each attribute cannot be directly added. In this case, the invention defines a hybrid objective function, which uses the distribution of the attributes in the data field to perform a weighted summation of the loss of each attribute, but in the selection of the weighting function, this is achieved by a cost-sensitive weighting functionThe weighting function is:

wherein

Representing a scale parameter, r ═ Σ P_t ⁱyⁱFC[i]Current normalized variable Zⁱ＝∑P_t ⁱexp(-αyⁱFC[i])，FC[i]Representing the current output result of the third fully-connected layer for attribute i, i.e., S FCs [ i]_jComposition FC [ i]，yⁱTrue labels, i.e. S, representing the current sub-training set for attribute i

Composition yⁱThereby obtaining a multitask penalty function

Wherein N is the number of pictures in a batch, C is the number of attribute types, FC [ i]_jThe result of the attribute all-connected layer output attribute i representing the jth picture,

a label representing picture j versus attribute i.

The overall system penalty function is then:

wherein l_softmaxRepresents the loss function of the Softmax layer,/_centerlossCenter loss function, l, representing the face identity of the first fully-connected layer_multitaskFace attribute loss function, λ, corresponding to the third connection layer₁And λ₂Represents a predetermined loss weight, 0 < lambda₁,λ₂And (5) taking the observed value of the experience, wherein the suggested value is 0.08 and 0.02. Therefore, the whole system is trained under the co-supervision of identity recognition loss and attribute recognition loss, parameters are optimized, and the fusion of attribute recognition and identity recognition at the parameter level is realized, rather than the existing fusion mode at the feature level and the final similarityResulting in fusion of the planes.

The face recognition method based on the fusion network model constructed by the invention mainly comprises two processes of training and recognition, which are specifically as follows:

1. training process:

step 101: and acquiring a training sample set, and performing image preprocessing on the training sample, wherein the image preprocessing comprises size normalization, image pixel value mean normalization and random inversion normalization (left-right inversion is performed to increase the number of the training sample set). For example, scaling the picture size to 128 × 3, 112 × 3(H × W × C, H is the picture height, W is the picture width, C is the picture channel, and 3 is represented as RGB color picture), and then performing mean normalization and random flip normalization;

randomly dividing the training sample set into a plurality of sub-training sets, wherein the sample number of each sub-training set is S;

step 102: initializing a neural network parameter (such as an Xavier method), and distributing the weight of the attribute loss function, thereby obtaining the network parameter of the first iteration and the attribute distribution weight of the attribute loss function. Wherein the attribute distribution weight of the attribute loss function comprises the attribute distribution weight of the attribute loss function of the positive sample

Attribute distribution weight of attribute loss function of sum negative sample

And respectively represent positive and negative sample loss function initialization weights for attribute i,

and

representing the number of positive and negative samples with the attribute of i in the training sample set;

step 103: the sub-training set is used as an input image of the fusion network model constructed by the invention, the identity label and the C attribute label are predicted, and the error is compared with the real labelCalculating a loss function

In this embodiment, λ₁And λ₂The preferable values of (b) are 0.08 and 0.02, respectively. Namely, it is

Wherein

table indicates the label of picture j to attribute i, P_t ⁱAn attribute distribution weight representing an attribute loss function of the positive or negative sample of the tth iteration corresponding to attribute i;

step 104: calculating the gradient of the loss function

Wherein W_tA network parameter representing a t-th iteration;

updating the network parameter W at the t +1 th time_t+1：W_t+1＝W_t+V_t+1Wherein

updating attribute distribution weight of attribute loss function at t +1 th time

Namely, the attribute distribution weights of the attribute loss functions of the positive and negative samples are updated according to the updating mode, wherein alpha is a scale parameter,

Composition yⁱ；

Step 105: step 103 and step 104 are repeatedly executed, the network parameter and the attribute distribution weight of the attribute loss function of each attribute are iteratively updated until

And (6) converging. And storing the currently updated network parameters and attribute distribution weights of the attribute loss functions.

2. The identification process comprises the following steps:

step 202: loading the network parameters saved in the training process;

step 203: inputting the image to be recognized processed in step 201 into the fusion network model constructed in the invention, calculating forward, and predicting the identity label and the C attribute label through two full-connection layers (second and third). In the specific embodiment, the identity tag obtains the index tag corresponding to the maximum probability value through FC1024 and softmax; the face attribute label is output through FC8, where the face attribute label Y is:

while the invention has been described with reference to specific embodiments, any feature disclosed in this specification may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise; all of the disclosed features, or all of the method or process steps, may be combined in any combination, except mutually exclusive features and/or steps.

Claims

1. The face recognition method based on the combination of the face attribute information is characterized by comprising the following steps:

constructing a fusion network model:

training the fusion network model:

Attribute distribution weight of attribute loss function of sum negative sample

Wherein i is a face attribute class identifier;

Wherein

Representing the loss function of the Softmax layer,

a central loss function representing the identity of the face at the first fully connected layer,

a face attribute loss function, λ, representing the third connection layer₁And λ₂Represents a predetermined loss weight, 0 < lambda₁,λ₂If the ratio is less than 1, taking a channel test observation value;

wherein

Represents the current output result of the third fully-connected layer of the jth picture to attribute i,

a real label representing picture j versus attribute i,

attribute distribution weights representing attribute loss functions of positive or negative samples of the t-th iteration corresponding to attribute i, C representing the number of attribute classes;

step 104: calculating a loss function

Gradient of (2)

Wherein W_tA network parameter representing a t-th iteration;

iteratively updating the network parameters: w_t+1＝W_t+V_t+1Wherein

Beta represents the learning rate of the preset negative gradient, mu represents the weight of the last gradient value, V_tRepresenting the gradient of the t iteration, wherein the gradient of the first iteration is 0, and the weight mu is a preset value;

iteratively updating attribute distribution weights for attribute loss functionsHeavy:

wherein the dimension parameter

Current normalized variable

Representing the current output result of the third fully-connected layer for attribute i, i.e., S FCs [ i]_jComposition FC [ i]，yⁱTrue labels, i.e. S, representing the current sub-training set for attribute i

Composition yⁱ；

and (3) identification processing of the image to be identified:

step 202: loading the network parameters saved in the training process;

step 203: inputting the image to be recognized processed in the step 201 into the fusion network model, performing forward propagation, and predicting identity tags and C types of face attribute tags through a second full-connection layer and a third full-connection layer respectively, wherein the identity tags are subjected to the extraction of index tags corresponding to the maximum probability value through the second full-connection layer and the softmax layer; and the face attribute label is directly output through the third full-connection layer.

2. The method of claim 1, wherein the attribute loss function for positive samplesProperty distribution weight of

Attribute distribution weight of attribute loss function of sum negative sample

The initial values of (a) are:

and

representing the number of positive and negative samples with attribute i in the training sample set.