CN110443162B

CN110443162B - Two-stage training method for disguised face recognition

Info

Publication number: CN110443162B
Application number: CN201910654611.0A
Authority: CN
Inventors: 吴晓富; 项阳; 赵师亮; 张索非; 颜俊
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2019-07-19
Filing date: 2019-07-19
Publication date: 2022-08-30
Anticipated expiration: 2039-07-19
Also published as: CN110443162A

Abstract

A two-stage training method for disguised face recognition comprises the following steps: step S1, preprocessing the data Set needed to be used for training to obtain a Set _F 、Set _S (ii) a Step S2, first stage, Set _F As a training set and using an ArcFace loss function to train the network; step S3, revoking the last full connection layer of the network; step S4, second stage, Set _S The network was trained as a training set and using the ArcFace loss function. The invention utilizes a small amount of disguised face data to transfer the work application domain of the model from general face recognition to disguised face recognition, and has good recognition effect on DFW test reference.

Description

Two-stage training method for disguised face recognition

Technical Field

The invention belongs to the technical field of face recognition, and particularly relates to a two-section type training method for disguised face recognition.

Background

Face recognition techniques through convolutional neural networks have enjoyed tremendous success in recent years. As a result of the research, face recognition using face feature vectors obtained by network mapping is very effective and is generally considered to be the most advanced method at present. With the continuous proposition of advanced network structures, high-quality data sets and more delicate loss functions, the identification power of the finally obtained feature vectors is stronger and stronger: the difference between the feature vectors of different persons is gradually increased, whereas the difference between the feature vectors of the same person is decreased.

Although face recognition has been gaining attention, camouflaged face recognition remains a challenging topic. The difficulty of identification is greatly increased after applying articles such as makeup to the face, wearing a cap, mask, etc. And under the condition that the identification difficulty is high, the overall quality of the data set of the degree of dependence of deep learning is not satisfactory, and the difficulty of the subject is further improved. Compared with the common Face recognition field in which high-quality achievements such as FaceNet, SphereFace and ArcFace continuously appear, the achievements in the disguised Face recognition field are much less, and the current newer achievement is MiRA-Face obtained by utilizing a DFW disguised Face data set. The method uses a two-stage training, firstly uses a general face recognition training method to obtain a network, and then uses a training set provided by DFW to perform dimensionality reduction processing on the feature vector by using PCA, thereby obtaining certain information in the aspect of camouflage. MiRA-Face has the following disadvantages: (1) the first stage of training uses the method proposed by CosFace, which is not the best choice at present; (2) less information is extracted using PCA. Both of these result in that there is still room for improvement in the performance of the algorithm.

Disclosure of Invention

The technical problem to be solved by the invention is to overcome the defects of the prior art and provide a two-stage training method for disguised face recognition, wherein the method is characterized in that an ArcFace is adopted to train a basic convolutional neural network, and a Joint loss function is used for minimizing the class inner distance of a DFW training sample and expanding the inner distance of the DFW training sample, so that the good disguised face recognition effect is achieved.

The invention provides a two-section type training method for disguised face recognition, which comprises the following steps of:

step S1, preprocessing the data Set needed by training to obtain Set _F 、Set _S ；

Step S2, first stage, Set _F As a training set and using an ArcFace loss function to train the network;

step S3, revoking the last full connection layer of the network;

step S4, second stage, Set _S The network is trained as a training set and using the Joint loss function.

As a further technical scheme of the invention, the model adopted in the first stage is composed of a ResNet50IR residual network, an output module composed of a BatchNorm layer, a Dropout layer, a full connection layer and another BatchNorm layer, a classification module composed of a full connection layer and a Softmax classification layer and an ArcFace loss function, wherein the ResNet50IR residual network and the output module are used as backbone networks for extracting features.

Furthermore, the ResNet50IR residual network uses a residual unit based on 50 layers of ResNet, the residual unit is a 6-layer composite structure of Batchnorm-convolution-Batchnorm-PRelu-convolution-Batchnorm, and the output size is determined by the step size of the 5 th convolutional layer; when the step length is 1, the output and the input have the same size; when the step length is 2, the size of the output is half of that of the input; the ResNet50IR residual error network is composed of an Input and 4 convolution modules, wherein the 4 convolution modules respectively have 3, 4, 14 and 3 residual error units, the first residual error unit of each convolution module is responsible for reducing the output dimension, the parameters of a Dropout layer of the output module are 0.5, the output of a full connection layer is a 512-dimensional vector, and a final characteristic vector v is obtained after passing through a BatchNorm layer.

Furthermore, before the step feature vector V is input into the full-link layer, normalization processing is performed to make | | | V | | | 1; the dimension of the weight of the full connection layer is determined according to the label category number of the training set, when the category number is P, the dimension of the weight matrix w is D × P, when the MS-Celeb-1M data set is used as the training set, P is 85K, D is the length of the feature vector V, and the length is 512; the offset b of the full connection layer is set to zero and each row of w is normalized, and the output vector of the full connection layer is

w _i Is the ith column of the weight matrix W.

Further, the network is trained by using an ArcFace loss function, and the function formula is

Wherein the superparameters s and m are respectively 64 and 0.5, theta _j,i Feature vector v generated for ith input _i And a weight vector w _j Angle of (a) y _i Is v is _i The corresponding correct tag value.

Furthermore, the model of the second stage is composed of a feature extraction network and a Joint loss function, the feature extraction network is a backbone network of the first stage, and the Joint loss function formula is

In the formula, the front part is Triplet loss, and the rear part is Pair loss; f (x) _i ) Extracting feature vector v output by network for feature after normalization processing _i ，<f(x ₁ ),f(x ₂ )>Is the vector product of feature vectors, i.e. vector v ₁ And v ₂ The parameters alpha and lambda are positive values.

Further, the training set in the second stage is the training set of the DFW dataset, and triples need to be performed before training

The pairing of (1) is to select a Normal image as the first

Then choose the validations or disassigned under the same directory as Normal

Finally, selecting Impersonate under the same catalogue as Normal as the Imersonate

Furthermore, the pictures in the same directory in the DFW dataset are divided into four types of Normal, valid, disabled and Impersonator, wherein the Normal, valid and disabled are the same person, and the Impersonator and the first three are different persons with similar growth phase.

The invention utilizes a small amount of disguised face data to convert the working applicable domain of the model from general face recognition to disguised face recognition, and has good recognition effect on DFW test standard.

Drawings

FIG. 1 is a schematic diagram of a training process of the present invention;

FIG. 2 is a schematic diagram of a backbone network structure according to the present invention;

FIG. 3 is a diagram of a residual unit structure according to the present invention;

FIG. 4 is a graph comparing different loss functions in phase 2 of the present invention;

FIG. 5 is an exemplary graph of a DFW data set;

fig. 6 is a graph comparing the results of DFW tests.

Detailed Description

Referring to fig. 1, the overall process of this embodiment is divided into two stages, and the network model used in stage 1 is composed of the following 4 parts: (1) ResNet50IR residual network; (2) the output module is composed of a BatchNorm layer, a Dropout layer, a full connecting layer and a BatchNorm layer; (3) the classification module consists of a full connection layer and a Softmax classification layer; (4) ArcFace loss function. The two parts (1) and (2) are used as the backbone network for feature extraction, and a specific network structure and the dimension of a single output are shown in fig. 2. Stage 1 specific analysis is as follows:

(1) the ResNet50IR uses the modified residual cells shown in fig. 3 on a conventional 50-layer ResNet basis. The residual unit uses a 6-layer composite structure of BatchNorm-convolution-BatchNorm-PRelu-convolution-BatchNorm. The output size of the whole residual unit is controlled by the step size of the 5 th convolution layer, when the step size is 1, the output is the same as the input size, and when the step size is 2, the output size is half of the input size.

(2) ResNet50IR is made up of 5 parts, one Input part and 4 convolution modules, 4 convolution modules have 3, 4, 14, 3 residual units respectively, and the first residual unit in each module is responsible for reducing the output dimension (the step size of the second convolution layer in the unit is set to 2). In fig. 2 we present the output dimension of each module at a single input dimension of [112 x 3 ].

(3) The parameter of the Dropout layer in the output module is set to 0.5, that is, the output of one half of the random units in the layer is set to zero, which increases the robustness of the network. The output of the full connection layer is a 512-dimensional vector, and a final characteristic vector v is obtained after the vector passes through a BatchNorm layer.

(4) Before inputting the feature vector v into the full-link layer, normalization processing needs to be performed on the feature vector v, so that | | | v | | | becomes 1. The dimension of the weight of the fully connected layer is determined according to the number of label categories of the training set, the dimension of the weight matrix W is D × P (D rows and P columns) when the number of categories is PP, P is 85k when the MS-Celeb-1M data set is used as the training set, and D is the length of the feature vector v, which is 512 here. Zero the offset b of the fully connected layer and normalize each column of W, assuming that the ith column of W is written as W _i At this time, the ith bit of the output vector of the full-connected layer is:

v·w _i ＝‖v‖·||w _i ||·cosθ＝cosθ (1.1)

cos (θ) is the two vectors v and w _i The included angle of (c).

(5) The whole network is trained by adopting a loss function proposed by ArcFace, and the formula is as follows:

the hyperparameters s and m in the formula are respectively set to 64 and 0.5, theta _j,i Refers to the feature vector v generated by the ith input _i And a weight vector w _j Angle of (a) y _i Denotes v _i The corresponding correct tag value. With respect to the frequently employed Softmax loss function shown in equation (1.3):

equation (1.2) is modified as follows:

setting the bias b of the full connection layer to 0, and setting the characteristic vector v and the weight vector w _i Is normalizedAt this time v and w _i The vector product of (2) can be regarded as the cosine of the angle between the two vectors, see equation (1.1). Equation (1.2) is regarded as angle theta _j,i The gradient of the function (c) is determined as follows:

the direction in which the loss function is dropped most rapidly is

Is reduced and

increase, it can be considered that the feature vector v will be let in during training _i Weight vector as close as possible to representing its label category

While keeping away from the remaining weight vectors w not representing the class _j ,j≠y _i . Therefore, the feature vectors with the same label are gathered in the same area along with training, and the feature vectors with different labels are pulled away from an included angle, namely, the intra-class distance is reduced, and the inter-class distance is increased.

Use of

Replace it

This further reduces the intra-class distance of the feature vectors. Is considered in

When smaller, the

In this case, the loss function represented by the formula (1.2) still maintains a large value, and in order to further decrease the loss function, the loss function needs to be decreased

Further reducing it.

In the pair of the feature vector v and the weight vector w _i Besides the normalization, a hyper-parameter s is set, and the training is easier and the network is easier to converge by using a larger s value, which is generally set to 64. In the case of not using s, it is used alone

Instead of the former

The model is difficult to converge when the model is actually trained. When s is set to be larger, it can be found that the loss function is larger when the classification is wrong than when s is not used, so that the loss function is forced to iterate to the correct direction, and when the classification is correct, the loss function is smaller than the original loss function, so that the training is easy to converge.

The training in the first stage generally uses a larger face data set, such as VGG2, MS-Celeb-1M, etc., and this stage is to obtain a feature extraction network (i.e. a network with a removed last part of fully connected layers) that is applied to a non-masquerading face with a better effect.

The network model used in the stage 2 is (1) a feature extraction network; (2) joint loss function. Wherein the feature extraction network is the backbone network obtained after the stage 1. Stage 2 was specifically analyzed as follows:

(1) the Joint loss function is shown in the following equation:

obviously, equation (1.5) consists of two parts, the first part is called Triplet loss and the latter part is called Pair loss. In the formula, f (x) _i ,x _i ) Feature vector v representing feature extraction network output after normalization processing _i ，<f(x ₁ ),f(x ₂ )>The vector product representing the feature vector, i.e. two vectors v ₁ And v ₂ The cosine value of the included angle. Both parameters α and λ take positive values. When the loss function is used for training, the pictures need to be arranged into a group of three for inputting, namely, the pictures in the formula

In which the doublet

Called positive sample pair, need to have the same label, and

a different label, called a negative exemplar pair, is needed. The Triplet loss controls the distance between the positive sample pairs to be smaller than the distance between the negative sample pairs, the specific difference is controlled by a parameter alpha, and generally the alpha is about 0.3. Pair loss controls the distance between the positive sample pairs, further limiting the inter-class distance, and its existence avoids the situation that the inter-class distance can not be reduced but the inter-class distance is controlled, which may occur by using Triplet loss only. The comparison of the angle distribution between the positive sample pairs obtained after training using Triplet loss only and Joint loss is shown in fig. 4, and it can be found that the angle distribution between the positive sample pairs is significantly better after using Pair loss. For the parameter λ, 0.3 or 0.4 is generally used.

(2) The training set used in stage 2 is the training set of the DFW data set, and before training, the triples need to be carried out

Pairing: 1) selecting a Normal image as

2) Selecting Validation or disqualified under the same directory as Normal as

3) Selecting Impersonate under the same category as Normal as

(Note: the pictures in the same directory in the DFW dataset are divided into 4 groups, namely Normal, Validation, disfigured and Impersonate, wherein Normal, Validation and disfigured are the same person, Impersonate and the first three are different persons with similar growth phase, see FIG. 5 for example)

According to the invention, a feature extraction network for a common face is obtained through the training of the stage 1, then the stage 2 is used for carrying out triple pairing training by using the idea of a triple loss function aiming at the condition that the existing disguised face data set is generally small, the defect of triple loss is made up by using Pair loss, and the migration of the application range of the network to the disguised face is completed. The combination of the two treatments makes the invention greatly improved in the result.

The model of the invention (stage 1 training on MS-Celeb-1M data set, stage 2 training on DFW training set) is tested on the disguised face recognition test set provided by DFW, GAR is respectively as follows under 1% and 0.1% of FAR conditions: 1) protocol-1: 97.98% and 60.23%; 2) protocol-2: 90.37% and 82.84%; 3) protocol-3: 90.4% and 81.18%. The following are descriptions of GAR and FAR:

the DFW test dataset provides a collection of face images in groups of two, where some of the pairs are the same person, and these pairs are taken as positive samples. The other picture pairs belong to different persons. The similarity degree of the two images is measured by the distance between the image feature vectors, but only one distance is obvious at present, so that whether the two images are the same person or not can not be judged. At present, a more common method is to add a threshold value as a threshold, and when the distance is smaller than the threshold value, the sample is regarded as a positive sample, otherwise, the sample is regarded as a negative sample.

When a threshold value is given. The values of TP, TN, FP, FN can be calculated:

TP: the number of positive samples correctly identified by the algorithm;

TN: the number of negative samples correctly identified by the algorithm;

FP: a number of negative samples identified as positive samples;

FN: number of positive samples identified as negative samples

The values of GAR and FAR can then be obtained from these values:

FIG. 6 is a comparison of the model of the present invention with other models, and in general terms, the present invention performs better than most existing algorithms. (Note: the DFW dataset provides three different positive and negative sample pairs, protocol-1, protocol-2, and protocol-3, where protocol-3 is a composite of the first two sample pairs.)

The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are intended to further illustrate the principles of the invention, but that various changes and modifications may be made without departing from the spirit and scope of the invention, which is intended to be protected by the appended claims. The scope of the invention is defined by the claims and their equivalents.

Claims

1. A method for recognizing a disguised face is characterized by adopting two-stage training and comprising the following steps,

step S1, preprocessing the data Set needed by training to obtain Set _F 、Set _S The data set is a face data set;

step S2, first stage, Set _F As a training set, training the network using an ArcFace loss function;

step S3, revoking the last full connection layer of the network;

step S4, second stage, Set _S The network is trained as a training set and using the join loss function.

2. The method for camouflaging a human face according to claim 1, wherein the model adopted in the first stage is composed of a ResNet50IR residual network, an output module composed of a BatchNorm layer and a Dropout layer, a fully-connected layer and another BatchNorm layer, a classification module composed of a fully-connected layer and a Softmax classification layer, and an ArcFace loss function, and the ResNet50IR residual network and the output module are used as a backbone network for extracting features.

3. The method for camouflaging a face of a person according to claim 2, wherein the ResNet50IR residual network uses a residual unit based on 50 layers of ResNet, the residual unit is a 6-layer composite structure of BatchNorm-convolution-BatchNorm-PRelu-convolution-BatchNorm, and the output size is determined by the step size of the 5 th convolution layer; when the step length is 1, the output and the input have the same size; when the step length is 2, the size of the output is half of that of the input; the ResNet50IR residual error network is composed of an Input and 4 convolution modules, wherein the 4 convolution modules respectively have 3, 4, 14 and 3 residual error units, the first residual error unit of each convolution module is responsible for reducing the output dimension, the parameters of a Dropout layer of the output module are 0.5, the output of a full connection layer is a 512-dimensional vector, and a final characteristic vector v is obtained after passing through a BatchNorm layer.

4. The method for camouflaging a face as claimed in claim 3, wherein the feature vector V needs to be normalized before being input into the full-connected layer, so that 1 is satisfied; the dimension of the weight of the full connection layer is determined according to the label category number of the training set, when the category number is P, the dimension of a weight matrix w is D x P, when an MS-Celeb-1M data set is used as the training set, P is 85K, D is the length of a feature vector V, and the length is 512; the offset b of the full link layer is set to zero and each row of w is normalized, and then the output vector of the full link layer is

w _i Is the ith column of the weight matrix W.

5. A method for camouflaging a face recognition according to claim 1 or 2, characterized in that Ar is usedcFace loss function training network with function formula as

Wherein the hyperparameters s and m are respectively 64 and 0.5, theta _j，i Feature vector v generated for ith input _i And a weight vector w _j Angle of (y) _i Is v _i The corresponding correct tag value.

6. The method according to claim 1, wherein the model at the second stage comprises a feature extraction network and a Joint loss function, the feature extraction network is a backbone network at the first stage, and the Joint loss function formula is

In the formula, the front part is Triplet loss, and the rear part is Pair loss; f (x) _i ) Extracting feature vector v output by network for feature after normalization processing _i ，<f(x ₁ )，f(x ₂ )>Is the vector product of feature vectors, i.e. vector v ₁ And v ₂ The parameters alpha and lambda are positive values.

7. The method according to claim 1, wherein the training set in the second stage is a training set of a DFW data set, and triples are required before training

The pairing of (1) is to select a Normal image as the first

Then choose the validations or disassigned under the same directory as Normal

Finally, the Impersonate under the same catalogue as Normal is selected as

8. The method for camouflaging a face of a person as claimed in claim 7, wherein the pictures in the same directory in the DFW dataset are divided into four types of Normal, valid, dislimited and Impersonate, wherein Normal, valid and dislimited are the same person, and Impersonate and the first three are different persons with similar growth.