CN106372581B

CN106372581B - Method for constructing and training face recognition feature extraction network

Info

Publication number: CN106372581B
Application number: CN201610726171.1A
Authority: CN
Inventors: 吴晓雨; 郭天楚; 杨磊; 朱贝贝; 谭笑
Original assignee: Communication University of China
Current assignee: Communication University of China
Priority date: 2016-08-25
Filing date: 2016-08-25
Publication date: 2020-09-04
Anticipated expiration: 2036-08-25
Also published as: CN106372581A

Abstract

The invention provides a method for constructing and training a face recognition feature extraction network, wherein the method comprises the following steps: constructing a feature extraction network and a metric learning dimensionality reduction network, wherein the output of the feature extraction network is the input of the metric learning dimensionality reduction network; training the feature extraction network based on the entire sample set to output a feature set; screening the feature set by utilizing semantic sampling to obtain a pure sample set; training the metric learning dimensionality reduction network based on the clean sample set. The natural face recognition network constructed by the invention can improve the characterization capability of the features, thereby fully mining the feature information in the data and accurately recognizing the original face picture.

Description

Method for constructing and training face recognition feature extraction network

Technical Field

The invention relates to the technical field of image recognition, in particular to a construction method of a face recognition network, and specifically relates to a method for constructing and training a face recognition feature extraction network.

Background

Face recognition has long been a hot topic in computer vision. Compared with traditional biological characteristic identification such as iris identification and fingerprint identification, the human face identification can finish the identification task only through image data or pictures acquired by the most common camera without acquiring data by means of special media. This makes face identification have more extensive application scenario than iris identification, fingerprint identification etc.. Face recognition is one of biological feature recognition, and is mostly applied to the fields of security protection, identity authentication and the like. With the continuous development of society and the continuous progress of science and technology, the face recognition technology has slowly gone into people's life from laboratory research. And then the face recognition is applied to the fields of entrance guard, attendance checking, mobile phone unlocking, financial payment and the like which are closer to daily life.

However, when the face recognition technology is applied to daily life scenes, a problem that the face recognition technology cannot be avoided is that the face recognition equipment cannot acquire photos similar to standard illumination and standard postures acquired in a laboratory. In a daily face recognition scene, people probably acquire photos through a mobile phone camera in a natural state, so that the data to be recognized approach to face pictures with nature, any, multiple illumination and multiple expressions. Compared with the standard illumination and standard posture face data acquired in a laboratory, the data has the advantages that the face in a natural living state contains more noise, and factors such as unbalanced illumination, non-frontal posture, expression, whether a person is shielded in a small range by the face, whether the person is made up and the like need to be considered in the recognition process, so that the traditional face recognition technology is greatly challenged. Therefore, how to develop a face recognition technology robust to external interference factors is a problem to be solved.

In the prior art, a face recognition model with robustness is obtained, and huge training data is relied on. It is desirable that the training data have a similar statistical distribution as the actual prediction data. At present, with the development of the internet and the popularization of social networks, in the current big data era, people can obtain huge training data through the internet. However, how to utilize the huge training data to make the face recognition model fully learn the required information becomes a hot spot of current research. With the popularity and the development of deep learning, people find that deep learning can better describe information implicit in data compared with shallow learning, and has better characterization capability and fitting capability to an objective function than shallow learning. Therefore, in the field of natural image recognition, deep learning makes a significant contribution. However, the face recognition and the natural image recognition are two different tasks, and have the characteristics of similarity and difference. The similarity points are represented as image recognition tasks, the reference signals and the loss functions are similar, and huge data are processed by utilizing the high abstraction and the representation fitting capacity of the deep learning network; the human face recognition method is characterized in that natural images are different, backgrounds are complex and huge, a network may need to consider more context information and color texture information in a large range, the human face is simple in structure, the discrimination of different people is smaller than that of different classes in the natural images, more attention detail differences are needed in a human face recognition task, and less attention color information is needed. Therefore, the training mode and the network structure of natural image recognition cannot be directly applied to the face recognition task.

The training data of the existing common natural face recognition deep neural network, such as CASIA-NET, are all collected from the Internet, and the identity of a person overlapped with the LFW database is removed, so that the training set is not overlapped with the test set. The CASIA-Net comprises 10 convolutional layers and a full-connection classification layer, as shown in FIG. 7, the specific parameters are shown in the following table a, the table a is the network parameters of the CASIA-Net, and as can be seen from the table a, the CASIA-Net integrates the existing successful neural network design skills, including deep structure, low-dimensional representation and multi-loss function. Smaller convolution kernel stacks not only reduce the number of parameters but also increase the non-linear capability of the network, all that is used in CASIA-Net is a 3 x 3 convolution stack. Inspired by the existing VGG-Net, the CASIA-Net combines two 3 × 3 into one stage, and 5 stages in total form the whole network. CASIA-Net does not use a full-link layer to fuse feature images (feature maps) to obtain low-dimensional features, and the whole network feature extraction uses convolution operation. The pooling 5(Pool5) layer is a feature layer, and the low-dimensional representation conforms to the assumption of low-dimensional prevalence distribution of human faces. Convolution 52(Conv52) does not employ ReLU activation since the low-dimensional representation needs to contain all the discriminatory information of the face, whereas ReLU would make the neurons sparse. In the maximum pooling (max posing), the maximum value of the sensing domain is taken as an activation value and transmitted to the next layer, and if max posing is adopted in the feature layer, a noise sensitive area is easily introduced. Therefore, by adopting averageposing operation after the conv52 layer, the softmax signal and the verification signal are fused in the feature layer to learn more representation information which is beneficial to distinguishing human faces.

TABLE a

As is well known, a human face is different from a natural image, and the structure of the human face is single and fixed. For a network for face classification, not only large-scale features but also more detail features of an image of interest are required, namely, a smaller convolution kernel and a smaller perception domain are required to capture details. However, the convolution layer of the existing CASIA-Net is simple to stack, the features extracted by the network are not deeply researched, the convolution kernels are all 3 x 3, and the feature scale is single, so that the existing CASIA-Net cannot be used for natural face feature recognition.

In order to further improve the existing natural face recognition deep neural network, people try to introduce other learning ideas to enable parameters of the natural face recognition deep neural network to reach a better position, for example, the introduction of metric learning can improve the characteristics of the natural face recognition deep neural network. A typical loss function for metric learning is tripletloss. However, tripletloss suffers from several problems in neural network training: firstly, hardware resources are insufficient, secondly, the training with softmax is not good, and thirdly, errors are provided in a feature space and noise is not robust.

Therefore, the conventional natural face recognition deep neural network still cannot accurately recognize the natural face, and therefore, how to improve the conventional face recognition deep neural network to accurately recognize the natural face image becomes a technical problem to be solved by those skilled in the art.

Disclosure of Invention

In view of the above, the technical problem to be solved by the present invention is to provide a method for constructing and training a face recognition feature extraction network, which solves the problems that the existing deep learning network cannot accurately extract effective features and cannot accurately recognize original face pictures.

In order to solve the above technical problems, a specific embodiment of the present invention provides a method for constructing and training a face recognition feature extraction network, including: constructing a feature extraction network and a metric learning dimensionality reduction network, wherein the output of the feature extraction network is the input of the metric learning dimensionality reduction network; training the feature extraction network based on the entire sample set to output a feature set; screening the feature set by utilizing semantic sampling to obtain a pure sample set; training the metric learning dimensionality reduction network based on the clean sample set.

According to the above embodiments of the present invention, the method for constructing and training the face recognition feature extraction network at least has the following advantages: and simultaneously, feature extraction is carried out by utilizing a feature extraction network and a metric learning dimensionality reduction network. In the feature extraction network, a deep learning network in a Stage (Stage) stacking mode is designed, so that the deep learning network has better feature extraction capability; in the design of Stage, convolution kernels of 1 × 1, 3 × 3 and 5 × 5 are adopted at the same time, feature images (feature maps) of the previous layer are convoluted at the same time, and the obtained feature maps are overlapped to extract multi-scale features; then, a 3 x 3 convolution kernel is adopted to convolute the multi-scale feature map, the features of the multi-scale convolution kernel are fused, and the purposes of expanding firstly, fully learning more complete features, compressing again and removing redundant features are achieved through the change of the dimension of the feature map; each Stage can be regarded as a superposition of convolution kernels, and the superposition of the convolution kernels can be regarded as that a larger perception domain is obtained by using less weight, and a nonlinear characterization layer of the deep learning network is enhanced. In addition, a metric learning dimensionality reduction network is introduced, the input of the metric learning dimensionality reduction network is the lower-dimensional features of the image extracted by the feature extraction network, and the output feature set of the feature extraction network is subjected to semantic sampling to screen out a pure sample set; training the metric learning dimensionality reduction network by using a pure sample set; and then optimizing a metric learning dimensionality reduction network by using a metric learning loss function tripletloss, and simultaneously extracting features by using a feature extraction network and the metric learning dimensionality reduction network to improve the characterization capability of the features, so that feature information in data is fully mined, a deep learning network is guided to rapidly solve, and an original face picture can be accurately identified.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description, serve to explain the principles of the invention.

Fig. 1 is a flowchart of a first embodiment of a method for constructing and training a face recognition feature extraction network according to an embodiment of the present invention;

fig. 2 is a flowchart of a second embodiment of a method for constructing and training a face recognition feature extraction network according to the embodiment of the present invention;

fig. 3A is a schematic diagram of an original face picture according to an embodiment of the present invention;

FIG. 3B is a schematic diagram of a standard face image obtained after processing an original face image using a natural face recognition network according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a feature extraction network provided in accordance with an embodiment of the present invention;

FIG. 5 is a schematic diagram of a two-dimensional feature residual according to an embodiment of the present invention;

FIG. 6A is a schematic diagram of a conventional tripletLoss;

FIG. 6B is a schematic diagram of a tripletLoss according to an embodiment of the present invention.

FIG. 7 is a schematic structural diagram of a CASIA-Net deep neural network in the prior art.

Detailed Description

For the purpose of promoting a clear understanding of the objects, aspects and advantages of the embodiments of the invention, reference will now be made to the drawings and detailed description, wherein there are shown in the drawings and described in detail, various modifications of the embodiments described herein, and other embodiments of the invention will be apparent to those skilled in the art.

The exemplary embodiments of the present invention and the description thereof are provided to explain the present invention and not to limit the present invention. Additionally, the same or similar numbered elements/components used in the drawings and the embodiments are used to represent the same or similar parts.

As used herein, the terms "first," "second," …, etc., do not denote any order or sequence, nor are they used to limit the present invention, but rather are used to distinguish one element from another or from another element or operation described in the same technical language.

With respect to directional terminology used herein, for example: up, down, left, right, front or rear, etc., are simply directions with reference to the drawings. Accordingly, the directional terminology used is intended to be illustrative and is not intended to be limiting of the present teachings.

As used herein, the terms "comprising," "including," "having," "containing," and the like are open-ended terms that mean including, but not limited to.

As used herein, "and/or" includes any and all combinations of the described items.

As used herein, the terms "substantially", "about" and the like are used to modify any slight variation in quantity or error that does not alter the nature of the variation. Generally, the range of slight variations or errors modified by such terms may be 20% in some embodiments, 10% in some embodiments, 5% in some embodiments, or other values. It should be understood by those skilled in the art that the aforementioned values can be adjusted according to actual needs, and are not limited thereto.

Certain words used to describe the present application are discussed below or elsewhere in this specification to provide additional guidance to those skilled in the art in describing the present application.

Fig. 1 is a flowchart of a first embodiment of a method for constructing and training a face recognition feature extraction network according to an embodiment of the present invention, and as shown in fig. 1, a feature extraction network and a metric learning dimension reduction network are constructed and trained.

The specific embodiments shown in the drawings include:

step 101: and constructing a feature extraction network and a metric learning dimensionality reduction network, wherein the output of the feature extraction network is the input of the metric learning dimensionality reduction network. And constructing a feature extraction network and a metric learning dimensionality reduction network as a feature extraction module of the natural face recognition network.

Step 102: training the feature extraction network based on the entire sample set to output a feature set. The method specifically comprises the following steps: and training the feature extraction network by using a loss function softmax based on all the sample sets so as to output a feature set. In an embodiment of the present invention, the feature dimension of the feature set is 320; all sample sets are CASIA-WebFace databases, and the CASIA-WebFace databases contain 10575 categories and 49 ten thousand pictures; and simultaneously applying 1 × 1, 3 × 3 and 5 × 5 convolution kernels in the feature extraction network to form a multi-scale feature fusion mode.

Step 103: and screening the feature set by utilizing semantic sampling to obtain a pure sample set. The method specifically comprises the following steps: and screening 90% of samples in the feature set farthest from the feature plane by using semantic sampling to serve as a pure sample set. In an embodiment of the present invention, the clean sample set is DataSubset; obtaining a characteristic plane through Logistic regression; the feature dimension of the clean sample set is 320.

Step 104: and training the metric learning dimensionality reduction network based on the pure sample set, wherein the characteristic dimensionality is reduced to 128 after the pure sample set longitude quantity learning dimensionality reduction network.

Referring to fig. 1, the metric learning dimensionality reduction network is introduced, the standard face picture is processed by the feature extraction network and the metric learning dimensionality reduction network in sequence, and the feature representation capability is improved, so that feature information in data is fully mined, a deep learning network is guided to rapidly solve, and an original face picture can be accurately identified.

Fig. 2 is a flowchart of a second embodiment of a method for constructing and training a face recognition feature extraction network according to the embodiment of the present invention, and as shown in fig. 2, after the metric learning dimensionality reduction network is trained based on the clean sample set, the metric learning dimensionality reduction network needs to be optimized by using an improved metric learning loss function tripletLoss.

In the embodiment shown in the figure, after step 104, the method further comprises:

step 105: and optimizing the metric learning dimensionality reduction network by using a metric learning loss function tripletLoss. When the metric learning loss function tripletLoss is used for optimizing the metric learning dimension reduction network, a sample point far away from most sample points is taken as an anchor point, so that the sample point far away from most sample points is close to most sample points.

The specific formula of the metric learning loss function tripletLoss used in the invention is as follows:

tripletloss＝log(1+z)

wherein the content of the first and second substances,

z is an intermediate variable; f. of_aIs the characteristics of the selected sample a; f. of_pIs the characteristics of the selected sample p; f. of_nIs the characteristics of the selected sample n; margin is a fixed interval that is artificially set.

The metric learning loss function loss in the prior art is as follows:

obviously, compared with the existing metric learning loss function, the metric learning loss function after the improvement of the invention adds the balancing factor with overlarge residual error to play a smoothing role, thereby leading the network parameters of the metric learning dimensionality reduction network to reach a better position.

Referring to fig. 2, the metric learning dimensionality reduction network is optimized by using the improved metric learning loss function tripletLoss, so that the network parameters of the metric learning dimensionality reduction network reach a better position, and the performance and the robustness of the metric learning dimensionality reduction network can be improved.

Fig. 3A is a schematic diagram of an original face picture according to an embodiment of the present invention; fig. 3B is a schematic diagram of a standard face image obtained after processing an original face image by using the natural face recognition network provided in the embodiment of the present invention, and as shown in fig. 3A and 3B, the original face image obtained in a natural state is often disordered. For example, the original face image is likely to include a plurality of faces including a complex background, and the in-plane rotation angles of the faces are different. If such pictures are directly introduced into deep web learning (such as feature extraction network, metric learning dimensionality reduction network, etc.), the information needed to be seen by the deep web learning may include multiple faces, faces of different sizes and angles, and may include many background noises. The deep network can certainly approximate the effective content in the image by fitting a complex function through a large number of parameters and complex nonlinearity, but if a priori knowledge is introduced, preprocessing is carried out on input data, so that the deep network can learn efficient characteristics more finely. Therefore, the tasks of processing the original face picture mainly include: and introducing prior knowledge, removing a disordered background in an incoming picture, and removing the rotation in the face. The processed data of the original face picture should only contain one face. In the present application, five-point calibration is used, and as shown in the figure, two images before and after alignment can be seen, and the images after alignment appear in the fixed area along with the five sense organs.

An affine transformation is a transformation of two-dimensional coordinates into coordinates, which can be written in the form:

x'＝ax+by+m 2.1

y'＝cx+dy+n 2.2

wherein x 'and y' are new coordinates, and x 'and y' can be calculated according to the coefficients a, b, c, d, m, n and the original coordinates x and y.

As in the above equation, the parameters of the affine transformation can be uniquely determined by three points. In order to play a role in constraint, five-point detection is adopted to solve and constrain affine transformation parameters. The five points are marked as the positions of the five sense organs of the human face, and the five sense organs are basically positioned at the same position through the aligned image of the five point affine transformation. All pictures were aligned to 128 × 128 size pictures, where the standard five sense organ positions are (32,50), (96,50), (64,75), (43,90), (86,90), respectively, representing the positions of the left eyeball, the right eyeball, the tip of the nose, the left mouth corner, and the right mouth corner, respectively.

Fig. 4 is a schematic diagram of a feature extraction network according to an embodiment of the present invention, and as shown in fig. 4, the feature extraction network is trained first, and then the metric learning dimensionality reduction network is trained.

The training modes of the feature extraction network and the metric learning dimensionality reduction network are summarized as follows:

firstly, a standard face picture is transmitted into a feature extraction network, identity information is used as a reference signal (namely a supervision signal), softmax is used as a loss function, and the feature extraction network is trained.

And secondly, extracting lower dimensional features after all standard face pictures are subjected to the trained feature extraction network.

And thirdly, performing semantic sampling on the lower dimensional features to obtain data A.

And fourthly, taking the identity information as a reference signal, taking a softmax function as a loss function, and pre-training the metric learning dimensionality reduction network.

And fifthly, taking the category relation as a reference signal, taking an improved tripletloss function as a loss function, carrying out measurement sampling on the lower dimensional features by taking batch as a unit, taking the measurement sampling as data B, and inputting the data into a measurement learning dimensionality reduction network to obtain the low dimensional features.

And finishing the training of the feature extraction module. In the prediction process, the pre-trained data are respectively transmitted into the two networks to obtain the low-dimensional representation of the input image.

The feature extraction network has the following features: firstly, stacking convolution kernels in a stage mode, removing full-continuous layers, and extracting features only by using convolution layers; secondly, the feature map extracted from the last convolutional layer is used as a feature layer, and ReLU activation is not adopted, so that the features are low in dimension and dense; thirdly, in order to remove noise, the last convolution kernel is used for extracting features.

The invention improves the stage, and the stage contained in the feature extraction network has the functions of multi-scale feature fusion, feature decorrelation, feature dimension reduction and the like. In the feature extraction network, the structure of the fifth stage (stage5) is one 5 × 5 less convolution kernel than a normal stage because 5 × 5 convolution is not used because the length and width of its input are smaller when the deep network is convolved to stage 5. The whole feature extraction network comprises 11 nonlinear convolution layers and 1 full-connected classification layer. The network parameters are shown in table 1. Table 1 shows network parameters of the feature extraction network.

TABLE 1

The feature extraction network iterates 20 ten thousand times, the initial learning rate is 0.01, and the learning rate is adjusted once every ten thousand times with the amplitude of gamma being 0.8. The learning rate at the current iteration is initial learning rate gamma (iteration times/10000) (the specific meaning is that the learning rate at the beginning of training is 0.01, the learning rate at the first ten thousand iterations is 0.01 gamma, the learning rate at the second ten thousand iterations is 0.01 gamma 2, and so on), and the weight attenuation coefficient is 5e-4 (the specific meaning is negative 5 th power of 10). The batch size Batchsize is 150.

In the above feature extraction network training, a CASIA-Webface whole sample is used as a training set, and a feature set output by the training set is defined as DataSet _ All. DataSet _ All contains I personal identity. A part of a data set, namely DataSubset, which participates in training as a metric learning dimensionality reduction network is extracted from DataSet _ All through semantic sampling.

In principle, it is necessary to find the harder to learn sample as the data subset DataSubset. Because the samples are difficult to classify and are sample sets which are not well solved by the feature extraction network, the problem of leaving the feature extraction network can be solved by using the samples as training samples of the metric learning dimensionality reduction network. However, the CASIA-WebFace database contains 10575 categories of 49 pictures in total, and after all images are mirrored, the whole database contains nearly 100 pictures. If 100 ten thousand pictures are traversed to find the closest X samples of different categories for each picture of each category, the calculation amount is too large. This sampling method also has the following problems: firstly, for data resources acquired by a network such as CASIA-Webface, the labeling is likely to be wrong, and the noise is easily acquired by sampling according to the method, so that the overall performance of the network is influenced. Secondly, the sampling mode excessively considers the pictures which are most difficult to classify, and the sample distribution is unbalanced.

To solve the above problem, a simple binary classifier, logistic regression, is trained for each identity class. For a particular identity class, the positive examples of the classifier are made up of samples of that identity class, and the negative examples are twice the number of samples randomly collected from samples not belonging to that identity class. The classifier trained from this dataset for this identity class is a weak classifier, although the performance is not high enough, but is somewhat tolerant to noise. Logistic regression is equivalent to finding a hyperplane with positive and negative samples on both sides of the hyperplane. The hyperplane parameters are w and b, respectively. The hyperplane is considered a feature plane of the category. After 10575 feature planes are obtained, features closest to the feature planes are calculated one by one. The distance between feature planes is given by the following equation 2.3, where f_i，f_jRepresenting feature planes of class i and class j, respectively. For each class feature plane and its nearest feature plane, the 90% positive samples of the class sample furthest from the feature plane, i.e. the 90% samples most like the class, are selected, 70% of the samples are randomly extracted from the 90% selected samples as sampling samples, and 10757 classes are traversed according to the method. For the sample sampling process, only the sample-to-feature plane distance is needed, and the probability space of the nonlinear mapping is not needed. Z is the sample-to-feature plane distance, f, as described in equation 2.4_iFeature plane of the ith class, x_iAre sample features in the i category. The specific steps are summarized as follows:

the first step, for the identity I belonging to I, all samples under the identity are selected as Pos _ I, the number is N, for all the sample identities j, j belongs to I and j is not equal to I, and 2N samples are selected as Neg _ I.

And secondly, training Logistic regression according to Pos _ i and Neg _ i to obtain parameters w _ i and b _ i of the feature plane P _ i.

And thirdly, repeating the first step and the second step to calculate all the characteristic planes.

Fourthly, for the identity I belonging to I, calculating a distance characteristic plane f according to a formula 2.3_iNearest feature plane f_j。

The fifth step, calculate all samples x according to equation 2.4_iBelonging to identity i, to feature plane f_iThe distance of (2) is arranged in descending order, 90% of top samples are taken as sub _90, and 75% of samples are randomly selected from the sub _90 and put into DataSubset.

Sixth, calculate all samples x_jBelonging to identity j (the category with the semantic distance to identity i), selecting a sample according to the method of the fifth step, and putting the sample into DataSubset.

And seventhly, repeating the four to six steps until all the identities are traversed to obtain the final DataSubset.

z_sample＝w_i*x_i2.4

In the semantic sampling process, 90% of the class samples farthest from the feature plane are selected, and the meaning is that the most similar class samples are selected, namely, the samples with wrong labels and poor picture quality are excluded.

In addition, the metric learning dimensionality reduction network is a fully-connected network, the input is the output of the feature extraction network, namely 320-dimensional features, the output is fully connected to 128 hidden neurons through two, and the output is fully connected to the number of identities of the CASIA-Webface, namely 10575. The parameter settings are as in table 2, and table 2 is the parameter settings of the metric learning dimension reduction network. The 256-dimensional features here are dense, where none of the fully-connected layers are activated with ReLU thereafter.

TABLE 2

The present application improves upon conventional tripletloss by and in three ways, respectively.

First, the introduction of tripletLoss designed a metric learning dimension reduction network and only used in this network. After the metric learning dimensionality reduction network is designed, a batch sample required to be observed by tripletloss is not an original picture x any more, but a feature expressed in a lower dimension of x extracted through the feature extraction network. In addition, the metric learning dimensionality reduction network only comprises a fully-connected hidden layer, network parameters are less, intermediate data needing to be stored are less, the using amount of a memory or a video memory is greatly reduced, and the single GPU can be directly used for training the network.

Second, a method of pre-training using a data set is employed with the present document. Samples are placed in the batch in a random category, so that the samples in the batch are suitable for the residual error of the softmax loss function to update network parameters, and when the network is in a better position, the training is stopped. And sampling 30 samples in each category according to the categories, wherein the samples in each category comprise 100 categories and 3000 samples, and putting the samples into the batch, so that the samples in the batch are suitable for tripletloss residual updating network parameters. Two training modes are respectively adopted to enable the network to start metric learning from a better position, which is not only beneficial to sampling of triples, but also beneficial to balancing the relationship between two loss functions.

Finally, we modify the loss function of tripletloss to add an excess balance factor to the residual. As in equations 2.5 and 2.6,

Loss＝log(1+z)............2.5

since the log function is a function with a smoothing effect. The introduction of the function enables the network Loss to be used for the characteristic f of the selected sample a_aWhen derivation is carried out, a coefficient of 1/(1+ z) is added, and the residual coefficient is smaller as z is increased, so that a smoothing effect is achieved.

Fig. 5 is a schematic diagram of a two-dimensional feature residual error provided by an embodiment of the present invention, as shown in fig. 5, a point a is a selected sample point — an anchor point, p is a sample having the same identity as the point a — a positive sample, and n is a sample having a different identity from the point a — a positive sample-negative examples. The double arrows represent the distance between the positive and negative sample pairs. According to the formulas 2.5 and 2.6, the residual error of the loss function to the sample a point at the moment is f_n-f_pI.e. the difference between two eigenvectors, this term being the vector pointing from p to n points, i.e. the gradient direction being pointing from p to n points, the magnitude being f_n-f_pThe modulus value of (a). The network is an updating principle of gradient descent, and the significance of the network is to adjust network parameters so that the network is opposite to the gradient residual direction. I.e. it is desired to change the network parameters such that point a moves in the direction of point a'. When smoothing is added, the magnitude of the gradient is no longer f_n-f_pIs multiplied by a factor of 1/(1+ z), i.e. it is desired that the network changes the parameters such that point a moves to the location of point a'. As z increases, the coefficient decreases, and a can be smoothly moved against the gradient direction.

Fig. 6A is a schematic diagram of a conventional tripletLoss, and fig. 6B is a schematic diagram of a tripletLoss according to an embodiment of the present invention, as shown in fig. 6A and 6B, the conventional tripletLoss limits that a negative sample collected cannot be the hardest negative sample, but is a harder negative sample, otherwise the network may cause a gradient collapse at an early stage. However, the tripletLoss proposed by the present invention is used in a network with fewer parameters, and the network parameters have substantially reached a better position, so that the sampling of the triplets may not strictly follow the normalized euclidean distance formula.

In the improved tripletloss of the present invention, semantic sampling is used for training. More noise parts are eliminated in semantic sampling, so that the positive samples collected by us contain less noise, and the network is less interfered. Through experiments, it is found that, in a batch, each category includes 30 samples, and when the most difficult positive sample is selected by respectively using each sample as an anchor point, we find that the sample with the label 1 in the batch is always selected as the most difficult positive sample when other samples are used as anchor points. And the samples selected as the most difficult samples are distributed more intensively, and are represented as that a plurality of points are not grouped and are far away from other sample points, as shown in fig. 6A and 6B, the gray depth represents the frequency of selecting each sample point as the most difficult sample of other sample points when each sample point is taken as an anchor point, and the deeper the gray depth, the higher the frequency. Rather than having each point as an anchor point and having most points close to non-clustered points, non-clustered points are used as anchor points and the anchor points are closer to most points. The original sampling mode is shown on the left side of FIG. 6A, and the gradient will cause the largest majority of points to move closer to the "outlier" points; FIG. 6B is a sampling pattern for this document such that "ungrouped" points are close to most points. Taking the point marked with 1 in batch in fig. 6B as an example, it is considered as the hardest positive sample point of 11 points in the same category, and here, the 11 points are referred to as forming a set _1, then taking the point No. 1 as the anchor point, and the 11 points in set _1 as the hardest positive sample points, and randomly extracting the negative sample points satisfying the following formula 2.7. Thus, equivalently for anchor point No. 1, 11 different characteristic errors are provided, so that the network updates the weights to bring point No. 1 closer towards most of the sample points.

The method comprises the following specific steps:

in the first step, 30 samples are taken from the DataSubset dataset per class, each batch containing 100 class ready data.

And secondly, initializing the network by using the pre-trained network coefficient.

And thirdly, randomly selecting the most difficult to correct sample to form a set for each input sample in each batch.

And fourthly, selecting the samples in the set as anchor points, selecting the samples corresponding to the anchor points as the most difficult positive samples, randomly selecting negative samples meeting the formula 2.7, forming a ternary group pair, and updating the network parameters.

The specific embodiment of the invention provides a method for constructing and training a face recognition feature extraction network, and feature extraction is carried out by utilizing the feature extraction network and a metric learning dimensionality reduction network. In the feature extraction network, a deep learning network in a Stage (Stage) stacking mode is designed, so that the deep learning network has better feature extraction capability; in the design of Stage, convolution kernels of 1 × 1, 3 × 3 and 5 × 5 are adopted at the same time, feature images (feature maps) of the previous layer are convoluted at the same time, and obtained feature maps are overlapped to extract multi-scale features; then, a 3 x 3 convolution kernel is adopted to convolute the multi-scale feature map, the features of the multi-scale convolution kernel are fused, and the purposes of expanding firstly, fully learning more complete features, compressing again and removing redundant features are achieved through the change of the dimension of the feature map; each Stage can be regarded as a superposition of convolution kernels, and the superposition of the convolution kernels can be regarded as that a larger perception domain is obtained by using less weight, and a nonlinear characterization layer of the deep learning network is enhanced. In addition, a metric learning dimensionality reduction network is introduced, the input of the metric learning dimensionality reduction network is the lower-dimensional features of the image extracted by the feature extraction network, and the output feature set of the feature extraction network is subjected to semantic sampling to screen out a pure sample set; training the metric learning dimensionality reduction network by using a pure sample set; and then optimizing a metric learning dimensionality reduction network by using a metric learning loss function tripletloss, and simultaneously extracting features by using a feature extraction network and the metric learning dimensionality reduction network to improve the characterization capability of the features, so that feature information in data is fully mined, a deep learning network is guided to rapidly solve, and an original face picture can be accurately identified.

The embodiments of the invention described above may be implemented in various hardware, software code, or combinations of both. For example, an embodiment of the present invention may also be program code for executing the above method in a Digital Signal Processor (DSP). The invention may also relate to a variety of functions performed by a computer processor, digital signal processor, microprocessor, or Field Programmable Gate Array (FPGA). The processor described above may be configured according to the present invention to perform certain tasks by executing machine-readable software code or firmware code that defines certain methods disclosed herein. Software code or firmware code may be developed in different programming languages and in different formats or forms. Software code may also be compiled for different target platforms. However, the different code styles, types, and languages of software code and other types of configuration code that perform tasks in accordance with the present invention do not depart from the spirit and scope of the present invention.

The foregoing is merely an illustrative embodiment of the present invention, and any equivalent changes and modifications made by those skilled in the art without departing from the spirit and principle of the present invention should fall within the protection scope of the present invention.

Claims

1. A method for constructing and training a face recognition feature extraction network is characterized by comprising the following steps:

processing an original face picture to obtain a standard face picture, and constructing a feature extraction network and a metric learning dimension reduction network of the standard face picture, wherein the feature extraction network outputs a feature set comprising high-dimensional face feature information, the output of the feature extraction network is the input of the metric learning dimension reduction network, the output of the metric learning dimension reduction network is the feature set comprising low-dimensional face feature information, and the low-dimensional face feature information is the result of dimension reduction of the high-dimensional face feature information;

training the feature extraction network based on standard face pictures of all sample sets by taking the identity information as a reference signal so as to output a feature set comprising high-dimensional face feature information;

screening a feature set comprising high-dimensional face feature information by utilizing semantic sampling so as to obtain a pure sample set comprising the face feature information with the same dimension as the feature set;

training the metric learning dimensionality reduction network based on a pure sample set by taking identity information as a reference signal;

the method comprises the following steps of screening a feature set comprising high-dimensional face feature information by utilizing semantic sampling so as to obtain a pure sample set comprising the face feature information with the same dimension as the feature set, and specifically comprises the following steps:

screening 90% of samples farthest from a feature plane in a feature set comprising high-dimensional face feature information by utilizing semantic sampling to serve as a pure sample set comprising the face feature information with the same dimension as the sample set;

after the step of training the metric learning dimensionality reduction network based on the clean sample set, the method further includes:

optimizing the metric learning dimensionality reduction network by using a metric learning loss function tripletLoss;

the specific formula of the metric learning loss function tripletLoss is as follows:

tripletloss＝log(1+z)

wherein the content of the first and second substances,

z is an intermediate variable; f. of_aIs the characteristics of the selected sample a; f. of_pIs the characteristics of the selected sample p; f. of_nIs the characteristics of the selected sample n; p is a positive sample with the same identity as a, and n is a negative sample with a different identity from a; margin is a preset fixed interval;

when the metric learning loss function tripletLoss is used for optimizing the metric learning dimension reduction network, a sample point far away from most sample points is taken as an anchor point, so that the sample point far away from most sample points is close to most sample points.

2. The method for constructing and training a face recognition feature extraction network according to claim 1, wherein the step of training the feature extraction network based on standard face pictures of all sample sets with identity information as a reference signal so as to output a feature set including high-dimensional face feature information specifically comprises:

and training the feature extraction network by using a loss function softmax based on the standard face pictures of all the sample sets so as to output a feature set comprising high-dimensional face feature information.

3. The method of constructing and training a face recognition feature extraction network of claim 1, wherein the feature plane is obtained by Logistic regression.

4. The method of claim 1, wherein 1 x 1, 3 x 3, 5 x 5 convolution kernels are applied simultaneously in the feature extraction network to form a multi-scale feature fusion mode.

5. The method for constructing and training a face recognition feature extraction network according to claim 1, wherein the feature dimension of the feature set comprising high-dimensional face feature information is 320; the dimensionality of the face feature information of the pure sample set is 320; the pure sample set longitude quantity is reduced to 128 after learning the dimensionality reduction network.

6. The method of claim 1, wherein the entire sample set is a CASIA-Webface database.