CN113807214B

CN113807214B - Small target face recognition method based on deit affiliated network knowledge distillation

Info

Publication number: CN113807214B
Application number: CN202111015756.XA
Authority: CN
Inventors: 宋尧哲; 孟方舟; 舒子婷; 吴萌萌; 童官军
Original assignee: Shanghai Institute of Microsystem and Information Technology of CAS
Current assignee: Shanghai Institute of Microsystem and Information Technology of CAS
Priority date: 2021-08-31
Filing date: 2021-08-31
Publication date: 2024-01-05
Anticipated expiration: 2041-08-31
Also published as: CN113807214A

Abstract

The invention relates to a small target face recognition method based on deit affiliated network knowledge distillation, which comprises the following steps: constructing a deit network as a student network, constructing a teacher network, adding a residual error connection module, and training the student network on a high-pixel face image by using the teacher network; inputting a small target face image into a trained student network to obtain a second classification characteristic and a second distillation characteristic; inputting an image which has the same identity as the trained deit network and is not downsampled into the teacher network to obtain a second teacher characteristic; constructing a third loss function according to the second classification characteristic and the real label, constructing a fourth loss function according to the second distillation characteristic and the second teacher characteristic, and adding the third loss function and the fourth loss function to obtain a second total loss; and performing secondary training on the trained deit network under the second total loss. The invention can effectively identify the small target face image.

Description

Small target face recognition method based on deit affiliated network knowledge distillation

Technical Field

The invention relates to the technical field of computer vision, in particular to a small target face recognition method based on deit affiliated network knowledge distillation.

Background

With the continuous updating and proposition of deep learning algorithm and corresponding large-scale data set, face recognition has been greatly developed. Under the conditions of fixed face pose (front face), clear image and closed environment (without 'uncertain' category), the face recognition accuracy can reach more than 99%.

However, in the monitoring environment, due to practical problems such as low resolution of the camera, long distance of the face target, and blurred relative motion of the target, the small target face image actually collected often has various states of gestures (for example, side face, head elevation), low resolution (less than 32×32 pixels), and noise interference. Meanwhile, as not all detected face targets can be matched with the identities of the people in the database under the field monitoring environment, the problem of small target face recognition becomes an open-loop environment problem at the same time.

For the above reasons, in a real environment, the performance of a face recognition algorithm with fixed posture, clear image and excellent performance in a closed environment tends to be drastically reduced. The performance degradation of the algorithm is not only reflected in that the performance of the algorithm is greatly reduced when the algorithm is directly tested on the small target face image in the monitoring environment after being trained on the high-pixel face image, but also reflected in that the performance is poor even if the algorithm is used for training in the small target face image in the monitoring environment and then tested on the same small target face image. The reason for this is that if training is performed on a high-pixel data set, the small-target face image test causes the problem of 'domain transfer' due to different distributions of the data sets, so that the fitting is over-performed; the training is directly performed on the small target face image, so that the characteristics are difficult to extract due to the fact that the pixels of the small target face image are too low (less than 32 x 32 pixels), and in addition, the low-pixel face recognition data set in a large-scale real environment does not exist in the existing public data set, so that a network with discrimination capability is difficult to form.

Aiming at the difficulty of the face recognition problem of a small target in a real environment, two algorithms with the best performance at present adopt a method for carrying out knowledge distillation based on a CNN network, and the method specifically comprises the following steps: the teacher network is a CNN network-based model pre-trained on high-pixel face images, and freezes parameters in the training process and only serves as a feature extractor. The student network is consistent with the teacher network, and participates in training in the training process. During training, a high-pixel face image is input to a teacher network, a small target face image obtained by downsampling the same high-pixel image is input to a student network, and the penultimate layer characteristics of students are enabled to approach to the corresponding layer of the teacher through designing a loss function, so that a student model can obtain the high-pixel image transmitted by the teacher model in knowledge distillation to extract characteristic information, and meanwhile, the student model classifies the loss function to learn the information of the small target face image. When the knowledge distillation loss function is designed, the traditional algorithm directly inputs the high and low pixel face images to the loss function, so that the high pixel face image recognition accuracy is damaged, and therefore, the traditional algorithm is improved on the basis, and a parallel feature layer input loss function is added. Because the teacher network feature layer has good discrimination capability in the high-pixel face image, the student network can acquire the discrimination characteristics expected to be the same in the face image with the same ID but downsampled by designing the loss function through the teacher feature layer and the student feature layer, so that the discrimination capability of the student network in the low-pixel face image is improved.

Disclosure of Invention

The invention aims to solve the technical problem of providing a small target face recognition method based on the deit affiliated network knowledge distillation, which can effectively recognize small target face images.

The technical scheme adopted for solving the technical problems is as follows: the small target face recognition method based on the deit affiliated network knowledge distillation comprises the following steps:

step (1): constructing a deit network as a student network, preprocessing a selected training set, and inputting the preprocessed training set into the student network to obtain a first classification characteristic and a first distillation characteristic;

step (2): selecting a teacher network which is pre-trained in the data set, and inputting the pre-processed training set into the teacher network to obtain a first teacher characteristic;

step (3): adding a residual error connection module after the last discrimination layer of the teacher network, wherein the residual error connection module participates in training;

step (4): constructing a first loss function according to the first classification feature and the real tag, constructing a second loss function according to the first distillation feature and the first teacher feature, and adding the first loss function and the second loss function to obtain a first total loss;

training a student network on a first face image with the teacher network at the first total loss;

step (5): inputting a second face image to the trained student network to obtain a second classification characteristic and a second distillation characteristic;

the pixel resolution of the first face image is higher than that of the second face image;

step (6): inputting a second face image which is the same as the trained student network but not downsampled to the teacher network to obtain a second teacher feature;

step (7): constructing a third loss function according to the second classification characteristic and the real label, constructing a fourth loss function according to the second distillation characteristic and the second teacher characteristic, and adding the third loss function and the fourth loss function to obtain a second total loss;

performing secondary training on the trained student network under the second total loss;

step (8): and identifying the input new second face image by using the secondarily trained student network.

In the step (1), the selected training set is preprocessed and then input to the student network, specifically: and adjusting the size of each image in the training set to 224 x 224 by an interpolation method, cutting out 14 x 14 image blocks according to the size of 16 x 16, and inputting the cut image blocks into the student network.

And in the step (1), the Vgface 2 high-pixel face image is used as a training set.

The step (2) specifically comprises the following steps: and selecting the SE+Resnet network which is pre-trained in the data set as a teacher network, fixing parameters of the SE+Resnet network, enabling the SE+Resnet network to be a feature extractor, and inputting the pre-processed training set into the teacher network to obtain a first teacher feature.

In the step (5), a second face image is input to the trained student network, specifically: and (3) performing downsampling on the trained student network input to be 16 x 16, and performing interpolation amplification to obtain 224 x 224 downsampled second face images.

Advantageous effects

Due to the adoption of the technical scheme, compared with the prior art, the invention has the following advantages and positive effects: according to the student network, a transformer structure with a non-CNN structure is adopted as a model framework, and a transformer network non-local attention mechanism is utilized to combine each pixel point of an input image with the information of other pixel points of the whole image, so that the integral characteristic of the image is learned, the performance loss of the model is far smaller than that of the CNN network framework when the model faces a low-pixel image after pre-training, and the problems of model performance loss and over-fitting caused by the fact that the downsampled image is interpolated to be in the same dimension as the high-pixel image are avoided; according to the invention, the auxiliary residual error connection module is added in the teacher network, so that the 'what knowledge the teacher should teach' is parameterized, the 'model capacity gap' problem is avoided, the knowledge distillation method is changed into online-offline combined knowledge distillation, a stable and easily-converged model is obtained in a self-adaptive manner, and meanwhile, the student network absorbs good information from the teacher network; according to the invention, by the auxiliary knowledge distillation based on the deit network, the accuracy rate of a Tinyface dataset test set in a native low-pixel face dataset is 71.1%, and the highest accuracy rate is achieved in an end-to-end face recognition algorithm without enhancing the test set.

Drawings

FIG. 1 is a schematic diagram of a deit network of an embodiment of the invention;

FIG. 2 is a schematic diagram of a residual connection module according to an embodiment of the present invention;

fig. 3 is a schematic diagram of the overall architecture of a teacher network according to an embodiment of the present invention.

Detailed Description

The invention will be further illustrated with reference to specific examples. It is to be understood that these examples are illustrative of the present invention and are not intended to limit the scope of the present invention. Further, it is understood that various changes and modifications may be made by those skilled in the art after reading the teachings of the present invention, and such equivalents are intended to fall within the scope of the claims appended hereto.

The embodiment of the invention relates to a small target face recognition method based on the deit affiliated network knowledge distillation, which specifically comprises the following steps:

1. constructing a deit network as a student network (see fig. 1 in detail), selecting a Vggface2 high-pixel face image as a training set, adjusting the image size to 224 x 224 by an interpolation method, then splicing and cutting 14 x 14 image blocks according to the 16 x 16 size and the picture, and inputting the 14 x 14 image blocks into the deit network to obtain a first classification characteristic and a first distillation characteristic.

In fig. 1, patch keys are 768-dimensional features obtained by linear layer coding after 16×16 stitching and clipping of an image, and class keys and distillation token are respectively a learnable embedded vector with the same dimension as the patch keys, wherein the class keys are used for generating a judging layer of a final loss function with a real label, and distillation token is used for generating a judging layer of a final loss function with a teacher network output.

2. And (2) selecting the SE+Resnet network which is pre-trained in the Vgface 2 data set as a teacher network, fixing parameters of the SE+Resnet network, enabling the SE+Resnet network to be a feature extractor, and inputting the same face image in the step (1) into the teacher network to obtain a first teacher feature.

3. Constructing a first loss function according to the first classification characteristic obtained in the step 1 and the real label, constructing a second loss function according to the first distillation characteristic obtained in the step 1 and the first teacher characteristic obtained in the step 2, adding the first loss function and the second loss function to obtain a first total loss, and training a deit network on a high-pixel face image by using a teacher network under the first total loss. The genuine tag is here the person identity ID before downsampling, i.e. the genuine tag in step 1.

The function of the steps 1 to 3 is to pretrain the network with the high-pixel face image information before training the network with the downsampled face image information, so that the network learns basic features for face recognition, and the learned features in the high-pixel face image are conveniently utilized in the subsequent training with the low-pixel face image, thereby avoiding the problem that the model is difficult to converge due to the fact that the low-pixel face image is directly trained and the task is too complex.

4. And (3) adding an attached residual error connection module after the last layer of the judging layer of the teacher network in the step (2), wherein the last layer of the judging layer and the previous part still freeze parameters to be used as a feature extractor, and the newly added residual error connection module participates in training. Fig. 3 is a schematic diagram of the whole architecture of a teacher network incorporating a residual connection module.

Fig. 2 is a schematic diagram of an auxiliary residual error connection module, and by setting the auxiliary residual error connection module, the "what knowledge the teacher should teach" can be parameterized, so that the "model capacity gap" problem is avoided, the knowledge distillation method is changed into online-offline combined knowledge distillation, a stable and easily-converged model is obtained in a self-adaptive manner, and meanwhile, a student network can absorb good information from the teacher network.

5. And (3) inputting the dect model trained in the step (3) into a downsampled small target face image with downsampled 16 times, and then interpolating and amplifying the downsampled small target face image with downsampled 224 times to obtain a second classification characteristic and a second distillation characteristic.

6. And (3) inputting an image which has the same identity as the student network in the step (5) and is not subjected to downsampling into the teacher network in the step (4) to obtain a second teacher characteristic.

7. Constructing a third loss function according to the second classification characteristic obtained in the step 5 and the real label, constructing a fourth loss function according to the second distillation characteristic obtained in the step 5 and the second teacher characteristic obtained in the step 6, adding the third loss function and the fourth loss function to obtain a second total loss, and performing secondary training on the trained det network under the second total loss. The real label is the person identity ID after downsampling, and the person identity is unchanged before and after downsampling, so that the real label is the same as the real label in the step 1 and the step 3.

Further, the formulas of the first total loss and the second total loss in step 3 and step 7 can be expressed as follows:

L _global ＝(1-λ)L _CE (ψ(Z _s ),y)+λτ ² KL(ψ(Z _s /τ),ψ(Z _t /τ))

wherein λ is a weighting coefficient for adjusting the first total loss or the second total loss, and in this embodiment, 0.5 is selected; psi (·) is a softmax function, Z _s Z for output of trained det network _t For the output of the teacher network, y is a real label, namely the character identity ID corresponding to the face image, τ is the degree coefficient of knowledge distillation, and 1.25 is selected in the embodiment; by the method of Z _s 、Z _t The output of the teacher network and the student network (det network) can be softened by dividing by the knowledge distillation degree coefficient tau, so that knowledge distillation is better performed; psi (Z) _s /τ)、ψ(Z _t τ) output the softened teacher network output and the softened student network output by a softmax function, respectively.

L _CE () Is a cross-loss function and L _CE () Can be expressed as:

KL () is a KL divergence and can be expressed as:

8. and identifying the input small target face image by using the deit network trained for the second time.

9. By testing on a public data set Tinyface, the invention achieves a Rank-1 accuracy of 71.1%, and is the highest accuracy method in the current algorithm of the data set, and specific results are shown in Table 1.

Table 1 experimental results test comparison chart

Model	Rank-1	Rank-20	mAP
				DeepId2	17.4	25.2	12.1
SphereFace	22.3	35.5	16.2
				VGGFace	30.4	40.4	23.1
CenterFace	32.1	44.5	24.6
				CSRI	45.2	60.2	39.9
T-C	58.6	73.0	52.7
				Shi	63.9	/	/
SafwanKhalid	70.4	82.2	63.2
				This embodiment mode	71.13	84.09	64.58

Therefore, the student network adopts a non-CNN-structured transformer structure as a model framework, and combines each pixel point of an input image with the information of the rest pixel points of the whole image by utilizing a non-local attention mechanism of the transformer network, so that the integral characteristic of the image is learned, and the performance loss of the model is far smaller than that of the CNN network framework when the model faces a low-pixel image after pre-training; according to the invention, the auxiliary residual error connection module is added in the teacher network, so that the knowledge distillation method is changed into online-offline combined knowledge distillation, a stable and easily-converged model is obtained in a self-adaptive manner, and meanwhile, the student network absorbs good information from the teacher network.

Claims

1. The small target face recognition method based on the deit affiliated network knowledge distillation is characterized by comprising the following steps of:

2. The small target face recognition method based on the deit affiliated network knowledge distillation according to claim 1, wherein in the step (1), the selected training set is preprocessed and then input to the student network, specifically: and adjusting the size of each image in the training set to 224 x 224 by an interpolation method, cutting out 14 x 14 image blocks according to the size of 16 x 16, and inputting the cut image blocks into the student network.

3. The small target face recognition method based on the deit affiliated network knowledge distillation according to claim 1, wherein Vggface2 high pixel face images are used as training sets in the step (1).

4. The small target face recognition method based on the deit affiliated network knowledge distillation according to claim 1, wherein the step (2) is specifically: and selecting the SE+Resnet network which is pre-trained in the data set as a teacher network, fixing parameters of the SE+Resnet network, enabling the SE+Resnet network to be a feature extractor, and inputting the pre-processed training set into the teacher network to obtain a first teacher feature.

5. The small target face recognition method based on the deit affiliated network knowledge distillation according to claim 1, wherein the step (5) inputs a second face image to the trained student network, specifically: and (3) performing downsampling on the trained student network input to be 16 x 16, and performing interpolation amplification to obtain 224 x 224 downsampled second face images.