CN112580502A

CN112580502A - SICNN-based low-quality video face recognition method

Info

Publication number: CN112580502A
Application number: CN202011496030.8A
Authority: CN
Inventors: 袁家斌; 陆要要; 何珊
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2020-12-17
Filing date: 2020-12-17
Publication date: 2021-03-30

Abstract

The invention discloses a low-quality video face recognition method based on SICNN, which comprises the following steps of firstly, classifying key points representing the face direction by adopting a clustering algorithm according to the position characteristics of the face in a video to select key frames; then, establishing a SICNN reconstruction model, extracting features through a reconstruction network and an identification network, respectively obtaining reconstruction loss and identification loss so as to define identity loss, training the reconstruction network by using an alternative training strategy, and obtaining a frame image with high resolution and more identity features; and finally, inputting the reconstructed frame image into an identification network inclusion Resnet v2, extracting depth features for classification and identification, and voting the identification results of all the image frames to obtain a video identification result. The method is applied to low-quality video face recognition, and the accuracy rate of the low-quality video face recognition is effectively improved.

Description

SICNN-based low-quality video face recognition method

Technical Field

The invention belongs to the technical field of face recognition, and particularly relates to a low-quality video face recognition method based on SICNN.

Background

In recent years, the development of computer vision has prompted more and more technology to fall into a practical product in everyday life. With the rise of deep neural networks, face recognition technology has been developed rapidly. Image face recognition has achieved excellent results, and the research on video face recognition is relatively unexpected. This is because the video face recognition faces more than the same problems of illumination, occlusion, and posture as the image face recognition, and the image frame quality of the video in practical application (such as in a monitoring scene) is usually inferior to that of the image. Currently, video face recognition methods are divided into two categories: classical methods and deep learning methods. The method is commonly used for face recognition by a deep learning method, a section of video is divided into a plurality of image frames, the image frames are used for face recognition, and the final recognition result is voted to obtain the final video recognition result. A common method for improving the low-quality video is a super-resolution reconstruction method, which reconstructs a low-resolution video frame to obtain a super-resolution image with better visual effect and more characteristics, thereby solving the problem of unmatched space characteristic dimensions of the images with medium and low resolution.

However, the general super-resolution reconstruction algorithm also has some problems: most of the prior super-resolution reconstruction methods consider whether an output image is clear and vivid and improve visual effect, so that the recovery of facial features is neglected, a face close to a real identity cannot be generated, the accuracy of face recognition cannot be improved, and an expected recognition effect cannot be achieved.

Disclosure of Invention

The invention aims to overcome the defects of the existing super-resolution reconstruction algorithm and provide a reconstruction algorithm beneficial to face recognition. The invention adopts a low-quality video face recognition method based on SICNN (super identity convolutional neural network), which can effectively solve the problem of low-quality video recognition efficiency and achieve good recognition effect.

In order to achieve the purpose, the invention adopts the technical scheme that:

a low-quality video face recognition method based on SICNN comprises the following steps:

step 1, preprocessing data, namely splitting low-quality video data in a data set into image frames, cutting face images into face images with the size of 32 × 40px by face detection, and dividing the image set into a training set and a testing set by using an algorithm, wherein the size ratio of the data set is 7: 3;

selecting key frames, taking the key point positions of the low-quality video face image frames in the data set as face features, and selecting the key frames by using a K-means clustering algorithm and a random algorithm;

step 3, the low-quality video key frame image is processed

Input improved SICNN reconstruction network CNN_HExtracting features, reconstructing to obtain super-resolution image

Simultaneous and high resolution face image

Obtaining the super-resolution reconstruction loss L^SR；

Step 4, reconstructing the super-resolution image in the step 3

Input-improved recognition network CNN_RExtracting depth features, mapping the features to a hypersphere space for classification and identification to obtain an identification loss L^FRAnd super identity loss L^SI；

Step 5, training the network by using an alternative training strategy, and using the recognition loss L obtained in the step 4^FRTraining the recognition network to obtain a weighted hyperidentity loss L using step 4^SIAnd the super-resolution reconstruction loss L obtained in the step 3^SRCo-training the reconstruction network until convergence;

step 6, inputting the video image frame reconstructed by the SICNN in the step 3 into an inclusion Resnet v2 network, extracting features by using small convolution, using a softmax classifier, and improving Centerlos as a loss function training network, wherein the improved Centerlos is obtained by calculating the original Centerlos and directly takes the features of the high-resolution face image as the center;

and 7, voting the identification result of the image frame to obtain a final video identification result.

Further, the data set used in step 1 is a COX data set, ten divisions of the COX data set are used for the training sample and the testing sample, and the result is an average value of ten experiments.

Further, the K value of the K-means algorithm in step 2 is 5, which respectively represents 5 different human face poses: the face recognition method comprises the following steps of left side face, left deflection face, front face, right deflection face and right side face, wherein 10 key frames are selected by each group through a random algorithm.

Further, CNN in the step 3_HThe network comprises DB (Dense Block), convolution, DB, deconvolution, convolution, DB and convolution sequence connection; CNN_HThe network uses DB to extract semantic features, uses deconvolution to amplify the resolution of input features, and uses convolution to realize mapping and reconstruction; because the resolution of the used low-quality video key frame is 32 x 40, the identification requirement can be met only by amplifying by 4 times; thus CNN_HThe number of deconvolution in the network is changed from 3 to 2, so that the original amplification is changed from 8 times to 4 times.

Furthermore, each DB block is sequentially connected with 6 identical DenseLayer structures, each DenseLayer structure comprises a 1 × 1 and a 3 × 3 convolutional layer sequentially connected, and the 1 × 1 convolutional layer is a bottleneck layer, so as to reduce the number of input feature maps, i.e. dimensionality reduction; the composition of each convolution layer is a Batch Normalization + ReLU +3 × 3Conv layer, the growth _ rate of the DB block is equal to 32, and the bn _ size is equal to 4.

Further, in the step 4, the CNN similar to ResNet in the SICNN model is used_RNetwork improvement, CNN_RThe method comprises 36 convolutional layers, wherein 6 convolutional layers and 5 residual layers are alternately connected, the convolutional layers are a convolutional layer 1a, a convolutional layer 1b, a residual layer 1, a convolutional layer 2, a residual layer 2, a convolutional layer 3, a residual layer 3, a convolutional layer 4, a residual layer 4, a convolutional layer 5 and a residual layer 5 in sequence, each residual layer comprises one convolutional layer and a plurality of residual blocks, and the number of the residual blocks corresponding to the 5 residual layers is 1, 2, 4, 6 and 2 respectively; CNN_RThe emphasis is on improving the face identity characteristics, the reconstructed SR image is closer to the original HR image, the loss function uses A-softmax, the function introduces the angle classification distance and is equivalent to learning the characteristics in a hypersphere space, and the learned face characteristics have better distinguishability.

Further, the loss of excess identity L in the step 5^SIIs a kind of temporal loss computing normalized euclidean distance that directly relates the loss to the identity in hypersphere space.

Further, the inclusion respet v2 network in step 6 is sequentially connected by Stem, 5 × inclusion respet a, Reduction a, 10 × inclusion respet B, Reduction B, 5 × inclusion respet C, Average firing, drop, Softmax layers; the inclusion respet v2 network implements the dimensionality Reduction operation with 1 × 1 convolution in each layer, using two 3 × 3 convolutions instead of 5 × 5, while decomposing the 7 × 7 convolution and the 3 × 3 convolutions in the Stem layer and Reduction ab C layer into two one-dimensional convolutions (1 × 7, 7 × 1) (1 × 3, 3 × 1); the inclusion respet v2 network replaces the sequential connection of the convolution and pooling before the inclusion respet structure with Stem modules to obtain a deeper network; the training uses the softmax cross entropy loss function and the modified center loss function, centerlos.

Further, the final classification result in step 7 is obtained by voting, and the result with the largest number of votes obtained is the final video identification result.

Compared with the prior art, the invention has the following beneficial effects:

the invention relates to a low-quality video face recognition model based on super-resolution reconstruction, which aims at the characteristic of low-resolution illumination difference of a low-quality video, and improves the accuracy of video recognition by performing key frame extraction, super-resolution reconstruction and face recognition classification on a video frame;

according to the method, the video key frames are selected through the key frame selection algorithm, the calculation complexity of reconstruction and recognition is reduced on the basis of not influencing the recognition efficiency, and the training and testing time is reduced;

according to the invention, by introducing identity loss into the reconstruction network through the SICNN reconstruction method, the reconstructed super-resolution image can obtain more identity characteristics, and the face recognition accuracy can be improved;

the Incepton Resnet v2+ Centerlos identification network in the invention utilizes the center loss function to more accurately classify the result, and the Incepton Resnet v2 network is used to reduce the calculation cost and accelerate the learning speed.

Drawings

FIG. 1 is a diagram of a SICNN-based low-quality video face recognition model;

FIG. 2 is a SICNN model framework;

fig. 3 is an inclusion-Resnet v2 network architecture.

Detailed Description

The present invention will be further described with reference to the following examples.

Example 1

step 1, preprocessing data, dividing low-quality video data in a data set into image frames, cutting face images into face images with the size of 32 × 40px by face detection, and dividing the image set into a training set and a testing set by using an algorithm, wherein the size ratio of the data set is 7: 3. For example, a COX dataset may be used, and ten partitions of the COX dataset may be used for training samples and test samples, with the results averaged over ten experiments.

COX face data sets aim to address the problems of video to still (V2S), still to video (S2V) and video to video (V2V) face recognition. The data set contains 1,000 subjects, each simulating a video surveillance scene, capturing 1 high quality still image and 3 video sequences (cam1, cam2, cam 3). After face detection and data preprocessing, the number of image frames containing faces in most video sequences is more than 100, some are even more than 300.

And 2, selecting key frames, taking the key point positions of the low-quality video face image frames in the data set as face features, and selecting the key frames by using a K-means clustering algorithm and a random algorithm. The K value of the K-means algorithm is 5, and the K value represents 5 different human face poses: the face recognition method comprises the following steps of left side face, left deflection face, front face, right deflection face and right side face, wherein 10 key frames are selected by each group through a random algorithm.

According to the invention, the K-Means clustering is carried out on the image according to the positions of the key points of the face, and the key point position of the a-th sample is set as x^(a). Generally, the image detected by the human face marks the positions of the two eyes of the human face, and the invention takes the positions as key points, so the invention has the advantages of simple structure, convenient operation and low cost

Namely, it is

Is the left eye position coordinate of the a-th sample,

is the right eye position coordinate of the a-th sample. Therefore, the invention defines the distance function of the a sample of the K-Means cluster as follows:

wherein: l_ajFor the a sample and the j class centroid mu_jDistance of (d), mu_jIs the centroid of class j, (x)_jL,y_jLIs) is_jLeft eye position coordinates, (x)_jR,y_jRIs) is_jThe right eye position coordinates.

Sample position set of hypothetical inputIs S ═ x⁽¹⁾,x⁽²⁾,…,x^(a),…,x^(m)},x^(a)∈RⁿM is the number of samples, RⁿFor an n-dimensional real number set, the algorithm steps are as follows:

(1) randomly selecting k clustering centroids as mu₁,μ₂,…,μ_k∈RⁿWherein: mu.s_kIs the centroid of class k;

(2) repeating the following process until convergence

For each sample set x^(a)Calculating x^(a)Class (c) to which^(a)Class of the a-th sample, j is class number):

for each class j, the centroid of the class is recalculated

}

Step 3, the ith low-quality video key frame, namely the low-resolution face image

Input improved SICNN reconstruction network CNN_HExtracting features, reconstructing to obtain super-resolution face image

Simultaneous and high resolution face image

Obtaining the super-resolution reconstruction loss L^SR。CNN_HThe network comprises DB (Dense Block), convolution, DB, deconvolution, convolution, DB and convolution sequence connection; CNN_HNetwork using DB to extract semantic features, using deconvolution to amplify inputResolution of features, mapping and reconstruction using convolution. Since the resolution of the low-quality video key frames used is 32 x 40, only 4 times of magnification is needed to meet the recognition requirements. Thus improving CNN_HThe number of deconvolution in the network is changed from 3 to 2, so that the original amplification is changed from 8 times to 4 times.

3.1 DB(Dense Block)

CNN_HThe network uses DB to extract semantic features, and in order to solve the problem of gradient disappearance, a DB block directly connects all layers by means of the thought of Resnet on the premise of ensuring the maximum information transmission between the layers in the network. Simply speaking, the input to each layer is from the output of all previous layers.

The number of output signatures per convolutional layer in the DB is small (less than 100) and not as wide as hundreds or thousands of networks. Meanwhile, the connection mode enables the transfer of the characteristics and the gradient to be more effective, and the network is easier to train. The gradient vanishing problem is easier to occur when the depth of the network is deeper, because the input information and the gradient information are transmitted among a plurality of layers, and now the dense connection is equivalent to that each layer is directly connected with input and loss, so that the gradient vanishing phenomenon can be reduced, and the deeper network is not a problem.

Each DB block in the present invention is composed of 6 substructures, each substructure includes one convolution layer of 1 × 1 and one convolution layer of 3 × 3, and the convolution layer of 1 × 1 is a bottleneck layer, so as to reduce the number of feature maps input, i.e., reduce the dimension. The composition of each convolution layer is a Batch Normalization + ReLU +3 × 3Conv layer, the growth _ rate of the DB block is equal to 32, and the bn _ size is equal to 4.

The face image with size 32 x 48 is input, and the reconstructed image resolution is enlarged to 8 times of the original image, which is 256 x 320.

3.2 loss function

On reestablishing the network CNN_HDefining Euclidean distance of pixels between the SR image and the high-resolution HR image after LR image reconstruction as super-resolution loss L^SRFor the ith low resolution face image, it goes through CNN_HThe super-resolution loss after reconstruction is as follows:

wherein:

the ith LR and HR face images in the training set respectively,

to represent

The reconstructed output can also be expressed as

Step 4, reconstructing the super-resolution image of the ith low-resolution face image in the step 3

Input-improved recognition network CNN_RExtracting depth features, mapping the features to a hypersphere space for classification and identification to obtain an identification loss L^FRAnd super identity loss L^SI. CNN similar to ResNet in SICNN model_RNetwork improvement, CNN_RThe multilayer structure comprises 36 convolutional layers, wherein 6 convolutional layers and 5 residual layers are alternately connected, the convolutional layers are a convolutional layer 1a, a convolutional layer 1b, a residual layer 1, a convolutional layer 2, a residual layer 2, a convolutional layer 3, a residual layer 3, a convolutional layer 4, a residual layer 4, a convolutional layer 5 and a residual layer 5 in sequence, each residual layer comprises a convolutional layer and a plurality of residual blocks, and the number of the residual blocks corresponding to the 5 residual layers is 1, 2, 4, 6 and 2 respectively. CNN_RThe emphasis is on improving the face identity characteristics, the reconstructed SR image is closer to the original HR image, the loss function uses A-softmax, the function introduces the angle classification distance and is equivalent to learning the characteristics in a hypersphere space, and the learned face characteristics are enabled to beWith better distinguishability.

4.1 identification network CNN_R

CNN_RIs a CNN similar to Resnet_RThe multilayer structure comprises 36 convolutional layers, wherein 6 convolutional layers and 5 residual layers are alternately connected, the convolutional layers are a convolutional layer 1a, a convolutional layer 1b, a residual layer 1, a convolutional layer 2, a residual layer 2, a convolutional layer 3, a residual layer 3, a convolutional layer 4, a residual layer 4, a convolutional layer 5 and a residual layer 5 in sequence, each residual layer comprises a convolutional layer and a plurality of residual blocks, and the number of the residual blocks corresponding to the 5 residual layers is 1, 2, 4, 6 and 2 respectively. CNN_RThe network structure of (a) is as follows:

4.2 identifying networks CNN_R

CNN_RThe A-Softmax loss is used as a loss function, the function introduces angle classification distance, and is equivalent to learning features in a hypersphere space, so that the learned face features have better distinguishability. CNN_RExpressing this loss function as the recognition loss L^FRFor a term belonging to the y-th_iSuper-resolution face image input of individual identity

Identifying loss

The following were used:

in the formula: e is the base of the natural logarithm,

for inputting human face images

From CNN_RThe identity characteristics of the extracted identification information are extracted,

and

respectively representing identities y_iAnd y_jThe learning angle of (2) is set,

and

are respectively from

And

generalized monotonic decreasing function.

And

the derivation formula of (1) is as follows:

wherein the content of the first and second substances,

and

are respectively as

And

b is a hyper-parameter of the angle margin constraint, b is more than or equal to 1, c is a nonnegative integer and c belongs to [0, b-1 ]]Preferably, b is 4, c ∈ [0,3 ]]。

Step 5, training the network by using an alternative training strategy, and using the recognition loss L obtained in the step 4^FRTraining the recognition network to obtain a weighted hyperidentity loss L using step 4^SIAnd the super-resolution reconstruction loss L obtained in the step 3^SRThe reconstructed network is co-trained until convergence. Loss of superidentity L^SIIs a kind of temporal loss computing normalized euclidean distance that directly relates the loss to the identity in hypersphere space. For input LR image

The loss of superidentity in the hypersphere identity metric space is:

wherein:

is that

The identity representation projected onto the unit hypersphere,

is that

The identity representation projected onto the unit hypersphere,

and

are respectively

And

from CNN_RThe extracted identity features.

5.1 training strategy

Inputting: recognition model CNN for high-resolution image HR face image training_RUsing the resolution loss L^SRTrained human face hallucination model CNN_HThe minimum batch size is N,

the ith low-resolution and high-resolution face images.

And (3) outputting: and (4) SICNN.

1 when not converging

2, selecting a batch size N image pair

3 to for

Generating a batch size N

Wherein

4 using N image pairs

Average recognition loss L of^FRTo update the recognition model CNN_R：

5 using N image pairs

Average super-resolution loss L of^SRAnd average hyperidentity loss L^SITo update the reconstruction model CNN_H(α is the loss of superidentity L^SIWeight of (2), equal to 8):

6, end

And 6, inputting the video image frame reconstructed by the SICNN in the step 3 into an inclusion Resnet v2 network, extracting features by using small convolution, using a softmax classifier, and training the network by improving Centerlos as a loss function. The Inception Resnet v2 network is different from the convolutional layer and the pooling layer of the traditional network, and the 1 × 1 convolution, the 3 × 3 convolution or the 3 × 3 pooling are operated in parallel in the same layer. The inclusion Resnet v2 network is connected by Stem, 5 × inclusion ResNet A, Reduction A, 10 × inclusion ResNet B, Reduction B, 5 × inclusion ResNet C, Average Pooling, Dropout, Softmax layers in sequence; the Incepton Resnet v2 network uses 1 × 1 convolution to realize dimensionality Reduction operation in each layer, uses two 3 × 3 convolutions to replace 5 × 5, and decomposes the 7 × 7 convolution and the 3 × 3 convolution in the Stem layer and the Reduction AB C layer into two one-dimensional convolutions (1 × 7, 7 × 1) (1 × 3, 3 × 1), thereby reducing the number of parameters, accelerating calculation and further increasing the depth of the network. The inclusion respet v2 network replaces the sequential connection of the convolution and pooling before the inclusion respet structure with Stem modules to obtain a deeper network. The training uses the softmax cross entropy loss function and the modified center loss function, centerlos.

Supposing a super-resolution face image reconstructed from the ith low-resolution face image

Features extracted from the inclusion Resnet v2 network are denoted as I_iCorresponding real category is y_iThe center of each category is marked as

The central loss function for the formation of cohesive increase is defined as L_center：

Considering the characteristics of video recognition, will I_iCorresponding high resolution face image

Feature H extracted from Incepton Resnet v2 network_iAs a true class y_iOf (2), i.e. improved L_centerIn

Is H_iAnd the center is unchanged during the training process. Improved L_centerComprises the following steps:

and 7, voting the identification result of the image frame to obtain a final video identification result. And the final classification result is obtained by voting the identification results of all the video frames, and the result with the largest number of votes is the final video identification result.

Firstly, classifying key point positions representing the direction of a face by adopting a clustering algorithm according to the position characteristics of the face in a video to select key frames; then, establishing a SICNN reconstruction model, extracting features through a reconstruction network and an identification network, respectively obtaining reconstruction loss and identification loss so as to define identity loss, training the reconstruction network by using an alternative training strategy, and obtaining a frame image with high resolution and more identity features; and finally, inputting the reconstructed frame image into an identification network inclusion Resnet v2, extracting depth features for classification and identification, and voting the identification results of all the image frames to obtain a video identification result. Compared with a classical point pair set video identification method and a common super-resolution reconstruction method, the low-quality video face identification method based on the super-resolution reconstruction method is applied to low-quality video face identification, and the accuracy of low-quality video face identification can be effectively improved by adopting the technical scheme of the invention.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. A low-quality video face recognition method based on SICNN is characterized by comprising the following steps:

step 3, the low-quality video key frame image is processed

Simultaneous and high resolution face image

Obtaining the super-resolution reconstruction loss L^SR；

Step 4, reconstructing the super-resolution image in the step 3

2. The SICNN-based low-quality video face recognition method of claim 1, wherein: the data set used in the step 1 is a COX data set, ten divisions of the COX data set are used for the training sample and the testing sample, and the result is the average value of ten experiments.

3. The SICNN-based low-quality video face recognition method of claim 1, wherein: the K value of the K-means algorithm in the step 2 is 5, and the K value represents 5 different human face poses: the face recognition method comprises the following steps of left side face, left deflection face, front face, right deflection face and right side face, wherein 10 key frames are selected by each group through a random algorithm.

4. The SICNN-based low-quality video face recognition method of claim 1, wherein: CNN in the step 3_HThe network comprises DB, convolution, DB, deconvolution, convolution, DB and convolution sequential connection; CNN_HThe network uses the DB to extract semantic features, uses deconvolution to scale up the resolution of the input features, uses convolution to achieve mapping and reconstruction.

5. The SICNN-based low-quality video face recognition method of claim 4, wherein: each DB block is sequentially connected by 6 identical DenseLayer structures, each DenseLayer structure comprises sequentially connected convolution layers of 1 x 1 and 3 x 3, and the convolution layers of 1 x 1 are bottleneck layers; the composition of each convolution layer is a Batch Normalization + ReLU +3 × 3Conv layer, the growth _ rate of the DB block is equal to 32, and the bn _ size is equal to 4.

6. The SICNN-based low-quality video face recognition method of claim 1, wherein: in the step 4, CNN in the SICNN model is calculated_RThe method comprises 36 convolutional layers, wherein 6 convolutional layers and 5 residual layers are alternately connected, the convolutional layers are a convolutional layer 1a, a convolutional layer 1b, a residual layer 1, a convolutional layer 2, a residual layer 2, a convolutional layer 3, a residual layer 3, a convolutional layer 4, a residual layer 4, a convolutional layer 5 and a residual layer 5 in sequence, each residual layer comprises one convolutional layer and a plurality of residual blocks, and the number of the residual blocks corresponding to the 5 residual layers is 1, 2, 4, 6 and 2 respectively; the loss function uses a-softmax.

7. The SICNN-based low-quality video face recognition method of claim 1, wherein: the super identity loss L in the step 5^SIIs a kind of percentual loss calculation normalized euclidean distance.

8. The SICNN-based low-quality video face recognition method of claim 1, wherein: the inclusion Resnet v2 network in the step 6 is sequentially connected by Stem, 5 × inclusion ResNet A, Reduction A, 10 × inclusion ResNet B, Reduction B, 5 × inclusion ResNet C, Average Pooling, Drapout and Softmax layers; the inclusion respet v2 network used 1 × 1 convolution in each layer, using two 3 × 3 convolutions instead of 5 × 5, while decomposing the 7 × 7 convolution and the 3 × 3 convolutions in the Stem and Reduction AB C layers into two one-dimensional convolutions (1 × 7, 7 × 1) (1 × 3, 3 × 1); the inclusion respet v2 network replaces the sequential connection of the convolution and pooling before the inclusion respet structure with a Stem module; the training uses the softmax cross entropy loss function and the modified center loss function, centerlos.

9. The SICNN-based low-quality video face recognition method of claim 1, wherein: the final classification result in the step 7 is obtained by voting, and the result with the largest number of votes obtained is the final video identification result.