CN112580502A - SICNN-based low-quality video face recognition method - Google Patents

SICNN-based low-quality video face recognition method Download PDF

Info

Publication number
CN112580502A
CN112580502A CN202011496030.8A CN202011496030A CN112580502A CN 112580502 A CN112580502 A CN 112580502A CN 202011496030 A CN202011496030 A CN 202011496030A CN 112580502 A CN112580502 A CN 112580502A
Authority
CN
China
Prior art keywords
network
face
sicnn
loss
quality video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011496030.8A
Other languages
Chinese (zh)
Inventor
袁家斌
陆要要
何珊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
Original Assignee
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN202011496030.8A priority Critical patent/CN112580502A/en
Publication of CN112580502A publication Critical patent/CN112580502A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformation in the plane of the image
    • G06T3/40Scaling the whole image or part thereof
    • G06T3/4053Super resolution, i.e. output image resolution higher than sensor resolution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation

Abstract

The invention discloses a low-quality video face recognition method based on SICNN, which comprises the following steps of firstly, classifying key points representing the face direction by adopting a clustering algorithm according to the position characteristics of the face in a video to select key frames; then, establishing a SICNN reconstruction model, extracting features through a reconstruction network and an identification network, respectively obtaining reconstruction loss and identification loss so as to define identity loss, training the reconstruction network by using an alternative training strategy, and obtaining a frame image with high resolution and more identity features; and finally, inputting the reconstructed frame image into an identification network inclusion Resnet v2, extracting depth features for classification and identification, and voting the identification results of all the image frames to obtain a video identification result. The method is applied to low-quality video face recognition, and the accuracy rate of the low-quality video face recognition is effectively improved.

Description

SICNN-based low-quality video face recognition method
Technical Field
The invention belongs to the technical field of face recognition, and particularly relates to a low-quality video face recognition method based on SICNN.
Background
In recent years, the development of computer vision has prompted more and more technology to fall into a practical product in everyday life. With the rise of deep neural networks, face recognition technology has been developed rapidly. Image face recognition has achieved excellent results, and the research on video face recognition is relatively unexpected. This is because the video face recognition faces more than the same problems of illumination, occlusion, and posture as the image face recognition, and the image frame quality of the video in practical application (such as in a monitoring scene) is usually inferior to that of the image. Currently, video face recognition methods are divided into two categories: classical methods and deep learning methods. The method is commonly used for face recognition by a deep learning method, a section of video is divided into a plurality of image frames, the image frames are used for face recognition, and the final recognition result is voted to obtain the final video recognition result. A common method for improving the low-quality video is a super-resolution reconstruction method, which reconstructs a low-resolution video frame to obtain a super-resolution image with better visual effect and more characteristics, thereby solving the problem of unmatched space characteristic dimensions of the images with medium and low resolution.
However, the general super-resolution reconstruction algorithm also has some problems: most of the prior super-resolution reconstruction methods consider whether an output image is clear and vivid and improve visual effect, so that the recovery of facial features is neglected, a face close to a real identity cannot be generated, the accuracy of face recognition cannot be improved, and an expected recognition effect cannot be achieved.
Disclosure of Invention
The invention aims to overcome the defects of the existing super-resolution reconstruction algorithm and provide a reconstruction algorithm beneficial to face recognition. The invention adopts a low-quality video face recognition method based on SICNN (super identity convolutional neural network), which can effectively solve the problem of low-quality video recognition efficiency and achieve good recognition effect.
In order to achieve the purpose, the invention adopts the technical scheme that:
a low-quality video face recognition method based on SICNN comprises the following steps:
step 1, preprocessing data, namely splitting low-quality video data in a data set into image frames, cutting face images into face images with the size of 32 × 40px by face detection, and dividing the image set into a training set and a testing set by using an algorithm, wherein the size ratio of the data set is 7: 3;
selecting key frames, taking the key point positions of the low-quality video face image frames in the data set as face features, and selecting the key frames by using a K-means clustering algorithm and a random algorithm;
step 3, the low-quality video key frame image is processed
Figure BDA0002842198710000011
Input improved SICNN reconstruction network CNNHExtracting features, reconstructing to obtain super-resolution image
Figure BDA0002842198710000021
Simultaneous and high resolution face image
Figure BDA0002842198710000022
Obtaining the super-resolution reconstruction loss LSR
Step 4, reconstructing the super-resolution image in the step 3
Figure BDA0002842198710000023
Input-improved recognition network CNNRExtracting depth features, mapping the features to a hypersphere space for classification and identification to obtain an identification loss LFRAnd super identity loss LSI
Step 5, training the network by using an alternative training strategy, and using the recognition loss L obtained in the step 4FRTraining the recognition network to obtain a weighted hyperidentity loss L using step 4SIAnd the super-resolution reconstruction loss L obtained in the step 3SRCo-training the reconstruction network until convergence;
step 6, inputting the video image frame reconstructed by the SICNN in the step 3 into an inclusion Resnet v2 network, extracting features by using small convolution, using a softmax classifier, and improving Centerlos as a loss function training network, wherein the improved Centerlos is obtained by calculating the original Centerlos and directly takes the features of the high-resolution face image as the center;
and 7, voting the identification result of the image frame to obtain a final video identification result.
Further, the data set used in step 1 is a COX data set, ten divisions of the COX data set are used for the training sample and the testing sample, and the result is an average value of ten experiments.
Further, the K value of the K-means algorithm in step 2 is 5, which respectively represents 5 different human face poses: the face recognition method comprises the following steps of left side face, left deflection face, front face, right deflection face and right side face, wherein 10 key frames are selected by each group through a random algorithm.
Further, CNN in the step 3HThe network comprises DB (Dense Block), convolution, DB, deconvolution, convolution, DB and convolution sequence connection; CNNHThe network uses DB to extract semantic features, uses deconvolution to amplify the resolution of input features, and uses convolution to realize mapping and reconstruction; because the resolution of the used low-quality video key frame is 32 x 40, the identification requirement can be met only by amplifying by 4 times; thus CNNHThe number of deconvolution in the network is changed from 3 to 2, so that the original amplification is changed from 8 times to 4 times.
Furthermore, each DB block is sequentially connected with 6 identical DenseLayer structures, each DenseLayer structure comprises a 1 × 1 and a 3 × 3 convolutional layer sequentially connected, and the 1 × 1 convolutional layer is a bottleneck layer, so as to reduce the number of input feature maps, i.e. dimensionality reduction; the composition of each convolution layer is a Batch Normalization + ReLU +3 × 3Conv layer, the growth _ rate of the DB block is equal to 32, and the bn _ size is equal to 4.
Further, in the step 4, the CNN similar to ResNet in the SICNN model is usedRNetwork improvement, CNNRThe method comprises 36 convolutional layers, wherein 6 convolutional layers and 5 residual layers are alternately connected, the convolutional layers are a convolutional layer 1a, a convolutional layer 1b, a residual layer 1, a convolutional layer 2, a residual layer 2, a convolutional layer 3, a residual layer 3, a convolutional layer 4, a residual layer 4, a convolutional layer 5 and a residual layer 5 in sequence, each residual layer comprises one convolutional layer and a plurality of residual blocks, and the number of the residual blocks corresponding to the 5 residual layers is 1, 2, 4, 6 and 2 respectively; CNNRThe emphasis is on improving the face identity characteristics, the reconstructed SR image is closer to the original HR image, the loss function uses A-softmax, the function introduces the angle classification distance and is equivalent to learning the characteristics in a hypersphere space, and the learned face characteristics have better distinguishability.
Further, the loss of excess identity L in the step 5SIIs a kind of temporal loss computing normalized euclidean distance that directly relates the loss to the identity in hypersphere space.
Further, the inclusion respet v2 network in step 6 is sequentially connected by Stem, 5 × inclusion respet a, Reduction a, 10 × inclusion respet B, Reduction B, 5 × inclusion respet C, Average firing, drop, Softmax layers; the inclusion respet v2 network implements the dimensionality Reduction operation with 1 × 1 convolution in each layer, using two 3 × 3 convolutions instead of 5 × 5, while decomposing the 7 × 7 convolution and the 3 × 3 convolutions in the Stem layer and Reduction ab C layer into two one-dimensional convolutions (1 × 7, 7 × 1) (1 × 3, 3 × 1); the inclusion respet v2 network replaces the sequential connection of the convolution and pooling before the inclusion respet structure with Stem modules to obtain a deeper network; the training uses the softmax cross entropy loss function and the modified center loss function, centerlos.
Further, the final classification result in step 7 is obtained by voting, and the result with the largest number of votes obtained is the final video identification result.
Compared with the prior art, the invention has the following beneficial effects:
the invention relates to a low-quality video face recognition model based on super-resolution reconstruction, which aims at the characteristic of low-resolution illumination difference of a low-quality video, and improves the accuracy of video recognition by performing key frame extraction, super-resolution reconstruction and face recognition classification on a video frame;
according to the method, the video key frames are selected through the key frame selection algorithm, the calculation complexity of reconstruction and recognition is reduced on the basis of not influencing the recognition efficiency, and the training and testing time is reduced;
according to the invention, by introducing identity loss into the reconstruction network through the SICNN reconstruction method, the reconstructed super-resolution image can obtain more identity characteristics, and the face recognition accuracy can be improved;
the Incepton Resnet v2+ Centerlos identification network in the invention utilizes the center loss function to more accurately classify the result, and the Incepton Resnet v2 network is used to reduce the calculation cost and accelerate the learning speed.
Drawings
FIG. 1 is a diagram of a SICNN-based low-quality video face recognition model;
FIG. 2 is a SICNN model framework;
fig. 3 is an inclusion-Resnet v2 network architecture.
Detailed Description
The present invention will be further described with reference to the following examples.
Example 1
A low-quality video face recognition method based on SICNN comprises the following steps:
step 1, preprocessing data, dividing low-quality video data in a data set into image frames, cutting face images into face images with the size of 32 × 40px by face detection, and dividing the image set into a training set and a testing set by using an algorithm, wherein the size ratio of the data set is 7: 3. For example, a COX dataset may be used, and ten partitions of the COX dataset may be used for training samples and test samples, with the results averaged over ten experiments.
COX face data sets aim to address the problems of video to still (V2S), still to video (S2V) and video to video (V2V) face recognition. The data set contains 1,000 subjects, each simulating a video surveillance scene, capturing 1 high quality still image and 3 video sequences (cam1, cam2, cam 3). After face detection and data preprocessing, the number of image frames containing faces in most video sequences is more than 100, some are even more than 300.
And 2, selecting key frames, taking the key point positions of the low-quality video face image frames in the data set as face features, and selecting the key frames by using a K-means clustering algorithm and a random algorithm. The K value of the K-means algorithm is 5, and the K value represents 5 different human face poses: the face recognition method comprises the following steps of left side face, left deflection face, front face, right deflection face and right side face, wherein 10 key frames are selected by each group through a random algorithm.
According to the invention, the K-Means clustering is carried out on the image according to the positions of the key points of the face, and the key point position of the a-th sample is set as x(a). Generally, the image detected by the human face marks the positions of the two eyes of the human face, and the invention takes the positions as key points, so the invention has the advantages of simple structure, convenient operation and low cost
Figure BDA0002842198710000041
Namely, it is
Figure BDA0002842198710000042
Is the left eye position coordinate of the a-th sample,
Figure BDA0002842198710000043
is the right eye position coordinate of the a-th sample. Therefore, the invention defines the distance function of the a sample of the K-Means cluster as follows:
Figure BDA0002842198710000044
wherein: lajFor the a sample and the j class centroid mujDistance of (d), mujIs the centroid of class j, (x)jL,yjLIs) isjLeft eye position coordinates, (x)jR,yjRIs) isjThe right eye position coordinates.
Sample position set of hypothetical inputIs S ═ x(1),x(2),…,x(a),…,x(m)},x(a)∈RnM is the number of samples, RnFor an n-dimensional real number set, the algorithm steps are as follows:
(1) randomly selecting k clustering centroids as mu12,…,μk∈RnWherein: mu.skIs the centroid of class k;
(2) repeating the following process until convergence
For each sample set x(a)Calculating x(a)Class (c) to which(a)Class of the a-th sample, j is class number):
Figure BDA0002842198710000051
for each class j, the centroid of the class is recalculated
Figure BDA0002842198710000052
}
Step 3, the ith low-quality video key frame, namely the low-resolution face image
Figure BDA0002842198710000053
Input improved SICNN reconstruction network CNNHExtracting features, reconstructing to obtain super-resolution face image
Figure BDA0002842198710000054
Simultaneous and high resolution face image
Figure BDA0002842198710000055
Obtaining the super-resolution reconstruction loss LSR。CNNHThe network comprises DB (Dense Block), convolution, DB, deconvolution, convolution, DB and convolution sequence connection; CNNHNetwork using DB to extract semantic features, using deconvolution to amplify inputResolution of features, mapping and reconstruction using convolution. Since the resolution of the low-quality video key frames used is 32 x 40, only 4 times of magnification is needed to meet the recognition requirements. Thus improving CNNHThe number of deconvolution in the network is changed from 3 to 2, so that the original amplification is changed from 8 times to 4 times.
3.1 DB(Dense Block)
CNNHThe network uses DB to extract semantic features, and in order to solve the problem of gradient disappearance, a DB block directly connects all layers by means of the thought of Resnet on the premise of ensuring the maximum information transmission between the layers in the network. Simply speaking, the input to each layer is from the output of all previous layers.
The number of output signatures per convolutional layer in the DB is small (less than 100) and not as wide as hundreds or thousands of networks. Meanwhile, the connection mode enables the transfer of the characteristics and the gradient to be more effective, and the network is easier to train. The gradient vanishing problem is easier to occur when the depth of the network is deeper, because the input information and the gradient information are transmitted among a plurality of layers, and now the dense connection is equivalent to that each layer is directly connected with input and loss, so that the gradient vanishing phenomenon can be reduced, and the deeper network is not a problem.
Each DB block in the present invention is composed of 6 substructures, each substructure includes one convolution layer of 1 × 1 and one convolution layer of 3 × 3, and the convolution layer of 1 × 1 is a bottleneck layer, so as to reduce the number of feature maps input, i.e., reduce the dimension. The composition of each convolution layer is a Batch Normalization + ReLU +3 × 3Conv layer, the growth _ rate of the DB block is equal to 32, and the bn _ size is equal to 4.
The face image with size 32 x 48 is input, and the reconstructed image resolution is enlarged to 8 times of the original image, which is 256 x 320.
3.2 loss function
On reestablishing the network CNNHDefining Euclidean distance of pixels between the SR image and the high-resolution HR image after LR image reconstruction as super-resolution loss LSRFor the ith low resolution face image, it goes through CNNHThe super-resolution loss after reconstruction is as follows:
Figure BDA0002842198710000061
wherein:
Figure BDA0002842198710000062
the ith LR and HR face images in the training set respectively,
Figure BDA0002842198710000063
to represent
Figure BDA0002842198710000064
The reconstructed output can also be expressed as
Figure BDA0002842198710000065
Step 4, reconstructing the super-resolution image of the ith low-resolution face image in the step 3
Figure BDA0002842198710000066
Input-improved recognition network CNNRExtracting depth features, mapping the features to a hypersphere space for classification and identification to obtain an identification loss LFRAnd super identity loss LSI. CNN similar to ResNet in SICNN modelRNetwork improvement, CNNRThe multilayer structure comprises 36 convolutional layers, wherein 6 convolutional layers and 5 residual layers are alternately connected, the convolutional layers are a convolutional layer 1a, a convolutional layer 1b, a residual layer 1, a convolutional layer 2, a residual layer 2, a convolutional layer 3, a residual layer 3, a convolutional layer 4, a residual layer 4, a convolutional layer 5 and a residual layer 5 in sequence, each residual layer comprises a convolutional layer and a plurality of residual blocks, and the number of the residual blocks corresponding to the 5 residual layers is 1, 2, 4, 6 and 2 respectively. CNNRThe emphasis is on improving the face identity characteristics, the reconstructed SR image is closer to the original HR image, the loss function uses A-softmax, the function introduces the angle classification distance and is equivalent to learning the characteristics in a hypersphere space, and the learned face characteristics are enabled to beWith better distinguishability.
4.1 identification network CNNR
CNNRIs a CNN similar to ResnetRThe multilayer structure comprises 36 convolutional layers, wherein 6 convolutional layers and 5 residual layers are alternately connected, the convolutional layers are a convolutional layer 1a, a convolutional layer 1b, a residual layer 1, a convolutional layer 2, a residual layer 2, a convolutional layer 3, a residual layer 3, a convolutional layer 4, a residual layer 4, a convolutional layer 5 and a residual layer 5 in sequence, each residual layer comprises a convolutional layer and a plurality of residual blocks, and the number of the residual blocks corresponding to the 5 residual layers is 1, 2, 4, 6 and 2 respectively. CNNRThe network structure of (a) is as follows:
Figure BDA0002842198710000067
Figure BDA0002842198710000071
4.2 identifying networks CNNR
CNNRThe A-Softmax loss is used as a loss function, the function introduces angle classification distance, and is equivalent to learning features in a hypersphere space, so that the learned face features have better distinguishability. CNNRExpressing this loss function as the recognition loss LFRFor a term belonging to the y-thiSuper-resolution face image input of individual identity
Figure BDA0002842198710000072
Identifying loss
Figure BDA0002842198710000073
The following were used:
Figure BDA0002842198710000074
in the formula: e is the base of the natural logarithm,
Figure BDA0002842198710000075
for inputting human face images
Figure BDA0002842198710000076
From CNNRThe identity characteristics of the extracted identification information are extracted,
Figure BDA0002842198710000077
and
Figure BDA0002842198710000078
respectively representing identities yiAnd yjThe learning angle of (2) is set,
Figure BDA0002842198710000079
and
Figure BDA00028421987100000710
are respectively from
Figure BDA00028421987100000711
And
Figure BDA00028421987100000712
generalized monotonic decreasing function.
Figure BDA00028421987100000713
And
Figure BDA00028421987100000714
the derivation formula of (1) is as follows:
Figure BDA00028421987100000715
wherein the content of the first and second substances,
Figure BDA00028421987100000716
and
Figure BDA00028421987100000717
are respectively as
Figure BDA00028421987100000718
And
Figure BDA00028421987100000719
b is a hyper-parameter of the angle margin constraint, b is more than or equal to 1, c is a nonnegative integer and c belongs to [0, b-1 ]]Preferably, b is 4, c ∈ [0,3 ]]。
Step 5, training the network by using an alternative training strategy, and using the recognition loss L obtained in the step 4FRTraining the recognition network to obtain a weighted hyperidentity loss L using step 4SIAnd the super-resolution reconstruction loss L obtained in the step 3SRThe reconstructed network is co-trained until convergence. Loss of superidentity LSIIs a kind of temporal loss computing normalized euclidean distance that directly relates the loss to the identity in hypersphere space. For input LR image
Figure BDA00028421987100000720
The loss of superidentity in the hypersphere identity metric space is:
Figure BDA0002842198710000081
wherein:
Figure BDA0002842198710000082
is that
Figure BDA0002842198710000083
The identity representation projected onto the unit hypersphere,
Figure BDA0002842198710000084
is that
Figure BDA0002842198710000085
The identity representation projected onto the unit hypersphere,
Figure BDA0002842198710000086
and
Figure BDA0002842198710000087
are respectively
Figure BDA0002842198710000088
And
Figure BDA0002842198710000089
from CNNRThe extracted identity features.
5.1 training strategy
Inputting: recognition model CNN for high-resolution image HR face image trainingRUsing the resolution loss LSRTrained human face hallucination model CNNHThe minimum batch size is N,
Figure BDA00028421987100000810
the ith low-resolution and high-resolution face images.
And (3) outputting: and (4) SICNN.
1 when not converging
2, selecting a batch size N image pair
Figure BDA00028421987100000811
3 to for
Figure BDA00028421987100000812
Generating a batch size N
Figure BDA00028421987100000813
Wherein
Figure BDA00028421987100000814
4 using N image pairs
Figure BDA00028421987100000815
Average recognition loss L ofFRTo update the recognition model CNNR
Figure BDA00028421987100000816
5 using N image pairs
Figure BDA00028421987100000817
Average super-resolution loss L ofSRAnd average hyperidentity loss LSITo update the reconstruction model CNNH(α is the loss of superidentity LSIWeight of (2), equal to 8):
Figure BDA00028421987100000818
6, end
And 6, inputting the video image frame reconstructed by the SICNN in the step 3 into an inclusion Resnet v2 network, extracting features by using small convolution, using a softmax classifier, and training the network by improving Centerlos as a loss function. The Inception Resnet v2 network is different from the convolutional layer and the pooling layer of the traditional network, and the 1 × 1 convolution, the 3 × 3 convolution or the 3 × 3 pooling are operated in parallel in the same layer. The inclusion Resnet v2 network is connected by Stem, 5 × inclusion ResNet A, Reduction A, 10 × inclusion ResNet B, Reduction B, 5 × inclusion ResNet C, Average Pooling, Dropout, Softmax layers in sequence; the Incepton Resnet v2 network uses 1 × 1 convolution to realize dimensionality Reduction operation in each layer, uses two 3 × 3 convolutions to replace 5 × 5, and decomposes the 7 × 7 convolution and the 3 × 3 convolution in the Stem layer and the Reduction AB C layer into two one-dimensional convolutions (1 × 7, 7 × 1) (1 × 3, 3 × 1), thereby reducing the number of parameters, accelerating calculation and further increasing the depth of the network. The inclusion respet v2 network replaces the sequential connection of the convolution and pooling before the inclusion respet structure with Stem modules to obtain a deeper network. The training uses the softmax cross entropy loss function and the modified center loss function, centerlos.
Supposing a super-resolution face image reconstructed from the ith low-resolution face image
Figure BDA0002842198710000091
Features extracted from the inclusion Resnet v2 network are denoted as IiCorresponding real category is yiThe center of each category is marked as
Figure BDA0002842198710000092
The central loss function for the formation of cohesive increase is defined as Lcenter
Figure BDA0002842198710000093
Considering the characteristics of video recognition, will IiCorresponding high resolution face image
Figure BDA0002842198710000094
Feature H extracted from Incepton Resnet v2 networkiAs a true class yiOf (2), i.e. improved LcenterIn
Figure BDA0002842198710000095
Is HiAnd the center is unchanged during the training process. Improved LcenterComprises the following steps:
Figure BDA0002842198710000096
and 7, voting the identification result of the image frame to obtain a final video identification result. And the final classification result is obtained by voting the identification results of all the video frames, and the result with the largest number of votes is the final video identification result.
Firstly, classifying key point positions representing the direction of a face by adopting a clustering algorithm according to the position characteristics of the face in a video to select key frames; then, establishing a SICNN reconstruction model, extracting features through a reconstruction network and an identification network, respectively obtaining reconstruction loss and identification loss so as to define identity loss, training the reconstruction network by using an alternative training strategy, and obtaining a frame image with high resolution and more identity features; and finally, inputting the reconstructed frame image into an identification network inclusion Resnet v2, extracting depth features for classification and identification, and voting the identification results of all the image frames to obtain a video identification result. Compared with a classical point pair set video identification method and a common super-resolution reconstruction method, the low-quality video face identification method based on the super-resolution reconstruction method is applied to low-quality video face identification, and the accuracy of low-quality video face identification can be effectively improved by adopting the technical scheme of the invention.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims (9)

1. A low-quality video face recognition method based on SICNN is characterized by comprising the following steps:
step 1, preprocessing data, namely splitting low-quality video data in a data set into image frames, cutting face images into face images with the size of 32 × 40px by face detection, and dividing the image set into a training set and a testing set by using an algorithm, wherein the size ratio of the data set is 7: 3;
selecting key frames, taking the key point positions of the low-quality video face image frames in the data set as face features, and selecting the key frames by using a K-means clustering algorithm and a random algorithm;
step 3, the low-quality video key frame image is processed
Figure FDA0002842198700000011
Input improved SICNN reconstruction network CNNHExtracting features, reconstructing to obtain super-resolution image
Figure FDA0002842198700000012
Simultaneous and high resolution face image
Figure FDA0002842198700000013
Obtaining the super-resolution reconstruction loss LSR
Step 4, reconstructing the super-resolution image in the step 3
Figure FDA0002842198700000014
Input-improved recognition network CNNRExtracting depth features, mapping the features to a hypersphere space for classification and identification to obtain an identification loss LFRAnd super identity loss LSI
Step 5, training the network by using an alternative training strategy, and using the recognition loss L obtained in the step 4FRTraining the recognition network to obtain a weighted hyperidentity loss L using step 4SIAnd the super-resolution reconstruction loss L obtained in the step 3SRCo-training the reconstruction network until convergence;
step 6, inputting the video image frame reconstructed by the SICNN in the step 3 into an inclusion Resnet v2 network, extracting features by using small convolution, using a softmax classifier, and improving Centerlos as a loss function training network, wherein the improved Centerlos is obtained by calculating the original Centerlos and directly takes the features of the high-resolution face image as the center;
and 7, voting the identification result of the image frame to obtain a final video identification result.
2. The SICNN-based low-quality video face recognition method of claim 1, wherein: the data set used in the step 1 is a COX data set, ten divisions of the COX data set are used for the training sample and the testing sample, and the result is the average value of ten experiments.
3. The SICNN-based low-quality video face recognition method of claim 1, wherein: the K value of the K-means algorithm in the step 2 is 5, and the K value represents 5 different human face poses: the face recognition method comprises the following steps of left side face, left deflection face, front face, right deflection face and right side face, wherein 10 key frames are selected by each group through a random algorithm.
4. The SICNN-based low-quality video face recognition method of claim 1, wherein: CNN in the step 3HThe network comprises DB, convolution, DB, deconvolution, convolution, DB and convolution sequential connection; CNNHThe network uses the DB to extract semantic features, uses deconvolution to scale up the resolution of the input features, uses convolution to achieve mapping and reconstruction.
5. The SICNN-based low-quality video face recognition method of claim 4, wherein: each DB block is sequentially connected by 6 identical DenseLayer structures, each DenseLayer structure comprises sequentially connected convolution layers of 1 x 1 and 3 x 3, and the convolution layers of 1 x 1 are bottleneck layers; the composition of each convolution layer is a Batch Normalization + ReLU +3 × 3Conv layer, the growth _ rate of the DB block is equal to 32, and the bn _ size is equal to 4.
6. The SICNN-based low-quality video face recognition method of claim 1, wherein: in the step 4, CNN in the SICNN model is calculatedRThe method comprises 36 convolutional layers, wherein 6 convolutional layers and 5 residual layers are alternately connected, the convolutional layers are a convolutional layer 1a, a convolutional layer 1b, a residual layer 1, a convolutional layer 2, a residual layer 2, a convolutional layer 3, a residual layer 3, a convolutional layer 4, a residual layer 4, a convolutional layer 5 and a residual layer 5 in sequence, each residual layer comprises one convolutional layer and a plurality of residual blocks, and the number of the residual blocks corresponding to the 5 residual layers is 1, 2, 4, 6 and 2 respectively; the loss function uses a-softmax.
7. The SICNN-based low-quality video face recognition method of claim 1, wherein: the super identity loss L in the step 5SIIs a kind of percentual loss calculation normalized euclidean distance.
8. The SICNN-based low-quality video face recognition method of claim 1, wherein: the inclusion Resnet v2 network in the step 6 is sequentially connected by Stem, 5 × inclusion ResNet A, Reduction A, 10 × inclusion ResNet B, Reduction B, 5 × inclusion ResNet C, Average Pooling, Drapout and Softmax layers; the inclusion respet v2 network used 1 × 1 convolution in each layer, using two 3 × 3 convolutions instead of 5 × 5, while decomposing the 7 × 7 convolution and the 3 × 3 convolutions in the Stem and Reduction AB C layers into two one-dimensional convolutions (1 × 7, 7 × 1) (1 × 3, 3 × 1); the inclusion respet v2 network replaces the sequential connection of the convolution and pooling before the inclusion respet structure with a Stem module; the training uses the softmax cross entropy loss function and the modified center loss function, centerlos.
9. The SICNN-based low-quality video face recognition method of claim 1, wherein: the final classification result in the step 7 is obtained by voting, and the result with the largest number of votes obtained is the final video identification result.
CN202011496030.8A 2020-12-17 2020-12-17 SICNN-based low-quality video face recognition method Pending CN112580502A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011496030.8A CN112580502A (en) 2020-12-17 2020-12-17 SICNN-based low-quality video face recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011496030.8A CN112580502A (en) 2020-12-17 2020-12-17 SICNN-based low-quality video face recognition method

Publications (1)

Publication Number Publication Date
CN112580502A true CN112580502A (en) 2021-03-30

Family

ID=75135971

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011496030.8A Pending CN112580502A (en) 2020-12-17 2020-12-17 SICNN-based low-quality video face recognition method

Country Status (1)

Country Link
CN (1) CN112580502A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113375676A (en) * 2021-05-26 2021-09-10 南京航空航天大学 Detector landing point positioning method based on impulse neural network
CN114612990A (en) * 2022-03-22 2022-06-10 河海大学 Unmanned aerial vehicle face recognition method based on super-resolution
CN115205768A (en) * 2022-09-16 2022-10-18 山东百盟信息技术有限公司 Video classification method based on resolution self-adaptive network

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113375676A (en) * 2021-05-26 2021-09-10 南京航空航天大学 Detector landing point positioning method based on impulse neural network
CN113375676B (en) * 2021-05-26 2024-02-20 南京航空航天大学 Detector landing site positioning method based on impulse neural network
CN114612990A (en) * 2022-03-22 2022-06-10 河海大学 Unmanned aerial vehicle face recognition method based on super-resolution
CN115205768A (en) * 2022-09-16 2022-10-18 山东百盟信息技术有限公司 Video classification method based on resolution self-adaptive network

Similar Documents

Publication Publication Date Title
CN110827213B (en) Super-resolution image restoration method based on generation type countermeasure network
CN108520535B (en) Object classification method based on depth recovery information
CN112580502A (en) SICNN-based low-quality video face recognition method
CN104268593B (en) The face identification method of many rarefaction representations under a kind of Small Sample Size
CN109615582A (en) A kind of face image super-resolution reconstruction method generating confrontation network based on attribute description
CN106778796B (en) Human body action recognition method and system based on hybrid cooperative training
CN109360170B (en) Human face repairing method based on advanced features
CN109376787B (en) Manifold learning network and computer vision image set classification method based on manifold learning network
CN111523483B (en) Chinese meal dish image recognition method and device
CN113033345B (en) V2V video face recognition method based on public feature subspace
CN112950480A (en) Super-resolution reconstruction method integrating multiple receptive fields and dense residual attention
CN112784929A (en) Small sample image classification method and device based on double-element group expansion
CN109711442A (en) Unsupervised layer-by-layer generation fights character representation learning method
CN113628297A (en) COVID-19 deep learning diagnosis system based on attention mechanism and transfer learning
CN113378949A (en) Dual-generation confrontation learning method based on capsule network and mixed attention
CN109446997A (en) Document code automatic identifying method
CN110414431B (en) Face recognition method and system based on elastic context relation loss function
CN111611909A (en) Multi-subspace-domain self-adaptive face recognition method
CN111507356A (en) Segmentation method of handwritten characters of lower case money of financial bills
CN111695455A (en) Low-resolution face recognition method based on coupling discrimination manifold alignment
Chen et al. Generalized face antispoofing by learning to fuse features from high-and low-frequency domains
Zhang et al. Attention-enhanced CNN for chinese calligraphy styles classification
CN114818963A (en) Small sample detection algorithm based on cross-image feature fusion
CN114492634A (en) Fine-grained equipment image classification and identification method and system
CN114155572A (en) Facial expression recognition method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination