CN116453192A

CN116453192A - Self-attention shielding face recognition method based on blocking

Info

Publication number: CN116453192A
Application number: CN202310430697.5A
Authority: CN
Inventors: 姜义; 赵森
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2023-04-20
Filing date: 2023-04-20
Publication date: 2023-07-18

Abstract

The invention discloses a block-based self-attention shielding face recognition method in the technical field of face recognition, which comprises the steps of firstly carrying out operations such as face detection, feature point positioning and the like through preprocessing to obtain and process a face image, firstly dividing the obtained face image according to feature points, then respectively carrying out feature extraction on the face blocks obtained by dividing, then utilizing a self-attention module to mainly extract non-shielding face features, and finally classifying; the influence of the characteristics of the shielding position of the face image on the whole face recognition is reduced to a certain extent, and the whole efficiency of the model is improved; the obtained features are more beneficial to improving the accuracy of the subsequent face classification, and the robustness of the whole model to the occlusion face recognition is improved.

Description

Self-attention shielding face recognition method based on blocking

Technical Field

The invention relates to the technical field of face recognition, in particular to a self-attention shielding face recognition method based on blocking.

Background

In the last decade of the 21 st century, machine learning theory developed rapidly, researchers have proposed methods based on genetic algorithm, support vector machine, sparse expression and the like for face recognition, wherein the sparse expression becomes a research hotspot at the moment because of its graceful theory and robustness to shielding factors, but the method parameters need to be set manually, so the labor cost is high. With the advent of the big data age and the improvement of computer power, deep learning is advanced to the forensic prominence. The deep learning processes the message by simulating the human brain nervous system, and can better solve the complex problem existing in the face recognition. Under the deep learning model, the big data analysis is realized by an operation system formed by an artificial graphic processor, and the face features with judgment can be directly learned from the original image. In the age of massive face data, face recognition based on deep learning has achieved a good effect in terms of speed and accuracy. However, uncertainty factors such as illumination, attitude and shielding often exist in the face recognition process, so that the problems of identification of an access control system, identification of people blocking face crimes, management of people under epidemic situations and the like are needed to be solved.

In recent years, researchers at home and abroad are not tired in the aspect of face recognition shielding research, and some methods capable of better solving the problem of face recognition shielding are provided, for example: a method of effectively utilizing the characteristics of a face which is not occluded, a method of repairing and complementing the characteristics of an occluded area, a method based on characteristic fusion, a method based on generating an countermeasure network, and the like.

For effectively utilizing the characteristics of the face which is not shielded, the method comprises the steps of optimizing the loss function, enhancing the characteristics of the area of the face which is not shielded and the like. In terms of enhancing the characteristics of the non-occlusion face area, wang and the like use an Anchor strategy and a data enhancement strategy to construct a face recognition network FAN (Face Attention Network) integrating an attention mechanism. During model training, different Attention mechanisms are set on the basis of feature graphs with face sizes at different positions of feature pyramids, namely, an Attention function is added to an Anchor of RetinaNet, and in order to learn an occluded face, a multi-scale feature extraction, a multi-scale Anchor and a multi-scale Attention mechanism based on semantic segmentation are adopted, so that the detection effect of the occluded face is improved. In the aspect of optimizing the loss function, the CenterLoss increases the inter-class distance and reduces the intra-class distance through Softmax and L2 norms, so that the accuracy of the predicted value is effectively increased. Liu et al propose an Arcface method to maximize intra-class distances in angular space. Opitz et al designed Grid Loss function to synthesize the classification effect of local and whole information, and enhance the robustness of detection model to occlusion. The method adopts the idea of blocking processing, divides the face feature map into a plurality of grids, and sums the loss of each grid and the loss of the whole map as a total loss function so as to strengthen the feature identification of each grid. The Grid function is used for effectively improving the recognition effect of the shielded face, the method has better performance in the training of small samples, the training difficulty is low, the implementation is easy, the method can be used for real-time detection, and the stability is high; however, when the problem of large-scale posture change is solved, the problems of high model training difficulty, low stability and relatively large influence of a loss function can occur.

For the method of characteristic repair and complement of the shielding area, ge and the like propose a local linear embedded convolutional neural network LLE-CNN (Locally Linear EmbeddingCNN). Methods for attempting repair and complement of features of an occlusion region using information of a region other than a human face are explored. The method comprises the steps of forming a plurality of pictures into nearest neighbors trained by a face dictionary and a non-face dictionary to refine descriptors, completing and recovering characteristics of shielding face information caused by shielding, and simultaneously inhibiting noise information in the characteristics.

For the feature fusion-based method, zhu et al utilize human context information to assist in completing face recognition, propose a context-combined Multi-Scale Region convolutional neural network CMS-RCNN (context Multi-Scale Region-based CNN). The CMS-RCNN provides a method for fusing global and local context information, simultaneously focuses on the characteristics of a face region and the face context information, fuses the characteristics on the multi-layer characteristic diagram to form a long characteristic vector for subsequent classification, and has higher recognition accuracy, but the characteristic weight distribution and integration of each part have difficulty and low efficiency, and also has influence on the stability of the model to a certain extent. Although the speed can be increased by reducing the number of regions or reducing the resolution of the input image, etc., little effect is achieved. Zhou Xiaojia et al propose a block-based face recognition algorithm with occlusion. Positioning face feature points by using a Coarse-to-Fine self-coding network CFAN (Coarse-to-Fine Auto-encoder Networks), performing geometric normalization and illumination normalization according to the face feature positioning result, and performing face segmentation to obtain eyes, noses and mouths; performing feature extraction of each face block by using an improved lightweight convolutional neural network; utilizing a multi-classification network to combine the additional information of the input block to judge the shielding of the face block; and combining the face block characteristics and the shielding classification discrimination results to obtain the characteristics of the characteristic shielding face. And measuring the feature similarity by using Euclidean distance, and taking features corresponding to blocks of which both sides are not judged to be shielded for similarity measurement for two features generated by two faces. That is, when comparing two features, if a feature segment of one feature is a zero vector, then the corresponding segment of the other vector is also zeroed. In the task of face recognition classification, the face represented by the most similar features is taken as the classification result.

For methods based on generating an antagonism network Chen et al propose antagonism occlusion-aware face detectors (Adversarial Occlusion-aware Face Detector, AOFD). The training data set is expanded based on the generation of a large number of samples of the occlusion faces generated by the antagonism network, the occlusion region is segmented by utilizing the context information, and the influence of the occlusion region on the face characteristics is shielded by the segmented mask. The AOFD utilizes a multi-stage target detection framework, so that the detection speed is limited; and it requires the generation of a large number of occlusion samples, which also increases the training time and difficulty required for the model. Zhang et al fully utilized the face surrounding information to train the GAN network, proposed a context information based generation countermeasure network (Contextual based Generative Adversarial Network, C-GAN) whose generation network is composed of an upsampling subnet and an optimizing subnet. The sampling sub-network converts the low-resolution image into a high-resolution image and outputs the high-resolution image, the identification network is used for identifying the face, the non-face, the real image and the false image, and the sub-network is returned to complete the frame detection of the face. The model is suitable for detecting high-resolution images, otherwise, sampling low-resolution images is needed, and the model training time is increased. Najibi et al propose SSH, modeling the context information through filters, building a selective refinement network SRN (Selective Refinement Network). The convolutional layer output of the VGG network is divided into three branches, each branch has similar detection and classification processes, and the multi-scale face detection is completed by utilizing the feature graphs with different scales, so that the detection performance and the detection precision are improved. However, since the output characteristics of the intermediate layer do not have sufficient discrimination ability, the required training of the added branch is required, resulting in an increase in difficulty of training and an increase in training time.

Disclosure of Invention

This section is intended to outline some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. Some simplifications or omissions may be made in this section as well as in the description of the invention and in the title of the invention, which may not be used to limit the scope of the invention.

In order to solve the technical problems, 1, how to process the picture containing the human face in the preprocessing and obtain the high-quality human face frame and the key feature points of the human face; 2. reasonably partitioning the obtained face image to obtain the characteristics of each block; 3. how to apply the self-attention mechanism and its variants to extract the characteristics of the face block.

The invention provides a block-based self-attention shielding face recognition method, which adopts the following technical scheme: comprising the following steps:

s1: pretreatment: adopting MTCNN to perform face detection and feature point positioning, wherein the MTCNN is used as a candidate network P-Net, a large number of candidate frames are rapidly generated for a face picture pyramid input into the candidate network, and the candidate frames are stored in a target boundary frame list, and the candidate frames are preliminarily screened by utilizing non-maximum suppression and IoU technology;

s2: face image segmentation: after the face feature points are positioned, the face image is divided into the following parts according to the obtained position information of the face key points: for each block, the marked key points are used as centers to divide the block;

s3: self-attention mechanism: the self-attention mechanism is utilized to train the characteristics extracted from each face block through the self-attention model, and the face characteristics at the non-shielding position are extracted in an emphasized manner;

s4: classification: with the FaceNet system, faceNet trains the neural network using a Loss function based on maximum boundary neighbor classification of Triplets, with the output of the network being a 128-dimensional vector.

Optionally, the MTCNN adopts a three-network cascade structure, where three networks are respectively: candidate network P-Net, refinement network R-Net and output network O-Net, inserting pretreatment in the middle of two networks:

before the candidate network P-Net is transmitted, the picture is continuously restored until the input size of the candidate network P-Net is larger than or equal to that of the candidate network P-Net, and therefore the face picture pyramid is obtained. The resolution formula for a picture is as follows:

resize_picture＝original_picture*resize_factor ^t

wherein, the size_picture represents the picture obtained after adjustment, the original_picture represents the original picture, and the size_factor represents the parameter for adjusting the size of the picture, which is between 0.7 and 0.8; t represents the number of adjustments.

Optionally, the steps of initially screening candidate frames using non-maximum suppression and IoU techniques are:

s1.1: obtaining a confidence score list corresponding to the target bounding box list and setting a threshold value;

s1.2: ordering the confidence scores;

s1.3: calculating the area of intersections and union of all the boundary boxes;

s1.4: calculating the IoU of the boundary box with the highest confidence coefficient and other candidate boxes, wherein the IoU is the intersection part of the two boundary boxes divided by the union of the two boundary boxes;

s1.5: deleting IoU candidate boxes greater than a threshold;

s1.6: adding the corresponding bounding box with the highest confidence score to the final output list, and then deleting the bounding box from the bounding box list;

s1.7: the above process is repeated until the bounding box list is empty.

Alternatively, in S2,

for the left and right eye regions, the key point of the left or right eye is set as (x) _eye ,y _eye ) The key points are taken as the center, the blocks with the size of 64 multiplied by 64 are segmented, and the segmented blocks are:

(x _eye -32,y _eye -32),(x _eye +32,y _eye +32)；

for the nose region, the identified nose block key points are set as (x _nose ,y _nose ) The partition blocks are as follows:

(x _nose -8,y _nose +32),(x _nose +8,y _nose -8)；

for the mouth region, the left and right mouth corner feature points are respectively set as (x) _lmous ,y _lmous ),(x _rmous ,y _rmous ) The partition area is divided into:

(x _lmous -8,y _lmous +8),(x _rmous +8,y _rmous -8)。

optionally, the self-attention model adopts a Query-Key-Value mode, and the calculation method of the self-attention model is as follows:

(1) Let the input sequence beThe output sequence is +.>

The obtained face block features and three randomly initialized matrixes Andmultiplication is carried out to obtain query vectors Q= [ Q ] respectively ₁ ,…,q _n ]Key vector kk= [ k ] ₁ ,…,k _n ]Sum vector v= [ V ] ₁ ,…,v _n ]；

(2) Performing point multiplication on the query vector of the current region and each value vector to obtain a Score;

(3) The Score obtained is divided by the first dimensionAnd then, carrying out Soft max calculation on the result, wherein the calculation formula is as follows:

(4) Multiplying and summing all calculated results of Soft max with a value vector V, wherein the obtained result is the value of Self-attribute at the current node; for each query vector q _n E, Q, the calculation formula is as follows:

wherein D is _x 、D _k 、D _v N and H are the dimensions of the matrix.

Optionally, the left eye block query vector and the value vector dot product formula is as follows:

Q _leye ×K _leye ＝Score _lele ；

wherein Score _lele The magnitude of (a) represents the current value vector K _leye And query vector Q _leye The degree of correlation of the block features of the blocks, if Score _lele The larger the value, the more the value vector and Q are represented _leye The more relevant the belonging block features.

Optionally, in S4, the selected triples include two face images of the same person and a face image of another person, and the triple loss function is as follows:

wherein the tripletIn (I)>And->Is a vector of two images belonging to the same face,/->A face image vector for a randomly selected person; a is a threshold value, and the function of the threshold value is to ensure that the intra-class distance of the face is smaller than the inter-class distance; when L>0, i.e. intra-class distance + threshold<When the inter-class distance is increased, the loss is 0.

In summary, the present invention includes at least one of the following beneficial effects:

1. after image preprocessing, the obtained face image with shielding is segmented according to the obtained face key feature points, so that the features of the non-shielding position in the face image are effectively reserved, the features of the non-shielding position of the face image are partially abandoned, the effective extraction and use of the features of the non-shielding position of the face in a subsequent module are facilitated, the cost waste generated on the features of the shielding position of the face image during model training is effectively avoided, the influence of the features of the shielding position of the face image on the whole face recognition is reduced to a certain extent, and the whole efficiency of the model is improved.

2. Converting each obtained face block picture into a vector, wherein the sizes of the blocks are different, so that the converted vectors are different, and feature fusion processing is required to be carried out on the obtained vectors of each face block feature, for example, each face block vector is complemented by 0, so that the dimensions of each vector are the same; and then carrying out feature fusion processing on the non-shielded parts except the key parts in the face image and the key parts. The overall features of the fused face are then trained and characterized using self-attention modules and variants thereof, using the features of the self-attention mechanism that are good at capturing the internal relevance of the features and data, such as: transformer and Vision Transformer, etc. Under the condition that the self-attention module pays more attention to the non-occlusion position characteristics, the obtained characteristics are more beneficial to improving the accuracy of the subsequent face classification, and the robustness of the whole model to occlusion face recognition is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a block-based self-attention-blocking face recognition method of the present invention;

FIG. 2 is a schematic diagram of a P-Net network architecture according to the present invention;

FIG. 3 is a schematic diagram of an R-Net network architecture of the present invention;

FIG. 4 is a schematic diagram of an O-Net network architecture according to the present invention;

FIG. 5 is a diagram of the calculation of the self-attention model of the present invention;

fig. 6 is a diagram of the structure of the present invention Vision Transformer.

Detailed Description

The invention is described in further detail below with reference to fig. 1-6.

Example 1

Referring to fig. 1, the invention discloses a block-based self-attention shielding face recognition method, which mainly comprises the steps of preprocessing, face image blocking, face block feature extraction, self-attention mechanism, classification and the like. The main flow is shown in figure 1. The method comprises the steps of firstly carrying out operations such as face detection, feature point positioning and the like through preprocessing to obtain and process a face image, firstly dividing the obtained face image according to feature points, then respectively carrying out feature extraction on face blocks obtained by dividing, and then utilizing a self-attention module to mainly extract non-shielding face features, and finally classifying.

The method comprises the following specific steps:

s1: pretreatment: in an actual scene, problems such as dim light, strong light, blurring, noise and the like often exist, so that preprocessing is needed to process the obtained face image. As the first step of the face recognition algorithm, the implementation of the face detection function is very important, and the accuracy and the rate obtained in this step can directly influence the characteristics of all face recognition technologies. The object of face detection is to search the face in the image and obtain the position information of the face, and the algorithm outputs the coordinates of the circumscribed rectangle of the face in the image and possibly contains the information of the gesture of the person. After the face image is obtained, feature point positioning is required to be performed on the obtained face image. The feature point positioning is to automatically position facial key feature points which are defined in advance according to physiological features of the human face, such as eye corners, nose tips, mouth corners, facial contours and the like, according to the input human face data, and the step is to obtain the data basis that the human face feature points are human face image blocks. The present time adopts MTCNN (Multi-task Cascaded Convolutional Networks) to make face detection and feature point positioning.

MTCNN is a deep cascade multitasking framework that is used to solve the problem of face detection and alignment in an unconstrained environment due to various gestures, lighting and occlusion. The network adopts a three-network cascade structure, wherein the three networks are respectively: propos Network (P-Net), refine Network (R-Net) and Output Network (O-Net). They are multitasking networks, since each network has three task trains, respectively: face classification, bounding box regression and face key point positioning. The three networks are not directly connected, so that another process can be interposed between the two networks.

resize_picture＝original_picture*resize_factor ^t

where size_picture represents the picture that is obtained after resizing. original_picture represents the original. The size_factor represents a parameter for resizing the picture, typically between 0.7 and 0.8; if the parameters of the adjusted pictures are too large, the time for obtaining the facial image pyramid is increased, and the resolution difference of the obtained pictures is not large; if the parameters of the parameter picture are too small, some small and medium-sized faces are omitted, and t represents the adjustment time.

For P-Net, which is a candidate network, the main task is to quickly generate a large number of candidate frames for the face image pyramid of the input network, and to preliminarily screen the candidate frames using Non-maximum suppression (NMS) and IoU (intersection-over-unit) techniques. The P-Net network structure is shown in FIG. 2.

The idea of the non-maximum suppression method is to search for local maxima and suppress non-maxima. The target bounding box generated by the P-Net is stored in the target bounding box list, because there may be overlap between candidate bounding boxes, and the optimal target bounding box needs to be found by using non-maximum suppression, so that redundant bounding boxes are eliminated.

IoU, i.e. the intersection of two bounding boxes divided by their union.

Before the non-maximum suppression method is utilized, a confidence score list corresponding to the target bounding box list is acquired and a threshold value is set so as to delete the bounding box with larger overlap.

The process of screening a large number of candidate frames generated by P-Net by using a non-maximum suppression method is as follows:

s1.2: ordering the confidence scores;

s1.5: deleting IoU candidate boxes greater than a threshold;

s1.7: the above process is repeated until the bounding box list is empty.

For R-Net, as a refinement network, the R-Net has more full connection layers than the P-Net, so that the R-Net layer can screen face candidate frames more finely. The R-Net network structure is shown in FIG. 3.

For the O-Net, as an output network, the network is one more convolutional layer than the R-Net, so the processing result of the task is finer, the supervision of the face area is enhanced by the layer, and five key point coordinates of the face are output. The O-Net network structure is shown in FIG. 4.

for example, for left and right eye regions, the key point of the left or right eye is set as (x) _eye ,y _eye ) The key points are taken as the center, the blocks with the size of 64 multiplied by 64 are segmented, and the segmented blocks are:

(x _eye -32,y _eye -32),(x _eye +32,y _eye +32)；

(x _nose -8,y _nose +32),(x _nose +8,y _nose -8)；

(x _lmous -8,y _lmous +8),(x _rmous +8,y _rmous -8)。

s3: self-attention mechanism: the attention mechanism is to screen out important information from a large amount of information, focus on the important information, and ignore most of the unimportant information. The greater the weight, the more important this information is. Self-attention mechanisms are variants of attention mechanisms that reduce reliance on external information, and are more adept at capturing internal dependencies of data or features. Therefore, the self-attention mechanism can be utilized to extract the characteristics of each face block, and the face characteristics of the non-shielding position are extracted in an emphasized way through the training of the self-attention module. The self-attention model often adopts a Query-Key-Value (QKV) mode, and the calculation process of the self-attention model is as follows in fig. 5,

(1) Let the input sequence beThe output sequence is +.>

The calculation formulas of the left eye block query vector, the key vector and the value vector are as follows:

wherein x is _leye For features extracted in the left eye block, W _q 、W _k And W is _v All are randomly initialized matrixes, Q _leye 、K _leye And V _leye Query vectors, key vectors, and value vectors, respectively, that are left eye block features.

the formula of the point multiplication of the block query vector and the value vector for the left eye is as follows:

Q _leye ×K _leye ＝Score _lele ；

wherein D is _x 、D _k 、D _v N and H are the dimensions of the matrix.

Vision Transformer as a variant of the self-Attention mechanism, the structure of which is shown in fig. 6, contains six modules in total, of which there are two Layer Normalization (Layer normalization) modules, two Residual Connection (residual link) modules, one MLP (Multi-Layer Attention) module, one Multi-Head Attention module.

Vision Transformer Each patch is analogized to 1 token in NLP by means of a transform model, the picture is converted to a vector by the flat layer, and the patch-embedding and position-embedding are generated by the embedding layer and then input into Vision Transformer module.

Wherein the Layer Normalization module is used for normalizing the input data. Residual Connection module is used to solve the gradient vanishing problem. The MLP module is used for carrying out final classification calculation on the characteristics obtained by the previous module processing, so as to obtain the probability of predicting each sample label.

Wherein the Multi-Head Attention module is a variant of the self-Attention mechanism.

S4: classification: face Net is a general system that can be used for Face verification, recognition and clustering. The method adopted by Face Net is to map an image to euclidean space through convolutional neural network learning. The spatial distance is directly related to the similarity of pictures, namely, the distance between different pictures of the same person in space is small, and the distance between images of different persons in space is large. So it can judge whether the two pictures are the same face or not by the size of the distance.

Face Net trains neural networks using a boss function based on maximum boundary neighbor classification of Triplets, the output of the network being a 128-dimensional vector. The selected triplets comprise two face images of the same person and a face image of one person. The triple loss function is as follows:

wherein the tripletIn (I)>And->Is a vector of two images belonging to the same face,/->A face image vector for a randomly selected person; a isA threshold value that acts to ensure that the intra-class distance of the face is less than the inter-class distance; when L>0, i.e. intra-class distance + threshold<When the inter-class distance is increased, the loss is 0.

The above embodiments are not intended to limit the scope of the present invention, so: all equivalent changes in structure, shape and principle of the invention should be covered in the scope of protection of the invention.

Claims

1. The self-attention shielding face recognition method based on the blocking is characterized by comprising the following steps of: comprising the following steps:

2. The block-based self-attention-blocking face recognition method of claim 1, wherein: the MTCNN adopts a three-network cascade structure, wherein three networks are respectively: candidate network P-Net, refinement network R-Net and output network O-Net, inserting pretreatment in the middle of two networks:

resize_picture＝original_picture*resize_factor ^t

3. The block-based self-attention-blocking face recognition method of claim 1, wherein: the steps of preliminary screening candidate frames using non-maxima suppression and IoU techniques are:

s1.2: ordering the confidence scores;

s1.5: deleting IoU candidate boxes greater than a threshold;

s1.7: the above process is repeated until the bounding box list is empty.

4. The block-based self-attention-blocking face recognition method of claim 1, wherein: in the step S2 of the process,

(x _eye -32,y _eye -32),(x _eye +32,y _eye +32)；

(x _nose -8,y _nose +32),(x _nose +8,y _nose -8)；

(x _lmous -8,y _lmous +8),(x _rmous +8,y _rmous -8)。

5. the block-based self-attention-blocking face recognition method of claim 1, wherein: the self-attention model adopts a Query-Key-Value mode, and the calculation method of the self-attention model is as follows:

(1) Let the input sequence beThe output sequence is +.>

The obtained face block features and three randomly initialized matrixes Andmultiplication is carried out to obtain query vectors Q= [ Q ] respectively ₁ ,…,q _n ]Key vector k= [ K ] ₁ ,…,k _n ]Sum vector v= [ V ] ₁ ,…,v _n ]；

wherein D is _x 、D _k 、D _v N and H are the dimensions of the matrix.

6. The block-based self-attention-blocking face recognition method of claim 5, wherein: the left eye block query vector, key vector and value vector calculation formula is as follows:

wherein x is _leye For extraction in the left eye blockFeatures, W _q 、W _k And W is _v All are randomly initialized matrixes, Q _leye 、K _leye And V _leye Query vectors, key vectors, and value vectors, respectively, that are left eye block features.

7. The block-based self-attention-blocking face recognition method of claim 5, wherein: the left eye block query vector and value vector dot product formula is as follows:

Q _leye ×K _leye ＝Score _lele ；

8. The block-based self-attention-blocking face recognition method of claim 1, wherein: in S4, the selected triples include two face images of the same person and a face image of another person, and the triple loss function is as follows: