CN110610129A

CN110610129A - Deep learning face recognition system and method based on self-attention mechanism

Info

Publication number: CN110610129A
Application number: CN201910719368.6A
Authority: CN
Inventors: 凌贺飞; 邬继阳; 李平
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2019-08-05
Filing date: 2019-08-05
Publication date: 2019-12-24

Abstract

The invention discloses a system and a method for deeply learning face recognition based on a self-attention mechanism, and belongs to the field of computer vision and pattern recognition. The invention constructs a channel self-attention module, performs dimension conversion transposition on three-dimensional data of a characteristic diagram, learns a cross-correlation relationship matrix among channels to express a relative relationship among different channels, obtains the characteristics after channel optimization through calculation with the original characteristics, and performs different weight assignment on different channels, thereby realizing the selection of channel filtration and reducing the redundant information of characteristic channels. A spatial self-attention module is constructed, spatial information of a three-dimensional feature map is modeled, a cross-correlation relation matrix among the spatial positions of the feature map is learned to represent the relative relation among different positions, the feature after spatial position optimization is obtained through calculation with the input feature, different weights are given to different positions of a face feature map, the selection of important feature areas of the face is achieved, and the feature is concentrated in the important areas of the face.

Description

Deep learning face recognition system and method based on self-attention mechanism

Technical Field

The invention belongs to the field of computer vision and pattern recognition, and particularly relates to a deep learning face recognition system and method based on an attention mechanism.

Background

In recent years, with the rapid development of parallel computing processing capability of computers, the technical field of computer vision has advanced greatly under the push of the heat trend of deep learning, and has certain application requirements in various fields. The face recognition is a technology for enabling a computer to automatically recognize the identities of related personnel in monitoring data in a visual algorithm, and is widely applied to various fields such as intelligent security, personnel attendance, community inspection, self-service and the like. For example, the sky-eye monitoring system in the 'safe city smart community' plan in China tracks and catches suspects by using a face recognition technology; in daily life and work, a face recognition technology is often used, such as a face recognition system installed in a campus laboratory and an enterprise office, so that the attendance checking function of related workers can be completed, and meanwhile, the invasion of external personnel can be prevented; in addition, in the field of financial payment, the face recognition technology is fully utilized, for example, a face recognition system is installed on an ATM (automatic teller machine) of a bank to prevent fraudulent card swiping of other people, and face swiping payment adopted in mobile payment and the like are used for further guaranteeing safety. The face recognition technology generally only needs one common camera to complete recognition and authentication operation in an actual deployment environment, has the advantages of dynamic property, no need of cooperation and the like, has more convenient advantages compared with the traditional biological characteristics such as iris recognition, fingerprint recognition and the like, and the application of the face recognition technology is more and more extensive due to the existence of the factors.

Since 2012, the understanding and analysis of face images by computers has been greatly leaped owing to the theoretical development of deep learning and the technological progress of GPU acceleration. The face recognition technology is also applied to the commercial application of high-speed trains with convolutional neural networks under the non-matching condition. Particularly, the current real-time personnel deployment and control system based on the monitoring video can automatically detect, analyze and capture the face image area in the video while analyzing the monitoring video stream, upload the face image area to a background server to complete real-time face deployment and control comparison, and simultaneously alarm abnormal face images, so that a large amount of manpower, material resources and financial resources are saved for the construction of the current 'safe city'.

By means of the analysis capability of Deep Convolutional Neural Network (DCNN) on images, the Deep features based on the Convolutional Neural Network gradually replace the traditional manual features in face recognition. Compared with the traditional manual shallow feature, the depth feature has stronger distinguishing capability and robustness. At present, the face recognition algorithm based on the convolutional neural network mainly realizes the constraint of a feature space by modifying a loss function, such as CosFace, ArcFace and the like, but does not carry out targeted research on the network structure. The method carries out feature extraction through a general classification convolutional neural network, and then carries out constraint of a feature space on a final classification layer, thereby realizing the purposes of increasing the distance between classes and reducing the distance in the classes. The modification aiming at the loss function actually enhances the distinguishing capability of the features to a great extent, but the methods ignore the problems of the convolutional neural network structure in the face recognition feature extraction. The existing convolutional neural network has a single structure, so that the problems of information redundancy and the like exist in forward-propagated feature extraction, the flexibility is weak, and the generalization capability is slightly poor.

Most of the existing face recognition algorithms use a general image classification backbone network, and such networks have two disadvantages in actual face application. Firstly, the feature map extracted by the standard CNN network often has a large number of channels, for example, the number of channels in the later stage of the ResNet network reaches 2048, so that a certain information redundancy is brought to a great extent by the large number of channels, and even a risk of network overfitting may exist. Although there are some regularization approaches such as Dropout that are effective to alleviate this problem, the results are still unsatisfactory; secondly, based on the cognition of the human face to the real world, different parts in the human face image have different importance in the actual recognition, but the mechanism of parameter sharing of convolution kernel in the convolution neural network endows the same weight to all image pixels, and different processing modes cannot be well given to different positions.

Disclosure of Invention

The invention provides a deep learning face recognition method based on a self-attention mechanism, aiming at the defects that overfitting is caused by high channel number of characteristic images in a general convolutional neural network in the prior art and different positions of a face are not distinguished and treated due to a convolutional kernel weight sharing mechanism and the like and improvement requirements. The method aims to learn the cross-correlation information among characteristic diagram channels through a channel self-attention module to obtain the matrix relation among the channels and endow different channels with different importance; then, the spatial self-attention module learns the cross-correlation information between the positions of the feature map to obtain the matrix relation between the positions, and different weights are given to the spatial positions of the feature map to learn the importance of different positions of the human face. The method not only can keep the excellent performance of the original convolution neural network, but also can optimize the characteristics of the face image in the forward transmission process of the neural network, reduce the information redundancy among image channels, concentrate the convolution kernel on the more important position in the face image, improve the face recognition accuracy and enhance the flexibility and generalization capability of the model.

To achieve the above object, according to one aspect of the present invention, there is provided a deep learning face recognition system based on an attention-free mechanism, the system including:

the input module is used for selecting a face picture training set and inputting a face picture to be recognized;

a self-attention based deep learning module with ResNet as a backbone network, comprising a plurality of residual blocks and a plurality of attention modules, said attention modules comprising a channel attention module and/or a spatial attention module, concatenating the channel attention module and/or the spatial attention module at the end of each residual block, the last layer being a fully connected layer, the residual block is used for further extracting the characteristic diagram of the input face picture or the characteristic diagram, the channel attention module is used for learning a cross-correlation relation matrix among characteristic diagram channels in the forward propagation process to obtain a characteristic diagram after channel optimization, the space attention module is used for learning a cross-correlation relation matrix between the space positions of the feature maps in the forward propagation process to obtain the feature maps after the space positions are optimized; the full connection layer is used for converting the finally optimized feature map into features;

the training module is used for training the self-attention-based deep learning module by adopting the face picture training set to obtain a trained self-attention-based deep learning module;

and the face recognition module is used for inputting the face picture to be recognized into the trained self-attention-based deep learning module and outputting a face recognition result.

Specifically, the channel attention module is realized by the following steps:

inputting a feature map F_I∈R^C×H×WRespectively obtaining a characteristic diagram theta (F) through two parallel convolutions_I)∈R^C×H×W、

Characteristic diagram theta (F)_I)、φ(F_I) Respectively carrying out maximum pooling and average pooling in parallel to obtain a characteristic diagram

Characteristic diagram Pool (F)_I)¹Obtaining a characteristic diagram through dimension conversionCharacteristic diagram Pool (F)_I)²Obtaining a characteristic diagram through dimension conversionThen to Pool' (F)_I)²Performing transposition;

obtaining a channel self-attention moment array by a Softmax activation function operated according to rows

Feature map F_IObtaining a characteristic diagram through convolutionCharacteristic diagram ρ (F)_I) Obtaining a characteristic diagram through dimension conversion

To A_CAnd ρ' (F)_I) Matrix multiplication and dimension conversion are carried out, and the result after dimension conversion is summed with F_IAdding bit by bit to obtain the final characteristic diagram with optimized channel dimension of C multiplied by H multiplied by W

Wherein C, H, W represents the channel dimension, height and width of the original characteristic diagram, theta, phi and rho represent the channel convolution operation,it is shown that the bit-by-bit addition operation,representing a matrix multiplication operation and alpha representing a coefficient controlling the proportion of the original features to the channel optimized features.

In particular, the channel self-attention matrix A_cThe deployment is as follows:

wherein,showing the correlation between the ith channel and the jth channel.

Specifically, the spatial attention module is realized by the following steps:

inputting a feature map F_C∈R^C×H×WRespectively obtaining a characteristic diagram through two parallel convolutions

Characteristic diagram theta (F)_C) Obtaining a characteristic diagram through the maximum pooling and the average pooling in parallel connection

Respectively corresponding to the characteristic diagram phi (F)_C) And Pool (F)_C)¹Dimension conversion to obtain a characteristic diagram Andaim at phi' (F)_C) Performing transposition;

for feature map phi' (F)_C)^TAnd Pool' (F)_C)¹Matrix multiplication and Softmax nonlinear activation calculation are carried out to obtain a spatial self-attention moment array

Feature map F_CObtaining a characteristic graph rho (F) through convolution_C)∈R^C×H×WObtaining a characteristic diagram through the maximum pooling and the average pooling which are connected in parallelObtaining a characteristic diagram through dimension conversion

Obtaining a feature map with dimension of C multiplied by H multiplied by W after space position optimization through matrix multiplication and bitwise addition

Wherein C, H, W represents the channel dimension, height and width of the original characteristic diagram, theta, phi and rho represent the channel convolution operation,it is shown that the bit-by-bit addition operation,expressing matrix multiplication operation, beta expressing a coefficient for controlling the proportion of the original characteristic to the space position optimization characteristic, and a variable r serving as a coefficient for channel dimension reduction, wherein the requirement that r is more than 1 andare integers.

Specifically, the calculation formula of the loss function L is as follows:

wherein N and N respectively represent the number of samples of the current batch and the total number of categories, the hyperparameter s represents a scale scaling factor,representing the angle between the current sample feature and the corresponding class weight, θ_jRepresenting the angle between the class weight and the corresponding sample, m₁And m₂The angle interval and the cosine interval are respectively represented as two hyper-parameters of the loss function.

To achieve the above object, according to another aspect of the present invention, there is provided a deep learning face recognition method based on an attention-free mechanism, the method including the steps of:

training the self-attention-based deep learning network by adopting a face picture training set to obtain a trained self-attention-based deep learning network;

inputting the face picture to be recognized into the trained self-attention-based deep learning network, and outputting a face recognition result;

the self-attention-based deep learning network takes ResNet as a backbone network and comprises a plurality of residual blocks and a plurality of attention modules, wherein each attention module comprises a channel attention module and/or a space attention module, the channel attention module and/or the space attention module are connected in series at the tail of each residual block, the last layer is a full connection layer, the residual blocks are used for further extracting feature maps of input face pictures or feature maps, the channel attention module is used for learning a cross-correlation relationship matrix among feature map channels in the forward propagation process to obtain a feature map after channel optimization, and the space attention module is used for learning a cross-correlation relationship matrix among feature map space positions in the forward propagation process to obtain a feature map after space position optimization; the full connection layer is used for converting the finally optimized feature map into features.

Specifically, the channel attention module is realized by the following steps:

wherein,showing the correlation between the ith channel and the jth channel.

Specifically, the spatial attention module is realized by the following steps:

Respectively corresponding to the characteristic diagram phi (F)_C) And Pool (F)_C)¹Dimension conversion to obtain a characteristic diagram Aim at phi' (F)_C) Performing transposition;

Specifically, the calculation formula of the loss function L is as follows:

Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:

1. according to the principle of an attention mechanism, a channel self-attention module is constructed, the module learns a cross-correlation relation matrix among channels by performing operations such as dimension conversion transposition on three-dimensional data of a feature map, the matrix represents the relative relation among different channels, finally, the feature after channel optimization is obtained by calculating the original feature, the cross-correlation relation among different channels is finally learned, different weight assignment is performed on different channels, the selection of channel filtering is realized, and the redundant information of the feature channels is reduced.

2. According to the invention, a spatial self-attention module is constructed according to the principles of an attention mechanism, global feature expression and the like, the spatial self-attention module models spatial information of a three-dimensional feature map, learns a cross-correlation relation matrix among the spatial positions of the feature map, the matrix represents the relative relation among different positions, and finally obtains features after spatial position optimization through calculation with input features, finally learns the cross-correlation relation among different spatial positions, gives different weights to different positions of a face feature map, realizes the selection of important feature regions of a face, distinguishes different parts and processes, and concentrates the features in the most important region of the face.

Drawings

Fig. 1 is an overall framework diagram of a deep learning face recognition system based on an attention-driven mechanism according to an embodiment of the present invention;

FIG. 2 is a block diagram of a channel self-attention module according to an embodiment of the present invention;

FIG. 3 is a block diagram of a spatial self-attention module according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a model combination of channel self-attention and spatial self-attention provided by an embodiment of the present invention;

fig. 5 is an effect diagram of a face recognition method based on a self-attention mechanism according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

As shown in fig. 1, in the deep learning face recognition system based on the Self-Attention mechanism, a Residual Self-Attention Network model (SRANet) is improved based on the standard ResNet. Specifically, the invention adds a serial channel attention module and a serial space attention module at the end of each standard residual block to calculate a channel attention relationship matrix and a space attention relationship matrix, and then obtains the final optimization characteristics in a matrix multiplication mode. In addition, the last average pooling layer in the original ResNet structure is removed, and the average pooling layer is replaced by a full-connection layer with the fixed size of 512 dimensions, so that the final feature extraction is carried out. Compared with average pooled single-channel average, the full-connection layer considers channel and space information at the same time, and is matched with a channel and space attention module, so that the design is more reasonable.

Taking the data in fig. 1 as an example, assume that the original output of a certain residual block in the convolutional neural network is F_ISRANet based on the self-attention mechanism, first F_IInputting the data into a channel self-attention module to calculate a channel attention matrix, and after cross-correlation matrix information among different channels is obtained, F_IPerforming matrix multiplication operation with channel self-attention matrix, and performing bitwise addition operation with original input to obtain channel qualityCharacteristic F after conversion_C(ii) a Similarly, the feature F after the spatial position optimization can be obtained by using the same method_SFinally, F_SThe output of this residual structure is input to the next residual structure, and the ellipses in the figure indicate that there are a plurality of such structures.

The invention divides the face recognition into four stages: the method comprises a face image preprocessing stage, a self-attention model building stage, a loss function calculating stage and a feature extraction and retrieval comparison stage.

Preprocessing stage of face image

The face image preprocessing stage comprises the following steps: selection of a face data set, and preprocessing of the face data. The face data preprocessing is mainly divided into two parts: face detection, key point alignment and image data normalization.

For face detection and face key point alignment, the invention uses a cascaded multi-task convolutional neural network MTCNN commonly used in the industry to predict the face position and the face key point at the same time. In the actual training, 4 coordinate positions and 5 key point positions are predicted, and then the detected original face is cut into 112 × 112 face pictures with fixed sizes through similarity transformation.

For image data normalization, the present invention normalizes the pixel values in the original RGB image to [ -1, 1] by subtracting 127.5 and then dividing by 128. In addition, in training, the normalized training set is horizontally turned over with a probability of 50%, so that the effect of data set expansion is achieved, and the overall system precision is improved.

Self-attention model construction phase

(2.1) selecting backbone network

The invention adopts ResNet-50/ResNet-100 as the backbone network of the self-attention model to train the face recognition model, and in the design of ResNet residual block, the convolution kernel size of 3 x 3 is selected.

(2.2) design and implementation of channel self-attention Module

And a channel self-attention module is added behind each residual block of the backbone network to learn the cross-correlation relationship among the characteristic diagram channels in the forward transmission process of the convolutional neural network.

The structure of the channel self-attention module is shown in fig. 2. Input feature map F_I∈R^C×H×WFirstly, inputting an input feature map into two parallel convolution layers of 1 multiplied by 1, keeping the space scale of the input feature map unchanged, and halving the number of one channel to obtain the feature mapIn order to reduce the burden of matrix calculation, the invention simultaneously increases the maximum pooling and the average pooling which are connected in parallel before the matrix calculation and after the convolution layer, and the pooling kernels of the maximum pooling and the average pooling are the same in size and are both C multiplied by 2. On one hand, the performance stability is kept, and on the other hand, the consumption of video memory is greatly reduced. Through the operation of two pooling layers, the channel self-attention module only retains one-quarter size of spatial data for calculation, so there are:

wherein,

next, two feature maps Pool (F)_I)¹、Pool(F_I)²Dimension conversion and/or transposition operations are performed. Characteristic diagram Pool (F)_I)¹Through dimension conversion, willIs converted intoObtaining a characteristic diagram Pool' (F)_I)¹. Characteristic diagram Pool (F)_I)²Through dimension conversion, willIs converted intoObtaining a characteristic diagram Pool' (F)_I)²Then is transposed into

Finally, obtaining a channel self-attention matrix A by a Softmax activation function operated according to rows_c。

The formula is developed:

wherein,showing the correlation between the ith channel and the jth channel. After the channel notices the completion of the calculation of the moment array, the characteristic F is input in the same way_IObtained by a 1 × 1 convolutionBy dimension conversion to obtain a characteristic diagramFeature map ρ' (F) of the magnitude of_I). Then to A_CAnd ρ' (F)_I) Matrix multiplication and dimension conversion are carried out, and then F is subjected to_IAdding the results after dimension conversion bit by bit to obtain the characteristic F with the dimension of C multiplied by H multiplied by W of the final channel optimization_C。

In all the above formulas, C, H, W represents the channel dimension, height and width of the input feature map, respectively, θ, φ and ρ represent the channel convolution operation,it is shown that the bit-by-bit addition operation,the coefficient alpha for controlling the proportion of the original features and the channel optimization features is a learnable parameter with an initial value of 0, and the purpose of the coefficient alpha is to reduce the difficulty of the neural network when the neural network is just trained.

(2.3) design and implementation of spatial self-attention Module

After the channel self-attention module, a serial spatial self-attention module is followed to learn the relationship between the feature map positions, wherein all parameters are trained by a neural network back propagation technology and are self-adaptively learned.

As shown in fig. 3, a feature map F is input_C∈R^C×H×WThe spatial self-attention module first inputs the feature F_CInputting the data into two parallel 1 × 1 convolutional layers, keeping their spatial scale unchanged, but performing a certain degree of channel dimensionality reduction to obtain a feature map r > 1 andare integers. And then reducing the space dimension of one feature map by adopting two parallel maximum pooling and average pooling (the pooling cores are the same). As shown in FIG. 3, θ (F) is selected_C) Obtaining a characteristic diagram Pool (F)_C)¹。

Then respectively adding phi (F)_C) And Pool (F)_C)¹Dimension conversion toAndobtain a characteristic diagram phi' (F)_C) And Pool' (F)_C)¹. Will phi' (F)_C) Is transposed intoThen, the invention carries out matrix multiplication and Softmax nonlinear activation calculation on the two characteristics to obtain a space self-attention matrix A_S。

Unfolding like the tunnel is self-attentive, one can get:

in the context of this formula, the expression,representing the number of features in the pooled feature space dimension.Is a 2-dimensional matrix representing the relationship between any two spatial locations of the input features, e.g.,denotes phi' (F)_C)^TI th position and Pool' (F)_C)¹Is determined, where Softmax is calculated by row.

Next, spatial self-annotation is computedAfter the relationship matrix of the intention, the input feature F is also given_COne convolution to obtain rho (F)_C)∈R^C×H×WIs converted intoDimension conversion toMatrix Pool' (p (F)_C)). Finally, obtaining the characteristic F with the dimensionality of C multiplied by H multiplied by W after space self-attention structure optimization through matrix multiplication and bitwise addition_S。

In all the above equations, θ, φ, ρ represent convolution operations,it is shown that the bit-by-bit addition operation,and representing matrix multiplication operation, wherein beta is a learnable parameter with an initial value set to be 0, and a variable r is used as a coefficient for reducing the dimension of a channel and is finally set to be 16 through a comparison experiment.

(2.4) feature optimization and feature extraction settings

As shown in fig. 4, in order to fully and comprehensively optimize the three-dimensional feature map, the channel self-attention module and the spatial self-attention module are respectively connected in series behind the ResNet residual block of the backbone network, so as to optimize the feature map in the forward propagation process. The topology of fig. 4 includes dot-multiply and dot-add operations, with arrows directing the direction of input flow to output.

In addition, in the aspect of final feature extraction, the global average pooling layer of the original ResNet is removed, the optimized features are input into a full-connection layer with fixed dimension, the dimension of the full-connection layer is fixedly set to 512, and the full-connection layer with 512 dimensions is replaced by the full-connection layer with 512 dimensions for final feature extraction.

Loss function calculation stage

In order to effectively solve the problem that the conventional loss function can not comprehensively and effectively constrain all samples of the feature space, the invention provides an improved loss function L based on multi-interval constraint.

The formula is established on the basis of weight normalization and feature normalization, namely the invention firstly needs toAfter such constraints, all sample features are distributed on a hypersphere, where x_i∈R^dFeatures of the ith sample, with this sample belonging to the y_iIndividual class, w_j∈R^dJ-th column, b, representing a weight parameter W_jIs the corresponding bias term parameter, N and N respectively represent the number of samples of the current batch and the total number of categories,representing the angle between the current sample feature and the corresponding class weight, m₁And m₂Two hyper-parameters of the loss function, representing the angle interval and the cosine interval, respectively, the hyper-parameter s representing a scale scaling factor for avoiding the disappearance of the gradient, θ_jRepresenting the angle between the class weight and the corresponding sample, | | | | represents a 2-norm operation.

Feature extraction and retrieval comparison stage

After the face image to be recognized is processed by the trained model, a feature vector with fixed dimension is obtained, the vector is used for carrying out real-time comparison with features extracted offline from a library, and whether the face image is a person needing to be retrieved is judged according to cosine similarity obtained through calculation and a set threshold value. In this embodiment, the threshold is usually set in the range of 0.6 to 0.7.

The extraction and retrieval comparison of the characteristics are the stage when real-time face recognition is carried out on line, the given face to be searched is input into the trained model according to the same processing mode, the characteristic vector with the fixed size of 512 dimensions is extracted at the last full-connected layer, the cosine similarity comparison is carried out on the characteristic vector and the characteristics extracted off line in the library, and the cosine similarity calculation formula is as follows:

wherein A is_i、B_iAnd respectively indicating the features of the facial image to be retrieved and the stored facial image features in the search library, taking a plurality of images with the highest similarity and the similarity being greater than a set threshold value as query results, and completing the final facial recognition process, wherein P represents the dimension of a feature vector, and is 512.

Examples

In order to prove that the deep learning face recognition method based on the self-attention mechanism has advantages in performance and adaptability, the method is verified and analyzed through the following experiments:

A. experimental data set

Training set: CASIA-Webface and MS-Celeb-1M. The total number of the CASIA-Webface is 10575, and the total number of the human face images is 49.4 million. The total number of 100K people in the MS-Celeb-1M raw data is 10M face pictures, but the number of wrong samples is more, so that samples after cleaning are adopted in training, and the total number of images is 86876 individual 3.9M.

And (3) test set: LFW, AgeDB-30, CFP-FP, and MegaFace. The LFW, the AgeDB-30 and the CFP-FP test the human face verification accuracy on a small scale, and the MegaFace tests the human face recognition accuracy on a million level and the human face verification accuracy on a millionth false alarm rate.

B. Evaluation criteria

The invention adopts the mainstream evaluation standard of face recognition research at home and abroad, for face verification test, the accuracy is evaluated, and if the tested sample set has K pairs of pictures, wherein L pairs exist in wrong judgment, the accuracy of face verification is as follows:

for the recognition accuracy of MegaFace in the million level, a cumulative matching feature first accuracy rate CMC @1, namely a Rank1 recognition rate, is adopted. For the assumption that the size of a face query set is Q, each image Q to be queried in the face query set is_iQ performs similarity rank matching work, if each query image Q is a query image Q, i is 1, 2_iThe first correctly matched image location is r (q)_i) Then the calculation formula for CMC @ K is:

in the CMC curve, the identification accuracy is higher when K is larger, and in the MegaFace test protocol, the identification result of Rank1 is analyzed, namely CMC @ 1.

C. Results of the experiment

Experiments show that the face verification accuracy of the invention on LFW, AgeDB-3 and CFP-FP reaches 99.83%, 98.67% and 95.86% respectively; in addition, the Rank1 recognition rate on MegaFace in the million level is 98.38%, the verification rate with the false alarm rate of one millionth is 98.45%, and the levels reach the leading level. Meanwhile, the invention compares the existing mainstream scheme on several data sets, and the experimental results are shown in the following table:

TABLE 1 face verification accuracy (%) -of LFW, AgeDB-30 and CFP-FP

TABLE 2 MegaFace test results (%)

From the above two tables, it can be seen that the present invention shows superior performance in the same experimental environment, and in addition, the present invention also performs visualization processing on the face model based on the self-attention mechanism, and as a result, as shown in fig. 5, it can be seen that the face model with the attention module has a clearer face contour, so that the person is easier to recognize, which fully proves that the model based on the self-attention mechanism can effectively perform feature optimization on the forward transmission process of the convolutional neural network, and enhances the distinguishing force and robustness of the face features.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A system for deep learning face recognition based on a self-attention mechanism, the system comprising:

2. The face recognition system of claim 1, wherein the channel attention module is implemented by:

channel self-attention is obtained by activating functions of Softmax operated according to rowsMatrix array

3. The face recognition system of claim 2, wherein the channels are from an attention matrix a_cThe deployment is as follows:

wherein,showing the correlation between the ith channel and the jth channel.

4. The face recognition system of claim 1, wherein the spatial attention module is implemented by:

5. A face recognition system as claimed in any one of claims 1 to 4, wherein the loss function L is calculated as follows:

wherein N and N respectively represent the current batchThe number of secondary samples, and the total number of classes, the hyper-parameter s represents a scaling factor,representing the angle between the current sample feature and the corresponding class weight, θ_iRepresenting the angle between the class weight and the corresponding sample, m₁And m₂The angle interval and the cosine interval are respectively represented as two hyper-parameters of the loss function.

6. A deep learning face recognition method based on a self-attention mechanism is characterized by comprising the following steps:

7. The face recognition method of claim 6, wherein the channel attention module is implemented by:

8. The face recognition method of claim 7, wherein the channel is from the attention matrix a_cThe deployment is as follows:

wherein,showing the correlation between the ith channel and the jth channel.

9. The face recognition method of claim 6, wherein the spatial attention module is implemented by:

10. The face recognition method according to any one of claims 6 to 9, wherein the loss function L is calculated as follows: