CN111582044A

CN111582044A - Face recognition method based on convolutional neural network and attention model

Info

Publication number: CN111582044A
Application number: CN202010295613.8A
Authority: CN
Inventors: 贺前华; 杨泽睿; 庞文丰
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2020-04-15
Filing date: 2020-04-15
Publication date: 2020-08-25
Anticipated expiration: 2040-04-15
Also published as: CN111582044B

Abstract

The invention relates to a face recognition technology, in particular to a face recognition method based on a convolutional neural network and an attention model, which comprises the following steps: preprocessing a face image; inputting the preprocessed image data into a convolutional neural network to extract high-dimensional features; inputting the high-dimensional features into an attention model, calculating an attention mask through a training method of a neural network, and obtaining the attention distribution features M (F) of a space domain and a channel domain_c) (ii) a Feature M (F) of attention distribution_c) Inputting the feature into a Bottleneck module, and acquiring a feature H (M (F) by using a short mechanism_c) ); will be characterized by H (M (F)_c) Input into the full convolution network classification structure, and use Dropout strategy and softmax function to obtain the final face recognition result. The invention adopts the Attention model and the Bottleneck module to replace the middle of the VGG modelThe high-level network greatly reduces the number of parameters required by the model, and effectively reduces the video memory and time consumption during training; meanwhile, the learning capability of the network to data is effectively improved, and the stability and the practicability of the system are improved.

Description

Face recognition method based on convolutional neural network and attention model

Technical Field

The invention relates to a face recognition technology, in particular to a face recognition method based on a convolutional neural network and an attention model.

Background

Face recognition refers to a technique of recognizing or verifying one or more faces by extracting face information in a still image or a video sequence and comparing the face information with information in an existing face database. Compared with biological characteristic identification means such as signature and fingerprint, the information extraction of the image-based identity identification system is more convenient. It is one of the most active research directions in the fields of computer vision and pattern recognition due to its wide application direction and commercial value. In recent years, face identification networks based on deep learning have achieved remarkable results. Meanwhile, the attention model is widely used in various fields of deep learning. The visual attention mechanism mimics the brain signal processing mechanisms characteristic of human vision. By quickly scanning the global image, a target area needing important attention is obtained, and then more attention resources are put into the area to obtain more detailed information of the target needing attention and suppress other useless information. The method greatly improves the efficiency and accuracy of visual information processing.

In recent years, the research work of deep learning in combination with visual attention mechanisms has mostly focused on the use of masks to form the attention mechanism. The principle of the mask is that key features in the image data are identified through another layer of new weight, and the deep neural network learns the region needing attention in each new image through learning training, so that attention is brought to the user. Attention has two major classifications: soft attentions (soft attentions) and hard attentions (hardattentions). Strong attention is a stochastic prediction process, emphasizing dynamic changes and being inconsequential, and training generally needs to be done by reinforcement learning. The key to soft attention is that it can be miniaturized, i.e. the gradient can be calculated, and it can be trained using neural network methods. In the CNN-based attention model, the two attention domains are mainly classified, namely, spatial domain (spatial domain) and channel domain (channel domain). Attention in the spatial domain is mainly distributed in space, and is expressed on an image that attention degrees of different areas on the image are different, and is reflected in mathematics, that is, for a certain feature map with the size of C × W × H, effective spatial attention corresponds to a matrix with the size of W × H, and each position is a weight for a pixel at a position corresponding to an original feature map. The method integrates and compresses the multi-channel numerical values of the same pixel, and differentiates the pixel when calculating the attention weight, so that the spatial attention of the features can be obtained. The attention of the channel domain is mainly distributed in the channels, which means that the attention of different image channels is different when the image is displayed. Mathematically, that is, for a feature map of size C × W × H, a certain channel attention corresponds to a matrix of size C × 1 × 1, and each position is a weight for all pixels of the corresponding channel of the original feature map. The method integrates and compresses a plurality of pixel values of the same channel, and differentiates the channel when calculating the attention weight, so that the characteristic channel attention can be obtained.

The existing face recognition method based on deep learning and attention model uses training samples to automatically learn the face features, and can extract excellent face features with discrimination. However, an efficient neural network often requires a large number of parameters, and the consumption of computing resources during training and testing is large, which also brings difficulties to the migration, overloading and embedded design of the network. There is a need to find a design solution for a miniaturized network.

Disclosure of Invention

The invention provides a face recognition method based on a convolutional neural network and an attention model, which aims to solve the problem that the existing face recognition neural network generally consumes large computing resources, thereby reducing the requirements of a network model on equipment and simultaneously improving the computing speed; by using the attention model and the Bottleneck structure, on the basis of further reducing network parameters, the defect that the shallow network is difficult to extract the characteristics with rich semantic information is overcome, the recognition task is completed by combining the semantic information of each scale, and a high-accuracy recognition result is obtained.

The invention is realized by the following technical scheme: the face recognition method based on the convolutional neural network and the attention model comprises the following steps:

s1, preprocessing the face image;

s2, inputting the preprocessed image data into a convolutional neural network to extract high-dimensional features to obtain high-dimensional features F_c；

S3, converting the high-dimensional feature F_cInputting the data into an attention model, calculating an attention mask through a training method of a neural network, and obtaining an attention distribution characteristic M (F) of a space domain and a channel domain_c)；

S4, feature M (F) of attention distribution_c) Inputting the feature into a Bottleneck module, and acquiring a feature H (M (F) by using a short mechanism_c))；

S5, converting the characteristic H (M (F)_c) Input into the full convolution network classification structure, and use Dropout strategy and softmax function to obtain the final face recognition result.

According to the technical scheme, the Attention model (Attention model) and the Bottleneeck module are adopted to replace a middle-high network in the VGG model, so that the quantity of parameters required by the model is greatly reduced, and the video memory and the time consumption during training are effectively reduced; meanwhile, the learning capability of the network to data is effectively improved, and the stability and the practicability of the system are improved. Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the integral neural network remarkably reduces the number of parameters while keeping the simple structure characteristic and good generalization performance of the VGG, greatly reduces the video memory and memory occupation during training, and simultaneously, the extremely small network model is also beneficial to embedded design.

2. The attention module is designed based on one-dimensional convolution, and the information of a pixel domain and a channel domain is fully used on the basis of further reducing the number of parameters.

3. According to the invention, the Bottleneck module is based on the design of a one-dimensional volume and a single Batchnorm layer, so that the calculation time is reduced on the basis of further reducing the number of parameters, and the capture of multi-dimensional information is completed.

4. The method has the capability of fast and efficient learning of the local data set, and still has high performance when processing data with illumination difference and posture change.

Drawings

Fig. 1 is a schematic flowchart of a face recognition method based on a convolutional neural network and an attention model according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a convolutional neural network feature extraction module according to an embodiment of the present invention;

fig. 3 is a schematic diagram of an attention model and a bottellnk module according to an embodiment of the present invention;

fig. 4 is a schematic application diagram of a face recognition neural network module based on a convolutional neural network and an attention model according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment provides a face recognition method based on a convolutional neural network and an Attention model, wherein an Attention model and a Bottleneeck module are adopted to process output characteristics of a shallow convolutional neural network, and high-dimensional semantic characteristics including pixel domain and channel domain information are established; as shown in fig. 1, the method specifically includes the following steps:

1) preprocessing the face image: the preprocessing comprises the steps of converting an RGB face image into a gray image, randomly turning the image left and right, standardizing an image tensor, and normalizing the image pixel size into a normalized image

Normalized to 256 × 256 in this embodiment;

2) and Convolutional Neural Network (CNN) feature extraction: inputting the image data preprocessed in the step 1) into a convolutional neural network to extract high-dimensional features to obtain high-dimensional features F_cWhereinThe convolutional neural network structure adopts the design of an improved VGG network;

3) attention (Attention) model: the high-dimensional feature F output by the step 2) is processed_cInputting the data into a soft attention model, calculating an attention mask through a training method of a neural network, and obtaining an attention distribution characteristic M (F) of a space domain and a channel domain_c) Wherein the attention model is based on a CNN network design;

4) a Bottleneeck module: the attention distribution characteristics M (F) output by the step 3) are processed_c) Inputting the feature into a Bottleneck module, and acquiring a feature H (M (F) by using a short mechanism_c) Wherein the Bottleneeck module is based on one-dimensional convolution;

5) full Convolutional Network (FCN) classification: the characteristics H (M (F) output by the step 4) are processed_c) Input into the FCN classification structure to obtain the final face recognition result. The FCN classification structure uses the Dropout policy and the softmax function.

Further, in the step 1) of preprocessing the face image, the RGB image is converted into a gray image, specifically:

after the RGB image is converted into the gray image, the single-channel gray image is expanded into an N-channel gray image, and the image conversion calculation formula and the channel expansion formula are as follows:

Gray_i＝λ_iGray+bias_i

in the formula, R, G, and B represent RGB three-channel pixel values of the original image, respectively. i denotes the ith gray channel,

bias_iin order to be a term of the offset,

Gray_i∈[0,255]. In this embodiment, the number of channels N is 3, λ₁、λ₂、λ₃Respectively taking 0.37, 0.30 and 0.33; bias_iIs a value generated randomly.

Further, the image tensor is normalized in step 1) as follows:

in the formula, X_iInput data for i channel, E (X)_i) Is a mean value, σ (X)_i) Is the standard deviation. In the present embodiment, since the picture is mapped to the (0, 1) space when the tensor conversion is performed, E (X) is specified in order to reduce the amount of calculation and obtain more resolved input data_i) 0.5 and σ (X)_i) When X is equal to 0.5, mixing_iMapping to the (-1, 1) space allows negative samples to be generated.

In step 2), the input image data is subjected to batch normalization (Batchnormalization) after passing through a first maximum pooling (MaxPooling) layer, and picture down-sampling is performed after each convolutional layer by using the maximum pooling, wherein the number of convolutional layers is 5, and the shrunk feature size W × H (8 × 8 in the embodiment) is the original picture size

High-dimensional feature of output F_cThe form is as follows: (C)_c,W_c,H_c) (ii) a The mean and variance of the batch normalization were as follows:

in the formula, W_hFor convolutional neural network parameters, x_iFor convolutional neural network input, m is the batch size. In this example, the batch size value is 32, and the output characteristic F_cThe form is as follows: (32, 128,8,8) discarding the bias term bias when performing the batch normalization calculation.

As shown in fig. 3, Attention information is added to different channels and different pixel points of the output high-dimensional features by using the Attention model, and information of different dimensions is further captured by using a Bottleneck module, so that the performance of the convolutional neural network is improved, specifically:

firstly, the size output in the step 2) is (C)_c,W_c,H_c) Is spliced into a length C_c×W_c×H_cIs input vector V_c(ii) a The dimensions of the three-dimensional feature in this embodiment are (128,8,8), vector V_cLength 128 × 8 × 8 ═ 8192;

second, adopting full-connection network structure to correlate input vector V_cAnd a length W_c×H_cIs output vector V_pEstablishing the relationship between each high-dimensional characteristic pixel point and other pixel points and channels; output vector V in the present embodiment_pLength 8 × 8-64;

thirdly, the convolution output is processed by a sigmoid activation function to generate an attention feature graph sigma, the size of the attention feature graph is 8 × 8-64, and when the mask display is carried out, the vector needs to be reconstructed into 8 × 8 image data;

fourthly, the obtained attention feature map and the output feature F of the step 2)_cCarrying out tensor multiplication to form attention enhancement on a space domain and a channel domain to obtain the feature M (F) with enhanced attention_c)：

M(F_c)＝ψ(f^C×W×H(Contact(F_c)))＝ψ(f^C×W×H(V_c))

Where Contact () is the matrix splice, f^C×W×H() For feature vector V_cThe net computation performed, ψ () is the tensor multiplication (mask) computation of the attention feature map. In this embodiment, tensor multiplication (mask) calculation is based on a broadcast mechanism (broadcast); specifically, the high-dimensional feature has a scale of (32, 128, 64), and the mask having the scale of (32, 1, 64) is multiplied by the corresponding element having the scale of (32, 64) for each high-dimensional feature of the 128 channels to obtain a feature M (F)_c) Has a dimension of (32, 128, 64).

Further, the process of using the short mechanism to acquire the features by the bottleeck module in the step 4) specifically includes:

firstly, using 1 × 1 one-dimensional convolution to convert the feature M (F) output by the step 3)_c) Reducing the dimension and fusing information among different channels; the present embodiment reduces the number of channels from 128 to 32 by dimension reduction;

performing zero filling processing on two ends of the feature vector after dimension reduction, and then performing one-dimensional convolution with a convolution kernel of 3, namely performing 3 x 3 convolution to obtain a feature vector with a scale of (32, 32, 64);

thirdly, performing convolution 1 × 1 again to recover the dimension and obtain the feature F with the scale of (32, 128, 64)_B(M(F_c))；

Fourth, output characteristic H_B(M(F_c) In which H) is_B(M(F_c))＝F_B(M(F_c))+M(F_c) In this embodiment, the matrix alignment element addition is performed, feature H_B(M(F_c) Dimension (32, 128, 64).

That is, the Bottleneck structure is aligned to the feature M (F)_c) The feature F was obtained using batch normalization after 1 × 1 convolution followed by successive use of a one-dimensional convolution with a convolution kernel of 3 and a 1 × 1 convolution and a ReLU activation function_B(M(F_c))。

Referring to fig. 4, the FCN performs classification on the data output by the bottleeck using Dropout policy and softmax function, resulting in the final recognition result. In this embodiment, two fully-connected layers are adopted, the number of neurons is 512, and the Dropout coefficient takes 0.2.

The above description is only a preferred embodiment of the present invention, but the protection scope of the present invention is not limited thereto, and those skilled in the art should be able to equally replace or change the technical solution and the inventive concept of the present invention in the disclosure of the present invention, and all belong to the protection scope of the present invention.

Claims

1. The face recognition method based on the convolutional neural network and the attention model is characterized by comprising the following steps of:

s1, preprocessing the face image;

2. The face recognition method based on the convolutional neural network and the attention model as claimed in claim 1, wherein the preprocessing of step S1 comprises: normalizing face image size to

The face image is randomly turned left and right, the RGB image is converted into a gray level image, and the tensor of the face image is standardized.

3. The face recognition method based on the convolutional neural network and the attention model as claimed in claim 2, wherein the RGB image is converted into a gray image, specifically:

Gray_i＝λ_iGray+bias_i

in the formula, R, G and B are eachRepresenting RGB three-channel pixel values of an original image; i denotes the ith gray channel,

bias_iin order to be a term of the offset,

Gray_i∈[0,255]。

4. the face recognition method based on convolutional neural network and attention model of claim 3, wherein the number of channels N is 3, λ₁、λ₂、λ₃Respectively taking 0.37, 0.30 and 0.33; bias_iIs a value generated randomly.

5. The face recognition method based on the convolutional neural network and the attention model as claimed in claim 2, wherein the tensor of the face image is normalized as follows:

in the formula, X_iInput data for i channel, E (X)_i) Is a mean value, σ (X)_i) Is the standard deviation.

6. The face recognition method based on convolutional neural network and attention model of claim 1, wherein the image data inputted in step S2 is normalized by batch after passing through the first max pooling layer, the down-sampling of the picture is performed by max pooling after each convolutional layer, the number of convolutional layers is 5, and the shrunk feature size W × H is the original picture' S feature size

in the formula, W_hFor convolutional neural network parameters, x_iFor convolutional neural network input, m is the batch size.

7. The face recognition method based on the convolutional neural network and the attention model as claimed in claim 1, wherein the step S3 comprises the steps of:

s31, outputting the size of (C)_c,W_c,H_c) Is spliced into a length C_c×W_c×H_cIs input vector V_c；

S32, adopting full-connection network structure to correlate input vector V_cAnd a length W_c×H_cIs output vector V_pEstablishing the relationship between each high-dimensional characteristic pixel point and other pixel points and channels;

s33, generating an attention feature map by the convolution output through a sigmoid activation function;

s34, and combining the attention feature map with the high-dimensional feature F output in the step S2_cCarrying out tensor multiplication to form attention enhancement on a space domain and a channel domain to obtain the feature M (F) with enhanced attention_c)：

M(F_c)＝ψ(f^C×W×H(Contact(F_c)))＝ψ(f^C×W×H(V_c))

Where Contact () is the matrix splice, f^C×W×H() For feature vector V_cThe network computation is performed, ψ () is a tensor multiplication computation of the attention feature map.

8. The face recognition method based on the convolutional neural network and the attention model as claimed in claim 7, wherein the bottleeck module in step S4 is based on one-dimensional convolution, and the process of using the Shortcut mechanism to obtain features comprises:

s41, using 1 × 1 convolution to output the feature M (F)_c) Performing dimensionality reduction;

s42, performing zero filling processing on two ends of the feature vector after the dimension reduction, and then performing one-dimensional convolution with a convolution kernel of 3;

s43, carrying out convolution 1 × 1 again, recovering dimensionality and obtaining a feature F_B(M(F_c))；

S44, outputting characteristic H_B(M(F_c) In which H) is_B(M(F_c))＝F_B(M(F_c))+M(F_c)。